强化学习理论、算法及应用

张汝波 已出版文章查询
张汝波
本平台内已出版文章查询
1 顾国昌 已出版文章查询
顾国昌
本平台内已出版文章查询
1 刘照德 已出版文章查询
刘照德
本平台内已出版文章查询
1 王醒策 已出版文章查询
王醒策
本平台内已出版文章查询
1

+ 作者地址

1哈尔滨工程大学计算机系,哈尔滨,150001


0
  • 摘要
  • 参考文献
  • 相关文章
  • 统计
强化学习(reinforcement learning)一词来自于行为心理学,这一理论把行为学习看成是反复试验的过程,从而把环境状态映射成相应的动作.首先全面地介绍了强化学习理论的主要算法,即瞬时差分法、Q -学习算法及自适应启发评价算法;然后介绍了强化学习的应用情况;最后讨论了强化学习目前所要研究的问题.

[1] Minsky M L .Theory of neural analog reinforcement systems and its a pplication to the brain model problem[D].New Jersey,USA:Princeton Universit y,1954.

[2] Bush R R;Mosteller F.Stochastic Models for Learning[M].New Yo rk:Wiley,1955

[3] Widrow B;Hoff M E.Adaptive switching circuits[M].Cambridge,MA:The MIT Press,1988:126-134.

[4] Principles of Neurodymamics:Perceptrons and the Theo ry of Brain Mechanisms[S].Washington DC:Spartan Books,1961.

[5] Waltz M D;Fu K S .A heuristic approach to reinforcement learning control systems[J].IEEE Transactions on Automatic Control,1965,10(03):390-398.

[6] Samuel A L .Some studies in machine learning using the game of chec kers[J].IBM Journal of Research and Development,1967,11:601-617.

[7] Widrow B;Gupta N K;Maitra S .Punish/reward:Learning with a criti c in adaptive threshold system[J].IEEE Transactions on Systems Man and Cybernetics,1973,3(05):455-465.

[8] Saridis G N.Self-Organizing Control of Stochastic System[M].New York:Marcel Dekker,1977:319-332.

[9] Barto A G;Sutton R S;Brouwer P S .Associative search network:a reinforcement learning associative memory[J].Biological cybernetics,1981,40:201-211.

[10] Barto A G;Sutton S;Anderson C W .Neurallike adaptive elements that can solve difficult learning control problems[J].IEEE Transactions on Systems Man and Cybernetics,1983,13(05):834-846.

[11] Sutton R S .Temporal credit assignment in reinforcement learning[D].Amherst,MA:University of Massachusetts,1984.

[12] Sutton R S .Learning to predict by the methods of temproal differe nce[J].Machine Learning,1988,3:9-44.

[13] Dayan P .The convergence of TD(λ)for general λ[J].Machine Learning,1992,8:341-362.

[14] Wang Lichun;Denbigh P N.Monaural localization using combi nation of TD(λ) with back propagation[A].San Francisco USA,1993:187-190.

[15] Cichosz P;Mulawka J J.Fast and efficient reinforcement learni n g with truncated temporal differences[A].Morgan Kaufmann,San Francisco,USA,1995:99-107.

[16] Cichosz P .Truncating temporal differences:on the efficient implem entation of TD(λ) for reinforcement learning[J].Journal of Artificial Intelligence Research,1996,12:287-318.

[17] Badtke S J;Barto R G .Linear least-squares algorithms for tem poral difference learning[J].Machine Learning,1996,22:33-57.

[18] Robert E S;Warmuth K .On the worst-case analysis of temporal-difference learning algorithms[J].Machine Learning,1996,22:95-121.

[19] Watkins J C H;Dayan P .Q-learing[J].Machine Learning,1992,8:279-292.

[20] Jing Peng;Ronald J W .Increment multi-step Q-Learning[J].Machine Learning,1996,22:283-291.

[21] Szepesvari C.The asymptotic convergence-rate of Q-learning[M].Cambridge,MA:The MIT Press,1997:1064-1070.

[22] Werbos P J.A menu of designs for reinforcement learning over time[M].Cambridge,MA:The MIT Press,1990:25-44.

[23] Singh S P.Reinforcement learning algorithms for average payoff M arkovian decision processes[A].Menlo Park,CA,USA:AAAI Press,1994:202-207.

[24] Singh S P .Reinforcement learning with replacing eligibility trace s[J].Machine Learning,1996,22:159-195.

[25] Schwartz A.A reinforcement learning method for maximizing undisco nunted rewards[A].Morgan Kaufmann,San Mateo,CA,USA,1993:298-306.

[26] Mahadevan S .Average reward reinforcement learning:foundations,alg orithms and empirical results[J].Machine Learning,1996,22:159-195.

[27] Prasad Tadepalli;Dokyeong Ok .Model-based average reward reinforcement learning[J].Artificial Intelligence: An International Journal,1998(1/2):177-224.

[28] Williams R J .Simple statistical gradient-following algorithms fo r connectionist[J].Machine Learning,1992,8:229-256.

[29] Tesauro G .Practical issues in temporal difference learning[J].Machine Learning,1992,8:257-277.

[30] Sutton R S .The challenge of reinforcement learning[J].Machine Learning,1992,8:225-227.

[31] Winfried Ilg;Karsten Berns .A learning architecture based on f or adaptive control of the walking machine LAURON[J].Robotics and Autonomous Systems,1995,15:323-334.

[32] Sebastian Thrun;Tom M. Mitchell .Lifelong robot learning[J].Robotics and Autonomous Systems,1995(1/2):25-46.

[33] 阎平凡 .再励学习-原理、算法及其在智能控制中的应用[J].信息与控制,1996,25(01):28-34.

[34] 俞星星;阎平凡 .强化学习系统及其基于可靠度最优化的学习算法[J].信息与控制,1997,26(01):332-339.

[35] Xu Ningshou;Wu Zhanglei;Chen Liping.A learning modified gen eralized predictive controller[A].Singapore,1991:231-236.

[36] 杨璐;洪家荣;黄梯云 .用加强学习方法解决基于神经网络的时序实时建模问题[J].哈尔滨工业大学学报,1996,28(04):136-139.

[37] 马莉;蔡自兴 .再励学习控制器结构与算法[J].模式识别与人工智能,1998,11(01):96-100.

[38] 张汝波,顾国昌.智能机器人行为学习方法研究[C].材料科学与工程技术论文集,1998:469~472.

[39] 张汝波,周宁,顾国昌,张国印.基于强化学习的智能机器人避碰方法研究[J].机器人,1999(03):204-209.

[40] 张汝波 .强化学习研究及其在AUV导航系统中的应用[D].哈尔滨工程大学,1999.

[41] 蒋国飞,吴沧浦.基于Q学习算法和BP神经网络的倒立摆控制[J].自动化学报,1998(05):662-666.

[42] 蒋国飞,高慧琪,吴沧浦.Q学习算法中网格离散化方法的收敛性分析[J].控制理论与应用,1999(02):194-198.

[43] Gullapalli V .A stochastic reinforcement learning algorithm for l earning real valued functions[J].Neural Networks,1992,3(03):671-692.

[44] Tesauro G J .TD-gammon,a self-teaching backgammon program,achie ves master-level play[J].Neural Computation,1994,6(02):215-219.

[45] Gerald Tesauro .Temporal Difference Learning and TD-Gammon[J].Communications of the ACM,1995(3):58-68.

[46] Anderson C.W. .Learning to control an inverted pendulum using neural networks[J].IEEE Control Systems Magazine,1989(3):31-37.

[47] Khan E.Reinforcement control with unsupervised learning[A].北京,1992:88-93.

[48] Berenji H.R.;Khedkar P. .Learning and tuning fuzzy logic controllers through reinforcements[J].IEEE Transactions on Neural Networks,1992(5):724-740.

[49] Whitley D;Dominic S;Das R;Aanderson C W .Genetic reinforceme nt learning for neurocontrol problems[J].Machine Learning,1993,13:259-284.

[50] Anderson C W;Hittle D C .Synthesis of reinforcement learning n eural network and PI control applied to a simulated heating coil[J].Artificial Intelligence in Engineering,1997,11:421-429.

[51] Krose B J A;Van Dam J W M.Adaptive state space quantisation for reinforcement learning of collision-free navigation[A].Raleigh,NC,USA,1992:1327-1332.

[52] Millan J D R;Torras C .A reinforcement connectionist approach to robot path finding in nonlike environments[J].Machine Learning,1992,8:363-395.

[53] Dillmann K B R;Zachmann U.Reinforcement learning for the cont rol of an autonomous mobile robot[A].Raleigh,NC,USA,1992:1808-1814.

[54] Lin Longji .Self-improving reactive agent based on reinforcement learning,planning and teaching[J].Machine Learning,1992,8:293-321.

[55] Pushkar P;Abdul S .Reinforcement learning of iterative behavi or with multiple sensors[J].Journal of Applied Artificial Intelligence,1994,4(05):381-365.

[56] Touzet C F .Neural reinforcement learning for behavior synthesis[J].Robotics and Autonomous Systems,1997,22:251-281.

[57] Caironi PVC;Dorigo M .Training and delayed reinforcements in Q-learning agents[J].International journal of intelligent systems,1997(10):695-724.

[58] Hee Rak Beom;Hyung Suck Cho .A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning[J].IEEE transactions on systems, man, and cybernetics,1995(3):464-477.

[59] Crites R H;Barto A G.Improving elevator performance using rei nforcement learning[M].Cambridge,MA:The MIT Press,1995:1017-1023.

[60] Majal M .A comparative analysis of reinforcement learning methods[R].Cambridge,MA:Massachusetts Institute of Technology,AD-A259893,1991.

[61] Lima p;Beard R .Using neural networks and dyna algorithm fo r integrated planning,reacting and learning in systems[R].Troy,New York:Re nsselaer Polytechnic Institute,NASA93-24743,1993.

[62] Baird Leemon C .Learning with high dimension,continuous action[R].Washington DC:Wright Laboratory,AD-A280844,1993.

[63] Zeng Dajun;Katia S .Using case-based reasoning as reinforcem ent learning framework for optimization in changing criteria[R].Pittsburgh Pe nnsylvania:Carnegie Mellon University,AD-293602,1995.

[64] Anderson C W.Strategy learning with multilalyer connectionist re presentation[A].Morgan Kaufmann,San Mateo,CA,USA,1987:103-114.

[65] Mills P M;Zomaya A Y.Reinforcement learning using back propag ation as building block[A].Singapore,1991:1554-1559.

[66] Kokar M M;Reveliotis S A .Reinforcement learning:architecture s and algorithms[J].International Journal of Intelligent Systems,1993,8:875-894.

[67] Moure A W;Atkson C G .Prioritized sweeping:reinforcement learning with less data and less time[J].Machine Learning,1993,13:103-130.

[68] Pack K L .Associative reinforcement learning:function in K-DN F[J].Machine Learning,1994,15:279-293.

[69] 郭茂祖;陈彬;王晓龙 .加强学习[J].计算机科学,1993,25(03):13-15.



期刊热词
  • + 更多
  • 字体大小