An efficient and robust gradient reinforcement learning: Deep comparative policy

Wang, Jiaguo; Li, Wenheng; Lei, Chao; Yang, Meng; Pei, Yang

doi:10.3233/JIFS-233747

An efficient and robust gradient reinforcement learning: Deep comparative policy

Article type: Research Article

Authors: Wang, Jiaguo^a | Li, Wenheng^b | Lei, Chao^c | Yang, Meng^d | Pei, Yang^{a; *}

Affiliations: [a] Northwestern Polytechnical University, Xi’an, China | [b] AVIC Xi’an Aeronautics Computing Technique Research Institute, Xi’an, China | [c] School of Computing and Information Systems, The University of Melbourne, Parkville, Victoria, Australia | [d] Faculty of Information Technology, Monash University, Clayton Victoria, Australia

Correspondence: [*] Corresponding author. Yang Pei, Northwestern Polytechnical University, Xi’an, China. E-mail: [email protected].

Abstract: Recently, actor-critic architectures such as deep deterministic policy gradient (DDPG) are able to understand higher-level concepts for searching rich reward, and generate complex actions in continuous action space, and widely used in practical applications. However, when action space is limited and has dynamic hard margins, training DDPG can be problematic and inefficiency. Since real-world actuators always have margins and interferences, after initialization, the actor network is likely to be stuck at a local optimal point on action space margin: actor gradient orients to the outside of action space but actuators stop at the margin. If the hard margins are complex, dynamic and unknown to the DDPG agent, it is unable to use penalty functions to recover from local optimum. If we enlarge the random process for local exploration, the training could be in potential risk of failure. Therefore, simply relying on gradient of critic network to train the actor network is not a robust method in real environment. To solve this problem, in this paper we modify DDPG to deep comparative policy (DCP). Rather than leveraging critic-to-actor gradient, the core training process of DCP is regulated by a T-fold compare among random proposed adjacent actions. The performance of DDPG, DCP and related algorithms are tested and compared in two experiments. Our results show that, DCP is effective, efficient and qualified to perform all tasks that DDPG can perform. More importantly, DCP is less likely to be influenced by the action space margins, DCP can provide more safety in avoiding training failure and local optimum, and gain more robustness in applications with dynamic hard margins in the action space. Another advantage is that, complex penalty for margin touching detection is not required, the reward function can always be brief and short.

Keywords: Actor-critic, deep reinforcement learning, intelligent agent, iterative learning

DOI: 10.3233/JIFS-233747

Journal: Journal of Intelligent & Fuzzy Systems, vol. 46, no. 2, pp. 3773-3788, 2024

Published: 14 February 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia