Graph convolutional recurrent networks for reward shaping in reinforcement learning

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

In this paper, we consider the problem of low-speed convergence in Reinforcement Learning (RL). As a solution, various potential-based reward shaping techniques were proposed to form the potential function. Learning a potential function is still challenging and comparable to building a value function from scratch. In this work, our main contribution is proposing a new scheme for reward shaping, which combines (1) the Graph Convolutional Recurrent Networks (GCRN), (2) augmented Krylov, and (3) look-ahead advice to form the potential function. We propose an architecture for GCRN that combines Graph Convolutional Networks (GCN) to capture spatial dependencies and Bi-Directional Gated Recurrent Units (Bi-GRUs) to account for temporal dependencies. Our definition of the loss function of GCRN incorporates the message passing technique of the Hidden Markov Models (HMM). Since the transition matrix of the environment is hard to compute, we use the Krylov basis to estimate the transition matrix, which outperforms the existing approximation bases. Unlike existing potential functions that only rely on states to perform reward shaping, we use both the states and actions through the look-ahead advice mechanism to produce more precise advice. Our evaluations conducted on the Atari 2600 and MuJoCo games show that our solution outperforms the state-of-the-art that utilizes GCN as the potential function in most games in terms of the learning speed while reaching higher rewards.

Original languageBritish English
Pages (from-to)63-80
Number of pages18
JournalInformation Sciences
Volume608
DOIs
StatePublished - Aug 2022

Keywords

  • Atari
  • Augmented Krylov
  • GCRN
  • Look-Ahead Advice
  • MuJoCo
  • Reinforcement Learning
  • Reward Shaping

Fingerprint

Dive into the research topics of 'Graph convolutional recurrent networks for reward shaping in reinforcement learning'. Together they form a unique fingerprint.

Cite this