A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Dialogue policy learning (DPL) is a key component in a task-oriented dialogue (TOD) system. Its goal is to decide the next action of the dialogue system, given the dialogue state at each turn based on a learned dialogue policy. Reinforcement learning (RL) is widely used to optimize this dialogue policy. In the learning process, the user is regarded as the environment and the system as the agent. The research team of Prof. Wong Kam-Fai from The Chinese University of Hong Kong presents an overview of the recent advances and challenges in dialogue policy from the perspective of RL. More specifically, they identify the problems and summarize corresponding solutions for RL-based dialogue policy learning. In addition, they provide a comprehensive survey of applying RL to DPL by categorizing recent methods into five basic elements in RL. Related work has been published in the third issue of Machine Intelligence Research in 2023. Download full text for free now!




Task-oriented dialogue (TOD) system aims to assist users in accomplishing tasks ranging from weather inquiries to schedule planning. It can be classified into two approaches. The first is the end-to-end approach, which directly maps the current dialogue context to the system′s natural language response. These works often adopt a sequence-to-sequence model and train in a supervised manner. The second is the pipeline approach, which separates the system into four interdependent components: Natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning (DPL) and natural language generation (NLG), as shown in Fig. 1.




Both of these methods have their own limitations and advantages. The end-to-end approach is more flexible and has fewer requirements for data annotations. However, it requires a large amount of data and its black box structure provides no interpretation and little control. On the flip side, the pipeline approach is more interpretable and easier to implement. Although the whole system is harder to optimize globally, the pipeline approach is preferred by most commercial dialogue systems. The survey also falls under the pipeline category to investigate and summarize the current progress of dialogue policy learning. it will briefly introduce the different functions of these four modules and then look deeper into the dialogue policy learning module.


Among these four modules, NLU aims to identify the intentions and slots from the input sentence as the first module that interacts directly with the user. Then, the DST module represents all previous extracted intentions and slots as an internal dialogue state. Next, the DPL module performs an action to satisfy the user′s intent given the state as input. Finally, the NLG module transforms and outputs the action in natural language form. In this pipeline, DPL plays a key role in TOD as an intermediate connection between the DST and NLG modules, which directly affects the success of the dialogue system.


Recently, the progress in DPL has been significantly facilitated by the development of reinforcement learning (RL) algorithms. Levin et al. are the first to treat DPL as a Markov decision process (MDP) problem. They outline the complexities of modelling DPL as an MDP problem and justify the application of RL algorithms to optimize the dialogue policy. Thereafter, the majority of studies attempt to investigate and resolve the technical issues that arise when applying RL algorithms to dialogue systems practically. At the other end of the spectrum, several researchers explored the use of supervised learning (SL) techniques in DPL. The main idea is to treat the dialogue policy learning as a multi-class classification problem, with actions and states acting as labels and inputs, respectively. However, SL techniques have a notorious and unaffordable flaw since they do not consider the future effects of the current decision, resulting in sub-optimal behaviour.


With the breakthroughs in deep learning, deep reinforcement learning (DRL) methods that combine neural networks with RL have recently led to successes in learning policies for a wide range of sequential decision-making problems. This includes simulated environments like the Atari games, the chess game Go, and various robotic tasks. Following that, DRL has received a lot of attention and achieved promising results, mainly in single-domain dialogue scenarios. The neural models can extract high-level dialogue states and encode complicated and long language utterances. This was the biggest challenge that early works faced. As the focus of DPL research has slowly gravitated to more complicated multi-domain datasets, many RL algorithms face scalability problems.


Recently, there has been a flurry of works that focus on ways to adapt and improve RL agents in multi-domain scenarios. Few works attempt to review the vast literature on recent applications of reinforcement learning (RL) in DPL of TOD systems. Grassl surveyed the use of RL in the four types of dialogue systems, namely social chatbots, infobots, task-oriented, and personal assistant bots. However, the progress and challenges of using RL in TOD systems were not well discussed. Similarly, Dai et al. reviewed the recent progress and challenges of dialogue management, which only contained a limited discussion on RL methods in DPL due to its wide scope of interest. Furthermore, RL dialogue systems often have different settings in the five core RL elements, namely environment, policy, state, action, and reward. Previous surveys did not consider the inconsistent settings of different systems, which resulted in an unfair comparison among these systems.


This survey describes the unique strengths of previous works and categorize them based on the five elements of RL. Then it focuses on three main recent challenges of applying RL to DPL, namely exploration efficiency, cold start problem, and large state-action space. Most recent works using RL to optimize DRL attempt to address these challenges. The procedure which this paper used to shortlist these works for review is provided in Appendix.


The remainder of this paper is organized as follows. Section 2 illustrates the problem definition of DPL and elaborates on the challenges of using RL to train a dialogue agent in TOD systems firstly. Then, this paper introduces the proposed methodology to characterize recent DPL works. The methodology is motivated by the fact that the key differentiating aspect of recently proposed methods can be boiled down to the differences in these five fundamental elements of RL. In this case, it is easy and self-evident to find similarities and differences between different methods. Furthermore, this helps identify the key component of each work that contributed the most to its improvement. The state-of-the-art techniques of recent DPL works categorized by the five RL elements are discussed in detail separately in Sections 3–7. Section 8 discusses the current status of DPL research with RL. Section 9 presents the challenges in applying RL dialogue agents in real-life scenarios and three promising future research directions. Finally, the survey was concluded in Section 10.



Download full text

A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning

Wai-Chung Kwan, Hong-Ru Wang, Hui-Min Wang, Kam-Fai Wong




    author = {Wai-Chung Kwan and Hong-Ru Wang and Hui-Min Wang and Kam-Fai Wong},
    journal = {Machine Intelligence Research},
    title = {A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-oriented Dialogue Policy Learning},
    year = {2023},
    volume = {20},
    number = {3},
    pages = {318-334},
    doi = {10.1007/s11633-022-1347-y}

  • Share:
Release Date: 2023-07-11 Visited: