
Taxonomy of RL methods
The cross-entropy method falls into the model-free and policy-based category of methods. These notions are new, so let's spend some time exploring them. All methods in RL can be classified into various aspects:
- Model-free or model-based
- Value-based or policy-based
- On-policy or off-policy
There are other ways that you can taxonomize RL methods, but for now we're interested in the preceding three. Let's define them, as your problem specifics can influence your decision on a particular method.
The term model-free means that the method doesn't build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. In contrast, model-based methods try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.
Both classes of methods have strong and weak sides, but usually pure model-based methods are used in deterministic environments, such as board games with strict rules. On the other hand, model-free methods are usually easier to train as it's hard to build good models of complex environments with rich observations. All of the methods described in this book are from the model-free category, as those methods have been the most active area of research for the past few years. Only recently have researchers started to mix the benefits from both worlds (for example, refer to DeepMind's papers on imagination in agents. This approach will be described in Chapter 17, Beyond Model-Free – Imagination).
By looking from an other angle, policy-based methods are directly approximating the policy of the agent, that is, what actions the agent should carry out at every step. Policy is usually represented by probability distribution over the available actions. In contrast, the method could be value-based. In this case, instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value. Both of those families of methods are equally popular and we'll discuss value-based methods in the next part of the book. Policy methods will be the topic of part three.
The third important classification of methods is on-policy versus off-policy. We'll discuss this distinction more in parts two and three of the book, but for now, it will be enough to explain off-policy as the ability of the method to learn on old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago).
So, our cross-entropy method is model-free, policy-based, and on-policy, which means the following:
- It doesn't build any model of the environment; it just says to the agent what to do at every step
- It approximates the policy of the agent
- It requires fresh data obtained from the environment