Deep Reinforcement Learning Hands-On
上QQ阅读APP看书,第一时间看更新

Value of action

To make our life slightly easier, we can define different quantities in addition to the value of state Value of action: value of action Value of action. Basically, it equals the total reward we can get by executing action a in state s and can be defined via Value of action. Being a much less fundamental entity than Value of action, this quantity gave a name to the whole family of methods called "Q-learning", because it is slightly more convenient in practice. In these methods, our primary objective is to get values of Q for every pair of state and action.

Q for this state s and action a equals the expected immediate reward and the discounted long-term reward of the destination state. We also can define Value of action via  Value of action:

This just means that the value of some state equals to the value of the maximum action we can execute from this state. It may look very close to the value of state, but there is still a difference, which is important to understand. Finally, we can express Q(s, a) via itself, which will be used in the next chapter's topic of Q-learning:

To give you a concrete example, let's consider an environment which is similar to FrozenLake, but has a much simpler structure: we have one initial state Value of action surrounded by four target states, Value of action, with different rewards.

Figure 5: A simplified grid-like environment

Every action is probabilistic in the same way as in FrozenLake: with a 33% chance that our action will be executed without modifications, but with a 33% chance we will slip to the left, relatively, of our target cell and a 33% chance we will slip to the right. For simplicity, we use discount factor gamma=1.

Figure 6: A transition diagram of the grid environment

Let's calculate the values of actions to begin with. Terminal states Value of action have no outbound connections, so Q for those states is zero for all actions. Due to this, the values of the Terminal states are equal to their immediate reward (once we get there, our episode ends without any subsequent states): Value of action.

The values of actions for state 0 are a bit more complicated. Let's start with the "up" action. Its value, according to the definition, is equal to the expected sum of the immediate reward plus long-term value for subsequent steps. We have no subsequent steps for any possible transition for the "up" action, so Value of action.

Repeating this for the rest of Value of action actions results in the following:

The final value for state Value of action is the maximum of those actions' values, which is 2.97.

Q values are much more convenient in practice, as for the agent it's much simpler to make decisions about actions based on Q than based on V. In the case of Q, to choose the action based on the state, the agent just needs to calculate Q for all available actions, using the current state and choose the action with the largest value of Q. To do the same using values of states, the agent needs to know not only values, but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair. Later in this chapter, we'll see this in practice by solving the FrozenLake environment both ways. However, to be able to do this, we have one important thing still missing: a general way to calculate those Vs and Qs.