
Value of action
To make our life slightly easier, we can define different quantities in addition to the value of state : value of action
. Basically, it equals the total reward we can get by executing action a in state s and can be defined via
. Being a much less fundamental entity than
, this quantity gave a name to the whole family of methods called "Q-learning", because it is slightly more convenient in practice. In these methods, our primary objective is to get values of Q for every pair of state and action.

Q for this state s and action a equals the expected immediate reward and the discounted long-term reward of the destination state. We also can define via
:

This just means that the value of some state equals to the value of the maximum action we can execute from this state. It may look very close to the value of state, but there is still a difference, which is important to understand. Finally, we can express Q(s, a) via itself, which will be used in the next chapter's topic of Q-learning:

To give you a concrete example, let's consider an environment which is similar to FrozenLake, but has a much simpler structure: we have one initial state surrounded by four target states,
, with different rewards.

Figure 5: A simplified grid-like environment
Every action is probabilistic in the same way as in FrozenLake: with a 33% chance that our action will be executed without modifications, but with a 33% chance we will slip to the left, relatively, of our target cell and a 33% chance we will slip to the right. For simplicity, we use discount factor gamma=1.

Figure 6: A transition diagram of the grid environment
Let's calculate the values of actions to begin with. Terminal states have no outbound connections, so Q for those states is zero for all actions. Due to this, the values of the Terminal states are equal to their immediate reward (once we get there, our episode ends without any subsequent states):
.
The values of actions for state 0 are a bit more complicated. Let's start with the "up" action. Its value, according to the definition, is equal to the expected sum of the immediate reward plus long-term value for subsequent steps. We have no subsequent steps for any possible transition for the "up" action, so .
Repeating this for the rest of actions results in the following:



The final value for state is the maximum of those actions' values, which is 2.97.
Q values are much more convenient in practice, as for the agent it's much simpler to make decisions about actions based on Q than based on V. In the case of Q, to choose the action based on the state, the agent just needs to calculate Q for all available actions, using the current state and choose the action with the largest value of Q. To do the same using values of states, the agent needs to know not only values, but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair. Later in this chapter, we'll see this in practice by solving the FrozenLake environment both ways. However, to be able to do this, we have one important thing still missing: a general way to calculate those Vs and Qs.