Deep Reinforcement Learning Hands-On
上QQ阅读APP看书,第一时间看更新

Q-learning for FrozenLake

The whole example is in the Chapter05/02_frozenlake_q_learning.py file, and the difference is really minor. The most obvious change is to our value table. In the previous example, we kept the value of the state, so the key in the dictionary was just a state. Now we need to store values of the Q-function, which has two parameters: state and action, so the key in the value table is now a composite.

The second difference is in our calc_action_value function. We just don't need it anymore, as our action values are stored in the value table. Finally, the most important change in the code is in the agent's value_iteration method. Before, it was just a wrapper around the calc_action_value call, which did the job of Bellman approximation. Now, as this function has gone and was replaced by a value table, we need to do this approximation in the value_iteration method.

Let's look at the code. As it's almost the same, I'll jump directly to the most interesting value_iteration function:

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            for action in range(self.env.action_space.n):
                action_value = 0.0
                target_counts = self.transits[(state, action)]
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    reward = self.rewards[(state, action, tgt_state)]
                    best_action = self.select_action(tgt_state)
                    action_value += (count / total) * (reward + GAMMA * self.values[(tgt_state, best_action)])
                self.values[(state, action)] = action_value

The code is very similar to calc_action_value in the previous example and in fact it does almost the same thing. For the given state and action, it needs to calculate the value of this action using statistics about target states that we've reached with the action. To calculate this value, we use the Bellman equation and our counters, which allow us to approximate the probability of the target state. However, in Bellman's equation we have the value of the state and now we need to calculate it differently. Before, we had it stored in the value table (as we approximated the value of states), so we just took it from this table. We can't do this anymore, so we have to call the select_action method, which will choose for us the action with the largest Q-value, and then we take this Q-value as the value of the target state. Of course, we can implement another function which could calculate for us this value of state, but select_action does almost everything we need, so we will reuse it here.

There is another piece of this example that I'd like to emphasize here. Let's look at our select_action method:

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

As I said, we don't have the calc_action_value method anymore, so, to select action, we just iterate over the actions and look up their values in our values table. It could look like a minor improvement, but if you think about the data that we used in calc_action_value, it may become obvious why the learning of the Q-function is much more popular in RL than the learning of the V-function.

Our calc_action_value function uses both information about reward and probabilities. It's not a huge problem for the value iteration method, which relies on this information during training. However, in the next chapter, we'll learn about the value iteration method extension, which doesn't require probability approximation, but just takes it from the environment samples. For such methods, this dependency on probability adds an extra burden for the agent. In the case of Q-learning, what the agent needs to make the decision is just Q-values.

I don't want to say that V-functions are completely useless, because they are an essential part of Actor-Critic method which we'll talk about in part three of this book. However, in the area of value learning, Q-functions is the definite favorite. With regard to convergence speed, both our versions are almost identical (but the Q-learning version requires four times more memory for the value table).

rl_book_samples/Chapter05$ ./02_frozenlake_q_learning.py
[2017-10-13 12:38:56,658] Making new env: FrozenLake-v0
[2017-10-13 12:38:56,863] Making new env: FrozenLake-v0
Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.200
Best reward updated 0.200 -> 0.350
Best reward updated 0.350 -> 0.700
Best reward updated 0.700 -> 0.750
Best reward updated 0.750 -> 0.850
Solved in 22 iterations!