Numerical Computing with Python
上QQ阅读APP看书,第一时间看更新

Category 1 - value based 

Value function does look like the right-hand side of the image (the sum of discounted future rewards) where every state has some value. Let's say, the state one step away from the goal has a value of -1; and two steps away from the goal has a value of -2. In a similar way, the starting point has a value of -16. If the agent gets stuck in the wrong place, the value could be as much as -24. In fact, the agent does move across the grid based on the best possible values to reach its goal. For example, the agent is at a state with a value of -15. Here, it can choose to move either north or south, so it chooses to move north due to the high reward, which is -14 rather, than moving south, which has a value of -16. In this way, the agent chooses its path across the grid until it reaches the goal.

  • Value Function: Only values are defined at all states
  • No Policy (Implicit): No exclusive policy is present; policies are chosen based on the values at each state