The Bellman Optimality Equation
Let’s first briefly outline the deterministic and stochastic environments in reinforcement learning (RL). We can then define the Bellman equation
in each case and later generalize it.
-
Deterministic Environment: the environment is considered deterministic if a given action
taken in a state , always results in the same next state . You can say that the state transition probability ( ) is 1, defined as: .Let’s look at a scenario in the deterministic case wherein you can take
actions in a given state , each resulting in different next states respectively. The value of the state would then be defined as the of the sum of the immediate reward and discounted long-term reward (or value) of the next state, shown as:The above is the
Bellman equation of value
for the deterministic case. -
Stochastic Environment: the environment is considered to be stochastic if a given action
taken in a given state can result in different next states with different transition probabilities.Let’s look at a scenario in the stochastic case wherein an action
taken in a given state , results in three different next states , each with some transition probability . The expected value of the state would then be defined as the sum of the immediate reward and the discounted long-term reward (or value) of the next states multiplied by their respective transition probabilities, shown as:
Bellman Optimality Equation
Combining the Bellman equation, for a deterministic case, with a value for stochastic actions, gives the Bellman optimality equation
for a general case:
Also written as,