Sankalp Prabhakar | The Bellman Optimality Equation

Let’s first briefly outline the deterministic and stochastic environments in reinforcement learning (RL). We can then define the Bellman equation in each case and later generalize it.

Deterministic Environment: the environment is considered deterministic if a given action \((A=a)\) taken in a state \((S=s)\), always results in the same next state \((S=s')\). You can say that the state transition probability (\(P\)) is 1, defined as: \(P[S_{t+1}= s' | S_{t}=s, A_{t}=a] = 1\).

Let’s look at a scenario in the deterministic case wherein you can take \(n\) actions \((A_{1...n})\) in a given state \((S=s_{0})\), each resulting in \(n\) different next states \((S_{1...n})\) respectively. The value of the state \((V_{s_{0}})\) would then be defined as the \(max\) of the sum of the immediate reward and discounted long-term reward (or value) of the next state, shown as:
\[\begin{aligned} V_{s_{0}} = max_{a=1...n}(R_{a} + \gamma V_{a}) \end{aligned}\]
The above is the Bellman equation of value for the deterministic case.
Stochastic Environment: the environment is considered to be stochastic if a given action \((A=a)\) taken in a given state \((S = s)\) can result in different next states \((S=s')\) with different transition probabilities.

Let’s look at a scenario in the stochastic case wherein an action \((A = a_{0})\) taken in a given state \((S=s_{0})\), results in three different next states \((S = s' \mid s'\epsilon \{s_{1}, s_{2}, s_{3}\})\), each with some transition probability \((P = p_{i})\). The expected value of the state \((V_{s_{0}})\) would then be defined as the sum of the immediate reward and the discounted long-term reward (or value) of the next states multiplied by their respective transition probabilities, shown as:
\[\begin{aligned} V_{s_{0}}(a_{0}) = p_{1}(r_{1} + \gamma V_{s_{1}}) + p_{2}(r_{2} + \gamma V_{s_{2}}) + p_{3}(r_{3} + \gamma V_{s_{3}}) \end{aligned}\]

Bellman Optimality Equation

Combining the Bellman equation, for a deterministic case, with a value for stochastic actions, gives the Bellman optimality equation for a general case:

\[\begin{aligned} V_{0}(a) = max_{a \epsilon A} \sum_{s \epsilon S} p_{a, 0\rightarrow s}(r_{s,a} + \gamma V_{s}) \end{aligned}\]

Also written as,

\[\begin{aligned} V_{0}(a) = max_{a \epsilon A} \mathbb{E_{s \sim S}} [r_{s,a} + \gamma V_{s}] \end{aligned}\]