Let’s first briefly outline the deterministic and stochastic environments in reinforcement learning (RL). We can then define the Bellman equation in each case and later generalize it.

  • Deterministic Environment: the environment is considered deterministic if a given action \((A=a)\) taken in a state \((S=s)\), always results in the same next state \((S=s')\). You can say that the state transition probability (\(P\)) is 1, defined as: \(P[S_{t+1}= s' | S_{t}=s, A_{t}=a] = 1\).

    Let’s look at a scenario in the deterministic case wherein you can take \(n\) actions \((A_{1...n})\) in a given state \((S=s_{0})\), each resulting in \(n\) different next states \((S_{1...n})\) respectively. The value of the state \((V_{s_{0}})\) would then be defined as the \(max\) of the sum of the immediate reward and discounted long-term reward (or value) of the next state, shown as:

    \[\begin{aligned} V_{s_{0}} = max_{a=1...n}(R_{a} + \gamma V_{a}) \end{aligned}\]

    The above is the Bellman equation of value for the deterministic case.

  • Stochastic Environment: the environment is considered to be stochastic if a given action \((A=a)\) taken in a given state \((S = s)\) can result in different next states \((S=s')\) with different transition probabilities.

    Let’s look at a scenario in the stochastic case wherein an action \((A = a_{0})\) taken in a given state \((S=s_{0})\), results in three different next states \((S = s' \mid s'\epsilon \{s_{1}, s_{2}, s_{3}\})\), each with some transition probability \((P = p_{i})\). The expected value of the state \((V_{s_{0}})\) would then be defined as the sum of the immediate reward and the discounted long-term reward (or value) of the next states multiplied by their respective transition probabilities, shown as:

    \[\begin{aligned} V_{s_{0}}(a_{0}) = p_{1}(r_{1} + \gamma V_{s_{1}}) + p_{2}(r_{2} + \gamma V_{s_{2}}) + p_{3}(r_{3} + \gamma V_{s_{3}}) \end{aligned}\]

Bellman Optimality Equation

Combining the Bellman equation, for a deterministic case, with a value for stochastic actions, gives the Bellman optimality equation for a general case:

\[\begin{aligned} V_{0}(a) = max_{a \epsilon A} \sum_{s \epsilon S} p_{a, 0\rightarrow s}(r_{s,a} + \gamma V_{s}) \end{aligned}\]

Also written as,

\[\begin{aligned} V_{0}(a) = max_{a \epsilon A} \mathbb{E_{s \sim S}} [r_{s,a} + \gamma V_{s}] \end{aligned}\]