Let’s first briefly outline the deterministic and stochastic environments in reinforcement learning (RL). We can then define the Bellman equation in each case and later generalize it.

  • Deterministic Environment: the environment is considered deterministic if a given action (A=a) taken in a state (S=s), always results in the same next state (S=s). You can say that the state transition probability (P) is 1, defined as: P[St+1=s|St=s,At=a]=1.

    Let’s look at a scenario in the deterministic case wherein you can take n actions (A1...n) in a given state (S=s0), each resulting in n different next states (S1...n) respectively. The value of the state (Vs0) would then be defined as the max of the sum of the immediate reward and discounted long-term reward (or value) of the next state, shown as:

    Vs0=maxa=1...n(Ra+γVa)

    The above is the Bellman equation of value for the deterministic case.

  • Stochastic Environment: the environment is considered to be stochastic if a given action (A=a) taken in a given state (S=s) can result in different next states (S=s) with different transition probabilities.

    Let’s look at a scenario in the stochastic case wherein an action (A=a0) taken in a given state (S=s0), results in three different next states (S=ssϵ{s1,s2,s3}), each with some transition probability (P=pi). The expected value of the state (Vs0) would then be defined as the sum of the immediate reward and the discounted long-term reward (or value) of the next states multiplied by their respective transition probabilities, shown as:

    Vs0(a0)=p1(r1+γVs1)+p2(r2+γVs2)+p3(r3+γVs3)

Bellman Optimality Equation

Combining the Bellman equation, for a deterministic case, with a value for stochastic actions, gives the Bellman optimality equation for a general case:

V0(a)=maxaϵAsϵSpa,0s(rs,a+γVs)

Also written as,

V0(a)=maxaϵAEsS[rs,a+γVs]