Contents
Part 2: Markov Decision Processes Explained - Mathematical Foundation of RL
Welcome to the second post in our Deep Reinforcement Learning Series! In this comprehensive guide, weβll explore Markov Decision Processes (MDPs) - the mathematical framework that formalizes reinforcement learning problems. Understanding MDPs is crucial for grasping the theoretical foundations of RL.
What is a Markov Decision Process?
A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs provide a formal way to describe the environment in which reinforcement learning agents operate.
Key Properties
Markov Property: The future is independent of the past given the present: \(P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \dots) = P(S_{t+1} | S_t, A_t)\)
This means the current state contains all information needed to make optimal decisions.
Stationary Transitions: Transition dynamics donβt change over time: \(P(S_{t+1} | S_t, A_t) = P(S_{t+1} | S_t, A_t) \quad \forall t\)
Formal Definition of MDP
An MDP is defined by the tuple: \[\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)\]
1. State Space (S)
The set of all possible states the environment can be in: \[\mathcal{S} = \{s_1, s_2, \dots, s_n\}\]
Types of State Spaces:
Discrete State Space:
- Finite number of states
- Examples: Chess board positions, Grid World
- Easier to solve analytically
Continuous State Space:
- Infinite number of states
- Examples: Robot joint angles, Stock prices
- Requires function approximation
Example - Grid World: \(\mathcal{S} = \{(x,y) | x \in \{0,1,2,3\}, y \in \{0,1,2,3\}\}\)
2. Action Space (A)
The set of all possible actions the agent can take: \[\mathcal{A} = \{a_1, a_2, \dots, a_m\}\]
Types of Action Spaces:
Discrete Action Space:
- Finite number of actions
- Examples: Move up/down/left/right, Press button
- Suitable for value-based methods
Continuous Action Space:
- Infinite action possibilities
- Examples: Robot motor torques, Steering angle
- Requires policy-based methods
Example - Grid World: \(\mathcal{A} = \{\text{up}, \text{down}, \text{left}, \text{right}\}\)
3. Transition Function (P)
The probability of transitioning to sate \(s'\) given current sate \(s\) and action \(a\): \[\mathcal{P}(s' | s, a) = P(S_{t+1} = s' | S_t = s, A_t = a)\]
Properties:
Stochastic: $$0 \leq P(sβ s,a) \leq 1$$ - Normalized: \(\sum_{s' \in \mathcal{S}} P(s'|s,a) = 1\) Example - Deterministic Transitions:
Where \(f(s,a)\) is the deterministic transition function.
Example - Stochastic Transitions:
# 80% chance to move in intended direction
# 20% chance to move in random direction
P(s'|s,a) = {
intended_direction: 0.8,
other_directions: 0.2/3
}
4. Reward Function (R)
The expected immediate reward after taking action \(a\) in sate \(s\) and transitioning to \(s'\): \[\mathcal{R}(s, a, s') = \mathbb{E}[R_{t+1} | S_t = s, A_t = a, S_{t+1} = s']\]
Types of Rewards:
Deterministic Rewards: \(R(s,a,s') = r \quad \text{(fixed value)}\)
Stochastic Rewards: \(R(s,a,s') = \mathbb{E}[R | s,a,s']\)
Example - Grid World: \(R(s,a,s') = \begin{cases} +10 & \text{if } s' = \text{goal} \\ -1 & \text{if } s' = \text{obstacle} \\ -0.1 & \text{otherwise (time penalty)} \end{cases}\)
5. Discount Factor (Ξ³)
A factor \(\gamma \in [0,1]\) that determines the importance of future rewards:
Interpretation:
- \(\gamma \approx 0\): Agent is myopic (focus on immediate rewards)
- \(\gamma \approx 1\): Agent is farsighted (considers long-term rewards)
- \(\gamma = 1\): Values all future rewards equally
Typical Values:
- \(\gamma = 0.9\): Moderate discounting
- \(\gamma = 0.95\): Strong long-term consideration
- \(\gamma = 0.99\): Very long-term planning
The MDP Dynamics
Episode Trajectory
A sequence of states, actions, and rewards: \[\tau = (S_0, A_0, R_1, S_1, A_1, R_2, \dots, S_T)\]
Return (Cumulative Reward)
The sum of discounted rewards from time step \(t\): \[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]
Finite Horizon (Episodic): \(G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}\)
Infinite Horizon (Continuing): \(G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\)
Value Functions
1. State-Value Function (V)
The expected return starting from sate \(s\) and following policy \(\pi\): \[V^\pi(s) = \mathbb{E}_\pi \left[ G_t | S_t = s \right]\]
Bellman Expectation Equation: \[V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]
Optimal State-Value Function: \[V^*(s) = \max_\pi V^\pi(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]\]
2. Action-Value Function (Q)
The expected return starting from sate \(s\), taking action \(a\), and then following policy \(\pi\): \[Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t | S_t = s, A_t = a \right]\]
Bellman Expectation Equation: \[Q^\pi(s,a) = \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s',a') \right]\]
Optimal Action-Value Function: \[Q^*(s,a) = \max_\pi Q^\pi(s,a) = \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a' \in \mathcal{A}} Q^*(s',a') \right]\]
3. Relationship Between V and Q
\[V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) Q^\pi(s,a)\] \[Q^\pi(s,a) = \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]Policies
Definition
A policy \(\pi\) is a mapping from states to actions: \[\pi: \mathcal{S} \rightarrow \mathcal{A}\]
Types of Policies
Deterministic Policy: \(\pi(s) = a \quad \text{(single action)}\)
Stochastic Policy: \(\pi(a|s) = P(A_t = a | S_t = s) \quad \text{(probability distribution)}\)
Optimal Policy: \[\pi^*(s) = \arg\max_{a \in \mathcal{A}} Q^*(s,a)\]
Bellman Equations
Bellman Expectation Equation for V
\[V^\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s \right]\]Expanded Form: \[V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]
Bellman Optimality Equation for V
\[V^*(s) = \max_{a \in \mathcal{A}} \mathbb{E} \left[ R_{t+1} + \gamma V^*(S_{t+1}) | S_t = s, A_t = a \right]\]Expanded Form: \[V^*(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]\]
Bellman Expectation Equation for Q
\[Q^\pi(s,a) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) | S_t = s, A_t = a \right]\]Expanded Form: \[Q^\pi(s,a) = \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s',a') \right]\]
Bellman Optimality Equation for Q
\[Q^*(s,a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a' \in \mathcal{A}} Q^*(S_{t+1}, a') | S_t = s, A_t = a \right]\]Expanded Form: \[Q^*(s,a) = \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a' \in \mathcal{A}} Q^*(s',a') \right]\]
Python Implementation
MDP Class
import numpy as np
from typing import Tuple, List, Dict
class MDP:
"""
Markov Decision Process implementation
Args:
states: List of all possible states
actions: List of all possible actions
transitions: Transition function P(s'|s,a)
rewards: Reward function R(s,a,s')
gamma: Discount factor
"""
def __init__(self,
states: List,
actions: List,
transitions: Dict,
rewards: Dict,
gamma: float = 0.95):
self.states = states
self.actions = actions
self.transitions = transitions
self.rewards = rewards
self.gamma = gamma
self.n_states = len(states)
self.n_actions = len(actions)
def get_transition_prob(self, state: int, action: int,
next_state: int) -> float:
"""
Get transition probability P(s'|s,a)
Args:
state: Current state
action: Action taken
next_state: Next state
Returns:
Transition probability
"""
return self.transitions.get((state, action, next_state), 0.0)
def get_reward(self, state: int, action: int,
next_state: int) -> float:
"""
Get expected reward R(s,a,s')
Args:
state: Current state
action: Action taken
next_state: Next state
Returns:
Expected reward
"""
return self.rewards.get((state, action, next_state), 0.0)
def get_expected_reward(self, state: int, action: int) -> float:
"""
Get expected reward for state-action pair
Args:
state: Current state
action: Action taken
Returns:
Expected reward
"""
expected_reward = 0.0
for next_state in self.states:
prob = self.get_transition_prob(state, action, next_state)
reward = self.get_reward(state, action, next_state)
expected_reward += prob * reward
return expected_reward
Value Iteration Algorithm
def value_iteration(mdp: MDP, theta: float = 1e-6,
max_iterations: int = 1000) -> Tuple[np.ndarray, np.ndarray]:
"""
Solve MDP using Value Iteration
Args:
mdp: Markov Decision Process
theta: Convergence threshold
max_iterations: Maximum iterations
Returns:
(optimal_values, optimal_policy)
"""
# Initialize value function
V = np.zeros(mdp.n_states)
for iteration in range(max_iterations):
delta = 0.0
V_new = np.zeros(mdp.n_states)
# Update each state
for state in mdp.states:
# Bellman optimality equation
max_value = -np.inf
for action in mdp.actions:
# Expected value for this action
action_value = mdp.get_expected_reward(state, action)
# Add discounted future value
for next_state in mdp.states:
prob = mdp.get_transition_prob(state, action, next_state)
action_value += prob * mdp.gamma * V[next_state]
max_value = max(max_value, action_value)
V_new[state] = max_value
delta = max(delta, abs(V_new[state] - V[state]))
V = V_new
# Check convergence
if delta < theta:
print(f"Converged in {iteration + 1} iterations")
break
# Extract optimal policy
policy = np.zeros(mdp.n_states, dtype=int)
for state in mdp.states:
max_value = -np.inf
best_action = 0
for action in mdp.actions:
action_value = mdp.get_expected_reward(state, action)
for next_state in mdp.states:
prob = mdp.get_transition_prob(state, action, next_state)
action_value += prob * mdp.gamma * V[next_state]
if action_value > max_value:
max_value = action_value
best_action = action
policy[state] = best_action
return V, policy
Policy Iteration Algorithm
def policy_iteration(mdp: MDP, max_iterations: int = 100) -> Tuple[np.ndarray, np.ndarray]:
"""
Solve MDP using Policy Iteration
Args:
mdp: Markov Decision Process
max_iterations: Maximum iterations
Returns:
(optimal_values, optimal_policy)
"""
# Initialize random policy
policy = np.random.randint(0, mdp.n_actions, size=mdp.n_states)
V = np.zeros(mdp.n_states)
for iteration in range(max_iterations):
# Policy Evaluation
policy_stable = True
for _ in range(100): # Fixed iterations for evaluation
V_new = np.zeros(mdp.n_states)
for state in mdp.states:
action = policy[state]
# Bellman expectation equation
value = mdp.get_expected_reward(state, action)
for next_state in mdp.states:
prob = mdp.get_transition_prob(state, action, next_state)
value += prob * mdp.gamma * V[next_state]
V_new[state] = value
V = V_new
# Policy Improvement
for state in mdp.states:
old_action = policy[state]
max_value = -np.inf
best_action = 0
for action in mdp.actions:
action_value = mdp.get_expected_reward(state, action)
for next_state in mdp.states:
prob = mdp.get_transition_prob(state, action, next_state)
action_value += prob * mdp.gamma * V[next_state]
if action_value > max_value:
max_value = action_value
best_action = action
policy[state] = best_action
if old_action != best_action:
policy_stable = False
if policy_stable:
print(f"Converged in {iteration + 1} iterations")
break
return V, policy
Example: Grid World MDP
def create_grid_world_mdp() -> MDP:
"""
Create a 4x4 Grid World MDP
Returns:
MDP instance
"""
# States: 16 positions (4x4 grid)
states = list(range(16))
# Actions: 0=up, 1=down, 2=left, 3=right
actions = list(range(4))
# Define transitions and rewards
transitions = {}
rewards = {}
for state in states:
x, y = state // 4, state % 4
for action in actions:
# Calculate next position
if action == 0: # up
next_x, next_y = x, min(3, y + 1)
elif action == 1: # down
next_x, next_y = x, max(0, y - 1)
elif action == 2: # left
next_x, next_y = max(0, x - 1), y
else: # right
next_x, next_y = min(3, x + 1), y
next_state = next_x * 4 + next_y
# Deterministic transition
transitions[(state, action, next_state)] = 1.0
# Define rewards
if next_state == 3: # Goal at (0,3)
rewards[(state, action, next_state)] = 10.0
elif next_state == 5: # Obstacle at (1,1)
rewards[(state, action, next_state)] = -1.0
else:
rewards[(state, action, next_state)] = -0.1 # Time penalty
return MDP(states, actions, transitions, rewards, gamma=0.95)
# Create and solve MDP
mdp = create_grid_world_mdp()
# Solve using Value Iteration
V, policy = value_iteration(mdp)
print("Optimal Values:")
print(V.reshape(4, 4))
print("\nOptimal Policy:")
print(policy.reshape(4, 4))
# Visualize policy
action_symbols = ['β', 'β', 'β', 'β']
policy_grid = np.array([[action_symbols[a] for a in row] for row in policy.reshape(4, 4)])
print("\nPolicy Visualization:")
for row in reversed(policy_grid):
print(' '.join(row))
Solving MDPs
Dynamic Programming Methods
Value Iteration:
- Iteratively improves value function
- Converges to optimal values
- Guarantees optimality
Policy Iteration:
- Alternates between evaluation and improvement
- Often faster convergence
- Guarantees optimality
Linear Programming
Formulate as linear program:
- Variables: State values
- Constraints: Bellman equations
- Objective: Maximize expected return
Monte Carlo Methods
Estimate values through sampling:
- No model required
- Model-free approach
- Converges with enough samples
Temporal Difference Learning
Combine ideas from DP and MC:
- Learn from experience
- Online learning
- Basis for Q-learning
Example: Small MDP
Problem Setup
Consider a simple 3-state MDP:
βββββββββββ
β S0 β
β (Start) β
ββββββ¬βββββ
β
ββββββ΄βββββ
β β
S1 S2
(Good) (Bad)
States: \(\mathcal{S} = \{S_0, S_1, S_2\}\)
Actions: \(\mathcal{A} = \{\text{left}, \text{right}\}\)
Transitions:
- From \(S_0\): 50% to \(S_1\), 50% to \(S_2\)
- From \(S_1\): Stay in \(S_1\) (terminal)
- From \(S_2\): Stay in \(S_2\) (terminal)
Rewards:
- \[R(S_0, \cdot, S_1) = +1\]
- \[R(S_0, \cdot, S_2) = -1\]
- \[R(S_1, \cdot, S_1) = 0\]
- \[R(S_2, \cdot, S_2) = 0\]
Discount: \(\gamma = 0.9\)
Solving with Bellman Equations
Value Function:
\(V(S_0) = 0.5 \times [1 + 0.9 \times V(S_1)] + 0.5 \times [-1 + 0.9 \times V(S_2)]\) \(V(S_1) = 0\) \(V(S_2) = 0\)
Substituting:
\(V(S_0) = 0.5 \times [1 + 0.9 \times 0] + 0.5 \times [-1 + 0.9 \times 0]\) \(V(S_0) = 0.5 \times 1 + 0.5 \times (-1)\) \(V(S_0) = 0$**Optimal Policy:**$\pi^*(S_0) = \text{left} \quad \text{(both actions give same value)}\)
Optimal Policy: \(\pi^*(S_0) = \text{left} \quad \text{(both actions give same value)}\)
Partially Observable MDPs (POMDPs)
Definition
When the agent cannot fully observe the state: \[\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma)\]
Where:
- $\mathcal{O}$ - Observation space
$\mathcal{Z}$ - Observation function $P(o s,a,sβ)$
Belief State
Agent maintains probability distribution over states: \[b_t(s) = P(S_t = s | o_0, a_0, o_1, a_1, \dots, o_t)\]
Solving POMDPs
- Belief state MDP: Convert to MDP over beliefs
- Point-based value iteration: Approximate solution
- Monte Carlo methods: Sample-based approaches
MDP Properties
Finite Horizon
- Fixed number of time steps
- Episode terminates after \(T\) steps
- Value function depends on time to go
Infinite Horizon
- No fixed termination
- Discounting ensures convergence
- Stationary optimal policy
Episodic vs Continuing
Episodic:
- Natural termination points
- Episodes are independent
- Example: Games
Continuing:
- No natural termination
- Goes on indefinitely
- Example: Robot control
Whatβs Next?
In the next post, weβll implement Q-Learning from Scratch - a model-free algorithm that learns optimal policies through experience. Weβll cover:
- Q-learning algorithm
- Exploration strategies
- Implementation details
- Practical examples
- Hyperparameter tuning
Key Takeaways
MDPs formalize RL problems mathematically Bellman equations provide recursive structure Value functions measure expected returns Dynamic programming can solve MDPs exactly Model-free methods learn from experience POMDPs handle partial observability
Practice Exercises
- Implement Value Iteration for a different MDP
- Compare Value vs Policy Iteration convergence speed
- Create a POMDP and implement belief updates
- Experiment with different discount factors
- Visualize value functions as heatmaps
Questions?
Have questions about MDPs or Bellman equations? Drop them in the comments below!
Next Post: Part 3: Q-Learning from Scratch
Series Index: Deep Reinforcement Learning Series Roadmap