Lecture: Introduction to Reinforcement Learning

Learning Objectives

Understand the Markov Decision Process framework. Implement Q-learning to solve a grid world problem. Explain the exploration-exploitation tradeoff and how epsilon-greedy addresses it. Connect RL to real applications including RLHF.

1. The RL Setting

Reinforcement learning addresses a fundamental question: how should an agent act in an environment to maximize cumulative reward? Unlike supervised learning (learn from labeled examples), RL learns from interaction: try something, see what happens, adjust.

The formal framework is the Markov Decision Process (MDP), consisting of:

S: State space — all possible states the environment can be in
A: Action space — all possible actions the agent can take
P(s'|s,a): Transition dynamics — probability of reaching s' after taking action a in state s
R(s,a,s'): Reward function — scalar feedback signal
γ ∈ [0,1): Discount factor — how much to value future rewards vs. immediate ones

The agent's goal: find a policy π(a|s) — a mapping from states to action probabilities — that maximizes expected discounted cumulative reward: E[Σ γᵗ rₜ].

2. The Grid World Environment

import numpy as np
import random

class GridWorld:
"""
Simple 4x4 grid world:
S: start (0,0), G: goal (3,3), H: holes (1,1), (2,3)

Actions: 0=up, 1=right, 2=down, 3=left
Reward: +10 for goal, -10 for hole, -0.1 for step
"""
def __init__(self):
    self.size = 4
    self.holes = {(1,1), (2,3)}
    self.goal = (3,3)
    self.reset()

def reset(self):
    self.state = (0, 0)
    return self.state

def step(self, action):
    r, c = self.state
    if action == 0:   dr, dc = -1, 0   # up
    elif action == 1: dr, dc = 0, 1    # right
    elif action == 2: dr, dc = 1, 0    # down
    else:             dr, dc = 0, -1   # left

    # Clip to grid boundaries
    new_r = max(0, min(self.size-1, r + dr))
    new_c = max(0, min(self.size-1, c + dc))
    self.state = (new_r, new_c)

    if self.state == self.goal:
        return self.state, +10, True    # done!
    elif self.state in self.holes:
        return self.state, -10, True    # fell in
    else:
        return self.state, -0.1, False  # still going

3. Q-Learning: Learning Action Values

Q-learning estimates the value of taking action a in state s under the optimal policy:

Q*(s,a) = E[r + γ max_a' Q*(s', a')]

The Q-learning update rule applies this Bellman equation incrementally:

Q[s,a] ← Q[s,a] + α (r + γ max_a' Q[s',a'] - Q[s,a])

Where:
α = learning rate (how fast to update)
γ = discount factor (future vs immediate reward)
The term in parentheses is the "TD error" (temporal difference error)

class QLearningAgent:
def __init__(self, n_states=16, n_actions=4, lr=0.1, gamma=0.99, epsilon=0.1):
    self.Q = np.zeros((n_states, n_actions))
    self.lr = lr
    self.gamma = gamma
    self.epsilon = epsilon

def state_to_idx(self, state):
    r, c = state
    return r * 4 + c  # Convert (row, col) to flat index

def choose_action(self, state):
    """Epsilon-greedy: explore randomly epsilon% of the time."""
    if random.random()  self.epsilon:
        return random.randint(0, 3)     # random action (explore)
    else:
        s = self.state_to_idx(state)
        return np.argmax(self.Q[s])     # greedy action (exploit)

def update(self, state, action, reward, next_state, done):
    s = self.state_to_idx(state)
    s_next = self.state_to_idx(next_state)

    # Bellman target
    if done:
        target = reward
    else:
        target = reward + self.gamma * np.max(self.Q[s_next])

    # Q-value update
    self.Q[s, action] += self.lr * (target - self.Q[s, action])

4. Training Loop

env = GridWorld()
agent = QLearningAgent(epsilon=0.3)   # start with high exploration

episode_rewards = []
for episode in range(5000):
state = env.reset()
total_reward = 0

for step in range(100):   # max 100 steps per episode
    action = agent.choose_action(state)
    next_state, reward, done = env.step(action)
    agent.update(state, action, reward, next_state, done)
    state = next_state
    total_reward += reward
    if done:
        break

# Decay epsilon (explore less as we learn more)
agent.epsilon = max(0.01, agent.epsilon * 0.999)
episode_rewards.append(total_reward)

print(f"Average reward (last 100 episodes): {np.mean(episode_rewards[-100:]):.2f}")

5. The Exploration-Exploitation Tradeoff

Epsilon-greedy exploration is simple but naive. The fundamental question: how much should the agent explore (try new actions to gather information) vs. exploit (take the best known action)?

Too much exploration: the agent never converges to a good policy
Too little exploration: the agent gets stuck in locally good but globally suboptimal policies

More sophisticated exploration strategies: Upper Confidence Bound (UCB) which selects actions based on both expected value and uncertainty; Thompson Sampling which samples from a posterior over Q-values; and Intrinsic Motivation which adds rewards for visiting unexplored states.

6. Connection to RLHF

RLHF (Reinforcement Learning from Human Feedback) applies exactly these ideas to language models. The "state" is the conversation context, the "actions" are tokens generated, and the "reward" comes from a learned reward model trained on human preferences. The same Bellman equations, the same policy gradient updates, the same exploration-exploitation tradeoff — applied at a massive scale with neural network function approximation instead of a lookup table.

Exercise

Modify the epsilon decay schedule. Try: (1) no decay (constant epsilon), (2) fast linear decay, (3) slow exponential decay. What effect does each have on the learning curve? Which produces the best final policy?