Reinforcement Learning Fundamentals
Train agents to make sequential decisions through trial and error. Covers Markov decision processes, Q-learning, policy gradients, reward shaping, and the patterns that let AI systems learn optimal behavior from interaction with an environment.
Reinforcement learning (RL) is how machines learn to make decisions by trying things and observing outcomes. Unlike supervised learning (where you provide correct answers) or unsupervised learning (where you find patterns), RL learns through interaction: take an action, observe a reward, adjust strategy. This is how AlphaGo learned to play Go, how OpenAI trained ChatGPT with RLHF, and how autonomous vehicles learn to drive.
Core Concepts
Agent: The learner/decision-maker
Environment: The world the agent interacts with
State (s): Current situation
Action (a): What the agent can do
Reward (r): Feedback signal (positive or negative)
Policy (π): Strategy mapping states to actions
The RL Loop:
┌──────────┐
│ Agent │
│ π(s)→a │──── action ────►┌───────────────┐
│ │ │ Environment │
│ │◄── reward, s' ───│ │
└──────────┘ └───────────────┘
Goal: Learn policy π that maximizes cumulative reward
Example — Robot Navigation:
State: Robot position (x, y), obstacle locations
Actions: Move up, down, left, right
Reward: +100 for reaching goal, -1 per step, -50 for hitting wall
Policy: Learned path that reaches goal in minimum steps
Q-Learning
import numpy as np
class QLearningAgent:
"""Tabular Q-learning for discrete state/action spaces."""
def __init__(self, n_states, n_actions, learning_rate=0.1,
discount=0.99, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.lr = learning_rate
self.gamma = discount
self.epsilon = epsilon
def choose_action(self, state):
"""Epsilon-greedy action selection."""
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1]) # Explore
return np.argmax(self.q_table[state]) # Exploit
def learn(self, state, action, reward, next_state, done):
"""Update Q-value using Bellman equation."""
current_q = self.q_table[state, action]
if done:
target = reward
else:
target = reward + self.gamma * np.max(self.q_table[next_state])
# Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
self.q_table[state, action] += self.lr * (target - current_q)
# Training loop:
# for episode in range(10000):
# state = env.reset()
# while not done:
# action = agent.choose_action(state)
# next_state, reward, done = env.step(action)
# agent.learn(state, action, reward, next_state, done)
# state = next_state
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Sparse reward signal | Agent learns slowly or not at all | Reward shaping, intermediate rewards |
| Same epsilon throughout training | Never exploits or never explores enough | Epsilon decay schedule |
| No reward normalization | Reward scale causes training instability | Normalize rewards to standard range |
| Train only in simulation | Sim-to-real gap causes real-world failures | Domain randomization, sim-to-real transfer |
| Overfit to training environment | Fails in slightly different conditions | Environment variation during training |
Reinforcement learning is powerful but challenging. The reward function defines what the agent optimizes for — get it wrong and the agent will find clever but undesirable solutions (reward hacking). Start simple, iterate on the reward function, and validate in diverse environments.