Reinforcement Learning Fundamentals

Reinforcement learning (RL) is how machines learn to make decisions by trying things and observing outcomes. Unlike supervised learning (where you provide correct answers) or unsupervised learning (where you find patterns), RL learns through interaction: take an action, observe a reward, adjust strategy. This is how AlphaGo learned to play Go, how OpenAI trained ChatGPT with RLHF, and how autonomous vehicles learn to drive.

Core Concepts

Agent: The learner/decision-maker
Environment: The world the agent interacts with
State (s): Current situation
Action (a): What the agent can do
Reward (r): Feedback signal (positive or negative)
Policy (π): Strategy mapping states to actions

The RL Loop:
  ┌──────────┐
  │  Agent   │
  │  π(s)→a  │──── action ────►┌───────────────┐
  │          │                  │  Environment  │
  │          │◄── reward, s' ───│               │
  └──────────┘                  └───────────────┘

Goal: Learn policy π that maximizes cumulative reward

Example — Robot Navigation:
  State: Robot position (x, y), obstacle locations
  Actions: Move up, down, left, right
  Reward: +100 for reaching goal, -1 per step, -50 for hitting wall
  Policy: Learned path that reaches goal in minimum steps

Q-Learning

import numpy as np

class QLearningAgent:
    """Tabular Q-learning for discrete state/action spaces."""
    
    def __init__(self, n_states, n_actions, learning_rate=0.1, 
                 discount=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon
    
    def choose_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.q_table.shape[1])  # Explore
        return np.argmax(self.q_table[state])  # Exploit
    
    def learn(self, state, action, reward, next_state, done):
        """Update Q-value using Bellman equation."""
        current_q = self.q_table[state, action]
        
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.q_table[next_state])
        
        # Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
        self.q_table[state, action] += self.lr * (target - current_q)

# Training loop:
# for episode in range(10000):
#     state = env.reset()
#     while not done:
#         action = agent.choose_action(state)
#         next_state, reward, done = env.step(action)
#         agent.learn(state, action, reward, next_state, done)
#         state = next_state

Anti-Patterns

Anti-Pattern	Consequence	Fix
Sparse reward signal	Agent learns slowly or not at all	Reward shaping, intermediate rewards
Same epsilon throughout training	Never exploits or never explores enough	Epsilon decay schedule
No reward normalization	Reward scale causes training instability	Normalize rewards to standard range
Train only in simulation	Sim-to-real gap causes real-world failures	Domain randomization, sim-to-real transfer
Overfit to training environment	Fails in slightly different conditions	Environment variation during training

Reinforcement learning is powerful but challenging. The reward function defines what the agent optimizes for — get it wrong and the agent will find clever but undesirable solutions (reward hacking). Start simple, iterate on the reward function, and validate in diverse environments.

Core Concepts

Q-Learning

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production