ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Reinforcement Learning Fundamentals

Train agents to make sequential decisions through trial and error. Covers Markov decision processes, Q-learning, policy gradients, reward shaping, and the patterns that let AI systems learn optimal behavior from interaction with an environment.

Reinforcement learning (RL) is how machines learn to make decisions by trying things and observing outcomes. Unlike supervised learning (where you provide correct answers) or unsupervised learning (where you find patterns), RL learns through interaction: take an action, observe a reward, adjust strategy. This is how AlphaGo learned to play Go, how OpenAI trained ChatGPT with RLHF, and how autonomous vehicles learn to drive.


Core Concepts

Agent: The learner/decision-maker
Environment: The world the agent interacts with
State (s): Current situation
Action (a): What the agent can do
Reward (r): Feedback signal (positive or negative)
Policy (π): Strategy mapping states to actions

The RL Loop:
  ┌──────────┐
  │  Agent   │
  │  π(s)→a  │──── action ────►┌───────────────┐
  │          │                  │  Environment  │
  │          │◄── reward, s' ───│               │
  └──────────┘                  └───────────────┘

Goal: Learn policy π that maximizes cumulative reward

Example — Robot Navigation:
  State: Robot position (x, y), obstacle locations
  Actions: Move up, down, left, right
  Reward: +100 for reaching goal, -1 per step, -50 for hitting wall
  Policy: Learned path that reaches goal in minimum steps

Q-Learning

import numpy as np

class QLearningAgent:
    """Tabular Q-learning for discrete state/action spaces."""
    
    def __init__(self, n_states, n_actions, learning_rate=0.1, 
                 discount=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon
    
    def choose_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.q_table.shape[1])  # Explore
        return np.argmax(self.q_table[state])  # Exploit
    
    def learn(self, state, action, reward, next_state, done):
        """Update Q-value using Bellman equation."""
        current_q = self.q_table[state, action]
        
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.q_table[next_state])
        
        # Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
        self.q_table[state, action] += self.lr * (target - current_q)

# Training loop:
# for episode in range(10000):
#     state = env.reset()
#     while not done:
#         action = agent.choose_action(state)
#         next_state, reward, done = env.step(action)
#         agent.learn(state, action, reward, next_state, done)
#         state = next_state

Anti-Patterns

Anti-PatternConsequenceFix
Sparse reward signalAgent learns slowly or not at allReward shaping, intermediate rewards
Same epsilon throughout trainingNever exploits or never explores enoughEpsilon decay schedule
No reward normalizationReward scale causes training instabilityNormalize rewards to standard range
Train only in simulationSim-to-real gap causes real-world failuresDomain randomization, sim-to-real transfer
Overfit to training environmentFails in slightly different conditionsEnvironment variation during training

Reinforcement learning is powerful but challenging. The reward function defines what the agent optimizes for — get it wrong and the agent will find clever but undesirable solutions (reward hacking). Start simple, iterate on the reward function, and validate in diverse environments.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →