A/B Testing at Scale | The Garnet Wiki

A/B testing is the scientific method applied to product development. Instead of debating whether Feature A or Feature B is better, you run a controlled experiment with real users and let the data decide. Done wrong, A/B testing produces false confidence that leads to worse products.

Experiment Design

1. Hypothesis
   "Adding a progress bar to checkout will increase completion rate"
   
2. Primary Metric (what you're optimizing)
   Checkout completion rate
   
3. Guardrail Metrics (what must NOT degrade)
   Revenue per user, page load time, error rate
   
4. Minimum Detectable Effect (MDE)
   1% absolute increase (from 65% to 66%)
   
5. Sample Size
   Calculated based on MDE, baseline rate, significance level
   
6. Duration
   2 full weeks minimum (capture weekly patterns)
   
7. Randomization Unit
   User ID (not session, not request)

Sample Size Calculation

from scipy import stats
import numpy as np

def sample_size_proportion(
    baseline_rate: float,
    mde: float,
    alpha: float = 0.05,
    power: float = 0.8
) -> int:
    """Calculate required sample size per variant."""
    p1 = baseline_rate
    p2 = baseline_rate + mde
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    n = ((z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) + 
          z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) / 
         (p2 - p1)) ** 2
    
    return int(np.ceil(n))

# Example: Detect 1% increase from 65% baseline
n = sample_size_proportion(baseline_rate=0.65, mde=0.01)
# Result: ~16,500 users per variant
# Total needed: ~33,000 users

Statistical Analysis

from scipy.stats import chi2_contingency, norm

def analyze_experiment(control, treatment):
    """Analyze A/B test results."""
    # Raw metrics
    ctrl_rate = control["conversions"] / control["users"]
    treat_rate = treatment["conversions"] / treatment["users"]
    
    lift = (treat_rate - ctrl_rate) / ctrl_rate * 100
    
    # Chi-squared test
    table = [
        [control["conversions"], control["users"] - control["conversions"]],
        [treatment["conversions"], treatment["users"] - treatment["conversions"]]
    ]
    chi2, p_value, _, _ = chi2_contingency(table)
    
    # Confidence interval
    se = np.sqrt(
        ctrl_rate * (1 - ctrl_rate) / control["users"] +
        treat_rate * (1 - treat_rate) / treatment["users"]
    )
    ci_lower = (treat_rate - ctrl_rate) - 1.96 * se
    ci_upper = (treat_rate - ctrl_rate) + 1.96 * se
    
    return {
        "control_rate": f"{ctrl_rate:.4%}",
        "treatment_rate": f"{treat_rate:.4%}",
        "lift": f"{lift:+.2f}%",
        "p_value": f"{p_value:.4f}",
        "significant": p_value < 0.05,
        "confidence_interval": f"[{ci_lower:.4%}, {ci_upper:.4%}]"
    }

Common Statistical Mistakes

Mistake 1: Peeking
  Checking results daily and stopping when significant
  Problem: Multiple testing inflates false positive rate
  Fix: Pre-commit to sample size/duration OR use sequential testing

Mistake 2: Wrong randomization unit
  Randomize by session → same user in both variants
  Problem: Contaminated results
  Fix: Randomize by user ID consistently

Mistake 3: Not enough power
  "We ran 500 users and it's not significant"
  Problem: Underpowered test, can't detect real effects
  Fix: Calculate sample size BEFORE running

Mistake 4: Survivor bias
  Only count users who completed onboarding
  Problem: Treatment might cause more dropoff before counting
  Fix: Intent-to-treat analysis (count all assigned users)

Mistake 5: Multiple metrics, no correction
  Test 20 metrics, declare victory on the 1 significant one
  Problem: With 20 tests at α=0.05, expect 1 false positive
  Fix: Bonferroni correction or pre-declared primary metric

Anti-Patterns

Anti-Pattern	Consequence	Fix
Peeking at results early	Inflated false positive rate	Pre-register duration + sample size
No guardrail metrics	Win primary metric, damage overall	Monitor revenue, latency, errors
Test too briefly	Miss weekly patterns	Run full weeks (2 minimum)
Multiple comparisons	False discoveries	Pre-declared primary metric
Ship on p=0.049	Borderline results, likely noise	Replicate or increase sample size

A/B testing done right is rigorous science. Done wrong, it is a way to confirm biases with a veneer of statistical credibility.

Experiment Design

Sample Size Calculation

Statistical Analysis

Common Statistical Mistakes

Anti-Patterns

More in Data Science

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production

Anomaly Detection at Scale