ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

A/B Testing at Scale

Design and run rigorous A/B tests that produce trustworthy results. Covers experiment design, sample size calculation, statistical significance, guardrail metrics, multi-variant testing, and the common statistical mistakes that lead to wrong conclusions.

A/B testing is the scientific method applied to product development. Instead of debating whether Feature A or Feature B is better, you run a controlled experiment with real users and let the data decide. Done wrong, A/B testing produces false confidence that leads to worse products.


Experiment Design

1. Hypothesis
   "Adding a progress bar to checkout will increase completion rate"
   
2. Primary Metric (what you're optimizing)
   Checkout completion rate
   
3. Guardrail Metrics (what must NOT degrade)
   Revenue per user, page load time, error rate
   
4. Minimum Detectable Effect (MDE)
   1% absolute increase (from 65% to 66%)
   
5. Sample Size
   Calculated based on MDE, baseline rate, significance level
   
6. Duration
   2 full weeks minimum (capture weekly patterns)
   
7. Randomization Unit
   User ID (not session, not request)

Sample Size Calculation

from scipy import stats
import numpy as np

def sample_size_proportion(
    baseline_rate: float,
    mde: float,
    alpha: float = 0.05,
    power: float = 0.8
) -> int:
    """Calculate required sample size per variant."""
    p1 = baseline_rate
    p2 = baseline_rate + mde
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    n = ((z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) + 
          z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) / 
         (p2 - p1)) ** 2
    
    return int(np.ceil(n))

# Example: Detect 1% increase from 65% baseline
n = sample_size_proportion(baseline_rate=0.65, mde=0.01)
# Result: ~16,500 users per variant
# Total needed: ~33,000 users

Statistical Analysis

from scipy.stats import chi2_contingency, norm

def analyze_experiment(control, treatment):
    """Analyze A/B test results."""
    # Raw metrics
    ctrl_rate = control["conversions"] / control["users"]
    treat_rate = treatment["conversions"] / treatment["users"]
    
    lift = (treat_rate - ctrl_rate) / ctrl_rate * 100
    
    # Chi-squared test
    table = [
        [control["conversions"], control["users"] - control["conversions"]],
        [treatment["conversions"], treatment["users"] - treatment["conversions"]]
    ]
    chi2, p_value, _, _ = chi2_contingency(table)
    
    # Confidence interval
    se = np.sqrt(
        ctrl_rate * (1 - ctrl_rate) / control["users"] +
        treat_rate * (1 - treat_rate) / treatment["users"]
    )
    ci_lower = (treat_rate - ctrl_rate) - 1.96 * se
    ci_upper = (treat_rate - ctrl_rate) + 1.96 * se
    
    return {
        "control_rate": f"{ctrl_rate:.4%}",
        "treatment_rate": f"{treat_rate:.4%}",
        "lift": f"{lift:+.2f}%",
        "p_value": f"{p_value:.4f}",
        "significant": p_value < 0.05,
        "confidence_interval": f"[{ci_lower:.4%}, {ci_upper:.4%}]"
    }

Common Statistical Mistakes

Mistake 1: Peeking
  Checking results daily and stopping when significant
  Problem: Multiple testing inflates false positive rate
  Fix: Pre-commit to sample size/duration OR use sequential testing

Mistake 2: Wrong randomization unit
  Randomize by session → same user in both variants
  Problem: Contaminated results
  Fix: Randomize by user ID consistently

Mistake 3: Not enough power
  "We ran 500 users and it's not significant"
  Problem: Underpowered test, can't detect real effects
  Fix: Calculate sample size BEFORE running

Mistake 4: Survivor bias
  Only count users who completed onboarding
  Problem: Treatment might cause more dropoff before counting
  Fix: Intent-to-treat analysis (count all assigned users)

Mistake 5: Multiple metrics, no correction
  Test 20 metrics, declare victory on the 1 significant one
  Problem: With 20 tests at α=0.05, expect 1 false positive
  Fix: Bonferroni correction or pre-declared primary metric

Anti-Patterns

Anti-PatternConsequenceFix
Peeking at results earlyInflated false positive ratePre-register duration + sample size
No guardrail metricsWin primary metric, damage overallMonitor revenue, latency, errors
Test too brieflyMiss weekly patternsRun full weeks (2 minimum)
Multiple comparisonsFalse discoveriesPre-declared primary metric
Ship on p=0.049Borderline results, likely noiseReplicate or increase sample size

A/B testing done right is rigorous science. Done wrong, it is a way to confirm biases with a veneer of statistical credibility.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →