A/B Testing at Scale
Design and run rigorous A/B tests that produce trustworthy results. Covers experiment design, sample size calculation, statistical significance, guardrail metrics, multi-variant testing, and the common statistical mistakes that lead to wrong conclusions.
A/B testing is the scientific method applied to product development. Instead of debating whether Feature A or Feature B is better, you run a controlled experiment with real users and let the data decide. Done wrong, A/B testing produces false confidence that leads to worse products.
Experiment Design
1. Hypothesis
"Adding a progress bar to checkout will increase completion rate"
2. Primary Metric (what you're optimizing)
Checkout completion rate
3. Guardrail Metrics (what must NOT degrade)
Revenue per user, page load time, error rate
4. Minimum Detectable Effect (MDE)
1% absolute increase (from 65% to 66%)
5. Sample Size
Calculated based on MDE, baseline rate, significance level
6. Duration
2 full weeks minimum (capture weekly patterns)
7. Randomization Unit
User ID (not session, not request)
Sample Size Calculation
from scipy import stats
import numpy as np
def sample_size_proportion(
baseline_rate: float,
mde: float,
alpha: float = 0.05,
power: float = 0.8
) -> int:
"""Calculate required sample size per variant."""
p1 = baseline_rate
p2 = baseline_rate + mde
# Pooled proportion
p_pool = (p1 + p2) / 2
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = ((z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) /
(p2 - p1)) ** 2
return int(np.ceil(n))
# Example: Detect 1% increase from 65% baseline
n = sample_size_proportion(baseline_rate=0.65, mde=0.01)
# Result: ~16,500 users per variant
# Total needed: ~33,000 users
Statistical Analysis
from scipy.stats import chi2_contingency, norm
def analyze_experiment(control, treatment):
"""Analyze A/B test results."""
# Raw metrics
ctrl_rate = control["conversions"] / control["users"]
treat_rate = treatment["conversions"] / treatment["users"]
lift = (treat_rate - ctrl_rate) / ctrl_rate * 100
# Chi-squared test
table = [
[control["conversions"], control["users"] - control["conversions"]],
[treatment["conversions"], treatment["users"] - treatment["conversions"]]
]
chi2, p_value, _, _ = chi2_contingency(table)
# Confidence interval
se = np.sqrt(
ctrl_rate * (1 - ctrl_rate) / control["users"] +
treat_rate * (1 - treat_rate) / treatment["users"]
)
ci_lower = (treat_rate - ctrl_rate) - 1.96 * se
ci_upper = (treat_rate - ctrl_rate) + 1.96 * se
return {
"control_rate": f"{ctrl_rate:.4%}",
"treatment_rate": f"{treat_rate:.4%}",
"lift": f"{lift:+.2f}%",
"p_value": f"{p_value:.4f}",
"significant": p_value < 0.05,
"confidence_interval": f"[{ci_lower:.4%}, {ci_upper:.4%}]"
}
Common Statistical Mistakes
Mistake 1: Peeking
Checking results daily and stopping when significant
Problem: Multiple testing inflates false positive rate
Fix: Pre-commit to sample size/duration OR use sequential testing
Mistake 2: Wrong randomization unit
Randomize by session → same user in both variants
Problem: Contaminated results
Fix: Randomize by user ID consistently
Mistake 3: Not enough power
"We ran 500 users and it's not significant"
Problem: Underpowered test, can't detect real effects
Fix: Calculate sample size BEFORE running
Mistake 4: Survivor bias
Only count users who completed onboarding
Problem: Treatment might cause more dropoff before counting
Fix: Intent-to-treat analysis (count all assigned users)
Mistake 5: Multiple metrics, no correction
Test 20 metrics, declare victory on the 1 significant one
Problem: With 20 tests at α=0.05, expect 1 false positive
Fix: Bonferroni correction or pre-declared primary metric
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Peeking at results early | Inflated false positive rate | Pre-register duration + sample size |
| No guardrail metrics | Win primary metric, damage overall | Monitor revenue, latency, errors |
| Test too briefly | Miss weekly patterns | Run full weeks (2 minimum) |
| Multiple comparisons | False discoveries | Pre-declared primary metric |
| Ship on p=0.049 | Borderline results, likely noise | Replicate or increase sample size |
A/B testing done right is rigorous science. Done wrong, it is a way to confirm biases with a veneer of statistical credibility.