A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production
Build experimentation infrastructure that produces trustworthy results. Covers statistical foundations, feature flag integration, sample size calculations, metric selection, guardrail metrics, and the organizational patterns that prevent HiPPO-driven decisions.
Most A/B tests at most companies are statistically invalid. The sample size is too small. The test ran for 3 days instead of 3 weeks. Someone peeked at results on day 2 and declared victory. The primary metric moved 2% but nobody checked whether the payment flow still worked.
This guide covers how to build A/B testing infrastructure that produces results you can actually trust, because making business decisions on bad data is worse than making no decision at all.
Statistical Foundations (The Minimum You Need)
You do not need a statistics PhD, but you need to understand 4 concepts:
| Concept | What It Is | Practical Meaning |
|---|---|---|
| Statistical significance | Probability the result is not due to chance | Standard threshold: p < 0.05 (95% confidence) |
| Statistical power | Probability of detecting a real effect | Standard threshold: 80% power |
| Minimum Detectable Effect (MDE) | Smallest improvement you care about | ”I want to detect a 5% lift in conversion” |
| Sample size | Users needed per variant to reach significance + power | Depends on baseline rate, MDE, and significance level |
Sample Size Calculator
from scipy import stats
import math
def required_sample_size(
baseline_rate: float, # Current conversion rate (e.g., 0.03 = 3%)
min_detectable_effect: float, # Relative lift (e.g., 0.10 = 10% improvement)
significance: float = 0.05, # Alpha (false positive rate)
power: float = 0.80, # 1 - Beta (true positive detection rate)
) -> int:
"""Calculate required sample size per variant."""
p1 = baseline_rate
p2 = baseline_rate * (1 + min_detectable_effect)
z_alpha = stats.norm.ppf(1 - significance / 2)
z_beta = stats.norm.ppf(power)
p_bar = (p1 + p2) / 2
n = (
(z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
) / (p2 - p1) ** 2
return math.ceil(n)
# Example: 3% baseline conversion, want to detect 10% relative lift
sample_size = required_sample_size(0.03, 0.10)
# Result: ~14,750 per variant → ~29,500 total
# At 1,000 users/day → test needs to run ~30 days
The uncomfortable truth: Most companies do not have enough traffic to detect small effects. If your baseline conversion is 3% and you get 500 users/day, detecting a 5% relative lift requires 60+ days. Either increase your MDE or accept that you cannot run that test.
Experimentation Architecture
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ User │───▶│ Assignment │───▶│ Feature │
│ Request │ │ Service │ │ Flags │
│ │ │ │ │ │
│ │ │ user_id → │ │ variant_a │
│ │ │ variant_a │ │ or │
│ │ │ │ │ variant_b │
└──────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ Event │───▶│ Analysis │
│ Tracking │ │ Pipeline │
│ │ │ │
│ exposure, │ │ stats │
│ conversion, │ │ significance│
│ revenue │ │ guardrails │
└──────────────┘ └──────────────┘
Assignment Service
import hashlib
class ExperimentAssignment:
"""Deterministic, consistent experiment assignment."""
def assign(self, user_id: str, experiment_id: str,
variants: list[str], weights: list[float] = None) -> str:
"""Assign user to variant deterministically.
Same user + same experiment = always same variant.
Different experiments = independent assignment.
"""
# Hash user_id + experiment_id for deterministic assignment
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 10000 # 10,000 buckets for precision
if weights is None:
weights = [1.0 / len(variants)] * len(variants)
cumulative = 0
for variant, weight in zip(variants, weights):
cumulative += weight * 10000
if bucket < cumulative:
return variant
return variants[-1] # Fallback
Metric Selection
| Metric Type | Definition | Example |
|---|---|---|
| Primary metric | The ONE thing the experiment is trying to improve | Checkout conversion rate |
| Secondary metrics | Additional metrics that should improve or stay neutral | Revenue per user, AOV |
| Guardrail metrics | Metrics that must NOT degrade | Page load time, error rate, customer support tickets |
| Counter metrics | Metrics you expect to move in the opposite direction | Cart abandonment (should decrease) |
Guardrail Metrics: The Safety Net
guardrail_metrics:
- name: "error_rate"
threshold: "< 1% increase from control"
action: "Auto-halt experiment if breached"
- name: "p95_latency"
threshold: "< 200ms increase from control"
action: "Alert + review"
- name: "crash_rate"
threshold: "< 0.1% increase from control"
action: "Auto-halt experiment"
- name: "customer_support_tickets"
threshold: "< 10% increase from control"
action: "Alert + review"
Common Mistakes
| Mistake | What Goes Wrong | Prevention |
|---|---|---|
| Peeking | Checking results daily and stopping when significant | Pre-commit to sample size. Use sequential testing if you must peek. |
| Too many variants | Dilutes traffic, extends timeline | Max 3-4 variants per test |
| No guardrails | Conversion up but latency doubled | Always monitor guardrail metrics |
| Wrong randomization unit | Assign per-session, user sees both variants | Randomize per user, not session |
| Post-hoc segmentation | ”It worked for mobile users!” (after seeing all data) | Pre-register segments of interest before the test |
| Survivorship bias | Measuring conversions only for users who stayed | Include all exposed users in analysis |
Organization: Avoiding HiPPO Decisions
HiPPO = Highest Paid Person’s Opinion. The most common threat to data-driven decisions.
Before A/B testing culture:
VP: "I think the button should be blue."
Team: "OK." → Ships blue button.
After A/B testing culture:
VP: "I think the button should be blue."
Team: "Great hypothesis! Let's test blue vs. current green."
Results: Green converts 8% better.
Team: "Data says green. We're keeping green."
VP: "Fair enough." ← This is the culture you want.
Experiment Governance
| Process | Purpose |
|---|---|
| Experiment review board | Reviews proposed experiments for statistical validity and ethical concerns |
| Pre-registration | Document hypothesis, primary metric, sample size, and analysis plan BEFORE the test starts |
| Mandatory guardrails | Every experiment must include error rate and latency guardrails |
| Result review | Analysis reviewed by someone not involved in the experiment |
| Decision log | Record what was decided and why (even for tests that were not implemented) |
Implementation Checklist
- Build deterministic assignment service (same user always gets same variant)
- Define primary, secondary, and guardrail metrics for every experiment
- Calculate required sample size BEFORE launching (not during or after)
- Implement exposure logging: track which users saw which variant
- Set up auto-halt triggers for guardrail metric violations
- Enforce pre-registration: document hypothesis and analysis plan before test starts
- Build a results dashboard with confidence intervals, not just p-values
- Run A/A tests quarterly to validate your experimentation infrastructure
- Train product managers on statistical significance and sample size requirements
- Create an experiment decision log: what was tested, results, and what was shipped