A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production

Most A/B tests at most companies are statistically invalid. The sample size is too small. The test ran for 3 days instead of 3 weeks. Someone peeked at results on day 2 and declared victory. The primary metric moved 2% but nobody checked whether the payment flow still worked.

This guide covers how to build A/B testing infrastructure that produces results you can actually trust, because making business decisions on bad data is worse than making no decision at all.

Statistical Foundations (The Minimum You Need)

You do not need a statistics PhD, but you need to understand 4 concepts:

Concept	What It Is	Practical Meaning
Statistical significance	Probability the result is not due to chance	Standard threshold: p < 0.05 (95% confidence)
Statistical power	Probability of detecting a real effect	Standard threshold: 80% power
Minimum Detectable Effect (MDE)	Smallest improvement you care about	”I want to detect a 5% lift in conversion”
Sample size	Users needed per variant to reach significance + power	Depends on baseline rate, MDE, and significance level

Sample Size Calculator

from scipy import stats
import math

def required_sample_size(
    baseline_rate: float,     # Current conversion rate (e.g., 0.03 = 3%)
    min_detectable_effect: float,  # Relative lift (e.g., 0.10 = 10% improvement)
    significance: float = 0.05,    # Alpha (false positive rate)
    power: float = 0.80,          # 1 - Beta (true positive detection rate)
) -> int:
    """Calculate required sample size per variant."""
    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_detectable_effect)

    z_alpha = stats.norm.ppf(1 - significance / 2)
    z_beta = stats.norm.ppf(power)

    p_bar = (p1 + p2) / 2

    n = (
        (z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    ) / (p2 - p1) ** 2

    return math.ceil(n)

# Example: 3% baseline conversion, want to detect 10% relative lift
sample_size = required_sample_size(0.03, 0.10)
# Result: ~14,750 per variant → ~29,500 total
# At 1,000 users/day → test needs to run ~30 days

The uncomfortable truth: Most companies do not have enough traffic to detect small effects. If your baseline conversion is 3% and you get 500 users/day, detecting a 5% relative lift requires 60+ days. Either increase your MDE or accept that you cannot run that test.

Experimentation Architecture

┌──────────┐    ┌──────────────┐    ┌──────────────┐
│  User    │───▶│  Assignment  │───▶│  Feature     │
│  Request │    │  Service     │    │  Flags       │
│          │    │              │    │              │
│          │    │  user_id →   │    │  variant_a   │
│          │    │  variant_a   │    │  or          │
│          │    │              │    │  variant_b   │
└──────────┘    └──────────────┘    └──────────────┘
                       │
                       ▼
               ┌──────────────┐    ┌──────────────┐
               │  Event       │───▶│  Analysis    │
               │  Tracking    │    │  Pipeline    │
               │              │    │              │
               │  exposure,   │    │  stats       │
               │  conversion, │    │  significance│
               │  revenue     │    │  guardrails  │
               └──────────────┘    └──────────────┘

Assignment Service

import hashlib

class ExperimentAssignment:
    """Deterministic, consistent experiment assignment."""

    def assign(self, user_id: str, experiment_id: str,
               variants: list[str], weights: list[float] = None) -> str:
        """Assign user to variant deterministically.

        Same user + same experiment = always same variant.
        Different experiments = independent assignment.
        """
        # Hash user_id + experiment_id for deterministic assignment
        hash_input = f"{user_id}:{experiment_id}"
        hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
        bucket = hash_value % 10000  # 10,000 buckets for precision

        if weights is None:
            weights = [1.0 / len(variants)] * len(variants)

        cumulative = 0
        for variant, weight in zip(variants, weights):
            cumulative += weight * 10000
            if bucket < cumulative:
                return variant

        return variants[-1]  # Fallback

Metric Selection

Metric Type	Definition	Example
Primary metric	The ONE thing the experiment is trying to improve	Checkout conversion rate
Secondary metrics	Additional metrics that should improve or stay neutral	Revenue per user, AOV
Guardrail metrics	Metrics that must NOT degrade	Page load time, error rate, customer support tickets
Counter metrics	Metrics you expect to move in the opposite direction	Cart abandonment (should decrease)

Guardrail Metrics: The Safety Net

guardrail_metrics:
  - name: "error_rate"
    threshold: "< 1% increase from control"
    action: "Auto-halt experiment if breached"

  - name: "p95_latency"
    threshold: "< 200ms increase from control"
    action: "Alert + review"

  - name: "crash_rate"
    threshold: "< 0.1% increase from control"
    action: "Auto-halt experiment"

  - name: "customer_support_tickets"
    threshold: "< 10% increase from control"
    action: "Alert + review"

Common Mistakes

Mistake	What Goes Wrong	Prevention
Peeking	Checking results daily and stopping when significant	Pre-commit to sample size. Use sequential testing if you must peek.
Too many variants	Dilutes traffic, extends timeline	Max 3-4 variants per test
No guardrails	Conversion up but latency doubled	Always monitor guardrail metrics
Wrong randomization unit	Assign per-session, user sees both variants	Randomize per user, not session
Post-hoc segmentation	”It worked for mobile users!” (after seeing all data)	Pre-register segments of interest before the test
Survivorship bias	Measuring conversions only for users who stayed	Include all exposed users in analysis

Organization: Avoiding HiPPO Decisions

HiPPO = Highest Paid Person’s Opinion. The most common threat to data-driven decisions.

Before A/B testing culture:
  VP: "I think the button should be blue."
  Team: "OK." → Ships blue button.

After A/B testing culture:
  VP: "I think the button should be blue."
  Team: "Great hypothesis! Let's test blue vs. current green."
  Results: Green converts 8% better.
  Team: "Data says green. We're keeping green."
  VP: "Fair enough." ← This is the culture you want.

Experiment Governance

Process	Purpose
Experiment review board	Reviews proposed experiments for statistical validity and ethical concerns
Pre-registration	Document hypothesis, primary metric, sample size, and analysis plan BEFORE the test starts
Mandatory guardrails	Every experiment must include error rate and latency guardrails
Result review	Analysis reviewed by someone not involved in the experiment
Decision log	Record what was decided and why (even for tests that were not implemented)

Statistical Foundations (The Minimum You Need)

Sample Size Calculator

Experimentation Architecture

Assignment Service

Metric Selection

Guardrail Metrics: The Safety Net

Common Mistakes

Organization: Avoiding HiPPO Decisions

Experiment Governance

Implementation Checklist

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

Anomaly Detection at Scale