A/B Testing Statistical Framework
Rigorous A/B testing for product decisions. Covers sample size calculation, statistical significance, Bayesian vs frequentist approaches, and common pitfalls that invalidate experiments.
A/B testing is the only reliable method for measuring the causal impact of product changes. Correlation studies, user surveys, and expert opinions all introduce bias. Controlled experiments eliminate it. But most A/B tests are poorly designed — they run too short, check results too often, or draw conclusions from noise. A rigorous statistical framework turns A/B testing from guesswork into science.
Experiment Design
Step 1: Define the Hypothesis
Every experiment starts with a specific, measurable hypothesis:
- Bad: “The new checkout flow will be better”
- Good: “The new checkout flow will increase purchase completion rate by 5% (from 3.2% to 3.36%)”
The hypothesis defines: the metric, the expected effect size, and the direction (one-tailed vs two-tailed).
Step 2: Calculate Sample Size
Running an experiment without calculating sample size is the most common mistake. Under-powered experiments produce inconclusive results. Over-powered experiments waste time and traffic.
from scipy import stats
import math
def required_sample_size(
baseline_rate: float, # Current conversion rate
min_effect: float, # Minimum detectable effect (relative)
alpha: float = 0.05, # Significance level
power: float = 0.80 # Statistical power
) -> int:
p1 = baseline_rate
p2 = baseline_rate * (1 + min_effect)
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
pooled_p = (p1 + p2) / 2
n = ((z_alpha * math.sqrt(2 * pooled_p * (1 - pooled_p)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / \
(p2 - p1) ** 2
return math.ceil(n) # Per group
Example: Baseline 3.2% conversion, want to detect 5% relative lift, 80% power:
- Required: ~96,000 users per group
- At 10,000 daily users: run for ~19 days
Step 3: Run Duration
Minimum: Enough time to reach the calculated sample size. Practical minimum: At least 1 full business cycle (7 days) to capture day-of-week effects. Maximum: 4 weeks. Longer experiments suffer from cookie churn and changing conditions.
Statistical Analysis
Frequentist Approach
The traditional hypothesis testing framework:
from scipy.stats import chi2_contingency, norm
def analyze_ab_test(control_conversions, control_total,
treatment_conversions, treatment_total):
p_control = control_conversions / control_total
p_treatment = treatment_conversions / treatment_total
# Relative lift
lift = (p_treatment - p_control) / p_control
# Z-test for proportions
p_pooled = (control_conversions + treatment_conversions) / \
(control_total + treatment_total)
se = math.sqrt(p_pooled * (1 - p_pooled) *
(1/control_total + 1/treatment_total))
z_stat = (p_treatment - p_control) / se
p_value = 2 * (1 - norm.cdf(abs(z_stat)))
# Confidence interval
se_diff = math.sqrt(p_control * (1-p_control) / control_total +
p_treatment * (1-p_treatment) / treatment_total)
ci_lower = (p_treatment - p_control) - 1.96 * se_diff
ci_upper = (p_treatment - p_control) + 1.96 * se_diff
return {
'lift': f'{lift:.2%}',
'p_value': f'{p_value:.4f}',
'significant': p_value < 0.05,
'ci_95': f'[{ci_lower:.4f}, {ci_upper:.4f}]'
}
Bayesian Approach
Bayesian analysis provides probability statements about the treatment effect, which are often more useful for decision-making:
| Question | Frequentist Answer | Bayesian Answer |
|---|---|---|
| ”Is B better than A?" | "We reject H₀ at α=0.05" | "There’s a 94% probability B is better" |
| "How much better?" | "CI: [0.2%, 1.8%]" | "Expected lift: 1.0% with 95% credible interval [0.2%, 1.8%]" |
| "Should we ship?" | "p < 0.05, yes" | "94% chance of positive lift, expected value $120K/year” |
Common Pitfalls
1. Peeking at Results
Checking results daily and stopping when p < 0.05 inflates false positive rates from 5% to 20-30%. Either commit to a fixed sample size or use sequential testing methods (e.g., spending functions).
2. Multiple Comparisons
Testing 10 metrics without correction guarantees at least one false positive. Apply Bonferroni correction or pre-designate a single primary metric.
3. Simpson’s Paradox
A treatment can appear positive overall while being negative in every subgroup (or vice versa). Always check segment-level results.
4. Survivorship Bias
If users drop off differently between groups, your end-of-experiment comparison is biased. Analyze using intent-to-treat: include all randomized users, not just those who completed the flow.
Decision Framework
| Scenario | Action |
|---|---|
| p < 0.05 and lift > MDE | Ship it — statistically and practically significant |
| p < 0.05 and lift < MDE | Reconsider — statistically significant but too small to matter |
| p > 0.05 and CI includes MDE | Inconclusive — need more data |
| p > 0.05 and CI excludes MDE | No effect — the treatment doesn’t work |
The most dangerous outcome is declaring “no effect” when the experiment was simply under-powered. Always check if the confidence interval is narrow enough to rule out meaningful effects.