A/B Testing Statistical Framework

A/B testing is the only reliable method for measuring the causal impact of product changes. Correlation studies, user surveys, and expert opinions all introduce bias. Controlled experiments eliminate it. But most A/B tests are poorly designed — they run too short, check results too often, or draw conclusions from noise. A rigorous statistical framework turns A/B testing from guesswork into science.

Experiment Design

Step 1: Define the Hypothesis

Every experiment starts with a specific, measurable hypothesis:

Bad: “The new checkout flow will be better”
Good: “The new checkout flow will increase purchase completion rate by 5% (from 3.2% to 3.36%)”

The hypothesis defines: the metric, the expected effect size, and the direction (one-tailed vs two-tailed).

Step 2: Calculate Sample Size

Running an experiment without calculating sample size is the most common mistake. Under-powered experiments produce inconclusive results. Over-powered experiments waste time and traffic.

from scipy import stats
import math

def required_sample_size(
    baseline_rate: float,    # Current conversion rate
    min_effect: float,       # Minimum detectable effect (relative)
    alpha: float = 0.05,     # Significance level
    power: float = 0.80      # Statistical power
) -> int:
    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_effect)
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    pooled_p = (p1 + p2) / 2
    
    n = ((z_alpha * math.sqrt(2 * pooled_p * (1 - pooled_p)) +
          z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / \
         (p2 - p1) ** 2
    
    return math.ceil(n)  # Per group

Example: Baseline 3.2% conversion, want to detect 5% relative lift, 80% power:

Required: ~96,000 users per group
At 10,000 daily users: run for ~19 days

Step 3: Run Duration

Minimum: Enough time to reach the calculated sample size. Practical minimum: At least 1 full business cycle (7 days) to capture day-of-week effects. Maximum: 4 weeks. Longer experiments suffer from cookie churn and changing conditions.

Statistical Analysis

Frequentist Approach

The traditional hypothesis testing framework:

from scipy.stats import chi2_contingency, norm

def analyze_ab_test(control_conversions, control_total,
                    treatment_conversions, treatment_total):
    p_control = control_conversions / control_total
    p_treatment = treatment_conversions / treatment_total
    
    # Relative lift
    lift = (p_treatment - p_control) / p_control
    
    # Z-test for proportions
    p_pooled = (control_conversions + treatment_conversions) / \
               (control_total + treatment_total)
    se = math.sqrt(p_pooled * (1 - p_pooled) * 
                   (1/control_total + 1/treatment_total))
    z_stat = (p_treatment - p_control) / se
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))
    
    # Confidence interval
    se_diff = math.sqrt(p_control * (1-p_control) / control_total +
                       p_treatment * (1-p_treatment) / treatment_total)
    ci_lower = (p_treatment - p_control) - 1.96 * se_diff
    ci_upper = (p_treatment - p_control) + 1.96 * se_diff
    
    return {
        'lift': f'{lift:.2%}',
        'p_value': f'{p_value:.4f}',
        'significant': p_value < 0.05,
        'ci_95': f'[{ci_lower:.4f}, {ci_upper:.4f}]'
    }

Bayesian Approach

Bayesian analysis provides probability statements about the treatment effect, which are often more useful for decision-making:

Question	Frequentist Answer	Bayesian Answer
”Is B better than A?"	"We reject H₀ at α=0.05"	"There’s a 94% probability B is better"
"How much better?"	"CI: [0.2%, 1.8%]"	"Expected lift: 1.0% with 95% credible interval [0.2%, 1.8%]"
"Should we ship?"	"p < 0.05, yes"	"94% chance of positive lift, expected value $120K/year”

Common Pitfalls

1. Peeking at Results

Checking results daily and stopping when p < 0.05 inflates false positive rates from 5% to 20-30%. Either commit to a fixed sample size or use sequential testing methods (e.g., spending functions).

2. Multiple Comparisons

Testing 10 metrics without correction guarantees at least one false positive. Apply Bonferroni correction or pre-designate a single primary metric.

3. Simpson’s Paradox

A treatment can appear positive overall while being negative in every subgroup (or vice versa). Always check segment-level results.

4. Survivorship Bias

If users drop off differently between groups, your end-of-experiment comparison is biased. Analyze using intent-to-treat: include all randomized users, not just those who completed the flow.

Decision Framework

Scenario	Action
p < 0.05 and lift > MDE	Ship it — statistically and practically significant
p < 0.05 and lift < MDE	Reconsider — statistically significant but too small to matter
p > 0.05 and CI includes MDE	Inconclusive — need more data
p > 0.05 and CI excludes MDE	No effect — the treatment doesn’t work

The most dangerous outcome is declaring “no effect” when the experiment was simply under-powered. Always check if the confidence interval is narrow enough to rule out meaningful effects.