ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

A/B Testing Statistical Framework

Rigorous A/B testing for product decisions. Covers sample size calculation, statistical significance, Bayesian vs frequentist approaches, and common pitfalls that invalidate experiments.

A/B testing is the only reliable method for measuring the causal impact of product changes. Correlation studies, user surveys, and expert opinions all introduce bias. Controlled experiments eliminate it. But most A/B tests are poorly designed — they run too short, check results too often, or draw conclusions from noise. A rigorous statistical framework turns A/B testing from guesswork into science.


Experiment Design

Step 1: Define the Hypothesis

Every experiment starts with a specific, measurable hypothesis:

  • Bad: “The new checkout flow will be better”
  • Good: “The new checkout flow will increase purchase completion rate by 5% (from 3.2% to 3.36%)”

The hypothesis defines: the metric, the expected effect size, and the direction (one-tailed vs two-tailed).

Step 2: Calculate Sample Size

Running an experiment without calculating sample size is the most common mistake. Under-powered experiments produce inconclusive results. Over-powered experiments waste time and traffic.

from scipy import stats
import math

def required_sample_size(
    baseline_rate: float,    # Current conversion rate
    min_effect: float,       # Minimum detectable effect (relative)
    alpha: float = 0.05,     # Significance level
    power: float = 0.80      # Statistical power
) -> int:
    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_effect)
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    pooled_p = (p1 + p2) / 2
    
    n = ((z_alpha * math.sqrt(2 * pooled_p * (1 - pooled_p)) +
          z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / \
         (p2 - p1) ** 2
    
    return math.ceil(n)  # Per group

Example: Baseline 3.2% conversion, want to detect 5% relative lift, 80% power:

  • Required: ~96,000 users per group
  • At 10,000 daily users: run for ~19 days

Step 3: Run Duration

Minimum: Enough time to reach the calculated sample size. Practical minimum: At least 1 full business cycle (7 days) to capture day-of-week effects. Maximum: 4 weeks. Longer experiments suffer from cookie churn and changing conditions.


Statistical Analysis

Frequentist Approach

The traditional hypothesis testing framework:

from scipy.stats import chi2_contingency, norm

def analyze_ab_test(control_conversions, control_total,
                    treatment_conversions, treatment_total):
    p_control = control_conversions / control_total
    p_treatment = treatment_conversions / treatment_total
    
    # Relative lift
    lift = (p_treatment - p_control) / p_control
    
    # Z-test for proportions
    p_pooled = (control_conversions + treatment_conversions) / \
               (control_total + treatment_total)
    se = math.sqrt(p_pooled * (1 - p_pooled) * 
                   (1/control_total + 1/treatment_total))
    z_stat = (p_treatment - p_control) / se
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))
    
    # Confidence interval
    se_diff = math.sqrt(p_control * (1-p_control) / control_total +
                       p_treatment * (1-p_treatment) / treatment_total)
    ci_lower = (p_treatment - p_control) - 1.96 * se_diff
    ci_upper = (p_treatment - p_control) + 1.96 * se_diff
    
    return {
        'lift': f'{lift:.2%}',
        'p_value': f'{p_value:.4f}',
        'significant': p_value < 0.05,
        'ci_95': f'[{ci_lower:.4f}, {ci_upper:.4f}]'
    }

Bayesian Approach

Bayesian analysis provides probability statements about the treatment effect, which are often more useful for decision-making:

QuestionFrequentist AnswerBayesian Answer
”Is B better than A?""We reject H₀ at α=0.05""There’s a 94% probability B is better"
"How much better?""CI: [0.2%, 1.8%]""Expected lift: 1.0% with 95% credible interval [0.2%, 1.8%]"
"Should we ship?""p < 0.05, yes""94% chance of positive lift, expected value $120K/year”

Common Pitfalls

1. Peeking at Results

Checking results daily and stopping when p < 0.05 inflates false positive rates from 5% to 20-30%. Either commit to a fixed sample size or use sequential testing methods (e.g., spending functions).

2. Multiple Comparisons

Testing 10 metrics without correction guarantees at least one false positive. Apply Bonferroni correction or pre-designate a single primary metric.

3. Simpson’s Paradox

A treatment can appear positive overall while being negative in every subgroup (or vice versa). Always check segment-level results.

4. Survivorship Bias

If users drop off differently between groups, your end-of-experiment comparison is biased. Analyze using intent-to-treat: include all randomized users, not just those who completed the flow.


Decision Framework

ScenarioAction
p < 0.05 and lift > MDEShip it — statistically and practically significant
p < 0.05 and lift < MDEReconsider — statistically significant but too small to matter
p > 0.05 and CI includes MDEInconclusive — need more data
p > 0.05 and CI excludes MDENo effect — the treatment doesn’t work

The most dangerous outcome is declaring “no effect” when the experiment was simply under-powered. Always check if the confidence interval is narrow enough to rule out meaningful effects.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →