Causal Inference for Product

A/B testing is the gold standard for causal inference, but it is not always feasible. You cannot randomly assign users to a data breach to study its impact on churn. You cannot randomly withhold features from paying customers. Causal inference techniques let you identify cause-and-effect relationships from observational data when experiments are impossible or impractical.

When A/B Tests Are Not Enough

Cannot A/B test:
  - Pricing changes (legal, brand risk)
  - Outages and incidents (ethical issues)
  - Competitor actions (not in your control)
  - Long-term behavior changes (experiment duration limits)
  - Network effects (treatment group affects control group)

Observational data alternatives:
  - Difference-in-differences
  - Instrumental variables
  - Regression discontinuity
  - Propensity score matching
  - Synthetic control

Difference-in-Differences (DiD)

Compares the change over time between a treatment and control group:

import statsmodels.formula.api as smf

# Setup: Feature launched in Region A (treatment), not Region B (control)
# Measure: Revenue before and after launch in both regions

model = smf.ols(
    'revenue ~ treatment * post_launch + C(region) + C(month)',
    data=df
).fit()

# treatment:post_launch coefficient = causal effect of feature on revenue
print(f"Feature effect: ${model.params['treatment:post_launch']:.2f}")
print(f"p-value: {model.pvalues['treatment:post_launch']:.4f}")

Key assumption: Parallel trends
  Without treatment, both groups would have followed the same trend.
  
  Validate: Check pre-treatment trends are parallel
  If not parallel → DiD estimate is biased

Propensity Score Matching

Match treated users to similar untreated users based on observable characteristics:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
# P(treatment | features) for each user
propensity_model = LogisticRegression()
propensity_model.fit(X[features], X['treated'])
X['propensity_score'] = propensity_model.predict_proba(X[features])[:, 1]

# Step 2: Match treated to untreated with similar propensity
treated = X[X['treated'] == 1]
untreated = X[X['treated'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(untreated[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_untreated = untreated.iloc[indices.flatten()]

# Step 3: Compare outcomes
att = treated['outcome'].mean() - matched_untreated['outcome'].mean()
print(f"Average Treatment Effect on Treated: {att:.3f}")

Regression Discontinuity

Exploit a threshold that determines treatment:

# Users who score above 80 get premium features
# Compare users just above 80 (treated) vs just below (untreated)

# Bandwidth: Users scoring 75-85
bandwidth = 5
near_threshold = df[
    (df['score'] >= 80 - bandwidth) & 
    (df['score'] <= 80 + bandwidth)
]

model = smf.ols(
    'retention ~ treated + score_centered + treated:score_centered',
    data=near_threshold
).fit()

# 'treated' coefficient = causal effect at the threshold
print(f"Effect of premium features on retention: {model.params['treated']:.3f}")

Choosing the Right Method

Method	When to Use	Key Assumption
A/B Test	Random assignment possible	Random assignment
DiD	Treatment at specific time, control group exists	Parallel trends
Propensity Matching	Many observables, no time dimension	No unobserved confounders
Regression Discontinuity	Treatment based on a threshold	No manipulation around threshold
Instrumental Variables	Unobserved confounders exist	Valid instrument available
Synthetic Control	One treated unit, many controls	Weighted combination matches pre-treatment

Anti-Patterns

Anti-Pattern	Consequence	Fix
Confusing correlation with causation	Wrong product decisions	Use causal inference techniques
Not checking parallel trends	Biased DiD estimates	Plot pre-treatment trends
Matching on too many variables	Overfitting, poor matches	Use propensity score, not exact matching
Small sample near discontinuity	Low statistical power	Increase bandwidth (with trade-offs)
Reporting results without confidence intervals	Overconfidence in estimates	Always report uncertainty

Causal inference is how data teams move from “Feature X users have higher retention” (correlation) to “Feature X causes 5% higher retention” (causation). The distinction determines whether the next $1M investment is well-spent or wasted.

When A/B Tests Are Not Enough

Difference-in-Differences (DiD)

Propensity Score Matching

Regression Discontinuity

Choosing the Right Method

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production