Chapter B5:A/B Testing Playbook

Story → Concept → Insight → Practice

Story: “Did the campaign really move the needle?”

A national streaming service launches a new paid‑social creative. They target a subset of ZIP codes and see conversions rise 6%. Leadership asks: Was that incrementality, or would it have happened anyway?

Two problems pop up immediately:

  1. The targeted ZIPs skew wealthier and more urban than average (selection bias).

  2. Conversions are noisy week‑to‑week (variance problem).

To answer cleanly, we’ll design an experiment rather than rely on naïve before/after comparisons. We’ll match comparable geographies, randomize exposure within each matched pair, and size the test so we can detect the lift with high probability.


Concept: What good A/B tests look like (beyond the buzzwords)

1) Core principles

  • Randomization breaks the link between treatment and unobservables.

  • Stable measurement unit: choose the unit where spillovers are minimal (user/account/device/household, store, ZIP, DMA, etc.).

  • Pre‑treatment balance: improve precision by making treated and control units similar before treatment (match or stratify on outcome‑predictive covariates).

  • Analysis mirror design: if you pair/stratify in design, respect that structure in analysis (paired tests, fixed effects, cluster‑robust SEs).

2) Matching & stratification (design‑time variance reduction)

Goal: create groups that look alike on variables that predict the outcome (Y). Common choices: historical conversions/revenue, traffic, paid/organic mix, demographics, store size, etc.

Matched pairs (blocking of size 2)

  • Build pairs of near‑twins using pre‑treatment covariates (e.g., Mahalanobis distance).

  • Randomize within each pair: 1 treated, 1 control.

  • Analyze with paired differences or a regression with pair fixed effects.

Stratification (blocking of size ≥2)

  • Partition units into wider strata (e.g., low/med/high baseline traffic; or k‑means clusters).

  • Randomize a fixed proportion within each stratum.

  • Analyze with stratum fixed effects or post‑stratification weights.

Practical matching options

  • Mahalanobis distance on standardized covariates (simple, effective).

  • k‑means clusters of size 2 for quick pairing when covariates are many.

  • Greedy sort‑and‑pair by a composite pre‑period score when you need something fast.

Tip: In observational DiD you prioritize covariates that predict treatment (to control selection). In randomized A/B tests you prioritize covariates that predict outcomes (to reduce variance).

3) Determining sample size (MDE ↔ power ↔ α)

Pick:

  • Metric (e.g., conversion rate, revenue per user),

  • Baseline level (p₀, μ₀, σ),

  • Minimum Detectable Effect (MDE) you care about,

  • α (Type I error, often 0.05 two‑sided) and power (1−β, often 0.8 or 0.9).

Binary outcome (conversion rate) For equal allocation A/B and two‑sided α:

(n_{per group} = ) with (p=(p_1+p_2)/2). For planning, use (p_1p_0), (p_2=p_0+).

Continuous outcome (e.g., revenue/user) (n_{per group} = ) where (^2) is outcome variance and () is the smallest mean difference worth detecting.

Clustered/geo tests Power depends on the number of clusters and the intra‑cluster correlation (ICC). Blocking (pairs/strata) helps because analysis uses within‑block contrasts that cancel shared noise. For matched‑pairs with pair‑level outcomes, a back‑of‑envelope is: (m = ) where (m) is the number of treated pairs, (_d^2) is the variance of pre‑period within‑pair differences, and () is the target post‑period mean difference.

4) How long should you run the test?

Duration = sample size per arm ÷ daily eligible traffic per arm, rounded up to cover full weekly cycles. Add time for ramp‑up/learning and avoid known anomalies (holidays, product drops). For geo tests, ensure you span multiple business cycles (often ≥4 weeks) because cluster‑level noise is higher.

5) Analysis overview (respect the design)

  • User‑level tests: difference in means/proportions with robust SEs; optionally add CUPED/regression adjustment using pre‑period metrics.

  • Matched pairs: paired t‑test on pair‑level post‑period differences; or regression with pair FE and cluster‑robust (pair) SEs.

  • Stratified tests: include stratum FE; report a stratum‑weighted average treatment effect.

  • Always sanity‑check balance and randomization integrity; verify no peeking inflates Type I error (use sequential methods if you must monitor).


Insight: Apply the concept to the Story

Setup: 500 ZIPs total. We’ll create 250 matched pairs using pre‑period revenue, ad spend, and site visits. We plan to treat 20% of all ZIPs.

Correct assignment logic

  1. Match all 500 into 250 pairs.

  2. Randomly pick 100 pairs → in each, flip a coin so 1 treated, 1 control.

  3. Leave the other 150 pairs entirely control. Result: 100 treated ZIPs and 400 controls, while preserving the matched structure.

Sizing & duration (binary example) Suppose baseline conversion p₀ = 3.0%, and we care about a 10% relative lift (MDE = 0.3pp → 0.030 → 0.033). With α=0.05 (two‑sided) and 80% power:

  • z‑values: 1.96 (α/2) and 0.84 (power).

  • Plugging in gives ≈ 53,000 visitors per group (≈ 106k total).

  • If you get ~8,000 eligible visitors/day and split 50/50 → ~4,000 per arm/day → ~14 days minimum (then pad to cover full weekly cycles; aim 2–3 weeks).

Analysis Report pair‑level differences and a pooled ATE with cluster‑robust SEs. Add pre‑period adjustment (e.g., regress post on treatment plus pre‑period outcome and pair FE) for extra precision. Visualize treatment‑control gaps over time by pair and overall.


Practice: Code & hands‑on tasks

A) Build matched pairs and assign treatment (Python)

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# 1) Load or simulate geo‑level pre‑period data
# df has one row per geo with pre‑period metrics used ONLY for design
# columns: ['geo','pre_revenue','pre_ad_spend','pre_site_visits']

# Example simulation scaffold (replace with your data)
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'geo': [f'ZIP_{i:03d}' for i in range(n)],
    'pre_revenue': np.random.normal(1000, 200, n),
    'pre_ad_spend': np.random.normal(500, 100, n),
    'pre_site_visits': np.random.normal(10000, 1500, n)
})

# 2) Standardize covariates
covars = ['pre_revenue','pre_ad_spend','pre_site_visits']
X = StandardScaler().fit_transform(df[covars])

# 3) Quick pairing via k‑means into clusters of size ~2
# (For globally optimal matching use networkx.max_weight_matching on a graph of negative distances.)
km = KMeans(n_clusters=n//2, n_init=10, random_state=42)
labels = km.fit_predict(X)

# Build pairs by taking the two nearest within each cluster.
df['cluster'] = labels

def pair_within_cluster(g):
    # sort by a composite score to pick nearest two
    s = (g[covars].rank(pct=True).mean(axis=1))
    g = g.assign(_score=s).sort_values('_score')
    m = len(g)
    assert m >= 2
    pair_ids = np.repeat(np.arange(m//2), 2)
    # if odd (shouldn’t happen with k=2 target), leave last to merge into nearest prior pair
    if m % 2 == 1:
        pair_ids = np.concatenate([pair_ids, [pair_ids[-1]]])
    g['pair_id'] = [f"P{g.name:03d}_{i}" for i in pair_ids]
    return g.drop(columns=['_score'])

pairs = df.groupby('cluster', group_keys=False).apply(pair_within_cluster)

# 4) Randomize 20% of all geos by selecting 20% of pairs, then assign 1 treated within each selected pair
all_pairs = pairs['pair_id'].unique()
num_treated_geos = int(0.20 * len(pairs))
num_pairs_to_select = num_treated_geos  # one treated per selected pair
sel_pairs = np.random.choice(all_pairs, size=num_pairs_to_select, replace=False)

pairs['treatment'] = 0
mask = pairs['pair_id'].isin(sel_pairs)
# flip a coin within each selected pair
pairs.loc[mask, 'treatment'] = pairs[mask].groupby('pair_id')['geo'].transform(lambda s: (np.random.permutation([1,0])[:len(s)]))

assignment = pairs[['geo','pair_id','treatment']]

B) Stratified randomization (user‑level test)

# Suppose df_users has columns ['user_id','pre_spend','pre_visits','device','segment']
# 1) Build strata using quantile bins of a pre‑period composite score
score = (df_users[['pre_spend','pre_visits']].rank(pct=True).mean(axis=1))
df_users = df_users.assign(stratum=pd.qcut(score, q=10, labels=False))

# 2) Randomize 50/50 within each stratum
np.random.seed(123)
df_users['treatment'] = (
    df_users.groupby('stratum')['user_id']
            .transform(lambda s: np.random.permutation([0,1] * (len(s)//2 + 1))[:len(s)])
)

C) Sample‑size helper functions (proportions & means)

import math
from math import sqrt

# z quantiles for common alphas/powers (or use scipy.stats)
Z = {0.80:0.841621, 0.90:1.281552, 'two_sided_0.05':1.959964}

def n_two_sample_proportions(p0, mde_abs, power=0.80, alpha_two_sided=0.05):
    z_alpha = Z['two_sided_0.05'] if alpha_two_sided==0.05 else 1.959964
    z_beta = Z[power]
    p1, p2 = p0, p0 + mde_abs
    pbar = 0.5*(p1+p2)
    num = (z_alpha*sqrt(2*pbar*(1-pbar)) + z_beta*sqrt(p1*(1-p1) + p2*(1-p2)))**2
    den = (p2 - p1)**2
    return math.ceil(num/den)  # per group

def n_two_sample_means(sigma, mde_abs, power=0.80, alpha_two_sided=0.05):
    z_alpha = Z['two_sided_0.05'] if alpha_two_sided==0.05 else 1.959964
    z_beta = Z[power]
    num = 2*(sigma**2)*(z_alpha + z_beta)**2
    den = mde_abs**2
    return math.ceil(num/den)  # per group

# Example: baseline 3%, want +0.3pp
print(n_two_sample_proportions(0.03, 0.003, power=0.80))  # ~53000 per arm

D) Duration calculator

def days_needed(n_per_group, daily_traffic, alloc=0.5):
    per_arm = daily_traffic*alloc
    return math.ceil(n_per_group / max(per_arm, 1))

# Example: 8k eligible/day, 50/50 split
n = 53000
print(days_needed(n, 8000, 0.5))  # -> 14 days

E) Analysis skeletons

Paired geo analysis (post‑period totals)

# df_post has ['pair_id','geo','treatment','y_post'] ; df_pre has ['pair_id','geo','y_pre']
# 1) Pair‑level differences

Last updated