Chapter B5:A/B Testing Playbook

Story → Concept → Insight → Practice

Story: “Did the campaign really move the needle?”

A national streaming service launches a new paid‑social creative. They target a subset of ZIP codes and see conversions rise 6%. Leadership asks: Was that incrementality, or would it have happened anyway?

Two problems pop up immediately:

  1. The targeted ZIPs skew wealthier and more urban than average (selection bias).

  2. Conversions are noisy week‑to‑week (variance problem).

To answer cleanly, we’ll design an experiment rather than rely on naïve before/after comparisons. We’ll match comparable geographies, randomize exposure within each matched pair, and size the test so we can detect the lift with high probability.


Concept: What good A/B tests look like (beyond the buzzwords)

1) Core principles

  • Randomization breaks the link between treatment and unobservables.

  • Stable measurement unit: choose the unit where spillovers are minimal (user/account/device/household, store, ZIP, DMA, etc.).

  • Pre‑treatment balance: improve precision by making treated and control units similar before treatment (match or stratify on outcome‑predictive covariates).

  • Analysis mirror design: if you pair/stratify in design, respect that structure in analysis (paired tests, fixed effects, cluster‑robust SEs).

2) Matching & stratification (design‑time variance reduction)

Goal: create groups that look alike on variables that predict the outcome (Y). Common choices: historical conversions/revenue, traffic, paid/organic mix, demographics, store size, etc.

Matched pairs (blocking of size 2)

  • Build pairs of near‑twins using pre‑treatment covariates (e.g., Mahalanobis distance).

  • Randomize within each pair: 1 treated, 1 control.

  • Analyze with paired differences or a regression with pair fixed effects.

Stratification (blocking of size ≥2)

  • Partition units into wider strata (e.g., low/med/high baseline traffic; or k‑means clusters).

  • Randomize a fixed proportion within each stratum.

  • Analyze with stratum fixed effects or post‑stratification weights.

Practical matching options

  • Mahalanobis distance on standardized covariates (simple, effective).

  • k‑means clusters of size 2 for quick pairing when covariates are many.

  • Greedy sort‑and‑pair by a composite pre‑period score when you need something fast.

Tip: In observational DiD you prioritize covariates that predict treatment (to control selection). In randomized A/B tests you prioritize covariates that predict outcomes (to reduce variance).

3) Determining sample size (MDE ↔ power ↔ α)

Pick:

  • Metric (e.g., conversion rate, revenue per user),

  • Baseline level (p₀, μ₀, σ),

  • Minimum Detectable Effect (MDE) you care about,

  • α (Type I error, often 0.05 two‑sided) and power (1−β, often 0.8 or 0.9).

Binary outcome (conversion rate) For equal allocation A/B and two‑sided α:

(n_{per group} = ) with (p=(p_1+p_2)/2). For planning, use (p_1p_0), (p_2=p_0+).

Continuous outcome (e.g., revenue/user) (n_{per group} = ) where (^2) is outcome variance and () is the smallest mean difference worth detecting.

Clustered/geo tests Power depends on the number of clusters and the intra‑cluster correlation (ICC). Blocking (pairs/strata) helps because analysis uses within‑block contrasts that cancel shared noise. For matched‑pairs with pair‑level outcomes, a back‑of‑envelope is: (m = ) where (m) is the number of treated pairs, (_d^2) is the variance of pre‑period within‑pair differences, and () is the target post‑period mean difference.

4) How long should you run the test?

Duration = sample size per arm ÷ daily eligible traffic per arm, rounded up to cover full weekly cycles. Add time for ramp‑up/learning and avoid known anomalies (holidays, product drops). For geo tests, ensure you span multiple business cycles (often ≥4 weeks) because cluster‑level noise is higher.

5) Analysis overview (respect the design)

  • User‑level tests: difference in means/proportions with robust SEs; optionally add CUPED/regression adjustment using pre‑period metrics.

  • Matched pairs: paired t‑test on pair‑level post‑period differences; or regression with pair FE and cluster‑robust (pair) SEs.

  • Stratified tests: include stratum FE; report a stratum‑weighted average treatment effect.

  • Always sanity‑check balance and randomization integrity; verify no peeking inflates Type I error (use sequential methods if you must monitor).


Insight: Apply the concept to the Story

Setup: 500 ZIPs total. We’ll create 250 matched pairs using pre‑period revenue, ad spend, and site visits. We plan to treat 20% of all ZIPs.

Correct assignment logic

  1. Match all 500 into 250 pairs.

  2. Randomly pick 100 pairs → in each, flip a coin so 1 treated, 1 control.

  3. Leave the other 150 pairs entirely control. Result: 100 treated ZIPs and 400 controls, while preserving the matched structure.

Sizing & duration (binary example) Suppose baseline conversion p₀ = 3.0%, and we care about a 10% relative lift (MDE = 0.3pp → 0.030 → 0.033). With α=0.05 (two‑sided) and 80% power:

  • z‑values: 1.96 (α/2) and 0.84 (power).

  • Plugging in gives ≈ 53,000 visitors per group (≈ 106k total).

  • If you get ~8,000 eligible visitors/day and split 50/50 → ~4,000 per arm/day → ~14 days minimum (then pad to cover full weekly cycles; aim 2–3 weeks).

Analysis Report pair‑level differences and a pooled ATE with cluster‑robust SEs. Add pre‑period adjustment (e.g., regress post on treatment plus pre‑period outcome and pair FE) for extra precision. Visualize treatment‑control gaps over time by pair and overall.


Practice: Code & hands‑on tasks

A) Build matched pairs and assign treatment (Python)

B) Stratified randomization (user‑level test)

C) Sample‑size helper functions (proportions & means)

D) Duration calculator

E) Analysis skeletons

Paired geo analysis (post‑period totals)

Last updated