Foundations · 12 Stages

Statistics & Experimentation.

From probability axioms to causal inference — every concept built for DS and AI engineers, with explicit math, real ML connections, and interview Q&A throughout.

scipy.statsstatsmodelsNumPypandaspingouinmatplotlib

Probability Foundations 01–02

Probability, Random Variables & Core Distributions

Every ML model is a probability machine — it outputs distributions, optimises expected losses, and makes decisions under uncertainty. Before touching a training loop, you need to understand what probability actually means and how random variables formalise the randomness in your data.

Probability Axioms, Sample Spaces & Events

Kolmogorov axioms, P(A∪B) = P(A)+P(B)−P(A∩B), independence, complement rule

Probability is a function P: events → [0,1] satisfying Kolmogorov's three axioms: 1. Non-negativity: P(A) ≥ 0 for any event A 2. Normalisation: P(Ω) = 1 (something must happen) 3. Additivity: P(A∪B) = P(A) + P(B) if A∩B = ∅ From these three axioms, all of probability theory follows. Key derived rules: Complement: P(Aᶜ) = 1 − P(A) Addition: P(A∪B) = P(A) + P(B) − P(A∩B) Independence: P(A∩B) = P(A)·P(B) iff A and B are independent Independence is crucial in ML. Naive Bayes assumes features are conditionally independent given the class: P(x₁,x₂,...,xₙ|y) = ∏ᵢ P(xᵢ|y). This assumption is almost never true in practice but the classifier often works well anyway.

  Sample Space Ω
  ┌─────────────────────────────┐
  │   ┌───────┐   ┌───────┐    │
  │   │   A   │░░░│   B   │    │
  │   │       │░A∩B       │    │
  │   └───────┘   └───────┘    │
  │                             │
  └─────────────────────────────┘
  P(A∪B) = P(A) + P(B) − P(A∩B)
  If A∩B = ∅ (mutually exclusive): P(A∪B) = P(A) + P(B)

Python — probability axioms, set operations, independence check with scipy/numpy

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Simulating probability axioms ─────────────────────────
# Roll a fair die 100,000 times
n = 100_000
rolls = np.random.randint(1, 7, n)

A = rolls <= 3          # event A: {1, 2, 3}
B = rolls % 2 == 0      # event B: {2, 4, 6}
A_and_B = A & B         # A ∩ B = {2}
A_or_B  = A | B         # A ∪ B = {1, 2, 3, 4, 6}

print(f"P(A)      = {A.mean():.4f}   (expected 0.5)")
print(f"P(B)      = {B.mean():.4f}   (expected 0.5)")
print(f"P(A∩B)    = {A_and_B.mean():.4f}  (expected 0.1667)")
print(f"P(A∪B)    = {A_or_B.mean():.4f}  (expected 0.8333)")
print(f"Addition check: P(A)+P(B)-P(A∩B) = "
      f"{A.mean()+B.mean()-A_and_B.mean():.4f}  (should match P(A∪B))")

# ── Independence check ─────────────────────────────────────
# A and B are NOT independent here
print(f"\nIndependence: P(A)·P(B) = {A.mean()*B.mean():.4f}")
print(f"              P(A∩B)    = {A_and_B.mean():.4f}")
print(f"  → Independent? {np.isclose(A.mean()*B.mean(), A_and_B.mean(), atol=0.01)}")
# A={1,2,3} and B={2,4,6} share event 2 → NOT independent

# ── Complement rule ────────────────────────────────────────
A_comp = ~A
print(f"\nP(Aᶜ) = {A_comp.mean():.4f}  |  1-P(A) = {1-A.mean():.4f}")

# ── Conditional probability P(A|B) = P(A∩B)/P(B) ─────────
p_A_given_B = A_and_B.sum() / B.sum()
print(f"\nP(A|B) = {p_A_given_B:.4f}  (P(even number is also ≤3))")
# P(A|B) = P({2})/P({2,4,6}) = (1/6)/(3/6) = 1/3 ≈ 0.333

ML Link Naive Bayes classifiers; class prior P(y) in Bayesian inference; probability calibration in classification models; feature independence assumptions in generative models.

Trap Treating model output probabilities as true probabilities without calibration check

A Random Forest outputs 0.85 for a customer churn prediction. Many engineers treat this as "85% chance of churn" and build intervention logic around it. But tree ensemble probabilities are poorly calibrated — the true rate may be 60% or 95%. Decisions built on uncalibrated probabilities are unreliable.

Fix Always check calibration using a calibration curve (reliability diagram): plot mean predicted probability vs actual fraction positive in each bucket. Apply isotonic regression or Platt scaling (CalibratedClassifierCV in sklearn) to correct miscalibration before using model scores in downstream decisions.

Trap Assuming feature independence when it is clearly violated

Naive Bayes assumes P(x₁,...,xₙ|y) = ∏P(xᵢ|y). For text features this approximately holds. For user behaviour features (time_on_page and scroll_depth are correlated), it is violated. When features are highly correlated, Naive Bayes double-counts evidence, making it overconfident.

Fix Use Naive Bayes when features are roughly independent (text bag-of-words, sensor readings). For correlated features, use logistic regression (which models correlations through the weight vector) or tree models. Check pairwise correlations before choosing Naive Bayes.

The three axioms: (1) P(A) ≥ 0 — probabilities are non-negative. (2) P(Ω) = 1 — the total probability over all outcomes is 1. (3) P(A∪B) = P(A)+P(B) for mutually exclusive events. These matter for ML because every probabilistic model must satisfy them to produce valid outputs. A classifier whose softmax probabilities do not sum to 1 violates axiom 2. A model that produces negative confidence scores violates axiom 1. When you write a custom loss function or probabilistic layer, you are implicitly working within these constraints.

Mutually exclusive: A and B cannot both occur — P(A∩B) = 0. If A happens, B is impossible. Example: a single coin flip cannot be both heads and tails. Independent: knowing A occurred gives no information about B — P(A∩B) = P(A)·P(B). Example: two separate coin flips. Crucially, mutually exclusive events with P(A)>0 and P(B)>0 are NEVER independent — if A occurs, B definitely did not (P(B|A)=0 ≠ P(B)). In ML, feature independence (as assumed by Naive Bayes) means conditional independence: P(xᵢ,xⱼ|y) = P(xᵢ|y)·P(xⱼ|y), not marginal independence.

Random Variables, PMF, PDF & CDF

Discrete vs continuous RVs, E[X] = Σxᵢ·P(xᵢ), Var(X) = E[X²]−(E[X])², CDF properties

A random variable X is a function from the sample space Ω to real numbers. It maps outcomes to numeric values so we can do arithmetic on randomness. Discrete RV: takes countable values. Described by PMF: P(X=xᵢ) = pᵢ, where Σpᵢ = 1. Continuous RV: takes uncountable values. Described by PDF f(x) where P(a≤X≤b) = ∫ₐᵇ f(x)dx. CDF: F(x) = P(X≤x) — works for both discrete and continuous RVs. Key quantities: Expected value: E[X] = Σ xᵢ·P(X=xᵢ) (discrete) E[X] = ∫ x·f(x) dx (continuous) Variance: Var(X) = E[(X−μ)²] = E[X²] − (E[X])² Std deviation: σ = √Var(X) Linearity: E[aX+b] = a·E[X]+b, Var(aX+b) = a²·Var(X) The loss function in ML is always an expectation: L(θ) = E[(y−ŷ)²] or E[−y·log(ŷ)]. Minimising the loss = finding θ that minimises expected loss over the data distribution.

Python — PMF, PDF, CDF, E[X], Var(X) with scipy.stats and numpy

import numpy as np
from scipy import stats
import matplotlib
matplotlib.use('Agg')

# ── Discrete RV: fair die ─────────────────────────────────
outcomes = np.array([1, 2, 3, 4, 5, 6])
probs    = np.array([1/6] * 6)            # PMF

E_X   = np.sum(outcomes * probs)          # E[X] = 3.5
E_X2  = np.sum(outcomes**2 * probs)       # E[X²] = 15.1667
Var_X = E_X2 - E_X**2                     # Var(X) = 2.9167
std_X = np.sqrt(Var_X)

print(f"Die — E[X]={E_X:.4f}, Var(X)={Var_X:.4f}, σ={std_X:.4f}")

# CDF: P(X ≤ k) = cumulative sum of PMF
cdf = np.cumsum(probs)
print(f"P(X ≤ 3) = {cdf[2]:.4f}")    # 0.5

# ── Continuous RV: Normal distribution ───────────────────
mu, sigma = 0, 1
norm_rv = stats.norm(loc=mu, scale=sigma)

# PDF at x=0: f(0) = 1/√(2π) ≈ 0.3989
print(f"\nNormal PDF at x=0: {norm_rv.pdf(0):.4f}")
# P(-1 ≤ X ≤ 1) = CDF(1) - CDF(-1) ≈ 0.6827 (68% rule)
print(f"P(-1 ≤ X ≤ 1):     {norm_rv.cdf(1) - norm_rv.cdf(-1):.4f}")
# Inverse CDF (percentile): 95th percentile
print(f"95th percentile:   {norm_rv.ppf(0.95):.4f}")

# ── Linearity of expectation ──────────────────────────────
# If X ~ N(0,1), then Y = 2X + 3 ~ N(3, 4)
# E[Y] = 2·E[X]+3 = 2·0+3 = 3
# Var[Y] = 2²·Var[X] = 4·1 = 4
X_samples = np.random.normal(0, 1, 100_000)
Y_samples  = 2*X_samples + 3
print(f"\nE[2X+3] = {Y_samples.mean():.4f}  (expected 3)")
print(f"Var[2X+3]= {Y_samples.var():.4f}  (expected 4)")

# ── E[X] as foundation of ML loss ─────────────────────────
# MSE = E[(y - ŷ)²] estimated empirically as:
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.3, 2.8, 4.2, 4.9])
mse = np.mean((y_true - y_pred)**2)   # empirical expectation
print(f"\nMSE = E[(y-ŷ)²] ≈ {mse:.4f}")

ML Link Every ML loss function is an expectation: MSE = E[(y−ŷ)²], cross-entropy = E[−y·log(ŷ)]. Training = minimising empirical expected loss. Variance of predictions = model uncertainty. E[X] linearity is why batch gradient = average over mini-batch.

Trap Confusing E[f(X)] with f(E[X]) — Jensen's inequality violation

log(E[X]) ≠ E[log(X)] in general. In ML this appears as: the average of log-probabilities is not the log of the average probability. Confusion about this causes errors when computing ELBO in VAEs or when aggregating log-likelihoods across batches.

Fix Jensen's inequality: for convex f, f(E[X]) ≤ E[f(X)]. For log (concave): E[log X] ≤ log(E[X]). Always compute expectations in the original space or use log-sum-exp trick for numerical stability when working in log-space.

Trap Using variance when standard deviation is the interpretable quantity

Variance is in squared units — the variance of a salary in dollars is in dollars². Reporting Var(salary)=2,500,000 is technically correct but uninterpretable. In ML, reporting prediction variance in the squared output unit confuses stakeholders and makes threshold-setting harder.

Fix Report standard deviation (σ = √Var) for human-interpretable spread. Use variance in mathematical derivations (it adds linearly for independent variables: Var(X+Y) = Var(X)+Var(Y)). In uncertainty quantification, report ±1σ prediction intervals in original units.

PMF (Probability Mass Function) gives P(X=x) directly for discrete RVs — values are probabilities in [0,1] summing to 1. PDF (Probability Density Function) for continuous RVs gives probability per unit length — it is a density, not a probability. P(a≤X≤b) = ∫ₐᵇ f(x)dx. Since f(x) is a density, it can exceed 1 as long as the integral over all x equals 1. For example, a Uniform(0, 0.5) has f(x)=2 everywhere in [0,0.5], but ∫₀^0.5 2dx = 1. In code: scipy.stats.norm.pdf(0) = 0.3989, but scipy.stats.uniform(scale=0.5).pdf(0.25) = 2.0.

The expected value minimises mean squared error: E[X] = argmin_c E[(X−c)²]. This directly motivates MSE regression: predicting the conditional mean E[y|x] is optimal under squared loss. The expected value also has linearity — E[aX+bY] = aE[X]+bE[Y] — which makes the gradient of the expected loss equal to the expected gradient, justifying mini-batch training. However, the mean is not always the right summary: for asymmetric losses (e.g., understock vs overstock), the optimal prediction is a different quantile, not the mean. This leads to quantile regression and pinball loss.

Discrete Distributions: Bernoulli, Binomial & Poisson

When to use each, parameters, E[X] and Var(X), connection to ML losses and count modelling

Three discrete distributions appear everywhere in DS and AI: Bernoulli(p): single binary trial. X ∈ {0,1}. P(X=1) = p, P(X=0) = 1−p E[X] = p, Var(X) = p(1−p) → Binary cross-entropy loss = negative log-likelihood under Bernoulli Binomial(n, p): count of successes in n independent Bernoulli trials. X ∈ {0,...,n}. P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ E[X] = np, Var(X) = np(1−p) → Number of conversions in n sessions; click counts per user Poisson(λ): count of rare events in a fixed interval. X ∈ {0,1,2,...}. P(X=k) = e^(−λ)·λᵏ/k! E[X] = λ, Var(X) = λ (mean = variance — a key Poisson property) → Arrivals per second, defects per batch, word counts in documents When does Binomial → Poisson? When n is large and p is small (n≥20, p≤0.05): Binomial(n,p) ≈ Poisson(np).

Python — Bernoulli, Binomial, Poisson with scipy.stats, MLE, goodness-of-fit

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Bernoulli(p=0.15) — e.g. email open rate ─────────────
p_open = 0.15
bern   = stats.bernoulli(p_open)
print(f"Bernoulli(0.15) — E[X]={bern.mean():.2f}, Var={bern.var():.4f}")
# Simulate 10,000 email sends
opens = bern.rvs(10_000)
print(f"Simulated open rate: {opens.mean():.4f}  (true p={p_open})")

# Negative log-likelihood = binary cross-entropy
# L = -[y·log(p) + (1-y)·log(1-p)]
p_hat = opens.mean()   # MLE: just the sample mean
nll = -np.mean(opens*np.log(p_hat) + (1-opens)*np.log(1-p_hat))
print(f"Binary cross-entropy (NLL): {nll:.4f}")

# ── Binomial(n=100, p=0.15) — clicks in sessions ─────────
binom_rv = stats.binom(n=100, p=0.15)
print(f"\nBinom(100,0.15) — E[X]={binom_rv.mean():.1f}, σ={binom_rv.std():.2f}")
print(f"P(X ≤ 10): {binom_rv.cdf(10):.4f}")
print(f"P(X ≥ 20): {1 - binom_rv.cdf(19):.4f}")

# ── Poisson(λ=3) — support tickets per hour ──────────────
lam = 3.0
pois = stats.poisson(lam)
print(f"\nPoisson(3) — E[X]={pois.mean():.1f}, Var[X]={pois.var():.1f}")
# Key property: E[X] = Var[X] = λ
tickets = pois.rvs(1000)
print(f"Simulated: mean={tickets.mean():.2f}, var={tickets.var():.2f}")

# MLE for Poisson: λ_hat = sample mean
lam_hat = tickets.mean()
print(f"MLE λ̂ = {lam_hat:.4f}  (true λ={lam})")

# ── Binomial → Poisson approximation ─────────────────────
n, p = 10_000, 0.0003       # rare event: n large, p small
binom_approx = stats.binom(n, p)
poisson_approx = stats.poisson(n*p)   # λ = np = 3
k = 2
print(f"\nBinom({n},{p}) P(X={k}): {binom_approx.pmf(k):.6f}")
print(f"Poisson({n*p}) P(X={k}): {poisson_approx.pmf(k):.6f}")
# Very close — Poisson is a good approximation

ML Link Bernoulli → binary cross-entropy loss (logistic regression NLL). Binomial → count regression. Poisson → Poisson regression for event counts (clicks, purchases, defects). All three are members of the exponential family — the basis of Generalized Linear Models.

Trap Using linear regression on count data (Poisson-distributed targets)

Linear regression assumes Gaussian residuals with constant variance. Count data (number of purchases, clicks, defects) are non-negative, integer-valued, and variance typically scales with the mean (Var ≈ mean for Poisson). Linear regression will predict negative counts and underfit because its constant-variance assumption is violated.

Fix Use Poisson regression (statsmodels GLM with Poisson family + log link) or negative binomial regression (for overdispersed counts where Var > mean). sklearn does not support GLMs natively — use statsmodels.api.GLM or XGBoost with objective="count:poisson".

Trap Treating Binomial events as Poisson when trials are not rare

Using a Poisson model for a click-through rate of 30% with n=100 visits is wrong — the Poisson approximation requires p to be small (≤0.05). With high p, the Binomial distribution is bounded at n and has lower variance than Poisson, so the Poisson model over-predicts extreme counts.

Fix Binomial(n, p) approximates Poisson(np) only when n ≥ 20 and p ≤ 0.05. For higher rates, model directly as Binomial. In practice, for a CTR of 30% on 100 impressions, use Beta-Binomial regression or a logistic regression framed as binary classification on individual events.

A logistic regression models P(y=1|x) = σ(wᵀx) = p. If we assume y ~ Bernoulli(p), the likelihood of observing a single example is p^y·(1−p)^(1−y). The log-likelihood is y·log(p) + (1−y)·log(1−p). Summing over n examples and negating gives the binary cross-entropy loss: L = −Σ[yᵢlog(pᵢ) + (1−yᵢ)log(1−pᵢ)]. Minimising cross-entropy = maximising the Bernoulli log-likelihood = MLE. This is why logistic regression is called a discriminative generative model.

Poisson arises from a memoryless arrival process where events occur at constant rate λ independently. This independence constraint forces Var(X) = λ = E[X]. In practice, count data is often overdispersed (Var >> mean) because events cluster — customers who buy once are more likely to buy again, so purchase counts have higher variance than Poisson predicts. When Var > mean, use Negative Binomial (adds a dispersion parameter). When Var < mean (rare), use Binomial. Always check: if var/mean >> 1 in your count target, Poisson regression will give poor fits.

When a logistic regression outputs 0.73, it is claiming P(y=1|x) = 0.73. That number is meaningless unless you understand probability axioms, conditional probability, and calibration. Statistics is the language ML is written in.

Key Distributions for DS & AI + Bayes' Theorem

Six distributions explain most of what you encounter in ML systems: Normal for continuous features, Beta for rates and priors, Exponential for wait times, and the trio of Uniform/Log-Normal/Gamma for the rest. Bayes' theorem ties them all together into a framework for updating beliefs with data.

The Normal Distribution — Properties & the 68-95-99.7 Rule

PDF formula, Z-score standardisation, why normality is assumed, when it breaks

The Normal (Gaussian) distribution N(μ, σ²) is defined by its PDF: f(x) = (1 / σ√(2π)) · exp(−(x−μ)² / 2σ²) Parameters: μ = mean (location), σ² = variance (spread). Standardisation: Z = (X−μ)/σ → Z ~ N(0,1). The 68-95-99.7 rule (empirical rule): P(μ−σ ≤ X ≤ μ+σ) ≈ 0.6827 (68%) P(μ−2σ ≤ X ≤ μ+2σ) ≈ 0.9545 (95%) P(μ−3σ ≤ X ≤ μ+3σ) ≈ 0.9973 (99.7%) The Normal distribution is ubiquitous because of the Central Limit Theorem (Stage 04): the sum of many independent random variables converges to Normal regardless of the original distribution. This justifies many ML assumptions: residuals in linear regression, weight initialisation noise, mini-batch gradient averages.

  Normal Distribution N(μ, σ²)

         68.27%
       ┌─────────┐
       │         │    95.45%
  ─────┼─────────┼──────────────
       │  ─────  │
  .....|▓▓▓▓▓▓▓▓▓|.....
  .....|▓▓▓▓▓▓▓▓▓|.....
  ....|▓▓▓▓▓▓▓▓▓▓▓|....
  ...|▓▓▓▓▓▓▓▓▓▓▓▓▓|...
  ..|▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓|..
  ___|___|___|___|___|___
  μ-3σ μ-2σ μ-σ  μ  μ+σ μ+2σ μ+3σ

Python — Normal distribution, Z-scores, normality tests, standardisation in ML

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Normal distribution basics ────────────────────────────
mu, sigma = 170, 10    # height in cm
norm = stats.norm(loc=mu, scale=sigma)

print("68-95-99.7 Rule:")
print(f"  P(μ±1σ): {norm.cdf(mu+sigma) - norm.cdf(mu-sigma):.4f}")
print(f"  P(μ±2σ): {norm.cdf(mu+2*sigma) - norm.cdf(mu-2*sigma):.4f}")
print(f"  P(μ±3σ): {norm.cdf(mu+3*sigma) - norm.cdf(mu-3*sigma):.4f}")

# ── Z-score standardisation ────────────────────────────────
heights = np.array([160, 165, 170, 175, 180, 190, 195])
z_scores = (heights - mu) / sigma
print(f"\nZ-scores: {z_scores.round(2)}")
# How unusual is 195cm?
p_extreme = 2 * (1 - norm.cdf(195))   # two-tailed
print(f"P(|X-μ| ≥ 25): {p_extreme:.4f}")   # ~0.012 → rare

# ── Normality tests ───────────────────────────────────────
# Test if a sample is normally distributed
data_normal = np.random.normal(0, 1, 500)
data_skewed = np.random.exponential(1, 500)

# Shapiro-Wilk (best for n < 5000)
stat_n, p_n = stats.shapiro(data_normal)
stat_s, p_s = stats.shapiro(data_skewed)
print(f"\nShapiro-Wilk: Normal data  p={p_n:.4f}  (fail to reject H₀ → normal)")
print(f"Shapiro-Wilk: Skewed data  p={p_s:.4f}  (reject H₀ → not normal)")

# D'Agostino-Pearson (better for large n)
stat_d, p_d = stats.normaltest(data_skewed)
print(f"D'Agostino: Skewed data    p={p_d:.4f}")

# ── StandardScaler IS Z-score normalisation ────────────────
from sklearn.preprocessing import StandardScaler
X = np.random.normal(170, 10, (1000, 1))
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"\nAfter StandardScaler: mean={X_scaled.mean():.4f}, std={X_scaled.std():.4f}")
# Mean ≈ 0, Std ≈ 1 — equivalent to Z-score transform

ML Link OLS linear regression assumes N(0,σ²) residuals → MSE loss is the MLE. He/Xavier weight initialisation draws from N(0, 2/fan_in). Batch normalisation forces activations toward N(0,1). The z-score formula IS StandardScaler in sklearn.

Trap Assuming normality when the data is heavy-tailed or skewed

Financial returns, user session lengths, and NLP token counts are rarely normally distributed — they are typically right-skewed or heavy-tailed. Applying mean ± 1.96σ confidence intervals or t-tests assuming normality to these distributions gives wrong coverage. Outliers in heavy-tailed data inflate σ, widening CIs that do not reflect actual variability.

Fix Always plot the distribution first (histogram, QQ-plot). For skewed data: apply log transform (log1p for zero-inclusive), use median ± IQR instead of mean ± σ, or use non-parametric tests (Mann-Whitney instead of t-test). For heavy tails, use robust estimators (median, MAD).

Trap Using StandardScaler when the distribution has heavy outliers

StandardScaler uses mean and std — both are highly sensitive to outliers. A single outlier at 100σ from the mean shifts the mean and inflates the std, collapsing all other feature values near zero after scaling. This effectively hides the variation in the bulk of the distribution.

Fix Use RobustScaler (scales using IQR: (X − median) / IQR) when outliers are present. For features with extreme skew, apply log/Box-Cox transform first, then StandardScaler. Inspect feature distributions before choosing a scaler.

Three reasons: (1) Mathematical convenience — the Normal distribution is closed under addition and linear transformation, making algebra tractable. (2) Central Limit Theorem — if errors come from many small independent sources, their sum is approximately Normal regardless of each source's distribution. (3) Maximum Likelihood — assuming Gaussian errors leads to the MSE loss, which is easy to optimise. The assumption is often violated in practice (financial returns have fat tails, text data is discrete) — the key question is whether the violation is severe enough to matter for the task. For robust alternatives, use Laplace noise assumption → MAE loss, or Student-t → Huber loss.

This is a right-skewed distribution. Steps: (1) Plot histogram and check skewness. (2) Apply log1p transform: np.log1p(x) — compresses the long tail and brings the distribution closer to Normal. (3) Then apply StandardScaler on the transformed values. (4) Verify the transformed distribution looks more symmetric. This matters because: gradient-based models converge faster on well-scaled inputs; regularisation applies more uniformly; the Normal-assumption-based confidence intervals become more valid. Tree models (XGBoost, LightGBM) are scale-invariant and do not require this transformation.

Bayes' Theorem & Conditional Probability

P(A|B) = P(B|A)·P(A)/P(B), prior × likelihood → posterior, Bayesian updating step-by-step

Bayes' theorem derives from the definition of conditional probability: P(A|B) = P(A∩B) / P(B) and P(B|A) = P(A∩B) / P(A) Combining these: P(A|B) = P(B|A) · P(A) / P(B) In Bayesian inference, rewrite A=θ (parameters), B=D (data): P(θ|D) = P(D|θ) · P(θ) / P(D) posterior ∝ likelihood × prior Terms: P(θ) = prior — belief about θ before seeing data P(D|θ) = likelihood — how probable is the data given θ? P(θ|D) = posterior — updated belief after seeing data P(D) = evidence (normalising constant, often intractable) Intuition: You start with a prior belief. You observe data. The likelihood updates that belief. The posterior is your new, better-informed belief. Law of Total Probability: P(B) = Σᵢ P(B|Aᵢ)·P(Aᵢ) where {Aᵢ} partition Ω.

  Bayesian Updating

  Prior P(θ)          Likelihood P(D|θ)      Posterior P(θ|D)
  ─────────           ─────────────────       ──────────────────
  Belief before  ×    How probable           ÷  P(D)   =  Updated belief
  seeing data         is the data                          after data

  Example: Medical test
  P(disease)     = 0.001   ← prior (rare disease)
  P(test+|sick)  = 0.99    ← sensitivity (likelihood)
  P(test+|well)  = 0.01    ← false positive rate
  P(test+)       = ?       ← law of total probability

  P(test+) = 0.99×0.001 + 0.01×0.999 = 0.01098
  P(sick|test+) = (0.99×0.001) / 0.01098 ≈ 0.083  ← posterior!
  → Only 8.3% chance of disease even with a positive test!

Python — Bayes theorem, medical test example, Bayesian updating, Naive Bayes classifier

import numpy as np
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.datasets import make_classification

# ── Medical test example ──────────────────────────────────
p_disease      = 0.001       # P(disease) — prior
p_pos_given_sick  = 0.99     # P(test+|sick) — sensitivity
p_pos_given_well  = 0.01     # P(test+|well) — false positive rate

# Law of total probability
p_pos = p_pos_given_sick * p_disease + p_pos_given_well * (1 - p_disease)

# Bayes' theorem: P(sick|test+)
p_sick_given_pos = (p_pos_given_sick * p_disease) / p_pos
print(f"P(test+)       = {p_pos:.5f}")
print(f"P(sick|test+)  = {p_sick_given_pos:.4f}  ({p_sick_given_pos*100:.1f}%)")
# Counter-intuitive: 99% accurate test → only 8.3% true positive!

# ── Bayesian updating with sequential data ────────────────
# Beta-Binomial: updating click rate belief as data arrives
# Prior: Beta(α=1, β=1) = uniform (no prior knowledge)
from scipy.stats import beta as beta_dist

alpha_prior, beta_prior = 1, 1   # uniform prior
clicks   = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])  # observed clicks

print("\nBayesian CTR updating:")
alpha, beta = alpha_prior, beta_prior
for i, click in enumerate(clicks):
    alpha += click
    beta  += (1 - click)
    posterior_mean = alpha / (alpha + beta)
    ci_low, ci_high = beta_dist.ppf([0.025, 0.975], alpha, beta)
    print(f"  After {i+1:2d} obs: CTR={posterior_mean:.3f}  "
          f"95% CI=[{ci_low:.3f}, {ci_high:.3f}]")
# As more data arrives, CI narrows and mean converges to true rate

# ── Gaussian Naive Bayes classifier ──────────────────────
X, y = make_classification(n_samples=1000, n_features=10,
                            n_informative=5, random_state=42)
gnb = GaussianNB()
gnb.fit(X, y)
print(f"\nNaive Bayes accuracy: {gnb.score(X, y):.4f}")
# gnb.class_prior_  = P(y=k) for each class (prior)
# gnb.theta_        = P(xᵢ|y=k) feature means per class (likelihood params)
print("Class priors:", gnb.class_prior_.round(3))

ML Link Naive Bayes classifier = Bayes theorem + conditional independence assumption. Bayesian hyperparameter optimisation (Optuna, BO) uses posterior P(params|results). L2 regularisation = MAP with Gaussian prior. Beta-Binomial bandits use Bayesian updating for explore-exploit.

Trap Ignoring the base rate (prior) when interpreting model predictions

A fraud detection model with 95% precision sounds impressive. But if the fraud rate is 0.1%, the model is flagging many legitimate transactions. The base rate is the prior P(fraud) — a model's posterior P(fraud|features) cannot be high unless either the likelihood ratio is enormous or the base rate is substantial. Engineers frequently report precision without reporting the base rate context.

Fix Always report the prior (base rate) alongside the model's precision/recall. Use a confusion matrix at realistic class proportions. Ask: "If I randomly flagged transactions at the model's predicted positive rate, how would that compare?" The base rate is the Bayesian prior your model is working against.

Trap Confusing P(model output | data) with P(data | model output)

A common error in probabilistic reasoning: P(positive test | sick) ≠ P(sick | positive test). Similarly: P(model predicts fraud | actual fraud) (recall) ≠ P(actual fraud | model predicts fraud) (precision). The base rate mediates the conversion between these — the medical test example above demonstrates how 99% sensitivity with 0.1% base rate gives only 8.3% posterior probability.

Fix Always frame predictions using Bayes' theorem: P(y=1|x) = P(x|y=1)·P(y=1) / P(x). When reporting classifier performance, distinguish between: sensitivity = P(ŷ=1|y=1), precision = P(y=1|ŷ=1). The latter is what the business cares about and requires knowing P(y=1) — the class prior.

The base rate fallacy is ignoring P(A) (the prior) when computing P(A|B). Example: a model for detecting churned users achieves 90% sensitivity. The churn rate is 2%. If the model flags 100 users: ~18 are true churners (90% of the 20 actual churners in ~1000 users) and 98 are false alarms (10% of 980 non-churners). Precision = 18/116 ≈ 15.5%. Despite 90% sensitivity, only 1 in 6 flagged users actually churned. The prior P(churn=2%) dominates. Fix: always compute P(true positive|flagged) = precision, not just sensitivity. Use Bayes' theorem explicitly when setting decision thresholds.

MLE: θ_MLE = argmax P(D|θ) — finds parameters that make the observed data most probable, ignoring any prior knowledge. MAP: θ_MAP = argmax P(θ|D) = argmax P(D|θ)·P(θ) — maximises the posterior, incorporating a prior. With a Gaussian prior N(0, 1/λ) on weights: MAP estimate = Ridge regression (L2 regularisation). With a Laplace prior: MAP = Lasso (L1). Prefer MAP when: you have genuine prior knowledge (e.g., CTR is usually between 1-10%), small data (prior regularises, preventing overfitting), or you want to encode domain constraints. With large data, MLE and MAP converge — the likelihood dominates the prior.

Beta, Exponential & Log-Normal Distributions

When each appears in ML: Beta for rates, Exponential for wait times, Log-Normal for prices

Three more distributions with direct DS/AI applications: Beta(α, β): continuous, supported on [0,1]. Natural distribution for probabilities and rates. PDF: f(x) ∝ x^(α−1)·(1−x)^(β−1) E[X] = α/(α+β), Mode = (α−1)/(α+β−2) for α,β>1 Conjugate prior for Bernoulli/Binomial → posterior is also Beta (Beta-Binomial model). Alpha=Beta=1 → Uniform. Higher α+β → more concentrated (more data = tighter belief). Exponential(λ): models time between events in a Poisson process. Memoryless property. PDF: f(x) = λe^(−λx), E[X] = 1/λ, Var(X) = 1/λ² Memoryless: P(X>s+t | X>s) = P(X>t) — past waiting time gives no info about future wait. Log-Normal(μ, σ): if log(X) ~ N(μ, σ²), then X is log-normal. X > 0, right-skewed. E[X] = e^(μ + σ²/2), Var(X) = (e^(σ²)−1)·e^(2μ+σ²) Models: income, house prices, stock prices, file sizes, LLM token counts.

Python — Beta, Exponential, Log-Normal with scipy.stats, Beta-Binomial CTR model

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Beta distribution: CTR prior ─────────────────────────
# Prior belief: CTR is somewhere around 10-15%
# Start with Beta(2, 10) → E[X] = 2/12 ≈ 0.167
alpha, beta = 2, 10
beta_rv = stats.beta(alpha, beta)
print(f"Prior Beta({alpha},{beta}): E[CTR]={beta_rv.mean():.3f}, "
      f"95% CI=[{beta_rv.ppf(0.025):.3f}, {beta_rv.ppf(0.975):.3f}]")

# Observe 50 clicks from 400 impressions
clicks, impressions = 50, 400
# Bayesian update: Beta posterior = Beta(α+clicks, β+no_clicks)
alpha_post = alpha + clicks
beta_post  = beta  + (impressions - clicks)
post_rv    = stats.beta(alpha_post, beta_post)
print(f"Posterior Beta({alpha_post},{beta_post}): "
      f"E[CTR]={post_rv.mean():.3f}, "
      f"95% CI=[{post_rv.ppf(0.025):.3f}, {post_rv.ppf(0.975):.3f}]")
# Posterior is much tighter — 400 observations nail the CTR

# ── Exponential: server request latency ──────────────────
# Mean latency = 200ms → rate λ = 1/200 = 0.005
lam = 1/200
exp_rv = stats.expon(scale=1/lam)   # scipy uses scale=1/λ
print(f"\nExponential(λ=1/200ms):")
print(f"  E[latency]   = {exp_rv.mean():.0f} ms")
print(f"  P(> 500ms)   = {1-exp_rv.cdf(500):.4f}")   # slow requests
print(f"  P(> 1000ms)  = {1-exp_rv.cdf(1000):.6f}")  # very slow
# Memoryless: P(X>600|X>100) = P(X>500)
print(f"  Memoryless check: {(1-exp_rv.cdf(600))/(1-exp_rv.cdf(100)):.4f} "
      f"vs {1-exp_rv.cdf(500):.4f}")

# ── Log-Normal: house prices ──────────────────────────────
# If log(price) ~ N(12, 0.5²), price is log-normal
mu_log, sigma_log = 12, 0.5
lognorm_rv = stats.lognorm(s=sigma_log, scale=np.exp(mu_log))
print(f"\nLog-Normal(μ=12, σ=0.5) prices:")
print(f"  Median price = ${lognorm_rv.median():,.0f}")
print(f"  Mean price   = ${lognorm_rv.mean():,.0f}")   # mean > median (right skew)
print(f"  90th pct     = ${lognorm_rv.ppf(0.9):,.0f}")

# ── Why log-transform works for prices ────────────────────
prices = lognorm_rv.rvs(1000)
print(f"\nRaw prices:      skewness={stats.skew(prices):.2f}")
print(f"Log prices:      skewness={stats.skew(np.log(prices)):.2f}")
# Log transform makes the distribution symmetric → regression works better

ML Link Beta distribution → Thompson sampling in multi-armed bandits; Beta-Binomial Bayesian A/B testing. Exponential → survival analysis (time-to-churn), queuing models for API latency. Log-Normal → price/income features need log transform before regression.

Trap Not log-transforming right-skewed target variables before regression

Predicting house prices, salaries, or user lifetime value with linear regression on raw values causes the model to focus on the few high outliers (because MSE penalises large errors quadratically) and ignore the bulk of the distribution. Residuals are heteroscedastic — variance grows with the predicted value, violating OLS assumptions.

Fix Apply log (or log1p) transform to right-skewed targets before linear/tree regression. Train on log(y), then exponentiate predictions. Evaluate RMSE on the original scale. Note: predicting log(y) minimises relative error, not absolute error — choose based on the business metric (do you care about being off by $10k or by 10%?).

Trap Using the mean instead of median as a point estimate for right-skewed distributions

For log-normal salary data with a few very high earners, the mean is pulled far above the median by the long tail. Reporting "average salary = $120k" when 80% of employees earn under $80k is technically correct but misleading. ML models trained to predict the mean will systematically overpredict for most users.

Fix For skewed distributions: report median as the central tendency and IQR for spread. For ML: use quantile regression (predict the 50th percentile) if the median is the target; use MAE loss instead of MSE (MAE minimises expected absolute error, which equals the median for symmetric distributions and is more robust for skewed ones).

Three reasons: (1) Support — Beta is defined on [0,1], exactly the range of probabilities. (2) Flexibility — Beta(α,β) can represent a wide variety of shapes: uniform (α=β=1), bell-shaped (α,β>1), U-shaped (α,β<1), or skewed. (3) Conjugacy — Beta is the conjugate prior for the Bernoulli/Binomial likelihood. When you observe clicks (Binomial), the posterior is Beta(α+clicks, β+non-clicks) — analytically tractable without MCMC. This enables exact Bayesian updating: each click shifts α up by 1, each non-click shifts β up by 1. Thompson sampling in bandits works by sampling from this Beta posterior.

P(X > s+t | X > s) = P(X > t) — given that you have already waited s time units, the distribution of remaining wait time is identical to the original. The past wait gives zero information about the future. In ML systems: (1) HTTP request inter-arrival times — knowing a request just arrived doesn't change the expected time to the next one. (2) User session modelling — if sessions are Exponentially distributed, knowing a user has been active for 10 minutes doesn't predict when they'll leave better than if they just arrived. This fails in practice for human behaviour (hazard rate changes over time) → use Weibull distribution instead.

MLE assumes a distribution for your data and asks: what parameters make the observed data most likely? MAP adds a prior. The distribution you choose IS a modelling assumption — make it explicit.

Sampling & the Central Limit Theorem 03–04

Descriptive Statistics, Moments & Outlier Detection

Descriptive statistics are your first look at data — before any model, any feature engineering, any training loop. Getting this wrong shapes every decision that follows. Mean vs median, variance vs IQR, skewness — these are not textbook formalities, they are data quality tools.

Central Tendency — Mean, Median & Mode

When each is the right summary, how outliers affect each, median for skewed data

Three measures of where the "centre" of a distribution lies: Mean (arithmetic average): x̄ = (1/n) Σᵢ xᵢ → Minimises sum of squared deviations: argmin_c Σ(xᵢ−c)² → Sensitive to outliers — one extreme value shifts it substantially. Median: middle value when sorted. 50th percentile. → Minimises sum of absolute deviations: argmin_c Σ|xᵢ−c| → Robust to outliers. Preferred for skewed distributions (income, prices). Mode: most frequent value. → Only measure meaningful for categorical data. → Multi-modal distributions signal possible subpopulations (important for segmentation). Rule of thumb for skewness direction: Right-skewed (long right tail): mean > median > mode Left-skewed (long left tail): mean < median < mode Symmetric: mean ≈ median ≈ mode

Python — mean vs median comparison, outlier impact, pandas describe, trimmed mean

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# ── Symmetric distribution: mean ≈ median ────────────────
heights = np.random.normal(170, 10, 1000)
print(f"Heights — Mean: {heights.mean():.2f}, Median: {np.median(heights):.2f}")

# ── Right-skewed: salary data ──────────────────────────────
# 950 workers earn $30k-$80k, 50 executives earn $500k+
salaries = np.concatenate([
    np.random.lognormal(np.log(50_000), 0.3, 950),
    np.random.lognormal(np.log(500_000), 0.4, 50)
])
print(f"\nSalary distribution:")
print(f"  Mean:   ${salaries.mean():>10,.0f}  ← pulled up by executives")
print(f"  Median: ${np.median(salaries):>10,.0f}  ← better represents typical worker")
print(f"  Std:    ${salaries.std():>10,.0f}")

# ── Impact of a single outlier ─────────────────────────────
data = np.array([10.0, 11.0, 12.0, 13.0, 14.0, 15.0])
print(f"\nBefore outlier: mean={data.mean():.2f}, median={np.median(data):.2f}")
data_with_outlier = np.append(data, 1000.0)
print(f"After outlier:  mean={data_with_outlier.mean():.2f}, "
      f"median={np.median(data_with_outlier):.2f}")
# Mean jumps from 12.5 to ~154; median moves from 12.5 to 13 — robust!

# ── pandas describe: always run first ─────────────────────
df = pd.DataFrame({'salary': salaries, 'height': np.random.normal(170,10,1000)})
print("\n", df.describe().round(2))
# Check: is mean >> median? (right skew) → log transform needed
print(f"\nSalary skewness:  {stats.skew(salaries):.3f}  (>1 = strongly right-skewed)")
print(f"Height skewness:  {stats.skew(heights):.3f}  (≈0 = symmetric)")

# ── Trimmed mean: robust alternative ─────────────────────
# Removes top and bottom 10% before computing mean
trimmed = stats.trim_mean(salaries, 0.10)
print(f"\nTrimmed mean (10%): ${trimmed:,.0f}")
# Between mean and median — less influenced by extreme outliers

ML Link Mean imputation vs median imputation in sklearn SimpleImputer. MSE regression predicts the conditional mean E[y|x]; MAE regression predicts the conditional median. Multi-modal target distributions suggest clustering first. Outlier impact on gradient descent: MSE loss is dominated by large-residual examples.

Trap Using mean imputation for skewed features with outliers

Filling missing values with the feature mean when the feature has outliers introduces bias. For income data, mean imputation fills missing values with a figure pulled upward by high earners — most imputed values are unrealistically high, adding noise that degrades model performance.

Fix Use median imputation (SimpleImputer(strategy="median")) for skewed or outlier-prone features. For categorical features, use mode imputation. For features with missingness correlated with the target (MNAR — Missing Not At Random), consider creating a binary "was_missing" indicator feature alongside the imputed value.

Trap Reporting model performance metrics as mean without checking for high variance

Reporting "mean AUC = 0.87 across 5 folds" hides that fold scores were [0.70, 0.92, 0.88, 0.91, 0.84]. The single outlier fold at 0.70 signals data distribution issues (perhaps that fold has a different class balance) that the mean conceals.

Fix Always report mean ± std or median with IQR for cross-validation scores. If the distribution of fold scores is skewed, the median is more representative. A high standard deviation across folds is itself a signal — investigate the outlier fold before drawing conclusions.

Use median when: (1) The distribution is skewed (income, house prices, LLM latency) — the median is not pulled by extreme values. (2) Outliers are genuine and not data errors — they should not dominate the central tendency estimate. (3) You care about the "typical" value for a user or example, not the average. (4) The business question is about the 50th percentile: "what does the median customer earn?" Use mean when: (1) The distribution is symmetric. (2) You are computing expectations that enter formulas (variance, regression coefficients). (3) The business question is about totals: mean × count = total (e.g., revenue per user × users = total revenue).

The mean is ~2.7× the median — this is a strongly right-skewed distribution, likely driven by a small number of very high-value outliers. Steps: (1) Plot the histogram to confirm shape. (2) Check skewness with scipy.stats.skew(). (3) Apply log1p transform: log1p(salary) will compress the long tail and make the distribution more symmetric. (4) Check if outliers are data errors (salary = $1B) or genuine (CEO compensation) — if errors, cap or remove them. (5) Use median and IQR for descriptive statistics rather than mean and std. (6) Consider using RobustScaler instead of StandardScaler for model input.

Variance, Std Dev, Skewness & Kurtosis

Moments of a distribution, what each reveals, excess kurtosis, fat tails in ML

The moments of a distribution summarise its shape: 1st moment — Mean: μ = E[X] = Σxᵢ/n 2nd moment — Variance: σ² = E[(X−μ)²] = E[X²] − μ² 3rd moment — Skewness: γ₁ = E[(X−μ)³] / σ³ > 0: right tail longer (mass pulled right) < 0: left tail longer (mass pulled left) = 0: symmetric 4th moment — Kurtosis: γ₂ = E[(X−μ)⁴] / σ⁴ − 3 (excess kurtosis) = 0: Normal distribution (mesokurtic) > 0: heavier tails than Normal (leptokurtic) — e.g. financial returns, attention weights < 0: lighter tails than Normal (platykurtic) — e.g. Uniform distribution Important: Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y) For independent X,Y: Var(X+Y) = Var(X) + Var(Y) — variances add

Python — scipy.stats moments, skewness/kurtosis diagnostics, fat tail detection

import numpy as np
from scipy import stats
import pandas as pd

np.random.seed(42)

# ── Compute all four moments ──────────────────────────────
def describe_moments(data, label):
    print(f"\n{label}:")
    print(f"  Mean:     {np.mean(data):.4f}")
    print(f"  Variance: {np.var(data, ddof=1):.4f}")
    print(f"  Skewness: {stats.skew(data):.4f}")
    print(f"  Kurtosis: {stats.kurtosis(data):.4f}  (excess, Normal=0)")

# ── Symmetric (Normal) ────────────────────────────────────
normal_data   = np.random.normal(0, 1, 5000)
# ── Right-skewed (Exponential) ────────────────────────────
skewed_data   = np.random.exponential(1, 5000)
# ── Heavy-tailed (Student-t with 3 dof) ──────────────────
heavy_tail    = stats.t.rvs(df=3, size=5000)

describe_moments(normal_data,  "Normal(0,1)")
describe_moments(skewed_data,  "Exponential(1)")
describe_moments(heavy_tail,   "Student-t(df=3)")

# ── Variance: sample vs population ───────────────────────
# ddof=0: population variance (divide by n)   — biased estimator
# ddof=1: sample variance (divide by n-1)     — unbiased estimator (Bessel's correction)
data = np.array([2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0])
print(f"\nPopulation variance (ddof=0): {np.var(data, ddof=0):.4f}")
print(f"Sample variance    (ddof=1): {np.var(data, ddof=1):.4f}")

# ── Variance addition for independent variables ───────────
# Var(X+Y) = Var(X) + Var(Y) when X, Y independent
x = np.random.normal(0, 2, 100_000)   # Var(X) = 4
y = np.random.normal(0, 3, 100_000)   # Var(Y) = 9
z = x + y                              # Var(Z) should ≈ 13
print(f"\nVar(X) + Var(Y) = {x.var():.2f} + {y.var():.2f} = {x.var()+y.var():.2f}")
print(f"Var(X+Y) empirically:        {z.var():.2f}")

# ── Kurtosis: detecting fat tails ─────────────────────────
# Returns in financial data are fat-tailed
returns = stats.t.rvs(df=4, size=10_000)   # simulate fat-tailed returns
print(f"\nFat-tail data kurtosis: {stats.kurtosis(returns):.2f}")
# High positive kurtosis → more extreme values than Normal predicts
# → MSE loss is heavily influenced by these extremes
# → Consider Huber loss or Student-t loss instead

ML Link High kurtosis features → Huber loss more appropriate than MSE. Skewness → log transform before regression. Variance addition used in dropout (dropout randomly zeros → reduces variance of output by (1-p)). Weight initialisation uses Var(W)=2/n to maintain activation variance across layers.

Trap Using population variance (ddof=0) when you have a sample

In pandas, df["col"].var() defaults to ddof=1 (sample variance). In numpy, np.var(arr) defaults to ddof=0 (population variance). When reporting variance of a sample or computing standard error, using ddof=0 underestimates the true population variance, leading to overly narrow confidence intervals.

Fix Always specify ddof explicitly: np.var(arr, ddof=1) for sample variance. In sklearn, StandardScaler uses ddof=0 by default (population std) — this is fine for scaling because the exact ddof correction matters less at scale, but be consistent when computing statistics for reports.

Trap Ignoring heavy tails when choosing a loss function

MSE = E[(y−ŷ)²] is heavily influenced by squared errors. If the target distribution has high kurtosis (fat tails), a few large outlier residuals dominate the gradient and push the model to fit them at the cost of typical examples. The model learns to chase outliers.

Fix Diagnose target kurtosis first: scipy.stats.kurtosis(y). For excess kurtosis > 3: use Huber loss (delta controls the transition from squared to linear). For extreme heavy tails: consider robust regression (statsmodels RLM) or quantile regression targeting the 50th percentile instead of the mean.

Excess kurtosis = E[(X−μ)⁴]/σ⁴ − 3, where the "−3" makes the Normal distribution the baseline at 0. Positive excess kurtosis means heavier tails than the Normal — more probability mass in the extremes and in the centre, less in the shoulders. Financial returns (excess kurtosis ~3–10) and neural network weight gradients can have fat tails. For ML: (1) Fat-tailed targets → MSE loss is dominated by rare large errors, Huber loss is better. (2) Fat-tailed gradient distributions → gradient clipping is essential to prevent exploding updates. (3) Fat-tailed features → outlier-robust scalers and tree models that do not require Gaussian assumptions are preferred.

The sample variance s² = Σ(xᵢ−x̄)²/(n−1) is unbiased: E[s²] = σ². Using n in the denominator gives a biased estimate that systematically underestimates σ². The intuition: the sample mean x̄ is computed from the same data and is closer to the sample than the true μ, so deviations from x̄ are systematically smaller than deviations from μ. Dividing by n−1 corrects this. The "−1" represents losing one degree of freedom by estimating the mean. In practice, for n > 30 the difference is small (1/n vs 1/(n−1)). For ML with millions of examples, it does not matter. It matters for small sample statistics (n < 50).

Outlier Detection — IQR Rule, Z-Score & MAD

Three detection methods, their sensitivity, which is robust, and what to do with outliers in ML

Outlier detection determines which data points deviate substantially from the bulk of the distribution. IQR (Interquartile Range) Rule — Tukey's fence: IQR = Q3 − Q1 Lower fence: Q1 − 1.5×IQR Upper fence: Q3 + 1.5×IQR Robust: based on quartiles, not affected by extreme values. Z-score Method: Z = (x − μ) / σ Flag |Z| > 3 (0.27% of data expected under Normality). NOT robust: μ and σ are themselves affected by outliers, masking extreme values ("masking effect"). MAD (Median Absolute Deviation) — most robust: MAD = median(|xᵢ − median(x)|) Modified Z-score: M = 0.6745 · (x − median) / MAD Flag |M| > 3.5 Breakdown point = 50%: resistant to up to 50% contamination. What to do with outliers: (1) Verify — is it a data error or a genuine extreme value? (2) If error: remove or correct. (3) If genuine: keep but use robust models/losses; or cap (winsorisation) at a percentile.

Python — IQR, Z-score, MAD outlier detection, winsorisation, comparison

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# ── Generate data with outliers ───────────────────────────
data = np.concatenate([
    np.random.normal(100, 15, 990),    # bulk of distribution
    np.array([250, 280, −50, 300, 270, 260, −40, 290, 270, 300])  # 10 outliers
])

# ── Method 1: IQR Rule ────────────────────────────────────
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
outliers_iqr = data[(data < lower) | (data > upper)]
print(f"IQR Method: Q1={Q1:.1f}, Q3={Q3:.1f}, IQR={IQR:.1f}")
print(f"  Fences: [{lower:.1f}, {upper:.1f}]")
print(f"  Detected {len(outliers_iqr)} outliers")

# ── Method 2: Z-score (NOT robust) ───────────────────────
mu, sigma = data.mean(), data.std()
z_scores = np.abs((data - mu) / sigma)
outliers_z = data[z_scores > 3]
print(f"\nZ-score: mu={mu:.1f}, σ={sigma:.1f}")
print(f"  Detected {len(outliers_z)} outliers")
# Note: outliers inflate σ, masking some extreme values

# ── Method 3: MAD (most robust) ──────────────────────────
median = np.median(data)
MAD = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / MAD
outliers_mad = data[np.abs(modified_z) > 3.5]
print(f"\nMAD Method: median={median:.1f}, MAD={MAD:.1f}")
print(f"  Detected {len(outliers_mad)} outliers")   # most accurate

# ── Winsorisation: cap instead of remove ─────────────────
from scipy.stats.mstats import winsorize
data_winsorized = winsorize(data, limits=[0.01, 0.01])  # clip top/bottom 1%
print(f"\nOriginal: max={data.max():.0f}, min={data.min():.0f}")
print(f"Winsorized: max={data_winsorized.max():.0f}, min={data_winsorized.min():.0f}")

# ── In pandas ─────────────────────────────────────────────
df = pd.DataFrame({'value': data})
df['is_outlier_iqr'] = (df['value'] < lower) | (df['value'] > upper)
print(f"\nDataFrame outlier flag: {df['is_outlier_iqr'].sum()} outliers flagged")

ML Link Outlier features cause gradient instability in neural networks (use clipping). Outlier targets inflate MSE loss (use Huber loss). IQR-based detection is used inside sklearn's RobustScaler. Isolation Forest and LOF are ML-based outlier detectors for high-dimensional data.

Trap Removing outliers from test data but not from training data (or vice versa)

A preprocessing pipeline removes outliers from the training set before fitting a model. At inference, the model receives raw data including outliers, and its predictions on outlier inputs are extrapolations into a region it never trained on. The opposite problem — removing outliers from test set — inflates reported metrics by hiding the hardest examples.

Fix The outlier treatment policy must be applied identically to training and serving data. Use sklearn Pipeline to encode the outlier handling: clip at training-time percentiles (fit RobustScaler on train, apply to test) or use winsorisation with percentiles computed on training data only.

Trap Treating all outliers as data errors without investigation

An automated pipeline flags and removes all values with |Z| > 3. But a user who spent 10 hours on your platform in a single day is a genuine power user, not a data error. Removing them from a churn model removes exactly the most valuable customers. Blind outlier removal discards business-critical signal.

Fix Always investigate outliers before removing: plot them, look up the raw events, check whether they concentrate in a specific segment. Add a boolean "is_outlier" feature alongside the capped value — this lets the model learn from both the scale and the outlier status. Create separate model segments for heavy users if their behaviour is qualitatively different.

MAD (Median Absolute Deviation) is the most robust with a breakdown point of 50% — it remains reliable even when up to 50% of the data is contaminated. The Z-score method uses the sample mean and std, both of which are pulled toward outliers (the masking effect — extreme outliers inflate σ, making other outliers look normal). IQR is more robust than Z-score (breakdown point 25%) because quartiles are less sensitive to extremes. For high-dimensional data, use Isolation Forest (isolates outliers by randomly partitioning features — outliers are isolated in fewer splits) or Local Outlier Factor (density-based).

Systematic diagnosis: (1) Identify the 5%: sort by residual magnitude (|y−ŷ|) and inspect the worst examples. (2) Check if they are outliers: IQR / MAD test on feature values and target. (3) Look for a common pattern: are they from a specific time window, user segment, or geography? (4) Check if they are out-of-distribution: compare their feature distribution to the training set (use KL divergence or a simple classifier to distinguish bad vs good examples). (5) Decide: if they are genuine extreme values, use Huber loss or a separate model for this segment. If they are data errors, fix the pipeline.

Every experienced DS engineer runs df.describe() on new data before writing a single model line. The skewness, the min/max gap, the 25th vs 75th percentile spread — these tell you more about what preprocessing is needed than any automated pipeline can.

Law of Large Numbers, Central Limit Theorem & Bootstrap

The Central Limit Theorem is arguably the most important theorem in applied statistics — it is the reason hypothesis tests work, the reason batch gradient descent converges, and the reason confidence intervals take the form they do. Understanding it deeply changes how you think about sampling, noise, and estimation.

Law of Large Numbers & Empirical Risk

Weak LLN: x̄ₙ → μ in probability; strong LLN; why more data improves models

The Law of Large Numbers (LLN) formalises the intuition that averages stabilise with more data. Weak LLN: For iid X₁,...,Xₙ with E[X]=μ: P(|X̄ₙ − μ| > ε) → 0 as n → ∞ for any ε > 0 The sample mean converges to the true mean in probability. Strong LLN: P(lim_{n→∞} X̄ₙ = μ) = 1 Almost sure convergence — a stronger statement. Rate of convergence: |X̄ₙ − μ| ≈ σ/√n (standard error) → Halving the error requires 4× the data. → 10× error reduction requires 100× data. For ML, the LLN justifies empirical risk minimisation (ERM): True risk: R(θ) = E_{(x,y)~P}[L(y, f_θ(x))] — unknown (requires knowing P) Empirical risk: R̂(θ) = (1/n) Σᵢ L(yᵢ, f_θ(xᵢ)) — computable from data LLN → R̂(θ) → R(θ) as n → ∞. Training = minimising empirical risk ≈ minimising true risk for large n.

Python — LLN convergence simulation, standard error, empirical risk vs true risk

import numpy as np
from scipy import stats
import matplotlib
matplotlib.use('Agg')

np.random.seed(42)

# ── Simulating LLN convergence ────────────────────────────
# True mean of Exponential(λ=2) = 1/λ = 0.5
true_mean = 0.5
samples   = np.random.exponential(0.5, 10_000)

print("LLN convergence — sample mean vs true mean:")
print(f"{'n':>8} | {'Sample Mean':>12} | {'Error':>12}")
print("-" * 38)
for n in [10, 50, 100, 500, 1000, 5000, 10000]:
    x_bar  = samples[:n].mean()
    error  = abs(x_bar - true_mean)
    print(f"{n:>8} | {x_bar:>12.6f} | {error:>12.6f}")
# Error decreases as ~1/√n

# ── Standard error: how fast does x̄ converge? ────────────
# SE = σ / √n
sigma = samples.std()
print(f"\nStandard Error at various n:")
for n in [100, 400, 1000, 10000]:
    se = sigma / np.sqrt(n)
    print(f"  n={n:6d}: SE={se:.5f}  "
          f"(95% margin: ±{1.96*se:.5f})")
# To halve SE: need 4× data; for 10× improvement: 100× data

# ── Empirical Risk ≈ True Risk (LLN in action) ────────────
# True risk for a regression model: E[(y - ŷ)²]
# Simulate: true f(x) = 2x + 1, model predictions ŷ = 2x + 0.8 (biased)
X_true = np.random.uniform(0, 1, 100_000)
y_true = 2*X_true + 1 + np.random.normal(0, 0.1, 100_000)
y_pred = 2*X_true + 0.8   # slightly biased model

# True risk (approximate with large n)
true_risk = np.mean((y_true - y_pred)**2)

print(f"\nTrue Risk  (n=100k): {true_risk:.6f}")
for n in [50, 200, 1000, 10000]:
    emp_risk = np.mean((y_true[:n] - y_pred[:n])**2)
    print(f"Empirical Risk (n={n:5d}): {emp_risk:.6f}  "
          f"(error: {abs(emp_risk-true_risk):.6f})")

ML Link ERM (Empirical Risk Minimisation) is the theoretical foundation of all supervised learning. Mini-batch gradient descent uses a small-n empirical risk estimate — LLN says it approximates true gradient as batch size grows. Generalization gap = |R̂(θ) − R(θ)| decreases with n by LLN.

Trap Expecting that more data always helps when the bottleneck is model bias

A linear model on non-linear data has high bias — its empirical risk and true risk are both high and converge to a value greater than the irreducible noise (the model is fundamentally wrong). Collecting 10× more data does not help: LLN means the empirical risk converges to the true risk faster, but the true risk is already high due to model misspecification.

Fix Plot learning curves: if train loss and val loss both plateau at a high value — high bias. Adding data barely helps. Fix: increase model expressiveness (polynomial features, deeper network, switch from linear to tree-based model). Only if the gap between train and val loss is large (high variance) will more data help.

Trap Using very small batch sizes without accounting for high gradient variance

Mini-batch gradient is an empirical estimate of the true gradient. With batch_size=4, the standard error of each gradient estimate is σ/√4 = σ/2 — very noisy. Training with small batches requires a lower learning rate to compensate for gradient variance. Using the same learning rate as a larger batch causes divergence.

Fix Scale learning rate with batch size: if you halve the batch size, halve the learning rate (linear scaling rule). Alternatively, use gradient accumulation to simulate large batch training on a small GPU: accumulate 8 steps of batch_size=32 = effective batch_size=256.

The true objective of ML is to minimise the expected loss over the data-generating distribution P(x,y): R(θ) = E[L(y, f_θ(x))]. We cannot compute this expectation directly because P is unknown. LLN guarantees that the empirical average (1/n)Σ L(yᵢ, f_θ(xᵢ)) converges to R(θ) as n→∞. For large n, minimising the empirical risk ≈ minimising the true risk. The gap |R̂(θ)−R(θ)| is controlled by learning theory (VC dimension, Rademacher complexity). In practice: more IID training data = better generalisation by LLN, unless model capacity is too low (bias bound) or too high (variance bound).

Standard deviation (σ): measures the spread of individual data points around the mean. It describes variability in the population. Standard error (SE = σ/√n): measures the spread of the sample mean across repeated samples. It describes uncertainty in the estimate of the mean. Key relationship: SE shrinks as √n grows — 4× more data halves SE. In ML: std of model predictions on a test set = variability of individual predictions. SE of model AUC across cross-validation folds = uncertainty in the mean AUC estimate. Always report SE (not std) when reporting average metrics across experiments.

Central Limit Theorem — Proof Intuition & Simulation

CLT statement, what "sampling distribution" means, n≥30 rule of thumb, exceptions

Central Limit Theorem (CLT): Let X₁, X₂, ..., Xₙ be iid with mean μ and variance σ² (finite). Then: √n · (X̄ₙ − μ) / σ → N(0, 1) as n → ∞ Equivalently: X̄ₙ ~ N(μ, σ²/n) approximately for large n. Key points: • X̄ₙ is the mean of n samples — NOT the individual Xᵢ values. • Works regardless of the shape of the original distribution (may be Uniform, Poisson, Exponential). • The original distribution only needs finite mean and variance. • Rate of convergence depends on skewness — more skewed → larger n needed. • Practical rule: n ≥ 30 for symmetric distributions; n ≥ 100 for moderately skewed. Why it matters: → Confidence intervals for means are valid for large n without knowing the original distribution. → t-tests are valid for large n even if the data is non-Normal. → Training loss averaged over a mini-batch is approximately Normally distributed. → Any metric that is a mean (AUC, accuracy, F1) has an approximately Normal sampling distribution.

  CLT Simulation: Exponential Population → Normal Sampling Distribution

  Population X ~ Exp(λ=1):          Sampling Dist. of X̄ (n=30):
       |                                      |     ▓▓
       |▓                                     |   ▓▓▓▓▓
       |▓▓                                    |  ▓▓▓▓▓▓▓▓
       |▓▓▓                                   | ▓▓▓▓▓▓▓▓▓▓
       |▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓                   |▓▓▓▓▓▓▓▓▓▓▓▓▓
       └──────────────────────                 └─────────────────
    0  1  2  3  4  5  6  7  →              0.6 0.8 1.0 1.2 1.4 →
  Right-skewed, not normal               Bell-shaped, approximately N(1, 1/30)

Python — CLT simulation from scratch, sampling distribution, convergence to Normal

import numpy as np
from scipy import stats
import matplotlib
matplotlib.use('Agg')

np.random.seed(42)

# ── CLT simulation ────────────────────────────────────────
# Population: Exponential(λ=1) — heavily right-skewed
# True mean = 1/λ = 1,  True std = 1/λ = 1
pop_mean, pop_std = 1.0, 1.0
B = 10_000   # number of simulated samples

print("CLT: Sampling distribution of X̄ approaches N(μ, σ²/n)")
print(f"{'n':>5} | {'E[X̄]':>8} | {'Std(X̄)':>8} | {'Skew(X̄)':>9} | {'Shapiro p':>10}")
print("-" * 55)

for n in [1, 5, 10, 30, 100]:
    # Draw B samples each of size n; compute mean of each
    sample_means = np.array([
        np.random.exponential(pop_mean, n).mean()
        for _ in range(B)
    ])
    expected_se = pop_std / np.sqrt(n)
    sw_stat, sw_p = stats.shapiro(sample_means[:500])  # Shapiro on 500
    print(f"{n:>5} | {sample_means.mean():>8.4f} | "
          f"{sample_means.std():>8.4f} | "
          f"{stats.skew(sample_means):>9.4f} | "
          f"{sw_p:>10.4f}")
# n=1: highly skewed (Exp distribution itself)
# n=30: Shapiro p > 0.05 → cannot reject normality
# n=100: very close to N(1, 1/100)

# ── What CLT enables: CI for any sample mean ─────────────
# Even if the original distribution is not normal,
# the 95% CI for the mean is: x̄ ± 1.96 · σ/√n
user_sessions = np.random.exponential(scale=5, size=200)  # skewed session data
n = len(user_sessions)
x_bar  = user_sessions.mean()
se     = user_sessions.std(ddof=1) / np.sqrt(n)
ci_low = x_bar - 1.96 * se
ci_hi  = x_bar + 1.96 * se
print(f"\nSession data (n={n}, Exp distributed):")
print(f"  Sample mean: {x_bar:.3f} min")
print(f"  95% CI (CLT): [{ci_low:.3f}, {ci_hi:.3f}]")
print(f"  True mean (λ=1/5=0.2 → mean=5):  5.000")

ML Link CLT justifies: (1) using z-test/t-test on any large-n data regardless of distribution. (2) The loss on a mini-batch is approximately Normal → gradient noise is approximately Normal → Adam's normalisation is well-motivated. (3) AUC/F1 confidence intervals via bootstrapping follow CLT for large test sets. (4) BatchNorm normalises activations to approximately N(0,1) per batch — CLT makes this meaningful even if individual activations are not Normal.

Trap Applying the CLT to small samples from heavily skewed distributions

The n≥30 rule assumes roughly symmetric populations. For a Pareto-distributed variable (income, wealth) with a power-law tail: even n=1000 may not be sufficient for the CLT to kick in. A/B tests on revenue per user — which is Pareto-distributed — often show spurious significance because the Normal approximation fails for small samples with high-value outliers.

Fix For skewed metrics: (1) Use bootstrap CIs instead of Normal-based CIs — bootstrap does not assume Normality. (2) Apply log transform to the metric before testing. (3) Use Mann-Whitney U test (non-parametric, no distributional assumption). (4) Pre-specify a minimum n based on the skewness of the metric, not just the generic n≥30 rule.

Trap Confusing the distribution of individual samples with the sampling distribution of the mean

Claim: "CLT says our data is normally distributed." This is wrong. CLT says the distribution of the sample mean is approximately Normal for large n. Individual data points from an Exponential distribution are still Exponential — the CLT does not apply to individual observations, only to their average.

Fix Be precise: CLT applies to the estimator (sample mean), not to the data. When using a t-test or z-test, you are testing whether the mean differs — the Normality assumption is on the sampling distribution of the mean, not on the raw data. For raw data normality, use Shapiro-Wilk or QQ-plots. For testing means of large samples, CLT gives you the freedom to use Normal-based tests even on non-Normal data.

CLT: Given iid X₁,...,Xₙ with mean μ and finite variance σ², the standardised sample mean √n(X̄−μ)/σ converges in distribution to N(0,1) as n→∞. What it does NOT say: (1) Individual observations are Normal — they follow the original distribution. (2) It works for any n — small samples from highly skewed distributions still require large n. (3) It works when the Xi have infinite variance — heavy-tailed distributions like Cauchy (undefined mean) or Pareto with tail index < 2 do not satisfy the CLT conditions. (4) The approximation is exact for any n — it is asymptotic. For moderate n with skewed distributions, the approximation may be poor.

Mini-batch gradient is the average of n per-sample gradients: ĝ = (1/n)Σᵢ ∇L(xᵢ,yᵢ). By CLT, this average is approximately N(∇R(θ), Σ/n) where Σ is the covariance of individual sample gradients and ∇R(θ) is the true gradient. Three consequences: (1) Mini-batch gradient is an unbiased estimator of the true gradient (by LLN). (2) Its variance decreases as 1/n — larger batches give less noisy gradient estimates. (3) Gradient noise is approximately Gaussian — this motivates Adam's use of first and second moment estimates (mean and variance of a Gaussian). The Gaussian gradient noise also adds implicit regularisation: model escapes sharp minima that true gradient descent would stay in.

Bootstrap Resampling — Non-Parametric CIs & Permutation Tests

Resample with replacement, percentile bootstrap CI, when to use over parametric CIs

Bootstrap is a simulation-based method to estimate the sampling distribution of any statistic, without needing to know the population distribution or derive a formula. Algorithm (Percentile Bootstrap CI): 1. From observed data x = {x₁,...,xₙ}, draw B bootstrap samples each of size n with replacement. 2. Compute the statistic θ̂* = T(x*) for each bootstrap sample. 3. The 95% CI is the [2.5th, 97.5th] percentile of θ̂*₁,...,θ̂*_B. Works for any statistic: mean, median, correlation, AUC, F1, NDCG — no formula required. Permutation test (two-sample): 1. Pool both groups. Randomly permute group labels B times. 2. Compute the test statistic (e.g. mean difference) for each permutation. 3. p-value = fraction of permutations with statistic ≥ observed value. → Non-parametric: makes no distributional assumption. Tests the null: labels are exchangeable. Limitations: Bootstrap assumes data is IID — fails for time-series without block bootstrap. B = 1000 is usually sufficient; B = 10,000 for very precise tail percentiles.

Python — bootstrap CI for median/AUC, permutation test, scipy bootstrap

import numpy as np
from scipy import stats
from sklearn.metrics import roc_auc_score

np.random.seed(42)

# ── Bootstrap CI for the median ───────────────────────────
# Data: right-skewed (parametric CI for mean won't work well)
data = np.random.lognormal(0, 1, 100)
B    = 5000

bootstrap_medians = np.array([
    np.median(np.random.choice(data, size=len(data), replace=True))
    for _ in range(B)
])

ci_low, ci_high = np.percentile(bootstrap_medians, [2.5, 97.5])
print(f"Sample median:       {np.median(data):.4f}")
print(f"Bootstrap 95% CI:  [{ci_low:.4f}, {ci_high:.4f}]")
print(f"Bootstrap SE:        {bootstrap_medians.std():.4f}")

# ── Bootstrap CI for AUC (model evaluation) ──────────────
# Simulate a binary classification result
n_test = 500
y_true  = np.random.randint(0, 2, n_test)
y_score = y_true * 0.7 + np.random.rand(n_test) * 0.3

def bootstrap_auc(y_true, y_score, B=2000):
    aucs = []
    idx  = np.arange(len(y_true))
    for _ in range(B):
        resample = np.random.choice(idx, size=len(idx), replace=True)
        if len(np.unique(y_true[resample])) < 2:
            continue   # skip if resample has only one class
        aucs.append(roc_auc_score(y_true[resample], y_score[resample]))
    return np.array(aucs)

auc_samples = bootstrap_auc(y_true, y_score)
print(f"\nAUC point estimate: {roc_auc_score(y_true, y_score):.4f}")
print(f"Bootstrap 95% CI:   [{np.percentile(auc_samples,2.5):.4f}, "
      f"{np.percentile(auc_samples,97.5):.4f}]")

# ── Permutation test: are two groups different? ───────────
group_a = np.random.normal(5.0, 1.5, 80)   # control
group_b = np.random.normal(5.4, 1.5, 80)   # treatment (small lift)
observed_diff = group_b.mean() - group_a.mean()

pooled = np.concatenate([group_a, group_b])
perm_diffs = np.array([
    np.random.permutation(pooled)[:80].mean() -
    np.random.permutation(pooled)[80:].mean()
    for _ in range(5000)
])
p_value = (np.abs(perm_diffs) >= np.abs(observed_diff)).mean()
print(f"\nPermutation test: observed diff={observed_diff:.4f}, p={p_value:.4f}")

# ── scipy built-in bootstrap ─────────────────────────────
result = stats.bootstrap(
    (data,), np.median, n_resamples=5000,
    confidence_level=0.95, method='percentile', random_state=42
)
print(f"\nscipy bootstrap CI: {result.confidence_interval}")

ML Link Bootstrap CI for AUC/F1/NDCG model metrics — standard in ML evaluation. Permutation tests for feature importance (shuffle feature → measure metric drop). Bootstrap for ensemble methods (Random Forest bagging IS bootstrap sampling). Cross-validation is related: it estimates the sampling distribution of model performance.

Trap Applying standard bootstrap to time-series or autocorrelated data

Standard bootstrap resamples observations independently with replacement, breaking temporal dependencies. For time-series data (user sessions over time, stock returns, sensor readings), consecutive observations are correlated — standard bootstrap destroys this structure and produces confidence intervals that are too narrow.

Fix Use block bootstrap for time-series: resample contiguous blocks of observations (e.g., blocks of 10 time steps) instead of individual points. sklearn.utils.resample with time-series caution. For A/B tests on time-series metrics, use paired bootstrap (compare the same time windows across treatment and control).

Trap Using too few bootstrap samples (B too small)

With B=100 bootstrap samples, the 2.5th and 97.5th percentiles are estimated from only 2-3 points each — the CI boundaries are unstable and change significantly with different random seeds. CI coverage is unreliable.

Fix Use B ≥ 1000 for standard CIs; B ≥ 5000 for precise tail percentiles (99% CI). scipy.stats.bootstrap defaults to 9999 resamples. The computational cost is usually trivial — a 1000-sample bootstrap on 500 data points completes in milliseconds.

Use bootstrap when: (1) The statistic has no closed-form sampling distribution (median, correlation, NDCG, AUC). (2) The data is clearly non-Normal and n is small (< 30). (3) You need a CI for a complex estimator (ratio, rank correlation). (4) You want a distribution-free guarantee without parametric assumptions. Use parametric CIs when: data is Normal or n is large enough for CLT (mean of n ≥ 30 from a not-too-skewed distribution). Parametric CIs are faster and analytically interpretable. Both approaches converge for large n. In ML evaluation: always use bootstrap for AUC, F1, NDCG confidence intervals — these metrics have no standard parametric CI formula.

Bagging (Bootstrap Aggregating) directly applies bootstrap resampling to ensemble learning. Each tree in a Random Forest is trained on a bootstrap sample of the training set (n observations drawn with replacement from n originals). Approximately 63.2% of original observations appear at least once in each bootstrap sample (1 − (1−1/n)ⁿ → 1−1/e ≈ 0.632). The remaining ~36.8% are out-of-bag (OOB) samples — held-out by chance — which enables free validation without a separate validation set. The bootstrap creates tree diversity (each tree sees different data), which is what drives the variance reduction in the ensemble.

The CLT does not say samples are normal. It says the sampling distribution of the mean is normal. This distinction is critical: you can apply t-tests to non-normal populations as long as n is large enough — because the sample mean itself will be normally distributed.

Estimation 05–06

Confidence Intervals — Frequentist, t, and Bootstrap

A confidence interval answers: given data, what range of parameter values is consistent with what we observed? CIs are central to ML evaluation — every AUC, NDCG, or F1 score should be reported with a CI, not just a point estimate. Understanding the difference between "95% CI" and "95% probability the parameter is in this interval" is one of the most important conceptual hurdles in statistics.

Frequentist CI Interpretation & z-Interval

What a CI actually means, z-interval formula: x̄ ± z_{α/2}·σ/√n, CI width and sample size trade-off

A 95% Confidence Interval is a range computed from sample data such that, if you repeated the experiment many times, 95% of computed intervals would contain the true parameter. z-interval (known σ, or large n by CLT): CI = x̄ ± z_{α/2} · σ/√n Where: x̄ = sample mean z_{α/2} = z-score for desired confidence (1.96 for 95%, 2.576 for 99%) σ/√n = standard error (margin of error shrinks with more data) CI width trade-offs: Wider CI → higher coverage but less precise To halve the margin of error: quadruple sample size (ME ∝ 1/√n) Common misconceptions: ✗ "There is 95% probability the true mean is in [a, b]" ✗ "95% of data points fall in [a, b]" ✓ "This procedure produces intervals containing μ 95% of the time" ✓ "The data is consistent with all parameter values in [a, b]"

  95% CI Interpretation (100 repeated experiments):

  True μ ─────────────────────────────── │
  ────[═══]──────────────── experiment 1 │ 95 of 100
  ────────[═══]────────────   ...        │ intervals
  ──[═══]────────────────── experiment N │ contain μ
  ──────────────[══]───────  ← this misses
  ─────────────────────────────────────────────────
  ME = z_{α/2} × σ/√n;  quadruple n → halve ME

Python — z-interval, CI for proportions, width vs sample size

import numpy as np
from scipy import stats

np.random.seed(42)

# ── z-interval (large n, σ known or use CLT) ──────────────
true_mean, true_std = 5.0, 2.0
n      = 200
sample = np.random.normal(true_mean, true_std, n)
x_bar  = sample.mean()
se     = true_std / np.sqrt(n)
z      = stats.norm.ppf(0.975)          # 1.96 for 95% CI

ci_low  = x_bar - z * se
ci_high = x_bar + z * se
print(f"Sample mean: {x_bar:.4f}")
print(f"95% z-CI:    ({ci_low:.4f}, {ci_high:.4f})")
print(f"Contains true mean? {ci_low < true_mean < ci_high}")

# ── CI for proportion ──────────────────────────────────────
clicks, total = 470, 1000
p_hat  = clicks / total
se_p   = np.sqrt(p_hat * (1 - p_hat) / total)
ci_p   = (p_hat - z*se_p, p_hat + z*se_p)
print(f"\nCTR: {p_hat:.3f}  95% CI: ({ci_p[0]:.3f}, {ci_p[1]:.3f})")

# ── CI width vs sample size ────────────────────────────────
print("\nSample size vs margin of error (σ=2, 95% CI):")
for n in [25, 100, 400, 1600]:
    me = z * true_std / np.sqrt(n)
    print(f"  n={n:5d}: ME = {me:.4f}")
# Quadrupling n halves ME — the square-root law

ML Link CI for AUC/NDCG/F1 is standard in ML evaluation. Bootstrap CI for AUC (DeLong method or scipy.stats.bootstrap). A/B test results: always report CI on the metric difference, not just p-value. CI width tells you if you have enough data to make a decision — a wide CI means more traffic or labels needed.

Trap Claiming "95% probability the parameter is in this interval"

This is a Bayesian credible interval statement, not a frequentist CI. The frequentist CI is about the procedure: 95% of CIs constructed this way contain the true value. Once you have a specific CI like (0.62, 0.71), the true AUC is either in it or not — there is no longer any probability involved.

Fix Use correct frequentist language: "We are 95% confident" (refers to the method). Or switch to Bayesian credible intervals (PyMC, Stan) if you want probability statements about the parameter. The distinction matters in Bayesian A/B testing.

Trap Not reporting CI for ML model metrics

Reporting "Model A AUC = 0.84, Model B AUC = 0.82" without CIs. The difference of 0.02 may or may not be statistically meaningful depending on the test-set size. With n=100 test samples, AUC CIs are wide (~±0.05); with n=10,000 they are tight (~±0.005).

Fix Use bootstrap CI for AUC/F1/NDCG: sklearn.utils.resample 1000 times, compute metric each resample, take percentile 2.5 and 97.5. Or use scipy.stats.bootstrap with method="BCa" for better small-sample accuracy.

0.83 falls inside (0.81, 0.87), so your model is statistically consistent with 0.83. But overlapping individual CIs do NOT mean the difference is non-significant — that is a common error. Compute the CI for the difference AUC_A − AUC_B using paired bootstrap (same test-set indices). If that CI excludes 0, the difference is significant. Never compare models by whether their individual CIs overlap.

ME = z·σ/√n, so going from n to 4n changes √n to 2√n, cutting ME in half. This square-root relationship means large samples are expensive: to cut uncertainty by 10× you need 100× more data. It is the statistical reason why ML experiments need large held-out test sets and why AUC CIs are wide with only a few hundred test examples.

t-Interval (Unknown σ, Small n) & Welch

t-distribution, heavier tails, degrees of freedom, Bessel's correction ddof=1, two-sample Welch CI

When σ is unknown (the typical case), estimating it with s introduces extra uncertainty — the test statistic follows a t-distribution, not Normal. t-interval formula: CI = x̄ ± t_{α/2, n-1} · s/√n Key properties of t-distribution vs Normal: - Heavier tails → wider CIs for small n (accounts for uncertainty in estimating σ) - Parameterised by df = n − 1 - As n → ∞: t_{n-1} → N(0,1) - t_{0.975, df=3} = 3.18 vs z_{0.975} = 1.96 Bessel's correction (ddof=1): s² = Σ(xᵢ − x̄)² / (n−1) ← unbiased estimator of σ² Using n in the denominator underestimates σ² → CIs too narrow np.std(x, ddof=0) = biased | np.std(x, ddof=1) = unbiased ← use this pandas .std() uses ddof=1 by default (inconsistent with numpy!) Welch's t-test (two-sample, unequal variance) — use as default: t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂) df = Welch's approximation (Satterthwaite) Valid whether or not variances are equal

  t vs Normal (heavier tails for small df):

  Normal (z, df=∞)       t (df=3)
      ▄██▄                    ▄██▄
    ▄██████▄               ▄██████▄
  ▄██████████▄           ▄██████████▄
▄████████████████▄   ▄███████████████████▄
──────────────────   ─────────────────────
z_{0.975} = 1.96     t_{0.975,df=3} = 3.18
                     ← heavier tails, wider CI

Python — t-interval, Bessel's correction, Welch two-sample CI

import numpy as np
from scipy import stats

np.random.seed(42)

# ── One-sample t-interval ──────────────────────────────────
sample = np.random.normal(loc=5.0, scale=2.0, size=15)
n      = len(sample)
x_bar  = sample.mean()
s      = sample.std(ddof=1)      # Bessel's correction
se     = s / np.sqrt(n)

ci = stats.t.interval(0.95, df=n-1, loc=x_bar, scale=se)
print(f"n={n}, x̄={x_bar:.3f}, s={s:.3f}")
print(f"95% t-CI: ({ci[0]:.3f}, {ci[1]:.3f})")

t_crit = stats.t.ppf(0.975, df=n-1)
z_crit = stats.norm.ppf(0.975)
print(f"t_crit={t_crit:.3f}  z_crit={z_crit:.3f}  (t > z → wider CI)")

# ── Bessel's correction comparison ────────────────────────
print(f"\nddof=0 (biased):   {sample.var(ddof=0):.4f}")
print(f"ddof=1 (unbiased): {sample.var(ddof=1):.4f}")

# ── Two-sample Welch t-CI (A/B test) ──────────────────────
control   = np.random.normal(0.50, 0.05, 100)
treatment = np.random.normal(0.53, 0.06, 120)

result   = stats.ttest_ind(treatment, control, equal_var=False)  # Welch
diff     = treatment.mean() - control.mean()
se_diff  = np.sqrt(treatment.var(ddof=1)/len(treatment) +
                   control.var(ddof=1)/len(control))
df_welch = se_diff**4 / (
    (treatment.var(ddof=1)/len(treatment))**2/(len(treatment)-1) +
    (control.var(ddof=1)/len(control))**2/(len(control)-1)
)
t_w = stats.t.ppf(0.975, df=df_welch)
ci_diff = (diff - t_w*se_diff, diff + t_w*se_diff)
print(f"\nΔ mean: {diff:.4f}")
print(f"95% Welch CI: ({ci_diff[0]:.4f}, {ci_diff[1]:.4f})")
print(f"p = {result.pvalue:.4f}")

ML Link Welch's t-test is the default for A/B tests comparing metric means. t-distribution underlies Bayesian robust regression (Student-t likelihood for heavy-tailed noise). Small test-set evaluation: t-CIs are critical with only 50-200 examples — always report them alongside point metrics.

Trap Using np.std(x) without ddof=1 for sample statistics

numpy default is ddof=0 (population formula), which underestimates σ². Pandas .std() uses ddof=1. This inconsistency causes bugs where variance computed with numpy is slightly smaller than with pandas, making CIs too narrow and p-values too small. The bias is small for large n but matters for n < 30.

Fix Always explicitly set ddof=1 for sample statistics: np.std(x, ddof=1), np.var(x, ddof=1). Or use scipy.stats.sem(x) for standard error which uses ddof=1 by default. Document the convention in team code standards.

Trap Using Student's t-test (equal_var=True) when group variances differ

Student's t-test assumes equal variances. In A/B tests, treatment groups often have different variance (e.g., a new feature some users love and others ignore). Using equal_var=True when variances differ inflates Type I error — you get more false positives than your α would suggest.

Fix Default to Welch's (equal_var=False). It is valid whether or not variances are equal, with only marginally lower power when they are. Do not run Levene's test first to decide — that two-stage procedure inflates error rates.

The t-distribution accounts for uncertainty in estimating σ from data. With small n, s is a noisy estimator of σ, making extreme outcomes more likely — hence heavier tails. As df = n−1 increases, the estimate of σ improves and t_{df} → N(0,1). For n > 30 the difference is small; scipy always uses the exact t-distribution regardless of sample size.

Default to unequal variances (Welch's, equal_var=False). Treatment effects often change variance as well as mean (e.g., a new recommendation changes revenue for some segments but not others). Welch's controls Type I error correctly in both cases; Student's fails when variances differ. There is virtually no cost to using Welch's when variances are actually equal.

Bootstrap CI for ML Metrics (AUC, F1, NDCG)

When bootstrap beats parametric CI, percentile vs BCa, paired bootstrap for model comparison

Bootstrap CIs are the standard for ML metrics because: 1. AUC, F1, NDCG, Precision@K have no closed-form sampling distribution 2. Test sets are often small (n < 1000) making Normal approximations suspect 3. Bootstrap works for any statistic without distributional assumptions Percentile Bootstrap (simplest): 1. Resample test set with replacement B times (B = 1000–5000) 2. Compute metric on each resample → B values 3. CI = [percentile 2.5, percentile 97.5] BCa Bootstrap (preferred for production): Corrects for bias and non-constant variance in the bootstrap distribution More accurate for small n and skewed statistics Available: scipy.stats.bootstrap(method='BCa') DeLong's method (AUC specific): Analytical CI for AUC — no simulation needed Faster, exact, but AUC-only Paired bootstrap for model comparison: For each of B resamples compute Δ = metric_A − metric_B CI for Δ excluding 0 → statistically significant improvement

Python — bootstrap CI for AUC, BCa via scipy, paired bootstrap model comparison

import numpy as np
from scipy import stats
from sklearn.metrics import roc_auc_score
from sklearn.utils import resample as sk_resample

np.random.seed(42)

n_test  = 500
y_true  = np.random.binomial(1, 0.3, n_test)
y_prob_A = np.clip(y_true*0.7 + np.random.normal(0, 0.2, n_test), 0, 1)
y_prob_B = np.clip(y_true*0.65 + np.random.normal(0, 0.2, n_test), 0, 1)

# ── Percentile bootstrap CI for AUC ───────────────────────
B = 2000
auc_boot = []
for _ in range(B):
    idx = sk_resample(np.arange(n_test))
    if 0 < y_true[idx].sum() < len(idx):
        auc_boot.append(roc_auc_score(y_true[idx], y_prob_A[idx]))
ci_pct = np.percentile(auc_boot, [2.5, 97.5])
print(f"AUC: {roc_auc_score(y_true, y_prob_A):.4f}")
print(f"95% Percentile CI: ({ci_pct[0]:.4f}, {ci_pct[1]:.4f})")

# ── BCa via scipy (preferred) ──────────────────────────────
res = stats.bootstrap(
    (y_true, y_prob_A),
    statistic=lambda yt, yp: roc_auc_score(yt, yp),
    n_resamples=2000, method='BCa',
    confidence_level=0.95, paired=True
)
print(f"BCa CI: ({res.confidence_interval.low:.4f}, {res.confidence_interval.high:.4f})")

# ── Paired bootstrap: Model A vs Model B ──────────────────
delta_boot = []
for _ in range(B):
    idx = sk_resample(np.arange(n_test))
    if 0 < y_true[idx].sum() < len(idx):
        a = roc_auc_score(y_true[idx], y_prob_A[idx])
        b = roc_auc_score(y_true[idx], y_prob_B[idx])
        delta_boot.append(a - b)
ci_delta = np.percentile(delta_boot, [2.5, 97.5])
print(f"\nΔ AUC (A−B): {roc_auc_score(y_true,y_prob_A)-roc_auc_score(y_true,y_prob_B):.4f}")
print(f"95% CI for Δ: ({ci_delta[0]:.4f}, {ci_delta[1]:.4f})")
print(f"Statistically significant? {ci_delta[0] > 0}")

ML Link Bootstrap CI is standard for leaderboard comparisons (Kaggle, NLP benchmarks). Paired bootstrap for model comparison avoids spurious significance of individual CI comparison. Recommendation systems: bootstrap CI for NDCG@K, MRR, MAP. NLP: bootstrap for BLEU, ROUGE, BERTScore. Feature importance: permutation test is a bootstrap-like procedure.

Trap Deciding model superiority by checking if individual CIs overlap

If Model A CI is (0.81, 0.87) and Model B CI is (0.83, 0.89), the CIs overlap, but this does NOT mean the difference is non-significant. Individual CI overlap is a conservative test — it fails to detect real differences. The CI for the difference (A−B) is what matters.

Fix Always compute CI for the DIFFERENCE using paired bootstrap (same test-set indices for both models). If the CI for Δ excludes 0, the difference is significant. This is more powerful because it accounts for correlation between model scores on the same test samples.

DeLong is faster (analytical, no simulation) and exact for AUC. Use it when comparing two correlated AUCs on the same test set — DeLong accounts for correlation and is more powerful than naive bootstrap. Use bootstrap when: comparing other metrics (F1, NDCG, Precision@K) that have no analytical CI; the test set is small and you want BCa correction; you want a consistent methodology across all metrics.

B = 1000 is sufficient for percentile CI in most ML applications. For BCa or very wide percentile ranges (99.9%), use B = 5000–10000. The variance of the bootstrap CI estimate decreases as 1/√B, so beyond B = 10000 gains are minimal. For fast development iteration, B = 500 is acceptable. Always use the same B when comparing models for reproducibility.

"A 95% confidence interval contains the true parameter 95% of the time" — this refers to the procedure, not the interval. Once computed, a specific CI either contains the true value or it doesn't. The confidence is in the repeated-sampling process, not in the specific numbers you computed.

MLE, MAP & Bayesian Estimation

Maximum Likelihood Estimation is the single most important estimation framework in ML — it underlies logistic regression, neural network training, GMMs, and virtually every model with a probabilistic interpretation. MAP adds a prior, and suddenly L2 and L1 regularisation become Bayesian principles in disguise.

Maximum Likelihood Estimation (MLE)

Likelihood function, log-likelihood, MLE for Bernoulli and Normal, cross-entropy = negative log-likelihood

MLE finds θ that makes the observed data most probable. Likelihood (iid observations): L(θ; x₁,...,xₙ) = ∏ᵢ P(xᵢ | θ) Log-likelihood (preferred — avoids numerical underflow, turns product → sum): ℓ(θ) = Σᵢ log P(xᵢ | θ) θ̂_MLE = argmax_θ ℓ(θ) MLE for Bernoulli (logistic regression, coin flip): ℓ(p) = Σᵢ [xᵢ log p + (1−xᵢ) log(1−p)] dℓ/dp = 0 → p̂_MLE = sample proportion = Σxᵢ/n ✓ MLE for Normal (mean estimation): ℓ(μ) = −n/2 log(2πσ²) − (1/2σ²) Σ(xᵢ−μ)² dℓ/dμ = 0 → μ̂_MLE = x̄ ✓ Cross-entropy = negative log-likelihood for Bernoulli: H(y, ŷ) = −Σᵢ [yᵢ log ŷᵢ + (1−yᵢ) log(1−ŷᵢ)] Minimising cross-entropy ≡ MLE for a Bernoulli model. Neural network training with cross-entropy loss IS MLE of weights.

Python — MLE via scipy.optimize, log-likelihood surface, cross-entropy = NLL

import numpy as np
from scipy import stats, optimize

np.random.seed(42)

# ── MLE for Normal: manual optimisation ───────────────────
data = np.random.normal(loc=3.5, scale=1.2, size=100)

def neg_log_likelihood(params, x):
    mu, log_sigma = params
    sigma = np.exp(log_sigma)       # ensure σ > 0
    return -stats.norm.logpdf(x, mu, sigma).sum()

result    = optimize.minimize(neg_log_likelihood, x0=[0., 0.], args=(data,))
mu_mle    = result.x[0]
sigma_mle = np.exp(result.x[1])
print(f"MLE μ = {mu_mle:.4f}  (true 3.5, x̄ = {data.mean():.4f})")
print(f"MLE σ = {sigma_mle:.4f}  (true 1.2, s = {data.std(ddof=1):.4f})")

# ── scipy built-in fit ─────────────────────────────────────
mu_fit, std_fit = stats.norm.fit(data)
print(f"scipy.fit: μ={mu_fit:.4f}, σ={std_fit:.4f}")

# ── Cross-entropy = negative log-likelihood ───────────────
y_true = np.array([1, 0, 1, 1, 0, 1])
y_pred = np.array([0.8, 0.3, 0.9, 0.7, 0.2, 0.6])

cross_entropy = -np.mean(
    y_true * np.log(y_pred + 1e-8) + (1-y_true) * np.log(1-y_pred + 1e-8)
)
neg_log_lik = -stats.bernoulli.logpmf(y_true, y_pred).mean()
print(f"\nCross-entropy    = {cross_entropy:.6f}")
print(f"Neg log-lik      = {neg_log_lik:.6f}")
print(f"Equal? {np.isclose(cross_entropy, neg_log_lik)}")
# → Identical: minimising CE IS MLE for Bernoulli model

ML Link MLE is the training objective for: logistic regression (Bernoulli MLE), linear regression with MSE (Gaussian MLE), neural networks with cross-entropy (Bernoulli MLE), GMMs via EM (Gaussian mixture MLE). Any model minimising −log P(y|x,θ) is doing MLE.

Trap MLE overfits with limited data — no regularisation

With small n, MLE for a complex model can fit training data perfectly but generalise poorly. For sparse categories (rare classes, small vocabularies), MLE assigns zero probability to unseen events. This is why naive Bayes classifiers predict 0% probability for unseen words without smoothing.

Fix Add a prior → MAP estimation. Laplace smoothing (add 1 to counts) is MAP with a Dirichlet prior. L2 weight decay is MAP with Gaussian prior. MLE is fine with large datasets; for small data, always regularise.

Three reasons: (1) Products of small probabilities underflow numerically — ∏ᵢ P(xᵢ|θ) with n=1000 at probability 0.9 each = 0.9^1000 ≈ 10⁻⁴⁶. Log converts this to a sum. (2) Gradient-based optimisers minimise by convention. (3) Log turns the product into a sum, which is differentiable and separable by term. The argmax of ℓ equals the argmax of L because log is monotonically increasing.

For yᵢ ∼ N(f(xᵢ;θ), σ²), log-likelihood is: ℓ(θ) = −n/2·log(2πσ²) − (1/2σ²)·Σ(yᵢ−f(xᵢ;θ))². Maximising over θ is equivalent to minimising Σ(yᵢ−f(xᵢ;θ))² — exactly MSE. The other terms are constants in θ. Conclusion: training a regressor with MSE = MLE under Gaussian noise.

MAP Estimation & Bayesian Credible Intervals

Prior × likelihood, Gaussian prior = L2, Laplace prior = L1, credible interval vs frequentist CI

MAP finds θ maximising the posterior P(θ|data) ∝ P(data|θ) · P(θ): θ̂_MAP = argmax_θ [ℓ(θ) + log P(θ)] ──────────────────── log-likelihood + log-prior Regularisation as MAP: Gaussian prior N(0, 1/λ): log P(θ) ∝ −λ||θ||² MAP objective: ℓ(θ) − λ||θ||² → L2 regularisation ✓ Laplace prior Lap(0, 1/λ): log P(θ) ∝ −λ||θ||₁ MAP objective: ℓ(θ) − λ||θ||₁ → L1 regularisation ✓ Laplace has a sharp peak at 0 → encourages sparse solutions ✓ Full Bayes vs MAP: MAP: single point estimate (mode of posterior) — fast Full Bayes: entire posterior P(θ|data) — uncertainty-aware Bayesian Credible Interval: Region R such that P(θ ∈ R | data) = 0.95 → This IS a probability statement about θ (unlike frequentist CI) Beta-Bernoulli conjugacy: Prior: θ ∼ Beta(α, β) Likelihood: k successes in n trials Posterior: θ|data ∼ Beta(α+k, β+n−k) [conjugate update] Posterior mean: (α+k)/(α+β+n) [shrinks toward prior mean]

  Frequentist CI vs Bayesian Credible Interval:

  Frequentist 95% CI          Bayesian 95% Credible
  ─────────────────────       ──────────────────────────
  Procedure produces          P(θ ∈ [a,b] | data) = 0.95
  intervals containing μ      Given THIS data, θ has 95%
  95% of the time.            posterior probability of
  θ is fixed; CI is random.   being in [a,b].
                              θ is random (has a prior).

Python — MAP as regularised logistic regression, Beta-Bernoulli Bayesian updating

import numpy as np
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

np.random.seed(42)

# ── MAP = L2-regularised logistic regression ───────────────
X, y = make_classification(n_samples=300, n_features=10, random_state=42)
# C = 1/λ in sklearn; smaller C → stronger prior (more regularisation)
for C, label in [(0.1, 'Strong prior  (MAP C=0.1)'),
                 (1.0, 'Moderate prior (MAP C=1)'),
                 (1e6, 'No prior       (MLE C=1e6)')]:
    lr = LogisticRegression(C=C, solver='lbfgs', max_iter=1000)
    lr.fit(X, y)
    print(f"{label}: max|w| = {np.abs(lr.coef_[0]).max():.4f}")

# ── Beta-Bernoulli Bayesian update ────────────────────────
alpha0, beta0 = 2, 5          # prior: Beta(2,5) → expect ~28% CTR
clicks, total = 47, 120       # observed data

alpha_post = alpha0 + clicks
beta_post  = beta0 + (total - clicks)
post       = stats.beta(alpha_post, beta_post)

map_est  = (alpha_post - 1) / (alpha_post + beta_post - 2)
post_mean = alpha_post / (alpha_post + beta_post)
mle_est  = clicks / total

print(f"\nPrior mean:     {alpha0/(alpha0+beta0):.4f}")
print(f"MLE:            {mle_est:.4f}")
print(f"MAP:            {map_est:.4f}")
print(f"Posterior mean: {post_mean:.4f}  (shrinks toward prior)")

# ── 95% Bayesian Credible Interval ─────────────────────────
cred = post.interval(0.95)
print(f"95% Credible Interval: ({cred[0]:.4f}, {cred[1]:.4f})")
print("→ P(CTR ∈ CI | data) = 0.95  [probability statement about θ]")

ML Link L2 weight decay (torch.optim weight_decay) = MAP with N(0,1/λ) prior. Dropout ≈ approximate Bayesian inference (MC Dropout for uncertainty). Thompson sampling in bandits uses Beta posterior exactly. Bayesian hyperparameter optimisation uses Gaussian Process posterior. Laplace approximation: fit Gaussian to posterior around MAP for neural network uncertainty.

Trap "P(θ is in the CI) = 0.95" when using frequentist CIs

In frequentist statistics, θ is a fixed constant — it does not have a probability distribution. The statement "P(θ ∈ CI) = 0.95" only makes sense in a Bayesian framework where θ has a prior. Using Bayesian language for frequentist CIs misleads stakeholders and can cause incorrect decisions in A/B testing (e.g., claiming 95% probability treatment is better when the CI is frequentist).

Fix For frequentist CIs: "We are 95% confident the true value lies in [a, b]" (confidence refers to the procedure). For probability statements about the parameter, use Bayesian A/B testing with PyMC, Stan, or Pyro.

C = 1/λ where λ is the L2 regularisation strength. The MAP interpretation: weights are assumed drawn from N(0, C). Small C (e.g., 0.01) → strong Gaussian prior → weights must stay near 0 → high bias, low variance. Large C (e.g., 1000) → weak prior → close to MLE → weights can be large → overfitting risk. Tuning C with cross-validation is Bayesian model selection: you are finding the prior variance that best fits the data.

MAP returns the mode of the posterior — a single point estimate. Full Bayes computes the entire posterior P(θ|data), enabling credible intervals and predictive distributions that propagate parameter uncertainty. MAP is fast and sufficient when the posterior is unimodal and roughly symmetric. Full Bayes is needed for multi-modal posteriors, small datasets, or when prediction uncertainty is critical (medical diagnosis, autonomous systems, Bayesian neural networks).

When you train a neural network with cross-entropy loss, you are doing MLE. When you add L2 weight decay, you are doing MAP with a Gaussian prior. When you add L1 regularisation, you are doing MAP with a Laplace prior. Regularisation is not a computational trick — it is a principled probabilistic statement about your beliefs about the weights before seeing data.

Hypothesis Testing 07–08

Hypothesis Tests, p-values & t-tests

Hypothesis testing is the formal framework for making decisions from data. p-values are misunderstood more than almost any other statistical concept — even by experienced researchers. In DS/AI, hypothesis tests are the backbone of A/B testing, feature significance analysis, model comparison, and drift detection.

Null/Alternative Hypothesis, p-value, Type I/II Errors

H₀ and H₁, p-value definition, α significance level, Type I (false positive) and Type II (false negative) errors, power

Hypothesis testing framework: H₀ (Null): default assumption — "no effect", "no difference" H₁ (Alternative): the claim you want to support with data p-value: P(observing data this extreme or more | H₀ is true) Small p (< α) → reject H₀ (data is surprising under H₀) Large p (≥ α) → fail to reject H₀ (insufficient evidence) α = significance level (Type I error rate), typically 0.05 Decision table: │ H₀ True │ H₀ False (H₁ True) ─────────────┼────────────────────┼───────────────────── Reject H₀ │ Type I Error (α) │ Correct (Power=1−β) (p < α) │ False Positive │ True Positive ─────────────┼────────────────────┼───────────────────── Fail to │ Correct │ Type II Error (β) reject H₀ │ True Negative │ False Negative (p ≥ α) │ │ Statistical ≠ practical significance: p < 0.05 with n = 1,000,000 can detect trivially small effects Always report effect size (Cohen's d, relative lift) with p-value p-value misconceptions: ✗ "p = 0.03 → 3% chance H₀ is true" ✗ "p = 0.06 → effect almost real" ✗ "p > 0.05 proves H₀" (absence of evidence ≠ evidence of absence) ✓ "p = 0.03 → if H₀ were true, we'd see this result only 3% of the time"

  Type I vs Type II Error:

  H₀ distribution (μ=0)     H₁ distribution (μ=δ)
       ┌──────┐                    ┌──────┐
      ░│      │░                  ░│      │░
     ░░│      │░░│               ░░│      │░░
    ░░░│      │░░│ critical     ░░░│      │░░░
   ░░░░│      │░░│ value       ░░░░│      │░░░
  ─────────────────┼─────────────────────────────
      Type I (α) → │ ← Type II (β)   Power = 1−β →

Python — p-value simulation under H₀, Type I rate verification, Cohen's d

import numpy as np
from scipy import stats

np.random.seed(42)
n_exp, n, alpha = 10_000, 50, 0.05

# ── Type I error rate under H₀ ────────────────────────────
# When H₀ is true, p-values are Uniform(0,1) → P(p<0.05) = 0.05
p_null = [
    stats.ttest_ind(np.random.normal(0,1,n), np.random.normal(0,1,n)).pvalue
    for _ in range(n_exp)
]
print(f"Type I error rate: {np.mean(np.array(p_null) < alpha):.4f}  (expected {alpha})")

# ── Power under H₁ (Cohen's d = 0.5) ─────────────────────
p_alt = [
    stats.ttest_ind(np.random.normal(0.5,1,n), np.random.normal(0,1,n)).pvalue
    for _ in range(n_exp)
]
power = np.mean(np.array(p_alt) < alpha)
print(f"Power (d=0.5, n={n}):  {power:.4f}  (target ≥ 0.80)")
# n=50, d=0.5 → power ≈ 0.70 → underpowered

# ── p-values under H₀ are Uniform(0,1) ─────────────────────
print(f"p<0.01 under H₀: {np.mean(np.array(p_null)<0.01):.4f}  (expected 0.01)")
print(f"p<0.10 under H₀: {np.mean(np.array(p_null)<0.10):.4f}  (expected 0.10)")

# ── Effect size (Cohen's d) ────────────────────────────────
g1 = np.random.normal(5.2, 1.5, 100)
g2 = np.random.normal(4.8, 1.5, 100)
pooled_std = np.sqrt((g1.var(ddof=1) + g2.var(ddof=1)) / 2)
d  = (g1.mean() - g2.mean()) / pooled_std
t_stat, p_val = stats.ttest_ind(g1, g2)
print(f"\nt={t_stat:.3f}, p={p_val:.4f}, Cohen's d={d:.3f}")
print(f"Effect size: {'small' if abs(d)<0.5 else 'medium' if abs(d)<0.8 else 'large'}"),

ML Link Type I error = false positive in feature importance (spurious correlations). Type II error = missing a real model improvement in an underpowered A/B test. In ML: p-value for feature significance (chi-squared, t-test), drift detection (KS test), model comparison. Peeking at results inflates Type I error — use sequential testing methods.

Trap p-hacking: testing multiple variants until p < 0.05

If you run an A/B test, get p=0.07, collect more data, and check again, the effective Type I error rate can reach 20% even though α=0.05. With 5 interim looks, P(at least one false positive) ≈ 1−0.95⁵ = 23%. This is endemic in A/B testing where analysts check dashboards daily.

Fix Pre-register the experiment: fix n, primary metric, and analysis plan BEFORE running. Analyse once at the pre-specified sample size. If early stopping is needed, use sequential testing (SPRT, mSPRT, or always-valid p-values) — these correct for multiple looks.

Trap Reporting p < 0.05 without effect size

With n=1,000,000 web users, even a 0.001% CTR improvement will be statistically significant. But a 0.001% absolute lift may be economically meaningless. "Statistically significant (p=0.002)" without effect size misleads stakeholders into shipping meaningless changes.

Fix Always report: p-value AND effect size (Cohen's d, relative lift %) AND CI for the effect. Stakeholders care about the magnitude and uncertainty, not just whether p crossed 0.05.

It tells you: if the null hypothesis were true, there is a 3% chance of observing data this extreme or more extreme by chance. It does NOT tell you: (a) the probability H₀ is true (requires a Bayesian prior); (b) the size of the effect; (c) whether the finding is practically significant; (d) whether it will replicate. Interpret p-values alongside effect size, sample size, and pre-registered protocol.

p = 0.08 means insufficient evidence to reject H₀ at α=0.05. Options: (1) If you ran the pre-specified sample size, accept the null — the experiment is done. (2) Report the CI: if lift CI is (−0.5%, +3%), the true effect might still be meaningful but uncertain. (3) If underpowered, specify a new stopping rule and re-run. Do not retroactively lower α or cherry-pick the analysis. The ship decision should be a business risk call, not statistical gymnastics.

One-sample, Two-sample & Paired t-tests

Test statistic formula for each variant, Welch vs Student, one-tailed vs two-tailed, assumptions

Three t-test variants — all test whether a mean (or mean difference) equals zero: 1. One-sample t-test H₀: μ = μ₀ (mean equals a known reference value) t = (x̄ − μ₀) / (s/√n), df = n−1 Use: "Is model inference latency significantly above 200ms?" 2. Two-sample Welch's t-test ← default for A/B tests H₀: μ₁ = μ₂ (two independent groups have equal means) t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), df = Welch's approx Use: comparing CTR between treatment and control groups 3. Paired t-test H₀: μ_d = 0 (mean difference = 0) dᵢ = xᵢ_after − xᵢ_before, t = d̄ / (s_d/√n) Use: before/after intervention on the same users More powerful than two-sample when within-pair correlation is positive Assumptions for all t-tests: Independence within each group Approximately Normal sampling distribution (CLT handles large n) Welch's: does NOT require equal variance — use by default Paired: requires matched observations (same units) One-tailed vs two-tailed: Two-tailed (default): H₁: μ ≠ μ₀ — tests both directions One-tailed: H₁: μ > μ₀ — pre-specify direction; lower p threshold but higher false positive risk

Python — one-sample, Welch two-sample, paired t-tests with scipy.stats

import numpy as np
from scipy import stats

np.random.seed(42)

# ── One-sample t-test: latency > 200ms? ───────────────────
latency = np.random.normal(210, 30, 50)
t_stat, p_two = stats.ttest_1samp(latency, popmean=200)
t_stat, p_one = stats.ttest_1samp(latency, popmean=200, alternative='greater')
print(f"One-sample (two-tailed): t={t_stat:.3f}, p={p_two:.4f}")
print(f"One-sample (one-tailed): t={t_stat:.3f}, p={p_one:.4f}")

# ── Two-sample Welch t-test: A/B test CTR ─────────────────
control   = np.random.binomial(1, 0.42, 500)
treatment = np.random.binomial(1, 0.46, 500)

t_stat, p_val = stats.ttest_ind(treatment, control, equal_var=False)
print(f"\nA/B: ctrl={control.mean():.4f}, treat={treatment.mean():.4f}")
print(f"Welch: t={t_stat:.3f}, p={p_val:.4f}, sig={p_val<0.05}")

# ── Paired t-test: before/after same users ────────────────
before = np.random.normal(50, 10, 100)
after  = before + np.random.normal(3, 5, 100)

t_stat, p_paired = stats.ttest_rel(after, before)
print(f"\nPaired: t={t_stat:.3f}, p={p_paired:.4f}")

# Equivalent: one-sample t-test on differences
diffs = after - before
t_stat2, p_val2 = stats.ttest_1samp(diffs, popmean=0)
print(f"Match: {np.isclose(t_stat, t_stat2)}")
print(f"Within-pair correlation: {np.corrcoef(before, after)[0,1]:.4f}")
print("High correlation → paired test more powerful than two-sample")

ML Link Two-sample Welch: standard for A/B tests. Paired: before/after feature engineering on same dataset split, cross-validation fold comparison. One-sample: is model calibration error significantly different from 0? McNemar's test is the paired analogue for comparing classifier predictions on the same test set.

Trap Using equal_var=True (Student's t-test) in scipy

scipy.stats.ttest_ind defaults to equal_var=True (Student's t-test). If A/B groups have unequal sample sizes (90/10 traffic split) or unequal variance (very common when the treatment changes user behaviour differently for different segments), Student's t-test is anti-conservative and inflates Type I error. This is a silent bug that can cause you to ship changes with no real effect.

Fix Always set equal_var=False: stats.ttest_ind(a, b, equal_var=False). There is virtually no reason to use Student's t-test in practice.

Trap Using two-sample test on paired data (same users before/after)

If the same 100 users are measured before and after an intervention, the observations are not independent. Using a two-sample t-test ignores within-user correlation, throwing away statistical power. You may miss an effect the paired test would detect clearly.

Fix Ask: are the two groups the same units measured twice (→ paired) or different units (→ two-sample)? High within-pair correlation amplifies the power advantage of the paired test.

In a standard A/B test, users are randomly assigned to one group — they are different people, so two-sample Welch's is correct. Paired applies when: testing the same user before and after an intervention (e.g., same user's revenue 7 days before vs 7 days after an email); crossover experiments where every user sees both variants sequentially; comparing two models on the same test examples (McNemar's test for classifiers). The key: is there a natural, meaningful pairing between observations?

Student's assumes equal variances (homoscedasticity). When variances differ — common in practice — Student's inflates Type I error while Welch's remains calibrated. When variances ARE equal, Welch's has only marginally less power (wider CI by a small amount). Since the asymmetric cost strongly favours Welch's (no downside when variances are equal, major downside when they're not), Welch's dominates as the default.

A p-value does not tell you the probability the null hypothesis is true. It tells you the probability of seeing data this extreme IF the null hypothesis were true. These are completely different statements. Confusing them leads to overconfidence in "statistically significant" results and underconfidence when p > 0.05.

Non-parametric Tests, Power Analysis & Multiple Testing

Not all data is continuous and normally distributed. Chi-squared handles categorical data; Mann-Whitney handles ordinal data or non-Normal distributions. Power analysis is what separates well-designed experiments from underpowered ones that waste resources. And multiple testing correction is what prevents you from finding "significance" in pure noise.

Chi-squared & Mann-Whitney U

χ² goodness-of-fit, χ² test of independence, Mann-Whitney for non-Normal/ordinal data, effect sizes

Chi-squared (χ²) — for categorical variables: Goodness-of-fit (does observed match expected distribution?): χ² = Σᵢ (Oᵢ − Eᵢ)² / Eᵢ, df = k − 1 H₀: data follows the expected distribution Use: are clicks uniformly distributed across 5 button positions? Test of independence (are two categorical variables independent?): χ² = Σᵢⱼ (Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ, df = (r−1)(c−1) Eᵢⱼ = (row total × col total) / n H₀: the two variables are independent Use: is user device type independent of subscription tier? Assumptions for χ²: Expected counts ≥ 5 in each cell (use Fisher's exact if < 5) Independent observations Mann-Whitney U (Wilcoxon rank-sum): Non-parametric alternative to two-sample t-test Tests whether one distribution is stochastically larger H₀: P(X > Y) = 0.5 Works for: ordinal data, non-Normal/skewed distributions, small n Effect size: rank-biserial correlation r = 1 − 2U/(n₁·n₂)

Python — chi-squared GOF, independence test, Mann-Whitney U with scipy

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Chi-squared goodness-of-fit ────────────────────────────
# Are button clicks uniformly distributed across 5 positions?
observed = np.array([87, 103, 95, 78, 137])
expected = np.full(5, observed.sum() / 5)     # uniform expectation
chi2, p_gof = stats.chisquare(observed, expected)
print(f"GOF χ²={chi2:.3f}, p={p_gof:.4f}")
print(f"Uniform? {p_gof > 0.05}")

# ── Chi-squared test of independence ──────────────────────
# Device (mobile/desktop) × tier (free/paid)
contingency = np.array([[342, 158],
                         [271, 229]])
chi2, p_ind, dof, exp = stats.chi2_contingency(contingency)
print(f"\nIndependence χ²={chi2:.3f}, p={p_ind:.4f}, df={dof}")
print(f"Min expected count: {exp.min():.1f}  (need ≥ 5)")

# Cramér's V effect size
n    = contingency.sum()
v    = np.sqrt(chi2 / (n * (min(contingency.shape) - 1)))
print(f"Cramér's V = {v:.4f}  (0.1=small, 0.3=medium, 0.5=large)")

# ── Mann-Whitney U: skewed metric comparison ───────────────
# Session duration (exponential) — t-test is less appropriate
ctrl_dur = np.random.exponential(120, 200)
trt_dur  = np.random.exponential(140, 200)

_, p_t  = stats.ttest_ind(trt_dur, ctrl_dur)
u, p_mw = stats.mannwhitneyu(trt_dur, ctrl_dur, alternative='two-sided')
r_rb    = 1 - 2*u / (len(trt_dur)*len(ctrl_dur))   # rank-biserial r

print(f"\nt-test p={p_t:.4f}  |  Mann-Whitney p={p_mw:.4f}")
print(f"Rank-biserial r={r_rb:.4f}  (|r|>0.3 = medium effect)")

ML Link Chi-squared independence: feature selection (is categorical feature X independent of label Y?). Goodness-of-fit: is model output class distribution matching expected? KS test (Kolmogorov-Smirnov, a non-parametric goodness-of-fit test) is standard for feature drift detection in ML pipelines. Mann-Whitney: comparing model scores on skewed metric distributions.

Trap Using chi-squared when expected cell counts are < 5

The chi-squared approximation breaks down with sparse cells. scipy.stats.chi2_contingency may return significant results as a pure artefact of the approximation. This is common with rare user segments, infrequent events, or small datasets.

Fix Always check the expected counts array from chi2_contingency. If any cell < 5, use Fisher's exact test (stats.fisher_exact for 2×2) or combine sparse categories before testing.

Use Mann-Whitney when: (1) the metric is ordinal (satisfaction rating 1-5); (2) heavily skewed (session duration, revenue per user — a few whale users dominate); (3) severe outliers that a t-test would be sensitive to; (4) small sample and normality is questionable. For large-scale web A/B tests with continuous metrics and large n, t-tests are fine by CLT. Note: Mann-Whitney tests stochastic equality, not mean equality — they answer slightly different questions.

(1) Expected counts — any < 5 invalidates the test. (2) Effect size — Cramér's V = √(χ²/(n·(min(r,c)−1))). V < 0.1 is weak even if p is tiny. (3) Standardised residuals (O−E)/√E for each cell — which specific combinations drive the significance? (4) With large n, almost any dependency will be significant. Always ask: is this effect size meaningful for the decision at hand?

Power Analysis & Sample Size Calculation

Power = 1−β, Cohen's d, sample size formula, MDE, a priori vs post-hoc power, statsmodels

Power = P(reject H₀ | H₁ is true) = 1 − β (probability of detecting a real effect) Four linked quantities — fix three to solve for the fourth: α = significance level (0.05 typical) β = Type II error rate → power = 1 − β (0.80 typical) δ = effect size (standardised) n = sample size per group Cohen's d (standardised effect size): d = (μ₁ − μ₂) / σ_pooled Small: d=0.2 | Medium: d=0.5 | Large: d=0.8 Approximate n per group (two-sample t-test): n ≈ 2(z_{α/2} + z_β)² / d² α=0.05, power=0.80: n ≈ 2 × (1.96+0.84)² / d² = 15.68/d² d=0.5 (medium): n ≈ 63 per group MDE (Minimum Detectable Effect): Smallest effect you want to detect at given n, α, power Larger MDE → smaller n needed (cheaper but misses small effects) Smaller MDE → larger n (more sensitive but expensive) A priori (before experiment) vs post-hoc: A priori: use to determine n before running ← correct usage Post-hoc: circular — power computed from observed effect ≡ p-value in disguise

  Power vs n per group (α=0.05, two-sample t-test):

  Power
  1.0 ┤                             ●●●●●  d=0.8
  0.9 ┤                       ●●●●●
  0.8 ┤──────────────── target ──●────────  d=0.5
  0.7 ┤              ●●
  0.6 ┤         ●●●●
  0.5 ┤      ●●●
  0.4 ┤   ●●●                          d=0.2 (needs n≈393)
  0.3 ┤ ●●
      └────────────────────────────── n
       25  50 100 150 200   (d=0.5 hits 0.80 at n≈64)

Python — power analysis, sample size table, MDE calculation with statsmodels

import numpy as np
from statsmodels.stats.power import TTestIndPower, NormalIndPower

# ── Sample size for given power ────────────────────────────
analysis = TTestIndPower()
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80)
print(f"n per group (d=0.5, 80% power): {n:.0f}")

# ── Power vs n table ───────────────────────────────────────
print(f"\n{'n':>6} | d=0.2 | d=0.5 | d=0.8")
print("-" * 33)
for n in [50, 100, 200, 400]:
    p = [analysis.solve_power(effect_size=d, nobs1=n, alpha=0.05)
         for d in [0.2, 0.5, 0.8]]
    print(f"{n:>6} | {p[0]:.3f} | {p[1]:.3f} | {p[2]:.3f}")

# ── MDE for proportion A/B test ────────────────────────────
p1 = 0.40      # baseline CTR
prop_analysis = NormalIndPower()
print("\nMDE vs n/group (baseline CTR=40%):")
for n_group in [500, 1000, 2000, 5000]:
    for delta in np.linspace(0.001, 0.10, 500):
        p2 = p1 + delta
        p_bar = (p1 + p2) / 2
        d = abs(p1 - p2) / np.sqrt(p_bar * (1 - p_bar))
        pwr = prop_analysis.solve_power(effect_size=d, nobs1=n_group, alpha=0.05)
        if pwr >= 0.80:
            print(f"  n={n_group:5d}: MDE={delta:.3f} ({delta/p1*100:.1f}% relative)")
            break

ML Link Pre-experiment power analysis is mandatory before any A/B test — it determines traffic allocation and experiment duration. MDE drives the business decision: "Can we detect a 1% CTR lift with current traffic in 2 weeks?" Power analysis for offline ML: determines the test-set size needed to detect a statistically significant improvement in AUC. In recommendation systems: power is often low because target metrics (purchases, revenue) are sparse and noisy.

Trap Running A/B tests without pre-specified stopping rule (continuous peeking)

Starting an experiment and stopping as soon as p < 0.05 dramatically inflates Type I error. Under repeated checking, the chance of seeing at least one p < 0.05 is much greater than α, even when H₀ is true. This is the "peeking problem" and it is widespread in growth teams.

Fix Fix n via a priori power analysis and analyse ONCE at that point. If early stopping is needed, use sequential testing methods (SPRT or always-valid confidence sequences) that maintain correct Type I error under repeated looks. Tools: Python's anytime_valid library or Evan Miller's sequential calculator.

Trap Post-hoc power analysis after a non-significant result

After getting p=0.15, computing "post-hoc power" using the observed effect and claiming "the study was underpowered." Post-hoc power computed from the observed effect is mathematically redundant — it is a monotone function of the observed p-value. A post-hoc power of 50% always corresponds to p=0.50.

Fix For non-significant results, report the CI: "95% CI for lift is (−0.5%, +2.3%), ruling out effects larger than 2.3% with 95% confidence." This is informative. Post-hoc power is not.

Steps: (1) baseline CTR p₁ = 0.40 from historical data; (2) target p₂ = 0.40 × 1.05 = 0.42; (3) compute Cohen's h = 2·arcsin(√p₂) − 2·arcsin(√p₁) ≈ 0.04; (4) NormalIndPower().solve_power(effect_size=0.04, alpha=0.05, power=0.80) → ~4900 per group. Business constraints often force smaller n → either accept a higher MDE (only detect larger lifts) or lower power (accept more false negatives). Always surface this trade-off explicitly.

α (Type I rate) = P(false positive). Power (1−β) = P(true positive). Lowering α (stricter threshold) reduces false positives but also reduces power (more false negatives). To maintain power while lowering α, you need more data. This trade-off matters in multiple testing: Bonferroni divides α by m, making each individual test stricter — you need a proportionally larger sample to maintain power for each individual hypothesis.

Multiple Testing Correction — Bonferroni & Benjamini-Hochberg FDR

FWER vs FDR, Bonferroni α/m, BH stepup procedure, when to use each, p-value adjustment

Problem: testing m hypotheses simultaneously. Expected false positives at α = m × 0.05 Testing 100 features → expect 5 spurious significant results on noise alone Framework 1 — FWER (Family-Wise Error Rate), Bonferroni: Controls P(any false positive) ≤ α Method: reject Hᵢ if pᵢ < α/m Example: m=100 tests → need p < 0.0005 to reject Very conservative; use when ANY false positive is unacceptable Framework 2 — FDR (False Discovery Rate), Benjamini-Hochberg: Controls E[false positives / total rejections] ≤ α Algorithm: 1. Sort p-values p_(1) ≤ p_(2) ≤ ... ≤ p_(m) 2. Find largest k where p_(k) ≤ (k/m) × α 3. Reject H_(1), ..., H_(k) Less conservative than Bonferroni — more discoveries at controlled contamination Use when some false positives are acceptable (feature selection, genomics) Holm-Bonferroni: Stepwise version of Bonferroni Uniformly more powerful than plain Bonferroni; same FWER guarantee

  Benjamini-Hochberg (m=10 tests, α=0.05):

  Rank k │ p-value │ threshold k/m×α │ Reject?
  ───────┼─────────┼─────────────────┼────────
    1    │ 0.001   │ 0.005           │ YES ✓
    2    │ 0.006   │ 0.010           │ YES ✓
    3    │ 0.011   │ 0.015           │ YES ✓
    4    │ 0.022   │ 0.020           │ NO  ← p > threshold
    5    │ 0.045   │ 0.025           │ NO
  Reject H_(1) … H_(3): largest k where p_(k) ≤ k×α/m

Python — Bonferroni, Holm, BH FDR comparison with statsmodels.multipletests

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(42)
m, n = 100, 80

# ── Simulate: 90 null features, 10 real (d=0.5) ───────────
p_null = [stats.ttest_ind(np.random.normal(0,1,n), np.random.normal(0,1,n)).pvalue
          for _ in range(90)]
p_alt  = [stats.ttest_ind(np.random.normal(0.5,1,n), np.random.normal(0,1,n)).pvalue
          for _ in range(10)]
p_vals  = np.array(p_null + p_alt)
is_null = np.array([True]*90 + [False]*10)

def summarise(name, reject):
    fp = (reject & is_null).sum()
    tp = (reject & ~is_null).sum()
    fdr = fp / max(reject.sum(), 1)
    print(f"{name:20s}: TP={tp}, FP={fp}, FDR={fdr:.3f}, Discoveries={reject.sum()}")

# ── No correction ──────────────────────────────────────────
summarise("No correction", p_vals < 0.05)

# ── Bonferroni ─────────────────────────────────────────────
reject_bf, *_ = multipletests(p_vals, alpha=0.05, method='bonferroni')
summarise("Bonferroni", reject_bf)

# ── Holm-Bonferroni ────────────────────────────────────────
reject_holm, *_ = multipletests(p_vals, alpha=0.05, method='holm')
summarise("Holm-Bonferroni", reject_holm)

# ── Benjamini-Hochberg FDR ─────────────────────────────────
reject_bh, *_ = multipletests(p_vals, alpha=0.05, method='fdr_bh')
summarise("BH FDR", reject_bh)

print("\nBonferroni controls FWER (zero false positives goal)")
print("BH FDR controls expected false discovery rate ≤ 5%")
print("BH finds more true effects at cost of a small FP rate")

ML Link BH FDR is standard for ML feature selection (testing all features for target association), NLP (all token n-grams), genomics (20,000 gene tests). Bonferroni for A/B tests with multiple primary metrics: testing 5 metrics → use α/5=0.01. ANOVA post-hoc (Tukey HSD) handles multiple model comparisons simultaneously.

Trap Testing multiple metrics in one A/B test without any correction

An A/B test checks CTR, conversion rate, revenue per user, and session duration. With 4 metrics at α=0.05, P(at least one false positive) = 1 − 0.95⁴ = 18.5%. Reporting "CTR improved (p=0.03)" while silently ignoring 3 non-significant metrics inflates the effective false positive rate to ~18%.

Fix Pre-specify a single primary metric. Apply Bonferroni or BH to secondary metrics. Report all metrics transparently, not just the significant ones. With 4 tests and Bonferroni, require p < 0.0125 per test.

Trap Bonferroni when tests are positively correlated

Bonferroni assumes independence. When testing CTR and conversion rate simultaneously (correlated — clickers are more likely to convert), Bonferroni is overly conservative, leading to too few discoveries. The actual FWER is less than α when tests are positively correlated.

Fix Use Holm-Bonferroni (uniformly more powerful, same FWER guarantee). For strongly correlated metrics, use permutation-based correction that uses the joint null distribution. BH FDR is also more robust to positive dependence.

FDR=0.05 means you expect at most 5% of discoveries to be false positives. With 7 discoveries: expected false positives = 0.05 × 7 = 0.35 — fewer than 1 on average. This is a guarantee on the expected proportion, valid when tests are independent or positively correlated. It does not guarantee zero false positives in this specific run — just that across repeated experiments the contamination rate is controlled.

Bonferroni (FWER): when any single false positive has serious consequences — medical device approval, drug trials, security alerts. You want essentially zero false alarms. BH FDR: when false positives are tolerable as long as most discoveries are real — ML feature selection, exploratory analysis, large-scale A/B tests with many secondary metrics. In practice: ML almost always uses FDR; clinical trials use Bonferroni. Business A/B testing sits between — pre-specify one primary metric (Bonferroni territory) and use FDR for exploratory secondary metrics.

Running 20 independent tests at α=0.05 guarantees at least one false positive with probability 1 − 0.95²⁰ = 64%. This is the multiple comparisons problem, and it underlies the replication crisis. Bonferroni is conservative; Benjamini-Hochberg FDR is the right tool for feature selection and large-scale ML experimentation.

Experimentation 09–10

A/B Test Design — Randomisation, MDE & Guardrails

A/B testing is the gold standard for causal inference in online systems. It is also the most widely misused tool in product analytics. Getting the design right — unit of randomisation, sample size, guardrail metrics, novelty effects — determines whether your experiment answers a causal question or produces noise that gets shipped.

Randomisation Unit, SUTVA & Network Effects

User vs session vs page randomisation, SUTVA (no interference), network effects violation, switchback experiments

Stable Unit Treatment Value Assumption (SUTVA): 1. No interference: unit i's outcome is not affected by unit j's assignment 2. No hidden variation in treatment: everyone in treatment gets the same treatment Why SUTVA matters: Violation 1 — Network effects: recommending to user A changes the content pool for user B. Social networks, marketplaces, ride-sharing — assigning treatment by user violates SUTVA. Violation 2 — Spillover: an email to user A changes user B's behaviour (shared household). Randomisation units: User-level: Most common. Consistent experience; respects SUTVA for non-social products Session-level: Higher variance (same user sees both variants). Use only for micro-interactions Geo/cluster: Randomise by geography or cohort to prevent interference on social graphs Time-based (switchback): alternate treatment/control by time slots — for shared inventory Unit of randomisation = Unit of analysis: Randomise by user → analyse by user-level outcomes (not sessions or page views) Deviation from this inflates Type I error and biases lift estimates Hash-based assignment: user_id → hash(user_id + experiment_salt) % 100 → bucket Guarantees consistent assignment and reproducibility without storing state

  SUTVA violation: social network interference

  Without network effect      With network effect
  ──────────────────────      ───────────────────────────
  A (control) → outcome_A    A (control) ──────────────┐
  B (treatment) → outcome_B  B (treatment) → B sees new│
                             content → B's friends see ↓
                             → A's outcome changes too  ×
                             → treatment effect bleeds into control

Python — hash-based user assignment, SUTVA check, session vs user analysis

import numpy as np
import hashlib
import pandas as pd

# ── Deterministic user assignment via hashing ─────────────
def assign_experiment(user_id: int, experiment: str, n_buckets: int = 100) -> str:
    key = f"{user_id}:{experiment}".encode()
    bucket = int(hashlib.md5(key).hexdigest(), 16) % n_buckets
    return 'treatment' if bucket < 50 else 'control'

# Verify: ~50/50 split, same user always gets same assignment
users = np.arange(10_000)
assignments = [assign_experiment(u, "exp_ctr_v1") for u in users]
treatment_rate = sum(a == 'treatment' for a in assignments) / len(users)
print(f"Treatment rate: {treatment_rate:.4f}  (expected 0.50)")
# Same user always gets same group (reproducible)
print(f"User 42: {assign_experiment(42, 'exp_ctr_v1')} (call 1)")
print(f"User 42: {assign_experiment(42, 'exp_ctr_v1')} (call 2, same result)")

# ── Unit mismatch: user vs session analysis ────────────────
np.random.seed(42)
n_users = 500
# Each user has 1-5 sessions
sessions_per_user = np.random.randint(1, 6, n_users)
user_group = np.random.choice(['control', 'treatment'], n_users)

# True user-level effect: 0.04 lift
user_ctr = np.where(user_group == 'treatment',
                    np.random.beta(5, 10, n_users),      # treatment ~33%
                    np.random.beta(4.5, 10, n_users))    # control  ~31%

# Session-level data (correctly analysed at user level)
df = pd.DataFrame({
    'user_id': np.repeat(np.arange(n_users), sessions_per_user),
    'group': np.repeat(user_group, sessions_per_user),
    'session_ctr': np.repeat(user_ctr, sessions_per_user) +
                   np.random.normal(0, 0.02, sessions_per_user.sum())
})

# Correct: user-level analysis
user_agg = df.groupby('user_id').agg(
    group=('group','first'), ctr=('session_ctr','mean')
)
ctrl = user_agg[user_agg.group=='control']['ctr']
trt  = user_agg[user_agg.group=='treatment']['ctr']
print(f"\nUser-level lift: {trt.mean()-ctrl.mean():.4f}  n_users={n_users}")

# Wrong: session-level analysis (pseudo-replication, inflated n)
ctrl_s = df[df.group=='control']['session_ctr']
trt_s  = df[df.group=='treatment']['session_ctr']
print(f"Session-level lift: {trt_s.mean()-ctrl_s.mean():.4f}  n_sessions={len(df)}")
print("Session-level n is inflated → narrower CI, risk of spurious significance")

ML Link Hash-based assignment is how every major tech company runs A/B tests (Google Overlapping Experiments Framework, Facebook PlanOut). SUTVA violations require cluster randomisation or geo experiments for recommendation/social systems. Switchback experiments are standard for shared-inventory systems (Uber, DoorDash, Airbnb).

Trap Session-level randomisation with user-level analysis (or vice versa)

Randomising users but counting sessions as independent observations inflates the effective sample size. If user A has 10 sessions, those 10 sessions are correlated — treating them as independent observations understates variance and overstates significance.

Fix Always analyse at the same level as randomisation. For user-level experiments: aggregate to one row per user first, then run the t-test on user-level means. For ratio metrics (CTR = clicks/impressions), use the delta method or bootstrap CI — do not compute click_rate = total_clicks/total_impressions and treat it as a per-row metric.

Trap Ignoring network effects in social/marketplace experiments

Testing a new recommendation algorithm by assigning half the users to treatment. If treatment users interact with control users (messaging, following, competing for the same supply), treatment effects spill into the control group, understating the true lift. You might ship a feature that actually hurts users because the network effect compressed the observed difference.

Fix Use graph-based cluster randomisation (assign entire social clusters to one group), geo experiments (assign entire cities/regions), or synthetic control methods. For marketplace supply-side experiments (driver incentives, seller features), switchback or holdout experiments are the right design.

If you randomise by user but analyse by session, each user contributes multiple observations that are correlated. This violates the independence assumption of the t-test, inflating the effective sample size and deflating the standard error. The result: spuriously narrow CIs and inflated significance. The fix is to aggregate to one observation per randomisation unit (user-level mean CTR) before testing.

SUTVA requires that user A's outcome depends only on A's treatment, not on other users' assignments. It is violated in: social networks (A follows B — B's treatment changes A's feed), two-sided marketplaces (treatment drivers compete with control drivers for same rides), recommendation systems (treatment shifts content distribution for all users in collaborative filtering). When SUTVA is violated, standard A/B estimates are biased — use geo/cluster randomisation or causal graph methods.

MDE, Duration, Guardrail Metrics & Novelty Effect

Minimum detectable effect, experiment duration, pre-experiment checks (SRM), guardrail metrics, novelty bias

MDE (Minimum Detectable Effect): The smallest effect size your experiment can reliably detect at α, power Drives the sample size decision: n ∝ 1/MDE² Experiment duration: Duration = n / (daily_traffic_per_group) Minimum: ≥ 1–2 business cycles (week-over-week variation) For day-of-week effects: run for full weeks (7, 14, 21 days) Too short: day-of-week bias, novelty effects dominate Too long: chance of contamination, external events Guardrail metrics: Metrics that must NOT worsen even if the primary metric improves Examples: latency p99, error rate, crash rate, unsubscribe rate Fail-safe: ship only if guardrails pass, regardless of primary metric Novelty effect: Users interact more with a new feature simply because it is new (not better) Manifests as: initial CTR spike that decays over weeks to steady-state Mitigation: run experiment ≥ 2 full weeks; analyse last-week data separately Sample Ratio Mismatch (SRM): Assignment ratio ≠ expected ratio (e.g., 48% treatment vs expected 50%) Indicates: logging bugs, bot traffic, selection bias in assignment SRM check: chi-squared test on assignment counts H₀: observed ratio matches expected Always run SRM check before analysing results — SRM invalidates all metrics

  A/B Test Timeline:

  Day 0       Day 1-N              Day N+1
  │           │                    │
  ┌─────────────────────────────────┐
  │ Pre-checks │  Live experiment   │ Analysis
  │            │                   │
  │ SRM audit  │ Control ────────► │ Primary metric
  │ Baseline   │ Treatment ──────► │ Guardrails
  │ power calc │                   │ Novelty check
  └─────────────────────────────────┘
  ↑ Randomise  ↑ Expose            ↑ Measure & decide

Python — SRM detection, guardrail check, novelty effect detection

import numpy as np
from scipy import stats

np.random.seed(42)

# ── Sample Ratio Mismatch (SRM) check ─────────────────────
# Expected: 50/50 split, observed: 4800 control / 5200 treatment
expected_split = 0.5
n_control, n_treatment = 4800, 5200
n_total = n_control + n_treatment

expected_counts = np.array([n_total * expected_split,
                             n_total * (1 - expected_split)])
observed_counts = np.array([n_control, n_treatment])

chi2, p_srm = stats.chisquare(observed_counts, expected_counts)
print(f"SRM check: χ²={chi2:.2f}, p={p_srm:.4f}")
print(f"SRM detected? {p_srm < 0.01}  (use α=0.01 for SRM)")
# p < 0.01 → SRM present → DO NOT analyse results

# ── Guardrail metric evaluation ────────────────────────────
n = 1000
ctrl_latency = np.random.lognormal(4.6, 0.4, n)    # ~p50=100ms
trt_latency  = np.random.lognormal(4.7, 0.4, n)    # ~p50=110ms (10% regression)

# Primary metric: check if latency guardrail passes
latency_guardrail_pct = 0.10    # allow up to 10% p95 regression
p95_ctrl = np.percentile(ctrl_latency, 95)
p95_trt  = np.percentile(trt_latency, 95)
relative_change = (p95_trt - p95_ctrl) / p95_ctrl
print(f"\nLatency p95 ctrl={p95_ctrl:.0f}ms, trt={p95_trt:.0f}ms")
print(f"Relative change: {relative_change:.3f}")
print(f"Guardrail pass? {relative_change < latency_guardrail_pct}")

# ── Novelty effect detection ───────────────────────────────
# Compare week 1 CTR lift vs week 2 CTR lift
week1_ctrl = np.random.normal(0.40, 0.02, 500)
week1_trt  = np.random.normal(0.47, 0.02, 500)   # novelty spike
week2_ctrl = np.random.normal(0.40, 0.02, 500)
week2_trt  = np.random.normal(0.42, 0.02, 500)   # settled effect

lift_w1 = week1_trt.mean() - week1_ctrl.mean()
lift_w2 = week2_trt.mean() - week2_ctrl.mean()
print(f"\nWeek 1 lift: {lift_w1:.4f}  (novelty inflated)")
print(f"Week 2 lift: {lift_w2:.4f}  (steady-state)")
print(f"Novelty effect: {lift_w1 - lift_w2:.4f}")

ML Link SRM check is a required pre-analysis step at every major tech company before looking at any metric. Guardrail metrics operationalise the product constraints that a recommendation model must satisfy (latency, diversity, content safety scores). Novelty effects are especially severe for ML-generated content — users explore more when content is novel, inflating early engagement metrics.

Trap Ending the experiment as soon as the primary metric is significant

This is the peeking problem. If you stop after week 1 when the novelty effect inflates CTR, you ship a feature that decays back to baseline. The week-2 CTR is the steady-state effect you actually care about. Running for at least two business cycles also controls for day-of-week and seasonal variation.

Fix Pre-commit to experiment duration based on sample size calculation. Do not stop early based on metric values. If you need early stopping, use sequential testing methods (see Stage 10). Separately analyse first-week vs subsequent-week behaviour to identify and quantify the novelty effect.

Trap Ignoring SRM and analysing a contaminated experiment

SRM (Sample Ratio Mismatch) means the assignment mechanism is broken — some users are not being assigned randomly. Common causes: caching layers serving stale assignments, bots skewing to one variant, logging infrastructure dropping events unevenly. All downstream metric estimates are biased when SRM is present.

Fix Run SRM check first (chi-squared on assignment counts, α=0.01). If SRM is detected, investigate the root cause before looking at any metric. Never try to "correct for" SRM after the fact — the contamination is unfixable.

The week 1 spike is likely a novelty effect — users explore new UI/content because it's novel, not because it's better. The true steady-state effect is ~1% (week 2). Decision: (1) check if 1% lift is above your MDE (pre-specified); (2) check guardrail metrics pass; (3) if the 1% lift is statistically significant and economically meaningful, ship it — but set expectations correctly. Do not report the +6% to stakeholders as the expected long-run effect.

Guardrails should cover: performance (latency p95, p99 — model inference time), reliability (error rate, serving availability), user experience (unsubscribe/block rate, report rate, session duration floor), business constraints (content diversity index, fairness metrics if applicable). A new model that improves CTR by 5% but increases p99 latency by 50% is a bad trade. Guardrails are set by the team before the experiment and are non-negotiable — if they fail, the feature is not shipped regardless of primary metric.

A/B Analysis: Intention-to-Treat, Ratio Metrics & Delta Method

ITT vs per-protocol analysis, ratio metric (CTR = clicks/impressions) variance, delta method, CUPED preview

Intention-to-Treat (ITT) vs Per-Protocol: ITT: analyse all randomised users, regardless of whether they engaged with the feature Per-protocol: analyse only users who "activated" the treatment In online experiments: always use ITT — per-protocol introduces selection bias Ratio metrics — the correct way to handle CTR: CTR = total_clicks / total_impressions (NOT user-level mean of click_i/impression_i) Naive variance: Var(CTR) is not simply p(1-p)/n — it must account for the correlation between numerator and denominator within each user Delta method for ratio metric variance: Let R = Σclicksᵢ / Σimpressionsᵢ Δ = R_treatment − R_control Var(Δ) ≈ (1/n²)[Var(∑clicks) + R²·Var(∑impr) − 2R·Cov(∑clicks, ∑impr)] Simpler practical approach: compute per-user CTR = clicks_i/impressions_i Then t-test on per-user CTRs (after filtering users with ≥ min_impressions) This avoids the delta method while being approximately correct CUPED (Controlled-experiment Using Pre-Experiment Data): Reduce variance by regressing out the pre-experiment value of the metric Y_cuped = Y − θ · (X − E[X]) where X = pre-experiment metric value θ = Cov(Y, X) / Var(X) (regression coefficient) Can reduce variance by 30-70% → halves required sample size (or improves power)

Python — ITT analysis, ratio metric delta method, CUPED variance reduction

import numpy as np
from scipy import stats
import pandas as pd

np.random.seed(42)
n = 1000   # users per group

# ── Ratio metric: CTR = clicks/impressions ─────────────────
impressions_ctrl = np.random.poisson(10, n)
impressions_trt  = np.random.poisson(10, n)
clicks_ctrl = np.random.binomial(impressions_ctrl, 0.10)
clicks_trt  = np.random.binomial(impressions_trt,  0.12)

# Naive (wrong): t-test on per-user CTR (zero-impression users cause issues)
safe_ctrl = impressions_ctrl > 0
safe_trt  = impressions_trt  > 0
ctr_ctrl  = clicks_ctrl[safe_ctrl] / impressions_ctrl[safe_ctrl]
ctr_trt   = clicks_trt[safe_trt]   / impressions_trt[safe_trt]
_, p_naive = stats.ttest_ind(ctr_trt, ctr_ctrl)
print(f"Naive per-user CTR t-test: p={p_naive:.4f}")

# Aggregate ratio + delta method (more correct)
R_ctrl  = clicks_ctrl.sum() / impressions_ctrl.sum()
R_trt   = clicks_trt.sum()  / impressions_trt.sum()
delta   = R_trt - R_ctrl
# Delta method variance for difference in ratios
def ratio_var(clicks, impr):
    n_ = len(clicks)
    r  = clicks.sum() / impr.sum()
    return (1/n_**2) * (
        np.var(clicks)*n_ + r**2*np.var(impr)*n_ - 2*r*np.cov(clicks,impr)[0,1]*n_
    )
se_delta = np.sqrt(ratio_var(clicks_trt, impressions_trt) +
                   ratio_var(clicks_ctrl, impressions_ctrl))
z_stat   = delta / se_delta
p_delta  = 2 * (1 - stats.norm.cdf(abs(z_stat)))
print(f"Ratio + delta method:       p={p_delta:.4f}  (lift={delta:.4f})")

# ── CUPED: pre-experiment variance reduction ───────────────
# Pre-experiment CTR (covariate)
pre_ctr_ctrl = ctr_ctrl + np.random.normal(0, 0.01, len(ctr_ctrl))
pre_ctr_trt  = ctr_trt  + np.random.normal(0, 0.01, len(ctr_trt))

# CUPED adjustment
theta_ctrl = np.cov(ctr_ctrl, pre_ctr_ctrl)[0,1] / np.var(pre_ctr_ctrl)
theta_trt  = np.cov(ctr_trt,  pre_ctr_trt)[0,1]  / np.var(pre_ctr_trt)
cuped_ctrl = ctr_ctrl - theta_ctrl * (pre_ctr_ctrl - pre_ctr_ctrl.mean())
cuped_trt  = ctr_trt  - theta_trt  * (pre_ctr_trt  - pre_ctr_trt.mean())

_, p_cuped = stats.ttest_ind(cuped_trt, cuped_ctrl)
var_reduction = 1 - np.var(cuped_ctrl)/np.var(ctr_ctrl)
print(f"CUPED t-test:               p={p_cuped:.4f}")
print(f"CUPED variance reduction:   {var_reduction:.1%}")

ML Link CUPED (Microsoft Research, 2013) is now standard at most tech companies — it halves required experiment duration for the same power by regressing out pre-experiment noise. Delta method is used in online experimentation frameworks (Airbnb, LinkedIn, Netflix) for ratio metrics. ITT is the default in clinical trials and should be the default in online experiments to prevent selection bias from "activation" filters.

Trap Computing aggregate CTR without accounting for per-user correlation

total_clicks/total_impressions is not a simple proportion whose variance is p(1-p)/n. High-impression users (whales) dominate the numerator and denominator and introduce correlation. Using p(1-p)/n understates variance → too-narrow CI → inflated significance.

Fix Use the delta method for ratio metrics, or compute per-user CTR and run a t-test on those per-user values (filtering users with 0 impressions). For production systems, use an experimentation library that handles ratio metrics correctly (e.g., Statsig, Optimizely Stats Engine, or Netflix's Paired Bootstrap).

ITT analyses all randomised users. Per-protocol (only activated users) introduces selection bias: users who see and engage with a new feature are systematically different from those who don't (e.g., more active, less fatigued). Any lift observed among activated users may reflect this selection, not the treatment. ITT gives an unbiased estimate of the average treatment effect across the entire population — which is what you actually care about when deciding to ship.

CUPED uses pre-experiment data (pre-period value of the same metric) as a covariate to reduce post-experiment variance. The regression Y_cuped = Y − θ(X − E[X]) removes the component of Y that is explained by baseline behaviour X. Because baseline behaviour is correlated with post-experiment behaviour (users who clicked a lot before tend to click a lot after), removing it reduces noise without changing the expected value of Y_cuped. This variance reduction translates directly to tighter CIs and higher power for the same n.

The unit of randomisation must match the unit of analysis. If you randomise by user but analyse by session, users who see both treatments (session crossover) inflate variance and bias estimates. The most common silent error in A/B testing is a mismatch between these two units.

Sequential Testing, Bandits & Explore-Exploit

Classical A/B testing requires you to decide on sample size before running — and then commit to not looking at results until the end. Sequential testing breaks that constraint, allowing continuous monitoring with valid statistical guarantees. Multi-armed bandits go further: they learn and adapt during the experiment itself, trading slower statistical conclusions for faster convergence to the best option.

The Peeking Problem & Sequential Testing

Why continuous peeking inflates Type I error, SPRT, always-valid p-values, Bayesian sequential testing

The Peeking Problem: Running a t-test at every new data point and stopping when p < 0.05. Under repeated testing of H₀: P(ever seeing p<0.05) >> 0.05 With continuous peeking: effective Type I error rate can exceed 30-50% Why it happens: p-values fluctuate over time. In a random walk, the process will visit the "significant" region at some point just by chance, even if H₀ is true. Classical t-test p-values are only valid at a pre-specified stopping point. Sequential Probability Ratio Test (SPRT): Anytime-valid test — valid to check at any point and stop at any time At each step, compute likelihood ratio Λ = P(data|H₁)/P(data|H₀) Stop if Λ ≥ 1/α (reject H₀) or Λ ≤ β (accept H₀) Controls both α and β exactly, even under continuous monitoring Always-valid confidence intervals (e-values): CI that remains valid regardless of when you look or when you stop Based on the theory of e-values and anytime-valid inference Available: scipy-extensions, or compute via mixture sequential tests Bayesian sequential testing: Monitor P(treatment > control | data) — posterior probability of lift Stop when this probability exceeds a threshold (e.g., 0.95) or time runs out No multiple comparisons issue — posterior probability is valid at any time

Python — peeking inflation simulation, SPRT, Bayesian sequential monitoring

import numpy as np
from scipy import stats

np.random.seed(42)
n_sims, n_max, alpha = 5000, 500, 0.05

# ── Peeking inflation: check p at every step under H₀ ─────
false_positives_peeking = 0
false_positives_fixed   = 0

for _ in range(n_sims):
    ctrl = np.random.normal(0, 1, n_max)
    trt  = np.random.normal(0, 1, n_max)   # H₀ true: no difference
    # Peek at every 10 observations
    found = False
    for n in range(20, n_max, 10):
        _, p = stats.ttest_ind(trt[:n], ctrl[:n])
        if p < alpha:
            found = True
            break
    false_positives_peeking += found
    # Fixed sample: look once at end
    _, p_final = stats.ttest_ind(trt, ctrl)
    false_positives_fixed += (p_final < alpha)

print(f"Type I error — peeking: {false_positives_peeking/n_sims:.3f}  (expected {alpha})")
print(f"Type I error — fixed n: {false_positives_fixed/n_sims:.3f}   (should be ~{alpha})")

# ── SPRT: Wald Sequential Probability Ratio Test ───────────
def sprt_test(data_ctrl, data_trt, delta=0.2, alpha=0.05, beta=0.20):
    """Returns (decision, n_used) where decision in {H0, H1, continue}"""
    A = (1 - beta) / alpha        # upper boundary
    B = beta / (1 - alpha)        # lower boundary
    log_lambda = 0.0
    n = min(len(data_ctrl), len(data_trt))
    for i in range(1, n):
        # Log-likelihood ratio for normal shift delta
        x_diff = data_trt[i] - data_ctrl[i]
        log_lambda += (delta * x_diff - delta**2/2)
        if np.exp(log_lambda) >= A:
            return 'reject H0 (treatment better)', i
        if np.exp(log_lambda) <= B:
            return 'accept H0 (no difference)', i
    return 'continue (undecided)', n

ctrl_sample = np.random.normal(0.0, 1, 300)
trt_sample  = np.random.normal(0.3, 1, 300)   # real effect
decision, n_used = sprt_test(ctrl_sample, trt_sample, delta=0.3)
print(f"\nSPRT (real effect): {decision} at n={n_used}")

ctrl_null = np.random.normal(0, 1, 300)
trt_null  = np.random.normal(0, 1, 300)       # no effect
decision_null, n_used_null = sprt_test(ctrl_null, trt_null, delta=0.3)
print(f"SPRT (no effect):   {decision_null} at n={n_used_null}")

ML Link Sequential testing is standard in continuous deployment contexts where you cannot wait for a fixed end date. Netflix uses always-valid sequential tests. Bayesian sequential monitoring (posterior probability of lift) is the approach used by VWO, Adobe Target, and Google Optimize's Bayesian mode. SPRT is used in medical device monitoring (post-market surveillance) where continuous monitoring is required.

Trap Using a fixed-sample t-test while checking results daily

This is the peeking problem. A team launches an A/B test, checks the dashboard every morning, and stops when p < 0.05. Even if H₀ is true, this process will eventually cross the significance threshold. The actual false positive rate can be 20-50×% the nominal α.

Fix Either: (a) pre-specify sample size and look only once at the end; (b) use a sequential testing method (SPRT, mSPRT, e-values) that explicitly allows continuous monitoring; (c) use Bayesian monitoring with a posterior probability threshold (valid at any time). Most modern experimentation platforms now offer sequential testing by default.

Classical p-values are calibrated for a single pre-specified look at the data. Under H₀, a p-value is Uniform(0,1) at one observation — but if you take multiple looks, you accumulate multiple chances to see a small p-value just by chance. The process of p-values over time is like a random walk with a reflecting barrier at 1 — it will eventually cross 0.05 with probability 1 under continuous monitoring. Sequential methods correct for this by using a test statistic whose distribution accounts for the stopping rule.

SPRT is frequentist: it controls exact Type I and Type II error rates and gives a definitive decision (reject H₀ or accept H₀). Use it when you need hard error-rate guarantees. Bayesian sequential monitoring gives P(treatment > control | data) which is more intuitive for business stakeholders ("we are 94% confident treatment is better"). Use it when the business question is "how confident are we?" rather than "can we reject the null?" Bayesian sequential tests have no formal multiple comparisons issue — the posterior is valid at any time.

Multi-Armed Bandits — ε-Greedy, UCB & Thompson Sampling

Explore-exploit trade-off, regret minimisation, three bandit algorithms compared, when to use vs A/B

Multi-Armed Bandit (MAB): K arms (variants), each with unknown reward distribution μₖ Goal: maximise total reward over T rounds, not just decide which arm is best Regret = T·μ* − Σₜ rₜ (cumulative reward lost vs always pulling best arm) Three algorithms: 1. ε-Greedy: With prob ε: explore (random arm) With prob 1-ε: exploit (best known arm) Simple, explainable, widely used in industry Weakness: does not reduce ε over time → constant exploration cost 2. UCB (Upper Confidence Bound): Pull arm k with highest: x̄ₖ + √(2 ln t / nₖ) Optimism-in-face-of-uncertainty: try arms whose true value might be high Deterministic, no random exploration, provably optimal regret O(K ln T) 3. Thompson Sampling: Maintain Beta(αₖ, βₖ) posterior for each arm Each round: sample θₖ ∼ Beta(αₖ, βₖ) for each arm, pull argmax θₖ Update: if reward=1 → αₖ++, if reward=0 → βₖ++ Bayesian: naturally balances exploration and exploitation Often best empirically; handles non-stationary rewards well MAB vs A/B: A/B: split traffic 50/50 throughout → suboptimal if one arm is clearly worse MAB: adapts allocation toward better arms → lower regret during experiment A/B: clean statistical inference; MAB: muddies causal estimates (non-random allocation)

Python — epsilon-greedy, UCB1, Thompson sampling; regret comparison

import numpy as np
import matplotlib
matplotlib.use('Agg')

np.random.seed(42)

# ── Environment: 5 arms with true CTRs ────────────────────
true_ctrs = [0.10, 0.15, 0.12, 0.20, 0.08]   # arm 3 is best (0.20)
K, T = len(true_ctrs), 2000

def pull(arm):
    return int(np.random.random() < true_ctrs[arm])

# ── ε-Greedy ──────────────────────────────────────────────
def eps_greedy(epsilon=0.10):
    counts   = np.zeros(K)
    values   = np.zeros(K)
    regret   = []
    total    = 0
    for t in range(1, T+1):
        arm = np.random.randint(K) if np.random.random() < epsilon else np.argmax(values)
        r   = pull(arm)
        counts[arm] += 1
        values[arm] += (r - values[arm]) / counts[arm]
        total += r
        regret.append(t * max(true_ctrs) - total)
    return regret, values

# ── UCB1 ──────────────────────────────────────────────────
def ucb1():
    counts, values = np.zeros(K), np.zeros(K)
    regret = []
    total  = 0
    # Initialise: pull each arm once
    for arm in range(K):
        r = pull(arm); counts[arm] = 1; values[arm] = r; total += r
    for t in range(K+1, T+1):
        ucb = values + np.sqrt(2 * np.log(t) / counts)
        arm = np.argmax(ucb)
        r = pull(arm)
        counts[arm] += 1
        values[arm] += (r - values[arm]) / counts[arm]
        total += r
        regret.append(t * max(true_ctrs) - total)
    return regret, values

# ── Thompson Sampling ─────────────────────────────────────
def thompson():
    alpha, beta_p = np.ones(K), np.ones(K)
    regret = []
    total  = 0
    for t in range(1, T+1):
        theta = np.random.beta(alpha, beta_p)
        arm   = np.argmax(theta)
        r     = pull(arm)
        if r == 1: alpha[arm] += 1
        else:       beta_p[arm] += 1
        total += r
        regret.append(t * max(true_ctrs) - total)
    return regret, alpha / (alpha + beta_p)

r_eps, v_eps = eps_greedy(0.10)
r_ucb, v_ucb = ucb1()
r_ts,  v_ts  = thompson()

print("Final regret at T=2000:")
print(f"  ε-Greedy:         {r_eps[-1]:.1f}")
print(f"  UCB1:             {r_ucb[-1]:.1f}")
print(f"  Thompson Sampling:{r_ts[-1]:.1f}")
print(f"\nEstimated CTRs (true best arm: {np.argmax(true_ctrs)}):")
for name, vals in [("ε-Greedy", v_eps), ("UCB1", v_ucb), ("Thompson", v_ts)]:
    print(f"  {name}: best arm = {np.argmax(vals)}, CTR = {max(vals):.4f}")

ML Link Thompson Sampling is used in recommendation systems (Spotify playlist personalisation), ad serving (Google Smart Bidding), and push notification timing. UCB is standard in tree search (UCT in AlphaGo's MCTS). ε-Greedy is the simplest explore-exploit strategy in tabular RL (Q-learning with ε-greedy action selection). Contextual bandits extend MAB to use features (LinUCB, Neural-bandit) — standard in news personalisation (Yahoo, Google News).

Trap Using Thompson sampling for causal inference (estimating treatment effect)

Thompson sampling's adaptive allocation muddies causal estimates. As it converges toward the best arm, the treatment and control groups are no longer comparable random samples — the group receiving less traffic systematically differs from the group receiving more. You cannot reliably estimate the ATE (average treatment effect) from a MAB experiment.

Fix Use bandits when the goal is to minimise regret during exploration (maximise cumulative reward). Use A/B tests when the goal is to estimate the causal effect of one variant for a binary ship/no-ship decision. Hybrid: run A/B first to establish significance, then switch to a bandit for ongoing optimisation (contextual bandit for personalisation).

Thompson Sampling maintains a Beta posterior for each arm's win rate. For a new arm with little data, Beta(1,1) is flat (high variance) — sampling from it gives high probability of generating a large θ value, so the arm gets explored. As an arm accumulates wins/losses, the posterior sharpens around the true CTR, reducing the chance of a spuriously large sample — natural exploitation. No tuning of ε needed. This Bayesian mechanism automatically allocates more exploration where uncertainty is high and more exploitation where the best arm is well-identified.

UCB1's expected regret is O(K log T) — logarithmic in the number of rounds and linear in the number of arms. This is provably optimal (matches a lower bound). In practice, logarithmic regret means: for T=1,000,000 rounds and K=10 arms, expected regret ≈ C·10·14 ≈ 140C — a fixed overhead amortised over 1M rounds. This compares favourably to ε-greedy whose regret grows linearly with T (constant fraction of rounds wasted on random exploration). UCB's regret guarantee makes it preferable when you need predictable performance bounds.

The choice between A/B testing and multi-armed bandits is not about statistics — it is about objectives. A/B tests maximise statistical certainty about which variant is better. Bandits maximise cumulative reward during the experiment. If you are optimising for a one-time launch decision, A/B is correct. If you are optimising for user experience throughout the learning process, bandits are better.

Causal Inference 11–12

Causal Reasoning, DAGs & Confounders

Correlation tells you what moves together. Causation tells you what would happen if you intervened. ML models trained on observational data absorb correlations — including spurious ones driven by confounders, selection bias, and feedback loops. Understanding the difference between association and causation is what separates an ML practitioner who deploys reliable systems from one who ships models that work in backtesting but fail in production.

Correlation vs Causation & Directed Acyclic Graphs (DAGs)

Causal DAG structure, do-calculus, d-separation, intervention vs observation

Observational (correlation): P(Y | X = x) — conditional probability Interventional (causal): P(Y | do(X = x)) — Pearl's do-calculus These are different unless X is randomised. Directed Acyclic Graph (DAG): Nodes = variables, directed edges = direct causal relationships No cycles (hence "acyclic") — time flows forward DAG encodes the data-generating process (DGP) Key path types: Direct cause: X → Y (X directly causes Y) Chain: X → Z → Y (Z mediates the effect) Fork: X ← Z → Y (Z is a common cause = confounder) Collider: X → Z ← Y (Z has two causes — conditioning on Z opens a path) d-separation (blocking paths): A set S blocks the path between X and Y if: - S contains a non-collider on the path (chain or fork node), OR - S does not contain a collider (and does not contain a descendant of a collider) If ALL paths between X and Y are blocked by S: X ⊥ Y | S (conditional independence) Backdoor criterion: To estimate the causal effect of X on Y, block all backdoor paths (paths from X to Y going through X's parents) by conditioning on set S If S satisfies the backdoor criterion: P(Y|do(X)) = Σₛ P(Y|X,S=s)·P(S=s)

  Confounder (fork): Z causes both X and Y

  Z (season)
  ↙         ↘
  X (ice cream)  →  Y (drowning)

  Observed: X and Y are correlated (both peak in summer)
  Causal: X does NOT cause Y — Z causes both
  Conditioning on Z blocks the backdoor path X←Z→Y

  DAG symbols:
  X → Y  : X directly causes Y
  X ← Z → Y : Z is a confounder (fork)
  X → Z ← Y : Z is a collider (conditioning opens spurious path)

Python — DAG simulation, backdoor adjustment, confounder identification

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
n = 5000

# ── Simulate: Z confounds X→Y ─────────────────────────────
# True causal effect of X on Y: +1.0
# Z causes both X (positively) and Y (positively)
Z = np.random.normal(0, 1, n)           # confounder (e.g., user age)
X = 0.7 * Z + np.random.normal(0, 1, n) # treatment (e.g., feature exposure)
Y = 1.0 * X + 0.8 * Z + np.random.normal(0, 1, n)  # outcome (e.g., revenue)
# True causal effect of X on Y = 1.0
# Z positively confounds: both X and Y are high when Z is high

# ── Naive correlation: biased estimate ─────────────────────
slope_naive, _, _, _, _ = stats.linregress(X, Y)
print(f"Naive X→Y slope (biased):     {slope_naive:.4f}  (true: 1.0)")

# ── Backdoor adjustment: condition on Z ───────────────────
df = pd.DataFrame({'Y': Y, 'X': X, 'Z': Z})
# Multiple regression with Z included blocks the backdoor path
from numpy.linalg import lstsq
A = np.column_stack([X, Z, np.ones(n)])
coefs, *_ = lstsq(A, Y, rcond=None)
print(f"Adjusted X→Y slope:           {coefs[0]:.4f}  (true: 1.0, recovered!)")
print(f"Z coefficient:                {coefs[1]:.4f}  (true: 0.8)")

# ── Spurious correlation from a collider ───────────────────
# X and Y are independent; Z = X + Y (collider)
X2 = np.random.normal(0, 1, n)
Y2 = np.random.normal(0, 1, n)
Z2 = X2 + Y2 + np.random.normal(0, 0.1, n)   # collider

print(f"\nX⊥Y (true): corr = {np.corrcoef(X2, Y2)[0,1]:.4f}")
# Condition on Z2 (collider) — opens spurious association
Z2_high = Z2 > np.percentile(Z2, 75)
corr_cond = np.corrcoef(X2[Z2_high], Y2[Z2_high])[0,1]
print(f"X⊥Y|Z  (conditioning on collider): corr = {corr_cond:.4f}  (spurious!)")

ML Link DAGs formalise the assumptions behind ML feature engineering. A feature correlated with the label due to a confounder (not causally related) will degrade after deployment when the confounder distribution shifts. Causal feature selection: use DAG to identify features on causal paths vs spurious correlation paths. Counterfactual reasoning in RL: the agent must estimate "what would have happened if I had taken action A instead of B?"

Trap Conditioning on a collider introduces spurious associations

In a hiring dataset: talent (X) and luck (Y) are independent, but we only observe candidates who were hired (Z = "hired" is a collider on X→Z←Y). Conditioning on Z (analysing only hired people) makes talent and luck appear negatively correlated — the "talented people don't need luck" bias. This is a data issue that ML models absorb automatically when trained on selected/filtered data.

Fix Identify colliders in your DAG before filtering or stratifying training data. In survival bias scenarios (only observing "successful" outcomes), collider conditioning is automatic and introduces bias. Solutions: collect data from the unfiltered population, use inverse probability weighting, or use causal representation learning.

Trap Treating correlation in training data as evidence of causation

A fraud detection model uses "number of recent transactions" as a feature. High transaction count is correlated with fraud because fraudsters make many purchases quickly. But it is also correlated with power users making legitimate purchases. The model learns a spurious correlation that will surface as false positives for legitimate high-volume users.

Fix For each high-importance feature, ask: "Is this feature on a causal path to the outcome, or is it correlated through a confounder?" For fraud: transaction velocity is both causal (fraud mechanism) and spurious (correlated with power user). Use domain knowledge to design features that are causally motivated. Consider causal discovery algorithms (PC algorithm, LiNGAM) for large feature spaces.

P(Y|X=x) is the conditional probability — observed in data when X happens to be x, including all causal and confounded associations. P(Y|do(X=x)) is the interventional distribution — what Y would be if you externally forced X to be x, cutting all incoming edges to X in the DAG. They are equal only if X is randomised (no confounders). If Z confounds X→Y, then P(Y|X) absorbs the Z path and overestimates the causal effect. A/B testing computes P(Y|do(X)) by randomising X.

Berkeley admission bias: talent (T) and diversity (D) are independent causes of admission (A = collider). Analysing only admitted students shows talent and diversity negatively correlated — because conditioning on A opens the path T→A←D. In ML: analysing only users who churned (churn = collider on engagement and price sensitivity) will show engagement and price sensitivity as spuriously correlated within the churn group. Training a retention model on churn-only data would absorb this spurious correlation.

Selection Bias & Feedback Loops in ML Training Data

Logged data bias, position bias, feedback loop reinforcement, propensity weighting

ML models are trained on observational data — data generated by a previous system. That previous system had its own selection mechanism, which biases the training data. Position bias in recommendation/search: Users are more likely to click items shown at position 1 than position 5, regardless of item quality. A model trained on raw clicks learns: "items shown at position 1 are popular" → reinforces position 1 → feedback loop Selection bias: only observing accepted decisions: Loan approval model: only approved loans have repayment outcomes Hiring model: only hired candidates have performance outcomes Content moderation: only reviewed content has labels The model cannot learn from the counterfactual (rejected applicants who would have repaid) Feedback loop (reinforcement): Model A trains on data → Model A is deployed → Model A shapes new data → New data is used to train Model B → Model B is more similar to A than ground truth Inverse Propensity Scoring (IPS): Weight each observation by 1/P(selected | features) Counteracts the selection mechanism to produce unbiased estimates Used in counterfactual recommendation evaluation and off-policy RL

Python — position bias simulation, feedback loop detection, IPS weighting

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
n_items, n_users = 100, 2000

# ── Position bias in recommendation ───────────────────────
# True item quality: uniform across items
true_quality = np.random.uniform(0, 1, n_items)

# Position bias: P(click | position) drops with rank
position_bias = 1.0 / np.arange(1, 11)   # 10 shown positions
position_bias /= position_bias[0]         # normalise to 1 at pos 1

# Simulate user behaviour: click ~ quality × position_bias
observations = []
for _ in range(n_users):
    shown = np.random.choice(n_items, 10, replace=False)
    for rank, item in enumerate(shown):
        p_click = true_quality[item] * position_bias[rank]
        click   = np.random.binomial(1, min(p_click, 1))
        observations.append({'item': item, 'rank': rank+1, 'click': click,
                              'true_quality': true_quality[item]})

df = pd.DataFrame(observations)

# Biased model: raw CTR without position correction
raw_ctr = df.groupby('item')['click'].mean()
biased_corr = np.corrcoef(raw_ctr, true_quality)[0,1]
print(f"Biased CTR correlation with true quality: {biased_corr:.4f}")

# IPS correction: weight each click by 1/position_bias
df['propensity'] = position_bias[df['rank']-1]
df['ips_click']  = df['click'] / df['propensity']
ips_ctr = df.groupby('item')['ips_click'].mean()
debiased_corr = np.corrcoef(ips_ctr, true_quality)[0,1]
print(f"IPS-corrected correlation with true quality: {debiased_corr:.4f}")
print("IPS debiasing recovers true quality ranking")

# ── Feedback loop detection ────────────────────────────────
# Train on round 1 data, check if round 2 distribution is distorted
round1_items_shown = df['item'].value_counts(normalize=True)
# Simulate: deployed model shows top items more → feedback loop
top20 = raw_ctr.nlargest(20).index
round2_boost = np.ones(n_items)
round2_boost[top20] = 3.0   # top items get 3× more exposure
round2_prob = round2_boost / round2_boost.sum()

print(f"\nEntropy of item distribution round 1: "
      f"{stats.entropy(round1_items_shown):.4f}")
print(f"Entropy of item distribution round 2: "
      f"{stats.entropy(round2_prob):.4f}")
print("Lower entropy in round 2 = more concentrated → feedback loop")

ML Link IPS weighting is standard in counterfactual learning-to-rank (Yahoo, Microsoft MSLR). DPO (Direct Preference Optimisation) for LLMs implicitly accounts for the reference policy's selection. Off-policy evaluation in RL uses IPS to evaluate policies using data collected under a different (behaviour) policy. Debiased training data for recommendation systems uses IPS to remove position and exposure bias.

Trap Training a recommendation model on implicit feedback without debiasing

Clicks are a function of item quality AND exposure probability (position, thumbnail, notification timing). Training on raw clicks trains the model to predict "what users clicked given our current exposure policy" not "what users would have engaged with given equal exposure." Deploying this model reinforces existing popular items, reducing diversity and harming long-tail creators and items.

Fix Use IPS-weighted training or exposure-debiased objectives. For evaluation: use Normalised Discounted Cumulative Gain (NDCG) with IPS-corrected relevance judgments. Monitor exposure distribution entropy — declining entropy over model versions signals feedback loop amplification.

Cycle: (1) Model M₁ makes predictions → shapes what users see → users only interact with M₁-selected content. (2) New training data only contains M₁-influenced interactions. (3) Model M₂ trained on this data learns to mimic M₁'s biases in addition to true user preferences. (4) M₂ deployment further concentrates exposure. Each cycle amplifies whatever M₁ initially got right or wrong. Popular items become more popular (rich get richer); items never shown starve of data. Solutions: exploration (ε-greedy in serving), diverse data collection, holdout evaluation against an unbiased baseline, periodic cold-start resets.

IPS weights each observation by 1/P(observation selected | context) to counteract the selection mechanism. Intuition: if an item has low exposure probability (rarely shown), its clicks are rare but informative — upweight them. If an item is always shown, its clicks are abundant but partly driven by exposure, not quality — downweight them. Use IPS when: training recommendation models on position-biased click data; off-policy evaluation (evaluating a new recommendation policy using logs from the old policy); evaluating loan/hiring models when you only observe outcomes for accepted cases.

Every feature in your training data has a causal story. If you train on data where users with high past engagement are shown better content (feedback loop), your model learns to predict engagement from past engagement — not from content quality. Deploying this model amplifies the feedback loop. Causal reasoning is how you identify which features reflect signal vs artefact.

Causal Methods: DiD, CUPED & Metric Design

When you cannot randomise — because it is unethical, impractical, or the intervention already happened — you need causal inference methods. Difference-in-differences exploits pre/post and treatment/control structure. CUPED reduces variance in controlled experiments. Goodhart's Law is the causal reason why optimising a metric destroys its value as a metric — the central challenge of AI product metric design.

Difference-in-Differences (DiD)

Natural experiment, parallel trends assumption, DiD estimator, two-way fixed effects

DiD estimates causal effects when randomisation is impossible but you have: - A treatment group and a control group - Pre-treatment and post-treatment observations for both DiD estimator: τ_DiD = (Ȳ_treated,post − Ȳ_treated,pre) − (Ȳ_control,post − Ȳ_control,pre) Intuition: The control group measures what the treated group would have done without treatment. Subtract this trend to isolate the treatment effect. Parallel trends assumption: In the absence of treatment, both groups would have followed the same trend This is the key (untestable in post-treatment period) assumption of DiD Validation: check pre-treatment trends are parallel (placebo test) Two-Way Fixed Effects (TWFE): yᵢₜ = αᵢ + γₜ + τ·Dᵢₜ + εᵢₜ αᵢ = unit fixed effects (remove time-invariant unit differences) γₜ = time fixed effects (remove time trends common to all units) τ = DiD treatment effect estimate Use cases in DS/AI: Measuring the effect of a platform change on a subset of users Estimating the revenue impact of a new feature rolled out to some regions Policy evaluation when A/B testing was not done

  Difference-in-Differences:

  Metric
    │         ┌── Treatment (fact)
    │    ╔════╪═════
    │    ║    │                    ← DiD estimate = τ
    │    ║    │         ╔══════════ Treatment (counterfactual)
    │    ║    │    ┌────╢
    │  ──┼────┼────┼────╚══════════ Control
    │    │    │    │
  ──┼────┼────┼────┼─────────── Time
  pre-period  │  post-period
              Treatment
              start

Python — DiD simulation, parallel trends test, TWFE regression

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
n_units, T = 200, 6   # 200 units, 6 time periods; treatment at t=3

# ── Generate panel data ────────────────────────────────────
unit_effects = np.random.normal(0, 2, n_units)
time_trends  = np.linspace(0, 1, T)     # common time trend
treated      = np.random.binomial(1, 0.5, n_units)  # 50% treated
true_tau     = 2.5   # true treatment effect

records = []
for i in range(n_units):
    for t in range(T):
        post    = int(t >= 3)
        treat_active = treated[i] * post
        y = (unit_effects[i] + time_trends[t] +
             true_tau * treat_active +
             np.random.normal(0, 1))
        records.append({'unit': i, 't': t, 'y': y,
                         'treated': treated[i], 'post': post})

df = pd.DataFrame(records)

# ── Simple DiD estimator ───────────────────────────────────
pre  = df[df.post == 0]
post = df[df.post == 1]
did_est = ((post[post.treated==1]['y'].mean() - pre[post.treated==1]['y'].mean()) -
           (post[post.treated==0]['y'].mean() - pre[post.treated==0]['y'].mean()))
print(f"DiD estimate: {did_est:.4f}  (true effect: {true_tau})")

# ── Two-Way Fixed Effects regression ──────────────────────
# Create dummies (demean instead for simplicity)
df['y_dm'] = (df['y']
              - df.groupby('unit')['y'].transform('mean')
              - df.groupby('t')['y'].transform('mean')
              + df['y'].mean())
df['D_dm'] = (df['treat_active'] if 'treat_active' in df
              else df['treated']*df['post'])
df['treat_active'] = df['treated'] * df['post']
df['D_dm'] = (df['treat_active']
              - df.groupby('unit')['treat_active'].transform('mean')
              - df.groupby('t')['treat_active'].transform('mean')
              + df['treat_active'].mean())

slope, _, _, _, _ = stats.linregress(df['D_dm'], df['y_dm'])
print(f"TWFE estimate: {slope:.4f}  (true: {true_tau})")

# ── Parallel trends pre-test ───────────────────────────────
pre_only = df[df.t < 3].copy()
pre_only['t_rel'] = pre_only['t'] - 2
interaction = pre_only['treated'] * pre_only['t_rel']
slope_pre, _, _, p_pre, _ = stats.linregress(interaction, pre_only['y'])
print(f"\nParallel trends test: slope={slope_pre:.4f}, p={p_pre:.4f}")
print(f"Parallel trends hold? {p_pre > 0.05}")

ML Link DiD is standard for measuring the impact of algorithm changes that were rolled out without A/B testing ("we shipped it last Tuesday, what was the impact?"). Geo-based DiD: treatment cities vs control cities. TWFE regression is used in econometrics and DS for panel data analysis. Caution: with staggered adoption timing, TWFE estimates can be biased (Callaway-Sant'Anna estimator is more robust).

Trap Violating the parallel trends assumption without checking

DiD is only valid if the treated and control groups would have followed the same trend without treatment. If the treated group was already diverging before treatment (e.g., higher-growth regions got the treatment), DiD overestimates the causal effect. This is the most common error in quasi-experimental analysis.

Fix Run a pre-trends test: fit a regression of outcome on time for the treated group in the pre-treatment period and check if the trend differs from the control group. Alternatively, run a placebo test: pretend treatment occurred one period earlier and check if DiD shows a spurious effect.

DiD when: (a) the feature was already shipped without randomisation ("accidental rollout") and you need to estimate impact retrospectively; (b) the treatment cannot be randomised at user level (e.g., a legal change that applies to an entire geography); (c) network effects or SUTVA violations make user-level A/B testing invalid. DiD requires the parallel trends assumption — it is weaker than A/B's randomisation guarantee. When you can A/B test, do so; use DiD as a fallback for observational natural experiments.

Parallel trends: in the absence of treatment, the average outcome in the treatment and control groups would have followed the same time trend. Validation: (1) Pre-trends test — plot treated vs control group outcomes in the pre-period and test if their trends are statistically different (they should not be). (2) Placebo test — re-run DiD as if treatment occurred one period before it actually did; the DiD estimate should be near zero. If neither validates, consider matching (propensity score) or synthetic control methods that relax the parallel trends requirement.

CUPED — Variance Reduction in Controlled Experiments

CUPED derivation, θ estimator, variance reduction formula, practical application to AUC and CTR experiments

CUPED: Controlled-experiment Using Pre-Experiment Data (Deng et al., 2013, Microsoft Research) Core idea: remove the pre-experiment variability from the outcome metric to reduce noise. CUPED estimator: Ȳ_cuped = Ȳ − θ · (X̄ − E[X]) θ = Cov(Y, X) / Var(X) (regression coefficient of Y on pre-experiment covariate X) Variance reduction: Var(Ȳ_cuped) = Var(Ȳ) · (1 − ρ²) where ρ = Cor(Y, X) (correlation between post and pre-experiment metric) What it achieves: ρ = 0.5 → variance reduced by 25% ρ = 0.7 → variance reduced by 51% ρ = 0.8 → variance reduced by 36% ← note: ρ=0.8 → 1-0.64=36% ρ = 0.9 → variance reduced by 81% ← 90% day-to-day user correlation Equivalent to: regression-adjusted estimator Y_cuped = Y − θ·(X − X̄) This removes variation in Y that is predictable from X (pre-experiment baseline) CUPED in practice: For CTR experiments: X = user CTR in the 7 days before experiment start For revenue: X = user revenue in prior period For model evaluation: X = per-sample score from previous model version Requires: pre-experiment data covering the same users

Python — CUPED from scratch, optimal θ, variance reduction calculation

import numpy as np
from scipy import stats

np.random.seed(42)
n_ctrl = n_trt = 1000

# ── Simulate: users with stable baseline behaviour ─────────
# Pre-experiment CTR (covariate X) — correlated with post-experiment CTR
user_baseline = np.random.beta(3, 10, n_ctrl + n_trt)  # stable CTR per user

# Post-experiment outcomes (Y): baseline + treatment effect + noise
ctrl_pre = user_baseline[:n_ctrl] + np.random.normal(0, 0.02, n_ctrl)
trt_pre  = user_baseline[n_ctrl:] + np.random.normal(0, 0.02, n_trt)

ctrl_post = user_baseline[:n_ctrl] + np.random.normal(0, 0.02, n_ctrl)  # no effect
trt_post  = (user_baseline[n_ctrl:] + 0.01 +                            # +1% lift
             np.random.normal(0, 0.02, n_trt))

# ── Standard t-test (no CUPED) ────────────────────────────
_, p_standard = stats.ttest_ind(trt_post, ctrl_post)
print(f"Standard t-test:  p={p_standard:.4f}")

# ── CUPED adjustment ──────────────────────────────────────
# Compute theta for each group (or combined)
X_all  = np.concatenate([ctrl_pre, trt_pre])
Y_all  = np.concatenate([ctrl_post, trt_post])
theta  = np.cov(Y_all, X_all)[0,1] / np.var(X_all)
E_X    = X_all.mean()

ctrl_cuped = ctrl_post - theta * (ctrl_pre - E_X)
trt_cuped  = trt_post  - theta * (trt_pre  - E_X)

_, p_cuped = stats.ttest_ind(trt_cuped, ctrl_cuped)
print(f"CUPED t-test:     p={p_cuped:.4f}")

# ── Variance reduction achieved ────────────────────────────
rho      = np.corrcoef(Y_all, X_all)[0,1]
var_red  = 1 - np.var(ctrl_cuped) / np.var(ctrl_post)
print(f"\nCorrelation ρ(Y, X):  {rho:.4f}")
print(f"Theoretical reduction: {1-rho**2:.1%}  (= 1 - ρ²)")
print(f"Actual reduction:      {var_red:.1%}")
print(f"Effective n multiplier: {1/(1-rho**2):.2f}×  (equivalent to {n_ctrl/(1-rho**2):.0f} users)")

# ── CUPED benefit: sample size to detect 1% lift ──────────
from statsmodels.stats.power import TTestIndPower
pooled_std     = np.std(ctrl_post, ddof=1)
pooled_std_cup = np.std(ctrl_cuped, ddof=1)
d_std  = 0.01 / pooled_std
d_cup  = 0.01 / pooled_std_cup
anal = TTestIndPower()
n_std = anal.solve_power(effect_size=d_std, alpha=0.05, power=0.80)
n_cup = anal.solve_power(effect_size=d_cup, alpha=0.05, power=0.80)
print(f"\nSample size to detect 1% lift (80% power):")
print(f"  Without CUPED: {n_std:.0f} per group")
print(f"  With CUPED:    {n_cup:.0f} per group  ({100*(1-n_cup/n_std):.0f}% saving)")

ML Link CUPED is now standard at Microsoft (Bing, Teams), Airbnb, Netflix, LinkedIn, and Booking.com. It is most powerful when ρ(pre, post) is high — for daily-active-user metrics (DAU CTR, time spent) correlation is typically 0.7-0.9, giving 50-80% variance reduction. MLCUPED extends this to multivariate covariates (regression on multiple pre-experiment features). Equivalent to regression discontinuity design in some settings.

Trap Applying CUPED when the covariate is not available for all users

New users have no pre-experiment history — CUPED cannot be applied to them. If you apply CUPED only to existing users, the treatment effect estimate applies to a different population than the one you intended to test. Mixing CUPED and non-CUPED users in the same analysis invalidates the variance reduction calculation.

Fix Segment users: apply CUPED to existing users (high variance reduction), standard t-test to new users. Combine estimates using weighted averaging. Or restrict the experiment to users with sufficient history. Report the CUPED results with the caveat about which user segment they apply to.

CUPED subtracts θ·(X − E[X]) from Y where X is the pre-experiment covariate. Since E[X] is a constant and θ is computed from pooled data (not from treatment assignment), E[Y_cuped] = E[Y] − θ·(E[X] − E[X]) = E[Y]. The expected value is unchanged → no bias. The variance is reduced because the subtracted term θ·(X−E[X]) is correlated with Y — removing it reduces the residual noise. It is equivalent to the regression adjustment estimator in statistics.

Variance reduction = 1 − ρ² = 1 − 0.64 = 36%. Since sample size ∝ 1/Var, the required n is reduced by the same factor: n_cuped = n_standard × (1 − ρ²) = n × 0.64. This means you need only 64% as many users — a 36% reduction. For an experiment that needed 10,000 users per group, CUPED reduces it to ~6,400 per group, cutting experiment duration by 36% at the same traffic level.

Goodhart's Law & Metric Design for AI Products

Why optimised metrics degrade, proxy vs goal metrics, metric stack design, Instrumental Variables overview

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Charles Goodhart (1975) — originally about monetary policy; applies universally to ML metrics. Mechanism (causal): Productive path: quality → engagement → metric ↑ Goodhart path: metric-gaming → metric ↑ (without quality) Optimising the metric shifts traffic to the Goodhart path Classic ML examples: CTR optimisation → clickbait headlines (high CTR, low satisfaction) Watch time optimisation → addictive/outrage content → user dissatisfaction (YouTube, 2019) Engagement score → notification spam → unsubscribes Accuracy metric → class imbalance ignored → predicts majority class only Metric stack design (avoiding Goodhart): Primary metric: what you actually care about (satisfaction, retention, revenue) Proxy metric: what you can measure easily and correlates with primary Guardrail metrics: metrics that catch Goodhart paths (diversity, complaint rate, etc.) Counter-metrics: signal when the proxy is diverging from the primary Instrumental Variables (IV) — brief: When you cannot randomise and DiD doesn't apply, an instrument Z is: 1. Correlated with treatment X (relevance) 2. Only affects outcome Y through X (exclusion restriction) Z → X → Y (Z is not correlated with confounders) IV estimator: τ_IV = Cov(Z,Y)/Cov(Z,X) Use: natural experiments where Z is an exogenous shock (lottery assignment, geographic/time discontinuities, algorithmic assignment cutoffs)

  Goodhart's Law causal diagram:

  True goal (satisfaction)           Proxy metric (CTR)
       │                                   │
       ▼                                   ▼
  Quality content  ─── causes ───►  User clicks (aligned)
  Clickbait        ─── causes ───►  User clicks (Goodhart path)
                                          │
  When you optimise CTR:                 │
  system finds BOTH paths ──────────────►│
  CTR ↑, quality ↓, satisfaction ↓      ▼
                                    Metric target achieved,
                                    goal NOT achieved

Python — Goodhart simulation, metric divergence detection, IV estimator

import numpy as np
from scipy import stats

np.random.seed(42)
n = 1000

# ── Goodhart simulation: optimising CTR degrades quality ──
# Two content types: quality articles vs clickbait
# Optimise for CTR → system shifts to clickbait
quality_articles = {'ctr': 0.08, 'satisfaction': 0.80, 'share': 0.40}
clickbait        = {'ctr': 0.25, 'satisfaction': 0.15, 'share': 0.05}

print("Metric optimisation drift (% clickbait in feed over time):")
print(f"{'Round':>6} | {'% Clickbait':>11} | {'CTR':>6} | {'Satisfaction':>13} | {'Shares':>7}")
print("-" * 52)
pct_clickbait = 0.20
for round_i in range(5):
    ctr  = pct_clickbait*clickbait['ctr']  + (1-pct_clickbait)*quality_articles['ctr']
    sat  = pct_clickbait*clickbait['satisfaction'] + (1-pct_clickbait)*quality_articles['satisfaction']
    shr  = pct_clickbait*clickbait['share'] + (1-pct_clickbait)*quality_articles['share']
    print(f"{round_i+1:>6} | {pct_clickbait:>10.0%} | {ctr:>6.3f} | {sat:>12.3f} | {shr:>7.3f}")
    # System learns: clickbait has higher CTR → increase allocation
    pct_clickbait = min(pct_clickbait * 1.8, 0.95)
print("CTR improves but satisfaction and shares collapse → Goodhart")

# ── Metric divergence: CTR vs satisfaction correlation ─────
ctrs = np.random.normal(0.12, 0.03, n)
# Initially correlated: quality drives both
quality       = np.random.uniform(0, 1, n)
satisfaction  = 0.8 * quality + np.random.normal(0, 0.1, n)
# After optimising CTR, clickbait articles have high CTR but low satisfaction
is_clickbait = ctrs > np.percentile(ctrs, 70)
satisfaction[is_clickbait] *= 0.3   # Goodhart: high CTR, low satisfaction
corr = np.corrcoef(ctrs, satisfaction)[0,1]
print(f"\nCTR-Satisfaction correlation after optimisation: {corr:.4f}")
print("Near-zero correlation → proxy (CTR) has diverged from goal (satisfaction)")

# ── Simple IV estimator ────────────────────────────────────
# Z = algorithm assignment (exogenous), X = feature usage, Y = revenue
Z = np.random.binomial(1, 0.5, n)           # instrument (random assignment)
confounder = np.random.normal(0, 1, n)       # unobserved
X = 0.7*Z + 0.6*confounder + np.random.normal(0, 0.5, n)   # treatment
Y = 2.0*X + 0.8*confounder + np.random.normal(0, 1, n)     # outcome (true τ=2.0)

# Naive OLS: biased due to confounder
slope_ols, *_ = stats.linregress(X, Y)
print(f"\nOLS estimate (biased):  {slope_ols:.4f}  (true: 2.0)")

# IV (Two-Stage Least Squares, TSLS) — manual
# Stage 1: regress X on Z → get X_hat
slope_1, int_1, *_ = stats.linregress(Z, X)
X_hat = int_1 + slope_1 * Z
# Stage 2: regress Y on X_hat
slope_iv, *_ = stats.linregress(X_hat, Y)
print(f"IV estimate (unbiased): {slope_iv:.4f}  (true: 2.0)")

ML Link Goodhart's Law is why: YouTube moved from watch time to satisfaction surveys (2018); TikTok uses "not interested" signals alongside engagement; Google uses demotion signals alongside CTR. Metric stack design is critical for RLHF/RLAIF: the reward model proxy (human preference) optimised too hard leads to reward hacking. IV is used in tech companies for: estimating the effect of algorithmic changes using natural discontinuities (e.g., users above/below a ranking cutoff, A/B holdouts from historical logs).

Trap Using engagement metrics as the sole optimisation target for a recommendation model

Short-term engagement (clicks, watch time, likes) is easy to measure but subject to Goodhart dynamics. Optimising a recommendation model directly on clicks trains it to exploit psychological triggers (curiosity gaps, outrage, social comparison) rather than match genuine user preferences. Over months, user satisfaction declines even as the engagement metric improves.

Fix Build a metric stack: primary metric = satisfaction/retention/LTV; proxy = engagement + implicit signals; guardrails = complaint/unsubscribe/report rates; counter-metric = session-ending events. Monitor the correlation between proxy and primary metrics over time — declining correlation is the signal that Goodhart effects are active. Add satisfaction surveys, thumbs-up/down signals, and post-session satisfaction prompts as direct proxies for the primary metric.

Trap IV: weak instrument bias

If the instrument Z is only weakly correlated with treatment X (F-statistic < 10 in the first stage), IV estimates become severely biased and have huge variance. The F < 10 rule of thumb for instrument strength is widely used in econometrics.

Fix Check the first-stage F-statistic: regress X on Z and check F. If F < 10, the instrument is weak. Consider: a stronger instrument (larger exogenous shock), multiple instruments (GMM), or alternative identification strategies. Never use IV with a weak instrument — the bias can be worse than OLS.

This is a classic Goodhart signal: the model found content that gets clicks but does not deliver value (users click but leave quickly). CTR is a proxy for "user found this interesting enough to click"; watch time is a proxy for "user found this valuable after clicking." When these diverge, the model has likely learned to exploit clickbait features. Next steps: (1) inspect which content types gained share after retraining; (2) check thumbnail/title change rate (are recommendations triggering more aggressive thumbnails?); (3) add post-click satisfaction (watch fraction, return visit) as a co-objective or guardrail metric.

Relevance: Z must be correlated with treatment X. Exclusion restriction: Z must affect outcome Y only through X, not through any other path (Z ⊥ confounder). ML example: estimating the effect of feature recommendation on user retention. Instrument Z = whether the user's notification was sent during a platform outage (exogenous variation in feature exposure). Relevance: outage → less feature usage (Z correlated with X). Exclusion: outage only affects retention through reduced feature usage, not through any direct quality signal (debatable — outage might also affect trust). In practice, the exclusion restriction is always an assumption, never testable.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." This is not a business aphorism — it is a causal statement. When you optimise engagement directly, the causal path to the metric changes: instead of high-quality content → engagement, you get engineered engagement triggers → engagement. The metric value rises but the underlying quantity (user satisfaction) does not.

End of Statistics & Experimentation

Numbers without context are noise.

Statistics is what turns data into evidence and evidence into decisions.

All Foundations AI/ML Foundations