Statistics & Experimentation — Sudheesh Knowledge Base

Why Statistics Belongs in AI Engineering

Statistics is the discipline that keeps engineering confidence honest. It tells you when a model improvement is real, when an observed difference is noise, and when a metric is too unstable to support a product decision.

Core Concepts

Estimation

An estimate is a summary of incomplete information. Always pair it with uncertainty:

Mean and variance
Confidence intervals
Bootstrap intervals
Bayesian credible intervals
Prediction intervals for future observations

Hypothesis Testing

Hypothesis tests are useful when the decision is binary: ship or do not ship, promote or rollback, investigate or ignore. The p-value is not the probability that the hypothesis is true. It is a measure of how surprising the data is under a null assumption.

Power

Power answers a practical question: if the effect exists, how likely are we to detect it? Low power creates inconclusive experiments and encourages over-reading noise.

A/B Testing Framework

Design experiments in this order:

Define the decision.
Choose the primary metric.
Choose guardrail metrics.
Estimate baseline variance.
Decide minimum detectable effect.
Compute sample size and duration.
Pre-register analysis choices.
Monitor data quality, not the result, during the test.

Metric Design

Good metrics are sensitive, stable, and aligned with value. For AI products, use a metric stack:

Model metric: accuracy, NDCG, F1, hallucination rate.
System metric: latency, cost, availability.
Product metric: conversion, retention, task success.
Trust metric: escalation, user correction, safety flags.

Failure Modes

Peeking repeatedly and stopping when the result becomes significant.
Testing too many metrics without correction.
Measuring proxy metrics that do not match user value.
Running experiments during abnormal traffic windows.
Ignoring interference between users in marketplace, social, or recommendation systems.

Research Habit

For every experiment, keep an experiment card:

Decision owner
Hypothesis
Primary metric
Guardrails
Sample size
Exclusion rules
Result
Interpretation
Follow-up