Probability, Random Variables & Core Distributions
Every ML model is a probability machine — it outputs distributions, optimises expected losses, and makes decisions under uncertainty. Before touching a training loop, you need to understand what probability actually means and how random variables formalise the randomness in your data.
Probability is a function P: events → [0,1] satisfying Kolmogorov's three axioms: 1. Non-negativity: P(A) ≥ 0 for any event A 2. Normalisation: P(Ω) = 1 (something must happen) 3. Additivity: P(A∪B) = P(A) + P(B) if A∩B = ∅ From these three axioms, all of probability theory follows. Key derived rules: Complement: P(Aᶜ) = 1 − P(A) Addition: P(A∪B) = P(A) + P(B) − P(A∩B) Independence: P(A∩B) = P(A)·P(B) iff A and B are independent Independence is crucial in ML. Naive Bayes assumes features are conditionally independent given the class: P(x₁,x₂,...,xₙ|y) = ∏ᵢ P(xᵢ|y). This assumption is almost never true in practice but the classifier often works well anyway.
Sample Space Ω ┌─────────────────────────────┐ │ ┌───────┐ ┌───────┐ │ │ │ A │░░░│ B │ │ │ │ │░A∩B │ │ │ └───────┘ └───────┘ │ │ │ └─────────────────────────────┘ P(A∪B) = P(A) + P(B) − P(A∩B) If A∩B = ∅ (mutually exclusive): P(A∪B) = P(A) + P(B)
import numpy as np
from scipy import stats
np.random.seed(42)
# ── Simulating probability axioms ─────────────────────────
# Roll a fair die 100,000 times
n = 100_000
rolls = np.random.randint(1, 7, n)
A = rolls <= 3 # event A: {1, 2, 3}
B = rolls % 2 == 0 # event B: {2, 4, 6}
A_and_B = A & B # A ∩ B = {2}
A_or_B = A | B # A ∪ B = {1, 2, 3, 4, 6}
print(f"P(A) = {A.mean():.4f} (expected 0.5)")
print(f"P(B) = {B.mean():.4f} (expected 0.5)")
print(f"P(A∩B) = {A_and_B.mean():.4f} (expected 0.1667)")
print(f"P(A∪B) = {A_or_B.mean():.4f} (expected 0.8333)")
print(f"Addition check: P(A)+P(B)-P(A∩B) = "
f"{A.mean()+B.mean()-A_and_B.mean():.4f} (should match P(A∪B))")
# ── Independence check ─────────────────────────────────────
# A and B are NOT independent here
print(f"\nIndependence: P(A)·P(B) = {A.mean()*B.mean():.4f}")
print(f" P(A∩B) = {A_and_B.mean():.4f}")
print(f" → Independent? {np.isclose(A.mean()*B.mean(), A_and_B.mean(), atol=0.01)}")
# A={1,2,3} and B={2,4,6} share event 2 → NOT independent
# ── Complement rule ────────────────────────────────────────
A_comp = ~A
print(f"\nP(Aᶜ) = {A_comp.mean():.4f} | 1-P(A) = {1-A.mean():.4f}")
# ── Conditional probability P(A|B) = P(A∩B)/P(B) ─────────
p_A_given_B = A_and_B.sum() / B.sum()
print(f"\nP(A|B) = {p_A_given_B:.4f} (P(even number is also ≤3))")
# P(A|B) = P({2})/P({2,4,6}) = (1/6)/(3/6) = 1/3 ≈ 0.333 A Random Forest outputs 0.85 for a customer churn prediction. Many engineers treat this as "85% chance of churn" and build intervention logic around it. But tree ensemble probabilities are poorly calibrated — the true rate may be 60% or 95%. Decisions built on uncalibrated probabilities are unreliable.
Naive Bayes assumes P(x₁,...,xₙ|y) = ∏P(xᵢ|y). For text features this approximately holds. For user behaviour features (time_on_page and scroll_depth are correlated), it is violated. When features are highly correlated, Naive Bayes double-counts evidence, making it overconfident.
The three axioms: (1) P(A) ≥ 0 — probabilities are non-negative. (2) P(Ω) = 1 — the total probability over all outcomes is 1. (3) P(A∪B) = P(A)+P(B) for mutually exclusive events. These matter for ML because every probabilistic model must satisfy them to produce valid outputs. A classifier whose softmax probabilities do not sum to 1 violates axiom 2. A model that produces negative confidence scores violates axiom 1. When you write a custom loss function or probabilistic layer, you are implicitly working within these constraints.
Mutually exclusive: A and B cannot both occur — P(A∩B) = 0. If A happens, B is impossible. Example: a single coin flip cannot be both heads and tails. Independent: knowing A occurred gives no information about B — P(A∩B) = P(A)·P(B). Example: two separate coin flips. Crucially, mutually exclusive events with P(A)>0 and P(B)>0 are NEVER independent — if A occurs, B definitely did not (P(B|A)=0 ≠ P(B)). In ML, feature independence (as assumed by Naive Bayes) means conditional independence: P(xᵢ,xⱼ|y) = P(xᵢ|y)·P(xⱼ|y), not marginal independence.
A random variable X is a function from the sample space Ω to real numbers. It maps outcomes to numeric values so we can do arithmetic on randomness. Discrete RV: takes countable values. Described by PMF: P(X=xᵢ) = pᵢ, where Σpᵢ = 1. Continuous RV: takes uncountable values. Described by PDF f(x) where P(a≤X≤b) = ∫ₐᵇ f(x)dx. CDF: F(x) = P(X≤x) — works for both discrete and continuous RVs. Key quantities: Expected value: E[X] = Σ xᵢ·P(X=xᵢ) (discrete) E[X] = ∫ x·f(x) dx (continuous) Variance: Var(X) = E[(X−μ)²] = E[X²] − (E[X])² Std deviation: σ = √Var(X) Linearity: E[aX+b] = a·E[X]+b, Var(aX+b) = a²·Var(X) The loss function in ML is always an expectation: L(θ) = E[(y−ŷ)²] or E[−y·log(ŷ)]. Minimising the loss = finding θ that minimises expected loss over the data distribution.
import numpy as np
from scipy import stats
import matplotlib
matplotlib.use('Agg')
# ── Discrete RV: fair die ─────────────────────────────────
outcomes = np.array([1, 2, 3, 4, 5, 6])
probs = np.array([1/6] * 6) # PMF
E_X = np.sum(outcomes * probs) # E[X] = 3.5
E_X2 = np.sum(outcomes**2 * probs) # E[X²] = 15.1667
Var_X = E_X2 - E_X**2 # Var(X) = 2.9167
std_X = np.sqrt(Var_X)
print(f"Die — E[X]={E_X:.4f}, Var(X)={Var_X:.4f}, σ={std_X:.4f}")
# CDF: P(X ≤ k) = cumulative sum of PMF
cdf = np.cumsum(probs)
print(f"P(X ≤ 3) = {cdf[2]:.4f}") # 0.5
# ── Continuous RV: Normal distribution ───────────────────
mu, sigma = 0, 1
norm_rv = stats.norm(loc=mu, scale=sigma)
# PDF at x=0: f(0) = 1/√(2π) ≈ 0.3989
print(f"\nNormal PDF at x=0: {norm_rv.pdf(0):.4f}")
# P(-1 ≤ X ≤ 1) = CDF(1) - CDF(-1) ≈ 0.6827 (68% rule)
print(f"P(-1 ≤ X ≤ 1): {norm_rv.cdf(1) - norm_rv.cdf(-1):.4f}")
# Inverse CDF (percentile): 95th percentile
print(f"95th percentile: {norm_rv.ppf(0.95):.4f}")
# ── Linearity of expectation ──────────────────────────────
# If X ~ N(0,1), then Y = 2X + 3 ~ N(3, 4)
# E[Y] = 2·E[X]+3 = 2·0+3 = 3
# Var[Y] = 2²·Var[X] = 4·1 = 4
X_samples = np.random.normal(0, 1, 100_000)
Y_samples = 2*X_samples + 3
print(f"\nE[2X+3] = {Y_samples.mean():.4f} (expected 3)")
print(f"Var[2X+3]= {Y_samples.var():.4f} (expected 4)")
# ── E[X] as foundation of ML loss ─────────────────────────
# MSE = E[(y - ŷ)²] estimated empirically as:
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.3, 2.8, 4.2, 4.9])
mse = np.mean((y_true - y_pred)**2) # empirical expectation
print(f"\nMSE = E[(y-ŷ)²] ≈ {mse:.4f}") log(E[X]) ≠ E[log(X)] in general. In ML this appears as: the average of log-probabilities is not the log of the average probability. Confusion about this causes errors when computing ELBO in VAEs or when aggregating log-likelihoods across batches.
Variance is in squared units — the variance of a salary in dollars is in dollars². Reporting Var(salary)=2,500,000 is technically correct but uninterpretable. In ML, reporting prediction variance in the squared output unit confuses stakeholders and makes threshold-setting harder.
PMF (Probability Mass Function) gives P(X=x) directly for discrete RVs — values are probabilities in [0,1] summing to 1. PDF (Probability Density Function) for continuous RVs gives probability per unit length — it is a density, not a probability. P(a≤X≤b) = ∫ₐᵇ f(x)dx. Since f(x) is a density, it can exceed 1 as long as the integral over all x equals 1. For example, a Uniform(0, 0.5) has f(x)=2 everywhere in [0,0.5], but ∫₀^0.5 2dx = 1. In code: scipy.stats.norm.pdf(0) = 0.3989, but scipy.stats.uniform(scale=0.5).pdf(0.25) = 2.0.
The expected value minimises mean squared error: E[X] = argmin_c E[(X−c)²]. This directly motivates MSE regression: predicting the conditional mean E[y|x] is optimal under squared loss. The expected value also has linearity — E[aX+bY] = aE[X]+bE[Y] — which makes the gradient of the expected loss equal to the expected gradient, justifying mini-batch training. However, the mean is not always the right summary: for asymmetric losses (e.g., understock vs overstock), the optimal prediction is a different quantile, not the mean. This leads to quantile regression and pinball loss.
Three discrete distributions appear everywhere in DS and AI: Bernoulli(p): single binary trial. X ∈ {0,1}. P(X=1) = p, P(X=0) = 1−p E[X] = p, Var(X) = p(1−p) → Binary cross-entropy loss = negative log-likelihood under Bernoulli Binomial(n, p): count of successes in n independent Bernoulli trials. X ∈ {0,...,n}. P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ E[X] = np, Var(X) = np(1−p) → Number of conversions in n sessions; click counts per user Poisson(λ): count of rare events in a fixed interval. X ∈ {0,1,2,...}. P(X=k) = e^(−λ)·λᵏ/k! E[X] = λ, Var(X) = λ (mean = variance — a key Poisson property) → Arrivals per second, defects per batch, word counts in documents When does Binomial → Poisson? When n is large and p is small (n≥20, p≤0.05): Binomial(n,p) ≈ Poisson(np).
import numpy as np
from scipy import stats
np.random.seed(42)
# ── Bernoulli(p=0.15) — e.g. email open rate ─────────────
p_open = 0.15
bern = stats.bernoulli(p_open)
print(f"Bernoulli(0.15) — E[X]={bern.mean():.2f}, Var={bern.var():.4f}")
# Simulate 10,000 email sends
opens = bern.rvs(10_000)
print(f"Simulated open rate: {opens.mean():.4f} (true p={p_open})")
# Negative log-likelihood = binary cross-entropy
# L = -[y·log(p) + (1-y)·log(1-p)]
p_hat = opens.mean() # MLE: just the sample mean
nll = -np.mean(opens*np.log(p_hat) + (1-opens)*np.log(1-p_hat))
print(f"Binary cross-entropy (NLL): {nll:.4f}")
# ── Binomial(n=100, p=0.15) — clicks in sessions ─────────
binom_rv = stats.binom(n=100, p=0.15)
print(f"\nBinom(100,0.15) — E[X]={binom_rv.mean():.1f}, σ={binom_rv.std():.2f}")
print(f"P(X ≤ 10): {binom_rv.cdf(10):.4f}")
print(f"P(X ≥ 20): {1 - binom_rv.cdf(19):.4f}")
# ── Poisson(λ=3) — support tickets per hour ──────────────
lam = 3.0
pois = stats.poisson(lam)
print(f"\nPoisson(3) — E[X]={pois.mean():.1f}, Var[X]={pois.var():.1f}")
# Key property: E[X] = Var[X] = λ
tickets = pois.rvs(1000)
print(f"Simulated: mean={tickets.mean():.2f}, var={tickets.var():.2f}")
# MLE for Poisson: λ_hat = sample mean
lam_hat = tickets.mean()
print(f"MLE λ̂ = {lam_hat:.4f} (true λ={lam})")
# ── Binomial → Poisson approximation ─────────────────────
n, p = 10_000, 0.0003 # rare event: n large, p small
binom_approx = stats.binom(n, p)
poisson_approx = stats.poisson(n*p) # λ = np = 3
k = 2
print(f"\nBinom({n},{p}) P(X={k}): {binom_approx.pmf(k):.6f}")
print(f"Poisson({n*p}) P(X={k}): {poisson_approx.pmf(k):.6f}")
# Very close — Poisson is a good approximation Linear regression assumes Gaussian residuals with constant variance. Count data (number of purchases, clicks, defects) are non-negative, integer-valued, and variance typically scales with the mean (Var ≈ mean for Poisson). Linear regression will predict negative counts and underfit because its constant-variance assumption is violated.
Using a Poisson model for a click-through rate of 30% with n=100 visits is wrong — the Poisson approximation requires p to be small (≤0.05). With high p, the Binomial distribution is bounded at n and has lower variance than Poisson, so the Poisson model over-predicts extreme counts.
A logistic regression models P(y=1|x) = σ(wᵀx) = p. If we assume y ~ Bernoulli(p), the likelihood of observing a single example is p^y·(1−p)^(1−y). The log-likelihood is y·log(p) + (1−y)·log(1−p). Summing over n examples and negating gives the binary cross-entropy loss: L = −Σ[yᵢlog(pᵢ) + (1−yᵢ)log(1−pᵢ)]. Minimising cross-entropy = maximising the Bernoulli log-likelihood = MLE. This is why logistic regression is called a discriminative generative model.
Poisson arises from a memoryless arrival process where events occur at constant rate λ independently. This independence constraint forces Var(X) = λ = E[X]. In practice, count data is often overdispersed (Var >> mean) because events cluster — customers who buy once are more likely to buy again, so purchase counts have higher variance than Poisson predicts. When Var > mean, use Negative Binomial (adds a dispersion parameter). When Var < mean (rare), use Binomial. Always check: if var/mean >> 1 in your count target, Poisson regression will give poor fits.
When a logistic regression outputs 0.73, it is claiming P(y=1|x) = 0.73. That number is meaningless unless you understand probability axioms, conditional probability, and calibration. Statistics is the language ML is written in.