ML Coding Interview · 45 Minutes

Write it.
Explain it.
Optimise it.

12 algorithms interviewers actually ask. Each one with a think-aloud script, production-quality code, and the Senior ✦ insight the top 10% know.

Page

What interviewers score

Correctness — does it produce the right output?
Complexity — can you name it before they ask?
Vectorisation — do you avoid Python loops?
Edge cases — empty clusters, division by zero, overflow
Think-aloud — do you narrate the why, not just the what?

👤 Senior ML Engineer · Interviewer

Time

Space

Difficulty

Code

Open with

"The formula is Attention(Q,K,V) = softmax(QK^T / √d_k) · V. Before writing code, let me explain why each component exists — the sqrt scaling prevents vanishing gradients through softmax on high-dimensional vectors."

Step 1 — Raw scores: QK^T via matmul → shape (batch, heads, seq, seq). O(n²d) per head.
Step 2 — Scale: divide by √d_k. Without this, dot products grow with d_k and softmax saturates.
Step 3 — Mask (optional): set padding/future positions to −∞ so softmax outputs ~0 weight.
Step 4 — Numerically stable softmax: subtract max before exp (log-sum-exp trick). Essential for large d_k.
Step 5 — Weighted sum: weights @ V → context vectors, shape (batch, heads, seq, d_v).

Name proactively

Masking: causal mask for decoder (upper triangle = −∞), padding mask for variable-length sequences
Numerical stability: subtract row-max before exp — prevents NaN/Inf on large dot products
Shape convention: swapaxes(-2,-1) not .T for batched transpose

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K : (batch, heads, seq, d_k)
    V    : (batch, heads, seq, d_v)
    mask : (batch, 1, seq, seq) — 0 where position should be masked
    Returns: context (same shape as V), attention_weights
    """
    d_k = Q.shape[-1]

    # Step 1 — raw scores (batch, heads, seq_q, seq_k)
    scores = np.matmul(Q, K.swapaxes(-2, -1)) / np.sqrt(d_k)

    # Step 2 — apply mask: set masked positions to -1e9 before softmax
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # Step 3 — numerically stable softmax (subtract row max)
    scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)

    # Step 4 — weighted sum of values
    context = np.matmul(weights, V)
    return context, weights


# ── Causal mask helper ───────────────────────────────────────────
def causal_mask(seq_len):
    """Lower-triangular mask: position i can only attend to j <= i."""
    return np.tril(np.ones((seq_len, seq_len)))[None, None, :, :]

✦ Senior ML Engineer

"Why √d_k? As d_k grows, the variance of QK^T grows proportionally. Without scaling, softmax inputs become large in magnitude, the output becomes near-one-hot, and gradients through softmax vanish. The √d_k normalises the variance back to ~1. State this explicitly — most candidates write the division without explaining it."

✦ Senior ML Engineer

"Flash Attention computes the same result in O(n) memory instead of O(n²) by processing attention in tiles and never materialising the full n×n score matrix. The key insight: softmax can be computed incrementally. At 2048 sequence length, the O(n²) matrix is 16MB per head. At 32K tokens it's 4GB. Mentioning Flash Attention immediately shows you understand the production bottleneck — even if they don't ask you to implement it."

Open with

"I'll cache all intermediate activations during the forward pass because the backward pass needs them for the chain rule. Let me state the shapes at each layer first so I don't lose track."

Forward: X → Z1=XW1+b1 → A1=relu(Z1) → Z2=A1W2+b2 → probs=softmax(Z2) → CE loss
Backward layer 2: dL/dZ2 = (probs − one_hot_y) / N. Closed form for softmax+CE combined.
Backward layer 1: dA1 = dZ2 @ W2.T; dZ1 = dA1 × relu'(Z1) where relu'(z) = (z > 0)
Weight gradients: dW2 = A1.T @ dZ2; dW1 = X.T @ dZ1. Shape-check before writing.
He initialisation: W ~ N(0, 2/fan_in) for ReLU layers. Always mention this.

Name proactively

Numerical stability: add ε = 1e-9 inside log to prevent log(0)
Softmax+CE gradient simplifies to (probs − y_onehot)/N — don't compute them separately
ReLU backward is a mask: gradient passes through where activation was positive

import numpy as np

def relu(x):     return np.maximum(0, x)
def relu_back(x): return (x > 0).astype(float)

def softmax(x):
    x = x - x.max(axis=1, keepdims=True)  # stability
    e = np.exp(x)
    return e / e.sum(axis=1, keepdims=True)

class TwoLayerNet:
    def __init__(self, d_in, d_h, d_out, lr=0.01):
        # He initialisation for ReLU layers
        self.W1 = np.random.randn(d_in, d_h)   * np.sqrt(2.0 / d_in)
        self.b1 = np.zeros(d_h)
        self.W2 = np.random.randn(d_h, d_out)  * np.sqrt(2.0 / d_h)
        self.b2 = np.zeros(d_out)
        self.lr  = lr

    def forward(self, X, y):
        N = X.shape[0]
        self.X   = X
        self.z1  = X @ self.W1 + self.b1        # (N, H)
        self.a1  = relu(self.z1)                # (N, H)
        self.z2  = self.a1 @ self.W2 + self.b2  # (N, C)
        self.p   = softmax(self.z2)             # (N, C)
        loss = -np.log(self.p[np.arange(N), y] + 1e-9).mean()
        return loss

    def backward(self, y):
        N = self.X.shape[0]

        # Softmax + CE gradient (closed form — don't compute separately)
        dz2 = self.p.copy()
        dz2[np.arange(N), y] -= 1
        dz2 /= N                                # (N, C)

        dW2 = self.a1.T @ dz2                   # (H, C)
        db2 = dz2.sum(axis=0)

        da1 = dz2 @ self.W2.T                   # (N, H)
        dz1 = da1 * relu_back(self.z1)          # (N, H)

        dW1 = self.X.T @ dz1                    # (D, H)
        db1 = dz1.sum(axis=0)

        # SGD update
        for p, g in [(self.W2,dW2),(self.b2,db2),(self.W1,dW1),(self.b1,db1)]:
            p -= self.lr * g

✦ Senior ML Engineer

"The combined softmax+CE gradient simplifies to (probs − y_onehot)/N. Don't derive softmax gradient and CE gradient separately and multiply them — that's technically correct but a red flag. Any engineer who's implemented this knows the closed form. State it upfront: 'The gradient through the combined softmax+CE layer has a clean closed form, so I'll use that directly.'"

✦ Senior ML Engineer

"He initialisation (σ = √(2/fan_in)) is non-negotiable for ReLU layers. Xavier (σ = √(1/fan_in)) is for tanh/sigmoid. Using Xavier with ReLU causes activations to halve in variance at every layer — a 10-layer network starts with variance 1/1024. Name the initialisation and the reason before writing the weight init line. It's a 5-second sentence that shows you understand training dynamics."

Open with

"Standard K-Means has three failure modes I'll address upfront: bad initialisation (K-Means++), empty clusters on reassignment (reinitialise from data), and slow convergence (use centroid shift as stopping criterion)."

Initialise (K-Means++): pick first centroid randomly, then each subsequent centroid with probability ∝ distance² to nearest existing centroid. Better spread than random.
Assign: vectorised distance (n, k) matrix via broadcasting, argmin per row → O(nkd).
Update: mean per cluster. If any cluster is empty → reinitialise that centroid to a random data point.
Converge: check ‖new_centroids − old_centroids‖ < tol, or max_iter reached.

Name proactively

Empty cluster: mask.sum() == 0 → reinit from random data point (not zeros)
Distance matrix: (n, k) via broadcasting, avoid a Python loop over k
Convergence: use centroid shift not label change — more stable

import numpy as np

def kmeans(X, k, max_iter=100, tol=1e-4, seed=42):
    """X: (n, d). Returns centroids (k, d), labels (n,)."""
    rng = np.random.default_rng(seed)
    n, d = X.shape

    # ── K-Means++ initialisation ────────────────────────────────────
    centroids = [X[rng.integers(n)]]
    for _ in range(k - 1):
        # Distance² from each point to nearest centroid
        dists = np.array([
            min(np.sum((x - c) ** 2) for c in centroids) for x in X
        ])
        probs = dists / dists.sum()
        centroids.append(X[rng.choice(n, p=probs)])
    centroids = np.array(centroids)   # (k, d)

    for _ in range(max_iter):
        # ── Assign: vectorised (n, k) distance matrix ──────────────────
        dists = np.sum(
            (X[:, None, :] - centroids[None, :, :]) ** 2, axis=2
        )                              # (n, k)
        labels = dists.argmin(axis=1)  # (n,)

        # ── Update: mean per cluster, handle empty clusters ────────────
        new_c = np.zeros_like(centroids)
        for j in range(k):
            mask = labels == j
            if mask.sum() == 0:
                new_c[j] = X[rng.integers(n)]   # reinit empty cluster
            else:
                new_c[j] = X[mask].mean(axis=0)

        # ── Convergence: centroid shift ─────────────────────────────────
        shift = np.linalg.norm(new_c - centroids)
        centroids = new_c
        if shift < tol:
            break

    return centroids, labels

✦ Senior ML Engineer

"The (n, k) vectorised distance matrix — X[:, None, :] - centroids[None, :, :] — is the Python loop elimination that shows vectorisation fluency. Candidates who write 'for j in range(k): dists[:, j] = ...' are getting the right answer slowly. A 10,000-point, 20-cluster K-Means runs 15× faster with the broadcasting version. Always state: 'I'll use broadcasting to avoid a loop over clusters.'"

✦ Senior ML Engineer

"K-Means++ initialisation cuts iterations to convergence by 2-3× and improves final clustering quality — this is why sklearn defaults to it. The key idea: sample subsequent centroids with probability proportional to distance², which spreads them across the data. Mentioning this vs random initialisation, and explaining why, signals you know the algorithm's history, not just its mechanics."

Open with

"Naive softmax overflows with large logits — exp(1000) is inf. The fix is the log-sum-exp trick: subtract the row max before exponentiation. Mathematically equivalent, numerically stable."

Stable softmax: shift = x − max(x, axis=−1). Then exp(shift) / sum(exp(shift)). Shift doesn't change the ratio.
Cross-entropy: −log(p[true_class]). Add ε = 1e-9 inside log for safety.
Fused gradient: dL/dz = (softmax(z) − one_hot_y) / N. One pass, no chain rule needed.
Log-softmax + NLL: more numerically stable than softmax then log: log_softmax(x) = x − log(sum(exp(x−max))) + max.

Name proactively

Overflow: exp(1000) = inf without max subtraction
Underflow: exp(−1000) = 0 → log(0) = −inf → add ε
Gradient: use fused form, not softmax_grad × ce_grad

import numpy as np

# ── Numerically stable softmax ──────────────────────────────────
def softmax(x):
    """x: (N, C). Returns probabilities same shape."""
    shifted = x - x.max(axis=-1, keepdims=True)  # log-sum-exp trick
    e = np.exp(shifted)
    return e / e.sum(axis=-1, keepdims=True)

# ── Cross-entropy loss ──────────────────────────────────────────
def cross_entropy(probs, y):
    """probs: (N, C), y: (N,) integer labels. Returns scalar loss."""
    N = probs.shape[0]
    log_p = np.log(probs[np.arange(N), y] + 1e-9)
    return -log_p.mean()

# ── Fused softmax + CE gradient ─────────────────────────────────
def softmax_ce_grad(logits, y):
    """
    Combined gradient dL/d(logits) for softmax + cross-entropy.
    Closed form: (softmax(logits) - one_hot(y)) / N
    """
    N = logits.shape[0]
    g = softmax(logits)            # (N, C)
    g[np.arange(N), y] -= 1       # subtract 1 at true class
    return g / N

# ── Log-softmax (more stable for loss computation) ───────────────
def log_softmax(x):
    shifted = x - x.max(axis=-1, keepdims=True)
    return shifted - np.log(np.exp(shifted).sum(axis=-1, keepdims=True))

✦ Senior ML Engineer

"Log-softmax followed by NLL loss is strictly more stable than softmax followed by log(p). With softmax+log, you compute exp then immediately log — wasting precision. Log-softmax combines them: log(exp(x-max) / sum) = x − max − log(sum(exp(x−max))). PyTorch's F.cross_entropy uses this internally. Mentioning the equivalence and why log-softmax is preferred shows you understand numerical computing, not just the math."

Open with

"BatchNorm has two distinct modes: training uses batch statistics (mean/var of current mini-batch), inference uses running statistics accumulated during training via exponential moving average. I'll implement both."

Forward (train): μ = mean(x), σ² = var(x), x̂ = (x−μ)/√(σ²+ε), y = γx̂ + β
Running stats: update EMA: running_mean = (1−m)·running_mean + m·μ. Used at inference.
Forward (inference): use running_mean, running_var instead of batch stats.
Backward: chain rule through normalisation — dγ = Σ(dout × x̂), dβ = Σ(dout), dx requires the full computation graph.

Name proactively

ε in denominator: prevents division by zero when variance is 0
Training vs inference mode: running stats essential — batch stats at inference are wrong
Interaction with dropout: BN after dropout behaves differently from before

import numpy as np

class BatchNorm1d:
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        self.gamma = np.ones(num_features)   # scale (learnable)
        self.beta  = np.zeros(num_features)  # shift (learnable)
        self.eps   = eps
        self.mom   = momentum
        # Running stats for inference
        self.running_mean = np.zeros(num_features)
        self.running_var  = np.ones(num_features)

    def forward(self, x, training=True):
        """x: (N, F)"""
        if training:
            self.mu  = x.mean(axis=0)        # (F,)
            self.var = x.var(axis=0)         # (F,)
            self.x_hat = (x - self.mu) / np.sqrt(self.var + self.eps)
            # Update running stats (EMA)
            self.running_mean = (1 - self.mom) * self.running_mean + self.mom * self.mu
            self.running_var  = (1 - self.mom) * self.running_var  + self.mom * self.var
        else:
            # Inference: use accumulated running stats
            self.x_hat = (x - self.running_mean) / np.sqrt(self.running_var + self.eps)
        return self.gamma * self.x_hat + self.beta

    def backward(self, dout):
        """dout: (N, F). Returns dx, dgamma, dbeta."""
        N = dout.shape[0]
        dgamma = (dout * self.x_hat).sum(axis=0)
        dbeta  = dout.sum(axis=0)

        dx_hat = dout * self.gamma
        inv_std = 1.0 / np.sqrt(self.var + self.eps)
        dx = (1.0/N) * inv_std * (
            N * dx_hat
            - dx_hat.sum(axis=0)
            - self.x_hat * (dx_hat * self.x_hat).sum(axis=0)
        )
        return dx, dgamma, dbeta

✦ Senior ML Engineer

"The running stats update during training — and using them at inference — is what most candidates forget. A BatchNorm layer that always uses batch statistics at inference time will behave differently for batch size 1 (evaluation) vs batch size 32 (training). This is one of the most common production bugs in early PyTorch implementations. State upfront: 'There are two modes, and I need running statistics for inference.'"

✦ Senior ML Engineer

"Layer Norm normalises across the feature dimension per sample (not across the batch). This makes it sequence-length-independent and the default choice for Transformers — you can't batch-normalise variable-length sequences easily. If the interviewer says 'how would you modify this for a Transformer?' the answer is: change the normalisation axis from axis=0 (batch) to axis=-1 (features), and there are no running statistics needed."

Open with

"Adam maintains two running estimates per parameter: a first moment (mean of gradients) and a second moment (uncentred variance). Bias correction compensates for the fact that both start at zero — without it, early steps are severely underestimated."

First moment: m_t = β1·m_(t-1) + (1−β1)·g_t — smoothed gradient direction
Second moment: v_t = β2·v_(t-1) + (1−β2)·g_t² — smoothed gradient magnitude²
Bias correction: m̂ = m/(1−β1^t), v̂ = v/(1−β2^t). At t=1, β1=0.9: m̂ = m/0.1 — 10× amplification to compensate for cold start.
Update: θ = θ − α · m̂ / (√v̂ + ε)
Typical defaults: α=1e-3, β1=0.9, β2=0.999, ε=1e-8

Name proactively

Bias correction: without it, first few steps severely undershoot
AdamW: L2 regularisation applied directly to weights, not gradient — prevents weight decay from interacting with adaptive learning rates
AMSGrad: uses max(v̂) instead of v̂ to guarantee convergence

import numpy as np

class Adam:
    def __init__(self, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr    = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps   = eps
        self.m     = {}  # first moments
        self.v     = {}  # second moments
        self.t     = 0   # global step counter

    def step(self, params, grads):
        """
        params, grads: dict {name: np.ndarray}
        Updates params in-place.
        """
        self.t += 1

        for name, g in grads.items():
            if name not in self.m:
                self.m[name] = np.zeros_like(g)
                self.v[name] = np.zeros_like(g)

            # Biased moment estimates
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * g
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * g ** 2

            # Bias-corrected estimates
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)

            # Parameter update
            params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

✦ Senior ML Engineer

"AdamW separates weight decay from the gradient update — this is the most important Adam variant to know. In vanilla Adam, L2 regularisation (adding λθ to the gradient) interacts with the adaptive learning rate: parameters with large gradients get less decay than parameters with small gradients, which is the opposite of what you want. AdamW applies the decay directly: θ ← θ × (1−λ), before the Adam step. This is why every modern large-model training uses AdamW, not Adam."

Open with

"Gradient clipping is primarily needed for RNNs and deep networks where gradients can explode during BPTT. Global norm clipping preserves gradient direction — it scales all gradients uniformly if the total norm exceeds a threshold. Value clipping clips each component independently and distorts the direction."

Global L2 norm: √(Σ_i ||g_i||²) across all parameter tensors concatenated.
Clip coefficient: max_norm / total_norm. If < 1, scale all grads by this factor.
In-place update: multiply each grad tensor by clip_coef — no copies needed.
Value clipping: np.clip(g, −v, v) — simpler but changes direction; only use if distribution of gradient magnitudes is known.

Name proactively

Division by zero: add 1e-6 to total_norm denominator
When not to clip: if grads are already small, clipping is a no-op — check total_norm first
PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

import numpy as np

def clip_grad_norm(grads, max_norm):
    """
    Global gradient norm clipping.
    grads   : list of np.ndarray (one per parameter group)
    max_norm: float clip threshold
    Returns : (clipped_grads, total_norm)
    """
    # Step 1 — global L2 norm across all parameter gradients
    total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads))

    # Step 2 — clip coefficient (1.0 means no clipping)
    clip_coef = max_norm / (total_norm + 1e-6)

    # Step 3 — scale all gradients uniformly (preserves direction)
    if clip_coef < 1.0:
        for g in grads:
            g *= clip_coef   # in-place — no copy

    return grads, float(total_norm)


# ── Value clipping (component-wise — distorts direction) ─────────
def clip_grad_value(grads, clip_value):
    return [np.clip(g, -clip_value, clip_value) for g in grads]

✦ Senior ML Engineer

"Global norm clipping preserves gradient direction — it's a uniform rescaling of the entire gradient vector. Value clipping clips each component independently and produces a different vector direction. For RNNs where gradient magnitude matters but direction is critical, norm clipping is strictly better. The monitoring insight: log total_norm during training. If it regularly hits the clip threshold (max_norm), your learning rate is too high or your architecture has a pathological gradient flow problem."

Open with

"The default rolling().mean() includes the current row in the window. For a predictive feature, that's leakage — at inference time, you don't have the current row's value. The fix is shift(1) before rolling, which shifts values forward by one period so each row's window only contains past data."

Sort first: groupby + sort_values by date — rolling requires chronological order.
shift(1) before rolling: moves each value one period forward, excluding current row from its own window.
Temporal train/test split: cutoff date, not random. All train rows before cutoff, all test rows after.
Point-in-time join: for multi-table setups, join features using as-of lookups — only information available at prediction time.

Name proactively

random_state in train_test_split: wrong for time series — shuffles temporal order
min_periods=1: avoids NaN for users with fewer than 7 days of history
GroupBy leakage: ensure rolling is scoped per user, not across all users

import pandas as pd
import numpy as np

# ── BAD: leaks current-row value into its own window ─────────────
def leaky_features(df):
    df['roll7'] = df.groupby('user_id')['revenue'].transform(
        lambda x: x.rolling(7).mean()  # includes current row
    )
    return df

# ── GOOD: point-in-time correct ──────────────────────────────────
def pit_features(df):
    df = df.sort_values(['user_id', 'date']).copy()

    grp = df.groupby('user_id')['revenue']

    # shift(1) excludes current row from its own window
    df['roll7_avg']  = grp.transform(
        lambda x: x.shift(1).rolling(7, min_periods=1).mean()
    )
    df['lag_1d']    = grp.shift(1)
    df['lag_7d']    = grp.shift(7)
    df['roll7_std']  = grp.transform(
        lambda x: x.shift(1).rolling(7, min_periods=2).std()
    )
    return df

# ── Correct temporal train/test split ───────────────────────────
def temporal_split(df, cutoff, target='label'):
    # Never use train_test_split(shuffle=True) for time series
    train = df[df['date'] <  cutoff]
    test  = df[df['date'] >= cutoff]

    feat_cols = [c for c in df.columns if c not in [target, 'date']]
    return (train[feat_cols], train[target],
            test[feat_cols],  test[target])

✦ Senior ML Engineer

"Feature leakage via rolling without shift(1) is one of the most common production ML bugs. The model trains fine, validation AUC looks great, then production performance collapses. The diagnosis: production AUC significantly lower than validation AUC on a time-series problem. The fix is always shift(1) before rolling. Name this as the default defensive pattern: 'I always shift before rolling on time-series features.'"

✦ Senior ML Engineer

"Point-in-time correctness extends beyond rolling features. In multi-table joins — joining customer attributes to transactions — you must join on the attribute values as they existed at the transaction date, not their current values. Joining on current customer tier to historical transactions leaks future tier changes into past predictions. Production feature stores (Feast, Tecton) enforce PIT joins by design. Mentioning this shows you've thought about end-to-end pipeline correctness."

Open with

"Naive target encoding — replacing a category with its mean target value — leaks the label into the feature. The fix is out-of-fold encoding: compute the encoding for each row using only the folds it wasn't in. For rare categories, add Bayesian smoothing to shrink toward the global mean."

Out-of-fold loop: for each fold, compute mean target per category on training portion, apply to validation portion.
Smoothing: encoded = (count × cat_mean + smoothing × global_mean) / (count + smoothing). Rare categories get pulled toward global mean.
Inference encoding: fit on full training set with smoothing, store as a map for test-time lookup.
Fallback: unseen categories at inference → global mean.

Name proactively

shuffle=False in KFold: for time-series target encoding, preserve temporal order
Smoothing factor: 10–50 typical; higher = more regularisation for rare categories
Unseen categories at inference: fillna(global_mean)

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

def target_encode(X_train, y_train, col, n_folds=5, smooth=10):
    """
    Out-of-fold target encoding with Bayesian smoothing.
    Returns: (oof_encoded array, inference encoding map)
    """
    global_mean = y_train.mean()
    oof = np.zeros(len(X_train))

    kf = KFold(n_splits=n_folds, shuffle=False)  # no shuffle for temporal

    for tr_idx, val_idx in kf.split(X_train):
        X_tr  = X_train.iloc[tr_idx]
        y_tr  = y_train.iloc[tr_idx]
        X_val = X_train.iloc[val_idx]

        # Per-category stats on training fold
        stats = pd.DataFrame({'mean': y_tr.groupby(X_tr[col]).mean(),
                               'cnt':  y_tr.groupby(X_tr[col]).count()})

        # Bayesian smoothing: shrink toward global mean for small counts
        stats['enc'] = (
            (stats['cnt'] * stats['mean'] + smooth * global_mean)
            / (stats['cnt'] + smooth)
        )
        oof[val_idx] = (
            X_val[col].map(stats['enc']).fillna(global_mean).values
        )

    # Inference map: fit on full training set
    full_stats = pd.DataFrame({'mean': y_train.groupby(X_train[col]).mean(),
                                'cnt':  y_train.groupby(X_train[col]).count()})
    full_stats['enc'] = (
        (full_stats['cnt'] * full_stats['mean'] + smooth * global_mean)
        / (full_stats['cnt'] + smooth)
    )
    enc_map = full_stats['enc'].to_dict()

    return oof, enc_map, global_mean

✦ Senior ML Engineer

"Bayesian smoothing for rare categories — (count × cat_mean + λ × global_mean) / (count + λ) — is the production detail that separates target encoding done right from done naively. A category with 2 observations and a 100% positive rate shouldn't encode as 1.0. With λ=10, it encodes as (2×1.0 + 10×global_mean) / 12 — pulled significantly toward the global mean. λ is a hyperparameter; tune it on a holdout. State this before writing the formula."

Open with

"SMOTE generates synthetic minority samples by interpolating between existing minority samples and their k nearest neighbours. This is better than simple duplication because it adds new points along the feature manifold rather than exact copies."

Compute KNN: pairwise distance matrix (n×n), argsort each row, take first k non-self neighbours.
Generate: for each synthetic sample, pick a random minority point and a random neighbour, interpolate: x_new = x + α(x_nn − x) where α ~ Uniform(0, 1).
Result: n_synthetic new points, all within the convex hull of minority class pairs.

Name proactively

Distance matrix diagonal: set to inf to exclude self-distance in KNN
Class ratio: don't oversample to 50/50 blindly — tune the ratio to your metric
Categorical features: standard SMOTE works on continuous features; SMOTENC handles mixed types

import numpy as np

def smote(X_minority, n_synthetic, k=5, seed=42):
    """
    X_minority : (n, d) minority class samples
    n_synthetic: number of new samples to generate
    Returns    : (n_synthetic, d) synthetic samples
    """
    rng = np.random.default_rng(seed)
    n, d = X_minority.shape

    # Step 1 — pairwise squared distance matrix (n x n)
    diff  = X_minority[:, None, :] - X_minority[None, :, :]
    dists = np.sum(diff ** 2, axis=2)
    np.fill_diagonal(dists, np.inf)   # exclude self

    # Step 2 — k nearest neighbour indices for each point
    knn_idx = np.argsort(dists, axis=1)[:, :k]  # (n, k)

    # Step 3 — generate synthetic samples via interpolation
    synthetic = np.zeros((n_synthetic, d))
    for i in range(n_synthetic):
        src    = rng.integers(n)             # random minority sample
        nn     = knn_idx[src, rng.integers(k)]  # random neighbour
        alpha  = rng.random()               # interpolation factor
        synthetic[i] = (
            X_minority[src] + alpha * (X_minority[nn] - X_minority[src])
        )
    return synthetic

# Usage:
# X_syn = smote(X_minority, n_synthetic=500)
# X_bal = np.vstack([X_majority, X_minority, X_syn])
# y_bal = np.hstack([y_maj, np.ones(len(X_minority) + len(X_syn))])

✦ Senior ML Engineer

"SMOTE is a training-set-only operation — apply it after your train/test split, never before. If you apply SMOTE before splitting, synthetic samples generated from real training points will leak into your test set. The correct pipeline: split first, then oversample the minority class in the training set only. This is the most common imbalanced-data pipeline bug. State it before coding: 'I'll apply this to the training set after splitting.'"

✦ Senior ML Engineer

"Class imbalance is primarily an evaluation problem, not just a training problem. AUC-PR is the correct metric for imbalanced classes — AUC-ROC is optimistic because it includes the trivially-correct true negative region. At 1% positive rate, a model that predicts all negative has AUC-ROC of 0.5 but AUC-PR near 0. Mention this when discussing imbalanced data: 'I'd also switch from AUC-ROC to AUC-PR as the primary offline metric.'"

Open with

"There are three complexity tiers: O(n log n) sort-then-slice (simplest), O(n log k) min-heap (optimal for streaming), and O(n) np.argpartition (optimal for in-memory arrays). For ANN retrieval, cosine similarity is a dot product after L2 normalisation — fully vectorisable."

Sort approach: O(n log n) — np.argsort then slice. Correct but wasteful; sorts more than needed.
Min-heap: O(n log k) time, O(k) space. Optimal when k << n or data streams in. Heap size stays at k.
np.argpartition: O(n) average. Guarantees top-k in output but not sorted within them. Fastest for in-memory arrays.
Cosine similarity: normalise corpus to unit vectors once, then at query time: dot product = cosine similarity. O(nd) per query.

Name proactively

argpartition is not sorted: apply argsort on the k returned indices to rank them
FAISS for scale: at 1M+ vectors, FAISS with IVF-PQ gives O(1) approximate retrieval
Zero-norm vectors: add ε to denominator in cosine normalisation

import numpy as np
import heapq

# ── Option 1: Min-heap — O(n log k), O(k) space ─────────────────
def top_k_heap(scores, k):
    heap = []
    for i, s in enumerate(scores):
        heapq.heappush(heap, (s, i))
        if len(heap) > k:
            heapq.heappop(heap)          # evict smallest
    return [i for _, i in sorted(heap, reverse=True)]

# ── Option 2: np.argpartition — O(n) avg, in-memory ─────────────
def top_k_partition(scores, k):
    indices = np.argpartition(scores, -k)[-k:]     # unordered top-k
    return indices[np.argsort(scores[indices])[::-1]] # sorted

# ── Option 3: Vectorised cosine ANN ─────────────────────────────
def cosine_top_k(query, corpus, k):
    """
    query  : (d,)    query vector
    corpus : (n, d)  pre-indexed item vectors
    Returns: top-k indices sorted by cosine similarity
    """
    # Normalise both to unit vectors — dot product = cosine similarity
    q_norm = query  / (np.linalg.norm(query)  + 1e-8)
    c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-8)
    scores = c_norm @ q_norm      # (n,) — vectorised dot products
    return top_k_partition(scores, k)

✦ Senior ML Engineer

"np.argpartition is the answer most candidates don't know. It runs in O(n) average time using introselect, guarantees the k largest are in the output (without sorting them), and runs entirely in C. At 10M documents, argpartition is 3× faster than argsort for top-100 retrieval. Use it for any in-memory top-k. The one gotcha: the returned indices are not sorted — apply argsort on the k results to rank them. State both the advantage and the gotcha."

Open with

"A model with AUC 0.92 can still output completely wrong probabilities — it ranks correctly but isn't calibrated. Platt scaling fits a sigmoid f(s) = 1/(1 + exp(As + B)) on top of raw model scores via MLE. It's a 2-parameter model trained on the validation set."

Platt scaling: fit A, B by minimising negative log-likelihood of sigmoid(A·score + B) against true labels on a held-out calibration set.
Reliability diagram: bin predictions [0,1] into n_bins buckets; plot predicted probability vs actual positive rate per bin. Perfect calibration = diagonal line.
ECE (Expected Calibration Error): weighted average of |predicted − actual| per bin. Target < 0.05.
Isotonic regression: non-parametric alternative — fits a monotone step function. More flexible than Platt, requires more data (use for >1000 calibration samples).

Name proactively

Calibration set must be held-out: fitting on training scores causes overfitting
Platt assumes sigmoid relationship: only valid if model outputs roughly linear log-odds
Temperature scaling: single parameter T (divide logits by T) — popular for deep learning

import numpy as np
from scipy.optimize import minimize

def fit_platt(scores, labels):
    """Fit Platt sigmoid: p = 1/(1 + exp(A*score + B)) via MLE."""
    def neg_log_likelihood(params):
        A, B = params
        p = 1.0 / (1.0 + np.exp(A * scores + B))
        p = np.clip(p, 1e-7, 1 - 1e-7)
        return -np.mean(
            labels * np.log(p) + (1 - labels) * np.log(1 - p)
        )
    res = minimize(neg_log_likelihood, x0=[-1.0, 0.0], method='L-BFGS-B')
    return res.x   # (A, B)

def platt_predict(scores, A, B):
    return 1.0 / (1.0 + np.exp(A * scores + B))

# ── Reliability diagram + ECE ────────────────────────────────────
def calibration_stats(probs, labels, n_bins=10):
    """Returns (bin_pred, bin_actual, ECE)."""
    bins       = np.linspace(0, 1, n_bins + 1)
    bin_pred   = []
    bin_actual = []
    ece        = 0.0
    N          = len(labels)

    for i in range(n_bins):
        mask = (probs >= bins[i]) & (probs < bins[i + 1])
        if mask.sum() > 0:
            bp = probs[mask].mean()
            ba = labels[mask].mean()
            bin_pred.append(bp)
            bin_actual.append(ba)
            ece += (mask.sum() / N) * abs(bp - ba)

    return np.array(bin_pred), np.array(bin_actual), ece

✦ Senior ML Engineer

"Calibration matters as much as discrimination in production. A fraud model with AUC 0.95 but poor calibration will set the wrong transaction block threshold — if it says 0.9 probability but the actual rate is 0.3, your threshold is wrong. Temperature scaling is the simplest calibration method for neural networks: divide logits by a scalar T before softmax, fit T on a validation set. One parameter, no change to model ranking, dramatically improved probability estimates. It's why all modern LLMs use temperature at inference."

✦ Senior ML Engineer

"The reliability diagram is the correct way to present calibration — not just ECE. ECE is a single number that hides the shape of miscalibration. A model that systematically overestimates probabilities for high scores and underestimates for low scores has ECE near 0 (errors cancel) but is badly calibrated in practice. Plot the reliability diagram and look for the S-curve pattern — that's the signature of a model that needs Platt scaling."

Write it.Explain it.Optimise it.

Write it.
Explain it.
Optimise it.