ML Coding Interview · 45 Minutes
12 algorithms interviewers actually ask. Each one with a think-aloud script, production-quality code, and the Senior ✦ insight the top 10% know.
What interviewers score
"The formula is Attention(Q,K,V) = softmax(QKT / √dk) · V. Before writing code, let me explain why each component exists — the sqrt scaling prevents vanishing gradients through softmax on high-dimensional vectors."
import numpy as np def scaled_dot_product_attention(Q, K, V, mask=None): """ Q, K : (batch, heads, seq, d_k) V : (batch, heads, seq, d_v) mask : (batch, 1, seq, seq) — 0 where position should be masked Returns: context (same shape as V), attention_weights """ d_k = Q.shape[-1] # Step 1 — raw scores (batch, heads, seq_q, seq_k) scores = np.matmul(Q, K.swapaxes(-2, -1)) / np.sqrt(d_k) # Step 2 — apply mask: set masked positions to -1e9 before softmax if mask is not None: scores = np.where(mask == 0, -1e9, scores) # Step 3 — numerically stable softmax (subtract row max) scores -= scores.max(axis=-1, keepdims=True) weights = np.exp(scores) weights /= weights.sum(axis=-1, keepdims=True) # Step 4 — weighted sum of values context = np.matmul(weights, V) return context, weights # ── Causal mask helper ─────────────────────────────────────────── def causal_mask(seq_len): """Lower-triangular mask: position i can only attend to j <= i.""" return np.tril(np.ones((seq_len, seq_len)))[None, None, :, :]
"Why √dk? As dk grows, the variance of QKT grows proportionally. Without scaling, softmax inputs become large in magnitude, the output becomes near-one-hot, and gradients through softmax vanish. The √dk normalises the variance back to ~1. State this explicitly — most candidates write the division without explaining it."
"Flash Attention computes the same result in O(n) memory instead of O(n²) by processing attention in tiles and never materialising the full n×n score matrix. The key insight: softmax can be computed incrementally. At 2048 sequence length, the O(n²) matrix is 16MB per head. At 32K tokens it's 4GB. Mentioning Flash Attention immediately shows you understand the production bottleneck — even if they don't ask you to implement it."
"I'll cache all intermediate activations during the forward pass because the backward pass needs them for the chain rule. Let me state the shapes at each layer first so I don't lose track."
import numpy as np def relu(x): return np.maximum(0, x) def relu_back(x): return (x > 0).astype(float) def softmax(x): x = x - x.max(axis=1, keepdims=True) # stability e = np.exp(x) return e / e.sum(axis=1, keepdims=True) class TwoLayerNet: def __init__(self, d_in, d_h, d_out, lr=0.01): # He initialisation for ReLU layers self.W1 = np.random.randn(d_in, d_h) * np.sqrt(2.0 / d_in) self.b1 = np.zeros(d_h) self.W2 = np.random.randn(d_h, d_out) * np.sqrt(2.0 / d_h) self.b2 = np.zeros(d_out) self.lr = lr def forward(self, X, y): N = X.shape[0] self.X = X self.z1 = X @ self.W1 + self.b1 # (N, H) self.a1 = relu(self.z1) # (N, H) self.z2 = self.a1 @ self.W2 + self.b2 # (N, C) self.p = softmax(self.z2) # (N, C) loss = -np.log(self.p[np.arange(N), y] + 1e-9).mean() return loss def backward(self, y): N = self.X.shape[0] # Softmax + CE gradient (closed form — don't compute separately) dz2 = self.p.copy() dz2[np.arange(N), y] -= 1 dz2 /= N # (N, C) dW2 = self.a1.T @ dz2 # (H, C) db2 = dz2.sum(axis=0) da1 = dz2 @ self.W2.T # (N, H) dz1 = da1 * relu_back(self.z1) # (N, H) dW1 = self.X.T @ dz1 # (D, H) db1 = dz1.sum(axis=0) # SGD update for p, g in [(self.W2,dW2),(self.b2,db2),(self.W1,dW1),(self.b1,db1)]: p -= self.lr * g
"The combined softmax+CE gradient simplifies to (probs − y_onehot)/N. Don't derive softmax gradient and CE gradient separately and multiply them — that's technically correct but a red flag. Any engineer who's implemented this knows the closed form. State it upfront: 'The gradient through the combined softmax+CE layer has a clean closed form, so I'll use that directly.'"
"He initialisation (σ = √(2/fan_in)) is non-negotiable for ReLU layers. Xavier (σ = √(1/fan_in)) is for tanh/sigmoid. Using Xavier with ReLU causes activations to halve in variance at every layer — a 10-layer network starts with variance 1/1024. Name the initialisation and the reason before writing the weight init line. It's a 5-second sentence that shows you understand training dynamics."
"Standard K-Means has three failure modes I'll address upfront: bad initialisation (K-Means++), empty clusters on reassignment (reinitialise from data), and slow convergence (use centroid shift as stopping criterion)."
import numpy as np def kmeans(X, k, max_iter=100, tol=1e-4, seed=42): """X: (n, d). Returns centroids (k, d), labels (n,).""" rng = np.random.default_rng(seed) n, d = X.shape # ── K-Means++ initialisation ──────────────────────────────────── centroids = [X[rng.integers(n)]] for _ in range(k - 1): # Distance² from each point to nearest centroid dists = np.array([ min(np.sum((x - c) ** 2) for c in centroids) for x in X ]) probs = dists / dists.sum() centroids.append(X[rng.choice(n, p=probs)]) centroids = np.array(centroids) # (k, d) for _ in range(max_iter): # ── Assign: vectorised (n, k) distance matrix ────────────────── dists = np.sum( (X[:, None, :] - centroids[None, :, :]) ** 2, axis=2 ) # (n, k) labels = dists.argmin(axis=1) # (n,) # ── Update: mean per cluster, handle empty clusters ──────────── new_c = np.zeros_like(centroids) for j in range(k): mask = labels == j if mask.sum() == 0: new_c[j] = X[rng.integers(n)] # reinit empty cluster else: new_c[j] = X[mask].mean(axis=0) # ── Convergence: centroid shift ───────────────────────────────── shift = np.linalg.norm(new_c - centroids) centroids = new_c if shift < tol: break return centroids, labels
"The (n, k) vectorised distance matrix — X[:, None, :] - centroids[None, :, :] — is the Python loop elimination that shows vectorisation fluency. Candidates who write 'for j in range(k): dists[:, j] = ...' are getting the right answer slowly. A 10,000-point, 20-cluster K-Means runs 15× faster with the broadcasting version. Always state: 'I'll use broadcasting to avoid a loop over clusters.'"
"K-Means++ initialisation cuts iterations to convergence by 2-3× and improves final clustering quality — this is why sklearn defaults to it. The key idea: sample subsequent centroids with probability proportional to distance², which spreads them across the data. Mentioning this vs random initialisation, and explaining why, signals you know the algorithm's history, not just its mechanics."
"Naive softmax overflows with large logits — exp(1000) is inf. The fix is the log-sum-exp trick: subtract the row max before exponentiation. Mathematically equivalent, numerically stable."
import numpy as np # ── Numerically stable softmax ────────────────────────────────── def softmax(x): """x: (N, C). Returns probabilities same shape.""" shifted = x - x.max(axis=-1, keepdims=True) # log-sum-exp trick e = np.exp(shifted) return e / e.sum(axis=-1, keepdims=True) # ── Cross-entropy loss ────────────────────────────────────────── def cross_entropy(probs, y): """probs: (N, C), y: (N,) integer labels. Returns scalar loss.""" N = probs.shape[0] log_p = np.log(probs[np.arange(N), y] + 1e-9) return -log_p.mean() # ── Fused softmax + CE gradient ───────────────────────────────── def softmax_ce_grad(logits, y): """ Combined gradient dL/d(logits) for softmax + cross-entropy. Closed form: (softmax(logits) - one_hot(y)) / N """ N = logits.shape[0] g = softmax(logits) # (N, C) g[np.arange(N), y] -= 1 # subtract 1 at true class return g / N # ── Log-softmax (more stable for loss computation) ─────────────── def log_softmax(x): shifted = x - x.max(axis=-1, keepdims=True) return shifted - np.log(np.exp(shifted).sum(axis=-1, keepdims=True))
"Log-softmax followed by NLL loss is strictly more stable than softmax followed by log(p). With softmax+log, you compute exp then immediately log — wasting precision. Log-softmax combines them: log(exp(x-max) / sum) = x − max − log(sum(exp(x−max))). PyTorch's F.cross_entropy uses this internally. Mentioning the equivalence and why log-softmax is preferred shows you understand numerical computing, not just the math."
"BatchNorm has two distinct modes: training uses batch statistics (mean/var of current mini-batch), inference uses running statistics accumulated during training via exponential moving average. I'll implement both."
import numpy as np class BatchNorm1d: def __init__(self, num_features, eps=1e-5, momentum=0.1): self.gamma = np.ones(num_features) # scale (learnable) self.beta = np.zeros(num_features) # shift (learnable) self.eps = eps self.mom = momentum # Running stats for inference self.running_mean = np.zeros(num_features) self.running_var = np.ones(num_features) def forward(self, x, training=True): """x: (N, F)""" if training: self.mu = x.mean(axis=0) # (F,) self.var = x.var(axis=0) # (F,) self.x_hat = (x - self.mu) / np.sqrt(self.var + self.eps) # Update running stats (EMA) self.running_mean = (1 - self.mom) * self.running_mean + self.mom * self.mu self.running_var = (1 - self.mom) * self.running_var + self.mom * self.var else: # Inference: use accumulated running stats self.x_hat = (x - self.running_mean) / np.sqrt(self.running_var + self.eps) return self.gamma * self.x_hat + self.beta def backward(self, dout): """dout: (N, F). Returns dx, dgamma, dbeta.""" N = dout.shape[0] dgamma = (dout * self.x_hat).sum(axis=0) dbeta = dout.sum(axis=0) dx_hat = dout * self.gamma inv_std = 1.0 / np.sqrt(self.var + self.eps) dx = (1.0/N) * inv_std * ( N * dx_hat - dx_hat.sum(axis=0) - self.x_hat * (dx_hat * self.x_hat).sum(axis=0) ) return dx, dgamma, dbeta
"The running stats update during training — and using them at inference — is what most candidates forget. A BatchNorm layer that always uses batch statistics at inference time will behave differently for batch size 1 (evaluation) vs batch size 32 (training). This is one of the most common production bugs in early PyTorch implementations. State upfront: 'There are two modes, and I need running statistics for inference.'"
"Layer Norm normalises across the feature dimension per sample (not across the batch). This makes it sequence-length-independent and the default choice for Transformers — you can't batch-normalise variable-length sequences easily. If the interviewer says 'how would you modify this for a Transformer?' the answer is: change the normalisation axis from axis=0 (batch) to axis=-1 (features), and there are no running statistics needed."
"Adam maintains two running estimates per parameter: a first moment (mean of gradients) and a second moment (uncentred variance). Bias correction compensates for the fact that both start at zero — without it, early steps are severely underestimated."
import numpy as np class Adam: def __init__(self, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8): self.lr = lr self.beta1 = beta1 self.beta2 = beta2 self.eps = eps self.m = {} # first moments self.v = {} # second moments self.t = 0 # global step counter def step(self, params, grads): """ params, grads: dict {name: np.ndarray} Updates params in-place. """ self.t += 1 for name, g in grads.items(): if name not in self.m: self.m[name] = np.zeros_like(g) self.v[name] = np.zeros_like(g) # Biased moment estimates self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * g self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * g ** 2 # Bias-corrected estimates m_hat = self.m[name] / (1 - self.beta1 ** self.t) v_hat = self.v[name] / (1 - self.beta2 ** self.t) # Parameter update params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
"AdamW separates weight decay from the gradient update — this is the most important Adam variant to know. In vanilla Adam, L2 regularisation (adding λθ to the gradient) interacts with the adaptive learning rate: parameters with large gradients get less decay than parameters with small gradients, which is the opposite of what you want. AdamW applies the decay directly: θ ← θ × (1−λ), before the Adam step. This is why every modern large-model training uses AdamW, not Adam."
"Gradient clipping is primarily needed for RNNs and deep networks where gradients can explode during BPTT. Global norm clipping preserves gradient direction — it scales all gradients uniformly if the total norm exceeds a threshold. Value clipping clips each component independently and distorts the direction."
import numpy as np def clip_grad_norm(grads, max_norm): """ Global gradient norm clipping. grads : list of np.ndarray (one per parameter group) max_norm: float clip threshold Returns : (clipped_grads, total_norm) """ # Step 1 — global L2 norm across all parameter gradients total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads)) # Step 2 — clip coefficient (1.0 means no clipping) clip_coef = max_norm / (total_norm + 1e-6) # Step 3 — scale all gradients uniformly (preserves direction) if clip_coef < 1.0: for g in grads: g *= clip_coef # in-place — no copy return grads, float(total_norm) # ── Value clipping (component-wise — distorts direction) ───────── def clip_grad_value(grads, clip_value): return [np.clip(g, -clip_value, clip_value) for g in grads]
"Global norm clipping preserves gradient direction — it's a uniform rescaling of the entire gradient vector. Value clipping clips each component independently and produces a different vector direction. For RNNs where gradient magnitude matters but direction is critical, norm clipping is strictly better. The monitoring insight: log total_norm during training. If it regularly hits the clip threshold (max_norm), your learning rate is too high or your architecture has a pathological gradient flow problem."
"The default rolling().mean() includes the current row in the window. For a predictive feature, that's leakage — at inference time, you don't have the current row's value. The fix is shift(1) before rolling, which shifts values forward by one period so each row's window only contains past data."
import pandas as pd import numpy as np # ── BAD: leaks current-row value into its own window ───────────── def leaky_features(df): df['roll7'] = df.groupby('user_id')['revenue'].transform( lambda x: x.rolling(7).mean() # includes current row ) return df # ── GOOD: point-in-time correct ────────────────────────────────── def pit_features(df): df = df.sort_values(['user_id', 'date']).copy() grp = df.groupby('user_id')['revenue'] # shift(1) excludes current row from its own window df['roll7_avg'] = grp.transform( lambda x: x.shift(1).rolling(7, min_periods=1).mean() ) df['lag_1d'] = grp.shift(1) df['lag_7d'] = grp.shift(7) df['roll7_std'] = grp.transform( lambda x: x.shift(1).rolling(7, min_periods=2).std() ) return df # ── Correct temporal train/test split ─────────────────────────── def temporal_split(df, cutoff, target='label'): # Never use train_test_split(shuffle=True) for time series train = df[df['date'] < cutoff] test = df[df['date'] >= cutoff] feat_cols = [c for c in df.columns if c not in [target, 'date']] return (train[feat_cols], train[target], test[feat_cols], test[target])
"Feature leakage via rolling without shift(1) is one of the most common production ML bugs. The model trains fine, validation AUC looks great, then production performance collapses. The diagnosis: production AUC significantly lower than validation AUC on a time-series problem. The fix is always shift(1) before rolling. Name this as the default defensive pattern: 'I always shift before rolling on time-series features.'"
"Point-in-time correctness extends beyond rolling features. In multi-table joins — joining customer attributes to transactions — you must join on the attribute values as they existed at the transaction date, not their current values. Joining on current customer tier to historical transactions leaks future tier changes into past predictions. Production feature stores (Feast, Tecton) enforce PIT joins by design. Mentioning this shows you've thought about end-to-end pipeline correctness."
"Naive target encoding — replacing a category with its mean target value — leaks the label into the feature. The fix is out-of-fold encoding: compute the encoding for each row using only the folds it wasn't in. For rare categories, add Bayesian smoothing to shrink toward the global mean."
import numpy as np import pandas as pd from sklearn.model_selection import KFold def target_encode(X_train, y_train, col, n_folds=5, smooth=10): """ Out-of-fold target encoding with Bayesian smoothing. Returns: (oof_encoded array, inference encoding map) """ global_mean = y_train.mean() oof = np.zeros(len(X_train)) kf = KFold(n_splits=n_folds, shuffle=False) # no shuffle for temporal for tr_idx, val_idx in kf.split(X_train): X_tr = X_train.iloc[tr_idx] y_tr = y_train.iloc[tr_idx] X_val = X_train.iloc[val_idx] # Per-category stats on training fold stats = pd.DataFrame({'mean': y_tr.groupby(X_tr[col]).mean(), 'cnt': y_tr.groupby(X_tr[col]).count()}) # Bayesian smoothing: shrink toward global mean for small counts stats['enc'] = ( (stats['cnt'] * stats['mean'] + smooth * global_mean) / (stats['cnt'] + smooth) ) oof[val_idx] = ( X_val[col].map(stats['enc']).fillna(global_mean).values ) # Inference map: fit on full training set full_stats = pd.DataFrame({'mean': y_train.groupby(X_train[col]).mean(), 'cnt': y_train.groupby(X_train[col]).count()}) full_stats['enc'] = ( (full_stats['cnt'] * full_stats['mean'] + smooth * global_mean) / (full_stats['cnt'] + smooth) ) enc_map = full_stats['enc'].to_dict() return oof, enc_map, global_mean
"Bayesian smoothing for rare categories — (count × cat_mean + λ × global_mean) / (count + λ) — is the production detail that separates target encoding done right from done naively. A category with 2 observations and a 100% positive rate shouldn't encode as 1.0. With λ=10, it encodes as (2×1.0 + 10×global_mean) / 12 — pulled significantly toward the global mean. λ is a hyperparameter; tune it on a holdout. State this before writing the formula."
"SMOTE generates synthetic minority samples by interpolating between existing minority samples and their k nearest neighbours. This is better than simple duplication because it adds new points along the feature manifold rather than exact copies."
import numpy as np def smote(X_minority, n_synthetic, k=5, seed=42): """ X_minority : (n, d) minority class samples n_synthetic: number of new samples to generate Returns : (n_synthetic, d) synthetic samples """ rng = np.random.default_rng(seed) n, d = X_minority.shape # Step 1 — pairwise squared distance matrix (n x n) diff = X_minority[:, None, :] - X_minority[None, :, :] dists = np.sum(diff ** 2, axis=2) np.fill_diagonal(dists, np.inf) # exclude self # Step 2 — k nearest neighbour indices for each point knn_idx = np.argsort(dists, axis=1)[:, :k] # (n, k) # Step 3 — generate synthetic samples via interpolation synthetic = np.zeros((n_synthetic, d)) for i in range(n_synthetic): src = rng.integers(n) # random minority sample nn = knn_idx[src, rng.integers(k)] # random neighbour alpha = rng.random() # interpolation factor synthetic[i] = ( X_minority[src] + alpha * (X_minority[nn] - X_minority[src]) ) return synthetic # Usage: # X_syn = smote(X_minority, n_synthetic=500) # X_bal = np.vstack([X_majority, X_minority, X_syn]) # y_bal = np.hstack([y_maj, np.ones(len(X_minority) + len(X_syn))])
"SMOTE is a training-set-only operation — apply it after your train/test split, never before. If you apply SMOTE before splitting, synthetic samples generated from real training points will leak into your test set. The correct pipeline: split first, then oversample the minority class in the training set only. This is the most common imbalanced-data pipeline bug. State it before coding: 'I'll apply this to the training set after splitting.'"
"Class imbalance is primarily an evaluation problem, not just a training problem. AUC-PR is the correct metric for imbalanced classes — AUC-ROC is optimistic because it includes the trivially-correct true negative region. At 1% positive rate, a model that predicts all negative has AUC-ROC of 0.5 but AUC-PR near 0. Mention this when discussing imbalanced data: 'I'd also switch from AUC-ROC to AUC-PR as the primary offline metric.'"
"There are three complexity tiers: O(n log n) sort-then-slice (simplest), O(n log k) min-heap (optimal for streaming), and O(n) np.argpartition (optimal for in-memory arrays). For ANN retrieval, cosine similarity is a dot product after L2 normalisation — fully vectorisable."
import numpy as np import heapq # ── Option 1: Min-heap — O(n log k), O(k) space ───────────────── def top_k_heap(scores, k): heap = [] for i, s in enumerate(scores): heapq.heappush(heap, (s, i)) if len(heap) > k: heapq.heappop(heap) # evict smallest return [i for _, i in sorted(heap, reverse=True)] # ── Option 2: np.argpartition — O(n) avg, in-memory ───────────── def top_k_partition(scores, k): indices = np.argpartition(scores, -k)[-k:] # unordered top-k return indices[np.argsort(scores[indices])[::-1]] # sorted # ── Option 3: Vectorised cosine ANN ───────────────────────────── def cosine_top_k(query, corpus, k): """ query : (d,) query vector corpus : (n, d) pre-indexed item vectors Returns: top-k indices sorted by cosine similarity """ # Normalise both to unit vectors — dot product = cosine similarity q_norm = query / (np.linalg.norm(query) + 1e-8) c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-8) scores = c_norm @ q_norm # (n,) — vectorised dot products return top_k_partition(scores, k)
"np.argpartition is the answer most candidates don't know. It runs in O(n) average time using introselect, guarantees the k largest are in the output (without sorting them), and runs entirely in C. At 10M documents, argpartition is 3× faster than argsort for top-100 retrieval. Use it for any in-memory top-k. The one gotcha: the returned indices are not sorted — apply argsort on the k results to rank them. State both the advantage and the gotcha."
"A model with AUC 0.92 can still output completely wrong probabilities — it ranks correctly but isn't calibrated. Platt scaling fits a sigmoid f(s) = 1/(1 + exp(As + B)) on top of raw model scores via MLE. It's a 2-parameter model trained on the validation set."
import numpy as np from scipy.optimize import minimize def fit_platt(scores, labels): """Fit Platt sigmoid: p = 1/(1 + exp(A*score + B)) via MLE.""" def neg_log_likelihood(params): A, B = params p = 1.0 / (1.0 + np.exp(A * scores + B)) p = np.clip(p, 1e-7, 1 - 1e-7) return -np.mean( labels * np.log(p) + (1 - labels) * np.log(1 - p) ) res = minimize(neg_log_likelihood, x0=[-1.0, 0.0], method='L-BFGS-B') return res.x # (A, B) def platt_predict(scores, A, B): return 1.0 / (1.0 + np.exp(A * scores + B)) # ── Reliability diagram + ECE ──────────────────────────────────── def calibration_stats(probs, labels, n_bins=10): """Returns (bin_pred, bin_actual, ECE).""" bins = np.linspace(0, 1, n_bins + 1) bin_pred = [] bin_actual = [] ece = 0.0 N = len(labels) for i in range(n_bins): mask = (probs >= bins[i]) & (probs < bins[i + 1]) if mask.sum() > 0: bp = probs[mask].mean() ba = labels[mask].mean() bin_pred.append(bp) bin_actual.append(ba) ece += (mask.sum() / N) * abs(bp - ba) return np.array(bin_pred), np.array(bin_actual), ece
"Calibration matters as much as discrimination in production. A fraud model with AUC 0.95 but poor calibration will set the wrong transaction block threshold — if it says 0.9 probability but the actual rate is 0.3, your threshold is wrong. Temperature scaling is the simplest calibration method for neural networks: divide logits by a scalar T before softmax, fit T on a validation set. One parameter, no change to model ranking, dramatically improved probability estimates. It's why all modern LLMs use temperature at inference."
"The reliability diagram is the correct way to present calibration — not just ECE. ECE is a single number that hides the shape of miscalibration. A model that systematically overestimates probabilities for high scores and underestimates for low scores has ECE near 0 (errors cancel) but is badly calibrated in practice. Plot the reliability diagram and look for the S-curve pattern — that's the signature of a model that needs Platt scaling."