Interview Simulation · 45 minutes

ML System Design

You have 45 minutes. Let's build.

👤 Senior ML Engineer · Interviewer

Design a personalised recommendation system for a streaming platform with 100M users and 10M items.

SYSTEM ARCHITECTURE RECOMMENDATION ENGINE

Design a personalised recommendation system for a streaming platform with 100M users and 10M items.

  User Activity
  (plays, ratings, skips)
         │
         ▼
  ┌─ CANDIDATE GENERATION ──────────────────────────────────┐
  │  Two-Tower Model                                         │
  │  User Tower: user_id embed + history embed + context    │
  │  Item Tower: item_id embed + metadata embed             │
  │  ANN search (FAISS/ScaNN) → top-500 candidates          │
  └──────────────────────────────────────────────────────────┘
         │ 500 candidates
         ▼
  ┌─ RANKING ───────────────────────────────────────────────┐
  │  LTR Model (LightGBM / Wide & Deep / DCN-v2)            │
  │  Input: user×item cross-features, context, freshness    │
  │  Multi-task: P(play) + P(complete) + P(like)            │
  │  Output: ranked slate of 20                             │
  └──────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ POST-RANKING ──────────────────────────────────────────┐
  │  Diversity injection (MMR / DPP)                        │
  │  Business rules (boost new content, block seen)         │
  │  Freshness decay                                        │
  └──────────────────────────────────────────────────────────┘

Clarify scope first: "Is this home-page ranking or you-may-also-like? Real-time or batch? How do we handle cold-start for new users and new items?"
Scale framing: 100M users × 1 daily session = 100M requests/day. 10M items rules out brute-force ranking — need ANN retrieval.
Propose two-stage architecture: retrieval (scale problem — narrow 10M to 500) → ranking (quality problem — score 500, return 20). This decouples the two fundamentally different challenges.
State latency constraints: retrieval ≤ 500ms p99, ranking ≤ 100ms p99, total budget 600ms.
State the output: a ranked slate of 20 items per request, re-ranked post-scoring for diversity and business rules.

Retrieval — Two-Tower model: user tower [user_id embed + watch history aggregation + context] → 256-dim vector. Item tower [item_id embed + metadata embed] → 256-dim vector. Inner product similarity. ANN search via FAISS IVF-PQ → top-500 candidates in < 50ms.
Ranking — Wide & Deep / DCN-v2: scores each of 500 candidates. Wide part: cross-product sparse features (memorisation of known user × genre patterns). Deep part: dense feature embeddings through MLP (generalisation to unseen combinations).
Multi-task ranking heads: P(play) + P(complete) + P(like), combined via learned weights α, β, γ. Prevents optimising clicks at the expense of satisfaction.
Post-ranking layer: MMR or DPP for diversity injection. Business rules: boost new content, suppress recently watched. Freshness decay applied to item scores.
Cold-start: new user → global popularity fallback → content-based (onboarding survey). New item → content-based embedding from metadata until 50+ interactions.
ANN at scale: 10M item embeddings (256-dim, INT8 quantised) = 2.56GB. FAISS IVF-PQ index in RAM. Rebuilt nightly during off-peak. Sub-50ms query p99.

✦ Senior ML Engineer

"Saying two-stage immediately signals you've built at scale. Candidates who propose a single ranking model for 100M users × 10M items fail the scale smell test — you cannot run your ranking model 10 million times per request. The retrieval step exists specifically to make ranking tractable. The two-stage architecture is what Netflix, Spotify, and YouTube all converged on independently."

✦ Senior ML Engineer

"The most common mistake: jumping to 'matrix factorisation' and stopping there. Collaborative filtering solves retrieval but has no quality-ranking mechanism. Two-stage is not just an optimisation — it's architecturally necessary. Name the split, name why each stage exists."

Labels from implicit feedback: play > 30s = positive, skip < 10s = strong negative. > 70% completion = strong positive (avoids auto-play contamination).
User features: watch history (last 50 items), demographics, device type, time-of-day, day-of-week, subscription tier.
Item features: genre, duration, recency, popularity (global + segment-level), language, content embeddings from title/description.
Cross features: user-genre affinity rolling 30d, user × item collaborative signal, session context (last played item, session length so far).
Cold-start strategy: new user → popularity fallback. New item → metadata embeddings until 50+ interactions accumulate.

Label construction detail: > 70% play completion = label 1 (strong). < 10% = label 0 (strong negative). 10–70% = omit or weak signal. Avoids labelling auto-play starts as positive.
Temporal features (high impact): hour-of-day bucketed into 6 windows, day-of-week, time-since-last-session, sequence position within session. User content preference shifts dramatically by context.
Popularity tiers: global popularity, segment-level (age cohort, region, device), and trending velocity (24h engagement change). Segment-level is typically 2× more predictive than global rank.
Data pipeline: Kafka → Spark Streaming → feature store. Online store: Redis (≤ 5ms latency). Offline store: Hive / BigQuery for batch training. Point-in-time correct joins: training labels at T, features from T − 2h to prevent leakage.
Negative sampling: in-batch negatives for Two-Tower training (efficient). Hard negatives (impressions the user saw but skipped) for ranking model — prevents model learning "popular = positive".
Feature freshness: user interaction features updated every 10 min. Item popularity counts hourly. Genre affinity rolling window recalculated daily.

✦ Senior ML Engineer

"Temporal context is the most impactful feature category most candidates miss. A user at 11pm Friday wants fundamentally different content than the same user at 8am Tuesday. Hour-of-day and day-of-week as features immediately shows production experience. Candidates who list only demographics and item genre are describing a textbook system."

✦ Senior ML Engineer

"Point-in-time correct joins separate candidates who have shipped ML systems from those who haven't. Training data constructed with future leakage produces models that look great offline and degrade immediately in production. The fix is a 2-hour delay: labels at T, features from T − 2h. This is operational, not theoretical — mention it."

Retrieval: Two-Tower model trained on contrastive loss (InfoNCE). User and item encoded separately. FAISS IVF-PQ index at inference → top-500 in < 50ms.
Ranking: Wide & Deep or DCN-v2. Three multi-task heads: P(play), P(complete), P(like). Combined score = α·P(play) + β·P(complete) + γ·P(like).
Negative sampling strategy: in-batch negatives for Two-Tower (efficient, scales well). Hard negatives (saw, skipped) for ranking (critical for calibration quality).
Offline evaluation: Recall@100 for retrieval (target ≥ 80%). NDCG@10 for ranking. Coverage and popularity bias checks.

Two-Tower architecture: user tower [user_id embed(256) + history mean-pool(256) + context(64)] → MLP → 256-dim output. Item tower [item_id embed(256) + metadata embed(128)] → MLP → 256-dim output. Trained with InfoNCE on positive (user, item) pairs with in-batch negatives.
DCN-v2 (preferred over Wide & Deep): explicit cross-network captures up to 6th-order feature interactions via learned cross layers. Achieves the same memorisation + generalisation with fewer parameters and more interpretable cross-feature weights. Published 2021, now standard at Google.
Multi-task training: shared base network → three prediction heads. Combined ranking score uses Pareto-optimised scalarisation: maximise satisfaction (P(complete) + P(like)) subject to P(play) ≥ threshold. Prevents engagement-only optimisation.
Hard negative strategy: for ranking, sample 4 hard negatives per positive — items the user was shown but skipped. Mix 50/50 with random negatives. Hard negatives improve ranking calibration but too many → model learns "anything shown is hard to distinguish" and regresses.
Scale: 10M items × 256-dim INT8 = 2.56GB embedding table. FAISS IVF-PQ (nlist=1024, m=32) → 320MB compressed index, < 50ms query p99 on single CPU core.
Offline eval thresholds: Recall@100 ≥ 80% (retrieval). NDCG@10 ≥ 0.42 (ranking, based on A/B holdout calibration). AUC ≥ 0.78 per task head. Coverage: top-1000 items should not dominate > 30% of recommendations.

✦ Senior ML Engineer

"Multi-task ranking is the key senior signal. Single-objective models optimise clicks but degrade completion and satisfaction — YouTube's original watch-time algorithm maximised engagement, which led to increasingly extreme content recommendations. The algorithm was technically working; the objective was wrong. Multi-task with explicit satisfaction heads is how you avoid this. Name the trade-off proactively."

✦ Senior ML Engineer

"DCN-v2 over Wide & Deep is the current-state answer. Wide & Deep was published in 2016. DCN-v2 (2021) achieves the same memorisation + generalisation with fewer parameters and more explicit cross-feature learning. Using the 2016 architecture isn't wrong, but citing the 2021 update shows you follow the literature."

Offline retrieval: Recall@100 ≥ 80%. If retrieval misses the eventual positive, the ranker never gets a chance to surface it.
Offline ranking: NDCG@10 (primary), hit-rate@20, AUC per task head.
Online metrics: play rate (impressions → actual plays), completion rate, intra-list diversity (ILD), D7 retention.
North star: monthly active hours per user.
A/B design: user-level randomisation, ≥ 7 days (novelty effect fades at ~3 days), power analysis at MDE = 0.5% on play rate.

Recall@100 mechanics: of all eventual strong positive interactions (completion > 70% in following 24h), what % appear in retrieval's top-100? Target ≥ 80%. Below this, the ranking model is bottlenecked by retrieval quality, not its own quality.
A/B power analysis: MDE = 0.5% on play rate. Historical play rate = 18%, σ estimated from daily variance. At α = 0.05, power = 0.80: N ≈ 250K users per arm. At 100M DAU, this is achieved in < 4 hours — run for 7 days regardless to capture weekly patterns.
Guardrail metrics: revenue (subscription churn rate), content creator fairness (long-tail content impressions ≥ 15% of total), cross-genre discovery rate ≥ 25%. If any guardrail regresses > 1% relative, abort regardless of play rate lift.
Novelty effect: new recommendations feel interesting purely because they differ from history. Effect peaks at day 1–2, fades by day 4–5. Running for 7 days gives steady-state lift estimate — short experiments overstate improvement.
Long-term metrics: D7, D30 retention; subscriber LTV. Play rate can increase while D30 retention decreases if recommendations are locally satisfying but homogenising (filter bubble). Track cohort retention alongside engagement metrics.
Diversity metric — ILD: intra-list distance = average pairwise dissimilarity across the 20 recommended items. ILD ≥ 0.4 (genre-space distance) prevents "20 action movies" slates.

✦ Senior ML Engineer

"The candidate who connects online metrics to the business north star wins. 'We improved NDCG@10 by 3%' is table stakes. 'That 3% translated to +8% session depth and +2% D7 retention, worth an estimated $12M in annual subscriber LTV' closes the loop. Every ML metric you present should have a dollar figure or a retention number attached. This is how PMs and VPs hear it."

✦ Senior ML Engineer

"Guardrail metrics are the sophistication signal most candidates skip. Any metric can be improved by overfitting to it. Content creator fairness — a system concentrating 90% of impressions on 1% of content might have great NDCG but destroys the ecosystem. Name a guardrail. It immediately signals product-level thinking beyond the model itself."

Serving budget: retrieval ≤ 500ms p99 (precomputed embeddings + ANN). Ranking ≤ 100ms p99. Total 600ms end-to-end.
Feature serving: user features from Redis (≤ 5ms). Item features precomputed. Real-time session events via Kafka consumer.
Retraining cadence: retrieval weekly (stable long-term taste). Ranking daily (captures recent intent shifts — Friday vs Monday preferences differ).
Monitoring: CTR drift daily, NDCG on held-out test weekly, popularity bias monthly, training-serving skew via PSI.
Failure modes to name: embedding staleness, popularity collapse, cold-start regression after model update.

Retrieval serving path: user embedding precomputed nightly → stored in Redis (≤ 2ms read). FAISS IVF-PQ query → top-500 in < 50ms. Index rebuilt weekly during 2–4am off-peak. Version-controlled: rollback available within 60s.
Ranking serving path: batch score all 500 candidates in a single forward pass. LightGBM (CPU, < 10ms) or quantised DCN-v2 (CPU, < 40ms). Post-ranking (diversity + business rules) < 10ms. Return top-20 with scores.
Retraining pipeline: Kafka → Spark Streaming → feature store → daily training job (last 30 days, point-in-time correct) → offline eval gate (NDCG@10 must not regress > 2%) → shadow deployment → full rollout.
Online learning — cold-start bandit: ε-greedy (ε = 0.05): 5% of slots allocated to cold items. Outcome (play/skip) fed into daily ranking retrain as labelled examples. Warms up item embeddings in ~48h from first impression.
Failure mode — embedding staleness: user makes abrupt preference shift (binge new genre) but embedding updates weekly. Mitigation: inject last-5-interactions as real-time feature into ranking model bypassing embedding lag.
Training-serving skew monitoring: Population Stability Index (PSI) on top-10 features, computed daily. PSI > 0.10 = warning. PSI > 0.20 = halt new model deployment, investigate pipeline. Skew causes offline-online metric divergence.

✦ Senior ML Engineer

"Different retraining cadences for retrieval vs ranking is where depth shows. The wrong answer is 'retrain everything daily.' Retrieval embeddings capture stable long-term taste — daily retraining wastes compute and introduces noise. Ranking captures short-term intent — weekly is too slow (Monday preferences differ from Friday). Different cadences for different model components reflects real production architecture."

✦ Senior ML Engineer

"Training-serving skew is the failure mode behind most 'model was fine in eval but degraded in production' incidents. The feature distribution at training time does not match serving time — usually a pipeline bug or schema change. PSI monitoring on key features catches this before users see degraded recommendations. Mentioning skew monitoring immediately signals you've debugged production ML systems."

Design the feed ranking system for a social network with 500M users and 100M posts per day.

  ┌─ CANDIDATE SOURCES (parallel) ────────────────────────┐
  │  Social graph: posts from follows (1st + 2nd degree)  │
  │  Interest graph: posts matching engaged topics        │
  │  Promoted content: ads (separate bid/quality system)  │
  └───────────────────────────────────────────────────────┘
         │ ~2000 candidates
         ▼
  ┌─ LIGHTWEIGHT RANKER ──────────────────────────────────┐
  │  Fast GBDT on sparse features                         │
  │  Prunes 2000 → 500 in < 20ms                        │
  └───────────────────────────────────────────────────────┘
         │ 500 candidates
         ▼
  ┌─ HEAVY RANKER ────────────────────────────────────────┐
  │  Multi-task DNN                                       │
  │  Heads: P(like) P(comment) P(share) P(hide) P(report)│
  │  Ranking score = f(heads, weights, business rules)    │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ DIVERSITY + POLICY LAYER ────────────────────────────┐
  │  ≤ 2 consecutive posts per creator                    │
  │  Content freshness boost · Policy filters             │
  └───────────────────────────────────────────────────────┘

Clarify primary goal: "Is this optimising engagement, satisfaction, creator monetisation, or all three simultaneously? What is the tension between them?"
Frame as a multi-stakeholder problem — four different objectives: user satisfaction, creator reach, advertiser ROI, and platform health. A great design acknowledges all four.
Scale: 500M users × 10 feed loads/day = 5B ranking requests/day = ~58,000 QPS for the heavy ranker. This rules out any expensive per-request deep model at full candidate depth.
Propose three-stage cascade: retrieval (social + interest graph, ~2000 candidates) → light ranker (GBDT, prunes to 500) → heavy ranker (multi-task DNN, returns top-20) → diversity + policy layer.
State the wellbeing constraint upfront: optimising purely for engagement causes anxiety, outrage amplification, and long-term user churn.

Candidate retrieval sources (parallel): social graph (posts from 1st and 2nd-degree connections, ~500 candidates), interest graph (posts matching topics the user frequently engages with, ~800 candidates), promoted content (ads with separate bid/quality scoring, ~200 slots). Deduplicated, total ~2000.
Light ranker: GBDT (LightGBM) on sparse features — user-creator affinity score, post age, engagement velocity, media type. Prunes 2000 → 500 in < 20ms CPU. Cheap enough to run at 58k QPS.
Heavy ranker: multi-task DNN with 6 prediction heads: P(like), P(comment), P(share), P(hide), P(report), P(time_spent). Ranking score = α·P(share) + β·P(comment) + γ·P(like) − δ·P(hide) − ε·P(report). Weights learned or tuned via constrained optimisation.
Diversity + policy layer: at most 2 consecutive posts from the same creator. Content freshness decay applied. Policy filters: remove violating content, apply regional restrictions. Recency boost for posts < 2h old.
Real-time vs near-real-time: new posts from close connections appear within 5 minutes. Interest-graph posts within 30 minutes. This drives feature freshness requirements for the Kafka pipeline.
Engagement vs wellbeing trade-off: pure engagement optimisation causes outrage amplification (angry comments drive engagement). Multi-task with P(hide) and P(report) as explicit negative signals prevents this. Value model weights P(share) and P(save) higher than P(like) to favour high-quality engagement over passive reactions.

✦ Senior ML Engineer

"The wellbeing question is the one that separates senior candidates. Optimising purely for engagement causes anxiety, outrage amplification, and long-term user churn. Bring up the tension between engagement and satisfaction proactively — before the interviewer asks. It shows you've thought about what you're building, not just how. Every major social platform has had to publicly walk back pure-engagement optimisation."

✦ Senior ML Engineer

"The four-stakeholder framing — users, creators, advertisers, platform — is what senior product engineers think about. Junior candidates optimise for user engagement and call it done. The creator side matters because a platform with no rewarded creators has no content. Naming all four stakeholders and their tensions shows systems thinking beyond the model."

Labels: implicit positives: like, comment, share, view-time > 5s, link click. Negatives: hide post, unfollow after viewing, report. Note: hide/report are rare but high-signal — upsample in training.
User features: social graph (follows, close friends, mutual connections), engagement history (topics, creators, format preference — video vs photo vs text), device, network speed.
Post features: creator affinity with user, post age (recency), media type, topic embeddings (NLP on caption), engagement velocity (early likes/comments as quality signal), comment sentiment.
Cross features: user-creator affinity (rolling 30d engagement rate), user-topic affinity (rolling 7d), user preference for format (% of time spent on video vs photo), user-post semantic similarity.

Engagement velocity (high-impact feature): a post with 1000 likes in 10 minutes is fundamentally different from one with 1000 likes over a week. Velocity = (engagements in last 30 min) / (impressions in last 30 min). This single feature enables surfacing viral content before it goes viral.
Training data construction: 30-day rolling window. Negatives = posts the user was shown but did not engage with (not all non-impressed items — only use impressed items as negatives to avoid selection bias). Point-in-time correct: labels at T, features at T − 1h.
Rare label upsampling: hide and report events are < 0.1% of impressions but critical for quality. Oversample at 10×. Use focal loss (γ = 2) to focus training on hard-to-classify examples. Train per-action AUC as separate evaluation metric.
Semantic features: NLP embeddings on post caption (multilingual BERT, fine-tuned on in-domain data). Cosine similarity between user's topic interest vector and post embedding as a cross-feature. Captures content the user hasn't seen from creators they don't follow.
Close friends signal: posts from "close friends" (users the viewer interacts with most) get a 1.5× affinity boost in the light ranker. Prevents social graph dilution when following 2000+ accounts.
Format preference: compute per-user format affinity: (time spent on video / total time) vs (photos / total) over 30 days. Users who consistently scroll past video get lower P(time_spent) for video posts in their heavy ranker scoring.

✦ Senior ML Engineer

"Engagement velocity is the most impactful feature category most candidates miss. A post with 1000 likes in the first 10 minutes is fundamentally different from one that accumulated 1000 likes over a week. Velocity features enable the system to surface viral content before it goes viral — which is the defining quality signal of a great feed algorithm. If you don't mention velocity, you're describing a system that's always behind the curve."

✦ Senior ML Engineer

"Negatives from impressions, not from all non-impressed items — this is a subtle but critical training data distinction. If you sample negatives from all un-shown posts, you're training the model to distinguish shown posts from everything else, which includes posts the algorithm already knew were bad. The correct negatives are the posts that were shown but ignored — that's the hard signal."

Three-stage cascade at 58k QPS: retrieval (~2000) → light ranker GBDT prunes to 500 in < 20ms → heavy ranker multi-task DNN scores 500 in < 100ms → diversity + policy layer returns top-20.
Light ranker (GBDT): LightGBM on sparse features. Fast CPU inference. Prunes 75% of candidates cheaply. The cost of heavy-ranker inference on 2000 candidates at 58k QPS is prohibitive without this stage.
Heavy ranker (multi-task DNN): 6 prediction heads sharing a common representation tower. Combined value score: upweight share/save/comment, downweight hide/report with learned negative weights.
Value model: score = α·P(share) + β·P(comment) + γ·P(like) − δ·P(hide) − ε·P(report). Weights tuned via constrained optimisation: maximise quality engagement subject to report rate ≤ threshold.

Light ranker architecture: LightGBM with 500 trees, max depth 6. Features: user-creator affinity score (precomputed), post age bucket, engagement velocity, media type one-hot, close-friend flag. Trained on 30-day data, retrained weekly. Serves 2000 candidates per request at < 20ms CPU.
Heavy ranker architecture: 3-layer MLP shared tower (512 → 256 → 128) → 6 separate prediction heads (1-layer each). Input: 200 dense features (user embedding 64-dim + post embedding 64-dim + 72 engineered features). Trained with task-specific binary cross-entropy per head, summed with loss weights tuned for task importance.
Value model calibration: the negative weights for P(hide) and P(report) are critical. If δ is too small, the model ignores explicit negative signals. If too large, it becomes overly conservative and filters real content. Tune via constrained optimisation: maximise P(share) subject to E[P(report)] ≤ 0.003 on validation set.
Training regime: daily incremental training on last 24h data (80% of iterations). Full retrain weekly on 30-day window (20% of cycles). Hourly mini-batch for trending topics only (bandit policy). Mixed-precision training (FP16) on 8× A100 GPUs, ~4h full retrain.
Diversity post-ranking: greedy MMR: iteratively select the next post that maximises relevance − λ·max_similarity_to_selected, where similarity is in topic-embedding space. λ = 0.3 (tuned). Prevents "10 posts from same creator" slates.
Scale napkin math: 58k QPS × 500 candidates × heavy ranker forward pass. Solution: run heavy ranker as a batch scoring microservice. 500 candidates × single forward pass = 500 rows × 200 features. On a quantised DNN (INT8, 4 CPU cores): ~80ms per batch. Fits the 100ms budget.

✦ Senior ML Engineer

"The three-stage cascade is critical at 500M users scale. A heavy DNN on 2000 candidates at 58k QPS = 116B candidate scorings per second — that's 100+ GPU cluster territory. The light ranker prunes 75% cheaply on CPU. Never propose a single model architecture at this scale without doing the napkin math. The napkin math is part of the answer."

✦ Senior ML Engineer

"Downweighting P(hide) and P(report) in the value model is the production insight most candidates miss. Pure engagement maximisation treats a hide as zero signal. It's actually -5× signal — that user explicitly said they don't want this. Incorporating explicit negative labels into the value model is the difference between a feed that feels good and one that feels extractive."

Offline: AUC per action type (separate AUC for like, share, hide, report). Composite NDCG using action quality weights. Creator diversity score (Gini coefficient of creator impressions).
Online: engagement rate, save rate, share rate (higher-intent than like), session depth (posts viewed per session), unfollow rate (guardrail), report rate (guardrail).
North star: weekly active engaged users — users who are actively posting, commenting, or sharing, not just passively consuming.
A/B design: user-level holdout. Social products require ≥ 2-week experiment duration (weekly usage patterns + novelty effect). Network effects mean friend-level randomisation isn't always sufficient — consider cluster-level holdout.

Network effects in A/B testing: if your friend is in the treatment group and gets a better feed, your control-group experience is already contaminated (you see their better content when they reshare). Standard user-level randomisation understates the true effect. For large social networks, use geo-level holdout (different regions get treatment vs control) to isolate network effects.
2-week minimum for social products: novelty effect on social platforms is stronger and longer-lasting than e-commerce (7–10 days vs 3–4 days). Users explore new feed behaviour for 5–7 days before reverting to their natural pattern. Running for 14 days ensures you measure steady-state lift.
Wellbeing survey guardrail: weekly survey sample (1% of users) measuring "Did you feel good about time spent today?" Net Positive Score (NPS-style, 0–10). If NPS drops > 2 points in treatment, abort regardless of engagement lift. Engagement can go up while wellbeing goes down.
Creator fairness metric: Gini coefficient of impressions across all creators. Gini = 0 means equal impressions. Gini = 1 means one creator gets all impressions. Target: Gini ≤ 0.7. If a new model drives Gini > 0.75, it's concentrating impressions on a few creators, destroying ecosystem diversity.
Unfollow rate as guardrail: if a user unfollows a creator within 24h of seeing their content (especially if the content was recommended, not from their graph), that's a strong signal the recommendation was wrong. Unfollow rate per recommended content ≤ 0.8% per 1000 impressions.
Surrogate online metric: share rate is the highest-intent signal available at scale. A user sharing a post means they found it worth their social reputation. Save rate (bookmark) is second. These correlate more strongly with D30 retention than raw like rate.

✦ Senior ML Engineer

"Unfollow rate and report rate as guardrail metrics are the interview signal that shows you think about failure modes. Every positive engagement metric can be gamed — outrage drives comments, clickbait drives clicks. The guardrails are what keep the system honest. Name them without being asked. Then explain why each guardrail connects to long-term retention, not just short-term brand protection."

✦ Senior ML Engineer

"Network effects in A/B testing is the advanced answer almost nobody gives. Standard randomisation assumes treatment and control users don't interact. On a social network, they do. Your friend in treatment reshares content into your control feed. The measured effect is diluted. Geo-level holdout is the production solution — mention it even briefly, it shows you understand the platform dynamics."

Scale: 500M users × 10 feed loads/day = 5B requests/day = ~58k QPS. Heavy ranker must serve at < 100ms p99 with 500 candidates per request.
Feature freshness: last engagement (< 1s old) from Kafka consumer → Redis. Engagement velocity updated every 60 seconds. Batch user features from feature store (nightly update).
Retraining: daily incremental on last 24h data. Full retrain weekly. Hourly mini-batch update for trending topic embeddings.
Feedback loop risk: popular content → more exposure → more engagement → more exposure. Mitigation: ε-greedy exploration budget (5% of slots for non-top-ranked content), periodic diversity injection.

Serving architecture: candidate retrieval microservice (social graph + interest graph queries in parallel, 50ms budget). Light ranker microservice (LightGBM, CPU, < 20ms). Heavy ranker microservice (quantised DNN, CPU or GPU, < 80ms). Post-ranking (business rules, < 10ms). Total budget 200ms p99.
Real-time feature pipeline: user engagement events → Kafka → Flink (1-min window aggregates for velocity features) → Redis. User features (social graph, long-term preferences) updated nightly in batch. Hybrid feature serving: real-time Redis + batch feature store per request.
Monitoring stack: per-feature PSI (Population Stability Index) daily. Prediction score distribution vs 7-day rolling baseline (KL divergence alert). Creator Gini coefficient weekly. Engagement rate by user cohort (new users, power users, at-risk users). Model performance on held-out test set weekly.
Feedback loop — detection and mitigation: detect via diversity collapse metric: if top-10 items for the median user have ILD (intra-list distance) < 0.25, the loop is collapsing. Mitigation: (1) ε-greedy: 5% of slots forced to diverse content. (2) New creator boost: content from creators with < 1000 followers gets a 1.2× relevance multiplier. (3) Exploration decay: reduce boost after 48h to avoid artificially propping up unengaging new content.
Incident response: shadow mode before deploying new model versions. New model runs in parallel with current model, scoring every request but not serving results. Compare model score distributions, engagement prediction calibration, and diversity metrics. Green-light only if all metrics within ±5% of current model on 48h of shadow traffic.
Online learning for trending topics: hourly mini-batch trains a lightweight topic-affinity adapter on top of the frozen heavy ranker base. New topics (breaking news, viral memes) are not represented in the weekly-trained base model. The adapter captures them within 1 hour of first appearance.

✦ Senior ML Engineer

"The feedback loop problem is the most sophisticated production concern to mention. Popular content gets more exposure, gets more engagement, gets more exposure — the system converges to a handful of viral posts. TikTok's algorithm is famous for occasionally surfacing non-mainstream content — that's deliberate exploration breaking the loop. Mentioning ε-greedy exploration as a feedback loop mitigation immediately signals you understand the long-term dynamics, not just the short-term ML problem."

✦ Senior ML Engineer

"Shadow mode before production deployment is the operational discipline that separates companies with reliable ML from those with frequent regressions. You cannot catch a model that has subtle distribution shift issues with offline eval alone. 48 hours of shadow traffic shows you exactly how the new model behaves on real serving traffic before any user sees it. Build this into your deployment answer every time."

Design the search ranking system for a professional network with 900M users and 1B indexed documents (profiles, jobs, posts).

  User query: "senior ML engineer London open to work"
         │
         ▼
  ┌─ QUERY UNDERSTANDING ─────────────────────────────────┐
  │  Intent classification: navigational / informational  │
  │  NER: extract role, location, seniority, skills       │
  │  Query expansion: synonyms, related skills (BERT)     │
  │  Spell correction                                     │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ RETRIEVAL (parallel) ────────────────────────────────┐
  │  BM25 (keyword)   ──┐                                 │
  │  Dense retrieval   ──┼──▶ RRF fusion → top-1000       │
  │  (bi-encoder)      ──┘                                │
  └───────────────────────────────────────────────────────┘
         │ 1000 candidates
         ▼
  ┌─ RE-RANKING ──────────────────────────────────────────┐
  │  LambdaMART LTR → top-100                             │
  │  BERT cross-encoder re-ranks top-100 → top-20         │
  │  Features: query-doc relevance + user context         │
  │  + business signals (freshness, activity)             │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ PERSONALISATION LAYER ───────────────────────────────┐
  │  User preference re-rank (past clicks, connections)   │
  │  Diversity: vary company, seniority, location         │
  └───────────────────────────────────────────────────────┘

Distinguish entity types first: profiles, jobs, companies, and posts rank differently — each has its own relevance signal. Clarify which entity type is the primary focus.
Classify query intent: navigational (find a specific person) vs informational (find ML engineers in London) vs transactional (apply to a job). Each needs a different ranking strategy.
Scale: 900M users, 1B documents. Query latency budget: 200ms total end-to-end. This drives the cascade design — can't run expensive cross-encoder on 1B docs.
Propose four-stage pipeline: query understanding → hybrid retrieval (BM25 + dense, RRF fusion, top-1000) → LTR re-ranking (top-100) → BERT cross-encoder (top-20) → personalisation.
State that personalisation layer is A/B tested at user level; ranking layers are A/B tested at query level for consistency.

Query intent classification: fine-tuned BERT classifier trained on query logs with editor-labelled intent. Navigational queries (contain proper nouns, names) → exact match priority. Informational → semantic retrieval priority. Transactional (contain "apply", "hiring") → freshness + apply-rate features boosted.
NER on query: extract entities — job title, location, seniority level, skills, company. "Senior ML engineer London" → role=ML Engineer, location=London, seniority=Senior. Feed extracted entities as structured features into retrieval and ranking.
Hybrid retrieval (parallel): BM25 on inverted index for keyword precision (handles rare skill names, exact company names). Bi-encoder (BERT-based) dense retrieval for semantic understanding ("ML engineer" → "machine learning scientist"). Reciprocal Rank Fusion (RRF) merges both lists: RRF score = Σ 1/(k + rank_i). top-1000 candidates.
Separate indexes per entity type: profile index, job index, post index. Query routes to relevant index based on intent classification. Prevents job results contaminating a people-search query.
Query expansion: synonym expansion via fine-tuned word2vec on job title taxonomy (Python → "Python programming", "Python developer"). Related skills graph (ML → TensorFlow, PyTorch, Scikit-learn) broadens recall for skills queries.
200ms budget breakdown: query understanding 20ms + BM25 retrieval 30ms + dense retrieval 30ms + RRF fusion 10ms + LambdaMART 50ms + BERT cross-encoder 50ms + personalisation 10ms = 200ms total p99.

✦ Senior ML Engineer

"The query-intent classification question is what senior candidates ask first — and what 90% skip. A navigational query ('find John Smith at Google') needs exact match and graph proximity. An informational query ('ML engineers in London') needs semantic understanding. Conflating them into one model produces a mediocre system for both. Intent classification gates the entire pipeline."

✦ Senior ML Engineer

"Hybrid retrieval (BM25 + dense + RRF) over either alone is the current-state answer. BM25 wins on rare terms, exact skill names, and company names. Dense retrieval wins on semantic generalisation. RRF fusion is parameter-free and consistently outperforms learned fusion on sparse data. Every major search system at scale now uses hybrid retrieval."

Training data: click-through logs with dwell-time signal. Long click (> 30s dwell) = strong positive. Bounce (< 5s) = negative. Skip (scrolled past) = weak negative.
Graded relevance labels: DCG-style (4=perfect, 3=excellent, 2=good, 1=fair, 0=bad) based on click + dwell + downstream action (apply, connect, message). Not binary — graded relevance enables LambdaMART to optimise NDCG directly.
Query features: query length, NER entities extracted, intent class score, historical CTR for this query pattern, query reformulation signal.
Document features: profile completeness %, connection graph distance to searcher, skill-query match score, recency of activity, endorsement count, seniority match.
Cross features: query × document semantic similarity (bi-encoder cosine), exact match on skills/title, 2nd-degree connection flag, company affinity score.

Label collection via SERP judgement: human raters score query-document pairs (4-point scale). High-agreement pairs (κ > 0.7) used directly. Low-agreement pairs discarded — ambiguous queries are not worth training signal. Use click + dwell as weak labels for high-volume queries where human rating is infeasible.
Position bias correction in click data: clicks at position 1 get 5× more clicks than position 5 regardless of quality. Train a propensity model (IPW — Inverse Propensity Weighting) to de-bias click labels. Without this, LTR model learns "position 1 = high quality" instead of true document relevance.
Query features deep: query-in-session reformulation (user typed this query after seeing unsatisfying results — negative signal for previous query), session-level context (what entities has the user searched for this session — implied preference signal).
Document features deep: for job documents: post date, application rate (jobs with high apply rate = high intent signal), salary range (if disclosed), remote/hybrid flag. For profiles: open-to-work badge, connection count, recent activity (posted last 30 days = active), profile view-to-connection rate.
Cross features — connection graph: 1st-degree connection to searcher = strong boost. 2nd-degree = moderate boost. Mutual connections count. Former colleagues flag. These features are only feasible because the graph is precomputed offline and stored in a graph feature store.
Training set construction: sample query-document pairs from impressed SERPs. For each query: top-K ranked results + 5 randomly sampled from remainder (to capture diversity). Point-in-time correct: labels at T, features from T − 1h to prevent leakage.

✦ Senior ML Engineer

"DCG labels over binary click/no-click is the depth signal that shows you understand Learning to Rank. Binary labels treat 'clicked and applied' the same as 'clicked and bounced in 2 seconds.' Graded relevance (4=applied, 3=long dwell, 2=short dwell, 1=impression skip) allows LambdaMART to directly optimise NDCG instead of AUC. The NDCG gain from graded labels typically exceeds the gain from any single model architecture change."

✦ Senior ML Engineer

"Position bias correction via IPW is the production detail most candidates miss. Without it, your LTR model learns 'documents at position 1 are high quality' because they receive disproportionate clicks. The model then ranks these documents higher in future, creating a feedback loop. IPW de-biases the labels so the model learns document quality, not position quality. Name it — it shows you've thought beyond naive click modelling."

LTR family for initial ranking: pointwise (XGBoost — treats docs independently), pairwise (LambdaRank — optimises relative order), listwise (LambdaMART — directly optimises NDCG). Use LambdaMART as the primary ranker for 1000 candidates.
BERT cross-encoder for re-ranking: applied to top-100 results from LambdaMART. Cross-encoder processes query and document jointly via cross-attention — 100× more expensive than bi-encoder but 20% more accurate on precision@1.
Cascade rationale: LambdaMART on 1000 (fast, ~50ms) → BERT cross-encoder on top-50 (slow, ~50ms). Total: ~100ms for two-stage ranking. Cross-encoder on all 1000 would take ~5s — unacceptable.
Offline eval: NDCG@10 (primary), MRR (mean reciprocal rank), Recall@100 for retrieval stage.

LambdaMART architecture: gradient-boosted decision trees (GBDT) trained with LambdaGrad — gradient approximation that directly optimises NDCG. 500–1000 trees, max depth 6. Input: ~150 features per query-doc pair. Training: 30-day rolling window of impressed SERPs with graded labels. Retrain daily.
BERT cross-encoder: [CLS] + query_tokens + [SEP] + doc_tokens → pooled [CLS] representation → linear layer → relevance score. Query and document processed jointly — cross-attention between all query-doc token pairs. Fine-tuned on human-labelled query-document pairs (4-class relevance). BERT-base (110M params) quantised to INT8 for inference.
Why cross-encoder > bi-encoder for re-ranking: bi-encoder encodes query and document independently — no cross-attention. Misses interactions like "Python" in query matching "CPython" in document. Cross-encoder attends across all pairs — catches these interactions. But: cross-encoder is O(N) at inference (must run per candidate); bi-encoder is O(1) for documents (pre-encoded). Hence bi-encoder for retrieval, cross-encoder for re-ranking small set.
Personalisation layer: lightweight logistic regression or small MLP on top of LambdaMART scores + user-document interaction features (click history, connections). Trained per-user type (recruiter vs job seeker). Adds 10ms at inference, personalises results without retraining the heavy rankers.
Offline training cascade: train bi-encoder on query-document pairs → generate dense index → train LambdaMART on top-1000 retrieval output → train BERT cross-encoder on top-100 LambdaMART output. Each stage trained on the output of the previous to avoid distribution mismatch.
NDCG@10 targets: retrieval Recall@100 ≥ 85%. LambdaMART NDCG@10 ≥ 0.68. BERT cross-encoder NDCG@10 ≥ 0.74 on held-out test set. +6% NDCG lift from adding cross-encoder on top of LambdaMART — justifies 50ms latency cost.

✦ Senior ML Engineer

"LambdaMART → BERT cross-encoder cascade is the answer that signals FAANG-level thinking. LambdaMART on 1000 candidates (50ms), BERT on top-50 (50ms). The cross-encoder is 100× more expensive than bi-encoder but 20% more accurate on precision@1. You can't run it on 1000 documents — you can on 50. The cascade is not a compromise; it's the right architecture for this latency budget."

✦ Senior ML Engineer

"Training the cascade in stages — train retrieval, generate candidates, train re-ranker on those candidates — prevents distribution mismatch. If you train LambdaMART on perfect oracle candidates but serve it on bi-encoder retrieval output, it sees a different input distribution at test time. Train each stage on the actual output of the previous stage."

Offline: NDCG@10 (primary), MRR (mean reciprocal rank — critical for navigational queries where rank 1 matters), Recall@100 for retrieval stage.
Online: CTR@1 (top result clicked), CTR@5, apply rate (for jobs), connection request rate (for profiles), zero-result rate (% queries returning < 5 results).
Business: job fill rate (posted job → hired), recruiter search satisfaction (NPS survey subset), candidate discovery (new profiles surfaced to recruiters).
A/B design: query-level randomisation (not user-level). Two users typing the same query should see consistent results — user-level A/B would show them different results, eroding trust.

MRR for navigational queries: mean reciprocal rank = 1/rank_of_first_correct_result. For navigational queries ("find John Smith"), the user only wants one specific result. If it's at rank 3, MRR = 0.33. MRR@1 (whether rank 1 is correct) is the pure navigational quality metric. Target: MRR@1 ≥ 0.72 for navigational queries.
Zero-result rate: % of queries returning fewer than 5 results. Driven by over-filtering or under-recall in retrieval. Target < 2%. Zero-result queries have 0% CTR and cause user frustration — monitor separately by entity type.
Apply rate vs CTR: for job search, CTR measures relevance (did user click?). Apply rate measures intent (did user submit application?). High CTR + low apply rate = job description is misleading or requirements are mismatched. Track both, target apply rate ≥ 8% of job clicks.
Query-level A/B rationale: search is a shared experience — two users with the same query should see the same ranking. User-level randomisation means the same query returns different results for different users simultaneously. Inconsistency damages trust. Personalisation layer (user-level) is A/B tested separately, after the ranking models are validated.
Guardrail — result freshness: avg age of top-10 job results ≤ 14 days. Stale jobs (posted months ago) that are already filled are a major satisfaction driver. If a model change causes result freshness to regress, abort regardless of CTR lift.
Diversity metric: % of queries returning results from ≥ 3 different companies, ≥ 2 different seniority levels. Prevents "all Google results" slates for broad queries like "software engineer".

✦ Senior ML Engineer

"Query-level A/B randomisation is the search-specific insight that most candidates miss. In recommendation, user-level A/B is standard. In search, the same query from different users should return consistent results — user-level would show John and Jane different results for 'ML engineer London', which breaks the implicit contract of a search engine. Randomise at the query hash level, not the user level. This is a one-liner that shows you understand search system design specifically."

✦ Senior ML Engineer

"Result freshness as a guardrail metric is the domain-specific signal for professional search. A recommendation model that surfaces old content loses some quality. A job search model that surfaces filled positions causes user frustration and undermines the platform's core value proposition. Freshness ≤ 14 days for jobs is a hard guardrail — name specific thresholds, not just 'we should monitor freshness'."

Latency budget: 200ms total. Query understanding 20ms + retrieval 60ms + LambdaMART 50ms + BERT cross-encoder 50ms + personalisation 10ms + overhead 10ms.
Index architecture: inverted index (BM25) + vector index (FAISS IVF-PQ for dense retrieval). Separate indexes per entity type (profiles, jobs, posts). Real-time indexing for job posts (changes within 60s via Kafka).
Retraining: LambdaMART daily on last 7 days of click data. BERT cross-encoder weekly (expensive, requires GPU cluster). Bi-encoder monthly (stable representations).
Monitoring: query coverage (% queries with ≥ 5 results), result freshness (avg age of top-10), NDCG on held-out test weekly, zero-result rate daily.

Index update strategy: job posts need near-real-time indexing (posted → indexed within 60s). Kafka → index writer service. Profile updates batch-indexed (hourly). This asymmetric freshness matches business priority: stale job posts cost applicants; slightly stale profiles are acceptable.
BERT inference serving: cross-encoder on top-50 per query at scale. At 1M queries/day (LinkedIn-scale), 50M cross-encoder inferences/day. BERT-base (INT8 quantised) on 4 CPU cores: ~1ms/inference. 50 inferences/query = 50ms. Requires 30–50 CPU cores per serving instance for the 200ms budget. GPU reduces to ~5ms for the batch but adds operational complexity — CPU is operationally simpler at this scale.
Feature serving for ranking: precomputed graph features (2nd-degree connections, company affinity) from offline graph processing job → served from Redis with 24h TTL. Query-time features (NER results, intent class) computed inline. Document features (profile completeness, job freshness) precomputed and stored in document store, fetched with the document at retrieval time.
Retraining cadence rationale: LambdaMART daily — user behaviour and job inventory change daily (new job posts, people start/end job search). BERT weekly — cross-encoder needs expensive GPU fine-tuning; weekly captures major vocabulary shifts. Bi-encoder monthly — dense representations of job titles and skills are stable; monthly rebuild balances quality vs compute cost.
Failure mode — index lag: job post → Kafka → index writer → searchable. If Kafka consumer falls behind (high-traffic spike), new jobs don't appear in search results for minutes. Monitoring: lag metric on Kafka consumer offset. Alert if lag > 30s. Mitigation: consumer auto-scaling based on partition lag.
Failure mode — BERT regression: new cross-encoder version degrades precision@1. Shadow mode: new model runs in parallel, scores are logged but not served. Compare score distributions and offline metrics before cutover. Green-light requires NDCG@10 within ±1% of current model on 48h shadow traffic.

✦ Senior ML Engineer

"Asymmetric index freshness — job posts near-real-time, profiles hourly, posts daily — is the production insight that shows you understand the business priority behind each entity type. A uniform 'all indexes updated hourly' policy would let new job posts sit invisible for up to 60 minutes. That's a direct revenue impact. Different entity types need different freshness SLAs, and matching them to business priority is the systems engineering answer."

✦ Senior ML Engineer

"CPU vs GPU for BERT inference at this scale is a nuanced answer. GPU is 10× faster per inference but adds cost, complexity, and a GPU dependency in the serving stack. At 50 inferences/query with 50ms budget, 4 CPU cores is sufficient. The right answer isn't 'use GPU' — it's 'CPU at this query volume, GPU when query volume × candidate count × latency budget make CPU insufficient.' Show the napkin math."

Design a real-time fraud detection system for a payments platform processing 10,000 transactions per second.

  Transaction event (100ms total budget)
         │
         ▼
  ┌─ REAL-TIME FEATURE COMPUTATION ───────────────────────┐
  │  Velocity: txn count in last 1m / 5m / 1h            │
  │  User: avg_amount, usual_merchants, geo_pattern       │
  │  Device: device_id, IP geolocation, fingerprint       │
  │  Graph: card-device-merchant network signals          │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ RULE ENGINE (deterministic, < 5ms) ─────────────────┐
  │  Hard rules: new card + high value + foreign IP       │
  │  Velocity rules: > 5 txns/min → block               │
  └───────────────────────────────────────────────────────┘
         │ passed rules
         ▼
  ┌─ ML SCORING (< 50ms) ─────────────────────────────────┐
  │  LightGBM on 200+ engineered features                 │
  │  score > 0.90 → block · 0.70–0.90 → 3DS challenge   │
  │  score < 0.70 → approve                              │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ ASYNC DEEP ANALYSIS (post-transaction) ──────────────┐
  │  GNN: card-merchant-device ring fraud detection       │
  │  LSTM: account takeover sequence detection            │
  │  Human review queue for borderline cases              │
  └───────────────────────────────────────────────────────┘

Clarify cost asymmetry first: "What is the acceptable false positive rate? A blocked legitimate transaction is also harmful. What is the FP:FN cost ratio?"
Define three decision outputs — not binary: hard block (score > 0.90), soft challenge (3DS/OTP, 0.70–0.90), approve (< 0.70). Three-way decision minimises friction while catching fraud.
Hard latency constraint: 100ms total at 10k TPS. No GPU in the critical path. Synchronous CPU-only inference required.
Propose three-layer architecture: rule engine (deterministic, < 5ms) → ML scoring (LightGBM, < 50ms) → async deep analysis (GNN + LSTM, post-transaction).
Fraud rate is 0.1% — a naive "predict everything legitimate" classifier has 99.9% accuracy. The real metric is AUC-PR and precision at operating threshold, not accuracy.

FP:FN cost ratio matters enormously: a false positive (blocking legitimate transaction) costs: customer friction, cart abandonment, potential churn, support ticket. Typical ratio: FP cost ≈ 1× revenue. FN cost (missed fraud) ≈ 10–50× (chargeback + dispute + fee + reputational). So tolerate more FPs to catch more FNs — but not so many FPs that customers leave.
Three-decision design: hard block for high-confidence fraud (irrecoverable decision — use conservatively). Soft challenge (3DS, OTP, step-up auth) for uncertain zone — lets legitimate users complete transaction with extra friction. Approve with silent flag for async analysis. Three outputs reduce both revenue loss and fraud loss.
10k TPS × 100ms: 10,000 concurrent in-flight transactions. Each needs rule engine + LightGBM scoring within 100ms. Rule engine: pure deterministic logic on precomputed velocity counts from Redis. LightGBM: 200 trees, max depth 6, INT8 quantised, CPU inference < 10ms on 4 cores.
Async deep analysis: Graph Neural Network runs post-approval on the transaction graph (card-device-merchant edges). Detects fraud rings where cards are linked via shared devices. LSTM detects account takeover sequences (login from new device → profile change → high-value transaction). Results feed into next-transaction decision within 60s.
Compliance constraint: every block or challenge decision must be explainable (GDPR, PCI-DSS). SHAP values for top-5 features stored in immutable audit log per transaction. Model must be interpretable by compliance team.
Human review: borderline cases (score 0.60–0.75) route to analyst queue. Analyst sees: transaction details, SHAP explanation, user history, device fingerprint. Decision time target: < 4 hours. Outcome feeds back as labelled training example.

✦ Senior ML Engineer

"The cost ratio question is what separates senior candidates. Fraud rate is 0.1%, so naive accuracy is 99.9% — meaningless. The real question is: what does a false positive (blocked legitimate transaction) cost versus a false negative (missed fraud)? This ratio drives your precision-recall operating point. Candidates who jump to 'minimise fraud loss' without acknowledging FP cost have never seen a fraud team's business review."

✦ Senior ML Engineer

"The three-decision output (block / challenge / approve) is the design insight that 80% of candidates miss. Binary fraud classifiers force a hard tradeoff between FP and FN. The challenge tier (3DS step-up) recovers legitimate transactions that would otherwise be blocked, reducing FP cost while still catching fraud. It's not just a UX feature — it changes the ROC operating point economics entirely."

Velocity features (highest signal): transaction count from same card/device/IP in last 1min, 5min, 1h, 24h. Amount deviation from user's 30-day mean (z-score). These features compute in < 2ms from Redis counters.
Behavioural features: time since last transaction, usual transaction hours (3am flag), geographic consistency (merchant location vs user's typical area), device fingerprint match.
Graph features: has this device been used by multiple compromised cards? Has this merchant seen a chargeback spike in last 24h?
Labels: ground truth = chargeback confirmed + manual review confirmed fraud. Label delay: fraud confirmed T+14 days. Training window must lag by ≥ 14 days.
Class imbalance: 0.1% fraud rate. Strategies: cost-sensitive learning (FN weight = 10× FP), SMOTE or oversampling, focal loss.

Velocity features in depth: Redis INCR + EXPIRE on sliding window keys: key = vel:{card_id}:{window}, TTL = window duration. Count in O(1). Also: amount sum, unique merchant count, unique IP count in window. These 20 velocity features are the single highest-signal feature group in fraud detection.
Amount deviation: z-score = (current_amount − μ_30d) / σ_30d. A $5,000 transaction from a user who typically spends $40 has z-score ≈ 10 — extreme outlier. Precompute μ and σ per user daily, store in Redis.
Geographic velocity: distance from last transaction location / time since last transaction = implied travel speed. Speed > 900 km/h with real transaction locations = physically impossible → high fraud signal. Requires IP geolocation + merchant lat/lng.
Label delay — critical production detail: you cannot train on transactions from the last 14 days because labels don't exist yet. Training window: T − 90d to T − 14d. Inference: T − 0d. The 14-day lag means the model is always slightly stale relative to current fraud patterns. Mitigation: rule engine captures emergent patterns faster (can be updated in hours).
Merchant chargeback rate feature: rolling 7-day chargeback rate per merchant. A new merchant with 15% chargeback rate is a high-risk signal. Requires aggregating chargeback data across all cards → compute in batch job → serve from Redis at transaction time.
Device fingerprint: hash of browser/device attributes (user agent, screen resolution, fonts, timezone, WebGL renderer). Stable across sessions. If device fingerprint has been used by 5+ different cards in 30 days, it's likely a fraud device.

✦ Senior ML Engineer

"Label delay is the production detail that trips candidates up. You cannot train on today's transactions because today's fraud labels don't exist — chargebacks arrive T+14 days. Your training window must lag by at least 14 days. Missing this causes severe training-serving skew: the model trains on 'confirmed fraud' that took 2 weeks to confirm, but is deployed to catch fraud in the next 100ms. Name the lag explicitly."

✦ Senior ML Engineer

"Geographic velocity (impossible travel detection) is the signal candidates know but rarely implement correctly. The naive version checks 'distance > 500km in 1h.' The production version checks distance / (time_delta + ε) vs physical speed limits, handling same-device-different-location patterns and VPN detection. Mentioning the implementation detail — not just the concept — shows you've thought about edge cases."

Primary (synchronous): LightGBM. Reasons: < 10ms CPU inference (no GPU in critical path), handles missing values natively, excellent on tabular features, interpretable via SHAP for compliance.
Threshold strategy: not 0.5. Use precision-recall curve: operate at 95% precision for hard block (5% FP tolerance). Challenge tier: 70–90% score. Threshold tuned to business cost ratio, not accuracy.
Secondary (async): GNN for fraud ring detection (shared card-device-merchant graph). LSTM for account takeover sequence detection.
Offline eval: AUC-PR (primary for imbalanced classes), precision@1% FPR (business operating point), F1 at operating threshold.

LightGBM for critical path: 200 trees, max depth 8, learning rate 0.05. 200+ features: 20 velocity features, 15 behavioural, 10 device, 5 merchant risk, 150 interaction terms. Quantised INT8 for serving. Inference: < 10ms on 4 CPU cores at 10k TPS (100 CPU-ms per transaction, well within budget).
Why not neural net in critical path: FFNN/Transformer requires GPU for < 10ms inference at production scale. GPU adds: operational complexity, GPU memory management, cold-start latency, higher cost, and a single point of failure. LightGBM at < 10ms CPU is the production-proven choice for synchronous fraud scoring. Neural nets belong in async analysis where latency budget is relaxed.
Threshold calibration: run Platt scaling or isotonic regression to calibrate LightGBM output probabilities. Raw GBDT outputs are not well-calibrated probabilities. After calibration, precision@score=0.90 should be ≈ 0.90. Recalibrate monthly as fraud patterns shift.
GNN architecture (async): heterogeneous graph: nodes = {cards, devices, merchants, IPs}. Edges = transaction relationships. GraphSAGE for node embedding aggregation. Trained with contrastive loss: pull together nodes in confirmed fraud rings, push apart legitimate nodes. Run on transaction batches every 60s, not per-transaction.
LSTM for account takeover: sequence model on per-account event stream: login, profile change, address change, high-value transaction. Trained on confirmed account takeover sequences. Detects unusual action order patterns before the fraudulent transaction completes.
Model ensemble: final risk score = 0.6 × LightGBM_score + 0.2 × GNN_score + 0.2 × LSTM_score. GNN and LSTM scores from previous batch update (60s lag). Ensemble reduces false negatives by 15% over LightGBM alone on fraud ring and ATO cases.

✦ Senior ML Engineer

"LightGBM over neural nets in the synchronous critical path is the answer that shows production experience. Everyone says 'deep learning.' The correct answer for real-time fraud at 10k TPS is: LightGBM for latency (< 10ms CPU), deep models asynchronously for pattern detection. Neural nets in the 100ms budget require GPU, which adds operational complexity and a GPU dependency in a latency-critical payment path. Name the tradeoff explicitly."

✦ Senior ML Engineer

"AUC-PR over AUC-ROC for imbalanced classes is a litmus test question in fraud ML. AUC-ROC looks great (0.98+) when the positive class is 0.1% — it's dominated by the trivially large negative class. AUC-PR tells you how well the model retrieves the rare fraud cases without flooding the ops team with false positives. If a candidate cites only AUC-ROC for a 0.1% base rate problem, they haven't worked with imbalanced classes in production."

Offline primary: AUC-PR (not AUC-ROC — imbalanced classes). Precision@1% FPR (business operating point). F1 at the chosen threshold.
Online: fraud loss rate ($ fraud / $ processed), false positive rate (% legitimate transactions challenged/blocked), chargeback rate, 3DS challenge rate, customer friction score.
Business: revenue recovery via challenges (legitimate transactions saved by 3DS vs hard block), false positive cost (cart abandonment from unnecessary challenges).
A/B testing challenge: rare events — statistical significance on fraud rate requires enormous sample sizes. Use surrogate: model score distribution comparison + precision on high-score transactions.

Precision@1% FPR: of all transactions the model flags as fraud, what % are actually fraud, when we allow 1% false positive rate? At 0.1% base rate, even 1% FPR means 10× more FPs than TPs without a good model. Target precision@1% FPR ≥ 0.50 (catching fraud at 50% precision when holding false positives at 1%).
A/B framework for rare events: cannot A/B test fraud rate directly (need millions of transactions for statistical significance at 0.1% base rate). Instead: (1) compare model score distributions between treatment and control — KL divergence. (2) Measure precision on high-score transactions (score > 0.85): inspect for true positives. (3) Use revenue-based metrics: fraud loss rate as $-denominated metric has enough variance to detect 10% relative changes in reasonable sample.
Shadow mode evaluation: new model runs alongside production model, scoring every transaction but not acting on the score. After 7 days: compare precision@threshold, recall, score calibration. Avoids the problem of not being able to evaluate blocked transactions (which can't be confirmed as fraud or legitimate after the fact).
Fraud loss rate calculation: (sum of confirmed fraud $ in period) / (total transaction $ in period). Target: < 0.05% (5 basis points). Industry benchmark: Visa/Mastercard maintain ~0.06–0.08% fraud loss rates. Track by merchant category, card type, transaction channel to identify high-risk segments.
Customer friction metric: 3DS challenge completion rate — what % of challenged users complete the step-up auth? Completion rate < 60% signals the challenge is too burdensome (likely applied to legitimate users who abandon). Target: > 80% completion rate on challenged transactions, < 15% of total transactions challenged.
Model drift monitoring: fraud patterns evolve faster than almost any other ML domain. Monitor daily: (1) score distribution shift (PSI on score deciles). (2) Precision on high-confidence fraud decisions. (3) Feature distribution shift on top velocity features. Alert if any drift > 2 standard deviations from 30-day baseline.

✦ Senior ML Engineer

"Shadow mode evaluation is the production sophistication answer for fraud. You can't evaluate a blocking model by looking at blocked transactions — they're gone, you can never know if they were legitimate. Shadow mode logs the model's decision without acting on it, letting you measure the true FP rate on a sample of approved transactions. This is standard practice at every mature fraud team, and almost nobody mentions it in interviews."

✦ Senior ML Engineer

"Daily drift monitoring is non-negotiable in fraud. Fraud patterns shift faster than any other ML domain — a new fraud ring's technique can make your feature distribution shift within hours. PSI > 0.2 on velocity features should trigger an immediate review, not a weekly report. The model that was 95% accurate last month might be 70% accurate today if a new card-testing bot pattern emerged. Mention the monitoring cadence — daily, not weekly."

Serving path: transaction event → Redis velocity lookup (< 2ms) → rule engine (< 5ms) → LightGBM feature assembly + scoring (< 10ms) → decision (< 1ms). Total: < 20ms synchronous. 80ms remaining for network, serialisation, response.
Feature streaming: Kafka consumer computes velocity counts → Redis INCR with TTL. Flink for 1-minute window aggregates. All within < 500ms of transaction event.
Model refresh: retrain weekly (fraud patterns evolve). Monitor precision daily via chargeback labels at T+7. Alert if precision drops > 3% from baseline.
Feedback loop from shadow mode: blocked transactions can't generate ground truth. Shadow mode on approved traffic measures true FP rate without taking action.

Velocity counter architecture: Redis INCR on sliding window keys: vel:{card_id}:1m, vel:{card_id}:5m, etc. Expiry set to window length. At 10k TPS × 20 velocity keys = 200k Redis operations/second. Redis can handle 1M+ ops/sec on a single node. Cluster for HA.
LightGBM serving: model loaded in memory at process startup (~50MB). Feature vector constructed in-process from Redis lookups. LightGBM predict: < 1ms for single row with 200 features (INT8 quantised). No network hop for model inference — all in-process.
Async pipeline for GNN/LSTM: approved transactions → Kafka → Flink batch aggregator (60s window) → GNN scoring → risk score update in Redis per card/device. Next transaction from same card sees updated GNN risk score as an additional feature. Latency: < 60s for GNN risk to propagate.
Retraining pipeline: weekly full retrain on T − 90d to T − 14d confirmed labels. Daily incremental update with new confirmed chargebacks from T − 21d to T − 14d (rolling). Offline eval gate: precision@1% FPR must not regress > 2% vs current production model. Deploy via canary: 5% → 25% → 100% traffic over 48h.
Compliance infrastructure: every transaction decision: log {transaction_id, score, threshold, top-5 SHAP features, decision} to immutable audit store (append-only). Regulatory requirement: 5-year retention. Decision must be reproducible with same features at time of transaction. Feature values snapshotted at decision time.
Incident response — new fraud pattern: alert fires (PSI spike on velocity features). On-call analyst reviews: identifies new attack vector (e.g., micro-transaction testing: 100 × $0.10 transactions before $500 transaction). New rule added to rule engine within 2 hours (no model retrain needed). Rule engine is the fast-update layer; ML model is the high-precision layer.

✦ Senior ML Engineer

"Rule engine as the fast-update layer is the production architecture insight. Fraud patterns emerge in hours; ML model retraining takes a week. The rule engine can be updated by an analyst in 2 hours without a model deploy. It handles the 'known unknowns' — attack patterns we've seen and can describe. The ML model handles the 'unknown unknowns' — novel fraud patterns that don't match any rule. Both layers are necessary and serve different purposes."

✦ Senior ML Engineer

"Canary deployment for fraud models is not optional. A fraudulent model update (buggy feature, schema change, or model regression) that goes to 100% traffic immediately can allow millions of dollars of fraud or block thousands of legitimate transactions before anyone notices. 5% → 25% → 100% over 48 hours with precision monitoring at each stage. This is the deployment discipline that separates mature fraud teams from amateur ones."

Design the click-through rate (CTR) prediction system for a digital advertising platform serving 5M ad impressions per second.

  Ad auction request (user context + candidate ads)
         │
         ▼
  ┌─ FEATURE EXTRACTION ──────────────────────────────────┐
  │  User: demographics, interest vector, search history  │
  │  Ad: creative embed, historical CTR, bid price        │
  │  Context: query/page topic, device, time, position    │
  │  Cross: user×ad affinity, semantic similarity         │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ CTR MODEL ───────────────────────────────────────────┐
  │  Wide & Deep (Google) or DLRM (Meta)                  │
  │  Wide: sparse feature interactions (memorisation)     │
  │  Deep: dense embeddings through MLP (generalisation)  │
  │  Output: P(click | user, ad, context) = pCTR          │
  └───────────────────────────────────────────────────────┘
         │ pCTR
         ▼
  ┌─ AUCTION MECHANISM ───────────────────────────────────┐
  │  Quality score = pCTR × relevance_score               │
  │  Ad rank = bid × quality_score                        │
  │  Winner: highest ad rank                              │
  │  Price: Vickrey second-price auction                  │
  └───────────────────────────────────────────────────────┘

Frame correctly: CTR prediction is not just ML — pCTR feeds directly into the auction mechanism. Accuracy affects revenue AND advertiser fairness. A biased pCTR over-charges some advertisers and under-charges others.
Distinguish calibration from discrimination: calibration (does predicted 5% CTR match actual 5%?) is as important as discrimination (can you rank ads correctly?). For ad pricing, both are required.
Scale: 5M QPS × 10–50 candidate ads per request = 50M–250M model calls/second. Model must be < 1ms CPU inference. This is the most demanding serving environment in ML.
Propose architecture: feature hashing → Wide & Deep or DLRM → calibration layer → auction. Continuous (online) learning for real-time CTR pattern updates.
State that position bias correction in features is required — without it, the model learns position as a quality proxy instead of true ad relevance.

Scale napkin math: 5M QPS × 50 ads × 1ms budget = 250M model calls/second, each < 1ms. On modern hardware: 1 CPU core handles ~10k scalar LightGBM or neural net predictions/second. Need 25,000 CPU cores, or batched GPU inference. Solution: quantised ONNX model on GPU serving cluster with batching, or extremely optimised INT8 CPU model. This is why ad CTR uses highly specialised serving infrastructure (Meta's Triton, Google's TFX).
Calibration vs discrimination — the pricing impact: if model predicts pCTR = 5% but actual CTR = 1%, the auction price for that ad is 5× too high. Advertiser overpays, depletes budget, stops bidding. If pCTR = 1% but actual = 5%, advertiser underpays — platform loses revenue. Both miscalibration directions have direct revenue consequences. Calibration is a first-class metric, not an afterthought.
Position bias in the auction context: ads at position 1 get 5× more clicks regardless of quality. Without correction, the model conflates high position with high quality. Corrected model serves better ads at position 1 (a virtuous cycle). Method: train with position as a feature, then at inference set position = 1 for all candidates to get unbiased quality scores.
Second-price auction (Vickrey): winner pays the minimum bid needed to win, not their own bid. This incentivises advertisers to bid their true value. Price = (second_highest_bid × second_highest_quality) / winner_quality + ε. Quality score = pCTR × relevance — the model directly determines how much advertisers pay per impression.
Continuous training necessity: CTR patterns shift within hours (trending topics, time of day, breaking news). A daily-retrained model misses the Monday 9am engagement spike. FTRL (Follow The Regularised Leader) optimiser enables online learning: model updates incrementally with each hour's data without full retraining.
Feature hashing for scale: billions of user IDs, ad IDs, advertiser IDs — one-hot encoding is impossible. Hash into fixed-size hash space (2^24 ≈ 16M buckets). Collision rate acceptable at this size. Embedding table: 16M buckets × 64-dim × 4 bytes = 4GB. Fits in GPU memory for serving.

✦ Senior ML Engineer

"Calibration vs discrimination is the deep insight that almost nobody gives. AUC measures discrimination — can you rank ads correctly? Calibration measures whether predicted probabilities match actual rates. For ad pricing, you need both. A model with AUC 0.82 but poor calibration will systematically over-charge or under-charge advertisers. This is a revenue and legal problem, not just an accuracy problem. Name both, explain why both matter for an auction system."

✦ Senior ML Engineer

"The scale napkin math — 250M model calls/second, each < 1ms — is what separates candidates who understand the infrastructure reality from those who just propose a model. At this QPS, you cannot afford any network hop in the inference path. The model must be in-process, quantised, and batched on GPU. Doing the math out loud shows systems thinking, not just ML knowledge."

Feature hashing: billions of user/ad IDs → hash into 2^24 buckets. Fixed embedding table size regardless of new users or ads. Collision rate acceptable.
Key features: user historical CTR by ad category (rolling 7d), ad creative performance history (CTR by placement), position bias correction feature, user recency (time since last ad click).
Position bias correction: train with position as a feature → at inference, set position = 1 for all candidates. Model learns "this ad is good regardless of position."
Label: click = 1, impression without click = 0. Sparse: typical CTR 0.1–2%. Treat as a binary classification with extreme class imbalance.
Cross features: user interest vector × ad topic similarity, user device × ad format compatibility, user-advertiser affinity (has user engaged with this advertiser before?).

Position bias correction — implementation: add position as a categorical feature (1–10) during training. At serving time, override position = 1 for all candidates. This removes position as a confound — the model's score reflects true quality, not rank position. Alternative: propensity model (IPW) to de-bias click labels. Both approaches used in production; feature-override is simpler.
User interest vector construction: aggregate user's last 30 days of ad clicks and page visits → topic categories (IAB taxonomy, ~500 categories). Represent as a sparse 500-dim interest vector. Update every 10 minutes from Kafka → Redis. Embedding lookup at serving time (< 1ms).
Ad creative embeddings: image embeddings from ResNet (for display ads), text embeddings from BERT (for search ads), combined into 128-dim creative vector. Precomputed when ad is uploaded, stored in ad feature store. Retrieved at serving time by ad ID.
Historical CTR feature: for established ads: rolling 7-day CTR by placement type (search, display, video). Smoothed with prior (global average CTR) to handle low-count ads: smoothed_CTR = (clicks + α × global_CTR) / (impressions + α), where α = 100. This prevents cold-start instability for new ads.
Contextual features: page topic (NLP on page URL/content), time of day (6-bucket), day of week, device type, browser, geography. Context determines ad relevance — a running shoe ad is more relevant on a fitness blog than a finance news page. Page topic × ad category match is a high-signal cross feature.
Negative sampling for training: all impressed ads have click/no-click labels — no need for negative sampling (unlike recommendation). Impression = exposure. This avoids selection bias that affects recommendation systems. However: only use impressed ads as training examples, not all ads (ads that lost the auction aren't negatives — they were never shown).

✦ Senior ML Engineer

"Position bias correction is the single most important feature engineering insight for ad CTR. Without it, your model learns 'position 1 = high quality' rather than 'this ad is genuinely relevant.' All major ad platforms discovered this independently — Google calls it 'examination hypothesis,' Facebook had to retrain their models after discovering the bias. The feature-override method is elegant: train with position, serve with position=1. One line of inference code that fixes a fundamental bias."

✦ Senior ML Engineer

"Smoothed CTR for cold-start ads — (clicks + α × global_CTR) / (impressions + α) — is the production detail that shows you've handled new ad cold-start. A new ad with 0 impressions has undefined CTR. A new ad with 10 impressions and 0 clicks shouldn't be predicted as 0% CTR forever. Bayesian smoothing with a prior pulls the estimate toward the global average until sufficient data accumulates. Name the formula."

Wide & Deep (Google, 2016): Wide component = cross-product sparse feature interactions (memorisation of specific user × ad patterns). Deep component = dense embeddings through MLP (generalisation to unseen combinations). Two components trained jointly.
DLRM (Meta, 2019): embedding tables for categorical features + bottom MLP for dense features + dot product interaction layer + top MLP. Designed for trillion-parameter embedding tables via model parallelism across GPUs.
Online learning (FTRL): CTR patterns shift hourly. Incremental training on last 1h data using FTRL optimiser. No full retraining needed — model updates as new data arrives.
Calibration layer: Platt scaling or isotonic regression on top of model scores to ensure predicted probabilities match actual CTR rates. Applied post-training, recalibrated daily.

Wide & Deep architecture detail: Wide = single-layer logistic regression on cross-product features (user_category × ad_category combinations seen in training). Deep = embedding lookup for all categorical features → concatenate → 3 ReLU layers (1024 → 512 → 256) → sigmoid output. Joint training: loss = cross-entropy on click label. Wide handles memorisation of known patterns; Deep handles generalisation to new patterns.
DLRM architecture detail: separate embedding tables for each categorical feature (user_id, ad_id, advertiser_id, page_category) — each table independently partitioned across GPUs (model parallelism). Dense features (age, bid price) through bottom MLP. Dot product of all embedding pairs creates interaction features. Top MLP for final prediction. Enables trillion-parameter models that wouldn't fit on a single GPU.
FTRL online learning: FTRL (Follow The Regularised Leader) with L1 and L2 regularisation. L1 sparsity is critical — billions of features, most irrelevant. FTRL naturally produces sparse models (most weights zero). Per-feature adaptive learning rates handle the varying frequency of features (common features learn slower; rare features update more aggressively). Hourly mini-batch: ~100M impressions/hour at 5M QPS × 3600s.
Calibration necessity and method: raw sigmoid outputs from Wide & Deep are not calibrated probabilities. Plot predicted score deciles vs actual CTR — should be a diagonal. If model predicts 2% CTR but actual is 0.5%, recalibrate. Platt scaling: fit logistic regression on (model_score, actual_click) pairs. Isotonic regression for non-monotonic corrections. Recalibrate daily on last 24h impressions.
Model size and serving: embedding tables dominate model size. User_id: 1B users × 64-dim × 4 bytes = 256GB (sharded across servers). Ad_id: 100M ads × 64-dim = 25GB. MLP weights: ~100MB. Total serving: distributed embedding lookup via RPC + local MLP inference. Embedding lookup latency: ~0.5ms. MLP inference: ~0.1ms. Total: ~0.6ms per prediction — meets the 1ms budget.
Catastrophic forgetting in online learning: FTRL with too-high learning rate overwrites knowledge of long-term CTR patterns with recent noise. Mitigation: Elastic Weight Consolidation (EWC) — regularise recent model weights toward base model weights. Or: mixture training — 70% recent data + 30% historical sample in each FTRL batch. Prevents performance collapse on established ad patterns.

✦ Senior ML Engineer

"Online learning with FTRL is the production depth question. A daily-retrained model misses the Monday morning CTR pattern shift. The major platforms all run continuous training — hourly or finer — with FTRL. Mentioning FTRL specifically (not just 'online learning') shows you know the algorithm. The L1 sparsity it produces is critical at billion-feature scale — without it, the model becomes dense and unserveable."

✦ Senior ML Engineer

"Catastrophic forgetting in online learning is the failure mode that catches teams off guard. High learning rate + recency bias causes the model to forget 6-month patterns in 48 hours of noisy data. EWC or mixture training (recent + historical) is the fix. This shows you understand that online learning is not just 'train more often' — it requires explicit mechanisms to prevent knowledge loss."

Offline discrimination: AUC (primary for ranking quality). NE (Normalised Entropy) — measures calibration quality relative to baseline of predicting global mean CTR.
Offline calibration: plot predicted CTR vs actual CTR in 10 score buckets — should be on the diagonal. Calibration error = mean absolute deviation from diagonal.
Online: RPM (Revenue Per Mille impressions — business metric), CTR, advertiser ROI (ROAS — Return on Ad Spend), auction win rate.
A/B design: impression-level randomisation (not user-level — one user sees many ads per session). Statistical significance on RPM (high-variance metric requires large sample).

NE (Normalised Entropy) metric: NE = log-loss(model) / log-loss(baseline), where baseline = always predicting global average CTR. NE < 1 means the model is better than baseline. NE = 0.95 means 5% better than predicting the mean. NE is calibration-sensitive (unlike AUC) and is used by Meta's ads team as the primary offline metric. Target: NE < 0.85 (15% improvement over baseline).
Calibration plot analysis: split all predictions into 10 deciles. For each decile: compare mean predicted score vs actual CTR. Perfect calibration: all points on y = x line. Common failure: model over-predicts CTR at high scores (calibration curve flattens at top) — indicates model doesn't separate high-CTR ads well. Fix: isotonic regression recalibration on validation set.
RPM as the north star: Revenue Per Mille = (total revenue / total impressions) × 1000. A CTR model improvement that increases CTR but decreases bid competition (advertisers leave due to high prices) can decrease RPM. RPM captures the auction equilibrium effect that CTR alone misses. Track daily RPM vs 30-day rolling baseline.
A/B at impression level: for CTR prediction, treatment vs control must be at impression level (or higher) — not click level. One user sees multiple ads per session; randomising at user level means they might see worse-ranked ads for the entire session. Impression-level randomisation: each impression independently assigned to treatment or control bucket. Allows faster experiment turnaround with same statistical power.
Advertiser fairness metric: check that CTR prediction accuracy is consistent across advertiser size (large vs small advertisers). Large advertisers have more data → model learns their patterns better. Small advertiser CTR under-prediction leads to under-charging (revenue loss) or blocked auctions. Monitor per-advertiser-size AUC and calibration error. Target: calibration error ≤ 0.5% for all advertiser tiers.
Budget pacing guardrail: advertisers set daily budgets. If CTR is over-predicted, they win more auctions and exhaust budget early (no impressions in the afternoon). If under-predicted, they lose auctions and underspend (wasted budget capacity). Even > 5% miscalibration affects budget pacing significantly. Track budget utilisation distribution as a guardrail metric.

✦ Senior ML Engineer

"NE (Normalised Entropy) over AUC alone is the ad ML litmus test. AUC measures discrimination — can you rank ads correctly? NE measures calibration quality relative to baseline. For an auction system, both matter: AUC determines ad order, NE determines whether the prices are correct. Facebook's ads team introduced NE as the standard metric precisely because AUC was insufficient for measuring calibration quality. Name it, define it, explain why it matters for pricing."

✦ Senior ML Engineer

"RPM as the north star metric captures what AUC and NE miss: the auction equilibrium effect. A CTR model that improves AUC by 2% but increases prices above market clearing (advertisers leave) can actually decrease RPM. The auction is a dynamic system — model quality affects bidder behaviour, which affects revenue. Candidates who cite only AUC/NE as their success metric haven't thought about what the model's output is actually used for."

Serving: model must be < 1ms CPU inference. Quantised ONNX model. Hash-based feature lookup (no joins). Embedding tables sharded across GPU memory.
Feature freshness: user CTR history updated every 10 minutes from Kafka → Redis. Ad performance stats every 1 minute (critical for detecting bad creatives quickly).
Online learning (FTRL): hourly mini-batch retraining on last 1h impressions. Catastrophic forgetting prevention: 70% recent + 30% historical in each batch.
Monitoring: CTR by ad category (sudden drop = model or creative issue), RPM trend daily, model score distribution (detect distribution shift immediately).

Serving architecture at 5M QPS: request arrives → feature lookup microservice (Redis for user features, ad feature store for ad features, real-time context inline) → embedding table shard lookups (distributed, ~0.5ms) → MLP inference (local, ~0.1ms) → calibration layer (~0.05ms) → return pCTR. Total: ~0.65ms, within 1ms budget. 5M QPS requires ~5,000 serving pods with this latency.
Embedding table serving: user ID embeddings: 1B entries, sharded across 16 GPU servers (16GB each). Ad ID embeddings: 100M entries, sharded across 4 GPU servers. Lookup is a distributed key-value read: send user_id to its shard server, receive 64-dim embedding vector. Batch lookup: collect all ad IDs in the request, send to respective shards in parallel, merge results.
Feature freshness SLAs: user interest vector: 10min staleness tolerable (user interests are relatively stable within a session). Ad creative CTR: 1min (bad creatives need immediate detection and pausing). Position bias features: 1hr (stable intraday). Real-time context (page topic, time, device): inline at request time (no staleness).
FTRL hourly pipeline: Kafka consumer aggregates impressions in 1h tumbling window → batch write to training storage → FTRL optimizer runs for 30min on 1h data → model checkpoint saved → A/B validation on held-out 10% of traffic → deploy to shadow fleet → full deploy. Total pipeline latency: ~2 hours from impression to production model update.
Failure mode — model score collapse: FTRL with aggressive learning rate can drive all prediction scores toward 0.5 (maximum entropy collapse). Monitor: score variance (target σ > 0.08). If variance drops below threshold, rollback to previous checkpoint and halve the learning rate. Early detection via hourly score distribution check.
Failure mode — creative spamming: a fraudulent advertiser uploads many similar creatives, each with artificially inflated historical CTR (click farms). Detection: CTR velocity per advertiser (sudden spike in new creative CTR within 1h). Flag for manual review. Feature: ad age × CTR ratio — new ads with high CTR are a signal. This feeds into the fraud detection pipeline for the ad marketplace.

✦ Senior ML Engineer

"2-hour pipeline latency (impression → production model) is the operational reality of continuous training. The data freshness you get from FTRL isn't real-time — it's 2 hours from impression to model update. This matters for new ad creatives (a new creative's CTR is unknown for 2 hours), for trending topics (an emerging trend takes 2 hours to surface), and for fraud (a click farm can do damage in the 2-hour window before the model updates). Name the latency, name what it affects."

✦ Senior ML Engineer

"Model score collapse — all predictions converging toward 0.5 — is the FTRL failure mode that's almost never mentioned in interviews. It happens when the learning rate is too high or the regularisation is too weak, causing the model to maximise entropy (be uncertain about everything) rather than maximise prediction accuracy. Score variance monitoring (σ > 0.08) catches it early. This is a production horror story from every major ad platform at least once."

Design the content moderation system for a video platform with 500 hours of video uploaded per minute.

  Video upload (500 hours/minute)
         │
         ▼
  ┌─ AUTOMATED PRE-SCREENING (< 1 min) ───────────────────┐
  │  Hash matching: known CSAM (PhotoDNA), spam hashes    │
  │  Audio: speech-to-text → text classifier              │
  │  Vision: frame sampling → image classifier            │
  │  Metadata: title, description, tags → text classifier │
  └───────────────────────────────────────────────────────┘
         │
         ├──▶ High confidence violation (score > 0.99) → Remove
         │
         ├──▶ Medium confidence (0.80–0.99) → Suppress + review
         │
         └──▶ Low signal (< 0.80) → Publish + background analysis
                                         │
                                         ▼
  ┌─ DEEP ANALYSIS (async, minutes) ──────────────────────┐
  │  Multi-modal: video + audio + transcript + metadata   │
  │  Fine-grained harm classification (14 categories)     │
  │  Context: satire? news? education?                    │
  └───────────────────────────────────────────────────────┘
         │
         ▼
  ┌─ HUMAN REVIEW QUEUE ──────────────────────────────────┐
  │  Priority: score × severity × creator tier            │
  │  Reviewer: video + model scores + context             │
  │  Decision feeds back as training signal               │
  └───────────────────────────────────────────────────────┘

Scale frame: 500 hours/minute = 720,000 hours of video per day. 100% human review is impossible — the system must be a hybrid: automation removes clear violations, humans handle edge cases.
Clarify harm categories: CSAM, terrorism/extremism, graphic violence, hate speech, spam, misinformation, copyright — each has a different cost of FP (wrongly removed legitimate content) vs FN (missed harmful content).
Clarify pre-publish vs post-publish decision boundary: CSAM and known terrorism warrant pre-publish blocking (severity too high to allow any exposure). Most categories: post-publish removal with appeals (lower FP cost).
Propose three-stage pipeline: hash matching (deterministic, near-zero latency) → per-modality fast classifiers (text, audio, video frames) → multi-modal deep model (async, for uncertain cases) → human review queue.
State that the system optimises for different precision/recall operating points per harm category — CSAM at 99.9% precision, hate speech at 85% precision (contextually complex).

Pre-publish vs post-publish design: pre-publish blocking holds the video and blocks it before any user sees it. Only appropriate for the highest-severity harms (CSAM, known terrorism content) where even momentary exposure is unacceptable. For everything else, post-publish removal is preferred — it allows legitimate content to be available while review proceeds, and the false positive cost (blocking a legitimate video for hours) is lower than the false negative cost.
720,000 hours/day processing budget: automated pre-screening must complete in < 1 minute per video (to minimise time-to-exposure for violations). For a 10-minute video: hash matching 0ms + metadata classification 2s + audio transcription + text classification 30s + frame sampling + image classification 20s = < 1 minute total. Parallelised across services.
14 harm categories, separate classifiers: CSAM, terrorism, graphic violence, hate speech (race, gender, religion, sexual orientation), harassment, spam, misinformation (health, electoral), copyright infringement, dangerous acts, nudity (adult content), extremist propaganda, child safety (non-CSAM), privacy violations. Each category has different human reviewer SLA, legal liability, and precision/recall targets.
Hash matching architecture: PhotoDNA for CSAM — perceptual hash, handles minor edits (resize, colour shift). Exact MD5 hashes for known spam content. Lookup time < 1ms via hash table. 100% precision, high recall for known content. This handles the clearest cases before any ML is involved.
Cascade decision points: score > 0.99 → remove immediately (high-confidence automation). Score 0.80–0.99 → suppress (not shown to users) + queue for human review within 24h. Score < 0.80 → publish + async analysis. This three-tier decision reduces human review load by 60% while maintaining quality on borderline cases.
Appeals process: creator appeals removal → new reviewer re-examines with context. If appeal overturned, original removal decision is weighted negatively in reviewer quality scoring. Creates accountability on both sides — creators don't spam appeals, reviewers don't make careless removal decisions.

✦ Senior ML Engineer

"Pre-publish vs post-publish is the question that separates senior candidates — and most people ignore it entirely. Pre-publish blocking at high confidence is the right answer for the most severe harms. For ambiguous content, post-publish removal with appeals is better — the false positive cost of blocking legitimate content is underweighted by most candidates who focus only on catching bad content. The asymmetric FP cost by harm category is the key insight."

✦ Senior ML Engineer

"Different precision/recall operating points per harm category is what shows you understand this domain. CSAM: 99.9% precision required (false positive = legitimate family video blocked, legal liability). Hate speech: 85% precision acceptable (false positive = legitimate political commentary removed, free speech concern). The classifier is not one model with one threshold — it's a system with category-specific thresholds tuned to legal and policy requirements."

Multi-modal processing: audio (speech-to-text → text classifier), video frames (sample 1 frame/second → image classifier), metadata (title, description, tags → text classifier). Each modality contributes independent signal.
Training data: human reviewer labels across 14 harm categories. Inter-annotator agreement (IAA) critical — hate speech has κ ≈ 0.6–0.7 (ambiguous), CSAM has κ ≈ 1.0 (unambiguous).
Class imbalance: violations are rare (< 1% of uploads for most categories). Focal loss, oversample violations, or cost-sensitive learning with high FN weight.
Cross-modal context: a spoken hateful statement over an innocuous image is only detectable with cross-modal understanding. Per-modality classifiers miss this — multi-modal fusion is required.

Audio processing pipeline: speech-to-text via Whisper (multilingual, 100+ languages). Transcript → fine-tuned multilingual BERT for harm classification. Audio non-speech signals: tone of voice analysis (aggression detection), background sounds (gunshots, screams). Speech-to-text runs at 10× real-time speed, so a 10-minute video is transcribed in 60 seconds.
Video frame sampling: 1 frame/second for initial screening (fast, catches static violations). Adaptive sampling: scenes with high optical flow (rapid motion) → 5 frames/second. Key frames at scene boundaries. 10-minute video = 600 frames at 1fps. Image classifier (CLIP-based): ~10ms per frame. Total: 6 seconds for 600 frames on GPU.
Metadata classification: title + description + tags → zero-shot or fine-tuned text classifier. Fast (< 100ms), cheap, high precision for obvious violations (titles explicitly describing harm). High false negative rate for misleading titles — metadata alone is insufficient, but excellent as a first-pass filter and confidence booster.
Inter-annotator agreement (IAA) by category: CSAM: κ = 1.0 (clear, objective). Graphic violence: κ = 0.85. Spam: κ = 0.80. Hate speech: κ = 0.60–0.70 (policy-dependent, subjective). Misinformation: κ = 0.50–0.65 (requires fact-checking expertise). Low IAA categories need: clearer annotation guidelines, expert reviewers, more conservative (lower recall) operating points.
Temporal context in video: hate speech clip embedded in a 2-hour documentary about historical atrocities is different from a standalone hate speech video. Model must understand video-level context, not just frame-level or clip-level. Temporal context window: classify clips in the context of the surrounding 5 minutes. This requires video-level representation, not just per-frame classification.
Active learning for reviewer queue: route content near the classification boundary (score 0.40–0.60) to human review. This content has the highest information value — reviewer decisions improve model calibration in ambiguous regions. Content with score > 0.95 or < 0.05 doesn't need human review (model is already confident). Active learning reduces labelling cost by 40% while improving boundary-region accuracy.

✦ Senior ML Engineer

"Multi-modal fusion is where the depth shows. Concatenating embeddings from text + audio + video ignores the temporal structure of video. The best systems process content jointly with attention across modalities — a spoken hateful statement over a seemingly innocuous image is only detectable with cross-modal understanding. Per-modality classifiers have fundamental blind spots. A sentence is fine. The same sentence spoken over certain images is a violation. That requires joint cross-modal attention."

✦ Senior ML Engineer

"Active learning for reviewer queue prioritisation is the production sophistication signal. Randomly routing content to review wastes reviewer capacity on easy cases where the model is already confident (score > 0.95). Routing borderline cases (score 0.40–0.60) to review maximises information value per reviewer-hour. This reduces labelling cost and improves model quality simultaneously — it's not just a nice-to-have, it's how you scale a review operation that's already understaffed."

Cascade architecture: hash matching (deterministic, 0ms) → per-modality fast classifiers (< 30s per video) → multi-modal deep model (async, only for uncertain cases).
Text classifier: fine-tuned multilingual BERT (100+ languages). Video: CLIP-based frame encoder. Audio: Wav2Vec 2.0 for speech features, then text classifier on transcript.
Multi-modal fusion: transformer over concatenated modality embeddings with cross-attention. Trained on human-reviewed examples.
Multi-task learning: one shared base model → 14 separate output heads. Shared representations benefit all harm categories. Better than 14 separate models (shared features, less labelling needed per category).

Per-modality classifiers (fast path): text classifier: XLM-RoBERTa fine-tuned on harm categories, multilingual. Image classifier: CLIP ViT-L/14 with linear probe for harm detection. Audio: Wav2Vec 2.0 + fine-tuned classifier on harmful audio patterns (weapon sounds, threatening speech). Each runs independently and in parallel. Total fast-path latency: < 60s per 10-minute video.
Multi-modal fusion architecture: per-modality encoders produce embeddings (text: 768-dim, video frame: 512-dim, audio: 256-dim). Temporal aggregation: mean pooling across video frames and audio segments. Cross-modal transformer: 6-layer, 8-head attention over concatenated [text; video; audio] sequence. Output: 512-dim joint representation. 14 linear heads for harm categories.
Multi-task training rationale: hate speech detection benefits from shared representations with harassment detection (adjacent harm type). CSAM detection benefits from nudity detection representations. Violence detection shares visual features with graphic content detection. Multi-task reduces labelling requirement per category: 10,000 examples per harm type instead of 100,000 (because shared layers learn generalisable features).
Context-aware classification: video-level context model: sliding window of 5-minute clips with temporal position encoding. A problematic 30-second clip within a 2-hour documentary context scores differently than the same clip in isolation. Context model runs as a second-pass after initial classification, only for borderline cases (score 0.50–0.85).
Model update policy: retrain monthly (or after significant policy changes). Regression test suite: 10,000 labeled examples per harm category, including known hard cases. New model must match or improve on all categories before deployment. Phased rollout: 10% traffic for 48h → check precision/recall per category → full deploy.
Adversarial robustness: bad actors manipulate content to evade detection (text in images, audio encoding of harmful content, temporal interleaving). Regular adversarial testing: red team submits evasion attempts, successful evasions become training examples. Model update cadence must match adversarial adaptation cadence — monthly retraining is the minimum.

✦ Senior ML Engineer

"Multi-task over 14 harm categories is the architectural insight. A model trained on hate speech also learns useful representations for harassment — adjacent harm types share features. Shared representations reduce the labelling burden per category: 10,000 examples instead of 100,000. This is not just a training efficiency argument — it's also a quality argument. Categories with sparse training data (rare harm types) benefit from representation transfer from data-rich categories."

✦ Senior ML Engineer

"Adversarial robustness — red team testing with evasion attempts feeding back into training — is the operational sophistication that shows you understand content moderation is an adversarial game, not a static classification problem. Bad actors adapt. Your model must adapt faster. Monthly retraining with red team inputs is the minimum cadence. If your retraining cycle is slower than the adversary's adaptation cycle, you will always be behind."

Per-category precision and recall: precision matters most for pre-publish blocking (avoid wrongly removing legitimate content). Recall matters most for the most severe harms (missing CSAM is catastrophic).
Platform-level metric: Violative Content Rate (VCR) — % of content views that contain policy-violating content. Industry benchmark: major platforms target < 0.1–0.2% VCR.
Appeal rate: % of removals successfully appealed = proxy for false positive rate. Target: < 0.5% of removed content successfully appealed.
Human review SLA: CSAM: < 1 hour to review. Other categories: < 24 hours. Track queue depth and time-to-review as operational metrics.

VCR measurement: sample 1000 videos from active feed hourly → human reviewer team evaluates for violations → compute % violating. VCR is the platform health metric that regulators and press monitor. DSA (Digital Services Act) in EU requires platforms to report VCR quarterly. Target: VCR < 0.1% for CSAM, < 0.2% for violence, < 0.5% for hate speech.
Precision-recall tradeoff per category: CSAM: target precision ≥ 99.9%, recall ≥ 99.0% (near-perfect required). Hate speech: target precision ≥ 85%, recall ≥ 70% (balance between free expression and harm). Spam: target precision ≥ 95%, recall ≥ 80% (high-volume, lower individual severity). Different operating points reflect different FP/FN cost ratios per category.
Inter-reviewer consistency (IAA) monitoring: weekly: sample 200 reviewed cases → send to second reviewer blind → compute Cohen's kappa. Kappa < 0.6 for a category = annotation quality issue. Trigger: reviewer calibration session, policy clarification, example update to annotation guidelines.
Reviewer quality scoring: each reviewer tracked on: appeal overturn rate (% of their removals successfully appealed), IAA with senior reviewers on calibration samples, throughput vs quality tradeoff. Reviewers with high overturn rate flagged for retraining. This creates a quality feedback loop beyond model metrics.
Counterfactual measurement: you can observe false positives (removed content that was appealed). You cannot directly observe false negatives (harmful content you missed). Proxy: sample 1,000 random published videos weekly → expert team audit for violations → estimate FN rate from audit. Requires careful sampling strategy — pure random sampling rarely catches rare violations.
Regulatory reporting metrics: EU DSA requires: monthly active user count, number of pieces of content removed per category, number of accounts suspended, time-to-removal from report, appeal overturn rate. Design the moderation system to generate these metrics as first-class outputs, not audit afterthoughts.

✦ Senior ML Engineer

"VCR (Violative Content Rate) as the platform-level metric is the content moderation equivalent of the north star. Individual model precision/recall are table stakes. VCR measures whether harmful content is actually reaching users — which is what regulators, press, and users care about. A model with 95% precision but a VCR of 2% is failing its mission. Connect your per-category metrics to VCR, and VCR to regulatory requirements (EU DSA). That closes the loop."

✦ Senior ML Engineer

"You cannot directly measure false negatives in content moderation — harmful content you miss is invisible to your metrics until a reporter finds it. The counterfactual measurement approach — weekly expert audit of random published content — is the production answer. It's expensive (requires expert reviewers), but it's the only way to estimate your FN rate. Most candidates talk only about metrics they can observe. The metrics you can't observe are the ones that matter most."

Processing budget: 500 hours/minute = 8.3 hours of video/second. Fast-path (metadata + audio + frames): < 60s per video. Parallelised per modality. Deep multi-modal analysis: < 15 minutes for 10-minute video.
Reviewer tools: model scores + contributing features + channel history shown to reviewers. Reduces review time from 8 min to 3 min per item.
Model update cycle: monthly retrain + regression test suite. Red team adversarial inputs feed into training. Category-specific model versions with independent deployment.
Regulatory infrastructure: audit logs, removal reason codes, appeal tracking, VCR reporting. Built as first-class pipeline outputs, not ad-hoc exports.

Processing parallelism: for a 10-minute video, services run in parallel. Stream 1: extract metadata → text classifier (2s). Stream 2: extract audio → speech-to-text → text classifier (30s). Stream 3: extract frames (1fps) → image classifier (20s). All three streams run simultaneously. Decision aggregator receives results as they complete. Highest-confidence decision made when all streams complete or any stream exceeds 0.99 threshold.
Reviewer tool design: reviewer sees: (1) video with timestamps flagged by model. (2) Harm category scores per modality (text: 0.85, image: 0.72, audio: 0.31). (3) SHAP-like explanation of what triggered each signal. (4) Creator history (previous violations, account age, follower count). (5) Similar previously reviewed cases with decisions. This context reduces review time from 8 minutes to 3 minutes while improving consistency.
Queue prioritisation: review queue sorted by: (1) harm severity (CSAM always first). (2) Model confidence × severity score. (3) Creator tier (new creator with 0 followers = lower priority than established creator with policy violation). (4) Time in queue (SLA compliance). CSAM queue never exceeds 30 minutes. Standard queue SLA: 24 hours.
Cross-lingual challenges: 500 hours/minute includes content in 100+ languages. Multilingual models (XLM-RoBERTa, Whisper) handle major languages. For low-resource languages (< 1M speakers): machine translation → English classifier. Translation introduces latency (add 30s) and errors (translation artifacts cause false positives). Dedicated reviewers for low-resource languages hired from local markets.
Policy update propagation: when moderation policy changes (e.g., new hate speech definitions), all previously borderline content must be re-evaluated. Batch re-processing job: run new model over all content flagged in the last 90 days. Typically 10M–50M items. Job runs over 72 hours on dedicated GPU cluster. New removals notify creators with updated policy explanation.
Reviewer mental health infrastructure: reviewers are exposed to traumatic content at volume. Required: maximum 4-hour shifts with mandatory breaks, mental health support resources, content filtering (reviewers specialise in specific harm types to reduce exposure breadth), rotation to less-harmful queues every 2 weeks. This is an operational requirement, not a nice-to-have — high reviewer turnover degrades model quality.

✦ Senior ML Engineer

"Reviewer mental health infrastructure is the answer that shows you've thought about the human system, not just the ML system. Content moderation is one of the few ML applications where the human-in-the-loop infrastructure directly affects model quality. High reviewer turnover means inconsistent labels, which degrades model training. Reviewer burnout is not a human resources problem — it's a data quality problem. Mentioning this shows you understand that the ML system doesn't end at the model boundary."

✦ Senior ML Engineer

"Policy update re-processing — running the new model over all borderline content from the last 90 days when policy changes — is the operational depth answer. Policy changes retroactively change what was a violation. Without batch re-processing, you have inconsistent enforcement: content uploaded before the policy change is treated differently from content uploaded after. This 72-hour batch job is the operational cost of every policy update, and it's invisible unless you've had to run one."