Select a stage from the left rail.
AI Engineering · LLMs & NLP · Updated May 2025
16 stages covering language foundations, transformer internals, retrieval systems, agents, and production reliability — with code, pitfalls, and the production depth that separates engineers who ship from those who prototype.
What this section covers
Select a stage from the left rail.
Most language-system failures begin before the model call: bad parsing, weak labels, missing metadata, or a corpus that does not represent the real task. Fix data before tuning models.
| Question | Answer |
|---|---|
| What is a corpus? | A collection of text for training, retrieval, evaluation, or grounding |
| How to avoid leakage? | Split by the unit that repeats in production (doc, user, conversation, time) |
| Why metadata matters? | Enables filtering, slicing, access control, debugging, recency, and fair eval |
| When does TF-IDF beat embeddings? | Exact IDs, rare identifiers, domain jargon, small labeled datasets |
| Stemming vs lemmatization? | Stemming chops to crude roots; lemmatization maps to dictionary forms |
| What to store after normalization? | Raw text, normalized text, source metadata, offsets, normalizer version |
A strong model trained on a misrepresentative corpus will underperform a weaker model trained on a well-constructed one. The eval corpus is the most critical artifact: if it does not match production distribution (language, source, length, domain, recency), you are optimizing for a test that does not exist. Teams that discover this late lose months. The fix: instrument production traffic early, sample from real traces for eval construction, and maintain domain slices as the corpus grows.
A TF-IDF + logistic regression baseline is often 85–90% of the LLM quality at 1% of the cost and latency. If the team ships an LLM solution without proving the baseline fails, they own the infrastructure and cost of the heavier system with no documented justification. The baseline also creates a stable lower bound: if the LLM degrades in production, the baseline is the fallback. Always build it, test it, and keep it running alongside the primary system.
LLMs often wrap classical NLP tasks in natural language, but the product still needs schemas, labeled examples, and measurable outputs. A model cannot learn a task that humans cannot label consistently.
| Question | Answer |
|---|---|
| Precision vs recall? | Precision: how many predicted positives were correct. Recall: how many actual positives were found |
| What is F1? | Harmonic mean of precision and recall — useful when both false positives and negatives matter |
| What is NER? | Named entity recognition identifies spans: people, companies, products, dates, amounts, IDs |
| When use rule-based NLP? | Stable patterns, low false-positive risk, deterministic behavior more valuable than generalization |
| What is active learning? | Prioritizes uncertain or high-value examples for human annotation to maximize label efficiency |
| How to handle class imbalance? | Stratified sampling, class-weighted losses, threshold tuning, targeted data collection, per-class metrics |
Low inter-annotator agreement is not an annotator problem — it is a label definition problem. If trained humans cannot agree on a label, the model will not learn a stable decision boundary either. Cohen's kappa below 0.7 on a classification task is a signal to pause labeling and rewrite the guidelines, not to add more data. The most efficient fix is always clearer guidelines with worked examples and explicit tie-breaking rules, not more annotation hours on an ambiguous task.
A model with 94% overall accuracy can have 40% recall on the customer segment that drives 60% of revenue, or 20% precision on the language that represents your fastest-growing market. Aggregate metrics hide these failures. The fix: maintain eval slices by domain, language, length, recency, label, and customer tier from the start. Build them into CI — a regression on a slice is a regression, even if aggregate numbers hold. Companies that discover slice failures late often have to rebuild data pipelines and retrain from scratch.
Dense embeddings are powerful, but sparse search remains the fastest way to catch names, IDs, error codes, versions, and rare terms. Hybrid retrieval combines the precision of lexical search with the recall of semantic search.
| Question | Answer |
|---|---|
| BM25 vs TF-IDF? | Both use term rarity; BM25 adds better term-frequency saturation and length normalization |
| When does BM25 beat embeddings? | Exact IDs, rare names, error codes, quotes, legal clauses — lexical precision matters |
| What is IDF? | Inverse document frequency: more weight to terms appearing in fewer documents |
| What is RRF? | Combines multiple ranked lists by giving credit based on rank position — simple and effective |
| Why use hybrid search? | Improves recall across semantic queries and exact lexical queries simultaneously |
| Where does reranking fit? | After initial retrieval and filtering, before final prompt assembly |
When engineers say "retrieval is not finding the right document," the root cause is almost never the embedding model or similarity metric. It is usually: a metadata filter removing the right document before scoring, a permission filter applied at the wrong stage, an analyzer tokenizing a product ID differently from the query, or a field boost misconfiguration. The debugging process should start with retrieval traces (what was returned, what was filtered), not with embedding quality. Vector index tuning is the last thing to touch, not the first.
A hybrid retrieval system has at least five tunable dimensions: BM25 weight, vector weight, RRF k-parameter, candidate pool size, and reranker depth. Each change can improve one query type and hurt another. The only way to safely tune is to have a labeled retrieval eval set: 100–200 representative queries with expected documents or chunks, measuring recall@k, MRR, and NDCG by query type (exact ID lookups, semantic questions, multi-hop, rare terms). Without this, tuning is guesswork and regressions are invisible.
Embeddings are similarity features, not truth. They need versioning, evaluation, and retrieval guardrails. Nearest neighbor is not the same as correct answer.
| Question | Answer |
|---|---|
| What is cosine similarity? | Measures angle between vectors — often used to compare embedding direction after normalization |
| What is ANN search? | Approximate nearest-neighbor search trades perfect recall for much faster vector lookup |
| What is HNSW? | Graph-based ANN index — common for fast high-recall vector search in production |
| Why do metadata filters matter? | Enforce permissions, recency, source type, tenant boundaries, and domain constraints |
| API LLM vs open-source? | API: less ops burden, capability lead. Open-source: control, privacy, customization, cost at scale |
| How to evaluate embeddings? | Retrieval recall@k, MRR, clustering quality, duplicate-detection F1, human relevance labels |
Every chunk in the index was produced by a specific model at a specific version. When you migrate to a new embedding model, old and new vectors live in different semantic spaces — cosine similarity scores between them are meaningless. This means: (1) the entire index must be rebuilt before mixing old and new retrievals, (2) the migration must be atomic from the retrieval system's perspective, (3) rollback must be possible by pointing back to the old index. Systems that mix embedding versions — even during a rolling migration — produce retrieval scores that cannot be debugged because the numbers are no longer comparable.
HNSW has two key parameters: ef_construction (index-build quality) and ef (search-time quality). Increasing ef improves recall@k at the cost of query latency. The optimal setting depends on your latency budget and the cost of missed documents. Plot recall@10 vs p99 latency across a range of ef values on your actual corpus — the curve is usually non-linear, with large recall gains at low ef values and diminishing returns past a threshold. This measurement, done once per corpus size order of magnitude, tells you exactly where to set the parameter instead of guessing.
Most LLM bugs are easier to debug when you can separate model limitation, prompt limitation, retrieval limitation, and product-contract limitation. Understand the inference machine before designing systems around it.
| Question | Answer |
|---|---|
| Why does tokenization matter for reliability? | Controls what the model can see — wrong truncation means confident answers from incomplete evidence |
| What is prefill? | Phase where model processes input prompt and builds KV cache before emitting first token |
| Why do long RAG prompts hurt quality? | Distractors, conflicting snippets, instruction dilution — more context is not better context |
| When should temperature be near zero? | Classification, extraction, compliance, routing — anywhere reproducibility matters more than variation |
| What to log for token debugging? | Prompt tokens, completion tokens, model name, truncation strategy, selected docs, prompt template version |
Teams that complain about "slow LLM responses" almost always conflate two distinct latency components: prefill (time-to-first-token, driven by prompt length and batch state) and decode (time-per-output-token, driven by output length and generation strategy). The fix for high prefill latency is shortening the prompt, caching stable prefixes, or using prefix caching. The fix for high decode latency is constraining output length, using speculative decoding, or batching. Optimizing one without understanding which component dominates is guesswork — always measure TTFT and total latency separately in your traces.
JSON mode and constrained decoding solve the syntax problem: the output is valid JSON that matches the schema. They do not solve the semantic problem: the values in the fields may be hallucinated, unsupported by evidence, or logically inconsistent. A system that treats a valid-JSON response as a correct response is missing a layer of validation. The correct production contract after structured output: (1) parse and schema-validate, (2) check factual claims against retrieved evidence, (3) apply domain logic (amounts are positive, dates are in range, referenced IDs exist), (4) escalate if any check fails. Steps 2–4 require deterministic application code, not more prompting.
A prompt is not magic words — it is an executable specification for a probabilistic worker. The best prompt is short enough to audit and explicit enough to test. Treat every prompt edit like a code change.
| Question | Answer |
|---|---|
| What makes a prompt production-ready? | Versioned, tested against eval set, clear input boundaries, defined failure behavior, validated output shape |
| Why avoid giant prompts? | Harder to review, more expensive, slower, more fragile, more likely to contain conflicting instructions |
| Where should authorization live? | Deterministic application code before and after the model call — not in prompts |
| What is prompt injection? | Untrusted text attempts to alter the model instruction hierarchy by pretending to be higher-priority instructions |
Direct prompt injection (a user saying "ignore previous instructions") is well-known and relatively easy to defend against. Indirect prompt injection — where the malicious instruction is embedded in retrieved content, an email, a web page, or a document — is structurally harder because the application itself fetches and passes the poisoned content. The only reliable defenses are: (1) treat all retrieved and user-provided text as untrusted data, not as instructions; (2) enforce permissions and tool calls in deterministic application code, not in model output; (3) test the system with adversarial retrieved documents as a CI step. No amount of system prompt wording fully protects against a well-crafted indirect injection if the boundary is not enforced in code.
A prompt change that ships without a version tag, eval gate, and rollback path is a silent production migration. Teams that store prompts as hardcoded strings in Python files have no way to answer: "what prompt was active during the incident?" or "which prompt change caused the quality regression last Tuesday?" The fix: store prompts in versioned YAML under source control, load by tag at runtime from a registry, run evals on each new version before promotion, and link the deployment tag to traces. This is the same discipline as software deployment — prompt changes just look less like code because they are text.
BERT, GPT, and T5 are not just brand names — their architecture and training objective shape what they are naturally good at. Architecture choice is a product decision.
| Question | Answer |
|---|---|
| BERT vs GPT? | BERT: encoder-only, bidirectional, strong for understanding tasks. GPT: decoder-only, autoregressive, strong for generation |
| What is causal masking? | Prevents a decoder from attending to future tokens while predicting the next token |
| What is contrastive learning? | Brings matching pairs closer and pushes non-matching pairs apart in representation space |
| What is instruction tuning? | Supervised training on instruction-response examples to teach a base model to follow user tasks |
| Why use residual connections? | Help gradients flow and let each block refine rather than replace the representation |
| What are logits? | Raw vocabulary scores before softmax converts them to probabilities |
Choosing a decoder-only LLM for a classification task is not just a capability choice — it is a latency and cost contract. A decoder generates tokens sequentially (O(output_length) steps per request); an encoder produces a fixed-size representation in one forward pass. For a binary classification task running at 10,000 requests/second, the difference can be 50ms vs 5ms P99 latency and 20× cost difference. The habit of defaulting to the latest chat model for every task is one of the most common and most expensive engineering mistakes in LLM application development. Always benchmark the smallest model that meets the quality bar on your actual task.
A fundamental misconception: instruction tuning teaches a model how to follow tasks, not what facts to know. If you fine-tune a model on instruction-response pairs about your internal knowledge base, the model will learn to produce responses in the right format and tone — but it will not reliably acquire the factual content from the training examples, especially for specific entities, numbers, and policies. LLMs memorize facts unreliably from training, and they cannot cite or verify what they memorized. The correct architecture: instruction-tune for behavior, format, and routing; use RAG for knowledge that must be current, private, or auditable. Conflating the two leads to hallucination-prone systems that are difficult to debug.
Fine-tuning is rarely the first fix for missing knowledge. It is strongest for behavior, style, format, routing, and domain-specific decision boundaries — not for injecting facts that change or need citation.
| Question | Answer |
|---|---|
| When should you fine-tune an LLM? | Consistent behavior, style, formatting, or task policy — when prompting has become brittle or expensive |
| Why is fine-tuning not ideal for facts? | Facts change, are memorized unreliably, and cannot be cited — retrieval is better for current or auditable knowledge |
| What is LoRA? | Freezes the base model and trains small low-rank matrices added to selected layers |
| What is DPO? | Trains from chosen/rejected pairs without a separate online RL loop |
| What makes preference data good? | Clear rubrics, calibrated annotators, difficult pairs, domain coverage, disagreement review |
| When merge an adapter into base weights? | Adapter is stable, always used with that base, merge testing shows no quality regression or deployment issue |
A team that creates one LoRA adapter to solve a problem creates a manageable artifact. A team of 15 teams each creating adapters for their domain over 18 months creates an operations nightmare: adapters that are no longer tested, adapters trained on deprecated base model versions, adapters whose datasets have drifted from the use case they were designed for, and adapters with no documented owner. The fix is to treat adapter lifecycle the same as software lifecycle: mandatory eval results at creation, an owner per adapter, an expiry date, a deprecation process, and a central registry with deployment gates. This is not bureaucracy — it is the minimum governance to keep a multi-team fine-tuning program from becoming a liability.
RLHF and DPO optimize for human preference signals, which often correlate with confidence, fluency, and brevity rather than factual accuracy. A model that sounds more helpful may be less accurate — and this regression often does not show up in preference win-rate metrics because the annotators are not equipped to fact-check the responses. The result: the preference-tuned model scores better in human evals but hallucinates more in production. The fix: run factuality benchmarks (TRUTHFULQA, domain-specific QA sets, citation accuracy on your corpus) alongside preference evals before promoting any preference-tuned checkpoint to production. Win rate alone is not a sufficient deployment gate.
RAG quality comes from the data pipeline as much as the model call. A vector database is an index, not a source of truth — keep raw documents, parsed text, chunks, embeddings, metadata, and answers separate and versioned.
| Question | Answer |
|---|---|
| Mandatory metadata for RAG? | Document ID, chunk ID, source, content hash, embedding version, chunking version, timestamps, ACL fields |
| Why keep raw source text? | Re-parse, re-chunk, audit citations, compare versions, recover from bad indexing choices |
| When is reranking worth it? | Recall is acceptable but top-k precision is weak, or final prompt budget is tight |
| What is query expansion? | Rewriting or expanding a user query into related terms or subquestions to improve recall |
When you migrate to a new embedding model mid-production, old chunks (embedded with model v1) and new queries (encoded with model v2) exist in incompatible vector spaces. Cosine similarity scores between them are not meaningful — the retrieval appears to work because numbers are returned, but the results are semantically random. The only correct approach is to re-embed the entire index with the new model before serving any queries through it, with a clean atomic swap. A rolling migration where some chunks are on the old model and some are on the new model is not a valid intermediate state — it is a silent quality regression that is extremely hard to diagnose without explicit version tracking in your traces.
When a RAG system returns a bad answer, the debugging path requires answering: what query was used, what chunks were retrieved, what were the scores, which chunks were filtered by ACL, what was the final context passed to the model, and what was the model's actual input? Without chunk-level lineage — stable chunk IDs, source offsets, version metadata, and per-request retrieval traces — you cannot answer any of these questions. The common failure is to build the retrieval and generation pipeline first, then try to add observability after something breaks in production. The correct order: define the trace schema before writing the first retrieval call, and treat trace completeness as a launch requirement.
RAG is not complete when the model answers. It is complete when the product can explain why that answer should be trusted. The answer layer must prove it used the right evidence.
| Question | Answer |
|---|---|
| What is groundedness? | The degree to which generated claims are supported by the supplied evidence |
| How to reduce hallucination in RAG? | Improve retrieval, constrain answers to context, require citations, validate claims, escalate when evidence is missing |
| What should ship before more prompt engineering? | An eval set that measures the behavior you are trying to improve |
| What is a golden set? | A curated set of representative cases with expected behavior used as regression test for all changes |
A RAG system that does not have an explicit missing-evidence path will hallucinate. The model's default behavior when context is absent or insufficient is to generate a plausible-sounding answer from parametric knowledge — which violates the product promise of private-knowledge grounding and produces confident wrong answers. The missing-evidence path should be: detect that no retrieved chunk supports the answer (using confidence thresholds, evidence-matching, or explicit model instructions), return a structured "cannot answer" response, and optionally route to a human reviewer or alternate data source. This path must be tested with eval cases where the answer genuinely does not exist in the corpus — not as a nice-to-have, but as a launch requirement.
LLM-as-judge systems have three systematic failure modes: they over-reward fluent and confident-sounding answers regardless of factual accuracy; they correlate with the same model family biases as the generator (a GPT-4 judge may prefer GPT-4 outputs); and they drift when the judge model is upgraded, making historical comparisons invalid. The fix: use LLM-judge as one signal in an ensemble alongside deterministic checks (schema validation, citation presence, term matching), retrieval metrics (recall@k), and periodic human audits. Calibrate the judge against human labels on your specific domain before using it as a primary eval metric, and version the judge model alongside the generator model in your eval infrastructure.
Retrieval quality is usually capped by document quality, not model quality. Clean lineage beats clever prompting. Move beyond nearest neighbors when the corpus has entities, procedures, and multi-hop relationships.
| Question | Answer |
|---|---|
| How to pick chunk size? | Start from the unit needed to answer common questions; tune with retrieval recall and answer faithfulness evals |
| Why include source offsets? | Cite exact spans, highlight evidence, deduplicate chunks, and debug parsing issues |
| What is parent-child retrieval? | Index small child chunks for precise search, return larger parent sections for enough answer context |
| When does Graph RAG help? | Questions needing relationships, multi-hop reasoning, entity disambiguation, or aggregation across connected facts |
| What is a tombstone? | A marker that a prior chunk or document version is no longer active, even if stored for audit |
| How to handle document updates? | Detect content changes, version the document, tombstone old chunks, re-parse, re-embed, validate on update-sensitive evals |
Teams spend significant effort tuning embedding models, rerankers, and LLMs, but the single most impactful RAG parameter is usually chunk size and overlap. Too small: precise ranking but insufficient context per chunk to answer a question independently. Too large: good context per chunk but noisy ranking because chunks cover too many topics. The correct process: build a 100-question golden eval set from your actual corpus, measure recall@5 and answer faithfulness at chunk sizes of 256, 512, and 1024 tokens with 20% overlap, and choose the configuration that optimizes faithfulness (not just recall). This takes one day of work and avoids weeks of chasing model and prompt improvements that cannot overcome a bad chunking strategy.
Graph RAG is only as good as its entity canonicalization. A knowledge graph where "IBM," "I.B.M.," "International Business Machines," and "IBM Corp" are four separate nodes does not answer any multi-hop question correctly — every relationship query returns partial results because the graph is fragmented. Entity resolution (alias mapping, ID assignment, confidence-weighted merge, human review for high-value entities) is the most labor-intensive part of building a production knowledge graph, and it is almost always underestimated. The correct investment: use a combination of rule-based normalization for known entity types (company names, product IDs, person names), probabilistic linking for ambiguous cases, and a human review queue for high-confidence merges before they propagate through the graph.
The safest agent architecture keeps the model creative in planning and deterministic in permission, state, and execution. Agents are useful when the model must choose actions, not when a simple chain is enough.
| Question | Answer |
|---|---|
| When should you use an agent? | Tool selection, stateful planning, conditional branching, or multi-step recovery — not for simple deterministic chains |
| What is the biggest agent risk? | Unbounded action — limit tools, permissions, loop count, budget, and state transitions |
| Why model agents as graphs? | Makes transitions, retries, branches, and checkpoints explicit and testable |
| What should be observable in an agent? | Steps, tools, inputs, outputs, model versions, costs, errors, state transitions, and final decision rationale |
Agent loops retry failed tool calls. Without idempotency keys, every retry is a potential duplicate action — duplicate tickets, duplicate emails, duplicate charges, duplicate records. The idempotency key is a stable identifier for a specific intended action (usually request_id + tool_name + a hash of the intended arguments) that the downstream service uses to deduplicate. If the same key arrives twice, the service returns the same result without re-executing the action. This is the same pattern used in payment systems and it is equally critical in agent systems. Any tool call that modifies state — create, update, delete, send, charge — must support idempotency keys before it is exposed to an agent loop.
Most agent bugs occur in systems where the agent state is implicit — managed by a growing list of messages or a string summary passed to each model call. When the state is implicit, it is impossible to write tests for specific agent behaviors, impossible to reason about what transitions are allowed, and impossible to detect when the agent is stuck or looping. The fix: define the agent state as an explicit typed object (a Pydantic model or dataclass), define the valid transitions as a state machine graph (this can be drawn on a whiteboard before writing code), and implement the reducer that applies an action to a state. This makes the agent debuggable, testable, and explainable — properties that are impossible with implicit state.
The product value of an agent comes from controlled state transitions, not from letting the model think forever. Make autonomy bounded, inspectable, and recoverable before making it more capable.
| Question | Answer |
|---|---|
| Working vs long-term memory? | Working: temporary task state. Long-term: persists across sessions — stricter consent, retention, and audit rules |
| How to prevent memory poisoning? | Store provenance, validate writes, separate facts from instructions, require approval for sensitive durable memory |
| What is ReAct? | Pattern where the model alternates between reasoning about the task and acting through tools or observations |
| When is plan-and-execute better than ReAct? | Task benefits from a visible checklist, predictable phases, or human approval before execution |
| When to use multiple agents? | Separate roles, tools, permissions, or review responsibilities that improve quality enough to justify complexity |
| Main risk of multi-agent systems? | Unbounded cost and unclear accountability — more agents can amplify errors instead of correcting them |
Any system that stores long-term memories about users must answer three questions: (1) what exactly is being stored and why? (2) can the user see and delete their memories? (3) what happens to downstream decisions when a memory is deleted? These are not engineering nice-to-haves — they are product and compliance requirements in any jurisdiction with data privacy laws. Systems that store raw conversation history "just in case" without a retention policy, user visibility, or deletion mechanism create regulatory liability. The correct design: store only what is needed for the task to work better, with explicit purpose and expiry, user-visible records, and a deletion path that propagates to derived inferences.
The most common multi-agent failure is not that individual agents produce bad outputs — it is that they produce outputs in conflicting versions of the task context, and there is no deterministic mechanism to resolve the conflict. Agent A produces a draft based on the customer record as it existed at T=0. Agent B produces a review based on the customer record at T=2 after a concurrent update. The reviewer accepts Agent A's output. The result is a decision based on stale state that neither agent detected. The fix: pass a consistent snapshot of the shared state to all agents at the beginning of the task, use a single reducer that owns all state updates, and validate state consistency before any agent output is accepted. This is the distributed systems problem of consistency, applied to LLM agent coordination.
Every prompt change without evals is a silent production migration. Serious LLM products are evaluated systems, not prompt experiments. The first investment is always measurement, not optimization.
| Question | Answer |
|---|---|
| What is a golden set? | Curated cases with expected behavior — regression test for model, prompt, retrieval, and tool changes |
| What should you cache in RAG? | Embeddings, parsed docs, retrieval for public data, rerank scores, stable summaries — be careful with final answers |
| Minimum useful LLM trace? | Request ID, prompt version, model, token counts, retrieved IDs, tool calls, latency, cost, validation result, status |
| How to detect drift? | Track eval scores, retrieval distributions, refusal rates, tool failures, cost, latency, and user correction signals |
| Why use response_model in FastAPI? | Validates and documents the returned shape — reduces accidental leakage of internal or malformed fields |
Teams that start prompt engineering without an eval set are optimizing blindly — they can observe one or two examples getting better while ten other cases silently regress. The minimum viable eval set for a RAG system is 50–100 cases with expected sources, required terms in the answer, and prohibited terms. This takes one or two days to build. Without it, every prompt change is a guess and every regression is discovered by users in production. With it, prompt engineering becomes a disciplined engineering process where each change has a measurable, reproducible quality signal. The eval set is not a nice-to-have — it is the precondition for any meaningful improvement work.
Multi-tenant RAG systems that apply ACL filters after retrieval have a race condition: the ranking step has already been influenced by unauthorized content before the filter removes it. The correct architecture: filter by tenant_id and ACL before scoring begins, so the retrieval model never sees cross-tenant content at all. The second common mistake is caching: a final answer cached for User A at T=0 must never be returned to User B, even if User B sends the same question. Caching at the level of retrieved documents (keyed by query + tenant_id) is safe; caching at the level of final answers requires a cache key that encodes the full permission context, which is usually not practical. When in doubt, cache only neutral intermediates (embeddings, parsed documents) not final answers.
A reliable AI product is not a single model call. It is a governed service with budgets, permissions, fallbacks, and audit trails. Production LLM systems need predictable behavior under load, abuse, and regulation.
| Question | Answer |
|---|---|
| What is continuous batching? | Dynamically batches tokens from multiple requests so GPU utilization stays high while requests enter and leave |
| What is speculative decoding? | A smaller draft model proposes tokens; a larger model verifies — improves latency when accepted tokens match |
| What is indirect prompt injection? | Malicious instruction hidden in retrieved content, web pages, or emails that tries to control the model through context |
| Where should PII redaction happen? | Before sending to models when possible, and again before logging or displaying outputs |
| What belongs in an AI system card? | Purpose, intended users, limitations, data sources, eval results, safety constraints, monitoring plan, owner |
| How to handle an AI incident? | Contain feature, preserve traces, identify affected users, roll back configs, patch eval coverage, communicate impact |
LLM provider outages happen. GPU memory exhaustion happens. Rate limits happen. A system that has no degraded mode converts every infrastructure event into a user-facing total failure. Degraded modes worth designing upfront: (1) cached answers for the most common queries; (2) a smaller, self-hosted model that handles a reduced capability set; (3) async processing that queues requests and notifies users when results are ready; (4) a graceful UI that communicates the degradation honestly. The degraded mode contract — what the system can and cannot do during degradation — must be defined before launch, tested in staging, and on-call teams must know the runbook. Post-launch is too late to design this under pressure.
Six months after launch, your team will need to answer: what model was running on this day, what prompt version was active, what data was in the corpus, what eval results justified the last release decision, and who approved it? Without governance artifacts — model cards, prompt changelogs with eval results, risk classifications, approval history — these questions are unanswerable, and so is every regulatory or customer inquiry that follows an incident. The governance overhead is not zero, but it is a small fraction of the cost of a compliance audit or an incident where nobody can explain what the system was doing. Build governance into the release process from the start — retro-fitting it to a running system is much harder and never as complete.
Strong engineers do not just name techniques — they identify failure modes, define metrics, and defend trade-offs. Every interview answer should tie a concept to latency, cost, reliability, evaluation, or user experience.
| Question | Answer (production-framed) |
|---|---|
| Temperature vs top-p? | Temperature reshapes the whole probability distribution; top-p samples from the smallest set reaching cumulative probability p. Both control randomness — tune with task evals |
| Why can a longer context window hurt quality? | Distractors, conflicting evidence, instruction dilution, higher cost, and slower TTFT — more context is not better context |
| RAG vs fine-tuning? | RAG for changing or private knowledge needing citations; fine-tune for stable behavior, style, output format, or domain decision patterns |
| BERT vs GPT in one minute? | BERT: encoder, bidirectional, strong for understanding (classification, NER, reranking, embeddings). GPT: decoder, autoregressive, strong for generation (chat, code, agents) |
| Long context vs retrieval? | Long context simpler when evidence is small and already available. Retrieval better when knowledge is large, private, changing, permissioned, or needs citation and freshness |
| How to reduce hallucination? | Improve retrieval recall, rerank for precision, constrain answers to context, require citations, validate claims, escalate when evidence is missing |
At L5, the expected answer to "How do you reduce hallucination in RAG?" is a list of correct techniques: improve retrieval, add citations, constrain answers to context, use an LLM judge. At L6+, the expected answer starts differently: "First, I need to know where the hallucination is coming from — is it retrieval failure (correct chunks not found), context failure (chunks retrieved but not used), or generation failure (answer generated despite contradicting context)? The diagnosis determines the fix, and the correct metric for each failure mode is different." The L6 signal is always: diagnose first, prescribe second, and name how you would measure each thing.
A senior LLM systems design answer does not just cover architecture — it covers: who owns what, what is the failure mode that pages on-call at 2am, how long does it take to roll back a bad change, and how do you know the system is degrading before users tell you? These questions reveal whether you have shipped a system into production and lived with it, or whether you have designed systems on paper. Interviewers at L6+ level are evaluating: can you own this end-to-end? That means the data pipeline, the model, the evaluation, the deployment, the monitoring, the incident response, and the stakeholder communication. Technical architecture is table stakes — organizational thinking is the differentiator.