AI Engineering · LLMs & NLP · Updated May 2025

Build it.
Ship it.
Explain it.

16 stages covering language foundations, transformer internals, retrieval systems, agents, and production reliability — with code, pitfalls, and the production depth that separates engineers who ship from those who prototype.

Page

What this section covers

Corpus design, tokenization, and retrieval fundamentals
Transformer internals: attention, KV cache, decoding
Prompting, fine-tuning, PEFT, and preference optimization
RAG pipelines: ingestion, hybrid retrieval, grounded generation
Agents: tool contracts, state machines, memory, and planning
Production: evaluation, serving, safety, guardrails, governance

🔧 Production Context

Select a stage from the left rail.

Layer —

Stage —

Depth

Production reality

Most language-system failures begin before the model call: bad parsing, weak labels, missing metadata, or a corpus that does not represent the real task. Fix data before tuning models.

Corpus Design: A corpus is the text a system learns from, searches over, evaluates on, or uses as grounding. Start with task boundaries: who writes the text, which domains matter, what must not leak into evaluation. Connect corpus quality to downstream metrics — bad data makes strong models look unstable.
Vocabulary & N-grams: Before neural embeddings, text was represented via vocabulary features. N-grams still matter for search baselines, spam detection, log analysis, and exact term matching — they catch IDs, error codes, product names, and rare terms that dense embeddings smooth away. Always build a lexical baseline before jumping to LLMs.
Text Normalization: Normalization creates consistency, but over-normalization destroys meaning. Preserve negation, IDs, section numbers, currency, code symbols, and capitalization when they matter. Modern LLMs prefer natural text — aggressive stemming and stop-word removal should be justified by evals, not habit.

Production pitfalls

Document leakage: Chunks from the same source appear in both train and test — split by document, user, account, or time window, not by individual row
Eval corpus mismatch: Test set uses clean English; production includes OCR, tables, abbreviations, multilingual text — sample eval data from real traces
Skipping the baseline: Teams ship expensive LLM classifiers without checking if TF-IDF + logistic regression would meet the SLA
Removing negation: "not eligible for refund" becomes "eligible refund" — use task-specific stop-word lists and test negation-heavy cases
English-only preprocessing: Whitespace rules and stemming fail on multilingual or code-mixed inputs — detect language first

Stratified split — document-level

from sklearn.model_selection import train_test_split train, test = train_test_split( examples, test_size=0.2, stratify=[x["label"] for x in examples], random_state=42, ) assert not set(x["doc_id"] for x in train) & set(x["doc_id"] for x in test)

TF-IDF baseline with n-grams

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=50000) X = vec.fit_transform([x["text"] for x in train]) clf = LogisticRegression(max_iter=1000).fit(X, [x["label"] for x in train])

Conservative text normalizer

import re, unicodedata def normalize_text(text: str) -> str: text = unicodedata.normalize("NFKC", text) text = re.sub(r"\s+", " ", text).strip() return text # avoid stemming unless justified by evals

Quick reference

Question	Answer
What is a corpus?	A collection of text for training, retrieval, evaluation, or grounding
How to avoid leakage?	Split by the unit that repeats in production (doc, user, conversation, time)
Why metadata matters?	Enables filtering, slicing, access control, debugging, recency, and fair eval
When does TF-IDF beat embeddings?	Exact IDs, rare identifiers, domain jargon, small labeled datasets
Stemming vs lemmatization?	Stemming chops to crude roots; lemmatization maps to dictionary forms
What to store after normalization?	Raw text, normalized text, source metadata, offsets, normalizer version

✦Corpus quality caps model quality

A strong model trained on a misrepresentative corpus will underperform a weaker model trained on a well-constructed one. The eval corpus is the most critical artifact: if it does not match production distribution (language, source, length, domain, recency), you are optimizing for a test that does not exist. Teams that discover this late lose months. The fix: instrument production traffic early, sample from real traces for eval construction, and maintain domain slices as the corpus grows.

✦The baseline is a production requirement, not a formality

A TF-IDF + logistic regression baseline is often 85–90% of the LLM quality at 1% of the cost and latency. If the team ships an LLM solution without proving the baseline fails, they own the infrastructure and cost of the heavier system with no documented justification. The baseline also creates a stable lower bound: if the LLM degrades in production, the baseline is the fallback. Always build it, test it, and keep it running alongside the primary system.

Production reality

LLMs often wrap classical NLP tasks in natural language, but the product still needs schemas, labeled examples, and measurable outputs. A model cannot learn a task that humans cannot label consistently.

Core NLP Tasks: Classification assigns labels. NER extracts entity spans. Relation extraction connects entities. POS tagging and parsing expose grammatical structure. Summarization compresses information. Clustering groups similar texts. LLMs perform many zero-shot, but production systems still need schemas, labeled examples, and measurable outputs.
Labeling Quality: Quality depends on clear guidelines, calibrated annotators, disagreement review, representative samples, and careful handling of rare classes. For LLM systems, labeled data is still needed for evals, prompt regression, fine-tuning, routing, and preference optimization. Disagreements are debugging signal — review them.
NLP Metrics: Metrics must match the task. Classification uses precision, recall, F1. Extraction uses span-level exact or partial match. Summarization may use ROUGE or human rubrics. Always inspect metrics by slice: source, language, label, length, recency, customer segment. A model can look good overall while failing on a specific subgroup.

Production pitfalls

Vague label definitions: Annotators cannot distinguish complaint, escalation, urgent, and churn-risk — write guidelines with examples, counterexamples, and conflict rules
Optimizing overall accuracy on imbalanced labels: 96% accuracy by always predicting majority class — track per-class precision, recall, F1 for high-risk labels
No disagreement workflow: Annotator conflicts are hidden in data and become confusing training signal — review disagreements, update guidelines, keep adjudicated labels separate
One metric for every task: Accuracy hides low recall on the class that matters most to the business
Using BLEU/ROUGE as truth: Surface overlap punishes good paraphrases and rewards copied but incomplete text — use as weak signals with human checks

Label distribution check

from collections import Counter counts = Counter(x["label"] for x in examples) total = sum(counts.values()) for label, n in counts.most_common(): print(label, n, round(n / total, 3))

Per-class metrics report

from sklearn.metrics import classification_report y_true = [x["label"] for x in test] y_pred = clf.predict(vec.transform([x["text"] for x in test])) print(classification_report(y_true, y_pred, digits=3))

Entity schema example

ticket = { "label": "billing_issue", "entities": [ {"type": "invoice_id", "text": "INV-2041", "start": 17, "end": 25} ], "sentiment": "frustrated" }

Quick reference

Question	Answer
Precision vs recall?	Precision: how many predicted positives were correct. Recall: how many actual positives were found
What is F1?	Harmonic mean of precision and recall — useful when both false positives and negatives matter
What is NER?	Named entity recognition identifies spans: people, companies, products, dates, amounts, IDs
When use rule-based NLP?	Stable patterns, low false-positive risk, deterministic behavior more valuable than generalization
What is active learning?	Prioritizes uncertain or high-value examples for human annotation to maximize label efficiency
How to handle class imbalance?	Stratified sampling, class-weighted losses, threshold tuning, targeted data collection, per-class metrics

✦Inter-annotator agreement is a product metric

Low inter-annotator agreement is not an annotator problem — it is a label definition problem. If trained humans cannot agree on a label, the model will not learn a stable decision boundary either. Cohen's kappa below 0.7 on a classification task is a signal to pause labeling and rewrite the guidelines, not to add more data. The most efficient fix is always clearer guidelines with worked examples and explicit tie-breaking rules, not more annotation hours on an ambiguous task.

✦Slice analysis is where models actually fail

A model with 94% overall accuracy can have 40% recall on the customer segment that drives 60% of revenue, or 20% precision on the language that represents your fastest-growing market. Aggregate metrics hide these failures. The fix: maintain eval slices by domain, language, length, recency, label, and customer tier from the start. Build them into CI — a regression on a slice is a regression, even if aggregate numbers hold. Companies that discover slice failures late often have to rebuild data pipelines and retrain from scratch.

Production reality

Dense embeddings are powerful, but sparse search remains the fastest way to catch names, IDs, error codes, versions, and rare terms. Hybrid retrieval combines the precision of lexical search with the recall of semantic search.

Inverted Index: Maps each term to the documents containing it — powers search engines, log search, compliance discovery, and the sparse side of hybrid RAG. It is explainable, fast, filter-friendly, and strong for exact terms. Learn it before vector search because many production retrieval bugs are lexical matching or metadata filtering problems.
TF-IDF and BM25: TF-IDF scores terms higher when frequent in a document but rare in the corpus. BM25 improves this with term-frequency saturation and document-length normalization. Both are strong baselines for search and classification. BM25 is the right contrast against embeddings: exactness and explainability versus semantic generalization.
Hybrid Search: Combines lexical precision with semantic recall. A production pattern retrieves candidates from BM25 and vector search, filters by permissions and metadata, fuses with Reciprocal Rank Fusion, reranks, then sends only the strongest evidence to the model.

Production pitfalls

Dense-only retrieval for identifiers: "ERR-8492" returns semantically similar docs but misses the exact error page — use BM25 or exact filters alongside vector search
No field-aware search: A footer match ranks above a title match — boost important fields: title, heading, ID, canonical entity fields
Ignoring analyzers: The search engine tokenizes hyphens, accents, and case differently than expected — test analyzer output for real queries
Fusion before permission filters: Unauthorized documents influence ranking or leak through traces — apply ACL filters before ranking results
Reranking too few candidates: The right document was candidate 27 but the app kept only top 10 — retrieve a wider pool then rerank and trim

Tiny inverted index (conceptual)

from collections import defaultdict index = defaultdict(set) for doc_id, text in docs.items(): for term in text.lower().split(): index[term].add(doc_id) matches = index["refund"] & index["policy"]

BM25 scoring intuition

score(query, doc) = sum( idf(term) * tf_saturation(term, doc) * length_normalization(doc) for term in query )

Reciprocal Rank Fusion

def rrf(rankings, k=60): scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking, start=1): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores, key=scores.get, reverse=True)

Quick reference

Question	Answer
BM25 vs TF-IDF?	Both use term rarity; BM25 adds better term-frequency saturation and length normalization
When does BM25 beat embeddings?	Exact IDs, rare names, error codes, quotes, legal clauses — lexical precision matters
What is IDF?	Inverse document frequency: more weight to terms appearing in fewer documents
What is RRF?	Combines multiple ranked lists by giving credit based on rank position — simple and effective
Why use hybrid search?	Improves recall across semantic queries and exact lexical queries simultaneously
Where does reranking fit?	After initial retrieval and filtering, before final prompt assembly

✦Most retrieval bugs are metadata and filter bugs

When engineers say "retrieval is not finding the right document," the root cause is almost never the embedding model or similarity metric. It is usually: a metadata filter removing the right document before scoring, a permission filter applied at the wrong stage, an analyzer tokenizing a product ID differently from the query, or a field boost misconfiguration. The debugging process should start with retrieval traces (what was returned, what was filtered), not with embedding quality. Vector index tuning is the last thing to touch, not the first.

✦Retrieval quality must be measured, not assumed

A hybrid retrieval system has at least five tunable dimensions: BM25 weight, vector weight, RRF k-parameter, candidate pool size, and reranker depth. Each change can improve one query type and hurt another. The only way to safely tune is to have a labeled retrieval eval set: 100–200 representative queries with expected documents or chunks, measuring recall@k, MRR, and NDCG by query type (exact ID lookups, semantic questions, multi-hop, rare terms). Without this, tuning is guesswork and regressions are invisible.

Production reality

Embeddings are similarity features, not truth. They need versioning, evaluation, and retrieval guardrails. Nearest neighbor is not the same as correct answer.

Embedding Space: Embeddings represent text as vectors so semantically related items land near each other — useful for semantic search, duplicate detection, clustering, routing, recommendations, and RAG. Similarity is task-dependent: the nearest vector can be topically related but outdated, unauthorized, or not actually answerable. Always validate on task-specific recall metrics.
Vector Indexing & ANN: Vector databases use approximate nearest-neighbor indexes to find similar vectors quickly. Approximation creates a recall-latency trade-off: faster search may miss relevant documents. Production systems must evaluate retrieval quality after index tuning. Metadata filters and ACLs must be part of the retrieval plan — not an afterthought applied post-search.
Model Selection: Choose the simplest model that satisfies the product contract. Rules for stable exact policies. Classical ML for high-volume labeled tasks. Encoders for classification and embeddings. LLMs for nuanced language, reasoning, and flexible schemas. Open-source models help with privacy and control but add infrastructure burden. Always benchmark on your own eval set.

Production pitfalls

Comparing vectors from different models: Scores become meaningless when old and new embedding spaces are mixed — store model, dimension, preprocessing, and index build version
Assuming nearest means correct: The closest chunk is similar but not sufficient evidence — rerank, check answerability, validate generated claims against source text
Tuning for latency only: The index gets faster but misses expected evidence for hard questions — track recall@k and latency together during index tuning
Post-filtering after vector search: ACL filters remove top results, leaving too few candidates — prefer pre-filtering or retrieve a larger candidate pool
Choosing by benchmark leaderboard only: Best general model may be too slow, costly, or weak on private domain data — benchmark on your own eval set with latency, cost, safety criteria

Cosine similarity search

import numpy as np texts = ["refund policy", "invoice payment", "password reset"] vectors = np.array([[1.0, 0.1], [0.1, 1.0], [0.8, 0.2]]) query = np.array([0.9, 0.1]) def normalize(x): return x / np.linalg.norm(x, axis=-1, keepdims=True) scores = normalize(vectors) @ normalize(query).reshape(-1, 1) best = int(scores.argmax()) print(texts[best])

Vector index config checklist

vector_index = { "embedding_model": "text-embedding-model", "dimensions": 1536, "metric": "cosine", "ann": "hnsw", "filters": ["tenant_id", "acl", "doc_type", "created_at"] }

Model selection rubric

if exact_policy and stable_pattern: choice = "rules" elif many_labels and strict_latency: choice = "small classifier or encoder" elif needs_reasoning or changing_schema: choice = "LLM with evals" elif data_cannot_leave_vpc: choice = "self-hosted model"

Quick reference

Question	Answer
What is cosine similarity?	Measures angle between vectors — often used to compare embedding direction after normalization
What is ANN search?	Approximate nearest-neighbor search trades perfect recall for much faster vector lookup
What is HNSW?	Graph-based ANN index — common for fast high-recall vector search in production
Why do metadata filters matter?	Enforce permissions, recency, source type, tenant boundaries, and domain constraints
API LLM vs open-source?	API: less ops burden, capability lead. Open-source: control, privacy, customization, cost at scale
How to evaluate embeddings?	Retrieval recall@k, MRR, clustering quality, duplicate-detection F1, human relevance labels

✦Embedding versioning is a production requirement

Every chunk in the index was produced by a specific model at a specific version. When you migrate to a new embedding model, old and new vectors live in different semantic spaces — cosine similarity scores between them are meaningless. This means: (1) the entire index must be rebuilt before mixing old and new retrievals, (2) the migration must be atomic from the retrieval system's perspective, (3) rollback must be possible by pointing back to the old index. Systems that mix embedding versions — even during a rolling migration — produce retrieval scores that cannot be debugged because the numbers are no longer comparable.

✦The recall-latency tradeoff is tunable — measure it

HNSW has two key parameters: ef_construction (index-build quality) and ef (search-time quality). Increasing ef improves recall@k at the cost of query latency. The optimal setting depends on your latency budget and the cost of missed documents. Plot recall@10 vs p99 latency across a range of ef values on your actual corpus — the curve is usually non-linear, with large recall gains at low ef values and diminishing returns past a threshold. This measurement, done once per corpus size order of magnitude, tells you exactly where to set the parameter instead of guessing.

Production reality

Most LLM bugs are easier to debug when you can separate model limitation, prompt limitation, retrieval limitation, and product-contract limitation. Understand the inference machine before designing systems around it.

Tokenization: The compression boundary between text and model. Shapes cost, context length, latency, and behavior around code, rare names, IDs, tables, and non-English text. Measure token growth after full prompt assembly — schema instructions, citations, and tool traces silently consume the usable context window. Budget after assembly with the target tokenizer, not by character count.
Attention & KV Cache: Self-attention creates queries, keys, and values from token representations. During generation, the KV cache stores computed states so the model does not recompute the full prefix at each step. Long inputs increase prefill latency (time-to-first-token); long outputs increase decode latency. Measure both separately — optimizing only output length leaves prefill latency unaddressed.
Decoding: Converts model probabilities into text. Deterministic settings help extraction, classification, and tool calls; moderate randomness helps brainstorming. Structured outputs reduce parser failures but do not guarantee semantic correctness. The production contract: generate, validate, repair or escalate.

Production pitfalls

Counting characters instead of tokens: Short JSON schemas or code blocks tokenize far worse than prose — budget after final prompt assembly with the target model tokenizer
Tokenizer changes during model migration: A model switch can silently change token count, truncation behavior, and cost — add token-budget regression tests before changing model families
Optimizing only generation length: Teams cap max output but keep sending 40 retrieved chunks — users still experience slow first-token latency
Treating context window as free memory: A 128k context can still be too slow, too expensive, and too noisy — use the smallest context that preserves task accuracy, then prove it with evals
High temperature for factual QA: Randomness makes retrieval grounding less stable and regressions harder to reproduce — use low temperature for grounded answers and routing
Trusting JSON validity as truth: The object parses, but fields can still be unsupported by evidence — run schema validation, evidence checks, and domain rules after parsing

Token-budget guard

def reserve_budget(messages, max_context, max_output, tokenizer): used = sum(len(tokenizer.encode(m["content"])) for m in messages) available = max_context - max_output if used > available: raise ValueError(f"Prompt uses {used} tokens; budget is {available}") return {"prompt_tokens": used, "remaining": available - used}

Attention and latency model

Q = X @ Wq; K = X @ Wk; V = X @ Wv attention = softmax((Q @ K.T) / sqrt(d_k)) @ V # Latency decomposition total_latency = prefill(context_tokens) + decode(output_tokens) # prefill = time-to-first-token → driven by prompt length # decode = total generation time → driven by output length

Pydantic structured output with validation

from pydantic import BaseModel, Field, ValidationError class Invoice(BaseModel): invoice_id: str amount_due: float = Field(ge=0) currency: str try: invoice = Invoice.model_validate(model_json) except ValidationError as err: raise ValueError(f"invalid model output: {err}")

Quick reference

Question	Answer
Why does tokenization matter for reliability?	Controls what the model can see — wrong truncation means confident answers from incomplete evidence
What is prefill?	Phase where model processes input prompt and builds KV cache before emitting first token
Why do long RAG prompts hurt quality?	Distractors, conflicting snippets, instruction dilution — more context is not better context
When should temperature be near zero?	Classification, extraction, compliance, routing — anywhere reproducibility matters more than variation
What to log for token debugging?	Prompt tokens, completion tokens, model name, truncation strategy, selected docs, prompt template version

✦Prefill and decode are separate latency problems

Teams that complain about "slow LLM responses" almost always conflate two distinct latency components: prefill (time-to-first-token, driven by prompt length and batch state) and decode (time-per-output-token, driven by output length and generation strategy). The fix for high prefill latency is shortening the prompt, caching stable prefixes, or using prefix caching. The fix for high decode latency is constraining output length, using speculative decoding, or batching. Optimizing one without understanding which component dominates is guesswork — always measure TTFT and total latency separately in your traces.

✦Structured outputs fail at the semantic layer, not the syntax layer

JSON mode and constrained decoding solve the syntax problem: the output is valid JSON that matches the schema. They do not solve the semantic problem: the values in the fields may be hallucinated, unsupported by evidence, or logically inconsistent. A system that treats a valid-JSON response as a correct response is missing a layer of validation. The correct production contract after structured output: (1) parse and schema-validate, (2) check factual claims against retrieved evidence, (3) apply domain logic (amounts are positive, dates are in range, referenced IDs exist), (4) escalate if any check fails. Steps 2–4 require deterministic application code, not more prompting.

Production reality

A prompt is not magic words — it is an executable specification for a probabilistic worker. The best prompt is short enough to audit and explicit enough to test. Treat every prompt edit like a code change.

Prompt Contract: Production prompts should be versioned artifacts. Separate stable system policy from dynamic user data, retrieval context, and tool state. The contract defines: role, task, context, rules, output schema, refusal policy, and examples only when they remove ambiguity. Every prompt edit can alter behavior across thousands of requests — review and test accordingly.
Instruction Hierarchy: Prompt injection is an authority problem, not a string problem. Retrieved text and user text are data, not instructions. Tools should receive typed arguments from validated model output, not raw generated prose. The application must enforce permissions outside the model. Authorization belongs in deterministic code, before and after the model call.

Production pitfalls

Mixing policy, evidence, and user text: The model cannot reliably distinguish trusted instructions from untrusted retrieved content — use clear delimiters, typed sections, and explicit instruction hierarchy
Few-shot examples conflicting with schema: The model imitates examples and ignores the written contract when they disagree — keep examples minimal, valid, and covered by tests
Retrieved content overriding system prompt: A malicious document triggers prompt injection — label retrieved text as untrusted evidence and enforce data access in application code
Secrets in the prompt: Anything in context can be leaked by a bad prompt path — never send secrets to the model unless disclosure to the user is acceptable

Production prompt skeleton

System: You are a support analyst. Answer only from CONTEXT. If evidence is missing, say what is missing. Developer: Return JSON: {"answer":"...", "citations":[], "confidence":0.0, "escalation_reason":"..."}. User: QUESTION: {{question}} CONTEXT: {{retrieved_passages}}

Tool authorization pattern

if tool_name == "refund_user": assert current_user.can("refund") # auth in app code args = RefundArgs.model_validate(tool_args) # typed schema assert args.amount <= policy.max_refund # domain rule return refund_service.create(args)

Quick reference

Question	Answer
What makes a prompt production-ready?	Versioned, tested against eval set, clear input boundaries, defined failure behavior, validated output shape
Why avoid giant prompts?	Harder to review, more expensive, slower, more fragile, more likely to contain conflicting instructions
Where should authorization live?	Deterministic application code before and after the model call — not in prompts
What is prompt injection?	Untrusted text attempts to alter the model instruction hierarchy by pretending to be higher-priority instructions

✦Indirect prompt injection is the most dangerous attack vector

Direct prompt injection (a user saying "ignore previous instructions") is well-known and relatively easy to defend against. Indirect prompt injection — where the malicious instruction is embedded in retrieved content, an email, a web page, or a document — is structurally harder because the application itself fetches and passes the poisoned content. The only reliable defenses are: (1) treat all retrieved and user-provided text as untrusted data, not as instructions; (2) enforce permissions and tool calls in deterministic application code, not in model output; (3) test the system with adversarial retrieved documents as a CI step. No amount of system prompt wording fully protects against a well-crafted indirect injection if the boundary is not enforced in code.

✦Prompt versioning requires the same discipline as code versioning

A prompt change that ships without a version tag, eval gate, and rollback path is a silent production migration. Teams that store prompts as hardcoded strings in Python files have no way to answer: "what prompt was active during the incident?" or "which prompt change caused the quality regression last Tuesday?" The fix: store prompts in versioned YAML under source control, load by tag at runtime from a registry, run evals on each new version before promotion, and link the deployment tag to traces. This is the same discipline as software deployment — prompt changes just look less like code because they are text.

Production reality

BERT, GPT, and T5 are not just brand names — their architecture and training objective shape what they are naturally good at. Architecture choice is a product decision.

Encoder vs Decoder: Encoder-only models (BERT-style) read the whole input bidirectionally — efficient for classification, extraction, embeddings, and reranking. Decoder-only models (GPT-style) generate one token at a time with causal masking — dominant for chat, agents, and completion. Encoder-decoder models (T5/BART) read an input and generate an output — strong for translation, summarization, and sequence-to-sequence tasks.
Training Objectives: Masked language modeling predicts hidden tokens from both sides. Causal language modeling predicts the next token from prior tokens. Contrastive learning pulls related texts closer in embedding space. Instruction tuning teaches task following. Preference tuning teaches response ranking. Each objective shapes what the model is naturally good at post-training.
Architecture Blocks: A transformer block mixes token information with attention, transforms each token with a feed-forward network, stabilizes training with residual connections and normalization, then projects to vocabulary logits. This mental model explains why prompt length affects latency, why token position matters, and why decoding settings alter outputs without changing weights.

Production pitfalls

Using a chat LLM for every classification job: Pays generation latency and cost even though a small encoder could classify faster and cheaper — benchmark candidates first
Ignoring causal masking: A decoder cannot attend to future tokens during generation — explain latency and output behavior in terms of this constraint
Confusing pretraining with alignment: A pretrained model can produce fluent text but may not follow instructions or refuse unsafe requests — separate pretraining, instruction tuning, and preference tuning
Expecting chat-model embeddings to be ideal: A generative model is not optimized for similarity search — use embedding models tuned for retrieval and validate on recall metrics
Treating attention weights as faithful explanation: Attention weights can be inspected but are not guaranteed to explain model reasoning — use as one debugging signal, not as proof of causality

Model-family decision table

classification / NER / rerank -> encoder (e.g. BERT, RoBERTa) chat / code / agent planning -> decoder (e.g. GPT, LLaMA, Claude) translation / summarization -> encoder-decoder (e.g. T5, BART) semantic search embeddings -> embedding-specific model

Training objective intuition

masked_lm: "The capital of [MASK] is Paris" causal_lm: "The capital of France is" -> " Paris" contrastive: pull(query, positive_doc) > pull(query, negative_doc) instruction: (prompt, ideal_response) pairs preference: prefer(chosen_response) > prefer(rejected_response)

Transformer block pseudocode

x = token_embedding(tokens) + position_embedding(tokens) for block in transformer: x = x + multi_head_attention(layer_norm(x)) x = x + feed_forward(layer_norm(x)) logits = output_projection(layer_norm(x))

Quick reference

Question	Answer
BERT vs GPT?	BERT: encoder-only, bidirectional, strong for understanding tasks. GPT: decoder-only, autoregressive, strong for generation
What is causal masking?	Prevents a decoder from attending to future tokens while predicting the next token
What is contrastive learning?	Brings matching pairs closer and pushes non-matching pairs apart in representation space
What is instruction tuning?	Supervised training on instruction-response examples to teach a base model to follow user tasks
Why use residual connections?	Help gradients flow and let each block refine rather than replace the representation
What are logits?	Raw vocabulary scores before softmax converts them to probabilities

✦The encoder/decoder choice is a latency and cost contract

Choosing a decoder-only LLM for a classification task is not just a capability choice — it is a latency and cost contract. A decoder generates tokens sequentially (O(output_length) steps per request); an encoder produces a fixed-size representation in one forward pass. For a binary classification task running at 10,000 requests/second, the difference can be 50ms vs 5ms P99 latency and 20× cost difference. The habit of defaulting to the latest chat model for every task is one of the most common and most expensive engineering mistakes in LLM application development. Always benchmark the smallest model that meets the quality bar on your actual task.

✦Instruction tuning ≠ knowledge injection

A fundamental misconception: instruction tuning teaches a model how to follow tasks, not what facts to know. If you fine-tune a model on instruction-response pairs about your internal knowledge base, the model will learn to produce responses in the right format and tone — but it will not reliably acquire the factual content from the training examples, especially for specific entities, numbers, and policies. LLMs memorize facts unreliably from training, and they cannot cite or verify what they memorized. The correct architecture: instruction-tune for behavior, format, and routing; use RAG for knowledge that must be current, private, or auditable. Conflating the two leads to hallucination-prone systems that are difficult to debug.

Production reality

Fine-tuning is rarely the first fix for missing knowledge. It is strongest for behavior, style, format, routing, and domain-specific decision boundaries — not for injecting facts that change or need citation.

SFT & Instruction Tuning: Supervised fine-tuning teaches a model to imitate high-quality input-output examples. Useful when prompts are too long, behavior must be consistent, outputs have domain-specific style, or there are many examples. It is not a reliable way to insert large changing knowledge. Keep a held-out set, deduplicate training data, test refusals and edge cases, and compare against a strong prompt-only baseline before deciding SFT is necessary.
PEFT & LoRA: Parameter-efficient fine-tuning updates a small set of adapter weights instead of the full model. LoRA injects low-rank matrices into selected layers, reducing memory and training cost. Useful for domain behavior and specialist variants, but adapter sprawl becomes an operations problem. Track base model, adapter version, dataset version, eval results, and merge status in a registry.
Preference Optimization (DPO/RLHF): Trains a model to prefer one response over another — useful for subjective qualities like helpfulness, brevity, tone, refusal style, or rubric compliance. The quality of pairwise data matters more than the algorithm. Preference tuning can over-optimize for judge taste, reduce diversity, or damage factual behavior — run broad regression evals before and after.

Production pitfalls

Fine-tuning to fix retrieval gaps: The model still cannot know a newly updated policy unless that evidence is supplied — use RAG for changing knowledge and fine-tune only stable behavior
Training on low-quality assistant outputs: The model learns verbosity, unsupported claims, and schema mistakes — curate examples, include negative cases, and evaluate on held-out data
Adapter without base-model pinning: Loading a LoRA adapter on the wrong base model gives degraded or nonsensical behavior — pin and validate base model checksum, tokenizer, adapter rank
Too many unowned adapters: Every team trains a domain adapter, but nobody knows which is safe to serve — use a registry with owners, evals, expiry dates, and deployment gates
Optimizing for vague preference labels: Annotators choose the friendlier response, but the product needed the more grounded one — use rubrics with ranked criteria: correctness, evidence, completeness, tone
No capability regression tests: Tone improves while extraction accuracy or refusal behavior worsens — run task, safety, and calibration evals before promotion

SFT training example shape (JSONL)

{"messages":[ {"role":"system","content":"Extract support ticket fields as JSON."}, {"role":"user","content":"Customer cannot log in after SSO migration."}, {"role":"assistant","content":"{\"category\":\"auth\",\"severity\":\"high\"}"} ]}

Adapter registry schema

registry = { "legal_summarizer:v3": { "base": "llama-3.1-8b", "adapter": "s3://models/legal-lora-v3", "eval": {"faithfulness": 0.91, "json_valid": 0.99}, } }

DPO preference pair

{ "prompt": "Explain why retrieval failed.", "chosen": "The query used an exact invoice ID, but vector-only search missed it.", "rejected": "The system had a temporary issue. Try again later." }

Quick reference

Question	Answer
When should you fine-tune an LLM?	Consistent behavior, style, formatting, or task policy — when prompting has become brittle or expensive
Why is fine-tuning not ideal for facts?	Facts change, are memorized unreliably, and cannot be cited — retrieval is better for current or auditable knowledge
What is LoRA?	Freezes the base model and trains small low-rank matrices added to selected layers
What is DPO?	Trains from chosen/rejected pairs without a separate online RL loop
What makes preference data good?	Clear rubrics, calibrated annotators, difficult pairs, domain coverage, disagreement review
When merge an adapter into base weights?	Adapter is stable, always used with that base, merge testing shows no quality regression or deployment issue

✦Adapter sprawl is the hidden LLMOps debt

A team that creates one LoRA adapter to solve a problem creates a manageable artifact. A team of 15 teams each creating adapters for their domain over 18 months creates an operations nightmare: adapters that are no longer tested, adapters trained on deprecated base model versions, adapters whose datasets have drifted from the use case they were designed for, and adapters with no documented owner. The fix is to treat adapter lifecycle the same as software lifecycle: mandatory eval results at creation, an owner per adapter, an expiry date, a deprecation process, and a central registry with deployment gates. This is not bureaucracy — it is the minimum governance to keep a multi-team fine-tuning program from becoming a liability.

✦Preference optimization can silently damage factual capability

RLHF and DPO optimize for human preference signals, which often correlate with confidence, fluency, and brevity rather than factual accuracy. A model that sounds more helpful may be less accurate — and this regression often does not show up in preference win-rate metrics because the annotators are not equipped to fact-check the responses. The result: the preference-tuned model scores better in human evals but hallucinates more in production. The fix: run factuality benchmarks (TRUTHFULQA, domain-specific QA sets, citation accuracy on your corpus) alongside preference evals before promoting any preference-tuned checkpoint to production. Win rate alone is not a sufficient deployment gate.

Production reality

RAG quality comes from the data pipeline as much as the model call. A vector database is an index, not a source of truth — keep raw documents, parsed text, chunks, embeddings, metadata, and answers separate and versioned.

Ingestion: RAG starts before embeddings. Parsing should preserve headings, tables, captions, hierarchy, links, and source offsets. Chunk IDs must be stable. Every chunk needs source lineage, ACL metadata, content hash, embedding model version, and timestamps so re-indexing and debugging are possible. A system without this metadata cannot explain why it returned a specific result.
Hybrid Retrieval: Dense vectors capture semantic similarity; sparse search catches exact names, IDs, errors, and rare terms. The production pattern: retrieve candidates from BM25 and vector search, filter by permissions and metadata, rerank candidates with a stronger relevance model, then compress evidence for the final answer prompt. Reranking after too-small a candidate pool is the most common retrieval bug.

Production pitfalls

No embedding version column: Search mixes old and new vector spaces after model migration — store embedding model, dimensions, chunking version, and index build ID per chunk
Chunking by fixed characters only: Chunks split definitions, tables, procedures, and code blocks in the middle — use structure-aware chunking with overlap and source offsets
Vector-only retrieval for exact identifiers: "INV-2026-1049" returns semantically similar invoices instead of the exact one — use hybrid retrieval and boost exact matches
Reranking after truncation: Correct document was candidate 27 but the app kept only top 10 before reranking — retrieve a wider candidate pool first
No ACL filter before ranking: Unauthorized documents influence ranking or appear in traces — apply permission filters before returning any results

Chunk lineage schema

CREATE TABLE rag_chunks ( chunk_id TEXT PRIMARY KEY, document_id TEXT NOT NULL, source_uri TEXT NOT NULL, content_hash TEXT NOT NULL, embedding_model TEXT NOT NULL, acl JSONB NOT NULL, text TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now() );

Hybrid retrieval pipeline

candidates = union( vector_search(query, k=40), keyword_search(query, k=40), ) candidates = filter_acl(candidates, user) # permissions first ranked = rerank(query, candidates)[:8] # then trim context = compress_for_answer(query, ranked)

Quick reference

Question	Answer
Mandatory metadata for RAG?	Document ID, chunk ID, source, content hash, embedding version, chunking version, timestamps, ACL fields
Why keep raw source text?	Re-parse, re-chunk, audit citations, compare versions, recover from bad indexing choices
When is reranking worth it?	Recall is acceptable but top-k precision is weak, or final prompt budget is tight
What is query expansion?	Rewriting or expanding a user query into related terms or subquestions to improve recall

✦Embedding version mismatch is an invisible failure mode

When you migrate to a new embedding model mid-production, old chunks (embedded with model v1) and new queries (encoded with model v2) exist in incompatible vector spaces. Cosine similarity scores between them are not meaningful — the retrieval appears to work because numbers are returned, but the results are semantically random. The only correct approach is to re-embed the entire index with the new model before serving any queries through it, with a clean atomic swap. A rolling migration where some chunks are on the old model and some are on the new model is not a valid intermediate state — it is a silent quality regression that is extremely hard to diagnose without explicit version tracking in your traces.

✦Chunk lineage is the foundation of RAG debuggability

When a RAG system returns a bad answer, the debugging path requires answering: what query was used, what chunks were retrieved, what were the scores, which chunks were filtered by ACL, what was the final context passed to the model, and what was the model's actual input? Without chunk-level lineage — stable chunk IDs, source offsets, version metadata, and per-request retrieval traces — you cannot answer any of these questions. The common failure is to build the retrieval and generation pipeline first, then try to add observability after something breaks in production. The correct order: define the trace schema before writing the first retrieval call, and treat trace completeness as a launch requirement.

Production reality

RAG is not complete when the model answers. It is complete when the product can explain why that answer should be trusted. The answer layer must prove it used the right evidence.

Citation Discipline: Citations should point to the exact evidence used, not just the source document. The model should distinguish: supported facts (direct evidence), inferred summaries (derived), and missing evidence (escalate). For high-stakes workflows, run a post-generation groundedness check that compares each important claim against the retrieved passages using deterministic span matching or an LLM judge calibrated on your domain.
Missing Evidence Handling: When retrieval returns no useful context, the model must not fall back to parametric knowledge and generate a plausible-sounding but unsupported answer. The system needs an explicit missing-evidence path: detect low-evidence answers, return a "cannot answer from available information" response, and route to a human or an alternate data source. Test this path as a first-class eval case.

Production pitfalls

Document-level citations only: Answer cites a 90-page PDF but the claim is nowhere visible to the user — cite chunk IDs, section names, page numbers, and quote spans
Answering from prior knowledge when context is missing: Model gives generic but plausible answer that violates the product's private-knowledge grounding promise — use an explicit missing-evidence path and test it
Using only LLM-as-judge for groundedness: Judges can miss retrieval failures, over-reward fluent answers, and drift with model upgrades — combine deterministic checks, human labels, retrieval metrics, and judge scores
No negative eval cases: The system answers when it should refuse or escalate — include unanswerable, adversarial, permission-denied, and conflicting-evidence cases in every eval run

Answer contract with citation schema

{ "answer": "...", "claims": [ {"text": "...", "citation_ids": ["chunk_14"], "support": "direct"}, {"text": "...", "citation_ids": [], "support": "missing"} ], "needs_escalation": true }

RAG eval assertion

def evaluate_rag(case, result): retrieved = set(result["retrieved_doc_ids"]) expected = set(case["expected_sources"]) recall = len(retrieved & expected) / max(len(expected), 1) assert recall >= case["min_recall"] assert all(t in result["answer"] for t in case["must_include"]) assert not any(t in result["answer"] for t in case["must_not_include"]) assert result["citations"], "answer must cite evidence"

Quick reference

Question	Answer
What is groundedness?	The degree to which generated claims are supported by the supplied evidence
How to reduce hallucination in RAG?	Improve retrieval, constrain answers to context, require citations, validate claims, escalate when evidence is missing
What should ship before more prompt engineering?	An eval set that measures the behavior you are trying to improve
What is a golden set?	A curated set of representative cases with expected behavior used as regression test for all changes

✦The missing-evidence path is a product requirement, not an edge case

A RAG system that does not have an explicit missing-evidence path will hallucinate. The model's default behavior when context is absent or insufficient is to generate a plausible-sounding answer from parametric knowledge — which violates the product promise of private-knowledge grounding and produces confident wrong answers. The missing-evidence path should be: detect that no retrieved chunk supports the answer (using confidence thresholds, evidence-matching, or explicit model instructions), return a structured "cannot answer" response, and optionally route to a human reviewer or alternate data source. This path must be tested with eval cases where the answer genuinely does not exist in the corpus — not as a nice-to-have, but as a launch requirement.

✦LLM-as-judge is a noisy signal that decays over time

LLM-as-judge systems have three systematic failure modes: they over-reward fluent and confident-sounding answers regardless of factual accuracy; they correlate with the same model family biases as the generator (a GPT-4 judge may prefer GPT-4 outputs); and they drift when the judge model is upgraded, making historical comparisons invalid. The fix: use LLM-judge as one signal in an ensemble alongside deterministic checks (schema validation, citation presence, term matching), retrieval metrics (recall@k), and periodic human audits. Calibrate the judge against human labels on your specific domain before using it as a primary eval metric, and version the judge model alongside the generator model in your eval infrastructure.

Production reality

Retrieval quality is usually capped by document quality, not model quality. Clean lineage beats clever prompting. Move beyond nearest neighbors when the corpus has entities, procedures, and multi-hop relationships.

Chunking Strategy: Chunking decides what evidence the model can see as a unit. Good chunks are large enough to answer a question and small enough to rank precisely. Preserve headings, tables, code blocks, page numbers, and parent sections. Use overlap when concepts span boundaries. Evaluate chunk sizes by answerability, recall@k, and final answer faithfulness — not by intuition.
Graph RAG: Adds structure when answers depend on relationships: customers linked to contracts, services linked to incidents, clauses linked to exceptions. The graph is most valuable when entities are canonicalized and edges have provenance. Use graph traversal to expand or constrain retrieval, then ground final answers in source text — not in graph-generated summaries.
Knowledge Freshness: Knowledge systems decay when documents change but indexes, summaries, and embeddings do not. A production pipeline needs document versioning, content hashes, tombstones for deleted chunks, re-embedding jobs, and freshness signals during retrieval. Answers should prefer current policy and disclose when evidence is stale or conflicting.

Production pitfalls

Chunks without parent context: A paragraph says "not covered" but the section title says which policy — include parent headings and source metadata in each chunk
Huge chunks for every document: Recall looks good but answer prompts include too many distractors — evaluate chunk sizes by answerability and faithfulness
Graph without provenance: The graph says a contract has a clause, but nobody can trace it to a document span — attach source, page, chunk ID, extractor version, confidence to every edge
LLM-extracted entities without normalization: IBM, I.B.M., and International Business Machines become separate nodes — use entity resolution with aliases and IDs
Deleted content remains searchable: A removed policy continues to appear because old chunks were never tombstoned — use soft deletes, rebuild jobs, and filters excluding inactive chunks
Summaries not regenerated after source updates: Cached summary still reflects last quarter policy even though the chunk is fresh — treat summaries as derived artifacts with dependency tracking

Structure-aware chunk with parent context

def chunk_sections(doc): for section in doc.sections: text = section.heading + "\n" + section.body for window in sliding_tokens(text, size=420, overlap=60): yield { "text": window.text, "section": section.heading, # parent context "page": window.page, "offsets": window.offsets, # for citation }

Graph RAG — entity neighborhood (Cypher)

MATCH (c:Customer {id: $customer_id})-[:HAS_CONTRACT]->(k) MATCH (k)-[:HAS_CLAUSE]->(clause) WHERE clause.topic IN $topics RETURN clause.text, clause.source_uri, clause.page LIMIT 12

Incremental index update (delta logic)

if new_hash != stored_hash: mark_old_chunks_deleted(document_id) # tombstone enqueue_parse(document_id, version=new_version) enqueue_embed(document_id, embedding_model=current_model) else: skip("unchanged document")

Quick reference

Question	Answer
How to pick chunk size?	Start from the unit needed to answer common questions; tune with retrieval recall and answer faithfulness evals
Why include source offsets?	Cite exact spans, highlight evidence, deduplicate chunks, and debug parsing issues
What is parent-child retrieval?	Index small child chunks for precise search, return larger parent sections for enough answer context
When does Graph RAG help?	Questions needing relationships, multi-hop reasoning, entity disambiguation, or aggregation across connected facts
What is a tombstone?	A marker that a prior chunk or document version is no longer active, even if stored for audit
How to handle document updates?	Detect content changes, version the document, tombstone old chunks, re-parse, re-embed, validate on update-sensitive evals

✦Chunk size is the most impactful RAG hyperparameter

Teams spend significant effort tuning embedding models, rerankers, and LLMs, but the single most impactful RAG parameter is usually chunk size and overlap. Too small: precise ranking but insufficient context per chunk to answer a question independently. Too large: good context per chunk but noisy ranking because chunks cover too many topics. The correct process: build a 100-question golden eval set from your actual corpus, measure recall@5 and answer faithfulness at chunk sizes of 256, 512, and 1024 tokens with 20% overlap, and choose the configuration that optimizes faithfulness (not just recall). This takes one day of work and avoids weeks of chasing model and prompt improvements that cannot overcome a bad chunking strategy.

✦Entity resolution determines Graph RAG quality

Graph RAG is only as good as its entity canonicalization. A knowledge graph where "IBM," "I.B.M.," "International Business Machines," and "IBM Corp" are four separate nodes does not answer any multi-hop question correctly — every relationship query returns partial results because the graph is fragmented. Entity resolution (alias mapping, ID assignment, confidence-weighted merge, human review for high-value entities) is the most labor-intensive part of building a production knowledge graph, and it is almost always underestimated. The correct investment: use a combination of rule-based normalization for known entity types (company names, product IDs, person names), probabilistic linking for ambiguous cases, and a human review queue for high-confidence merges before they propagate through the graph.

Production reality

The safest agent architecture keeps the model creative in planning and deterministic in permission, state, and execution. Agents are useful when the model must choose actions, not when a simple chain is enough.

Tool Contracts: A tool is a product API exposed to a probabilistic planner. Tool arguments must be validated via typed schemas. Destructive operations need dry-run previews, confirmation gates, idempotency keys, and audit trails. The model should never decide whether a user is allowed to perform an action — authorization belongs in deterministic application code.
State Machines: Agent loops should be explicit state machines. Each state has allowed transitions, budgets, and exit criteria. Memory should be task-specific and inspectable. Human review belongs at high-cost, high-risk, low-confidence, or irreversible steps. A vague stop condition is the most common agent architecture bug.

Production pitfalls

Free-form tool arguments: Model sends malformed dates, mixed currencies, or unsupported enum values — use strict schemas and return validation errors the model can repair
No idempotency on retries: Network retry creates duplicate tickets, refunds, or emails — use idempotency keys for every side-effecting tool
Vague stop condition: Agent keeps searching and re-planning because no state marks success — define done, blocked, needs_user, and escalate states before implementation
Unbounded memory: Model accumulates noisy history and follows stale assumptions — use structured task state plus summarized history with freshness rules

Tool boundary with auth and validation

class CreateTicket(BaseModel): title: str severity: Literal["low", "medium", "high"] customer_id: str def create_ticket_tool(args, user): payload = CreateTicket.model_validate(args) authorize(user, "ticket:create", payload.customer_id) return ticket_api.create(payload, idempotency_key=request_id())

Agent loop with hard bounds

while state not in {"done", "escalate"}: assert steps < 8 # hard step limit assert cost < budget # hard cost limit action = planner.next(state) state = reducer(state, execute(action)) steps += 1

Quick reference

Question	Answer
When should you use an agent?	Tool selection, stateful planning, conditional branching, or multi-step recovery — not for simple deterministic chains
What is the biggest agent risk?	Unbounded action — limit tools, permissions, loop count, budget, and state transitions
Why model agents as graphs?	Makes transitions, retries, branches, and checkpoints explicit and testable
What should be observable in an agent?	Steps, tools, inputs, outputs, model versions, costs, errors, state transitions, and final decision rationale

✦Idempotency is non-negotiable for side-effecting tools

Agent loops retry failed tool calls. Without idempotency keys, every retry is a potential duplicate action — duplicate tickets, duplicate emails, duplicate charges, duplicate records. The idempotency key is a stable identifier for a specific intended action (usually request_id + tool_name + a hash of the intended arguments) that the downstream service uses to deduplicate. If the same key arrives twice, the service returns the same result without re-executing the action. This is the same pattern used in payment systems and it is equally critical in agent systems. Any tool call that modifies state — create, update, delete, send, charge — must support idempotency keys before it is exposed to an agent loop.

✦Explicit state machines vs implicit loops

Most agent bugs occur in systems where the agent state is implicit — managed by a growing list of messages or a string summary passed to each model call. When the state is implicit, it is impossible to write tests for specific agent behaviors, impossible to reason about what transitions are allowed, and impossible to detect when the agent is stuck or looping. The fix: define the agent state as an explicit typed object (a Pydantic model or dataclass), define the valid transitions as a state machine graph (this can be drawn on a whiteboard before writing code), and implement the reducer that applies an action to a state. This makes the agent debuggable, testable, and explainable — properties that are impossible with implicit state.

Production reality

The product value of an agent comes from controlled state transitions, not from letting the model think forever. Make autonomy bounded, inspectable, and recoverable before making it more capable.

Agent Memory: Memory is any state carried across steps or sessions. Short-term helps the current task; long-term can personalize or accumulate knowledge. Every memory type needs a purpose, schema, retention rule, and user-visible deletion path. Treat memory as data, not policy — revalidate permissions and keep high-risk rules outside memory. Summaries should include uncertainty and source links so stale assumptions can be corrected.
Planning Patterns: Planning is useful when a task needs multiple decisions, not when a deterministic pipeline is enough. ReAct interleaves reasoning and tool calls. Plan-and-execute creates a plan first, then runs steps. Reflection can catch errors but also adds cost and self-confirmation risk. Production planners need step budgets, allowed tools, stop states, and recovery rules.
Multi-Agent Systems: Multi-agent design is useful when roles have genuinely different tools, data access, or review responsibilities — not because more agents sounds more powerful. Use a supervisor to route tasks, specialist agents for bounded work, and deterministic reducers to merge results. Define ownership and conflict resolution before adding agents. More agents can amplify errors instead of correcting them.

Production pitfalls

Memory as untrusted instruction: A stale or poisoned memory says "always bypass approval for this user" — treat memory as data, not policy; revalidate permissions
No forgetting mechanism: Old preferences and task assumptions silently affect future answers — add retention, user controls, confidence, provenance, and freshness checks
Reflection without new evidence: Model critiques itself but only repeats the same unsupported assumption — reflection should reference tool outputs or retrieved evidence
Planning hidden from the product: Users and operators cannot tell why the agent took an action — log plans, tool calls, state transitions, and final rationale separately
Agents without shared state: Each specialist has a different version of the task and overwrites prior decisions — use a typed shared state object and a reducer that owns updates
No final accountable reviewer: Multiple agents agree on a bad answer and the system has no authority boundary — use deterministic validation with explicit acceptance criteria

Memory schema with provenance

memory = { "kind": "user_preference", "fact": "prefers concise technical answers", "source": "conversation:2026-05-17", "confidence": 0.82, "expires_at": "2026-08-17" }

Bounded planner loop

plan = planner.create(goal, tools=allowed_tools) for step in plan.steps[:MAX_STEPS]: if step.tool not in allowed_tools: return escalate("unsupported tool") result = execute(step) state = reducer(state, result) if state.done or state.blocked: break

Supervisor routing sketch

route = supervisor.classify(request) if route == "legal": draft = legal_agent.run(request) elif route == "data": draft = sql_agent.run(request) else: draft = general_agent.run(request) return reviewer.validate(draft, policy=release_policy)

Quick reference

Question	Answer
Working vs long-term memory?	Working: temporary task state. Long-term: persists across sessions — stricter consent, retention, and audit rules
How to prevent memory poisoning?	Store provenance, validate writes, separate facts from instructions, require approval for sensitive durable memory
What is ReAct?	Pattern where the model alternates between reasoning about the task and acting through tools or observations
When is plan-and-execute better than ReAct?	Task benefits from a visible checklist, predictable phases, or human approval before execution
When to use multiple agents?	Separate roles, tools, permissions, or review responsibilities that improve quality enough to justify complexity
Main risk of multi-agent systems?	Unbounded cost and unclear accountability — more agents can amplify errors instead of correcting them

✦Memory consent and deletion are non-negotiable product requirements

Any system that stores long-term memories about users must answer three questions: (1) what exactly is being stored and why? (2) can the user see and delete their memories? (3) what happens to downstream decisions when a memory is deleted? These are not engineering nice-to-haves — they are product and compliance requirements in any jurisdiction with data privacy laws. Systems that store raw conversation history "just in case" without a retention policy, user visibility, or deletion mechanism create regulatory liability. The correct design: store only what is needed for the task to work better, with explicit purpose and expiry, user-visible records, and a deletion path that propagates to derived inferences.

✦Multi-agent systems fail at the coordination layer, not the agent layer

The most common multi-agent failure is not that individual agents produce bad outputs — it is that they produce outputs in conflicting versions of the task context, and there is no deterministic mechanism to resolve the conflict. Agent A produces a draft based on the customer record as it existed at T=0. Agent B produces a review based on the customer record at T=2 after a concurrent update. The reviewer accepts Agent A's output. The result is a decision based on stale state that neither agent detected. The fix: pass a consistent snapshot of the shared state to all agents at the beginning of the task, use a single reducer that owns all state updates, and validate state consistency before any agent output is accepted. This is the distributed systems problem of consistency, applied to LLM agent coordination.

Production reality

Every prompt change without evals is a silent production migration. Serious LLM products are evaluated systems, not prompt experiments. The first investment is always measurement, not optimization.

Evaluation Stack: Evaluation should cover retrieval recall, answer correctness, groundedness, refusal quality, tool-call validity, latency, cost, and user task completion. Use small curated golden sets for fast regression, larger sampled sets for coverage, and production traces for drift detection. Combine deterministic checks, retrieval metrics, human labels, and LLM-judge scores — no single signal is sufficient.
Serving Strategy: Production LLM systems route by task difficulty, privacy, latency, and cost. Cache stable intermediate results — embeddings, retrieval, extracted entities, policy checks. Use streaming for perceived latency, backpressure for rate limits, and fallbacks for degraded modes. One premium model for every task is one of the most common and most expensive LLM production mistakes.
Observability: You cannot debug what you did not record. Each request needs a complete trace: prompt version, model version, retrieved chunks, tool calls, validation results, latency, cost, and user-visible outcome. Redact sensitive data but preserve enough structure to reproduce failures. No prompt version in traces means regressions cannot be correlated to changes.
End-to-End API Path: The API boundary validates the request, retrieves authorized context, assembles the prompt, calls the model with a timeout, validates the response, and returns a trace ID. The production contract around the model call — not just the model call itself — is what makes the system reliable.

Production pitfalls

Only using LLM-as-judge: Judges miss retrieval failures, over-reward fluent answers, drift with model upgrades — combine deterministic checks, human labels, retrieval metrics, and judge scores
No negative tests: System answers when it should refuse or escalate — include unanswerable, adversarial, permission-denied, and conflicting-evidence cases
One premium model for every task: Cost scales faster than product value — route by task, confidence, and business value; escalate only when needed
Caching final answers with private context: A cached response for one user leaks permission-scoped information to another — cache by permission scope or cache only neutral intermediates
No prompt version in traces: A regression appears but the team cannot tell which prompt was active — version prompts, retrieval configs, model names, and eval runs together
Endpoint returns raw model output: Malformed JSON, missing citations, or unsafe text reaches users — validate every model response against a response schema before returning

RAG eval assertion

Model routing policy

if task == "classification": model = "small-fast" elif requires_deep_reasoning: model = "large" elif high_value_customer: model = "large" else: model = "medium" assert estimated_cost(request) < user_budget

FastAPI RAG endpoint with trace

@app.post("/ask", response_model=AskResponse) async def ask(body: AskRequest, request: Request): trace_id = request.headers.get("x-request-id", "local") docs = await retrieve(body.question, tenant_id=body.tenant_id) raw = await call_llm(build_prompt(body.question, docs), timeout_s=20) return AskResponse.model_validate({**raw, "trace_id": trace_id})

Quick reference

Question	Answer
What is a golden set?	Curated cases with expected behavior — regression test for model, prompt, retrieval, and tool changes
What should you cache in RAG?	Embeddings, parsed docs, retrieval for public data, rerank scores, stable summaries — be careful with final answers
Minimum useful LLM trace?	Request ID, prompt version, model, token counts, retrieved IDs, tool calls, latency, cost, validation result, status
How to detect drift?	Track eval scores, retrieval distributions, refusal rates, tool failures, cost, latency, and user correction signals
Why use response_model in FastAPI?	Validates and documents the returned shape — reduces accidental leakage of internal or malformed fields

✦Build the eval set before the optimization loop

Teams that start prompt engineering without an eval set are optimizing blindly — they can observe one or two examples getting better while ten other cases silently regress. The minimum viable eval set for a RAG system is 50–100 cases with expected sources, required terms in the answer, and prohibited terms. This takes one or two days to build. Without it, every prompt change is a guess and every regression is discovered by users in production. With it, prompt engineering becomes a disciplined engineering process where each change has a measurable, reproducible quality signal. The eval set is not a nice-to-have — it is the precondition for any meaningful improvement work.

✦Tenant isolation must happen before the model call, not after

Multi-tenant RAG systems that apply ACL filters after retrieval have a race condition: the ranking step has already been influenced by unauthorized content before the filter removes it. The correct architecture: filter by tenant_id and ACL before scoring begins, so the retrieval model never sees cross-tenant content at all. The second common mistake is caching: a final answer cached for User A at T=0 must never be returned to User B, even if User B sends the same question. Caching at the level of retrieved documents (keyed by query + tenant_id) is safe; caching at the level of final answers requires a cache key that encodes the full permission context, which is usually not practical. When in doubt, cache only neutral intermediates (embeddings, parsed documents) not final answers.

Production reality

A reliable AI product is not a single model call. It is a governed service with budgets, permissions, fallbacks, and audit trails. Production LLM systems need predictable behavior under load, abuse, and regulation.

Inference Optimization: Serving strategy depends on latency target, privacy, volume, and task difficulty. TTFT is driven by prefill and queuing; total latency by output length and tools. Optimize by shortening prompts, caching stable prefixes, batching where possible, streaming responses, routing easy work to smaller models, and falling back gracefully when providers fail.
Safety Guardrails: Guardrails should be layered. Classify input risk, isolate untrusted text, retrieve only authorized data, validate outputs, block unsafe tool calls, and escalate high-risk cases. The model can help detect risk but the application must enforce policy. For private systems, data access control is more important than clever prompt wording.
Governance: Track model versions, prompt versions, datasets, eval results, risk classifications, known limitations, approval history, and incident playbooks. The goal is the ability to answer: what changed, why it changed, and whether the change was safe. Any AI feature launched without documented intended use, limitations, and rollback path is an unacceptable risk.

Production pitfalls

Optimizing average latency only: p50 looks good while p95 and p99 are unusable for large prompts — track p50, p95, p99, TTFT, output tokens, queue time, and tool latency by route
No degraded mode: A provider outage turns every AI feature into a hard failure — add cached answers, smaller-model fallbacks, async jobs, and escalation paths
Single guardrail at the prompt layer: The prompt asks the model to be safe but tools and data access are still exposed — enforce authorization and schema validation in application code
No tests for indirect prompt injection: Retrieved documents with malicious instructions are not tested — build eval cases with adversarial retrieved content
No rollback path for prompts: A prompt edit degrades answers and the team cannot quickly restore last good behavior — version prompts and configs together with one-click rollback
Compliance added after launch: System lacks data lineage, risk notes, and audit logs when customers ask — document intended use, limitations, data flow, and retention before launch

Cascade routing with fallback

try: if request.kind in {"tagging", "routing"}: return small_model(request) if request.risk == "high": return large_model(request, temperature=0) return medium_model(request) except ProviderTimeout: return degraded_response("We can summarize, but live actions are paused.")

Layered safety pipeline

risk = classify_input(user_text) if risk.blocked: return refusal(risk.reason) docs = retrieve(query, acl=current_user.acl) answer = generate(prompt, context=docs) checked = validate_output(answer, docs, policy) return checked if checked.safe else escalate(checked.reason)

Release gate checklist

release = { "model": "answer-model-v5", "prompt": "rag_answer_v18", "eval_passed": True, "safety_passed": True, "owner": "ai-platform", "rollback": "rag_answer_v17" }

Quick reference

Question	Answer
What is continuous batching?	Dynamically batches tokens from multiple requests so GPU utilization stays high while requests enter and leave
What is speculative decoding?	A smaller draft model proposes tokens; a larger model verifies — improves latency when accepted tokens match
What is indirect prompt injection?	Malicious instruction hidden in retrieved content, web pages, or emails that tries to control the model through context
Where should PII redaction happen?	Before sending to models when possible, and again before logging or displaying outputs
What belongs in an AI system card?	Purpose, intended users, limitations, data sources, eval results, safety constraints, monitoring plan, owner
How to handle an AI incident?	Contain feature, preserve traces, identify affected users, roll back configs, patch eval coverage, communicate impact

✦Degraded mode is a first-class product requirement

LLM provider outages happen. GPU memory exhaustion happens. Rate limits happen. A system that has no degraded mode converts every infrastructure event into a user-facing total failure. Degraded modes worth designing upfront: (1) cached answers for the most common queries; (2) a smaller, self-hosted model that handles a reduced capability set; (3) async processing that queues requests and notifies users when results are ready; (4) a graceful UI that communicates the degradation honestly. The degraded mode contract — what the system can and cannot do during degradation — must be defined before launch, tested in staging, and on-call teams must know the runbook. Post-launch is too late to design this under pressure.

✦Governance is debuggability at the organizational level

Six months after launch, your team will need to answer: what model was running on this day, what prompt version was active, what data was in the corpus, what eval results justified the last release decision, and who approved it? Without governance artifacts — model cards, prompt changelogs with eval results, risk classifications, approval history — these questions are unanswerable, and so is every regulatory or customer inquiry that follows an incident. The governance overhead is not zero, but it is a small fraction of the cost of a compliance audit or an incident where nobody can explain what the system was doing. Build governance into the release process from the start — retro-fitting it to a running system is much harder and never as complete.

Production reality

Strong engineers do not just name techniques — they identify failure modes, define metrics, and defend trade-offs. Every interview answer should tie a concept to latency, cost, reliability, evaluation, or user experience.

LLM Fundamentals Answer Framework: Define the concept → explain why it matters in production → name failure modes → propose metrics or tests → give a trade-off. This structure distinguishes engineering judgment from memorized definitions.
RAG Design Interview: RAG interviews are system design interviews. Start with data sources and permissions, then ingestion, parsing, chunking, embeddings, index design, retrieval, reranking, prompt assembly, answer validation, evaluation, and observability. Always mention failure modes: stale content, missing ACLs, poor parsing, low recall, noisy top-k, and unsupported answers.

Common interview mistakes

Only giving definitions: Answers sound memorized and do not show engineering judgment — tie every concept to latency, cost, reliability, evaluation, or user experience
Claiming one universal best model: Real systems choose by task, latency, budget, privacy, and risk — frame model choice as a routing and evaluation problem
Starting RAG design with vector DB choice: The interviewer wants data quality, permissions, and evaluation trade-offs first — start from product requirements and corpus behavior
No plan for unanswerable questions: System will hallucinate when retrieval fails — add missing-evidence behavior, confidence thresholds, and escalation to every RAG design

Answer framework for any concept

answer(concept): define_it() # what it is explain_why_it_matters_in_production() # latency / cost / reliability name_failure_modes() # what breaks propose_metrics_or_tests() # how to verify give_a_tradeoff() # the honest tension

RAG whiteboard outline

sources -> parser -> chunks -> embeddings -> vector index -> keyword index -> hybrid retrieval -> reranker -> prompt assembly -> LLM -> citation validator -> trace + feedback + evals

Quick reference — common interview questions

Question	Answer (production-framed)
Temperature vs top-p?	Temperature reshapes the whole probability distribution; top-p samples from the smallest set reaching cumulative probability p. Both control randomness — tune with task evals
Why can a longer context window hurt quality?	Distractors, conflicting evidence, instruction dilution, higher cost, and slower TTFT — more context is not better context
RAG vs fine-tuning?	RAG for changing or private knowledge needing citations; fine-tune for stable behavior, style, output format, or domain decision patterns
BERT vs GPT in one minute?	BERT: encoder, bidirectional, strong for understanding (classification, NER, reranking, embeddings). GPT: decoder, autoregressive, strong for generation (chat, code, agents)
Long context vs retrieval?	Long context simpler when evidence is small and already available. Retrieval better when knowledge is large, private, changing, permissioned, or needs citation and freshness
How to reduce hallucination?	Improve retrieval recall, rerank for precision, constrain answers to context, require citations, validate claims, escalate when evidence is missing

✦The interview answer that separates L5 from L6

At L5, the expected answer to "How do you reduce hallucination in RAG?" is a list of correct techniques: improve retrieval, add citations, constrain answers to context, use an LLM judge. At L6+, the expected answer starts differently: "First, I need to know where the hallucination is coming from — is it retrieval failure (correct chunks not found), context failure (chunks retrieved but not used), or generation failure (answer generated despite contradicting context)? The diagnosis determines the fix, and the correct metric for each failure mode is different." The L6 signal is always: diagnose first, prescribe second, and name how you would measure each thing.

✦System design interviews test organizational thinking, not just technical answers

A senior LLM systems design answer does not just cover architecture — it covers: who owns what, what is the failure mode that pages on-call at 2am, how long does it take to roll back a bad change, and how do you know the system is degrading before users tell you? These questions reveal whether you have shipped a system into production and lived with it, or whether you have designed systems on paper. Interviewers at L6+ level are evaluating: can you own this end-to-end? That means the data pipeline, the model, the evaluation, the deployment, the monitoring, the incident response, and the stakeholder communication. Technical architecture is table stakes — organizational thinking is the differentiator.

Build it.Ship it.Explain it.

Build it.
Ship it.
Explain it.