Conversational AI Agent · Production Case Study

How would you like to explore this?

This case study has two views tailored to different audiences.

Your details are required to continue

AI Engineer  ·  EdTech Platform  ·  New Delhi

Conversational AI Agent
for UPSC Aspirants

End-to-end production case study. Multi-agent RAG system, 8 system layers, full architecture with trade-offs, and production hardening from A to Z.

40K+
Active Students
95%
Answer Accuracy
1–2s
Avg Latency
8
System Layers
The Problem

A domain where hallucination is not an option

UPSC Civil Services is India's most competitive examination — roughly 1 million candidates compete annually for under 1,000 positions. Students at a leading EdTech platform in New Delhi needed round-the-clock access to accurate, domain-expert answers across a vast multi-domain syllabus: History, Geography, Polity, Economy, Science, Current Affairs, and Ethics. The challenge was not building a chatbot. It was building a system where answer quality could never silently degrade.

Constraint 1

Hallucination is unacceptable. A wrong answer about a constitutional article can directly damage a student's exam preparation. Generic LLM behaviour must be constrained to verified UPSC content.

Constraint 2

Scale is real. 40,000 concurrent users means the system must handle exam-season spikes without degradation. Thread-blocking inference is not viable.

Constraint 3

Cost must be controlled. GPT-4 at 40K users with unconstrained usage is financially unsustainable. Every architecture decision is a cost-quality trade-off.

Core Tension

Maximise answer quality while minimising GPT-4 API calls. This single constraint drives caching strategy, agent routing, retrieval design, and the evaluation gate in CI/CD.

GPT-4LangGraphPineconeFastAPICeleryRedisLangSmithRAGASDockerAWS ECSGitHub ActionsPython 3.11
System Design

Seven layers, one responsibility each

No layer does two things. The gateway never touches GPT-4. The workers never handle auth. The agents never write to Redis. Separation of concerns at this scale makes debugging tractable and scaling decisions obvious.

Full System Request Flow
Gateway Layer
Student Query ──▶ FastAPI Gateway ──▶ JWT Auth ──▶ Redis Rate Limit
cache hit path: ──▶ Pinecone Semantic Cache ──▶ Stream to Client
cache miss: ──▶ Celery Task Queue ──▶ Celery Worker (AWS ECS)
Agent Layer — LangGraph
Supervisor Agent ── classifies intent ──▶ Factual | PYQ | Essay | Current Affairs
Retrieval + Generation
Pinecone Namespace Query ──▶ Parent Chunk Assembly ──▶ GPT-4 ──▶ Guardrail Check ──▶ Stream Answer
Session + Cache Write-Back
Redis Session Memory + Pinecone Cache Upsert ── async ──▶ Next query cheaper
Observability
LangSmith Traces + CloudWatch Metrics + User Feedback Loop + RAGAS Eval Gate (CI)
Infrastructure
Docker (API image) + Docker (Worker image) ──▶ AWS ECR ──▶ ECS Blue-Green Deploy
System Components

Eight layers, fully decomposed

Each component: what it does, why this approach over alternatives, the implementation with real code, and the key trade-off. Click any row to expand.

FastAPI is the system's only public surface. Four responsibilities and nothing else: authenticate the student via JWT, enforce Redis rate-limits, check the semantic cache, and dispatch work to Celery. It never touches GPT-4. FastAPI's async event loop means a 1–2s RAG pipeline call does not block the thread — at 40K concurrent users, the difference between a working system and thread starvation.

FastAPI async patterns

Django and Flask are synchronous by default. A 1.5s GPT-4 call blocks a worker thread for its full duration. A thread pool of 40 exhausts in under a second at peak load. FastAPI with uvicorn handles each request as a coroutine — the thread stays free while Celery executes the task. The two-step POST → stream pattern adds one round-trip but lets clients reconnect the stream on mobile network drops — essential for students on slow connections.

Python — FastAPI gateway: auth · rate-limit · cache · Celery dispatch
@app.post("/chat")
async def chat(req: ChatRequest, user: User = Depends(verify_jwt)):
    # Rate limit: 100 queries/student/day — atomic Redis INCR
    rate_key = f"rate:{user.id}:{today()}"
    if await redis.incr(rate_key) > 100:
        raise HTTPException(429, "Daily query limit reached")
    await redis.expire(rate_key, 86400)

    # Semantic cache lookup — zero GPT-4 cost on hit
    cached = await semantic_cache.lookup(req.query)
    if cached:
        return StreamingResponse(stream_string(cached))

    # Dispatch to Celery — return task ID in <5ms, never block
    task = rag_pipeline.delay(req.query, req.session_id, user.id)
    return {"task_id": task.id}

@app.get("/stream/{task_id}")
async def stream_result(task_id: str):
    # SSE stream: tokens arrive as worker produces them
    return StreamingResponse(
        token_generator(task_id),
        media_type="text/event-stream",
    )
Trade-off

Two-step POST → stream adds one round-trip. Benefit: the gateway never blocks; clients reconnect the SSE stream on network drops. At 40K mobile users, reconnect resilience is not optional.

Redis does three completely separate jobs. Each is tuned, monitored, and reasoned about independently. Mixing them into one key namespace would make each harder to debug and optimize. Job 1: semantic response caching (primary cost lever). Job 2: per-user rate limiting (cost guard). Job 3: conversation session memory (context store for multi-turn coherence).

Data & caching patterns

At 40K students asking overlapping UPSC questions, cache hit rate is the single most impactful cost metric. A 40% hit rate means 40% of GPT-4 API spend simply does not happen. Threshold matters critically: at similarity 0.88 we had 6% false cache hits — wrong answers served. Moving to 0.92 reduced false hits to under 1% with only a 4% hit rate reduction. That trade-off is correct for high-stakes exam content.

Python — Redis: semantic cache · rate limit · bounded session memory
# Job 1: Semantic cache — query Pinecone cache namespace
async def cache_lookup(query: str) -> str | None:
    embedding = await embed(query)
    results = pinecone_index.query(
        vector=embedding, top_k=1, namespace="cache",
        include_metadata=True,
    )
    if results.matches and results.matches[0].score >= 0.92:
        return results.matches[0].metadata["answer"]
    return None

# Job 2: Rate limiter — atomic, no race conditions
async def check_rate_limit(user_id: str) -> bool:
    key = f"rate:{user_id}:{today()}"
    count = await redis.incr(key)
    if count == 1:
        await redis.expire(key, 86400)  # reset daily
    return count <= 100

# Job 3: Session memory — bounded 10-turn window, 30-min TTL
async def get_history(session_id: str) -> list[dict]:
    raw = await redis.lrange(f"session:{session_id}", 0, 9)
    return [json.loads(m) for m in reversed(raw)]

async def append_turn(session_id: str, message: dict):
    key = f"session:{session_id}"
    await redis.lpush(key, json.dumps(message))
    await redis.ltrim(key, 0, 9)     # enforce 10-turn limit
    await redis.expire(key, 1800)    # 30-min inactivity TTL
Trade-off

Semantic cache lives in a Pinecone namespace (not Redis) — embedding similarity search is Pinecone's native operation. A Redis+FAISS alternative would require maintaining a second in-memory vector index with operational overhead.

Celery decouples the API layer from the inference layer entirely. FastAPI dispatches a task and returns in under 5ms. Workers execute the full RAG pipeline independently on AWS ECS and scale horizontally without touching the API servers. A lightweight Lambda publishes Celery queue depth to CloudWatch every 30 seconds — ECS scales workers up when depth exceeds 50 pending tasks, down below 10.

MLOps & async pipelines

Without Celery, every FastAPI request would block 1–2s while the RAG pipeline ran in-process. At 40K users, even with async FastAPI, you'd need 40K concurrent RAG pipelines running simultaneously — impossible. With Celery, workers scale as an independent fleet. Peak queue depth during exam season: ~380 tasks. Auto-scaling ECS absorbs this in seconds without over-provisioning year-round.

Python — Celery worker: restore context · invoke LangGraph · cache result
@celery.task(bind=True, max_retries=3)
def rag_pipeline(self, query: str, session_id: str, user_id: str):
    try:
        # Restore conversation context from Redis
        history = get_session_history_sync(session_id)

        # Invoke LangGraph supervisor agent
        result = supervisor_graph.invoke({
            "query": query,
            "history": history,
            "user_id": user_id,
        })

        # Cache answer in Pinecone cache namespace
        pinecone_index.upsert(
            vectors=[{
                "id": cache_id(query),
                "values": embed_sync(query),
                "metadata": {"answer": result["answer"]},
            }],
            namespace="cache",
        )

        # Append turn to session memory
        append_turn_sync(session_id, {
            "role": "assistant", "content": result["answer"]
        })
        return result

    except openai.RateLimitError as e:
        raise self.retry(exc=e, countdown=5)
    except Exception as e:
        raise self.retry(exc=e, countdown=2)
Trade-off

Separate API and worker containers means each scales to a different signal. API scales by CPU/memory; workers scale by queue depth. During exam-season spikes, workers go 4 → 20+ while the API tier stays at 3 instances unchanged.

The agent follows a Supervisor → Specialist pattern. A supervisor node classifies query intent and routes to one of four domain specialists: Factual Lookup, PYQ Analysis, Essay Guidance, Current Affairs. Each specialist has its own Pinecone namespace, prompt template, and LangSmith trace — completely isolated. A regression in the PYQ agent does not require re-evaluating the full system.

Agentic workflows & LangGraph

A monolithic agent with a single system prompt for all UPSC query types consistently underperforms on edge cases. "What is the 42nd Constitutional Amendment?" needs a different retrieval namespace and prompt framing than "Give me an essay structure on India-China border disputes." Separate agents means separate optimization loops, separate prompt versions, and isolated failure surfaces. The supervisor adds ~50ms classification latency — worth it for per-domain accuracy gains.

Python — LangGraph: typed state · supervisor routing · specialist pattern
from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    query:            str
    history:          list[dict]
    intent:           str           # set by supervisor
    retrieved_chunks: list[str]
    answer:           str
    sources:          list[str]
    guardrail_passed: bool
    token_usage:      dict          # per-node cost attribution

def supervisor_node(state: AgentState) -> AgentState:
    intent = classify_intent(state["query"], state["history"])
    return {**state, "intent": intent}

def route(state: AgentState) -> str:
    return state["intent"]  # "factual" | "pyq" | "essay" | "current_affairs"

def factual_agent(state: AgentState) -> AgentState:
    chunks = pinecone_query(state["query"], namespace="factual", top_k=5)
    answer = gpt4_generate(FACTUAL_PROMPT, chunks, state["history"])
    return {**state, "retrieved_chunks": chunks, "answer": answer}

graph = StateGraph(AgentState)
graph.add_node("supervisor",      supervisor_node)
graph.add_node("factual",         factual_agent)
graph.add_node("pyq",             pyq_agent)
graph.add_node("essay",           essay_agent)
graph.add_node("current_affairs", current_affairs_agent)
graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route, {
    "factual": "factual", "pyq": "pyq",
    "essay": "essay", "current_affairs": "current_affairs",
})
supervisor_graph = graph.compile()
Trade-off

Four specialist agents adds ~50ms for the classification step per request. All queries pass through the supervisor. The latency cost is fixed within the 1–2s budget and the per-domain accuracy improvement justifies it unambiguously.

Three retrieval decisions define accuracy: (1) namespace-per-domain — each UPSC subject has its own Pinecone namespace, eliminating cross-domain retrieval noise; (2) hierarchical parent-child chunking — 128-token child chunks for precision retrieval, 512-token parent chunks assembled for GPT-4 context; (3) a dedicated cache namespace for semantic response caching, eliminating a second vector store.

Advanced RAG & vector search

LangSmith trace data exposed that without namespacing, the factual agent occasionally pulled irrelevant Current Affairs chunks when UPSC topic keywords overlapped with recent news. Hierarchical chunking addressed truncated-answer hallucinations — small chunks retrieve precisely but GPT-4 needs broader context to give complete answers. Reducing top-k from 8 to 5 (validated by RAGAS) cut token cost 22% with no accuracy loss.

Python — Pinecone: namespace design · hierarchical upsert · cache lookup
# Namespace-per-domain: each subject isolated
DOMAIN_NAMESPACES = {
    "factual": ["polity", "history", "economy", "geography"],
    "pyq":     ["pyq_prelims", "pyq_mains"],
    "essay":   ["essay_structure", "essay_content"],
    "current": ["current_affairs"],
    "cache":   ["cache"],
}

def upsert_document(doc: Document, domain: str):
    """Hierarchical: child chunks indexed, parent assembled for context."""
    for i, parent in enumerate(chunk(doc.text, size=512, overlap=50)):
        for j, child in enumerate(chunk(parent, size=128, overlap=20)):
            pinecone_index.upsert(vectors=[{
                "id": f"{doc.id}_p{i}_c{j}",
                "values": embed(child),
                "metadata": {
                    "child_text":  child,
                    "parent_text": parent,   # full context for GPT-4
                    "domain":      domain,
                    "updated_at":  doc.updated_at,
                },
            }], namespace=domain)

def retrieve(query: str, namespace: str, top_k: int = 5) -> list[str]:
    """Returns parent text — complete context — for matched child chunks."""
    results = pinecone_index.query(
        vector=embed(query), top_k=top_k,
        namespace=namespace, include_metadata=True,
    )
    return [m.metadata["parent_text"] for m in results.matches]
Trade-off

Hierarchical chunking doubles upsert complexity. Payload: two text fields (child + parent) per vector. Benefit: retrieval precision from child chunks + generation context from parent chunks. The accuracy gain at 95% benchmark validates the complexity.

Three images, each with a single entrypoint: API gateway, Celery worker, Redis (standard image). API and worker share the same Python codebase but deploy as independent ECS services with different resource allocations and scaling policies. Multi-stage builds keep the API image at ~190MB (no ML dependencies) and the worker image at ~380MB (full stack). The same docker-compose.yml runs locally and in CI for environment parity.

MLOps & deployment

If API and workers run as the same process, they must scale together. During a traffic spike, you need inference capacity (workers), not gateway capacity (API). Separating into two ECS services means the API scales by CPU/request count and workers scale by queue depth — the correct signal for each. Independent scaling avoids over-provisioning the gateway tier during exam-season inference spikes.

Dockerfile + docker-compose — multi-stage API image + worker service
# Dockerfile.api — multi-stage, lean gateway image (~190MB)
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements/api.txt .
RUN pip install --no-cache-dir -r api.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY src/ .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

# ── docker-compose.yml (local = production parity) ────────────
services:
  api:
    build: { context: ., dockerfile: Dockerfile.api }
    env_file: .env.local
    ports: ["8000:8000"]
    depends_on: [redis]

  worker:
    build: { context: ., dockerfile: Dockerfile.worker }
    command: celery -A tasks worker --concurrency=8 --loglevel=info
    env_file: .env.local
    depends_on: [redis]

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
Trade-off

Two Dockerfiles means two build contexts to maintain. Payoff: API image starts 40% faster (no ML stack import time), workers can use GPU-optimised base images without bloating the API container, and each can be updated independently without a joint deploy.

The pipeline enforces one rule: no code or prompt change reaches production without RAGAS eval sign-off. The eval suite runs first — before unit tests, before Docker build. A failing eval blocks the entire pipeline. Prompts are versioned in LangSmith Hub and pinned by tag in production config. ECS blue-green keeps the old version fully alive until the new one passes health checks.

MLOps & CI/CD for LLMs

Blue-green over rolling: with 40K active students, a broken rolling deploy where 50% of requests fail for 3 minutes is a serious incident — especially during exam season. Blue-green means zero downtime and instant rollback. The RAGAS gate runs first because a green test suite is meaningless if answer quality dropped. The 4-minute eval overhead is the correct trade-off for a system students depend on for exam preparation.

YAML — GitHub Actions: RAGAS gate → tests → Docker → ECR → ECS blue-green
# .github/workflows/deploy.yml
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run RAGAS eval suite  # blocks deploy on threshold failure
        run: |
          python evals/run_ragas.py
            --dataset evals/golden_set.json
            --thresholds '{"faithfulness":0.90,"answer_relevancy":0.88,"accuracy":0.93}'

  test:
    needs: evaluate
    steps:
      - run: pytest tests/ -v --cov=app --cov-fail-under=80

  build-push:
    needs: test
    steps:
      - name: Build and push API + worker images to ECR
        run: |
          docker build -f Dockerfile.api    -t $ECR_REGISTRY/upsc-api:$GITHUB_SHA .
          docker build -f Dockerfile.worker -t $ECR_REGISTRY/upsc-worker:$GITHUB_SHA .
          docker push $ECR_REGISTRY/upsc-api:$GITHUB_SHA
          docker push $ECR_REGISTRY/upsc-worker:$GITHUB_SHA

  deploy:
    needs: build-push
    steps:
      - name: Blue-green ECS deploy (old version stays live until health checks pass)
        run: |
          aws ecs update-service
            --cluster upsc-prod --service upsc-api
            --task-definition upsc-api:$NEW_REVISION
            --deployment-configuration minimumHealthyPercent=100,maximumPercent=200
Trade-off

RAGAS eval adds ~4 minutes to every deploy. That is the correct trade-off: a 4-minute delay to ensure answer quality never drops below threshold is a bargain for a 40K-user live exam-prep system.

LangSmith makes the invisible visible. Every LangGraph node is traced with individual latency, token counts (input + output), retrieved chunk content with similarity scores, and the exact GPT-4 prompt for every production request. User feedback (thumbs up/down) feeds prompt refinement cycles. Custom CloudWatch dashboards track the four operational KPIs: queue depth, cache hit rate, per-specialist error rate, and P95 latency.

LLM system design & evaluation

Three production fixes came directly from LangSmith data: (1) 12% Supervisor misclassification rate discovered via intent trace analysis — fixed with 6 few-shot examples, dropped to under 2%; (2) Polity namespace latency anomaly found via per-node latency traces — namespace index size imbalance, resolved by splitting; (3) Token cost opportunity from token count traces — reducing chunks 8→5 cut cost 22%, validated zero accuracy loss via RAGAS. Operating without trace visibility is flying blind.

Python — LangSmith: prompt pinning · node tracing · cost attribution · feedback
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree

ls = Client()

# Pin prompt versions — never "latest" in production
SUPERVISOR_PROMPT = ls.pull_prompt("upsc-supervisor:prod-v1.4")
FACTUAL_PROMPT    = ls.pull_prompt("upsc-factual:prod-v2.1")

@traceable(name="factual_agent", tags=["specialist", "factual"])
def factual_agent(state: AgentState) -> AgentState:
    run = get_current_run_tree()

    chunks = retrieve(state["query"], namespace="factual")
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=build_messages(FACTUAL_PROMPT, chunks, state["history"]),
    )

    # Token usage tracked per node for cost attribution
    run.extra["token_usage"] = response.usage.model_dump()
    return {**state, "answer": response.choices[0].message.content}

# User feedback captured per run for prompt refinement
def record_feedback(run_id: str, score: int, comment: str = ""):
    ls.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=score,          # 1 = thumbs up, 0 = thumbs down
        comment=comment,
    )
Trade-off

LangSmith stores every production trace. At 40K queries/day, trace volume is significant. Sampling strategy: 100% of errors always traced, 10% of successes sampled. This preserves full visibility on failures while keeping storage costs manageable.

Design Decisions

Eight decisions that shaped the system

Every architectural decision made, the alternative considered, and the reasoning. This is the section that matters most in a system design interview.

Decision Chosen Approach Alternative Impact Reasoning
Agent pattern Supervisor + 4 specialists Single monolithic agent High Isolated prompts, namespaces, traces per domain. A regression in one specialist does not contaminate others. Debugging stays tractable.
Inference execution Celery async workers Sync FastAPI endpoint Critical Prevents FastAPI thread exhaustion at 40K concurrent users. Workers scale horizontally independent of the API layer.
Semantic cache store Pinecone cache namespace Redis + FAISS in-memory Medium Pinecone already running. Cache is a namespace, not a second service. Eliminates a second vector store to maintain.
Chunking strategy Hierarchical parent-child Fixed 512-token chunks High Child chunks give retrieval precision. Parent chunks give generation context. Reduces truncated-answer hallucinations.
Retrieved chunks 5 per query 8–10 chunks High LangSmith traces showed no accuracy gain above 5 (RAGAS validated). Reducing cut latency 300ms and GPT-4 cost 22%.
Deployment strategy ECS blue-green Rolling update Critical 40K active users. A broken rolling deploy affects live traffic immediately. Blue-green keeps old version alive until health checks pass.
Prompt versioning LangSmith Hub (pinned tags) Git text files High LangSmith ties prompt versions to trace data. Every production request permanently linked to the exact prompt that generated it.
Cache threshold 0.92 similarity 0.88 (initial) High At 0.88, false cache hits were 6%. At 0.92, under 1%. Only a 4% hit rate reduction — correct trade-off for exam content.

Production Lessons

Lesson 1

The Ingestion Pipeline Is Half the System

Most teams spend 90% of effort on the query path and 10% on ingestion. The quality of what is in Pinecone determines 80% of answer quality. UPSC content changes annually. Without a robust, versioned ingestion pipeline, accuracy silently degrades after every content update. This is the piece that keeps a 95% benchmark honest over time.

Lesson 2

LangSmith Pays for Itself in Week One

The first week of production tracing revealed three prompt logic bugs that would have taken days to reproduce from application logs. The Supervisor misclassification, the Polity namespace latency, and the token count opportunity — all found through traces, not logs. Operating an LLM pipeline in production without full trace visibility is flying blind.

Lesson 3

Cache Hit Rate Is Your Best Cost KPI

Tracking and optimising semantic cache hit rate is the highest-leverage cost intervention at 40K users. Tuning the similarity threshold from 0.88 to 0.92 reduced false cache hits from 6% to under 1% — the difference between occasionally serving wrong answers (unacceptable in an exam context) and reliably serving correct ones.

Production Hardening

Four additions that complete the system

What separates a production system from a prototype is how it handles edge cases, failures, and operational realities. These four additions are the gap between 90% complete and production-ready.

01 Cross-Encoder Reranker
Hardened

Problem: Pinecone returns top-K chunks by embedding similarity — cosine distance in embedding space does not always correlate with contextual relevance for GPT-4. Long-tail UPSC queries (specific Article numbers, Act citations, historical dates) suffered most because surface similarity diverged from semantic fit.

Solution: Cohere Rerank cross-encoder between Pinecone retrieval and GPT-4 generation. Cross-encoders score query-chunk pairs jointly — not independently — producing significantly more accurate relevance rankings than bi-encoder similarity alone.

Estimated +3–5% accuracy on factual domain queries. +150ms latency, well within the 2s budget.
Python — Cohere Rerank: broad retrieval → cross-encoder reranking → top-5
import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

def retrieve_and_rerank(
    query: str, namespace: str, top_k: int = 5
) -> list[str]:
    # Step 1: broad candidate pool from Pinecone (20, not 5)
    raw_chunks = retrieve(query, namespace=namespace, top_k=20)

    # Step 2: cross-encoder reranking — scores all 20 jointly
    results = co.rerank(
        query=query,
        documents=raw_chunks,
        top_n=top_k,
        model="rerank-english-v3.0",
        return_documents=True,
    )
    return [r.document.text for r in results.results]

# Replace in each specialist agent:
# Before: chunks = retrieve(state["query"], namespace="factual", top_k=5)
# After:  chunks = retrieve_and_rerank(state["query"], "factual", top_k=5)
Advanced RAG patterns
02 GPT-4 Fallback Chain
Hardened

Problem: OpenAI API outages during peak exam windows (Prelims in June, Mains in September) surface as 500 errors for 40K students with zero mitigation. No fallback tier existed — a single provider failure meant complete service outage at the most critical usage time.

Solution: A LangGraph conditional node implementing three fallback tiers: GPT-4 → GPT-3.5-turbo → semantic cache (relaxed threshold 0.85). Each tier has a timeout. The response includes a model quality signal so the client can surface a subtle "reduced quality" indicator when serving a degraded-tier answer.

Eliminates complete outages during OpenAI incidents. GPT-3.5 handles ~80% of factual queries adequately. Cache fallback covers an additional ~40% of degraded-mode requests.
Python — LangGraph fallback: GPT-4 → GPT-3.5 → cache → graceful error
def generation_with_fallback(state: AgentState) -> AgentState:
    """Three-tier fallback. Each tier has independent timeout."""

    for model, timeout in [("gpt-4", 10), ("gpt-3.5-turbo", 8)]:
        try:
            response = openai.chat.completions.create(
                model=model,
                messages=build_messages(state),
                timeout=timeout,
            )
            return {
                **state,
                "answer":      response.choices[0].message.content,
                "model_used":  model,
                "is_fallback": model != "gpt-4",
            }
        except (openai.APIError, openai.Timeout, openai.APIConnectionError):
            continue

    # Tier 3: semantic cache with relaxed threshold (0.85 vs 0.92)
    cached = semantic_cache.lookup(state["query"], threshold=0.85)
    if cached:
        return {**state, "answer": cached, "model_used": "cache", "is_fallback": True}

    return {**state, "answer": SERVICE_DEGRADED_MSG, "model_used": "none"}
LLM system reliability
03 Embedding Model Migration Pipeline
Hardened

Problem: Updating the embedding model (e.g., text-embedding-ada-002 → text-embedding-3-large) silently invalidates all existing Pinecone vectors. New query embeddings are incompatible with old stored vectors — cosine distance comparisons return noise, answer quality collapses, and no error fires. No migration pipeline existed.

Solution: A Celery pipeline that creates a versioned namespace, re-embeds all documents with the new model, runs RAGAS eval against the new namespace, and performs an atomic config cutover only if the eval passes threshold. The old namespace is retained for a 7-day rollback window.

Enables safe embedding model upgrades with zero downtime. Dual-namespace atomic cutover means students never experience a degraded retrieval state during migration.
Python — Celery migration: re-embed → RAGAS gate → atomic namespace cutover
@celery.task(bind=True)
def migrate_embedding_model(self, domain: str, new_model: str):
    old_ns  = get_active_namespace(domain)
    new_ns  = f"{domain}_{model_version(new_model)}"

    try:
        # 1. Fetch all documents from old namespace
        docs = [v.metadata["parent_text"]
                for v in fetch_all_vectors(namespace=old_ns)]

        # 2. Re-embed with new model (batched)
        new_vecs = embed_batch(docs, model=new_model, batch_size=100)

        # 3. Upsert to new namespace
        upsert_namespace(new_vecs, namespace=new_ns)

        # 4. RAGAS eval gate — must pass before cutover
        score = run_ragas_eval(namespace=new_ns,
                               dataset="evals/golden_set.json")

        if score["accuracy"] < 0.93:
            raise ValueError(
                f"Migration failed RAGAS gate: {score['accuracy']:.2%}. "
                f"Old namespace {old_ns} unchanged."
            )

        # 5. Atomic cutover — old namespace retained for rollback
        set_active_namespace(domain, new_ns)
        schedule_cleanup(old_ns, delay_days=7)

    except Exception as e:
        notify_team(f"Migration FAILED for {domain}: {e}")
        raise
MLOps & model lifecycle
04 Pre-Season Load Testing
Hardened

Problem: The 1–2s latency claim was observed organically in production — never validated upfront under simulated peak concurrent load. Before each major UPSC exam window (Prelims June, Mains September), the system ran untested at peak concurrency. Auto-scaling configuration was never stress-verified.

Solution: Locust load tests with realistic student behaviour profiles (weighted mix of factual, PYQ, and essay queries with think time). Run against staging with production-equivalent ECS task counts. Target: P95 latency ≤ 2.5s at 500 concurrent users. Scheduled to run 7 days before each exam cutoff date.

Finds breaking points before students do. Validates auto-scaling triggers. Gives defensible confidence in latency SLAs for the highest-traffic windows of the year.
Python — Locust: realistic UPSC student behaviour under sustained load
from locust import HttpUser, task, between
import random, uuid

FACTUAL = ["What is the 73rd Constitutional Amendment?",
           "Explain the doctrine of basic structure."]
ESSAY   = ["Essay structure: climate diplomacy 250 words",
           "UPSC essay: federal governance challenges"]
PYQ     = ["Previous year questions on monetary policy 2023",
           "UPSC 2022 GS2 questions on judiciary"]

class UPSCStudent(HttpUser):
    wait_time = between(2, 8)    # realistic reading + thinking time

    def on_start(self):
        self.session_id = str(uuid.uuid4())
        self.headers = {"Authorization": f"Bearer {get_test_token()}"}

    @task(4)  # 4x weight — factual is most common query type
    def ask_factual(self):
        r = self.client.post("/chat",
            json={"query": random.choice(FACTUAL),
                  "session_id": self.session_id},
            headers=self.headers)
        if r.status_code == 200:
            self.client.get(f"/stream/{r.json()['task_id']}",
                           headers=self.headers)

    @task(2)
    def ask_pyq(self):
        self.client.post("/chat",
            json={"query": random.choice(PYQ), "session_id": self.session_id},
            headers=self.headers)

    @task(1)
    def ask_essay(self):
        self.client.post("/chat",
            json={"query": random.choice(ESSAY), "session_id": self.session_id},
            headers=self.headers)

# Run: locust -f locustfile.py --headless -u 500 -r 50 --run-time 600s
System design & reliability