I'm J. Sudheesh Kumar Reddy — an AI Engineer with 4 years building systems that serve real users. From recommendation engines at 500K+ scale to production multi-agent LLM platforms, I build end-to-end with rigour.
I don't just build models — I build systems. Every line of code I write considers monitoring, evaluation, cost, and the engineer who maintains it at 2am.
Production multi-agent systems with LangGraph, Hybrid RAG pipelines, semantic caching, and LLM-as-Judge evaluation. Built for real user loads, not demos.
End-to-end systems from feature engineering and SVD++ training through to A/B testing, drift monitoring, and real-time Redis-cached serving.
CI/CD for models, automated drift detection (PSI/KS), Evidently AI dashboards, MLflow registry, canary deployments, and auto-retraining pipelines.
AWS (SageMaker, ECS Fargate, RDS, ElastiCache), Docker, JWT/OAuth2/RBAC, Celery async pipelines, and PostgreSQL + Redis at scale.
Full model lifecycle: problem framing → data curation → transformer fine-tuning → multi-metric evaluation → ONNX quantization → handoff documentation.
Designing for scale: request routing, caching strategies, async ingestion pipelines, auth layers, observability stacks, and cost optimisation frameworks.
Whether it's a new AI system, a collaboration, or just a technical conversation — I'm here.
A focused track record building AI infrastructure that serves hundreds of thousands of users. Progression from data science to AI engineering with clear ownership at each stage.
The overview below covers what kind of work I've done, the domains and scales involved, and the technologies I've worked with — without disclosing confidential client information.
Built and owned a content recommendation system end-to-end — from data pipeline design and model development through A/B testing, drift monitoring, and production deployment. Serving hundreds of thousands of users with real-time latency requirements.
Led the complete model development lifecycle for an abstractive text summarization system. Covered problem framing, domain-specific corpus curation, transformer fine-tuning, multi-metric evaluation, and production handoff documentation.
Architected and shipped an end-to-end conversational AI agent system with multi-agent orchestration, hybrid RAG, persistent state management, authentication, async ingestion pipelines, LLMOps observability, and nightly evaluation pipelines.
Focus on model development, experimentation, and statistical analysis. Owned full ML lifecycle for two major product features. Developed deep expertise in NLP, recommendation systems, and evaluation frameworks.
Expanded scope to full system architecture: authentication, async pipelines, caching, monitoring, evaluation, and cloud deployment. Led end-to-end delivery of a complex multi-agent production system used by hundreds of thousands.
I'll review your request and share the appropriate documentation — typically within 24 hours on working days.
Core concepts, algorithms, evaluation frameworks, and production practices from 2 years of hands-on data science work. Interview-ready notes with code, formulas, and mental models.
Supervised: labelled targets, learns mapping f(X)→y. Unsupervised: structure discovery without labels. Semi-supervised: sparse labels + unlabelled data. Self-supervised: labels from data itself (GPT pretraining).
MSE = Bias² + Variance + Irreducible Noise. High bias → underfitting (simple model). High variance → overfitting (complex model). Regularisation (L1/L2), dropout, cross-validation, and ensemble methods manage this tradeoff.
Bagging (Random Forest): parallel trees, reduce variance. Boosting (XGBoost, LightGBM): sequential, reduce bias. LightGBM: histogram-based, leaf-wise growth → 10x faster on large datasets. Stacking: meta-learner over base models.
When asked XGBoost vs LightGBM: LightGBM is ~10x faster on large data because it uses histogram binning + leaf-wise growth vs XGBoost's level-wise. But LightGBM is more prone to overfitting on small datasets.
| Metric | Best For | Caveat | Formula Hint |
|---|---|---|---|
| ROC-AUC | Imbalanced classification | Misleading if positives are very rare | Area under TPR vs FPR |
| PR-AUC | Rare positive events | Sensitive to threshold choice | Precision vs Recall |
| F1 | Balanced P/R needed | Doesn't distinguish FP/FN costs | 2·P·R / (P+R) |
| NDCG@K | Ranked retrieval | Needs relevance grades, not binary | DCG/IDCG |
| MRR | First relevant result quality | Only considers first hit | Mean(1/rank) |
| ROUGE-L | Text summarization | Surface form, not semantics | LCS-based recall |
| BERTScore | Semantic text similarity | Expensive at scale | Cosine sim of BERT embeddings |
OHE: low cardinality. Target encoding: high cardinality (risk of leakage — use K-fold target encoding). Hashing: very high cardinality with collisions. Embeddings: learned representations for neural models.
Never compute target encoding statistics using the full training set before cross-validation splits. Always compute within the fold to prevent target leakage.
Solves training-serving skew: offline store (S3/Parquet) for batch training, online store (Redis/DynamoDB) for real-time serving. Point-in-time correctness ensures no data leakage across time boundaries.
Extractive (BM25/TextRank, selects existing sentences) vs Abstractive (generates new text). PEGASUS pre-trained on gap-sentence prediction makes it the best out-of-the-box choice for domain-specific summarisation with fine-tuning.
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments
# fp16 + gradient checkpointing for memory efficiency
training_args = Seq2SeqTrainingArguments(
fp16=True, gradient_checkpointing=True,
predict_with_generate=True,
generation_max_length=256,
per_device_train_batch_size=4,
)
# Constrained decoding to prevent entity hallucination
outputs = model.generate(
input_ids, num_beams=4,
force_words_ids=entity_ids, # enforce key entities
no_repeat_ngram_size=3,
)
t-test (continuous, normal). Mann-Whitney U (continuous, non-normal — preferred for click rates). Chi-square (categorical). Fisher's exact (small counts). Always: pre-register, define MDE, compute required sample size before running.
Controlled-experiment Using Pre-Experiment Data. Use pre-experiment covariate (e.g. prior week CTR) to reduce metric variance, gaining same statistical power with smaller samples. Used extensively at Airbnb, Booking.com.
End-to-end architecture from candidate generation through ranking, serving, cold-start, A/B testing, and continuous monitoring. Based on production experience.
Matrix factorisation: decomposes user-item interaction matrix into latent factors. SVD++ adds implicit feedback (clicks, dwells) on top of explicit ratings. ALS (Alternating Least Squares) scales better for sparse data.
New user: popularity fallback + onboarding survey embeddings + demographic proxies. New item: content-based similarity from metadata embeddings. Exploration budget (ε-greedy or Thompson sampling) to gather signal fast.
Never show pure popularity to new users. Even basic onboarding questions (3-5 topics) dramatically improve first-session relevance and reduce immediate churn.
Offline: NDCG@K, MRR, Hit-Rate@K on held-out set. Critically: offline metrics and online business metrics often don't agree — always validate with A/B test. Use multi-armed bandit for faster winner selection. Track novelty and diversity too.
User preferences shift with content, seasons, and cohorts. Monitor feature PSI (interaction features, user activity rates). Monitor prediction distribution. Set PSI > 0.2 threshold for auto-retraining trigger via Airflow + Evidently AI.
From attention mechanisms to production multi-agent systems. Deep notes on RAG architectures, LangGraph orchestration, fine-tuning, evaluation, and cost optimisation.
Queries, keys, values come from the same sequence (self-attention) or different sequences (cross-attention). Scaling by √d_k prevents softmax saturation in high dimensions. Multi-head: run H attention heads in parallel, concatenate and project.
KV-Cache: store computed key/value pairs across autoregressive steps — avoids recomputing O(n²) attention on prior tokens. GQA (Grouped Query Attention): multiple query heads share fewer KV heads → reduces memory bandwidth significantly at inference.
Without KV-cache, a 4K context generates 4096 new attention operations per token. Cache makes inference latency linear, not quadratic.
Dense (vector): captures semantic intent, handles synonyms. BM25 (sparse): exact keyword match, good for rare terms and named entities. RRF (Reciprocal Rank Fusion) merges both ranked lists without needing score calibration. +10–20% NDCG over pure vector.
Embed the user query, search cache with cosine similarity threshold (e.g. 0.92). If similar query found → return cached answer. Cuts LLM API calls by 45–68% in production. Use Redis LangCache or custom implementation with pgvector.
At 15K queries/day, semantic caching at 45% hit rate saves ~6,750 LLM calls/day. At $0.01/call (GPT-4o-mini), that's ~$24K/year saved.
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
# PostgresSaver: durable session state, survives failures
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
def supervisor_router(state):
intent = classify_intent(state["messages"][-1])
# Routes to: rag_agent | test_agent | affairs_agent
return {"next": intent}
graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor_router)
graph.add_node("rag_agent", rag_node)
graph.add_conditional_edges("supervisor", route_fn)
app = graph.compile(checkpointer=checkpointer)
Use PostgresSaver over Redis for LangGraph checkpointing when you need durability across pod restarts and long-horizon sessions. Redis is fine for sub-session state but loses data on restart without AOF/RDB persistence.
| Metric | Measures | Ground Truth? | When to Use |
|---|---|---|---|
| ROUGE-L | Summarization recall | ✅ Required | Fine-tuning eval against reference summaries |
| BERTScore | Semantic similarity | ✅ Required | When wording varies but meaning matters |
| FactCC | Factual consistency | Source doc only | Summarization hallucination detection |
| RAGAS Faithfulness | Answer grounded in context | ❌ Reference-free | RAG pipeline evaluation |
| LLM-as-Judge | Holistic 4-axis quality | ❌ Reference-free | Production nightly eval on 500 samples |
JUDGE = """Evaluate on 4 axes (1-5 each):
1. Factual Accuracy 2. Completeness
3. Hallucination (5=none) 4. Helpfulness
Return: {{"accuracy":X, "completeness":X, "hallucination":X, "helpful":X}}
Question: {q} Context: {ctx} Answer: {ans}"""
async def nightly_eval(sample_ids):
results = []
async for sid in sample_ids:
q, ctx, ans = fetch_production_sample(sid)
score = json.loads(await llm.ainvoke(JUDGE.format(...)))
results.append(score)
# Alert if avg hallucination < 3.5 or accuracy < 4.0
alert_if_regression(results)
Fine-tune when: domain vocabulary is highly specialized, you have 1K+ quality examples, latency is critical, and knowledge is relatively static. RAG when: knowledge updates frequently, source attribution is required, data is proprietary/can't be in weights, or you need to scale context.
LoRA: inject low-rank adapter matrices (rank 4-64) into attention projections. Trains only ~0.1% of parameters. QLoRA adds 4-bit quantization of frozen model → fine-tune 70B models on consumer GPUs. VRAM: 7B needs ~6GB with QLoRA vs ~28GB full fine-tune.
Production ML operations: CI/CD pipelines, drift detection, automated retraining, LLMOps observability, async infrastructure, and model serving optimisation.
| Level | What's Automated | Trigger |
|---|---|---|
| Level 0 | Nothing. Manual train/deploy by DS. | Human only |
| Level 1 | Training pipeline automated. Model re-trains on schedule/drift. | Cron or drift alert |
| Level 2 | Full CI/CD: automated testing, eval gates, canary deploy, rollback. | Any code change or data drift |
Covariate (data) drift: P(X) changes, P(Y|X) stable. Feature distributions shift (e.g. user behaviour changes seasonally). Concept drift: P(Y|X) changes — the model's learned relationship becomes invalid. Label drift: P(Y) changes — class proportions shift.
PSI: Population Stability Index. Best for feature drift. >0.2 = significant. KS test: Kolmogorov-Smirnov for continuous features. p < 0.05 = drift. JS Divergence: for prediction distribution. Chi-square: categorical feature drift.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
drift_results = report.as_dict()
# Auto-trigger retraining if PSI > 0.2
if drift_results["drift_share"] > 0.2:
airflow_client.trigger_dag("retraining_pipeline")
alert_on_call("Significant drift detected — retraining triggered")
Infrastructure metrics: p50/p95/p99 latency per agent node, throughput, error rates. LLM-specific: token cost per session, cache hit rate, model routing distribution. Quality metrics: LLM-as-Judge scores, hallucination rate trend, self-resolution rate, user satisfaction proxy.
Every LangGraph node execution, LLM call, retrieval step, and tool invocation is traced with latency, token count, intermediate outputs, and cost. Enables root-cause diagnosis of failures in minutes. Alert on: agent node p99 > 3s, cost spike > 2× baseline, judge score drop.
from celery import Celery
app = Celery("ingest", broker="redis://redis:6379/0",
backend="redis://redis:6379/1")
@app.task(bind=True, max_retries=3, default_retry_delay=60)
def ingest_document(self, doc_id, s3_uri):
try:
text = parse_pdf_docling(s3_uri) # Docling
chunks = semantic_chunk(text) # ~300 token chunks
embs = embed_batch(chunks) # text-embedding-3
upsert_pgvector(embs, doc_id) # dense index
upsert_elasticsearch(chunks, doc_id) # BM25 index
except Exception as e:
raise self.retry(exc=e)
Always decouple ingestion from serving. Running document processing synchronously in the request path causes latency spikes and timeouts. Celery workers handle ingestion independently; serving always reads from the already-indexed store.
AWS services for ML engineers, Docker patterns, authentication design, PostgreSQL optimisation, and Redis caching strategies — grounded in production use.
| Service | ML Use Case | When to Use Instead |
|---|---|---|
| SageMaker Training | Managed training jobs, spot instances | ECS if you need full Docker control |
| ECS Fargate | Containerised API serving (no GPU) | SageMaker Endpoints for GPU inference |
| S3 | Data lake, model artefacts, training data | EFS for shared file access between pods |
| RDS PostgreSQL | Conversation store, pgvector search | Aurora Serverless for variable load |
| ElastiCache Redis | Semantic cache, session state, broker | DynamoDB for globally distributed KV |
| MSK (Kafka) | Real-time event streaming for feature store | Kinesis for lighter AWS-native setup |
# Stage 1: build — install heavy deps once
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Stage 2: runtime — lean image
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib /usr/local/lib
COPY src/ .
# Health check before ECS considers container healthy
HEALTHCHECK --interval=30s --timeout=5s \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--workers", "4"]
Header.Payload.Signature (base64url). Access token: short-lived (15 min). Refresh token: long-lived (7 days), stored httpOnly cookie. On expiry, client uses refresh token silently. For AI APIs: embed user tier in JWT claims to enforce RBAC at gateway.
async def get_current_user(token=Depends(oauth2_scheme)):
payload = jwt.decode(token, SECRET, ["HS256"])
role = payload["role"] # "free"|"premium"|"admin"
tier = payload["tier"] # enforces rate limits + features
return UserContext(id=payload["sub"], role=role, tier=tier)Role matrix for EdTech AI:
Free: basic Q&A, 20 queries/day, no test analysis.
Premium: all agents, 200 queries/day, test feedback.
Admin: system management, eval dashboard access.
Enforce at FastAPI dependency level, not inside agent logic.
pgvector adds ANN (approximate nearest neighbour) search directly in Postgres. HNSW index for performance, IVFFlat for memory efficiency. Pair with pg_trgm for BM25-like text search to get hybrid retrieval without a separate vector DB.
-- Create HNSW index (best for recall/speed balance)
CREATE INDEX ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Semantic search with metadata filter
SELECT chunk_text, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE user_id = $2 -- tenant isolation
ORDER BY embedding <=> $1 LIMIT 10;| Pattern | How | Best For |
|---|---|---|
| Cache-aside | App checks cache, populates on miss | Read-heavy, tolerance for stale data |
| Semantic cache | Embed query, cosine search cache | LLM response caching (45–68% hit rate) |
| Write-through | Write to cache and DB simultaneously | Strong consistency required |
| TTL invalidation | Key expires after N seconds | Recommendation pre-compute (24h TTL) |
Production blueprints for ML systems — the thinking framework, common patterns, and trade-off analysis for designing recommendation, RAG, multi-agent, and inference systems.
Clarify task type, success metrics, constraints (latency, cost, scale). Separate business metrics from ML metrics.
Collection, labelling, validation, feature engineering, feature store, training/serving parity.
Simple baseline first. Scale complexity with data. Consider inference cost and latency budgets.
Online vs batch vs streaming. Caching strategy. Auth. Rate limiting. SLO definition.
A/B test plan, offline eval metrics, shadow mode, eval gates in CI/CD.
Drift detection, alerting, auto-retraining triggers, feedback loops.
Candidate gen (ANN/two-tower) → feature engineering → LightGBM ranker → A/B test layer → Redis cache → feature drift monitoring → auto-retraining. Cold-start: popularity fallback + onboarding survey embeddings. Scale: pre-compute nightly for active users.
Async Celery indexing + sync FastAPI serving. Hybrid BM25 + dense retrieval with RRF fusion. Cross-encoder reranking. Redis semantic cache. Guardrails (intent filter + output validator). Nightly LLM-as-Judge eval. LangSmith distributed tracing.
LangGraph supervisor with intent routing to specialised sub-agents. PostgresSaver for durable session state. Failure isolation: each agent retries independently. HITL interrupts for high-stakes decisions. Per-node latency and cost tracking via Prometheus.
FastAPI + uvicorn workers. Pre-computed batch predictions in Redis (TTL 24h). Cache-aside for real-time fallback. Auto-scaling ECS tasks on CPU/RPS. Canary rollout 5% → 20% → 100%. Shadow mode for new model validation. SLO: p99 < 200ms.
60+ curated questions across ML concepts, LLM/RAG, MLOps, system design, and behavioural. Organised by seniority level with answer frameworks.
I'm actively looking for high-impact roles and collaboration opportunities. If you're working on problems at the intersection of AI, scale, and real user value — let's talk.
Four years of building AI systems that serve real users at real scale. I think in systems, not models — every technical decision considers cost, monitoring, failure modes, and the engineer who maintains it.
Whether you're a hiring manager, a potential collaborator, or just want to discuss something technical — I read every message and respond thoughtfully.
Response time is typically within 24 hours on working days.
Thanks for reaching out. I'll review your message and get back to you at the email you provided — typically within 24 hours.
In the meantime, feel free to explore the knowledge notes.