Data Scientist · EdTech Platform, New Delhi · 2022 – 2023
Hybrid recommendation system — ALS collaborative filtering, TF-IDF content-based NLP, and a LightGBM learned-to-rank layer — personalising the student portal homepage for 40,000+ UPSC aspirants. Designed, built, and owned solo.
The Problem
The platform's content library held 2,400+ items — PDF notes, video lectures, practice tests, and previous-year question papers spanning 11 General Studies subjects and 30+ optional papers. Without personalisation, every student saw the same homepage carousel. High-value content was missed not because it was irrelevant, but because it was never surfaced. Six months of interaction data sat in PostgreSQL, unused.
Increase content engagement and completion rates by surfacing the right content for each student — without requiring explicit ratings, which students on exam-prep platforms never provide. Completion rate on premium content was the downstream metric the business cared about.
Sole data scientist on shared AWS infrastructure with no dedicated ML budget. Recommendations must serve in under 10ms (embedded in the login portal homepage on mobile connections). Nightly retraining acceptable — content doesn't change hourly, and student preferences shift over days, not minutes.
Students never rate content explicitly. Positive signals are inferred from behaviour: PDF scroll depth >40%, video watch time >60 seconds, bookmarks, and exercise completions. Each signal carries a different confidence weight — a bookmark is a much stronger positive than a passive view — which determines how the ALS model weighs each interaction in the training matrix.
Two cold-start cases require separate strategies. New students have zero interaction history. New content items have zero engagement data. A 3-question onboarding survey (primary subjects + difficulty preference) solves user cold start from day one. New content gets a content-based similarity score computed synchronously on publish, making it recommendable before the next nightly batch runs.
System Architecture — 5 Layers
The content library updates once or twice daily. Student subject preferences shift over weeks, not minutes — a student who clicks three Economics videos in one session doesn't need their recommendations updated within the hour; they need better recommendations tomorrow. A nightly Celery task retrains all models on a 90-day rolling interaction window and writes pre-computed top-50 recommendations per user to Redis. The homepage then becomes a single Redis GET: under 8ms regardless of model complexity. Real-time re-ranking on every page load would have added Kafka, Flink, and stateful operator infrastructure for marginal signal-quality gain at this scale.
Before collaborative or content-based filtering could run, 2,400+ unstructured PDFs needed structured metadata — subject tags, difficulty level, content type. This metadata is the upstream dependency for three core RecSys components:
difficulty_match feature requires a per-item difficulty label to compare against the user's revealed difficulty preference.A standalone NLP pipeline (Model 04) handled this: pdfplumber extracted text from the first 8 pages of each PDF, NLTK tokenised and stemmed it, and a TF-IDF + OneVsRestClassifier pipeline — trained on 400 manually-tagged items — assigned subject labels (multi-label, macro F1 = 0.91) and difficulty tier (macro F1 = 0.87). New uploads are tagged synchronously in under 2 seconds, so they enter the recommendation index before the next nightly batch.
Four-Model Architecture
No single model solved all cases cleanly. An NLP auto-tagger (Model 04) first turns raw PDFs into structured metadata — without it, content-based filtering has no reliable signal and the LightGBM ranker's difficulty feature has no ground truth. ALS collaborative filtering then dominates for warm users, TF-IDF handles cold start, and LightGBM learns to combine both in a nonlinear ranking function that manual blending cannot replicate.
implicit library · matrix factorisation · implicit feedback
Finds students with similar interaction patterns and surfaces what they engaged with. Primary signal for warm users who have at least 20 tracked interactions.
Sparse user-item interaction matrix of shape (40,000 × 2,400). Each cell is a confidence-weighted sum of interaction signals: completion = 3×, bookmark = 2×, view >60s = 1×, view <60s = 0.3×. Most cells are empty — the matrix is <2% dense, which is why standard SVD fails here.
64 latent factors (tuned via MLflow grid search over 32/64/128). Regularisation = 0.01, 25 iterations, confidence scale α = 40. Trained on EC2 m5.xlarge using the C++ backend of the implicit library — full retraining completes in ~4 minutes, well within the nightly batch window.
SVD treats every missing entry as a zero preference. At this scale, >98% of (user, item) pairs have no recorded interaction — not because the student dislikes that content, but because they've never encountered it. Recommending from a model that treats "unseen" as "disliked" is actively harmful. ALS with confidence weighting makes a critical distinction: high-confidence entries are observed interactions (the model is certain about the signal), low-confidence entries are unobserved (the model is uncertain, not negative). This matters significantly for a 2,400-item library where a student can only consume a fraction in a year.
import implicit, scipy.sparse as sparse, numpy as np
CONF_WEIGHTS = {
'complete': 3.0,
'bookmark': 2.0,
'view_long': 1.0, # watch > 60s or scroll > 40%
'view_short': 0.3, # passive, noisy signal
}
def build_interaction_matrix(events: pd.DataFrame, n_users: int, n_items: int):
events['weight'] = events['event_type'].map(CONF_WEIGHTS).fillna(0.1)
agg = events.groupby(['student_idx', 'content_idx'])['weight'].sum()
return sparse.csr_matrix(
(agg.values,
(agg.index.get_level_values(0), agg.index.get_level_values(1))),
shape=(n_users, n_items)
)
# Train ALS — confidence matrix C = 1 + alpha * R
model = implicit.als.AlternatingLeastSquares(
factors=64, regularization=0.01, iterations=25, use_gpu=False
)
model.fit(interaction_matrix * 40) # alpha = 40
# Generate top-200 candidates for one user
ids, scores = model.recommend(
userid=user_idx,
user_items=interaction_matrix[user_idx],
N=200,
filter_already_liked_items=True,
) scikit-learn TF-IDF · cosine similarity · metadata features
Recommends content similar to what a student has engaged with, based on subject, topic, difficulty, and content type. Essential for cold-start users and new content items.
Each item is represented as a concatenated metadata string: subject (History, Polity, Geography, Economy, Science, Environment, Art & Culture…) + topic + difficulty level (Prelims / Mains / Both / Optional) + content type (PDF note / video / PYQ / practice test) + title. This gives the TF-IDF vectoriser rich, structured text to work with.
TF-IDF with max_features = 5,000, ngram_range = (1, 2), sublinear_tf = True (log-scaled frequencies reduce the dominance of common words like "questions"). Item-item cosine similarity computed once and stored as a compressed sparse matrix in memory.
A user's CB profile is the mean TF-IDF vector of their top-10 most recent positively-interacted items (completions and bookmarks preferred). For cold-start users, the profile is built from the subjects and difficulty level they selected during onboarding.
1. Cold-start users (<20 tracked interactions) — CB is the primary model. 2. New content (<50 total interactions) — ALS has no signal; CB handles it using the item's metadata similarity to the user's profile. 3. Serendipity injection — 10% of final recommendations are CB-only, injected to prevent filter-bubble collapse in long-running users.
When a new item is added to the content library, a PostgreSQL trigger fires a Celery task that computes that item's TF-IDF vector and its cosine similarity to all user profiles. This runs in under 1 second and stores the results immediately. The item is recommendable via CB before the next nightly ALS batch — there is no 24-hour blind spot for new content.
rank:ndcg objective · contextual features · nonlinear scoring
Takes 200 candidates from ALS + CB and re-scores them using a rich feature set. Learns the nonlinear interactions between CF score, CB score, recency, popularity, and user context — interactions that weighted linear blending cannot capture.
ALS score (dot product, normalised 0–1) · CB cosine similarity · item log-popularity (rolling 7-day view count, log-scaled) · recency score (exponential decay: 1.2× for items <30 days old, decaying to 0.8× at 90 days) · user subject affinity (last-14-day engagement share by subject) · difficulty match (binary: does item level match user's revealed preference).
Positive labels: click events on model-surfaced items from A/B test traffic (Experiment 01 onwards). Using pre-recommendation editorial clicks as training labels would introduce selection bias — students only clicked from an editorially-curated item set, not the full catalog. Labels were collected from the 20% treatment group during Exp 01, then from 100% of traffic after Exp 01 shipped.
Objective: rank:ndcg. LightGBM with 200 trees, learning rate = 0.05, max_depth = 6, num_leaves = 31. Trained weekly (not nightly) — daily click labels need to accumulate before there's sufficient volume for stable ranking model updates. NDCG@10 on held-out 10%: 0.73. Outperformed weighted linear blending (0.7 ALS + 0.3 CB) by 8% NDCG@10.
Linear blending assumes CF and CB contributions are additive and independent — they aren't. A new student with high subject affinity in History needs a different CF/CB balance than a veteran student with 12 months of interaction history. LightGBM learns these conditional relationships: subject affinity × content type × recency interacting together produces splits that a weighted average simply cannot express.
import lightgbm as lgb, numpy as np
def build_ranking_features(
user_id: int, candidates: list,
cf_scores: np.ndarray, cb_scores: np.ndarray
) -> pd.DataFrame:
rows = []
user_affinity = get_subject_affinity(user_id, days=14)
user_level = get_difficulty_preference(user_id)
for item_id, cf, cb in zip(candidates, cf_scores, cb_scores):
meta = content_meta[item_id]
days_old = (today() - meta['published_at']).days
rows.append({
'cf_score': cf,
'cb_score': cb,
'log_popularity': np.log1p(item_views_7d.get(item_id, 0)),
'recency_score': 1.2 * np.exp(-days_old / 30),
'subject_affinity': user_affinity.get(meta['subject'], 0.0),
'difficulty_match': int(meta['level'] == user_level),
})
return pd.DataFrame(rows)
# Score + rank candidates
ranker = lgb.Booster(model_file='ranker_weekly.lgb')
features = build_ranking_features(user_id, candidates, cf_scores, cb_scores)
scores = ranker.predict(features)
# Top-50 after popularity debiasing
top50 = apply_debiasing(candidates, scores, n=50)
def apply_debiasing(items, scores, n, penalty=0.4):
"""Penalise globally overrepresented items."""
adjusted = []
for item, score in zip(items, scores):
if item in global_top20_pct:
score *= (1 - penalty)
adjusted.append((item, score))
return [i for i, _ in sorted(adjusted, key=lambda x: -x[1])[:n]] pdfplumber · NLTK · TF-IDF · OneVsRestClassifier (scikit-learn)
Automatically extracts subject tags and difficulty level from raw PDFs. Upstream prerequisite that provides the structured metadata all three RecSys models depend on.
2,400+ content items had been added manually over years with inconsistent or missing metadata. Inconsistent labels meant the TF-IDF vectoriser was indexing noise instead of signal, and the LightGBM difficulty_match feature had no reliable per-item ground truth. Manual re-tagging was an editorial bottleneck: a new batch of 30 PDFs could sit unindexed for weeks. The auto-tagger reduced labelling latency to under 2 seconds per item, triggered synchronously on upload so new content enters the recommendation index before the next nightly batch runs.
UPSC GS content is structurally multi-label. River basin ecology spans Geography, Environment, and sometimes Polity. Historical treaties touch History, International Relations, and Map Work simultaneously. A single-label classifier systematically under-tags content, reducing content-based recall. OneVsRestClassifier trains an independent logistic regression per subject — 11 classifiers for 11 GS subjects — applying each independently so any combination of labels is possible. Ground truth: 400 items manually tagged by a subject-matter expert. Macro F1 on held-out 20%: 0.91.
Three-tier difficulty (Prelims / Mains / Advanced Optional) treated as single-label multi-class. Vocabulary complexity, question type, and abstract reasoning density correlate strongly with tier — TF-IDF bigrams (up to 5,000 features) capture these patterns well. A standalone logistic regression is trained separately from the subject classifiers, as difficulty is orthogonal to subject and shares no label structure. Macro F1 on held-out 20%: 0.87.
Content-Based Model (TF-IDF): Subject and topic tags are concatenated into the item's text representation before vectorisation. This dramatically sharpens cosine similarity — two Physics videos about thermodynamics now share a subject:physics token regardless of surface-level vocabulary differences, improving CB recall for cold-start users whose profiles are built from subject preferences.
LightGBM Ranker: The difficulty_match feature (binary: does item difficulty match the student's revealed preference tier?) is computed from auto-tag output. A student who consistently bookmarks and completes Mains-level content gets a 1.0 for Mains items and a 0.0 for Prelims — a signal the ranker learned to weight heavily in its top splits.
Onboarding Cold-Start: The 3-question survey asks students to select primary subjects and target tier. Those choices are stored as a preference vector keyed on subject labels — labels that are only meaningful if every content item carries accurate subject tags. Without reliable auto-tagging, the onboarding personalisation would have had nothing to map to.
import pdfplumber, re, joblib
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
STOP_WORDS = set(stopwords.words('english'))
stemmer = PorterStemmer()
def extract_text(pdf_path: str) -> str:
pages = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages[:8]: # first 8 pages capture subject well
pages.append(page.extract_text() or '')
return ' '.join(pages)
def preprocess(text: str) -> str:
text = re.sub(r'[^a-zA-Z\s]', ' ', text.lower())
tokens = word_tokenize(text)
return ' '.join(
stemmer.stem(t) for t in tokens
if t not in STOP_WORDS and len(t) > 2
)
# ── Subject classifier: multi-label, 11 UPSC GS subjects ──────────────
# Ground truth: 400 items tagged manually by subject-matter expert
mlb = MultiLabelBinarizer()
y_subject = mlb.fit_transform(train_labels_subject) # (400, 11)
subject_clf = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=8000,
sublinear_tf=True, min_df=2)),
('clf', OneVsRestClassifier(
LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs'))),
])
subject_clf.fit(train_texts, y_subject)
# Macro F1 on held-out 20%: 0.91
# ── Difficulty classifier: single-label, 3 tiers ──────────────────────
# Prelims / Mains / Advanced Optional
difficulty_clf = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000,
sublinear_tf=True, min_df=2)),
('clf', LogisticRegression(C=0.5, max_iter=1000, solver='lbfgs')),
])
difficulty_clf.fit(train_texts, train_labels_difficulty)
# Macro F1 on held-out 20%: 0.87
def tag_content_item(pdf_path: str) -> dict:
"""Tag a new PDF in <2s synchronously on upload."""
clean = preprocess(extract_text(pdf_path))
subjects = list(mlb.inverse_transform(subject_clf.predict([clean]))[0])
difficulty = difficulty_clf.predict([clean])[0]
return {'subjects': subjects, 'difficulty': difficulty}
joblib.dump(subject_clf, 'models/subject_clf.pkl')
joblib.dump(difficulty_clf, 'models/difficulty_clf.pkl')
joblib.dump(mlb, 'models/mlb.pkl') Batch-First Architecture
The system is entirely batch-driven. A Celery task fires at 2 AM IST, retrains all three models, computes top-50 recommendations for every active student, applies debiasing and recency adjustments, and writes results to Redis. The homepage serving layer is a single Redis GET — no model inference at request time. Full pipeline runtime: ~22 minutes on a single EC2 m5.xlarge.
Pull 90 days of rolling interaction data from the content_events table for all students who were active in the last 30 days. The 90-day window (not full history) is intentional: training on the full 12-month history amplified popularity bias from the pre-personalisation era, when the same 30 items dominated the editorial homepage and accumulated disproportionate interaction counts. Tested against 30/60/90/180-day windows in MLflow; 90 days produced the best NDCG@10 with sufficient signal for sparse users.
Approximately 500,000 quality interaction events (completions, bookmarks, long views) in the training window at steady state — roughly 12–15 meaningful signals per active student per month.
Aggregate events into a confidence-weighted sparse interaction matrix (40K × 2,400) using Pandas groupby and scipy.sparse.csr_matrix. The full matrix with all non-zero values fits comfortably in the m5.xlarge's 16 GB RAM — peak memory usage: ~1.1 GB. No distributed processing (PySpark, Spark) is needed or appropriate at this scale; adding cluster infrastructure would have doubled build complexity for zero performance gain.
ALS is retrained from scratch every night (~4 minutes). Full retraining is cheap enough that incremental updates would add complexity without meaningful benefit. TF-IDF vectoriser and item-item similarity matrix are only recomputed when content metadata changes (checked via a version hash). LightGBM ranker is retrained weekly every Sunday night — daily click labels need ~7 days to accumulate enough ranking signal volume for stable gradient boosting.
For each active student: ALS generates top-200 CF candidates (filtered for already-seen items). CB adds up to 50 candidates for cold-start users or new content items. Duplicates are removed. LightGBM ranker scores all candidates using the 6-feature set. Popularity debiasing penalty (0.4×) is applied to items in the top-20% by platform-wide 7-day view count. Recency boost (1.2× for items <30 days old, exponential decay) is applied. Top-50 by final score become the recommendation list.
Top-50 item IDs are serialised as JSON and written to Redis with key recs:{student_id} and TTL = 86,400 seconds (24 hours). A Redis pipeline batches all writes in a single round-trip. ~28,000 active users are processed in the ~22-minute batch window. Cold-start users (no interaction history + no onboarding survey) receive a subject-popularity fallback list rather than empty recommendations.
The student portal homepage hits GET /recommendations/{student_id}. The handler reads recs:{student_id} from Redis and returns the ordered list. No model inference happens at request time. P95 latency: <8ms including network overhead. On cache miss (new student, expired TTL, or batch failure), the cold-start fallback endpoint returns content-based recommendations computed from the student's onboarding survey responses.
-- PostgreSQL: interaction logging
CREATE TABLE content_events (
event_id BIGSERIAL PRIMARY KEY,
student_id INTEGER NOT NULL REFERENCES students(id),
content_id INTEGER NOT NULL REFERENCES content_items(id),
event_type VARCHAR(20) NOT NULL, -- 'view', 'bookmark', 'complete', 'skip'
duration_s INTEGER, -- watch/read time in seconds
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_events_student ON content_events (student_id, created_at DESC);
CREATE INDEX idx_events_content ON content_events (content_id);
CREATE INDEX idx_events_type_day ON content_events (event_type, created_at DESC);
# Celery: nightly batch pipeline
@celery.task(name='nightly_recs', max_retries=2, acks_late=True)
def run_nightly_recommendations():
run_id = mlflow.start_run(run_name='nightly_batch')
events = extract_events(days=90)
matrix = build_interaction_matrix(events)
als = train_als(matrix, factors=64, alpha=40)
tfidf = load_or_recompute_tfidf()
ranker = lgb.Booster(model_file='ranker_weekly.lgb')
active_users = get_active_users(days=30)
pipe = redis.pipeline(transaction=False)
for user_id in active_users:
cf_ids, cf_scores = als.recommend(user_id, matrix[user_id], N=200,
filter_already_liked_items=True)
cb_ids, cb_scores = get_cb_candidates(user_id, tfidf, N=50)
candidates = deduplicate(cf_ids, cf_scores, cb_ids, cb_scores)
features = build_ranking_features(user_id, *candidates)
scores = ranker.predict(features)
top50 = apply_debiasing_and_recency(candidates[0], scores)
pipe.set(f'recs:{user_id}', json.dumps(top50), ex=86400)
pipe.execute()
mlflow.log_metrics({'users_processed': len(active_users),
'batch_duration_s': elapsed()})
mlflow.end_run()
# FastAPI: recommendation serving
@app.get('/recommendations/{student_id}')
async def get_recommendations(student_id: int, user: User = Depends(verify_jwt)):
if user.id != student_id:
raise HTTPException(403)
cached = await redis.get(f'recs:{student_id}')
if cached:
return {'items': json.loads(cached), 'source': 'model'}
fallback = await get_cold_start_recs(student_id)
return {'items': fallback, 'source': 'cold_start'} Production Monitoring
A batch pipeline that fails silently serves stale recommendations for 24 hours before anyone notices. The monitoring layer is simple by design: three signals cover the failure modes that actually occur in production.
All 40+ offline experiments tracked in a self-hosted MLflow instance on the same EC2 instance. Parameters tracked: ALS factor count (32/64/128), regularisation (0.005/0.01/0.05), confidence scale α (20/40/80), interaction window (30/60/90/180 days), LightGBM learning rate and max depth. Primary metric: NDCG@10 on a 10% held-out validation set. The 90-day window + 64 factors + α=40 combination was the Pareto-optimal point — best NDCG@10 without overfitting to historical popularity. Models are promoted via the MLflow Model Registry: Staging stage requires offline validation metrics above the current Production threshold before promotion.
12 Months of Experimentation
Every model change shipped through an A/B experiment — 80/20 traffic split (80% treatment, 20% holdout), minimum 14-day run to capture weekly study patterns, primary metric: 14-day click-through rate on recommended items, secondary metric: 30-day content completion rate. Five experiments ran sequentially over 12 months, each treatment group becoming the new baseline for the next test.
| # | Hypothesis | Treatment | Control | Primary Result | Decision |
|---|---|---|---|---|---|
| 01 | Any collaborative personalisation beats editorial curation | ALS-only recommendations | Editorial curated homepage feed | +22% CTR, +14% completion rate | Shipped |
| 02 | Content-based outperforms ALS for cold-start users (<20 interactions) | CB recs for cold-start cohort; ALS for warm users | ALS for all users (sparse signal for cold-start) | +31% CTR in cold-start segment; no change in warm segment | Shipped — CB for cold-start, ALS for warm |
| 03 | LightGBM learned-to-rank outperforms weighted score blending | LightGBM ranker (ALS + CB + 4 contextual features) | Linear blend: 0.7 × CF score + 0.3 × CB score | +8% CTR, +12% NDCG@10 vs. linear blend | Shipped |
| 04 | Popularity debiasing improves catalog coverage without hurting CTR | 0.4× penalty on overrepresented items (top-20% by 7-day views) | No debiasing — popularity-dominated ranking | CTR: no significant change · 3× catalog coverage (distinct items recommended) | Shipped — catalog health justified despite neutral CTR |
| 05 | Recency boost increases engagement with newly published content | 1.2× score multiplier for content <30 days old, exponential decay | No recency adjustment | +7% CTR on recently-published content subset; overall CTR +2% | Shipped |
Experiment 04 was the most contested internally. Standard recommendation systems thinking: if CTR is neutral and you're adding complexity, don't ship. That reasoning fails here.
Before debiasing, the top 20 items out of 2,400 captured 78% of all recommendation slots. Students preparing for optional papers — Literature, Anthropology, Law, Agriculture — received the same mainstream General Studies content as everyone else, because those items had accumulated the most interactions during the pre-personalisation editorial era. The recommendation engine was perpetuating the bias it was supposed to replace.
After debiasing, the number of distinct items appearing in recommendations tripled. Optional-paper students started seeing subject-relevant content for the first time. The business case: optional-paper students convert to paid annual subscriptions at a higher rate than GS-only students. Improving their experience — even without a 14-day CTR signal — had long-term retention value that a short experiment window couldn't measure.
Lesson: short-run CTR is a measurement of what's easy to click, not what's good to recommend. Match the experiment metric to the business outcome that actually matters.
Across all 5 experiments over 12 months: +34% CTR and +18% completion rate on recommended content. Completion rate on premium (paid-tier) content — the downstream metric the business tracked for subscription value — sustained a measurable lift that persisted through the 6-month observation window post-launch. The recommendation system also reduced the support tickets asking "where do I find X subject notes" by surfacing relevant content proactively — an unmeasured but clearly visible qualitative improvement the product team noted.
What Actually Moved the Needle
A year in production surfaced a clear hierarchy: data quality decisions outperformed architecture decisions; product-side changes sometimes outperformed model changes. Infrastructure was made correctly once and then left alone.
The first ALS model treated all interaction events equally. A student who left a PDF open in a browser tab while making tea registered the same training signal as a student who bookmarked a note after a full read-through. The model was dominated by passive view events — the noisiest, lowest-intent signal in the dataset.
Introducing confidence weights — completion = 3×, bookmark = 2×, view >60s = 1×, view <60s = 0.3× — improved offline NDCG@10 by 9% without changing model architecture, training time, or infrastructure. The weights weren't derived from theory; they came from a manual correlation analysis of which interaction types predicted repeat engagement on a 30-day holdout.
Training the ALS model on the full 12-month interaction history amplified popularity bias from the pre-personalisation era. During that period, every student saw the same 30 editorial items on the homepage. Those items accumulated enormous interaction counts — not because they were the best content, but because they were the only visible content. When the ALS model was trained on this history, those same 30 items dominated its output, perpetuating the bias the system was designed to replace.
A 90-day rolling window reduced the influence of legacy editorial dominance and improved model responsiveness to current student interests. MLflow experiment across four window lengths: 90 days was the Pareto-optimal point — fresh enough to reduce legacy bias, long enough to retain signal for students who study infrequently. The 30-day window had better recency but 40% fewer training signals for students who study 2–3 times per week rather than daily.
The initial cold-start strategy was pure model-side: content-based recommendations built from the student's enrolled subject (captured during course registration). It worked, but produced generic subject-level popular content for 2–3 weeks until the ALS model had enough signal. During those early weeks, cold-start students had the worst recommendation quality on the platform.
A product change — a 3-question onboarding survey shown at first login, asking for primary preparation subjects, current stage (Prelims/Mains/Both), and difficulty preference — dramatically improved cold-start quality from day one. With those signals, the content-based profile could filter to subject-appropriate difficulty levels immediately instead of defaulting to subject-popular items.
CTR for new students in the first 30 days improved by 19% after the survey was introduced — more improvement than any model change delivered for the same cohort during the same period. The lesson: when cold-start quality is the bottleneck, ask the user before you model the user.
Before popularity debiasing, the top 20 items out of 2,400 captured 78% of all recommendation slots. This wasn't a model failure — it was a structural feedback loop. Popular items get recommended → students click them → they become more popular in training data → they get recommended more. A better model trained on the same data would amplify the same loop, not break it.
Breaking the loop required an explicit intervention: a 0.4× penalty applied to items in the global top-20% by 7-day view count, applied after LightGBM scoring and before final ranking. Experiment 04 showed no CTR change (popular items are genuinely good) but tripled the number of distinct items appearing in recommendations across the platform. Optional-paper students — 20% of the user base — started receiving subject-relevant recommendations for the first time.
The LightGBM ranker was evaluated offline on NDCG@10 (+12% vs. linear blend). Online, in Experiment 03, it produced +8% CTR. A reasonable correlation — but Experiment 04 showed a complete disconnect: debiasing produced 0% CTR change despite tripling catalog coverage, a metric that wasn't even being tracked offline.
The business outcome that mattered — completion rate on premium content, which correlated with subscription renewal — didn't move in lockstep with 14-day CTR either. Completion rate improvement (+18% cumulatively) was a slower signal, visible only over 30-day observation windows, and required a separate tracking query to measure.
Lesson: define the downstream business metric before designing the experiment. NDCG@10 is a good offline proxy for ranking quality, but it cannot tell you whether the right users are getting the right content — only A/B testing with a business-aligned metric can do that.
Content preferences for UPSC students shift over days, not minutes. A nightly batch pipeline matched the signal timescale correctly. Real-time re-ranking — Kafka, Flink, stateful stream operators — would have consumed the remaining infrastructure runway for marginal quality gain. The 22-minute nightly run on a single EC2 instance remains the most cost-effective decision made across the project.
The recommendation engine proved the platform could build and own production ML infrastructure — models trained nightly, evaluated rigorously, deployed with monitoring. That foundation created the conditions for the next problem.
The next problem was different in kind. Students weren't just discovering content; they were asking specific questions buried inside that content. A recommendation system surfaces items. It cannot answer "what did the 2019 UPSC Mains Essay paper ask about Ethics?" That question requires retrieval over unstructured knowledge, generation grounded in source material, and accuracy guarantees no collaborative filter can provide.
That distinction — discovery vs. synthesis — is what drove the move from recommendation systems to retrieval-augmented generation. The infrastructure habits (async pipelines, Redis caching, evaluation gates before deployment) carried over. The problem class changed entirely.