MLflow — Experiment Tracking & Model Registry
A team promotes a model to production with no record of which dataset version, hyperparameters, or code commit produced it. MLflow solves this by logging every run's full context and linking it to a versioned model registry entry.
mlflow.start_run() creates an isolated experiment context. Log scalars with log_metric(step=epoch) for time-series tracking, hyperparameters with log_param, and files with log_artifact. mlflow.autolog() integrates natively with sklearn, PyTorch, and XGBoost to capture standard metrics automatically. Nested runs (parent = sweep, child = trial) structure hyperparameter searches hierarchically. Programmatic comparison via MlflowClient.search_runs(filter_string="metrics.val_auc > 0.85") enables automated champion-challenger selection without touching the UI.
import mlflow, mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction-v2")
with mlflow.start_run(run_name="gbm-lr0.05") as run:
mlflow.log_params({"n_estimators": 300, "learning_rate": 0.05, "max_depth": 5})
model = GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=5)
model.fit(X_train, y_train)
auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
mlflow.log_metric("val_auc", auc)
mlflow.sklearn.log_model(model, "model", registered_model_name="churn-gbm")
print(f"Run {run.info.run_id} | AUC {auc:.4f}") mlflow.start_run() inside a loop without a context manager leaves orphan RUNNING runs that pollute the experiment view and break automated searches.
On large PyTorch models, autolog logs full model checkpoints every epoch — 500MB+ per run, storage costs spiral in days.
log_metric logs a single key-value at an optional step (for epoch-level tracking in loops); log_metrics logs a dictionary at once for batch efficiency. Use log_metric(step=epoch) inside training loops, log_metrics({}) for final summary metrics at the end of a run.
Tag runs at creation with mlflow.set_tag("status","active"). Filter actives with MlflowClient.search_runs(filter_string="tags.status = 'active'"). Soft-delete stales via MlflowClient.delete_run() — it moves runs to Deleted state (recoverable for 30 days), not permanent erasure. Never hard-delete without confirming no Model Registry versions reference the run_id.
The MLflow Model Registry decouples training (run artifacts) from deployment (versioned model). Transition stages programmatically via MlflowClient.transition_model_version_stage(). Every Production model links back to the run that created it — full provenance. Webhooks on stage transitions trigger downstream CI (integration tests when entering Staging, canary deploy when entering Production). Aliases (client.set_registered_model_alias("churn-gbm","champion","12")) let serving code reference a stable name rather than a stage string.
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow-server:5000")
# Promote challenger to Staging
client.transition_model_version_stage(
name="churn-gbm", version="14", stage="Staging",
archive_existing_versions=False
)
# Champion-challenger comparison
prod = client.get_latest_versions("churn-gbm", stages=["Production"])[0]
champ_auc = float(client.get_run(prod.run_id).data.metrics["val_auc"])
chall_auc = float(client.get_run(
client.get_model_version("churn-gbm","14").run_id).data.metrics["val_auc"])
if chall_auc - champ_auc > 0.005:
client.transition_model_version_stage(
"churn-gbm","14","Production", archive_existing_versions=True)
print(f"Promoted: AUC delta {chall_auc - champ_auc:.4f}") Teams load models from a raw S3 path — no version history, no rollback path, no audit trail. A regression requires manual S3 archaeology to find the previous artifact.
Model registry version 14 is in Production, but the serving pod has no way to identify which MLflow run produced it — debugging a prod regression requires manual UI searching.
The artifact store is a raw file store (S3/GCS) for anything logged during a run. The registry is a higher-level abstraction adding versioning, lifecycle stages, aliases, and lineage on top of specific model artifacts. A model can exist in the artifact store without being registered; registration is an explicit promotion step that signals deployment-readiness.
MLflow allows this intentionally for canary/A/B deployments. Set aliases explicitly (client.set_registered_model_alias("churn-gbm","champion","12"), "challenger","14") and load by alias in serving code. Archive the older version once the newer one is fully validated and traffic has been fully cut over.
An MLflow Project is a directory with MLproject YAML defining entry points, parameters, and an execution environment (conda.yaml or Docker image). mlflow run . -P lr=0.01 runs training in an isolated environment. Projects are runnable from Git URIs: mlflow run [email protected]:org/repo#v1.2 -P lr=0.01, enabling any engineer to reproduce any historical training run with a single command. Combine with DVC for data: checkout the git tag, dvc pull, then mlflow run.
name: churn-prediction
conda_env: conda.yaml
entry_points:
train:
parameters:
learning_rate: {type: float, default: 0.05}
n_estimators: {type: int, default: 300}
data_version: {type: str, default: "v3"}
command: >
python src/train.py
--lr {learning_rate}
--n-estimators {n_estimators}
--data-version {data_version}
evaluate:
parameters:
run_id: {type: str}
command: "python src/evaluate.py --run-id {run_id}" scikit-learn>=1.0 installs different patch versions across runs, causing metric variance that looks like model regression but is just library drift.
Hardcoded /data/train.parquet breaks when the team moves to S3 storage, and the MLproject entry point offers no way to parameterize it.
Retrieve the MLflow run by run_id, read tags for git_commit and dvc_data_hash. git checkout that commit, dvc pull to restore the exact dataset, then mlflow run . with the logged parameters. The Model Registry version links to run_id which links to Git and DVC — full three-way provenance chain.
Git + DVC (S3 remote) + MLflow (SQLite backend to start). Enforce: always commit .dvc files alongside code changes; log git_commit tag in every run; use dvc.yaml pipelines so retraining is one command. This covers experiment reproducibility and data lineage with zero Kubernetes overhead.
Every production model should trace back to a single MLflow run_id — that run_id is the anchor for data provenance, code version, and metric history.