Experiment Tracking with MLflow
Production MLflow setup for experiment tracking, model versioning, and artifact management. Covers local and remote tracking servers, model registry, and CI/CD integration.
Experiment tracking is the version control of machine learning. Without it, data scientists lose track of which hyperparameters produced which results, which dataset version trained which model, and which model is actually running in production. MLflow has emerged as the standard open-source solution — it’s framework-agnostic, supports every major ML library, and scales from a single laptop to enterprise deployments.
The cost of not tracking experiments is invisible until it isn’t. You’ll discover the pain the first time someone asks “can you reproduce the model from three months ago?” and you can’t.
Core Concepts
| Component | Purpose | Storage |
|---|---|---|
| Tracking | Log parameters, metrics, artifacts per run | SQLite / PostgreSQL |
| Projects | Package code for reproducible runs | Git repository |
| Models | Standard format for model packaging | Local / S3 / Azure Blob |
| Model Registry | Lifecycle management (staging → production) | Database + artifact store |
Production Tracking Server
For team use, deploy a persistent tracking server with remote storage:
mlflow server \
--backend-store-uri postgresql://mlflow:pass@db:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts/experiments \
--host 0.0.0.0 \
--port 5000
Architecture:
- Backend store (PostgreSQL): Stores experiment metadata, parameters, metrics
- Artifact store (S3/Azure Blob): Stores model files, datasets, plots
- Tracking UI: Web interface for comparing experiments
Logging Experiments
import mlflow
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="xgboost-v3"):
# Log hyperparameters
mlflow.log_params({
"n_estimators": 500,
"max_depth": 6,
"learning_rate": 0.1,
"subsample": 0.8,
})
# Train model
model = train_model(X_train, y_train, **params)
# Log metrics
mlflow.log_metrics({
"accuracy": accuracy_score(y_test, preds),
"f1_score": f1_score(y_test, preds),
"auc_roc": roc_auc_score(y_test, probs),
})
# Log the model
mlflow.xgboost.log_model(model, "model")
# Log artifacts (plots, data profiles)
mlflow.log_artifact("confusion_matrix.png")
mlflow.log_artifact("feature_importance.csv")
Model Registry Workflow
The Model Registry provides a centralized model store with lifecycle stages:
Development → Staging → Production → Archived
Registering a Model
# Register during training
mlflow.xgboost.log_model(
model, "model",
registered_model_name="churn-predictor"
)
# Or register an existing run's model
result = mlflow.register_model(
model_uri="runs:/abc123/model",
name="churn-predictor"
)
Promoting Models
from mlflow import MlflowClient
client = MlflowClient()
# Transition to staging
client.transition_model_version_stage(
name="churn-predictor",
version=3,
stage="Staging"
)
# After validation, promote to production
client.transition_model_version_stage(
name="churn-predictor",
version=3,
stage="Production"
)
CI/CD Integration
Automate model validation before promotion:
# .github/workflows/model-promotion.yml
on:
workflow_dispatch:
inputs:
model_name:
required: true
model_version:
required: true
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Load model from registry
run: |
python -c "
import mlflow
model = mlflow.pyfunc.load_model(
f'models:/${{ inputs.model_name }}/${{ inputs.model_version }}'
)
# Run validation suite
"
- name: Performance regression check
run: python scripts/check_model_performance.py
- name: Promote to production
if: success()
run: python scripts/promote_model.py
Best Practices
- Log everything: Parameters, metrics, artifacts, environment info, git hash
- Use consistent naming:
{project}-{model_type}-{version}for run names - Tag runs: Add tags for dataset version, feature set version, team
- Compare before promoting: Always compare new model against current production baseline
- Automate artifact cleanup: Set retention policies on old experiment artifacts
- Use model signatures: Define input/output schemas for each registered model