Experiment Tracking with MLflow

Experiment tracking is the version control of machine learning. Without it, data scientists lose track of which hyperparameters produced which results, which dataset version trained which model, and which model is actually running in production. MLflow has emerged as the standard open-source solution — it’s framework-agnostic, supports every major ML library, and scales from a single laptop to enterprise deployments.

The cost of not tracking experiments is invisible until it isn’t. You’ll discover the pain the first time someone asks “can you reproduce the model from three months ago?” and you can’t.

Core Concepts

Component	Purpose	Storage
Tracking	Log parameters, metrics, artifacts per run	SQLite / PostgreSQL
Projects	Package code for reproducible runs	Git repository
Models	Standard format for model packaging	Local / S3 / Azure Blob
Model Registry	Lifecycle management (staging → production)	Database + artifact store

Production Tracking Server

For team use, deploy a persistent tracking server with remote storage:

mlflow server \
  --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/experiments \
  --host 0.0.0.0 \
  --port 5000

Architecture:

Backend store (PostgreSQL): Stores experiment metadata, parameters, metrics
Artifact store (S3/Azure Blob): Stores model files, datasets, plots
Tracking UI: Web interface for comparing experiments

Logging Experiments

import mlflow

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")

with mlflow.start_run(run_name="xgboost-v3"):
    # Log hyperparameters
    mlflow.log_params({
        "n_estimators": 500,
        "max_depth": 6,
        "learning_rate": 0.1,
        "subsample": 0.8,
    })
    
    # Train model
    model = train_model(X_train, y_train, **params)
    
    # Log metrics
    mlflow.log_metrics({
        "accuracy": accuracy_score(y_test, preds),
        "f1_score": f1_score(y_test, preds),
        "auc_roc": roc_auc_score(y_test, probs),
    })
    
    # Log the model
    mlflow.xgboost.log_model(model, "model")
    
    # Log artifacts (plots, data profiles)
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.log_artifact("feature_importance.csv")

Model Registry Workflow

The Model Registry provides a centralized model store with lifecycle stages:

Development → Staging → Production → Archived

Registering a Model

# Register during training
mlflow.xgboost.log_model(
    model, "model",
    registered_model_name="churn-predictor"
)

# Or register an existing run's model
result = mlflow.register_model(
    model_uri="runs:/abc123/model",
    name="churn-predictor"
)

Promoting Models

from mlflow import MlflowClient

client = MlflowClient()

# Transition to staging
client.transition_model_version_stage(
    name="churn-predictor",
    version=3,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name="churn-predictor",
    version=3,
    stage="Production"
)

CI/CD Integration

Automate model validation before promotion:

# .github/workflows/model-promotion.yml
on:
  workflow_dispatch:
    inputs:
      model_name:
        required: true
      model_version:
        required: true

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Load model from registry
        run: |
          python -c "
          import mlflow
          model = mlflow.pyfunc.load_model(
              f'models:/${{ inputs.model_name }}/${{ inputs.model_version }}'
          )
          # Run validation suite
          "

      - name: Performance regression check
        run: python scripts/check_model_performance.py

      - name: Promote to production
        if: success()
        run: python scripts/promote_model.py

Best Practices

Log everything: Parameters, metrics, artifacts, environment info, git hash
Use consistent naming: {project}-{model_type}-{version} for run names
Tag runs: Add tags for dataset version, feature set version, team
Compare before promoting: Always compare new model against current production baseline
Automate artifact cleanup: Set retention policies on old experiment artifacts
Use model signatures: Define input/output schemas for each registered model