MLOps: From Jupyter Notebook to Production Pipeline

The 90% Problem

Getting a machine learning model to work in a Jupyter notebook is relatively easy. Getting it into production, keeping it running reliably, and maintaining it over time is where most AI projects fail. Studies consistently show that 80-90% of ML models never make it to production, and most of those that do fail within months due to data drift, infrastructure issues, or organizational problems.

MLOps (Machine Learning Operations) is the discipline that closes this gap — applying DevOps principles and practices to the ML lifecycle to build systems that actually work in production.

The Full ML Lifecycle

Production ML involves far more than model training:

  1. Data collection and versioning: Data changes over time. Without versioning (DVC, LakeFS), you can't reproduce results or debug regressions.
  2. Feature engineering: Feature stores (Feast, Tecton) provide consistent feature computation between training and serving — a common source of training-serving skew.
  3. Training infrastructure: Experiment tracking (MLflow, Weights & Biases), distributed training, GPU resource management.
  4. Model registry: Version and stage management for trained models; audit trail of what was deployed when.
  5. Serving infrastructure: Model servers, autoscaling, load balancing, A/B testing.
  6. Monitoring: Data drift detection, model performance monitoring, alerting.
  7. Retraining: Automated triggering of retraining when drift is detected; CI/CD for ML.

Training-Serving Skew: The Silent Killer

Training-serving skew — where the data distribution at serving time differs from training time — is the most common production ML failure mode. It occurs when: features are computed differently in training and serving (timestamp handling, categorical encoding, scaling), training data is not representative of production data, or the production data distribution shifts over time.

Preventing skew requires: using the same feature computation code in training and serving (feature stores solve this), testing on held-out data from the production distribution, and monitoring feature distributions in production.

Experiment Tracking

Every training run should be tracked: hyperparameters, dataset version, training code version, evaluation metrics, system metrics (GPU utilization, memory, time). MLflow and Weights & Biases are the industry standards. Without tracking, you can't reproduce previous results, compare experiments, or understand why model performance changed.

CI/CD for ML

Automated testing for ML pipelines must handle the stochastic nature of training: model performance will vary between runs. Good ML CI/CD: run automated tests on small data to detect code regressions; use statistical tests to determine whether metric changes are significant; automate deployment when all tests pass and performance meets thresholds. The Meridian AI MS in AI Engineering covers complete CI/CD pipeline implementation as part of MLE-440.