In This Article
- The Notebook-to-Production Gap
- MLOps Platform Components
- ML CI/CD: From Experiment to Production
- Model Registry: Version, Stage, and Deploy
- Feature Store: Compute Once, Serve Everywhere
- Training Infrastructure: GPU Management and Cost
- MLOps Team Structure
- Platform Comparison: MLflow vs Vertex vs SageMaker
- Go Deeper
The Notebook-to-Production Gap
A data science team builds 10 models per year. Of those 10: 3 make it to production. Of those 3: 1 still works accurately after 6 months. The attrition: 7 models die in notebooks (no engineering capacity to productionize), 2 models degrade undetected (no monitoring), and 1 survives because one engineer manually maintains it. The root cause isn't model quality — it's infrastructure. Without MLOps: there's no automated pipeline from training to deployment, no model registry tracking which version is in production, no feature store ensuring consistent features between training and serving, no monitoring detecting when the model's accuracy drops, and no automated retraining when performance degrades. MLOps is to ML what DevOps is to software — the engineering discipline that makes the science deployable, reliable, and maintainable.
MLOps Platform Components
| Component | Purpose | Without It |
|---|---|---|
| Experiment Tracking | Log every training run: params, metrics, artifacts | Scientists can't reproduce results or compare approaches |
| Model Registry | Version models, manage stage transitions (dev→staging→prod) | Nobody knows which model version is in production |
| Feature Store | Centralized feature computation and serving | Training/serving skew — model accuracy drops in production |
| ML CI/CD | Automated testing, validation, and deployment | Manual deployment — slow, error-prone, inconsistent |
| Model Serving | Serve predictions via API or batch | Ad-hoc serving — no scaling, no monitoring, no fallback |
| Model Monitoring | Track accuracy, drift, and data quality in production | Model degrades silently until business impact is visible |
ML CI/CD: From Experiment to Production
ML CI/CD differs from software CI/CD because the artifact isn't just code — it's code + data + trained model. The ML CI/CD pipeline: code commit (data scientist pushes training code + feature definitions) → data validation (verify training data hasn't drifted from expected distribution — schema checks, statistical tests, completeness validation) → training (execute training pipeline on versioned data with logged hyperparameters) → model validation (compare new model accuracy against: current production model, minimum accuracy threshold, and bias/fairness metrics) → staging deployment (deploy to staging environment, run integration tests with production-like traffic) → approval gate (human review of: accuracy metrics, fairness assessment, and resource requirements) → production deployment (canary deployment — route 5% of traffic to new model, monitor, gradually increase to 100%) → post-deployment monitoring (accuracy, latency, data drift tracked continuously).
Key difference from software CI/CD: Software tests are deterministic (the test passes or fails identically every time). ML tests are statistical (the model achieves 91.3% accuracy ± 0.5% depending on random initialization). ML CI/CD must handle: acceptable accuracy ranges (not exact thresholds), comparison against the current production model (not just absolute metrics), and data-dependent behavior (the same code produces different models with different training data). This statistical nature makes ML CI/CD more complex than software CI/CD — but the alternative (manual deployment with no validation) is far worse.
Model Registry: Version, Stage, and Deploy
The model registry is the single source of truth for: which models exist, which version of each model is in production, and the lineage of each model (trained on what data, with what code, producing what metrics). Registry operations: register (after training, the model artifact + metadata is registered with: model name, version number, training data version, code commit hash, accuracy metrics, and training parameters), stage transition (models move through stages: None → Staging → Production → Archived. Each transition requires: validation criteria met + approval), serve (the serving infrastructure pulls the "Production" stage model — when a new version is promoted, serving automatically switches), and rollback (if the new production model underperforms, rollback to the previous version in seconds — the previous version is still in the registry, ready to re-promote). MLflow Model Registry (integrated with Databricks and compatible with Fabric) is the most widely adopted open-source registry. Azure ML Model Registry provides the managed alternative for Azure-native deployments.
Feature Store: Compute Once, Serve Everywhere
The feature store solves three problems: training/serving skew (features computed differently in training notebooks vs production serving → model accuracy drops. The feature store ensures identical computation in both contexts), feature reuse (the "customer_90day_purchase_count" feature is needed by: churn model, recommendation model, CLV model. Without a feature store: each team computes it independently, with slightly different definitions. With a feature store: computed once, shared across all models), and point-in-time correctness (training data must use features as they existed at prediction time — not current values. The feature store handles this temporal join automatically). Feature store architecture: offline store (batch features for training — stored in the data platform lakehouse, updated on schedule), online store (low-latency features for real-time serving — stored in Redis or Cosmos DB, updated via streaming), and feature serving API (models request features by entity ID — the store returns the correct features from offline or online store based on the serving context). Databricks Feature Store and Fabric Feature Store provide managed implementations.
Training Infrastructure: GPU Management and Cost
ML training consumes GPU compute — the most expensive cloud resource. Cost management: spot/preemptible instances (60-80% cheaper than on-demand — use for training jobs that can checkpoint and resume after interruption. Most training frameworks support checkpointing natively), auto-scaling clusters (scale GPU nodes to zero when no training jobs are running — the training cluster costs $0/hour when idle), right-sized GPUs (not every model needs A100s. Tabular ML: CPU-only. Fine-tuning medium models: T4 or A10. Training large models from scratch: A100 or H100. Overprovisiong GPU type is the most common waste), and experiment efficiency (hyperparameter search can consume 100x the compute of a single training run. Use Bayesian optimization instead of grid search — 5-10x fewer trials for equivalent results). For a team running 50 training experiments/month: optimized GPU management saves $3-8K/month vs unoptimized — significant at scale.
MLOps Team Structure
| Role | Focus | Count (per 5 data scientists) |
|---|---|---|
| ML Engineer | Productionize models, build serving infrastructure, CI/CD | 2-3 |
| Data Engineer | Feature pipelines, data quality, training data management | 2-3 |
| MLOps/DevOps Engineer | Platform infrastructure, GPU management, monitoring | 1 |
The critical ratio: 2 ML engineers per 5 data scientists. Organizations with more scientists than engineers produce notebooks. Organizations with balanced teams produce production models.
Platform Comparison: MLflow vs Vertex vs SageMaker
| Platform | Best For | Strengths |
|---|---|---|
| MLflow (Databricks) | Multi-cloud, open-source | Open-source, portable, largest community, feature store + registry integrated |
| Azure ML | Azure-native enterprise | Managed endpoints, responsible AI dashboard, Fabric integration |
| Vertex AI (GCP) | Google Cloud enterprise | AutoML, Gemini integration, managed pipelines |
| SageMaker (AWS) | AWS-native enterprise | Broadest feature set, SageMaker Studio, inference optimization |
MLOps Maturity Model: From Ad-Hoc to Automated
Five MLOps maturity levels: Level 0 — No MLOps (models in notebooks, manual deployment, no monitoring. 80% of organizations start here). Level 1 — Manual Pipeline (training scripts in Git, manual deployment with documented process, basic monitoring). Level 2 — Automated Training (automated training pipeline, model registry for versioning, experiment tracking. The model retrains on schedule without human intervention). Level 3 — Automated Deployment (ML CI/CD pipeline: training → validation → staging → production. Feature store for consistent features. Automated retraining triggered by monitoring). Level 4 — Full Automation (end-to-end automation: data drift triggers retraining → new model validated → deployed automatically. A/B testing for model comparison. Human oversight for governance, not for operations). Most enterprises target Level 2-3 within 12 months. Level 4 requires: mature data infrastructure, 5+ production models, and an established ML engineering team. The maturity assessment takes 1-2 days and produces: current level, target level, and the capability investments needed to advance.
MLOps for Regulated Industries
Regulated industries add governance requirements to every MLOps component: model documentation (model card for every production model: training data, performance metrics, bias assessment, intended use, and known limitations — required for FDA (medical devices), ECOA (credit decisions), and EU AI Act (high-risk AI)), reproducibility (given the same data + code + parameters → produce the identical model. Required for: audit, regulatory examination, and incident investigation. Achieved through: data versioning, code versioning, and deterministic training configurations), explainability (every prediction must be explainable — SHAP values for feature contribution, decision boundary visualization, and plain-language explanation generation. Required for: credit decisions, insurance underwriting, and clinical recommendations), and audit trail (every model version: who trained it, when, on what data, with what approval, and what it replaced. Required for: SOX (financial models), HIPAA (clinical models), and state insurance regulations). Budget 25-40% additional MLOps effort for regulated industries.
Getting Started: MLOps Platform in 8 Weeks
The 8-week MLOps bootstrap: Week 1-2: Deploy MLflow (experiment tracking + model registry). Every training run logged from day one — even before CI/CD exists, the team builds the habit of logging experiments. Week 3-4: Build the first ML CI/CD pipeline (training script → validation → staging deployment). Use GitHub Actions or Azure DevOps for orchestration. Start with the team's highest-priority model. Week 5-6: Add basic monitoring (Evidently AI for data drift + performance tracking). Set up alerting to Slack/Teams when drift exceeds threshold. Week 7-8: Build the feature store for the top 10 features used across models. Deploy the automated retraining pipeline triggered by monitoring alerts. After 8 weeks: the team has a functional MLOps platform that handles the critical path from training to production. It's not enterprise-grade yet — but it's operational, and every subsequent model benefits from the infrastructure. Scale the platform as model count grows.
MLOps Maturity Model: From Ad-Hoc to Automated
Five MLOps maturity levels: Level 0 — Manual (data scientist trains in notebook, exports model file, engineer deploys manually. Time to deploy: weeks. Reproducibility: none). Level 1 — Pipeline (training automated in a pipeline, but deployment is manual. Experiment tracking exists. Time to deploy: days). Level 2 — CI/CD (training and deployment automated. Model registry stages artifacts. Automated testing validates models before promotion. Time to deploy: hours). Level 3 — Monitored (production models monitored for drift and performance. Retraining triggered automatically). Level 4 — Autonomous (full automation: data arrives, features computed, model retrained, validated, deployed, monitored, retrained when needed — with human oversight but minimal manual intervention). Most enterprises start at Level 0-1. Target Level 2-3 within 6 months. Level 4 is appropriate for high-frequency models where the retraining cycle is daily or weekly.
MLOps Team Structure and Responsibilities
The MLOps team bridges data science and engineering: ML engineers (build training pipelines, serving infrastructure, and monitoring — they build the systems that make models work in production. Skills: Python, Spark, Kubernetes, Docker, MLflow), data engineers (build feature pipelines that compute and serve features at scale. Skills: Spark, SQL, streaming), data scientists (build models, define features, set evaluation criteria — the MLOps platform primary users, not its builders), and DevOps engineers (manage underlying infrastructure: Kubernetes clusters, GPU instances, container registries). Team sizing: 1 ML engineer per 2-3 data scientists keeps models flowing from notebooks to production without bottleneck.
The Xylity Approach
We build MLOps platforms with the 6-component architecture — experiment tracking, model registry, feature store, ML CI/CD, model serving, and monitoring. Our ML engineers, data engineers, and AI architects build the infrastructure that closes the notebook-to-production gap — turning data science experiments into production systems that drive business decisions.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Close the Notebook-to-Production Gap
CI/CD, model registry, feature store, monitoring. MLOps infrastructure that turns experiments into production systems.
Start Your MLOps Platform →