In This Article
- The Production Reality: What Changes When the Notebook Closes
- The Production Readiness Checklist: 12 Requirements
- Production Data Pipeline: From Source to Feature Vector
- Model Serving Architecture: Batch, Real-Time, and Streaming
- Testing ML Systems: Beyond Unit Tests
- Production Monitoring: Catching Degradation Before Users Do
- Retraining Strategy: Scheduled, Triggered, and Continuous
- Who Owns Production ML? Roles and Responsibilities
- Go Deeper
The Production Reality: What Changes When the Notebook Closes
A data scientist presents a churn prediction model to leadership. AUC: 0.84. The model identifies customers who will churn in the next 90 days with precision high enough to justify the retention intervention cost. Leadership approves deployment. The data scientist exports the model as a pickle file and hands it to the engineering team. Then reality arrives.
The notebook reads from a CSV file the data scientist created by running a SQL query against the data warehouse and joining it with an Excel export from the CRM team. In production, this data must flow automatically — every day, from live systems, through validated pipelines, with quality checks that catch missing values, schema changes, and data drift. The notebook imports 23 Python packages at specific versions. In production, these dependencies must be containerized, versioned, and reproducible across environments. The notebook's feature engineering runs in 47 cells with inline comments instead of documentation. In production, these transformations must be testable, monitored, and consistent between training and serving.
The model is 20% of a production ML system. The other 80% — data pipelines, serving infrastructure, monitoring, retraining automation, and operational runbooks — is what separates a notebook experiment from a production ML system.
The Production Readiness Checklist: 12 Requirements
Before any model reaches production, it must pass these 12 requirements. Models that skip requirements don't fail immediately — they fail silently over weeks and months as data shifts, pipelines degrade, and predictions become unreliable without anyone noticing.
| # | Requirement | What It Means | Common Shortcut (Don't) |
|---|---|---|---|
| 1 | Reproducible training | Anyone can rebuild the exact model from code + data + config | Training depends on notebook state or manual steps |
| 2 | Versioned artifacts | Model, data, code, and config all version-tracked together | Model pickle in a shared drive with no version info |
| 3 | Automated data pipeline | Features computed by production pipelines, not manual queries | Data scientist runs SQL manually before each training run |
| 4 | Training-serving consistency | Same feature transformations at training and inference time | Notebook computes features one way, serving code another |
| 5 | Data validation | Input data checked for schema, range, completeness before inference | Model accepts any input and produces confident garbage |
| 6 | Model validation gate | New model must outperform current model on test data before deploy | Retrained model deployed without comparison |
| 7 | Containerized serving | Model + dependencies packaged in a container with defined API | Model runs on a specific VM with manually installed packages |
| 8 | Latency SLA | Inference response time tested under production load | Latency tested with 1 request; production sends 100/second |
| 9 | Monitoring dashboard | Prediction distribution, input drift, latency, errors tracked | No monitoring — degradation discovered by business users |
| 10 | Alerting | Automated alerts when metrics exceed thresholds | Someone checks the dashboard "when they remember" |
| 11 | Rollback plan | Can revert to previous model version within minutes | No rollback — fix forward or suffer |
| 12 | Operational runbook | Documented procedures for common failures and incidents | Knowledge lives in the data scientist's head |
Requirements 1-7 are the minimum for first production deployment. Requirements 8-12 are the minimum for sustained production operations. Deploy with 1-7. Operate with 1-12. Skipping 8-12 works for 30 days. By day 90, the unmonitored model is silently wrong and nobody knows.
Production Data Pipeline: From Source to Feature Vector
The production data pipeline replaces the notebook's manual data preparation with an automated, monitored, tested system that delivers features reliably at the freshness the model requires.
Pipeline Architecture
Extraction Layer
Pulls raw data from source systems — CRM, ERP, product databases, event streams. The extraction must handle: source system outages (retry with backoff), schema changes (detect and alert), incremental extraction (only new/changed records), and rate limiting (don't overwhelm the source). Data engineering builds and maintains this layer.
Feature Engineering Layer
Transforms raw data into the features the model consumes. Temporal features (days since last purchase, 30-day purchase frequency), aggregations (average basket size over 90 days), categorical encoding (one-hot, target encoding), and derived features (ratios, differences, interactions). This layer must produce identical features at training time and serving time — the most common and most damaging production ML bug is training-serving skew where features are computed differently.
Validation Layer
Checks feature quality before the model sees them. Schema validation (expected columns exist, correct data types). Range validation (feature values within expected bounds — age isn't negative, revenue isn't $10 trillion). Completeness validation (null rate within threshold). Distribution validation (feature distributions haven't shifted dramatically from training data). Invalid data routes to a dead-letter queue for investigation — not to the model for confident wrong predictions.
Feature Store
Stores computed features for both training (offline store — historical features with point-in-time correctness) and serving (online store — low-latency current features). The feature store eliminates training-serving skew by serving features through a single computation pipeline. Fabric Feature Store, Databricks Feature Store, or open-source Feast provide this capability.
Pipeline Testing
Production data pipelines need tests just like application code. Unit tests validate individual transformation functions (does the days-since-last-purchase calculation handle null dates correctly?). Integration tests validate end-to-end pipeline execution (does the pipeline produce the expected output given test input?). Data quality tests validate output against expectations (null rate below 2%, value ranges within bounds, row count within expected volume). These tests run in CI/CD — a pipeline change that breaks a test doesn't deploy to production.
Model Serving Architecture: Batch, Real-Time, and Streaming
How the model serves predictions depends on the decision it supports. Three serving patterns, each with different architecture, cost, and latency characteristics.
Batch Serving
The model scores the entire dataset on a schedule — nightly churn scores for all customers, weekly demand forecasts for all products, daily credit risk scores for the loan portfolio. Architecture: scheduled job (Spark, Azure ML pipeline, Databricks workflow) reads features from the offline store, runs inference, writes predictions to the data warehouse or Power BI semantic model. Cost-effective (compute runs only during the job), reliable (simple scheduling), and sufficient for decisions with daily or weekly cadence.
Real-Time Serving
The model responds to individual prediction requests in milliseconds. Architecture: containerized model (Docker) behind a REST API endpoint with auto-scaling, load balancing, and health monitoring. Azure ML Managed Online Endpoints, SageMaker Endpoints, or Kubernetes-based serving (Seldon Core, KServe, BentoML) provide the infrastructure. Cost is higher (always-on compute) but required for: fraud detection (score before transaction authorization), recommendations (personalize each page view), chatbots and AI agents (respond in real time), and dynamic pricing.
Streaming Serving
The model scores events as they arrive in a data stream — IoT sensor readings, user clickstream, transaction events. Architecture: model embedded in a stream processing job (Spark Structured Streaming, Flink, or custom Kafka consumer) that reads events, computes features, runs inference, and writes scored events downstream. Required when: events arrive continuously, batch scoring is too slow, and individual API calls are too expensive at the event volume.
| Pattern | Latency | Cost | Best For | Infrastructure |
|---|---|---|---|---|
| Batch | Hours | Low — runs on schedule | Daily/weekly scoring, dashboards, reports | Spark job, ML pipeline, scheduled notebook |
| Real-Time | Milliseconds | High — always-on | Transaction scoring, recommendations, chatbots | Managed endpoints, Kubernetes, API gateway |
| Streaming | Seconds | Medium — scales with volume | IoT, clickstream, continuous event scoring | Spark Streaming, Flink, Kafka consumer |
Testing ML Systems: Beyond Unit Tests
ML systems require testing beyond traditional software testing because the system's behavior depends on learned patterns in data — not just coded logic. A code change that introduces a bug produces a test failure. A data change that shifts the learned pattern produces a silently wrong model that passes all code tests.
Five Testing Layers
Unit Tests (Code Correctness)
Test individual functions: feature engineering logic, data transformations, preprocessing. These catch coding bugs — the days-since-last-purchase function returns negative values, the one-hot encoder creates wrong columns. Standard software testing practices apply.
Data Tests (Input Quality)
Test the data pipeline output: schema conformance, value ranges, null rates, row counts, distribution statistics. These catch data quality issues — a source system change that renames a column, a pipeline bug that drops 30% of records, a seasonal shift that changes feature distributions. Tools: Great Expectations, Deequ, custom validation scripts.
Model Tests (Prediction Quality)
Test model performance on held-out data: accuracy metrics (AUC, F1, RMSE), subgroup performance (does accuracy differ across segments?), edge cases (how does the model handle extreme values, rare categories, missing features?). The model validation gate: a retrained model must match or exceed the current production model's performance on a standard test set before deployment.
Integration Tests (System Correctness)
Test the end-to-end system: send a prediction request through the production pipeline and verify the response is correct, timely, and formatted as expected. Integration tests catch: feature store connectivity issues, model deserialization failures, API contract violations, and response formatting errors.
Shadow Tests (Production Validation)
Run the new model alongside the current model on live production data — comparing predictions without taking action on the new model's output. Shadow testing reveals: training-serving skew (model performs differently on live data than test data), edge cases not represented in test data, and latency under real production load. Run shadow tests for 1-2 weeks before promoting a new model to production.
Traditional software testing pyramid: many unit tests, fewer integration tests, few end-to-end tests. ML testing adds data tests and model tests as first-class layers. A model that passes all code tests but fails data tests (input quality degraded) or model tests (accuracy dropped on new data) should not deploy. The ML testing pyramid: unit tests → data tests → model tests → integration tests → shadow tests.
Production Monitoring: Catching Degradation Before Users Do
Production ML monitoring tracks four categories — and most organizations only implement the first one.
Infrastructure monitoring (table stakes): Endpoint latency, error rate, throughput, CPU/memory utilization. Standard DevOps monitoring. Alert when latency exceeds SLA or error rate spikes. This catches system failures but not model failures — the model can return fast, wrong predictions while infrastructure metrics look healthy.
Data monitoring (critical): Input feature distributions compared to training distributions. Statistical tests (PSI, KS test, Jensen-Shannon divergence) detect when production data has drifted from what the model learned on. Data drift is the most common cause of model degradation — and the one that infrastructure monitoring completely misses. Alert when drift score exceeds threshold for any feature.
Prediction monitoring (essential): Distribution of prediction outputs — are the proportions of high/medium/low risk scores changing? Is the average predicted churn probability increasing? Prediction drift often precedes accuracy degradation and provides earlier warning than waiting for ground truth. Monitor prediction distributions daily and compare to the baseline established during initial deployment.
Business outcome monitoring (the ultimate measure): Do the predictions match reality? The churn model predicted 200 customers would churn this month — did they? The fraud model flagged 500 transactions — were they actually fraudulent? Outcome monitoring requires delayed feedback (churn is known 90 days later, fraud investigations take weeks) but is the definitive measure of whether the model is still useful. Track accuracy over rolling windows and alert when it degrades below the threshold that justified the model's deployment.
Retraining Strategy: Scheduled, Triggered, and Continuous
Every production model needs a retraining strategy — the question is when and how.
Scheduled retraining: Retrain on a calendar cadence — monthly, quarterly. Simple to implement and operate. Appropriate for models where data distribution changes slowly (annual customer behavior patterns, long-term trends). Risk: if drift happens between scheduled retraining, the model operates on stale patterns until the next scheduled retrain.
Triggered retraining: Retrain when monitoring detects drift or accuracy degradation. More responsive than scheduled — the model retrains when it needs to, not when the calendar says. Requires the monitoring infrastructure to detect triggers reliably. Appropriate for models operating in environments with unpredictable change (market conditions, seasonal shifts, external events).
Continuous retraining: The model retrains on a continuous stream of new data — learning from the latest patterns as they arrive. Most sophisticated. Requires: streaming data infrastructure, automated training pipeline, automated validation and deployment, and guardrails that prevent a bad training batch from corrupting the production model. Appropriate for high-frequency models where data distribution changes daily (recommendation engines, dynamic pricing).
The Xylity Approach
We implement production ML through the MLOps framework — 12-point readiness checklist, automated data pipelines with feature store, containerized serving, 5-layer testing, 4-category monitoring, and retraining automation. Our ML engineers and data engineers build production ML alongside your team — transferring the operational capability so your organization sustains production ML independently.
Who Owns Production ML? Roles and Responsibilities
| Role | Owns | Contributes To |
|---|---|---|
| Data Scientist | Model development, feature selection, accuracy optimization | Feature engineering design, monitoring threshold definition |
| ML Engineer | Production pipeline, serving infrastructure, CI/CD, containerization | Model optimization for production (latency, memory), monitoring implementation |
| Data Engineer | Data pipeline, feature engineering pipeline, feature store | Data quality monitoring, pipeline reliability |
| DevOps/SRE | Infrastructure, deployment automation, incident response | Latency monitoring, scaling, uptime SLA |
| Domain Expert | Business validation, outcome interpretation | Feature definition, model output validation, threshold setting |
The most common staffing mistake: hiring data scientists to do production ML. Data scientists develop models. ML engineers deploy and operate them. These are different skills — statistical modeling vs. software engineering, notebook exploration vs. production reliability, accuracy optimization vs. latency optimization. Organizations that ask data scientists to do both get neither done well.
For organizations building their first production ML system, the minimum team: 1 data scientist (model development), 1 ML engineer (production infrastructure), 1 data engineer (data pipeline), and domain expert access (validation and interpretation). As the model portfolio grows, each role scales — but the separation of responsibilities remains.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Get ML Into Production — Not Just Notebooks
12-point readiness checklist, production pipelines, serving architecture, 5-layer testing, monitoring. ML engineering that makes models reliable at scale.
Start Your ML Production Engagement →