In This Article
- Why 85% of AI Projects Never Reach Production
- The Six-Phase AI Engagement Lifecycle
- Phase 1: Discovery and Problem Framing (Weeks 1-2)
- Phase 2: Data Assessment and Feasibility (Weeks 3-5)
- Phase 3: Proof of Concept (Weeks 6-10)
- Phase 4: Production Development (Weeks 11-18)
- Phase 5: Deployment and Integration (Weeks 19-22)
- Phase 6: Operations and Continuous Improvement
- Team Composition: Who You Need at Each Phase
- Engagement Pricing Models: Fixed, T&M, and Outcome-Based
- Build vs. Augment: When Internal Teams Need External Specialists
- Go Deeper
Why 85% of AI Projects Never Reach Production
The statistic that 85% of AI projects fail to reach production has been cited so often it's become background noise. But the reasons behind the statistic are specific, identifiable, and — if you structure the engagement correctly — preventable. In our experience across enterprise AI consulting engagements spanning 22 industries, the failures cluster into three patterns.
Pattern 1: The pilot that proved nothing. A data science team builds a model on a curated dataset, achieves impressive accuracy in a notebook, presents results to leadership, and gets approval for "production deployment." But the pilot skipped the questions that determine production viability: Can the data pipeline deliver this data reliably at the freshness the model needs? Can the model serve predictions at the latency the business process requires? Will the operational team actually change their workflow to use model outputs? The pilot proved the algorithm works. It proved nothing about production viability.
Pattern 2: The scope that expanded until it collapsed. An engagement starts with a clear use case — predict customer churn. Three weeks in, the VP of Sales asks to add lead scoring. The CMO wants campaign attribution. The CFO wants revenue forecasting. Each request is reasonable. Together they transform a focused engagement into an undefined program that tries to do everything, delivers nothing completely, and exhausts budget before the first model reaches production. Scope management in AI engagements is harder than in traditional IT because AI's promise invites expansion.
Pattern 3: The model nobody adopted. The model works. The accuracy is good. The deployment is clean. And the operations team continues doing exactly what they did before because nobody invested in the change management that connects model output to operational workflow. The loan officer who has 20 years of experience doesn't trust the model's risk assessment. The supply chain planner who built the forecasting spreadsheet doesn't want to be replaced by an algorithm. The model is technically deployed and operationally ignored.
The six-phase engagement model we describe here is designed to prevent all three patterns. Each phase has specific deliverables, decision gates, and kill criteria. The engagement can stop at any phase if the evidence doesn't support continuing — and stopping early with a clear answer is a better outcome than continuing on momentum into a failed deployment.
The Six-Phase AI Engagement Lifecycle
Each phase produces specific deliverables that feed the next phase. Decision gates between phases require explicit go/no-go decisions based on evidence, not optimism. The timeline below is typical for a first enterprise AI engagement; subsequent engagements compress because infrastructure and organizational muscle memory exist.
| Phase | Duration | Primary Deliverable | Decision Gate |
|---|---|---|---|
| 1. Discovery | Weeks 1-2 | Problem statement, success criteria, data inventory | Is this a well-defined ML problem with available data? |
| 2. Data Assessment | Weeks 3-5 | Data quality report, feasibility analysis | Does the data support the model at required accuracy? |
| 3. Proof of Concept | Weeks 6-10 | Working model, accuracy metrics, production plan | Does the model meet accuracy thresholds on held-out data? |
| 4. Production Dev | Weeks 11-18 | Production-grade model, pipeline, integration | Does the system meet SLA requirements end-to-end? |
| 5. Deployment | Weeks 19-22 | Live model, monitoring, runbooks | Is the model performing as expected on live data? |
| 6. Operations | Ongoing | Monitoring, retraining, continuous improvement | Is business value sustained or improving? |
Every phase has explicit conditions under which the engagement should stop. A clear "no" at Phase 2 saves 4-5 months of investment. A clear "no" at Phase 3 saves 3-4 months. Engagements that lack kill criteria continue on momentum into failed deployments. We define kill criteria at engagement start and revisit them at every decision gate.
Phase 1: Discovery and Problem Framing (Weeks 1-2)
Discovery is where most engagements go wrong first. The business says "we want to use AI for X." The data science team hears "build a model for X." Both skip the framing work that determines whether X is actually an ML problem, whether the data exists, and whether the organization can act on the model's output.
What Discovery Produces
Problem Statement (Machine-Readable)
Convert the business question into a specific ML formulation. "Reduce customer churn" becomes "predict which customers will cancel in the next 90 days, with sufficient accuracy to justify the retention intervention cost, using data available at the point of prediction." The formulation specifies the prediction target, the prediction horizon, the accuracy threshold (tied to intervention economics), and the data availability constraint.
Success Criteria (Business-Measurable)
Define success in business terms the executive sponsor cares about — revenue retained, cost avoided, risk reduced — not model accuracy metrics. Then map business success to model performance thresholds. If retaining a customer is worth $5,000 and the retention intervention costs $200, the model needs to be accurate enough that the intervention cost across all flagged customers is justified by the customers actually retained. This is the economics that determines the accuracy threshold.
Data Inventory and Gap Analysis
Catalog every data source the model might need. For each source: what system does it live in, who owns it, what's the refresh frequency, what's the access mechanism (API, database, export), and what's the historical depth. Identify gaps between what the model needs and what exists. These gaps determine Phase 2 scope and potentially kill the engagement if critical data doesn't exist.
Stakeholder Map and Change Impact
Identify every person who will need to change their workflow if the model succeeds. The operations manager who will receive model alerts. The analyst who currently does the work manually. The VP who will present model-driven results to the board. Map their current workflow, how AI changes it, and what adoption support they'll need. This feeds Phase 5 change management planning.
Discovery typically involves 6-10 stakeholder interviews, a data source workshop, and a framing session that produces the problem statement. Two weeks is sufficient if the right stakeholders are available. Longer timelines usually indicate organizational alignment issues that Phase 1 should surface rather than absorb.
Phase 2: Data Assessment and Feasibility (Weeks 3-5)
Phase 2 answers the question Discovery raised: does the data actually support this model? This is where engagements should kill early if the answer is no. A clear "the data doesn't support this use case at the accuracy the business case requires" is a valuable deliverable that saves months of downstream investment.
Data Quality at ML Granularity
Data quality assessment for ML is more demanding than for reporting. We evaluate quality at the specific granularity the model requires — transaction-level, event-level, or sensor-level — not the aggregated level reporting typically uses. For each feature candidate: completeness (what percentage of rows have values?), accuracy (do the values match reality?), consistency (do related fields agree?), timeliness (is the data fresh enough for the prediction horizon?), and historical depth (is there enough history to train on?). We document quality issues with specific remediation estimates so the decision gate has concrete input.
Feature Engineering Feasibility
Can the raw data be transformed into features that the model can learn from? Some use cases require features that are technically possible but prohibitively expensive to engineer — joining data across systems with different identifiers, computing features that require complex temporal logic, or creating features from unstructured text or images that need preprocessing pipelines. We assess feature engineering feasibility for the top 20-30 feature candidates and estimate the data engineering effort required.
Baseline and Ceiling Analysis
Before building the real model, we establish two benchmarks. The baseline is the simplest possible approach — a rules-based heuristic, a simple average, or the current manual process accuracy. The model must beat this or there's no value. The ceiling is the theoretical maximum given the data's signal-to-noise ratio — estimated through exploratory analysis, feature correlation, and domain expert input on what's predictable versus random. If the ceiling is close to the baseline, ML won't add enough value to justify the investment.
Stop the engagement if: (1) Critical data doesn't exist and would take 6+ months to collect. (2) Data quality issues require remediation exceeding the use case's business value. (3) The ceiling analysis suggests the maximum achievable accuracy is below the business case threshold. (4) Feature engineering effort exceeds what the timeline and budget support. A clear "no" at Phase 2 saves 4-5 months of downstream investment and preserves credibility for the next use case.
Phase 3: Proof of Concept (Weeks 6-10)
The PoC builds the model on real data and measures whether it meets the accuracy thresholds the business case requires. This is NOT a production system — it's an evidence-gathering exercise that de-risks the production investment. The PoC runs in a controlled environment on a curated (but representative) dataset. Its purpose is to answer: can we build a model that's accurate enough to justify production deployment?
Model Development
We iterate through model approaches: starting simple (logistic regression, gradient boosted trees) before moving to complex (deep learning, ensemble methods). Simple models that meet accuracy thresholds are always preferred because they're faster to deploy, easier to explain, cheaper to operate, and more stable in production. We train on historical data with proper time-based splits (no data leakage — the model never sees future data during training) and evaluate on held-out test sets that represent the production distribution.
What the PoC Report Contains
The PoC report is the decision document for the production investment. It contains: model accuracy on held-out test data against the business-case threshold, error analysis (what does the model get wrong, and are the errors acceptable?), feature importance (what drives predictions, and does it make domain sense?), data requirements for production (pipeline freshness, quality thresholds, volume), infrastructure requirements (compute, storage, serving latency), and the estimated effort for production development. The report makes a recommendation: proceed, proceed with conditions, or stop.
The PoC Trap
The most common Phase 3 failure is the PoC that succeeds in a way that can't translate to production. A model trained on a carefully curated dataset by a data scientist who hand-cleaned every anomaly achieves 94% accuracy. But production data has anomalies nobody will hand-clean, the dataset was static while production data shifts daily, and the features that drove accuracy require joins across three systems that the data pipeline can't perform at the required freshness. The PoC succeeded on a dataset that doesn't represent production. This is why PoC design must account for production constraints from the start.
Phase 4: Production Development (Weeks 11-18)
Production development transforms the PoC model into a production system. This is the phase most organizations underestimate because they see the PoC model as "almost done." In reality, the PoC is typically 20% of the total engineering effort. The remaining 80% is the infrastructure, integration, monitoring, and operational discipline that make the model reliable at scale.
Production Data Pipeline
The PoC ran on a static dataset. Production requires a live pipeline that extracts data from source systems, transforms it into features, validates quality, and delivers features to the model at the freshness and latency production requires. This pipeline must be monitored (alerting when it fails or when data quality degrades), tested (automated tests that catch feature drift), and documented (runbooks for operational teams). The data engineering work here is substantial — often the single largest effort in Phase 4. If the organization doesn't have mature data engineering capabilities, this is where the engagement stalls.
Model Hardening
The PoC model was optimized for accuracy. The production model is optimized for accuracy, latency, reliability, and operability. Model hardening includes: retraining on the full dataset (not just the PoC subset), optimizing inference latency (quantization, pruning, batch vs. real-time serving architecture), implementing input validation (rejecting or flagging inputs outside the training distribution), implementing output calibration (ensuring predicted probabilities match observed frequencies), and building the retraining pipeline that keeps the model current as data distribution shifts.
Integration Architecture
The model must integrate with the operational systems where decisions are made. A churn prediction model integrates with the CRM so account managers see churn risk scores. A demand forecasting model integrates with the supply chain planning system so planners see AI-enhanced forecasts. A fraud detection model integrates with the transaction processing system so fraudulent transactions are flagged in real time. Integration architecture determines whether model outputs reach the people who act on them — or sit in a dashboard nobody opens.
Testing Strategy
ML testing goes beyond traditional software testing. Unit tests verify feature engineering logic. Integration tests verify the pipeline from source to prediction. Model validation tests verify accuracy on fresh data against the PoC benchmarks. Shadow testing runs the model on live data alongside the current process (without taking action) to validate predictions match expectations. A/B testing compares model-driven decisions against the status quo to measure actual business impact. Each layer catches different failure modes.
The 80/20 Rule of AI Engineering
The PoC model is 20% of the engineering effort. The production pipeline, integration, monitoring, testing, and operational infrastructure are the other 80%. Organizations that budget for the 20% and discover the 80% mid-project are the organizations whose AI projects stall at "almost deployed" for months. Budget for the full lifecycle from day one.
Phase 5: Deployment and Integration (Weeks 19-22)
Deployment is the transition from development to operations. The model moves from a development environment to production infrastructure, integrations go live, monitoring activates, and — critically — the operational team begins using model outputs in their actual workflow.
Deployment Patterns
Shadow deployment runs the model on live data alongside the current process without taking action. It validates that the model performs on real production data as it did on test data. Shadow deployment typically runs for 2-4 weeks and surfaces issues that testing didn't catch: data distribution differences between test and production, feature values that fall outside training range, and latency under real load.
Canary deployment routes a small percentage of traffic (5-10%) to the new model while the majority continues on the existing process. It measures real business impact at limited risk. If the model underperforms, the blast radius is contained.
Blue-green deployment maintains two production environments — the current system (blue) and the new model (green). Traffic switches from blue to green with instant rollback capability. This is appropriate when the model replaces an existing system completely.
Change Management Activation
This is where the stakeholder map from Phase 1 becomes operational. The operations manager who will receive model alerts gets trained on interpreting predictions and the confidence intervals around them. The analyst who currently does the work manually gets retrained on the new workflow — not replaced, but redirected to exception handling and model improvement. The VP who will present results gets a dashboard showing model impact in business terms they can communicate to the board. Change management is not a communication plan. It's a workflow redesign with training, support, and feedback channels.
Phase 6: Operations and Continuous Improvement
Deployment is not the finish line — it's the starting line for operations. A deployed model is a depreciating asset. Data distribution shifts. Customer behavior evolves. Market conditions change. Regulations update. Without active operations, model performance degrades silently until someone notices predictions don't match reality.
Monitoring Framework
Production model monitoring tracks four categories of metrics. Technical health: latency, throughput, error rates, resource utilization. Data quality: input feature distributions versus training distributions, missing values, outliers, drift detection. Model performance: prediction accuracy measured against ground truth (when available), calibration, confidence distribution. Business impact: the metrics from the Phase 1 success criteria — revenue retained, cost avoided, risk reduced. Technical health is monitored in real time. Data quality and model performance are monitored daily to weekly. Business impact is measured monthly to quarterly.
Retraining Triggers and Pipeline
The model should retrain when data drift exceeds thresholds, when performance metrics degrade below acceptable levels, or on a scheduled cadence (monthly for most enterprise use cases). The retraining pipeline automates: data extraction with current data, feature engineering, model training, validation against the previous model (the new model must outperform or match), and deployment through the same CI/CD pipeline as initial deployment. Manual retraining doesn't scale beyond one model. Automated retraining through MLOps is required for organizations operating multiple production models.
Feedback Loops
The most valuable signal for model improvement comes from the operational teams using model outputs. The loan officer who overrides the model's risk assessment provides labeled data that reveals where the model is wrong. The supply chain planner who adjusts the AI forecast provides signal about factors the model missed. Building structured feedback mechanisms — not just "email the data team" but actual feedback interfaces that capture overrides with reasons — creates a continuous improvement cycle that makes the model better over time.
Budget 20-30% of the initial development cost annually for ongoing operations — monitoring, retraining, improvement, and the data engineering that keeps pipelines reliable. Organizations that budget zero for operations after deployment discover that their model accuracy degrades to baseline within 6-12 months. Operations is not optional; it's what makes the development investment compound rather than depreciate.
Team Composition: Who You Need at Each Phase
Different phases require different skill mixes. Staffing the entire engagement with data scientists is the most common team composition mistake — because data scientists are the right skill for Phase 3 (PoC) but not for Phase 2 (data assessment requires data engineering), Phase 4 (production development requires ML engineering and software engineering), or Phase 5 (deployment requires DevOps and change management).
| Phase | Lead Role | Supporting Roles | Domain Expert Involvement |
|---|---|---|---|
| 1. Discovery | AI Strategy Consultant | Data engineer, business analyst | Heavy — problem framing requires domain |
| 2. Data Assessment | Data Engineer | Data scientist, data quality analyst | Moderate — validating data interpretation |
| 3. PoC | Data Scientist | ML engineer, data engineer | Heavy — validating model outputs make sense |
| 4. Production | ML Engineer | Data engineer, software engineer, DevOps | Light — integration testing and validation |
| 5. Deployment | ML Engineer / DevOps | Change management, training | Heavy — workflow redesign and adoption |
| 6. Operations | MLOps Engineer | Data engineer, data scientist (retraining) | Periodic — feedback and improvement cycles |
The total team size varies: 3-5 people for a focused single-use-case engagement, 8-12 for a multi-model program. The key insight is that the lead role changes at each phase. Engagements that keep data scientists as lead through Phases 4-6 underperform because production engineering is a different discipline than model development. If your team lacks ML engineering or production AI specialists, Phase 4 is where to augment.
Engagement Pricing Models: Fixed, T&M, and Outcome-Based
How AI engagements are priced affects how they're managed, what risks each party bears, and whether incentives align between client and consulting partner.
Fixed-Price (Phases 1-2)
Discovery and Data Assessment are well-suited to fixed pricing because the scope is defined: specific stakeholder interviews, specific data sources to assess, specific deliverables. Typical range: $40,000-$120,000 depending on organizational complexity and data estate size. Fixed pricing here gives the client cost certainty for the decision phase — and the deliverable (go/no-go recommendation with evidence) has clear completion criteria.
Time and Materials (Phases 3-5)
PoC through Deployment is better suited to T&M because the scope adapts based on what each phase reveals. A PoC that discovers the model needs a feature the pipeline doesn't yet produce requires data engineering work that wasn't in the original scope. Production development timelines depend on integration complexity that becomes clear during Phase 4, not before. T&M with weekly reporting and monthly budget reviews gives flexibility while maintaining cost visibility.
Outcome-Based (Phase 6 — carefully)
Outcome-based pricing (payment tied to business results) is sometimes appropriate for Phase 6 operations — if the business metric is clearly attributable to the model. Revenue retained from churn prevention can be measured. Cost avoided from predictive maintenance can be measured. But attribution is tricky: did the customer stay because of the retention intervention the model triggered, or would they have stayed anyway? Outcome-based pricing works when attribution methodology is agreed upfront and measurement infrastructure exists.
Build vs. Augment: When Internal Teams Need External Specialists
Every AI engagement eventually confronts the build vs. augment question: should the organization build the entire capability internally, or bring in external specialists for specific phases?
Build internally when: The use case is a core competitive differentiator (the model IS the product). The organization has ongoing AI needs that justify permanent headcount. The domain expertise required is so specialized that external teams can't acquire it fast enough. The data sensitivity or regulatory constraints make external access impractical.
Augment with external specialists when: The team has data scientists but lacks ML engineers for production deployment. The engagement requires skills the team will need for 6 months but not permanently (MLOps setup, specific framework expertise). The timeline is compressed and hiring would take longer than the engagement. The organization needs to prove AI value before justifying permanent headcount investment.
The most common pattern we see: organizations build an internal data science team, run successful PoCs, then stall at production deployment because they lack the ML engineering and data engineering depth that Phases 4-6 require. This is where consulting-led augmentation fills the gap — pre-qualified ML engineers, data engineers, and AI solution architects who deploy alongside the internal team, transfer knowledge through the engagement, and exit when the internal team can sustain operations independently.
The Xylity Approach
We structure AI engagements with explicit knowledge transfer at every phase. Our consultants work alongside your team — not in a separate room. By Phase 6, your internal team operates the production system independently. We provide ongoing support for retraining cycles and model improvement, but the operational capability transfers to your organization. This is consulting-led engagement, not dependency creation.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Structure Your AI Engagement for Production
The six-phase model with kill criteria at every gate. Discovery through operations, structured for models that actually reach production.
Start Your AI Engagement →