In This Article
The Platform Trap: Buying Before Knowing What You Need
An enterprise buys Databricks because the CTO saw a demo. Six months later, the data science team is using Azure ML because it integrates better with their Azure data estate. The ML engineers deployed a model on SageMaker because it was fastest for their specific use case. Now the organization operates three ML platforms — with three licensing costs, three skill requirements, and three operational models. Nobody planned this. It happened because platform selection preceded use case analysis.
Tech stack selection should follow this sequence: use cases define requirements → requirements narrow platform choices → platform choices determine tooling → tooling determines skills. Most organizations invert this: vendor demo → platform purchase → find use cases → discover the platform doesn't support them well → buy another platform. This guide follows the correct sequence.
The AI/ML Tech Stack: 7 Layers
| Layer | What It Provides | Key Choices |
|---|---|---|
| 1. Data Platform | Storage, processing, and serving of training data | Fabric, Databricks, Snowflake, BigQuery |
| 2. ML Frameworks | Model development and training | PyTorch, TensorFlow, scikit-learn, XGBoost, Hugging Face |
| 3. Experiment Tracking | Reproducible experiments, parameter logging | MLflow, Weights & Biases, Neptune, Comet |
| 4. ML Platform | Managed compute, pipelines, model registry | Azure ML, Databricks ML, SageMaker, Vertex AI |
| 5. LLM/GenAI | Large language models for generation and reasoning | Azure OpenAI, Anthropic API, open-source (Llama, Mistral) |
| 6. Vector & Search | Embedding storage and retrieval for RAG | Azure AI Search, Pinecone, Weaviate, pgvector, Qdrant |
| 7. MLOps | CI/CD, monitoring, retraining automation | Azure ML Pipelines, Databricks Workflows, Kubeflow, custom |
Layer 1: Data Platform — The Foundation Everything Depends On
The data platform decision is the most consequential because every other layer depends on it. The ML platform, feature store, training data, and serving infrastructure all connect to the data platform. Choosing an ML platform that doesn't integrate natively with your data platform creates integration friction that slows every project.
Microsoft Fabric: Unified analytics platform combining data lake, warehouse, data engineering, data science, and real-time analytics. Best for: Microsoft-ecosystem organizations that want one platform for data + ML. Native integration with Azure ML, Power BI, and Azure OpenAI. The Fabric lakehouse serves as both the analytical platform and the ML feature store.
Databricks: Lakehouse platform with strong ML capabilities. Best for: organizations with heavy Spark workloads, large-scale distributed training, and data science teams that prefer notebook-first development. MLflow is native. Unity Catalog provides governance. Feature Store is integrated. The trade-off: higher cost than Fabric for organizations not utilizing Spark at scale.
Snowflake: Cloud data warehouse with growing ML capabilities (Snowpark, Cortex). Best for: SQL-heavy organizations where the data team is stronger in SQL than Python. Snowpark ML enables model training in Snowflake using Python. The trade-off: less mature ML ecosystem than Databricks or Azure ML.
Choose the ML platform that integrates natively with your data platform. Fabric → Azure ML. Databricks → Databricks ML. Snowflake → Snowpark. GCP → Vertex AI. AWS → SageMaker. Cross-platform integration (e.g., SageMaker with Fabric) works but adds 30-50% overhead in data movement, credential management, and operational complexity.
Layer 2: ML Frameworks — What Your Team Already Knows Matters
ML framework selection is primarily a team skill decision, not a technology decision. A team proficient in PyTorch will be 3x more productive in PyTorch than in TensorFlow — regardless of which framework benchmarks 2% faster on a specific task.
| Framework | Best For | Ecosystem Strength |
|---|---|---|
| PyTorch | Research, deep learning, NLP, computer vision | Dominant in research, Hugging Face ecosystem, dynamic graphs |
| TensorFlow / Keras | Production ML, mobile/edge deployment | TFLite for mobile, TF Serving for production, Keras for rapid prototyping |
| scikit-learn | Classical ML (classification, regression, clustering) | Fastest prototyping for tabular data, excellent documentation |
| XGBoost / LightGBM | Tabular data, Kaggle-winning accuracy | Best accuracy on structured enterprise data, fast training |
| Hugging Face Transformers | NLP, generative AI, fine-tuning pre-trained models | Largest model hub, easy fine-tuning, active community |
The enterprise default: XGBoost/LightGBM for tabular problems (90% of enterprise ML). PyTorch + Hugging Face for NLP/vision/GenAI. scikit-learn for prototyping and simple models. This combination covers virtually every enterprise ML use case.
Layer 5: LLM/GenAI — Build, Buy, or Open-Source
The LLM decision has the most variety and the fastest-changing landscape. Three approaches:
Commercial API (Azure OpenAI, Anthropic, Google): Highest capability models (GPT-4o, Claude, Gemini) accessed via API. No infrastructure to manage. Pay per token. Best for: applications that need frontier model capability, don't require on-premises deployment, and where data can be sent to cloud APIs (check data residency requirements). Azure OpenAI provides enterprise security (data not used for training, VNet isolation, managed identity).
Open-source deployed (Llama 3, Mistral, Phi): Models deployed on your own infrastructure (Azure ML, Databricks, or Kubernetes). No data leaves your environment. Lower per-inference cost at scale. Best for: organizations with data sovereignty requirements, high-volume inference workloads, or need for model customization beyond prompting. The trade-off: infrastructure management, GPU procurement, and lower capability than frontier commercial models.
Fine-tuned models: Start with a commercial or open-source base model and fine-tune on domain-specific data. Produces a model that's more accurate for your specific use case than the base model. Best for: classification tasks with domain-specific categories, generation tasks requiring specific style/format, and RAG applications where retrieval alone doesn't achieve sufficient accuracy. Fine-tuning requires: 500-10,000 labeled examples, GPU compute for training, and MLOps infrastructure for versioning and deployment.
Layer 6: Vector and Search — The RAG Foundation
Vector databases store document embeddings and enable similarity search — the retrieval component of RAG architectures. The selection depends on scale, latency requirements, and ecosystem integration.
| Vector Store | Best For | Scale | Managed? |
|---|---|---|---|
| Azure AI Search | Azure-native RAG, hybrid search (vector + keyword) | Millions of vectors | Fully managed |
| Pinecone | Dedicated vector search, simple API | Billions of vectors | Fully managed |
| Weaviate | Multi-modal (text + image), GraphQL API | Millions-billions | Cloud or self-hosted |
| pgvector | PostgreSQL-native, existing Postgres infrastructure | Millions of vectors | Via managed Postgres |
| Qdrant | High-performance, Rust-based, filtering support | Billions of vectors | Cloud or self-hosted |
Selection guidance: If on Azure → Azure AI Search (native integration, hybrid search, semantic ranking). If scale exceeds 100M vectors → Pinecone or Qdrant. If existing PostgreSQL → pgvector (zero new infrastructure). If multi-modal (text + images) → Weaviate.
The Decision Framework: 5 Questions
What's your data platform?
This determines your ML platform. Fabric → Azure ML. Databricks → Databricks ML. Don't fight the ecosystem — integration cost exceeds any feature advantage.
What problem types dominate?
Tabular classification/regression → XGBoost + scikit-learn. NLP/GenAI → PyTorch + Hugging Face + LLM API. Computer vision → PyTorch + pre-trained CNNs. The problem type narrows framework selection.
Can data go to cloud APIs?
Yes → Azure OpenAI or Anthropic for LLM workloads. No (data sovereignty, regulation) → deploy open-source models on-premises or in your VNet.
What does your team know?
A team skilled in PyTorch should use PyTorch. Retraining to TensorFlow costs 3-6 months of reduced productivity. Framework familiarity trumps marginal technical advantages.
What's your MLOps maturity?
Low maturity → use managed services (Azure ML managed endpoints, Databricks Model Serving). High maturity → consider open-source for flexibility (Kubeflow, Seldon). Managed services trade flexibility for operational simplicity.
Build vs Buy: The Platform Decision
Every layer has a build-vs-buy decision. For most enterprises, the answer is buy for infrastructure (ML platform, compute, monitoring) and build for domain-specific components (feature engineering, model logic, evaluation). The decision criteria: is this a differentiator (build — your feature engineering captures competitive advantage) or commodity infrastructure (buy — experiment tracking, model serving, compute management are solved problems that don't differentiate your business)? Building commodity infrastructure creates maintenance burden without competitive advantage. Buying domain-specific components creates vendor dependency without domain fit.
Migration Path: Moving Between Platforms
Tech stack decisions aren't permanent. Organizations grow, strategies change, and platforms evolve. The migration mitigation strategy: use open-source frameworks where possible (MLflow works across Azure ML, Databricks, and standalone), containerize models for portability (a Docker container deploys on any platform), and keep training code platform-agnostic (train with PyTorch/scikit-learn, deploy on any serving infrastructure). These practices reduce switching cost from 6 months to 6 weeks when a platform change becomes necessary.
Evaluation Infrastructure: The Forgotten Layer
Every stack needs evaluation infrastructure — automated testing that runs before and after every change. For traditional ML: model validation suites (accuracy, fairness, drift tests). For GenAI: response quality evaluation (relevance, groundedness, coherence scoring). For both: A/B testing infrastructure that compares model versions on live traffic. Evaluation infrastructure is often the last layer built — but it should be among the first, because without automated evaluation, every change is a guess about whether quality improved or degraded. Azure AI Studio, LangSmith, and custom evaluation pipelines provide this capability. Budget 15-20% of the total stack cost for evaluation infrastructure.
Vendor Lock-In Mitigation
Every platform creates some lock-in. The mitigation strategy: keep model training code in framework-native format (PyTorch, scikit-learn) independent of the ML platform's proprietary abstractions. Use MLflow for experiment tracking and model registry — it works on Azure ML, Databricks, and standalone. Containerize models for serving — Docker containers deploy anywhere regardless of the training platform. Store training data in open formats (Parquet, Delta Lake) accessible from any platform. These practices let you switch ML platforms (Azure ML → Databricks, SageMaker → Vertex) by changing the deployment infrastructure, not rewriting model code. The migration cost drops from months to weeks when you've maintained platform independence in the core components.
The Xylity Approach
We select AI/ML tech stacks through the 5-question framework — data platform drives ML platform, problem type drives frameworks, data residency drives LLM choice, team skills drive tooling, and MLOps maturity drives infrastructure model. Our AI architects and solution architects design the stack alongside your team — avoiding the platform trap of buying before knowing what you need.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
The Right Stack for Your AI Ambition
Five questions, seven layers — the tech stack decision framework that prevents the platform trap and matches infrastructure to use cases.
Start Your Tech Stack Assessment →