The Platform Trap: Buying Before Knowing What You Need

An enterprise buys Databricks because the CTO saw a demo. Six months later, the data science team is using Azure ML because it integrates better with their Azure data estate. The ML engineers deployed a model on SageMaker because it was fastest for their specific use case. Now the organization operates three ML platforms — with three licensing costs, three skill requirements, and three operational models. Nobody planned this. It happened because platform selection preceded use case analysis.

Tech stack selection should follow this sequence: use cases define requirements → requirements narrow platform choices → platform choices determine tooling → tooling determines skills. Most organizations invert this: vendor demo → platform purchase → find use cases → discover the platform doesn't support them well → buy another platform. This guide follows the correct sequence.

The right AI/ML tech stack is the one that serves your use cases, matches your data platform, and your team can actually operate. Vendor features matter less than ecosystem fit. — Xylity AI Engineering Practice

The AI/ML Tech Stack: 7 Layers

LayerWhat It ProvidesKey Choices
1. Data PlatformStorage, processing, and serving of training dataFabric, Databricks, Snowflake, BigQuery
2. ML FrameworksModel development and trainingPyTorch, TensorFlow, scikit-learn, XGBoost, Hugging Face
3. Experiment TrackingReproducible experiments, parameter loggingMLflow, Weights & Biases, Neptune, Comet
4. ML PlatformManaged compute, pipelines, model registryAzure ML, Databricks ML, SageMaker, Vertex AI
5. LLM/GenAILarge language models for generation and reasoningAzure OpenAI, Anthropic API, open-source (Llama, Mistral)
6. Vector & SearchEmbedding storage and retrieval for RAGAzure AI Search, Pinecone, Weaviate, pgvector, Qdrant
7. MLOpsCI/CD, monitoring, retraining automationAzure ML Pipelines, Databricks Workflows, Kubeflow, custom

Layer 1: Data Platform — The Foundation Everything Depends On

The data platform decision is the most consequential because every other layer depends on it. The ML platform, feature store, training data, and serving infrastructure all connect to the data platform. Choosing an ML platform that doesn't integrate natively with your data platform creates integration friction that slows every project.

Microsoft Fabric: Unified analytics platform combining data lake, warehouse, data engineering, data science, and real-time analytics. Best for: Microsoft-ecosystem organizations that want one platform for data + ML. Native integration with Azure ML, Power BI, and Azure OpenAI. The Fabric lakehouse serves as both the analytical platform and the ML feature store.

Databricks: Lakehouse platform with strong ML capabilities. Best for: organizations with heavy Spark workloads, large-scale distributed training, and data science teams that prefer notebook-first development. MLflow is native. Unity Catalog provides governance. Feature Store is integrated. The trade-off: higher cost than Fabric for organizations not utilizing Spark at scale.

Snowflake: Cloud data warehouse with growing ML capabilities (Snowpark, Cortex). Best for: SQL-heavy organizations where the data team is stronger in SQL than Python. Snowpark ML enables model training in Snowflake using Python. The trade-off: less mature ML ecosystem than Databricks or Azure ML.

The Platform Integration Rule

Choose the ML platform that integrates natively with your data platform. Fabric → Azure ML. Databricks → Databricks ML. Snowflake → Snowpark. GCP → Vertex AI. AWS → SageMaker. Cross-platform integration (e.g., SageMaker with Fabric) works but adds 30-50% overhead in data movement, credential management, and operational complexity.

Layer 2: ML Frameworks — What Your Team Already Knows Matters

ML framework selection is primarily a team skill decision, not a technology decision. A team proficient in PyTorch will be 3x more productive in PyTorch than in TensorFlow — regardless of which framework benchmarks 2% faster on a specific task.

FrameworkBest ForEcosystem Strength
PyTorchResearch, deep learning, NLP, computer visionDominant in research, Hugging Face ecosystem, dynamic graphs
TensorFlow / KerasProduction ML, mobile/edge deploymentTFLite for mobile, TF Serving for production, Keras for rapid prototyping
scikit-learnClassical ML (classification, regression, clustering)Fastest prototyping for tabular data, excellent documentation
XGBoost / LightGBMTabular data, Kaggle-winning accuracyBest accuracy on structured enterprise data, fast training
Hugging Face TransformersNLP, generative AI, fine-tuning pre-trained modelsLargest model hub, easy fine-tuning, active community

The enterprise default: XGBoost/LightGBM for tabular problems (90% of enterprise ML). PyTorch + Hugging Face for NLP/vision/GenAI. scikit-learn for prototyping and simple models. This combination covers virtually every enterprise ML use case.

Layer 5: LLM/GenAI — Build, Buy, or Open-Source

The LLM decision has the most variety and the fastest-changing landscape. Three approaches:

Commercial API (Azure OpenAI, Anthropic, Google): Highest capability models (GPT-4o, Claude, Gemini) accessed via API. No infrastructure to manage. Pay per token. Best for: applications that need frontier model capability, don't require on-premises deployment, and where data can be sent to cloud APIs (check data residency requirements). Azure OpenAI provides enterprise security (data not used for training, VNet isolation, managed identity).

Open-source deployed (Llama 3, Mistral, Phi): Models deployed on your own infrastructure (Azure ML, Databricks, or Kubernetes). No data leaves your environment. Lower per-inference cost at scale. Best for: organizations with data sovereignty requirements, high-volume inference workloads, or need for model customization beyond prompting. The trade-off: infrastructure management, GPU procurement, and lower capability than frontier commercial models.

Fine-tuned models: Start with a commercial or open-source base model and fine-tune on domain-specific data. Produces a model that's more accurate for your specific use case than the base model. Best for: classification tasks with domain-specific categories, generation tasks requiring specific style/format, and RAG applications where retrieval alone doesn't achieve sufficient accuracy. Fine-tuning requires: 500-10,000 labeled examples, GPU compute for training, and MLOps infrastructure for versioning and deployment.

Layer 6: Vector and Search — The RAG Foundation

Vector databases store document embeddings and enable similarity search — the retrieval component of RAG architectures. The selection depends on scale, latency requirements, and ecosystem integration.

Vector StoreBest ForScaleManaged?
Azure AI SearchAzure-native RAG, hybrid search (vector + keyword)Millions of vectorsFully managed
PineconeDedicated vector search, simple APIBillions of vectorsFully managed
WeaviateMulti-modal (text + image), GraphQL APIMillions-billionsCloud or self-hosted
pgvectorPostgreSQL-native, existing Postgres infrastructureMillions of vectorsVia managed Postgres
QdrantHigh-performance, Rust-based, filtering supportBillions of vectorsCloud or self-hosted

Selection guidance: If on Azure → Azure AI Search (native integration, hybrid search, semantic ranking). If scale exceeds 100M vectors → Pinecone or Qdrant. If existing PostgreSQL → pgvector (zero new infrastructure). If multi-modal (text + images) → Weaviate.

The Decision Framework: 5 Questions

1

What's your data platform?

This determines your ML platform. Fabric → Azure ML. Databricks → Databricks ML. Don't fight the ecosystem — integration cost exceeds any feature advantage.

2

What problem types dominate?

Tabular classification/regression → XGBoost + scikit-learn. NLP/GenAI → PyTorch + Hugging Face + LLM API. Computer vision → PyTorch + pre-trained CNNs. The problem type narrows framework selection.

3

Can data go to cloud APIs?

Yes → Azure OpenAI or Anthropic for LLM workloads. No (data sovereignty, regulation) → deploy open-source models on-premises or in your VNet.

4

What does your team know?

A team skilled in PyTorch should use PyTorch. Retraining to TensorFlow costs 3-6 months of reduced productivity. Framework familiarity trumps marginal technical advantages.

5

What's your MLOps maturity?

Low maturity → use managed services (Azure ML managed endpoints, Databricks Model Serving). High maturity → consider open-source for flexibility (Kubeflow, Seldon). Managed services trade flexibility for operational simplicity.

Build vs Buy: The Platform Decision

Every layer has a build-vs-buy decision. For most enterprises, the answer is buy for infrastructure (ML platform, compute, monitoring) and build for domain-specific components (feature engineering, model logic, evaluation). The decision criteria: is this a differentiator (build — your feature engineering captures competitive advantage) or commodity infrastructure (buy — experiment tracking, model serving, compute management are solved problems that don't differentiate your business)? Building commodity infrastructure creates maintenance burden without competitive advantage. Buying domain-specific components creates vendor dependency without domain fit.

Migration Path: Moving Between Platforms

Tech stack decisions aren't permanent. Organizations grow, strategies change, and platforms evolve. The migration mitigation strategy: use open-source frameworks where possible (MLflow works across Azure ML, Databricks, and standalone), containerize models for portability (a Docker container deploys on any platform), and keep training code platform-agnostic (train with PyTorch/scikit-learn, deploy on any serving infrastructure). These practices reduce switching cost from 6 months to 6 weeks when a platform change becomes necessary.

Evaluation Infrastructure: The Forgotten Layer

Every stack needs evaluation infrastructure — automated testing that runs before and after every change. For traditional ML: model validation suites (accuracy, fairness, drift tests). For GenAI: response quality evaluation (relevance, groundedness, coherence scoring). For both: A/B testing infrastructure that compares model versions on live traffic. Evaluation infrastructure is often the last layer built — but it should be among the first, because without automated evaluation, every change is a guess about whether quality improved or degraded. Azure AI Studio, LangSmith, and custom evaluation pipelines provide this capability. Budget 15-20% of the total stack cost for evaluation infrastructure.

Vendor Lock-In Mitigation

Every platform creates some lock-in. The mitigation strategy: keep model training code in framework-native format (PyTorch, scikit-learn) independent of the ML platform's proprietary abstractions. Use MLflow for experiment tracking and model registry — it works on Azure ML, Databricks, and standalone. Containerize models for serving — Docker containers deploy anywhere regardless of the training platform. Store training data in open formats (Parquet, Delta Lake) accessible from any platform. These practices let you switch ML platforms (Azure ML → Databricks, SageMaker → Vertex) by changing the deployment infrastructure, not rewriting model code. The migration cost drops from months to weeks when you've maintained platform independence in the core components.

The Xylity Approach

We select AI/ML tech stacks through the 5-question framework — data platform drives ML platform, problem type drives frameworks, data residency drives LLM choice, team skills drive tooling, and MLOps maturity drives infrastructure model. Our AI architects and solution architects design the stack alongside your team — avoiding the platform trap of buying before knowing what you need.

Continue building your understanding with these related resources from our consulting practice.

The Right Stack for Your AI Ambition

Five questions, seven layers — the tech stack decision framework that prevents the platform trap and matches infrastructure to use cases.

Start Your Tech Stack Assessment →