In This Article
- The First Decision: What Type of Problem Is This?
- Classification: When the Answer Is a Category
- Regression: When the Answer Is a Number
- Clustering: When There Is No Right Answer
- NLP: When the Input Is Text
- Computer Vision: When the Input Is an Image
- Time Series: When the Pattern Is Temporal
- The Complexity Ladder: Start Simple, Add Complexity Only When Needed
- The Selection Decision Matrix
- Go Deeper
The First Decision: What Type of Problem Is This?
Before selecting an algorithm, identify the problem type. This sounds obvious, but it's the most common source of ML project misdirection. "Predict customer churn" could be classification (will they churn? yes/no), regression (how many days until churn?), or survival analysis (what's the probability of churn in each future time period?). Each formulation leads to different algorithms, different evaluation metrics, and different integration patterns. The right formulation comes from the business question, not from the data.
| Problem Type | Output | Business Questions | Evaluation Metric |
|---|---|---|---|
| Classification | Category (yes/no, A/B/C) | Will this customer churn? Is this transaction fraudulent? Which segment? | AUC, F1, Precision, Recall |
| Regression | Continuous number | How much revenue? What price? How many units? | RMSE, MAE, MAPE, R² |
| Clustering | Group assignment | What segments exist? Which items are similar? What's anomalous? | Silhouette, business validation |
| NLP | Text understanding/generation | What's the sentiment? What's the intent? Summarize this. | Task-specific (accuracy, BLEU, ROUGE) |
| Computer Vision | Image understanding | What's in this image? Is this defective? Where is the damage? | mAP, IoU, accuracy |
| Time Series | Future value(s) | What will demand be? What's the forecast? When will it fail? | MAPE, RMSE, coverage |
Spend time getting the problem formulation right before touching algorithms. A misformulated problem — classification when the business needs regression, or point prediction when the business needs a probability distribution — wastes the entire model development cycle. The right formulation comes from three questions: What does the business need to know? What decision does it inform? What form does the answer need to take?
Classification: When the Answer Is a Category
Classification predicts which category an entity belongs to — churn/not-churn, fraud/legitimate, high-risk/medium-risk/low-risk. The output is a probability per class; the decision threshold converts probability to action.
Algorithm Selection for Classification
| Algorithm | Best For | Strengths | Limitations |
|---|---|---|---|
| Logistic Regression | Baseline, interpretable models, linear relationships | Fast training, fully interpretable, well-calibrated probabilities | Can't capture non-linear relationships without feature engineering |
| Random Forest | Moderate datasets, minimal tuning needed | Handles non-linearity, resistant to overfitting, feature importance | Slower inference than linear, less accurate than boosting |
| XGBoost / LightGBM | Most tabular classification problems | Best accuracy on tabular data, handles missing values, fast training | Requires tuning, less interpretable than linear |
| Neural Networks | High-dimensional data, complex interactions | Learns complex patterns, handles unstructured data (text, image) | Needs large data, expensive training, less interpretable |
The default for enterprise tabular classification: XGBoost or LightGBM. Start here. Move to logistic regression if interpretability is critical (regulatory requirement, explainable decisions). Move to neural networks only if tabular models plateau and you have >100K training examples with complex interaction patterns.
Threshold Selection
The model outputs probabilities. The threshold converts probability to decision — above 0.5, predict churn; below, predict retain. But 0.5 is rarely the right threshold. The optimal threshold depends on the cost of errors: if a false negative (missed churn) costs $5,000 and a false positive (unnecessary intervention) costs $200, the threshold should be much lower than 0.5 — catching more churners at the cost of some unnecessary interventions. The threshold is a business decision, not a statistical one.
Regression: When the Answer Is a Number
Regression predicts a continuous value — revenue forecast, customer lifetime value, estimated claim cost, property valuation, time to completion.
Algorithm Selection for Regression
| Algorithm | Best For | Key Characteristic |
|---|---|---|
| Linear / Ridge / Lasso | Baseline, interpretable, linear relationships | Coefficients are directly interpretable as "for each unit increase in X, Y changes by..." |
| XGBoost / LightGBM Regressor | Most tabular regression | Best accuracy on structured data, handles non-linearity automatically |
| Elastic Net | Many correlated features | Combines L1 and L2 regularization, performs feature selection |
| Quantile Regression | Prediction intervals needed | Predicts percentiles (10th, 50th, 90th) instead of mean — produces prediction ranges |
For regression, always report prediction intervals, not just point estimates. "Revenue forecast: $12.4M" is less useful than "Revenue forecast: $12.4M (80% confidence: $11.2M-$13.6M)." Decision-makers need to understand the uncertainty around predictions to make appropriate decisions. Quantile regression or conformal prediction provide these intervals.
Clustering: When There Is No Right Answer
Clustering finds natural groupings in data without a target variable — customer segmentation, document grouping, anomaly detection, market basket analysis. Unlike classification and regression (supervised — the model learns from labeled examples), clustering is unsupervised — the model discovers structure the analyst didn't define in advance.
Algorithm Selection for Clustering
| Algorithm | Best For | Requires # Clusters? | Handles Non-Spherical? |
|---|---|---|---|
| K-Means | Large datasets, spherical clusters | Yes (k must be specified) | No |
| DBSCAN | Arbitrary shapes, noise detection | No (density-based) | Yes |
| Hierarchical | Small-medium datasets, explore cluster structure | No (cut dendrogram at desired level) | Yes |
| Gaussian Mixture | Soft assignments, probabilistic clustering | Yes (k components) | Partially (ellipsoidal) |
The clustering trap: silhouette score isn't enough. Clustering quality must be validated by the business — do the segments make sense? Are they actionable? Can the marketing team design different strategies for each segment? A clustering with perfect silhouette score but no business interpretation is mathematically valid and practically useless.
NLP: When the Input Is Text
NLP tasks in enterprise ML: text classification (email → complaint/inquiry/request), named entity recognition (contract → extract parties, dates, amounts), sentiment analysis (review → positive/negative/neutral), summarization (document → key points), and question answering (knowledge base → answer specific questions).
Model Selection for NLP
Traditional ML (TF-IDF + classification): For simple text classification with labeled training data. Fast, interpretable, sufficient for many enterprise use cases (email routing, ticket categorization). Works with 1,000-10,000 labeled examples.
Pre-trained transformers (BERT, RoBERTa) with fine-tuning: For nuanced text understanding — sentiment that depends on context, intent that requires world knowledge, classification that requires understanding relationships between words. Fine-tune on 500-5,000 domain-specific examples. Significantly more accurate than TF-IDF for complex text tasks.
Large Language Models (GPT, Claude) with prompting: For tasks where labeled training data is scarce or unavailable. Zero-shot and few-shot prompting can classify, extract, and summarize without fine-tuning. Best for: prototyping, low-volume tasks, and tasks that benefit from general world knowledge. Limitation: higher inference cost, latency, and the need for prompt engineering.
Computer Vision: When the Input Is an Image
Enterprise computer vision: quality inspection (detect defects on manufacturing line), document processing (extract data from scanned documents), damage assessment (evaluate insurance claims from photos), security (facial recognition, access control), and medical imaging (detect anomalies in X-rays, CT scans).
Pre-trained CNNs with transfer learning (ResNet, EfficientNet, ViT) are the standard approach. Train a pre-trained model on 500-5,000 labeled images from your domain. For most enterprise computer vision, transfer learning achieves production-grade accuracy — training from scratch requires 100,000+ images and is rarely necessary.
Time Series: When the Pattern Is Temporal
Time series forecasting predicts future values based on historical patterns — demand forecasting, revenue projection, capacity planning, equipment degradation. Time series problems have unique challenges: seasonality (repeating patterns at fixed intervals), trend (long-term direction), and autocorrelation (each value depends on previous values).
Algorithm Selection for Time Series
| Algorithm | Best For | Handles Seasonality? | Handles External Variables? |
|---|---|---|---|
| ARIMA / SARIMA | Univariate, single series | SARIMA: yes | ARIMAX: yes |
| Prophet | Business time series with holidays, changepoints | Yes (multiple) | Yes (regressors) |
| XGBoost (lagged features) | Many series, cross-series features | Via features | Yes (any feature) |
| LSTM / Temporal Fusion | Complex dependencies, long horizons | Learned | Yes |
For enterprise demand forecasting across many SKUs/locations: XGBoost with engineered temporal features (lags, moving averages, calendar features) typically outperforms deep learning — faster to train, easier to explain, and more performant at the SKU level where each series has limited history. Reserve LSTM and transformer-based models for problems with long-range dependencies and large training sets.
The Complexity Ladder: Start Simple, Add Complexity Only When Needed
Baseline: Simple Rules or Heuristics
Before any ML: what does the simplest possible approach achieve? Average of last 3 months for forecasting. "All customers churn" as a naive baseline. Rules-based scoring for fraud. The ML model must beat this baseline — otherwise, the investment isn't justified.
Linear Models
Logistic regression, linear regression, elastic net. Fast to train, fully interpretable, sufficient for many enterprise use cases. If linear models achieve the accuracy threshold, stop here — the interpretability and operational simplicity are worth more than marginal accuracy improvements.
Gradient Boosting
XGBoost, LightGBM, CatBoost. The sweet spot for most tabular enterprise ML — high accuracy, handles non-linearity, reasonable training time, feature importance for interpretability. This is where 80% of enterprise tabular ML should land.
Deep Learning
Neural networks, transformers, CNNs. Use only for: unstructured data (text, images, audio), very large datasets (100K+ examples), or tabular problems where gradient boosting plateaus and the accuracy gap is worth the infrastructure investment. Deep learning costs 5-10x more to train and serve than gradient boosting — justify the cost with measured accuracy improvement.
The Selection Decision Matrix
Use this matrix to select the starting algorithm for each enterprise ML use case. Start at the recommended algorithm. Escalate to higher complexity only if the starting algorithm doesn't meet the accuracy threshold after thorough feature engineering.
| Use Case | Problem Type | Start With | Escalate To | Data Requirement |
|---|---|---|---|---|
| Customer churn | Binary classification | XGBoost + RFM features | Neural net if >500K customers | 12+ months behavioral data |
| Fraud detection | Anomaly + classification | XGBoost + isolation forest | Graph neural net for network fraud | Labeled fraud cases + transaction history |
| Demand forecast | Time series regression | XGBoost + temporal features | Temporal fusion transformer | 3+ years at required granularity |
| CLV prediction | Regression | XGBoost + RFM + tenure | Probabilistic (BG/NBD + Gamma-Gamma) | 2+ years transaction history |
| Customer segments | Clustering | K-Means on RFM | Gaussian mixture + behavioral features | Sufficient volume per segment (100+ per cluster) |
| Text classification | Multi-class classification | TF-IDF + logistic regression | Fine-tuned BERT | 1,000+ labeled examples per class |
| Quality inspection | Image classification | Transfer learning (ResNet) | Custom CNN + object detection | 500+ labeled images per defect type |
| Predictive maintenance | Time-to-event / classification | XGBoost + sensor features | LSTM if long temporal dependencies | Labeled failure events + sensor history |
The Xylity Approach
We select models through the complexity ladder — start simple, measure, escalate only when the accuracy gap justifies the infrastructure cost. Our data scientists develop models; our ML engineers deploy them to production. The output: the right model at the right complexity level, deployed and monitored for sustained accuracy. Machine learning consulting that produces production systems, not research papers.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
The Right Model for the Right Problem
Classification, regression, clustering, NLP, computer vision, time series — the decision framework that matches algorithm complexity to business value.
Start Your ML Model Selection Engagement →