ML Model Selection
Guides you through choosing the right machine learning model for a given problem.
ML Model Selection
Overview
Selecting the right machine learning model is the most consequential early decision in any ML project. A poor choice leads to wasted compute, missed accuracy targets, and delayed timelines. This skill provides a systematic framework for matching problem characteristics to algorithm families, then narrowing to specific models based on data volume, feature types, latency requirements, and interpretability needs.
Use this skill when starting a new ML project, when an existing model underperforms and you suspect a fundamentally different approach is needed, or when stakeholders ask why a particular algorithm was chosen.
Core Framework
Problem Type Classification
- Supervised Learning - Labeled data available; predict a target variable.
- Classification (binary, multiclass, multilabel)
- Regression (continuous, count, ordinal)
- Unsupervised Learning - No labels; discover structure.
- Clustering, dimensionality reduction, anomaly detection
- Reinforcement Learning - Sequential decision-making with rewards.
- Self-supervised / Semi-supervised - Limited labels augmented with unlabeled data.
Decision Criteria Matrix
| Criterion | Low | Medium | High |
|---|---|---|---|
| Data volume | Rule-based / linear | Tree ensembles | Deep learning |
| Feature interpretability need | Neural nets OK | SHAP-compatible | Linear / GAM |
| Latency budget | Batch OK | Sub-second | Sub-10ms |
| Dimensionality | Simple models | Regularized | Embeddings / PCA first |
Process
- Define the business objective and map it to a problem type (classification, regression, clustering, etc.).
- Profile the dataset: volume, feature count, feature types (numeric, categorical, text, image), label balance, missingness rate.
- Identify hard constraints: latency, memory, interpretability mandates, regulatory requirements.
- Select 2-3 candidate algorithm families using the decision criteria matrix.
- Establish a baseline with the simplest viable model (logistic regression, k-NN, or decision tree).
- Train candidates with default hyperparameters on a consistent train/validation split.
- Compare candidates on the primary metric plus secondary metrics (fairness, calibration, inference speed).
- Select the best candidate and proceed to hyperparameter tuning.
- Document the selection rationale including rejected alternatives and why.
Key Principles
- Always start with a simple baseline before reaching for complex models.
- More data often beats a better algorithm; verify data quality before model complexity.
- Match the model to the data type natively: tabular data favors tree ensembles; images favor CNNs; sequences favor transformers or RNNs.
- Gradient-boosted trees (XGBoost, LightGBM) remain the default winner for structured/tabular data.
- Deep learning requires at minimum tens of thousands of samples to outperform classical methods on tabular data.
- Interpretability is not optional in regulated domains (finance, healthcare); plan for it from the start.
- Ensemble methods can combine strengths but add deployment complexity.
Common Pitfalls
- Jumping straight to deep learning without establishing a simpler baseline.
- Ignoring inference-time constraints until deployment and discovering the model is too slow.
- Selecting a model based on benchmark performance on different data distributions.
- Confusing model flexibility with model quality; overfitting is the usual result.
- Neglecting categorical feature handling; many models need explicit encoding strategies.
- Choosing a model because it is trendy rather than because it fits the problem constraints.
Output Format
When recommending a model, deliver:
- Problem Statement: One sentence mapping business goal to ML task type.
- Data Profile Summary: Key stats (rows, features, types, label distribution).
- Constraints: Latency, interpretability, compute budget.
- Recommended Model: Algorithm name with justification.
- Runner-up: Alternative model and the scenario where it would be preferred.
- Baseline Plan: Simplest model to implement first for comparison.
- Risks: Known failure modes of the selected approach.
Related Skills
Computer Vision Pipeline Design
Designing computer vision pipelines for image and video analysis tasks. Covers
Data Preprocessing
Systematic approach to data cleaning, transformation, and feature preparation for
ML Deployment and MLOps
ML model deployment and MLOps practices for production systems. Covers serving
ML Model Evaluation
Comprehensive model evaluation and metrics selection for machine learning. Covers
Neural Network Architecture Design
Guides the design of neural network architectures for various tasks. Covers layer
NLP Pipeline Design
Designing end-to-end natural language processing pipelines from text ingestion to