Skip to main content
Technology & EngineeringData Science175 lines

Scikit Learn

Expert guidance on scikit-learn for building, evaluating, and deploying machine learning models in Python.

Quick Summary34 lines
You are an expert in scikit-learn for data analysis and science.

## Key Points

- **Always use Pipelines** to prevent data leakage. Fitting a scaler on the full dataset before splitting leaks test information.
- **Stratify splits** when classes are imbalanced (`stratify=y`).
- **Use `cross_val_score`** instead of a single train/test split for more reliable estimates.
- **Set `random_state`** on models and splits for reproducibility.
- **Choose the right metric**: accuracy is misleading for imbalanced data. Prefer F1, AUC, or precision/recall as appropriate.
- **Use `n_jobs=-1`** to parallelize grid search and ensemble training.
- **Data leakage**: fitting transformers on the full dataset before splitting. Always transform inside a Pipeline.
- **Forgetting to scale features** for distance-based models (SVM, KNN, logistic regression).
- **Using accuracy on imbalanced classes**: a model predicting the majority class always gets high accuracy.
- **Over-tuning on test set**: tune on validation/CV only; evaluate on test set once at the end.
- **Ignoring `max_iter` warnings**: logistic regression and SVMs may not converge with default iterations. Increase `max_iter`.

## Quick Example

```python
import pandas as pd

importances = model.feature_importances_
feature_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_imp.head(10))
```

```python
import joblib

joblib.dump(full_pipe, "model.joblib")
loaded_pipe = joblib.load("model.joblib")
```
skilldb get data-science-skills/Scikit LearnFull skill: 175 lines
Paste into your CLAUDE.md or agent config

scikit-learn — Data Science

You are an expert in scikit-learn for data analysis and science.

Overview

scikit-learn is Python's most widely used library for classical machine learning. It provides a consistent API across dozens of algorithms for classification, regression, clustering, dimensionality reduction, and model selection. Its fit / predict / transform interface, combined with Pipeline and cross-validation utilities, makes it the standard toolkit for tabular ML.

Core Concepts

The Estimator API

Every scikit-learn object follows the same pattern:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

Train/Test Split and Cross-Validation

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Mean accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

Column Transformers

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "gender"]),
])

full_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42)),
])

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    "clf__n_estimators": [100, 200, 500],
    "clf__max_depth": [5, 10, 20, None],
    "clf__min_samples_split": [2, 5, 10],
}

search = GridSearchCV(full_pipe, param_grid, cv=5, scoring="f1", n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
print(search.best_score_)

Implementation Patterns

Classification Evaluation

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# AUC for binary classification
y_prob = model.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")

Regression Evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_pred = model.predict(X_test)
print(f"MAE:  {mean_absolute_error(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"R²:   {r2_score(y_test, y_pred):.3f}")

Feature Importance

import pandas as pd

importances = model.feature_importances_
feature_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_imp.head(10))

Model Persistence

import joblib

joblib.dump(full_pipe, "model.joblib")
loaded_pipe = joblib.load("model.joblib")

Best Practices

  • Always use Pipelines to prevent data leakage. Fitting a scaler on the full dataset before splitting leaks test information.
  • Stratify splits when classes are imbalanced (stratify=y).
  • Use cross_val_score instead of a single train/test split for more reliable estimates.
  • Set random_state on models and splits for reproducibility.
  • Choose the right metric: accuracy is misleading for imbalanced data. Prefer F1, AUC, or precision/recall as appropriate.
  • Use n_jobs=-1 to parallelize grid search and ensemble training.

Core Philosophy

scikit-learn's greatest contribution is not any single algorithm but the consistent API contract: every estimator follows fit, predict, transform. This uniformity means that once you understand how to use one model, you understand the interface for all of them. More importantly, it means that pipelines, cross-validation, and grid search work with any estimator without special-casing. Investing time in mastering Pipeline and ColumnTransformer pays dividends across every project.

The Pipeline is not optional -- it is the primary defense against data leakage. When preprocessing steps (scaling, encoding, imputation) are performed outside the pipeline, it is trivially easy to accidentally fit them on the full dataset before splitting. Inside a pipeline, cross_val_score and GridSearchCV automatically fit transformers only on each training fold. Treating the Pipeline as the unit of model development, not just the estimator, is the single most important scikit-learn practice.

Choose the right metric before choosing the right model. Accuracy is the default but is actively misleading for imbalanced classes, cost-sensitive decisions, or ranking tasks. Defining the evaluation metric upfront -- F1, AUC, precision at a given recall threshold, or a custom business metric -- focuses the entire modeling process on the outcome that actually matters. scikit-learn's scoring parameter makes this easy; use it from day one.

Anti-Patterns

  • Preprocessing outside the pipeline: Fitting a StandardScaler or OneHotEncoder on the full dataset, then splitting into train and test. This leaks test-set statistics into training and inflates evaluation metrics. Always place transformers inside a Pipeline.

  • Evaluating on the test set repeatedly: Using the test set to make modeling decisions (feature selection, hyperparameter tuning, threshold selection) rather than reserving it for a single final evaluation. This turns the test set into a validation set and produces overfit estimates.

  • Relying on accuracy for imbalanced data: Reporting accuracy on a dataset where 95% of samples belong to one class. A model that always predicts the majority class achieves 95% accuracy while being completely useless. Use F1, precision-recall, or AUC instead.

  • Manual feature transformation before fit: Applying log transforms, binning, or encoding in a preprocessing script and saving the result, then loading it in a separate training script. This divorces the transformation from the model, making it impossible to guarantee that the same preprocessing is applied at inference time.

  • Ignoring convergence warnings: Dismissing ConvergenceWarning from LogisticRegression or SVM as noise. These warnings indicate that the optimizer did not find a solution, meaning the model's coefficients are unreliable. Increase max_iter or rescale features.

Common Pitfalls

  • Data leakage: fitting transformers on the full dataset before splitting. Always transform inside a Pipeline.
  • Forgetting to scale features for distance-based models (SVM, KNN, logistic regression).
  • Using accuracy on imbalanced classes: a model predicting the majority class always gets high accuracy.
  • Over-tuning on test set: tune on validation/CV only; evaluate on test set once at the end.
  • Ignoring max_iter warnings: logistic regression and SVMs may not converge with default iterations. Increase max_iter.

Install this skill directly: skilldb add data-science-skills

Get CLI access →