Scikit Learn
Expert guidance on scikit-learn for building, evaluating, and deploying machine learning models in Python.
You are an expert in scikit-learn for data analysis and science.
## Key Points
- **Always use Pipelines** to prevent data leakage. Fitting a scaler on the full dataset before splitting leaks test information.
- **Stratify splits** when classes are imbalanced (`stratify=y`).
- **Use `cross_val_score`** instead of a single train/test split for more reliable estimates.
- **Set `random_state`** on models and splits for reproducibility.
- **Choose the right metric**: accuracy is misleading for imbalanced data. Prefer F1, AUC, or precision/recall as appropriate.
- **Use `n_jobs=-1`** to parallelize grid search and ensemble training.
- **Data leakage**: fitting transformers on the full dataset before splitting. Always transform inside a Pipeline.
- **Forgetting to scale features** for distance-based models (SVM, KNN, logistic regression).
- **Using accuracy on imbalanced classes**: a model predicting the majority class always gets high accuracy.
- **Over-tuning on test set**: tune on validation/CV only; evaluate on test set once at the end.
- **Ignoring `max_iter` warnings**: logistic regression and SVMs may not converge with default iterations. Increase `max_iter`.
## Quick Example
```python
import pandas as pd
importances = model.feature_importances_
feature_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_imp.head(10))
```
```python
import joblib
joblib.dump(full_pipe, "model.joblib")
loaded_pipe = joblib.load("model.joblib")
```skilldb get data-science-skills/Scikit LearnFull skill: 175 linesscikit-learn — Data Science
You are an expert in scikit-learn for data analysis and science.
Overview
scikit-learn is Python's most widely used library for classical machine learning. It provides a consistent API across dozens of algorithms for classification, regression, clustering, dimensionality reduction, and model selection. Its fit / predict / transform interface, combined with Pipeline and cross-validation utilities, makes it the standard toolkit for tabular ML.
Core Concepts
The Estimator API
Every scikit-learn object follows the same pattern:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
Train/Test Split and Cross-Validation
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Mean accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
Column Transformers
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocessor = ColumnTransformer([
("num", StandardScaler(), ["age", "income"]),
("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "gender"]),
])
full_pipe = Pipeline([
("preprocess", preprocessor),
("clf", RandomForestClassifier(n_estimators=200, random_state=42)),
])
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid = {
"clf__n_estimators": [100, 200, 500],
"clf__max_depth": [5, 10, 20, None],
"clf__min_samples_split": [2, 5, 10],
}
search = GridSearchCV(full_pipe, param_grid, cv=5, scoring="f1", n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
print(search.best_score_)
Implementation Patterns
Classification Evaluation
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# AUC for binary classification
y_prob = model.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
Regression Evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
y_pred = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
Feature Importance
import pandas as pd
importances = model.feature_importances_
feature_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_imp.head(10))
Model Persistence
import joblib
joblib.dump(full_pipe, "model.joblib")
loaded_pipe = joblib.load("model.joblib")
Best Practices
- Always use Pipelines to prevent data leakage. Fitting a scaler on the full dataset before splitting leaks test information.
- Stratify splits when classes are imbalanced (
stratify=y). - Use
cross_val_scoreinstead of a single train/test split for more reliable estimates. - Set
random_stateon models and splits for reproducibility. - Choose the right metric: accuracy is misleading for imbalanced data. Prefer F1, AUC, or precision/recall as appropriate.
- Use
n_jobs=-1to parallelize grid search and ensemble training.
Core Philosophy
scikit-learn's greatest contribution is not any single algorithm but the consistent API contract: every estimator follows fit, predict, transform. This uniformity means that once you understand how to use one model, you understand the interface for all of them. More importantly, it means that pipelines, cross-validation, and grid search work with any estimator without special-casing. Investing time in mastering Pipeline and ColumnTransformer pays dividends across every project.
The Pipeline is not optional -- it is the primary defense against data leakage. When preprocessing steps (scaling, encoding, imputation) are performed outside the pipeline, it is trivially easy to accidentally fit them on the full dataset before splitting. Inside a pipeline, cross_val_score and GridSearchCV automatically fit transformers only on each training fold. Treating the Pipeline as the unit of model development, not just the estimator, is the single most important scikit-learn practice.
Choose the right metric before choosing the right model. Accuracy is the default but is actively misleading for imbalanced classes, cost-sensitive decisions, or ranking tasks. Defining the evaluation metric upfront -- F1, AUC, precision at a given recall threshold, or a custom business metric -- focuses the entire modeling process on the outcome that actually matters. scikit-learn's scoring parameter makes this easy; use it from day one.
Anti-Patterns
-
Preprocessing outside the pipeline: Fitting a
StandardScalerorOneHotEncoderon the full dataset, then splitting into train and test. This leaks test-set statistics into training and inflates evaluation metrics. Always place transformers inside aPipeline. -
Evaluating on the test set repeatedly: Using the test set to make modeling decisions (feature selection, hyperparameter tuning, threshold selection) rather than reserving it for a single final evaluation. This turns the test set into a validation set and produces overfit estimates.
-
Relying on accuracy for imbalanced data: Reporting accuracy on a dataset where 95% of samples belong to one class. A model that always predicts the majority class achieves 95% accuracy while being completely useless. Use F1, precision-recall, or AUC instead.
-
Manual feature transformation before fit: Applying log transforms, binning, or encoding in a preprocessing script and saving the result, then loading it in a separate training script. This divorces the transformation from the model, making it impossible to guarantee that the same preprocessing is applied at inference time.
-
Ignoring convergence warnings: Dismissing
ConvergenceWarningfromLogisticRegressionorSVMas noise. These warnings indicate that the optimizer did not find a solution, meaning the model's coefficients are unreliable. Increasemax_iteror rescale features.
Common Pitfalls
- Data leakage: fitting transformers on the full dataset before splitting. Always transform inside a Pipeline.
- Forgetting to scale features for distance-based models (SVM, KNN, logistic regression).
- Using accuracy on imbalanced classes: a model predicting the majority class always gets high accuracy.
- Over-tuning on test set: tune on validation/CV only; evaluate on test set once at the end.
- Ignoring
max_iterwarnings: logistic regression and SVMs may not converge with default iterations. Increasemax_iter.
Install this skill directly: skilldb add data-science-skills
Related Skills
Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.