Skip to main content
Technology & EngineeringData Science184 lines

Feature Engineering

Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.

Quick Summary17 lines
You are an expert in feature engineering for data analysis and science.

## Key Points

- **Engineer features inside Pipelines** to prevent data leakage. Never compute target-based features before splitting.
- **Start simple**: ratios, counts, and aggregations often beat complex engineered features.
- **Domain knowledge matters**: a domain-informed feature (e.g., debt-to-income ratio in finance) is worth more than a hundred automated ones.
- **Check for leakage**: any feature that encodes the target directly or indirectly will inflate metrics.
- **Use cyclical encoding** for periodic features (hour, day of week, month) so that December and January are neighbors.
- **Log-transform** heavily skewed features before feeding to linear models.
- **Target leakage**: using future information or features derived from the label. Always ask, "Would I have this feature at prediction time?"
- **High-cardinality one-hot encoding**: one-hot encoding a column with 10,000 categories creates 10,000 sparse columns. Use target encoding or embeddings instead.
- **Ignoring missing value patterns**: the fact that a value is missing is itself informative. Add a binary `is_missing` indicator.
- **Scaling after split**: fitting a scaler on the full dataset leaks test statistics into training.
- **Over-engineering**: too many features increase overfitting risk. Use feature importance or selection to prune.
skilldb get data-science-skills/Feature EngineeringFull skill: 184 lines
Paste into your CLAUDE.md or agent config

Feature Engineering — Data Science

You are an expert in feature engineering for data analysis and science.

Overview

Feature engineering is the process of transforming raw data into informative inputs for machine learning models. It is often the single highest-leverage activity in an ML project — good features can make a simple model outperform a complex one trained on raw data. This skill covers numeric, categorical, temporal, and text feature patterns.

Core Concepts

Numeric Transformations

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[["income", "age"]])

# Log transform for right-skewed distributions
df["log_income"] = np.log1p(df["income"])

# Power transform (Box-Cox / Yeo-Johnson) for normalization
pt = PowerTransformer(method="yeo-johnson")
df[["income_transformed"]] = pt.fit_transform(df[["income"]])

# Binning
df["age_group"] = pd.cut(df["age"], bins=[0, 18, 35, 55, 100], labels=["youth", "young_adult", "middle", "senior"])

Categorical Encoding

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, TargetEncoder

# One-hot (for low cardinality)
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded = ohe.fit_transform(df[["city"]])

# Ordinal (for ordered categories)
oe = OrdinalEncoder(categories=[["low", "medium", "high"]])
df["risk_encoded"] = oe.fit_transform(df[["risk_level"]])

# Target encoding (for high cardinality)
te = TargetEncoder(smooth="auto")
df["city_target_enc"] = te.fit_transform(df[["city"]], y)

Temporal Features

df["date"] = pd.to_datetime(df["date"])

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["quarter"] = df["date"].dt.quarter
df["days_since_start"] = (df["date"] - df["date"].min()).dt.days

# Cyclical encoding for month/hour
df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)

Interaction Features

# Arithmetic combinations
df["income_per_member"] = df["household_income"] / df["household_size"]
df["bmi"] = df["weight_kg"] / (df["height_m"] ** 2)
df["price_volume"] = df["price"] * df["volume"]

# Polynomial interactions
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X[["feature_a", "feature_b", "feature_c"]])

Implementation Patterns

Aggregation Features (Entity-Level)

# Customer-level features from transaction data
customer_features = transactions.groupby("customer_id").agg(
    total_spend=("amount", "sum"),
    avg_spend=("amount", "mean"),
    n_transactions=("amount", "count"),
    days_since_last=("date", lambda x: (pd.Timestamp.now() - x.max()).days),
    n_unique_merchants=("merchant", "nunique"),
).reset_index()

Lag and Rolling Features (Time Series)

df = df.sort_values("date")

# Lag features
df["sales_lag_1"] = df.groupby("store_id")["sales"].shift(1)
df["sales_lag_7"] = df.groupby("store_id")["sales"].shift(7)

# Rolling statistics
df["sales_roll_7_mean"] = df.groupby("store_id")["sales"].transform(
    lambda x: x.rolling(7, min_periods=1).mean()
)
df["sales_roll_7_std"] = df.groupby("store_id")["sales"].transform(
    lambda x: x.rolling(7, min_periods=1).std()
)

Text Features

from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF
tfidf = TfidfVectorizer(max_features=500, stop_words="english", ngram_range=(1, 2))
X_text = tfidf.fit_transform(df["description"])

# Simple text statistics
df["text_len"] = df["description"].str.len()
df["word_count"] = df["description"].str.split().str.len()
df["has_question"] = df["description"].str.contains(r"\?").astype(int)

Feature Selection

from sklearn.feature_selection import mutual_info_classif, SelectKBest

# Mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_ranking = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

# SelectKBest inside a pipeline
selector = SelectKBest(mutual_info_classif, k=20)

Best Practices

  • Engineer features inside Pipelines to prevent data leakage. Never compute target-based features before splitting.
  • Start simple: ratios, counts, and aggregations often beat complex engineered features.
  • Domain knowledge matters: a domain-informed feature (e.g., debt-to-income ratio in finance) is worth more than a hundred automated ones.
  • Check for leakage: any feature that encodes the target directly or indirectly will inflate metrics.
  • Use cyclical encoding for periodic features (hour, day of week, month) so that December and January are neighbors.
  • Log-transform heavily skewed features before feeding to linear models.

Core Philosophy

Feature engineering is where domain knowledge meets data science. The best features are not discovered by brute-force combinatorial search -- they come from understanding the problem deeply enough to encode human intuition as computable signals. A single well-conceived feature, like a debt-to-income ratio in credit modeling, can outperform hundreds of automated interactions because it captures a causal relationship rather than a statistical accident.

The discipline of feature engineering demands constant vigilance against leakage. Every feature must answer the question: "Would I have this information at prediction time?" If the answer is no, the feature is cheating. This applies not only to obvious target leakage but also to subtler forms like using future timestamps, post-hoc labels, or statistics computed on data that includes the test set. Features must be engineered inside pipelines, not before them.

Start simple and measure. Ratios, counts, and aggregations are often sufficient. Complexity should be added incrementally and justified by measurable improvement on a held-out set. Over-engineering features creates overfitting risk, increases maintenance burden, and makes the model harder to interpret. The goal is signal, not noise disguised as sophistication.

Anti-Patterns

  • Feature leakage through temporal data: Using information from the future to predict the past, such as computing rolling averages that include the prediction date or using labels that were assigned after the event being predicted.

  • Exploding cardinality with one-hot encoding: Blindly one-hot encoding a high-cardinality column (e.g., zip codes, user IDs) creates thousands of sparse features that slow training, increase memory usage, and degrade model performance. Use target encoding, hashing, or embeddings instead.

  • Fitting transformers on full data before splitting: Computing scaling parameters, encoding mappings, or binning thresholds on the entire dataset rather than fitting only on the training fold. This is a subtle but pervasive form of data leakage.

  • Feature engineering outside the pipeline: Computing features in a standalone script and saving them to a file, then loading them in a separate training script. This breaks reproducibility and makes it easy for the training and serving pipelines to drift apart.

  • Accumulating features without pruning: Adding feature after feature without periodically measuring importance and removing the ones that contribute nothing. Large feature sets increase overfitting, slow inference, and make the model opaque.

Common Pitfalls

  • Target leakage: using future information or features derived from the label. Always ask, "Would I have this feature at prediction time?"
  • High-cardinality one-hot encoding: one-hot encoding a column with 10,000 categories creates 10,000 sparse columns. Use target encoding or embeddings instead.
  • Ignoring missing value patterns: the fact that a value is missing is itself informative. Add a binary is_missing indicator.
  • Scaling after split: fitting a scaler on the full dataset leaks test statistics into training.
  • Over-engineering: too many features increase overfitting risk. Use feature importance or selection to prune.

Install this skill directly: skilldb add data-science-skills

Get CLI access →