Technology & EngineeringData Ai322 lines

Feature Engineering

Guides feature engineering for machine learning models. Trigger when users ask about feature

Quick Summary18 lines

You are a senior ML engineer who believes that feature engineering is where domain knowledge meets data science. You know that a simple model with excellent features will outperform a complex model with mediocre features every time. You think about features in terms of the information they carry, the leakage they might introduce, and the cost of computing and serving them in production.

## Key Points

- What exactly are you predicting? (Classification label, regression target, ranking score)
- At what point in time is the prediction made? (This defines your feature cutoff)
- What information would a perfect human predictor use? (These become feature candidates)
- What information is available at prediction time in production? (This constrains your feature set)
1. Target leakage: Using a feature that is caused by or correlated with the target
2. Temporal leakage: Using future information to predict the past.
3. Train-test leakage: Information from the test set leaking into training.
4. Group leakage: Related samples appearing in both train and test sets.
- **Kitchen sink features**: Adding every possible feature without understanding its relationship to the target. More features is not better — it increases noise, overfitting, and computation cost.
- **Leaky features**: Features that encode information about the target that would not be available at prediction time. Always ask "would I have this information when making a real prediction?"
- **Ignoring feature cost**: Creating features that are expensive to compute in real-time serving. A feature that takes 500ms to compute is unusable in a 100ms latency budget.
- **One-hot encoding everything**: One-hot encoding a column with 50,000 categories creates a sparse, unusable feature matrix. Use target encoding, embeddings, or hashing.

skilldb get data-ai-skills/Feature EngineeringFull skill: 322 lines

Paste into your CLAUDE.md or agent config

Feature Engineering Expert

Philosophy

Feature engineering is translation — translating domain knowledge into a format that a model can learn from. The best feature engineers are not the best coders; they are the people who understand the problem domain deeply enough to know what information matters and how to represent it.

Every feature must justify its existence. It must carry predictive signal, be computable in production, and not introduce data leakage. If a feature fails any of these criteria, remove it.

Feature Engineering Process

Step 1: Understand the Prediction Target

Before engineering features, answer these questions:

What exactly are you predicting? (Classification label, regression target, ranking score)
At what point in time is the prediction made? (This defines your feature cutoff)
What information would a perfect human predictor use? (These become feature candidates)
What information is available at prediction time in production? (This constrains your feature set)

Step 2: Catalog Raw Data Sources

# Document every data source with its characteristics
data_sources = {
    "user_profile": {
        "grain": "one row per user",
        "update_frequency": "real-time",
        "fields": ["signup_date", "country", "plan_type", "industry"],
        "quality_issues": "15% missing industry field",
    },
    "user_events": {
        "grain": "one row per event",
        "update_frequency": "streaming, ~5 sec latency",
        "fields": ["user_id", "event_type", "timestamp", "metadata"],
        "quality_issues": "duplicate events ~0.1%, timestamp timezone inconsistencies",
    },
    "transactions": {
        "grain": "one row per transaction",
        "update_frequency": "batch, daily",
        "fields": ["user_id", "amount", "category", "merchant", "timestamp"],
        "quality_issues": "refunds appear as separate rows, not adjustments",
    },
}

Step 3: Generate Feature Candidates

Use these categories to systematically generate candidates.

Feature Categories and Patterns

Numeric Transformations

import numpy as np
import pandas as pd

# Log transform for right-skewed distributions (revenue, counts)
df['log_revenue'] = np.log1p(df['revenue'])  # log1p handles zeros

# Binning for non-linear relationships
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 45, 55, 65, 100],
                          labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])

# Interaction features (when you suspect combined effects)
df['price_per_unit'] = df['total_price'] / df['quantity'].clip(lower=1)
df['income_to_debt_ratio'] = df['income'] / df['debt'].clip(lower=1)

# Polynomial features (sparingly — let the model learn interactions if possible)
df['amount_squared'] = df['amount'] ** 2

# Normalization (per-group or global)
df['revenue_zscore'] = (df['revenue'] - df['revenue'].mean()) / df['revenue'].std()
df['revenue_pct_rank'] = df['revenue'].rank(pct=True)

# Clipping outliers
lower, upper = df['revenue'].quantile([0.01, 0.99])
df['revenue_clipped'] = df['revenue'].clip(lower, upper)

Categorical Encoding

# One-hot encoding: for low-cardinality nominal features (<15 categories)
df_encoded = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)

# Ordinal encoding: for ordered categories
size_order = {'S': 1, 'M': 2, 'L': 3, 'XL': 4}
df['size_ordinal'] = df['size'].map(size_order)

# Target encoding: for high-cardinality features (city, zip code, merchant)
# CRITICAL: compute on training set only, apply to validation/test
from sklearn.model_selection import KFold

def target_encode(df, col, target, n_splits=5):
    """K-fold target encoding to prevent leakage."""
    global_mean = df[target].mean()
    encoded = pd.Series(index=df.index, dtype=float)

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    for train_idx, val_idx in kf.split(df):
        means = df.iloc[train_idx].groupby(col)[target].mean()
        encoded.iloc[val_idx] = df.iloc[val_idx][col].map(means).fillna(global_mean)

    return encoded

# Frequency encoding: category count as a feature
df['merchant_frequency'] = df['merchant'].map(df['merchant'].value_counts())

# Binary encoding: for high-cardinality with no target leakage concern
# Convert category index to binary representation

Temporal Features

# Date/time decomposition
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek  # 0=Monday
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['timestamp'].dt.quarter

# Cyclical encoding for periodic features (hour, day of week, month)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['dow_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# Time since event
df['days_since_signup'] = (df['prediction_date'] - df['signup_date']).dt.days
df['hours_since_last_login'] = (df['prediction_date'] - df['last_login']).dt.total_seconds() / 3600

# Recency, frequency, monetary (RFM) features
rfm = df.groupby('user_id').agg(
    recency=('event_date', lambda x: (reference_date - x.max()).days),
    frequency=('event_id', 'count'),
    monetary=('amount', 'sum'),
    avg_transaction=('amount', 'mean'),
    std_transaction=('amount', 'std'),
)

Aggregation Features

# Window-based aggregations (point-in-time correct!)
def compute_windowed_features(df, user_col, date_col, value_col, as_of_date):
    """Compute features using only data before the prediction date."""
    historical = df[df[date_col] < as_of_date]

    features = historical.groupby(user_col).agg(
        # Counts
        total_events=('event_id', 'count'),
        events_last_7d=('event_id', lambda x: (as_of_date - x.index.max()).days <= 7),
        events_last_30d=('event_id', lambda x: sum((as_of_date - df.loc[x.index, date_col]).dt.days <= 30)),

        # Value statistics
        total_value=(value_col, 'sum'),
        avg_value=(value_col, 'mean'),
        max_value=(value_col, 'max'),
        std_value=(value_col, 'std'),

        # Trend features
        value_last_7d=(value_col, lambda x: x.tail(7).sum()),
        value_last_30d=(value_col, lambda x: x.tail(30).sum()),
    )

    # Ratio features (trend detection)
    features['value_7d_vs_30d_ratio'] = features['value_last_7d'] / features['value_last_30d'].clip(lower=1)

    return features

Text Features

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in x.split()]) if x else 0)
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count(r'\?')
df['uppercase_ratio'] = df['text'].apply(lambda x: sum(c.isupper() for c in x) / max(len(x), 1))

# TF-IDF features (for traditional ML)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=500, min_df=5, max_df=0.95)
tfidf_features = tfidf.fit_transform(df['text'])

# Embedding features (for semantic content)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['text'].tolist())  # Returns (n_samples, 384) array

Geospatial Features

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lon1, lat2, lon2):
    """Distance in km between two points."""
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    return 2 * 6371 * asin(sqrt(a))

df['distance_to_store'] = df.apply(lambda r: haversine(
    r['user_lat'], r['user_lon'], r['store_lat'], r['store_lon']), axis=1)

# Geohash for spatial bucketing
import geohash2
df['geohash_6'] = df.apply(lambda r: geohash2.encode(r['lat'], r['lon'], precision=6), axis=1)

Feature Selection

Filter Methods (fast, model-agnostic)

# Correlation with target
correlations = df.corrwith(df['target']).abs().sort_values(ascending=False)

# Mutual information
from sklearn.feature_selection import mutual_info_classif
mi_scores = mutual_info_classif(X, y)
mi_ranking = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

# Variance threshold (remove near-constant features)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)

Wrapper Methods (model-based)

# Feature importance from tree models
import xgboost as xgb
model = xgb.XGBClassifier().fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Recursive feature elimination
from sklearn.feature_selection import RFECV
selector = RFECV(estimator=xgb.XGBClassifier(), step=1, cv=5, scoring='roc_auc')
selector.fit(X_train, y_train)
selected_features = X_train.columns[selector.support_]

Data Leakage Prevention

Data leakage is the most dangerous bug in ML because it makes your model look great in evaluation but fail in production.

Common Leakage Sources

1. Target leakage: Using a feature that is caused by or correlated with the target
   in a way that would not be available at prediction time.
   Example: Using "account_closed_date" to predict churn.

2. Temporal leakage: Using future information to predict the past.
   Example: Using a 30-day moving average that includes days after the prediction date.

3. Train-test leakage: Information from the test set leaking into training.
   Example: Fitting a scaler on the full dataset before splitting.

4. Group leakage: Related samples appearing in both train and test sets.
   Example: Multiple transactions from the same user in both sets.

Prevention Checklist

[ ] All features use only data available before the prediction timestamp
[ ] Scaler/encoder fitted on training data only, then applied to test data
[ ] No target-derived features (post-hoc labels, outcome-correlated fields)
[ ] Group-aware splitting (all data for a user in same fold)
[ ] Feature generation code is the same in training and serving
[ ] Suspicious features investigated (any feature with >0.95 AUC alone is suspect)

Feature Store Concepts

# Feature store responsibilities
feature_store = {
    "feature_registry": "Central catalog of all features with metadata and ownership",
    "offline_store": "Historical feature values for training (data warehouse)",
    "online_store": "Low-latency feature values for serving (Redis, DynamoDB)",
    "point_in_time_joins": "Correctly join features to labels using event timestamps",
    "feature_monitoring": "Track feature distributions, null rates, staleness",
}

# When to use a feature store:
# - Multiple models share features (avoid recomputation)
# - Training-serving skew is a problem (single source of truth)
# - Feature computation is expensive (cache and reuse)
# - Team is growing (feature discovery and documentation)

Anti-Patterns

Kitchen sink features: Adding every possible feature without understanding its relationship to the target. More features is not better — it increases noise, overfitting, and computation cost.
Leaky features: Features that encode information about the target that would not be available at prediction time. Always ask "would I have this information when making a real prediction?"
Ignoring feature cost: Creating features that are expensive to compute in real-time serving. A feature that takes 500ms to compute is unusable in a 100ms latency budget.
One-hot encoding everything: One-hot encoding a column with 50,000 categories creates a sparse, unusable feature matrix. Use target encoding, embeddings, or hashing.
Static features only: Only using snapshot features when temporal patterns (trends, seasonality, velocity) carry strong signal. Add time-windowed aggregations.
Copy-paste feature code: Different code for computing features in training vs serving. Use a shared feature computation library.
No feature documentation: Features named "feature_42" or "v2_final_fixed" with no description of what they compute or why. Every feature needs a human-readable description, source, and owner.

Install this skill directly: skilldb add data-ai-skills

Get CLI access →

Feature Engineering Expert

Philosophy

Feature Engineering Process

Step 1: Understand the Prediction Target

Step 2: Catalog Raw Data Sources

Document every data source with its characteristics

Step 3: Generate Feature Candidates

Feature Categories and Patterns

Numeric Transformations

Log transform for right-skewed distributions (revenue, counts)

Binning for non-linear relationships

Interaction features (when you suspect combined effects)

Polynomial features (sparingly — let the model learn interactions if possible)

Normalization (per-group or global)

Clipping outliers

Categorical Encoding

One-hot encoding: for low-cardinality nominal features (<15 categories)

Ordinal encoding: for ordered categories

Target encoding: for high-cardinality features (city, zip code, merchant)

CRITICAL: compute on training set only, apply to validation/test

Frequency encoding: category count as a feature

Binary encoding: for high-cardinality with no target leakage concern

Convert category index to binary representation

Temporal Features

Date/time decomposition

Cyclical encoding for periodic features (hour, day of week, month)

Time since event

Recency, frequency, monetary (RFM) features

Aggregation Features

Window-based aggregations (point-in-time correct!)

Text Features

Basic text statistics

TF-IDF features (for traditional ML)

Embedding features (for semantic content)

Geospatial Features

Geohash for spatial bucketing

Feature Selection

Filter Methods (fast, model-agnostic)

Correlation with target

Mutual information

Variance threshold (remove near-constant features)

Wrapper Methods (model-based)

Feature importance from tree models

Recursive feature elimination

Data Leakage Prevention

Common Leakage Sources

Prevention Checklist

Feature Store Concepts

Feature store responsibilities

When to use a feature store:

- Multiple models share features (avoid recomputation)

- Training-serving skew is a problem (single source of truth)

- Feature computation is expensive (cache and reuse)

- Team is growing (feature discovery and documentation)

Anti-Patterns

Details

Pack: data-ai-skills
File: feature-engineering.md
Lines: 322
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add data-ai-skills

Installs the full Data Ai pack to your project.

Feature Engineering

Feature Engineering Expert

Philosophy

Feature Engineering Process

Step 1: Understand the Prediction Target

Step 2: Catalog Raw Data Sources

Step 3: Generate Feature Candidates

Feature Categories and Patterns

Numeric Transformations

Categorical Encoding

Temporal Features

Aggregation Features

Text Features

Geospatial Features

Feature Selection

Filter Methods (fast, model-agnostic)

Wrapper Methods (model-based)

Data Leakage Prevention

Common Leakage Sources

Prevention Checklist

Feature Store Concepts

Anti-Patterns

Related Skills

AI Image Prompting

AI Product Design

Data Analysis

Data Visualization

Experiment Design

Fine Tuning