Feature Engineering for Machine Learning: From Raw Data to Predictive Power

Feature engineering is the process of transforming raw data into features that better represent the underlying patterns for machine learning models. It’s often the difference between a mediocre model and a production-quality one. Raw data rarely arrives in a form that models can directly exploit.

Why Feature Engineering Matters

Models learn from features, not raw data. The quality of features determines the ceiling of model performance:

Better features → simpler models → faster training → easier debugging
Poor features → complex models → overfitting → mysterious failures

A well-engineered feature can be worth more than a fancier algorithm.

Numerical Features

Scaling

Most ML algorithms perform better when numerical features are on similar scales:

Standard Scaling (Z-score):

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Results: mean=0, std=1

Use when: Features are approximately normally distributed.

Min-Max Scaling:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Results: values in [0, 1]

Use when: You need bounded values (e.g., neural networks).

Robust Scaling:

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
# Uses median and IQR, robust to outliers

Use when: Data contains significant outliers.

Transformations

Log Transform: Compresses right-skewed distributions (income, prices, counts).

import numpy as np
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

Power Transform (Box-Cox/Yeo-Johnson): Automatically finds the best transformation.

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X)

Binning: Converts continuous variables into categories.

df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100], 
                         labels=['youth', 'young_adult', 'middle', 'senior'])

Categorical Features

Encoding Strategies

Method	When to Use	Cardinality
One-Hot	Tree models, low cardinality	< 20 categories
Label/Ordinal	Ordered categories (low/med/high)	Any
Target Encoding	High cardinality, regression	> 20 categories
Frequency Encoding	When frequency matters	Any
Binary Encoding	High cardinality, memory-constrained	> 50 categories

Target Encoding (with regularization):

# Replace category with mean of target variable
# Must use cross-validation to prevent data leakage
from sklearn.model_selection import KFold
from category_encoders import TargetEncoder

encoder = TargetEncoder(smoothing=10)  # Regularization
X_encoded = encoder.fit_transform(X['city'], y)

Handling Unknown Categories

Production models will encounter categories not seen during training:

Default category — Map unknown values to an “other” bucket
Frequency threshold — Categories below N occurrences become “rare”
Embedding lookup — Use nearest-neighbor in embedding space

Temporal Features

Time-based data requires extracting meaningful signals:

df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['is_business_hour'] = df['hour'].between(9, 17).astype(int)

Cyclical Encoding

Hour 23 and hour 0 are close in time but far apart numerically. Use sine/cosine encoding:

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

Lag Features

For time series forecasting:

df['sales_lag_1'] = df['sales'].shift(1)
df['sales_lag_7'] = df['sales'].shift(7)
df['sales_rolling_7'] = df['sales'].rolling(7).mean()
df['sales_rolling_30'] = df['sales'].rolling(30).mean()

Text Features

Basic Extraction

df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['capital_ratio'] = df['text'].str.count(r'[A-Z]') / df['text_length']
df['has_url'] = df['text'].str.contains(r'http').astype(int)

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
text_features = vectorizer.fit_transform(df['text'])

Embeddings

For modern NLP, use pre-trained embeddings:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['text'].tolist())

Feature Selection

Not all features help. Some add noise, cause overfitting, or slow down training.

Methods

Filter Methods — Statistical tests independent of the model:

Correlation analysis (drop features correlated > 0.95 with each other)
Mutual information
Chi-squared test (for categorical targets)

Wrapper Methods — Use model performance to select features:

Forward selection (add features one at a time)
Backward elimination (remove features one at a time)
Recursive Feature Elimination (RFE)

Embedded Methods — Feature selection built into the model:

L1 regularization (Lasso) — drives coefficients to zero
Tree-based importance (Random Forest, XGBoost)

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
X_selected = selector.fit_transform(X, y)
print(f"Selected {X_selected.shape[1]} of {X.shape[1]} features")

Anti-Patterns

Data Leakage

Using information from the future or the target variable to create features. This inflates metrics during training but fails completely in production.

Feature Explosion

Creating thousands of features through aggressive interaction terms or polynomial expansion. More features ≠ better models.

Ignoring Feature Distribution Shifts

Features that work today may drift tomorrow. Monitor feature distributions in production and retrain when drift exceeds thresholds.

Manual Feature Engineering Without Baselines

Always establish a baseline with raw features before investing in feature engineering. Sometimes the raw data is good enough.

Not Storing Feature Logic

Feature transformations must be reproducible. Use feature stores or pipeline frameworks (scikit-learn Pipelines, Feast, Tecton) to version and serve features consistently.

The best features encode domain knowledge in a form that algorithms can exploit. Spend 80% of your time on understanding the data and 20% on modeling — not the other way around.