Feature Engineering for Machine Learning: From Raw Data to Predictive Power
A practitioner's guide to feature engineering — transforming raw data into features that improve model performance through encoding, scaling, creation, and selection techniques.
Feature engineering is the process of transforming raw data into features that better represent the underlying patterns for machine learning models. It’s often the difference between a mediocre model and a production-quality one. Raw data rarely arrives in a form that models can directly exploit.
Why Feature Engineering Matters
Models learn from features, not raw data. The quality of features determines the ceiling of model performance:
- Better features → simpler models → faster training → easier debugging
- Poor features → complex models → overfitting → mysterious failures
A well-engineered feature can be worth more than a fancier algorithm.
Numerical Features
Scaling
Most ML algorithms perform better when numerical features are on similar scales:
Standard Scaling (Z-score):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Results: mean=0, std=1
Use when: Features are approximately normally distributed.
Min-Max Scaling:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Results: values in [0, 1]
Use when: You need bounded values (e.g., neural networks).
Robust Scaling:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
# Uses median and IQR, robust to outliers
Use when: Data contains significant outliers.
Transformations
Log Transform: Compresses right-skewed distributions (income, prices, counts).
import numpy as np
df['log_income'] = np.log1p(df['income']) # log1p handles zeros
Power Transform (Box-Cox/Yeo-Johnson): Automatically finds the best transformation.
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X)
Binning: Converts continuous variables into categories.
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
labels=['youth', 'young_adult', 'middle', 'senior'])
Categorical Features
Encoding Strategies
| Method | When to Use | Cardinality |
|---|---|---|
| One-Hot | Tree models, low cardinality | < 20 categories |
| Label/Ordinal | Ordered categories (low/med/high) | Any |
| Target Encoding | High cardinality, regression | > 20 categories |
| Frequency Encoding | When frequency matters | Any |
| Binary Encoding | High cardinality, memory-constrained | > 50 categories |
Target Encoding (with regularization):
# Replace category with mean of target variable
# Must use cross-validation to prevent data leakage
from sklearn.model_selection import KFold
from category_encoders import TargetEncoder
encoder = TargetEncoder(smoothing=10) # Regularization
X_encoded = encoder.fit_transform(X['city'], y)
Handling Unknown Categories
Production models will encounter categories not seen during training:
- Default category — Map unknown values to an “other” bucket
- Frequency threshold — Categories below N occurrences become “rare”
- Embedding lookup — Use nearest-neighbor in embedding space
Temporal Features
Time-based data requires extracting meaningful signals:
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['is_business_hour'] = df['hour'].between(9, 17).astype(int)
Cyclical Encoding
Hour 23 and hour 0 are close in time but far apart numerically. Use sine/cosine encoding:
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
Lag Features
For time series forecasting:
df['sales_lag_1'] = df['sales'].shift(1)
df['sales_lag_7'] = df['sales'].shift(7)
df['sales_rolling_7'] = df['sales'].rolling(7).mean()
df['sales_rolling_30'] = df['sales'].rolling(30).mean()
Text Features
Basic Extraction
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['capital_ratio'] = df['text'].str.count(r'[A-Z]') / df['text_length']
df['has_url'] = df['text'].str.contains(r'http').astype(int)
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
text_features = vectorizer.fit_transform(df['text'])
Embeddings
For modern NLP, use pre-trained embeddings:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['text'].tolist())
Feature Selection
Not all features help. Some add noise, cause overfitting, or slow down training.
Methods
Filter Methods — Statistical tests independent of the model:
- Correlation analysis (drop features correlated > 0.95 with each other)
- Mutual information
- Chi-squared test (for categorical targets)
Wrapper Methods — Use model performance to select features:
- Forward selection (add features one at a time)
- Backward elimination (remove features one at a time)
- Recursive Feature Elimination (RFE)
Embedded Methods — Feature selection built into the model:
- L1 regularization (Lasso) — drives coefficients to zero
- Tree-based importance (Random Forest, XGBoost)
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
X_selected = selector.fit_transform(X, y)
print(f"Selected {X_selected.shape[1]} of {X.shape[1]} features")
Anti-Patterns
Data Leakage
Using information from the future or the target variable to create features. This inflates metrics during training but fails completely in production.
Feature Explosion
Creating thousands of features through aggressive interaction terms or polynomial expansion. More features ≠ better models.
Ignoring Feature Distribution Shifts
Features that work today may drift tomorrow. Monitor feature distributions in production and retrain when drift exceeds thresholds.
Manual Feature Engineering Without Baselines
Always establish a baseline with raw features before investing in feature engineering. Sometimes the raw data is good enough.
Not Storing Feature Logic
Feature transformations must be reproducible. Use feature stores or pipeline frameworks (scikit-learn Pipelines, Feast, Tecton) to version and serve features consistently.
The best features encode domain knowledge in a form that algorithms can exploit. Spend 80% of your time on understanding the data and 20% on modeling — not the other way around.