Feature Engineering for Machine Learning Pipelines
Production feature engineering patterns for ML pipelines. Covers feature stores, temporal features, automated feature selection, and data leakage prevention.
Feature engineering consumes 60-80% of a data scientist’s time — and it’s the single biggest lever for model performance. A well-engineered feature can improve model accuracy more than switching from logistic regression to a neural network. Yet most teams treat feature engineering as ad-hoc experimentation rather than systematic engineering.
The gap between notebook feature engineering and production feature engineering is enormous. In a notebook, you can compute features on the entire dataset. In production, you need to compute features in real-time, handle missing data gracefully, avoid data leakage, version your feature definitions, and serve features at sub-100ms latency.
The Feature Engineering Lifecycle
| Phase | Activities | Common Pitfalls |
|---|---|---|
| Discovery | Domain analysis, correlation studies | Ignoring domain knowledge |
| Engineering | Transform, combine, encode features | Data leakage via future data |
| Selection | Remove redundant/noisy features | Over-reliance on automated selection |
| Validation | Statistical tests, distribution analysis | Training/serving skew |
| Serving | Feature store, real-time computation | Latency and freshness trade-offs |
Temporal Feature Patterns
Most production ML involves time-series or event data. Temporal features are the highest-value feature family — and the easiest to leak.
Rolling Window Aggregations
def compute_rolling_features(df, entity_col, timestamp_col, value_col, windows):
features = {}
df = df.sort_values(timestamp_col)
for window in windows:
rolling = df.groupby(entity_col)[value_col].rolling(
window=window, min_periods=1
)
features[f'{value_col}_mean_{window}d'] = rolling.mean().values
features[f'{value_col}_std_{window}d'] = rolling.std().values
features[f'{value_col}_max_{window}d'] = rolling.max().values
features[f'{value_col}_trend_{window}d'] = (
df[value_col] - rolling.mean().values
) / (rolling.std().values + 1e-8)
return pd.DataFrame(features, index=df.index)
Critical rule: Rolling windows must use only data up to the prediction point. Including future data is the most common source of data leakage. In production, use closed='left' or equivalent to ensure strict temporal boundaries.
Feature Store Architecture
A feature store centralizes feature computation, storage, and serving. Without one, every model trains on slightly different feature versions, leading to irreproducible results and training-serving skew.
Key Components
- Feature Registry: Catalog of all feature definitions with metadata, lineage, and ownership
- Offline Store: Historical features for training (typically Parquet/Delta Lake)
- Online Store: Low-latency features for inference (Redis, DynamoDB, or specialized stores)
- Transformation Engine: Compute features from raw data using registered definitions
- Serving Layer: API that retrieves features for both training and inference
Implementation Decision
| Approach | When to Use | Examples |
|---|---|---|
| Managed | < 1000 features, want simplicity | Feast, Tecton, Databricks Feature Store |
| Custom | > 1000 features, complex pipelines | Redis + Airflow + custom API |
| Embedded | Single model, lightweight needs | Feature computation in model serving code |
Data Leakage Prevention
Data leakage inflates performance metrics during development and causes catastrophic failures in production. The three most dangerous forms:
1. Target Leakage
Features that directly encode the target variable. Example: including account_status=churned as a feature when predicting churn.
2. Temporal Leakage
Using information from the future to predict the past. Example: using a 30-day rolling average that includes days after the prediction date.
3. Train-Test Contamination
Information from the test set influencing training decisions. Example: fitting a scaler on the entire dataset before splitting.
Prevention Protocol:
- Split data temporally, never randomly, for time-series problems
- Compute all features using only data available at prediction time
- Fit all transformations (scaling, encoding) on training data only
- Validate feature distributions between train and test sets
Automated Feature Selection
Manual feature selection doesn’t scale beyond 50 features. Automated approaches:
Filter Methods (fast, independent of model):
- Mutual information scoring for non-linear relationships
- Chi-squared test for categorical features
- Correlation matrix analysis for redundancy removal
Wrapper Methods (accurate, expensive):
- Recursive Feature Elimination (RFE) with cross-validation
- Sequential Feature Selection (forward/backward)
Embedded Methods (balanced):
- L1 regularization (Lasso) for automatic feature zeroing
- Tree-based feature importance (Random Forest, XGBoost)
- Permutation importance for model-agnostic ranking
The practical recipe: start with filter methods to remove obvious noise (> 500 features → 100), then use embedded methods to rank remaining features, then validate the top-k with wrapper methods.
Production Checklist
- Build a feature store before your second model goes to production
- Version all feature definitions alongside model code
- Validate temporal boundaries to prevent data leakage
- Monitor feature distributions in production for drift
- Document feature lineage — every feature should trace back to raw data sources
- Compute training and serving features from the same code path to prevent skew