ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Feature Engineering for Machine Learning Pipelines

Production feature engineering patterns for ML pipelines. Covers feature stores, temporal features, automated feature selection, and data leakage prevention.

Feature engineering consumes 60-80% of a data scientist’s time — and it’s the single biggest lever for model performance. A well-engineered feature can improve model accuracy more than switching from logistic regression to a neural network. Yet most teams treat feature engineering as ad-hoc experimentation rather than systematic engineering.

The gap between notebook feature engineering and production feature engineering is enormous. In a notebook, you can compute features on the entire dataset. In production, you need to compute features in real-time, handle missing data gracefully, avoid data leakage, version your feature definitions, and serve features at sub-100ms latency.


The Feature Engineering Lifecycle

PhaseActivitiesCommon Pitfalls
DiscoveryDomain analysis, correlation studiesIgnoring domain knowledge
EngineeringTransform, combine, encode featuresData leakage via future data
SelectionRemove redundant/noisy featuresOver-reliance on automated selection
ValidationStatistical tests, distribution analysisTraining/serving skew
ServingFeature store, real-time computationLatency and freshness trade-offs

Temporal Feature Patterns

Most production ML involves time-series or event data. Temporal features are the highest-value feature family — and the easiest to leak.

Rolling Window Aggregations

def compute_rolling_features(df, entity_col, timestamp_col, value_col, windows):
    features = {}
    df = df.sort_values(timestamp_col)
    
    for window in windows:
        rolling = df.groupby(entity_col)[value_col].rolling(
            window=window, min_periods=1
        )
        features[f'{value_col}_mean_{window}d'] = rolling.mean().values
        features[f'{value_col}_std_{window}d'] = rolling.std().values
        features[f'{value_col}_max_{window}d'] = rolling.max().values
        features[f'{value_col}_trend_{window}d'] = (
            df[value_col] - rolling.mean().values
        ) / (rolling.std().values + 1e-8)
    
    return pd.DataFrame(features, index=df.index)

Critical rule: Rolling windows must use only data up to the prediction point. Including future data is the most common source of data leakage. In production, use closed='left' or equivalent to ensure strict temporal boundaries.


Feature Store Architecture

A feature store centralizes feature computation, storage, and serving. Without one, every model trains on slightly different feature versions, leading to irreproducible results and training-serving skew.

Key Components

  1. Feature Registry: Catalog of all feature definitions with metadata, lineage, and ownership
  2. Offline Store: Historical features for training (typically Parquet/Delta Lake)
  3. Online Store: Low-latency features for inference (Redis, DynamoDB, or specialized stores)
  4. Transformation Engine: Compute features from raw data using registered definitions
  5. Serving Layer: API that retrieves features for both training and inference

Implementation Decision

ApproachWhen to UseExamples
Managed< 1000 features, want simplicityFeast, Tecton, Databricks Feature Store
Custom> 1000 features, complex pipelinesRedis + Airflow + custom API
EmbeddedSingle model, lightweight needsFeature computation in model serving code

Data Leakage Prevention

Data leakage inflates performance metrics during development and causes catastrophic failures in production. The three most dangerous forms:

1. Target Leakage

Features that directly encode the target variable. Example: including account_status=churned as a feature when predicting churn.

2. Temporal Leakage

Using information from the future to predict the past. Example: using a 30-day rolling average that includes days after the prediction date.

3. Train-Test Contamination

Information from the test set influencing training decisions. Example: fitting a scaler on the entire dataset before splitting.

Prevention Protocol:

  • Split data temporally, never randomly, for time-series problems
  • Compute all features using only data available at prediction time
  • Fit all transformations (scaling, encoding) on training data only
  • Validate feature distributions between train and test sets

Automated Feature Selection

Manual feature selection doesn’t scale beyond 50 features. Automated approaches:

Filter Methods (fast, independent of model):

  • Mutual information scoring for non-linear relationships
  • Chi-squared test for categorical features
  • Correlation matrix analysis for redundancy removal

Wrapper Methods (accurate, expensive):

  • Recursive Feature Elimination (RFE) with cross-validation
  • Sequential Feature Selection (forward/backward)

Embedded Methods (balanced):

  • L1 regularization (Lasso) for automatic feature zeroing
  • Tree-based feature importance (Random Forest, XGBoost)
  • Permutation importance for model-agnostic ranking

The practical recipe: start with filter methods to remove obvious noise (> 500 features → 100), then use embedded methods to rank remaining features, then validate the top-k with wrapper methods.


Production Checklist

  1. Build a feature store before your second model goes to production
  2. Version all feature definitions alongside model code
  3. Validate temporal boundaries to prevent data leakage
  4. Monitor feature distributions in production for drift
  5. Document feature lineage — every feature should trace back to raw data sources
  6. Compute training and serving features from the same code path to prevent skew
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →