Feature Engineering at Scale
Transform raw data into predictive features for machine learning at production scale. Covers feature stores, feature pipelines, temporal features, encoding strategies, feature drift detection, and the patterns that make feature engineering systematic rather than ad-hoc.
Feature engineering is where domain knowledge meets data. A well-engineered feature can improve model accuracy more than any architecture change. But ad-hoc feature engineering — scattered SQL queries, notebook-only features, no version control — creates a mess that cannot be reproduced, monitored, or scaled.
Feature Engineering Pipeline
Raw Data → Cleaning → Transformation → Feature Store → Model Training
↓
Model Serving (online features)
Offline features (batch):
Computed daily/hourly from data warehouse
Examples: 30-day purchase count, lifetime value, avg session duration
Latency: Minutes to hours
Online features (real-time):
Computed at request time or from streaming data
Examples: Last page viewed, cart contents, time since last login
Latency: Milliseconds
Point-in-time features:
Historical features computed AS OF a specific timestamp
Critical for training (prevent data leakage)
Example: "What was the user's purchase count at the time of this event?"
Common Feature Patterns
class FeatureTransformer:
"""Standard feature engineering patterns."""
def temporal_features(self, df, timestamp_col):
"""Extract time-based features from timestamps."""
df["hour_of_day"] = df[timestamp_col].dt.hour
df["day_of_week"] = df[timestamp_col].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df[timestamp_col].dt.month
df["is_business_hours"] = df["hour_of_day"].between(9, 17).astype(int)
return df
def aggregation_features(self, df, entity_col, value_col, windows):
"""Rolling window aggregations per entity."""
features = {}
for window in windows: # e.g., [7, 30, 90] days
grouped = df.groupby(entity_col)[value_col]
features[f"{value_col}_{window}d_count"] = grouped.rolling(f"{window}d").count()
features[f"{value_col}_{window}d_sum"] = grouped.rolling(f"{window}d").sum()
features[f"{value_col}_{window}d_avg"] = grouped.rolling(f"{window}d").mean()
features[f"{value_col}_{window}d_std"] = grouped.rolling(f"{window}d").std()
return pd.DataFrame(features)
def ratio_features(self, df):
"""Compute ratios that capture relative behavior."""
df["purchase_to_visit_ratio"] = df["purchases_30d"] / (df["visits_30d"] + 1)
df["weekend_vs_weekday_ratio"] = df["weekend_sessions"] / (df["weekday_sessions"] + 1)
df["recent_vs_historical"] = df["purchases_7d"] / (df["purchases_90d"] + 1)
return df
def target_encoding(self, df, category_col, target_col, smoothing=10):
"""Encode categories by their target mean (with regularization)."""
global_mean = df[target_col].mean()
stats = df.groupby(category_col)[target_col].agg(["mean", "count"])
# Smoothing: blend category mean with global mean
# More samples → trust category mean more
stats["smoothed"] = (
(stats["count"] * stats["mean"] + smoothing * global_mean) /
(stats["count"] + smoothing)
)
return df[category_col].map(stats["smoothed"])
Feature Store
from feast import FeatureStore, Entity, Feature, ValueType
# Define entity (the "key" for feature lookup)
customer = Entity(
name="customer_id",
value_type=ValueType.STRING,
description="Unique customer identifier",
)
# Define feature view (a table of features)
customer_features = FeatureView(
name="customer_features",
entities=["customer_id"],
features=[
Feature("purchases_30d", ValueType.INT64),
Feature("lifetime_value", ValueType.DOUBLE),
Feature("days_since_last_order", ValueType.INT64),
Feature("avg_order_value", ValueType.DOUBLE),
Feature("preferred_category", ValueType.STRING),
],
ttl=timedelta(days=1), # Refresh daily
source=bigquery_source,
)
# Online serving (real-time inference)
store = FeatureStore(repo_path="feature_repo/")
features = store.get_online_features(
features=["customer_features:purchases_30d", "customer_features:lifetime_value"],
entity_rows=[{"customer_id": "cust-12345"}],
)
# Returns features in < 10ms for real-time scoring
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Features in notebooks only | Cannot reproduce or deploy to production | Feature store with version control |
| Data leakage in training | Model uses future information → inflated metrics | Point-in-time correct feature computation |
| One-hot encoding high cardinality | 10,000+ columns for categories | Target encoding or embedding |
| No feature monitoring | Feature drift goes undetected | Feature distribution monitoring |
| Compute features at serving time | High latency, duplicated logic | Pre-compute and store in feature store |
Feature engineering is the highest-leverage activity in machine learning. A systematic approach — feature stores, standardized transformations, drift monitoring — turns ad-hoc experimentation into a reliable, scalable engineering practice.