Feature Engineering at Scale

Feature engineering is where domain knowledge meets data. A well-engineered feature can improve model accuracy more than any architecture change. But ad-hoc feature engineering — scattered SQL queries, notebook-only features, no version control — creates a mess that cannot be reproduced, monitored, or scaled.

Feature Engineering Pipeline

Raw Data → Cleaning → Transformation → Feature Store → Model Training
                                              ↓
                                       Model Serving (online features)

Offline features (batch):
  Computed daily/hourly from data warehouse
  Examples: 30-day purchase count, lifetime value, avg session duration
  Latency: Minutes to hours
  
Online features (real-time):
  Computed at request time or from streaming data
  Examples: Last page viewed, cart contents, time since last login
  Latency: Milliseconds
  
Point-in-time features:
  Historical features computed AS OF a specific timestamp
  Critical for training (prevent data leakage)
  Example: "What was the user's purchase count at the time of this event?"

Common Feature Patterns

class FeatureTransformer:
    """Standard feature engineering patterns."""
    
    def temporal_features(self, df, timestamp_col):
        """Extract time-based features from timestamps."""
        df["hour_of_day"] = df[timestamp_col].dt.hour
        df["day_of_week"] = df[timestamp_col].dt.dayofweek
        df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
        df["month"] = df[timestamp_col].dt.month
        df["is_business_hours"] = df["hour_of_day"].between(9, 17).astype(int)
        return df
    
    def aggregation_features(self, df, entity_col, value_col, windows):
        """Rolling window aggregations per entity."""
        features = {}
        for window in windows:  # e.g., [7, 30, 90] days
            grouped = df.groupby(entity_col)[value_col]
            features[f"{value_col}_{window}d_count"] = grouped.rolling(f"{window}d").count()
            features[f"{value_col}_{window}d_sum"] = grouped.rolling(f"{window}d").sum()
            features[f"{value_col}_{window}d_avg"] = grouped.rolling(f"{window}d").mean()
            features[f"{value_col}_{window}d_std"] = grouped.rolling(f"{window}d").std()
        return pd.DataFrame(features)
    
    def ratio_features(self, df):
        """Compute ratios that capture relative behavior."""
        df["purchase_to_visit_ratio"] = df["purchases_30d"] / (df["visits_30d"] + 1)
        df["weekend_vs_weekday_ratio"] = df["weekend_sessions"] / (df["weekday_sessions"] + 1)
        df["recent_vs_historical"] = df["purchases_7d"] / (df["purchases_90d"] + 1)
        return df
    
    def target_encoding(self, df, category_col, target_col, smoothing=10):
        """Encode categories by their target mean (with regularization)."""
        global_mean = df[target_col].mean()
        
        stats = df.groupby(category_col)[target_col].agg(["mean", "count"])
        
        # Smoothing: blend category mean with global mean
        # More samples → trust category mean more
        stats["smoothed"] = (
            (stats["count"] * stats["mean"] + smoothing * global_mean) /
            (stats["count"] + smoothing)
        )
        
        return df[category_col].map(stats["smoothed"])

Feature Store

from feast import FeatureStore, Entity, Feature, ValueType

# Define entity (the "key" for feature lookup)
customer = Entity(
    name="customer_id",
    value_type=ValueType.STRING,
    description="Unique customer identifier",
)

# Define feature view (a table of features)
customer_features = FeatureView(
    name="customer_features",
    entities=["customer_id"],
    features=[
        Feature("purchases_30d", ValueType.INT64),
        Feature("lifetime_value", ValueType.DOUBLE),
        Feature("days_since_last_order", ValueType.INT64),
        Feature("avg_order_value", ValueType.DOUBLE),
        Feature("preferred_category", ValueType.STRING),
    ],
    ttl=timedelta(days=1),  # Refresh daily
    source=bigquery_source,
)

# Online serving (real-time inference)
store = FeatureStore(repo_path="feature_repo/")

features = store.get_online_features(
    features=["customer_features:purchases_30d", "customer_features:lifetime_value"],
    entity_rows=[{"customer_id": "cust-12345"}],
)
# Returns features in < 10ms for real-time scoring

Anti-Patterns

Anti-Pattern	Consequence	Fix
Features in notebooks only	Cannot reproduce or deploy to production	Feature store with version control
Data leakage in training	Model uses future information → inflated metrics	Point-in-time correct feature computation
One-hot encoding high cardinality	10,000+ columns for categories	Target encoding or embedding
No feature monitoring	Feature drift goes undetected	Feature distribution monitoring
Compute features at serving time	High latency, duplicated logic	Pre-compute and store in feature store

Feature engineering is the highest-leverage activity in machine learning. A systematic approach — feature stores, standardized transformations, drift monitoring — turns ad-hoc experimentation into a reliable, scalable engineering practice.

Feature Engineering Pipeline

Common Feature Patterns

Feature Store

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production