ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Feature Engineering at Scale

Transform raw data into predictive features for machine learning at production scale. Covers feature stores, feature pipelines, temporal features, encoding strategies, feature drift detection, and the patterns that make feature engineering systematic rather than ad-hoc.

Feature engineering is where domain knowledge meets data. A well-engineered feature can improve model accuracy more than any architecture change. But ad-hoc feature engineering — scattered SQL queries, notebook-only features, no version control — creates a mess that cannot be reproduced, monitored, or scaled.


Feature Engineering Pipeline

Raw Data → Cleaning → Transformation → Feature Store → Model Training

                                       Model Serving (online features)

Offline features (batch):
  Computed daily/hourly from data warehouse
  Examples: 30-day purchase count, lifetime value, avg session duration
  Latency: Minutes to hours
  
Online features (real-time):
  Computed at request time or from streaming data
  Examples: Last page viewed, cart contents, time since last login
  Latency: Milliseconds
  
Point-in-time features:
  Historical features computed AS OF a specific timestamp
  Critical for training (prevent data leakage)
  Example: "What was the user's purchase count at the time of this event?"

Common Feature Patterns

class FeatureTransformer:
    """Standard feature engineering patterns."""
    
    def temporal_features(self, df, timestamp_col):
        """Extract time-based features from timestamps."""
        df["hour_of_day"] = df[timestamp_col].dt.hour
        df["day_of_week"] = df[timestamp_col].dt.dayofweek
        df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
        df["month"] = df[timestamp_col].dt.month
        df["is_business_hours"] = df["hour_of_day"].between(9, 17).astype(int)
        return df
    
    def aggregation_features(self, df, entity_col, value_col, windows):
        """Rolling window aggregations per entity."""
        features = {}
        for window in windows:  # e.g., [7, 30, 90] days
            grouped = df.groupby(entity_col)[value_col]
            features[f"{value_col}_{window}d_count"] = grouped.rolling(f"{window}d").count()
            features[f"{value_col}_{window}d_sum"] = grouped.rolling(f"{window}d").sum()
            features[f"{value_col}_{window}d_avg"] = grouped.rolling(f"{window}d").mean()
            features[f"{value_col}_{window}d_std"] = grouped.rolling(f"{window}d").std()
        return pd.DataFrame(features)
    
    def ratio_features(self, df):
        """Compute ratios that capture relative behavior."""
        df["purchase_to_visit_ratio"] = df["purchases_30d"] / (df["visits_30d"] + 1)
        df["weekend_vs_weekday_ratio"] = df["weekend_sessions"] / (df["weekday_sessions"] + 1)
        df["recent_vs_historical"] = df["purchases_7d"] / (df["purchases_90d"] + 1)
        return df
    
    def target_encoding(self, df, category_col, target_col, smoothing=10):
        """Encode categories by their target mean (with regularization)."""
        global_mean = df[target_col].mean()
        
        stats = df.groupby(category_col)[target_col].agg(["mean", "count"])
        
        # Smoothing: blend category mean with global mean
        # More samples → trust category mean more
        stats["smoothed"] = (
            (stats["count"] * stats["mean"] + smoothing * global_mean) /
            (stats["count"] + smoothing)
        )
        
        return df[category_col].map(stats["smoothed"])

Feature Store

from feast import FeatureStore, Entity, Feature, ValueType

# Define entity (the "key" for feature lookup)
customer = Entity(
    name="customer_id",
    value_type=ValueType.STRING,
    description="Unique customer identifier",
)

# Define feature view (a table of features)
customer_features = FeatureView(
    name="customer_features",
    entities=["customer_id"],
    features=[
        Feature("purchases_30d", ValueType.INT64),
        Feature("lifetime_value", ValueType.DOUBLE),
        Feature("days_since_last_order", ValueType.INT64),
        Feature("avg_order_value", ValueType.DOUBLE),
        Feature("preferred_category", ValueType.STRING),
    ],
    ttl=timedelta(days=1),  # Refresh daily
    source=bigquery_source,
)

# Online serving (real-time inference)
store = FeatureStore(repo_path="feature_repo/")

features = store.get_online_features(
    features=["customer_features:purchases_30d", "customer_features:lifetime_value"],
    entity_rows=[{"customer_id": "cust-12345"}],
)
# Returns features in < 10ms for real-time scoring

Anti-Patterns

Anti-PatternConsequenceFix
Features in notebooks onlyCannot reproduce or deploy to productionFeature store with version control
Data leakage in trainingModel uses future information → inflated metricsPoint-in-time correct feature computation
One-hot encoding high cardinality10,000+ columns for categoriesTarget encoding or embedding
No feature monitoringFeature drift goes undetectedFeature distribution monitoring
Compute features at serving timeHigh latency, duplicated logicPre-compute and store in feature store

Feature engineering is the highest-leverage activity in machine learning. A systematic approach — feature stores, standardized transformations, drift monitoring — turns ad-hoc experimentation into a reliable, scalable engineering practice.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →