Time Series Forecasting in Production

Time series forecasting powers demand planning, capacity management, financial projections, and anomaly detection. Yet most forecasting projects fail in production — not because the model is wrong, but because the pipeline around it is fragile. A production forecasting system needs automated retraining, forecast evaluation, uncertainty quantification, and graceful handling of missing data and regime changes.

Method Selection Guide

Data Characteristics	Recommended Method	When It Fails
Strong seasonality, stable trend	SARIMA / ETS	Regime changes, multiple seasonalities
Multiple seasonalities (daily+weekly)	Prophet / MSTL	Complex non-linear patterns
Many related series	LightGBM / XGBoost	Short series (< 100 points)
Long history, complex patterns	N-BEATS / TFT	Limited data, need interpretability
Hierarchical data	Reconciliation + any base model	Incoherent base forecasts

The practical rule: Start with a seasonal naive baseline (repeat last year’s pattern). If you can’t beat seasonal naive, your model isn’t learning anything useful. Then try ETS and Prophet. Only reach for deep learning if simpler methods plateau and you have > 1,000 data points per series.

Production Pipeline Architecture

Data Ingestion → Preprocessing → Feature Engineering → 
Model Training → Forecast Generation → Evaluation → 
Reconciliation → Storage → API / Dashboard

Preprocessing for Real-World Data

Real time series data is messy: missing values, outliers, calendar effects, and structural breaks.

class TimeSeriesPreprocessor:
    def __init__(self, freq='D'):
        self.freq = freq
    
    def process(self, series: pd.Series) -> pd.Series:
        # 1. Ensure regular frequency
        series = series.asfreq(self.freq)
        
        # 2. Handle missing values (max 3 consecutive)
        series = series.interpolate(method='time', limit=3)
        
        # 3. Detect and replace outliers (IQR method)
        q1, q3 = series.quantile([0.25, 0.75])
        iqr = q3 - q1
        lower, upper = q1 - 3 * iqr, q3 + 3 * iqr
        series = series.clip(lower, upper)
        
        # 4. Flag remaining gaps (fill with seasonal average)
        if series.isna().any():
            seasonal_avg = series.groupby(series.index.dayofweek).transform('mean')
            series = series.fillna(seasonal_avg)
        
        return series

Forecast Evaluation

Metrics That Matter

Metric	Formula	Use When
MAPE	Mean Absolute Percentage Error	Comparing across different scales
RMSSE	Root Mean Squared Scaled Error	M5/M6 competition standard
WAPE	Weighted Absolute Percentage Error	Aggregated business reporting
Coverage	% of actuals within prediction interval	Evaluating uncertainty

Never use MAPE alone. It’s undefined when actuals are zero and biased toward under-forecasting. Use WAPE for business reporting and RMSSE for model comparison.

Backtesting Protocol

def time_series_cv(series, model_fn, n_splits=5, horizon=30, gap=7):
    """Walk-forward cross-validation with gap."""
    results = []
    
    for i in range(n_splits):
        # Training: all data up to cutoff
        cutoff = len(series) - (n_splits - i) * horizon - gap
        train = series[:cutoff]
        
        # Gap: skip `gap` days to simulate production delay
        test_start = cutoff + gap
        test = series[test_start:test_start + horizon]
        
        # Forecast
        model = model_fn(train)
        forecast = model.predict(horizon)
        
        results.append({
            'fold': i,
            'mape': mean_absolute_percentage_error(test, forecast),
            'coverage': prediction_interval_coverage(test, forecast)
        })
    
    return pd.DataFrame(results)

The gap parameter simulates the real-world delay between data availability and forecast generation. If your data pipeline has a 2-day lag, set gap=2. Omitting the gap overestimates accuracy.

Uncertainty Quantification

Point forecasts are almost always wrong. Prediction intervals communicate the range of likely outcomes, which is far more useful for decision-making.

Methods:

Parametric: Assume residuals follow a distribution (normal, Student-t)
Bootstrap: Resample residuals and regenerate forecasts
Quantile regression: Directly model quantiles (10th, 50th, 90th)
Conformal prediction: Distribution-free, guaranteed coverage

For business use: Report the 10th, 50th, and 90th percentile forecasts. The 10th percentile is your conservative plan, the 50th is the expected outcome, and the 90th is the optimistic scenario. This maps cleanly to “worst case / base case / best case” planning.

Monitoring and Retraining

Forecast Monitoring

Track these metrics in production:

Forecast accuracy decay: Does accuracy degrade over the forecast horizon?
Bias detection: Are forecasts systematically high or low?
Coverage calibration: Do 90% prediction intervals actually contain 90% of actuals?
Concept drift: Has the underlying data distribution changed?

Retraining Triggers

Scheduled: Weekly or monthly, depending on data velocity
Performance-based: Retrain when rolling MAPE exceeds threshold
Event-driven: After known structural changes (new product launch, policy change)
Drift-detected: When statistical drift tests fail (PSI > 0.2)

The best forecasting systems aren’t the ones with the most sophisticated models — they’re the ones that detect when their models are wrong and adapt automatically.