Synthetic Data Generation for ML Training

Real data is expensive, sensitive, and never enough. Synthetic data generation solves the three biggest bottlenecks in ML development: insufficient training data, privacy constraints that block data sharing, and class imbalance that cripples model performance. Gartner estimates that by 2026, 75% of enterprises will use synthetic data in AI development, up from less than 10% in 2023.

But synthetic data is not free quality. Generate it poorly and you train models on noise that hallucinate patterns that don’t exist in the real world. This guide covers when to use synthetic data, how to generate it effectively, how to validate quality, and how to avoid the failure modes that make it worse than useless.

When to Use Synthetic Data

Scenario	Why Synthetic Data Helps	Risk If Done Badly
Insufficient training data	Bootstrap small datasets to trainable size	Models overfit to synthetic artifacts, not real patterns
Privacy compliance	Train on synthetic versions of PII/PHI data	Synthetic data can still leak private information if poorly generated
Class imbalance	Generate minority class examples for better recall	Synthetic minority samples may not represent real distribution
Edge case coverage	Create rare scenarios (fraud, failures, anomalies)	Unrealistic edge cases teach the model wrong signals
Data sharing	Share synthetic data across teams/partners without privacy risk	Loss of statistical fidelity makes shared data useless
Testing & QA	Generate test data for CI/CD pipelines	Test data that doesn’t match production distributions gives false confidence

Generation Methods

Statistical Methods

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

# Fit to real data distribution
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic samples
synthetic_data = synthesizer.sample(num_rows=10000)

# Validate statistical similarity
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(real_data, synthetic_data, metadata)
print(f"Overall quality score: {quality_report.get_score()}")

Best for: Tabular data with well-defined statistical distributions. Fast, interpretable, good privacy properties.

GAN-Based Generation

import torch
from torch import nn

class Generator(nn.Module):
    def __init__(self, noise_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(noise_dim, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, output_dim),
            nn.Tanh(),
        )
    
    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )
    
    def forward(self, x):
        return self.net(x)

# Training loop
for epoch in range(num_epochs):
    for real_batch in dataloader:
        # Train discriminator
        noise = torch.randn(batch_size, noise_dim)
        fake = generator(noise)
        
        d_loss = -torch.mean(torch.log(discriminator(real_batch)) + 
                             torch.log(1 - discriminator(fake.detach())))
        
        # Train generator
        g_loss = -torch.mean(torch.log(discriminator(generator(noise))))

Best for: Complex distributions, image data, time-series data. Higher quality but harder to train and validate.

LLM-Based Generation

def generate_synthetic_text_data(schema, n_samples, examples=None):
    """Use LLMs to generate structured synthetic data."""
    
    prompt = f"""Generate {n_samples} realistic synthetic records following this schema.
Each record should be unique, realistic, and diverse.
Do NOT copy the examples — use them only to understand the format and style.

Schema:
{json.dumps(schema, indent=2)}

{"Examples (for reference only):" + json.dumps(examples[:3], indent=2) if examples else ""}

Generate {n_samples} records as a JSON array:"""
    
    response = llm.generate(prompt, temperature=0.9, max_tokens=4000)
    records = json.loads(response)
    
    # Validate against schema
    validated = [r for r in records if validate_schema(r, schema)]
    
    return validated

# Example: Generate synthetic support tickets
schema = {
    "ticket_id": "string (format: TKT-NNNN)",
    "subject": "string (20-80 chars)",
    "body": "string (50-500 chars)",
    "category": "billing | technical | feature_request | account",
    "priority": "low | medium | high | critical",
    "customer_tier": "free | pro | enterprise",
    "sentiment": "positive | neutral | negative | frustrated",
}

synthetic_tickets = generate_synthetic_text_data(schema, n_samples=500)

Best for: Text data, structured records, augmenting labeled datasets. Fast iteration, highly controllable.

Differential Privacy + Synthetic Data

from diffprivlib.models import GaussianNB
from smartnoise.synthesizers import MWEMSynthesizer

# Generate synthetic data with formal privacy guarantees
synthesizer = MWEMSynthesizer(epsilon=1.0, split_factor=3)
synthesizer.fit(sensitive_data, preprocessor_eps=0.1)

private_synthetic = synthesizer.sample(n_samples=5000)

# Privacy guarantee: epsilon=1.0 means any single record's 
# presence or absence changes output probability by at most e^1 ≈ 2.7x

Best for: Healthcare, financial, and government data where formal privacy guarantees are legally required.

Quality Validation

Statistical Fidelity Tests

Test	What It Measures	Acceptable Threshold
Column distribution (KS test)	Per-column statistical similarity	p-value > 0.05
Correlation preservation	Pairwise column correlations	Mean absolute error < 0.1
Joint distribution	Multi-column joint probability	Likelihood ratio > 0.8
Cardinality match	Unique values per categorical column	Within 10% of real data
Boundary adherence	Min/max/range of numerical columns	No out-of-range values
Temporal patterns	Time-series autocorrelation	ACF difference < 0.15

def validate_synthetic_quality(real_data, synthetic_data):
    report = {"overall_score": 0, "column_scores": {}}
    
    scores = []
    for col in real_data.columns:
        if real_data[col].dtype in ['float64', 'int64']:
            # KS test for numerical columns
            stat, p_value = ks_2samp(real_data[col], synthetic_data[col])
            col_score = p_value  # Higher = more similar
        else:
            # Chi-squared test for categorical columns
            real_dist = real_data[col].value_counts(normalize=True)
            synth_dist = synthetic_data[col].value_counts(normalize=True)
            col_score = 1 - wasserstein_distance(
                real_dist.values, synth_dist.reindex(real_dist.index, fill_value=0).values
            )
        
        report["column_scores"][col] = round(col_score, 3)
        scores.append(col_score)
    
    report["overall_score"] = round(sum(scores) / len(scores), 3)
    return report

ML Utility Test

The ultimate test: does a model trained on synthetic data perform similarly to one trained on real data?

def ml_utility_test(real_data, synthetic_data, target_column):
    """Train on synthetic, test on real. Compare to train-on-real baseline."""
    
    X_real = real_data.drop(columns=[target_column])
    y_real = real_data[target_column]
    X_synth = synthetic_data.drop(columns=[target_column])
    y_synth = synthetic_data[target_column]
    
    X_train_real, X_test, y_train_real, y_test = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )
    
    # Baseline: train on real
    model_real = RandomForestClassifier(n_estimators=100)
    model_real.fit(X_train_real, y_train_real)
    score_real = model_real.score(X_test, y_test)
    
    # Test: train on synthetic
    model_synth = RandomForestClassifier(n_estimators=100)
    model_synth.fit(X_synth, y_synth)
    score_synth = model_synth.score(X_test, y_test)
    
    utility_ratio = score_synth / score_real
    
    return {
        "baseline_accuracy": score_real,
        "synthetic_accuracy": score_synth,
        "utility_ratio": utility_ratio,  # Target: > 0.9
        "verdict": "PASS" if utility_ratio > 0.9 else "FAIL",
    }

Privacy Leakage Test

def membership_inference_attack(real_data, synthetic_data, n_shadow=1000):
    """Test if synthetic data leaks membership information about real records."""
    
    # Can an attacker determine if a specific real record was used
    # to generate the synthetic data?
    
    # Train a classifier to distinguish real from synthetic
    combined = pd.concat([
        real_data.assign(label=1),
        synthetic_data.assign(label=0)
    ])
    
    X = combined.drop(columns=['label'])
    y = combined['label']
    
    from sklearn.model_selection import cross_val_score
    attack_model = RandomForestClassifier(n_estimators=50)
    scores = cross_val_score(attack_model, X, y, cv=5, scoring='roc_auc')
    
    # AUC close to 0.5 = attacker can't distinguish = good privacy
    # AUC close to 1.0 = attacker can distinguish = privacy leak
    mean_auc = scores.mean()
    
    return {
        "attack_auc": round(mean_auc, 3),
        "privacy_verdict": "SAFE" if mean_auc < 0.6 else "LEAKING",
        "recommendation": (
            "Synthetic data preserves privacy" if mean_auc < 0.6
            else "Increase noise or use differential privacy"
        ),
    }

Production Pipeline

Source Data → Privacy Review → Schema Extraction
                                     ↓
              Generation (Statistical / GAN / LLM)
                                     ↓
              Quality Validation (Stats + ML Utility + Privacy)
                                     ↓
                  Pass? → Register in Data Catalog
                  Fail? → Adjust parameters, regenerate

Anti-Patterns

Anti-Pattern	Problem	Fix
No quality validation	Synthetic data doesn’t match real distribution	Always run statistical fidelity + ML utility tests
Overfitting to real data	Synthetic data memorizes real records	Test with membership inference attacks, add noise
Using synthetic data without labels	Generated labels may be wrong	Validate label accuracy against business rules or expert review
One-time generation	Stale synthetic data as real-world distribution shifts	Regenerate periodically tied to data pipeline refreshes
Ignoring domain constraints	Generated data violates business rules (negative ages, future dates)	Add constraint validation layer post-generation
”Synthetic = safe” assumption	Synthetic data can still leak private info without differential privacy	Always test with membership inference before sharing

Synthetic Data Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For synthetic data engineering consulting, visit garnetgrid.com. :::