Verified by Garnet Grid

Synthetic Data Generation for ML Training

Generate high-quality synthetic data for machine learning. Covers statistical methods, GANs, LLM-based generation, privacy preservation, quality validation, and production pipelines.

Real data is expensive, sensitive, and never enough. Synthetic data generation solves the three biggest bottlenecks in ML development: insufficient training data, privacy constraints that block data sharing, and class imbalance that cripples model performance. Gartner estimates that by 2026, 75% of enterprises will use synthetic data in AI development, up from less than 10% in 2023.

But synthetic data is not free quality. Generate it poorly and you train models on noise that hallucinate patterns that don’t exist in the real world. This guide covers when to use synthetic data, how to generate it effectively, how to validate quality, and how to avoid the failure modes that make it worse than useless.


When to Use Synthetic Data

ScenarioWhy Synthetic Data HelpsRisk If Done Badly
Insufficient training dataBootstrap small datasets to trainable sizeModels overfit to synthetic artifacts, not real patterns
Privacy complianceTrain on synthetic versions of PII/PHI dataSynthetic data can still leak private information if poorly generated
Class imbalanceGenerate minority class examples for better recallSynthetic minority samples may not represent real distribution
Edge case coverageCreate rare scenarios (fraud, failures, anomalies)Unrealistic edge cases teach the model wrong signals
Data sharingShare synthetic data across teams/partners without privacy riskLoss of statistical fidelity makes shared data useless
Testing & QAGenerate test data for CI/CD pipelinesTest data that doesn’t match production distributions gives false confidence

Generation Methods

Statistical Methods

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

# Fit to real data distribution
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic samples
synthetic_data = synthesizer.sample(num_rows=10000)

# Validate statistical similarity
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(real_data, synthetic_data, metadata)
print(f"Overall quality score: {quality_report.get_score()}")

Best for: Tabular data with well-defined statistical distributions. Fast, interpretable, good privacy properties.

GAN-Based Generation

import torch
from torch import nn

class Generator(nn.Module):
    def __init__(self, noise_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(noise_dim, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, output_dim),
            nn.Tanh(),
        )
    
    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )
    
    def forward(self, x):
        return self.net(x)

# Training loop
for epoch in range(num_epochs):
    for real_batch in dataloader:
        # Train discriminator
        noise = torch.randn(batch_size, noise_dim)
        fake = generator(noise)
        
        d_loss = -torch.mean(torch.log(discriminator(real_batch)) + 
                             torch.log(1 - discriminator(fake.detach())))
        
        # Train generator
        g_loss = -torch.mean(torch.log(discriminator(generator(noise))))

Best for: Complex distributions, image data, time-series data. Higher quality but harder to train and validate.

LLM-Based Generation

def generate_synthetic_text_data(schema, n_samples, examples=None):
    """Use LLMs to generate structured synthetic data."""
    
    prompt = f"""Generate {n_samples} realistic synthetic records following this schema.
Each record should be unique, realistic, and diverse.
Do NOT copy the examples — use them only to understand the format and style.

Schema:
{json.dumps(schema, indent=2)}

{"Examples (for reference only):" + json.dumps(examples[:3], indent=2) if examples else ""}

Generate {n_samples} records as a JSON array:"""
    
    response = llm.generate(prompt, temperature=0.9, max_tokens=4000)
    records = json.loads(response)
    
    # Validate against schema
    validated = [r for r in records if validate_schema(r, schema)]
    
    return validated

# Example: Generate synthetic support tickets
schema = {
    "ticket_id": "string (format: TKT-NNNN)",
    "subject": "string (20-80 chars)",
    "body": "string (50-500 chars)",
    "category": "billing | technical | feature_request | account",
    "priority": "low | medium | high | critical",
    "customer_tier": "free | pro | enterprise",
    "sentiment": "positive | neutral | negative | frustrated",
}

synthetic_tickets = generate_synthetic_text_data(schema, n_samples=500)

Best for: Text data, structured records, augmenting labeled datasets. Fast iteration, highly controllable.

Differential Privacy + Synthetic Data

from diffprivlib.models import GaussianNB
from smartnoise.synthesizers import MWEMSynthesizer

# Generate synthetic data with formal privacy guarantees
synthesizer = MWEMSynthesizer(epsilon=1.0, split_factor=3)
synthesizer.fit(sensitive_data, preprocessor_eps=0.1)

private_synthetic = synthesizer.sample(n_samples=5000)

# Privacy guarantee: epsilon=1.0 means any single record's 
# presence or absence changes output probability by at most e^1 ≈ 2.7x

Best for: Healthcare, financial, and government data where formal privacy guarantees are legally required.


Quality Validation

Statistical Fidelity Tests

TestWhat It MeasuresAcceptable Threshold
Column distribution (KS test)Per-column statistical similarityp-value > 0.05
Correlation preservationPairwise column correlationsMean absolute error < 0.1
Joint distributionMulti-column joint probabilityLikelihood ratio > 0.8
Cardinality matchUnique values per categorical columnWithin 10% of real data
Boundary adherenceMin/max/range of numerical columnsNo out-of-range values
Temporal patternsTime-series autocorrelationACF difference < 0.15
def validate_synthetic_quality(real_data, synthetic_data):
    report = {"overall_score": 0, "column_scores": {}}
    
    scores = []
    for col in real_data.columns:
        if real_data[col].dtype in ['float64', 'int64']:
            # KS test for numerical columns
            stat, p_value = ks_2samp(real_data[col], synthetic_data[col])
            col_score = p_value  # Higher = more similar
        else:
            # Chi-squared test for categorical columns
            real_dist = real_data[col].value_counts(normalize=True)
            synth_dist = synthetic_data[col].value_counts(normalize=True)
            col_score = 1 - wasserstein_distance(
                real_dist.values, synth_dist.reindex(real_dist.index, fill_value=0).values
            )
        
        report["column_scores"][col] = round(col_score, 3)
        scores.append(col_score)
    
    report["overall_score"] = round(sum(scores) / len(scores), 3)
    return report

ML Utility Test

The ultimate test: does a model trained on synthetic data perform similarly to one trained on real data?

def ml_utility_test(real_data, synthetic_data, target_column):
    """Train on synthetic, test on real. Compare to train-on-real baseline."""
    
    X_real = real_data.drop(columns=[target_column])
    y_real = real_data[target_column]
    X_synth = synthetic_data.drop(columns=[target_column])
    y_synth = synthetic_data[target_column]
    
    X_train_real, X_test, y_train_real, y_test = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )
    
    # Baseline: train on real
    model_real = RandomForestClassifier(n_estimators=100)
    model_real.fit(X_train_real, y_train_real)
    score_real = model_real.score(X_test, y_test)
    
    # Test: train on synthetic
    model_synth = RandomForestClassifier(n_estimators=100)
    model_synth.fit(X_synth, y_synth)
    score_synth = model_synth.score(X_test, y_test)
    
    utility_ratio = score_synth / score_real
    
    return {
        "baseline_accuracy": score_real,
        "synthetic_accuracy": score_synth,
        "utility_ratio": utility_ratio,  # Target: > 0.9
        "verdict": "PASS" if utility_ratio > 0.9 else "FAIL",
    }

Privacy Leakage Test

def membership_inference_attack(real_data, synthetic_data, n_shadow=1000):
    """Test if synthetic data leaks membership information about real records."""
    
    # Can an attacker determine if a specific real record was used
    # to generate the synthetic data?
    
    # Train a classifier to distinguish real from synthetic
    combined = pd.concat([
        real_data.assign(label=1),
        synthetic_data.assign(label=0)
    ])
    
    X = combined.drop(columns=['label'])
    y = combined['label']
    
    from sklearn.model_selection import cross_val_score
    attack_model = RandomForestClassifier(n_estimators=50)
    scores = cross_val_score(attack_model, X, y, cv=5, scoring='roc_auc')
    
    # AUC close to 0.5 = attacker can't distinguish = good privacy
    # AUC close to 1.0 = attacker can distinguish = privacy leak
    mean_auc = scores.mean()
    
    return {
        "attack_auc": round(mean_auc, 3),
        "privacy_verdict": "SAFE" if mean_auc < 0.6 else "LEAKING",
        "recommendation": (
            "Synthetic data preserves privacy" if mean_auc < 0.6
            else "Increase noise or use differential privacy"
        ),
    }

Production Pipeline

Source Data → Privacy Review → Schema Extraction

              Generation (Statistical / GAN / LLM)

              Quality Validation (Stats + ML Utility + Privacy)

                  Pass? → Register in Data Catalog
                  Fail? → Adjust parameters, regenerate

Anti-Patterns

Anti-PatternProblemFix
No quality validationSynthetic data doesn’t match real distributionAlways run statistical fidelity + ML utility tests
Overfitting to real dataSynthetic data memorizes real recordsTest with membership inference attacks, add noise
Using synthetic data without labelsGenerated labels may be wrongValidate label accuracy against business rules or expert review
One-time generationStale synthetic data as real-world distribution shiftsRegenerate periodically tied to data pipeline refreshes
Ignoring domain constraintsGenerated data violates business rules (negative ages, future dates)Add constraint validation layer post-generation
”Synthetic = safe” assumptionSynthetic data can still leak private info without differential privacyAlways test with membership inference before sharing

Synthetic Data Checklist

  • Use case identified: augmentation, privacy, testing, or sharing
  • Generation method selected based on data type and requirements
  • Statistical fidelity validated (KS test, correlation, distribution)
  • ML utility tested: synthetic-trained model ≥ 90% of real-trained baseline
  • Privacy tested: membership inference AUC < 0.6
  • Domain constraints enforced (value ranges, business rules, referential integrity)
  • Labeled data validated against expert ground truth
  • Generation pipeline automated and versioned
  • Data catalog updated with synthetic dataset metadata
  • Periodic regeneration scheduled to match distribution drift
  • Legal review: synthetic data compliant with data sharing agreements
  • Documentation: generation parameters, quality scores, known limitations

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For synthetic data engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →