Synthetic Data Generation for ML Training
Generate high-quality synthetic data for machine learning. Covers statistical methods, GANs, LLM-based generation, privacy preservation, quality validation, and production pipelines.
Real data is expensive, sensitive, and never enough. Synthetic data generation solves the three biggest bottlenecks in ML development: insufficient training data, privacy constraints that block data sharing, and class imbalance that cripples model performance. Gartner estimates that by 2026, 75% of enterprises will use synthetic data in AI development, up from less than 10% in 2023.
But synthetic data is not free quality. Generate it poorly and you train models on noise that hallucinate patterns that don’t exist in the real world. This guide covers when to use synthetic data, how to generate it effectively, how to validate quality, and how to avoid the failure modes that make it worse than useless.
When to Use Synthetic Data
| Scenario | Why Synthetic Data Helps | Risk If Done Badly |
|---|---|---|
| Insufficient training data | Bootstrap small datasets to trainable size | Models overfit to synthetic artifacts, not real patterns |
| Privacy compliance | Train on synthetic versions of PII/PHI data | Synthetic data can still leak private information if poorly generated |
| Class imbalance | Generate minority class examples for better recall | Synthetic minority samples may not represent real distribution |
| Edge case coverage | Create rare scenarios (fraud, failures, anomalies) | Unrealistic edge cases teach the model wrong signals |
| Data sharing | Share synthetic data across teams/partners without privacy risk | Loss of statistical fidelity makes shared data useless |
| Testing & QA | Generate test data for CI/CD pipelines | Test data that doesn’t match production distributions gives false confidence |
Generation Methods
Statistical Methods
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
# Fit to real data distribution
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
# Generate synthetic samples
synthetic_data = synthesizer.sample(num_rows=10000)
# Validate statistical similarity
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(real_data, synthetic_data, metadata)
print(f"Overall quality score: {quality_report.get_score()}")
Best for: Tabular data with well-defined statistical distributions. Fast, interpretable, good privacy properties.
GAN-Based Generation
import torch
from torch import nn
class Generator(nn.Module):
def __init__(self, noise_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(noise_dim, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Linear(256, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Linear(512, output_dim),
nn.Tanh(),
)
def forward(self, z):
return self.net(z)
class Discriminator(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 512),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid(),
)
def forward(self, x):
return self.net(x)
# Training loop
for epoch in range(num_epochs):
for real_batch in dataloader:
# Train discriminator
noise = torch.randn(batch_size, noise_dim)
fake = generator(noise)
d_loss = -torch.mean(torch.log(discriminator(real_batch)) +
torch.log(1 - discriminator(fake.detach())))
# Train generator
g_loss = -torch.mean(torch.log(discriminator(generator(noise))))
Best for: Complex distributions, image data, time-series data. Higher quality but harder to train and validate.
LLM-Based Generation
def generate_synthetic_text_data(schema, n_samples, examples=None):
"""Use LLMs to generate structured synthetic data."""
prompt = f"""Generate {n_samples} realistic synthetic records following this schema.
Each record should be unique, realistic, and diverse.
Do NOT copy the examples — use them only to understand the format and style.
Schema:
{json.dumps(schema, indent=2)}
{"Examples (for reference only):" + json.dumps(examples[:3], indent=2) if examples else ""}
Generate {n_samples} records as a JSON array:"""
response = llm.generate(prompt, temperature=0.9, max_tokens=4000)
records = json.loads(response)
# Validate against schema
validated = [r for r in records if validate_schema(r, schema)]
return validated
# Example: Generate synthetic support tickets
schema = {
"ticket_id": "string (format: TKT-NNNN)",
"subject": "string (20-80 chars)",
"body": "string (50-500 chars)",
"category": "billing | technical | feature_request | account",
"priority": "low | medium | high | critical",
"customer_tier": "free | pro | enterprise",
"sentiment": "positive | neutral | negative | frustrated",
}
synthetic_tickets = generate_synthetic_text_data(schema, n_samples=500)
Best for: Text data, structured records, augmenting labeled datasets. Fast iteration, highly controllable.
Differential Privacy + Synthetic Data
from diffprivlib.models import GaussianNB
from smartnoise.synthesizers import MWEMSynthesizer
# Generate synthetic data with formal privacy guarantees
synthesizer = MWEMSynthesizer(epsilon=1.0, split_factor=3)
synthesizer.fit(sensitive_data, preprocessor_eps=0.1)
private_synthetic = synthesizer.sample(n_samples=5000)
# Privacy guarantee: epsilon=1.0 means any single record's
# presence or absence changes output probability by at most e^1 ≈ 2.7x
Best for: Healthcare, financial, and government data where formal privacy guarantees are legally required.
Quality Validation
Statistical Fidelity Tests
| Test | What It Measures | Acceptable Threshold |
|---|---|---|
| Column distribution (KS test) | Per-column statistical similarity | p-value > 0.05 |
| Correlation preservation | Pairwise column correlations | Mean absolute error < 0.1 |
| Joint distribution | Multi-column joint probability | Likelihood ratio > 0.8 |
| Cardinality match | Unique values per categorical column | Within 10% of real data |
| Boundary adherence | Min/max/range of numerical columns | No out-of-range values |
| Temporal patterns | Time-series autocorrelation | ACF difference < 0.15 |
def validate_synthetic_quality(real_data, synthetic_data):
report = {"overall_score": 0, "column_scores": {}}
scores = []
for col in real_data.columns:
if real_data[col].dtype in ['float64', 'int64']:
# KS test for numerical columns
stat, p_value = ks_2samp(real_data[col], synthetic_data[col])
col_score = p_value # Higher = more similar
else:
# Chi-squared test for categorical columns
real_dist = real_data[col].value_counts(normalize=True)
synth_dist = synthetic_data[col].value_counts(normalize=True)
col_score = 1 - wasserstein_distance(
real_dist.values, synth_dist.reindex(real_dist.index, fill_value=0).values
)
report["column_scores"][col] = round(col_score, 3)
scores.append(col_score)
report["overall_score"] = round(sum(scores) / len(scores), 3)
return report
ML Utility Test
The ultimate test: does a model trained on synthetic data perform similarly to one trained on real data?
def ml_utility_test(real_data, synthetic_data, target_column):
"""Train on synthetic, test on real. Compare to train-on-real baseline."""
X_real = real_data.drop(columns=[target_column])
y_real = real_data[target_column]
X_synth = synthetic_data.drop(columns=[target_column])
y_synth = synthetic_data[target_column]
X_train_real, X_test, y_train_real, y_test = train_test_split(
X_real, y_real, test_size=0.2, random_state=42
)
# Baseline: train on real
model_real = RandomForestClassifier(n_estimators=100)
model_real.fit(X_train_real, y_train_real)
score_real = model_real.score(X_test, y_test)
# Test: train on synthetic
model_synth = RandomForestClassifier(n_estimators=100)
model_synth.fit(X_synth, y_synth)
score_synth = model_synth.score(X_test, y_test)
utility_ratio = score_synth / score_real
return {
"baseline_accuracy": score_real,
"synthetic_accuracy": score_synth,
"utility_ratio": utility_ratio, # Target: > 0.9
"verdict": "PASS" if utility_ratio > 0.9 else "FAIL",
}
Privacy Leakage Test
def membership_inference_attack(real_data, synthetic_data, n_shadow=1000):
"""Test if synthetic data leaks membership information about real records."""
# Can an attacker determine if a specific real record was used
# to generate the synthetic data?
# Train a classifier to distinguish real from synthetic
combined = pd.concat([
real_data.assign(label=1),
synthetic_data.assign(label=0)
])
X = combined.drop(columns=['label'])
y = combined['label']
from sklearn.model_selection import cross_val_score
attack_model = RandomForestClassifier(n_estimators=50)
scores = cross_val_score(attack_model, X, y, cv=5, scoring='roc_auc')
# AUC close to 0.5 = attacker can't distinguish = good privacy
# AUC close to 1.0 = attacker can distinguish = privacy leak
mean_auc = scores.mean()
return {
"attack_auc": round(mean_auc, 3),
"privacy_verdict": "SAFE" if mean_auc < 0.6 else "LEAKING",
"recommendation": (
"Synthetic data preserves privacy" if mean_auc < 0.6
else "Increase noise or use differential privacy"
),
}
Production Pipeline
Source Data → Privacy Review → Schema Extraction
↓
Generation (Statistical / GAN / LLM)
↓
Quality Validation (Stats + ML Utility + Privacy)
↓
Pass? → Register in Data Catalog
Fail? → Adjust parameters, regenerate
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No quality validation | Synthetic data doesn’t match real distribution | Always run statistical fidelity + ML utility tests |
| Overfitting to real data | Synthetic data memorizes real records | Test with membership inference attacks, add noise |
| Using synthetic data without labels | Generated labels may be wrong | Validate label accuracy against business rules or expert review |
| One-time generation | Stale synthetic data as real-world distribution shifts | Regenerate periodically tied to data pipeline refreshes |
| Ignoring domain constraints | Generated data violates business rules (negative ages, future dates) | Add constraint validation layer post-generation |
| ”Synthetic = safe” assumption | Synthetic data can still leak private info without differential privacy | Always test with membership inference before sharing |
Synthetic Data Checklist
- Use case identified: augmentation, privacy, testing, or sharing
- Generation method selected based on data type and requirements
- Statistical fidelity validated (KS test, correlation, distribution)
- ML utility tested: synthetic-trained model ≥ 90% of real-trained baseline
- Privacy tested: membership inference AUC < 0.6
- Domain constraints enforced (value ranges, business rules, referential integrity)
- Labeled data validated against expert ground truth
- Generation pipeline automated and versioned
- Data catalog updated with synthetic dataset metadata
- Periodic regeneration scheduled to match distribution drift
- Legal review: synthetic data compliant with data sharing agreements
- Documentation: generation parameters, quality scores, known limitations
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For synthetic data engineering consulting, visit garnetgrid.com. :::