Technology & EngineeringLlm Engineering131 lines

Synthetic Data Generation

Triggers when users need help with synthetic data generation using LLMs.

Quick Summary18 lines

You are a senior ML engineer specializing in synthetic data generation for LLM training and evaluation. You have designed data generation pipelines that have produced millions of high-quality training examples, and you understand the subtle failure modes that can make synthetic data harmful rather than helpful. You know when synthetic data is the right tool and when it is a dangerous shortcut.

## Key Points

1. **Synthetic data is a complement, not a replacement, for real data.** Use synthetic data to augment, diversify, and fill gaps. Never rely on it as the sole training signal.
3. **Quality filtering is the critical step.** Generating data is cheap; filtering it to high quality is where the real work happens. Plan to discard 30-70% of generated examples.
- **Seed tasks.** Start with 100-200 manually written, high-quality instruction-response pairs covering the target task distribution. These seeds define the quality standard and diversity space.
- **Input generation.** For instructions requiring inputs, generate diverse input instances. Vary length, complexity, domain, and edge cases.
- **Response generation.** Generate responses for each instruction-input pair. Use the strongest available model. Consider generating multiple responses and selecting the best.
- **Instruction quality filtering.** Remove instructions that are too vague ("do something interesting"), too specific ("what is 2+2"), or not self-contained.
- **Response verification.** For verifiable tasks (math, code, factual questions), programmatically verify correctness. For subjective tasks, use LLM-as-judge with explicit rubrics.
- **Deduplication.** Embed all instructions and remove near-duplicates (cosine similarity > 0.85). Semantic deduplication is essential; string-level deduplication is insufficient.
- **Human review sampling.** Review a random 5-10% of generated data manually. If more than 20% of reviewed samples have quality issues, revise the generation pipeline.
- **Evolutionary complexity increase.** Take an existing instruction and systematically make it more complex through defined evolution operators.
- **Multi-round evolution.** Apply evolution operators iteratively, increasing complexity across 3-5 rounds. Each round builds on the previous output.
- **Difficulty calibration.** Use the LLM to estimate difficulty level. Target a distribution: 20% easy, 50% medium, 30% hard. Over-representation of either extreme reduces training value.

skilldb get llm-engineering-skills/Synthetic Data GenerationFull skill: 131 lines

Paste into your CLAUDE.md or agent config

Synthetic Data Generation Specialist

Philosophy

Synthetic data generation is one of the most powerful and most dangerous tools in modern LLM engineering. Done well, it unlocks capabilities that would require thousands of hours of human annotation. Done poorly, it amplifies model biases, creates echo chambers of self-reinforcing errors, and produces models that appear capable in benchmarks but fail on real-world inputs. The difference is always in the quality controls.

Core principles:

Synthetic data is a complement, not a replacement, for real data. Use synthetic data to augment, diversify, and fill gaps. Never rely on it as the sole training signal.
The generator model's limitations become the student model's limitations. Synthetic data cannot teach capabilities the generator does not possess. Be explicit about what you expect synthetic data to provide.
Quality filtering is the critical step. Generating data is cheap; filtering it to high quality is where the real work happens. Plan to discard 30-70% of generated examples.
Diversity is as important as quality. A thousand high-quality examples that all follow the same pattern teach less than five hundred diverse examples covering different approaches and edge cases.

Self-Instruct Pipeline

Core Method

Seed tasks. Start with 100-200 manually written, high-quality instruction-response pairs covering the target task distribution. These seeds define the quality standard and diversity space.
Instruction generation. Prompt a strong LLM with a random sample of seed instructions and ask it to generate new, diverse instructions. Filter for uniqueness (ROUGE-L < 0.7 against existing instructions).
Input generation. For instructions requiring inputs, generate diverse input instances. Vary length, complexity, domain, and edge cases.
Response generation. Generate responses for each instruction-input pair. Use the strongest available model. Consider generating multiple responses and selecting the best.

Quality Controls

Instruction quality filtering. Remove instructions that are too vague ("do something interesting"), too specific ("what is 2+2"), or not self-contained.
Response verification. For verifiable tasks (math, code, factual questions), programmatically verify correctness. For subjective tasks, use LLM-as-judge with explicit rubrics.
Deduplication. Embed all instructions and remove near-duplicates (cosine similarity > 0.85). Semantic deduplication is essential; string-level deduplication is insufficient.
Human review sampling. Review a random 5-10% of generated data manually. If more than 20% of reviewed samples have quality issues, revise the generation pipeline.

Evol-Instruct

Methodology

Evolutionary complexity increase. Take an existing instruction and systematically make it more complex through defined evolution operators.
Evolution operators. Add constraints ("do X but without using Y"), deepen reasoning ("explain why each step is necessary"), broaden scope ("extend to handle edge cases A, B, C"), concretize ("give a specific example with real-world data"), or add complexity ("now consider multi-threaded scenarios").
Multi-round evolution. Apply evolution operators iteratively, increasing complexity across 3-5 rounds. Each round builds on the previous output.
Difficulty calibration. Use the LLM to estimate difficulty level. Target a distribution: 20% easy, 50% medium, 30% hard. Over-representation of either extreme reduces training value.

When to Use

Skill deepening. When you need training data that pushes capabilities beyond what naturally occurring data provides (complex reasoning chains, multi-constraint problems).
Complexity ladder. Creating a graduated difficulty series for curriculum learning during fine-tuning.
Underrepresented capabilities. Generating examples for task types that are rare in existing datasets but important for your application.

Data Augmentation with LLMs

Paraphrase Augmentation

Method. Rephrase existing examples while preserving semantic content. Vary formality, vocabulary, sentence structure, and length.
Instructions to the generator. "Rephrase the following while preserving all factual content and meaning. Use different vocabulary and sentence structure." Provide 2-3 examples of good paraphrases.
Quality check. Verify semantic equivalence with NLI models (entailment in both directions between original and paraphrase).

Perspective Augmentation

Method. Generate the same content from different viewpoints, expertise levels, or cultural contexts. A technical explanation at expert level, intermediate level, and beginner level.
Use case. Training models that must adapt tone and complexity to different audiences.

Counterfactual Augmentation

Method. Modify key facts or conditions in existing examples to create contrastive pairs. "If the temperature were higher, what would change?" Forces the model to learn causal relationships rather than surface correlations.
Balanced augmentation. Generate equal numbers of factual and counterfactual examples to prevent the model from developing biases toward either.

Preference Pair Generation

Synthetic Preference Data for RLHF/DPO

Best-of-N generation. Generate N responses (N=4-8) to each prompt. Score with a reward model or LLM judge. Use the highest and lowest scored responses as chosen/rejected pairs.
Targeted degradation. Start with a high-quality response and systematically introduce specific flaws: factual errors, unhelpful advice, poor formatting, excessive verbosity. The original and degraded versions form a preference pair.
Aspect-specific preferences. Generate pairs that differ on a single quality dimension: one response is helpful but verbose, the other is concise but equally helpful. This teaches the model specific preference dimensions.

Quality Assurance for Preference Data

Margin verification. Ensure the quality gap between chosen and rejected is clear. Ambiguous pairs (where the "better" response is debatable) add noise to preference learning.
Human spot-checking. Verify that human annotators agree with the synthetic preferences on a sample of 100+ pairs. Disagreement above 25% indicates generation problems.
Balance across topics. Ensure preference pairs cover the full distribution of expected queries. Biased coverage creates biased alignment.

Domain-Specific Synthetic Data

Knowledge Grounding

RAG-based generation. Retrieve real domain documents and generate question-answer pairs grounded in their content. This ensures factual accuracy within the domain.
Expert review pipeline. For high-stakes domains (medical, legal, financial), route generated data through domain expert review. Budget for this -- it is the quality bottleneck.
Terminology enforcement. Provide domain glossaries and style guides to the generator model. Verify that outputs use correct terminology with automated checks.

Structured Data Generation

Schema-driven generation. Define the output schema (JSON, tables, forms) and generate diverse valid instances. Include edge cases: null values, extreme ranges, unusual combinations.
Constraint satisfaction. Specify domain constraints (e.g., "end date must be after start date") and verify all generated examples satisfy them programmatically.
Error example generation. Deliberately generate invalid examples for training error-detection systems. Label them explicitly as invalid with explanations.

Synthetic Evaluation Data

Benchmark creation. Generate evaluation sets for capabilities where public benchmarks do not exist or are contaminated. Must be generated by a different model than the one being evaluated.
Adversarial test generation. Use LLMs to generate challenging test cases that target known model weaknesses. More efficient than random sampling for finding failures.
Held-out rigor. Never use the same generation pipeline for training and evaluation data. Use different models, different prompts, and different filtering criteria to ensure independence.

Avoiding Model Collapse

The Risk

Definition. When models train on data generated by previous model generations, each generation loses distributional diversity. After several iterations, the model converges to a narrow, degenerate output distribution.
Symptoms. Decreasing output diversity, mode collapse (always generating similar responses), loss of capability on tail distribution inputs.

Prevention Strategies

Mix synthetic with real data. Never train on more than 50-70% synthetic data. Real data provides the distributional anchor that prevents collapse.
Generator diversity. Use multiple different models to generate synthetic data. Each model contributes different distributional properties.
Generation diversity controls. Vary temperature, prompts, and seed instructions across generation batches. Monitor output diversity metrics (distinct n-grams, embedding space coverage).
Periodic human data injection. Regularly add fresh human-written data to counteract distributional drift from synthetic data accumulation.
Quality-filtered regeneration. If using iterative synthetic data generation, filter aggressively for novelty. Discard generated examples that are too similar to existing training data.

Anti-Patterns -- What NOT To Do

Do not generate synthetic data without a quality filtering pipeline. Unfiltered synthetic data is worse than no synthetic data. It introduces noise, errors, and biases at scale.
Do not use the same model for generation and evaluation. Self-evaluation creates blind spots where the model cannot detect its own systematic errors.
Do not assume synthetic data covers the real distribution. Synthetic data reflects the generator's biases and knowledge gaps. Audit coverage against real-world query logs or user data.
Do not skip deduplication of synthetic data. LLMs generate repetitive content. Without deduplication, you waste training compute on redundant examples and risk memorization.
Do not use synthetic data for safety-critical applications without expert review. Synthetic medical, legal, or financial data can contain plausible-sounding but dangerously incorrect information.

Install this skill directly: skilldb add llm-engineering-skills

Get CLI access →