Tutorials¶
- End-to-end tutorials provide detailed step-by-step explanations and the code used for end-to-end workflows.
- Paper implementations provide reproductions of fundamental papers in the synthetic data domain.
- Examples don't provide explenations but simply show code for different tasks.
End-to-end tutorials¶
- 
Generate a preference dataset 
 Learn about synthetic data generation for ORPO and DPO. 
- 
Clean an existing preference dataset 
 Learn about how to provide AI feedback to clean an existing dataset. 
- 
Retrieval and reranking models 
 Learn about synthetic data generation for fine-tuning custom retrieval and reranking models. 
- 
Generate text classification data 
 Learn about how synthetic data generation for text classification can help address data imbalance or scarcity. 
Paper Implementations¶
- 
Deepseek Prover 
 Learn about an approach to generate mathematical proofs for theorems generated from informal math problems. 
- 
DEITA 
 Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection. 
- 
Instruction Backtranslation 
 Learn about automatically labeling human-written text with corresponding instructions. 
- 
Prometheus 2 
 Learn about using open-source models as judges for direct assessment and pair-wise ranking. 
- 
UltraFeedback 
 Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models. 
- 
APIGen 
 Learn how to create verifiable high-quality datases for function-calling applications. 
- 
CLAIR 
 Learn Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs. 
- 
Math Shepherd 
 Learn about Math-Shepherd, a framework to generate datasets to train process reward models (PRMs) which assign reward scores to each step of math problem solutions. 
Examples¶
- 
Benchmarking with distilabel 
 Learn about reproducing the Arena Hard benchmark with disitlabel. 
- 
Structured generation with outlines 
 Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel. 
- 
Structured generation with instructor 
 Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel. 
- 
Create a social network with FinePersonas 
 Learn how to leverage FinePersonas to create a synthetic social network and fine-tune adapters for Multi-LoRA. 
- 
Create questions and answers for a exam 
 Learn how to generate questions and answers for a exam, using a raw wikipedia page and structured generation. 
- 
Image generation with distilabel 
 Generate synthetic images using distilabel. 
- 
Text generation with images in distilabel 
 Ask questions about images using distilabel.