Skip to content

Distilabel

Distilabel Logo Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your synthetic data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

What do people build with distilabel?

The Argilla community uses distilabel to create amazing datasets and models.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences that have been generated using the teknium/OpenHermes-2.5 LLM. It is a great example of how you can use distilabel to scale and increase dataset development.
  • distilabeled Intel Orca DPO dataset used to fine-tune the improved OpenHermes model. This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset.
  • The haiku DPO data is an example of how anyone can create a synthetic dataset for a specific task, which after curation and evaluation can be used for fine-tuning custom LLMs.