Skip to content

Distilabel

Distilabel Logo Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use Distilabel?

Whether you are working on a predictive model that computes semantic similarity or the next generative model that is going to beat the LLM benchmarks. Our framework ensures that the hard data work pays off. Distilabel is the missing piece that helps you synthesize data and provide AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on achieveing and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

What do people build with Distilabel?

Distilabel is a tool that can be used to synthesize data and provide AI feedback. Our community uses Distilabel to create amazing datasets and models, and we love contributions to open-source ourselves too.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
  • Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model,, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
  • The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.