Skip to content

distilabel

AI Feedback (AIF) framework to build datasets with and for LLMs:

  • Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
  • Multiple tasks for Self-Instruct, Preference datasets and more.
  • Dataset export to Argilla for easy data exploration and further annotation.

Installation

pip install distilabel
Requires Python 3.8+

In addition, the following extras are available:

  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.
  • hf-inference-endpoints: for using the HuggingFace Inference Endpoints via the InferenceEndpointsLLM integration.
  • openai: for using OpenAI API models via the OpenAILLM integration.
  • vllm: for using vllm serving engine via the vLLM integration.
  • llama-cpp: for using llama-cpp-python as Python bindings for llama.cpp.
  • together: for using Together Inference via their Python client.
  • argilla: for exporting the generated datasets to Argilla.

Quick example

from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import pipeline
from distilabel.tasks import TextGenerationTask

dataset = (
    load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")
    .remove_columns(["completion", "meta"])
    .rename_column("prompt", "input")
)

task = TextGenerationTask()  # (1)

generator = OpenAILLM(task=task, max_new_tokens=512)  # (2)

pipeline = pipeline("preference", "instruction-following", generator=generator)  # (3)

dataset = pipeline.generate(dataset)
  1. Create a Task for generating text given an instruction.
  2. Create a LLM for generating text using the Task created in the first step. As the LLM will generate text, it will be a generator.
  3. Create a pre-defined Pipeline using the pipeline function and the generator created in step 2. The pipeline function will create a labeller LLM using OpenAILLM with the UltraFeedback task for instruction following assessment.

Note

To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY environment variable.

For a more complete example, check out our awesome notebook on Google Colab:

Open In Colab