distilabel

AI Feedback (AIF) framework to build datasets with and for LLMs:

Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
Multiple tasks for Self-Instruct, Preference datasets and more.
Dataset export to Argilla for easy data exploration and further annotation.

Installation

pip install distilabel

Requires Python 3.8+

In addition, the following extras are available:

hf-transformers: for using models available in transformers package via the TransformersLLM integration.
hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
openai: for using OpenAI API models via the OpenAILLM integration.
vllm: for using vllm serving engine via the vLLM integration.
argilla: for exporting the generated datasets to Argilla.

Quick example

from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import pipeline
from distilabel.tasks import TextGenerationTask

dataset = (
    load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")
    .remove_columns(["completion", "meta"])
    .rename_column("prompt", "input")
)

task = TextGenerationTask()  # (1)

generator = OpenAILLM(task=task, max_new_tokens=512)  # (2)

pipeline = pipeline("preference", "instruction-following", generator=generator)  # (3)

dataset = pipeline.generate(dataset)

Create a Task for generating text given an instruction.
Create a LLM for generating text using the Task created in the first step. As the LLM will generate text, it will be a generator.
Create a pre-defined Pipeline using the pipeline function and the generator created in step 2. The pipeline function will create a labeller LLM using OpenAILLM with the UltraFeedback task for instruction following assessment.

Note

To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY environment variable.

For a more complete example, check out our awesome notebook on Google Colab: