distilabel
AI Feedback (AIF) framework to build datasets with and for LLMs:
- Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
- Multiple tasks for Self-Instruct, Preference datasets and more.
- Dataset export to Argilla for easy data exploration and further annotation.
Installation
Requires Python 3.8+In addition, the following extras are available:
hf-transformers: for using models available in transformers package via theTransformersLLMintegration.hf-inference-endpoints: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLMintegration.openai: for using OpenAI API models via theOpenAILLMintegration.vllm: for using vllm serving engine via thevLLMintegration.argilla: for exporting the generated datasets to Argilla.
Quick example
from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import pipeline
from distilabel.tasks import TextGenerationTask
dataset = (
load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")
.remove_columns(["completion", "meta"])
.rename_column("prompt", "input")
)
task = TextGenerationTask() # (1)
generator = OpenAILLM(task=task, max_new_tokens=512) # (2)
pipeline = pipeline("preference", "instruction-following", generator=generator) # (3)
dataset = pipeline.generate(dataset)
- Create a
Taskfor generating text given an instruction. - Create a
LLMfor generating text using theTaskcreated in the first step. As theLLMwill generate text, it will be agenerator. - Create a pre-defined
Pipelineusing thepipelinefunction and thegeneratorcreated in step 2. Thepipelinefunction will create alabellerLLM usingOpenAILLMwith theUltraFeedbacktask for instruction following assessment.
Note
To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY environment variable.
For a more complete example, check out our awesome notebook on Google Colab: