distilabel¶

AI Feedback (AIF) framework to build datasets with and for LLMs:

Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
Multiple tasks for Self-Instruct, Preference datasets and more.
Dataset export to Argilla for easy data exploration and further annotation.

Installation¶

pip install distilabel

Requires Python 3.8+

In addition, the following extras are available:

anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
argilla: for exporting the generated datasets to Argilla.
cohere: for using models available in Cohere via the CohereLLM integration.
hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
hf-transformers: for using models available in transformers package via the TransformersLLM integration.
litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
ollama: for using Ollama and their available models via OllamaLLM integration.
openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
vllm: for using vllm serving engine via the vLLM integration.

Quick example¶

To run the following example you must install distilabel with both openai extra:

pip install "distilabel[openai]" --upgrade

Then run:

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration

with Pipeline(
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    generate_with_openai = TextGeneration(
        name="generate_with_gpt35", llm=OpenAILLM(model="gpt-3.5-turbo")
    )

    load_dataset.connect(generate_with_openai)

if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            "generate_with_gpt35": {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )