distilabel¶
AI Feedback (AIF) framework to build datasets with and for LLMs:
- Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
- Multiple tasks for Self-Instruct, Preference datasets and more.
- Dataset export to Argilla for easy data exploration and further annotation.
Installation¶
Requires Python 3.8+In addition, the following extras are available:
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration.argilla
: for exporting the generated datasets to Argilla.cohere
: for using models available in Cohere via theCohereLLM
integration.hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration.hf-transformers
: for using models available in transformers package via theTransformersLLM
integration.litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration.llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration.mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration.ollama
: for using Ollama and their available models viaOllamaLLM
integration.openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
.vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration.vllm
: for using vllm serving engine via thevLLM
integration.
Quick example¶
To run the following example you must install distilabel
with both openai
extra:
Then run:
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration
with Pipeline(
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
generate_with_openai = TextGeneration(
name="generate_with_gpt35", llm=OpenAILLM(model="gpt-3.5-turbo")
)
load_dataset.connect(generate_with_openai)
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
"load_dataset": {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
"generate_with_gpt35": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)