Installation¶

You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.

To install the latest release of the package from PyPI you can use the following command:

pip install distilabel --upgrade

Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:

pip install "distilabel @ git+https://github.com/argilla-io/distilabel.git@develop" --upgrade

Note

We are installing from develop since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace develop with the branch name.

Extras¶

Additionally, as part of distilabel some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:

LLMs¶

anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
argilla: for exporting the generated datasets to Argilla.
cohere: for using models available in Cohere via the CohereLLM integration.
groq: for using models available in Groq using groq Python client via the GroqLLM integration.
hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
hf-transformers: for using models available in transformers package via the TransformersLLM integration.
litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
ollama: for using Ollama and their available models via OllamaLLM integration.
openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
vllm: for using vllm serving engine via the vLLM integration.
sentence-transformers: for generating sentence embeddings using sentence-transformers.
mlx: for using MLX models via the MlxLLM integration.

Data processing¶

ray: for scaling and distributing a pipeline with Ray.
faiss-cpu and faiss-gpu: for generating sentence embeddings using faiss.
minhash: for using minhash for duplicate detection with datasketch and nltk.
text-clustering: for using text clustering with UMAP and Scikit-learn.

Structured generation¶

outlines: for using structured generation of LLMs with outlines.
instructor: for using structured generation of LLMs with Instructor.

Recommendations / Notes¶

The mistralai dependency requires Python 3.9 or higher, so if you're willing to use the distilabel.models.llms.MistralLLM implementation, you will need to have Python 3.9 or higher.

In some cases like transformers and vllm, the installation of flash-attn is recommended if you are using a GPU accelerator since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the distilabel dependencies.

pip install flash-attn --no-build-isolation

Also, if you are willing to use the llama-cpp-python integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.