Installation¶
You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.
To install the latest release of the package from PyPI you can use the following command:
Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:
Note
We are installing from develop
since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace develop
with the branch name.
Extras¶
Additionally, as part of distilabel
some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:
LLMs¶
-
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration. -
argilla
: for exporting the generated datasets to Argilla. -
cohere
: for using models available in Cohere via theCohereLLM
integration. -
groq
: for using models available in Groq usinggroq
Python client via theGroqLLM
integration. -
hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration. -
hf-transformers
: for using models available in transformers package via theTransformersLLM
integration. -
litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration. -
llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration. -
mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration. -
ollama
: for using Ollama and their available models viaOllamaLLM
integration. -
openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
. -
vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration. -
vllm
: for using vllm serving engine via thevLLM
integration. -
sentence-transformers
: for generating sentence embeddings using sentence-transformers.
Data processing¶
-
ray
: for scaling and distributing a pipeline with Ray. -
faiss-cpu
andfaiss-gpu
: for generating sentence embeddings using faiss. -
minhash
: for using minhash for duplicate detection with datasketch and nltk. -
text-clustering
: for using text clustering with UMAP and Scikit-learn.
Structured generation¶
-
outlines
: for using structured generation of LLMs with outlines. -
instructor
: for using structured generation of LLMs with Instructor.
Recommendations / Notes¶
The mistralai
dependency requires Python 3.9 or higher, so if you're willing to use the distilabel.models.llms.MistralLLM
implementation, you will need to have Python 3.9 or higher.
In some cases like transformers
and vllm
, the installation of flash-attn
is recommended if you are using a GPU accelerator since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the distilabel
dependencies.
Also, if you are willing to use the llama-cpp-python
integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.