LLM Gallery¶
This section contains the existing LLM
subclasses implemented in distilabel
.
llms
¶
AnthropicLLM
¶
Bases: AsyncLLM
Anthropic LLM implementation running the Async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the name of the model to use for the LLM e.g. "claude-3-opus-20240229", "claude-3-sonnet-20240229", etc. Available models can be checked here: Anthropic: Models overview. |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Anthropic API. If not provided,
it will be read from |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Anthropic API. Defaults to |
timeout |
RuntimeParameter[float]
|
the maximum time in seconds to wait for a response. Defaults to |
max_retries |
RuntimeParameter[int]
|
The maximum number of times to retry the request before failing. Defaults
to |
http_client |
Optional[AsyncClient]
|
if provided, an alternative HTTP client to use for calling Anthropic
API. Defaults to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
_aclient |
Optional[AsyncAnthropic]
|
the |
Runtime parameters
api_key
: the API key to authenticate the requests to the Anthropic API. If not provided, it will be read fromANTHROPIC_API_KEY
environment variable.base_url
: the base URL to use for the Anthropic API. Defaults to"https://api.anthropic.com"
.timeout
: the maximum time in seconds to wait for a response. Defaults to600.0
.max_retries
: the maximum number of times to retry the request before failing. Defaults to6
.
Examples:
Generate text:
from distilabel.models.llms import AnthropicLLM
llm = AnthropicLLM(model="claude-3-opus-20240229", api_key="api.key")
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pydantic import BaseModel
from distilabel.models.llms import AnthropicLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = AnthropicLLM(
model="claude-3-opus-20240229",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/anthropic.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
_check_model_exists()
¶
Checks if the specified model exists in the available models.
Source code in src/distilabel/models/llms/anthropic.py
load()
¶
Loads the AsyncAnthropic
client to use the Anthropic async API.
Source code in src/distilabel/models/llms/anthropic.py
agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)
async
¶
Generates a response asynchronously, using the Anthropic Async API definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
max_tokens
|
int
|
the maximum number of new tokens that the model will generate. Defaults to |
128
|
stop_sequences
|
Union[List[str], None]
|
custom text sequences that will cause the model to stop generating. Defaults to |
None
|
temperature
|
float
|
the temperature to use for the generation. Set only if top_p is None. Defaults to |
1.0
|
top_p
|
Union[float, None]
|
the top-p value to use for the generation. Defaults to |
None
|
top_k
|
Union[int, None]
|
the top-k value to use for the generation. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/anthropic.py
AnyscaleLLM
¶
Bases: OpenAILLM
Anyscale LLM implementation running the async API client of OpenAI.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM, e.g., |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Anyscale API requests. Defaults to |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Anyscale API. Defaults to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
Examples:
Generate text:
from distilabel.models.llms import AnyscaleLLM
llm = AnyscaleLLM(model="google/gemma-7b-it", api_key="api.key")
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/anyscale.py
AzureOpenAILLM
¶
Bases: OpenAILLM
Azure OpenAI LLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM i.e. the name of the Azure deployment. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Azure OpenAI API can be set with |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Azure OpenAI API. Defaults to |
api_version |
Optional[RuntimeParameter[str]]
|
the API version to use for the Azure OpenAI API. Defaults to |
Icon
:material-microsoft-azure:
Examples:
Generate text:
from distilabel.models.llms import AzureOpenAILLM
llm = AzureOpenAILLM(model="gpt-4-turbo", api_key="api.key")
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate text from a custom endpoint following the OpenAI API:
from distilabel.models.llms import AzureOpenAILLM
llm = AzureOpenAILLM(
model="prometheus-eval/prometheus-7b-v2.0",
base_url=r"http://localhost:8080/v1"
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pydantic import BaseModel
from distilabel.models.llms import AzureOpenAILLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = AzureOpenAILLM(
model="gpt-4-turbo",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/azure.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
load()
¶
Loads the AsyncAzureOpenAI
client to benefit from async requests.
Source code in src/distilabel/models/llms/azure.py
CohereLLM
¶
Bases: AsyncLLM
Cohere API implementation using the async client for concurrent text generation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the name of the model from the Cohere API to use for the generation. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Cohere API requests. Defaults to
|
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Cohere API. Defaults to
the value of the |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
client_name |
RuntimeParameter[str]
|
the name of the client to use for the API requests. Defaults to
|
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
_ChatMessage |
Type[ChatMessage]
|
the |
_aclient |
AsyncClient
|
the |
Runtime parameters
base_url
: the base URL to use for the Cohere API requests. Defaults to"https://api.cohere.ai/v1"
.api_key
: the API key to authenticate the requests to the Cohere API. Defaults to the value of theCOHERE_API_KEY
environment variable.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.client_name
: the name of the client to use for the API requests. Defaults to"distilabel"
.
Examples:
Generate text:
from distilabel.models.llms import CohereLLM
llm = CohereLLM(model="CohereForAI/c4ai-command-r-plus")
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.models.llms import CohereLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = CohereLLM(
model="CohereForAI/c4ai-command-r-plus",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/cohere.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the AsyncClient
client from the cohere
package.
Source code in src/distilabel/models/llms/cohere.py
_format_chat_to_cohere(input)
¶
Formats the chat input to the Cohere Chat API conversational format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
The chat input to format. |
required |
Returns:
Type | Description |
---|---|
Tuple[Union[str, None], List[ChatMessage], str]
|
A tuple containing the system, chat history, and message. |
Source code in src/distilabel/models/llms/cohere.py
agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)
async
¶
Generates a response from the LLM given an input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
temperature
|
Optional[float]
|
the temperature to use for the generation. Defaults to |
None
|
max_tokens
|
Optional[int]
|
the maximum number of new tokens that the model will generate.
Defaults to |
None
|
k
|
Optional[int]
|
the number of highest probability vocabulary tokens to keep for the generation.
Defaults to |
None
|
p
|
Optional[float]
|
the nucleus sampling probability to use for the generation. Defaults to
|
None
|
seed
|
Optional[float]
|
the seed to use for the generation. Defaults to |
None
|
stop_sequences
|
Optional[Sequence[str]]
|
a list of sequences to use as stopping criteria for the generation.
Defaults to |
None
|
frequency_penalty
|
Optional[float]
|
the frequency penalty to use for the generation. Defaults
to |
None
|
presence_penalty
|
Optional[float]
|
the presence penalty to use for the generation. Defaults to
|
None
|
raw_prompting
|
Optional[bool]
|
a flag to use raw prompting for the generation. Defaults to
|
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
The generated response from the Cohere API model. |
Source code in src/distilabel/models/llms/cohere.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
|
GroqLLM
¶
Bases: AsyncLLM
Groq API implementation using the async client for concurrent text generation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the name of the model from the Groq API to use for the generation. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Groq API requests. Defaults to
|
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Groq API. Defaults to
the value of the |
max_retries |
RuntimeParameter[int]
|
the maximum number of times to retry the request to the API before
failing. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. |
_aclient |
Optional[AsyncGroq]
|
the |
Runtime parameters
base_url
: the base URL to use for the Groq API requests. Defaults to"https://api.groq.com"
.api_key
: the API key to authenticate the requests to the Groq API. Defaults to the value of theGROQ_API_KEY
environment variable.max_retries
: the maximum number of times to retry the request to the API before failing. Defaults to2
.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.
Examples:
Generate text:
from distilabel.models.llms import GroqLLM
llm = GroqLLM(model="llama3-70b-8192")
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.models.llms import GroqLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = GroqLLM(
model="llama3-70b-8192",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/groq.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the AsyncGroq
client to benefit from async requests.
Source code in src/distilabel/models/llms/groq.py
agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None)
async
¶
Generates num_generations
responses for the given input using the Groq async
client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
seed
|
Optional[int]
|
the seed to use for the generation. Defaults to |
None
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop
|
Optional[str]
|
the stop sequence to use for the generation. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
References
- https://console.groq.com/docs/text-chat
Source code in src/distilabel/models/llms/groq.py
InferenceEndpointsLLM
¶
Bases: InferenceEndpointsBaseClient
, AsyncLLM
, MagpieChatTemplateMixin
InferenceEndpoints LLM implementation running the async API client.
This LLM will internally use huggingface_hub.AsyncInferenceClient
.
Attributes:
Name | Type | Description |
---|---|---|
model_id |
Optional[str]
|
the model ID to use for the LLM as available in the Hugging Face Hub, which
will be used to resolve the base URL for the serverless Inference Endpoints API requests.
Defaults to |
endpoint_name |
Optional[RuntimeParameter[str]]
|
the name of the Inference Endpoint to use for the LLM. Defaults to |
endpoint_namespace |
Optional[RuntimeParameter[str]]
|
the namespace of the Inference Endpoint to use for the LLM. Defaults to |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Inference Endpoints API requests. |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Inference Endpoints API. |
tokenizer_id |
Optional[str]
|
the tokenizer ID to use for the LLM as available in the Hugging Face Hub.
Defaults to |
model_display_name |
Optional[str]
|
the model display name to use for the LLM. Defaults to |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
structured_output |
Optional[RuntimeParameter[StructuredOutputType]]
|
a dictionary containing the structured output configuration or
if more fine-grained control is needed, an instance of |
Icon
:hugging:
Examples:
Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:
from distilabel.models.llms.huggingface import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Dedicated Inference Endpoints:
from distilabel.models.llms.huggingface import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
endpoint_name="<ENDPOINT_NAME>",
api_key="<HF_API_KEY>",
endpoint_namespace="<USER|ORG>",
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Dedicated Inference Endpoints or TGI:
from distilabel.models.llms.huggingface import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
api_key="<HF_API_KEY>",
base_url="<BASE_URL>",
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pydantic import BaseModel
from distilabel.models.llms import InferenceEndpointsLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
api_key="api.key",
structured_output={"format": "json", "schema": User.model_json_schema()}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the Tour De France"}]])
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
|
|
only_one_of_model_id_endpoint_name_or_base_url_provided()
¶
Validates that only one of model_id
or endpoint_name
is provided; and if base_url
is also
provided, a warning will be shown informing the user that the provided base_url
will be ignored in
favour of the dynamically calculated one..
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
_get_structured_output(input)
¶
Gets the structured output (if any) for the given input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
Returns:
Type | Description |
---|---|
StandardInput
|
The input and the structured output that will be passed as |
Union[Dict[str, Any], None]
|
inference endpoint or |
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
_check_stop_sequences(stop_sequences=None)
¶
Checks that no more than 4 stop sequences are provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
stop_sequences
|
Optional[Union[str, List[str]]]
|
the stop sequences to be checked. |
None
|
Returns:
Type | Description |
---|---|
Union[List[str], None]
|
The stop sequences. |
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
agenerate(input, max_new_tokens=128, frequency_penalty=None, logit_bias=None, logprobs=False, presence_penalty=None, seed=None, stop_sequences=None, temperature=1.0, tool_choice=None, tool_prompt=None, tools=None, top_logprobs=None, top_n_tokens=None, top_p=None, do_sample=False, repetition_penalty=None, return_full_text=False, top_k=None, typical_p=None, watermark=False, num_generations=1)
async
¶
Generates completions for the given input using the async client. This method
uses two methods of the huggingface_hub.AsyncClient
: chat_completion
and text_generation
.
chat_completion
method will be used only if no tokenizer_id
has been specified.
Some arguments of this function are specific to the text_generation
method, while
some others are specific to the chat_completion
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty
|
Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]
|
a value between |
None
|
logit_bias
|
Optional[List[float]]
|
modify the likelihood of specified tokens appearing in the completion.
This argument is exclusive to the |
None
|
logprobs
|
bool
|
whether to return the log probabilities or not. This argument is exclusive
to the |
False
|
presence_penalty
|
Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]
|
a value between |
None
|
seed
|
Optional[int]
|
the seed to use for the generation. Defaults to |
None
|
stop_sequences
|
Optional[List[str]]
|
either a single string or a list of strings containing the sequences
to stop the generation at. Defaults to |
None
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
tool_choice
|
Optional[Union[Dict[str, str], Literal['auto']]]
|
the name of the tool the model should call. It can be a dictionary
like |
None
|
tool_prompt
|
Optional[str]
|
A prompt to be appended before the tools. This argument is exclusive
to the |
None
|
tools
|
Optional[List[Dict[str, Any]]]
|
a list of tools definitions that the LLM can use.
This argument is exclusive to the |
None
|
top_logprobs
|
Optional[PositiveInt]
|
the number of top log probabilities to return per output token
generated. This argument is exclusive to the |
None
|
top_n_tokens
|
Optional[PositiveInt]
|
the number of top log probabilities to return per output token
generated. This argument is exclusive of the |
None
|
top_p
|
Optional[float]
|
the top-p value to use for the generation. Defaults to |
None
|
do_sample
|
bool
|
whether to use sampling for the generation. This argument is exclusive
of the |
False
|
repetition_penalty
|
Optional[float]
|
the repetition penalty to use for the generation. This argument
is exclusive of the |
None
|
return_full_text
|
bool
|
whether to return the full text of the completion or just
the generated text. Defaults to |
False
|
top_k
|
Optional[int]
|
the top-k value to use for the generation. This argument is exclusive
of the |
None
|
typical_p
|
Optional[float]
|
the typical-p value to use for the generation. This argument is exclusive
of the |
None
|
watermark
|
bool
|
whether to add the watermark to the generated text. This argument
is exclusive of the |
False
|
num_generations
|
int
|
the number of generations to generate. Defaults to |
1
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py
|
|
TransformersLLM
¶
Bases: LLM
, MagpieChatTemplateMixin
, CudaDevicePlacementMixin
Hugging Face transformers
library LLM implementation using the text generation
pipeline.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files. |
revision |
str
|
if |
torch_dtype |
str
|
the torch dtype to use for the model e.g. "float16", "float32", etc.
Defaults to |
trust_remote_code |
bool
|
whether to allow fetching and executing remote code fetched
from the repository in the Hub. Defaults to |
model_kwargs |
Optional[Dict[str, Any]]
|
additional dictionary of keyword arguments that will be passed to
the |
tokenizer |
Optional[str]
|
the tokenizer Hugging Face Hub repo id or a path to a directory containing
the tokenizer config files. If not provided, the one associated to the |
use_fast |
bool
|
whether to use a fast tokenizer or not. Defaults to |
chat_template |
Optional[str]
|
a chat template that will be used to build the prompts before
sending them to the model. If not provided, the chat template defined in the
tokenizer config will be used. If not provided and the tokenizer doesn't have
a chat template, then ChatML template will be used. Defaults to |
device |
Optional[Union[str, int]]
|
the name or index of the device where the model will be loaded. Defaults
to |
device_map |
Optional[Union[str, Dict[str, Any]]]
|
a dictionary mapping each layer of the model to a device, or a mode
like |
token |
Optional[SecretStr]
|
the Hugging Face Hub token that will be used to authenticate to the Hugging
Face Hub. If not provided, the |
structured_output |
Optional[RuntimeParameter[OutlinesStructuredOutputType]]
|
a dictionary containing the structured output configuration or if more
fine-grained control is needed, an instance of |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
Icon
:hugging:
Examples:
Generate text:
from distilabel.models.llms import TransformersLLM
llm = TransformersLLM(model="microsoft/Phi-3-mini-4k-instruct")
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/huggingface/transformers.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.
Source code in src/distilabel/models/llms/huggingface/transformers.py
unload()
¶
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/huggingface/transformers.py
generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)
¶
Generates num_generations
responses for each input using the text generation
pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[StandardInput]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
0.1
|
repetition_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
1.1
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
top_k
|
int
|
the top-k value to use for the generation. Defaults to |
0
|
do_sample
|
bool
|
whether to use sampling or not. Defaults to |
True
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/huggingface/transformers.py
get_last_hidden_states(inputs)
¶
Gets the last hidden_states
of the model for the given inputs. It doesn't
execute the task head.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[StandardInput]
|
a list of inputs in chat format to generate the embeddings for. |
required |
Returns:
Type | Description |
---|---|
List[HiddenState]
|
A list containing the last hidden state for each sequence using a NumPy array |
List[HiddenState]
|
with shape [num_tokens, hidden_size]. |
Source code in src/distilabel/models/llms/huggingface/transformers.py
_prepare_structured_output(structured_output=None)
¶
Creates the appropriate function to filter tokens to generate structured outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
structured_output
|
Optional[OutlinesStructuredOutputType]
|
the configuration dict to prepare the structured output. |
None
|
Returns:
Type | Description |
---|---|
Union[Callable, List[Callable]]
|
The callable that will be used to guide the generation of the model. |
Source code in src/distilabel/models/llms/huggingface/transformers.py
LiteLLM
¶
Bases: AsyncLLM
LiteLLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gpt-3.5-turbo" or "mistral/mistral-large", etc. |
verbose |
RuntimeParameter[bool]
|
whether to log the LiteLLM client's logs. Defaults to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
Runtime parameters
verbose
: whether to log the LiteLLM client's logs. Defaults toFalse
.
Examples:
Generate text:
from distilabel.models.llms import LiteLLM
llm = LiteLLM(model="gpt-3.5-turbo")
llm.load()
# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.models.llms import LiteLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = LiteLLM(
model="gpt-3.5-turbo",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/litellm.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the acompletion
LiteLLM client to benefit from async requests.
Source code in src/distilabel/models/llms/litellm.py
agenerate(input, num_generations=1, functions=None, function_call=None, temperature=1.0, top_p=1.0, stop=None, max_tokens=None, presence_penalty=None, frequency_penalty=None, logit_bias=None, user=None, metadata=None, api_base=None, api_version=None, api_key=None, model_list=None, mock_response=None, force_timeout=600, custom_llm_provider=None)
async
¶
Generates num_generations
responses for the given input using the LiteLLM async client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
functions
|
Optional[List]
|
a list of functions to apply to the conversation messages. Defaults to
|
None
|
function_call
|
Optional[str]
|
the name of the function to call within the conversation. Defaults
to |
None
|
temperature
|
Optional[float]
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
Optional[float]
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop
|
Optional[Union[str, list]]
|
Up to 4 sequences where the LLM API will stop generating further tokens.
Defaults to |
None
|
max_tokens
|
Optional[int]
|
The maximum number of tokens in the generated completion. Defaults to
|
None
|
presence_penalty
|
Optional[float]
|
It is used to penalize new tokens based on their existence in the
text so far. Defaults to |
None
|
frequency_penalty
|
Optional[float]
|
It is used to penalize new tokens based on their frequency in the
text so far. Defaults to |
None
|
logit_bias
|
Optional[dict]
|
Used to modify the probability of specific tokens appearing in the
completion. Defaults to |
None
|
user
|
Optional[str]
|
A unique identifier representing your end-user. This can help the LLM provider
to monitor and detect abuse. Defaults to |
None
|
metadata
|
Optional[dict]
|
Pass in additional metadata to tag your completion calls - eg. prompt
version, details, etc. Defaults to |
None
|
api_base
|
Optional[str]
|
Base URL for the API. Defaults to |
None
|
api_version
|
Optional[str]
|
API version. Defaults to |
None
|
api_key
|
Optional[str]
|
API key. Defaults to |
None
|
model_list
|
Optional[list]
|
List of api base, version, keys. Defaults to |
None
|
mock_response
|
Optional[str]
|
If provided, return a mock completion response for testing or debugging
purposes. Defaults to |
None
|
force_timeout
|
Optional[int]
|
The maximum execution time in seconds for the completion request.
Defaults to |
600
|
custom_llm_provider
|
Optional[str]
|
Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)
model="amazon.titan-tg1-large" and custom_llm_provider="bedrock". Defaults to
|
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/litellm.py
|
|
LlamaCppLLM
¶
Bases: LLM
, MagpieChatTemplateMixin
llama.cpp LLM implementation running the Python bindings for the C++ code.
Attributes:
Name | Type | Description |
---|---|---|
model_path |
RuntimeParameter[FilePath]
|
contains the path to the GGUF quantized model, compatible with the
installed version of the |
n_gpu_layers |
RuntimeParameter[int]
|
the number of layers to use for the GPU. Defaults to |
chat_format |
Optional[RuntimeParameter[str]]
|
the chat format to use for the model. Defaults to |
n_ctx |
int
|
the context size to use for the model. Defaults to |
n_batch |
int
|
the prompt processing maximum batch size to use for the model. Defaults to |
seed |
int
|
random seed to use for the generation. Defaults to |
verbose |
RuntimeParameter[bool]
|
whether to print verbose output. Defaults to |
structured_output |
Optional[RuntimeParameter[OutlinesStructuredOutputType]]
|
a dictionary containing the structured output configuration or if more
fine-grained control is needed, an instance of |
extra_kwargs |
Optional[RuntimeParameter[Dict[str, Any]]]
|
additional dictionary of keyword arguments that will be passed to the
|
tokenizer_id |
Optional[RuntimeParameter[str]]
|
the tokenizer Hugging Face Hub repo id or a path to a directory containing
the tokenizer config files. If not provided, the one associated to the |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
_model |
Optional[Llama]
|
the Llama model instance. This attribute is meant to be used internally and
should not be accessed directly. It will be set in the |
Runtime parameters
model_path
: the path to the GGUF quantized model.n_gpu_layers
: the number of layers to use for the GPU. Defaults to-1
.chat_format
: the chat format to use for the model. Defaults toNone
.verbose
: whether to print verbose output. Defaults toFalse
.extra_kwargs
: additional dictionary of keyword arguments that will be passed to theLlama
class ofllama_cpp
library. Defaults to{}
.
References
Examples:
Generate text:
from pathlib import Path
from distilabel.models.llms import LlamaCppLLM
# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"
llm = LlamaCppLLM(
model_path=str(Path.home() / model_path),
n_gpu_layers=-1, # To use the GPU if available
n_ctx=1024, # Set the context size
)
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pathlib import Path
from distilabel.models.llms import LlamaCppLLM
model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"
class User(BaseModel):
name: str
last_name: str
id: int
llm = LlamaCppLLM(
model_path=str(Path.home() / model_path), # type: ignore
n_gpu_layers=-1,
n_ctx=1024,
structured_output={"format": "json", "schema": Character},
)
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/llamacpp.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
validate_magpie_usage()
¶
Validates that magpie usage is valid.
Source code in src/distilabel/models/llms/llamacpp.py
load()
¶
Loads the Llama
model from the model_path
.
Source code in src/distilabel/models/llms/llamacpp.py
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/llamacpp.py
generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None)
¶
Generates num_generations
responses for the given input using the Llama model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty
|
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
extra_generation_kwargs
|
Optional[Dict[str, Any]]
|
dictionary with additional arguments to be passed to
the |
None
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/llamacpp.py
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 |
|
_prepare_structured_output(structured_output=None)
¶
Creates the appropriate function to filter tokens to generate structured outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
structured_output
|
Optional[OutlinesStructuredOutputType]
|
the configuration dict to prepare the structured output. |
None
|
Returns:
Type | Description |
---|---|
Union[LogitsProcessorList, LogitsProcessor]
|
The callable that will be used to guide the generation of the model. |
Source code in src/distilabel/models/llms/llamacpp.py
MistralLLM
¶
Bases: AsyncLLM
Mistral LLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "mistral-tiny", "mistral-large", etc. |
endpoint |
str
|
the endpoint to use for the Mistral API. Defaults to "https://api.mistral.ai". |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Mistral API. Defaults to |
max_retries |
RuntimeParameter[int]
|
the maximum number of retries to attempt when a request fails. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response. Defaults to |
max_concurrent_requests |
RuntimeParameter[int]
|
the maximum number of concurrent requests to send. Defaults
to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
_aclient |
Optional[Mistral]
|
the |
Runtime parameters
api_key
: the API key to authenticate the requests to the Mistral API.max_retries
: the maximum number of retries to attempt when a request fails. Defaults to5
.timeout
: the maximum time in seconds to wait for a response. Defaults to120
.max_concurrent_requests
: the maximum number of concurrent requests to send. Defaults to64
.
Examples:
Generate text:
from distilabel.models.llms import MistralLLM
llm = MistralLLM(model="open-mixtral-8x22b")
llm.load()
# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.models.llms import MistralLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = MistralLLM(
model="open-mixtral-8x22b",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/mistral.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the Mistral
client to benefit from async requests.
Source code in src/distilabel/models/llms/mistral.py
agenerate(input, max_new_tokens=None, temperature=None, top_p=None)
async
¶
Generates num_generations
responses for the given input using the MistralAI async
client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
max_new_tokens
|
Optional[int]
|
the maximum number of new tokens that the model will generate.
Defaults to |
None
|
temperature
|
Optional[float]
|
the temperature to use for the generation. Defaults to |
None
|
top_p
|
Optional[float]
|
the top-p value to use for the generation. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/mistral.py
MlxLLM
¶
Bases: LLM
, MagpieChatTemplateMixin
Apple MLX LLM implementation.
Attributes:
Name | Type | Description |
---|---|---|
path_or_hf_repo |
str
|
the path to the model or the Hugging Face Hub repo id. |
tokenizer_config |
Dict[str, Any]
|
the tokenizer configuration. |
model_config |
Dict[str, Any]
|
the model configuration. |
adapter_path |
Optional[str]
|
the path to the adapter. |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
Icon
:apple:
Examples:
Generate text:
from distilabel.models.llms import MlxLLM
llm = MlxLLM(model="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/mlx.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.
Source code in src/distilabel/models/llms/mlx.py
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/mlx.py
generate(inputs, num_generations=1, max_tokens=256, sampler=None, logits_processors=None, max_kv_size=None, prompt_cache=None, prefill_step_size=512, kv_bits=None, kv_group_size=64, quantized_kv_start=0, prompt_progress_callback=None, temp=None, repetition_penalty=None, repetition_context_size=None, top_p=None, min_p=None, min_tokens_to_keep=None)
¶
Generates num_generations
responses for each input using the text generation
pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[StandardInput]
|
the inputs to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
256
|
sampler
|
Optional[Callable]
|
the sampler to use for the generation. Defaults to |
None
|
logits_processors
|
Optional[List[Callable]]
|
the logits processors to use for the generation. Defaults to
|
None
|
max_kv_size
|
Optional[int]
|
the maximum size of the key-value cache. Defaults to |
None
|
prompt_cache
|
Optional[Any]
|
the prompt cache to use for the generation. Defaults to |
None
|
prefill_step_size
|
int
|
the prefill step size. Defaults to |
512
|
kv_bits
|
Optional[int]
|
the number of bits to use for the key-value cache. Defaults to |
None
|
kv_group_size
|
int
|
the group size for the key-value cache. Defaults to |
64
|
quantized_kv_start
|
int
|
the start of the quantized key-value cache. Defaults to |
0
|
prompt_progress_callback
|
Optional[Callable[[int, int], None]]
|
the callback to use for the generation. Defaults to
|
None
|
temp
|
Optional[float]
|
the temperature to use for the generation. Defaults to |
None
|
repetition_penalty
|
Optional[float]
|
the repetition penalty to use for the generation. Defaults to
|
None
|
repetition_context_size
|
Optional[int]
|
the context size for the repetition penalty. Defaults to
|
None
|
top_p
|
Optional[float]
|
the top-p value to use for the generation. Defaults to |
None
|
min_p
|
Optional[float]
|
the minimum p value to use for the generation. Defaults to |
None
|
min_tokens_to_keep
|
Optional[int]
|
the minimum number of tokens to keep. Defaults to |
None
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/mlx.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
|
MixtureOfAgentsLLM
¶
Bases: AsyncLLM
Mixture-of-Agents
implementation.
An LLM
class that leverages LLM
s collective strenghts to generate a response,
as described in the "Mixture-of-Agents Enhances Large Language model Capabilities"
paper. There is a list of LLM
s proposing/generating outputs that LLM
s from the next
round/layer can use as auxiliary information. Finally, there is an LLM
that aggregates
the outputs to generate the final response.
Attributes:
Name | Type | Description |
---|---|---|
aggregator_llm |
LLM
|
The |
proposers_llms |
List[AsyncLLM]
|
The list of |
rounds |
int
|
The number of layers or rounds that the |
Examples:
Generate text:
from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM
llm = MixtureOfAgentsLLM(
aggregator_llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
proposers_llms=[
InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
InferenceEndpointsLLM(
model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
),
InferenceEndpointsLLM(
model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
),
],
rounds=2,
)
llm.load()
output = llm.generate_outputs(
inputs=[
[
{
"role": "user",
"content": "My favorite witty review of The Rings of Power series is this: Input:",
}
]
]
)
Source code in src/distilabel/models/llms/moa.py
|
|
runtime_parameters_names
property
¶
Returns the runtime parameters of the LLM
, which are a combination of the
RuntimeParameter
s of the LLM
, the aggregator_llm
and the proposers_llms
.
Returns:
Type | Description |
---|---|
RuntimeParametersNames
|
The runtime parameters of the |
model_name
property
¶
Returns the aggregated model name.
load()
¶
Loads all the LLM
s in the MixtureOfAgents
.
Source code in src/distilabel/models/llms/moa.py
get_generation_kwargs()
¶
Returns the generation kwargs of the MixtureOfAgents
as a dictionary.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
The generation kwargs of the |
Source code in src/distilabel/models/llms/moa.py
_build_moa_system_prompt(prev_outputs)
¶
Builds the Mixture-of-Agents system prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prev_outputs
|
List[str]
|
The list of previous outputs to use as references. |
required |
Returns:
Type | Description |
---|---|
str
|
The Mixture-of-Agents system prompt. |
Source code in src/distilabel/models/llms/moa.py
_inject_moa_system_prompt(input, prev_outputs)
¶
Injects the Mixture-of-Agents system prompt into the input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
The input to inject the system prompt into. |
required |
prev_outputs
|
List[str]
|
The list of previous outputs to use as references. |
required |
Returns:
Type | Description |
---|---|
StandardInput
|
The input with the Mixture-of-Agents system prompt injected. |
Source code in src/distilabel/models/llms/moa.py
_agenerate(inputs, num_generations=1, **kwargs)
async
¶
Internal function to concurrently generate responses for a list of inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
the list of inputs to generate responses for. |
required |
num_generations
|
int
|
the number of generations to generate per input. |
1
|
**kwargs
|
Any
|
the additional kwargs to be used for the generation. |
{}
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list containing the generations for each input. |
Source code in src/distilabel/models/llms/moa.py
OllamaLLM
¶
Bases: AsyncLLM
, MagpieChatTemplateMixin
Ollama LLM implementation running the Async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "notus". |
host |
Optional[RuntimeParameter[str]]
|
the Ollama server host. |
timeout |
RuntimeParameter[int]
|
the timeout for the LLM. Defaults to |
follow_redirects |
bool
|
whether to follow redirects. Defaults to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration or if more
fine-grained control is needed, an instance of |
tokenizer_id |
Optional[RuntimeParameter[str]]
|
the tokenizer Hugging Face Hub repo id or a path to a directory containing
the tokenizer config files. If not provided, the one associated to the |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
_aclient |
Optional[AsyncClient]
|
the |
Runtime parameters
host
: the Ollama server host.timeout
: the client timeout for the Ollama API. Defaults to120
.
Examples:
Generate text:
from distilabel.models.llms import OllamaLLM
llm = OllamaLLM(model="llama3")
llm.load()
# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/ollama.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
validate_magpie_usage()
¶
Validates that magpie usage is valid.
Source code in src/distilabel/models/llms/ollama.py
load()
¶
Loads the AsyncClient
to use Ollama async API.
Source code in src/distilabel/models/llms/ollama.py
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/ollama.py
agenerate(input, format='', options=None, keep_alive=None)
async
¶
Generates a response asynchronously, using the Ollama Async API definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input to use for the generation. |
required |
format
|
Literal['', 'json']
|
the format to use for the generation. Defaults to |
''
|
options
|
Union[Options, None]
|
the options to use for the generation. Defaults to |
None
|
keep_alive
|
Union[bool, None]
|
whether to keep the connection alive. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of strings as completion for the given input. |
Source code in src/distilabel/models/llms/ollama.py
OpenAILLM
¶
Bases: OpenAIBaseClient
, AsyncLLM
OpenAI LLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gpt-3.5-turbo", "gpt-4", etc. Supported models can be found here. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the OpenAI API requests. Defaults to |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the OpenAI API. Defaults to
|
default_headers |
Optional[RuntimeParameter[Dict[str, str]]]
|
the default headers to use for the OpenAI API requests. |
max_retries |
RuntimeParameter[int]
|
the maximum number of times to retry the request to the API before
failing. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
structured_output |
Optional[RuntimeParameter[InstructorStructuredOutputType]]
|
a dictionary containing the structured output configuration configuration
using |
Runtime parameters
base_url
: the base URL to use for the OpenAI API requests. Defaults toNone
.api_key
: the API key to authenticate the requests to the OpenAI API. Defaults toNone
.max_retries
: the maximum number of times to retry the request to the API before failing. Defaults to6
.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.
Icon
:simple-openai:
Examples:
Generate text:
from distilabel.models.llms import OpenAILLM
llm = OpenAILLM(model="gpt-4-turbo", api_key="api.key")
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate text from a custom endpoint following the OpenAI API:
from distilabel.models.llms import OpenAILLM
llm = OpenAILLM(
model="prometheus-eval/prometheus-7b-v2.0",
base_url=r"http://localhost:8080/v1"
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pydantic import BaseModel
from distilabel.models.llms import OpenAILLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = OpenAILLM(
model="gpt-4-turbo",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Generate with Batch API (offline batch generation):
from distilabel.models.llms import OpenAILLM
load = llm = OpenAILLM(
model="gpt-3.5-turbo",
use_offline_batch_generation=True,
offline_batch_generation_block_until_done=5, # poll for results every 5 seconds
)
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
# [['Hello! How can I assist you today?']]
Source code in src/distilabel/models/llms/openai.py
|
|
agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, echo=False, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, extra_body=None)
async
¶
Generates num_generations
responses for the given input using the OpenAI async
client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
NonNegativeInt
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
logprobs
|
bool
|
whether to return the log probabilities or not. Defaults to |
False
|
top_logprobs
|
Optional[PositiveInt]
|
the number of top log probabilities to return per output token
generated. Defaults to |
None
|
echo
|
bool
|
whether to echo the input in the response or not. It's only used if the
|
False
|
frequency_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty
|
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop
|
Optional[Union[str, List[str]]]
|
a string or a list of strings to use as a stop sequence for the generation.
Defaults to |
None
|
response_format
|
Optional[Dict[str, str]]
|
the format of the response to return. Must be one of "text" or "json". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to None which returns text. To return JSON, use {"type": "json_object"}. |
None
|
extra_body
|
Optional[Dict[str, Any]]
|
an optional dictionary containing extra body parameters that will
be sent to the OpenAI API endpoint. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/openai.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
|
_generations_from_openai_completion(completion)
¶
Get the generations from the OpenAI Chat Completion object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
completion
|
ChatCompletion
|
the completion object to get the generations from. |
required |
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of strings containing the generated responses for the input. |
Source code in src/distilabel/models/llms/openai.py
offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs)
¶
Uses the OpenAI batch API to generate num_generations
responses for the given
inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
Union[List[FormattedInput], None]
|
a list of inputs in chat format to generate responses for. |
None
|
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
logprobs
|
bool
|
whether to return the log probabilities or not. Defaults to |
False
|
top_logprobs
|
Optional[PositiveInt]
|
the number of top log probabilities to return per output token
generated. Defaults to |
None
|
frequency_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty
|
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop
|
Optional[Union[str, List[str]]]
|
a string or a list of strings to use as a stop sequence for the generation.
Defaults to |
None
|
response_format
|
Optional[str]
|
the format of the response to return. Must be one of
"text" or "json". Read the documentation here
for more information on how to use the JSON model from OpenAI. Defaults to |
None
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input |
List[GenerateOutput]
|
in |
Raises:
Type | Description |
---|---|
DistilabelOfflineBatchGenerationNotFinishedException
|
if the batch generation is not finished yet. |
ValueError
|
if no job IDs were found to retrieve the results from. |
Source code in src/distilabel/models/llms/openai.py
_check_and_get_batch_results()
¶
Checks the status of the batch jobs and retrieves the results from the OpenAI Batch API.
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Raises:
Type | Description |
---|---|
ValueError
|
if no job IDs were found to retrieve the results from. |
DistilabelOfflineBatchGenerationNotFinishedException
|
if the batch generation is not finished yet. |
RuntimeError
|
if the only batch job found failed. |
Source code in src/distilabel/models/llms/openai.py
_parse_output(output)
¶
Parses the output from the OpenAI Batch API into a list of strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Dict[str, Any]
|
the output to parse. |
required |
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of strings containing the generated responses for the input. |
Source code in src/distilabel/models/llms/openai.py
_get_openai_batch(batch_id)
¶
Gets a batch from the OpenAI Batch API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_id
|
str
|
the ID of the batch to retrieve. |
required |
Returns:
Type | Description |
---|---|
Batch
|
The batch retrieved from the OpenAI Batch API. |
Raises:
Type | Description |
---|---|
OpenAIError
|
if there was an error while retrieving the batch from the OpenAI Batch API. |
Source code in src/distilabel/models/llms/openai.py
_retrieve_batch_results(batch)
¶
Retrieves the results of a batch from its output file, parsing the JSONL content into a list of dictionaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch
|
Batch
|
the batch to retrieve the results from. |
required |
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
A list of dictionaries containing the results of the batch. |
Raises:
Type | Description |
---|---|
AssertionError
|
if no output file ID was found in the batch. |
Source code in src/distilabel/models/llms/openai.py
_create_jobs(inputs, **kwargs)
¶
Creates jobs in the OpenAI Batch API to generate responses for the given inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
a list of inputs in chat format to generate responses for. |
required |
kwargs
|
Any
|
the keyword arguments to use for the generation. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[str, ...]
|
A list of job IDs created in the OpenAI Batch API. |
Source code in src/distilabel/models/llms/openai.py
_create_batch_api_job(batch_input_file)
¶
Creates a job in the OpenAI Batch API to generate responses for the given input file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_input_file
|
FileObject
|
the input file to generate responses for. |
required |
Returns:
Type | Description |
---|---|
Union[Batch, None]
|
The batch job created in the OpenAI Batch API. |
Source code in src/distilabel/models/llms/openai.py
_create_batch_files(inputs, **kwargs)
¶
Creates the necessary input files for the batch API to generate responses. The maximum size of each file so the OpenAI Batch API can process it is 100MB, so we need to split the inputs into multiple files if necessary.
More information: https://platform.openai.com/docs/api-reference/files/create
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
a list of inputs in chat format to generate responses for, optionally including structured output. |
required |
kwargs
|
Any
|
the keyword arguments to use for the generation. |
{}
|
Returns:
Type | Description |
---|---|
List[FileObject]
|
The list of file objects created for the OpenAI Batch API. |
Raises:
Type | Description |
---|---|
OpenAIError
|
if there was an error while creating the batch input file in the OpenAI Batch API. |
Source code in src/distilabel/models/llms/openai.py
_create_jsonl_buffers(inputs, **kwargs)
¶
Creates a generator of buffers containing the JSONL formatted inputs to be used by the OpenAI Batch API. The buffers created are of size 100MB or less.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
a list of inputs in chat format to generate responses for, optionally including structured output. |
required |
kwargs
|
Any
|
the keyword arguments to use for the generation. |
{}
|
Yields:
Type | Description |
---|---|
BytesIO
|
A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch |
BytesIO
|
API. |
Source code in src/distilabel/models/llms/openai.py
_create_jsonl_row(input, custom_id, **kwargs)
¶
Creates a JSONL formatted row to be used by the OpenAI Batch API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a list of inputs in chat format to generate responses for, optionally including structured output. |
required |
custom_id
|
str
|
a custom ID to use for the row. |
required |
kwargs
|
Any
|
the keyword arguments to use for the generation. |
{}
|
Returns:
Type | Description |
---|---|
bytes
|
A JSONL formatted row to be used by the OpenAI Batch API. |
Source code in src/distilabel/models/llms/openai.py
TogetherLLM
¶
Bases: OpenAILLM
TogetherLLM LLM implementation running the async API client of OpenAI.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "mistralai/Mixtral-8x7B-Instruct-v0.1". Supported models can be found here. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Together API can be set with |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Together API. Defaults to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
Examples:
Generate text:
from distilabel.models.llms import AnyscaleLLM
llm = TogetherLLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", api_key="api.key")
llm.load()
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/together.py
VertexAILLM
¶
Bases: AsyncLLM
VertexAI LLM implementation running the async API clients for Gemini.
- Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini
To use the VertexAILLM
is necessary to have configured the Google Cloud authentication
using one of these methods:
- Setting
GOOGLE_CLOUD_CREDENTIALS
environment variable - Using
gcloud auth application-default login
command - Using
vertexai.init
function from thegoogle-cloud-aiplatform
library
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gemini-1.0-pro". Supported models. |
_aclient |
Optional[GenerativeModel]
|
the |
Icon
:simple-googlecloud:
Examples:
Generate text:
from distilabel.models.llms import VertexAILLM
llm = VertexAILLM(model="gemini-1.5-pro")
llm.load()
# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
Source code in src/distilabel/models/llms/vertexai.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the GenerativeModel
class which has access to generate_content_async
to benefit from async requests.
Source code in src/distilabel/models/llms/vertexai.py
_chattype_to_content(input)
¶
Converts a chat type to a list of content items expected by the API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the chat type to be converted. |
required |
Returns:
Type | Description |
---|---|
List[Content]
|
List[str]: a list of content items expected by the API. |
Source code in src/distilabel/models/llms/vertexai.py
agenerate(input, temperature=None, top_p=None, top_k=None, max_output_tokens=None, stop_sequences=None, safety_settings=None, tools=None)
async
¶
Generates num_generations
responses for the given input using the VertexAI async client definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
VertexChatType
|
a single input in chat format to generate responses for. |
required |
temperature
|
Optional[float]
|
Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to |
None
|
top_p
|
Optional[float]
|
If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to |
None
|
top_k
|
Optional[int]
|
If specified, top-k sampling will be used. Defaults to |
None
|
max_output_tokens
|
Optional[int]
|
The maximum number of output tokens to generate per message. Defaults to |
None
|
stop_sequences
|
Optional[List[str]]
|
A list of stop sequences. Defaults to |
None
|
safety_settings
|
Optional[Dict[str, Any]]
|
Safety configuration for returned content from the API. Defaults to |
None
|
tools
|
Optional[List[Dict[str, Any]]]
|
A potential list of tools that can be used by the API. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/vertexai.py
ClientvLLM
¶
Bases: OpenAILLM
, MagpieChatTemplateMixin
A client for the vLLM
server implementing the OpenAI API specification.
Attributes:
Name | Type | Description |
---|---|---|
base_url |
Optional[RuntimeParameter[str]]
|
the base URL of the |
max_retries |
RuntimeParameter[int]
|
the maximum number of times to retry the request to the API before
failing. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
httpx_client_kwargs |
RuntimeParameter[int]
|
extra kwargs that will be passed to the |
tokenizer |
Optional[str]
|
the Hugging Face Hub repo id or path of the tokenizer that will be used
to apply the chat template and tokenize the inputs before sending it to the
server. Defaults to |
tokenizer_revision |
Optional[str]
|
the revision of the tokenizer to load. Defaults to |
_aclient |
AsyncOpenAI
|
the |
Runtime parameters
base_url
: the base url of thevLLM
server. Defaults to"http://localhost:8000"
.max_retries
: the maximum number of times to retry the request to the API before failing. Defaults to6
.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.httpx_client_kwargs
: extra kwargs that will be passed to thehttpx.AsyncClient
created to comunicate with thevLLM
server. Defaults toNone
.
Examples:
Generate text:
from distilabel.models.llms import ClientvLLM
llm = ClientvLLM(
base_url="http://localhost:8000/v1",
tokenizer="meta-llama/Meta-Llama-3.1-8B-Instruct"
)
llm.load()
results = llm.generate_outputs(
inputs=[[{"role": "user", "content": "Hello, how are you?"}]],
temperature=0.7,
top_p=1.0,
max_new_tokens=256,
)
# [
# [
# "I'm functioning properly, thank you for asking. How can I assist you today?",
# "I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?",
# "I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?"
# ]
# ]
Source code in src/distilabel/models/llms/vllm.py
|
|
model_name
cached
property
¶
Returns the name of the model served with vLLM server.
load()
¶
Creates an httpx.AsyncClient
to connect to the vLLM server and a tokenizer
optionally.
Source code in src/distilabel/models/llms/vllm.py
_prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
StandardInput
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/vllm.py
agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, logit_bias=None, presence_penalty=0.0, temperature=1.0, top_p=1.0)
async
¶
Generates num_generations
responses for each input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
FormattedInput
|
a single input in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
logit_bias
|
Optional[Dict[str, int]]
|
modify the likelihood of specified tokens appearing in the completion. Defaults to `` |
None
|
presence_penalty
|
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
nucleus sampling. The value refers to the top-p tokens that should be
considered for sampling. Defaults to |
1.0
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/vllm.py
vLLM
¶
Bases: LLM
, MagpieChatTemplateMixin
, CudaDevicePlacementMixin
vLLM
library LLM implementation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files. |
dtype |
str
|
the data type to use for the model. Defaults to |
trust_remote_code |
bool
|
whether to trust the remote code when loading the model. Defaults
to |
quantization |
Optional[str]
|
the quantization mode to use for the model. Defaults to |
revision |
Optional[str]
|
the revision of the model to load. Defaults to |
tokenizer |
Optional[str]
|
the tokenizer Hugging Face Hub repo id or a path to a directory containing
the tokenizer files. If not provided, the tokenizer will be loaded from the
model directory. Defaults to |
tokenizer_mode |
Literal['auto', 'slow']
|
the mode to use for the tokenizer. Defaults to |
tokenizer_revision |
Optional[str]
|
the revision of the tokenizer to load. Defaults to |
skip_tokenizer_init |
bool
|
whether to skip the initialization of the tokenizer. Defaults
to |
chat_template |
Optional[str]
|
a chat template that will be used to build the prompts before
sending them to the model. If not provided, the chat template defined in the
tokenizer config will be used. If not provided and the tokenizer doesn't have
a chat template, then ChatML template will be used. Defaults to |
structured_output |
Optional[RuntimeParameter[OutlinesStructuredOutputType]]
|
a dictionary containing the structured output configuration or if more
fine-grained control is needed, an instance of |
seed |
int
|
the seed to use for the random number generator. Defaults to |
extra_kwargs |
Optional[RuntimeParameter[Dict[str, Any]]]
|
additional dictionary of keyword arguments that will be passed to the
|
_model |
LLM
|
the |
_tokenizer |
PreTrainedTokenizer
|
the tokenizer instance used to format the prompt before passing it to
the |
use_magpie_template |
bool
|
a flag used to enable/disable applying the Magpie pre-query
template. Defaults to |
magpie_pre_query_template |
Union[MagpieAvailablePreQueryTemplates, str, None]
|
the pre-query template to be applied to the prompt or
sent to the LLM to generate an instruction or a follow up user message. Valid
values are "llama3", "qwen2" or another pre-query template provided. Defaults
to |
References
- https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py
Runtime parameters
extra_kwargs
: additional dictionary of keyword arguments that will be passed to theLLM
class ofvllm
library.
Examples:
Generate text:
from distilabel.models.llms import vLLM
# You can pass a custom chat_template to the model
llm = vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
)
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
Generate structured data:
from pathlib import Path
from distilabel.models.llms import vLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = vLLM(
model="prometheus-eval/prometheus-7b-v2.0"
structured_output={"format": "json", "schema": Character},
)
llm.load()
# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
Source code in src/distilabel/models/llms/vllm.py
|
|
model_name
property
¶
Returns the model name used for the LLM.
load()
¶
Loads the vLLM
model using either the path or the Hugging Face Hub repository id.
Additionally, this method also sets the chat_template
for the tokenizer, so as to properly
parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
default value is ChatML format, unless explicitly provided.
Source code in src/distilabel/models/llms/vllm.py
unload()
¶
Unloads the vLLM
model.
prepare_input(input)
¶
Prepares the input (applying the chat template and tokenization) for the provided input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Union[StandardInput, str]
|
the input list containing chat items. |
required |
Returns:
Type | Description |
---|---|
str
|
The prompt to send to the LLM. |
Source code in src/distilabel/models/llms/vllm.py
_prepare_batches(inputs)
¶
Prepares the inputs by grouping them by the structured output.
When we generate structured outputs with schemas obtained from a dataset, we need to
prepare the data to try to send batches of inputs instead of single inputs to the model
to take advante of the engine. So we group the inputs by the structured output to be
passed in the generate
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[StructuredInput]
|
The batch of inputs passed to the generate method. As we expect to be generating structured outputs, each element will be a tuple containing the instruction and the structured output. |
required |
Returns:
Type | Description |
---|---|
List[Tuple[List[str], OutlinesStructuredOutputType]]
|
The prepared batches (sub-batches let's say) to be passed to the |
List[int]
|
Each new tuple will contain instead of the single instruction, a list of instructions |
Source code in src/distilabel/models/llms/vllm.py
generate(inputs, num_generations=1, max_new_tokens=128, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, logprobs=None, stop=None, stop_token_ids=None, include_stop_str_in_output=False, skip_special_tokens=True, logits_processors=None, extra_sampling_params=None, echo=False)
¶
Generates num_generations
responses for each input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
List[FormattedInput]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations
|
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens
|
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
presence_penalty
|
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
frequency_penalty
|
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
repetition_penalty
|
float
|
the repetition penalty to use for the generation Defaults to
|
1.0
|
temperature
|
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p
|
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
top_k
|
int
|
the top-k value to use for the generation. Defaults to |
-1
|
min_p
|
float
|
the minimum probability to use for the generation. Defaults to |
0.0
|
logprobs
|
Optional[PositiveInt]
|
number of log probabilities to return per output token. If |
None
|
stop
|
Optional[List[str]]
|
a list of strings that will be used to stop the generation when found.
Defaults to |
None
|
stop_token_ids
|
Optional[List[int]]
|
a list of token ids that will be used to stop the generation
when found. Defaults to |
None
|
include_stop_str_in_output
|
bool
|
whether to include the stop string in the output.
Defaults to |
False
|
skip_special_tokens
|
bool
|
whether to exclude special tokens from the output. Defaults
to |
True
|
logits_processors
|
Optional[LogitsProcessors]
|
a list of functions to process the logits before sampling.
Defaults to |
None
|
extra_sampling_params
|
Optional[Dict[str, Any]]
|
dictionary with additional arguments to be passed to
the |
None
|
echo
|
bool
|
whether to echo the include the prompt in the response or not. Defaults
to |
False
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/models/llms/vllm.py
|
|
_prepare_structured_output(structured_output)
¶
Creates the appropriate function to filter tokens to generate structured outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
structured_output
|
OutlinesStructuredOutputType
|
the configuration dict to prepare the structured output. |
required |
Returns:
Type | Description |
---|---|
Union[Callable, None]
|
The callable that will be used to guide the generation of the model. |
Source code in src/distilabel/models/llms/vllm.py
CudaDevicePlacementMixin
¶
Bases: BaseModel
Mixin class to assign CUDA devices to the LLM
based on the cuda_devices
attribute
and the device placement information provided in _device_llm_placement_map
. Providing
the device placement information is optional, but if it is provided, it will be used to
assign CUDA devices to the LLM
s, trying to avoid using the same device for different
LLM
s.
Attributes:
Name | Type | Description |
---|---|---|
cuda_devices |
RuntimeParameter[Union[List[int], Literal['auto']]]
|
a list with the ID of the CUDA devices to be used by the |
disable_cuda_device_placement |
RuntimeParameter[bool]
|
Whether to disable the CUDA device placement logic
or not. Defaults to |
_llm_identifier |
Union[str, None]
|
the identifier of the |
_device_llm_placement_map |
Generator[Dict[str, List[int]], None, None]
|
a dictionary with the device placement information for each
|
Source code in src/distilabel/models/mixins/cuda_device_placement.py
|
|
load()
¶
Assign CUDA devices to the LLM based on the device placement information provided
in _device_llm_placement_map
.
Source code in src/distilabel/models/mixins/cuda_device_placement.py
unload()
¶
Unloads the LLM and removes the CUDA devices assigned to it from the device
placement information provided in _device_llm_placement_map
.
Source code in src/distilabel/models/mixins/cuda_device_placement.py
_device_llm_placement_map()
¶
Reads the content of the device placement file of the node with a lock, yields the content, and writes the content back to the file after the context manager is closed. If the file doesn't exist, an empty dictionary will be yielded.
Yields:
Type | Description |
---|---|
Dict[str, List[int]]
|
The content of the device placement file. |
Source code in src/distilabel/models/mixins/cuda_device_placement.py
_assign_cuda_devices()
¶
Assigns CUDA devices to the LLM based on the device placement information provided
in _device_llm_placement_map
. If the cuda_devices
attribute is set to "auto", it
will be set to the first available CUDA device that is not going to be used by any
other LLM. If the cuda_devices
attribute is set to a list of devices, it will be
checked if the devices are available to be used by the LLM. If not, a warning will be
logged.
Source code in src/distilabel/models/mixins/cuda_device_placement.py
_check_cuda_devices(device_map)
¶
Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device_map
|
Dict[str, List[int]]
|
a dictionary with the device placement information for each LLM. |
required |
Source code in src/distilabel/models/mixins/cuda_device_placement.py
_get_cuda_device(device_map)
¶
Returns the first available CUDA device to be used by the LLM that is not going to be used by any other LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device_map
|
Dict[str, List[int]]
|
a dictionary with the device placement information for each LLM. |
required |
Returns:
Type | Description |
---|---|
Union[int, None]
|
The first available CUDA device to be used by the LLM. |
Raises:
Type | Description |
---|---|
RuntimeError
|
if there is no available CUDA device to be used by the LLM. |
Source code in src/distilabel/models/mixins/cuda_device_placement.py
_set_cuda_visible_devices()
¶
Sets the CUDA_VISIBLE_DEVICES
environment variable to the list of CUDA devices
to be used by the LLM.