Index
AnthropicLLM
¶
Bases: AsyncLLM
Anthropic LLM implementation running the Async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the name of the model to use for the LLM e.g. "claude-3-opus-20240229", "claude-3-sonnet-20240229", etc. Available models can be checked here: Anthropic: Models overview. |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Anthropic API. If not provided,
it will be read from |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Anthropic API. Defaults to |
timeout |
RuntimeParameter[float]
|
the maximum time in seconds to wait for a response. Defaults to |
max_retries |
RuntimeParameter[int]
|
The maximum number of times to retry the request before failing. Defaults
to |
http_client |
Optional[AsyncClient]
|
if provided, an alternative HTTP client to use for calling Anthropic
API. Defaults to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
_aclient |
Optional[AsyncAnthropic]
|
the |
Runtime parameters
api_key
: the API key to authenticate the requests to the Anthropic API. If not provided, it will be read fromANTHROPIC_API_KEY
environment variable.base_url
: the base URL to use for the Anthropic API. Defaults to"https://api.anthropic.com"
.timeout
: the maximum time in seconds to wait for a response. Defaults to600.0
.max_retries
: the maximum number of times to retry the request before failing. Defaults to6
.
Source code in src/distilabel/llms/anthropic.py
|
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)
async
¶
Generates a response asynchronously, using the Anthropic Async API definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
max_tokens |
int
|
the maximum number of new tokens that the model will generate. Defaults to |
128
|
stop_sequences |
Union[List[str], None]
|
custom text sequences that will cause the model to stop generating. Defaults to |
None
|
temperature |
float
|
the temperature to use for the generation. Set only if top_p is None. Defaults to |
1.0
|
top_p |
Union[float, None]
|
the top-p value to use for the generation. Defaults to |
None
|
top_k |
Union[int, None]
|
the top-k value to use for the generation. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/anthropic.py
generate(inputs, num_generations=1, **kwargs)
¶
Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to agenerate
.
Source code in src/distilabel/llms/anthropic.py
load()
¶
Loads the AsyncAnthropic
client to use the Anthropic async API.
Source code in src/distilabel/llms/anthropic.py
AnyscaleLLM
¶
Bases: OpenAILLM
Anyscale LLM implementation running the async API client of OpenAI because of duplicate API behavior.
Attributes:
Name | Type | Description |
---|---|---|
model |
the model name to use for the LLM, e.g., |
|
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Anyscale API requests. Defaults to |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Anyscale API. Defaults to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
Source code in src/distilabel/llms/anyscale.py
AsyncLLM
¶
Bases: LLM
Abstract class for asynchronous LLMs, so as to benefit from the async capabilities
of each LLM implementation. This class is meant to be subclassed by each LLM, and the
method agenerate
needs to be implemented to provide the asynchronous generation of
responses.
Attributes:
Name | Type | Description |
---|---|---|
_event_loop |
AbstractEventLoop
|
the event loop to be used for the asynchronous generation of responses. |
Source code in src/distilabel/llms/base.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
generate_parameters: List[inspect.Parameter]
property
¶
Returns the parameters of the agenerate
method.
Returns:
Type | Description |
---|---|
List[Parameter]
|
A list containing the parameters of the |
generate_parsed_docstring: Docstring
cached
property
¶
Returns the parsed docstring of the agenerate
method.
Returns:
Type | Description |
---|---|
Docstring
|
The parsed docstring of the |
__del__()
¶
agenerate(input, num_generations=1, **kwargs)
abstractmethod
async
¶
Method to generate a num_generations
responses for a given input asynchronously,
and executed concurrently in generate
method.
Source code in src/distilabel/llms/base.py
generate(inputs, num_generations=1, **kwargs)
¶
Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to agenerate
.
Source code in src/distilabel/llms/base.py
AzureOpenAILLM
¶
Bases: OpenAILLM
Azure OpenAI LLM implementation running the async API client of OpenAI because of duplicate API behavior, but with Azure-specific parameters.
Attributes:
Name | Type | Description |
---|---|---|
model |
the model name to use for the LLM i.e. the name of the Azure deployment. |
|
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Azure OpenAI API can be set with |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Azure OpenAI API. Defaults to |
api_version |
Optional[RuntimeParameter[str]]
|
the API version to use for the Azure OpenAI API. Defaults to |
Source code in src/distilabel/llms/azure.py
load()
¶
Loads the AsyncAzureOpenAI
client to benefit from async requests.
Source code in src/distilabel/llms/azure.py
CohereLLM
¶
Bases: AsyncLLM
Cohere API implementation using the async client for concurrent text generation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the name of the model from the Cohere API to use for the generation. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Cohere API requests. Defaults to
|
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Cohere API. Defaults to
the value of the |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
client_name |
RuntimeParameter[str]
|
the name of the client to use for the API requests. Defaults to
|
_ChatMessage |
Type[ChatMessage]
|
the |
_aclient |
AsyncClient
|
the |
Runtime parameters
base_url
: the base URL to use for the Cohere API requests. Defaults to"https://api.cohere.ai/v1"
.api_key
: the API key to authenticate the requests to the Cohere API. Defaults to the value of theCOHERE_API_KEY
environment variable.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.client_name
: the name of the client to use for the API requests. Defaults to"distilabel"
.
Source code in src/distilabel/llms/cohere.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)
async
¶
Generates a response from the LLM given an input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
temperature |
Optional[float]
|
the temperature to use for the generation. Defaults to |
None
|
max_tokens |
Optional[int]
|
the maximum number of new tokens that the model will generate.
Defaults to |
None
|
k |
Optional[int]
|
the number of highest probability vocabulary tokens to keep for the generation.
Defaults to |
None
|
p |
Optional[float]
|
the nucleus sampling probability to use for the generation. Defaults to
|
None
|
seed |
Optional[float]
|
the seed to use for the generation. Defaults to |
None
|
stop_sequences |
Optional[Sequence[str]]
|
a list of sequences to use as stopping criteria for the generation.
Defaults to |
None
|
frequency_penalty |
Optional[float]
|
the frequency penalty to use for the generation. Defaults
to |
None
|
presence_penalty |
Optional[float]
|
the presence penalty to use for the generation. Defaults to
|
None
|
raw_prompting |
Optional[bool]
|
a flag to use raw prompting for the generation. Defaults to
|
None
|
Returns:
Type | Description |
---|---|
Union[str, None]
|
The generated response from the Cohere API model. |
Source code in src/distilabel/llms/cohere.py
generate(inputs, num_generations=1, **kwargs)
¶
Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to agenerate
.
Source code in src/distilabel/llms/cohere.py
load()
¶
Loads the AsyncClient
client from the cohere
package.
Source code in src/distilabel/llms/cohere.py
CudaDevicePlacementMixin
¶
Bases: BaseModel
Mixin class to assign CUDA devices to the LLM
based on the cuda_devices
attribute
and the device placement information provided in _device_llm_placement_map
. Providing
the device placement information is optional, but if it is provided, it will be used to
assign CUDA devices to the LLM
s, trying to avoid using the same device for different
LLM
s.
Attributes:
Name | Type | Description |
---|---|---|
cuda_devices |
Union[List[int], Literal['auto']]
|
a list with the ID of the CUDA devices to be used by the |
_llm_identifier |
Union[str, None]
|
the identifier of the |
_device_llm_placement_map |
Union[DictProxy[str, Any], None]
|
a dictionary with the device placement information for each
|
Source code in src/distilabel/llms/mixins.py
|
|
load()
¶
Assign CUDA devices to the LLM based on the device placement information provided
in _device_llm_placement_map
.
Source code in src/distilabel/llms/mixins.py
set_device_placement_info(llm_identifier, device_llm_placement_map, device_llm_placement_lock)
¶
Sets the value of _device_llm_placement_map
to be used to assign CUDA devices
to the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm_identifier |
str
|
the identifier of the LLM to be used as key in the device placement information. |
required |
device_llm_placement_map |
DictProxy[str, Any]
|
a dictionary with the device placement information for each LLM. It should have two keys. The first key is "lock" and its value is a lock object to be used to synchronize the access to the device placement information. The second key is "value" and its value is a dictionary with the device placement information for each LLM. |
required |
device_llm_placement_lock |
Lock
|
a lock object to be used to synchronize the access to
|
required |
Source code in src/distilabel/llms/mixins.py
InferenceEndpointsLLM
¶
Bases: AsyncLLM
InferenceEndpoints LLM implementation running the async API client via either
the huggingface_hub.AsyncInferenceClient
or via openai.AsyncOpenAI
.
Attributes:
Name | Type | Description |
---|---|---|
model_id |
Optional[str]
|
the model ID to use for the LLM as available in the Hugging Face Hub, which
will be used to resolve the base URL for the serverless Inference Endpoints API requests.
Defaults to |
endpoint_name |
Optional[RuntimeParameter[str]]
|
the name of the Inference Endpoint to use for the LLM. Defaults to |
endpoint_namespace |
Optional[RuntimeParameter[str]]
|
the namespace of the Inference Endpoint to use for the LLM. Defaults to |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Inference Endpoints API requests. |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Inference Endpoints API. |
tokenizer_id |
Optional[str]
|
the tokenizer ID to use for the LLM as available in the Hugging Face Hub.
Defaults to |
model_display_name |
Optional[str]
|
the model display name to use for the LLM. Defaults to |
use_openai_client |
bool
|
whether to use the OpenAI client instead of the Hugging Face client. |
Examples:
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Free serverless Inference API
llm = InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
# Dedicated Inference Endpoints
llm = InferenceEndpointsLLM(
endpoint_name="<ENDPOINT_NAME>",
api_key="<HF_API_KEY>",
endpoint_namespace="<USER|ORG>",
)
# Dedicated Inference Endpoints or TGI
llm = InferenceEndpointsLLM(
api_key="<HF_API_KEY>",
base_url="<BASE_URL>",
)
llm.load()
# Synchrounous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])
Source code in src/distilabel/llms/huggingface/inference_endpoints.py
|
|
model_name: Union[str, None]
property
¶
Returns the model name used for the LLM.
agenerate(input, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=None, temperature=1.0, do_sample=False, top_k=None, top_p=None, typical_p=None, stop_sequences=None)
async
¶
Generates completions for the given input using the OpenAI async client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
max_new_tokens |
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty |
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty |
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
repetition_penalty |
Optional[float]
|
the repetition penalty to use for the generation. Defaults
to |
None
|
temperature |
float
|
the temperature to use for the generation. Defaults to |
1.0
|
do_sample |
bool
|
whether to use sampling for the generation. Defaults to |
False
|
top_k |
Optional[int]
|
the top-k value to use for the generation. Defaults to |
None
|
top_p |
Optional[float]
|
the top-p value to use for the generation. Defaults to |
None
|
typical_p |
Optional[float]
|
the typical-p value to use for the generation. Defaults to |
None
|
stop_sequences |
Optional[Union[str, List[str]]]
|
either a single string or a list of strings containing the sequences
to stop the generation at. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/huggingface/inference_endpoints.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 |
|
generate(inputs, num_generations=1, **kwargs)
¶
Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to agenerate
.
Source code in src/distilabel/llms/huggingface/inference_endpoints.py
load()
¶
Loads the either the AsyncInferenceClient
or the AsyncOpenAI
client to benefit
from async requests, running the Hugging Face Inference Endpoint underneath via the
/v1/chat/completions
endpoint, exposed for the models running on TGI using the
text-generation
task.
Raises:
Type | Description |
---|---|
ImportError
|
if the |
ImportError
|
if the |
ValueError
|
if the model is not currently deployed or is not running the TGI framework. |
ImportError
|
if the |
Source code in src/distilabel/llms/huggingface/inference_endpoints.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
|
only_one_of_model_id_endpoint_name_or_base_url_provided()
¶
Validates that only one of model_id
or endpoint_name
is provided; and if base_url
is also
provided, a warning will be shown informing the user that the provided base_url
will be ignored in
favour of the dynamically calculated one..
Source code in src/distilabel/llms/huggingface/inference_endpoints.py
LLM
¶
Bases: RuntimeParametersMixin
, BaseModel
, _Serializable
, ABC
Base class for LLM
s to be used in distilabel
framework.
To implement an LLM
subclass, you need to subclass this class and implement:
- load
method to load the LLM
if needed. Don't forget to call super().load()
,
so the _logger
attribute is initialized.
- model_name
property to return the model name used for the LLM.
- generate
method to generate num_generations
per input in inputs
.
Attributes:
Name | Type | Description |
---|---|---|
generation_kwargs |
Optional[RuntimeParameter[Dict[str, Any]]]
|
the kwargs to be propagated to either |
_logger |
Union[Logger, None]
|
the logger to be used for the |
Source code in src/distilabel/llms/base.py
|
|
generate_parameters: List[inspect.Parameter]
property
¶
Returns the parameters of the generate
method.
Returns:
Type | Description |
---|---|
List[Parameter]
|
A list containing the parameters of the |
generate_parsed_docstring: Docstring
cached
property
¶
Returns the parsed docstring of the generate
method.
Returns:
Type | Description |
---|---|
Docstring
|
The parsed docstring of the |
model_name: str
abstractmethod
property
¶
Returns the model name used for the LLM.
runtime_parameters_names: RuntimeParametersNames
property
¶
Returns the runtime parameters of the LLM
, which are combination of the
attributes of the LLM
type hinted with RuntimeParameter
and the parameters
of the generate
method that are not input
and num_generations
.
Returns:
Type | Description |
---|---|
RuntimeParametersNames
|
A dictionary with the name of the runtime parameters as keys and a boolean |
RuntimeParametersNames
|
indicating if the parameter is optional or not. |
generate(inputs, num_generations=1, **kwargs)
abstractmethod
¶
Abstract method to be implemented by each LLM to generate num_generations
per input in inputs
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
the list of inputs to generate responses for which follows OpenAI's API format: |
required |
num_generations |
int
|
the number of generations to generate per input. |
1
|
**kwargs |
Any
|
the additional kwargs to be used for the generation. |
{}
|
Source code in src/distilabel/llms/base.py
get_last_hidden_states(inputs)
¶
Method to get the last hidden states of the model for a list of inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
the list of inputs to get the last hidden states from. |
required |
Returns:
Type | Description |
---|---|
List[HiddenState]
|
A list containing the last hidden state for each sequence using a NumPy array with shape [num_tokens, hidden_size]. |
Source code in src/distilabel/llms/base.py
get_runtime_parameters_info()
¶
Gets the information of the runtime parameters of the LLM
such as the name
and the description. This function is meant to include the information of the runtime
parameters in the serialized data of the LLM
.
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
A list containing the information for each runtime parameter of the |
Source code in src/distilabel/llms/base.py
LiteLLM
¶
Bases: AsyncLLM
LiteLLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gpt-3.5-turbo" or "mistral/mistral-large", etc. |
verbose |
RuntimeParameter[bool]
|
whether to log the LiteLLM client's logs. Defaults to |
Runtime parameters
verbose
: whether to log the LiteLLM client's logs. Defaults toFalse
.
Source code in src/distilabel/llms/litellm.py
|
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, num_generations=1, functions=None, function_call=None, temperature=1.0, top_p=1.0, stop=None, max_tokens=None, presence_penalty=None, frequency_penalty=None, logit_bias=None, user=None, metadata=None, api_base=None, api_version=None, api_key=None, model_list=None, mock_response=None, force_timeout=600, custom_llm_provider=None)
async
¶
Generates num_generations
responses for the given input using the LiteLLM async client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
functions |
Optional[List]
|
a list of functions to apply to the conversation messages. Defaults to
|
None
|
function_call |
Optional[str]
|
the name of the function to call within the conversation. Defaults
to |
None
|
temperature |
Optional[float]
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p |
Optional[float]
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop |
Optional[Union[str, list]]
|
Up to 4 sequences where the LLM API will stop generating further tokens.
Defaults to |
None
|
max_tokens |
Optional[int]
|
The maximum number of tokens in the generated completion. Defaults to
|
None
|
presence_penalty |
Optional[float]
|
It is used to penalize new tokens based on their existence in the
text so far. Defaults to |
None
|
frequency_penalty |
Optional[float]
|
It is used to penalize new tokens based on their frequency in the
text so far. Defaults to |
None
|
logit_bias |
Optional[dict]
|
Used to modify the probability of specific tokens appearing in the
completion. Defaults to |
None
|
user |
Optional[str]
|
A unique identifier representing your end-user. This can help the LLM provider
to monitor and detect abuse. Defaults to |
None
|
metadata |
Optional[dict]
|
Pass in additional metadata to tag your completion calls - eg. prompt
version, details, etc. Defaults to |
None
|
api_base |
Optional[str]
|
Base URL for the API. Defaults to |
None
|
api_version |
Optional[str]
|
API version. Defaults to |
None
|
api_key |
Optional[str]
|
API key. Defaults to |
None
|
model_list |
Optional[list]
|
List of api base, version, keys. Defaults to |
None
|
mock_response |
Optional[str]
|
If provided, return a mock completion response for testing or debugging
purposes. Defaults to |
None
|
force_timeout |
Optional[int]
|
The maximum execution time in seconds for the completion request.
Defaults to |
600
|
custom_llm_provider |
Optional[str]
|
Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)
model="amazon.titan-tg1-large" and custom_llm_provider="bedrock". Defaults to
|
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/litellm.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
load()
¶
Loads the acompletion
LiteLLM client to benefit from async requests.
Source code in src/distilabel/llms/litellm.py
LlamaCppLLM
¶
Bases: LLM
llama.cpp LLM implementation running the Python bindings for the C++ code.
Attributes:
Name | Type | Description |
---|---|---|
chat_format |
str
|
the chat format to use for the model. Defaults to |
model_path |
RuntimeParameter[FilePath]
|
contains the path to the GGUF quantized model, compatible with the
installed version of the |
n_gpu_layers |
RuntimeParameter[int]
|
the number of layers to use for the GPU. Defaults to |
verbose |
RuntimeParameter[bool]
|
whether to print verbose output. Defaults to |
_model |
Optional[Llama]
|
the Llama model instance. This attribute is meant to be used internally and
should not be accessed directly. It will be set in the |
Runtime parameters
model_path
: the path to the GGUF quantized model.n_gpu_layers
: the number of layers to use for the GPU. Defaults to-1
.verbose
: whether to print verbose output. Defaults toFalse
.
Source code in src/distilabel/llms/llamacpp.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0)
¶
Generates num_generations
responses for the given input using the Llama model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens |
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty |
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty |
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature |
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p |
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/llamacpp.py
load()
¶
Loads the Llama
model from the model_path
.
Source code in src/distilabel/llms/llamacpp.py
MistralLLM
¶
Bases: AsyncLLM
Mistral LLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "mistral-tiny", "mistral-large", etc. |
endpoint |
str
|
the endpoint to use for the Mistral API. Defaults to "https://api.mistral.ai". |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Mistral API. Defaults to |
max_retries |
RuntimeParameter[int]
|
the maximum number of retries to attempt when a request fails. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response. Defaults to |
max_concurrent_requests |
RuntimeParameter[int]
|
the maximum number of concurrent requests to send. Defaults
to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
_aclient |
Optional[MistralAsyncClient]
|
the |
Runtime parameters
api_key
: the API key to authenticate the requests to the Mistral API.max_retries
: the maximum number of retries to attempt when a request fails. Defaults to5
.timeout
: the maximum time in seconds to wait for a response. Defaults to120
.max_concurrent_requests
: the maximum number of concurrent requests to send. Defaults to64
.
Source code in src/distilabel/llms/mistral.py
|
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, max_new_tokens=None, temperature=None, top_p=None)
async
¶
Generates num_generations
responses for the given input using the MistralAI async
client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
max_new_tokens |
Optional[int]
|
the maximum number of new tokens that the model will generate.
Defaults to |
None
|
temperature |
Optional[float]
|
the temperature to use for the generation. Defaults to |
None
|
top_p |
Optional[float]
|
the top-p value to use for the generation. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/mistral.py
generate(inputs, num_generations=1, **kwargs)
¶
Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to agenerate
.
Source code in src/distilabel/llms/mistral.py
load()
¶
Loads the MistralAsyncClient
client to benefit from async requests.
Source code in src/distilabel/llms/mistral.py
OllamaLLM
¶
Bases: AsyncLLM
Ollama LLM implementation running the Async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "notus". |
host |
Optional[RuntimeParameter[str]]
|
the Ollama server host. |
timeout |
RuntimeParameter[int]
|
the timeout for the LLM. Defaults to |
_aclient |
Optional[AsyncClient]
|
the |
Runtime parameters
host
: the Ollama server host.timeout
: the client timeout for the Ollama API. Defaults to120
.
Source code in src/distilabel/llms/ollama.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, num_generations=1, format='', options=None, keep_alive=None)
async
¶
Generates a response asynchronously, using the Ollama Async API definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
the input to use for the generation. |
required |
num_generations |
int
|
the number of generations to produce. Defaults to |
1
|
format |
Literal['', 'json']
|
the format to use for the generation. Defaults to |
''
|
options |
Union[Options, None]
|
the options to use for the generation. Defaults to |
None
|
keep_alive |
Union[bool, None]
|
whether to keep the connection alive. Defaults to |
None
|
Returns:
Type | Description |
---|---|
List[str]
|
A list of strings as completion for the given input. |
Source code in src/distilabel/llms/ollama.py
load()
¶
Loads the AsyncClient
to use Ollama async API.
Source code in src/distilabel/llms/ollama.py
OpenAILLM
¶
Bases: AsyncLLM
OpenAI LLM implementation running the async API client.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gpt-3.5-turbo", "gpt-4", etc. Supported models can be found here. |
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the OpenAI API requests. Defaults to |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the OpenAI API. Defaults to
|
max_retries |
RuntimeParameter[int]
|
the maximum number of times to retry the request to the API before
failing. Defaults to |
timeout |
RuntimeParameter[int]
|
the maximum time in seconds to wait for a response from the API. Defaults
to |
Runtime parameters
base_url
: the base URL to use for the OpenAI API requests. Defaults toNone
.api_key
: the API key to authenticate the requests to the OpenAI API. Defaults toNone
.max_retries
: the maximum number of times to retry the request to the API before failing. Defaults to6
.timeout
: the maximum time in seconds to wait for a response from the API. Defaults to120
.
Source code in src/distilabel/llms/openai.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None)
async
¶
Generates num_generations
responses for the given input using the OpenAI async
client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens |
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty |
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty |
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature |
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p |
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
stop |
Optional[Union[str, List[str]]]
|
a string or a list of strings to use as a stop sequence for the generation.
Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/openai.py
load()
¶
Loads the AsyncOpenAI
client to benefit from async requests.
Source code in src/distilabel/llms/openai.py
TogetherLLM
¶
Bases: OpenAILLM
TogetherLLM LLM implementation running the async API client of OpenAI because of duplicate API behavior.
Attributes:
Name | Type | Description |
---|---|---|
model |
the model name to use for the LLM e.g. "mistralai/Mixtral-8x7B-Instruct-v0.1". Supported models can be found here. |
|
base_url |
Optional[RuntimeParameter[str]]
|
the base URL to use for the Together API can be set with |
api_key |
Optional[RuntimeParameter[SecretStr]]
|
the API key to authenticate the requests to the Together API. Defaults to |
_api_key_env_var |
str
|
the name of the environment variable to use for the API key. It is meant to be used internally. |
Source code in src/distilabel/llms/together.py
TransformersLLM
¶
Bases: LLM
, CudaDevicePlacementMixin
Hugging Face transformers
library LLM implementation using the text generation
pipeline.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files. |
revision |
str
|
if |
torch_dtype |
str
|
the torch dtype to use for the model e.g. "float16", "float32", etc.
Defaults to |
trust_remote_code |
bool
|
whether to trust or not remote (code in the Hugging Face Hub
repository) code to load the model. Defaults to |
model_kwargs |
Optional[Dict[str, Any]]
|
additional dictionary of keyword arguments that will be passed to
the |
tokenizer |
Optional[str]
|
the tokenizer Hugging Face Hub repo id or a path to a directory containing
the tokenizer config files. If not provided, the one associated to the |
use_fast |
bool
|
whether to use a fast tokenizer or not. Defaults to |
chat_template |
Optional[str]
|
a chat template that will be used to build the prompts before
sending them to the model. If not provided, the chat template defined in the
tokenizer config will be used. If not provided and the tokenizer doesn't have
a chat template, then ChatML template will be used. Defaults to |
device |
Optional[Union[str, int]]
|
the name or index of the device where the model will be loaded. Defaults
to |
device_map |
Optional[Union[str, Dict[str, Any]]]
|
a dictionary mapping each layer of the model to a device, or a mode
like |
token |
Optional[str]
|
the Hugging Face Hub token that will be used to authenticate to the Hugging
Face Hub. If not provided, the |
Source code in src/distilabel/llms/huggingface/transformers.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)
¶
Generates num_generations
responses for each input using the text generation
pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens |
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
temperature |
float
|
the temperature to use for the generation. Defaults to |
0.1
|
repetition_penalty |
float
|
the repetition penalty to use for the generation. Defaults
to |
1.1
|
top_p |
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
top_k |
int
|
the top-k value to use for the generation. Defaults to |
0
|
do_sample |
bool
|
whether to use sampling or not. Defaults to |
True
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/huggingface/transformers.py
get_last_hidden_states(inputs)
¶
Gets the last hidden_states
of the model for the given inputs. It doesn't
execute the task head.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
a list of inputs in chat format to generate the embeddings for. |
required |
Returns:
Type | Description |
---|---|
List[HiddenState]
|
A list containing the last hidden state for each sequence using a NumPy array |
List[HiddenState]
|
with shape [num_tokens, hidden_size]. |
Source code in src/distilabel/llms/huggingface/transformers.py
load()
¶
Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.
Source code in src/distilabel/llms/huggingface/transformers.py
prepare_input(input)
¶
Prepares the input by applying the chat template to the input, which is formatted as an OpenAI conversation, and adding the generation prompt.
Source code in src/distilabel/llms/huggingface/transformers.py
VertexAILLM
¶
Bases: AsyncLLM
VertexAI LLM implementation running the async API clients for Gemini.
- Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini
To use the VertexAILLM
is necessary to have configured the Google Cloud authentication
using one of these methods:
- Setting
GOOGLE_CLOUD_CREDENTIALS
environment variable - Using
gcloud auth application-default login
command - Using
vertexai.init
function from thegoogle-cloud-aiplatform
library
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model name to use for the LLM e.g. "gemini-1.0-pro". Supported models. |
_aclient |
Optional[GenerativeModel]
|
the |
Source code in src/distilabel/llms/vertexai.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
agenerate(input, num_generations=1, temperature=None, top_p=None, top_k=None, max_output_tokens=None, stop_sequences=None, safety_settings=None, tools=None)
async
¶
Generates num_generations
responses for the given input using the VertexAI async client definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
ChatType
|
a single input in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
temperature |
Optional[float]
|
Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to |
None
|
top_p |
Optional[float]
|
If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to |
None
|
top_k |
Optional[int]
|
If specified, top-k sampling will be used. Defaults to |
None
|
max_output_tokens |
Optional[int]
|
The maximum number of output tokens to generate per message. Defaults to |
None
|
stop_sequences |
Optional[List[str]]
|
A list of stop sequences. Defaults to |
None
|
safety_settings |
Optional[Dict[str, Any]]
|
Safety configuration for returned content from the API. Defaults to |
None
|
tools |
Optional[List[Dict[str, Any]]]
|
A potential list of tools that can be used by the API. Defaults to |
None
|
Returns:
Type | Description |
---|---|
GenerateOutput
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/vertexai.py
load()
¶
Loads the GenerativeModel
class which has access to generate_content_async
to benefit from async requests.
Source code in src/distilabel/llms/vertexai.py
vLLM
¶
Bases: LLM
, CudaDevicePlacementMixin
vLLM
library LLM implementation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files. |
model_kwargs |
Optional[RuntimeParameter[Dict[str, Any]]]
|
additional dictionary of keyword arguments that will be passed to
the |
chat_template |
Optional[str]
|
a chat template that will be used to build the prompts before
sending them to the model. If not provided, the chat template defined in the
tokenizer config will be used. If not provided and the tokenizer doesn't have
a chat template, then ChatML template will be used. Defaults to |
_model |
Optional[LLM]
|
the |
_tokenizer |
Optional[PreTrainedTokenizer]
|
the tokenizer instance used to format the prompt before passing it to
the |
Runtime parameters
model_kwargs
: additional dictionary of keyword arguments that will be passed to theLLM
class ofvllm
library.
Source code in src/distilabel/llms/vllm.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
model_name: str
property
¶
Returns the model name used for the LLM.
generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, extra_sampling_params=None)
¶
Generates num_generations
responses for each input using the text generation
pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[ChatType]
|
a list of inputs in chat format to generate responses for. |
required |
num_generations |
int
|
the number of generations to create per input. Defaults to
|
1
|
max_new_tokens |
int
|
the maximum number of new tokens that the model will generate.
Defaults to |
128
|
frequency_penalty |
float
|
the repetition penalty to use for the generation. Defaults
to |
0.0
|
presence_penalty |
float
|
the presence penalty to use for the generation. Defaults to
|
0.0
|
temperature |
float
|
the temperature to use for the generation. Defaults to |
1.0
|
top_p |
float
|
the top-p value to use for the generation. Defaults to |
1.0
|
top_k |
int
|
the top-k value to use for the generation. Defaults to |
-1
|
extra_sampling_params |
Optional[Dict[str, Any]]
|
dictionary with additional arguments to be passed to
the |
None
|
Returns:
Type | Description |
---|---|
List[GenerateOutput]
|
A list of lists of strings containing the generated responses for each input. |
Source code in src/distilabel/llms/vllm.py
load()
¶
Loads the vLLM
model using either the path or the Hugging Face Hub repository id.
Additionally, this method also sets the chat_template
for the tokenizer, so as to properly
parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
default value is ChatML format, unless explicitly provided.
Source code in src/distilabel/llms/vllm.py
prepare_input(input)
¶
Prepares the input by applying the chat template to the input, which is formatted as an OpenAI conversation, and adding the generation prompt.