LLM Gallery¶

This section contains the existing LLM subclasses implemented in distilabel.

`llms` ¶

`AnthropicLLM` ¶

Bases: AsyncLLM

Anthropic LLM implementation running the Async API client.

Attributes:

Name	Type	Description
`model`	`str`	the name of the model to use for the LLM e.g. "claude-3-opus-20240229", "claude-3-sonnet-20240229", etc. Available models can be checked here: Anthropic: Models overview.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Anthropic API. If not provided, it will be read from `ANTHROPIC_API_KEY` environment variable.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Anthropic API. Defaults to `None` which means that `https://api.anthropic.com` will be used internally.
`timeout`	`RuntimeParameter[float]`	the maximum time in seconds to wait for a response. Defaults to `600.0`.
`max_retries`	`RuntimeParameter[int]`	The maximum number of times to retry the request before failing. Defaults to `6`.
`http_client`	`Optional[AsyncClient]`	if provided, an alternative HTTP client to use for calling Anthropic API. Defaults to `None`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
`_api_key_env_var`	`str`	the name of the environment variable to use for the API key. It is meant to be used internally.
`_aclient`	`Optional[AsyncAnthropic]`	the `AsyncAnthropic` client to use for the Anthropic API. It is meant to be used internally. Set in the `load` method.

Runtime parameters

api_key: the API key to authenticate the requests to the Anthropic API. If not provided, it will be read from ANTHROPIC_API_KEY environment variable.
base_url: the base URL to use for the Anthropic API. Defaults to "https://api.anthropic.com".
timeout: the maximum time in seconds to wait for a response. Defaults to 600.0.
max_retries: the maximum number of times to retry the request before failing. Defaults to 6.

Examples:

Generate text:

from distilabel.models.llms import AnthropicLLM

llm = AnthropicLLM(model="claude-3-opus-20240229", api_key="api.key")

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pydantic import BaseModel
from distilabel.models.llms import AnthropicLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = AnthropicLLM(
    model="claude-3-opus-20240229",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/anthropic.py

class AnthropicLLM(AsyncLLM):
    """Anthropic LLM implementation running the Async API client.

    Attributes:
        model: the name of the model to use for the LLM e.g. "claude-3-opus-20240229",
            "claude-3-sonnet-20240229", etc. Available models can be checked here:
            [Anthropic: Models overview](https://docs.anthropic.com/claude/docs/models-overview).
        api_key: the API key to authenticate the requests to the Anthropic API. If not provided,
            it will be read from `ANTHROPIC_API_KEY` environment variable.
        base_url: the base URL to use for the Anthropic API. Defaults to `None` which means
            that `https://api.anthropic.com` will be used internally.
        timeout: the maximum time in seconds to wait for a response. Defaults to `600.0`.
        max_retries: The maximum number of times to retry the request before failing. Defaults
            to `6`.
        http_client: if provided, an alternative HTTP client to use for calling Anthropic
            API. Defaults to `None`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
        _api_key_env_var: the name of the environment variable to use for the API key. It
            is meant to be used internally.
        _aclient: the `AsyncAnthropic` client to use for the Anthropic API. It is meant
            to be used internally. Set in the `load` method.

    Runtime parameters:
        - `api_key`: the API key to authenticate the requests to the Anthropic API. If not
            provided, it will be read from `ANTHROPIC_API_KEY` environment variable.
        - `base_url`: the base URL to use for the Anthropic API. Defaults to `"https://api.anthropic.com"`.
        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `600.0`.
        - `max_retries`: the maximum number of times to retry the request before failing.
            Defaults to `6`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import AnthropicLLM

        llm = AnthropicLLM(model="claude-3-opus-20240229", api_key="api.key")

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import AnthropicLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = AnthropicLLM(
            model="claude-3-opus-20240229",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(
            "ANTHROPIC_BASE_URL", "https://api.anthropic.com"
        ),
        description="The base URL to use for the Anthropic API.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_ANTHROPIC_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Anthropic API.",
    )
    timeout: RuntimeParameter[float] = Field(
        default=600.0,
        description="The maximum time in seconds to wait for a response from the API.",
    )
    max_retries: RuntimeParameter[int] = Field(
        default=6,
        description="The maximum number of times to retry the request to the API before"
        " failing.",
    )
    http_client: Optional[AsyncClient] = Field(default=None, exclude=True)
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )

    _num_generations_param_supported = False

    _api_key_env_var: str = PrivateAttr(default=_ANTHROPIC_API_KEY_ENV_VAR_NAME)
    _aclient: Optional["AsyncAnthropic"] = PrivateAttr(...)

    def _check_model_exists(self) -> None:
        """Checks if the specified model exists in the available models."""
        from anthropic import AsyncAnthropic

        annotation = get_type_hints(AsyncAnthropic().messages.create).get("model", None)
        models = [
            value
            for type_ in get_args(annotation)
            if get_origin(type_) is Literal
            for value in get_args(type_)
        ]

        if self.model not in models:
            raise ValueError(
                f"Model {self.model} does not exist among available models. "
                f"The available models are {', '.join(models)}"
            )

    def load(self) -> None:
        """Loads the `AsyncAnthropic` client to use the Anthropic async API."""
        super().load()

        try:
            from anthropic import AsyncAnthropic
        except ImportError as ie:
            raise ImportError(
                "Anthropic Python client is not installed. Please install it using"
                " `pip install 'distilabel[anthropic]'`."
            ) from ie

        if self.api_key is None:
            raise ValueError(
                f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
                f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
            )

        self._check_model_exists()

        self._aclient = AsyncAnthropic(
            api_key=self.api_key.get_secret_value(),
            base_url=self.base_url,
            timeout=self.timeout,
            http_client=self.http_client,
            max_retries=self.max_retries,
        )
        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="anthropic",
            )
            self._aclient = result.get("client")
            if structured_output := result.get("structured_output"):
                self.structured_output = structured_output

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        max_tokens: int = 128,
        stop_sequences: Union[List[str], None] = None,
        temperature: float = 1.0,
        top_p: Union[float, None] = None,
        top_k: Union[int, None] = None,
    ) -> GenerateOutput:
        """Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).

        Args:
            input: a single input in chat format to generate responses for.
            max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.
            stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.
            temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.
            top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.
            top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        from anthropic._types import NOT_GIVEN

        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,
                client=self._aclient,
                framework="anthropic",
            )
            self._aclient = result.get("client")

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        kwargs = {
            "messages": input,  # type: ignore
            "model": self.model,
            "system": (
                input.pop(0)["content"]
                if input and input[0]["role"] == "system"
                else NOT_GIVEN
            ),
            "max_tokens": max_tokens,
            "stream": False,
            "stop_sequences": NOT_GIVEN if stop_sequences is None else stop_sequences,
            "temperature": temperature,
            "top_p": NOT_GIVEN if top_p is None else top_p,
            "top_k": NOT_GIVEN if top_k is None else top_k,
        }

        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)

        completion: Union["Message", "BaseModel"] = await self._aclient.messages.create(
            **kwargs
        )  # type: ignore
        if structured_output:
            # raw_response = completion._raw_response
            return prepare_output(
                [completion.model_dump_json()],
                **self._get_llm_statistics(completion._raw_response),
            )

        if (content := completion.content[0].text) is None:
            self._logger.warning(
                f"Received no response using Anthropic client (model: '{self.model}')."
                f" Finish reason was: {completion.stop_reason}"
            )
        return prepare_output([content], **self._get_llm_statistics(completion))

    @staticmethod
    def _get_llm_statistics(completion: "Message") -> "LLMStatistics":
        return {
            "input_tokens": [completion.usage.input_tokens],
            "output_tokens": [completion.usage.output_tokens],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`_check_model_exists()` ¶

Checks if the specified model exists in the available models.

Source code in src/distilabel/models/llms/anthropic.py

def _check_model_exists(self) -> None:
    """Checks if the specified model exists in the available models."""
    from anthropic import AsyncAnthropic

    annotation = get_type_hints(AsyncAnthropic().messages.create).get("model", None)
    models = [
        value
        for type_ in get_args(annotation)
        if get_origin(type_) is Literal
        for value in get_args(type_)
    ]

    if self.model not in models:
        raise ValueError(
            f"Model {self.model} does not exist among available models. "
            f"The available models are {', '.join(models)}"
        )

`load()` ¶

Loads the AsyncAnthropic client to use the Anthropic async API.

Source code in src/distilabel/models/llms/anthropic.py

def load(self) -> None:
    """Loads the `AsyncAnthropic` client to use the Anthropic async API."""
    super().load()

    try:
        from anthropic import AsyncAnthropic
    except ImportError as ie:
        raise ImportError(
            "Anthropic Python client is not installed. Please install it using"
            " `pip install 'distilabel[anthropic]'`."
        ) from ie

    if self.api_key is None:
        raise ValueError(
            f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
            f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
        )

    self._check_model_exists()

    self._aclient = AsyncAnthropic(
        api_key=self.api_key.get_secret_value(),
        base_url=self.base_url,
        timeout=self.timeout,
        http_client=self.http_client,
        max_retries=self.max_retries,
    )
    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="anthropic",
        )
        self._aclient = result.get("client")
        if structured_output := result.get("structured_output"):
            self.structured_output = structured_output

`agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)` `async` ¶

Generates a response asynchronously, using the Anthropic Async API definition.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`max_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`stop_sequences`	`Union[List[str], None]`	custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.	`None`
`temperature`	`float`	the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.	`1.0`
`top_p`	`Union[float, None]`	the top-p value to use for the generation. Defaults to `NOT_GIVEN`.	`None`
`top_k`	`Union[int, None]`	the top-k value to use for the generation. Defaults to `NOT_GIVEN`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/anthropic.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    max_tokens: int = 128,
    stop_sequences: Union[List[str], None] = None,
    temperature: float = 1.0,
    top_p: Union[float, None] = None,
    top_k: Union[int, None] = None,
) -> GenerateOutput:
    """Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).

    Args:
        input: a single input in chat format to generate responses for.
        max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.
        stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.
        temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.
        top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.
        top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    from anthropic._types import NOT_GIVEN

    structured_output = None
    if isinstance(input, tuple):
        input, structured_output = input
        result = self._prepare_structured_output(
            structured_output=structured_output,
            client=self._aclient,
            framework="anthropic",
        )
        self._aclient = result.get("client")

    if structured_output is None and self.structured_output is not None:
        structured_output = self.structured_output

    kwargs = {
        "messages": input,  # type: ignore
        "model": self.model,
        "system": (
            input.pop(0)["content"]
            if input and input[0]["role"] == "system"
            else NOT_GIVEN
        ),
        "max_tokens": max_tokens,
        "stream": False,
        "stop_sequences": NOT_GIVEN if stop_sequences is None else stop_sequences,
        "temperature": temperature,
        "top_p": NOT_GIVEN if top_p is None else top_p,
        "top_k": NOT_GIVEN if top_k is None else top_k,
    }

    if structured_output:
        kwargs = self._prepare_kwargs(kwargs, structured_output)

    completion: Union["Message", "BaseModel"] = await self._aclient.messages.create(
        **kwargs
    )  # type: ignore
    if structured_output:
        # raw_response = completion._raw_response
        return prepare_output(
            [completion.model_dump_json()],
            **self._get_llm_statistics(completion._raw_response),
        )

    if (content := completion.content[0].text) is None:
        self._logger.warning(
            f"Received no response using Anthropic client (model: '{self.model}')."
            f" Finish reason was: {completion.stop_reason}"
        )
    return prepare_output([content], **self._get_llm_statistics(completion))

`AnyscaleLLM` ¶

Bases: OpenAILLM

Anyscale LLM implementation running the async API client of OpenAI.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM, e.g., `google/gemma-7b-it`. See the supported models under the "Text Generation -> Supported Models" section here.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Anyscale API requests. Defaults to `None`, which means that the value set for the environment variable `ANYSCALE_BASE_URL` will be used, or "https://api.endpoints.anyscale.com/v1" if not set.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Anyscale API. Defaults to `None` which means that the value set for the environment variable `ANYSCALE_API_KEY` will be used, or `None` if not set.
`_api_key_env_var`	`str`	the name of the environment variable to use for the API key. It is meant to be used internally.

Examples:

Generate text:

from distilabel.models.llms import AnyscaleLLM

llm = AnyscaleLLM(model="google/gemma-7b-it", api_key="api.key")

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/anyscale.py

class AnyscaleLLM(OpenAILLM):
    """Anyscale LLM implementation running the async API client of OpenAI.

    Attributes:
        model: the model name to use for the LLM, e.g., `google/gemma-7b-it`. See the
            supported models under the "Text Generation -> Supported Models" section
            [here](https://docs.endpoints.anyscale.com/).
        base_url: the base URL to use for the Anyscale API requests. Defaults to `None`, which
            means that the value set for the environment variable `ANYSCALE_BASE_URL` will be used, or
            "https://api.endpoints.anyscale.com/v1" if not set.
        api_key: the API key to authenticate the requests to the Anyscale API. Defaults to `None` which
            means that the value set for the environment variable `ANYSCALE_API_KEY` will be used, or
            `None` if not set.
        _api_key_env_var: the name of the environment variable to use for the API key.
            It is meant to be used internally.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import AnyscaleLLM

        llm = AnyscaleLLM(model="google/gemma-7b-it", api_key="api.key")

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(
            "ANYSCALE_BASE_URL", "https://api.endpoints.anyscale.com/v1"
        ),
        description="The base URL to use for the Anyscale API requests.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_ANYSCALE_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Anyscale API.",
    )

    _api_key_env_var: str = PrivateAttr(_ANYSCALE_API_KEY_ENV_VAR_NAME)

`AzureOpenAILLM` ¶

Bases: OpenAILLM

Azure OpenAI LLM implementation running the async API client.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM i.e. the name of the Azure deployment.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Azure OpenAI API can be set with `AZURE_OPENAI_ENDPOINT`. Defaults to `None` which means that the value set for the environment variable `AZURE_OPENAI_ENDPOINT` will be used, or `None` if not set.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Azure OpenAI API. Defaults to `None` which means that the value set for the environment variable `AZURE_OPENAI_API_KEY` will be used, or `None` if not set.
`api_version`	`Optional[RuntimeParameter[str]]`	the API version to use for the Azure OpenAI API. Defaults to `None` which means that the value set for the environment variable `OPENAI_API_VERSION` will be used, or `None` if not set.

Icon

:material-microsoft-azure:

Examples:

Generate text:

from distilabel.models.llms import AzureOpenAILLM

llm = AzureOpenAILLM(model="gpt-4-turbo", api_key="api.key")

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate text from a custom endpoint following the OpenAI API:

from distilabel.models.llms import AzureOpenAILLM

llm = AzureOpenAILLM(
    model="prometheus-eval/prometheus-7b-v2.0",
    base_url=r"http://localhost:8080/v1"
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pydantic import BaseModel
from distilabel.models.llms import AzureOpenAILLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = AzureOpenAILLM(
    model="gpt-4-turbo",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/azure.py

class AzureOpenAILLM(OpenAILLM):
    """Azure OpenAI LLM implementation running the async API client.

    Attributes:
        model: the model name to use for the LLM i.e. the name of the Azure deployment.
        base_url: the base URL to use for the Azure OpenAI API can be set with `AZURE_OPENAI_ENDPOINT`.
            Defaults to `None` which means that the value set for the environment variable
            `AZURE_OPENAI_ENDPOINT` will be used, or `None` if not set.
        api_key: the API key to authenticate the requests to the Azure OpenAI API. Defaults to `None`
            which means that the value set for the environment variable `AZURE_OPENAI_API_KEY` will be
            used, or `None` if not set.
        api_version: the API version to use for the Azure OpenAI API. Defaults to `None` which means
            that the value set for the environment variable `OPENAI_API_VERSION` will be used, or
            `None` if not set.

    Icon:
        `:material-microsoft-azure:`

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import AzureOpenAILLM

        llm = AzureOpenAILLM(model="gpt-4-turbo", api_key="api.key")

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate text from a custom endpoint following the OpenAI API:

        ```python
        from distilabel.models.llms import AzureOpenAILLM

        llm = AzureOpenAILLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            base_url=r"http://localhost:8080/v1"
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import AzureOpenAILLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = AzureOpenAILLM(
            model="gpt-4-turbo",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME),
        description="The base URL to use for the Azure OpenAI API requests i.e. the Azure OpenAI endpoint.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Azure OpenAI API.",
    )

    api_version: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv("OPENAI_API_VERSION"),
        description="The API version to use for the Azure OpenAI API.",
    )

    _base_url_env_var: str = PrivateAttr(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME)
    _api_key_env_var: str = PrivateAttr(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME)
    _aclient: Optional["AsyncAzureOpenAI"] = PrivateAttr(...)  # type: ignore

    @override
    def load(self) -> None:
        """Loads the `AsyncAzureOpenAI` client to benefit from async requests."""
        # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output
        # in the load method before we have the proper client.
        with patch(
            "distilabel.models.llms.openai.OpenAILLM._prepare_structured_output",
            lambda *args, **kwargs: {"client": None, "structured_output": None},
        ):
            super().load()

        try:
            from openai import AsyncAzureOpenAI
        except ImportError as ie:
            raise ImportError(
                "OpenAI Python client is not installed. Please install it using"
                " `pip install 'distilabel[openai]'`."
            ) from ie

        if self.api_key is None:
            raise ValueError(
                f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
                f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
            )

        # TODO: May be worth adding the AD auth too? Also the `organization`?
        self._aclient = AsyncAzureOpenAI(  # type: ignore
            azure_endpoint=self.base_url,  # type: ignore
            azure_deployment=self.model,
            api_version=self.api_version,
            api_key=self.api_key.get_secret_value(),
            max_retries=self.max_retries,  # type: ignore
            timeout=self.timeout,
        )

        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="azure_openai",
            )
            self._aclient = result.get("client")  # type: ignore

`load()` ¶

Loads the AsyncAzureOpenAI client to benefit from async requests.

Source code in src/distilabel/models/llms/azure.py

@override
def load(self) -> None:
    """Loads the `AsyncAzureOpenAI` client to benefit from async requests."""
    # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output
    # in the load method before we have the proper client.
    with patch(
        "distilabel.models.llms.openai.OpenAILLM._prepare_structured_output",
        lambda *args, **kwargs: {"client": None, "structured_output": None},
    ):
        super().load()

    try:
        from openai import AsyncAzureOpenAI
    except ImportError as ie:
        raise ImportError(
            "OpenAI Python client is not installed. Please install it using"
            " `pip install 'distilabel[openai]'`."
        ) from ie

    if self.api_key is None:
        raise ValueError(
            f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
            f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
        )

    # TODO: May be worth adding the AD auth too? Also the `organization`?
    self._aclient = AsyncAzureOpenAI(  # type: ignore
        azure_endpoint=self.base_url,  # type: ignore
        azure_deployment=self.model,
        api_version=self.api_version,
        api_key=self.api_key.get_secret_value(),
        max_retries=self.max_retries,  # type: ignore
        timeout=self.timeout,
    )

    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="azure_openai",
        )
        self._aclient = result.get("client")  # type: ignore

`CohereLLM` ¶

Bases: AsyncLLM

Cohere API implementation using the async client for concurrent text generation.

Attributes:

Name	Type	Description
`model`	`str`	the name of the model from the Cohere API to use for the generation.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Cohere API requests. Defaults to `"https://api.cohere.ai/v1"`.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Cohere API. Defaults to the value of the `COHERE_API_KEY` environment variable.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response from the API. Defaults to `120`.
`client_name`	`RuntimeParameter[str]`	the name of the client to use for the API requests. Defaults to `"distilabel"`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
`_ChatMessage`	`Type[ChatMessage]`	the `ChatMessage` class from the `cohere` package.
`_aclient`	`AsyncClient`	the `AsyncClient` client from the `cohere` package.

Runtime parameters

base_url: the base URL to use for the Cohere API requests. Defaults to "https://api.cohere.ai/v1".
api_key: the API key to authenticate the requests to the Cohere API. Defaults to the value of the COHERE_API_KEY environment variable.
timeout: the maximum time in seconds to wait for a response from the API. Defaults to 120.
client_name: the name of the client to use for the API requests. Defaults to "distilabel".

Examples:

Generate text:

from distilabel.models.llms import CohereLLM

llm = CohereLLM(model="CohereForAI/c4ai-command-r-plus")

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

```python
from pydantic import BaseModel
from distilabel.models.llms import CohereLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = CohereLLM(
    model="CohereForAI/c4ai-command-r-plus",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/cohere.py

class CohereLLM(AsyncLLM):
    """Cohere API implementation using the async client for concurrent text generation.

    Attributes:
        model: the name of the model from the Cohere API to use for the generation.
        base_url: the base URL to use for the Cohere API requests. Defaults to
            `"https://api.cohere.ai/v1"`.
        api_key: the API key to authenticate the requests to the Cohere API. Defaults to
            the value of the `COHERE_API_KEY` environment variable.
        timeout: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        client_name: the name of the client to use for the API requests. Defaults to
            `"distilabel"`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
        _ChatMessage: the `ChatMessage` class from the `cohere` package.
        _aclient: the `AsyncClient` client from the `cohere` package.

    Runtime parameters:
        - `base_url`: the base URL to use for the Cohere API requests. Defaults to
            `"https://api.cohere.ai/v1"`.
        - `api_key`: the API key to authenticate the requests to the Cohere API. Defaults
            to the value of the `COHERE_API_KEY` environment variable.
        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        - `client_name`: the name of the client to use for the API requests. Defaults to
            `"distilabel"`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import CohereLLM

        llm = CohereLLM(model="CohereForAI/c4ai-command-r-plus")

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import CohereLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = CohereLLM(
            model="CohereForAI/c4ai-command-r-plus",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(
            "COHERE_BASE_URL", "https://api.cohere.ai/v1"
        ),
        description="The base URL to use for the Cohere API requests.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_COHERE_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Cohere API.",
    )
    timeout: RuntimeParameter[int] = Field(
        default=120,
        description="The maximum time in seconds to wait for a response from the API.",
    )
    client_name: RuntimeParameter[str] = Field(
        default="distilabel",
        description="The name of the client to use for the API requests.",
    )
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )

    _num_generations_param_supported = False

    _ChatMessage: Type["ChatMessage"] = PrivateAttr(...)
    _aclient: "AsyncClient" = PrivateAttr(...)
    _tokenizer: "Tokenizer" = PrivateAttr(...)

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    def load(self) -> None:
        """Loads the `AsyncClient` client from the `cohere` package."""

        super().load()

        try:
            from cohere import AsyncClient, ChatMessage
        except ImportError as ie:
            raise ImportError(
                "The `cohere` package is required to use the `CohereLLM` class."
            ) from ie

        self._ChatMessage = ChatMessage

        self._aclient = AsyncClient(
            api_key=self.api_key.get_secret_value(),  # type: ignore
            client_name=self.client_name,
            base_url=self.base_url,
            timeout=self.timeout,
        )

        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="cohere",
            )
            self._aclient = result.get("client")  # type: ignore
            if structured_output := result.get("structured_output"):
                self.structured_output = structured_output

        from cohere.manually_maintained.tokenizers import get_hf_tokenizer

        self._tokenizer: "Tokenizer" = get_hf_tokenizer(self._aclient, self.model)

    def _format_chat_to_cohere(
        self, input: "FormattedInput"
    ) -> Tuple[Union[str, None], List["ChatMessage"], str]:
        """Formats the chat input to the Cohere Chat API conversational format.

        Args:
            input: The chat input to format.

        Returns:
            A tuple containing the system, chat history, and message.
        """
        system = None
        message = None
        chat_history = []
        for item in input:
            role = item["role"]
            content = item["content"]
            if role == "system":
                system = content
            elif role == "user":
                message = content
            elif role == "assistant":
                if message is None:
                    raise ValueError(
                        "An assistant message but be preceded by a user message."
                    )
                chat_history.append(self._ChatMessage(role="USER", message=message))  # type: ignore
                chat_history.append(self._ChatMessage(role="CHATBOT", message=content))  # type: ignore
                message = None

        if message is None:
            raise ValueError("The chat input must end with a user message.")

        return system, chat_history, message

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        k: Optional[int] = None,
        p: Optional[float] = None,
        seed: Optional[float] = None,
        stop_sequences: Optional[Sequence[str]] = None,
        frequency_penalty: Optional[float] = None,
        presence_penalty: Optional[float] = None,
        raw_prompting: Optional[bool] = None,
    ) -> GenerateOutput:
        """Generates a response from the LLM given an input.

        Args:
            input: a single input in chat format to generate responses for.
            temperature: the temperature to use for the generation. Defaults to `None`.
            max_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `None`.
            k: the number of highest probability vocabulary tokens to keep for the generation.
                Defaults to `None`.
            p: the nucleus sampling probability to use for the generation. Defaults to
                `None`.
            seed: the seed to use for the generation. Defaults to `None`.
            stop_sequences: a list of sequences to use as stopping criteria for the generation.
                Defaults to `None`.
            frequency_penalty: the frequency penalty to use for the generation. Defaults
                to `None`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `None`.
            raw_prompting: a flag to use raw prompting for the generation. Defaults to
                `None`.

        Returns:
            The generated response from the Cohere API model.
        """
        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,  # type: ignore
                client=self._aclient,
                framework="cohere",
            )
            self._aclient = result.get("client")  # type: ignore

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        system, chat_history, message = self._format_chat_to_cohere(input)

        kwargs = {
            "message": message,
            "model": self.model,
            "preamble": system,
            "chat_history": chat_history,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "k": k,
            "p": p,
            "seed": seed,
            "stop_sequences": stop_sequences,
            "frequency_penalty": frequency_penalty,
            "presence_penalty": presence_penalty,
            "raw_prompting": raw_prompting,
        }
        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore

        response: Union["Message", "BaseModel"] = await self._aclient.chat(**kwargs)  # type: ignore

        if structured_output:
            return prepare_output(
                [response.model_dump_json()],
                **self._get_llm_statistics(
                    input, orjson.dumps(response.model_dump_json()).decode("utf-8")
                ),  # type: ignore
            )

        if (text := response.text) == "":
            self._logger.warning(  # type: ignore
                f"Received no response using Cohere client (model: '{self.model}')."
                f" Finish reason was: {response.finish_reason}"
            )
            return prepare_output(
                [None],
                **self._get_llm_statistics(input, ""),
            )

        return prepare_output(
            [text],
            **self._get_llm_statistics(input, text),
        )

    def _get_llm_statistics(
        self, input: FormattedInput, output: str
    ) -> "LLMStatistics":
        return {
            "input_tokens": [compute_tokens(input, self._tokenizer.encode)],
            "output_tokens": [compute_tokens(output, self._tokenizer.encode)],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the AsyncClient client from the cohere package.

Source code in src/distilabel/models/llms/cohere.py

def load(self) -> None:
    """Loads the `AsyncClient` client from the `cohere` package."""

    super().load()

    try:
        from cohere import AsyncClient, ChatMessage
    except ImportError as ie:
        raise ImportError(
            "The `cohere` package is required to use the `CohereLLM` class."
        ) from ie

    self._ChatMessage = ChatMessage

    self._aclient = AsyncClient(
        api_key=self.api_key.get_secret_value(),  # type: ignore
        client_name=self.client_name,
        base_url=self.base_url,
        timeout=self.timeout,
    )

    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="cohere",
        )
        self._aclient = result.get("client")  # type: ignore
        if structured_output := result.get("structured_output"):
            self.structured_output = structured_output

    from cohere.manually_maintained.tokenizers import get_hf_tokenizer

    self._tokenizer: "Tokenizer" = get_hf_tokenizer(self._aclient, self.model)

`_format_chat_to_cohere(input)` ¶

Formats the chat input to the Cohere Chat API conversational format.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	The chat input to format.	required

Returns:

Type	Description
`Tuple[Union[str, None], List[ChatMessage], str]`	A tuple containing the system, chat history, and message.

Source code in src/distilabel/models/llms/cohere.py

def _format_chat_to_cohere(
    self, input: "FormattedInput"
) -> Tuple[Union[str, None], List["ChatMessage"], str]:
    """Formats the chat input to the Cohere Chat API conversational format.

    Args:
        input: The chat input to format.

    Returns:
        A tuple containing the system, chat history, and message.
    """
    system = None
    message = None
    chat_history = []
    for item in input:
        role = item["role"]
        content = item["content"]
        if role == "system":
            system = content
        elif role == "user":
            message = content
        elif role == "assistant":
            if message is None:
                raise ValueError(
                    "An assistant message but be preceded by a user message."
                )
            chat_history.append(self._ChatMessage(role="USER", message=message))  # type: ignore
            chat_history.append(self._ChatMessage(role="CHATBOT", message=content))  # type: ignore
            message = None

    if message is None:
        raise ValueError("The chat input must end with a user message.")

    return system, chat_history, message

`agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)` `async` ¶

Generates a response from the LLM given an input.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`temperature`	`Optional[float]`	the temperature to use for the generation. Defaults to `None`.	`None`
`max_tokens`	`Optional[int]`	the maximum number of new tokens that the model will generate. Defaults to `None`.	`None`
`k`	`Optional[int]`	the number of highest probability vocabulary tokens to keep for the generation. Defaults to `None`.	`None`
`p`	`Optional[float]`	the nucleus sampling probability to use for the generation. Defaults to `None`.	`None`
`seed`	`Optional[float]`	the seed to use for the generation. Defaults to `None`.	`None`
`stop_sequences`	`Optional[Sequence[str]]`	a list of sequences to use as stopping criteria for the generation. Defaults to `None`.	`None`
`frequency_penalty`	`Optional[float]`	the frequency penalty to use for the generation. Defaults to `None`.	`None`
`presence_penalty`	`Optional[float]`	the presence penalty to use for the generation. Defaults to `None`.	`None`
`raw_prompting`	`Optional[bool]`	a flag to use raw prompting for the generation. Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	The generated response from the Cohere API model.

Source code in src/distilabel/models/llms/cohere.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    k: Optional[int] = None,
    p: Optional[float] = None,
    seed: Optional[float] = None,
    stop_sequences: Optional[Sequence[str]] = None,
    frequency_penalty: Optional[float] = None,
    presence_penalty: Optional[float] = None,
    raw_prompting: Optional[bool] = None,
) -> GenerateOutput:
    """Generates a response from the LLM given an input.

    Args:
        input: a single input in chat format to generate responses for.
        temperature: the temperature to use for the generation. Defaults to `None`.
        max_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `None`.
        k: the number of highest probability vocabulary tokens to keep for the generation.
            Defaults to `None`.
        p: the nucleus sampling probability to use for the generation. Defaults to
            `None`.
        seed: the seed to use for the generation. Defaults to `None`.
        stop_sequences: a list of sequences to use as stopping criteria for the generation.
            Defaults to `None`.
        frequency_penalty: the frequency penalty to use for the generation. Defaults
            to `None`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `None`.
        raw_prompting: a flag to use raw prompting for the generation. Defaults to
            `None`.

    Returns:
        The generated response from the Cohere API model.
    """
    structured_output = None
    if isinstance(input, tuple):
        input, structured_output = input
        result = self._prepare_structured_output(
            structured_output=structured_output,  # type: ignore
            client=self._aclient,
            framework="cohere",
        )
        self._aclient = result.get("client")  # type: ignore

    if structured_output is None and self.structured_output is not None:
        structured_output = self.structured_output

    system, chat_history, message = self._format_chat_to_cohere(input)

    kwargs = {
        "message": message,
        "model": self.model,
        "preamble": system,
        "chat_history": chat_history,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "k": k,
        "p": p,
        "seed": seed,
        "stop_sequences": stop_sequences,
        "frequency_penalty": frequency_penalty,
        "presence_penalty": presence_penalty,
        "raw_prompting": raw_prompting,
    }
    if structured_output:
        kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore

    response: Union["Message", "BaseModel"] = await self._aclient.chat(**kwargs)  # type: ignore

    if structured_output:
        return prepare_output(
            [response.model_dump_json()],
            **self._get_llm_statistics(
                input, orjson.dumps(response.model_dump_json()).decode("utf-8")
            ),  # type: ignore
        )

    if (text := response.text) == "":
        self._logger.warning(  # type: ignore
            f"Received no response using Cohere client (model: '{self.model}')."
            f" Finish reason was: {response.finish_reason}"
        )
        return prepare_output(
            [None],
            **self._get_llm_statistics(input, ""),
        )

    return prepare_output(
        [text],
        **self._get_llm_statistics(input, text),
    )

`GroqLLM` ¶

Bases: AsyncLLM

Groq API implementation using the async client for concurrent text generation.

Attributes:

Name	Type	Description
`model`	`str`	the name of the model from the Groq API to use for the generation.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Groq API requests. Defaults to `"https://api.groq.com"`.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Groq API. Defaults to the value of the `GROQ_API_KEY` environment variable.
`max_retries`	`RuntimeParameter[int]`	the maximum number of times to retry the request to the API before failing. Defaults to `2`.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response from the API. Defaults to `120`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
`_api_key_env_var`	`str`	the name of the environment variable to use for the API key.
`_aclient`	`Optional[AsyncGroq]`	the `AsyncGroq` client from the `groq` package.

Runtime parameters

base_url: the base URL to use for the Groq API requests. Defaults to "https://api.groq.com".
api_key: the API key to authenticate the requests to the Groq API. Defaults to the value of the GROQ_API_KEY environment variable.
max_retries: the maximum number of times to retry the request to the API before failing. Defaults to 2.
timeout: the maximum time in seconds to wait for a response from the API. Defaults to 120.

Examples:

Generate text:

from distilabel.models.llms import GroqLLM

llm = GroqLLM(model="llama3-70b-8192")

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

```python
from pydantic import BaseModel
from distilabel.models.llms import GroqLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = GroqLLM(
    model="llama3-70b-8192",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/groq.py

class GroqLLM(AsyncLLM):
    """Groq API implementation using the async client for concurrent text generation.

    Attributes:
        model: the name of the model from the Groq API to use for the generation.
        base_url: the base URL to use for the Groq API requests. Defaults to
            `"https://api.groq.com"`.
        api_key: the API key to authenticate the requests to the Groq API. Defaults to
            the value of the `GROQ_API_KEY` environment variable.
        max_retries: the maximum number of times to retry the request to the API before
            failing. Defaults to `2`.
        timeout: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
        _api_key_env_var: the name of the environment variable to use for the API key.
        _aclient: the `AsyncGroq` client from the `groq` package.

    Runtime parameters:
        - `base_url`: the base URL to use for the Groq API requests. Defaults to
            `"https://api.groq.com"`.
        - `api_key`: the API key to authenticate the requests to the Groq API. Defaults to
            the value of the `GROQ_API_KEY` environment variable.
        - `max_retries`: the maximum number of times to retry the request to the API before
            failing. Defaults to `2`.
        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import GroqLLM

        llm = GroqLLM(model="llama3-70b-8192")

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import GroqLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = GroqLLM(
            model="llama3-70b-8192",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str

    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(
            _GROQ_API_BASE_URL_ENV_VAR_NAME, "https://api.groq.com"
        ),
        description="The base URL to use for the Groq API requests.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_GROQ_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Groq API.",
    )
    max_retries: RuntimeParameter[int] = Field(
        default=2,
        description="The maximum number of times to retry the request to the API before"
        " failing.",
    )
    timeout: RuntimeParameter[int] = Field(
        default=120,
        description="The maximum time in seconds to wait for a response from the API.",
    )
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )

    _num_generations_param_supported = False

    _api_key_env_var: str = PrivateAttr(_GROQ_API_KEY_ENV_VAR_NAME)
    _aclient: Optional["AsyncGroq"] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the `AsyncGroq` client to benefit from async requests."""
        super().load()

        try:
            from groq import AsyncGroq
        except ImportError as ie:
            raise ImportError(
                "Groq Python client is not installed. Please install it using"
                ' `pip install "distilabel[groq]"`.'
            ) from ie

        if self.api_key is None:
            raise ValueError(
                f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
                f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
            )

        self._aclient = AsyncGroq(
            base_url=self.base_url,
            api_key=self.api_key.get_secret_value(),
            max_retries=self.max_retries,  # type: ignore
            timeout=self.timeout,
        )

        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="groq",
            )
            self._aclient = result.get("client")  # type: ignore
            if structured_output := result.get("structured_output"):
                self.structured_output = structured_output

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        seed: Optional[int] = None,
        max_new_tokens: int = 128,
        temperature: float = 1.0,
        top_p: float = 1.0,
        stop: Optional[str] = None,
    ) -> "GenerateOutput":
        """Generates `num_generations` responses for the given input using the Groq async
        client.

        Args:
            input: a single input in chat format to generate responses for.
            seed: the seed to use for the generation. Defaults to `None`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            stop: the stop sequence to use for the generation. Defaults to `None`.

        Returns:
            A list of lists of strings containing the generated responses for each input.

        References:
            - https://console.groq.com/docs/text-chat
        """
        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,
                client=self._aclient,
                framework="groq",
            )
            self._aclient = result.get("client")

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        kwargs = {
            "messages": input,  # type: ignore
            "model": self.model,
            "seed": seed,
            "temperature": temperature,
            "max_tokens": max_new_tokens,
            "top_p": top_p,
            "stream": False,
            "stop": stop,
        }
        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)

        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore
        if structured_output:
            return prepare_output(
                [completion.model_dump_json()],
                **self._get_llm_statistics(completion._raw_response),
            )

        generations = []
        for choice in completion.choices:
            if (content := choice.message.content) is None:
                self._logger.warning(  # type: ignore
                    f"Received no response using the Groq client (model: '{self.model}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(content)
        return prepare_output(generations, **self._get_llm_statistics(completion))

    @staticmethod
    def _get_llm_statistics(completion: "ChatCompletion") -> "LLMStatistics":
        return {
            "input_tokens": [completion.usage.prompt_tokens if completion else 0],
            "output_tokens": [completion.usage.completion_tokens if completion else 0],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the AsyncGroq client to benefit from async requests.

Source code in src/distilabel/models/llms/groq.py

def load(self) -> None:
    """Loads the `AsyncGroq` client to benefit from async requests."""
    super().load()

    try:
        from groq import AsyncGroq
    except ImportError as ie:
        raise ImportError(
            "Groq Python client is not installed. Please install it using"
            ' `pip install "distilabel[groq]"`.'
        ) from ie

    if self.api_key is None:
        raise ValueError(
            f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
            f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
        )

    self._aclient = AsyncGroq(
        base_url=self.base_url,
        api_key=self.api_key.get_secret_value(),
        max_retries=self.max_retries,  # type: ignore
        timeout=self.timeout,
    )

    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="groq",
        )
        self._aclient = result.get("client")  # type: ignore
        if structured_output := result.get("structured_output"):
            self.structured_output = structured_output

`agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None)` `async` ¶

Generates num_generations responses for the given input using the Groq async client.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`seed`	`Optional[int]`	the seed to use for the generation. Defaults to `None`.	`None`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`stop`	`Optional[str]`	the stop sequence to use for the generation. Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

References

https://console.groq.com/docs/text-chat

Source code in src/distilabel/models/llms/groq.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    seed: Optional[int] = None,
    max_new_tokens: int = 128,
    temperature: float = 1.0,
    top_p: float = 1.0,
    stop: Optional[str] = None,
) -> "GenerateOutput":
    """Generates `num_generations` responses for the given input using the Groq async
    client.

    Args:
        input: a single input in chat format to generate responses for.
        seed: the seed to use for the generation. Defaults to `None`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        stop: the stop sequence to use for the generation. Defaults to `None`.

    Returns:
        A list of lists of strings containing the generated responses for each input.

    References:
        - https://console.groq.com/docs/text-chat
    """
    structured_output = None
    if isinstance(input, tuple):
        input, structured_output = input
        result = self._prepare_structured_output(
            structured_output=structured_output,
            client=self._aclient,
            framework="groq",
        )
        self._aclient = result.get("client")

    if structured_output is None and self.structured_output is not None:
        structured_output = self.structured_output

    kwargs = {
        "messages": input,  # type: ignore
        "model": self.model,
        "seed": seed,
        "temperature": temperature,
        "max_tokens": max_new_tokens,
        "top_p": top_p,
        "stream": False,
        "stop": stop,
    }
    if structured_output:
        kwargs = self._prepare_kwargs(kwargs, structured_output)

    completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore
    if structured_output:
        return prepare_output(
            [completion.model_dump_json()],
            **self._get_llm_statistics(completion._raw_response),
        )

    generations = []
    for choice in completion.choices:
        if (content := choice.message.content) is None:
            self._logger.warning(  # type: ignore
                f"Received no response using the Groq client (model: '{self.model}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(content)
    return prepare_output(generations, **self._get_llm_statistics(completion))

`InferenceEndpointsLLM` ¶

Bases: InferenceEndpointsBaseClient, AsyncLLM, MagpieChatTemplateMixin

InferenceEndpoints LLM implementation running the async API client.

This LLM will internally use huggingface_hub.AsyncInferenceClient.

Attributes:

Name	Type	Description
`model_id`	`Optional[str]`	the model ID to use for the LLM as available in the Hugging Face Hub, which will be used to resolve the base URL for the serverless Inference Endpoints API requests. Defaults to `None`.
`endpoint_name`	`Optional[RuntimeParameter[str]]`	the name of the Inference Endpoint to use for the LLM. Defaults to `None`.
`endpoint_namespace`	`Optional[RuntimeParameter[str]]`	the namespace of the Inference Endpoint to use for the LLM. Defaults to `None`.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Inference Endpoints API requests.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Inference Endpoints API.
`provider`	`Optional[str]`	the name of the provider to use for inference. Defaults to `None`.
`tokenizer_id`	`Optional[str]`	the tokenizer ID to use for the LLM as available in the Hugging Face Hub. Defaults to `None`, but defining one is recommended to properly format the prompt.
`model_display_name`	`Optional[str]`	the model display name to use for the LLM. Defaults to `None`.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.
`structured_output`	`Optional[RuntimeParameter[StructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.

Icon

:hugging:

Examples:

Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:

from distilabel.models.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Dedicated Inference Endpoints:

from distilabel.models.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    endpoint_name="<ENDPOINT_NAME>",
    api_key="<HF_API_KEY>",
    endpoint_namespace="<USER|ORG>",
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Dedicated Inference Endpoints or TGI:

from distilabel.models.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    api_key="<HF_API_KEY>",
    base_url="<BASE_URL>",
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pydantic import BaseModel
from distilabel.models.llms import InferenceEndpointsLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3-70B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    api_key="api.key",
    structured_output={"format": "json", "schema": User.model_json_schema()}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the Tour De France"}]])

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

class InferenceEndpointsLLM(
    InferenceEndpointsBaseClient, AsyncLLM, MagpieChatTemplateMixin
):
    """InferenceEndpoints LLM implementation running the async API client.

    This LLM will internally use `huggingface_hub.AsyncInferenceClient`.

    Attributes:
        model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which
            will be used to resolve the base URL for the serverless Inference Endpoints API requests.
            Defaults to `None`.
        endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to `None`.
        endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to `None`.
        base_url: the base URL to use for the Inference Endpoints API requests.
        api_key: the API key to authenticate the requests to the Inference Endpoints API.
        provider: the name of the provider to use for inference. Defaults to `None`.
        tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub.
            Defaults to `None`, but defining one is recommended to properly format the prompt.
        model_display_name: the model display name to use for the LLM. Defaults to `None`.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.
        structured_output: a dictionary containing the structured output configuration or
            if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`.
            Defaults to None.

    Icon:
        `:hugging:`

    Examples:
        Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:

        ```python
        from distilabel.models.llms.huggingface import InferenceEndpointsLLM

        llm = InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Dedicated Inference Endpoints:

        ```python
        from distilabel.models.llms.huggingface import InferenceEndpointsLLM

        llm = InferenceEndpointsLLM(
            endpoint_name="<ENDPOINT_NAME>",
            api_key="<HF_API_KEY>",
            endpoint_namespace="<USER|ORG>",
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Dedicated Inference Endpoints or TGI:

        ```python
        from distilabel.models.llms.huggingface import InferenceEndpointsLLM

        llm = InferenceEndpointsLLM(
            api_key="<HF_API_KEY>",
            base_url="<BASE_URL>",
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import InferenceEndpointsLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            api_key="api.key",
            structured_output={"format": "json", "schema": User.model_json_schema()}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the Tour De France"}]])
        ```
    """

    def load(self) -> None:
        # Sets the logger and calls the load method of the BaseClient
        self._num_generations_param_supported = False
        AsyncLLM.load(self)
        InferenceEndpointsBaseClient.load(self)

    @model_validator(mode="after")  # type: ignore
    def only_one_of_model_id_endpoint_name_or_base_url_provided(
        self,
    ) -> "InferenceEndpointsLLM":
        """Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also
        provided, a warning will be shown informing the user that the provided `base_url` will be ignored in
        favour of the dynamically calculated one.."""

        if self.base_url and (self.model_id or self.endpoint_name):
            self._logger.warning(  # type: ignore
                f"Since the `base_url={self.base_url}` is available and either one of `model_id`"
                " or `endpoint_name` is also provided, the `base_url` will either be ignored"
                " or overwritten with the one generated from either of those args, for serverless"
                " or dedicated inference endpoints, respectively."
            )

        if self.use_magpie_template and self.tokenizer_id is None:
            raise ValueError(
                "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
                " set a `tokenizer_id` and try again."
            )

        if (
            self.model_id
            and self.tokenizer_id is None
            and self.structured_output is not None
        ):
            self.tokenizer_id = self.model_id

        if self.base_url and not (self.model_id or self.endpoint_name):
            return self

        if self.model_id and not self.endpoint_name:
            return self

        if self.endpoint_name and not self.model_id:
            return self

        raise ValidationError(
            f"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is"
            f" provided too, it will be overwritten instead. Found `model_id`={self.model_id},"
            f" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}."
        )

    def prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        prompt: str = (
            self._tokenizer.apply_chat_template(  # type: ignore
                conversation=input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    def _get_structured_output(
        self, input: FormattedInput
    ) -> Tuple["StandardInput", Union[Dict[str, Any], None]]:
        """Gets the structured output (if any) for the given input.

        Args:
            input: a single input in chat format to generate responses for.

        Returns:
            The input and the structured output that will be passed as `grammar` to the
            inference endpoint or `None` if not required.
        """
        structured_output = None

        # Specific structured output per input
        if isinstance(input, tuple):
            input, structured_output = input
            structured_output = {
                "type": structured_output["format"],  # type: ignore
                "value": structured_output["schema"],  # type: ignore
            }

        # Same structured output for all the inputs
        if structured_output is None and self.structured_output is not None:
            try:
                structured_output = {
                    "type": self.structured_output["format"],  # type: ignore
                    "value": self.structured_output["schema"],  # type: ignore
                }
            except KeyError as e:
                raise ValueError(
                    "To use the structured output you have to inform the `format` and `schema` in "
                    "the `structured_output` attribute."
                ) from e

        if structured_output:
            if isinstance(structured_output["value"], ModelMetaclass):
                structured_output["value"] = structured_output[
                    "value"
                ].model_json_schema()

        return input, structured_output

    async def _generate_with_text_generation(
        self,
        input: str,
        max_new_tokens: int = 128,
        repetition_penalty: Optional[float] = None,
        frequency_penalty: Optional[float] = None,
        temperature: float = 1.0,
        do_sample: bool = False,
        top_n_tokens: Optional[int] = None,
        top_p: Optional[float] = None,
        top_k: Optional[int] = None,
        typical_p: Optional[float] = None,
        stop_sequences: Union[List[str], None] = None,
        return_full_text: bool = False,
        seed: Optional[int] = None,
        watermark: bool = False,
        structured_output: Union[Dict[str, Any], None] = None,
    ) -> GenerateOutput:
        generation: Union["TextGenerationOutput", None] = None
        try:
            generation = await self._aclient.text_generation(  # type: ignore
                prompt=input,
                max_new_tokens=max_new_tokens,
                do_sample=do_sample,
                typical_p=typical_p,
                repetition_penalty=repetition_penalty,
                frequency_penalty=frequency_penalty,
                temperature=temperature,
                top_n_tokens=top_n_tokens,
                top_p=top_p,
                top_k=top_k,
                stop_sequences=stop_sequences,
                return_full_text=return_full_text,
                # NOTE: here to ensure that the cache is not used and a different response is
                # generated every time
                seed=seed or random.randint(0, sys.maxsize),
                watermark=watermark,
                grammar=structured_output,  # type: ignore
                details=True,
            )
        except Exception as e:
            self._logger.warning(  # type: ignore
                f"⚠️ Received no response using Inference Client (model: '{self.model_name}')."
                f" Finish reason was: {e}"
            )
        return prepare_output(
            generations=[generation.generated_text] if generation else [None],
            input_tokens=[
                compute_tokens(input, self._tokenizer.encode) if self._tokenizer else -1
            ],
            output_tokens=[
                generation.details.generated_tokens
                if generation and generation.details
                else 0
            ],
            logprobs=self._get_logprobs_from_text_generation(generation)
            if generation
            else None,  # type: ignore
        )

    def _get_logprobs_from_text_generation(
        self, generation: "TextGenerationOutput"
    ) -> Union[List[List[List["Logprob"]]], None]:
        if generation.details is None or generation.details.top_tokens is None:
            return None

        return [
            [
                [
                    {"token": top_logprob["text"], "logprob": top_logprob["logprob"]}
                    for top_logprob in token_logprobs
                ]
                for token_logprobs in generation.details.top_tokens
            ]
        ]

    async def _generate_with_chat_completion(
        self,
        input: "StandardInput",
        max_new_tokens: int = 128,
        frequency_penalty: Optional[float] = None,
        logit_bias: Optional[List[float]] = None,
        logprobs: bool = False,
        presence_penalty: Optional[float] = None,
        seed: Optional[int] = None,
        stop_sequences: Optional[List[str]] = None,
        temperature: float = 1.0,
        tool_choice: Optional[Union[Dict[str, str], Literal["auto"]]] = None,
        tool_prompt: Optional[str] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
        top_logprobs: Optional[PositiveInt] = None,
        top_p: Optional[float] = None,
    ) -> GenerateOutput:
        message = None
        completion: Union["ChatCompletionOutput", None] = None
        output_logprobs = None
        try:
            completion = await self._aclient.chat_completion(  # type: ignore
                messages=input,  # type: ignore
                max_tokens=max_new_tokens,
                frequency_penalty=frequency_penalty,
                logit_bias=logit_bias,
                logprobs=logprobs,
                presence_penalty=presence_penalty,
                # NOTE: here to ensure that the cache is not used and a different response is
                # generated every time
                seed=seed or random.randint(0, sys.maxsize),
                stop=stop_sequences,
                temperature=temperature,
                tool_choice=tool_choice,  # type: ignore
                tool_prompt=tool_prompt,
                tools=tools,  # type: ignore
                top_logprobs=top_logprobs,
                top_p=top_p,
            )
            choice = completion.choices[0]  # type: ignore
            if (message := choice.message.content) is None:
                self._logger.warning(  # type: ignore
                    f"⚠️ Received no response using Inference Client (model: '{self.model_name}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            if choice_logprobs := self._get_logprobs_from_choice(choice):
                output_logprobs = [choice_logprobs]
        except Exception as e:
            self._logger.warning(  # type: ignore
                f"⚠️ Received no response using Inference Client (model: '{self.model_name}')."
                f" Finish reason was: {e}"
            )
        return prepare_output(
            generations=[message],
            input_tokens=[completion.usage.prompt_tokens] if completion else None,
            output_tokens=[completion.usage.completion_tokens] if completion else None,
            logprobs=output_logprobs,
        )

    def _get_logprobs_from_choice(
        self, choice: "ChatCompletionOutputComplete"
    ) -> Union[List[List["Logprob"]], None]:
        if choice.logprobs is None:
            return None

        return [
            [
                {"token": top_logprob.token, "logprob": top_logprob.logprob}
                for top_logprob in token_logprobs.top_logprobs
            ]
            for token_logprobs in choice.logprobs.content
        ]

    def _check_stop_sequences(
        self,
        stop_sequences: Optional[Union[str, List[str]]] = None,
    ) -> Union[List[str], None]:
        """Checks that no more than 4 stop sequences are provided.

        Args:
            stop_sequences: the stop sequences to be checked.

        Returns:
            The stop sequences.
        """
        if stop_sequences is not None:
            if isinstance(stop_sequences, str):
                stop_sequences = [stop_sequences]
            if len(stop_sequences) > 4:
                warnings.warn(
                    "Only up to 4 stop sequences are allowed, so keeping the first 4 items only.",
                    UserWarning,
                    stacklevel=2,
                )
                stop_sequences = stop_sequences[:4]
        return stop_sequences

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        max_new_tokens: int = 128,
        frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,
        logit_bias: Optional[List[float]] = None,
        logprobs: bool = False,
        presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,
        seed: Optional[int] = None,
        stop_sequences: Optional[List[str]] = None,
        temperature: float = 1.0,
        tool_choice: Optional[Union[Dict[str, str], Literal["auto"]]] = None,
        tool_prompt: Optional[str] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
        top_logprobs: Optional[PositiveInt] = None,
        top_n_tokens: Optional[PositiveInt] = None,
        top_p: Optional[float] = None,
        do_sample: bool = False,
        repetition_penalty: Optional[float] = None,
        return_full_text: bool = False,
        top_k: Optional[int] = None,
        typical_p: Optional[float] = None,
        watermark: bool = False,
        num_generations: int = 1,
    ) -> GenerateOutput:
        """Generates completions for the given input using the async client. This method
        uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.
        `chat_completion` method will be used only if no `tokenizer_id` has been specified.
        Some arguments of this function are specific to the `text_generation` method, while
        some others are specific to the `chat_completion` method.

        Args:
            input: a single input in chat format to generate responses for.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize
                new tokens based on their existing frequency in the text so far, decreasing
                model's likelihood to repeat the same line verbatim. Defauls to `None`.
            logit_bias: modify the likelihood of specified tokens appearing in the completion.
                This argument is exclusive to the `chat_completion` method and will be used
                only if `tokenizer_id` is `None`.
                Defaults to `None`.
            logprobs: whether to return the log probabilities or not. This argument is exclusive
                to the `chat_completion` method and will be used only if `tokenizer_id`
                is `None`. Defaults to `False`.
            presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize
                new tokens based on whether they appear in the text so far, increasing the
                model likelihood to talk about new topics. This argument is exclusive to
                the `chat_completion` method and will be used only if `tokenizer_id` is
                `None`. Defauls to `None`.
            seed: the seed to use for the generation. Defaults to `None`.
            stop_sequences: either a single string or a list of strings containing the sequences
                to stop the generation at. Defaults to `None`, but will be set to the
                `tokenizer.eos_token` if available.
            temperature: the temperature to use for the generation. Defaults to `1.0`.
            tool_choice: the name of the tool the model should call. It can be a dictionary
                like `{"function_name": "my_tool"}` or "auto". If not provided, then the
                model won't use any tool. This argument is exclusive to the `chat_completion`
                method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.
            tool_prompt: A prompt to be appended before the tools. This argument is exclusive
                to the `chat_completion` method and will be used only if `tokenizer_id`
                is `None`. Defauls to `None`.
            tools: a list of tools definitions that the LLM can use.
                This argument is exclusive to the `chat_completion` method and will be used
                only if `tokenizer_id` is `None`. Defaults to `None`.
            top_logprobs: the number of top log probabilities to return per output token
                generated. This argument is exclusive to the `chat_completion` method and
                will be used only if `tokenizer_id` is `None`. Defaults to `None`.
            top_n_tokens: the number of top log probabilities to return per output token
                generated. This argument is exclusive of the `text_generation` method and
                will be only used if `tokenizer_id` is not `None`. Defaults to `None`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            do_sample: whether to use sampling for the generation. This argument is exclusive
                of the `text_generation` method and will be only used if `tokenizer_id` is not
                `None`. Defaults to `False`.
            repetition_penalty: the repetition penalty to use for the generation. This argument
                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`
                is not `None`. Defaults to `None`.
            return_full_text: whether to return the full text of the completion or just
                the generated text. Defaults to `False`, meaning that only the generated
                text will be returned. This argument is exclusive of the `text_generation`
                method and will be only used if `tokenizer_id` is not `None`.
            top_k: the top-k value to use for the generation. This argument is exclusive
                of the `text_generation` method and will be only used if `tokenizer_id`
                is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid
                values in TGI.
            typical_p: the typical-p value to use for the generation. This argument is exclusive
                of the `text_generation` method and will be only used if `tokenizer_id`
                is not `None`. Defaults to `None`.
            watermark: whether to add the watermark to the generated text. This argument
                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`
                is not `None`. Defaults to `None`.
            num_generations: the number of generations to generate. Defaults to `1`. It's here to ensure
                the validation succeds.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        stop_sequences = self._check_stop_sequences(stop_sequences)

        if isinstance(input, str) or self.tokenizer_id is not None:
            structured_output = None
            if not isinstance(input, str):
                input, structured_output = self._get_structured_output(input)
                input = self.prepare_input(input)

            return await self._generate_with_text_generation(
                input=input,
                max_new_tokens=max_new_tokens,
                do_sample=do_sample,
                typical_p=typical_p,
                repetition_penalty=repetition_penalty,
                frequency_penalty=frequency_penalty,
                temperature=temperature,
                top_n_tokens=top_n_tokens,
                top_p=top_p,
                top_k=top_k,
                stop_sequences=stop_sequences,
                return_full_text=return_full_text,
                seed=seed,
                watermark=watermark,
                structured_output=structured_output,
            )

        return await self._generate_with_chat_completion(
            input=input,  # type: ignore
            max_new_tokens=max_new_tokens,
            frequency_penalty=frequency_penalty,
            logit_bias=logit_bias,
            logprobs=logprobs,
            presence_penalty=presence_penalty,
            seed=seed,
            stop_sequences=stop_sequences,
            temperature=temperature,
            tool_choice=tool_choice,
            tool_prompt=tool_prompt,
            tools=tools,
            top_logprobs=top_logprobs,
            top_p=top_p,
        )

`only_one_of_model_id_endpoint_name_or_base_url_provided()` ¶

Validates that only one of model_id or endpoint_name is provided; and if base_url is also provided, a warning will be shown informing the user that the provided base_url will be ignored in favour of the dynamically calculated one..

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

@model_validator(mode="after")  # type: ignore
def only_one_of_model_id_endpoint_name_or_base_url_provided(
    self,
) -> "InferenceEndpointsLLM":
    """Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also
    provided, a warning will be shown informing the user that the provided `base_url` will be ignored in
    favour of the dynamically calculated one.."""

    if self.base_url and (self.model_id or self.endpoint_name):
        self._logger.warning(  # type: ignore
            f"Since the `base_url={self.base_url}` is available and either one of `model_id`"
            " or `endpoint_name` is also provided, the `base_url` will either be ignored"
            " or overwritten with the one generated from either of those args, for serverless"
            " or dedicated inference endpoints, respectively."
        )

    if self.use_magpie_template and self.tokenizer_id is None:
        raise ValueError(
            "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
            " set a `tokenizer_id` and try again."
        )

    if (
        self.model_id
        and self.tokenizer_id is None
        and self.structured_output is not None
    ):
        self.tokenizer_id = self.model_id

    if self.base_url and not (self.model_id or self.endpoint_name):
        return self

    if self.model_id and not self.endpoint_name:
        return self

    if self.endpoint_name and not self.model_id:
        return self

    raise ValidationError(
        f"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is"
        f" provided too, it will be overwritten instead. Found `model_id`={self.model_id},"
        f" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}."
    )

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

def prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    prompt: str = (
        self._tokenizer.apply_chat_template(  # type: ignore
            conversation=input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`_get_structured_output(input)` ¶

Gets the structured output (if any) for the given input.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required

Returns:

Type	Description
`StandardInput`	The input and the structured output that will be passed as `grammar` to the
`Union[Dict[str, Any], None]`	inference endpoint or `None` if not required.

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

def _get_structured_output(
    self, input: FormattedInput
) -> Tuple["StandardInput", Union[Dict[str, Any], None]]:
    """Gets the structured output (if any) for the given input.

    Args:
        input: a single input in chat format to generate responses for.

    Returns:
        The input and the structured output that will be passed as `grammar` to the
        inference endpoint or `None` if not required.
    """
    structured_output = None

    # Specific structured output per input
    if isinstance(input, tuple):
        input, structured_output = input
        structured_output = {
            "type": structured_output["format"],  # type: ignore
            "value": structured_output["schema"],  # type: ignore
        }

    # Same structured output for all the inputs
    if structured_output is None and self.structured_output is not None:
        try:
            structured_output = {
                "type": self.structured_output["format"],  # type: ignore
                "value": self.structured_output["schema"],  # type: ignore
            }
        except KeyError as e:
            raise ValueError(
                "To use the structured output you have to inform the `format` and `schema` in "
                "the `structured_output` attribute."
            ) from e

    if structured_output:
        if isinstance(structured_output["value"], ModelMetaclass):
            structured_output["value"] = structured_output[
                "value"
            ].model_json_schema()

    return input, structured_output

`_check_stop_sequences(stop_sequences=None)` ¶

Checks that no more than 4 stop sequences are provided.

Parameters:

Name	Type	Description	Default
`stop_sequences`	`Optional[Union[str, List[str]]]`	the stop sequences to be checked.	`None`

Returns:

Type	Description
`Union[List[str], None]`	The stop sequences.

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

def _check_stop_sequences(
    self,
    stop_sequences: Optional[Union[str, List[str]]] = None,
) -> Union[List[str], None]:
    """Checks that no more than 4 stop sequences are provided.

    Args:
        stop_sequences: the stop sequences to be checked.

    Returns:
        The stop sequences.
    """
    if stop_sequences is not None:
        if isinstance(stop_sequences, str):
            stop_sequences = [stop_sequences]
        if len(stop_sequences) > 4:
            warnings.warn(
                "Only up to 4 stop sequences are allowed, so keeping the first 4 items only.",
                UserWarning,
                stacklevel=2,
            )
            stop_sequences = stop_sequences[:4]
    return stop_sequences

`agenerate(input, max_new_tokens=128, frequency_penalty=None, logit_bias=None, logprobs=False, presence_penalty=None, seed=None, stop_sequences=None, temperature=1.0, tool_choice=None, tool_prompt=None, tools=None, top_logprobs=None, top_n_tokens=None, top_p=None, do_sample=False, repetition_penalty=None, return_full_text=False, top_k=None, typical_p=None, watermark=False, num_generations=1)` `async` ¶

Generates completions for the given input using the async client. This method uses two methods of the huggingface_hub.AsyncClient: chat_completion and text_generation. chat_completion method will be used only if no tokenizer_id has been specified. Some arguments of this function are specific to the text_generation method, while some others are specific to the chat_completion method.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`frequency_penalty`	`Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]`	a value between `-2.0` and `2.0`. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing model's likelihood to repeat the same line verbatim. Defauls to `None`.	`None`
`logit_bias`	`Optional[List[float]]`	modify the likelihood of specified tokens appearing in the completion. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.	`None`
`logprobs`	`bool`	whether to return the log probabilities or not. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defaults to `False`.	`False`
`presence_penalty`	`Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]`	a value between `-2.0` and `2.0`. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model likelihood to talk about new topics. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defauls to `None`.	`None`
`seed`	`Optional[int]`	the seed to use for the generation. Defaults to `None`.	`None`
`stop_sequences`	`Optional[List[str]]`	either a single string or a list of strings containing the sequences to stop the generation at. Defaults to `None`, but will be set to the `tokenizer.eos_token` if available.	`None`
`temperature`	`float`	the temperature to use for the generation. Defaults to `1.0`.	`1.0`
`tool_choice`	`Optional[Union[Dict[str, str], Literal['auto']]]`	the name of the tool the model should call. It can be a dictionary like `{"function_name": "my_tool"}` or "auto". If not provided, then the model won't use any tool. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.	`None`
`tool_prompt`	`Optional[str]`	A prompt to be appended before the tools. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defauls to `None`.	`None`
`tools`	`Optional[List[Dict[str, Any]]]`	a list of tools definitions that the LLM can use. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.	`None`
`top_logprobs`	`Optional[PositiveInt]`	the number of top log probabilities to return per output token generated. This argument is exclusive to the `chat_completion` method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.	`None`
`top_n_tokens`	`Optional[PositiveInt]`	the number of top log probabilities to return per output token generated. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `None`.	`None`
`top_p`	`Optional[float]`	the top-p value to use for the generation. Defaults to `1.0`.	`None`
`do_sample`	`bool`	whether to use sampling for the generation. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `False`.	`False`
`repetition_penalty`	`Optional[float]`	the repetition penalty to use for the generation. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `None`.	`None`
`return_full_text`	`bool`	whether to return the full text of the completion or just the generated text. Defaults to `False`, meaning that only the generated text will be returned. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`.	`False`
`top_k`	`Optional[int]`	the top-k value to use for the generation. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid values in TGI.	`None`
`typical_p`	`Optional[float]`	the typical-p value to use for the generation. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `None`.	`None`
`watermark`	`bool`	whether to add the watermark to the generated text. This argument is exclusive of the `text_generation` method and will be only used if `tokenizer_id` is not `None`. Defaults to `None`.	`False`
`num_generations`	`int`	the number of generations to generate. Defaults to `1`. It's here to ensure the validation succeds.	`1`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/huggingface/inference_endpoints.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    max_new_tokens: int = 128,
    frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,
    logit_bias: Optional[List[float]] = None,
    logprobs: bool = False,
    presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: float = 1.0,
    tool_choice: Optional[Union[Dict[str, str], Literal["auto"]]] = None,
    tool_prompt: Optional[str] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
    top_logprobs: Optional[PositiveInt] = None,
    top_n_tokens: Optional[PositiveInt] = None,
    top_p: Optional[float] = None,
    do_sample: bool = False,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    top_k: Optional[int] = None,
    typical_p: Optional[float] = None,
    watermark: bool = False,
    num_generations: int = 1,
) -> GenerateOutput:
    """Generates completions for the given input using the async client. This method
    uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.
    `chat_completion` method will be used only if no `tokenizer_id` has been specified.
    Some arguments of this function are specific to the `text_generation` method, while
    some others are specific to the `chat_completion` method.

    Args:
        input: a single input in chat format to generate responses for.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize
            new tokens based on their existing frequency in the text so far, decreasing
            model's likelihood to repeat the same line verbatim. Defauls to `None`.
        logit_bias: modify the likelihood of specified tokens appearing in the completion.
            This argument is exclusive to the `chat_completion` method and will be used
            only if `tokenizer_id` is `None`.
            Defaults to `None`.
        logprobs: whether to return the log probabilities or not. This argument is exclusive
            to the `chat_completion` method and will be used only if `tokenizer_id`
            is `None`. Defaults to `False`.
        presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize
            new tokens based on whether they appear in the text so far, increasing the
            model likelihood to talk about new topics. This argument is exclusive to
            the `chat_completion` method and will be used only if `tokenizer_id` is
            `None`. Defauls to `None`.
        seed: the seed to use for the generation. Defaults to `None`.
        stop_sequences: either a single string or a list of strings containing the sequences
            to stop the generation at. Defaults to `None`, but will be set to the
            `tokenizer.eos_token` if available.
        temperature: the temperature to use for the generation. Defaults to `1.0`.
        tool_choice: the name of the tool the model should call. It can be a dictionary
            like `{"function_name": "my_tool"}` or "auto". If not provided, then the
            model won't use any tool. This argument is exclusive to the `chat_completion`
            method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.
        tool_prompt: A prompt to be appended before the tools. This argument is exclusive
            to the `chat_completion` method and will be used only if `tokenizer_id`
            is `None`. Defauls to `None`.
        tools: a list of tools definitions that the LLM can use.
            This argument is exclusive to the `chat_completion` method and will be used
            only if `tokenizer_id` is `None`. Defaults to `None`.
        top_logprobs: the number of top log probabilities to return per output token
            generated. This argument is exclusive to the `chat_completion` method and
            will be used only if `tokenizer_id` is `None`. Defaults to `None`.
        top_n_tokens: the number of top log probabilities to return per output token
            generated. This argument is exclusive of the `text_generation` method and
            will be only used if `tokenizer_id` is not `None`. Defaults to `None`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        do_sample: whether to use sampling for the generation. This argument is exclusive
            of the `text_generation` method and will be only used if `tokenizer_id` is not
            `None`. Defaults to `False`.
        repetition_penalty: the repetition penalty to use for the generation. This argument
            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`
            is not `None`. Defaults to `None`.
        return_full_text: whether to return the full text of the completion or just
            the generated text. Defaults to `False`, meaning that only the generated
            text will be returned. This argument is exclusive of the `text_generation`
            method and will be only used if `tokenizer_id` is not `None`.
        top_k: the top-k value to use for the generation. This argument is exclusive
            of the `text_generation` method and will be only used if `tokenizer_id`
            is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid
            values in TGI.
        typical_p: the typical-p value to use for the generation. This argument is exclusive
            of the `text_generation` method and will be only used if `tokenizer_id`
            is not `None`. Defaults to `None`.
        watermark: whether to add the watermark to the generated text. This argument
            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`
            is not `None`. Defaults to `None`.
        num_generations: the number of generations to generate. Defaults to `1`. It's here to ensure
            the validation succeds.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    stop_sequences = self._check_stop_sequences(stop_sequences)

    if isinstance(input, str) or self.tokenizer_id is not None:
        structured_output = None
        if not isinstance(input, str):
            input, structured_output = self._get_structured_output(input)
            input = self.prepare_input(input)

        return await self._generate_with_text_generation(
            input=input,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            typical_p=typical_p,
            repetition_penalty=repetition_penalty,
            frequency_penalty=frequency_penalty,
            temperature=temperature,
            top_n_tokens=top_n_tokens,
            top_p=top_p,
            top_k=top_k,
            stop_sequences=stop_sequences,
            return_full_text=return_full_text,
            seed=seed,
            watermark=watermark,
            structured_output=structured_output,
        )

    return await self._generate_with_chat_completion(
        input=input,  # type: ignore
        max_new_tokens=max_new_tokens,
        frequency_penalty=frequency_penalty,
        logit_bias=logit_bias,
        logprobs=logprobs,
        presence_penalty=presence_penalty,
        seed=seed,
        stop_sequences=stop_sequences,
        temperature=temperature,
        tool_choice=tool_choice,
        tool_prompt=tool_prompt,
        tools=tools,
        top_logprobs=top_logprobs,
        top_p=top_p,
    )

`TransformersLLM` ¶

Bases: LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin

Hugging Face transformers library LLM implementation using the text generation pipeline.

Attributes:

Name	Type	Description
`model`	`str`	the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.
`revision`	`str`	if `model` refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to `"main"`.
`torch_dtype`	`str`	the torch dtype to use for the model e.g. "float16", "float32", etc. Defaults to `"auto"`.
`trust_remote_code`	`bool`	whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to `False`.
`model_kwargs`	`Optional[Dict[str, Any]]`	additional dictionary of keyword arguments that will be passed to the `from_pretrained` method of the model.
`tokenizer`	`Optional[str]`	the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the `model` will be used. Defaults to `None`.
`use_fast`	`bool`	whether to use a fast tokenizer or not. Defaults to `True`.
`chat_template`	`Optional[str]`	a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to `None`.
`device`	`Optional[Union[str, int]]`	the name or index of the device where the model will be loaded. Defaults to `None`.
`device_map`	`Optional[Union[str, Dict[str, Any]]]`	a dictionary mapping each layer of the model to a device, or a mode like `"sequential"` or `"auto"`. Defaults to `None`.
`token`	`Optional[SecretStr]`	the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package local configuration will be used. Defaults to `None`.
`structured_output`	`Optional[RuntimeParameter[OutlinesStructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.

Icon

:hugging:

Examples:

Generate text:

from distilabel.models.llms import TransformersLLM

llm = TransformersLLM(model="microsoft/Phi-3-mini-4k-instruct")

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/huggingface/transformers.py

class TransformersLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):
    """Hugging Face `transformers` library LLM implementation using the text generation
    pipeline.

    Attributes:
        model: the model Hugging Face Hub repo id or a path to a directory containing the
            model weights and configuration files.
        revision: if `model` refers to a Hugging Face Hub repository, then the revision
            (e.g. a branch name or a commit id) to use. Defaults to `"main"`.
        torch_dtype: the torch dtype to use for the model e.g. "float16", "float32", etc.
            Defaults to `"auto"`.
        trust_remote_code: whether to allow fetching and executing remote code fetched
            from the repository in the Hub. Defaults to `False`.
        model_kwargs: additional dictionary of keyword arguments that will be passed to
            the `from_pretrained` method of the model.
        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing
            the tokenizer config files. If not provided, the one associated to the `model`
            will be used. Defaults to `None`.
        use_fast: whether to use a fast tokenizer or not. Defaults to `True`.
        chat_template: a chat template that will be used to build the prompts before
            sending them to the model. If not provided, the chat template defined in the
            tokenizer config will be used. If not provided and the tokenizer doesn't have
            a chat template, then ChatML template will be used. Defaults to `None`.
        device: the name or index of the device where the model will be loaded. Defaults
            to `None`.
        device_map: a dictionary mapping each layer of the model to a device, or a mode
            like `"sequential"` or `"auto"`. Defaults to `None`.
        token: the Hugging Face Hub token that will be used to authenticate to the Hugging
            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package
            local configuration will be used. Defaults to `None`.
        structured_output: a dictionary containing the structured output configuration or if more
            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.

    Icon:
        `:hugging:`

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import TransformersLLM

        llm = TransformersLLM(model="microsoft/Phi-3-mini-4k-instruct")

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    model: str
    revision: str = "main"
    torch_dtype: str = "auto"
    trust_remote_code: bool = False
    model_kwargs: Optional[Dict[str, Any]] = None
    tokenizer: Optional[str] = None
    use_fast: bool = True
    chat_template: Optional[str] = None
    device: Optional[Union[str, int]] = None
    device_map: Optional[Union[str, Dict[str, Any]]] = None
    token: Optional[SecretStr] = Field(
        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR)
    )
    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(
        default=None,
        description="The structured output format to use across all the generations.",
    )

    _pipeline: Optional["Pipeline"] = PrivateAttr(...)
    _prefix_allowed_tokens_fn: Union[Callable, None] = PrivateAttr(default=None)
    _logits_processor: Union[Callable, None] = PrivateAttr(default=None)

    def load(self) -> None:
        """Loads the model and tokenizer and creates the text generation pipeline. In addition,
        it will configure the tokenizer chat template."""
        if self.device == "cuda":
            CudaDevicePlacementMixin.load(self)

        try:
            from transformers import pipeline
        except ImportError as ie:
            raise ImportError(
                "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
            ) from ie

        token = self.token.get_secret_value() if self.token is not None else self.token

        self._pipeline = pipeline(
            "text-generation",
            model=self.model,
            revision=self.revision,
            torch_dtype=self.torch_dtype,
            trust_remote_code=self.trust_remote_code,
            model_kwargs=self.model_kwargs or {},
            tokenizer=self.tokenizer or self.model,
            use_fast=self.use_fast,
            device=self.device,
            device_map=self.device_map,
            token=token,
            return_full_text=False,
        )

        if self.chat_template is not None:
            self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore

        if self._pipeline.tokenizer.pad_token is None:  # type: ignore
            self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore

        if self.structured_output:
            processor = self._prepare_structured_output(self.structured_output)
            if _is_outlines_version_below_0_1_0():
                self._prefix_allowed_tokens_fn = processor
            else:
                self._logits_processor = [processor]

        super().load()

    def unload(self) -> None:
        """Unloads the `vLLM` model."""
        CudaDevicePlacementMixin.unload(self)
        super().unload()

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    def prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        if self._pipeline.tokenizer.chat_template is None:  # type: ignore
            return input[0]["content"]

        prompt: str = (
            self._pipeline.tokenizer.apply_chat_template(  # type: ignore
                input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    @validate_call
    def generate(  # type: ignore
        self,
        inputs: List[StandardInput],
        num_generations: int = 1,
        max_new_tokens: int = 128,
        temperature: float = 0.1,
        repetition_penalty: float = 1.1,
        top_p: float = 1.0,
        top_k: int = 0,
        do_sample: bool = True,
    ) -> List[GenerateOutput]:
        """Generates `num_generations` responses for each input using the text generation
        pipeline.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            repetition_penalty: the repetition penalty to use for the generation. Defaults
                to `1.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            top_k: the top-k value to use for the generation. Defaults to `0`.
            do_sample: whether to use sampling or not. Defaults to `True`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        prepared_inputs = [self.prepare_input(input=input) for input in inputs]

        outputs: List[List[Dict[str, str]]] = self._pipeline(  # type: ignore
            prepared_inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            repetition_penalty=repetition_penalty,
            top_p=top_p,
            top_k=top_k,
            do_sample=do_sample,
            num_return_sequences=num_generations,
            prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,
            pad_token_id=self._pipeline.tokenizer.eos_token_id,
            logits_processor=self._logits_processor,
        )
        llm_output = [
            [generation["generated_text"] for generation in output]
            for output in outputs
        ]

        result = []
        for input, output in zip(inputs, llm_output):
            result.append(
                prepare_output(
                    output,
                    input_tokens=[
                        compute_tokens(input, self._pipeline.tokenizer.encode)
                    ],
                    output_tokens=[
                        compute_tokens(row, self._pipeline.tokenizer.encode)
                        for row in output
                    ],
                )
            )

        return result

    def get_last_hidden_states(
        self, inputs: List["StandardInput"]
    ) -> List["HiddenState"]:
        """Gets the last `hidden_states` of the model for the given inputs. It doesn't
        execute the task head.

        Args:
            inputs: a list of inputs in chat format to generate the embeddings for.

        Returns:
            A list containing the last hidden state for each sequence using a NumPy array
            with shape [num_tokens, hidden_size].
        """
        model: "PreTrainedModel" = (
            self._pipeline.model.model  # type: ignore
            if hasattr(self._pipeline.model, "model")  # type: ignore
            else next(self._pipeline.model.children())  # type: ignore
        )
        tokenizer: "PreTrainedTokenizer" = self._pipeline.tokenizer  # type: ignore
        input_ids = tokenizer(
            [self.prepare_input(input) for input in inputs],  # type: ignore
            return_tensors="pt",
            padding=True,
        ).to(model.device)
        last_hidden_states = model(**input_ids)["last_hidden_state"]

        return [
            seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()
            for seq_last_hidden_state, attention_mask in zip(
                last_hidden_states,
                input_ids["attention_mask"],  # type: ignore
            )
        ]

    def _prepare_structured_output(
        self, structured_output: Optional[OutlinesStructuredOutputType] = None
    ) -> Union[Callable, List[Callable]]:
        """Creates the appropriate function to filter tokens to generate structured outputs.

        Args:
            structured_output: the configuration dict to prepare the structured output.

        Returns:
            The callable that will be used to guide the generation of the model.
        """
        from distilabel.steps.tasks.structured_outputs.outlines import (
            prepare_guided_output,
        )

        result = prepare_guided_output(
            structured_output, "transformers", self._pipeline
        )
        if schema := result.get("schema"):
            self.structured_output["schema"] = schema
        return result["processor"]

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.

Source code in src/distilabel/models/llms/huggingface/transformers.py

def load(self) -> None:
    """Loads the model and tokenizer and creates the text generation pipeline. In addition,
    it will configure the tokenizer chat template."""
    if self.device == "cuda":
        CudaDevicePlacementMixin.load(self)

    try:
        from transformers import pipeline
    except ImportError as ie:
        raise ImportError(
            "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
        ) from ie

    token = self.token.get_secret_value() if self.token is not None else self.token

    self._pipeline = pipeline(
        "text-generation",
        model=self.model,
        revision=self.revision,
        torch_dtype=self.torch_dtype,
        trust_remote_code=self.trust_remote_code,
        model_kwargs=self.model_kwargs or {},
        tokenizer=self.tokenizer or self.model,
        use_fast=self.use_fast,
        device=self.device,
        device_map=self.device_map,
        token=token,
        return_full_text=False,
    )

    if self.chat_template is not None:
        self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore

    if self._pipeline.tokenizer.pad_token is None:  # type: ignore
        self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore

    if self.structured_output:
        processor = self._prepare_structured_output(self.structured_output)
        if _is_outlines_version_below_0_1_0():
            self._prefix_allowed_tokens_fn = processor
        else:
            self._logits_processor = [processor]

    super().load()

`unload()` ¶

Unloads the vLLM model.

Source code in src/distilabel/models/llms/huggingface/transformers.py

def unload(self) -> None:
    """Unloads the `vLLM` model."""
    CudaDevicePlacementMixin.unload(self)
    super().unload()

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/huggingface/transformers.py

def prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    if self._pipeline.tokenizer.chat_template is None:  # type: ignore
        return input[0]["content"]

    prompt: str = (
        self._pipeline.tokenizer.apply_chat_template(  # type: ignore
            input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)` ¶

Generates num_generations responses for each input using the text generation pipeline.

Parameters:

Name	Type	Description	Default
`inputs`	`List[StandardInput]`	a list of inputs in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`0.1`
`repetition_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `1.1`.	`1.1`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`top_k`	`int`	the top-k value to use for the generation. Defaults to `0`.	`0`
`do_sample`	`bool`	whether to use sampling or not. Defaults to `True`.	`True`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/huggingface/transformers.py

@validate_call
def generate(  # type: ignore
    self,
    inputs: List[StandardInput],
    num_generations: int = 1,
    max_new_tokens: int = 128,
    temperature: float = 0.1,
    repetition_penalty: float = 1.1,
    top_p: float = 1.0,
    top_k: int = 0,
    do_sample: bool = True,
) -> List[GenerateOutput]:
    """Generates `num_generations` responses for each input using the text generation
    pipeline.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        repetition_penalty: the repetition penalty to use for the generation. Defaults
            to `1.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        top_k: the top-k value to use for the generation. Defaults to `0`.
        do_sample: whether to use sampling or not. Defaults to `True`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    prepared_inputs = [self.prepare_input(input=input) for input in inputs]

    outputs: List[List[Dict[str, str]]] = self._pipeline(  # type: ignore
        prepared_inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        top_k=top_k,
        do_sample=do_sample,
        num_return_sequences=num_generations,
        prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,
        pad_token_id=self._pipeline.tokenizer.eos_token_id,
        logits_processor=self._logits_processor,
    )
    llm_output = [
        [generation["generated_text"] for generation in output]
        for output in outputs
    ]

    result = []
    for input, output in zip(inputs, llm_output):
        result.append(
            prepare_output(
                output,
                input_tokens=[
                    compute_tokens(input, self._pipeline.tokenizer.encode)
                ],
                output_tokens=[
                    compute_tokens(row, self._pipeline.tokenizer.encode)
                    for row in output
                ],
            )
        )

    return result

`get_last_hidden_states(inputs)` ¶

Gets the last hidden_states of the model for the given inputs. It doesn't execute the task head.

Parameters:

Name	Type	Description	Default
`inputs`	`List[StandardInput]`	a list of inputs in chat format to generate the embeddings for.	required

Returns:

Type	Description
`List[HiddenState]`	A list containing the last hidden state for each sequence using a NumPy array
`List[HiddenState]`	with shape [num_tokens, hidden_size].

Source code in src/distilabel/models/llms/huggingface/transformers.py

def get_last_hidden_states(
    self, inputs: List["StandardInput"]
) -> List["HiddenState"]:
    """Gets the last `hidden_states` of the model for the given inputs. It doesn't
    execute the task head.

    Args:
        inputs: a list of inputs in chat format to generate the embeddings for.

    Returns:
        A list containing the last hidden state for each sequence using a NumPy array
        with shape [num_tokens, hidden_size].
    """
    model: "PreTrainedModel" = (
        self._pipeline.model.model  # type: ignore
        if hasattr(self._pipeline.model, "model")  # type: ignore
        else next(self._pipeline.model.children())  # type: ignore
    )
    tokenizer: "PreTrainedTokenizer" = self._pipeline.tokenizer  # type: ignore
    input_ids = tokenizer(
        [self.prepare_input(input) for input in inputs],  # type: ignore
        return_tensors="pt",
        padding=True,
    ).to(model.device)
    last_hidden_states = model(**input_ids)["last_hidden_state"]

    return [
        seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()
        for seq_last_hidden_state, attention_mask in zip(
            last_hidden_states,
            input_ids["attention_mask"],  # type: ignore
        )
    ]

`_prepare_structured_output(structured_output=None)` ¶

Creates the appropriate function to filter tokens to generate structured outputs.

Parameters:

Name	Type	Description	Default
`structured_output`	`Optional[OutlinesStructuredOutputType]`	the configuration dict to prepare the structured output.	`None`

Returns:

Type	Description
`Union[Callable, List[Callable]]`	The callable that will be used to guide the generation of the model.

Source code in src/distilabel/models/llms/huggingface/transformers.py

def _prepare_structured_output(
    self, structured_output: Optional[OutlinesStructuredOutputType] = None
) -> Union[Callable, List[Callable]]:
    """Creates the appropriate function to filter tokens to generate structured outputs.

    Args:
        structured_output: the configuration dict to prepare the structured output.

    Returns:
        The callable that will be used to guide the generation of the model.
    """
    from distilabel.steps.tasks.structured_outputs.outlines import (
        prepare_guided_output,
    )

    result = prepare_guided_output(
        structured_output, "transformers", self._pipeline
    )
    if schema := result.get("schema"):
        self.structured_output["schema"] = schema
    return result["processor"]

`LiteLLM` ¶

Bases: AsyncLLM

LiteLLM implementation running the async API client.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "gpt-3.5-turbo" or "mistral/mistral-large", etc.
`verbose`	`RuntimeParameter[bool]`	whether to log the LiteLLM client's logs. Defaults to `False`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.

Runtime parameters

verbose: whether to log the LiteLLM client's logs. Defaults to False.

Examples:

Generate text:

from distilabel.models.llms import LiteLLM

llm = LiteLLM(model="gpt-3.5-turbo")

llm.load()

# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

```python
from pydantic import BaseModel
from distilabel.models.llms import LiteLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = LiteLLM(
    model="gpt-3.5-turbo",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/litellm.py

class LiteLLM(AsyncLLM):
    """LiteLLM implementation running the async API client.

    Attributes:
        model: the model name to use for the LLM e.g. "gpt-3.5-turbo" or "mistral/mistral-large",
            etc.
        verbose: whether to log the LiteLLM client's logs. Defaults to `False`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.

    Runtime parameters:
        - `verbose`: whether to log the LiteLLM client's logs. Defaults to `False`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import LiteLLM

        llm = LiteLLM(model="gpt-3.5-turbo")

        llm.load()

        # Call the model
        output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import LiteLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = LiteLLM(
            model="gpt-3.5-turbo",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    verbose: RuntimeParameter[bool] = Field(
        default=False, description="Whether to log the LiteLLM client's logs."
    )
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )

    _aclient: Optional[Callable] = PrivateAttr(...)

    def load(self) -> None:
        """
        Loads the `acompletion` LiteLLM client to benefit from async requests.
        """
        super().load()

        try:
            import litellm

            litellm.telemetry = False
        except ImportError as e:
            raise ImportError(
                "LiteLLM Python client is not installed. Please install it using"
                " `pip install 'distilabel[litellm]'`."
            ) from e
        self._aclient = litellm.acompletion

        if not self.verbose:
            litellm.suppress_debug_info = True
            for key in logging.Logger.manager.loggerDict.keys():
                if "litellm" not in key.lower():
                    continue
                logging.getLogger(key).setLevel(logging.CRITICAL)

        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="litellm",
            )
            self._aclient = result.get("client").messages.create
            if structured_output := result.get("structured_output"):
                self.structured_output = structured_output

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    @validate_call
    async def agenerate(  # type: ignore # noqa: C901
        self,
        input: FormattedInput,
        num_generations: int = 1,
        functions: Optional[List] = None,
        function_call: Optional[str] = None,
        temperature: Optional[float] = 1.0,
        top_p: Optional[float] = 1.0,
        stop: Optional[Union[str, list]] = None,
        max_tokens: Optional[int] = None,
        presence_penalty: Optional[float] = None,
        frequency_penalty: Optional[float] = None,
        logit_bias: Optional[dict] = None,
        user: Optional[str] = None,
        metadata: Optional[dict] = None,
        api_base: Optional[str] = None,
        api_version: Optional[str] = None,
        api_key: Optional[str] = None,
        model_list: Optional[list] = None,
        mock_response: Optional[str] = None,
        force_timeout: Optional[int] = 600,
        custom_llm_provider: Optional[str] = None,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).

        Args:
            input: a single input in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            functions: a list of functions to apply to the conversation messages. Defaults to
                `None`.
            function_call: the name of the function to call within the conversation. Defaults
                to `None`.
            temperature: the temperature to use for the generation. Defaults to `1.0`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            stop: Up to 4 sequences where the LLM API will stop generating further tokens.
                Defaults to `None`.
            max_tokens: The maximum number of tokens in the generated completion. Defaults to
                `None`.
            presence_penalty: It is used to penalize new tokens based on their existence in the
                text so far. Defaults to `None`.
            frequency_penalty: It is used to penalize new tokens based on their frequency in the
                text so far. Defaults to `None`.
            logit_bias: Used to modify the probability of specific tokens appearing in the
                completion. Defaults to `None`.
            user: A unique identifier representing your end-user. This can help the LLM provider
                to monitor and detect abuse. Defaults to `None`.
            metadata: Pass in additional metadata to tag your completion calls - eg. prompt
                version, details, etc. Defaults to `None`.
            api_base: Base URL for the API. Defaults to `None`.
            api_version: API version. Defaults to `None`.
            api_key: API key. Defaults to `None`.
            model_list: List of api base, version, keys. Defaults to `None`.
            mock_response: If provided, return a mock completion response for testing or debugging
                purposes. Defaults to `None`.
            force_timeout: The maximum execution time in seconds for the completion request.
                Defaults to `600`.
            custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)
                model="amazon.titan-tg1-large" and custom_llm_provider="bedrock". Defaults to
                `None`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        import litellm
        from litellm import token_counter

        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,
                client=self._aclient,
                framework="litellm",
            )
            self._aclient = result.get("client").messages.create

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        kwargs = {
            "model": self.model,
            "messages": input,
            "n": num_generations,
            "functions": functions,
            "function_call": function_call,
            "temperature": temperature,
            "top_p": top_p,
            "stream": False,
            "stop": stop,
            "max_tokens": max_tokens,
            "presence_penalty": presence_penalty,
            "frequency_penalty": frequency_penalty,
            "logit_bias": logit_bias,
            "user": user,
            "metadata": metadata,
            "api_base": api_base,
            "api_version": api_version,
            "api_key": api_key,
            "model_list": model_list,
            "mock_response": mock_response,
            "force_timeout": force_timeout,
            "custom_llm_provider": custom_llm_provider,
        }
        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)

        async def _call_aclient_until_n_choices() -> List["Choices"]:
            choices = []
            while len(choices) < num_generations:
                completion: Union["ModelResponse", "BaseModel"] = await self._aclient(
                    **kwargs
                )  # type: ignore
                if self.structured_output:
                    # Prevent pydantic model from being cast to list during list extension
                    completion = [completion]
                else:
                    completion = completion.choices
                choices.extend(completion)
            return choices

        # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used
        try:
            litellm.drop_params = False
            choices = await _call_aclient_until_n_choices()
        except litellm.exceptions.APIError as e:
            if "does not support parameters" in str(e):
                litellm.drop_params = True
                choices = await _call_aclient_until_n_choices()
            else:
                raise e

        generations = []
        input_tokens = [
            token_counter(model=self.model, messages=input)
        ] * num_generations
        output_tokens = []

        if self.structured_output:
            for choice in choices:
                generations.append(choice.model_dump_json())
                output_tokens.append(
                    token_counter(
                        model=self.model,
                        text=orjson.dumps(choice.model_dump_json()).decode("utf-8"),
                    )
                )
            return prepare_output(
                generations,
                input_tokens=input_tokens,
                output_tokens=output_tokens,
            )

        for choice in choices:
            if (content := choice.message.content) is None:
                self._logger.warning(  # type: ignore
                    f"Received no response using LiteLLM client (model: '{self.model}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(content)
            output_tokens.append(token_counter(model=self.model, text=content))

        return prepare_output(
            generations, input_tokens=input_tokens, output_tokens=output_tokens
        )

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the acompletion LiteLLM client to benefit from async requests.

Source code in src/distilabel/models/llms/litellm.py

def load(self) -> None:
    """
    Loads the `acompletion` LiteLLM client to benefit from async requests.
    """
    super().load()

    try:
        import litellm

        litellm.telemetry = False
    except ImportError as e:
        raise ImportError(
            "LiteLLM Python client is not installed. Please install it using"
            " `pip install 'distilabel[litellm]'`."
        ) from e
    self._aclient = litellm.acompletion

    if not self.verbose:
        litellm.suppress_debug_info = True
        for key in logging.Logger.manager.loggerDict.keys():
            if "litellm" not in key.lower():
                continue
            logging.getLogger(key).setLevel(logging.CRITICAL)

    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="litellm",
        )
        self._aclient = result.get("client").messages.create
        if structured_output := result.get("structured_output"):
            self.structured_output = structured_output

`agenerate(input, num_generations=1, functions=None, function_call=None, temperature=1.0, top_p=1.0, stop=None, max_tokens=None, presence_penalty=None, frequency_penalty=None, logit_bias=None, user=None, metadata=None, api_base=None, api_version=None, api_key=None, model_list=None, mock_response=None, force_timeout=600, custom_llm_provider=None)` `async` ¶

Generates num_generations responses for the given input using the LiteLLM async client.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`functions`	`Optional[List]`	a list of functions to apply to the conversation messages. Defaults to `None`.	`None`
`function_call`	`Optional[str]`	the name of the function to call within the conversation. Defaults to `None`.	`None`
`temperature`	`Optional[float]`	the temperature to use for the generation. Defaults to `1.0`.	`1.0`
`top_p`	`Optional[float]`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`stop`	`Optional[Union[str, list]]`	Up to 4 sequences where the LLM API will stop generating further tokens. Defaults to `None`.	`None`
`max_tokens`	`Optional[int]`	The maximum number of tokens in the generated completion. Defaults to `None`.	`None`
`presence_penalty`	`Optional[float]`	It is used to penalize new tokens based on their existence in the text so far. Defaults to `None`.	`None`
`frequency_penalty`	`Optional[float]`	It is used to penalize new tokens based on their frequency in the text so far. Defaults to `None`.	`None`
`logit_bias`	`Optional[dict]`	Used to modify the probability of specific tokens appearing in the completion. Defaults to `None`.	`None`
`user`	`Optional[str]`	A unique identifier representing your end-user. This can help the LLM provider to monitor and detect abuse. Defaults to `None`.	`None`
`metadata`	`Optional[dict]`	Pass in additional metadata to tag your completion calls - eg. prompt version, details, etc. Defaults to `None`.	`None`
`api_base`	`Optional[str]`	Base URL for the API. Defaults to `None`.	`None`
`api_version`	`Optional[str]`	API version. Defaults to `None`.	`None`
`api_key`	`Optional[str]`	API key. Defaults to `None`.	`None`
`model_list`	`Optional[list]`	List of api base, version, keys. Defaults to `None`.	`None`
`mock_response`	`Optional[str]`	If provided, return a mock completion response for testing or debugging purposes. Defaults to `None`.	`None`
`force_timeout`	`Optional[int]`	The maximum execution time in seconds for the completion request. Defaults to `600`.	`600`
`custom_llm_provider`	`Optional[str]`	Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable) model="amazon.titan-tg1-large" and custom_llm_provider="bedrock". Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/litellm.py

@validate_call
async def agenerate(  # type: ignore # noqa: C901
    self,
    input: FormattedInput,
    num_generations: int = 1,
    functions: Optional[List] = None,
    function_call: Optional[str] = None,
    temperature: Optional[float] = 1.0,
    top_p: Optional[float] = 1.0,
    stop: Optional[Union[str, list]] = None,
    max_tokens: Optional[int] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    logit_bias: Optional[dict] = None,
    user: Optional[str] = None,
    metadata: Optional[dict] = None,
    api_base: Optional[str] = None,
    api_version: Optional[str] = None,
    api_key: Optional[str] = None,
    model_list: Optional[list] = None,
    mock_response: Optional[str] = None,
    force_timeout: Optional[int] = 600,
    custom_llm_provider: Optional[str] = None,
) -> GenerateOutput:
    """Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).

    Args:
        input: a single input in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        functions: a list of functions to apply to the conversation messages. Defaults to
            `None`.
        function_call: the name of the function to call within the conversation. Defaults
            to `None`.
        temperature: the temperature to use for the generation. Defaults to `1.0`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        stop: Up to 4 sequences where the LLM API will stop generating further tokens.
            Defaults to `None`.
        max_tokens: The maximum number of tokens in the generated completion. Defaults to
            `None`.
        presence_penalty: It is used to penalize new tokens based on their existence in the
            text so far. Defaults to `None`.
        frequency_penalty: It is used to penalize new tokens based on their frequency in the
            text so far. Defaults to `None`.
        logit_bias: Used to modify the probability of specific tokens appearing in the
            completion. Defaults to `None`.
        user: A unique identifier representing your end-user. This can help the LLM provider
            to monitor and detect abuse. Defaults to `None`.
        metadata: Pass in additional metadata to tag your completion calls - eg. prompt
            version, details, etc. Defaults to `None`.
        api_base: Base URL for the API. Defaults to `None`.
        api_version: API version. Defaults to `None`.
        api_key: API key. Defaults to `None`.
        model_list: List of api base, version, keys. Defaults to `None`.
        mock_response: If provided, return a mock completion response for testing or debugging
            purposes. Defaults to `None`.
        force_timeout: The maximum execution time in seconds for the completion request.
            Defaults to `600`.
        custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)
            model="amazon.titan-tg1-large" and custom_llm_provider="bedrock". Defaults to
            `None`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    import litellm
    from litellm import token_counter

    structured_output = None
    if isinstance(input, tuple):
        input, structured_output = input
        result = self._prepare_structured_output(
            structured_output=structured_output,
            client=self._aclient,
            framework="litellm",
        )
        self._aclient = result.get("client").messages.create

    if structured_output is None and self.structured_output is not None:
        structured_output = self.structured_output

    kwargs = {
        "model": self.model,
        "messages": input,
        "n": num_generations,
        "functions": functions,
        "function_call": function_call,
        "temperature": temperature,
        "top_p": top_p,
        "stream": False,
        "stop": stop,
        "max_tokens": max_tokens,
        "presence_penalty": presence_penalty,
        "frequency_penalty": frequency_penalty,
        "logit_bias": logit_bias,
        "user": user,
        "metadata": metadata,
        "api_base": api_base,
        "api_version": api_version,
        "api_key": api_key,
        "model_list": model_list,
        "mock_response": mock_response,
        "force_timeout": force_timeout,
        "custom_llm_provider": custom_llm_provider,
    }
    if structured_output:
        kwargs = self._prepare_kwargs(kwargs, structured_output)

    async def _call_aclient_until_n_choices() -> List["Choices"]:
        choices = []
        while len(choices) < num_generations:
            completion: Union["ModelResponse", "BaseModel"] = await self._aclient(
                **kwargs
            )  # type: ignore
            if self.structured_output:
                # Prevent pydantic model from being cast to list during list extension
                completion = [completion]
            else:
                completion = completion.choices
            choices.extend(completion)
        return choices

    # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used
    try:
        litellm.drop_params = False
        choices = await _call_aclient_until_n_choices()
    except litellm.exceptions.APIError as e:
        if "does not support parameters" in str(e):
            litellm.drop_params = True
            choices = await _call_aclient_until_n_choices()
        else:
            raise e

    generations = []
    input_tokens = [
        token_counter(model=self.model, messages=input)
    ] * num_generations
    output_tokens = []

    if self.structured_output:
        for choice in choices:
            generations.append(choice.model_dump_json())
            output_tokens.append(
                token_counter(
                    model=self.model,
                    text=orjson.dumps(choice.model_dump_json()).decode("utf-8"),
                )
            )
        return prepare_output(
            generations,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
        )

    for choice in choices:
        if (content := choice.message.content) is None:
            self._logger.warning(  # type: ignore
                f"Received no response using LiteLLM client (model: '{self.model}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(content)
        output_tokens.append(token_counter(model=self.model, text=content))

    return prepare_output(
        generations, input_tokens=input_tokens, output_tokens=output_tokens
    )

`LlamaCppLLM` ¶

Bases: LLM, MagpieChatTemplateMixin

llama.cpp LLM implementation running the Python bindings for the C++ code.

Attributes:

Name	Type	Description
`model_path`	`RuntimeParameter[FilePath]`	contains the path to the GGUF quantized model, compatible with the installed version of the `llama.cpp` Python bindings.
`n_gpu_layers`	`RuntimeParameter[int]`	the number of layers to use for the GPU. Defaults to `-1`, meaning that the available GPU device will be used.
`chat_format`	`Optional[RuntimeParameter[str]]`	the chat format to use for the model. Defaults to `None`, which means the Llama format will be used.
`n_ctx`	`int`	the context size to use for the model. Defaults to `512`.
`n_batch`	`int`	the prompt processing maximum batch size to use for the model. Defaults to `512`.
`seed`	`int`	random seed to use for the generation. Defaults to `4294967295`.
`verbose`	`RuntimeParameter[bool]`	whether to print verbose output. Defaults to `False`.
`structured_output`	`Optional[RuntimeParameter[OutlinesStructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
`extra_kwargs`	`Optional[RuntimeParameter[Dict[str, Any]]]`	additional dictionary of keyword arguments that will be passed to the `Llama` class of `llama_cpp` library. Defaults to `{}`.
`tokenizer_id`	`Optional[RuntimeParameter[str]]`	the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the `model` will be used. Defaults to `None`.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.
`_model`	`Optional[Llama]`	the Llama model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the `load` method.

Runtime parameters

model_path: the path to the GGUF quantized model.
n_gpu_layers: the number of layers to use for the GPU. Defaults to -1.
chat_format: the chat format to use for the model. Defaults to None.
verbose: whether to print verbose output. Defaults to False.
extra_kwargs: additional dictionary of keyword arguments that will be passed to the Llama class of llama_cpp library. Defaults to {}.

References

Examples:

Generate text:

from pathlib import Path
from distilabel.models.llms import LlamaCppLLM

# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"

llm = LlamaCppLLM(
    model_path=str(Path.home() / model_path),
    n_gpu_layers=-1,  # To use the GPU if available
    n_ctx=1024,       # Set the context size
)

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pathlib import Path
from distilabel.models.llms import LlamaCppLLM

model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = LlamaCppLLM(
    model_path=str(Path.home() / model_path),  # type: ignore
    n_gpu_layers=-1,
    n_ctx=1024,
    structured_output={"format": "json", "schema": Character},
)

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/llamacpp.py

class LlamaCppLLM(LLM, MagpieChatTemplateMixin):
    """llama.cpp LLM implementation running the Python bindings for the C++ code.

    Attributes:
        model_path: contains the path to the GGUF quantized model, compatible with the
            installed version of the `llama.cpp` Python bindings.
        n_gpu_layers: the number of layers to use for the GPU. Defaults to `-1`, meaning that
            the available GPU device will be used.
        chat_format: the chat format to use for the model. Defaults to `None`, which means the
            Llama format will be used.
        n_ctx: the context size to use for the model. Defaults to `512`.
        n_batch: the prompt processing maximum batch size to use for the model. Defaults to `512`.
        seed: random seed to use for the generation. Defaults to `4294967295`.
        verbose: whether to print verbose output. Defaults to `False`.
        structured_output: a dictionary containing the structured output configuration or if more
            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
        extra_kwargs: additional dictionary of keyword arguments that will be passed to the
            `Llama` class of `llama_cpp` library. Defaults to `{}`.
        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing
            the tokenizer config files. If not provided, the one associated to the `model`
            will be used. Defaults to `None`.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.
        _model: the Llama model instance. This attribute is meant to be used internally and
            should not be accessed directly. It will be set in the `load` method.

    Runtime parameters:
        - `model_path`: the path to the GGUF quantized model.
        - `n_gpu_layers`: the number of layers to use for the GPU. Defaults to `-1`.
        - `chat_format`: the chat format to use for the model. Defaults to `None`.
        - `verbose`: whether to print verbose output. Defaults to `False`.
        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to the
            `Llama` class of `llama_cpp` library. Defaults to `{}`.

    References:
        - [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
        - [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python)

    Examples:
        Generate text:

        ```python
        from pathlib import Path
        from distilabel.models.llms import LlamaCppLLM

        # You can follow along this example downloading the following model running the following
        # command in the terminal, that will download the model to the `Downloads` folder:
        # curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

        model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"

        llm = LlamaCppLLM(
            model_path=str(Path.home() / model_path),
            n_gpu_layers=-1,  # To use the GPU if available
            n_ctx=1024,       # Set the context size
        )

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pathlib import Path
        from distilabel.models.llms import LlamaCppLLM

        model_path = "Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf"

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = LlamaCppLLM(
            model_path=str(Path.home() / model_path),  # type: ignore
            n_gpu_layers=-1,
            n_ctx=1024,
            structured_output={"format": "json", "schema": Character},
        )

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model_path: RuntimeParameter[FilePath] = Field(
        default=None, description="The path to the GGUF quantized model.", exclude=True
    )
    n_gpu_layers: RuntimeParameter[int] = Field(
        default=-1,
        description="The number of layers that will be loaded in the GPU.",
    )
    chat_format: Optional[RuntimeParameter[str]] = Field(
        default=None,
        description="The chat format to use for the model. Defaults to `None`, which means the Llama format will be used.",
    )

    n_ctx: int = 512
    n_batch: int = 512
    seed: int = 4294967295
    verbose: RuntimeParameter[bool] = Field(
        default=False,
        description="Whether to print verbose output from llama.cpp library.",
    )
    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(
        default_factory=dict,
        description="Additional dictionary of keyword arguments that will be passed to the"
        " `Llama` class of `llama_cpp` library. See all the supported arguments at: "
        "https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__",
    )
    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(
        default=None,
        description="The structured output format to use across all the generations.",
    )
    tokenizer_id: Optional[RuntimeParameter[str]] = Field(
        default=None,
        description="The Hugging Face Hub repo id or a path to a directory containing"
        " the tokenizer config files. If not provided, the one associated to the `model`"
        " will be used.",
    )
    _logits_processor: Optional["LogitsProcessorList"] = PrivateAttr(default=None)
    _model: Optional["Llama"] = PrivateAttr(...)

    @model_validator(mode="after")
    def validate_magpie_usage(
        self,
    ) -> "LlamaCppLLM":
        """Validates that magpie usage is valid."""

        if self.use_magpie_template and self.tokenizer_id is None:
            raise ValueError(
                "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
                " set a `tokenizer_id` and try again."
            )

    def load(self) -> None:
        """Loads the `Llama` model from the `model_path`."""
        try:
            from llama_cpp import Llama
        except ImportError as ie:
            raise ImportError(
                "The `llama_cpp` package is required to use the `LlamaCppLLM` class."
            ) from ie

        self._model = Llama(
            model_path=self.model_path.as_posix(),
            seed=self.seed,
            n_ctx=self.n_ctx,
            n_batch=self.n_batch,
            chat_format=self.chat_format,
            n_gpu_layers=self.n_gpu_layers,
            verbose=self.verbose,
            **self.extra_kwargs,
        )

        if self.structured_output:
            self._logits_processor = self._prepare_structured_output(
                self.structured_output
            )

        if self.use_magpie_template or self.magpie_pre_query_template:
            if not self.tokenizer_id:
                raise ValueError(
                    "The Hugging Face Hub repo id or a path to a directory containing"
                    " the tokenizer config files is required when using the `use_magpie_template`"
                    " or `magpie_pre_query_template` runtime parameters."
                )

        if self.tokenizer_id:
            try:
                from transformers import AutoTokenizer
            except ImportError as ie:
                raise ImportError(
                    "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
                ) from ie
            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)
            if self._tokenizer.chat_template is None:
                raise ValueError(
                    "The tokenizer does not have a chat template. Please use a tokenizer with a chat template."
                )

        # NOTE: Here because of the custom `logging` interface used, since it will create the logging name
        # out of the model name, which won't be available until the `Llama` instance is created.
        super().load()

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self._model.model_path  # type: ignore

    def _generate_chat_completion(
        self,
        input: FormattedInput,
        max_new_tokens: int = 128,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        extra_generation_kwargs: Optional[Dict[str, Any]] = None,
    ) -> "CreateChatCompletionResponse":
        return self._model.create_chat_completion(  # type: ignore
            messages=input,  # type: ignore
            max_tokens=max_new_tokens,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
            logits_processor=self._logits_processor,
            **(extra_generation_kwargs or {}),
        )

    def prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        prompt: str = (
            self._tokenizer.apply_chat_template(  # type: ignore
                conversation=input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    def _generate_with_text_generation(
        self,
        input: FormattedInput,
        max_new_tokens: int = 128,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        extra_generation_kwargs: Optional[Dict[str, Any]] = None,
    ) -> "CreateChatCompletionResponse":
        prompt = self.prepare_input(input)
        return self._model.create_completion(
            prompt=prompt,
            max_tokens=max_new_tokens,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
            logits_processor=self._logits_processor,
            **(extra_generation_kwargs or {}),
        )

    @validate_call
    def generate(  # type: ignore
        self,
        inputs: List[FormattedInput],
        num_generations: int = 1,
        max_new_tokens: int = 128,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        extra_generation_kwargs: Optional[Dict[str, Any]] = None,
    ) -> List[GenerateOutput]:
        """Generates `num_generations` responses for the given input using the Llama model.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            extra_generation_kwargs: dictionary with additional arguments to be passed to
                the `create_chat_completion` method. Reference at
                https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        structured_output = None
        batch_outputs = []
        for input in inputs:
            if isinstance(input, tuple):
                input, structured_output = input
            elif self.structured_output:
                structured_output = self.structured_output

            outputs = []
            output_tokens = []
            for _ in range(num_generations):
                # NOTE(plaguss): There seems to be a bug in how the logits processor
                # is used. Basically it consumes the FSM internally, and it isn't reinitialized
                # after each generation, so subsequent calls yield nothing. This is a workaround
                # until is fixed in the `llama_cpp` or `outlines` libraries.
                if structured_output:
                    self._logits_processor = self._prepare_structured_output(
                        structured_output
                    )
                if self.tokenizer_id is None:
                    completion = self._generate_chat_completion(
                        input,
                        max_new_tokens,
                        frequency_penalty,
                        presence_penalty,
                        temperature,
                        top_p,
                        extra_generation_kwargs,
                    )
                    outputs.append(completion["choices"][0]["message"]["content"])
                    output_tokens.append(completion["usage"]["completion_tokens"])
                else:
                    completion: "CreateChatCompletionResponse" = (
                        self._generate_with_text_generation(  # type: ignore
                            input,
                            max_new_tokens,
                            frequency_penalty,
                            presence_penalty,
                            temperature,
                            top_p,
                            extra_generation_kwargs,
                        )
                    )
                    outputs.append(completion["choices"][0]["text"])
                    output_tokens.append(completion["usage"]["completion_tokens"])
            batch_outputs.append(
                prepare_output(
                    outputs,
                    input_tokens=[completion["usage"]["prompt_tokens"]]
                    * num_generations,
                    output_tokens=output_tokens,
                )
            )

        return batch_outputs

    def _prepare_structured_output(
        self, structured_output: Optional[OutlinesStructuredOutputType] = None
    ) -> Union["LogitsProcessorList", "LogitsProcessor"]:
        """Creates the appropriate function to filter tokens to generate structured outputs.

        Args:
            structured_output: the configuration dict to prepare the structured output.

        Returns:
            The callable that will be used to guide the generation of the model.
        """
        from distilabel.steps.tasks.structured_outputs.outlines import (
            prepare_guided_output,
        )

        result = prepare_guided_output(structured_output, "llamacpp", self._model)
        if (schema := result.get("schema")) and self.structured_output:
            self.structured_output["schema"] = schema
        return [result["processor"]]

`model_name` `property` ¶

Returns the model name used for the LLM.

`validate_magpie_usage()` ¶

Validates that magpie usage is valid.

Source code in src/distilabel/models/llms/llamacpp.py

@model_validator(mode="after")
def validate_magpie_usage(
    self,
) -> "LlamaCppLLM":
    """Validates that magpie usage is valid."""

    if self.use_magpie_template and self.tokenizer_id is None:
        raise ValueError(
            "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
            " set a `tokenizer_id` and try again."
        )

`load()` ¶

Loads the Llama model from the model_path.

Source code in src/distilabel/models/llms/llamacpp.py

def load(self) -> None:
    """Loads the `Llama` model from the `model_path`."""
    try:
        from llama_cpp import Llama
    except ImportError as ie:
        raise ImportError(
            "The `llama_cpp` package is required to use the `LlamaCppLLM` class."
        ) from ie

    self._model = Llama(
        model_path=self.model_path.as_posix(),
        seed=self.seed,
        n_ctx=self.n_ctx,
        n_batch=self.n_batch,
        chat_format=self.chat_format,
        n_gpu_layers=self.n_gpu_layers,
        verbose=self.verbose,
        **self.extra_kwargs,
    )

    if self.structured_output:
        self._logits_processor = self._prepare_structured_output(
            self.structured_output
        )

    if self.use_magpie_template or self.magpie_pre_query_template:
        if not self.tokenizer_id:
            raise ValueError(
                "The Hugging Face Hub repo id or a path to a directory containing"
                " the tokenizer config files is required when using the `use_magpie_template`"
                " or `magpie_pre_query_template` runtime parameters."
            )

    if self.tokenizer_id:
        try:
            from transformers import AutoTokenizer
        except ImportError as ie:
            raise ImportError(
                "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
            ) from ie
        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)
        if self._tokenizer.chat_template is None:
            raise ValueError(
                "The tokenizer does not have a chat template. Please use a tokenizer with a chat template."
            )

    # NOTE: Here because of the custom `logging` interface used, since it will create the logging name
    # out of the model name, which won't be available until the `Llama` instance is created.
    super().load()

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/llamacpp.py

def prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    prompt: str = (
        self._tokenizer.apply_chat_template(  # type: ignore
            conversation=input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None)` ¶

Generates num_generations responses for the given input using the Llama model.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`extra_generation_kwargs`	`Optional[Dict[str, Any]]`	dictionary with additional arguments to be passed to the `create_chat_completion` method. Reference at https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion	`None`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/llamacpp.py

@validate_call
def generate(  # type: ignore
    self,
    inputs: List[FormattedInput],
    num_generations: int = 1,
    max_new_tokens: int = 128,
    frequency_penalty: float = 0.0,
    presence_penalty: float = 0.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
    extra_generation_kwargs: Optional[Dict[str, Any]] = None,
) -> List[GenerateOutput]:
    """Generates `num_generations` responses for the given input using the Llama model.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        extra_generation_kwargs: dictionary with additional arguments to be passed to
            the `create_chat_completion` method. Reference at
            https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    structured_output = None
    batch_outputs = []
    for input in inputs:
        if isinstance(input, tuple):
            input, structured_output = input
        elif self.structured_output:
            structured_output = self.structured_output

        outputs = []
        output_tokens = []
        for _ in range(num_generations):
            # NOTE(plaguss): There seems to be a bug in how the logits processor
            # is used. Basically it consumes the FSM internally, and it isn't reinitialized
            # after each generation, so subsequent calls yield nothing. This is a workaround
            # until is fixed in the `llama_cpp` or `outlines` libraries.
            if structured_output:
                self._logits_processor = self._prepare_structured_output(
                    structured_output
                )
            if self.tokenizer_id is None:
                completion = self._generate_chat_completion(
                    input,
                    max_new_tokens,
                    frequency_penalty,
                    presence_penalty,
                    temperature,
                    top_p,
                    extra_generation_kwargs,
                )
                outputs.append(completion["choices"][0]["message"]["content"])
                output_tokens.append(completion["usage"]["completion_tokens"])
            else:
                completion: "CreateChatCompletionResponse" = (
                    self._generate_with_text_generation(  # type: ignore
                        input,
                        max_new_tokens,
                        frequency_penalty,
                        presence_penalty,
                        temperature,
                        top_p,
                        extra_generation_kwargs,
                    )
                )
                outputs.append(completion["choices"][0]["text"])
                output_tokens.append(completion["usage"]["completion_tokens"])
        batch_outputs.append(
            prepare_output(
                outputs,
                input_tokens=[completion["usage"]["prompt_tokens"]]
                * num_generations,
                output_tokens=output_tokens,
            )
        )

    return batch_outputs

`_prepare_structured_output(structured_output=None)` ¶

Creates the appropriate function to filter tokens to generate structured outputs.

Parameters:

Name	Type	Description	Default
`structured_output`	`Optional[OutlinesStructuredOutputType]`	the configuration dict to prepare the structured output.	`None`

Returns:

Type	Description
`Union[LogitsProcessorList, LogitsProcessor]`	The callable that will be used to guide the generation of the model.

Source code in src/distilabel/models/llms/llamacpp.py

def _prepare_structured_output(
    self, structured_output: Optional[OutlinesStructuredOutputType] = None
) -> Union["LogitsProcessorList", "LogitsProcessor"]:
    """Creates the appropriate function to filter tokens to generate structured outputs.

    Args:
        structured_output: the configuration dict to prepare the structured output.

    Returns:
        The callable that will be used to guide the generation of the model.
    """
    from distilabel.steps.tasks.structured_outputs.outlines import (
        prepare_guided_output,
    )

    result = prepare_guided_output(structured_output, "llamacpp", self._model)
    if (schema := result.get("schema")) and self.structured_output:
        self.structured_output["schema"] = schema
    return [result["processor"]]

`MistralLLM` ¶

Bases: AsyncLLM

Mistral LLM implementation running the async API client.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "mistral-tiny", "mistral-large", etc.
`endpoint`	`str`	the endpoint to use for the Mistral API. Defaults to "https://api.mistral.ai".
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Mistral API. Defaults to `None` which means that the value set for the environment variable `OPENAI_API_KEY` will be used, or `None` if not set.
`max_retries`	`RuntimeParameter[int]`	the maximum number of retries to attempt when a request fails. Defaults to `5`.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response. Defaults to `120`.
`max_concurrent_requests`	`RuntimeParameter[int]`	the maximum number of concurrent requests to send. Defaults to `64`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
`_api_key_env_var`	`str`	the name of the environment variable to use for the API key. It is meant to be used internally.
`_aclient`	`Optional[Mistral]`	the `Mistral` to use for the Mistral API. It is meant to be used internally. Set in the `load` method.

Runtime parameters

api_key: the API key to authenticate the requests to the Mistral API.
max_retries: the maximum number of retries to attempt when a request fails. Defaults to 5.
timeout: the maximum time in seconds to wait for a response. Defaults to 120.
max_concurrent_requests: the maximum number of concurrent requests to send. Defaults to 64.

Examples:

Generate text:

from distilabel.models.llms import MistralLLM

llm = MistralLLM(model="open-mixtral-8x22b")

llm.load()

# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

```python
from pydantic import BaseModel
from distilabel.models.llms import MistralLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = MistralLLM(
    model="open-mixtral-8x22b",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/mistral.py

class MistralLLM(AsyncLLM):
    """Mistral LLM implementation running the async API client.

    Attributes:
        model: the model name to use for the LLM e.g. "mistral-tiny", "mistral-large", etc.
        endpoint: the endpoint to use for the Mistral API. Defaults to "https://api.mistral.ai".
        api_key: the API key to authenticate the requests to the Mistral API. Defaults to `None` which
            means that the value set for the environment variable `OPENAI_API_KEY` will be used, or
            `None` if not set.
        max_retries: the maximum number of retries to attempt when a request fails. Defaults to `5`.
        timeout: the maximum time in seconds to wait for a response. Defaults to `120`.
        max_concurrent_requests: the maximum number of concurrent requests to send. Defaults
            to `64`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.
        _api_key_env_var: the name of the environment variable to use for the API key. It is meant to
            be used internally.
        _aclient: the `Mistral` to use for the Mistral API. It is meant to be used internally.
            Set in the `load` method.

    Runtime parameters:
        - `api_key`: the API key to authenticate the requests to the Mistral API.
        - `max_retries`: the maximum number of retries to attempt when a request fails.
            Defaults to `5`.
        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `120`.
        - `max_concurrent_requests`: the maximum number of concurrent requests to send.
            Defaults to `64`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import MistralLLM

        llm = MistralLLM(model="open-mixtral-8x22b")

        llm.load()

        # Call the model
        output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import MistralLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = MistralLLM(
            model="open-mixtral-8x22b",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    endpoint: str = "https://api.mistral.ai"
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_MISTRALAI_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Mistral API.",
    )
    max_retries: RuntimeParameter[int] = Field(
        default=6,
        description="The maximum number of times to retry the request to the API before"
        " failing.",
    )
    timeout: RuntimeParameter[int] = Field(
        default=120,
        description="The maximum time in seconds to wait for a response from the API.",
    )
    max_concurrent_requests: RuntimeParameter[int] = Field(
        default=64, description="The maximum number of concurrent requests to send."
    )
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )

    _num_generations_param_supported = False

    _api_key_env_var: str = PrivateAttr(_MISTRALAI_API_KEY_ENV_VAR_NAME)
    _aclient: Optional["Mistral"] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the `Mistral` client to benefit from async requests."""
        super().load()

        try:
            from mistralai import Mistral
        except ImportError as ie:
            raise ImportError(
                "MistralAI Python client is not installed. Please install it using"
                " `pip install 'distilabel[mistralai]'`."
            ) from ie

        if self.api_key is None:
            raise ValueError(
                f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
                f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
            )

        self._aclient = Mistral(
            api_key=self.api_key.get_secret_value(),
            endpoint=self.endpoint,
            max_retries=self.max_retries,  # type: ignore
            timeout=self.timeout,  # type: ignore
            max_concurrent_requests=self.max_concurrent_requests,  # type: ignore
        )

        if self.structured_output:
            result = self._prepare_structured_output(
                structured_output=self.structured_output,
                client=self._aclient,
                framework="mistral",
            )
            self._aclient = result.get("client")  # type: ignore
            if structured_output := result.get("structured_output"):
                self.structured_output = structured_output

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    # TODO: add `num_generations` parameter once Mistral client allows `n` parameter
    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        max_new_tokens: Optional[int] = None,
        temperature: Optional[float] = None,
        top_p: Optional[float] = None,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for the given input using the MistralAI async
        client.

        Args:
            input: a single input in chat format to generate responses for.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,
                client=self._aclient,
                framework="mistral",
            )
            self._aclient = result.get("client")

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        kwargs = {
            "messages": input,  # type: ignore
            "model": self.model,
            "max_tokens": max_new_tokens,
            "temperature": temperature,
            "top_p": top_p,
        }
        generations = []
        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)
            # TODO: This should work just with the _aclient.chat method, but it's not working.
            # We need to check instructor and see if we can create a PR.
            completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore
        else:
            # completion = await self._aclient.chat(**kwargs)  # type: ignore
            completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore

        if structured_output:
            return prepare_output(
                [completion.model_dump_json()],
                **self._get_llm_statistics(completion._raw_response),
            )

        for choice in completion.choices:
            if (content := choice.message.content) is None:
                self._logger.warning(  # type: ignore
                    f"Received no response using MistralAI client (model: '{self.model}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(content)

        return prepare_output(generations, **self._get_llm_statistics(completion))

    @staticmethod
    def _get_llm_statistics(completion: "ChatCompletionResponse") -> "LLMStatistics":
        return {
            "input_tokens": [completion.usage.prompt_tokens],
            "output_tokens": [completion.usage.completion_tokens],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the Mistral client to benefit from async requests.

Source code in src/distilabel/models/llms/mistral.py

def load(self) -> None:
    """Loads the `Mistral` client to benefit from async requests."""
    super().load()

    try:
        from mistralai import Mistral
    except ImportError as ie:
        raise ImportError(
            "MistralAI Python client is not installed. Please install it using"
            " `pip install 'distilabel[mistralai]'`."
        ) from ie

    if self.api_key is None:
        raise ValueError(
            f"To use `{self.__class__.__name__}` an API key must be provided via `api_key`"
            f" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`."
        )

    self._aclient = Mistral(
        api_key=self.api_key.get_secret_value(),
        endpoint=self.endpoint,
        max_retries=self.max_retries,  # type: ignore
        timeout=self.timeout,  # type: ignore
        max_concurrent_requests=self.max_concurrent_requests,  # type: ignore
    )

    if self.structured_output:
        result = self._prepare_structured_output(
            structured_output=self.structured_output,
            client=self._aclient,
            framework="mistral",
        )
        self._aclient = result.get("client")  # type: ignore
        if structured_output := result.get("structured_output"):
            self.structured_output = structured_output

`agenerate(input, max_new_tokens=None, temperature=None, top_p=None)` `async` ¶

Generates num_generations responses for the given input using the MistralAI async client.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`max_new_tokens`	`Optional[int]`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`None`
`temperature`	`Optional[float]`	the temperature to use for the generation. Defaults to `0.1`.	`None`
`top_p`	`Optional[float]`	the top-p value to use for the generation. Defaults to `1.0`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/mistral.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    max_new_tokens: Optional[int] = None,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
) -> GenerateOutput:
    """Generates `num_generations` responses for the given input using the MistralAI async
    client.

    Args:
        input: a single input in chat format to generate responses for.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    structured_output = None
    if isinstance(input, tuple):
        input, structured_output = input
        result = self._prepare_structured_output(
            structured_output=structured_output,
            client=self._aclient,
            framework="mistral",
        )
        self._aclient = result.get("client")

    if structured_output is None and self.structured_output is not None:
        structured_output = self.structured_output

    kwargs = {
        "messages": input,  # type: ignore
        "model": self.model,
        "max_tokens": max_new_tokens,
        "temperature": temperature,
        "top_p": top_p,
    }
    generations = []
    if structured_output:
        kwargs = self._prepare_kwargs(kwargs, structured_output)
        # TODO: This should work just with the _aclient.chat method, but it's not working.
        # We need to check instructor and see if we can create a PR.
        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore
    else:
        # completion = await self._aclient.chat(**kwargs)  # type: ignore
        completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore

    if structured_output:
        return prepare_output(
            [completion.model_dump_json()],
            **self._get_llm_statistics(completion._raw_response),
        )

    for choice in completion.choices:
        if (content := choice.message.content) is None:
            self._logger.warning(  # type: ignore
                f"Received no response using MistralAI client (model: '{self.model}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(content)

    return prepare_output(generations, **self._get_llm_statistics(completion))

`MlxLLM` ¶

Bases: LLM, MagpieChatTemplateMixin

Apple MLX LLM implementation.

Attributes:

Name	Type	Description
`path_or_hf_repo`	`str`	the path to the model or the Hugging Face Hub repo id.
`tokenizer_config`	`Dict[str, Any]`	the tokenizer configuration.
`mlx_model_config`	`Dict[str, Any]`	the MLX model configuration.
`adapter_path`	`Optional[str]`	the path to the adapter.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.

Icon

:apple:

Examples:

Generate text:

from distilabel.models.llms import MlxLLM

llm = MlxLLM(path_or_hf_repo="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/mlx.py

class MlxLLM(LLM, MagpieChatTemplateMixin):
    """Apple MLX LLM implementation.

    Attributes:
        path_or_hf_repo: the path to the model or the Hugging Face Hub repo id.
        tokenizer_config: the tokenizer configuration.
        mlx_model_config: the MLX model configuration.
        adapter_path: the path to the adapter.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.

    Icon:
        `:apple:`

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import MlxLLM

        llm = MlxLLM(path_or_hf_repo="mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    path_or_hf_repo: str
    tokenizer_config: Dict[str, Any] = Field(default_factory=dict)
    mlx_model_config: Dict[str, Any] = Field(default_factory=dict)
    adapter_path: Optional[str] = None

    _model: Optional["nn.Module"] = PrivateAttr(None)
    _tokenizer: Optional["TokenizerWrapper"] = PrivateAttr(None)
    _mlx_generate: Optional[Callable] = PrivateAttr(None)
    _make_sampler: Optional[Callable] = PrivateAttr(None)

    def load(self) -> None:
        """Loads the model and tokenizer and creates the text generation pipeline. In addition,
        it will configure the tokenizer chat template."""
        try:
            import mlx  # noqa
            from mlx_lm.utils import generate, load
            from mlx_lm.sample_utils import make_sampler
        except ImportError as ie:
            raise ImportError(
                "MLX is not installed. Please install it using `pip install 'distilabel[mlx]'`."
            ) from ie

        self._model, self._tokenizer = load(
            self.path_or_hf_repo,
            tokenizer_config=self.tokenizer_config,
            model_config=self.mlx_model_config,
            adapter_path=self.adapter_path,
        )

        if self._tokenizer.pad_token is None:
            self._tokenizer.pad_token = self._tokenizer.eos_token

        self._mlx_generate = generate
        self._make_sampler = make_sampler
        super().load()

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.path_or_hf_repo

    def prepare_input(self, input: Union["StandardInput", str]) -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        if isinstance(input, str):
            return input

        prompt: str = (
            self._tokenizer.apply_chat_template(  # type: ignore
                input,
                tokenize=False,
                add_generation_prompt=True,
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    @validate_call
    def generate(  # type: ignore
        self,
        inputs: List[Union[StandardInput, str]],
        num_generations: int = 1,
        max_tokens: int = 256,
        logits_processors: Optional[List[Callable]] = None,
        max_kv_size: Optional[int] = None,
        prompt_cache: Optional[Any] = None,
        prefill_step_size: int = 512,
        kv_bits: Optional[int] = None,
        kv_group_size: int = 64,
        quantized_kv_start: int = 0,
        prompt_progress_callback: Optional[Callable[[int, int], None]] = None,
        temp: float = 0.0,
        top_p: float = 0.0,
        min_p: float = 0.0,
        min_tokens_to_keep: int = 1,
        top_k: int = -1,
    ) -> List[GenerateOutput]:
        """Generates `num_generations` responses for each input using the text generation
        pipeline.

        Args:
            inputs: the inputs to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            logits_processors: the logits processors to use for the generation. Defaults to
                `None`.
            max_kv_size: the maximum size of the key-value cache. Defaults to `None`.
            prompt_cache: the prompt cache to use for the generation. Defaults to `None`.
            prefill_step_size: the prefill step size. Defaults to `512`.
            kv_bits: the number of bits to use for the key-value cache. Defaults to `None`.
            kv_group_size: the group size for the key-value cache. Defaults to `64`.
            quantized_kv_start: the start of the quantized key-value cache. Defaults to `0`.
            prompt_progress_callback: the callback to use for the generation. Defaults to
                `None`.
            temp: The temperature for text generation. Defaults to `0.0`.
            top_p: The top-p value used for the generation. Defaults to `0.0`.
            min_p: The min-p value used for the generation. Defaults to `0.0`.
            min_tokens_to_keep: Minimum number of tokens to keep for sampling after
                filtering. Must be at least 1. Defaults to `1`.
            top_k: The top-k value used for the generation. Defaults to `-1`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """

        sampler = self._make_sampler(  # type: ignore
            temp=temp,
            top_p=top_p,
            min_p=min_p,
            min_tokens_to_keep=min_tokens_to_keep,
            top_k=top_k,
        )
        structured_output = None
        result = []
        for input in inputs:
            if isinstance(input, tuple):
                input, structured_output = input

            output: List[str] = []
            for _ in range(num_generations):
                if structured_output:  # will raise a NotImplementedError
                    self._prepare_structured_output(structured_output)
                prompt = self.prepare_input(input)
                generation = self._mlx_generate(  # type: ignore
                    prompt=prompt,
                    model=self._model,
                    tokenizer=self._tokenizer,
                    logits_processors=logits_processors,
                    max_tokens=max_tokens,
                    sampler=sampler,
                    max_kv_size=max_kv_size,
                    prompt_cache=prompt_cache,
                    prefill_step_size=prefill_step_size,
                    kv_bits=kv_bits,
                    kv_group_size=kv_group_size,
                    quantized_kv_start=quantized_kv_start,
                    prompt_progress_callback=prompt_progress_callback,
                )

                output.append(generation)

            result.append(
                prepare_output(
                    generations=output,
                    input_tokens=[compute_tokens(input, self._tokenizer.encode)],  # type: ignore
                    output_tokens=[
                        compute_tokens(
                            text_or_messages=generation,
                            tokenizer=self._tokenizer.encode,  # type: ignore
                        )
                        for generation in output
                    ],
                )
            )
        return result

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.

Source code in src/distilabel/models/llms/mlx.py

def load(self) -> None:
    """Loads the model and tokenizer and creates the text generation pipeline. In addition,
    it will configure the tokenizer chat template."""
    try:
        import mlx  # noqa
        from mlx_lm.utils import generate, load
        from mlx_lm.sample_utils import make_sampler
    except ImportError as ie:
        raise ImportError(
            "MLX is not installed. Please install it using `pip install 'distilabel[mlx]'`."
        ) from ie

    self._model, self._tokenizer = load(
        self.path_or_hf_repo,
        tokenizer_config=self.tokenizer_config,
        model_config=self.mlx_model_config,
        adapter_path=self.adapter_path,
    )

    if self._tokenizer.pad_token is None:
        self._tokenizer.pad_token = self._tokenizer.eos_token

    self._mlx_generate = generate
    self._make_sampler = make_sampler
    super().load()

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`Union[StandardInput, str]`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/mlx.py

def prepare_input(self, input: Union["StandardInput", str]) -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    if isinstance(input, str):
        return input

    prompt: str = (
        self._tokenizer.apply_chat_template(  # type: ignore
            input,
            tokenize=False,
            add_generation_prompt=True,
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`generate(inputs, num_generations=1, max_tokens=256, logits_processors=None, max_kv_size=None, prompt_cache=None, prefill_step_size=512, kv_bits=None, kv_group_size=64, quantized_kv_start=0, prompt_progress_callback=None, temp=0.0, top_p=0.0, min_p=0.0, min_tokens_to_keep=1, top_k=-1)` ¶

Generates num_generations responses for each input using the text generation pipeline.

Parameters:

Name	Type	Description	Default
`inputs`	`List[Union[StandardInput, str]]`	the inputs to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`256`
`logits_processors`	`Optional[List[Callable]]`	the logits processors to use for the generation. Defaults to `None`.	`None`
`max_kv_size`	`Optional[int]`	the maximum size of the key-value cache. Defaults to `None`.	`None`
`prompt_cache`	`Optional[Any]`	the prompt cache to use for the generation. Defaults to `None`.	`None`
`prefill_step_size`	`int`	the prefill step size. Defaults to `512`.	`512`
`kv_bits`	`Optional[int]`	the number of bits to use for the key-value cache. Defaults to `None`.	`None`
`kv_group_size`	`int`	the group size for the key-value cache. Defaults to `64`.	`64`
`quantized_kv_start`	`int`	the start of the quantized key-value cache. Defaults to `0`.	`0`
`prompt_progress_callback`	`Optional[Callable[[int, int], None]]`	the callback to use for the generation. Defaults to `None`.	`None`
`temp`	`float`	The temperature for text generation. Defaults to `0.0`.	`0.0`
`top_p`	`float`	The top-p value used for the generation. Defaults to `0.0`.	`0.0`
`min_p`	`float`	The min-p value used for the generation. Defaults to `0.0`.	`0.0`
`min_tokens_to_keep`	`int`	Minimum number of tokens to keep for sampling after filtering. Must be at least 1. Defaults to `1`.	`1`
`top_k`	`int`	The top-k value used for the generation. Defaults to `-1`.	`-1`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/mlx.py

@validate_call
def generate(  # type: ignore
    self,
    inputs: List[Union[StandardInput, str]],
    num_generations: int = 1,
    max_tokens: int = 256,
    logits_processors: Optional[List[Callable]] = None,
    max_kv_size: Optional[int] = None,
    prompt_cache: Optional[Any] = None,
    prefill_step_size: int = 512,
    kv_bits: Optional[int] = None,
    kv_group_size: int = 64,
    quantized_kv_start: int = 0,
    prompt_progress_callback: Optional[Callable[[int, int], None]] = None,
    temp: float = 0.0,
    top_p: float = 0.0,
    min_p: float = 0.0,
    min_tokens_to_keep: int = 1,
    top_k: int = -1,
) -> List[GenerateOutput]:
    """Generates `num_generations` responses for each input using the text generation
    pipeline.

    Args:
        inputs: the inputs to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        logits_processors: the logits processors to use for the generation. Defaults to
            `None`.
        max_kv_size: the maximum size of the key-value cache. Defaults to `None`.
        prompt_cache: the prompt cache to use for the generation. Defaults to `None`.
        prefill_step_size: the prefill step size. Defaults to `512`.
        kv_bits: the number of bits to use for the key-value cache. Defaults to `None`.
        kv_group_size: the group size for the key-value cache. Defaults to `64`.
        quantized_kv_start: the start of the quantized key-value cache. Defaults to `0`.
        prompt_progress_callback: the callback to use for the generation. Defaults to
            `None`.
        temp: The temperature for text generation. Defaults to `0.0`.
        top_p: The top-p value used for the generation. Defaults to `0.0`.
        min_p: The min-p value used for the generation. Defaults to `0.0`.
        min_tokens_to_keep: Minimum number of tokens to keep for sampling after
            filtering. Must be at least 1. Defaults to `1`.
        top_k: The top-k value used for the generation. Defaults to `-1`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """

    sampler = self._make_sampler(  # type: ignore
        temp=temp,
        top_p=top_p,
        min_p=min_p,
        min_tokens_to_keep=min_tokens_to_keep,
        top_k=top_k,
    )
    structured_output = None
    result = []
    for input in inputs:
        if isinstance(input, tuple):
            input, structured_output = input

        output: List[str] = []
        for _ in range(num_generations):
            if structured_output:  # will raise a NotImplementedError
                self._prepare_structured_output(structured_output)
            prompt = self.prepare_input(input)
            generation = self._mlx_generate(  # type: ignore
                prompt=prompt,
                model=self._model,
                tokenizer=self._tokenizer,
                logits_processors=logits_processors,
                max_tokens=max_tokens,
                sampler=sampler,
                max_kv_size=max_kv_size,
                prompt_cache=prompt_cache,
                prefill_step_size=prefill_step_size,
                kv_bits=kv_bits,
                kv_group_size=kv_group_size,
                quantized_kv_start=quantized_kv_start,
                prompt_progress_callback=prompt_progress_callback,
            )

            output.append(generation)

        result.append(
            prepare_output(
                generations=output,
                input_tokens=[compute_tokens(input, self._tokenizer.encode)],  # type: ignore
                output_tokens=[
                    compute_tokens(
                        text_or_messages=generation,
                        tokenizer=self._tokenizer.encode,  # type: ignore
                    )
                    for generation in output
                ],
            )
        )
    return result

`MixtureOfAgentsLLM` ¶

Bases: AsyncLLM

Mixture-of-Agents implementation.

An LLM class that leverages LLMs collective strenghts to generate a response, as described in the "Mixture-of-Agents Enhances Large Language model Capabilities" paper. There is a list of LLMs proposing/generating outputs that LLMs from the next round/layer can use as auxiliary information. Finally, there is an LLM that aggregates the outputs to generate the final response.

Attributes:

Name	Type	Description
`aggregator_llm`	`LLM`	The `LLM` that aggregates the outputs of the proposer `LLM`s.
`proposers_llms`	`List[AsyncLLM]`	The list of `LLM`s that propose outputs to be aggregated.
`rounds`	`int`	The number of layers or rounds that the `proposers_llms` will generate outputs. Defaults to `1`.

References

Mixture-of-Agents Enhances Large Language Model Capabilities

Examples:

Generate text:

from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
    aggregator_llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    proposers_llms=[
        InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
        InferenceEndpointsLLM(
            model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
            tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        ),
        InferenceEndpointsLLM(
            model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
            tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
        ),
    ],
    rounds=2,
)

llm.load()

output = llm.generate_outputs(
    inputs=[
        [
            {
                "role": "user",
                "content": "My favorite witty review of The Rings of Power series is this: Input:",
            }
        ]
    ]
)

Source code in src/distilabel/models/llms/moa.py

class MixtureOfAgentsLLM(AsyncLLM):
    """`Mixture-of-Agents` implementation.

    An `LLM` class that leverages `LLM`s collective strenghts to generate a response,
    as described in the "Mixture-of-Agents Enhances Large Language model Capabilities"
    paper. There is a list of `LLM`s proposing/generating outputs that `LLM`s from the next
    round/layer can use as auxiliary information. Finally, there is an `LLM` that aggregates
    the outputs to generate the final response.

    Attributes:
        aggregator_llm: The `LLM` that aggregates the outputs of the proposer `LLM`s.
        proposers_llms: The list of `LLM`s that propose outputs to be aggregated.
        rounds: The number of layers or rounds that the `proposers_llms` will generate
            outputs. Defaults to `1`.

    References:
        - [Mixture-of-Agents Enhances Large Language Model Capabilities](https://arxiv.org/abs/2406.04692)

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

        llm = MixtureOfAgentsLLM(
            aggregator_llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            ),
            proposers_llms=[
                InferenceEndpointsLLM(
                    model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                    tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
                ),
                InferenceEndpointsLLM(
                    model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
                    tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
                ),
                InferenceEndpointsLLM(
                    model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
                    tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
                ),
            ],
            rounds=2,
        )

        llm.load()

        output = llm.generate_outputs(
            inputs=[
                [
                    {
                        "role": "user",
                        "content": "My favorite witty review of The Rings of Power series is this: Input:",
                    }
                ]
            ]
        )
        ```
    """

    aggregator_llm: LLM
    proposers_llms: List[AsyncLLM] = Field(default_factory=list)
    rounds: int = 1

    @property
    def runtime_parameters_names(self) -> "RuntimeParametersNames":
        """Returns the runtime parameters of the `LLM`, which are a combination of the
        `RuntimeParameter`s of the `LLM`, the `aggregator_llm` and the `proposers_llms`.

        Returns:
            The runtime parameters of the `LLM`.
        """
        runtime_parameters_names = super().runtime_parameters_names
        del runtime_parameters_names["generation_kwargs"]
        return runtime_parameters_names

    def load(self) -> None:
        """Loads all the `LLM`s in the `MixtureOfAgents`."""
        super().load()

        for llm in self.proposers_llms:
            self._logger.debug(f"Loading proposer LLM in MoA: {llm}")  # type: ignore
            llm.load()

        self._logger.debug(f"Loading aggregator LLM in MoA: {self.aggregator_llm}")  # type: ignore
        self.aggregator_llm.load()

    @property
    def model_name(self) -> str:
        """Returns the aggregated model name."""
        return f"moa-{self.aggregator_llm.model_name}-{'-'.join([llm.model_name for llm in self.proposers_llms])}"

    def get_generation_kwargs(self) -> Dict[str, Any]:
        """Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.

        Returns:
            The generation kwargs of the `MixtureOfAgents`.
        """
        return {
            "aggregator_llm": self.aggregator_llm.get_generation_kwargs(),
            "proposers_llms": [
                llm.get_generation_kwargs() for llm in self.proposers_llms
            ],
        }

    # `abstractmethod`, had to be implemented but not used
    async def agenerate(
        self, input: "FormattedInput", num_generations: int = 1, **kwargs: Any
    ) -> List[Union[str, None]]:
        raise NotImplementedError(
            "`agenerate` method is not implemented for `MixtureOfAgents`"
        )

    def _build_moa_system_prompt(self, prev_outputs: List[str]) -> str:
        """Builds the Mixture-of-Agents system prompt.

        Args:
            prev_outputs: The list of previous outputs to use as references.

        Returns:
            The Mixture-of-Agents system prompt.
        """
        moa_system_prompt = MOA_SYSTEM_PROMPT
        for i, prev_output in enumerate(prev_outputs):
            if prev_output is not None:
                moa_system_prompt += f"\n{i + 1}. {prev_output}"
        return moa_system_prompt

    def _inject_moa_system_prompt(
        self, input: "StandardInput", prev_outputs: List[str]
    ) -> "StandardInput":
        """Injects the Mixture-of-Agents system prompt into the input.

        Args:
            input: The input to inject the system prompt into.
            prev_outputs: The list of previous outputs to use as references.

        Returns:
            The input with the Mixture-of-Agents system prompt injected.
        """
        if len(prev_outputs) == 0:
            return input

        moa_system_prompt = self._build_moa_system_prompt(prev_outputs)

        system = next((item for item in input if item["role"] == "system"), None)
        if system:
            original_system_prompt = system["content"]
            system["content"] = f"{moa_system_prompt}\n\n{original_system_prompt}"
        else:
            input.insert(0, {"role": "system", "content": moa_system_prompt})

        return input

    async def _agenerate(
        self,
        inputs: List["FormattedInput"],
        num_generations: int = 1,
        **kwargs: Any,
    ) -> List["GenerateOutput"]:
        """Internal function to concurrently generate responses for a list of inputs.

        Args:
            inputs: the list of inputs to generate responses for.
            num_generations: the number of generations to generate per input.
            **kwargs: the additional kwargs to be used for the generation.

        Returns:
            A list containing the generations for each input.
        """
        aggregator_llm_kwargs: Dict[str, Any] = kwargs.get("aggregator_llm", {})
        proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(
            "proposers_llms", [{}] * len(self.proposers_llms)
        )

        prev_outputs = []
        for round in range(self.rounds):
            self._logger.debug(f"Generating round {round + 1}/{self.rounds} in MoA")  # type: ignore

            # Generate `num_generations` with each proposer LLM for each input
            tasks = [
                asyncio.create_task(
                    llm._agenerate(
                        inputs=[
                            self._inject_moa_system_prompt(
                                cast("StandardInput", input), prev_input_outputs
                            )
                            for input, prev_input_outputs in itertools.zip_longest(
                                inputs, prev_outputs, fillvalue=[]
                            )
                        ],
                        num_generations=1,
                        **generation_kwargs,
                    )
                )
                for llm, generation_kwargs in zip(
                    self.proposers_llms, proposers_llms_kwargs
                )
            ]

            # Group generations per input
            outputs: List[List["GenerateOutput"]] = await asyncio.gather(*tasks)
            prev_outputs = [
                list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)
            ]

        self._logger.debug("Aggregating outputs in MoA")  # type: ignore
        if isinstance(self.aggregator_llm, AsyncLLM):
            return await self.aggregator_llm._agenerate(
                inputs=[
                    self._inject_moa_system_prompt(
                        cast("StandardInput", input), prev_input_outputs
                    )
                    for input, prev_input_outputs in zip(inputs, prev_outputs)
                ],
                num_generations=num_generations,
                **aggregator_llm_kwargs,
            )

        return self.aggregator_llm.generate(
            inputs=[
                self._inject_moa_system_prompt(
                    cast("StandardInput", input), prev_input_outputs
                )
                for input, prev_input_outputs in zip(inputs, prev_outputs)
            ],
            num_generations=num_generations,
            **aggregator_llm_kwargs,
        )

`runtime_parameters_names` `property` ¶

Returns the runtime parameters of the LLM, which are a combination of the RuntimeParameters of the LLM, the aggregator_llm and the proposers_llms.

Returns:

Type	Description
`RuntimeParametersNames`	The runtime parameters of the `LLM`.

`model_name` `property` ¶

Returns the aggregated model name.

`load()` ¶

Loads all the LLMs in the MixtureOfAgents.

Source code in src/distilabel/models/llms/moa.py

def load(self) -> None:
    """Loads all the `LLM`s in the `MixtureOfAgents`."""
    super().load()

    for llm in self.proposers_llms:
        self._logger.debug(f"Loading proposer LLM in MoA: {llm}")  # type: ignore
        llm.load()

    self._logger.debug(f"Loading aggregator LLM in MoA: {self.aggregator_llm}")  # type: ignore
    self.aggregator_llm.load()

`get_generation_kwargs()` ¶

Returns the generation kwargs of the MixtureOfAgents as a dictionary.

Returns:

Type	Description
`Dict[str, Any]`	The generation kwargs of the `MixtureOfAgents`.

Source code in src/distilabel/models/llms/moa.py

def get_generation_kwargs(self) -> Dict[str, Any]:
    """Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.

    Returns:
        The generation kwargs of the `MixtureOfAgents`.
    """
    return {
        "aggregator_llm": self.aggregator_llm.get_generation_kwargs(),
        "proposers_llms": [
            llm.get_generation_kwargs() for llm in self.proposers_llms
        ],
    }

`_build_moa_system_prompt(prev_outputs)` ¶

Builds the Mixture-of-Agents system prompt.

Parameters:

Name	Type	Description	Default
`prev_outputs`	`List[str]`	The list of previous outputs to use as references.	required

Returns:

Type	Description
`str`	The Mixture-of-Agents system prompt.

Source code in src/distilabel/models/llms/moa.py

def _build_moa_system_prompt(self, prev_outputs: List[str]) -> str:
    """Builds the Mixture-of-Agents system prompt.

    Args:
        prev_outputs: The list of previous outputs to use as references.

    Returns:
        The Mixture-of-Agents system prompt.
    """
    moa_system_prompt = MOA_SYSTEM_PROMPT
    for i, prev_output in enumerate(prev_outputs):
        if prev_output is not None:
            moa_system_prompt += f"\n{i + 1}. {prev_output}"
    return moa_system_prompt

`_inject_moa_system_prompt(input, prev_outputs)` ¶

Injects the Mixture-of-Agents system prompt into the input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	The input to inject the system prompt into.	required
`prev_outputs`	`List[str]`	The list of previous outputs to use as references.	required

Returns:

Type	Description
`StandardInput`	The input with the Mixture-of-Agents system prompt injected.

Source code in src/distilabel/models/llms/moa.py

def _inject_moa_system_prompt(
    self, input: "StandardInput", prev_outputs: List[str]
) -> "StandardInput":
    """Injects the Mixture-of-Agents system prompt into the input.

    Args:
        input: The input to inject the system prompt into.
        prev_outputs: The list of previous outputs to use as references.

    Returns:
        The input with the Mixture-of-Agents system prompt injected.
    """
    if len(prev_outputs) == 0:
        return input

    moa_system_prompt = self._build_moa_system_prompt(prev_outputs)

    system = next((item for item in input if item["role"] == "system"), None)
    if system:
        original_system_prompt = system["content"]
        system["content"] = f"{moa_system_prompt}\n\n{original_system_prompt}"
    else:
        input.insert(0, {"role": "system", "content": moa_system_prompt})

    return input

`_agenerate(inputs, num_generations=1, **kwargs)` `async` ¶

Internal function to concurrently generate responses for a list of inputs.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	the list of inputs to generate responses for.	required
`num_generations`	`int`	the number of generations to generate per input.	`1`
`**kwargs`	`Any`	the additional kwargs to be used for the generation.	`{}`

Returns:

Type	Description
`List[GenerateOutput]`	A list containing the generations for each input.

Source code in src/distilabel/models/llms/moa.py

async def _agenerate(
    self,
    inputs: List["FormattedInput"],
    num_generations: int = 1,
    **kwargs: Any,
) -> List["GenerateOutput"]:
    """Internal function to concurrently generate responses for a list of inputs.

    Args:
        inputs: the list of inputs to generate responses for.
        num_generations: the number of generations to generate per input.
        **kwargs: the additional kwargs to be used for the generation.

    Returns:
        A list containing the generations for each input.
    """
    aggregator_llm_kwargs: Dict[str, Any] = kwargs.get("aggregator_llm", {})
    proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(
        "proposers_llms", [{}] * len(self.proposers_llms)
    )

    prev_outputs = []
    for round in range(self.rounds):
        self._logger.debug(f"Generating round {round + 1}/{self.rounds} in MoA")  # type: ignore

        # Generate `num_generations` with each proposer LLM for each input
        tasks = [
            asyncio.create_task(
                llm._agenerate(
                    inputs=[
                        self._inject_moa_system_prompt(
                            cast("StandardInput", input), prev_input_outputs
                        )
                        for input, prev_input_outputs in itertools.zip_longest(
                            inputs, prev_outputs, fillvalue=[]
                        )
                    ],
                    num_generations=1,
                    **generation_kwargs,
                )
            )
            for llm, generation_kwargs in zip(
                self.proposers_llms, proposers_llms_kwargs
            )
        ]

        # Group generations per input
        outputs: List[List["GenerateOutput"]] = await asyncio.gather(*tasks)
        prev_outputs = [
            list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)
        ]

    self._logger.debug("Aggregating outputs in MoA")  # type: ignore
    if isinstance(self.aggregator_llm, AsyncLLM):
        return await self.aggregator_llm._agenerate(
            inputs=[
                self._inject_moa_system_prompt(
                    cast("StandardInput", input), prev_input_outputs
                )
                for input, prev_input_outputs in zip(inputs, prev_outputs)
            ],
            num_generations=num_generations,
            **aggregator_llm_kwargs,
        )

    return self.aggregator_llm.generate(
        inputs=[
            self._inject_moa_system_prompt(
                cast("StandardInput", input), prev_input_outputs
            )
            for input, prev_input_outputs in zip(inputs, prev_outputs)
        ],
        num_generations=num_generations,
        **aggregator_llm_kwargs,
    )

`OllamaLLM` ¶

Bases: AsyncLLM, MagpieChatTemplateMixin

Ollama LLM implementation running the Async API client.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "notus".
`host`	`Optional[RuntimeParameter[str]]`	the Ollama server host.
`timeout`	`RuntimeParameter[int]`	the timeout for the LLM. Defaults to `120`.
`follow_redirects`	`bool`	whether to follow redirects. Defaults to `True`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
`tokenizer_id`	`Optional[RuntimeParameter[str]]`	the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the `model` will be used. Defaults to `None`.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.
`_aclient`	`Optional[AsyncClient]`	the `AsyncClient` to use for the Ollama API. It is meant to be used internally. Set in the `load` method.

Runtime parameters

host: the Ollama server host.
timeout: the client timeout for the Ollama API. Defaults to 120.

Examples:

Generate text:

from distilabel.models.llms import OllamaLLM

llm = OllamaLLM(model="llama3")

llm.load()

# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/ollama.py

class OllamaLLM(AsyncLLM, MagpieChatTemplateMixin):
    """Ollama LLM implementation running the Async API client.

    Attributes:
        model: the model name to use for the LLM e.g. "notus".
        host: the Ollama server host.
        timeout: the timeout for the LLM. Defaults to `120`.
        follow_redirects: whether to follow redirects. Defaults to `True`.
        structured_output: a dictionary containing the structured output configuration or if more
            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing
            the tokenizer config files. If not provided, the one associated to the `model`
            will be used. Defaults to `None`.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.
        _aclient: the `AsyncClient` to use for the Ollama API. It is meant to be used internally.
            Set in the `load` method.

    Runtime parameters:
        - `host`: the Ollama server host.
        - `timeout`: the client timeout for the Ollama API. Defaults to `120`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import OllamaLLM

        llm = OllamaLLM(model="llama3")

        llm.load()

        # Call the model
        output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    model: str
    host: Optional[RuntimeParameter[str]] = Field(
        default=None, description="The host of the Ollama API."
    )
    timeout: RuntimeParameter[int] = Field(
        default=120, description="The timeout for the Ollama API."
    )
    follow_redirects: bool = True
    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (
        Field(
            default=None,
            description="The structured output format to use across all the generations.",
        )
    )
    tokenizer_id: Optional[RuntimeParameter[str]] = Field(
        default=None,
        description="The Hugging Face Hub repo id or a path to a directory containing"
        " the tokenizer config files. If not provided, the one associated to the `model`"
        " will be used.",
    )
    _num_generations_param_supported = False
    _aclient: Optional["AsyncClient"] = PrivateAttr(...)  # type: ignore

    @model_validator(mode="after")  # type: ignore
    def validate_magpie_usage(
        self,
    ) -> "OllamaLLM":
        """Validates that magpie usage is valid."""

        if self.use_magpie_template and self.tokenizer_id is None:
            raise ValueError(
                "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
                " set a `tokenizer_id` and try again."
            )

    def load(self) -> None:
        """Loads the `AsyncClient` to use Ollama async API."""
        super().load()

        try:
            from ollama import AsyncClient

            self._aclient = AsyncClient(
                host=self.host,
                timeout=self.timeout,
                follow_redirects=self.follow_redirects,
            )
        except ImportError as e:
            raise ImportError(
                "Ollama Python client is not installed. Please install it using"
                " `pip install 'distilabel[ollama]'`."
            ) from e

        if self.tokenizer_id:
            try:
                from transformers import AutoTokenizer
            except ImportError as ie:
                raise ImportError(
                    "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
                ) from ie
            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)
            if self._tokenizer.chat_template is None:
                raise ValueError(
                    "The tokenizer does not have a chat template. Please use a tokenizer with a chat template."
                )

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    async def _generate_chat_completion(
        self,
        input: "StandardInput",
        format: Literal["", "json"] = "",
        options: Union[Options, None] = None,
        keep_alive: Union[bool, None] = None,
    ) -> "ChatResponse":
        return await self._aclient.chat(
            model=self.model,
            messages=input,
            stream=False,
            format=format,
            options=options,
            keep_alive=keep_alive,
        )

    def prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        prompt: str = (
            self._tokenizer.apply_chat_template(
                conversation=input,
                tokenize=False,
                add_generation_prompt=True,
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    async def _generate_with_text_generation(
        self,
        input: "StandardInput",
        format: Literal["", "json"] = None,
        options: Union[Options, None] = None,
        keep_alive: Union[bool, None] = None,
    ) -> "GenerateResponse":
        input = self.prepare_input(input)
        return await self._aclient.generate(
            model=self.model,
            prompt=input,
            format=format,
            options=options,
            keep_alive=keep_alive,
            raw=True,
        )

    @validate_call
    async def agenerate(
        self,
        input: StandardInput,
        format: Literal["", "json"] = "",
        # TODO: include relevant options from `Options` in `agenerate` method.
        options: Union[Options, None] = None,
        keep_alive: Union[bool, None] = None,
    ) -> GenerateOutput:
        """
        Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).

        Args:
            input: the input to use for the generation.
            format: the format to use for the generation. Defaults to `""`.
            options: the options to use for the generation. Defaults to `None`.
            keep_alive: whether to keep the connection alive. Defaults to `None`.

        Returns:
            A list of strings as completion for the given input.
        """
        text = None
        try:
            if not format:
                format = None
            if self.tokenizer_id is None:
                completion = await self._generate_chat_completion(
                    input, format, options, keep_alive
                )
                text = completion["message"]["content"]
            else:
                completion = await self._generate_with_text_generation(
                    input, format, options, keep_alive
                )
                text = completion.response
        except Exception as e:
            self._logger.warning(  # type: ignore
                f"⚠️ Received no response using Ollama client (model: '{self.model_name}')."
                f" Finish reason was: {e}"
            )

        return prepare_output([text], **self._get_llm_statistics(completion))

    @staticmethod
    def _get_llm_statistics(completion: Dict[str, Any]) -> "LLMStatistics":
        return {
            "input_tokens": [completion["prompt_eval_count"]],
            "output_tokens": [completion["eval_count"]],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`validate_magpie_usage()` ¶

Validates that magpie usage is valid.

Source code in src/distilabel/models/llms/ollama.py

@model_validator(mode="after")  # type: ignore
def validate_magpie_usage(
    self,
) -> "OllamaLLM":
    """Validates that magpie usage is valid."""

    if self.use_magpie_template and self.tokenizer_id is None:
        raise ValueError(
            "`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,"
            " set a `tokenizer_id` and try again."
        )

`load()` ¶

Loads the AsyncClient to use Ollama async API.

Source code in src/distilabel/models/llms/ollama.py

def load(self) -> None:
    """Loads the `AsyncClient` to use Ollama async API."""
    super().load()

    try:
        from ollama import AsyncClient

        self._aclient = AsyncClient(
            host=self.host,
            timeout=self.timeout,
            follow_redirects=self.follow_redirects,
        )
    except ImportError as e:
        raise ImportError(
            "Ollama Python client is not installed. Please install it using"
            " `pip install 'distilabel[ollama]'`."
        ) from e

    if self.tokenizer_id:
        try:
            from transformers import AutoTokenizer
        except ImportError as ie:
            raise ImportError(
                "Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`."
            ) from ie
        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)
        if self._tokenizer.chat_template is None:
            raise ValueError(
                "The tokenizer does not have a chat template. Please use a tokenizer with a chat template."
            )

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/ollama.py

def prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    prompt: str = (
        self._tokenizer.apply_chat_template(
            conversation=input,
            tokenize=False,
            add_generation_prompt=True,
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`agenerate(input, format='', options=None, keep_alive=None)` `async` ¶

Generates a response asynchronously, using the Ollama Async API definition.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input to use for the generation.	required
`format`	`Literal['', 'json']`	the format to use for the generation. Defaults to `""`.	`''`
`options`	`Union[Options, None]`	the options to use for the generation. Defaults to `None`.	`None`
`keep_alive`	`Union[bool, None]`	whether to keep the connection alive. Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of strings as completion for the given input.

Source code in src/distilabel/models/llms/ollama.py

@validate_call
async def agenerate(
    self,
    input: StandardInput,
    format: Literal["", "json"] = "",
    # TODO: include relevant options from `Options` in `agenerate` method.
    options: Union[Options, None] = None,
    keep_alive: Union[bool, None] = None,
) -> GenerateOutput:
    """
    Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).

    Args:
        input: the input to use for the generation.
        format: the format to use for the generation. Defaults to `""`.
        options: the options to use for the generation. Defaults to `None`.
        keep_alive: whether to keep the connection alive. Defaults to `None`.

    Returns:
        A list of strings as completion for the given input.
    """
    text = None
    try:
        if not format:
            format = None
        if self.tokenizer_id is None:
            completion = await self._generate_chat_completion(
                input, format, options, keep_alive
            )
            text = completion["message"]["content"]
        else:
            completion = await self._generate_with_text_generation(
                input, format, options, keep_alive
            )
            text = completion.response
    except Exception as e:
        self._logger.warning(  # type: ignore
            f"⚠️ Received no response using Ollama client (model: '{self.model_name}')."
            f" Finish reason was: {e}"
        )

    return prepare_output([text], **self._get_llm_statistics(completion))

`OpenAILLM` ¶

Bases: OpenAIBaseClient, AsyncLLM

OpenAI LLM implementation running the async API client.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "gpt-3.5-turbo", "gpt-4", etc. Supported models can be found here.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the OpenAI API requests. Defaults to `None`, which means that the value set for the environment variable `OPENAI_BASE_URL` will be used, or "https://api.openai.com/v1" if not set.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the OpenAI API. Defaults to `None` which means that the value set for the environment variable `OPENAI_API_KEY` will be used, or `None` if not set.
`default_headers`	`Optional[RuntimeParameter[Dict[str, str]]]`	the default headers to use for the OpenAI API requests.
`max_retries`	`RuntimeParameter[int]`	the maximum number of times to retry the request to the API before failing. Defaults to `6`.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response from the API. Defaults to `120`.
`structured_output`	`Optional[RuntimeParameter[InstructorStructuredOutputType]]`	a dictionary containing the structured output configuration configuration using `instructor`. You can take a look at the dictionary structure in `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.

Runtime parameters

base_url: the base URL to use for the OpenAI API requests. Defaults to None.
api_key: the API key to authenticate the requests to the OpenAI API. Defaults to None.
max_retries: the maximum number of times to retry the request to the API before failing. Defaults to 6.
timeout: the maximum time in seconds to wait for a response from the API. Defaults to 120.

Icon

:simple-openai:

Examples:

Generate text:

from distilabel.models.llms import OpenAILLM

llm = OpenAILLM(model="gpt-4-turbo", api_key="api.key")

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate text from a custom endpoint following the OpenAI API:

from distilabel.models.llms import OpenAILLM

llm = OpenAILLM(
    model="prometheus-eval/prometheus-7b-v2.0",
    base_url=r"http://localhost:8080/v1"
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pydantic import BaseModel
from distilabel.models.llms import OpenAILLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = OpenAILLM(
    model="gpt-4-turbo",
    api_key="api.key",
    structured_output={"schema": User}
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Generate with Batch API (offline batch generation):

from distilabel.models.llms import OpenAILLM

load = llm = OpenAILLM(
    model="gpt-3.5-turbo",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds
)

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
# [['Hello! How can I assist you today?']]

Source code in src/distilabel/models/llms/openai.py

class OpenAILLM(OpenAIBaseClient, AsyncLLM):
    """OpenAI LLM implementation running the async API client.

    Attributes:
        model: the model name to use for the LLM e.g. "gpt-3.5-turbo", "gpt-4", etc.
            Supported models can be found [here](https://platform.openai.com/docs/guides/text-generation).
        base_url: the base URL to use for the OpenAI API requests. Defaults to `None`, which
            means that the value set for the environment variable `OPENAI_BASE_URL` will
            be used, or "https://api.openai.com/v1" if not set.
        api_key: the API key to authenticate the requests to the OpenAI API. Defaults to
            `None` which means that the value set for the environment variable `OPENAI_API_KEY`
            will be used, or `None` if not set.
        default_headers: the default headers to use for the OpenAI API requests.
        max_retries: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        timeout: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        structured_output: a dictionary containing the structured output configuration configuration
            using `instructor`. You can take a look at the dictionary structure in
            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.

    Runtime parameters:
        - `base_url`: the base URL to use for the OpenAI API requests. Defaults to `None`.
        - `api_key`: the API key to authenticate the requests to the OpenAI API. Defaults
            to `None`.
        - `max_retries`: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.

    Icon:
        `:simple-openai:`

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import OpenAILLM

        llm = OpenAILLM(model="gpt-4-turbo", api_key="api.key")

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate text from a custom endpoint following the OpenAI API:

        ```python
        from distilabel.models.llms import OpenAILLM

        llm = OpenAILLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            base_url=r"http://localhost:8080/v1"
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pydantic import BaseModel
        from distilabel.models.llms import OpenAILLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = OpenAILLM(
            model="gpt-4-turbo",
            api_key="api.key",
            structured_output={"schema": User}
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```

        Generate with Batch API (offline batch generation):

        ```python
        from distilabel.models.llms import OpenAILLM

        load = llm = OpenAILLM(
            model="gpt-3.5-turbo",
            use_offline_batch_generation=True,
            offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds
        )

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        # [['Hello! How can I assist you today?']]
        ```
    """

    def load(self) -> None:
        AsyncLLM.load(self)
        OpenAIBaseClient.load(self)

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        num_generations: int = 1,
        max_new_tokens: NonNegativeInt = 128,
        logprobs: bool = False,
        top_logprobs: Optional[PositiveInt] = None,
        echo: bool = False,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        stop: Optional[Union[str, List[str]]] = None,
        response_format: Optional[Dict[str, str]] = None,
        extra_body: Optional[Dict[str, Any]] = None,
        seed: Optional[int] = None,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for the given input using the OpenAI async
        client.

        Args:
            input: a single input in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            logprobs: whether to return the log probabilities or not. Defaults to `False`.
            top_logprobs: the number of top log probabilities to return per output token
                generated. Defaults to `None`.
            echo: whether to echo the input in the response or not. It's only used if the
                `input` argument is an `str`. Defaults to `False`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            stop: a string or a list of strings to use as a stop sequence for the generation.
                Defaults to `None`.
            response_format: the format of the response to return. Must be one of
                "text" or "json". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)
                for more information on how to use the JSON model from OpenAI. Defaults to None
                which returns text. To return JSON, use {"type": "json_object"}.
            extra_body: an optional dictionary containing extra body parameters that will
                be sent to the OpenAI API endpoint. Defaults to `None`.
            seed: an optional integer to seed the generation process for deterministic
                outcomes. Defaults to `None`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """

        if seed is not None:
            if extra_body is None:
                extra_body = {}
            extra_body["seed"] = seed

        if isinstance(input, str):
            return await self._generate_completion(
                input=input,
                num_generations=num_generations,
                max_new_tokens=max_new_tokens,
                echo=echo,
                top_logprobs=top_logprobs,
                frequency_penalty=frequency_penalty,
                presence_penalty=presence_penalty,
                temperature=temperature,
                top_p=top_p,
                extra_body=extra_body,
            )

        return await self._generate_chat_completion(
            input=input,
            num_generations=num_generations,
            max_new_tokens=max_new_tokens,
            logprobs=logprobs,
            top_logprobs=top_logprobs,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
            stop=stop,
            response_format=response_format,
            extra_body=extra_body,
        )

    async def _generate_completion(
        self,
        input: str,
        num_generations: int = 1,
        max_new_tokens: int = 128,
        echo: bool = False,
        top_logprobs: Optional[PositiveInt] = None,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        extra_body: Optional[Dict[str, Any]] = None,
    ) -> GenerateOutput:
        completion = await self._aclient.completions.create(
            prompt=input,
            echo=echo,
            model=self.model,
            n=num_generations,
            max_tokens=max_new_tokens,
            logprobs=top_logprobs,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
            extra_body=extra_body,
        )

        generations = []
        logprobs = []
        for choice in completion.choices:
            generations.append(choice.text)
            if choice_logprobs := self._get_logprobs_from_completion_choice(choice):
                logprobs.append(choice_logprobs)

        statistics = self._get_llm_statistics(completion)
        return prepare_output(
            generations=generations,
            input_tokens=statistics["input_tokens"],
            output_tokens=statistics["output_tokens"],
            logprobs=logprobs,
        )

    def _get_logprobs_from_completion_choice(
        self, choice: "OpenAICompletionChoice"
    ) -> Union[List[Union[List["Logprob"], None]], None]:
        if choice.logprobs is None or choice.logprobs.top_logprobs is None:
            return None

        return [
            [
                {"token": token, "logprob": token_logprob}
                for token, token_logprob in logprobs.items()
            ]
            if logprobs is not None
            else None
            for logprobs in choice.logprobs.top_logprobs
        ]

    async def _generate_chat_completion(
        self,
        input: Union["StandardInput", "StructuredInput"],
        num_generations: int = 1,
        max_new_tokens: int = 128,
        logprobs: bool = False,
        top_logprobs: Optional[PositiveInt] = None,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        stop: Optional[Union[str, List[str]]] = None,
        response_format: Optional[Dict[str, str]] = None,
        extra_body: Optional[Dict[str, Any]] = None,
    ) -> GenerateOutput:
        structured_output = None
        if isinstance(input, tuple):
            input, structured_output = input
            result = self._prepare_structured_output(
                structured_output=structured_output,  # type: ignore
                client=self._aclient,
                framework="openai",
            )
            self._aclient = result.get("client")  # type: ignore

        if structured_output is None and self.structured_output is not None:
            structured_output = self.structured_output

        kwargs = {
            "messages": input,  # type: ignore
            "model": self.model,
            "logprobs": logprobs,
            "top_logprobs": top_logprobs,
            "max_tokens": max_new_tokens,
            "n": num_generations,
            "frequency_penalty": frequency_penalty,
            "presence_penalty": presence_penalty,
            "temperature": temperature,
            "top_p": top_p,
            "stop": stop,
            "extra_body": extra_body,
        }

        # Checks if any message contains an image, in that case "stop" cannot be used or
        # raises an error in the API.
        if isinstance(
            [row for row in input if row["role"] == "user"][0]["content"], list
        ):
            kwargs.pop("stop")

        if response_format is not None:
            kwargs["response_format"] = response_format

        if structured_output:
            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore

        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore

        if structured_output:
            # NOTE: `instructor` doesn't work with `n` parameter, so it will always return
            # only 1 choice.
            statistics = self._get_llm_statistics(completion._raw_response)
            if choice_logprobs := self._get_logprobs_from_chat_completion_choice(
                completion._raw_response.choices[0]
            ):
                output_logprobs = [choice_logprobs]
            else:
                output_logprobs = None
            return prepare_output(
                generations=[completion.model_dump_json()],
                input_tokens=statistics["input_tokens"],
                output_tokens=statistics["output_tokens"],
                logprobs=output_logprobs,
            )

        return self._generations_from_openai_completion(completion)

    def _generations_from_openai_completion(
        self, completion: "OpenAIChatCompletion"
    ) -> "GenerateOutput":
        """Get the generations from the OpenAI Chat Completion object.

        Args:
            completion: the completion object to get the generations from.

        Returns:
            A list of strings containing the generated responses for the input.
        """
        generations = []
        logprobs = []
        for choice in completion.choices:
            if (content := choice.message.content) is None:
                self._logger.warning(  # type: ignore
                    f"Received no response using OpenAI client (model: '{self.model}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(content)
            if choice_logprobs := self._get_logprobs_from_chat_completion_choice(
                choice
            ):
                logprobs.append(choice_logprobs)

        statistics = self._get_llm_statistics(completion)
        return prepare_output(
            generations=generations,
            input_tokens=statistics["input_tokens"],
            output_tokens=statistics["output_tokens"],
            logprobs=logprobs,
        )

    def _get_logprobs_from_chat_completion_choice(
        self, choice: "OpenAIChatCompletionChoice"
    ) -> Union[List[List["Logprob"]], None]:
        if choice.logprobs is None or choice.logprobs.content is None:
            return None

        return [
            [
                {"token": top_logprob.token, "logprob": top_logprob.logprob}
                for top_logprob in token_logprobs.top_logprobs
            ]
            for token_logprobs in choice.logprobs.content
        ]

    def offline_batch_generate(
        self,
        inputs: Union[List["FormattedInput"], None] = None,
        num_generations: int = 1,
        max_new_tokens: int = 128,
        logprobs: bool = False,
        top_logprobs: Optional[PositiveInt] = None,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        stop: Optional[Union[str, List[str]]] = None,
        response_format: Optional[str] = None,
        **kwargs: Any,
    ) -> List["GenerateOutput"]:
        """Uses the OpenAI batch API to generate `num_generations` responses for the given
        inputs.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            logprobs: whether to return the log probabilities or not. Defaults to `False`.
            top_logprobs: the number of top log probabilities to return per output token
                generated. Defaults to `None`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            stop: a string or a list of strings to use as a stop sequence for the generation.
                Defaults to `None`.
            response_format: the format of the response to return. Must be one of
                "text" or "json". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)
                for more information on how to use the JSON model from OpenAI. Defaults to `text`.

        Returns:
            A list of lists of strings containing the generated responses for each input
            in `inputs`.

        Raises:
            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation
                is not finished yet.
            ValueError: if no job IDs were found to retrieve the results from.
        """
        if self.jobs_ids:
            return self._check_and_get_batch_results()

        if inputs:
            self.jobs_ids = self._create_jobs(
                inputs=inputs,
                **{
                    "model": self.model,
                    "logprobs": logprobs,
                    "top_logprobs": top_logprobs,
                    "max_tokens": max_new_tokens,
                    "n": num_generations,
                    "frequency_penalty": frequency_penalty,
                    "presence_penalty": presence_penalty,
                    "temperature": temperature,
                    "top_p": top_p,
                    "stop": stop,
                    "response_format": response_format,
                },
            )
            raise DistilabelOfflineBatchGenerationNotFinishedException(
                jobs_ids=self.jobs_ids
            )

        raise ValueError("No `inputs` were provided and no `jobs_ids` were found.")

    def _check_and_get_batch_results(self) -> List["GenerateOutput"]:
        """Checks the status of the batch jobs and retrieves the results from the OpenAI
        Batch API.

        Returns:
            A list of lists of strings containing the generated responses for each input.

        Raises:
            ValueError: if no job IDs were found to retrieve the results from.
            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation
                is not finished yet.
            RuntimeError: if the only batch job found failed.
        """
        if not self.jobs_ids:
            raise ValueError("No job IDs were found to retrieve the results from.")

        outputs = []
        for batch_id in self.jobs_ids:
            batch = self._get_openai_batch(batch_id)

            if batch.status in ("validating", "in_progress", "finalizing"):
                raise DistilabelOfflineBatchGenerationNotFinishedException(
                    jobs_ids=self.jobs_ids
                )

            if batch.status in ("failed", "expired", "cancelled", "cancelling"):
                self._logger.error(  # type: ignore
                    f"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'."
                )
                if len(self.jobs_ids) == 1:
                    self.jobs_ids = None
                    raise RuntimeError(
                        f"The only OpenAI API Batch that was created with ID '{batch_id}'"
                        f" failed with status '{batch.status}'."
                    )

                continue

            outputs.extend(self._retrieve_batch_results(batch))

        # sort by `custom_id` to return the results in the same order as the inputs
        outputs = sorted(outputs, key=lambda x: int(x["custom_id"]))
        return [self._parse_output(output) for output in outputs]

    def _parse_output(self, output: Dict[str, Any]) -> "GenerateOutput":
        """Parses the output from the OpenAI Batch API into a list of strings.

        Args:
            output: the output to parse.

        Returns:
            A list of strings containing the generated responses for the input.
        """
        from openai.types.chat import ChatCompletion as OpenAIChatCompletion

        if "response" not in output:
            return []

        if output["response"]["status_code"] != 200:
            return []

        return self._generations_from_openai_completion(
            OpenAIChatCompletion(**output["response"]["body"])
        )

    def _get_openai_batch(self, batch_id: str) -> "OpenAIBatch":
        """Gets a batch from the OpenAI Batch API.

        Args:
            batch_id: the ID of the batch to retrieve.

        Returns:
            The batch retrieved from the OpenAI Batch API.

        Raises:
            openai.OpenAIError: if there was an error while retrieving the batch from the
                OpenAI Batch API.
        """
        import openai

        try:
            return self._client.batches.retrieve(batch_id)
        except openai.OpenAIError as e:
            self._logger.error(  # type: ignore
                f"Error while retrieving batch '{batch_id}' from OpenAI: {e}"
            )
            raise e

    def _retrieve_batch_results(self, batch: "OpenAIBatch") -> List[Dict[str, Any]]:
        """Retrieves the results of a batch from its output file, parsing the JSONL content
        into a list of dictionaries.

        Args:
            batch: the batch to retrieve the results from.

        Returns:
            A list of dictionaries containing the results of the batch.

        Raises:
            AssertionError: if no output file ID was found in the batch.
        """
        import openai

        assert batch.output_file_id, "No output file ID was found in the batch."

        try:
            file_response = self._client.files.content(batch.output_file_id)
            return [orjson.loads(line) for line in file_response.text.splitlines()]
        except openai.OpenAIError as e:
            self._logger.error(  # type: ignore
                f"Error while retrieving batch results from file '{batch.output_file_id}': {e}"
            )
            return []

    def _create_jobs(
        self, inputs: List["FormattedInput"], **kwargs: Any
    ) -> Tuple[str, ...]:
        """Creates jobs in the OpenAI Batch API to generate responses for the given inputs.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            kwargs: the keyword arguments to use for the generation.

        Returns:
            A list of job IDs created in the OpenAI Batch API.
        """
        batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)
        jobs = []
        for batch_input_file in batch_input_files:
            if batch := self._create_batch_api_job(batch_input_file):
                jobs.append(batch.id)
        return tuple(jobs)

    def _create_batch_api_job(
        self, batch_input_file: "OpenAIFileObject"
    ) -> Union["OpenAIBatch", None]:
        """Creates a job in the OpenAI Batch API to generate responses for the given input
        file.

        Args:
            batch_input_file: the input file to generate responses for.

        Returns:
            The batch job created in the OpenAI Batch API.
        """
        import openai

        metadata = {"description": "distilabel"}

        if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:
            metadata["distilabel_pipeline_name"] = distilabel_pipeline_name

        if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:
            metadata["distilabel_pipeline_cache_id"] = distilabel_pipeline_cache_id

        batch = None
        try:
            batch = self._client.batches.create(
                completion_window="24h",
                endpoint="/v1/chat/completions",
                input_file_id=batch_input_file.id,
                metadata=metadata,
            )
        except openai.OpenAIError as e:
            self._logger.error(  # type: ignore
                f"Error while creating OpenAI Batch API job for file with ID"
                f" '{batch_input_file.id}': {e}."
            )
            raise e
        return batch

    def _create_batch_files(
        self, inputs: List["FormattedInput"], **kwargs: Any
    ) -> List["OpenAIFileObject"]:
        """Creates the necessary input files for the batch API to generate responses. The
        maximum size of each file so the OpenAI Batch API can process it is 100MB, so we
        need to split the inputs into multiple files if necessary.

        More information: https://platform.openai.com/docs/api-reference/files/create

        Args:
            inputs: a list of inputs in chat format to generate responses for, optionally
                including structured output.
            kwargs: the keyword arguments to use for the generation.

        Returns:
            The list of file objects created for the OpenAI Batch API.

        Raises:
            openai.OpenAIError: if there was an error while creating the batch input file
                in the OpenAI Batch API.
        """
        import openai

        files = []
        for file_no, buffer in enumerate(
            self._create_jsonl_buffers(inputs=inputs, **kwargs)
        ):
            try:
                # TODO: add distilabel pipeline name and id
                batch_input_file = self._client.files.create(
                    file=(self._name_for_openai_files(file_no), buffer),
                    purpose="batch",
                )
                files.append(batch_input_file)
            except openai.OpenAIError as e:
                self._logger.error(  # type: ignore
                    f"Error while creating OpenAI batch input file: {e}"
                )
                raise e
        return files

    def _create_jsonl_buffers(
        self, inputs: List["FormattedInput"], **kwargs: Any
    ) -> Generator[io.BytesIO, None, None]:
        """Creates a generator of buffers containing the JSONL formatted inputs to be
        used by the OpenAI Batch API. The buffers created are of size 100MB or less.

        Args:
            inputs: a list of inputs in chat format to generate responses for, optionally
                including structured output.
            kwargs: the keyword arguments to use for the generation.

        Yields:
            A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch
            API.
        """
        buffer = io.BytesIO()
        buffer_current_size = 0
        for i, input in enumerate(inputs):
            # We create the smallest `custom_id` so we don't  increase the size of the file
            # to much, but we can still sort the results with the order of the inputs.
            row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)
            row_size = len(row)
            if row_size + buffer_current_size > _OPENAI_BATCH_API_MAX_FILE_SIZE:
                buffer.seek(0)
                yield buffer
                buffer = io.BytesIO()
                buffer_current_size = 0
            buffer.write(row)
            buffer_current_size += row_size

        if buffer_current_size > 0:
            buffer.seek(0)
            yield buffer

    def _create_jsonl_row(
        self, input: "FormattedInput", custom_id: str, **kwargs: Any
    ) -> bytes:
        """Creates a JSONL formatted row to be used by the OpenAI Batch API.

        Args:
            input: a list of inputs in chat format to generate responses for, optionally
                including structured output.
            custom_id: a custom ID to use for the row.
            kwargs: the keyword arguments to use for the generation.

        Returns:
            A JSONL formatted row to be used by the OpenAI Batch API.
        """
        # TODO: depending on the format of the input, add `response_format` to the kwargs
        row = {
            "custom_id": custom_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {"messages": input, **kwargs},
        }
        json_row = orjson.dumps(row)
        return json_row + b"\n"

    def _name_for_openai_files(self, file_no: int) -> str:
        if (
            envs.DISTILABEL_PIPELINE_NAME is None
            or envs.DISTILABEL_PIPELINE_CACHE_ID is None
        ):
            return f"distilabel-pipeline-fileno-{file_no}.jsonl"

        return f"distilabel-pipeline-{envs.DISTILABEL_PIPELINE_NAME}-{envs.DISTILABEL_PIPELINE_CACHE_ID}-fileno-{file_no}.jsonl"

    @staticmethod
    def _get_llm_statistics(
        completion: Union["OpenAIChatCompletion", "OpenAICompletion"],
    ) -> "LLMStatistics":
        return {
            "output_tokens": [
                completion.usage.completion_tokens if completion.usage else 0
            ],
            "input_tokens": [completion.usage.prompt_tokens if completion.usage else 0],
        }

`agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, echo=False, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, extra_body=None, seed=None)` `async` ¶

Generates num_generations responses for the given input using the OpenAI async client.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`NonNegativeInt`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`logprobs`	`bool`	whether to return the log probabilities or not. Defaults to `False`.	`False`
`top_logprobs`	`Optional[PositiveInt]`	the number of top log probabilities to return per output token generated. Defaults to `None`.	`None`
`echo`	`bool`	whether to echo the input in the response or not. It's only used if the `input` argument is an `str`. Defaults to `False`.	`False`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`stop`	`Optional[Union[str, List[str]]]`	a string or a list of strings to use as a stop sequence for the generation. Defaults to `None`.	`None`
`response_format`	`Optional[Dict[str, str]]`	the format of the response to return. Must be one of "text" or "json". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to None which returns text. To return JSON, use {"type": "json_object"}.	`None`
`extra_body`	`Optional[Dict[str, Any]]`	an optional dictionary containing extra body parameters that will be sent to the OpenAI API endpoint. Defaults to `None`.	`None`
`seed`	`Optional[int]`	an optional integer to seed the generation process for deterministic outcomes. Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/openai.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    num_generations: int = 1,
    max_new_tokens: NonNegativeInt = 128,
    logprobs: bool = False,
    top_logprobs: Optional[PositiveInt] = None,
    echo: bool = False,
    frequency_penalty: float = 0.0,
    presence_penalty: float = 0.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
    stop: Optional[Union[str, List[str]]] = None,
    response_format: Optional[Dict[str, str]] = None,
    extra_body: Optional[Dict[str, Any]] = None,
    seed: Optional[int] = None,
) -> GenerateOutput:
    """Generates `num_generations` responses for the given input using the OpenAI async
    client.

    Args:
        input: a single input in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        logprobs: whether to return the log probabilities or not. Defaults to `False`.
        top_logprobs: the number of top log probabilities to return per output token
            generated. Defaults to `None`.
        echo: whether to echo the input in the response or not. It's only used if the
            `input` argument is an `str`. Defaults to `False`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        stop: a string or a list of strings to use as a stop sequence for the generation.
            Defaults to `None`.
        response_format: the format of the response to return. Must be one of
            "text" or "json". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)
            for more information on how to use the JSON model from OpenAI. Defaults to None
            which returns text. To return JSON, use {"type": "json_object"}.
        extra_body: an optional dictionary containing extra body parameters that will
            be sent to the OpenAI API endpoint. Defaults to `None`.
        seed: an optional integer to seed the generation process for deterministic
            outcomes. Defaults to `None`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """

    if seed is not None:
        if extra_body is None:
            extra_body = {}
        extra_body["seed"] = seed

    if isinstance(input, str):
        return await self._generate_completion(
            input=input,
            num_generations=num_generations,
            max_new_tokens=max_new_tokens,
            echo=echo,
            top_logprobs=top_logprobs,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
            extra_body=extra_body,
        )

    return await self._generate_chat_completion(
        input=input,
        num_generations=num_generations,
        max_new_tokens=max_new_tokens,
        logprobs=logprobs,
        top_logprobs=top_logprobs,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty,
        temperature=temperature,
        top_p=top_p,
        stop=stop,
        response_format=response_format,
        extra_body=extra_body,
    )

`_generations_from_openai_completion(completion)` ¶

Get the generations from the OpenAI Chat Completion object.

Parameters:

Name	Type	Description	Default
`completion`	`ChatCompletion`	the completion object to get the generations from.	required

Returns:

Type	Description
`GenerateOutput`	A list of strings containing the generated responses for the input.

Source code in src/distilabel/models/llms/openai.py

def _generations_from_openai_completion(
    self, completion: "OpenAIChatCompletion"
) -> "GenerateOutput":
    """Get the generations from the OpenAI Chat Completion object.

    Args:
        completion: the completion object to get the generations from.

    Returns:
        A list of strings containing the generated responses for the input.
    """
    generations = []
    logprobs = []
    for choice in completion.choices:
        if (content := choice.message.content) is None:
            self._logger.warning(  # type: ignore
                f"Received no response using OpenAI client (model: '{self.model}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(content)
        if choice_logprobs := self._get_logprobs_from_chat_completion_choice(
            choice
        ):
            logprobs.append(choice_logprobs)

    statistics = self._get_llm_statistics(completion)
    return prepare_output(
        generations=generations,
        input_tokens=statistics["input_tokens"],
        output_tokens=statistics["output_tokens"],
        logprobs=logprobs,
    )

`offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs)` ¶

Uses the OpenAI batch API to generate num_generations responses for the given inputs.

Parameters:

Name	Type	Description	Default
`inputs`	`Union[List[FormattedInput], None]`	a list of inputs in chat format to generate responses for.	`None`
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`logprobs`	`bool`	whether to return the log probabilities or not. Defaults to `False`.	`False`
`top_logprobs`	`Optional[PositiveInt]`	the number of top log probabilities to return per output token generated. Defaults to `None`.	`None`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`stop`	`Optional[Union[str, List[str]]]`	a string or a list of strings to use as a stop sequence for the generation. Defaults to `None`.	`None`
`response_format`	`Optional[str]`	the format of the response to return. Must be one of "text" or "json". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to `text`.	`None`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input
`List[GenerateOutput]`	in `inputs`.

Raises:

Type	Description
`DistilabelOfflineBatchGenerationNotFinishedException`	if the batch generation is not finished yet.
`ValueError`	if no job IDs were found to retrieve the results from.

Source code in src/distilabel/models/llms/openai.py

def offline_batch_generate(
    self,
    inputs: Union[List["FormattedInput"], None] = None,
    num_generations: int = 1,
    max_new_tokens: int = 128,
    logprobs: bool = False,
    top_logprobs: Optional[PositiveInt] = None,
    frequency_penalty: float = 0.0,
    presence_penalty: float = 0.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
    stop: Optional[Union[str, List[str]]] = None,
    response_format: Optional[str] = None,
    **kwargs: Any,
) -> List["GenerateOutput"]:
    """Uses the OpenAI batch API to generate `num_generations` responses for the given
    inputs.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        logprobs: whether to return the log probabilities or not. Defaults to `False`.
        top_logprobs: the number of top log probabilities to return per output token
            generated. Defaults to `None`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        stop: a string or a list of strings to use as a stop sequence for the generation.
            Defaults to `None`.
        response_format: the format of the response to return. Must be one of
            "text" or "json". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)
            for more information on how to use the JSON model from OpenAI. Defaults to `text`.

    Returns:
        A list of lists of strings containing the generated responses for each input
        in `inputs`.

    Raises:
        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation
            is not finished yet.
        ValueError: if no job IDs were found to retrieve the results from.
    """
    if self.jobs_ids:
        return self._check_and_get_batch_results()

    if inputs:
        self.jobs_ids = self._create_jobs(
            inputs=inputs,
            **{
                "model": self.model,
                "logprobs": logprobs,
                "top_logprobs": top_logprobs,
                "max_tokens": max_new_tokens,
                "n": num_generations,
                "frequency_penalty": frequency_penalty,
                "presence_penalty": presence_penalty,
                "temperature": temperature,
                "top_p": top_p,
                "stop": stop,
                "response_format": response_format,
            },
        )
        raise DistilabelOfflineBatchGenerationNotFinishedException(
            jobs_ids=self.jobs_ids
        )

    raise ValueError("No `inputs` were provided and no `jobs_ids` were found.")

`_check_and_get_batch_results()` ¶

Checks the status of the batch jobs and retrieves the results from the OpenAI Batch API.

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Raises:

Type	Description
`ValueError`	if no job IDs were found to retrieve the results from.
`DistilabelOfflineBatchGenerationNotFinishedException`	if the batch generation is not finished yet.
`RuntimeError`	if the only batch job found failed.

Source code in src/distilabel/models/llms/openai.py

def _check_and_get_batch_results(self) -> List["GenerateOutput"]:
    """Checks the status of the batch jobs and retrieves the results from the OpenAI
    Batch API.

    Returns:
        A list of lists of strings containing the generated responses for each input.

    Raises:
        ValueError: if no job IDs were found to retrieve the results from.
        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation
            is not finished yet.
        RuntimeError: if the only batch job found failed.
    """
    if not self.jobs_ids:
        raise ValueError("No job IDs were found to retrieve the results from.")

    outputs = []
    for batch_id in self.jobs_ids:
        batch = self._get_openai_batch(batch_id)

        if batch.status in ("validating", "in_progress", "finalizing"):
            raise DistilabelOfflineBatchGenerationNotFinishedException(
                jobs_ids=self.jobs_ids
            )

        if batch.status in ("failed", "expired", "cancelled", "cancelling"):
            self._logger.error(  # type: ignore
                f"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'."
            )
            if len(self.jobs_ids) == 1:
                self.jobs_ids = None
                raise RuntimeError(
                    f"The only OpenAI API Batch that was created with ID '{batch_id}'"
                    f" failed with status '{batch.status}'."
                )

            continue

        outputs.extend(self._retrieve_batch_results(batch))

    # sort by `custom_id` to return the results in the same order as the inputs
    outputs = sorted(outputs, key=lambda x: int(x["custom_id"]))
    return [self._parse_output(output) for output in outputs]

`_parse_output(output)` ¶

Parses the output from the OpenAI Batch API into a list of strings.

Parameters:

Name	Type	Description	Default
`output`	`Dict[str, Any]`	the output to parse.	required

Returns:

Type	Description
`GenerateOutput`	A list of strings containing the generated responses for the input.

Source code in src/distilabel/models/llms/openai.py

def _parse_output(self, output: Dict[str, Any]) -> "GenerateOutput":
    """Parses the output from the OpenAI Batch API into a list of strings.

    Args:
        output: the output to parse.

    Returns:
        A list of strings containing the generated responses for the input.
    """
    from openai.types.chat import ChatCompletion as OpenAIChatCompletion

    if "response" not in output:
        return []

    if output["response"]["status_code"] != 200:
        return []

    return self._generations_from_openai_completion(
        OpenAIChatCompletion(**output["response"]["body"])
    )

`_get_openai_batch(batch_id)` ¶

Gets a batch from the OpenAI Batch API.

Parameters:

Name	Type	Description	Default
`batch_id`	`str`	the ID of the batch to retrieve.	required

Returns:

Type	Description
`Batch`	The batch retrieved from the OpenAI Batch API.

Raises:

Type	Description
`OpenAIError`	if there was an error while retrieving the batch from the OpenAI Batch API.

Source code in src/distilabel/models/llms/openai.py

def _get_openai_batch(self, batch_id: str) -> "OpenAIBatch":
    """Gets a batch from the OpenAI Batch API.

    Args:
        batch_id: the ID of the batch to retrieve.

    Returns:
        The batch retrieved from the OpenAI Batch API.

    Raises:
        openai.OpenAIError: if there was an error while retrieving the batch from the
            OpenAI Batch API.
    """
    import openai

    try:
        return self._client.batches.retrieve(batch_id)
    except openai.OpenAIError as e:
        self._logger.error(  # type: ignore
            f"Error while retrieving batch '{batch_id}' from OpenAI: {e}"
        )
        raise e

`_retrieve_batch_results(batch)` ¶

Retrieves the results of a batch from its output file, parsing the JSONL content into a list of dictionaries.

Parameters:

Name	Type	Description	Default
`batch`	`Batch`	the batch to retrieve the results from.	required

Returns:

Type	Description
`List[Dict[str, Any]]`	A list of dictionaries containing the results of the batch.

Raises:

Type	Description
`AssertionError`	if no output file ID was found in the batch.

Source code in src/distilabel/models/llms/openai.py

def _retrieve_batch_results(self, batch: "OpenAIBatch") -> List[Dict[str, Any]]:
    """Retrieves the results of a batch from its output file, parsing the JSONL content
    into a list of dictionaries.

    Args:
        batch: the batch to retrieve the results from.

    Returns:
        A list of dictionaries containing the results of the batch.

    Raises:
        AssertionError: if no output file ID was found in the batch.
    """
    import openai

    assert batch.output_file_id, "No output file ID was found in the batch."

    try:
        file_response = self._client.files.content(batch.output_file_id)
        return [orjson.loads(line) for line in file_response.text.splitlines()]
    except openai.OpenAIError as e:
        self._logger.error(  # type: ignore
            f"Error while retrieving batch results from file '{batch.output_file_id}': {e}"
        )
        return []

`_create_jobs(inputs, **kwargs)` ¶

Creates jobs in the OpenAI Batch API to generate responses for the given inputs.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for.	required
`kwargs`	`Any`	the keyword arguments to use for the generation.	`{}`

Returns:

Type	Description
`Tuple[str, ...]`	A list of job IDs created in the OpenAI Batch API.

Source code in src/distilabel/models/llms/openai.py

def _create_jobs(
    self, inputs: List["FormattedInput"], **kwargs: Any
) -> Tuple[str, ...]:
    """Creates jobs in the OpenAI Batch API to generate responses for the given inputs.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        kwargs: the keyword arguments to use for the generation.

    Returns:
        A list of job IDs created in the OpenAI Batch API.
    """
    batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)
    jobs = []
    for batch_input_file in batch_input_files:
        if batch := self._create_batch_api_job(batch_input_file):
            jobs.append(batch.id)
    return tuple(jobs)

`_create_batch_api_job(batch_input_file)` ¶

Creates a job in the OpenAI Batch API to generate responses for the given input file.

Parameters:

Name	Type	Description	Default
`batch_input_file`	`FileObject`	the input file to generate responses for.	required

Returns:

Type	Description
`Union[Batch, None]`	The batch job created in the OpenAI Batch API.

Source code in src/distilabel/models/llms/openai.py

def _create_batch_api_job(
    self, batch_input_file: "OpenAIFileObject"
) -> Union["OpenAIBatch", None]:
    """Creates a job in the OpenAI Batch API to generate responses for the given input
    file.

    Args:
        batch_input_file: the input file to generate responses for.

    Returns:
        The batch job created in the OpenAI Batch API.
    """
    import openai

    metadata = {"description": "distilabel"}

    if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:
        metadata["distilabel_pipeline_name"] = distilabel_pipeline_name

    if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:
        metadata["distilabel_pipeline_cache_id"] = distilabel_pipeline_cache_id

    batch = None
    try:
        batch = self._client.batches.create(
            completion_window="24h",
            endpoint="/v1/chat/completions",
            input_file_id=batch_input_file.id,
            metadata=metadata,
        )
    except openai.OpenAIError as e:
        self._logger.error(  # type: ignore
            f"Error while creating OpenAI Batch API job for file with ID"
            f" '{batch_input_file.id}': {e}."
        )
        raise e
    return batch

`_create_batch_files(inputs, **kwargs)` ¶

Creates the necessary input files for the batch API to generate responses. The maximum size of each file so the OpenAI Batch API can process it is 100MB, so we need to split the inputs into multiple files if necessary.

More information: https://platform.openai.com/docs/api-reference/files/create

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for, optionally including structured output.	required
`kwargs`	`Any`	the keyword arguments to use for the generation.	`{}`

Returns:

Type	Description
`List[FileObject]`	The list of file objects created for the OpenAI Batch API.

Raises:

Type	Description
`OpenAIError`	if there was an error while creating the batch input file in the OpenAI Batch API.

Source code in src/distilabel/models/llms/openai.py

def _create_batch_files(
    self, inputs: List["FormattedInput"], **kwargs: Any
) -> List["OpenAIFileObject"]:
    """Creates the necessary input files for the batch API to generate responses. The
    maximum size of each file so the OpenAI Batch API can process it is 100MB, so we
    need to split the inputs into multiple files if necessary.

    More information: https://platform.openai.com/docs/api-reference/files/create

    Args:
        inputs: a list of inputs in chat format to generate responses for, optionally
            including structured output.
        kwargs: the keyword arguments to use for the generation.

    Returns:
        The list of file objects created for the OpenAI Batch API.

    Raises:
        openai.OpenAIError: if there was an error while creating the batch input file
            in the OpenAI Batch API.
    """
    import openai

    files = []
    for file_no, buffer in enumerate(
        self._create_jsonl_buffers(inputs=inputs, **kwargs)
    ):
        try:
            # TODO: add distilabel pipeline name and id
            batch_input_file = self._client.files.create(
                file=(self._name_for_openai_files(file_no), buffer),
                purpose="batch",
            )
            files.append(batch_input_file)
        except openai.OpenAIError as e:
            self._logger.error(  # type: ignore
                f"Error while creating OpenAI batch input file: {e}"
            )
            raise e
    return files

`_create_jsonl_buffers(inputs, **kwargs)` ¶

Creates a generator of buffers containing the JSONL formatted inputs to be used by the OpenAI Batch API. The buffers created are of size 100MB or less.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for, optionally including structured output.	required
`kwargs`	`Any`	the keyword arguments to use for the generation.	`{}`

Yields:

Type	Description
`BytesIO`	A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch
`BytesIO`	API.

Source code in src/distilabel/models/llms/openai.py

def _create_jsonl_buffers(
    self, inputs: List["FormattedInput"], **kwargs: Any
) -> Generator[io.BytesIO, None, None]:
    """Creates a generator of buffers containing the JSONL formatted inputs to be
    used by the OpenAI Batch API. The buffers created are of size 100MB or less.

    Args:
        inputs: a list of inputs in chat format to generate responses for, optionally
            including structured output.
        kwargs: the keyword arguments to use for the generation.

    Yields:
        A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch
        API.
    """
    buffer = io.BytesIO()
    buffer_current_size = 0
    for i, input in enumerate(inputs):
        # We create the smallest `custom_id` so we don't  increase the size of the file
        # to much, but we can still sort the results with the order of the inputs.
        row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)
        row_size = len(row)
        if row_size + buffer_current_size > _OPENAI_BATCH_API_MAX_FILE_SIZE:
            buffer.seek(0)
            yield buffer
            buffer = io.BytesIO()
            buffer_current_size = 0
        buffer.write(row)
        buffer_current_size += row_size

    if buffer_current_size > 0:
        buffer.seek(0)
        yield buffer

`_create_jsonl_row(input, custom_id, **kwargs)` ¶

Creates a JSONL formatted row to be used by the OpenAI Batch API.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a list of inputs in chat format to generate responses for, optionally including structured output.	required
`custom_id`	`str`	a custom ID to use for the row.	required
`kwargs`	`Any`	the keyword arguments to use for the generation.	`{}`

Returns:

Type	Description
`bytes`	A JSONL formatted row to be used by the OpenAI Batch API.

Source code in src/distilabel/models/llms/openai.py

def _create_jsonl_row(
    self, input: "FormattedInput", custom_id: str, **kwargs: Any
) -> bytes:
    """Creates a JSONL formatted row to be used by the OpenAI Batch API.

    Args:
        input: a list of inputs in chat format to generate responses for, optionally
            including structured output.
        custom_id: a custom ID to use for the row.
        kwargs: the keyword arguments to use for the generation.

    Returns:
        A JSONL formatted row to be used by the OpenAI Batch API.
    """
    # TODO: depending on the format of the input, add `response_format` to the kwargs
    row = {
        "custom_id": custom_id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {"messages": input, **kwargs},
    }
    json_row = orjson.dumps(row)
    return json_row + b"\n"

`ClientSGLang` ¶

Bases: OpenAILLM, MagpieChatTemplateMixin

A client for the SGLang server implementing the OpenAI API specification.

Attributes:

Name	Type	Description
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL of the `SGLang` server. Defaults to `"http://localhost:30000"`.
`max_retries`	`RuntimeParameter[int]`	the maximum number of times to retry the request to the API before failing. Defaults to `6`.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response from the API. Defaults to `120`.
`httpx_client_kwargs`	`RuntimeParameter[int]`	extra kwargs that will be passed to the `httpx.AsyncClient` created to comunicate with the `vLLM` server. Defaults to `None`.
`tokenizer`	`Optional[str]`	the Hugging Face Hub repo id or path of the tokenizer that will be used to apply the chat template and tokenize the inputs before sending it to the server. Defaults to `None`.
`tokenizer_revision`	`Optional[str]`	the revision of the tokenizer to load. Defaults to `None`.
`_aclient`	`AsyncOpenAI`	the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults to `None`.

Runtime parameters

base_url: the base url of the vLLM server. Defaults to "http://localhost:30000".
max_retries: the maximum number of times to retry the request to the API before failing. Defaults to 6.
timeout: the maximum time in seconds to wait for a response from the API. Defaults to 120.
httpx_client_kwargs: extra kwargs that will be passed to the httpx.AsyncClient created to comunicate with the vLLM server. Defaults to None.

Examples:

Generate text:

from distilabel.models.llms import ClientSGLang

llm = ClientSGLang(
    base_url="http://localhost:30000/v1",
    tokenizer="Qwen/Qwen2-7B-Instruct"
)

llm.load()

results = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Hello, how are you?"}]],
    temperature=0.7,
    top_p=1.0,
    max_new_tokens=256,
)

Source code in src/distilabel/models/llms/sglang.py

class ClientSGLang(OpenAILLM, MagpieChatTemplateMixin):
    """A client for the `SGLang` server implementing the OpenAI API specification.

    Attributes:
        base_url: the base URL of the `SGLang` server. Defaults to `"http://localhost:30000"`.
        max_retries: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        timeout: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        httpx_client_kwargs: extra kwargs that will be passed to the `httpx.AsyncClient`
            created to comunicate with the `vLLM` server. Defaults to `None`.
        tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used
            to apply the chat template and tokenize the inputs before sending it to the
            server. Defaults to `None`.
        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.
        _aclient: the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults
            to `None`.

    Runtime parameters:
        - `base_url`: the base url of the `vLLM` server. Defaults to `"http://localhost:30000"`.
        - `max_retries`: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        - `httpx_client_kwargs`: extra kwargs that will be passed to the `httpx.AsyncClient`
            created to comunicate with the `vLLM` server. Defaults to `None`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import ClientSGLang

        llm = ClientSGLang(
            base_url="http://localhost:30000/v1",
            tokenizer="Qwen/Qwen2-7B-Instruct"
        )

        llm.load()

        results = llm.generate_outputs(
            inputs=[[{"role": "user", "content": "Hello, how are you?"}]],
            temperature=0.7,
            top_p=1.0,
            max_new_tokens=256,
        )
        ```
    """

    model: str = ""  # Default value so it's not needed to `ClientvLLM(model="...")`
    tokenizer: Optional[str] = None
    tokenizer_revision: Optional[str] = None

    # We need the sync client to get the list of models
    _client: "OpenAI" = PrivateAttr(None)
    _tokenizer: "PreTrainedTokenizer" = PrivateAttr(None)

    def load(self) -> None:
        """Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer
        optionally."""

        self.api_key = SecretStr("EMPTY")

        # We need to first create the sync client to get the model name that will be used
        # in the `super().load()` when creating the logger.
        try:
            from openai import OpenAI
        except ImportError as ie:
            raise ImportError(
                "OpenAI Python client is not installed. Please install it using"
                " `pip install 'distilabel[openai]'`."
            ) from ie

        self._client = OpenAI(
            base_url=self.base_url,
            api_key=self.api_key.get_secret_value(),  # type: ignore
            max_retries=self.max_retries,  # type: ignore
            timeout=self.timeout,
        )

        super().load()

        try:
            from transformers import AutoTokenizer
        except ImportError as ie:
            raise ImportError(
                "To use `ClientvLLM` you need to install `transformers`."
                "Please install it using `pip install 'distilabel[hf-transformers]'`."
            ) from ie

        self._tokenizer = AutoTokenizer.from_pretrained(
            self.tokenizer, revision=self.tokenizer_revision
        )

    @cached_property
    def model_name(self) -> str:  # type: ignore
        """Returns the name of the model served with vLLM server."""
        models = self._client.models.list()
        return models.data[0].id

    def _prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        prompt: str = (
            self._tokenizer.apply_chat_template(  # type: ignore
                input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,  # type: ignore
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        num_generations: int = 1,
        max_new_tokens: int = 128,
        frequency_penalty: float = 0.0,
        logit_bias: Optional[Dict[str, int]] = None,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for each input.

        Args:
            input: a single input in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            logit_bias: modify the likelihood of specified tokens appearing in the completion.
                Defaults to ``
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: nucleus sampling. The value refers to the top-p tokens that should be
                considered for sampling. Defaults to `1.0`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """

        completion = await self._aclient.completions.create(
            model=self.model_name,
            prompt=self._prepare_input(input),  # type: ignore
            n=num_generations,
            max_tokens=max_new_tokens,
            frequency_penalty=frequency_penalty,
            logit_bias=logit_bias,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
        )

        generations = []
        for choice in completion.choices:
            text = choice.text
            if text == "":
                self._logger.warning(  # type: ignore
                    f"Received no response from SGLang server (model: '{self.model_name}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(text)

        return prepare_output(generations, **self._get_llm_statistics(completion))

`model_name` `cached` `property` ¶

Returns the name of the model served with vLLM server.

`load()` ¶

Creates an httpx.AsyncClient to connect to the vLLM server and a tokenizer optionally.

Source code in src/distilabel/models/llms/sglang.py

def load(self) -> None:
    """Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer
    optionally."""

    self.api_key = SecretStr("EMPTY")

    # We need to first create the sync client to get the model name that will be used
    # in the `super().load()` when creating the logger.
    try:
        from openai import OpenAI
    except ImportError as ie:
        raise ImportError(
            "OpenAI Python client is not installed. Please install it using"
            " `pip install 'distilabel[openai]'`."
        ) from ie

    self._client = OpenAI(
        base_url=self.base_url,
        api_key=self.api_key.get_secret_value(),  # type: ignore
        max_retries=self.max_retries,  # type: ignore
        timeout=self.timeout,
    )

    super().load()

    try:
        from transformers import AutoTokenizer
    except ImportError as ie:
        raise ImportError(
            "To use `ClientvLLM` you need to install `transformers`."
            "Please install it using `pip install 'distilabel[hf-transformers]'`."
        ) from ie

    self._tokenizer = AutoTokenizer.from_pretrained(
        self.tokenizer, revision=self.tokenizer_revision
    )

`_prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/sglang.py

def _prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    prompt: str = (
        self._tokenizer.apply_chat_template(  # type: ignore
            input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,  # type: ignore
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, logit_bias=None, presence_penalty=0.0, temperature=1.0, top_p=1.0)` `async` ¶

Generates num_generations responses for each input.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`logit_bias`	`Optional[Dict[str, int]]`	modify the likelihood of specified tokens appearing in the completion. Defaults to ``	`None`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	nucleus sampling. The value refers to the top-p tokens that should be considered for sampling. Defaults to `1.0`.	`1.0`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/sglang.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    num_generations: int = 1,
    max_new_tokens: int = 128,
    frequency_penalty: float = 0.0,
    logit_bias: Optional[Dict[str, int]] = None,
    presence_penalty: float = 0.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
) -> GenerateOutput:
    """Generates `num_generations` responses for each input.

    Args:
        input: a single input in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        logit_bias: modify the likelihood of specified tokens appearing in the completion.
            Defaults to ``
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: nucleus sampling. The value refers to the top-p tokens that should be
            considered for sampling. Defaults to `1.0`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """

    completion = await self._aclient.completions.create(
        model=self.model_name,
        prompt=self._prepare_input(input),  # type: ignore
        n=num_generations,
        max_tokens=max_new_tokens,
        frequency_penalty=frequency_penalty,
        logit_bias=logit_bias,
        presence_penalty=presence_penalty,
        temperature=temperature,
        top_p=top_p,
    )

    generations = []
    for choice in completion.choices:
        text = choice.text
        if text == "":
            self._logger.warning(  # type: ignore
                f"Received no response from SGLang server (model: '{self.model_name}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(text)

    return prepare_output(generations, **self._get_llm_statistics(completion))

`SGLang` ¶

Bases: LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin

SGLang library LLM implementation.

Attributes:

Name	Type	Description
`model`	`str`	the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.
`dtype`	`str`	the data type to use for the model. Defaults to `auto`.
`trust_remote_code`	`bool`	whether to trust the remote code when loading the model. Defaults to `False`.
`quantization`	`Optional[str]`	the quantization mode to use for the model. Defaults to `None`.
`revision`	`Optional[str]`	the revision of the model to load. Defaults to `None`.
`tokenizer`	`Optional[str]`	the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer files. Defaults to `model`.
`tokenizer_mode`	`Literal['auto', 'slow']`	the mode to use for the tokenizer. Defaults to `auto`.
`skip_tokenizer_init`	`bool`	whether to skip the initialization of the tokenizer. Defaults to `False`.
`chat_template`	`Optional[str]`	a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to `None`.
`structured_output`	`Optional[RuntimeParameter[OutlinesStructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
`seed`	`int`	the seed to use for the random number generator. Defaults to `0`.
`extra_kwargs`	`Optional[RuntimeParameter[Dict[str, Any]]]`	additional dictionary of keyword arguments that will be passed to the `Engine` class of `sglang` library. Defaults to `{}`.
`_model`	`Engine`	the `SGLang` model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the `load` method.
`_tokenizer`	`PreTrainedTokenizer`	the tokenizer instance used to format the prompt before passing it to the `LLM`. It will be set in the `load` method.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.

References

https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py

Runtime parameters

extra_kwargs: additional dictionary of keyword arguments that will be passed to the LLM class of SGLang library.

Examples:

Generate text:

from distilabel.models.llms import SGLang
if __name__ == "__main__":
    llm = SGLang(
        model="Qwen/Qwen2.5-Coder-3B-Instruct",
        chat_template="[INST] {{ messages[0]['content']}} [/INST]"
    )

    llm.load()
    # Call the model
    output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from distilabel.models.llms import SGLang
from pydantic import BaseModel

if __name__ == "__main__":

    class User(BaseModel):
        name: str
        last_name: str
        id: int

    llm = SGLang(
        model="Qwen/Qwen2.5-Coder-3B-Instruct",
        structured_output={"format": "json", "schema": User},
    )

    llm.load()
    # Call the model
    output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/sglang.py

class SGLang(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):
    """`SGLang` library LLM implementation.

    Attributes:
        model: the model Hugging Face Hub repo id or a path to a directory containing the
            model weights and configuration files.
        dtype: the data type to use for the model. Defaults to `auto`.
        trust_remote_code: whether to trust the remote code when loading the model. Defaults
            to `False`.
        quantization: the quantization mode to use for the model. Defaults to `None`.
        revision: the revision of the model to load. Defaults to `None`.
        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing
            the tokenizer files. Defaults to `model`.
        tokenizer_mode: the mode to use for the tokenizer. Defaults to `auto`.
        skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults
            to `False`.
        chat_template: a chat template that will be used to build the prompts before
            sending them to the model. If not provided, the chat template defined in the
            tokenizer config will be used. If not provided and the tokenizer doesn't have
            a chat template, then ChatML template will be used. Defaults to `None`.
        structured_output: a dictionary containing the structured output configuration or if more
            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
        seed: the seed to use for the random number generator. Defaults to `0`.
        extra_kwargs: additional dictionary of keyword arguments that will be passed to the
            `Engine` class of `sglang` library. Defaults to `{}`.
        _model: the `SGLang` model instance. This attribute is meant to be used internally
            and should not be accessed directly. It will be set in the `load` method.
        _tokenizer: the tokenizer instance used to format the prompt before passing it to
            the `LLM`. It will be set in the `load` method.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.

    References:
        - https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py

    Runtime parameters:
        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to
            the `LLM` class of `SGLang` library.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import SGLang
        if __name__ == "__main__":
            llm = SGLang(
                model="Qwen/Qwen2.5-Coder-3B-Instruct",
                chat_template="[INST] {{ messages[0]['content']}} [/INST]"
            )

            llm.load()
            # Call the model
            output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from distilabel.models.llms import SGLang
        from pydantic import BaseModel

        if __name__ == "__main__":

            class User(BaseModel):
                name: str
                last_name: str
                id: int

            llm = SGLang(
                model="Qwen/Qwen2.5-Coder-3B-Instruct",
                structured_output={"format": "json", "schema": User},
            )

            llm.load()
            # Call the model
            output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    dtype: str = "auto"
    trust_remote_code: bool = False
    quantization: Optional[str] = None
    revision: Optional[str] = None

    tokenizer: Optional[str] = None
    tokenizer_mode: Literal["auto", "slow"] = "auto"
    skip_tokenizer_init: bool = False
    chat_template: Optional[str] = None

    seed: int = 0

    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(
        default_factory=dict,
        description="Additional dictionary of keyword arguments that will be passed to the"
        " `Engine` class of `sglang` library. See all the supported arguments at: "
        "https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/engine.py",
    )
    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(
        default=None,
        description="The structured output format to use across all the generations.",
    )

    _model: "Engine" = PrivateAttr(None)
    _tokenizer: "PreTrainedTokenizer" = PrivateAttr(None)

    def load(self) -> None:
        """Loads the `SGLang` model using either the path or the Hugging Face Hub repository id.
        Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly
        parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
        default value is ChatML format, unless explicitly provided.
        """
        super().load()

        CudaDevicePlacementMixin.load(self)

        try:
            from sglang import Engine
        except ImportError as err:
            raise ImportError(
                "sglang is not installed. Please install it with sglang document https://docs.sglang.ai/start/install.html."
            ) from err

        self._model = Engine(
            model_path=self.model,
            dtype=self.dtype,
            trust_remote_code=self.trust_remote_code,
            quantization=self.quantization,
            revision=self.revision,
            tokenizer_path=self.tokenizer,
            tokenizer_mode=self.tokenizer_mode,
            skip_tokenizer_init=self.skip_tokenizer_init,
            random_seed=self.seed,
            **self.extra_kwargs,  # type: ignore
        )
        from sglang.srt.hf_transformers_utils import get_tokenizer

        self._tokenizer = get_tokenizer(
            self.model,
            tokenizer_mode=self.tokenizer_mode,
            trust_remote_code=self.trust_remote_code,
            tokenizer_revision="main",
        )  # type: ignore
        if self.chat_template is not None:
            self._tokenizer.chat_template = self.chat_template  # type: ignore

    def unload(self) -> None:
        """Unloads the `SGLang` model."""
        self._cleanup_sglang_model()
        self._model = None  # type: ignore
        self._tokenizer = None  # type: ignore
        CudaDevicePlacementMixin.unload(self)
        super().unload()

    def _cleanup_sglang_model(self) -> None:
        if self._model is None:
            return

        import torch  # noqa

        self._model.shutdown()
        with contextlib.suppress(AssertionError):
            torch.distributed.destroy_process_group()
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    def prepare_input(self, input: Union["StandardInput", str]) -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        if isinstance(input, str):
            return input

        prompt: str = (
            self._tokenizer.apply_chat_template(
                input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,  # type: ignore
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    def _prepare_batches(
        self, inputs: List["StructuredInput"]
    ) -> Tuple[List[Tuple[List[str], "OutlinesStructuredOutputType"]], List[int]]:
        """Prepares the inputs by grouping them by the structured output.

        When we generate structured outputs with schemas obtained from a dataset, we need to
        prepare the data to try to send batches of inputs instead of single inputs to the model
        to take advante of the engine. So we group the inputs by the structured output to be
        passed in the `generate` method.

        Args:
            inputs: The batch of inputs passed to the generate method. As we expect to be generating
                structured outputs, each element will be a tuple containing the instruction and the
                structured output.

        Returns:
            The prepared batches (sub-batches let's say) to be passed to the `generate` method.
            Each new tuple will contain instead of the single instruction, a list of instructions
        """
        instruction_order = {}
        batches: Dict[str, List[str]] = {}
        for i, (instruction, structured_output) in enumerate(inputs):
            instruction = self.prepare_input(instruction)
            instruction_order[instruction] = i

            structured_output = json.dumps(structured_output)
            if structured_output not in batches:
                batches[structured_output] = [instruction]
            else:
                batches[structured_output].append(instruction)

        # Built a list with instructions sorted by structured output
        flat_instructions = [
            instruction for _, group in batches.items() for instruction in group
        ]

        # Generate the list of indices based on the original order
        sorted_indices = [
            instruction_order[instruction] for instruction in flat_instructions
        ]

        return [
            (batch, json.loads(schema)) for schema, batch in batches.items()
        ], sorted_indices

    @validate_call
    def generate(  # noqa: C901 # type: ignore
        self,
        inputs: List[FormattedInput],
        num_generations: int = 1,
        max_new_tokens: int = 128,
        stop: Optional[Union[str, List[str]]] = None,
        stop_token_ids: Optional[List[int]] = None,
        temperature: float = 1.0,
        top_p: float = 1.0,
        top_k: int = -1,
        min_p: float = 0.0,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        repetition_penalty: float = 1.0,
        min_new_tokens: int = 0,
        spaces_between_special_tokens: bool = True,
        json_schema: Optional[str] = None,
        regex: Optional[str] = None,
        no_stop_trim: bool = False,
        ignore_eos: bool = False,
        skip_special_tokens: bool = True,
        custom_params: Optional[Dict[str, Any]] = None,  # no use?
        return_logprob: bool = False,
        top_logprobs_num: int = 0,
        echo: bool = False,
    ) -> List[GenerateOutput]:
        """Generates `num_generations` responses for each input.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            stop: a list of strings that will be used to stop the generation when found.
                Defaults to `None`.
            stop_token_ids: a list of token ids that will be used to stop the generation
                when found. Defaults to `None`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            top_k: the top-k value to use for the generation. Defaults to `0`.
            min_p: the minimum probability to use for the generation. Defaults to `0.0`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            repetition_penalty: the repetition penalty to use for the generation Defaults to
                `1.0`.
            min_new_tokens: Forces the model to generate at least `min_new_tokens` until
                a stop word or EOS token is sampled. Defaults to `0`.
            spaces_between_special_tokens:  Whether or not to add spaces between special
                tokens during detokenization. Defaults to `True`.
            json_schema: json structure output. Defaults to `None`.
            regex: regex structure output. Defaults to `None`.
            no_stop_trim: Don't trim stop words or EOS token from the generated text.
                Defaults to `False`.
            ignore_eos: Don't stop generation when EOS token is sampled. Defaults to `False`.
            skip_special_tokens: Whether to exclude special tokens from the output. Defaults
                to `False`.
            custom_params: Used when employing `CustomLogitProcessor`. Defaults to `None`.
            return_logprob: Whether to return log probabilities for tokens. Defaults to `False`.
            top_logprobs_num: If returning log probabilities, specifies the number of
                top logprobs to return at each position. Defaults to `0`.
            echo: Whether to echo the include the prompt in the response or not. Defaults
                to `False`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """

        if isinstance(inputs[0], tuple):
            # Prepare the batches for structured generation
            prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore
        else:
            # Simulate a batch without the structured output content
            prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore
            sorted_indices = None

        batched_outputs: List["LLMOutput"] = []
        generations = []

        for prepared_inputs, structured_output in prepared_batches:
            if self.structured_output is not None and structured_output is not None:
                self._logger.warning(
                    "An `structured_output` was provided in the model configuration, but"
                    " one was also provided in the input. The input structured output will"
                    " be used."
                )

            temp_structure = None
            if structured_output is not None:
                temp_structure = structured_output
            elif self.structured_output is not None:
                temp_structure = self.structured_output

            if temp_structure is not None:
                format = temp_structure.get("format")
                schema = temp_structure.get("schema")
                if not format:
                    if isinstance(schema, dict) or inspect.isclass(schema):
                        format = "json"
                    elif isinstance(schema, str):
                        format = "regex"

                if format == "json":
                    json_schema = json.dumps(schema_as_dict(schema))
                elif format == "regex":
                    regex = schema

            sampling_params = {
                "max_new_tokens": max_new_tokens,
                "stop": stop,
                "stop_token_ids": stop_token_ids,
                "temperature": temperature,
                "top_p": top_p,
                "top_k": top_k,
                "min_p": min_p,
                "frequency_penalty": frequency_penalty,
                "presence_penalty": presence_penalty,
                "repetition_penalty": repetition_penalty,
                "min_new_tokens": min_new_tokens,
                "spaces_between_special_tokens": spaces_between_special_tokens,
                "n": num_generations,
                "json_schema": json_schema,
                "regex": regex,
                "no_stop_trim": no_stop_trim,
                "ignore_eos": ignore_eos,
                "skip_special_tokens": skip_special_tokens,
                "custom_params": custom_params,
            }

            batch_outputs = self._model.generate(
                prompt=prepared_inputs,
                sampling_params=sampling_params,
                return_logprob=return_logprob,
                logprob_start_len=0 if echo else -1,
                top_logprobs_num=top_logprobs_num if return_logprob else 0,
            )

            group = int(len(batch_outputs) / len(prepared_inputs))
            for num in range(len(prepared_inputs)):
                input = prepared_inputs[num]
                outputs = batch_outputs[num * group : (num + 1) * group]
                processed_prompt_logprobs = []
                meta_info = outputs[0]["meta_info"]
                if "input_top_logprobs" in meta_info:
                    processed_prompt_logprobs = self._get_llm_logprobs(
                        top_logprob=meta_info["input_top_logprobs"],
                        choose_logprob=meta_info["input_token_logprobs"],
                    )
                texts, statistics, outputs_logprobs = self._process_outputs(
                    input=input,
                    outputs=outputs,
                    echo=echo,
                    prompt_logprobs=processed_prompt_logprobs,
                )
                batched_outputs.append(texts)
                generation = prepare_output(
                    generations=texts,
                    input_tokens=statistics["input_tokens"],
                    output_tokens=statistics["output_tokens"],
                    logprobs=outputs_logprobs,
                )

                generations.append(generation)

        if sorted_indices is not None:
            pairs = list(enumerate(sorted_indices))
            pairs.sort(key=lambda x: x[1])
            generations = [generations[original_idx] for original_idx, _ in pairs]

        return generations

    def _process_outputs(
        self,
        input: str,
        outputs,
        prompt_logprobs: List[List["Logprob"]],
        echo: bool = False,
    ) -> Tuple["LLMOutput", "LLMStatistics", "LLMLogprobs"]:
        texts = []
        outputs_logprobs = []
        lens = 1
        if isinstance(outputs, list):
            lens = len(outputs)
        statistics = {
            "input_tokens": [compute_tokens(input, self._tokenizer.encode)] * lens,
            "output_tokens": [],
        }

        for output in outputs:
            text = output["text"]
            if echo:
                text = input + text
            texts.append(text)
            statistics["output_tokens"].append(output["meta_info"]["completion_tokens"])
            if "output_top_logprobs" in output["meta_info"]:
                processed_output_logprobs = self._get_llm_logprobs(
                    output["meta_info"]["output_top_logprobs"]
                )
                outputs_logprobs.append(prompt_logprobs + processed_output_logprobs)

        return texts, statistics, outputs_logprobs

    def _get_llm_logprobs(
        self,
        top_logprob,
        choose_logprob=None,
    ) -> List[List["Logprob"]]:
        processed_logprobs = []
        if choose_logprob is not None:
            token_logprobs = []
            for num in range(len(choose_logprob)):
                if choose_logprob[num][0] is None:
                    processed_logprobs.append(None)
                    continue
                else:
                    token_logprobs.append(
                        {
                            "token": self._tokenizer.decode(choose_logprob[num][1]),
                            "logprob": choose_logprob[num][0],
                        }
                    )
                for top_num in range(len(top_logprob[num]) - 1):
                    token_logprobs.append(
                        {
                            "token": self._tokenizer.decode(
                                top_logprob[num][top_num][1]
                            ),
                            "logprob": top_logprob[num][top_num][0],
                        }
                    )
                processed_logprobs.append(token_logprobs)
        else:
            for probs in top_logprob:
                token_logprobs = []
                for item in probs:
                    token_logprobs.append(
                        {"token": self._tokenizer.decode(item[1]), "logprob": item[0]}
                    )
                processed_logprobs.append(token_logprobs)
        return processed_logprobs

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the SGLang model using either the path or the Hugging Face Hub repository id. Additionally, this method also sets the chat_template for the tokenizer, so as to properly parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the default value is ChatML format, unless explicitly provided.

Source code in src/distilabel/models/llms/sglang.py

def load(self) -> None:
    """Loads the `SGLang` model using either the path or the Hugging Face Hub repository id.
    Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly
    parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
    default value is ChatML format, unless explicitly provided.
    """
    super().load()

    CudaDevicePlacementMixin.load(self)

    try:
        from sglang import Engine
    except ImportError as err:
        raise ImportError(
            "sglang is not installed. Please install it with sglang document https://docs.sglang.ai/start/install.html."
        ) from err

    self._model = Engine(
        model_path=self.model,
        dtype=self.dtype,
        trust_remote_code=self.trust_remote_code,
        quantization=self.quantization,
        revision=self.revision,
        tokenizer_path=self.tokenizer,
        tokenizer_mode=self.tokenizer_mode,
        skip_tokenizer_init=self.skip_tokenizer_init,
        random_seed=self.seed,
        **self.extra_kwargs,  # type: ignore
    )
    from sglang.srt.hf_transformers_utils import get_tokenizer

    self._tokenizer = get_tokenizer(
        self.model,
        tokenizer_mode=self.tokenizer_mode,
        trust_remote_code=self.trust_remote_code,
        tokenizer_revision="main",
    )  # type: ignore
    if self.chat_template is not None:
        self._tokenizer.chat_template = self.chat_template  # type: ignore

`unload()` ¶

Unloads the SGLang model.

Source code in src/distilabel/models/llms/sglang.py

def unload(self) -> None:
    """Unloads the `SGLang` model."""
    self._cleanup_sglang_model()
    self._model = None  # type: ignore
    self._tokenizer = None  # type: ignore
    CudaDevicePlacementMixin.unload(self)
    super().unload()

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`Union[StandardInput, str]`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/sglang.py

def prepare_input(self, input: Union["StandardInput", str]) -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    if isinstance(input, str):
        return input

    prompt: str = (
        self._tokenizer.apply_chat_template(
            input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,  # type: ignore
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`_prepare_batches(inputs)` ¶

Prepares the inputs by grouping them by the structured output.

When we generate structured outputs with schemas obtained from a dataset, we need to prepare the data to try to send batches of inputs instead of single inputs to the model to take advante of the engine. So we group the inputs by the structured output to be passed in the generate method.

Parameters:

Name	Type	Description	Default
`inputs`	`List[StructuredInput]`	The batch of inputs passed to the generate method. As we expect to be generating structured outputs, each element will be a tuple containing the instruction and the structured output.	required

Returns:

Type	Description
`List[Tuple[List[str], OutlinesStructuredOutputType]]`	The prepared batches (sub-batches let's say) to be passed to the `generate` method.
`List[int]`	Each new tuple will contain instead of the single instruction, a list of instructions

Source code in src/distilabel/models/llms/sglang.py

def _prepare_batches(
    self, inputs: List["StructuredInput"]
) -> Tuple[List[Tuple[List[str], "OutlinesStructuredOutputType"]], List[int]]:
    """Prepares the inputs by grouping them by the structured output.

    When we generate structured outputs with schemas obtained from a dataset, we need to
    prepare the data to try to send batches of inputs instead of single inputs to the model
    to take advante of the engine. So we group the inputs by the structured output to be
    passed in the `generate` method.

    Args:
        inputs: The batch of inputs passed to the generate method. As we expect to be generating
            structured outputs, each element will be a tuple containing the instruction and the
            structured output.

    Returns:
        The prepared batches (sub-batches let's say) to be passed to the `generate` method.
        Each new tuple will contain instead of the single instruction, a list of instructions
    """
    instruction_order = {}
    batches: Dict[str, List[str]] = {}
    for i, (instruction, structured_output) in enumerate(inputs):
        instruction = self.prepare_input(instruction)
        instruction_order[instruction] = i

        structured_output = json.dumps(structured_output)
        if structured_output not in batches:
            batches[structured_output] = [instruction]
        else:
            batches[structured_output].append(instruction)

    # Built a list with instructions sorted by structured output
    flat_instructions = [
        instruction for _, group in batches.items() for instruction in group
    ]

    # Generate the list of indices based on the original order
    sorted_indices = [
        instruction_order[instruction] for instruction in flat_instructions
    ]

    return [
        (batch, json.loads(schema)) for schema, batch in batches.items()
    ], sorted_indices

`generate(inputs, num_generations=1, max_new_tokens=128, stop=None, stop_token_ids=None, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, frequency_penalty=0.0, presence_penalty=0.0, repetition_penalty=1.0, min_new_tokens=0, spaces_between_special_tokens=True, json_schema=None, regex=None, no_stop_trim=False, ignore_eos=False, skip_special_tokens=True, custom_params=None, return_logprob=False, top_logprobs_num=0, echo=False)` ¶

Generates num_generations responses for each input.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`stop`	`Optional[Union[str, List[str]]]`	a list of strings that will be used to stop the generation when found. Defaults to `None`.	`None`
`stop_token_ids`	`Optional[List[int]]`	a list of token ids that will be used to stop the generation when found. Defaults to `None`.	`None`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`top_k`	`int`	the top-k value to use for the generation. Defaults to `0`.	`-1`
`min_p`	`float`	the minimum probability to use for the generation. Defaults to `0.0`.	`0.0`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`repetition_penalty`	`float`	the repetition penalty to use for the generation Defaults to `1.0`.	`1.0`
`min_new_tokens`	`int`	Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Defaults to `0`.	`0`
`spaces_between_special_tokens`	`bool`	Whether or not to add spaces between special tokens during detokenization. Defaults to `True`.	`True`
`json_schema`	`Optional[str]`	json structure output. Defaults to `None`.	`None`
`regex`	`Optional[str]`	regex structure output. Defaults to `None`.	`None`
`no_stop_trim`	`bool`	Don't trim stop words or EOS token from the generated text. Defaults to `False`.	`False`
`ignore_eos`	`bool`	Don't stop generation when EOS token is sampled. Defaults to `False`.	`False`
`skip_special_tokens`	`bool`	Whether to exclude special tokens from the output. Defaults to `False`.	`True`
`custom_params`	`Optional[Dict[str, Any]]`	Used when employing `CustomLogitProcessor`. Defaults to `None`.	`None`
`return_logprob`	`bool`	Whether to return log probabilities for tokens. Defaults to `False`.	`False`
`top_logprobs_num`	`int`	If returning log probabilities, specifies the number of top logprobs to return at each position. Defaults to `0`.	`0`
`echo`	`bool`	Whether to echo the include the prompt in the response or not. Defaults to `False`.	`False`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/sglang.py

@validate_call
def generate(  # noqa: C901 # type: ignore
    self,
    inputs: List[FormattedInput],
    num_generations: int = 1,
    max_new_tokens: int = 128,
    stop: Optional[Union[str, List[str]]] = None,
    stop_token_ids: Optional[List[int]] = None,
    temperature: float = 1.0,
    top_p: float = 1.0,
    top_k: int = -1,
    min_p: float = 0.0,
    frequency_penalty: float = 0.0,
    presence_penalty: float = 0.0,
    repetition_penalty: float = 1.0,
    min_new_tokens: int = 0,
    spaces_between_special_tokens: bool = True,
    json_schema: Optional[str] = None,
    regex: Optional[str] = None,
    no_stop_trim: bool = False,
    ignore_eos: bool = False,
    skip_special_tokens: bool = True,
    custom_params: Optional[Dict[str, Any]] = None,  # no use?
    return_logprob: bool = False,
    top_logprobs_num: int = 0,
    echo: bool = False,
) -> List[GenerateOutput]:
    """Generates `num_generations` responses for each input.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        stop: a list of strings that will be used to stop the generation when found.
            Defaults to `None`.
        stop_token_ids: a list of token ids that will be used to stop the generation
            when found. Defaults to `None`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        top_k: the top-k value to use for the generation. Defaults to `0`.
        min_p: the minimum probability to use for the generation. Defaults to `0.0`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        repetition_penalty: the repetition penalty to use for the generation Defaults to
            `1.0`.
        min_new_tokens: Forces the model to generate at least `min_new_tokens` until
            a stop word or EOS token is sampled. Defaults to `0`.
        spaces_between_special_tokens:  Whether or not to add spaces between special
            tokens during detokenization. Defaults to `True`.
        json_schema: json structure output. Defaults to `None`.
        regex: regex structure output. Defaults to `None`.
        no_stop_trim: Don't trim stop words or EOS token from the generated text.
            Defaults to `False`.
        ignore_eos: Don't stop generation when EOS token is sampled. Defaults to `False`.
        skip_special_tokens: Whether to exclude special tokens from the output. Defaults
            to `False`.
        custom_params: Used when employing `CustomLogitProcessor`. Defaults to `None`.
        return_logprob: Whether to return log probabilities for tokens. Defaults to `False`.
        top_logprobs_num: If returning log probabilities, specifies the number of
            top logprobs to return at each position. Defaults to `0`.
        echo: Whether to echo the include the prompt in the response or not. Defaults
            to `False`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """

    if isinstance(inputs[0], tuple):
        # Prepare the batches for structured generation
        prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore
    else:
        # Simulate a batch without the structured output content
        prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore
        sorted_indices = None

    batched_outputs: List["LLMOutput"] = []
    generations = []

    for prepared_inputs, structured_output in prepared_batches:
        if self.structured_output is not None and structured_output is not None:
            self._logger.warning(
                "An `structured_output` was provided in the model configuration, but"
                " one was also provided in the input. The input structured output will"
                " be used."
            )

        temp_structure = None
        if structured_output is not None:
            temp_structure = structured_output
        elif self.structured_output is not None:
            temp_structure = self.structured_output

        if temp_structure is not None:
            format = temp_structure.get("format")
            schema = temp_structure.get("schema")
            if not format:
                if isinstance(schema, dict) or inspect.isclass(schema):
                    format = "json"
                elif isinstance(schema, str):
                    format = "regex"

            if format == "json":
                json_schema = json.dumps(schema_as_dict(schema))
            elif format == "regex":
                regex = schema

        sampling_params = {
            "max_new_tokens": max_new_tokens,
            "stop": stop,
            "stop_token_ids": stop_token_ids,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "min_p": min_p,
            "frequency_penalty": frequency_penalty,
            "presence_penalty": presence_penalty,
            "repetition_penalty": repetition_penalty,
            "min_new_tokens": min_new_tokens,
            "spaces_between_special_tokens": spaces_between_special_tokens,
            "n": num_generations,
            "json_schema": json_schema,
            "regex": regex,
            "no_stop_trim": no_stop_trim,
            "ignore_eos": ignore_eos,
            "skip_special_tokens": skip_special_tokens,
            "custom_params": custom_params,
        }

        batch_outputs = self._model.generate(
            prompt=prepared_inputs,
            sampling_params=sampling_params,
            return_logprob=return_logprob,
            logprob_start_len=0 if echo else -1,
            top_logprobs_num=top_logprobs_num if return_logprob else 0,
        )

        group = int(len(batch_outputs) / len(prepared_inputs))
        for num in range(len(prepared_inputs)):
            input = prepared_inputs[num]
            outputs = batch_outputs[num * group : (num + 1) * group]
            processed_prompt_logprobs = []
            meta_info = outputs[0]["meta_info"]
            if "input_top_logprobs" in meta_info:
                processed_prompt_logprobs = self._get_llm_logprobs(
                    top_logprob=meta_info["input_top_logprobs"],
                    choose_logprob=meta_info["input_token_logprobs"],
                )
            texts, statistics, outputs_logprobs = self._process_outputs(
                input=input,
                outputs=outputs,
                echo=echo,
                prompt_logprobs=processed_prompt_logprobs,
            )
            batched_outputs.append(texts)
            generation = prepare_output(
                generations=texts,
                input_tokens=statistics["input_tokens"],
                output_tokens=statistics["output_tokens"],
                logprobs=outputs_logprobs,
            )

            generations.append(generation)

    if sorted_indices is not None:
        pairs = list(enumerate(sorted_indices))
        pairs.sort(key=lambda x: x[1])
        generations = [generations[original_idx] for original_idx, _ in pairs]

    return generations

`TogetherLLM` ¶

Bases: OpenAILLM

TogetherLLM LLM implementation running the async API client of OpenAI.

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "mistralai/Mixtral-8x7B-Instruct-v0.1". Supported models can be found here.
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL to use for the Together API can be set with `TOGETHER_BASE_URL`. Defaults to `None` which means that the value set for the environment variable `TOGETHER_BASE_URL` will be used, or "https://api.together.xyz/v1" if not set.
`api_key`	`Optional[RuntimeParameter[SecretStr]]`	the API key to authenticate the requests to the Together API. Defaults to `None` which means that the value set for the environment variable `TOGETHER_API_KEY` will be used, or `None` if not set.
`_api_key_env_var`	`str`	the name of the environment variable to use for the API key. It is meant to be used internally.

Examples:

Generate text:

from distilabel.models.llms import AnyscaleLLM

llm = TogetherLLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", api_key="api.key")

llm.load()

output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/together.py

class TogetherLLM(OpenAILLM):
    """TogetherLLM LLM implementation running the async API client of OpenAI.

    Attributes:
        model: the model name to use for the LLM e.g. "mistralai/Mixtral-8x7B-Instruct-v0.1".
            Supported models can be found [here](https://api.together.xyz/models).
        base_url: the base URL to use for the Together API can be set with `TOGETHER_BASE_URL`.
            Defaults to `None` which means that the value set for the environment variable
            `TOGETHER_BASE_URL` will be used, or "https://api.together.xyz/v1" if not set.
        api_key: the API key to authenticate the requests to the Together API. Defaults to `None`
            which means that the value set for the environment variable `TOGETHER_API_KEY` will be
            used, or `None` if not set.
        _api_key_env_var: the name of the environment variable to use for the API key. It
            is meant to be used internally.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import AnyscaleLLM

        llm = TogetherLLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", api_key="api.key")

        llm.load()

        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    base_url: Optional[RuntimeParameter[str]] = Field(
        default_factory=lambda: os.getenv(
            "TOGETHER_BASE_URL", "https://api.together.xyz/v1"
        ),
        description="The base URL to use for the Together API requests.",
    )
    api_key: Optional[RuntimeParameter[SecretStr]] = Field(
        default_factory=lambda: os.getenv(_TOGETHER_API_KEY_ENV_VAR_NAME),
        description="The API key to authenticate the requests to the Together API.",
    )

    _api_key_env_var: str = PrivateAttr(_TOGETHER_API_KEY_ENV_VAR_NAME)

`VertexAILLM` ¶

Bases: AsyncLLM

VertexAI LLM implementation running the async API clients for Gemini.

Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini

To use the VertexAILLM is necessary to have configured the Google Cloud authentication using one of these methods:

Setting GOOGLE_CLOUD_CREDENTIALS environment variable
Using gcloud auth application-default login command
Using vertexai.init function from the google-cloud-aiplatform library

Attributes:

Name	Type	Description
`model`	`str`	the model name to use for the LLM e.g. "gemini-1.0-pro". Supported models.
`_aclient`	`Optional[GenerativeModel]`	the `GenerativeModel` to use for the Vertex AI Gemini API. It is meant to be used internally. Set in the `load` method.

Icon

:simple-googlecloud:

Examples:

Generate text:

from distilabel.models.llms import VertexAILLM

llm = VertexAILLM(model="gemini-1.5-pro")

llm.load()

# Call the model
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

Source code in src/distilabel/models/llms/vertexai.py

class VertexAILLM(AsyncLLM):
    """VertexAI LLM implementation running the async API clients for Gemini.

    - Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini

    To use the `VertexAILLM` is necessary to have configured the Google Cloud authentication
    using one of these methods:

    - Setting `GOOGLE_CLOUD_CREDENTIALS` environment variable
    - Using `gcloud auth application-default login` command
    - Using `vertexai.init` function from the `google-cloud-aiplatform` library

    Attributes:
        model: the model name to use for the LLM e.g. "gemini-1.0-pro". [Supported models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models).
        _aclient: the `GenerativeModel` to use for the Vertex AI Gemini API. It is meant
            to be used internally. Set in the `load` method.

    Icon:
        `:simple-googlecloud:`

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import VertexAILLM

        llm = VertexAILLM(model="gemini-1.5-pro")

        llm.load()

        # Call the model
        output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```
    """

    model: str

    _num_generations_param_supported = False

    _aclient: Optional["GenerativeModel"] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests."""
        super().load()

        try:
            from vertexai.generative_models import GenerationConfig, GenerativeModel

            self._generation_config_class = GenerationConfig
        except ImportError as e:
            raise ImportError(
                "vertexai is not installed. Please install it using"
                " `pip install 'distilabel[vertexai]'`."
            ) from e

        if _is_gemini_model(self.model):
            self._aclient = GenerativeModel(model_name=self.model)
        else:
            raise NotImplementedError(
                "`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data."
            )

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    def _chattype_to_content(self, input: "StandardInput") -> List["Content"]:
        """Converts a chat type to a list of content items expected by the API.

        Args:
            input: the chat type to be converted.

        Returns:
            List[str]: a list of content items expected by the API.
        """
        from vertexai.generative_models import Content, Part

        contents = []
        for message in input:
            if message["role"] not in ["user", "model"]:
                raise ValueError(
                    "`VertexAILLM only supports the roles 'user' or 'model'."
                )
            contents.append(
                Content(
                    role=message["role"], parts=[Part.from_text(message["content"])]
                )
            )
        return contents

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: VertexChatType,
        temperature: Optional[float] = None,
        top_p: Optional[float] = None,
        top_k: Optional[int] = None,
        max_output_tokens: Optional[int] = None,
        stop_sequences: Optional[List[str]] = None,
        safety_settings: Optional[Dict[str, Any]] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).

        Args:
            input: a single input in chat format to generate responses for.
            temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.
            top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.
            top_k: If specified, top-k sampling will be used. Defaults to `None`.
            max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.
            stop_sequences: A list of stop sequences. Defaults to `None`.
            safety_settings: Safety configuration for returned content from the API. Defaults to `None`.
            tools: A potential list of tools that can be used by the API. Defaults to `None`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        from vertexai.generative_models import GenerationConfig

        content: "GenerationResponse" = await self._aclient.generate_content_async(  # type: ignore
            contents=self._chattype_to_content(input),
            generation_config=GenerationConfig(
                candidate_count=1,  # only one candidate allowed per call
                temperature=temperature,
                top_k=top_k,
                top_p=top_p,
                max_output_tokens=max_output_tokens,
                stop_sequences=stop_sequences,
            ),
            safety_settings=safety_settings,  # type: ignore
            tools=tools,  # type: ignore
            stream=False,
        )

        text = None
        try:
            text = content.candidates[0].text
        except ValueError:
            self._logger.warning(  # type: ignore
                f"Received no response using VertexAI client (model: '{self.model}')."
                f" Finish reason was: '{content.candidates[0].finish_reason}'."
            )
        return prepare_output([text], **self._get_llm_statistics(content))

    @staticmethod
    def _get_llm_statistics(content: "GenerationResponse") -> "LLMStatistics":
        return {
            "input_tokens": [content.usage_metadata.prompt_token_count],
            "output_tokens": [content.usage_metadata.candidates_token_count],
        }

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the GenerativeModel class which has access to generate_content_async to benefit from async requests.

Source code in src/distilabel/models/llms/vertexai.py

def load(self) -> None:
    """Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests."""
    super().load()

    try:
        from vertexai.generative_models import GenerationConfig, GenerativeModel

        self._generation_config_class = GenerationConfig
    except ImportError as e:
        raise ImportError(
            "vertexai is not installed. Please install it using"
            " `pip install 'distilabel[vertexai]'`."
        ) from e

    if _is_gemini_model(self.model):
        self._aclient = GenerativeModel(model_name=self.model)
    else:
        raise NotImplementedError(
            "`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data."
        )

`_chattype_to_content(input)` ¶

Converts a chat type to a list of content items expected by the API.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the chat type to be converted.	required

Returns:

Type	Description
`List[Content]`	List[str]: a list of content items expected by the API.

Source code in src/distilabel/models/llms/vertexai.py

def _chattype_to_content(self, input: "StandardInput") -> List["Content"]:
    """Converts a chat type to a list of content items expected by the API.

    Args:
        input: the chat type to be converted.

    Returns:
        List[str]: a list of content items expected by the API.
    """
    from vertexai.generative_models import Content, Part

    contents = []
    for message in input:
        if message["role"] not in ["user", "model"]:
            raise ValueError(
                "`VertexAILLM only supports the roles 'user' or 'model'."
            )
        contents.append(
            Content(
                role=message["role"], parts=[Part.from_text(message["content"])]
            )
        )
    return contents

`agenerate(input, temperature=None, top_p=None, top_k=None, max_output_tokens=None, stop_sequences=None, safety_settings=None, tools=None)` `async` ¶

Generates num_generations responses for the given input using the VertexAI async client definition.

Parameters:

Name	Type	Description	Default
`input`	`VertexChatType`	a single input in chat format to generate responses for.	required
`temperature`	`Optional[float]`	Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.	`None`
`top_p`	`Optional[float]`	If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.	`None`
`top_k`	`Optional[int]`	If specified, top-k sampling will be used. Defaults to `None`.	`None`
`max_output_tokens`	`Optional[int]`	The maximum number of output tokens to generate per message. Defaults to `None`.	`None`
`stop_sequences`	`Optional[List[str]]`	A list of stop sequences. Defaults to `None`.	`None`
`safety_settings`	`Optional[Dict[str, Any]]`	Safety configuration for returned content from the API. Defaults to `None`.	`None`
`tools`	`Optional[List[Dict[str, Any]]]`	A potential list of tools that can be used by the API. Defaults to `None`.	`None`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/vertexai.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: VertexChatType,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    top_k: Optional[int] = None,
    max_output_tokens: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    safety_settings: Optional[Dict[str, Any]] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
) -> GenerateOutput:
    """Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).

    Args:
        input: a single input in chat format to generate responses for.
        temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.
        top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.
        top_k: If specified, top-k sampling will be used. Defaults to `None`.
        max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.
        stop_sequences: A list of stop sequences. Defaults to `None`.
        safety_settings: Safety configuration for returned content from the API. Defaults to `None`.
        tools: A potential list of tools that can be used by the API. Defaults to `None`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    from vertexai.generative_models import GenerationConfig

    content: "GenerationResponse" = await self._aclient.generate_content_async(  # type: ignore
        contents=self._chattype_to_content(input),
        generation_config=GenerationConfig(
            candidate_count=1,  # only one candidate allowed per call
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            max_output_tokens=max_output_tokens,
            stop_sequences=stop_sequences,
        ),
        safety_settings=safety_settings,  # type: ignore
        tools=tools,  # type: ignore
        stream=False,
    )

    text = None
    try:
        text = content.candidates[0].text
    except ValueError:
        self._logger.warning(  # type: ignore
            f"Received no response using VertexAI client (model: '{self.model}')."
            f" Finish reason was: '{content.candidates[0].finish_reason}'."
        )
    return prepare_output([text], **self._get_llm_statistics(content))

`ClientvLLM` ¶

Bases: OpenAILLM, MagpieChatTemplateMixin

A client for the vLLM server implementing the OpenAI API specification.

Attributes:

Name	Type	Description
`base_url`	`Optional[RuntimeParameter[str]]`	the base URL of the `vLLM` server. Defaults to `"http://localhost:8000"`.
`max_retries`	`RuntimeParameter[int]`	the maximum number of times to retry the request to the API before failing. Defaults to `6`.
`timeout`	`RuntimeParameter[int]`	the maximum time in seconds to wait for a response from the API. Defaults to `120`.
`httpx_client_kwargs`	`RuntimeParameter[int]`	extra kwargs that will be passed to the `httpx.AsyncClient` created to comunicate with the `vLLM` server. Defaults to `None`.
`tokenizer`	`Optional[str]`	the Hugging Face Hub repo id or path of the tokenizer that will be used to apply the chat template and tokenize the inputs before sending it to the server. Defaults to `None`.
`tokenizer_revision`	`Optional[str]`	the revision of the tokenizer to load. Defaults to `None`.
`_aclient`	`AsyncOpenAI`	the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults to `None`.

Runtime parameters

base_url: the base url of the vLLM server. Defaults to "http://localhost:8000".
max_retries: the maximum number of times to retry the request to the API before failing. Defaults to 6.
timeout: the maximum time in seconds to wait for a response from the API. Defaults to 120.
httpx_client_kwargs: extra kwargs that will be passed to the httpx.AsyncClient created to comunicate with the vLLM server. Defaults to None.

Examples:

Generate text:

from distilabel.models.llms import ClientvLLM

llm = ClientvLLM(
    base_url="http://localhost:8000/v1",
    tokenizer="meta-llama/Meta-Llama-3.1-8B-Instruct"
)

llm.load()

results = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Hello, how are you?"}]],
    temperature=0.7,
    top_p=1.0,
    max_new_tokens=256,
)
# [
#     [
#         "I'm functioning properly, thank you for asking. How can I assist you today?",
#         "I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?",
#         "I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?"
#     ]
# ]

Source code in src/distilabel/models/llms/vllm.py

class ClientvLLM(OpenAILLM, MagpieChatTemplateMixin):
    """A client for the `vLLM` server implementing the OpenAI API specification.

    Attributes:
        base_url: the base URL of the `vLLM` server. Defaults to `"http://localhost:8000"`.
        max_retries: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        timeout: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        httpx_client_kwargs: extra kwargs that will be passed to the `httpx.AsyncClient`
            created to comunicate with the `vLLM` server. Defaults to `None`.
        tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used
            to apply the chat template and tokenize the inputs before sending it to the
            server. Defaults to `None`.
        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.
        _aclient: the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults
            to `None`.

    Runtime parameters:
        - `base_url`: the base url of the `vLLM` server. Defaults to `"http://localhost:8000"`.
        - `max_retries`: the maximum number of times to retry the request to the API before
            failing. Defaults to `6`.
        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults
            to `120`.
        - `httpx_client_kwargs`: extra kwargs that will be passed to the `httpx.AsyncClient`
            created to comunicate with the `vLLM` server. Defaults to `None`.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import ClientvLLM

        llm = ClientvLLM(
            base_url="http://localhost:8000/v1",
            tokenizer="meta-llama/Meta-Llama-3.1-8B-Instruct"
        )

        llm.load()

        results = llm.generate_outputs(
            inputs=[[{"role": "user", "content": "Hello, how are you?"}]],
            temperature=0.7,
            top_p=1.0,
            max_new_tokens=256,
        )
        # [
        #     [
        #         "I'm functioning properly, thank you for asking. How can I assist you today?",
        #         "I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?",
        #         "I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?"
        #     ]
        # ]
        ```
    """

    model: str = ""  # Default value so it's not needed to `ClientvLLM(model="...")`
    tokenizer: Optional[str] = None
    tokenizer_revision: Optional[str] = None

    # We need the sync client to get the list of models
    _client: "OpenAI" = PrivateAttr(None)
    _tokenizer: "PreTrainedTokenizer" = PrivateAttr(None)

    def load(self) -> None:
        """Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer
        optionally."""

        self.api_key = SecretStr("EMPTY")

        # We need to first create the sync client to get the model name that will be used
        # in the `super().load()` when creating the logger.
        try:
            from openai import OpenAI
        except ImportError as ie:
            raise ImportError(
                "OpenAI Python client is not installed. Please install it using"
                " `pip install 'distilabel[openai]'`."
            ) from ie

        self._client = OpenAI(
            base_url=self.base_url,
            api_key=self.api_key.get_secret_value(),  # type: ignore
            max_retries=self.max_retries,  # type: ignore
            timeout=self.timeout,
        )

        super().load()

        try:
            from transformers import AutoTokenizer
        except ImportError as ie:
            raise ImportError(
                "To use `ClientvLLM` you need to install `transformers`."
                "Please install it using `pip install 'distilabel[hf-transformers]'`."
            ) from ie

        self._tokenizer = AutoTokenizer.from_pretrained(
            self.tokenizer, revision=self.tokenizer_revision
        )

    @cached_property
    def model_name(self) -> str:  # type: ignore
        """Returns the name of the model served with vLLM server."""
        models = self._client.models.list()
        return models.data[0].id

    def _prepare_input(self, input: "StandardInput") -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        prompt: str = (
            self._tokenizer.apply_chat_template(  # type: ignore
                input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,  # type: ignore
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    @validate_call
    async def agenerate(  # type: ignore
        self,
        input: FormattedInput,
        num_generations: int = 1,
        max_new_tokens: int = 128,
        frequency_penalty: float = 0.0,
        logit_bias: Optional[Dict[str, int]] = None,
        presence_penalty: float = 0.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
    ) -> GenerateOutput:
        """Generates `num_generations` responses for each input.

        Args:
            input: a single input in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            logit_bias: modify the likelihood of specified tokens appearing in the completion.
                Defaults to ``
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: nucleus sampling. The value refers to the top-p tokens that should be
                considered for sampling. Defaults to `1.0`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """

        completion = await self._aclient.completions.create(
            model=self.model_name,
            prompt=self._prepare_input(input),  # type: ignore
            n=num_generations,
            max_tokens=max_new_tokens,
            frequency_penalty=frequency_penalty,
            logit_bias=logit_bias,
            presence_penalty=presence_penalty,
            temperature=temperature,
            top_p=top_p,
        )

        generations = []
        for choice in completion.choices:
            text = choice.text
            if text == "":
                self._logger.warning(  # type: ignore
                    f"Received no response from vLLM server (model: '{self.model_name}')."
                    f" Finish reason was: {choice.finish_reason}"
                )
            generations.append(text)

        return prepare_output(generations, **self._get_llm_statistics(completion))

`model_name` `cached` `property` ¶

Returns the name of the model served with vLLM server.

`load()` ¶

Creates an httpx.AsyncClient to connect to the vLLM server and a tokenizer optionally.

Source code in src/distilabel/models/llms/vllm.py

def load(self) -> None:
    """Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer
    optionally."""

    self.api_key = SecretStr("EMPTY")

    # We need to first create the sync client to get the model name that will be used
    # in the `super().load()` when creating the logger.
    try:
        from openai import OpenAI
    except ImportError as ie:
        raise ImportError(
            "OpenAI Python client is not installed. Please install it using"
            " `pip install 'distilabel[openai]'`."
        ) from ie

    self._client = OpenAI(
        base_url=self.base_url,
        api_key=self.api_key.get_secret_value(),  # type: ignore
        max_retries=self.max_retries,  # type: ignore
        timeout=self.timeout,
    )

    super().load()

    try:
        from transformers import AutoTokenizer
    except ImportError as ie:
        raise ImportError(
            "To use `ClientvLLM` you need to install `transformers`."
            "Please install it using `pip install 'distilabel[hf-transformers]'`."
        ) from ie

    self._tokenizer = AutoTokenizer.from_pretrained(
        self.tokenizer, revision=self.tokenizer_revision
    )

`_prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`StandardInput`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/vllm.py

def _prepare_input(self, input: "StandardInput") -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    prompt: str = (
        self._tokenizer.apply_chat_template(  # type: ignore
            input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,  # type: ignore
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, logit_bias=None, presence_penalty=0.0, temperature=1.0, top_p=1.0)` `async` ¶

Generates num_generations responses for each input.

Parameters:

Name	Type	Description	Default
`input`	`FormattedInput`	a single input in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`logit_bias`	`Optional[Dict[str, int]]`	modify the likelihood of specified tokens appearing in the completion. Defaults to ``	`None`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	nucleus sampling. The value refers to the top-p tokens that should be considered for sampling. Defaults to `1.0`.	`1.0`

Returns:

Type	Description
`GenerateOutput`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/vllm.py

@validate_call
async def agenerate(  # type: ignore
    self,
    input: FormattedInput,
    num_generations: int = 1,
    max_new_tokens: int = 128,
    frequency_penalty: float = 0.0,
    logit_bias: Optional[Dict[str, int]] = None,
    presence_penalty: float = 0.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
) -> GenerateOutput:
    """Generates `num_generations` responses for each input.

    Args:
        input: a single input in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        logit_bias: modify the likelihood of specified tokens appearing in the completion.
            Defaults to ``
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: nucleus sampling. The value refers to the top-p tokens that should be
            considered for sampling. Defaults to `1.0`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """

    completion = await self._aclient.completions.create(
        model=self.model_name,
        prompt=self._prepare_input(input),  # type: ignore
        n=num_generations,
        max_tokens=max_new_tokens,
        frequency_penalty=frequency_penalty,
        logit_bias=logit_bias,
        presence_penalty=presence_penalty,
        temperature=temperature,
        top_p=top_p,
    )

    generations = []
    for choice in completion.choices:
        text = choice.text
        if text == "":
            self._logger.warning(  # type: ignore
                f"Received no response from vLLM server (model: '{self.model_name}')."
                f" Finish reason was: {choice.finish_reason}"
            )
        generations.append(text)

    return prepare_output(generations, **self._get_llm_statistics(completion))

`vLLM` ¶

Bases: LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin

vLLM library LLM implementation.

Attributes:

Name	Type	Description
`model`	`str`	the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.
`dtype`	`str`	the data type to use for the model. Defaults to `auto`.
`trust_remote_code`	`bool`	whether to trust the remote code when loading the model. Defaults to `False`.
`quantization`	`Optional[str]`	the quantization mode to use for the model. Defaults to `None`.
`revision`	`Optional[str]`	the revision of the model to load. Defaults to `None`.
`tokenizer`	`Optional[str]`	the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer files. If not provided, the tokenizer will be loaded from the model directory. Defaults to `None`.
`tokenizer_mode`	`Literal['auto', 'slow']`	the mode to use for the tokenizer. Defaults to `auto`.
`tokenizer_revision`	`Optional[str]`	the revision of the tokenizer to load. Defaults to `None`.
`skip_tokenizer_init`	`bool`	whether to skip the initialization of the tokenizer. Defaults to `False`.
`chat_template`	`Optional[str]`	a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to `None`.
`structured_output`	`Optional[RuntimeParameter[OutlinesStructuredOutputType]]`	a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
`seed`	`int`	the seed to use for the random number generator. Defaults to `0`.
`extra_kwargs`	`Optional[RuntimeParameter[Dict[str, Any]]]`	additional dictionary of keyword arguments that will be passed to the `LLM` class of `vllm` library. Defaults to `{}`.
`_model`	`LLM`	the `vLLM` model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the `load` method.
`_tokenizer`	`PreTrainedTokenizer`	the tokenizer instance used to format the prompt before passing it to the `LLM`. This attribute is meant to be used internally and should not be accessed directly. It will be set in the `load` method.
`use_magpie_template`	`bool`	a flag used to enable/disable applying the Magpie pre-query template. Defaults to `False`.
`magpie_pre_query_template`	`Union[MagpieAvailablePreQueryTemplates, str, None]`	the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are "llama3", "qwen2" or another pre-query template provided. Defaults to `None`.

References

https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py

Runtime parameters

extra_kwargs: additional dictionary of keyword arguments that will be passed to the LLM class of vllm library.

Examples:

Generate text:

from distilabel.models.llms import vLLM

# You can pass a custom chat_template to the model
llm = vLLM(
    model="prometheus-eval/prometheus-7b-v2.0",
    chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
)

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])

Generate structured data:

from pathlib import Path
from distilabel.models.llms import vLLM

class User(BaseModel):
    name: str
    last_name: str
    id: int

llm = vLLM(
    model="prometheus-eval/prometheus-7b-v2.0"
    structured_output={"format": "json", "schema": Character},
)

llm.load()

# Call the model
output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])

Source code in src/distilabel/models/llms/vllm.py

class vLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):
    """`vLLM` library LLM implementation.

    Attributes:
        model: the model Hugging Face Hub repo id or a path to a directory containing the
            model weights and configuration files.
        dtype: the data type to use for the model. Defaults to `auto`.
        trust_remote_code: whether to trust the remote code when loading the model. Defaults
            to `False`.
        quantization: the quantization mode to use for the model. Defaults to `None`.
        revision: the revision of the model to load. Defaults to `None`.
        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing
            the tokenizer files. If not provided, the tokenizer will be loaded from the
            model directory. Defaults to `None`.
        tokenizer_mode: the mode to use for the tokenizer. Defaults to `auto`.
        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.
        skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults
            to `False`.
        chat_template: a chat template that will be used to build the prompts before
            sending them to the model. If not provided, the chat template defined in the
            tokenizer config will be used. If not provided and the tokenizer doesn't have
            a chat template, then ChatML template will be used. Defaults to `None`.
        structured_output: a dictionary containing the structured output configuration or if more
            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.
        seed: the seed to use for the random number generator. Defaults to `0`.
        extra_kwargs: additional dictionary of keyword arguments that will be passed to the
            `LLM` class of `vllm` library. Defaults to `{}`.
        _model: the `vLLM` model instance. This attribute is meant to be used internally
            and should not be accessed directly. It will be set in the `load` method.
        _tokenizer: the tokenizer instance used to format the prompt before passing it to
            the `LLM`. This attribute is meant to be used internally and should not be
            accessed directly. It will be set in the `load` method.
        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query
            template. Defaults to `False`.
        magpie_pre_query_template: the pre-query template to be applied to the prompt or
            sent to the LLM to generate an instruction or a follow up user message. Valid
            values are "llama3", "qwen2" or another pre-query template provided. Defaults
            to `None`.

    References:
        - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py

    Runtime parameters:
        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to
            the `LLM` class of `vllm` library.

    Examples:
        Generate text:

        ```python
        from distilabel.models.llms import vLLM

        # You can pass a custom chat_template to the model
        llm = vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]",
        )

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Hello world!"}]])
        ```

        Generate structured data:

        ```python
        from pathlib import Path
        from distilabel.models.llms import vLLM

        class User(BaseModel):
            name: str
            last_name: str
            id: int

        llm = vLLM(
            model="prometheus-eval/prometheus-7b-v2.0"
            structured_output={"format": "json", "schema": Character},
        )

        llm.load()

        # Call the model
        output = llm.generate_outputs(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
        ```
    """

    model: str
    dtype: str = "auto"
    trust_remote_code: bool = False
    quantization: Optional[str] = None
    revision: Optional[str] = None

    tokenizer: Optional[str] = None
    tokenizer_mode: Literal["auto", "slow"] = "auto"
    tokenizer_revision: Optional[str] = None
    skip_tokenizer_init: bool = False
    chat_template: Optional[str] = None

    seed: int = 0

    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(
        default_factory=dict,
        description="Additional dictionary of keyword arguments that will be passed to the"
        " `vLLM` class of `vllm` library. See all the supported arguments at: "
        "https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py",
    )
    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(
        default=None,
        description="The structured output format to use across all the generations.",
    )

    _model: "_vLLM" = PrivateAttr(None)
    _tokenizer: "PreTrainedTokenizer" = PrivateAttr(None)
    _structured_output_logits_processor: Optional[Callable] = PrivateAttr(default=None)

    def load(self) -> None:
        """Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.
        Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly
        parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
        default value is ChatML format, unless explicitly provided.
        """
        super().load()

        CudaDevicePlacementMixin.load(self)

        try:
            from vllm import LLM as _vLLM
        except ImportError as ie:
            raise ImportError(
                "vLLM is not installed. Please install it using `pip install 'distilabel[vllm]'`."
            ) from ie

        self._model = _vLLM(
            self.model,
            dtype=self.dtype,
            trust_remote_code=self.trust_remote_code,
            quantization=self.quantization,
            revision=self.revision,
            tokenizer=self.tokenizer,
            tokenizer_mode=self.tokenizer_mode,
            tokenizer_revision=self.tokenizer_revision,
            skip_tokenizer_init=self.skip_tokenizer_init,
            seed=self.seed,
            **self.extra_kwargs,  # type: ignore
        )

        self._tokenizer = self._model.get_tokenizer()  # type: ignore
        if self.chat_template is not None:
            self._tokenizer.chat_template = self.chat_template  # type: ignore

        if self.structured_output:
            self._structured_output_logits_processor = self._prepare_structured_output(
                self.structured_output
            )

    def unload(self) -> None:
        """Unloads the `vLLM` model."""
        self._cleanup_vllm_model()
        self._model = None  # type: ignore
        self._tokenizer = None  # type: ignore
        CudaDevicePlacementMixin.unload(self)
        super().unload()

    def _cleanup_vllm_model(self) -> None:
        if self._model is None:
            return

        import torch  # noqa
        from vllm.distributed.parallel_state import (
            destroy_distributed_environment,
            destroy_model_parallel,
        )

        destroy_model_parallel()
        destroy_distributed_environment()

        # Don't delete model_executor if it does not exist, e.g. when VLLM_USE_V1 is set
        if hasattr(self._model.llm_engine, "model_executor"):
            del self._model.llm_engine.model_executor
        del self._model
        with contextlib.suppress(AssertionError):
            torch.distributed.destroy_process_group()
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()

    @property
    def model_name(self) -> str:
        """Returns the model name used for the LLM."""
        return self.model

    def prepare_input(self, input: Union["StandardInput", str]) -> str:
        """Prepares the input (applying the chat template and tokenization) for the provided
        input.

        Args:
            input: the input list containing chat items.

        Returns:
            The prompt to send to the LLM.
        """
        if isinstance(input, str):
            return input

        prompt: str = (
            self._tokenizer.apply_chat_template(
                input,  # type: ignore
                tokenize=False,
                add_generation_prompt=True,  # type: ignore
            )
            if input
            else ""
        )
        return super().apply_magpie_pre_query_template(prompt, input)

    def _prepare_batches(
        self, inputs: List["StructuredInput"]
    ) -> Tuple[List[Tuple[List[str], "OutlinesStructuredOutputType"]], List[int]]:
        """Prepares the inputs by grouping them by the structured output.

        When we generate structured outputs with schemas obtained from a dataset, we need to
        prepare the data to try to send batches of inputs instead of single inputs to the model
        to take advante of the engine. So we group the inputs by the structured output to be
        passed in the `generate` method.

        Args:
            inputs: The batch of inputs passed to the generate method. As we expect to be generating
                structured outputs, each element will be a tuple containing the instruction and the
                structured output.

        Returns:
            The prepared batches (sub-batches let's say) to be passed to the `generate` method.
            Each new tuple will contain instead of the single instruction, a list of instructions
        """
        instruction_order = {}
        batches: Dict[str, List[str]] = {}
        for i, (instruction, structured_output) in enumerate(inputs):
            instruction = self.prepare_input(instruction)
            instruction_order[instruction] = i

            structured_output = json.dumps(structured_output)
            if structured_output not in batches:
                batches[structured_output] = [instruction]
            else:
                batches[structured_output].append(instruction)

        # Built a list with instructions sorted by structured output
        flat_instructions = [
            instruction for _, group in batches.items() for instruction in group
        ]

        # Generate the list of indices based on the original order
        sorted_indices = [
            instruction_order[instruction] for instruction in flat_instructions
        ]

        return [
            (batch, json.loads(schema)) for schema, batch in batches.items()
        ], sorted_indices

    @validate_call
    def generate(  # noqa: C901 # type: ignore
        self,
        inputs: List[FormattedInput],
        num_generations: int = 1,
        max_new_tokens: int = 128,
        presence_penalty: float = 0.0,
        frequency_penalty: float = 0.0,
        repetition_penalty: float = 1.0,
        temperature: float = 1.0,
        top_p: float = 1.0,
        top_k: int = -1,
        min_p: float = 0.0,
        logprobs: Optional[PositiveInt] = None,
        stop: Optional[List[str]] = None,
        stop_token_ids: Optional[List[int]] = None,
        include_stop_str_in_output: bool = False,
        skip_special_tokens: bool = True,
        logits_processors: Optional[LogitsProcessors] = None,
        extra_sampling_params: Optional[Dict[str, Any]] = None,
        echo: bool = False,
    ) -> List[GenerateOutput]:
        """Generates `num_generations` responses for each input.

        Args:
            inputs: a list of inputs in chat format to generate responses for.
            num_generations: the number of generations to create per input. Defaults to
                `1`.
            max_new_tokens: the maximum number of new tokens that the model will generate.
                Defaults to `128`.
            presence_penalty: the presence penalty to use for the generation. Defaults to
                `0.0`.
            frequency_penalty: the repetition penalty to use for the generation. Defaults
                to `0.0`.
            repetition_penalty: the repetition penalty to use for the generation Defaults to
                `1.0`.
            temperature: the temperature to use for the generation. Defaults to `0.1`.
            top_p: the top-p value to use for the generation. Defaults to `1.0`.
            top_k: the top-k value to use for the generation. Defaults to `0`.
            min_p: the minimum probability to use for the generation. Defaults to `0.0`.
            logprobs: number of log probabilities to return per output token. If `None`,
                then no log probability won't be returned. Defaults to `None`.
            stop: a list of strings that will be used to stop the generation when found.
                Defaults to `None`.
            stop_token_ids: a list of token ids that will be used to stop the generation
                when found. Defaults to `None`.
            include_stop_str_in_output: whether to include the stop string in the output.
                Defaults to `False`.
            skip_special_tokens: whether to exclude special tokens from the output. Defaults
                to `False`.
            logits_processors: a list of functions to process the logits before sampling.
                Defaults to `None`.
            extra_sampling_params: dictionary with additional arguments to be passed to
                the `SamplingParams` class from `vllm`.
            echo: whether to echo the include the prompt in the response or not. Defaults
                to `False`.

        Returns:
            A list of lists of strings containing the generated responses for each input.
        """
        from vllm import SamplingParams

        if not logits_processors:
            logits_processors = []

        if extra_sampling_params is None:
            extra_sampling_params = {}

        structured_output = None

        if isinstance(inputs[0], tuple):
            # Prepare the batches for structured generation
            prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore
        else:
            # Simulate a batch without the structured output content
            prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore
            sorted_indices = None

        # Case in which we have a single structured output for the dataset
        if self._structured_output_logits_processor:
            logits_processors.append(self._structured_output_logits_processor)

        batched_outputs: List["LLMOutput"] = []
        generations = []

        for prepared_inputs, structured_output in prepared_batches:
            if self.structured_output is not None and structured_output is not None:
                self._logger.warning(
                    "An `structured_output` was provided in the model configuration, but"
                    " one was also provided in the input. The input structured output will"
                    " be used."
                )

            if structured_output is not None:
                logits_processors.append(
                    self._prepare_structured_output(structured_output)  # type: ignore
                )

            sampling_params = SamplingParams(  # type: ignore
                n=num_generations,
                presence_penalty=presence_penalty,
                frequency_penalty=frequency_penalty,
                repetition_penalty=repetition_penalty,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                min_p=min_p,
                max_tokens=max_new_tokens,
                prompt_logprobs=logprobs if echo else None,
                logprobs=logprobs,
                stop=stop,
                stop_token_ids=stop_token_ids,
                include_stop_str_in_output=include_stop_str_in_output,
                skip_special_tokens=skip_special_tokens,
                logits_processors=logits_processors,
                **extra_sampling_params,
            )

            batch_outputs: List["RequestOutput"] = self._model.generate(
                prompts=prepared_inputs,
                sampling_params=sampling_params,
                use_tqdm=False,
            )

            # Remove structured output logit processor to avoid stacking structured output
            # logits processors that leads to non-sense generations
            if structured_output is not None:
                logits_processors.pop(-1)

            for input, outputs in zip(prepared_inputs, batch_outputs):
                processed_prompt_logprobs = []
                if outputs.prompt_logprobs is not None:
                    processed_prompt_logprobs = self._get_llm_logprobs(
                        outputs.prompt_logprobs
                    )
                texts, statistics, outputs_logprobs = self._process_outputs(
                    input=input,
                    outputs=outputs,
                    echo=echo,
                    prompt_logprobs=processed_prompt_logprobs,
                )
                batched_outputs.append(texts)
                generation = prepare_output(
                    generations=texts,
                    input_tokens=statistics["input_tokens"],
                    output_tokens=statistics["output_tokens"],
                    logprobs=outputs_logprobs,
                )

                generations.append(generation)

        if sorted_indices is not None:
            pairs = list(enumerate(sorted_indices))
            pairs.sort(key=lambda x: x[1])
            generations = [generations[original_idx] for original_idx, _ in pairs]

        return generations

    def _process_outputs(
        self,
        input: str,
        outputs: "RequestOutput",
        prompt_logprobs: List[List["Logprob"]],
        echo: bool = False,
    ) -> Tuple["LLMOutput", "LLMStatistics", "LLMLogprobs"]:
        texts = []
        outputs_logprobs = []
        statistics = {
            "input_tokens": [compute_tokens(input, self._tokenizer.encode)]
            * len(outputs.outputs),
            "output_tokens": [],
        }
        for output in outputs.outputs:
            text = output.text
            if echo:
                text = input + text
            texts.append(text)
            statistics["output_tokens"].append(len(output.token_ids))
            if output.logprobs is not None:
                processed_output_logprobs = self._get_llm_logprobs(output.logprobs)
                outputs_logprobs.append(prompt_logprobs + processed_output_logprobs)
        return texts, statistics, outputs_logprobs

    def _prepare_structured_output(  # type: ignore
        self, structured_output: "OutlinesStructuredOutputType"
    ) -> Union[Callable, None]:
        """Creates the appropriate function to filter tokens to generate structured outputs.

        Args:
            structured_output: the configuration dict to prepare the structured output.

        Returns:
            The callable that will be used to guide the generation of the model.
        """
        from distilabel.steps.tasks.structured_outputs.outlines import (
            prepare_guided_output,
        )

        assert structured_output is not None, "`structured_output` cannot be `None`"

        result = prepare_guided_output(structured_output, "vllm", self._model)
        if (schema := result.get("schema")) and self.structured_output:
            self.structured_output["schema"] = schema
        return result["processor"]

    def _get_llm_logprobs(
        self, logprobs: Union["PromptLogprobs", "SampleLogprobs"]
    ) -> List[List["Logprob"]]:
        processed_logprobs = []
        for token_logprob in logprobs:  # type: ignore
            token_logprobs = []
            if token_logprob is None:
                processed_logprobs.append(None)
                continue
            for logprob in token_logprob.values():
                token_logprobs.append(
                    {"token": logprob.decoded_token, "logprob": logprob.logprob}
                )
            processed_logprobs.append(token_logprobs)
        return processed_logprobs

`model_name` `property` ¶

Returns the model name used for the LLM.

`load()` ¶

Loads the vLLM model using either the path or the Hugging Face Hub repository id. Additionally, this method also sets the chat_template for the tokenizer, so as to properly parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the default value is ChatML format, unless explicitly provided.

Source code in src/distilabel/models/llms/vllm.py

def load(self) -> None:
    """Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.
    Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly
    parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the
    default value is ChatML format, unless explicitly provided.
    """
    super().load()

    CudaDevicePlacementMixin.load(self)

    try:
        from vllm import LLM as _vLLM
    except ImportError as ie:
        raise ImportError(
            "vLLM is not installed. Please install it using `pip install 'distilabel[vllm]'`."
        ) from ie

    self._model = _vLLM(
        self.model,
        dtype=self.dtype,
        trust_remote_code=self.trust_remote_code,
        quantization=self.quantization,
        revision=self.revision,
        tokenizer=self.tokenizer,
        tokenizer_mode=self.tokenizer_mode,
        tokenizer_revision=self.tokenizer_revision,
        skip_tokenizer_init=self.skip_tokenizer_init,
        seed=self.seed,
        **self.extra_kwargs,  # type: ignore
    )

    self._tokenizer = self._model.get_tokenizer()  # type: ignore
    if self.chat_template is not None:
        self._tokenizer.chat_template = self.chat_template  # type: ignore

    if self.structured_output:
        self._structured_output_logits_processor = self._prepare_structured_output(
            self.structured_output
        )

`unload()` ¶

Unloads the vLLM model.

Source code in src/distilabel/models/llms/vllm.py

def unload(self) -> None:
    """Unloads the `vLLM` model."""
    self._cleanup_vllm_model()
    self._model = None  # type: ignore
    self._tokenizer = None  # type: ignore
    CudaDevicePlacementMixin.unload(self)
    super().unload()

`prepare_input(input)` ¶

Prepares the input (applying the chat template and tokenization) for the provided input.

Parameters:

Name	Type	Description	Default
`input`	`Union[StandardInput, str]`	the input list containing chat items.	required

Returns:

Type	Description
`str`	The prompt to send to the LLM.

Source code in src/distilabel/models/llms/vllm.py

def prepare_input(self, input: Union["StandardInput", str]) -> str:
    """Prepares the input (applying the chat template and tokenization) for the provided
    input.

    Args:
        input: the input list containing chat items.

    Returns:
        The prompt to send to the LLM.
    """
    if isinstance(input, str):
        return input

    prompt: str = (
        self._tokenizer.apply_chat_template(
            input,  # type: ignore
            tokenize=False,
            add_generation_prompt=True,  # type: ignore
        )
        if input
        else ""
    )
    return super().apply_magpie_pre_query_template(prompt, input)

`_prepare_batches(inputs)` ¶

Prepares the inputs by grouping them by the structured output.

When we generate structured outputs with schemas obtained from a dataset, we need to prepare the data to try to send batches of inputs instead of single inputs to the model to take advante of the engine. So we group the inputs by the structured output to be passed in the generate method.

Parameters:

Name	Type	Description	Default
`inputs`	`List[StructuredInput]`	The batch of inputs passed to the generate method. As we expect to be generating structured outputs, each element will be a tuple containing the instruction and the structured output.	required

Returns:

Type	Description
`List[Tuple[List[str], OutlinesStructuredOutputType]]`	The prepared batches (sub-batches let's say) to be passed to the `generate` method.
`List[int]`	Each new tuple will contain instead of the single instruction, a list of instructions

Source code in src/distilabel/models/llms/vllm.py

def _prepare_batches(
    self, inputs: List["StructuredInput"]
) -> Tuple[List[Tuple[List[str], "OutlinesStructuredOutputType"]], List[int]]:
    """Prepares the inputs by grouping them by the structured output.

    When we generate structured outputs with schemas obtained from a dataset, we need to
    prepare the data to try to send batches of inputs instead of single inputs to the model
    to take advante of the engine. So we group the inputs by the structured output to be
    passed in the `generate` method.

    Args:
        inputs: The batch of inputs passed to the generate method. As we expect to be generating
            structured outputs, each element will be a tuple containing the instruction and the
            structured output.

    Returns:
        The prepared batches (sub-batches let's say) to be passed to the `generate` method.
        Each new tuple will contain instead of the single instruction, a list of instructions
    """
    instruction_order = {}
    batches: Dict[str, List[str]] = {}
    for i, (instruction, structured_output) in enumerate(inputs):
        instruction = self.prepare_input(instruction)
        instruction_order[instruction] = i

        structured_output = json.dumps(structured_output)
        if structured_output not in batches:
            batches[structured_output] = [instruction]
        else:
            batches[structured_output].append(instruction)

    # Built a list with instructions sorted by structured output
    flat_instructions = [
        instruction for _, group in batches.items() for instruction in group
    ]

    # Generate the list of indices based on the original order
    sorted_indices = [
        instruction_order[instruction] for instruction in flat_instructions
    ]

    return [
        (batch, json.loads(schema)) for schema, batch in batches.items()
    ], sorted_indices

`generate(inputs, num_generations=1, max_new_tokens=128, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, logprobs=None, stop=None, stop_token_ids=None, include_stop_str_in_output=False, skip_special_tokens=True, logits_processors=None, extra_sampling_params=None, echo=False)` ¶

Generates num_generations responses for each input.

Parameters:

Name	Type	Description	Default
`inputs`	`List[FormattedInput]`	a list of inputs in chat format to generate responses for.	required
`num_generations`	`int`	the number of generations to create per input. Defaults to `1`.	`1`
`max_new_tokens`	`int`	the maximum number of new tokens that the model will generate. Defaults to `128`.	`128`
`presence_penalty`	`float`	the presence penalty to use for the generation. Defaults to `0.0`.	`0.0`
`frequency_penalty`	`float`	the repetition penalty to use for the generation. Defaults to `0.0`.	`0.0`
`repetition_penalty`	`float`	the repetition penalty to use for the generation Defaults to `1.0`.	`1.0`
`temperature`	`float`	the temperature to use for the generation. Defaults to `0.1`.	`1.0`
`top_p`	`float`	the top-p value to use for the generation. Defaults to `1.0`.	`1.0`
`top_k`	`int`	the top-k value to use for the generation. Defaults to `0`.	`-1`
`min_p`	`float`	the minimum probability to use for the generation. Defaults to `0.0`.	`0.0`
`logprobs`	`Optional[PositiveInt]`	number of log probabilities to return per output token. If `None`, then no log probability won't be returned. Defaults to `None`.	`None`
`stop`	`Optional[List[str]]`	a list of strings that will be used to stop the generation when found. Defaults to `None`.	`None`
`stop_token_ids`	`Optional[List[int]]`	a list of token ids that will be used to stop the generation when found. Defaults to `None`.	`None`
`include_stop_str_in_output`	`bool`	whether to include the stop string in the output. Defaults to `False`.	`False`
`skip_special_tokens`	`bool`	whether to exclude special tokens from the output. Defaults to `False`.	`True`
`logits_processors`	`Optional[LogitsProcessors]`	a list of functions to process the logits before sampling. Defaults to `None`.	`None`
`extra_sampling_params`	`Optional[Dict[str, Any]]`	dictionary with additional arguments to be passed to the `SamplingParams` class from `vllm`.	`None`
`echo`	`bool`	whether to echo the include the prompt in the response or not. Defaults to `False`.	`False`

Returns:

Type	Description
`List[GenerateOutput]`	A list of lists of strings containing the generated responses for each input.

Source code in src/distilabel/models/llms/vllm.py

@validate_call
def generate(  # noqa: C901 # type: ignore
    self,
    inputs: List[FormattedInput],
    num_generations: int = 1,
    max_new_tokens: int = 128,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    repetition_penalty: float = 1.0,
    temperature: float = 1.0,
    top_p: float = 1.0,
    top_k: int = -1,
    min_p: float = 0.0,
    logprobs: Optional[PositiveInt] = None,
    stop: Optional[List[str]] = None,
    stop_token_ids: Optional[List[int]] = None,
    include_stop_str_in_output: bool = False,
    skip_special_tokens: bool = True,
    logits_processors: Optional[LogitsProcessors] = None,
    extra_sampling_params: Optional[Dict[str, Any]] = None,
    echo: bool = False,
) -> List[GenerateOutput]:
    """Generates `num_generations` responses for each input.

    Args:
        inputs: a list of inputs in chat format to generate responses for.
        num_generations: the number of generations to create per input. Defaults to
            `1`.
        max_new_tokens: the maximum number of new tokens that the model will generate.
            Defaults to `128`.
        presence_penalty: the presence penalty to use for the generation. Defaults to
            `0.0`.
        frequency_penalty: the repetition penalty to use for the generation. Defaults
            to `0.0`.
        repetition_penalty: the repetition penalty to use for the generation Defaults to
            `1.0`.
        temperature: the temperature to use for the generation. Defaults to `0.1`.
        top_p: the top-p value to use for the generation. Defaults to `1.0`.
        top_k: the top-k value to use for the generation. Defaults to `0`.
        min_p: the minimum probability to use for the generation. Defaults to `0.0`.
        logprobs: number of log probabilities to return per output token. If `None`,
            then no log probability won't be returned. Defaults to `None`.
        stop: a list of strings that will be used to stop the generation when found.
            Defaults to `None`.
        stop_token_ids: a list of token ids that will be used to stop the generation
            when found. Defaults to `None`.
        include_stop_str_in_output: whether to include the stop string in the output.
            Defaults to `False`.
        skip_special_tokens: whether to exclude special tokens from the output. Defaults
            to `False`.
        logits_processors: a list of functions to process the logits before sampling.
            Defaults to `None`.
        extra_sampling_params: dictionary with additional arguments to be passed to
            the `SamplingParams` class from `vllm`.
        echo: whether to echo the include the prompt in the response or not. Defaults
            to `False`.

    Returns:
        A list of lists of strings containing the generated responses for each input.
    """
    from vllm import SamplingParams

    if not logits_processors:
        logits_processors = []

    if extra_sampling_params is None:
        extra_sampling_params = {}

    structured_output = None

    if isinstance(inputs[0], tuple):
        # Prepare the batches for structured generation
        prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore
    else:
        # Simulate a batch without the structured output content
        prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore
        sorted_indices = None

    # Case in which we have a single structured output for the dataset
    if self._structured_output_logits_processor:
        logits_processors.append(self._structured_output_logits_processor)

    batched_outputs: List["LLMOutput"] = []
    generations = []

    for prepared_inputs, structured_output in prepared_batches:
        if self.structured_output is not None and structured_output is not None:
            self._logger.warning(
                "An `structured_output` was provided in the model configuration, but"
                " one was also provided in the input. The input structured output will"
                " be used."
            )

        if structured_output is not None:
            logits_processors.append(
                self._prepare_structured_output(structured_output)  # type: ignore
            )

        sampling_params = SamplingParams(  # type: ignore
            n=num_generations,
            presence_penalty=presence_penalty,
            frequency_penalty=frequency_penalty,
            repetition_penalty=repetition_penalty,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            min_p=min_p,
            max_tokens=max_new_tokens,
            prompt_logprobs=logprobs if echo else None,
            logprobs=logprobs,
            stop=stop,
            stop_token_ids=stop_token_ids,
            include_stop_str_in_output=include_stop_str_in_output,
            skip_special_tokens=skip_special_tokens,
            logits_processors=logits_processors,
            **extra_sampling_params,
        )

        batch_outputs: List["RequestOutput"] = self._model.generate(
            prompts=prepared_inputs,
            sampling_params=sampling_params,
            use_tqdm=False,
        )

        # Remove structured output logit processor to avoid stacking structured output
        # logits processors that leads to non-sense generations
        if structured_output is not None:
            logits_processors.pop(-1)

        for input, outputs in zip(prepared_inputs, batch_outputs):
            processed_prompt_logprobs = []
            if outputs.prompt_logprobs is not None:
                processed_prompt_logprobs = self._get_llm_logprobs(
                    outputs.prompt_logprobs
                )
            texts, statistics, outputs_logprobs = self._process_outputs(
                input=input,
                outputs=outputs,
                echo=echo,
                prompt_logprobs=processed_prompt_logprobs,
            )
            batched_outputs.append(texts)
            generation = prepare_output(
                generations=texts,
                input_tokens=statistics["input_tokens"],
                output_tokens=statistics["output_tokens"],
                logprobs=outputs_logprobs,
            )

            generations.append(generation)

    if sorted_indices is not None:
        pairs = list(enumerate(sorted_indices))
        pairs.sort(key=lambda x: x[1])
        generations = [generations[original_idx] for original_idx, _ in pairs]

    return generations

`_prepare_structured_output(structured_output)` ¶

Creates the appropriate function to filter tokens to generate structured outputs.

Parameters:

Name	Type	Description	Default
`structured_output`	`OutlinesStructuredOutputType`	the configuration dict to prepare the structured output.	required

Returns:

Type	Description
`Union[Callable, None]`	The callable that will be used to guide the generation of the model.

Source code in src/distilabel/models/llms/vllm.py

def _prepare_structured_output(  # type: ignore
    self, structured_output: "OutlinesStructuredOutputType"
) -> Union[Callable, None]:
    """Creates the appropriate function to filter tokens to generate structured outputs.

    Args:
        structured_output: the configuration dict to prepare the structured output.

    Returns:
        The callable that will be used to guide the generation of the model.
    """
    from distilabel.steps.tasks.structured_outputs.outlines import (
        prepare_guided_output,
    )

    assert structured_output is not None, "`structured_output` cannot be `None`"

    result = prepare_guided_output(structured_output, "vllm", self._model)
    if (schema := result.get("schema")) and self.structured_output:
        self.structured_output["schema"] = schema
    return result["processor"]

`CudaDevicePlacementMixin` ¶

Bases: BaseModel

Mixin class to assign CUDA devices to the LLM based on the cuda_devices attribute and the device placement information provided in _device_llm_placement_map. Providing the device placement information is optional, but if it is provided, it will be used to assign CUDA devices to the LLMs, trying to avoid using the same device for different LLMs.

Attributes:

Name	Type	Description
`cuda_devices`	`RuntimeParameter[Union[List[int], Literal['auto']]]`	a list with the ID of the CUDA devices to be used by the `LLM`. If set to "auto", the devices will be automatically assigned based on the device placement information provided in `_device_llm_placement_map`. If set to a list of devices, it will be checked if the devices are available to be used by the `LLM`. If not, a warning will be logged.
`disable_cuda_device_placement`	`RuntimeParameter[bool]`	Whether to disable the CUDA device placement logic or not. Defaults to `False`.
`_llm_identifier`	`Union[str, None]`	the identifier of the `LLM` to be used as key in `_device_llm_placement_map`.
`_device_llm_placement_map`	`Generator[Dict[str, List[int]], None, None]`	a dictionary with the device placement information for each `LLM`.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

class CudaDevicePlacementMixin(BaseModel):
    """Mixin class to assign CUDA devices to the `LLM` based on the `cuda_devices` attribute
    and the device placement information provided in `_device_llm_placement_map`. Providing
    the device placement information is optional, but if it is provided, it will be used to
    assign CUDA devices to the `LLM`s, trying to avoid using the same device for different
    `LLM`s.

    Attributes:
        cuda_devices: a list with the ID of the CUDA devices to be used by the `LLM`. If set
            to "auto", the devices will be automatically assigned based on the device
            placement information provided in `_device_llm_placement_map`. If set to a list
            of devices, it will be checked if the devices are available to be used by the
            `LLM`. If not, a warning will be logged.
        disable_cuda_device_placement: Whether to disable the CUDA device placement logic
            or not. Defaults to `False`.
        _llm_identifier: the identifier of the `LLM` to be used as key in `_device_llm_placement_map`.
        _device_llm_placement_map: a dictionary with the device placement information for each
            `LLM`.
    """

    cuda_devices: RuntimeParameter[Union[List[int], Literal["auto"]]] = Field(
        default="auto", description="A list with the ID of the CUDA devices to be used."
    )
    disable_cuda_device_placement: RuntimeParameter[bool] = Field(
        default=False,
        description="Whether to disable the CUDA device placement logic or not.",
    )

    _llm_identifier: Union[str, None] = PrivateAttr(default=None)
    _desired_num_gpus: PositiveInt = PrivateAttr(default=1)
    _available_cuda_devices: List[int] = PrivateAttr(default_factory=list)
    _can_check_cuda_devices: bool = PrivateAttr(default=False)

    _logger: "Logger" = PrivateAttr(None)

    def load(self) -> None:
        """Assign CUDA devices to the LLM based on the device placement information provided
        in `_device_llm_placement_map`."""

        if self.disable_cuda_device_placement:
            return

        try:
            import pynvml

            pynvml.nvmlInit()
            device_count = pynvml.nvmlDeviceGetCount()
            self._available_cuda_devices = list(range(device_count))
            self._can_check_cuda_devices = True
        except ImportError as ie:
            if self.cuda_devices == "auto":
                raise ImportError(
                    "The 'pynvml' library is not installed. It is required to automatically"
                    " assign CUDA devices to the `LLM`s. Please, install it and try again."
                ) from ie

            if self.cuda_devices:
                self._logger.warning(  # type: ignore
                    "The 'pynvml' library is not installed. It is recommended to install it"
                    " to check if the CUDA devices assigned to the LLM are available."
                )

        self._assign_cuda_devices()

    def unload(self) -> None:
        """Unloads the LLM and removes the CUDA devices assigned to it from the device
        placement information provided in `_device_llm_placement_map`."""
        if self.disable_cuda_device_placement:
            return

        with self._device_llm_placement_map() as device_map:
            if self._llm_identifier in device_map:
                self._logger.debug(  # type: ignore
                    f"Removing '{self._llm_identifier}' from the CUDA device map file"
                    f" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'."
                )
                del device_map[self._llm_identifier]

    @contextmanager
    def _device_llm_placement_map(self) -> Generator[Dict[str, List[int]], None, None]:
        """Reads the content of the device placement file of the node with a lock, yields
        the content, and writes the content back to the file after the context manager is
        closed. If the file doesn't exist, an empty dictionary will be yielded.

        Yields:
            The content of the device placement file.
        """
        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)
        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()
        with portalocker.Lock(
            _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,
            "r+",
            flags=portalocker.LockFlags.EXCLUSIVE,
        ) as f:
            try:
                content = json.load(f)
            except json.JSONDecodeError:
                content = {}
            yield content
            f.seek(0)
            f.truncate()
            f.write(json.dumps(content))

    def _assign_cuda_devices(self) -> None:
        """Assigns CUDA devices to the LLM based on the device placement information provided
        in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to "auto", it
        will be set to the first available CUDA device that is not going to be used by any
        other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be
        checked if the devices are available to be used by the LLM. If not, a warning will be
        logged."""

        # Take the lock and read the device placement information for each LLM.
        with self._device_llm_placement_map() as device_map:
            if self.cuda_devices == "auto":
                self.cuda_devices = []
                for _ in range(self._desired_num_gpus):
                    if (device_id := self._get_cuda_device(device_map)) is not None:
                        self.cuda_devices.append(device_id)
                        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore
                if len(self.cuda_devices) != self._desired_num_gpus:
                    self._logger.warning(  # type: ignore
                        f"Could not assign the desired number of GPUs {self._desired_num_gpus}"
                        f" for LLM with identifier '{self._llm_identifier}'."
                    )
            else:
                self._check_cuda_devices(device_map)

            device_map[self._llm_identifier] = self.cuda_devices  # type: ignore

        # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`
        # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.
        if self.cuda_devices == "auto":
            self.cuda_devices = []

        self._set_cuda_visible_devices()

    def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -> None:
        """Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.

        Args:
            device_map: a dictionary with the device placement information for each LLM.
        """
        for device in self.cuda_devices:  # type: ignore
            for llm, devices in device_map.items():
                if device in devices:
                    self._logger.warning(  # type: ignore
                        f"LLM with identifier '{llm}' is also going to use CUDA device "
                        f"'{device}'. This may lead to performance issues or running out"
                        " of memory depending on the device capabilities and the loaded"
                        " models."
                    )

    def _get_cuda_device(self, device_map: Dict[str, List[int]]) -> Union[int, None]:
        """Returns the first available CUDA device to be used by the LLM that is not going
        to be used by any other LLM.

        Args:
            device_map: a dictionary with the device placement information for each LLM.

        Returns:
            The first available CUDA device to be used by the LLM.

        Raises:
            RuntimeError: if there is no available CUDA device to be used by the LLM.
        """
        for device in self._available_cuda_devices:
            if all(device not in devices for devices in device_map.values()):
                return device

        return None

    def _set_cuda_visible_devices(self) -> None:
        """Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices
        to be used by the LLM.
        """
        if not self.cuda_devices:
            return

        if self._can_check_cuda_devices and not all(
            device in self._available_cuda_devices for device in self.cuda_devices
        ):
            raise RuntimeError(
                f"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}."
                f" The available devices are: {self._available_cuda_devices}. Please, review"
                " the 'cuda_devices' attribute and try again."
            )

        cuda_devices = ",".join([str(device) for device in self.cuda_devices])
        self._logger.info(  # type: ignore
            f"🎮 LLM '{self._llm_identifier}' is going to use the following CUDA devices:"
            f" {self.cuda_devices}."
        )
        os.environ["CUDA_VISIBLE_DEVICES"] = cuda_devices

`load()` ¶

Assign CUDA devices to the LLM based on the device placement information provided in _device_llm_placement_map.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def load(self) -> None:
    """Assign CUDA devices to the LLM based on the device placement information provided
    in `_device_llm_placement_map`."""

    if self.disable_cuda_device_placement:
        return

    try:
        import pynvml

        pynvml.nvmlInit()
        device_count = pynvml.nvmlDeviceGetCount()
        self._available_cuda_devices = list(range(device_count))
        self._can_check_cuda_devices = True
    except ImportError as ie:
        if self.cuda_devices == "auto":
            raise ImportError(
                "The 'pynvml' library is not installed. It is required to automatically"
                " assign CUDA devices to the `LLM`s. Please, install it and try again."
            ) from ie

        if self.cuda_devices:
            self._logger.warning(  # type: ignore
                "The 'pynvml' library is not installed. It is recommended to install it"
                " to check if the CUDA devices assigned to the LLM are available."
            )

    self._assign_cuda_devices()

`unload()` ¶

Unloads the LLM and removes the CUDA devices assigned to it from the device placement information provided in _device_llm_placement_map.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def unload(self) -> None:
    """Unloads the LLM and removes the CUDA devices assigned to it from the device
    placement information provided in `_device_llm_placement_map`."""
    if self.disable_cuda_device_placement:
        return

    with self._device_llm_placement_map() as device_map:
        if self._llm_identifier in device_map:
            self._logger.debug(  # type: ignore
                f"Removing '{self._llm_identifier}' from the CUDA device map file"
                f" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'."
            )
            del device_map[self._llm_identifier]

`_device_llm_placement_map()` ¶

Reads the content of the device placement file of the node with a lock, yields the content, and writes the content back to the file after the context manager is closed. If the file doesn't exist, an empty dictionary will be yielded.

Yields:

Type	Description
`Dict[str, List[int]]`	The content of the device placement file.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

@contextmanager
def _device_llm_placement_map(self) -> Generator[Dict[str, List[int]], None, None]:
    """Reads the content of the device placement file of the node with a lock, yields
    the content, and writes the content back to the file after the context manager is
    closed. If the file doesn't exist, an empty dictionary will be yielded.

    Yields:
        The content of the device placement file.
    """
    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)
    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()
    with portalocker.Lock(
        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,
        "r+",
        flags=portalocker.LockFlags.EXCLUSIVE,
    ) as f:
        try:
            content = json.load(f)
        except json.JSONDecodeError:
            content = {}
        yield content
        f.seek(0)
        f.truncate()
        f.write(json.dumps(content))

`_assign_cuda_devices()` ¶

Assigns CUDA devices to the LLM based on the device placement information provided in _device_llm_placement_map. If the cuda_devices attribute is set to "auto", it will be set to the first available CUDA device that is not going to be used by any other LLM. If the cuda_devices attribute is set to a list of devices, it will be checked if the devices are available to be used by the LLM. If not, a warning will be logged.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def _assign_cuda_devices(self) -> None:
    """Assigns CUDA devices to the LLM based on the device placement information provided
    in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to "auto", it
    will be set to the first available CUDA device that is not going to be used by any
    other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be
    checked if the devices are available to be used by the LLM. If not, a warning will be
    logged."""

    # Take the lock and read the device placement information for each LLM.
    with self._device_llm_placement_map() as device_map:
        if self.cuda_devices == "auto":
            self.cuda_devices = []
            for _ in range(self._desired_num_gpus):
                if (device_id := self._get_cuda_device(device_map)) is not None:
                    self.cuda_devices.append(device_id)
                    device_map[self._llm_identifier] = self.cuda_devices  # type: ignore
            if len(self.cuda_devices) != self._desired_num_gpus:
                self._logger.warning(  # type: ignore
                    f"Could not assign the desired number of GPUs {self._desired_num_gpus}"
                    f" for LLM with identifier '{self._llm_identifier}'."
                )
        else:
            self._check_cuda_devices(device_map)

        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore

    # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`
    # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.
    if self.cuda_devices == "auto":
        self.cuda_devices = []

    self._set_cuda_visible_devices()

`_check_cuda_devices(device_map)` ¶

Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.

Parameters:

Name	Type	Description	Default
`device_map`	`Dict[str, List[int]]`	a dictionary with the device placement information for each LLM.	required

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -> None:
    """Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.

    Args:
        device_map: a dictionary with the device placement information for each LLM.
    """
    for device in self.cuda_devices:  # type: ignore
        for llm, devices in device_map.items():
            if device in devices:
                self._logger.warning(  # type: ignore
                    f"LLM with identifier '{llm}' is also going to use CUDA device "
                    f"'{device}'. This may lead to performance issues or running out"
                    " of memory depending on the device capabilities and the loaded"
                    " models."
                )

`_get_cuda_device(device_map)` ¶

Returns the first available CUDA device to be used by the LLM that is not going to be used by any other LLM.

Parameters:

Name	Type	Description	Default
`device_map`	`Dict[str, List[int]]`	a dictionary with the device placement information for each LLM.	required

Returns:

Type	Description
`Union[int, None]`	The first available CUDA device to be used by the LLM.

Raises:

Type	Description
`RuntimeError`	if there is no available CUDA device to be used by the LLM.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def _get_cuda_device(self, device_map: Dict[str, List[int]]) -> Union[int, None]:
    """Returns the first available CUDA device to be used by the LLM that is not going
    to be used by any other LLM.

    Args:
        device_map: a dictionary with the device placement information for each LLM.

    Returns:
        The first available CUDA device to be used by the LLM.

    Raises:
        RuntimeError: if there is no available CUDA device to be used by the LLM.
    """
    for device in self._available_cuda_devices:
        if all(device not in devices for devices in device_map.values()):
            return device

    return None

`_set_cuda_visible_devices()` ¶

Sets the CUDA_VISIBLE_DEVICES environment variable to the list of CUDA devices to be used by the LLM.

Source code in src/distilabel/models/mixins/cuda_device_placement.py

def _set_cuda_visible_devices(self) -> None:
    """Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices
    to be used by the LLM.
    """
    if not self.cuda_devices:
        return

    if self._can_check_cuda_devices and not all(
        device in self._available_cuda_devices for device in self.cuda_devices
    ):
        raise RuntimeError(
            f"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}."
            f" The available devices are: {self._available_cuda_devices}. Please, review"
            " the 'cuda_devices' attribute and try again."
        )

    cuda_devices = ",".join([str(device) for device in self.cuda_devices])
    self._logger.info(  # type: ignore
        f"🎮 LLM '{self._llm_identifier}' is going to use the following CUDA devices:"
        f" {self.cuda_devices}."
    )
    os.environ["CUDA_VISIBLE_DEVICES"] = cuda_devices

LLM Gallery¶

llms ¶

AnthropicLLM ¶

model_name property ¶

_check_model_exists() ¶

load() ¶

agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None) async ¶

AnyscaleLLM ¶

AzureOpenAILLM ¶

load() ¶

CohereLLM ¶

model_name property ¶

load() ¶

_format_chat_to_cohere(input) ¶

agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None) async ¶

GroqLLM ¶

model_name property ¶

load() ¶

agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None) async ¶

InferenceEndpointsLLM ¶

only_one_of_model_id_endpoint_name_or_base_url_provided() ¶

prepare_input(input) ¶

_get_structured_output(input) ¶

_check_stop_sequences(stop_sequences=None) ¶

TransformersLLM ¶

model_name property ¶

load() ¶

unload() ¶

prepare_input(input) ¶

generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True) ¶

get_last_hidden_states(inputs) ¶

_prepare_structured_output(structured_output=None) ¶

LiteLLM ¶

model_name property ¶

load() ¶

LlamaCppLLM ¶

model_name property ¶

validate_magpie_usage() ¶

load() ¶

prepare_input(input) ¶

generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None) ¶

_prepare_structured_output(structured_output=None) ¶

MistralLLM ¶

model_name property ¶

load() ¶

agenerate(input, max_new_tokens=None, temperature=None, top_p=None) async ¶

MlxLLM ¶

model_name property ¶

load() ¶

prepare_input(input) ¶

generate(inputs, num_generations=1, max_tokens=256, logits_processors=None, max_kv_size=None, prompt_cache=None, prefill_step_size=512, kv_bits=None, kv_group_size=64, quantized_kv_start=0, prompt_progress_callback=None, temp=0.0, top_p=0.0, min_p=0.0, min_tokens_to_keep=1, top_k=-1) ¶

MixtureOfAgentsLLM ¶

runtime_parameters_names property ¶

model_name property ¶

load() ¶

get_generation_kwargs() ¶

_build_moa_system_prompt(prev_outputs) ¶

_inject_moa_system_prompt(input, prev_outputs) ¶

_agenerate(inputs, num_generations=1, **kwargs) async ¶

OllamaLLM ¶

model_name property ¶

validate_magpie_usage() ¶

load() ¶

prepare_input(input) ¶

agenerate(input, format='', options=None, keep_alive=None) async ¶

OpenAILLM ¶

agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, echo=False, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, extra_body=None, seed=None) async ¶

_generations_from_openai_completion(completion) ¶

offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs) ¶

_check_and_get_batch_results() ¶

_parse_output(output) ¶

_get_openai_batch(batch_id) ¶

_retrieve_batch_results(batch) ¶

_create_jobs(inputs, **kwargs) ¶

_create_batch_api_job(batch_input_file) ¶

_create_batch_files(inputs, **kwargs) ¶

_create_jsonl_buffers(inputs, **kwargs) ¶

_create_jsonl_row(input, custom_id, **kwargs) ¶

ClientSGLang ¶

model_name cached property ¶

`llms` ¶

`AnthropicLLM` ¶

`model_name` `property` ¶

`_check_model_exists()` ¶

`load()` ¶

`agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)` `async` ¶

`AnyscaleLLM` ¶

`AzureOpenAILLM` ¶

`load()` ¶

`CohereLLM` ¶

`model_name` `property` ¶

`load()` ¶

`_format_chat_to_cohere(input)` ¶

`agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)` `async` ¶

`GroqLLM` ¶

`model_name` `property` ¶

`load()` ¶

`agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None)` `async` ¶

`InferenceEndpointsLLM` ¶

`only_one_of_model_id_endpoint_name_or_base_url_provided()` ¶

`prepare_input(input)` ¶

`_get_structured_output(input)` ¶

`_check_stop_sequences(stop_sequences=None)` ¶

`TransformersLLM` ¶

`model_name` `property` ¶

`load()` ¶

`unload()` ¶

`prepare_input(input)` ¶

`generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)` ¶

`get_last_hidden_states(inputs)` ¶

`_prepare_structured_output(structured_output=None)` ¶

`LiteLLM` ¶

`model_name` `property` ¶

`load()` ¶

`LlamaCppLLM` ¶

`model_name` `property` ¶

`validate_magpie_usage()` ¶

`load()` ¶

`prepare_input(input)` ¶

`generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None)` ¶

`_prepare_structured_output(structured_output=None)` ¶

`MistralLLM` ¶

`model_name` `property` ¶

`load()` ¶

`agenerate(input, max_new_tokens=None, temperature=None, top_p=None)` `async` ¶

`MlxLLM` ¶

`model_name` `property` ¶

`load()` ¶

`prepare_input(input)` ¶

`generate(inputs, num_generations=1, max_tokens=256, logits_processors=None, max_kv_size=None, prompt_cache=None, prefill_step_size=512, kv_bits=None, kv_group_size=64, quantized_kv_start=0, prompt_progress_callback=None, temp=0.0, top_p=0.0, min_p=0.0, min_tokens_to_keep=1, top_k=-1)` ¶

`MixtureOfAgentsLLM` ¶

`runtime_parameters_names` `property` ¶

`model_name` `property` ¶

`load()` ¶

`get_generation_kwargs()` ¶

`_build_moa_system_prompt(prev_outputs)` ¶

`_inject_moa_system_prompt(input, prev_outputs)` ¶

`_agenerate(inputs, num_generations=1, **kwargs)` `async` ¶

`OllamaLLM` ¶

`model_name` `property` ¶

`validate_magpie_usage()` ¶

`load()` ¶

`prepare_input(input)` ¶

`agenerate(input, format='', options=None, keep_alive=None)` `async` ¶

`OpenAILLM` ¶

`agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, echo=False, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, extra_body=None, seed=None)` `async` ¶

`_generations_from_openai_completion(completion)` ¶

`offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs)` ¶

`_check_and_get_batch_results()` ¶

`_parse_output(output)` ¶

`_get_openai_batch(batch_id)` ¶

`_retrieve_batch_results(batch)` ¶

`_create_jobs(inputs, **kwargs)` ¶

`_create_batch_api_job(batch_input_file)` ¶

`_create_batch_files(inputs, **kwargs)` ¶

`_create_jsonl_buffers(inputs, **kwargs)` ¶

`_create_jsonl_row(input, custom_id, **kwargs)` ¶

`ClientSGLang` ¶

`model_name` `cached` `property` ¶

`load()` ¶