Skip to content

InferenceEndpointsLLM

InferenceEndpoints LLM implementation running the async API client.

This LLM will internally use huggingface_hub.AsyncInferenceClient or openai.AsyncOpenAI depending on the use_openai_client attribute.

Attributes

  • model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which will be used to resolve the base URL for the serverless Inference Endpoints API requests. Defaults to None.

  • endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to None.

  • endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to None.

  • base_url: the base URL to use for the Inference Endpoints API requests.

  • api_key: the API key to authenticate the requests to the Inference Endpoints API.

  • tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub. Defaults to None, but defining one is recommended to properly format the prompt.

  • model_display_name: the model display name to use for the LLM. Defaults to None.

  • use_openai_client: whether to use the OpenAI client instead of the Hugging Face client.

Examples

Free serverless Inference API

from distilabel.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    model_id="mistralai/Mistral-7B-Instruct-v0.2",
)

llm.load()

# Synchrounous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])

Dedicated Inference Endpoints

from distilabel.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    endpoint_name="<ENDPOINT_NAME>",
    api_key="<HF_API_KEY>",
    endpoint_namespace="<USER|ORG>",
)

llm.load()

# Synchrounous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])

Dedicated Inference Endpoints or TGI

from distilabel.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    api_key="<HF_API_KEY>",
    base_url="<BASE_URL>",
)

llm.load()

# Synchrounous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])

# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])