Skip to content

LLMs

In this section we will see what's an LLM and the different LLMs implementations available in distilabel.

LLM

The LLM class encapsulates the functionality for interacting with a large language model.

It distinguishes between task specifications and configurable parameters that influence the LLM behavior.

For illustration purposes, we employ the TextGenerationTask in this section and guide you to the dedicated Tasks section for comprehensive details.

LLM classes share several general parameters and define implementation-specific ones. Let's explain the general parameters first and the generate method, and then the specifics for each class.

General parameters

Let's briefly introduce the general parameters we may find1:

  • max_new_tokens: this parameter controls the maximum number of tokens the LLM is allowed to use.

  • temperature: parameter associated to the creativity of the model, a value close to 0 makes the model more deterministic, while higher values make the model more "creative".

  • top_k and top_p: top_k limits the number of tokens the model is allowed to use to generate the following token sorted by probability, while top_p limits the number of tokens the model can use for the next token, but in terms of the sum of their probabilities.

  • frequency_penalty and presence_penalty: the frequency penalty penalizes tokens that have already appeared in the generated text, limiting the possibility of those appearing again, and the presence_penalty penalizes regardless of the frequency.

  • prompt_format and prompt_formatting_fn: these two parameters allow to tweak the prompt of our models, for example we can direct the LLM to format the prompt according to one of the defined formats, while prompt_formatting_fn allows to pass a function that will be applied to the prompt before the generation, for extra control of what we ingest to the model.

generate method

Once you create an LLM, you use the generate method to interact with it. This method accepts two parameters:

  • inputs: which is a list of dictionaries containing the inputs for the LLM and the Task. Each dictionary must have all the keys required by the Task.

    inputs = [
        {"input": "Write a letter for my friend Bob..."},
        {"input": "Give me a summary of the following text:..."},
        ...
    ]
    
  • num_generations: which is an integer used to specify how many text generations we want to obtain for each element in inputs.

The output of the method will be a list containing lists of LLMOutput. Each inner list is associated to the corresponding input in inputs, and each LLMOutput is associated to one of the num_generations for each input.

>>> llm.generate(inputs=[...], num_generations=2)
[ # (1)
    [ # (2)
        { # (3)
            "model_name": "notus-7b-v1",
            "prompt_used": "Write a letter for my friend Bob...",
            "raw_output": "Dear Bob, ...",
            "parsed_output": {
                "generations":  "Dear Bob, ...",
            }
        },
        {
            "model_name": "notus-7b-v1",
            "prompt_used": "Write a letter for my friend Bob...",
            "raw_output": "Dear Bob, ...",
            "parsed_output": {
                "generations":  "Dear Bob, ...",
            }
        },
    ],
    [...],
]
  1. The outer list will contain as many lists as elements in inputs.
  2. The inner lists will contain as many LLMOutputs as specified in num_generations.
  3. Each LLMOutput is a dictionary

The LLMOutput is a TypedDict containing the keys model_name, prompt_used, raw_output and parsed_output. The parsed_output key is a dictionary that will contain all the Task outputs.

{
    "model_name": "notus-7b-v1",
    "prompt_used": "Write a letter for my friend Bob...",
    "raw_output": "Dear Bob, ...",
    "parsed_output": { # (1)
        "generations":  "Dear Bob, ...",
    }
},
  1. The keys contained in parsed_output will depend on the Task used. In this case, we used TextGenerationTask, so the key generations is present.

If the LLM uses a thread pool, then the output of the generate method will be a Future having as result a list of lists of LLMOutput as described above.

Integrations

OpenAI

These may be the default choice for your ambitious tasks.

For the API reference visit OpenAILLM.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import TextGenerationTask

openaillm = OpenAILLM(
    model="gpt-3.5-turbo",
    task=TextGenerationTask(),
    prompt_format="openai",
    max_new_tokens=256,
    openai_api_key=os.environ.get("OPENAI_API_KEY"),
    temperature=0.3,
)
result = openaillm.generate([{"input": "What is OpenAI?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# OpenAI is an artificial intelligence research laboratory and company. It was founded
# with the goal of ensuring that artificial general intelligence (AGI) benefits all of
# humanity. OpenAI conducts cutting-edge research in various fields of AI ...

Llama.cpp

Applicable for local execution of Language Models (LLMs). Use this LLM when you have access to the quantized weights of your selected model for interaction.

Let's see an example using notus-7b-v1. First, you can download the weights from the following link:

from distilabel.llm import LlamaCppLLM
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama

# Instantiate our LLM with them:
llm = LlamaCppLLM(
    model=Llama(model_path="./notus-7b-v1.q4_k_m.gguf", n_gpu_layers=-1),
    task=TextGenerationTask(),
    max_new_tokens=128,
    temperature=0.3,
    prompt_format="notus",
)

result = llm.generate([{"input": "What is the capital of Spain?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# The capital of Spain is Madrid. It is located in the center of the country and
# is known for its vibrant culture, beautiful architecture, and delicious food.
# Madrid is home to many famous landmarks such as the Prado Museum, Retiro Park,
# and the Royal Palace of Madrid. I hope this information helps!

For the API reference visit LlammaCppLLM.

vLLM

Highly recommended to use if you have a GPU available, as it is the fastest solution out there for batch generation. Find more information about it in vLLM docs.

from distilabel.llm import vLLM
from distilabel.tasks import TextGenerationTask
from vllm import LLM

llm = vLLM(
    vllm=LLM(model="argilla/notus-7b-v1"),
    task=TextGenerationTask(),
    max_new_tokens=512,
    temperature=0.3,
    prompt_format="notus",
)
result_vllm = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of artificial intelligence (AI) system that is designed
# to understand and interpret human language. It is called "large" because it uses a vast
# amount of data, typically billions of words or more, to learn and make predictions about
# language. Large language models are ...

For the API reference visit vLLM.

Ollama

Highly recommended to use if you have a GPU available, as it is one of the fastest solutions out and also has metal support for the MacOS M1 chip and its follow-ups. Find more information about it in the Ollama GitHub.

Before being able to use Ollama you first need to install it. After that, you can select one of the models from their model library and use it as follows:

ollama serve
ollama run notus # or other model name

Note

The ollama run <model_name> command will also set pre-defined generation parameters for the model. These can be found in their library and overridden by passing them as arguments to the command as shown here.

We can then re-use this model name as a reference within distilabel through our OllamaLLM implementation:

from distilabel.llm import OllamaLLM
from distilabel.tasks import TextGenerationTask

llm = OllamaLLM(
    model="notus",  # should be deployed via `ollama notus:7b-v1-q5_K_M`
    task=TextGenerationTask(),
    prompt_format="openai",
)
result = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of artificial intelligence (AI) system that has been trained
# on a vast amount of text data to generate human-like language. These models are capable of
# understanding and generating complex sentences, and can be used for tasks such as language
# translation, text summarization, and natural language generation. They are typically very ...

HuggingFace LLMs

This section explains two different ways to use HuggingFace models:

Transformers

This is the option to use a model hosted on the HuggingFace Hub. Load the model and tokenizer in the standard manner as done locally, and proceed to instantiate your class.

For the API reference visit TransformersLLM.

Let's see an example using notus-7b-v1:

from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the models from the HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", device_map="auto")

# Instantiate our LLM with them:
llm = TransformersLLM(
    model=model,
    tokenizer=tokenizer,
    task=TextGenerationTask(),
    max_new_tokens=128,
    temperature=0.3,
    prompt_format="notus",
)

result = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of machine learning algorithm that is designed to analyze
# and understand large amounts of text data. It is called "large" because it requires a
# vast amount of data to train and improve its accuracy. These models are ...

Inference Endpoints

HuggingFace provides a streamlined approach for deploying models through Inference Endpoints on their infrastructure. Opt for this solution if your model is hosted on the HuggingFace Hub.

For the API reference visit InferenceEndpointsLLM.

Let's see how to interact with these LLMs:

import os

from distilabel.llm import InferenceEndpointsLLM
from distilabel.tasks import TextGenerationTask

endpoint_name = "aws-notus-7b-v1-4052" or os.getenv("HF_INFERENCE_ENDPOINT_NAME")
endpoint_namespace = "argilla" or os.getenv("HF_NAMESPACE")
token = os.getenv("HF_TOKEN")  # hf_...

llm = InferenceEndpointsLLM(
    endpoint_name_or_model_id=endpoint_name,
    endpoint_namespace=endpoint_namespace,
    token=token,
    task=TextGenerationTask(),
    max_new_tokens=512,
    prompt_format="notus",
)
result = llm.generate([{"input": "What are critique LLMs?"}])
# print(result[0][0]["parsed_output"]["generations"])
# Critique LLMs (Long Land Moore Machines) are artificial intelligence models designed specifically for analyzing and evaluating the quality or worth of a particular subject or object. These models can be trained on a large dataset of reviews, ratings, or commentary related to a product, service, artwork, or any other topic of interest.
# The training data can include both positive and negative feedback, helping the LLM to understand the nuanced aspects of quality and value. The model uses natural language processing (NLP) techniques to extract meaningful insights, including sentiment analysis, entity recognition, and text classification.
# Once the model is trained, it can be used to analyze new input data and provide a critical assessment based on its learned understanding of quality and value. For example, a critique LLM for movies could evaluate a new film and generate a detailed review highlighting its strengths, weaknesses, and overall rating.
# Critique LLMs are becoming increasingly useful in various industries, such as e-commerce, education, and entertainment, where they can provide objective and reliable feedback to help guide decision-making processes. They can also aid in content optimization by highlighting areas of improvement or recommending strategies for enhancing user engagement.
# In summary, critique LLMs are powerful tools for analyzing and evaluating the quality or worth of different subjects or objects, helping individuals and organizations make informed decisions with confidence.

Together Inference

Together offers a product named Together Inference, which exposes some models for diverse tasks such as chat, text generation, code, or image; exposing those via an endpoint within their API either as serverless endpoints or as dedicated instances.

See their release post with more details at Announcing Together Inference Engine – the fastest inference available.

from distilabel.llm import TogetherInferenceLLM
from distilabel.tasks import TextGenerationTask

llm = TogetherInferenceLLM(
    model="togethercomputer/llama-2-70b-chat",
    task=TextGenerationTask(),
    max_new_tokens=512,
    temperature=0.3,
    prompt_format="llama2",
)
output = llm.generate(
    [{"input": "Explain me the theory of relativity as if you were a pirate."}]
)
# >>> print(result[0][0]["parsed_output"]["generations"])
# Ahoy matey! Yer lookin' fer a tale of the theory of relativity, eh? Well,
# settle yerself down with a pint o' grog and listen close, for this be a story
# of the sea of time and space!
# Ye see, matey, the theory of relativity be tellin' us that time and space ain't
# fixed things, like the deck o' a ship or the stars in the sky. Nay, they be like
# the ocean itself, always changin' and flowin' like the tides.
# Now, imagine ...

Vertex AI LLMs

Google Cloud Vertex AI platform allows to use Google proprietary models and deploy other models for online predictions. distilabel integrates with Vertex AI trough VertexAILLM and VertexAIEndpointLLM classes.

To use one of these classes you will need to have configured the Google Cloud authentication using one of these methods:

  • Settings GOOGLE_CLOUD_CREDENTIALS environment variable
  • Using gcloud auth application-default login command
  • Using vertexai.init Python SDK function from the google-cloud-aiplatform library before instantiating the LLM.

Proprietary models (Gemini and PaLM)

VertexAILLM allows to use Google proprietary models such as Gemini and PaLM. These models are served trough Vertex AI and its different APIs:

  • Gemini API: which offers models from the Gemini family such as gemini-pro and gemini-pro-vision models. More information: Vertex AI - Gemini API.
  • Text Generation API: which offers models from the PaLM family such as text-bison. More information: Vertex AI - PaLM 2 for text.
  • Code Generation API: which offers models from the PaLM family for code-generation such as code-bison. More information: Vertex AI - Codey for code generation.
from distilabel.llm import VertexAILLM
from distilabel.tasks import TextGenerationTask

llm = VertexAILLM(
    task=TextGenerationTask(), model="gemini-pro", max_new_tokens=512, temperature=0.3
)

results = llm.generate(
    inputs=[
        {"input": "Write a short summary about the Gemini astrological sign"},
    ],
)
# >>> print(results[0][0]["parsed_output"]["generations"])
# Gemini, the third astrological sign in the zodiac, is associated with the element of
# air and is ruled by the planet Mercury. People born under the Gemini sign are often
# characterized as being intelligent, curious, and communicative. They are known for their
# quick wit, adaptability, and versatility. Geminis are often drawn to learning and enjoy
# exploring new ideas and concepts. They are also known for their social nature and ability
# to connect with others easily. However, Geminis can also be seen as indecisive, restless,
# and superficial at times. They may struggle with commitment and may have difficulty focusing
# on one thing for too long. Overall, Geminis are known for their intelligence, curiosity,
# and social nature.

Endpoints for online prediction

VertexAIEndpointLLM class allows to use a model deployed in a Vertex AI Endpoint for online prediction to generate text. Unlike the rest of LLMs classes which comes with a set of pre-defined arguments in its __init__ method, VertexAIEndpointLLM requires to provide the generation arguments to be used in a dictionary that will pased to the generation_kwargs argument. This is because the generation parameters will be different and have different names depending on the Docker image deployed on the Vertex AI Endpoint.

from distilabel.llm import VertexAIEndpointLLM
from distilabel.tasks import TextGenerationTask

llm = VertexAIEndpointLLM(
    task=TextGenerationTask(),
    endpoint_id="3466410517680095232",
    project="experiments-404412",
    location="us-central1",
    generation_kwargs={
        "temperature": 1.0,
        "max_tokens": 128,
        "top_p": 1.0,
        "top_k": 10,
    },
)

results = llm.generate(
    inputs=[
        {"input": "Write a short summary about the Gemini astrological sign"},
    ],
)
# >>> print(results[0][0]["parsed_output"]["generations"])
# Geminis are known for their curiosity, adaptability, and love of knowledge. They are
# also known for their tendency to be indecisive, impulsive and prone to arguing. They
# are ruled by the planet Mercury, which is associated with communication, quick thinking,
# and change.

Anyscale

Anyscale Endpoints offers open source large language models (LLMs) as fully managed API endpoints. Interoperate with open source models as you would do it with OpenAI:

import os

from distilabel.llm import AnyscaleLLM
from distilabel.tasks import TextGenerationTask

anyscale_llm = AnyscaleLLM(
    model="HuggingFaceH4/zephyr-7b-beta",
    task=TextGenerationTask(),
    api_key=os.environ.get("ANYSCALE_API_KEY"),
)
result = anyscale_llm.generate([{"input": "What is Anyscale?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# 'Anyscale is a machine learning (ML) software company that provides tools and platforms
# for scalable distributed ML workflows. Their offerings enable data scientists and engineers
# to easily and efficiently deploy ML models at scale, both on-premise and on the cloud.
# Anyscale's core technology, Ray, is an open-source framework for distributed Python computation 
# that provides a unified interface for distributed computing, resource management, and task scheduling.
# With Anyscale's solutions, businesses can accelerate their ML development and deployment cycles and drive
# greater value from their ML investments.'

For the API reference visit AnyscaleLLM.

ProcessLLM and LLMPool

By default, distilabel uses a single process, so the generation loop is usually bottlenecked by the model inference time and Python GIL. To overcome this limitation, we provide the ProcessLLM class that allows to load an LLM in a different process, avoiding the GIL and allowing to parallelize the generation loop. Creating a ProcessLLM is easy as:

from distilabel.llm import LLM, ProcessLLM
from distilabel.tasks import Task, TextGenerationTask


def load_gpt_4(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-4",
        task=task,
        num_threads=4,
    )


llm = ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4)
future = llm.generate(
    inputs=[{"input": "Write a letter for Bob"}], num_generations=1
)  # (1)
llm.teardown()  # (2)
result = future.result()
# >>> print(result[0][0]["parsed_output"]["generations"])
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
  1. The ProcessLLM returns a Future containing a list of lists of LLMOutputs.
  2. The ProcessLLM needs to be terminated after usage. If the ProcessLLM is used by a Pipeline, it will be terminated automatically.

You can directly use a ProcessLLM as the generator or labeller in a Pipeline. Apart from that, there would be situations in which you would like to generate texts using several LLMs in parallel. For this purpose, we provide the LLMPool class:

from distilabel.llm import LLM, LLMPool, ProcessLLM
from distilabel.tasks import Task, TextGenerationTask


def load_gpt_3(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-3.5-turbo",
        task=task,
        num_threads=4,
    )


def load_gpt_4(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-4",
        task=task,
        num_threads=4,
    )


pool = LLMPool(
    llms=[
        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_3),
        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4),
    ]
)
result = pool.generate(inputs=[{"input": "Write a letter for Bob"}], num_generations=2)
pool.teardown()
# >>> print(result[0][0]["parsed_output"]["generations"], end="\n\n\n\n\n\n---->")
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
# >>> print(result[0][1]["parsed_output"]["generations"])
# Of course, I'd be happy to draft a sample letter for you. However, I would need some additional
# information including who "Bob" is, the subject matter of the letter, the tone (formal or informal),
# and any specific details or points you'd like to include. Please provide some more context and I'll do my best to assist you.

  1. You can take a look at this blog post from cohere for a thorough explanation of the different parameters.