Skip to content

LLMs

In this section we will see what's an LLM and the different LLMs implementations available in distilabel.

LLM

The LLM class encapsulates the functionality for interacting with a large language model.

It distinguishes between task specifications and configurable parameters that influence the LLM's behavior.

For illustration purposes, we employ the TextGenerationTask in this section and guide you to the dedicated Tasks section for comprehensive details.

LLM classes share several general parameters and define implementation-specific ones. Let's explain the general parameters first and the generate method, and then the specifics for each class.

General parameters

Let's briefly introduce the general parameters we may find1:

  • max_new_tokens: this parameter controls the maximum number of tokens the LLM is allowed to use.

  • temperature: parameter associated to the creativity of the model, a value close to 0 makes the model more deterministic, while higher values make the model more "creative".

  • top_k and top_p: top_k limits the number of tokens the model is allowed to use to generate the following token sorted by probability, while top_p limits the number of tokens the model can use for the next token, but in terms of the sum of their probabilities.

  • frequency_penalty and presence_penalty: the frequency penalty penalizes tokens that have already appeard in the generated text, limiting the possibility of those appearing again, and the presence_penalty penalizes regardless of hte frequency.

  • prompt_format and prompt_formatting_fn: these two parameters allow to tweak the prompt of our models, for example we can direct the LLM to format the prompt according to one of the defined formats, while prompt_formatting_fn allows to pass a function that will be applied to the prompt before the generation, for extra control of what we ingest to the model.

generate method

Once you create an LLM, you use the generate method to interact with it. This method accepts two parameters:

  • inputs: which is a list of dictionaries containing the inputs for the LLM and the Task. Each dictionary must have all the keys required by the Task.

    inputs = [
        {"input": "Write a letter for my friend Bob..."},
        {"input": "Give me a summary of the following text:..."},
        ...
    ]
    
  • num_generations: which is an integer used to specify how many text generations we want to obtain for each element in inputs.

The output of the method will be a list containing lists of LLMOutput. Each inner list is associated to the corresponding input in inputs, and each LLMOutput is associated to one of the num_generations for each input.

>>> llm.generate(inputs=[...], num_generations=2)
[ # (1)
    [ # (2)
        { # (3)
            "model_name": "notus-7b-v1",
            "prompt_used": "Write a letter for my friend Bob...",
            "raw_output": "Dear Bob, ...",
            "parsed_output": {
                "generations":  "Dear Bob, ...",
            }
        }, 
        {
            "model_name": "notus-7b-v1",
            "prompt_used": "Write a letter for my friend Bob...",
            "raw_output": "Dear Bob, ...",
            "parsed_output": {
                "generations":  "Dear Bob, ...",
            }
        }, 
    ],
    [...],
]
  1. The outer list will contain as many lists as elements in inputs.
  2. The inner lists will contain as many LLMOutputs as specified in num_generations.
  3. Each LLMOutput is a dictionary

The LLMOutput is a TypedDict containing the keys model_name, prompt_used, raw_output and parsed_output. The parsed_output key is a dictionary that will contain all the Task outputs.

{
    "model_name": "notus-7b-v1",
    "prompt_used": "Write a letter for my friend Bob...",
    "raw_output": "Dear Bob, ...",
    "parsed_output": { # (1)
        "generations":  "Dear Bob, ...",
    }
}, 
  1. The keys contained in parsed_output will depend on the Task used. In this case, we used TextGenerationTask, so the key generations is present.

If the LLM uses a thread pool, then the output of the generate method will be a Future having as result a list of lists of LLMOutput as described above.

Integrations

OpenAI

These may be the default choice for your ambitious tasks.

For the API reference visit OpenAILLM.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import OpenAITextGenerationTask

openaillm = OpenAILLM(
    model="gpt-3.5-turbo",
    task=OpenAITextGenerationTask(),
    max_new_tokens=256,
    openai_api_key=os.environ.get("OPENAI_API_KEY"),
    temperature=0.3,
)
result = openaillm.generate([{"input": "What is OpenAI?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# OpenAI is an artificial intelligence research laboratory and company. It was founded
# with the goal of ensuring that artificial general intelligence (AGI) benefits all of
# humanity. OpenAI conducts cutting-edge research in various fields of AI ...

Llama.cpp

Applicable for local execution of Language Models (LLMs). Utilize this LLM when you have access to the quantized weights of your selected model for interaction.

Let's see an example using notus-7b-v1. First, you can download the weights from the following link:

from distilabel.llm import LlamaCppLLM
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama

# Instantiate our LLM with them:
llm = LlamaCppLLM(
    model=Llama(model_path="./notus-7b-v1.q4_k_m.gguf", n_gpu_layers=-1),
    task=TextGenerationTask(),
    max_new_tokens=128,
    temperature=0.3,
    prompt_format="notus",
)

result = llm.generate([{"input": "What is the capital of Spain?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# The capital of Spain is Madrid. It is located in the center of the country and
# is known for its vibrant culture, beautiful architecture, and delicious food.
# Madrid is home to many famous landmarks such as the Prado Museum, Retiro Park,
# and the Royal Palace of Madrid. I hope this information helps!

For the API reference visit LlammaCppLLM.

vLLM

Highly recommended to use if you have a GPU available, as is the fastest solution out there for batch generation. Find more information about in vLLM docs.

from distilabel.tasks import TextGenerationTask
from distilabel.llm import vLLM
from vllm import LLM

llm = vLLM(
    vllm=LLM(model="argilla/notus-7b-v1"),
    task=TextGenerationTask(),
    max_new_tokens=512,
    temperature=0.3,
    prompt_format="notus",
)
result_vllm = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of artificial intelligence (AI) system that is designed
# to understand and interpret human language. It is called "large" because it uses a vast
# amount of data, typically billions of words or more, to learn and make predictions about
# language. Large language models are ...

For the API reference visit vLLM.

HuggingFace LLMs

This section explains two different ways to use HuggingFace models:

Transformers

This is the option to utilize a model hosted on Hugging Face Hub. Load the model and tokenizer in the standard manner as done locally, and proceed to instantiate your class.

For the API reference visit TransformersLLM.

Let's see an example using notus-7b-v1:

from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the models from huggingface hub:
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", device_map="auto")

# Instantiate our LLM with them:
llm = TransformersLLM(
    model=model,
    tokenizer=tokenizer,
    task=TextGenerationTask(),
    max_new_tokens=128,
    temperature=0.3,
    prompt_format="notus",
)

result = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of machine learning algorithm that is designed to analyze
# and understand large amounts of text data. It is called "large" because it requires a
# vast amount of data to train and improve its accuracy. These models are ...

Inference Endpoints

Hugging Face provides a streamlined approach for deploying models through inference endpoints on their infrastructure. Opt for this solution if your model is hosted on Hugging Face.

For the API reference visit InferenceEndpointsLLM.

Let's see how to interact with these LLMs:

import os

from distilabel.llm import InferenceEndpointsLLM
from distilabel.tasks import TextGenerationTask

endpoint_name = "aws-notus-7b-v1-4052" or os.getenv("HF_INFERENCE_ENDPOINT_NAME")
endpoint_namespace = "argilla" or os.getenv("HF_NAMESPACE")
token = os.getenv("HF_TOKEN")  # hf_...

llm = InferenceEndpointsLLM(
    endpoint_name=endpoint_name,
    endpoint_namespace=endpoint_namespace,
    token=token,
    task=TextGenerationTask(),
    max_new_tokens=512,
    prompt_format="notus",
)
result = llm.generate([{"input": "What are critique LLMs?"}])
# print(result[0][0]["parsed_output"]["generations"])
# Critique LLMs (Long Land Moore Machines) are artificial intelligence models designed specifically for analyzing and evaluating the quality or worth of a particular subject or object. These models can be trained on a large dataset of reviews, ratings, or commentary related to a product, service, artwork, or any other topic of interest.
# The training data can include both positive and negative feedback, helping the LLM to understand the nuanced aspects of quality and value. The model uses natural language processing (NLP) techniques to extract meaningful insights, including sentiment analysis, entity recognition, and text classification.
# Once the model is trained, it can be used to analyze new input data and provide a critical assessment based on its learned understanding of quality and value. For example, a critique LLM for movies could evaluate a new film and generate a detailed review highlighting its strengths, weaknesses, and overall rating.
# Critique LLMs are becoming increasingly useful in various industries, such as e-commerce, education, and entertainment, where they can provide objective and reliable feedback to help guide decision-making processes. They can also aid in content optimization by highlighting areas of improvement or recommending strategies for enhancing user engagement.
# In summary, critique LLMs are powerful tools for analyzing and evaluating the quality or worth of different subjects or objects, helping individuals and organizations make informed decisions with confidence.

ProcessLLM and LLMPool

By default, distilabel uses a single process, so the generation loop is usually bottlenecked by the model inference time and Python GIL. To overcome this limitation, we provide the ProcessLLM class that allows to load an LLM in a different process, avoiding the GIL and allowing to parallelize the generation loop. Creating a ProcessLLM is easy as:

from distilabel.tasks import TextGenerationTask, Task
from distilabel.llm import ProcessLLM, LLM


def load_gpt_4(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-4",
        task=task,
        num_threads=4,
    )


llm = ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4)
future = llm.generate(
    inputs=[{"input": "Write a letter for Bob"}], num_generations=1
)  # (1)
llm.teardown()  # (2)
result = future.result()
# >>> print(result[0][0]["parsed_output"]["generations"])
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
  1. The ProcessLLM returns a Future containing a list of lists of LLMOutputs.
  2. The ProcessLLM needs to be terminated after usage. If the ProcessLLM is used by a Pipeline, it will be terminated automatically.

You can directly use a ProcessLLM as the generator or labeller in a Pipeline. Apart from that, there would be situations in which you would like to generate texts using several LLMs in parallel. For this purpose, we provide the LLMPool class:

from distilabel.tasks import TextGenerationTask, Task
from distilabel.llm import ProcessLLM, LLM, LLMPool

def load_gpt_3(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-3.5-turbo",
        task=task,
        num_threads=4,
    )

def load_gpt_4(task: Task) -> LLM:
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-4",
        task=task,
        num_threads=4,
    )


pool = LLMPool(llms=[
    ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_3),
    ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4),
])
result = pool.generate(
    inputs=[{"input": "Write a letter for Bob"}], num_generations=2
)
pool.teardown()
# >>> print(result[0][0]["parsed_output"]["generations"], end="\n\n\n\n\n\n---->")
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
# >>> print(result[0][1]["parsed_output"]["generations"])
# Of course, I'd be happy to draft a sample letter for you. However, I would need some additional 
# information including who "Bob" is, the subject matter of the letter, the tone (formal or informal), 
# and any specific details or points you'd like to include. Please provide some more context and I'll do my best to assist you.

  1. You can take a look at this blog post from cohere for a thorough explanation of the different parameters.