LLMs
In this section we will see what's an LLM
and the different LLM
s implementations available in distilabel
.
LLM
The LLM
class encapsulates the functionality for interacting with a large language model.
It distinguishes between task specifications and configurable parameters that influence the LLM's behavior.
For illustration purposes, we employ the TextGenerationTask
in this section and guide you to the dedicated Tasks
section for comprehensive details.
LLM classes share several general parameters and define implementation-specific ones. Let's explain the general parameters first and the generate method, and then the specifics for each class.
General parameters
Let's briefly introduce the general parameters we may find1:
-
max_new_tokens
: this parameter controls the maximum number of tokens the LLM is allowed to use. -
temperature
: parameter associated to the creativity of the model, a value close to 0 makes the model more deterministic, while higher values make the model more "creative". -
top_k
andtop_p
:top_k
limits the number of tokens the model is allowed to use to generate the following token sorted by probability, whiletop_p
limits the number of tokens the model can use for the next token, but in terms of the sum of their probabilities. -
frequency_penalty
andpresence_penalty
: the frequency penalty penalizes tokens that have already appeard in the generated text, limiting the possibility of those appearing again, and thepresence_penalty
penalizes regardless of hte frequency. -
prompt_format
andprompt_formatting_fn
: these two parameters allow to tweak the prompt of our models, for example we can direct theLLM
to format the prompt according to one of the defined formats, whileprompt_formatting_fn
allows to pass a function that will be applied to the prompt before the generation, for extra control of what we ingest to the model.
generate
method
Once you create an LLM
, you use the generate
method to interact with it. This method accepts two parameters:
-
inputs
: which is a list of dictionaries containing the inputs for theLLM
and theTask
. Each dictionary must have all the keys required by theTask
. -
num_generations
: which is an integer used to specify how many text generations we want to obtain for each element ininputs
.
The output of the method will be a list containing lists of LLMOutput
. Each inner list is associated to the corresponding input in inputs
, and each LLMOutput
is associated to one of the num_generations
for each input.
>>> llm.generate(inputs=[...], num_generations=2)
[ # (1)
[ # (2)
{ # (3)
"model_name": "notus-7b-v1",
"prompt_used": "Write a letter for my friend Bob...",
"raw_output": "Dear Bob, ...",
"parsed_output": {
"generations": "Dear Bob, ...",
}
},
{
"model_name": "notus-7b-v1",
"prompt_used": "Write a letter for my friend Bob...",
"raw_output": "Dear Bob, ...",
"parsed_output": {
"generations": "Dear Bob, ...",
}
},
],
[...],
]
- The outer list will contain as many lists as elements in
inputs
. - The inner lists will contain as many
LLMOutput
s as specified innum_generations
. - Each
LLMOutput
is a dictionary
The LLMOutput
is a TypedDict
containing the keys model_name
, prompt_used
, raw_output
and parsed_output
. The parsed_output
key is a dictionary that will contain all the Task
outputs.
{
"model_name": "notus-7b-v1",
"prompt_used": "Write a letter for my friend Bob...",
"raw_output": "Dear Bob, ...",
"parsed_output": { # (1)
"generations": "Dear Bob, ...",
}
},
- The keys contained in
parsed_output
will depend on theTask
used. In this case, we usedTextGenerationTask
, so the keygenerations
is present.
If the LLM
uses a thread pool, then the output of the generate
method will be a Future having as result a list of lists of LLMOutput
as described above.
Integrations
OpenAI
These may be the default choice for your ambitious tasks.
For the API reference visit OpenAILLM.
import os
from distilabel.llm import OpenAILLM
from distilabel.tasks import OpenAITextGenerationTask
openaillm = OpenAILLM(
model="gpt-3.5-turbo",
task=OpenAITextGenerationTask(),
max_new_tokens=256,
openai_api_key=os.environ.get("OPENAI_API_KEY"),
temperature=0.3,
)
result = openaillm.generate([{"input": "What is OpenAI?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# OpenAI is an artificial intelligence research laboratory and company. It was founded
# with the goal of ensuring that artificial general intelligence (AGI) benefits all of
# humanity. OpenAI conducts cutting-edge research in various fields of AI ...
Llama.cpp
Applicable for local execution of Language Models (LLMs). Utilize this LLM when you have access to the quantized weights of your selected model for interaction.
Let's see an example using notus-7b-v1. First, you can download the weights from the following link:
from distilabel.llm import LlamaCppLLM
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama
# Instantiate our LLM with them:
llm = LlamaCppLLM(
model=Llama(model_path="./notus-7b-v1.q4_k_m.gguf", n_gpu_layers=-1),
task=TextGenerationTask(),
max_new_tokens=128,
temperature=0.3,
prompt_format="notus",
)
result = llm.generate([{"input": "What is the capital of Spain?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# The capital of Spain is Madrid. It is located in the center of the country and
# is known for its vibrant culture, beautiful architecture, and delicious food.
# Madrid is home to many famous landmarks such as the Prado Museum, Retiro Park,
# and the Royal Palace of Madrid. I hope this information helps!
For the API reference visit LlammaCppLLM.
vLLM
Highly recommended to use if you have a GPU available, as is the fastest solution out there for batch generation. Find more information about in vLLM docs.
from distilabel.tasks import TextGenerationTask
from distilabel.llm import vLLM
from vllm import LLM
llm = vLLM(
vllm=LLM(model="argilla/notus-7b-v1"),
task=TextGenerationTask(),
max_new_tokens=512,
temperature=0.3,
prompt_format="notus",
)
result_vllm = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of artificial intelligence (AI) system that is designed
# to understand and interpret human language. It is called "large" because it uses a vast
# amount of data, typically billions of words or more, to learn and make predictions about
# language. Large language models are ...
For the API reference visit vLLM.
HuggingFace LLMs
This section explains two different ways to use HuggingFace models:
Transformers
This is the option to utilize a model hosted on Hugging Face Hub. Load the model and tokenizer in the standard manner as done locally, and proceed to instantiate your class.
For the API reference visit TransformersLLM.
Let's see an example using notus-7b-v1:
from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the models from huggingface hub:
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", device_map="auto")
# Instantiate our LLM with them:
llm = TransformersLLM(
model=model,
tokenizer=tokenizer,
task=TextGenerationTask(),
max_new_tokens=128,
temperature=0.3,
prompt_format="notus",
)
result = llm.generate([{"input": "What's a large language model?"}])
# >>> print(result[0][0]["parsed_output"]["generations"])
# A large language model is a type of machine learning algorithm that is designed to analyze
# and understand large amounts of text data. It is called "large" because it requires a
# vast amount of data to train and improve its accuracy. These models are ...
Inference Endpoints
Hugging Face provides a streamlined approach for deploying models through inference endpoints on their infrastructure. Opt for this solution if your model is hosted on Hugging Face.
For the API reference visit InferenceEndpointsLLM.
Let's see how to interact with these LLMs:
import os
from distilabel.llm import InferenceEndpointsLLM
from distilabel.tasks import TextGenerationTask
endpoint_name = "aws-notus-7b-v1-4052" or os.getenv("HF_INFERENCE_ENDPOINT_NAME")
endpoint_namespace = "argilla" or os.getenv("HF_NAMESPACE")
token = os.getenv("HF_TOKEN") # hf_...
llm = InferenceEndpointsLLM(
endpoint_name=endpoint_name,
endpoint_namespace=endpoint_namespace,
token=token,
task=TextGenerationTask(),
max_new_tokens=512,
prompt_format="notus",
)
result = llm.generate([{"input": "What are critique LLMs?"}])
# print(result[0][0]["parsed_output"]["generations"])
# Critique LLMs (Long Land Moore Machines) are artificial intelligence models designed specifically for analyzing and evaluating the quality or worth of a particular subject or object. These models can be trained on a large dataset of reviews, ratings, or commentary related to a product, service, artwork, or any other topic of interest.
# The training data can include both positive and negative feedback, helping the LLM to understand the nuanced aspects of quality and value. The model uses natural language processing (NLP) techniques to extract meaningful insights, including sentiment analysis, entity recognition, and text classification.
# Once the model is trained, it can be used to analyze new input data and provide a critical assessment based on its learned understanding of quality and value. For example, a critique LLM for movies could evaluate a new film and generate a detailed review highlighting its strengths, weaknesses, and overall rating.
# Critique LLMs are becoming increasingly useful in various industries, such as e-commerce, education, and entertainment, where they can provide objective and reliable feedback to help guide decision-making processes. They can also aid in content optimization by highlighting areas of improvement or recommending strategies for enhancing user engagement.
# In summary, critique LLMs are powerful tools for analyzing and evaluating the quality or worth of different subjects or objects, helping individuals and organizations make informed decisions with confidence.
ProcessLLM
and LLMPool
By default, distilabel
uses a single process, so the generation loop is usually bottlenecked by the model inference time and Python GIL. To overcome this limitation, we provide the ProcessLLM
class that allows to load an LLM
in a different process, avoiding the GIL and allowing to parallelize the generation loop. Creating a ProcessLLM
is easy as:
from distilabel.tasks import TextGenerationTask, Task
from distilabel.llm import ProcessLLM, LLM
def load_gpt_4(task: Task) -> LLM:
from distilabel.llm import OpenAILLM
return OpenAILLM(
model="gpt-4",
task=task,
num_threads=4,
)
llm = ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4)
future = llm.generate(
inputs=[{"input": "Write a letter for Bob"}], num_generations=1
) # (1)
llm.teardown() # (2)
result = future.result()
# >>> print(result[0][0]["parsed_output"]["generations"])
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
- The
ProcessLLM
returns aFuture
containing a list of lists ofLLMOutput
s. - The
ProcessLLM
needs to be terminated after usage. If theProcessLLM
is used by aPipeline
, it will be terminated automatically.
You can directly use a ProcessLLM
as the generator
or labeller
in a Pipeline
. Apart from that, there would be situations in which you would like to generate texts using several LLM
s in parallel. For this purpose, we provide the LLMPool
class:
from distilabel.tasks import TextGenerationTask, Task
from distilabel.llm import ProcessLLM, LLM, LLMPool
def load_gpt_3(task: Task) -> LLM:
from distilabel.llm import OpenAILLM
return OpenAILLM(
model="gpt-3.5-turbo",
task=task,
num_threads=4,
)
def load_gpt_4(task: Task) -> LLM:
from distilabel.llm import OpenAILLM
return OpenAILLM(
model="gpt-4",
task=task,
num_threads=4,
)
pool = LLMPool(llms=[
ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_3),
ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt_4),
])
result = pool.generate(
inputs=[{"input": "Write a letter for Bob"}], num_generations=2
)
pool.teardown()
# >>> print(result[0][0]["parsed_output"]["generations"], end="\n\n\n\n\n\n---->")
# Dear Bob,
# I hope this letter finds you in good health and high spirits. I know it's been a while since we last caught up, and I wanted to take the time to connect and share a few updates.
# Life has been keeping me pretty busy lately. [Provide a brief overview of what you've been up to: work, school, family, hobbies, etc.]
# I've often found myself reminiscing about the good old days, like when we [include a memorable moment or shared experience with Bob].
# >>> print(result[0][1]["parsed_output"]["generations"])
# Of course, I'd be happy to draft a sample letter for you. However, I would need some additional
# information including who "Bob" is, the subject matter of the letter, the tone (formal or informal),
# and any specific details or points you'd like to include. Please provide some more context and I'll do my best to assist you.