Concepts
This page aims to get you familiarized with the basic concepts of the framework, describing the most important
components or classes and how they work together. The following sections will guide you through the primary components
of the framework: Pipeline
, LLM
(both generator and labeller), and the Task
.
Components¶
Task¶
The Task
class in the one in charge of defining the behaviour of the LLM
, and therefore it can define if an LLM is
a generator
or a labeller
. To do so, the Task
class generates the prompt that will be sent to the LLM
from a template.
It also defines, which input arguments are required to generate the prompt, and which output arguments will be extracted
from the LLM
response. It's worth mentioning that the Task
class doesn't return a str
, but a Prompt
class which
will generate the str
format depending on the LLM
that is going to be used (Zephyr, Llama, OpenAI, etc).
from distilabel.tasks import UltraJudgeTask
task = UltraJudgeTask()
input = (
"Can you provide a corrected version of the following sentence using proper "
'English grammar? "We going to the beach" Additionally, could you please '
"provide your correction in an Excel table format with the following columns: "
"| Incorrect Sentence | Corrected Sentence | |-------------------|--------------------|"
)
generations = [
(
"| Incorrect Sentence | Corrected Sentence |\n|-------------------|-------------------"
'-----|\n| "We going to the beach" | "We are going to the beach" |\n\nCorrectio'
'n: The verb in the second sentence ("are") changes to reflect the subject\'s ("w'
'e") agreement with the verb "be." This is called subject-verb agreement. In the '
'first sentence, the verb "going" infers that the action is ongoing or in a contin'
"uous state, which is not the case. Therefore, the second sentence is grammatically "
"correct."
),
(
"| Incorrect Sentence | Corrected Sentence |\n|-------------------|-------------------"
"-----|\n| We going to the beach | We are going to the beach | \n\nHere's a breakdo"
'wn of the correction:\n\n- "We going to the beach" is an example of a subject-ve'
'rb agreement error. The verb changing from the third person singular ("is") to t'
'he third person plural ("are") in this instance, as there are multiple people go'
'ing to the beach.\n- The "g" in "going" changes to an "e" due to a hard "g"'
' sound being followed by an "e," which is a common spelling rule in English.'
),
]
prompt = task.generate_prompt(input, generations)
print(prompt.format_as("default")) # format as "openai", "zephyr", "llama", ...
LLM¶
The LLM
class represents a language model and implements the way to interact with it. It also defines the generation
parameters that can be passed to the model to tweak the generations. As mentioned above, the LLM
will have a Task
associated that will use to generate the prompt and extract the output from the generation.
from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraJudgeTask
labeller = OpenAILLM(
model="gpt-3.5-turbo",
task=UltraJudgeTask(),
prompt_format="openai",
max_new_tokens=2048,
temperature=0.0,
)
outputs = labeller.generate(
inputs=[
{
"input": "Here's a math problem that you need to resolve: 2 + 2 * 3. What's the result of this problem? Explain it",
"generations": [
(
"The output of the math problem 2 + 2 * 3 is calculated by following "
"the order of operations (PEMDAS). First, perform the multiplication: "
"2 * 3 = 6. Then, perform the addition: 2 + 6 = 8. Therefore, the "
"output of the problem is 8."
),
(
"The correct solution to the math problem is 8. To get the correct "
"answer, we follow the order of operations (PEMDAS) and perform "
"multiplication before addition. So, first, we solve 2 * 3 = 6, "
"then we add 2 to 6 to get 8."
),
],
}
]
)
print(outputs[0][0]["parsed_output"])
Note
To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY
environment variable.
Pipeline¶
The Pipeline
class orchestrates the whole generation and labelling process, and it's in charge of the batching of the
input dataset, as well as reporting the generation progress. It's worth mentioning that is not mandatory to pass both
a generator LLM
and a labeller LLM
to the Pipeline
class, as it can also be used only for generation or labelling.
Pipelines
from datasets import load_dataset
from distilabel.llm import LlamaCppLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import TextGenerationTask, UltraJudgeTask
from llama_cpp import Llama
dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
[
column
for column in dataset.column_names
if column not in ["input", "generations"]
]
)
pipeline = Pipeline(
generator=LlamaCppLLM(
model=Llama(
model_path="./llama-2-7b-chat.Q4_0.gguf",
verbose=False,
n_ctx=1024,
),
task=TextGenerationTask(),
max_new_tokens=512,
prompt_format="llama2",
),
labeller=OpenAILLM(
model="gpt-3.5-turbo",
task=UltraJudgeTask(),
prompt_format="openai",
max_new_tokens=1024,
num_threads=1,
temperature=0.0,
),
)
dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)
Note
To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY
environment variable
and that you have download the file llama-2-7b-chat.Q4_O.gguf
in the same folder as the script.
from datasets import load_dataset
from distilabel.llm import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama
dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
[column for column in dataset.column_names if column not in ["input"]]
)
pipeline = Pipeline(
generator=LlamaCppLLM(
model=Llama(
model_path="./llama-2-7b-chat.Q4_0.gguf",
verbose=False,
n_ctx=1024,
),
task=TextGenerationTask(),
max_new_tokens=512,
prompt_format="llama2",
),
)
dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)
from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import UltraJudgeTask
dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
[
column
for column in dataset.column_names
if column not in ["input", "generations"]
]
)
pipeline = Pipeline(
labeller=OpenAILLM(
model="gpt-3.5-turbo",
task=UltraJudgeTask(),
prompt_format="openai",
max_new_tokens=1024,
num_threads=1,
temperature=0.0,
),
)
dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)