Skip to content

Concepts

This page aims to get you familiarized with the basic concepts of the framework, describing the most important components or classes and how they work together. The following sections will guide you through the primary components of the framework: Pipeline, LLM (both generator and labeller), and the Task.

distilabel distilabel

distilabel flow diagram

Components

Task

The Task class in the one in charge of defining the behaviour of the LLM, and therefore it can define if an LLM is a generator or a labeller. To do so, the Task class generates the prompt that will be sent to the LLM from a template. It also defines, which input arguments are required to generate the prompt, and which output arguments will be extracted from the LLM response. It's worth mentioning that the Task class doesn't return a str, but a Prompt class which will generate the str format depending on the LLM that is going to be used (Zephyr, Llama, OpenAI, etc).

from distilabel.tasks import UltraJudgeTask

task = UltraJudgeTask()

input = (
    "Can you provide a corrected version of the following sentence using proper "
    'English grammar? "We going to the beach" Additionally, could you please '
    "provide your correction in an Excel table format with the following columns: "
    "| Incorrect Sentence | Corrected Sentence | |-------------------|--------------------|"
)

generations = [
    (
        "| Incorrect Sentence | Corrected Sentence |\n|-------------------|-------------------"
        '-----|\n| "We going to the beach" | "We are going to the beach" |\n\nCorrectio'
        'n: The verb in the second sentence ("are") changes to reflect the subject\'s ("w'
        'e") agreement with the verb "be." This is called subject-verb agreement. In the '
        'first sentence, the verb "going" infers that the action is ongoing or in a contin'
        "uous state, which is not the case. Therefore, the second sentence is grammatically "
        "correct."
    ),
    (
        "| Incorrect Sentence | Corrected Sentence |\n|-------------------|-------------------"
        "-----|\n| We going to the beach | We are going to the beach | \n\nHere's a breakdo"
        'wn of the correction:\n\n- "We going to the beach" is an example of a subject-ve'
        'rb agreement error. The verb changing from the third person singular ("is") to t'
        'he third person plural ("are") in this instance, as there are multiple people go'
        'ing to the beach.\n- The "g" in "going" changes to an "e" due to a hard "g"'
        ' sound being followed by an "e," which is a common spelling rule in English.'
    ),
]


prompt = task.generate_prompt(input, generations)
print(prompt.format_as("default"))  # format as "openai", "zephyr", "llama", ...

LLM

The LLM class represents a language model and implements the way to interact with it. It also defines the generation parameters that can be passed to the model to tweak the generations. As mentioned above, the LLM will have a Task associated that will use to generate the prompt and extract the output from the generation.

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraJudgeTask

labeller = OpenAILLM(
    model="gpt-3.5-turbo",
    task=UltraJudgeTask(),
    prompt_format="openai",
    max_new_tokens=2048,
    temperature=0.0,
)

outputs = labeller.generate(
    inputs=[
        {
            "input": "Here's a math problem that you need to resolve: 2 + 2 * 3. What's the result of this problem? Explain it",
            "generations": [
                (
                    "The output of the math problem 2 + 2 * 3 is calculated by following "
                    "the order of operations (PEMDAS). First, perform the multiplication: "
                    "2 * 3 = 6. Then, perform the addition: 2 + 6 = 8. Therefore, the "
                    "output of the problem is 8."
                ),
                (
                    "The correct solution to the math problem is 8. To get the correct "
                    "answer, we follow the order of operations (PEMDAS) and perform "
                    "multiplication before addition. So, first, we solve 2 * 3 = 6, "
                    "then we add 2 to 6 to get 8."
                ),
            ],
        }
    ]
)

print(outputs[0][0]["parsed_output"])

Note

To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY environment variable.

Pipeline

The Pipeline class orchestrates the whole generation and labelling process, and it's in charge of the batching of the input dataset, as well as reporting the generation progress. It's worth mentioning that is not mandatory to pass both a generator LLM and a labeller LLM to the Pipeline class, as it can also be used only for generation or labelling.

Pipelines

from datasets import load_dataset
from distilabel.llm import LlamaCppLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import TextGenerationTask, UltraJudgeTask
from llama_cpp import Llama

dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
    [
        column
        for column in dataset.column_names
        if column not in ["input", "generations"]
    ]
)

pipeline = Pipeline(
    generator=LlamaCppLLM(
        model=Llama(
            model_path="./llama-2-7b-chat.Q4_0.gguf",
            verbose=False,
            n_ctx=1024,
        ),
        task=TextGenerationTask(),
        max_new_tokens=512,
        prompt_format="llama2",
    ),
    labeller=OpenAILLM(
        model="gpt-3.5-turbo",
        task=UltraJudgeTask(),
        prompt_format="openai",
        max_new_tokens=1024,
        num_threads=1,
        temperature=0.0,
    ),
)


dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)

Note

To run the script successfully, ensure you have assigned your OpenAI API key to the OPENAI_API_KEY environment variable and that you have download the file llama-2-7b-chat.Q4_O.gguf in the same folder as the script.

from datasets import load_dataset
from distilabel.llm import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama

dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
    [column for column in dataset.column_names if column not in ["input"]]
)

pipeline = Pipeline(
    generator=LlamaCppLLM(
        model=Llama(
            model_path="./llama-2-7b-chat.Q4_0.gguf",
            verbose=False,
            n_ctx=1024,
        ),
        task=TextGenerationTask(),
        max_new_tokens=512,
        prompt_format="llama2",
    ),
)


dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)
from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import UltraJudgeTask

dataset = load_dataset("argilla/distilabel-docs", split="train")
dataset = dataset.remove_columns(
    [
        column
        for column in dataset.column_names
        if column not in ["input", "generations"]
    ]
)

pipeline = Pipeline(
    labeller=OpenAILLM(
        model="gpt-3.5-turbo",
        task=UltraJudgeTask(),
        prompt_format="openai",
        max_new_tokens=1024,
        num_threads=1,
        temperature=0.0,
    ),
)


dataset = pipeline.generate(dataset, num_generations=2, batch_size=5)