Tasks

In this section we will see what's a Task and the list of tasks available in distilabel.

Task

The Task class takes charge of setting how the LLM behaves, deciding whether it acts as a generator or a labeller. To accomplish this, the Task class creates a prompt using a template that will be sent to the LLM. It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the LLM response. The Task class yields a Prompt that can generate a string with the format needed, depending on the specific LLM used.

All the Tasks defines a system_prompt which serves as the initial instruction given to the LLM, guiding it on what kind of information or output is expected, and the following methods:

generate_prompt: This method will be used by the LLM to create the prompts that will be fed to the model.
parse_output: After the LLM has generated the content, this method will be called on the raw outputs of the model to extract the relevant content (scores, rationales, etc).
input_args_names and output_args_names: These methods are used in the Pipeline to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a LLM that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the labeller LLM, in the case of a Pipeline that has both a generator and a labeller. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.

After defining a task, the only action required is to pass it to the corresponding LLM. All the intricate processes are then handled internally:

from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask

# This snippet uses `TransformersLLM`, but is the same for every other `LLM`.
generator = TransformersLLM(
    model=...,
    tokenizer=...,
    task=TextGenerationTask(),
)

Given this explanation, distilabel distinguishes between two primary categories of tasks: those focused on text generation and those centered around labelling. These Task classes delineate the LLM's conduct, be it the creation of textual content or the assignment of labels to text, each with precise guidelines tailored to their respective functionalities. Users can seamlessly leverage these distinct task types to tailor the LLM's behavior according to their specific application needs.

Text Generation

These set of classes are designed to steer a LLM in generating text with specific guidelines. They provide a structured approach to instruct the LLM on generating content in a manner tailored to predefined criteria.

TextGenerationTask

This is the base class for text generation, and includes the following fields for guiding the generation process:

system_prompt, which serves as the initial instruction or query given to the LLM, guiding it on what kind of information or output is expected.
A list of principles to inject on the system_prompt, which by default correspond to those defined in the UltraFeedback paper¹,
and lastly a distribution for these principles so the LLM can be directed towards the different principles with a more customized behaviour.

For the API reference visit TextGenerationTask.

Llama2TextGenerationTask

This class inherits from the TextGenerationTask and it's specially prepared to deal with prompts in the form of the Llama2 model, so it should be the go to task for LLMs intented for text generation that were trained using this prompt format. The specific prompt formats can be found in the source code of the Prompt class.

from distilabel.llm import TransformersLLM
from distilabel.tasks import Llama2TextGenerationTask

# This snippet uses `TransformersLLM`, but is the same for every other `LLM`.
generator = TransformersLLM(
    model=...,
    tokenizer=...,
    task=Llama2TextGenerationTask(),
)

For the API reference visit Llama2TextGenerationTask.

OpenAITextGenerationTask

The OpenAI task for text generation is similar to the Llama2TextGenerationTask, but with the specific prompt format expected by the chat completion task from OpenAI.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import OpenAITextGenerationTask

generator = OpenAILLM(
    task=OpenAITextGenerationTask(), openai_api_key=os.getenv("OPENAI_API_KEY")
)

For the API reference visit OpenAITextGenerationTask.

SelfInstructTask

The task specially designed to build the prompts following the Self-Instruct paper: SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.

From the original repository: The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances, so this Task is specially interesting for generating new datasets from a set of predefined topics.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import SelfInstructTask

generator = OpenAILLM(
    task=SelfInstructTask(
        system_prompt="You are a question-answering assistant for...",
        application_description="AI assistant",
        num_instructions=3,
    ),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

For the API reference visit SelfInstructTask.

Labelling

Instead of generating text, you can instruct the LLM to label datasets. The existing tasks are designed specifically for creating Preference datasets.

Preference

Preference datasets for Language Models (LLMs) are sets of information that show how people rank or prefer one thing over another in a straightforward and clear manner. These datasets help train language models to understand and generate content that aligns with user preferences, enhancing the model's ability to generate contextually relevant and preferred outputs.

Contrary to the TextGenerationTask, the PreferenceTask is not intended for direct use. It implements the default methods input_args_names and output_args_names, but generate_prompt and parse_output are specific to each PreferenceTask. Examining the output_args_names reveals that the generation will encompass both the rating and the rationale that influenced that rating.

UltraFeedbackTask

This task is specifically designed to build the prompts following the format defined in the "UltraFeedback: Boosting Language Models With High Quality Feedback" paper.

From the original repository: To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness. This Task is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.

The following snippet can be used as a simplified UltraFeedback Task, for which we define 3 different ratings, but take into account the predefined versions are intended to be used out of the box:

from textwrap import dedent

from distilabel.tasks.preference.ultrafeedback import Rating, UltraFeedbackTask

task_description = dedent(
    """
    # General Text Quality Assessment
    Evaluate the model's outputs based on various criteria:
    1. **Correctness & Informativeness**: Does the output provide accurate and helpful information?
    2. **Honesty & Uncertainty**: How confidently does the model convey its information, and does it express uncertainty appropriately?
    3. **Truthfulness & Hallucination**: Does the model introduce misleading or fabricated details?
    4. **Instruction Following**: Does the model's output align with given instructions and the user's intent?
    Your role is to provide a holistic assessment considering all the above factors.

    **Scoring**: Rate outputs 1 to 3 based on the overall quality, considering all aspects:
    """
)

ratings = [
    Rating(value=1, description="Low Quality"),
    Rating(value=2, description="Moderate Quality"),
    Rating(value=3, description="Good Quality"),
]

ultrafeedback_task = UltraFeedbackTask(
    system_prompt="Your role is to evaluate text quality based on given criteria",
    task_description=task_description,
    ratings=ratings,
)

Text QualityHelpfulnessTruthfulnessHonestyInstruction Following

The following example uses a LLM to examinate the data for text quality criteria, which includes the different criteria from UltraFeedback (Correctness & Informativeness, Honesty & Uncertainty, Truthfulness & Hallucination and Instruction Following):

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_text_quality(),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

The following example creates a UltraFeedback task to emphasize helpfulness, that is overall quality and correctness of the output:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_helpfulness(), openai_api_key=os.getenv("OPENAI_API_KEY")
)

The following example creates a UltraFeedback task to emphasize truthfulness and hallucination assessment:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_truthfulness(),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

The following example creates a UltraFeedback task to emphasize honesty and uncertainty expression assessment:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_honesty(), openai_api_key=os.getenv("OPENAI_API_KEY")
)

The following example creates a UltraFeedback task to emphasize the evaluation of alignment between output and intent:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_instruction_following(),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

For the API reference visit UltraFeedbackTask.

JudgeLMTask

The task specially designed to build the prompts following the UltraFeedback paper: JudgeLM: Fine-tuned Large Language Models Are Scalable Judges. This task is designed to evaluate the performance of AI assistants.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import JudgeLMTask

labeller = OpenAILLM(task=JudgeLMTask(), openai_api_key=os.getenv("OPENAI_API_KEY"))

For the API reference visit JudgeLMTask.

UltraJudgeTask

This class implements a PreferenceTask specifically for a better evaluation using AI Feedback. The task is defined based on both UltraFeedback and JudgeLM, but with several improvements / modifications.

It introduces an additional argument to differentiate various areas for processing. While these areas can be customized, the default values are as follows:

from distilabel.tasks import UltraJudgeTask

# To see the complete system_prompt and task_description please take a look at the UltraJudgeTask definition
ultrajudge_task = UltraJudgeTask(
    system_prompt="You are an evaluator tasked with assessing AI assistants' responses from the perspective of typical user preferences...",
    task_description="Your task is to rigorously evaluate the performance of...",
    areas=[
        "Practical Accuracy",
        "Clarity & Transparency",
        "Authenticity & Reliability",
        "Compliance with Intent",
    ],
)

Which can be directly used in the following way:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraJudgeTask

labeller = OpenAILLM(task=UltraJudgeTask(), openai_api_key=os.getenv("OPENAI_API_KEY"))

For the API reference visit UltraJudgeTask.

The principles can be found here in the codebase. More information on the Principle Sampling can be found in the UltraFeedfack repository. ↩