Tasks¶

In this section we will see what's a Task and the list of tasks available in distilabel.

Task¶

The Task class takes charge of setting how the LLM behaves, deciding whether it acts as a generator or a labeller. To accomplish this, the Task class creates a prompt using a template that will be sent to the LLM. It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the LLM response. The Task class yields a Prompt that can generate a string with the format needed, depending on the specific LLM used.

All the Tasks defines a system_prompt which serves as the initial instruction given to the LLM, guiding it on what kind of information or output is expected, and the following methods:

generate_prompt: This method will be used by the LLM to create the prompts that will be fed to the model.
parse_output: After the LLM has generated the content, this method will be called on the raw outputs of the model to extract the relevant content (scores, rationales, etc).
input_args_names and output_args_names: These methods are used in the Pipeline to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a LLM that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the labeller LLM, in the case of a Pipeline that has both a generator and a labeller. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.

After defining a task, the only action required is to pass it to the corresponding LLM. All the intricate processes are then handled internally:

from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask

# This snippet uses `TransformersLLM`, but is the same for every other `LLM`.
generator = TransformersLLM(
    model=...,
    tokenizer=...,
    task=TextGenerationTask(),
)

Given this explanation, distilabel distinguishes between two primary categories of tasks: those focused on text generation and those centered around labelling. These Task classes delineate the LLM's conduct, be it the creation of textual content or the assignment of labels to text, each with precise guidelines tailored to their respective functionalities. Users can seamlessly leverage these distinct task types to tailor the LLM's behavior according to their specific application needs.

Text Generation¶

These set of classes are designed to steer a LLM in generating text with specific guidelines. They provide a structured approach to instruct the LLM on generating content in a manner tailored to predefined criteria.

TextGenerationTask¶

This is the base class for text generation, and includes the following fields for guiding the generation process:

system_prompt, which serves as the initial instruction or query given to the LLM, guiding it on what kind of information or output is expected.
A list of principles to inject on the system_prompt, which by default correspond to those defined in the UltraFeedback paper¹,
and lastly a distribution for these principles so the LLM can be directed towards the different principles with a more customized behaviour.

For the API reference visit TextGenerationTask.

SelfInstructTask¶

The task specially designed to build the prompts following the Self-Instruct paper: SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.

From the original repository:

The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances, so this Task is specially interesting for generating new datasets from a set of predefined topics.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import SelfInstructTask

generator = OpenAILLM(
    task=SelfInstructTask(
        system_prompt="You are a question-answering assistant for...",
        application_description="AI assistant",
        num_instructions=3,
        criteria_for_query_generation="Design queries to be... ",
    ),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

For the API reference visit SelfInstructTask.

You can personalize the way in which your SelfInstructTask behaves by changing the default values of the parameters to something that suits your use case. Let's go through them:

System Prompt: you can control the overall behaviour and expectations of your model.
Application Description: a description of the AI application. By default, we use "AI Assistant".
Number of instructions: number of instructions in the prompt.
Criteria for Query Generation: the criteria for query generation that we want our model to have. The default value covers default behaviour for SelfInstructTask. This value is passed to the .jinja template, where extra instructions are added to ensure correct output format.

You can see an example of how to customise a SelfInstructTask to create Haikus in the snippet in the subsection Custom TextGenerationTask.

EvolInstructTask¶

The task is specially designed to build the prompts following the Evol-Instruct strategy proposed in: WizardLM: Empowering Large Language Models to Follow Complex Instructions.

From the original repository:

Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skill range, to improve the performance of LLMs.

Use this Task to build more complete and complex datasets starting from simple ones.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import EvolInstructTask

generator = OpenAILLM(
    task=EvolInstructTask(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

You can take a look at a sample dataset generated using EvolInstructTask.

Note

The original definition of EvolInstruct considers an elimination evolving step with different situations to remove the responses considered failures. Section 3.2, Elimination Evolving in WizardLM paper shows these steps. We have implemented steps 2-4 as part of this task, but not step one. Step 1 of the elimination process can be implemented using labellers in distilabel.

For the API reference visit EvolInstructTask.

EvolComplexity¶

The Deita framework presents a data selection framework composed of two initial steps that consist on adopting an evolution-based approach as defined in WizardLM. The first of the evolution steps, related to the complexity is the same EvolInstruct task, exposed with the same name given in the paper:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import EvolComplexityTask

generator = OpenAILLM(
    task=EvolComplexityTask(),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

For the API reference visit EvolComplexityTask.

EvolQuality¶

The second step from the Deita framework consists on enhancing the quality of the instructions, in the same spirit from EvolComplexityTask. The EvolQualityTask can be used to augment the quality of the instructions by enhancing helpfulness, augmenting relevance, enriching depth, fostering creativity, and supplying additional details, following the Deita implementation:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import EvolQualityTask

generator = OpenAILLM(
    task=EvolQualityTask(),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

The following example shows an example of dataset after evolving it's quality with this task. For the initial dataset we have chosen argilla/distilabel-intel-orca-dpo-pairs for simplicity, the dataset needs to have both "input" and "generation" columns in order to allow for the evolution strategy proposed in the Deita paper:

import os
from datasets import Dataset

from distilabel.tasks import EvolQualityTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline


# Create a sample dataset (this one is inspired from the distilabel-intel-orca-dpo-pairs)
sample_evol_quality = Dataset.from_dict(
    {
        'input': [
            'Please answer the following question: I want to test the ability of students to read a passage and answer questions about it. Could you please come up with a good question for the passage "In 1901, the Federation of Australia was the process by which the six separate British self-governing colonies of New South Wales, Queensland, South Australia, Tasmania, Victoria and Western Australia formed one nation. They kept the systems of government that they had developed as separate colonies but also would have a federal government that was responsible for matters concerning the whole nation. When the Constitution of Australia came into force, the colonies collectively became states of the Commonwealth of Australia."?\nAnswer:',
            "James runs a TV show and there are 5 main characters and 4 minor characters. He pays the minor characters $15,000 each episode. He paid the major characters three times as much. How much does he pay per episode? Let's be accurate as possible."
        ],
        'generation': [
            "Certainly! Here's a detailed and long answer to the question you provided:\n\nQuestion: Read the passage below and answer the following questions.\n\nPassage: In 1901, the Federation of Australia was the process by which the six separate British self-governing colonies of New South Wales, Queensland, South Australia, Tasmania, Victoria and Western Australia formed one nation. They kept the systems of government that they had developed as separate colonies but also would have a federal government that was responsible for matters concerning the whole nation. When the Constitution of Australia came into force, the colonies collectively became states of the Commonwealth of Australia.\n\n1. What event did the passage describe?\n2. How many colonies were involved in the Federation of Australia?\n3. What did the colonies keep after forming one nation?\n4. What was the purpose of the federal government established by the Constitution of Australia?\n5. What were the colonies called after they became states of the Commonwealth of Australia?\n\nAnswers:\n\n1. The passage describes the Federation of Australia, which was the process by which six separate British self-governing colonies formed one nation.\n2. Six colonies were involved in the Federation of Australia: New South Wales, Queensland, South Australia, Tasmania, Victoria, and Western Australia.\n3. The colonies kept the systems of government that they had developed as separate colonies.\n4. The federal government established by the Constitution of Australia was responsible for matters concerning the whole nation.\n5. After becoming states of the Commonwealth of Australia, the colonies were called states.\n\nIn-depth analysis:\n\nThe passage provides a brief overview of the Federation of Australia, which was the process by which six separate British self-governing colonies formed one nation in 1901. The colonies kept the systems of government that they had developed as separate colonies, but they also established a federal government that was responsible for matters concerning the whole nation. This federal government was established by the Constitution of Australia, which came into force on January 1, 1901.\n\nThe passage highlights the significance of the Federation of Australia by noting that it was a process by which the colonies collectively became states of the Commonwealth of Australia. This transformation marked a significant milestone in the history of Australia, as it created a unified nation out of six separate colonies.\n\nThe passage also provides information about the systems of government that were kept by the colonies after forming one nation. Specifically, it notes that the colonies kept the systems of government that they had developed as separate colonies. This suggests that the Federation of Australia did not result in a complete overhaul of the existing systems of government, but rather a reorganization of these systems to create a federal government that was responsible for matters concerning the whole nation.\n\nFinally, the passage uses descriptive language to emphasize the importance of the Constitution of Australia in the Federation of Australia. It notes that the Constitution came into force on January 1, 1901, and that it established the federal government that was responsible for matters concerning the whole nation. This language emphasizes the significance of the Constitution in the formation of the Federation of Australia and the creation of the Commonwealth of Australia.\n\nIn conclusion, the passage provides a detailed overview of the Federation of Australia, highlighting the significance of the Constitution of Australia in the formation of the Commonwealth of Australia. It also provides information about the systems of government that were kept by the colonies after forming one nation, and emphasizes the importance of the Constitution in the creation of the federal government responsible for matters concerning the whole nation.",
            'James pays the minor characters $15,000 each episode. Since there are 4 minor characters, he pays them a total of 4 * $15,000 = $60,000 per episode.\n\nThe major characters are paid three times as much. So, each major character gets paid 3 * $15,000 = $45,000 per episode.\n\nThere are 5 main characters, so he pays them a total of 5 * $45,000 = $225,000 per episode.\n\nIn total, James pays $225,000 (major characters) + $60,000 (minor characters) = $285,000 per episode.'
        ]
    }
)

# Create the pipeline
pipe = Pipeline(
    generator=OpenAILLM(
        task=EvolQualityTask(),
        api_key=os.getenv("OPENAI_API_KEY"),
        temperature=1
    ),
)

# Run the pipeline in the sample dataset
sample_quality_dataset = pipe.generate(sample_evol_quality)

print(sample_quality_dataset.select_columns(["input", "generation", "generations"])[2])
# {
#   "input": "What happens next in this paragraph?\n\nShe then rubs a needle on a cotton ball then pushing it onto a pencil and wrapping thread around it. She then holds up a box of a product and then pouring several liquids into a bowl. she\nChoose your answer from: A. adds saucepan and shakes up the product in a grinder. B. pinches the thread to style a cigarette, and then walks away. C. then dips the needle in ink and using the pencil to draw a design on her leg, rubbing it off with a rag in the end. D. begins to style her hair and cuts it several times before parting the ends of it to show the hairstyle she has created.",
#   "generation": "C. She then dips the needle in ink and using the pencil to draw a design on her leg, rubbing it off with a rag in the end. In this option, she is continuing the process of using the needle, pencil, and thread, which is most related to what she was doing in the previous sentence.",
#   "generations": [
#     "C. Then, to everyone's surprise, she dips the needle in ink and starts using the pencil to draw an intricate design on her leg. The creativity in her actions is truly unparalleled. After showcasing her artwork, she skillfully rubs it off with a rag, leaving everyone in awe of her talent."
#   ]
# }

For the API reference visit EvolQualityTask.

Custom TextGenerationTask¶

The following examples show different cases of creating your custom TextGenerationTask. Inherit from TextGenerationTask and implement the generate_prompt and parse_output to customise the behavior of the LLM.

OSS Instruct TaskHaiku SelfInstructTaskWizardLM Equal Prompts Task

This task implements the OSS Instruct from Magicoder: Source Code Is All You Need. Generate problems and solutions from seed problems following the paper implementation:

from dataclasses import dataclass
from typing import Dict, List

from distilabel.tasks import TextGenerationTask
from distilabel.tasks.prompt import Prompt

oss_instruct_prompt = """Please gain inspiration from the following random code snippet to create a high-quality programming problem. Present your output in two distinct sections:
[Problem Description] and [Solution].
Code snippet for inspiration:

{code}

Guidelines for each section:
1. [Problem Description]: This should be **completely self-contained**, providing
all the contextual information one needs to understand and solve the problem.
Assume common programming knowledge, but ensure that any specific context,
variables, or code snippets pertinent to this problem are explicitly included.
2. [Solution]: Offer a comprehensive, **correct** solution that accurately
addresses the [Problem Description] you provided.
"""


@dataclass
class OSSInstruct(TextGenerationTask):
    system_prompt: str = "You are exceptionally skilled at crafting high-quality programming problems and offering precise solutions."

    def generate_prompt(self, input: str) -> Prompt:
        return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=oss_instruct_prompt.format(code=input)
          )

    def parse_output(self, output: str) -> List[Dict[str, str]]:
        problem, solution = output.split("[Solution]")
        return {
            "problem": problem.replace("[Problem Description]", "").strip(),
            "solution": solution.strip()
        }

Here you cn see an example of how to customise a SelfInstructTask to create Haikus. The following Haiku DPO dataset contains more information on how this dataset was created.

from dataclasses import dataclass
from typing import Dict, List

from distilabel.tasks import SelfInstructTask


@dataclass
class CustomTask(SelfInstructTask):
    system_prompt: str = "You are an expert Haiku writer, writing the best and most diverse Haikus given topics as inputs."
    application_description: str = (
        "An AI assistant adept at writing Haiku.\n"
        "It expects complete suggestions from users providing details of the kind of haiku they want.\n"
        "The AI assistant will help users write haiku about particular topics and is willing to accept requests related to a specific subject or object or a more abstract request"
        "based on an emotion, theme or vibe.\n"
    )

    criteria_queries: str = (
        "Incorporate a diverse range of verbs, avoiding repetition.\n"
        "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
        "Design queries to be self-contained and standalone."
    )

    def define_task(self):
        instruction_task = SelfInstructTask(
            num_instructions=15,
            application_description=self.application_description,
            criteria_for_query_generation=self.criteria_queries,
        )

        return instruction_task

    def parse_output(self, output: str) -> List[Dict[str, str]]:
        return {"output": output}

The following task, obtained from WizardLM: Empowering Large Language Models to Follow Complex Instructions, can be used to check whether two instructions are equal or different to decide whether to keep in your dataset or remove redundant instructions:

from dataclasses import dataclass
import string
from typing import Dict, List

from distilabel.tasks import Prompt, TextGenerationTask

# Prompt from the WizardLM paper for the Equal Prompts task:
wizardllm_equal_prompt = """Here are two Instructions, do you think they are equal to each other and meet the following requirements?:
1. They have the same constraints and requirments.
2. They have the same depth and breadth of the inquiry.
The First Prompt: {first_instruction}
The Second Prompt: {second_instruction}
Your Judgement (Just answer: Equal or Not Equal. No need to explain the reason):"""


@dataclass
class WizardLMEqualPrompts(TextGenerationTask):
    """Task to check for the equality of two instructions following the Appendix G in
    [WizardLM paper](https://arxiv.org/abs/2304.12244).
    """

    system_prompt: str = "You are an AI judge in charge of determining the equality of two instructions. "

    def generate_prompt(self, input: List[str]) -> Prompt:
        return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=wizardllm_equal_prompt.format(
                first_instruction=input[0], second_instruction=input[1]
            ),
        )

    def parse_output(self, output: str) -> List[Dict[str, str]]:
        """Remove punctuation from the string."""
        return {
            "generations": output.translate(str.maketrans("", "", string.punctuation))
        }

Labelling¶

Instead of generating text, you can instruct the LLM to label datasets. The existing tasks are designed specifically for creating both PreferenceTask and CritiqueTask datasets.

Preference¶

Preference datasets for Language Models (LLMs) are sets of information that show how people rank or prefer one thing over another in a straightforward and clear manner. These datasets help train language models to understand and generate content that aligns with user preferences, enhancing the model's ability to generate contextually relevant and preferred outputs.

Contrary to the TextGenerationTask, the PreferenceTask is not intended for direct use. It implements the default methods input_args_names and output_args_names, but generate_prompt and parse_output are specific to each PreferenceTask. Examining the output_args_names reveals that the generation will encompass both the rating and the rationale that influenced that rating.

UltraFeedbackTask¶

This task is specifically designed to build the prompts following the format defined in the "UltraFeedback: Boosting Language Models With High Quality Feedback" paper.

From the original repository: To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness. This Task is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.

The following snippet can be used as a simplified UltraFeedback Task, for which we define 3 different ratings, but take into account the predefined versions are intended to be used out of the box:

from textwrap import dedent

from distilabel.tasks.preference.ultrafeedback import Rating, UltraFeedbackTask

task_description = dedent(
    """
    # General Text Quality Assessment
    Evaluate the model's outputs based on various criteria:
    1. **Correctness & Informativeness**: Does the output provide accurate and helpful information?
    2. **Honesty & Uncertainty**: How confidently does the model convey its information, and does it express uncertainty appropriately?
    3. **Truthfulness & Hallucination**: Does the model introduce misleading or fabricated details?
    4. **Instruction Following**: Does the model's output align with given instructions and the user's intent?
    Your role is to provide a holistic assessment considering all the above factors.

    **Scoring**: Rate outputs 1 to 3 based on the overall quality, considering all aspects:
    """
)

ratings = [
    Rating(value=1, description="Low Quality"),
    Rating(value=2, description="Moderate Quality"),
    Rating(value=3, description="Good Quality"),
]

ultrafeedback_task = UltraFeedbackTask(
    system_prompt="Your role is to evaluate text quality based on given criteria",
    task_description=task_description,
    ratings=ratings,
)

HelpfulnessTruthfulnessHonestyInstruction Following

The following example creates a UltraFeedback task to emphasize helpfulness, that is overall quality and correctness of the output:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_helpfulness(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

The following example creates a UltraFeedback task to emphasize truthfulness and hallucination assessment:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_truthfulness(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

The following example creates a UltraFeedback task to emphasize honesty and uncertainty expression assessment:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_honesty(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

The following example creates a UltraFeedback task to emphasize the evaluation of alignment between output and intent:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_instruction_following(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

Additionally, we at Argilla created a custom subtask for UltraFeedback, that generates an overall score evaluating all the aspects mentioned above but within a single subtask. Otherwise, in order to get an overall score, all the subtasks above should be run and the average of those scores to be calculated.

Overall Quality

The following example uses a LLM to examinate the data for our custom overall quality criteria, which includes the different criteria from UltraFeedback (Correctness & Informativeness, Honesty & Uncertainty, Truthfulness & Hallucination and Instruction Following):

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraFeedbackTask

labeller = OpenAILLM(
    task=UltraFeedbackTask.for_overall_quality(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

For the API reference visit UltraFeedbackTask.

JudgeLMTask¶

The task specially designed to build the prompts following the UltraFeedback paper: JudgeLM: Fine-tuned Large Language Models Are Scalable Judges. This task is designed to evaluate the performance of AI assistants.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import JudgeLMTask

labeller = OpenAILLM(task=JudgeLMTask(), api_key=os.getenv("OPENAI_API_KEY", None))

For the API reference visit JudgeLMTask.

UltraJudgeTask¶

This class implements a PreferenceTask specifically for a better evaluation using AI Feedback. The task is defined based on both UltraFeedback and JudgeLM, but with several improvements / modifications.

It introduces an additional argument to differentiate various areas for processing. While these areas can be customized, the default values are as follows:

from distilabel.tasks import UltraJudgeTask

# To see the complete system_prompt and task_description please take a look at the UltraJudgeTask definition
ultrajudge_task = UltraJudgeTask(
    system_prompt="You are an evaluator tasked with assessing AI assistants' responses from the perspective of typical user preferences...",
    task_description="Your task is to rigorously evaluate the performance of...",
    areas=[
        "Practical Accuracy",
        "Clarity & Transparency",
        "Authenticity & Reliability",
        "Compliance with Intent",
    ],
)

Which can be directly used in the following way:

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import UltraJudgeTask

labeller = OpenAILLM(task=UltraJudgeTask(), api_key=os.getenv("OPENAI_API_KEY", None))

For the API reference visit UltraJudgeTask.

ComplexityScorerTask¶

This class implements a PreferenceTask to rate a list of instructions according to its complexity or difficulty. Defined in Deita framework, it's intended use is the scoring of instructions whose complexity has been enhanced by means of the EvolComplexity method defined, inspired on the EvolInstruct method from WizardLM.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import ComplexityScorerTask

labeller = OpenAILLM(
    task=ComplexityScorerTask(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

Differently to other tasks, this case doesn't need a reference input to valuate the complexity of the instructions. The following example shows how it works with a small sample dataset:

import os
from datasets import Dataset

from distilabel.tasks import ComplexityScorerTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline


# Create a sample dataset (this one is inspired from the distilabel-sample-evol-complexity dataset)
sample_evol_complexity = Dataset.from_dict(
    {
        'generations': [
            [
                'Generate a catchy tagline for a new high-end clothing brand\n',
                "Devise a captivating and thought-provoking tagline that effectively represents the unique essence and luxurious nature of an upcoming luxury fashion label. Additionally, ensure that the tagline encapsulates the brand's core values and resonates with the discerning tastes of its exclusive clientele."
            ],
            [
                'How can I create a healthier lifestyle for myself?\n',
                'What are some innovative ways to optimize physical and mental wellness while incorporating sustainable practices into daily routines?'
            ]
        ]
    }
)

# Create the pipeline
pipe = Pipeline(
    labeller=OpenAILLM(
        task=ComplexityScorerTask(),
        api_key=os.getenv("OPENAI_API_KEY"),
        temperature=0.1
    )
)

# Run the pipeline in the sample dataset
new_dataset = pipe.generate(sample_evol_complexity.select(range(3,5)))

print(new_dataset.select_columns(["generations", "rating"])[:])
# {
#   "generations": [
#     [
#       "Generate a catchy tagline for a new high-end clothing brand\n",
#       "Devise a captivating and thought-provoking tagline that effectively represents the unique essence and luxurious nature of an upcoming luxury fashion label. Additionally, ensure that the tagline encapsulates the brand's core values and resonates with the discerning tastes of its exclusive clientele."
#     ],
#     [
#       "How can I create a healthier lifestyle for myself?\n",
#       "What are some innovative ways to optimize physical and mental wellness while incorporating sustainable practices into daily routines?"
#     ]
#   ],
#   "rating": [
#     [1.0, 3.0],
#     [1.0, 2.0]
#   ]
# }

For the API reference visit ComplexityScorerTask.

QualityScorerTask¶

This class implements a PreferenceTask to rate a list of instructions according to its quality. Follows the same idea defined in the ComplexityScorerTask from the Deita framework, but in this case it rates the instructions in terms of concepts like helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response.

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import QualityScorerTask

labeller = OpenAILLM(
    task=QualityScorerTask(),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

By default, the quality is defined as the following the paper prompt, but can be modified updating the task_description as in the following example (keep in mind the default task_description corresponds to the EvolQuality criteria defined to evolve the initial instructions, so this should be taken into account):

import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import QualityScorerTask

labeller = OpenAILLM(
    task=QualityScorerTask(
        task_description="Take into account the expressiveness of the answers."
    ),
    api_key=os.getenv("OPENAI_API_KEY", None),
)

The following is an example pipeline to label a dataset using QualityScorerTask:

import os

from datasets import Dataset
from distilabel.tasks import QualityScorerTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline


# Create a sample dataset (this one is inspired from the distilabel-intel-orca-dpo-pairs)
sample_evol_quality = Dataset.from_dict(
    {
        'input': [
            'Please answer the following question: I want to test the ability of students to read a passage and answer questions about it. Could you please come up with a good question for the passage "In 1901, the Federation of Australia was the process by which the six separate British self-governing colonies of New South Wales, Queensland, South Australia, Tasmania, Victoria and Western Australia formed one nation. They kept the systems of government that they had developed as separate colonies but also would have a federal government that was responsible for matters concerning the whole nation. When the Constitution of Australia came into force, the colonies collectively became states of the Commonwealth of Australia."?\nAnswer:',
            "James runs a TV show and there are 5 main characters and 4 minor characters. He pays the minor characters $15,000 each episode. He paid the major characters three times as much. How much does he pay per episode? Let's be accurate as possible."
        ],
        'generations': [
            [
                " Certainly! Here's a detailed and long answer to the question you provided:\n\nQuestion: Read the passage below and answer the following questions.\n\nPassage: In 1901, the Federation of Australia was the process by which the six separate British self-governing colonies of New South Wales, Queensland, South Australia, Tasmania, Victoria and Western Australia formed one nation. They kept the systems of government that they had developed as separate colonies but also would have a federal government that was responsible for matters concerning the whole nation. When the Constitution of Australia came into force, the colonies collectively became states of the Commonwealth of Australia.\n\n1. What event did the passage describe?\n2. How many colonies were involved in the Federation of Australia?\n3. What did the colonies keep after forming one nation?\n4. What was the purpose of the federal government established by the Constitution of Australia?\n5. What were the colonies called after they became states of the Commonwealth of Australia?\n\nAnswers:\n\n1. The passage describes the Federation of Australia, which was the process by which six separate British self-governing colonies formed one nation.\n2. Six colonies were involved in the Federation of Australia: New South Wales, Queensland, South Australia, Tasmania, Victoria, and Western Australia.\n3. The colonies kept the systems of government that they had developed as separate colonies.\n4. The federal government established by the Constitution of Australia was responsible for matters concerning the whole nation.\n5. After becoming states of the Commonwealth of Australia, the colonies were called states.\n\nIn-depth analysis:\n\nThe passage provides a brief overview of the Federation of Australia, which was the process by which six separate British self-governing colonies formed one nation in 1901. The colonies kept the systems of government that they had developed as separate colonies, but they also established a federal government that was responsible for matters concerning the whole nation. This federal government was established by the Constitution of Australia, which came into force on January 1, 1901.\n\nThe passage highlights the significance of the Federation of Australia by noting that it was a process by which the colonies collectively became states of the Commonwealth of Australia. This transformation marked a significant milestone in the history of Australia, as it created a unified nation out of six separate colonies.\n\nThe passage also provides information about the systems of government that were kept by the colonies after forming one nation. Specifically, it notes that the colonies kept the systems of government that they had developed as separate colonies. This suggests that the Federation of Australia did not result in a complete overhaul of the existing systems of government, but rather a reorganization of these systems to create a federal government that was responsible for matters concerning the whole nation.\n\nFinally, the passage uses descriptive language to emphasize the importance of the Constitution of Australia in the Federation of Australia. It notes that the Constitution came into force on January 1, 1901, and that it established the federal government that was responsible for matters concerning the whole nation. This language emphasizes the significance of the Constitution in the formation of the Federation of Australia and the creation of the Commonwealth of Australia.\n\nIn conclusion, the passage provides a detailed overview of the Federation of Australia, highlighting the significance of the Constitution of Australia in the formation of the Commonwealth of Australia. It also provides information about the systems of government that were kept by the colonies after forming one nation, and emphasizes the importance of the Constitution in the creation of the federal government responsible for matters concerning the whole nation.",
                "Certainly! Here's a more detailed answer to the question you provided with additional analysis:\n\nQuestion: Read the passage below and answer the following questions.\n\nPassage: In 1901, the Federation of Australia was the process by which the six separate British self-governing colonies of New South Wales, Queensland, South Australia, Tasmania, Victoria and Western Australia formed one nation. They kept the systems of government that they had developed as separate colonies but also would have a federal government that was responsible for matters concerning the whole nation. When the Constitution of Australia came into force, the colonies collectively became states of the Commonwealth of Australia.\n\n1. What"
            ],
            [
                'James pays the minor characters $15,000 each episode. Since there are 4 minor characters, he pays them a total of 4 * $15,000 = $60,000 per episode.\n\nThe major characters are paid three times as much. So, each major character gets paid 3 * $15,000 = $45,000 per episode.\n\nThere are 5 main characters, so he pays them a total of 5 * $45,000 = $225,000 per episode.\n\nIn total, James pays $225,000 (major characters) + $60,000 (minor characters) = $285,000 per episode.', 
                "In James' TV show, he pays each of the 4 minor characters $15,000 per episode, totaling $60,000. The major characters, being paid three times as much, receive $45,000 each per episode. With 5 main characters, James pays a total of $225,000 for them. Therefore, the total payment per episode is $285,000, consisting of $225,000 for the major characters and $60,000 for the minor characters."
            ]
        ]
   }
)

# Create the pipeline to label the dataset with theQualityScorerTask
pipe_labeller = Pipeline(
    labeller=OpenAILLM(
        task=QualityScorerTask(),
        api_key=os.getenv("OPENAI_API_KEY"),
        temperature=0.1,
        max_new_tokens=1024
    )
)

# Run the pipeline to get the scoring for the datase
quality_labelled_dataset = pipe_labeller.generate(sample_evol_quality)
print(quality_labelled_dataset.select_columns(["labelling_prompt", "rating"])[0])
# {
#     'labelling_prompt': [
#         {'content': '', 'role': 'system'},
#         {
#             'content': 'Rank the following responses provided by different AI assistants to the user’s question\naccording to the quality of their response. Score each response from 1 to 2, with 3\nreserved for responses that are already very well written and cannot be improved further.\nYour evaluation should consider factors such as helpfulness, relevance, accuracy, depth,\ncreativity, and level of detail of the response.\nUse the following format:\n[Response 1] Score:\n[Response 2] Score:\n...\n#Question#: You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)\'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:\n#Response List#:\n\n[Response 1] [\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]\n[Response 2] The RDF triplets generated from the input sentence "AFC Ajax (amateurs)\'s ground is Sportpark De Toekomst where Ajax Youth Academy also play" accurately capture the relationships present. The output is a list of triplets that includes ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"] and ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]. These triplets represent the structure and semantics of the sentence.',
#             'role': 'user'
#         }
#     ],
#     'rating': [2.0, 3.0]
# }

For the API reference visit QualityScorerTask.

Critique¶

The CritiqueTask is designed to be a labeller for generated text, while not only adding scores based on a rubric, but also critiques explaining the reasons why those scores have been provided. The critique can either be using a reference answer (gold answer) as e.g. Prometheus does, or just by generating the critique per each of the N provided generations.

The resulting datasets after running a pipeline with the CritiqueTask are useful towards either training a model to generate critiques based on the critiques generated by a more powerful model as e.g. GPT-4 from OpenAI, or to be used directly for DPO fine-tuning. The fact that the critique is generated per each pair, a balanced dataset could be generated with individual critiques and their scores, so that then we can e.g. define a threshold on what's considered chosen and rejected, to then run DPO fine-tunes.

While the CritiqueTask may seem fairly similar to the PreferenceTask, there is a core difference, which is the fact that the critiques are provided per each response or even to a single response, with no need to compare or rate them against each other.

UltraCMTask¶

This task is specifically designed to build the prompts following the format defined in the "UltraFeedback: Boosting Language Models With High Quality Feedback" paper.

UltraCM is a model that has been fine-tuned using the UltraFeedback dataset, so as to produce critiques for the generated content, as the authors claim in their paper: "Moreover, since ULTRAFEEDBACK provides detailed textual feedback, we also fine-tune a model that could critique model responses automatically. Our critique model, UltraCM, generates reasonable and detailed comments on various tasks.".

Ideally, the UltraCMTask will be more consistent when used with either their fine-tuned model UltraCM or with OpenAI, as both have been proven to produce successfully the structured content following the prompt formatting, and not only structured, but also meaningful and reasonable.

See the following snippet, with an example on how to instantiate the UltraCMTask which only requires the system prompt, and it can be modified based on how is the critique intended to be formulated, while the system prompt shown below is the default one as of the UltraFeedback paper.

from distilabel.tasks import UltraCMTask

task = UltraCMTask(
    system_prompt="User: A one-turn chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, very detailed, and polite answers to the user's questions.</s>",
)

PrometheusTask¶

This task is specifically designed to build the prompts following the format defined in the "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models" paper.

Ideally, the PrometheusTask should only be used to format the prompts for the Prometheus models as those are the ones that have been fine-tuned to follow the same formatting and will produce consistent results compared to other base models or fine-tuned with different formats. In this case, since the formatting used by Prometheus follows the Llama 2 format, those are recommended. Otherwise, OpenAI has also proved to produce consistent results.

The following snippet can be used out of the box to define a simple PrometheusTask with the system prompt, the scoring criteria and the score descriptions, but those can be modified while keeping in mind that Prometheus always expects 5 scores from 1-5 with a meaningful description, as well as with a criteria relevant to the scores defined.

from distilabel.tasks import PrometheusTask

task = PrometheusTask(
    system_prompt="You are a fair evaluator language model.",
    scoring_criteria="Relevance, Grammar, Informativeness, Engagement",
    score_descriptions={
        1: "The response is not relevant to the prompt.",
        2: "The response is relevant to the prompt, but it is not grammatical.",
        3: "The response is relevant to the prompt and it is grammatical, but it is not informative.",
        4: "The response is relevant to the prompt, it is grammatical, and it is informative, but it is not engaging.",
        5: "The response is relevant to the prompt, it is grammatical, it is informative, and it is engaging.",
    },
)

The principles can be found here in the codebase. More information on the Principle Sampling can be found in the UltraFeedfack repository. ↩