Special Tasks¶

This section covers some tasks that don't implement the Task API, but can be thought of as tasks, instead they inherit from Step.

Embedding Generation¶

The DEITA paper needs to tackle the challenge of ensuring diversity in the final dataset, and they propose an embedding-based method to filter the dataset. For this end, the GenerateEmbeddings step is in charge of generating embeddings for the datasets' text.

from distilabel.llms.huggingface.transformers import TransformersLLM
from distilabel.pipeline.local import Pipeline
from distilabel.steps.tasks.generate_embeddings import GenerateEmbeddings

llm = TransformersLLM(
    model="TaylorAI/bge-micro-v2",
    model_kwargs={"is_decoder": True},
)
llm.load()

task = GenerateEmbeddings(
    name="task",
    llm=llm,
    pipeline=Pipeline(name="unit-test-pipeline"),
)

This step needs an LLM to generate the embeddings, we have chosen to use a TransformersLLM with TaylorAI/bge-micro-v2 in this case. Upon call, this step will compute the embedding for the input text and add it to the row:

result = next(task.process([{"text": "Hello, how are you?"}]))
print(result[0]["embedding"])
# [-8.12729941, -5.24642847, -6.34003029, ...]

Ranking LLM Responses¶

Jian et al. present in their paper LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion a "small" model that is able to take a instruction and a pair output candidates, and output a score for each candidate to measure their relative quality, hence ranking the responses. You can use PairRM in distilabel to accomplish this task, let's see how it works:

from distilabel.pipeline.local import Pipeline
from distilabel.steps.tasks.pair_rm import PairRM

ranker = PairRM(
    name="pair_rm_ranker", pipeline=Pipeline(name="ranking-pipeline")
)
# NOTE: Keep in mind this call will automatically try to load an LLM internally
ranker.load()

For this step the model is fixed by default contrary to other steps, as the implementation relies completely on LLM-Blender for it to work.

To ingest data for this task you would need an input, which corresponds to the instruction, and a list of candidates to compare, that the model will rank working on pairs:

result = next(
    ranker.process(
        [
            {"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
        ]
    )
)

Let's see what the result looks like:

import json
print(json.dumps(result, indent=2))
# [
#   {
#     "input": "Hello, how are you?",
#     "candidates": [
#       "fine",
#       "good",
#       "bad"
#     ],
#     "ranks": [
#       2,
#       1,
#       3
#     ],
#     "ranked_candidates": [
#       "good",
#       "fine",
#       "bad"
#     ]
#   }
# ]

We see we have both the ranks, that determine the position that would order the candidates field, and the ranked_candidates in case these want to be used directly.

Filtering data to ensure diversity¶

We have already mentioned a global step that appeared in the Global Steps section, but it was quite specific to be introduced at that time. This Task is the DeitaFiltering step.

It's a special type of step developed to reproduce the DEITA paper, in charge of filtering responses according to a predefined score. Let's see how it is defined:

from distilabel.pipeline.local import Pipeline
from distilabel.steps.deita import DeitaFiltering

deita_filtering = DeitaFiltering(
    name="deita_filtering",
    data_budget=1,
    pipeline=Pipeline(name="deita-filtering-pipeline"),
)
# Remember to call the load method if working outside of a Pipeline context
deita_filtering.load()

This step is prepared to work on DEITA outputs: It expects instructions evolved following the Evol Instruct procedure, with a score assigned to the complexity of the instruction and the quality of the response (ComplexityScorer and QualityScorer respectively), and embeddings computed on the responses. The following is a random example following the structure of the input needed from the process method:

result = next(
    deita_filtering.process(
        [
            {
                "evol_instruction_score": 0.5,
                "evol_response_score": 0.5,
                "embedding": [-8.12729941, -5.24642847, -6.34003029],
            },
            {
                "evol_instruction_score": 0.6,
                "evol_response_score": 0.6,
                "embedding": [2.99329242, 0.7800932, 0.7799726],
            },
            {
                "evol_instruction_score": 0.7,
                "evol_response_score": 0.7,
                "embedding": [10.29041806, 14.33088073, 13.00557506],
            },
        ]
    )
)

And this is what we could expect from the output:

import json

print(json.dumps(result, indent=2))
# [
#   {
#     "evol_instruction_score": 0.5,
#     "evol_response_score": 0.5,
#     "embedding": [
#       -8.12729941,
#       -5.24642847,
#       -6.34003029
#     ],
#     "deita_score": 0.25,
#     "deita_score_computed_with": [
#       "evol_instruction_score",
#       "evol_response_score"
#     ],
#     "nearest_neighbor_distance": 1.9042812683723933
#   }
# ]

We would obtain the dataset size expected for our data_budget and diversity_threshold set. For more information on how this Task works take a look at the API Reference.