Clean an existing preference dataset¶
- Goal: Clean an existing preference dataset by providing AI feedback on the quality of the data.
- Libraries: argilla, hf-inference-endpoints
- Components: LoadDataFromDicts, UltraFeedback, KeepColumns, PreferenceToArgilla, InferenceEndpointsLLM, GlobalStep
Getting Started¶
Install the dependencies¶
To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:
Let's make the required imports:
You'll need an HF_TOKEN to use the HF Inference Endpoints. Login to use it directly within this notebook.
(optional) Deploy Argilla¶
You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.
Along with that, you will need to install Argilla as a distilabel extra.
The dataset¶
In this case, we will clean a preference dataset, so we will use the Intel/orca_dpo_pairs dataset from the Hugging Face Hub.
Next, we will shuffle the chosen and rejected columns to avoid any bias in the dataset.
As a custom step
You can also create a custom step in a separate module, import it and add it to the pipeline after loading the orca_dpo_pairs dataset using the LoadDataFromHub step.
from typing import TYPE_CHECKING, List
from distilabel.steps import GlobalStep, StepInput
if TYPE_CHECKING:
    from distilabel.typing import StepOutput
import random
class ShuffleStep(GlobalStep):
    @property
    def inputs(self):
        """Returns List[str]: The inputs of the step."""
        return ["instruction", "chosen", "rejected"]
    @property
    def outputs(self):
        """Returns List[str]: The outputs of the step."""
        return ["instruction", "generations", "order"]
    def process(self, inputs: StepInput):
        """Returns StepOutput: The outputs of the step."""
        outputs = []
        for input in inputs:
            chosen = input["chosen"]
            rejected = input["rejected"]
            pair = [chosen, rejected]
            random.shuffle(pair)
            order = ["chosen" if x == chosen else "rejected" for x in pair]
            outputs.append({"instruction": input["instruction"], "generations": pair, "order": order})
        yield outputs
Define the pipeline¶
To clean an existing preference dataset, we will need to define a Pipeline with all the necessary steps. However, a similar workflow can be used to clean a SFT dataset. Below, we will go over each step in detail.
Load the dataset¶
We will use the dataset we just shuffled as source data.
- Component: LoadDataFromDicts
- Input columns: system,question,chosen,rejected,generationsandorder, the same keys as in the loaded list of dictionaries.
- Output columns: system,instruction,chosen,rejected,generationsandorder. We will useoutput_mappingsto rename the columns.
Evaluate the responses¶
To evaluate the quality of the responses, we will use meta-llama/Meta-Llama-3.1-70B-Instruct, applying the UltraFeedback task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use PrometheusEval instead.
- Component: UltraFeedbacktask with LLMs usingInferenceEndpointsLLM
- Input columns: instruction,generations
- Output columns: ratings,rationales,distilabel_metadata,model_name
For your use case and to improve the results, you can use any other LLM of your choice.
evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)
Keep only the required columns¶
We will get rid of the unneeded columns.
- Component: KeepColumns
- Input columns: system,instruction,chosen,rejected,generations,ratings,rationales,distilabel_metadataandmodel_name
- Output columns: instruction,chosen,rejected,generationsandorder
keep_columns = KeepColumns(
    columns=[
        "instruction",
        "generations",
        "order",
        "ratings",
        "rationales",
        "model_name",
    ],
    pipeline=Pipeline(name="showcase-pipeline"),
)
keep_columns.load()
next(
    keep_columns.process(
        [
            {
                "system": "",
                "instruction": "What's the capital of Spain?",
                "chosen": "Madrid",
                "rejected": "Barcelona",
                "generations": ["Madrid", "Barcelona"],
                "order": ["chosen", "rejected"],
                "ratings": [5, 1],
                "rationales": ["", ""],
                "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            }
        ]
    )
)
(Optional) Further data curation¶
You can use Argilla to further curate your data.
- Component: PreferenceToArgillastep
- Input columns: instruction,generations,generation_models,ratings
- Output columns: instruction,generations,generation_models,ratings
Run the pipeline¶
Below, you can see the full pipeline definition:
with Pipeline(name="clean-dataset") as pipeline:
    load_dataset = LoadDataFromDicts(
        data=dataset, output_mappings={"question": "instruction"}
    )
    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
    )
    keep_columns = KeepColumns(
        columns=[
            "instruction",
            "generations",
            "order",
            "ratings",
            "rationales",
            "model_name",
        ]
    )
    to_argilla = PreferenceToArgilla(
        dataset_name="cleaned-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key="[your-api-key]",
        num_generations=2,
    )
    load_dataset.connect(evaluate_responses)
    evaluate_responses.connect(keep_columns)
    keep_columns.connect(to_argilla)
Let's now run the pipeline and clean our preference dataset.
Let's check it! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.
You can push the dataset to the Hub for sharing with the community and embed it to explore the data.
Conclusions¶
In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.
We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation.
