Generate a preference dataset¶
- Goal: Generate a synthetic preference dataset for DPO/ORPO.
- Libraries: argilla, hf-inference-endpoints
- Components: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM
Getting started¶
Install the dependencies¶
To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:
Let's make the required imports:
You'll need an HF_TOKEN to use the HF Inference Endpoints. Log in to use it directly within this notebook.
(optional) Deploy Argilla¶
You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.
Along with that, you will need to install Argilla as a distilabel extra.
Define the pipeline¶
To generate our preference dataset, we will need to define a Pipeline with all the necessary steps. Below, we will go over each step in detail.
Load the dataset¶
We will use as source data the argilla/10Kprompts-mini dataset from the Hugging Face Hub.
- Component:
LoadDataFromHub - Input columns:
instructionandtopic, the same as in the loaded dataset - Output columns:
instructionandtopic
Generate responses¶
We need to generate the responses for the given instructions. We will use two different models available on the Hugging Face Hub through the Serverless Inference API: meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mixtral-8x7B-Instruct-v0.1. We will also indicate the generation parameters for each model.
- Component:
TextGenerationtask with LLMs usingInferenceEndpointsLLM - Input columns:
instruction - Output columns:
generation,distilabel_metadata,model_namefor each model
For your use case and to improve the results, you can use any other LLM of your choice.
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
),
]
for task in generate_responses:
task.load()
print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
Group the responses¶
The task to evaluate the responses needs as input a list of generations. However, each model response was saved in the generation column of the subsets text_generation_0 and text_generation_1. We will combine these two columns into a single column and the default subset.
- Component:
GroupColumns - Input columns:
generationandmodel_namefromtext_generation_0andtext_generation_1 - Output columns:
generationsandmodel_names
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
pipeline=Pipeline(name="showcase-pipeline"),
)
next(
group_responses.process(
[
{
"generation": "Madrid",
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
},
],
[
{
"generation": "Barcelona",
"model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
}
],
)
)
Evaluate the responses¶
To build our preference dataset, we need to evaluate the responses generated by the models. We will use meta-llama/Meta-Llama-3-70B-Instruct for this, applying the UltraFeedback task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).
- Component:
UltraFeedbacktask with LLMs usingInferenceEndpointsLLM - Input columns:
instruction,generations - Output columns:
ratings,rationales,distilabel_metadata,model_name
For your use case and to improve the results, you can use any other LLM of your choice.
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
Convert to a preference dataset¶
- You can automatically convert it to a preference dataset with the
chosenandrejectedcolumns.- Component:
FormatTextGenerationDPOstep - Input columns:
instruction,generations,generation_models,ratings - Output columns:
prompt,prompt_id,chosen,chosen_model,chosen_rating,rejected,rejected_model,rejected_rating
- Component:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
format_dpo.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
"generation_models": [
"Meta-Llama-3-8B-Instruct",
"Mixtral-8x7B-Instruct-v0.1",
],
"ratings": [5, 1],
}
]
)
)
- Or you can use Argilla to manually label the data and convert it to a preference dataset.
- Component:
PreferenceToArgillastep - Input columns:
instruction,generations,generation_models,ratings - Output columns:
instruction,generations,generation_models,ratings
- Component:
Run the pipeline¶
Below, you can see the full pipeline definition:
with Pipeline(name="generate-dataset") as pipeline:
load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
]
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
)
format_dpo = FormatTextGenerationDPO()
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2
)
for task in generate_responses:
load_dataset.connect(task)
task.connect(group_responses)
group_responses.connect(evaluate_responses)
evaluate_responses.connect(format_dpo, to_argilla)
Let's now run the pipeline and generate the preference dataset.
Let's check the preference dataset! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.
You can push the dataset to the Hub for sharing with the community and embed it to explore the data.
Conclusions¶
In this tutorial, we showcased the detailed steps to build a pipeline for generating a preference dataset using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub, or use them to train a model for DPO or ORPO.
We used a dataset containing prompts to generate responses using two different models through the serverless Hugging Face Inference API. Next, we evaluated the responses using a third model, following the UltraFeedback standards. Finally, we converted the data to a preference dataset and used Argilla for further curation.
