Generate a preference dataset¶
- Goal: Generate a synthetic preference dataset for DPO/ORPO.
- Libraries: argilla, hf-inference-endpoints
- Components: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM
Getting started¶
Install the dependencies¶
To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:
Let's make the required imports:
You'll need an HF_TOKEN
to use the HF Inference Endpoints. Log in to use it directly within this notebook.
(optional) Deploy Argilla¶
You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.
Along with that, you will need to install Argilla as a distilabel extra.
Define the pipeline¶
To generate our preference dataset, we will need to define a Pipeline
with all the necessary steps. Below, we will go over each step in detail.
Load the dataset¶
We will use as source data the argilla/10Kprompts-mini
dataset from the Hugging Face Hub.
- Component:
LoadDataFromHub
- Input columns:
instruction
andtopic
, the same as in the loaded dataset - Output columns:
instruction
andtopic
Generate responses¶
We need to generate the responses for the given instructions. We will use two different models available on the Hugging Face Hub through the Serverless Inference API: meta-llama/Meta-Llama-3-8B-Instruct
and mistralai/Mixtral-8x7B-Instruct-v0.1
. We will also indicate the generation parameters for each model.
- Component:
TextGeneration
task with LLMs usingInferenceEndpointsLLM
- Input columns:
instruction
- Output columns:
generation
,distilabel_metadata
,model_name
for each model
For your use case and to improve the results, you can use any other LLM of your choice.
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
),
]
for task in generate_responses:
task.load()
print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
Group the responses¶
The task to evaluate the responses needs as input a list of generations. However, each model response was saved in the generation column of the subsets text_generation_0
and text_generation_1
. We will combine these two columns into a single column and the default
subset.
- Component:
GroupColumns
- Input columns:
generation
andmodel_name
fromtext_generation_0
andtext_generation_1
- Output columns:
generations
andmodel_names
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
pipeline=Pipeline(name="showcase-pipeline"),
)
next(
group_responses.process(
[
{
"generation": "Madrid",
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
},
],
[
{
"generation": "Barcelona",
"model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
}
],
)
)
Evaluate the responses¶
To build our preference dataset, we need to evaluate the responses generated by the models. We will use meta-llama/Meta-Llama-3-70B-Instruct
for this, applying the UltraFeedback
task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).
- Component:
UltraFeedback
task with LLMs usingInferenceEndpointsLLM
- Input columns:
instruction
,generations
- Output columns:
ratings
,rationales
,distilabel_metadata
,model_name
For your use case and to improve the results, you can use any other LLM of your choice.
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
Convert to a preference dataset¶
- You can automatically convert it to a preference dataset with the
chosen
andrejected
columns.- Component:
FormatTextGenerationDPO
step - Input columns:
instruction
,generations
,generation_models
,ratings
- Output columns:
prompt
,prompt_id
,chosen
,chosen_model
,chosen_rating
,rejected
,rejected_model
,rejected_rating
- Component:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
format_dpo.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
"generation_models": [
"Meta-Llama-3-8B-Instruct",
"Mixtral-8x7B-Instruct-v0.1",
],
"ratings": [5, 1],
}
]
)
)
- Or you can use Argilla to manually label the data and convert it to a preference dataset.
- Component:
PreferenceToArgilla
step - Input columns:
instruction
,generations
,generation_models
,ratings
- Output columns:
instruction
,generations
,generation_models
,ratings
- Component:
Run the pipeline¶
Below, you can see the full pipeline definition:
with Pipeline(name="generate-dataset") as pipeline:
load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
]
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
)
format_dpo = FormatTextGenerationDPO()
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2
)
for task in generate_responses:
load_dataset.connect(task)
task.connect(group_responses)
group_responses.connect(evaluate_responses)
evaluate_responses.connect(format_dpo, to_argilla)
Let's now run the pipeline and generate the preference dataset.
Let's check the preference dataset! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.
You can push the dataset to the Hub for sharing with the community and embed it to explore the data.
Conclusions¶
In this tutorial, we showcased the detailed steps to build a pipeline for generating a preference dataset using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub, or use them to train a model for DPO or ORPO.
We used a dataset containing prompts to generate responses using two different models through the serverless Hugging Face Inference API. Next, we evaluated the responses using a third model, following the UltraFeedback standards. Finally, we converted the data to a preference dataset and used Argilla for further curation.