✨ Clean a Preference Dataset with JudgeLMTask
and GPT4-turbo
¶
In this tutorial, we will explain how to use distilabel
to clean the known DPO dataset orca_dpo_pairs. If you want a spoiler, you can check the cleaned dataset.
We will follow the next steps:
- Prepare the original dataset for cleaning.
- Create and run the distilabel pipeline.
- Optionally, post-process the cleaned dataset.
- Analyze the distilabelled dataset.
Introduction¶
Many open-source datasets are highly used to train and evaluate NLP models. However, many can be still improved in terms of quality, as we did with UltraFeedback, Dollys or Alpacas.
In this case, the main intuition was that the original dataset just assumes gpt4/3.5-turbo is always the best response, but that's not always the case. And DPO fine-tuning benefits from the diversity of preference pairs.
To address this issue, we used distilabel
, an AI Feedback (AIF) framework that can generate and label datasets using LLMs and can be used for many different use cases.
Getting Started¶
Running Argilla¶
For this tutorial, you can use Argilla to visualize and annotate the dataset cleaned by distilabel. There are two main options for deploying and running Argilla:
Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:
For details about configuring your deployment, check the official Hugging Face Hub guide.
Launch Argilla using Argilla's quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.
For more information on deployment options, please check the Deployment section of the documentation.
Install Dependencies¶
Let’s start by installing the required dependencies to run distilabel, Argilla, and the remainder of this tutorial.
Then we can import the required libraries.
import os
import random
import nltk
import numpy as np
import openai
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import argilla as rg
from datasets import load_dataset
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import JudgeLMTask
nltk.download('punkt')
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL
and API_KEY
:
If you’re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"
# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# # Replace workspace with the name of your workspace
# rg.init(
# api_url="https://[your-owner-name]-[your_space_name].hf.space",
# api_key="owner.apikey",
# workspace="admin",
# extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )
Finally, we will also need to provide a HF_TOKEN and the OPENAI_API_KEY to run the distilabel pipeline.
Prepare the Dataset¶
First, we will load the original orca_dpo_pairs, which consists of 12,859 preference pairs.
> Note: To enhance performance while using this tutorial as a guide, consider selecting a subset of samples from the original dataset.
> python
> subsample = dataset.select(range(500))
>
In order to avoid positional bias and keep track of the order we will create and apply the function shuffle_and_track
to the dataset. This function takes chosen and rejected, shuffles them randomly, and then returns a dictionary that includes the shuffled pair as generations and the order in which they were shuffled based on their original identification.
# Shuffle 'chosen' and 'rejected'
def shuffle_and_track(chosen, rejected):
pair = [chosen, rejected]
random.shuffle(pair)
order = ["chosen" if x == chosen else "rejected" for x in pair]
return {"generations": pair, "order": order}
# Apply the function to the dataset
dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))
Moreover, to indicate which will be the input to be used for generation in our pipeline, we will rename the question column to input. This dataset is already binarized, but if you don't know about binarization or do you want to know how to binarize a dataset, you can take a look at here.
Create the Pipeline¶
In this case, we will only need to include a labeller
in our pipeline. The labeller will rate the generations according to the input and will add the rationale behind its score. So, we will start by initializing it using the OpenAI integration, which will take the following arguments:
task
: Specify the usage of the LLM as a labeller by creating a prompt using a standard template. The JudgeLMTask is designed to evaluate the performance of AI assistants.model
: We usegpt-4-1106-preview
as the model to be used for generation.num_threads
:16
of threads to be used for parallel generation.max_new_tokns
:512
is the maximum number of tokens to be generated.
> For more information about the LLM integrations, tasks and different components of the pipeline, please check the documentation.
Then, we will add the labeller to the pipeline. We can check that no generator was added and the labeller takes the arguments we specified before.
Finally, we will run the pipeline using the generate
method. This method will take the input dataset and the desired number of generations to be performed for each input. For our case, we will indicate 2, one to rate chosen and the other for rejected that were added to the generations column.
> Remember that the labelling process can take a while depending on the number of generations and the number of threads specified.
Now, we can inspect the dataset again, as the generations and the rationale behind the score were added to the original dataset as rating and rationale.
Optionally, if you want to further filter and curate the dataset, you can push the dataset to Argilla as follows:
Optional: Post-process the dataset¶
Even if the dataset was already curated, we can still improve it by adding more information. Thus, we will swap rejected and chosen, and add chosen scores and status.
The add_status
function assesses the status of a set of responses based on their ratings and order. If there are no ratings, or if both ratings are equal, it sets the status to tie. If that's not the case, but the highest-rated response is not the chosen one, then is swapped. Otherwise, it keeps the status as unchanged.
The swap
function returns a dictionary with the current and original chosen and rejected items and the score of the chosen item.
# Define the add_status function
def add_status(r):
status = "unchanged"
highest_rated_idx = np.argmax(r['rating'])
if r['rating']== None or r['rating'][0] == r['rating'][1]:
status = "tie"
elif r['order'][highest_rated_idx] != 'chosen':
status = "swapped"
return {"status": status}
# Define the swap function
def swap(r):
chosen = r["chosen"]
rejected = r["rejected"]
if r['rating'] is not None:
chosen_score = r['rating'][np.argmax(r['rating'])]
else:
chosen_score = None
if r['status'] == "swapped":
chosen = r["rejected"]
rejected = r["chosen"]
return {
"chosen": chosen,
"rejected": rejected,
"original_chosen": r["chosen"],
"original_rejected": r["rejected"],
"chosen_score": chosen_score
}
# Apply the functions to the dataset
updated_disti_dataset = disti_dataset.map(add_status).map(swap)
Optional: Find duplicated examples¶
Conversely, when training a model to ensure the accuracy of its results, it is essential to verify that your training samples are not duplicated in your test set. In our case, we will use our dataset as an example and compare it with the gsm8k test dataset, which comprises 7473 samples in each subset.
Then, we will extract the questions from both datasets and preprocess them tokenizing and lowercasing them.
# Function to preprocess the text
def preprocess(text):
return nltk.word_tokenize(text.lower())
# Preprocess the questions
source_questions_processed = [preprocess(q) for q in source_questions]
source_questions.extend([preprocess(q) for q in source_questions_socratic])
target_questions_processed = [preprocess(q) for q in target_questions]
Finally, we will compare the questions from the test set with the ones from our dataset and check if there are any duplicated samples. To do so, we will vectorize the questions and calculate the cosine similarity. The threshold set was 0.8 so that we could avoid false positives, as it was tested manually.
We can inspect the results by creating a dataframe.
And, we can add a new column to our dataset indicating whether each question is matched.
# Create a set of matching target questions
matching_target_questions = list(similarity_df['Target Question'])
# Add a column to the target dataset indicating whether each question is matched
target_dataset = target_dataset.map(lambda example: {"in_gsm8k_train": example['input'] in matching_target_questions})
target_dataset
Analyze our cleaned dataset¶
This dataset is great for fine-tuning preferences, and it's a better choice than the original one. It's set up in the easy-to-understand "chosen, rejected" format and comes with extra details for more experiments and filtering. This updated dataset is really handy because it shows which responses are favorites (according to gpt-4-turbo), points out the responses with low scores, and even includes explanations in everyday language.
The main changes are:
- ~2K pairs have been swapped: rejected becomes the chosen response. We have kept the original chosen and rejected on two new columns original_* for reproducibility purposes.
- 4K pairs have been identified as tie: equally bad or good.
- Chosen scores have been added: you can now filter out based on a threshold (see our distilabelled Hermes 2.5 model for an example)
- We have kept the ratings and rationales generated with gpt-4-turbo and distilabel so you can prepare the data differently if you want.
- We have added a column to indicate if the input is part of gsm8k train set.
This results in 5,922 instead of 12,859 samples (54% reduction) and leads to better performance than the same model tuned with 100% of the samples in the original dataset.
Conclusions¶
In summary, we've demonstrated the process of cleaning a preference dataset using distilabel
. Additionally, we've illustrated how to employ Argilla for visualizing and annotating the dataset that has been cleaned with distilabel. Lastly, we've covered the steps for post-processing the dataset and provided an analysis of the key changes that were made.
Now the next question is: can we build better models with this new knowledge? The answer is the distilabeled Hermes model, check it out!
Have a look at these resources if you want to go further: