🧮 Create a mathematical preference dataset¶
distilabel
is a new AI Feedback (AIF) framework created by Argilla that leverages the power of LLMs for generating synthetic datasets for preference or self-instruct. You can find more information in the links below:
- GitHub: argilla-io/distilabel
- Docs: distilabel.argilla.io
Also, don't forget to follow us on social media to keep up to date with the latest news about Argilla and distilabel:
- Twitter: @argilla_io
- LinkedIn: Argilla
Demo¶
In this demo, we will create a preference dataset that can be later used to fine-tune an LLM using DPO. First, we will define a list of math topics and we will create a pipeline for generating a list of instructions using self-instruct
and OpenAI gpt-3.5-turbo
. After that, we will create another pipeline in which we will ask gpt-3.5-turbo
to generate 3 texts for each instruction, and finally, we will ask it again to rate these responses, giving us our preference dataset.
Setup¶
For this tutorial, you will need an API key associated with your OpenAI account. After that, you will need to create a Google Colab secret by clicking the icon key in the left sidebar and creating a secret called api_key
with your OpenAI API key as a value.
> Google Colab secrets were released a few weeks ago and it's very useful to reuse and not leak your API Keys!
Installing distilabel
¶
We will distilabel
with the openai
and argilla
extras, to also install the openai
and argilla
clients that we will need later. In addition, we will install an extension for timing some cells.
Instruction generation¶
As mentioned above, we will first create a Pipeline
for generating instructions using self-instruct
and gpt-3.5-turbo
. For that, we will create an instance of SelfInstructTask
, which defines a prompt template for generating instructions given an application description. We will also create an instance of OpenAILLM
for using gpt-3.5-turbo
and we will pass it to the SelfInstructTask
instance that we created before.
> As we're passing a Task
for generating texts to the OpenAILLM
we can denominate this one as a generator
.
First of all, we will create a Hugging Face 🤗 dataset
that will contain a single column called input
. This column will contain the math topics from which we want our LLM to generate instructions.
> It's important that the column is called input
because the Task
that we will create later expects an input argument called input
.
from datasets import Dataset
math_topics = [
"Algebraic Expressions",
"Linear Equations",
"Quadratic Equations",
"Polynomial Functions",
"Rational Expressions",
"Exponential Functions",
"Logarithmic Functions",
"Sequences and Series",
"Matrices",
"Determinants",
"Complex Numbers",
"Trigonometry",
"Geometry",
"Coordinate Geometry",
"Vector Algebra",
"Statistics",
"Probability",
"Calculus",
"Differential Calculus",
"Integral Calculus",
"Limits and Continuity",
"Differentiation",
"Integration",
"Theorems of Calculus",
"Mathematical Reasoning",
"Set Theory",
"Number Theory",
"Permutations and Combinations",
"Binomial Theorem",
"Arithmetic Progressions",
"Geometric Progressions",
"Harmonic Progressions",
"Trigonometric Ratios",
"Trigonometric Identities",
"Inverse Trigonometric Functions",
"Hyperbolic Functions",
"Conic Sections",
"Circle Geometry",
"Ellipse Geometry",
"Parabola Geometry",
"Hyperbola Geometry",
"Function Theory",
"Graph Theory",
"Differential Equations",
"Mathematical Induction",
"Discrete Mathematics",
]
dataset = Dataset.from_dict({
"input": math_topics
})
Next, we will a SelfInstructTask
that will guide the LLM
using the prompt to generate instructions from the given list of inputs.
> All the Task
s have two properties input_args_names
and output_args_names
that indicates which arguments expect as inputs and outputs will generate respectively.
application_description = (
"An AI assistant adept at answering a wide array of math, logic, and reasoning puzzles, trivia, "
"and general questions. Users of this assistant love to ask the assistant to think and outlines "
"the solutions step by step. It expects complete questions from users providing all the details "
"to solve the proposed problem or respond to general knowledge questions. It covers general "
"knowledge about math, puzzles, reasoning exercises, and real-life scenarios where math and "
"reasoning are important."
)
# by default `SelfInstructTask` will generate 5 instructions, but we can tweak
# this behaviour passing the `num_instructions` argument.
instruction_task = SelfInstructTask(
application_description=application_description
)
print(f"`SelfInstructTask`\n - Input arguments: {instruction_task.input_args_names}\n - Output arguments: {instruction_task.output_args_names}")
Next step, we will create an LLM
, in this case, an instance of OpenAILLM
as we want to use gpt-3.5-turbo
to generate the instructions. We will pass the instruction_task
for generating the prompts that we need for generating the instructions given the input
s of our dataset.
Finally, we will create a Pipeline
to orchestrate the whole generation process. In this case, we will only pass a generator
.
And then we trigger the generation process by calling the generate
method of the pipeline... We specify that we want 10
generations for each input.
import re
def transform(inst: str) -> str:
"""Remove 1., 2., ... from the instruction."""
clean_inst = re.sub(r'^\d+\.\s*', '', inst)
return f"{clean_inst}"
instructions = [
transform(instruction)
for generations in distiset["instructions"]
for generation in generations
for instruction in generation
if instruction != ""
]
print(f"Number of generated instructions: {len(instructions)}")
Preference dataset¶
We have the instructions, but we still need the responses for these instructions, and more importantly, evaluate how good these responses are.
To do so, we will create a new Pipeline
for generating and labeling the generated texts:
- We will a
generator
LLM usingOpenAILLM
and theTextGenerationTask
to generate responses for a given instruction. - We will create a
labeler
LLM usingOpenAILLM
and theUltraFeedbackTask
task to a rating telling us how good was a response for a given instruction.
We create a generator
that will use gpt-3.5-turbo
for generating text. We also use the principles
feature of the TextGenerationTask
, which will inject a principle in the generated prompt to make the LLM generate a text focusing on the provided principle, and that will allow us to generate a more heterogeneous dataset.
from distilabel.tasks import TextGenerationTask
text_generation_task = TextGenerationTask(
principles_distribution={
"harmlessness": 0.4,
"helpfulness": 0.2,
"truthfulness": 0.2,
"honesty": 0.1,
"verbalized_calibration": 0.1
}
)
generator = OpenAILLM(
task=text_generation_task,
num_threads=8,
max_new_tokens=1024
)
Next, we create a labeler
that will evaluate how good the texts that the generator
gave us are. In this case, we have decided to use the UltraFeedbackTask
which defines a prompt template for generating preference datasets.
Human Feedback with Argilla¶
You can use the AI Feedback created by distilabel directly but we have seen that enhancing it with human feedback will improve the quality of your LLM. We provide a to_argilla
method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check the Argilla docs to get it up and running.
First, install it.
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:
Now we can convert our dataset to a formatted Argilla dataset and push it.
We can now jump into the UI and start providing human feedback to improve the quality of the synthetic dataset.