๐งฎ Create a mathematical preference dataset¶
distilabel is a new AI Feedback (AIF) framework created by Argilla that leverages the power of LLMs for generating synthetic datasets for preference or self-instruct. You can find more information in the links below:
- GitHub: argilla-io/distilabel
- Docs: distilabel.argilla.io
And also don't forget to follow us on social media to keep up to date with the latest news about Argilla and distilabel:
- Twitter: @argilla_io
- LinkedIn: Argilla
Demo¶
In this demo, we will create a preference dataset that can be later used to fine-tune an LLM using DPO. First, we will define a list of math topics and we will create a pipeline for generating a list of instructions using self-instruct and OpenAI gpt-3.5-turbo. After that, we will create another pipeline in which we will ask gpt-3.5-turbo to generate 3 texts for each instruction, and finally we will ask it again to rate these responses, given us our preference dataset.
Setup¶
For this tutorial you will need an API key associated to your OpenAI account. After that, you will need to create a Google Colab secret clicking the icon key in the left side bar and create a secret called api_key with your OpenAI API key as value.
> Google Colab secrets has been released a few weeks ago and it's very useful to reuse and not leak your API Keys!
Installing distilabel¶
We will distilabel with the openai and argilla extras, to also install the openai and argilla clients that we will need later. In addition, we will install an extension for timing some cells.
Instruction generation¶
As mentioned above, we will first create a Pipeline for generating instructions using self-instruct and gpt-3.5-turbo. For that we will create an instance of SelfInstructTask, which defines a prompt template for generating instructions given an application description. We will also create an instance of OpenAILLM for using gpt-3.5-turbo and we will pass it the SelfInstructTask instance that we created before.
> As we're passing a Task for generating texts to the OpenAILLM we can denominate this one as a generator.
First of all, we will create a Hugging Face ๐ค dataset that will contain a single column called input. This column will contain the math topics from which we want our LLM to generate instructions.
> It's important that the column is called input, because the Task that we will create later expects an input argument called input.
from datasets import Dataset
math_topics = [
"Algebraic Expressions",
"Linear Equations",
"Quadratic Equations",
"Polynomial Functions",
"Rational Expressions",
"Exponential Functions",
"Logarithmic Functions",
"Sequences and Series",
"Matrices",
"Determinants",
"Complex Numbers",
"Trigonometry",
"Geometry",
"Coordinate Geometry",
"Vector Algebra",
"Statistics",
"Probability",
"Calculus",
"Differential Calculus",
"Integral Calculus",
"Limits and Continuity",
"Differentiation",
"Integration",
"Theorems of Calculus",
"Mathematical Reasoning",
"Set Theory",
"Number Theory",
"Permutations and Combinations",
"Binomial Theorem",
"Arithmetic Progressions",
"Geometric Progressions",
"Harmonic Progressions",
"Trigonometric Ratios",
"Trigonometric Identities",
"Inverse Trigonometric Functions",
"Hyperbolic Functions",
"Conic Sections",
"Circle Geometry",
"Ellipse Geometry",
"Parabola Geometry",
"Hyperbola Geometry",
"Function Theory",
"Graph Theory",
"Differential Equations",
"Mathematical Induction",
"Discrete Mathematics",
]
dataset = Dataset.from_dict({
"input": math_topics
})
Next, we will a SelfInstructTask that will guide the LLM using the prompt to generate instructions from the given list of inputs.
> All the Tasks have two properties input_args_names and output_args_names that indicates which arguments expects as inputs and outputs will generate respectively.
application_description = (
"An AI assistant adept at answering a wide array of math, logic, and reasoning puzzles, trivia, "
"and general questions. Users of this assistant love to ask the assistant to think and outlines "
"the solutions step by step. It expects complete questions from users providing all the details "
"to solve the proposed problem or respond to general knowledge questions. It covers general "
"knowledge about math, puzzles, reasoning exercises, and real-life scenarios where math and "
"reasoning are important."
)
# by default `SelfInstructTask` will generate 5 instructions, but we can tweak
# this behaviour passing the `num_instructions` argument.
instruction_task = SelfInstructTask(
application_description=application_description
)
print(f"`SelfInstructTask`\n - Input arguments: {instruction_task.input_args_names}\n - Output arguments: {instruction_task.output_args_names}")
Next step, we will create an LLM, in this case an instance of OpenAILLM as we want to use gpt-3.5-turbo to generate the instructions. We will pass the instruction_task for generating the prompts that we need for generating the instructions given the inputs of our dataset.
Finally, we will create a Pipeline to orchestrate the whole generation process. In this case we will only pass a generator.
and we trigger the generation process calling the generate method of the pipeline... We specify that we want 10 generations for each input
import re
def transform(inst: str) -> str:
"""Remove 1., 2., ... from the instruction."""
clean_inst = re.sub(r'^\d+\.\s*', '', inst)
return f"{clean_inst}"
instructions = [
transform(instruction)
for generations in distiset["instructions"]
for generation in generations
for instruction in generation
if instruction != ""
]
print(f"Number of generated instructions: {len(instructions)}")
Preference dataset¶
We have the instructions, but we still need the responses for these instructions, and more importally, evaluate how good these responses are.
To do so, we will create a new Pipeline for generating and labelling the generated texts:
- We will a
generatorLLM usingOpenAILLMand theTextGenerationTaskto generate responses for a given instruction. - We will create a
labellerLLM usingOpenAILLMand theUltraFeedbackTasktask to a rating telling us how good was a response for a given instruction.
We create a generator that will use gpt-3.5-turbo for generating text. We also use the principles feature of the TextGenerationTask, that will inject a principle in the generated prompt to make the LLM generate a text focusing on the provided principle and that will allow us to generate a more heterogeneous dataset.
from distilabel.tasks import TextGenerationTask
text_generation_task = TextGenerationTask(
principles_distribution={
"harmlessness": 0.4,
"helpfulness": 0.2,
"truthfulness": 0.2,
"honesty": 0.1,
"verbalized_calibration": 0.1
}
)
generator = OpenAILLM(
task=text_generation_task,
num_threads=8,
max_new_tokens=1024
)
Next we create a labeller that will evaluate how good the texts that the generator gave us are. In this case we have decided to use the UltraFeedbackTask which defines a prompt template for generating preference datasets.
Human Feedback with Argilla¶
You can use the AI Feedback created by distilabel directly but we hae ve seen that enhancing it with human feedback will improve the quality of your LLM. We provide a to_argilla method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check the Argilla docs to get it up and running.
First, install it.
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:
Now we can convert our dataset to a formatted Argilla dataset.
We can now jump in the UI and start providing human feedback to improve the quality of the synthetic dataset.