Skip to content

🧮 Create a mathematical preference dataset

distilabel is a new AI Feedback (AIF) framework created by Argilla that leverages the power of LLMs for generating synthetic datasets for preference or self-instruct. You can find more information in the links below:

Also, don't forget to follow us on social media to keep up to date with the latest news about Argilla and distilabel:

Demo

In this demo, we will create a preference dataset that can be later used to fine-tune an LLM using DPO. First, we will define a list of math topics and we will create a pipeline for generating a list of instructions using self-instruct and OpenAI gpt-3.5-turbo. After that, we will create another pipeline in which we will ask gpt-3.5-turbo to generate 3 texts for each instruction, and finally, we will ask it again to rate these responses, giving us our preference dataset.

Setup

For this tutorial, you will need an API key associated with your OpenAI account. After that, you will need to create a Google Colab secret by clicking the icon key in the left sidebar and creating a secret called api_key with your OpenAI API key as a value.

> Google Colab secrets were released a few weeks ago and it's very useful to reuse and not leak your API Keys!

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

Installing distilabel

We will distilabel with the openai and argilla extras, to also install the openai and argilla clients that we will need later. In addition, we will install an extension for timing some cells.

%pip install distilabel[openai,argilla] ipython-autotime -qqq
%load_ext autotime

Instruction generation

As mentioned above, we will first create a Pipeline for generating instructions using self-instruct and gpt-3.5-turbo. For that, we will create an instance of SelfInstructTask, which defines a prompt template for generating instructions given an application description. We will also create an instance of OpenAILLM for using gpt-3.5-turbo and we will pass it to the SelfInstructTask instance that we created before.

> As we're passing a Task for generating texts to the OpenAILLM we can denominate this one as a generator.

from distilabel.tasks import SelfInstructTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
time: 13.3 s (started: 2023-11-28 13:21:22 +00:00)

First of all, we will create a Hugging Face 🤗 dataset that will contain a single column called input. This column will contain the math topics from which we want our LLM to generate instructions.

> It's important that the column is called input because the Task that we will create later expects an input argument called input.

from datasets import Dataset


math_topics = [
    "Algebraic Expressions",
    "Linear Equations",
    "Quadratic Equations",
    "Polynomial Functions",
    "Rational Expressions",
    "Exponential Functions",
    "Logarithmic Functions",
    "Sequences and Series",
    "Matrices",
    "Determinants",
    "Complex Numbers",
    "Trigonometry",
    "Geometry",
    "Coordinate Geometry",
    "Vector Algebra",
    "Statistics",
    "Probability",
    "Calculus",
    "Differential Calculus",
    "Integral Calculus",
    "Limits and Continuity",
    "Differentiation",
    "Integration",
    "Theorems of Calculus",
    "Mathematical Reasoning",
    "Set Theory",
    "Number Theory",
    "Permutations and Combinations",
    "Binomial Theorem",
    "Arithmetic Progressions",
    "Geometric Progressions",
    "Harmonic Progressions",
    "Trigonometric Ratios",
    "Trigonometric Identities",
    "Inverse Trigonometric Functions",
    "Hyperbolic Functions",
    "Conic Sections",
    "Circle Geometry",
    "Ellipse Geometry",
    "Parabola Geometry",
    "Hyperbola Geometry",
    "Function Theory",
    "Graph Theory",
    "Differential Equations",
    "Mathematical Induction",
    "Discrete Mathematics",
]

dataset = Dataset.from_dict({
    "input": math_topics
})
time: 19.3 ms (started: 2023-11-28 13:21:46 +00:00)

Next, we will a SelfInstructTask that will guide the LLM using the prompt to generate instructions from the given list of inputs.

> All the Tasks have two properties input_args_names and output_args_names that indicates which arguments expect as inputs and outputs will generate respectively.

application_description = (
    "An AI assistant adept at answering a wide array of math, logic, and reasoning puzzles, trivia, "
    "and general questions. Users of this assistant love to ask the assistant to think and outlines "
    "the solutions step by step. It expects complete questions from users providing all the details "
    "to solve the proposed problem or respond to general knowledge questions. It covers general "
    "knowledge about math, puzzles, reasoning exercises, and real-life scenarios where math and "
    "reasoning are important."
)

# by default `SelfInstructTask` will generate 5 instructions, but we can tweak
# this behaviour passing the `num_instructions` argument.
instruction_task = SelfInstructTask(
    application_description=application_description
)

print(f"`SelfInstructTask`\n   - Input arguments: {instruction_task.input_args_names}\n   - Output arguments: {instruction_task.output_args_names}")
`SelfInstructTask`
   - Input arguments: ['input']
   - Output arguments: ['generations']
time: 692 µs (started: 2023-11-28 13:21:50 +00:00)

Next step, we will create an LLM, in this case, an instance of OpenAILLM as we want to use gpt-3.5-turbo to generate the instructions. We will pass the instruction_task for generating the prompts that we need for generating the instructions given the inputs of our dataset.

instruction_generator = OpenAILLM(
    task=instruction_task,
    num_threads=8,
    max_new_tokens=1024,
    temperature=0.7
)
time: 497 ms (started: 2023-11-28 13:21:54 +00:00)

Finally, we will create a Pipeline to orchestrate the whole generation process. In this case, we will only pass a generator.

pipeline = Pipeline(generator=instruction_generator)
time: 576 µs (started: 2023-11-28 13:21:56 +00:00)

And then we trigger the generation process by calling the generate method of the pipeline... We specify that we want 10 generations for each input.

distiset = pipeline.generate(
    dataset=dataset,
    num_generations=10,
    batch_size=4
)
import re

def transform(inst: str) -> str:
    """Remove 1., 2., ... from the instruction."""
    clean_inst = re.sub(r'^\d+\.\s*', '', inst)
    return f"{clean_inst}"

instructions = [
    transform(instruction)
    for generations in distiset["instructions"]
    for generation in generations
    for instruction in generation
    if instruction != ""
]
print(f"Number of generated instructions: {len(instructions)}")
Number of generated instructions: 4637
time: 60.1 ms (started: 2023-11-28 13:28:33 +00:00)

import random

samples = random.sample(instructions, 5)

for sample in samples:
    print(sample)
How can the concept of probability be applied in real-life scenarios? 
Could you outline the process to solve a quadratic equation using the quadratic formula?
Explain the process of expanding the binomial expression (x + 3)^2 step by step.
How can I find the sum of an arithmetic series?
Explain the concept of factorial and provide an example of its application in real-life scenarios.
time: 8.4 ms (started: 2023-11-28 14:38:11 +00:00)

dataset = Dataset.from_dict({"instructions": instructions})
time: 17.8 ms (started: 2023-11-28 13:28:37 +00:00)

dataset.push_to_hub("argilla/distilabel-math-instructions")

Preference dataset

We have the instructions, but we still need the responses for these instructions, and more importantly, evaluate how good these responses are.

To do so, we will create a new Pipeline for generating and labeling the generated texts:

  1. We will a generator LLM using OpenAILLM and the TextGenerationTask to generate responses for a given instruction.
  2. We will create a labeler LLM using OpenAILLM and the UltraFeedbackTask task to a rating telling us how good was a response for a given instruction.
from datasets import load_dataset

dataset = load_dataset("argilla/distilabel-math-instructions", split="train")
dataset = dataset.rename_column("instructions", "input")

We create a generator that will use gpt-3.5-turbo for generating text. We also use the principles feature of the TextGenerationTask, which will inject a principle in the generated prompt to make the LLM generate a text focusing on the provided principle, and that will allow us to generate a more heterogeneous dataset.

from distilabel.tasks import TextGenerationTask

text_generation_task = TextGenerationTask(
    principles_distribution={
        "harmlessness": 0.4,
        "helpfulness": 0.2,
        "truthfulness": 0.2,
        "honesty": 0.1,
        "verbalized_calibration": 0.1
    }
)

generator = OpenAILLM(
    task=text_generation_task,
    num_threads=8,
    max_new_tokens=1024
)
time: 392 ms (started: 2023-11-28 13:29:01 +00:00)

Next, we create a labeler that will evaluate how good the texts that the generator gave us are. In this case, we have decided to use the UltraFeedbackTask which defines a prompt template for generating preference datasets.

from distilabel.tasks import UltraFeedbackTask

preference_labeller = OpenAILLM(
    task=UltraFeedbackTask.for_instruction_following(),
    num_threads=8,
    max_new_tokens=1024,
)
time: 374 ms (started: 2023-11-28 13:29:04 +00:00)

pipeline = Pipeline(
    generator=generator,
    labeller=preference_labeller
)
time: 558 µs (started: 2023-11-28 13:29:05 +00:00)

distiset_pref = pipeline.generate(
    dataset=dataset.shuffle().select(range(100)),
    num_generations=3,
    batch_size=8
)
distiset_pref.column_names
['input',
 'generation_model',
 'generation_prompt',
 'raw_generation_responses',
 'generations',
 'labelling_model',
 'labelling_prompt',
 'raw_labelling_response',
 'rating',
 'rationale']
time: 4.16 ms (started: 2023-11-23 16:36:11 +00:00)

print("Instruction:", distiset_pref[0]["input"])
for generation in distiset_pref[0]["generations"]:
    print(generation, end="\n\n\n\n")
Instruction: What is the significance of graph theory in computer science and network analysis?
Graph theory is a fundamental field in mathematics that has significant applications in computer science and network analysis. Here are a few ways it is significant:

1. Network analysis: Graph theory provides a powerful framework for analyzing and understanding complex networks. It helps uncover key properties, such as connectivity, centrality, clustering, and community structure. These insights are essential for designing efficient and reliable networks, managing traffic, identifying vulnerabilities, and optimizing network performance.

2. Data structures and algorithms: Graphs are commonly used as data structures in computer science. They are particularly useful for representing relationships and connections between entities. Many algorithms, such as Dijkstra's algorithm for shortest paths, the traveling salesman problem, and depth-first search, are based on graph theory and are widely employed in various computer science applications.

3. Social networks and web analysis: Graph theory plays a crucial role in analyzing social networks and the structure of the web. It enables us to study patterns of interactions, identify influencers, detect communities, and understand information flow in complex systems like social media platforms and the World Wide Web.

4. Circuit design and optimization: Graph theory aids in circuit design and optimization, especially in electronic and digital systems. By modeling circuits as graphs, engineers can analyze circuit behavior, optimize layouts, and solve various design problems that arise in areas like VLSI (Very Large Scale Integration) and hardware design.

5. Computer graphics and image processing: Graph theory has applications in computer graphics and image processing. Techniques like graph coloring, graph matching, and graph-based segmentation are used for tasks such as color allocation, object recognition, and image compression.

These are just a few examples of how graph theory is significant in computer science and network analysis. Its applications are vast and continue to grow as technology advances.



Graph theory is highly significant in computer science and network analysis for a variety of reasons:

1. Modeling Relationships: In computer science, graph theory allows us to model and represent complex relationships and dependencies between entities using graphs. This is particularly useful in network analysis, as it helps us understand the connections between nodes in a network, such as the internet or social networks.

2. Algorithm Design: Graph algorithms play a crucial role in solving various computational problems efficiently. Many important algorithms, like Dijkstra's algorithm for finding the shortest path or Prim's algorithm for finding a minimum spanning tree, are based on graph theory. These algorithms are used extensively in network analysis, routing problems, optimizing resource allocation, and many other areas in computer science.

3. Network Analysis: Graph theory provides a powerful framework for analyzing networks, both in theoretical and practical terms. It enables the identification of network structures, such as cliques, communities, or hubs, which helps in understanding network behavior, predicting patterns, and identifying vulnerabilities or bottlenecks.

4. Optimization Problems: Graph theory offers tools for solving various optimization problems that arise in computer science and network analysis. For example, the traveling salesman problem, which seeks the shortest route through a set of cities, can be modeled as a graph problem, allowing the application of graph algorithms to find efficient solutions.

5. Data Structures: Many data structures in computer science, such as trees and adjacency lists, are derived from graph theory. These data structures are fundamental for efficient storage and retrieval of information, and are widely used in implementing algorithms related to networks and other computational domains.

Overall, graph theory provides a theoretical foundation and practical techniques to analyze and solve problems related to networks, computation, and data structures, making it an essential field within computer science and network analysis.



Graph theory has significant applications in computer science and network analysis. Here are a few key areas where graph theory plays a crucial role:

1. Network Analysis and Social Networks: Graph theory is fundamental to the analysis of complex networks, such as social networks, computer networks, biological networks, and transportation networks. It helps in understanding relationships, connectivity, and patterns within these networks.

2. Data Structures and Algorithms: Many fundamental data structures and algorithms in computer science are based on graphs. Graphs are used to represent and model a variety of problems, including searching, sorting, shortest path algorithms, spanning trees, and flow optimization.

3. Database Systems: Graph databases use graph theory to store and query data in a network-like structure. They enable efficient retrieval and analysis of interconnected data, making them suitable for domains like social media, recommendation systems, and fraud detection.

4. Compiler Design and Code Optimization: Graph theory is employed in compiler design for optimizing code and representing dependencies between program components. Techniques like control flow graphs and data flow analysis utilize graph representations to optimize program execution.

5. Artificial Intelligence and Machine Learning: Graphs are widely used in AI and machine learning to represent complex relationships and dependencies within data. Graph-based algorithms help in tasks like clustering, recommendation systems, community detection, and knowledge representation.

6. Internet and Web Design: Web pages and links can be represented as graphs, enabling algorithms like PageRank to determine the importance and ranking of web pages. Graph theory is also used in studying internet topology, routing algorithms, and network flow optimization.

Graph theory provides a powerful framework for modeling, analyzing, and solving problems in these areas and more. Its concepts and algorithms are essential tools for computer scientists and network analysts.



time: 1.99 ms (started: 2023-11-23 16:38:05 +00:00)

distiset_pref[0]["rationale"]
['Text 1 fully aligns with the task goal and restrictions. It provides a comprehensive explanation of the significance of graph theory in computer science and network analysis by discussing multiple applications and how they are relevant in each area.',
 'Text 2 almost fully aligns with the task goal and restrictions. It covers most of the significant ways graph theory is used in computer science and network analysis, but it could benefit from providing more specific examples or details in some areas.',
 'Text 3 partially aligns with the task goal and restrictions. While it touches on some key areas where graph theory is significant, it lacks detailed explanations and examples in certain domains such as algorithm design and network analysis. It would benefit from providing a more comprehensive discussion in order to fully meet the requirements.']
time: 4.56 ms (started: 2023-11-23 16:39:01 +00:00)

Human Feedback with Argilla

You can use the AI Feedback created by distilabel directly but we have seen that enhancing it with human feedback will improve the quality of your LLM. We provide a to_argilla method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check the Argilla docs to get it up and running.

First, install it.

!pip install "distilabel[argilla]"

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

import argilla as rg

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="owner.apikey",
    workspace="admin"
)

Now we can convert our dataset to a formatted Argilla dataset and push it.

rg_dataset = distiset_pref.to_argilla()
time: 347 ms (started: 2023-11-23 15:36:31 +00:00)

rg_dataset.push_to_argilla(name="math-preference-dataset", workspace="admin")
RemoteFeedbackDataset(
   id=4232d7ac-eaff-49b0-88b8-3a384b76efbb
   name=math-preference-dataset
   workspace=Workspace(id=2fc2ebed-8d20-41b0-b33a-5c5f3712da53, name=admin, inserted_at=2023-11-23 15:00:40.160242, updated_at=2023-11-23 15:00:40.160242)
   url=https://gabrielmbmb-distilabel.hf.space/dataset/4232d7ac-eaff-49b0-88b8-3a384b76efbb/annotation-mode
   fields=[RemoteTextField(id=UUID('dc965a9c-ac85-449b-ae16-bda998c88c1c'), client=None, name='input', title='Input', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('0d29f518-a3bf-4642-a7d7-1324329555b7'), client=None, name='generations-1', title='Generations-1', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('c0541a45-8892-49fb-8b85-fcaa0cd147e0'), client=None, name='generations-2', title='Generations-2', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('4da8a4b5-553d-4b1d-a14f-b936a313797f'), client=None, name='generations-3', title='Generations-3', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('fb012ff3-5fb7-40c0-b623-d4195c1508c8'), client=None, name='generations-1-rating', title="What's the rating for generations-1?", description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), RemoteRatingQuestion(id=UUID('a7f02428-74e4-4497-8f26-803988e5c336'), client=None, name='generations-2-rating', title="What's the rating for generations-2?", description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), RemoteRatingQuestion(id=UUID('cbab782f-b0c0-4ff6-8c64-c58ad3ea476a'), client=None, name='generations-3-rating', title="What's the rating for generations-3?", description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), RemoteTextQuestion(id=UUID('405620c5-8aea-4fda-b92b-ee72e90629a9'), client=None, name='ratings-rationale', title="What's the rationale behind the ratings?", description=None, required=True, type='text', use_markdown=False)]
   guidelines=None)
time: 8.63 s (started: 2023-11-23 15:36:51 +00:00)

We can now jump into the UI and start providing human feedback to improve the quality of the synthetic dataset.