🧙 Create an evol-instruct dataset¶

In this tutorial, we'll develop an evol-instruct dataset by employing the approaches outlined in "WizardLM: Empowering Large Language Models to Follow Complex Instructions" and What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning using distilabel. In the next section, we will describe the process in detail. So, let's get started! 🪄

Introduction¶

The WizardLM paper proposes a new method, Evol-Instruct, to synthetically create a dataset with open-domain instructions of varying complexity using gpt-3.5-turbo. The resulting dataset, combined with the original, was used to fine-tune LLaMa, leading to the creation of WizardLM. This model surpasses ChatGPT in both human and automatic evaluations, demonstrating more than 90% of ChatGPT's capabilities in 17 out of 29 skills.

In this tutorial, we will only focus on the Evol-Instruct approach to create a more complex dataset. From an initial dataset that will be the seed for the evolution process, the steps for each epoch (determined as M=4) are as follows:

Intruction Evolving: Use gpt-3.5-turbo with predefined prompts to generate the evolved instructions. These prompts can be of two types: in-depth evolving (includes adding constraints, deepening, concretizing, increasing reasoning, and complicating the input) and in-breadth evolving (includes mutation). The complicating prompt is the only one not applied as it needs in-context examples. Then, only one of the remaining five is selected randomly to be applied to the input instruction. You can check the original code here.
Elimination Evolving
The instruction evolving step may fail, so the new instructions are filtered according to the following criteria:
1. The evolved instruction does not provide any information gain. Automatically evaluated with ChatGPT.
2. The evolved instruction contains "sorry" and is less than 80 words.
3. The evolved instruction only contains punctuation and stop words.
4. The evolved instruction copies words from the evolving prompt.
If the evolved instruction passes the previous criteria, it is added to the pool of new instructions and also will be used as input for the next iteration. If not, it is dropped and the original instruction is the one used for the next iteration.

Once, the evolved instructions are generated, they use the same LLM to generate the corresponding responses. Finally, the resulting dataset is the combination of the original and the new instructions generated in each epoch.

On the other hand, the Deita paper proposes more strategies to select the best data for alignment. While using the Evol-Instruct approach, but without the breadth evolving step, what they called Evol-Complexity. They also applied the Evol-quality and Data selection strategies.

The Evol-quality is similar to Evol-Complexity, although it uses a different prompt, which is focused on improving the quality of the responses by enhancing helpfulness, augmenting relevance, enriching depth, fostering creativity, and supplying additional details, to generate new pairs.
The Data Selection strategy filters the new instructions using embeddings and cosine similarity to the original instructions to select the best and most diverse ones.

In the next sections, we will see how to use these approaches to build our dataset using distilabel.

Getting started¶

Install dependencies¶

Let’s start by installing the required dependencies to run distilabel. You can also install argilla for better visualization and curation of the results.

%pip install -q -U "distilabel[openai,argilla]" --upgrade

Then we can import the required libraries.

import os
import string
import time
from dataclasses import dataclass
from typing import Dict, List

import pandas as pd
from datasets import Dataset, load_dataset

from distilabel.dataset import CustomDataset
from distilabel.llm import LLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import EvolComplexityTask, Prompt, EvolQualityTask, TextGenerationTask

# Set the OpenAI API Key
os.environ["OPENAI_API_KEY"] = 'sk-...'

Prepare the initial dataset¶

The first step is to prepare the initial dataset that will be used for the evolution process. Following the same idea as shown in an example from the paper, we will use the well-known alpaca dataset available in HuggingFace. For the sake of this tutorial's example, we will use 5 samples.

Good to mention that other datasets like the distilabel-intel-orca-dpo-pairs, a "distilabeled" version of orca_dpo_pairs for preference tuning with 12.9K samples, were also applied as the seed dataset. However, the instructions were already too complex, so the evolution process generated a small amount of instructions that were of poor-quality or with hallucinations.

# Load the dataset
hf_dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Get our initial dataset
initial_dataset = (
    hf_dataset
    .select_columns(["instruction", "output"])
    .rename_column("instruction", "input")
    .rename_column("output", "response")
)

# Select a subset
initial_dataset = initial_dataset.shuffle(seed=5).select(range(5))

initial_dataset[0]

{'input': 'Generate a list of three ingredients for a chocolate cake.',
 'response': '- Flour\n- Cocoa powder\n- Sugar'}

The `Evol-Complexity` approach¶

For our case, we will need to set two different LLMs with their corresponding tasks: one for the instruction evolving and another for the elimination evolving step 1.

Instruction Evolving LLM¶

The next step is to define the LLM that will be used to generate the evolved instructions. We will use gpt-3.5-turbo as the language model, and the task EvolComplexityTask, also we will set some parameters (Section 4.3 from WizardLM). Take into account that the EvolComplexity will perform the random selection of the evolving prompt and the filtering of the evolved instructions up the first step from the elimination evolving related to equal prompts.

# Define our LLM
complexity_llm = OpenAILLM(
    task=EvolComplexityTask(),
    api_key=os.getenv("OPENAI_API_KEY"),
    model= "gpt-3.5-turbo",
    num_threads=4,
    max_new_tokens=2048,
    temperature=1,
    frequency_penalty=0.0,
    top_p=0.9,
)

Elimination Evolving LLM¶

As part of the elimination step, it was stated to ask ChatGPT if the original prompt and the evolved one from the current epoch are equal. In order to do so, we will need to define a LLM with the corresponding task. As the task does not exist, we will customize one based on TextGenerationTask from distilabel indicating how to generate the prompt and parse the output.

# Indicate the prompt (Appendix G from WizardLM)
elimination_equal_prompt = """Here are two Instructions, do you think they are equal to each other and meet the following requirements?:
    1. They have the same constraints and requirements.
    2. They have the same depth and breadth of the inquiry.
    The First Prompt: {first_instruction}
    The Second Prompt: {second_instruction}
    Your Judgement (Just answer: Equal or Not Equal. No need to explain the reason):"""

# Define our distilabel class
@dataclass
class EliminationEqualPrompts(TextGenerationTask):

    system_prompt: str = "You are an AI judge in charge of determining the equality of two instructions. "

    def generate_prompt(self, input: List[str]) -&gt; Prompt:
        return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=elimination_equal_prompt.format(
                first_instruction=input[0], second_instruction=input[1]
            ),
        )

    def parse_output(self, output: str) -&gt; List[Dict[str, str]]:
        """Remove punctuation from the string and lowercase it."""
        return {
            "generations": output.translate(
                str.maketrans("", "", string.punctuation)).lower()
        }

We will use this task in our LLM definition. Similarly to the paper, the parameters will be the same as the ones used in the previous section.

# Define out second LLM
elimination_llm = OpenAILLM(
    task=EliminationEqualPrompts(),
    api_key=os.getenv("OPENAI_API_KEY"),
    model= "gpt-3.5-turbo",
    num_threads=4,
    max_new_tokens=2048,
    temperature=1,
    frequency_penalty=0.0,
    top_p=0.9,
)

The `Evol-quality` approach¶

Following the Deita paper idea, we will run the Evol-quality approach to generate new responses from those generated instructions in the previous section focusing on quality. Similarly, we will define the LLM and the EvolQualityTask to generate the new responses.

# Define our LLM
quality_llm = OpenAILLM(
    task=EvolQualityTask(),
    api_key=os.getenv("OPENAI_API_KEY"),
    model= "gpt-4-turbo-preview",
    num_threads=4,
    max_new_tokens=2048,
    temperature=1,
    frequency_penalty=0.0,
    top_p=0.9,
)

Run the evolution process¶

To run the evolution process, we will create the make_evol_instruct_dataset function that will take the defined LLMs, the initial dataset, and the number of evolution steps. In our approach, we will follow the steps from WizardLM, but using the Evol-Complexity task and their number of epochs, as well as Evol-Quality. To clarify, for each complexity step, we followed this process:

Run the complexity pipe to generate new instructions from the previous ones. Deita: For each instruction sample $I^{(0)}_k$, we use the In-Depth Evolving Prompt [...]. After $M$ iterations, we obtain a set of instructions across different complexities for $I_k$, ${I^{(0)}_k, \ldots, I^{(M)}_k}$.
Execute the elimination pipe to filter the new instructions. WizardLM: The evolved instruction does not provide any information gain. Automatically evaluated with ChatGPT.
Create inside the current epoch a loop to generate the new responses for each new successful instruction. The generated samples will be saved for the final dataset. Deita: After $M$ iterations, for the same instruction $I^{(0)}_k$, we procure a set of responses spanning various qualities for $R_k$, denoted as ${R^{(0)}_k, \ldots, R^{(M)}_k}$.
The input for the next complexity step will be the successful instructions with their associated initial responses.

# Helper functions to generate the evol-instruct dataset
def prepare_for_equal_prompts(example):
    """"If the evolved instruction is None, we use the original instruction (to make sure it will be removed)"""
    if example["instructions"][0] is None:
        return {"input": [example["input"], example["input"]]}
    else:
        return {"input": [example["input"], example["instructions"][0]]}

def prepare_for_evol_quality(example):
    return {"input": example["instructions"][0], "generation": example["response"]}

def make_evol_instruct_dataset(
    complexity_llm: LLM, 
    elimination_llm: LLM,
    quality_llm: LLM,
    dataset: Dataset,
    instruction_steps: int = 4,
    responses_steps: int = 4
    ) -&gt; "Dataset":

    # Set the pipelines
    complexity_pipe = Pipeline(generator=complexity_llm)
    elimination_pipe = Pipeline(generator=elimination_llm)
    quality_pipe = Pipeline(generator=quality_llm)

    # Set the initial dataset
    input_complexity = dataset
    successful_samples = []

    # Start the evolution process
    for step in range(1, instruction_steps + 1):
        print(f"Evolving instruction step: {step}/{instruction_steps}")

        # Run the complexity pipe to generate new instructions
        instruction_dataset = complexity_pipe.generate(input_complexity, batch_size=8)

        # Run the elimination pipe to determine if the instructions are equal
        prepared_dataset = (
            instruction_dataset
            .map(prepare_for_equal_prompts)
            .select_columns(["input"])
        )
        elimination_dataset=elimination_pipe.generate(prepared_dataset, batch_size=8)

        # Save the successful instructions to be used for quality evol and prepare the inputs for the next iteration
        new_instructions = []
        responses= []
        successful_instructions = []

        for row_evolved, row_elimination in zip(instruction_dataset, elimination_dataset):
            if (row_evolved['instructions'][0] is not None) and (row_elimination['generations'][0] != "equal"):
                new_instructions.append(row_evolved['instructions'][0])
                responses.append(row_evolved['response'])
                successful_instructions.append(row_evolved)
            else:
                new_instructions.append(row_evolved['input'])
                responses.append(row_evolved['response'])

        input_complexity = Dataset.from_dict({"input": new_instructions, "response": responses})

        # Run the quality pipe to generate new responses
        complexity_dataset = pd.DataFrame(successful_instructions)
        input_quality = Dataset.from_pandas(complexity_dataset).map(prepare_for_evol_quality).select_columns(["input", "generation"])

        for q_step in range(1, responses_steps + 1):
            print(f"Evolving response step: {q_step}/{responses_steps}")

            # Generate new responses
            response_dataset = quality_pipe.generate(input_quality, batch_size=8)

            # Save the successful responses in the pool and prepare the inputs for the next iteration
            inputs = []
            new_responses = []

            for row in response_dataset:
                inputs.append(row['input'])
                new_responses.append(row['generations'][0])
                successful_samples.append(row)

            input_quality = Dataset.from_dict({"input": inputs, "generation": new_responses})

    # Prepare the final dataset
    df_final_dataset = pd.DataFrame(successful_samples)
    final_dataset = Dataset.from_pandas(df_final_dataset)
    final_dataset.__class__ = CustomDataset
    final_dataset.task = TextGenerationTask() #or EvolQualityTask()

    return final_dataset

So, let's make our first evol-instruct dataset! 🧙

ds_evol_instruct = make_evol_instruct_dataset(
    complexity_llm=complexity_llm,
    elimination_llm=elimination_llm,
    quality_llm=quality_llm,
    dataset=initial_dataset,
    instruction_steps=5,
    responses_steps=5,
    )

ds_evol_instruct

ds_evol_instruct[0]

{'input': 'Provide a selection of three specific ingredients for a decadent dark chocolate raspberry cake.',
 'generation': '- Flour\n- Cocoa powder\n- Sugar',
 'generation_model': ['gpt-4-turbo-preview'],
 'generation_prompt': [[{'content': '', 'role': 'system'},
   {'content': "I want you to act as a Response Rewriter\nYour goal is to enhance the quality of the response given by an AI assistant\nto the #Given Prompt# through rewriting.\nBut the rewritten response must be reasonable and must be understood by humans.\nYour rewriting cannot omit the non-text parts such as the table and code in\n#Given Prompt# and #Given Response#. Also, please do not omit the input\nin #Given Prompt#.\nYou Should enhance the quality of the response using the following method:\nPlease make the Response more in-depth.\nYou should try your best not to make the #Rewritten Response# become verbose,\n#Rewritten Response# can only add 10 to 20 words into #Given Response#.\n'#Given Response#', '#Rewritten Response#', 'given response' and 'rewritten response'\nare not allowed to appear in #Rewritten Response#\n#Given Prompt#:\nProvide a selection of three specific ingredients for a decadent dark chocolate raspberry cake.\n#Given Response#:\n- Flour\n- Cocoa powder\n- Sugar\n#Rewritten Response#:",
    'role': 'user'}]],
 'raw_generation_responses': ['- High-quality all-purpose flour\n- Unsweetened dark cocoa powder\n- Granulated white sugar'],
 'generations': ['- High-quality all-purpose flour\n- Unsweetened dark cocoa powder\n- Granulated white sugar']}

Optionally, we can push the dataset to HuggingFace to share it with the community thanks to the push_to_hub method.

# Push to Hugging Face
HF_REPO_ID = "argilla/distilabel-evol-instruct-dataset"
ds_evol_instruct.push_to_hub(
        HF_REPO_ID,  # type: ignore
        split="train",
        private=False,
        token=os.getenv("HF_TOKEN", None),
    )

Human Feedback with Argilla¶

You can use the AI Feedback created by distilabel directly but we have seen that enhancing it with human feedback will improve the quality of your LLM. So, we provide a to_argilla method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check the Argilla docs to get it up and running.

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

import argilla as rg

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="argilla.apikey",
    workspace="argilla"
)

You can now push the dataset to Argilla as follows and curate even more the evolved instructions:

# Convert the dataset to Argilla format adding questions and metadata
rg_dataset = ds_evol_instruct.to_argilla(vector_strategy=False, metric_strategy=False)

# Push the dataset to Argilla
remote_rg_dataset = rg_dataset.push_to_argilla(name="distilabel-evol-instructions", workspace="argilla")

Conclusions¶

In this tutorial, we followed our own approach for the methods from WizardLM and Deita to develop an evolved-instruction dataset. Using distilabel, we generated and evaluated new instructions, creating a dataset featuring successful instructions after applying Evol-Complexity to make them more complex and Evol-Quality to improve the quality of their responses. Optionally, we employed Argilla to verify their quality using human feedback.

We hope you found this tutorial helpful! 👐

Explore different ways to create new datasets by checking out these tutorials!