🤗 Use Notus on inference endpoints to create a legal preference dataset¶
In this tutorial, you will learn how to use the Notus model on Inference Endpoints to create a legal preference dataset based on RAG instructions from the European AI Act. A full end-to-end example of how to use distilabel to leverage LLMs!
distilabel is an AI Feedback (AIF) framework that can generate and label datasets using LLMs, and can be used for many different use cases. Implemented with robustness, efficiency and scalability in mind, it allows anyone to build their synthetic datasets that can be used in many different scenarios. This tutorial shows an end-to-end example in which we will create a model expert in the new AI Act, to which we can make different types of questions and requests.
The LLM model that we will fine-tune for this is Notus 7B, a fine-tuned version of Zephyr 7B that uses Direct Preference Optimization (DPO) and AIF techniques to outperform its foundation model in several benchmarks, and is completely open-source.
This tutorial includes the following steps:
- Defining a custom generating task for a
distilabel
pipeline. - Creating a RAG pipeline using Haystack for the EU AI Act.
- Generating an instruction dataset with
SelfInstructTask
. - Generating a preference dataset using an
UltraFeedback
text quality task.
You can use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
Introduction¶
Let's start by installing the required depencies to run distilabel, Argilla and the rest of the packages used in the tutorial; most notably, Haystack.
Running Argilla¶
For this tutorial, you can use Argilla to visualize and annotate the different datasets created by distilabel. There are two main options for deploying and running Argilla:
Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:
For details about configuring your deployment, check the official Hugging Face Hub guide.
Launch Argilla using Argilla's quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.
For more information on deployment options, please check the Deployment section of the documentation.
Import dependencies¶
The main dependencies for this tutorial are distilabel for creating the synthetic datasets and Argilla for visualizing and annotating these datasets, and also for fine-tuning our model. The package Haystack is used to creates batches from the original PDF document we want to create our datasets from.
import os
from typing import Dict
import argilla as rg
from distilabel.llm import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline, pipeline
from distilabel.tasks import TextGenerationTask, SelfInstructTask, Prompt
from datasets import Dataset
from haystack.nodes import PDFToTextConverter, PreProcessor
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:
Additionally, we need to provide our HuggingFace and OpenAI accest token. To later instatiate an InferenceEndpointsLLM
object, we need to pass as parameters the HF Inference Endpoint name and the HF namespace. One very convenient way to do so is also through environment variables.
Setting up an inference endpoint with Notus¶
Inference endpoints are a solution, managed by Hugging Face, to easily deploy any Transformer-like model. They are built from models on the Hugging Face Hub. Inference endpoints are really handy for making inference on LLMs without the hastle of trying to run the models locally. In this tutorial, we will use inference endpoints to generate text using our Notus model, as part of the distilabel
workflow. The endpoint of choice has a Notus 7B instance running.
Defining a custom generating task for a distilabel pipeline¶
To kickstart this tutorial, let's see how to set up and endpoint for our Notus model. It's not part of the end-to-end example we'll see later, but an example of how to connect to a Hugging Face endpoint and a test of the distilabel
pipeline.
Let's dive into this quick example of how to use an inference endpoint. We have prepared an easy TextGenerationTask
to ask question to the model, in a very similar way as we talk with the LLMs using chatbots. First, we define a class for the question-answering task, with functions showing distilabel
how the model should generate the prompts, parse the input and the output, etc.
class QuestionAnsweringTask(TextGenerationTask):
def generate_prompt(self, question: str) -> str:
return Prompt(
system_prompt=self.system_prompt,
formatted_prompt=question,
).format_as(
"llama2"
) # type: ignore
def parse_output(self, output: str) -> Dict[str, str]:
return {"answer": output.strip()}
@property
def input_args_names(self) -> list[str]:
return ["question"]
@property
def output_args_names(self) -> list[str]:
return ["answer"]
llm
is an object of the InferenceEndpointsLLM
class, and by using it we can start generating answers to question using the llm.generate()
method.
With the InferenceEndpointsLLM
object defined with the endpoint information and the Task, we can go ahead and start generating text. Let's ask this LLM what's, for example, the second most populated city in Denmark. The answer should be Aarhus.
The endpoint is working correctly! We have succesfully set up a custom generating task for a distilabel
pipeline.
Creating a RAG pipeline using Haystack for the European AI Act¶
For this end-to-end example, we would like to create an expert model capable of answering question and filling up information about the new AI Act promoted by the European Union, which is the first regulation on artificial intelligence. As part of its digital strategy, the EU wants to regulate artificial AI to ensure better conditions for the development and use of this innovative technology. This act is a regulatory framework for AI, with different risk levels meaning more or less regulation. They are the world's first rules on AI.
This RAG pipeline that we want to create downloads the PDF file, converts it to plain text and preprocess it, creating batches that we can feed distilabel
to start creating instructions from it. Let's see this first part of the pipeline and get the input data. Note that this RAG part of the pipeline is not based on an active pipeline based queries or semantic properties, but a more brute-force approach in which we download the PDF and preprocess its contents.
Downloading the AI Act PDF¶
Firstly, we need to download the PDF document itself. We'll place it in our working directory, if it's not there already.
Once we have it in our working directory, we can use Haystack's Converter and Pipeline features to extract the textual data, clean it and divide it in different batches. Afterwards, these batches will be used to start creating synthetic instructions.
# The converter turns the PDF into text we can process easily
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
# Preprocessing pipelines can have several steps.
# Ours clean empty lines, header, footers and whitespaces
# and split the text into 150-char long batches, respecting
# where the sentences naturally end and begin.
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="word",
split_length=150,
split_respect_sentence_boundary=True,
)
doc = converter.convert(file_path="The-AI-Act.pdf", meta=None)[0]
docs = preprocessor.process([doc])
print(f"Documents: 1\nBatches: {len(docs)}")
Let's take a quick look at the batches we just generated.
The document has been correctly batched, from one big document to 355 strings, 150-character long at maximum. This list of strings can now be used as input to generate a instruction dataset using distilabel
.
Generating instructions with SelfInstructTask¶
With out Inference Endpoint up and running, we should be able to generate instructions with distilabel. These instructions, made by the LLM through our endpoint, will form an instruction dataset, with instructions created from the data we just extracted.
For this example, we are using a subset of 50 batches generated in the section above, to be gentle on performance.
With the SelfInstructTask
class we can generate a Self-Instruct specitification for building the prompts, as done in the Self-Instruct paper. distilabel
will start from human-made input, in this case, the batches we created from the AI Act pdf, and it will generate instructions based on it. These instructions can then be reviewed using Argilla to keep the best ones.
An application description can be passed as a parameter to specify the behaviour of the model; we want a model capable of answering our questions about the AI Act.
Let's now define a generator, passing the SelfInstructTask
object, and create a Pipeline
object.
instructions_generator = InferenceEndpointsLLM(
endpoint_name_or_model_id=os.getenv("HF_INFERENCE_ENDPOINT_NAME"), # type: ignore
endpoint_namespace=os.getenv("HF_NAMESPACE"), # type: ignore
token=os.getenv("HF_TOKEN") or None,
task=instructions_task,
)
instructions_pipeline = Pipeline(generator=instructions_generator)
Our pipeline is ready to be used to generate instructions. Let's do it!
The pipeline has succesfully generated instructions given the topics and the behaviour passed as input. Let's gather all those instructions and see how the look.
These initial intructions form our instruction dataset. Following the human-in-the-loop approach, we should push the instructions to Argilla to visualize them and be able to rank them in terms of quality. Those annotations are essential to make quality data, ensuring a better performance of the final model. Nevertheless, this step is optional.
Pushing the instruction dataset to Argilla to visualize and annotate.¶
Let's take a quick look at the instructions generated by SelfInstructTask
.
For each input, i.e., each batch of the AI Act pdf file, we have a generator prompt, with general guidelines on how to behave, as well as the application description parameter. 4 instructions per input have been generated.
Now it's the perfect time to upload the instruction dataset to Argilla, review it and manually annotate it.
In the Argilla UI, each tuple input-instruction is visualized individually, and can be individually annotated.
Generate a Preference Dataset using an Ultrafeedback text quality task.¶
Once we have our instruction dataset, we are going to create a preference dataset through the UltraFeedback text quality task. This is a type of task used in NLP used to evaluate the quality of text generated; our goal is to provide detailed feedback on the quality of the generated text, beyond a binary label.
Our pipeline()
method allows us to create a Pipeline
instance with the provided LLMs for a given task, which is useful whenever you want to use a pre-defined or custom Pipeline
for a given task. We will specify our task and subtask, the generator we want to use (in this case, one based in a Text Generator Task) and our OpenAI API key.
preference_pipeline = pipeline(
"preference",
"instruction-following",
generator=InferenceEndpointsLLM(
endpoint_name_or_model_id=os.getenv("HF_INFERENCE_ENDPOINT_NAME"), # type: ignore
endpoint_namespace=os.getenv("HF_NAMESPACE", None),
task=TextGenerationTask(),
max_new_tokens=256,
num_threads=2,
temperature=0.3,
),
max_new_tokens=256,
num_threads=2,
openai_api_key=os.getenv("OPENAI_API_KEY"),
temperature=0.0,
)
We also need to retrieve our instruction dataset from Argilla, as it will be the input of this pipeline.
Before generating the text based on our instructions, we need to mingle a little bit with the dataset. From the previous section, we still have our old input, the batches from the PDF. We have to change that to the instructions that we generated.
Now, let's build a dataset by using the pipeline we just created, and the topics from which our instructions were generated.
Let's take a look at an instance of the preference dataset
Upload the preference dataset to Argilla to annotate.¶
Once our preference dataset has been correctly generated, the Argilla UI is the best tool at our disposal to visualize it and annotate it. As for the instruction dataset, we just have to convert it to an Argilla Feedback Dataset, and push it to Argilla.
# Uploading the Preference Dataset
preference_rg_dataset = preference_dataset.to_argilla()
# Adding the context as a metadata property in the new Feedback dataset, as this
# information will be useful later.
for record_feedback, record_huggingface in zip(
preference_rg_dataset, preference_dataset
):
record_feedback.metadata["context"] = record_huggingface["context"]
preference_rg_dataset.push_to_argilla(name=f"notus_AI_preference")
In the Argilla UI, we can see the input (an instruction), and the two generations that the LLM created out of it.
Conclusions¶
To conclude, we have gone through an end-to-end example of distilabel. We've set up an Inference Endpoint, defined a distilabel pipeline that extracts information from a PDF, created and manually reviewed the instruction and preference dataset created from that input. The final preference dataset is perfect for fine-tuning, and you can easily do this using the ArgillaTrainer from Argilla. Have a look at these resources if you want to go further: