Argilla¶
As an additional step, besides being able to restore the dataset from the Pipeline
output as a Distiset
(which is a datasets.DatasetDict
with multiple configurations depending on the leaf nodes of the Pipeline
), one can also include a Step
within the Pipeline
to easily export the datasets to Argilla with a pre-defined configuration, suiting the annotation purposes.
Being able to export the generated synthetic datasets to Argilla, was one of the core features we wanted to have integrated within distilabel
because we believe in the potential of synthetic data, but without removing the impact a human annotator or group of annotators can bring. So on, the Argilla integration makes it straightforward to push a dataset to Argilla while the Pipeline
is running, to be able to follow along the generation process in Argilla's UI, as well as annotating the records on the fly.
Before using any of the steps about to be described below, you should first have an Argilla instance up and running, so that you can successfully upload the data to Argilla. In order to deploy Argilla, the easiest and most straight forward way is to deploy it via the Argilla Template in Hugging Face Spaces as simply as following the steps there, or just via the following button:
Additionally, Argilla offer multiple deployment options listed in the Argilla Documentation - Installation page.
Text Generation¶
For text generation scenarios, i.e. when the Pipeline
contains a TextGeneration
step, we have designed the task TextGenerationToArgilla
, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.
The dataset will be pushed with the following configuration:
-
Fields:
instruction
andgeneration
, both being fields of typeargilla.TextField
, plus the automatically generatedid
for the giveninstruction
to be able to search for other records with the sameinstruction
in the dataset. The fieldinstruction
must always be a string, while the fieldgeneration
can either be a single string or a list of strings (useful when there are multiple parent nodes of typeTextGeneration
); even though each record will always contain at most oneinstruction
-generation
pair. -
Questions:
quality
will be the only question for the annotators to answer, i.e., to annotate, and it will be anargilla.LabelQuestion
referring to the quality of the provided generation for the given instruction. It can be annotated as either 👎 (bad) or 👍 (good).
Note
The TextGenerationToArgilla
step will only work as is if the Pipeline
contains one or multiple TextGeneration
steps, or if the columns instruction
and generation
are available within the batch data. Otherwise, the variable input_mappings
will need to be set so that either both or one of instruction
and generation
are mapped to one of the existing columns in the batch data.
from distilabel.llms import OpenAILLM
from distilabel.steps import LoadDataFromDicts, TextGenerationToArgilla
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="my-pipeline") as pipeline:
load_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"instruction": "Write a short story about a dragon that saves a princess from a tower.",
},
],
)
text_generation = TextGeneration(
name="text_generation",
llm=OpenAILLM(model="gpt-4"),
)
to_argilla = TextGenerationToArgilla(
dataset_name="my-dataset",
dataset_workspace="admin",
api_url="<ARGILLA_API_URL>",
api_key="<ARGILLA_API_KEY>",
)
load_dataset >> text_generation >> to_argilla
pipeline.run()
Preference¶
For preference scenarios, i.e. when the Pipeline
contains multiple TextGeneration
steps, we have designed the task PreferenceToArgilla
, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.
The dataset will be pushed with the following configuration:
-
Fields:
instruction
andgenerations
, both being fields of typeargilla.TextField
, plus the automatically generatedid
for the giveninstruction
to be able to search for other records with the sameinstruction
in the dataset. The fieldinstruction
must always be a string, while the fieldgenerations
must be a list of strings, containing the generated texts for the giveninstruction
so that at least there are two generations to compare. Other than that, the number ofgeneration
fields within each record in Argilla will be defined by the value of the variablenum_generations
to be provided in thePreferenceToArgilla
step. -
Questions:
rating
andrationale
will be the pairs of questions to be defined per each generation i.e. per each value within the range from 0 tonum_generations
, and those will be of typesargilla.RatingQuestion
andargilla.TextQuestion
, respectively. Note that only the first pair of questions will be mandatory, since only one generation is ensured to be within the batch data. Additionally, note that the provided ratings will range from 1 to 5, and to mention that Argilla only supports values above 0.
Note
The PreferenceToArgilla
step will only work if the Pipeline
contains multiple TextGeneration
steps, or if the columns instruction
and generations
are available within the batch data. Otherwise, the variable input_mappings
will need to be set so that either both or one of instruction
and generations
are mapped to one of the existing columns in the batch data.
Note
Additionally, if the Pipeline
contains an UltraFeedback
step, the ratings
and rationales
will also be available, so if that's the case, those will be automatically injected as suggestions to the existing dataset so that the annotator only needs to review those, instead of fulfilling those by themselves.
from distilabel.llms import OpenAILLM
from distilabel.steps import LoadDataFromDicts, PreferenceToArgilla
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="my-pipeline") as pipeline:
load_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"instruction": "Write a short story about a dragon that saves a princess from a tower.",
},
],
)
text_generation = TextGeneration(
name="text_generation",
llm=OpenAILLM(model="gpt-4"),
num_generations=4,
group_generations=True,
)
to_argilla = PreferenceToArgilla(
dataset_name="my-dataset",
dataset_workspace="admin",
api_url="<ARGILLA_API_URL>",
api_key="<ARGILLA_API_KEY>",
num_generations=4,
)
load_dataset >> text_generation >> to_argilla
pipeline.run()
Note
If you are willing to also add the suggestions, feel free to check "UltraFeedback: Boosting Language Models with High-quality Feedback" where the UltraFeedback
task is used to generate both ratings and rationales for each of the generations of a given instruction.