Quickstart¶
Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).
- Steps: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.
- Tasks: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.
- LLMs: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.
Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.
Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:
Installation¶
To install the latest release with hf-inference-endpoints
extra of the package from PyPI you can use the following command:
Use a generic pipeline¶
To use a generic pipeline for an ML task, you can use the InstructionResponsePipeline
class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the InferenceEndpointsLLM
class to generate data based on the input data and the model.
from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline()
dataset = pipeline.run()
The InstructionResponsePipeline
class will use the InferenceEndpointsLLM
class with the model meta-llama/Meta-Llama-3.1-8B-Instruct
to generate data based on the system prompt. The output data will be a dataset with the columns instruction
and response
. The class uses a generic system prompt, but you can customize it by passing the system_prompt
parameter to the class.
Note
We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.
Define a Custom pipeline¶
In this guide we will walk you through the process of creating a simple pipeline that uses the InferenceEndpointsLLM
class to generate text. The Pipeline
will load a dataset that contains a column named prompt
from the Hugging Face Hub via the step LoadDataFromHub
and then use the InferenceEndpointsLLM
class to generate text based on the dataset using the TextGeneration
task.
You can check the available models in the Hugging Face Model Hub and filter by
Inference status
.
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( # (1)
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: # (2)
load_dataset = LoadDataFromHub( # (3)
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( # (4)
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
), # (5)
system_prompt="You are a creative AI Assistant writer.",
template="Follow the following instruction: {{ instruction }}" # (6)
)
load_dataset >> text_generation # (7)
if __name__ == "__main__":
distiset = pipeline.run( # (8)
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") # (9)
-
We define a
Pipeline
with the namesimple-text-generation-pipeline
and a descriptionA simple text generation pipeline
. Note that thename
is mandatory and will be used to calculate thecache
signature path, so changing the name will change the cache path and will be identified as a different pipeline. -
We are using the
Pipeline
context manager, meaning that everyStep
subclass that is defined within the context manager will be added to the pipeline automatically. -
We define a
LoadDataFromHub
step namedload_dataset
that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in thepipeline.run
method below, but it can also be defined within the class instance via the argrepo_id=...
. This step will produce output batches with the rows from the dataset, and the columnprompt
will be mapped to theinstruction
field. -
We define a
TextGeneration
task namedtext_generation
that will generate text based on theinstruction
field from the dataset. This task will use theInferenceEndpointsLLM
class with the modelMeta-Llama-3.1-8B-Instruct
. -
We define the
InferenceEndpointsLLM
class with the modelMeta-Llama-3.1-8B-Instruct
that will be used by theTextGeneration
task. In this case, since theInferenceEndpointsLLM
is used, we assume that theHF_TOKEN
environment variable is set. -
Both
system_prompt
andtemplate
are optional fields. Thetemplate
must be informed as a string following the Jinja2 template format, and the fields that appear there ("instruction" in this case, which corresponds to the default) must be informed in thecolumns
attribute. The component gallery forTextGeneration
has examples to get you started. -
We connect the
load_dataset
step to thetext_generation
task using thershift
operator, meaning that the output from theload_dataset
step will be used as input for thetext_generation
task. -
We run the pipeline with the parameters for the
load_dataset
andtext_generation
steps. Theload_dataset
step will use the repositorydistilabel-internal-testing/instruction-dataset-mini
and thetest
split, and thetext_generation
task will use thegeneration_kwargs
with thetemperature
set to0.7
and themax_new_tokens
set to512
. -
Optionally, we can push the generated
Distiset
to the Hugging Face Hub repositorydistilabel-example
. This will allow you to share the generated dataset with others and use it in other pipelines.