Quickstart¶
Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).
- Steps: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.
- Tasks: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.
- LLMs: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.
Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.
Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:
Installation¶
To install the latest release with hf-inference-endpoints extra of the package from PyPI you can use the following command:
Define a pipeline¶
In this guide we will walk you through the process of creating a simple pipeline that uses the InferenceEndpointsLLM class to generate text. The Pipeline will load a dataset that contains a column named prompt from the Hugging Face Hub via the step LoadDataFromHub and then use the InferenceEndpointsLLM class to generate text based on the dataset using the TextGeneration task.
You can check the available models in the Hugging Face Model Hub and filter by
Inference status.
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( # (1)
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: # (2)
load_dataset = LoadDataFromHub( # (3)
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( # (4)
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
), # (5)
system_prompt="You are a creative AI Assistant writer.",
template="Follow the following instruction: {{ instruction }}" # (6)
)
load_dataset >> text_generation # (7)
if __name__ == "__main__":
distiset = pipeline.run( # (8)
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") # (9)
-
We define a
Pipelinewith the namesimple-text-generation-pipelineand a descriptionA simple text generation pipeline. Note that thenameis mandatory and will be used to calculate thecachesignature path, so changing the name will change the cache path and will be identified as a different pipeline. -
We are using the
Pipelinecontext manager, meaning that everyStepsubclass that is defined within the context manager will be added to the pipeline automatically. -
We define a
LoadDataFromHubstep namedload_datasetthat will load a dataset from the Hugging Face Hub, as provided via runtime parameters in thepipeline.runmethod below, but it can also be defined within the class instance via the argrepo_id=.... This step will produce output batches with the rows from the dataset, and the columnpromptwill be mapped to theinstructionfield. -
We define a
TextGenerationtask namedtext_generationthat will generate text based on theinstructionfield from the dataset. This task will use theInferenceEndpointsLLMclass with the modelMeta-Llama-3.1-8B-Instruct. -
We define the
InferenceEndpointsLLMclass with the modelMeta-Llama-3.1-8B-Instructthat will be used by theTextGenerationtask. In this case, since theInferenceEndpointsLLMis used, we assume that theHF_TOKENenvironment variable is set. -
Both
system_promptandtemplateare optional fields. Thetemplatemust be informed as a string following the Jinja2 template format, and the fields that appear there ("instruction" in this case, which corresponds to the default) must be informed in thecolumnsattribute. The component gallery forTextGenerationhas examples to get you started. -
We connect the
load_datasetstep to thetext_generationtask using thershiftoperator, meaning that the output from theload_datasetstep will be used as input for thetext_generationtask. -
We run the pipeline with the parameters for the
load_datasetandtext_generationsteps. Theload_datasetstep will use the repositorydistilabel-internal-testing/instruction-dataset-miniand thetestsplit, and thetext_generationtask will use thegeneration_kwargswith thetemperatureset to0.7and themax_new_tokensset to512. -
Optionally, we can push the generated
Distisetto the Hugging Face Hub repositorydistilabel-example. This will allow you to share the generated dataset with others and use it in other pipelines.