Quickstart¶
To start off, distilabel
is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline
which orchestrates the execution of the Step
subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
Installation¶
To install the latest release with hf-inference-endpoints
extra of the package from PyPI you can use the following command:
Define a pipeline¶
In this guide we will walk you through the process of creating a simple pipeline that uses the InferenceEndpointsLLM
class to generate text. The Pipeline
will load a dataset that contains a column named prompt
from the Hugging Face Hub via the step LoadDataFromHub
and then use the InferenceEndpointsLLM
class to generate text based on the dataset using the TextGeneration
task.
You can check the available models in the Hugging Face Model Hub and filter by
Inference status
.
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( # (1)
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: # (2)
load_dataset = LoadDataFromHub( # (3)
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( # (4)
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
), # (5)
)
load_dataset >> text_generation # (6)
if __name__ == "__main__":
distiset = pipeline.run( # (7)
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") # (8)
-
We define a
Pipeline
with the namesimple-text-generation-pipeline
and a descriptionA simple text generation pipeline
. Note that thename
is mandatory and will be used to calculate thecache
signature path, so changing the name will change the cache path and will be identified as a different pipeline. -
We are using the
Pipeline
context manager, meaning that everyStep
subclass that is defined within the context manager will be added to the pipeline automatically. -
We define a
LoadDataFromHub
step namedload_dataset
that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in thepipeline.run
method below, but it can also be defined within the class instance via the argrepo_id=...
. This step will produce output batches with the rows from the dataset, and the columnprompt
will be mapped to theinstruction
field. -
We define a
TextGeneration
task namedtext_generation
that will generate text based on theinstruction
field from the dataset. This task will use theInferenceEndpointsLLM
class with the modelMeta-Llama-3.1-8B-Instruct
. -
We define the
InferenceEndpointsLLM
class with the modelMeta-Llama-3.1-8B-Instruct
that will be used by theTextGeneration
task. In this case, since theInferenceEndpointsLLM
is used, we assume that theHF_TOKEN
environment variable is set. -
We connect the
load_dataset
step to thetext_generation
task using thershift
operator, meaning that the output from theload_dataset
step will be used as input for thetext_generation
task. -
We run the pipeline with the parameters for the
load_dataset
andtext_generation
steps. Theload_dataset
step will use the repositorydistilabel-internal-testing/instruction-dataset-mini
and thetest
split, and thetext_generation
task will use thegeneration_kwargs
with thetemperature
set to0.7
and themax_new_tokens
set to512
. -
Optionally, we can push the generated
Distiset
to the Hugging Face Hub repositorydistilabel-example
. This will allow you to share the generated dataset with others and use it in other pipelines.
Minimal example¶
distilabel
gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset
dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")
with Pipeline() as pipeline: # (1)
TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")) # (2)
if __name__ == "__main__":
distiset = pipeline.run(dataset=dataset) # (3)
distiset.push_to_hub(repo_id="distilabel-example")