Quickstart¶
To start off, distilabel is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline which orchestrates the execution of the Step subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
That being said, in this guide we will walk you through the process of creating a simple pipeline that uses the OpenAILLM class to generate text. The Pipeline will load a dataset that contains a column named prompt from the Hugging Face Hub via the step LoadDataFromHub and then use the OpenAILLM class to generate text based on the dataset using the TextGeneration task.
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( # (1)
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: # (2)
load_dataset = LoadDataFromHub( # (3)
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( # (4)
name="text_generation",
llm=OpenAILLM(model="gpt-3.5-turbo"), # (5)
)
load_dataset >> text_generation # (6)
if __name__ == "__main__":
distiset = pipeline.run( # (7)
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") # (8)
-
We define a
Pipelinewith the namesimple-text-generation-pipelineand a descriptionA simple text generation pipeline. Note that thenameis mandatory and will be used to calculate thecachesignature path, so changing the name will change the cache path and will be identified as a different pipeline. -
We are using the
Pipelinecontext manager, meaning that everyStepsubclass that is defined within the context manager will be added to the pipeline automatically. -
We define a
LoadDataFromHubstep namedload_datasetthat will load a dataset from the Hugging Face Hub, as provided via runtime parameters in thepipeline.runmethod below, but it can also be defined within the class instance via the argrepo_id=.... This step will basically produce output batches with the rows from the dataset, and the columnpromptwill be mapped to theinstructionfield. -
We define a
TextGenerationtask namedtext_generationthat will generate text based on theinstructionfield from the dataset. This task will use theOpenAILLMclass with the modelgpt-3.5-turbo. -
We define the
OpenAILLMclass with the modelgpt-3.5-turbothat will be used by theTextGenerationtask. In this case, since theOpenAILLMis used, we assume that theOPENAI_API_KEYenvironment variable is set, and the OpenAI API will be used to generate the text. -
We connect the
load_datasetstep to thetext_generationtask using thershiftoperator, meaning that the output from theload_datasetstep will be used as input for thetext_generationtask. -
We run the pipeline with the parameters for the
load_datasetandtext_generationsteps. Theload_datasetstep will use the repositorydistilabel-internal-testing/instruction-dataset-miniand thetestsplit, and thetext_generationtask will use thegeneration_kwargswith thetemperatureset to0.7and themax_new_tokensset to512. -
Optionally, we can push the generated
Distisetto the Hugging Face Hub repositorydistilabel-example. This will allow you to share the generated dataset with others and use it in other pipelines.