How to Guide¶
To start off, distilabel
is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline
which orchestrates the execution of the Step
subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
This being said, in this guide we will walk you through the process of creating a simple pipeline that uses the OpenAILLM
class to generate text.å The Pipeline
will load a dataset that contains a column named prompt
from the Hugging Face Hub via the step LoadHubDataset
and then use the OpenAILLM
class to generate text based on the dataset using the TextGeneration
task.
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration
with Pipeline( #
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: #
load_dataset = LoadHubDataset( #
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( #
name="text_generation",
llm=OpenAILLM(model="gpt-3.5-turbo"), #
)
load_dataset >> text_generation #
if __name__ == "__main__":
distiset = pipeline.run( #
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") #