Skip to content

How to Guide

To start off, distilabel is a framework for building pipelines for generating synthetic data using LLMs, that defines a Pipeline which orchestrates the execution of the Step subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).

This being said, in this guide we will walk you through the process of creating a simple pipeline that uses the OpenAILLM class to generate text.å The Pipeline will load a dataset that contains a column named prompt from the Hugging Face Hub via the step LoadHubDataset and then use the OpenAILLM class to generate text based on the dataset using the TextGeneration task.

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration

with Pipeline(  # 
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:  # 
    load_dataset = LoadHubDataset(  # 
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    text_generation = TextGeneration(  # 
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),  # 
    )

    load_dataset >> text_generation  # 

if __name__ == "__main__":
    distiset = pipeline.run(  # 
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    distiset.push_to_hub(repo_id="distilabel-example")  #