Skip to content

Using the Distiset dataset object

A Pipeline in distilabel returns a special type of Hugging Face datasets.DatasetDict which is called Distiset.

The Distiset is a dictionary-like object that contains the different configurations generated by the Pipeline, where each configuration corresponds to each leaf step in the DAG built by the Pipeline. Each configuration corresponds to a different subset of the dataset. This is a concept taken from đŸ¤— datasets that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub.

Below you can find an example of how to create a Distiset object that resembles a datasets.DatasetDict:

from datasets import Dataset
from distilabel.distiset import Distiset

distiset = Distiset(
    {
        "leaf_step_1": Dataset.from_dict({"instruction": [1, 2, 3]}),
        "leaf_step_2": Dataset.from_dict(
            {"instruction": [1, 2, 3, 4], "generation": [5, 6, 7, 8]}
        ),
    }
)

Note

If there's only one leaf node, i.e., only one step at the end of the Pipeline, then the configuration name won't be the name of the last step, but it will be set to "default" instead, as that's more aligned with standard datasets within the Hugging Face Hub.

Distiset methods

We can interact with the different pieces generated by the Pipeline and treat them as different configurations. The Distiset contains just two methods:

Train/Test split

Create a train/test split partition of the dataset for the different configurations or subsets.

>>> distiset.train_test_split(train_size=0.9)
Distiset({
    leaf_step_1: DatasetDict({
        train: Dataset({
            features: ['instruction'],
            num_rows: 2
        })
        test: Dataset({
            features: ['instruction'],
            num_rows: 1
        })
    })
    leaf_step_2: DatasetDict({
        train: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 3
        })
        test: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 1
        })
    })
})

Push to Hugging Face Hub

Push the Distiset to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:

distiset.push_to_hub(
    "my-org/my-dataset",
    commit_message="Initial commit",
    private=False,
    token=os.getenv("HF_TOKEN"),
    generate_card=True,
    include_script=False
)

New since version 1.3.0

Since version 1.3.0 you can automatically push the script that created your pipeline to the same repository. For example, assuming you have a file like the following:

sample_pipe.py
with Pipeline() as pipe:
    ...
distiset = pipe.run()
distiset.push_to_hub(
    "my-org/my-dataset,
    include_script=True
)

After running the command, you could visit the repository and the file sample_pipe.py will be stored to simplify sharing your pipeline with the community.

Custom Docstrings

distilabel contains a custom plugin to automatically generates a gallery for the different components. The information is extracted by parsing the Step's docstrings. You can take a look at the docstrings in the source code of the UltraFeedback, and take a look at the corresponding entry in the components gallery to see an example of how the docstrings are rendered.

If you create your own components and want the Citations automatically rendered in the README card (in case you are sharing your final distiset in the Hugging Face Hub), you may want to add the citation section. This is an example for the MagpieGenerator Task:

class MagpieGenerator(GeneratorTask, MagpieBase):
    r"""Generator task the generates instructions or conversations using Magpie.
    ...

    Citations:

        ```
        @misc{xu2024magpiealignmentdatasynthesis,
            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
            year={2024},
            eprint={2406.08464},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2406.08464},
        }
        ```
    """

The Citations section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ```@misc{...}```. This information will be automatically used in the README of your Distiset if you decide to call distiset.push_to_hub. Alternatively, if the Citations is not found, but in the References there are found any urls pointing to https://arxiv.org/, we will try to obtain the Bibtex equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.

Image Datasets

Keep reading if you are interested in Image datasets

The Distiset object has a new method transform_columns_to_image specifically to transform the images to PIL.Image.Image before pushing the dataset to the hugging face hub.

Since version 1.5.0 we have the ImageGeneration task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:

# Assume all the imports are already done, we are only interested
with Pipeline(name="image_generation_pipeline") as pipeline:
    img_generation = ImageGeneration(
        name="flux_schnell",
        llm=igm,
        InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
    )
    ...

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False, dataset=ds)
    # Save the images as `PIL.Image.Image`
+   distiset = distiset.transform_columns_to_image("image")
    distiset.push_to_hub(...)

After calling transform_columns_to_image on the image columns we may have generated (in this case we only want to transform the image column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.

Save and load from disk

Take into account that these methods work as datasets.load_from_disk and datasets.Dataset.save_to_disk so the arguments are directly passed to those methods. This means you can also make use of storage_options argument to save your Distiset in your cloud provider, including the distilabel artifacts (pipeline.yaml, pipeline.log and the README.md with the dataset card). You can read more in datasets documentation here.

Save the Distiset to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:

distiset.save_to_disk(
    "my-dataset",
    save_card=True,
    save_pipeline_config=True,
    save_pipeline_log=True
)

Load a Distiset that was saved using Distiset.save_to_disk just the same way:

distiset = Distiset.load_from_disk("my-dataset")

Load a Distiset from a remote location, like S3, GCS. You can pass the storage_options argument to authenticate with the cloud provider:

distiset = Distiset.load_from_disk(
    "s3://path/to/my_dataset",  # gcs:// or any filesystem tolerated by fsspec
    storage_options={
        "key": os.environ["S3_ACCESS_KEY"],
        "secret": os.environ["S3_SECRET_KEY"],
        ...
    }
)

Take a look at the remaining arguments at Distiset.save_to_disk and Distiset.load_from_disk.

Dataset card

Having this special type of dataset comes with an added advantage when calling Distiset.push_to_hub, which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting generate_card=False:

distiset.push_to_hub("my-org/my-dataset", generate_card=True)

We will have an automatic dataset card (an example can be seen here) with some handy information like reproducing the Pipeline with the CLI, or examples of the records from the different subsets.

create_distiset helper

Lastly, we presented in the caching section the create_distiset function, you can take a look at the section to see how to create a Distiset from the cache folder, using the helper function to automatically include all the relevant data.