Distiset¶

A Pipeline in distilabel returns a special type of Hugging Face datasets.DatasetDict which is called Distiset, as a combination of distilabel and dataset. This object is a wrapper around datasets.Dataset which comes with some extra functionality to easily deal with the dataset pieces that a Pipeline can generate.

The Distiset is a dictionary-like object that contains the different configurations generated by the Pipeline, where each configuration corresponds to each leaf step in the DAG built by the Pipeline. Each configuration corresponds to a different subset of the dataset, which is a concept taken from 🤗 datasets that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub straight away.

Below you can find an example on how to create a Distiset object, similarly as a datasets.DatasetDict, which is not required in distilabel since that's internally handled by the Pipeline as part of the output of the run method:

from datasets import Dataset
from distilabel.distiset import Distiset

distiset = Distiset(
    {
        "leaf_step_1": Dataset.from_dict({"instruction": [1, 2, 3]}),
        "leaf_step_2": Dataset.from_dict(
            {"instruction": [1, 2, 3, 4], "generation": [5, 6, 7, 8]}
        ),
    }
)

Note

If there's only one leaf node, i.e., only one step at the end of the Pipeline, then the configuration name won't be the name of the last step, but it will be set to "default" instead, as that's more aligned with standard datasets within the Hugging Face Hub.

Distiset methods¶

We can interact with the different pieces generated by the Pipeline and treat them as different configurations. The Distiset contains just two methods:

Train/Test split¶

Which easily does the train/test split partition of the dataset for the different configurations or subsets.

>>> distiset.train_test_split(train_size=0.9)
Distiset({
    leaf_step_1: DatasetDict({
        train: Dataset({
            features: ['instruction'],
            num_rows: 2
        })
        test: Dataset({
            features: ['instruction'],
            num_rows: 1
        })
    })
    leaf_step_2: DatasetDict({
        train: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 3
        })
        test: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 1
        })
    })
})

Push to Hugging Face Hub¶

Pushes the Distiset to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:

distiset.push_to_hub(
    "my-org/my-dataset",
    commit_message="Initial commit",
    private=False,
    token=os.getenv("HF_TOKEN"),
)

Dataset card¶

Having this special type of dataset comes with an added advantage when calling Distiset.push_to_hub, which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting generate_card=False:

distiset.push_to_hub("my-org/my-dataset", generate_card=True)

We will have an automatic dataset card (an example can be seen here) with some handy information like reproducing the Pipeline with the CLI, or examples of the records from the different subsets.

create_distiset helper¶

Lastly, we presented in the caching section the create_distiset function, you can take a look at the section to see how to create a Distiset from the cache folder, using the helper function to automatically include all the relevant data.