Using the Distiset dataset object¶
A Pipeline
in distilabel
returns a special type of Hugging Face datasets.DatasetDict
which is called Distiset
.
The Distiset
is a dictionary-like object that contains the different configurations generated by the Pipeline
, where each configuration corresponds to each leaf step in the DAG built by the Pipeline
. Each configuration corresponds to a different subset of the dataset. This is a concept taken from 🤗 datasets
that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub.
Below you can find an example of how to create a Distiset
object that resembles a datasets.DatasetDict
:
from datasets import Dataset
from distilabel.distiset import Distiset
distiset = Distiset(
{
"leaf_step_1": Dataset.from_dict({"instruction": [1, 2, 3]}),
"leaf_step_2": Dataset.from_dict(
{"instruction": [1, 2, 3, 4], "generation": [5, 6, 7, 8]}
),
}
)
Note
If there's only one leaf node, i.e., only one step at the end of the Pipeline
, then the configuration name won't be the name of the last step, but it will be set to "default" instead, as that's more aligned with standard datasets within the Hugging Face Hub.
Distiset methods¶
We can interact with the different pieces generated by the Pipeline
and treat them as different configurations
. The Distiset
contains just two methods:
Train/Test split¶
Create a train/test split partition of the dataset for the different configurations or subsets.
>>> distiset.train_test_split(train_size=0.9)
Distiset({
leaf_step_1: DatasetDict({
train: Dataset({
features: ['instruction'],
num_rows: 2
})
test: Dataset({
features: ['instruction'],
num_rows: 1
})
})
leaf_step_2: DatasetDict({
train: Dataset({
features: ['instruction', 'generation'],
num_rows: 3
})
test: Dataset({
features: ['instruction', 'generation'],
num_rows: 1
})
})
})
Push to Hugging Face Hub¶
Push the Distiset
to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:
distiset.push_to_hub(
"my-org/my-dataset",
commit_message="Initial commit",
private=False,
token=os.getenv("HF_TOKEN"),
)
Save and load from disk¶
Take into account that these methods work as datasets.load_from_disk
and datasets.Dataset.save_to_disk
so the arguments are directly passed to those methods. This means you can also make use of storage_options
argument to save your Distiset
in your cloud provider, including the distilabel artifacts (pipeline.yaml
, pipeline.log
and the README.md
with the dataset card). You can read more in datasets
documentation here.
Save the Distiset
to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:
Load a Distiset
that was saved using Distiset.save_to_disk
just the same way:
Load a Distiset
from a remote location, like S3, GCS. You can pass the storage_options
argument to authenticate with the cloud provider:
Take a look at the remaining arguments at Distiset.save_to_disk
and Distiset.load_from_disk
.
Dataset card¶
Having this special type of dataset comes with an added advantage when calling Distiset.push_to_hub
, which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting generate_card=False
:
We will have an automatic dataset card (an example can be seen here) with some handy information like reproducing the Pipeline
with the CLI
, or examples of the records from the different subsets.
create_distiset helper¶
Lastly, we presented in the caching section the create_distiset
function, you can take a look at the section to see how to create a Distiset
from the cache folder, using the helper function to automatically include all the relevant data.