dataset
CustomDataset
¶
Bases: Dataset
A custom dataset class that extends from datasets.Dataset
and is used to generate
an Argilla FeedbackDataset
instance from the pre-defined configuration within the task
provided to Pipeline.generate
.
Source code in src/distilabel/dataset.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
load_from_disk(dataset_path, **kwargs)
classmethod
¶
Load a CustomDataset from disk, also reading the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
PathLike
|
Path to the dataset, as you would do with a standard Dataset. |
required |
Returns:
Type | Description |
---|---|
The loaded dataset. |
Source code in src/distilabel/dataset.py
save_to_disk(dataset_path, **kwargs)
¶
Saves the datataset to disk, also saving the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
PathLike
|
Path to the dataset. |
required |
**kwargs |
Any
|
Additional arguments to be passed to |
{}
|
Source code in src/distilabel/dataset.py
to_argilla()
¶
Converts the dataset to an Argilla FeedbackDataset
instance, based on the
task defined in the dataset as part of Pipeline.generate
.
Raises:
Type | Description |
---|---|
ImportError
|
if the argilla library is not installed. |
ValueError
|
if the task is not set. |
Returns:
Name | Type | Description |
---|---|---|
FeedbackDataset |
FeedbackDataset
|
the Argilla |
Source code in src/distilabel/dataset.py
DatasetCheckpoint
dataclass
¶
A checkpoint class that contains the information of a checkpoint.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Path
|
The path to the checkpoint. |
cwd() / 'ckpt'
|
save_frequency |
int
|
The frequency at which the checkpoint should be saved By default is set to -1 (no checkpoint is saved to disk, but the dataset is returned upon failure). |
-1
|
extra_kwargs |
dict[str, Any]
|
Additional kwargs to be passed to the |
field(default_factory=dict)
|
Examples:
>>> from distilabel.dataset import DatasetCheckpoint
>>> # Save the dataset every 10% of the records generated.
>>> checkpoint = DatasetCheckpoint(save_frequency=len(dataset) // 10)
>>> # Afterwards, we can access the checkpoint's checkpoint.path.
Source code in src/distilabel/dataset.py
do_checkpoint(step)
¶
Determines if a checkpoint should be done.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
step |
int
|
The number of records generated. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
Whether a checkpoint should be done. |