LoadDataFromDisk¶
Load a dataset that was previously saved to disk.
If you previously saved your dataset using the save_to_disk
method, or
Distiset.save_to_disk
you can load it again to build a new pipeline using this class.
Attributes¶
-
dataset_path: The path to the dataset or distiset.
-
split: The split of the dataset to load (typically will be
train
,test
orvalidation
). -
config: The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations.
Runtime Parameters¶
-
batch_size: The batch size to use when processing the data.
-
dataset_path: The path to the dataset or distiset.
-
is_distiset: Whether the dataset to load is a
Distiset
or not. Defaults to False. -
split: The split of the dataset to load. Defaults to 'train'.
-
config: The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations.
-
num_examples: The number of examples to load from the dataset. By default will load all examples.
-
storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to
None
.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph New columns
OCOL0[dynamic]
end
end
subgraph LoadDataFromDisk
StepOutput[Output Columns: dynamic]
end
StepOutput --> OCOL0
Outputs¶
- dynamic (
all
): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Examples¶
Load data from a Hugging Face Dataset¶
from distilabel.steps import LoadDataFromDisk
loader = LoadDataFromDisk(dataset_path="path/to/dataset")
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Load data from a distilabel Distiset¶
from distilabel.steps import LoadDataFromDisk
# Specify the configuration to load.
loader = LoadDataFromDisk(
dataset_path="path/to/dataset",
is_distiset=True,
config="leaf_step_1"
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'a': 1}, {'a': 2}, {'a': 3}], True)
Load data from a Hugging Face Dataset or Distiset in your cloud provider¶
from distilabel.steps import LoadDataFromDisk
loader = LoadDataFromDisk(
dataset_path="gcs://path/to/dataset",
storage_options={"project": "experiments-0001"}
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)