Skip to content

LoadDataFromDisk

Load a dataset that was previously saved to disk.

If you previously saved your dataset using the save_to_disk method, or Distiset.save_to_disk you can load it again to build a new pipeline using this class.

Attributes

  • dataset_path: The path to the dataset or distiset.

  • split: The split of the dataset to load (typically will be train, test or validation).

  • config: The configuration of the dataset to load. Defaults to default, if there are multiple configurations in the dataset this must be suplied or an error is raised.

Runtime Parameters

  • batch_size: The batch size to use when processing the data.

  • dataset_path: The path to the dataset or distiset.

  • is_distiset: Whether the dataset to load is a Distiset or not. Defaults to False.

  • split: The split of the dataset to load. Defaults to 'train'.

  • config: The configuration of the dataset to load. Defaults to default, if there are multiple configurations in the dataset this must be suplied or an error is raised.

  • num_examples: The number of examples to load from the dataset. By default will load all examples.

  • storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to None.

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph LoadDataFromDisk
        StepOutput[Output Columns: dynamic]
    end

    StepOutput --> OCOL0

Outputs

  • dynamic (all): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.

Examples

Load data from a Hugging Face Dataset

from distilabel.steps import LoadDataFromDisk

loader = LoadDataFromDisk(dataset_path="path/to/dataset")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Load data from a distilabel Distiset

from distilabel.steps import LoadDataFromDisk

# Specify the configuration to load.
loader = LoadDataFromDisk(
    dataset_path="path/to/dataset",
    is_distiset=True,
    config="leaf_step_1"
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'a': 1}, {'a': 2}, {'a': 3}], True)

Load data from a Hugging Face Dataset or Distiset in your cloud provider

from distilabel.steps import LoadDataFromDisk

loader = LoadDataFromDisk(
    dataset_path="gcs://path/to/dataset",
    storage_options={"project": "experiments-0001"}
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)