Skip to content

LoadDataFromFileSystem

Loads a dataset from a file in your filesystem.

GeneratorStep that creates a dataset from a file in the filesystem, uses Hugging Face datasets library. Take a look at Hugging Face Datasets for more information of the supported file types.

Attributes

  • data_files: The path to the file, or directory containing the files that conform the dataset.

  • split: The split of the dataset to load (typically will be train, test or validation).

Runtime Parameters

  • batch_size: The batch size to use when processing the data.

  • data_files: The path to the file, or directory containing the files that conform the dataset.

  • split: The split of the dataset to load. Defaults to 'train'.

  • streaming: Whether to load the dataset in streaming mode or not. Defaults to False.

  • num_examples: The number of examples to load from the dataset. By default will load all examples.

  • storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to None.

  • filetype: The expected filetype. If not provided, it will be inferred from the file extension. For more than one file, it will be inferred from the first file.

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph LoadDataFromFileSystem
        StepOutput[Output Columns: dynamic]
    end

    StepOutput --> OCOL0

Outputs

  • dynamic (all): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.

Examples

Load data from a Hugging Face dataset in your file system

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(data_files="path/to/dataset.jsonl")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Specify a filetype if the file extension is not expected

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(filetype="csv", data_files="path/to/dataset.txtr")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Load data from a file in your cloud provider

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(
    data_files="gcs://path/to/dataset",
    storage_options={"project": "experiments-0001"}
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Load data passing a glob pattern

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(
    data_files="path/to/dataset/*.jsonl",
    streaming=True
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)