LoadDataFromFileSystem¶

Loads a dataset from a file in your filesystem.

GeneratorStep that creates a dataset from a file in the filesystem, uses Hugging Face datasets library. Take a look at Hugging Face Datasets for more information of the supported file types.

Attributes¶

data_files: The path to the file, or directory containing the files that conform the dataset.
split: The split of the dataset to load (typically will be train, test or validation).

Runtime Parameters¶

batch_size: The batch size to use when processing the data.
data_files: The path to the file, or directory containing the files that conform the dataset.
split: The split of the dataset to load. Defaults to 'train'.
streaming: Whether to load the dataset in streaming mode or not. Defaults to False.
num_examples: The number of examples to load from the dataset. By default will load all examples.
storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to None.
filetype: The expected filetype. If not provided, it will be inferred from the file extension. For more than one file, it will be inferred from the first file.

Input & Output Columns¶

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph LoadDataFromFileSystem
        StepOutput[Output Columns: dynamic]
    end

    StepOutput --> OCOL0

Outputs¶

dynamic (all): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.

Examples¶

Load data from a Hugging Face dataset in your file system¶

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(data_files="path/to/dataset.jsonl")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Specify a filetype if the file extension is not expected¶

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(filetype="csv", data_files="path/to/dataset.txtr")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Load data from a file in your cloud provider¶

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(
    data_files="gcs://path/to/dataset",
    storage_options={"project": "experiments-0001"}
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

Load data passing a glob pattern¶

from distilabel.steps import LoadDataFromFileSystem

loader = LoadDataFromFileSystem(
    data_files="path/to/dataset/*.jsonl",
    streaming=True
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)