LoadDataFromFileSystem¶
Loads a dataset from a file in your filesystem.
GeneratorStep that creates a dataset from a file in the filesystem, uses Hugging Face datasets
    library. Take a look at Hugging Face Datasets
    for more information of the supported file types.
Attributes¶
- 
data_files: The path to the file, or directory containing the files that conform the dataset. 
- 
split: The split of the dataset to load (typically will be train,testorvalidation).
Runtime Parameters¶
- 
batch_size: The batch size to use when processing the data. 
- 
data_files: The path to the file, or directory containing the files that conform the dataset. 
- 
split: The split of the dataset to load. Defaults to 'train'. 
- 
streaming: Whether to load the dataset in streaming mode or not. Defaults to False.
- 
num_examples: The number of examples to load from the dataset. By default will load all examples. 
- 
storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to None.
- 
filetype: The expected filetype. If not provided, it will be inferred from the file extension. For more than one file, it will be inferred from the first file. 
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end
    subgraph LoadDataFromFileSystem
        StepOutput[Output Columns: dynamic]
    end
    StepOutput --> OCOL0
Outputs¶
- dynamic (all): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Examples¶
Load data from a Hugging Face dataset in your file system¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(data_files="path/to/dataset.jsonl")
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Specify a filetype if the file extension is not expected¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(filetype="csv", data_files="path/to/dataset.txtr")
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Load data from a file in your cloud provider¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(
    data_files="gcs://path/to/dataset",
    storage_options={"project": "experiments-0001"}
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Load data passing a glob pattern¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(
    data_files="path/to/dataset/*.jsonl",
    streaming=True
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)