LoadDataFromFileSystem¶
Loads a dataset from a file in your filesystem.
GeneratorStep
that creates a dataset from a file in the filesystem, uses Hugging Face datasets
library. Take a look at Hugging Face Datasets
for more information of the supported file types.
Attributes¶
-
data_files: The path to the file, or directory containing the files that conform the dataset.
-
split: The split of the dataset to load (typically will be
train
,test
orvalidation
).
Runtime Parameters¶
-
batch_size: The batch size to use when processing the data.
-
data_files: The path to the file, or directory containing the files that conform the dataset.
-
split: The split of the dataset to load. Defaults to 'train'.
-
streaming: Whether to load the dataset in streaming mode or not. Defaults to
False
. -
num_examples: The number of examples to load from the dataset. By default will load all examples.
-
storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to
None
. -
filetype: The expected filetype. If not provided, it will be inferred from the file extension. For more than one file, it will be inferred from the first file.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph New columns
OCOL0[dynamic]
end
end
subgraph LoadDataFromFileSystem
StepOutput[Output Columns: dynamic]
end
StepOutput --> OCOL0
Outputs¶
- dynamic (
all
): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Examples¶
Load data from a Hugging Face dataset in your file system¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(data_files="path/to/dataset.jsonl")
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Specify a filetype if the file extension is not expected¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(filetype="csv", data_files="path/to/dataset.txtr")
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Load data from a file in your cloud provider¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(
data_files="gcs://path/to/dataset",
storage_options={"project": "experiments-0001"}
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)
Load data passing a glob pattern¶
from distilabel.steps import LoadDataFromFileSystem
loader = LoadDataFromFileSystem(
data_files="path/to/dataset/*.jsonl",
streaming=True
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)