LoadDataFromHub¶
Loads a dataset from the Hugging Face Hub.
GeneratorStep that loads a dataset from the Hugging Face Hub using the datasets
    library.
Attributes¶
- 
repo_id: The Hugging Face Hub repository ID of the dataset to load. 
- 
split: The split of the dataset to load. 
- 
config: The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations. 
Runtime Parameters¶
- 
batch_size: The batch size to use when processing the data. 
- 
repo_id: The Hugging Face Hub repository ID of the dataset to load. 
- 
split: The split of the dataset to load. Defaults to 'train'. 
- 
config: The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations. 
- 
revision: The revision of the dataset to load. Defaults to the latest revision. 
- 
streaming: Whether to load the dataset in streaming mode or not. Defaults to False.
- 
num_examples: The number of examples to load from the dataset. By default will load all examples. 
- 
storage_options: Key/value pairs to be passed on to the file-system backend, if any. Defaults to None.
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end
    subgraph LoadDataFromHub
        StepOutput[Output Columns: dynamic]
    end
    StepOutput --> OCOL0
Outputs¶
- dynamic (all): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Examples¶
Load data from a dataset in Hugging Face Hub¶
from distilabel.steps import LoadDataFromHub
loader = LoadDataFromHub(
    repo_id="distilabel-internal-testing/instruction-dataset-mini",
    split="test",
    batch_size=2
)
loader.load()
# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'prompt': 'Arianna has 12...', False)