Global Steps¶
The global steps are the ones that in order to do it's processing, they will need access to all the data at once. Some examples include creating a dataset to be pushed to the Hugging Face Hub, or a filtering step in a Pipeline
.
Push data to Hugging Face Hub in batches¶
The first example of a global
step corresponds to PushToHub
:
import os
from distilabel.pipeline.local import Pipeline
from distilabel.steps.globals.huggingface import PushToHub
push_to_hub = PushToHub(
name="push_to_hub",
repo_id="org/dataset-name",
split="train",
private=False,
token=os.getenv("HF_API_TOKEN"),
pipeline=Pipeline(name="push-pipeline"),
)
This step can be used to push batches of the dataset to the Hugging Face Hub as the process advances, enabling a checkpoint strategy in your pipeline.
Data Filtering¶
For some pipelines we may need to filter data according to some criteria. For example, the implementation of DeitaFiltering
does some filtering to determine the examples to keep according to ensure the final dataset has enough diversity. We will see this step in it's own place because it may be difficult to follow out of context.