HuggingFaceHubCheckpointer¶

Special type of step that uploads the data to a Hugging Face Hub dataset.

A Step that uploads the data to a Hugging Face Hub dataset. The data is uploaded in JSONL format in a specific Hugging Face Dataset, which can be different to the one where the main distiset pipeline is saved. The data is checked every input_batch_size inputs, and a new file is created in the repo_id repository. There will be different config files depending on the leaf steps as in the pipeline, and each file will be numbered sequentially. As there will be writes every input_batch_size inputs, it's advisable not to set a small number on this step, as that will slow down the process.

Attributes¶

repo_id: The ID of the repository to push to in the following format: <user>/<dataset_name> or <org>/<dataset_name>. Also accepts <dataset_name>, which will default to the namespace of the logged-in user.
private: Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
token: An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with huggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.

Input & Output Columns¶

graph TD
    subgraph Dataset
    end

    subgraph HuggingFaceHubCheckpointer
    end

Examples¶

Do checkpoints of the data generated in a Hugging Face Hub dataset¶

from typing import TYPE_CHECKING
from datasets import Dataset

from distilabel.pipeline import Pipeline
from distilabel.steps import HuggingFaceHubCheckpointer
from distilabel.steps.base import Step, StepInput

if TYPE_CHECKING:
    from distilabel.typing import StepOutput

# Create a dummy dataset
dataset = Dataset.from_dict({"instruction": ["tell me lies"] * 100})

with Pipeline(name="pipeline-with-checkpoints") as pipeline:
    text_generation = TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
        ),
        template="Follow the following instruction: {{ instruction }}"
    )
    checkpoint = HuggingFaceHubCheckpointer(
        repo_id="username/streaming_checkpoint",
        private=True,
        input_batch_size=50  # Will save write the data to the dataset every 50 inputs
    )
    text_generation >> checkpoint