Skip to content

Global Steps

PushToHub

Bases: GlobalStep

A GlobalStep which creates a datasets.Dataset with the input data and pushes it to the Hugging Face Hub.

Attributes:

Name Type Description
repo_id RuntimeParameter[str]

The Hugging Face Hub repository ID where the dataset will be uploaded.

split RuntimeParameter[str]

The split of the dataset that will be pushed. Defaults to "train".

private RuntimeParameter[bool]

Whether the dataset to be pushed should be private or not. Defaults to False.

token Optional[RuntimeParameter[str]]

The token that will be used to authenticate in the Hub. If not provided, the token will be tried to be obtained from the environment variable HF_TOKEN. If not provided using one of the previous methods, then huggingface_hub library will try to use the token from the local Hugging Face CLI configuration. Defaults to None.

Runtime parameters
  • repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.
  • split: The split of the dataset that will be pushed.
  • private: Whether the dataset to be pushed should be private or not.
  • token: The token that will be used to authenticate in the Hub.
Input columns
  • dynamic, based on the existing data within inputs
Source code in src/distilabel/steps/globals/huggingface.py
class PushToHub(GlobalStep):
    """A `GlobalStep` which creates a `datasets.Dataset` with the input data and pushes
    it to the Hugging Face Hub.

    Attributes:
        repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.
        split: The split of the dataset that will be pushed. Defaults to `"train"`.
        private: Whether the dataset to be pushed should be private or not. Defaults to
            `False`.
        token: The token that will be used to authenticate in the Hub. If not provided, the
            token will be tried to be obtained from the environment variable `HF_TOKEN`.
            If not provided using one of the previous methods, then `huggingface_hub` library
            will try to use the token from the local Hugging Face CLI configuration. Defaults
            to `None`.

    Runtime parameters:
        - `repo_id`: The Hugging Face Hub repository ID where the dataset will be uploaded.
        - `split`: The split of the dataset that will be pushed.
        - `private`: Whether the dataset to be pushed should be private or not.
        - `token`: The token that will be used to authenticate in the Hub.

    Input columns:
        - dynamic, based on the existing data within inputs
    """

    repo_id: RuntimeParameter[str] = Field(
        default=None,
        description="The Hugging Face Hub repository ID where the dataset will be uploaded.",
    )
    split: RuntimeParameter[str] = Field(
        default="train",
        description="The split of the dataset that will be pushed. Defaults to 'train'.",
    )
    private: RuntimeParameter[bool] = Field(
        default=False,
        description="Whether the dataset to be pushed should be private or not. Defaults"
        " to `False`.",
    )
    token: Optional[RuntimeParameter[str]] = Field(
        default=None,
        description="The token that will be used to authenticate in the Hub. If not provided,"
        " the token will be tried to be obtained from the environment variable `HF_TOKEN`."
        " If not provided using one of the previous methods, then `huggingface_hub` library"
        " will try to use the token from the local Hugging Face CLI configuration. Defaults"
        " to `None`",
    )

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Method that processes the input data, respecting the `datasets.Dataset` formatting,
        and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.

        Args:
            inputs: that input data within a single object (as it's a GlobalStep) that
                will be transformed into a `datasets.Dataset`.

        Yields:
            Propagates the received inputs so that the `Distiset` can be generated if this is
            the last step of the `Pipeline`, or if this is not a leaf step and has follow up
            steps.
        """
        dataset_dict = defaultdict(list)
        for input in inputs:
            for key, value in input.items():
                dataset_dict[key].append(value)
        dataset_dict = dict(dataset_dict)
        dataset = Dataset.from_dict(dataset_dict)
        dataset.push_to_hub(
            self.repo_id,  # type: ignore
            split=self.split,
            private=self.private,
            token=self.token or os.getenv("HF_TOKEN"),
        )
        yield inputs

process(inputs)

Method that processes the input data, respecting the datasets.Dataset formatting, and pushes it to the Hugging Face Hub based on the RuntimeParameters attributes.

Parameters:

Name Type Description Default
inputs StepInput

that input data within a single object (as it's a GlobalStep) that will be transformed into a datasets.Dataset.

required

Yields:

Type Description
StepOutput

Propagates the received inputs so that the Distiset can be generated if this is

StepOutput

the last step of the Pipeline, or if this is not a leaf step and has follow up

StepOutput

steps.

Source code in src/distilabel/steps/globals/huggingface.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Method that processes the input data, respecting the `datasets.Dataset` formatting,
    and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.

    Args:
        inputs: that input data within a single object (as it's a GlobalStep) that
            will be transformed into a `datasets.Dataset`.

    Yields:
        Propagates the received inputs so that the `Distiset` can be generated if this is
        the last step of the `Pipeline`, or if this is not a leaf step and has follow up
        steps.
    """
    dataset_dict = defaultdict(list)
    for input in inputs:
        for key, value in input.items():
            dataset_dict[key].append(value)
    dataset_dict = dict(dataset_dict)
    dataset = Dataset.from_dict(dataset_dict)
    dataset.push_to_hub(
        self.repo_id,  # type: ignore
        split=self.split,
        private=self.private,
        token=self.token or os.getenv("HF_TOKEN"),
    )
    yield inputs