Skip to content

Command Line Interface (CLI)

Distilabel offers a CLI to explore and re-run existing Pipeline dumps, meaning that an existing dump can be explored to see the steps, how those are connected, the runtime parameters used, and also re-run it with the same or different runtime parameters, respectively.

Available commands

The only available command as of the current version of distilabel is distilabel pipeline.

$ distilabel pipeline --help

 Usage: distilabel pipeline [OPTIONS] COMMAND [ARGS]...

 Commands to run and inspect Distilabel pipelines.

╭─ Options ───────────────────────────────────────────────────────────────────────────────╮
 --help          Show this message and exit.                                             ╰─────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────╮
 info      Get information about a Distilabel pipeline.                                   run       Run a Distilabel pipeline.                                                    ╰─────────────────────────────────────────────────────────────────────────────────────────╯

So on, distilabel pipeline has two subcommands: info and run, as described below. Note that for testing purposes we will be using the following dataset.

distilabel pipeline info

$ distilabel pipeline info --help

 Usage: distilabel pipeline info [OPTIONS]

 Get information about a Distilabel pipeline.

╭─ Options ───────────────────────────────────────────────────────────────────────────╮
 *  --config        TEXT  Path or URL to the Distilabel pipeline configuration file.                           [default: None]                                                                      [required]                                                     --help                Show this message and exit.                                ╰─────────────────────────────────────────────────────────────────────────────────────╯

As we can see from the help message, we need to pass either a Path or a URL. This second option comes handy for datasets stored in Hugging Face Hub, for example:

distilabel pipeline info --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

If we take a look:

CLI 1

The pipeline information includes the steps used in the Pipeline along with the Runtime Parameter that was used, as well as a description of each of them, and also the connections between these steps. These can be helpful to explore the Pipeline locally.

distilabel pipeline run

We can also run a Pipeline from the CLI just pointing to the same pipeline.yaml file or an URL pointing to it and calling distilabel pipeline run. Alternatively, an URL pointing to a Python script containing a distilabel pipeline can be used:

$ distilabel pipeline run --help

 Usage: distilabel pipeline run [OPTIONS]

 Run a Distilabel pipeline.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 --param                                          PARSE_RUNTIME_PARAM  [default: (dynamic)]                                          --config                                         TEXT                 Path or URL to the Distilabel pipeline configuration file.                                                                          [default: None]                                               --script                                         TEXT                 URL pointing to a python script containing a distilabel                                                                             pipeline.                                                                                                                           [default: None]                                               --pipeline-variable-name                         TEXT                 Name of the pipeline in a script. I.e. the 'pipeline'                                                                               variable in `with Pipeline(...) as pipeline:...`.                                                                                   [default: pipeline]                                           --ignore-cache              --no-ignore-cache                         Whether to ignore the cache and re-run the pipeline from                                                                            scratch.                                                                                                                            [default: no-ignore-cache]                                    --repo-id                                        TEXT                 The Hugging Face Hub repository ID to push the resulting                                                                            dataset to.                                                                                                                         [default: None]                                               --commit-message                                 TEXT                 The commit message to use when pushing the dataset.                                                                                 [default: None]                                               --private                   --no-private                              Whether to make the resulting dataset private on the Hub.                                                                           [default: no-private]                                         --token                                          TEXT                 The Hugging Face Hub API token to use when pushing the                                                                              dataset.                                                                                                                            [default: None]                                               --help                                                                Show this message and exit.                                  ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Using --config option, we must pass a path with a pipeline.yaml file. To specify the runtime parameters of the steps we will need to use the --param option and the value of the parameter in the following format:

distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml" \
    --param load_dataset.repo_id=distilabel-internal-testing/instruction-dataset-mini \
    --param load_dataset.split=test \
    --param generate_with_gpt35.llm.generation_kwargs.max_new_tokens=512 \
    --param generate_with_gpt35.llm.generation_kwargs.temperature=0.7 \
    --param to_argilla.dataset_name=text_generation_with_gpt35 \
    --param to_argilla.dataset_workspace=admin

Or using --script we can pass directly a remote python script (keep in mind --config and --script are exclusive):

distilabel pipeline run --script "https://huggingface.co/datasets/distilabel-internal-testing/pipe_nothing_test/raw/main/pipe_nothing.py"

You can also pass runtime parameters to the python script as we saw with --config option.

Again, this helps with the reproducibility of the results, and simplifies sharing not only the final dataset but also the process to generate it.