Caching¶
Distilabel Pipelines
automatically save all the intermediate steps to to avoid losing any data in case of error.
Cache directory¶
Out of the box, the Pipeline
will use the ~/.cache/distilabel/pipelines
directory to store the different pipelines:
This directory can be modified by setting the DISTILABEL_CACHE_DIR
environment variable (export DISTILABEL_CACHE_DIR=my_cache_dir
) or by explicitly passing the cache_dir
variable to the Pipeline
constructor like so:
How does it work?¶
Let's take a look at the logging messages from a sample pipeline.
When we run a Pipeline
for the first time
If we decide to stop the pipeline (say we kill the run altogether via CTRL + C
or CMD + C
in macOS), we will see the signal sent to the different workers:
After this step, when we run again the pipeline, the first log message we see corresponds to "Load pipeline from cache", which will restart processing from where it stopped:
Finally, if we decide to run the same Pipeline
after it has finished completely, it won't start again but resume the process, as we already have all the data processed:
Serialization¶
Let's see what gets serialized by looking at a sample Pipeline
's cached folder:
$ tree ~/.cache/distilabel/pipelines/73ca3f6b7a613fb9694db7631cc038d379f1f533
├── batch_manager.json
├── batch_manager_steps
│ ├── generate_response.json
│ └── rename_columns.json
├── data
│ └── generate_response
│ ├── 00001.parquet
│ └── 00002.parquet
└── pipeline.yaml
The Pipeline
will have a signature created from the arguments that define it so we can find it afterwards, and the contents are the following:
-
batch_manager.json
Folder that stores the content of the internal batch manager to keep track of the data. Along with the
batch_manager_steps/
they store the information to restart thePipeline
. One shouldn't need to know about it. -
pipeline.yaml
This file contains a representation of the
Pipeline
in YAML format. If we push aDistiset
to the Hugging Face Hub as obtained from callingPipeline.run
, this file will be stored at our datasets' repository, allowing to reproduce thePipeline
using theCLI
: -
data/
Folder that stores the data generated, with a special folder to keep track of each
leaf_step
separately. We can recreate aDistiset
from the contents of this folder (Parquet files), as we will see next.
create_distiset¶
In case we wanted to regenerate the dataset from the cache
, we can do it using the create_distiset
and passing the path to the /data
folder inside our Pipeline
:
from pathlib import Path
from distilabel.distiset import create_distiset
path = Path("~/.cache/distilabel/pipelines/73ca3f6b7a613fb9694db7631cc038d379f1f533/data")
ds = create_distiset(path)
ds
# Distiset({
# generate_response: DatasetDict({
# train: Dataset({
# features: ['instruction', 'response'],
# num_rows: 80
# })
# })
# })
Note
Internally, the function will try to inject the pipeline_path
variable if it's not passed via argument, assuming
it's in the parent directory of the current one, called pipeline.yaml
. If the file doesn't exist, it won't
raise any error, but take into account that if the Distiset
is pushed to the Hugging Face Hub, the pipeline.yaml
won't be
generated.