Using a file system to pass data of batches between steps¶
In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, distilabel uses fsspec to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the run method of the distilabel pipelines:
Warning
In order to use a specific file system/cloud storage, you will need to install the specific package providing the fsspec implementation for that file system. For instance, to use Google Cloud Storage you will need to install gcsfs:
Check the available implementations: fsspec - Other known implementations
from distilabel.pipeline import Pipeline
with Pipeline(name="my-pipeline") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.run(
...,
storage_parameters={"path": "gcs://my-bucket"},
use_fs_to_pass_data=True
)
The code above setups a file system (in this case Google Cloud Storage) and sets the flag use_fs_to_pass_data to specify that the data of the batches should be passed to the steps using the file system. The storage_parameters argument is optional, and in the case it's not provided but use_fs_to_pass_data==True, distilabel will use the local file system.
Note
As GlobalSteps receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if use_fs_to_pass_data==False, distilabel will use the file system to pass the data to the GlobalStep.