Skip to content

Frequent Asked Questions (FAQ)

How can I rename the columns in a batch?

Every Step has both input_mappings and output_mappings attributes, that can be used to rename the columns in each batch.

But input_mappings will only map, meaning that if you have a batch with the column A and you want to rename to B, you should use input_mappings={"A": "B"}, but that will only be applied to that specific Step meaning that the next step in the pipeline will still have the column A instead of B.

While output_mappings will indeed apply the rename, meaning that if the Step produces the column A and you want to rename to B, you should use output_mappings={"A": "B"}, and that will be applied to the next Step in the pipeline.

Will the API Keys be exposed when sharing the pipeline?

No, those will be masked out using pydantic.SecretStr, meaning that those won't be exposed when sharing the pipeline.

This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via attribute or runtime parameter, you will need to provide them again.

Does it work for Windows?

Yes, but you may need to set the multiprocessing context in advance, to ensure that the spawn method is used, since the default method fork is not available on Windows.

import multiprocessing as mp

Will the custom Steps / Tasks / LLMs be serialized too?

No, at the moment only the references to the classes within the distilabel library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available, unless used from the same file.

What happens if fails? Do I lose all the data?

No, indeed we're using a cache mechanism to store all the intermediate results in disk, so that if a Step fails, the pipeline can be re-run from that point without losing the data, only if nothing is changed in the Pipeline.

All the data will be stored in .cache/distilabel, but the only data that will persist at the end of the execution is the one from the leaf step/s, so bear that in mind.

For more information on the caching mechanism in distilabel, you can check the Learn - Advanced - Caching section.

Also note that when running a Step or a Task standalone, the cache mechanism won't be used, so if you want to use that, you should use the Pipeline context manager.

How can I use the same LLM across several tasks without having to load it several times?

You can serve the LLM using a solution like TGI or vLLM, and then connect to it using an AsyncLLM client like InferenceEndpointsLLM or OpenAILLM. Please refer to Serving LLMs guide for more information.