Skip to content

Steps Gallery

Category Overview

The gallery page showcases the different types of components within distilabel.

Icon Category Description
text-generation Text generation steps are used to generate text based on a given prompt.
chat-generation Chat generation steps are used to generate text based on a conversation.
text-classification Text classification steps are used to classify text into a category.
text-manipulation Text manipulation steps are used to manipulate or rewrite an input text.
evol Evol steps are used to rewrite input text and evolve it to a higher quality.
critique Critique steps are used to provide feedback on the quality of the data with a written explanation.
scorer Scorer steps are used to evaluate and score the data with a numerical value.
preference Preference steps are used to collect preferences on the data with numerical values or ranks.
embedding Embedding steps are used to generate embeddings for the data.
clustering Clustering steps are used to group similar data points together.
columns Columns steps are used to manipulate columns in the data.
filtering Filtering steps are used to filter the data based on some criteria.
format Format steps are used to format the data.
load Load steps are used to load the data.
save Save steps are used to save the data.
  • PreferenceToArgilla


    Creates a preference dataset in Argilla.

    PreferenceToArgilla

  • TextGenerationToArgilla


    Creates a text generation dataset in Argilla.

    TextGenerationToArgilla

  • CombineColumns


    CombineColumns is deprecated and will be removed in version 1.5.0, use GroupColumns instead.

    CombineColumns

  • PushToHub


    Push data to a Hugging Face Hub dataset.

    PushToHub

  • LoadDataFromDicts


    Loads a dataset from a list of dictionaries.

    LoadDataFromDicts

  • LoadDataFromHub


    Loads a dataset from the Hugging Face Hub.

    LoadDataFromHub

  • LoadDataFromFileSystem


    Loads a dataset from a file in your filesystem.

    LoadDataFromFileSystem

  • LoadDataFromDisk


    Load a dataset that was previously saved to disk.

    LoadDataFromDisk

  • ConversationTemplate


    Generate a conversation template from an instruction and a response.

    ConversationTemplate

  • FormatTextGenerationDPO


    Format the output of your LLMs for Direct Preference Optimization (DPO).

    FormatTextGenerationDPO

  • FormatChatGenerationDPO


    Format the output of a combination of a ChatGeneration + a preference task for Direct Preference Optimization (DPO).

    FormatChatGenerationDPO

  • FormatTextGenerationSFT


    Format the output of a TextGeneration task for Supervised Fine-Tuning (SFT).

    FormatTextGenerationSFT

  • FormatChatGenerationSFT


    Format the output of a ChatGeneration task for Supervised Fine-Tuning (SFT).

    FormatChatGenerationSFT

  • DeitaFiltering


    Filter dataset rows using DEITA filtering strategy.

    DeitaFiltering

  • EmbeddingDedup


    Deduplicates text using embeddings.

    EmbeddingDedup

  • MinHashDedup


    Deduplicates text using MinHash and MinHashLSH.

    MinHashDedup

  • CombineOutputs


    Combine the outputs of several upstream steps.

    CombineOutputs

  • ExpandColumns


    Expand columns that contain lists into multiple rows.

    ExpandColumns

  • GroupColumns


    Combines columns from a list of StepInput.

    GroupColumns

  • KeepColumns


    Keeps selected columns in the dataset.

    KeepColumns

  • MergeColumns


    Merge columns from a row.

    MergeColumns

  • DBSCAN


    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core

    DBSCAN

  • UMAP


    UMAP is a general purpose manifold learning and dimension reduction algorithm.

    UMAP

  • FaissNearestNeighbour


    Create a faiss index to get the nearest neighbours.

    FaissNearestNeighbour

  • EmbeddingGeneration


    Generate embeddings using an Embeddings model.

    EmbeddingGeneration

  • RewardModelScore


    Assign a score to a response using a Reward Model.

    RewardModelScore

  • TruncateTextColumn


    Truncate a row using a tokenizer or the number of characters.

    TruncateTextColumn