Steps Gallery¶
Category Overview
The gallery page showcases the different types of components within distilabel.
| Icon | Category | Description |
|---|---|---|
| text-generation | Text generation steps are used to generate text based on a given prompt. | |
| chat-generation | Chat generation steps are used to generate text based on a conversation. | |
| text-classification | Text classification steps are used to classify text into a category. | |
| text-manipulation | Text manipulation steps are used to manipulate or rewrite an input text. | |
| evol | Evol steps are used to rewrite input text and evolve it to a higher quality. | |
| critique | Critique steps are used to provide feedback on the quality of the data with a written explanation. | |
| scorer | Scorer steps are used to evaluate and score the data with a numerical value. | |
| preference | Preference steps are used to collect preferences on the data with numerical values or ranks. | |
| embedding | Embedding steps are used to generate embeddings for the data. | |
| clustering | Clustering steps are used to group similar data points together. | |
| columns | Columns steps are used to manipulate columns in the data. | |
| filtering | Filtering steps are used to filter the data based on some criteria. | |
| format | Format steps are used to format the data. | |
| load | Load steps are used to load the data. | |
| execution | Executes python functions. | |
| save | Save steps are used to save the data. |
-
PreferenceToArgilla
Creates a preference dataset in Argilla.
-
TextGenerationToArgilla
Creates a text generation dataset in Argilla.
-
CombineColumns
CombineColumnsis deprecated and will be removed in version 1.5.0, useGroupColumnsinstead. -
PushToHub
Push data to a Hugging Face Hub dataset.
-
LoadDataFromDicts
Loads a dataset from a list of dictionaries.
-
DataSampler
Step to sample from a dataset.
-
LoadDataFromHub
Loads a dataset from the Hugging Face Hub.
-
LoadDataFromFileSystem
Loads a dataset from a file in your filesystem.
-
LoadDataFromDisk
Load a dataset that was previously saved to disk.
-
PrepareExamples
Helper step to create examples from
queryandanswerspairs used as Few Shots in APIGen. -
ConversationTemplate
Generate a conversation template from an instruction and a response.
-
FormatTextGenerationDPO
Format the output of your LLMs for Direct Preference Optimization (DPO).
-
FormatChatGenerationDPO
Format the output of a combination of a
ChatGeneration+ a preference task for Direct Preference Optimization (DPO). -
FormatTextGenerationSFT
Format the output of a
TextGenerationtask for Supervised Fine-Tuning (SFT). -
FormatChatGenerationSFT
Format the output of a
ChatGenerationtask for Supervised Fine-Tuning (SFT). -
DeitaFiltering
Filter dataset rows using DEITA filtering strategy.
-
EmbeddingDedup
Deduplicates text using embeddings.
-
APIGenExecutionChecker
Executes the generated function calls.
-
MinHashDedup
Deduplicates text using
MinHashandMinHashLSH. -
CombineOutputs
Combine the outputs of several upstream steps.
-
ExpandColumns
Expand columns that contain lists into multiple rows.
-
GroupColumns
Combines columns from a list of
StepInput. -
KeepColumns
Keeps selected columns in the dataset.
-
MergeColumns
Merge columns from a row.
-
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core
-
UMAP
UMAP is a general purpose manifold learning and dimension reduction algorithm.
-
FaissNearestNeighbour
Create a
faissindex to get the nearest neighbours. -
EmbeddingGeneration
Generate embeddings using an
Embeddingsmodel. -
RewardModelScore
Assign a score to a response using a Reward Model.
-
TruncateTextColumn
Truncate a row using a tokenizer or the number of characters.