Steps Gallery¶
Category Overview
The gallery page showcases the different types of components within distilabel.
| Icon | Category | Description | 
|---|---|---|
| text-generation | Text generation steps are used to generate text based on a given prompt. | |
| chat-generation | Chat generation steps are used to generate text based on a conversation. | |
| text-classification | Text classification steps are used to classify text into a category. | |
| text-manipulation | Text manipulation steps are used to manipulate or rewrite an input text. | |
| evol | Evol steps are used to rewrite input text and evolve it to a higher quality. | |
| critique | Critique steps are used to provide feedback on the quality of the data with a written explanation. | |
| scorer | Scorer steps are used to evaluate and score the data with a numerical value. | |
| preference | Preference steps are used to collect preferences on the data with numerical values or ranks. | |
| embedding | Embedding steps are used to generate embeddings for the data. | |
| clustering | Clustering steps are used to group similar data points together. | |
| columns | Columns steps are used to manipulate columns in the data. | |
| filtering | Filtering steps are used to filter the data based on some criteria. | |
| format | Format steps are used to format the data. | |
| load | Load steps are used to load the data. | |
| execution | Executes python functions. | |
| save | Save steps are used to save the data. | |
| image-generation | Image generation steps are used to generate images based on a given prompt. | |
| labelling | Labelling steps are used to label the data. | 
- 
PreferenceToArgilla 
 Creates a preference dataset in Argilla. 
- 
TextGenerationToArgilla 
 Creates a text generation dataset in Argilla. 
- 
PushToHub 
 Push data to a Hugging Face Hub dataset. 
- 
LoadDataFromDicts 
 Loads a dataset from a list of dictionaries. 
- 
DataSampler 
 Step to sample from a dataset. 
- 
LoadDataFromHub 
 Loads a dataset from the Hugging Face Hub. 
- 
LoadDataFromFileSystem 
 Loads a dataset from a file in your filesystem. 
- 
LoadDataFromDisk 
 Load a dataset that was previously saved to disk. 
- 
PrepareExamples 
 Helper step to create examples from queryandanswerspairs used as Few Shots in APIGen.
- 
ConversationTemplate 
 Generate a conversation template from an instruction and a response. 
- 
FormatTextGenerationDPO 
 Format the output of your LLMs for Direct Preference Optimization (DPO). 
- 
FormatChatGenerationDPO 
 Format the output of a combination of a ChatGeneration+ a preference task for Direct Preference Optimization (DPO).
- 
FormatTextGenerationSFT 
 Format the output of a TextGenerationtask for Supervised Fine-Tuning (SFT).
- 
FormatChatGenerationSFT 
 Format the output of a ChatGenerationtask for Supervised Fine-Tuning (SFT).
- 
DeitaFiltering 
 Filter dataset rows using DEITA filtering strategy. 
- 
EmbeddingDedup 
 Deduplicates text using embeddings. 
- 
APIGenExecutionChecker 
 Executes the generated function calls. 
- 
MinHashDedup 
 Deduplicates text using MinHashandMinHashLSH.
- 
CombineOutputs 
 Combine the outputs of several upstream steps. 
- 
ExpandColumns 
 Expand columns that contain lists into multiple rows. 
- 
GroupColumns 
 Combines columns from a list of StepInput.
- 
KeepColumns 
 Keeps selected columns in the dataset. 
- 
MergeColumns 
 Merge columns from a row. 
- 
DBSCAN 
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core 
- 
UMAP 
 UMAP is a general purpose manifold learning and dimension reduction algorithm. 
- 
FaissNearestNeighbour 
 Create a faissindex to get the nearest neighbours.
- 
EmbeddingGeneration 
 Generate embeddings using an Embeddingsmodel.
- 
RewardModelScore 
 Assign a score to a response using a Reward Model. 
- 
FormatPRM 
 Helper step to transform the data into the format expected by the PRM model. 
- 
TruncateTextColumn 
 Truncate a row using a tokenizer or the number of characters.