Steps Gallery¶
- 
DBSCAN 
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core 
- 
UMAP 
 UMAP is a general purpose manifold learning and dimension reduction algorithm. 
- 
DeitaFiltering 
 Filter dataset rows using DEITA filtering strategy. 
- 
FaissNearestNeighbour 
 Create a faissindex to get the nearest neighbours.
- 
EmbeddingDedup 
 Deduplicates text using embeddings. 
- 
PushToHub 
 Push data to a Hugging Face Hub dataset. 
- 
PreferenceToArgilla 
 Creates a preference dataset in Argilla. 
- 
TextGenerationToArgilla 
 Creates a text generation dataset in Argilla. 
- 
CombineOutputs 
 Combine the outputs of several upstream steps. 
- 
ExpandColumns 
 Expand columns that contain lists into multiple rows. 
- 
GroupColumns 
 Combines columns from a list of StepInput.
- 
CombineColumns 
 CombineColumnsis deprecated and will be removed in version 1.5.0, useGroupColumnsinstead.
- 
KeepColumns 
 Keeps selected columns in the dataset. 
- 
MergeColumns 
 Merge columns from a row. 
- 
EmbeddingGeneration 
 Generate embeddings using an Embeddingsmodel.
- 
MinHashDedup 
 Deduplicates text using MinHashandMinHashLSH.
- 
ConversationTemplate 
 Generate a conversation template from an instruction and a response. 
- 
FormatTextGenerationDPO 
 Format the output of your LLMs for Direct Preference Optimization (DPO). 
- 
FormatChatGenerationDPO 
 Format the output of a combination of a ChatGeneration+ a preference task for Direct Preference Optimization (DPO).
- 
FormatTextGenerationSFT 
 Format the output of a TextGenerationtask for Supervised Fine-Tuning (SFT).
- 
FormatChatGenerationSFT 
 Format the output of a ChatGenerationtask for Supervised Fine-Tuning (SFT).
- 
RewardModelScore 
 Assign a score to a response using a Reward Model. 
- 
TruncateTextColumn 
 Truncate a row using a tokenizer or the number of characters. 
- 
LoadDataFromDicts 
 Loads a dataset from a list of dictionaries. 
- 
LoadDataFromHub 
 Loads a dataset from the Hugging Face Hub. 
- 
LoadDataFromFileSystem 
 Loads a dataset from a file in your filesystem. 
- 
LoadDataFromDisk 
 Load a dataset that was previously saved to disk.