Steps Gallery¶
Category Overview
The gallery page showcases the different types of components within distilabel
.
Icon | Category | Description |
---|---|---|
text-generation | Text generation steps are used to generate text based on a given prompt. | |
chat-generation | Chat generation steps are used to generate text based on a conversation. | |
text-classification | Text classification steps are used to classify text into a category. | |
text-manipulation | Text manipulation steps are used to manipulate or rewrite an input text. | |
evol | Evol steps are used to rewrite input text and evolve it to a higher quality. | |
critique | Critique steps are used to provide feedback on the quality of the data with a written explanation. | |
scorer | Scorer steps are used to evaluate and score the data with a numerical value. | |
preference | Preference steps are used to collect preferences on the data with numerical values or ranks. | |
embedding | Embedding steps are used to generate embeddings for the data. | |
clustering | Clustering steps are used to group similar data points together. | |
columns | Columns steps are used to manipulate columns in the data. | |
filtering | Filtering steps are used to filter the data based on some criteria. | |
format | Format steps are used to format the data. | |
load | Load steps are used to load the data. | |
execution | Executes python functions. | |
save | Save steps are used to save the data. | |
image-generation | Image generation steps are used to generate images based on a given prompt. | |
labelling | Labelling steps are used to label the data. |
-
PreferenceToArgilla
Creates a preference dataset in Argilla.
-
TextGenerationToArgilla
Creates a text generation dataset in Argilla.
-
PushToHub
Push data to a Hugging Face Hub dataset.
-
LoadDataFromDicts
Loads a dataset from a list of dictionaries.
-
DataSampler
Step to sample from a dataset.
-
LoadDataFromHub
Loads a dataset from the Hugging Face Hub.
-
LoadDataFromFileSystem
Loads a dataset from a file in your filesystem.
-
LoadDataFromDisk
Load a dataset that was previously saved to disk.
-
PrepareExamples
Helper step to create examples from
query
andanswers
pairs used as Few Shots in APIGen. -
ConversationTemplate
Generate a conversation template from an instruction and a response.
-
FormatTextGenerationDPO
Format the output of your LLMs for Direct Preference Optimization (DPO).
-
FormatChatGenerationDPO
Format the output of a combination of a
ChatGeneration
+ a preference task for Direct Preference Optimization (DPO). -
FormatTextGenerationSFT
Format the output of a
TextGeneration
task for Supervised Fine-Tuning (SFT). -
FormatChatGenerationSFT
Format the output of a
ChatGeneration
task for Supervised Fine-Tuning (SFT). -
DeitaFiltering
Filter dataset rows using DEITA filtering strategy.
-
EmbeddingDedup
Deduplicates text using embeddings.
-
APIGenExecutionChecker
Executes the generated function calls.
-
MinHashDedup
Deduplicates text using
MinHash
andMinHashLSH
. -
CombineOutputs
Combine the outputs of several upstream steps.
-
ExpandColumns
Expand columns that contain lists into multiple rows.
-
GroupColumns
Combines columns from a list of
StepInput
. -
KeepColumns
Keeps selected columns in the dataset.
-
MergeColumns
Merge columns from a row.
-
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core
-
UMAP
UMAP is a general purpose manifold learning and dimension reduction algorithm.
-
FaissNearestNeighbour
Create a
faiss
index to get the nearest neighbours. -
EmbeddingGeneration
Generate embeddings using an
Embeddings
model. -
RewardModelScore
Assign a score to a response using a Reward Model.
-
FormatPRM
Helper step to transform the data into the format expected by the PRM model.
-
TruncateTextColumn
Truncate a row using a tokenizer or the number of characters.