Extra¶
steps
¶
DBSCAN
¶
Bases: GlobalStep
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core samples in regions of high density and expands clusters from them. This algorithm is good for data which contains clusters of similar density.
This is a GlobalStep
that clusters the embeddings using the DBSCAN algorithm
from sklearn
. Visit TextClustering
step for an example of use.
The trained model is saved as an artifact when creating a distiset
and pushing it to the Hugging Face Hub.
Input columns
- projection (
List[float]
): Vector representation of the text to cluster, normally the output from theUMAP
step.
Output columns
- cluster_label (
int
): Integer representing the label of a given cluster. -1 means it wasn't clustered.
Categories
- clustering
- text-classification
References
Attributes:
Name | Type | Description |
---|---|---|
- |
eps
|
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. |
- |
min_samples
|
The number of samples (or total weight) in a neighborhood for a point
to be considered as a core point. This includes the point itself. If |
- |
metric
|
The metric to use when calculating distance between instances in a feature
array. If metric is a string or callable, it must be one of the options allowed
by |
- |
n_jobs
|
The number of parallel jobs to run. |
Runtime parameters
eps
: The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.min_samples
: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. Ifmin_samples
is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse.metric
: The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed bysklearn.metrics.pairwise_distances
for its metric parameter.n_jobs
: The number of parallel jobs to run.
Source code in src/distilabel/steps/clustering/dbscan.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
TextClustering
¶
Bases: TextClassification
, GlobalTask
Task that clusters a set of texts and generates summary labels for each cluster.
This is a GlobalTask
that inherits from TextClassification
, this means that all
the attributes from that class are available here. Also, in this case we deal
with all the inputs at once, instead of using batches. The input_batch_size
is
used here to send the examples to the LLM in batches (a subtle difference with the
more common Task
definitions).
The task looks in each cluster for a given number of representative examples (the number
is set by the samples_per_cluster
attribute), and sends them to the LLM to get a label/s
that represent the cluster. The labels are then assigned to each text in the cluster.
The clusters and projections used in the step, are assumed to be obtained from the UMAP
+ DBSCAN
steps, but could be generated for similar steps, as long as they represent the
same concepts.
This step runs a pipeline like the one in this repository:
https://github.com/huggingface/text-clustering
Input columns
- text (
str
): The reference text we want to obtain labels for. - projection (
List[float]
): Vector representation of the text to cluster, normally the output from theUMAP
step. - cluster_label (
int
): Integer representing the label of a given cluster. -1 means it wasn't clustered.
Output columns
- summary_label (
str
): The label or list of labels for the text. - model_name (
str
): The name of the model used to generate the label/s.
Categories
- clustering
- text-classification
References
Attributes:
Name | Type | Description |
---|---|---|
- |
savefig
|
Whether to generate and save a figure with the clustering of the texts. |
- |
samples_per_cluster
|
The number of examples to use in the LLM as a sample of the cluster. |
Examples:
Generate labels for a set of texts using clustering:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps import UMAP, DBSCAN, TextClustering
from distilabel.pipeline import Pipeline
ds_name = "argilla-warehouse/personahub-fineweb-edu-4-clustering-100k"
with Pipeline(name="Text clustering dataset") as pipeline:
batch_size = 500
ds = load_dataset(ds_name, split="train").select(range(10000))
loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)
umap = UMAP(n_components=2, metric="cosine")
dbscan = DBSCAN(eps=0.3, min_samples=30)
text_clustering = TextClustering(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
n=3, # 3 labels per example
query_title="Examples of Personas",
samples_per_cluster=10,
context=(
"Describe the main themes, topics, or categories that could describe the "
"following types of personas. All the examples of personas must share "
"the same set of labels."
),
default_label="None",
savefig=True,
input_batch_size=8,
input_mappings={"text": "persona"},
use_default_structured_output=True,
)
loader >> umap >> dbscan >> text_clustering
Source code in src/distilabel/steps/clustering/text_clustering.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
|
inputs: List[str]
property
¶
The input for the task are the same as those for TextClassification
plus
the projection
and cluster_label
columns (which can be obtained from
UMAP + DBSCAN steps).
outputs: List[str]
property
¶
The output for the task is the summary_label
and the model_name
.
_save_figure(data, cluster_centers, cluster_summaries)
¶
Saves the figure starting from the dataframe, using matplotlib.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
pd.DataFrame with the columns 'X', 'Y' and 'labels' representing the projections and the label of each text respectively. |
required |
cluster_centers |
Dict[str, Tuple[float, float]]
|
Dictionary mapping from each label the center of a cluster, to help with the placement of the annotations. |
required |
cluster_summaries |
Dict[int, str]
|
The summaries of the clusters, obtained from the LLM. |
required |
Source code in src/distilabel/steps/clustering/text_clustering.py
_create_figure(inputs, label2docs, cluster_summaries)
¶
Creates a figure of the clustered texts and save it as an artifact.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The inputs of the step, as we will extract information from them again. |
required |
label2docs |
Dict[int, List[str]]
|
Map from each label to the list of documents (texts) that belong to that cluster. |
required |
cluster_summaries |
Dict[int, str]
|
The summaries of the clusters, obtained from the LLM. |
required |
Source code in src/distilabel/steps/clustering/text_clustering.py
_prepare_input_texts(inputs, label2docs, unique_labels)
¶
Prepares a batch of inputs to send to the LLM, with the examples of each cluster.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
Inputs from the step. |
required |
label2docs |
Dict[int, List[int]]
|
Map from each label to the list of documents (texts) that belong to that cluster. |
required |
unique_labels |
List[int]
|
The unique labels of the clusters. |
required |
Returns:
Type | Description |
---|---|
List[Dict[str, Union[str, int]]]
|
The input texts to send to the LLM, with the examples of each cluster |
List[Dict[str, Union[str, int]]]
|
prepared to be used in the prompt, and an additional key to store the |
List[Dict[str, Union[str, int]]]
|
labels (that will be needed to find the data after the batches are |
List[Dict[str, Union[str, int]]]
|
returned from the LLM). |
Source code in src/distilabel/steps/clustering/text_clustering.py
UMAP
¶
Bases: GlobalStep
UMAP is a general purpose manifold learning and dimension reduction algorithm.
This is a GlobalStep
that reduces the dimensionality of the embeddings using. Visit
the TextClustering
step for an example of use. The trained model is saved as an artifact
when creating a distiset and pushing it to the Hugging Face Hub.
Input columns
- embedding (
List[float]
): The original embeddings we want to reduce the dimension.
Output columns
- projection (
List[float]
): Embedding reduced to the number of components specified, the size of the new embeddings will be determined by then_components
.
Categories
- clustering
- text-classification
References
Attributes:
Name | Type | Description |
---|---|---|
- |
n_components
|
The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100. |
- |
metric
|
The metric to use to compute distances in high dimensional space.
Visit UMAP's documentation for more information. Defaults to |
- |
n_jobs
|
The number of parallel jobs to run. Defaults to |
- |
random_state
|
The random state to use for the UMAP algorithm. |
Runtime parameters
n_components
: The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100.metric
: The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults toeuclidean
.n_jobs
: The number of parallel jobs to run. Defaults to8
.random_state
: The random state to use for the UMAP algorithm.
Citations
@misc{mcinnes2020umapuniformmanifoldapproximation,
title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},
author={Leland McInnes and John Healy and James Melville},
year={2020},
eprint={1802.03426},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/1802.03426},
}
Source code in src/distilabel/steps/clustering/umap.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
CombineOutputs
¶
Bases: Step
Combine the outputs of several upstream steps.
CombineOutputs
is a Step
that takes the outputs of several upstream steps and combines
them to generate a new dictionary with all keys/columns of the upstream steps outputs.
Input columns
- dynamic (based on the upstream
Step
s): All the columns of the upstream steps outputs.
Output columns
- dynamic (based on the upstream
Step
s): All the columns of the upstream steps outputs.
Categories
- columns
Examples:
Combine dictionaries of a dataset:
```python
from distilabel.steps import CombineOutputs
combine_outputs = CombineOutputs()
combine_outputs.load()
result = next(
combine_outputs.process(
[{"a": 1, "b": 2}, {"a": 3, "b": 4}],
[{"c": 5, "d": 6}, {"c": 7, "d": 8}],
)
)
# [
# {"a": 1, "b": 2, "c": 5, "d": 6},
# {"a": 3, "b": 4, "c": 7, "d": 8},
# ]
```
Combine upstream steps outputs in a pipeline:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import CombineOutputs
with Pipeline() as pipeline:
step_1 = ...
step_2 = ...
step_3 = ...
combine = CombineOutputs()
[step_1, step_2, step_3] >> combine
```
Source code in src/distilabel/steps/columns/combine.py
DeitaFiltering
¶
Bases: GlobalStep
Filter dataset rows using DEITA filtering strategy.
Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
data_budget |
RuntimeParameter[int]
|
The desired size of the dataset after filtering. |
diversity_threshold |
RuntimeParameter[float]
|
If a row has a cosine distance with respect to it's nearest
neighbor greater than this value, it will be included in the filtered dataset.
Defaults to |
normalize_embeddings |
RuntimeParameter[bool]
|
Whether to normalize the embeddings before computing the cosine
distance. Defaults to |
Runtime parameters
data_budget
: The desired size of the dataset after filtering.diversity_threshold
: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset.
Input columns
- evol_instruction_score (
float
): The score of the instruction generated byComplexityScorer
step. - evol_response_score (
float
): The score of the response generated byQualityScorer
step. - embedding (
List[float]
): The embedding generated for the conversation of the instruction-response pair usingGenerateEmbeddings
step.
Output columns
- deita_score (
float
): The DEITA score for the instruction-response pair. - deita_score_computed_with (
List[str]
): The scores used to compute the DEITA score. - nearest_neighbor_distance (
float
): The cosine distance between the embeddings of the instruction-response pair.
Categories
- filtering
Examples:
Filter the dataset based on the DEITA score and the cosine distance between the embeddings:
from distilabel.steps import DeitaFiltering
deita_filtering = DeitaFiltering(data_budget=1)
deita_filtering.load()
result = next(
deita_filtering.process(
[
{
"evol_instruction_score": 0.5,
"evol_response_score": 0.5,
"embedding": [-8.12729941, -5.24642847, -6.34003029],
},
{
"evol_instruction_score": 0.6,
"evol_response_score": 0.6,
"embedding": [2.99329242, 0.7800932, 0.7799726],
},
{
"evol_instruction_score": 0.7,
"evol_response_score": 0.7,
"embedding": [10.29041806, 14.33088073, 13.00557506],
},
],
)
)
# >>> result
# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
Source code in src/distilabel/steps/deita.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 |
|
process(inputs)
¶
Filter the dataset based on the DEITA score and the cosine distance between the embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
The filtered dataset. |
Source code in src/distilabel/steps/deita.py
_compute_deita_score(inputs)
¶
Computes the DEITA score for each instruction-response pair. The DEITA score is the product of the instruction score and the response score.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Returns:
Type | Description |
---|---|
StepInput
|
The input data with the DEITA score computed. |
Source code in src/distilabel/steps/deita.py
_compute_nearest_neighbor(inputs)
¶
Computes the cosine distance between the embeddings of the instruction-response pairs and the nearest neighbor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Returns:
Type | Description |
---|---|
StepInput
|
The input data with the cosine distance computed. |
Source code in src/distilabel/steps/deita.py
_normalize_embeddings(embeddings)
¶
Normalize the embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings |
ndarray
|
The embeddings to normalize. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
The normalized embeddings. |
Source code in src/distilabel/steps/deita.py
_cosine_distance(embeddings)
¶
Computes the cosine distance between the embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings |
array
|
The embeddings. |
required |
Returns:
Type | Description |
---|---|
array
|
The cosine distance between the embeddings. |
Source code in src/distilabel/steps/deita.py
_manhattan_distance(embeddings)
¶
Computes the manhattan distance between the embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings |
array
|
The embeddings. |
required |
Returns:
Type | Description |
---|---|
array
|
The manhattan distance between the embeddings. |
Source code in src/distilabel/steps/deita.py
EmbeddingGeneration
¶
Bases: Step
Generate embeddings using an Embeddings
model.
EmbeddingGeneration
is a Step
that using an Embeddings
model generates sentence
embeddings for the provided input texts.
Attributes:
Name | Type | Description |
---|---|---|
embeddings |
Embeddings
|
the |
Input columns
- text (
str
): The text for which the sentence embedding has to be generated.
Output columns
- embedding (
List[Union[float, int]]
): the generated sentence embedding.
Categories
- embedding
Examples:
Generate sentence embeddings with Sentence Transformers:
from distilabel.embeddings import SentenceTransformerEmbeddings
from distilabel.steps import EmbeddingGeneration
embedding_generation = EmbeddingGeneration(
embeddings=SentenceTransformerEmbeddings(
model="mixedbread-ai/mxbai-embed-large-v1",
)
)
embedding_generation.load()
result = next(embedding_generation.process([{"text": "Hello, how are you?"}]))
# [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]
Source code in src/distilabel/steps/embeddings/embedding_generation.py
FaissNearestNeighbour
¶
Bases: GlobalStep
Create a faiss
index to get the nearest neighbours.
FaissNearestNeighbour
is a GlobalStep
that creates a faiss
index using the Hugging
Face datasets
library integration, and then gets the nearest neighbours and the scores
or distance of the nearest neighbours for each input row.
Attributes:
Name | Type | Description |
---|---|---|
device |
Optional[RuntimeParameter[Union[int, List[int]]]]
|
the CUDA device ID or a list of IDs to be used. If negative integer, it
will use all the available GPUs. Defaults to |
string_factory |
Optional[RuntimeParameter[str]]
|
the name of the factory to be used to build the |
metric_type |
Optional[RuntimeParameter[int]]
|
the metric to be used to measure the distance between the points. It's
an integer and the recommend way to pass it is importing |
k |
Optional[RuntimeParameter[int]]
|
the number of nearest neighbours to search for each input row. Defaults to |
search_batch_size |
Optional[RuntimeParameter[int]]
|
the number of rows to include in a search batch. The value can
be adjusted to maximize the resources usage or to avoid OOM issues. Defaults
to |
train_size |
Optional[RuntimeParameter[int]]
|
If the index needs a training step, specifies how many vectors will be used to train the index. |
Runtime parameters
device
: the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults toNone
.string_factory
: the name of the factory to be used to build thefaiss
index. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults toNone
.metric_type
: the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importingfaiss
and then passing one offaiss.METRIC_x
variables. Defaults toNone
.k
: the number of nearest neighbours to search for each input row. Defaults to1
.search_batch_size
: the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to50
.train_size
: If the index needs a training step, specifies how many vectors will be used to train the index.
Input columns
- embedding (
List[Union[float, int]]
): a sentence embedding.
Output columns
- nn_indices (
List[int]
): a list containing the indices of thek
nearest neighbours in the inputs for the row. - nn_scores (
List[float]
): a list containing the score or distance to eachk
nearest neighbour in the inputs.
Categories
- embedding
References
Examples:
Generating embeddings and getting the nearest neighbours:
from distilabel.embeddings.sentence_transformers import SentenceTransformerEmbeddings
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub
with Pipeline(name="hello") as pipeline:
load_data = LoadDataFromHub(output_mappings={"prompt": "text"})
embeddings = EmbeddingGeneration(
embeddings=SentenceTransformerEmbeddings(
model="mixedbread-ai/mxbai-embed-large-v1"
)
)
nearest_neighbours = FaissNearestNeighbour()
load_data >> embeddings >> nearest_neighbours
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
load_data.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
},
use_cache=False,
)
Citations
@misc{douze2024faisslibrary,
title={The Faiss library},
author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazaré and Maria Lomeli and Lucas Hosseini and Hervé Jégou},
year={2024},
eprint={2401.08281},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2401.08281},
}
Source code in src/distilabel/steps/embeddings/nearest_neighbour.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|
_build_index(inputs)
¶
Builds a faiss
index using datasets
integration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
List[Dict[str, Any]]
|
a list of dictionaries. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
The build |
Source code in src/distilabel/steps/embeddings/nearest_neighbour.py
_save_index(dataset)
¶
Save the generated Faiss index as an artifact of the step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
the dataset with the |
required |
Source code in src/distilabel/steps/embeddings/nearest_neighbour.py
_search(dataset)
¶
Search the top k
nearest neighbours for each row in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
the dataset with the |
required |
Returns:
Type | Description |
---|---|
Dataset
|
The updated dataset containing the top |
Dataset
|
as well as the score or distance. |
Source code in src/distilabel/steps/embeddings/nearest_neighbour.py
EmbeddingDedup
¶
Bases: GlobalStep
Deduplicates text using embeddings.
EmbeddingDedup
is a Step that detects near-duplicates in datasets, using
embeddings to compare the similarity between the texts. The typical workflow with this step
would include having a dataset with embeddings precomputed, and then (possibly using the
FaissNearestNeighbour
) using the nn_indices
and nn_scores
, determine the texts that
are duplicate.
Attributes:
Name | Type | Description |
---|---|---|
threshold |
Optional[RuntimeParameter[float]]
|
the threshold to consider 2 examples as duplicates.
It's dependent on the type of index that was used to generate the embeddings.
For example, if the embeddings were generated using cosine similarity, a threshold
of |
Runtime Parameters
threshold
: the threshold to consider 2 examples as duplicates.
Input columns
- nn_indices (
List[int]
): a list containing the indices of thek
nearest neighbours in the inputs for the row. - nn_scores (
List[float]
): a list containing the score or distance to eachk
nearest neighbour in the inputs.
Output columns
- keep_row_after_embedding_filtering (
bool
): boolean indicating if the piecetext
is not a duplicate i.e. this text should be kept.
Categories
- filtering
Examples:
Deduplicate a list of texts using embedding information:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
data = LoadDataFromDicts(
data=[
{
"persona": "A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.",
"embedding": [
0.018477669046149742,
-0.03748236608841726,
0.001919870620352492,
0.024918478063770535,
0.02348063521315178,
0.0038251285566308375,
-0.01723884983037716,
0.02881971942372201,
],
"nn_indices": [0, 1],
"nn_scores": [
0.9164746999740601,
0.782106876373291,
],
},
{
"persona": "A music teacher or instructor focused on theoretical and practical piano lessons.",
"embedding": [
-0.0023464179614082125,
-0.07325472251663565,
-0.06058678419516501,
-0.02100326928586996,
-0.013462744792362657,
0.027368447064244242,
-0.003916070100455717,
0.01243614518480423,
],
"nn_indices": [0, 2],
"nn_scores": [
0.7552462220191956,
0.7261884808540344,
],
},
{
"persona": "A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.",
"embedding": [
-0.01630817942328242,
-0.023760151552345232,
-0.014249650090627883,
-0.005713686451446624,
-0.016033059279131567,
0.0071440908501058786,
-0.05691099643425161,
0.01597412704817784,
],
"nn_indices": [1, 2],
"nn_scores": [
0.8107735514640808,
0.7172299027442932,
],
},
],
batch_size=batch_size,
)
# In general you should do something like this before the deduplication step, to obtain the
# `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's
# no need for it.
# nn = FaissNearestNeighbour(
# k=30,
# metric_type=faiss.METRIC_INNER_PRODUCT,
# search_batch_size=50,
# train_size=len(dataset), # The number of embeddings to use for training
# string_factory="IVF300_HNSW32,Flat" # To use an index (optional, maybe required for big datasets)
# )
# Read more about the `string_factory` here:
# https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
embedding_dedup = EmbeddingDedup(
threshold=0.8,
input_batch_size=batch_size,
)
data >> embedding_dedup
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)
ds = distiset["default"]["train"]
# Filter out the duplicates
ds_dedup = ds.filter(lambda x: x["keep_row_after_embedding_filtering"])
```
Source code in src/distilabel/steps/filtering/embedding.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
MinHashDedup
¶
Bases: Step
Deduplicates text using MinHash
and MinHashLSH
.
MinHashDedup
is a Step that detects near-duplicates in datasets. The idea roughly translates
to the following steps:
1. Tokenize the text into words or ngrams.
2. Create a MinHash
for each text.
3. Store the MinHashes
in a MinHashLSH
.
4. Check if the MinHash
is already in the LSH
, if so, it is a duplicate.
Attributes:
Name | Type | Description |
---|---|---|
num_perm |
int
|
the number of permutations to use. Defaults to |
seed |
int
|
the seed to use for the MinHash. This seed must be the same
used for |
tokenizer |
Literal['words', 'ngrams']
|
the tokenizer to use. Available ones are |
n |
Optional[int]
|
the size of the ngrams to use. Only relevant if |
threshold |
float
|
the threshold to consider two MinHashes as duplicates.
Values closer to 0 detect more duplicates. Defaults to |
storage |
Literal['dict', 'disk']
|
the storage to use for the LSH. Can be |
Input columns
- text (
str
): the texts to be filtered.
Output columns
- keep_row_after_minhash_filtering (
bool
): boolean indicating if the piecetext
is not a duplicate i.e. this text should be kept.
Categories
- filtering
References
Examples:
Deduplicate a list of texts using MinHash and MinHashLSH:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import MinHashDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
ds_size = 1000
batch_size = 500 # Bigger batch sizes work better for this step
data = LoadDataFromDicts(
data=[
{"text": "This is a test document."},
{"text": "This document is a test."},
{"text": "Test document for duplication."},
{"text": "Document for duplication test."},
{"text": "This is another unique document."},
]
* (ds_size // 5),
batch_size=batch_size,
)
minhash_dedup = MinHashDedup(
tokenizer="words",
threshold=0.9, # lower values will increase the number of duplicates
storage="dict", # or "disk" for bigger datasets
)
data >> minhash_dedup
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)
ds = distiset["default"]["train"]
# Filter out the duplicates
ds_dedup = ds.filter(lambda x: x["keep_row_after_minhash_filtering"])
```
Source code in src/distilabel/steps/filtering/minhash.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
ConversationTemplate
¶
Bases: Step
Generate a conversation template from an instruction and a response.
Input columns
- instruction (
str
): The instruction to be used in the conversation. - response (
str
): The response to be used in the conversation.
Output columns
- conversation (
ChatType
): The conversation template.
Categories
- format
- chat
- template
Examples:
Create a conversation from an instruction and a response:
from distilabel.steps import ConversationTemplate
conv_template = ConversationTemplate()
conv_template.load()
result = next(
conv_template.process(
[
{
"instruction": "Hello",
"response": "Hi",
}
],
)
)
# >>> result
# [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]
Source code in src/distilabel/steps/formatting/conversation.py
inputs: StepColumns
property
¶
The instruction and response.
outputs: StepColumns
property
¶
The conversation template.
process(inputs)
¶
Generate a conversation template from an instruction and a response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
The input data with the conversation template. |
Source code in src/distilabel/steps/formatting/conversation.py
FormatChatGenerationDPO
¶
Bases: Step
Format the output of a combination of a ChatGeneration
+ a preference task for Direct Preference Optimization (DPO).
FormatChatGenerationDPO
is a Step
that formats the output of the combination of a ChatGeneration
task with a preference Task
i.e. a task generating ratings
such as UltraFeedback
following the standard
formatting from frameworks such as axolotl
or alignment-handbook
., so that those are used to rank the
existing generations and provide the chosen
and rejected
generations based on the ratings
.
Note
The messages
column should contain at least one message from the user, the generations
column should contain at least two generations, the ratings
column should contain the same
number of ratings as generations.
Input columns
- messages (
List[Dict[str, str]]
): The conversation messages. - generations (
List[str]
): The generations produced by theLLM
. - generation_models (
List[str]
, optional): The model names used to generate thegenerations
, only available if themodel_name
from theChatGeneration
task/s is combined into a single column named this way, otherwise, it will be ignored. - ratings (
List[float]
): The ratings for each of thegenerations
, produced by a preference task such asUltraFeedback
.
Output columns
- prompt (
str
): The user message used to generate thegenerations
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - chosen (
List[Dict[str, str]]
): Thechosen
generation based on theratings
. - chosen_model (
str
, optional): The model name used to generate thechosen
generation, if thegeneration_models
are available. - chosen_rating (
float
): The rating of thechosen
generation. - rejected (
List[Dict[str, str]]
): Therejected
generation based on theratings
. - rejected_model (
str
, optional): The model name used to generate therejected
generation, if thegeneration_models
are available. - rejected_rating (
float
): The rating of therejected
generation.
Categories
- format
- chat-generation
- preference
- messages
- generations
Examples:
Format your dataset for DPO fine tuning:
from distilabel.steps import FormatChatGenerationDPO
format_dpo = FormatChatGenerationDPO()
format_dpo.load()
# NOTE: "generation_models" can be added optionally.
result = next(
format_dpo.process(
[
{
"messages": [{"role": "user", "content": "What's 2+2?"}],
"generations": ["4", "5", "6"],
"ratings": [1, 0, -1],
}
]
)
)
# >>> result
# [
# {
# 'messages': [{'role': 'user', 'content': "What's 2+2?"}],
# 'generations': ['4', '5', '6'],
# 'ratings': [1, 0, -1],
# 'prompt': "What's 2+2?",
# 'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',
# 'chosen': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '4'}],
# 'chosen_rating': 1,
# 'rejected': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '6'}],
# 'rejected_rating': -1
# }
# ]
Source code in src/distilabel/steps/formatting/dpo.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
|
inputs: StepColumns
property
¶
List of inputs required by the Step
, which in this case are: messages
, generations
,
and ratings
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case is: generation_models
.
outputs: StepColumns
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, chosen
,
chosen_model
, chosen_rating
, rejected
, rejected_model
, rejected_rating
. Both
the chosen_model
and rejected_model
being optional and only used if generation_models
is available.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the DPO formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/dpo.py
FormatTextGenerationDPO
¶
Bases: Step
Format the output of your LLMs for Direct Preference Optimization (DPO).
FormatTextGenerationDPO
is a Step
that formats the output of the combination of a TextGeneration
task with a preference Task
i.e. a task generating ratings
, so that those are used to rank the
existing generations and provide the chosen
and rejected
generations based on the ratings
.
Use this step to transform the output of a combination of a TextGeneration
+ a preference task such as
UltraFeedback
following the standard formatting from frameworks such as axolotl
or alignment-handbook
.
Note
The generations
column should contain at least two generations, the ratings
column should
contain the same number of ratings as generations.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegenerations
, if available. - instruction (
str
): The instruction used to generate thegenerations
with theLLM
. - generations (
List[str]
): The generations produced by theLLM
. - generation_models (
List[str]
, optional): The model names used to generate thegenerations
, only available if themodel_name
from theTextGeneration
task/s is combined into a single column named this way, otherwise, it will be ignored. - ratings (
List[float]
): The ratings for each of thegenerations
, produced by a preference task such asUltraFeedback
.
Output columns
- prompt (
str
): The instruction used to generate thegenerations
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - chosen (
List[Dict[str, str]]
): Thechosen
generation based on theratings
. - chosen_model (
str
, optional): The model name used to generate thechosen
generation, if thegeneration_models
are available. - chosen_rating (
float
): The rating of thechosen
generation. - rejected (
List[Dict[str, str]]
): Therejected
generation based on theratings
. - rejected_model (
str
, optional): The model name used to generate therejected
generation, if thegeneration_models
are available. - rejected_rating (
float
): The rating of therejected
generation.
Categories
- format
- text-generation
- preference
- instruction
- generations
Examples:
Format your dataset for DPO fine tuning:
from distilabel.steps import FormatTextGenerationDPO
format_dpo = FormatTextGenerationDPO()
format_dpo.load()
# NOTE: Both "system_prompt" and "generation_models" can be added optionally.
result = next(
format_dpo.process(
[
{
"instruction": "What's 2+2?",
"generations": ["4", "5", "6"],
"ratings": [1, 0, -1],
}
]
)
)
# >>> result
# [
# { 'instruction': "What's 2+2?",
# 'generations': ['4', '5', '6'],
# 'ratings': [1, 0, -1],
# 'prompt': "What's 2+2?",
# 'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',
# 'chosen': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '4'}],
# 'chosen_rating': 1,
# 'rejected': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '6'}],
# 'rejected_rating': -1
# }
# ]
Source code in src/distilabel/steps/formatting/dpo.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
inputs: StepColumns
property
¶
List of inputs required by the Step
, which in this case are: instruction
, generations
,
and ratings
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case are: system_prompt
, and generation_models
.
outputs: StepColumns
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, chosen
,
chosen_model
, chosen_rating
, rejected
, rejected_model
, rejected_rating
. Both
the chosen_model
and rejected_model
being optional and only used if generation_models
is available.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the DPO formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/dpo.py
FormatChatGenerationSFT
¶
Bases: Step
Format the output of a ChatGeneration
task for Supervised Fine-Tuning (SFT).
FormatChatGenerationSFT
is a Step
that formats the output of a ChatGeneration
task for
Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as axolotl
or alignment-handbook
. The output of the ChatGeneration
task is formatted into a chat-like
conversation with the instruction
as the user message and the generation
as the assistant
message. Optionally, if the system_prompt
is available, it is included as the first message
in the conversation.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegeneration
, if available. - instruction (
str
): The instruction used to generate thegeneration
with theLLM
. - generation (
str
): The generation produced by theLLM
.
Output columns
- prompt (
str
): The instruction used to generate thegeneration
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - messages (
List[Dict[str, str]]
): The chat-like conversation with theinstruction
as the user message and thegeneration
as the assistant message.
Categories
- format
- chat-generation
- instruction
- generation
Examples:
Format your dataset for SFT:
from distilabel.steps import FormatChatGenerationSFT
format_sft = FormatChatGenerationSFT()
format_sft.load()
# NOTE: "system_prompt" can be added optionally.
result = next(
format_sft.process(
[
{
"messages": [{"role": "user", "content": "What's 2+2?"}],
"generation": "4"
}
]
)
)
# >>> result
# [
# {
# 'messages': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '4'}],
# 'generation': '4',
# 'prompt': 'What's 2+2?',
# 'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',
# }
# ]
Source code in src/distilabel/steps/formatting/sft.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
|
inputs: StepColumns
property
¶
List of inputs required by the Step
, which in this case are: instruction
, and generation
.
outputs: StepColumns
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, messages
.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the SFT formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/sft.py
FormatTextGenerationSFT
¶
Bases: Step
Format the output of a TextGeneration
task for Supervised Fine-Tuning (SFT).
FormatTextGenerationSFT
is a Step
that formats the output of a TextGeneration
task for
Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as axolotl
or alignment-handbook
. The output of the TextGeneration
task is formatted into a chat-like
conversation with the instruction
as the user message and the generation
as the assistant
message. Optionally, if the system_prompt
is available, it is included as the first message
in the conversation.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegeneration
, if available. - instruction (
str
): The instruction used to generate thegeneration
with theLLM
. - generation (
str
): The generation produced by theLLM
.
Output columns
- prompt (
str
): The instruction used to generate thegeneration
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - messages (
List[Dict[str, str]]
): The chat-like conversation with theinstruction
as the user message and thegeneration
as the assistant message.
Categories
- format
- text-generation
- instruction
- generation
Examples:
Format your dataset for SFT fine tuning:
from distilabel.steps import FormatTextGenerationSFT
format_sft = FormatTextGenerationSFT()
format_sft.load()
# NOTE: "system_prompt" can be added optionally.
result = next(
format_sft.process(
[
{
"instruction": "What's 2+2?",
"generation": "4"
}
]
)
)
# >>> result
# [
# {
# 'instruction': 'What's 2+2?',
# 'generation': '4',
# 'prompt': 'What's 2+2?',
# 'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',
# 'messages': [{'role': 'user', 'content': "What's 2+2?"}, {'role': 'assistant', 'content': '4'}]
# }
# ]
Source code in src/distilabel/steps/formatting/sft.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
|
inputs: StepColumns
property
¶
List of inputs required by the Step
, which in this case are: instruction
, and generation
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case is: system_prompt
.
outputs: StepColumns
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, messages
.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the SFT formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/sft.py
LoadDataFromDicts
¶
Bases: GeneratorStep
Loads a dataset from a list of dictionaries.
GeneratorStep
that loads a dataset from a list of dictionaries and yields it in
batches.
Attributes:
Name | Type | Description |
---|---|---|
data |
List[Dict[str, Any]]
|
The list of dictionaries to load the data from. |
Runtime parameters
batch_size
: The batch size to use when processing the data.
Output columns
- dynamic (based on the keys found on the first dictionary of the list): The columns of the dataset.
Categories
- load
Examples:
Load data from a list of dictionaries:
from distilabel.steps import LoadDataFromDicts
loader = LoadDataFromDicts(
data=[{"instruction": "What are 2+2?"}] * 5,
batch_size=2
)
loader.load()
result = next(loader.process())
# >>> result
# ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)
Source code in src/distilabel/steps/generators/data.py
outputs: List[str]
property
¶
Returns a list of strings with the names of the columns that the step will generate.
process(offset=0)
¶
Yields batches from a list of dictionaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries as read from the inputs (propagated in batches) |
GeneratorStepOutput
|
and a flag indicating whether the yield batch is the last one. |
Source code in src/distilabel/steps/generators/data.py
DataSampler
¶
Bases: GeneratorStep
Step to sample from a dataset.
GeneratorStep
that samples from a dataset and yields it in batches.
This step is useful when you have a pipeline that can benefit from using examples
in the prompts for example as few-shot learning, that can be changing on each row.
For example, you can pass a list of dictionaries with N examples and generate M samples
from it (assuming you have another step loading data, this M should have the same size
as the data being loaded in that step). The size S argument is the number of samples per
row generated, so each example would contain S examples to be used as examples.
Attributes:
Name | Type | Description |
---|---|---|
data |
List[Dict[str, Any]]
|
The list of dictionaries to sample from. |
size |
int
|
Number of samples per example. For example in a few-shot learning scenario, the number of few-shot examples that will be generated per example. Defaults to 2. |
samples |
int
|
Number of examples that will be generated by the step in total. If used with another loader step, this should be the same as the number of samples in the loader step. Defaults to 100. |
Output columns
- dynamic (based on the keys found on the first dictionary of the list): The columns of the dataset.
Categories
- load
Examples:
Sample data from a list of dictionaries:
from distilabel.steps import DataSampler
sampler = DataSampler(
data=[{"sample": f"sample {i}"} for i in range(30)],
samples=10,
size=2,
batch_size=4
)
sampler.load()
result = next(sampler.process())
# >>> result
# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)
Pipeline with a loader and a sampler combined in a single stream:
from datasets import load_dataset
from distilabel.steps import LoadDataFromDicts, DataSampler
from distilabel.steps.tasks.apigen.utils import PrepareExamples
from distilabel.pipeline import Pipeline
ds = (
load_dataset("Salesforce/xlam-function-calling-60k", split="train")
.shuffle(seed=42)
.select(range(500))
.to_list()
)
data = [
{
"func_name": "final_velocity",
"func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
},
{
"func_name": "permutation_count",
"func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
},
{
"func_name": "getdivision",
"func_desc": "Divides two numbers by making an API call to a division service.",
},
]
with Pipeline(name="APIGenPipeline") as pipeline:
loader_seeds = LoadDataFromDicts(data=data)
sampler = DataSampler(
data=ds,
size=2,
samples=len(data),
batch_size=8,
)
prep_examples = PrepareExamples()
sampler >> prep_examples
(
[loader_seeds, prep_examples]
>> combine_steps
)
# Now we have a single stream of data with the loader and the sampler data
Source code in src/distilabel/steps/generators/data_sampler.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
process(offset=0)
¶
Yields batches from a list of dictionaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries as read from the inputs (propagated in batches) |
GeneratorStepOutput
|
and a flag indicating whether the yield batch is the last one. |
Source code in src/distilabel/steps/generators/data_sampler.py
RewardModelScore
¶
Bases: Step
, CudaDevicePlacementMixin
Assign a score to a response using a Reward Model.
RewardModelScore
is a Step
that using a Reward Model (RM) loaded using transformers
,
assigns an score to a response generated for an instruction, or a score to a multi-turn
conversation.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files. |
revision |
str
|
if |
torch_dtype |
str
|
the torch dtype to use for the model e.g. "float16", "float32", etc.
Defaults to |
trust_remote_code |
bool
|
whether to allow fetching and executing remote code fetched
from the repository in the Hub. Defaults to |
device_map |
Union[str, Dict[str, Any], None]
|
a dictionary mapping each layer of the model to a device, or a mode like |
token |
Union[SecretStr, None]
|
the Hugging Face Hub token that will be used to authenticate to the Hugging
Face Hub. If not provided, the |
truncation |
bool
|
whether to truncate sequences at the maximum length. Defaults to |
max_length |
Union[int, None]
|
maximun length to use for padding or truncation. Defaults to |
Input columns
- instruction (
str
, optional): the instruction used to generate aresponse
. If provided, thenresponse
must be provided too. - response (
str
, optional): the response generated forinstruction
. If provided, theninstruction
must be provide too. - conversation (
ChatType
, optional): a multi-turn conversation. If not provided, theninstruction
andresponse
columns must be provided.
Output columns
- score (
float
): the score given by the reward model for the instruction-response pair or the conversation.
Categories
- scorer
Examples:
Assigning an score for an instruction-response pair:
from distilabel.steps import RewardModelScore
step = RewardModelScore(
model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)
step.load()
result = next(
step.process(
inputs=[
{
"instruction": "How much is 2+2?",
"response": "The output of 2+2 is 4",
},
{"instruction": "How much is 2+2?", "response": "4"},
]
)
)
# [
# {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},
# {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}
# ]
Assigning an score for a multi-turn conversation:
from distilabel.steps import RewardModelScore
step = RewardModelScore(
model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)
step.load()
result = next(
step.process(
inputs=[
{
"conversation": [
{"role": "user", "content": "How much is 2+2?"},
{"role": "assistant", "content": "The output of 2+2 is 4"},
],
},
{
"conversation": [
{"role": "user", "content": "How much is 2+2?"},
{"role": "assistant", "content": "4"},
],
},
]
)
)
# [
# {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},
# {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}
# ]
Source code in src/distilabel/steps/reward_model.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
|
TruncateTextColumn
¶
Bases: Step
Truncate a row using a tokenizer or the number of characters.
TruncateTextColumn
is a Step
that truncates a row according to the max length. If
the tokenizer
is provided, then the row will be truncated using the tokenizer,
and the max_length
will be used as the maximum number of tokens, otherwise it will
be used as the maximum number of characters. The TruncateTextColumn
step is useful when one
wants to truncate a row to a certain length, to avoid posterior errors in the model due
to the length.
Attributes:
Name | Type | Description |
---|---|---|
column |
str
|
the column to truncate. Defaults to |
max_length |
int
|
the maximum length to use for truncation.
If a |
tokenizer |
Optional[str]
|
the name of the tokenizer to use. If provided, the row will be
truncated using the tokenizer. Defaults to |
Input columns
- dynamic (determined by
column
attribute): The columns to be truncated, defaults to "text".
Output columns
- dynamic (determined by
column
attribute): The truncated column.
Categories
- text-manipulation
Examples:
Truncating a row to a given number of tokens:
from distilabel.steps import TruncateTextColumn
trunc = TruncateTextColumn(
tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
max_length=4,
column="text"
)
trunc.load()
result = next(
trunc.process(
[
{"text": "This is a sample text that is longer than 10 characters"}
]
)
)
# result
# [{'text': 'This is a sample'}]
Truncating a row to a given number of characters:
from distilabel.steps import TruncateTextColumn
trunc = TruncateTextColumn(max_length=10)
trunc.load()
result = next(
trunc.process(
[
{"text": "This is a sample text that is longer than 10 characters"}
]
)
)
# result
# [{'text': 'This is a '}]
Source code in src/distilabel/steps/truncate.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
|