EmbeddingDedup¶
Deduplicates text using embeddings.
EmbeddingDedup
is a Step that detects near-duplicates in datasets, using
embeddings to compare the similarity between the texts. The typical workflow with this step
would include having a dataset with embeddings precomputed, and then (possibly using the
FaissNearestNeighbour
) using the nn_indices
and nn_scores
, determine the texts that
are duplicate.
Attributes¶
- threshold: the threshold to consider 2 examples as duplicates. It's dependent on the type of index that was used to generate the embeddings. For example, if the embeddings were generated using cosine similarity, a threshold of
0.9
would make all the texts with a cosine similarity above the value duplicates. Higher values detect less duplicates in such an index, but that should be taken into account when building it. Defaults to0.9
. Runtime Parameters: -threshold
: the threshold to consider 2 examples as duplicates.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[nn_indices]
ICOL1[nn_scores]
end
subgraph New columns
OCOL0[keep_row_after_embedding_filtering]
end
end
subgraph EmbeddingDedup
StepInput[Input Columns: nn_indices, nn_scores]
StepOutput[Output Columns: keep_row_after_embedding_filtering]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
Inputs¶
-
nn_indices (
List[int]
): a list containing the indices of thek
nearest neighbours in the inputs for the row. -
nn_scores (
List[float]
): a list containing the score or distance to eachk
nearest neighbour in the inputs.
Outputs¶
- keep_row_after_embedding_filtering (
bool
): boolean indicating if the piecetext
is not a duplicate i.e. this text should be kept.
Examples¶
Deduplicate a list of texts using embedding information¶
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
data = LoadDataFromDicts(
data=[
{
"persona": "A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.",
"embedding": [
0.018477669046149742,
-0.03748236608841726,
0.001919870620352492,
0.024918478063770535,
0.02348063521315178,
0.0038251285566308375,
-0.01723884983037716,
0.02881971942372201,
],
"nn_indices": [0, 1],
"nn_scores": [
0.9164746999740601,
0.782106876373291,
],
},
{
"persona": "A music teacher or instructor focused on theoretical and practical piano lessons.",
"embedding": [
-0.0023464179614082125,
-0.07325472251663565,
-0.06058678419516501,
-0.02100326928586996,
-0.013462744792362657,
0.027368447064244242,
-0.003916070100455717,
0.01243614518480423,
],
"nn_indices": [0, 2],
"nn_scores": [
0.7552462220191956,
0.7261884808540344,
],
},
{
"persona": "A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.",
"embedding": [
-0.01630817942328242,
-0.023760151552345232,
-0.014249650090627883,
-0.005713686451446624,
-0.016033059279131567,
0.0071440908501058786,
-0.05691099643425161,
0.01597412704817784,
],
"nn_indices": [1, 2],
"nn_scores": [
0.8107735514640808,
0.7172299027442932,
],
},
],
batch_size=batch_size,
)
# In general you should do something like this before the deduplication step, to obtain the
# `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's
# no need for it.
# nn = FaissNearestNeighbour(
# k=30,
# metric_type=faiss.METRIC_INNER_PRODUCT,
# search_batch_size=50,
# train_size=len(dataset), # The number of embeddings to use for training
# string_factory="IVF300_HNSW32,Flat" # To use an index (optional, maybe required for big datasets)
# )
# Read more about the `string_factory` here:
# https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
embedding_dedup = EmbeddingDedup(
threshold=0.8,
input_batch_size=batch_size,
)
data >> embedding_dedup
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)
ds = distiset["default"]["train"]
# Filter out the duplicates
ds_dedup = ds.filter(lambda x: x["keep_row_after_embedding_filtering"])