MinHashDedup¶
Deduplicates text using MinHash and MinHashLSH.
MinHashDedup is a Step that detects near-duplicates in datasets. The idea roughly translates
to the following steps:
1. Tokenize the text into words or ngrams.
2. Create a MinHash for each text.
3. Store the MinHashes in a MinHashLSH.
4. Check if the MinHash is already in the LSH, if so, it is a duplicate.
Attributes¶
-
num_perm: the number of permutations to use. Defaults to
128. -
seed: the seed to use for the MinHash. This seed must be the same used for
MinHash, keep in mind when both steps are created. Defaults to1. -
tokenizer: the tokenizer to use. Available ones are
wordsorngrams. Ifwordsis selected, it tokenize the text into words using nltk's word tokenizer.ngramestimates the ngrams (together with the sizen) using. Defaults towords. -
n: the size of the ngrams to use. Only relevant if
tokenizer="ngrams". Defaults to5. -
threshold: the threshold to consider two MinHashes as duplicates. Values closer to 0 detect more duplicates. Defaults to
0.9. -
storage: the storage to use for the LSH. Can be
dictto store the index in memory, ordisk. Keep in mind,diskis an experimental feature not defined indatasketch, that is based on DiskCache'sIndexclass. It should work as adict, but backed by disk, but depending on the system it can be slower. Defaults todict. which uses a customshelvebackend. Note thediskis an experimetal feature that may cause issues. Defaults todict.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[text]
end
subgraph New columns
OCOL0[keep_row_after_minhash_filtering]
end
end
subgraph MinHashDedup
StepInput[Input Columns: text]
StepOutput[Output Columns: keep_row_after_minhash_filtering]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
Inputs¶
- text (
str): the texts to be filtered.
Outputs¶
- keep_row_after_minhash_filtering (
bool): boolean indicating if the piecetextis not a duplicate i.e. this text should be kept.
Examples¶
Deduplicate a list of texts using MinHash and MinHashLSH¶
from distilabel.pipeline import Pipeline
from distilabel.steps import MinHashDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
ds_size = 1000
batch_size = 500 # Bigger batch sizes work better for this step
data = LoadDataFromDicts(
data=[
{"text": "This is a test document."},
{"text": "This document is a test."},
{"text": "Test document for duplication."},
{"text": "Document for duplication test."},
{"text": "This is another unique document."},
]
* (ds_size // 5),
batch_size=batch_size,
)
minhash_dedup = MinHashDedup(
tokenizer="words",
threshold=0.9, # lower values will increase the number of duplicates
storage="dict", # or "disk" for bigger datasets
)
data >> minhash_dedup
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)
ds = distiset["default"]["train"]
# Filter out the duplicates
ds_dedup = ds.filter(lambda x: x["keep_row_after_minhash_filtering"])