MinHashDedup¶
Deduplicates text using MinHash and MinHashLSH.
MinHashDedup is a Step that detects near-duplicates in datasets. The idea roughly translates
    to the following steps:
    1. Tokenize the text into words or ngrams.
    2. Create a MinHash for each text.
    3. Store the MinHashes in a MinHashLSH.
    4. Check if the MinHash is already in the LSH, if so, it is a duplicate.
Attributes¶
- 
num_perm: the number of permutations to use. Defaults to 128.
- 
seed: the seed to use for the MinHash. Defaults to 1.
- 
tokenizer: the tokenizer to use. Available ones are wordsorngrams. Ifwordsis selected, it tokenizes the text into words using nltk's word tokenizer.ngramestimates the ngrams (together with the sizen). Defaults towords.
- 
n: the size of the ngrams to use. Only relevant if tokenizer="ngrams". Defaults to5.
- 
threshold: the threshold to consider two MinHashes as duplicates. Values closer to 0 detect more duplicates. Defaults to 0.9.
- 
storage: the storage to use for the LSH. Can be dictto store the index in memory, ordisk. Keep in mind,diskis an experimental feature not defined indatasketch, that is based on DiskCache'sIndexclass. It should work as adict, but backed by disk, but depending on the system it can be slower. Defaults todict.
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[text]
        end
        subgraph New columns
            OCOL0[keep_row_after_minhash_filtering]
        end
    end
    subgraph MinHashDedup
        StepInput[Input Columns: text]
        StepOutput[Output Columns: keep_row_after_minhash_filtering]
    end
    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput
Inputs¶
- text (str): the texts to be filtered.
Outputs¶
- keep_row_after_minhash_filtering (bool): boolean indicating if the piecetextis not a duplicate i.e. this text should be kept.
Examples¶
Deduplicate a list of texts using MinHash and MinHashLSH¶
from distilabel.pipeline import Pipeline
from distilabel.steps import MinHashDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
    ds_size = 1000
    batch_size = 500  # Bigger batch sizes work better for this step
    data = LoadDataFromDicts(
        data=[
            {"text": "This is a test document."},
            {"text": "This document is a test."},
            {"text": "Test document for duplication."},
            {"text": "Document for duplication test."},
            {"text": "This is another unique document."},
        ]
        * (ds_size // 5),
        batch_size=batch_size,
    )
    minhash_dedup = MinHashDedup(
        tokenizer="words",
        threshold=0.9,      # lower values will increase the number of duplicates
        storage="dict",     # or "disk" for bigger datasets
    )
    data >> minhash_dedup
if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False)
    ds = distiset["default"]["train"]
    # Filter out the duplicates
    ds_dedup = ds.filter(lambda x: x["keep_row_after_minhash_filtering"])