DeitaFiltering¶
Filter dataset rows using DEITA filtering strategy.
Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
Attributes¶
- 
data_budget: The desired size of the dataset after filtering. 
- 
diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. Defaults to 0.9.
- 
normalize_embeddings: Whether to normalize the embeddings before computing the cosine distance. Defaults to True.
Runtime Parameters¶
- 
data_budget: The desired size of the dataset after filtering. 
- 
diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. 
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[evol_instruction_score]
            ICOL1[evol_response_score]
            ICOL2[embedding]
        end
        subgraph New columns
            OCOL0[deita_score]
            OCOL1[deita_score_computed_with]
            OCOL2[nearest_neighbor_distance]
        end
    end
    subgraph DeitaFiltering
        StepInput[Input Columns: evol_instruction_score, evol_response_score, embedding]
        StepOutput[Output Columns: deita_score, deita_score_computed_with, nearest_neighbor_distance]
    end
    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput
Inputs¶
- 
evol_instruction_score ( float): The score of the instruction generated byComplexityScorerstep.
- 
evol_response_score ( float): The score of the response generated byQualityScorerstep.
- 
embedding ( List[float]): The embedding generated for the conversation of the instruction-response pair usingGenerateEmbeddingsstep.
Outputs¶
- 
deita_score ( float): The DEITA score for the instruction-response pair.
- 
deita_score_computed_with ( List[str]): The scores used to compute the DEITA score.
- 
nearest_neighbor_distance ( float): The cosine distance between the embeddings of the instruction-response pair.
Examples¶
Filter the dataset based on the DEITA score and the cosine distance between the embeddings¶
from distilabel.steps import DeitaFiltering
deita_filtering = DeitaFiltering(data_budget=1)
deita_filtering.load()
result = next(
    deita_filtering.process(
        [
            {
                "evol_instruction_score": 0.5,
                "evol_response_score": 0.5,
                "embedding": [-8.12729941, -5.24642847, -6.34003029],
            },
            {
                "evol_instruction_score": 0.6,
                "evol_response_score": 0.6,
                "embedding": [2.99329242, 0.7800932, 0.7799726],
            },
            {
                "evol_instruction_score": 0.7,
                "evol_response_score": 0.7,
                "embedding": [10.29041806, 14.33088073, 13.00557506],
            },
        ],
    )
)
# >>> result
# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]