Skip to content

DeitaFiltering

Filter dataset rows using DEITA filtering strategy.

Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes

  • data_budget: The desired size of the dataset after filtering.

  • diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. Defaults to 0.9.

  • normalize_embeddings: Whether to normalize the embeddings before computing the cosine distance. Defaults to True.

Runtime Parameters

  • data_budget: The desired size of the dataset after filtering.

  • diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset.

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[evol_instruction_score]
            ICOL1[evol_response_score]
            ICOL2[embedding]
        end
        subgraph New columns
            OCOL0[deita_score]
            OCOL1[deita_score_computed_with]
            OCOL2[nearest_neighbor_distance]
        end
    end

    subgraph DeitaFiltering
        StepInput[Input Columns: evol_instruction_score, evol_response_score, embedding]
        StepOutput[Output Columns: deita_score, deita_score_computed_with, nearest_neighbor_distance]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput

Inputs

  • evol_instruction_score (float): The score of the instruction generated by ComplexityScorer step.

  • evol_response_score (float): The score of the response generated by QualityScorer step.

  • embedding (List[float]): The embedding generated for the conversation of the instruction-response pair using GenerateEmbeddings step.

Outputs

  • deita_score (float): The DEITA score for the instruction-response pair.

  • deita_score_computed_with (List[str]): The scores used to compute the DEITA score.

  • nearest_neighbor_distance (float): The cosine distance between the embeddings of the instruction-response pair.

References