DeitaFiltering¶
Filter dataset rows using DEITA filtering strategy.
Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
Attributes¶
-
data_budget: The desired size of the dataset after filtering.
-
diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. Defaults to
0.9
. -
normalize_embeddings: Whether to normalize the embeddings before computing the cosine distance. Defaults to
True
.
Runtime Parameters¶
-
data_budget: The desired size of the dataset after filtering.
-
diversity_threshold: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[evol_instruction_score]
ICOL1[evol_response_score]
ICOL2[embedding]
end
subgraph New columns
OCOL0[deita_score]
OCOL1[deita_score_computed_with]
OCOL2[nearest_neighbor_distance]
end
end
subgraph DeitaFiltering
StepInput[Input Columns: evol_instruction_score, evol_response_score, embedding]
StepOutput[Output Columns: deita_score, deita_score_computed_with, nearest_neighbor_distance]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
ICOL2 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
Inputs¶
-
evol_instruction_score (
float
): The score of the instruction generated byComplexityScorer
step. -
evol_response_score (
float
): The score of the response generated byQualityScorer
step. -
embedding (
List[float]
): The embedding generated for the conversation of the instruction-response pair usingGenerateEmbeddings
step.
Outputs¶
-
deita_score (
float
): The DEITA score for the instruction-response pair. -
deita_score_computed_with (
List[str]
): The scores used to compute the DEITA score. -
nearest_neighbor_distance (
float
): The cosine distance between the embeddings of the instruction-response pair.
Examples¶
Filter the dataset based on the DEITA score and the cosine distance between the embeddings¶
from distilabel.steps import DeitaFiltering
deita_filtering = DeitaFiltering(data_budget=1)
deita_filtering.load()
result = next(
deita_filtering.process(
[
{
"evol_instruction_score": 0.5,
"evol_response_score": 0.5,
"embedding": [-8.12729941, -5.24642847, -6.34003029],
},
{
"evol_instruction_score": 0.6,
"evol_response_score": 0.6,
"embedding": [2.99329242, 0.7800932, 0.7799726],
},
{
"evol_instruction_score": 0.7,
"evol_response_score": 0.7,
"embedding": [10.29041806, 14.33088073, 13.00557506],
},
],
)
)
# >>> result
# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]