BitextRetrievalGenerator¶
Generate bitext retrieval data with an LLM
to later on train an embedding model.
BitextRetrievalGenerator
is a GeneratorTask
that generates bitext retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes¶
-
source_language: The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
-
target_language: The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
-
unit: The unit of the data to be generated, which can be
sentence
,phrase
, orpassage
. Defaults toNone
, meaning that it will be randomly sampled. -
difficulty: The difficulty of the query to be generated, which can be
elementary school
,high school
, orcollege
. Defaults toNone
, meaning that it will be randomly sampled. -
high_score: The high score of the query to be generated, which can be
4
,4.5
, or5
. Defaults toNone
, meaning that it will be randomly sampled. -
low_score: The low score of the query to be generated, which can be
2.5
,3
, or3.5
. Defaults toNone
, meaning that it will be randomly sampled. -
seed: The random seed to be set in case there's any sampling within the
format_input
method.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph New columns
OCOL0[S1]
OCOL1[S2]
OCOL2[S3]
OCOL3[model_name]
end
end
subgraph BitextRetrievalGenerator
StepOutput[Output Columns: S1, S2, S3, model_name]
end
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepOutput --> OCOL3
Outputs¶
-
S1 (
str
): the first sentence generated by theLLM
. -
S2 (
str
): the second sentence generated by theLLM
. -
S3 (
str
): the third sentence generated by theLLM
. -
model_name (
str
): the name of the model used to generate the bitext retrieval data.
Examples¶
Generate bitext retrieval data for training embedding models¶
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator
with Pipeline("my-pipeline") as pipeline:
task = BitextRetrievalGenerator(
source_language="English",
target_language="Spanish",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...