GenerateTextRetrievalData¶
Generate text retrieval data with an LLM
to later on train an embedding model.
GenerateTextRetrievalData
is a Task
that generates text retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note¶
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-retrieval"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-retrieval category.
Attributes¶
-
language: The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
-
query_type: The type of query to be generated, which can be
extremely long-tail
,long-tail
, orcommon
. Defaults toNone
, meaning that it will be randomly sampled. -
query_length: The length of the query to be generated, which can be
less than 5 words
,5 to 15 words
, orat least 10 words
. Defaults toNone
, meaning that it will be randomly sampled. -
difficulty: The difficulty of the query to be generated, which can be
high school
,college
, orPhD
. Defaults toNone
, meaning that it will be randomly sampled. -
clarity: The clarity of the query to be generated, which can be
clear
,understandable with some effort
, orambiguous
. Defaults toNone
, meaning that it will be randomly sampled. -
num_words: The number of words in the query to be generated, which can be
50
,100
,200
,300
,400
, or500
. Defaults toNone
, meaning that it will be randomly sampled. -
seed: The random seed to be set in case there's any sampling within the
format_input
method.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[task]
end
subgraph New columns
OCOL0[user_query]
OCOL1[positive_document]
OCOL2[hard_negative_document]
OCOL3[model_name]
end
end
subgraph GenerateTextRetrievalData
StepInput[Input Columns: task]
StepOutput[Output Columns: user_query, positive_document, hard_negative_document, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepOutput --> OCOL3
StepInput --> StepOutput
Inputs¶
- task (
str
): The task description to be used in the generation.
Outputs¶
-
user_query (
str
): the user query generated by theLLM
. -
positive_document (
str
): the positive document generated by theLLM
. -
hard_negative_document (
str
): the hard negative document generated by theLLM
. -
model_name (
str
): the name of the model used to generate the text retrieval data.
Examples¶
Generate synthetic text retrieval data for training embedding models¶
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-retrieval",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextRetrievalData(
language="English",
query_type="common",
query_length="5 to 15 words",
difficulty="high school",
clarity="clear",
num_words=100,
llm=..., # LLM instance
)
task >> generate