EmbeddingTaskGenerator¶
Generate task descriptions for embedding-related tasks using an LLM
.
EmbeddingTaskGenerator
is a GeneratorTask
that doesn't receieve any input besides the
provided attributes that generates task descriptions for embedding-related tasks using a
pre-defined prompt based on the category
attribute. The category
attribute should be
one of the following:
- `text-retrieval`: Generate task descriptions for text retrieval tasks.
- `text-matching-short`: Generate task descriptions for short text matching tasks.
- `text-matching-long`: Generate task descriptions for long text matching tasks.
- `text-classification`: Generate task descriptions for text classification tasks.
Attributes¶
-
category: The category of the task to be generated, which can either be
text-retrieval
,text-matching-short
,text-matching-long
, ortext-classification
. -
flatten_tasks: Whether to flatten the tasks i.e. since a list of tasks is generated by the
LLM
, this attribute indicates whether to flatten the list or not. Defaults toFalse
, meaning that running this task withnum_generations=1
will return adistilabel.Distiset
with one row only containing a list with around 20 tasks; otherwise, if set toTrue
, it will return adistilabel.Distiset
with around 20 rows, each containing one task.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph New columns
OCOL0[tasks]
OCOL1[task]
OCOL2[model_name]
end
end
subgraph EmbeddingTaskGenerator
StepOutput[Output Columns: tasks, task, model_name]
end
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
Outputs¶
-
tasks (
List[str]
): the list of tasks generated by theLLM
. -
task (
str
): the task generated by theLLM
ifflatten_tasks=True
. -
model_name (
str
): the name of the model used to generate the tasks.
Examples¶
Generate embedding tasks for text retrieval¶
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-retrieval",
flatten_tasks=True,
llm=..., # LLM instance
)
...
task >> ...