EmbeddingTaskGenerator¶

Generate task descriptions for embedding-related tasks using an LLM.

EmbeddingTaskGenerator is a GeneratorTask that doesn't receieve any input besides the provided attributes that generates task descriptions for embedding-related tasks using a pre-defined prompt based on the category attribute. The category attribute should be one of the following:

- `text-retrieval`: Generate task descriptions for text retrieval tasks.
- `text-matching-short`: Generate task descriptions for short text matching tasks.
- `text-matching-long`: Generate task descriptions for long text matching tasks.
- `text-classification`: Generate task descriptions for text classification tasks.

Attributes¶

category: The category of the task to be generated, which can either be text-retrieval, text-matching-short, text-matching-long, or text-classification.
flatten_tasks: Whether to flatten the tasks i.e. since a list of tasks is generated by the LLM, this attribute indicates whether to flatten the list or not. Defaults to False, meaning that running this task with num_generations=1 will return a distilabel.Distiset with one row only containing a list with around 20 tasks; otherwise, if set to True, it will return a distilabel.Distiset with around 20 rows, each containing one task.

Input & Output Columns¶

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[tasks]
            OCOL1[task]
            OCOL2[model_name]
        end
    end

    subgraph EmbeddingTaskGenerator
        StepOutput[Output Columns: tasks, task, model_name]
    end

    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2

Outputs¶

tasks (List[str]): the list of tasks generated by the LLM.
task (str): the task generated by the LLM if flatten_tasks=True.
model_name (str): the name of the model used to generate the tasks.

Examples¶

Generate embedding tasks for text retrieval¶

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-retrieval",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    ...

    task >> ...

References¶

Improving Text Embeddings with Large Language Models