Skip to content

Task Gallery

This section contains the existing Task subclasses implemented in distilabel.

BitextRetrievalGenerator

Bases: _EmbeddingDataGenerator

Generate bitext retrieval data with an LLM to later on train an embedding model.

BitextRetrievalGenerator is a GeneratorTask that generates bitext retrieval data with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Attributes:

Name Type Description
source_language str

The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

target_language str

The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

unit Optional[Literal['sentence', 'phrase', 'passage']]

The unit of the data to be generated, which can be sentence, phrase, or passage. Defaults to None, meaning that it will be randomly sampled.

difficulty Optional[Literal['elementary school', 'high school', 'college']]

The difficulty of the query to be generated, which can be elementary school, high school, or college. Defaults to None, meaning that it will be randomly sampled.

high_score Optional[Literal['4', '4.5', '5']]

The high score of the query to be generated, which can be 4, 4.5, or 5. Defaults to None, meaning that it will be randomly sampled.

low_score Optional[Literal['2.5', '3', '3.5']]

The low score of the query to be generated, which can be 2.5, 3, or 3.5. Defaults to None, meaning that it will be randomly sampled.

seed Optional[Literal['2.5', '3', '3.5']]

The random seed to be set in case there's any sampling within the format_input method.

Examples:

Generate bitext retrieval data for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator

with Pipeline("my-pipeline") as pipeline:
    task = BitextRetrievalGenerator(
        source_language="English",
        target_language="Spanish",
        unit="sentence",
        difficulty="elementary school",
        high_score="4",
        low_score="2.5",
        llm=...,
    )

    ...

    task >> ...
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class BitextRetrievalGenerator(_EmbeddingDataGenerator):
    """Generate bitext retrieval data with an `LLM` to later on train an embedding model.

    `BitextRetrievalGenerator` is a `GeneratorTask` that generates bitext retrieval data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Attributes:
        source_language: The source language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        target_language: The target language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.
            Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.
            Defaults to `None`, meaning that it will be randomly sampled.
        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Examples:

        Generate bitext retrieval data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import BitextRetrievalGenerator

        with Pipeline("my-pipeline") as pipeline:
            task = BitextRetrievalGenerator(
                source_language="English",
                target_language="Spanish",
                unit="sentence",
                difficulty="elementary school",
                high_score="4",
                low_score="2.5",
                llm=...,
            )

            ...

            task >> ...
        ```
    """

    source_language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )
    target_language: str = Field(
        default=...,
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    unit: Optional[Literal["sentence", "phrase", "passage"]] = None
    difficulty: Optional[Literal["elementary school", "high school", "college"]] = None
    high_score: Optional[Literal["4", "4.5", "5"]] = None
    low_score: Optional[Literal["2.5", "3", "3.5"]] = None

    _template_name: str = PrivateAttr(default="bitext-retrieval")

    @property
    def prompt(self) -> ChatType:
        """Contains the `prompt` to be used in the `process` method, rendering the `_template`; and
        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,
        being from the user with the content being the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    source_language=self.source_language,
                    target_language=self.target_language,
                    unit=self.unit or random.choice(["sentence", "phrase", "passage"]),
                    difficulty=self.difficulty
                    or random.choice(["elementary school", "high school", "college"]),
                    high_score=self.high_score or random.choice(["4", "4.5", "5"]),
                    low_score=self.low_score or random.choice(["2.5", "3", "3.5"]),
                ).strip(),
            }
        ]  # type: ignore

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["S1", "S2", "S3"]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

prompt: ChatType property

Contains the prompt to be used in the process method, rendering the _template; and formatted as an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

ChatGeneration

Bases: Task

Generates text based on a conversation.

ChatGeneration is a pre-defined task that defines the messages as the input and generation as the output. This task is used to generate text based on a conversation. The model_name is also returned as part of the output in order to enhance it.

Input columns
  • messages (List[Dict[Literal["role", "content"], str]]): The messages to generate the follow up completion from.
Output columns
  • generation (str): The generated text from the assistant.
  • model_name (str): The model name used to generate the text.
Categories
  • chat-generation
Icon

:material-chat:

Examples:

Generate text from a conversation in OpenAI chat format:

```python
from distilabel.steps.tasks import ChatGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
chat = ChatGeneration(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

chat.load()

result = next(
    chat.process(
        [
            {
                "messages": [
                    {"role": "user", "content": "How much is 2+2?"},
                ]
            }
        ]
    )
)
# result
# [
#     {
#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'generation': '4',
#     }
# ]
```
Source code in src/distilabel/steps/tasks/text_generation.py
class ChatGeneration(Task):
    """Generates text based on a conversation.

    `ChatGeneration` is a pre-defined task that defines the `messages` as the input
    and `generation` as the output. This task is used to generate text based on a conversation.
    The `model_name` is also returned as part of the output in order to enhance it.

    Input columns:
        - messages (`List[Dict[Literal["role", "content"], str]]`): The messages to generate the
            follow up completion from.

    Output columns:
        - generation (`str`): The generated text from the assistant.
        - model_name (`str`): The model name used to generate the text.

    Categories:
        - chat-generation

    Icon:
        `:material-chat:`

    Examples:

        Generate text from a conversation in OpenAI chat format:

        ```python
        from distilabel.steps.tasks import ChatGeneration
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        chat = ChatGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        chat.load()

        result = next(
            chat.process(
                [
                    {
                        "messages": [
                            {"role": "user", "content": "How much is 2+2?"},
                        ]
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #         'generation': '4',
        #     }
        # ]
        ```
    """

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `messages`."""
        return ["messages"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the messages provided
        are already formatted that way i.e. following the OpenAI chat format."""

        if not is_openai_format(input["messages"]):
            raise ValueError(
                "Input `messages` must be an OpenAI chat-like format conversation. "
                f"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
            )

        if input["messages"][-1]["role"] != "user":
            raise ValueError(
                "The last message must be from the user. Please check: "
                "'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
            )

        return input["messages"]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}

inputs: List[str] property

The input for the task are the messages.

outputs: List[str] property

The output for the task is the generation and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the messages provided are already formatted that way i.e. following the OpenAI chat format.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the messages provided
    are already formatted that way i.e. following the OpenAI chat format."""

    if not is_openai_format(input["messages"]):
        raise ValueError(
            "Input `messages` must be an OpenAI chat-like format conversation. "
            f"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
        )

    if input["messages"][-1]["role"] != "user":
        raise ValueError(
            "The last message must be from the user. Please check: "
            "'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
        )

    return input["messages"]

format_output(output, input)

The output is formatted as a dictionary with the generation. The model_name will be automatically included within the process method of Task.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

ComplexityScorer

Bases: Task

Score instructions based on their complexity using an LLM.

ComplexityScorer is a pre-defined task used to rank a list of instructions based in their complexity. It's an implementation of the complexity score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
_template Union[Template, None]

a Jinja2 template used to format the input for the LLM.

Input columns
  • instructions (List[str]): The list of instructions to be scored.
Output columns
  • scores (List[float]): The score for each instruction.
  • model_name (str): The model name used to generate the scores.
Categories
  • scorer
  • complexity
  • instruction
References

Examples:

Evaluate the complexity of your instructions:

```python
from distilabel.steps.tasks import ComplexityScorer
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

scorer.load()

result = next(
    scorer.process(
        [{"instructions": ["plain instruction", "highly complex instruction"]}]
    )
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/complexity_scorer.py
class ComplexityScorer(Task):
    """Score instructions based on their complexity using an `LLM`.

    `ComplexityScorer` is a pre-defined task used to rank a list of instructions based in
    their complexity. It's an implementation of the complexity score task from the paper
    'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection
    in Instruction Tuning'.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instructions (`List[str]`): The list of instructions to be scored.

    Output columns:
        - scores (`List[float]`): The score for each instruction.
        - model_name (`str`): The model name used to generate the scores.

    Categories:
        - scorer
        - complexity
        - instruction

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:

        Evaluate the complexity of your instructions:

        ```python
        from distilabel.steps.tasks import ComplexityScorer
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        scorer = ComplexityScorer(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        scorer.load()

        result = next(
            scorer.process(
                [{"instructions": ["plain instruction", "highly complex instruction"]}]
            )
        )
        # result
        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "complexity-scorer.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are the `instructions`."""
        return ["instructions"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(instructions=input["instructions"]),  # type: ignore
            }
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are: a list of `scores` containing the complexity score for each
        instruction in `instructions`, and the `model_name`."""
        return ["scores", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the key `scores` containing the scores for each instruction.
        """
        if output is None:
            return {"scores": [None] * len(input["instructions"])}

        scores = []
        score_lines = output.split("\n")
        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["instructions"]) - 1:
                break
        return {"scores": scores}

inputs: List[str] property

The inputs for the task are the instructions.

outputs: List[str] property

The output for the task are: a list of scores containing the complexity score for each instruction in instructions, and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/complexity_scorer.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(instructions=input["instructions"]),  # type: ignore
        }
    ]

format_output(output, input)

The output is formatted as a list with the score of each instruction.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Dict[str, Any]

the input to the task. Used for obtaining the number of responses.

required

Returns:

Type Description
Dict[str, Any]

A dict with the key scores containing the scores for each instruction.

Source code in src/distilabel/steps/tasks/complexity_scorer.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the key `scores` containing the scores for each instruction.
    """
    if output is None:
        return {"scores": [None] * len(input["instructions"])}

    scores = []
    score_lines = output.split("\n")
    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.match(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["instructions"]) - 1:
            break
    return {"scores": scores}

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/complexity_scorer.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "complexity-scorer.jinja2"
    )

    self._template = Template(open(_path).read())

EvolComplexity

Bases: EvolInstruct

Evolve instructions to make them more complex using an LLM.

EvolComplexity is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name Type Description
num_instructions

The number of instructions to be generated.

generate_answers

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed Dict[str, str]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The number of evolutions to be run.
Input columns
  • instruction (str): The instruction to evolve.
Output columns
  • evolved_instruction (str): The evolved instruction.
  • answer (str, optional): The answer to the instruction if generate_answers=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
Categories
  • evol
  • instruction
  • deita
References

Examples:

Evolve an instruction using an LLM:

```python
from distilabel.steps.tasks import EvolComplexity
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_complexity = EvolComplexity(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_complexity.load()

result = next(evol_complexity.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```

```
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
class EvolComplexity(EvolInstruct):
    """Evolve instructions to make them more complex using an `LLM`.

    `EvolComplexity` is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    Categories:
        - evol
        - instruction
        - deita

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)

    Examples:

        Evolve an instruction using an LLM:

        ```python
        from distilabel.steps.tasks import EvolComplexity
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_complexity = EvolComplexity(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_complexity.load()

        result = next(evol_complexity.process([{"instruction": "common instruction"}]))
        # result
        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

EvolComplexityGenerator

Bases: EvolInstructGenerator

Generate evolved instructions with increased complexity using an LLM.

EvolComplexityGenerator is a generation task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name Type Description
num_instructions

The number of instructions to be generated.

generate_answers

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed Dict[str, str]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The number of evolutions to be run.
Output columns
  • instruction (str): The evolved instruction.
  • answer (str, optional): The answer to the instruction if generate_answers=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
Categories
  • evol
  • instruction
  • generation
  • deita
References

Examples:

Generate evolved instructions without initial instructions:

```python
from distilabel.steps.tasks import EvolComplexityGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_complexity_generator = EvolComplexityGenerator(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=2,
)

evol_complexity_generator.load()

result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```

```
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
class EvolComplexityGenerator(EvolInstructGenerator):
    """Generate evolved instructions with increased complexity using an `LLM`.

    `EvolComplexityGenerator` is a generation task that evolves instructions to make
    them more complex, and it is based in the EvolInstruct task, but using slight different
    prompts, but the exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Output columns:
        - instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    Categories:
        - evol
        - instruction
        - generation
        - deita

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)

    Examples:

        Generate evolved instructions without initial instructions:

        ```python
        from distilabel.steps.tasks import EvolComplexityGenerator
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_complexity_generator = EvolComplexityGenerator(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=2,
        )

        evol_complexity_generator.load()

        result = next(scorer.process())
        # result
        # [{'instruction': 'generated instruction', 'model_name': 'test'}]
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

EvolInstruct

Bases: Task

Evolve instructions using an LLM.

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name Type Description
num_evolutions int

The number of evolutions to be performed.

store_evolutions bool

Whether to store all the evolutions or just the last one. Defaults to False.

generate_answers bool

Whether to generate answers for the evolved instructions. Defaults to False.

include_original_instruction bool

Whether to include the original instruction in the evolved_instructions output column. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for evolving the instructions. Defaults to the ones provided in the utils.py file.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Input columns
  • instruction (str): The instruction to evolve.
Output columns
  • evolved_instruction (str): The evolved instruction if store_evolutions=False.
  • evolved_instructions (List[str]): The evolved instructions if store_evolutions=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
  • answer (str): The answer to the evolved instruction if generate_answers=True and store_evolutions=False.
  • answers (List[str]): The answers to the evolved instructions if generate_answers=True and store_evolutions=True.
Categories
  • evol
  • instruction
References

Examples:

Evolve an instruction using an LLM:

```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```

Keep the iterations of the evolutions:

```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
    store_evolutions=True,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
#     {
#         'instruction': 'common instruction',
#         'evolved_instructions': ['initial evolution', 'final evolution'],
#         'model_name': 'model_name'
#     }
# ]
```

Generate answers for the instructions in a single step:

```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
    generate_answers=True,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
#     {
#         'instruction': 'common instruction',
#         'evolved_instruction': 'evolved instruction',
#         'answer': 'answer to the instruction',
#         'model_name': 'model_name'
#     }
# ]
```

Citations:

```
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
class EvolInstruct(Task):
    """Evolve instructions using an `LLM`.

    WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_evolutions: The number of evolutions to be performed.
        store_evolutions: Whether to store all the evolutions or just the last one. Defaults
            to `False`.
        generate_answers: Whether to generate answers for the evolved instructions. Defaults
            to `False`.
        include_original_instruction: Whether to include the original instruction in the
            `evolved_instructions` output column. Defaults to `False`.
        mutation_templates: The mutation templates to be used for evolving the instructions.
            Defaults to the ones provided in the `utils.py` file.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.
        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.
        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`
            and `store_evolutions=False`.
        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`
            and `store_evolutions=True`.

    Categories:
        - evol
        - instruction

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)

    Examples:

        Evolve an instruction using an LLM:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
        ```

        Keep the iterations of the evolutions:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
            store_evolutions=True,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'evolved_instructions': ['initial evolution', 'final evolution'],
        #         'model_name': 'model_name'
        #     }
        # ]
        ```

        Generate answers for the instructions in a single step:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
            generate_answers=True,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'evolved_instruction': 'evolved instruction',
        #         'answer': 'answer to the instruction',
        #         'model_name': 'model_name'
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    num_evolutions: int
    store_evolutions: bool = False
    generate_answers: bool = False
    include_original_instruction: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            (
                "evolved_instruction"
                if not self.store_evolutions
                else "evolved_instructions"
            ),
            "model_name",
        ]
        if self.generate_answers:
            _outputs.append("answer" if not self.store_evolutions else "answers")
        return _outputs

    @override
    def format_output(  # type: ignore
        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
    ) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        `answer` if `generate_answers=True`; and, finally, the `model_name`.

        Args:
            instructions: The instructions to be included within the output.
            answers: The answers to be included within the output if `generate_answers=True`.

        Returns:
            If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
            if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
        """
        _output = {}
        if not self.store_evolutions:
            _output["evolved_instruction"] = instructions[-1]
        else:
            _output["evolved_instructions"] = instructions

        if self.generate_answers and answers:
            if not self.store_evolutions:
                _output["answer"] = answers[-1]
            else:
                _output["answers"] = answers

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore

    def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """

        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction in instructions:
                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]
            generated_prompts = flatten_responses(
                self.llm.generate(
                    formatted_prompts,
                    **self.llm.generation_kwargs,  # type: ignore
                )
            )

            evolved_instructions = []
            for generated_prompt in generated_prompts:
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                evolved_instructions.append(generated_prompt)

            if self.store_evolutions:
                instructions = [
                    instruction + [evolved_instruction]
                    for instruction, evolved_instruction in zip(
                        instructions, evolved_instructions
                    )
                ]
            else:
                instructions = [
                    [evolved_instruction]
                    for evolved_instruction in evolved_instructions
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
            )

        return instructions

    def _generate_answers(
        self, evolved_instructions: List[List[str]]
    ) -> List[List[str]]:
        """Generates the answer for the instructions in `instructions`.

        Args:
            evolved_instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for each instruction.
        """
        formatted_instructions = [
            self.format_input(instruction)
            for instructions in evolved_instructions
            for instruction in instructions
        ]

        responses = self.llm.generate(
            formatted_instructions,
            num_generations=1,
            **self.llm.generation_kwargs,  # type: ignore
        )

        step = (
            self.num_evolutions
            if not self.include_original_instruction
            else self.num_evolutions + 1
        )
        return [
            flatten_responses(responses[i : i + step])
            for i in range(0, len(responses), step)
        ]

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        evolved_instructions = self._evolve_instructions(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_instructions` list
            from_ = 1 if not self.include_original_instruction else 0
            evolved_instructions = [
                instruction[from_:] for instruction in evolved_instructions
            ]

        if not self.generate_answers:
            for input, instruction in zip(inputs, evolved_instructions):
                input.update(self.format_output(instruction))
            yield inputs

        self._logger.info(
            f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
        )

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
            )

            answers = self._generate_answers(evolved_instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
                " instructions!"
            )

            for idx, (input, instruction) in enumerate(
                zip(inputs, evolved_instructions)
            ):
                input.update(self.format_output(instruction, answers[idx]))
            yield inputs

inputs: List[str] property

The input for the task is the instruction.

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates.

outputs: List[str] property

The output for the task are the evolved_instruction/s, the answer if generate_answers=True and the model_name.

_apply_random_mutation(instruction)

Applies a random mutation from the ones provided as part of the mutation_templates enum, and returns the provided instruction within the mutation prompt.

Parameters:

Name Type Description Default
instruction str

The instruction to be included within the mutation prompt.

required

Returns:

Type Description
str

A random mutation prompt with the provided instruction.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
def _apply_random_mutation(self, instruction: str) -> str:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        instruction: The instruction to be included within the mutation prompt.

    Returns:
        A random mutation prompt with the provided instruction.
    """
    mutation = np.random.choice(self.mutation_templates_names)
    return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore

_evolve_instructions(inputs)

Evolves the instructions provided as part of the inputs of the task.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Returns:

Type Description
List[List[str]]

A list where each item is a list with either the last evolved instruction if

List[List[str]]

store_evolutions=False or all the evolved instructions if store_evolutions=True.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
    """Evolves the instructions provided as part of the inputs of the task.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list where each item is a list with either the last evolved instruction if
        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
    """

    instructions: List[List[str]] = [[input["instruction"]] for input in inputs]

    for iter_no in range(self.num_evolutions):
        formatted_prompts = []
        for instruction in instructions:
            formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

        formatted_prompts = [
            self.format_input(prompt) for prompt in formatted_prompts
        ]
        generated_prompts = flatten_responses(
            self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )
        )

        evolved_instructions = []
        for generated_prompt in generated_prompts:
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            evolved_instructions.append(generated_prompt)

        if self.store_evolutions:
            instructions = [
                instruction + [evolved_instruction]
                for instruction, evolved_instruction in zip(
                    instructions, evolved_instructions
                )
            ]
        else:
            instructions = [
                [evolved_instruction]
                for evolved_instruction in evolved_instructions
            ]

        self._logger.info(
            f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
        )

    return instructions

_generate_answers(evolved_instructions)

Generates the answer for the instructions in instructions.

Parameters:

Name Type Description Default
evolved_instructions List[List[str]]

A list of lists where each item is a list with either the last evolved instruction if store_evolutions=False or all the evolved instructions if store_evolutions=True.

required

Returns:

Type Description
List[List[str]]

A list of answers for each instruction.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
def _generate_answers(
    self, evolved_instructions: List[List[str]]
) -> List[List[str]]:
    """Generates the answer for the instructions in `instructions`.

    Args:
        evolved_instructions: A list of lists where each item is a list with either the last
            evolved instruction if `store_evolutions=False` or all the evolved instructions
            if `store_evolutions=True`.

    Returns:
        A list of answers for each instruction.
    """
    formatted_instructions = [
        self.format_input(instruction)
        for instructions in evolved_instructions
        for instruction in instructions
    ]

    responses = self.llm.generate(
        formatted_instructions,
        num_generations=1,
        **self.llm.generation_kwargs,  # type: ignore
    )

    step = (
        self.num_evolutions
        if not self.include_original_instruction
        else self.num_evolutions + 1
    )
    return [
        flatten_responses(responses[i : i + step])
        for i in range(0, len(responses), step)
    ]

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

format_output(instructions, answers=None)

The output for the task is a dict with: evolved_instruction or evolved_instructions, depending whether the value is either False or True for store_evolutions, respectively; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name Type Description Default
instructions Union[str, List[str]]

The instructions to be included within the output.

required
answers Optional[List[str]]

The answers to be included within the output if generate_answers=True.

None

Returns:

Type Description
Dict[str, Any]

If store_evolutions=False and generate_answers=True return {"evolved_instruction": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

if store_evolutions=True and generate_answers=True return {"evolved_instructions": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

if store_evolutions=False and generate_answers=False return {"evolved_instruction": ..., "model_name": ...};

Dict[str, Any]

if store_evolutions=True and generate_answers=False return {"evolved_instructions": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
@override
def format_output(  # type: ignore
    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    `answer` if `generate_answers=True`; and, finally, the `model_name`.

    Args:
        instructions: The instructions to be included within the output.
        answers: The answers to be included within the output if `generate_answers=True`.

    Returns:
        If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
        if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
    """
    _output = {}
    if not self.store_evolutions:
        _output["evolved_instruction"] = instructions[-1]
    else:
        _output["evolved_instructions"] = instructions

    if self.generate_answers and answers:
        if not self.store_evolutions:
            _output["answer"] = answers[-1]
        else:
            _output["answers"] = answers

    _output["model_name"] = self.llm.model_name
    return _output

process(inputs)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    evolved_instructions = self._evolve_instructions(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_instructions` list
        from_ = 1 if not self.include_original_instruction else 0
        evolved_instructions = [
            instruction[from_:] for instruction in evolved_instructions
        ]

    if not self.generate_answers:
        for input, instruction in zip(inputs, evolved_instructions):
            input.update(self.format_output(instruction))
        yield inputs

    self._logger.info(
        f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
    )

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
        )

        answers = self._generate_answers(evolved_instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
            " instructions!"
        )

        for idx, (input, instruction) in enumerate(
            zip(inputs, evolved_instructions)
        ):
            input.update(self.format_output(instruction, answers[idx]))
        yield inputs

EvolInstructGenerator

Bases: GeneratorTask

Generate evolved instructions using an LLM.

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name Type Description
num_instructions int

The number of instructions to be generated.

generate_answers bool

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length RuntimeParameter[int]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length RuntimeParameter[int]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Output columns
  • instruction (str): The generated instruction if generate_answers=False.
  • answer (str): The generated answer if generate_answers=True.
  • instructions (List[str]): The generated instructions if generate_answers=True.
  • model_name (str): The name of the LLM used to generate and evolve the instructions.
Categories
  • evol
  • instruction
  • generation
References

Examples:

Generate evolved instructions without initial instructions:

```python
from distilabel.steps.tasks import EvolInstructGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct_generator = EvolInstructGenerator(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=2,
)

evol_instruct_generator.load()

result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```

Citations:

```
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
class EvolInstructGenerator(GeneratorTask):
    """Generate evolved instructions using an `LLM`.

    WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs
            to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs
            to be lower than, to be considered valid.
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Output columns:
        - instruction (`str`): The generated instruction if `generate_answers=False`.
        - answer (`str`): The generated answer if `generate_answers=True`.
        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.

    Categories:
        - evol
        - instruction
        - generation

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)

    Examples:

        Generate evolved instructions without initial instructions:

        ```python
        from distilabel.steps.tasks import EvolInstructGenerator
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct_generator = EvolInstructGenerator(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=2,
        )

        evol_instruct_generator.load()

        result = next(scorer.process())
        # result
        # [{'instruction': 'generated instruction', 'model_name': 'test'}]
        ```

    Citations:

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    num_instructions: int
    generate_answers: bool = False
    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

    min_length: RuntimeParameter[int] = Field(
        default=512,
        description="Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.",
    )
    max_length: RuntimeParameter[int] = Field(
        default=1024,
        description="Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.",
    )

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )
    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)
    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)

    def _generate_seed_texts(self) -> List[str]:
        """Generates a list of seed texts to be used as part of the starting prompts for the task.

        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
        a list of English words will be used to generate the seed texts that will be provided to the
        mutation method and included within the prompt.

        Returns:
            A list of seed texts to be used as part of the starting prompts for the task.
        """
        seed_texts = []
        for _ in range(self.num_instructions * 10):
            num_words = np.random.choice([1, 2, 3, 4])
            seed_texts.append(
                self.mutation_templates["FRESH_START"].replace(  # type: ignore
                    "<PROMPT>",
                    ", ".join(
                        [
                            np.random.choice(self._english_nouns).strip()
                            for _ in range(num_words)
                        ]
                    ),
                )
            )
        return seed_texts

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

        np.random.seed(self.seed)

        self._seed_texts = self._generate_seed_texts()
        self._prompts = [
            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
        ]

    @cached_property
    def _english_nouns(self) -> List[str]:
        """A list of English nouns to be used as part of the starting prompts for the task.

        References:
            - https://github.com/h2oai/h2o-wizardlm
        """
        _path = str(
            importlib_resources.files("distilabel")
            / "steps/tasks/evol_instruct/english_nouns.txt"
        )
        with open(_path, mode="r") as f:
            return [line.strip() for line in f.readlines()]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `instruction`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        _outputs = ["instruction", "model_name"]
        if self.generate_answers:
            _outputs.append("answer")
        return _outputs

    def format_output(  # type: ignore
        self, instruction: str, answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
        and, finally, the `model_name`.

        Args:
            instruction: The instruction to be included within the output.
            answer: The answer to be included within the output if `generate_answers=True`.

        Returns:
            If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
            if `generate_answers=False` return {"instruction": ..., "model_name": ...};
        """
        _output = {
            "instruction": instruction,
            "model_name": self.llm.model_name,
        }
        if self.generate_answers and answer is not None:
            _output["answer"] = answer
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            iter_no: The iteration number to be used to check whether the iteration is the
                first one i.e. FRESH_START, or not.

        Returns:
            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
        """
        prompts = []
        for idx in range(self.num_instructions):
            if (
                iter_no == 0
                or "Write one question or request containing" in self._prompts[idx]  # type: ignore
            ):
                mutation = "FRESH_START"
            else:
                mutation = np.random.choice(self.mutation_templates_names)
                if mutation == "FRESH_START":
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

            prompt_with_template = (
                self.mutation_templates[mutation].replace(  # type: ignore
                    "<PROMPT>",
                    self._prompts[idx],  # type: ignore
                )  # type: ignore
                if iter_no != 0
                else self._prompts[idx]  # type: ignore
            )
            prompts.append([{"role": "user", "content": prompt_with_template}])
        return prompts

    def _generate_answers(self, instructions: List[List[str]]) -> List[str]:
        """Generates the answer for the last instruction in `instructions`.

        Args:
            instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for the last instruction in `instructions`.
        """
        # TODO: update to generate answers for all the instructions
        _formatted_instructions = [
            [{"role": "user", "content": instruction[-1]}]
            for instruction in instructions
        ]
        responses = self.llm.generate(
            _formatted_instructions,
            **self.llm.generation_kwargs,  # type: ignore
        )
        return flatten_responses(responses)

    @override
    def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            A list of Python dictionaries with the outputs of the task, and a boolean
            flag indicating whether the task has finished or not i.e. is the last batch.
        """
        instructions = []
        mutation_no = 0

        # TODO: update to take into account `offset`
        iter_no = 0
        while len(instructions) < self.num_instructions:
            prompts = self._apply_random_mutation(iter_no=iter_no)

            generated_prompts = flatten_responses(
                self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
            )
            for idx, generated_prompt in enumerate(generated_prompts):
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                    instructions.append(generated_prompt)
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
                else:
                    self._prompts[idx] = generated_prompt  # type: ignore

            self._logger.info(
                f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
            )
            iter_no += 1

            if len(instructions) > self.num_instructions:
                instructions = instructions[: self.num_instructions]
            if len(instructions) > mutation_no:
                mutation_no = len(instructions) - mutation_no

            if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
                yield (
                    [
                        self.format_output(mutated_instruction)
                        for mutated_instruction in instructions[-mutation_no:]
                    ],
                    len(instructions) >= self.num_instructions,
                )

        self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
            )

            answers = self._generate_answers(instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
            )

            yield (
                [
                    self.format_output(instruction, answer)
                    for instruction, answer in zip(instructions, answers)
                ],
                True,
            )

_english_nouns: List[str] cached property

A list of English nouns to be used as part of the starting prompts for the task.

References
  • https://github.com/h2oai/h2o-wizardlm

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates.

outputs: List[str] property

The output for the task are the instruction, the answer if generate_answers=True and the model_name.

_apply_random_mutation(iter_no)

Applies a random mutation from the ones provided as part of the mutation_templates enum, and returns the provided instruction within the mutation prompt.

Parameters:

Name Type Description Default
iter_no int

The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not.

required

Returns:

Type Description
List[ChatType]

A random mutation prompt with the provided instruction formatted as an OpenAI conversation.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        iter_no: The iteration number to be used to check whether the iteration is the
            first one i.e. FRESH_START, or not.

    Returns:
        A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
    """
    prompts = []
    for idx in range(self.num_instructions):
        if (
            iter_no == 0
            or "Write one question or request containing" in self._prompts[idx]  # type: ignore
        ):
            mutation = "FRESH_START"
        else:
            mutation = np.random.choice(self.mutation_templates_names)
            if mutation == "FRESH_START":
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

        prompt_with_template = (
            self.mutation_templates[mutation].replace(  # type: ignore
                "<PROMPT>",
                self._prompts[idx],  # type: ignore
            )  # type: ignore
            if iter_no != 0
            else self._prompts[idx]  # type: ignore
        )
        prompts.append([{"role": "user", "content": prompt_with_template}])
    return prompts

_generate_answers(instructions)

Generates the answer for the last instruction in instructions.

Parameters:

Name Type Description Default
instructions List[List[str]]

A list of lists where each item is a list with either the last evolved instruction if store_evolutions=False or all the evolved instructions if store_evolutions=True.

required

Returns:

Type Description
List[str]

A list of answers for the last instruction in instructions.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
def _generate_answers(self, instructions: List[List[str]]) -> List[str]:
    """Generates the answer for the last instruction in `instructions`.

    Args:
        instructions: A list of lists where each item is a list with either the last
            evolved instruction if `store_evolutions=False` or all the evolved instructions
            if `store_evolutions=True`.

    Returns:
        A list of answers for the last instruction in `instructions`.
    """
    # TODO: update to generate answers for all the instructions
    _formatted_instructions = [
        [{"role": "user", "content": instruction[-1]}]
        for instruction in instructions
    ]
    responses = self.llm.generate(
        _formatted_instructions,
        **self.llm.generation_kwargs,  # type: ignore
    )
    return flatten_responses(responses)

_generate_seed_texts()

Generates a list of seed texts to be used as part of the starting prompts for the task.

It will use the FRESH_START mutation template, as it needs to generate text from scratch; and a list of English words will be used to generate the seed texts that will be provided to the mutation method and included within the prompt.

Returns:

Type Description
List[str]

A list of seed texts to be used as part of the starting prompts for the task.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
def _generate_seed_texts(self) -> List[str]:
    """Generates a list of seed texts to be used as part of the starting prompts for the task.

    It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
    a list of English words will be used to generate the seed texts that will be provided to the
    mutation method and included within the prompt.

    Returns:
        A list of seed texts to be used as part of the starting prompts for the task.
    """
    seed_texts = []
    for _ in range(self.num_instructions * 10):
        num_words = np.random.choice([1, 2, 3, 4])
        seed_texts.append(
            self.mutation_templates["FRESH_START"].replace(  # type: ignore
                "<PROMPT>",
                ", ".join(
                    [
                        np.random.choice(self._english_nouns).strip()
                        for _ in range(num_words)
                    ]
                ),
            )
        )
    return seed_texts

format_output(instruction, answer=None)

The output for the task is a dict with: instruction; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name Type Description Default
instruction str

The instruction to be included within the output.

required
answer Optional[str]

The answer to be included within the output if generate_answers=True.

None

Returns:

Type Description
Dict[str, Any]

If generate_answers=True return {"instruction": ..., "answer": ..., "model_name": ...};

Dict[str, Any]

if generate_answers=False return {"instruction": ..., "model_name": ...};

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
def format_output(  # type: ignore
    self, instruction: str, answer: Optional[str] = None
) -> Dict[str, Any]:
    """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
    and, finally, the `model_name`.

    Args:
        instruction: The instruction to be included within the output.
        answer: The answer to be included within the output if `generate_answers=True`.

    Returns:
        If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
        if `generate_answers=False` return {"instruction": ..., "model_name": ...};
    """
    _output = {
        "instruction": instruction,
        "model_name": self.llm.model_name,
    }
    if self.generate_answers and answer is not None:
        _output["answer"] = answer
    return _output

model_post_init(__context)

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

    np.random.seed(self.seed)

    self._seed_texts = self._generate_seed_texts()
    self._prompts = [
        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
    ]

process(offset=0)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
offset int

The offset to start the generation from. Defaults to 0.

0

Yields:

Type Description
GeneratorStepOutput

A list of Python dictionaries with the outputs of the task, and a boolean

GeneratorStepOutput

flag indicating whether the task has finished or not i.e. is the last batch.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        A list of Python dictionaries with the outputs of the task, and a boolean
        flag indicating whether the task has finished or not i.e. is the last batch.
    """
    instructions = []
    mutation_no = 0

    # TODO: update to take into account `offset`
    iter_no = 0
    while len(instructions) < self.num_instructions:
        prompts = self._apply_random_mutation(iter_no=iter_no)

        generated_prompts = flatten_responses(
            self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
        )
        for idx, generated_prompt in enumerate(generated_prompts):
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                instructions.append(generated_prompt)
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
            else:
                self._prompts[idx] = generated_prompt  # type: ignore

        self._logger.info(
            f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
        )
        iter_no += 1

        if len(instructions) > self.num_instructions:
            instructions = instructions[: self.num_instructions]
        if len(instructions) > mutation_no:
            mutation_no = len(instructions) - mutation_no

        if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
            yield (
                [
                    self.format_output(mutated_instruction)
                    for mutated_instruction in instructions[-mutation_no:]
                ],
                len(instructions) >= self.num_instructions,
            )

    self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
        )

        answers = self._generate_answers(instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
        )

        yield (
            [
                self.format_output(instruction, answer)
                for instruction, answer in zip(instructions, answers)
            ],
            True,
        )

EvolQuality

Bases: Task

Evolve the quality of the responses using an LLM.

EvolQuality task is used to evolve the quality of the responses given a prompt, by generating a new response with a language model. This step implements the evolution quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
num_evolutions int

The number of evolutions to be performed on the responses.

store_evolutions bool

Whether to store all the evolved responses or just the last one. Defaults to False.

include_original_response bool

Whether to include the original response within the evolved responses. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used to evolve the responses.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Input columns
  • instruction (str): The instruction that was used to generate the responses.
  • response (str): The responses to be rewritten.
Output columns
  • evolved_response (str): The evolved response if store_evolutions=False.
  • evolved_responses (List[str]): The evolved responses if store_evolutions=True.
  • model_name (str): The name of the LLM used to evolve the responses.
Categories
  • evol
  • response
  • deita
References

Examples:

Evolve the quality of the responses given a prompt:

```python
from distilabel.steps.tasks import EvolQuality
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_quality = EvolQuality(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_quality.load()

result = next(
    evol_quality.process(
        [
            {"instruction": "common instruction", "response": "a response"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'common instruction',
#         'response': 'a response',
#         'evolved_response': 'evolved response',
#         'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
#     }
# ]
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/evol_quality/base.py
class EvolQuality(Task):
    """Evolve the quality of the responses using an `LLM`.

    `EvolQuality` task is used to evolve the quality of the responses given a prompt,
    by generating a new response with a language model. This step implements the evolution
    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        num_evolutions: The number of evolutions to be performed on the responses.
        store_evolutions: Whether to store all the evolved responses or just the last one.
            Defaults to `False`.
        include_original_response: Whether to include the original response within the evolved
            responses. Defaults to `False`.
        mutation_templates: The mutation templates to be used to evolve the responses.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - response (`str`): The responses to be rewritten.

    Output columns:
        - evolved_response (`str`): The evolved response if `store_evolutions=False`.
        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the responses.

    Categories:
        - evol
        - response
        - deita

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:

        Evolve the quality of the responses given a prompt:

        ```python
        from distilabel.steps.tasks import EvolQuality
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_quality = EvolQuality(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_quality.load()

        result = next(
            evol_quality.process(
                [
                    {"instruction": "common instruction", "response": "a response"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'response': 'a response',
        #         'evolved_response': 'evolved response',
        #         'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    num_evolutions: int
    store_evolutions: bool = False
    include_original_response: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.",
    )

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and `response`."""
        return ["instruction", "response"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_response/s` and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            ("evolved_response" if not self.store_evolutions else "evolved_responses"),
            "model_name",
        ]

        return _outputs

    def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        and, finally, the `model_name`.

        Args:
            responses: The responses to be included within the output.

        Returns:
            if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
            if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
        """
        _output = {}

        if not self.store_evolutions:
            _output["evolved_response"] = responses[-1]
        else:
            _output["evolved_responses"] = responses

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates` enum."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str, response: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return (
            self.mutation_templates[mutation]
            .replace("<PROMPT>", instruction)
            .replace("<RESPONSE>", response)
        )

    def _evolve_reponses(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """
        np.random.seed(self.seed)
        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
        responses: List[List[str]] = [[input["response"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction, response in zip(instructions, responses):
                formatted_prompts.append(
                    self._apply_random_mutation(instruction[-1], response[-1])
                )

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]

            generated_responses = self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )

            if self.store_evolutions:
                responses = [
                    response + [evolved_response[0]]
                    for response, evolved_response in zip(
                        responses, generated_responses
                    )
                ]
            else:
                responses = [
                    [evolved_response[0]] for evolved_response in generated_responses
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
            )

        return responses

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list of Python dictionaries with the outputs of the task.
        """

        responses = self._evolve_reponses(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_responses` list
            from_ = 1 if not self.include_original_response else 0
            responses = [response[from_:] for response in responses]

        for input, response in zip(inputs, responses):
            input.update(self.format_output(response))
        yield inputs

        self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

inputs: List[str] property

The input for the task are the instruction and response.

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates enum.

outputs: List[str] property

The output for the task are the evolved_response/s and the model_name.

_apply_random_mutation(instruction, response)

Applies a random mutation from the ones provided as part of the mutation_templates enum, and returns the provided instruction within the mutation prompt.

Parameters:

Name Type Description Default
instruction str

The instruction to be included within the mutation prompt.

required

Returns:

Type Description
str

A random mutation prompt with the provided instruction.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def _apply_random_mutation(self, instruction: str, response: str) -> str:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        instruction: The instruction to be included within the mutation prompt.

    Returns:
        A random mutation prompt with the provided instruction.
    """
    mutation = np.random.choice(self.mutation_templates_names)
    return (
        self.mutation_templates[mutation]
        .replace("<PROMPT>", instruction)
        .replace("<RESPONSE>", response)
    )

_evolve_reponses(inputs)

Evolves the instructions provided as part of the inputs of the task.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Returns:

Type Description
List[List[str]]

A list where each item is a list with either the last evolved instruction if

List[List[str]]

store_evolutions=False or all the evolved instructions if store_evolutions=True.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def _evolve_reponses(self, inputs: "StepInput") -> List[List[str]]:
    """Evolves the instructions provided as part of the inputs of the task.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list where each item is a list with either the last evolved instruction if
        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
    """
    np.random.seed(self.seed)
    instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
    responses: List[List[str]] = [[input["response"]] for input in inputs]

    for iter_no in range(self.num_evolutions):
        formatted_prompts = []
        for instruction, response in zip(instructions, responses):
            formatted_prompts.append(
                self._apply_random_mutation(instruction[-1], response[-1])
            )

        formatted_prompts = [
            self.format_input(prompt) for prompt in formatted_prompts
        ]

        generated_responses = self.llm.generate(
            formatted_prompts,
            **self.llm.generation_kwargs,  # type: ignore
        )

        if self.store_evolutions:
            responses = [
                response + [evolved_response[0]]
                for response, evolved_response in zip(
                    responses, generated_responses
                )
            ]
        else:
            responses = [
                [evolved_response[0]] for evolved_response in generated_responses
            ]

        self._logger.info(
            f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
        )

    return responses

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

format_output(responses)

The output for the task is a dict with: evolved_response or evolved_responses, depending whether the value is either False or True for store_evolutions, respectively; and, finally, the model_name.

Parameters:

Name Type Description Default
responses Union[str, List[str]]

The responses to be included within the output.

required

Returns:

Type Description
Dict[str, Any]

if store_evolutions=False return {"evolved_response": ..., "model_name": ...};

Dict[str, Any]

if store_evolutions=True return {"evolved_responses": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    and, finally, the `model_name`.

    Args:
        responses: The responses to be included within the output.

    Returns:
        if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
        if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
    """
    _output = {}

    if not self.store_evolutions:
        _output["evolved_response"] = responses[-1]
    else:
        _output["evolved_responses"] = responses

    _output["model_name"] = self.llm.model_name
    return _output

model_post_init(__context)

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

process(inputs)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Returns:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list of Python dictionaries with the outputs of the task.
    """

    responses = self._evolve_reponses(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_responses` list
        from_ = 1 if not self.include_original_response else 0
        responses = [response[from_:] for response in responses]

    for input, response in zip(inputs, responses):
        input.update(self.format_output(response))
    yield inputs

    self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

GenerateEmbeddings

Bases: Step

Generate embeddings using the last hidden state of an LLM.

Generate embeddings for a text input using the last hidden state of an LLM, as described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
llm LLM

The LLM to use to generate the embeddings.

Input columns
  • text (str, List[Dict[str, str]]): The input text or conversation to generate embeddings for.
Output columns
  • embedding (List[float]): The embedding of the input text or conversation.
  • model_name (str): The model name used to generate the embeddings.
Categories
  • embedding
  • llm
References

Examples:

Rank LLM candidates:

```python
from distilabel.steps.tasks import GenerateEmbeddings
from distilabel.llms.huggingface import TransformersLLM

# Consider this as a placeholder for your actual LLM.
embedder = GenerateEmbeddings(
    llm=TransformersLLM(
        model="TaylorAI/bge-micro-v2",
        model_kwargs={"is_decoder": True},
        cuda_devices=[],
    )
)
embedder.load()

result = next(
    embedder.process(
        [
            {"text": "Hello, how are you?"},
        ]
    )
)
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/generate_embeddings.py
class GenerateEmbeddings(Step):
    """Generate embeddings using the last hidden state of an `LLM`.

    Generate embeddings for a text input using the last hidden state of an `LLM`, as
    described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        llm: The `LLM` to use to generate the embeddings.

    Input columns:
        - text (`str`, `List[Dict[str, str]]`): The input text or conversation to generate
            embeddings for.

    Output columns:
        - embedding (`List[float]`): The embedding of the input text or conversation.
        - model_name (`str`): The model name used to generate the embeddings.

    Categories:
        - embedding
        - llm

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)

    Examples:

        Rank LLM candidates:

        ```python
        from distilabel.steps.tasks import GenerateEmbeddings
        from distilabel.llms.huggingface import TransformersLLM

        # Consider this as a placeholder for your actual LLM.
        embedder = GenerateEmbeddings(
            llm=TransformersLLM(
                model="TaylorAI/bge-micro-v2",
                model_kwargs={"is_decoder": True},
                cuda_devices=[],
            )
        )
        embedder.load()

        result = next(
            embedder.process(
                [
                    {"text": "Hello, how are you?"},
                ]
            )
        )
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    llm: LLM

    def load(self) -> None:
        """Loads the `LLM` used to generate the embeddings."""
        super().load()

        self.llm.load()

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task is a `text` column containing either a string or a
        list of dictionaries in OpenAI chat-like format."""
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        """The outputs for the task is an `embedding` column containing the embedding of
        the `text` input."""
        return ["embedding", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Formats the input to be used by the LLM to generate the embeddings. The input
        can be in `ChatType` format or a string. If a string, it will be converted to a
        list of dictionaries in OpenAI chat-like format.

        Args:
            input: The input to format.

        Returns:
            The OpenAI chat-like format of the input.
        """
        text = input["text"] = input["text"]

        # input is in `ChatType` format
        if isinstance(text, str):
            return [{"role": "user", "content": text}]

        if is_openai_format(text):
            return text

        raise ValueError(
            f"Couldn't format input for step {self.name}. The `text` input column has to"
            " be a string or a list of dictionaries in OpenAI chat-like format."
        )

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates an embedding for each input using the last hidden state of the `LLM`.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """
        formatted_inputs = [self.format_input(input) for input in inputs]
        last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
        for input, hidden_state in zip(inputs, last_hidden_states):
            input["embedding"] = hidden_state[-1].tolist()
            input["model_name"] = self.llm.model_name
        yield inputs

inputs: List[str] property

The inputs for the task is a text column containing either a string or a list of dictionaries in OpenAI chat-like format.

outputs: List[str] property

The outputs for the task is an embedding column containing the embedding of the text input.

format_input(input)

Formats the input to be used by the LLM to generate the embeddings. The input can be in ChatType format or a string. If a string, it will be converted to a list of dictionaries in OpenAI chat-like format.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input to format.

required

Returns:

Type Description
ChatType

The OpenAI chat-like format of the input.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Formats the input to be used by the LLM to generate the embeddings. The input
    can be in `ChatType` format or a string. If a string, it will be converted to a
    list of dictionaries in OpenAI chat-like format.

    Args:
        input: The input to format.

    Returns:
        The OpenAI chat-like format of the input.
    """
    text = input["text"] = input["text"]

    # input is in `ChatType` format
    if isinstance(text, str):
        return [{"role": "user", "content": text}]

    if is_openai_format(text):
        return text

    raise ValueError(
        f"Couldn't format input for step {self.name}. The `text` input column has to"
        " be a string or a list of dictionaries in OpenAI chat-like format."
    )

load()

Loads the LLM used to generate the embeddings.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def load(self) -> None:
    """Loads the `LLM` used to generate the embeddings."""
    super().load()

    self.llm.load()

process(inputs)

Generates an embedding for each input using the last hidden state of the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates an embedding for each input using the last hidden state of the `LLM`.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """
    formatted_inputs = [self.format_input(input) for input in inputs]
    last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
    for input, hidden_state in zip(inputs, last_hidden_states):
        input["embedding"] = hidden_state[-1].tolist()
        input["model_name"] = self.llm.model_name
    yield inputs

GenerateLongTextMatchingData

Bases: _EmbeddingDataGeneration

Generate long text matching data with an LLM to later on train an embedding model.

GenerateLongTextMatchingData is a Task that generates long text matching data with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Note

Ideally this task should be used with EmbeddingTaskGenerator with flatten_tasks=True with the category="text-matching-long"; so that the LLM generates a list of tasks that are flattened so that each row contains a single task for the text-matching-long category.

Attributes:

Name Type Description
language str

The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

seed str

The random seed to be set in case there's any sampling within the format_input method. Note that in this task the seed has no effect since there are no sampling params.

References

Examples:

Generate synthetic long text matching data for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-matching-long",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateLongTextMatchingData(
        language="English",
        llm=...,  # LLM instance
    )

    task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateLongTextMatchingData(_EmbeddingDataGeneration):
    """Generate long text matching data with an `LLM` to later on train an embedding model.

    `GenerateLongTextMatchingData` is a `Task` that generates long text matching data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-matching-long"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-matching-long category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.
            Note that in this task the `seed` has no effect since there are no sampling params.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:

        Generate synthetic long text matching data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-matching-long",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateLongTextMatchingData(
                language="English",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    _template_name: str = PrivateAttr(default="long-text-matching")

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input", "positive_document"]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

format_input(input)

Method to format the input based on the task and the provided attributes, or just randomly sampling those if not provided. This method will render the _template with the provided arguments and return an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input dictionary containing the task to be used in the _template.

required

Returns:

Type Description
ChatType

A list with a single chat containing the user's message with the rendered _template.

Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
            ).strip(),
        }
    ]

GenerateSentencePair

Bases: Task

Generate a positive and negative (optionally) sentences given an anchor sentence.

GenerateSentencePair is a pre-defined task that given an anchor sentence generates a positive sentence related to the anchor and optionally a negative sentence unrelated to the anchor or similar to it. Optionally, you can give a context to guide the LLM towards more specific behavior. This task is useful to generate training datasets for training embeddings models.

Attributes:

Name Type Description
triplet bool

a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to False.

action GenerationAction

the action to perform to generate the positive sentence.

context str

the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default.

hard_negative bool

A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity.

Input columns
  • anchor (str): The anchor sentence to generate the positive and negative sentences.
Output columns
  • positive (str): The positive sentence related to the anchor.
  • negative (str): The negative sentence unrelated to the anchor if triplet=True, or more similar to the positive to make it more challenging for a model to distinguish in case hard_negative=True.
  • model_name (str): The name of the model that was used to generate the sentences.
Categories
  • embedding

Examples:

Paraphrasing:

```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="paraphrase",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```

Generating semantically similar sentences:

```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="semantically-similar",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
```

Generating queries:

```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
```

Generating answers:

```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="answer",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```

Generating queries with context (**applies to every action**):

```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
```

Generating Hard-negatives (**applies to every action**):

```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    hard_negative=True,
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
```
Source code in src/distilabel/steps/tasks/sentence_transformers.py
class GenerateSentencePair(Task):
    """Generate a positive and negative (optionally) sentences given an anchor sentence.

    `GenerateSentencePair` is a pre-defined task that given an anchor sentence generates
    a positive sentence related to the anchor and optionally a negative sentence unrelated
    to the anchor or similar to it. Optionally, you can give a context to guide the LLM
    towards more specific behavior. This task is useful to generate training datasets for
    training embeddings models.

    Attributes:
        triplet: a flag to indicate if the task should generate a triplet of sentences
            (anchor, positive, negative). Defaults to `False`.
        action: the action to perform to generate the positive sentence.
        context: the context to use for the generation. Can be helpful to guide the LLM
            towards more specific context. Not used by default.
        hard_negative: A flag to indicate if the negative should be a hard-negative or not.
            Hard negatives make it hard for the model to distinguish against the positive,
            with a higher degree of semantic similarity.

    Input columns:
        - anchor (`str`): The anchor sentence to generate the positive and negative sentences.

    Output columns:
        - positive (`str`): The positive sentence related to the `anchor`.
        - negative (`str`): The negative sentence unrelated to the `anchor` if `triplet=True`,
            or more similar to the positive to make it more challenging for a model to distinguish
            in case `hard_negative=True`.
        - model_name (`str`): The name of the model that was used to generate the sentences.

    Categories:
        - embedding

    Examples:

        Paraphrasing:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.llms import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="paraphrase",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
        ```

        Generating semantically similar sentences:

        ```python
        from distilabel.llms import InferenceEndpointsLLM
        from distilabel.steps.tasks import GenerateSentencePair

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="semantically-similar",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
        ```

        Generating queries:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.llms import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
        ```

        Generating answers:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.llms import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="answer",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
        ```

        Generating queries with context (**applies to every action**):

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.llms import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            context="Argilla is an open-source data curation platform for LLMs.",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
        ```

        Generating Hard-negatives (**applies to every action**):

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.llms import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            context="Argilla is an open-source data curation platform for LLMs.",
            hard_negative=True,
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
        ```

    """

    triplet: bool = False
    action: GenerationAction
    hard_negative: bool = False
    context: str = ""

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "generate-sentence-pair.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task is the `anchor` sentence."""
        return ["anchor"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The inputs are formatted as a `ChatType`, with a system prompt describing the
        task of generating a positive and negative sentences for the anchor sentence. The
        anchor is provided as the first user interaction in the conversation.

        Args:
            input: The input containing the `anchor` sentence.

        Returns:
            A list of dictionaries containing the system and user interactions.
        """
        action_sentence = GENERATION_ACTION_SENTENCES[self.action]

        format_system_prompt = {
            "action_sentence": action_sentence,
            "context": CONTEXT_INTRO if self.context else "",
        }
        if self.triplet:
            format_system_prompt["negative_style"] = NEGATIVE_STYLE[
                "hard-negative" if self.hard_negative else "negative"
            ]

        system_prompt = (
            POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT
        ).format(**format_system_prompt)

        return [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": self._template.render(
                    anchor=input["anchor"],
                    context=self.context if self.context else None,
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The outputs for the task are the `positive` and `negative` sentences, as well
        as the `model_name` used to generate the sentences."""
        columns = ["positive", "negative"] if self.triplet else ["positive"]
        columns += ["model_name"]
        return columns

    def format_output(
        self, output: Union[str, None], input: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """Formats the output of the LLM, to extract the `positive` and `negative` sentences
        generated. If the output is `None` or the regex doesn't match, then the outputs
        will be set to `None` as well.

        Args:
            output: The output of the LLM.
            input: The input used to generate the output.

        Returns:
            The formatted output containing the `positive` and `negative` sentences.
        """
        if output is None:
            return {"positive": None, "negative": None}

        match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)
        if match is None:
            formatted_output = {"positive": None}
            if self.triplet:
                formatted_output["negative"] = None
            return formatted_output

        groups = match.groups()
        if self.triplet:
            return {
                "positive": groups[0].strip(),
                "negative": groups[1].strip()
                if len(groups) > 1 and groups[1] is not None
                else None,
            }

        return {"positive": groups[0].strip()}

inputs: List[str] property

The inputs for the task is the anchor sentence.

outputs: List[str] property

The outputs for the task are the positive and negative sentences, as well as the model_name used to generate the sentences.

format_input(input)

The inputs are formatted as a ChatType, with a system prompt describing the task of generating a positive and negative sentences for the anchor sentence. The anchor is provided as the first user interaction in the conversation.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input containing the anchor sentence.

required

Returns:

Type Description
ChatType

A list of dictionaries containing the system and user interactions.

Source code in src/distilabel/steps/tasks/sentence_transformers.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The inputs are formatted as a `ChatType`, with a system prompt describing the
    task of generating a positive and negative sentences for the anchor sentence. The
    anchor is provided as the first user interaction in the conversation.

    Args:
        input: The input containing the `anchor` sentence.

    Returns:
        A list of dictionaries containing the system and user interactions.
    """
    action_sentence = GENERATION_ACTION_SENTENCES[self.action]

    format_system_prompt = {
        "action_sentence": action_sentence,
        "context": CONTEXT_INTRO if self.context else "",
    }
    if self.triplet:
        format_system_prompt["negative_style"] = NEGATIVE_STYLE[
            "hard-negative" if self.hard_negative else "negative"
        ]

    system_prompt = (
        POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT
    ).format(**format_system_prompt)

    return [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": self._template.render(
                anchor=input["anchor"],
                context=self.context if self.context else None,
            ),
        },
    ]

format_output(output, input=None)

Formats the output of the LLM, to extract the positive and negative sentences generated. If the output is None or the regex doesn't match, then the outputs will be set to None as well.

Parameters:

Name Type Description Default
output Union[str, None]

The output of the LLM.

required
input Optional[Dict[str, Any]]

The input used to generate the output.

None

Returns:

Type Description
Dict[str, Any]

The formatted output containing the positive and negative sentences.

Source code in src/distilabel/steps/tasks/sentence_transformers.py
def format_output(
    self, output: Union[str, None], input: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
    """Formats the output of the LLM, to extract the `positive` and `negative` sentences
    generated. If the output is `None` or the regex doesn't match, then the outputs
    will be set to `None` as well.

    Args:
        output: The output of the LLM.
        input: The input used to generate the output.

    Returns:
        The formatted output containing the `positive` and `negative` sentences.
    """
    if output is None:
        return {"positive": None, "negative": None}

    match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)
    if match is None:
        formatted_output = {"positive": None}
        if self.triplet:
            formatted_output["negative"] = None
        return formatted_output

    groups = match.groups()
    if self.triplet:
        return {
            "positive": groups[0].strip(),
            "negative": groups[1].strip()
            if len(groups) > 1 and groups[1] is not None
            else None,
        }

    return {"positive": groups[0].strip()}

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/sentence_transformers.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "generate-sentence-pair.jinja2"
    )

    self._template = Template(open(_path).read())

GenerateShortTextMatchingData

Bases: _EmbeddingDataGeneration

Generate short text matching data with an LLM to later on train an embedding model.

GenerateShortTextMatchingData is a Task that generates short text matching data with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Note

Ideally this task should be used with EmbeddingTaskGenerator with flatten_tasks=True with the category="text-matching-short"; so that the LLM generates a list of tasks that are flattened so that each row contains a single task for the text-matching-short category.

Attributes:

Name Type Description
language str

The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

seed str

The random seed to be set in case there's any sampling within the format_input method. Note that in this task the seed has no effect since there are no sampling params.

References

Examples:

Generate synthetic short text matching data for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-matching-short",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateShortTextMatchingData(
        language="English",
        llm=...,  # LLM instance
    )

    task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateShortTextMatchingData(_EmbeddingDataGeneration):
    """Generate short text matching data with an `LLM` to later on train an embedding model.

    `GenerateShortTextMatchingData` is a `Task` that generates short text matching data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-matching-short"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-matching-short category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.
            Note that in this task the `seed` has no effect since there are no sampling params.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:

        Generate synthetic short text matching data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-matching-short",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateShortTextMatchingData(
                language="English",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    _template_name: str = PrivateAttr(default="short-text-matching")

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
                the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
                there's only one turn, being from the user with the content being the rendered `_template`.

                Args:
                    input: The input dictionary containing the `task` to be used in the `_template`.

                Returns:
                    A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input", "positive_document"]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

format_input(input)

Method to format the input based on the task and the provided attributes, or just randomly sampling those if not provided. This method will render the _template with the provided arguments and return an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
            the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
            there's only one turn, being from the user with the content being the rendered `_template`.

            Args:
                input: The input dictionary containing the `task` to be used in the `_template`.

            Returns:
                A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
            ).strip(),
        }
    ]

GenerateTextClassificationData

Bases: _EmbeddingDataGeneration

Generate text classification data with an LLM to later on train an embedding model.

GenerateTextClassificationData is a Task that generates text classification data with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Note

Ideally this task should be used with EmbeddingTaskGenerator with flatten_tasks=True with the category="text-classification"; so that the LLM generates a list of tasks that are flattened so that each row contains a single task for the text-classification category.

Attributes:

Name Type Description
language str

The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

difficulty Optional[Literal['high school', 'college', 'PhD']]

The difficulty of the query to be generated, which can be high school, college, or PhD. Defaults to None, meaning that it will be randomly sampled.

clarity Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]

The clarity of the query to be generated, which can be clear, understandable with some effort, or ambiguous. Defaults to None, meaning that it will be randomly sampled.

seed Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]

The random seed to be set in case there's any sampling within the format_input method.

References

Examples:

Generate synthetic text classification data for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-classification",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateTextClassificationData(
        language="English",
        difficulty="high school",
        clarity="clear",
        llm=...,  # LLM instance
    )

    task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateTextClassificationData(_EmbeddingDataGeneration):
    """Generate text classification data with an `LLM` to later on train an embedding model.

    `GenerateTextClassificationData` is a `Task` that generates text classification data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-classification"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-classification category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.
            Defaults to `None`, meaning that it will be randomly sampled.
        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,
            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:

        Generate synthetic text classification data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-classification",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateTextClassificationData(
                language="English",
                difficulty="high school",
                clarity="clear",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    difficulty: Optional[Literal["high school", "college", "PhD"]] = None
    clarity: Optional[
        Literal["clear", "understandable with some effort", "ambiguous"]
    ] = None

    _template_name: str = PrivateAttr(default="text-classification")

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                    difficulty=self.difficulty
                    or random.choice(["high school", "college", "PhD"]),
                    clarity=self.clarity
                    or random.choice(
                        ["clear", "understandable with some effort", "ambiguous"]
                    ),
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input_text", "label", "misleading_label"]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

format_input(input)

Method to format the input based on the task and the provided attributes, or just randomly sampling those if not provided. This method will render the _template with the provided arguments and return an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input dictionary containing the task to be used in the _template.

required

Returns:

Type Description
ChatType

A list with a single chat containing the user's message with the rendered _template.

Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
                difficulty=self.difficulty
                or random.choice(["high school", "college", "PhD"]),
                clarity=self.clarity
                or random.choice(
                    ["clear", "understandable with some effort", "ambiguous"]
                ),
            ).strip(),
        }
    ]

GenerateTextRetrievalData

Bases: _EmbeddingDataGeneration

Generate text retrieval data with an LLM to later on train an embedding model.

GenerateTextRetrievalData is a Task that generates text retrieval data with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Note

Ideally this task should be used with EmbeddingTaskGenerator with flatten_tasks=True with the category="text-retrieval"; so that the LLM generates a list of tasks that are flattened so that each row contains a single task for the text-retrieval category.

Attributes:

Name Type Description
language str

The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

query_type Optional[Literal['extremely long-tail', 'long-tail', 'common']]

The type of query to be generated, which can be extremely long-tail, long-tail, or common. Defaults to None, meaning that it will be randomly sampled.

query_length Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]

The length of the query to be generated, which can be less than 5 words, 5 to 15 words, or at least 10 words. Defaults to None, meaning that it will be randomly sampled.

difficulty Optional[Literal['high school', 'college', 'PhD']]

The difficulty of the query to be generated, which can be high school, college, or PhD. Defaults to None, meaning that it will be randomly sampled.

clarity Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]

The clarity of the query to be generated, which can be clear, understandable with some effort, or ambiguous. Defaults to None, meaning that it will be randomly sampled.

num_words Optional[Literal[50, 100, 200, 300, 400, 500]]

The number of words in the query to be generated, which can be 50, 100, 200, 300, 400, or 500. Defaults to None, meaning that it will be randomly sampled.

seed Optional[Literal[50, 100, 200, 300, 400, 500]]

The random seed to be set in case there's any sampling within the format_input method.

References

Examples:

Generate synthetic text retrieval data for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-retrieval",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateTextRetrievalData(
        language="English",
        query_type="common",
        query_length="5 to 15 words",
        difficulty="high school",
        clarity="clear",
        num_words=100,
        llm=...,  # LLM instance
    )

    task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateTextRetrievalData(_EmbeddingDataGeneration):
    """Generate text retrieval data with an `LLM` to later on train an embedding model.

    `GenerateTextRetrievalData` is a `Task` that generates text retrieval data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-retrieval"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-retrieval category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        query_type: The type of query to be generated, which can be `extremely long-tail`, `long-tail`,
            or `common`. Defaults to `None`, meaning that it will be randomly sampled.
        query_length: The length of the query to be generated, which can be `less than 5 words`, `5 to 15 words`,
            or `at least 10 words`. Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.
            Defaults to `None`, meaning that it will be randomly sampled.
        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,
            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.
        num_words: The number of words in the query to be generated, which can be `50`, `100`, `200`, `300`, `400`, or `500`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:

        Generate synthetic text retrieval data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-retrieval",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateTextRetrievalData(
                language="English",
                query_type="common",
                query_length="5 to 15 words",
                difficulty="high school",
                clarity="clear",
                num_words=100,
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    query_type: Optional[Literal["extremely long-tail", "long-tail", "common"]] = None
    query_length: Optional[
        Literal["less than 5 words", "5 to 15 words", "at least 10 words"]
    ] = None
    difficulty: Optional[Literal["high school", "college", "PhD"]] = None
    clarity: Optional[
        Literal["clear", "understandable with some effort", "ambiguous"]
    ] = None
    num_words: Optional[Literal[50, 100, 200, 300, 400, 500]] = None

    _template_name: str = PrivateAttr(default="text-retrieval")

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                    query_type=self.query_type
                    or random.choice(["extremely long-tail", "long-tail", "common"]),
                    query_length=self.query_length
                    or random.choice(
                        ["less than 5 words", "5 to 15 words", "at least 10 words"]
                    ),
                    difficulty=self.difficulty
                    or random.choice(["high school", "college", "PhD"]),
                    clarity=self.clarity
                    or random.choice(
                        ["clear", "understandable with some effort", "ambiguous"]
                    ),
                    num_words=self.num_words
                    or random.choice([50, 100, 200, 300, 400, 500]),
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return [
            "user_query",
            "positive_document",
            "hard_negative_document",
        ]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

format_input(input)

Method to format the input based on the task and the provided attributes, or just randomly sampling those if not provided. This method will render the _template with the provided arguments and return an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input dictionary containing the task to be used in the _template.

required

Returns:

Type Description
ChatType

A list with a single chat containing the user's message with the rendered _template.

Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
                query_type=self.query_type
                or random.choice(["extremely long-tail", "long-tail", "common"]),
                query_length=self.query_length
                or random.choice(
                    ["less than 5 words", "5 to 15 words", "at least 10 words"]
                ),
                difficulty=self.difficulty
                or random.choice(["high school", "college", "PhD"]),
                clarity=self.clarity
                or random.choice(
                    ["clear", "understandable with some effort", "ambiguous"]
                ),
                num_words=self.num_words
                or random.choice([50, 100, 200, 300, 400, 500]),
            ).strip(),
        }
    ]

Genstruct

Bases: Task

Generate a pair of instruction-response from a document using an LLM.

Genstruct is a pre-defined task designed to generate valid instructions from a given raw document, with the title and the content, enabling the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is inspired in the Ada-Instruct paper.

Note

The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended option is to use NousResearch/Genstruct-7B as the LLM provided to the task, since it was trained for this specific task.

Attributes:

Name Type Description
_template Union[Template, None]

a Jinja2 template used to format the input for the LLM.

Input columns
  • title (str): The title of the document.
  • content (str): The content of the document.
Output columns
  • user (str): The user's instruction based on the document.
  • assistant (str): The assistant's response based on the user's instruction.
  • model_name (str): The model name used to generate the feedback and result.
Categories
  • text-generation
  • instruction
  • response
References

Examples:

Generate instructions from raw documents using the title and content:

```python
from distilabel.steps.tasks import Genstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
    llm=InferenceEndpointsLLM(
        model_id="NousResearch/Genstruct-7B",
    ),
)

genstruct.load()

result = next(
    genstruct.process(
        [
            {"title": "common instruction", "content": "content of the document"},
        ]
    )
)
# result
# [
#     {
#         'title': 'An instruction',
#         'content': 'content of the document',
#         'model_name': 'test',
#         'user': 'An instruction',
#         'assistant': 'content of the document',
#     }
# ]
```

Citations:

```
@misc{cui2023adainstructadaptinginstructiongenerators,
    title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},
    author={Wanyun Cui and Qianle Wang},
    year={2023},
    eprint={2310.04484},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2310.04484},
}
```
Source code in src/distilabel/steps/tasks/genstruct.py
class Genstruct(Task):
    """Generate a pair of instruction-response from a document using an `LLM`.

    `Genstruct` is a pre-defined task designed to generate valid instructions from a given raw document,
    with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
    datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
    inspired in the Ada-Instruct paper.

    Note:
        The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
        option is to use `NousResearch/Genstruct-7B` as the LLM provided to the task, since it was trained
        for this specific task.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - title (`str`): The title of the document.
        - content (`str`): The content of the document.

    Output columns:
        - user (`str`): The user's instruction based on the document.
        - assistant (`str`): The assistant's response based on the user's instruction.
        - model_name (`str`): The model name used to generate the `feedback` and `result`.

    Categories:
        - text-generation
        - instruction
        - response

    References:
        - [Genstruct 7B by Nous Research](https://huggingface.co/NousResearch/Genstruct-7B)
        - [Ada-Instruct: Adapting Instruction Generators for Complex Reasoning](https://arxiv.org/abs/2310.04484)

    Examples:

        Generate instructions from raw documents using the title and content:

        ```python
        from distilabel.steps.tasks import Genstruct
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        genstruct = Genstruct(
            llm=InferenceEndpointsLLM(
                model_id="NousResearch/Genstruct-7B",
            ),
        )

        genstruct.load()

        result = next(
            genstruct.process(
                [
                    {"title": "common instruction", "content": "content of the document"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'title': 'An instruction',
        #         'content': 'content of the document',
        #         'model_name': 'test',
        #         'user': 'An instruction',
        #         'assistant': 'content of the document',
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{cui2023adainstructadaptinginstructiongenerators,
            title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},
            author={Wanyun Cui and Qianle Wang},
            year={2023},
            eprint={2310.04484},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2310.04484},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "genstruct.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are the `title` and the `content`."""
        return ["title", "content"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    title=input["title"], content=input["content"]
                ),
            }
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `user` instruction based on the provided document
        and the `assistant` response based on the user's instruction."""
        return ["user", "assistant", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted so that both the user and the assistant messages are
        captured.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the keys `user` and `assistant` containing the content for each role.
        """
        if output is None:
            return {"user": None, "assistant": None}

        matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)
        if not matches:
            return {"user": None, "assistant": None}

        return {
            "user": matches.group(1).strip(),
            "assistant": matches.group(2).strip(),
        }

inputs: List[str] property

The inputs for the task are the title and the content.

outputs: List[str] property

The output for the task are the user instruction based on the provided document and the assistant response based on the user's instruction.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/genstruct.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                title=input["title"], content=input["content"]
            ),
        }
    ]

format_output(output, input)

The output is formatted so that both the user and the assistant messages are captured.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Dict[str, Any]

the input to the task. Used for obtaining the number of responses.

required

Returns:

Type Description
Dict[str, Any]

A dict with the keys user and assistant containing the content for each role.

Source code in src/distilabel/steps/tasks/genstruct.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted so that both the user and the assistant messages are
    captured.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the keys `user` and `assistant` containing the content for each role.
    """
    if output is None:
        return {"user": None, "assistant": None}

    matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)
    if not matches:
        return {"user": None, "assistant": None}

    return {
        "user": matches.group(1).strip(),
        "assistant": matches.group(2).strip(),
    }

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/genstruct.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "genstruct.jinja2"
    )

    self._template = Template(open(_path).read())

InstructionBacktranslation

Bases: Task

Self-Alignment with Instruction Backtranslation.

Attributes:

Name Type Description
_template Optional[Template]

the Jinja2 template to use for the Instruction Backtranslation task.

Input columns
  • instruction (str): The reference instruction to evaluate the text output.
  • generation (str): The text output to evaluate for the given instruction.
Output columns
  • score (str): The score for the generation based on the given instruction.
  • reason (str): The reason for the provided score.
  • model_name (str): The model name used to score the generation.
Categories
  • critique
References

Citations:

```
@misc{li2024selfalignmentinstructionbacktranslation,
    title={Self-Alignment with Instruction Backtranslation},
    author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
    year={2024},
    eprint={2308.06259},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2308.06259},
}
```
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
class InstructionBacktranslation(Task):
    """Self-Alignment with Instruction Backtranslation.

    Attributes:
        _template: the Jinja2 template to use for the Instruction Backtranslation task.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text output.
        - generation (`str`): The text output to evaluate for the given instruction.

    Output columns:
        - score (`str`): The score for the generation based on the given instruction.
        - reason (`str`): The reason for the provided score.
        - model_name (`str`): The model name used to score the generation.

    Categories:
        - critique

    References:
        - [`Self-Alignment with Instruction Backtranslation`](https://arxiv.org/abs/2308.06259)

    Citations:

        ```
        @misc{li2024selfalignmentinstructionbacktranslation,
            title={Self-Alignment with Instruction Backtranslation},
            author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
            year={2024},
            eprint={2308.06259},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2308.06259},
        }
        ```
    """

    _template: Optional["Template"] = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "instruction-backtranslation.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generation` for it."""
        return ["instruction", "generation"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generation=input["generation"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `score`, `reason` and the `model_name`."""
        return ["score", "reason", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `score` and `reason`. The
        `model_name` will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing the `score` and the `reason` for the provided `score`.
        """
        pattern = r"(.+?)Score: (\d)"

        matches = None
        if output is not None:
            matches = re.findall(pattern, output, re.DOTALL)
        if matches is None:
            return {"score": None, "reason": None}

        return {
            "score": int(matches[0][1]),
            "reason": matches[0][0].strip(),
        }

inputs: List[str] property

The input for the task is the instruction, and the generation for it.

outputs: List[str] property

The output for the task is the score, reason and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generation=input["generation"]
            ),
        },
    ]

format_output(output, input)

The output is formatted as a dictionary with the score and reason. The model_name will be automatically included within the process method of Task.

Parameters:

Name Type Description Default
output Union[str, None]

a string representing the output of the LLM via the process method.

required
input Dict[str, Any]

the input to the task, as required by some tasks to format the output.

required

Returns:

Type Description
Dict[str, Any]

A dictionary containing the score and the reason for the provided score.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `score` and `reason`. The
    `model_name` will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing the `score` and the `reason` for the provided `score`.
    """
    pattern = r"(.+?)Score: (\d)"

    matches = None
    if output is not None:
        matches = re.findall(pattern, output, re.DOTALL)
    if matches is None:
        return {"score": None, "reason": None}

    return {
        "score": int(matches[0][1]),
        "reason": matches[0][0].strip(),
    }

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "instruction-backtranslation.jinja2"
    )

    self._template = Template(open(_path).read())

Magpie

Bases: Task, MagpieBase

Generates conversations using an instruct fine-tuned LLM.

Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as if it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.

Attributes:

Name Type Description
n_turns

the number of turns that the generated conversation will have. Defaults to 1.

end_with_user

whether the conversation should end with a user message. Defaults to False.

include_system_prompt

whether to include the system prompt used in the generated conversation. Defaults to False.

only_instruction

whether to generate only the instruction. If this argument is True, then n_turns will be ignored. Defaults to False.

system_prompt

an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains a system_prompt column, then this runtime parameter will be ignored and the one from the column will be used. Defaults to None.

Runtime parameters
  • n_turns: the number of turns that the generated conversation will have. Defaults to 1.
  • end_with_user: whether the conversation should end with a user message. Defaults to False.
  • include_system_prompt: whether to include the system prompt used in the generated conversation. Defaults to False.
  • only_instruction: whether to generate only the instruction. If this argument is True, then n_turns will be ignored. Defaults to False.
  • system_prompt: an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains a system_prompt column, then this runtime parameter will be ignored and the one from the column will be used. Defaults to None.
Input columns
  • system_prompt (str, optional): an optional system prompt that can be provided to guide the generation of the instruct LLM and steer it to generate instructions of certain topic.
Output columns
  • conversation (ChatType): the generated conversation which is a list of chat items with a role and a message. Only if only_instruction=False.
  • instruction (str): the generated instructions if only_instruction=True or n_turns==1.
  • response (str): the generated response if n_turns==1.
  • model_name (str): The model name used to generate the conversation or instruction.
Categories
  • text-generation
  • instruction
References

Examples:

Generating instructions with Llama 3 8B Instruct and TransformersLLM:

```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import Magpie

magpie = Magpie(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 64,
        },
        device="mps",
    ),
    only_instruction=True,
)

magpie.load()

result = next(
    magpie.process(
        inputs=[
            {
                "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
            },
            {
                "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
            },
        ]
    )
)
# [
#     {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
# ]
```

Generating conversations with Llama 3 8B Instruct and TransformersLLM:

```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import Magpie

magpie = Magpie(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 256,
        },
        device="mps",
    ),
    n_turns=2,
)

magpie.load()

result = next(
    magpie.process(
        inputs=[
            {
                "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
            },
            {
                "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
            },
        ]
    )
)
# [
#     {
#         'conversation': [
#             {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
#             {
#                 'role': 'user',
#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
# of x".'
#             },
#             {
#                 'role': 'assistant',
#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
# x". What it's asking us to do is find the'
#             }
#         ]
#     },
#     {
#         'conversation': [
#             {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
#             {
#                 'role': 'user',
#                 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
# might be pests or diseases, but I'm not sure which."
#             },
#             {
#                 'role': 'assistant',
#                 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
#             }
#         ]
#     }
# ]
```
Source code in src/distilabel/steps/tasks/magpie/base.py
class Magpie(Task, MagpieBase):
    """Generates conversations using an instruct fine-tuned LLM.

    Magpie is a neat method that allows generating user instructions with no seed data
    or specific system prompt thanks to the autoregressive capabilities of the instruct
    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message
    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query
    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the
    LLM without any user message, then the LLM will continue generating tokens as if it was
    the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM.
    After this instruct is generated, it can be sent again to the LLM to generate this time
    an assistant response. This process can be repeated N times allowing to build a multi-turn
    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from
    Scratch by Prompting Aligned LLMs with Nothing'.

    Attributes:
        n_turns: the number of turns that the generated conversation will have.
            Defaults to `1`.
        end_with_user: whether the conversation should end with a user message.
            Defaults to `False`.
        include_system_prompt: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        only_instruction: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        system_prompt: an optional system prompt or list of system prompts that can
            be used to steer the LLM to generate content of certain topic, guide the style,
            etc. If it's a list of system prompts, then a random system prompt will be chosen
            per input/output batch. If the provided inputs contains a `system_prompt` column,
            then this runtime parameter will be ignored and the one from the column will
            be used. Defaults to `None`.

    Runtime parameters:
        - `n_turns`: the number of turns that the generated conversation will have. Defaults
            to `1`.
        - `end_with_user`: whether the conversation should end with a user message.
            Defaults to `False`.
        - `include_system_prompt`: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        - `only_instruction`: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        - `system_prompt`: an optional system prompt or list of system prompts that can
            be used to steer the LLM to generate content of certain topic, guide the style,
            etc. If it's a list of system prompts, then a random system prompt will be chosen
            per input/output batch. If the provided inputs contains a `system_prompt` column,
            then this runtime parameter will be ignored and the one from the column will
            be used. Defaults to `None`.

    Input columns:
        - system_prompt (`str`, optional): an optional system prompt that can be provided
            to guide the generation of the instruct LLM and steer it to generate instructions
            of certain topic.

    Output columns:
        - conversation (`ChatType`): the generated conversation which is a list of chat
            items with a role and a message. Only if `only_instruction=False`.
        - instruction (`str`): the generated instructions if `only_instruction=True` or `n_turns==1`.
        - response (`str`): the generated response if `n_turns==1`.
        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.

    Categories:
        - text-generation
        - instruction

    References:
        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)

    Examples:

        Generating instructions with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.llms import TransformersLLM
        from distilabel.steps.tasks import Magpie

        magpie = Magpie(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 64,
                },
                device="mps",
            ),
            only_instruction=True,
        )

        magpie.load()

        result = next(
            magpie.process(
                inputs=[
                    {
                        "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
                    },
                    {
                        "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
                    },
                ]
            )
        )
        # [
        #     {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
        #     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
        # ]
        ```

        Generating conversations with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.llms import TransformersLLM
        from distilabel.steps.tasks import Magpie

        magpie = Magpie(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 256,
                },
                device="mps",
            ),
            n_turns=2,
        )

        magpie.load()

        result = next(
            magpie.process(
                inputs=[
                    {
                        "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
                    },
                    {
                        "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
                    },
                ]
            )
        )
        # [
        #     {
        #         'conversation': [
        #             {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
        #             {
        #                 'role': 'user',
        #                 'content': 'I\'m having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
        # of x".'
        #             },
        #             {
        #                 'role': 'assistant',
        #                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don\'t worry, I\'m here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
        # x". What it\'s asking us to do is find the'
        #             }
        #         ]
        #     },
        #     {
        #         'conversation': [
        #             {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
        #             {
        #                 'role': 'user',
        #                 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
        # might be pests or diseases, but I'm not sure which."
        #             },
        #             {
        #                 'role': 'assistant',
        #                 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
        # **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
        #             }
        #         ]
        #     }
        # ]
        ```
    """

    def model_post_init(self, __context: Any) -> None:
        """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
        super().model_post_init(__context)

        if not isinstance(self.llm, MagpieChatTemplateMixin):
            raise ValueError(
                f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
                f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin."
            )

        self.llm.use_magpie_template = True

    @property
    def inputs(self) -> List[str]:
        return []

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Does nothing."""
        return []

    @property
    def outputs(self) -> List[str]:
        """Either a multi-turn conversation or the instruction generated."""
        if self.only_instruction:
            return ["instruction", "model_name"]
        if self.n_turns == 1:
            return ["instruction", "response", "model_name"]
        return ["conversation", "model_name"]

    def format_output(
        self,
        output: Union[str, None],
        input: Union[Dict[str, Any], None] = None,
    ) -> Dict[str, Any]:
        """Does nothing."""
        return {}

    def process(self, inputs: StepInput) -> "StepOutput":
        """Generate a list of instructions or conversations of the specified number of turns.

        Args:
            inputs: a list of dictionaries that can contain a `system_prompt` key.

        Yields:
            The list of generated conversations.
        """
        yield self._generate_with_pre_query_template(inputs)

outputs: List[str] property

Either a multi-turn conversation or the instruction generated.

format_input(input)

Does nothing.

Source code in src/distilabel/steps/tasks/magpie/base.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Does nothing."""
    return []

format_output(output, input=None)

Does nothing.

Source code in src/distilabel/steps/tasks/magpie/base.py
def format_output(
    self,
    output: Union[str, None],
    input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
    """Does nothing."""
    return {}

model_post_init(__context)

Checks that the provided LLM uses the MagpieChatTemplateMixin.

Source code in src/distilabel/steps/tasks/magpie/base.py
def model_post_init(self, __context: Any) -> None:
    """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
    super().model_post_init(__context)

    if not isinstance(self.llm, MagpieChatTemplateMixin):
        raise ValueError(
            f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
            f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin."
        )

    self.llm.use_magpie_template = True

process(inputs)

Generate a list of instructions or conversations of the specified number of turns.

Parameters:

Name Type Description Default
inputs StepInput

a list of dictionaries that can contain a system_prompt key.

required

Yields:

Type Description
StepOutput

The list of generated conversations.

Source code in src/distilabel/steps/tasks/magpie/base.py
def process(self, inputs: StepInput) -> "StepOutput":
    """Generate a list of instructions or conversations of the specified number of turns.

    Args:
        inputs: a list of dictionaries that can contain a `system_prompt` key.

    Yields:
        The list of generated conversations.
    """
    yield self._generate_with_pre_query_template(inputs)

MagpieGenerator

Bases: GeneratorTask, MagpieBase

Generator task the generates instructions or conversations using Magpie.

Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.

Attributes:

Name Type Description
n_turns

the number of turns that the generated conversation will have. Defaults to 1.

end_with_user

whether the conversation should end with a user message. Defaults to False.

include_system_prompt

whether to include the system prompt used in the generated conversation. Defaults to False.

only_instruction

whether to generate only the instruction. If this argument is True, then n_turns will be ignored. Defaults to False.

system_prompt

an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains a system_prompt column, then this runtime parameter will be ignored and the one from the column will be used. Defaults to None.

num_rows RuntimeParameter[int]

the number of rows to be generated.

Runtime parameters
  • n_turns: the number of turns that the generated conversation will have. Defaults to 1.
  • end_with_user: whether the conversation should end with a user message. Defaults to False.
  • include_system_prompt: whether to include the system prompt used in the generated conversation. Defaults to False.
  • only_instruction: whether to generate only the instruction. If this argument is True, then n_turns will be ignored. Defaults to False.
  • system_prompt: an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains a system_prompt column, then this runtime parameter will be ignored and the one from the column will be used. Defaults to None.
  • num_rows: the number of rows to be generated.
Output columns
  • conversation (ChatType): the generated conversation which is a list of chat items with a role and a message.
  • instruction (str): the generated instructions if only_instruction=True.
  • response (str): the generated response if n_turns==1.
  • model_name (str): The model name used to generate the conversation or instruction.
Categories
  • text-generation
  • instruction
  • generator
References

Examples:

Generating instructions with Llama 3 8B Instruct and TransformersLLM:

```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator

generator = MagpieGenerator(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 256,
        },
        device="mps",
    ),
    only_instruction=True,
    num_rows=5,
)

generator.load()

result = next(generator.process())
# (
#       [
#           {"instruction": "I've just bought a new phone and I're excited to start using it."},
#           {"instruction": "What are the most common types of companies that use digital signage?"}
#       ],
#       True
# )
```

Generating a conversation with Llama 3 8B Instruct and TransformersLLM:

```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator

generator = MagpieGenerator(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 64,
        },
        device="mps",
    ),
    n_turns=3,
    num_rows=5,
)

generator.load()

result = next(generator.process())
# (
#     [
#         {
#             'conversation': [
#                 {
#                     'role': 'system',
#                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
#                 },
#                 {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
#                 {
#                     'role': 'assistant',
#                     'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
# let's break down the basics. First, we need to identify your goals and target audience. What do"
#                 },
#                 {
#                     'role': 'user',
#                     'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
#                 },
#                 {
#                     'role': 'assistant',
#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
#                 }
#             ]
#         },
#         {
#             'conversation': [
#                 {
#                     'role': 'system',
#                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
#                 },
#                 {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
#                 {
#                     'role': 'assistant',
#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
#                 },
#                 {
#                     'role': 'user',
#                     'content': 'Let me stop you there. Let's explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my
# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
#                 },
#                 {
#                     'role': 'assistant',
#                     'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
#                 }
#             ]
#         }
#     ],
#     True
# )
```

Citations:

```
@misc{xu2024magpiealignmentdatasynthesis,
    title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
    author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
    year={2024},
    eprint={2406.08464},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2406.08464},
}
```
Source code in src/distilabel/steps/tasks/magpie/generator.py
class MagpieGenerator(GeneratorTask, MagpieBase):
    """Generator task the generates instructions or conversations using Magpie.

    Magpie is a neat method that allows generating user instructions with no seed data
    or specific system prompt thanks to the autoregressive capabilities of the instruct
    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message
    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query
    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the
    LLM without any user message, then the LLM will continue generating tokens as it was
    the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM.
    After this instruct is generated, it can be sent again to the LLM to generate this time
    an assistant response. This process can be repeated N times allowing to build a multi-turn
    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from
    Scratch by Prompting Aligned LLMs with Nothing'.

    Attributes:
        n_turns: the number of turns that the generated conversation will have.
            Defaults to `1`.
        end_with_user: whether the conversation should end with a user message.
            Defaults to `False`.
        include_system_prompt: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        only_instruction: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        system_prompt: an optional system prompt or list of system prompts that can
            be used to steer the LLM to generate content of certain topic, guide the style,
            etc. If it's a list of system prompts, then a random system prompt will be chosen
            per input/output batch. If the provided inputs contains a `system_prompt` column,
            then this runtime parameter will be ignored and the one from the column will
            be used. Defaults to `None`.
        num_rows: the number of rows to be generated.

    Runtime parameters:
        - `n_turns`: the number of turns that the generated conversation will have. Defaults
            to `1`.
        - `end_with_user`: whether the conversation should end with a user message.
            Defaults to `False`.
        - `include_system_prompt`: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        - `only_instruction`: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        - `system_prompt`: an optional system prompt or list of system prompts that can
            be used to steer the LLM to generate content of certain topic, guide the style,
            etc. If it's a list of system prompts, then a random system prompt will be chosen
            per input/output batch. If the provided inputs contains a `system_prompt` column,
            then this runtime parameter will be ignored and the one from the column will
            be used. Defaults to `None`.
        - `num_rows`: the number of rows to be generated.

    Output columns:
        - conversation (`ChatType`): the generated conversation which is a list of chat
            items with a role and a message.
        - instruction (`str`): the generated instructions if `only_instruction=True`.
        - response (`str`): the generated response if `n_turns==1`.
        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.

    Categories:
        - text-generation
        - instruction
        - generator

    References:
        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)

    Examples:

        Generating instructions with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.llms import TransformersLLM
        from distilabel.steps.tasks import MagpieGenerator

        generator = MagpieGenerator(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 256,
                },
                device="mps",
            ),
            only_instruction=True,
            num_rows=5,
        )

        generator.load()

        result = next(generator.process())
        # (
        #       [
        #           {"instruction": "I've just bought a new phone and I're excited to start using it."},
        #           {"instruction": "What are the most common types of companies that use digital signage?"}
        #       ],
        #       True
        # )
        ```

        Generating a conversation with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.llms import TransformersLLM
        from distilabel.steps.tasks import MagpieGenerator

        generator = MagpieGenerator(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 64,
                },
                device="mps",
            ),
            n_turns=3,
            num_rows=5,
        )

        generator.load()

        result = next(generator.process())
        # (
        #     [
        #         {
        #             'conversation': [
        #                 {
        #                     'role': 'system',
        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
        # insightful responses to help the user with their queries.'
        #                 },
        #                 {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
        #                 {
        #                     'role': 'assistant',
        #                     'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
        # let's break down the basics. First, we need to identify your goals and target audience. What do"
        #                 },
        #                 {
        #                     'role': 'user',
        #                     'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
        # expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
        #                 },
        #                 {
        #                     'role': 'assistant',
        #                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
        # agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
        #                 }
        #             ]
        #         },
        #         {
        #             'conversation': [
        #                 {
        #                     'role': 'system',
        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
        # insightful responses to help the user with their queries.'
        #                 },
        #                 {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
        #                 {
        #                     'role': 'assistant',
        #                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
        # **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
        #                 },
        #                 {
        #                     'role': 'user',
        #                     'content': 'Let me stop you there. Let\'s explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I\'re primarily using my
        # laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
        #                 },
        #                 {
        #                     'role': 'assistant',
        #                     'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
        # option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
        #                 }
        #             ]
        #         }
        #     ],
        #     True
        # )
        ```

    Citations:

        ```
        @misc{xu2024magpiealignmentdatasynthesis,
            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
            year={2024},
            eprint={2406.08464},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2406.08464},
        }
        ```
    """

    # TODO: move this to `GeneratorTask`
    num_rows: RuntimeParameter[int] = Field(
        default=None, description="The number of rows to generate."
    )

    def model_post_init(self, __context: Any) -> None:
        """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
        super().model_post_init(__context)

        if not isinstance(self.llm, MagpieChatTemplateMixin):
            raise ValueError(
                f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
                f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin."
            )

        self.llm.use_magpie_template = True

    def format_output(
        self,
        output: Union[str, None],
        input: Union[Dict[str, Any], None] = None,
    ) -> Dict[str, Any]:
        """Does nothing."""
        return {}

    @property
    def outputs(self) -> List[str]:
        """Either a multi-turn conversation or the instruction generated."""
        if self.only_instruction:
            return ["instruction", "model_name"]
        if self.n_turns == 1:
            return ["instruction", "response", "model_name"]
        return ["conversation", "model_name"]

    def process(self, offset: int = 0) -> "GeneratorStepOutput":
        """Generates the desired number of instructions or conversations using Magpie.

        Args:
            offset: The offset to start the generation from. Defaults to `0`.

        Yields:
            The generated instructions or conversations.
        """
        generated = offset

        while generated <= self.num_rows:  # type: ignore
            rows_to_generate = (
                self.num_rows if self.num_rows < self.batch_size else self.batch_size  # type: ignore
            )
            conversations = self._generate_with_pre_query_template(
                inputs=[{} for _ in range(rows_to_generate)]  # type: ignore
            )
            generated += rows_to_generate  # type: ignore
            yield (conversations, generated == self.num_rows)

outputs: List[str] property

Either a multi-turn conversation or the instruction generated.

format_output(output, input=None)

Does nothing.

Source code in src/distilabel/steps/tasks/magpie/generator.py
def format_output(
    self,
    output: Union[str, None],
    input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
    """Does nothing."""
    return {}

model_post_init(__context)

Checks that the provided LLM uses the MagpieChatTemplateMixin.

Source code in src/distilabel/steps/tasks/magpie/generator.py
def model_post_init(self, __context: Any) -> None:
    """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
    super().model_post_init(__context)

    if not isinstance(self.llm, MagpieChatTemplateMixin):
        raise ValueError(
            f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
            f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin."
        )

    self.llm.use_magpie_template = True

process(offset=0)

Generates the desired number of instructions or conversations using Magpie.

Parameters:

Name Type Description Default
offset int

The offset to start the generation from. Defaults to 0.

0

Yields:

Type Description
GeneratorStepOutput

The generated instructions or conversations.

Source code in src/distilabel/steps/tasks/magpie/generator.py
def process(self, offset: int = 0) -> "GeneratorStepOutput":
    """Generates the desired number of instructions or conversations using Magpie.

    Args:
        offset: The offset to start the generation from. Defaults to `0`.

    Yields:
        The generated instructions or conversations.
    """
    generated = offset

    while generated <= self.num_rows:  # type: ignore
        rows_to_generate = (
            self.num_rows if self.num_rows < self.batch_size else self.batch_size  # type: ignore
        )
        conversations = self._generate_with_pre_query_template(
            inputs=[{} for _ in range(rows_to_generate)]  # type: ignore
        )
        generated += rows_to_generate  # type: ignore
        yield (conversations, generated == self.num_rows)

MonolingualTripletGenerator

Bases: _EmbeddingDataGenerator

Generate monolingual triplets with an LLM to later on train an embedding model.

MonolingualTripletGenerator is a GeneratorTask that generates monolingual triplets with an LLM to later on train an embedding model. The task is based on the paper "Improving Text Embeddings with Large Language Models" and the data is generated based on the provided attributes, or randomly sampled if not provided.

Attributes:

Name Type Description
language str

The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.

unit Optional[Literal['sentence', 'phrase', 'passage']]

The unit of the data to be generated, which can be sentence, phrase, or passage. Defaults to None, meaning that it will be randomly sampled.

difficulty Optional[Literal['elementary school', 'high school', 'college']]

The difficulty of the query to be generated, which can be elementary school, high school, or college. Defaults to None, meaning that it will be randomly sampled.

high_score Optional[Literal['4', '4.5', '5']]

The high score of the query to be generated, which can be 4, 4.5, or 5. Defaults to None, meaning that it will be randomly sampled.

low_score Optional[Literal['2.5', '3', '3.5']]

The low score of the query to be generated, which can be 2.5, 3, or 3.5. Defaults to None, meaning that it will be randomly sampled.

seed Optional[Literal['2.5', '3', '3.5']]

The random seed to be set in case there's any sampling within the format_input method.

Examples:

Generate monolingual triplets for training embedding models:

```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import MonolingualTripletGenerator

with Pipeline("my-pipeline") as pipeline:
    task = MonolingualTripletGenerator(
        language="English",
        unit="sentence",
        difficulty="elementary school",
        high_score="4",
        low_score="2.5",
        llm=...,
    )

    ...

    task >> ...
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
class MonolingualTripletGenerator(_EmbeddingDataGenerator):
    """Generate monolingual triplets with an `LLM` to later on train an embedding model.

    `MonolingualTripletGenerator` is a `GeneratorTask` that generates monolingual triplets with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.
            Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.
            Defaults to `None`, meaning that it will be randomly sampled.
        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Examples:

        Generate monolingual triplets for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import MonolingualTripletGenerator

        with Pipeline("my-pipeline") as pipeline:
            task = MonolingualTripletGenerator(
                language="English",
                unit="sentence",
                difficulty="elementary school",
                high_score="4",
                low_score="2.5",
                llm=...,
            )

            ...

            task >> ...
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    unit: Optional[Literal["sentence", "phrase", "passage"]] = None
    difficulty: Optional[Literal["elementary school", "high school", "college"]] = None
    high_score: Optional[Literal["4", "4.5", "5"]] = None
    low_score: Optional[Literal["2.5", "3", "3.5"]] = None

    _template_name: str = PrivateAttr(default="monolingual-triplet")

    @property
    def prompt(self) -> ChatType:
        """Contains the `prompt` to be used in the `process` method, rendering the `_template`; and
        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,
        being from the user with the content being the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    language=self.language,
                    unit=self.unit or random.choice(["sentence", "phrase", "passage"]),
                    difficulty=self.difficulty
                    or random.choice(["elementary school", "high school", "college"]),
                    high_score=self.high_score or random.choice(["4", "4.5", "5"]),
                    low_score=self.low_score or random.choice(["2.5", "3", "3.5"]),
                ).strip(),
            }
        ]  # type: ignore

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["S1", "S2", "S3"]

keys: List[str] property

Contains the keys that will be parsed from the LLM output into a Python dict.

prompt: ChatType property

Contains the prompt to be used in the process method, rendering the _template; and formatted as an OpenAI formatted chat i.e. a ChatType, assuming that there's only one turn, being from the user with the content being the rendered _template.

PairRM

Bases: Step

Rank the candidates based on the input using the LLM model.

Attributes:

Name Type Description
model str

The model to use for the ranking. Defaults to "llm-blender/PairRM".

instructions Optional[str]

The instructions to use for the model. Defaults to None.

Input columns
  • inputs (List[Dict[str, Any]]): The input text or conversation to rank the candidates for.
  • candidates (List[Dict[str, Any]]): The candidates to rank.
Output columns
  • ranks (List[int]): The ranks of the candidates based on the input.
  • ranked_candidates (List[Dict[str, Any]]): The candidates ranked based on the input.
  • model_name (str): The model name used to rank the candidate responses. Defaults to "llm-blender/PairRM".
References
Categories
  • preference
Note

This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific LLM.

Examples:

Rank LLM candidates:

```python
from distilabel.steps.tasks import PairRM

# Consider this as a placeholder for your actual LLM.
pair_rm = PairRM()

pair_rm.load()

result = next(
    scorer.process(
        [
            {"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
        ]
    )
)
# result
# [
#     {
#         'input': 'Hello, how are you?',
#         'candidates': ['fine', 'good', 'bad'],
#         'ranks': [2, 1, 3],
#         'ranked_candidates': ['good', 'fine', 'bad'],
#         'model_name': 'llm-blender/PairRM',
#     }
# ]
```

Citations:

```
@misc{jiang2023llmblenderensemblinglargelanguage,
    title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
    author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
    year={2023},
    eprint={2306.02561},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2306.02561},
}
```
Source code in src/distilabel/steps/tasks/pair_rm.py
class PairRM(Step):
    """Rank the candidates based on the input using the `LLM` model.

    Attributes:
        model: The model to use for the ranking. Defaults to `"llm-blender/PairRM"`.
        instructions: The instructions to use for the model. Defaults to `None`.

    Input columns:
        - inputs (`List[Dict[str, Any]]`): The input text or conversation to rank the candidates for.
        - candidates (`List[Dict[str, Any]]`): The candidates to rank.

    Output columns:
        - ranks (`List[int]`): The ranks of the candidates based on the input.
        - ranked_candidates (`List[Dict[str, Any]]`): The candidates ranked based on the input.
        - model_name (`str`): The model name used to rank the candidate responses. Defaults to `"llm-blender/PairRM"`.

    References:
        - [LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion](https://arxiv.org/abs/2306.02561).
        - [Pair Ranking Model](https://huggingface.co/llm-blender/PairRM).

    Categories:
        - preference

    Note:
        This step differs to other tasks as there is a single implementation of this model
        currently, and we will use a specific `LLM`.

    Examples:

        Rank LLM candidates:

        ```python
        from distilabel.steps.tasks import PairRM

        # Consider this as a placeholder for your actual LLM.
        pair_rm = PairRM()

        pair_rm.load()

        result = next(
            scorer.process(
                [
                    {"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
                ]
            )
        )
        # result
        # [
        #     {
        #         'input': 'Hello, how are you?',
        #         'candidates': ['fine', 'good', 'bad'],
        #         'ranks': [2, 1, 3],
        #         'ranked_candidates': ['good', 'fine', 'bad'],
        #         'model_name': 'llm-blender/PairRM',
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{jiang2023llmblenderensemblinglargelanguage,
            title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
            author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
            year={2023},
            eprint={2306.02561},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2306.02561},
        }
        ```
    """

    model: str = "llm-blender/PairRM"
    instructions: Optional[str] = None

    def load(self) -> None:
        """Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the
        custom library for running the inference for the PairRM models."""
        try:
            import llm_blender
        except ImportError as e:
            raise ImportError(
                "The `llm_blender` package is required to use the `PairRM` class."
                "Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`."
            ) from e

        self._blender = llm_blender.Blender()
        self._blender.loadranker(self.model)

    @property
    def inputs(self) -> List[str]:
        """The input columns correspond to the two required arguments from `Blender.rank`:
        `inputs` and `candidates`."""
        return ["input", "candidates"]

    @property
    def outputs(self) -> List[str]:
        """The outputs will include the `ranks` and the `ranked_candidates`."""
        return ["ranks", "ranked_candidates", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
        """The input is expected to be a dictionary with the keys `input` and `candidates`,
        where the `input` corresponds to the instruction of a model and `candidates` are a
        list of responses to be ranked.
        """
        return {"input": input["input"], "candidates": input["candidates"]}

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates the ranks for the candidates based on the input.

        The ranks are the positions of the candidates, where lower is better,
        and the ranked candidates correspond to the candidates sorted according to the
        ranks obtained.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.
        """
        input_texts = []
        candidates = []
        for input in inputs:
            formatted_input = self.format_input(input)
            input_texts.append(formatted_input["input"])
            candidates.append(formatted_input["candidates"])

        instructions = (
            [self.instructions] * len(input_texts) if self.instructions else None
        )

        ranks = self._blender.rank(
            input_texts,
            candidates,
            instructions=instructions,
            return_scores=False,
            batch_size=self.input_batch_size,
        )
        # Sort the candidates based on the ranks
        ranked_candidates = np.take_along_axis(
            np.array(candidates), ranks - 1, axis=1
        ).tolist()
        ranks = ranks.tolist()
        for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
            input["ranks"] = rank
            input["ranked_candidates"] = ranked_candidate
            input["model_name"] = self.model

        yield inputs

inputs: List[str] property

The input columns correspond to the two required arguments from Blender.rank: inputs and candidates.

outputs: List[str] property

The outputs will include the ranks and the ranked_candidates.

format_input(input)

The input is expected to be a dictionary with the keys input and candidates, where the input corresponds to the instruction of a model and candidates are a list of responses to be ranked.

Source code in src/distilabel/steps/tasks/pair_rm.py
def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
    """The input is expected to be a dictionary with the keys `input` and `candidates`,
    where the `input` corresponds to the instruction of a model and `candidates` are a
    list of responses to be ranked.
    """
    return {"input": input["input"], "candidates": input["candidates"]}

load()

Loads the PairRM model provided via model with llm_blender.Blender, which is the custom library for running the inference for the PairRM models.

Source code in src/distilabel/steps/tasks/pair_rm.py
def load(self) -> None:
    """Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the
    custom library for running the inference for the PairRM models."""
    try:
        import llm_blender
    except ImportError as e:
        raise ImportError(
            "The `llm_blender` package is required to use the `PairRM` class."
            "Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`."
        ) from e

    self._blender = llm_blender.Blender()
    self._blender.loadranker(self.model)

process(inputs)

Generates the ranks for the candidates based on the input.

The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

An iterator with the inputs containing the ranks, ranked_candidates, and model_name.

Source code in src/distilabel/steps/tasks/pair_rm.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates the ranks for the candidates based on the input.

    The ranks are the positions of the candidates, where lower is better,
    and the ranked candidates correspond to the candidates sorted according to the
    ranks obtained.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.
    """
    input_texts = []
    candidates = []
    for input in inputs:
        formatted_input = self.format_input(input)
        input_texts.append(formatted_input["input"])
        candidates.append(formatted_input["candidates"])

    instructions = (
        [self.instructions] * len(input_texts) if self.instructions else None
    )

    ranks = self._blender.rank(
        input_texts,
        candidates,
        instructions=instructions,
        return_scores=False,
        batch_size=self.input_batch_size,
    )
    # Sort the candidates based on the ranks
    ranked_candidates = np.take_along_axis(
        np.array(candidates), ranks - 1, axis=1
    ).tolist()
    ranks = ranks.tolist()
    for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
        input["ranks"] = rank
        input["ranked_candidates"] = ranked_candidate
        input["model_name"] = self.model

    yield inputs

PrometheusEval

Bases: Task

Critique and rank the quality of generations from an LLM using Prometheus 2.0.

`PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative
evaluations.

- The absolute evaluation i.e. `mode="absolute"` is used to evaluate a single generation from
    an LLM for a given instruction.
- The relative evaluation i.e. `mode="relative"` is used to evaluate two generations from an LLM
    for a given instruction.

Both evaluations provide the possibility whether to use a reference answer to compare with or not
via the `reference` attribute, and both are based on a score rubric that critiques the generation/s
based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,
and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute
`rubric`.

Note:
    The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0
    models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,
    and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
    and quality is not guaranteed if using another model, even though some other models may be able to
    correctly follow the formatting and generate insightful critiques too.

Attributes:
    mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task
        will evaluate one or two generations.
    rubric: the score rubric to use within the prompt to run the critique based on different aspects.
        Can be any existing key in the `rubrics` attribute, which by default means that it can be:
        `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only
        work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.
    rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are
        the rubric names and the values are the rubric descriptions. The default rubrics are the following:
        `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.
    reference: a boolean flag to indicate whether a reference answer / completion will be provided, so
        that the model critique is based on the comparison with it. It implies that the column `reference`
        needs to be provided within the input data in addition to the rest of the inputs.
    _template: a Jinja2 template used to format the input for the LLM.

Input columns:
    - instruction (`str`): The instruction to use as reference.
    - generation (`str`, optional): The generated text from the given `instruction`. This column is required
        if `mode=absolute`.
    - generations (`List[str]`, optional): The generated texts from the given `instruction`. It should
        contain 2 generations only. This column is required if `mode=relative`.
    - reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM
        for comparison against.

Output columns:
    - feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the
        pre-defined score rubric, compared against `reference` if provided.
    - result (`Union[int, Literal["A", "B"]]`): If `mode=absolute`, then the result contains the score for the
        `generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either
        "A" or "B", the "winning" one being the generation in the index 0 of `generations` if `result='A'` or the
        index 1 if `result='B'`.
    - model_name (`str`): The model name used to generate the `feedback` and `result`.

Categories:
    - critique
    - preference

References:
    - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
    - [prometheus-eval: Evaluate your LLM's response with Prometheus 💯](https://github.com/prometheus-eval/prometheus-eval)

Examples:

    Critique and evaluate LLM generation quality using Prometheus 2.0:

    ```python
    from distilabel.steps.tasks import PrometheusEval
    from distilabel.llms import vLLM

    # Consider this as a placeholder for your actual LLM.
    prometheus = PrometheusEval(
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]"content" }}

{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="factual-validity" )

    prometheus.load()

    result = next(
        prometheus.process(
            [
                {"instruction": "make something", "generation": "something done"},
            ]
        )
    )
    # result
    # [
    #     {
    #         'instruction': 'make something',
    #         'generation': 'something done',
    #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
    #         'feedback': 'the feedback',
    #         'result': 6,
    #     }
    # ]
    ```

    Critique for relative evaluation:

    ```python
    from distilabel.steps.tasks import PrometheusEval
    from distilabel.llms import vLLM

    # Consider this as a placeholder for your actual LLM.
    prometheus = PrometheusEval(
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]"content" }}

{{ messages[1]"content" }}[/INST]", ), mode="relative", rubric="honesty" )

    prometheus.load()

    result = next(
        prometheus.process(
            [
                {"instruction": "make something", "generations": ["something done", "other thing"]},
            ]
        )
    )
    # result
    # [
    #     {
    #         'instruction': 'make something',
    #         'generations': ['something done', 'other thing'],
    #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
    #         'feedback': 'the feedback',
    #         'result': 'something done',
    #     }
    # ]
    ```

    Critique with a custom rubric:

    ```python
    from distilabel.steps.tasks import PrometheusEval
    from distilabel.llms import vLLM

    # Consider this as a placeholder for your actual LLM.
    prometheus = PrometheusEval(
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]"content" }}

{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="custom", rubrics={ "custom": "[A] Score 1: A Score 2: B Score 3: C Score 4: D Score 5: E" } )

    prometheus.load()

    result = next(
        prometheus.process(
            [
                {"instruction": "make something", "generation": "something done"},
            ]
        )
    )
    # result
    # [
    #     {
    #         'instruction': 'make something',
    #         'generation': 'something done',
    #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
    #         'feedback': 'the feedback',
    #         'result': 6,
    #     }
    # ]
    ```

    Critique using a reference answer:

    ```python
    from distilabel.steps.tasks import PrometheusEval
    from distilabel.llms import vLLM

    # Consider this as a placeholder for your actual LLM.
    prometheus = PrometheusEval(
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]"content" }}

{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="helpfulness", reference=True, )

    prometheus.load()

    result = next(
        prometheus.process(
            [
                {
                    "instruction": "make something",
                    "generation": "something done",
                    "reference": "this is a reference answer",
                },
            ]
        )
    )
    # result
    # [
    #     {
    #         'instruction': 'make something',
    #         'generation': 'something done',
    #         'reference': 'this is a reference answer',
    #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
    #         'feedback': 'the feedback',
    #         'result': 6,
    #     }
    # ]
    ```

Citations:

    ```
    @misc{kim2024prometheus2opensource,
        title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
        author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
        year={2024},
        eprint={2405.01535},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2405.01535},
    }
    ```
Source code in src/distilabel/steps/tasks/prometheus_eval.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
class PrometheusEval(Task):
    """Critique and rank the quality of generations from an `LLM` using Prometheus 2.0.

    `PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative
    evaluations.

    - The absolute evaluation i.e. `mode="absolute"` is used to evaluate a single generation from
        an LLM for a given instruction.
    - The relative evaluation i.e. `mode="relative"` is used to evaluate two generations from an LLM
        for a given instruction.

    Both evaluations provide the possibility whether to use a reference answer to compare with or not
    via the `reference` attribute, and both are based on a score rubric that critiques the generation/s
    based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,
    and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute
    `rubric`.

    Note:
        The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0
        models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,
        and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
        and quality is not guaranteed if using another model, even though some other models may be able to
        correctly follow the formatting and generate insightful critiques too.

    Attributes:
        mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task
            will evaluate one or two generations.
        rubric: the score rubric to use within the prompt to run the critique based on different aspects.
            Can be any existing key in the `rubrics` attribute, which by default means that it can be:
            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only
            work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.
        rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are
            the rubric names and the values are the rubric descriptions. The default rubrics are the following:
            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.
        reference: a boolean flag to indicate whether a reference answer / completion will be provided, so
            that the model critique is based on the comparison with it. It implies that the column `reference`
            needs to be provided within the input data in addition to the rest of the inputs.
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction to use as reference.
        - generation (`str`, optional): The generated text from the given `instruction`. This column is required
            if `mode=absolute`.
        - generations (`List[str]`, optional): The generated texts from the given `instruction`. It should
            contain 2 generations only. This column is required if `mode=relative`.
        - reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM
            for comparison against.

    Output columns:
        - feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the
            pre-defined score rubric, compared against `reference` if provided.
        - result (`Union[int, Literal["A", "B"]]`): If `mode=absolute`, then the result contains the score for the
            `generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either
            "A" or "B", the "winning" one being the generation in the index 0 of `generations` if `result='A'` or the
            index 1 if `result='B'`.
        - model_name (`str`): The model name used to generate the `feedback` and `result`.

    Categories:
        - critique
        - preference

    References:
        - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
        - [prometheus-eval: Evaluate your LLM's response with Prometheus 💯](https://github.com/prometheus-eval/prometheus-eval)

    Examples:

        Critique and evaluate LLM generation quality using Prometheus 2.0:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.llms import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="factual-validity"
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generation": "something done"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

        Critique for relative evaluation:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.llms import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="relative",
            rubric="honesty"
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generations": ["something done", "other thing"]},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generations': ['something done', 'other thing'],
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 'something done',
        #     }
        # ]
        ```

        Critique with a custom rubric:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.llms import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="custom",
            rubrics={
                "custom": "[A]\nScore 1: A\nScore 2: B\nScore 3: C\nScore 4: D\nScore 5: E"
            }
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generation": "something done"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

        Critique using a reference answer:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.llms import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="helpfulness",
            reference=True,
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {
                        "instruction": "make something",
                        "generation": "something done",
                        "reference": "this is a reference answer",
                    },
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'reference': 'this is a reference answer',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{kim2024prometheus2opensource,
            title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
            author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
            year={2024},
            eprint={2405.01535},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2405.01535},
        }
        ```
    """

    mode: Literal["absolute", "relative"]
    rubric: str
    rubrics: Optional[Dict[str, str]] = Field(default=_DEFAULT_RUBRICS)
    reference: bool = False

    _template: Union[Template, None] = PrivateAttr(...)

    @model_validator(mode="after")
    def validate_rubric_and_rubrics(self) -> Self:
        if not isinstance(self.rubrics, dict) or len(self.rubrics) < 1:
            raise ValueError(
                "Provided `rubrics` must be a Python dictionary with string keys and string values."
            )

        def rubric_matches_pattern(rubric: str) -> bool:
            """Checks if the provided rubric matches the pattern of the default rubrics."""
            pattern = r"^\[.*?\]\n(?:Score [1-4]: .*?\n){4}(?:Score 5: .*?)"
            return bool(re.match(pattern, rubric, re.MULTILINE))

        if not all(rubric_matches_pattern(value) for value in self.rubrics.values()):
            raise ValueError(
                "Provided rubrics should match the format of the default rubrics, which"
                " is as follows: `[<scoring criteria>]\nScore 1: <description>\nScore 2: <description>\n"
                "Score 3: <description>\nScore 4: <description>\nScore 5: <description>`; replacing"
                " `<scoring criteria>` and `<description>` with the actual criteria and description"
                " for each or the scores, respectively."
            )

        if self.rubric not in self.rubrics:
            raise ValueError(
                f"Provided rubric '{self.rubric}' is not among the available rubrics: {', '.join(self.rubrics.keys())}."
            )

        return self

    def load(self) -> None:
        """Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
        depending on the `mode` value, and either with or without reference, depending on the
        value of `reference`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "prometheus"
            / (
                f"{self.mode}_without_reference.jinja2"
                if self.reference is False
                else f"{self.mode}_with_reference.jinja2"
            )
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The default inputs for the task are the `instruction` and the `generation`
        if `reference=False`, otherwise, the inputs are `instruction`, `generation`, and
        `reference`."""
        if self.mode == "absolute":
            if self.reference:
                return ["instruction", "generation", "reference"]
            return ["instruction", "generation"]
        else:
            if self.reference:
                return ["instruction", "generations", "reference"]
            return ["instruction", "generations"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` where the prompt is formatted according
        to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
        from the user, including a pre-defined system prompt."""
        template_kwargs = {
            "instruction": input["instruction"],
            "rubric": self.rubrics[self.rubric],
        }
        if self.reference:
            template_kwargs["reference"] = input["reference"]

        if self.mode == "absolute":
            if not isinstance(input["generation"], str):
                raise ValueError(
                    f"Provided `generation` is of type {type(input['generation'])} but a string"
                    " should be provided instead.",
                )

            template_kwargs["generation"] = input["generation"]
            system_message = (
                "You are a fair judge assistant tasked with providing clear, objective feedback based"
                " on specific criteria, ensuring each assessment reflects the absolute standards set"
                " for performance."
            )
        else:  # self.mode == "relative"
            if (
                not isinstance(input["generations"], list)
                or not all(
                    isinstance(generation, str) for generation in input["generations"]
                )
                or len(input["generations"]) != 2
            ):
                raise ValueError(
                    f"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead."
                )

            template_kwargs["generations"] = input["generations"]
            system_message = (
                "You are a fair judge assistant assigned to deliver insightful feedback that compares"
                " individual performances, highlighting how each stands relative to others within the"
                " same cohort."
            )

        return [
            {
                "role": "system",
                "content": system_message,
            },
            {
                "role": "user",
                "content": self._template.render(**template_kwargs),  # type: ignore
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `feedback` and the `result` generated by Prometheus,
        as well as the `model_name` which is automatically included based on the `LLM` used.
        """
        return ["feedback", "result", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dict with the keys `feedback` and `result` captured
        using a regex from the Prometheus output.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Optionally provided in case it's useful to build the output.

        Returns:
            A dict with the keys `feedback` and `result` generated by the LLM.
        """
        if output is None:
            return {"feedback": None, "result": None}

        parts = output.split("[RESULT]")
        if len(parts) != 2:
            return {"feedback": None, "result": None}

        feedback, result = parts[0].strip(), parts[1].strip()
        if feedback.startswith("Feedback:"):
            feedback = feedback[len("Feedback:") :].strip()
        if self.mode == "absolute":
            if not result.isdigit() or result not in ["1", "2", "3", "4", "5"]:
                return {"feedback": None, "result": None}
            return {"feedback": feedback, "result": int(result)}
        else:  # self.mode == "relative"
            if result not in ["A", "B"]:
                return {"feedback": None, "result": None}
            return {"feedback": feedback, "result": result}

inputs: List[str] property

The default inputs for the task are the instruction and the generation if reference=False, otherwise, the inputs are instruction, generation, and reference.

outputs: List[str] property

The output for the task are the feedback and the result generated by Prometheus, as well as the model_name which is automatically included based on the LLM used.

format_input(input)

The input is formatted as a ChatType where the prompt is formatted according to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction from the user, including a pre-defined system prompt.

Source code in src/distilabel/steps/tasks/prometheus_eval.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` where the prompt is formatted according
    to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
    from the user, including a pre-defined system prompt."""
    template_kwargs = {
        "instruction": input["instruction"],
        "rubric": self.rubrics[self.rubric],
    }
    if self.reference:
        template_kwargs["reference"] = input["reference"]

    if self.mode == "absolute":
        if not isinstance(input["generation"], str):
            raise ValueError(
                f"Provided `generation` is of type {type(input['generation'])} but a string"
                " should be provided instead.",
            )

        template_kwargs["generation"] = input["generation"]
        system_message = (
            "You are a fair judge assistant tasked with providing clear, objective feedback based"
            " on specific criteria, ensuring each assessment reflects the absolute standards set"
            " for performance."
        )
    else:  # self.mode == "relative"
        if (
            not isinstance(input["generations"], list)
            or not all(
                isinstance(generation, str) for generation in input["generations"]
            )
            or len(input["generations"]) != 2
        ):
            raise ValueError(
                f"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead."
            )

        template_kwargs["generations"] = input["generations"]
        system_message = (
            "You are a fair judge assistant assigned to deliver insightful feedback that compares"
            " individual performances, highlighting how each stands relative to others within the"
            " same cohort."
        )

    return [
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": self._template.render(**template_kwargs),  # type: ignore
        },
    ]

format_output(output, input)

The output is formatted as a dict with the keys feedback and result captured using a regex from the Prometheus output.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Dict[str, Any]

the input to the task. Optionally provided in case it's useful to build the output.

required

Returns:

Type Description
Dict[str, Any]

A dict with the keys feedback and result generated by the LLM.

Source code in src/distilabel/steps/tasks/prometheus_eval.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dict with the keys `feedback` and `result` captured
    using a regex from the Prometheus output.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Optionally provided in case it's useful to build the output.

    Returns:
        A dict with the keys `feedback` and `result` generated by the LLM.
    """
    if output is None:
        return {"feedback": None, "result": None}

    parts = output.split("[RESULT]")
    if len(parts) != 2:
        return {"feedback": None, "result": None}

    feedback, result = parts[0].strip(), parts[1].strip()
    if feedback.startswith("Feedback:"):
        feedback = feedback[len("Feedback:") :].strip()
    if self.mode == "absolute":
        if not result.isdigit() or result not in ["1", "2", "3", "4", "5"]:
            return {"feedback": None, "result": None}
        return {"feedback": feedback, "result": int(result)}
    else:  # self.mode == "relative"
        if result not in ["A", "B"]:
            return {"feedback": None, "result": None}
        return {"feedback": feedback, "result": result}

load()

Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation depending on the mode value, and either with or without reference, depending on the value of reference.

Source code in src/distilabel/steps/tasks/prometheus_eval.py
def load(self) -> None:
    """Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
    depending on the `mode` value, and either with or without reference, depending on the
    value of `reference`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "prometheus"
        / (
            f"{self.mode}_without_reference.jinja2"
            if self.reference is False
            else f"{self.mode}_with_reference.jinja2"
        )
    )

    self._template = Template(open(_path).read())

QualityScorer

Bases: Task

Score responses based on their quality using an LLM.

QualityScorer is a pre-defined task that defines the instruction as the input and score as the output. This task is used to rate the quality of instructions and responses. It's an implementation of the quality score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'. The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs are scored in terms of quality, obtaining a quality score for each instruction.

Attributes:

Name Type Description
_template Union[Template, None]

a Jinja2 template used to format the input for the LLM.

Input columns
  • instruction (str): The instruction that was used to generate the responses.
  • responses (List[str]): The responses to be scored. Each response forms a pair with the instruction.
Output columns
  • scores (List[float]): The score for each instruction.
  • model_name (str): The model name used to generate the scores.
Categories
  • scorer
  • quality
  • response
References

Examples:

Evaluate the quality of your instructions:

```python
from distilabel.steps.tasks import QualityScorer
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
scorer = QualityScorer(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

scorer.load()

result = next(
    scorer.process(
        [
            {
                "instruction": "instruction",
                "responses": ["good response", "weird response", "bad response"]
            }
        ]
    )
)
# result
[
    {
        'instructions': 'instruction',
        'model_name': 'test',
        'scores': [5, 3, 1],
    }
]
```

Citations:

```
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/quality_scorer.py
class QualityScorer(Task):
    """Score responses based on their quality using an `LLM`.

    `QualityScorer` is a pre-defined task that defines the `instruction` as the input
    and `score` as the output. This task is used to rate the quality of instructions and responses.
    It's an implementation of the quality score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
    are scored in terms of quality, obtaining a quality score for each instruction.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.

    Output columns:
        - scores (`List[float]`): The score for each instruction.
        - model_name (`str`): The model name used to generate the scores.

    Categories:
        - scorer
        - quality
        - response

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:

        Evaluate the quality of your instructions:

        ```python
        from distilabel.steps.tasks import QualityScorer
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        scorer = QualityScorer(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        scorer.load()

        result = next(
            scorer.process(
                [
                    {
                        "instruction": "instruction",
                        "responses": ["good response", "weird response", "bad response"]
                    }
                ]
            )
        )
        # result
        [
            {
                'instructions': 'instruction',
                'model_name': 'test',
                'scores': [5, 3, 1],
            }
        ]
        ```

    Citations:

        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "quality-scorer.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are `instruction` and `responses`."""
        return ["instruction", "responses"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], responses=input["responses"]
                ),
            }
        ]

    @property
    def outputs(self):
        """The output for the task is a list of `scores` containing the quality score for each
        response in `responses`."""
        return ["scores", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction-response pair.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the key `scores` containing the scores for each instruction-response pair.
        """
        if output is None:
            return {"scores": [None] * len(input["responses"])}

        scores = []
        score_lines = output.split("\n")

        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["responses"]) - 1:
                break
        return {"scores": scores}

inputs: List[str] property

The inputs for the task are instruction and responses.

outputs property

The output for the task is a list of scores containing the quality score for each response in responses.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/quality_scorer.py
def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], responses=input["responses"]
            ),
        }
    ]

format_output(output, input)

The output is formatted as a list with the score of each instruction-response pair.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Dict[str, Any]

the input to the task. Used for obtaining the number of responses.

required

Returns:

Type Description
Dict[str, Any]

A dict with the key scores containing the scores for each instruction-response pair.

Source code in src/distilabel/steps/tasks/quality_scorer.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction-response pair.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the key `scores` containing the scores for each instruction-response pair.
    """
    if output is None:
        return {"scores": [None] * len(input["responses"])}

    scores = []
    score_lines = output.split("\n")

    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.match(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["responses"]) - 1:
            break
    return {"scores": scores}

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/quality_scorer.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "quality-scorer.jinja2"
    )

    self._template = Template(open(_path).read())

SelfInstruct

Bases: Task

Generate instructions based on a given input using an LLM.

SelfInstruct is a pre-defined task that, given a number of instructions, a certain criteria for query generations, an application description, and an input, generates a number of instruction related to the given input and following what is stated in the criteria for query generation and the application description. It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning Language Models with Self-Generated Instructions".

Attributes:

Name Type Description
num_instructions int

The number of instructions to be generated. Defaults to 5.

criteria_for_query_generation str

The criteria for the query generation. Defaults to the criteria defined within the paper.

application_description str

The description of the AI application that one want to build with these instructions. Defaults to AI assistant.

Input columns
  • input (str): The input to generate the instructions. It's also called seed in the paper.
Output columns
  • instructions (List[str]): The generated instructions.
  • model_name (str): The model name used to generate the instructions.
Categories
  • text-generation
Reference

Examples:

Generate instructions based on a given input:

```python
from distilabel.steps.tasks import SelfInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM

self_instruct = SelfInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=5,  # This is the default value
)

self_instruct.load()

result = next(self_instruct.process([{"input": "instruction"}]))
# result
# [
#     {
#         'input': 'instruction',
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
#     }
# ]
```

Citations:

```
@misc{wang2023selfinstructaligninglanguagemodels,
    title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
    author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
    year={2023},
    eprint={2212.10560},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2212.10560},
}
```
Source code in src/distilabel/steps/tasks/self_instruct.py
class SelfInstruct(Task):
    """Generate instructions based on a given input using an `LLM`.

    `SelfInstruct` is a pre-defined task that, given a number of instructions, a
    certain criteria for query generations, an application description, and an input,
    generates a number of instruction related to the given input and following what
    is stated in the criteria for query generation and the application description.
    It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning
    Language Models with Self-Generated Instructions".

    Attributes:
        num_instructions: The number of instructions to be generated. Defaults to 5.
        criteria_for_query_generation: The criteria for the query generation. Defaults
            to the criteria defined within the paper.
        application_description: The description of the AI application that one want
            to build with these instructions. Defaults to `AI assistant`.

    Input columns:
        - input (`str`): The input to generate the instructions. It's also called seed in
            the paper.

    Output columns:
        - instructions (`List[str]`): The generated instructions.
        - model_name (`str`): The model name used to generate the instructions.

    Categories:
        - text-generation

    Reference:
        - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)

    Examples:

        Generate instructions based on a given input:

        ```python
        from distilabel.steps.tasks import SelfInstruct
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        self_instruct = SelfInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=5,  # This is the default value
        )

        self_instruct.load()

        result = next(self_instruct.process([{"input": "instruction"}]))
        # result
        # [
        #     {
        #         'input': 'instruction',
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #         'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{wang2023selfinstructaligninglanguagemodels,
            title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
            author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
            year={2023},
            eprint={2212.10560},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2212.10560},
        }
        ```
    """

    num_instructions: int = 5
    criteria_for_query_generation: str = (
        "Incorporate a diverse range of verbs, avoiding repetition.\n"
        "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
        "Design queries to be self-contained and standalone.\n"
        'Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.'
    )
    application_description: str = "AI assistant"

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "self-instruct.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `input` i.e. seed text."""
        return ["input"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(
                    input=input["input"],
                    application_description=self.application_description,
                    criteria_for_query_generation=self.criteria_for_query_generation,
                    num_instructions=self.num_instructions,
                ),
            }
        ]

    @property
    def outputs(self):
        """The output for the task is a list of `instructions` containing the generated instructions."""
        return ["instructions", "model_name"]

    def format_output(
        self,
        output: Union[str, None],
        input: Optional[Dict[str, Any]] = None,
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the generated instructions.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with containing the generated instructions.
        """
        if output is None:
            return {"instructions": []}
        return {"instructions": [line for line in output.split("\n") if line != ""]}

inputs: List[str] property

The input for the task is the input i.e. seed text.

outputs property

The output for the task is a list of instructions containing the generated instructions.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/self_instruct.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(
                input=input["input"],
                application_description=self.application_description,
                criteria_for_query_generation=self.criteria_for_query_generation,
                num_instructions=self.num_instructions,
            ),
        }
    ]

format_output(output, input=None)

The output is formatted as a list with the generated instructions.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Optional[Dict[str, Any]]

the input to the task. Used for obtaining the number of responses.

None

Returns:

Type Description
Dict[str, Any]

A dict with containing the generated instructions.

Source code in src/distilabel/steps/tasks/self_instruct.py
def format_output(
    self,
    output: Union[str, None],
    input: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
    """The output is formatted as a list with the generated instructions.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with containing the generated instructions.
    """
    if output is None:
        return {"instructions": []}
    return {"instructions": [line for line in output.split("\n") if line != ""]}

load()

Loads the Jinja2 template.

Source code in src/distilabel/steps/tasks/self_instruct.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "self-instruct.jinja2"
    )

    self._template = Template(open(_path).read())

StructuredGeneration

Bases: Task

Generate structured content for a given instruction using an LLM.

StructuredGeneration is a pre-defined task that defines the instruction and the structured_output as the inputs, and generation as the output. This task is used to generate structured content based on the input instruction and following the schema provided within the structured_output column per each instruction. The model_name also returned as part of the output in order to enhance it.

Attributes:

Name Type Description
use_system_prompt bool

Whether to use the system prompt in the generation. Defaults to True, which means that if the column system_prompt is defined within the input batch, then the system_prompt will be used, otherwise, it will be ignored.

Input columns
  • instruction (str): The instruction to generate structured content from.
  • structured_output (Dict[str, Any]): The structured_output to generate structured content from. It should be a Python dictionary with the keys format and schema, where format should be one of json or regex, and the schema should be either the JSON schema or the regex pattern, respectively.
Output columns
  • generation (str): The generated text matching the provided schema, if possible.
  • model_name (str): The name of the model used to generate the text.
Categories
  • outlines
  • structured-generation

Examples:

Generate structured output from a JSON schema:

```python
from distilabel.steps.tasks import StructuredGeneration
from distilabel.llms import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "Create an RPG character",
                "structured_output": {
                    "type": "json",
                    "value": {
                        "properties": {
                            "name": {
                                "title": "Name",
                                "type": "string"
                            },
                            "description": {
                                "title": "Description",
                                "type": "string"
                            },
                            "role": {
                                "title": "Role",
                                "type": "string"
                            },
                            "weapon": {
                                "title": "Weapon",
                                "type": "string"
                            }
                        },
                        "required": [
                            "name",
                            "description",
                            "role",
                            "weapon"
                        ],
                        "title": "Character",
                        "type": "object"
                    }
                },
            }
        ]
    )
)
```

Generate structured output from a regex pattern:

```python
from distilabel.steps.tasks import StructuredGeneration
from distilabel.llms import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                "structured_output": {
                    "type": "regex",
                    "value": r"(\d{1,2})°C"
                },

            }
        ]
    )
)
```
Source code in src/distilabel/steps/tasks/structured_generation.py
class StructuredGeneration(Task):
    """Generate structured content for a given `instruction` using an `LLM`.

    `StructuredGeneration` is a pre-defined task that defines the `instruction` and the `structured_output`
    as the inputs, and `generation` as the output. This task is used to generate structured content based on
    the input instruction and following the schema provided within the `structured_output` column per each
    `instruction`. The `model_name` also returned as part of the output in order to enhance it.

    Attributes:
        use_system_prompt: Whether to use the system prompt in the generation. Defaults to `True`,
            which means that if the column `system_prompt` is  defined within the input batch, then
            the `system_prompt` will be used, otherwise, it will be ignored.

    Input columns:
        - instruction (`str`): The instruction to generate structured content from.
        - structured_output (`Dict[str, Any]`): The structured_output to generate structured content from. It should be a
            Python dictionary with the keys `format` and `schema`, where `format` should be one of `json` or
            `regex`, and the `schema` should be either the JSON schema or the regex pattern, respectively.

    Output columns:
        - generation (`str`): The generated text matching the provided schema, if possible.
        - model_name (`str`): The name of the model used to generate the text.

    Categories:
        - outlines
        - structured-generation

    Examples:

        Generate structured output from a JSON schema:

        ```python
        from distilabel.steps.tasks import StructuredGeneration
        from distilabel.llms import InferenceEndpointsLLM

        structured_gen = StructuredGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            ),
        )

        structured_gen.load()

        result = next(
            structured_gen.process(
                [
                    {
                        "instruction": "Create an RPG character",
                        "structured_output": {
                            "type": "json",
                            "value": {
                                "properties": {
                                    "name": {
                                        "title": "Name",
                                        "type": "string"
                                    },
                                    "description": {
                                        "title": "Description",
                                        "type": "string"
                                    },
                                    "role": {
                                        "title": "Role",
                                        "type": "string"
                                    },
                                    "weapon": {
                                        "title": "Weapon",
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "name",
                                    "description",
                                    "role",
                                    "weapon"
                                ],
                                "title": "Character",
                                "type": "object"
                            }
                        },
                    }
                ]
            )
        )
        ```

        Generate structured output from a regex pattern:

        ```python
        from distilabel.steps.tasks import StructuredGeneration
        from distilabel.llms import InferenceEndpointsLLM

        structured_gen = StructuredGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            ),
        )

        structured_gen.load()

        result = next(
            structured_gen.process(
                [
                    {
                        "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                        "structured_output": {
                            "type": "regex",
                            "value": r"(\\d{1,2})°C"
                        },

                    }
                ]
            )
        )
        ```
    """

    use_system_prompt: bool = False

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and the `structured_output`.
        Optionally, if the `use_system_prompt` flag is set to True, then the
        `system_prompt` will be used too."""
        columns = ["instruction", "structured_output"]
        if self.use_system_prompt:
            columns = ["system_prompt"] + columns
        return columns

    def format_input(self, input: Dict[str, Any]) -> StructuredInput:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        if not isinstance(input["instruction"], str):
            raise ValueError(
                f"Input `instruction` must be a string. Got: {input['instruction']}."
            )

        messages = [{"role": "user", "content": input["instruction"]}]
        if self.use_system_prompt:
            if "system_prompt" in input:
                messages.insert(
                    0, {"role": "system", "content": input["system_prompt"]}
                )
            else:
                warnings.warn(
                    "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                    UserWarning,
                    stacklevel=2,
                )

        return (messages, input.get("structured_output", None))  # type: ignore

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`. Note that even
        if the `structured_output` is defined to produce a JSON schema, this method will return the raw
        output i.e. a string without any parsing."""
        return {"generation": output}

inputs: List[str] property

The input for the task are the instruction and the structured_output. Optionally, if the use_system_prompt flag is set to True, then the system_prompt will be used too.

outputs: List[str] property

The output for the task is the generation and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/structured_generation.py
def format_input(self, input: Dict[str, Any]) -> StructuredInput:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    if not isinstance(input["instruction"], str):
        raise ValueError(
            f"Input `instruction` must be a string. Got: {input['instruction']}."
        )

    messages = [{"role": "user", "content": input["instruction"]}]
    if self.use_system_prompt:
        if "system_prompt" in input:
            messages.insert(
                0, {"role": "system", "content": input["system_prompt"]}
            )
        else:
            warnings.warn(
                "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                UserWarning,
                stacklevel=2,
            )

    return (messages, input.get("structured_output", None))  # type: ignore

format_output(output, input)

The output is formatted as a dictionary with the generation. The model_name will be automatically included within the process method of Task. Note that even if the structured_output is defined to produce a JSON schema, this method will return the raw output i.e. a string without any parsing.

Source code in src/distilabel/steps/tasks/structured_generation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`. Note that even
    if the `structured_output` is defined to produce a JSON schema, this method will return the raw
    output i.e. a string without any parsing."""
    return {"generation": output}

TextGeneration

Bases: Task

Simple text generation with an LLM given an instruction.

TextGeneration is a pre-defined task that defines the instruction as the input and generation as the output. This task is used to generate text based on the input instruction. The model_name is also returned as part of the output in order to enhance it.

Attributes:

Name Type Description
use_system_prompt bool

Whether to use the system prompt in the generation. Defaults to True, which means that if the column system_prompt is defined within the input batch, then the system_prompt will be used, otherwise, it will be ignored.

Input columns
  • instruction (str): The instruction to generate text from.
Output columns
  • generation (str): The generated text.
  • model_name (str): The name of the model used to generate the text.
Categories
  • text-generation

Examples:

Generate text from an instruction:

```python
from distilabel.steps.tasks import TextGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

text_gen.load()

result = next(
    text_gen.process(
        [{"instruction": "your instruction"}]
    )
)
# result
# [
#     {
#         'instruction': 'your instruction',
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'generation': 'generation',
#     }
# ]
```
Source code in src/distilabel/steps/tasks/text_generation.py
class TextGeneration(Task):
    """Simple text generation with an `LLM` given an instruction.

    `TextGeneration` is a pre-defined task that defines the `instruction` as the input
    and `generation` as the output. This task is used to generate text based on the input
    instruction. The model_name is also returned as part of the output in order to enhance it.

    Attributes:
        use_system_prompt: Whether to use the system prompt in the generation. Defaults to `True`,
            which means that if the column `system_prompt` is defined within the input batch, then
            the `system_prompt` will be used, otherwise, it will be ignored.

    Input columns:
        - instruction (`str`): The instruction to generate text from.

    Output columns:
        - generation (`str`): The generated text.
        - model_name (`str`): The name of the model used to generate the text.

    Categories:
        - text-generation

    Examples:

        Generate text from an instruction:

        ```python
        from distilabel.steps.tasks import TextGeneration
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        text_gen = TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        text_gen.load()

        result = next(
            text_gen.process(
                [{"instruction": "your instruction"}]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'your instruction',
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #         'generation': 'generation',
        #     }
        # ]
        ```
    """

    use_system_prompt: bool = True

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""

        if is_openai_format(input["instruction"]):
            raise ValueError(
                "Providing `instruction` formatted as an OpenAI chat / conversation is"
                " deprecated, you should use `ChatGeneration` with `messages` as input instead.",
            )

        if not isinstance(input["instruction"], str):
            raise ValueError(
                f"Input `instruction` must be a string. Got: {input['instruction']}."
            )

        messages = [{"role": "user", "content": input["instruction"]}]
        if self.use_system_prompt:
            if "system_prompt" in input:
                messages.insert(
                    0, {"role": "system", "content": input["system_prompt"]}
                )
            else:
                warnings.warn(
                    "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                    UserWarning,
                    stacklevel=2,
                )
        return messages  # type: ignore

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}

inputs: List[str] property

The input for the task is the instruction.

outputs: List[str] property

The output for the task is the generation and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""

    if is_openai_format(input["instruction"]):
        raise ValueError(
            "Providing `instruction` formatted as an OpenAI chat / conversation is"
            " deprecated, you should use `ChatGeneration` with `messages` as input instead.",
        )

    if not isinstance(input["instruction"], str):
        raise ValueError(
            f"Input `instruction` must be a string. Got: {input['instruction']}."
        )

    messages = [{"role": "user", "content": input["instruction"]}]
    if self.use_system_prompt:
        if "system_prompt" in input:
            messages.insert(
                0, {"role": "system", "content": input["system_prompt"]}
            )
        else:
            warnings.warn(
                "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                UserWarning,
                stacklevel=2,
            )
    return messages  # type: ignore

format_output(output, input=None)

The output is formatted as a dictionary with the generation. The model_name will be automatically included within the process method of Task.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_output(
    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

UltraFeedback

Bases: Task

Rank generations focusing on different aspects using an LLM.

UltraFeedback: Boosting Language Models with High-quality Feedback.

Attributes:

Name Type Description
aspect Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']

The aspect to perform with the UltraFeedback model. The available aspects are: - helpfulness: Evaluate text outputs based on helpfulness. - honesty: Evaluate text outputs based on honesty. - instruction-following: Evaluate text outputs based on given instructions. - truthfulness: Evaluate text outputs based on truthfulness. Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall assessment of the text outputs within a single prompt. The custom aspect is: - overall-rating: Evaluate text outputs based on an overall assessment. Defaults to "overall-rating".

Input columns
  • instruction (str): The reference instruction to evaluate the text outputs.
  • generations (List[str]): The text outputs to evaluate for the given instruction.
Output columns
  • ratings (List[float]): The ratings for each of the provided text outputs.
  • rationales (List[str]): The rationales for each of the provided text outputs.
  • model_name (str): The name of the model used to generate the ratings and rationales.
Categories
  • preference
References

Examples:

Rate generations from different LLMs based on the selected aspect:

```python
from distilabel.steps.tasks import UltraFeedback
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

ultrafeedback.load()

result = next(
    chat.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [
#     {
#         'instruction': 'How much is 2+2?',
#         'generations': ['4', 'and a car'],
#         'ratings': [1, 2],
#         'rationales': ['explanation for 4', 'explanation for and a car'],
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#     }
# ]
```

Citations:

```
@misc{cui2024ultrafeedbackboostinglanguagemodels,
    title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
    author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
    year={2024},
    eprint={2310.01377},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2310.01377},
}
```
Source code in src/distilabel/steps/tasks/ultrafeedback.py
class UltraFeedback(Task):
    """Rank generations focusing on different aspects using an `LLM`.

    UltraFeedback: Boosting Language Models with High-quality Feedback.

    Attributes:
        aspect: The aspect to perform with the `UltraFeedback` model. The available aspects are:
            - `helpfulness`: Evaluate text outputs based on helpfulness.
            - `honesty`: Evaluate text outputs based on honesty.
            - `instruction-following`: Evaluate text outputs based on given instructions.
            - `truthfulness`: Evaluate text outputs based on truthfulness.
            Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall
            assessment of the text outputs within a single prompt. The custom aspect is:
            - `overall-rating`: Evaluate text outputs based on an overall assessment.
            Defaults to `"overall-rating"`.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text outputs.
        - generations (`List[str]`): The text outputs to evaluate for the given instruction.

    Output columns:
        - ratings (`List[float]`): The ratings for each of the provided text outputs.
        - rationales (`List[str]`): The rationales for each of the provided text outputs.
        - model_name (`str`): The name of the model used to generate the ratings and rationales.

    Categories:
        - preference

    References:
        - [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377)
        - [`UltraFeedback - GitHub Repository`](https://github.com/OpenBMB/UltraFeedback)

    Examples:

        Rate generations from different LLMs based on the selected aspect:

        ```python
        from distilabel.steps.tasks import UltraFeedback
        from distilabel.llms.huggingface import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        ultrafeedback = UltraFeedback(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        ultrafeedback.load()

        result = next(
            chat.process(
                [
                    {
                        "instruction": "How much is 2+2?",
                        "generations": ["4", "and a car"],
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'How much is 2+2?',
        #         'generations': ['4', 'and a car'],
        #         'ratings': [1, 2],
        #         'rationales': ['explanation for 4', 'explanation for and a car'],
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #     }
        # ]
        ```

    Citations:

        ```
        @misc{cui2024ultrafeedbackboostinglanguagemodels,
            title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
            author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
            year={2024},
            eprint={2310.01377},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2310.01377},
        }
        ```
    """

    aspect: Literal[
        "helpfulness",
        "honesty",
        "instruction-following",
        "truthfulness",
        # Custom aspects
        "overall-rating",
    ] = "overall-rating"

    _system_prompt: str = PrivateAttr(
        default=(
            "Your role is to evaluate text quality based on given criteria.\n"
            'You\'ll receive an instructional description ("Instruction") and {no_texts} text outputs ("Text").\n'
            "Understand and interpret instructions to evaluate effectively.\n"
            "Provide annotations for each text with a rating and rationale.\n"
            "The {no_texts} texts given are independent, and should be evaluated separately.\n"
        )
    )
    _template: Optional["Template"] = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template for the given `aspect`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "ultrafeedback"
            / f"{self.aspect}.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generations` for it."""
        return ["instruction", "generations"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "system",
                "content": self._system_prompt.format(
                    no_texts=len(input["generations"])
                ),
            },
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generations=input["generations"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        columns = []
        if self.aspect in ["honesty", "instruction-following", "overall-rating"]:
            columns = ["ratings", "rationales"]
        elif self.aspect in ["helpfulness", "truthfulness"]:
            columns = ["types", "rationales", "ratings", "rationales-for-ratings"]
        return columns + ["model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `ratings` and `rationales` for
        each of the provided `generations` for the given `instruction`. The `model_name`
        will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing either the `ratings` and `rationales` for each of the provided
            `generations` for the given `instruction` if the provided aspect is either `honesty`,
            `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
            `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
            given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
        """
        if self.aspect in [
            "honesty",
            "instruction-following",
            "overall-rating",
        ]:
            return self._format_ratings_rationales_output(output, input)
        return self._format_types_ratings_rationales_output(output, input)

    def _format_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`."""
        if output is None:
            return {
                "ratings": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
            }

        pattern = r"Rating: (.+?)\nRationale: (.+)"
        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append({"ratings": None, "rationales": None})
                continue

            formatted_outputs.append(
                {
                    "ratings": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None,
                    "rationales": matches.group(2),
                }
            )
        return group_dicts(*formatted_outputs)

    def _format_types_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `helpfulness` or `truthfulness`."""
        if output is None:
            return {
                "types": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
                "ratings": [None] * len(input["generations"]),
                "rationales-for-ratings": [None] * len(input["generations"]),
            }

        pattern = r"Type: (.+?)\nRationale: (.+?)\nRating: (.+?)\nRationale: (.+)"

        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append(
                    {
                        "types": None,
                        "rationales": None,
                        "ratings": None,
                        "rationales-for-ratings": None,
                    }
                )
                continue

            formatted_outputs.append(
                {
                    "types": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None,
                    "rationales": matches.group(2),
                    "ratings": int(re.findall(r"\b\d+\b", matches.group(3))[0])
                    if matches.group(3) not in ["None", "N/A"]
                    else None,
                    "rationales-for-ratings": matches.group(4),
                }
            )
        return group_dicts(*formatted_outputs)

inputs: List[str] property

The input for the task is the instruction, and the generations for it.

outputs: List[str] property

The output for the task is the generation and the model_name.

_format_ratings_rationales_output(output, input)

Formats the output when the aspect is either honesty, instruction-following, or overall-rating.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def _format_ratings_rationales_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, List[Any]]:
    """Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`."""
    if output is None:
        return {
            "ratings": [None] * len(input["generations"]),
            "rationales": [None] * len(input["generations"]),
        }

    pattern = r"Rating: (.+?)\nRationale: (.+)"
    sections = output.split("\n\n")

    formatted_outputs = []
    for section in sections:
        matches = None
        if section is not None and section != "":
            matches = re.search(pattern, section, re.DOTALL)
        if not matches:
            formatted_outputs.append({"ratings": None, "rationales": None})
            continue

        formatted_outputs.append(
            {
                "ratings": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                if matches.group(1) not in ["None", "N/A"]
                else None,
                "rationales": matches.group(2),
            }
        )
    return group_dicts(*formatted_outputs)

_format_types_ratings_rationales_output(output, input)

Formats the output when the aspect is either helpfulness or truthfulness.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def _format_types_ratings_rationales_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, List[Any]]:
    """Formats the output when the aspect is either `helpfulness` or `truthfulness`."""
    if output is None:
        return {
            "types": [None] * len(input["generations"]),
            "rationales": [None] * len(input["generations"]),
            "ratings": [None] * len(input["generations"]),
            "rationales-for-ratings": [None] * len(input["generations"]),
        }

    pattern = r"Type: (.+?)\nRationale: (.+?)\nRating: (.+?)\nRationale: (.+)"

    sections = output.split("\n\n")

    formatted_outputs = []
    for section in sections:
        matches = None
        if section is not None and section != "":
            matches = re.search(pattern, section, re.DOTALL)
        if not matches:
            formatted_outputs.append(
                {
                    "types": None,
                    "rationales": None,
                    "ratings": None,
                    "rationales-for-ratings": None,
                }
            )
            continue

        formatted_outputs.append(
            {
                "types": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                if matches.group(1) not in ["None", "N/A"]
                else None,
                "rationales": matches.group(2),
                "ratings": int(re.findall(r"\b\d+\b", matches.group(3))[0])
                if matches.group(3) not in ["None", "N/A"]
                else None,
                "rationales-for-ratings": matches.group(4),
            }
        )
    return group_dicts(*formatted_outputs)

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "system",
            "content": self._system_prompt.format(
                no_texts=len(input["generations"])
            ),
        },
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generations=input["generations"]
            ),
        },
    ]

format_output(output, input)

The output is formatted as a dictionary with the ratings and rationales for each of the provided generations for the given instruction. The model_name will be automatically included within the process method of Task.

Parameters:

Name Type Description Default
output Union[str, None]

a string representing the output of the LLM via the process method.

required
input Dict[str, Any]

the input to the task, as required by some tasks to format the output.

required

Returns:

Type Description
Dict[str, Any]

A dictionary containing either the ratings and rationales for each of the provided

Dict[str, Any]

generations for the given instruction if the provided aspect is either honesty,

Dict[str, Any]

instruction-following, or overall-rating; or the types, rationales,

Dict[str, Any]

ratings, and rationales-for-ratings for each of the provided generations for the

Dict[str, Any]

given instruction if the provided aspect is either helpfulness or truthfulness.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `ratings` and `rationales` for
    each of the provided `generations` for the given `instruction`. The `model_name`
    will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing either the `ratings` and `rationales` for each of the provided
        `generations` for the given `instruction` if the provided aspect is either `honesty`,
        `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
        `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
        given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
    """
    if self.aspect in [
        "honesty",
        "instruction-following",
        "overall-rating",
    ]:
        return self._format_ratings_rationales_output(output, input)
    return self._format_types_ratings_rationales_output(output, input)

load()

Loads the Jinja2 template for the given aspect.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def load(self) -> None:
    """Loads the Jinja2 template for the given `aspect`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "ultrafeedback"
        / f"{self.aspect}.jinja2"
    )

    self._template = Template(open(_path).read())