Tasks¶

This section contains the API reference for the distilabel tasks. For an example on how to create and use a task, see the Tutorial - Tasks.

`GeneratorTask` ¶

Bases: _Task, GeneratorStep

GeneratorTask is a class that implements the _Task abstract class and adds the GeneratorStep interface to be used as a step in the pipeline.

Attributes:

Name	Type	Description
`llm`		the `LLM` to be used to generate the outputs of the task.
`group_generations`		whether to group the `num_generations` generated per input in a list or create a row per generation. Defaults to `False`.
`num_generations`		The number of generations to be produced per input.

Source code in src/distilabel/steps/tasks/base.py

class GeneratorTask(_Task, GeneratorStep):
    """GeneratorTask is a class that implements the `_Task` abstract class and adds the
    `GeneratorStep` interface to be used as a step in the pipeline.

    Attributes:
        llm: the `LLM` to be used to generate the outputs of the task.
        group_generations: whether to group the `num_generations` generated per input in
            a list or create a row per generation. Defaults to `False`.
        num_generations: The number of generations to be produced per input.
    """

    pass

`Task` ¶

Bases: _Task, Step

Task is a class that implements the _Task abstract class and adds the Step interface to be used as a step in the pipeline.

Attributes:

Name	Type	Description
`llm`		the `LLM` to be used to generate the outputs of the task.
`group_generations`		whether to group the `num_generations` generated per input in a list or create a row per generation. Defaults to `False`.
`num_generations`		The number of generations to be produced per input.

Source code in src/distilabel/steps/tasks/base.py

class Task(_Task, Step):
    """Task is a class that implements the `_Task` abstract class and adds the `Step`
    interface to be used as a step in the pipeline.

    Attributes:
        llm: the `LLM` to be used to generate the outputs of the task.
        group_generations: whether to group the `num_generations` generated per input in
            a list or create a row per generation. Defaults to `False`.
        num_generations: The number of generations to be produced per input.
    """

    @abstractmethod
    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Abstract method to format the inputs of the task. It needs to receive an input
        as a Python dictionary, and generates an OpenAI chat-like list of dicts."""
        pass

    def _format_inputs(self, inputs: List[Dict[str, Any]]) -> List["ChatType"]:
        """Formats the inputs of the task using the `format_input` method.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list containing the formatted inputs, which are `ChatType`-like following
            the OpenAI formatting.
        """
        return [self.format_input(input) for input in inputs]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        formatted_inputs = self._format_inputs(inputs)
        outputs = self.llm.generate(
            inputs=formatted_inputs,
            num_generations=self.num_generations,  # type: ignore
            **self.llm.generation_kwargs,  # type: ignore
        )

        task_outputs = []
        for input, input_outputs in zip(inputs, outputs):
            formatted_outputs = self._format_outputs(input_outputs, inputs)

            if self.group_generations:
                combined = combine_dicts(*formatted_outputs)
                task_outputs.append(
                    {**input, "model_name": self.llm.model_name, **combined}
                )
                continue

            # Create a row per generation
            for formatted_output in formatted_outputs:
                task_outputs.append(
                    {**input, "model_name": self.llm.model_name, **formatted_output}
                )

        yield task_outputs

`format_input(input)` `abstractmethod` ¶

Abstract method to format the inputs of the task. It needs to receive an input as a Python dictionary, and generates an OpenAI chat-like list of dicts.

Source code in src/distilabel/steps/tasks/base.py

@abstractmethod
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Abstract method to format the inputs of the task. It needs to receive an input
    as a Python dictionary, and generates an OpenAI chat-like list of dicts."""
    pass

`process(inputs)` ¶

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name	Type	Description	Default
`inputs`	`StepInput`	A list of Python dictionaries with the inputs of the task.	required

Yields:

Type	Description
`StepOutput`	A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/base.py

def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    formatted_inputs = self._format_inputs(inputs)
    outputs = self.llm.generate(
        inputs=formatted_inputs,
        num_generations=self.num_generations,  # type: ignore
        **self.llm.generation_kwargs,  # type: ignore
    )

    task_outputs = []
    for input, input_outputs in zip(inputs, outputs):
        formatted_outputs = self._format_outputs(input_outputs, inputs)

        if self.group_generations:
            combined = combine_dicts(*formatted_outputs)
            task_outputs.append(
                {**input, "model_name": self.llm.model_name, **combined}
            )
            continue

        # Create a row per generation
        for formatted_output in formatted_outputs:
            task_outputs.append(
                {**input, "model_name": self.llm.model_name, **formatted_output}
            )

    yield task_outputs

General Text Generation¶

`TextGeneration` ¶

Bases: Task

TextGeneration is a pre-defined task that defines the instruction as the input and generation as the output. This task is used to generate text based on the input instruction. The model_name is also returned as part of the output in order to enhance it.

Input columns

instruction (str): The instruction to generate text from.

Output columns

generation (str): The generated text.
model_name (str): The model name used to generate the text.

Source code in src/distilabel/steps/tasks/text_generation.py

class TextGeneration(Task):
    """TextGeneration is a pre-defined task that defines the `instruction` as the input
    and `generation` as the output. This task is used to generate text based on the input
    instruction. The model_name is also returned as part of the output in order to enhance it.

    Input columns:
        - instruction (`str`): The instruction to generate text from.

    Output columns:
        - generation (`str`): The generated text.
        - model_name (`str`): The model name used to generate the text.
    """

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""

        instruction = input["instruction"]

        if isinstance(instruction, str):
            return [{"role": "user", "content": input["instruction"]}]

        if not is_openai_format(instruction):
            raise ValueError(
                f"Input `instruction` must be a string or an OpenAI chat-like format. "
                f"Got: {instruction}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
            )

        return instruction

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}

`inputs: List[str]` `property` ¶

The input for the task is the instruction.

`outputs: List[str]` `property` ¶

The output for the task is the generation and the model_name.

`format_input(input)` ¶

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/text_generation.py

def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""

    instruction = input["instruction"]

    if isinstance(instruction, str):
        return [{"role": "user", "content": input["instruction"]}]

    if not is_openai_format(instruction):
        raise ValueError(
            f"Input `instruction` must be a string or an OpenAI chat-like format. "
            f"Got: {instruction}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
        )

    return instruction

`format_output(output, input)` ¶

The output is formatted as a dictionary with the generation. The model_name will be automatically included within the process method of Task.

Source code in src/distilabel/steps/tasks/text_generation.py

def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

Evol Instruct¶

`EvolInstruct` ¶

Bases: Task

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name	Type	Description
`num_evolutions`	`int`	The number of evolutions to be performed.
`store_evolutions`	`bool`	Whether to store all the evolutions or just the last one. Defaults to `False`.
`generate_answers`	`bool`	Whether to generate answers for the evolved instructions. Defaults to `False`.
`include_original_instruction`	`bool`	Whether to include the original instruction in the `evolved_instructions` output column. Defaults to `False`.
`mutation_templates`	`Dict[str, str]`	The mutation templates to be used for evolving the instructions. Defaults to the ones provided in the `utils.py` file.
`seed`	`RuntimeParameter[int]`	The seed to be set for `numpy` in order to randomly pick a mutation method. Defaults to `42`.

Runtime parameters

seed: The seed to be set for numpy in order to randomly pick a mutation method.

Input columns

instruction (str): The instruction to evolve.

Output columns

evolved_instruction (str): The evolved instruction if store_evolutions=False.
evolved_instructions (List[str]): The evolved instructions if store_evolutions=True.
model_name (str): The name of the LLM used to evolve the instructions.
answer (str): The answer to the evolved instruction if generate_answers=True and store_evolutions=False.
answers (List[str]): The answers to the evolved instructions if generate_answers=True and store_evolutions=True.

References

Source code in src/distilabel/steps/tasks/evol_instruct/base.py

class EvolInstruct(Task):
    """WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_evolutions: The number of evolutions to be performed.
        store_evolutions: Whether to store all the evolutions or just the last one. Defaults
            to `False`.
        generate_answers: Whether to generate answers for the evolved instructions. Defaults
            to `False`.
        include_original_instruction: Whether to include the original instruction in the
            `evolved_instructions` output column. Defaults to `False`.
        mutation_templates: The mutation templates to be used for evolving the instructions.
            Defaults to the ones provided in the `utils.py` file.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.
        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.
        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`
            and `store_evolutions=False`.
        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`
            and `store_evolutions=True`.

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)
    """

    num_evolutions: int
    store_evolutions: bool = False
    generate_answers: bool = False
    include_original_instruction: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            (
                "evolved_instruction"
                if not self.store_evolutions
                else "evolved_instructions"
            ),
            "model_name",
        ]
        if self.generate_answers:
            _outputs.append("answer" if not self.store_evolutions else "answers")
        return _outputs

    @override
    def format_output(  # type: ignore
        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
    ) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        `answer` if `generate_answers=True`; and, finally, the `model_name`.

        Args:
            instructions: The instructions to be included within the output.
            answers: The answers to be included within the output if `generate_answers=True`.

        Returns:
            If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
            if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
        """
        _output = {}
        if not self.store_evolutions:
            _output["evolved_instruction"] = instructions[-1]
        else:
            _output["evolved_instructions"] = instructions

        if self.generate_answers and answers:
            if not self.store_evolutions:
                _output["answer"] = answers[-1]
            else:
                _output["answers"] = answers

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore

    def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """

        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction in instructions:
                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]
            generated_prompts = flatten_responses(
                self.llm.generate(
                    formatted_prompts,
                    **self.llm.generation_kwargs,  # type: ignore
                )
            )

            evolved_instructions = []
            for generated_prompt in generated_prompts:
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                evolved_instructions.append(generated_prompt)

            if self.store_evolutions:
                instructions = [
                    instruction + [evolved_instruction]
                    for instruction, evolved_instruction in zip(
                        instructions, evolved_instructions
                    )
                ]
            else:
                instructions = [
                    [evolved_instruction]
                    for evolved_instruction in evolved_instructions
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
            )

        return instructions

    def _generate_answers(
        self, evolved_instructions: List[List[str]]
    ) -> List[List[str]]:
        """Generates the answer for the instructions in `instructions`.

        Args:
            evolved_instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for each instruction.
        """
        formatted_instructions = [
            self.format_input(instruction)
            for instructions in evolved_instructions
            for instruction in instructions
        ]

        responses = self.llm.generate(
            formatted_instructions,
            num_generations=1,
            **self.llm.generation_kwargs,  # type: ignore
        )

        step = (
            self.num_evolutions
            if not self.include_original_instruction
            else self.num_evolutions + 1
        )
        return [
            flatten_responses(responses[i : i + step])
            for i in range(0, len(responses), step)
        ]

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        evolved_instructions = self._evolve_instructions(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_instructions` list
            from_ = 1 if not self.include_original_instruction else 0
            evolved_instructions = [
                instruction[from_:] for instruction in evolved_instructions
            ]

        if not self.generate_answers:
            for input, instruction in zip(inputs, evolved_instructions):
                input.update(self.format_output(instruction))
            yield inputs

        self._logger.info(
            f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
        )

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
            )

            answers = self._generate_answers(evolved_instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
                " instructions!"
            )

            for idx, (input, instruction) in enumerate(
                zip(inputs, evolved_instructions)
            ):
                input.update(self.format_output(instruction, answers[idx]))
            yield inputs

`inputs: List[str]` `property` ¶

The input for the task is the instruction.

`mutation_templates_names: List[str]` `property` ¶

Returns the names i.e. keys of the provided mutation_templates.

`outputs: List[str]` `property` ¶

The output for the task are the evolved_instruction/s, the answer if generate_answers=True and the model_name.

`format_input(input)` ¶

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py

def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

`format_output(instructions, answers=None)` ¶

The output for the task is a dict with: evolved_instruction or evolved_instructions, depending whether the value is either False or True for store_evolutions, respectively; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name	Type	Description	Default
`instructions`	`Union[str, List[str]]`	The instructions to be included within the output.	required
`answers`	`Optional[List[str]]`	The answers to be included within the output if `generate_answers=True`.	`None`

Returns:

Type	Description
`Dict[str, Any]`	If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
`Dict[str, Any]`	if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
`Dict[str, Any]`	if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
`Dict[str, Any]`	if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py

@override
def format_output(  # type: ignore
    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    `answer` if `generate_answers=True`; and, finally, the `model_name`.

    Args:
        instructions: The instructions to be included within the output.
        answers: The answers to be included within the output if `generate_answers=True`.

    Returns:
        If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
        if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
    """
    _output = {}
    if not self.store_evolutions:
        _output["evolved_instruction"] = instructions[-1]
    else:
        _output["evolved_instructions"] = instructions

    if self.generate_answers and answers:
        if not self.store_evolutions:
            _output["answer"] = answers[-1]
        else:
            _output["answers"] = answers

    _output["model_name"] = self.llm.model_name
    return _output

`process(inputs)` ¶

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name	Type	Description	Default
`inputs`	`StepInput`	A list of Python dictionaries with the inputs of the task.	required

Yields:

Type	Description
`StepOutput`	A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py

@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    evolved_instructions = self._evolve_instructions(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_instructions` list
        from_ = 1 if not self.include_original_instruction else 0
        evolved_instructions = [
            instruction[from_:] for instruction in evolved_instructions
        ]

    if not self.generate_answers:
        for input, instruction in zip(inputs, evolved_instructions):
            input.update(self.format_output(instruction))
        yield inputs

    self._logger.info(
        f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
    )

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
        )

        answers = self._generate_answers(evolved_instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
            " instructions!"
        )

        for idx, (input, instruction) in enumerate(
            zip(inputs, evolved_instructions)
        ):
            input.update(self.format_output(instruction, answers[idx]))
        yield inputs

`EvolInstructGenerator` ¶

Bases: GeneratorTask

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name	Type	Description
`num_instructions`	`int`	The number of instructions to be generated.
`generate_answers`	`bool`	Whether to generate answers for the instructions or not. Defaults to `False`.
`mutation_templates`	`Dict[str, str]`	The mutation templates to be used for the generation of the instructions.
`min_length`	`RuntimeParameter[int]`	Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to `512`.
`max_length`	`RuntimeParameter[int]`	Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to `1024`.
`seed`	`RuntimeParameter[int]`	The seed to be set for `numpy` in order to randomly pick a mutation method. Defaults to `42`.

Runtime parameters

min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
seed: The seed to be set for numpy in order to randomly pick a mutation method.

Output columns

instruction (str): The generated instruction if generate_answers=False.
answer (str): The generated answer if generate_answers=True.
instructions (List[str]): The generated instructions if generate_answers=True.
model_name (str): The name of the LLM used to generate and evolve the instructions.

References

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py

class EvolInstructGenerator(GeneratorTask):
    """WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs
            to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs
            to be lower than, to be considered valid.
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Output columns:
        - instruction (`str`): The generated instruction if `generate_answers=False`.
        - answer (`str`): The generated answer if `generate_answers=True`.
        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)
    """

    num_instructions: int
    generate_answers: bool = False
    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

    min_length: RuntimeParameter[int] = Field(
        default=512,
        description="Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.",
    )
    max_length: RuntimeParameter[int] = Field(
        default=1024,
        description="Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.",
    )

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )
    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)
    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)

    def _generate_seed_texts(self) -> List[str]:
        """Generates a list of seed texts to be used as part of the starting prompts for the task.

        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
        a list of English words will be used to generate the seed texts that will be provided to the
        mutation method and included within the prompt.

        Returns:
            A list of seed texts to be used as part of the starting prompts for the task.
        """
        seed_texts = []
        for _ in range(self.num_instructions * 10):
            num_words = np.random.choice([1, 2, 3, 4])
            seed_texts.append(
                self.mutation_templates["FRESH_START"].replace(  # type: ignore
                    "<PROMPT>",
                    ", ".join(
                        [
                            np.random.choice(self._english_nouns).strip()
                            for _ in range(num_words)
                        ]
                    ),
                )
            )
        return seed_texts

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

        np.random.seed(self.seed)

        self._seed_texts = self._generate_seed_texts()
        self._prompts = [
            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
        ]

    @cached_property
    def _english_nouns(self) -> List[str]:
        """A list of English nouns to be used as part of the starting prompts for the task.

        References:
            - https://github.com/h2oai/h2o-wizardlm
        """
        _path = str(
            importlib_resources.files("distilabel")
            / "steps/tasks/evol_instruct/english_nouns.txt"
        )
        with open(_path, mode="r") as f:
            return [line.strip() for line in f.readlines()]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `instruction`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        _outputs = ["instruction", "model_name"]
        if self.generate_answers:
            _outputs.append("answer")
        return _outputs

    def format_output(  # type: ignore
        self, instruction: str, answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
        and, finally, the `model_name`.

        Args:
            instruction: The instruction to be included within the output.
            answer: The answer to be included within the output if `generate_answers=True`.

        Returns:
            If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
            if `generate_answers=False` return {"instruction": ..., "model_name": ...};
        """
        _output = {
            "instruction": instruction,
            "model_name": self.llm.model_name,
        }
        if self.generate_answers and answer is not None:
            _output["answer"] = answer
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            iter_no: The iteration number to be used to check whether the iteration is the
                first one i.e. FRESH_START, or not.

        Returns:
            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
        """
        prompts = []
        for idx in range(self.num_instructions):
            if (
                iter_no == 0
                or "Write one question or request containing" in self._prompts[idx]  # type: ignore
            ):
                mutation = "FRESH_START"
            else:
                mutation = np.random.choice(self.mutation_templates_names)
                if mutation == "FRESH_START":
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

            prompt_with_template = (
                self.mutation_templates[mutation].replace(  # type: ignore
                    "<PROMPT>",
                    self._prompts[idx],  # type: ignore
                )  # type: ignore
                if iter_no != 0
                else self._prompts[idx]  # type: ignore
            )
            prompts.append([{"role": "user", "content": prompt_with_template}])
        return prompts

    def _generate_answers(self, instructions: List[List[str]]) -> List[str]:
        """Generates the answer for the last instruction in `instructions`.

        Args:
            instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for the last instruction in `instructions`.
        """
        # TODO: update to generate answers for all the instructions
        _formatted_instructions = [
            [{"role": "user", "content": instruction[-1]}]
            for instruction in instructions
        ]
        responses = self.llm.generate(
            _formatted_instructions,
            **self.llm.generation_kwargs,  # type: ignore
        )
        return flatten_responses(responses)

    @override
    def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            A list of Python dictionaries with the outputs of the task, and a boolean
            flag indicating whether the task has finished or not i.e. is the last batch.
        """
        instructions = []
        mutation_no = 0

        iter_no = 0
        while len(instructions) < self.num_instructions:
            prompts = self._apply_random_mutation(iter_no=iter_no)

            generated_prompts = flatten_responses(
                self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
            )
            for idx, generated_prompt in enumerate(generated_prompts):
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                    instructions.append(generated_prompt)
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
                else:
                    self._prompts[idx] = generated_prompt  # type: ignore

            self._logger.info(
                f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
            )
            iter_no += 1

            if len(instructions) > self.num_instructions:
                instructions = instructions[: self.num_instructions]
            if len(instructions) > mutation_no:
                mutation_no = len(instructions) - mutation_no

            if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
                yield (
                    [
                        self.format_output(mutated_instruction)
                        for mutated_instruction in instructions[-mutation_no:]
                    ],
                    len(instructions) >= self.num_instructions,
                )

        self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
            )

            answers = self._generate_answers(instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
            )

            yield (
                [
                    self.format_output(instruction, answer)
                    for instruction, answer in zip(instructions, answers)
                ],
                True,
            )

`mutation_templates_names: List[str]` `property` ¶

Returns the names i.e. keys of the provided mutation_templates.

`outputs: List[str]` `property` ¶

The output for the task are the instruction, the answer if generate_answers=True and the model_name.

`format_output(instruction, answer=None)` ¶

The output for the task is a dict with: instruction; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name	Type	Description	Default
`instruction`	`str`	The instruction to be included within the output.	required
`answer`	`Optional[str]`	The answer to be included within the output if `generate_answers=True`.	`None`

Returns:

Type	Description
`Dict[str, Any]`	If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
`Dict[str, Any]`	if `generate_answers=False` return {"instruction": ..., "model_name": ...};

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py

def format_output(  # type: ignore
    self, instruction: str, answer: Optional[str] = None
) -> Dict[str, Any]:
    """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
    and, finally, the `model_name`.

    Args:
        instruction: The instruction to be included within the output.
        answer: The answer to be included within the output if `generate_answers=True`.

    Returns:
        If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
        if `generate_answers=False` return {"instruction": ..., "model_name": ...};
    """
    _output = {
        "instruction": instruction,
        "model_name": self.llm.model_name,
    }
    if self.generate_answers and answer is not None:
        _output["answer"] = answer
    return _output

`model_post_init(__context)` ¶

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py

@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

    np.random.seed(self.seed)

    self._seed_texts = self._generate_seed_texts()
    self._prompts = [
        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
    ]

`process(offset=0)` ¶

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name	Type	Description	Default
`offset`	`int`	The offset to start the generation from. Defaults to 0.	`0`

Yields:

Type	Description
`GeneratorStepOutput`	A list of Python dictionaries with the outputs of the task, and a boolean
`GeneratorStepOutput`	flag indicating whether the task has finished or not i.e. is the last batch.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py

@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        A list of Python dictionaries with the outputs of the task, and a boolean
        flag indicating whether the task has finished or not i.e. is the last batch.
    """
    instructions = []
    mutation_no = 0

    iter_no = 0
    while len(instructions) < self.num_instructions:
        prompts = self._apply_random_mutation(iter_no=iter_no)

        generated_prompts = flatten_responses(
            self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
        )
        for idx, generated_prompt in enumerate(generated_prompts):
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                instructions.append(generated_prompt)
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
            else:
                self._prompts[idx] = generated_prompt  # type: ignore

        self._logger.info(
            f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
        )
        iter_no += 1

        if len(instructions) > self.num_instructions:
            instructions = instructions[: self.num_instructions]
        if len(instructions) > mutation_no:
            mutation_no = len(instructions) - mutation_no

        if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
            yield (
                [
                    self.format_output(mutated_instruction)
                    for mutated_instruction in instructions[-mutation_no:]
                ],
                len(instructions) >= self.num_instructions,
            )

    self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
        )

        answers = self._generate_answers(instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
        )

        yield (
            [
                self.format_output(instruction, answer)
                for instruction, answer in zip(instructions, answers)
            ],
            True,
        )

Evol Complexity¶

`EvolComplexity` ¶

Bases: EvolInstruct

EvolComplexity is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name	Type	Description
`num_instructions`		The number of instructions to be generated.
`generate_answers`		Whether to generate answers for the instructions or not. Defaults to `False`.
`mutation_templates`	`Dict[str, str]`	The mutation templates to be used for the generation of the instructions.
`min_length`	`Dict[str, str]`	Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to `512`.
`max_length`	`Dict[str, str]`	Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to `1024`.
`seed`	`Dict[str, str]`	The seed to be set for `numpy` in order to randomly pick a mutation method. Defaults to `42`.

Runtime parameters

min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
seed: The number of evolutions to be run.

Input columns

instruction (str): The instruction to evolve.

Output columns

evolved_instruction (str): The evolved instruction.
answer (str, optional): The answer to the instruction if generate_answers=True.
model_name (str): The name of the LLM used to evolve the instructions.

References

Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py

class EvolComplexity(EvolInstruct):
    """EvolComplexity is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, but using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
    """

    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

`EvolComplexityGenerator` ¶

Bases: EvolInstructGenerator

EvolComplexity is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name	Type	Description
`num_instructions`		The number of instructions to be generated.
`generate_answers`		Whether to generate answers for the instructions or not. Defaults to `False`.
`mutation_templates`	`Dict[str, str]`	The mutation templates to be used for the generation of the instructions.
`min_length`	`Dict[str, str]`	Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to `512`.
`max_length`	`Dict[str, str]`	Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to `1024`.
`seed`	`Dict[str, str]`	The seed to be set for `numpy` in order to randomly pick a mutation method. Defaults to `42`.

Runtime parameters

min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
seed: The number of evolutions to be run.

Input columns

instruction (str): The instruction to evolve.

Output columns

instruction (str): The evolved instruction.
answer (str, optional): The answer to the instruction if generate_answers=True.
model_name (str): The name of the LLM used to evolve the instructions.

References

Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py

class EvolComplexityGenerator(EvolInstructGenerator):
    """EvolComplexity is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, but using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
    """

    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

Evol Quality¶

`EvolQuality` ¶

Bases: Task

The EvolQuality task is used to evolve the quality of the responses given a prompt, by generating a new response with a language model. This step implements the evolution quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name	Type	Description
`num_evolutions`	`int`	The number of evolutions to be performed on the responses.
`store_evolutions`	`bool`	Whether to store all the evolved responses or just the last one. Defaults to `False`.
`include_original_response`	`bool`	Whether to include the original response within the evolved responses. Defaults to `False`.
`mutation_templates`	`Dict[str, str]`	The mutation templates to be used to evolve the responses.
`seed`	`RuntimeParameter[int]`	The seed to be set for `numpy` in order to randomly pick a mutation method. Defaults to `42`.

Runtime parameters

seed: The seed to be set for numpy in order to randomly pick a mutation method.

Input columns

instruction (str): The instruction that was used to generate the responses.
response (str): The responses to be rewritten.

Output columns

evolved_response (str): The evolved response if store_evolutions=False.
evolved_responses (List[str]): The evolved responses if store_evolutions=True.
model_name (str): The name of the LLM used to evolve the responses.

References

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Source code in src/distilabel/steps/tasks/evol_quality/base.py

class EvolQuality(Task):
    """The `EvolQuality` task is used to evolve the quality of the responses given a prompt,
    by generating a new response with a language model. This step implements the evolution
    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        num_evolutions: The number of evolutions to be performed on the responses.
        store_evolutions: Whether to store all the evolved responses or just the last one.
            Defaults to `False`.
        include_original_response: Whether to include the original response within the evolved
            responses. Defaults to `False`.
        mutation_templates: The mutation templates to be used to evolve the responses.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - response (`str`): The responses to be rewritten.

    Output columns:
        - evolved_response (`str`): The evolved response if `store_evolutions=False`.
        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the responses.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    num_evolutions: int
    store_evolutions: bool = False
    include_original_response: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.",
    )

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and `response`."""
        return ["instruction", "response"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_response/s` and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            ("evolved_response" if not self.store_evolutions else "evolved_responses"),
            "model_name",
        ]

        return _outputs

    def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        and, finally, the `model_name`.

        Args:
            responses: The responses to be included within the output.

        Returns:
            if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
            if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
        """
        _output = {}

        if not self.store_evolutions:
            _output["evolved_response"] = responses[-1]
        else:
            _output["evolved_responses"] = responses

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates` enum."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str, response: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return (
            self.mutation_templates[mutation]
            .replace("<PROMPT>", instruction)
            .replace("<RESPONSE>", response[-1])
        )

    def _evolve_reponses(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """
        np.random.seed(self.seed)
        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
        responses: List[List[str]] = [[input["response"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction, response in zip(instructions, responses):
                formatted_prompts.append(
                    self._apply_random_mutation(instruction[-1], response[-1])
                )

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]

            generated_responses = self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )

            if self.store_evolutions:
                responses = [
                    response + [evolved_response[0]]
                    for response, evolved_response in zip(
                        responses, generated_responses
                    )
                ]
            else:
                responses = [
                    [evolved_response[0]] for evolved_response in generated_responses
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
            )

        return responses

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list of Python dictionaries with the outputs of the task.
        """

        responses = self._evolve_reponses(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_responses` list
            from_ = 1 if not self.include_original_response else 0
            responses = [response[from_:] for response in responses]

        for input, response in zip(inputs, responses):
            input.update(self.format_output(response))
        yield inputs

        self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

`inputs: List[str]` `property` ¶

The input for the task are the instruction and response.

`mutation_templates_names: List[str]` `property` ¶

Returns the names i.e. keys of the provided mutation_templates enum.

`outputs: List[str]` `property` ¶

The output for the task are the evolved_response/s and the model_name.

`format_input(input)` ¶

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_quality/base.py

def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

`format_output(responses)` ¶

The output for the task is a dict with: evolved_response or evolved_responses, depending whether the value is either False or True for store_evolutions, respectively; and, finally, the model_name.

Parameters:

Name	Type	Description	Default
`responses`	`Union[str, List[str]]`	The responses to be included within the output.	required

Returns:

Type	Description
`Dict[str, Any]`	if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
`Dict[str, Any]`	if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_quality/base.py

def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    and, finally, the `model_name`.

    Args:
        responses: The responses to be included within the output.

    Returns:
        if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
        if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
    """
    _output = {}

    if not self.store_evolutions:
        _output["evolved_response"] = responses[-1]
    else:
        _output["evolved_responses"] = responses

    _output["model_name"] = self.llm.model_name
    return _output

`model_post_init(__context)` ¶

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_quality/base.py

@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

`process(inputs)` ¶

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name	Type	Description	Default
`inputs`	`StepInput`	A list of Python dictionaries with the inputs of the task.	required

Returns:

Type	Description
`StepOutput`	A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_quality/base.py

@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list of Python dictionaries with the outputs of the task.
    """

    responses = self._evolve_reponses(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_responses` list
        from_ = 1 if not self.include_original_response else 0
        responses = [response[from_:] for response in responses]

    for input, response in zip(inputs, responses):
        input.update(self.format_output(response))
    yield inputs

    self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

DEITA Scorers¶

`ComplexityScorer` ¶

Bases: Task

This task is used to rank a list of instructions based on their complexity. It's an implementation of the complexity score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name	Type	Description
`_template`	`Union[Template, None]`	The Jinja2 template used to format the input data.

Input columns

instructions (List[str]): The list of instructions to be scored.

Output columns

complexity_score (List[float]): The complexity score for each instruction.

References

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Source code in src/distilabel/steps/tasks/complexity_scorer.py

class ComplexityScorer(Task):
    """This task is used to rank a list of instructions based on their complexity. It's
    an implementation of the complexity score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

    Attributes:
        _template: The Jinja2 template used to format the input data.

    Input columns:
        - instructions (`List[str]`): The list of instructions to be scored.

    Output columns:
        - complexity_score (`List[float]`): The complexity score for each instruction.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        super().load()
        self._template = Template(_COMPLEXITY_SCORER_TEMPLATE)

    @property
    def inputs(self) -> List[str]:
        return ["instructions"]

    @property
    def outputs(self) -> List[str]:
        return ["scores"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        if output is None:
            return {"scores": [None] * len(input["instructions"])}

        scores = []
        score_lines = output.split("\n")
        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["instructions"]) - 1:
                break

        return {"scores": scores}

`QualityScorer` ¶

Bases: Task

QualityScorer is a pre-defined task that defines the instruction as the input and score as the output. This task is used to rate the quality of instructions and responses. It's an implementation of the quality score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'. The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs are scored in terms of quality, obtaining a quality score for each instruction.

Attributes:

Name	Type	Description
`_template`	`Union[Template, None]`	a Jinja2 template used to format the input for the LLM.

Input columns

instruction (str): The instruction that was used to generate the responses.
responses (List[str]): The responses to be scored. Each response forms a pair with the instruction.

Output columns

quality_score (List[float]): The quality score for each instruction.

References

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Source code in src/distilabel/steps/tasks/quality_scorer.py

class QualityScorer(Task):
    """QualityScorer is a pre-defined task that defines the `instruction` as the input
    and `score` as the output. This task is used to rate the quality of instructions and responses.
    It's an implementation of the quality score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
    are scored in terms of quality, obtaining a quality score for each instruction.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.

    Output columns:
        - quality_score (`List[float]`): The quality score for each instruction.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        super().load()
        self._template = Template(_QUALITY_SCORER_TEMPLATE)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are `instruction` and `responses`."""
        return ["instruction", "responses"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

    @property
    def outputs(self):
        """The output for the task is a list of `quality_scores` containing the quality score for each
        response in `responses`."""
        return ["scores"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction-response pair.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with containing the scores for each instruction-response pair.
        """

        if output is None:
            return {self.outputs[0]: [None] * len(input["responses"])}

        scores = []
        score_lines = output.split("\n")

        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["responses"]) - 1:
                break

        return {self.outputs[0]: scores}

`inputs: List[str]` `property` ¶

The input for the task are instruction and responses.

`outputs` `property` ¶

The output for the task is a list of quality_scores containing the quality score for each response in responses.

`format_input(input)` ¶

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/quality_scorer.py

def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

`format_output(output, input)` ¶

The output is formatted as a list with the score of each instruction-response pair.

Parameters:

Name	Type	Description	Default
`output`	`Union[str, None]`	the raw output of the LLM.	required
`input`	`Dict[str, Any]`	the input to the task. Used for obtaining the number of responses.	required

Returns:

Type	Description
`Dict[str, Any]`	A dict with containing the scores for each instruction-response pair.

Source code in src/distilabel/steps/tasks/quality_scorer.py

def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction-response pair.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with containing the scores for each instruction-response pair.
    """

    if output is None:
        return {self.outputs[0]: [None] * len(input["responses"])}

    scores = []
    score_lines = output.split("\n")

    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.match(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["responses"]) - 1:
            break

    return {self.outputs[0]: scores}

Tasks¶

GeneratorTask ¶

Task ¶

format_input(input) abstractmethod ¶

process(inputs) ¶

General Text Generation¶

TextGeneration ¶

inputs: List[str] property ¶

outputs: List[str] property ¶

format_input(input) ¶

format_output(output, input) ¶

Evol Instruct¶

EvolInstruct ¶

inputs: List[str] property ¶

mutation_templates_names: List[str] property ¶

outputs: List[str] property ¶

format_input(input) ¶

format_output(instructions, answers=None) ¶

process(inputs) ¶

EvolInstructGenerator ¶

mutation_templates_names: List[str] property ¶

outputs: List[str] property ¶

format_output(instruction, answer=None) ¶

model_post_init(__context) ¶

process(offset=0) ¶

Evol Complexity¶

EvolComplexity ¶

EvolComplexityGenerator ¶

Evol Quality¶

EvolQuality ¶

inputs: List[str] property ¶

mutation_templates_names: List[str] property ¶

outputs: List[str] property ¶

format_input(input) ¶

format_output(responses) ¶

model_post_init(__context) ¶

process(inputs) ¶

DEITA Scorers¶

ComplexityScorer ¶

QualityScorer ¶

inputs: List[str] property ¶

outputs property ¶

format_input(input) ¶

format_output(output, input) ¶

`GeneratorTask` ¶

`Task` ¶

`format_input(input)` `abstractmethod` ¶

`process(inputs)` ¶

`TextGeneration` ¶

`inputs: List[str]` `property` ¶

`outputs: List[str]` `property` ¶

`format_input(input)` ¶

`format_output(output, input)` ¶

`EvolInstruct` ¶

`inputs: List[str]` `property` ¶

`mutation_templates_names: List[str]` `property` ¶

`outputs: List[str]` `property` ¶

`format_input(input)` ¶

`format_output(instructions, answers=None)` ¶

`process(inputs)` ¶

`EvolInstructGenerator` ¶

`mutation_templates_names: List[str]` `property` ¶

`outputs: List[str]` `property` ¶

`format_output(instruction, answer=None)` ¶

`model_post_init(__context)` ¶

`process(offset=0)` ¶

`EvolComplexity` ¶

`EvolComplexityGenerator` ¶

`EvolQuality` ¶

`inputs: List[str]` `property` ¶

`mutation_templates_names: List[str]` `property` ¶

`outputs: List[str]` `property` ¶

`format_input(input)` ¶

`format_output(responses)` ¶

`model_post_init(__context)` ¶

`process(inputs)` ¶

`ComplexityScorer` ¶

`QualityScorer` ¶

`inputs: List[str]` `property` ¶

`outputs` `property` ¶

`format_input(input)` ¶

`format_output(output, input)` ¶