Skip to content

Index

ChatType = List[ChatItem] module-attribute

ChatType is a type alias for a list of dicts following the OpenAI conversational format.

ComplexityScorer

Bases: Task

This task is used to rank a list of instructions based on their complexity. It's an implementation of the complexity score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
_template Union[Template, None]

The Jinja2 template used to format the input data.

Input columns
  • instructions (List[str]): The list of instructions to be scored.
Output columns
  • complexity_score (List[float]): The complexity score for each instruction.
References
Source code in src/distilabel/steps/tasks/complexity_scorer.py
class ComplexityScorer(Task):
    """This task is used to rank a list of instructions based on their complexity. It's
    an implementation of the complexity score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

    Attributes:
        _template: The Jinja2 template used to format the input data.

    Input columns:
        - instructions (`List[str]`): The list of instructions to be scored.

    Output columns:
        - complexity_score (`List[float]`): The complexity score for each instruction.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        super().load()
        self._template = Template(_COMPLEXITY_SCORER_TEMPLATE)

    @property
    def inputs(self) -> List[str]:
        return ["instructions"]

    @property
    def outputs(self) -> List[str]:
        return ["scores"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        if output is None:
            return {"scores": [None] * len(input["instructions"])}

        scores = []
        score_lines = output.split("\n")
        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["instructions"]) - 1:
                break

        return {"scores": scores}

EvolComplexity

Bases: EvolInstruct

EvolComplexity is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name Type Description
num_instructions

The number of instructions to be generated.

generate_answers

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed Dict[str, str]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The number of evolutions to be run.
Input columns
  • instruction (str): The instruction to evolve.
Output columns
  • evolved_instruction (str): The evolved instruction.
  • answer (str, optional): The answer to the instruction if generate_answers=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
References
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
class EvolComplexity(EvolInstruct):
    """EvolComplexity is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, but using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
    """

    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

EvolComplexityGenerator

Bases: EvolInstructGenerator

EvolComplexity is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.

Attributes:

Name Type Description
num_instructions

The number of instructions to be generated.

generate_answers

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length Dict[str, str]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed Dict[str, str]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The number of evolutions to be run.
Input columns
  • instruction (str): The instruction to evolve.
Output columns
  • instruction (str): The evolved instruction.
  • answer (str, optional): The answer to the instruction if generate_answers=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
References
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
class EvolComplexityGenerator(EvolInstructGenerator):
    """EvolComplexity is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, but using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
    """

    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

EvolInstruct

Bases: Task

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name Type Description
num_evolutions int

The number of evolutions to be performed.

store_evolutions bool

Whether to store all the evolutions or just the last one. Defaults to False.

generate_answers bool

Whether to generate answers for the evolved instructions. Defaults to False.

include_original_instruction bool

Whether to include the original instruction in the evolved_instructions output column. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for evolving the instructions. Defaults to the ones provided in the utils.py file.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Input columns
  • instruction (str): The instruction to evolve.
Output columns
  • evolved_instruction (str): The evolved instruction if store_evolutions=False.
  • evolved_instructions (List[str]): The evolved instructions if store_evolutions=True.
  • model_name (str): The name of the LLM used to evolve the instructions.
  • answer (str): The answer to the evolved instruction if generate_answers=True and store_evolutions=False.
  • answers (List[str]): The answers to the evolved instructions if generate_answers=True and store_evolutions=True.
References
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
class EvolInstruct(Task):
    """WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_evolutions: The number of evolutions to be performed.
        store_evolutions: Whether to store all the evolutions or just the last one. Defaults
            to `False`.
        generate_answers: Whether to generate answers for the evolved instructions. Defaults
            to `False`.
        include_original_instruction: Whether to include the original instruction in the
            `evolved_instructions` output column. Defaults to `False`.
        mutation_templates: The mutation templates to be used for evolving the instructions.
            Defaults to the ones provided in the `utils.py` file.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.
        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.
        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`
            and `store_evolutions=False`.
        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`
            and `store_evolutions=True`.

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)
    """

    num_evolutions: int
    store_evolutions: bool = False
    generate_answers: bool = False
    include_original_instruction: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            (
                "evolved_instruction"
                if not self.store_evolutions
                else "evolved_instructions"
            ),
            "model_name",
        ]
        if self.generate_answers:
            _outputs.append("answer" if not self.store_evolutions else "answers")
        return _outputs

    @override
    def format_output(  # type: ignore
        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
    ) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        `answer` if `generate_answers=True`; and, finally, the `model_name`.

        Args:
            instructions: The instructions to be included within the output.
            answers: The answers to be included within the output if `generate_answers=True`.

        Returns:
            If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
            if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
        """
        _output = {}
        if not self.store_evolutions:
            _output["evolved_instruction"] = instructions[-1]
        else:
            _output["evolved_instructions"] = instructions

        if self.generate_answers and answers:
            if not self.store_evolutions:
                _output["answer"] = answers[-1]
            else:
                _output["answers"] = answers

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore

    def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """

        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction in instructions:
                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]
            generated_prompts = flatten_responses(
                self.llm.generate(
                    formatted_prompts,
                    **self.llm.generation_kwargs,  # type: ignore
                )
            )

            evolved_instructions = []
            for generated_prompt in generated_prompts:
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                evolved_instructions.append(generated_prompt)

            if self.store_evolutions:
                instructions = [
                    instruction + [evolved_instruction]
                    for instruction, evolved_instruction in zip(
                        instructions, evolved_instructions
                    )
                ]
            else:
                instructions = [
                    [evolved_instruction]
                    for evolved_instruction in evolved_instructions
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
            )

        return instructions

    def _generate_answers(
        self, evolved_instructions: List[List[str]]
    ) -> List[List[str]]:
        """Generates the answer for the instructions in `instructions`.

        Args:
            evolved_instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for each instruction.
        """
        formatted_instructions = [
            self.format_input(instruction)
            for instructions in evolved_instructions
            for instruction in instructions
        ]

        responses = self.llm.generate(
            formatted_instructions,
            num_generations=1,
            **self.llm.generation_kwargs,  # type: ignore
        )

        step = (
            self.num_evolutions
            if not self.include_original_instruction
            else self.num_evolutions + 1
        )
        return [
            flatten_responses(responses[i : i + step])
            for i in range(0, len(responses), step)
        ]

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        evolved_instructions = self._evolve_instructions(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_instructions` list
            from_ = 1 if not self.include_original_instruction else 0
            evolved_instructions = [
                instruction[from_:] for instruction in evolved_instructions
            ]

        if not self.generate_answers:
            for input, instruction in zip(inputs, evolved_instructions):
                input.update(self.format_output(instruction))
            yield inputs

        self._logger.info(
            f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
        )

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
            )

            answers = self._generate_answers(evolved_instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
                " instructions!"
            )

            for idx, (input, instruction) in enumerate(
                zip(inputs, evolved_instructions)
            ):
                input.update(self.format_output(instruction, answers[idx]))
            yield inputs

inputs: List[str] property

The input for the task is the instruction.

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates.

outputs: List[str] property

The output for the task are the evolved_instruction/s, the answer if generate_answers=True and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

format_output(instructions, answers=None)

The output for the task is a dict with: evolved_instruction or evolved_instructions, depending whether the value is either False or True for store_evolutions, respectively; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name Type Description Default
instructions Union[str, List[str]]

The instructions to be included within the output.

required
answers Optional[List[str]]

The answers to be included within the output if generate_answers=True.

None

Returns:

Type Description
Dict[str, Any]

If store_evolutions=False and generate_answers=True return {"evolved_instruction": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

if store_evolutions=True and generate_answers=True return {"evolved_instructions": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

if store_evolutions=False and generate_answers=False return {"evolved_instruction": ..., "model_name": ...};

Dict[str, Any]

if store_evolutions=True and generate_answers=False return {"evolved_instructions": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
@override
def format_output(  # type: ignore
    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    `answer` if `generate_answers=True`; and, finally, the `model_name`.

    Args:
        instructions: The instructions to be included within the output.
        answers: The answers to be included within the output if `generate_answers=True`.

    Returns:
        If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
        if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
    """
    _output = {}
    if not self.store_evolutions:
        _output["evolved_instruction"] = instructions[-1]
    else:
        _output["evolved_instructions"] = instructions

    if self.generate_answers and answers:
        if not self.store_evolutions:
            _output["answer"] = answers[-1]
        else:
            _output["answers"] = answers

    _output["model_name"] = self.llm.model_name
    return _output

process(inputs)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_instruct/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    evolved_instructions = self._evolve_instructions(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_instructions` list
        from_ = 1 if not self.include_original_instruction else 0
        evolved_instructions = [
            instruction[from_:] for instruction in evolved_instructions
        ]

    if not self.generate_answers:
        for input, instruction in zip(inputs, evolved_instructions):
            input.update(self.format_output(instruction))
        yield inputs

    self._logger.info(
        f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
    )

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
        )

        answers = self._generate_answers(evolved_instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
            " instructions!"
        )

        for idx, (input, instruction) in enumerate(
            zip(inputs, evolved_instructions)
        ):
            input.update(self.format_output(instruction, answers[idx]))
        yield inputs

EvolInstructGenerator

Bases: GeneratorTask

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Attributes:

Name Type Description
num_instructions int

The number of instructions to be generated.

generate_answers bool

Whether to generate answers for the instructions or not. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used for the generation of the instructions.

min_length RuntimeParameter[int]

Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to 512.

max_length RuntimeParameter[int]

Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to 1024.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
  • max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Output columns
  • instruction (str): The generated instruction if generate_answers=False.
  • answer (str): The generated answer if generate_answers=True.
  • instructions (List[str]): The generated instructions if generate_answers=True.
  • model_name (str): The name of the LLM used to generate and evolve the instructions.
References
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
class EvolInstructGenerator(GeneratorTask):
    """WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs
            to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs
            to be lower than, to be considered valid.
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Output columns:
        - instruction (`str`): The generated instruction if `generate_answers=False`.
        - answer (`str`): The generated answer if `generate_answers=True`.
        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)
    """

    num_instructions: int
    generate_answers: bool = False
    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

    min_length: RuntimeParameter[int] = Field(
        default=512,
        description="Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.",
    )
    max_length: RuntimeParameter[int] = Field(
        default=1024,
        description="Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.",
    )

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )
    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)
    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)

    def _generate_seed_texts(self) -> List[str]:
        """Generates a list of seed texts to be used as part of the starting prompts for the task.

        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
        a list of English words will be used to generate the seed texts that will be provided to the
        mutation method and included within the prompt.

        Returns:
            A list of seed texts to be used as part of the starting prompts for the task.
        """
        seed_texts = []
        for _ in range(self.num_instructions * 10):
            num_words = np.random.choice([1, 2, 3, 4])
            seed_texts.append(
                self.mutation_templates["FRESH_START"].replace(  # type: ignore
                    "<PROMPT>",
                    ", ".join(
                        [
                            np.random.choice(self._english_nouns).strip()
                            for _ in range(num_words)
                        ]
                    ),
                )
            )
        return seed_texts

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

        np.random.seed(self.seed)

        self._seed_texts = self._generate_seed_texts()
        self._prompts = [
            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
        ]

    @cached_property
    def _english_nouns(self) -> List[str]:
        """A list of English nouns to be used as part of the starting prompts for the task.

        References:
            - https://github.com/h2oai/h2o-wizardlm
        """
        _path = str(
            importlib_resources.files("distilabel")
            / "steps/tasks/evol_instruct/english_nouns.txt"
        )
        with open(_path, mode="r") as f:
            return [line.strip() for line in f.readlines()]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `instruction`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        _outputs = ["instruction", "model_name"]
        if self.generate_answers:
            _outputs.append("answer")
        return _outputs

    def format_output(  # type: ignore
        self, instruction: str, answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
        and, finally, the `model_name`.

        Args:
            instruction: The instruction to be included within the output.
            answer: The answer to be included within the output if `generate_answers=True`.

        Returns:
            If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
            if `generate_answers=False` return {"instruction": ..., "model_name": ...};
        """
        _output = {
            "instruction": instruction,
            "model_name": self.llm.model_name,
        }
        if self.generate_answers and answer is not None:
            _output["answer"] = answer
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            iter_no: The iteration number to be used to check whether the iteration is the
                first one i.e. FRESH_START, or not.

        Returns:
            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
        """
        prompts = []
        for idx in range(self.num_instructions):
            if (
                iter_no == 0
                or "Write one question or request containing" in self._prompts[idx]  # type: ignore
            ):
                mutation = "FRESH_START"
            else:
                mutation = np.random.choice(self.mutation_templates_names)
                if mutation == "FRESH_START":
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

            prompt_with_template = (
                self.mutation_templates[mutation].replace(  # type: ignore
                    "<PROMPT>",
                    self._prompts[idx],  # type: ignore
                )  # type: ignore
                if iter_no != 0
                else self._prompts[idx]  # type: ignore
            )
            prompts.append([{"role": "user", "content": prompt_with_template}])
        return prompts

    def _generate_answers(self, instructions: List[List[str]]) -> List[str]:
        """Generates the answer for the last instruction in `instructions`.

        Args:
            instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for the last instruction in `instructions`.
        """
        # TODO: update to generate answers for all the instructions
        _formatted_instructions = [
            [{"role": "user", "content": instruction[-1]}]
            for instruction in instructions
        ]
        responses = self.llm.generate(
            _formatted_instructions,
            **self.llm.generation_kwargs,  # type: ignore
        )
        return flatten_responses(responses)

    @override
    def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            A list of Python dictionaries with the outputs of the task, and a boolean
            flag indicating whether the task has finished or not i.e. is the last batch.
        """
        instructions = []
        mutation_no = 0

        iter_no = 0
        while len(instructions) < self.num_instructions:
            prompts = self._apply_random_mutation(iter_no=iter_no)

            generated_prompts = flatten_responses(
                self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
            )
            for idx, generated_prompt in enumerate(generated_prompts):
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                    instructions.append(generated_prompt)
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
                else:
                    self._prompts[idx] = generated_prompt  # type: ignore

            self._logger.info(
                f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
            )
            iter_no += 1

            if len(instructions) > self.num_instructions:
                instructions = instructions[: self.num_instructions]
            if len(instructions) > mutation_no:
                mutation_no = len(instructions) - mutation_no

            if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
                yield (
                    [
                        self.format_output(mutated_instruction)
                        for mutated_instruction in instructions[-mutation_no:]
                    ],
                    len(instructions) >= self.num_instructions,
                )

        self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
            )

            answers = self._generate_answers(instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
            )

            yield (
                [
                    self.format_output(instruction, answer)
                    for instruction, answer in zip(instructions, answers)
                ],
                True,
            )

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates.

outputs: List[str] property

The output for the task are the instruction, the answer if generate_answers=True and the model_name.

format_output(instruction, answer=None)

The output for the task is a dict with: instruction; answer if generate_answers=True; and, finally, the model_name.

Parameters:

Name Type Description Default
instruction str

The instruction to be included within the output.

required
answer Optional[str]

The answer to be included within the output if generate_answers=True.

None

Returns:

Type Description
Dict[str, Any]

If generate_answers=True return {"instruction": ..., "answer": ..., "model_name": ...};

Dict[str, Any]

if generate_answers=False return {"instruction": ..., "model_name": ...};

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
def format_output(  # type: ignore
    self, instruction: str, answer: Optional[str] = None
) -> Dict[str, Any]:
    """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
    and, finally, the `model_name`.

    Args:
        instruction: The instruction to be included within the output.
        answer: The answer to be included within the output if `generate_answers=True`.

    Returns:
        If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
        if `generate_answers=False` return {"instruction": ..., "model_name": ...};
    """
    _output = {
        "instruction": instruction,
        "model_name": self.llm.model_name,
    }
    if self.generate_answers and answer is not None:
        _output["answer"] = answer
    return _output

model_post_init(__context)

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

    np.random.seed(self.seed)

    self._seed_texts = self._generate_seed_texts()
    self._prompts = [
        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
    ]

process(offset=0)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
offset int

The offset to start the generation from. Defaults to 0.

0

Yields:

Type Description
GeneratorStepOutput

A list of Python dictionaries with the outputs of the task, and a boolean

GeneratorStepOutput

flag indicating whether the task has finished or not i.e. is the last batch.

Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        A list of Python dictionaries with the outputs of the task, and a boolean
        flag indicating whether the task has finished or not i.e. is the last batch.
    """
    instructions = []
    mutation_no = 0

    iter_no = 0
    while len(instructions) < self.num_instructions:
        prompts = self._apply_random_mutation(iter_no=iter_no)

        generated_prompts = flatten_responses(
            self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore
        )
        for idx, generated_prompt in enumerate(generated_prompts):
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                instructions.append(generated_prompt)
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
            else:
                self._prompts[idx] = generated_prompt  # type: ignore

        self._logger.info(
            f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
        )
        iter_no += 1

        if len(instructions) > self.num_instructions:
            instructions = instructions[: self.num_instructions]
        if len(instructions) > mutation_no:
            mutation_no = len(instructions) - mutation_no

        if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
            yield (
                [
                    self.format_output(mutated_instruction)
                    for mutated_instruction in instructions[-mutation_no:]
                ],
                len(instructions) >= self.num_instructions,
            )

    self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
        )

        answers = self._generate_answers(instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
        )

        yield (
            [
                self.format_output(instruction, answer)
                for instruction, answer in zip(instructions, answers)
            ],
            True,
        )

EvolQuality

Bases: Task

The EvolQuality task is used to evolve the quality of the responses given a prompt, by generating a new response with a language model. This step implements the evolution quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
num_evolutions int

The number of evolutions to be performed on the responses.

store_evolutions bool

Whether to store all the evolved responses or just the last one. Defaults to False.

include_original_response bool

Whether to include the original response within the evolved responses. Defaults to False.

mutation_templates Dict[str, str]

The mutation templates to be used to evolve the responses.

seed RuntimeParameter[int]

The seed to be set for numpy in order to randomly pick a mutation method. Defaults to 42.

Runtime parameters
  • seed: The seed to be set for numpy in order to randomly pick a mutation method.
Input columns
  • instruction (str): The instruction that was used to generate the responses.
  • response (str): The responses to be rewritten.
Output columns
  • evolved_response (str): The evolved response if store_evolutions=False.
  • evolved_responses (List[str]): The evolved responses if store_evolutions=True.
  • model_name (str): The name of the LLM used to evolve the responses.
References
Source code in src/distilabel/steps/tasks/evol_quality/base.py
class EvolQuality(Task):
    """The `EvolQuality` task is used to evolve the quality of the responses given a prompt,
    by generating a new response with a language model. This step implements the evolution
    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        num_evolutions: The number of evolutions to be performed on the responses.
        store_evolutions: Whether to store all the evolved responses or just the last one.
            Defaults to `False`.
        include_original_response: Whether to include the original response within the evolved
            responses. Defaults to `False`.
        mutation_templates: The mutation templates to be used to evolve the responses.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - response (`str`): The responses to be rewritten.

    Output columns:
        - evolved_response (`str`): The evolved response if `store_evolutions=False`.
        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the responses.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    num_evolutions: int
    store_evolutions: bool = False
    include_original_response: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.",
    )

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and `response`."""
        return ["instruction", "response"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_response/s` and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            ("evolved_response" if not self.store_evolutions else "evolved_responses"),
            "model_name",
        ]

        return _outputs

    def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        and, finally, the `model_name`.

        Args:
            responses: The responses to be included within the output.

        Returns:
            if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
            if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
        """
        _output = {}

        if not self.store_evolutions:
            _output["evolved_response"] = responses[-1]
        else:
            _output["evolved_responses"] = responses

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates` enum."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str, response: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return (
            self.mutation_templates[mutation]
            .replace("<PROMPT>", instruction)
            .replace("<RESPONSE>", response[-1])
        )

    def _evolve_reponses(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """
        np.random.seed(self.seed)
        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
        responses: List[List[str]] = [[input["response"]] for input in inputs]

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction, response in zip(instructions, responses):
                formatted_prompts.append(
                    self._apply_random_mutation(instruction[-1], response[-1])
                )

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]

            generated_responses = self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )

            if self.store_evolutions:
                responses = [
                    response + [evolved_response[0]]
                    for response, evolved_response in zip(
                        responses, generated_responses
                    )
                ]
            else:
                responses = [
                    [evolved_response[0]] for evolved_response in generated_responses
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
            )

        return responses

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list of Python dictionaries with the outputs of the task.
        """

        responses = self._evolve_reponses(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_responses` list
            from_ = 1 if not self.include_original_response else 0
            responses = [response[from_:] for response in responses]

        for input, response in zip(inputs, responses):
            input.update(self.format_output(response))
        yield inputs

        self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

inputs: List[str] property

The input for the task are the instruction and response.

mutation_templates_names: List[str] property

Returns the names i.e. keys of the provided mutation_templates enum.

outputs: List[str] property

The output for the task are the evolved_response/s and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation. And the system_prompt is added as the first message if it exists.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]

format_output(responses)

The output for the task is a dict with: evolved_response or evolved_responses, depending whether the value is either False or True for store_evolutions, respectively; and, finally, the model_name.

Parameters:

Name Type Description Default
responses Union[str, List[str]]

The responses to be included within the output.

required

Returns:

Type Description
Dict[str, Any]

if store_evolutions=False return {"evolved_response": ..., "model_name": ...};

Dict[str, Any]

if store_evolutions=True return {"evolved_responses": ..., "model_name": ...}.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    and, finally, the `model_name`.

    Args:
        responses: The responses to be included within the output.

    Returns:
        if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
        if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
    """
    _output = {}

    if not self.store_evolutions:
        _output["evolved_response"] = responses[-1]
    else:
        _output["evolved_responses"] = responses

    _output["model_name"] = self.llm.model_name
    return _output

model_post_init(__context)

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

process(inputs)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Returns:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/evol_quality/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list of Python dictionaries with the outputs of the task.
    """

    responses = self._evolve_reponses(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_responses` list
        from_ = 1 if not self.include_original_response else 0
        responses = [response[from_:] for response in responses]

    for input, response in zip(inputs, responses):
        input.update(self.format_output(response))
    yield inputs

    self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

GenerateEmbeddings

Bases: Step

Generate embeddings for a text input using the last hidden state of an LLM, as described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.

Attributes:

Name Type Description
llm LLM

The LLM to use to generate the embeddings.

Input columns
  • text (str, List[Dict[str, str]]): The input text or conversation to generate embeddings for.
Output columns
  • embedding (List[float]): The embedding of the input text or conversation.
References
Source code in src/distilabel/steps/tasks/generate_embeddings.py
class GenerateEmbeddings(Step):
    """Generate embeddings for a text input using the last hidden state of an `LLM`, as
    described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        llm: The `LLM` to use to generate the embeddings.

    Input columns:
        - text (`str`, `List[Dict[str, str]]`): The input text or conversation to generate
            embeddings for.

    Output columns:
        - embedding (`List[float]`): The embedding of the input text or conversation.

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
    """

    llm: LLM

    def load(self) -> None:
        """Loads the `LLM` used to generate the embeddings."""
        super().load()
        self.llm.load()

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task is a `text` column containing either a string or a
        list of dictionaries in OpenAI chat-like format."""
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        """The outputs for the task is an `embedding` column containing the embedding of
        the `text` input."""
        return ["embedding"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Formats the input to be used by the LLM to generate the embeddings. The input
        can be in `ChatType` format or a string. If a string, it will be converted to a
        list of dictionaries in OpenAI chat-like format.

        Args:
            input: The input to format.

        Returns:
            The OpenAI chat-like format of the input.
        """
        text = input["text"] = input["text"]

        # input is in `ChatType` format
        if isinstance(text, str):
            return [{"role": "user", "content": text}]

        if is_openai_format(text):
            return text

        raise ValueError(
            f"Couldn't format input for step {self.name}. The `text` input column has to"
            " be a string or a list of dictionaries in OpenAI chat-like format."
        )

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates an embedding for each input using the last hidden state of the `LLM`.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """
        formatted_inputs = [self.format_input(input) for input in inputs]
        last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
        for input, hidden_state in zip(inputs, last_hidden_states):
            input["embedding"] = hidden_state[-1].tolist()
        yield inputs

inputs: List[str] property

The inputs for the task is a text column containing either a string or a list of dictionaries in OpenAI chat-like format.

outputs: List[str] property

The outputs for the task is an embedding column containing the embedding of the text input.

format_input(input)

Formats the input to be used by the LLM to generate the embeddings. The input can be in ChatType format or a string. If a string, it will be converted to a list of dictionaries in OpenAI chat-like format.

Parameters:

Name Type Description Default
input Dict[str, Any]

The input to format.

required

Returns:

Type Description
ChatType

The OpenAI chat-like format of the input.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Formats the input to be used by the LLM to generate the embeddings. The input
    can be in `ChatType` format or a string. If a string, it will be converted to a
    list of dictionaries in OpenAI chat-like format.

    Args:
        input: The input to format.

    Returns:
        The OpenAI chat-like format of the input.
    """
    text = input["text"] = input["text"]

    # input is in `ChatType` format
    if isinstance(text, str):
        return [{"role": "user", "content": text}]

    if is_openai_format(text):
        return text

    raise ValueError(
        f"Couldn't format input for step {self.name}. The `text` input column has to"
        " be a string or a list of dictionaries in OpenAI chat-like format."
    )

load()

Loads the LLM used to generate the embeddings.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def load(self) -> None:
    """Loads the `LLM` used to generate the embeddings."""
    super().load()
    self.llm.load()

process(inputs)

Generates an embedding for each input using the last hidden state of the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/generate_embeddings.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates an embedding for each input using the last hidden state of the `LLM`.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """
    formatted_inputs = [self.format_input(input) for input in inputs]
    last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
    for input, hidden_state in zip(inputs, last_hidden_states):
        input["embedding"] = hidden_state[-1].tolist()
    yield inputs

GeneratorTask

Bases: _Task, GeneratorStep

GeneratorTask is a class that implements the _Task abstract class and adds the GeneratorStep interface to be used as a step in the pipeline.

Attributes:

Name Type Description
llm

the LLM to be used to generate the outputs of the task.

group_generations

whether to group the num_generations generated per input in a list or create a row per generation. Defaults to False.

num_generations

The number of generations to be produced per input.

Source code in src/distilabel/steps/tasks/base.py
class GeneratorTask(_Task, GeneratorStep):
    """GeneratorTask is a class that implements the `_Task` abstract class and adds the
    `GeneratorStep` interface to be used as a step in the pipeline.

    Attributes:
        llm: the `LLM` to be used to generate the outputs of the task.
        group_generations: whether to group the `num_generations` generated per input in
            a list or create a row per generation. Defaults to `False`.
        num_generations: The number of generations to be produced per input.
    """

    pass

InstructionBacktranslation

Bases: Task

Self-Alignment with Instruction Backtranslation.

Attributes:

Name Type Description
_template Optional[Template]

the Jinja2 template to use for the Instruction Backtranslation task.

Input columns
  • instruction (str): The reference instruction to evaluate the text output.
  • generation (str): The text output to evaluate for the given instruction.
Output columns
  • score (str): The score for the generation based on the given instruction.
References
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
class InstructionBacktranslation(Task):
    """Self-Alignment with Instruction Backtranslation.

    Attributes:
        _template: the Jinja2 template to use for the Instruction Backtranslation task.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text output.
        - generation (`str`): The text output to evaluate for the given instruction.

    Output columns:
        - score (`str`): The score for the generation based on the given instruction.

    References:
        - [`Self-Alignment with Instruction Backtranslation`](https://arxiv.org/abs/2308.06259)
    """

    _template: Optional["Template"] = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template with the Instruction Backtranslation prompt."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "instruction-backtranslation.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generation` for it."""
        return ["instruction", "generation"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generation=input["generation"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `score`, `reason` and the `model_name`."""
        return ["score", "reason", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `score` and `reason`. The
        `model_name` will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing the `score` and the `reason` for the provided `score`.
        """
        pattern = r"(.+?)Score: (\d)"

        matches = None
        if output is not None:
            matches = re.findall(pattern, output, re.DOTALL)
        if matches is None:
            return {"score": None, "reason": None}

        return {
            "score": int(matches[0][1]),
            "reason": matches[0][0].strip(),
        }

inputs: List[str] property

The input for the task is the instruction, and the generation for it.

outputs: List[str] property

The output for the task is the score, reason and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generation=input["generation"]
            ),
        },
    ]

format_output(output, input)

The output is formatted as a dictionary with the score and reason. The model_name will be automatically included within the process method of Task.

Parameters:

Name Type Description Default
output Union[str, None]

a string representing the output of the LLM via the process method.

required
input Dict[str, Any]

the input to the task, as required by some tasks to format the output.

required

Returns:

Type Description
Dict[str, Any]

A dictionary containing the score and the reason for the provided score.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `score` and `reason`. The
    `model_name` will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing the `score` and the `reason` for the provided `score`.
    """
    pattern = r"(.+?)Score: (\d)"

    matches = None
    if output is not None:
        matches = re.findall(pattern, output, re.DOTALL)
    if matches is None:
        return {"score": None, "reason": None}

    return {
        "score": int(matches[0][1]),
        "reason": matches[0][0].strip(),
    }

load()

Loads the Jinja2 template with the Instruction Backtranslation prompt.

Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
def load(self) -> None:
    """Loads the Jinja2 template with the Instruction Backtranslation prompt."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "instruction-backtranslation.jinja2"
    )

    self._template = Template(open(_path).read())

PairRM

Bases: Step

Rank the candidates based on the input using the LLM model.

Attributes:

Name Type Description
model str

The model to use for the ranking. Defaults to "llm-blender/PairRM".

input_batch_size int

The batch size to use when processing the input. Defaults to 8.

instructions Optional[str]

The instructions to use for the model. Defaults to None.

Input columns
  • inputs (List[Dict[str, Any]]): The input text or conversation to rank the candidates for.
  • candidates (List[Dict[str, Any]]): The candidates to rank.
Output columns
  • ranks (List[int]): The ranks of the candidates based on the input.
  • ranked_candidates (List[Dict[str, Any]]): The candidates ranked based on the input.
References
Note

This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific LLM.

Source code in src/distilabel/steps/tasks/pair_rm.py
class PairRM(Step):
    """Rank the candidates based on the input using the `LLM` model.

    Attributes:
        model: The model to use for the ranking. Defaults to `"llm-blender/PairRM"`.
        input_batch_size: The batch size to use when processing the input. Defaults to `8`.
        instructions: The instructions to use for the model. Defaults to `None`.

    Input columns:
        - inputs (`List[Dict[str, Any]]`): The input text or conversation to rank the candidates for.
        - candidates (`List[Dict[str, Any]]`): The candidates to rank.

    Output columns:
        - ranks (`List[int]`): The ranks of the candidates based on the input.
        - ranked_candidates (`List[Dict[str, Any]]`): The candidates ranked based on the input.

    References:
        - [LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion](https://arxiv.org/abs/2306.02561).
        - [Pair Ranking Model](https://huggingface.co/llm-blender/PairRM).

    Note:
        This step differs to other tasks as there is a single implementation of this model
        currently, and we will use a specific `LLM`.
    """

    model: str = "llm-blender/PairRM"
    input_batch_size: int = 8
    instructions: Optional[str] = None

    def load(self) -> None:
        try:
            import llm_blender
        except ImportError as e:
            raise ImportError(
                "The `llm_blender` package is required to use the `PairRM` class."
                "Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`."
            ) from e
        self._blender = llm_blender.Blender()
        self._blender.loadranker(self.model)

    @property
    def inputs(self) -> List[str]:
        """The input columns correspond to the two required arguments from `Blender.rank`:
        `inputs` and `candidates`."""
        return ["input", "candidates"]

    @property
    def outputs(self) -> List[str]:
        """The outputs will include the `ranks` and the `ranked_candidates`."""
        return ["ranks", "ranked_candidates"]

    def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
        """The input is expected to be a dictionary with the keys `input` and `candidates`,
        where the `input` corresponds to the instruction of a model and `candidates` are a
        list of responses to be ranked.
        """
        return input

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates the ranks for the candidates based on the input.

        The ranks are the positions of the candidates, where lower is better,
        and the ranked candidates correspond to the candidates sorted according to the
        ranks obtained.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            An iterator with the inputs containing the `ranks` and the `ranked_candidates`.
        """
        input_texts = []
        candidates = []
        for input in inputs:
            formatted_input = self.format_input(input)
            input_texts.append(formatted_input["input"])
            candidates.append(formatted_input["candidates"])

        instructions = (
            [self.instructions] * len(input_texts) if self.instructions else None
        )

        ranks = self._blender.rank(
            input_texts,
            candidates,
            instructions=instructions,
            return_scores=False,
            batch_size=self.input_batch_size,
        )
        # Sort the candidates based on the ranks
        ranked_candidates = np.take_along_axis(
            np.array(candidates), ranks - 1, axis=1
        ).tolist()
        ranks = ranks.tolist()
        for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
            input["ranks"] = rank
            input["ranked_candidates"] = ranked_candidate

        yield inputs

inputs: List[str] property

The input columns correspond to the two required arguments from Blender.rank: inputs and candidates.

outputs: List[str] property

The outputs will include the ranks and the ranked_candidates.

format_input(input)

The input is expected to be a dictionary with the keys input and candidates, where the input corresponds to the instruction of a model and candidates are a list of responses to be ranked.

Source code in src/distilabel/steps/tasks/pair_rm.py
def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
    """The input is expected to be a dictionary with the keys `input` and `candidates`,
    where the `input` corresponds to the instruction of a model and `candidates` are a
    list of responses to be ranked.
    """
    return input

process(inputs)

Generates the ranks for the candidates based on the input.

The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

An iterator with the inputs containing the ranks and the ranked_candidates.

Source code in src/distilabel/steps/tasks/pair_rm.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates the ranks for the candidates based on the input.

    The ranks are the positions of the candidates, where lower is better,
    and the ranked candidates correspond to the candidates sorted according to the
    ranks obtained.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        An iterator with the inputs containing the `ranks` and the `ranked_candidates`.
    """
    input_texts = []
    candidates = []
    for input in inputs:
        formatted_input = self.format_input(input)
        input_texts.append(formatted_input["input"])
        candidates.append(formatted_input["candidates"])

    instructions = (
        [self.instructions] * len(input_texts) if self.instructions else None
    )

    ranks = self._blender.rank(
        input_texts,
        candidates,
        instructions=instructions,
        return_scores=False,
        batch_size=self.input_batch_size,
    )
    # Sort the candidates based on the ranks
    ranked_candidates = np.take_along_axis(
        np.array(candidates), ranks - 1, axis=1
    ).tolist()
    ranks = ranks.tolist()
    for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
        input["ranks"] = rank
        input["ranked_candidates"] = ranked_candidate

    yield inputs

QualityScorer

Bases: Task

QualityScorer is a pre-defined task that defines the instruction as the input and score as the output. This task is used to rate the quality of instructions and responses. It's an implementation of the quality score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'. The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs are scored in terms of quality, obtaining a quality score for each instruction.

Attributes:

Name Type Description
_template Union[Template, None]

a Jinja2 template used to format the input for the LLM.

Input columns
  • instruction (str): The instruction that was used to generate the responses.
  • responses (List[str]): The responses to be scored. Each response forms a pair with the instruction.
Output columns
  • quality_score (List[float]): The quality score for each instruction.
References
Source code in src/distilabel/steps/tasks/quality_scorer.py
class QualityScorer(Task):
    """QualityScorer is a pre-defined task that defines the `instruction` as the input
    and `score` as the output. This task is used to rate the quality of instructions and responses.
    It's an implementation of the quality score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
    are scored in terms of quality, obtaining a quality score for each instruction.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.

    Output columns:
        - quality_score (`List[float]`): The quality score for each instruction.

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        super().load()
        self._template = Template(_QUALITY_SCORER_TEMPLATE)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are `instruction` and `responses`."""
        return ["instruction", "responses"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

    @property
    def outputs(self):
        """The output for the task is a list of `quality_scores` containing the quality score for each
        response in `responses`."""
        return ["scores"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction-response pair.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with containing the scores for each instruction-response pair.
        """

        if output is None:
            return {self.outputs[0]: [None] * len(input["responses"])}

        scores = []
        score_lines = output.split("\n")

        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["responses"]) - 1:
                break

        return {self.outputs[0]: scores}

inputs: List[str] property

The input for the task are instruction and responses.

outputs property

The output for the task is a list of quality_scores containing the quality score for each response in responses.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/quality_scorer.py
def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [{"role": "user", "content": self._template.render(**input)}]  # type: ignore

format_output(output, input)

The output is formatted as a list with the score of each instruction-response pair.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Dict[str, Any]

the input to the task. Used for obtaining the number of responses.

required

Returns:

Type Description
Dict[str, Any]

A dict with containing the scores for each instruction-response pair.

Source code in src/distilabel/steps/tasks/quality_scorer.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction-response pair.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with containing the scores for each instruction-response pair.
    """

    if output is None:
        return {self.outputs[0]: [None] * len(input["responses"])}

    scores = []
    score_lines = output.split("\n")

    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.match(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["responses"]) - 1:
            break

    return {self.outputs[0]: scores}

SelfInstruct

Bases: Task

SelfInstruct is a pre-defined task that, given a number of instructions, a certain criteria for query generations, an application description, and an input, generates a number of instruction related to the given input and following what is stated in the criteria for query generation and the application description. It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning Language Models with Self-Generated Instructions".

Attributes:

Name Type Description
num_instructions int

The number of instructions to be generated. Defaults to 5.

criteria_for_query_generation str

The criteria for the query generation. Defaults to the criteria defined within the paper.

application_description str

The description of the AI application that one want to build with these instructions. Defaults to AI assistant.

Input columns
  • input (str): The input to generate the instructions. It's also called seed in the paper.
Output columns
  • instructions (List[str]): The generated instructions.
Reference
Source code in src/distilabel/steps/tasks/self_instruct.py
class SelfInstruct(Task):
    """SelfInstruct is a pre-defined task that, given a number of instructions, a
    certain criteria for query generations, an application description, and an input,
    generates a number of instruction related to the given input and following what
    is stated in the criteria for query generation and the application description.
    It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning
    Language Models with Self-Generated Instructions".

    Attributes:
        num_instructions: The number of instructions to be generated. Defaults to 5.
        criteria_for_query_generation: The criteria for the query generation. Defaults
            to the criteria defined within the paper.
        application_description: The description of the AI application that one want
            to build with these instructions. Defaults to `AI assistant`.

    Input columns:
        - input (`str`): The input to generate the instructions. It's also called seed in the paper.

    Output columns:
        - instructions (`List[str]`): The generated instructions.

    Reference:
        - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)
    """

    num_instructions: int = 5

    criteria_for_query_generation: str = (
        "Incorporate a diverse range of verbs, avoiding repetition.\n"
        "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
        "Design queries to be self-contained and standalone.\n"
        'Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.'
    )

    application_description: str = "AI assistant"

    _template: Template = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template for SelfInstruct."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "self-instruct.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `input` i.e. seed text."""
        return ["input"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""

        return [
            {
                "role": "user",
                "content": self._template.render(
                    input=input["input"],
                    application_description=self.application_description,
                    criteria_for_query_generation=self.criteria_for_query_generation,
                    num_instructions=self.num_instructions,
                ),
            }
        ]

    @property
    def outputs(self):
        """The output for the task is a list of `instructions` containing the generated instructions."""
        return ["instructions"]

    def format_output(
        self,
        output: Union[str, None],
        input: Optional[Dict[str, Any]] = None,
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the generated instructions.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with containing the generated instructions.
        """

        if output is None:
            return {"instructions": []}

        lines = [line for line in output.split("\n") if line != ""]
        return {"instructions": lines}

inputs: List[str] property

The input for the task is the input i.e. seed text.

outputs property

The output for the task is a list of instructions containing the generated instructions.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/self_instruct.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""

    return [
        {
            "role": "user",
            "content": self._template.render(
                input=input["input"],
                application_description=self.application_description,
                criteria_for_query_generation=self.criteria_for_query_generation,
                num_instructions=self.num_instructions,
            ),
        }
    ]

format_output(output, input=None)

The output is formatted as a list with the generated instructions.

Parameters:

Name Type Description Default
output Union[str, None]

the raw output of the LLM.

required
input Optional[Dict[str, Any]]

the input to the task. Used for obtaining the number of responses.

None

Returns:

Type Description
Dict[str, Any]

A dict with containing the generated instructions.

Source code in src/distilabel/steps/tasks/self_instruct.py
def format_output(
    self,
    output: Union[str, None],
    input: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
    """The output is formatted as a list with the generated instructions.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with containing the generated instructions.
    """

    if output is None:
        return {"instructions": []}

    lines = [line for line in output.split("\n") if line != ""]
    return {"instructions": lines}

load()

Loads the Jinja2 template for SelfInstruct.

Source code in src/distilabel/steps/tasks/self_instruct.py
def load(self) -> None:
    """Loads the Jinja2 template for SelfInstruct."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "self-instruct.jinja2"
    )

    self._template = Template(open(_path).read())

Task

Bases: _Task, Step

Task is a class that implements the _Task abstract class and adds the Step interface to be used as a step in the pipeline.

Attributes:

Name Type Description
llm

the LLM to be used to generate the outputs of the task.

group_generations

whether to group the num_generations generated per input in a list or create a row per generation. Defaults to False.

num_generations

The number of generations to be produced per input.

Source code in src/distilabel/steps/tasks/base.py
class Task(_Task, Step):
    """Task is a class that implements the `_Task` abstract class and adds the `Step`
    interface to be used as a step in the pipeline.

    Attributes:
        llm: the `LLM` to be used to generate the outputs of the task.
        group_generations: whether to group the `num_generations` generated per input in
            a list or create a row per generation. Defaults to `False`.
        num_generations: The number of generations to be produced per input.
    """

    @abstractmethod
    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Abstract method to format the inputs of the task. It needs to receive an input
        as a Python dictionary, and generates an OpenAI chat-like list of dicts."""
        pass

    def _format_inputs(self, inputs: List[Dict[str, Any]]) -> List["ChatType"]:
        """Formats the inputs of the task using the `format_input` method.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list containing the formatted inputs, which are `ChatType`-like following
            the OpenAI formatting.
        """
        return [self.format_input(input) for input in inputs]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        formatted_inputs = self._format_inputs(inputs)
        outputs = self.llm.generate(
            inputs=formatted_inputs,
            num_generations=self.num_generations,  # type: ignore
            **self.llm.generation_kwargs,  # type: ignore
        )

        task_outputs = []
        for input, input_outputs in zip(inputs, outputs):
            formatted_outputs = self._format_outputs(input_outputs, inputs)

            if self.group_generations:
                combined = combine_dicts(*formatted_outputs)
                task_outputs.append(
                    {**input, "model_name": self.llm.model_name, **combined}
                )
                continue

            # Create a row per generation
            for formatted_output in formatted_outputs:
                task_outputs.append(
                    {**input, "model_name": self.llm.model_name, **formatted_output}
                )

        yield task_outputs

format_input(input) abstractmethod

Abstract method to format the inputs of the task. It needs to receive an input as a Python dictionary, and generates an OpenAI chat-like list of dicts.

Source code in src/distilabel/steps/tasks/base.py
@abstractmethod
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Abstract method to format the inputs of the task. It needs to receive an input
    as a Python dictionary, and generates an OpenAI chat-like list of dicts."""
    pass

process(inputs)

Processes the inputs of the task and generates the outputs using the LLM.

Parameters:

Name Type Description Default
inputs StepInput

A list of Python dictionaries with the inputs of the task.

required

Yields:

Type Description
StepOutput

A list of Python dictionaries with the outputs of the task.

Source code in src/distilabel/steps/tasks/base.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    formatted_inputs = self._format_inputs(inputs)
    outputs = self.llm.generate(
        inputs=formatted_inputs,
        num_generations=self.num_generations,  # type: ignore
        **self.llm.generation_kwargs,  # type: ignore
    )

    task_outputs = []
    for input, input_outputs in zip(inputs, outputs):
        formatted_outputs = self._format_outputs(input_outputs, inputs)

        if self.group_generations:
            combined = combine_dicts(*formatted_outputs)
            task_outputs.append(
                {**input, "model_name": self.llm.model_name, **combined}
            )
            continue

        # Create a row per generation
        for formatted_output in formatted_outputs:
            task_outputs.append(
                {**input, "model_name": self.llm.model_name, **formatted_output}
            )

    yield task_outputs

TextGeneration

Bases: Task

TextGeneration is a pre-defined task that defines the instruction as the input and generation as the output. This task is used to generate text based on the input instruction. The model_name is also returned as part of the output in order to enhance it.

Input columns
  • instruction (str): The instruction to generate text from.
Output columns
  • generation (str): The generated text.
  • model_name (str): The model name used to generate the text.
Source code in src/distilabel/steps/tasks/text_generation.py
class TextGeneration(Task):
    """TextGeneration is a pre-defined task that defines the `instruction` as the input
    and `generation` as the output. This task is used to generate text based on the input
    instruction. The model_name is also returned as part of the output in order to enhance it.

    Input columns:
        - instruction (`str`): The instruction to generate text from.

    Output columns:
        - generation (`str`): The generated text.
        - model_name (`str`): The model name used to generate the text.
    """

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""

        instruction = input["instruction"]

        if isinstance(instruction, str):
            return [{"role": "user", "content": input["instruction"]}]

        if not is_openai_format(instruction):
            raise ValueError(
                f"Input `instruction` must be a string or an OpenAI chat-like format. "
                f"Got: {instruction}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
            )

        return instruction

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}

inputs: List[str] property

The input for the task is the instruction.

outputs: List[str] property

The output for the task is the generation and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""

    instruction = input["instruction"]

    if isinstance(instruction, str):
        return [{"role": "user", "content": input["instruction"]}]

    if not is_openai_format(instruction):
        raise ValueError(
            f"Input `instruction` must be a string or an OpenAI chat-like format. "
            f"Got: {instruction}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'."
        )

    return instruction

format_output(output, input)

The output is formatted as a dictionary with the generation. The model_name will be automatically included within the process method of Task.

Source code in src/distilabel/steps/tasks/text_generation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

UltraFeedback

Bases: Task

UltraFeedback: Boosting Language Models with High-quality Feedback.

Attributes:

Name Type Description
aspect Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']

The aspect to perform with the UltraFeedback model. The available aspects are: - helpfulness: Evaluate text outputs based on helpfulness. - honesty: Evaluate text outputs based on honesty. - instruction-following: Evaluate text outputs based on given instructions. - truthfulness: Evaluate text outputs based on truthfulness. Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall assessment of the text outputs within a single prompt. The custom aspect is: - overall-rating: Evaluate text outputs based on an overall assessment.

Input columns
  • instruction (str): The reference instruction to evaluate the text outputs.
  • generations (List[str]): The text outputs to evaluate for the given instruction.
Output columns
  • ratings (List[float]): The ratings for each of the provided text outputs.
  • rationales (List[str]): The rationales for each of the provided text outputs.
  • model_name (str): The name of the model used to generate the ratings and rationales.
References
Source code in src/distilabel/steps/tasks/ultrafeedback.py
class UltraFeedback(Task):
    """UltraFeedback: Boosting Language Models with High-quality Feedback.

    Attributes:
        aspect: The aspect to perform with the `UltraFeedback` model. The available aspects are:
            - `helpfulness`: Evaluate text outputs based on helpfulness.
            - `honesty`: Evaluate text outputs based on honesty.
            - `instruction-following`: Evaluate text outputs based on given instructions.
            - `truthfulness`: Evaluate text outputs based on truthfulness.
            Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall
            assessment of the text outputs within a single prompt. The custom aspect is:
            - `overall-rating`: Evaluate text outputs based on an overall assessment.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text outputs.
        - generations (`List[str]`): The text outputs to evaluate for the given instruction.

    Output columns:
        - ratings (`List[float]`): The ratings for each of the provided text outputs.
        - rationales (`List[str]`): The rationales for each of the provided text outputs.
        - model_name (`str`): The name of the model used to generate the ratings and rationales.

    References:
        - [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377)
        - [`UltraFeedback - GitHub Repository`](https://github.com/OpenBMB/UltraFeedback)
    """

    aspect: Literal[
        "helpfulness",
        "honesty",
        "instruction-following",
        "truthfulness",
        # Custom aspects
        "overall-rating",
    ]

    _system_prompt: str = PrivateAttr(
        default=(
            "Your role is to evaluate text quality based on given criteria.\n"
            'You\'ll receive an instructional description ("Instruction") and {no_texts} text outputs ("Text").\n'
            "Understand and interpret instructions to evaluate effectively.\n"
            "Provide annotations for each text with a rating and rationale.\n"
            "The {no_texts} texts given are independent, and should be evaluated separately.\n"
        )
    )
    _template: Optional["Template"] = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template for the given `aspect`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "ultrafeedback"
            / f"{self.aspect}.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generations` for it."""
        return ["instruction", "generations"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "system",
                "content": self._system_prompt.format(
                    no_texts=len(input["generations"])
                ),
            },
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generations=input["generations"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        columns = []
        if self.aspect in ["honesty", "instruction-following", "overall-rating"]:
            columns = ["ratings", "rationales"]
        elif self.aspect in ["helpfulness", "truthfulness"]:
            columns = ["types", "rationales", "ratings", "rationales-for-ratings"]
        return columns + ["model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `ratings` and `rationales` for
        each of the provided `generations` for the given `instruction`. The `model_name`
        will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing either the `ratings` and `rationales` for each of the provided
            `generations` for the given `instruction` if the provided aspect is either `honesty`,
            `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
            `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
            given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
        """
        if self.aspect in [
            "honesty",
            "instruction-following",
            "overall-rating",
        ]:
            return self._format_ratings_rationales_output(output, input)
        else:
            return self._format_types_ratings_rationales_output(output, input)

    def _format_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`."""
        if output is None:
            return {
                "ratings": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
            }

        pattern = r"Rating: (.+?)\nRationale: (.+)"
        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append({"ratings": None, "rationales": None})
                continue

            formatted_outputs.append(
                {
                    "ratings": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None,
                    "rationales": matches.group(2),
                }
            )
        return combine_dicts(*formatted_outputs)

    def _format_types_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `helpfulness` or `truthfulness`."""
        if output is None:
            return {
                "types": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
                "ratings": [None] * len(input["generations"]),
                "rationales-for-ratings": [None] * len(input["generations"]),
            }

        pattern = r"Type: (.+?)\nRationale: (.+?)\nRating: (.+?)\nRationale: (.+)"

        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append(
                    {
                        "types": None,
                        "rationales": None,
                        "ratings": None,
                        "rationales-for-ratings": None,
                    }
                )
                continue

            formatted_outputs.append(
                {
                    "types": int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None,
                    "rationales": matches.group(2),
                    "ratings": int(re.findall(r"\b\d+\b", matches.group(3))[0])
                    if matches.group(3) not in ["None", "N/A"]
                    else None,
                    "rationales-for-ratings": matches.group(4),
                }
            )
        return combine_dicts(*formatted_outputs)

inputs: List[str] property

The input for the task is the instruction, and the generations for it.

outputs: List[str] property

The output for the task is the generation and the model_name.

format_input(input)

The input is formatted as a ChatType assuming that the instruction is the first interaction from the user within a conversation.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "system",
            "content": self._system_prompt.format(
                no_texts=len(input["generations"])
            ),
        },
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generations=input["generations"]
            ),
        },
    ]

format_output(output, input)

The output is formatted as a dictionary with the ratings and rationales for each of the provided generations for the given instruction. The model_name will be automatically included within the process method of Task.

Parameters:

Name Type Description Default
output Union[str, None]

a string representing the output of the LLM via the process method.

required
input Dict[str, Any]

the input to the task, as required by some tasks to format the output.

required

Returns:

Type Description
Dict[str, Any]

A dictionary containing either the ratings and rationales for each of the provided

Dict[str, Any]

generations for the given instruction if the provided aspect is either honesty,

Dict[str, Any]

instruction-following, or overall-rating; or the types, rationales,

Dict[str, Any]

ratings, and rationales-for-ratings for each of the provided generations for the

Dict[str, Any]

given instruction if the provided aspect is either helpfulness or truthfulness.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `ratings` and `rationales` for
    each of the provided `generations` for the given `instruction`. The `model_name`
    will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing either the `ratings` and `rationales` for each of the provided
        `generations` for the given `instruction` if the provided aspect is either `honesty`,
        `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
        `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
        given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
    """
    if self.aspect in [
        "honesty",
        "instruction-following",
        "overall-rating",
    ]:
        return self._format_ratings_rationales_output(output, input)
    else:
        return self._format_types_ratings_rationales_output(output, input)

load()

Loads the Jinja2 template for the given aspect.

Source code in src/distilabel/steps/tasks/ultrafeedback.py
def load(self) -> None:
    """Loads the Jinja2 template for the given `aspect`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "ultrafeedback"
        / f"{self.aspect}.jinja2"
    )

    self._template = Template(open(_path).read())