Task Gallery¶
This section contains the existing Task
subclasses implemented in distilabel
.
BitextRetrievalGenerator
¶
Bases: _EmbeddingDataGenerator
Generate bitext retrieval data with an LLM
to later on train an embedding model.
BitextRetrievalGenerator
is a GeneratorTask
that generates bitext retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes:
Name | Type | Description |
---|---|---|
source_language |
str
|
The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
target_language |
str
|
The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
unit |
Optional[Literal['sentence', 'phrase', 'passage']]
|
The unit of the data to be generated, which can be |
difficulty |
Optional[Literal['elementary school', 'high school', 'college']]
|
The difficulty of the query to be generated, which can be |
high_score |
Optional[Literal['4', '4.5', '5']]
|
The high score of the query to be generated, which can be |
low_score |
Optional[Literal['2.5', '3', '3.5']]
|
The low score of the query to be generated, which can be |
seed |
Optional[Literal['2.5', '3', '3.5']]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate bitext retrieval data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator
with Pipeline("my-pipeline") as pipeline:
task = BitextRetrievalGenerator(
source_language="English",
target_language="Spanish",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
prompt: ChatType
property
¶
Contains the prompt
to be used in the process
method, rendering the _template
; and
formatted as an OpenAI formatted chat i.e. a ChatType
, assuming that there's only one turn,
being from the user with the content being the rendered _template
.
ChatGeneration
¶
Bases: Task
Generates text based on a conversation.
ChatGeneration
is a pre-defined task that defines the messages
as the input
and generation
as the output. This task is used to generate text based on a conversation.
The model_name
is also returned as part of the output in order to enhance it.
Input columns
- messages (
List[Dict[Literal["role", "content"], str]]
): The messages to generate the follow up completion from.
Output columns
- generation (
str
): The generated text from the assistant. - model_name (
str
): The model name used to generate the text.
Categories
- chat-generation
Icon
:material-chat:
Examples:
Generate text from a conversation in OpenAI chat format:
```python
from distilabel.steps.tasks import ChatGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
chat = ChatGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
chat.load()
result = next(
chat.process(
[
{
"messages": [
{"role": "user", "content": "How much is 2+2?"},
]
}
]
)
)
# result
# [
# {
# 'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'generation': '4',
# }
# ]
```
Source code in src/distilabel/steps/tasks/text_generation.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
inputs: List[str]
property
¶
The input for the task are the messages
.
outputs: List[str]
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the messages provided
are already formatted that way i.e. following the OpenAI chat format.
Source code in src/distilabel/steps/tasks/text_generation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_generation.py
ComplexityScorer
¶
Bases: Task
Score instructions based on their complexity using an LLM
.
ComplexityScorer
is a pre-defined task used to rank a list of instructions based in
their complexity. It's an implementation of the complexity score task from the paper
'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection
in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instructions (
List[str]
): The list of instructions to be scored.
Output columns
- scores (
List[float]
): The score for each instruction. - model_name (
str
): The model name used to generate the scores.
Categories
- scorer
- complexity
- instruction
Examples:
Evaluate the complexity of your instructions:
```python
from distilabel.steps.tasks import ComplexityScorer
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
scorer.load()
result = next(
scorer.process(
[{"instructions": ["plain instruction", "highly complex instruction"]}]
)
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/complexity_scorer.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
inputs: List[str]
property
¶
The inputs for the task are the instructions
.
outputs: List[str]
property
¶
The output for the task are: a list of scores
containing the complexity score for each
instruction in instructions
, and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/complexity_scorer.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
EvolComplexity
¶
Bases: EvolInstruct
Evolve instructions to make them more complex using an LLM
.
EvolComplexity
is a task that evolves instructions to make them more complex,
and it is based in the EvolInstruct task, using slight different prompts, but the
exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
The number of instructions to be generated. |
|
generate_answers |
Whether to generate answers for the instructions or not. Defaults
to |
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
Dict[str, str]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- deita
References
Examples:
Evolve an instruction using an LLM:
```python
from distilabel.steps.tasks import EvolComplexity
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity = EvolComplexity(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_complexity.load()
result = next(evol_complexity.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
```
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
EvolComplexityGenerator
¶
Bases: EvolInstructGenerator
Generate evolved instructions with increased complexity using an LLM
.
EvolComplexityGenerator
is a generation task that evolves instructions to make
them more complex, and it is based in the EvolInstruct task, but using slight different
prompts, but the exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
The number of instructions to be generated. |
|
generate_answers |
Whether to generate answers for the instructions or not. Defaults
to |
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
Dict[str, str]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Output columns
- instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- generation
- deita
References
Examples:
Generate evolved instructions without initial instructions:
```python
from distilabel.steps.tasks import EvolComplexityGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity_generator = EvolComplexityGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_complexity_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
```
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
EvolInstruct
¶
Bases: Task
Evolve instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed. |
store_evolutions |
bool
|
Whether to store all the evolutions or just the last one. Defaults
to |
generate_answers |
bool
|
Whether to generate answers for the evolved instructions. Defaults
to |
include_original_instruction |
bool
|
Whether to include the original instruction in the
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for evolving the instructions.
Defaults to the ones provided in the |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction ifstore_evolutions=False
. - evolved_instructions (
List[str]
): The evolved instructions ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the instructions. - answer (
str
): The answer to the evolved instruction ifgenerate_answers=True
andstore_evolutions=False
. - answers (
List[str]
): The answers to the evolved instructions ifgenerate_answers=True
andstore_evolutions=True
.
Categories
- evol
- instruction
References
Examples:
Evolve an instruction using an LLM:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```
Keep the iterations of the evolutions:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
store_evolutions=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instructions': ['initial evolution', 'final evolution'],
# 'model_name': 'model_name'
# }
# ]
```
Generate answers for the instructions in a single step:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
generate_answers=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instruction': 'evolved instruction',
# 'answer': 'answer to the instruction',
# 'model_name': 'model_name'
# }
# ]
```
Citations:
```
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 |
|
inputs: List[str]
property
¶
The input for the task is the instruction
.
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
outputs: List[str]
property
¶
The output for the task are the evolved_instruction/s
, the answer
if generate_answers=True
and the model_name
.
_apply_random_mutation(instruction)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_evolve_instructions(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
List[List[str]]
|
|
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_generate_answers(evolved_instructions)
¶
Generates the answer for the instructions in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evolved_instructions |
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list of answers for each instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
format_output(instructions, answers=None)
¶
The output for the task is a dict with: evolved_instruction
or evolved_instructions
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
answer
if generate_answers=True
; and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions |
Union[str, List[str]]
|
The instructions to be included within the output. |
required |
answers |
Optional[List[str]]
|
The answers to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
EvolInstructGenerator
¶
Bases: GeneratorTask
Generate evolved instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. |
generate_answers |
bool
|
Whether to generate answers for the instructions or not. Defaults
to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Output columns
- instruction (
str
): The generated instruction ifgenerate_answers=False
. - answer (
str
): The generated answer ifgenerate_answers=True
. - instructions (
List[str]
): The generated instructions ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to generate and evolve the instructions.
Categories
- evol
- instruction
- generation
References
Examples:
Generate evolved instructions without initial instructions:
```python
from distilabel.steps.tasks import EvolInstructGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct_generator = EvolInstructGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_instruct_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```
Citations:
```
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
```
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 |
|
_english_nouns: List[str]
cached
property
¶
A list of English nouns to be used as part of the starting prompts for the task.
References
- https://github.com/h2oai/h2o-wizardlm
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
outputs: List[str]
property
¶
The output for the task are the instruction
, the answer
if generate_answers=True
and the model_name
.
_apply_random_mutation(iter_no)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iter_no |
int
|
The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not. |
required |
Returns:
Type | Description |
---|---|
List[ChatType]
|
A random mutation prompt with the provided instruction formatted as an OpenAI conversation. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_generate_answers(instructions)
¶
Generates the answer for the last instruction in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions |
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of answers for the last instruction in |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_generate_seed_texts()
¶
Generates a list of seed texts to be used as part of the starting prompts for the task.
It will use the FRESH_START
mutation template, as it needs to generate text from scratch; and
a list of English words will be used to generate the seed texts that will be provided to the
mutation method and included within the prompt.
Returns:
Type | Description |
---|---|
List[str]
|
A list of seed texts to be used as part of the starting prompts for the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
format_output(instruction, answer=None)
¶
The output for the task is a dict with: instruction
; answer
if generate_answers=True
;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the output. |
required |
answer |
Optional[str]
|
The answer to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
process(offset=0)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries with the outputs of the task, and a boolean |
GeneratorStepOutput
|
flag indicating whether the task has finished or not i.e. is the last batch. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
EvolQuality
¶
Bases: Task
Evolve the quality of the responses using an LLM
.
EvolQuality
task is used to evolve the quality of the responses given a prompt,
by generating a new response with a language model. This step implements the evolution
quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed on the responses. |
store_evolutions |
bool
|
Whether to store all the evolved responses or just the last one.
Defaults to |
include_original_response |
bool
|
Whether to include the original response within the evolved
responses. Defaults to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used to evolve the responses. |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction that was used to generate theresponses
. - response (
str
): The responses to be rewritten.
Output columns
- evolved_response (
str
): The evolved response ifstore_evolutions=False
. - evolved_responses (
List[str]
): The evolved responses ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the responses.
Categories
- evol
- response
- deita
Examples:
Evolve the quality of the responses given a prompt:
```python
from distilabel.steps.tasks import EvolQuality
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_quality = EvolQuality(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_quality.load()
result = next(
evol_quality.process(
[
{"instruction": "common instruction", "response": "a response"},
]
)
)
# result
# [
# {
# 'instruction': 'common instruction',
# 'response': 'a response',
# 'evolved_response': 'evolved response',
# 'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
# }
# ]
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/evol_quality/base.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
|
inputs: List[str]
property
¶
The input for the task are the instruction
and response
.
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
enum.
outputs: List[str]
property
¶
The output for the task are the evolved_response/s
and the model_name
.
_apply_random_mutation(instruction, response)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
_evolve_reponses(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
List[List[str]]
|
|
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_output(responses)
¶
The output for the task is a dict with: evolved_response
or evolved_responses
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses |
Union[str, List[str]]
|
The responses to be included within the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
GenerateEmbeddings
¶
Bases: Step
Generate embeddings using the last hidden state of an LLM
.
Generate embeddings for a text input using the last hidden state of an LLM
, as
described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
llm |
LLM
|
The |
Input columns
- text (
str
,List[Dict[str, str]]
): The input text or conversation to generate embeddings for.
Output columns
- embedding (
List[float]
): The embedding of the input text or conversation. - model_name (
str
): The model name used to generate the embeddings.
Categories
- embedding
- llm
Examples:
Rank LLM candidates:
```python
from distilabel.steps.tasks import GenerateEmbeddings
from distilabel.llms.huggingface import TransformersLLM
# Consider this as a placeholder for your actual LLM.
embedder = GenerateEmbeddings(
llm=TransformersLLM(
model="TaylorAI/bge-micro-v2",
model_kwargs={"is_decoder": True},
cuda_devices=[],
)
)
embedder.load()
result = next(
embedder.process(
[
{"text": "Hello, how are you?"},
]
)
)
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/generate_embeddings.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
inputs: List[str]
property
¶
The inputs for the task is a text
column containing either a string or a
list of dictionaries in OpenAI chat-like format.
outputs: List[str]
property
¶
The outputs for the task is an embedding
column containing the embedding of
the text
input.
format_input(input)
¶
Formats the input to be used by the LLM to generate the embeddings. The input
can be in ChatType
format or a string. If a string, it will be converted to a
list of dictionaries in OpenAI chat-like format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
ChatType
|
The OpenAI chat-like format of the input. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
load()
¶
process(inputs)
¶
Generates an embedding for each input using the last hidden state of the LLM
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
GenerateLongTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate long text matching data with an LLM
to later on train an embedding model.
GenerateLongTextMatchingData
is a Task
that generates long text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-long"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-long category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
str
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic long text matching data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-long",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateLongTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateSentencePair
¶
Bases: Task
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair
is a pre-defined task that given an anchor sentence generates
a positive sentence related to the anchor and optionally a negative sentence unrelated
to the anchor or similar to it. Optionally, you can give a context to guide the LLM
towards more specific behavior. This task is useful to generate training datasets for
training embeddings models.
Attributes:
Name | Type | Description |
---|---|---|
triplet |
bool
|
a flag to indicate if the task should generate a triplet of sentences
(anchor, positive, negative). Defaults to |
action |
GenerationAction
|
the action to perform to generate the positive sentence. |
context |
str
|
the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default. |
hard_negative |
bool
|
A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity. |
Input columns
- anchor (
str
): The anchor sentence to generate the positive and negative sentences.
Output columns
- positive (
str
): The positive sentence related to theanchor
. - negative (
str
): The negative sentence unrelated to theanchor
iftriplet=True
, or more similar to the positive to make it more challenging for a model to distinguish in casehard_negative=True
. - model_name (
str
): The name of the model that was used to generate the sentences.
Categories
- embedding
Examples:
Paraphrasing:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```
Generating semantically similar sentences:
```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
```
Generating queries:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
```
Generating answers:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```
Generating queries with context (**applies to every action**):
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
```
Generating Hard-negatives (**applies to every action**):
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
hard_negative=True,
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
```
Source code in src/distilabel/steps/tasks/sentence_transformers.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 |
|
inputs: List[str]
property
¶
The inputs for the task is the anchor
sentence.
outputs: List[str]
property
¶
The outputs for the task are the positive
and negative
sentences, as well
as the model_name
used to generate the sentences.
format_input(input)
¶
The inputs are formatted as a ChatType
, with a system prompt describing the
task of generating a positive and negative sentences for the anchor sentence. The
anchor is provided as the first user interaction in the conversation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list of dictionaries containing the system and user interactions. |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
format_output(output, input=None)
¶
Formats the output of the LLM, to extract the positive
and negative
sentences
generated. If the output is None
or the regex doesn't match, then the outputs
will be set to None
as well.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
The output of the LLM. |
required |
input |
Optional[Dict[str, Any]]
|
The input used to generate the output. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
The formatted output containing the |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/sentence_transformers.py
GenerateShortTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate short text matching data with an LLM
to later on train an embedding model.
GenerateShortTextMatchingData
is a Task
that generates short text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-short"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-short category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
str
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic short text matching data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-short",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateShortTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Args:
input: The input dictionary containing the `task` to be used in the `_template`.
Returns:
A list with a single chat containing the user's message with the rendered `_template`.
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextClassificationData
¶
Bases: _EmbeddingDataGeneration
Generate text classification data with an LLM
to later on train an embedding model.
GenerateTextClassificationData
is a Task
that generates text classification data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-classification"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-classification category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
seed |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic text classification data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-classification",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextClassificationData(
language="English",
difficulty="high school",
clarity="clear",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextRetrievalData
¶
Bases: _EmbeddingDataGeneration
Generate text retrieval data with an LLM
to later on train an embedding model.
GenerateTextRetrievalData
is a Task
that generates text retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-retrieval"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-retrieval category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
query_type |
Optional[Literal['extremely long-tail', 'long-tail', 'common']]
|
The type of query to be generated, which can be |
query_length |
Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]
|
The length of the query to be generated, which can be |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
num_words |
Optional[Literal[50, 100, 200, 300, 400, 500]]
|
The number of words in the query to be generated, which can be |
seed |
Optional[Literal[50, 100, 200, 300, 400, 500]]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic text retrieval data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-retrieval",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextRetrievalData(
language="English",
query_type="common",
query_length="5 to 15 words",
difficulty="high school",
clarity="clear",
num_words=100,
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
Genstruct
¶
Bases: Task
Generate a pair of instruction-response from a document using an LLM
.
Genstruct
is a pre-defined task designed to generate valid instructions from a given raw document,
with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
inspired in the Ada-Instruct paper.
Note
The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
option is to use NousResearch/Genstruct-7B
as the LLM provided to the task, since it was trained
for this specific task.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- title (
str
): The title of the document. - content (
str
): The content of the document.
Output columns
- user (
str
): The user's instruction based on the document. - assistant (
str
): The assistant's response based on the user's instruction. - model_name (
str
): The model name used to generate thefeedback
andresult
.
Categories
- text-generation
- instruction
- response
References
Examples:
Generate instructions from raw documents using the title and content:
```python
from distilabel.steps.tasks import Genstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
llm=InferenceEndpointsLLM(
model_id="NousResearch/Genstruct-7B",
),
)
genstruct.load()
result = next(
genstruct.process(
[
{"title": "common instruction", "content": "content of the document"},
]
)
)
# result
# [
# {
# 'title': 'An instruction',
# 'content': 'content of the document',
# 'model_name': 'test',
# 'user': 'An instruction',
# 'assistant': 'content of the document',
# }
# ]
```
Citations:
```
@misc{cui2023adainstructadaptinginstructiongenerators,
title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},
author={Wanyun Cui and Qianle Wang},
year={2023},
eprint={2310.04484},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.04484},
}
```
Source code in src/distilabel/steps/tasks/genstruct.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
inputs: List[str]
property
¶
The inputs for the task are the title
and the content
.
outputs: List[str]
property
¶
The output for the task are the user
instruction based on the provided document
and the assistant
response based on the user's instruction.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/genstruct.py
format_output(output, input)
¶
The output is formatted so that both the user and the assistant messages are captured.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the keys |
Source code in src/distilabel/steps/tasks/genstruct.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/genstruct.py
InstructionBacktranslation
¶
Bases: Task
Self-Alignment with Instruction Backtranslation.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Optional[Template]
|
the Jinja2 template to use for the Instruction Backtranslation task. |
Input columns
- instruction (
str
): The reference instruction to evaluate the text output. - generation (
str
): The text output to evaluate for the given instruction.
Output columns
- score (
str
): The score for the generation based on the given instruction. - reason (
str
): The reason for the provided score. - model_name (
str
): The model name used to score the generation.
Categories
- critique
Citations:
```
@misc{li2024selfalignmentinstructionbacktranslation,
title={Self-Alignment with Instruction Backtranslation},
author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
year={2024},
eprint={2308.06259},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2308.06259},
}
```
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
inputs: List[str]
property
¶
The input for the task is the instruction
, and the generation
for it.
outputs: List[str]
property
¶
The output for the task is the score
, reason
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the score
and reason
. The
model_name
will be automatically included within the process
method of Task
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
a string representing the output of the LLM via the |
required |
input |
Dict[str, Any]
|
the input to the task, as required by some tasks to format the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary containing the |
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
Magpie
¶
Bases: Task
, MagpieBase
Generates conversations using an instruct fine-tuned LLM.
Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as if it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.
Attributes:
Name | Type | Description |
---|---|---|
n_turns |
the number of turns that the generated conversation will have.
Defaults to |
|
end_with_user |
whether the conversation should end with a user message.
Defaults to |
|
include_system_prompt |
whether to include the system prompt used in the generated
conversation. Defaults to |
|
only_instruction |
whether to generate only the instruction. If this argument is
|
|
system_prompt |
an optional system prompt or list of system prompts that can
be used to steer the LLM to generate content of certain topic, guide the style,
etc. If it's a list of system prompts, then a random system prompt will be chosen
per input/output batch. If the provided inputs contains a |
Runtime parameters
n_turns
: the number of turns that the generated conversation will have. Defaults to1
.end_with_user
: whether the conversation should end with a user message. Defaults toFalse
.include_system_prompt
: whether to include the system prompt used in the generated conversation. Defaults toFalse
.only_instruction
: whether to generate only the instruction. If this argument isTrue
, thenn_turns
will be ignored. Defaults toFalse
.system_prompt
: an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains asystem_prompt
column, then this runtime parameter will be ignored and the one from the column will be used. Defaults toNone
.
Input columns
- system_prompt (
str
, optional): an optional system prompt that can be provided to guide the generation of the instruct LLM and steer it to generate instructions of certain topic.
Output columns
- conversation (
ChatType
): the generated conversation which is a list of chat items with a role and a message. Only ifonly_instruction=False
. - instruction (
str
): the generated instructions ifonly_instruction=True
orn_turns==1
. - response (
str
): the generated response ifn_turns==1
. - model_name (
str
): The model name used to generate theconversation
orinstruction
.
Categories
- text-generation
- instruction
Examples:
Generating instructions with Llama 3 8B Instruct and TransformersLLM:
```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import Magpie
magpie = Magpie(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 64,
},
device="mps",
),
only_instruction=True,
)
magpie.load()
result = next(
magpie.process(
inputs=[
{
"system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
},
{
"system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
},
]
)
)
# [
# {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
# {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
# ]
```
Generating conversations with Llama 3 8B Instruct and TransformersLLM:
```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import Magpie
magpie = Magpie(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 256,
},
device="mps",
),
n_turns=2,
)
magpie.load()
result = next(
magpie.process(
inputs=[
{
"system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
},
{
"system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
},
]
)
)
# [
# {
# 'conversation': [
# {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
# {
# 'role': 'user',
# 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
# of x".'
# },
# {
# 'role': 'assistant',
# 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
# x". What it's asking us to do is find the'
# }
# ]
# },
# {
# 'conversation': [
# {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
# {
# 'role': 'user',
# 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
# might be pests or diseases, but I'm not sure which."
# },
# {
# 'role': 'assistant',
# 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
# }
# ]
# }
# ]
```
Source code in src/distilabel/steps/tasks/magpie/base.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 |
|
outputs: List[str]
property
¶
Either a multi-turn conversation or the instruction generated.
format_input(input)
¶
format_output(output, input=None)
¶
model_post_init(__context)
¶
Checks that the provided LLM
uses the MagpieChatTemplateMixin
.
Source code in src/distilabel/steps/tasks/magpie/base.py
process(inputs)
¶
Generate a list of instructions or conversations of the specified number of turns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
a list of dictionaries that can contain a |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
The list of generated conversations. |
Source code in src/distilabel/steps/tasks/magpie/base.py
MagpieGenerator
¶
Bases: GeneratorTask
, MagpieBase
Generator task the generates instructions or conversations using Magpie.
Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.
Attributes:
Name | Type | Description |
---|---|---|
n_turns |
the number of turns that the generated conversation will have.
Defaults to |
|
end_with_user |
whether the conversation should end with a user message.
Defaults to |
|
include_system_prompt |
whether to include the system prompt used in the generated
conversation. Defaults to |
|
only_instruction |
whether to generate only the instruction. If this argument is
|
|
system_prompt |
an optional system prompt or list of system prompts that can
be used to steer the LLM to generate content of certain topic, guide the style,
etc. If it's a list of system prompts, then a random system prompt will be chosen
per input/output batch. If the provided inputs contains a |
|
num_rows |
RuntimeParameter[int]
|
the number of rows to be generated. |
Runtime parameters
n_turns
: the number of turns that the generated conversation will have. Defaults to1
.end_with_user
: whether the conversation should end with a user message. Defaults toFalse
.include_system_prompt
: whether to include the system prompt used in the generated conversation. Defaults toFalse
.only_instruction
: whether to generate only the instruction. If this argument isTrue
, thenn_turns
will be ignored. Defaults toFalse
.system_prompt
: an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains asystem_prompt
column, then this runtime parameter will be ignored and the one from the column will be used. Defaults toNone
.num_rows
: the number of rows to be generated.
Output columns
- conversation (
ChatType
): the generated conversation which is a list of chat items with a role and a message. - instruction (
str
): the generated instructions ifonly_instruction=True
. - response (
str
): the generated response ifn_turns==1
. - model_name (
str
): The model name used to generate theconversation
orinstruction
.
Categories
- text-generation
- instruction
- generator
Examples:
Generating instructions with Llama 3 8B Instruct and TransformersLLM:
```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 256,
},
device="mps",
),
only_instruction=True,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {"instruction": "I've just bought a new phone and I're excited to start using it."},
# {"instruction": "What are the most common types of companies that use digital signage?"}
# ],
# True
# )
```
Generating a conversation with Llama 3 8B Instruct and TransformersLLM:
```python
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 64,
},
device="mps",
),
n_turns=3,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
# {
# 'role': 'assistant',
# 'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
# let's break down the basics. First, we need to identify your goals and target audience. What do"
# },
# {
# 'role': 'user',
# 'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
# },
# {
# 'role': 'assistant',
# 'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
# }
# ]
# },
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
# {
# 'role': 'assistant',
# 'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
# },
# {
# 'role': 'user',
# 'content': 'Let me stop you there. Let's explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my
# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
# },
# {
# 'role': 'assistant',
# 'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
# }
# ]
# }
# ],
# True
# )
```
Citations:
```
@misc{xu2024magpiealignmentdatasynthesis,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.08464},
}
```
Source code in src/distilabel/steps/tasks/magpie/generator.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 |
|
outputs: List[str]
property
¶
Either a multi-turn conversation or the instruction generated.
format_output(output, input=None)
¶
model_post_init(__context)
¶
Checks that the provided LLM
uses the MagpieChatTemplateMixin
.
Source code in src/distilabel/steps/tasks/magpie/generator.py
process(offset=0)
¶
Generates the desired number of instructions or conversations using Magpie.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The generated instructions or conversations. |
Source code in src/distilabel/steps/tasks/magpie/generator.py
MonolingualTripletGenerator
¶
Bases: _EmbeddingDataGenerator
Generate monolingual triplets with an LLM
to later on train an embedding model.
MonolingualTripletGenerator
is a GeneratorTask
that generates monolingual triplets with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
unit |
Optional[Literal['sentence', 'phrase', 'passage']]
|
The unit of the data to be generated, which can be |
difficulty |
Optional[Literal['elementary school', 'high school', 'college']]
|
The difficulty of the query to be generated, which can be |
high_score |
Optional[Literal['4', '4.5', '5']]
|
The high score of the query to be generated, which can be |
low_score |
Optional[Literal['2.5', '3', '3.5']]
|
The low score of the query to be generated, which can be |
seed |
Optional[Literal['2.5', '3', '3.5']]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate monolingual triplets for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import MonolingualTripletGenerator
with Pipeline("my-pipeline") as pipeline:
task = MonolingualTripletGenerator(
language="English",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
prompt: ChatType
property
¶
Contains the prompt
to be used in the process
method, rendering the _template
; and
formatted as an OpenAI formatted chat i.e. a ChatType
, assuming that there's only one turn,
being from the user with the content being the rendered _template
.
PairRM
¶
Bases: Step
Rank the candidates based on the input using the LLM
model.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
The model to use for the ranking. Defaults to |
instructions |
Optional[str]
|
The instructions to use for the model. Defaults to |
Input columns
- inputs (
List[Dict[str, Any]]
): The input text or conversation to rank the candidates for. - candidates (
List[Dict[str, Any]]
): The candidates to rank.
Output columns
- ranks (
List[int]
): The ranks of the candidates based on the input. - ranked_candidates (
List[Dict[str, Any]]
): The candidates ranked based on the input. - model_name (
str
): The model name used to rank the candidate responses. Defaults to"llm-blender/PairRM"
.
References
Categories
- preference
Note
This step differs to other tasks as there is a single implementation of this model
currently, and we will use a specific LLM
.
Examples:
Rank LLM candidates:
```python
from distilabel.steps.tasks import PairRM
# Consider this as a placeholder for your actual LLM.
pair_rm = PairRM()
pair_rm.load()
result = next(
scorer.process(
[
{"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
]
)
)
# result
# [
# {
# 'input': 'Hello, how are you?',
# 'candidates': ['fine', 'good', 'bad'],
# 'ranks': [2, 1, 3],
# 'ranked_candidates': ['good', 'fine', 'bad'],
# 'model_name': 'llm-blender/PairRM',
# }
# ]
```
Citations:
```
@misc{jiang2023llmblenderensemblinglargelanguage,
title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
year={2023},
eprint={2306.02561},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2306.02561},
}
```
Source code in src/distilabel/steps/tasks/pair_rm.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
inputs: List[str]
property
¶
The input columns correspond to the two required arguments from Blender.rank
:
inputs
and candidates
.
outputs: List[str]
property
¶
The outputs will include the ranks
and the ranked_candidates
.
format_input(input)
¶
The input is expected to be a dictionary with the keys input
and candidates
,
where the input
corresponds to the instruction of a model and candidates
are a
list of responses to be ranked.
Source code in src/distilabel/steps/tasks/pair_rm.py
load()
¶
Loads the PairRM model provided via model
with llm_blender.Blender
, which is the
custom library for running the inference for the PairRM models.
Source code in src/distilabel/steps/tasks/pair_rm.py
process(inputs)
¶
Generates the ranks for the candidates based on the input.
The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
An iterator with the inputs containing the |
Source code in src/distilabel/steps/tasks/pair_rm.py
PrometheusEval
¶
Bases: Task
Critique and rank the quality of generations from an LLM
using Prometheus 2.0.
`PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative
evaluations.
- The absolute evaluation i.e. `mode="absolute"` is used to evaluate a single generation from
an LLM for a given instruction.
- The relative evaluation i.e. `mode="relative"` is used to evaluate two generations from an LLM
for a given instruction.
Both evaluations provide the possibility whether to use a reference answer to compare with or not
via the `reference` attribute, and both are based on a score rubric that critiques the generation/s
based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,
and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute
`rubric`.
Note:
The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0
models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,
and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
and quality is not guaranteed if using another model, even though some other models may be able to
correctly follow the formatting and generate insightful critiques too.
Attributes:
mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task
will evaluate one or two generations.
rubric: the score rubric to use within the prompt to run the critique based on different aspects.
Can be any existing key in the `rubrics` attribute, which by default means that it can be:
`helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only
work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.
rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are
the rubric names and the values are the rubric descriptions. The default rubrics are the following:
`helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.
reference: a boolean flag to indicate whether a reference answer / completion will be provided, so
that the model critique is based on the comparison with it. It implies that the column `reference`
needs to be provided within the input data in addition to the rest of the inputs.
_template: a Jinja2 template used to format the input for the LLM.
Input columns:
- instruction (`str`): The instruction to use as reference.
- generation (`str`, optional): The generated text from the given `instruction`. This column is required
if `mode=absolute`.
- generations (`List[str]`, optional): The generated texts from the given `instruction`. It should
contain 2 generations only. This column is required if `mode=relative`.
- reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM
for comparison against.
Output columns:
- feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the
pre-defined score rubric, compared against `reference` if provided.
- result (`Union[int, Literal["A", "B"]]`): If `mode=absolute`, then the result contains the score for the
`generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either
"A" or "B", the "winning" one being the generation in the index 0 of `generations` if `result='A'` or the
index 1 if `result='B'`.
- model_name (`str`): The model name used to generate the `feedback` and `result`.
Categories:
- critique
- preference
References:
- [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
- [prometheus-eval: Evaluate your LLM's response with Prometheus 💯](https://github.com/prometheus-eval/prometheus-eval)
Examples:
Critique and evaluate LLM generation quality using Prometheus 2.0:
```python
from distilabel.steps.tasks import PrometheusEval
from distilabel.llms import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}
{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="factual-validity" )
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generation": "something done"},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
```
Critique for relative evaluation:
```python
from distilabel.steps.tasks import PrometheusEval
from distilabel.llms import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}
{{ messages[1]"content" }}[/INST]", ), mode="relative", rubric="honesty" )
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generations": ["something done", "other thing"]},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generations': ['something done', 'other thing'],
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 'something done',
# }
# ]
```
Critique with a custom rubric:
```python
from distilabel.steps.tasks import PrometheusEval
from distilabel.llms import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}
{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="custom", rubrics={ "custom": "[A] Score 1: A Score 2: B Score 3: C Score 4: D Score 5: E" } )
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generation": "something done"},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
```
Critique using a reference answer:
```python
from distilabel.steps.tasks import PrometheusEval
from distilabel.llms import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}
{{ messages[1]"content" }}[/INST]", ), mode="absolute", rubric="helpfulness", reference=True, )
prometheus.load()
result = next(
prometheus.process(
[
{
"instruction": "make something",
"generation": "something done",
"reference": "this is a reference answer",
},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'reference': 'this is a reference answer',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
```
Citations:
```
@misc{kim2024prometheus2opensource,
title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
year={2024},
eprint={2405.01535},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.01535},
}
```
Source code in src/distilabel/steps/tasks/prometheus_eval.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 |
|
inputs: List[str]
property
¶
The default inputs for the task are the instruction
and the generation
if reference=False
, otherwise, the inputs are instruction
, generation
, and
reference
.
outputs: List[str]
property
¶
The output for the task are the feedback
and the result
generated by Prometheus,
as well as the model_name
which is automatically included based on the LLM
used.
format_input(input)
¶
The input is formatted as a ChatType
where the prompt is formatted according
to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
from the user, including a pre-defined system prompt.
Source code in src/distilabel/steps/tasks/prometheus_eval.py
format_output(output, input)
¶
The output is formatted as a dict with the keys feedback
and result
captured
using a regex from the Prometheus output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Dict[str, Any]
|
the input to the task. Optionally provided in case it's useful to build the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the keys |
Source code in src/distilabel/steps/tasks/prometheus_eval.py
load()
¶
Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
depending on the mode
value, and either with or without reference, depending on the
value of reference
.
Source code in src/distilabel/steps/tasks/prometheus_eval.py
QualityScorer
¶
Bases: Task
Score responses based on their quality using an LLM
.
QualityScorer
is a pre-defined task that defines the instruction
as the input
and score
as the output. This task is used to rate the quality of instructions and responses.
It's an implementation of the quality score task from the paper 'What Makes Good Data
for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
are scored in terms of quality, obtaining a quality score for each instruction.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instruction (
str
): The instruction that was used to generate theresponses
. - responses (
List[str]
): The responses to be scored. Each response forms a pair with the instruction.
Output columns
- scores (
List[float]
): The score for each instruction. - model_name (
str
): The model name used to generate the scores.
Categories
- scorer
- quality
- response
Examples:
Evaluate the quality of your instructions:
```python
from distilabel.steps.tasks import QualityScorer
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = QualityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
scorer.load()
result = next(
scorer.process(
[
{
"instruction": "instruction",
"responses": ["good response", "weird response", "bad response"]
}
]
)
)
# result
[
{
'instructions': 'instruction',
'model_name': 'test',
'scores': [5, 3, 1],
}
]
```
Citations:
```
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
```
Source code in src/distilabel/steps/tasks/quality_scorer.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
inputs: List[str]
property
¶
The inputs for the task are instruction
and responses
.
outputs
property
¶
The output for the task is a list of scores
containing the quality score for each
response in responses
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/quality_scorer.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction-response pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/quality_scorer.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/quality_scorer.py
SelfInstruct
¶
Bases: Task
Generate instructions based on a given input using an LLM
.
SelfInstruct
is a pre-defined task that, given a number of instructions, a
certain criteria for query generations, an application description, and an input,
generates a number of instruction related to the given input and following what
is stated in the criteria for query generation and the application description.
It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning
Language Models with Self-Generated Instructions".
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. Defaults to 5. |
criteria_for_query_generation |
str
|
The criteria for the query generation. Defaults to the criteria defined within the paper. |
application_description |
str
|
The description of the AI application that one want
to build with these instructions. Defaults to |
Input columns
- input (
str
): The input to generate the instructions. It's also called seed in the paper.
Output columns
- instructions (
List[str]
): The generated instructions. - model_name (
str
): The model name used to generate the instructions.
Categories
- text-generation
Examples:
Generate instructions based on a given input:
```python
from distilabel.steps.tasks import SelfInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
self_instruct = SelfInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=5, # This is the default value
)
self_instruct.load()
result = next(self_instruct.process([{"input": "instruction"}]))
# result
# [
# {
# 'input': 'instruction',
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
# }
# ]
```
Citations:
```
@misc{wang2023selfinstructaligninglanguagemodels,
title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
year={2023},
eprint={2212.10560},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2212.10560},
}
```
Source code in src/distilabel/steps/tasks/self_instruct.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|
inputs: List[str]
property
¶
The input for the task is the input
i.e. seed text.
outputs
property
¶
The output for the task is a list of instructions
containing the generated instructions.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/self_instruct.py
format_output(output, input=None)
¶
The output is formatted as a list with the generated instructions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Optional[Dict[str, Any]]
|
the input to the task. Used for obtaining the number of responses. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with containing the generated instructions. |
Source code in src/distilabel/steps/tasks/self_instruct.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/self_instruct.py
StructuredGeneration
¶
Bases: Task
Generate structured content for a given instruction
using an LLM
.
StructuredGeneration
is a pre-defined task that defines the instruction
and the structured_output
as the inputs, and generation
as the output. This task is used to generate structured content based on
the input instruction and following the schema provided within the structured_output
column per each
instruction
. The model_name
also returned as part of the output in order to enhance it.
Attributes:
Name | Type | Description |
---|---|---|
use_system_prompt |
bool
|
Whether to use the system prompt in the generation. Defaults to |
Input columns
- instruction (
str
): The instruction to generate structured content from. - structured_output (
Dict[str, Any]
): The structured_output to generate structured content from. It should be a Python dictionary with the keysformat
andschema
, whereformat
should be one ofjson
orregex
, and theschema
should be either the JSON schema or the regex pattern, respectively.
Output columns
- generation (
str
): The generated text matching the provided schema, if possible. - model_name (
str
): The name of the model used to generate the text.
Categories
- outlines
- structured-generation
Examples:
Generate structured output from a JSON schema:
```python
from distilabel.steps.tasks import StructuredGeneration
from distilabel.llms import InferenceEndpointsLLM
structured_gen = StructuredGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
structured_gen.load()
result = next(
structured_gen.process(
[
{
"instruction": "Create an RPG character",
"structured_output": {
"type": "json",
"value": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"role": {
"title": "Role",
"type": "string"
},
"weapon": {
"title": "Weapon",
"type": "string"
}
},
"required": [
"name",
"description",
"role",
"weapon"
],
"title": "Character",
"type": "object"
}
},
}
]
)
)
```
Generate structured output from a regex pattern:
```python
from distilabel.steps.tasks import StructuredGeneration
from distilabel.llms import InferenceEndpointsLLM
structured_gen = StructuredGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
structured_gen.load()
result = next(
structured_gen.process(
[
{
"instruction": "What's the weather like today in Seattle in Celsius degrees?",
"structured_output": {
"type": "regex",
"value": r"(\d{1,2})°C"
},
}
]
)
)
```
Source code in src/distilabel/steps/tasks/structured_generation.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
inputs: List[str]
property
¶
The input for the task are the instruction
and the structured_output
.
Optionally, if the use_system_prompt
flag is set to True, then the
system_prompt
will be used too.
outputs: List[str]
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/structured_generation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
. Note that even
if the structured_output
is defined to produce a JSON schema, this method will return the raw
output i.e. a string without any parsing.
Source code in src/distilabel/steps/tasks/structured_generation.py
TextGeneration
¶
Bases: Task
Simple text generation with an LLM
given an instruction.
TextGeneration
is a pre-defined task that defines the instruction
as the input
and generation
as the output. This task is used to generate text based on the input
instruction. The model_name is also returned as part of the output in order to enhance it.
Attributes:
Name | Type | Description |
---|---|---|
use_system_prompt |
bool
|
Whether to use the system prompt in the generation. Defaults to |
Input columns
- instruction (
str
): The instruction to generate text from.
Output columns
- generation (
str
): The generated text. - model_name (
str
): The name of the model used to generate the text.
Categories
- text-generation
Examples:
Generate text from an instruction:
```python
from distilabel.steps.tasks import TextGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
text_gen.load()
result = next(
text_gen.process(
[{"instruction": "your instruction"}]
)
)
# result
# [
# {
# 'instruction': 'your instruction',
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'generation': 'generation',
# }
# ]
```
Source code in src/distilabel/steps/tasks/text_generation.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
inputs: List[str]
property
¶
The input for the task is the instruction
.
outputs: List[str]
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/text_generation.py
format_output(output, input=None)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_generation.py
UltraFeedback
¶
Bases: Task
Rank generations focusing on different aspects using an LLM
.
UltraFeedback: Boosting Language Models with High-quality Feedback.
Attributes:
Name | Type | Description |
---|---|---|
aspect |
Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']
|
The aspect to perform with the |
Input columns
- instruction (
str
): The reference instruction to evaluate the text outputs. - generations (
List[str]
): The text outputs to evaluate for the given instruction.
Output columns
- ratings (
List[float]
): The ratings for each of the provided text outputs. - rationales (
List[str]
): The rationales for each of the provided text outputs. - model_name (
str
): The name of the model used to generate the ratings and rationales.
Categories
- preference
References
Examples:
Rate generations from different LLMs based on the selected aspect:
```python
from distilabel.steps.tasks import UltraFeedback
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
ultrafeedback.load()
result = next(
chat.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [
# {
# 'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [1, 2],
# 'rationales': ['explanation for 4', 'explanation for and a car'],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# }
# ]
```
Citations:
```
@misc{cui2024ultrafeedbackboostinglanguagemodels,
title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2310.01377},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.01377},
}
```
Source code in src/distilabel/steps/tasks/ultrafeedback.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
inputs: List[str]
property
¶
The input for the task is the instruction
, and the generations
for it.
outputs: List[str]
property
¶
The output for the task is the generation
and the model_name
.
_format_ratings_rationales_output(output, input)
¶
Formats the output when the aspect is either honesty
, instruction-following
, or overall-rating
.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
_format_types_ratings_rationales_output(output, input)
¶
Formats the output when the aspect is either helpfulness
or truthfulness
.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
format_output(output, input)
¶
The output is formatted as a dictionary with the ratings
and rationales
for
each of the provided generations
for the given instruction
. The model_name
will be automatically included within the process
method of Task
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
a string representing the output of the LLM via the |
required |
input |
Dict[str, Any]
|
the input to the task, as required by some tasks to format the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary containing either the |
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
given |
Source code in src/distilabel/steps/tasks/ultrafeedback.py
load()
¶
Loads the Jinja2 template for the given aspect
.