Task Gallery¶
This section contains the existing Task
subclasses implemented in distilabel
.
BitextRetrievalGenerator
¶
Bases: _EmbeddingDataGenerator
Generate bitext retrieval data with an LLM
to later on train an embedding model.
BitextRetrievalGenerator
is a GeneratorTask
that generates bitext retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes:
Name | Type | Description |
---|---|---|
source_language |
str
|
The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
target_language |
str
|
The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
unit |
Optional[Literal['sentence', 'phrase', 'passage']]
|
The unit of the data to be generated, which can be |
difficulty |
Optional[Literal['elementary school', 'high school', 'college']]
|
The difficulty of the query to be generated, which can be |
high_score |
Optional[Literal['4', '4.5', '5']]
|
The high score of the query to be generated, which can be |
low_score |
Optional[Literal['2.5', '3', '3.5']]
|
The low score of the query to be generated, which can be |
seed |
Optional[Literal['2.5', '3', '3.5']]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate bitext retrieval data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator
with Pipeline("my-pipeline") as pipeline:
task = BitextRetrievalGenerator(
source_language="English",
target_language="Spanish",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
prompt: ChatType
property
¶
Contains the prompt
to be used in the process
method, rendering the _template
; and
formatted as an OpenAI formatted chat i.e. a ChatType
, assuming that there's only one turn,
being from the user with the content being the rendered _template
.
ChatGeneration
¶
Bases: Task
Generates text based on a conversation.
ChatGeneration
is a pre-defined task that defines the messages
as the input
and generation
as the output. This task is used to generate text based on a conversation.
The model_name
is also returned as part of the output in order to enhance it.
Input columns
- messages (
List[Dict[Literal["role", "content"], str]]
): The messages to generate the follow up completion from.
Output columns
- generation (
str
): The generated text from the assistant. - model_name (
str
): The model name used to generate the text.
Categories
- chat-generation
Icon
:material-chat:
Examples:
Generate text from a conversation in OpenAI chat format:
```python
from distilabel.steps.tasks import ChatGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
chat = ChatGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
chat.load()
result = next(
chat.process(
[
{
"messages": [
{"role": "user", "content": "How much is 2+2?"},
]
}
]
)
)
# result
# [
# {
# 'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'generation': '4',
# }
# ]
```
Source code in src/distilabel/steps/tasks/text_generation.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
inputs: List[str]
property
¶
The input for the task are the messages
.
outputs: List[str]
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the messages provided
are already formatted that way i.e. following the OpenAI chat format.
Source code in src/distilabel/steps/tasks/text_generation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_generation.py
ComplexityScorer
¶
Bases: Task
Score instructions based on their complexity using an LLM
.
ComplexityScorer
is a pre-defined task used to rank a list of instructions based in
their complexity. It's an implementation of the complexity score task from the paper
'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection
in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instructions (
List[str]
): The list of instructions to be scored.
Output columns
- scores (
List[float]
): The score for each instruction. - model_name (
str
): The model name used to generate the scores.
Categories
- scorer
- complexity
- instruction
Examples:
Evaluate the complexity of your instructions:
```python
from distilabel.steps.tasks import ComplexityScorer
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
scorer.load()
result = next(
scorer.process(
[{"instructions": ["plain instruction", "highly complex instruction"]}]
)
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
```
Source code in src/distilabel/steps/tasks/complexity_scorer.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
|
inputs: List[str]
property
¶
The inputs for the task are the instructions
.
outputs: List[str]
property
¶
The output for the task are: a list of scores
containing the complexity score for each
instruction in instructions
, and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
the raw output of the LLM. |
required |
input |
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/complexity_scorer.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
EvolComplexity
¶
Bases: EvolInstruct
Evolve instructions to make them more complex using an LLM
.
EvolComplexity
is a task that evolves instructions to make them more complex,
and it is based in the EvolInstruct task, using slight different prompts, but the
exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
The number of instructions to be generated. |
|
generate_answers |
Whether to generate answers for the instructions or not. Defaults
to |
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
Dict[str, str]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- deita
References
Examples:
Evolve an instruction using an LLM:
```python
from distilabel.steps.tasks import EvolComplexity
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity = EvolComplexity(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_complexity.load()
result = next(evol_complexity.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
EvolComplexityGenerator
¶
Bases: EvolInstructGenerator
Generate evolved instructions with increased complexity using an LLM
.
EvolComplexityGenerator
is a generation task that evolves instructions to make
them more complex, and it is based in the EvolInstruct task, but using slight different
prompts, but the exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
The number of instructions to be generated. |
|
generate_answers |
Whether to generate answers for the instructions or not. Defaults
to |
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
Dict[str, str]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Output columns
- instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- generation
- deita
References
Examples:
Generate evolved instructions without initial instructions:
```python
from distilabel.steps.tasks import EvolComplexityGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity_generator = EvolComplexityGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_complexity_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
EvolInstruct
¶
Bases: Task
Evolve instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed. |
store_evolutions |
bool
|
Whether to store all the evolutions or just the last one. Defaults
to |
generate_answers |
bool
|
Whether to generate answers for the evolved instructions. Defaults
to |
include_original_instruction |
bool
|
Whether to include the original instruction in the
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for evolving the instructions.
Defaults to the ones provided in the |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction ifstore_evolutions=False
. - evolved_instructions (
List[str]
): The evolved instructions ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the instructions. - answer (
str
): The answer to the evolved instruction ifgenerate_answers=True
andstore_evolutions=False
. - answers (
List[str]
): The answers to the evolved instructions ifgenerate_answers=True
andstore_evolutions=True
.
Categories
- evol
- instruction
References
Examples:
Evolve an instruction using an LLM:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
```
Keep the iterations of the evolutions:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
store_evolutions=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instructions': ['initial evolution', 'final evolution'],
# 'model_name': 'model_name'
# }
# ]
```
Generate answers for the instructions in a single step:
```python
from distilabel.steps.tasks import EvolInstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
generate_answers=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instruction': 'evolved instruction',
# 'answer': 'answer to the instruction',
# 'model_name': 'model_name'
# }
# ]
```
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
inputs: List[str]
property
¶
The input for the task is the instruction
.
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
outputs: List[str]
property
¶
The output for the task are the evolved_instruction/s
, the answer
if generate_answers=True
and the model_name
.
_apply_random_mutation(instruction)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_evolve_instructions(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
List[List[str]]
|
|
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_generate_answers(evolved_instructions)
¶
Generates the answer for the instructions in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evolved_instructions |
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list of answers for each instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
format_output(instructions, answers=None)
¶
The output for the task is a dict with: evolved_instruction
or evolved_instructions
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
answer
if generate_answers=True
; and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions |
Union[str, List[str]]
|
The instructions to be included within the output. |
required |
answers |
Optional[List[str]]
|
The answers to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
EvolInstructGenerator
¶
Bases: GeneratorTask
Generate evolved instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. |
generate_answers |
bool
|
Whether to generate answers for the instructions or not. Defaults
to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Output columns
- instruction (
str
): The generated instruction ifgenerate_answers=False
. - answer (
str
): The generated answer ifgenerate_answers=True
. - instructions (
List[str]
): The generated instructions ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to generate and evolve the instructions.
Categories
- evol
- instruction
- generation
References
Examples:
Generate evolved instructions without initial instructions:
```python
from distilabel.steps.tasks import EvolInstructGenerator
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct_generator = EvolInstructGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_instruct_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
```
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
|
_english_nouns: List[str]
cached
property
¶
A list of English nouns to be used as part of the starting prompts for the task.
References
- https://github.com/h2oai/h2o-wizardlm
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
outputs: List[str]
property
¶
The output for the task are the instruction
, the answer
if generate_answers=True
and the model_name
.
_apply_random_mutation(iter_no)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iter_no |
int
|
The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not. |
required |
Returns:
Type | Description |
---|---|
List[ChatType]
|
A random mutation prompt with the provided instruction formatted as an OpenAI conversation. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_generate_answers(instructions)
¶
Generates the answer for the last instruction in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions |
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of answers for the last instruction in |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_generate_seed_texts()
¶
Generates a list of seed texts to be used as part of the starting prompts for the task.
It will use the FRESH_START
mutation template, as it needs to generate text from scratch; and
a list of English words will be used to generate the seed texts that will be provided to the
mutation method and included within the prompt.
Returns:
Type | Description |
---|---|
List[str]
|
A list of seed texts to be used as part of the starting prompts for the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
format_output(instruction, answer=None)
¶
The output for the task is a dict with: instruction
; answer
if generate_answers=True
;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the output. |
required |
answer |
Optional[str]
|
The answer to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
process(offset=0)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries with the outputs of the task, and a boolean |
GeneratorStepOutput
|
flag indicating whether the task has finished or not i.e. is the last batch. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
EvolQuality
¶
Bases: Task
Evolve the quality of the responses using an LLM
.
EvolQuality
task is used to evolve the quality of the responses given a prompt,
by generating a new response with a language model. This step implements the evolution
quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed on the responses. |
store_evolutions |
bool
|
Whether to store all the evolved responses or just the last one.
Defaults to |
include_original_response |
bool
|
Whether to include the original response within the evolved
responses. Defaults to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used to evolve the responses. |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction that was used to generate theresponses
. - response (
str
): The responses to be rewritten.
Output columns
- evolved_response (
str
): The evolved response ifstore_evolutions=False
. - evolved_responses (
List[str]
): The evolved responses ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the responses.
Categories
- evol
- response
- deita
Examples:
Evolve the quality of the responses given a prompt:
```python
from distilabel.steps.tasks import EvolQuality
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_quality = EvolQuality(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_quality.load()
result = next(
evol_quality.process(
[
{"instruction": "common instruction", "response": "a response"},
]
)
)
# result
# [
# {
# 'instruction': 'common instruction',
# 'response': 'a response',
# 'evolved_response': 'evolved response',
# 'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
# }
# ]
```
Source code in src/distilabel/steps/tasks/evol_quality/base.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
|
inputs: List[str]
property
¶
The input for the task are the instruction
and response
.
mutation_templates_names: List[str]
property
¶
Returns the names i.e. keys of the provided mutation_templates
enum.
outputs: List[str]
property
¶
The output for the task are the evolved_response/s
and the model_name
.
_apply_random_mutation(instruction, response)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction |
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
_evolve_reponses(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
List[List[str]]
|
|
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_output(responses)
¶
The output for the task is a dict with: evolved_response
or evolved_responses
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses |
Union[str, List[str]]
|
The responses to be included within the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
GenerateEmbeddings
¶
Bases: Step
Generate embeddings using the last hidden state of an LLM
.
Generate embeddings for a text input using the last hidden state of an LLM
, as
described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
llm |
LLM
|
The |
Input columns
- text (
str
,List[Dict[str, str]]
): The input text or conversation to generate embeddings for.
Output columns
- embedding (
List[float]
): The embedding of the input text or conversation. - model_name (
str
): The model name used to generate the embeddings.
Categories
- embedding
- llm
Examples:
Rank LLM candidates:
```python
from distilabel.steps.tasks import GenerateEmbeddings
from distilabel.llms.huggingface import TransformersLLM
# Consider this as a placeholder for your actual LLM.
embedder = GenerateEmbeddings(
llm=TransformersLLM(
model="TaylorAI/bge-micro-v2",
model_kwargs={"is_decoder": True},
cuda_devices=[],
)
)
embedder.load()
result = next(
embedder.process(
[
{"text": "Hello, how are you?"},
]
)
)
```
Source code in src/distilabel/steps/tasks/generate_embeddings.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
inputs: List[str]
property
¶
The inputs for the task is a text
column containing either a string or a
list of dictionaries in OpenAI chat-like format.
outputs: List[str]
property
¶
The outputs for the task is an embedding
column containing the embedding of
the text
input.
format_input(input)
¶
Formats the input to be used by the LLM to generate the embeddings. The input
can be in ChatType
format or a string. If a string, it will be converted to a
list of dictionaries in OpenAI chat-like format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
ChatType
|
The OpenAI chat-like format of the input. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
load()
¶
process(inputs)
¶
Generates an embedding for each input using the last hidden state of the LLM
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
GenerateLongTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate long text matching data with an LLM
to later on train an embedding model.
GenerateLongTextMatchingData
is a Task
that generates long text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-long"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-long category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
str
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic long text matching data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-long",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateLongTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateSentencePair
¶
Bases: Task
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair
is a pre-defined task that given an anchor sentence generates
a positive sentence related to the anchor and optionally a negative sentence unrelated
to the anchor. Optionally, you can give a context to guide the LLM towards more specific
behavior. This task is useful to generate training datasets for training embeddings
models.
Attributes:
Name | Type | Description |
---|---|---|
triplet |
bool
|
a flag to indicate if the task should generate a triplet of sentences
(anchor, positive, negative). Defaults to |
action |
GenerationAction
|
the action to perform to generate the positive sentence. |
context |
str
|
the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default. |
Input columns
- anchor (
str
): The anchor sentence to generate the positive and negative sentences.
Output columns
- positive (
str
): The positive sentence related to theanchor
. - negative (
str
): The negative sentence unrelated to theanchor
iftriplet=True
. - model_name (
str
): The name of the model that was used to generate the sentences.
Categories
- embedding
Examples:
Paraphrasing:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```
Generating semantically similar sentences:
```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
```
Generating queries:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
```
Generating answers:
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
```
Generating queries with context (**applies to every action**):
```python
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
```
Source code in src/distilabel/steps/tasks/sentence_transformers.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
|
inputs: List[str]
property
¶
The inputs for the task is the anchor
sentence.
outputs: List[str]
property
¶
The outputs for the task are the positive
and negative
sentences, as well
as the model_name
used to generate the sentences.
format_input(input)
¶
The inputs are formatted as a ChatType
, with a system prompt describing the
task of generating a positive and negative sentences for the anchor sentence. The
anchor is provided as the first user interaction in the conversation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list of dictionaries containing the system and user interactions. |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
format_output(output, input=None)
¶
Formats the output of the LLM, to extract the positive
and negative
sentences
generated. If the output is None
or the regex doesn't match, then the outputs
will be set to None
as well.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, None]
|
The output of the LLM. |
required |
input |
Optional[Dict[str, Any]]
|
The input used to generate the output. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
The formatted output containing the |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/sentence_transformers.py
GenerateShortTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate short text matching data with an LLM
to later on train an embedding model.
GenerateShortTextMatchingData
is a Task
that generates short text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-short"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-short category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
str
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic short text matching data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-short",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateShortTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Args:
input: The input dictionary containing the `task` to be used in the `_template`.
Returns:
A list with a single chat containing the user's message with the rendered `_template`.
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextClassificationData
¶
Bases: _EmbeddingDataGeneration
Generate text classification data with an LLM
to later on train an embedding model.
GenerateTextClassificationData
is a Task
that generates text classification data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-classification"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-classification category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
seed |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic text classification data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-classification",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextClassificationData(
language="English",
difficulty="high school",
clarity="clear",
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextRetrievalData
¶
Bases: _EmbeddingDataGeneration
Generate text retrieval data with an LLM
to later on train an embedding model.
GenerateTextRetrievalData
is a Task
that generates text retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-retrieval"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-retrieval category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
query_type |
Optional[Literal['extremely long-tail', 'long-tail', 'common']]
|
The type of query to be generated, which can be |
query_length |
Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]
|
The length of the query to be generated, which can be |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
num_words |
Optional[Literal[50, 100, 200, 300, 400, 500]]
|
The number of words in the query to be generated, which can be |
seed |
Optional[Literal[50, 100, 200, 300, 400, 500]]
|
The random seed to be set in case there's any sampling within the |
Examples:
Generate synthetic text retrieval data for training embedding models:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-retrieval",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextRetrievalData(
language="English",
query_type="common",
query_length="5 to 15 words",
difficulty="high school",
clarity="clear",
num_words=100,
llm=..., # LLM instance
)
task >> generate
```
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 |
|
keys: List[str]
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
Genstruct
¶
Bases: Task
Generate a pair of instruction-response from a document using an LLM
.
Genstruct
is a pre-defined task designed to generate valid instructions from a given raw document,
with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
inspired in the Ada-Instruct paper.
Note
The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
option is to use NousResearch/Genstruct-7B
as the LLM provided to the task, since it was trained
for this specific task.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- title (
str
): The title of the document. - content (
str
): The content of the document.
Output columns
- user (
str
):