Task Gallery¶
This section contains the existing Task
subclasses implemented in distilabel
.
tasks
¶
APIGenExecutionChecker
¶
Bases: Step
Executes the generated function calls.
This step checks if a given answer from a model as generated by APIGenGenerator
can be executed against the given library (given by libpath
, which is a string
pointing to a python .py file with functions).
Attributes:
Name | Type | Description |
---|---|---|
libpath |
str
|
The path to the library where we will retrieve the functions. It can also point to a folder with the functions. In this case, the folder layout should be a folder with .py files, each containing a single function, the name of the function being the same as the filename. |
check_is_dangerous |
bool
|
Bool to exclude some potentially dangerous functions, it contains some heuristics found while testing. This functions can run subprocesses, deal with the OS, or have other potentially dangerous operations. Defaults to True. |
Input columns
- answers (
str
): List with arguments to be passed to the function, dumped as a string from a list of dictionaries. Should be loaded usingjson.loads
.
Output columns
- keep_row_after_execution_check (
bool
): Whether the function should be kept or not. - execution_result (
str
): The result from executing the function.
Categories
- filtering
- execution
References
Examples:
Execute a function from a given library with the answer from an LLM:
from distilabel.steps.tasks import APIGenExecutionChecker
# For the libpath you can use as an example the file at the tests folder:
# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py
task = APIGenExecutionChecker(
libpath="../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py",
)
task.load()
res = next(
task.process(
[
{
"answers": [
{
"arguments": {
"initial_velocity": 0.2,
"acceleration": 0.1,
"time": 0.5,
},
"name": "final_velocity",
}
],
}
]
)
)
res
#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]
Source code in src/distilabel/steps/tasks/apigen/execution_checker.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
|
inputs
property
¶
The inputs for the task are those found in the original dataset.
outputs
property
¶
The outputs are the columns required by APIGenGenerator
task.
load()
¶
Loads the library where the functions will be extracted from.
_get_function(function_name)
¶
Retrieves the function from the toolbox.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
function_name
|
str
|
The name of the function to retrieve. |
required |
Returns:
Name | Type | Description |
---|---|---|
Callable |
Callable
|
The function to be executed. |
Source code in src/distilabel/steps/tasks/apigen/execution_checker.py
_is_dangerous(function)
¶
Checks if a function is dangerous to remove it. Contains a list of heuristics to avoid executing possibly dangerous functions.
Source code in src/distilabel/steps/tasks/apigen/execution_checker.py
process(inputs)
¶
Checks the answer to see if it can be executed. Captures the possible errors and returns them.
If a single example is provided, it is copied to avoid raising an error.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of dictionaries with the input data. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of dictionaries with the output data. |
Source code in src/distilabel/steps/tasks/apigen/execution_checker.py
APIGenGenerator
¶
Bases: Task
Generate queries and answers for the given functions in JSON format.
The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate
verifiable and diverse function-calling datasets. The task generates a set of diverse queries
and corresponding answers for the given functions in JSON format.
Attributes:
system_prompt: The system prompt to guide the user in the generation of queries and answers.
use_tools: Whether to use the tools available in the prompt to generate the queries and answers.
In case the tools are given in the input, they will be added to the prompt.
number: The number of queries to generate. It can be a list, where each number will be
chosen randomly, or a dictionary with the number of queries and the probability of each.
I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.
It corresponds to the number of parallel queries to generate.
use_default_structured_output: Whether to use the default structured output or not.
Input columns:
- examples (`str`): Examples used as few shots to guide the model.
- func_name (`str`): Name for the function to generate.
- func_desc (`str`): Description of what the function should do.
- tools (`str`): JSON formatted string containing the tool representation of the function.
Output columns:
- query (`str`): The list of queries.
- answers (`str`): JSON formatted string with the list of answers, containing the info as
a dictionary to be passed to the functions.
Categories:
- text-generation
References:
- [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
- [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)
Examples:
Generate without structured output (original implementation):
```python
from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
apigen = ApiGenGenerator(
use_default_structured_output=False,
llm=llm
)
apigen.load()
res = next(
apigen.process(
[
{
"examples": 'QUERY:
What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', "func_name": "getrandommovie", "func_desc": "Returns a list of random movies from a database by calling an external API." } ] ) ) res # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', # 'number': 1, # 'func_name': 'getrandommovie', # 'func_desc': 'Returns a list of random movies from a database by calling an external API.', # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?', # 'Give me 5 random movie suggestions from your database to plan my weekend.'], # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}], # [{'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}]], # 'raw_input_api_gen_generator_0': [{'role': 'system', # 'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.
Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.
Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions
Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively"}, # {'role': 'user', # 'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]
Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function getrandommovie
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.
The output MUST strictly adhere to the following JSON format, and NO other text MUST be included:
[
{
"query": "The generated query.",
"answers": [
{
"name": "api_name",
"arguments": {
"arg_name": "value"
... (more arguments as required)
}
},
... (more API calls as required)
]
}
]
Now please generate 2 diverse query and answer pairs following the above format.'}]}, # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}] ```
Generate with structured output:
```python
from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
apigen = ApiGenGenerator(
use_default_structured_output=True,
llm=llm
)
apigen.load()
res_struct = next(
apigen.process(
[
{
"examples": 'QUERY:
What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', "func_name": "getrandommovie", "func_desc": "Returns a list of random movies from a database by calling an external API." } ] ) ) res_struct # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', # 'number': 1, # 'func_name': 'getrandommovie', # 'func_desc': 'Returns a list of random movies from a database by calling an external API.', # 'queries': ["I'm bored and want to watch a movie. Can you suggest some movies?", # "My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?"], # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}], # [{'arguments': {}, 'name': 'getrandommovie'}]], # 'raw_input_api_gen_generator_0': [{'role': 'system', # 'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.
Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.
Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions
Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively"}, # {'role': 'user', # 'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]
Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function getrandommovie
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.
Now please generate 2 diverse query and answer pairs following the above format.'}]}, # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}] ```
Source code in src/distilabel/steps/tasks/apigen/generator.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 |
|
inputs
property
¶
The inputs for the task.
outputs
property
¶
The output for the task are the queries and corresponding answers.
load()
¶
Loads the template for the generator prompt.
Source code in src/distilabel/steps/tasks/apigen/generator.py
_parallel_queries(number)
¶
Prepares the function to update the parallel queries guide in the prompt.
Raises:
Type | Description |
---|---|
ValueError
|
if |
Returns:
Type | Description |
---|---|
Callable[[int], str]
|
The function to generate the parallel queries guide. |
Source code in src/distilabel/steps/tasks/apigen/generator.py
_get_number()
¶
Generates the number of queries to generate in a single call.
The number must be set to _number
to avoid changing the original value
when calling _default_error
.
Source code in src/distilabel/steps/tasks/apigen/generator.py
_set_format_inst()
¶
Prepares the function to generate the formatted instructions for the prompt.
If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.
Source code in src/distilabel/steps/tasks/apigen/generator.py
_get_func_desc(input)
¶
If available and required, will use the info from the tools in the prompt for extra information. Otherwise will use jut the function description.
Source code in src/distilabel/steps/tasks/apigen/generator.py
format_input(input)
¶
The input is formatted as a ChatType
.
Source code in src/distilabel/steps/tasks/apigen/generator.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the queries and answers pairs. |
Dict[str, Any]
|
The answers are an array of answers corresponding to the query. |
Dict[str, Any]
|
Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer. |
Dict[str, Any]
|
Each argument is represented as a key-value pair, where the key is the parameter name and the |
Dict[str, Any]
|
value is the corresponding value. |
Source code in src/distilabel/steps/tasks/apigen/generator.py
_format_output(pairs, input)
¶
Parses the response, returning a dictionary with queries and answers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pairs
|
Dict[str, Any]
|
The parsed dictionary from the LLM's output. |
required |
input
|
Dict[str, Any]
|
The input from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Formatted output, where the |
Dict[str, Any]
|
are a list of objects. |
Source code in src/distilabel/steps/tasks/apigen/generator.py
_default_error(input)
¶
Returns a default error output, to fill the responses in case of failure.
Source code in src/distilabel/steps/tasks/apigen/generator.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from typing import Dict, List
from pydantic import BaseModel
class Answer(BaseModel):
name: str
arguments: Dict[str, str]
class QueryAnswer(BaseModel):
query: str
answers: List[Answer]
class QueryAnswerPairs(BaseModel):
pairs: List[QueryAnswer]
json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/apigen/generator.py
APIGenSemanticChecker
¶
Bases: Task
Generate queries and answers for the given functions in JSON format.
The APIGenGenerator
is inspired by the APIGen pipeline, which was designed to generate
verifiable and diverse function-calling datasets. The task generates a set of diverse queries
and corresponding answers for the given functions in JSON format.
Attributes:
Name | Type | Description |
---|---|---|
system_prompt |
str
|
System prompt for the task. Has a default one. |
exclude_failed_execution |
str
|
Whether to exclude failed executions (won't run on those
rows that have a False in |
Input columns
- func_desc (
str
): Description of what the function should do. - query (
str
): Instruction from the user. - answers (
str
): JSON encoded list with arguments to be passed to the function/API. Should be loaded usingjson.loads
. - execution_result (
str
): Result of the function/API executed.
Output columns
- thought (
str
): Reasoning for the output on whether to keep this output or not. - keep_row_after_semantic_check (
bool
): True or False, can be used to filter afterwards.
Categories
- filtering
- text-generation
References
Examples:
Semantic checker for generated function calls (original implementation):
```python
from distilabel.steps.tasks import APIGenSemanticChecker
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
semantic_checker = APIGenSemanticChecker(
use_default_structured_output=False,
llm=llm
)
semantic_checker.load()
res = next(
semantic_checker.process(
[
{
"func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
"query": "What information can be obtained about the Maine Coon cat breed?",
"answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
"execution_result": "The Maine Coon is a big and hairy breed of cat",
}
]
)
)
res
# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
# 'query': 'What information can be obtained about the Maine Coon cat breed?',
# 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
# 'thought': '',
# 'keep_row_after_semantic_check': True,
# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
# 'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
# {'role': 'user',
# 'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n "thought": "Concisely describe your reasoning here",\n "pass": "yes" or "no"\n}\n```\n'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
```
Semantic checker for generated function calls (structured output):
```python
from distilabel.steps.tasks import APIGenSemanticChecker
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
semantic_checker = APIGenSemanticChecker(
use_default_structured_output=True,
llm=llm
)
semantic_checker.load()
res = next(
semantic_checker.process(
[
{
"func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
"query": "What information can be obtained about the Maine Coon cat breed?",
"answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
"execution_result": "The Maine Coon is a big and hairy breed of cat",
}
]
)
)
res
# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
# 'query': 'What information can be obtained about the Maine Coon cat breed?',
# 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
# 'keep_row_after_semantic_check': True,
# 'thought': '',
# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
# 'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
# {'role': 'user',
# 'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
```
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
|
inputs
property
¶
The inputs for the task.
outputs
property
¶
The output for the task are the queries and corresponding answers.
load()
¶
Loads the template for the generator prompt.
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
_set_format_inst()
¶
Prepares the function to generate the formatted instructions for the prompt.
If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
format_input(input)
¶
The input is formatted as a ChatType
.
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the queries and answers pairs. |
Dict[str, Any]
|
The answers are an array of answers corresponding to the query. |
Dict[str, Any]
|
Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer. |
Dict[str, Any]
|
Each argument is represented as a key-value pair, where the key is the parameter name and the |
Dict[str, Any]
|
value is the corresponding value. |
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
_default_error(input)
¶
Default error message for the task.
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from typing import Literal
from pydantic import BaseModel
import json
class Checker(BaseModel):
thought: str
passes: Literal["yes", "no"]
json.dumps(Checker.model_json_schema(), indent=4)
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/apigen/semantic_checker.py
ArgillaLabeller
¶
Bases: Task
Annotate Argilla records based on input fields, example records and question settings.
This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM. It uses a system prompt that guides the LLM to understand the input fields, the question type, and the question settings. The task then formats the input data and generates a response based on the question. The response is validated against the question's value model, and the final suggestion is prepared for annotation.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- record (
argilla.Record
): The record to be annotated. - fields (
Optional[List[Dict[str, Any]]]
): The list of field settings for the input fields. - question (
Optional[Dict[str, Any]]
): The question settings for the question to be answered. - example_records (
Optional[List[Dict[str, Any]]]
): The few shot example records with responses to be used to answer the question. - guidelines (
Optional[str]
): The guidelines for the annotation task.
Output columns
- suggestion (
Dict[str, Any]
): The final suggestion for annotation.
Categories
- text-classification
- scorer
- text-generation
Examples:
Annotate a record with the same dataset and question:
import argilla as rg
from argilla import Suggestion
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM
# Get information from Argilla dataset definition
dataset = rg.Dataset("my_dataset")
pending_records_filter = rg.Filter(("status", "==", "pending"))
completed_records_filter = rg.Filter(("status", "==", "completed"))
pending_records = list(
dataset.records(
query=rg.Query(filter=pending_records_filter),
limit=5,
)
)
example_records = list(
dataset.records(
query=rg.Query(filter=completed_records_filter),
limit=5,
)
)
field = dataset.settings.fields["text"]
question = dataset.settings.questions["label"]
# Initialize the labeller with the model and fields
labeller = ArgillaLabeller(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
fields=[field],
question=question,
example_records=example_records,
guidelines=dataset.guidelines
)
labeller.load()
# Process the pending records
result = next(
labeller.process(
[
{
"record": record
} for record in pending_records
]
)
)
# Add the suggestions to the records
for record, suggestion in zip(pending_records, result):
record.suggestions.add(Suggestion(**suggestion["suggestion"]))
# Log the updated records
dataset.records.log(pending_records)
Annotate a record with alternating datasets and questions:
import argilla as rg
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM
# Get information from Argilla dataset definition
dataset = rg.Dataset("my_dataset")
field = dataset.settings.fields["text"]
question = dataset.settings.questions["label"]
question2 = dataset.settings.questions["label2"]
# Initialize the labeller with the model and fields
labeller = ArgillaLabeller(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
labeller.load()
# Process the record
record = next(dataset.records())
result = next(
labeller.process(
[
{
"record": record,
"fields": [field],
"question": question,
},
{
"record": record,
"fields": [field],
"question": question2,
}
]
)
)
# Add the suggestions to the record
for suggestion in result:
record.suggestions.add(rg.Suggestion(**suggestion["suggestion"]))
# Log the updated record
dataset.records.log([record])
Overwrite default prompts and instructions:
import argilla as rg
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM
# Overwrite default prompts and instructions
labeller = ArgillaLabeller(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
system_prompt="You are an expert annotator and labelling assistant that understands complex domains and natural language processing.",
question_to_label_instruction={
"label_selection": "Select the appropriate label from the list of provided labels.",
"multi_label_selection": "Select none, one or multiple labels from the list of provided labels.",
"text": "Provide a text response to the question.",
"rating": "Provide a rating for the question.",
},
)
labeller.load()
Source code in src/distilabel/steps/tasks/argilla_labeller.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 |
|
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_format_record(record, fields)
¶
Format the record fields into a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record
|
Dict[str, Any]
|
The record to format. |
required |
fields
|
List[Dict[str, Any]]
|
The fields to format. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The formatted record fields. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_get_label_instruction(question)
¶
Get the label instruction for the question.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question
|
Dict[str, Any]
|
The question to get the label instruction for. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The label instruction for the question. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_format_question(question)
¶
Format the question settings into a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question
|
Dict[str, Any]
|
The question to format. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The formatted question. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_format_example_records(records, fields, question)
¶
Format the example records into a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
records
|
List[Dict[str, Any]]
|
The records to format. |
required |
fields
|
List[Dict[str, Any]]
|
The fields to format. |
required |
question
|
Dict[str, Any]
|
The question to format. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The formatted example records. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
format_input(input)
¶
Format the input into a chat message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Union[Dict[str, Any], Record, TextField, MultiLabelQuestion, LabelQuestion, RatingQuestion, TextQuestion]]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
ChatType
|
The formatted chat message. |
Raises:
Type | Description |
---|---|
ValueError
|
If question or fields are not provided. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
format_output(output, input)
¶
Format the output into a dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
The output to format. |
required |
input
|
Dict[str, Any]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: The formatted output. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
process(inputs)
¶
Process the input through the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
The input to process. |
required |
Returns:
Name | Type | Description |
---|---|---|
StepOutput |
StepOutput
|
The output of the task. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_get_value_from_question_value_model(question_value_model)
¶
Get the value from the question value model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question_value_model
|
BaseModel
|
The question value model to get the value from. |
required |
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The value from the question value model. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_assign_value_to_question_value_model(value, question)
¶
Assign the value to the question value model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value
|
Any
|
The value to assign. |
required |
question
|
Dict[str, Any]
|
The question to assign the value to. |
required |
Returns:
Name | Type | Description |
---|---|---|
BaseModel |
BaseModel
|
The question value model with the assigned value. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
_get_pydantic_model_of_structured_output(question)
¶
Get the Pydantic model of the structured output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question
|
Dict[str, Any]
|
The question to get the Pydantic model of the structured output for. |
required |
Returns:
Name | Type | Description |
---|---|---|
BaseModel |
BaseModel
|
The Pydantic model of the structured output. |
Source code in src/distilabel/steps/tasks/argilla_labeller.py
CLAIR
¶
Bases: Task
Contrastive Learning from AI Revisions (CLAIR).
CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting
preference A preferred
A’ is much more contrastive and precise.
Input columns
- task (
str
): The task or instruction. - student_solution (
str
): An answer to the task that is to be revised.
Output columns
- revision (
str
): The revised text. - rational (
str
): The rational for the provided revision. - model_name (
str
): The name of the model used to generate the revision and rational.
Categories
- preference
- text-generation
References
Examples:
Create contrastive preference pairs:
from distilabel.steps.tasks import CLAIR
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 4096,
},
)
clair_task = CLAIR(llm=llm)
clair_task.load()
result = next(
clair_task.process(
[
{
"task": "How many gaps are there between the earth and the moon?",
"student_solution": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'
}
]
)
)
# result
# [{'task': 'How many gaps are there between the earth and the moon?',
# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',
# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
# 'rational': 'The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.',
# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.\n\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
# 'raw_input_c_l_a_i_r_0': [{'role': 'system',
# 'content': "You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution."},
# {'role': 'user',
# 'content': '{task}: How many gaps are there between the earth and the moon?\n\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\n\n-----------------\n\nLet\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Citations:
```
@misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,
title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},
author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},
year={2024},
eprint={2408.06266},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.06266},
}
```
Source code in src/distilabel/steps/tasks/clair.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/clair.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction-response pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/clair.py
ComplexityScorer
¶
Bases: Task
Score instructions based on their complexity using an LLM
.
ComplexityScorer
is a pre-defined task used to rank a list of instructions based in
their complexity. It's an implementation of the complexity score task from the paper
'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection
in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instructions (
List[str]
): The list of instructions to be scored.
Output columns
- scores (
List[float]
): The score for each instruction. - model_name (
str
): The model name used to generate the scores.
Categories
- scorer
- complexity
- instruction
Examples:
Evaluate the complexity of your instructions:
from distilabel.steps.tasks import ComplexityScorer
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
scorer.load()
result = next(
scorer.process(
[{"instructions": ["plain instruction", "highly complex instruction"]}]
)
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
Generate structured output with default schema:
from distilabel.steps.tasks import ComplexityScorer
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
use_default_structured_output=use_default_structured_output
)
scorer.load()
result = next(
scorer.process(
[{"instructions": ["plain instruction", "highly complex instruction"]}]
)
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \n "scores": [\n 1, \n 2\n ]\n}'}}]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
Source code in src/distilabel/steps/tasks/complexity_scorer.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
inputs
property
¶
The inputs for the task are the instructions
.
outputs
property
¶
The output for the task are: a list of scores
containing the complexity score for each
instruction in instructions
, and the model_name
.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/complexity_scorer.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from pydantic import BaseModel
from typing import List
class SchemaComplexityScorer(BaseModel):
scores: List[int]
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/complexity_scorer.py
_format_structured_output(output, input)
¶
Parses the structured response, which should correspond to a dictionary
with either positive
, or positive
and negative
keys.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
str
|
The output from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Formatted output. |
Source code in src/distilabel/steps/tasks/complexity_scorer.py
_sample_input()
¶
Returns a sample input to be used in the print
method.
Tasks that don't adhere to a format input that returns a map of the type
str -> str should override this method to return a sample input.
Source code in src/distilabel/steps/tasks/complexity_scorer.py
EvolInstruct
¶
Bases: Task
Evolve instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed. |
store_evolutions |
bool
|
Whether to store all the evolutions or just the last one. Defaults
to |
generate_answers |
bool
|
Whether to generate answers for the evolved instructions. Defaults
to |
include_original_instruction |
bool
|
Whether to include the original instruction in the
|
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for evolving the instructions.
Defaults to the ones provided in the |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction ifstore_evolutions=False
. - evolved_instructions (
List[str]
): The evolved instructions ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the instructions. - answer (
str
): The answer to the evolved instruction ifgenerate_answers=True
andstore_evolutions=False
. - answers (
List[str]
): The answers to the evolved instructions ifgenerate_answers=True
andstore_evolutions=True
.
Categories
- evol
- instruction
References
Examples:
Evolve an instruction using an LLM:
from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
Keep the iterations of the evolutions:
from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
store_evolutions=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instructions': ['initial evolution', 'final evolution'],
# 'model_name': 'model_name'
# }
# ]
Generate answers for the instructions in a single step:
from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
generate_answers=True,
)
evol_instruct.load()
result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
# {
# 'instruction': 'common instruction',
# 'evolved_instruction': 'evolved instruction',
# 'answer': 'answer to the instruction',
# 'model_name': 'model_name'
# }
# ]
Citations
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 |
|
inputs
property
¶
The input for the task is the instruction
.
outputs
property
¶
The output for the task are the evolved_instruction/s
, the answer
if generate_answers=True
and the model_name
.
mutation_templates_names
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
format_output(instructions, answers=None)
¶
The output for the task is a dict with: evolved_instruction
or evolved_instructions
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
answer
if generate_answers=True
; and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
Union[str, List[str]]
|
The instructions to be included within the output. |
required |
answers
|
Optional[List[str]]
|
The answers to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_apply_random_mutation(instruction)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction
|
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_evolve_instructions(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
List[List[str]]
|
|
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
_generate_answers(evolved_instructions)
¶
Generates the answer for the instructions in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evolved_instructions
|
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
Tuple[List[List[str]], LLMStatistics]
|
A list of answers for each instruction. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/base.py
EvolComplexity
¶
Bases: EvolInstruct
Evolve instructions to make them more complex using an LLM
.
EvolComplexity
is a task that evolves instructions to make them more complex,
and it is based in the EvolInstruct task, using slight different prompts, but the
exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
The number of instructions to be generated. |
|
generate_answers |
bool
|
Whether to generate answers for the instructions or not. Defaults
to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
Dict[str, str]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Input columns
- instruction (
str
): The instruction to evolve.
Output columns
- evolved_instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- deita
References
Examples:
Evolve an instruction using an LLM:
from distilabel.steps.tasks import EvolComplexity
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity = EvolComplexity(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_complexity.load()
result = next(evol_complexity.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
EvolComplexityGenerator
¶
Bases: EvolInstructGenerator
Generate evolved instructions with increased complexity using an LLM
.
EvolComplexityGenerator
is a generation task that evolves instructions to make
them more complex, and it is based in the EvolInstruct task, but using slight different
prompts, but the exact same evolutionary approach.
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. |
generate_answers |
bool
|
Whether to generate answers for the instructions or not. Defaults
to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The number of evolutions to be run.
Output columns
- instruction (
str
): The evolved instruction. - answer (
str
, optional): The answer to the instruction ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to evolve the instructions.
Categories
- evol
- instruction
- generation
- deita
References
Examples:
Generate evolved instructions without initial instructions:
from distilabel.steps.tasks import EvolComplexityGenerator
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_complexity_generator = EvolComplexityGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_complexity_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
Source code in src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
EvolInstructGenerator
¶
Bases: GeneratorTask
Generate evolved instructions using an LLM
.
WizardLM: Empowering Large Language Models to Follow Complex Instructions
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. |
generate_answers |
bool
|
Whether to generate answers for the instructions or not. Defaults
to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used for the generation of the instructions. |
min_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be higher than, to be considered valid. Defaults to |
max_length |
RuntimeParameter[int]
|
Defines the length (in bytes) that the generated instruction needs to
be lower than, to be considered valid. Defaults to |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
min_length
: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.max_length
: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Output columns
- instruction (
str
): The generated instruction ifgenerate_answers=False
. - answer (
str
): The generated answer ifgenerate_answers=True
. - instructions (
List[str]
): The generated instructions ifgenerate_answers=True
. - model_name (
str
): The name of the LLM used to generate and evolve the instructions.
Categories
- evol
- instruction
- generation
References
Examples:
Generate evolved instructions without initial instructions:
from distilabel.steps.tasks import EvolInstructGenerator
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_instruct_generator = EvolInstructGenerator(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=2,
)
evol_instruct_generator.load()
result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
Citations
@misc{xu2023wizardlmempoweringlargelanguage,
title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
year={2023},
eprint={2304.12244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2304.12244},
}
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
_english_nouns
cached
property
¶
A list of English nouns to be used as part of the starting prompts for the task.
References
- https://github.com/h2oai/h2o-wizardlm
outputs
property
¶
The output for the task are the instruction
, the answer
if generate_answers=True
and the model_name
.
mutation_templates_names
property
¶
Returns the names i.e. keys of the provided mutation_templates
.
_generate_seed_texts()
¶
Generates a list of seed texts to be used as part of the starting prompts for the task.
It will use the FRESH_START
mutation template, as it needs to generate text from scratch; and
a list of English words will be used to generate the seed texts that will be provided to the
mutation method and included within the prompt.
Returns:
Type | Description |
---|---|
List[str]
|
A list of seed texts to be used as part of the starting prompts for the task. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
format_output(instruction, answer=None)
¶
The output for the task is a dict with: instruction
; answer
if generate_answers=True
;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction
|
str
|
The instruction to be included within the output. |
required |
answer
|
Optional[str]
|
The answer to be included within the output if |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
If |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_apply_random_mutation(iter_no)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iter_no
|
int
|
The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not. |
required |
Returns:
Type | Description |
---|---|
List[ChatType]
|
A random mutation prompt with the provided instruction formatted as an OpenAI conversation. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
_generate_answers(instructions)
¶
Generates the answer for the last instruction in instructions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
List[List[str]]
|
A list of lists where each item is a list with either the last
evolved instruction if |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], LLMStatistics]
|
A list of answers for the last instruction in |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
process(offset=0)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset
|
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries with the outputs of the task, and a boolean |
GeneratorStepOutput
|
flag indicating whether the task has finished or not i.e. is the last batch. |
Source code in src/distilabel/steps/tasks/evol_instruct/generator.py
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 |
|
EvolQuality
¶
Bases: Task
Evolve the quality of the responses using an LLM
.
EvolQuality
task is used to evolve the quality of the responses given a prompt,
by generating a new response with a language model. This step implements the evolution
quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
num_evolutions |
int
|
The number of evolutions to be performed on the responses. |
store_evolutions |
bool
|
Whether to store all the evolved responses or just the last one.
Defaults to |
include_original_response |
bool
|
Whether to include the original response within the evolved
responses. Defaults to |
mutation_templates |
Dict[str, str]
|
The mutation templates to be used to evolve the responses. |
seed |
RuntimeParameter[int]
|
The seed to be set for |
Runtime parameters
seed
: The seed to be set fornumpy
in order to randomly pick a mutation method.
Input columns
- instruction (
str
): The instruction that was used to generate theresponses
. - response (
str
): The responses to be rewritten.
Output columns
- evolved_response (
str
): The evolved response ifstore_evolutions=False
. - evolved_responses (
List[str]
): The evolved responses ifstore_evolutions=True
. - model_name (
str
): The name of the LLM used to evolve the responses.
Categories
- evol
- response
- deita
Examples:
Evolve the quality of the responses given a prompt:
from distilabel.steps.tasks import EvolQuality
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
evol_quality = EvolQuality(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_evolutions=2,
)
evol_quality.load()
result = next(
evol_quality.process(
[
{"instruction": "common instruction", "response": "a response"},
]
)
)
# result
# [
# {
# 'instruction': 'common instruction',
# 'response': 'a response',
# 'evolved_response': 'evolved response',
# 'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
# }
# ]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
Source code in src/distilabel/steps/tasks/evol_quality/base.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 |
|
inputs
property
¶
The input for the task are the instruction
and response
.
outputs
property
¶
The output for the task are the evolved_response/s
and the model_name
.
mutation_templates_names
property
¶
Returns the names i.e. keys of the provided mutation_templates
enum.
model_post_init(__context)
¶
Override this method to perform additional initialization after __init__
and model_construct
.
This is useful if you want to do some validation that requires the entire model to be initialized.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation. And the
system_prompt
is added as the first message if it exists.
Source code in src/distilabel/steps/tasks/evol_quality/base.py
format_output(responses)
¶
The output for the task is a dict with: evolved_response
or evolved_responses
,
depending whether the value is either False
or True
for store_evolutions
, respectively;
and, finally, the model_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses
|
Union[str, List[str]]
|
The responses to be included within the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
if |
Dict[str, Any]
|
if |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
_apply_random_mutation(instruction, response)
¶
Applies a random mutation from the ones provided as part of the mutation_templates
enum, and returns the provided instruction within the mutation prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction
|
str
|
The instruction to be included within the mutation prompt. |
required |
Returns:
Type | Description |
---|---|
str
|
A random mutation prompt with the provided instruction. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
_evolve_reponses(inputs)
¶
Evolves the instructions provided as part of the inputs of the task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list where each item is a list with either the last evolved instruction if |
Dict[str, Any]
|
|
Source code in src/distilabel/steps/tasks/evol_quality/base.py
process(inputs)
¶
Processes the inputs of the task and generates the outputs using the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/evol_quality/base.py
GenerateEmbeddings
¶
Bases: Step
Generate embeddings using the last hidden state of an LLM
.
Generate embeddings for a text input using the last hidden state of an LLM
, as
described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
llm |
LLM
|
The |
Input columns
- text (
str
,List[Dict[str, str]]
): The input text or conversation to generate embeddings for.
Output columns
- embedding (
List[float]
): The embedding of the input text or conversation. - model_name (
str
): The model name used to generate the embeddings.
Categories
- embedding
- llm
Examples:
Rank LLM candidates:
from distilabel.steps.tasks import GenerateEmbeddings
from distilabel.models.llms.huggingface import TransformersLLM
# Consider this as a placeholder for your actual LLM.
embedder = GenerateEmbeddings(
llm=TransformersLLM(
model="TaylorAI/bge-micro-v2",
model_kwargs={"is_decoder": True},
cuda_devices=[],
)
)
embedder.load()
result = next(
embedder.process(
[
{"text": "Hello, how are you?"},
]
)
)
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
Source code in src/distilabel/steps/tasks/generate_embeddings.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
inputs
property
¶
The inputs for the task is a text
column containing either a string or a
list of dictionaries in OpenAI chat-like format.
outputs
property
¶
The outputs for the task is an embedding
column containing the embedding of
the text
input.
load()
¶
format_input(input)
¶
Formats the input to be used by the LLM to generate the embeddings. The input
can be in ChatType
format or a string. If a string, it will be converted to a
list of dictionaries in OpenAI chat-like format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Any]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
ChatType
|
The OpenAI chat-like format of the input. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
process(inputs)
¶
Generates an embedding for each input using the last hidden state of the LLM
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
Genstruct
¶
Bases: Task
Generate a pair of instruction-response from a document using an LLM
.
Genstruct
is a pre-defined task designed to generate valid instructions from a given raw document,
with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
inspired in the Ada-Instruct paper.
Note
The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
option is to use NousResearch/Genstruct-7B
as the LLM provided to the task, since it was trained
for this specific task.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- title (
str
): The title of the document. - content (
str
): The content of the document.
Output columns
- user (
str
): The user's instruction based on the document. - assistant (
str
): The assistant's response based on the user's instruction. - model_name (
str
): The model name used to generate thefeedback
andresult
.
Categories
- text-generation
- instruction
- response
References
Examples:
Generate instructions from raw documents using the title and content:
from distilabel.steps.tasks import Genstruct
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
llm=InferenceEndpointsLLM(
model_id="NousResearch/Genstruct-7B",
),
)
genstruct.load()
result = next(
genstruct.process(
[
{"title": "common instruction", "content": "content of the document"},
]
)
)
# result
# [
# {
# 'title': 'An instruction',
# 'content': 'content of the document',
# 'model_name': 'test',
# 'user': 'An instruction',
# 'assistant': 'content of the document',
# }
# ]
Citations
Source code in src/distilabel/steps/tasks/genstruct.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
inputs
property
¶
The inputs for the task are the title
and the content
.
outputs
property
¶
The output for the task are the user
instruction based on the provided document
and the assistant
response based on the user's instruction.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/genstruct.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/genstruct.py
format_output(output, input)
¶
The output is formatted so that both the user and the assistant messages are captured.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the keys |
Source code in src/distilabel/steps/tasks/genstruct.py
ImageGeneration
¶
Bases: ImageTask
Image generation with an image to text model given a prompt.
ImageGeneration
is a pre-defined task that allows generating images from a prompt.
It works with any of the image_generation
defined under distilabel.models.image_generation
,
the models implemented models that allow image generation.
By default, the images are generated as a base64 string format, and after the dataset
has been generated, the images can be automatically transformed to PIL.Image.Image
using
Distiset.transform_columns_to_image
. Take a look at the Image Generation with distilabel
example in the documentation for more information.
Using the save_artifacts
attribute, the images can be saved on the artifacts folder in the
hugging face hub repository.
Attributes:
Name | Type | Description |
---|---|---|
save_artifacts |
bool
|
Bool value to save the image artifacts on its folder. Otherwise, the base64 representation of the image will be saved as a string. Defaults to False. |
image_format |
str
|
Any of the formats supported by PIL. Defaults to |
Input columns
- prompt (str): A column named prompt with the prompts to generate the images.
Output columns
- image (
str
): The generated image. Initially is a base64 string, for simplicity during the pipeline run, but this can be transformed to an Image object after distiset is returned at the end of a pipeline by callingdistiset.transform_columns_to_image(<IMAGE_COLUMN>)
. - image_path (
str
): The path where the image is saved. Only available ifsave_artifacts
is True. - model_name (
str
): The name of the model used to generate the image.
Categories
- image-generation
Examples:
Generate an image from a prompt:
from distilabel.steps.tasks import ImageGeneration
from distilabel.models.image_generation import InferenceEndpointsImageGeneration
igm = InferenceEndpointsImageGeneration(
model_id="black-forest-labs/FLUX.1-schnell"
)
# save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
image_gen = ImageGeneration(image_generation_model=igm)
image_gen.load()
result = next(
image_gen.process(
[{"prompt": "a white siamese cat"}]
)
)
Generate an image and save them as artifacts in a Hugging Face Hub repository:
from distilabel.steps.tasks import ImageGeneration
# Select the Image Generation model to use
from distilabel.models.image_generation import OpenAIImageGeneration
igm = OpenAIImageGeneration(
model="dall-e-3",
api_key="api.key",
generation_kwargs={
"size": "1024x1024",
"quality": "standard",
"style": "natural"
}
)
# save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
image_gen = ImageGeneration(
image_generation_model=igm,
save_artifacts=True,
image_format="JPEG" # By default will use JPEG, the options available can be seen in PIL documentation.
)
image_gen.load()
result = next(
image_gen.process(
[{"prompt": "a white siamese cat"}]
)
)
Source code in src/distilabel/steps/tasks/image_generation.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
BitextRetrievalGenerator
¶
Bases: _EmbeddingDataGenerator
Generate bitext retrieval data with an LLM
to later on train an embedding model.
BitextRetrievalGenerator
is a GeneratorTask
that generates bitext retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes:
Name | Type | Description |
---|---|---|
source_language |
str
|
The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
target_language |
str
|
The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
unit |
Optional[Literal['sentence', 'phrase', 'passage']]
|
The unit of the data to be generated, which can be |
difficulty |
Optional[Literal['elementary school', 'high school', 'college']]
|
The difficulty of the query to be generated, which can be |
high_score |
Optional[Literal['4', '4.5', '5']]
|
The high score of the query to be generated, which can be |
low_score |
Optional[Literal['2.5', '3', '3.5']]
|
The low score of the query to be generated, which can be |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Output columns
- S1 (
str
): the first sentence generated by theLLM
. - S2 (
str
): the second sentence generated by theLLM
. - S3 (
str
): the third sentence generated by theLLM
. - model_name (
str
): the name of the model used to generate the bitext retrieval data.
Examples:
Generate bitext retrieval data for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator
with Pipeline("my-pipeline") as pipeline:
task = BitextRetrievalGenerator(
source_language="English",
target_language="Spanish",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 |
|
prompt
property
¶
Contains the prompt
to be used in the process
method, rendering the _template
; and
formatted as an OpenAI formatted chat i.e. a ChatType
, assuming that there's only one turn,
being from the user with the content being the rendered _template
.
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
GenerateLongTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate long text matching data with an LLM
to later on train an embedding model.
GenerateLongTextMatchingData
is a Task
that generates long text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-long"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-long category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Input columns
- task (
str
): The task description to be used in the generation.
Output columns
- input (
str
): the input generated by theLLM
. - positive_document (
str
): the positive document generated by theLLM
. - model_name (
str
): the name of the model used to generate the long text matching data.
Examples:
Generate synthetic long text matching data for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-long",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateLongTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 |
|
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateShortTextMatchingData
¶
Bases: _EmbeddingDataGeneration
Generate short text matching data with an LLM
to later on train an embedding model.
GenerateShortTextMatchingData
is a Task
that generates short text matching data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-matching-short"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-matching-short category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Input columns
- task (
str
): The task description to be used in the generation.
Output columns
- input (
str
): the input generated by theLLM
. - positive_document (
str
): the positive document generated by theLLM
. - model_name (
str
): the name of the model used to generate the short text matching data.
Examples:
Generate synthetic short text matching data for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-matching-short",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateShortTextMatchingData(
language="English",
llm=..., # LLM instance
)
task >> generate
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 |
|
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Args:
input: The input dictionary containing the `task` to be used in the `_template`.
Returns:
A list with a single chat containing the user's message with the rendered `_template`.
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextClassificationData
¶
Bases: _EmbeddingDataGeneration
Generate text classification data with an LLM
to later on train an embedding model.
GenerateTextClassificationData
is a Task
that generates text classification data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-classification"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-classification category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Input columns
- task (
str
): The task description to be used in the generation.
Output columns
- input_text (
str
): the input text generated by theLLM
. - label (
str
): the label generated by theLLM
. - misleading_label (
str
): the misleading label generated by theLLM
. - model_name (
str
): the name of the model used to generate the text classification data.
Examples:
Generate synthetic text classification data for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-classification",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextClassificationData(
language="English",
difficulty="high school",
clarity="clear",
llm=..., # LLM instance
)
task >> generate
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 |
|
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
GenerateTextRetrievalData
¶
Bases: _EmbeddingDataGeneration
Generate text retrieval data with an LLM
to later on train an embedding model.
GenerateTextRetrievalData
is a Task
that generates text retrieval data with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Note
Ideally this task should be used with EmbeddingTaskGenerator
with flatten_tasks=True
with the category="text-retrieval"
; so that the LLM
generates a list of tasks that
are flattened so that each row contains a single task for the text-retrieval category.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
query_type |
Optional[Literal['extremely long-tail', 'long-tail', 'common']]
|
The type of query to be generated, which can be |
query_length |
Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]
|
The length of the query to be generated, which can be |
difficulty |
Optional[Literal['high school', 'college', 'PhD']]
|
The difficulty of the query to be generated, which can be |
clarity |
Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]
|
The clarity of the query to be generated, which can be |
num_words |
Optional[Literal[50, 100, 200, 300, 400, 500]]
|
The number of words in the query to be generated, which can be |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Input columns
- task (
str
): The task description to be used in the generation.
Output columns
- user_query (
str
): the user query generated by theLLM
. - positive_document (
str
): the positive document generated by theLLM
. - hard_negative_document (
str
): the hard negative document generated by theLLM
. - model_name (
str
): the name of the model used to generate the text retrieval data.
Examples:
Generate synthetic text retrieval data for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData
with Pipeline("my-pipeline") as pipeline:
task = EmbeddingTaskGenerator(
category="text-retrieval",
flatten_tasks=True,
llm=..., # LLM instance
)
generate = GenerateTextRetrievalData(
language="English",
query_type="common",
query_length="5 to 15 words",
difficulty="high school",
clarity="clear",
num_words=100,
llm=..., # LLM instance
)
task >> generate
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 |
|
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
format_input(input)
¶
Method to format the input based on the task
and the provided attributes, or just
randomly sampling those if not provided. This method will render the _template
with
the provided arguments and return an OpenAI formatted chat i.e. a ChatType
, assuming that
there's only one turn, being from the user with the content being the rendered _template
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Any]
|
The input dictionary containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list with a single chat containing the user's message with the rendered |
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
MonolingualTripletGenerator
¶
Bases: _EmbeddingDataGenerator
Generate monolingual triplets with an LLM
to later on train an embedding model.
MonolingualTripletGenerator
is a GeneratorTask
that generates monolingual triplets with an
LLM
to later on train an embedding model. The task is based on the paper "Improving
Text Embeddings with Large Language Models" and the data is generated based on the
provided attributes, or randomly sampled if not provided.
Attributes:
Name | Type | Description |
---|---|---|
language |
str
|
The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf. |
unit |
Optional[Literal['sentence', 'phrase', 'passage']]
|
The unit of the data to be generated, which can be |
difficulty |
Optional[Literal['elementary school', 'high school', 'college']]
|
The difficulty of the query to be generated, which can be |
high_score |
Optional[Literal['4', '4.5', '5']]
|
The high score of the query to be generated, which can be |
low_score |
Optional[Literal['2.5', '3', '3.5']]
|
The low score of the query to be generated, which can be |
seed |
int
|
The random seed to be set in case there's any sampling within the |
Output columns
- S1 (
str
): the first sentence generated by theLLM
. - S2 (
str
): the second sentence generated by theLLM
. - S3 (
str
): the third sentence generated by theLLM
. - model_name (
str
): the name of the model used to generate the monolingual triplets.
Examples:
Generate monolingual triplets for training embedding models:
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import MonolingualTripletGenerator
with Pipeline("my-pipeline") as pipeline:
task = MonolingualTripletGenerator(
language="English",
unit="sentence",
difficulty="elementary school",
high_score="4",
low_score="2.5",
llm=...,
)
...
task >> ...
Source code in src/distilabel/steps/tasks/improving_text_embeddings.py
827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 |
|
prompt
property
¶
Contains the prompt
to be used in the process
method, rendering the _template
; and
formatted as an OpenAI formatted chat i.e. a ChatType
, assuming that there's only one turn,
being from the user with the content being the rendered _template
.
keys
property
¶
Contains the keys
that will be parsed from the LLM
output into a Python dict.
InstructionBacktranslation
¶
Bases: Task
Self-Alignment with Instruction Backtranslation.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Optional[Template]
|
the Jinja2 template to use for the Instruction Backtranslation task. |
Input columns
- instruction (
str
): The reference instruction to evaluate the text output. - generation (
str
): The text output to evaluate for the given instruction.
Output columns
- score (
str
): The score for the generation based on the given instruction. - reason (
str
): The reason for the provided score. - model_name (
str
): The model name used to score the generation.
Categories
- critique
Examples:
Generate a score and reason for a given instruction and generation:
from distilabel.steps.tasks import InstructionBacktranslation
instruction_backtranslation = InstructionBacktranslation(
name="instruction_backtranslation",
llm=llm,
input_batch_size=10,
output_mappings={"model_name": "scoring_model"},
)
instruction_backtranslation.load()
result = next(
instruction_backtranslation.process(
[
{
"instruction": "How much is 2+2?",
"generation": "4",
}
]
)
)
# result
# [
# {
# "instruction": "How much is 2+2?",
# "generation": "4",
# "score": 3,
# "reason": "Reason for the generation.",
# "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
# }
# ]
Citations
@misc{li2024selfalignmentinstructionbacktranslation,
title={Self-Alignment with Instruction Backtranslation},
author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
year={2024},
eprint={2308.06259},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2308.06259},
}
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
inputs
property
¶
The input for the task is the instruction
, and the generation
for it.
outputs
property
¶
The output for the task is the score
, reason
and the model_name
.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the score
and reason
. The
model_name
will be automatically included within the process
method of Task
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
a string representing the output of the LLM via the |
required |
input
|
Dict[str, Any]
|
the input to the task, as required by some tasks to format the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary containing the |
Source code in src/distilabel/steps/tasks/instruction_backtranslation.py
Magpie
¶
Bases: Task
, MagpieBase
Generates conversations using an instruct fine-tuned LLM.
Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as if it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.
Attributes:
Name | Type | Description |
---|---|---|
n_turns |
RuntimeParameter[PositiveInt]
|
the number of turns that the generated conversation will have.
Defaults to |
end_with_user |
RuntimeParameter[bool]
|
whether the conversation should end with a user message.
Defaults to |
include_system_prompt |
RuntimeParameter[bool]
|
whether to include the system prompt used in the generated
conversation. Defaults to |
only_instruction |
RuntimeParameter[bool]
|
whether to generate only the instruction. If this argument is
|
system_prompt |
Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]
|
an optional system prompt, or a list of system prompts from which
a random one will be chosen, or a dictionary of system prompts from which a
random one will be choosen, or a dictionary of system prompts with their probability
of being chosen. The random system prompt will be chosen per input/output batch.
This system prompt can be used to guide the generation of the instruct LLM and
steer it to generate instructions of a certain topic. Defaults to |
Runtime parameters
n_turns
: the number of turns that the generated conversation will have. Defaults to1
.end_with_user
: whether the conversation should end with a user message. Defaults toFalse
.include_system_prompt
: whether to include the system prompt used in the generated conversation. Defaults toFalse
.only_instruction
: whether to generate only the instruction. If this argument isTrue
, thenn_turns
will be ignored. Defaults toFalse
.system_prompt
: an optional system prompt or list of system prompts that can be used to steer the LLM to generate content of certain topic, guide the style, etc. If it's a list of system prompts, then a random system prompt will be chosen per input/output batch. If the provided inputs contains asystem_prompt
column, then this runtime parameter will be ignored and the one from the column will be used. Defaults toNone
.system_prompt
: an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic.
Input columns
- system_prompt (
str
, optional): an optional system prompt that can be provided to guide the generation of the instruct LLM and steer it to generate instructions of certain topic.
Output columns
- conversation (
ChatType
): the generated conversation which is a list of chat items with a role and a message. Only ifonly_instruction=False
. - instruction (
str
): the generated instructions ifonly_instruction=True
orn_turns==1
. - response (
str
): the generated response ifn_turns==1
. - system_prompt_key (
str
, optional): the key of the system prompt used to generate the conversation or instruction. Only ifsystem_prompt
is a dictionary. - model_name (
str
): The model name used to generate theconversation
orinstruction
.
Categories
- text-generation
- instruction
Examples:
Generating instructions with Llama 3 8B Instruct and TransformersLLM:
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import Magpie
magpie = Magpie(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 64,
},
device="mps",
),
only_instruction=True,
)
magpie.load()
result = next(
magpie.process(
inputs=[
{
"system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
},
{
"system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
},
]
)
)
# [
# {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
# {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
# ]
Generating conversations with Llama 3 8B Instruct and TransformersLLM:
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import Magpie
magpie = Magpie(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 256,
},
device="mps",
),
n_turns=2,
)
magpie.load()
result = next(
magpie.process(
inputs=[
{
"system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
},
{
"system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
},
]
)
)
# [
# {
# 'conversation': [
# {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
# {
# 'role': 'user',
# 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
# of x".'
# },
# {
# 'role': 'assistant',
# 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
# x". What it's asking us to do is find the'
# }
# ]
# },
# {
# 'conversation': [
# {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
# {
# 'role': 'user',
# 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
# might be pests or diseases, but I'm not sure which."
# },
# {
# 'role': 'assistant',
# 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
# }
# ]
# }
# ]
Source code in src/distilabel/steps/tasks/magpie/base.py
365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 |
|
outputs
property
¶
Either a multi-turn conversation or the instruction generated.
model_post_init(__context)
¶
Checks that the provided LLM
uses the MagpieChatTemplateMixin
.
Source code in src/distilabel/steps/tasks/magpie/base.py
format_input(input)
¶
format_output(output, input=None)
¶
process(inputs)
¶
Generate a list of instructions or conversations of the specified number of turns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
a list of dictionaries that can contain a |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
The list of generated conversations. |
Source code in src/distilabel/steps/tasks/magpie/base.py
MagpieGenerator
¶
Bases: GeneratorTask
, MagpieBase
Generator task the generates instructions or conversations using Magpie.
Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as it was the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.
Attributes:
Name | Type | Description |
---|---|---|
n_turns |
RuntimeParameter[PositiveInt]
|
the number of turns that the generated conversation will have.
Defaults to |
end_with_user |
RuntimeParameter[bool]
|
whether the conversation should end with a user message.
Defaults to |
include_system_prompt |
RuntimeParameter[bool]
|
whether to include the system prompt used in the generated
conversation. Defaults to |
only_instruction |
RuntimeParameter[bool]
|
whether to generate only the instruction. If this argument is
|
system_prompt |
Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]
|
an optional system prompt, or a list of system prompts from which
a random one will be chosen, or a dictionary of system prompts from which a
random one will be choosen, or a dictionary of system prompts with their probability
of being chosen. The random system prompt will be chosen per input/output batch.
This system prompt can be used to guide the generation of the instruct LLM and
steer it to generate instructions of a certain topic. Defaults to |
num_rows |
RuntimeParameter[int]
|
the number of rows to be generated. |
Runtime parameters
n_turns
: the number of turns that the generated conversation will have. Defaults to1
.end_with_user
: whether the conversation should end with a user message. Defaults toFalse
.include_system_prompt
: whether to include the system prompt used in the generated conversation. Defaults toFalse
.only_instruction
: whether to generate only the instruction. If this argument isTrue
, thenn_turns
will be ignored. Defaults toFalse
.system_prompt
: an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic.num_rows
: the number of rows to be generated.
Output columns
- conversation (
ChatType
): the generated conversation which is a list of chat items with a role and a message. - instruction (
str
): the generated instructions ifonly_instruction=True
. - response (
str
): the generated response ifn_turns==1
. - system_prompt_key (
str
, optional): the key of the system prompt used to generate the conversation or instruction. Only ifsystem_prompt
is a dictionary. - model_name (
str
): The model name used to generate theconversation
orinstruction
.
Categories
- text-generation
- instruction
- generator
Examples:
Generating instructions with Llama 3 8B Instruct and TransformersLLM:
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 256,
},
device="mps",
),
only_instruction=True,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {"instruction": "I've just bought a new phone and I're excited to start using it."},
# {"instruction": "What are the most common types of companies that use digital signage?"}
# ],
# True
# )
Generating a conversation with Llama 3 8B Instruct and TransformersLLM:
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 64,
},
device="mps",
),
n_turns=3,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
# {
# 'role': 'assistant',
# 'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
# let's break down the basics. First, we need to identify your goals and target audience. What do"
# },
# {
# 'role': 'user',
# 'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
# },
# {
# 'role': 'assistant',
# 'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
# }
# ]
# },
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
# {
# 'role': 'assistant',
# 'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
# },
# {
# 'role': 'user',
# 'content': 'Let me stop you there. Let's explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my
# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
# },
# {
# 'role': 'assistant',
# 'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
# }
# ]
# }
# ],
# True
# )
Generating with system prompts with probabilities:
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import MagpieGenerator
magpie = MagpieGenerator(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 0.8,
"max_new_tokens": 256,
},
),
n_turns=2,
system_prompt={
"math": ("You're an expert AI assistant.", 0.8),
"writing": ("You're an expert writing assistant.", 0.2),
},
)
magpie.load()
result = next(magpie.process())
Citations
@misc{xu2024magpiealignmentdatasynthesis,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.08464},
}
Source code in src/distilabel/steps/tasks/magpie/generator.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
|
outputs
property
¶
Either a multi-turn conversation or the instruction generated.
model_post_init(__context)
¶
Checks that the provided LLM
uses the MagpieChatTemplateMixin
.
Source code in src/distilabel/steps/tasks/magpie/generator.py
format_output(output, input=None)
¶
process(offset=0)
¶
Generates the desired number of instructions or conversations using Magpie.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset
|
int
|
The offset to start the generation from. Defaults to |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The generated instructions or conversations. |
Source code in src/distilabel/steps/tasks/magpie/generator.py
MathShepherdCompleter
¶
Bases: Task
Math Shepherd Completer and auto-labeller task.
This task is in charge of, given a list of solutions to an instruction, and a golden solution, as reference, generate completions for the solutions, and label them according to the golden solution using the hard estimation method from figure 2 in the reference paper, Eq. 3. The attributes make the task flexible to be used with different types of dataset and LLMs, and allow making use of different fields to modify the system and user prompts for it. Before modifying them, review the current defaults to ensure the completions are generated correctly.
Attributes:
Name | Type | Description |
---|---|---|
system_prompt |
Optional[str]
|
The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected. |
extra_rules |
Optional[str]
|
This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset. |
few_shots |
Optional[str]
|
Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset. |
N |
PositiveInt
|
Number of completions to generate for each step, correspond to N in the paper. They used 8 in the paper, but it can be adjusted. |
tags |
list[str]
|
List of tags to be used in the completions, the default ones are ["+", "-"] as in the paper, where the first is used as a positive label, and the second as a negative one. This can be updated, but it MUST be a list with 2 elements, where the first is the positive one, and the second the negative one. |
Input columns
- instruction (
str
): The task or instruction. - solutions (
List[str]
): List of solutions to the task. - golden_solution (
str
): The reference solution to the task, will be used to annotate the candidate solutions.
Output columns
- solutions (
List[str]
): The same columns that were used as input, the "solutions" is modified. - model_name (
str
): The name of the model used to generate the revision.
Categories
- text-generation
- labelling
Examples:
Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):
from distilabel.steps.tasks import MathShepherdCompleter
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={
"temperature": 0.6,
"max_new_tokens": 1024,
},
)
task = MathShepherdCompleter(
llm=llm,
N=3,
use_default_structured_output=True
)
task.load()
result = next(
task.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
"solutions": [
["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
]
},
]
)
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"],
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
Annotate your steps with the Math Shepherd Completer:
from distilabel.steps.tasks import MathShepherdCompleter
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={
"temperature": 0.6,
"max_new_tokens": 1024,
},
)
task = MathShepherdCompleter(
llm=llm,
N=3
)
task.load()
result = next(
task.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
"solutions": [
["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
]
},
]
)
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"],
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
Citations:
```
@misc{wang2024mathshepherdverifyreinforcellms,
title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
year={2024},
eprint={2312.08935},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2312.08935},
}
```
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 |
|
format_output(output, input=None)
¶
process(inputs)
¶
Does the processing of generation completions for the solutions, and annotate each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
Inputs to the step |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
Annotated inputs with the completions. |
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
_prepare_completions(instruction, steps)
¶
Helper method to create, given a solution (a list of steps), and a instruction, the texts to be completed by the LLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruction
|
str
|
Instruction of the problem. |
required |
steps
|
list[str]
|
List of steps that are part of the solution. |
required |
Returns:
Type | Description |
---|---|
List[ChatType]
|
List of ChatType, where each ChatType is the prompt corresponding to one of the steps |
List[ChatType]
|
to be completed. |
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
_auto_label(inputs, final_outputs, input_positions, golden_answers, statistics, raw_outputs, raw_inputs)
¶
Labels the steps inplace (in the inputs), and returns the inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
The original inputs |
required |
final_outputs
|
list[Completions]
|
List of generations from the LLM. It's organized as a list where the elements sent to the LLM are grouped together, then each element contains the completions, and each completion is a list of steps. |
required |
input_positions
|
list[tuple[int, int, int]]
|
A list with tuples generated in the process method that contains (i, j, k) where i is the index of the input, j is the index of the solution, and k is the index of the completion. |
required |
golden_answers
|
list[str]
|
List of golden answers for each input. |
required |
statistics
|
list[LLMStatistics]
|
List of statistics from the LLM. |
required |
raw_outputs
|
list[str]
|
List of raw outputs from the LLM. |
required |
raw_inputs
|
list[str]
|
List of raw inputs to the LLM. |
required |
Returns:
Type | Description |
---|---|
StepInput
|
Inputs annotated. |
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 |
|
_add_metadata(input, statistics, raw_output, raw_input)
¶
Adds the distilabel_metadata
to the input.
This method comes for free in the general Tasks, but as we have reimplemented the process
,
we have to repeat it here.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
dict[str, Any]
|
The input to add the metadata to. |
required |
statistics
|
list[LLMStatistics]
|
The statistics from the LLM. |
required |
raw_output
|
Union[str, None]
|
The raw output from the LLM. |
required |
raw_input
|
Union[list[dict[str, Any]], None]
|
The raw input to the LLM. |
required |
Returns:
Type | Description |
---|---|
dict[str, Any]
|
The input with the metadata added if applies. |
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from pydantic import BaseModel, Field
class Solution(BaseModel):
solution: str = Field(..., description="Step by step solution leading to the final answer")
class MathShepherdCompleter(BaseModel):
solutions: list[Solution] = Field(..., description="List of solutions")
MathShepherdCompleter.model_json_schema()
Returns:
Type | Description |
---|---|
dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/math_shepherd/completer.py
MathShepherdGenerator
¶
Bases: Task
Math Shepherd solution generator.
This task is in charge of generating completions for a given instruction, in the format expected
by the Math Shepherd Completer task. The attributes make the task flexible to be used with different
types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented
in the original paper. Before modifying them, review the current defaults to ensure the completions
are generated correctly. This task can be used to generate the golden solutions for a given problem if
not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.
Only one of solutions
or golden_solution
will be generated, depending on the value of M.
Attributes:
Name | Type | Description |
---|---|---|
system_prompt |
Optional[str]
|
The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected. Take into account that the system prompt includes 2 variables in the Jinja2 template, {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example to steer the model towards a specific type of responses, and few shots to add examples. They can be modified to adapt the system prompt to the dataset and model used without needing to change the full system prompt. |
extra_rules |
Optional[str]
|
This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset. |
few_shots |
Optional[str]
|
Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset. |
M |
Optional[PositiveInt]
|
Number of completions to generate for each step. By default is set to 1, which will generate the "golden_solution". In this case select a stronger model, as it will be used as the source of true during labelling. If M is set to a number greater than 1, the task will generate a list of completions to be labeled by the Math Shepherd Completer task. |
Input columns
- instruction (
str
): The task or instruction.
Output columns
- golden_solution (
str
): The step by step solution to the instruction. It will be generated if M is equal to 1. - solutions (
List[List[str]]
): A list of possible solutions to the instruction. It will be generated if M is greater than 1. - model_name (
str
): The name of the model used to generate the revision.
Categories
- text-generation
Examples:
Generate the solution for a given instruction (prefer a stronger model here):
from distilabel.steps.tasks import MathShepherdGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.6,
"max_new_tokens": 1024,
},
)
task = MathShepherdGenerator(
name="golden_solution_generator",
llm=llm,
)
task.load()
result = next(
task.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
},
]
)
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': '["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"]'}]]
Generate M completions for a given instruction (using structured output generation):
from distilabel.steps.tasks import MathShepherdGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 2048,
},
)
task = MathShepherdGenerator(
name="solution_generator",
llm=llm,
M=2,
use_default_structured_output=True,
)
task.load()
result = next(
task.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
},
]
)
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
Source code in src/distilabel/steps/tasks/math_shepherd/generator.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 |
|
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from pydantic import BaseModel, Field
class Solution(BaseModel):
solution: str = Field(..., description="Step by step solution leading to the final answer")
class MathShepherdGenerator(BaseModel):
solutions: list[Solution] = Field(..., description="List of solutions")
MathShepherdGenerator.model_json_schema()
Returns:
Type | Description |
---|---|
dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/math_shepherd/generator.py
FormatPRM
¶
Bases: Step
Helper step to transform the data into the format expected by the PRM model.
This step can be used to format the data in one of 2 formats: Following the format presented in peiyi9979/Math-Shepherd, in which case this step creates the columns input and label, where the input is the instruction with the solution (and the tag replaced by a token), and the label is the instruction with the solution, both separated by a newline. Following TRL's format for training, which generates the columns prompt, completions, and labels. The labels correspond to the original tags replaced by boolean values, where True represents correct steps.
Attributes:
Name | Type | Description |
---|---|---|
format |
Literal['math-shepherd', 'trl']
|
The format to use for the PRM model. "math-shepherd" corresponds to the original paper, while "trl" is a format prepared to train the model using TRL. |
step_token |
str
|
String that serves as a unique token denoting the position for predicting the step score. |
tags |
list[str]
|
List of tags that represent the correct and incorrect steps.
This only needs to be informed if it's different than the default in
|
Input columns
- instruction (
str
): The task or instruction. - solutions (
list[str]
): List of steps with a solution to the task.
Output columns
- input (
str
): The instruction with the solutions, where the label tags are replaced by a token. - label (
str
): The instruction with the solutions. - prompt (
str
): The instruction with the solutions, where the label tags are replaced by a token. - completions (
List[str]
): The solution represented as a list of steps. - labels (
List[bool]
): The labels, as a list of booleans, where True represents a good response.
Categories
- text-manipulation
- columns
References
Examples:
Prepare your data to train a PRM model with the Math-Shepherd format:
from distilabel.steps.tasks import FormatPRM
from distilabel.steps import ExpandColumns
expand_columns = ExpandColumns(columns=["solutions"])
expand_columns.load()
# Define our PRM formatter
formatter = FormatPRM()
formatter.load()
# Expand the solutions column as it comes from the MathShepherdCompleter
result = next(
expand_columns.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
},
]
)
)
result = next(formatter.process(result))
Prepare your data to train a PRM model with the TRL format:
from distilabel.steps.tasks import FormatPRM
from distilabel.steps import ExpandColumns
expand_columns = ExpandColumns(columns=["solutions"])
expand_columns.load()
# Define our PRM formatter
formatter = FormatPRM(format="trl")
formatter.load()
# Expand the solutions column as it comes from the MathShepherdCompleter
result = next(
expand_columns.process(
[
{
"instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
},
]
)
)
result = next(formatter.process(result))
# {
# "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# "solutions": [
# "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +",
# "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +",
# "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"
# ],
# "prompt": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# "completions": [
# "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.",
# "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber.",
# "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3"
# ],
# "labels": [
# true,
# true,
# true
# ]
# }
Citations:
```
@misc{wang2024mathshepherdverifyreinforcellms,
title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
year={2024},
eprint={2312.08935},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2312.08935},
}
```
Source code in src/distilabel/steps/tasks/math_shepherd/utils.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
|
process(inputs)
¶
The process prepares the data for the APIGenGenerator
task.
If a single example is provided, it is copied to avoid raising an error.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of dictionaries with the input data. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of dictionaries with the output data. |
Source code in src/distilabel/steps/tasks/math_shepherd/utils.py
PairRM
¶
Bases: Step
Rank the candidates based on the input using the LLM
model.
Attributes:
Name | Type | Description |
---|---|---|
model |
str
|
The model to use for the ranking. Defaults to |
instructions |
Optional[str]
|
The instructions to use for the model. Defaults to |
Input columns
- inputs (
List[Dict[str, Any]]
): The input text or conversation to rank the candidates for. - candidates (
List[Dict[str, Any]]
): The candidates to rank.
Output columns
- ranks (
List[int]
): The ranks of the candidates based on the input. - ranked_candidates (
List[Dict[str, Any]]
): The candidates ranked based on the input. - model_name (
str
): The model name used to rank the candidate responses. Defaults to"llm-blender/PairRM"
.
References
Categories
- preference
Note
This step differs to other tasks as there is a single implementation of this model
currently, and we will use a specific LLM
.
Examples:
Rank LLM candidates:
from distilabel.steps.tasks import PairRM
# Consider this as a placeholder for your actual LLM.
pair_rm = PairRM()
pair_rm.load()
result = next(
scorer.process(
[
{"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
]
)
)
# result
# [
# {
# 'input': 'Hello, how are you?',
# 'candidates': ['fine', 'good', 'bad'],
# 'ranks': [2, 1, 3],
# 'ranked_candidates': ['good', 'fine', 'bad'],
# 'model_name': 'llm-blender/PairRM',
# }
# ]
Citations
@misc{jiang2023llmblenderensemblinglargelanguage,
title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
year={2023},
eprint={2306.02561},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2306.02561},
}
Source code in src/distilabel/steps/tasks/pair_rm.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|
inputs
property
¶
The input columns correspond to the two required arguments from Blender.rank
:
inputs
and candidates
.
outputs
property
¶
The outputs will include the ranks
and the ranked_candidates
.
load()
¶
Loads the PairRM model provided via model
with llm_blender.Blender
, which is the
custom library for running the inference for the PairRM models.
Source code in src/distilabel/steps/tasks/pair_rm.py
format_input(input)
¶
The input is expected to be a dictionary with the keys input
and candidates
,
where the input
corresponds to the instruction of a model and candidates
are a
list of responses to be ranked.
Source code in src/distilabel/steps/tasks/pair_rm.py
process(inputs)
¶
Generates the ranks for the candidates based on the input.
The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
An iterator with the inputs containing the |
Source code in src/distilabel/steps/tasks/pair_rm.py
PrometheusEval
¶
Bases: Task
Critique and rank the quality of generations from an LLM
using Prometheus 2.0.
PrometheusEval
is a task created for Prometheus 2.0, covering both the absolute and relative
evaluations. The absolute evaluation i.e. mode="absolute"
is used to evaluate a single generation from
an LLM for a given instruction. The relative evaluation i.e. mode="relative"
is used to evaluate two generations from an LLM
for a given instruction.
Both evaluations provide the possibility of using a reference answer to compare with or withoug
the reference
attribute, and both are based on a score rubric that critiques the generation/s
based on the following default aspects: helpfulness
, harmlessness
, honesty
, factual-validity
,
and reasoning
, that can be overridden via rubrics
, and the selected rubric is set via the attribute
rubric
.
Note
The PrometheusEval
task is better suited and intended to be used with any of the Prometheus 2.0
models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,
and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
and quality is not guaranteed if using another model, even though some other models may be able to
correctly follow the formatting and generate insightful critiques too.
Attributes:
Name | Type | Description |
---|---|---|
mode |
Literal['absolute', 'relative']
|
the evaluation mode to use, either |
rubric |
str
|
the score rubric to use within the prompt to run the critique based on different aspects.
Can be any existing key in the |
rubrics |
Optional[Dict[str, str]]
|
a dictionary containing the different rubrics to use for the critique, where the keys are
the rubric names and the values are the rubric descriptions. The default rubrics are the following:
|
reference |
bool
|
a boolean flag to indicate whether a reference answer / completion will be provided, so
that the model critique is based on the comparison with it. It implies that the column |
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instruction (
str
): The instruction to use as reference. - generation (
str
, optional): The generated text from the giveninstruction
. This column is required ifmode=absolute
. - generations (
List[str]
, optional): The generated texts from the giveninstruction
. It should contain 2 generations only. This column is required ifmode=relative
. - reference (
str
, optional): The reference / golden answer for theinstruction
, to be used by the LLM for comparison against.
Output columns
- feedback (
str
): The feedback explaining the result below, as critiqued by the LLM using the pre-defined score rubric, compared againstreference
if provided. - result (
Union[int, Literal["A", "B"]]
): Ifmode=absolute
, then the result contains the score for thegeneration
in a likert-scale from 1-5, otherwise, ifmode=relative
, then the result contains either "A" or "B", the "winning" one being the generation in the index 0 ofgenerations
ifresult='A'
or the index 1 ifresult='B'
. - model_name (
str
): The model name used to generate thefeedback
andresult
.
Categories
- critique
- preference
References
Examples:
Critique and evaluate LLM generation quality using Prometheus 2_0:
from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
),
mode="absolute",
rubric="factual-validity"
)
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generation": "something done"},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
Critique for relative evaluation:
from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
),
mode="relative",
rubric="honesty"
)
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generations": ["something done", "other thing"]},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generations': ['something done', 'other thing'],
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 'something done',
# }
# ]
Critique with a custom rubric:
from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
),
mode="absolute",
rubric="custom",
rubrics={
"custom": "[A]\nScore 1: A\nScore 2: B\nScore 3: C\nScore 4: D\nScore 5: E"
}
)
prometheus.load()
result = next(
prometheus.process(
[
{"instruction": "make something", "generation": "something done"},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
Critique using a reference answer:
from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM
# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
),
mode="absolute",
rubric="helpfulness",
reference=True,
)
prometheus.load()
result = next(
prometheus.process(
[
{
"instruction": "make something",
"generation": "something done",
"reference": "this is a reference answer",
},
]
)
)
# result
# [
# {
# 'instruction': 'make something',
# 'generation': 'something done',
# 'reference': 'this is a reference answer',
# 'model_name': 'prometheus-eval/prometheus-7b-v2.0',
# 'feedback': 'the feedback',
# 'result': 6,
# }
# ]
Citations
@misc{kim2024prometheus2opensource,
title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
year={2024},
eprint={2405.01535},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.01535},
}
Source code in src/distilabel/steps/tasks/prometheus_eval.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 |
|
inputs
property
¶
The default inputs for the task are the instruction
and the generation
if reference=False
, otherwise, the inputs are instruction
, generation
, and
reference
.
outputs
property
¶
The output for the task are the feedback
and the result
generated by Prometheus,
as well as the model_name
which is automatically included based on the LLM
used.
load()
¶
Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
depending on the mode
value, and either with or without reference, depending on the
value of reference
.
Source code in src/distilabel/steps/tasks/prometheus_eval.py
format_input(input)
¶
The input is formatted as a ChatType
where the prompt is formatted according
to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
from the user, including a pre-defined system prompt.
Source code in src/distilabel/steps/tasks/prometheus_eval.py
format_output(output, input)
¶
The output is formatted as a dict with the keys feedback
and result
captured
using a regex from the Prometheus output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Optionally provided in case it's useful to build the output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the keys |
Source code in src/distilabel/steps/tasks/prometheus_eval.py
QualityScorer
¶
Bases: Task
Score responses based on their quality using an LLM
.
QualityScorer
is a pre-defined task that defines the instruction
as the input
and score
as the output. This task is used to rate the quality of instructions and responses.
It's an implementation of the quality score task from the paper 'What Makes Good Data
for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
are scored in terms of quality, obtaining a quality score for each instruction.
Attributes:
Name | Type | Description |
---|---|---|
_template |
Union[Template, None]
|
a Jinja2 template used to format the input for the LLM. |
Input columns
- instruction (
str
): The instruction that was used to generate theresponses
. - responses (
List[str]
): The responses to be scored. Each response forms a pair with the instruction.
Output columns
- scores (
List[float]
): The score for each instruction. - model_name (
str
): The model name used to generate the scores.
Categories
- scorer
- quality
- response
Examples:
Evaluate the quality of your instructions:
from distilabel.steps.tasks import QualityScorer
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
scorer = QualityScorer(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
scorer.load()
result = next(
scorer.process(
[
{
"instruction": "instruction",
"responses": ["good response", "weird response", "bad response"]
}
]
)
)
# result
[
{
'instructions': 'instruction',
'model_name': 'test',
'scores': [5, 3, 1],
}
]
Generate structured output with default schema:
from distilabel.steps.tasks import QualityScorer
from distilabel.models import InferenceEndpointsLLM
scorer = QualityScorer(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
use_default_structured_output=True
)
scorer.load()
result = next(
scorer.process(
[
{
"instruction": "instruction",
"responses": ["good response", "weird response", "bad response"]
}
]
)
)
# result
[{'instruction': 'instruction',
'responses': ['good response', 'weird response', 'bad response'],
'scores': [1, 2, 3],
'distilabel_metadata': {'raw_output_quality_scorer_0': '{ "scores": [1, 2, 3] }'},
'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Citations
@misc{liu2024makesgooddataalignment,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2024},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15685},
}
Source code in src/distilabel/steps/tasks/quality_scorer.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
|
inputs
property
¶
The inputs for the task are instruction
and responses
.
outputs
property
¶
The output for the task is a list of scores
containing the quality score for each
response in responses
.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/quality_scorer.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/quality_scorer.py
format_output(output, input)
¶
The output is formatted as a list with the score of each instruction-response pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Dict[str, Any]
|
the input to the task. Used for obtaining the number of responses. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with the key |
Source code in src/distilabel/steps/tasks/quality_scorer.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from pydantic import BaseModel
from typing import List
class SchemaQualityScorer(BaseModel):
scores: List[int]
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/quality_scorer.py
_format_structured_output(output, input)
¶
Parses the structured response, which should correspond to a dictionary with the scores, and a list with them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
str
|
The output from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Formatted output. |
Source code in src/distilabel/steps/tasks/quality_scorer.py
SelfInstruct
¶
Bases: Task
Generate instructions based on a given input using an LLM
.
SelfInstruct
is a pre-defined task that, given a number of instructions, a
certain criteria for query generations, an application description, and an input,
generates a number of instruction related to the given input and following what
is stated in the criteria for query generation and the application description.
It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning
Language Models with Self-Generated Instructions".
Attributes:
Name | Type | Description |
---|---|---|
num_instructions |
int
|
The number of instructions to be generated. Defaults to 5. |
criteria_for_query_generation |
str
|
The criteria for the query generation. Defaults to the criteria defined within the paper. |
application_description |
str
|
The description of the AI application that one want
to build with these instructions. Defaults to |
Input columns
- input (
str
): The input to generate the instructions. It's also called seed in the paper.
Output columns
- instructions (
List[str]
): The generated instructions. - model_name (
str
): The model name used to generate the instructions.
Categories
- text-generation
Examples:
Generate instructions based on a given input:
from distilabel.steps.tasks import SelfInstruct
from distilabel.models import InferenceEndpointsLLM
self_instruct = SelfInstruct(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
num_instructions=5, # This is the default value
)
self_instruct.load()
result = next(self_instruct.process([{"input": "instruction"}]))
# result
# [
# {
# 'input': 'instruction',
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
# }
# ]
Citations
@misc{wang2023selfinstructaligninglanguagemodels,
title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
year={2023},
eprint={2212.10560},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2212.10560},
}
Source code in src/distilabel/steps/tasks/self_instruct.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
inputs
property
¶
The input for the task is the input
i.e. seed text.
outputs
property
¶
The output for the task is a list of instructions
containing the generated instructions.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/self_instruct.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/self_instruct.py
format_output(output, input=None)
¶
The output is formatted as a list with the generated instructions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
the raw output of the LLM. |
required |
input
|
Optional[Dict[str, Any]]
|
the input to the task. Used for obtaining the number of responses. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dict with containing the generated instructions. |
Source code in src/distilabel/steps/tasks/self_instruct.py
GenerateSentencePair
¶
Bases: Task
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair
is a pre-defined task that given an anchor sentence generates
a positive sentence related to the anchor and optionally a negative sentence unrelated
to the anchor or similar to it. Optionally, you can give a context to guide the LLM
towards more specific behavior. This task is useful to generate training datasets for
training embeddings models.
Attributes:
Name | Type | Description |
---|---|---|
triplet |
bool
|
a flag to indicate if the task should generate a triplet of sentences
(anchor, positive, negative). Defaults to |
action |
GenerationAction
|
the action to perform to generate the positive sentence. |
context |
str
|
the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default. |
hard_negative |
bool
|
A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity. |
Input columns
- anchor (
str
): The anchor sentence to generate the positive and negative sentences.
Output columns
- positive (
str
): The positive sentence related to theanchor
. - negative (
str
): The negative sentence unrelated to theanchor
iftriplet=True
, or more similar to the positive to make it more challenging for a model to distinguish in casehard_negative=True
. - model_name (
str
): The name of the model that was used to generate the sentences.
Categories
- embedding
Examples:
Paraphrasing:
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
Generating semantically similar sentences:
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
Generating queries:
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
Generating answers:
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
Generating queries with context (applies to every action):
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
Generating Hard-negatives (applies to every action):
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
hard_negative=True,
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
Generating structured data with default schema (applies to every action):
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
hard_negative=True,
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
use_default_structured_output=True
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
Source code in src/distilabel/steps/tasks/sentence_transformers.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 |
|
inputs
property
¶
The inputs for the task is the anchor
sentence.
outputs
property
¶
The outputs for the task are the positive
and negative
sentences, as well
as the model_name
used to generate the sentences.
load()
¶
Loads the Jinja2 template.
Source code in src/distilabel/steps/tasks/sentence_transformers.py
format_input(input)
¶
The inputs are formatted as a ChatType
, with a system prompt describing the
task of generating a positive and negative sentences for the anchor sentence. The
anchor is provided as the first user interaction in the conversation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Dict[str, Any]
|
The input containing the |
required |
Returns:
Type | Description |
---|---|
ChatType
|
A list of dictionaries containing the system and user interactions. |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
format_output(output, input=None)
¶
Formats the output of the LLM, to extract the positive
and negative
sentences
generated. If the output is None
or the regex doesn't match, then the outputs
will be set to None
as well.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
The output of the LLM. |
required |
input
|
Optional[Dict[str, Any]]
|
The input used to generate the output. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
The formatted output containing the |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
_format_structured_output(output)
¶
Parses the structured response, which should correspond to a dictionary
with either positive
, or positive
and negative
keys.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
str
|
The output from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Formatted output. |
Source code in src/distilabel/steps/tasks/sentence_transformers.py
StructuredGeneration
¶
Bases: Task
Generate structured content for a given instruction
using an LLM
.
StructuredGeneration
is a pre-defined task that defines the instruction
and the structured_output
as the inputs, and generation
as the output. This task is used to generate structured content based on
the input instruction and following the schema provided within the structured_output
column per each
instruction
. The model_name
also returned as part of the output in order to enhance it.
Attributes:
Name | Type | Description |
---|---|---|
use_system_prompt |
bool
|
Whether to use the system prompt in the generation. Defaults to |
Input columns
- instruction (
str
): The instruction to generate structured content from. - structured_output (
Dict[str, Any]
): The structured_output to generate structured content from. It should be a Python dictionary with the keysformat
andschema
, whereformat
should be one ofjson
orregex
, and theschema
should be either the JSON schema or the regex pattern, respectively.
Output columns
- generation (
str
): The generated text matching the provided schema, if possible. - model_name (
str
): The name of the model used to generate the text.
Categories
- outlines
- structured-generation
Examples:
Generate structured output from a JSON schema:
from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM
structured_gen = StructuredGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
structured_gen.load()
result = next(
structured_gen.process(
[
{
"instruction": "Create an RPG character",
"structured_output": {
"format": "json",
"schema": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"role": {
"title": "Role",
"type": "string"
},
"weapon": {
"title": "Weapon",
"type": "string"
}
},
"required": [
"name",
"description",
"role",
"weapon"
],
"title": "Character",
"type": "object"
}
},
}
]
)
)
Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):
from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM
structured_gen = StructuredGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
structured_gen.load()
result = next(
structured_gen.process(
[
{
"instruction": "What's the weather like today in Seattle in Celsius degrees?",
"structured_output": {
"format": "regex",
"schema": r"(\d{1,2})°C"
},
}
]
)
)
Source code in src/distilabel/steps/tasks/structured_generation.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
|
inputs
property
¶
The input for the task are the instruction
and the structured_output
.
Optionally, if the use_system_prompt
flag is set to True, then the
system_prompt
will be used too.
outputs
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/structured_generation.py
format_output(output, input)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
. Note that even
if the structured_output
is defined to produce a JSON schema, this method will return the raw
output i.e. a string without any parsing.
Source code in src/distilabel/steps/tasks/structured_generation.py
TextClassification
¶
Bases: Task
Classifies text into one or more categories or labels.
This task can be used for text classification problems, where the goal is to assign one or multiple labels to a given text. It uses structured generation as per the reference paper by default, it can help to generate more concise labels. See section 4.1 in the reference.
Input columns
- text (
str
): The reference text we want to obtain labels for.
Output columns
- labels (
Union[str, List[str]]
): The label or list of labels for the text. - model_name (
str
): The name of the model used to generate the label/s.
Categories
- text-classification
Attributes:
Name | Type | Description |
---|---|---|
system_prompt |
Optional[str]
|
A prompt to display to the user before the task starts. Contains a default message to make the model behave like a classifier specialist. |
n |
PositiveInt
|
Number of labels to generate If only 1 is required, corresponds to a label classification problem, if >1 it will intend return the "n" labels most representative for the text. Defaults to 1. |
context |
Optional[str]
|
Context to use when generating the labels. By default contains a generic message, but can be used to customize the context for the task. |
examples |
Optional[List[str]]
|
List of examples to help the model understand the task, few shots. |
available_labels |
Optional[Union[List[str], Dict[str, str]]]
|
List of available labels to choose from when classifying the text, or a dictionary with the labels and their descriptions. |
default_label |
Optional[Union[str, List[str]]]
|
Default label to use when the text is ambiguous or lacks sufficient information for classification. Can be a list in case of multiple labels (n>1). |
Examples:
Assigning a sentiment to a text:
from distilabel.steps.tasks import TextClassification
from distilabel.models import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
text_classification = TextClassification(
llm=llm,
context="You are an AI system specialized in assigning sentiment to movies.",
available_labels=["positive", "negative"],
)
text_classification.load()
result = next(
text_classification.process(
[{"text": "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."}]
)
)
# result
# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',
# 'labels': 'positive',
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n "labels": "positive"\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
# 'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
# {'role': 'user',
# 'content': '# Instruction\nPlease classify the user query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nYou are an AI system specialized in assigning sentiment to movie the user queries.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n "positive", # The text shows positive sentiment\n "negative", # The text shows negative sentiment\n]\n\n\n## User Query\n```\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n "labels": "label"\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Assigning predefined labels with specified descriptions:
from distilabel.steps.tasks import TextClassification
text_classification = TextClassification(
llm=llm,
n=1,
context="Determine the intent of the text.",
available_labels={
"complaint": "A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.",
"inquiry": "A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.",
"feedback": "A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.",
"praise": "A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending."
},
query_title="Customer Query",
)
text_classification.load()
result = next(
text_classification.process(
[{"text": "Can you tell me more about your return policy?"}]
)
)
# result
# [{'text': 'Can you tell me more about your return policy?',
# 'labels': 'inquiry',
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n "labels": "inquiry"\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
# 'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
# {'role': 'user',
# 'content': '# Instruction\nPlease classify the customer query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nDetermine the intent of the text.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n "complaint", # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\n "inquiry", # A question or request for information about a product, service, or situation. It\'s a neutral or curious expression seeking clarification or details.\n "feedback", # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\n "praise", # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\n]\n\n\n## Customer Query\n```\nCan you tell me more about your return policy?\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n "labels": "label"\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Free multi label classification without predefined labels:
from distilabel.steps.tasks import TextClassification
text_classification = TextClassification(
llm=llm,
n=3,
context=(
"Describe the main themes, topics, or categories that could describe the "
"following type of persona."
),
query_title="Example of Persona",
)
text_classification.load()
result = next(
text_classification.process(
[{"text": "A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States."}]
)
)
# result
# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',
# 'labels': ['Historical Researcher',
# 'Cultural Specialist',
# 'Ethnic Studies Expert'],
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n "labels": ["Historical Researcher", "Cultural Specialist", "Ethnic Studies Expert"]\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
# 'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
# {'role': 'user',
# 'content': '# Instruction\nPlease classify the example of persona by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide a list of 3 labels that best describe the text.\nDescribe the main themes, topics, or categories that could describe the following type of persona.\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\n\n\n## Example of Persona\n```\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n "labels": ["label_0", "label_1", "label_2"]\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Source code in src/distilabel/steps/tasks/text_classification.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
inputs
property
¶
The input for the task is the instruction
.
outputs
property
¶
The output for the task is the generation
and the model_name
.
_get_available_labels_message()
¶
Prepares the message to display depending on the available labels (if any), and whether the labels have a specific context.
Source code in src/distilabel/steps/tasks/text_classification.py
_get_examples_message()
¶
Prepares the message to display depending on the examples provided.
Source code in src/distilabel/steps/tasks/text_classification.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/text_classification.py
format_output(output, input=None)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_classification.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/text_classification.py
_format_structured_output(output)
¶
Parses the structured response, which should correspond to a dictionary
with the labels
, and either a string or a list of strings with the labels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
str
|
The output from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, Union[str, List[str]]]
|
Formatted output. |
Source code in src/distilabel/steps/tasks/text_classification.py
ChatGeneration
¶
Bases: Task
Generates text based on a conversation.
ChatGeneration
is a pre-defined task that defines the messages
as the input
and generation
as the output. This task is used to generate text based on a conversation.
The model_name
is also returned as part of the output in order to enhance it.
Input columns
- messages (
List[Dict[Literal["role", "content"], str]]
): The messages to generate the follow up completion from.
Output columns
- generation (
str
): The generated text from the assistant. - model_name (
str
): The model name used to generate the text.
Categories
- chat-generation
Icon
:material-chat:
Examples:
Generate text from a conversation in OpenAI chat format:
from distilabel.steps.tasks import ChatGeneration
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
chat = ChatGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
)
)
chat.load()
result = next(
chat.process(
[
{
"messages": [
{"role": "user", "content": "How much is 2+2?"},
]
}
]
)
)
# result
# [
# {
# 'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# 'generation': '4',
# }
# ]
Source code in src/distilabel/steps/tasks/text_generation.py
284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
inputs
property
¶
The input for the task are the messages
.
outputs
property
¶
The output for the task is the generation
and the model_name
.
format_input(input)
¶
The input is formatted as a ChatType
assuming that the messages provided
are already formatted that way i.e. following the OpenAI chat format.
Source code in src/distilabel/steps/tasks/text_generation.py
format_output(output, input=None)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_generation.py
TextGeneration
¶
Bases: Task
Text generation with an LLM
given a prompt.
TextGeneration
is a pre-defined task that allows passing a custom prompt using the
Jinja2 syntax. By default, a instruction
is expected in the inputs, but the using
template
and columns
attributes one can define a custom prompt and columns expected
from the text. This task should be good enough for tasks that don't need post-processing
of the responses generated by the LLM.
Attributes:
Name | Type | Description |
---|---|---|
system_prompt |
Union[str, None]
|
The system prompt to use in the generation. If not provided, then
it will check if the input row has a column named |
template |
str
|
The template to use for the generation. It must follow the Jinja2 template syntax. If not provided, it will assume the text passed is an instruction and construct the appropriate template. |
columns |
Union[str, List[str]]
|
A string with the column, or a list with columns expected in the template.
Take a look at the examples for more information. Defaults to |
use_system_prompt |
bool
|
DEPRECATED. To be removed in 1.5.0. Whether to use the system
prompt in the generation. Defaults to |
Input columns
- dynamic (determined by
columns
attribute): By default will be set toinstruction
. The columns can point both to astr
or aList[str]
to be used in the template.
Output columns
- generation (
str
): The generated text. - model_name (
str
): The name of the model used to generate the text.
Categories
- text-generation
References
Examples:
Generate text from an instruction:
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
)
text_gen.load()
result = next(
text_gen.process(
[{"instruction": "your instruction"}]
)
)
# result
# [
# {
# 'instruction': 'your instruction',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'generation',
# }
# ]
Use a custom template to generate text:
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
CUSTOM_TEMPLATE = '''Document:
{{ document }}
Question: {{ question }}
Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
'''.rstrip()
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
template=CUSTOM_TEMPLATE,
columns=["document", "question"],
)
text_gen.load()
result = next(
text_gen.process(
[
{
"document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
"question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
}
]
)
)
# result
# [
# {
# 'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
# 'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
# }
# ]
Few shot learning with different system prompts:
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:
{% for example in examples %}
Example {{ loop.index }}:
Instruction: {{ example }}
{% endfor %}
Now, generate a new instruction in a similar style:
'''.rstrip()
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
template=CUSTOM_TEMPLATE,
columns="examples",
)
text_gen.load()
result = next(
text_gen.process(
[
{
"examples": ["This is an example", "Another relevant example"],
"system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
}
]
)
)
# result
# [
# {
# 'examples': ['This is an example', 'Another relevant example'],
# 'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'Disable the firewall on the router',
# }
# ]
Source code in src/distilabel/steps/tasks/text_generation.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
|
inputs
property
¶
The input for the task is the instruction
by default, or the columns
given as input.
outputs
property
¶
The output for the task is the generation
and the model_name
.
_prepare_message_content(input)
¶
Prepares the content for the template and returns the formatted messages.
Source code in src/distilabel/steps/tasks/text_generation.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/text_generation.py
format_output(output, input=None)
¶
The output is formatted as a dictionary with the generation
. The model_name
will be automatically included within the process
method of Task
.
Source code in src/distilabel/steps/tasks/text_generation.py
TextGenerationWithImage
¶
Bases: TextGeneration
Text generation with images with an LLM
given a prompt.
`TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the
Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using
`template` and `columns` attributes one can define a custom prompt and columns expected
from the text. Additionally, an `image` column is expected containing one of the
url, base64 encoded image or PIL image. This task inherits from `TextGeneration`,
so all the functionality available in that task related to the prompt will be available
here too.
Attributes:
system_prompt: The system prompt to use in the generation.
If not, then no system prompt will be used. Defaults to `None`.
template: The template to use for the generation. It must follow the Jinja2 template
syntax. If not provided, it will assume the text passed is an instruction and
construct the appropriate template.
columns: A string with the column, or a list with columns expected in the template.
Take a look at the examples for more information. Defaults to `instruction`.
image_type: The type of the image provided, this will be used to preprocess if necessary.
Must be one of "url", "base64" or "PIL".
Input columns:
- dynamic (determined by `columns` attribute): By default will be set to `instruction`.
The columns can point both to a `str` or a `list[str]` to be used in the template.
- image: The column containing the image URL, base64 encoded image or PIL image.
Output columns:
- generation (`str`): The generated text.
- model_name (`str`): The name of the model used to generate the text.
Categories:
- text-generation
References:
- [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)
- [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- [OpenAI Vision](https://platform.openai.com/docs/guides/vision)
Examples:
Answer questions from an image:
```python
from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM
vision = TextGenerationWithImage(
name="vision_gen",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
),
image_type="url"
)
vision.load()
result = next(
vision.process(
[
{
"instruction": "What’s in this image?",
"image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
]
)
)
# result
# [
# {
# "instruction": "What’s in this image?",
# "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
# "generation": "Based on the visual cues in the image...",
# "model_name": "meta-llama/Llama-3.2-11B-Vision-Instruct"
# ... # distilabel_metadata would be here
# }
# ]
# result[0]["generation"]
# "Based on the visual cues in the image, here are some possible story points:
- The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.
Analysis and Ideas: * The abundance of green grass and trees suggests a healthy ecosystem or habitat. * The presence of wildlife, such as birds or deer, is possible based on the surroundings. * A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.
Additional Questions to Ask: * Why is a footbridge present in this area? * What kind of wildlife inhabits this region"
Answer questions from an image stored as base64:
```python
# For this example we will assume that we have the string representation of the image
# stored, but will just take the image and transform it to base64 to ilustrate the example.
import requests
import base64
image_url ="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img = requests.get(image_url).content
base64_image = base64.b64encode(img).decode("utf-8")
from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM
vision = TextGenerationWithImage(
name="vision_gen",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
),
image_type="base64"
)
vision.load()
result = next(
vision.process(
[
{
"instruction": "What’s in this image?",
"image": base64_image
}
]
)
)
Source code in src/distilabel/steps/tasks/text_generation_with_image.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
_transform_image(image)
¶
Transforms the image based on the image_type
attribute.
Source code in src/distilabel/steps/tasks/text_generation_with_image.py
_prepare_message_content(input)
¶
Prepares the content for the template and returns the formatted messages.
Source code in src/distilabel/steps/tasks/text_generation_with_image.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/text_generation_with_image.py
UltraFeedback
¶
Bases: Task
Rank generations focusing on different aspects using an LLM
.
UltraFeedback: Boosting Language Models with High-quality Feedback.
Attributes:
Name | Type | Description |
---|---|---|
aspect |
Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']
|
The aspect to perform with the |
Input columns
- instruction (
str
): The reference instruction to evaluate the text outputs. - generations (
List[str]
): The text outputs to evaluate for the given instruction.
Output columns
- ratings (
List[float]
): The ratings for each of the provided text outputs. - rationales (
List[str]
): The rationales for each of the provided text outputs. - model_name (
str
): The name of the model used to generate the ratings and rationales.
Categories
- preference
References
Examples:
Rate generations from different LLMs based on the selected aspect:
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
use_default_structured_output=False
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [
# {
# 'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [1, 2],
# 'rationales': ['explanation for 4', 'explanation for and a car'],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# }
# ]
Rate generations from different LLMs based on the honesty, using the default structured output:
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
aspect="honesty"
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [5, 1],
# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',
# "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{"ratings": [\n 5,\n 1\n] \n\n,"rationales": [\n "The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.",\n "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."\n] }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Rate generations from different LLMs based on the helpfulness, using the default structured output:
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={"max_new_tokens": 512},
),
aspect="helpfulness"
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [1, 5],
# 'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',
# 'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],
# 'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',
# 'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],
# 'types': [1, 3, 1],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \n "ratings": [\n 1,\n 5\n ]\n ,\n "rationales": [\n "Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.",\n "Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question."\n ]\n ,\n "rationales_for_rating": [\n "Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.",\n "Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question."\n ]\n ,\n "types": [\n 1, 3,\n 1\n ]\n }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
Citations
@misc{cui2024ultrafeedbackboostinglanguagemodels,
title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2310.01377},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.01377},
}
Source code in src/distilabel/steps/tasks/ultrafeedback.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 |
|
inputs
property
¶
The input for the task is the instruction
, and the generations
for it.
outputs
property
¶
The output for the task is the generation
and the model_name
.
load()
¶
Loads the Jinja2 template for the given aspect
.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
format_input(input)
¶
The input is formatted as a ChatType
assuming that the instruction
is the first interaction from the user within a conversation.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
format_output(output, input=None)
¶
The output is formatted as a dictionary with the ratings
and rationales
for
each of the provided generations
for the given instruction
. The model_name
will be automatically included within the process
method of Task
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
Union[str, None]
|
a string representing the output of the LLM via the |
required |
input
|
Union[Dict[str, Any], None]
|
the input to the task, as required by some tasks to format the output. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary containing either the |
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
given |
Source code in src/distilabel/steps/tasks/ultrafeedback.py
_format_ratings_rationales_output(output, input)
¶
Formats the output when the aspect is either honesty
, instruction-following
, or overall-rating
.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
_format_types_ratings_rationales_output(output, input)
¶
Formats the output when the aspect is either helpfulness
or truthfulness
.
Source code in src/distilabel/steps/tasks/ultrafeedback.py
get_structured_output()
¶
Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.
The schema corresponds to the following:
from pydantic import BaseModel
from typing import List
class SchemaUltraFeedback(BaseModel):
ratings: List[int]
rationales: List[str]
class SchemaUltraFeedbackWithType(BaseModel):
types: List[Optional[int]]
ratings: List[int]
rationales: List[str]
rationales_for_rating: List[str]
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
JSON Schema of the response to enforce. |
Source code in src/distilabel/steps/tasks/ultrafeedback.py
_format_structured_output(output, input)
¶
Parses the structured response, which should correspond to a dictionary
with either positive
, or positive
and negative
keys.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output
|
str
|
The output from the |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Formatted output. |
Source code in src/distilabel/steps/tasks/ultrafeedback.py
URIAL
¶
Bases: Task
Generates a response using a non-instruct fine-tuned model.
URIAL
is a pre-defined task that generates a response using a non-instruct fine-tuned
model. This task is used to generate a response based on the conversation provided as
input.
Input columns
- instruction (
str
, optional): The instruction to generate a response from. - conversation (
List[Dict[str, str]]
, optional): The conversation to generate a response from (the last message must be from the user).
Output columns
- generation (
str
): The generated response. - model_name (
str
): The name of the model used to generate the response.
Categories
- text-generation
Examples:
Generate text from an instruction:
from distilabel.models import vLLM
from distilabel.steps.tasks import URIAL
step = URIAL(
llm=vLLM(
model="meta-llama/Meta-Llama-3.1-8B",
generation_kwargs={"temperature": 0.7},
),
)
step.load()
results = next(
step.process(inputs=[{"instruction": "What's the most most common type of cloud?"}])
)
# [
# {
# 'instruction': "What's the most most common type of cloud?",
# 'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',
# 'distilabel_metadata': {...},
# 'model_name': 'meta-llama/Meta-Llama-3.1-8B'
# }
# ]
Source code in src/distilabel/steps/tasks/urial.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
load()
¶
Loads the Jinja2 template for the given aspect
.
Source code in src/distilabel/steps/tasks/urial.py
task(inputs=None, outputs=None)
¶
Creates a Task
from a formatting output function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
Union[StepColumns, None]
|
a list containing the name of the inputs columns/keys or a dictionary
where the keys are the columns and the values are booleans indicating whether
the column is required or not, that are required by the step. If not provided
the default will be an empty list |
None
|
outputs
|
Union[StepColumns, None]
|
a list containing the name of the outputs columns/keys or a dictionary
where the keys are the columns and the values are booleans indicating whether
the column will be generated or not. If not provided the default will be an
empty list |
None
|