Steps¶
This section contains the API reference for the distilabel steps. For an example on how to create and use a step, see the Tutorial - Steps.
StepInput = Annotated[List[Dict[str, Any]], _STEP_INPUT_ANNOTATION]
module-attribute
¶
StepInput is just an Annotated
alias of the typing List[Dict[str, Any]]
with
extra metadata that allows distilabel
to perform validations over the process
step
method defined in each Step
GeneratorStep
¶
Bases: _Step
, ABC
A special kind of Step
that is able to generate data i.e. it doesn't receive
any input from the previous steps.
Attributes:
Name | Type | Description |
---|---|---|
batch_size |
RuntimeParameter[int]
|
The number of rows that will contain the batches generated by the
step. Defaults to |
Runtime parameters
batch_size
: The number of rows that will contain the batches generated by the step. Defaults to50
.
Source code in src/distilabel/steps/base.py
process(offset=0)
abstractmethod
¶
Method that defines the generation logic of the step. It should yield the output rows and a boolean indicating if it's the last batch or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The output rows and a boolean indicating if it's the last batch or not. |
Source code in src/distilabel/steps/base.py
process_applying_mappings(offset=0)
¶
Runs the process
method of the step applying the outputs_mappings
to the
output rows. This is the function that should be used to run the generation logic
of the step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The output rows and a boolean indicating if it's the last batch or not. |
Source code in src/distilabel/steps/base.py
GlobalStep
¶
Bases: Step
, ABC
A special kind of Step
which it's process
method receives all the data processed
by their previous steps at once, instead of receiving it in batches. This kind of steps
are useful when the processing logic requires to have all the data at once, for example
to train a model, to perform a global aggregation, etc.
Source code in src/distilabel/steps/base.py
Step
¶
Bases: _Step
, ABC
Base class for the steps that can be included in a Pipeline
.
Attributes:
Name | Type | Description |
---|---|---|
input_batch_size |
RuntimeParameter[PositiveInt]
|
The number of rows that will contain the batches processed by
the step. Defaults to |
Runtime parameters
input_batch_size
: The number of rows that will contain the batches processed by the step. Defaults to50
.
Source code in src/distilabel/steps/base.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 |
|
process(*inputs)
abstractmethod
¶
Method that defines the processing logic of the step. It should yield the output rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
An argument used to receive the outputs of the previous steps. The
number of arguments depends on the number of previous steps. It doesn't
need to be an |
()
|
Source code in src/distilabel/steps/base.py
process_applying_mappings(*args)
¶
Runs the process
method of the step applying the input_mappings
to the input
rows and the outputs_mappings
to the output rows. This is the function that
should be used to run the processing logic of the step.
Yields:
Type | Description |
---|---|
StepOutput
|
The output rows. |