General Steps¶
This section shows some steps that don't belong to the special category of global
or generator
steps.
Processing columns of the dataset¶
The following steps correspond to some common operations that can be helpful during the dataset generation.
Note
We will use a sample dataset from a dict, keep in mind that as we are working with iterators (note the call to next
function), you may need to recreate the dataset
to execute all the steps at once.
from distilabel.pipeline.local import Pipeline
from distilabel.steps.generators.data import LoadDataFromDicts
load_data = LoadDataFromDicts(
name="load_data",
data=[
{
"instruction": "What if the Beatles had never formed as a band?",
"completion": "The Beatles are widely credited with starting the British Invasion, a wave of rock and roll bands from the United Kingdom that became popular in America in the 1960s. If the Beatles had never formed, this musical movement may have never happened, and the world may have been exposed to a different kind of music. So, without the Beatles\u2019 fame and success, other bands wouldn\u2019t have been able to break into the American music scene and create a wider range of sounds. We could have ended up with a less interesting mix of songs playing on the radio."
},
{
"instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).",
"completion": "The problem is asking us to find the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2. \n\nStep 1: Substitute x with 2 in the function\nWe replace every x in the function with 2. This gives us:\nf(2) = 5(2)^3 - 2(2) + 3\n\nStep 2: Simplify the expression\nNext, we simplify the expression by performing the operations in order from left to right.\n\nFirst, calculate the cube of 2, which is 8. Substitute this back into the expression:\nf(2) = 5(8) - 4 + 3\n\nThen, multiply 5 by 8 which gives us 40:\nf(2) = 40 - 4 + 3\n\nFinally, subtract 4 from 40 which gives us 36, and then add 3 to that:\nf(2) = 36 + 3\n\nStep 3: Final calculation\nNow, add 36 and 3 together:\nf(2) = 39\n\nSo, the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2 is 39."
}
],
pipeline=Pipeline(name="data-pipeline")
)
Keep Columns¶
There is a special step to keep only the specified columns after a processing step: KeepColumns
. Let's use it to keep only the instruction
column from the previous dataset:
from distilabel.pipeline.local import Pipeline
from distilabel.steps.keep import KeepColumns
keep_columns = KeepColumns(
name="keep-columns",
columns=["instruction"],
pipeline=Pipeline(name="keeper-pipeline"),
)
And to see it in action, let's grab the first batch of data:
batch = next(load_data.process())[0]
print(json.dumps(next(keep_columns.process(batch)), indent=2))
# [
# {
# "instruction": "What if the Beatles had never formed as a band?"
# },
# {
# "instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2)."
# }
# ]
After this step has processed the batch we have lost the completion
column. This step can be useful to just keep the relevant columns after a step that generates some intermediate steps for example.
Combine Columns¶
This next step allows us to merge the output from multiple steps into a single row for further processing, let's take a look at CombineColumns
:
from distilabel.pipeline.local import Pipeline
from distilabel.steps.combine import CombineColumns
combine_columns = CombineColumns(
name="combine_columns",
columns=["instruction", "completion"],
pipeline=Pipeline(name="combine-pipeline"),
)
To see the step in action, we are going to pass the previous batch as individual lists per row, mimicking what we would see during a pipeline in which we are combining the output from two different steps that could be generating data. We can understand each of these [batch[i]]
as if it was the result from two different steps generating data:
batch = next(load_data.process())[0]
combined = next(combine_columns.process([batch[0]], [batch[1]]))
print(json.dumps(combined, indent=2))
# [
# {
# "merged_instruction": [
# "What if the Beatles had never formed as a band?",
# "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2)."
# ],
# "merged_completion": [
# "The Beatles are widely credited with starting the British Invasion, a wave of rock and roll bands from the United Kingdom that became popular in America in the 1960s. If the Beatles had never formed, this musical movement may have never happened, and the world may have been exposed to a different kind of music. So, without the Beatles\u2019 fame and success, other bands wouldn\u2019t have been able to break into the American music scene and create a wider range of sounds. We could have ended up with a less interesting mix of songs playing on the radio.",
# "The problem is asking us to find the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2. \n\nStep 1: Substitute x with 2 in the function\nWe replace every x in the function with 2. This gives us:\nf(2) = 5(2)^3 - 2(2) + 3\n\nStep 2: Simplify the expression\nNext, we simplify the expression by performing the operations in order from left to right.\n\nFirst, calculate the cube of 2, which is 8. Substitute this back into the expression:\nf(2) = 5(8) - 4 + 3\n\nThen, multiply 5 by 8 which gives us 40:\nf(2) = 40 - 4 + 3\n\nFinally, subtract 4 from 40 which gives us 36, and then add 3 to that:\nf(2) = 36 + 3\n\nStep 3: Final calculation\nNow, add 36 and 3 together:\nf(2) = 39\n\nSo, the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2 is 39."
# ]
# }
# ]
We have both instruction
and completion
from the 2 different lists merged as a single column: merged_instruction
and merged_completion
respectively.
This step is necessary to build more complicated pipelines like UltraFeedback
, where we need to have the merged content of multiple LLMs
to rate them.
Expand Columns¶
Just as we may have the necessity to merge the output from different steps, we can equally want to expand the current columns to behave as multiple rows, let's see the ExpandColumns
work on the output from the previous step:
from distilabel.pipeline.local import Pipeline
from distilabel.steps.expand import ExpandColumns
expand_columns = ExpandColumns(
name="expand_columns",
columns=["merged_instruction", "merged_completion"],
pipeline=Pipeline(name="expand-pipeline"),
)
We can pass to the process method the combined
variable, which is the output from the previous step directly:
print(json.dumps(next(expand_columns.process(combined)), indent=2))
# [
# {
# "merged_instruction": "What if the Beatles had never formed as a band?",
# "merged_completion": "The Beatles are widely credited with starting the British Invasion, a wave of rock and roll bands from the United Kingdom that became popular in America in the 1960s. If the Beatles had never formed, this musical movement may have never happened, and the world may have been exposed to a different kind of music. So, without the Beatles\u2019 fame and success, other bands wouldn\u2019t have been able to break into the American music scene and create a wider range of sounds. We could have ended up with a less interesting mix of songs playing on the radio."
# },
# {
# "merged_instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).",
# "merged_completion": "The problem is asking us to find the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2. \n\nStep 1: Substitute x with 2 in the function\nWe replace every x in the function with 2. This gives us:\nf(2) = 5(2)^3 - 2(2) + 3\n\nStep 2: Simplify the expression\nNext, we simplify the expression by performing the operations in order from left to right.\n\nFirst, calculate the cube of 2, which is 8. Substitute this back into the expression:\nf(2) = 5(8) - 4 + 3\n\nThen, multiply 5 by 8 which gives us 40:\nf(2) = 40 - 4 + 3\n\nFinally, subtract 4 from 40 which gives us 36, and then add 3 to that:\nf(2) = 36 + 3\n\nStep 3: Final calculation\nNow, add 36 and 3 together:\nf(2) = 39\n\nSo, the value of the function f(x) = 5x^3 - 2x + 3 at the point x = 2 is 39."
# }
# ]
Obtaining the columns as a list of rows, that could be processed for a further step requiring the data in that special format.