KeepColumns¶
Keeps selected columns in the dataset.
KeepColumns is a Step that implements the process method that keeps only the columns
    specified in the columns attribute. Also KeepColumns provides an attribute columns to
    specify the columns to keep which will override the default value for the properties inputs
    and outputs.
Note¶
The order in which the columns are provided is important, as the output will be sorted
using the provided order, which is useful before pushing either a dataset.Dataset via
the PushToHub step or a distilabel.Distiset via the Pipeline.run output variable.
Attributes¶
- columns: List of strings with the names of the columns to keep.
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[dynamic]
        end
        subgraph New columns
            OCOL0[dynamic]
        end
    end
    subgraph KeepColumns
        StepInput[Input Columns: dynamic]
        StepOutput[Output Columns: dynamic]
    end
    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput
Inputs¶
- dynamic (determined by columnsattribute): The columns to keep.
Outputs¶
- dynamic (determined by columnsattribute): The columns that were kept.
Examples¶
Select the columns to keep¶
from distilabel.steps import KeepColumns
keep_columns = KeepColumns(
    columns=["instruction", "generation"],
)
keep_columns.load()
result = next(
    keep_columns.process(
        [{"instruction": "What's the brightest color?", "generation": "white", "model_name": "my_model"}],
    )
)
# >>> result
# [{'instruction': "What's the brightest color?", 'generation': 'white'}]