TruncateTextColumn¶
Truncate a row using a tokenizer or the number of characters.
TruncateTextColumn
is a Step
that truncates a row according to the max length. If
the tokenizer
is provided, then the row will be truncated using the tokenizer,
and the max_length
will be used as the maximum number of tokens, otherwise it will
be used as the maximum number of characters. The TruncateTextColumn
step is useful when one
wants to truncate a row to a certain length, to avoid posterior errors in the model due
to the length.
Attributes¶
-
column: the column to truncate. Defaults to
"text"
. -
max_length: the maximum length to use for truncation. If a
tokenizer
is given, corresponds to the number of tokens, otherwise corresponds to the number of characters. Defaults to8192
. -
tokenizer: the name of the tokenizer to use. If provided, the row will be truncated using the tokenizer. Defaults to
None
.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[dynamic]
end
subgraph New columns
OCOL0[dynamic]
end
end
subgraph TruncateTextColumn
StepInput[Input Columns: dynamic]
StepOutput[Output Columns: dynamic]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
Inputs¶
- dynamic (determined by
column
attribute): The columns to be truncated, defaults to "text".
Outputs¶
- dynamic (determined by
column
attribute): The truncated column.
Examples¶
Truncating a row to a given number of tokens¶
from distilabel.steps import TruncateTextColumn
trunc = TruncateTextColumn(
tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
max_length=4,
column="text"
)
trunc.load()
result = next(
trunc.process(
[
{"text": "This is a sample text that is longer than 10 characters"}
]
)
)
# result
# [{'text': 'This is a sample'}]