Index
GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]]
module-attribute
¶
GeneratorStepOutput is an alias of the typing Iterator[Tuple[List[Dict[str, Any]], bool]]
StepInput = Annotated[List[Dict[str, Any]], _STEP_INPUT_ANNOTATION]
module-attribute
¶
StepInput is just an Annotated
alias of the typing List[Dict[str, Any]]
with
extra metadata that allows distilabel
to perform validations over the process
step
method defined in each Step
StepOutput = Iterator[List[Dict[str, Any]]]
module-attribute
¶
StepOutput is an alias of the typing Iterator[List[Dict[str, Any]]]
CombineColumns
¶
Bases: Step
Combines columns from a list of StepInput
.
CombineColumns
is a Step
that implements the process
method that calls the combine_dicts
function to handle and combine a list of StepInput
. Also CombineColumns
provides two attributes
columns
and output_columns
to specify the columns to merge and the output columns
which will override the default value for the properties inputs
and outputs
, respectively.
Attributes:
Name | Type | Description |
---|---|---|
columns |
List[str]
|
List of strings with the names of the columns to merge. |
output_columns |
Optional[List[str]]
|
Optional list of strings with the names of the output columns. |
Input columns
- dynamic (determined by
columns
attribute): The columns to merge.
Output columns
- dynamic (determined by
columns
andoutput_columns
attributes): The columns that were merged.
Source code in src/distilabel/steps/combine.py
inputs: List[str]
property
¶
The inputs for the task are the column names in columns
.
outputs: List[str]
property
¶
The outputs for the task are the column names in output_columns
or
merged_{column}
for each column in columns
.
process(*inputs)
¶
The process
method calls the combine_dicts
function to handle and combine a list of StepInput
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/combine.py
ConversationTemplate
¶
Bases: Step
Generate a conversation template from an instruction and a response.
Input columns
- instruction (
str
): The instruction to be used in the conversation. - response (
str
): The response to be used in the conversation.
Output columns
- conversation (
ChatType
): The conversation template.
Categories
- format
- chat
- template
Source code in src/distilabel/steps/formatting/conversation.py
inputs: List[str]
property
¶
The instruction and response.
outputs: List[str]
property
¶
The conversation template.
process(inputs)
¶
Generate a conversation template from an instruction and a response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
The input data with the conversation template. |
Source code in src/distilabel/steps/formatting/conversation.py
DeitaFiltering
¶
Bases: GlobalStep
Filter dataset rows using DEITA filtering strategy.
Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
data_budget |
RuntimeParameter[int]
|
The desired size of the dataset after filtering. |
diversity_threshold |
RuntimeParameter[float]
|
If a row has a cosine distance with respect to it's nearest
neighbor greater than this value, it will be included in the filtered dataset.
Defaults to |
normalize_embeddings |
RuntimeParameter[bool]
|
Whether to normalize the embeddings before computing the cosine
distance. Defaults to |
Runtime parameters
data_budget
: The desired size of the dataset after filtering.diversity_threshold
: If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset.
Input columns
- evol_instruction_score (
float
): The score of the instruction generated byComplexityScorer
step. - evol_response_score (
float
): The score of the response generated byQualityScorer
step. - embedding (
List[float]
): The embedding generated for the conversation of the instruction-response pair usingGenerateEmbeddings
step.
Output columns
- deita_score (
float
): The DEITA score for the instruction-response pair. - deita_score_computed_with (
List[str]
): The scores used to compute the DEITA score. - nearest_neighbor_distance (
float
): The cosine distance between the embeddings of the instruction-response pair.
Categories
- filtering
Source code in src/distilabel/steps/deita.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
process(inputs)
¶
Filter the dataset based on the DEITA score and the cosine distance between the embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
The filtered dataset. |
Source code in src/distilabel/steps/deita.py
ExpandColumns
¶
Bases: Step
Expand columns that contain lists into multiple rows.
ExpandColumns
is a Step
that takes a list of columns and expands them into multiple
rows. The new rows will have the same data as the original row, except for the expanded
column, which will contain a single item from the original list.
Attributes:
Name | Type | Description |
---|---|---|
columns |
Union[Dict[str, str], List[str]]
|
A dictionary that maps the column to be expanded to the new column name or a list of columns to be expanded. If a list is provided, the new column name will be the same as the column name. |
Input columns
- dynamic (determined by
columns
attribute): The columns to be expanded into multiple rows.
Output columns
- dynamic (determined by
columns
attribute): The expanded columns.
Source code in src/distilabel/steps/expand.py
inputs: List[str]
property
¶
The columns to be expanded.
outputs: List[str]
property
¶
The expanded columns.
always_dict(value)
classmethod
¶
Ensure that the columns are always a dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value |
Union[Dict[str, str], List[str]]
|
The columns to be expanded. |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
The columns to be expanded as a dictionary. |
Source code in src/distilabel/steps/expand.py
process(inputs)
¶
Expand the columns in the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
The input data. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
The expanded rows. |
Source code in src/distilabel/steps/expand.py
FormatChatGenerationDPO
¶
Bases: Step
Format the output of a combination of a ChatGeneration
+ a preference task such as
UltraFeedback
, for Direct Preference Optimization (DPO) following the standard formatting
from frameworks such as axolotl
or alignment-handbook
.
FormatChatGenerationDPO
is a Step
that formats the output of the combination of a ChatGeneration
task with a preference Task
i.e. a task generating ratings
, so that those are used to rank the
existing generations and provide the chosen
and rejected
generations based on the ratings
.
Note
The messages
column should contain at least one message from the user, the generations
column should contain at least two generations, the ratings
column should contain the same
number of ratings as generations.
Input columns
- messages (
List[Dict[str, str]]
): The conversation messages. - generations (
List[str]
): The generations produced by theLLM
. - generation_models (
List[str]
, optional): The model names used to generate thegenerations
, only available if themodel_name
from theChatGeneration
task/s is combined into a single column named this way, otherwise, it will be ignored. - ratings (
List[float]
): The ratings for each of thegenerations
, produced by a preference task such asUltraFeedback
.
Output columns
- prompt (
str
): The user message used to generate thegenerations
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - chosen (
List[Dict[str, str]]
): Thechosen
generation based on theratings
. - chosen_model (
str
, optional): The model name used to generate thechosen
generation, if thegeneration_models
are available. - chosen_rating (
float
): The rating of thechosen
generation. - rejected (
List[Dict[str, str]]
): Therejected
generation based on theratings
. - rejected_model (
str
, optional): The model name used to generate therejected
generation, if thegeneration_models
are available. - rejected_rating (
float
): The rating of therejected
generation.
Categories
- format
- chat-generation
- preference
- messages
- generations
Source code in src/distilabel/steps/formatting/dpo.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
|
inputs: List[str]
property
¶
List of inputs required by the Step
, which in this case are: messages
, generations
,
and ratings
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case is: generation_models
.
outputs: List[str]
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, chosen
,
chosen_model
, chosen_rating
, rejected
, rejected_model
, rejected_rating
. Both
the chosen_model
and rejected_model
being optional and only used if generation_models
is available.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the DPO formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/dpo.py
FormatChatGenerationSFT
¶
Bases: Step
Format the output of a ChatGeneration
task for Supervised Fine-Tuning (SFT) following the
standard formatting from frameworks such as axolotl
or alignment-handbook
.
FormatChatGenerationSFT
is a Step
that formats the output of a ChatGeneration
task for
Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as axolotl
or alignment-handbook
. The output of the ChatGeneration
task is formatted into a chat-like
conversation with the instruction
as the user message and the generation
as the assistant
message. Optionally, if the system_prompt
is available, it is included as the first message
in the conversation.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegeneration
, if available. - instruction (
str
): The instruction used to generate thegeneration
with theLLM
. - generation (
str
): The generation produced by theLLM
.
Output columns
- prompt (
str
): The instruction used to generate thegeneration
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - messages (
List[Dict[str, str]]
): The chat-like conversation with theinstruction
as the user message and thegeneration
as the assistant message.
Categories
- format
- chat-generation
- instruction
- generation
Source code in src/distilabel/steps/formatting/sft.py
inputs: List[str]
property
¶
List of inputs required by the Step
, which in this case are: instruction
, and generation
.
outputs: List[str]
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, messages
.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the SFT formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/sft.py
FormatTextGenerationDPO
¶
Bases: Step
Format the output of your LLMs for Direct Preference Optimization (DPO).
FormatTextGenerationDPO
is a Step
that formats the output of the combination of a TextGeneration
task with a preference Task
i.e. a task generating ratings
, so that those are used to rank the
existing generations and provide the chosen
and rejected
generations based on the ratings
.
Use this step to transform the output of a combination of a TextGeneration
+ a preference task such as
UltraFeedback
following the standard formatting from frameworks such as axolotl
or alignment-handbook
.
Note
The generations
column should contain at least two generations, the ratings
column should
contain the same number of ratings as generations.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegenerations
, if available. - instruction (
str
): The instruction used to generate thegenerations
with theLLM
. - generations (
List[str]
): The generations produced by theLLM
. - generation_models (
List[str]
, optional): The model names used to generate thegenerations
, only available if themodel_name
from theTextGeneration
task/s is combined into a single column named this way, otherwise, it will be ignored. - ratings (
List[float]
): The ratings for each of thegenerations
, produced by a preference task such asUltraFeedback
.
Output columns
- prompt (
str
): The instruction used to generate thegenerations
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - chosen (
List[Dict[str, str]]
): Thechosen
generation based on theratings
. - chosen_model (
str
, optional): The model name used to generate thechosen
generation, if thegeneration_models
are available. - chosen_rating (
float
): The rating of thechosen
generation. - rejected (
List[Dict[str, str]]
): Therejected
generation based on theratings
. - rejected_model (
str
, optional): The model name used to generate therejected
generation, if thegeneration_models
are available. - rejected_rating (
float
): The rating of therejected
generation.
Categories
- format
- text-generation
- preference
- instruction
- generations
Source code in src/distilabel/steps/formatting/dpo.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
inputs: List[str]
property
¶
List of inputs required by the Step
, which in this case are: instruction
, generations
,
and ratings
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case are: system_prompt
, and generation_models
.
outputs: List[str]
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, chosen
,
chosen_model
, chosen_rating
, rejected
, rejected_model
, rejected_rating
. Both
the chosen_model
and rejected_model
being optional and only used if generation_models
is available.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the DPO formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/dpo.py
FormatTextGenerationSFT
¶
Bases: Step
Format the output of a TextGeneration
task for Supervised Fine-Tuning (SFT).
FormatTextGenerationSFT
is a Step
that formats the output of a TextGeneration
task for
Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as axolotl
or alignment-handbook
. The output of the TextGeneration
task is formatted into a chat-like
conversation with the instruction
as the user message and the generation
as the assistant
message. Optionally, if the system_prompt
is available, it is included as the first message
in the conversation.
Input columns
- system_prompt (
str
, optional): The system prompt used within theLLM
to generate thegeneration
, if available. - instruction (
str
): The instruction used to generate thegeneration
with theLLM
. - generation (
str
): The generation produced by theLLM
.
Output columns
- prompt (
str
): The instruction used to generate thegeneration
with theLLM
. - prompt_id (
str
): TheSHA256
hash of theprompt
. - messages (
List[Dict[str, str]]
): The chat-like conversation with theinstruction
as the user message and thegeneration
as the assistant message.
Categories
- format
- text-generation
- instruction
- generation
Source code in src/distilabel/steps/formatting/sft.py
inputs: List[str]
property
¶
List of inputs required by the Step
, which in this case are: instruction
, and generation
.
optional_inputs: List[str]
property
¶
List of optional inputs, which are not required by the Step
but used if available,
which in this case is: system_prompt
.
outputs: List[str]
property
¶
List of outputs generated by the Step
, which are: prompt
, prompt_id
, messages
.
Reference
- Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
process(*inputs)
¶
The process
method formats the received StepInput
or list of StepInput
according to the SFT formatting standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A |
Source code in src/distilabel/steps/formatting/sft.py
GeneratorStep
¶
Bases: _Step
, ABC
A special kind of Step
that is able to generate data i.e. it doesn't receive
any input from the previous steps.
Attributes:
Name | Type | Description |
---|---|---|
batch_size |
RuntimeParameter[int]
|
The number of rows that will contain the batches generated by the
step. Defaults to |
Runtime parameters
batch_size
: The number of rows that will contain the batches generated by the step. Defaults to50
.
Source code in src/distilabel/steps/base.py
process(offset=0)
abstractmethod
¶
Method that defines the generation logic of the step. It should yield the output rows and a boolean indicating if it's the last batch or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The output rows and a boolean indicating if it's the last batch or not. |
Source code in src/distilabel/steps/base.py
process_applying_mappings(offset=0)
¶
Runs the process
method of the step applying the outputs_mappings
to the
output rows. This is the function that should be used to run the generation logic
of the step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to 0. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
The output rows and a boolean indicating if it's the last batch or not. |
Source code in src/distilabel/steps/base.py
GlobalStep
¶
Bases: Step
, ABC
A special kind of Step
which it's process
method receives all the data processed
by their previous steps at once, instead of receiving it in batches. This kind of steps
are useful when the processing logic requires to have all the data at once, for example
to train a model, to perform a global aggregation, etc.
Source code in src/distilabel/steps/base.py
KeepColumns
¶
Bases: Step
Keeps selected columns in the dataset.
KeepColumns
is a Step
that implements the process
method that keeps only the columns
specified in the columns
attribute. Also KeepColumns
provides an attribute columns
to
specify the columns to keep which will override the default value for the properties inputs
and outputs
.
Note
The order in which the columns are provided is important, as the output will be sorted
using the provided order, which is useful before pushing either a dataset.Dataset
via
the PushToHub
step or a distilabel.Distiset
via the Pipeline.run
output variable.
Attributes:
Name | Type | Description |
---|---|---|
columns |
List[str]
|
List of strings with the names of the columns to keep. |
Input columns
- dynamic (determined by
columns
attribute): The columns to keep.
Output columns
- dynamic (determined by
columns
attribute): The columns that were kept.
Source code in src/distilabel/steps/keep.py
inputs: List[str]
property
¶
The inputs for the task are the column names in columns
.
outputs: List[str]
property
¶
The outputs for the task are the column names in columns
.
process(*inputs)
¶
The process
method keeps only the columns specified in the columns
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
A list of dictionaries with the input data. |
()
|
Yields:
Type | Description |
---|---|
StepOutput
|
A list of dictionaries with the output data. |
Source code in src/distilabel/steps/keep.py
LoadDataFromDicts
¶
Bases: GeneratorStep
Loads a dataset from a list of dictionaries.
GeneratorStep
that loads a dataset from a list of dictionaries and yields it in
batches.
Attributes:
Name | Type | Description |
---|---|---|
data |
List[Dict[str, Any]]
|
The list of dictionaries to load the data from. |
Runtime parameters
batch_size
: The batch size to use when processing the data.
Output columns
- dynamic (based on the keys found on the first dictionary of the list): The columns of the dataset.
Categories
- load
Source code in src/distilabel/steps/generators/data.py
outputs: List[str]
property
¶
Returns a list of strings with the names of the columns that the step will generate.
process(offset=0)
¶
Yields batches from a list of dictionaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start the generation from. Defaults to |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A list of Python dictionaries as read from the inputs (propagated in batches) |
GeneratorStepOutput
|
and a flag indicating whether the yield batch is the last one. |
Source code in src/distilabel/steps/generators/data.py
LoadHubDataset
¶
Bases: GeneratorStep
Loads a dataset from the Hugging Face Hub.
GeneratorStep
that loads a dataset from the Hugging Face Hub using the datasets
library.
Attributes:
Name | Type | Description |
---|---|---|
repo_id |
RuntimeParameter[str]
|
The Hugging Face Hub repository ID of the dataset to load. |
split |
RuntimeParameter[str]
|
The split of the dataset to load. |
config |
Optional[RuntimeParameter[str]]
|
The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations. |
Runtime parameters
batch_size
: The batch size to use when processing the data.repo_id
: The Hugging Face Hub repository ID of the dataset to load.split
: The split of the dataset to load. Defaults to 'train'.config
: The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations.streaming
: Whether to load the dataset in streaming mode or not. Defaults toFalse
.num_examples
: The number of examples to load from the dataset. By default will load all examples.
Output columns
- dynamic (
all
): The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Categories
- load
Source code in src/distilabel/steps/generators/huggingface.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
outputs: List[str]
property
¶
The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.
Returns:
Type | Description |
---|---|
List[str]
|
The columns that will be generated by this step. |
load()
¶
Load the dataset from the Hugging Face Hub
Source code in src/distilabel/steps/generators/huggingface.py
process(offset=0)
¶
Yields batches from the loaded dataset from the Hugging Face Hub.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
The offset to start yielding the data from. Will be used during the caching process to help skipping already processed data. |
0
|
Yields:
Type | Description |
---|---|
GeneratorStepOutput
|
A tuple containing a batch of rows and a boolean indicating if the batch is |
GeneratorStepOutput
|
the last one. |
Source code in src/distilabel/steps/generators/huggingface.py
PreferenceToArgilla
¶
Bases: Argilla
Creates a preference dataset in Argilla.
Step that creates a dataset in Argilla during the load phase, and then pushes the input batches into it as records. This dataset is a preference dataset, where there's one field for the instruction and one extra field per each generation within the same record, and then a rating question per each of the generation fields. The rating question asks the annotator to set a rating from 1 to 5 for each of the provided generations.
Note
This step is meant to be used in conjunction with the UltraFeedback
step, or any other step
generating both ratings and responses for a given set of instruction and generations for the
given instruction. But alternatively, it can also be used with any other task or step generating
only the instruction
and generations
, as the ratings
and rationales
are optional.
Attributes:
Name | Type | Description |
---|---|---|
num_generations |
int
|
The number of generations to include in the dataset. |
dataset_name |
int
|
The name of the dataset in Argilla. |
dataset_workspace |
int
|
The workspace where the dataset will be created in Argilla. Defaults to
|
api_url |
int
|
The URL of the Argilla API. Defaults to |
api_key |
int
|
The API key to authenticate with Argilla. Defaults to |
Runtime parameters
api_url
: The base URL to use for the Argilla API requests.api_key
: The API key to authenticate the requests to the Argilla API.
Input columns
- instruction (
str
): The instruction that was used to generate the completion. - generations (
List[str]
): The completion that was generated based on the input instruction. - ratings (
List[str]
, optional): The ratings for the generations. If not provided, the generated ratings won't be pushed to Argilla. - rationales (
List[str]
, optional): The rationales for the ratings. If not provided, the generated rationales won't be pushed to Argilla.
Source code in src/distilabel/steps/argilla/preference.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
|
inputs: List[str]
property
¶
The inputs for the step are the instruction
and the generations
. Optionally, one could also
provide the ratings
and the rationales
for the generations.
load()
¶
Sets the _instruction
and _generations
attributes based on the inputs_mapping
, otherwise
uses the default values; and then uses those values to create a FeedbackDataset
suited for
the text-generation scenario. And then it pushes it to Argilla.
Source code in src/distilabel/steps/argilla/preference.py
process(inputs)
¶
Creates and pushes the records as FeedbackRecords to the Argilla dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/argilla/preference.py
PushToHub
¶
Bases: GlobalStep
Push data to a Hugging Face Hub dataset.
A GlobalStep
which creates a datasets.Dataset
with the input data and pushes
it to the Hugging Face Hub.
Attributes:
Name | Type | Description |
---|---|---|
repo_id |
RuntimeParameter[str]
|
The Hugging Face Hub repository ID where the dataset will be uploaded. |
split |
RuntimeParameter[str]
|
The split of the dataset that will be pushed. Defaults to |
private |
RuntimeParameter[bool]
|
Whether the dataset to be pushed should be private or not. Defaults to
|
token |
Optional[RuntimeParameter[str]]
|
The token that will be used to authenticate in the Hub. If not provided, the
token will be tried to be obtained from the environment variable |
Runtime parameters
repo_id
: The Hugging Face Hub repository ID where the dataset will be uploaded.split
: The split of the dataset that will be pushed.private
: Whether the dataset to be pushed should be private or not.token
: The token that will be used to authenticate in the Hub.
Input columns
- dynamic (
all
): all columns from the input will be used to create the dataset.
Categories
- save
- dataset
- huggingface
Source code in src/distilabel/steps/globals/huggingface.py
process(inputs)
¶
Method that processes the input data, respecting the datasets.Dataset
formatting,
and pushes it to the Hugging Face Hub based on the RuntimeParameter
s attributes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
that input data within a single object (as it's a GlobalStep) that
will be transformed into a |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
Propagates the received inputs so that the |
StepOutput
|
the last step of the |
StepOutput
|
steps. |
Source code in src/distilabel/steps/globals/huggingface.py
Step
¶
Bases: _Step
, ABC
Base class for the steps that can be included in a Pipeline
.
Attributes:
Name | Type | Description |
---|---|---|
input_batch_size |
RuntimeParameter[PositiveInt]
|
The number of rows that will contain the batches processed by
the step. Defaults to |
Runtime parameters
input_batch_size
: The number of rows that will contain the batches processed by the step. Defaults to50
.
Source code in src/distilabel/steps/base.py
504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 |
|
process(*inputs)
abstractmethod
¶
Method that defines the processing logic of the step. It should yield the output rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
StepInput
|
An argument used to receive the outputs of the previous steps. The
number of arguments depends on the number of previous steps. It doesn't
need to be an |
()
|
Source code in src/distilabel/steps/base.py
process_applying_mappings(*args)
¶
Runs the process
method of the step applying the input_mappings
to the input
rows and the outputs_mappings
to the output rows. This is the function that
should be used to run the processing logic of the step.
Yields:
Type | Description |
---|---|
StepOutput
|
The output rows. |
Source code in src/distilabel/steps/base.py
TextGenerationToArgilla
¶
Bases: Argilla
Creates a text generation dataset in Argilla.
Step
that creates a dataset in Argilla during the load phase, and then pushes the input
batches into it as records. This dataset is a text-generation dataset, where there's one field
per each input, and then a label question to rate the quality of the completion in either bad
(represented with 👎) or good (represented with 👍).
Note
This step is meant to be used in conjunction with a TextGeneration
step and no column mapping
is needed, as it will use the default values for the instruction
and generation
columns.
Attributes:
Name | Type | Description |
---|---|---|
dataset_name |
The name of the dataset in Argilla. |
|
dataset_workspace |
The workspace where the dataset will be created in Argilla. Defaults to
|
|
api_url |
The URL of the Argilla API. Defaults to |
|
api_key |
The API key to authenticate with Argilla. Defaults to |
Runtime parameters
api_url
: The base URL to use for the Argilla API requests.api_key
: The API key to authenticate the requests to the Argilla API.
Input columns
- instruction (
str
): The instruction that was used to generate the completion. - generation (
str
orList[str]
): The completions that were generated based on the input instruction.
Source code in src/distilabel/steps/argilla/text_generation.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
inputs: List[str]
property
¶
The inputs for the step are the instruction
and the generation
.
load()
¶
Sets the _instruction
and _generation
attributes based on the inputs_mapping
, otherwise
uses the default values; and then uses those values to create a FeedbackDataset
suited for
the text-generation scenario. And then it pushes it to Argilla.
Source code in src/distilabel/steps/argilla/text_generation.py
process(inputs)
¶
Creates and pushes the records as FeedbackRecords to the Argilla dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Returns:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |
Source code in src/distilabel/steps/argilla/text_generation.py
step(inputs=None, outputs=None, step_type='normal')
¶
Creates an Step
from a processing function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
Union[List[str], None]
|
a list containing the name of the inputs columns/keys expected by this step.
If not provided the default will be an empty list |
None
|
outputs |
Union[List[str], None]
|
a list containing the name of the outputs columns/keys that the step
will generate. If not provided the default will be an empty list |
None
|
step_type |
Literal['normal', 'global', 'generator']
|
the kind of step to create. Valid choices are: "normal" ( |
'normal'
|
Returns:
Type | Description |
---|---|
Callable[..., Type[_Step]]
|
A callable that will generate the type given the processing function. |
Example:
# Normal step
@step(inputs=["instruction"], outputs=["generation"])
def GenerationStep(inputs: StepInput, dummy_generation: RuntimeParameter[str]) -> StepOutput:
for input in inputs:
input["generation"] = dummy_generation
yield inputs
# Global step
@step(inputs=["instruction"], step_type="global")
def FilteringStep(inputs: StepInput, max_length: RuntimeParameter[int] = 256) -> StepOutput:
yield [
input
for input in inputs
if len(input["instruction"]) <= max_length
]
# Generator step
@step(outputs=["num"], step_type="generator")
def RowGenerator(num_rows: RuntimeParameter[int] = 500) -> GeneratorStepOutput:
data = list(range(num_rows))
for i in range(0, len(data), 100):
last_batch = i + 100 >= len(data)
yield [{"num": num} for num in data[i : i + 100]], last_batch
Source code in src/distilabel/steps/decorator.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|