Embeddings¶
GenerateEmbeddings
¶
Bases: Step
Generate embeddings for a text input using the last hidden state of an LLM
, as
described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
Automatic Data Selection in Instruction Tuning'.
Attributes:
Name | Type | Description |
---|---|---|
llm |
LLM
|
The |
Input columns
- text (
str
,List[Dict[str, str]]
): The input text or conversation to generate embeddings for.
Output columns
- embedding (
List[float]
): The embedding of the input text or conversation.
Source code in src/distilabel/steps/tasks/generate_embeddings.py
inputs: List[str]
property
¶
The inputs for the task is a text
column containing either a string or a
list of dictionaries in OpenAI chat-like format.
outputs: List[str]
property
¶
The outputs for the task is an embedding
column containing the embedding of
the text
input.
format_input(input)
¶
Formats the input to be used by the LLM to generate the embeddings. The input
can be in ChatType
format or a string. If a string, it will be converted to a
list of dictionaries in OpenAI chat-like format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Dict[str, Any]
|
The input to format. |
required |
Returns:
Type | Description |
---|---|
ChatType
|
The OpenAI chat-like format of the input. |
Source code in src/distilabel/steps/tasks/generate_embeddings.py
load()
¶
process(inputs)
¶
Generates an embedding for each input using the last hidden state of the LLM
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
StepInput
|
A list of Python dictionaries with the inputs of the task. |
required |
Yields:
Type | Description |
---|---|
StepOutput
|
A list of Python dictionaries with the outputs of the task. |