GenerateSentencePair¶
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair
is a pre-defined task that given an anchor sentence generates
a positive sentence related to the anchor and optionally a negative sentence unrelated
to the anchor. Optionally, you can give a context to guide the LLM towards more specific
behavior. This task is useful to generate training datasets for training embeddings
models.
Attributes¶
-
triplet: a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to
False
. -
action: the action to perform to generate the positive sentence.
-
context: the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[anchor]
end
subgraph New columns
OCOL0[positive]
OCOL1[negative]
OCOL2[model_name]
end
end
subgraph GenerateSentencePair
StepInput[Input Columns: anchor]
StepOutput[Output Columns: positive, negative, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
Inputs¶
- anchor (
str
): The anchor sentence to generate the positive and negative sentences.
Outputs¶
-
positive (
str
): The positive sentence related to theanchor
. -
negative (
str
): The negative sentence unrelated to theanchor
iftriplet=True
. -
model_name (
str
): The name of the model that was used to generate the sentences.
Examples¶
Paraphrasing¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
Generating semantically similar sentences¶
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
Generating queries¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
Generating answers¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.llms import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])