GenerateSentencePair¶
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair
is a pre-defined task that given an anchor sentence generates
a positive sentence related to the anchor and optionally a negative sentence unrelated
to the anchor or similar to it. Optionally, you can give a context to guide the LLM
towards more specific behavior. This task is useful to generate training datasets for
training embeddings models.
Attributes¶
-
triplet: a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to
False
. -
action: the action to perform to generate the positive sentence.
-
context: the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default.
-
hard_negative: A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[anchor]
end
subgraph New columns
OCOL0[positive]
OCOL1[negative]
OCOL2[model_name]
end
end
subgraph GenerateSentencePair
StepInput[Input Columns: anchor]
StepOutput[Output Columns: positive, negative, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
Inputs¶
- anchor (
str
): The anchor sentence to generate the positive and negative sentences.
Outputs¶
-
positive (
str
): The positive sentence related to theanchor
. -
negative (
str
): The negative sentence unrelated to theanchor
iftriplet=True
, or more similar to the positive to make it more challenging for a model to distinguish in casehard_negative=True
. -
model_name (
str
): The name of the model that was used to generate the sentences.
Examples¶
Paraphrasing¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
Generating semantically similar sentences¶
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
Generating queries¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
Generating answers¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
)¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
hard_negative=True,
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
use_default_structured_output=True
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])