GenerateSentencePair¶
Generate a positive and negative (optionally) sentences given an anchor sentence.
GenerateSentencePair is a pre-defined task that given an anchor sentence generates
    a positive sentence related to the anchor and optionally a negative sentence unrelated
    to the anchor or similar to it. Optionally, you can give a context to guide the LLM
    towards more specific behavior. This task is useful to generate training datasets for
    training embeddings models.
Attributes¶
- 
triplet: a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to False.
- 
action: the action to perform to generate the positive sentence. 
- 
context: the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default. 
- 
hard_negative: A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity. 
Input & Output Columns¶
graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[anchor]
        end
        subgraph New columns
            OCOL0[positive]
            OCOL1[negative]
            OCOL2[model_name]
        end
    end
    subgraph GenerateSentencePair
        StepInput[Input Columns: anchor]
        StepOutput[Output Columns: positive, negative, model_name]
    end
    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput
Inputs¶
- anchor (str): The anchor sentence to generate the positive and negative sentences.
Outputs¶
- 
positive ( str): The positive sentence related to theanchor.
- 
negative ( str): The negative sentence unrelated to theanchoriftriplet=True, or more similar to the positive to make it more challenging for a model to distinguish in casehard_negative=True.
- 
model_name ( str): The name of the model that was used to generate the sentences.
Examples¶
Paraphrasing¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="paraphrase",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
Generating semantically similar sentences¶
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="semantically-similar",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
Generating queries¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
Generating answers¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="answer",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
)¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    hard_negative=True,
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
    use_default_structured_output=True
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])