Skip to content

TextGeneration

Text generation with an LLM given a prompt.

TextGeneration is a pre-defined task that allows passing a custom prompt using the Jinja2 syntax. By default, a instruction is expected in the inputs, but the using template and columns attributes one can define a custom prompt and columns expected from the text. This task should be good enough for tasks that don't need post-processing of the responses generated by the LLM.

Attributes

  • system_prompt: The system prompt to use in the generation. If not provided, then it will check if the input row has a column named system_prompt and use it. If not, then no system prompt will be used. Defaults to None.

  • template: The template to use for the generation. It must follow the Jinja2 template syntax. If not provided, it will assume the text passed is an instruction and construct the appropriate template.

  • columns: A string with the column, or a list with columns expected in the template. Take a look at the examples for more information. Defaults to instruction.

  • use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system prompt in the generation. Defaults to True, which means that if the column system_prompt is defined within the input batch, then the system_prompt will be used, otherwise, it will be ignored.

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[dynamic]
        end
        subgraph New columns
            OCOL0[generation]
            OCOL1[model_name]
        end
    end

    subgraph TextGeneration
        StepInput[Input Columns: dynamic]
        StepOutput[Output Columns: generation, model_name]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

Inputs

  • dynamic (determined by columns attribute): By default will be set to instruction. The columns can point both to a str or a List[str] to be used in the template.

Outputs

  • generation (str): The generated text.

  • model_name (str): The name of the model used to generate the text.

Examples

Generate text from an instruction

from distilabel.steps.tasks import TextGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    )
)

text_gen.load()

result = next(
    text_gen.process(
        [{"instruction": "your instruction"}]
    )
)
# result
# [
#     {
#         'instruction': 'your instruction',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'generation',
#     }
# ]

Use a custom template to generate text

from distilabel.steps.tasks import TextGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Document:
{{ document }}

Question: {{ question }}

Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
    template=CUSTOM_TEMPLATE,
    columns=["document", "question"],
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
                "question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
            }
        ]
    )
)
# result
# [
#     {
#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
#     }
# ]

Few shot learning with different system prompts

from distilabel.steps.tasks import TextGeneration
from distilabel.llms.huggingface import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:

{% for example in examples %}
Example {{ loop.index }}:
Instruction: {{ example }}

{% endfor %}
Now, generate a new instruction in a similar style:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    template=CUSTOM_TEMPLATE,
    columns="examples",
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "examples": ["This is an example", "Another relevant example"],
                "system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
            }
        ]
    )
)
# result
# [
#     {
#         'examples': ['This is an example', 'Another relevant example'],
#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'Disable the firewall on the router',
#     }
# ]

References