TextGeneration¶
Text generation with an LLM
given a prompt.
TextGeneration
is a pre-defined task that allows passing a custom prompt using the
Jinja2 syntax. By default, a instruction
is expected in the inputs, but the using
template
and columns
attributes one can define a custom prompt and columns expected
from the text. This task should be good enough for tasks that don't need post-processing
of the responses generated by the LLM.
Attributes¶
-
system_prompt: The system prompt to use in the generation. If not provided, then it will check if the input row has a column named
system_prompt
and use it. If not, then no system prompt will be used. Defaults toNone
. -
template: The template to use for the generation. It must follow the Jinja2 template syntax. If not provided, it will assume the text passed is an instruction and construct the appropriate template.
-
columns: A string with the column, or a list with columns expected in the template. Take a look at the examples for more information. Defaults to
instruction
. -
use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system prompt in the generation. Defaults to
True
, which means that if the columnsystem_prompt
is defined within the input batch, then thesystem_prompt
will be used, otherwise, it will be ignored.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[dynamic]
end
subgraph New columns
OCOL0[generation]
OCOL1[model_name]
end
end
subgraph TextGeneration
StepInput[Input Columns: dynamic]
StepOutput[Output Columns: generation, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepInput --> StepOutput
Inputs¶
- dynamic (determined by
columns
attribute): By default will be set toinstruction
. The columns can point both to astr
or aList[str]
to be used in the template.
Outputs¶
-
generation (
str
): The generated text. -
model_name (
str
): The name of the model used to generate the text.
Examples¶
Generate text from an instruction¶
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
)
text_gen.load()
result = next(
text_gen.process(
[{"instruction": "your instruction"}]
)
)
# result
# [
# {
# 'instruction': 'your instruction',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'generation',
# }
# ]
Use a custom template to generate text¶
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
CUSTOM_TEMPLATE = '''Document:
{{ document }}
Question: {{ question }}
Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
'''.rstrip()
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
template=CUSTOM_TEMPLATE,
columns=["document", "question"],
)
text_gen.load()
result = next(
text_gen.process(
[
{
"document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
"question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
}
]
)
)
# result
# [
# {
# 'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
# 'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
# }
# ]
Few shot learning with different system prompts¶
from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM
CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:
{% for example in examples %}
Example {{ loop.index }}:
Instruction: {{ example }}
{% endfor %}
Now, generate a new instruction in a similar style:
'''.rstrip()
text_gen = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
template=CUSTOM_TEMPLATE,
columns="examples",
)
text_gen.load()
result = next(
text_gen.process(
[
{
"examples": ["This is an example", "Another relevant example"],
"system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
}
]
)
)
# result
# [
# {
# 'examples': ['This is an example', 'Another relevant example'],
# 'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
# 'generation': 'Disable the firewall on the router',
# }
# ]