Genstruct¶
Generate a pair of instruction-response from a document using an LLM
.
Genstruct
is a pre-defined task designed to generate valid instructions from a given raw document,
with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
inspired in the Ada-Instruct paper.
Note¶
The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
option is to use NousResearch/Genstruct-7B
as the LLM provided to the task, since it was trained
for this specific task.
Attributes¶
- _template: a Jinja2 template used to format the input for the LLM.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[title]
ICOL1[content]
end
subgraph New columns
OCOL0[user]
OCOL1[assistant]
OCOL2[model_name]
end
end
subgraph Genstruct
StepInput[Input Columns: title, content]
StepOutput[Output Columns: user, assistant, model_name]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
Inputs¶
-
title (
str
): The title of the document. -
content (
str
): The content of the document.
Outputs¶
-
user (
str
): The user's instruction based on the document. -
assistant (
str
): The assistant's response based on the user's instruction. -
model_name (
str
): The model name used to generate thefeedback
andresult
.
Examples¶
Generate instructions from raw documents using the title and content¶
from distilabel.steps.tasks import Genstruct
from distilabel.llms.huggingface import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
llm=InferenceEndpointsLLM(
model_id="NousResearch/Genstruct-7B",
),
)
genstruct.load()
result = next(
genstruct.process(
[
{"title": "common instruction", "content": "content of the document"},
]
)
)
# result
# [
# {
# 'title': 'An instruction',
# 'content': 'content of the document',
# 'model_name': 'test',
# 'user': 'An instruction',
# 'assistant': 'content of the document',
# }
# ]