Structured data generation¶

Distilabel has integrations with relevant libraries to generate structured text i.e. to guide the LLM towards the generation of structured outputs following a JSON schema, a regex, etc.

Outlines¶

Distilabel integrates outlines within some LLM subclasses. At the moment, the following LLMs integrated with outlines are supported in distilabel: TransformersLLM, vLLM or LlamaCppLLM, so that anyone can generate structured outputs in the form of JSON or a parseable regex.

The LLM has an argument named structured_output¹ that determines how we can generate structured outputs with it, let's see an example using LlamaCppLLM.

Note

For outlines integration to work you may need to install the corresponding dependencies:

pip install distilabel[outlines]

JSON¶

We will start with a JSON example, where we initially define a pydantic.BaseModel schema to guide the generation of the structured output.

Note

Take a look at StructuredOutputType to see the expected format of the structured_output dict variable.

from pydantic import BaseModel

class User(BaseModel):
    name: str
    last_name: str
    id: int

And then we provide that schema to the structured_output argument of the LLM.

from distilabel.models import LlamaCppLLM

llm = LlamaCppLLM(
    model_path="./openhermes-2.5-mistral-7b.Q4_K_M.gguf"  # (1)
    n_gpu_layers=-1,
    n_ctx=1024,
    structured_output={"format": "json", "schema": User},
)
llm.load()

We have previously downloaded a GGUF model i.e. llama.cpp compatible, from the Hugging Face Hub using curl², but any model can be used as replacement, as long as the model_path argument is updated.

And we are ready to pass our instruction as usual:

import json

result = llm.generate(
    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
    max_new_tokens=50
)

data = json.loads(result[0][0])
data
# {'name': 'Kathy', 'last_name': 'Smith', 'id': 4539210}
User(**data)
# User(name='Kathy', last_name='Smith', id=4539210)

We get back a Python dictionary (formatted as a string) that we can parse using json.loads, or validate it directly using the User, which si a pydantic.BaseModel instance.

Regex¶

The following example shows an example of text generation whose output adhere to a regular expression:

pattern = r"<name>(.*?)</name>.*?<grade>(.*?)</grade>"  # the same pattern for re.compile

llm=LlamaCppLLM(
    model_path=model_path,
    n_gpu_layers=-1,
    n_ctx=1024,
    structured_output={"format": "regex", "schema": pattern},
)
llm.load()

result = llm.generate(
    [
        [
            {"role": "system", "content": "You are Simpsons' fans who loves assigning grades from A to E, where A is the best and E is the worst."},
            {"role": "user", "content": "What's up with Homer Simpson?"}
        ]
    ],
    max_new_tokens=200
)

We can check the output by parsing the content using the same pattern we required from the LLM.

import re
match = re.search(pattern, result[0][0])

if match:
    name = match.group(1)
    grade = match.group(2)
    print(f"Name: {name}, Grade: {grade}")
# Name: Homer Simpson, Grade: C+

These were some simple examples, but one can see the options this opens.

Tip

A full pipeline example can be seen in the following script: examples/structured_generation_with_outlines.py

Instructor¶

For other LLM providers behind APIs, there's no direct way of accessing the internal logit processor like outlines does, but thanks to instructor we can generate structured output from LLM providers based on pydantic.BaseModel objects. We have integrated instructor to deal with the AsyncLLM.

Note

For instructor integration to work you may need to install the corresponding dependencies:

pip install distilabel[instructor]

Note

Take a look at InstructorStructuredOutputType to see the expected format of the structured_output dict variable.

The following is the same example you can see with outlines's JSON section for comparison purposes.

from pydantic import BaseModel

class User(BaseModel):
    name: str
    last_name: str
    id: int

And then we provide that schema to the structured_output argument of the LLM:

Note

In this example we are using Meta Llama 3.1 8B Instruct, keep in mind not all the models support structured outputs.

from distilabel.models import MistralLLM

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    structured_output={"schema": User}
)
llm.load()

And we are ready to pass our instructions as usual:

import json

result = llm.generate(
    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
    max_new_tokens=256
)

data = json.loads(result[0][0])
data
# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
User(**data)
# User(name='John', last_name='Doe', id=12345)

We get back a Python dictionary (formatted as a string) that we can parse using json.loads, or validate it directly using the User, which is a pydantic.BaseModel instance.

Tip

A full pipeline example can be seen in the following script: examples/structured_generation_with_instructor.py

OpenAI JSON¶

OpenAI offers a JSON Mode to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.

Warning

Bear in mind, for this to work, you must instruct the model in some way to generate JSON, either in the system message or in the instruction, as can be seen in the API reference.

Contrary to what we have via outlines, JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. More information can be found in the OpenAI documentation.

Other than the reference to generating JSON, to ensure the model generates parseable JSON we can pass the argument response_format="json"³:

from distilabel.models import OpenAILLM
llm = OpenAILLM(model="gpt4-turbo", api_key="api.key")
llm.generate(..., response_format="json")

You can check the variable type by importing it from:

from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType

↩

Download the model with curl:

curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

↩

Keep in mind that to interact with this response_format argument in a pipeline, you will have to pass it via the generation_kwargs:

# Assuming a pipeline is already defined, and we have a task using OpenAILLM called `task_with_openai`:
pipeline.run(
    parameters={
        "task_with_openai": {
            "llm": {
                "generation_kwargs": {
                    "response_format": "json"
                }
            }
        }
    }
)

↩