Skip to content

Prometheus 2

"Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models" presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"; since GPT-4, as well as other proprietary LLMs, are commonly used to asses the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.

Existing open evaluator LMs exhibit critical shortcomings:

  1. They issue scores that significantly diverge from those assigned by humans.
  2. They lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment.

Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria.

Prometheus 2 released two variants:

Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer, and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively.

On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Their models, code, and data are all publicly available at prometheus-eval/prometheus-eval.



The section is named Replication but in this case we're not replicating the Prometheus 2 paper per se, but rather showing how to use the PrometheusEval task implemented within distilabel to evaluate the quality of the responses from a given instruction using the Prometheus 2 model.

To showcase Prometheus 2 we will be using the PrometheusEval task implemented in distilabel and a smaller dataset created by the Hugging Face H4 team named HuggingFaceH4/instruction-dataset for testing purposes.


To reproduce the code below, one will need to install distilabel as it follows:

pip install "distilabel[vllm]>=1.1.0"

Alternatively, it's recommended to install Dao-AILab/flash-attention to benefit from Flash Attention 2 speed ups during inference via vllm.

pip install flash-attn --no-build-isolation


The installation notes above assume that you are using a VM with one GPU accelerator with at least the required VRAM to fit prometheus-eval/prometheus-7b-v2.0 in bfloat16 (28GB); but if you have enough VRAM to fit their 8x7B model in bfloat16 (~90GB) you can use prometheus-eval/prometheus-8x7b-v2.0 instead.

Building blocks

  • LoadDataFromHub: GeneratorStep to load a dataset from the Hugging Face Hub.

  • PrometheusEval: Task that assesses the quality of a response for a given instruction using any of the Prometheus 2 models.


    Since the Prometheus 2 models use a slightly different chat template than mistralai/Mistral-7B-Instruct-v0.2, we need to set the chat_template parameter to [INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST] so as to properly format the input for Prometheus 2.

  • (Optional) KeepColumns: Task that keeps only the specified columns in the dataset, used to remove the undesired columns.


As mentioned before, we will put the previously mentioned building blocks together to see how Prometheus 2 can be used via distilabel.

from distilabel.llms import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromHub
from distilabel.steps.tasks import PrometheusEval

if __name__ == "__main__":
    with Pipeline(name="prometheus") as pipeline:
        load_dataset = LoadDataFromHub(
            output_mappings={"prompt": "instruction", "completion": "generation"},

        task = PrometheusEval(
                chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",

        keep_columns = KeepColumns(
            columns=["instruction", "generation", "feedback", "result", "model_name"],

        load_dataset >> task >> keep_columns

Then we need to call with the runtime parameters so that the pipeline can be launched.

distiset =
    parameters={ {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 1024,
                    "temperature": 0.7,

Finally, we can optionally push the generated dataset, named Distiset, to the Hugging Face Hub via the push_to_hub method, so that each subset generated in the leaf steps is pushed to the Hub.