PrometheusEval¶
Critique and rank the quality of generations from an LLM
using Prometheus 2.0.
PrometheusEval
is a task created for Prometheus 2.0, covering both the absolute and relative
evaluations.
- The absolute evaluation i.e. `mode="absolute"` is used to evaluate a single generation from
an LLM for a given instruction.
- The relative evaluation i.e. `mode="relative"` is used to evaluate two generations from an LLM
for a given instruction.
Both evaluations provide the possibility whether to use a reference answer to compare with or not
via the `reference` attribute, and both are based on a score rubric that critiques the generation/s
based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,
and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute
`rubric`.
Note¶
The PrometheusEval
task is better suited and intended to be used with any of the Prometheus 2.0
models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,
and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
and quality is not guaranteed if using another model, even though some other models may be able to
correctly follow the formatting and generate insightful critiques too.
Attributes¶
-
mode: the evaluation mode to use, either
absolute
orrelative
. It defines whether the task will evaluate one or two generations. -
rubric: the score rubric to use within the prompt to run the critique based on different aspects. Can be any existing key in the
rubrics
attribute, which by default means that it can be:helpfulness
,harmlessness
,honesty
,factual-validity
, orreasoning
. Those will only work if using the defaultrubrics
, otherwise, the providedrubrics
should be used. -
rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are the rubric names and the values are the rubric descriptions. The default rubrics are the following:
helpfulness
,harmlessness
,honesty
,factual-validity
, andreasoning
. -
reference: a boolean flag to indicate whether a reference answer / completion will be provided, so that the model critique is based on the comparison with it. It implies that the column
reference
needs to be provided within the input data in addition to the rest of the inputs. -
_template: a Jinja2 template used to format the input for the LLM.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[instruction]
ICOL1[generation]
ICOL2[generations]
ICOL3[reference]
end
subgraph New columns
OCOL0[feedback]
OCOL1[result]
OCOL2[model_name]
end
end
subgraph PrometheusEval
StepInput[Input Columns: instruction, generation, generations, reference]
StepOutput[Output Columns: feedback, result, model_name]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
ICOL2 --> StepInput
ICOL3 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
Inputs¶
-
instruction (
str
): The instruction to use as reference. -
generation (
str
, optional): The generated text from the giveninstruction
. This column is required ifmode=absolute
. -
generations (
List[str]
, optional): The generated texts from the giveninstruction
. It should contain 2 generations only. This column is required ifmode=relative
. -
reference (
str
, optional): The reference / golden answer for theinstruction
, to be used by the LLM for comparison against.
Outputs¶
-
feedback (
str
): The feedback explaining the result below, as critiqued by the LLM using the pre-defined score rubric, compared againstreference
if provided. -
result (
Union[int, Literal["A", "B"]]
): Ifmode=absolute
, then the result contains the score for thegeneration
in a likert-scale from 1-5, otherwise, ifmode=relative
, then the result contains either "A" or "B", the "winning" one being the generation in the index 0 ofgenerations
ifresult='A'
or the index 1 ifresult='B'
. -
model_name (
str
): The model name used to generate thefeedback
andresult
.