ultrajudge
Area
¶
UltraJudgeOutput
¶
UltraJudgeTask
dataclass
¶
Bases: PreferenceTask
A PreferenceTask
for the UltraJudge task. The UltraJudge
task has been defined
at Argilla specifically for a better evaluation using AI Feedback. The task is defined
based on both UltraFeedback and JudgeLM, but with several improvements / modifications.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
system_prompt |
str
|
the system prompt to be used for generation. Defaults to |
"You are an evaluator tasked with assessing AI assistants' responses from the perspective of typical user preferences. Your critical analysis should focus on human-like engagement, solution effectiveness, accuracy, clarity, and creativity. Approach each response as if you were the user, considering how well the response meets your needs and expectations in a real-world scenario. Provide detailed feedback that highlights strengths and areas for improvement in each response, keeping in mind the goal of simulating a human's preferred choice. Your evaluation should be impartial and thorough, reflecting a human's perspective in preferring responses that are practical, clear, authentic, and aligned with their intent. Avoid bias, and focus on the content and quality of the responses."
|
task_description |
Union[str, None]
|
the description of the task. Defaults to |
"Your task is to rigorously evaluate the performance of {num_responses} AI assistants, simulating a human's perspective. You will assess each response based on four key domains, reflecting aspects that are typically valued by humans: {areas}. First provide a score between 0 and 10 and write a detailed feedback for each area and assistant. Finally, provide a list of {num_responses} scores, each separated by a space, to reflect the performance of Assistants 1 to {num_responses}."
|
areas |
List[str]
|
the areas to be used for the task. Defaults to a list of four areas: "Practical Accuracy", "Clarity & Transparency", "Authenticity & Reliability", and "Compliance with Intent". |
field(default_factory=lambda: ['Practical Accuracy', 'Clarity & Transparency', 'Authenticity & Reliability', 'Compliance with Intent'])
|
References
Source code in src/distilabel/tasks/preference/ultrajudge.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
areas_str: str
property
¶
Returns a string representation of the areas.
extract_area_score_and_rationale_regex: str
property
¶
Returns a regex to extract the area, score, and rationale from the output.
extract_final_scores_regex: str
property
¶
Returns a regex to extract the final scores from the output.
output_args_names: List[str]
property
¶
Returns the names of the output arguments of the task.
generate_prompt(input, generations, **_)
¶
Generates a prompt following the UltraJudge specification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
str
|
the input to be used for the prompt. |
required |
generations |
List[str]
|
the generations to be used for the prompt. |
required |
Returns:
Name | Type | Description |
---|---|---|
Prompt |
Prompt
|
the generated prompt. |
Examples:
>>> from distilabel.tasks.preference import UltraJudgeTask
>>> task = UltraJudgeTask(system_prompt="You are a helpful assistant.")
>>> task.generate_prompt("What are the first 5 Fibonacci numbers?", ["0 1 1 2 3", "0 1 1 2 3"])
Prompt(
system_prompt="You are a helpful assistant.",
formatted_prompt="Your task is to rigorously evaluate the performance of ...",
)
Source code in src/distilabel/tasks/preference/ultrajudge.py
parse_output(output)
¶
Parses the output of the model into the desired format.