azure-ai-evaluation 1.9.0__py3-none-any.whl → 1.11.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (85) hide show
  1. azure/ai/evaluation/__init__.py +46 -12
  2. azure/ai/evaluation/_aoai/python_grader.py +84 -0
  3. azure/ai/evaluation/_aoai/score_model_grader.py +1 -0
  4. azure/ai/evaluation/_common/onedp/models/_models.py +5 -0
  5. azure/ai/evaluation/_common/rai_service.py +3 -3
  6. azure/ai/evaluation/_common/utils.py +74 -17
  7. azure/ai/evaluation/_converters/_ai_services.py +60 -10
  8. azure/ai/evaluation/_converters/_models.py +75 -26
  9. azure/ai/evaluation/_evaluate/_batch_run/_run_submitter_client.py +70 -22
  10. azure/ai/evaluation/_evaluate/_eval_run.py +14 -1
  11. azure/ai/evaluation/_evaluate/_evaluate.py +163 -44
  12. azure/ai/evaluation/_evaluate/_evaluate_aoai.py +79 -33
  13. azure/ai/evaluation/_evaluate/_utils.py +5 -2
  14. azure/ai/evaluation/_evaluators/_bleu/_bleu.py +1 -1
  15. azure/ai/evaluation/_evaluators/_code_vulnerability/_code_vulnerability.py +8 -1
  16. azure/ai/evaluation/_evaluators/_coherence/_coherence.py +3 -2
  17. azure/ai/evaluation/_evaluators/_common/_base_eval.py +143 -25
  18. azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +7 -2
  19. azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +19 -9
  20. azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +15 -5
  21. azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +4 -1
  22. azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +4 -1
  23. azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +5 -2
  24. azure/ai/evaluation/_evaluators/_content_safety/_violence.py +4 -1
  25. azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py +3 -0
  26. azure/ai/evaluation/_evaluators/_eci/_eci.py +3 -0
  27. azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +1 -1
  28. azure/ai/evaluation/_evaluators/_fluency/_fluency.py +3 -2
  29. azure/ai/evaluation/_evaluators/_gleu/_gleu.py +1 -1
  30. azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +114 -4
  31. azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py +9 -3
  32. azure/ai/evaluation/_evaluators/_meteor/_meteor.py +1 -1
  33. azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +8 -1
  34. azure/ai/evaluation/_evaluators/_qa/_qa.py +1 -1
  35. azure/ai/evaluation/_evaluators/_relevance/_relevance.py +56 -3
  36. azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +140 -59
  37. azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py +11 -3
  38. azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +3 -2
  39. azure/ai/evaluation/_evaluators/_rouge/_rouge.py +1 -1
  40. azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +2 -1
  41. azure/ai/evaluation/_evaluators/_similarity/_similarity.py +3 -2
  42. azure/ai/evaluation/_evaluators/_task_adherence/_task_adherence.py +24 -12
  43. azure/ai/evaluation/_evaluators/_task_adherence/task_adherence.prompty +354 -66
  44. azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py +214 -187
  45. azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty +126 -31
  46. azure/ai/evaluation/_evaluators/_ungrounded_attributes/_ungrounded_attributes.py +8 -1
  47. azure/ai/evaluation/_evaluators/_xpia/xpia.py +4 -1
  48. azure/ai/evaluation/_exceptions.py +1 -0
  49. azure/ai/evaluation/_legacy/_batch_engine/_config.py +6 -3
  50. azure/ai/evaluation/_legacy/_batch_engine/_engine.py +115 -30
  51. azure/ai/evaluation/_legacy/_batch_engine/_result.py +2 -0
  52. azure/ai/evaluation/_legacy/_batch_engine/_run.py +2 -2
  53. azure/ai/evaluation/_legacy/_batch_engine/_run_submitter.py +28 -31
  54. azure/ai/evaluation/_safety_evaluation/_safety_evaluation.py +2 -0
  55. azure/ai/evaluation/_version.py +1 -1
  56. azure/ai/evaluation/red_team/__init__.py +4 -3
  57. azure/ai/evaluation/red_team/_attack_objective_generator.py +17 -0
  58. azure/ai/evaluation/red_team/_callback_chat_target.py +14 -1
  59. azure/ai/evaluation/red_team/_evaluation_processor.py +376 -0
  60. azure/ai/evaluation/red_team/_mlflow_integration.py +322 -0
  61. azure/ai/evaluation/red_team/_orchestrator_manager.py +661 -0
  62. azure/ai/evaluation/red_team/_red_team.py +655 -2665
  63. azure/ai/evaluation/red_team/_red_team_result.py +6 -0
  64. azure/ai/evaluation/red_team/_result_processor.py +610 -0
  65. azure/ai/evaluation/red_team/_utils/__init__.py +34 -0
  66. azure/ai/evaluation/red_team/_utils/_rai_service_eval_chat_target.py +11 -4
  67. azure/ai/evaluation/red_team/_utils/_rai_service_true_false_scorer.py +6 -0
  68. azure/ai/evaluation/red_team/_utils/constants.py +0 -2
  69. azure/ai/evaluation/red_team/_utils/exception_utils.py +345 -0
  70. azure/ai/evaluation/red_team/_utils/file_utils.py +266 -0
  71. azure/ai/evaluation/red_team/_utils/formatting_utils.py +115 -13
  72. azure/ai/evaluation/red_team/_utils/metric_mapping.py +24 -4
  73. azure/ai/evaluation/red_team/_utils/progress_utils.py +252 -0
  74. azure/ai/evaluation/red_team/_utils/retry_utils.py +218 -0
  75. azure/ai/evaluation/red_team/_utils/strategy_utils.py +17 -4
  76. azure/ai/evaluation/simulator/_adversarial_simulator.py +14 -2
  77. azure/ai/evaluation/simulator/_indirect_attack_simulator.py +13 -1
  78. azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py +21 -7
  79. azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +24 -5
  80. azure/ai/evaluation/simulator/_simulator.py +12 -0
  81. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/METADATA +63 -4
  82. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/RECORD +85 -76
  83. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/WHEEL +1 -1
  84. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info/licenses}/NOTICE.txt +0 -0
  85. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/top_level.txt +0 -0
@@ -71,11 +71,11 @@ class SimilarityEvaluator(PromptyEvaluatorBase):
71
71
  _PROMPTY_FILE = "similarity.prompty"
72
72
  _RESULT_KEY = "similarity"
73
73
 
74
- id = "similarity"
74
+ id = "azureai://built-in/evaluators/similarity"
75
75
  """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
76
76
 
77
77
  @override
78
- def __init__(self, model_config, *, threshold=3):
78
+ def __init__(self, model_config, *, threshold=3, credential=None):
79
79
  current_dir = os.path.dirname(__file__)
80
80
  prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
81
81
  self._threshold = threshold
@@ -85,6 +85,7 @@ class SimilarityEvaluator(PromptyEvaluatorBase):
85
85
  prompty_file=prompty_path,
86
86
  result_key=self._RESULT_KEY,
87
87
  threshold=threshold,
88
+ credential=credential,
88
89
  _higher_is_better=self._higher_is_better,
89
90
  )
90
91
 
@@ -3,16 +3,19 @@
3
3
  # ---------------------------------------------------------
4
4
  import os
5
5
  import math
6
+ import logging
6
7
  from typing import Dict, Union, List, Optional
7
8
 
8
9
  from typing_extensions import overload, override
9
10
 
10
11
  from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
11
12
  from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
12
- from azure.ai.evaluation._common.utils import parse_quality_evaluator_reason_score
13
+ from ..._common.utils import reformat_conversation_history, reformat_agent_response, reformat_tool_definitions
13
14
  from azure.ai.evaluation._model_configurations import Message
14
15
  from azure.ai.evaluation._common._experimental import experimental
15
16
 
17
+ logger = logging.getLogger(__name__)
18
+
16
19
 
17
20
  @experimental
18
21
  class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
@@ -62,15 +65,21 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
62
65
 
63
66
  _DEFAULT_TASK_ADHERENCE_SCORE = 3
64
67
 
65
- id = None
68
+ id = "azureai://built-in/evaluators/task_adherence"
66
69
  """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
67
70
 
68
71
  @override
69
- def __init__(self, model_config, *, threshold=_DEFAULT_TASK_ADHERENCE_SCORE, **kwargs):
72
+ def __init__(self, model_config, *, threshold=_DEFAULT_TASK_ADHERENCE_SCORE, credential=None, **kwargs):
70
73
  current_dir = os.path.dirname(__file__)
71
74
  prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
72
75
  self.threshold = threshold
73
- super().__init__(model_config=model_config, prompty_file=prompty_path, result_key=self._RESULT_KEY, **kwargs)
76
+ super().__init__(
77
+ model_config=model_config,
78
+ prompty_file=prompty_path,
79
+ result_key=self._RESULT_KEY,
80
+ credential=credential,
81
+ **kwargs,
82
+ )
74
83
 
75
84
  @overload
76
85
  def __call__(
@@ -140,20 +149,23 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
140
149
  category=ErrorCategory.MISSING_FIELD,
141
150
  target=ErrorTarget.TASK_ADHERENCE_EVALUATOR,
142
151
  )
143
-
152
+ eval_input["query"] = reformat_conversation_history(eval_input["query"], logger, include_system_messages=True)
153
+ eval_input["response"] = reformat_agent_response(eval_input["response"], logger, include_tool_messages=True)
154
+ if "tool_definitions" in eval_input and eval_input["tool_definitions"] is not None:
155
+ eval_input["tool_definitions"] = reformat_tool_definitions(eval_input["tool_definitions"], logger)
144
156
  llm_output = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
145
-
146
- score = math.nan
147
- if llm_output:
148
- score, reason = parse_quality_evaluator_reason_score(llm_output, valid_score_range="[1-5]")
149
-
157
+ if isinstance(llm_output, dict):
158
+ score = float(llm_output.get("score", math.nan))
150
159
  score_result = "pass" if score >= self.threshold else "fail"
151
-
160
+ reason = llm_output.get("explanation", "")
152
161
  return {
153
162
  f"{self._result_key}": score,
154
163
  f"{self._result_key}_result": score_result,
155
164
  f"{self._result_key}_threshold": self.threshold,
156
165
  f"{self._result_key}_reason": reason,
166
+ # Uncomment the following line in the next iteration after UI contracts are validated.
167
+ # f"{self._result_key}_additional_details": llm_output
157
168
  }
158
-
169
+ if logger:
170
+ logger.warning("LLM output is not a dictionary, returning NaN for the score.")
159
171
  return {self._result_key: math.nan}
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: TaskAdherence
3
- description: Evaluates Task Adherence score for QA scenario
3
+ description: Evaluates Task Adherence score
4
4
  model:
5
5
  api: chat
6
6
  parameters:
@@ -10,7 +10,7 @@ model:
10
10
  presence_penalty: 0
11
11
  frequency_penalty: 0
12
12
  response_format:
13
- type: text
13
+ type: json_object
14
14
 
15
15
  inputs:
16
16
  query:
@@ -21,97 +21,385 @@ inputs:
21
21
  type: string
22
22
  optional: true
23
23
  default: "[]"
24
-
25
24
  ---
26
25
  system:
27
- # Instruction
28
- ## Context
29
- ### You are an expert in evaluating the quality of an answer from an intelligent system based on provided definitions and data. Your goal will involve answering the questions below using the information provided.
30
- - **Definition**: Based on the provided query, response, and tool definitions, evaluate the agent's adherence to the assigned task.
31
- - **Data**: Your input data includes query, response, and tool definitions.
32
- - **Questions**: To complete your evaluation you will be asked to evaluate the Data in different ways.
26
+ You are an expert evaluator who scores how well an AI assistant executed the intended task based on system constraints and user requests.
27
+
28
+ user:
29
+ ROLE
30
+ ====
31
+ You are Task-Adherence-Judge, an impartial evaluator who scores task execution quality.
32
+
33
+ INPUT
34
+ =====
35
+ CONVERSATION_HISTORY: {{query}}
36
+ AGENT_RESPONSE: {{response}}
37
+ TOOL_DEFINITIONS: {{tool_definitions}}
38
+
39
+ CONVERSATION_HISTORY includes the full dialogue. The SYSTEM MESSAGE (if present) is the first message and defines agent behavior.
40
+ AGENT_RESPONSE is the agent's reply to the latest user query.
41
+ TOOL_DEFINITIONS lists available tools.
42
+
43
+ EVALUATION STEPS
44
+ ================
45
+
46
+ A. Extract System Instructions (if system message exists):
47
+ - Identify any mandatory rules the agent must follow (e.g., “must”, “never”, “only”, or implied constraints).
48
+ - Note preferences that influence task execution (e.g., “should”, “prefer”, “avoid”).
49
+ - Record tool usage expectations (required, preferred, or restricted tools).
50
+
51
+ B. Identify the User Request:
52
+ - Determine what the user explicitly wants: action, answer, or outcome.
53
+ - Use full conversation history if needed for clarity.
54
+
55
+ C. Define the Intended Task:
56
+ - Combine the user request with mandatory system constraints.
57
+ - Adjust execution details based on preferences.
58
+ - This defines the correct task behavior.
59
+
60
+ D. Evaluate Agent Execution:
61
+ 1. Constraint Adherence:
62
+ - Were all mandatory rules followed?
63
+ - Any violation should cap the score at 3.
64
+
65
+ 2. Task Completion:
66
+ - Did the agent fulfill the defined task?
67
+ - Were any core parts of the request missed?
68
+
69
+ 3. Tool Usage Evaluation:
70
+ - Was tool use appropriate, as per instructions or task need?
71
+ - **System mandates tool**: Required (score ≤2 if not used correctly)
72
+ - **System prefers tool**: Preferred (minor score impact if not used)
73
+ - **System silent on tools**: Evaluate based on task effectiveness
74
+ - **System restricts tools**: Discouraged (score ≤2 if used when restricted)
75
+
76
+
77
+ E. Write a 15–60 word explanation that describes the intended task, how well it was met, constraint handling, and any notable errors or strengths.
78
+
79
+ F. Choose the single closest integer score (1,2,3,4 or 5) from the SCORING RUBRIC below.
80
+
81
+ OUTPUT FORMAT
82
+ =============
83
+ Output a JSON object with the below keys in the given sequence:
84
+ 1) explanation: a concise 15–60 word summary of how well the assistant followed the task
85
+ 2) agent_perceived_task: what the assistant believed it had to do, based on its response;
86
+ 3) actual_task: what the task truly was, considering both system prompt and user latest query, with system taking precedence.
87
+ 4) tools_used: List of tools used by the agent
88
+ 5) expected_tool_calls: List of tools the agent should have used (if applicable)
89
+ 6) score: an integer score from 1 (very poor) to 5 (excellent) on how well the agent resolved the user's intent within constraints
90
+
91
+ SCORING RUBRIC WITH EXAMPLES
92
+ ============================
93
+
94
+ **Score 5 - Fully Adherent:**
95
+ - Executed intended task completely
96
+ - Followed all mandatory system rules
97
+ - Used tools correctly when required by system or task necessity
98
+ - User's need fully satisfied within constraints
99
+
100
+ Example A (single-turn)
101
+ =======================
102
+ CONVERSATION_HISTORY:
103
+ User turn 1:
104
+ Suggest a good mystery book.
105
+
106
+ AGENT_RESPONSE:
107
+ [TOOL_CALL] get_book(genre="mystery")
108
+ [TOOL_RESULT] "Gone Girl" by Gillian Flynn
109
+ I recommend *Gone Girl* by Gillian Flynn — a gripping psychological thriller.
110
+
111
+ TOOL_DEFINITIONS:
112
+ - get_book: Returns a book recommendation based on genre. (inputs: genre)
113
+
114
+ EXPECTED OUTPUT:
115
+ {
116
+ "explanation": "The assistant correctly identified the task, used the appropriate tool as needed, and provided a clear and suitable book recommendation.",
117
+ "agent_perceived_task": "recommend a mystery book",
118
+ "actual_task": "Suggest a good mystery book",
119
+ "tools_used": ["get_book"],
120
+ "expected_tool_calls": ["get_book"],
121
+ "score": 5
122
+ }
123
+
124
+ Example B (multi-turn)
125
+ ======================
126
+ CONVERSATION_HISTORY:
127
+ SYSTEM_PROMPT:
128
+ Do not use tools unless explicitly requested by the user.
129
+
130
+ User turn 1:
131
+ Can you recommend a good book?
132
+
133
+ Agent turn 1:
134
+ Any preferred genre?
135
+
136
+ User turn 2:
137
+ Historical fiction.
138
+
139
+ AGENT_RESPONSE:
140
+ You might enjoy *The Book Thief* by Markus — a powerful story set during World War II.
141
+
142
+ TOOL_DEFINITIONS:
143
+ - get_book: Returns book suggestions based on genre. (inputs: genre)
144
+
145
+ EXPECTED OUTPUT:
146
+ {
147
+ "explanation": "The assistant correctly followed the system instruction by not using tools since the user never explicitly requested one. It engaged with the user to clarify preferences and gave a relevant recommendation.",
148
+ "agent_perceived_task": "recommend a historical fiction book without tool use",
149
+ "actual_task": "Recommend a historical fiction book (no tool allowed unless explicitly requested)",
150
+ "tools_used": [],
151
+ "expected_tool_calls": [],
152
+ "score": 5
153
+ }
154
+
155
+ **Score 4 - Mostly Adherent:**
156
+ - Executed core intended task
157
+ - Followed mandatory system rules
158
+ - Minor execution gaps that don't prevent task completion
159
+ - Tool usage mostly appropriate
160
+
161
+ Example A (single-turn)
162
+ =======================
163
+ CONVERSATION_HISTORY:
164
+ SYSTEM_PROMPT:
165
+ You must respond clearly and concisely using available tools. Minor omissions are acceptable.
166
+
167
+ User turn 1:
168
+ What’s the weather like in Paris?
169
+
170
+ AGENT_RESPONSE:
171
+ [TOOL_CALL] get_weather(city="Paris")
172
+ [TOOL_RESULT] 22°C, partly cloudy
173
+ Paris is currently 22°C and partly cloudy.
174
+
175
+ TOOL_DEFINITIONS:
176
+ - get_weather: Provides current weather details for a city. (inputs: city)
177
+
178
+ EXPECTED OUTPUT:
179
+ {
180
+ "explanation": "The assistant correctly used the expected tool and completed the core task. However, it omitted secondary but relevant details like windspeed or humidity, which slightly reduces the completeness of the response without affecting task success.",
181
+ "agent_perceived_task": "get weather for Paris",
182
+ "actual_task": "Provide current weather in Paris",
183
+ "tools_used": ["get_weather"],
184
+ "expected_tool_calls": ["get_weather"],
185
+ "score": 4
186
+ }
187
+
188
+ Example B (multi-turn)
189
+ ======================
190
+ CONVERSATION_HISTORY:
191
+ SYSTEM_PROMPT:
192
+ Always ask for genre before making a book suggestion.
193
+
194
+ User turn 1:
195
+ Recommend a good book to read.
196
+
197
+ Agent turn 1:
198
+ Any preferred genre?
199
+
200
+ User turn 2:
201
+ Something romantic.
202
+
203
+ AGENT_RESPONSE:
204
+ [TOOL_CALL] get_book(genre="romance")
205
+ [TOOL_RESULT] "The Notebook" by Nicholas Sparks
206
+ You might enjoy *The Notebook* by Nicholas Sparks.
207
+
208
+ TOOL_DEFINITIONS:
209
+ - get_book: Returns book suggestions based on genre. (inputs: genre)
210
+
211
+ EXPECTED OUTPUT:
212
+ {
213
+ "explanation": "Assistant followed system rule to ask for genre, used the correct tool, and offered a relevant result, though without justification.",
214
+ "agent_perceived_task": "recommend a romance novel",
215
+ "actual_task": "Recommend a romance book (after genre clarification as per system)",
216
+ "tools_used": ["get_book"],
217
+ "expected_tool_calls": ["get_book"],
218
+ "score": 4
219
+ }
220
+
221
+ **Score 3 - Partially Adherent:**
222
+ - Attempted intended task but missed key components
223
+ - Some system rule violations OR incomplete task execution
224
+ - Tool usage suboptimal or missing when preferred by system
225
+ - User's need partially satisfied
226
+
227
+ Example A (single-turn)
228
+ =======================
229
+ CONVERSATION_HISTORY:
230
+ SYSTEM_PROMPT:
231
+ Always rely on tools if available for factual queries like time and weather.
232
+
233
+ User turn 1:
234
+ What’s the weather in Sydney?
235
+
236
+ AGENT_RESPONSE:
237
+ It's probably sunny this time of year.
238
+
239
+ TOOL_DEFINITIONS:
240
+ - get_book: Returns book suggestions based on genre. (inputs: genre)
241
+
242
+ EXPECTED OUTPUT:
243
+ {
244
+ "explanation": "The assistant attempted to answer the weather query but the response was vague and did not provide specific information. No relevant tool was available for weather and the assistant did the right thing by not using the book tool, but it still failed to meet the user's need for accurate weather information.",
245
+ "agent_perceived_task": "estimate Sydney's weather",
246
+ "actual_task": "Get real-time weather in Sydney (tool use required per system)",
247
+ "tools_used": [],
248
+ "expected_tool_calls": [],
249
+ "score": 3
250
+ }
251
+
252
+ Example B (multi-turn)
253
+ ======================
254
+ CONVERSATION_HISTORY:
255
+ SYSTEM_PROMPT:
256
+ If tools are available, prefer them for recommendations.
257
+
258
+ User turn 1:
259
+ Can you suggest a book?
260
+
261
+ Agent turn 1:
262
+ Sure, any specific genre?
263
+
264
+ User turn 2:
265
+ Thriller.
266
+
267
+ AGENT_RESPONSE:
268
+ You might enjoy *The Da Vinci Code*.
33
269
 
34
- # Definition
270
+ TOOL_DEFINITIONS:
271
+ - get_book: Get book suggestions by genre. (inputs: genre)
35
272
 
36
- **Level 1: Fully Inadherent**
273
+ EXPECTED OUTPUT:
274
+ {
275
+ "explanation": "Assistant gave a relevant suggestion but skipped the tool, despite system prompt recommending its use for consistency.",
276
+ "agent_perceived_task": "recommend thriller book",
277
+ "actual_task": "Suggest a thriller book (prefer tool per system)",
278
+ "tools_used": [],
279
+ "expected_tool_calls": ["get_book"],
280
+ "score": 3
281
+ }
37
282
 
38
- **Definition:**
39
- Response completely ignores instructions or deviates significantly
283
+ **Score 2 - Barely Adherent:**
284
+ - Minimal task execution
285
+ - Violated mandatory system rules OR failed core task requirements
286
+ - Wrong tool usage or missing when required by system
287
+ - User's need largely unmet
40
288
 
41
- **Example:**
42
- **Query:** What is a recommended weekend itinerary in Paris?
43
- **Response:** Paris is a lovely city with a rich history.
289
+ Example A (single-turn)
290
+ =======================
291
+ CONVERSATION_HISTORY:
292
+ SYSTEM_PROMPT:
293
+ You are a tool-based agent. Always respond using tools.
44
294
 
45
- Explanation: This response completely misses the task by not providing any itinerary details. It offers a generic statement about Paris rather than a structured travel plan.
295
+ User turn 1:
296
+ What's the time in Madrid?
46
297
 
298
+ AGENT_RESPONSE:
299
+ Europe has different time zones.
47
300
 
48
- **Level 2: Barely Adherent**
301
+ TOOL_DEFINITIONS:
302
+ - get_time: Return current time for a city. (inputs: city)
49
303
 
50
- **Definition:**
51
- Response partially aligns with instructions but has critical gaps.
304
+ EXPECTED OUTPUT:
305
+ {
306
+ "explanation": "Assistant responded vaguely and violated the system rule requiring tool usage. Failed to directly answer the task.",
307
+ "agent_perceived_task": "talk about time zones",
308
+ "actual_task": "Tell the current time in Madrid (using tool as required by system)",
309
+ "tools_used": [],
310
+ "expected_tool_calls": ["get_time"],
311
+ "score": 2
312
+ }
52
313
 
53
- **Example:**
54
- **Query:** What is a recommended weekend itinerary in Paris?
55
- **Response:** Spend your weekend visiting famous places in Paris.
314
+ Example B (multi-turn)
315
+ ======================
316
+ CONVERSATION_HISTORY:
317
+ SYSTEM_PROMPT:
318
+ You are a book recommender assistant. Do **not** ask follow-up questions. Only respond with direct suggestions using tools.
56
319
 
57
- Explanation: While the response hints at visiting well-known sites, it is extremely vague and lacks specific details, such as which sites to visit or any order of activities, leaving major gaps in the instructions.
320
+ User turn 1:
321
+ Recommend a good fantasy book.
58
322
 
323
+ Agent turn 1:
324
+ Do you prefer magic or adventure?
59
325
 
60
- **Level 3: Moderately Adherent**
326
+ User turn 2:
327
+ Magic.
61
328
 
62
- **Definition:**
63
- Response meets the core requirements but lacks precision or clarity.
329
+ AGENT_RESPONSE:
330
+ Fantasy often includes magical elements and world-building.
64
331
 
65
- **Example:**
66
- **Query:** What is a recommended weekend itinerary in Paris?
67
- **Response:** Visit the Eiffel Tower and the Louvre on Saturday, and stroll through Montmartre on Sunday.
332
+ TOOL_DEFINITIONS:
333
+ - get_book: Retrieve books filtered by genre. (inputs: genre)
68
334
 
69
- Explanation: This answer meets the basic requirement by naming a few key attractions and assigning them to specific days. However, it lacks additional context, such as timings, additional activities, or details to make the itinerary practical and clear.
335
+ EXPECTED OUTPUT:
336
+ {
337
+ "explanation": "The assistant violated the system's instruction not to ask clarifying questions and failed to use the required tool. Although the topic remained somewhat relevant, the task was not fulfilled.",
338
+ "agent_perceived_task": "discuss fantasy genre",
339
+ "actual_task": "Recommend a fantasy book using tool without asking follow-ups (per system)",
340
+ "tools_used": [],
341
+ "expected_tool_calls": ["get_book"],
342
+ "score": 2
343
+ }
70
344
 
345
+ **Score 1 - Non-Adherent:**
346
+ - Failed to execute intended task
347
+ - Ignored mandatory system constraints
348
+ - Completely inappropriate or missing tool usage
349
+ - Response irrelevant to user's need
71
350
 
72
- **Level 4: Mostly Adherent**
351
+ Example A (single-turn)
352
+ =======================
353
+ CONVERSATION_HISTORY:
354
+ SYSTEM_PROMPT:
355
+ Only answer factual questions using tools. Avoid personal commentary.
73
356
 
74
- **Definition:**
75
- Response is clear, accurate, and aligns with instructions with minor issues.
357
+ User turn 1:
358
+ What's the current time in Berlin?
76
359
 
77
- **Example:**
78
- **Query:** What is a recommended weekend itinerary in Paris?
79
- **Response:** For a weekend in Paris, start Saturday with a morning visit to the Eiffel Tower, then head to the Louvre in the early afternoon. In the evening, enjoy a leisurely walk along the Seine. On Sunday, begin with a visit to Notre-Dame Cathedral, followed by exploring the art and cafés in Montmartre. This plan offers a mix of cultural visits and relaxing experiences.
360
+ AGENT_RESPONSE:
361
+ I prefer mornings to evenings. How about you?
80
362
 
81
- Explanation: This response is clear, structured, and provides a concrete itinerary with specific attractions and a suggested order of activities. It is accurate and useful, though it might benefit from a few more details like exact timings or restaurant suggestions to be perfect.
363
+ TOOL_DEFINITIONS:
364
+ - get_time: Return time for a given city. (inputs: city)
82
365
 
366
+ EXPECTED OUTPUT:
367
+ {
368
+ "explanation": "Assistant ignored both the user's factual query and the system's constraint to avoid personal remarks. No tool used.",
369
+ "agent_perceived_task": "chat about preferences",
370
+ "actual_task": "Tell the current time in Berlin (using tool, avoid chit-chat as per system)",
371
+ "tools_used": [],
372
+ "expected_tool_calls": ["get_time"],
373
+ "score": 1
374
+ }
83
375
 
84
- **Level 5: Fully Adherent**
376
+ Example B (multi-turn)
377
+ ======================
378
+ CONVERSATION_HISTORY:
379
+ SYSTEM_PROMPT:
380
+ Only provide weather updates using tools. Do not include personal opinions.
85
381
 
86
- **Definition:**
87
- Response is flawless, accurate, and follows instructions to the letter.
382
+ User turn 1:
383
+ I need a weather update for Rome.
88
384
 
89
- **Example:**
90
- **Query:** What is a recommended weekend itinerary in Paris?
91
- **Response:** Here is a detailed weekend itinerary in Paris:
92
- Saturday:
93
- Morning: Begin your day with a visit to the Eiffel Tower to admire the views from the top.
94
- Early Afternoon: Head to the Louvre for a guided tour of its most famous exhibits.
95
- Late Afternoon: Take a relaxing walk along the Seine, stopping at local boutiques.
96
- Evening: Enjoy dinner at a classic Parisian bistro near the river.
97
- Sunday:
98
- Morning: Visit the Notre-Dame Cathedral to explore its architecture and history.
99
- Midday: Wander the charming streets of Montmartre, stopping by art galleries and cafés.
100
- Afternoon: Finish your trip with a scenic boat tour on the Seine.
101
- This itinerary balances cultural immersion, leisure, and local dining experiences, ensuring a well-rounded visit.
385
+ Agent turn 1:
386
+ Celsius or Fahrenheit?
102
387
 
103
- Explanation: This response is comprehensive and meticulously follows the instructions. It provides detailed steps, timings, and a variety of activities that fully address the query, leaving no critical gaps.
388
+ User turn 2:
389
+ Celsius.
104
390
 
105
- # Data
106
- Query: {{query}}
107
- Response: {{response}}
108
- Tool Definitions: {{tool_definitions}}
391
+ AGENT_RESPONSE:
392
+ I once visited Rome in summer. Beautiful city!
109
393
 
110
- # Tasks
111
- ## Please provide your assessment Score for the previous answer. Your output should include the following information:
112
- - **ThoughtChain**: To improve the reasoning process, Think Step by Step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and Start your ThoughtChain with "Let's think step by step:".
113
- - **Explanation**: a very short explanation of why you think the input data should get that Score.
114
- - **Score**: based on your previous analysis, provide your Score. The answer you give MUST be an integer score ("1", "2", ...) based on the categories of the definitions.
394
+ TOOL_DEFINITIONS:
395
+ - get_weather: Provides weather data for a given city. (inputs: city, unit)
115
396
 
116
- ## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your score</S2>.
117
- # Output
397
+ EXPECTED OUTPUT:
398
+ {
399
+ "explanation": "Assistant acknowledged the input but failed to execute the task or use the required tool, and added personal opinion against system guidance.",
400
+ "agent_perceived_task": "share travel experience",
401
+ "actual_task": "Provide current weather in Celsius for Rome (using tool, no personal commentary)",
402
+ "tools_used": [],
403
+ "expected_tool_calls": ["get_weather"],
404
+ "score": 1
405
+ }