azure-ai-evaluation 1.8.0__py3-none-any.whl → 1.10.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/__init__.py +51 -6
- azure/ai/evaluation/_aoai/__init__.py +1 -1
- azure/ai/evaluation/_aoai/aoai_grader.py +21 -11
- azure/ai/evaluation/_aoai/label_grader.py +3 -2
- azure/ai/evaluation/_aoai/python_grader.py +84 -0
- azure/ai/evaluation/_aoai/score_model_grader.py +91 -0
- azure/ai/evaluation/_aoai/string_check_grader.py +3 -2
- azure/ai/evaluation/_aoai/text_similarity_grader.py +3 -2
- azure/ai/evaluation/_azure/_envs.py +9 -10
- azure/ai/evaluation/_azure/_token_manager.py +7 -1
- azure/ai/evaluation/_common/constants.py +11 -2
- azure/ai/evaluation/_common/evaluation_onedp_client.py +32 -26
- azure/ai/evaluation/_common/onedp/__init__.py +32 -32
- azure/ai/evaluation/_common/onedp/_client.py +136 -139
- azure/ai/evaluation/_common/onedp/_configuration.py +70 -73
- azure/ai/evaluation/_common/onedp/_patch.py +21 -21
- azure/ai/evaluation/_common/onedp/_utils/__init__.py +6 -0
- azure/ai/evaluation/_common/onedp/_utils/model_base.py +1232 -0
- azure/ai/evaluation/_common/onedp/_utils/serialization.py +2032 -0
- azure/ai/evaluation/_common/onedp/_validation.py +50 -50
- azure/ai/evaluation/_common/onedp/_version.py +9 -9
- azure/ai/evaluation/_common/onedp/aio/__init__.py +29 -29
- azure/ai/evaluation/_common/onedp/aio/_client.py +138 -143
- azure/ai/evaluation/_common/onedp/aio/_configuration.py +70 -75
- azure/ai/evaluation/_common/onedp/aio/_patch.py +21 -21
- azure/ai/evaluation/_common/onedp/aio/operations/__init__.py +37 -39
- azure/ai/evaluation/_common/onedp/aio/operations/_operations.py +4832 -4494
- azure/ai/evaluation/_common/onedp/aio/operations/_patch.py +21 -21
- azure/ai/evaluation/_common/onedp/models/__init__.py +168 -142
- azure/ai/evaluation/_common/onedp/models/_enums.py +230 -162
- azure/ai/evaluation/_common/onedp/models/_models.py +2685 -2228
- azure/ai/evaluation/_common/onedp/models/_patch.py +21 -21
- azure/ai/evaluation/_common/onedp/operations/__init__.py +37 -39
- azure/ai/evaluation/_common/onedp/operations/_operations.py +6106 -5657
- azure/ai/evaluation/_common/onedp/operations/_patch.py +21 -21
- azure/ai/evaluation/_common/rai_service.py +88 -52
- azure/ai/evaluation/_common/raiclient/__init__.py +1 -1
- azure/ai/evaluation/_common/raiclient/operations/_operations.py +14 -1
- azure/ai/evaluation/_common/utils.py +188 -10
- azure/ai/evaluation/_constants.py +2 -1
- azure/ai/evaluation/_converters/__init__.py +1 -1
- azure/ai/evaluation/_converters/_ai_services.py +9 -8
- azure/ai/evaluation/_converters/_models.py +46 -0
- azure/ai/evaluation/_converters/_sk_services.py +495 -0
- azure/ai/evaluation/_eval_mapping.py +2 -2
- azure/ai/evaluation/_evaluate/_batch_run/_run_submitter_client.py +73 -25
- azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +2 -2
- azure/ai/evaluation/_evaluate/_evaluate.py +210 -94
- azure/ai/evaluation/_evaluate/_evaluate_aoai.py +132 -89
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +0 -1
- azure/ai/evaluation/_evaluate/_utils.py +25 -17
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +4 -4
- azure/ai/evaluation/_evaluators/_code_vulnerability/_code_vulnerability.py +20 -12
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +6 -6
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +45 -11
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +24 -9
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +24 -9
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +28 -18
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +11 -8
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +11 -8
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +12 -9
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +10 -7
- azure/ai/evaluation/_evaluators/_document_retrieval/__init__.py +1 -5
- azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py +37 -64
- azure/ai/evaluation/_evaluators/_eci/_eci.py +6 -3
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +5 -5
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +3 -3
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +4 -4
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +12 -8
- azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py +31 -26
- azure/ai/evaluation/_evaluators/_intent_resolution/intent_resolution.prompty +210 -96
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +3 -4
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +14 -7
- azure/ai/evaluation/_evaluators/_qa/_qa.py +5 -5
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +62 -15
- azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +140 -59
- azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py +21 -26
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +5 -5
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +22 -22
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +7 -6
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +4 -4
- azure/ai/evaluation/_evaluators/_task_adherence/_task_adherence.py +27 -24
- azure/ai/evaluation/_evaluators/_task_adherence/task_adherence.prompty +354 -66
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py +175 -183
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty +99 -21
- azure/ai/evaluation/_evaluators/_ungrounded_attributes/_ungrounded_attributes.py +20 -12
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +10 -7
- azure/ai/evaluation/_exceptions.py +10 -0
- azure/ai/evaluation/_http_utils.py +3 -3
- azure/ai/evaluation/_legacy/_batch_engine/_config.py +6 -3
- azure/ai/evaluation/_legacy/_batch_engine/_engine.py +117 -32
- azure/ai/evaluation/_legacy/_batch_engine/_openai_injector.py +5 -2
- azure/ai/evaluation/_legacy/_batch_engine/_result.py +2 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run.py +2 -2
- azure/ai/evaluation/_legacy/_batch_engine/_run_submitter.py +33 -41
- azure/ai/evaluation/_legacy/_batch_engine/_utils.py +1 -4
- azure/ai/evaluation/_legacy/_common/_async_token_provider.py +12 -19
- azure/ai/evaluation/_legacy/_common/_thread_pool_executor_with_context.py +2 -0
- azure/ai/evaluation/_legacy/prompty/_prompty.py +11 -5
- azure/ai/evaluation/_safety_evaluation/__init__.py +1 -1
- azure/ai/evaluation/_safety_evaluation/_safety_evaluation.py +195 -111
- azure/ai/evaluation/_user_agent.py +32 -1
- azure/ai/evaluation/_version.py +1 -1
- azure/ai/evaluation/red_team/__init__.py +3 -1
- azure/ai/evaluation/red_team/_agent/__init__.py +1 -1
- azure/ai/evaluation/red_team/_agent/_agent_functions.py +68 -71
- azure/ai/evaluation/red_team/_agent/_agent_tools.py +103 -145
- azure/ai/evaluation/red_team/_agent/_agent_utils.py +26 -6
- azure/ai/evaluation/red_team/_agent/_semantic_kernel_plugin.py +62 -71
- azure/ai/evaluation/red_team/_attack_objective_generator.py +94 -52
- azure/ai/evaluation/red_team/_attack_strategy.py +2 -1
- azure/ai/evaluation/red_team/_callback_chat_target.py +4 -9
- azure/ai/evaluation/red_team/_default_converter.py +1 -1
- azure/ai/evaluation/red_team/_red_team.py +1947 -1040
- azure/ai/evaluation/red_team/_red_team_result.py +49 -38
- azure/ai/evaluation/red_team/_utils/__init__.py +1 -1
- azure/ai/evaluation/red_team/_utils/_rai_service_eval_chat_target.py +39 -34
- azure/ai/evaluation/red_team/_utils/_rai_service_target.py +163 -138
- azure/ai/evaluation/red_team/_utils/_rai_service_true_false_scorer.py +14 -14
- azure/ai/evaluation/red_team/_utils/constants.py +1 -13
- azure/ai/evaluation/red_team/_utils/formatting_utils.py +41 -44
- azure/ai/evaluation/red_team/_utils/logging_utils.py +17 -17
- azure/ai/evaluation/red_team/_utils/metric_mapping.py +31 -4
- azure/ai/evaluation/red_team/_utils/strategy_utils.py +33 -25
- azure/ai/evaluation/simulator/_adversarial_scenario.py +2 -0
- azure/ai/evaluation/simulator/_adversarial_simulator.py +31 -17
- azure/ai/evaluation/simulator/_conversation/__init__.py +2 -2
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +8 -8
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +18 -6
- azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py +54 -24
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +7 -1
- azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +30 -10
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +19 -31
- azure/ai/evaluation/simulator/_model_tools/_template_handler.py +20 -6
- azure/ai/evaluation/simulator/_model_tools/models.py +1 -1
- azure/ai/evaluation/simulator/_simulator.py +21 -8
- {azure_ai_evaluation-1.8.0.dist-info → azure_ai_evaluation-1.10.0.dist-info}/METADATA +46 -3
- {azure_ai_evaluation-1.8.0.dist-info → azure_ai_evaluation-1.10.0.dist-info}/RECORD +141 -136
- azure/ai/evaluation/_common/onedp/aio/_vendor.py +0 -40
- {azure_ai_evaluation-1.8.0.dist-info → azure_ai_evaluation-1.10.0.dist-info}/NOTICE.txt +0 -0
- {azure_ai_evaluation-1.8.0.dist-info → azure_ai_evaluation-1.10.0.dist-info}/WHEEL +0 -0
- {azure_ai_evaluation-1.8.0.dist-info → azure_ai_evaluation-1.10.0.dist-info}/top_level.txt +0 -0
|
@@ -41,13 +41,13 @@ class SimilarityEvaluator(PromptyEvaluatorBase):
|
|
|
41
41
|
:caption: Initialize and call a SimilarityEvaluator with a four-gram rouge type.
|
|
42
42
|
|
|
43
43
|
.. admonition:: Example using Azure AI Project URL:
|
|
44
|
-
|
|
44
|
+
|
|
45
45
|
.. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
|
|
46
46
|
:start-after: [START similarity_evaluator]
|
|
47
47
|
:end-before: [END similarity_evaluator]
|
|
48
48
|
:language: python
|
|
49
49
|
:dedent: 8
|
|
50
|
-
:caption: Initialize and call SimilarityEvaluator using Azure AI Project URL in the following format
|
|
50
|
+
:caption: Initialize and call SimilarityEvaluator using Azure AI Project URL in the following format
|
|
51
51
|
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
|
|
52
52
|
|
|
53
53
|
.. admonition:: Example:
|
|
@@ -71,7 +71,7 @@ class SimilarityEvaluator(PromptyEvaluatorBase):
|
|
|
71
71
|
_PROMPTY_FILE = "similarity.prompty"
|
|
72
72
|
_RESULT_KEY = "similarity"
|
|
73
73
|
|
|
74
|
-
id = "similarity"
|
|
74
|
+
id = "azureai://built-in/evaluators/similarity"
|
|
75
75
|
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
|
|
76
76
|
|
|
77
77
|
@override
|
|
@@ -85,7 +85,7 @@ class SimilarityEvaluator(PromptyEvaluatorBase):
|
|
|
85
85
|
prompty_file=prompty_path,
|
|
86
86
|
result_key=self._RESULT_KEY,
|
|
87
87
|
threshold=threshold,
|
|
88
|
-
_higher_is_better=self._higher_is_better
|
|
88
|
+
_higher_is_better=self._higher_is_better,
|
|
89
89
|
)
|
|
90
90
|
|
|
91
91
|
# Ignoring a mypy error about having only 1 overload function.
|
|
@@ -3,16 +3,20 @@
|
|
|
3
3
|
# ---------------------------------------------------------
|
|
4
4
|
import os
|
|
5
5
|
import math
|
|
6
|
+
import logging
|
|
6
7
|
from typing import Dict, Union, List, Optional
|
|
7
8
|
|
|
8
9
|
from typing_extensions import overload, override
|
|
9
10
|
|
|
10
11
|
from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
|
|
11
12
|
from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
|
|
12
|
-
from
|
|
13
|
+
from ..._common.utils import reformat_conversation_history, reformat_agent_response, reformat_tool_definitions
|
|
13
14
|
from azure.ai.evaluation._model_configurations import Message
|
|
14
15
|
from azure.ai.evaluation._common._experimental import experimental
|
|
15
16
|
|
|
17
|
+
logger = logging.getLogger(__name__)
|
|
18
|
+
|
|
19
|
+
|
|
16
20
|
@experimental
|
|
17
21
|
class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
18
22
|
"""The Task Adherence evaluator assesses how well an AI-generated response follows the assigned task based on:
|
|
@@ -42,15 +46,15 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
|
42
46
|
:language: python
|
|
43
47
|
:dedent: 8
|
|
44
48
|
:caption: Initialize and call an TaskAdherenceEvaluator with a query and response.
|
|
45
|
-
|
|
49
|
+
|
|
46
50
|
.. admonition:: Example using Azure AI Project URL:
|
|
47
|
-
|
|
51
|
+
|
|
48
52
|
.. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
|
|
49
53
|
:start-after: [START task_adherence_evaluator]
|
|
50
54
|
:end-before: [END task_adherence_evaluator]
|
|
51
55
|
:language: python
|
|
52
56
|
:dedent: 8
|
|
53
|
-
:caption: Initialize and call TaskAdherenceEvaluator using Azure AI Project URL in the following format
|
|
57
|
+
:caption: Initialize and call TaskAdherenceEvaluator using Azure AI Project URL in the following format
|
|
54
58
|
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
|
|
55
59
|
|
|
56
60
|
"""
|
|
@@ -61,18 +65,15 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
|
61
65
|
|
|
62
66
|
_DEFAULT_TASK_ADHERENCE_SCORE = 3
|
|
63
67
|
|
|
64
|
-
id =
|
|
68
|
+
id = "azureai://built-in/evaluators/task_adherence"
|
|
65
69
|
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
|
|
66
70
|
|
|
67
71
|
@override
|
|
68
|
-
def __init__(self, model_config, *, threshold=_DEFAULT_TASK_ADHERENCE_SCORE,
|
|
69
|
-
**kwargs):
|
|
72
|
+
def __init__(self, model_config, *, threshold=_DEFAULT_TASK_ADHERENCE_SCORE, **kwargs):
|
|
70
73
|
current_dir = os.path.dirname(__file__)
|
|
71
74
|
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
|
|
72
75
|
self.threshold = threshold
|
|
73
|
-
super().__init__(model_config=model_config, prompty_file=prompty_path,
|
|
74
|
-
result_key=self._RESULT_KEY,
|
|
75
|
-
**kwargs)
|
|
76
|
+
super().__init__(model_config=model_config, prompty_file=prompty_path, result_key=self._RESULT_KEY, **kwargs)
|
|
76
77
|
|
|
77
78
|
@overload
|
|
78
79
|
def __call__(
|
|
@@ -85,7 +86,7 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
|
85
86
|
"""Evaluate task adherence for a given query, response, and optional tool defintions.
|
|
86
87
|
The query and response can be either a string or a list of messages.
|
|
87
88
|
|
|
88
|
-
|
|
89
|
+
|
|
89
90
|
Example with string inputs and no tools:
|
|
90
91
|
evaluator = TaskAdherenceEvaluator(model_config)
|
|
91
92
|
query = "What is the weather today?"
|
|
@@ -113,9 +114,9 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
|
113
114
|
|
|
114
115
|
@override
|
|
115
116
|
def __call__( # pylint: disable=docstring-missing-param
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
117
|
+
self,
|
|
118
|
+
*args,
|
|
119
|
+
**kwargs,
|
|
119
120
|
):
|
|
120
121
|
"""
|
|
121
122
|
Invokes the instance using the overloaded __call__ signature.
|
|
@@ -142,21 +143,23 @@ class TaskAdherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
|
142
143
|
category=ErrorCategory.MISSING_FIELD,
|
|
143
144
|
target=ErrorTarget.TASK_ADHERENCE_EVALUATOR,
|
|
144
145
|
)
|
|
145
|
-
|
|
146
|
+
eval_input["query"] = reformat_conversation_history(eval_input["query"], logger, include_system_messages=True)
|
|
147
|
+
eval_input["response"] = reformat_agent_response(eval_input["response"], logger, include_tool_messages=True)
|
|
148
|
+
if "tool_definitions" in eval_input and eval_input["tool_definitions"] is not None:
|
|
149
|
+
eval_input["tool_definitions"] = reformat_tool_definitions(eval_input["tool_definitions"], logger)
|
|
146
150
|
llm_output = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
score_result = 'pass' if score >= self.threshold else 'fail'
|
|
153
|
-
|
|
151
|
+
if isinstance(llm_output, dict):
|
|
152
|
+
score = float(llm_output.get("score", math.nan))
|
|
153
|
+
score_result = "pass" if score >= self.threshold else "fail"
|
|
154
|
+
reason = llm_output.get("explanation", "")
|
|
154
155
|
return {
|
|
155
156
|
f"{self._result_key}": score,
|
|
156
157
|
f"{self._result_key}_result": score_result,
|
|
157
158
|
f"{self._result_key}_threshold": self.threshold,
|
|
158
159
|
f"{self._result_key}_reason": reason,
|
|
160
|
+
# Uncomment the following line in the next iteration after UI contracts are validated.
|
|
161
|
+
# f"{self._result_key}_additional_details": llm_output
|
|
159
162
|
}
|
|
160
|
-
|
|
163
|
+
if logger:
|
|
164
|
+
logger.warning("LLM output is not a dictionary, returning NaN for the score.")
|
|
161
165
|
return {self._result_key: math.nan}
|
|
162
|
-
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: TaskAdherence
|
|
3
|
-
description: Evaluates Task Adherence score
|
|
3
|
+
description: Evaluates Task Adherence score
|
|
4
4
|
model:
|
|
5
5
|
api: chat
|
|
6
6
|
parameters:
|
|
@@ -10,7 +10,7 @@ model:
|
|
|
10
10
|
presence_penalty: 0
|
|
11
11
|
frequency_penalty: 0
|
|
12
12
|
response_format:
|
|
13
|
-
type:
|
|
13
|
+
type: json_object
|
|
14
14
|
|
|
15
15
|
inputs:
|
|
16
16
|
query:
|
|
@@ -21,97 +21,385 @@ inputs:
|
|
|
21
21
|
type: string
|
|
22
22
|
optional: true
|
|
23
23
|
default: "[]"
|
|
24
|
-
|
|
25
24
|
---
|
|
26
25
|
system:
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
26
|
+
You are an expert evaluator who scores how well an AI assistant executed the intended task based on system constraints and user requests.
|
|
27
|
+
|
|
28
|
+
user:
|
|
29
|
+
ROLE
|
|
30
|
+
====
|
|
31
|
+
You are Task-Adherence-Judge, an impartial evaluator who scores task execution quality.
|
|
32
|
+
|
|
33
|
+
INPUT
|
|
34
|
+
=====
|
|
35
|
+
CONVERSATION_HISTORY: {{query}}
|
|
36
|
+
AGENT_RESPONSE: {{response}}
|
|
37
|
+
TOOL_DEFINITIONS: {{tool_definitions}}
|
|
38
|
+
|
|
39
|
+
CONVERSATION_HISTORY includes the full dialogue. The SYSTEM MESSAGE (if present) is the first message and defines agent behavior.
|
|
40
|
+
AGENT_RESPONSE is the agent's reply to the latest user query.
|
|
41
|
+
TOOL_DEFINITIONS lists available tools.
|
|
42
|
+
|
|
43
|
+
EVALUATION STEPS
|
|
44
|
+
================
|
|
45
|
+
|
|
46
|
+
A. Extract System Instructions (if system message exists):
|
|
47
|
+
- Identify any mandatory rules the agent must follow (e.g., “must”, “never”, “only”, or implied constraints).
|
|
48
|
+
- Note preferences that influence task execution (e.g., “should”, “prefer”, “avoid”).
|
|
49
|
+
- Record tool usage expectations (required, preferred, or restricted tools).
|
|
50
|
+
|
|
51
|
+
B. Identify the User Request:
|
|
52
|
+
- Determine what the user explicitly wants: action, answer, or outcome.
|
|
53
|
+
- Use full conversation history if needed for clarity.
|
|
54
|
+
|
|
55
|
+
C. Define the Intended Task:
|
|
56
|
+
- Combine the user request with mandatory system constraints.
|
|
57
|
+
- Adjust execution details based on preferences.
|
|
58
|
+
- This defines the correct task behavior.
|
|
59
|
+
|
|
60
|
+
D. Evaluate Agent Execution:
|
|
61
|
+
1. Constraint Adherence:
|
|
62
|
+
- Were all mandatory rules followed?
|
|
63
|
+
- Any violation should cap the score at 3.
|
|
64
|
+
|
|
65
|
+
2. Task Completion:
|
|
66
|
+
- Did the agent fulfill the defined task?
|
|
67
|
+
- Were any core parts of the request missed?
|
|
68
|
+
|
|
69
|
+
3. Tool Usage Evaluation:
|
|
70
|
+
- Was tool use appropriate, as per instructions or task need?
|
|
71
|
+
- **System mandates tool**: Required (score ≤2 if not used correctly)
|
|
72
|
+
- **System prefers tool**: Preferred (minor score impact if not used)
|
|
73
|
+
- **System silent on tools**: Evaluate based on task effectiveness
|
|
74
|
+
- **System restricts tools**: Discouraged (score ≤2 if used when restricted)
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
E. Write a 15–60 word explanation that describes the intended task, how well it was met, constraint handling, and any notable errors or strengths.
|
|
78
|
+
|
|
79
|
+
F. Choose the single closest integer score (1,2,3,4 or 5) from the SCORING RUBRIC below.
|
|
80
|
+
|
|
81
|
+
OUTPUT FORMAT
|
|
82
|
+
=============
|
|
83
|
+
Output a JSON object with the below keys in the given sequence:
|
|
84
|
+
1) explanation: a concise 15–60 word summary of how well the assistant followed the task
|
|
85
|
+
2) agent_perceived_task: what the assistant believed it had to do, based on its response;
|
|
86
|
+
3) actual_task: what the task truly was, considering both system prompt and user latest query, with system taking precedence.
|
|
87
|
+
4) tools_used: List of tools used by the agent
|
|
88
|
+
5) expected_tool_calls: List of tools the agent should have used (if applicable)
|
|
89
|
+
6) score: an integer score from 1 (very poor) to 5 (excellent) on how well the agent resolved the user's intent within constraints
|
|
90
|
+
|
|
91
|
+
SCORING RUBRIC WITH EXAMPLES
|
|
92
|
+
============================
|
|
93
|
+
|
|
94
|
+
**Score 5 - Fully Adherent:**
|
|
95
|
+
- Executed intended task completely
|
|
96
|
+
- Followed all mandatory system rules
|
|
97
|
+
- Used tools correctly when required by system or task necessity
|
|
98
|
+
- User's need fully satisfied within constraints
|
|
99
|
+
|
|
100
|
+
Example A (single-turn)
|
|
101
|
+
=======================
|
|
102
|
+
CONVERSATION_HISTORY:
|
|
103
|
+
User turn 1:
|
|
104
|
+
Suggest a good mystery book.
|
|
105
|
+
|
|
106
|
+
AGENT_RESPONSE:
|
|
107
|
+
[TOOL_CALL] get_book(genre="mystery")
|
|
108
|
+
[TOOL_RESULT] "Gone Girl" by Gillian Flynn
|
|
109
|
+
I recommend *Gone Girl* by Gillian Flynn — a gripping psychological thriller.
|
|
110
|
+
|
|
111
|
+
TOOL_DEFINITIONS:
|
|
112
|
+
- get_book: Returns a book recommendation based on genre. (inputs: genre)
|
|
113
|
+
|
|
114
|
+
EXPECTED OUTPUT:
|
|
115
|
+
{
|
|
116
|
+
"explanation": "The assistant correctly identified the task, used the appropriate tool as needed, and provided a clear and suitable book recommendation.",
|
|
117
|
+
"agent_perceived_task": "recommend a mystery book",
|
|
118
|
+
"actual_task": "Suggest a good mystery book",
|
|
119
|
+
"tools_used": ["get_book"],
|
|
120
|
+
"expected_tool_calls": ["get_book"],
|
|
121
|
+
"score": 5
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
Example B (multi-turn)
|
|
125
|
+
======================
|
|
126
|
+
CONVERSATION_HISTORY:
|
|
127
|
+
SYSTEM_PROMPT:
|
|
128
|
+
Do not use tools unless explicitly requested by the user.
|
|
129
|
+
|
|
130
|
+
User turn 1:
|
|
131
|
+
Can you recommend a good book?
|
|
132
|
+
|
|
133
|
+
Agent turn 1:
|
|
134
|
+
Any preferred genre?
|
|
135
|
+
|
|
136
|
+
User turn 2:
|
|
137
|
+
Historical fiction.
|
|
138
|
+
|
|
139
|
+
AGENT_RESPONSE:
|
|
140
|
+
You might enjoy *The Book Thief* by Markus — a powerful story set during World War II.
|
|
141
|
+
|
|
142
|
+
TOOL_DEFINITIONS:
|
|
143
|
+
- get_book: Returns book suggestions based on genre. (inputs: genre)
|
|
144
|
+
|
|
145
|
+
EXPECTED OUTPUT:
|
|
146
|
+
{
|
|
147
|
+
"explanation": "The assistant correctly followed the system instruction by not using tools since the user never explicitly requested one. It engaged with the user to clarify preferences and gave a relevant recommendation.",
|
|
148
|
+
"agent_perceived_task": "recommend a historical fiction book without tool use",
|
|
149
|
+
"actual_task": "Recommend a historical fiction book (no tool allowed unless explicitly requested)",
|
|
150
|
+
"tools_used": [],
|
|
151
|
+
"expected_tool_calls": [],
|
|
152
|
+
"score": 5
|
|
153
|
+
}
|
|
154
|
+
|
|
155
|
+
**Score 4 - Mostly Adherent:**
|
|
156
|
+
- Executed core intended task
|
|
157
|
+
- Followed mandatory system rules
|
|
158
|
+
- Minor execution gaps that don't prevent task completion
|
|
159
|
+
- Tool usage mostly appropriate
|
|
160
|
+
|
|
161
|
+
Example A (single-turn)
|
|
162
|
+
=======================
|
|
163
|
+
CONVERSATION_HISTORY:
|
|
164
|
+
SYSTEM_PROMPT:
|
|
165
|
+
You must respond clearly and concisely using available tools. Minor omissions are acceptable.
|
|
166
|
+
|
|
167
|
+
User turn 1:
|
|
168
|
+
What’s the weather like in Paris?
|
|
169
|
+
|
|
170
|
+
AGENT_RESPONSE:
|
|
171
|
+
[TOOL_CALL] get_weather(city="Paris")
|
|
172
|
+
[TOOL_RESULT] 22°C, partly cloudy
|
|
173
|
+
Paris is currently 22°C and partly cloudy.
|
|
174
|
+
|
|
175
|
+
TOOL_DEFINITIONS:
|
|
176
|
+
- get_weather: Provides current weather details for a city. (inputs: city)
|
|
177
|
+
|
|
178
|
+
EXPECTED OUTPUT:
|
|
179
|
+
{
|
|
180
|
+
"explanation": "The assistant correctly used the expected tool and completed the core task. However, it omitted secondary but relevant details like windspeed or humidity, which slightly reduces the completeness of the response without affecting task success.",
|
|
181
|
+
"agent_perceived_task": "get weather for Paris",
|
|
182
|
+
"actual_task": "Provide current weather in Paris",
|
|
183
|
+
"tools_used": ["get_weather"],
|
|
184
|
+
"expected_tool_calls": ["get_weather"],
|
|
185
|
+
"score": 4
|
|
186
|
+
}
|
|
187
|
+
|
|
188
|
+
Example B (multi-turn)
|
|
189
|
+
======================
|
|
190
|
+
CONVERSATION_HISTORY:
|
|
191
|
+
SYSTEM_PROMPT:
|
|
192
|
+
Always ask for genre before making a book suggestion.
|
|
193
|
+
|
|
194
|
+
User turn 1:
|
|
195
|
+
Recommend a good book to read.
|
|
196
|
+
|
|
197
|
+
Agent turn 1:
|
|
198
|
+
Any preferred genre?
|
|
199
|
+
|
|
200
|
+
User turn 2:
|
|
201
|
+
Something romantic.
|
|
202
|
+
|
|
203
|
+
AGENT_RESPONSE:
|
|
204
|
+
[TOOL_CALL] get_book(genre="romance")
|
|
205
|
+
[TOOL_RESULT] "The Notebook" by Nicholas Sparks
|
|
206
|
+
You might enjoy *The Notebook* by Nicholas Sparks.
|
|
207
|
+
|
|
208
|
+
TOOL_DEFINITIONS:
|
|
209
|
+
- get_book: Returns book suggestions based on genre. (inputs: genre)
|
|
210
|
+
|
|
211
|
+
EXPECTED OUTPUT:
|
|
212
|
+
{
|
|
213
|
+
"explanation": "Assistant followed system rule to ask for genre, used the correct tool, and offered a relevant result, though without justification.",
|
|
214
|
+
"agent_perceived_task": "recommend a romance novel",
|
|
215
|
+
"actual_task": "Recommend a romance book (after genre clarification as per system)",
|
|
216
|
+
"tools_used": ["get_book"],
|
|
217
|
+
"expected_tool_calls": ["get_book"],
|
|
218
|
+
"score": 4
|
|
219
|
+
}
|
|
220
|
+
|
|
221
|
+
**Score 3 - Partially Adherent:**
|
|
222
|
+
- Attempted intended task but missed key components
|
|
223
|
+
- Some system rule violations OR incomplete task execution
|
|
224
|
+
- Tool usage suboptimal or missing when preferred by system
|
|
225
|
+
- User's need partially satisfied
|
|
226
|
+
|
|
227
|
+
Example A (single-turn)
|
|
228
|
+
=======================
|
|
229
|
+
CONVERSATION_HISTORY:
|
|
230
|
+
SYSTEM_PROMPT:
|
|
231
|
+
Always rely on tools if available for factual queries like time and weather.
|
|
232
|
+
|
|
233
|
+
User turn 1:
|
|
234
|
+
What’s the weather in Sydney?
|
|
235
|
+
|
|
236
|
+
AGENT_RESPONSE:
|
|
237
|
+
It's probably sunny this time of year.
|
|
238
|
+
|
|
239
|
+
TOOL_DEFINITIONS:
|
|
240
|
+
- get_book: Returns book suggestions based on genre. (inputs: genre)
|
|
241
|
+
|
|
242
|
+
EXPECTED OUTPUT:
|
|
243
|
+
{
|
|
244
|
+
"explanation": "The assistant attempted to answer the weather query but the response was vague and did not provide specific information. No relevant tool was available for weather and the assistant did the right thing by not using the book tool, but it still failed to meet the user's need for accurate weather information.",
|
|
245
|
+
"agent_perceived_task": "estimate Sydney's weather",
|
|
246
|
+
"actual_task": "Get real-time weather in Sydney (tool use required per system)",
|
|
247
|
+
"tools_used": [],
|
|
248
|
+
"expected_tool_calls": [],
|
|
249
|
+
"score": 3
|
|
250
|
+
}
|
|
251
|
+
|
|
252
|
+
Example B (multi-turn)
|
|
253
|
+
======================
|
|
254
|
+
CONVERSATION_HISTORY:
|
|
255
|
+
SYSTEM_PROMPT:
|
|
256
|
+
If tools are available, prefer them for recommendations.
|
|
257
|
+
|
|
258
|
+
User turn 1:
|
|
259
|
+
Can you suggest a book?
|
|
260
|
+
|
|
261
|
+
Agent turn 1:
|
|
262
|
+
Sure, any specific genre?
|
|
263
|
+
|
|
264
|
+
User turn 2:
|
|
265
|
+
Thriller.
|
|
266
|
+
|
|
267
|
+
AGENT_RESPONSE:
|
|
268
|
+
You might enjoy *The Da Vinci Code*.
|
|
33
269
|
|
|
34
|
-
|
|
270
|
+
TOOL_DEFINITIONS:
|
|
271
|
+
- get_book: Get book suggestions by genre. (inputs: genre)
|
|
35
272
|
|
|
36
|
-
|
|
273
|
+
EXPECTED OUTPUT:
|
|
274
|
+
{
|
|
275
|
+
"explanation": "Assistant gave a relevant suggestion but skipped the tool, despite system prompt recommending its use for consistency.",
|
|
276
|
+
"agent_perceived_task": "recommend thriller book",
|
|
277
|
+
"actual_task": "Suggest a thriller book (prefer tool per system)",
|
|
278
|
+
"tools_used": [],
|
|
279
|
+
"expected_tool_calls": ["get_book"],
|
|
280
|
+
"score": 3
|
|
281
|
+
}
|
|
37
282
|
|
|
38
|
-
**
|
|
39
|
-
|
|
283
|
+
**Score 2 - Barely Adherent:**
|
|
284
|
+
- Minimal task execution
|
|
285
|
+
- Violated mandatory system rules OR failed core task requirements
|
|
286
|
+
- Wrong tool usage or missing when required by system
|
|
287
|
+
- User's need largely unmet
|
|
40
288
|
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
289
|
+
Example A (single-turn)
|
|
290
|
+
=======================
|
|
291
|
+
CONVERSATION_HISTORY:
|
|
292
|
+
SYSTEM_PROMPT:
|
|
293
|
+
You are a tool-based agent. Always respond using tools.
|
|
44
294
|
|
|
45
|
-
|
|
295
|
+
User turn 1:
|
|
296
|
+
What's the time in Madrid?
|
|
46
297
|
|
|
298
|
+
AGENT_RESPONSE:
|
|
299
|
+
Europe has different time zones.
|
|
47
300
|
|
|
48
|
-
|
|
301
|
+
TOOL_DEFINITIONS:
|
|
302
|
+
- get_time: Return current time for a city. (inputs: city)
|
|
49
303
|
|
|
50
|
-
|
|
51
|
-
|
|
304
|
+
EXPECTED OUTPUT:
|
|
305
|
+
{
|
|
306
|
+
"explanation": "Assistant responded vaguely and violated the system rule requiring tool usage. Failed to directly answer the task.",
|
|
307
|
+
"agent_perceived_task": "talk about time zones",
|
|
308
|
+
"actual_task": "Tell the current time in Madrid (using tool as required by system)",
|
|
309
|
+
"tools_used": [],
|
|
310
|
+
"expected_tool_calls": ["get_time"],
|
|
311
|
+
"score": 2
|
|
312
|
+
}
|
|
52
313
|
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
314
|
+
Example B (multi-turn)
|
|
315
|
+
======================
|
|
316
|
+
CONVERSATION_HISTORY:
|
|
317
|
+
SYSTEM_PROMPT:
|
|
318
|
+
You are a book recommender assistant. Do **not** ask follow-up questions. Only respond with direct suggestions using tools.
|
|
56
319
|
|
|
57
|
-
|
|
320
|
+
User turn 1:
|
|
321
|
+
Recommend a good fantasy book.
|
|
58
322
|
|
|
323
|
+
Agent turn 1:
|
|
324
|
+
Do you prefer magic or adventure?
|
|
59
325
|
|
|
60
|
-
|
|
326
|
+
User turn 2:
|
|
327
|
+
Magic.
|
|
61
328
|
|
|
62
|
-
|
|
63
|
-
|
|
329
|
+
AGENT_RESPONSE:
|
|
330
|
+
Fantasy often includes magical elements and world-building.
|
|
64
331
|
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
**Response:** Visit the Eiffel Tower and the Louvre on Saturday, and stroll through Montmartre on Sunday.
|
|
332
|
+
TOOL_DEFINITIONS:
|
|
333
|
+
- get_book: Retrieve books filtered by genre. (inputs: genre)
|
|
68
334
|
|
|
69
|
-
|
|
335
|
+
EXPECTED OUTPUT:
|
|
336
|
+
{
|
|
337
|
+
"explanation": "The assistant violated the system's instruction not to ask clarifying questions and failed to use the required tool. Although the topic remained somewhat relevant, the task was not fulfilled.",
|
|
338
|
+
"agent_perceived_task": "discuss fantasy genre",
|
|
339
|
+
"actual_task": "Recommend a fantasy book using tool without asking follow-ups (per system)",
|
|
340
|
+
"tools_used": [],
|
|
341
|
+
"expected_tool_calls": ["get_book"],
|
|
342
|
+
"score": 2
|
|
343
|
+
}
|
|
70
344
|
|
|
345
|
+
**Score 1 - Non-Adherent:**
|
|
346
|
+
- Failed to execute intended task
|
|
347
|
+
- Ignored mandatory system constraints
|
|
348
|
+
- Completely inappropriate or missing tool usage
|
|
349
|
+
- Response irrelevant to user's need
|
|
71
350
|
|
|
72
|
-
|
|
351
|
+
Example A (single-turn)
|
|
352
|
+
=======================
|
|
353
|
+
CONVERSATION_HISTORY:
|
|
354
|
+
SYSTEM_PROMPT:
|
|
355
|
+
Only answer factual questions using tools. Avoid personal commentary.
|
|
73
356
|
|
|
74
|
-
|
|
75
|
-
|
|
357
|
+
User turn 1:
|
|
358
|
+
What's the current time in Berlin?
|
|
76
359
|
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
**Response:** For a weekend in Paris, start Saturday with a morning visit to the Eiffel Tower, then head to the Louvre in the early afternoon. In the evening, enjoy a leisurely walk along the Seine. On Sunday, begin with a visit to Notre-Dame Cathedral, followed by exploring the art and cafés in Montmartre. This plan offers a mix of cultural visits and relaxing experiences.
|
|
360
|
+
AGENT_RESPONSE:
|
|
361
|
+
I prefer mornings to evenings. How about you?
|
|
80
362
|
|
|
81
|
-
|
|
363
|
+
TOOL_DEFINITIONS:
|
|
364
|
+
- get_time: Return time for a given city. (inputs: city)
|
|
82
365
|
|
|
366
|
+
EXPECTED OUTPUT:
|
|
367
|
+
{
|
|
368
|
+
"explanation": "Assistant ignored both the user's factual query and the system's constraint to avoid personal remarks. No tool used.",
|
|
369
|
+
"agent_perceived_task": "chat about preferences",
|
|
370
|
+
"actual_task": "Tell the current time in Berlin (using tool, avoid chit-chat as per system)",
|
|
371
|
+
"tools_used": [],
|
|
372
|
+
"expected_tool_calls": ["get_time"],
|
|
373
|
+
"score": 1
|
|
374
|
+
}
|
|
83
375
|
|
|
84
|
-
|
|
376
|
+
Example B (multi-turn)
|
|
377
|
+
======================
|
|
378
|
+
CONVERSATION_HISTORY:
|
|
379
|
+
SYSTEM_PROMPT:
|
|
380
|
+
Only provide weather updates using tools. Do not include personal opinions.
|
|
85
381
|
|
|
86
|
-
|
|
87
|
-
|
|
382
|
+
User turn 1:
|
|
383
|
+
I need a weather update for Rome.
|
|
88
384
|
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
**Response:** Here is a detailed weekend itinerary in Paris:
|
|
92
|
-
Saturday:
|
|
93
|
-
Morning: Begin your day with a visit to the Eiffel Tower to admire the views from the top.
|
|
94
|
-
Early Afternoon: Head to the Louvre for a guided tour of its most famous exhibits.
|
|
95
|
-
Late Afternoon: Take a relaxing walk along the Seine, stopping at local boutiques.
|
|
96
|
-
Evening: Enjoy dinner at a classic Parisian bistro near the river.
|
|
97
|
-
Sunday:
|
|
98
|
-
Morning: Visit the Notre-Dame Cathedral to explore its architecture and history.
|
|
99
|
-
Midday: Wander the charming streets of Montmartre, stopping by art galleries and cafés.
|
|
100
|
-
Afternoon: Finish your trip with a scenic boat tour on the Seine.
|
|
101
|
-
This itinerary balances cultural immersion, leisure, and local dining experiences, ensuring a well-rounded visit.
|
|
385
|
+
Agent turn 1:
|
|
386
|
+
Celsius or Fahrenheit?
|
|
102
387
|
|
|
103
|
-
|
|
388
|
+
User turn 2:
|
|
389
|
+
Celsius.
|
|
104
390
|
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
Response: {{response}}
|
|
108
|
-
Tool Definitions: {{tool_definitions}}
|
|
391
|
+
AGENT_RESPONSE:
|
|
392
|
+
I once visited Rome in summer. Beautiful city!
|
|
109
393
|
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
- **ThoughtChain**: To improve the reasoning process, Think Step by Step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and Start your ThoughtChain with "Let's think step by step:".
|
|
113
|
-
- **Explanation**: a very short explanation of why you think the input data should get that Score.
|
|
114
|
-
- **Score**: based on your previous analysis, provide your Score. The answer you give MUST be an integer score ("1", "2", ...) based on the categories of the definitions.
|
|
394
|
+
TOOL_DEFINITIONS:
|
|
395
|
+
- get_weather: Provides weather data for a given city. (inputs: city, unit)
|
|
115
396
|
|
|
116
|
-
|
|
117
|
-
|
|
397
|
+
EXPECTED OUTPUT:
|
|
398
|
+
{
|
|
399
|
+
"explanation": "Assistant acknowledged the input but failed to execute the task or use the required tool, and added personal opinion against system guidance.",
|
|
400
|
+
"agent_perceived_task": "share travel experience",
|
|
401
|
+
"actual_task": "Provide current weather in Celsius for Rome (using tool, no personal commentary)",
|
|
402
|
+
"tools_used": [],
|
|
403
|
+
"expected_tool_calls": ["get_weather"],
|
|
404
|
+
"score": 1
|
|
405
|
+
}
|