azure-ai-evaluation 1.0.0b3__py3-none-any.whl → 1.0.0b5__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/__init__.py +23 -1
- azure/ai/evaluation/{simulator/_helpers → _common}/_experimental.py +20 -9
- azure/ai/evaluation/_common/constants.py +9 -2
- azure/ai/evaluation/_common/math.py +29 -0
- azure/ai/evaluation/_common/rai_service.py +222 -93
- azure/ai/evaluation/_common/utils.py +328 -19
- azure/ai/evaluation/_constants.py +16 -8
- azure/ai/evaluation/_evaluate/{_batch_run_client → _batch_run}/__init__.py +3 -2
- azure/ai/evaluation/_evaluate/{_batch_run_client → _batch_run}/code_client.py +33 -17
- azure/ai/evaluation/_evaluate/{_batch_run_client/batch_run_context.py → _batch_run/eval_run_context.py} +14 -7
- azure/ai/evaluation/_evaluate/{_batch_run_client → _batch_run}/proxy_client.py +22 -4
- azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +35 -0
- azure/ai/evaluation/_evaluate/_eval_run.py +47 -14
- azure/ai/evaluation/_evaluate/_evaluate.py +370 -188
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +15 -16
- azure/ai/evaluation/_evaluate/_utils.py +77 -25
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +1 -1
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +16 -10
- azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +76 -34
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +76 -46
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +26 -19
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +62 -25
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +68 -36
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +67 -46
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +33 -4
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +33 -4
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +33 -4
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +33 -4
- azure/ai/evaluation/_evaluators/_eci/_eci.py +7 -5
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +14 -6
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +22 -21
- azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +66 -36
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +1 -1
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +51 -16
- azure/ai/evaluation/_evaluators/_groundedness/groundedness_with_query.prompty +113 -0
- azure/ai/evaluation/_evaluators/_groundedness/groundedness_without_query.prompty +99 -0
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +3 -7
- azure/ai/evaluation/_evaluators/_multimodal/__init__.py +20 -0
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +130 -0
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +57 -0
- azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +96 -0
- azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +120 -0
- azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +96 -0
- azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +96 -0
- azure/ai/evaluation/_evaluators/_multimodal/_violence.py +96 -0
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +46 -13
- azure/ai/evaluation/_evaluators/_qa/_qa.py +11 -6
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +23 -20
- azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +78 -42
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +126 -80
- azure/ai/evaluation/_evaluators/_retrieval/retrieval.prompty +74 -24
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +2 -2
- azure/ai/evaluation/_evaluators/_service_groundedness/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +150 -0
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +32 -15
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +36 -10
- azure/ai/evaluation/_exceptions.py +26 -6
- azure/ai/evaluation/_http_utils.py +203 -132
- azure/ai/evaluation/_model_configurations.py +23 -6
- azure/ai/evaluation/_vendor/__init__.py +3 -0
- azure/ai/evaluation/_vendor/rouge_score/__init__.py +14 -0
- azure/ai/evaluation/_vendor/rouge_score/rouge_scorer.py +328 -0
- azure/ai/evaluation/_vendor/rouge_score/scoring.py +63 -0
- azure/ai/evaluation/_vendor/rouge_score/tokenize.py +63 -0
- azure/ai/evaluation/_vendor/rouge_score/tokenizers.py +53 -0
- azure/ai/evaluation/_version.py +1 -1
- azure/ai/evaluation/simulator/__init__.py +2 -1
- azure/ai/evaluation/simulator/_adversarial_scenario.py +5 -0
- azure/ai/evaluation/simulator/_adversarial_simulator.py +88 -60
- azure/ai/evaluation/simulator/_conversation/__init__.py +13 -12
- azure/ai/evaluation/simulator/_conversation/_conversation.py +4 -4
- azure/ai/evaluation/simulator/_data_sources/__init__.py +3 -0
- azure/ai/evaluation/simulator/_data_sources/grounding.json +1150 -0
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +24 -66
- azure/ai/evaluation/simulator/_helpers/__init__.py +1 -2
- azure/ai/evaluation/simulator/_helpers/_simulator_data_classes.py +26 -5
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +98 -95
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +67 -21
- azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +28 -11
- azure/ai/evaluation/simulator/_model_tools/_template_handler.py +68 -24
- azure/ai/evaluation/simulator/_model_tools/models.py +10 -10
- azure/ai/evaluation/simulator/_prompty/task_query_response.prompty +4 -9
- azure/ai/evaluation/simulator/_prompty/task_simulate.prompty +6 -5
- azure/ai/evaluation/simulator/_simulator.py +222 -169
- azure/ai/evaluation/simulator/_tracing.py +4 -4
- azure/ai/evaluation/simulator/_utils.py +6 -6
- {azure_ai_evaluation-1.0.0b3.dist-info → azure_ai_evaluation-1.0.0b5.dist-info}/METADATA +237 -52
- azure_ai_evaluation-1.0.0b5.dist-info/NOTICE.txt +70 -0
- azure_ai_evaluation-1.0.0b5.dist-info/RECORD +120 -0
- {azure_ai_evaluation-1.0.0b3.dist-info → azure_ai_evaluation-1.0.0b5.dist-info}/WHEEL +1 -1
- azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty +0 -49
- azure_ai_evaluation-1.0.0b3.dist-info/RECORD +0 -98
- {azure_ai_evaluation-1.0.0b3.dist-info → azure_ai_evaluation-1.0.0b5.dist-info}/top_level.txt +0 -0
|
@@ -6,7 +6,7 @@ import functools
|
|
|
6
6
|
import inspect
|
|
7
7
|
import json
|
|
8
8
|
import logging
|
|
9
|
-
from typing import Callable, Dict,
|
|
9
|
+
from typing import Callable, Dict, Literal, Optional, Union, cast
|
|
10
10
|
|
|
11
11
|
import pandas as pd
|
|
12
12
|
from promptflow._sdk.entities._flows import FlexFlow as flex_flow
|
|
@@ -16,31 +16,30 @@ from promptflow.client import PFClient
|
|
|
16
16
|
from promptflow.core import Prompty as prompty_core
|
|
17
17
|
from typing_extensions import ParamSpec
|
|
18
18
|
|
|
19
|
+
from azure.ai.evaluation._model_configurations import AzureAIProject, EvaluationResult
|
|
20
|
+
|
|
19
21
|
from ..._user_agent import USER_AGENT
|
|
20
22
|
from .._utils import _trace_destination_from_project_scope
|
|
21
23
|
|
|
22
24
|
LOGGER = logging.getLogger(__name__)
|
|
23
25
|
|
|
24
26
|
P = ParamSpec("P")
|
|
25
|
-
R = TypeVar("R")
|
|
26
27
|
|
|
27
28
|
|
|
28
|
-
def _get_evaluator_type(evaluator: Dict[str, Callable]):
|
|
29
|
+
def _get_evaluator_type(evaluator: Dict[str, Callable]) -> Literal["content-safety", "built-in", "custom"]:
|
|
29
30
|
"""
|
|
30
31
|
Get evaluator type for telemetry.
|
|
31
32
|
|
|
32
33
|
:param evaluator: The evaluator object
|
|
33
34
|
:type evaluator: Dict[str, Callable]
|
|
34
35
|
:return: The evaluator type. Possible values are "built-in", "custom", and "content-safety".
|
|
35
|
-
:rtype:
|
|
36
|
+
:rtype: Literal["content-safety", "built-in", "custom"]
|
|
36
37
|
"""
|
|
37
|
-
built_in = False
|
|
38
|
-
content_safety = False
|
|
39
|
-
|
|
40
38
|
module = inspect.getmodule(evaluator)
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
39
|
+
module_name = module.__name__ if module else ""
|
|
40
|
+
|
|
41
|
+
built_in = module_name.startswith("azure.ai.evaluation._evaluators.")
|
|
42
|
+
content_safety = built_in and module_name.startswith("azure.ai.evaluation._evaluators._content_safety")
|
|
44
43
|
|
|
45
44
|
if content_safety:
|
|
46
45
|
return "content-safety"
|
|
@@ -98,22 +97,22 @@ def _get_evaluator_properties(evaluator, evaluator_name):
|
|
|
98
97
|
|
|
99
98
|
|
|
100
99
|
# cspell:ignore isna
|
|
101
|
-
def log_evaluate_activity(func: Callable[P,
|
|
100
|
+
def log_evaluate_activity(func: Callable[P, EvaluationResult]) -> Callable[P, EvaluationResult]:
|
|
102
101
|
"""Decorator to log evaluate activity
|
|
103
102
|
|
|
104
103
|
:param func: The function to be decorated
|
|
105
104
|
:type func: Callable
|
|
106
105
|
:returns: The decorated function
|
|
107
|
-
:rtype: Callable[P,
|
|
106
|
+
:rtype: Callable[P, EvaluationResult]
|
|
108
107
|
"""
|
|
109
108
|
|
|
110
109
|
@functools.wraps(func)
|
|
111
|
-
def wrapper(*args: P.args, **kwargs: P.kwargs) ->
|
|
110
|
+
def wrapper(*args: P.args, **kwargs: P.kwargs) -> EvaluationResult:
|
|
112
111
|
from promptflow._sdk._telemetry import ActivityType, log_activity
|
|
113
112
|
from promptflow._sdk._telemetry.telemetry import get_telemetry_logger
|
|
114
113
|
|
|
115
|
-
evaluators = kwargs.get("evaluators",
|
|
116
|
-
azure_ai_project = kwargs.get("azure_ai_project", None)
|
|
114
|
+
evaluators = cast(Optional[Dict[str, Callable]], kwargs.get("evaluators", {})) or {}
|
|
115
|
+
azure_ai_project = cast(Optional[AzureAIProject], kwargs.get("azure_ai_project", None))
|
|
117
116
|
|
|
118
117
|
pf_client = PFClient(
|
|
119
118
|
config=(
|
|
@@ -127,7 +126,7 @@ def log_evaluate_activity(func: Callable[P, R]) -> Callable[P, R]:
|
|
|
127
126
|
track_in_cloud = bool(pf_client._config.get_trace_destination()) # pylint: disable=protected-access
|
|
128
127
|
evaluate_target = bool(kwargs.get("target", None))
|
|
129
128
|
evaluator_config = bool(kwargs.get("evaluator_config", None))
|
|
130
|
-
custom_dimensions = {
|
|
129
|
+
custom_dimensions: Dict[str, Union[str, bool]] = {
|
|
131
130
|
"track_in_cloud": track_in_cloud,
|
|
132
131
|
"evaluate_target": evaluate_target,
|
|
133
132
|
"evaluator_config": evaluator_config,
|
|
@@ -6,15 +6,24 @@ import logging
|
|
|
6
6
|
import os
|
|
7
7
|
import re
|
|
8
8
|
import tempfile
|
|
9
|
-
from collections import namedtuple
|
|
10
9
|
from pathlib import Path
|
|
11
|
-
from typing import Dict
|
|
10
|
+
from typing import Any, Dict, NamedTuple, Optional, Tuple, Union
|
|
11
|
+
import uuid
|
|
12
|
+
import base64
|
|
12
13
|
|
|
13
14
|
import pandas as pd
|
|
14
|
-
|
|
15
|
-
from
|
|
15
|
+
from promptflow.client import PFClient
|
|
16
|
+
from promptflow.entities import Run
|
|
17
|
+
|
|
18
|
+
from azure.ai.evaluation._constants import (
|
|
19
|
+
DEFAULT_EVALUATION_RESULTS_FILE_NAME,
|
|
20
|
+
DefaultOpenEncoding,
|
|
21
|
+
EvaluationRunProperties,
|
|
22
|
+
Prefixes,
|
|
23
|
+
)
|
|
16
24
|
from azure.ai.evaluation._evaluate._eval_run import EvalRun
|
|
17
25
|
from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
|
|
26
|
+
from azure.ai.evaluation._model_configurations import AzureAIProject
|
|
18
27
|
|
|
19
28
|
LOGGER = logging.getLogger(__name__)
|
|
20
29
|
|
|
@@ -23,14 +32,20 @@ AZURE_WORKSPACE_REGEX_FORMAT = (
|
|
|
23
32
|
"(/providers/Microsoft.MachineLearningServices)?/workspaces/([^/]+)$"
|
|
24
33
|
)
|
|
25
34
|
|
|
26
|
-
|
|
35
|
+
|
|
36
|
+
class AzureMLWorkspace(NamedTuple):
|
|
37
|
+
subscription_id: str
|
|
38
|
+
resource_group_name: str
|
|
39
|
+
workspace_name: str
|
|
27
40
|
|
|
28
41
|
|
|
29
|
-
def is_none(value):
|
|
42
|
+
def is_none(value) -> bool:
|
|
30
43
|
return value is None or str(value).lower() == "none"
|
|
31
44
|
|
|
32
45
|
|
|
33
|
-
def extract_workspace_triad_from_trace_provider(
|
|
46
|
+
def extract_workspace_triad_from_trace_provider( # pylint: disable=name-too-long
|
|
47
|
+
trace_provider: str,
|
|
48
|
+
) -> AzureMLWorkspace:
|
|
34
49
|
match = re.match(AZURE_WORKSPACE_REGEX_FORMAT, trace_provider)
|
|
35
50
|
if not match or len(match.groups()) != 5:
|
|
36
51
|
raise EvaluationException(
|
|
@@ -47,7 +62,7 @@ def extract_workspace_triad_from_trace_provider(trace_provider: str): # pylint:
|
|
|
47
62
|
subscription_id = match.group(1)
|
|
48
63
|
resource_group_name = match.group(3)
|
|
49
64
|
workspace_name = match.group(5)
|
|
50
|
-
return
|
|
65
|
+
return AzureMLWorkspace(subscription_id, resource_group_name, workspace_name)
|
|
51
66
|
|
|
52
67
|
|
|
53
68
|
def load_jsonl(path):
|
|
@@ -55,7 +70,7 @@ def load_jsonl(path):
|
|
|
55
70
|
return [json.loads(line) for line in f.readlines()]
|
|
56
71
|
|
|
57
72
|
|
|
58
|
-
def _azure_pf_client_and_triad(trace_destination):
|
|
73
|
+
def _azure_pf_client_and_triad(trace_destination) -> Tuple[PFClient, AzureMLWorkspace]:
|
|
59
74
|
from promptflow.azure._cli._utils import _get_azure_pf_client
|
|
60
75
|
|
|
61
76
|
ws_triad = extract_workspace_triad_from_trace_provider(trace_destination)
|
|
@@ -68,15 +83,43 @@ def _azure_pf_client_and_triad(trace_destination):
|
|
|
68
83
|
return azure_pf_client, ws_triad
|
|
69
84
|
|
|
70
85
|
|
|
86
|
+
def _store_multimodal_content(messages, tmpdir: str):
|
|
87
|
+
# verify if images folder exists
|
|
88
|
+
images_folder_path = os.path.join(tmpdir, "images")
|
|
89
|
+
os.makedirs(images_folder_path, exist_ok=True)
|
|
90
|
+
|
|
91
|
+
# traverse all messages and replace base64 image data with new file name.
|
|
92
|
+
for message in messages:
|
|
93
|
+
if isinstance(message.get("content", []), list):
|
|
94
|
+
for content in message.get("content", []):
|
|
95
|
+
if content.get("type") == "image_url":
|
|
96
|
+
image_url = content.get("image_url")
|
|
97
|
+
if image_url and "url" in image_url and image_url["url"].startswith("data:image/jpg;base64,"):
|
|
98
|
+
# Extract the base64 string
|
|
99
|
+
base64image = image_url["url"].replace("data:image/jpg;base64,", "")
|
|
100
|
+
|
|
101
|
+
# Generate a unique filename
|
|
102
|
+
image_file_name = f"{str(uuid.uuid4())}.jpg"
|
|
103
|
+
image_url["url"] = f"images/{image_file_name}" # Replace the base64 URL with the file path
|
|
104
|
+
|
|
105
|
+
# Decode the base64 string to binary image data
|
|
106
|
+
image_data_binary = base64.b64decode(base64image)
|
|
107
|
+
|
|
108
|
+
# Write the binary image data to the file
|
|
109
|
+
image_file_path = os.path.join(images_folder_path, image_file_name)
|
|
110
|
+
with open(image_file_path, "wb") as f:
|
|
111
|
+
f.write(image_data_binary)
|
|
112
|
+
|
|
113
|
+
|
|
71
114
|
def _log_metrics_and_instance_results(
|
|
72
|
-
metrics,
|
|
73
|
-
instance_results,
|
|
74
|
-
trace_destination,
|
|
75
|
-
run,
|
|
76
|
-
evaluation_name,
|
|
77
|
-
) -> str:
|
|
115
|
+
metrics: Dict[str, Any],
|
|
116
|
+
instance_results: pd.DataFrame,
|
|
117
|
+
trace_destination: Optional[str],
|
|
118
|
+
run: Run,
|
|
119
|
+
evaluation_name: Optional[str],
|
|
120
|
+
) -> Optional[str]:
|
|
78
121
|
if trace_destination is None:
|
|
79
|
-
LOGGER.
|
|
122
|
+
LOGGER.debug("Skip uploading evaluation results to AI Studio since no trace destination was provided.")
|
|
80
123
|
return None
|
|
81
124
|
|
|
82
125
|
azure_pf_client, ws_triad = _azure_pf_client_and_triad(trace_destination)
|
|
@@ -94,10 +137,18 @@ def _log_metrics_and_instance_results(
|
|
|
94
137
|
ml_client=azure_pf_client.ml_client,
|
|
95
138
|
promptflow_run=run,
|
|
96
139
|
) as ev_run:
|
|
97
|
-
|
|
98
140
|
artifact_name = EvalRun.EVALUATION_ARTIFACT if run else EvalRun.EVALUATION_ARTIFACT_DUMMY_RUN
|
|
99
141
|
|
|
100
142
|
with tempfile.TemporaryDirectory() as tmpdir:
|
|
143
|
+
# storing multi_modal images if exists
|
|
144
|
+
col_name = "inputs.conversation"
|
|
145
|
+
if col_name in instance_results.columns:
|
|
146
|
+
for item in instance_results[col_name].items():
|
|
147
|
+
value = item[1]
|
|
148
|
+
if "messages" in value:
|
|
149
|
+
_store_multimodal_content(value["messages"], tmpdir)
|
|
150
|
+
|
|
151
|
+
# storing artifact result
|
|
101
152
|
tmp_path = os.path.join(tmpdir, artifact_name)
|
|
102
153
|
|
|
103
154
|
with open(tmp_path, "w", encoding=DefaultOpenEncoding.WRITE) as f:
|
|
@@ -112,7 +163,8 @@ def _log_metrics_and_instance_results(
|
|
|
112
163
|
if run is None:
|
|
113
164
|
ev_run.write_properties_to_run_history(
|
|
114
165
|
properties={
|
|
115
|
-
|
|
166
|
+
EvaluationRunProperties.RUN_TYPE: "eval_run",
|
|
167
|
+
EvaluationRunProperties.EVALUATION_RUN: "azure-ai-generative-parent",
|
|
116
168
|
"_azureml.evaluate_artifacts": json.dumps([{"path": artifact_name, "type": "table"}]),
|
|
117
169
|
"isEvaluatorRun": "true",
|
|
118
170
|
}
|
|
@@ -138,7 +190,7 @@ def _get_ai_studio_url(trace_destination: str, evaluation_id: str) -> str:
|
|
|
138
190
|
return studio_url
|
|
139
191
|
|
|
140
192
|
|
|
141
|
-
def _trace_destination_from_project_scope(project_scope:
|
|
193
|
+
def _trace_destination_from_project_scope(project_scope: AzureAIProject) -> str:
|
|
142
194
|
subscription_id = project_scope["subscription_id"]
|
|
143
195
|
resource_group_name = project_scope["resource_group_name"]
|
|
144
196
|
workspace_name = project_scope["project_name"]
|
|
@@ -151,9 +203,9 @@ def _trace_destination_from_project_scope(project_scope: dict) -> str:
|
|
|
151
203
|
return trace_destination
|
|
152
204
|
|
|
153
205
|
|
|
154
|
-
def _write_output(path, data_dict):
|
|
206
|
+
def _write_output(path: Union[str, os.PathLike], data_dict: Any) -> None:
|
|
155
207
|
p = Path(path)
|
|
156
|
-
if
|
|
208
|
+
if p.is_dir():
|
|
157
209
|
p = p / DEFAULT_EVALUATION_RESULTS_FILE_NAME
|
|
158
210
|
|
|
159
211
|
with open(p, "w", encoding=DefaultOpenEncoding.WRITE) as f:
|
|
@@ -161,7 +213,7 @@ def _write_output(path, data_dict):
|
|
|
161
213
|
|
|
162
214
|
|
|
163
215
|
def _apply_column_mapping(
|
|
164
|
-
source_df: pd.DataFrame, mapping_config: Dict[str, str], inplace: bool = False
|
|
216
|
+
source_df: pd.DataFrame, mapping_config: Optional[Dict[str, str]], inplace: bool = False
|
|
165
217
|
) -> pd.DataFrame:
|
|
166
218
|
"""
|
|
167
219
|
Apply column mapping to source_df based on mapping_config.
|
|
@@ -211,7 +263,7 @@ def _apply_column_mapping(
|
|
|
211
263
|
return result_df
|
|
212
264
|
|
|
213
265
|
|
|
214
|
-
def _has_aggregator(evaluator):
|
|
266
|
+
def _has_aggregator(evaluator: object) -> bool:
|
|
215
267
|
return hasattr(evaluator, "__aggregate__")
|
|
216
268
|
|
|
217
269
|
|
|
@@ -234,11 +286,11 @@ def get_int_env_var(env_var_name: str, default_value: int) -> int:
|
|
|
234
286
|
return default_value
|
|
235
287
|
|
|
236
288
|
|
|
237
|
-
def set_event_loop_policy():
|
|
289
|
+
def set_event_loop_policy() -> None:
|
|
238
290
|
import asyncio
|
|
239
291
|
import platform
|
|
240
292
|
|
|
241
293
|
if platform.system().lower() == "windows":
|
|
242
294
|
# Reference: https://stackoverflow.com/questions/45600579/asyncio-event-loop-is-closed-when-getting-loop
|
|
243
295
|
# On Windows seems to be a problem with EventLoopPolicy, use this snippet to work around it
|
|
244
|
-
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
|
|
296
|
+
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) # type: ignore[attr-defined]
|
|
@@ -63,7 +63,7 @@ class BleuScoreEvaluator:
|
|
|
63
63
|
:keyword ground_truth: The ground truth to be compared against.
|
|
64
64
|
:paramtype ground_truth: str
|
|
65
65
|
:return: The BLEU score.
|
|
66
|
-
:rtype:
|
|
66
|
+
:rtype: Dict[str, float]
|
|
67
67
|
"""
|
|
68
68
|
return async_run_allowing_running_loop(
|
|
69
69
|
self._async_evaluator, response=response, ground_truth=ground_truth, **kwargs
|
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
# ---------------------------------------------------------
|
|
4
4
|
import os
|
|
5
5
|
from typing import Optional
|
|
6
|
+
|
|
6
7
|
from typing_extensions import override
|
|
7
8
|
|
|
8
9
|
from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
|
|
@@ -30,18 +31,23 @@ class CoherenceEvaluator(PromptyEvaluatorBase):
|
|
|
30
31
|
.. code-block:: python
|
|
31
32
|
|
|
32
33
|
{
|
|
33
|
-
"
|
|
34
|
+
"coherence": 1.0,
|
|
35
|
+
"gpt_coherence": 1.0,
|
|
34
36
|
}
|
|
37
|
+
|
|
38
|
+
Note: To align with our support of a diverse set of models, a key without the `gpt_` prefix has been added.
|
|
39
|
+
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
|
|
40
|
+
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
|
|
35
41
|
"""
|
|
36
42
|
|
|
37
|
-
|
|
38
|
-
|
|
43
|
+
_PROMPTY_FILE = "coherence.prompty"
|
|
44
|
+
_RESULT_KEY = "coherence"
|
|
39
45
|
|
|
40
46
|
@override
|
|
41
|
-
def __init__(self, model_config
|
|
47
|
+
def __init__(self, model_config):
|
|
42
48
|
current_dir = os.path.dirname(__file__)
|
|
43
|
-
prompty_path = os.path.join(current_dir, self.
|
|
44
|
-
super().__init__(model_config=model_config, prompty_file=prompty_path, result_key=self.
|
|
49
|
+
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
|
|
50
|
+
super().__init__(model_config=model_config, prompty_file=prompty_path, result_key=self._RESULT_KEY)
|
|
45
51
|
|
|
46
52
|
@override
|
|
47
53
|
def __call__(
|
|
@@ -49,8 +55,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase):
|
|
|
49
55
|
*,
|
|
50
56
|
query: Optional[str] = None,
|
|
51
57
|
response: Optional[str] = None,
|
|
52
|
-
conversation
|
|
53
|
-
**kwargs
|
|
58
|
+
conversation=None,
|
|
59
|
+
**kwargs,
|
|
54
60
|
):
|
|
55
61
|
"""Evaluate coherence. Accepts either a query and response for a single evaluation,
|
|
56
62
|
or a conversation for a potentially multi-turn evaluation. If the conversation has more than one pair of
|
|
@@ -63,8 +69,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase):
|
|
|
63
69
|
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
|
|
64
70
|
key "messages". Conversation turns are expected
|
|
65
71
|
to be dictionaries with keys "content" and "role".
|
|
66
|
-
:paramtype conversation: Optional[
|
|
72
|
+
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
|
|
67
73
|
:return: The relevance score.
|
|
68
|
-
:rtype:
|
|
74
|
+
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str, List[float]]]]]
|
|
69
75
|
"""
|
|
70
76
|
return super().__call__(query=query, response=response, conversation=conversation, **kwargs)
|
|
@@ -5,7 +5,7 @@ model:
|
|
|
5
5
|
api: chat
|
|
6
6
|
parameters:
|
|
7
7
|
temperature: 0.0
|
|
8
|
-
max_tokens:
|
|
8
|
+
max_tokens: 800
|
|
9
9
|
top_p: 1.0
|
|
10
10
|
presence_penalty: 0
|
|
11
11
|
frequency_penalty: 0
|
|
@@ -20,38 +20,80 @@ inputs:
|
|
|
20
20
|
|
|
21
21
|
---
|
|
22
22
|
system:
|
|
23
|
-
|
|
23
|
+
# Instruction
|
|
24
|
+
## Goal
|
|
25
|
+
### You are an expert in evaluating the quality of a RESPONSE from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
|
|
26
|
+
- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
|
|
27
|
+
- **Data**: Your input data include a QUERY and a RESPONSE.
|
|
28
|
+
- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
|
|
24
29
|
|
|
25
30
|
user:
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
question
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
31
|
+
# Definition
|
|
32
|
+
**Coherence** refers to the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent answer directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas.
|
|
33
|
+
|
|
34
|
+
# Ratings
|
|
35
|
+
## [Coherence: 1] (Incoherent Response)
|
|
36
|
+
**Definition:** The response lacks coherence entirely. It consists of disjointed words or phrases that do not form complete or meaningful sentences. There is no logical connection to the question, making the response incomprehensible.
|
|
37
|
+
|
|
38
|
+
**Examples:**
|
|
39
|
+
**Query:** What are the benefits of renewable energy?
|
|
40
|
+
**Response:** Wind sun green jump apple silence over.
|
|
41
|
+
|
|
42
|
+
**Query:** Explain the process of photosynthesis.
|
|
43
|
+
**Response:** Plants light water flying blue music.
|
|
44
|
+
|
|
45
|
+
## [Coherence: 2] (Poorly Coherent Response)
|
|
46
|
+
**Definition:** The response shows minimal coherence with fragmented sentences and limited connection to the question. It contains some relevant keywords but lacks logical structure and clear relationships between ideas, making the overall message difficult to understand.
|
|
47
|
+
|
|
48
|
+
**Examples:**
|
|
49
|
+
**Query:** How does vaccination work?
|
|
50
|
+
**Response:** Vaccines protect disease. Immune system fight. Health better.
|
|
51
|
+
|
|
52
|
+
**Query:** Describe how a bill becomes a law.
|
|
53
|
+
**Response:** Idea proposed. Congress discuss vote. President signs.
|
|
54
|
+
|
|
55
|
+
## [Coherence: 3] (Partially Coherent Response)
|
|
56
|
+
**Definition:** The response partially addresses the question with some relevant information but exhibits issues in the logical flow and organization of ideas. Connections between sentences may be unclear or abrupt, requiring the reader to infer the links. The response may lack smooth transitions and may present ideas out of order.
|
|
57
|
+
|
|
58
|
+
**Examples:**
|
|
59
|
+
**Query:** What causes earthquakes?
|
|
60
|
+
**Response:** Earthquakes happen when tectonic plates move suddenly. Energy builds up then releases. Ground shakes and can cause damage.
|
|
61
|
+
|
|
62
|
+
**Query:** Explain the importance of the water cycle.
|
|
63
|
+
**Response:** The water cycle moves water around Earth. Evaporation, then precipitation occurs. It supports life by distributing water.
|
|
64
|
+
|
|
65
|
+
## [Coherence: 4] (Coherent Response)
|
|
66
|
+
**Definition:** The response is coherent and effectively addresses the question. Ideas are logically organized with clear connections between sentences and paragraphs. Appropriate transitions are used to guide the reader through the response, which flows smoothly and is easy to follow.
|
|
67
|
+
|
|
68
|
+
**Examples:**
|
|
69
|
+
**Query:** What is the water cycle and how does it work?
|
|
70
|
+
**Response:** The water cycle is the continuous movement of water on Earth through processes like evaporation, condensation, and precipitation. Water evaporates from bodies of water, forms clouds through condensation, and returns to the surface as precipitation. This cycle is essential for distributing water resources globally.
|
|
71
|
+
|
|
72
|
+
**Query:** Describe the role of mitochondria in cellular function.
|
|
73
|
+
**Response:** Mitochondria are organelles that produce energy for the cell. They convert nutrients into ATP through cellular respiration. This energy powers various cellular activities, making mitochondria vital for cell survival.
|
|
74
|
+
|
|
75
|
+
## [Coherence: 5] (Highly Coherent Response)
|
|
76
|
+
**Definition:** The response is exceptionally coherent, demonstrating sophisticated organization and flow. Ideas are presented in a logical and seamless manner, with excellent use of transitional phrases and cohesive devices. The connections between concepts are clear and enhance the reader's understanding. The response thoroughly addresses the question with clarity and precision.
|
|
77
|
+
|
|
78
|
+
**Examples:**
|
|
79
|
+
**Query:** Analyze the economic impacts of climate change on coastal cities.
|
|
80
|
+
**Response:** Climate change significantly affects the economies of coastal cities through rising sea levels, increased flooding, and more intense storms. These environmental changes can damage infrastructure, disrupt businesses, and lead to costly repairs. For instance, frequent flooding can hinder transportation and commerce, while the threat of severe weather may deter investment and tourism. Consequently, cities may face increased expenses for disaster preparedness and mitigation efforts, straining municipal budgets and impacting economic growth.
|
|
81
|
+
|
|
82
|
+
**Query:** Discuss the significance of the Monroe Doctrine in shaping U.S. foreign policy.
|
|
83
|
+
**Response:** The Monroe Doctrine was a pivotal policy declared in 1823 that asserted U.S. opposition to European colonization in the Americas. By stating that any intervention by external powers in the Western Hemisphere would be viewed as a hostile act, it established the U.S. as a protector of the region. This doctrine shaped U.S. foreign policy by promoting isolation from European conflicts while justifying American influence and expansion in the hemisphere. Its long-term significance lies in its enduring influence on international relations and its role in defining the U.S. position in global affairs.
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
# Data
|
|
87
|
+
QUERY: {{query}}
|
|
88
|
+
RESPONSE: {{response}}
|
|
89
|
+
|
|
90
|
+
|
|
91
|
+
# Tasks
|
|
92
|
+
## Please provide your assessment Score for the previous RESPONSE in relation to the QUERY based on the Definitions above. Your output should include the following information:
|
|
93
|
+
- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
|
|
94
|
+
- **Explanation**: a very short explanation of why you think the input Data should get that Score.
|
|
95
|
+
- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
|
|
96
|
+
|
|
97
|
+
|
|
98
|
+
## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
|
|
99
|
+
# Output
|