azure-ai-evaluation 0.0.0b0__py3-none-any.whl → 1.0.0b1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/__init__.py +60 -0
- azure/ai/evaluation/_common/__init__.py +16 -0
- azure/ai/evaluation/_common/constants.py +65 -0
- azure/ai/evaluation/_common/rai_service.py +452 -0
- azure/ai/evaluation/_common/utils.py +87 -0
- azure/ai/evaluation/_constants.py +50 -0
- azure/ai/evaluation/_evaluate/__init__.py +3 -0
- azure/ai/evaluation/_evaluate/_batch_run_client/__init__.py +8 -0
- azure/ai/evaluation/_evaluate/_batch_run_client/batch_run_context.py +72 -0
- azure/ai/evaluation/_evaluate/_batch_run_client/code_client.py +150 -0
- azure/ai/evaluation/_evaluate/_batch_run_client/proxy_client.py +61 -0
- azure/ai/evaluation/_evaluate/_eval_run.py +494 -0
- azure/ai/evaluation/_evaluate/_evaluate.py +689 -0
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +174 -0
- azure/ai/evaluation/_evaluate/_utils.py +237 -0
- azure/ai/evaluation/_evaluators/__init__.py +3 -0
- azure/ai/evaluation/_evaluators/_bleu/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +73 -0
- azure/ai/evaluation/_evaluators/_chat/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_chat/_chat.py +350 -0
- azure/ai/evaluation/_evaluators/_chat/retrieval/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_chat/retrieval/_retrieval.py +163 -0
- azure/ai/evaluation/_evaluators/_chat/retrieval/retrieval.prompty +48 -0
- azure/ai/evaluation/_evaluators/_coherence/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +122 -0
- azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +62 -0
- azure/ai/evaluation/_evaluators/_content_safety/__init__.py +21 -0
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +108 -0
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_base.py +66 -0
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +296 -0
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +78 -0
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +76 -0
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +76 -0
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -0
- azure/ai/evaluation/_evaluators/_eci/__init__.py +0 -0
- azure/ai/evaluation/_evaluators/_eci/_eci.py +99 -0
- azure/ai/evaluation/_evaluators/_f1_score/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +141 -0
- azure/ai/evaluation/_evaluators/_fluency/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +122 -0
- azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +61 -0
- azure/ai/evaluation/_evaluators/_gleu/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +71 -0
- azure/ai/evaluation/_evaluators/_groundedness/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +123 -0
- azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty +54 -0
- azure/ai/evaluation/_evaluators/_meteor/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +96 -0
- azure/ai/evaluation/_evaluators/_protected_material/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +104 -0
- azure/ai/evaluation/_evaluators/_protected_materials/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_protected_materials/_protected_materials.py +104 -0
- azure/ai/evaluation/_evaluators/_qa/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_qa/_qa.py +111 -0
- azure/ai/evaluation/_evaluators/_relevance/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +131 -0
- azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +69 -0
- azure/ai/evaluation/_evaluators/_rouge/__init__.py +10 -0
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +98 -0
- azure/ai/evaluation/_evaluators/_similarity/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +130 -0
- azure/ai/evaluation/_evaluators/_similarity/similarity.prompty +71 -0
- azure/ai/evaluation/_evaluators/_xpia/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +140 -0
- azure/ai/evaluation/_exceptions.py +107 -0
- azure/ai/evaluation/_http_utils.py +395 -0
- azure/ai/evaluation/_model_configurations.py +27 -0
- azure/ai/evaluation/_user_agent.py +6 -0
- azure/ai/evaluation/_version.py +5 -0
- azure/ai/evaluation/py.typed +0 -0
- azure/ai/evaluation/simulator/__init__.py +15 -0
- azure/ai/evaluation/simulator/_adversarial_scenario.py +27 -0
- azure/ai/evaluation/simulator/_adversarial_simulator.py +450 -0
- azure/ai/evaluation/simulator/_constants.py +17 -0
- azure/ai/evaluation/simulator/_conversation/__init__.py +315 -0
- azure/ai/evaluation/simulator/_conversation/_conversation.py +178 -0
- azure/ai/evaluation/simulator/_conversation/constants.py +30 -0
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +252 -0
- azure/ai/evaluation/simulator/_helpers/__init__.py +4 -0
- azure/ai/evaluation/simulator/_helpers/_language_suffix_mapping.py +17 -0
- azure/ai/evaluation/simulator/_helpers/_simulator_data_classes.py +93 -0
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +207 -0
- azure/ai/evaluation/simulator/_model_tools/__init__.py +23 -0
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +147 -0
- azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +228 -0
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +157 -0
- azure/ai/evaluation/simulator/_model_tools/_template_handler.py +157 -0
- azure/ai/evaluation/simulator/_model_tools/models.py +616 -0
- azure/ai/evaluation/simulator/_prompty/task_query_response.prompty +69 -0
- azure/ai/evaluation/simulator/_prompty/task_simulate.prompty +36 -0
- azure/ai/evaluation/simulator/_tracing.py +92 -0
- azure/ai/evaluation/simulator/_utils.py +111 -0
- azure/ai/evaluation/simulator/simulator.py +579 -0
- azure_ai_evaluation-1.0.0b1.dist-info/METADATA +377 -0
- azure_ai_evaluation-1.0.0b1.dist-info/RECORD +97 -0
- {azure_ai_evaluation-0.0.0b0.dist-info → azure_ai_evaluation-1.0.0b1.dist-info}/WHEEL +1 -1
- azure_ai_evaluation-1.0.0b1.dist-info/top_level.txt +1 -0
- azure_ai_evaluation-0.0.0b0.dist-info/METADATA +0 -7
- azure_ai_evaluation-0.0.0b0.dist-info/RECORD +0 -4
- azure_ai_evaluation-0.0.0b0.dist-info/top_level.txt +0 -1
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
from concurrent.futures import as_completed
|
|
6
|
+
from typing import Union
|
|
7
|
+
|
|
8
|
+
from promptflow.tracing import ThreadPoolExecutorWithContext as ThreadPoolExecutor
|
|
9
|
+
|
|
10
|
+
from .._coherence import CoherenceEvaluator
|
|
11
|
+
from .._f1_score import F1ScoreEvaluator
|
|
12
|
+
from .._fluency import FluencyEvaluator
|
|
13
|
+
from .._groundedness import GroundednessEvaluator
|
|
14
|
+
from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
|
|
15
|
+
from .._relevance import RelevanceEvaluator
|
|
16
|
+
from .._similarity import SimilarityEvaluator
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
class QAEvaluator:
|
|
20
|
+
"""
|
|
21
|
+
Initialize a question-answer evaluator configured for a specific Azure OpenAI model.
|
|
22
|
+
|
|
23
|
+
:param model_config: Configuration for the Azure OpenAI model.
|
|
24
|
+
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
|
|
25
|
+
~azure.ai.evaluation.OpenAIModelConfiguration]
|
|
26
|
+
:return: A function that evaluates and generates metrics for "question-answering" scenario.
|
|
27
|
+
:rtype: Callable
|
|
28
|
+
|
|
29
|
+
**Usage**
|
|
30
|
+
|
|
31
|
+
.. code-block:: python
|
|
32
|
+
|
|
33
|
+
eval_fn = QAEvaluator(model_config)
|
|
34
|
+
result = qa_eval(
|
|
35
|
+
query="Tokyo is the capital of which country?",
|
|
36
|
+
response="Japan",
|
|
37
|
+
context="Tokyo is the capital of Japan.",
|
|
38
|
+
ground_truth="Japan"
|
|
39
|
+
)
|
|
40
|
+
|
|
41
|
+
**Output format**
|
|
42
|
+
|
|
43
|
+
.. code-block:: python
|
|
44
|
+
|
|
45
|
+
{
|
|
46
|
+
"gpt_groundedness": 3.5,
|
|
47
|
+
"gpt_relevance": 4.0,
|
|
48
|
+
"gpt_coherence": 1.5,
|
|
49
|
+
"gpt_fluency": 4.0,
|
|
50
|
+
"gpt_similarity": 3.0,
|
|
51
|
+
"f1_score": 0.42
|
|
52
|
+
}
|
|
53
|
+
"""
|
|
54
|
+
|
|
55
|
+
def __init__(
|
|
56
|
+
self, model_config: dict, parallel: bool = True
|
|
57
|
+
):
|
|
58
|
+
self._parallel = parallel
|
|
59
|
+
|
|
60
|
+
self._evaluators = [
|
|
61
|
+
GroundednessEvaluator(model_config),
|
|
62
|
+
RelevanceEvaluator(model_config),
|
|
63
|
+
CoherenceEvaluator(model_config),
|
|
64
|
+
FluencyEvaluator(model_config),
|
|
65
|
+
SimilarityEvaluator(model_config),
|
|
66
|
+
F1ScoreEvaluator(),
|
|
67
|
+
]
|
|
68
|
+
|
|
69
|
+
def __call__(self, *, query: str, response: str, context: str, ground_truth: str, **kwargs):
|
|
70
|
+
"""
|
|
71
|
+
Evaluates question-answering scenario.
|
|
72
|
+
|
|
73
|
+
:keyword query: The query to be evaluated.
|
|
74
|
+
:paramtype query: str
|
|
75
|
+
:keyword response: The response to be evaluated.
|
|
76
|
+
:paramtype response: str
|
|
77
|
+
:keyword context: The context to be evaluated.
|
|
78
|
+
:paramtype context: str
|
|
79
|
+
:keyword ground_truth: The ground truth to be evaluated.
|
|
80
|
+
:paramtype ground_truth: str
|
|
81
|
+
:keyword parallel: Whether to evaluate in parallel. Defaults to True.
|
|
82
|
+
:paramtype parallel: bool
|
|
83
|
+
:return: The scores for QA scenario.
|
|
84
|
+
:rtype: dict
|
|
85
|
+
"""
|
|
86
|
+
results = {}
|
|
87
|
+
if self._parallel:
|
|
88
|
+
with ThreadPoolExecutor() as executor:
|
|
89
|
+
futures = {
|
|
90
|
+
executor.submit(
|
|
91
|
+
evaluator,
|
|
92
|
+
query=query,
|
|
93
|
+
response=response,
|
|
94
|
+
context=context,
|
|
95
|
+
ground_truth=ground_truth,
|
|
96
|
+
**kwargs
|
|
97
|
+
): evaluator
|
|
98
|
+
for evaluator in self._evaluators
|
|
99
|
+
}
|
|
100
|
+
|
|
101
|
+
# Collect results as they complete
|
|
102
|
+
for future in as_completed(futures):
|
|
103
|
+
results.update(future.result())
|
|
104
|
+
else:
|
|
105
|
+
for evaluator in self._evaluators:
|
|
106
|
+
result = evaluator(
|
|
107
|
+
query=query, response=response, context=context, ground_truth=ground_truth, **kwargs
|
|
108
|
+
)
|
|
109
|
+
results.update(result)
|
|
110
|
+
|
|
111
|
+
return results
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
from ._relevance import RelevanceEvaluator
|
|
6
|
+
|
|
7
|
+
__all__ = [
|
|
8
|
+
"RelevanceEvaluator",
|
|
9
|
+
]
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
import os
|
|
6
|
+
import re
|
|
7
|
+
from typing import Union
|
|
8
|
+
|
|
9
|
+
import numpy as np
|
|
10
|
+
|
|
11
|
+
from promptflow._utils.async_utils import async_run_allowing_running_loop
|
|
12
|
+
from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
|
|
13
|
+
from promptflow.core import AsyncPrompty
|
|
14
|
+
|
|
15
|
+
from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
|
|
16
|
+
from ..._common.utils import (
|
|
17
|
+
check_and_add_api_version_for_aoai_model_config,
|
|
18
|
+
check_and_add_user_agent_for_aoai_model_config,
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
try:
|
|
22
|
+
from ..._user_agent import USER_AGENT
|
|
23
|
+
except ImportError:
|
|
24
|
+
USER_AGENT = None
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
class _AsyncRelevanceEvaluator:
|
|
28
|
+
# Constants must be defined within eval's directory to be save/loadable
|
|
29
|
+
PROMPTY_FILE = "relevance.prompty"
|
|
30
|
+
LLM_CALL_TIMEOUT = 600
|
|
31
|
+
DEFAULT_OPEN_API_VERSION = "2024-02-15-preview"
|
|
32
|
+
|
|
33
|
+
def __init__(self, model_config: dict):
|
|
34
|
+
check_and_add_api_version_for_aoai_model_config(model_config, self.DEFAULT_OPEN_API_VERSION)
|
|
35
|
+
|
|
36
|
+
prompty_model_config = {"configuration": model_config, "parameters": {"extra_headers": {}}}
|
|
37
|
+
|
|
38
|
+
# Handle "RuntimeError: Event loop is closed" from httpx AsyncClient
|
|
39
|
+
# https://github.com/encode/httpx/discussions/2959
|
|
40
|
+
prompty_model_config["parameters"]["extra_headers"].update({"Connection": "close"})
|
|
41
|
+
|
|
42
|
+
check_and_add_user_agent_for_aoai_model_config(
|
|
43
|
+
model_config,
|
|
44
|
+
prompty_model_config,
|
|
45
|
+
USER_AGENT,
|
|
46
|
+
)
|
|
47
|
+
|
|
48
|
+
current_dir = os.path.dirname(__file__)
|
|
49
|
+
prompty_path = os.path.join(current_dir, self.PROMPTY_FILE)
|
|
50
|
+
self._flow = AsyncPrompty.load(source=prompty_path, model=prompty_model_config)
|
|
51
|
+
|
|
52
|
+
async def __call__(self, *, query: str, response: str, context: str, **kwargs):
|
|
53
|
+
# Validate input parameters
|
|
54
|
+
query = str(query or "")
|
|
55
|
+
response = str(response or "")
|
|
56
|
+
context = str(context or "")
|
|
57
|
+
|
|
58
|
+
if not (query.strip() and response.strip() and context.strip()):
|
|
59
|
+
msg = "'query', 'response' and 'context' must be non-empty strings."
|
|
60
|
+
raise EvaluationException(
|
|
61
|
+
message=msg,
|
|
62
|
+
internal_message=msg,
|
|
63
|
+
error_category=ErrorCategory.MISSING_FIELD,
|
|
64
|
+
error_blame=ErrorBlame.USER_ERROR,
|
|
65
|
+
error_target=ErrorTarget.RELEVANCE_EVALUATOR,
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
# Run the evaluation flow
|
|
69
|
+
llm_output = await self._flow(
|
|
70
|
+
query=query, response=response, context=context, timeout=self.LLM_CALL_TIMEOUT, **kwargs
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
score = np.nan
|
|
74
|
+
if llm_output:
|
|
75
|
+
match = re.search(r"\d", llm_output)
|
|
76
|
+
if match:
|
|
77
|
+
score = float(match.group())
|
|
78
|
+
|
|
79
|
+
return {"gpt_relevance": float(score)}
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
class RelevanceEvaluator:
|
|
83
|
+
"""
|
|
84
|
+
Initialize a relevance evaluator configured for a specific Azure OpenAI model.
|
|
85
|
+
|
|
86
|
+
:param model_config: Configuration for the Azure OpenAI model.
|
|
87
|
+
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
|
|
88
|
+
~azure.ai.evaluation.OpenAIModelConfiguration]
|
|
89
|
+
|
|
90
|
+
**Usage**
|
|
91
|
+
|
|
92
|
+
.. code-block:: python
|
|
93
|
+
|
|
94
|
+
eval_fn = RelevanceEvaluator(model_config)
|
|
95
|
+
result = eval_fn(
|
|
96
|
+
query="What is the capital of Japan?",
|
|
97
|
+
response="The capital of Japan is Tokyo.",
|
|
98
|
+
context="Tokyo is Japan's capital, known for its blend of traditional culture \
|
|
99
|
+
and technological advancements.")
|
|
100
|
+
|
|
101
|
+
**Output format**
|
|
102
|
+
|
|
103
|
+
.. code-block:: python
|
|
104
|
+
|
|
105
|
+
{
|
|
106
|
+
"gpt_relevance": 3.0
|
|
107
|
+
}
|
|
108
|
+
"""
|
|
109
|
+
|
|
110
|
+
def __init__(self, model_config: dict):
|
|
111
|
+
self._async_evaluator = _AsyncRelevanceEvaluator(model_config)
|
|
112
|
+
|
|
113
|
+
def __call__(self, *, query: str, response: str, context: str, **kwargs):
|
|
114
|
+
"""
|
|
115
|
+
Evaluate relevance.
|
|
116
|
+
|
|
117
|
+
:keyword query: The query to be evaluated.
|
|
118
|
+
:paramtype query: str
|
|
119
|
+
:keyword response: The response to be evaluated.
|
|
120
|
+
:paramtype response: str
|
|
121
|
+
:keyword context: The context to be evaluated.
|
|
122
|
+
:paramtype context: str
|
|
123
|
+
:return: The relevance score.
|
|
124
|
+
:rtype: dict
|
|
125
|
+
"""
|
|
126
|
+
return async_run_allowing_running_loop(
|
|
127
|
+
self._async_evaluator, query=query, response=response, context=context, **kwargs
|
|
128
|
+
)
|
|
129
|
+
|
|
130
|
+
def _to_async(self):
|
|
131
|
+
return self._async_evaluator
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Relevance
|
|
3
|
+
description: Evaluates relevance score for QA scenario
|
|
4
|
+
model:
|
|
5
|
+
api: chat
|
|
6
|
+
configuration:
|
|
7
|
+
type: azure_openai
|
|
8
|
+
azure_deployment: ${env:AZURE_DEPLOYMENT}
|
|
9
|
+
api_key: ${env:AZURE_OPENAI_API_KEY}
|
|
10
|
+
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
|
|
11
|
+
parameters:
|
|
12
|
+
temperature: 0.0
|
|
13
|
+
max_tokens: 1
|
|
14
|
+
top_p: 1.0
|
|
15
|
+
presence_penalty: 0
|
|
16
|
+
frequency_penalty: 0
|
|
17
|
+
response_format:
|
|
18
|
+
type: text
|
|
19
|
+
|
|
20
|
+
inputs:
|
|
21
|
+
query:
|
|
22
|
+
type: string
|
|
23
|
+
response:
|
|
24
|
+
type: string
|
|
25
|
+
context:
|
|
26
|
+
type: string
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
system:
|
|
30
|
+
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
|
|
31
|
+
user:
|
|
32
|
+
Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
|
|
33
|
+
One star: the answer completely lacks relevance
|
|
34
|
+
Two stars: the answer mostly lacks relevance
|
|
35
|
+
Three stars: the answer is partially relevant
|
|
36
|
+
Four stars: the answer is mostly relevant
|
|
37
|
+
Five stars: the answer has perfect relevance
|
|
38
|
+
|
|
39
|
+
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
|
|
40
|
+
|
|
41
|
+
context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
|
|
42
|
+
question: What field did Marie Curie excel in?
|
|
43
|
+
answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
|
|
44
|
+
stars: 1
|
|
45
|
+
|
|
46
|
+
context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
|
|
47
|
+
question: Where were The Beatles formed?
|
|
48
|
+
answer: The band The Beatles began their journey in London, England, and they changed the history of music.
|
|
49
|
+
stars: 2
|
|
50
|
+
|
|
51
|
+
context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
|
|
52
|
+
question: What are the main goals of Perseverance Mars rover mission?
|
|
53
|
+
answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
|
|
54
|
+
stars: 3
|
|
55
|
+
|
|
56
|
+
context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
|
|
57
|
+
question: What are the main components of the Mediterranean diet?
|
|
58
|
+
answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
|
|
59
|
+
stars: 4
|
|
60
|
+
|
|
61
|
+
context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
|
|
62
|
+
question: What are the main attractions of the Queen's Royal Castle?
|
|
63
|
+
answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
|
|
64
|
+
stars: 5
|
|
65
|
+
|
|
66
|
+
context: {{context}}
|
|
67
|
+
question: {{query}}
|
|
68
|
+
answer: {{response}}
|
|
69
|
+
stars:
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
from ._rouge import RougeScoreEvaluator, RougeType
|
|
6
|
+
|
|
7
|
+
__all__ = [
|
|
8
|
+
"RougeScoreEvaluator",
|
|
9
|
+
"RougeType",
|
|
10
|
+
]
|
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
from enum import Enum
|
|
5
|
+
|
|
6
|
+
from rouge_score import rouge_scorer
|
|
7
|
+
|
|
8
|
+
from promptflow._utils.async_utils import async_run_allowing_running_loop
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
class RougeType(str, Enum):
|
|
12
|
+
"""
|
|
13
|
+
Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types.
|
|
14
|
+
"""
|
|
15
|
+
|
|
16
|
+
ROUGE_1 = "rouge1"
|
|
17
|
+
"""Overlap of unigrams (single words) between generated and reference text."""
|
|
18
|
+
|
|
19
|
+
ROUGE_2 = "rouge2"
|
|
20
|
+
"""Overlap of bigrams (two consecutive words) between generated and reference text."""
|
|
21
|
+
|
|
22
|
+
ROUGE_3 = "rouge3"
|
|
23
|
+
"""Overlap of trigrams (three consecutive words) between generated and reference text."""
|
|
24
|
+
|
|
25
|
+
ROUGE_4 = "rouge4"
|
|
26
|
+
"""Overlap of four-grams (four consecutive words) between generated and reference text."""
|
|
27
|
+
|
|
28
|
+
ROUGE_5 = "rouge5"
|
|
29
|
+
"""Overlap of five-grams (five consecutive words) between generated and reference text."""
|
|
30
|
+
|
|
31
|
+
ROUGE_L = "rougeL"
|
|
32
|
+
"""Overlap of L-grams (L consecutive words) between generated and reference text."""
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
class _AsyncRougeScoreEvaluator:
|
|
36
|
+
def __init__(self, rouge_type: RougeType):
|
|
37
|
+
self._rouge_type = rouge_type
|
|
38
|
+
|
|
39
|
+
async def __call__(self, *, ground_truth: str, response: str, **kwargs):
|
|
40
|
+
scorer = rouge_scorer.RougeScorer(rouge_types=[self._rouge_type])
|
|
41
|
+
metrics = scorer.score(ground_truth, response)[self._rouge_type]
|
|
42
|
+
return {
|
|
43
|
+
"rouge_precision": metrics.precision,
|
|
44
|
+
"rouge_recall": metrics.recall,
|
|
45
|
+
"rouge_f1_score": metrics.fmeasure,
|
|
46
|
+
}
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
class RougeScoreEvaluator:
|
|
50
|
+
"""
|
|
51
|
+
Evaluator for computes the ROUGE scores between two strings.
|
|
52
|
+
|
|
53
|
+
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
|
|
54
|
+
summarization and machine translation. It measures the overlap between generated text and reference summaries.
|
|
55
|
+
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
|
|
56
|
+
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
|
|
57
|
+
coherence and relevance are critical.
|
|
58
|
+
|
|
59
|
+
**Usage**
|
|
60
|
+
|
|
61
|
+
.. code-block:: python
|
|
62
|
+
|
|
63
|
+
eval_fn = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)
|
|
64
|
+
result = eval_fn(
|
|
65
|
+
response="Tokyo is the capital of Japan.",
|
|
66
|
+
ground_truth="The capital of Japan is Tokyo.")
|
|
67
|
+
|
|
68
|
+
**Output format**
|
|
69
|
+
|
|
70
|
+
.. code-block:: python
|
|
71
|
+
|
|
72
|
+
{
|
|
73
|
+
"rouge_precision": 1.0,
|
|
74
|
+
"rouge_recall": 1.0,
|
|
75
|
+
"rouge_f1_score": 1.0
|
|
76
|
+
}
|
|
77
|
+
"""
|
|
78
|
+
|
|
79
|
+
def __init__(self, rouge_type: RougeType):
|
|
80
|
+
self._async_evaluator = _AsyncRougeScoreEvaluator(rouge_type)
|
|
81
|
+
|
|
82
|
+
def __call__(self, *, ground_truth: str, response: str, **kwargs):
|
|
83
|
+
"""
|
|
84
|
+
Evaluate the ROUGE score between the response and the ground truth.
|
|
85
|
+
|
|
86
|
+
:keyword response: The response to be evaluated.
|
|
87
|
+
:paramtype response: str
|
|
88
|
+
:keyword ground_truth: The ground truth to be compared against.
|
|
89
|
+
:paramtype ground_truth: str
|
|
90
|
+
:return: The ROUGE score.
|
|
91
|
+
:rtype: dict
|
|
92
|
+
"""
|
|
93
|
+
return async_run_allowing_running_loop(
|
|
94
|
+
self._async_evaluator, ground_truth=ground_truth, response=response, **kwargs
|
|
95
|
+
)
|
|
96
|
+
|
|
97
|
+
def _to_async(self):
|
|
98
|
+
return self._async_evaluator
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
from ._similarity import SimilarityEvaluator
|
|
6
|
+
|
|
7
|
+
__all__ = [
|
|
8
|
+
"SimilarityEvaluator",
|
|
9
|
+
]
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
import os
|
|
6
|
+
import re
|
|
7
|
+
from typing import Union
|
|
8
|
+
|
|
9
|
+
import numpy as np
|
|
10
|
+
|
|
11
|
+
from promptflow._utils.async_utils import async_run_allowing_running_loop
|
|
12
|
+
from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
|
|
13
|
+
from promptflow.core import AsyncPrompty
|
|
14
|
+
|
|
15
|
+
from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
|
|
16
|
+
from ..._common.utils import (
|
|
17
|
+
check_and_add_api_version_for_aoai_model_config,
|
|
18
|
+
check_and_add_user_agent_for_aoai_model_config,
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
try:
|
|
22
|
+
from ..._user_agent import USER_AGENT
|
|
23
|
+
except ImportError:
|
|
24
|
+
USER_AGENT = None
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
class _AsyncSimilarityEvaluator:
|
|
28
|
+
# Constants must be defined within eval's directory to be save/loadable
|
|
29
|
+
PROMPTY_FILE = "similarity.prompty"
|
|
30
|
+
LLM_CALL_TIMEOUT = 600
|
|
31
|
+
DEFAULT_OPEN_API_VERSION = "2024-02-15-preview"
|
|
32
|
+
|
|
33
|
+
def __init__(self, model_config: dict):
|
|
34
|
+
check_and_add_api_version_for_aoai_model_config(model_config, self.DEFAULT_OPEN_API_VERSION)
|
|
35
|
+
|
|
36
|
+
prompty_model_config = {"configuration": model_config, "parameters": {"extra_headers": {}}}
|
|
37
|
+
|
|
38
|
+
# Handle "RuntimeError: Event loop is closed" from httpx AsyncClient
|
|
39
|
+
# https://github.com/encode/httpx/discussions/2959
|
|
40
|
+
prompty_model_config["parameters"]["extra_headers"].update({"Connection": "close"})
|
|
41
|
+
|
|
42
|
+
check_and_add_user_agent_for_aoai_model_config(
|
|
43
|
+
model_config,
|
|
44
|
+
prompty_model_config,
|
|
45
|
+
USER_AGENT,
|
|
46
|
+
)
|
|
47
|
+
|
|
48
|
+
current_dir = os.path.dirname(__file__)
|
|
49
|
+
prompty_path = os.path.join(current_dir, self.PROMPTY_FILE)
|
|
50
|
+
self._flow = AsyncPrompty.load(source=prompty_path, model=prompty_model_config)
|
|
51
|
+
|
|
52
|
+
async def __call__(self, *, query: str, response: str, ground_truth: str, **kwargs):
|
|
53
|
+
# Validate input parameters
|
|
54
|
+
query = str(query or "")
|
|
55
|
+
response = str(response or "")
|
|
56
|
+
ground_truth = str(ground_truth or "")
|
|
57
|
+
|
|
58
|
+
if not (query.strip() and response.strip() and ground_truth.strip()):
|
|
59
|
+
msg = "'query', 'response' and 'ground_truth' must be non-empty strings."
|
|
60
|
+
raise EvaluationException(
|
|
61
|
+
message=msg,
|
|
62
|
+
internal_message=msg,
|
|
63
|
+
error_category=ErrorCategory.MISSING_FIELD,
|
|
64
|
+
error_blame=ErrorBlame.USER_ERROR,
|
|
65
|
+
error_target=ErrorTarget.SIMILARITY_EVALUATOR,
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
# Run the evaluation flow
|
|
69
|
+
llm_output = await self._flow(
|
|
70
|
+
query=query, response=response, ground_truth=ground_truth, timeout=self.LLM_CALL_TIMEOUT, **kwargs
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
score = np.nan
|
|
74
|
+
if llm_output:
|
|
75
|
+
match = re.search(r"\d", llm_output)
|
|
76
|
+
if match:
|
|
77
|
+
score = float(match.group())
|
|
78
|
+
|
|
79
|
+
return {"gpt_similarity": float(score)}
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
class SimilarityEvaluator:
|
|
83
|
+
"""
|
|
84
|
+
Initialize a similarity evaluator configured for a specific Azure OpenAI model.
|
|
85
|
+
|
|
86
|
+
:param model_config: Configuration for the Azure OpenAI model.
|
|
87
|
+
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
|
|
88
|
+
~azure.ai.evaluation.OpenAIModelConfiguration]
|
|
89
|
+
|
|
90
|
+
**Usage**
|
|
91
|
+
|
|
92
|
+
.. code-block:: python
|
|
93
|
+
|
|
94
|
+
eval_fn = SimilarityEvaluator(model_config)
|
|
95
|
+
result = eval_fn(
|
|
96
|
+
query="What is the capital of Japan?",
|
|
97
|
+
response="The capital of Japan is Tokyo.",
|
|
98
|
+
ground_truth="Tokyo is Japan's capital.")
|
|
99
|
+
|
|
100
|
+
**Output format**
|
|
101
|
+
|
|
102
|
+
.. code-block:: python
|
|
103
|
+
|
|
104
|
+
{
|
|
105
|
+
"gpt_similarity": 3.0
|
|
106
|
+
}
|
|
107
|
+
"""
|
|
108
|
+
|
|
109
|
+
def __init__(self, model_config: dict):
|
|
110
|
+
self._async_evaluator = _AsyncSimilarityEvaluator(model_config)
|
|
111
|
+
|
|
112
|
+
def __call__(self, *, query: str, response: str, ground_truth: str, **kwargs):
|
|
113
|
+
"""
|
|
114
|
+
Evaluate similarity.
|
|
115
|
+
|
|
116
|
+
:keyword query: The query to be evaluated.
|
|
117
|
+
:paramtype query: str
|
|
118
|
+
:keyword response: The response to be evaluated.
|
|
119
|
+
:paramtype response: str
|
|
120
|
+
:keyword ground_truth: The ground truth to be evaluated.
|
|
121
|
+
:paramtype ground_truth: str
|
|
122
|
+
:return: The similarity score.
|
|
123
|
+
:rtype: dict
|
|
124
|
+
"""
|
|
125
|
+
return async_run_allowing_running_loop(
|
|
126
|
+
self._async_evaluator, query=query, response=response, ground_truth=ground_truth, **kwargs
|
|
127
|
+
)
|
|
128
|
+
|
|
129
|
+
def _to_async(self):
|
|
130
|
+
return self._async_evaluator
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Similarity
|
|
3
|
+
description: Evaluates similarity score for QA scenario
|
|
4
|
+
model:
|
|
5
|
+
api: chat
|
|
6
|
+
configuration:
|
|
7
|
+
type: azure_openai
|
|
8
|
+
azure_deployment: ${env:AZURE_DEPLOYMENT}
|
|
9
|
+
api_key: ${env:AZURE_OPENAI_API_KEY}
|
|
10
|
+
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
|
|
11
|
+
parameters:
|
|
12
|
+
temperature: 0.0
|
|
13
|
+
max_tokens: 1
|
|
14
|
+
top_p: 1.0
|
|
15
|
+
presence_penalty: 0
|
|
16
|
+
frequency_penalty: 0
|
|
17
|
+
response_format:
|
|
18
|
+
type: text
|
|
19
|
+
|
|
20
|
+
inputs:
|
|
21
|
+
query:
|
|
22
|
+
type: string
|
|
23
|
+
response:
|
|
24
|
+
type: string
|
|
25
|
+
ground_truth:
|
|
26
|
+
type: string
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
system:
|
|
30
|
+
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
|
|
31
|
+
user:
|
|
32
|
+
Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
|
|
33
|
+
One star: the predicted answer is not at all similar to the correct answer
|
|
34
|
+
Two stars: the predicted answer is mostly not similar to the correct answer
|
|
35
|
+
Three stars: the predicted answer is somewhat similar to the correct answer
|
|
36
|
+
Four stars: the predicted answer is mostly similar to the correct answer
|
|
37
|
+
Five stars: the predicted answer is completely similar to the correct answer
|
|
38
|
+
|
|
39
|
+
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
|
|
40
|
+
|
|
41
|
+
The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.
|
|
42
|
+
|
|
43
|
+
question: What is the role of ribosomes?
|
|
44
|
+
correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
|
|
45
|
+
predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
|
|
46
|
+
stars: 1
|
|
47
|
+
|
|
48
|
+
question: Why did the Titanic sink?
|
|
49
|
+
correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
|
|
50
|
+
predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
|
|
51
|
+
stars: 2
|
|
52
|
+
|
|
53
|
+
question: What causes seasons on Earth?
|
|
54
|
+
correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
|
|
55
|
+
predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
|
|
56
|
+
stars: 3
|
|
57
|
+
|
|
58
|
+
question: How does photosynthesis work?
|
|
59
|
+
correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
|
|
60
|
+
predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
|
|
61
|
+
stars: 4
|
|
62
|
+
|
|
63
|
+
question: What are the health benefits of regular exercise?
|
|
64
|
+
correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
|
|
65
|
+
predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
|
|
66
|
+
stars: 5
|
|
67
|
+
|
|
68
|
+
question: {{query}}
|
|
69
|
+
correct answer:{{ground_truth}}
|
|
70
|
+
predicted answer: {{response}}
|
|
71
|
+
stars:
|