azure-ai-evaluation 0.0.0b0__py3-none-any.whl → 1.0.0b1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (100) hide show
  1. azure/ai/evaluation/__init__.py +60 -0
  2. azure/ai/evaluation/_common/__init__.py +16 -0
  3. azure/ai/evaluation/_common/constants.py +65 -0
  4. azure/ai/evaluation/_common/rai_service.py +452 -0
  5. azure/ai/evaluation/_common/utils.py +87 -0
  6. azure/ai/evaluation/_constants.py +50 -0
  7. azure/ai/evaluation/_evaluate/__init__.py +3 -0
  8. azure/ai/evaluation/_evaluate/_batch_run_client/__init__.py +8 -0
  9. azure/ai/evaluation/_evaluate/_batch_run_client/batch_run_context.py +72 -0
  10. azure/ai/evaluation/_evaluate/_batch_run_client/code_client.py +150 -0
  11. azure/ai/evaluation/_evaluate/_batch_run_client/proxy_client.py +61 -0
  12. azure/ai/evaluation/_evaluate/_eval_run.py +494 -0
  13. azure/ai/evaluation/_evaluate/_evaluate.py +689 -0
  14. azure/ai/evaluation/_evaluate/_telemetry/__init__.py +174 -0
  15. azure/ai/evaluation/_evaluate/_utils.py +237 -0
  16. azure/ai/evaluation/_evaluators/__init__.py +3 -0
  17. azure/ai/evaluation/_evaluators/_bleu/__init__.py +9 -0
  18. azure/ai/evaluation/_evaluators/_bleu/_bleu.py +73 -0
  19. azure/ai/evaluation/_evaluators/_chat/__init__.py +9 -0
  20. azure/ai/evaluation/_evaluators/_chat/_chat.py +350 -0
  21. azure/ai/evaluation/_evaluators/_chat/retrieval/__init__.py +9 -0
  22. azure/ai/evaluation/_evaluators/_chat/retrieval/_retrieval.py +163 -0
  23. azure/ai/evaluation/_evaluators/_chat/retrieval/retrieval.prompty +48 -0
  24. azure/ai/evaluation/_evaluators/_coherence/__init__.py +7 -0
  25. azure/ai/evaluation/_evaluators/_coherence/_coherence.py +122 -0
  26. azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +62 -0
  27. azure/ai/evaluation/_evaluators/_content_safety/__init__.py +21 -0
  28. azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +108 -0
  29. azure/ai/evaluation/_evaluators/_content_safety/_content_safety_base.py +66 -0
  30. azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +296 -0
  31. azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +78 -0
  32. azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +76 -0
  33. azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +76 -0
  34. azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -0
  35. azure/ai/evaluation/_evaluators/_eci/__init__.py +0 -0
  36. azure/ai/evaluation/_evaluators/_eci/_eci.py +99 -0
  37. azure/ai/evaluation/_evaluators/_f1_score/__init__.py +9 -0
  38. azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +141 -0
  39. azure/ai/evaluation/_evaluators/_fluency/__init__.py +9 -0
  40. azure/ai/evaluation/_evaluators/_fluency/_fluency.py +122 -0
  41. azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +61 -0
  42. azure/ai/evaluation/_evaluators/_gleu/__init__.py +9 -0
  43. azure/ai/evaluation/_evaluators/_gleu/_gleu.py +71 -0
  44. azure/ai/evaluation/_evaluators/_groundedness/__init__.py +9 -0
  45. azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +123 -0
  46. azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty +54 -0
  47. azure/ai/evaluation/_evaluators/_meteor/__init__.py +9 -0
  48. azure/ai/evaluation/_evaluators/_meteor/_meteor.py +96 -0
  49. azure/ai/evaluation/_evaluators/_protected_material/__init__.py +5 -0
  50. azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +104 -0
  51. azure/ai/evaluation/_evaluators/_protected_materials/__init__.py +5 -0
  52. azure/ai/evaluation/_evaluators/_protected_materials/_protected_materials.py +104 -0
  53. azure/ai/evaluation/_evaluators/_qa/__init__.py +9 -0
  54. azure/ai/evaluation/_evaluators/_qa/_qa.py +111 -0
  55. azure/ai/evaluation/_evaluators/_relevance/__init__.py +9 -0
  56. azure/ai/evaluation/_evaluators/_relevance/_relevance.py +131 -0
  57. azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +69 -0
  58. azure/ai/evaluation/_evaluators/_rouge/__init__.py +10 -0
  59. azure/ai/evaluation/_evaluators/_rouge/_rouge.py +98 -0
  60. azure/ai/evaluation/_evaluators/_similarity/__init__.py +9 -0
  61. azure/ai/evaluation/_evaluators/_similarity/_similarity.py +130 -0
  62. azure/ai/evaluation/_evaluators/_similarity/similarity.prompty +71 -0
  63. azure/ai/evaluation/_evaluators/_xpia/__init__.py +5 -0
  64. azure/ai/evaluation/_evaluators/_xpia/xpia.py +140 -0
  65. azure/ai/evaluation/_exceptions.py +107 -0
  66. azure/ai/evaluation/_http_utils.py +395 -0
  67. azure/ai/evaluation/_model_configurations.py +27 -0
  68. azure/ai/evaluation/_user_agent.py +6 -0
  69. azure/ai/evaluation/_version.py +5 -0
  70. azure/ai/evaluation/py.typed +0 -0
  71. azure/ai/evaluation/simulator/__init__.py +15 -0
  72. azure/ai/evaluation/simulator/_adversarial_scenario.py +27 -0
  73. azure/ai/evaluation/simulator/_adversarial_simulator.py +450 -0
  74. azure/ai/evaluation/simulator/_constants.py +17 -0
  75. azure/ai/evaluation/simulator/_conversation/__init__.py +315 -0
  76. azure/ai/evaluation/simulator/_conversation/_conversation.py +178 -0
  77. azure/ai/evaluation/simulator/_conversation/constants.py +30 -0
  78. azure/ai/evaluation/simulator/_direct_attack_simulator.py +252 -0
  79. azure/ai/evaluation/simulator/_helpers/__init__.py +4 -0
  80. azure/ai/evaluation/simulator/_helpers/_language_suffix_mapping.py +17 -0
  81. azure/ai/evaluation/simulator/_helpers/_simulator_data_classes.py +93 -0
  82. azure/ai/evaluation/simulator/_indirect_attack_simulator.py +207 -0
  83. azure/ai/evaluation/simulator/_model_tools/__init__.py +23 -0
  84. azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +147 -0
  85. azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +228 -0
  86. azure/ai/evaluation/simulator/_model_tools/_rai_client.py +157 -0
  87. azure/ai/evaluation/simulator/_model_tools/_template_handler.py +157 -0
  88. azure/ai/evaluation/simulator/_model_tools/models.py +616 -0
  89. azure/ai/evaluation/simulator/_prompty/task_query_response.prompty +69 -0
  90. azure/ai/evaluation/simulator/_prompty/task_simulate.prompty +36 -0
  91. azure/ai/evaluation/simulator/_tracing.py +92 -0
  92. azure/ai/evaluation/simulator/_utils.py +111 -0
  93. azure/ai/evaluation/simulator/simulator.py +579 -0
  94. azure_ai_evaluation-1.0.0b1.dist-info/METADATA +377 -0
  95. azure_ai_evaluation-1.0.0b1.dist-info/RECORD +97 -0
  96. {azure_ai_evaluation-0.0.0b0.dist-info → azure_ai_evaluation-1.0.0b1.dist-info}/WHEEL +1 -1
  97. azure_ai_evaluation-1.0.0b1.dist-info/top_level.txt +1 -0
  98. azure_ai_evaluation-0.0.0b0.dist-info/METADATA +0 -7
  99. azure_ai_evaluation-0.0.0b0.dist-info/RECORD +0 -4
  100. azure_ai_evaluation-0.0.0b0.dist-info/top_level.txt +0 -1
@@ -0,0 +1,111 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ from concurrent.futures import as_completed
6
+ from typing import Union
7
+
8
+ from promptflow.tracing import ThreadPoolExecutorWithContext as ThreadPoolExecutor
9
+
10
+ from .._coherence import CoherenceEvaluator
11
+ from .._f1_score import F1ScoreEvaluator
12
+ from .._fluency import FluencyEvaluator
13
+ from .._groundedness import GroundednessEvaluator
14
+ from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
15
+ from .._relevance import RelevanceEvaluator
16
+ from .._similarity import SimilarityEvaluator
17
+
18
+
19
+ class QAEvaluator:
20
+ """
21
+ Initialize a question-answer evaluator configured for a specific Azure OpenAI model.
22
+
23
+ :param model_config: Configuration for the Azure OpenAI model.
24
+ :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
25
+ ~azure.ai.evaluation.OpenAIModelConfiguration]
26
+ :return: A function that evaluates and generates metrics for "question-answering" scenario.
27
+ :rtype: Callable
28
+
29
+ **Usage**
30
+
31
+ .. code-block:: python
32
+
33
+ eval_fn = QAEvaluator(model_config)
34
+ result = qa_eval(
35
+ query="Tokyo is the capital of which country?",
36
+ response="Japan",
37
+ context="Tokyo is the capital of Japan.",
38
+ ground_truth="Japan"
39
+ )
40
+
41
+ **Output format**
42
+
43
+ .. code-block:: python
44
+
45
+ {
46
+ "gpt_groundedness": 3.5,
47
+ "gpt_relevance": 4.0,
48
+ "gpt_coherence": 1.5,
49
+ "gpt_fluency": 4.0,
50
+ "gpt_similarity": 3.0,
51
+ "f1_score": 0.42
52
+ }
53
+ """
54
+
55
+ def __init__(
56
+ self, model_config: dict, parallel: bool = True
57
+ ):
58
+ self._parallel = parallel
59
+
60
+ self._evaluators = [
61
+ GroundednessEvaluator(model_config),
62
+ RelevanceEvaluator(model_config),
63
+ CoherenceEvaluator(model_config),
64
+ FluencyEvaluator(model_config),
65
+ SimilarityEvaluator(model_config),
66
+ F1ScoreEvaluator(),
67
+ ]
68
+
69
+ def __call__(self, *, query: str, response: str, context: str, ground_truth: str, **kwargs):
70
+ """
71
+ Evaluates question-answering scenario.
72
+
73
+ :keyword query: The query to be evaluated.
74
+ :paramtype query: str
75
+ :keyword response: The response to be evaluated.
76
+ :paramtype response: str
77
+ :keyword context: The context to be evaluated.
78
+ :paramtype context: str
79
+ :keyword ground_truth: The ground truth to be evaluated.
80
+ :paramtype ground_truth: str
81
+ :keyword parallel: Whether to evaluate in parallel. Defaults to True.
82
+ :paramtype parallel: bool
83
+ :return: The scores for QA scenario.
84
+ :rtype: dict
85
+ """
86
+ results = {}
87
+ if self._parallel:
88
+ with ThreadPoolExecutor() as executor:
89
+ futures = {
90
+ executor.submit(
91
+ evaluator,
92
+ query=query,
93
+ response=response,
94
+ context=context,
95
+ ground_truth=ground_truth,
96
+ **kwargs
97
+ ): evaluator
98
+ for evaluator in self._evaluators
99
+ }
100
+
101
+ # Collect results as they complete
102
+ for future in as_completed(futures):
103
+ results.update(future.result())
104
+ else:
105
+ for evaluator in self._evaluators:
106
+ result = evaluator(
107
+ query=query, response=response, context=context, ground_truth=ground_truth, **kwargs
108
+ )
109
+ results.update(result)
110
+
111
+ return results
@@ -0,0 +1,9 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ from ._relevance import RelevanceEvaluator
6
+
7
+ __all__ = [
8
+ "RelevanceEvaluator",
9
+ ]
@@ -0,0 +1,131 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ import os
6
+ import re
7
+ from typing import Union
8
+
9
+ import numpy as np
10
+
11
+ from promptflow._utils.async_utils import async_run_allowing_running_loop
12
+ from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
13
+ from promptflow.core import AsyncPrompty
14
+
15
+ from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
16
+ from ..._common.utils import (
17
+ check_and_add_api_version_for_aoai_model_config,
18
+ check_and_add_user_agent_for_aoai_model_config,
19
+ )
20
+
21
+ try:
22
+ from ..._user_agent import USER_AGENT
23
+ except ImportError:
24
+ USER_AGENT = None
25
+
26
+
27
+ class _AsyncRelevanceEvaluator:
28
+ # Constants must be defined within eval's directory to be save/loadable
29
+ PROMPTY_FILE = "relevance.prompty"
30
+ LLM_CALL_TIMEOUT = 600
31
+ DEFAULT_OPEN_API_VERSION = "2024-02-15-preview"
32
+
33
+ def __init__(self, model_config: dict):
34
+ check_and_add_api_version_for_aoai_model_config(model_config, self.DEFAULT_OPEN_API_VERSION)
35
+
36
+ prompty_model_config = {"configuration": model_config, "parameters": {"extra_headers": {}}}
37
+
38
+ # Handle "RuntimeError: Event loop is closed" from httpx AsyncClient
39
+ # https://github.com/encode/httpx/discussions/2959
40
+ prompty_model_config["parameters"]["extra_headers"].update({"Connection": "close"})
41
+
42
+ check_and_add_user_agent_for_aoai_model_config(
43
+ model_config,
44
+ prompty_model_config,
45
+ USER_AGENT,
46
+ )
47
+
48
+ current_dir = os.path.dirname(__file__)
49
+ prompty_path = os.path.join(current_dir, self.PROMPTY_FILE)
50
+ self._flow = AsyncPrompty.load(source=prompty_path, model=prompty_model_config)
51
+
52
+ async def __call__(self, *, query: str, response: str, context: str, **kwargs):
53
+ # Validate input parameters
54
+ query = str(query or "")
55
+ response = str(response or "")
56
+ context = str(context or "")
57
+
58
+ if not (query.strip() and response.strip() and context.strip()):
59
+ msg = "'query', 'response' and 'context' must be non-empty strings."
60
+ raise EvaluationException(
61
+ message=msg,
62
+ internal_message=msg,
63
+ error_category=ErrorCategory.MISSING_FIELD,
64
+ error_blame=ErrorBlame.USER_ERROR,
65
+ error_target=ErrorTarget.RELEVANCE_EVALUATOR,
66
+ )
67
+
68
+ # Run the evaluation flow
69
+ llm_output = await self._flow(
70
+ query=query, response=response, context=context, timeout=self.LLM_CALL_TIMEOUT, **kwargs
71
+ )
72
+
73
+ score = np.nan
74
+ if llm_output:
75
+ match = re.search(r"\d", llm_output)
76
+ if match:
77
+ score = float(match.group())
78
+
79
+ return {"gpt_relevance": float(score)}
80
+
81
+
82
+ class RelevanceEvaluator:
83
+ """
84
+ Initialize a relevance evaluator configured for a specific Azure OpenAI model.
85
+
86
+ :param model_config: Configuration for the Azure OpenAI model.
87
+ :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
88
+ ~azure.ai.evaluation.OpenAIModelConfiguration]
89
+
90
+ **Usage**
91
+
92
+ .. code-block:: python
93
+
94
+ eval_fn = RelevanceEvaluator(model_config)
95
+ result = eval_fn(
96
+ query="What is the capital of Japan?",
97
+ response="The capital of Japan is Tokyo.",
98
+ context="Tokyo is Japan's capital, known for its blend of traditional culture \
99
+ and technological advancements.")
100
+
101
+ **Output format**
102
+
103
+ .. code-block:: python
104
+
105
+ {
106
+ "gpt_relevance": 3.0
107
+ }
108
+ """
109
+
110
+ def __init__(self, model_config: dict):
111
+ self._async_evaluator = _AsyncRelevanceEvaluator(model_config)
112
+
113
+ def __call__(self, *, query: str, response: str, context: str, **kwargs):
114
+ """
115
+ Evaluate relevance.
116
+
117
+ :keyword query: The query to be evaluated.
118
+ :paramtype query: str
119
+ :keyword response: The response to be evaluated.
120
+ :paramtype response: str
121
+ :keyword context: The context to be evaluated.
122
+ :paramtype context: str
123
+ :return: The relevance score.
124
+ :rtype: dict
125
+ """
126
+ return async_run_allowing_running_loop(
127
+ self._async_evaluator, query=query, response=response, context=context, **kwargs
128
+ )
129
+
130
+ def _to_async(self):
131
+ return self._async_evaluator
@@ -0,0 +1,69 @@
1
+ ---
2
+ name: Relevance
3
+ description: Evaluates relevance score for QA scenario
4
+ model:
5
+ api: chat
6
+ configuration:
7
+ type: azure_openai
8
+ azure_deployment: ${env:AZURE_DEPLOYMENT}
9
+ api_key: ${env:AZURE_OPENAI_API_KEY}
10
+ azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
11
+ parameters:
12
+ temperature: 0.0
13
+ max_tokens: 1
14
+ top_p: 1.0
15
+ presence_penalty: 0
16
+ frequency_penalty: 0
17
+ response_format:
18
+ type: text
19
+
20
+ inputs:
21
+ query:
22
+ type: string
23
+ response:
24
+ type: string
25
+ context:
26
+ type: string
27
+
28
+ ---
29
+ system:
30
+ You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
31
+ user:
32
+ Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
33
+ One star: the answer completely lacks relevance
34
+ Two stars: the answer mostly lacks relevance
35
+ Three stars: the answer is partially relevant
36
+ Four stars: the answer is mostly relevant
37
+ Five stars: the answer has perfect relevance
38
+
39
+ This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
40
+
41
+ context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
42
+ question: What field did Marie Curie excel in?
43
+ answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
44
+ stars: 1
45
+
46
+ context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
47
+ question: Where were The Beatles formed?
48
+ answer: The band The Beatles began their journey in London, England, and they changed the history of music.
49
+ stars: 2
50
+
51
+ context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
52
+ question: What are the main goals of Perseverance Mars rover mission?
53
+ answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
54
+ stars: 3
55
+
56
+ context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
57
+ question: What are the main components of the Mediterranean diet?
58
+ answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
59
+ stars: 4
60
+
61
+ context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
62
+ question: What are the main attractions of the Queen's Royal Castle?
63
+ answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
64
+ stars: 5
65
+
66
+ context: {{context}}
67
+ question: {{query}}
68
+ answer: {{response}}
69
+ stars:
@@ -0,0 +1,10 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ from ._rouge import RougeScoreEvaluator, RougeType
6
+
7
+ __all__ = [
8
+ "RougeScoreEvaluator",
9
+ "RougeType",
10
+ ]
@@ -0,0 +1,98 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+ from enum import Enum
5
+
6
+ from rouge_score import rouge_scorer
7
+
8
+ from promptflow._utils.async_utils import async_run_allowing_running_loop
9
+
10
+
11
+ class RougeType(str, Enum):
12
+ """
13
+ Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types.
14
+ """
15
+
16
+ ROUGE_1 = "rouge1"
17
+ """Overlap of unigrams (single words) between generated and reference text."""
18
+
19
+ ROUGE_2 = "rouge2"
20
+ """Overlap of bigrams (two consecutive words) between generated and reference text."""
21
+
22
+ ROUGE_3 = "rouge3"
23
+ """Overlap of trigrams (three consecutive words) between generated and reference text."""
24
+
25
+ ROUGE_4 = "rouge4"
26
+ """Overlap of four-grams (four consecutive words) between generated and reference text."""
27
+
28
+ ROUGE_5 = "rouge5"
29
+ """Overlap of five-grams (five consecutive words) between generated and reference text."""
30
+
31
+ ROUGE_L = "rougeL"
32
+ """Overlap of L-grams (L consecutive words) between generated and reference text."""
33
+
34
+
35
+ class _AsyncRougeScoreEvaluator:
36
+ def __init__(self, rouge_type: RougeType):
37
+ self._rouge_type = rouge_type
38
+
39
+ async def __call__(self, *, ground_truth: str, response: str, **kwargs):
40
+ scorer = rouge_scorer.RougeScorer(rouge_types=[self._rouge_type])
41
+ metrics = scorer.score(ground_truth, response)[self._rouge_type]
42
+ return {
43
+ "rouge_precision": metrics.precision,
44
+ "rouge_recall": metrics.recall,
45
+ "rouge_f1_score": metrics.fmeasure,
46
+ }
47
+
48
+
49
+ class RougeScoreEvaluator:
50
+ """
51
+ Evaluator for computes the ROUGE scores between two strings.
52
+
53
+ ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
54
+ summarization and machine translation. It measures the overlap between generated text and reference summaries.
55
+ ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
56
+ summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
57
+ coherence and relevance are critical.
58
+
59
+ **Usage**
60
+
61
+ .. code-block:: python
62
+
63
+ eval_fn = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)
64
+ result = eval_fn(
65
+ response="Tokyo is the capital of Japan.",
66
+ ground_truth="The capital of Japan is Tokyo.")
67
+
68
+ **Output format**
69
+
70
+ .. code-block:: python
71
+
72
+ {
73
+ "rouge_precision": 1.0,
74
+ "rouge_recall": 1.0,
75
+ "rouge_f1_score": 1.0
76
+ }
77
+ """
78
+
79
+ def __init__(self, rouge_type: RougeType):
80
+ self._async_evaluator = _AsyncRougeScoreEvaluator(rouge_type)
81
+
82
+ def __call__(self, *, ground_truth: str, response: str, **kwargs):
83
+ """
84
+ Evaluate the ROUGE score between the response and the ground truth.
85
+
86
+ :keyword response: The response to be evaluated.
87
+ :paramtype response: str
88
+ :keyword ground_truth: The ground truth to be compared against.
89
+ :paramtype ground_truth: str
90
+ :return: The ROUGE score.
91
+ :rtype: dict
92
+ """
93
+ return async_run_allowing_running_loop(
94
+ self._async_evaluator, ground_truth=ground_truth, response=response, **kwargs
95
+ )
96
+
97
+ def _to_async(self):
98
+ return self._async_evaluator
@@ -0,0 +1,9 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ from ._similarity import SimilarityEvaluator
6
+
7
+ __all__ = [
8
+ "SimilarityEvaluator",
9
+ ]
@@ -0,0 +1,130 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+
5
+ import os
6
+ import re
7
+ from typing import Union
8
+
9
+ import numpy as np
10
+
11
+ from promptflow._utils.async_utils import async_run_allowing_running_loop
12
+ from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
13
+ from promptflow.core import AsyncPrompty
14
+
15
+ from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
16
+ from ..._common.utils import (
17
+ check_and_add_api_version_for_aoai_model_config,
18
+ check_and_add_user_agent_for_aoai_model_config,
19
+ )
20
+
21
+ try:
22
+ from ..._user_agent import USER_AGENT
23
+ except ImportError:
24
+ USER_AGENT = None
25
+
26
+
27
+ class _AsyncSimilarityEvaluator:
28
+ # Constants must be defined within eval's directory to be save/loadable
29
+ PROMPTY_FILE = "similarity.prompty"
30
+ LLM_CALL_TIMEOUT = 600
31
+ DEFAULT_OPEN_API_VERSION = "2024-02-15-preview"
32
+
33
+ def __init__(self, model_config: dict):
34
+ check_and_add_api_version_for_aoai_model_config(model_config, self.DEFAULT_OPEN_API_VERSION)
35
+
36
+ prompty_model_config = {"configuration": model_config, "parameters": {"extra_headers": {}}}
37
+
38
+ # Handle "RuntimeError: Event loop is closed" from httpx AsyncClient
39
+ # https://github.com/encode/httpx/discussions/2959
40
+ prompty_model_config["parameters"]["extra_headers"].update({"Connection": "close"})
41
+
42
+ check_and_add_user_agent_for_aoai_model_config(
43
+ model_config,
44
+ prompty_model_config,
45
+ USER_AGENT,
46
+ )
47
+
48
+ current_dir = os.path.dirname(__file__)
49
+ prompty_path = os.path.join(current_dir, self.PROMPTY_FILE)
50
+ self._flow = AsyncPrompty.load(source=prompty_path, model=prompty_model_config)
51
+
52
+ async def __call__(self, *, query: str, response: str, ground_truth: str, **kwargs):
53
+ # Validate input parameters
54
+ query = str(query or "")
55
+ response = str(response or "")
56
+ ground_truth = str(ground_truth or "")
57
+
58
+ if not (query.strip() and response.strip() and ground_truth.strip()):
59
+ msg = "'query', 'response' and 'ground_truth' must be non-empty strings."
60
+ raise EvaluationException(
61
+ message=msg,
62
+ internal_message=msg,
63
+ error_category=ErrorCategory.MISSING_FIELD,
64
+ error_blame=ErrorBlame.USER_ERROR,
65
+ error_target=ErrorTarget.SIMILARITY_EVALUATOR,
66
+ )
67
+
68
+ # Run the evaluation flow
69
+ llm_output = await self._flow(
70
+ query=query, response=response, ground_truth=ground_truth, timeout=self.LLM_CALL_TIMEOUT, **kwargs
71
+ )
72
+
73
+ score = np.nan
74
+ if llm_output:
75
+ match = re.search(r"\d", llm_output)
76
+ if match:
77
+ score = float(match.group())
78
+
79
+ return {"gpt_similarity": float(score)}
80
+
81
+
82
+ class SimilarityEvaluator:
83
+ """
84
+ Initialize a similarity evaluator configured for a specific Azure OpenAI model.
85
+
86
+ :param model_config: Configuration for the Azure OpenAI model.
87
+ :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
88
+ ~azure.ai.evaluation.OpenAIModelConfiguration]
89
+
90
+ **Usage**
91
+
92
+ .. code-block:: python
93
+
94
+ eval_fn = SimilarityEvaluator(model_config)
95
+ result = eval_fn(
96
+ query="What is the capital of Japan?",
97
+ response="The capital of Japan is Tokyo.",
98
+ ground_truth="Tokyo is Japan's capital.")
99
+
100
+ **Output format**
101
+
102
+ .. code-block:: python
103
+
104
+ {
105
+ "gpt_similarity": 3.0
106
+ }
107
+ """
108
+
109
+ def __init__(self, model_config: dict):
110
+ self._async_evaluator = _AsyncSimilarityEvaluator(model_config)
111
+
112
+ def __call__(self, *, query: str, response: str, ground_truth: str, **kwargs):
113
+ """
114
+ Evaluate similarity.
115
+
116
+ :keyword query: The query to be evaluated.
117
+ :paramtype query: str
118
+ :keyword response: The response to be evaluated.
119
+ :paramtype response: str
120
+ :keyword ground_truth: The ground truth to be evaluated.
121
+ :paramtype ground_truth: str
122
+ :return: The similarity score.
123
+ :rtype: dict
124
+ """
125
+ return async_run_allowing_running_loop(
126
+ self._async_evaluator, query=query, response=response, ground_truth=ground_truth, **kwargs
127
+ )
128
+
129
+ def _to_async(self):
130
+ return self._async_evaluator
@@ -0,0 +1,71 @@
1
+ ---
2
+ name: Similarity
3
+ description: Evaluates similarity score for QA scenario
4
+ model:
5
+ api: chat
6
+ configuration:
7
+ type: azure_openai
8
+ azure_deployment: ${env:AZURE_DEPLOYMENT}
9
+ api_key: ${env:AZURE_OPENAI_API_KEY}
10
+ azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
11
+ parameters:
12
+ temperature: 0.0
13
+ max_tokens: 1
14
+ top_p: 1.0
15
+ presence_penalty: 0
16
+ frequency_penalty: 0
17
+ response_format:
18
+ type: text
19
+
20
+ inputs:
21
+ query:
22
+ type: string
23
+ response:
24
+ type: string
25
+ ground_truth:
26
+ type: string
27
+
28
+ ---
29
+ system:
30
+ You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
31
+ user:
32
+ Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
33
+ One star: the predicted answer is not at all similar to the correct answer
34
+ Two stars: the predicted answer is mostly not similar to the correct answer
35
+ Three stars: the predicted answer is somewhat similar to the correct answer
36
+ Four stars: the predicted answer is mostly similar to the correct answer
37
+ Five stars: the predicted answer is completely similar to the correct answer
38
+
39
+ This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
40
+
41
+ The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.
42
+
43
+ question: What is the role of ribosomes?
44
+ correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
45
+ predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
46
+ stars: 1
47
+
48
+ question: Why did the Titanic sink?
49
+ correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
50
+ predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
51
+ stars: 2
52
+
53
+ question: What causes seasons on Earth?
54
+ correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
55
+ predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
56
+ stars: 3
57
+
58
+ question: How does photosynthesis work?
59
+ correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
60
+ predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
61
+ stars: 4
62
+
63
+ question: What are the health benefits of regular exercise?
64
+ correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
65
+ predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
66
+ stars: 5
67
+
68
+ question: {{query}}
69
+ correct answer:{{ground_truth}}
70
+ predicted answer: {{response}}
71
+ stars:
@@ -0,0 +1,5 @@
1
+ from .xpia import IndirectAttackEvaluator
2
+
3
+ __all__ = [
4
+ "IndirectAttackEvaluator",
5
+ ]