azure-ai-evaluation 1.0.0b5__py3-none-any.whl → 1.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/_azure/__init__.py +3 -0
- azure/ai/evaluation/_azure/_clients.py +188 -0
- azure/ai/evaluation/_azure/_models.py +227 -0
- azure/ai/evaluation/_azure/_token_manager.py +118 -0
- azure/ai/evaluation/_common/_experimental.py +4 -0
- azure/ai/evaluation/_common/math.py +62 -2
- azure/ai/evaluation/_common/rai_service.py +110 -50
- azure/ai/evaluation/_common/utils.py +50 -16
- azure/ai/evaluation/_constants.py +2 -0
- azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +9 -0
- azure/ai/evaluation/_evaluate/_batch_run/proxy_client.py +13 -3
- azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +12 -1
- azure/ai/evaluation/_evaluate/_eval_run.py +38 -43
- azure/ai/evaluation/_evaluate/_evaluate.py +62 -131
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +2 -1
- azure/ai/evaluation/_evaluate/_utils.py +72 -38
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +16 -17
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +60 -29
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +88 -6
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +16 -3
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +39 -10
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +58 -52
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +79 -34
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +73 -34
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +74 -33
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -34
- azure/ai/evaluation/_evaluators/_eci/_eci.py +28 -3
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +20 -13
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +57 -26
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +13 -15
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +68 -30
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +17 -20
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +10 -8
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +0 -2
- azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +10 -6
- azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_violence.py +6 -2
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +57 -34
- azure/ai/evaluation/_evaluators/_qa/_qa.py +25 -37
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +63 -29
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +76 -161
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +24 -25
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +65 -67
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +26 -20
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +74 -40
- azure/ai/evaluation/_exceptions.py +2 -0
- azure/ai/evaluation/_http_utils.py +6 -4
- azure/ai/evaluation/_model_configurations.py +65 -14
- azure/ai/evaluation/_vendor/rouge_score/rouge_scorer.py +0 -4
- azure/ai/evaluation/_vendor/rouge_score/scoring.py +0 -4
- azure/ai/evaluation/_vendor/rouge_score/tokenize.py +0 -4
- azure/ai/evaluation/_version.py +1 -1
- azure/ai/evaluation/simulator/_adversarial_scenario.py +17 -1
- azure/ai/evaluation/simulator/_adversarial_simulator.py +57 -47
- azure/ai/evaluation/simulator/_constants.py +11 -1
- azure/ai/evaluation/simulator/_conversation/__init__.py +128 -7
- azure/ai/evaluation/simulator/_conversation/_conversation.py +0 -1
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +16 -8
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +12 -1
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +3 -1
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +48 -4
- azure/ai/evaluation/simulator/_model_tools/_template_handler.py +1 -0
- azure/ai/evaluation/simulator/_simulator.py +54 -45
- azure/ai/evaluation/simulator/_utils.py +25 -7
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/METADATA +240 -327
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/RECORD +71 -68
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +0 -322
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/NOTICE.txt +0 -0
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/WHEEL +0 -0
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: azure-ai-evaluation
|
|
3
|
-
Version: 1.0
|
|
3
|
+
Version: 1.1.0
|
|
4
4
|
Summary: Microsoft Azure Evaluation Library for Python
|
|
5
5
|
Home-page: https://github.com/Azure/azure-sdk-for-python
|
|
6
6
|
Author: Microsoft Corporation
|
|
@@ -9,7 +9,7 @@ License: MIT License
|
|
|
9
9
|
Project-URL: Bug Reports, https://github.com/Azure/azure-sdk-for-python/issues
|
|
10
10
|
Project-URL: Source, https://github.com/Azure/azure-sdk-for-python
|
|
11
11
|
Keywords: azure,azure sdk
|
|
12
|
-
Classifier: Development Status ::
|
|
12
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
13
13
|
Classifier: Programming Language :: Python
|
|
14
14
|
Classifier: Programming Language :: Python :: 3
|
|
15
15
|
Classifier: Programming Language :: Python :: 3 :: Only
|
|
@@ -28,13 +28,20 @@ Requires-Dist: pyjwt >=2.8.0
|
|
|
28
28
|
Requires-Dist: azure-identity >=1.16.0
|
|
29
29
|
Requires-Dist: azure-core >=1.30.2
|
|
30
30
|
Requires-Dist: nltk >=3.9.1
|
|
31
|
-
|
|
32
|
-
Requires-Dist: promptflow-azure <2.0.0,>=1.15.0 ; extra == 'remote'
|
|
33
|
-
Requires-Dist: azure-ai-inference >=1.0.0b4 ; extra == 'remote'
|
|
31
|
+
Requires-Dist: azure-storage-blob >=12.10.0
|
|
34
32
|
|
|
35
33
|
# Azure AI Evaluation client library for Python
|
|
36
34
|
|
|
37
|
-
|
|
35
|
+
Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
|
|
36
|
+
|
|
37
|
+
Use Azure AI Evaluation SDK to:
|
|
38
|
+
- Evaluate existing data from generative AI applications
|
|
39
|
+
- Evaluate generative AI applications
|
|
40
|
+
- Evaluate by generating mathematical, AI-assisted quality and safety metrics
|
|
41
|
+
|
|
42
|
+
Azure AI SDK provides following to evaluate Generative AI Applications:
|
|
43
|
+
- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
|
|
44
|
+
- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
|
|
38
45
|
|
|
39
46
|
[Source code][source_code]
|
|
40
47
|
| [Package (PyPI)][evaluation_pypi]
|
|
@@ -42,272 +49,177 @@ We are excited to introduce the public preview of the Azure AI Evaluation SDK.
|
|
|
42
49
|
| [Product documentation][product_documentation]
|
|
43
50
|
| [Samples][evaluation_samples]
|
|
44
51
|
|
|
45
|
-
This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
|
|
46
|
-
|
|
47
|
-
For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
|
|
48
52
|
|
|
49
53
|
## Getting started
|
|
50
54
|
|
|
51
55
|
### Prerequisites
|
|
52
56
|
|
|
53
57
|
- Python 3.8 or later is required to use this package.
|
|
58
|
+
- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
|
|
54
59
|
|
|
55
60
|
### Install the package
|
|
56
61
|
|
|
57
|
-
Install the Azure AI Evaluation
|
|
62
|
+
Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
|
|
58
63
|
|
|
59
64
|
```bash
|
|
60
65
|
pip install azure-ai-evaluation
|
|
61
66
|
```
|
|
67
|
+
If you want to track results in [AI Studio][ai_studio], install `remote` extra:
|
|
68
|
+
```python
|
|
69
|
+
pip install azure-ai-evaluation[remote]
|
|
70
|
+
```
|
|
62
71
|
|
|
63
72
|
## Key concepts
|
|
64
73
|
|
|
65
|
-
Evaluators
|
|
74
|
+
### Evaluators
|
|
66
75
|
|
|
67
|
-
|
|
76
|
+
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
|
|
68
77
|
|
|
69
|
-
|
|
78
|
+
#### Built-in evaluators
|
|
79
|
+
|
|
80
|
+
Built-in evaluators are out of box evaluators provided by Microsoft:
|
|
81
|
+
| Category | Evaluator class |
|
|
82
|
+
|-----------|------------------------------------------------------------------------------------------------------------------------------------|
|
|
83
|
+
| [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
|
|
84
|
+
| [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
|
|
85
|
+
| [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` |
|
|
86
|
+
| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` |
|
|
70
87
|
|
|
71
|
-
|
|
88
|
+
For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
|
|
72
89
|
|
|
73
90
|
```python
|
|
74
91
|
import os
|
|
75
|
-
from pprint import pprint
|
|
76
92
|
|
|
77
|
-
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
|
|
93
|
+
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
|
|
78
94
|
|
|
95
|
+
# NLP bleu score evaluator
|
|
96
|
+
bleu_score_evaluator = BleuScoreEvaluator()
|
|
97
|
+
result = bleu_score(
|
|
98
|
+
response="Tokyo is the capital of Japan.",
|
|
99
|
+
ground_truth="The capital of Japan is Tokyo."
|
|
100
|
+
)
|
|
79
101
|
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
102
|
+
# AI assisted quality evaluator
|
|
103
|
+
model_config = {
|
|
104
|
+
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
|
|
105
|
+
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
|
|
106
|
+
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
|
|
107
|
+
}
|
|
83
108
|
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
"api_key": os.environ.get("AZURE_OPENAI_KEY"),
|
|
90
|
-
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
|
|
91
|
-
}
|
|
109
|
+
relevance_evaluator = RelevanceEvaluator(model_config)
|
|
110
|
+
result = relevance_evaluator(
|
|
111
|
+
query="What is the capital of Japan?",
|
|
112
|
+
response="The capital of Japan is Tokyo."
|
|
113
|
+
)
|
|
92
114
|
|
|
93
|
-
|
|
94
|
-
|
|
115
|
+
# AI assisted safety evaluator
|
|
116
|
+
azure_ai_project = {
|
|
117
|
+
"subscription_id": "<subscription_id>",
|
|
118
|
+
"resource_group_name": "<resource_group_name>",
|
|
119
|
+
"project_name": "<project_name>",
|
|
120
|
+
}
|
|
95
121
|
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
122
|
+
violence_evaluator = ViolenceEvaluator(azure_ai_project)
|
|
123
|
+
result = violence_evaluator(
|
|
124
|
+
query="What is the capital of France?",
|
|
125
|
+
response="Paris."
|
|
126
|
+
)
|
|
127
|
+
```
|
|
101
128
|
|
|
102
|
-
|
|
103
|
-
# {'gpt_relevance': 5.0}
|
|
129
|
+
#### Custom evaluators
|
|
104
130
|
|
|
105
|
-
|
|
131
|
+
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
|
|
106
132
|
|
|
107
|
-
|
|
108
|
-
azure_ai_project = {
|
|
109
|
-
"subscription_id": <subscription_id>,
|
|
110
|
-
"resource_group_name": <resource_group_name>,
|
|
111
|
-
"project_name": <project_name>
|
|
112
|
-
}
|
|
133
|
+
```python
|
|
113
134
|
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
# {'violence': 'Very low',
|
|
118
|
-
# 'violence_reason': "The system's response is a straightforward factual response "
|
|
119
|
-
# 'to a geography question. There is no violent content or '
|
|
120
|
-
# 'language present.',
|
|
121
|
-
# 'violence_score': 0}
|
|
122
|
-
|
|
123
|
-
# Code based evaluator
|
|
124
|
-
response_length("The Alpine Explorer Tent is the most waterproof.")
|
|
125
|
-
# {'value': 48}
|
|
126
|
-
|
|
127
|
-
# Using multiple evaluators together using `Evaluate` API
|
|
128
|
-
|
|
129
|
-
result = evaluate(
|
|
130
|
-
data="evaluate_test_data.jsonl",
|
|
131
|
-
evaluators={
|
|
132
|
-
"response_length": response_length,
|
|
133
|
-
"violence": violence_eval,
|
|
134
|
-
},
|
|
135
|
-
)
|
|
135
|
+
# Custom evaluator as a function to calculate response length
|
|
136
|
+
def response_length(response, **kwargs):
|
|
137
|
+
return len(response)
|
|
136
138
|
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
139
|
+
# Custom class based evaluator to check for blocked words
|
|
140
|
+
class BlocklistEvaluator:
|
|
141
|
+
def __init__(self, blocklist):
|
|
142
|
+
self._blocklist = blocklist
|
|
140
143
|
|
|
144
|
+
def __call__(self, *, response: str, **kwargs):
|
|
145
|
+
score = any([word in answer for word in self._blocklist])
|
|
146
|
+
return {"score": score}
|
|
141
147
|
|
|
142
|
-
|
|
143
|
-
their AI application.
|
|
148
|
+
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
|
|
144
149
|
|
|
145
|
-
|
|
150
|
+
result = response_length("The capital of Japan is Tokyo.")
|
|
151
|
+
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
|
|
146
152
|
|
|
147
|
-
```
|
|
148
|
-
---
|
|
149
|
-
name: ApplicationPrompty
|
|
150
|
-
description: Simulates an application
|
|
151
|
-
model:
|
|
152
|
-
api: chat
|
|
153
|
-
parameters:
|
|
154
|
-
temperature: 0.0
|
|
155
|
-
top_p: 1.0
|
|
156
|
-
presence_penalty: 0
|
|
157
|
-
frequency_penalty: 0
|
|
158
|
-
response_format:
|
|
159
|
-
type: text
|
|
153
|
+
```
|
|
160
154
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
type: dict
|
|
155
|
+
### Evaluate API
|
|
156
|
+
The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
|
|
164
157
|
|
|
165
|
-
|
|
166
|
-
system:
|
|
167
|
-
You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.
|
|
158
|
+
#### Evaluate existing dataset
|
|
168
159
|
|
|
169
|
-
|
|
170
|
-
|
|
160
|
+
```python
|
|
161
|
+
from azure.ai.evaluation import evaluate
|
|
171
162
|
|
|
163
|
+
result = evaluate(
|
|
164
|
+
data="data.jsonl", # provide your data here
|
|
165
|
+
evaluators={
|
|
166
|
+
"blocklist": blocklist_evaluator,
|
|
167
|
+
"relevance": relevance_evaluator
|
|
168
|
+
},
|
|
169
|
+
# column mapping
|
|
170
|
+
evaluator_config={
|
|
171
|
+
"relevance": {
|
|
172
|
+
"column_mapping": {
|
|
173
|
+
"query": "${data.queries}"
|
|
174
|
+
"ground_truth": "${data.ground_truth}"
|
|
175
|
+
"response": "${outputs.response}"
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
# Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
|
|
180
|
+
azure_ai_project = azure_ai_project,
|
|
181
|
+
# Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
|
|
182
|
+
output_path="./evaluation_results.json"
|
|
183
|
+
)
|
|
172
184
|
```
|
|
185
|
+
For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
|
|
173
186
|
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
temperature: 0.0
|
|
184
|
-
top_p: 1.0
|
|
185
|
-
presence_penalty: 0
|
|
186
|
-
frequency_penalty: 0
|
|
187
|
-
response_format:
|
|
188
|
-
type: json_schema
|
|
189
|
-
json_schema:
|
|
190
|
-
name: QRJsonSchema
|
|
191
|
-
schema:
|
|
192
|
-
type: object
|
|
193
|
-
properties:
|
|
194
|
-
items:
|
|
195
|
-
type: array
|
|
196
|
-
items:
|
|
197
|
-
type: object
|
|
198
|
-
properties:
|
|
199
|
-
q:
|
|
200
|
-
type: string
|
|
201
|
-
r:
|
|
202
|
-
type: string
|
|
203
|
-
required:
|
|
204
|
-
- q
|
|
205
|
-
- r
|
|
206
|
-
|
|
207
|
-
inputs:
|
|
208
|
-
text:
|
|
209
|
-
type: string
|
|
210
|
-
num_queries:
|
|
211
|
-
type: integer
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
---
|
|
215
|
-
system:
|
|
216
|
-
You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
|
|
217
|
-
Both Questions and Answers MUST BE extracted from given Text
|
|
218
|
-
Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
|
|
219
|
-
RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
|
|
220
|
-
A sentence should contribute multiple QnAs if it has more info in it
|
|
221
|
-
Answer must not be more than 5 words
|
|
222
|
-
Answer must be picked from Text as is
|
|
223
|
-
Question should be as descriptive as possible and must include as much context as possible from Text
|
|
224
|
-
Output must always have the provided number of QnAs
|
|
225
|
-
Output must be in JSON format.
|
|
226
|
-
Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
|
|
227
|
-
Text:
|
|
228
|
-
<|text_start|>
|
|
229
|
-
On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
|
|
230
|
-
Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
|
|
231
|
-
<|text_end|>
|
|
232
|
-
Output with 5 QnAs:
|
|
233
|
-
{
|
|
234
|
-
"qna": [{
|
|
235
|
-
"q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
|
|
236
|
-
"r": "January 24, 1984"
|
|
237
|
-
},
|
|
238
|
-
{
|
|
239
|
-
"q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
|
|
240
|
-
"r": "Steve Jobs"
|
|
241
|
-
},
|
|
242
|
-
{
|
|
243
|
-
"q": "What percent of the desktop share did Apple have in the United States in late 2003?",
|
|
244
|
-
"r": "2.06 percent"
|
|
245
|
-
},
|
|
246
|
-
{
|
|
247
|
-
"q": "What were the research firms that reported on Apple's market share in the U.S.?",
|
|
248
|
-
"r": "IDC and Gartner"
|
|
187
|
+
#### Evaluate generative AI application
|
|
188
|
+
```python
|
|
189
|
+
from askwiki import askwiki
|
|
190
|
+
|
|
191
|
+
result = evaluate(
|
|
192
|
+
data="data.jsonl",
|
|
193
|
+
target=askwiki,
|
|
194
|
+
evaluators={
|
|
195
|
+
"relevance": relevance_eval
|
|
249
196
|
},
|
|
250
|
-
{
|
|
251
|
-
"
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
}
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
197
|
+
evaluator_config={
|
|
198
|
+
"default": {
|
|
199
|
+
"column_mapping": {
|
|
200
|
+
"query": "${data.queries}"
|
|
201
|
+
"context": "${outputs.context}"
|
|
202
|
+
"response": "${outputs.response}"
|
|
203
|
+
}
|
|
204
|
+
}
|
|
205
|
+
}
|
|
206
|
+
)
|
|
260
207
|
```
|
|
208
|
+
Above code snippet refers to askwiki application in this [sample][evaluate_app].
|
|
261
209
|
|
|
262
|
-
|
|
210
|
+
For more details refer to [Evaluate on a target][evaluate_target]
|
|
263
211
|
|
|
264
|
-
|
|
265
|
-
import json
|
|
266
|
-
import asyncio
|
|
267
|
-
from typing import Any, Dict, List, Optional
|
|
268
|
-
from azure.ai.evaluation.simulator import Simulator
|
|
269
|
-
from promptflow.client import load_flow
|
|
270
|
-
import os
|
|
271
|
-
import wikipedia
|
|
212
|
+
### Simulator
|
|
272
213
|
|
|
273
|
-
# Set up the model configuration without api_key, using DefaultAzureCredential
|
|
274
|
-
model_config = {
|
|
275
|
-
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
|
|
276
|
-
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
|
|
277
|
-
# not providing key would make the SDK pick up `DefaultAzureCredential`
|
|
278
|
-
# use "api_key": "<your API key>"
|
|
279
|
-
"api_version": "2024-08-01-preview" # keep this for gpt-4o
|
|
280
|
-
}
|
|
281
214
|
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
wiki_title = wikipedia.search(wiki_search_term)[0]
|
|
285
|
-
wiki_page = wikipedia.page(wiki_title)
|
|
286
|
-
text = wiki_page.summary[:1000]
|
|
287
|
-
|
|
288
|
-
def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
|
|
289
|
-
try:
|
|
290
|
-
current_dir = os.path.dirname(__file__)
|
|
291
|
-
prompty_path = os.path.join(current_dir, "application.prompty")
|
|
292
|
-
_flow = load_flow(
|
|
293
|
-
source=prompty_path,
|
|
294
|
-
model=model_config,
|
|
295
|
-
credential=DefaultAzureCredential()
|
|
296
|
-
)
|
|
297
|
-
response = _flow(
|
|
298
|
-
query=query,
|
|
299
|
-
context=context,
|
|
300
|
-
conversation_history=messages_list
|
|
301
|
-
)
|
|
302
|
-
return response
|
|
303
|
-
except Exception as e:
|
|
304
|
-
print(f"Something went wrong invoking the prompty: {e}")
|
|
305
|
-
return "something went wrong"
|
|
215
|
+
Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
|
|
216
|
+
|
|
306
217
|
|
|
218
|
+
```python
|
|
307
219
|
async def callback(
|
|
308
220
|
messages: Dict[str, List[Dict]],
|
|
309
221
|
stream: bool = False,
|
|
310
|
-
session_state: Any = None,
|
|
222
|
+
session_state: Any = None,
|
|
311
223
|
context: Optional[Dict[str, Any]] = None,
|
|
312
224
|
) -> dict:
|
|
313
225
|
messages_list = messages["messages"]
|
|
@@ -315,8 +227,8 @@ async def callback(
|
|
|
315
227
|
latest_message = messages_list[-1]
|
|
316
228
|
query = latest_message["content"]
|
|
317
229
|
# Call your endpoint or AI application here
|
|
318
|
-
response
|
|
319
|
-
|
|
230
|
+
# response should be a string
|
|
231
|
+
response = call_to_your_application(query, messages_list, context)
|
|
320
232
|
formatted_response = {
|
|
321
233
|
"content": response,
|
|
322
234
|
"role": "assistant",
|
|
@@ -324,33 +236,32 @@ async def callback(
|
|
|
324
236
|
}
|
|
325
237
|
messages["messages"].append(formatted_response)
|
|
326
238
|
return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
|
|
239
|
+
```
|
|
327
240
|
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
241
|
+
The simulator initialization and invocation looks like this:
|
|
242
|
+
```python
|
|
243
|
+
from azure.ai.evaluation.simulator import Simulator
|
|
244
|
+
model_config = {
|
|
245
|
+
"azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
|
|
246
|
+
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
|
|
247
|
+
"api_version": os.environ.get("AZURE_API_VERSION"),
|
|
248
|
+
}
|
|
249
|
+
custom_simulator = Simulator(model_config=model_config)
|
|
250
|
+
outputs = asyncio.run(custom_simulator(
|
|
251
|
+
target=callback,
|
|
252
|
+
conversation_turns=[
|
|
253
|
+
[
|
|
254
|
+
"What should I know about the public gardens in the US?",
|
|
341
255
|
],
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
asyncio.run(main())
|
|
352
|
-
print("done!")
|
|
353
|
-
|
|
256
|
+
[
|
|
257
|
+
"How do I simulate data against LLMs",
|
|
258
|
+
],
|
|
259
|
+
],
|
|
260
|
+
max_conversation_turns=2,
|
|
261
|
+
))
|
|
262
|
+
with open("simulator_output.jsonl", "w") as f:
|
|
263
|
+
for output in outputs:
|
|
264
|
+
f.write(output.to_eval_qr_json_lines())
|
|
354
265
|
```
|
|
355
266
|
|
|
356
267
|
#### Adversarial Simulator
|
|
@@ -358,73 +269,11 @@ if __name__ == "__main__":
|
|
|
358
269
|
```python
|
|
359
270
|
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
|
|
360
271
|
from azure.identity import DefaultAzureCredential
|
|
361
|
-
from typing import Any, Dict, List, Optional
|
|
362
|
-
import asyncio
|
|
363
|
-
|
|
364
|
-
|
|
365
272
|
azure_ai_project = {
|
|
366
273
|
"subscription_id": <subscription_id>,
|
|
367
274
|
"resource_group_name": <resource_group_name>,
|
|
368
275
|
"project_name": <project_name>
|
|
369
276
|
}
|
|
370
|
-
|
|
371
|
-
async def callback(
|
|
372
|
-
messages: List[Dict],
|
|
373
|
-
stream: bool = False,
|
|
374
|
-
session_state: Any = None,
|
|
375
|
-
context: Dict[str, Any] = None
|
|
376
|
-
) -> dict:
|
|
377
|
-
messages_list = messages["messages"]
|
|
378
|
-
# get last message
|
|
379
|
-
latest_message = messages_list[-1]
|
|
380
|
-
query = latest_message["content"]
|
|
381
|
-
context = None
|
|
382
|
-
if 'file_content' in messages["template_parameters"]:
|
|
383
|
-
query += messages["template_parameters"]['file_content']
|
|
384
|
-
# the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
|
|
385
|
-
# to respond to the simulator. You should replace it with a call to your model/endpoint/application
|
|
386
|
-
# make sure you pass the `query` and format the response as we have shown below
|
|
387
|
-
from openai import AsyncAzureOpenAI
|
|
388
|
-
oai_client = AsyncAzureOpenAI(
|
|
389
|
-
api_key=<api_key>,
|
|
390
|
-
azure_endpoint=<endpoint>,
|
|
391
|
-
api_version="2023-12-01-preview",
|
|
392
|
-
)
|
|
393
|
-
try:
|
|
394
|
-
response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
|
|
395
|
-
except Exception as e:
|
|
396
|
-
print(f"Error: {e}")
|
|
397
|
-
# to continue the conversation, return the messages, else you can fail the adversarial with an exception
|
|
398
|
-
message = {
|
|
399
|
-
"content": "Something went wrong. Check the exception e for more details.",
|
|
400
|
-
"role": "assistant",
|
|
401
|
-
"context": None,
|
|
402
|
-
}
|
|
403
|
-
messages["messages"].append(message)
|
|
404
|
-
return {
|
|
405
|
-
"messages": messages["messages"],
|
|
406
|
-
"stream": stream,
|
|
407
|
-
"session_state": session_state
|
|
408
|
-
}
|
|
409
|
-
response_result = response_from_oai_chat_completions.choices[0].message.content
|
|
410
|
-
formatted_response = {
|
|
411
|
-
"content": response_result,
|
|
412
|
-
"role": "assistant",
|
|
413
|
-
"context": {},
|
|
414
|
-
}
|
|
415
|
-
messages["messages"].append(formatted_response)
|
|
416
|
-
return {
|
|
417
|
-
"messages": messages["messages"],
|
|
418
|
-
"stream": stream,
|
|
419
|
-
"session_state": session_state,
|
|
420
|
-
"context": context
|
|
421
|
-
}
|
|
422
|
-
|
|
423
|
-
```
|
|
424
|
-
|
|
425
|
-
#### Adversarial QA
|
|
426
|
-
|
|
427
|
-
```python
|
|
428
277
|
scenario = AdversarialScenario.ADVERSARIAL_QA
|
|
429
278
|
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
|
|
430
279
|
|
|
@@ -437,30 +286,30 @@ outputs = asyncio.run(
|
|
|
437
286
|
)
|
|
438
287
|
)
|
|
439
288
|
|
|
440
|
-
print(outputs.
|
|
289
|
+
print(outputs.to_eval_qr_json_lines())
|
|
441
290
|
```
|
|
442
|
-
#### Direct Attack Simulator
|
|
443
291
|
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
292
|
+
For more details about the simulator, visit the following links:
|
|
293
|
+
- [Adversarial Simulation docs][adversarial_simulation_docs]
|
|
294
|
+
- [Adversarial scenarios][adversarial_simulation_scenarios]
|
|
295
|
+
- [Simulating jailbreak attacks][adversarial_jailbreak]
|
|
447
296
|
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
|
|
452
|
-
|
|
453
|
-
|
|
454
|
-
|
|
455
|
-
|
|
297
|
+
## Examples
|
|
298
|
+
|
|
299
|
+
In following section you will find examples of:
|
|
300
|
+
- [Evaluate an application][evaluate_app]
|
|
301
|
+
- [Evaluate different models][evaluate_models]
|
|
302
|
+
- [Custom Evaluators][custom_evaluators]
|
|
303
|
+
- [Adversarial Simulation][adversarial_simulation]
|
|
304
|
+
- [Simulate with conversation starter][simulate_with_conversation_starter]
|
|
305
|
+
|
|
306
|
+
More examples can be found [here][evaluate_samples].
|
|
456
307
|
|
|
457
|
-
print(outputs)
|
|
458
|
-
```
|
|
459
308
|
## Troubleshooting
|
|
460
309
|
|
|
461
310
|
### General
|
|
462
311
|
|
|
463
|
-
|
|
312
|
+
Please refer to [troubleshooting][evaluation_tsg] for common issues.
|
|
464
313
|
|
|
465
314
|
### Logging
|
|
466
315
|
|
|
@@ -505,10 +354,74 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
|
|
|
505
354
|
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
|
|
506
355
|
[coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
|
|
507
356
|
[coc_contact]: mailto:opencode@microsoft.com
|
|
508
|
-
|
|
357
|
+
[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
|
|
358
|
+
[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
|
|
359
|
+
[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
|
|
360
|
+
[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
|
|
361
|
+
[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint
|
|
362
|
+
[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
|
|
363
|
+
[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
|
|
364
|
+
[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
|
|
365
|
+
[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
|
|
366
|
+
[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint
|
|
367
|
+
[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators
|
|
368
|
+
[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
|
|
369
|
+
[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
|
|
370
|
+
[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
|
|
371
|
+
[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
|
|
372
|
+
[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
|
|
373
|
+
[adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
|
|
374
|
+
[adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
|
|
375
|
+
[adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Adversarial_Data
|
|
376
|
+
[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter
|
|
377
|
+
[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
|
|
509
378
|
|
|
510
379
|
# Release History
|
|
511
380
|
|
|
381
|
+
## 1.1.0 (2024-12-12)
|
|
382
|
+
|
|
383
|
+
### Bugs Fixed
|
|
384
|
+
- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Studio.
|
|
385
|
+
- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
|
|
386
|
+
|
|
387
|
+
## 1.0.1 (2024-11-15)
|
|
388
|
+
|
|
389
|
+
### Bugs Fixed
|
|
390
|
+
- Removing `azure-ai-inference` as dependency.
|
|
391
|
+
- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
|
|
392
|
+
|
|
393
|
+
## 1.0.0 (2024-11-13)
|
|
394
|
+
|
|
395
|
+
### Breaking Changes
|
|
396
|
+
- The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
|
|
397
|
+
- Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.
|
|
398
|
+
|
|
399
|
+
### Bugs Fixed
|
|
400
|
+
- Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
|
|
401
|
+
- Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
|
|
402
|
+
- Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
|
|
403
|
+
- Fixed an issue with the `ContentSafetyEvaluator` that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.
|
|
404
|
+
- Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
|
|
405
|
+
otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
|
|
406
|
+
would be 2, not 1.5.
|
|
407
|
+
|
|
408
|
+
### Other Changes
|
|
409
|
+
- Refined error messages for serviced-based evaluators and simulators.
|
|
410
|
+
- Tracing has been disabled due to Cosmos DB initialization issue.
|
|
411
|
+
- Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
|
|
412
|
+
- Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
|
|
413
|
+
- For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
|
|
414
|
+
```python
|
|
415
|
+
adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
|
|
416
|
+
outputs = asyncio.run(
|
|
417
|
+
adversarial_simulator(
|
|
418
|
+
scenario=scenario,
|
|
419
|
+
target=callback,
|
|
420
|
+
randomize_order=True
|
|
421
|
+
)
|
|
422
|
+
)
|
|
423
|
+
```
|
|
424
|
+
|
|
512
425
|
## 1.0.0b5 (2024-10-28)
|
|
513
426
|
|
|
514
427
|
### Features Added
|
|
@@ -565,8 +478,8 @@ outputs = asyncio.run(custom_simulator(
|
|
|
565
478
|
- `SimilarityEvaluator`
|
|
566
479
|
- `RetrievalEvaluator`
|
|
567
480
|
- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
|
|
568
|
-
|
|
569
|
-
| Evaluator | New
|
|
481
|
+
|
|
482
|
+
| Evaluator | New `max_token` for Generation |
|
|
570
483
|
| --- | --- |
|
|
571
484
|
| `CoherenceEvaluator` | 800 |
|
|
572
485
|
| `RelevanceEvaluator` | 800 |
|