azure-ai-evaluation 1.0.0b5__py3-none-any.whl → 1.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- azure/ai/evaluation/_common/_experimental.py +4 -0
- azure/ai/evaluation/_common/math.py +62 -2
- azure/ai/evaluation/_common/rai_service.py +80 -29
- azure/ai/evaluation/_common/utils.py +50 -16
- azure/ai/evaluation/_constants.py +1 -0
- azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +9 -0
- azure/ai/evaluation/_evaluate/_batch_run/proxy_client.py +13 -3
- azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +11 -0
- azure/ai/evaluation/_evaluate/_eval_run.py +34 -10
- azure/ai/evaluation/_evaluate/_evaluate.py +59 -103
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +2 -1
- azure/ai/evaluation/_evaluate/_utils.py +6 -4
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +16 -17
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +60 -29
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +17 -5
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +4 -2
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +6 -9
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +56 -50
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +79 -34
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +73 -34
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +74 -33
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -34
- azure/ai/evaluation/_evaluators/_eci/_eci.py +28 -3
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +20 -13
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +57 -26
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +13 -15
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +68 -30
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +17 -20
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +10 -8
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +0 -2
- azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +10 -6
- azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +6 -2
- azure/ai/evaluation/_evaluators/_multimodal/_violence.py +6 -2
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +57 -34
- azure/ai/evaluation/_evaluators/_qa/_qa.py +25 -37
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +63 -29
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +76 -161
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +24 -25
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +65 -67
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +26 -20
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +74 -40
- azure/ai/evaluation/_exceptions.py +2 -0
- azure/ai/evaluation/_model_configurations.py +65 -14
- azure/ai/evaluation/_version.py +1 -1
- azure/ai/evaluation/simulator/_adversarial_scenario.py +15 -1
- azure/ai/evaluation/simulator/_adversarial_simulator.py +25 -34
- azure/ai/evaluation/simulator/_constants.py +11 -1
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +16 -8
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +11 -1
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +3 -1
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +8 -4
- azure/ai/evaluation/simulator/_simulator.py +51 -45
- azure/ai/evaluation/simulator/_utils.py +25 -7
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/METADATA +232 -324
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/RECORD +60 -61
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +0 -322
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/NOTICE.txt +0 -0
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/WHEEL +0 -0
- {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: azure-ai-evaluation
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.1
|
|
4
4
|
Summary: Microsoft Azure Evaluation Library for Python
|
|
5
5
|
Home-page: https://github.com/Azure/azure-sdk-for-python
|
|
6
6
|
Author: Microsoft Corporation
|
|
@@ -9,7 +9,7 @@ License: MIT License
|
|
|
9
9
|
Project-URL: Bug Reports, https://github.com/Azure/azure-sdk-for-python/issues
|
|
10
10
|
Project-URL: Source, https://github.com/Azure/azure-sdk-for-python
|
|
11
11
|
Keywords: azure,azure sdk
|
|
12
|
-
Classifier: Development Status ::
|
|
12
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
13
13
|
Classifier: Programming Language :: Python
|
|
14
14
|
Classifier: Programming Language :: Python :: 3
|
|
15
15
|
Classifier: Programming Language :: Python :: 3 :: Only
|
|
@@ -30,11 +30,19 @@ Requires-Dist: azure-core >=1.30.2
|
|
|
30
30
|
Requires-Dist: nltk >=3.9.1
|
|
31
31
|
Provides-Extra: remote
|
|
32
32
|
Requires-Dist: promptflow-azure <2.0.0,>=1.15.0 ; extra == 'remote'
|
|
33
|
-
Requires-Dist: azure-ai-inference >=1.0.0b4 ; extra == 'remote'
|
|
34
33
|
|
|
35
34
|
# Azure AI Evaluation client library for Python
|
|
36
35
|
|
|
37
|
-
|
|
36
|
+
Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
|
|
37
|
+
|
|
38
|
+
Use Azure AI Evaluation SDK to:
|
|
39
|
+
- Evaluate existing data from generative AI applications
|
|
40
|
+
- Evaluate generative AI applications
|
|
41
|
+
- Evaluate by generating mathematical, AI-assisted quality and safety metrics
|
|
42
|
+
|
|
43
|
+
Azure AI SDK provides following to evaluate Generative AI Applications:
|
|
44
|
+
- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
|
|
45
|
+
- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
|
|
38
46
|
|
|
39
47
|
[Source code][source_code]
|
|
40
48
|
| [Package (PyPI)][evaluation_pypi]
|
|
@@ -42,272 +50,177 @@ We are excited to introduce the public preview of the Azure AI Evaluation SDK.
|
|
|
42
50
|
| [Product documentation][product_documentation]
|
|
43
51
|
| [Samples][evaluation_samples]
|
|
44
52
|
|
|
45
|
-
This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
|
|
46
|
-
|
|
47
|
-
For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
|
|
48
53
|
|
|
49
54
|
## Getting started
|
|
50
55
|
|
|
51
56
|
### Prerequisites
|
|
52
57
|
|
|
53
58
|
- Python 3.8 or later is required to use this package.
|
|
59
|
+
- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
|
|
54
60
|
|
|
55
61
|
### Install the package
|
|
56
62
|
|
|
57
|
-
Install the Azure AI Evaluation
|
|
63
|
+
Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
|
|
58
64
|
|
|
59
65
|
```bash
|
|
60
66
|
pip install azure-ai-evaluation
|
|
61
67
|
```
|
|
68
|
+
If you want to track results in [AI Studio][ai_studio], install `remote` extra:
|
|
69
|
+
```python
|
|
70
|
+
pip install azure-ai-evaluation[remote]
|
|
71
|
+
```
|
|
62
72
|
|
|
63
73
|
## Key concepts
|
|
64
74
|
|
|
65
|
-
Evaluators
|
|
75
|
+
### Evaluators
|
|
66
76
|
|
|
67
|
-
|
|
77
|
+
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
|
|
68
78
|
|
|
69
|
-
|
|
79
|
+
#### Built-in evaluators
|
|
70
80
|
|
|
71
|
-
|
|
81
|
+
Built-in evaluators are out of box evaluators provided by Microsoft:
|
|
82
|
+
| Category | Evaluator class |
|
|
83
|
+
|-----------|------------------------------------------------------------------------------------------------------------------------------------|
|
|
84
|
+
| [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
|
|
85
|
+
| [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
|
|
86
|
+
| [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` |
|
|
87
|
+
| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` |
|
|
88
|
+
|
|
89
|
+
For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
|
|
72
90
|
|
|
73
91
|
```python
|
|
74
92
|
import os
|
|
75
|
-
from pprint import pprint
|
|
76
|
-
|
|
77
|
-
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
|
|
78
93
|
|
|
94
|
+
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
|
|
79
95
|
|
|
80
|
-
|
|
81
|
-
|
|
96
|
+
# NLP bleu score evaluator
|
|
97
|
+
bleu_score_evaluator = BleuScoreEvaluator()
|
|
98
|
+
result = bleu_score(
|
|
99
|
+
response="Tokyo is the capital of Japan.",
|
|
100
|
+
ground_truth="The capital of Japan is Tokyo."
|
|
101
|
+
)
|
|
82
102
|
|
|
103
|
+
# AI assisted quality evaluator
|
|
104
|
+
model_config = {
|
|
105
|
+
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
|
|
106
|
+
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
|
|
107
|
+
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
|
|
108
|
+
}
|
|
83
109
|
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
"api_key": os.environ.get("AZURE_OPENAI_KEY"),
|
|
90
|
-
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
|
|
91
|
-
}
|
|
110
|
+
relevance_evaluator = RelevanceEvaluator(model_config)
|
|
111
|
+
result = relevance_evaluator(
|
|
112
|
+
query="What is the capital of Japan?",
|
|
113
|
+
response="The capital of Japan is Tokyo."
|
|
114
|
+
)
|
|
92
115
|
|
|
93
|
-
|
|
94
|
-
|
|
116
|
+
# AI assisted safety evaluator
|
|
117
|
+
azure_ai_project = {
|
|
118
|
+
"subscription_id": "<subscription_id>",
|
|
119
|
+
"resource_group_name": "<resource_group_name>",
|
|
120
|
+
"project_name": "<project_name>",
|
|
121
|
+
}
|
|
95
122
|
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
123
|
+
violence_evaluator = ViolenceEvaluator(azure_ai_project)
|
|
124
|
+
result = violence_evaluator(
|
|
125
|
+
query="What is the capital of France?",
|
|
126
|
+
response="Paris."
|
|
127
|
+
)
|
|
128
|
+
```
|
|
101
129
|
|
|
102
|
-
|
|
103
|
-
# {'gpt_relevance': 5.0}
|
|
130
|
+
#### Custom evaluators
|
|
104
131
|
|
|
105
|
-
|
|
132
|
+
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
|
|
106
133
|
|
|
107
|
-
|
|
108
|
-
azure_ai_project = {
|
|
109
|
-
"subscription_id": <subscription_id>,
|
|
110
|
-
"resource_group_name": <resource_group_name>,
|
|
111
|
-
"project_name": <project_name>
|
|
112
|
-
}
|
|
134
|
+
```python
|
|
113
135
|
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
# {'violence': 'Very low',
|
|
118
|
-
# 'violence_reason': "The system's response is a straightforward factual response "
|
|
119
|
-
# 'to a geography question. There is no violent content or '
|
|
120
|
-
# 'language present.',
|
|
121
|
-
# 'violence_score': 0}
|
|
122
|
-
|
|
123
|
-
# Code based evaluator
|
|
124
|
-
response_length("The Alpine Explorer Tent is the most waterproof.")
|
|
125
|
-
# {'value': 48}
|
|
126
|
-
|
|
127
|
-
# Using multiple evaluators together using `Evaluate` API
|
|
128
|
-
|
|
129
|
-
result = evaluate(
|
|
130
|
-
data="evaluate_test_data.jsonl",
|
|
131
|
-
evaluators={
|
|
132
|
-
"response_length": response_length,
|
|
133
|
-
"violence": violence_eval,
|
|
134
|
-
},
|
|
135
|
-
)
|
|
136
|
+
# Custom evaluator as a function to calculate response length
|
|
137
|
+
def response_length(response, **kwargs):
|
|
138
|
+
return len(response)
|
|
136
139
|
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
+
# Custom class based evaluator to check for blocked words
|
|
141
|
+
class BlocklistEvaluator:
|
|
142
|
+
def __init__(self, blocklist):
|
|
143
|
+
self._blocklist = blocklist
|
|
140
144
|
|
|
145
|
+
def __call__(self, *, response: str, **kwargs):
|
|
146
|
+
score = any([word in answer for word in self._blocklist])
|
|
147
|
+
return {"score": score}
|
|
141
148
|
|
|
142
|
-
|
|
143
|
-
their AI application.
|
|
149
|
+
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
|
|
144
150
|
|
|
145
|
-
|
|
151
|
+
result = response_length("The capital of Japan is Tokyo.")
|
|
152
|
+
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
|
|
146
153
|
|
|
147
|
-
```
|
|
148
|
-
---
|
|
149
|
-
name: ApplicationPrompty
|
|
150
|
-
description: Simulates an application
|
|
151
|
-
model:
|
|
152
|
-
api: chat
|
|
153
|
-
parameters:
|
|
154
|
-
temperature: 0.0
|
|
155
|
-
top_p: 1.0
|
|
156
|
-
presence_penalty: 0
|
|
157
|
-
frequency_penalty: 0
|
|
158
|
-
response_format:
|
|
159
|
-
type: text
|
|
154
|
+
```
|
|
160
155
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
type: dict
|
|
156
|
+
### Evaluate API
|
|
157
|
+
The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
|
|
164
158
|
|
|
165
|
-
|
|
166
|
-
system:
|
|
167
|
-
You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.
|
|
159
|
+
#### Evaluate existing dataset
|
|
168
160
|
|
|
169
|
-
|
|
170
|
-
|
|
161
|
+
```python
|
|
162
|
+
from azure.ai.evaluation import evaluate
|
|
171
163
|
|
|
164
|
+
result = evaluate(
|
|
165
|
+
data="data.jsonl", # provide your data here
|
|
166
|
+
evaluators={
|
|
167
|
+
"blocklist": blocklist_evaluator,
|
|
168
|
+
"relevance": relevance_evaluator
|
|
169
|
+
},
|
|
170
|
+
# column mapping
|
|
171
|
+
evaluator_config={
|
|
172
|
+
"relevance": {
|
|
173
|
+
"column_mapping": {
|
|
174
|
+
"query": "${data.queries}"
|
|
175
|
+
"ground_truth": "${data.ground_truth}"
|
|
176
|
+
"response": "${outputs.response}"
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
}
|
|
180
|
+
# Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
|
|
181
|
+
azure_ai_project = azure_ai_project,
|
|
182
|
+
# Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
|
|
183
|
+
output_path="./evaluation_results.json"
|
|
184
|
+
)
|
|
172
185
|
```
|
|
186
|
+
For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
|
|
173
187
|
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
temperature: 0.0
|
|
184
|
-
top_p: 1.0
|
|
185
|
-
presence_penalty: 0
|
|
186
|
-
frequency_penalty: 0
|
|
187
|
-
response_format:
|
|
188
|
-
type: json_schema
|
|
189
|
-
json_schema:
|
|
190
|
-
name: QRJsonSchema
|
|
191
|
-
schema:
|
|
192
|
-
type: object
|
|
193
|
-
properties:
|
|
194
|
-
items:
|
|
195
|
-
type: array
|
|
196
|
-
items:
|
|
197
|
-
type: object
|
|
198
|
-
properties:
|
|
199
|
-
q:
|
|
200
|
-
type: string
|
|
201
|
-
r:
|
|
202
|
-
type: string
|
|
203
|
-
required:
|
|
204
|
-
- q
|
|
205
|
-
- r
|
|
206
|
-
|
|
207
|
-
inputs:
|
|
208
|
-
text:
|
|
209
|
-
type: string
|
|
210
|
-
num_queries:
|
|
211
|
-
type: integer
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
---
|
|
215
|
-
system:
|
|
216
|
-
You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
|
|
217
|
-
Both Questions and Answers MUST BE extracted from given Text
|
|
218
|
-
Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
|
|
219
|
-
RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
|
|
220
|
-
A sentence should contribute multiple QnAs if it has more info in it
|
|
221
|
-
Answer must not be more than 5 words
|
|
222
|
-
Answer must be picked from Text as is
|
|
223
|
-
Question should be as descriptive as possible and must include as much context as possible from Text
|
|
224
|
-
Output must always have the provided number of QnAs
|
|
225
|
-
Output must be in JSON format.
|
|
226
|
-
Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
|
|
227
|
-
Text:
|
|
228
|
-
<|text_start|>
|
|
229
|
-
On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
|
|
230
|
-
Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
|
|
231
|
-
<|text_end|>
|
|
232
|
-
Output with 5 QnAs:
|
|
233
|
-
{
|
|
234
|
-
"qna": [{
|
|
235
|
-
"q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
|
|
236
|
-
"r": "January 24, 1984"
|
|
237
|
-
},
|
|
238
|
-
{
|
|
239
|
-
"q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
|
|
240
|
-
"r": "Steve Jobs"
|
|
241
|
-
},
|
|
242
|
-
{
|
|
243
|
-
"q": "What percent of the desktop share did Apple have in the United States in late 2003?",
|
|
244
|
-
"r": "2.06 percent"
|
|
245
|
-
},
|
|
246
|
-
{
|
|
247
|
-
"q": "What were the research firms that reported on Apple's market share in the U.S.?",
|
|
248
|
-
"r": "IDC and Gartner"
|
|
188
|
+
#### Evaluate generative AI application
|
|
189
|
+
```python
|
|
190
|
+
from askwiki import askwiki
|
|
191
|
+
|
|
192
|
+
result = evaluate(
|
|
193
|
+
data="data.jsonl",
|
|
194
|
+
target=askwiki,
|
|
195
|
+
evaluators={
|
|
196
|
+
"relevance": relevance_eval
|
|
249
197
|
},
|
|
250
|
-
{
|
|
251
|
-
"
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
}
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
198
|
+
evaluator_config={
|
|
199
|
+
"default": {
|
|
200
|
+
"column_mapping": {
|
|
201
|
+
"query": "${data.queries}"
|
|
202
|
+
"context": "${outputs.context}"
|
|
203
|
+
"response": "${outputs.response}"
|
|
204
|
+
}
|
|
205
|
+
}
|
|
206
|
+
}
|
|
207
|
+
)
|
|
260
208
|
```
|
|
209
|
+
Above code snippet refers to askwiki application in this [sample][evaluate_app].
|
|
261
210
|
|
|
262
|
-
|
|
211
|
+
For more details refer to [Evaluate on a target][evaluate_target]
|
|
263
212
|
|
|
264
|
-
|
|
265
|
-
import json
|
|
266
|
-
import asyncio
|
|
267
|
-
from typing import Any, Dict, List, Optional
|
|
268
|
-
from azure.ai.evaluation.simulator import Simulator
|
|
269
|
-
from promptflow.client import load_flow
|
|
270
|
-
import os
|
|
271
|
-
import wikipedia
|
|
213
|
+
### Simulator
|
|
272
214
|
|
|
273
|
-
# Set up the model configuration without api_key, using DefaultAzureCredential
|
|
274
|
-
model_config = {
|
|
275
|
-
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
|
|
276
|
-
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
|
|
277
|
-
# not providing key would make the SDK pick up `DefaultAzureCredential`
|
|
278
|
-
# use "api_key": "<your API key>"
|
|
279
|
-
"api_version": "2024-08-01-preview" # keep this for gpt-4o
|
|
280
|
-
}
|
|
281
215
|
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
wiki_title = wikipedia.search(wiki_search_term)[0]
|
|
285
|
-
wiki_page = wikipedia.page(wiki_title)
|
|
286
|
-
text = wiki_page.summary[:1000]
|
|
287
|
-
|
|
288
|
-
def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
|
|
289
|
-
try:
|
|
290
|
-
current_dir = os.path.dirname(__file__)
|
|
291
|
-
prompty_path = os.path.join(current_dir, "application.prompty")
|
|
292
|
-
_flow = load_flow(
|
|
293
|
-
source=prompty_path,
|
|
294
|
-
model=model_config,
|
|
295
|
-
credential=DefaultAzureCredential()
|
|
296
|
-
)
|
|
297
|
-
response = _flow(
|
|
298
|
-
query=query,
|
|
299
|
-
context=context,
|
|
300
|
-
conversation_history=messages_list
|
|
301
|
-
)
|
|
302
|
-
return response
|
|
303
|
-
except Exception as e:
|
|
304
|
-
print(f"Something went wrong invoking the prompty: {e}")
|
|
305
|
-
return "something went wrong"
|
|
216
|
+
Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
|
|
217
|
+
|
|
306
218
|
|
|
219
|
+
```python
|
|
307
220
|
async def callback(
|
|
308
221
|
messages: Dict[str, List[Dict]],
|
|
309
222
|
stream: bool = False,
|
|
310
|
-
session_state: Any = None,
|
|
223
|
+
session_state: Any = None,
|
|
311
224
|
context: Optional[Dict[str, Any]] = None,
|
|
312
225
|
) -> dict:
|
|
313
226
|
messages_list = messages["messages"]
|
|
@@ -315,8 +228,8 @@ async def callback(
|
|
|
315
228
|
latest_message = messages_list[-1]
|
|
316
229
|
query = latest_message["content"]
|
|
317
230
|
# Call your endpoint or AI application here
|
|
318
|
-
response
|
|
319
|
-
|
|
231
|
+
# response should be a string
|
|
232
|
+
response = call_to_your_application(query, messages_list, context)
|
|
320
233
|
formatted_response = {
|
|
321
234
|
"content": response,
|
|
322
235
|
"role": "assistant",
|
|
@@ -324,33 +237,32 @@ async def callback(
|
|
|
324
237
|
}
|
|
325
238
|
messages["messages"].append(formatted_response)
|
|
326
239
|
return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
|
|
240
|
+
```
|
|
327
241
|
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
242
|
+
The simulator initialization and invocation looks like this:
|
|
243
|
+
```python
|
|
244
|
+
from azure.ai.evaluation.simulator import Simulator
|
|
245
|
+
model_config = {
|
|
246
|
+
"azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
|
|
247
|
+
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
|
|
248
|
+
"api_version": os.environ.get("AZURE_API_VERSION"),
|
|
249
|
+
}
|
|
250
|
+
custom_simulator = Simulator(model_config=model_config)
|
|
251
|
+
outputs = asyncio.run(custom_simulator(
|
|
252
|
+
target=callback,
|
|
253
|
+
conversation_turns=[
|
|
254
|
+
[
|
|
255
|
+
"What should I know about the public gardens in the US?",
|
|
341
256
|
],
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
asyncio.run(main())
|
|
352
|
-
print("done!")
|
|
353
|
-
|
|
257
|
+
[
|
|
258
|
+
"How do I simulate data against LLMs",
|
|
259
|
+
],
|
|
260
|
+
],
|
|
261
|
+
max_conversation_turns=2,
|
|
262
|
+
))
|
|
263
|
+
with open("simulator_output.jsonl", "w") as f:
|
|
264
|
+
for output in outputs:
|
|
265
|
+
f.write(output.to_eval_qr_json_lines())
|
|
354
266
|
```
|
|
355
267
|
|
|
356
268
|
#### Adversarial Simulator
|
|
@@ -358,73 +270,11 @@ if __name__ == "__main__":
|
|
|
358
270
|
```python
|
|
359
271
|
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
|
|
360
272
|
from azure.identity import DefaultAzureCredential
|
|
361
|
-
from typing import Any, Dict, List, Optional
|
|
362
|
-
import asyncio
|
|
363
|
-
|
|
364
|
-
|
|
365
273
|
azure_ai_project = {
|
|
366
274
|
"subscription_id": <subscription_id>,
|
|
367
275
|
"resource_group_name": <resource_group_name>,
|
|
368
276
|
"project_name": <project_name>
|
|
369
277
|
}
|
|
370
|
-
|
|
371
|
-
async def callback(
|
|
372
|
-
messages: List[Dict],
|
|
373
|
-
stream: bool = False,
|
|
374
|
-
session_state: Any = None,
|
|
375
|
-
context: Dict[str, Any] = None
|
|
376
|
-
) -> dict:
|
|
377
|
-
messages_list = messages["messages"]
|
|
378
|
-
# get last message
|
|
379
|
-
latest_message = messages_list[-1]
|
|
380
|
-
query = latest_message["content"]
|
|
381
|
-
context = None
|
|
382
|
-
if 'file_content' in messages["template_parameters"]:
|
|
383
|
-
query += messages["template_parameters"]['file_content']
|
|
384
|
-
# the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
|
|
385
|
-
# to respond to the simulator. You should replace it with a call to your model/endpoint/application
|
|
386
|
-
# make sure you pass the `query` and format the response as we have shown below
|
|
387
|
-
from openai import AsyncAzureOpenAI
|
|
388
|
-
oai_client = AsyncAzureOpenAI(
|
|
389
|
-
api_key=<api_key>,
|
|
390
|
-
azure_endpoint=<endpoint>,
|
|
391
|
-
api_version="2023-12-01-preview",
|
|
392
|
-
)
|
|
393
|
-
try:
|
|
394
|
-
response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
|
|
395
|
-
except Exception as e:
|
|
396
|
-
print(f"Error: {e}")
|
|
397
|
-
# to continue the conversation, return the messages, else you can fail the adversarial with an exception
|
|
398
|
-
message = {
|
|
399
|
-
"content": "Something went wrong. Check the exception e for more details.",
|
|
400
|
-
"role": "assistant",
|
|
401
|
-
"context": None,
|
|
402
|
-
}
|
|
403
|
-
messages["messages"].append(message)
|
|
404
|
-
return {
|
|
405
|
-
"messages": messages["messages"],
|
|
406
|
-
"stream": stream,
|
|
407
|
-
"session_state": session_state
|
|
408
|
-
}
|
|
409
|
-
response_result = response_from_oai_chat_completions.choices[0].message.content
|
|
410
|
-
formatted_response = {
|
|
411
|
-
"content": response_result,
|
|
412
|
-
"role": "assistant",
|
|
413
|
-
"context": {},
|
|
414
|
-
}
|
|
415
|
-
messages["messages"].append(formatted_response)
|
|
416
|
-
return {
|
|
417
|
-
"messages": messages["messages"],
|
|
418
|
-
"stream": stream,
|
|
419
|
-
"session_state": session_state,
|
|
420
|
-
"context": context
|
|
421
|
-
}
|
|
422
|
-
|
|
423
|
-
```
|
|
424
|
-
|
|
425
|
-
#### Adversarial QA
|
|
426
|
-
|
|
427
|
-
```python
|
|
428
278
|
scenario = AdversarialScenario.ADVERSARIAL_QA
|
|
429
279
|
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
|
|
430
280
|
|
|
@@ -437,30 +287,30 @@ outputs = asyncio.run(
|
|
|
437
287
|
)
|
|
438
288
|
)
|
|
439
289
|
|
|
440
|
-
print(outputs.
|
|
290
|
+
print(outputs.to_eval_qr_json_lines())
|
|
441
291
|
```
|
|
442
|
-
#### Direct Attack Simulator
|
|
443
292
|
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
293
|
+
For more details about the simulator, visit the following links:
|
|
294
|
+
- [Adversarial Simulation docs][adversarial_simulation_docs]
|
|
295
|
+
- [Adversarial scenarios][adversarial_simulation_scenarios]
|
|
296
|
+
- [Simulating jailbreak attacks][adversarial_jailbreak]
|
|
447
297
|
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
|
|
452
|
-
|
|
453
|
-
|
|
454
|
-
|
|
455
|
-
|
|
298
|
+
## Examples
|
|
299
|
+
|
|
300
|
+
In following section you will find examples of:
|
|
301
|
+
- [Evaluate an application][evaluate_app]
|
|
302
|
+
- [Evaluate different models][evaluate_models]
|
|
303
|
+
- [Custom Evaluators][custom_evaluators]
|
|
304
|
+
- [Adversarial Simulation][adversarial_simulation]
|
|
305
|
+
- [Simulate with conversation starter][simulate_with_conversation_starter]
|
|
306
|
+
|
|
307
|
+
More examples can be found [here][evaluate_samples].
|
|
456
308
|
|
|
457
|
-
print(outputs)
|
|
458
|
-
```
|
|
459
309
|
## Troubleshooting
|
|
460
310
|
|
|
461
311
|
### General
|
|
462
312
|
|
|
463
|
-
|
|
313
|
+
Please refer to [troubleshooting][evaluation_tsg] for common issues.
|
|
464
314
|
|
|
465
315
|
### Logging
|
|
466
316
|
|
|
@@ -505,10 +355,68 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
|
|
|
505
355
|
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
|
|
506
356
|
[coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
|
|
507
357
|
[coc_contact]: mailto:opencode@microsoft.com
|
|
358
|
+
[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
|
|
359
|
+
[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
|
|
360
|
+
[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
|
|
361
|
+
[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
|
|
362
|
+
[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_app
|
|
363
|
+
[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
|
|
364
|
+
[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
|
|
365
|
+
[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
|
|
366
|
+
[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
|
|
367
|
+
[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_endpoints
|
|
368
|
+
[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_custom
|
|
369
|
+
[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
|
|
370
|
+
[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
|
|
371
|
+
[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
|
|
372
|
+
[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
|
|
373
|
+
[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
|
|
374
|
+
[adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
|
|
375
|
+
[adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
|
|
376
|
+
[adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/simulate_adversarial
|
|
377
|
+
[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/simulate_conversation_starter
|
|
378
|
+
[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
|
|
508
379
|
|
|
509
380
|
|
|
510
381
|
# Release History
|
|
511
382
|
|
|
383
|
+
## 1.0.1 (2024-11-15)
|
|
384
|
+
|
|
385
|
+
### Bugs Fixed
|
|
386
|
+
- Fixed `[remote]` extra to be needed only when tracking results in Azure AI Studio.
|
|
387
|
+
- Removing `azure-ai-inference` as dependency.
|
|
388
|
+
|
|
389
|
+
## 1.0.0 (2024-11-13)
|
|
390
|
+
|
|
391
|
+
### Breaking Changes
|
|
392
|
+
- The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
|
|
393
|
+
- Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.
|
|
394
|
+
|
|
395
|
+
### Bugs Fixed
|
|
396
|
+
- Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
|
|
397
|
+
- Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
|
|
398
|
+
- Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
|
|
399
|
+
- Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
|
|
400
|
+
otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
|
|
401
|
+
would be 2, not 1.5.
|
|
402
|
+
|
|
403
|
+
### Other Changes
|
|
404
|
+
- Refined error messages for serviced-based evaluators and simulators.
|
|
405
|
+
- Tracing has been disabled due to Cosmos DB initialization issue.
|
|
406
|
+
- Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
|
|
407
|
+
- Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
|
|
408
|
+
- For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
|
|
409
|
+
```python
|
|
410
|
+
adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
|
|
411
|
+
outputs = asyncio.run(
|
|
412
|
+
adversarial_simulator(
|
|
413
|
+
scenario=scenario,
|
|
414
|
+
target=callback,
|
|
415
|
+
randomize_order=True
|
|
416
|
+
)
|
|
417
|
+
)
|
|
418
|
+
```
|
|
419
|
+
|
|
512
420
|
## 1.0.0b5 (2024-10-28)
|
|
513
421
|
|
|
514
422
|
### Features Added
|
|
@@ -565,8 +473,8 @@ outputs = asyncio.run(custom_simulator(
|
|
|
565
473
|
- `SimilarityEvaluator`
|
|
566
474
|
- `RetrievalEvaluator`
|
|
567
475
|
- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
|
|
568
|
-
|
|
569
|
-
| Evaluator | New
|
|
476
|
+
|
|
477
|
+
| Evaluator | New `max_token` for Generation |
|
|
570
478
|
| --- | --- |
|
|
571
479
|
| `CoherenceEvaluator` | 800 |
|
|
572
480
|
| `RelevanceEvaluator` | 800 |
|