azure-ai-evaluation 1.0.0b5__py3-none-any.whl → 1.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (72) hide show
  1. azure/ai/evaluation/_azure/__init__.py +3 -0
  2. azure/ai/evaluation/_azure/_clients.py +188 -0
  3. azure/ai/evaluation/_azure/_models.py +227 -0
  4. azure/ai/evaluation/_azure/_token_manager.py +118 -0
  5. azure/ai/evaluation/_common/_experimental.py +4 -0
  6. azure/ai/evaluation/_common/math.py +62 -2
  7. azure/ai/evaluation/_common/rai_service.py +110 -50
  8. azure/ai/evaluation/_common/utils.py +50 -16
  9. azure/ai/evaluation/_constants.py +2 -0
  10. azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +9 -0
  11. azure/ai/evaluation/_evaluate/_batch_run/proxy_client.py +13 -3
  12. azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +12 -1
  13. azure/ai/evaluation/_evaluate/_eval_run.py +38 -43
  14. azure/ai/evaluation/_evaluate/_evaluate.py +62 -131
  15. azure/ai/evaluation/_evaluate/_telemetry/__init__.py +2 -1
  16. azure/ai/evaluation/_evaluate/_utils.py +72 -38
  17. azure/ai/evaluation/_evaluators/_bleu/_bleu.py +16 -17
  18. azure/ai/evaluation/_evaluators/_coherence/_coherence.py +60 -29
  19. azure/ai/evaluation/_evaluators/_common/_base_eval.py +88 -6
  20. azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +16 -3
  21. azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +39 -10
  22. azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +58 -52
  23. azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +79 -34
  24. azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +73 -34
  25. azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +74 -33
  26. azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -34
  27. azure/ai/evaluation/_evaluators/_eci/_eci.py +28 -3
  28. azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +20 -13
  29. azure/ai/evaluation/_evaluators/_fluency/_fluency.py +57 -26
  30. azure/ai/evaluation/_evaluators/_gleu/_gleu.py +13 -15
  31. azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +68 -30
  32. azure/ai/evaluation/_evaluators/_meteor/_meteor.py +17 -20
  33. azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +10 -8
  34. azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +0 -2
  35. azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +6 -2
  36. azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +10 -6
  37. azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +6 -2
  38. azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +6 -2
  39. azure/ai/evaluation/_evaluators/_multimodal/_violence.py +6 -2
  40. azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +57 -34
  41. azure/ai/evaluation/_evaluators/_qa/_qa.py +25 -37
  42. azure/ai/evaluation/_evaluators/_relevance/_relevance.py +63 -29
  43. azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +76 -161
  44. azure/ai/evaluation/_evaluators/_rouge/_rouge.py +24 -25
  45. azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +65 -67
  46. azure/ai/evaluation/_evaluators/_similarity/_similarity.py +26 -20
  47. azure/ai/evaluation/_evaluators/_xpia/xpia.py +74 -40
  48. azure/ai/evaluation/_exceptions.py +2 -0
  49. azure/ai/evaluation/_http_utils.py +6 -4
  50. azure/ai/evaluation/_model_configurations.py +65 -14
  51. azure/ai/evaluation/_vendor/rouge_score/rouge_scorer.py +0 -4
  52. azure/ai/evaluation/_vendor/rouge_score/scoring.py +0 -4
  53. azure/ai/evaluation/_vendor/rouge_score/tokenize.py +0 -4
  54. azure/ai/evaluation/_version.py +1 -1
  55. azure/ai/evaluation/simulator/_adversarial_scenario.py +17 -1
  56. azure/ai/evaluation/simulator/_adversarial_simulator.py +57 -47
  57. azure/ai/evaluation/simulator/_constants.py +11 -1
  58. azure/ai/evaluation/simulator/_conversation/__init__.py +128 -7
  59. azure/ai/evaluation/simulator/_conversation/_conversation.py +0 -1
  60. azure/ai/evaluation/simulator/_direct_attack_simulator.py +16 -8
  61. azure/ai/evaluation/simulator/_indirect_attack_simulator.py +12 -1
  62. azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +3 -1
  63. azure/ai/evaluation/simulator/_model_tools/_rai_client.py +48 -4
  64. azure/ai/evaluation/simulator/_model_tools/_template_handler.py +1 -0
  65. azure/ai/evaluation/simulator/_simulator.py +54 -45
  66. azure/ai/evaluation/simulator/_utils.py +25 -7
  67. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/METADATA +240 -327
  68. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/RECORD +71 -68
  69. azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +0 -322
  70. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/NOTICE.txt +0 -0
  71. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/WHEEL +0 -0
  72. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: azure-ai-evaluation
3
- Version: 1.0.0b5
3
+ Version: 1.1.0
4
4
  Summary: Microsoft Azure Evaluation Library for Python
5
5
  Home-page: https://github.com/Azure/azure-sdk-for-python
6
6
  Author: Microsoft Corporation
@@ -9,7 +9,7 @@ License: MIT License
9
9
  Project-URL: Bug Reports, https://github.com/Azure/azure-sdk-for-python/issues
10
10
  Project-URL: Source, https://github.com/Azure/azure-sdk-for-python
11
11
  Keywords: azure,azure sdk
12
- Classifier: Development Status :: 4 - Beta
12
+ Classifier: Development Status :: 5 - Production/Stable
13
13
  Classifier: Programming Language :: Python
14
14
  Classifier: Programming Language :: Python :: 3
15
15
  Classifier: Programming Language :: Python :: 3 :: Only
@@ -28,13 +28,20 @@ Requires-Dist: pyjwt >=2.8.0
28
28
  Requires-Dist: azure-identity >=1.16.0
29
29
  Requires-Dist: azure-core >=1.30.2
30
30
  Requires-Dist: nltk >=3.9.1
31
- Provides-Extra: remote
32
- Requires-Dist: promptflow-azure <2.0.0,>=1.15.0 ; extra == 'remote'
33
- Requires-Dist: azure-ai-inference >=1.0.0b4 ; extra == 'remote'
31
+ Requires-Dist: azure-storage-blob >=12.10.0
34
32
 
35
33
  # Azure AI Evaluation client library for Python
36
34
 
37
- We are excited to introduce the public preview of the Azure AI Evaluation SDK.
35
+ Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
36
+
37
+ Use Azure AI Evaluation SDK to:
38
+ - Evaluate existing data from generative AI applications
39
+ - Evaluate generative AI applications
40
+ - Evaluate by generating mathematical, AI-assisted quality and safety metrics
41
+
42
+ Azure AI SDK provides following to evaluate Generative AI Applications:
43
+ - [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
44
+ - [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
38
45
 
39
46
  [Source code][source_code]
40
47
  | [Package (PyPI)][evaluation_pypi]
@@ -42,272 +49,177 @@ We are excited to introduce the public preview of the Azure AI Evaluation SDK.
42
49
  | [Product documentation][product_documentation]
43
50
  | [Samples][evaluation_samples]
44
51
 
45
- This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
46
-
47
- For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
48
52
 
49
53
  ## Getting started
50
54
 
51
55
  ### Prerequisites
52
56
 
53
57
  - Python 3.8 or later is required to use this package.
58
+ - [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
54
59
 
55
60
  ### Install the package
56
61
 
57
- Install the Azure AI Evaluation library for Python with [pip][pip_link]::
62
+ Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
58
63
 
59
64
  ```bash
60
65
  pip install azure-ai-evaluation
61
66
  ```
67
+ If you want to track results in [AI Studio][ai_studio], install `remote` extra:
68
+ ```python
69
+ pip install azure-ai-evaluation[remote]
70
+ ```
62
71
 
63
72
  ## Key concepts
64
73
 
65
- Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models.
74
+ ### Evaluators
66
75
 
67
- ## Examples
76
+ Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
68
77
 
69
- ### Evaluators
78
+ #### Built-in evaluators
79
+
80
+ Built-in evaluators are out of box evaluators provided by Microsoft:
81
+ | Category | Evaluator class |
82
+ |-----------|------------------------------------------------------------------------------------------------------------------------------------|
83
+ | [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
84
+ | [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
85
+ | [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` |
86
+ | [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` |
70
87
 
71
- Users can create evaluator runs on the local machine as shown in the example below:
88
+ For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
72
89
 
73
90
  ```python
74
91
  import os
75
- from pprint import pprint
76
92
 
77
- from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
93
+ from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
78
94
 
95
+ # NLP bleu score evaluator
96
+ bleu_score_evaluator = BleuScoreEvaluator()
97
+ result = bleu_score(
98
+ response="Tokyo is the capital of Japan.",
99
+ ground_truth="The capital of Japan is Tokyo."
100
+ )
79
101
 
80
- def response_length(response, **kwargs):
81
- return {"value": len(response)}
82
-
102
+ # AI assisted quality evaluator
103
+ model_config = {
104
+ "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
105
+ "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
106
+ "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
107
+ }
83
108
 
84
- if __name__ == "__main__":
85
- # Built-in evaluators
86
- # Initialize Azure OpenAI Model Configuration
87
- model_config = {
88
- "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
89
- "api_key": os.environ.get("AZURE_OPENAI_KEY"),
90
- "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
91
- }
109
+ relevance_evaluator = RelevanceEvaluator(model_config)
110
+ result = relevance_evaluator(
111
+ query="What is the capital of Japan?",
112
+ response="The capital of Japan is Tokyo."
113
+ )
92
114
 
93
- # Initialzing Relevance Evaluator
94
- relevance_eval = RelevanceEvaluator(model_config)
115
+ # AI assisted safety evaluator
116
+ azure_ai_project = {
117
+ "subscription_id": "<subscription_id>",
118
+ "resource_group_name": "<resource_group_name>",
119
+ "project_name": "<project_name>",
120
+ }
95
121
 
96
- # Running Relevance Evaluator on single input row
97
- relevance_score = relevance_eval(
98
- response="The Alpine Explorer Tent is the most waterproof.",
99
- query="Which tent is the most waterproof?",
100
- )
122
+ violence_evaluator = ViolenceEvaluator(azure_ai_project)
123
+ result = violence_evaluator(
124
+ query="What is the capital of France?",
125
+ response="Paris."
126
+ )
127
+ ```
101
128
 
102
- pprint(relevance_score)
103
- # {'gpt_relevance': 5.0}
129
+ #### Custom evaluators
104
130
 
105
- # Content Safety Evaluator
131
+ Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
106
132
 
107
- # Initialize Project Scope
108
- azure_ai_project = {
109
- "subscription_id": <subscription_id>,
110
- "resource_group_name": <resource_group_name>,
111
- "project_name": <project_name>
112
- }
133
+ ```python
113
134
 
114
- violence_eval = ViolenceEvaluator(azure_ai_project)
115
- violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
116
- pprint(violence_score)
117
- # {'violence': 'Very low',
118
- # 'violence_reason': "The system's response is a straightforward factual response "
119
- # 'to a geography question. There is no violent content or '
120
- # 'language present.',
121
- # 'violence_score': 0}
122
-
123
- # Code based evaluator
124
- response_length("The Alpine Explorer Tent is the most waterproof.")
125
- # {'value': 48}
126
-
127
- # Using multiple evaluators together using `Evaluate` API
128
-
129
- result = evaluate(
130
- data="evaluate_test_data.jsonl",
131
- evaluators={
132
- "response_length": response_length,
133
- "violence": violence_eval,
134
- },
135
- )
135
+ # Custom evaluator as a function to calculate response length
136
+ def response_length(response, **kwargs):
137
+ return len(response)
136
138
 
137
- pprint(result)
138
- ```
139
- ### Simulator
139
+ # Custom class based evaluator to check for blocked words
140
+ class BlocklistEvaluator:
141
+ def __init__(self, blocklist):
142
+ self._blocklist = blocklist
140
143
 
144
+ def __call__(self, *, response: str, **kwargs):
145
+ score = any([word in answer for word in self._blocklist])
146
+ return {"score": score}
141
147
 
142
- Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes
143
- their AI application.
148
+ blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
144
149
 
145
- #### Simulating with a Prompty
150
+ result = response_length("The capital of Japan is Tokyo.")
151
+ result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
146
152
 
147
- ```yaml
148
- ---
149
- name: ApplicationPrompty
150
- description: Simulates an application
151
- model:
152
- api: chat
153
- parameters:
154
- temperature: 0.0
155
- top_p: 1.0
156
- presence_penalty: 0
157
- frequency_penalty: 0
158
- response_format:
159
- type: text
153
+ ```
160
154
 
161
- inputs:
162
- conversation_history:
163
- type: dict
155
+ ### Evaluate API
156
+ The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
164
157
 
165
- ---
166
- system:
167
- You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.
158
+ #### Evaluate existing dataset
168
159
 
169
- Output with a string that continues the conversation, responding to the latest message from the user, given the conversation history:
170
- {{ conversation_history }}
160
+ ```python
161
+ from azure.ai.evaluation import evaluate
171
162
 
163
+ result = evaluate(
164
+ data="data.jsonl", # provide your data here
165
+ evaluators={
166
+ "blocklist": blocklist_evaluator,
167
+ "relevance": relevance_evaluator
168
+ },
169
+ # column mapping
170
+ evaluator_config={
171
+ "relevance": {
172
+ "column_mapping": {
173
+ "query": "${data.queries}"
174
+ "ground_truth": "${data.ground_truth}"
175
+ "response": "${outputs.response}"
176
+ }
177
+ }
178
+ }
179
+ # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
180
+ azure_ai_project = azure_ai_project,
181
+ # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
182
+ output_path="./evaluation_results.json"
183
+ )
172
184
  ```
185
+ For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
173
186
 
174
- Query Response generaing prompty for gpt-4o with `json_schema` support
175
- Use this file as an override.
176
- ```yaml
177
- ---
178
- name: TaskSimulatorQueryResponseGPT4o
179
- description: Gets queries and responses from a blob of text
180
- model:
181
- api: chat
182
- parameters:
183
- temperature: 0.0
184
- top_p: 1.0
185
- presence_penalty: 0
186
- frequency_penalty: 0
187
- response_format:
188
- type: json_schema
189
- json_schema:
190
- name: QRJsonSchema
191
- schema:
192
- type: object
193
- properties:
194
- items:
195
- type: array
196
- items:
197
- type: object
198
- properties:
199
- q:
200
- type: string
201
- r:
202
- type: string
203
- required:
204
- - q
205
- - r
206
-
207
- inputs:
208
- text:
209
- type: string
210
- num_queries:
211
- type: integer
212
-
213
-
214
- ---
215
- system:
216
- You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
217
- Both Questions and Answers MUST BE extracted from given Text
218
- Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
219
- RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
220
- A sentence should contribute multiple QnAs if it has more info in it
221
- Answer must not be more than 5 words
222
- Answer must be picked from Text as is
223
- Question should be as descriptive as possible and must include as much context as possible from Text
224
- Output must always have the provided number of QnAs
225
- Output must be in JSON format.
226
- Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
227
- Text:
228
- <|text_start|>
229
- On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
230
- Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
231
- <|text_end|>
232
- Output with 5 QnAs:
233
- {
234
- "qna": [{
235
- "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
236
- "r": "January 24, 1984"
237
- },
238
- {
239
- "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
240
- "r": "Steve Jobs"
241
- },
242
- {
243
- "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
244
- "r": "2.06 percent"
245
- },
246
- {
247
- "q": "What were the research firms that reported on Apple's market share in the U.S.?",
248
- "r": "IDC and Gartner"
187
+ #### Evaluate generative AI application
188
+ ```python
189
+ from askwiki import askwiki
190
+
191
+ result = evaluate(
192
+ data="data.jsonl",
193
+ target=askwiki,
194
+ evaluators={
195
+ "relevance": relevance_eval
249
196
  },
250
- {
251
- "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
252
- "r": "6%"
253
- }]
254
- }
255
- Text:
256
- <|text_start|>
257
- {{ text }}
258
- <|text_end|>
259
- Output with {{ num_queries }} QnAs:
197
+ evaluator_config={
198
+ "default": {
199
+ "column_mapping": {
200
+ "query": "${data.queries}"
201
+ "context": "${outputs.context}"
202
+ "response": "${outputs.response}"
203
+ }
204
+ }
205
+ }
206
+ )
260
207
  ```
208
+ Above code snippet refers to askwiki application in this [sample][evaluate_app].
261
209
 
262
- Application code:
210
+ For more details refer to [Evaluate on a target][evaluate_target]
263
211
 
264
- ```python
265
- import json
266
- import asyncio
267
- from typing import Any, Dict, List, Optional
268
- from azure.ai.evaluation.simulator import Simulator
269
- from promptflow.client import load_flow
270
- import os
271
- import wikipedia
212
+ ### Simulator
272
213
 
273
- # Set up the model configuration without api_key, using DefaultAzureCredential
274
- model_config = {
275
- "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
276
- "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
277
- # not providing key would make the SDK pick up `DefaultAzureCredential`
278
- # use "api_key": "<your API key>"
279
- "api_version": "2024-08-01-preview" # keep this for gpt-4o
280
- }
281
214
 
282
- # Use Wikipedia to get some text for the simulation
283
- wiki_search_term = "Leonardo da Vinci"
284
- wiki_title = wikipedia.search(wiki_search_term)[0]
285
- wiki_page = wikipedia.page(wiki_title)
286
- text = wiki_page.summary[:1000]
287
-
288
- def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
289
- try:
290
- current_dir = os.path.dirname(__file__)
291
- prompty_path = os.path.join(current_dir, "application.prompty")
292
- _flow = load_flow(
293
- source=prompty_path,
294
- model=model_config,
295
- credential=DefaultAzureCredential()
296
- )
297
- response = _flow(
298
- query=query,
299
- context=context,
300
- conversation_history=messages_list
301
- )
302
- return response
303
- except Exception as e:
304
- print(f"Something went wrong invoking the prompty: {e}")
305
- return "something went wrong"
215
+ Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
216
+
306
217
 
218
+ ```python
307
219
  async def callback(
308
220
  messages: Dict[str, List[Dict]],
309
221
  stream: bool = False,
310
- session_state: Any = None, # noqa: ANN401
222
+ session_state: Any = None,
311
223
  context: Optional[Dict[str, Any]] = None,
312
224
  ) -> dict:
313
225
  messages_list = messages["messages"]
@@ -315,8 +227,8 @@ async def callback(
315
227
  latest_message = messages_list[-1]
316
228
  query = latest_message["content"]
317
229
  # Call your endpoint or AI application here
318
- response = method_to_invoke_application_prompty(query, messages_list, context)
319
- # Format the response to follow the OpenAI chat protocol format
230
+ # response should be a string
231
+ response = call_to_your_application(query, messages_list, context)
320
232
  formatted_response = {
321
233
  "content": response,
322
234
  "role": "assistant",
@@ -324,33 +236,32 @@ async def callback(
324
236
  }
325
237
  messages["messages"].append(formatted_response)
326
238
  return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
239
+ ```
327
240
 
328
- async def main():
329
- simulator = Simulator(model_config=model_config)
330
- current_dir = os.path.dirname(__file__)
331
- query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
332
- outputs = await simulator(
333
- target=callback,
334
- text=text,
335
- query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
336
- num_queries=2,
337
- max_conversation_turns=1,
338
- user_persona=[
339
- f"I am a student and I want to learn more about {wiki_search_term}",
340
- f"I am a teacher and I want to teach my students about {wiki_search_term}"
241
+ The simulator initialization and invocation looks like this:
242
+ ```python
243
+ from azure.ai.evaluation.simulator import Simulator
244
+ model_config = {
245
+ "azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
246
+ "azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
247
+ "api_version": os.environ.get("AZURE_API_VERSION"),
248
+ }
249
+ custom_simulator = Simulator(model_config=model_config)
250
+ outputs = asyncio.run(custom_simulator(
251
+ target=callback,
252
+ conversation_turns=[
253
+ [
254
+ "What should I know about the public gardens in the US?",
341
255
  ],
342
- )
343
- print(json.dumps(outputs, indent=2))
344
-
345
- if __name__ == "__main__":
346
- # Ensure that the following environment variables are set in your environment:
347
- # AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT
348
- # Example:
349
- # os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
350
- # os.environ["AZURE_DEPLOYMENT"] = "your-deployment-name"
351
- asyncio.run(main())
352
- print("done!")
353
-
256
+ [
257
+ "How do I simulate data against LLMs",
258
+ ],
259
+ ],
260
+ max_conversation_turns=2,
261
+ ))
262
+ with open("simulator_output.jsonl", "w") as f:
263
+ for output in outputs:
264
+ f.write(output.to_eval_qr_json_lines())
354
265
  ```
355
266
 
356
267
  #### Adversarial Simulator
@@ -358,73 +269,11 @@ if __name__ == "__main__":
358
269
  ```python
359
270
  from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
360
271
  from azure.identity import DefaultAzureCredential
361
- from typing import Any, Dict, List, Optional
362
- import asyncio
363
-
364
-
365
272
  azure_ai_project = {
366
273
  "subscription_id": <subscription_id>,
367
274
  "resource_group_name": <resource_group_name>,
368
275
  "project_name": <project_name>
369
276
  }
370
-
371
- async def callback(
372
- messages: List[Dict],
373
- stream: bool = False,
374
- session_state: Any = None,
375
- context: Dict[str, Any] = None
376
- ) -> dict:
377
- messages_list = messages["messages"]
378
- # get last message
379
- latest_message = messages_list[-1]
380
- query = latest_message["content"]
381
- context = None
382
- if 'file_content' in messages["template_parameters"]:
383
- query += messages["template_parameters"]['file_content']
384
- # the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
385
- # to respond to the simulator. You should replace it with a call to your model/endpoint/application
386
- # make sure you pass the `query` and format the response as we have shown below
387
- from openai import AsyncAzureOpenAI
388
- oai_client = AsyncAzureOpenAI(
389
- api_key=<api_key>,
390
- azure_endpoint=<endpoint>,
391
- api_version="2023-12-01-preview",
392
- )
393
- try:
394
- response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
395
- except Exception as e:
396
- print(f"Error: {e}")
397
- # to continue the conversation, return the messages, else you can fail the adversarial with an exception
398
- message = {
399
- "content": "Something went wrong. Check the exception e for more details.",
400
- "role": "assistant",
401
- "context": None,
402
- }
403
- messages["messages"].append(message)
404
- return {
405
- "messages": messages["messages"],
406
- "stream": stream,
407
- "session_state": session_state
408
- }
409
- response_result = response_from_oai_chat_completions.choices[0].message.content
410
- formatted_response = {
411
- "content": response_result,
412
- "role": "assistant",
413
- "context": {},
414
- }
415
- messages["messages"].append(formatted_response)
416
- return {
417
- "messages": messages["messages"],
418
- "stream": stream,
419
- "session_state": session_state,
420
- "context": context
421
- }
422
-
423
- ```
424
-
425
- #### Adversarial QA
426
-
427
- ```python
428
277
  scenario = AdversarialScenario.ADVERSARIAL_QA
429
278
  simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
430
279
 
@@ -437,30 +286,30 @@ outputs = asyncio.run(
437
286
  )
438
287
  )
439
288
 
440
- print(outputs.to_eval_qa_json_lines())
289
+ print(outputs.to_eval_qr_json_lines())
441
290
  ```
442
- #### Direct Attack Simulator
443
291
 
444
- ```python
445
- scenario = AdversarialScenario.ADVERSARIAL_QA
446
- simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
292
+ For more details about the simulator, visit the following links:
293
+ - [Adversarial Simulation docs][adversarial_simulation_docs]
294
+ - [Adversarial scenarios][adversarial_simulation_scenarios]
295
+ - [Simulating jailbreak attacks][adversarial_jailbreak]
447
296
 
448
- outputs = asyncio.run(
449
- simulator(
450
- scenario=scenario,
451
- max_conversation_turns=1,
452
- max_simulation_results=2,
453
- target=callback
454
- )
455
- )
297
+ ## Examples
298
+
299
+ In following section you will find examples of:
300
+ - [Evaluate an application][evaluate_app]
301
+ - [Evaluate different models][evaluate_models]
302
+ - [Custom Evaluators][custom_evaluators]
303
+ - [Adversarial Simulation][adversarial_simulation]
304
+ - [Simulate with conversation starter][simulate_with_conversation_starter]
305
+
306
+ More examples can be found [here][evaluate_samples].
456
307
 
457
- print(outputs)
458
- ```
459
308
  ## Troubleshooting
460
309
 
461
310
  ### General
462
311
 
463
- Azure ML clients raise exceptions defined in [Azure Core][azure_core_readme].
312
+ Please refer to [troubleshooting][evaluation_tsg] for common issues.
464
313
 
465
314
  ### Logging
466
315
 
@@ -505,10 +354,74 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
505
354
  [code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
506
355
  [coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
507
356
  [coc_contact]: mailto:opencode@microsoft.com
508
-
357
+ [evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
358
+ [evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
359
+ [evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
360
+ [evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
361
+ [evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint
362
+ [evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
363
+ [ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
364
+ [ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
365
+ [azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
366
+ [evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint
367
+ [custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators
368
+ [evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
369
+ [evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
370
+ [performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
371
+ [risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
372
+ [composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
373
+ [adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
374
+ [adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
375
+ [adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Adversarial_Data
376
+ [simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter
377
+ [adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
509
378
 
510
379
  # Release History
511
380
 
381
+ ## 1.1.0 (2024-12-12)
382
+
383
+ ### Bugs Fixed
384
+ - Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Studio.
385
+ - Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
386
+
387
+ ## 1.0.1 (2024-11-15)
388
+
389
+ ### Bugs Fixed
390
+ - Removing `azure-ai-inference` as dependency.
391
+ - Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
392
+
393
+ ## 1.0.0 (2024-11-13)
394
+
395
+ ### Breaking Changes
396
+ - The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
397
+ - Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.
398
+
399
+ ### Bugs Fixed
400
+ - Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
401
+ - Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
402
+ - Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
403
+ - Fixed an issue with the `ContentSafetyEvaluator` that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.
404
+ - Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
405
+ otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
406
+ would be 2, not 1.5.
407
+
408
+ ### Other Changes
409
+ - Refined error messages for serviced-based evaluators and simulators.
410
+ - Tracing has been disabled due to Cosmos DB initialization issue.
411
+ - Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
412
+ - Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
413
+ - For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
414
+ ```python
415
+ adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
416
+ outputs = asyncio.run(
417
+ adversarial_simulator(
418
+ scenario=scenario,
419
+ target=callback,
420
+ randomize_order=True
421
+ )
422
+ )
423
+ ```
424
+
512
425
  ## 1.0.0b5 (2024-10-28)
513
426
 
514
427
  ### Features Added
@@ -565,8 +478,8 @@ outputs = asyncio.run(custom_simulator(
565
478
  - `SimilarityEvaluator`
566
479
  - `RetrievalEvaluator`
567
480
  - The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
568
-
569
- | Evaluator | New Token Limit |
481
+
482
+ | Evaluator | New `max_token` for Generation |
570
483
  | --- | --- |
571
484
  | `CoherenceEvaluator` | 800 |
572
485
  | `RelevanceEvaluator` | 800 |