azure-ai-evaluation 1.0.0b5__py3-none-any.whl → 1.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. azure/ai/evaluation/_common/_experimental.py +4 -0
  2. azure/ai/evaluation/_common/math.py +62 -2
  3. azure/ai/evaluation/_common/rai_service.py +80 -29
  4. azure/ai/evaluation/_common/utils.py +50 -16
  5. azure/ai/evaluation/_constants.py +1 -0
  6. azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +9 -0
  7. azure/ai/evaluation/_evaluate/_batch_run/proxy_client.py +13 -3
  8. azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +11 -0
  9. azure/ai/evaluation/_evaluate/_eval_run.py +34 -10
  10. azure/ai/evaluation/_evaluate/_evaluate.py +59 -103
  11. azure/ai/evaluation/_evaluate/_telemetry/__init__.py +2 -1
  12. azure/ai/evaluation/_evaluate/_utils.py +6 -4
  13. azure/ai/evaluation/_evaluators/_bleu/_bleu.py +16 -17
  14. azure/ai/evaluation/_evaluators/_coherence/_coherence.py +60 -29
  15. azure/ai/evaluation/_evaluators/_common/_base_eval.py +17 -5
  16. azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +4 -2
  17. azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +6 -9
  18. azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +56 -50
  19. azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +79 -34
  20. azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +73 -34
  21. azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +74 -33
  22. azure/ai/evaluation/_evaluators/_content_safety/_violence.py +76 -34
  23. azure/ai/evaluation/_evaluators/_eci/_eci.py +28 -3
  24. azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +20 -13
  25. azure/ai/evaluation/_evaluators/_fluency/_fluency.py +57 -26
  26. azure/ai/evaluation/_evaluators/_gleu/_gleu.py +13 -15
  27. azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +68 -30
  28. azure/ai/evaluation/_evaluators/_meteor/_meteor.py +17 -20
  29. azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +10 -8
  30. azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +0 -2
  31. azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +6 -2
  32. azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +10 -6
  33. azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +6 -2
  34. azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +6 -2
  35. azure/ai/evaluation/_evaluators/_multimodal/_violence.py +6 -2
  36. azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +57 -34
  37. azure/ai/evaluation/_evaluators/_qa/_qa.py +25 -37
  38. azure/ai/evaluation/_evaluators/_relevance/_relevance.py +63 -29
  39. azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +76 -161
  40. azure/ai/evaluation/_evaluators/_rouge/_rouge.py +24 -25
  41. azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +65 -67
  42. azure/ai/evaluation/_evaluators/_similarity/_similarity.py +26 -20
  43. azure/ai/evaluation/_evaluators/_xpia/xpia.py +74 -40
  44. azure/ai/evaluation/_exceptions.py +2 -0
  45. azure/ai/evaluation/_model_configurations.py +65 -14
  46. azure/ai/evaluation/_version.py +1 -1
  47. azure/ai/evaluation/simulator/_adversarial_scenario.py +15 -1
  48. azure/ai/evaluation/simulator/_adversarial_simulator.py +25 -34
  49. azure/ai/evaluation/simulator/_constants.py +11 -1
  50. azure/ai/evaluation/simulator/_direct_attack_simulator.py +16 -8
  51. azure/ai/evaluation/simulator/_indirect_attack_simulator.py +11 -1
  52. azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +3 -1
  53. azure/ai/evaluation/simulator/_model_tools/_rai_client.py +8 -4
  54. azure/ai/evaluation/simulator/_simulator.py +51 -45
  55. azure/ai/evaluation/simulator/_utils.py +25 -7
  56. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/METADATA +232 -324
  57. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/RECORD +60 -61
  58. azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +0 -322
  59. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/NOTICE.txt +0 -0
  60. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/WHEEL +0 -0
  61. {azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.0.1.dist-info}/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: azure-ai-evaluation
3
- Version: 1.0.0b5
3
+ Version: 1.0.1
4
4
  Summary: Microsoft Azure Evaluation Library for Python
5
5
  Home-page: https://github.com/Azure/azure-sdk-for-python
6
6
  Author: Microsoft Corporation
@@ -9,7 +9,7 @@ License: MIT License
9
9
  Project-URL: Bug Reports, https://github.com/Azure/azure-sdk-for-python/issues
10
10
  Project-URL: Source, https://github.com/Azure/azure-sdk-for-python
11
11
  Keywords: azure,azure sdk
12
- Classifier: Development Status :: 4 - Beta
12
+ Classifier: Development Status :: 5 - Production/Stable
13
13
  Classifier: Programming Language :: Python
14
14
  Classifier: Programming Language :: Python :: 3
15
15
  Classifier: Programming Language :: Python :: 3 :: Only
@@ -30,11 +30,19 @@ Requires-Dist: azure-core >=1.30.2
30
30
  Requires-Dist: nltk >=3.9.1
31
31
  Provides-Extra: remote
32
32
  Requires-Dist: promptflow-azure <2.0.0,>=1.15.0 ; extra == 'remote'
33
- Requires-Dist: azure-ai-inference >=1.0.0b4 ; extra == 'remote'
34
33
 
35
34
  # Azure AI Evaluation client library for Python
36
35
 
37
- We are excited to introduce the public preview of the Azure AI Evaluation SDK.
36
+ Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
37
+
38
+ Use Azure AI Evaluation SDK to:
39
+ - Evaluate existing data from generative AI applications
40
+ - Evaluate generative AI applications
41
+ - Evaluate by generating mathematical, AI-assisted quality and safety metrics
42
+
43
+ Azure AI SDK provides following to evaluate Generative AI Applications:
44
+ - [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
45
+ - [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
38
46
 
39
47
  [Source code][source_code]
40
48
  | [Package (PyPI)][evaluation_pypi]
@@ -42,272 +50,177 @@ We are excited to introduce the public preview of the Azure AI Evaluation SDK.
42
50
  | [Product documentation][product_documentation]
43
51
  | [Samples][evaluation_samples]
44
52
 
45
- This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
46
-
47
- For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
48
53
 
49
54
  ## Getting started
50
55
 
51
56
  ### Prerequisites
52
57
 
53
58
  - Python 3.8 or later is required to use this package.
59
+ - [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
54
60
 
55
61
  ### Install the package
56
62
 
57
- Install the Azure AI Evaluation library for Python with [pip][pip_link]::
63
+ Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
58
64
 
59
65
  ```bash
60
66
  pip install azure-ai-evaluation
61
67
  ```
68
+ If you want to track results in [AI Studio][ai_studio], install `remote` extra:
69
+ ```python
70
+ pip install azure-ai-evaluation[remote]
71
+ ```
62
72
 
63
73
  ## Key concepts
64
74
 
65
- Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models.
75
+ ### Evaluators
66
76
 
67
- ## Examples
77
+ Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
68
78
 
69
- ### Evaluators
79
+ #### Built-in evaluators
70
80
 
71
- Users can create evaluator runs on the local machine as shown in the example below:
81
+ Built-in evaluators are out of box evaluators provided by Microsoft:
82
+ | Category | Evaluator class |
83
+ |-----------|------------------------------------------------------------------------------------------------------------------------------------|
84
+ | [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
85
+ | [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
86
+ | [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` |
87
+ | [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` |
88
+
89
+ For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
72
90
 
73
91
  ```python
74
92
  import os
75
- from pprint import pprint
76
-
77
- from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
78
93
 
94
+ from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
79
95
 
80
- def response_length(response, **kwargs):
81
- return {"value": len(response)}
96
+ # NLP bleu score evaluator
97
+ bleu_score_evaluator = BleuScoreEvaluator()
98
+ result = bleu_score(
99
+ response="Tokyo is the capital of Japan.",
100
+ ground_truth="The capital of Japan is Tokyo."
101
+ )
82
102
 
103
+ # AI assisted quality evaluator
104
+ model_config = {
105
+ "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
106
+ "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
107
+ "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
108
+ }
83
109
 
84
- if __name__ == "__main__":
85
- # Built-in evaluators
86
- # Initialize Azure OpenAI Model Configuration
87
- model_config = {
88
- "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
89
- "api_key": os.environ.get("AZURE_OPENAI_KEY"),
90
- "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
91
- }
110
+ relevance_evaluator = RelevanceEvaluator(model_config)
111
+ result = relevance_evaluator(
112
+ query="What is the capital of Japan?",
113
+ response="The capital of Japan is Tokyo."
114
+ )
92
115
 
93
- # Initialzing Relevance Evaluator
94
- relevance_eval = RelevanceEvaluator(model_config)
116
+ # AI assisted safety evaluator
117
+ azure_ai_project = {
118
+ "subscription_id": "<subscription_id>",
119
+ "resource_group_name": "<resource_group_name>",
120
+ "project_name": "<project_name>",
121
+ }
95
122
 
96
- # Running Relevance Evaluator on single input row
97
- relevance_score = relevance_eval(
98
- response="The Alpine Explorer Tent is the most waterproof.",
99
- query="Which tent is the most waterproof?",
100
- )
123
+ violence_evaluator = ViolenceEvaluator(azure_ai_project)
124
+ result = violence_evaluator(
125
+ query="What is the capital of France?",
126
+ response="Paris."
127
+ )
128
+ ```
101
129
 
102
- pprint(relevance_score)
103
- # {'gpt_relevance': 5.0}
130
+ #### Custom evaluators
104
131
 
105
- # Content Safety Evaluator
132
+ Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
106
133
 
107
- # Initialize Project Scope
108
- azure_ai_project = {
109
- "subscription_id": <subscription_id>,
110
- "resource_group_name": <resource_group_name>,
111
- "project_name": <project_name>
112
- }
134
+ ```python
113
135
 
114
- violence_eval = ViolenceEvaluator(azure_ai_project)
115
- violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
116
- pprint(violence_score)
117
- # {'violence': 'Very low',
118
- # 'violence_reason': "The system's response is a straightforward factual response "
119
- # 'to a geography question. There is no violent content or '
120
- # 'language present.',
121
- # 'violence_score': 0}
122
-
123
- # Code based evaluator
124
- response_length("The Alpine Explorer Tent is the most waterproof.")
125
- # {'value': 48}
126
-
127
- # Using multiple evaluators together using `Evaluate` API
128
-
129
- result = evaluate(
130
- data="evaluate_test_data.jsonl",
131
- evaluators={
132
- "response_length": response_length,
133
- "violence": violence_eval,
134
- },
135
- )
136
+ # Custom evaluator as a function to calculate response length
137
+ def response_length(response, **kwargs):
138
+ return len(response)
136
139
 
137
- pprint(result)
138
- ```
139
- ### Simulator
140
+ # Custom class based evaluator to check for blocked words
141
+ class BlocklistEvaluator:
142
+ def __init__(self, blocklist):
143
+ self._blocklist = blocklist
140
144
 
145
+ def __call__(self, *, response: str, **kwargs):
146
+ score = any([word in answer for word in self._blocklist])
147
+ return {"score": score}
141
148
 
142
- Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes
143
- their AI application.
149
+ blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
144
150
 
145
- #### Simulating with a Prompty
151
+ result = response_length("The capital of Japan is Tokyo.")
152
+ result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
146
153
 
147
- ```yaml
148
- ---
149
- name: ApplicationPrompty
150
- description: Simulates an application
151
- model:
152
- api: chat
153
- parameters:
154
- temperature: 0.0
155
- top_p: 1.0
156
- presence_penalty: 0
157
- frequency_penalty: 0
158
- response_format:
159
- type: text
154
+ ```
160
155
 
161
- inputs:
162
- conversation_history:
163
- type: dict
156
+ ### Evaluate API
157
+ The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
164
158
 
165
- ---
166
- system:
167
- You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.
159
+ #### Evaluate existing dataset
168
160
 
169
- Output with a string that continues the conversation, responding to the latest message from the user, given the conversation history:
170
- {{ conversation_history }}
161
+ ```python
162
+ from azure.ai.evaluation import evaluate
171
163
 
164
+ result = evaluate(
165
+ data="data.jsonl", # provide your data here
166
+ evaluators={
167
+ "blocklist": blocklist_evaluator,
168
+ "relevance": relevance_evaluator
169
+ },
170
+ # column mapping
171
+ evaluator_config={
172
+ "relevance": {
173
+ "column_mapping": {
174
+ "query": "${data.queries}"
175
+ "ground_truth": "${data.ground_truth}"
176
+ "response": "${outputs.response}"
177
+ }
178
+ }
179
+ }
180
+ # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
181
+ azure_ai_project = azure_ai_project,
182
+ # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
183
+ output_path="./evaluation_results.json"
184
+ )
172
185
  ```
186
+ For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
173
187
 
174
- Query Response generaing prompty for gpt-4o with `json_schema` support
175
- Use this file as an override.
176
- ```yaml
177
- ---
178
- name: TaskSimulatorQueryResponseGPT4o
179
- description: Gets queries and responses from a blob of text
180
- model:
181
- api: chat
182
- parameters:
183
- temperature: 0.0
184
- top_p: 1.0
185
- presence_penalty: 0
186
- frequency_penalty: 0
187
- response_format:
188
- type: json_schema
189
- json_schema:
190
- name: QRJsonSchema
191
- schema:
192
- type: object
193
- properties:
194
- items:
195
- type: array
196
- items:
197
- type: object
198
- properties:
199
- q:
200
- type: string
201
- r:
202
- type: string
203
- required:
204
- - q
205
- - r
206
-
207
- inputs:
208
- text:
209
- type: string
210
- num_queries:
211
- type: integer
212
-
213
-
214
- ---
215
- system:
216
- You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
217
- Both Questions and Answers MUST BE extracted from given Text
218
- Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
219
- RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
220
- A sentence should contribute multiple QnAs if it has more info in it
221
- Answer must not be more than 5 words
222
- Answer must be picked from Text as is
223
- Question should be as descriptive as possible and must include as much context as possible from Text
224
- Output must always have the provided number of QnAs
225
- Output must be in JSON format.
226
- Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
227
- Text:
228
- <|text_start|>
229
- On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
230
- Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
231
- <|text_end|>
232
- Output with 5 QnAs:
233
- {
234
- "qna": [{
235
- "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
236
- "r": "January 24, 1984"
237
- },
238
- {
239
- "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
240
- "r": "Steve Jobs"
241
- },
242
- {
243
- "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
244
- "r": "2.06 percent"
245
- },
246
- {
247
- "q": "What were the research firms that reported on Apple's market share in the U.S.?",
248
- "r": "IDC and Gartner"
188
+ #### Evaluate generative AI application
189
+ ```python
190
+ from askwiki import askwiki
191
+
192
+ result = evaluate(
193
+ data="data.jsonl",
194
+ target=askwiki,
195
+ evaluators={
196
+ "relevance": relevance_eval
249
197
  },
250
- {
251
- "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
252
- "r": "6%"
253
- }]
254
- }
255
- Text:
256
- <|text_start|>
257
- {{ text }}
258
- <|text_end|>
259
- Output with {{ num_queries }} QnAs:
198
+ evaluator_config={
199
+ "default": {
200
+ "column_mapping": {
201
+ "query": "${data.queries}"
202
+ "context": "${outputs.context}"
203
+ "response": "${outputs.response}"
204
+ }
205
+ }
206
+ }
207
+ )
260
208
  ```
209
+ Above code snippet refers to askwiki application in this [sample][evaluate_app].
261
210
 
262
- Application code:
211
+ For more details refer to [Evaluate on a target][evaluate_target]
263
212
 
264
- ```python
265
- import json
266
- import asyncio
267
- from typing import Any, Dict, List, Optional
268
- from azure.ai.evaluation.simulator import Simulator
269
- from promptflow.client import load_flow
270
- import os
271
- import wikipedia
213
+ ### Simulator
272
214
 
273
- # Set up the model configuration without api_key, using DefaultAzureCredential
274
- model_config = {
275
- "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
276
- "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
277
- # not providing key would make the SDK pick up `DefaultAzureCredential`
278
- # use "api_key": "<your API key>"
279
- "api_version": "2024-08-01-preview" # keep this for gpt-4o
280
- }
281
215
 
282
- # Use Wikipedia to get some text for the simulation
283
- wiki_search_term = "Leonardo da Vinci"
284
- wiki_title = wikipedia.search(wiki_search_term)[0]
285
- wiki_page = wikipedia.page(wiki_title)
286
- text = wiki_page.summary[:1000]
287
-
288
- def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
289
- try:
290
- current_dir = os.path.dirname(__file__)
291
- prompty_path = os.path.join(current_dir, "application.prompty")
292
- _flow = load_flow(
293
- source=prompty_path,
294
- model=model_config,
295
- credential=DefaultAzureCredential()
296
- )
297
- response = _flow(
298
- query=query,
299
- context=context,
300
- conversation_history=messages_list
301
- )
302
- return response
303
- except Exception as e:
304
- print(f"Something went wrong invoking the prompty: {e}")
305
- return "something went wrong"
216
+ Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
217
+
306
218
 
219
+ ```python
307
220
  async def callback(
308
221
  messages: Dict[str, List[Dict]],
309
222
  stream: bool = False,
310
- session_state: Any = None, # noqa: ANN401
223
+ session_state: Any = None,
311
224
  context: Optional[Dict[str, Any]] = None,
312
225
  ) -> dict:
313
226
  messages_list = messages["messages"]
@@ -315,8 +228,8 @@ async def callback(
315
228
  latest_message = messages_list[-1]
316
229
  query = latest_message["content"]
317
230
  # Call your endpoint or AI application here
318
- response = method_to_invoke_application_prompty(query, messages_list, context)
319
- # Format the response to follow the OpenAI chat protocol format
231
+ # response should be a string
232
+ response = call_to_your_application(query, messages_list, context)
320
233
  formatted_response = {
321
234
  "content": response,
322
235
  "role": "assistant",
@@ -324,33 +237,32 @@ async def callback(
324
237
  }
325
238
  messages["messages"].append(formatted_response)
326
239
  return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
240
+ ```
327
241
 
328
- async def main():
329
- simulator = Simulator(model_config=model_config)
330
- current_dir = os.path.dirname(__file__)
331
- query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
332
- outputs = await simulator(
333
- target=callback,
334
- text=text,
335
- query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
336
- num_queries=2,
337
- max_conversation_turns=1,
338
- user_persona=[
339
- f"I am a student and I want to learn more about {wiki_search_term}",
340
- f"I am a teacher and I want to teach my students about {wiki_search_term}"
242
+ The simulator initialization and invocation looks like this:
243
+ ```python
244
+ from azure.ai.evaluation.simulator import Simulator
245
+ model_config = {
246
+ "azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
247
+ "azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
248
+ "api_version": os.environ.get("AZURE_API_VERSION"),
249
+ }
250
+ custom_simulator = Simulator(model_config=model_config)
251
+ outputs = asyncio.run(custom_simulator(
252
+ target=callback,
253
+ conversation_turns=[
254
+ [
255
+ "What should I know about the public gardens in the US?",
341
256
  ],
342
- )
343
- print(json.dumps(outputs, indent=2))
344
-
345
- if __name__ == "__main__":
346
- # Ensure that the following environment variables are set in your environment:
347
- # AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT
348
- # Example:
349
- # os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
350
- # os.environ["AZURE_DEPLOYMENT"] = "your-deployment-name"
351
- asyncio.run(main())
352
- print("done!")
353
-
257
+ [
258
+ "How do I simulate data against LLMs",
259
+ ],
260
+ ],
261
+ max_conversation_turns=2,
262
+ ))
263
+ with open("simulator_output.jsonl", "w") as f:
264
+ for output in outputs:
265
+ f.write(output.to_eval_qr_json_lines())
354
266
  ```
355
267
 
356
268
  #### Adversarial Simulator
@@ -358,73 +270,11 @@ if __name__ == "__main__":
358
270
  ```python
359
271
  from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
360
272
  from azure.identity import DefaultAzureCredential
361
- from typing import Any, Dict, List, Optional
362
- import asyncio
363
-
364
-
365
273
  azure_ai_project = {
366
274
  "subscription_id": <subscription_id>,
367
275
  "resource_group_name": <resource_group_name>,
368
276
  "project_name": <project_name>
369
277
  }
370
-
371
- async def callback(
372
- messages: List[Dict],
373
- stream: bool = False,
374
- session_state: Any = None,
375
- context: Dict[str, Any] = None
376
- ) -> dict:
377
- messages_list = messages["messages"]
378
- # get last message
379
- latest_message = messages_list[-1]
380
- query = latest_message["content"]
381
- context = None
382
- if 'file_content' in messages["template_parameters"]:
383
- query += messages["template_parameters"]['file_content']
384
- # the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
385
- # to respond to the simulator. You should replace it with a call to your model/endpoint/application
386
- # make sure you pass the `query` and format the response as we have shown below
387
- from openai import AsyncAzureOpenAI
388
- oai_client = AsyncAzureOpenAI(
389
- api_key=<api_key>,
390
- azure_endpoint=<endpoint>,
391
- api_version="2023-12-01-preview",
392
- )
393
- try:
394
- response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
395
- except Exception as e:
396
- print(f"Error: {e}")
397
- # to continue the conversation, return the messages, else you can fail the adversarial with an exception
398
- message = {
399
- "content": "Something went wrong. Check the exception e for more details.",
400
- "role": "assistant",
401
- "context": None,
402
- }
403
- messages["messages"].append(message)
404
- return {
405
- "messages": messages["messages"],
406
- "stream": stream,
407
- "session_state": session_state
408
- }
409
- response_result = response_from_oai_chat_completions.choices[0].message.content
410
- formatted_response = {
411
- "content": response_result,
412
- "role": "assistant",
413
- "context": {},
414
- }
415
- messages["messages"].append(formatted_response)
416
- return {
417
- "messages": messages["messages"],
418
- "stream": stream,
419
- "session_state": session_state,
420
- "context": context
421
- }
422
-
423
- ```
424
-
425
- #### Adversarial QA
426
-
427
- ```python
428
278
  scenario = AdversarialScenario.ADVERSARIAL_QA
429
279
  simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
430
280
 
@@ -437,30 +287,30 @@ outputs = asyncio.run(
437
287
  )
438
288
  )
439
289
 
440
- print(outputs.to_eval_qa_json_lines())
290
+ print(outputs.to_eval_qr_json_lines())
441
291
  ```
442
- #### Direct Attack Simulator
443
292
 
444
- ```python
445
- scenario = AdversarialScenario.ADVERSARIAL_QA
446
- simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
293
+ For more details about the simulator, visit the following links:
294
+ - [Adversarial Simulation docs][adversarial_simulation_docs]
295
+ - [Adversarial scenarios][adversarial_simulation_scenarios]
296
+ - [Simulating jailbreak attacks][adversarial_jailbreak]
447
297
 
448
- outputs = asyncio.run(
449
- simulator(
450
- scenario=scenario,
451
- max_conversation_turns=1,
452
- max_simulation_results=2,
453
- target=callback
454
- )
455
- )
298
+ ## Examples
299
+
300
+ In following section you will find examples of:
301
+ - [Evaluate an application][evaluate_app]
302
+ - [Evaluate different models][evaluate_models]
303
+ - [Custom Evaluators][custom_evaluators]
304
+ - [Adversarial Simulation][adversarial_simulation]
305
+ - [Simulate with conversation starter][simulate_with_conversation_starter]
306
+
307
+ More examples can be found [here][evaluate_samples].
456
308
 
457
- print(outputs)
458
- ```
459
309
  ## Troubleshooting
460
310
 
461
311
  ### General
462
312
 
463
- Azure ML clients raise exceptions defined in [Azure Core][azure_core_readme].
313
+ Please refer to [troubleshooting][evaluation_tsg] for common issues.
464
314
 
465
315
  ### Logging
466
316
 
@@ -505,10 +355,68 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
505
355
  [code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
506
356
  [coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
507
357
  [coc_contact]: mailto:opencode@microsoft.com
358
+ [evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
359
+ [evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
360
+ [evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
361
+ [evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
362
+ [evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_app
363
+ [evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
364
+ [ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
365
+ [ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
366
+ [azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
367
+ [evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_endpoints
368
+ [custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_custom
369
+ [evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
370
+ [evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
371
+ [performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
372
+ [risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
373
+ [composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
374
+ [adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
375
+ [adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
376
+ [adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/simulate_adversarial
377
+ [simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/simulate_conversation_starter
378
+ [adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
508
379
 
509
380
 
510
381
  # Release History
511
382
 
383
+ ## 1.0.1 (2024-11-15)
384
+
385
+ ### Bugs Fixed
386
+ - Fixed `[remote]` extra to be needed only when tracking results in Azure AI Studio.
387
+ - Removing `azure-ai-inference` as dependency.
388
+
389
+ ## 1.0.0 (2024-11-13)
390
+
391
+ ### Breaking Changes
392
+ - The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
393
+ - Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.
394
+
395
+ ### Bugs Fixed
396
+ - Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
397
+ - Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
398
+ - Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
399
+ - Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
400
+ otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
401
+ would be 2, not 1.5.
402
+
403
+ ### Other Changes
404
+ - Refined error messages for serviced-based evaluators and simulators.
405
+ - Tracing has been disabled due to Cosmos DB initialization issue.
406
+ - Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
407
+ - Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
408
+ - For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
409
+ ```python
410
+ adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
411
+ outputs = asyncio.run(
412
+ adversarial_simulator(
413
+ scenario=scenario,
414
+ target=callback,
415
+ randomize_order=True
416
+ )
417
+ )
418
+ ```
419
+
512
420
  ## 1.0.0b5 (2024-10-28)
513
421
 
514
422
  ### Features Added
@@ -565,8 +473,8 @@ outputs = asyncio.run(custom_simulator(
565
473
  - `SimilarityEvaluator`
566
474
  - `RetrievalEvaluator`
567
475
  - The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
568
-
569
- | Evaluator | New Token Limit |
476
+
477
+ | Evaluator | New `max_token` for Generation |
570
478
  | --- | --- |
571
479
  | `CoherenceEvaluator` | 800 |
572
480
  | `RelevanceEvaluator` | 800 |