azure-ai-evaluation 1.2.0__py3-none-any.whl → 1.4.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/__init__.py +42 -14
- azure/ai/evaluation/_azure/_models.py +6 -6
- azure/ai/evaluation/_common/constants.py +6 -2
- azure/ai/evaluation/_common/rai_service.py +38 -4
- azure/ai/evaluation/_common/raiclient/__init__.py +34 -0
- azure/ai/evaluation/_common/raiclient/_client.py +128 -0
- azure/ai/evaluation/_common/raiclient/_configuration.py +87 -0
- azure/ai/evaluation/_common/raiclient/_model_base.py +1235 -0
- azure/ai/evaluation/_common/raiclient/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/_serialization.py +2050 -0
- azure/ai/evaluation/_common/raiclient/_version.py +9 -0
- azure/ai/evaluation/_common/raiclient/aio/__init__.py +29 -0
- azure/ai/evaluation/_common/raiclient/aio/_client.py +130 -0
- azure/ai/evaluation/_common/raiclient/aio/_configuration.py +87 -0
- azure/ai/evaluation/_common/raiclient/aio/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/_operations.py +981 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/models/__init__.py +60 -0
- azure/ai/evaluation/_common/raiclient/models/_enums.py +18 -0
- azure/ai/evaluation/_common/raiclient/models/_models.py +651 -0
- azure/ai/evaluation/_common/raiclient/models/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/raiclient/operations/_operations.py +1225 -0
- azure/ai/evaluation/_common/raiclient/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/py.typed +1 -0
- azure/ai/evaluation/_common/utils.py +30 -10
- azure/ai/evaluation/_constants.py +10 -0
- azure/ai/evaluation/_converters/__init__.py +3 -0
- azure/ai/evaluation/_converters/_ai_services.py +804 -0
- azure/ai/evaluation/_converters/_models.py +302 -0
- azure/ai/evaluation/_evaluate/_batch_run/__init__.py +10 -3
- azure/ai/evaluation/_evaluate/_batch_run/_run_submitter_client.py +104 -0
- azure/ai/evaluation/_evaluate/_batch_run/batch_clients.py +82 -0
- azure/ai/evaluation/_evaluate/_eval_run.py +1 -1
- azure/ai/evaluation/_evaluate/_evaluate.py +36 -4
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +23 -3
- azure/ai/evaluation/_evaluators/_code_vulnerability/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_code_vulnerability/_code_vulnerability.py +120 -0
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +21 -2
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +43 -3
- azure/ai/evaluation/_evaluators/_common/_base_multi_eval.py +3 -1
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +43 -4
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +16 -4
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +42 -5
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +15 -0
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +15 -0
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +15 -0
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +15 -0
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +28 -4
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +21 -2
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +26 -3
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +21 -3
- azure/ai/evaluation/_evaluators/_intent_resolution/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py +152 -0
- azure/ai/evaluation/_evaluators/_intent_resolution/intent_resolution.prompty +161 -0
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +26 -3
- azure/ai/evaluation/_evaluators/_qa/_qa.py +51 -7
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +26 -2
- azure/ai/evaluation/_evaluators/_response_completeness/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py +157 -0
- azure/ai/evaluation/_evaluators/_response_completeness/response_completeness.prompty +99 -0
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +21 -2
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +113 -4
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +23 -3
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +24 -5
- azure/ai/evaluation/_evaluators/_task_adherence/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_task_adherence/_task_adherence.py +148 -0
- azure/ai/evaluation/_evaluators/_task_adherence/task_adherence.prompty +117 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py +292 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty +71 -0
- azure/ai/evaluation/_evaluators/_ungrounded_attributes/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_ungrounded_attributes/_ungrounded_attributes.py +103 -0
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +2 -0
- azure/ai/evaluation/_exceptions.py +5 -1
- azure/ai/evaluation/_legacy/__init__.py +3 -0
- azure/ai/evaluation/_legacy/_batch_engine/__init__.py +9 -0
- azure/ai/evaluation/_legacy/_batch_engine/_config.py +45 -0
- azure/ai/evaluation/_legacy/_batch_engine/_engine.py +368 -0
- azure/ai/evaluation/_legacy/_batch_engine/_exceptions.py +88 -0
- azure/ai/evaluation/_legacy/_batch_engine/_logging.py +292 -0
- azure/ai/evaluation/_legacy/_batch_engine/_openai_injector.py +23 -0
- azure/ai/evaluation/_legacy/_batch_engine/_result.py +99 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run.py +121 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run_storage.py +128 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run_submitter.py +217 -0
- azure/ai/evaluation/_legacy/_batch_engine/_status.py +25 -0
- azure/ai/evaluation/_legacy/_batch_engine/_trace.py +105 -0
- azure/ai/evaluation/_legacy/_batch_engine/_utils.py +82 -0
- azure/ai/evaluation/_legacy/_batch_engine/_utils_deprecated.py +131 -0
- azure/ai/evaluation/_legacy/prompty/__init__.py +36 -0
- azure/ai/evaluation/_legacy/prompty/_connection.py +182 -0
- azure/ai/evaluation/_legacy/prompty/_exceptions.py +59 -0
- azure/ai/evaluation/_legacy/prompty/_prompty.py +313 -0
- azure/ai/evaluation/_legacy/prompty/_utils.py +545 -0
- azure/ai/evaluation/_legacy/prompty/_yaml_utils.py +99 -0
- azure/ai/evaluation/_red_team/__init__.py +3 -0
- azure/ai/evaluation/_red_team/_attack_objective_generator.py +192 -0
- azure/ai/evaluation/_red_team/_attack_strategy.py +42 -0
- azure/ai/evaluation/_red_team/_callback_chat_target.py +74 -0
- azure/ai/evaluation/_red_team/_default_converter.py +21 -0
- azure/ai/evaluation/_red_team/_red_team.py +1858 -0
- azure/ai/evaluation/_red_team/_red_team_result.py +246 -0
- azure/ai/evaluation/_red_team/_utils/__init__.py +3 -0
- azure/ai/evaluation/_red_team/_utils/constants.py +64 -0
- azure/ai/evaluation/_red_team/_utils/formatting_utils.py +164 -0
- azure/ai/evaluation/_red_team/_utils/logging_utils.py +139 -0
- azure/ai/evaluation/_red_team/_utils/strategy_utils.py +188 -0
- azure/ai/evaluation/_safety_evaluation/__init__.py +3 -0
- azure/ai/evaluation/_safety_evaluation/_generated_rai_client.py +0 -0
- azure/ai/evaluation/_safety_evaluation/_safety_evaluation.py +741 -0
- azure/ai/evaluation/_version.py +2 -1
- azure/ai/evaluation/simulator/_adversarial_scenario.py +3 -1
- azure/ai/evaluation/simulator/_adversarial_simulator.py +61 -27
- azure/ai/evaluation/simulator/_conversation/__init__.py +4 -5
- azure/ai/evaluation/simulator/_conversation/_conversation.py +4 -0
- azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py +145 -0
- azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +2 -0
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +71 -1
- {azure_ai_evaluation-1.2.0.dist-info → azure_ai_evaluation-1.4.0.dist-info}/METADATA +75 -15
- azure_ai_evaluation-1.4.0.dist-info/RECORD +197 -0
- {azure_ai_evaluation-1.2.0.dist-info → azure_ai_evaluation-1.4.0.dist-info}/WHEEL +1 -1
- azure/ai/evaluation/_evaluators/_multimodal/__init__.py +0 -20
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +0 -132
- azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +0 -55
- azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +0 -100
- azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +0 -124
- azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +0 -100
- azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +0 -100
- azure/ai/evaluation/_evaluators/_multimodal/_violence.py +0 -100
- azure_ai_evaluation-1.2.0.dist-info/RECORD +0 -125
- {azure_ai_evaluation-1.2.0.dist-info → azure_ai_evaluation-1.4.0.dist-info}/NOTICE.txt +0 -0
- {azure_ai_evaluation-1.2.0.dist-info → azure_ai_evaluation-1.4.0.dist-info}/top_level.txt +0 -0
azure/ai/evaluation/_version.py
CHANGED
|
@@ -5,7 +5,7 @@
|
|
|
5
5
|
from enum import Enum
|
|
6
6
|
from azure.ai.evaluation._common._experimental import experimental
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
# cspell:ignore vuln
|
|
9
9
|
@experimental
|
|
10
10
|
class AdversarialScenario(Enum):
|
|
11
11
|
"""Adversarial scenario types
|
|
@@ -28,6 +28,8 @@ class AdversarialScenario(Enum):
|
|
|
28
28
|
ADVERSARIAL_CONTENT_GEN_UNGROUNDED = "adv_content_gen_ungrounded"
|
|
29
29
|
ADVERSARIAL_CONTENT_GEN_GROUNDED = "adv_content_gen_grounded"
|
|
30
30
|
ADVERSARIAL_CONTENT_PROTECTED_MATERIAL = "adv_content_protected_material"
|
|
31
|
+
ADVERSARIAL_CODE_VULNERABILITY = "adv_code_vuln"
|
|
32
|
+
ADVERSARIAL_UNGROUNDED_ATTRIBUTES = "adv_isa"
|
|
31
33
|
|
|
32
34
|
|
|
33
35
|
@experimental
|
|
@@ -7,6 +7,7 @@ import asyncio
|
|
|
7
7
|
import logging
|
|
8
8
|
import random
|
|
9
9
|
from typing import Any, Callable, Dict, List, Optional, Union, cast
|
|
10
|
+
import uuid
|
|
10
11
|
|
|
11
12
|
from tqdm import tqdm
|
|
12
13
|
|
|
@@ -187,6 +188,8 @@ class AdversarialSimulator:
|
|
|
187
188
|
)
|
|
188
189
|
self._ensure_service_dependencies()
|
|
189
190
|
templates = await self.adversarial_template_handler._get_content_harm_template_collections(scenario.value)
|
|
191
|
+
simulation_id = str(uuid.uuid4())
|
|
192
|
+
logger.warning("Use simulation_id to help debug the issue: %s", str(simulation_id))
|
|
190
193
|
concurrent_async_task = min(concurrent_async_task, 1000)
|
|
191
194
|
semaphore = asyncio.Semaphore(concurrent_async_task)
|
|
192
195
|
sim_results = []
|
|
@@ -217,32 +220,54 @@ class AdversarialSimulator:
|
|
|
217
220
|
if randomization_seed is not None:
|
|
218
221
|
random.seed(randomization_seed)
|
|
219
222
|
random.shuffle(templates)
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
223
|
+
|
|
224
|
+
# Prepare task parameters based on scenario - but use a single append call for all scenarios
|
|
225
|
+
tasks = []
|
|
226
|
+
template_parameter_pairs = []
|
|
227
|
+
|
|
228
|
+
if scenario == AdversarialScenario.ADVERSARIAL_CONVERSATION:
|
|
229
|
+
# For ADVERSARIAL_CONVERSATION, flatten the parameters
|
|
230
|
+
for i, template in enumerate(templates):
|
|
231
|
+
if not template.template_parameters:
|
|
232
|
+
continue
|
|
233
|
+
for parameter in template.template_parameters:
|
|
234
|
+
template_parameter_pairs.append((template, parameter))
|
|
235
|
+
else:
|
|
236
|
+
# Use original logic for other scenarios - zip parameters
|
|
237
|
+
parameter_lists = [t.template_parameters for t in templates]
|
|
238
|
+
zipped_parameters = list(zip(*parameter_lists))
|
|
239
|
+
|
|
240
|
+
for param_group in zipped_parameters:
|
|
241
|
+
for template, parameter in zip(templates, param_group):
|
|
242
|
+
template_parameter_pairs.append((template, parameter))
|
|
243
|
+
|
|
244
|
+
# Limit to max_simulation_results if needed
|
|
245
|
+
if len(template_parameter_pairs) > max_simulation_results:
|
|
246
|
+
template_parameter_pairs = template_parameter_pairs[:max_simulation_results]
|
|
247
|
+
|
|
248
|
+
# Single task append loop for all scenarios
|
|
249
|
+
for template, parameter in template_parameter_pairs:
|
|
250
|
+
if _jailbreak_type == "upia":
|
|
251
|
+
parameter = self._add_jailbreak_parameter(parameter, random.choice(jailbreak_dataset))
|
|
252
|
+
|
|
253
|
+
tasks.append(
|
|
254
|
+
asyncio.create_task(
|
|
255
|
+
self._simulate_async(
|
|
256
|
+
target=target,
|
|
257
|
+
template=template,
|
|
258
|
+
parameters=parameter,
|
|
259
|
+
max_conversation_turns=max_conversation_turns,
|
|
260
|
+
api_call_retry_limit=api_call_retry_limit,
|
|
261
|
+
api_call_retry_sleep_sec=api_call_retry_sleep_sec,
|
|
262
|
+
api_call_delay_sec=api_call_delay_sec,
|
|
263
|
+
language=language,
|
|
264
|
+
semaphore=semaphore,
|
|
265
|
+
scenario=scenario,
|
|
266
|
+
simulation_id=simulation_id,
|
|
240
267
|
)
|
|
241
268
|
)
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
if len(tasks) >= max_simulation_results:
|
|
245
|
-
break
|
|
269
|
+
)
|
|
270
|
+
|
|
246
271
|
for task in asyncio.as_completed(tasks):
|
|
247
272
|
sim_results.append(await task)
|
|
248
273
|
progress_bar.update(1)
|
|
@@ -298,9 +323,14 @@ class AdversarialSimulator:
|
|
|
298
323
|
language: SupportedLanguages,
|
|
299
324
|
semaphore: asyncio.Semaphore,
|
|
300
325
|
scenario: Union[AdversarialScenario, AdversarialScenarioJailbreak],
|
|
326
|
+
simulation_id: str = "",
|
|
301
327
|
) -> List[Dict]:
|
|
302
328
|
user_bot = self._setup_bot(
|
|
303
|
-
role=ConversationRole.USER,
|
|
329
|
+
role=ConversationRole.USER,
|
|
330
|
+
template=template,
|
|
331
|
+
parameters=parameters,
|
|
332
|
+
scenario=scenario,
|
|
333
|
+
simulation_id=simulation_id,
|
|
304
334
|
)
|
|
305
335
|
system_bot = self._setup_bot(
|
|
306
336
|
target=target, role=ConversationRole.ASSISTANT, template=template, parameters=parameters, scenario=scenario
|
|
@@ -329,7 +359,7 @@ class AdversarialSimulator:
|
|
|
329
359
|
)
|
|
330
360
|
|
|
331
361
|
def _get_user_proxy_completion_model(
|
|
332
|
-
self, template_key: str, template_parameters: TemplateParameters
|
|
362
|
+
self, template_key: str, template_parameters: TemplateParameters, simulation_id: str = ""
|
|
333
363
|
) -> ProxyChatCompletionsModel:
|
|
334
364
|
return ProxyChatCompletionsModel(
|
|
335
365
|
name="raisvc_proxy_model",
|
|
@@ -340,6 +370,7 @@ class AdversarialSimulator:
|
|
|
340
370
|
api_version="2023-07-01-preview",
|
|
341
371
|
max_tokens=1200,
|
|
342
372
|
temperature=0.0,
|
|
373
|
+
simulation_id=simulation_id,
|
|
343
374
|
)
|
|
344
375
|
|
|
345
376
|
def _setup_bot(
|
|
@@ -350,10 +381,13 @@ class AdversarialSimulator:
|
|
|
350
381
|
parameters: TemplateParameters,
|
|
351
382
|
target: Optional[Callable] = None,
|
|
352
383
|
scenario: Union[AdversarialScenario, AdversarialScenarioJailbreak],
|
|
384
|
+
simulation_id: str = "",
|
|
353
385
|
) -> ConversationBot:
|
|
354
386
|
if role is ConversationRole.USER:
|
|
355
387
|
model = self._get_user_proxy_completion_model(
|
|
356
|
-
template_key=template.template_name,
|
|
388
|
+
template_key=template.template_name,
|
|
389
|
+
template_parameters=parameters,
|
|
390
|
+
simulation_id=simulation_id,
|
|
357
391
|
)
|
|
358
392
|
return ConversationBot(
|
|
359
393
|
role=role,
|
|
@@ -128,19 +128,15 @@ class ConversationBot:
|
|
|
128
128
|
self.conversation_starter: Optional[Union[str, jinja2.Template, Dict]] = None
|
|
129
129
|
if role == ConversationRole.USER:
|
|
130
130
|
if "conversation_starter" in self.persona_template_args:
|
|
131
|
-
print(self.persona_template_args)
|
|
132
131
|
conversation_starter_content = self.persona_template_args["conversation_starter"]
|
|
133
132
|
if isinstance(conversation_starter_content, dict):
|
|
134
133
|
self.conversation_starter = conversation_starter_content
|
|
135
|
-
print(f"Conversation starter content: {conversation_starter_content}")
|
|
136
134
|
else:
|
|
137
135
|
try:
|
|
138
136
|
self.conversation_starter = jinja2.Template(
|
|
139
137
|
conversation_starter_content, undefined=jinja2.StrictUndefined
|
|
140
138
|
)
|
|
141
|
-
print("Successfully created a Jinja2 template for the conversation starter.")
|
|
142
139
|
except jinja2.exceptions.TemplateSyntaxError as e: # noqa: F841
|
|
143
|
-
print(f"Template syntax error: {e}. Using raw content.")
|
|
144
140
|
self.conversation_starter = conversation_starter_content
|
|
145
141
|
else:
|
|
146
142
|
self.logger.info(
|
|
@@ -153,6 +149,7 @@ class ConversationBot:
|
|
|
153
149
|
conversation_history: List[ConversationTurn],
|
|
154
150
|
max_history: int,
|
|
155
151
|
turn_number: int = 0,
|
|
152
|
+
session_state: Optional[Dict[str, Any]] = None,
|
|
156
153
|
) -> Tuple[dict, dict, float, dict]:
|
|
157
154
|
"""
|
|
158
155
|
Prompt the ConversationBot for a response.
|
|
@@ -262,6 +259,7 @@ class CallbackConversationBot(ConversationBot):
|
|
|
262
259
|
conversation_history: List[Any],
|
|
263
260
|
max_history: int,
|
|
264
261
|
turn_number: int = 0,
|
|
262
|
+
session_state: Optional[Dict[str, Any]] = None,
|
|
265
263
|
) -> Tuple[dict, dict, float, dict]:
|
|
266
264
|
chat_protocol_message = self._to_chat_protocol(
|
|
267
265
|
self.user_template, conversation_history, self.user_template_parameters
|
|
@@ -269,7 +267,7 @@ class CallbackConversationBot(ConversationBot):
|
|
|
269
267
|
msg_copy = copy.deepcopy(chat_protocol_message)
|
|
270
268
|
result = {}
|
|
271
269
|
start_time = time.time()
|
|
272
|
-
result = await self.callback(msg_copy)
|
|
270
|
+
result = await self.callback(msg_copy, session_state=session_state)
|
|
273
271
|
end_time = time.time()
|
|
274
272
|
if not result:
|
|
275
273
|
result = {
|
|
@@ -348,6 +346,7 @@ class MultiModalConversationBot(ConversationBot):
|
|
|
348
346
|
conversation_history: List[Any],
|
|
349
347
|
max_history: int,
|
|
350
348
|
turn_number: int = 0,
|
|
349
|
+
session_state: Optional[Dict[str, Any]] = None,
|
|
351
350
|
) -> Tuple[dict, dict, float, dict]:
|
|
352
351
|
previous_prompt = conversation_history[-1]
|
|
353
352
|
chat_protocol_message = await self._to_chat_protocol(conversation_history, self.user_template_parameters)
|
|
@@ -101,6 +101,7 @@ async def simulate_conversation(
|
|
|
101
101
|
:rtype: Tuple[Optional[str], List[ConversationTurn]]
|
|
102
102
|
"""
|
|
103
103
|
|
|
104
|
+
session_state = {}
|
|
104
105
|
# Read the first prompt.
|
|
105
106
|
(first_response, request, _, full_response) = await bots[0].generate_response(
|
|
106
107
|
session=session,
|
|
@@ -149,7 +150,10 @@ async def simulate_conversation(
|
|
|
149
150
|
conversation_history=conversation_history,
|
|
150
151
|
max_history=history_limit,
|
|
151
152
|
turn_number=current_turn,
|
|
153
|
+
session_state=session_state,
|
|
152
154
|
)
|
|
155
|
+
if "session_state" in full_response and full_response["session_state"] is not None:
|
|
156
|
+
session_state.update(full_response["session_state"])
|
|
153
157
|
|
|
154
158
|
# check if conversation id is null, which means conversation starter was used. use id from next turn
|
|
155
159
|
if conversation_id is None and "id" in response:
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
import os
|
|
6
|
+
from typing import Dict, List, Optional
|
|
7
|
+
|
|
8
|
+
from azure.core.credentials import TokenCredential
|
|
9
|
+
from azure.ai.evaluation._model_configurations import AzureAIProject
|
|
10
|
+
from azure.ai.evaluation.simulator._model_tools import ManagedIdentityAPITokenManager
|
|
11
|
+
from azure.ai.evaluation._common.raiclient import MachineLearningServicesClient
|
|
12
|
+
import jwt
|
|
13
|
+
import time
|
|
14
|
+
import ast
|
|
15
|
+
|
|
16
|
+
class GeneratedRAIClient:
|
|
17
|
+
"""Client for the Responsible AI Service using the auto-generated MachineLearningServicesClient.
|
|
18
|
+
|
|
19
|
+
:param azure_ai_project: The scope of the Azure AI project. It contains subscription id, resource group, and project name.
|
|
20
|
+
:type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
|
|
21
|
+
:param token_manager: The token manager
|
|
22
|
+
:type token_manager: ~azure.ai.evaluation.simulator._model_tools._identity_manager.APITokenManager
|
|
23
|
+
"""
|
|
24
|
+
|
|
25
|
+
def __init__(self, azure_ai_project: AzureAIProject, token_manager: ManagedIdentityAPITokenManager):
|
|
26
|
+
self.azure_ai_project = azure_ai_project
|
|
27
|
+
self.token_manager = token_manager
|
|
28
|
+
|
|
29
|
+
# Service URL construction
|
|
30
|
+
if "RAI_SVC_URL" in os.environ:
|
|
31
|
+
endpoint = os.environ["RAI_SVC_URL"].rstrip("/")
|
|
32
|
+
else:
|
|
33
|
+
endpoint = self._get_service_discovery_url()
|
|
34
|
+
|
|
35
|
+
# Create the autogenerated client
|
|
36
|
+
self._client = MachineLearningServicesClient(
|
|
37
|
+
endpoint=endpoint,
|
|
38
|
+
subscription_id=self.azure_ai_project["subscription_id"],
|
|
39
|
+
resource_group_name=self.azure_ai_project["resource_group_name"],
|
|
40
|
+
workspace_name=self.azure_ai_project["project_name"],
|
|
41
|
+
credential=self.token_manager,
|
|
42
|
+
)
|
|
43
|
+
|
|
44
|
+
def _get_service_discovery_url(self):
|
|
45
|
+
"""Get the service discovery URL.
|
|
46
|
+
|
|
47
|
+
:return: The service discovery URL
|
|
48
|
+
:rtype: str
|
|
49
|
+
"""
|
|
50
|
+
import requests
|
|
51
|
+
bearer_token = self._fetch_or_reuse_token(self.token_manager)
|
|
52
|
+
headers = {"Authorization": f"Bearer {bearer_token}", "Content-Type": "application/json"}
|
|
53
|
+
|
|
54
|
+
response = requests.get(
|
|
55
|
+
f"https://management.azure.com/subscriptions/{self.azure_ai_project['subscription_id']}/"
|
|
56
|
+
f"resourceGroups/{self.azure_ai_project['resource_group_name']}/"
|
|
57
|
+
f"providers/Microsoft.MachineLearningServices/workspaces/{self.azure_ai_project['project_name']}?"
|
|
58
|
+
f"api-version=2023-08-01-preview",
|
|
59
|
+
headers=headers,
|
|
60
|
+
timeout=5,
|
|
61
|
+
)
|
|
62
|
+
|
|
63
|
+
if response.status_code != 200:
|
|
64
|
+
msg = (
|
|
65
|
+
f"Failed to connect to your Azure AI project. Please check if the project scope is configured "
|
|
66
|
+
f"correctly, and make sure you have the necessary access permissions. "
|
|
67
|
+
f"Status code: {response.status_code}."
|
|
68
|
+
)
|
|
69
|
+
raise Exception(msg)
|
|
70
|
+
|
|
71
|
+
# Parse the discovery URL
|
|
72
|
+
from urllib.parse import urlparse
|
|
73
|
+
base_url = urlparse(response.json()["properties"]["discoveryUrl"])
|
|
74
|
+
return f"{base_url.scheme}://{base_url.netloc}"
|
|
75
|
+
|
|
76
|
+
async def get_attack_objectives(self, risk_category: Optional[str] = None, application_scenario: str = None, strategy: Optional[str] = None) -> Dict:
|
|
77
|
+
"""Get attack objectives using the auto-generated operations.
|
|
78
|
+
|
|
79
|
+
:param risk_category: Optional risk category to filter the attack objectives
|
|
80
|
+
:type risk_category: Optional[str]
|
|
81
|
+
:param application_scenario: Optional description of the application scenario for context
|
|
82
|
+
:type application_scenario: str
|
|
83
|
+
:param strategy: Optional strategy to filter the attack objectives
|
|
84
|
+
:type strategy: Optional[str]
|
|
85
|
+
:return: The attack objectives
|
|
86
|
+
:rtype: Dict
|
|
87
|
+
"""
|
|
88
|
+
try:
|
|
89
|
+
# Send the request using the autogenerated client
|
|
90
|
+
response = self._client.rai_svc.get_attack_objectives(
|
|
91
|
+
risk_types=[risk_category],
|
|
92
|
+
lang="en",
|
|
93
|
+
strategy=strategy,
|
|
94
|
+
)
|
|
95
|
+
return response
|
|
96
|
+
|
|
97
|
+
except Exception as e:
|
|
98
|
+
# Log the exception for debugging purposes
|
|
99
|
+
import logging
|
|
100
|
+
logging.error(f"Error in get_attack_objectives: {str(e)}")
|
|
101
|
+
raise
|
|
102
|
+
|
|
103
|
+
async def get_jailbreak_prefixes(self) -> List[str]:
|
|
104
|
+
"""Get jailbreak prefixes using the auto-generated operations.
|
|
105
|
+
|
|
106
|
+
:return: The jailbreak prefixes
|
|
107
|
+
:rtype: List[str]
|
|
108
|
+
"""
|
|
109
|
+
try:
|
|
110
|
+
# Send the request using the autogenerated client
|
|
111
|
+
response = self._client.rai_svc.get_jail_break_dataset_with_type(type="upia")
|
|
112
|
+
if isinstance(response, list):
|
|
113
|
+
return response
|
|
114
|
+
else:
|
|
115
|
+
self.logger.error("Unexpected response format from get_jail_break_dataset_with_type")
|
|
116
|
+
raise ValueError("Unexpected response format from get_jail_break_dataset_with_type")
|
|
117
|
+
|
|
118
|
+
except Exception as e:
|
|
119
|
+
return [""]
|
|
120
|
+
|
|
121
|
+
def _fetch_or_reuse_token(self, credential: TokenCredential, token: Optional[str] = None) -> str:
|
|
122
|
+
"""Get token. Fetch a new token if the current token is near expiry
|
|
123
|
+
|
|
124
|
+
:param credential: The Azure authentication credential.
|
|
125
|
+
:type credential:
|
|
126
|
+
~azure.core.credentials.TokenCredential
|
|
127
|
+
:param token: The Azure authentication token. Defaults to None. If none, a new token will be fetched.
|
|
128
|
+
:type token: str
|
|
129
|
+
:return: The Azure authentication token.
|
|
130
|
+
"""
|
|
131
|
+
if token:
|
|
132
|
+
# Decode the token to get its expiration time
|
|
133
|
+
try:
|
|
134
|
+
decoded_token = jwt.decode(token, options={"verify_signature": False})
|
|
135
|
+
except jwt.PyJWTError:
|
|
136
|
+
pass
|
|
137
|
+
else:
|
|
138
|
+
exp_time = decoded_token["exp"]
|
|
139
|
+
current_time = time.time()
|
|
140
|
+
|
|
141
|
+
# Return current token if not near expiry
|
|
142
|
+
if (exp_time - current_time) >= 300:
|
|
143
|
+
return token
|
|
144
|
+
|
|
145
|
+
return credential.get_token("https://management.azure.com/.default").token
|
|
@@ -89,6 +89,7 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
|
|
|
89
89
|
self.tkey = template_key
|
|
90
90
|
self.tparam = template_parameters
|
|
91
91
|
self.result_url: Optional[str] = None
|
|
92
|
+
self.simulation_id: Optional[str] = kwargs.pop("simulation_id", "")
|
|
92
93
|
|
|
93
94
|
super().__init__(name=name, **kwargs)
|
|
94
95
|
|
|
@@ -169,6 +170,7 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
|
|
|
169
170
|
"Content-Type": "application/json",
|
|
170
171
|
"X-CV": f"{uuid.uuid4()}",
|
|
171
172
|
"X-ModelType": self.model or "",
|
|
173
|
+
"x-ms-client-request-id": self.simulation_id,
|
|
172
174
|
}
|
|
173
175
|
# add all additional headers
|
|
174
176
|
headers.update(self.additional_headers) # type: ignore[arg-type]
|
|
@@ -2,9 +2,10 @@
|
|
|
2
2
|
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
3
|
# ---------------------------------------------------------
|
|
4
4
|
import os
|
|
5
|
-
from typing import Any
|
|
5
|
+
from typing import Any, Dict, List
|
|
6
6
|
from urllib.parse import urljoin, urlparse
|
|
7
7
|
import base64
|
|
8
|
+
import json
|
|
8
9
|
|
|
9
10
|
from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
|
|
10
11
|
from azure.ai.evaluation._http_utils import AsyncHttpPipeline, get_async_http_client, get_http_client
|
|
@@ -62,6 +63,7 @@ class RAIClient: # pylint: disable=client-accepts-api-version-keyword
|
|
|
62
63
|
self.jailbreaks_json_endpoint = urljoin(self.api_url, "simulation/jailbreak")
|
|
63
64
|
self.simulation_submit_endpoint = urljoin(self.api_url, "simulation/chat/completions/submit")
|
|
64
65
|
self.xpia_jailbreaks_json_endpoint = urljoin(self.api_url, "simulation/jailbreak/xpia")
|
|
66
|
+
self.attack_objectives_endpoint = urljoin(self.api_url, "simulation/attackobjectives")
|
|
65
67
|
|
|
66
68
|
def _get_service_discovery_url(self):
|
|
67
69
|
bearer_token = self.token_manager.get_token()
|
|
@@ -206,3 +208,71 @@ class RAIClient: # pylint: disable=client-accepts-api-version-keyword
|
|
|
206
208
|
category=ErrorCategory.UNKNOWN,
|
|
207
209
|
blame=ErrorBlame.USER_ERROR,
|
|
208
210
|
)
|
|
211
|
+
|
|
212
|
+
async def get_attack_objectives(self, risk_categories: List[str], application_scenario: str = None, strategy: str = None) -> Any:
|
|
213
|
+
"""Get the attack objectives based on risk categories and application scenario
|
|
214
|
+
|
|
215
|
+
:param risk_categories: List of risk categories to generate attack objectives for
|
|
216
|
+
:type risk_categories: List[str]
|
|
217
|
+
:param application_scenario: Optional description of the application scenario for context
|
|
218
|
+
:type application_scenario: str
|
|
219
|
+
:param strategy: Optional attack strategy to get specific objectives for
|
|
220
|
+
:type strategy: str
|
|
221
|
+
:return: The attack objectives
|
|
222
|
+
:rtype: Any
|
|
223
|
+
"""
|
|
224
|
+
# Create query parameters for the request
|
|
225
|
+
if application_scenario:
|
|
226
|
+
raise NotImplementedError("Application scenario is not supported yet")
|
|
227
|
+
|
|
228
|
+
params = {
|
|
229
|
+
"api-version": "2022-11-01-preview",
|
|
230
|
+
"riskTypes": ",".join(risk_categories),
|
|
231
|
+
"lang": "en" # Default to English
|
|
232
|
+
}
|
|
233
|
+
|
|
234
|
+
# Add strategy parameter if provided
|
|
235
|
+
if strategy:
|
|
236
|
+
params["strategy"] = strategy
|
|
237
|
+
|
|
238
|
+
try:
|
|
239
|
+
# Make the request using the existing get method
|
|
240
|
+
result = await self.get(self.attack_objectives_endpoint)
|
|
241
|
+
# from collections import defaultdict
|
|
242
|
+
# counts_by_risk = defaultdict(int)
|
|
243
|
+
# for item in result:
|
|
244
|
+
# target_harms = item.get("Metadata", {}).get("TargetHarms", [])
|
|
245
|
+
# if not target_harms:
|
|
246
|
+
# # No risk type specified
|
|
247
|
+
# counts_by_risk["empty"] += 1
|
|
248
|
+
# else:
|
|
249
|
+
# for harm in target_harms:
|
|
250
|
+
# # Use "empty" if the risk type field is missing
|
|
251
|
+
# risk_type = harm.get("RiskType", "") or "empty"
|
|
252
|
+
# counts_by_risk[risk_type] += 1
|
|
253
|
+
return result
|
|
254
|
+
except Exception:
|
|
255
|
+
# If the API fails or isn't implemented yet, return a mock response
|
|
256
|
+
# This is temporary until the API endpoint is fully implemented
|
|
257
|
+
return [{
|
|
258
|
+
"metadata": {
|
|
259
|
+
"lang": "en",
|
|
260
|
+
"target_harms": [
|
|
261
|
+
{
|
|
262
|
+
"risk-type": "violence",
|
|
263
|
+
"risk-subtype": ""
|
|
264
|
+
}
|
|
265
|
+
]
|
|
266
|
+
},
|
|
267
|
+
"messages": [
|
|
268
|
+
{
|
|
269
|
+
"role": "user",
|
|
270
|
+
"content": "Risky content"
|
|
271
|
+
}
|
|
272
|
+
],
|
|
273
|
+
"modality": "text",
|
|
274
|
+
"source": [
|
|
275
|
+
"source"
|
|
276
|
+
]
|
|
277
|
+
}]
|
|
278
|
+
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: azure-ai-evaluation
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.4.0
|
|
4
4
|
Summary: Microsoft Azure Evaluation Library for Python
|
|
5
5
|
Home-page: https://github.com/Azure/azure-sdk-for-python
|
|
6
6
|
Author: Microsoft Corporation
|
|
@@ -21,13 +21,15 @@ Classifier: Operating System :: OS Independent
|
|
|
21
21
|
Requires-Python: >=3.9
|
|
22
22
|
Description-Content-Type: text/markdown
|
|
23
23
|
License-File: NOTICE.txt
|
|
24
|
-
Requires-Dist: promptflow-devkit
|
|
25
|
-
Requires-Dist: promptflow-core
|
|
26
|
-
Requires-Dist: pyjwt
|
|
27
|
-
Requires-Dist: azure-identity
|
|
28
|
-
Requires-Dist: azure-core
|
|
29
|
-
Requires-Dist: nltk
|
|
30
|
-
Requires-Dist: azure-storage-blob
|
|
24
|
+
Requires-Dist: promptflow-devkit>=1.17.1
|
|
25
|
+
Requires-Dist: promptflow-core>=1.17.1
|
|
26
|
+
Requires-Dist: pyjwt>=2.8.0
|
|
27
|
+
Requires-Dist: azure-identity>=1.16.0
|
|
28
|
+
Requires-Dist: azure-core>=1.30.2
|
|
29
|
+
Requires-Dist: nltk>=3.9.1
|
|
30
|
+
Requires-Dist: azure-storage-blob>=12.10.0
|
|
31
|
+
Provides-Extra: redteam
|
|
32
|
+
Requires-Dist: pyrit>=0.8.0; extra == "redteam"
|
|
31
33
|
|
|
32
34
|
# Azure AI Evaluation client library for Python
|
|
33
35
|
|
|
@@ -54,7 +56,7 @@ Azure AI SDK provides following to evaluate Generative AI Applications:
|
|
|
54
56
|
### Prerequisites
|
|
55
57
|
|
|
56
58
|
- Python 3.9 or later is required to use this package.
|
|
57
|
-
- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
|
|
59
|
+
- [Optional] You must have [Azure AI Foundry Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
|
|
58
60
|
|
|
59
61
|
### Install the package
|
|
60
62
|
|
|
@@ -63,10 +65,6 @@ Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
|
|
|
63
65
|
```bash
|
|
64
66
|
pip install azure-ai-evaluation
|
|
65
67
|
```
|
|
66
|
-
If you want to track results in [AI Studio][ai_studio], install `remote` extra:
|
|
67
|
-
```python
|
|
68
|
-
pip install azure-ai-evaluation[remote]
|
|
69
|
-
```
|
|
70
68
|
|
|
71
69
|
## Key concepts
|
|
72
70
|
|
|
@@ -175,9 +173,9 @@ result = evaluate(
|
|
|
175
173
|
}
|
|
176
174
|
}
|
|
177
175
|
}
|
|
178
|
-
# Optionally provide your AI
|
|
176
|
+
# Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
|
|
179
177
|
azure_ai_project = azure_ai_project,
|
|
180
|
-
# Optionally provide an output path to dump a json of metric summary, row level data and metric and
|
|
178
|
+
# Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
|
|
181
179
|
output_path="./evaluation_results.json"
|
|
182
180
|
)
|
|
183
181
|
```
|
|
@@ -375,8 +373,70 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
|
|
|
375
373
|
[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter
|
|
376
374
|
[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
|
|
377
375
|
|
|
376
|
+
|
|
378
377
|
# Release History
|
|
379
378
|
|
|
379
|
+
## 1.4.0 (2025-03-27)
|
|
380
|
+
|
|
381
|
+
### Features Added
|
|
382
|
+
- Enhanced binary evaluation results with customizable thresholds
|
|
383
|
+
- Added threshold support for QA and ContentSafety evaluators
|
|
384
|
+
- Evaluation results now include both the score and threshold values
|
|
385
|
+
- Configurable threshold parameter allows custom binary classification boundaries
|
|
386
|
+
- Default thresholds provided for backward compatibility
|
|
387
|
+
- Quality evaluators use "higher is better" scoring (score ≥ threshold is positive)
|
|
388
|
+
- Content safety evaluators use "lower is better" scoring (score ≤ threshold is positive)
|
|
389
|
+
- New Built-in evaluator called CodeVulnerabilityEvaluator is added.
|
|
390
|
+
- It provides capabilities to identify the following code vulnerabilities.
|
|
391
|
+
- path-injection
|
|
392
|
+
- sql-injection
|
|
393
|
+
- code-injection
|
|
394
|
+
- stack-trace-exposure
|
|
395
|
+
- incomplete-url-substring-sanitization
|
|
396
|
+
- flask-debug
|
|
397
|
+
- clear-text-logging-sensitive-data
|
|
398
|
+
- incomplete-hostname-regexp
|
|
399
|
+
- server-side-unvalidated-url-redirection
|
|
400
|
+
- weak-cryptographic-algorithm
|
|
401
|
+
- full-ssrf
|
|
402
|
+
- bind-socket-all-network-interfaces
|
|
403
|
+
- client-side-unvalidated-url-redirection
|
|
404
|
+
- likely-bugs
|
|
405
|
+
- reflected-xss
|
|
406
|
+
- clear-text-storage-sensitive-data
|
|
407
|
+
- tarslip
|
|
408
|
+
- hardcoded-credentials
|
|
409
|
+
- insecure-randomness
|
|
410
|
+
- It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)
|
|
411
|
+
|
|
412
|
+
- New Built-in evaluator called UngroundedAttributesEvaluator is added.
|
|
413
|
+
- It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only,
|
|
414
|
+
- where query represents the user query and response represents the AI system response given the provided context.
|
|
415
|
+
|
|
416
|
+
- Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class
|
|
417
|
+
- or emotional state of a person.
|
|
418
|
+
|
|
419
|
+
- It identifies the following attributes:
|
|
420
|
+
|
|
421
|
+
- emotional_state
|
|
422
|
+
- protected_class
|
|
423
|
+
- groundedness
|
|
424
|
+
- New Built-in evaluators for Agent Evaluation (Preview)
|
|
425
|
+
- IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
|
|
426
|
+
- ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
|
|
427
|
+
- TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
|
|
428
|
+
- ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.
|
|
429
|
+
|
|
430
|
+
### Bugs Fixed
|
|
431
|
+
- Fixed error in `GroundednessProEvaluator` when handling non-numeric values like "n/a" returned from the service.
|
|
432
|
+
- Uploading local evaluation results from `evaluate` with the same run name will no longer result in each online run sharing (and bashing) result files.
|
|
433
|
+
|
|
434
|
+
## 1.3.0 (2025-02-28)
|
|
435
|
+
|
|
436
|
+
### Breaking Changes
|
|
437
|
+
- Multimodal specific evaluators `ContentSafetyMultimodalEvaluator`, `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` has been removed. Please use `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` and `ProtectedMaterialEvaluator` instead.
|
|
438
|
+
- Metric name in ProtectedMaterialEvaluator's output is changed from `protected_material.fictional_characters_label` to `protected_material.fictional_characters_defect_rate`. It's now consistent with other evaluator's metric names (ending with `_defect_rate`).
|
|
439
|
+
|
|
380
440
|
## 1.2.0 (2025-01-27)
|
|
381
441
|
|
|
382
442
|
### Features Added
|