azure-ai-evaluation 1.9.0__py3-none-any.whl → 1.11.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (85) hide show
  1. azure/ai/evaluation/__init__.py +46 -12
  2. azure/ai/evaluation/_aoai/python_grader.py +84 -0
  3. azure/ai/evaluation/_aoai/score_model_grader.py +1 -0
  4. azure/ai/evaluation/_common/onedp/models/_models.py +5 -0
  5. azure/ai/evaluation/_common/rai_service.py +3 -3
  6. azure/ai/evaluation/_common/utils.py +74 -17
  7. azure/ai/evaluation/_converters/_ai_services.py +60 -10
  8. azure/ai/evaluation/_converters/_models.py +75 -26
  9. azure/ai/evaluation/_evaluate/_batch_run/_run_submitter_client.py +70 -22
  10. azure/ai/evaluation/_evaluate/_eval_run.py +14 -1
  11. azure/ai/evaluation/_evaluate/_evaluate.py +163 -44
  12. azure/ai/evaluation/_evaluate/_evaluate_aoai.py +79 -33
  13. azure/ai/evaluation/_evaluate/_utils.py +5 -2
  14. azure/ai/evaluation/_evaluators/_bleu/_bleu.py +1 -1
  15. azure/ai/evaluation/_evaluators/_code_vulnerability/_code_vulnerability.py +8 -1
  16. azure/ai/evaluation/_evaluators/_coherence/_coherence.py +3 -2
  17. azure/ai/evaluation/_evaluators/_common/_base_eval.py +143 -25
  18. azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +7 -2
  19. azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +19 -9
  20. azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +15 -5
  21. azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +4 -1
  22. azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +4 -1
  23. azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +5 -2
  24. azure/ai/evaluation/_evaluators/_content_safety/_violence.py +4 -1
  25. azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py +3 -0
  26. azure/ai/evaluation/_evaluators/_eci/_eci.py +3 -0
  27. azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +1 -1
  28. azure/ai/evaluation/_evaluators/_fluency/_fluency.py +3 -2
  29. azure/ai/evaluation/_evaluators/_gleu/_gleu.py +1 -1
  30. azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +114 -4
  31. azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py +9 -3
  32. azure/ai/evaluation/_evaluators/_meteor/_meteor.py +1 -1
  33. azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +8 -1
  34. azure/ai/evaluation/_evaluators/_qa/_qa.py +1 -1
  35. azure/ai/evaluation/_evaluators/_relevance/_relevance.py +56 -3
  36. azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +140 -59
  37. azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py +11 -3
  38. azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +3 -2
  39. azure/ai/evaluation/_evaluators/_rouge/_rouge.py +1 -1
  40. azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +2 -1
  41. azure/ai/evaluation/_evaluators/_similarity/_similarity.py +3 -2
  42. azure/ai/evaluation/_evaluators/_task_adherence/_task_adherence.py +24 -12
  43. azure/ai/evaluation/_evaluators/_task_adherence/task_adherence.prompty +354 -66
  44. azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py +214 -187
  45. azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty +126 -31
  46. azure/ai/evaluation/_evaluators/_ungrounded_attributes/_ungrounded_attributes.py +8 -1
  47. azure/ai/evaluation/_evaluators/_xpia/xpia.py +4 -1
  48. azure/ai/evaluation/_exceptions.py +1 -0
  49. azure/ai/evaluation/_legacy/_batch_engine/_config.py +6 -3
  50. azure/ai/evaluation/_legacy/_batch_engine/_engine.py +115 -30
  51. azure/ai/evaluation/_legacy/_batch_engine/_result.py +2 -0
  52. azure/ai/evaluation/_legacy/_batch_engine/_run.py +2 -2
  53. azure/ai/evaluation/_legacy/_batch_engine/_run_submitter.py +28 -31
  54. azure/ai/evaluation/_safety_evaluation/_safety_evaluation.py +2 -0
  55. azure/ai/evaluation/_version.py +1 -1
  56. azure/ai/evaluation/red_team/__init__.py +4 -3
  57. azure/ai/evaluation/red_team/_attack_objective_generator.py +17 -0
  58. azure/ai/evaluation/red_team/_callback_chat_target.py +14 -1
  59. azure/ai/evaluation/red_team/_evaluation_processor.py +376 -0
  60. azure/ai/evaluation/red_team/_mlflow_integration.py +322 -0
  61. azure/ai/evaluation/red_team/_orchestrator_manager.py +661 -0
  62. azure/ai/evaluation/red_team/_red_team.py +655 -2665
  63. azure/ai/evaluation/red_team/_red_team_result.py +6 -0
  64. azure/ai/evaluation/red_team/_result_processor.py +610 -0
  65. azure/ai/evaluation/red_team/_utils/__init__.py +34 -0
  66. azure/ai/evaluation/red_team/_utils/_rai_service_eval_chat_target.py +11 -4
  67. azure/ai/evaluation/red_team/_utils/_rai_service_true_false_scorer.py +6 -0
  68. azure/ai/evaluation/red_team/_utils/constants.py +0 -2
  69. azure/ai/evaluation/red_team/_utils/exception_utils.py +345 -0
  70. azure/ai/evaluation/red_team/_utils/file_utils.py +266 -0
  71. azure/ai/evaluation/red_team/_utils/formatting_utils.py +115 -13
  72. azure/ai/evaluation/red_team/_utils/metric_mapping.py +24 -4
  73. azure/ai/evaluation/red_team/_utils/progress_utils.py +252 -0
  74. azure/ai/evaluation/red_team/_utils/retry_utils.py +218 -0
  75. azure/ai/evaluation/red_team/_utils/strategy_utils.py +17 -4
  76. azure/ai/evaluation/simulator/_adversarial_simulator.py +14 -2
  77. azure/ai/evaluation/simulator/_indirect_attack_simulator.py +13 -1
  78. azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py +21 -7
  79. azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +24 -5
  80. azure/ai/evaluation/simulator/_simulator.py +12 -0
  81. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/METADATA +63 -4
  82. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/RECORD +85 -76
  83. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/WHEEL +1 -1
  84. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info/licenses}/NOTICE.txt +0 -0
  85. {azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,218 @@
1
+ # ---------------------------------------------------------
2
+ # Copyright (c) Microsoft Corporation. All rights reserved.
3
+ # ---------------------------------------------------------
4
+ """
5
+ Retry utilities for Red Team Agent.
6
+
7
+ This module provides centralized retry logic and decorators for handling
8
+ network errors and other transient failures consistently across the codebase.
9
+ """
10
+
11
+ import asyncio
12
+ import logging
13
+ from typing import Any, Callable, Dict, List, Optional, TypeVar
14
+ from tenacity import (
15
+ retry,
16
+ stop_after_attempt,
17
+ wait_exponential,
18
+ retry_if_exception,
19
+ RetryError,
20
+ )
21
+
22
+ # Retry imports for exception handling
23
+ import httpx
24
+ import httpcore
25
+
26
+ # Import Azure exceptions if available
27
+ try:
28
+ from azure.core.exceptions import ServiceRequestError, ServiceResponseError
29
+
30
+ AZURE_EXCEPTIONS = (ServiceRequestError, ServiceResponseError)
31
+ except ImportError:
32
+ AZURE_EXCEPTIONS = ()
33
+
34
+
35
+ # Type variable for generic retry decorators
36
+ T = TypeVar("T")
37
+
38
+
39
+ class RetryManager:
40
+ """Centralized retry management for Red Team operations."""
41
+
42
+ # Default retry configuration
43
+ DEFAULT_MAX_ATTEMPTS = 5
44
+ DEFAULT_MIN_WAIT = 2
45
+ DEFAULT_MAX_WAIT = 30
46
+ DEFAULT_MULTIPLIER = 1.5
47
+
48
+ # Network-related exceptions that should trigger retries
49
+ NETWORK_EXCEPTIONS = (
50
+ httpx.ConnectTimeout,
51
+ httpx.ReadTimeout,
52
+ httpx.ConnectError,
53
+ httpx.HTTPError,
54
+ httpx.TimeoutException,
55
+ httpx.HTTPStatusError,
56
+ httpcore.ReadTimeout,
57
+ ConnectionError,
58
+ ConnectionRefusedError,
59
+ ConnectionResetError,
60
+ TimeoutError,
61
+ OSError,
62
+ IOError,
63
+ asyncio.TimeoutError,
64
+ ) + AZURE_EXCEPTIONS
65
+
66
+ def __init__(
67
+ self,
68
+ logger: Optional[logging.Logger] = None,
69
+ max_attempts: int = DEFAULT_MAX_ATTEMPTS,
70
+ min_wait: int = DEFAULT_MIN_WAIT,
71
+ max_wait: int = DEFAULT_MAX_WAIT,
72
+ multiplier: float = DEFAULT_MULTIPLIER,
73
+ ):
74
+ """Initialize retry manager.
75
+
76
+ :param logger: Logger instance for retry messages
77
+ :param max_attempts: Maximum number of retry attempts
78
+ :param min_wait: Minimum wait time between retries (seconds)
79
+ :param max_wait: Maximum wait time between retries (seconds)
80
+ :param multiplier: Exponential backoff multiplier
81
+ """
82
+ self.logger = logger or logging.getLogger(__name__)
83
+ self.max_attempts = max_attempts
84
+ self.min_wait = min_wait
85
+ self.max_wait = max_wait
86
+ self.multiplier = multiplier
87
+
88
+ def should_retry_exception(self, exception: Exception) -> bool:
89
+ """Determine if an exception should trigger a retry.
90
+
91
+ :param exception: The exception to check
92
+ :return: True if the exception should trigger a retry
93
+ """
94
+ if isinstance(exception, self.NETWORK_EXCEPTIONS):
95
+ return True
96
+
97
+ # Special case for HTTP status errors
98
+ if isinstance(exception, httpx.HTTPStatusError):
99
+ return exception.response.status_code == 500 or "model_error" in str(exception)
100
+
101
+ return False
102
+
103
+ def log_retry_attempt(self, retry_state) -> None:
104
+ """Log retry attempts for visibility.
105
+
106
+ :param retry_state: The retry state object from tenacity
107
+ """
108
+ exception = retry_state.outcome.exception()
109
+ if exception:
110
+ self.logger.warning(
111
+ f"Retry attempt {retry_state.attempt_number}/{self.max_attempts}: "
112
+ f"{exception.__class__.__name__} - {str(exception)}. "
113
+ f"Retrying in {retry_state.next_action.sleep} seconds..."
114
+ )
115
+
116
+ def log_retry_error(self, retry_state) -> Exception:
117
+ """Log the final error after all retries failed.
118
+
119
+ :param retry_state: The retry state object from tenacity
120
+ :return: The final exception
121
+ """
122
+ exception = retry_state.outcome.exception()
123
+ self.logger.error(
124
+ f"All retries failed after {retry_state.attempt_number} attempts. "
125
+ f"Final error: {exception.__class__.__name__}: {str(exception)}"
126
+ )
127
+ return exception
128
+
129
+ def create_retry_decorator(self, context: str = "") -> Callable:
130
+ """Create a retry decorator with the configured settings.
131
+
132
+ :param context: Optional context string for logging
133
+ :return: Configured retry decorator
134
+ """
135
+ context_prefix = f"[{context}] " if context else ""
136
+
137
+ def log_attempt(retry_state):
138
+ exception = retry_state.outcome.exception()
139
+ if exception:
140
+ self.logger.warning(
141
+ f"{context_prefix}Retry attempt {retry_state.attempt_number}/{self.max_attempts}: "
142
+ f"{exception.__class__.__name__} - {str(exception)}. "
143
+ f"Retrying in {retry_state.next_action.sleep} seconds..."
144
+ )
145
+
146
+ def log_final_error(retry_state):
147
+ exception = retry_state.outcome.exception()
148
+ self.logger.error(
149
+ f"{context_prefix}All retries failed after {retry_state.attempt_number} attempts. "
150
+ f"Final error: {exception.__class__.__name__}: {str(exception)}"
151
+ )
152
+ return exception
153
+
154
+ return retry(
155
+ retry=retry_if_exception(self.should_retry_exception),
156
+ stop=stop_after_attempt(self.max_attempts),
157
+ wait=wait_exponential(
158
+ multiplier=self.multiplier,
159
+ min=self.min_wait,
160
+ max=self.max_wait,
161
+ ),
162
+ before_sleep=log_attempt,
163
+ retry_error_callback=log_final_error,
164
+ )
165
+
166
+ def get_retry_config(self) -> Dict[str, Any]:
167
+ """Get retry configuration dictionary for backward compatibility.
168
+
169
+ :return: Dictionary containing retry configuration
170
+ """
171
+ return {
172
+ "network_retry": {
173
+ "retry": retry_if_exception(self.should_retry_exception),
174
+ "stop": stop_after_attempt(self.max_attempts),
175
+ "wait": wait_exponential(
176
+ multiplier=self.multiplier,
177
+ min=self.min_wait,
178
+ max=self.max_wait,
179
+ ),
180
+ "retry_error_callback": self.log_retry_error,
181
+ "before_sleep": self.log_retry_attempt,
182
+ }
183
+ }
184
+
185
+
186
+ def create_standard_retry_manager(logger: Optional[logging.Logger] = None) -> RetryManager:
187
+ """Create a standard retry manager with default settings.
188
+
189
+ :param logger: Optional logger instance
190
+ :return: Configured RetryManager instance
191
+ """
192
+ return RetryManager(logger=logger)
193
+
194
+
195
+ # Convenience function for creating retry decorators
196
+ def create_retry_decorator(
197
+ logger: Optional[logging.Logger] = None,
198
+ context: str = "",
199
+ max_attempts: int = RetryManager.DEFAULT_MAX_ATTEMPTS,
200
+ min_wait: int = RetryManager.DEFAULT_MIN_WAIT,
201
+ max_wait: int = RetryManager.DEFAULT_MAX_WAIT,
202
+ ) -> Callable:
203
+ """Create a retry decorator with specified parameters.
204
+
205
+ :param logger: Optional logger instance
206
+ :param context: Optional context for logging
207
+ :param max_attempts: Maximum retry attempts
208
+ :param min_wait: Minimum wait time between retries
209
+ :param max_wait: Maximum wait time between retries
210
+ :return: Configured retry decorator
211
+ """
212
+ retry_manager = RetryManager(
213
+ logger=logger,
214
+ max_attempts=max_attempts,
215
+ min_wait=min_wait,
216
+ max_wait=max_wait,
217
+ )
218
+ return retry_manager.create_retry_decorator(context)
@@ -88,12 +88,15 @@ def get_converter_for_strategy(
88
88
 
89
89
 
90
90
  def get_chat_target(
91
- target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration]
91
+ target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
92
+ prompt_to_context: Optional[Dict[str, str]] = None,
92
93
  ) -> PromptChatTarget:
93
94
  """Convert various target types to a PromptChatTarget.
94
95
 
95
96
  :param target: The target to convert
96
97
  :type target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration]
98
+ :param prompt_to_context: Optional mapping from prompt content to context
99
+ :type prompt_to_context: Optional[Dict[str, str]]
97
100
  :return: A PromptChatTarget instance
98
101
  :rtype: PromptChatTarget
99
102
  """
@@ -151,7 +154,7 @@ def get_chat_target(
151
154
  has_callback_signature = False
152
155
 
153
156
  if has_callback_signature:
154
- chat_target = _CallbackChatTarget(callback=target)
157
+ chat_target = _CallbackChatTarget(callback=target, prompt_to_context=prompt_to_context)
155
158
  else:
156
159
 
157
160
  async def callback_target(
@@ -163,8 +166,18 @@ def get_chat_target(
163
166
  messages_list = [_message_to_dict(chat_message) for chat_message in messages] # type: ignore
164
167
  latest_message = messages_list[-1]
165
168
  application_input = latest_message["content"]
169
+
170
+ # Check if target accepts context as a parameter
171
+ sig = inspect.signature(target)
172
+ param_names = list(sig.parameters.keys())
173
+
166
174
  try:
167
- response = target(query=application_input)
175
+ if "context" in param_names:
176
+ # Pass context if the target function accepts it
177
+ response = target(query=application_input, context=context)
178
+ else:
179
+ # Fallback to original behavior for compatibility
180
+ response = target(query=application_input)
168
181
  except Exception as e:
169
182
  response = f"Something went wrong {e!s}"
170
183
 
@@ -177,7 +190,7 @@ def get_chat_target(
177
190
  messages_list.append(formatted_response) # type: ignore
178
191
  return {"messages": messages_list, "stream": stream, "session_state": session_state, "context": {}}
179
192
 
180
- chat_target = _CallbackChatTarget(callback=callback_target) # type: ignore
193
+ chat_target = _CallbackChatTarget(callback=callback_target, prompt_to_context=prompt_to_context) # type: ignore
181
194
 
182
195
  return chat_target
183
196
 
@@ -8,6 +8,7 @@ import logging
8
8
  import random
9
9
  from typing import Any, Callable, Dict, List, Optional, Union, cast
10
10
  import uuid
11
+ import warnings
11
12
 
12
13
  from tqdm import tqdm
13
14
 
@@ -68,6 +69,14 @@ class AdversarialSimulator:
68
69
 
69
70
  def __init__(self, *, azure_ai_project: Union[str, AzureAIProject], credential: TokenCredential):
70
71
  """Constructor."""
72
+ warnings.warn(
73
+ "DEPRECATION NOTE: Azure AI Evaluation SDK has discontinued active development on the AdversarialSimulator class."
74
+ + " While existing functionality remains available in preview, it is no longer recommended for production workloads or future integration. "
75
+ + "We recommend users migrate to the AI Red Teaming Agent for future use as it supports full parity of functionality."
76
+ + " See https://aka.ms/airedteamingagent-sample for details on AI Red Teaming Agent.",
77
+ DeprecationWarning,
78
+ stacklevel=2,
79
+ )
71
80
 
72
81
  if is_onedp_project(azure_ai_project):
73
82
  self.azure_ai_project = azure_ai_project
@@ -239,8 +248,11 @@ class AdversarialSimulator:
239
248
  # So randomize a the selection instead of the parameter list directly,
240
249
  # or a potentially large deep copy.
241
250
  if randomization_seed is not None:
242
- random.seed(randomization_seed)
243
- random.shuffle(templates)
251
+ # Create a local random instance to avoid polluting global state
252
+ local_random = random.Random(randomization_seed)
253
+ local_random.shuffle(templates)
254
+ else:
255
+ random.shuffle(templates)
244
256
 
245
257
  # Prepare task parameters based on scenario - but use a single append call for all scenarios
246
258
  tasks = []
@@ -5,7 +5,8 @@
5
5
  # noqa: E501
6
6
  import asyncio
7
7
  import logging
8
- from typing import Callable, cast, Union
8
+ import random
9
+ from typing import Callable, cast, Union, Optional
9
10
 
10
11
  from tqdm import tqdm
11
12
 
@@ -105,6 +106,7 @@ class IndirectAttackSimulator(AdversarialSimulator):
105
106
  api_call_retry_sleep_sec: int = 1,
106
107
  api_call_delay_sec: int = 0,
107
108
  concurrent_async_task: int = 3,
109
+ randomization_seed: Optional[int] = None,
108
110
  **kwargs,
109
111
  ):
110
112
  """
@@ -130,6 +132,9 @@ class IndirectAttackSimulator(AdversarialSimulator):
130
132
  :keyword concurrent_async_task: The number of asynchronous tasks to run concurrently during the simulation.
131
133
  Defaults to 3.
132
134
  :paramtype concurrent_async_task: int
135
+ :keyword randomization_seed: The seed used to randomize prompt selection. If unset, the system's
136
+ default seed is used. Defaults to None.
137
+ :paramtype randomization_seed: Optional[int]
133
138
  :return: A list of dictionaries, each representing a simulated conversation. Each dictionary contains:
134
139
 
135
140
  - 'template_parameters': A dictionary with parameters used in the conversation template,
@@ -190,6 +195,13 @@ class IndirectAttackSimulator(AdversarialSimulator):
190
195
  ncols=100,
191
196
  unit="simulations",
192
197
  )
198
+
199
+ # Apply randomization to templates if seed is provided
200
+ if randomization_seed is not None:
201
+ # Create a local random instance to avoid polluting global state
202
+ local_random = random.Random(randomization_seed)
203
+ local_random.shuffle(templates)
204
+
193
205
  for template in templates:
194
206
  for parameter in template.template_parameters:
195
207
  tasks.append(
@@ -30,7 +30,11 @@ class GeneratedRAIClient:
30
30
  :type token_manager: ~azure.ai.evaluation.simulator._model_tools._identity_manager.APITokenManager
31
31
  """
32
32
 
33
- def __init__(self, azure_ai_project: Union[AzureAIProject, str], token_manager: ManagedIdentityAPITokenManager):
33
+ def __init__(
34
+ self,
35
+ azure_ai_project: Union[AzureAIProject, str],
36
+ token_manager: ManagedIdentityAPITokenManager,
37
+ ):
34
38
  self.azure_ai_project = azure_ai_project
35
39
  self.token_manager = token_manager
36
40
 
@@ -53,10 +57,14 @@ class GeneratedRAIClient:
53
57
  ).rai_svc
54
58
  else:
55
59
  self._client = AIProjectClient(
56
- endpoint=azure_ai_project, credential=token_manager, user_agent_policy=user_agent_policy
60
+ endpoint=azure_ai_project,
61
+ credential=token_manager,
62
+ user_agent_policy=user_agent_policy,
57
63
  ).red_teams
58
64
  self._evaluation_onedp_client = EvaluationServiceOneDPClient(
59
- endpoint=azure_ai_project, credential=token_manager, user_agent_policy=user_agent_policy
65
+ endpoint=azure_ai_project,
66
+ credential=token_manager,
67
+ user_agent_policy=user_agent_policy,
60
68
  )
61
69
 
62
70
  def _get_service_discovery_url(self):
@@ -68,7 +76,10 @@ class GeneratedRAIClient:
68
76
  import requests
69
77
 
70
78
  bearer_token = self._fetch_or_reuse_token(self.token_manager)
71
- headers = {"Authorization": f"Bearer {bearer_token}", "Content-Type": "application/json"}
79
+ headers = {
80
+ "Authorization": f"Bearer {bearer_token}",
81
+ "Content-Type": "application/json",
82
+ }
72
83
 
73
84
  response = requests.get(
74
85
  f"https://management.azure.com/subscriptions/{self.azure_ai_project['subscription_id']}/"
@@ -100,6 +111,7 @@ class GeneratedRAIClient:
100
111
  risk_category: Optional[str] = None,
101
112
  application_scenario: str = None,
102
113
  strategy: Optional[str] = None,
114
+ language: str = "en",
103
115
  scan_session_id: Optional[str] = None,
104
116
  ) -> Dict:
105
117
  """Get attack objectives using the auto-generated operations.
@@ -112,6 +124,8 @@ class GeneratedRAIClient:
112
124
  :type application_scenario: str
113
125
  :param strategy: Optional strategy to filter the attack objectives
114
126
  :type strategy: Optional[str]
127
+ :param language: Language code for the attack objectives (e.g., "en", "es", "fr")
128
+ :type language: str
115
129
  :param scan_session_id: Optional unique session ID for the scan
116
130
  :type scan_session_id: Optional[str]
117
131
  :return: The attack objectives
@@ -122,9 +136,9 @@ class GeneratedRAIClient:
122
136
  response = self._client.get_attack_objectives(
123
137
  risk_types=[risk_type],
124
138
  risk_category=risk_category,
125
- lang="en",
139
+ lang=language,
126
140
  strategy=strategy,
127
- headers={"client_request_id": scan_session_id},
141
+ headers={"x-ms-client-request-id": scan_session_id},
128
142
  )
129
143
  return response
130
144
 
@@ -146,7 +160,7 @@ class GeneratedRAIClient:
146
160
  try:
147
161
  # Send the request using the autogenerated client
148
162
  response = self._client.get_jail_break_dataset_with_type(
149
- type="upia", headers={"client_request_id": scan_session_id}
163
+ type="upia", headers={"x-ms-client-request-id": scan_session_id}
150
164
  )
151
165
  if isinstance(response, list):
152
166
  return response
@@ -10,7 +10,7 @@ from typing import Any, Dict, List, Optional, cast, Union
10
10
 
11
11
  from azure.ai.evaluation._http_utils import AsyncHttpPipeline, get_async_http_client
12
12
  from azure.ai.evaluation._user_agent import UserAgentSingleton
13
- from azure.core.exceptions import HttpResponseError
13
+ from azure.core.exceptions import HttpResponseError, ServiceResponseError
14
14
  from azure.core.pipeline.policies import AsyncRetryPolicy, RetryMode
15
15
  from azure.ai.evaluation._common.onedp._client import AIProjectClient
16
16
  from azure.ai.evaluation._common.onedp.models import SimulationDTO
@@ -208,7 +208,7 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
208
208
  flag = True
209
209
  while flag:
210
210
  try:
211
- response = session.evaluations.operation_results(operation_id, headers=headers)
211
+ response = session.red_teams.operation_results(operation_id, headers=headers)
212
212
  except Exception as e:
213
213
  from types import SimpleNamespace # pylint: disable=forgotten-debug-statement
214
214
 
@@ -217,15 +217,34 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
217
217
  response_data = response
218
218
  flag = False
219
219
  break
220
- if response.status_code == 200:
221
- response_data = cast(List[Dict], response.json())
220
+ if not isinstance(response, SimpleNamespace) and response.get("object") == "chat.completion":
221
+ response_data = response
222
222
  flag = False
223
+ break
223
224
  else:
224
225
  request_count += 1
225
226
  sleep_time = RAIService.SLEEP_TIME**request_count
226
227
  await asyncio.sleep(sleep_time)
227
228
  else:
228
- response = await session.post(url=self.endpoint_url, headers=proxy_headers, json=sim_request_dto.to_dict())
229
+ # Retry policy for POST request to RAI service
230
+ service_call_retry_policy = AsyncRetryPolicy(
231
+ retry_on_exceptions=[ServiceResponseError],
232
+ retry_total=7,
233
+ retry_backoff_factor=10.0,
234
+ retry_backoff_max=180,
235
+ retry_mode=RetryMode.Exponential,
236
+ )
237
+
238
+ response = None
239
+ async with get_async_http_client().with_policies(retry_policy=service_call_retry_policy) as retry_client:
240
+ try:
241
+ response = await retry_client.post(
242
+ url=self.endpoint_url, headers=proxy_headers, json=sim_request_dto.to_dict()
243
+ )
244
+ except ServiceResponseError as e:
245
+ self.logger.error("ServiceResponseError during POST request to rai svc after retries: %s", str(e))
246
+ raise
247
+
229
248
  # response.raise_for_status()
230
249
  if response.status_code != 202:
231
250
  raise HttpResponseError(
@@ -7,6 +7,7 @@ import asyncio
7
7
  import importlib.resources as pkg_resources
8
8
  import json
9
9
  import os
10
+ import random
10
11
  import re
11
12
  import warnings
12
13
  from typing import Any, Callable, Dict, List, Optional, Union, Tuple
@@ -104,6 +105,7 @@ class Simulator:
104
105
  user_simulator_prompty_options: Dict[str, Any] = {},
105
106
  conversation_turns: List[List[Union[str, Dict[str, Any]]]] = [],
106
107
  concurrent_async_tasks: int = 5,
108
+ randomization_seed: Optional[int] = None,
107
109
  **kwargs,
108
110
  ) -> List[JsonLineChatProtocol]:
109
111
  """
@@ -134,6 +136,9 @@ class Simulator:
134
136
  :keyword concurrent_async_tasks: The number of asynchronous tasks to run concurrently during the simulation.
135
137
  Defaults to 5.
136
138
  :paramtype concurrent_async_tasks: int
139
+ :keyword randomization_seed: The seed used to randomize task/query order. If unset, the system's
140
+ default seed is used. Defaults to None.
141
+ :paramtype randomization_seed: Optional[int]
137
142
  :return: A list of simulated conversations represented as JsonLineChatProtocol objects.
138
143
  :rtype: List[JsonLineChatProtocol]
139
144
 
@@ -159,6 +164,13 @@ class Simulator:
159
164
  f"Only the first {num_queries} lines of the specified tasks will be simulated."
160
165
  )
161
166
 
167
+ # Apply randomization to tasks if seed is provided
168
+ if randomization_seed is not None and tasks:
169
+ # Create a local random instance to avoid polluting global state
170
+ local_random = random.Random(randomization_seed)
171
+ tasks = tasks.copy() # Don't modify the original list
172
+ local_random.shuffle(tasks)
173
+
162
174
  max_conversation_turns *= 2 # account for both user and assistant turns
163
175
 
164
176
  prompty_model_config = self.model_config
@@ -1,6 +1,6 @@
1
- Metadata-Version: 2.1
1
+ Metadata-Version: 2.4
2
2
  Name: azure-ai-evaluation
3
- Version: 1.9.0
3
+ Version: 1.11.0
4
4
  Summary: Microsoft Azure Evaluation Library for Python
5
5
  Home-page: https://github.com/Azure/azure-sdk-for-python
6
6
  Author: Microsoft Corporation
@@ -21,8 +21,6 @@ Classifier: Operating System :: OS Independent
21
21
  Requires-Python: >=3.9
22
22
  Description-Content-Type: text/markdown
23
23
  License-File: NOTICE.txt
24
- Requires-Dist: promptflow-devkit>=1.17.1
25
- Requires-Dist: promptflow-core>=1.17.1
26
24
  Requires-Dist: pyjwt>=2.8.0
27
25
  Requires-Dist: azure-identity>=1.16.0
28
26
  Requires-Dist: azure-core>=1.30.2
@@ -37,6 +35,20 @@ Requires-Dist: Jinja2>=3.1.6
37
35
  Requires-Dist: aiohttp>=3.0
38
36
  Provides-Extra: redteam
39
37
  Requires-Dist: pyrit==0.8.1; extra == "redteam"
38
+ Dynamic: author
39
+ Dynamic: author-email
40
+ Dynamic: classifier
41
+ Dynamic: description
42
+ Dynamic: description-content-type
43
+ Dynamic: home-page
44
+ Dynamic: keywords
45
+ Dynamic: license
46
+ Dynamic: license-file
47
+ Dynamic: project-url
48
+ Dynamic: provides-extra
49
+ Dynamic: requires-dist
50
+ Dynamic: requires-python
51
+ Dynamic: summary
40
52
 
41
53
  # Azure AI Evaluation client library for Python
42
54
 
@@ -400,6 +412,50 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
400
412
 
401
413
  # Release History
402
414
 
415
+ ## 1.11.0 (2025-09-02)
416
+
417
+ ### Features Added
418
+ - Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
419
+ - Added support for user-supplied TokenCredentials with LLM based evaluators.
420
+ - Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
421
+ - Added `language` parameter to `RedTeam` class for multilingual red team scanning support. The parameter accepts values from `SupportedLanguages` enum including English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Simplified Chinese, enabling red team attacks to be generated and conducted in multiple languages.
422
+ - Added support for IndirectAttack and UngroundedAttributes risk categories in `RedTeam` scanning. These new risk categories expand red team capabilities to detect cross-platform indirect attacks and evaluate ungrounded inferences about human attributes including emotional state and protected class information.
423
+
424
+ ### Bugs Fixed
425
+ - Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.
426
+
427
+ ### Other Changes
428
+ - Deprecating `AdversarialSimulator` in favor of the [AI Red Teaming Agent](https://aka.ms/airedteamingagent-sample). `AdversarialSimulator` will be removed in the next minor release.
429
+ - Moved retry configuration constants (`MAX_RETRY_ATTEMPTS`, `MAX_RETRY_WAIT_SECONDS`, `MIN_RETRY_WAIT_SECONDS`) from `RedTeam` class to new `RetryManager` class for better code organization and configurability.
430
+
431
+ ## 1.10.0 (2025-07-31)
432
+
433
+ ### Breaking Changes
434
+
435
+ - Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`. Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
436
+
437
+ ### Features Added
438
+
439
+ - Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
440
+ - Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher
441
+ tolerance for harmful responses).
442
+ - Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.
443
+
444
+
445
+ ### Bugs Fixed
446
+
447
+ - Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
448
+ - Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
449
+ - Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.
450
+
451
+
452
+ ### Other Changes
453
+
454
+ - The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)
455
+ - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.
456
+ This is due to be removed in a future release.
457
+
458
+
403
459
  ## 1.9.0 (2025-07-02)
404
460
 
405
461
  ### Features Added
@@ -411,8 +467,11 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
411
467
  ### Bugs Fixed
412
468
 
413
469
  - Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
470
+
471
+ - Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
414
472
  - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
415
473
  - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
474
+ - `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.
416
475
 
417
476
  ## 1.8.0 (2025-05-29)
418
477