PyPI - azure-ai-evaluation - Versions diffs - 1.9.0__py3-none-any.whl → 1.11.0__py3-none-any.whl - Mend

azure-ai-evaluation 1.9.0py3-none-any.whl → 1.11.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (85) hide show

azure/ai/evaluation/red_team/_utils/retry_utils.py ADDED Viewed

@@ -0,0 +1,218 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+"""
+Retry utilities for Red Team Agent.
+This module provides centralized retry logic and decorators for handling
+network errors and other transient failures consistently across the codebase.
+"""
+import asyncio
+import logging
+from typing import Any, Callable, Dict, List, Optional, TypeVar
+from tenacity import (
+    retry,
+    stop_after_attempt,
+    wait_exponential,
+    retry_if_exception,
+    RetryError,
+)
+# Retry imports for exception handling
+import httpx
+import httpcore
+# Import Azure exceptions if available
+try:
+    from azure.core.exceptions import ServiceRequestError, ServiceResponseError
+    AZURE_EXCEPTIONS = (ServiceRequestError, ServiceResponseError)
+except ImportError:
+    AZURE_EXCEPTIONS = ()
+# Type variable for generic retry decorators
+T = TypeVar("T")
+class RetryManager:
+    """Centralized retry management for Red Team operations."""
+    # Default retry configuration
+    DEFAULT_MAX_ATTEMPTS = 5
+    DEFAULT_MIN_WAIT = 2
+    DEFAULT_MAX_WAIT = 30
+    DEFAULT_MULTIPLIER = 1.5
+    # Network-related exceptions that should trigger retries
+    NETWORK_EXCEPTIONS = (
+        httpx.ConnectTimeout,
+        httpx.ReadTimeout,
+        httpx.ConnectError,
+        httpx.HTTPError,
+        httpx.TimeoutException,
+        httpx.HTTPStatusError,
+        httpcore.ReadTimeout,
+        ConnectionError,
+        ConnectionRefusedError,
+        ConnectionResetError,
+        TimeoutError,
+        OSError,
+        IOError,
+        asyncio.TimeoutError,
+    ) + AZURE_EXCEPTIONS
+    def __init__(
+        self,
+        logger: Optional[logging.Logger] = None,
+        max_attempts: int = DEFAULT_MAX_ATTEMPTS,
+        min_wait: int = DEFAULT_MIN_WAIT,
+        max_wait: int = DEFAULT_MAX_WAIT,
+        multiplier: float = DEFAULT_MULTIPLIER,
+    ):
+        """Initialize retry manager.
+        :param logger: Logger instance for retry messages
+        :param max_attempts: Maximum number of retry attempts
+        :param min_wait: Minimum wait time between retries (seconds)
+        :param max_wait: Maximum wait time between retries (seconds)
+        :param multiplier: Exponential backoff multiplier
+        """
+        self.logger = logger or logging.getLogger(__name__)
+        self.max_attempts = max_attempts
+        self.min_wait = min_wait
+        self.max_wait = max_wait
+        self.multiplier = multiplier
+    def should_retry_exception(self, exception: Exception) -> bool:
+        """Determine if an exception should trigger a retry.
+        :param exception: The exception to check
+        :return: True if the exception should trigger a retry
+        """
+        if isinstance(exception, self.NETWORK_EXCEPTIONS):
+            return True
+        # Special case for HTTP status errors
+        if isinstance(exception, httpx.HTTPStatusError):
+            return exception.response.status_code == 500 or "model_error" in str(exception)
+        return False
+    def log_retry_attempt(self, retry_state) -> None:
+        """Log retry attempts for visibility.
+        :param retry_state: The retry state object from tenacity
+        """
+        exception = retry_state.outcome.exception()
+        if exception:
+            self.logger.warning(
+                f"Retry attempt {retry_state.attempt_number}/{self.max_attempts}: "
+                f"{exception.__class__.__name__} - {str(exception)}. "
+                f"Retrying in {retry_state.next_action.sleep} seconds..."
+            )
+    def log_retry_error(self, retry_state) -> Exception:
+        """Log the final error after all retries failed.
+        :param retry_state: The retry state object from tenacity
+        :return: The final exception
+        """
+        exception = retry_state.outcome.exception()
+        self.logger.error(
+            f"All retries failed after {retry_state.attempt_number} attempts. "
+            f"Final error: {exception.__class__.__name__}: {str(exception)}"
+        )
+        return exception
+    def create_retry_decorator(self, context: str = "") -> Callable:
+        """Create a retry decorator with the configured settings.
+        :param context: Optional context string for logging
+        :return: Configured retry decorator
+        """
+        context_prefix = f"[{context}] " if context else ""
+        def log_attempt(retry_state):
+            exception = retry_state.outcome.exception()
+            if exception:
+                self.logger.warning(
+                    f"{context_prefix}Retry attempt {retry_state.attempt_number}/{self.max_attempts}: "
+                    f"{exception.__class__.__name__} - {str(exception)}. "
+                    f"Retrying in {retry_state.next_action.sleep} seconds..."
+                )
+        def log_final_error(retry_state):
+            exception = retry_state.outcome.exception()
+            self.logger.error(
+                f"{context_prefix}All retries failed after {retry_state.attempt_number} attempts. "
+                f"Final error: {exception.__class__.__name__}: {str(exception)}"
+            )
+            return exception
+        return retry(
+            retry=retry_if_exception(self.should_retry_exception),
+            stop=stop_after_attempt(self.max_attempts),
+            wait=wait_exponential(
+                multiplier=self.multiplier,
+                min=self.min_wait,
+                max=self.max_wait,
+            ),
+            before_sleep=log_attempt,
+            retry_error_callback=log_final_error,
+        )
+    def get_retry_config(self) -> Dict[str, Any]:
+        """Get retry configuration dictionary for backward compatibility.
+        :return: Dictionary containing retry configuration
+        """
+        return {
+            "network_retry": {
+                "retry": retry_if_exception(self.should_retry_exception),
+                "stop": stop_after_attempt(self.max_attempts),
+                "wait": wait_exponential(
+                    multiplier=self.multiplier,
+                    min=self.min_wait,
+                    max=self.max_wait,
+                ),
+                "retry_error_callback": self.log_retry_error,
+                "before_sleep": self.log_retry_attempt,
+            }
+        }
+def create_standard_retry_manager(logger: Optional[logging.Logger] = None) -> RetryManager:
+    """Create a standard retry manager with default settings.
+    :param logger: Optional logger instance
+    :return: Configured RetryManager instance
+    """
+    return RetryManager(logger=logger)
+# Convenience function for creating retry decorators
+def create_retry_decorator(
+    logger: Optional[logging.Logger] = None,
+    context: str = "",
+    max_attempts: int = RetryManager.DEFAULT_MAX_ATTEMPTS,
+    min_wait: int = RetryManager.DEFAULT_MIN_WAIT,
+    max_wait: int = RetryManager.DEFAULT_MAX_WAIT,
+) -> Callable:
+    """Create a retry decorator with specified parameters.
+    :param logger: Optional logger instance
+    :param context: Optional context for logging
+    :param max_attempts: Maximum retry attempts
+    :param min_wait: Minimum wait time between retries
+    :param max_wait: Maximum wait time between retries
+    :return: Configured retry decorator
+    """
+    retry_manager = RetryManager(
+        logger=logger,
+        max_attempts=max_attempts,
+        min_wait=min_wait,
+        max_wait=max_wait,
+    )
+    return retry_manager.create_retry_decorator(context)

azure/ai/evaluation/red_team/_utils/strategy_utils.py CHANGED Viewed

@@ -88,12 +88,15 @@ def get_converter_for_strategy(
 def get_chat_target(
-    target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration]
+    target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
+    prompt_to_context: Optional[Dict[str, str]] = None,
 ) -> PromptChatTarget:
     """Convert various target types to a PromptChatTarget.
     :param target: The target to convert
     :type target: Union[PromptChatTarget, Callable, AzureOpenAIModelConfiguration, OpenAIModelConfiguration]
+    :param prompt_to_context: Optional mapping from prompt content to context
+    :type prompt_to_context: Optional[Dict[str, str]]
     :return: A PromptChatTarget instance
     :rtype: PromptChatTarget
     """
@@ -151,7 +154,7 @@ def get_chat_target(
             has_callback_signature = False
         if has_callback_signature:
-            chat_target = _CallbackChatTarget(callback=target)
+            chat_target = _CallbackChatTarget(callback=target, prompt_to_context=prompt_to_context)
         else:
             async def callback_target(
@@ -163,8 +166,18 @@ def get_chat_target(
                 messages_list = [_message_to_dict(chat_message) for chat_message in messages]  # type: ignore
                 latest_message = messages_list[-1]
                 application_input = latest_message["content"]
+                # Check if target accepts context as a parameter
+                sig = inspect.signature(target)
+                param_names = list(sig.parameters.keys())
                 try:
-                    response = target(query=application_input)
+                    if "context" in param_names:
+                        # Pass context if the target function accepts it
+                        response = target(query=application_input, context=context)
+                    else:
+                        # Fallback to original behavior for compatibility
+                        response = target(query=application_input)
                 except Exception as e:
                     response = f"Something went wrong {e!s}"
@@ -177,7 +190,7 @@ def get_chat_target(
                 messages_list.append(formatted_response)  # type: ignore
                 return {"messages": messages_list, "stream": stream, "session_state": session_state, "context": {}}
-            chat_target = _CallbackChatTarget(callback=callback_target)  # type: ignore
+            chat_target = _CallbackChatTarget(callback=callback_target, prompt_to_context=prompt_to_context)  # type: ignore
     return chat_target

azure/ai/evaluation/simulator/_adversarial_simulator.py CHANGED Viewed

@@ -8,6 +8,7 @@ import logging
 import random
 from typing import Any, Callable, Dict, List, Optional, Union, cast
 import uuid
+import warnings
 from tqdm import tqdm
@@ -68,6 +69,14 @@ class AdversarialSimulator:
     def __init__(self, *, azure_ai_project: Union[str, AzureAIProject], credential: TokenCredential):
         """Constructor."""
+        warnings.warn(
+            "DEPRECATION NOTE: Azure AI Evaluation SDK has discontinued active development on the AdversarialSimulator class."
+            + " While existing functionality remains available in preview, it is no longer recommended for production workloads or future integration. "
+            + "We recommend users migrate to the AI Red Teaming Agent for future use as it supports full parity of functionality."
+            + " See https://aka.ms/airedteamingagent-sample for details on AI Red Teaming Agent.",
+            DeprecationWarning,
+            stacklevel=2,
+        )
         if is_onedp_project(azure_ai_project):
             self.azure_ai_project = azure_ai_project
@@ -239,8 +248,11 @@ class AdversarialSimulator:
             # So randomize a the selection instead of the parameter list directly,
             # or a potentially large deep copy.
             if randomization_seed is not None:
-                random.seed(randomization_seed)
-            random.shuffle(templates)
+                # Create a local random instance to avoid polluting global state
+                local_random = random.Random(randomization_seed)
+                local_random.shuffle(templates)
+            else:
+                random.shuffle(templates)
         # Prepare task parameters based on scenario - but use a single append call for all scenarios
         tasks = []

azure/ai/evaluation/simulator/_indirect_attack_simulator.py CHANGED Viewed

@@ -5,7 +5,8 @@
 # noqa: E501
 import asyncio
 import logging
-from typing import Callable, cast, Union
+import random
+from typing import Callable, cast, Union, Optional
 from tqdm import tqdm
@@ -105,6 +106,7 @@ class IndirectAttackSimulator(AdversarialSimulator):
         api_call_retry_sleep_sec: int = 1,
         api_call_delay_sec: int = 0,
         concurrent_async_task: int = 3,
+        randomization_seed: Optional[int] = None,
         **kwargs,
     ):
         """
@@ -130,6 +132,9 @@ class IndirectAttackSimulator(AdversarialSimulator):
         :keyword concurrent_async_task: The number of asynchronous tasks to run concurrently during the simulation.
             Defaults to 3.
         :paramtype concurrent_async_task: int
+        :keyword randomization_seed: The seed used to randomize prompt selection. If unset, the system's
+            default seed is used. Defaults to None.
+        :paramtype randomization_seed: Optional[int]
         :return: A list of dictionaries, each representing a simulated conversation. Each dictionary contains:
          - 'template_parameters': A dictionary with parameters used in the conversation template,
@@ -190,6 +195,13 @@ class IndirectAttackSimulator(AdversarialSimulator):
             ncols=100,
             unit="simulations",
         )
+        # Apply randomization to templates if seed is provided
+        if randomization_seed is not None:
+            # Create a local random instance to avoid polluting global state
+            local_random = random.Random(randomization_seed)
+            local_random.shuffle(templates)
         for template in templates:
             for parameter in template.template_parameters:
                 tasks.append(

azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py CHANGED Viewed

@@ -30,7 +30,11 @@ class GeneratedRAIClient:
     :type token_manager: ~azure.ai.evaluation.simulator._model_tools._identity_manager.APITokenManager
     """
-    def __init__(self, azure_ai_project: Union[AzureAIProject, str], token_manager: ManagedIdentityAPITokenManager):
+    def __init__(
+        self,
+        azure_ai_project: Union[AzureAIProject, str],
+        token_manager: ManagedIdentityAPITokenManager,
+    ):
         self.azure_ai_project = azure_ai_project
         self.token_manager = token_manager
@@ -53,10 +57,14 @@ class GeneratedRAIClient:
             ).rai_svc
         else:
             self._client = AIProjectClient(
-                endpoint=azure_ai_project, credential=token_manager, user_agent_policy=user_agent_policy
+                endpoint=azure_ai_project,
+                credential=token_manager,
+                user_agent_policy=user_agent_policy,
             ).red_teams
             self._evaluation_onedp_client = EvaluationServiceOneDPClient(
-                endpoint=azure_ai_project, credential=token_manager, user_agent_policy=user_agent_policy
+                endpoint=azure_ai_project,
+                credential=token_manager,
+                user_agent_policy=user_agent_policy,
             )
     def _get_service_discovery_url(self):
@@ -68,7 +76,10 @@ class GeneratedRAIClient:
         import requests
         bearer_token = self._fetch_or_reuse_token(self.token_manager)
-        headers = {"Authorization": f"Bearer {bearer_token}", "Content-Type": "application/json"}
+        headers = {
+            "Authorization": f"Bearer {bearer_token}",
+            "Content-Type": "application/json",
+        }
         response = requests.get(
             f"https://management.azure.com/subscriptions/{self.azure_ai_project['subscription_id']}/"
@@ -100,6 +111,7 @@ class GeneratedRAIClient:
         risk_category: Optional[str] = None,
         application_scenario: str = None,
         strategy: Optional[str] = None,
+        language: str = "en",
         scan_session_id: Optional[str] = None,
     ) -> Dict:
         """Get attack objectives using the auto-generated operations.
@@ -112,6 +124,8 @@ class GeneratedRAIClient:
         :type application_scenario: str
         :param strategy: Optional strategy to filter the attack objectives
         :type strategy: Optional[str]
+        :param language: Language code for the attack objectives (e.g., "en", "es", "fr")
+        :type language: str
         :param scan_session_id: Optional unique session ID for the scan
         :type scan_session_id: Optional[str]
         :return: The attack objectives
@@ -122,9 +136,9 @@ class GeneratedRAIClient:
             response = self._client.get_attack_objectives(
                 risk_types=[risk_type],
                 risk_category=risk_category,
-                lang="en",
+                lang=language,
                 strategy=strategy,
-                headers={"client_request_id": scan_session_id},
+                headers={"x-ms-client-request-id": scan_session_id},
             )
             return response
@@ -146,7 +160,7 @@ class GeneratedRAIClient:
         try:
             # Send the request using the autogenerated client
             response = self._client.get_jail_break_dataset_with_type(
-                type="upia", headers={"client_request_id": scan_session_id}
+                type="upia", headers={"x-ms-client-request-id": scan_session_id}
             )
             if isinstance(response, list):
                 return response

azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py CHANGED Viewed

@@ -10,7 +10,7 @@ from typing import Any, Dict, List, Optional, cast, Union
 from azure.ai.evaluation._http_utils import AsyncHttpPipeline, get_async_http_client
 from azure.ai.evaluation._user_agent import UserAgentSingleton
-from azure.core.exceptions import HttpResponseError
+from azure.core.exceptions import HttpResponseError, ServiceResponseError
 from azure.core.pipeline.policies import AsyncRetryPolicy, RetryMode
 from azure.ai.evaluation._common.onedp._client import AIProjectClient
 from azure.ai.evaluation._common.onedp.models import SimulationDTO
@@ -208,7 +208,7 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
             flag = True
             while flag:
                 try:
-                    response = session.evaluations.operation_results(operation_id, headers=headers)
+                    response = session.red_teams.operation_results(operation_id, headers=headers)
                 except Exception as e:
                     from types import SimpleNamespace  # pylint: disable=forgotten-debug-statement
@@ -217,15 +217,34 @@ class ProxyChatCompletionsModel(OpenAIChatCompletionsModel):
                     response_data = response
                     flag = False
                     break
-                if response.status_code == 200:
-                    response_data = cast(List[Dict], response.json())
+                if not isinstance(response, SimpleNamespace) and response.get("object") == "chat.completion":
+                    response_data = response
                     flag = False
+                    break
                 else:
                     request_count += 1
                     sleep_time = RAIService.SLEEP_TIME**request_count
                     await asyncio.sleep(sleep_time)
         else:
-            response = await session.post(url=self.endpoint_url, headers=proxy_headers, json=sim_request_dto.to_dict())
+            # Retry policy for POST request to RAI service
+            service_call_retry_policy = AsyncRetryPolicy(
+                retry_on_exceptions=[ServiceResponseError],
+                retry_total=7,
+                retry_backoff_factor=10.0,
+                retry_backoff_max=180,
+                retry_mode=RetryMode.Exponential,
+            )
+            response = None
+            async with get_async_http_client().with_policies(retry_policy=service_call_retry_policy) as retry_client:
+                try:
+                    response = await retry_client.post(
+                        url=self.endpoint_url, headers=proxy_headers, json=sim_request_dto.to_dict()
+                    )
+                except ServiceResponseError as e:
+                    self.logger.error("ServiceResponseError during POST request to rai svc after retries: %s", str(e))
+                    raise
             # response.raise_for_status()
             if response.status_code != 202:
                 raise HttpResponseError(

azure/ai/evaluation/simulator/_simulator.py CHANGED Viewed

@@ -7,6 +7,7 @@ import asyncio
 import importlib.resources as pkg_resources
 import json
 import os
+import random
 import re
 import warnings
 from typing import Any, Callable, Dict, List, Optional, Union, Tuple
@@ -104,6 +105,7 @@ class Simulator:
         user_simulator_prompty_options: Dict[str, Any] = {},
         conversation_turns: List[List[Union[str, Dict[str, Any]]]] = [],
         concurrent_async_tasks: int = 5,
+        randomization_seed: Optional[int] = None,
         **kwargs,
     ) -> List[JsonLineChatProtocol]:
         """
@@ -134,6 +136,9 @@ class Simulator:
         :keyword concurrent_async_tasks: The number of asynchronous tasks to run concurrently during the simulation.
             Defaults to 5.
         :paramtype concurrent_async_tasks: int
+        :keyword randomization_seed: The seed used to randomize task/query order. If unset, the system's
+            default seed is used. Defaults to None.
+        :paramtype randomization_seed: Optional[int]
         :return: A list of simulated conversations represented as JsonLineChatProtocol objects.
         :rtype: List[JsonLineChatProtocol]
@@ -159,6 +164,13 @@ class Simulator:
                 f"Only the first {num_queries} lines of the specified tasks will be simulated."
             )
+        # Apply randomization to tasks if seed is provided
+        if randomization_seed is not None and tasks:
+            # Create a local random instance to avoid polluting global state
+            local_random = random.Random(randomization_seed)
+            tasks = tasks.copy()  # Don't modify the original list
+            local_random.shuffle(tasks)
         max_conversation_turns *= 2  # account for both user and assistant turns
         prompty_model_config = self.model_config

{azure_ai_evaluation-1.9.0.dist-info → azure_ai_evaluation-1.11.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: azure-ai-evaluation
-Version: 1.9.0
+Version: 1.11.0
 Summary: Microsoft Azure Evaluation Library for Python
 Home-page: https://github.com/Azure/azure-sdk-for-python
 Author: Microsoft Corporation
@@ -21,8 +21,6 @@ Classifier: Operating System :: OS Independent
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: NOTICE.txt
-Requires-Dist: promptflow-devkit>=1.17.1
-Requires-Dist: promptflow-core>=1.17.1
 Requires-Dist: pyjwt>=2.8.0
 Requires-Dist: azure-identity>=1.16.0
 Requires-Dist: azure-core>=1.30.2
@@ -37,6 +35,20 @@ Requires-Dist: Jinja2>=3.1.6
 Requires-Dist: aiohttp>=3.0
 Provides-Extra: redteam
 Requires-Dist: pyrit==0.8.1; extra == "redteam"
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: project-url
+Dynamic: provides-extra
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
 # Azure AI Evaluation client library for Python
@@ -400,6 +412,50 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 # Release History
+## 1.11.0 (2025-09-02)
+### Features Added
+- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
+- Added support for user-supplied TokenCredentials with LLM based evaluators.
+- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
+- Added `language` parameter to `RedTeam` class for multilingual red team scanning support. The parameter accepts values from `SupportedLanguages` enum including English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Simplified Chinese, enabling red team attacks to be generated and conducted in multiple languages.
+- Added support for IndirectAttack and UngroundedAttributes risk categories in `RedTeam` scanning. These new risk categories expand red team capabilities to detect cross-platform indirect attacks and evaluate ungrounded inferences about human attributes including emotional state and protected class information.
+### Bugs Fixed
+- Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.
+### Other Changes
+- Deprecating `AdversarialSimulator` in favor of the [AI Red Teaming Agent](https://aka.ms/airedteamingagent-sample). `AdversarialSimulator` will be removed in the next minor release.
+- Moved retry configuration constants (`MAX_RETRY_ATTEMPTS`, `MAX_RETRY_WAIT_SECONDS`, `MIN_RETRY_WAIT_SECONDS`) from `RedTeam` class to new `RetryManager` class for better code organization and configurability.
+## 1.10.0 (2025-07-31)
+### Breaking Changes
+- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
+### Features Added
+- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
+- Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher
+tolerance for harmful responses).
+- Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.
+### Bugs Fixed
+- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
+- Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
+- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.
+### Other Changes
+- The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)
+  - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.
+    This is due to be removed in a future release.
 ## 1.9.0 (2025-07-02)
 ### Features Added
@@ -411,8 +467,11 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 ### Bugs Fixed
 - Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
+- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
 - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
 - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
+- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.
 ## 1.8.0 (2025-05-29)

azure-ai-evaluation 1.9.0__py3-none-any.whl → 1.11.0__py3-none-any.whl

Potentially problematic release.

azure-ai-evaluation 1.9.0py3-none-any.whl → 1.11.0py3-none-any.whl