PyPI - azure-ai-evaluation - Versions diffs - 1.9.0__tar.gz → 1.11.0__tar.gz - Mend

azure-ai-evaluation 1.9.0tar.gz → 1.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (397) hide show

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,49 @@
 # Release History
+## 1.11.0 (2025-09-02)
+### Features Added
+- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
+- Added support for user-supplied TokenCredentials with LLM based evaluators.
+- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
+- Added `language` parameter to `RedTeam` class for multilingual red team scanning support. The parameter accepts values from `SupportedLanguages` enum including English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Simplified Chinese, enabling red team attacks to be generated and conducted in multiple languages.
+- Added support for IndirectAttack and UngroundedAttributes risk categories in `RedTeam` scanning. These new risk categories expand red team capabilities to detect cross-platform indirect attacks and evaluate ungrounded inferences about human attributes including emotional state and protected class information.
+### Bugs Fixed
+- Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.
+### Other Changes
+- Deprecating `AdversarialSimulator` in favor of the [AI Red Teaming Agent](https://aka.ms/airedteamingagent-sample). `AdversarialSimulator` will be removed in the next minor release.
+- Moved retry configuration constants (`MAX_RETRY_ATTEMPTS`, `MAX_RETRY_WAIT_SECONDS`, `MIN_RETRY_WAIT_SECONDS`) from `RedTeam` class to new `RetryManager` class for better code organization and configurability.
+## 1.10.0 (2025-07-31)
+### Breaking Changes
+- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
+### Features Added
+- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
+- Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher
+tolerance for harmful responses).
+- Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.
+### Bugs Fixed
+- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
+- Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
+- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.
+### Other Changes
+- The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)
+  - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.
+    This is due to be removed in a future release.
 ## 1.9.0 (2025-07-02)
 ### Features Added
@@ -11,8 +55,11 @@
 ### Bugs Fixed
 - Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
+- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
 - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
 - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
+- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.
 ## 1.8.0 (2025-05-29)

{azure_ai_evaluation-1.9.0/azure_ai_evaluation.egg-info → azure_ai_evaluation-1.11.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: azure-ai-evaluation
-Version: 1.9.0
+Version: 1.11.0
 Summary: Microsoft Azure Evaluation Library for Python
 Home-page: https://github.com/Azure/azure-sdk-for-python
 Author: Microsoft Corporation
@@ -21,8 +21,6 @@ Classifier: Operating System :: OS Independent
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: NOTICE.txt
-Requires-Dist: promptflow-devkit>=1.17.1
-Requires-Dist: promptflow-core>=1.17.1
 Requires-Dist: pyjwt>=2.8.0
 Requires-Dist: azure-identity>=1.16.0
 Requires-Dist: azure-core>=1.30.2
@@ -37,6 +35,20 @@ Requires-Dist: Jinja2>=3.1.6
 Requires-Dist: aiohttp>=3.0
 Provides-Extra: redteam
 Requires-Dist: pyrit==0.8.1; extra == "redteam"
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: project-url
+Dynamic: provides-extra
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
 # Azure AI Evaluation client library for Python
@@ -400,6 +412,50 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 # Release History
+## 1.11.0 (2025-09-02)
+### Features Added
+- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
+- Added support for user-supplied TokenCredentials with LLM based evaluators.
+- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
+- Added `language` parameter to `RedTeam` class for multilingual red team scanning support. The parameter accepts values from `SupportedLanguages` enum including English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Simplified Chinese, enabling red team attacks to be generated and conducted in multiple languages.
+- Added support for IndirectAttack and UngroundedAttributes risk categories in `RedTeam` scanning. These new risk categories expand red team capabilities to detect cross-platform indirect attacks and evaluate ungrounded inferences about human attributes including emotional state and protected class information.
+### Bugs Fixed
+- Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.
+### Other Changes
+- Deprecating `AdversarialSimulator` in favor of the [AI Red Teaming Agent](https://aka.ms/airedteamingagent-sample). `AdversarialSimulator` will be removed in the next minor release.
+- Moved retry configuration constants (`MAX_RETRY_ATTEMPTS`, `MAX_RETRY_WAIT_SECONDS`, `MIN_RETRY_WAIT_SECONDS`) from `RedTeam` class to new `RetryManager` class for better code organization and configurability.
+## 1.10.0 (2025-07-31)
+### Breaking Changes
+- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
+### Features Added
+- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
+- Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher
+tolerance for harmful responses).
+- Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.
+### Bugs Fixed
+- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
+- Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
+- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.
+### Other Changes
+- The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)
+  - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.
+    This is due to be removed in a future release.
 ## 1.9.0 (2025-07-02)
 ### Features Added
@@ -411,8 +467,11 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 ### Bugs Fixed
 - Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
+- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
 - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
 - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
+- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.
 ## 1.8.0 (2025-05-29)

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/TROUBLESHOOTING.md RENAMED Viewed

@@ -46,9 +46,6 @@ This guide walks you through how to investigate failures, common errors in the `
 - Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
 - If you encounter a 403 Unauthorized error when using safety evaluators, verify that you have the `Contributor` role assigned to your Azure AI project. `Contributor` role is currently required to run safety evaluations.
-### Troubleshoot Quality Evaluator Issues
-- For `ToolCallAccuracyEvaluator`, if your input did not have a tool to evaluate, the current behavior is to output `null`.
 ## Handle Simulation Errors
 ### Adversarial Simulation Supported Regions

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/azure/ai/evaluation/__init__.py RENAMED Viewed

@@ -46,6 +46,7 @@ from ._aoai.label_grader import AzureOpenAILabelGrader
 from ._aoai.string_check_grader import AzureOpenAIStringCheckGrader
 from ._aoai.text_similarity_grader import AzureOpenAITextSimilarityGrader
 from ._aoai.score_model_grader import AzureOpenAIScoreModelGrader
+from ._aoai.python_grader import AzureOpenAIPythonGrader
 _patch_all = []
@@ -53,21 +54,46 @@ _patch_all = []
 # The converter from the AI service to the evaluator schema requires a dependency on
 # ai.projects, but we also don't want to force users installing ai.evaluations to pull
 # in ai.projects. So we only import it if it's available and the user has ai.projects.
-try:
-    from ._converters._ai_services import AIAgentConverter
+# We use lazy loading to avoid printing messages during import unless the classes are actually used.
+_lazy_imports = {}
-    _patch_all.append("AIAgentConverter")
-except ImportError:
-    print(
-        "[INFO] Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`."
-    )
-try:
-    from ._converters._sk_services import SKAgentConverter
+def _create_lazy_import(class_name, module_path, dependency_name):
+    """Create a lazy import function for optional dependencies.
-    _patch_all.append("SKAgentConverter")
-except ImportError:
-    print("[INFO] Could not import SKAgentConverter. Please install the dependency with `pip install semantic-kernel`.")
+    Args:
+        class_name: Name of the class to import
+        module_path: Module path to import from
+        dependency_name: Name of the dependency package for error message
+    Returns:
+        A function that performs the lazy import when called
+    """
+    def lazy_import():
+        try:
+            module = __import__(module_path, fromlist=[class_name])
+            cls = getattr(module, class_name)
+            _patch_all.append(class_name)
+            return cls
+        except ImportError:
+            raise ImportError(
+                f"Could not import {class_name}. Please install the dependency with `pip install {dependency_name}`."
+            )
+    return lazy_import
+_lazy_imports["AIAgentConverter"] = _create_lazy_import(
+    "AIAgentConverter",
+    "azure.ai.evaluation._converters._ai_services",
+    "azure-ai-projects",
+)
+_lazy_imports["SKAgentConverter"] = _create_lazy_import(
+    "SKAgentConverter",
+    "azure.ai.evaluation._converters._sk_services",
+    "semantic-kernel",
+)
 __all__ = [
     "evaluate",
@@ -110,6 +136,14 @@ __all__ = [
     "AzureOpenAIStringCheckGrader",
     "AzureOpenAITextSimilarityGrader",
     "AzureOpenAIScoreModelGrader",
+    "AzureOpenAIPythonGrader",
 ]
 __all__.extend([p for p in _patch_all if p not in __all__])
+def __getattr__(name):
+    """Handle lazy imports for optional dependencies."""
+    if name in _lazy_imports:
+        return _lazy_imports[name]()
+    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")

azure_ai_evaluation-1.11.0/azure/ai/evaluation/_aoai/python_grader.py ADDED Viewed

@@ -0,0 +1,84 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+from typing import Any, Dict, Union, Optional
+from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
+from openai.types.graders import PythonGrader
+from azure.ai.evaluation._common._experimental import experimental
+from .aoai_grader import AzureOpenAIGrader
+@experimental
+class AzureOpenAIPythonGrader(AzureOpenAIGrader):
+    """
+    Wrapper class for OpenAI's Python code graders.
+    Enables custom Python-based evaluation logic with flexible scoring and
+    pass/fail thresholds. The grader executes user-provided Python code
+    to evaluate outputs against custom criteria.
+    Supplying a PythonGrader to the `evaluate` method will cause an
+    asynchronous request to evaluate the grader via the OpenAI API. The
+    results of the evaluation will then be merged into the standard
+    evaluation results.
+    :param model_config: The model configuration to use for the grader.
+    :type model_config: Union[
+        ~azure.ai.evaluation.AzureOpenAIModelConfiguration,
+        ~azure.ai.evaluation.OpenAIModelConfiguration
+    ]
+    :param name: The name of the grader.
+    :type name: str
+    :param image_tag: The image tag for the Python execution environment.
+    :type image_tag: str
+    :param pass_threshold: Score threshold for pass/fail classification.
+        Scores >= threshold are considered passing.
+    :type pass_threshold: float
+    :param source: Python source code containing the grade function.
+        Must define: def grade(sample: dict, item: dict) -> float
+    :type source: str
+    :param kwargs: Additional keyword arguments to pass to the grader.
+    :type kwargs: Any
+    .. admonition:: Example:
+        .. literalinclude:: ../samples/evaluation_samples_common.py
+            :start-after: [START python_grader_example]
+            :end-before: [END python_grader_example]
+            :language: python
+            :dedent: 8
+            :caption: Using AzureOpenAIPythonGrader for custom evaluation logic.
+    """
+    id = "azureai://built-in/evaluators/azure-openai/python_grader"
+    def __init__(
+        self,
+        *,
+        model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
+        name: str,
+        image_tag: str,
+        pass_threshold: float,
+        source: str,
+        **kwargs: Any,
+    ):
+        # Validate pass_threshold
+        if not 0.0 <= pass_threshold <= 1.0:
+            raise ValueError("pass_threshold must be between 0.0 and 1.0")
+        # Store pass_threshold as instance attribute for potential future use
+        self.pass_threshold = pass_threshold
+        # Create OpenAI PythonGrader instance
+        grader = PythonGrader(
+            name=name,
+            image_tag=image_tag,
+            pass_threshold=pass_threshold,
+            source=source,
+            type="python",
+        )
+        super().__init__(model_config=model_config, grader_config=grader, **kwargs)

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/azure/ai/evaluation/_aoai/score_model_grader.py RENAMED Viewed

@@ -84,6 +84,7 @@ class AzureOpenAIScoreModelGrader(AzureOpenAIGrader):
             grader_kwargs["range"] = range
         if sampling_params is not None:
             grader_kwargs["sampling_params"] = sampling_params
+        grader_kwargs["pass_threshold"] = self.pass_threshold
         grader = ScoreModelGrader(**grader_kwargs)

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/azure/ai/evaluation/_common/onedp/models/_models.py RENAMED Viewed

@@ -1961,12 +1961,16 @@ class Message(_Model):
     :vartype role: str
     :ivar content: The content.
     :vartype content: str
+    :ivar context: The context.
+    :vartype context: str
     """
     role: Optional[str] = rest_field(name="Role", visibility=["read", "create", "update", "delete", "query"])
     """The role."""
     content: Optional[str] = rest_field(name="Content", visibility=["read", "create", "update", "delete", "query"])
     """The content."""
+    context: Optional[str] = rest_field(name="Context", visibility=["read", "create", "update", "delete", "query"])
+    """The context."""
     @overload
     def __init__(
@@ -1974,6 +1978,7 @@ class Message(_Model):
         *,
         role: Optional[str] = None,
         content: Optional[str] = None,
+        context: Optional[str] = None,
     ) -> None: ...
     @overload

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/azure/ai/evaluation/_common/rai_service.py RENAMED Viewed

@@ -290,7 +290,7 @@ async def submit_request_onedp(
     payload = generate_payload(normalized_user_text, metric, annotation_task=annotation_task)
     headers = get_common_headers(token, evaluator_name)
     if scan_session_id:
-        headers["client_request_id"] = scan_session_id
+        headers["x-ms-client-request-id"] = scan_session_id
     response = client.evaluations.submit_annotation(payload, headers=headers)
     result = json.loads(response)
     operation_id = result["location"].split("/")[-1]
@@ -319,8 +319,8 @@ async def fetch_result(operation_id: str, rai_svc_url: str, credential: TokenCre
         token = await fetch_or_reuse_token(credential, token)
         headers = get_common_headers(token)
-        async with get_async_http_client_with_timeout() as client:
-            response = await client.get(url, headers=headers)
+        async with get_async_http_client() as client:
+            response = await client.get(url, headers=headers, timeout=RAIService.TIMEOUT)
         if response.status_code == 200:
             return response.json()

{azure_ai_evaluation-1.9.0 → azure_ai_evaluation-1.11.0}/azure/ai/evaluation/_common/utils.py RENAMED Viewed

@@ -6,11 +6,11 @@ import posixpath
 import re
 import math
 import threading
-from typing import Any, List, Literal, Mapping, Type, TypeVar, Tuple, Union, cast, get_args, get_origin
+from typing import Any, List, Literal, Mapping, Optional, Type, TypeVar, Tuple, Union, cast, get_args, get_origin
 import nltk
 from azure.storage.blob import ContainerClient
-from typing_extensions import NotRequired, Required, TypeGuard
+from typing_extensions import NotRequired, Required, TypeGuard, TypeIs
 from azure.ai.evaluation._legacy._adapters._errors import MissingRequiredPackage
 from azure.ai.evaluation._constants import AZURE_OPENAI_TYPE, OPENAI_TYPE
 from azure.ai.evaluation._exceptions import ErrorMessage, ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
@@ -127,17 +127,15 @@ def construct_prompty_model_config(
     return prompty_model_config
-def is_onedp_project(azure_ai_project: AzureAIProject) -> bool:
+def is_onedp_project(azure_ai_project: Optional[Union[str, AzureAIProject]]) -> TypeIs[str]:
     """Check if the Azure AI project is an OneDP project.
     :param azure_ai_project: The scope of the Azure AI project.
-    :type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
+    :type azure_ai_project: Optional[Union[str,~azure.ai.evaluation.AzureAIProject]]
     :return: True if the Azure AI project is an OneDP project, False otherwise.
     :rtype: bool
     """
-    if isinstance(azure_ai_project, str):
-        return True
-    return False
+    return isinstance(azure_ai_project, str)
 def validate_azure_ai_project(o: object) -> AzureAIProject:
@@ -494,14 +492,17 @@ def _extract_text_from_content(content):
     return text
-def _get_conversation_history(query):
+def _get_conversation_history(query, include_system_messages=False):
     all_user_queries = []
     cur_user_query = []
     all_agent_responses = []
     cur_agent_response = []
+    system_message = None
     for msg in query:
         if not "role" in msg:
             continue
+        if include_system_messages and msg["role"] == "system" and "content" in msg:
+            system_message = msg.get("content", "")
         if msg["role"] == "user" and "content" in msg:
             if cur_agent_response != []:
                 all_agent_responses.append(cur_agent_response)
@@ -530,13 +531,18 @@ def _get_conversation_history(query):
             category=ErrorCategory.INVALID_VALUE,
             blame=ErrorBlame.USER_ERROR,
         )
-    return {"user_queries": all_user_queries, "agent_responses": all_agent_responses}
+    result = {"user_queries": all_user_queries, "agent_responses": all_agent_responses}
+    if include_system_messages:
+        result["system_message"] = system_message
+    return result
 def _pretty_format_conversation_history(conversation_history):
     """Formats the conversation history for better readability."""
     formatted_history = ""
+    if "system_message" in conversation_history and conversation_history["system_message"] is not None:
+        formatted_history += "SYSTEM_PROMPT:\n"
+        formatted_history += "  " + conversation_history["system_message"] + "\n\n"
     for i, (user_query, agent_response) in enumerate(
         zip(conversation_history["user_queries"], conversation_history["agent_responses"] + [None])
     ):
@@ -552,10 +558,10 @@ def _pretty_format_conversation_history(conversation_history):
     return formatted_history
-def reformat_conversation_history(query, logger=None):
+def reformat_conversation_history(query, logger=None, include_system_messages=False):
     """Reformats the conversation history to a more compact representation."""
     try:
-        conversation_history = _get_conversation_history(query)
+        conversation_history = _get_conversation_history(query, include_system_messages=include_system_messages)
         return _pretty_format_conversation_history(conversation_history)
     except:
         # If the conversation history cannot be parsed for whatever reason (e.g. the converter format changed), the original query is returned
@@ -570,22 +576,53 @@ def reformat_conversation_history(query, logger=None):
         return query
-def _get_agent_response(agent_response_msgs):
-    """Extracts the text from the agent response content."""
+def _get_agent_response(agent_response_msgs, include_tool_messages=False):
+    """Extracts formatted agent response including text, and optionally tool calls/results."""
     agent_response_text = []
+    tool_results = {}
+    # First pass: collect tool results
+    if include_tool_messages:
+        for msg in agent_response_msgs:
+            if msg.get("role") == "tool" and "tool_call_id" in msg:
+                for content in msg.get("content", []):
+                    if content.get("type") == "tool_result":
+                        result = content.get("tool_result")
+                        tool_results[msg["tool_call_id"]] = f"[TOOL_RESULT] {result}"
+    # Second pass: parse assistant messages and tool calls
     for msg in agent_response_msgs:
-        if "role" in msg and msg["role"] == "assistant" and "content" in msg:
+        if "role" in msg and msg.get("role") == "assistant" and "content" in msg:
             text = _extract_text_from_content(msg["content"])
             if text:
                 agent_response_text.extend(text)
+            if include_tool_messages:
+                for content in msg.get("content", []):
+                    # Todo: Verify if this is the correct way to handle tool calls
+                    if content.get("type") == "tool_call":
+                        if "tool_call" in content and "function" in content.get("tool_call", {}):
+                            tc = content.get("tool_call", {})
+                            func_name = tc.get("function", {}).get("name", "")
+                            args = tc.get("function", {}).get("arguments", {})
+                            tool_call_id = tc.get("id")
+                        else:
+                            tool_call_id = content.get("tool_call_id")
+                            func_name = content.get("name", "")
+                            args = content.get("arguments", {})
+                        args_str = ", ".join(f'{k}="{v}"' for k, v in args.items())
+                        call_line = f"[TOOL_CALL] {func_name}({args_str})"
+                        agent_response_text.append(call_line)
+                        if tool_call_id in tool_results:
+                            agent_response_text.append(tool_results[tool_call_id])
     return agent_response_text
-def reformat_agent_response(response, logger=None):
+def reformat_agent_response(response, logger=None, include_tool_messages=False):
     try:
         if response is None or response == []:
             return ""
-        agent_response = _get_agent_response(response)
+        agent_response = _get_agent_response(response, include_tool_messages=include_tool_messages)
         if agent_response == []:
             # If no message could be extracted, likely the format changed, fallback to the original response in that case
             if logger:
@@ -602,6 +639,26 @@ def reformat_agent_response(response, logger=None):
         return response
+def reformat_tool_definitions(tool_definitions, logger=None):
+    try:
+        output_lines = ["TOOL_DEFINITIONS:"]
+        for tool in tool_definitions:
+            name = tool.get("name", "unnamed_tool")
+            desc = tool.get("description", "").strip()
+            params = tool.get("parameters", {}).get("properties", {})
+            param_names = ", ".join(params.keys()) if params else "no parameters"
+            output_lines.append(f"- {name}: {desc} (inputs: {param_names})")
+        return "\n".join(output_lines)
+    except Exception as e:
+        # If the tool definitions cannot be parsed for whatever reason, the original tool definitions are returned
+        # This is a fallback to ensure that the evaluation can still proceed. See comments on reformat_conversation_history for more details.
+        if logger:
+            logger.warning(
+                f"Tool definitions could not be parsed, falling back to original definitions: {tool_definitions}"
+            )
+        return tool_definitions
 def upload(path: str, container_client: ContainerClient, logger=None):
     """Upload files or directories to Azure Blob Storage using a container client.

azure-ai-evaluation 1.9.0__tar.gz → 1.11.0__tar.gz

Potentially problematic release.

azure-ai-evaluation 1.9.0tar.gz → 1.11.0tar.gz