PyPI - azure-ai-evaluation - Versions diffs - 1.0.0b4__tar.gz → 1.0.0b5__tar.gz - Mend

azure-ai-evaluation 1.0.0b4tar.gz → 1.0.0b5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (171) hide show

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,71 @@
 # Release History
+## 1.0.0b5 (2024-10-28)
+### Features Added
+- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
+- Groundedness detection in Non Adversarial Simulator via query/context pairs
+```python
+import importlib.resources as pkg_resources
+package = "azure.ai.evaluation.simulator._data_sources"
+resource_name = "grounding.json"
+custom_simulator = Simulator(model_config=model_config)
+conversation_turns = []
+with pkg_resources.path(package, resource_name) as grounding_file:
+    with open(grounding_file, "r") as file:
+        data = json.load(file)
+for item in data:
+    conversation_turns.append([item])
+outputs = asyncio.run(custom_simulator(
+    target=callback,
+    conversation_turns=conversation_turns,
+    max_conversation_turns=1,
+))
+```
+- Adding evaluator for multimodal use cases
+### Breaking Changes
+- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
+- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
+- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
+- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
+- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
+- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
+```json
+{"question": <user_message>, "answer": <assistant_message>}
+```
+`to_eval_qr_json_lines` now has:
+```json
+{"query": <user_message>, "response": assistant_message}
+```
+### Bugs Fixed
+- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
+- Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
+- Fix evaluate API failure when `trace.destination` is set to `none`
+- Non adversarial simulator now accepts context from the callback
+### Other Changes
+- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
+- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
+- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
+  - `CoherenceEvaluator`
+  - `RelevanceEvaluator`
+  - `FluencyEvaluator`
+  - `GroundednessEvaluator`
+  - `SimilarityEvaluator`
+  - `RetrievalEvaluator`
+- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
+    | Evaluator | New Token Limit |
+    | --- | --- |
+    | `CoherenceEvaluator` | 800 |
+    | `RelevanceEvaluator` | 800 |
+    | `FluencyEvaluator` | 800 |
+    | `GroundednessEvaluator` | 800 |
+    | `RetrievalEvaluator` | 1600 |
+- Improved the error message for storage access permission issues to provide clearer guidance for users.
 ## 1.0.0b4 (2024-10-16)
 ### Breaking Changes
@@ -10,9 +76,11 @@
 ### Bugs Fixed
 - Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
+- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
 ### Other Changes
 - Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
+- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
 ## 1.0.0b3 (2024-10-01)

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/MANIFEST.in RENAMED Viewed

@@ -4,3 +4,4 @@ include azure/__init__.py
 include azure/ai/__init__.py
 include azure/ai/evaluation/py.typed
 recursive-include azure/ai/evaluation *.prompty
+include azure/ai/evaluation/simulator/_data_sources/grounding.json

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/NOTICE.txt RENAMED Viewed

@@ -48,3 +48,23 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
+License notice for [Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings](https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1272229/full)
+------------------------------------------------------------------------------------------------------------------
+Copyright © 2023 Hackl, Müller, Granitzer and Sailer. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [Is ChatGPT a Good NLG Evaluator? A Preliminary Study](https://aclanthology.org/2023.newsum-1.1) (Wang et al., NewSum 2023)
+------------------------------------------------------------------------------------------------------------------
+Copyright © 2023. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [SummEval: Re-evaluating Summarization Evaluation.](https://doi.org/10.1162/tacl_a_00373) (Fabbri et al.)
+------------------------------------------------------------------------------------------------------------------
+© 2021 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks](https://aclanthology.org/2023.emnlp-main.543) (Sottana et al., EMNLP 2023)
+------------------------------------------------------------------------------------------------------------------
+© 2023 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).

{azure_ai_evaluation-1.0.0b4/azure_ai_evaluation.egg-info → azure_ai_evaluation-1.0.0b5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: azure-ai-evaluation
-Version: 1.0.0b4
+Version: 1.0.0b5
 Summary: Microsoft Azure Evaluation Library for Python
 Home-page: https://github.com/Azure/azure-sdk-for-python
 Author: Microsoft Corporation
@@ -30,6 +30,7 @@ Requires-Dist: azure-core>=1.30.2
 Requires-Dist: nltk>=3.9.1
 Provides-Extra: remote
 Requires-Dist: promptflow-azure<2.0.0,>=1.15.0; extra == "remote"
+Requires-Dist: azure-ai-inference>=1.0.0b4; extra == "remote"
 # Azure AI Evaluation client library for Python
@@ -95,9 +96,6 @@ if __name__ == "__main__":
     # Running Relevance Evaluator on single input row
     relevance_score = relevance_eval(
         response="The Alpine Explorer Tent is the most waterproof.",
-        context="From the our product list,"
-        " the alpine explorer tent is the most waterproof."
-        " The Adventure Dining Table has higher weight.",
         query="Which tent is the most waterproof?",
     )
@@ -172,6 +170,95 @@ Output with a string that continues the conversation, responding to the latest m
 {{ conversation_history }}
 ```
+Query Response generaing prompty for gpt-4o with `json_schema` support
+Use this file as an override.
+```yaml
+---
+name: TaskSimulatorQueryResponseGPT4o
+description: Gets queries and responses from a blob of text
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: json_schema
+      json_schema:
+        name: QRJsonSchema
+        schema:
+          type: object
+          properties:
+            items:
+              type: array
+              items:
+                type: object
+                properties:
+                  q:
+                    type: string
+                  r:
+                    type: string
+                required:
+                  - q
+                  - r
+inputs:
+  text:
+    type: string
+  num_queries:
+    type: integer
+---
+system:
+You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
+Both Questions and Answers MUST BE extracted from given Text
+Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
+RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
+A sentence should contribute multiple QnAs if it has more info in it
+Answer must not be more than 5 words
+Answer must be picked from Text as is
+Question should be as descriptive as possible and must include as much context as possible from Text
+Output must always have the provided number of QnAs
+Output must be in JSON format.
+Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
+Text:
+<|text_start|>
+On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
+Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
+<|text_end|>
+Output with 5 QnAs:
+{
+    "qna": [{
+        "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
+        "r": "January 24, 1984"
+    },
+    {
+        "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
+        "r": "Steve Jobs"
+    },
+    {
+        "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
+        "r": "2.06 percent"
+    },
+    {
+        "q": "What were the research firms that reported on Apple's market share in the U.S.?",
+        "r": "IDC and Gartner"
+    },
+    {
+        "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
+        "r": "6%"
+    }]
+}
+Text:
+<|text_start|>
+{{ text }}
+<|text_end|>
+Output with {{ num_queries }} QnAs:
+```
 Application code:
 ```python
@@ -189,6 +276,7 @@ model_config = {
     "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
     # not providing key would make the SDK pick up `DefaultAzureCredential`
     # use "api_key": "<your API key>"
+    "api_version": "2024-08-01-preview" # keep this for gpt-4o
 }
 # Use Wikipedia to get some text for the simulation
@@ -232,20 +320,21 @@ async def callback(
     formatted_response = {
         "content": response,
         "role": "assistant",
-        "context": {
-            "citations": None,
-        },
+        "context": "",
     }
     messages["messages"].append(formatted_response)
     return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
 async def main():
     simulator = Simulator(model_config=model_config)
+    current_dir = os.path.dirname(__file__)
+    query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
     outputs = await simulator(
         target=callback,
         text=text,
+        query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
         num_queries=2,
-        max_conversation_turns=4,
+        max_conversation_turns=1,
         user_persona=[
             f"I am a student and I want to learn more about {wiki_search_term}",
             f"I am a teacher and I want to teach my students about {wiki_search_term}"
@@ -267,7 +356,7 @@ if __name__ == "__main__":
 #### Adversarial Simulator
 ```python
-from from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
+from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
 from azure.identity import DefaultAzureCredential
 from typing import Any, Dict, List, Optional
 import asyncio
@@ -420,6 +509,72 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 # Release History
+## 1.0.0b5 (2024-10-28)
+### Features Added
+- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
+- Groundedness detection in Non Adversarial Simulator via query/context pairs
+```python
+import importlib.resources as pkg_resources
+package = "azure.ai.evaluation.simulator._data_sources"
+resource_name = "grounding.json"
+custom_simulator = Simulator(model_config=model_config)
+conversation_turns = []
+with pkg_resources.path(package, resource_name) as grounding_file:
+    with open(grounding_file, "r") as file:
+        data = json.load(file)
+for item in data:
+    conversation_turns.append([item])
+outputs = asyncio.run(custom_simulator(
+    target=callback,
+    conversation_turns=conversation_turns,
+    max_conversation_turns=1,
+))
+```
+- Adding evaluator for multimodal use cases
+### Breaking Changes
+- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
+- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
+- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
+- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
+- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
+- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
+```json
+{"question": <user_message>, "answer": <assistant_message>}
+```
+`to_eval_qr_json_lines` now has:
+```json
+{"query": <user_message>, "response": assistant_message}
+```
+### Bugs Fixed
+- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
+- Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
+- Fix evaluate API failure when `trace.destination` is set to `none`
+- Non adversarial simulator now accepts context from the callback
+### Other Changes
+- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
+- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
+- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
+  - `CoherenceEvaluator`
+  - `RelevanceEvaluator`
+  - `FluencyEvaluator`
+  - `GroundednessEvaluator`
+  - `SimilarityEvaluator`
+  - `RetrievalEvaluator`
+- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
+    | Evaluator | New Token Limit |
+    | --- | --- |
+    | `CoherenceEvaluator` | 800 |
+    | `RelevanceEvaluator` | 800 |
+    | `FluencyEvaluator` | 800 |
+    | `GroundednessEvaluator` | 800 |
+    | `RetrievalEvaluator` | 1600 |
+- Improved the error message for storage access permission issues to provide clearer guidance for users.
 ## 1.0.0b4 (2024-10-16)
 ### Breaking Changes
@@ -430,9 +585,11 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 ### Bugs Fixed
 - Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
+- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
 ### Other Changes
 - Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
+- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
 ## 1.0.0b3 (2024-10-01)

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/README.md RENAMED Viewed

@@ -62,9 +62,6 @@ if __name__ == "__main__":
     # Running Relevance Evaluator on single input row
     relevance_score = relevance_eval(
         response="The Alpine Explorer Tent is the most waterproof.",
-        context="From the our product list,"
-        " the alpine explorer tent is the most waterproof."
-        " The Adventure Dining Table has higher weight.",
         query="Which tent is the most waterproof?",
     )
@@ -139,6 +136,95 @@ Output with a string that continues the conversation, responding to the latest m
 {{ conversation_history }}
 ```
+Query Response generaing prompty for gpt-4o with `json_schema` support
+Use this file as an override.
+```yaml
+---
+name: TaskSimulatorQueryResponseGPT4o
+description: Gets queries and responses from a blob of text
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: json_schema
+      json_schema:
+        name: QRJsonSchema
+        schema:
+          type: object
+          properties:
+            items:
+              type: array
+              items:
+                type: object
+                properties:
+                  q:
+                    type: string
+                  r:
+                    type: string
+                required:
+                  - q
+                  - r
+inputs:
+  text:
+    type: string
+  num_queries:
+    type: integer
+---
+system:
+You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
+Both Questions and Answers MUST BE extracted from given Text
+Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
+RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
+A sentence should contribute multiple QnAs if it has more info in it
+Answer must not be more than 5 words
+Answer must be picked from Text as is
+Question should be as descriptive as possible and must include as much context as possible from Text
+Output must always have the provided number of QnAs
+Output must be in JSON format.
+Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
+Text:
+<|text_start|>
+On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
+Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
+<|text_end|>
+Output with 5 QnAs:
+{
+    "qna": [{
+        "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
+        "r": "January 24, 1984"
+    },
+    {
+        "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
+        "r": "Steve Jobs"
+    },
+    {
+        "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
+        "r": "2.06 percent"
+    },
+    {
+        "q": "What were the research firms that reported on Apple's market share in the U.S.?",
+        "r": "IDC and Gartner"
+    },
+    {
+        "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
+        "r": "6%"
+    }]
+}
+Text:
+<|text_start|>
+{{ text }}
+<|text_end|>
+Output with {{ num_queries }} QnAs:
+```
 Application code:
 ```python
@@ -156,6 +242,7 @@ model_config = {
     "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
     # not providing key would make the SDK pick up `DefaultAzureCredential`
     # use "api_key": "<your API key>"
+    "api_version": "2024-08-01-preview" # keep this for gpt-4o
 }
 # Use Wikipedia to get some text for the simulation
@@ -199,20 +286,21 @@ async def callback(
     formatted_response = {
         "content": response,
         "role": "assistant",
-        "context": {
-            "citations": None,
-        },
+        "context": "",
     }
     messages["messages"].append(formatted_response)
     return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
 async def main():
     simulator = Simulator(model_config=model_config)
+    current_dir = os.path.dirname(__file__)
+    query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
     outputs = await simulator(
         target=callback,
         text=text,
+        query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
         num_queries=2,
-        max_conversation_turns=4,
+        max_conversation_turns=1,
         user_persona=[
             f"I am a student and I want to learn more about {wiki_search_term}",
             f"I am a teacher and I want to teach my students about {wiki_search_term}"
@@ -234,7 +322,7 @@ if __name__ == "__main__":
 #### Adversarial Simulator
 ```python
-from from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
+from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
 from azure.identity import DefaultAzureCredential
 from typing import Any, Dict, List, Optional
 import asyncio

azure_ai_evaluation-1.0.0b5/TROUBLESHOOTING.md ADDED Viewed

@@ -0,0 +1,50 @@
+# Troubleshoot AI Evaluation SDK Issues
+This guide walks you through how to investigate failures, common errors in the `azure-ai-evaluation` SDK, and steps to mitigate these issues.
+## Table of Contents
+- [Handle Evaluate API Errors](#handle-evaluate-api-errors)
+  - [Troubleshoot Remote Tracking Issues](#troubleshoot-remote-tracking-issues)
+  - [Safety Metric Supported Regions](#safety-metric-supported-regions)
+- [Handle Simulation Errors](#handle-simulation-errors)
+  - [Adversarial Simulation Supported Regions](#adversarial-simulation-supported-regions)
+- [Logging](#logging)
+- [Get additional help](#get-additional-help)
+## Handle Evaluate API Errors
+### Troubleshoot Remote Tracking Issues
+- Before running `evaluate()`, to ensure that you can enable logging and tracing to your Azure AI project, make sure you are first logged in by running `az login`.
+- Then install the following sub-package:
+    ```Shell
+    pip install azure-ai-evaluation[remote]
+    ```
+- Ensure that you assign the proper permissions to the storage account linked to your Azure AI Studio hub. This can be done with the following command. More information can be found [here](https://review.learn.microsoft.com/azure/ai-studio/how-to/disable-local-auth).
+    ```Shell
+    az role assignment create --role "Storage Blob Data Contributor" --scope /subscriptions/<mySubscriptionID>/resourceGroups/<myResourceGroupName> --assignee-principal-type User --assignee-object-id "<user-id>"
+    ```
+- Additionally, if you're using a virtual network or private link, and your evaluation run upload fails because of that, check out this [guide](https://docs.microsoft.com/azure/machine-learning/how-to-enable-studio-virtual-network#access-data-using-the-studio).
+### Safety Metric Supported Regions
+Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
+## Handle Simulation Errors
+### Adversarial Simulation Supported Regions
+Adversarial simulators use Azure AI Studio safety evaluation backend service to generate an adversarial dataset against your application. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaiadvsimulator-regionsupport).
+## Logging
+You can set logging level via environment variable `PF_LOGGING_LEVEL`, valid values includes `CRITICAL`, `ERROR`, `WARNING`, `INFO`, `DEBUG`, default to `INFO`.
+## Get Additional Help
+Additional information on ways to reach out for support can be found in the [SUPPORT.md](https://github.com/Azure/azure-sdk-for-python/blob/main/SUPPORT.md) at the root of the repo.

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/__init__.py RENAMED Viewed

@@ -12,10 +12,19 @@ from ._evaluators._content_safety import (
     SexualEvaluator,
     ViolenceEvaluator,
 )
+from ._evaluators._multimodal._content_safety_multimodal import (
+    ContentSafetyMultimodalEvaluator,
+    HateUnfairnessMultimodalEvaluator,
+    SelfHarmMultimodalEvaluator,
+    SexualMultimodalEvaluator,
+    ViolenceMultimodalEvaluator,
+)
+from ._evaluators._multimodal._protected_material import ProtectedMaterialMultimodalEvaluator
 from ._evaluators._f1_score import F1ScoreEvaluator
 from ._evaluators._fluency import FluencyEvaluator
 from ._evaluators._gleu import GleuScoreEvaluator
 from ._evaluators._groundedness import GroundednessEvaluator
+from ._evaluators._service_groundedness import GroundednessProEvaluator
 from ._evaluators._meteor import MeteorScoreEvaluator
 from ._evaluators._protected_material import ProtectedMaterialEvaluator
 from ._evaluators._qa import QAEvaluator
@@ -27,7 +36,10 @@ from ._evaluators._xpia import IndirectAttackEvaluator
 from ._model_configurations import (
     AzureAIProject,
     AzureOpenAIModelConfiguration,
+    Conversation,
+    EvaluationResult,
     EvaluatorConfig,
+    Message,
     OpenAIModelConfiguration,
 )
@@ -37,6 +49,7 @@ __all__ = [
     "F1ScoreEvaluator",
     "FluencyEvaluator",
     "GroundednessEvaluator",
+    "GroundednessProEvaluator",
     "RelevanceEvaluator",
     "SimilarityEvaluator",
     "QAEvaluator",
@@ -57,4 +70,13 @@ __all__ = [
     "AzureOpenAIModelConfiguration",
     "OpenAIModelConfiguration",
     "EvaluatorConfig",
+    "Conversation",
+    "Message",
+    "EvaluationResult",
+    "ContentSafetyMultimodalEvaluator",
+    "HateUnfairnessMultimodalEvaluator",
+    "SelfHarmMultimodalEvaluator",
+    "SexualMultimodalEvaluator",
+    "ViolenceMultimodalEvaluator",
+    "ProtectedMaterialMultimodalEvaluator",
 ]

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_common/constants.py RENAMED Viewed

@@ -6,6 +6,9 @@ from enum import Enum
 from azure.core import CaseInsensitiveEnumMeta
+PROMPT_BASED_REASON_EVALUATORS = ["coherence", "relevance", "retrieval", "groundedness", "fluency"]
 class CommonConstants:
     """Define common constants."""
@@ -35,6 +38,7 @@ class Tasks:
     CONTENT_HARM = "content harm"
     PROTECTED_MATERIAL = "protected material"
     XPIA = "xpia"
+    GROUNDEDNESS = "groundedness"
 class _InternalAnnotationTasks:
@@ -56,6 +60,7 @@ class EvaluationMetrics(str, Enum, metaclass=CaseInsensitiveEnumMeta):
     SEXUAL = "sexual"
     PROTECTED_MATERIAL = "protected_material"
     XPIA = "xpia"
+    GROUNDEDNESS = "generic_groundedness"
 class _InternalEvaluationMetrics(str, Enum, metaclass=CaseInsensitiveEnumMeta):

{azure_ai_evaluation-1.0.0b4 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_common/math.py RENAMED Viewed

@@ -5,6 +5,8 @@
 import math
 from typing import List
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
 def list_sum(lst: List[float]) -> float:
     return sum(lst)
@@ -15,4 +17,13 @@ def list_mean(lst: List[float]) -> float:
 def list_mean_nan_safe(lst: List[float]) -> float:
+    msg = "All score values are NaN. The mean cannot be calculated."
+    if all(math.isnan(l) for l in lst):
+        raise EvaluationException(
+            message=msg,
+            internal_message=msg,
+            blame=ErrorBlame.USER_ERROR,
+            category=ErrorCategory.INVALID_VALUE,
+            target=ErrorTarget.CONVERSATION,
+        )
     return list_mean([l for l in lst if not math.isnan(l)])

azure-ai-evaluation 1.0.0b4__tar.gz → 1.0.0b5__tar.gz

azure-ai-evaluation 1.0.0b4tar.gz → 1.0.0b5tar.gz