PyPI - azure-ai-evaluation - Versions diffs - 1.0.0__py3-none-any.whl → 1.0.0b1__py3-none-any.whl - Mend

azure-ai-evaluation 1.0.0py3-none-any.whl → 1.0.0b1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (108) hide show

azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty ADDED Viewed

@@ -0,0 +1,54 @@
+---
+name: Groundedness
+description: Evaluates groundedness score for QA scenario
+model:
+  api: chat
+  configuration:
+    type: azure_openai
+    azure_deployment: ${env:AZURE_DEPLOYMENT}
+    api_key: ${env:AZURE_OPENAI_API_KEY}
+    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
+  parameters:
+    temperature: 0.0
+    max_tokens: 1
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: text
+inputs:
+  response:
+    type: string
+  context:
+    type: string
+---
+system:
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
+user:
+You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:
+1. 5: The ANSWER follows logically from the information contained in the CONTEXT.
+2. 1: The ANSWER is logically false from the information contained in the CONTEXT.
+3. an integer score between 1 and 5 and if such integer score does not exist, use 1: It is not possible to determine whether the ANSWER is true or false without further information. Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails. Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.
+Independent Examples:
+## Example Task #1 Input:
+{"CONTEXT": "Some are reported as not having been wanted at all.", "QUESTION": "", "ANSWER": "All are reported as being completely and fully wanted."}
+## Example Task #1 Output:
+1
+## Example Task #2 Input:
+{"CONTEXT": "Ten new television shows appeared during the month of September. Five of the shows were sitcoms, three were hourlong dramas, and two were news-magazine shows. By January, only seven of these new shows were still on the air. Five of the shows that remained were sitcoms.", "QUESTION": "", "ANSWER": "At least one of the shows that were cancelled was an hourlong drama."}
+## Example Task #2 Output:
+5
+## Example Task #3 Input:
+{"CONTEXT": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is neither French nor English.", "QUESTION": "", "ANSWER": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is not French."}
+## Example Task #3 Output:
+5
+## Example Task #4 Input:
+{"CONTEXT": "Some are reported as not having been wanted at all.", "QUESTION": "", "ANSWER": "All are reported as being completely and fully wanted."}
+## Example Task #4 Output:
+1
+## Actual Task Input:
+{"CONTEXT": {{context}}, "QUESTION": "", "ANSWER": {{response}}}
+Reminder: The return values for each task should be correctly formatted as an integer between 1 and 5. Do not repeat the context and question.
+Actual Task Output:

azure/ai/evaluation/_evaluators/_meteor/_meteor.py CHANGED Viewed

@@ -1,10 +1,10 @@
 # ---------------------------------------------------------
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # ---------------------------------------------------------
+import nltk
 from nltk.translate.meteor_score import meteor_score
 from promptflow._utils.async_utils import async_run_allowing_running_loop
-from azure.ai.evaluation._common.utils import nltk_tokenize, ensure_nltk_data_downloaded
+from azure.ai.evaluation._common.utils import nltk_tokenize
 class _AsyncMeteorScoreEvaluator:
@@ -13,7 +13,10 @@ class _AsyncMeteorScoreEvaluator:
         self._beta = beta
         self._gamma = gamma
-        ensure_nltk_data_downloaded()
+        try:
+            nltk.find("corpora/wordnet.zip")
+        except LookupError:
+            nltk.download("wordnet")
     async def __call__(self, *, ground_truth: str, response: str, **kwargs):
         reference_tokens = nltk_tokenize(ground_truth)
@@ -34,7 +37,7 @@ class _AsyncMeteorScoreEvaluator:
 class MeteorScoreEvaluator:
     """
-    Calculates the METEOR score for a given response and ground truth.
+    Evaluator that computes the METEOR Score between two strings.
     The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
     comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
@@ -42,12 +45,6 @@ class MeteorScoreEvaluator:
     word stems to more accurately capture meaning and language variations. In addition to machine translation and
     text summarization, paraphrase detection is an optimal use case for the METEOR score.
-    Use the METEOR score when you want a more linguistically informed evaluation metric that captures not only
-    n-gram overlap but also accounts for synonyms, stemming, and word order. This is particularly useful for evaluating
-    tasks like machine translation, text summarization, and text generation.
-    The METEOR score ranges from 0 to 1, with 1 indicating a perfect match.
     :param alpha: The METEOR score alpha parameter. Default is 0.9.
     :type alpha: float
     :param beta: The METEOR score beta parameter. Default is 3.0.
@@ -55,18 +52,27 @@ class MeteorScoreEvaluator:
     :param gamma: The METEOR score gamma parameter. Default is 0.5.
     :type gamma: float
-    .. admonition:: Example:
+    **Usage**
-        .. literalinclude:: ../samples/evaluation_samples_evaluate.py
-            :start-after: [START meteor_score_evaluator]
-            :end-before: [END meteor_score_evaluator]
-            :language: python
-            :dedent: 8
-            :caption: Initialize and call a MeteorScoreEvaluator with alpha of 0.8.
-    """
+    .. code-block:: python
+        eval_fn = MeteorScoreEvaluator(
+            alpha=0.9,
+            beta=3.0,
+            gamma=0.5
+        )
+        result = eval_fn(
+            response="Tokyo is the capital of Japan.",
+            ground_truth="The capital of Japan is Tokyo.")
+    **Output format**
+    .. code-block:: python
-    id = "azureml://registries/azureml/models/Meteor-Score-Evaluator/versions/3"
-    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
+        {
+            "meteor_score": 0.62
+        }
+    """
     def __init__(self, alpha: float = 0.9, beta: float = 3.0, gamma: float = 0.5):
         self._async_evaluator = _AsyncMeteorScoreEvaluator(alpha=alpha, beta=beta, gamma=gamma)
@@ -80,7 +86,7 @@ class MeteorScoreEvaluator:
         :keyword ground_truth: The ground truth to be compared against.
         :paramtype ground_truth: str
         :return: The METEOR score.
-        :rtype: Dict[str, float]
+        :rtype: dict
         """
         return async_run_allowing_running_loop(
             self._async_evaluator, ground_truth=ground_truth, response=response, **kwargs

azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py CHANGED Viewed

@@ -1,113 +1,104 @@
 # ---------------------------------------------------------
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # ---------------------------------------------------------
+from promptflow._utils.async_utils import async_run_allowing_running_loop
+from azure.ai.evaluation._common.constants import EvaluationMetrics
+from azure.ai.evaluation._common.rai_service import evaluate_with_rai_service
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
+from azure.ai.evaluation._model_configurations import AzureAIProject
-from typing import Dict, List, Optional, Union
-from typing_extensions import overload, override
+class _AsyncProtectedMaterialEvaluator:
+    def __init__(self, azure_ai_project: dict, credential=None):
+        self._azure_ai_project = azure_ai_project
+        self._credential = credential
-from azure.ai.evaluation._common._experimental import experimental
-from azure.ai.evaluation._common.constants import EvaluationMetrics
-from azure.ai.evaluation._evaluators._common import RaiServiceEvaluatorBase
-from azure.ai.evaluation._model_configurations import Conversation
+    async def __call__(self, *, query: str, response: str, **kwargs):
+        """
+        Evaluates content according to this evaluator's metric.
+        :keyword query: The query to be evaluated.
+        :paramtype query: str
+        :keyword response: The response to be evaluated.
+        :paramtype response: str
+        :return: The evaluation score computation based on the Content Safety metric (self.metric).
+        :rtype: Any
+        """
+        # Validate inputs
+        # Raises value error if failed, so execution alone signifies success.
+        if not (query and query.strip() and query != "None") or not (
+            response and response.strip() and response != "None"
+        ):
+            msg = "Both 'query' and 'response' must be non-empty strings."
+            raise EvaluationException(
+                message=msg,
+                internal_message=msg,
+                error_category=ErrorCategory.MISSING_FIELD,
+                error_blame=ErrorBlame.USER_ERROR,
+                error_target=ErrorTarget.PROTECTED_MATERIAL_EVALUATOR,
+            )
+        # Run score computation based on supplied metric.
+        result = await evaluate_with_rai_service(
+            metric_name=EvaluationMetrics.PROTECTED_MATERIAL,
+            query=query,
+            response=response,
+            project_scope=self._azure_ai_project,
+            credential=self._credential,
+        )
+        return result
-@experimental
-class ProtectedMaterialEvaluator(RaiServiceEvaluatorBase[Union[str, bool]]):
+class ProtectedMaterialEvaluator:
     """
-    Evaluates the protected material score for a given query and response or a multi-turn conversation, with reasoning.
+    Initialize a protected material evaluator to detect whether protected material
+    is present in your AI system's response. Outputs True or False with AI-generated reasoning.
-    Protected material is any text that is under copyright, including song lyrics, recipes, and articles. Protected
-    material evaluation leverages the Azure AI Content Safety Protected Material for Text service to perform the
-    classification.
+    :param azure_ai_project: The scope of the Azure AI project.
+        It contains subscription id, resource group, and project name.
+    :type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
+    :param credential: The credential for connecting to Azure AI project.
+    :type credential: ~azure.core.credentials.TokenCredential
+    :return: Whether or not protected material was found in the response, with AI-generated reasoning.
+    :rtype: Dict[str, str]
-    The protected material score is a boolean value, where True indicates that protected material was detected.
+    **Usage**
-    :param credential: The credential required for connecting to the Azure AI project.
-    :type credential: ~azure.core.credentials.TokenCredential
-    :param azure_ai_project: The scope of the Azure AI project, containing the subscription ID,
-        resource group, and project name.
-    :type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
+    .. code-block:: python
+        azure_ai_project = {
+            "subscription_id": "<subscription_id>",
+            "resource_group_name": "<resource_group_name>",
+            "project_name": "<project_name>",
+        }
+        eval_fn = ProtectedMaterialEvaluator(azure_ai_project)
+        result = eval_fn(query="What is the capital of France?", response="Paris.")
+    **Output format**
-    .. admonition:: Example:
+    .. code-block:: python
-        .. literalinclude:: ../samples/evaluation_samples_evaluate.py
-            :start-after: [START protected_material_evaluator]
-            :end-before: [END protected_material_evaluator]
-            :language: python
-            :dedent: 8
-            :caption: Initialize and call a ProtectedMaterialEvaluator.
+        {
+            "protected_material_label": "False",
+            "protected_material_reason": "This query does not contain any protected material."
+        }
     """
-    id = "azureml://registries/azureml/models/Protected-Material-Evaluator/versions/3"
-    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
-    @override
-    def __init__(
-        self,
-        credential,
-        azure_ai_project,
-    ):
-        super().__init__(
-            eval_metric=EvaluationMetrics.PROTECTED_MATERIAL,
-            azure_ai_project=azure_ai_project,
-            credential=credential,
-        )
+    def __init__(self, azure_ai_project: dict, credential=None):
+        self._async_evaluator = _AsyncProtectedMaterialEvaluator(azure_ai_project, credential)
-    @overload
-    def __call__(
-        self,
-        *,
-        query: str,
-        response: str,
-    ) -> Dict[str, Union[str, bool]]:
-        """Evaluate a given query/response pair for protected material
+    def __call__(self, *, query: str, response: str, **kwargs):
+        """
+        Evaluates protected material content.
         :keyword query: The query to be evaluated.
         :paramtype query: str
         :keyword response: The response to be evaluated.
         :paramtype response: str
-        :return: The protected material score.
-        :rtype: Dict[str, Union[str, bool]]
-        """
-    @overload
-    def __call__(
-        self,
-        *,
-        conversation: Conversation,
-    ) -> Dict[str, Union[float, Dict[str, List[Union[str, bool]]]]]:
-        """Evaluate a conversation for protected material
-        :keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
-            key "messages", and potentially a global context under the key "context". Conversation turns are expected
-            to be dictionaries with keys "content", "role", and possibly "context".
-        :paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
-        :return: The protected material score.
-        :rtype: Dict[str, Union[str, bool, Dict[str, List[Union[str, bool]]]]]
-        """
-    @override
-    def __call__(
-        self,
-        *,
-        query: Optional[str] = None,
-        response: Optional[str] = None,
-        conversation=None,
-        **kwargs,
-    ):
+        :return: A dictionary containing a boolean label and reasoning.
+        :rtype: dict
         """
-        Evaluate if protected material is present in your AI system's response.
+        return async_run_allowing_running_loop(self._async_evaluator, query=query, response=response, **kwargs)
-        :keyword query: The query to be evaluated.
-        :paramtype query: Optional[str]
-        :keyword response: The response to be evaluated.
-        :paramtype response: Optional[str]
-        :keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
-            key "messages". Conversation turns are expected
-            to be dictionaries with keys "content" and "role".
-        :paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
-        :return: The fluency score.
-        :rtype: Union[Dict[str, Union[str, bool]], Dict[str, Union[float, Dict[str, List[Union[str, bool]]]]]]
-        """
-        return super().__call__(query=query, response=response, conversation=conversation, **kwargs)
+    def _to_async(self):
+        return self._async_evaluator

azure/ai/evaluation/_evaluators/_protected_materials/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+from ._protected_materials import ProtectedMaterialsEvaluator
+__all__ = [
+    "ProtectedMaterialsEvaluator",
+]

azure/ai/evaluation/_evaluators/_protected_materials/_protected_materials.py ADDED Viewed

@@ -0,0 +1,104 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+from promptflow._utils.async_utils import async_run_allowing_running_loop
+from azure.ai.evaluation._common.constants import EvaluationMetrics
+from azure.ai.evaluation._common.rai_service import evaluate_with_rai_service
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
+from azure.ai.evaluation._model_configurations import AzureAIProject
+class _AsyncProtectedMaterialsEvaluator:
+    def __init__(self, azure_ai_project: dict, credential=None):
+        self._azure_ai_project = azure_ai_project
+        self._credential = credential
+    async def __call__(self, *, query: str, response: str, **kwargs):
+        """
+        Evaluates content according to this evaluator's metric.
+        :keyword query: The query to be evaluated.
+        :paramtype query: str
+        :keyword response: The response to be evaluated.
+        :paramtype response: str
+        :return: The evaluation score computation based on the Content Safety metric (self.metric).
+        :rtype: Any
+        """
+        # Validate inputs
+        # Raises value error if failed, so execution alone signifies success.
+        if not (query and query.strip() and query != "None") or not (
+            response and response.strip() and response != "None"
+        ):
+            msg = "Both 'query' and 'response' must be non-empty strings."
+            raise EvaluationException(
+                message=msg,
+                internal_message=msg,
+                error_category=ErrorCategory.MISSING_FIELD,
+                error_blame=ErrorBlame.USER_ERROR,
+                error_target=ErrorTarget.PROTECTED_MATERIAL_EVALUATOR,
+            )
+        # Run score computation based on supplied metric.
+        result = await evaluate_with_rai_service(
+            metric_name=EvaluationMetrics.PROTECTED_MATERIAL,
+            query=query,
+            response=response,
+            project_scope=self._azure_ai_project,
+            credential=self._credential,
+        )
+        return result
+class ProtectedMaterialsEvaluator:
+    """
+    Initialize a protected materials evaluator to detect whether protected material
+    is present in your AI system's response. Outputs True or False with AI-generated reasoning.
+    :param azure_ai_project: The scope of the Azure AI project.
+        It contains subscription id, resource group, and project name.
+    :type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
+    :param credential: The credential for connecting to Azure AI project.
+    :type credential: ~azure.core.credentials.TokenCredential
+    :return: Whether or not protected material was found in the response, with AI-generated reasoning.
+    :rtype: Dict[str, str]
+    **Usage**
+    .. code-block:: python
+        azure_ai_project = {
+            "subscription_id": "<subscription_id>",
+            "resource_group_name": "<resource_group_name>",
+            "project_name": "<project_name>",
+        }
+        eval_fn = ProtectedMaterialsEvaluator(azure_ai_project)
+        result = eval_fn(query="What is the capital of France?", response="Paris.")
+    **Output format**
+    .. code-block:: python
+        {
+            "label": "False",
+            "reasoning": "This query does not contain any protected material."
+        }
+    """
+    def __init__(self, azure_ai_project: dict, credential=None):
+        self._async_evaluator = _AsyncProtectedMaterialsEvaluator(azure_ai_project, credential)
+    def __call__(self, *, query: str, response: str, **kwargs):
+        """
+        Evaluates protected materials content.
+        :keyword query: The query to be evaluated.
+        :paramtype query: str
+        :keyword response: The response to be evaluated.
+        :paramtype response: str
+        :return: A dictionary containing a boolean label and reasoning.
+        :rtype: dict
+        """
+        return async_run_allowing_running_loop(self._async_evaluator, query=query, response=response, **kwargs)
+    def _to_async(self):
+        return self._async_evaluator

azure/ai/evaluation/_evaluators/_qa/_qa.py CHANGED Viewed

@@ -3,7 +3,7 @@
 # ---------------------------------------------------------
 from concurrent.futures import as_completed
-from typing import Callable, Dict, List, Union
+from typing import Union
 from promptflow.tracing import ThreadPoolExecutorWithContext as ThreadPoolExecutor
@@ -11,6 +11,7 @@ from .._coherence import CoherenceEvaluator
 from .._f1_score import F1ScoreEvaluator
 from .._fluency import FluencyEvaluator
 from .._groundedness import GroundednessEvaluator
+from ..._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
 from .._relevance import RelevanceEvaluator
 from .._similarity import SimilarityEvaluator
@@ -22,33 +23,41 @@ class QAEvaluator:
     :param model_config: Configuration for the Azure OpenAI model.
     :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
         ~azure.ai.evaluation.OpenAIModelConfiguration]
-    :return: A callable class that evaluates and generates metrics for "question-answering" scenario.
-    :param kwargs: Additional arguments to pass to the evaluator.
-    :type kwargs: Any
+    :return: A function that evaluates and generates metrics for "question-answering" scenario.
+    :rtype: Callable
-    .. admonition:: Example:
+    **Usage**
-        .. literalinclude:: ../samples/evaluation_samples_evaluate.py
-            :start-after: [START qa_evaluator]
-            :end-before: [END qa_evaluator]
-            :language: python
-            :dedent: 8
-            :caption: Initialize and call a QAEvaluator.
+    .. code-block:: python
-    .. note::
+        eval_fn = QAEvaluator(model_config)
+        result = qa_eval(
+            query="Tokyo is the capital of which country?",
+            response="Japan",
+            context="Tokyo is the capital of Japan.",
+            ground_truth="Japan"
+        )
-        To align with our support of a diverse set of models, keys without the `gpt_` prefix has been added.
-        To maintain backwards compatibility, the old keys with the `gpt_` prefix are still be present in the output;
-        however, it is recommended to use the new keys moving forward as the old keys will be deprecated in the future.
-    """
+    **Output format**
+    .. code-block:: python
-    id = "qa"
-    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
+        {
+            "gpt_groundedness": 3.5,
+            "gpt_relevance": 4.0,
+            "gpt_coherence": 1.5,
+            "gpt_fluency": 4.0,
+            "gpt_similarity": 3.0,
+            "f1_score": 0.42
+        }
+    """
-    def __init__(self, model_config, **kwargs):
-        self._parallel = kwargs.pop("_parallel", False)
+    def __init__(
+        self, model_config: dict, parallel: bool = True
+    ):
+        self._parallel = parallel
-        self._evaluators: List[Union[Callable[..., Dict[str, Union[str, float]]], Callable[..., Dict[str, float]]]] = [
+        self._evaluators = [
             GroundednessEvaluator(model_config),
             RelevanceEvaluator(model_config),
             CoherenceEvaluator(model_config),
@@ -69,15 +78,22 @@ class QAEvaluator:
         :paramtype context: str
         :keyword ground_truth: The ground truth to be evaluated.
         :paramtype ground_truth: str
+        :keyword parallel: Whether to evaluate in parallel. Defaults to True.
+        :paramtype parallel: bool
         :return: The scores for QA scenario.
-        :rtype: Dict[str, Union[str, float]]
+        :rtype: dict
         """
-        results: Dict[str, Union[str, float]] = {}
+        results = {}
         if self._parallel:
             with ThreadPoolExecutor() as executor:
                 futures = {
                     executor.submit(
-                        evaluator, query=query, response=response, context=context, ground_truth=ground_truth, **kwargs
+                        evaluator,
+                        query=query,
+                        response=response,
+                        context=context,
+                        ground_truth=ground_truth,
+                        **kwargs
                     ): evaluator
                     for evaluator in self._evaluators
                 }
@@ -87,7 +103,9 @@ class QAEvaluator:
                     results.update(future.result())
         else:
             for evaluator in self._evaluators:
-                result = evaluator(query=query, response=response, context=context, ground_truth=ground_truth, **kwargs)
+                result = evaluator(
+                    query=query, response=response, context=context, ground_truth=ground_truth, **kwargs
+                )
                 results.update(result)
         return results

azure-ai-evaluation 1.0.0__py3-none-any.whl → 1.0.0b1__py3-none-any.whl

Potentially problematic release.

azure-ai-evaluation 1.0.0py3-none-any.whl → 1.0.0b1py3-none-any.whl