PyPI - azure-ai-evaluation - Versions diffs - 1.0.0b2__py3-none-any.whl → 1.13.3__py3-none-any.whl - Mend

azure-ai-evaluation 1.0.0b2py3-none-any.whl → 1.13.3py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (299) hide show

azure/ai/evaluation/_evaluators/_groundedness/groundedness_with_query.prompty ADDED Viewed

@@ -0,0 +1,114 @@
+---
+name: Groundedness
+description: Evaluates groundedness score for RAG scenario
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    max_tokens: 800
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: text
+inputs:
+  query:
+    type: string
+  response:
+    type: string
+  context:
+    type: string
+---
+system:
+# Instruction
+## Goal
+### You are an expert in evaluating the quality of a RESPONSE from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
+- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
+- **Data**: Your input data include CONTEXT, QUERY, and RESPONSE.
+- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
+user:
+# Definition
+**Groundedness** refers to how well an answer is anchored in the provided context, evaluating its relevance, accuracy, and completeness based exclusively on that context. It assesses the extent to which the answer directly and fully addresses the question without introducing unrelated or incorrect information.
+> Context is the source of truth for evaluating the response. If it's empty, rely on the tool results in the response and query.
+> Evaluate the groundedness of the response message, not the chat history.
+# Ratings
+## [Groundedness: 1] (Completely Unrelated Response)
+**Definition:** An answer that does not relate to the question or the context in any way.
+  - Does not relate to the question or context at all.
+  - Talks about the general topic but does not respond to the query.
+**Examples:**
+  **Context:** The company's annual meeting will be held next Thursday.
+  **Query:** When is the company's annual meeting?
+  **Response:** I enjoy hiking in the mountains during summer.
+  **Context:** The museum will exhibit modern art pieces from various local artists.
+  **Query:** What kind of art will be exhibited at the museum?
+  **Response:** Museums are important cultural institutions.
+## [Groundedness: 2] (Attempts to Respond but Contains Incorrect Information)
+**Definition:** An answer that attempts to respond to the question but includes incorrect information not supported by the context. It may misstate facts, misinterpret the context, or provide erroneous details. Even if some points are correct, the presence of inaccuracies makes the response unreliable.
+**Examples:**
+  **Context:** - The festival starts on June 5th and features international musicians.
+  **Query:** When does the festival start?
+  **Response:** The festival starts on July 5th and features local artists.
+  **Context:** bakery_menu.txt: - Croissant au Beurre — flaky, buttery croissant
+  **Query:** [{"role":"user","content":"Are there croissants?"}]
+  **Response:** [{"role":"assistant","content":"Yes, Croissant au Beurre is on the menu, served with jam."}]
+## [Groundedness: 3] (Nothing to be Grounded)
+Definition: An answer that does not provide any information that can be evaluated against the context. This includes responses that are asking for clarification, providing polite fillers, or follow-up questions.
+**Examples:**
+  **Context:**
+  **Query:** [{"role":"user","content":"How many eggs are needed for the recipe?"}, {"role":"tool","content":"tool_result": [{"file_name": "recipe.txt", "content": "The recipe requires two eggs and one cup of milk."}]}, {"role":"assistant","content":"You need three eggs for the recipe."}, {"role":"user","content":"Thank you."}]
+  **Response:** [{"role":"assistant","content":"You're welcome, anything else I can help with?"}]
+## [Groundedness: 4] (Partially Correct Response)
+**Definition:** An answer that provides a correct response to the question but is incomplete or lacks specific details mentioned in the context. It captures some of the necessary information but omits key elements needed for a full understanding.
+**Examples:**
+  **Context:** - store_details.txt: The bookstore offers a 15% discount to students and a 10% discount to senior citizens.
+  **Query:** [{"role":"user","content":"What discount does the bookstore offer to students, if any?"}]
+  **Response:** [{"role":"assistant","content":"Yes, students get a discount at the bookstore."}]
+  **Context:** The company's headquarters are located in Berlin, Germany.
+  **Query:** Where are the company's headquarters?
+  **Response:** The company's headquarters are in Germany.
+## [Groundedness: 5] (Fully Correct and Complete Response)
+**Definition:** An answer that thoroughly and accurately responds to the question, including all relevant details from the context. It directly addresses the question with precise information, demonstrating complete understanding without adding extraneous information.
+**Examples:**
+  **CONTEXT:** The author released her latest novel, 'The Silent Echo', on September 1st.
+  **QUERY:** [{"role":"user","content":"When was 'The Silent Echo' released?"}]
+  **RESPONSE:** [{"role":"assistant","content":"The 'Silent Echo' was released on September 1st."}]
+  **Context:**
+  **Query:** By what date must participants register to receive early bird pricing?
+  **Response:** [{"role":"tool","content":"tool_result": [{"file_name": "store_guidelines.txt", "content": "Participants registering before and including May 31st will be eligible for early bird pricing."}]}, {"role":"assistant","content":"Participants must register by May 31st to receive early bird pricing."}]
+# Data
+CONTEXT: {{context}}
+QUERY: {{query}}
+RESPONSE: {{response}}
+# Tasks
+## Please provide your assessment Score for the previous RESPONSE message in relation to the CONTEXT, QUERY and RESPONSE tools based on the Definitions above. Your output should include the following information:
+- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
+- **Explanation**: a very short explanation of why you think the input Data should get that Score.
+- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
+## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
+# Output

azure/ai/evaluation/_evaluators/_groundedness/groundedness_without_query.prompty ADDED Viewed

@@ -0,0 +1,104 @@
+---
+name: Groundedness
+description: Evaluates groundedness score for RAG scenario
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    max_tokens: 800
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: text
+inputs:
+  response:
+    type: string
+  context:
+    type: string
+---
+system:
+# Instruction
+## Goal
+### You are an expert in evaluating the quality of a RESPONSE from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
+- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
+- **Data**: Your input data include CONTEXT and RESPONSE.
+- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
+user:
+# Definition
+**Groundedness** refers to how well a response is anchored in the provided context, evaluating its relevance, accuracy, and completeness based exclusively on that context. It assesses the extent to which the response directly and fully addresses the information without introducing unrelated or incorrect information.
+> Context is the source of truth for evaluating the response.
+> Evaluate the groundedness of the response message based on the provided context.
+# Ratings
+## [Groundedness: 1] (Completely Unrelated Response)
+**Definition:** A response that does not relate to the context in any way.
+  - Does not relate to the context at all.
+  - Talks about the general topic but does not respond to the context.
+**Examples:**
+  **Context:** The company's profits increased by 20% in the last quarter.
+  **Response:** I enjoy playing soccer on weekends with my friends.
+  **Context:** The new smartphone model features a larger display and improved battery life.
+  **Response:** The history of ancient Egypt is fascinating and full of mysteries.
+## [Groundedness: 2] (Attempts to Respond but Contains Incorrect Information)
+**Definition:** A response that attempts to relate to the context but includes incorrect information not supported by the context. It may misstate facts, misinterpret the context, or provide erroneous details. Even if some points are correct, the presence of inaccuracies makes the response unreliable.
+**Examples:**
+  **Context:** The company's profits increased by 20% in the last quarter.
+  **Response:** The company's profits decreased by 20% in the last quarter.
+  **Context:** The new smartphone model features a larger display and improved battery life.
+  **Response:** The new smartphone model has a smaller display and shorter battery life.
+## [Groundedness: 3] (Accurate but Vague Response)
+**Definition:** A response that provides accurate information from the context but is overly generic or vague, not meaningfully engaging with the specific details in the context. The information is correct but lacks specificity and detail.
+**Examples:**
+  **Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
+  **Response:** The company is doing well financially.
+  **Context:** The new smartphone model features a larger display, improved battery life, and an upgraded camera system.
+  **Response:** The smartphone has some nice features.
+## [Groundedness: 4] (Partially Correct Response)
+**Definition:** A response that provides correct information from the context but is incomplete or lacks specific details mentioned in the context. It captures some of the necessary information but omits key elements needed for a full understanding.
+**Examples:**
+  **Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
+  **Response:** The company's profits increased by 20% in the last quarter.
+  **Context:** The new smartphone model features a larger display, improved battery life, and an upgraded camera system.
+  **Response:** The new smartphone model features a larger display and improved battery life.
+## [Groundedness: 5] (Fully Grounded and Complete Response)
+**Definition:** A response that thoroughly and accurately conveys information from the context, including all relevant details. It directly addresses the context with precise information, demonstrating complete understanding without adding extraneous information.
+**Examples:**
+  **Context:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
+  **Response:** The company's profits increased by 20% in the last quarter, marking the highest growth rate in its history.
+  **Context:** The new smartphone model features a larger display, improved battery life, and an upgraded camera system.
+  **Response:** The new smartphone model features a larger display, improved battery life, and an upgraded camera system.
+# Data
+CONTEXT: {{context}}
+RESPONSE: {{response}}
+# Tasks
+## Please provide your assessment Score for the previous RESPONSE in relation to the CONTEXT based on the Definitions above. Your output should include the following information:
+- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
+- **Explanation**: a very short explanation of why you think the input Data should get that Score.
+- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
+## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
+# Output

azure/ai/evaluation/{_evaluate/_batch_run_client → _evaluators/_intent_resolution}/__init__.py RENAMED Viewed

@@ -1,8 +1,7 @@
 # ---------------------------------------------------------
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # ---------------------------------------------------------
-from .batch_run_context import BatchRunContext
-from .code_client import CodeClient
-from .proxy_client import ProxyClient
-__all__ = ["CodeClient", "ProxyClient", "BatchRunContext"]
+from ._intent_resolution import IntentResolutionEvaluator
+__all__ = ["IntentResolutionEvaluator"]

azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py ADDED Viewed

@@ -0,0 +1,196 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+import os
+import math
+import logging
+from typing import Dict, Union, List, Optional
+from typing_extensions import overload, override
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
+from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
+from azure.ai.evaluation._model_configurations import Conversation, Message
+from ..._common.utils import check_score_is_valid, reformat_conversation_history, reformat_agent_response
+from azure.ai.evaluation._common._experimental import experimental
+logger = logging.getLogger(__name__)
+@experimental
+class IntentResolutionEvaluator(PromptyEvaluatorBase[Union[str, float]]):
+    """
+    Evaluates intent resolution for a given query and response or a multi-turn conversation, including reasoning.
+    The intent resolution evaluator assesses whether the user intent was correctly identified and resolved.
+    :param model_config: Configuration for the Azure OpenAI model.
+    :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
+        ~azure.ai.evaluation.OpenAIModelConfiguration]
+    .. admonition:: Example:
+        .. literalinclude:: ../samples/evaluation_samples_evaluate.py
+            :start-after: [START intent_resolution_evaluator]
+            :end-before: [END intent_resolution_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize and call an IntentResolutionEvaluator with a query and response.
+    .. admonition:: Example using Azure AI Project URL:
+        .. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
+            :start-after: [START intent_resolution_evaluator]
+            :end-before: [END intent_resolution_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize and call IntentResolutionEvaluator using Azure AI Project URL in the following format
+                https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
+    """
+    _PROMPTY_FILE = "intent_resolution.prompty"
+    _RESULT_KEY = "intent_resolution"
+    _OPTIONAL_PARAMS = ["tool_definitions"]
+    _MIN_INTENT_RESOLUTION_SCORE = 1
+    _MAX_INTENT_RESOLUTION_SCORE = 5
+    _DEFAULT_INTENT_RESOLUTION_THRESHOLD = 3
+    id = "azureai://built-in/evaluators/intent_resolution"
+    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
+    @override
+    def __init__(self, model_config, *, threshold=_DEFAULT_INTENT_RESOLUTION_THRESHOLD, credential=None, **kwargs):
+        current_dir = os.path.dirname(__file__)
+        prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
+        self.threshold = threshold
+        super().__init__(
+            model_config=model_config,
+            prompty_file=prompty_path,
+            result_key=self._RESULT_KEY,
+            threshold=threshold,
+            credential=credential,
+            _higher_is_better=True,
+            **kwargs,
+        )
+    @overload
+    def __call__(
+        self,
+        *,
+        query: Union[str, List[dict]],
+        response: Union[str, List[dict]],
+        tool_definitions: Optional[Union[dict, List[dict]]] = None,
+    ) -> Dict[str, Union[str, float]]:
+        """Evaluate intent resolution for a given query, response and optional tool definitions.
+        The query and response can be either a string or a list of messages.
+        Example with string inputs and no tools:
+            evaluator = IntentResolutionEvaluator(model_config)
+            query = "What is the weather today?"
+            response = "The weather is sunny."
+            result = evaluator(query=query, response=response)
+        Example with list of messages:
+            evaluator = IntentResolutionEvaluator(model_config)
+            query: [{'role': 'system', 'content': 'You are a friendly and helpful customer service agent.'}, {'createdAt': 1700000060, 'role': 'user', 'content': [{'type': 'text', 'text': 'Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?'}]}]
+            response: [{'createdAt': 1700000070, 'run_id': '0', 'role': 'assistant', 'content': [{'type': 'text', 'text': 'Hello! Let me quickly look up your account details.'}]}, {'createdAt': 1700000075, 'run_id': '0', 'role': 'assistant', 'content': [{'type': 'tool_call', 'tool_call': {'id': 'tool_call_20250310_001', 'type': 'function', 'function': {'name': 'get_orders', 'arguments': {'account_number': '888'}}}}]}, {'createdAt': 1700000080, 'run_id': '0', 'tool_call_id': 'tool_call_20250310_001', 'role': 'tool', 'content': [{'type': 'tool_result', 'tool_result': '[{ "order_id": "123" }, { "order_id": "124" }]'}]}, {'createdAt': 1700000085, 'run_id': '0', 'role': 'assistant', 'content': [{'type': 'text', 'text': 'Thanks for your patience. I see two orders on your account. Let me fetch the details for both.'}]}, {'createdAt': 1700000090, 'run_id': '0', 'role': 'assistant', 'content': [{'type': 'tool_call', 'tool_call': {'id': 'tool_call_20250310_002', 'type': 'function', 'function': {'name': 'get_order', 'arguments': {'order_id': '123'}}}}, {'type': 'tool_call', 'tool_call': {'id': 'tool_call_20250310_003', 'type': 'function', 'function': {'name': 'get_order', 'arguments': {'order_id': '124'}}}}]}, {'createdAt': 1700000095, 'run_id': '0', 'tool_call_id': 'tool_call_20250310_002', 'role': 'tool', 'content': [{'type': 'tool_result', 'tool_result': '{ "order": { "id": "123", "status": "shipped", "delivery_date": "2025-03-15" } }'}]}, {'createdAt': 1700000100, 'run_id': '0', 'tool_call_id': 'tool_call_20250310_003', 'role': 'tool', 'content': [{'type': 'tool_result', 'tool_result': '{ "order": { "id": "124", "status": "delayed", "expected_delivery": "2025-03-20" } }'}]}, {'createdAt': 1700000105, 'run_id': '0', 'role': 'assistant', 'content': [{'type': 'text', 'text': 'The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?'}]}]
+            tool_definitions: [{'name': 'get_orders', 'description': 'Get the list of orders for a given account number.', 'parameters': {'type': 'object', 'properties': {'account_number': {'type': 'string', 'description': 'The account number to get the orders for.'}}}}, {'name': 'get_order', 'description': 'Get the details of a specific order.', 'parameters': {'type': 'object', 'properties': {'order_id': {'type': 'string', 'description': 'The order ID to get the details for.'}}}}, {'name': 'initiate_return', 'description': 'Initiate the return process for an order.', 'parameters': {'type': 'object', 'properties': {'order_id': {'type': 'string', 'description': 'The order ID for the return process.'}}}}, {'name': 'update_shipping_address', 'description': 'Update the shipping address for a given account.', 'parameters': {'type': 'object', 'properties': {'account_number': {'type': 'string', 'description': 'The account number to update.'}, 'new_address': {'type': 'string', 'description': 'The new shipping address.'}}}}]
+            result = evaluator(query=query, response=response, tool_definitions=tool_definitions)
+        :keyword query: The query to be evaluated which is either a string or a list of messages.
+            The list of messages is the previous conversation history of the user and agent, including system messages and tool calls.
+        :paramtype query: Union[str, List[dict]]
+        :keyword response: The response to be evaluated, which is either a string or a list of messages (full agent response potentially including tool calls)
+        :paramtype response: Union[str, List[dict]]
+        :keyword tool_definitions: An optional list of messages containing the tool definitions the agent is aware of.
+        :paramtype tool_definitions: Optional[Union[dict, List[dict]]]
+        :return: A dictionary with the intent resolution evaluation
+        :rtype: Dict[str, Union[str, float]]
+        """
+    @override
+    def __call__(  # pylint: disable=docstring-missing-param
+        self,
+        *args,
+        **kwargs,
+    ):
+        """
+        Invokes the instance using the overloaded __call__ signature.
+        For detailed parameter types and return value documentation, see the overloaded __call__ definition.
+        """
+        return super().__call__(*args, **kwargs)
+    @override
+    async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]:  # type: ignore[override]
+        """Do intent resolution evaluation.
+        :param eval_input: The input to the evaluator. Expected to contain whatever inputs are needed for the _flow method
+        :type eval_input: Dict
+        :return: The evaluation result.
+        :rtype: Dict
+        """
+        # we override the _do_eval method as we want the output to be a dictionary, which is a different schema than _base_prompty_eval.py
+        if "query" not in eval_input and "response" not in eval_input:
+            raise EvaluationException(
+                message=f"Both query and response must be provided as input to the intent resolution evaluator.",
+                internal_message=f"Both query and response must be provided as input to the intent resolution evaluator.",
+                blame=ErrorBlame.USER_ERROR,
+                category=ErrorCategory.MISSING_FIELD,
+                target=ErrorTarget.INTENT_RESOLUTION_EVALUATOR,
+            )
+        # reformat query and response to the format expected by the prompty flow
+        eval_input["query"] = reformat_conversation_history(eval_input["query"], logger)
+        eval_input["response"] = reformat_agent_response(eval_input["response"], logger)
+        prompty_output_dict = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
+        llm_output = prompty_output_dict["llm_output"]
+        # llm_output should always be a dictionary because the response_format of prompty is set to json_object, but checking anyway
+        score = math.nan
+        if isinstance(llm_output, dict):
+            score = llm_output.get("score", math.nan)
+            if not check_score_is_valid(
+                score,
+                IntentResolutionEvaluator._MIN_INTENT_RESOLUTION_SCORE,
+                IntentResolutionEvaluator._MAX_INTENT_RESOLUTION_SCORE,
+            ):
+                raise EvaluationException(
+                    message=f"Invalid score value: {score}. Expected a number in range [{IntentResolutionEvaluator._MIN_INTENT_RESOLUTION_SCORE}, {IntentResolutionEvaluator._MAX_INTENT_RESOLUTION_SCORE}].",
+                    internal_message="Invalid score value.",
+                    category=ErrorCategory.FAILED_EXECUTION,
+                    blame=ErrorBlame.SYSTEM_ERROR,
+                )
+            reason = llm_output.get("explanation", "")
+            score = float(score)
+            score_result = "pass" if score >= self._threshold else "fail"
+            response_dict = {
+                f"{self._result_key}": score,
+                f"gpt_{self._result_key}": score,
+                f"{self._result_key}_result": score_result,
+                f"{self._result_key}_threshold": self._threshold,
+                f"{self._result_key}_reason": reason,
+                f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0),
+                f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0),
+                f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0),
+                f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""),
+                f"{self._result_key}_model": prompty_output_dict.get("model_id", ""),
+                f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""),
+                f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""),
+            }
+            return response_dict
+        # If llm_output is not a dictionary, return NaN for the score. This should never happen
+        if logger:
+            logger.warning("LLM output is not a dictionary, returning NaN for the score.")
+        binary_result = self._get_binary_result(score)
+        return {
+            self._result_key: float(score),
+            f"gpt_{self._result_key}": float(score),
+            f"{self._result_key}_result": binary_result,
+            f"{self._result_key}_threshold": self._threshold,
+        }

azure-ai-evaluation 1.0.0b2__py3-none-any.whl → 1.13.3__py3-none-any.whl

Potentially problematic release.

azure-ai-evaluation 1.0.0b2py3-none-any.whl → 1.13.3py3-none-any.whl