PyPI - azure-ai-evaluation - Versions diffs - 1.0.1__py3-none-any.whl → 1.13.3__py3-none-any.whl - Mend

azure-ai-evaluation 1.0.1py3-none-any.whl → 1.13.3py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (277) hide show

azure/ai/evaluation/_evaluators/_relevance/relevance.prompty CHANGED Viewed

@@ -10,91 +10,189 @@ model:
     presence_penalty: 0
     frequency_penalty: 0
     response_format:
-      type: text
+      type: json_object
 inputs:
   query:
     type: string
   response:
     type: string
 ---
 system:
-# Instruction
-## Goal
-### You are an expert in evaluating the quality of a RESPONSE from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
-- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
-- **Data**: Your input data include QUERY and RESPONSE.
-- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
+You are a Relevance-Judge, an impartial evaluator that scores how well the RESPONSE addresses the user's queries in the CONVERSATION_HISTORY using the definitions provided.
 user:
-# Definition
-**Relevance** refers to how effectively a response addresses a question. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given information.
+ROLE
+====
+You are a Relevance Evaluator. Your task is to judge how relevant a RESPONSE is to the CONVERSATION_HISTORY using the Relevance definitions provided.
-# Ratings
-## [Relevance: 1] (Irrelevant Response)
-**Definition:** The response is unrelated to the question. It provides information that is off-topic and does not attempt to address the question posed.
+INPUT
+=====
+CONVERSATION_HISTORY: {{query}}
+RESPONSE: {{response}}
-**Examples:**
-  **Query:** What is the team preparing for?
-  **Response:** I went grocery shopping yesterday evening.
+CONVERSATION_HISTORY is the full dialogue between the user and the agent up to the user's latest message. For single-turn interactions, this will be just the user's query.
+RESPONSE is the agent's reply to the user's latest message.
+TASK
+====
+Output a JSON object with:
+    1) a concise explanation of 15-60 words justifying your score based on how well the response is relevant to the user's queries in the CONVERSATION_HISTORY.
+    2) an integer score from 1 (very poor) to 5 (excellent) using the rubric below.
-  **Query:** When will the company's new product line launch?
-  **Response:** International travel can be very rewarding and educational.
+The explanation should always precede the score and should clearly justify the score based on the rubric definitions.
+Response format exactly as follows:
-## [Relevance: 2] (Incorrect Response)
-**Definition:** The response attempts to address the question but includes incorrect information. It provides a response that is factually wrong based on the provided information.
+{
+  "explanation": "<15-60 words>",
+  "score": <1-5>
+}
-**Examples:**
-  **Query:** When was the merger between the two firms finalized?
-  **Response:** The merger was finalized on April 10th.
-  **Query:** Where and when will the solar eclipse be visible?
-  **Response:** The solar eclipse will be visible in Asia on December 14th.
+EVALUATION STEPS
+================
+A. Read the CONVERSATION_HISTORY and RESPONSE carefully.
+B. Identify the user's query from the latest message (use conversation history for context if needed).
+C. Compare the RESPONSE against the rubric below:
+    - Does the response directly address the user's query?
+    - Is the information complete, partial, or off-topic?
+    - Is it vague, generic, or insightful?
+D. Match the response to the best score from the rubric.
+E. Provide a short explanation and the score using the required format.
-## [Relevance: 3] (Incomplete Response)
-**Definition:** The response addresses the question but omits key details necessary for a full understanding. It provides a partial response that lacks essential information.
+SCORING RUBRIC
+==============
-**Examples:**
-  **Query:** What type of food does the new restaurant offer?
-  **Response:** The restaurant offers Italian food like pasta.
+### Score 1 - Irrelevant Response
+Definition: The response is unrelated to the question. It provides off-topic information and does not attempt to address the question posed.
-  **Query:** What topics will the conference cover?
-  **Response:** The conference will cover renewable energy and climate change.
+**Example A**
+CONVERSATION_HISTORY: What is the team preparing for?
+RESPONSE: I went grocery shopping yesterday evening.
-## [Relevance: 4] (Complete Response)
-**Definition:** The response fully addresses the question with accurate and complete information. It includes all essential details required for a comprehensive understanding, without adding any extraneous information.
+Expected Output:
+{
+    "explanation": "The response is entirely off-topic and doesn't address the question.",
+    "score": 1
+}
-**Examples:**
-  **Query:** What type of food does the new restaurant offer?
-  **Response:** The new restaurant offers Italian cuisine, featuring dishes like pasta, pizza, and risotto.
-  **Query:** What topics will the conference cover?
-  **Response:** The conference will cover renewable energy, climate change, and sustainability practices.
+**Example B**
+CONVERSATION_HISTORY: When will the company's new product line launch?
+RESPONSE: International travel can be very rewarding and educational.
-## [Relevance: 5] (Comprehensive Response with Insights)
-**Definition:** The response not only fully and accurately addresses the question but also includes additional relevant insights or elaboration. It may explain the significance, implications, or provide minor inferences that enhance understanding.
+Expected Output:
+{
+    "explanation": "The response is completely irrelevant to the product launch question.",
+    "score": 1
+}
-**Examples:**
-  **Query:** What type of food does the new restaurant offer?
-  **Response:** The new restaurant offers Italian cuisine, featuring dishes like pasta, pizza, and risotto, aiming to provide customers with an authentic Italian dining experience.
-  **Query:** What topics will the conference cover?
-  **Response:** The conference will cover renewable energy, climate change, and sustainability practices, bringing together global experts to discuss these critical issues.
+### Score 2 – Related but Unhelpful / Superficial
+Definition: The response is loosely or formally related to the query but fails to deliver any meaningful, specific, or useful information. This includes vague phrases, non-answers, or failure/error messages.
+**Example A**
+CONVERSATION_HISTORY: What is the event about?
+RESPONSE: It’s something important.
+Expected Output:
+{
+  "explanation": "The response vaguely refers to the query topic but lacks specific or informative content.",
+  "score": 2
+}
+**Example B**
+CONVERSATION_HISTORY: What’s the weather in Paris?
+RESPONSE: I tried to find the forecast but the query failed.
+Expected Output:
+{
+  "explanation": "The response acknowledges the query but provides no usable weather information. It is related but unhelpful.",
+  "score": 2
+}
+### Score 3 - Partially Relevant / Incomplete
+Definition: The response addresses the query and includes relevant information, but omits essential components or detail. The answer is on-topic but insufficient to fully satisfy the request.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex has a gym.
+Expected Output:
+{
+    "explanation": "The response mentions one amenity but does not provide a fuller list or clarify whether other standard features (like parking or security) are included. It partially addresses the query but lacks completeness.",
+    "score": 3
+}
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: It includes priority customer support.
+Expected Output:
+{
+    "explanation": "The response identifies one service but omits other likely components of a premium membership (e.g., exclusive content or discounts). The information is relevant but incomplete.",
+    "score": 3
+}
+### Score 4 - Fully Relevant / Sufficient Response
+Definition: The response fully addresses the question with accurate and sufficient information, covering all essential aspects. Very minor omissions are acceptable as long as the core information is intact and the intent is clearly conveyed.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security.
+Expected Output:
+{
+    "explanation": "The response mentions multiple key amenities likely to be relevant to most users. While it may not list every feature, it clearly conveys the core offerings of the complex.",
+    "score": 4
+}
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases.
+Expected Output:
+{
+    "explanation": "The response outlines all major services expected from a premium membership. Even if a minor service is not mentioned, the core value is clearly and fully represented.",
+    "score": 4
+}
-# Data
-QUERY: {{query}}
-RESPONSE: {{response}}
+### Score 5 - Comprehensive Response with Insights
+Definition: The response not only fully and accurately answers the question, but also adds meaningful elaboration, interpretation, or context that enhances the user's understanding. This goes beyond just listing relevant details — it offers insight into why the information matters, how it's useful, or what impact it has.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security, designed to offer residents a comfortable and active lifestyle while ensuring their safety.
+Expected Output:
+{
+    "explanation": "The response fully lists key amenities and additionally explains how these features contribute to resident experience, enhancing the usefulness of the information.",
+    "score": 5
+}
-# Tasks
-## Please provide your assessment Score for the previous RESPONSE in relation to the QUERY based on the Definitions above. Your output should include the following information:
-- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
-- **Explanation**: a very short explanation of why you think the input Data should get that Score.
-- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases — tailored for users who want quicker resolutions and first access to new features.
+Expected Output:
+{
+    "explanation": "The response covers all essential services and adds valuable insight about the target user and benefits, enriching the response beyond basic listing.",
+    "score": 5
+}
+### Multi-turn Conversation Example
+When evaluating responses in a multi-turn conversation, consider the conversation context to understand the user's intent:
-## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
-# Output
+**Example - Multi-turn Context**
+CONVERSATION_HISTORY: [{"role":"user","content":"I'm planning a vacation to Europe."},{"role":"assistant","content":"That sounds exciting! What time of year are you thinking of traveling?"},{"role":"user","content":"Probably in July. What's the weather like then?"}]
+RESPONSE: [{"role":"assistant","content":"July is summer in Europe with generally warm and pleasant weather. Most countries have temperatures between 20-25°C (68-77°F). It's a popular travel time, so expect crowds at major tourist attractions and higher accommodation prices."}]
+Expected Output:
+{
+    "explanation": "The response directly addresses the weather question while providing valuable context about crowds and pricing that's relevant to vacation planning established in the conversation.",
+    "score": 5
+}

azure/ai/evaluation/_evaluators/_response_completeness/__init__.py ADDED Viewed

@@ -0,0 +1,7 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+from ._response_completeness import ResponseCompletenessEvaluator
+__all__ = ["ResponseCompletenessEvaluator"]

azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py ADDED Viewed

@@ -0,0 +1,202 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+import os
+import logging
+import math
+from typing import Dict, List, Union, Optional
+from typing_extensions import overload, override
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
+from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
+from azure.ai.evaluation._common.utils import parse_quality_evaluator_reason_score
+from azure.ai.evaluation._model_configurations import Conversation, Message
+from azure.ai.evaluation._common._experimental import experimental
+logger = logging.getLogger(__name__)
+@experimental
+class ResponseCompletenessEvaluator(PromptyEvaluatorBase[Union[str, float]]):
+    """Evaluates the extent to which a given response contains all necessary and relevant information with respect to the
+    provided ground truth.
+    The completeness measure assesses how thoroughly an AI model's generated response aligns with the key information,
+    claims, and statements established in the ground truth. This evaluation considers the presence, accuracy,
+    and relevance of the content provided.
+    The assessment spans multiple levels, ranging from fully incomplete to fully complete, ensuring a comprehensive
+    evaluation of the response's content quality.
+    Use this metric when you need to evaluate an AI model's ability to deliver comprehensive and accurate information,
+    particularly in text generation tasks where conveying all essential details is crucial for clarity,
+    context, and correctness.
+    Completeness scores range from 1 to 5:
+    1: Fully incomplete — Contains none of the necessary information.
+    2: Barely complete — Contains only a small portion of the required information.
+    3: Moderately complete — Covers about half of the required content.
+    4: Mostly complete — Includes most of the necessary details with minimal omissions.
+    5: Fully complete — Contains all key information without any omissions.
+    :param model_config: Configuration for the Azure OpenAI model.
+    :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
+        ~azure.ai.evaluation.OpenAIModelConfiguration]
+    .. admonition:: Example using Azure AI Project URL:
+        .. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
+            :start-after: [START completeness_evaluator]
+            :end-before: [END completeness_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize and call CompletenessEvaluator using Azure AI Project URL in the following format
+                https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
+    """
+    # Constants must be defined within eval's directory to be save/loadable
+    _PROMPTY_FILE = "response_completeness.prompty"
+    _RESULT_KEY = "response_completeness"
+    id = "azureai://built-in/evaluators/response_completeness"
+    _MIN_COMPLETENESS_SCORE = 1
+    _MAX_COMPLETENESS_SCORE = 5
+    _DEFAULT_COMPLETENESS_THRESHOLD = 3
+    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
+    @override
+    def __init__(
+        self, model_config, *, threshold: Optional[float] = _DEFAULT_COMPLETENESS_THRESHOLD, credential=None, **kwargs
+    ):
+        current_dir = os.path.dirname(__file__)
+        prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
+        self.threshold = threshold  # to be removed in favor of _threshold
+        super().__init__(
+            model_config=model_config,
+            prompty_file=prompty_path,
+            result_key=self._RESULT_KEY,
+            threshold=threshold,
+            credential=credential,
+            _higher_is_better=True,
+            **kwargs,
+        )
+    @overload
+    def __call__(
+        self,
+        *,
+        ground_truth: str,
+        response: str,
+    ) -> Dict[str, Union[str, float]]:
+        """Evaluate completeness in given response. Accepts ground truth and response for evaluation.
+        Example usage:
+        Evaluating completeness for a response string
+        ```python
+        from azure.ai.evaluation import CompletenessEvaluator
+        completeness_evaluator = CompletenessEvaluator(model_config)
+        ground_truth = "The ground truth to be evaluated."
+        response = "The response to be evaluated."
+        completeness_results = completeness_evaluator(ground_truth=ground_truth, response=response)
+        ```
+        :keword ground_truth: The ground truth to be evaluated.
+        :paramtype ground_truth: str
+        :keyword response: The response to be evaluated.
+        :paramtype response: Union[str, List[Message]]
+        :return: The response completeness score results.
+        :rtype: Dict[str, Union[str, float]]
+        """
+    @overload
+    def __call__(
+        self,
+        *,
+        conversation: Conversation,
+    ) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
+        """Evaluate completeness for a conversation
+        :keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
+            key "messages", and potentially a global context under the key "context". Conversation turns are expected
+            to be dictionaries with keys "content", "role", and possibly "context".
+        :paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
+        :return: The fluency score
+        :rtype: Dict[str, Union[float, Dict[str, List[float]]]]
+        """
+    @override
+    def __call__(  # pylint: disable=docstring-missing-param
+        self,
+        *args,
+        **kwargs,
+    ):
+        """
+        Invokes the instance using the overloaded __call__ signature.
+        For detailed parameter types and return value documentation, see the overloaded __call__ definition.
+        """
+        return super().__call__(*args, **kwargs)
+    @override
+    async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]:  # type: ignore[override]
+        """Do completeness evaluation.
+        :param eval_input: The input to the evaluator. Expected to contain whatever inputs are needed for the _flow method
+        :type eval_input: Dict
+        :return: The evaluation result.
+        :rtype: Dict
+        """
+        # we override the _do_eval method as we want the output to be a dictionary,
+        # which is a different schema than _base_prompty_eval.py
+        if "ground_truth" not in eval_input or "response" not in eval_input:
+            raise EvaluationException(
+                message=f"Both ground_truth and response must be provided as input to the completeness evaluator.",
+                internal_message=f"Both ground_truth and response must be provided as input to the completeness"
+                f" evaluator.",
+                blame=ErrorBlame.USER_ERROR,
+                category=ErrorCategory.MISSING_FIELD,
+                target=ErrorTarget.COMPLETENESS_EVALUATOR,
+            )
+        result = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
+        llm_output = result.get("llm_output") if isinstance(result, dict) else result
+        score = math.nan
+        llm_output_is_dict = isinstance(llm_output, dict)
+        if llm_output_is_dict or isinstance(llm_output, str):
+            reason = ""
+            if llm_output_is_dict:
+                score = float(llm_output.get("score", math.nan))
+                reason = llm_output.get("explanation", "")
+            else:
+                score, reason = parse_quality_evaluator_reason_score(llm_output, valid_score_range="[1-5]")
+            binary_result = self._get_binary_result(score)
+            # updating the result key and threshold to int based on the schema
+            return {
+                f"{self._result_key}": int(score),
+                f"{self._result_key}_result": binary_result,
+                f"{self._result_key}_threshold": int(self._threshold),
+                f"{self._result_key}_reason": reason,
+                f"{self._result_key}_prompt_tokens": result.get("input_token_count", 0),
+                f"{self._result_key}_completion_tokens": result.get("output_token_count", 0),
+                f"{self._result_key}_total_tokens": result.get("total_token_count", 0),
+                f"{self._result_key}_finish_reason": result.get("finish_reason", ""),
+                f"{self._result_key}_model": result.get("model_id", ""),
+                f"{self._result_key}_sample_input": result.get("sample_input", ""),
+                f"{self._result_key}_sample_output": result.get("sample_output", ""),
+            }
+        if logger:
+            logger.warning("LLM output is not a dictionary, returning NaN for the score.")
+        binary_result = self._get_binary_result(score)
+        return {
+            self._result_key: float(score),
+            f"{self._result_key}_result": binary_result,
+            f"{self._result_key}_threshold": self._threshold,
+        }

azure/ai/evaluation/_evaluators/_response_completeness/response_completeness.prompty ADDED Viewed

@@ -0,0 +1,84 @@
+---
+name: Completeness
+description: Evaluates Completeness score for QA scenario
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    max_tokens: 800
+    top_p: 1.0
+    seed: 123
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: text
+inputs:
+  response:
+    type: string
+  ground_truth:
+    type: string
+---
+system:
+# Instruction
+## Goal
+### You are an expert in evaluating the quality of a Response from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
+- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
+- **Data**: Your input data include Response and Ground Truth.
+- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
+user:
+# Definition
+**Completeness** refers to how accurately and thoroughly a response represents the information provided in the ground truth. It considers both the inclusion of all relevant statements and the correctness of those statements. Each statement in the ground truth should be evaluated individually to determine if it is accurately reflected in the response without missing any key information. The scale ranges from 1 to 5, with higher numbers indicating greater completeness.
+# Ratings
+## [Completeness: 1] (Fully Incomplete)
+**Definition:** A response that does not contain any of the necessary and relevant information with respect to the ground truth. It completely misses all the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot cannot cure cancer. Stay healthy requires sleeping exactly 8 hours a day. A few hours of exercise per week will have little benefits for physical and mental health. Physical and mental health benefits are separate topics. Scientists have not studied any of them."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Staying healthy requires proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 2] (Barely Complete)
+**Definition:** A response that contains only a small percentage of all the necessary and relevant information with respect to the ground truth. It misses almost all the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires 2 meals a day. Exercise per week makes no difference to physical and mental health. This is because physical and mental health benefits have low correlation through scientific studies. Scientists are making this observation in studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 3] (Moderately Complete)
+**Definition:** A response that contains half of the necessary and relevant information with respect to the ground truth. It misses half of the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires a few dollars of investments a day. Even a few dollars of investments per week will not make an impact on physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Fiction writers are starting to discover them through their works."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 4] (Mostly Complete)
+**Definition:** A response that contains most of the necessary and relevant information with respect to the ground truth. It misses some minor information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires keto diet and rigorous athletic training. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 5] (Fully Complete)
+**Definition:** A response that perfectly contains all the necessary and relevant information with respect to the ground truth. It does not miss any information from statements and claims in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+# Data
+Response: {{response}}
+Ground Truth: {{ground_truth}}
+# Tasks
+## Please provide your assessment Score for the previous RESPONSE in relation to the GROUND TRUTH based on the Definitions above. Your output should include the following information:
+- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
+- **Explanation**: a very short explanation of why you think the input data should get that Score.
+- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be an integer score (i.e., "1", "2"...) based on the levels of the definitions.
+## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your score</S2>.
+# Output

azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py CHANGED Viewed

@@ -31,6 +31,13 @@ class RetrievalEvaluator(PromptyEvaluatorBase[Union[str, float]]):
     :param model_config: Configuration for the Azure OpenAI model.
     :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
         ~azure.ai.evaluation.OpenAIModelConfiguration]
+    :param threshold: The threshold for the evaluation. Default is 3.
+    :type threshold: float
+    :param credential: The credential for authenticating to Azure AI service.
+    :type credential: ~azure.core.credentials.TokenCredential
+    :keyword is_reasoning_model: If True, the evaluator will use reasoning model configuration (o1/o3 models).
+        This will adjust parameters like max_completion_tokens and remove unsupported parameters. Default is False.
+    :paramtype is_reasoning_model: bool
     :return: A function that evaluates and generates metrics for "chat" scenario.
     :rtype: Callable
@@ -43,6 +50,25 @@ class RetrievalEvaluator(PromptyEvaluatorBase[Union[str, float]]):
             :dedent: 8
             :caption: Initialize and call a RetrievalEvaluator.
+    .. admonition:: Example using Azure AI Project URL:
+        .. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
+            :start-after: [START retrieval_evaluator]
+            :end-before: [END retrieval_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize and call RetrievalEvaluator using Azure AI Project URL in the following format
+                https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
+    .. admonition:: Example with Threshold:
+        .. literalinclude:: ../samples/evaluation_samples_threshold.py
+            :start-after: [START threshold_retrieval_evaluator]
+            :end-before: [END threshold_retrieval_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize with threshold and call a RetrievalEvaluator.
     .. note::
         To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
@@ -53,14 +79,24 @@ class RetrievalEvaluator(PromptyEvaluatorBase[Union[str, float]]):
     _PROMPTY_FILE = "retrieval.prompty"
     _RESULT_KEY = "retrieval"
-    id = "azureml://registries/azureml/models/Retrieval-Evaluator/versions/1"
+    id = "azureai://built-in/evaluators/retrieval"
     """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
     @override
-    def __init__(self, model_config):  # pylint: disable=super-init-not-called
+    def __init__(self, model_config, *, threshold: float = 3, credential=None, **kwargs):
         current_dir = os.path.dirname(__file__)
         prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
-        super().__init__(model_config=model_config, prompty_file=prompty_path, result_key=self._RESULT_KEY)
+        self._threshold = threshold
+        self._higher_is_better = True
+        super().__init__(
+            model_config=model_config,
+            prompty_file=prompty_path,
+            result_key=self._RESULT_KEY,
+            threshold=threshold,
+            credential=credential,
+            _higher_is_better=self._higher_is_better,
+            **kwargs,
+        )
     @overload
     def __call__(

azure-ai-evaluation 1.0.1__py3-none-any.whl → 1.13.3__py3-none-any.whl

Potentially problematic release.

azure-ai-evaluation 1.0.1py3-none-any.whl → 1.13.3py3-none-any.whl