PyPI - azure-ai-evaluation - Versions diffs - 1.0.0b2__py3-none-any.whl → 1.13.3__py3-none-any.whl - Mend

azure-ai-evaluation 1.0.0b2py3-none-any.whl → 1.13.3py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (299) hide show

azure/ai/evaluation/_evaluators/_relevance/relevance.prompty CHANGED Viewed

@@ -3,67 +3,196 @@ name: Relevance
 description: Evaluates relevance score for QA scenario
 model:
   api: chat
-  configuration:
-    type: azure_openai
-    azure_deployment: ${env:AZURE_DEPLOYMENT}
-    api_key: ${env:AZURE_OPENAI_API_KEY}
-    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
   parameters:
     temperature: 0.0
-    max_tokens: 1
+    max_tokens: 800
     top_p: 1.0
     presence_penalty: 0
     frequency_penalty: 0
     response_format:
-      type: text
+      type: json_object
 inputs:
   query:
     type: string
   response:
     type: string
-  context:
-    type: string
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
+You are a Relevance-Judge, an impartial evaluator that scores how well the RESPONSE addresses the user's queries in the CONVERSATION_HISTORY using the definitions provided.
 user:
-Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
-One star: the answer completely lacks relevance
-Two stars: the answer mostly lacks relevance
-Three stars: the answer is partially relevant
-Four stars: the answer is mostly relevant
-Five stars: the answer has perfect relevance
-This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
-context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
-question: What field did Marie Curie excel in?
-answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
-stars: 1
-context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
-question: Where were The Beatles formed?
-answer: The band The Beatles began their journey in London, England, and they changed the history of music.
-stars: 2
-context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
-question: What are the main goals of Perseverance Mars rover mission?
-answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
-stars: 3
-context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
-question: What are the main components of the Mediterranean diet?
-answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
-stars: 4
-context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
-question: What are the main attractions of the Queen's Royal Castle?
-answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
-stars: 5
-context: {{context}}
-question: {{query}}
-answer: {{response}}
-stars:
+ROLE
+====
+You are a Relevance Evaluator. Your task is to judge how relevant a RESPONSE is to the CONVERSATION_HISTORY using the Relevance definitions provided.
+INPUT
+=====
+CONVERSATION_HISTORY: {{query}}
+RESPONSE: {{response}}
+CONVERSATION_HISTORY is the full dialogue between the user and the agent up to the user's latest message. For single-turn interactions, this will be just the user's query.
+RESPONSE is the agent's reply to the user's latest message.
+TASK
+====
+Output a JSON object with:
+    1) a concise explanation of 15-60 words justifying your score based on how well the response is relevant to the user's queries in the CONVERSATION_HISTORY.
+    2) an integer score from 1 (very poor) to 5 (excellent) using the rubric below.
+The explanation should always precede the score and should clearly justify the score based on the rubric definitions.
+Response format exactly as follows:
+{
+  "explanation": "<15-60 words>",
+  "score": <1-5>
+}
+EVALUATION STEPS
+================
+A. Read the CONVERSATION_HISTORY and RESPONSE carefully.
+B. Identify the user's query from the latest message (use conversation history for context if needed).
+C. Compare the RESPONSE against the rubric below:
+    - Does the response directly address the user's query?
+    - Is the information complete, partial, or off-topic?
+    - Is it vague, generic, or insightful?
+D. Match the response to the best score from the rubric.
+E. Provide a short explanation and the score using the required format.
+SCORING RUBRIC
+==============
+### Score 1 - Irrelevant Response
+Definition: The response is unrelated to the question. It provides off-topic information and does not attempt to address the question posed.
+**Example A**
+CONVERSATION_HISTORY: What is the team preparing for?
+RESPONSE: I went grocery shopping yesterday evening.
+Expected Output:
+{
+    "explanation": "The response is entirely off-topic and doesn't address the question.",
+    "score": 1
+}
+**Example B**
+CONVERSATION_HISTORY: When will the company's new product line launch?
+RESPONSE: International travel can be very rewarding and educational.
+Expected Output:
+{
+    "explanation": "The response is completely irrelevant to the product launch question.",
+    "score": 1
+}
+### Score 2 – Related but Unhelpful / Superficial
+Definition: The response is loosely or formally related to the query but fails to deliver any meaningful, specific, or useful information. This includes vague phrases, non-answers, or failure/error messages.
+**Example A**
+CONVERSATION_HISTORY: What is the event about?
+RESPONSE: It’s something important.
+Expected Output:
+{
+  "explanation": "The response vaguely refers to the query topic but lacks specific or informative content.",
+  "score": 2
+}
+**Example B**
+CONVERSATION_HISTORY: What’s the weather in Paris?
+RESPONSE: I tried to find the forecast but the query failed.
+Expected Output:
+{
+  "explanation": "The response acknowledges the query but provides no usable weather information. It is related but unhelpful.",
+  "score": 2
+}
+### Score 3 - Partially Relevant / Incomplete
+Definition: The response addresses the query and includes relevant information, but omits essential components or detail. The answer is on-topic but insufficient to fully satisfy the request.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex has a gym.
+Expected Output:
+{
+    "explanation": "The response mentions one amenity but does not provide a fuller list or clarify whether other standard features (like parking or security) are included. It partially addresses the query but lacks completeness.",
+    "score": 3
+}
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: It includes priority customer support.
+Expected Output:
+{
+    "explanation": "The response identifies one service but omits other likely components of a premium membership (e.g., exclusive content or discounts). The information is relevant but incomplete.",
+    "score": 3
+}
+### Score 4 - Fully Relevant / Sufficient Response
+Definition: The response fully addresses the question with accurate and sufficient information, covering all essential aspects. Very minor omissions are acceptable as long as the core information is intact and the intent is clearly conveyed.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security.
+Expected Output:
+{
+    "explanation": "The response mentions multiple key amenities likely to be relevant to most users. While it may not list every feature, it clearly conveys the core offerings of the complex.",
+    "score": 4
+}
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases.
+Expected Output:
+{
+    "explanation": "The response outlines all major services expected from a premium membership. Even if a minor service is not mentioned, the core value is clearly and fully represented.",
+    "score": 4
+}
+### Score 5 - Comprehensive Response with Insights
+Definition: The response not only fully and accurately answers the question, but also adds meaningful elaboration, interpretation, or context that enhances the user's understanding. This goes beyond just listing relevant details — it offers insight into why the information matters, how it's useful, or what impact it has.
+**Example A**
+CONVERSATION_HISTORY: What amenities does the new apartment complex provide?
+RESPONSE: The apartment complex provides a gym, swimming pool, and 24/7 security, designed to offer residents a comfortable and active lifestyle while ensuring their safety.
+Expected Output:
+{
+    "explanation": "The response fully lists key amenities and additionally explains how these features contribute to resident experience, enhancing the usefulness of the information.",
+    "score": 5
+}
+**Example B**
+CONVERSATION_HISTORY: What services does the premium membership include?
+RESPONSE: The premium membership includes priority customer support, exclusive content access, and early product releases — tailored for users who want quicker resolutions and first access to new features.
+Expected Output:
+{
+    "explanation": "The response covers all essential services and adds valuable insight about the target user and benefits, enriching the response beyond basic listing.",
+    "score": 5
+}
+### Multi-turn Conversation Example
+When evaluating responses in a multi-turn conversation, consider the conversation context to understand the user's intent:
+**Example - Multi-turn Context**
+CONVERSATION_HISTORY: [{"role":"user","content":"I'm planning a vacation to Europe."},{"role":"assistant","content":"That sounds exciting! What time of year are you thinking of traveling?"},{"role":"user","content":"Probably in July. What's the weather like then?"}]
+RESPONSE: [{"role":"assistant","content":"July is summer in Europe with generally warm and pleasant weather. Most countries have temperatures between 20-25°C (68-77°F). It's a popular travel time, so expect crowds at major tourist attractions and higher accommodation prices."}]
+Expected Output:
+{
+    "explanation": "The response directly addresses the weather question while providing valuable context about crowds and pricing that's relevant to vacation planning established in the conversation.",
+    "score": 5
+}

azure/ai/evaluation/_evaluators/_response_completeness/__init__.py ADDED Viewed

@@ -0,0 +1,7 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+from ._response_completeness import ResponseCompletenessEvaluator
+__all__ = ["ResponseCompletenessEvaluator"]

azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py ADDED Viewed

@@ -0,0 +1,202 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+import os
+import logging
+import math
+from typing import Dict, List, Union, Optional
+from typing_extensions import overload, override
+from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
+from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
+from azure.ai.evaluation._common.utils import parse_quality_evaluator_reason_score
+from azure.ai.evaluation._model_configurations import Conversation, Message
+from azure.ai.evaluation._common._experimental import experimental
+logger = logging.getLogger(__name__)
+@experimental
+class ResponseCompletenessEvaluator(PromptyEvaluatorBase[Union[str, float]]):
+    """Evaluates the extent to which a given response contains all necessary and relevant information with respect to the
+    provided ground truth.
+    The completeness measure assesses how thoroughly an AI model's generated response aligns with the key information,
+    claims, and statements established in the ground truth. This evaluation considers the presence, accuracy,
+    and relevance of the content provided.
+    The assessment spans multiple levels, ranging from fully incomplete to fully complete, ensuring a comprehensive
+    evaluation of the response's content quality.
+    Use this metric when you need to evaluate an AI model's ability to deliver comprehensive and accurate information,
+    particularly in text generation tasks where conveying all essential details is crucial for clarity,
+    context, and correctness.
+    Completeness scores range from 1 to 5:
+    1: Fully incomplete — Contains none of the necessary information.
+    2: Barely complete — Contains only a small portion of the required information.
+    3: Moderately complete — Covers about half of the required content.
+    4: Mostly complete — Includes most of the necessary details with minimal omissions.
+    5: Fully complete — Contains all key information without any omissions.
+    :param model_config: Configuration for the Azure OpenAI model.
+    :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
+        ~azure.ai.evaluation.OpenAIModelConfiguration]
+    .. admonition:: Example using Azure AI Project URL:
+        .. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
+            :start-after: [START completeness_evaluator]
+            :end-before: [END completeness_evaluator]
+            :language: python
+            :dedent: 8
+            :caption: Initialize and call CompletenessEvaluator using Azure AI Project URL in the following format
+                https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
+    """
+    # Constants must be defined within eval's directory to be save/loadable
+    _PROMPTY_FILE = "response_completeness.prompty"
+    _RESULT_KEY = "response_completeness"
+    id = "azureai://built-in/evaluators/response_completeness"
+    _MIN_COMPLETENESS_SCORE = 1
+    _MAX_COMPLETENESS_SCORE = 5
+    _DEFAULT_COMPLETENESS_THRESHOLD = 3
+    """Evaluator identifier, experimental and to be used only with evaluation in cloud."""
+    @override
+    def __init__(
+        self, model_config, *, threshold: Optional[float] = _DEFAULT_COMPLETENESS_THRESHOLD, credential=None, **kwargs
+    ):
+        current_dir = os.path.dirname(__file__)
+        prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
+        self.threshold = threshold  # to be removed in favor of _threshold
+        super().__init__(
+            model_config=model_config,
+            prompty_file=prompty_path,
+            result_key=self._RESULT_KEY,
+            threshold=threshold,
+            credential=credential,
+            _higher_is_better=True,
+            **kwargs,
+        )
+    @overload
+    def __call__(
+        self,
+        *,
+        ground_truth: str,
+        response: str,
+    ) -> Dict[str, Union[str, float]]:
+        """Evaluate completeness in given response. Accepts ground truth and response for evaluation.
+        Example usage:
+        Evaluating completeness for a response string
+        ```python
+        from azure.ai.evaluation import CompletenessEvaluator
+        completeness_evaluator = CompletenessEvaluator(model_config)
+        ground_truth = "The ground truth to be evaluated."
+        response = "The response to be evaluated."
+        completeness_results = completeness_evaluator(ground_truth=ground_truth, response=response)
+        ```
+        :keword ground_truth: The ground truth to be evaluated.
+        :paramtype ground_truth: str
+        :keyword response: The response to be evaluated.
+        :paramtype response: Union[str, List[Message]]
+        :return: The response completeness score results.
+        :rtype: Dict[str, Union[str, float]]
+        """
+    @overload
+    def __call__(
+        self,
+        *,
+        conversation: Conversation,
+    ) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
+        """Evaluate completeness for a conversation
+        :keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
+            key "messages", and potentially a global context under the key "context". Conversation turns are expected
+            to be dictionaries with keys "content", "role", and possibly "context".
+        :paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
+        :return: The fluency score
+        :rtype: Dict[str, Union[float, Dict[str, List[float]]]]
+        """
+    @override
+    def __call__(  # pylint: disable=docstring-missing-param
+        self,
+        *args,
+        **kwargs,
+    ):
+        """
+        Invokes the instance using the overloaded __call__ signature.
+        For detailed parameter types and return value documentation, see the overloaded __call__ definition.
+        """
+        return super().__call__(*args, **kwargs)
+    @override
+    async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]:  # type: ignore[override]
+        """Do completeness evaluation.
+        :param eval_input: The input to the evaluator. Expected to contain whatever inputs are needed for the _flow method
+        :type eval_input: Dict
+        :return: The evaluation result.
+        :rtype: Dict
+        """
+        # we override the _do_eval method as we want the output to be a dictionary,
+        # which is a different schema than _base_prompty_eval.py
+        if "ground_truth" not in eval_input or "response" not in eval_input:
+            raise EvaluationException(
+                message=f"Both ground_truth and response must be provided as input to the completeness evaluator.",
+                internal_message=f"Both ground_truth and response must be provided as input to the completeness"
+                f" evaluator.",
+                blame=ErrorBlame.USER_ERROR,
+                category=ErrorCategory.MISSING_FIELD,
+                target=ErrorTarget.COMPLETENESS_EVALUATOR,
+            )
+        result = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
+        llm_output = result.get("llm_output") if isinstance(result, dict) else result
+        score = math.nan
+        llm_output_is_dict = isinstance(llm_output, dict)
+        if llm_output_is_dict or isinstance(llm_output, str):
+            reason = ""
+            if llm_output_is_dict:
+                score = float(llm_output.get("score", math.nan))
+                reason = llm_output.get("explanation", "")
+            else:
+                score, reason = parse_quality_evaluator_reason_score(llm_output, valid_score_range="[1-5]")
+            binary_result = self._get_binary_result(score)
+            # updating the result key and threshold to int based on the schema
+            return {
+                f"{self._result_key}": int(score),
+                f"{self._result_key}_result": binary_result,
+                f"{self._result_key}_threshold": int(self._threshold),
+                f"{self._result_key}_reason": reason,
+                f"{self._result_key}_prompt_tokens": result.get("input_token_count", 0),
+                f"{self._result_key}_completion_tokens": result.get("output_token_count", 0),
+                f"{self._result_key}_total_tokens": result.get("total_token_count", 0),
+                f"{self._result_key}_finish_reason": result.get("finish_reason", ""),
+                f"{self._result_key}_model": result.get("model_id", ""),
+                f"{self._result_key}_sample_input": result.get("sample_input", ""),
+                f"{self._result_key}_sample_output": result.get("sample_output", ""),
+            }
+        if logger:
+            logger.warning("LLM output is not a dictionary, returning NaN for the score.")
+        binary_result = self._get_binary_result(score)
+        return {
+            self._result_key: float(score),
+            f"{self._result_key}_result": binary_result,
+            f"{self._result_key}_threshold": self._threshold,
+        }

azure/ai/evaluation/_evaluators/_response_completeness/response_completeness.prompty ADDED Viewed

@@ -0,0 +1,84 @@
+---
+name: Completeness
+description: Evaluates Completeness score for QA scenario
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    max_tokens: 800
+    top_p: 1.0
+    seed: 123
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: text
+inputs:
+  response:
+    type: string
+  ground_truth:
+    type: string
+---
+system:
+# Instruction
+## Goal
+### You are an expert in evaluating the quality of a Response from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
+- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
+- **Data**: Your input data include Response and Ground Truth.
+- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
+user:
+# Definition
+**Completeness** refers to how accurately and thoroughly a response represents the information provided in the ground truth. It considers both the inclusion of all relevant statements and the correctness of those statements. Each statement in the ground truth should be evaluated individually to determine if it is accurately reflected in the response without missing any key information. The scale ranges from 1 to 5, with higher numbers indicating greater completeness.
+# Ratings
+## [Completeness: 1] (Fully Incomplete)
+**Definition:** A response that does not contain any of the necessary and relevant information with respect to the ground truth. It completely misses all the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot cannot cure cancer. Stay healthy requires sleeping exactly 8 hours a day. A few hours of exercise per week will have little benefits for physical and mental health. Physical and mental health benefits are separate topics. Scientists have not studied any of them."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Staying healthy requires proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 2] (Barely Complete)
+**Definition:** A response that contains only a small percentage of all the necessary and relevant information with respect to the ground truth. It misses almost all the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires 2 meals a day. Exercise per week makes no difference to physical and mental health. This is because physical and mental health benefits have low correlation through scientific studies. Scientists are making this observation in studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 3] (Moderately Complete)
+**Definition:** A response that contains half of the necessary and relevant information with respect to the ground truth. It misses half of the information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires a few dollars of investments a day. Even a few dollars of investments per week will not make an impact on physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Fiction writers are starting to discover them through their works."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 4] (Mostly Complete)
+**Definition:** A response that contains most of the necessary and relevant information with respect to the ground truth. It misses some minor information, especially claims and statements, established in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Staying healthy requires keto diet and rigorous athletic training. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+## [Completeness: 5] (Fully Complete)
+**Definition:** A response that perfectly contains all the necessary and relevant information with respect to the ground truth. It does not miss any information from statements and claims in the ground truth.
+**Examples:**
+  **Response:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+  **Ground Truth:** "Flu shot can prevent flu-related illnesses. Stay healthy by proper hydration and moderate exercise. Even a few hours of exercise per week can have long-term benefits for physical and mental health. This is because physical and mental health benefits have intricate relationships through behavioral changes. Scientists are starting to discover them through rigorous studies."
+# Data
+Response: {{response}}
+Ground Truth: {{ground_truth}}
+# Tasks
+## Please provide your assessment Score for the previous RESPONSE in relation to the GROUND TRUTH based on the Definitions above. Your output should include the following information:
+- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
+- **Explanation**: a very short explanation of why you think the input data should get that Score.
+- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be an integer score (i.e., "1", "2"...) based on the levels of the definitions.
+## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your score</S2>.
+# Output

azure/ai/evaluation/_evaluators/{_chat/retrieval → _retrieval}/__init__.py RENAMED Viewed

@@ -2,8 +2,8 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # ---------------------------------------------------------
-from ._retrieval import RetrievalChatEvaluator
+from ._retrieval import RetrievalEvaluator
 __all__ = [
-    "RetrievalChatEvaluator",
+    "RetrievalEvaluator",
 ]

azure-ai-evaluation 1.0.0b2__py3-none-any.whl → 1.13.3__py3-none-any.whl

Potentially problematic release.

azure-ai-evaluation 1.0.0b2py3-none-any.whl → 1.13.3py3-none-any.whl