PyPI - azure-ai-evaluation - Versions diffs - 1.0.0b3__tar.gz → 1.0.0b5__tar.gz - Mend

azure-ai-evaluation 1.0.0b3tar.gz → 1.0.0b5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (176) hide show

azure_ai_evaluation-1.0.0b5/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,183 @@
+# Release History
+## 1.0.0b5 (2024-10-28)
+### Features Added
+- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
+- Groundedness detection in Non Adversarial Simulator via query/context pairs
+```python
+import importlib.resources as pkg_resources
+package = "azure.ai.evaluation.simulator._data_sources"
+resource_name = "grounding.json"
+custom_simulator = Simulator(model_config=model_config)
+conversation_turns = []
+with pkg_resources.path(package, resource_name) as grounding_file:
+    with open(grounding_file, "r") as file:
+        data = json.load(file)
+for item in data:
+    conversation_turns.append([item])
+outputs = asyncio.run(custom_simulator(
+    target=callback,
+    conversation_turns=conversation_turns,
+    max_conversation_turns=1,
+))
+```
+- Adding evaluator for multimodal use cases
+### Breaking Changes
+- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
+- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
+- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
+- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
+- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
+- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
+```json
+{"question": <user_message>, "answer": <assistant_message>}
+```
+`to_eval_qr_json_lines` now has:
+```json
+{"query": <user_message>, "response": assistant_message}
+```
+### Bugs Fixed
+- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
+- Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
+- Fix evaluate API failure when `trace.destination` is set to `none`
+- Non adversarial simulator now accepts context from the callback
+### Other Changes
+- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
+- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
+- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
+  - `CoherenceEvaluator`
+  - `RelevanceEvaluator`
+  - `FluencyEvaluator`
+  - `GroundednessEvaluator`
+  - `SimilarityEvaluator`
+  - `RetrievalEvaluator`
+- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
+    | Evaluator | New Token Limit |
+    | --- | --- |
+    | `CoherenceEvaluator` | 800 |
+    | `RelevanceEvaluator` | 800 |
+    | `FluencyEvaluator` | 800 |
+    | `GroundednessEvaluator` | 800 |
+    | `RetrievalEvaluator` | 1600 |
+- Improved the error message for storage access permission issues to provide clearer guidance for users.
+## 1.0.0b4 (2024-10-16)
+### Breaking Changes
+- Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.
+- `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.
+- Changed package extra name from "pf-azure" to "remote".
+### Bugs Fixed
+- Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
+- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
+### Other Changes
+- Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
+- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
+## 1.0.0b3 (2024-10-01)
+### Features Added
+- Added `type` field to `AzureOpenAIModelConfiguration` and `OpenAIModelConfiguration`
+- The following evaluators now support `conversation` as an alternative input to their usual single-turn inputs:
+  - `ViolenceEvaluator`
+  - `SexualEvaluator`
+  - `SelfHarmEvaluator`
+  - `HateUnfairnessEvaluator`
+  - `ProtectedMaterialEvaluator`
+  - `IndirectAttackEvaluator`
+  - `CoherenceEvaluator`
+  - `RelevanceEvaluator`
+  - `FluencyEvaluator`
+  - `GroundednessEvaluator`
+- Surfaced `RetrievalScoreEvaluator`, formally an internal part of `ChatEvaluator` as a standalone conversation-only evaluator.
+### Breaking Changes
+- Removed `ContentSafetyChatEvaluator` and `ChatEvaluator`
+- The `evaluator_config` parameter of `evaluate` now maps in evaluator name to a dictionary `EvaluatorConfig`, which is a `TypedDict`. The
+`column_mapping` between `data` or `target` and evaluator field names should now be specified inside this new dictionary:
+Before:
+```python
+evaluate(
+    ...,
+    evaluator_config={
+        "hate_unfairness": {
+            "query": "${data.question}",
+            "response": "${data.answer}",
+        }
+    },
+    ...
+)
+```
+After
+```python
+evaluate(
+    ...,
+    evaluator_config={
+        "hate_unfairness": {
+            "column_mapping": {
+                "query": "${data.question}",
+                "response": "${data.answer}",
+             }
+        }
+    },
+    ...
+)
+```
+- Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.
+Before:
+```python
+azure_ai_project = {
+    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
+    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
+    "project_name": os.environ.get("PROJECT_NAME"),
+}
+sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
+```
+After:
+```python
+model_config = {
+    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
+    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
+}
+sim = Simulator(model_config=model_config)
+```
+If `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.
+### Bugs Fixed
+- Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
+## 1.0.0b2 (2024-09-24)
+### Breaking Changes
+- `data` and `evaluators` are now required keywords in `evaluate`.
+## 1.0.0b1 (2024-09-20)
+### Breaking Changes
+- The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed
+- The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
+- The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.
+- Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
+- Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.
+### Features Added
+- First preview
+- This package is port of `promptflow-evals`. New features will be added only to this package moving forward.
+- Added a `TypedDict` for `AzureAIProject` that allows for better intellisense and type checking when passing in project information

{azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/MANIFEST.in RENAMED Viewed

@@ -4,3 +4,4 @@ include azure/__init__.py
 include azure/ai/__init__.py
 include azure/ai/evaluation/py.typed
 recursive-include azure/ai/evaluation *.prompty
+include azure/ai/evaluation/simulator/_data_sources/grounding.json

azure_ai_evaluation-1.0.0b5/NOTICE.txt ADDED Viewed

@@ -0,0 +1,70 @@
+NOTICES AND INFORMATION
+Do Not Translate or Localize
+This software incorporates material from third parties.
+Microsoft makes certain open source code available at https://3rdpartysource.microsoft.com,
+or you may send a check or money order for US $5.00, including the product name,
+the open source component name, platform, and version number, to:
+Source Code Compliance Team
+Microsoft Corporation
+One Microsoft Way
+Redmond, WA 98052
+USA
+Notwithstanding any other terms, you may reverse engineer this software to the extent
+required to debug changes to any libraries licensed under the GNU Lesser General Public License.
+License notice for nltk
+---------------------------------------------------------
+Copyright 2024 The NLTK Project
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+License notice for rouge-score
+---------------------------------------------------------
+Copyright 2024 The Google Research Authors
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+License notice for [Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings](https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1272229/full)
+------------------------------------------------------------------------------------------------------------------
+Copyright © 2023 Hackl, Müller, Granitzer and Sailer. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [Is ChatGPT a Good NLG Evaluator? A Preliminary Study](https://aclanthology.org/2023.newsum-1.1) (Wang et al., NewSum 2023)
+------------------------------------------------------------------------------------------------------------------
+Copyright © 2023. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [SummEval: Re-evaluating Summarization Evaluation.](https://doi.org/10.1162/tacl_a_00373) (Fabbri et al.)
+------------------------------------------------------------------------------------------------------------------
+© 2021 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
+License notice for [Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks](https://aclanthology.org/2023.emnlp-main.543) (Sottana et al., EMNLP 2023)
+------------------------------------------------------------------------------------------------------------------
+© 2023 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).

{azure_ai_evaluation-1.0.0b3/azure_ai_evaluation.egg-info → azure_ai_evaluation-1.0.0b5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: azure-ai-evaluation
-Version: 1.0.0b3
+Version: 1.0.0b5
 Summary: Microsoft Azure Evaluation Library for Python
 Home-page: https://github.com/Azure/azure-sdk-for-python
 Author: Microsoft Corporation
@@ -21,17 +21,16 @@ Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
 Requires-Python: >=3.8
 Description-Content-Type: text/markdown
+License-File: NOTICE.txt
 Requires-Dist: promptflow-devkit>=1.15.0
 Requires-Dist: promptflow-core>=1.15.0
-Requires-Dist: numpy>=1.23.2; python_version < "3.12"
-Requires-Dist: numpy>=1.26.4; python_version >= "3.12"
 Requires-Dist: pyjwt>=2.8.0
-Requires-Dist: azure-identity>=1.12.0
+Requires-Dist: azure-identity>=1.16.0
 Requires-Dist: azure-core>=1.30.2
 Requires-Dist: nltk>=3.9.1
-Requires-Dist: rouge-score>=0.1.2
-Provides-Extra: pf-azure
-Requires-Dist: promptflow-azure<2.0.0,>=1.15.0; extra == "pf-azure"
+Provides-Extra: remote
+Requires-Dist: promptflow-azure<2.0.0,>=1.15.0; extra == "remote"
+Requires-Dist: azure-ai-inference>=1.0.0b4; extra == "remote"
 # Azure AI Evaluation client library for Python
@@ -97,9 +96,6 @@ if __name__ == "__main__":
     # Running Relevance Evaluator on single input row
     relevance_score = relevance_eval(
         response="The Alpine Explorer Tent is the most waterproof.",
-        context="From the our product list,"
-        " the alpine explorer tent is the most waterproof."
-        " The Adventure Dining Table has higher weight.",
         query="Which tent is the most waterproof?",
     )
@@ -154,11 +150,6 @@ name: ApplicationPrompty
 description: Simulates an application
 model:
   api: chat
-  configuration:
-    type: azure_openai
-    azure_deployment: ${env:AZURE_DEPLOYMENT}
-    api_key: ${env:AZURE_OPENAI_API_KEY}
-    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
   parameters:
     temperature: 0.0
     top_p: 1.0
@@ -179,6 +170,95 @@ Output with a string that continues the conversation, responding to the latest m
 {{ conversation_history }}
 ```
+Query Response generaing prompty for gpt-4o with `json_schema` support
+Use this file as an override.
+```yaml
+---
+name: TaskSimulatorQueryResponseGPT4o
+description: Gets queries and responses from a blob of text
+model:
+  api: chat
+  parameters:
+    temperature: 0.0
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    response_format:
+      type: json_schema
+      json_schema:
+        name: QRJsonSchema
+        schema:
+          type: object
+          properties:
+            items:
+              type: array
+              items:
+                type: object
+                properties:
+                  q:
+                    type: string
+                  r:
+                    type: string
+                required:
+                  - q
+                  - r
+inputs:
+  text:
+    type: string
+  num_queries:
+    type: integer
+---
+system:
+You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
+Both Questions and Answers MUST BE extracted from given Text
+Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
+RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
+A sentence should contribute multiple QnAs if it has more info in it
+Answer must not be more than 5 words
+Answer must be picked from Text as is
+Question should be as descriptive as possible and must include as much context as possible from Text
+Output must always have the provided number of QnAs
+Output must be in JSON format.
+Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
+Text:
+<|text_start|>
+On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
+Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
+<|text_end|>
+Output with 5 QnAs:
+{
+    "qna": [{
+        "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
+        "r": "January 24, 1984"
+    },
+    {
+        "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
+        "r": "Steve Jobs"
+    },
+    {
+        "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
+        "r": "2.06 percent"
+    },
+    {
+        "q": "What were the research firms that reported on Apple's market share in the U.S.?",
+        "r": "IDC and Gartner"
+    },
+    {
+        "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
+        "r": "6%"
+    }]
+}
+Text:
+<|text_start|>
+{{ text }}
+<|text_end|>
+Output with {{ num_queries }} QnAs:
+```
 Application code:
 ```python
@@ -187,93 +267,96 @@ import asyncio
 from typing import Any, Dict, List, Optional
 from azure.ai.evaluation.simulator import Simulator
 from promptflow.client import load_flow
-from azure.identity import DefaultAzureCredential
 import os
+import wikipedia
-azure_ai_project = {
-    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
-    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
-    "project_name": os.environ.get("PROJECT_NAME")
+# Set up the model configuration without api_key, using DefaultAzureCredential
+model_config = {
+    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
+    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
+    # not providing key would make the SDK pick up `DefaultAzureCredential`
+    # use "api_key": "<your API key>"
+    "api_version": "2024-08-01-preview" # keep this for gpt-4o
 }
-import wikipedia
-wiki_search_term = "Leonardo da vinci"
+# Use Wikipedia to get some text for the simulation
+wiki_search_term = "Leonardo da Vinci"
 wiki_title = wikipedia.search(wiki_search_term)[0]
 wiki_page = wikipedia.page(wiki_title)
 text = wiki_page.summary[:1000]
-def method_to_invoke_application_prompty(query: str):
+def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
     try:
         current_dir = os.path.dirname(__file__)
         prompty_path = os.path.join(current_dir, "application.prompty")
-        _flow = load_flow(source=prompty_path, model={
-            "configuration": azure_ai_project
-        })
+        _flow = load_flow(
+            source=prompty_path,
+            model=model_config,
+            credential=DefaultAzureCredential()
+        )
         response = _flow(
             query=query,
             context=context,
             conversation_history=messages_list
         )
         return response
-    except:
-        print("Something went wrong invoking the prompty")
+    except Exception as e:
+        print(f"Something went wrong invoking the prompty: {e}")
         return "something went wrong"
 async def callback(
-    messages: List[Dict],
+    messages: Dict[str, List[Dict]],
     stream: bool = False,
     session_state: Any = None,  # noqa: ANN401
     context: Optional[Dict[str, Any]] = None,
 ) -> dict:
     messages_list = messages["messages"]
-    # get last message
+    # Get the last message from the user
     latest_message = messages_list[-1]
     query = latest_message["content"]
-    context = None
-    # call your endpoint or ai application here
-    response = method_to_invoke_application_prompty(query)
-    # we are formatting the response to follow the openAI chat protocol format
+    # Call your endpoint or AI application here
+    response = method_to_invoke_application_prompty(query, messages_list, context)
+    # Format the response to follow the OpenAI chat protocol format
     formatted_response = {
         "content": response,
         "role": "assistant",
-        "context": {
-            "citations": None,
-        },
+        "context": "",
     }
     messages["messages"].append(formatted_response)
     return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
 async def main():
-    simulator = Simulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
+    simulator = Simulator(model_config=model_config)
+    current_dir = os.path.dirname(__file__)
+    query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
     outputs = await simulator(
         target=callback,
         text=text,
+        query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
         num_queries=2,
-        max_conversation_turns=4,
+        max_conversation_turns=1,
         user_persona=[
             f"I am a student and I want to learn more about {wiki_search_term}",
             f"I am a teacher and I want to teach my students about {wiki_search_term}"
         ],
     )
-    print(json.dumps(outputs))
+    print(json.dumps(outputs, indent=2))
 if __name__ == "__main__":
-    os.environ["AZURE_SUBSCRIPTION_ID"] = ""
-    os.environ["RESOURCE_GROUP"] = ""
-    os.environ["PROJECT_NAME"] = ""
-    os.environ["AZURE_OPENAI_API_KEY"] = ""
-    os.environ["AZURE_OPENAI_ENDPOINT"] = ""
-    os.environ["AZURE_DEPLOYMENT"] = ""
+    # Ensure that the following environment variables are set in your environment:
+    # AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT
+    # Example:
+    # os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
+    # os.environ["AZURE_DEPLOYMENT"] = "your-deployment-name"
     asyncio.run(main())
     print("done!")
 ```
 #### Adversarial Simulator
 ```python
-from from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
+from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
 from azure.identity import DefaultAzureCredential
 from typing import Any, Dict, List, Optional
 import asyncio
@@ -426,6 +509,88 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 # Release History
+## 1.0.0b5 (2024-10-28)
+### Features Added
+- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
+- Groundedness detection in Non Adversarial Simulator via query/context pairs
+```python
+import importlib.resources as pkg_resources
+package = "azure.ai.evaluation.simulator._data_sources"
+resource_name = "grounding.json"
+custom_simulator = Simulator(model_config=model_config)
+conversation_turns = []
+with pkg_resources.path(package, resource_name) as grounding_file:
+    with open(grounding_file, "r") as file:
+        data = json.load(file)
+for item in data:
+    conversation_turns.append([item])
+outputs = asyncio.run(custom_simulator(
+    target=callback,
+    conversation_turns=conversation_turns,
+    max_conversation_turns=1,
+))
+```
+- Adding evaluator for multimodal use cases
+### Breaking Changes
+- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
+- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
+- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
+- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
+- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
+- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
+```json
+{"question": <user_message>, "answer": <assistant_message>}
+```
+`to_eval_qr_json_lines` now has:
+```json
+{"query": <user_message>, "response": assistant_message}
+```
+### Bugs Fixed
+- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
+- Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
+- Fix evaluate API failure when `trace.destination` is set to `none`
+- Non adversarial simulator now accepts context from the callback
+### Other Changes
+- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
+- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
+- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
+  - `CoherenceEvaluator`
+  - `RelevanceEvaluator`
+  - `FluencyEvaluator`
+  - `GroundednessEvaluator`
+  - `SimilarityEvaluator`
+  - `RetrievalEvaluator`
+- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
+    | Evaluator | New Token Limit |
+    | --- | --- |
+    | `CoherenceEvaluator` | 800 |
+    | `RelevanceEvaluator` | 800 |
+    | `FluencyEvaluator` | 800 |
+    | `GroundednessEvaluator` | 800 |
+    | `RetrievalEvaluator` | 1600 |
+- Improved the error message for storage access permission issues to provide clearer guidance for users.
+## 1.0.0b4 (2024-10-16)
+### Breaking Changes
+- Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.
+- `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.
+- Changed package extra name from "pf-azure" to "remote".
+### Bugs Fixed
+- Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
+- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
+### Other Changes
+- Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
+- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
 ## 1.0.0b3 (2024-10-01)
 ### Features Added
@@ -480,9 +645,29 @@ evaluate(
 )
 ```
+- Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.
+Before:
+```python
+azure_ai_project = {
+    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
+    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
+    "project_name": os.environ.get("PROJECT_NAME"),
+}
+sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
+```
+After:
+```python
+model_config = {
+    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
+    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
+}
+sim = Simulator(model_config=model_config)
+```
+If `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.
 ### Bugs Fixed
-- Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
+- Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
 ## 1.0.0b2 (2024-09-24)
@@ -495,9 +680,9 @@ evaluate(
 ### Breaking Changes
 - The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed
-- The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
+- The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
 - The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.
-- Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
+- Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
 - Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.
 ### Features Added

azure-ai-evaluation 1.0.0b3__tar.gz → 1.0.0b5__tar.gz

azure-ai-evaluation 1.0.0b3tar.gz → 1.0.0b5tar.gz