PyPI - azure-ai-evaluation - Versions diffs - 1.0.0b5__py3-none-any.whl → 1.1.0__py3-none-any.whl - Mend - Supply Chain Defender

azure-ai-evaluation 1.0.0b5py3-none-any.whl → 1.1.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of azure-ai-evaluation might be problematic. Click here for more details.

Files changed (72) hide show

{azure_ai_evaluation-1.0.0b5.dist-info → azure_ai_evaluation-1.1.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: azure-ai-evaluation
-Version: 1.0.0b5
+Version: 1.1.0
 Summary: Microsoft Azure Evaluation Library for Python
 Home-page: https://github.com/Azure/azure-sdk-for-python
 Author: Microsoft Corporation
@@ -9,7 +9,7 @@ License: MIT License
 Project-URL: Bug Reports, https://github.com/Azure/azure-sdk-for-python/issues
 Project-URL: Source, https://github.com/Azure/azure-sdk-for-python
 Keywords: azure,azure sdk
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 5 - Production/Stable
 Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
@@ -28,13 +28,20 @@ Requires-Dist: pyjwt >=2.8.0
 Requires-Dist: azure-identity >=1.16.0
 Requires-Dist: azure-core >=1.30.2
 Requires-Dist: nltk >=3.9.1
-Provides-Extra: remote
-Requires-Dist: promptflow-azure <2.0.0,>=1.15.0 ; extra == 'remote'
-Requires-Dist: azure-ai-inference >=1.0.0b4 ; extra == 'remote'
+Requires-Dist: azure-storage-blob >=12.10.0
 # Azure AI Evaluation client library for Python
-We are excited to introduce the public preview of the Azure AI Evaluation SDK.
+Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
+Use Azure AI Evaluation SDK to:
+- Evaluate existing data from generative AI applications
+- Evaluate generative AI applications
+- Evaluate by generating mathematical, AI-assisted quality and safety metrics
+Azure AI SDK provides following to evaluate Generative AI Applications:
+- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
+- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
 [Source code][source_code]
 | [Package (PyPI)][evaluation_pypi]
@@ -42,272 +49,177 @@ We are excited to introduce the public preview of the Azure AI Evaluation SDK.
 | [Product documentation][product_documentation]
 | [Samples][evaluation_samples]
-This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
-For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
 ## Getting started
 ### Prerequisites
 - Python 3.8 or later is required to use this package.
+- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
 ### Install the package
-Install the Azure AI Evaluation library for Python with [pip][pip_link]::
+Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
 ```bash
 pip install azure-ai-evaluation
 ```
+If you want to track results in [AI Studio][ai_studio], install `remote` extra:
+```python
+pip install azure-ai-evaluation[remote]
+```
 ## Key concepts
-Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models.
+### Evaluators
-## Examples
+Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
-### Evaluators
+#### Built-in evaluators
+Built-in evaluators are out of box evaluators provided by Microsoft:
+| Category  | Evaluator class                                                                                                                    |
+|-----------|------------------------------------------------------------------------------------------------------------------------------------|
+| [Performance and quality][performance_and_quality_evaluators] (AI-assisted)  | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
+| [Performance and quality][performance_and_quality_evaluators] (NLP)  | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
+| [Risk and safety][risk_and_safety_evaluators] (AI-assisted)    | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator`                                             |
+| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator`                                             |
-Users can create evaluator runs on the local machine as shown in the example below:
+For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
 ```python
 import os
-from pprint import pprint
-from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
+from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
+# NLP bleu score evaluator
+bleu_score_evaluator = BleuScoreEvaluator()
+result = bleu_score(
+    response="Tokyo is the capital of Japan.",
+    ground_truth="The capital of Japan is Tokyo."
+)
-def response_length(response, **kwargs):
-    return {"value": len(response)}
+# AI assisted quality evaluator
+model_config = {
+    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
+    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
+    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
+}
-if __name__ == "__main__":
-    # Built-in evaluators
-    # Initialize Azure OpenAI Model Configuration
-    model_config = {
-        "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
-        "api_key": os.environ.get("AZURE_OPENAI_KEY"),
-        "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
-    }
+relevance_evaluator = RelevanceEvaluator(model_config)
+result = relevance_evaluator(
+    query="What is the capital of Japan?",
+    response="The capital of Japan is Tokyo."
+)
-    # Initialzing Relevance Evaluator
-    relevance_eval = RelevanceEvaluator(model_config)
+# AI assisted safety evaluator
+azure_ai_project = {
+    "subscription_id": "<subscription_id>",
+    "resource_group_name": "<resource_group_name>",
+    "project_name": "<project_name>",
+}
-    # Running Relevance Evaluator on single input row
-    relevance_score = relevance_eval(
-        response="The Alpine Explorer Tent is the most waterproof.",
-        query="Which tent is the most waterproof?",
-    )
+violence_evaluator = ViolenceEvaluator(azure_ai_project)
+result = violence_evaluator(
+    query="What is the capital of France?",
+    response="Paris."
+)
+```
-    pprint(relevance_score)
-    # {'gpt_relevance': 5.0}
+#### Custom evaluators
-    # Content Safety Evaluator
+Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
-    # Initialize Project Scope
-    azure_ai_project = {
-        "subscription_id": <subscription_id>,
-        "resource_group_name": <resource_group_name>,
-        "project_name": <project_name>
-    }
+```python
-    violence_eval = ViolenceEvaluator(azure_ai_project)
-    violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
-    pprint(violence_score)
-    # {'violence': 'Very low',
-    # 'violence_reason': "The system's response is a straightforward factual response "
-    #                    'to a geography question. There is no violent content or '
-    #                    'language present.',
-    # 'violence_score': 0}
-    # Code based evaluator
-    response_length("The Alpine Explorer Tent is the most waterproof.")
-    # {'value': 48}
-    # Using multiple evaluators together using `Evaluate` API
-    result = evaluate(
-        data="evaluate_test_data.jsonl",
-        evaluators={
-            "response_length": response_length,
-            "violence": violence_eval,
-        },
-    )
+# Custom evaluator as a function to calculate response length
+def response_length(response, **kwargs):
+    return len(response)
-    pprint(result)
-```
-### Simulator
+# Custom class based evaluator to check for blocked words
+class BlocklistEvaluator:
+    def __init__(self, blocklist):
+        self._blocklist = blocklist
+    def __call__(self, *, response: str, **kwargs):
+        score = any([word in answer for word in self._blocklist])
+        return {"score": score}
-Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes
-their AI application.
+blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
-#### Simulating with a Prompty
+result = response_length("The capital of Japan is Tokyo.")
+result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
-```yaml
----
-name: ApplicationPrompty
-description: Simulates an application
-model:
-  api: chat
-  parameters:
-    temperature: 0.0
-    top_p: 1.0
-    presence_penalty: 0
-    frequency_penalty: 0
-    response_format:
-      type: text
+```
-inputs:
-  conversation_history:
-    type: dict
+### Evaluate API
+The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
----
-system:
-You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.
+#### Evaluate existing dataset
-Output with a string that continues the conversation, responding to the latest message from the user, given the conversation history:
-{{ conversation_history }}
+```python
+from azure.ai.evaluation import evaluate
+result = evaluate(
+    data="data.jsonl", # provide your data here
+    evaluators={
+        "blocklist": blocklist_evaluator,
+        "relevance": relevance_evaluator
+    },
+    # column mapping
+    evaluator_config={
+        "relevance": {
+            "column_mapping": {
+                "query": "${data.queries}"
+                "ground_truth": "${data.ground_truth}"
+                "response": "${outputs.response}"
+            }
+        }
+    }
+    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
+    azure_ai_project = azure_ai_project,
+    # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
+    output_path="./evaluation_results.json"
+)
 ```
+For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
-Query Response generaing prompty for gpt-4o with `json_schema` support
-Use this file as an override.
-```yaml
----
-name: TaskSimulatorQueryResponseGPT4o
-description: Gets queries and responses from a blob of text
-model:
-  api: chat
-  parameters:
-    temperature: 0.0
-    top_p: 1.0
-    presence_penalty: 0
-    frequency_penalty: 0
-    response_format:
-      type: json_schema
-      json_schema:
-        name: QRJsonSchema
-        schema:
-          type: object
-          properties:
-            items:
-              type: array
-              items:
-                type: object
-                properties:
-                  q:
-                    type: string
-                  r:
-                    type: string
-                required:
-                  - q
-                  - r
-inputs:
-  text:
-    type: string
-  num_queries:
-    type: integer
----
-system:
-You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
-Both Questions and Answers MUST BE extracted from given Text
-Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
-RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
-A sentence should contribute multiple QnAs if it has more info in it
-Answer must not be more than 5 words
-Answer must be picked from Text as is
-Question should be as descriptive as possible and must include as much context as possible from Text
-Output must always have the provided number of QnAs
-Output must be in JSON format.
-Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
-Text:
-<|text_start|>
-On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
-Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
-<|text_end|>
-Output with 5 QnAs:
-{
-    "qna": [{
-        "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
-        "r": "January 24, 1984"
-    },
-    {
-        "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
-        "r": "Steve Jobs"
-    },
-    {
-        "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
-        "r": "2.06 percent"
-    },
-    {
-        "q": "What were the research firms that reported on Apple's market share in the U.S.?",
-        "r": "IDC and Gartner"
+#### Evaluate generative AI application
+```python
+from askwiki import askwiki
+result = evaluate(
+    data="data.jsonl",
+    target=askwiki,
+    evaluators={
+        "relevance": relevance_eval
     },
-    {
-        "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
-        "r": "6%"
-    }]
-}
-Text:
-<|text_start|>
-{{ text }}
-<|text_end|>
-Output with {{ num_queries }} QnAs:
+    evaluator_config={
+        "default": {
+            "column_mapping": {
+                "query": "${data.queries}"
+                "context": "${outputs.context}"
+                "response": "${outputs.response}"
+            }
+        }
+    }
+)
 ```
+Above code snippet refers to askwiki application in this [sample][evaluate_app].
-Application code:
+For more details refer to [Evaluate on a target][evaluate_target]
-```python
-import json
-import asyncio
-from typing import Any, Dict, List, Optional
-from azure.ai.evaluation.simulator import Simulator
-from promptflow.client import load_flow
-import os
-import wikipedia
+### Simulator
-# Set up the model configuration without api_key, using DefaultAzureCredential
-model_config = {
-    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
-    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
-    # not providing key would make the SDK pick up `DefaultAzureCredential`
-    # use "api_key": "<your API key>"
-    "api_version": "2024-08-01-preview" # keep this for gpt-4o
-}
-# Use Wikipedia to get some text for the simulation
-wiki_search_term = "Leonardo da Vinci"
-wiki_title = wikipedia.search(wiki_search_term)[0]
-wiki_page = wikipedia.page(wiki_title)
-text = wiki_page.summary[:1000]
-def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
-    try:
-        current_dir = os.path.dirname(__file__)
-        prompty_path = os.path.join(current_dir, "application.prompty")
-        _flow = load_flow(
-            source=prompty_path,
-            model=model_config,
-            credential=DefaultAzureCredential()
-        )
-        response = _flow(
-            query=query,
-            context=context,
-            conversation_history=messages_list
-        )
-        return response
-    except Exception as e:
-        print(f"Something went wrong invoking the prompty: {e}")
-        return "something went wrong"
+Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
+```python
 async def callback(
     messages: Dict[str, List[Dict]],
     stream: bool = False,
-    session_state: Any = None,  # noqa: ANN401
+    session_state: Any = None,
     context: Optional[Dict[str, Any]] = None,
 ) -> dict:
     messages_list = messages["messages"]
@@ -315,8 +227,8 @@ async def callback(
     latest_message = messages_list[-1]
     query = latest_message["content"]
     # Call your endpoint or AI application here
-    response = method_to_invoke_application_prompty(query, messages_list, context)
-    # Format the response to follow the OpenAI chat protocol format
+    # response should be a string
+    response = call_to_your_application(query, messages_list, context)
     formatted_response = {
         "content": response,
         "role": "assistant",
@@ -324,33 +236,32 @@ async def callback(
     }
     messages["messages"].append(formatted_response)
     return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
+```
-async def main():
-    simulator = Simulator(model_config=model_config)
-    current_dir = os.path.dirname(__file__)
-    query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
-    outputs = await simulator(
-        target=callback,
-        text=text,
-        query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
-        num_queries=2,
-        max_conversation_turns=1,
-        user_persona=[
-            f"I am a student and I want to learn more about {wiki_search_term}",
-            f"I am a teacher and I want to teach my students about {wiki_search_term}"
+The simulator initialization and invocation looks like this:
+```python
+from azure.ai.evaluation.simulator import Simulator
+model_config = {
+    "azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
+    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
+    "api_version": os.environ.get("AZURE_API_VERSION"),
+}
+custom_simulator = Simulator(model_config=model_config)
+outputs = asyncio.run(custom_simulator(
+    target=callback,
+    conversation_turns=[
+        [
+            "What should I know about the public gardens in the US?",
         ],
-    )
-    print(json.dumps(outputs, indent=2))
-if __name__ == "__main__":
-    # Ensure that the following environment variables are set in your environment:
-    # AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT
-    # Example:
-    # os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
-    # os.environ["AZURE_DEPLOYMENT"] = "your-deployment-name"
-    asyncio.run(main())
-    print("done!")
+        [
+            "How do I simulate data against LLMs",
+        ],
+    ],
+    max_conversation_turns=2,
+))
+with open("simulator_output.jsonl", "w") as f:
+    for output in outputs:
+        f.write(output.to_eval_qr_json_lines())
 ```
 #### Adversarial Simulator
@@ -358,73 +269,11 @@ if __name__ == "__main__":
 ```python
 from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
 from azure.identity import DefaultAzureCredential
-from typing import Any, Dict, List, Optional
-import asyncio
 azure_ai_project = {
     "subscription_id": <subscription_id>,
     "resource_group_name": <resource_group_name>,
     "project_name": <project_name>
 }
-async def callback(
-    messages: List[Dict],
-    stream: bool = False,
-    session_state: Any = None,
-    context: Dict[str, Any] = None
-) -> dict:
-    messages_list = messages["messages"]
-    # get last message
-    latest_message = messages_list[-1]
-    query = latest_message["content"]
-    context = None
-    if 'file_content' in messages["template_parameters"]:
-        query += messages["template_parameters"]['file_content']
-    # the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
-    # to respond to the simulator. You should replace it with a call to your model/endpoint/application
-    # make sure you pass the `query` and format the response as we have shown below
-    from openai import AsyncAzureOpenAI
-    oai_client = AsyncAzureOpenAI(
-        api_key=<api_key>,
-        azure_endpoint=<endpoint>,
-        api_version="2023-12-01-preview",
-    )
-    try:
-        response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
-    except Exception as e:
-        print(f"Error: {e}")
-        # to continue the conversation, return the messages, else you can fail the adversarial with an exception
-        message = {
-            "content": "Something went wrong. Check the exception e for more details.",
-            "role": "assistant",
-            "context": None,
-        }
-        messages["messages"].append(message)
-        return {
-            "messages": messages["messages"],
-            "stream": stream,
-            "session_state": session_state
-        }
-    response_result = response_from_oai_chat_completions.choices[0].message.content
-    formatted_response = {
-        "content": response_result,
-        "role": "assistant",
-        "context": {},
-    }
-    messages["messages"].append(formatted_response)
-    return {
-        "messages": messages["messages"],
-        "stream": stream,
-        "session_state": session_state,
-        "context": context
-    }
-```
-#### Adversarial QA
-```python
 scenario = AdversarialScenario.ADVERSARIAL_QA
 simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
@@ -437,30 +286,30 @@ outputs = asyncio.run(
     )
 )
-print(outputs.to_eval_qa_json_lines())
+print(outputs.to_eval_qr_json_lines())
 ```
-#### Direct Attack Simulator
-```python
-scenario = AdversarialScenario.ADVERSARIAL_QA
-simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
+For more details about the simulator, visit the following links:
+- [Adversarial Simulation docs][adversarial_simulation_docs]
+- [Adversarial scenarios][adversarial_simulation_scenarios]
+- [Simulating jailbreak attacks][adversarial_jailbreak]
-outputs = asyncio.run(
-    simulator(
-        scenario=scenario,
-        max_conversation_turns=1,
-        max_simulation_results=2,
-        target=callback
-    )
-)
+## Examples
+In following section you will find examples of:
+- [Evaluate an application][evaluate_app]
+- [Evaluate different models][evaluate_models]
+- [Custom Evaluators][custom_evaluators]
+- [Adversarial Simulation][adversarial_simulation]
+- [Simulate with conversation starter][simulate_with_conversation_starter]
+More examples can be found [here][evaluate_samples].
-print(outputs)
-```
 ## Troubleshooting
 ### General
-Azure ML clients raise exceptions defined in [Azure Core][azure_core_readme].
+Please refer to [troubleshooting][evaluation_tsg] for common issues.
 ### Logging
@@ -505,10 +354,74 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 [code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
 [coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
 [coc_contact]: mailto:opencode@microsoft.com
+[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
+[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
+[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
+[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
+[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint
+[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
+[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
+[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
+[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
+[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint
+[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators
+[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
+[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
+[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
+[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
+[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
+[adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
+[adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
+[adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Adversarial_Data
+[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter
+[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks
 # Release History
+## 1.1.0 (2024-12-12)
+### Bugs Fixed
+- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Studio.
+- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
+## 1.0.1 (2024-11-15)
+### Bugs Fixed
+- Removing `azure-ai-inference` as dependency.
+- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
+## 1.0.0 (2024-11-13)
+### Breaking Changes
+- The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
+- Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.
+### Bugs Fixed
+- Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
+- Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
+- Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
+- Fixed an issue with the `ContentSafetyEvaluator` that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.
+- Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
+otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
+would be 2, not 1.5.
+### Other Changes
+- Refined error messages for serviced-based evaluators and simulators.
+- Tracing has been disabled due to Cosmos DB initialization issue.
+- Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
+- Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the  `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
+- For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
+```python
+adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
+outputs = asyncio.run(
+    adversarial_simulator(
+        scenario=scenario,
+        target=callback,
+        randomize_order=True
+    )
+)
+```
 ## 1.0.0b5 (2024-10-28)
 ### Features Added
@@ -565,8 +478,8 @@ outputs = asyncio.run(custom_simulator(
   - `SimilarityEvaluator`
   - `RetrievalEvaluator`
 - The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
-    | Evaluator | New Token Limit |
+    | Evaluator | New `max_token` for Generation |
     | --- | --- |
     | `CoherenceEvaluator` | 800 |
     | `RelevanceEvaluator` | 800 |