azure-ai-evaluation 1.0.0b2__py3-none-any.whl → 1.13.3__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of azure-ai-evaluation might be problematic. Click here for more details.
- azure/ai/evaluation/__init__.py +100 -5
- azure/ai/evaluation/{_evaluators/_chat → _aoai}/__init__.py +3 -2
- azure/ai/evaluation/_aoai/aoai_grader.py +140 -0
- azure/ai/evaluation/_aoai/label_grader.py +68 -0
- azure/ai/evaluation/_aoai/python_grader.py +86 -0
- azure/ai/evaluation/_aoai/score_model_grader.py +94 -0
- azure/ai/evaluation/_aoai/string_check_grader.py +66 -0
- azure/ai/evaluation/_aoai/text_similarity_grader.py +80 -0
- azure/ai/evaluation/_azure/__init__.py +3 -0
- azure/ai/evaluation/_azure/_clients.py +204 -0
- azure/ai/evaluation/_azure/_envs.py +207 -0
- azure/ai/evaluation/_azure/_models.py +227 -0
- azure/ai/evaluation/_azure/_token_manager.py +129 -0
- azure/ai/evaluation/_common/__init__.py +9 -1
- azure/ai/evaluation/{simulator/_helpers → _common}/_experimental.py +24 -9
- azure/ai/evaluation/_common/constants.py +131 -2
- azure/ai/evaluation/_common/evaluation_onedp_client.py +169 -0
- azure/ai/evaluation/_common/math.py +89 -0
- azure/ai/evaluation/_common/onedp/__init__.py +32 -0
- azure/ai/evaluation/_common/onedp/_client.py +166 -0
- azure/ai/evaluation/_common/onedp/_configuration.py +72 -0
- azure/ai/evaluation/_common/onedp/_model_base.py +1232 -0
- azure/ai/evaluation/_common/onedp/_patch.py +21 -0
- azure/ai/evaluation/_common/onedp/_serialization.py +2032 -0
- azure/ai/evaluation/_common/onedp/_types.py +21 -0
- azure/ai/evaluation/_common/onedp/_utils/__init__.py +6 -0
- azure/ai/evaluation/_common/onedp/_utils/model_base.py +1232 -0
- azure/ai/evaluation/_common/onedp/_utils/serialization.py +2032 -0
- azure/ai/evaluation/_common/onedp/_validation.py +66 -0
- azure/ai/evaluation/_common/onedp/_vendor.py +50 -0
- azure/ai/evaluation/_common/onedp/_version.py +9 -0
- azure/ai/evaluation/_common/onedp/aio/__init__.py +29 -0
- azure/ai/evaluation/_common/onedp/aio/_client.py +168 -0
- azure/ai/evaluation/_common/onedp/aio/_configuration.py +72 -0
- azure/ai/evaluation/_common/onedp/aio/_patch.py +21 -0
- azure/ai/evaluation/_common/onedp/aio/operations/__init__.py +49 -0
- azure/ai/evaluation/_common/onedp/aio/operations/_operations.py +7143 -0
- azure/ai/evaluation/_common/onedp/aio/operations/_patch.py +21 -0
- azure/ai/evaluation/_common/onedp/models/__init__.py +358 -0
- azure/ai/evaluation/_common/onedp/models/_enums.py +447 -0
- azure/ai/evaluation/_common/onedp/models/_models.py +5963 -0
- azure/ai/evaluation/_common/onedp/models/_patch.py +21 -0
- azure/ai/evaluation/_common/onedp/operations/__init__.py +49 -0
- azure/ai/evaluation/_common/onedp/operations/_operations.py +8951 -0
- azure/ai/evaluation/_common/onedp/operations/_patch.py +21 -0
- azure/ai/evaluation/_common/onedp/py.typed +1 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/__init__.py +1 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/aio/__init__.py +1 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/aio/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/aio/operations/_operations.py +34 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/aio/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/__init__.py +1 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/aio/__init__.py +1 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/aio/operations/__init__.py +22 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/aio/operations/_operations.py +29 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/aio/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/operations/__init__.py +22 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/operations/_operations.py +29 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/buildingblocks/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/operations/_operations.py +34 -0
- azure/ai/evaluation/_common/onedp/servicepatterns/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/rai_service.py +831 -142
- azure/ai/evaluation/_common/raiclient/__init__.py +34 -0
- azure/ai/evaluation/_common/raiclient/_client.py +128 -0
- azure/ai/evaluation/_common/raiclient/_configuration.py +87 -0
- azure/ai/evaluation/_common/raiclient/_model_base.py +1235 -0
- azure/ai/evaluation/_common/raiclient/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/_serialization.py +2050 -0
- azure/ai/evaluation/_common/raiclient/_version.py +9 -0
- azure/ai/evaluation/_common/raiclient/aio/__init__.py +29 -0
- azure/ai/evaluation/_common/raiclient/aio/_client.py +130 -0
- azure/ai/evaluation/_common/raiclient/aio/_configuration.py +87 -0
- azure/ai/evaluation/_common/raiclient/aio/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/_operations.py +981 -0
- azure/ai/evaluation/_common/raiclient/aio/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/models/__init__.py +60 -0
- azure/ai/evaluation/_common/raiclient/models/_enums.py +18 -0
- azure/ai/evaluation/_common/raiclient/models/_models.py +651 -0
- azure/ai/evaluation/_common/raiclient/models/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/operations/__init__.py +25 -0
- azure/ai/evaluation/_common/raiclient/operations/_operations.py +1238 -0
- azure/ai/evaluation/_common/raiclient/operations/_patch.py +20 -0
- azure/ai/evaluation/_common/raiclient/py.typed +1 -0
- azure/ai/evaluation/_common/utils.py +870 -34
- azure/ai/evaluation/_constants.py +167 -6
- azure/ai/evaluation/_converters/__init__.py +3 -0
- azure/ai/evaluation/_converters/_ai_services.py +899 -0
- azure/ai/evaluation/_converters/_models.py +467 -0
- azure/ai/evaluation/_converters/_sk_services.py +495 -0
- azure/ai/evaluation/_eval_mapping.py +83 -0
- azure/ai/evaluation/_evaluate/_batch_run/__init__.py +17 -0
- azure/ai/evaluation/_evaluate/_batch_run/_run_submitter_client.py +176 -0
- azure/ai/evaluation/_evaluate/_batch_run/batch_clients.py +82 -0
- azure/ai/evaluation/_evaluate/{_batch_run_client → _batch_run}/code_client.py +47 -25
- azure/ai/evaluation/_evaluate/{_batch_run_client/batch_run_context.py → _batch_run/eval_run_context.py} +42 -13
- azure/ai/evaluation/_evaluate/_batch_run/proxy_client.py +124 -0
- azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +62 -0
- azure/ai/evaluation/_evaluate/_eval_run.py +102 -59
- azure/ai/evaluation/_evaluate/_evaluate.py +2134 -311
- azure/ai/evaluation/_evaluate/_evaluate_aoai.py +992 -0
- azure/ai/evaluation/_evaluate/_telemetry/__init__.py +14 -99
- azure/ai/evaluation/_evaluate/_utils.py +289 -40
- azure/ai/evaluation/_evaluator_definition.py +76 -0
- azure/ai/evaluation/_evaluators/_bleu/_bleu.py +93 -42
- azure/ai/evaluation/_evaluators/_code_vulnerability/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_code_vulnerability/_code_vulnerability.py +119 -0
- azure/ai/evaluation/_evaluators/_coherence/_coherence.py +117 -91
- azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +76 -39
- azure/ai/evaluation/_evaluators/_common/__init__.py +15 -0
- azure/ai/evaluation/_evaluators/_common/_base_eval.py +742 -0
- azure/ai/evaluation/_evaluators/_common/_base_multi_eval.py +63 -0
- azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +345 -0
- azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +198 -0
- azure/ai/evaluation/_evaluators/_common/_conversation_aggregators.py +49 -0
- azure/ai/evaluation/_evaluators/_content_safety/__init__.py +0 -4
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +144 -86
- azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +138 -57
- azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +123 -55
- azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +133 -54
- azure/ai/evaluation/_evaluators/_content_safety/_violence.py +134 -54
- azure/ai/evaluation/_evaluators/_document_retrieval/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py +442 -0
- azure/ai/evaluation/_evaluators/_eci/_eci.py +49 -56
- azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +102 -60
- azure/ai/evaluation/_evaluators/_fluency/_fluency.py +115 -92
- azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +66 -41
- azure/ai/evaluation/_evaluators/_gleu/_gleu.py +90 -37
- azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +318 -82
- azure/ai/evaluation/_evaluators/_groundedness/groundedness_with_query.prompty +114 -0
- azure/ai/evaluation/_evaluators/_groundedness/groundedness_without_query.prompty +104 -0
- azure/ai/evaluation/{_evaluate/_batch_run_client → _evaluators/_intent_resolution}/__init__.py +3 -4
- azure/ai/evaluation/_evaluators/_intent_resolution/_intent_resolution.py +196 -0
- azure/ai/evaluation/_evaluators/_intent_resolution/intent_resolution.prompty +275 -0
- azure/ai/evaluation/_evaluators/_meteor/_meteor.py +107 -61
- azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +104 -77
- azure/ai/evaluation/_evaluators/_qa/_qa.py +115 -63
- azure/ai/evaluation/_evaluators/_relevance/_relevance.py +182 -98
- azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +178 -49
- azure/ai/evaluation/_evaluators/_response_completeness/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py +202 -0
- azure/ai/evaluation/_evaluators/_response_completeness/response_completeness.prompty +84 -0
- azure/ai/evaluation/_evaluators/{_chat/retrieval → _retrieval}/__init__.py +2 -2
- azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +148 -0
- azure/ai/evaluation/_evaluators/_retrieval/retrieval.prompty +93 -0
- azure/ai/evaluation/_evaluators/_rouge/_rouge.py +189 -50
- azure/ai/evaluation/_evaluators/_service_groundedness/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +179 -0
- azure/ai/evaluation/_evaluators/_similarity/_similarity.py +102 -91
- azure/ai/evaluation/_evaluators/_similarity/similarity.prompty +0 -5
- azure/ai/evaluation/_evaluators/_task_adherence/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_task_adherence/_task_adherence.py +226 -0
- azure/ai/evaluation/_evaluators/_task_adherence/task_adherence.prompty +101 -0
- azure/ai/evaluation/_evaluators/_task_completion/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_task_completion/_task_completion.py +177 -0
- azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty +220 -0
- azure/ai/evaluation/_evaluators/_task_navigation_efficiency/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py +384 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py +298 -0
- azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty +166 -0
- azure/ai/evaluation/_evaluators/_tool_input_accuracy/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py +263 -0
- azure/ai/evaluation/_evaluators/_tool_input_accuracy/tool_input_accuracy.prompty +76 -0
- azure/ai/evaluation/_evaluators/_tool_output_utilization/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_tool_output_utilization/_tool_output_utilization.py +225 -0
- azure/ai/evaluation/_evaluators/_tool_output_utilization/tool_output_utilization.prompty +221 -0
- azure/ai/evaluation/_evaluators/_tool_selection/__init__.py +9 -0
- azure/ai/evaluation/_evaluators/_tool_selection/_tool_selection.py +266 -0
- azure/ai/evaluation/_evaluators/_tool_selection/tool_selection.prompty +104 -0
- azure/ai/evaluation/_evaluators/_tool_success/__init__.py +7 -0
- azure/ai/evaluation/_evaluators/_tool_success/_tool_success.py +301 -0
- azure/ai/evaluation/_evaluators/_tool_success/tool_success.prompty +321 -0
- azure/ai/evaluation/_evaluators/_ungrounded_attributes/__init__.py +5 -0
- azure/ai/evaluation/_evaluators/_ungrounded_attributes/_ungrounded_attributes.py +102 -0
- azure/ai/evaluation/_evaluators/_xpia/xpia.py +109 -107
- azure/ai/evaluation/_exceptions.py +51 -7
- azure/ai/evaluation/_http_utils.py +210 -137
- azure/ai/evaluation/_legacy/__init__.py +3 -0
- azure/ai/evaluation/_legacy/_adapters/__init__.py +7 -0
- azure/ai/evaluation/_legacy/_adapters/_check.py +17 -0
- azure/ai/evaluation/_legacy/_adapters/_configuration.py +45 -0
- azure/ai/evaluation/_legacy/_adapters/_constants.py +10 -0
- azure/ai/evaluation/_legacy/_adapters/_errors.py +29 -0
- azure/ai/evaluation/_legacy/_adapters/_flows.py +28 -0
- azure/ai/evaluation/_legacy/_adapters/_service.py +16 -0
- azure/ai/evaluation/_legacy/_adapters/client.py +51 -0
- azure/ai/evaluation/_legacy/_adapters/entities.py +26 -0
- azure/ai/evaluation/_legacy/_adapters/tracing.py +28 -0
- azure/ai/evaluation/_legacy/_adapters/types.py +15 -0
- azure/ai/evaluation/_legacy/_adapters/utils.py +31 -0
- azure/ai/evaluation/_legacy/_batch_engine/__init__.py +9 -0
- azure/ai/evaluation/_legacy/_batch_engine/_config.py +48 -0
- azure/ai/evaluation/_legacy/_batch_engine/_engine.py +477 -0
- azure/ai/evaluation/_legacy/_batch_engine/_exceptions.py +88 -0
- azure/ai/evaluation/_legacy/_batch_engine/_openai_injector.py +132 -0
- azure/ai/evaluation/_legacy/_batch_engine/_result.py +107 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run.py +127 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run_storage.py +128 -0
- azure/ai/evaluation/_legacy/_batch_engine/_run_submitter.py +262 -0
- azure/ai/evaluation/_legacy/_batch_engine/_status.py +25 -0
- azure/ai/evaluation/_legacy/_batch_engine/_trace.py +97 -0
- azure/ai/evaluation/_legacy/_batch_engine/_utils.py +97 -0
- azure/ai/evaluation/_legacy/_batch_engine/_utils_deprecated.py +131 -0
- azure/ai/evaluation/_legacy/_common/__init__.py +3 -0
- azure/ai/evaluation/_legacy/_common/_async_token_provider.py +117 -0
- azure/ai/evaluation/_legacy/_common/_logging.py +292 -0
- azure/ai/evaluation/_legacy/_common/_thread_pool_executor_with_context.py +17 -0
- azure/ai/evaluation/_legacy/prompty/__init__.py +36 -0
- azure/ai/evaluation/_legacy/prompty/_connection.py +119 -0
- azure/ai/evaluation/_legacy/prompty/_exceptions.py +139 -0
- azure/ai/evaluation/_legacy/prompty/_prompty.py +430 -0
- azure/ai/evaluation/_legacy/prompty/_utils.py +663 -0
- azure/ai/evaluation/_legacy/prompty/_yaml_utils.py +99 -0
- azure/ai/evaluation/_model_configurations.py +130 -8
- azure/ai/evaluation/_safety_evaluation/__init__.py +3 -0
- azure/ai/evaluation/_safety_evaluation/_generated_rai_client.py +0 -0
- azure/ai/evaluation/_safety_evaluation/_safety_evaluation.py +917 -0
- azure/ai/evaluation/_user_agent.py +32 -1
- azure/ai/evaluation/_vendor/__init__.py +3 -0
- azure/ai/evaluation/_vendor/rouge_score/__init__.py +14 -0
- azure/ai/evaluation/_vendor/rouge_score/rouge_scorer.py +324 -0
- azure/ai/evaluation/_vendor/rouge_score/scoring.py +59 -0
- azure/ai/evaluation/_vendor/rouge_score/tokenize.py +59 -0
- azure/ai/evaluation/_vendor/rouge_score/tokenizers.py +53 -0
- azure/ai/evaluation/_version.py +2 -1
- azure/ai/evaluation/red_team/__init__.py +22 -0
- azure/ai/evaluation/red_team/_agent/__init__.py +3 -0
- azure/ai/evaluation/red_team/_agent/_agent_functions.py +261 -0
- azure/ai/evaluation/red_team/_agent/_agent_tools.py +461 -0
- azure/ai/evaluation/red_team/_agent/_agent_utils.py +89 -0
- azure/ai/evaluation/red_team/_agent/_semantic_kernel_plugin.py +228 -0
- azure/ai/evaluation/red_team/_attack_objective_generator.py +268 -0
- azure/ai/evaluation/red_team/_attack_strategy.py +49 -0
- azure/ai/evaluation/red_team/_callback_chat_target.py +115 -0
- azure/ai/evaluation/red_team/_default_converter.py +21 -0
- azure/ai/evaluation/red_team/_evaluation_processor.py +505 -0
- azure/ai/evaluation/red_team/_mlflow_integration.py +430 -0
- azure/ai/evaluation/red_team/_orchestrator_manager.py +803 -0
- azure/ai/evaluation/red_team/_red_team.py +1717 -0
- azure/ai/evaluation/red_team/_red_team_result.py +661 -0
- azure/ai/evaluation/red_team/_result_processor.py +1708 -0
- azure/ai/evaluation/red_team/_utils/__init__.py +37 -0
- azure/ai/evaluation/red_team/_utils/_rai_service_eval_chat_target.py +128 -0
- azure/ai/evaluation/red_team/_utils/_rai_service_target.py +601 -0
- azure/ai/evaluation/red_team/_utils/_rai_service_true_false_scorer.py +114 -0
- azure/ai/evaluation/red_team/_utils/constants.py +72 -0
- azure/ai/evaluation/red_team/_utils/exception_utils.py +345 -0
- azure/ai/evaluation/red_team/_utils/file_utils.py +266 -0
- azure/ai/evaluation/red_team/_utils/formatting_utils.py +365 -0
- azure/ai/evaluation/red_team/_utils/logging_utils.py +139 -0
- azure/ai/evaluation/red_team/_utils/metric_mapping.py +73 -0
- azure/ai/evaluation/red_team/_utils/objective_utils.py +46 -0
- azure/ai/evaluation/red_team/_utils/progress_utils.py +252 -0
- azure/ai/evaluation/red_team/_utils/retry_utils.py +218 -0
- azure/ai/evaluation/red_team/_utils/strategy_utils.py +218 -0
- azure/ai/evaluation/simulator/__init__.py +2 -1
- azure/ai/evaluation/simulator/_adversarial_scenario.py +26 -1
- azure/ai/evaluation/simulator/_adversarial_simulator.py +270 -144
- azure/ai/evaluation/simulator/_constants.py +12 -1
- azure/ai/evaluation/simulator/_conversation/__init__.py +151 -23
- azure/ai/evaluation/simulator/_conversation/_conversation.py +10 -6
- azure/ai/evaluation/simulator/_conversation/constants.py +1 -1
- azure/ai/evaluation/simulator/_data_sources/__init__.py +3 -0
- azure/ai/evaluation/simulator/_data_sources/grounding.json +1150 -0
- azure/ai/evaluation/simulator/_direct_attack_simulator.py +54 -75
- azure/ai/evaluation/simulator/_helpers/__init__.py +1 -2
- azure/ai/evaluation/simulator/_helpers/_language_suffix_mapping.py +1 -0
- azure/ai/evaluation/simulator/_helpers/_simulator_data_classes.py +26 -5
- azure/ai/evaluation/simulator/_indirect_attack_simulator.py +145 -104
- azure/ai/evaluation/simulator/_model_tools/__init__.py +2 -1
- azure/ai/evaluation/simulator/_model_tools/_generated_rai_client.py +225 -0
- azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +80 -30
- azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +117 -45
- azure/ai/evaluation/simulator/_model_tools/_rai_client.py +109 -7
- azure/ai/evaluation/simulator/_model_tools/_template_handler.py +97 -33
- azure/ai/evaluation/simulator/_model_tools/models.py +30 -27
- azure/ai/evaluation/simulator/_prompty/task_query_response.prompty +6 -10
- azure/ai/evaluation/simulator/_prompty/task_simulate.prompty +6 -5
- azure/ai/evaluation/simulator/_simulator.py +302 -208
- azure/ai/evaluation/simulator/_utils.py +31 -13
- azure_ai_evaluation-1.13.3.dist-info/METADATA +939 -0
- azure_ai_evaluation-1.13.3.dist-info/RECORD +305 -0
- {azure_ai_evaluation-1.0.0b2.dist-info → azure_ai_evaluation-1.13.3.dist-info}/WHEEL +1 -1
- azure_ai_evaluation-1.13.3.dist-info/licenses/NOTICE.txt +70 -0
- azure/ai/evaluation/_evaluate/_batch_run_client/proxy_client.py +0 -71
- azure/ai/evaluation/_evaluators/_chat/_chat.py +0 -357
- azure/ai/evaluation/_evaluators/_chat/retrieval/_retrieval.py +0 -157
- azure/ai/evaluation/_evaluators/_chat/retrieval/retrieval.prompty +0 -48
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_base.py +0 -65
- azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +0 -301
- azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty +0 -54
- azure/ai/evaluation/_evaluators/_protected_materials/__init__.py +0 -5
- azure/ai/evaluation/_evaluators/_protected_materials/_protected_materials.py +0 -104
- azure/ai/evaluation/simulator/_tracing.py +0 -89
- azure_ai_evaluation-1.0.0b2.dist-info/METADATA +0 -449
- azure_ai_evaluation-1.0.0b2.dist-info/RECORD +0 -99
- {azure_ai_evaluation-1.0.0b2.dist-info → azure_ai_evaluation-1.13.3.dist-info}/top_level.txt +0 -0
|
@@ -1,60 +1,116 @@
|
|
|
1
1
|
# ---------------------------------------------------------
|
|
2
2
|
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
3
|
# ---------------------------------------------------------
|
|
4
|
+
from typing import Dict
|
|
4
5
|
from nltk.translate.bleu_score import SmoothingFunction, sentence_bleu
|
|
5
|
-
from
|
|
6
|
+
from typing_extensions import overload, override
|
|
6
7
|
|
|
7
8
|
from azure.ai.evaluation._common.utils import nltk_tokenize
|
|
8
9
|
|
|
10
|
+
from azure.ai.evaluation._evaluators._common import EvaluatorBase
|
|
11
|
+
from azure.ai.evaluation._constants import EVALUATION_PASS_FAIL_MAPPING
|
|
9
12
|
|
|
10
|
-
class _AsyncBleuScoreEvaluator:
|
|
11
|
-
def __init__(self):
|
|
12
|
-
pass
|
|
13
13
|
|
|
14
|
-
|
|
14
|
+
class BleuScoreEvaluator(EvaluatorBase):
|
|
15
|
+
"""
|
|
16
|
+
Calculate the BLEU score for a given response and ground truth.
|
|
17
|
+
|
|
18
|
+
BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
|
|
19
|
+
translation. It is widely used in text summarization and text generation use cases.
|
|
20
|
+
|
|
21
|
+
Use the BLEU score when you want to evaluate the similarity between the generated text and reference text,
|
|
22
|
+
especially in tasks such as machine translation or text summarization, where n-gram overlap is a significant
|
|
23
|
+
indicator of quality.
|
|
24
|
+
|
|
25
|
+
The BLEU score ranges from 0 to 1, with higher scores indicating better quality.
|
|
26
|
+
:param threshold: The threshold for the evaluation. Default is 0.5.
|
|
27
|
+
:type threshold: float
|
|
28
|
+
|
|
29
|
+
.. admonition:: Example:
|
|
30
|
+
|
|
31
|
+
.. literalinclude:: ../samples/evaluation_samples_evaluate.py
|
|
32
|
+
:start-after: [START bleu_score_evaluator]
|
|
33
|
+
:end-before: [END bleu_score_evaluator]
|
|
34
|
+
:language: python
|
|
35
|
+
:dedent: 8
|
|
36
|
+
:caption: Initialize and call an BleuScoreEvaluator using azure.ai.evaluation.AzureAIProject
|
|
37
|
+
|
|
38
|
+
.. admonition:: Example using Azure AI Project URL:
|
|
39
|
+
|
|
40
|
+
.. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
|
|
41
|
+
:start-after: [START bleu_score_evaluator]
|
|
42
|
+
:end-before: [END bleu_score_evaluator]
|
|
43
|
+
:language: python
|
|
44
|
+
:dedent: 8
|
|
45
|
+
:caption: Initialize and call an BleuScoreEvaluator using Azure AI Project URL in following format
|
|
46
|
+
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
|
|
47
|
+
|
|
48
|
+
.. admonition:: Example with Threshold:
|
|
49
|
+
|
|
50
|
+
.. literalinclude:: ../samples/evaluation_samples_threshold.py
|
|
51
|
+
:start-after: [START threshold_bleu_score_evaluator]
|
|
52
|
+
:end-before: [END threshold_bleu_score_evaluator]
|
|
53
|
+
:language: python
|
|
54
|
+
:dedent: 8
|
|
55
|
+
:caption: Initialize with threshold and call an BleuScoreEvaluator.
|
|
56
|
+
"""
|
|
57
|
+
|
|
58
|
+
id = "azureai://built-in/evaluators/bleu_score"
|
|
59
|
+
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
|
|
60
|
+
|
|
61
|
+
def __init__(self, *, threshold=0.5):
|
|
62
|
+
self._threshold = threshold
|
|
63
|
+
self._higher_is_better = True
|
|
64
|
+
super().__init__(threshold=threshold, _higher_is_better=self._higher_is_better)
|
|
65
|
+
|
|
66
|
+
@override
|
|
67
|
+
async def _do_eval(self, eval_input: Dict) -> Dict[str, float]:
|
|
68
|
+
"""Produce a bleu score evaluation result.
|
|
69
|
+
|
|
70
|
+
:param eval_input: The input to the evaluation function.
|
|
71
|
+
:type eval_input: Dict
|
|
72
|
+
:return: The evaluation result.
|
|
73
|
+
:rtype: Dict
|
|
74
|
+
"""
|
|
75
|
+
ground_truth = eval_input["ground_truth"]
|
|
76
|
+
response = eval_input["response"]
|
|
15
77
|
reference_tokens = nltk_tokenize(ground_truth)
|
|
16
78
|
hypothesis_tokens = nltk_tokenize(response)
|
|
17
79
|
|
|
18
80
|
# NIST Smoothing
|
|
19
81
|
smoothing_function = SmoothingFunction().method4
|
|
20
82
|
score = sentence_bleu([reference_tokens], hypothesis_tokens, smoothing_function=smoothing_function)
|
|
83
|
+
binary_result = False
|
|
84
|
+
if self._higher_is_better:
|
|
85
|
+
binary_result = score >= self._threshold
|
|
86
|
+
else:
|
|
87
|
+
binary_result = score <= self._threshold
|
|
21
88
|
|
|
22
89
|
return {
|
|
23
90
|
"bleu_score": score,
|
|
91
|
+
"bleu_result": EVALUATION_PASS_FAIL_MAPPING[binary_result],
|
|
92
|
+
"bleu_threshold": self._threshold,
|
|
24
93
|
}
|
|
25
94
|
|
|
95
|
+
@overload # type: ignore
|
|
96
|
+
def __call__(self, *, response: str, ground_truth: str):
|
|
97
|
+
"""
|
|
98
|
+
Evaluate the BLEU score between the response and the ground truth.
|
|
26
99
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
better quality.
|
|
35
|
-
|
|
36
|
-
**Usage**
|
|
37
|
-
|
|
38
|
-
.. code-block:: python
|
|
39
|
-
|
|
40
|
-
eval_fn = BleuScoreEvaluator()
|
|
41
|
-
result = eval_fn(
|
|
42
|
-
response="Tokyo is the capital of Japan.",
|
|
43
|
-
ground_truth="The capital of Japan is Tokyo.")
|
|
44
|
-
|
|
45
|
-
**Output format**
|
|
46
|
-
|
|
47
|
-
.. code-block:: python
|
|
48
|
-
|
|
49
|
-
{
|
|
50
|
-
"bleu_score": 0.22
|
|
51
|
-
}
|
|
52
|
-
"""
|
|
53
|
-
|
|
54
|
-
def __init__(self):
|
|
55
|
-
self._async_evaluator = _AsyncBleuScoreEvaluator()
|
|
100
|
+
:keyword response: The response to be evaluated.
|
|
101
|
+
:paramtype response: str
|
|
102
|
+
:keyword ground_truth: The ground truth to be compared against.
|
|
103
|
+
:paramtype ground_truth: str
|
|
104
|
+
:return: The BLEU score.
|
|
105
|
+
:rtype: Dict[str, float]
|
|
106
|
+
"""
|
|
56
107
|
|
|
57
|
-
|
|
108
|
+
@override
|
|
109
|
+
def __call__( # pylint: disable=docstring-missing-param
|
|
110
|
+
self,
|
|
111
|
+
*args,
|
|
112
|
+
**kwargs,
|
|
113
|
+
):
|
|
58
114
|
"""
|
|
59
115
|
Evaluate the BLEU score between the response and the ground truth.
|
|
60
116
|
|
|
@@ -63,11 +119,6 @@ class BleuScoreEvaluator:
|
|
|
63
119
|
:keyword ground_truth: The ground truth to be compared against.
|
|
64
120
|
:paramtype ground_truth: str
|
|
65
121
|
:return: The BLEU score.
|
|
66
|
-
:rtype:
|
|
122
|
+
:rtype: Dict[str, float]
|
|
67
123
|
"""
|
|
68
|
-
return
|
|
69
|
-
self._async_evaluator, response=response, ground_truth=ground_truth, **kwargs
|
|
70
|
-
)
|
|
71
|
-
|
|
72
|
-
def _to_async(self):
|
|
73
|
-
return self._async_evaluator
|
|
124
|
+
return super().__call__(*args, **kwargs)
|
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
# ---------------------------------------------------------
|
|
2
|
+
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
|
+
# ---------------------------------------------------------
|
|
4
|
+
from typing_extensions import overload, override
|
|
5
|
+
from typing import Dict, Union
|
|
6
|
+
|
|
7
|
+
from azure.ai.evaluation._common._experimental import experimental
|
|
8
|
+
from azure.ai.evaluation._common.constants import EvaluationMetrics
|
|
9
|
+
from azure.ai.evaluation._evaluators._common import RaiServiceEvaluatorBase
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
# cspell:ignore ssrf, vuln
|
|
13
|
+
@experimental
|
|
14
|
+
class CodeVulnerabilityEvaluator(RaiServiceEvaluatorBase[Union[str, bool]]):
|
|
15
|
+
"""
|
|
16
|
+
Evaluates code vulnerability for a given query and response for a single-turn evaluation only,
|
|
17
|
+
where query represents the user query or code before the completion, and response represents the code recommended by the assistant.
|
|
18
|
+
|
|
19
|
+
The code vulnerability evaluation checks for vulnerabilities in the following coding languages:
|
|
20
|
+
|
|
21
|
+
- Python
|
|
22
|
+
- Java
|
|
23
|
+
- C++
|
|
24
|
+
- C#
|
|
25
|
+
- Go
|
|
26
|
+
- Javascript
|
|
27
|
+
- SQL
|
|
28
|
+
|
|
29
|
+
The code vulnerability evaluation identifies the following vulnerabilities:
|
|
30
|
+
|
|
31
|
+
- path-injection
|
|
32
|
+
- sql-injection
|
|
33
|
+
- code-injection
|
|
34
|
+
- stack-trace-exposure
|
|
35
|
+
- incomplete-url-substring-sanitization
|
|
36
|
+
- flask-debug
|
|
37
|
+
- clear-text-logging-sensitive-data
|
|
38
|
+
- incomplete-hostname-regexp
|
|
39
|
+
- server-side-unvalidated-url-redirection
|
|
40
|
+
- weak-cryptographic-algorithm
|
|
41
|
+
- full-ssrf
|
|
42
|
+
- bind-socket-all-network-interfaces
|
|
43
|
+
- client-side-unvalidated-url-redirection
|
|
44
|
+
- likely-bugs
|
|
45
|
+
- reflected-xss
|
|
46
|
+
- clear-text-storage-sensitive-data
|
|
47
|
+
- tarslip
|
|
48
|
+
- hardcoded-credentials
|
|
49
|
+
- insecure-randomness
|
|
50
|
+
|
|
51
|
+
:param credential: The credential for connecting to Azure AI project. Required
|
|
52
|
+
:type credential: ~azure.core.credentials.TokenCredential
|
|
53
|
+
:param azure_ai_project: The Azure AI project, which can either be a string representing the project endpoint
|
|
54
|
+
or an instance of AzureAIProject. It contains subscription id, resource group, and project name.
|
|
55
|
+
:type azure_ai_project: Union[str, ~azure.ai.evaluation.AzureAIProject]
|
|
56
|
+
:param kwargs: Additional arguments to pass to the evaluator.
|
|
57
|
+
:type kwargs: Any
|
|
58
|
+
|
|
59
|
+
.. note::
|
|
60
|
+
|
|
61
|
+
If this evaluator is supplied to the `evaluate` function, the metric
|
|
62
|
+
for the code vulnerability will be "code_vulnerability_label".
|
|
63
|
+
"""
|
|
64
|
+
|
|
65
|
+
id = "azureai://built-in/evaluators/code_vulnerability"
|
|
66
|
+
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
|
|
67
|
+
_OPTIONAL_PARAMS = ["query"]
|
|
68
|
+
|
|
69
|
+
@override
|
|
70
|
+
def __init__(
|
|
71
|
+
self,
|
|
72
|
+
credential,
|
|
73
|
+
azure_ai_project,
|
|
74
|
+
**kwargs,
|
|
75
|
+
):
|
|
76
|
+
# Set default for evaluate_query if not provided
|
|
77
|
+
if "evaluate_query" not in kwargs:
|
|
78
|
+
kwargs["evaluate_query"] = True
|
|
79
|
+
|
|
80
|
+
super().__init__(
|
|
81
|
+
eval_metric=EvaluationMetrics.CODE_VULNERABILITY,
|
|
82
|
+
azure_ai_project=azure_ai_project,
|
|
83
|
+
credential=credential,
|
|
84
|
+
**kwargs,
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
@overload
|
|
88
|
+
def __call__(
|
|
89
|
+
self,
|
|
90
|
+
*,
|
|
91
|
+
query: str,
|
|
92
|
+
response: str,
|
|
93
|
+
) -> Dict[str, Union[str, float]]:
|
|
94
|
+
"""Evaluate a given query/response pair for code vulnerability
|
|
95
|
+
|
|
96
|
+
:keyword query: The query to be evaluated.
|
|
97
|
+
:paramtype query: str
|
|
98
|
+
:keyword response: The response to be evaluated.
|
|
99
|
+
:paramtype response: str
|
|
100
|
+
:return: The code vulnerability label.
|
|
101
|
+
:rtype: Dict[str, Union[str, bool]]
|
|
102
|
+
"""
|
|
103
|
+
|
|
104
|
+
@override
|
|
105
|
+
def __call__( # pylint: disable=docstring-missing-param
|
|
106
|
+
self,
|
|
107
|
+
*args,
|
|
108
|
+
**kwargs,
|
|
109
|
+
):
|
|
110
|
+
"""Evaluate code vulnerability. Accepts query and response for a single-turn evaluation only.
|
|
111
|
+
|
|
112
|
+
:keyword query: The query to be evaluated.
|
|
113
|
+
:paramtype query: Optional[str]
|
|
114
|
+
:keyword response: The response to be evaluated.
|
|
115
|
+
:paramtype response: Optional[str]
|
|
116
|
+
:rtype: Dict[str, Union[str, bool]]
|
|
117
|
+
"""
|
|
118
|
+
|
|
119
|
+
return super().__call__(*args, **kwargs)
|
|
@@ -1,108 +1,99 @@
|
|
|
1
1
|
# ---------------------------------------------------------
|
|
2
2
|
# Copyright (c) Microsoft Corporation. All rights reserved.
|
|
3
3
|
# ---------------------------------------------------------
|
|
4
|
-
|
|
5
4
|
import os
|
|
6
|
-
import
|
|
7
|
-
|
|
8
|
-
import numpy as np
|
|
9
|
-
from promptflow._utils.async_utils import async_run_allowing_running_loop
|
|
10
|
-
from promptflow.core import AsyncPrompty
|
|
11
|
-
|
|
12
|
-
from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
|
|
13
|
-
|
|
14
|
-
from ..._common.utils import ensure_api_version_in_aoai_model_config, ensure_user_agent_in_aoai_model_config
|
|
15
|
-
|
|
16
|
-
try:
|
|
17
|
-
from ..._user_agent import USER_AGENT
|
|
18
|
-
except ImportError:
|
|
19
|
-
USER_AGENT = None
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
class _AsyncCoherenceEvaluator:
|
|
23
|
-
# Constants must be defined within eval's directory to be save/loadable
|
|
24
|
-
PROMPTY_FILE = "coherence.prompty"
|
|
25
|
-
LLM_CALL_TIMEOUT = 600
|
|
26
|
-
DEFAULT_OPEN_API_VERSION = "2024-02-15-preview"
|
|
27
|
-
|
|
28
|
-
def __init__(self, model_config: dict):
|
|
29
|
-
ensure_api_version_in_aoai_model_config(model_config, self.DEFAULT_OPEN_API_VERSION)
|
|
30
|
-
|
|
31
|
-
prompty_model_config = {"configuration": model_config, "parameters": {"extra_headers": {}}}
|
|
5
|
+
from typing import Dict, Union, List
|
|
32
6
|
|
|
33
|
-
|
|
34
|
-
# https://github.com/encode/httpx/discussions/2959
|
|
35
|
-
prompty_model_config["parameters"]["extra_headers"].update({"Connection": "close"})
|
|
7
|
+
from typing_extensions import overload, override
|
|
36
8
|
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
prompty_model_config,
|
|
40
|
-
USER_AGENT,
|
|
41
|
-
)
|
|
42
|
-
|
|
43
|
-
current_dir = os.path.dirname(__file__)
|
|
44
|
-
prompty_path = os.path.join(current_dir, self.PROMPTY_FILE)
|
|
45
|
-
self._flow = AsyncPrompty.load(source=prompty_path, model=prompty_model_config)
|
|
46
|
-
|
|
47
|
-
async def __call__(self, *, query: str, response: str, **kwargs):
|
|
48
|
-
# Validate input parameters
|
|
49
|
-
query = str(query or "")
|
|
50
|
-
response = str(response or "")
|
|
51
|
-
|
|
52
|
-
if not (query.strip() and response.strip()):
|
|
53
|
-
msg = "Both 'query' and 'response' must be non-empty strings."
|
|
54
|
-
raise EvaluationException(
|
|
55
|
-
message=msg,
|
|
56
|
-
internal_message=msg,
|
|
57
|
-
error_category=ErrorCategory.INVALID_VALUE,
|
|
58
|
-
error_blame=ErrorBlame.USER_ERROR,
|
|
59
|
-
error_target=ErrorTarget.COHERENCE_EVALUATOR,
|
|
60
|
-
)
|
|
9
|
+
from azure.ai.evaluation._evaluators._common import PromptyEvaluatorBase
|
|
10
|
+
from azure.ai.evaluation._model_configurations import Conversation
|
|
61
11
|
|
|
62
|
-
# Run the evaluation flow
|
|
63
|
-
llm_output = await self._flow(query=query, response=response, timeout=self.LLM_CALL_TIMEOUT, **kwargs)
|
|
64
12
|
|
|
65
|
-
|
|
66
|
-
if llm_output:
|
|
67
|
-
match = re.search(r"\d", llm_output)
|
|
68
|
-
if match:
|
|
69
|
-
score = float(match.group())
|
|
70
|
-
|
|
71
|
-
return {"gpt_coherence": float(score)}
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
class CoherenceEvaluator:
|
|
13
|
+
class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
|
|
75
14
|
"""
|
|
76
|
-
|
|
15
|
+
Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning.
|
|
16
|
+
|
|
17
|
+
The coherence measure assesses the ability of the language model to generate text that reads naturally,
|
|
18
|
+
flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability
|
|
19
|
+
and user-friendliness of a model's generated responses in real-world applications.
|
|
77
20
|
|
|
78
21
|
:param model_config: Configuration for the Azure OpenAI model.
|
|
79
22
|
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
|
|
80
23
|
~azure.ai.evaluation.OpenAIModelConfiguration]
|
|
24
|
+
:param threshold: The threshold for the coherence evaluator. Default is 3.
|
|
25
|
+
:type threshold: int
|
|
26
|
+
:param credential: The credential for authenticating to Azure AI service.
|
|
27
|
+
:type credential: ~azure.core.credentials.TokenCredential
|
|
28
|
+
:keyword is_reasoning_model: If True, the evaluator will use reasoning model configuration (o1/o3 models).
|
|
29
|
+
This will adjust parameters like max_completion_tokens and remove unsupported parameters. Default is False.
|
|
30
|
+
:paramtype is_reasoning_model: bool
|
|
31
|
+
|
|
32
|
+
.. admonition:: Example:
|
|
33
|
+
|
|
34
|
+
.. literalinclude:: ../samples/evaluation_samples_evaluate.py
|
|
35
|
+
:start-after: [START coherence_evaluator]
|
|
36
|
+
:end-before: [END coherence_evaluator]
|
|
37
|
+
:language: python
|
|
38
|
+
:dedent: 8
|
|
39
|
+
:caption: Initialize and call CoherenceEvaluator using azure.ai.evaluation.AzureAIProject
|
|
40
|
+
|
|
41
|
+
.. admonition:: Example using Azure AI Project URL:
|
|
42
|
+
|
|
43
|
+
.. literalinclude:: ../samples/evaluation_samples_evaluate_fdp.py
|
|
44
|
+
:start-after: [START coherence_evaluator]
|
|
45
|
+
:end-before: [END coherence_evaluator]
|
|
46
|
+
:language: python
|
|
47
|
+
:dedent: 8
|
|
48
|
+
:caption: Initialize and call CoherenceEvaluator using Azure AI Project URL in following format
|
|
49
|
+
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}
|
|
50
|
+
|
|
51
|
+
.. admonition:: Example with Threshold:
|
|
52
|
+
|
|
53
|
+
.. literalinclude:: ../samples/evaluation_samples_threshold.py
|
|
54
|
+
:start-after: [START threshold_coherence_evaluator]
|
|
55
|
+
:end-before: [END threshold_coherence_evaluator]
|
|
56
|
+
:language: python
|
|
57
|
+
:dedent: 8
|
|
58
|
+
:caption: Initialize with threshold and call a CoherenceEvaluator with a query and response.
|
|
59
|
+
|
|
60
|
+
.. note::
|
|
61
|
+
|
|
62
|
+
To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
|
|
63
|
+
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
|
|
64
|
+
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
|
|
65
|
+
"""
|
|
81
66
|
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
.. code-block:: python
|
|
85
|
-
|
|
86
|
-
eval_fn = CoherenceEvaluator(model_config)
|
|
87
|
-
result = eval_fn(
|
|
88
|
-
query="What is the capital of Japan?",
|
|
89
|
-
response="The capital of Japan is Tokyo.")
|
|
90
|
-
|
|
91
|
-
**Output format**
|
|
67
|
+
_PROMPTY_FILE = "coherence.prompty"
|
|
68
|
+
_RESULT_KEY = "coherence"
|
|
92
69
|
|
|
93
|
-
|
|
70
|
+
id = "azureai://built-in/evaluators/coherence"
|
|
71
|
+
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
|
|
94
72
|
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
73
|
+
@override
|
|
74
|
+
def __init__(self, model_config, *, threshold=3, credential=None, **kwargs):
|
|
75
|
+
current_dir = os.path.dirname(__file__)
|
|
76
|
+
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
|
|
77
|
+
self._threshold = threshold
|
|
78
|
+
self._higher_is_better = True
|
|
79
|
+
super().__init__(
|
|
80
|
+
model_config=model_config,
|
|
81
|
+
prompty_file=prompty_path,
|
|
82
|
+
result_key=self._RESULT_KEY,
|
|
83
|
+
threshold=threshold,
|
|
84
|
+
credential=credential,
|
|
85
|
+
_higher_is_better=self._higher_is_better,
|
|
86
|
+
**kwargs,
|
|
87
|
+
)
|
|
102
88
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
89
|
+
@overload
|
|
90
|
+
def __call__(
|
|
91
|
+
self,
|
|
92
|
+
*,
|
|
93
|
+
query: str,
|
|
94
|
+
response: str,
|
|
95
|
+
) -> Dict[str, Union[str, float]]:
|
|
96
|
+
"""Evaluate coherence for given input of query, response
|
|
106
97
|
|
|
107
98
|
:keyword query: The query to be evaluated.
|
|
108
99
|
:paramtype query: str
|
|
@@ -111,7 +102,42 @@ class CoherenceEvaluator:
|
|
|
111
102
|
:return: The coherence score.
|
|
112
103
|
:rtype: Dict[str, float]
|
|
113
104
|
"""
|
|
114
|
-
return async_run_allowing_running_loop(self._async_evaluator, query=query, response=response, **kwargs)
|
|
115
105
|
|
|
116
|
-
|
|
117
|
-
|
|
106
|
+
@overload
|
|
107
|
+
def __call__(
|
|
108
|
+
self,
|
|
109
|
+
*,
|
|
110
|
+
conversation: Conversation,
|
|
111
|
+
) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
|
|
112
|
+
"""Evaluate coherence for a conversation
|
|
113
|
+
|
|
114
|
+
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
|
|
115
|
+
key "messages", and potentially a global context under the key "context". Conversation turns are expected
|
|
116
|
+
to be dictionaries with keys "content", "role", and possibly "context".
|
|
117
|
+
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
|
|
118
|
+
:return: The coherence score.
|
|
119
|
+
:rtype: Dict[str, Union[float, Dict[str, List[float]]]]
|
|
120
|
+
"""
|
|
121
|
+
|
|
122
|
+
@override
|
|
123
|
+
def __call__( # pylint: disable=docstring-missing-param
|
|
124
|
+
self,
|
|
125
|
+
*args,
|
|
126
|
+
**kwargs,
|
|
127
|
+
):
|
|
128
|
+
"""Evaluate coherence. Accepts either a query and response for a single evaluation,
|
|
129
|
+
or a conversation for a potentially multi-turn evaluation. If the conversation has more than one pair of
|
|
130
|
+
turns, the evaluator will aggregate the results of each turn.
|
|
131
|
+
|
|
132
|
+
:keyword query: The query to be evaluated.
|
|
133
|
+
:paramtype query: str
|
|
134
|
+
:keyword response: The response to be evaluated.
|
|
135
|
+
:paramtype response: Optional[str]
|
|
136
|
+
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
|
|
137
|
+
key "messages". Conversation turns are expected
|
|
138
|
+
to be dictionaries with keys "content" and "role".
|
|
139
|
+
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
|
|
140
|
+
:return: The relevance score.
|
|
141
|
+
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str, List[float]]]]]
|
|
142
|
+
"""
|
|
143
|
+
return super().__call__(*args, **kwargs)
|
|
@@ -3,14 +3,9 @@ name: Coherence
|
|
|
3
3
|
description: Evaluates coherence score for QA scenario
|
|
4
4
|
model:
|
|
5
5
|
api: chat
|
|
6
|
-
configuration:
|
|
7
|
-
type: azure_openai
|
|
8
|
-
azure_deployment: ${env:AZURE_DEPLOYMENT}
|
|
9
|
-
api_key: ${env:AZURE_OPENAI_API_KEY}
|
|
10
|
-
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
|
|
11
6
|
parameters:
|
|
12
7
|
temperature: 0.0
|
|
13
|
-
max_tokens:
|
|
8
|
+
max_tokens: 800
|
|
14
9
|
top_p: 1.0
|
|
15
10
|
presence_penalty: 0
|
|
16
11
|
frequency_penalty: 0
|
|
@@ -25,38 +20,80 @@ inputs:
|
|
|
25
20
|
|
|
26
21
|
---
|
|
27
22
|
system:
|
|
28
|
-
|
|
23
|
+
# Instruction
|
|
24
|
+
## Goal
|
|
25
|
+
### You are an expert in evaluating the quality of a RESPONSE from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
|
|
26
|
+
- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
|
|
27
|
+
- **Data**: Your input data include a QUERY and a RESPONSE.
|
|
28
|
+
- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
|
|
29
29
|
|
|
30
30
|
user:
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
question
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
31
|
+
# Definition
|
|
32
|
+
**Coherence** refers to the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent answer directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas.
|
|
33
|
+
|
|
34
|
+
# Ratings
|
|
35
|
+
## [Coherence: 1] (Incoherent Response)
|
|
36
|
+
**Definition:** The response lacks coherence entirely. It consists of disjointed words or phrases that do not form complete or meaningful sentences. There is no logical connection to the question, making the response incomprehensible.
|
|
37
|
+
|
|
38
|
+
**Examples:**
|
|
39
|
+
**Query:** What are the benefits of renewable energy?
|
|
40
|
+
**Response:** Wind sun green jump apple silence over.
|
|
41
|
+
|
|
42
|
+
**Query:** Explain the process of photosynthesis.
|
|
43
|
+
**Response:** Plants light water flying blue music.
|
|
44
|
+
|
|
45
|
+
## [Coherence: 2] (Poorly Coherent Response)
|
|
46
|
+
**Definition:** The response shows minimal coherence with fragmented sentences and limited connection to the question. It contains some relevant keywords but lacks logical structure and clear relationships between ideas, making the overall message difficult to understand.
|
|
47
|
+
|
|
48
|
+
**Examples:**
|
|
49
|
+
**Query:** How does vaccination work?
|
|
50
|
+
**Response:** Vaccines protect disease. Immune system fight. Health better.
|
|
51
|
+
|
|
52
|
+
**Query:** Describe how a bill becomes a law.
|
|
53
|
+
**Response:** Idea proposed. Congress discuss vote. President signs.
|
|
54
|
+
|
|
55
|
+
## [Coherence: 3] (Partially Coherent Response)
|
|
56
|
+
**Definition:** The response partially addresses the question with some relevant information but exhibits issues in the logical flow and organization of ideas. Connections between sentences may be unclear or abrupt, requiring the reader to infer the links. The response may lack smooth transitions and may present ideas out of order.
|
|
57
|
+
|
|
58
|
+
**Examples:**
|
|
59
|
+
**Query:** What causes earthquakes?
|
|
60
|
+
**Response:** Earthquakes happen when tectonic plates move suddenly. Energy builds up then releases. Ground shakes and can cause damage.
|
|
61
|
+
|
|
62
|
+
**Query:** Explain the importance of the water cycle.
|
|
63
|
+
**Response:** The water cycle moves water around Earth. Evaporation, then precipitation occurs. It supports life by distributing water.
|
|
64
|
+
|
|
65
|
+
## [Coherence: 4] (Coherent Response)
|
|
66
|
+
**Definition:** The response is coherent and effectively addresses the question. Ideas are logically organized with clear connections between sentences and paragraphs. Appropriate transitions are used to guide the reader through the response, which flows smoothly and is easy to follow.
|
|
67
|
+
|
|
68
|
+
**Examples:**
|
|
69
|
+
**Query:** What is the water cycle and how does it work?
|
|
70
|
+
**Response:** The water cycle is the continuous movement of water on Earth through processes like evaporation, condensation, and precipitation. Water evaporates from bodies of water, forms clouds through condensation, and returns to the surface as precipitation. This cycle is essential for distributing water resources globally.
|
|
71
|
+
|
|
72
|
+
**Query:** Describe the role of mitochondria in cellular function.
|
|
73
|
+
**Response:** Mitochondria are organelles that produce energy for the cell. They convert nutrients into ATP through cellular respiration. This energy powers various cellular activities, making mitochondria vital for cell survival.
|
|
74
|
+
|
|
75
|
+
## [Coherence: 5] (Highly Coherent Response)
|
|
76
|
+
**Definition:** The response is exceptionally coherent, demonstrating sophisticated organization and flow. Ideas are presented in a logical and seamless manner, with excellent use of transitional phrases and cohesive devices. The connections between concepts are clear and enhance the reader's understanding. The response thoroughly addresses the question with clarity and precision.
|
|
77
|
+
|
|
78
|
+
**Examples:**
|
|
79
|
+
**Query:** Analyze the economic impacts of climate change on coastal cities.
|
|
80
|
+
**Response:** Climate change significantly affects the economies of coastal cities through rising sea levels, increased flooding, and more intense storms. These environmental changes can damage infrastructure, disrupt businesses, and lead to costly repairs. For instance, frequent flooding can hinder transportation and commerce, while the threat of severe weather may deter investment and tourism. Consequently, cities may face increased expenses for disaster preparedness and mitigation efforts, straining municipal budgets and impacting economic growth.
|
|
81
|
+
|
|
82
|
+
**Query:** Discuss the significance of the Monroe Doctrine in shaping U.S. foreign policy.
|
|
83
|
+
**Response:** The Monroe Doctrine was a pivotal policy declared in 1823 that asserted U.S. opposition to European colonization in the Americas. By stating that any intervention by external powers in the Western Hemisphere would be viewed as a hostile act, it established the U.S. as a protector of the region. This doctrine shaped U.S. foreign policy by promoting isolation from European conflicts while justifying American influence and expansion in the hemisphere. Its long-term significance lies in its enduring influence on international relations and its role in defining the U.S. position in global affairs.
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
# Data
|
|
87
|
+
QUERY: {{query}}
|
|
88
|
+
RESPONSE: {{response}}
|
|
89
|
+
|
|
90
|
+
|
|
91
|
+
# Tasks
|
|
92
|
+
## Please provide your assessment Score for the previous RESPONSE in relation to the QUERY based on the Definitions above. Your output should include the following information:
|
|
93
|
+
- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
|
|
94
|
+
- **Explanation**: a very short explanation of why you think the input Data should get that Score.
|
|
95
|
+
- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
|
|
96
|
+
|
|
97
|
+
|
|
98
|
+
## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
|
|
99
|
+
# Output
|