azure-ai-evaluation 1.0.0b3__tar.gz → 1.0.0b5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (176) hide show
  1. azure_ai_evaluation-1.0.0b5/CHANGELOG.md +183 -0
  2. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/MANIFEST.in +1 -0
  3. azure_ai_evaluation-1.0.0b5/NOTICE.txt +70 -0
  4. {azure_ai_evaluation-1.0.0b3/azure_ai_evaluation.egg-info → azure_ai_evaluation-1.0.0b5}/PKG-INFO +237 -52
  5. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/README.md +126 -42
  6. azure_ai_evaluation-1.0.0b5/TROUBLESHOOTING.md +50 -0
  7. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/__init__.py +23 -1
  8. {azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/simulator/_helpers → azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_common}/_experimental.py +20 -9
  9. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_common/constants.py +9 -2
  10. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_common/math.py +29 -0
  11. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_common/rai_service.py +222 -93
  12. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_common/utils.py +411 -0
  13. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_constants.py +16 -8
  14. {azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluate/_batch_run_client → azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluate/_batch_run}/__init__.py +3 -2
  15. {azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluate/_batch_run_client → azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluate/_batch_run}/code_client.py +33 -17
  16. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluate/_batch_run_client/batch_run_context.py → azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluate/_batch_run/eval_run_context.py +14 -7
  17. {azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluate/_batch_run_client → azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluate/_batch_run}/proxy_client.py +22 -4
  18. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluate/_batch_run/target_run_context.py +35 -0
  19. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluate/_eval_run.py +47 -14
  20. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluate/_evaluate.py +370 -188
  21. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluate/_telemetry/__init__.py +15 -16
  22. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluate/_utils.py +77 -25
  23. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_bleu/_bleu.py +1 -1
  24. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_coherence/_coherence.py +16 -10
  25. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +99 -0
  26. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_common/_base_eval.py +76 -46
  27. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py +26 -19
  28. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py +62 -25
  29. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +138 -0
  30. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/_content_safety_chat.py +67 -46
  31. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/_hate_unfairness.py +33 -4
  32. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/_self_harm.py +33 -4
  33. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/_sexual.py +33 -4
  34. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/_violence.py +33 -4
  35. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_eci/_eci.py +7 -5
  36. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_f1_score/_f1_score.py +14 -6
  37. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_fluency/_fluency.py +22 -21
  38. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +86 -0
  39. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_gleu/_gleu.py +1 -1
  40. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +106 -0
  41. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_groundedness/groundedness_with_query.prompty +113 -0
  42. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_groundedness/groundedness_without_query.prompty +99 -0
  43. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_meteor/_meteor.py +3 -7
  44. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/__init__.py +20 -0
  45. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal.py +130 -0
  46. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_content_safety_multimodal_base.py +57 -0
  47. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_hate_unfairness.py +96 -0
  48. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_protected_material.py +120 -0
  49. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_self_harm.py +96 -0
  50. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_sexual.py +96 -0
  51. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_multimodal/_violence.py +96 -0
  52. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +90 -0
  53. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_qa/_qa.py +11 -6
  54. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_relevance/_relevance.py +23 -20
  55. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +100 -0
  56. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +197 -0
  57. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_retrieval/retrieval.prompty +93 -0
  58. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_rouge/_rouge.py +2 -2
  59. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_service_groundedness/__init__.py +9 -0
  60. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py +150 -0
  61. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_similarity/_similarity.py +32 -15
  62. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_xpia/xpia.py +36 -10
  63. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_exceptions.py +26 -6
  64. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_http_utils.py +203 -132
  65. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_model_configurations.py +23 -6
  66. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/__init__.py +3 -0
  67. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/rouge_score/__init__.py +14 -0
  68. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/rouge_score/rouge_scorer.py +328 -0
  69. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/rouge_score/scoring.py +63 -0
  70. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/rouge_score/tokenize.py +63 -0
  71. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/_vendor/rouge_score/tokenizers.py +53 -0
  72. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_version.py +1 -1
  73. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/__init__.py +2 -1
  74. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_adversarial_scenario.py +5 -0
  75. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_adversarial_simulator.py +88 -60
  76. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_conversation/__init__.py +13 -12
  77. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_conversation/_conversation.py +4 -4
  78. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/simulator/_data_sources/__init__.py +3 -0
  79. azure_ai_evaluation-1.0.0b5/azure/ai/evaluation/simulator/_data_sources/grounding.json +1150 -0
  80. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_direct_attack_simulator.py +24 -66
  81. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_helpers/__init__.py +1 -2
  82. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_helpers/_simulator_data_classes.py +26 -5
  83. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_indirect_attack_simulator.py +98 -95
  84. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/_identity_manager.py +67 -21
  85. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/_proxy_completion_model.py +28 -11
  86. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/_template_handler.py +68 -24
  87. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/models.py +10 -10
  88. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_prompty/task_query_response.prompty +4 -9
  89. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_prompty/task_simulate.prompty +6 -5
  90. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_simulator.py +222 -169
  91. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_tracing.py +4 -4
  92. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_utils.py +6 -6
  93. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5/azure_ai_evaluation.egg-info}/PKG-INFO +237 -52
  94. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure_ai_evaluation.egg-info/SOURCES.txt +30 -6
  95. azure_ai_evaluation-1.0.0b5/azure_ai_evaluation.egg-info/requires.txt +10 -0
  96. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/pyproject.toml +1 -2
  97. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/setup.py +5 -5
  98. azure_ai_evaluation-1.0.0b5/tests/__pf_service_isolation.py +28 -0
  99. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/conftest.py +27 -8
  100. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/custom_evaluators/answer_length_with_aggregation.py +9 -2
  101. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/target_fn.py +18 -0
  102. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/test_adv_simulator.py +51 -24
  103. azure_ai_evaluation-1.0.0b5/tests/e2etests/test_builtin_evaluators.py +1021 -0
  104. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/test_evaluate.py +228 -28
  105. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/test_sim_and_eval.py +7 -12
  106. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_batch_run_context.py +8 -8
  107. azure_ai_evaluation-1.0.0b5/tests/unittests/test_built_in_evaluator.py +138 -0
  108. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_content_safety_rai_script.py +28 -23
  109. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_eval_run.py +33 -4
  110. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_evaluate.py +63 -26
  111. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_evaluate_telemetry.py +11 -10
  112. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_jailbreak_simulator.py +4 -3
  113. azure_ai_evaluation-1.0.0b5/tests/unittests/test_non_adv_simulator.py +362 -0
  114. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_simulator.py +4 -5
  115. azure_ai_evaluation-1.0.0b5/tests/unittests/test_utils.py +56 -0
  116. azure_ai_evaluation-1.0.0b3/CHANGELOG.md +0 -81
  117. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_common/utils.py +0 -102
  118. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_coherence/coherence.prompty +0 -57
  119. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_content_safety/_content_safety.py +0 -106
  120. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_fluency/fluency.prompty +0 -56
  121. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py +0 -71
  122. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_groundedness/groundedness.prompty +0 -49
  123. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_protected_material/_protected_material.py +0 -57
  124. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty +0 -64
  125. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_retrieval/_retrieval.py +0 -151
  126. azure_ai_evaluation-1.0.0b3/azure/ai/evaluation/_evaluators/_retrieval/retrieval.prompty +0 -43
  127. azure_ai_evaluation-1.0.0b3/azure_ai_evaluation.egg-info/requires.txt +0 -16
  128. azure_ai_evaluation-1.0.0b3/tests/e2etests/test_builtin_evaluators.py +0 -474
  129. azure_ai_evaluation-1.0.0b3/tests/unittests/test_built_in_evaluator.py +0 -41
  130. azure_ai_evaluation-1.0.0b3/tests/unittests/test_non_adv_simulator.py +0 -129
  131. azure_ai_evaluation-1.0.0b3/tests/unittests/test_utils.py +0 -20
  132. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/__init__.py +0 -0
  133. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/__init__.py +0 -0
  134. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_common/__init__.py +0 -0
  135. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluate/__init__.py +0 -0
  136. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/__init__.py +0 -0
  137. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_bleu/__init__.py +0 -0
  138. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_coherence/__init__.py +0 -0
  139. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_common/__init__.py +0 -0
  140. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_content_safety/__init__.py +0 -0
  141. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_eci/__init__.py +0 -0
  142. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_f1_score/__init__.py +0 -0
  143. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_fluency/__init__.py +0 -0
  144. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_gleu/__init__.py +0 -0
  145. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_groundedness/__init__.py +0 -0
  146. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_meteor/__init__.py +0 -0
  147. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_protected_material/__init__.py +0 -0
  148. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_qa/__init__.py +0 -0
  149. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_relevance/__init__.py +0 -0
  150. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_retrieval/__init__.py +0 -0
  151. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_rouge/__init__.py +0 -0
  152. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_similarity/__init__.py +0 -0
  153. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_similarity/similarity.prompty +0 -0
  154. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_evaluators/_xpia/__init__.py +0 -0
  155. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/_user_agent.py +0 -0
  156. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/py.typed +0 -0
  157. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_constants.py +0 -0
  158. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_conversation/constants.py +0 -0
  159. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_helpers/_language_suffix_mapping.py +0 -0
  160. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/__init__.py +0 -0
  161. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_model_tools/_rai_client.py +0 -0
  162. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure/ai/evaluation/simulator/_prompty/__init__.py +0 -0
  163. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure_ai_evaluation.egg-info/dependency_links.txt +0 -0
  164. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure_ai_evaluation.egg-info/not-zip-safe +0 -0
  165. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/azure_ai_evaluation.egg-info/top_level.txt +0 -0
  166. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/setup.cfg +0 -0
  167. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/__init__.py +0 -0
  168. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/__openai_patcher.py +0 -0
  169. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/__init__.py +0 -0
  170. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/e2etests/test_metrics_upload.py +1 -1
  171. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_content_safety_defect_rate.py +1 -1
  172. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_evaluators/apology_dag/apology.py +0 -0
  173. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_evaluators/test_inputs_evaluators.py +0 -0
  174. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_save_eval.py +0 -0
  175. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_synthetic_callback_conv_bot.py +0 -0
  176. {azure_ai_evaluation-1.0.0b3 → azure_ai_evaluation-1.0.0b5}/tests/unittests/test_synthetic_conversation_bot.py +1 -1
@@ -0,0 +1,183 @@
1
+ # Release History
2
+
3
+ ## 1.0.0b5 (2024-10-28)
4
+
5
+ ### Features Added
6
+ - Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
7
+ - Groundedness detection in Non Adversarial Simulator via query/context pairs
8
+ ```python
9
+ import importlib.resources as pkg_resources
10
+ package = "azure.ai.evaluation.simulator._data_sources"
11
+ resource_name = "grounding.json"
12
+ custom_simulator = Simulator(model_config=model_config)
13
+ conversation_turns = []
14
+ with pkg_resources.path(package, resource_name) as grounding_file:
15
+ with open(grounding_file, "r") as file:
16
+ data = json.load(file)
17
+ for item in data:
18
+ conversation_turns.append([item])
19
+ outputs = asyncio.run(custom_simulator(
20
+ target=callback,
21
+ conversation_turns=conversation_turns,
22
+ max_conversation_turns=1,
23
+ ))
24
+ ```
25
+ - Adding evaluator for multimodal use cases
26
+
27
+ ### Breaking Changes
28
+ - Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
29
+ - `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
30
+ - `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
31
+ - `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
32
+ - AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
33
+ - Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
34
+ ```json
35
+ {"question": <user_message>, "answer": <assistant_message>}
36
+ ```
37
+ `to_eval_qr_json_lines` now has:
38
+ ```json
39
+ {"query": <user_message>, "response": assistant_message}
40
+ ```
41
+
42
+ ### Bugs Fixed
43
+ - Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
44
+ - Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
45
+ - Fix evaluate API failure when `trace.destination` is set to `none`
46
+ - Non adversarial simulator now accepts context from the callback
47
+
48
+ ### Other Changes
49
+ - Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
50
+ - `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
51
+ - To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
52
+ - `CoherenceEvaluator`
53
+ - `RelevanceEvaluator`
54
+ - `FluencyEvaluator`
55
+ - `GroundednessEvaluator`
56
+ - `SimilarityEvaluator`
57
+ - `RetrievalEvaluator`
58
+ - The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
59
+
60
+ | Evaluator | New Token Limit |
61
+ | --- | --- |
62
+ | `CoherenceEvaluator` | 800 |
63
+ | `RelevanceEvaluator` | 800 |
64
+ | `FluencyEvaluator` | 800 |
65
+ | `GroundednessEvaluator` | 800 |
66
+ | `RetrievalEvaluator` | 1600 |
67
+ - Improved the error message for storage access permission issues to provide clearer guidance for users.
68
+
69
+ ## 1.0.0b4 (2024-10-16)
70
+
71
+ ### Breaking Changes
72
+
73
+ - Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.
74
+ - `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.
75
+ - Changed package extra name from "pf-azure" to "remote".
76
+
77
+ ### Bugs Fixed
78
+ - Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
79
+ - Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
80
+
81
+ ### Other Changes
82
+ - Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
83
+ - Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
84
+
85
+ ## 1.0.0b3 (2024-10-01)
86
+
87
+ ### Features Added
88
+
89
+ - Added `type` field to `AzureOpenAIModelConfiguration` and `OpenAIModelConfiguration`
90
+ - The following evaluators now support `conversation` as an alternative input to their usual single-turn inputs:
91
+ - `ViolenceEvaluator`
92
+ - `SexualEvaluator`
93
+ - `SelfHarmEvaluator`
94
+ - `HateUnfairnessEvaluator`
95
+ - `ProtectedMaterialEvaluator`
96
+ - `IndirectAttackEvaluator`
97
+ - `CoherenceEvaluator`
98
+ - `RelevanceEvaluator`
99
+ - `FluencyEvaluator`
100
+ - `GroundednessEvaluator`
101
+ - Surfaced `RetrievalScoreEvaluator`, formally an internal part of `ChatEvaluator` as a standalone conversation-only evaluator.
102
+
103
+ ### Breaking Changes
104
+
105
+ - Removed `ContentSafetyChatEvaluator` and `ChatEvaluator`
106
+ - The `evaluator_config` parameter of `evaluate` now maps in evaluator name to a dictionary `EvaluatorConfig`, which is a `TypedDict`. The
107
+ `column_mapping` between `data` or `target` and evaluator field names should now be specified inside this new dictionary:
108
+
109
+ Before:
110
+ ```python
111
+ evaluate(
112
+ ...,
113
+ evaluator_config={
114
+ "hate_unfairness": {
115
+ "query": "${data.question}",
116
+ "response": "${data.answer}",
117
+ }
118
+ },
119
+ ...
120
+ )
121
+ ```
122
+
123
+ After
124
+ ```python
125
+ evaluate(
126
+ ...,
127
+ evaluator_config={
128
+ "hate_unfairness": {
129
+ "column_mapping": {
130
+ "query": "${data.question}",
131
+ "response": "${data.answer}",
132
+ }
133
+ }
134
+ },
135
+ ...
136
+ )
137
+ ```
138
+
139
+ - Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.
140
+ Before:
141
+ ```python
142
+ azure_ai_project = {
143
+ "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
144
+ "resource_group_name": os.environ.get("RESOURCE_GROUP"),
145
+ "project_name": os.environ.get("PROJECT_NAME"),
146
+ }
147
+ sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
148
+ ```
149
+ After:
150
+ ```python
151
+ model_config = {
152
+ "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
153
+ "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
154
+ }
155
+ sim = Simulator(model_config=model_config)
156
+ ```
157
+ If `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.
158
+
159
+ ### Bugs Fixed
160
+
161
+ - Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
162
+
163
+ ## 1.0.0b2 (2024-09-24)
164
+
165
+ ### Breaking Changes
166
+
167
+ - `data` and `evaluators` are now required keywords in `evaluate`.
168
+
169
+ ## 1.0.0b1 (2024-09-20)
170
+
171
+ ### Breaking Changes
172
+
173
+ - The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed
174
+ - The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
175
+ - The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.
176
+ - Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
177
+ - Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.
178
+
179
+ ### Features Added
180
+
181
+ - First preview
182
+ - This package is port of `promptflow-evals`. New features will be added only to this package moving forward.
183
+ - Added a `TypedDict` for `AzureAIProject` that allows for better intellisense and type checking when passing in project information
@@ -4,3 +4,4 @@ include azure/__init__.py
4
4
  include azure/ai/__init__.py
5
5
  include azure/ai/evaluation/py.typed
6
6
  recursive-include azure/ai/evaluation *.prompty
7
+ include azure/ai/evaluation/simulator/_data_sources/grounding.json
@@ -0,0 +1,70 @@
1
+ NOTICES AND INFORMATION
2
+ Do Not Translate or Localize
3
+
4
+ This software incorporates material from third parties.
5
+ Microsoft makes certain open source code available at https://3rdpartysource.microsoft.com,
6
+ or you may send a check or money order for US $5.00, including the product name,
7
+ the open source component name, platform, and version number, to:
8
+
9
+ Source Code Compliance Team
10
+ Microsoft Corporation
11
+ One Microsoft Way
12
+ Redmond, WA 98052
13
+ USA
14
+
15
+ Notwithstanding any other terms, you may reverse engineer this software to the extent
16
+ required to debug changes to any libraries licensed under the GNU Lesser General Public License.
17
+
18
+ License notice for nltk
19
+ ---------------------------------------------------------
20
+
21
+ Copyright 2024 The NLTK Project
22
+
23
+ Licensed under the Apache License, Version 2.0 (the "License");
24
+ you may not use this file except in compliance with the License.
25
+ You may obtain a copy of the License at
26
+
27
+ http://www.apache.org/licenses/LICENSE-2.0
28
+
29
+ Unless required by applicable law or agreed to in writing, software
30
+ distributed under the License is distributed on an "AS IS" BASIS,
31
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
32
+ See the License for the specific language governing permissions and
33
+ limitations under the License.
34
+
35
+ License notice for rouge-score
36
+ ---------------------------------------------------------
37
+
38
+ Copyright 2024 The Google Research Authors
39
+
40
+ Licensed under the Apache License, Version 2.0 (the "License");
41
+ you may not use this file except in compliance with the License.
42
+ You may obtain a copy of the License at
43
+
44
+ http://www.apache.org/licenses/LICENSE-2.0
45
+
46
+ Unless required by applicable law or agreed to in writing, software
47
+ distributed under the License is distributed on an "AS IS" BASIS,
48
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
49
+ See the License for the specific language governing permissions and
50
+ limitations under the License.
51
+
52
+
53
+ License notice for [Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings](https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1272229/full)
54
+ ------------------------------------------------------------------------------------------------------------------
55
+ Copyright © 2023 Hackl, Müller, Granitzer and Sailer. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
56
+
57
+
58
+ License notice for [Is ChatGPT a Good NLG Evaluator? A Preliminary Study](https://aclanthology.org/2023.newsum-1.1) (Wang et al., NewSum 2023)
59
+ ------------------------------------------------------------------------------------------------------------------
60
+ Copyright © 2023. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
61
+
62
+
63
+ License notice for [SummEval: Re-evaluating Summarization Evaluation.](https://doi.org/10.1162/tacl_a_00373) (Fabbri et al.)
64
+ ------------------------------------------------------------------------------------------------------------------
65
+ © 2021 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
66
+
67
+
68
+ License notice for [Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks](https://aclanthology.org/2023.emnlp-main.543) (Sottana et al., EMNLP 2023)
69
+ ------------------------------------------------------------------------------------------------------------------
70
+ © 2023 Association for Computational Linguistics. This work is openly licensed via [CC BY 4.0](http://creativecommons.org/licenses/by/4.0/).
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: azure-ai-evaluation
3
- Version: 1.0.0b3
3
+ Version: 1.0.0b5
4
4
  Summary: Microsoft Azure Evaluation Library for Python
5
5
  Home-page: https://github.com/Azure/azure-sdk-for-python
6
6
  Author: Microsoft Corporation
@@ -21,17 +21,16 @@ Classifier: License :: OSI Approved :: MIT License
21
21
  Classifier: Operating System :: OS Independent
22
22
  Requires-Python: >=3.8
23
23
  Description-Content-Type: text/markdown
24
+ License-File: NOTICE.txt
24
25
  Requires-Dist: promptflow-devkit>=1.15.0
25
26
  Requires-Dist: promptflow-core>=1.15.0
26
- Requires-Dist: numpy>=1.23.2; python_version < "3.12"
27
- Requires-Dist: numpy>=1.26.4; python_version >= "3.12"
28
27
  Requires-Dist: pyjwt>=2.8.0
29
- Requires-Dist: azure-identity>=1.12.0
28
+ Requires-Dist: azure-identity>=1.16.0
30
29
  Requires-Dist: azure-core>=1.30.2
31
30
  Requires-Dist: nltk>=3.9.1
32
- Requires-Dist: rouge-score>=0.1.2
33
- Provides-Extra: pf-azure
34
- Requires-Dist: promptflow-azure<2.0.0,>=1.15.0; extra == "pf-azure"
31
+ Provides-Extra: remote
32
+ Requires-Dist: promptflow-azure<2.0.0,>=1.15.0; extra == "remote"
33
+ Requires-Dist: azure-ai-inference>=1.0.0b4; extra == "remote"
35
34
 
36
35
  # Azure AI Evaluation client library for Python
37
36
 
@@ -97,9 +96,6 @@ if __name__ == "__main__":
97
96
  # Running Relevance Evaluator on single input row
98
97
  relevance_score = relevance_eval(
99
98
  response="The Alpine Explorer Tent is the most waterproof.",
100
- context="From the our product list,"
101
- " the alpine explorer tent is the most waterproof."
102
- " The Adventure Dining Table has higher weight.",
103
99
  query="Which tent is the most waterproof?",
104
100
  )
105
101
 
@@ -154,11 +150,6 @@ name: ApplicationPrompty
154
150
  description: Simulates an application
155
151
  model:
156
152
  api: chat
157
- configuration:
158
- type: azure_openai
159
- azure_deployment: ${env:AZURE_DEPLOYMENT}
160
- api_key: ${env:AZURE_OPENAI_API_KEY}
161
- azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
162
153
  parameters:
163
154
  temperature: 0.0
164
155
  top_p: 1.0
@@ -179,6 +170,95 @@ Output with a string that continues the conversation, responding to the latest m
179
170
  {{ conversation_history }}
180
171
 
181
172
  ```
173
+
174
+ Query Response generaing prompty for gpt-4o with `json_schema` support
175
+ Use this file as an override.
176
+ ```yaml
177
+ ---
178
+ name: TaskSimulatorQueryResponseGPT4o
179
+ description: Gets queries and responses from a blob of text
180
+ model:
181
+ api: chat
182
+ parameters:
183
+ temperature: 0.0
184
+ top_p: 1.0
185
+ presence_penalty: 0
186
+ frequency_penalty: 0
187
+ response_format:
188
+ type: json_schema
189
+ json_schema:
190
+ name: QRJsonSchema
191
+ schema:
192
+ type: object
193
+ properties:
194
+ items:
195
+ type: array
196
+ items:
197
+ type: object
198
+ properties:
199
+ q:
200
+ type: string
201
+ r:
202
+ type: string
203
+ required:
204
+ - q
205
+ - r
206
+
207
+ inputs:
208
+ text:
209
+ type: string
210
+ num_queries:
211
+ type: integer
212
+
213
+
214
+ ---
215
+ system:
216
+ You're an AI that helps in preparing a Question/Answer quiz from Text for "Who wants to be a millionaire" tv show
217
+ Both Questions and Answers MUST BE extracted from given Text
218
+ Frame Question in a way so that Answer is RELEVANT SHORT BITE-SIZED info from Text
219
+ RELEVANT info could be: NUMBER, DATE, STATISTIC, MONEY, NAME
220
+ A sentence should contribute multiple QnAs if it has more info in it
221
+ Answer must not be more than 5 words
222
+ Answer must be picked from Text as is
223
+ Question should be as descriptive as possible and must include as much context as possible from Text
224
+ Output must always have the provided number of QnAs
225
+ Output must be in JSON format.
226
+ Output must have {{num_queries}} objects in the format specified below. Any other count is unacceptable.
227
+ Text:
228
+ <|text_start|>
229
+ On January 24, 1984, former Apple CEO Steve Jobs introduced the first Macintosh. In late 2003, Apple had 2.06 percent of the desktop share in the United States.
230
+ Some years later, research firms IDC and Gartner reported that Apple's market share in the U.S. had increased to about 6%.
231
+ <|text_end|>
232
+ Output with 5 QnAs:
233
+ {
234
+ "qna": [{
235
+ "q": "When did the former Apple CEO Steve Jobs introduced the first Macintosh?",
236
+ "r": "January 24, 1984"
237
+ },
238
+ {
239
+ "q": "Who was the former Apple CEO that introduced the first Macintosh on January 24, 1984?",
240
+ "r": "Steve Jobs"
241
+ },
242
+ {
243
+ "q": "What percent of the desktop share did Apple have in the United States in late 2003?",
244
+ "r": "2.06 percent"
245
+ },
246
+ {
247
+ "q": "What were the research firms that reported on Apple's market share in the U.S.?",
248
+ "r": "IDC and Gartner"
249
+ },
250
+ {
251
+ "q": "What was the percentage increase of Apple's market share in the U.S., as reported by research firms IDC and Gartner?",
252
+ "r": "6%"
253
+ }]
254
+ }
255
+ Text:
256
+ <|text_start|>
257
+ {{ text }}
258
+ <|text_end|>
259
+ Output with {{ num_queries }} QnAs:
260
+ ```
261
+
182
262
  Application code:
183
263
 
184
264
  ```python
@@ -187,93 +267,96 @@ import asyncio
187
267
  from typing import Any, Dict, List, Optional
188
268
  from azure.ai.evaluation.simulator import Simulator
189
269
  from promptflow.client import load_flow
190
- from azure.identity import DefaultAzureCredential
191
270
  import os
271
+ import wikipedia
192
272
 
193
- azure_ai_project = {
194
- "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
195
- "resource_group_name": os.environ.get("RESOURCE_GROUP"),
196
- "project_name": os.environ.get("PROJECT_NAME")
273
+ # Set up the model configuration without api_key, using DefaultAzureCredential
274
+ model_config = {
275
+ "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
276
+ "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
277
+ # not providing key would make the SDK pick up `DefaultAzureCredential`
278
+ # use "api_key": "<your API key>"
279
+ "api_version": "2024-08-01-preview" # keep this for gpt-4o
197
280
  }
198
281
 
199
- import wikipedia
200
- wiki_search_term = "Leonardo da vinci"
282
+ # Use Wikipedia to get some text for the simulation
283
+ wiki_search_term = "Leonardo da Vinci"
201
284
  wiki_title = wikipedia.search(wiki_search_term)[0]
202
285
  wiki_page = wikipedia.page(wiki_title)
203
286
  text = wiki_page.summary[:1000]
204
287
 
205
- def method_to_invoke_application_prompty(query: str):
288
+ def method_to_invoke_application_prompty(query: str, messages_list: List[Dict], context: Optional[Dict]):
206
289
  try:
207
290
  current_dir = os.path.dirname(__file__)
208
291
  prompty_path = os.path.join(current_dir, "application.prompty")
209
- _flow = load_flow(source=prompty_path, model={
210
- "configuration": azure_ai_project
211
- })
292
+ _flow = load_flow(
293
+ source=prompty_path,
294
+ model=model_config,
295
+ credential=DefaultAzureCredential()
296
+ )
212
297
  response = _flow(
213
298
  query=query,
214
299
  context=context,
215
300
  conversation_history=messages_list
216
301
  )
217
302
  return response
218
- except:
219
- print("Something went wrong invoking the prompty")
303
+ except Exception as e:
304
+ print(f"Something went wrong invoking the prompty: {e}")
220
305
  return "something went wrong"
221
306
 
222
307
  async def callback(
223
- messages: List[Dict],
308
+ messages: Dict[str, List[Dict]],
224
309
  stream: bool = False,
225
310
  session_state: Any = None, # noqa: ANN401
226
311
  context: Optional[Dict[str, Any]] = None,
227
312
  ) -> dict:
228
313
  messages_list = messages["messages"]
229
- # get last message
314
+ # Get the last message from the user
230
315
  latest_message = messages_list[-1]
231
316
  query = latest_message["content"]
232
- context = None
233
- # call your endpoint or ai application here
234
- response = method_to_invoke_application_prompty(query)
235
- # we are formatting the response to follow the openAI chat protocol format
317
+ # Call your endpoint or AI application here
318
+ response = method_to_invoke_application_prompty(query, messages_list, context)
319
+ # Format the response to follow the OpenAI chat protocol format
236
320
  formatted_response = {
237
321
  "content": response,
238
322
  "role": "assistant",
239
- "context": {
240
- "citations": None,
241
- },
323
+ "context": "",
242
324
  }
243
325
  messages["messages"].append(formatted_response)
244
326
  return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
245
327
 
246
-
247
-
248
328
  async def main():
249
- simulator = Simulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
329
+ simulator = Simulator(model_config=model_config)
330
+ current_dir = os.path.dirname(__file__)
331
+ query_response_override_for_latest_gpt_4o = os.path.join(current_dir, "TaskSimulatorQueryResponseGPT4o.prompty")
250
332
  outputs = await simulator(
251
333
  target=callback,
252
334
  text=text,
335
+ query_response_generating_prompty=query_response_override_for_latest_gpt_4o, # use this only with latest gpt-4o
253
336
  num_queries=2,
254
- max_conversation_turns=4,
337
+ max_conversation_turns=1,
255
338
  user_persona=[
256
339
  f"I am a student and I want to learn more about {wiki_search_term}",
257
340
  f"I am a teacher and I want to teach my students about {wiki_search_term}"
258
341
  ],
259
342
  )
260
- print(json.dumps(outputs))
343
+ print(json.dumps(outputs, indent=2))
261
344
 
262
345
  if __name__ == "__main__":
263
- os.environ["AZURE_SUBSCRIPTION_ID"] = ""
264
- os.environ["RESOURCE_GROUP"] = ""
265
- os.environ["PROJECT_NAME"] = ""
266
- os.environ["AZURE_OPENAI_API_KEY"] = ""
267
- os.environ["AZURE_OPENAI_ENDPOINT"] = ""
268
- os.environ["AZURE_DEPLOYMENT"] = ""
346
+ # Ensure that the following environment variables are set in your environment:
347
+ # AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT
348
+ # Example:
349
+ # os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
350
+ # os.environ["AZURE_DEPLOYMENT"] = "your-deployment-name"
269
351
  asyncio.run(main())
270
352
  print("done!")
353
+
271
354
  ```
272
355
 
273
356
  #### Adversarial Simulator
274
357
 
275
358
  ```python
276
- from from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
359
+ from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
277
360
  from azure.identity import DefaultAzureCredential
278
361
  from typing import Any, Dict, List, Optional
279
362
  import asyncio
@@ -426,6 +509,88 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
426
509
 
427
510
  # Release History
428
511
 
512
+ ## 1.0.0b5 (2024-10-28)
513
+
514
+ ### Features Added
515
+ - Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
516
+ - Groundedness detection in Non Adversarial Simulator via query/context pairs
517
+ ```python
518
+ import importlib.resources as pkg_resources
519
+ package = "azure.ai.evaluation.simulator._data_sources"
520
+ resource_name = "grounding.json"
521
+ custom_simulator = Simulator(model_config=model_config)
522
+ conversation_turns = []
523
+ with pkg_resources.path(package, resource_name) as grounding_file:
524
+ with open(grounding_file, "r") as file:
525
+ data = json.load(file)
526
+ for item in data:
527
+ conversation_turns.append([item])
528
+ outputs = asyncio.run(custom_simulator(
529
+ target=callback,
530
+ conversation_turns=conversation_turns,
531
+ max_conversation_turns=1,
532
+ ))
533
+ ```
534
+ - Adding evaluator for multimodal use cases
535
+
536
+ ### Breaking Changes
537
+ - Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
538
+ - `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
539
+ - `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
540
+ - `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
541
+ - AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
542
+ - Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
543
+ ```json
544
+ {"question": <user_message>, "answer": <assistant_message>}
545
+ ```
546
+ `to_eval_qr_json_lines` now has:
547
+ ```json
548
+ {"query": <user_message>, "response": assistant_message}
549
+ ```
550
+
551
+ ### Bugs Fixed
552
+ - Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
553
+ - Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
554
+ - Fix evaluate API failure when `trace.destination` is set to `none`
555
+ - Non adversarial simulator now accepts context from the callback
556
+
557
+ ### Other Changes
558
+ - Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
559
+ - `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
560
+ - To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
561
+ - `CoherenceEvaluator`
562
+ - `RelevanceEvaluator`
563
+ - `FluencyEvaluator`
564
+ - `GroundednessEvaluator`
565
+ - `SimilarityEvaluator`
566
+ - `RetrievalEvaluator`
567
+ - The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
568
+
569
+ | Evaluator | New Token Limit |
570
+ | --- | --- |
571
+ | `CoherenceEvaluator` | 800 |
572
+ | `RelevanceEvaluator` | 800 |
573
+ | `FluencyEvaluator` | 800 |
574
+ | `GroundednessEvaluator` | 800 |
575
+ | `RetrievalEvaluator` | 1600 |
576
+ - Improved the error message for storage access permission issues to provide clearer guidance for users.
577
+
578
+ ## 1.0.0b4 (2024-10-16)
579
+
580
+ ### Breaking Changes
581
+
582
+ - Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.
583
+ - `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.
584
+ - Changed package extra name from "pf-azure" to "remote".
585
+
586
+ ### Bugs Fixed
587
+ - Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
588
+ - Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
589
+
590
+ ### Other Changes
591
+ - Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
592
+ - Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
593
+
429
594
  ## 1.0.0b3 (2024-10-01)
430
595
 
431
596
  ### Features Added
@@ -480,9 +645,29 @@ evaluate(
480
645
  )
481
646
  ```
482
647
 
648
+ - Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.
649
+ Before:
650
+ ```python
651
+ azure_ai_project = {
652
+ "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
653
+ "resource_group_name": os.environ.get("RESOURCE_GROUP"),
654
+ "project_name": os.environ.get("PROJECT_NAME"),
655
+ }
656
+ sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
657
+ ```
658
+ After:
659
+ ```python
660
+ model_config = {
661
+ "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
662
+ "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
663
+ }
664
+ sim = Simulator(model_config=model_config)
665
+ ```
666
+ If `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.
667
+
483
668
  ### Bugs Fixed
484
669
 
485
- - Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
670
+ - Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`
486
671
 
487
672
  ## 1.0.0b2 (2024-09-24)
488
673
 
@@ -495,9 +680,9 @@ evaluate(
495
680
  ### Breaking Changes
496
681
 
497
682
  - The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed
498
- - The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
683
+ - The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
499
684
  - The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.
500
- - Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
685
+ - Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
501
686
  - Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.
502
687
 
503
688
  ### Features Added