PyPI - ibm-watsonx-orchestrate-evaluation-framework - Versions diffs - 1.1.5__py3-none-any.whl → 1.1.7__py3-none-any.whl - Mend

ibm-watsonx-orchestrate-evaluation-framework 1.1.5py3-none-any.whl → 1.1.7py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ibm-watsonx-orchestrate-evaluation-framework might be problematic. Click here for more details.

Files changed (49) hide show

wxo_agentic_evaluation/prompt/semantic_matching_prompt.jinja2 CHANGED Viewed

@@ -1,13 +1,13 @@
 <|begin_of_text|><|start_header_id|>system<|end_header_id|>
-You are an evaluation agent specializing in semantic similarity assessment. Your task is to determine whether two texts express the same factual information and intentions, even when presented differently.
+You are an evaluation agent specializing in semantic similarity assessment. Your task is to determine whether two texts express the same factual information and intentions, even when presented differently, given a context of the situation.
 Key evaluation principles:
-1. Focus on whether the core information and outcome is the same
-2. Different phrasings that convey the same result should be considered equivalent
-3. When specific values (IDs, dates, amounts, names) appear in both texts, they must match exactly
-4. Ignore formatting differences in dates (2022-01-01 vs. 1/1/2022 vs 20220101), numbers ($210,000 vs 210000.0 vs $21,0000.0), and IDs
-5. Different levels of detail are acceptable if they don't contradict each other and the primary information remains intact
-6. Reference IDs that are clearly system-generated (like request IDs, confirmation numbers, UUIDs) may vary and should be ignored
+1. Focus on whether the core information and outcome is the same.
+2. Different phrasings that convey the same result should be considered equivalent.
+3. Ignore formatting differences in dates (2022-01-01 vs. 1/1/2022 vs 20220101), numbers ($210,000 vs 210000.0 vs $21,0000.0), and IDs.
+4. When specific values (e.g. IDs, dates, amounts, names) appear in both texts, they must match exactly. If they appear only in one text but the other text doesn’t contradict them, consider it equivalent.
+5. Reference IDs that are system-generated (e.g. item IDs, request IDs, confirmation numbers, UUIDs) should be ignored when checking for equivalence.
+6. When checking query results like lists or tables, differences in field values, and rows are acceptable as long as the same entities or items are represented and the query intent, data type, and structure remain the same.
 Respond ONLY with:
 - True: if the texts convey the same essential information and outcomes
@@ -20,16 +20,30 @@ DO NOT provide explanations or commentary - only respond with "True" or "False"
 Evaluate the following examples:
 ### Example 1
+Context:
+Get me a list of all active machines.
 Expected:
-Your email has been successfully updated.
+Here are all the active machines:
+| id | name      | number | status   |
+|----|-----------|--------|----------|
+| 43 | NNM1 |        | active   |
+| 01 | XYZ2 |        | active   |
+| 44 | RRX |        | active   |
 Actual:
-You have successfully updated your email.
+Here are all the active machines:
+| id | name      | number | status   |
+|----|-----------|--------|----------|
+| 1280 | ABC |        | active   |
 Answer:
 True
 ### Example 2
+Context:
+Give me information about Ontario.
 Expected:
 Ontario is a province in Canada.
@@ -40,6 +54,9 @@ Answer:
 False
 ### Example 3
+Context:
+Find payslip details for user 12345.
 Expected:
 No payslips found for user with ID 12345.
@@ -50,6 +67,9 @@ Answer:
 True
 ### Example 4
+Context:
+I'd like to create a new time off request.
 Expected:
 Your time off request from 2024-11-01 to 2024-11-01 for TRAVEL has been successfully submitted. The request ID is c705878eb6584e9b910b8db3907a31da.
@@ -60,6 +80,9 @@ Answer:
 True
 ### Example 5
+Context:
+What's my compensation details?
 Expected:
 Your compensation details are as follows:
 * Currency: USD
@@ -72,6 +95,9 @@ Answer:
 True
 ### Example 6
+Context:
+Show me my visa details.
 Expected:
 Your visa details are as follows:
 - Country: 44
@@ -88,6 +114,9 @@ Answer:
 False
 ### Example 7
+Context:
+Update my preferred name and my starting date.
 Expected:
 I successfully updated your personal information.
@@ -101,6 +130,9 @@ True
 ### Now, evaluate the following:
+Context:
+{{ context }}
 Expected:
 {{ expected_text }}

wxo_agentic_evaluation/prompt/template_render.py CHANGED Viewed

@@ -45,9 +45,11 @@ class KeywordMatchingTemplateRenderer(JinjaTemplateRenderer):
 class SemanticMatchingTemplateRenderer(JinjaTemplateRenderer):
-    def render(self, expected_text: str, actual_text: str) -> str:
+    def render(self, context: str, expected_text: str, actual_text: str) -> str:
         return super().render(
-            expected_text=expected_text, actual_text=actual_text
+            context=context,
+            expected_text=expected_text,
+            actual_text=actual_text,
         )
@@ -171,3 +173,19 @@ class OffPolicyAttackGeneratorTemplateRenderer(JinjaTemplateRenderer):
             original_story=original_story,
             original_starting_sentence=original_starting_sentence,
         )
+class LLMaaJTemplateRenderer(JinjaTemplateRenderer):
+    def render(
+        self,
+        user_input: str,
+        agent_answer: str,
+        llmaaj_instructions: str,
+        context: str,
+    ) -> str:
+        return super().render(
+            user_input=user_input,
+            agent_answer=agent_answer,
+            llmaaj_instructions=llmaaj_instructions,
+            context=context,
+        )

wxo_agentic_evaluation/quick_eval.py CHANGED Viewed

@@ -14,7 +14,6 @@ from wxo_agentic_evaluation.arg_configs import QuickEvalConfig
 from wxo_agentic_evaluation.inference_backend import (
     EvaluationController,
     WXOInferenceBackend,
-    get_wxo_client,
 )
 from wxo_agentic_evaluation.llm_user import LLMUser
 from wxo_agentic_evaluation.metrics.metrics import (
@@ -38,6 +37,7 @@ from wxo_agentic_evaluation.utils.open_ai_tool_extractor import (
     ToolExtractionOpenAIFormat,
 )
 from wxo_agentic_evaluation.utils.utils import ReferencelessEvalPanel
+from wxo_agentic_evaluation.wxo_client import get_wxo_client
 ROOT_DIR = os.path.dirname(__file__)
 MODEL_ID = "meta-llama/llama-3-405b-instruct"
@@ -62,7 +62,7 @@ def process_test_case(
     )
     summary, referenceless_metrics = evaluation_controller.generate_summary(
-        task_n, all_tools, messages
+        task_n, all_tools, messages, inference_backend
     )
     outfolder = Path(f"{config.output_dir}/quick-eval")
@@ -111,18 +111,25 @@ class QuickEvalController(EvaluationController):
         return messages
     def generate_summary(
-        self, task_n, tools: List[Mapping[str, Any]], messages: List[Message]
+        self,
+        task_n,
+        tools: List[Mapping[str, Any]],
+        messages: List[Message],
+        inference_backend=None,
     ) -> Tuple[ReferenceLessEvalMetrics, List[ExtendedMessage]]:
         # run reference-less evaluation
         rich.print(f"[b][Task-{task_n}] Starting Quick Evaluation")
+        processed_data = ReferencelessEvaluation.fmt_msgs_referenceless(
+            messages
+        )
         te = ReferencelessEvaluation(
             tools,
-            messages,
             MODEL_ID,
             task_n,
             self.test_case_name,
+            inference_backend=inference_backend,
         )
-        referenceless_results = te.run()
+        referenceless_results = te.run(examples=processed_data)
         rich.print(f"[b][Task-{task_n}] Finished Quick Evaluation")
         summary_metrics = self.compute_metrics(referenceless_results)
@@ -167,13 +174,13 @@ class QuickEvalController(EvaluationController):
             extended_messages.append(extended_message)
-        # return summary_metrics, referenceless_results
         return summary_metrics, extended_messages
     def failed_static_metrics_for_tool_call(
         self, static_metrics: Mapping[str, Mapping[str, Any]]
     ) -> Optional[List[FailedStaticTestCases]]:
         """
+        # TODO: in future PR, use the ReferencelessParser library
         static.metrics
         """
@@ -195,6 +202,7 @@ class QuickEvalController(EvaluationController):
         self, semantic_metrics: Mapping[str, Mapping[str, Any]]
     ) -> Optional[List[FailedSemanticTestCases]]:
         """
+        # TODO: in future PR, use the ReferencelessParser library
         semantic.general
         semantic.function_selection
@@ -257,11 +265,6 @@ class QuickEvalController(EvaluationController):
             []
         )  # keep track of tool calls that failed for semantic reason
-        from pprint import pprint
-        # pprint("quick eval results: ")
-        # pprint(quick_eval_results)
         for tool_call_idx, result in enumerate(quick_eval_results):
             static_passed = result.get("static", {}).get(
                 "final_decision", False
@@ -309,11 +312,20 @@ def main(config: QuickEvalConfig):
         config.auth_config.tenant_name,
         config.auth_config.token,
     )
+    auth = getattr(config, "auth_config", None)
+    extra_kwargs = {}
+    instance_url = getattr(auth, "url", None) if auth else None
+    token = getattr(auth, "token", None) if auth else None
+    if instance_url:
+        extra_kwargs["instance_url"] = instance_url
+    if token:
+        extra_kwargs["token"] = token
     inference_backend = WXOInferenceBackend(wxo_client)
     llm_user = LLMUser(
         wai_client=get_provider(
             config=config.provider_config,
             model_id=config.llm_user_config.model_id,
+            **extra_kwargs,
         ),
         template=LlamaUserTemplateRenderer(
             config.llm_user_config.prompt_config

wxo_agentic_evaluation/record_chat.py CHANGED Viewed

@@ -15,11 +15,7 @@ from wxo_agentic_evaluation.arg_configs import (
     KeywordsGenerationConfig,
 )
 from wxo_agentic_evaluation.data_annotator import DataAnnotator
-from wxo_agentic_evaluation.inference_backend import (
-    WXOClient,
-    WXOInferenceBackend,
-    get_wxo_client,
-)
+from wxo_agentic_evaluation.inference_backend import WXOInferenceBackend
 from wxo_agentic_evaluation.prompt.template_render import (
     StoryGenerationTemplateRenderer,
 )
@@ -27,6 +23,7 @@ from wxo_agentic_evaluation.service_instance import tenant_setup
 from wxo_agentic_evaluation.service_provider import get_provider
 from wxo_agentic_evaluation.type import Message
 from wxo_agentic_evaluation.utils.utils import is_saas_url
+from wxo_agentic_evaluation.wxo_client import WXOClient, get_wxo_client
 warnings.filterwarnings("ignore", category=DeprecationWarning)
 warnings.filterwarnings("ignore", category=FutureWarning)
@@ -45,7 +42,6 @@ def get_recent_runs(wxo_client: WXOClient, limit: int = 20):
     else:
         path = "v1/orchestrate/runs"
     meta_resp = wxo_client.get(path, params={"limit": 1, "offset": 0}).json()
     total = meta_resp.get("total", 0)
@@ -54,7 +50,9 @@ def get_recent_runs(wxo_client: WXOClient, limit: int = 20):
     # fetch the most recent runs
     offset_for_latest = max(total - limit, 0)
-    resp = wxo_client.get(path, params={"limit": limit, "offset": offset_for_latest}).json()
+    resp = wxo_client.get(
+        path, params={"limit": limit, "offset": offset_for_latest}
+    ).json()
     runs = []
     if isinstance(resp, dict):
@@ -72,8 +70,15 @@ def get_recent_runs(wxo_client: WXOClient, limit: int = 20):
     return runs
-def generate_story(annotated_data: dict):
+def generate_story(annotated_data: dict, config: ChatRecordingConfig = None):
     renderer = StoryGenerationTemplateRenderer(STORY_GENERATION_PROMPT_PATH)
+    extra_kwargs = {}
+    instance_url = getattr(config, "service_url", None)
+    token = getattr(config, "token", None)
+    if instance_url:
+        extra_kwargs["instance_url"] = instance_url
+    if token:
+        extra_kwargs["token"] = token
     provider = get_provider(
         model_id="meta-llama/llama-3-405b-instruct",
         params={
@@ -81,6 +86,7 @@ def generate_story(annotated_data: dict):
             "decoding_method": "greedy",
             "max_new_tokens": 256,
         },
+        **extra_kwargs,
     )
     prompt = renderer.render(input_data=json.dumps(annotated_data, indent=2))
     res = provider.query(prompt)
@@ -91,15 +97,16 @@ def annotate_messages(
     agent_name: str,
     messages: List[Message],
     keywords_generation_config: KeywordsGenerationConfig,
+    config: ChatRecordingConfig = None,
 ):
     annotator = DataAnnotator(
         messages=messages, keywords_generation_config=keywords_generation_config
     )
-    annotated_data = annotator.generate()
+    annotated_data = annotator.generate(config=config)
     if agent_name is not None:
         annotated_data["agent"] = agent_name
-    annotated_data["story"] = generate_story(annotated_data)
+    annotated_data["story"] = generate_story(annotated_data, config)
     return annotated_data
@@ -193,6 +200,7 @@ def _record(config: ChatRecordingConfig, bad_threads: set):
                                 agent_name,
                                 messages,
                                 config.keywords_generation_config,
+                                config,
                             )
                             annotation_filename = os.path.join(

ibm-watsonx-orchestrate-evaluation-framework 1.1.5__py3-none-any.whl → 1.1.7__py3-none-any.whl

Potentially problematic release.

ibm-watsonx-orchestrate-evaluation-framework 1.1.5py3-none-any.whl → 1.1.7py3-none-any.whl