PyPI - ibm-watsonx-orchestrate-evaluation-framework - Versions diffs - 1.0.2__py3-none-any.whl → 1.0.4__py3-none-any.whl - Mend

ibm-watsonx-orchestrate-evaluation-framework 1.0.2py3-none-any.whl → 1.0.4py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ibm-watsonx-orchestrate-evaluation-framework might be problematic. Click here for more details.

Files changed (41) hide show

{ibm_watsonx_orchestrate_evaluation_framework-1.0.2.dist-info → ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info}/METADATA RENAMED Viewed

@@ -1,12 +1,11 @@
 Metadata-Version: 2.4
 Name: ibm-watsonx-orchestrate-evaluation-framework
-Version: 1.0.2
+Version: 1.0.4
 Summary: The WxO evaluation framework
 Author-email: Haode Qi <Haode.Qi@ibm.com>
 License: MIT
 Requires-Python: <3.14,>=3.11
 Description-Content-Type: text/markdown
-License-File: LICENSE
 Requires-Dist: rich~=13.9.4
 Requires-Dist: pydantic~=2.10.6
 Requires-Dist: pyyaml~=6.0.2
@@ -32,7 +31,6 @@ Requires-Dist: notebook~=7.4.1; extra == "rag-eval"
 Requires-Dist: ipywidgets~=8.1.6; extra == "rag-eval"
 Requires-Dist: jupyter_contrib_nbextensions; extra == "rag-eval"
 Requires-Dist: jupyter~=1.1.1; extra == "rag-eval"
-Dynamic: license-file
 # WXO-agent evaluation framework
@@ -46,7 +44,8 @@ Dynamic: license-file
 - The agent calls the `runs/` endpoint of the wxo-lite server instance, and the actual tool code is executed on the server side. The server database is not visible to our framework.
 ## prerequisite
-Follow the [SDK setup guide](https://github.ibm.com/WatsonOrchestrate/wxo-clients/tree/main) to install the SDK. Make sure you are using version 1.2.0 of the SDK as this is the version this framework requires.
+Follow the [SDK setup guide](https://github.ibm.com/WatsonOrchestrate/wxo-clients/tree/main) to install the SDK.
+The current framework is compatible with ADK version >= 1.20, <= 1.6.0
 ## setup for evaluation framework
 Run the following command to install evaluation framework in the same env:
@@ -59,9 +58,11 @@ pip install -e .
 ```bash
 orchestrate server start
 export WATSONX_SPACE_ID=""
-export WATSONX_API_KEY=""
+export WATSONX_APIKEY=""
 ```
+NOTE: If you want to use `WO_INSTANCE` and `WO_API_KEY` instead, follow the [model proxy section](#using-model-proxy-provider).
 Import sample hr tools and agent into your default `wxo-dev` env:
 ```bash
 orchestrate tools import -f benchmarks/hr_sample/tools.py -k python
@@ -97,7 +98,6 @@ Note:
       ]
     }
   ],
-  "mine_fields": [],
   "story": "Your username is nwaters and you want to find out timeoff schedule from 20250101 to 20250303."
 }
 ```
@@ -124,7 +124,7 @@ NOTE: run `orchestrate env list` to find the name of the active tenant. for defa
 4. Run the test:
 ```bash
 export WATSONX_SPACE_ID=""
-export WATSONX_API_KEY=""
+export WATSONX_APIKEY=""
 python -m wxo_agentic_evaluation.main --config benchmarks/hr_sample/config.yaml
 ```
@@ -237,6 +237,69 @@ python -m wxo_agentic_evaluation.main --config benchmarks/hr_sample/config.yaml
 For full instructions on setting up tools, writing stories, configuring the pipeline, and generating batch test cases, see the [Batch Test case Generation Guide](./benchmarks/batch_sample/README.MD).
+## Using Model Proxy Provider
+To use the model proxy provider (which allows direct access to LLM models), follow these steps:
+1. Set up environment variables:
+   ```sh
+   export WO_INSTANCE=<your-instance-url>
+   export WO_API_KEY=<your-api-key>
+   ```
+2. Create a configuration file similar to [benchmarks/hr_sample/config_model_proxy.yaml](benchmarks/hr_sample/config_model_proxy.yaml):
+   ```yaml
+   test_paths:
+     - <your-test-path>
+   auth_config:
+     url: http://localhost:4321
+     tenant_name: wxo-dev
+   provider_config:
+     provider: "model_proxy"
+     model_id: "<model-id>"
+   output_dir: "<output-dir>"
+   ```
+3. Run the evaluation:
+   ```sh
+   python -m wxo_agentic_evaluation.main --config path/to/your/config.yaml
+   ```
+## Using Ollama
+To use model from Ollama (local LLM deployment), follow these steps:
+1. Make sure you have [Ollama](https://ollama.com) installed and running on your system.
+2. Pull your desired model using Ollama (e.g. llama3.1:8b):
+   ```sh
+   ollama pull <model-id>
+   ```
+3. Create a configuration file similar to [benchmarks/hr_sample/config_ollama.yaml](benchmarks/hr_sample/config_ollama.yaml):
+   ```yaml
+   test_paths:
+     - <your-test-path>
+   auth_config:
+     url: http://localhost:4321
+     tenant_name: wxo-dev
+   provider_config:
+     provider: "ollama"
+     model_id: "<model-id>"
+   output_dir: "results/ollama/<model-name>"
+   ```
+4. Run the evaluation:
+   ```sh
+   python -m wxo_agentic_evaluation.main --config path/to/your/config.yaml
+   ```
 ## Workflow diagram
 To help better understand the workflow, this is a diagram of how this repo works together with wxO lite python SDK and a wxO runtime.

ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,56 @@
+wxo_agentic_evaluation/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+wxo_agentic_evaluation/analyze_run.py,sha256=C4HowEukNMM-H8FkRcHRqkiNYIQVCoTKbBLiqr1cFRM,4332
+wxo_agentic_evaluation/annotate.py,sha256=nxYMc6gwfQ-GuNjCPFtbX_-Es5-9XDdbXpMH89yRDdc,1228
+wxo_agentic_evaluation/arg_configs.py,sha256=UCrGcakFaAM3reFquMn03qNtKe7Pg8ScbOF0K7o8VDU,2240
+wxo_agentic_evaluation/batch_annotate.py,sha256=44K4DUI498uaLIWUn3nz82AKcU6VnCjrExoG6GpPHoM,6323
+wxo_agentic_evaluation/data_annotator.py,sha256=DJVG2CdhJRAJ3X1ARbrsn9bPjTuytCDGIBM4PEexfnk,8214
+wxo_agentic_evaluation/evaluation_package.py,sha256=jOSe-TCJdAWCk1sWpRYfi_EMkZERrVf5swm-bxfozzc,21333
+wxo_agentic_evaluation/inference_backend.py,sha256=fhEB1kaNN-A08RtJglBiv3QL_8nq8m-g7xbF4WbHAvU,25691
+wxo_agentic_evaluation/llm_matching.py,sha256=l010exoMmsvTIAVHCm-Ok0diyeQogjCmemUb9rJLe6A,1477
+wxo_agentic_evaluation/llm_rag_eval.py,sha256=vsNGz1cFE5QGdhnfrx-iJq1r6q8tSI9Ef1mzuhoHElg,1642
+wxo_agentic_evaluation/llm_user.py,sha256=0zSsyEM7pYQtLcfbnu0gEIkosHDwntOZY84Ito6__SM,1407
+wxo_agentic_evaluation/main.py,sha256=tRXVle2o1JhwJZOTpqdsOzBOpxPYxAH5ziZkbCmzfyU,11470
+wxo_agentic_evaluation/record_chat.py,sha256=ZaOxIabDcE_CzZjKJESgh8LY7pK9UT4OvqQMFVdTG7A,8102
+wxo_agentic_evaluation/resource_map.py,sha256=-dIWQdpEpPeSCbDeYfRupG9KV1Q4NlHGb5KXywjkulM,1645
+wxo_agentic_evaluation/service_instance.py,sha256=yt7XpwheaRRG8Ri4TFIS5G2p5mnCwvNgj6T7bDF5uTU,6494
+wxo_agentic_evaluation/test_prompt.py,sha256=ksteXCs9iDQPMETc4Hb7JAXHhxz2r678U6-sgZJAO28,3924
+wxo_agentic_evaluation/tool_planner.py,sha256=e-lBb4w1klT1HOL9BTwae3lkGv5VBuYC397mSJgOhus,12622
+wxo_agentic_evaluation/type.py,sha256=uVKim70XgPW-3L7Z0yRO07wAH9xa-NcjfaiIyPhYMR0,3413
+wxo_agentic_evaluation/analytics/tools/analyzer.py,sha256=IPX_lAFujjPVI9fhXTNohXTxTmpqRhfzQygCWDYHBHg,18125
+wxo_agentic_evaluation/analytics/tools/main.py,sha256=ocwPUlEjyK7PMdXBg5OM2DVDQBcaHT4UjR4ZmEhR0C4,6567
+wxo_agentic_evaluation/analytics/tools/types.py,sha256=IFLKI1CCQwPR2iWjif8AqL_TEq--VbLwdwnMqfJujBw,4461
+wxo_agentic_evaluation/analytics/tools/ux.py,sha256=EaWNvsq68X_i2H4pQ2fABtXEEmk3ZXqaMrTs42_7MwE,18347
+wxo_agentic_evaluation/external_agent/__init__.py,sha256=LY3gMNzfIEwjpQkx5_2iZFHGQiUL4ymEkKL1dc2uKq4,1491
+wxo_agentic_evaluation/external_agent/external_validate.py,sha256=xW8tqPcm8JYvveSxf-oFCajvF5J8ORaK23YXu-LuFmc,4142
+wxo_agentic_evaluation/external_agent/performance_test.py,sha256=bCXUsW0OeUzwfSSYObgfAmEU5vARkD-PblYU-mU9aPY,2507
+wxo_agentic_evaluation/external_agent/types.py,sha256=4kfWD_ZyGZmpbib33gCxEuKS4HLb7CEtferlQgQe7uk,1624
+wxo_agentic_evaluation/metrics/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+wxo_agentic_evaluation/metrics/llm_as_judge.py,sha256=bybJQfVWiVh3BoFEZjdBmU9EQO9Ukheu3YWmkI9b1ks,1218
+wxo_agentic_evaluation/metrics/metrics.py,sha256=9O2m6T2iW-PMjGrTdMbOHP2Pr4RN0NwbEp6YgFpTi3I,5572
+wxo_agentic_evaluation/prompt/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+wxo_agentic_evaluation/prompt/answer_relevancy_prompt.jinja2,sha256=vLrMWce-5HlvniCQdtifnl-YdbJfT8-oixzfwulZs98,3839
+wxo_agentic_evaluation/prompt/args_extractor_prompt.jinja2,sha256=0qBicXFcc6AA3mQNLPVRmFsnuYaCABJXgZkIH9fO0Js,952
+wxo_agentic_evaluation/prompt/batch_testcase_prompt.jinja2,sha256=QXuk2ecnEPPRCPoWZJyrtb1gAVuIPljB91YoqPBp2Dk,1896
+wxo_agentic_evaluation/prompt/faithfulness_prompt.jinja2,sha256=DW9OdjeZJbOWrngRqTAVD4w0va_HtA2FR4G1POIIamM,2524
+wxo_agentic_evaluation/prompt/keyword_matching_prompt.jinja2,sha256=7mTkSrppjgPluUAIMTWaT30K7M4J4hyR_LjSjW1Ofq0,1290
+wxo_agentic_evaluation/prompt/keywords_generation_prompt.jinja2,sha256=PiCjr1ag44Jk5xD3F24fLD_bOGYh2sF0i5miY4OrVlc,1890
+wxo_agentic_evaluation/prompt/llama_user_prompt.jinja2,sha256=nDfCD0o9cRYmsgIjzD-RZNQxotlvuqrzdsZIY-vT794,684
+wxo_agentic_evaluation/prompt/semantic_matching_prompt.jinja2,sha256=MltPfEXYyOwEC2xNLl7UsFTxNbr8CwHaEcPqtvKE2r8,2749
+wxo_agentic_evaluation/prompt/starting_sentence_generation_prompt.jinja2,sha256=m_l6f7acfnWJmGQ0mXAy85oLGLgzhVhoz7UL1FVYq8A,4908
+wxo_agentic_evaluation/prompt/story_generation_prompt.jinja2,sha256=_DxjkFoHpNTmdVSUzUrUdwn4Cng7nAGqkMnm0ScOH1w,4191
+wxo_agentic_evaluation/prompt/template_render.py,sha256=FVH5ew2TofC5LGqQzqNj90unrxooUZv_5XxJzVdz8uM,3563
+wxo_agentic_evaluation/prompt/tool_chain_agent.jinja2,sha256=9RcIjLYoOvtFsf-RgyMfMcj2Fe8fq1wGkE4nG1zamYY,297
+wxo_agentic_evaluation/prompt/tool_planner.jinja2,sha256=Ln43kwfSX50B1VBsT-MY1TCE0o8pGFh8aQJAzZfGkpI,3239
+wxo_agentic_evaluation/prompt/examples/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+wxo_agentic_evaluation/prompt/examples/data_simple.json,sha256=XXF-Pn-mosklC9Ch7coyaJxosFNnl3OkHSW3YPuiKMM,2333
+wxo_agentic_evaluation/service_provider/__init__.py,sha256=EaY4jjKp58M3W8N3b3a8PNC2S81xA7YV2_QkTIy9DfI,1600
+wxo_agentic_evaluation/service_provider/model_proxy_provider.py,sha256=X5tiE0IKCR2CqhwEGm91LOdzFZQWSXzXQgLOtzi6ng0,4002
+wxo_agentic_evaluation/service_provider/ollama_provider.py,sha256=HMHQVUGFbLSQI1dhysAn70ozJl90yRg-CbNd4vsz-Dc,1116
+wxo_agentic_evaluation/service_provider/provider.py,sha256=MsnRzLYAaQiU6y6xf6eId7kn6-CetQuNZl00EP-Nl28,417
+wxo_agentic_evaluation/service_provider/watsonx_provider.py,sha256=iKVkWs4PRTM_S0TIdPgQ9NFQWPlDvcEvuHpQlIPzO10,6216
+wxo_agentic_evaluation/utils/__init__.py,sha256=QMxk6hx1CDvCBLFh40WpPZmqFNJtDqwXP7S7cXD6NQE,145
+wxo_agentic_evaluation/utils/utils.py,sha256=JYZQZ-OBy43gAWg9S7duJi9StRApGJATs2JUsW1l30M,6057
+ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info/METADATA,sha256=uhmuzKUbgWgKDNayG2dAc-YYvZ_ypeVY4onrcomv0Co,17667
+ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info/top_level.txt,sha256=2okpqtpxyqHoLyb2msio4pzqSg7yPSzwI7ekks96wYE,23
+ibm_watsonx_orchestrate_evaluation_framework-1.0.4.dist-info/RECORD,,

wxo_agentic_evaluation/analytics/tools/analyzer.py CHANGED Viewed

@@ -1,9 +1,9 @@
-from type import Message, ContentType, EvaluationData
+from wxo_agentic_evaluation.type import Message, ContentType, EvaluationData
 from typing import List, Optional
 import json
 import rich
 from collections import defaultdict
-from analytics.tools.types import (
+from wxo_agentic_evaluation.analytics.tools.types import (
     ErrorPatterns,
     ToolFailure,
     HallucinatedParameter,
@@ -15,7 +15,7 @@ from analytics.tools.types import (
     AnalysisResults,
     ErrorType,
 )
-from data_annotator import ERROR_KEYWORDS
+from wxo_agentic_evaluation.data_annotator import ERROR_KEYWORDS
 from http import HTTPStatus

wxo_agentic_evaluation/analytics/tools/ux.py CHANGED Viewed

@@ -5,7 +5,7 @@ from rich.table import Table
 from rich.panel import Panel
 from rich.align import Align
 from rich.console import Group
-from type import Message, ContentType
+from wxo_agentic_evaluation.type import Message, ContentType
 from typing import List, Dict, Optional
 from analytics.tools.types import (
     ToolDefinitionRecommendation,

wxo_agentic_evaluation/analyze_run.py CHANGED Viewed

@@ -9,9 +9,9 @@ from rich.table import Table
 from typing import List
 from wxo_agentic_evaluation.type import (
     ExtendedMessage,
-    ContentType,
-    ToolCallAndRoutingMetrics,
+    ContentType
 )
+from wxo_agentic_evaluation.metrics.metrics import ToolCallAndRoutingMetrics
 from wxo_agentic_evaluation.arg_configs import AnalyzeConfig
 from jsonargparse import CLI
@@ -71,10 +71,11 @@ def analyze(config: AnalyzeConfig):
     test_case_with_failed_tools = []
     for entry in summary:
-        test_case_name = entry["test_case"]
+        test_case_name = entry["dataset_name"]
         if test_case_name.lower().strip() == "summary (average)":
             continue
-        if int(entry["Wrong Function Calls"]) > 0 or int(entry["Wrong Parameters"]) > 0:
+        if not entry["is_success"] or float(entry["tool_calls_with_incorrect_parameter"]) > 0 or float(entry["tool_call_precision"]) < 1.0\
+                or float(entry["tool_call_recall"]) < 1.0:
             test_case_with_failed_tools.append(entry)
     if len(test_case_with_failed_tools) == 0:
         header_table = Table(show_header=False, box=None)
@@ -85,7 +86,7 @@ def analyze(config: AnalyzeConfig):
         rich.print(header_panel)
     for test_case_entry in test_case_with_failed_tools:
-        test_case_name = test_case_entry["test_case"]
+        test_case_name = test_case_entry["dataset_name"]
         test_case_path = os.path.join(
             config.data_path, "messages", f"{test_case_name}.messages.analyze.json"
@@ -94,7 +95,8 @@ def analyze(config: AnalyzeConfig):
         with open(test_case_path, "r", encoding="utf-8") as f:
             temp = json.load(f)
             for entry in temp:
-                test_messages.append(ExtendedMessage(**entry))
+                msg = ExtendedMessage(**entry)
+                test_messages.append(msg)
         test_metrics_path = os.path.join(
             config.data_path, "messages", f"{test_case_name}.metrics.json"
@@ -105,11 +107,9 @@ def analyze(config: AnalyzeConfig):
         header_table.add_row(f"Test Case Name: {test_case_name}")
         header_table.add_row((f"Expected Tool Calls: {metrics.expected_tool_calls}"))
         header_table.add_row(f"Correct Tool Calls: {metrics.correct_tool_calls}")
-        irrelevant_tool_calls = test_case_entry["Wrong Function Calls"]
-        header_table.add_row(f"Irrelevant Tool Call: {irrelevant_tool_calls}")
-        tool_call_with_incorrect_parameters = test_case_entry["Wrong Parameters"]
+        header_table.add_row(f"Text Match: {metrics.text_match.value}")
         header_table.add_row(
-            f"Tool Call with incorrect parameters: {tool_call_with_incorrect_parameters}"
+            f"Journey Success: {metrics.is_success}"
         )
         header_panel = Panel(
             header_table, title="[bold green]📋 Analysis Summary[/bold green]"

wxo_agentic_evaluation/arg_configs.py CHANGED Viewed

@@ -22,12 +22,19 @@ class LLMUserConfig:
     user_response_style: List[str] = field(default_factory=list)
+@dataclass
+class ProviderConfig:
+    model_id: str = field(default="meta-llama/llama-3-405b-instruct")
+    provider: str = field(default="watsonx")
 @dataclass
 class TestConfig:
     test_paths: List[str]
     output_dir: str
     auth_config: AuthConfig
     wxo_lite_version: str
+    provider_config: ProviderConfig = field(default_factory=ProviderConfig)
     llm_user_config: LLMUserConfig = field(default_factory=LLMUserConfig)
     enable_verbose_logging: bool = True
     enable_manual_user_input: bool = False
@@ -65,7 +72,7 @@ class ChatRecordingConfig:
         default_factory=KeywordsGenerationConfig
     )
     service_url: str = "http://localhost:4321"
-    tenant_name: str = "wxo-dev"
+    tenant_name: str = "local"
     token: str = None

wxo_agentic_evaluation/batch_annotate.py CHANGED Viewed

@@ -5,7 +5,7 @@ import os
 from pathlib import Path
 from jsonargparse import CLI
-from wxo_agentic_evaluation.watsonx_provider import WatsonXProvider
+from wxo_agentic_evaluation.service_provider import get_provider
 from wxo_agentic_evaluation.prompt.template_render import BatchTestCaseGeneratorTemplateRenderer
 from wxo_agentic_evaluation.arg_configs import BatchAnnotateConfig
 from wxo_agentic_evaluation import __file__
@@ -71,7 +71,6 @@ def extract_inputs_from_snapshot(snapshot_path: Path) -> dict:
 def load_example(example_path: Path):
     with example_path.open("r", encoding="utf-8") as f:
         data = json.load(f)
-    data.pop("mine_fields", None)
     return data
@@ -98,13 +97,9 @@ def build_prompt_for_story(agent, tools, tool_inputs, example_case: dict, story:
 def generate_multiple_in_one(prompt, output_dir, starting_index, model_id="meta-llama/llama-3-405b-instruct", ):
     output_dir.mkdir(parents=True, exist_ok=True)
-    provider = WatsonXProvider(
+    provider = get_provider(
         model_id=model_id,
-        llm_decode_parameter={
-            "min_new_tokens": 50,
-            "decoding_method": "greedy",
-            "max_new_tokens": 3000
-        }
+        params={"min_new_tokens": 50, "decoding_method": "greedy", "max_new_tokens": 3000},
     )
     response = provider.query(prompt)
@@ -119,7 +114,6 @@ def generate_multiple_in_one(prompt, output_dir, starting_index, model_id="meta-
         assert isinstance(test_cases, list), "Expected list of test cases"
         for i, case in enumerate(test_cases, start=starting_index):
-            case["mine_fields"] = []  # ✅ Add the field here
             out_file = output_dir / f"synthetic_test_case_{i}.json"
             with out_file.open("w", encoding="utf-8") as f:
                 json.dump(case, f, indent=2)

wxo_agentic_evaluation/data_annotator.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from wxo_agentic_evaluation.type import Message, EvaluationData
-from wxo_agentic_evaluation.watsonx_provider import WatsonXProvider
+from wxo_agentic_evaluation.service_provider.watsonx_provider import Provider
+from wxo_agentic_evaluation.service_provider import get_provider
 from wxo_agentic_evaluation.prompt.template_render import (
     LlamaKeywordsGenerationTemplateRenderer,
 )
@@ -94,16 +95,16 @@ ERROR_KEYWORDS = [
 class KeywordsGenerationLLM:
     def __init__(
         self,
-        wai_client: WatsonXProvider,
+        provider: Provider,
         template: LlamaKeywordsGenerationTemplateRenderer,
     ):
-        self.wai_client = wai_client
+        self.provider = provider
         self.prompt_template = template
     def genereate_keywords(self, response) -> Message | None:
         prompt = self.prompt_template.render(response=response)
-        res = self.wai_client.query(prompt)
-        keywords = ast.literal_eval(res["generated_text"].strip())
+        res: str = self.provider.query(prompt)
+        keywords = ast.literal_eval(res.strip())
         return keywords
@@ -120,7 +121,6 @@ class DataAnnotator:
             agent="",
             story="",
             starting_sentence=messages[0].content if messages else "",
-            mine_fields=[],
             goals={},
             goal_details=[],
         )
@@ -145,29 +145,48 @@ class DataAnnotator:
     def _process_tool_call_order(self, wrong_tool_response_id: list[str]) -> list[str]:
         """Process and order tool calls, skipping failed ones"""
+        # gather all call ids that actually got a response
+        valid_call_ids = {
+            json.loads(m.content)["tool_call_id"]
+            for m in self.messages
+            if m.type == "tool_response"
+        }
         order = []
-        for message in self.messages:
-            if message.type == "tool_call":
-                content = json.loads(message.content)
-                # skip all the tool calls that fail
-                if (
-                    content.get("tool_call_id", "") in wrong_tool_response_id
-                    or content.get("id", "") in wrong_tool_response_id
-                ):
-                    continue
-                if "tool_call_id" in content:
-                    del content["tool_call_id"]
-                if "id" in content:
-                    del content["id"]
-                content = json.dumps(content, sort_keys=True)
-                # for a given tool call signature - function name + args only keep the most recent one
-                if content in order:
-                    idx = order.index(content)
-                    order = order[:idx] + order[idx + 1 :] + [content]
-                else:
-                    order.append(content)
+        for idx, message in enumerate(self.messages):
+            if message.type != "tool_call":
+                continue
+            content = json.loads(message.content)
+            call_id = content.get("tool_call_id") or content.get("id")
+            # skip any calls that errored
+            if call_id in wrong_tool_response_id:
+                continue
+            # skip calls that never produced a tool_response
+            if call_id not in valid_call_ids:
+                continue
+            # skip the "reflection" copy that the LLM emits right after a response
+            prev = self.messages[idx - 1] if idx > 0 else None
+            if (
+                prev is not None
+                and prev.type == "tool_response"
+                and json.loads(prev.content).get("tool_call_id") == call_id
+            ):
+                continue
+            # normalize ids so json dumps only reflects name-args
+            content.pop("tool_call_id", None)
+            content.pop("id", None)
+            signature = json.dumps(content, sort_keys=True)
+            # if we’ve seen that exact (name-args) before, drop the old one
+            if signature in order:
+                order.remove(signature)
+            order.append(signature)
         return order
     def _process_tool_calls(self) -> tuple[Dict, List, str]:
@@ -209,16 +228,12 @@ class DataAnnotator:
         # we assume single summary step at the end
         for message in self.messages[::-1]:
             if message.role == "assistant":
-                wai_client = WatsonXProvider(
+                provider = get_provider(
                     model_id=self.keywords_generation_config.model_id,
-                    llm_decode_parameter={
-                        "min_new_tokens": 0,
-                        "decoding_method": "greedy",
-                        "max_new_tokens": 256,
-                    },
+                    params={"min_new_tokens": 0, "decoding_method": "greedy", "max_new_tokens": 256},
                 )
                 kw_generator = KeywordsGenerationLLM(
-                    wai_client=wai_client,
+                    provider=provider,
                     template=LlamaKeywordsGenerationTemplateRenderer(
                         self.keywords_generation_config.prompt_config
                     ),
@@ -247,7 +262,6 @@ class DataAnnotator:
             "agent": self.initial_data.agent,
             "goals": goals,
             "goal_details": goal_details,
-            "mine_fields": [],
             "story": self.initial_data.story,
             "starting_sentence": self.initial_data.starting_sentence,
         }

ibm-watsonx-orchestrate-evaluation-framework 1.0.2__py3-none-any.whl → 1.0.4__py3-none-any.whl

Potentially problematic release.

ibm-watsonx-orchestrate-evaluation-framework 1.0.2py3-none-any.whl → 1.0.4py3-none-any.whl