PyPI - ibm-watsonx-gov - Versions diffs - 1.3.3__cp313-cp313-macosx_11_0_arm64.whl - Mend

ibm-watsonx-gov 1.3.3__cp313-cp313-macosx_11_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (353) hide show

ibm_watsonx_gov/providers/llmevalkit/function_calling/metrics/trajectory/trajectory_metrics.json ADDED Viewed

@@ -0,0 +1,161 @@
+[
+    {
+        "name": "trajectory_objective_satisfaction",
+        "task_description": "As an expert evaluator specializing in user satisfaction assessment, your task is to comprehensively evaluate a complete agent trajectory from initial user request to final response. Your evaluation must focus on USER SATISFACTION - how well the agent fulfilled the user's underlying needs, expectations, and objectives.\n\n## Core Evaluation Philosophy:\nUser satisfaction goes beyond technical correctness. A trajectory can have perfect tool calls but still leave the user unsatisfied if it misses their intent, provides incomplete information, or delivers a poor user experience. Conversely, a trajectory with minor technical imperfections might still achieve high user satisfaction if it addresses the core need effectively.\n\n## Evaluation Dimensions:\n\n### 1. OBJECTIVE UNDERSTANDING (Weight: 25%)\n- Did the agent correctly interpret the user's explicit request?\n- Did the agent identify and address implicit needs or context?\n- Was the scope of the request properly understood (narrow vs. broad)?\n- Did the agent recognize the urgency or priority level?\n\n### 2. STRATEGY & APPROACH (Weight: 20%)\n- Was the chosen approach logical and efficient for the user's goal?\n- Did the agent select the most appropriate tools and sequence?\n- Was the strategy adapted based on intermediate results?\n- Did the agent avoid unnecessary complexity or redundancy?\n\n### 3. INFORMATION COMPLETENESS (Weight: 20%)\n- Did the agent gather all necessary information to satisfy the user's need?\n- Were follow-up questions asked when requirements were ambiguous?\n- Was the information comprehensive enough for the user to take action?\n- Did the agent identify and fill information gaps proactively?\n\n### 4. EXECUTION QUALITY (Weight: 15%)\n- Were tool calls executed correctly with appropriate parameters?\n- Did the agent handle errors, exceptions, or unexpected responses gracefully?\n- Was the execution efficient without unnecessary steps?\n- Did the agent validate results and ensure data accuracy?\n\n### 5. USER EXPERIENCE (Weight: 15%)\n- Was the interaction smooth and intuitive from the user's perspective?\n- Did the agent communicate progress and intermediate findings clearly?\n- Was the response time reasonable for the complexity of the request?\n- Did the agent maintain appropriate context throughout the interaction?\n\n### 6. FINAL DELIVERABLE (Weight: 5%)\n- Was the final answer directly actionable for the user?\n- Did the response format match user expectations and needs?\n- Was the information presented clearly and organized logically?\n- Did the agent provide appropriate next steps or recommendations?\n\n## Scoring Rubric:\n\n**Score 5 (Exceptional User Satisfaction):**\n- User's objective completely fulfilled with excellence\n- Proactive identification and addressing of implicit needs\n- Optimal strategy with flawless execution\n- Superior user experience with clear communication\n- Final deliverable exceeds expectations\n\n**Score 4 (High User Satisfaction):**\n- User's objective well satisfied with minor gaps\n- Good understanding of needs with mostly appropriate approach\n- Solid execution with minimal inefficiencies\n- Good user experience with clear communication\n- Final deliverable meets expectations well\n\n**Score 3 (Moderate User Satisfaction):**\n- User's basic objective met but with notable limitations\n- Adequate understanding but missed some nuances\n- Reasonable approach but with some inefficiencies or gaps\n- Acceptable user experience but could be clearer\n- Final deliverable is functional but incomplete\n\n**Score 2 (Low User Satisfaction):**\n- User's objective partially met with significant gaps\n- Limited understanding of user needs or context\n- Poor strategy or execution with multiple issues\n- Confusing or frustrating user experience\n- Final deliverable requires substantial additional work\n\n**Score 1 (Very Low User Satisfaction):**\n- User's objective largely unmet or misunderstood\n- Fundamental misinterpretation of user needs\n- Inappropriate strategy or failed execution\n- Poor user experience with unclear communication\n- Final deliverable is unusable or incorrect\n\n## Special Considerations:\n\n**Context Sensitivity:** Consider the user's domain expertise, urgency, and specific use case when evaluating satisfaction.\n\n**Error Recovery:** Evaluate how well the agent recovered from errors or adapted when initial approaches failed.\n\n**Efficiency vs. Thoroughness:** Balance the user's need for quick results against the need for comprehensive information.\n\n**Communication Style:** Assess whether the agent's communication style matched the user's apparent preferences and expertise level.\n\nYour evaluation should prioritize the user's perspective: Would a real user be satisfied with this interaction? Would they achieve their goal and have a positive experience?",
+        "jsonschema": {
+            "title": "trajectory_objective_satisfaction",
+            "description": "Comprehensive assessment of user satisfaction with the agent's trajectory performance.",
+            "type": "object",
+            "properties": {
+                "objective_understanding_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for how well the agent understood the user's explicit and implicit objectives (1-5)."
+                },
+                "strategy_approach_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for the appropriateness and efficiency of the agent's strategy and approach (1-5)."
+                },
+                "information_completeness_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for how completely the agent gathered and provided necessary information (1-5)."
+                },
+                "execution_quality_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for the technical quality and correctness of tool usage and execution (1-5)."
+                },
+                "user_experience_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for the overall user experience, communication clarity, and interaction flow (1-5)."
+                },
+                "final_deliverable_score": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Score for the quality, usefulness, and completeness of the final response (1-5)."
+                },
+                "evidence": {
+                    "type": "string",
+                    "description": "Detailed evidence from the trajectory supporting your evaluation. Cite specific user messages, tool calls, tool outputs, and agent responses. Organize by evaluation dimension."
+                },
+                "explanation": {
+                    "type": "string",
+                    "description": "Comprehensive explanation of your scoring across all dimensions. Explain how each dimension contributed to user satisfaction and identify key strengths and weaknesses."
+                },
+                "user_satisfaction_analysis": {
+                    "type": "string",
+                    "description": "Focused analysis on user satisfaction: Would the user be satisfied with this interaction? What would their experience be like? What would they think about the agent's performance?"
+                },
+                "output": {
+                    "type": "integer",
+                    "minimum": 1,
+                    "maximum": 5,
+                    "description": "Overall user satisfaction score (1-5) calculated as weighted average: Objective Understanding (25%) + Strategy & Approach (20%) + Information Completeness (20%) + Execution Quality (15%) + User Experience (15%) + Final Deliverable (5%)."
+                },
+                "confidence": {
+                    "type": "number",
+                    "minimum": 0,
+                    "maximum": 1,
+                    "description": "Your confidence in this evaluation, from 0.0 to 1.0. Consider factors like trajectory clarity, completeness of information, and evaluation complexity."
+                },
+                "correction": {
+                    "type": "string",
+                    "description": "If user satisfaction could be improved, provide specific, actionable recommendations organized by dimension. Focus on changes that would most impact user satisfaction. If no improvements needed, return empty string."
+                }
+            },
+            "required": [
+                "objective_understanding_score",
+                "strategy_approach_score",
+                "information_completeness_score",
+                "execution_quality_score",
+                "user_experience_score",
+                "final_deliverable_score",
+                "evidence",
+                "explanation",
+                "user_satisfaction_analysis",
+                "output",
+                "confidence",
+                "correction"
+            ]
+        },
+        "examples": [
+            {
+                "trajectory": "User: I need to book a flight from NYC to London for next Friday and find a good hotel near the airport.\n\nAgent: I'll help you book a flight and find a hotel. Let me start by searching for flights.\n\nTool Call: search_flights(origin=\"NYC\", destination=\"London\", date=\"2024-01-19\")\nTool Output: {\"flights\": [{\"airline\": \"BA\", \"price\": \"$650\", \"departure\": \"10:00 AM\", \"arrival\": \"10:00 PM\", \"flight_id\": \"BA123\"}]}\n\nAgent: I found a British Airways flight for $650. Let me book this for you and then find hotels.\n\nTool Call: book_flight(flight_id=\"BA123\", passenger_name=\"John Smith\")\nTool Output: {\"booking_confirmed\": true, \"confirmation_code\": \"ABC123\"}\n\nTool Call: search_hotels(location=\"London Heathrow Airport\", checkin=\"2024-01-19\", checkout=\"2024-01-21\")\nTool Output: {\"hotels\": [{\"name\": \"Hilton Heathrow\", \"price\": \"$180/night\", \"rating\": 4.2, \"distance\": \"0.5 miles from airport\"}]}\n\nAgent: Perfect! I've booked your flight (confirmation ABC123) and found the Hilton Heathrow for $180/night, just 0.5 miles from the airport. Your flight departs NYC at 10 AM and arrives in London at 10 PM. Would you like me to book the hotel as well?",
+                "expected_output": {
+                    "objective_understanding_score": 5,
+                    "strategy_approach_score": 5,
+                    "information_completeness_score": 4,
+                    "execution_quality_score": 5,
+                    "user_experience_score": 5,
+                    "final_deliverable_score": 4,
+                    "evidence": "OBJECTIVE UNDERSTANDING: Agent correctly identified both flight booking and hotel search requirements. Properly interpreted 'next Friday' as a specific date and 'near the airport' as location preference.\n\nSTRATEGY & APPROACH: Logical sequence - search flights first, book the selected flight, then search for hotels. Efficient approach without unnecessary steps.\n\nINFORMATION COMPLETENESS: Gathered flight details, booking confirmation, and hotel options. Minor gap: didn't ask about passenger preferences (meal, seat) or hotel booking confirmation.\n\nEXECUTION QUALITY: All tool calls executed correctly with appropriate parameters. Successful flight booking with confirmation code.\n\nUSER EXPERIENCE: Clear communication throughout, provided confirmation details, proactively offered next step (hotel booking).\n\nFINAL DELIVERABLE: Provided actionable information with confirmation code and hotel option, but left hotel booking incomplete.",
+                    "explanation": "This trajectory demonstrates high user satisfaction across most dimensions. The agent understood both explicit requests (flight + hotel), used an efficient strategy, and executed tools correctly. The user experience was smooth with clear communication and proactive next steps. Minor deduction for not completing the hotel booking or asking about preferences, but overall the user would be very satisfied with getting their flight booked and having a good hotel option presented.",
+                    "user_satisfaction_analysis": "The user would be highly satisfied with this interaction. They got their primary need (flight booking) completed successfully with a confirmation code, and received a relevant hotel recommendation. The agent was efficient, clear, and proactive. The only minor frustration might be needing a follow-up to complete the hotel booking, but the overall experience would be very positive.",
+                    "output": 5,
+                    "confidence": 0.9,
+                    "correction": "To achieve perfect user satisfaction: 1) Ask about passenger preferences (meal, seat selection) during booking, 2) Proactively book the hotel or clearly ask for confirmation to proceed, 3) Provide more hotel options for comparison."
+                }
+            },
+            {
+                "trajectory": "User: What's the weather like in Paris today and should I bring an umbrella?\n\nAgent: I'll check the weather in Paris for you.\n\nTool Call: get_weather(location=\"Paris\", date=\"today\")\nTool Output: {\"temperature\": \"18°C\", \"condition\": \"partly cloudy\", \"humidity\": \"65%\", \"wind\": \"10 km/h\"}\n\nAgent: The weather in Paris today is 18°C and partly cloudy with 65% humidity and 10 km/h wind.",
+                "expected_output": {
+                    "objective_understanding_score": 2,
+                    "strategy_approach_score": 3,
+                    "information_completeness_score": 2,
+                    "execution_quality_score": 4,
+                    "user_experience_score": 2,
+                    "final_deliverable_score": 2,
+                    "evidence": "OBJECTIVE UNDERSTANDING: Agent only addressed the weather query but completely ignored the umbrella question, which was the user's primary practical concern.\n\nSTRATEGY & APPROACH: Basic approach of getting weather data was correct, but no strategy for addressing the umbrella decision.\n\nINFORMATION COMPLETENESS: Missing critical information: precipitation probability, rain forecast, and umbrella recommendation. The weather data alone doesn't answer the user's question.\n\nEXECUTION QUALITY: Tool call was executed correctly with proper parameters and returned valid data.\n\nUSER EXPERIENCE: Poor - user asked a specific practical question about umbrella but got only raw weather data without actionable advice.\n\nFINAL DELIVERABLE: Incomplete response that doesn't help the user make the umbrella decision they asked about.",
+                    "explanation": "This trajectory shows poor user satisfaction despite technically correct tool usage. The agent failed to understand that the user's real need was a practical recommendation about bringing an umbrella, not just weather data. The response leaves the user to interpret the data themselves, which doesn't fulfill their objective of getting actionable advice.",
+                    "user_satisfaction_analysis": "The user would be frustrated with this interaction. They asked a simple, practical question about whether to bring an umbrella, but received only raw weather data without the recommendation they needed. They would likely need to ask a follow-up question or make the decision themselves, indicating the agent failed to provide value.",
+                    "output": 2,
+                    "confidence": 0.95,
+                    "correction": "To improve user satisfaction: 1) Directly answer the umbrella question based on weather conditions, 2) Get precipitation probability/rain forecast data, 3) Provide clear recommendation: 'Based on partly cloudy conditions with no rain expected, you likely don't need an umbrella, but consider bringing one if you'll be out all day as conditions can change.'"
+                }
+            },
+            {
+                "trajectory": "User: I'm planning a dinner party for 8 people this Saturday. Can you help me find a good restaurant that takes reservations and suggest a wine pairing?\n\nAgent: I'd be happy to help you plan your dinner party! Let me search for restaurants that can accommodate your group.\n\nTool Call: search_restaurants(location=\"user_location\", party_size=8, date=\"2024-01-20\", accepts_reservations=true)\nTool Output: {\"restaurants\": [{\"name\": \"Bistro Laurent\", \"cuisine\": \"French\", \"rating\": 4.6, \"price_range\": \"$$$\", \"availability\": \"7:00 PM, 8:30 PM\"}, {\"name\": \"Mama Rosa's\", \"cuisine\": \"Italian\", \"rating\": 4.3, \"price_range\": \"$$\", \"availability\": \"6:00 PM, 9:00 PM\"}]}\n\nAgent: Great! I found two excellent options for your party of 8:\n\n1. **Bistro Laurent** (French, 4.6★, $$$) - Available at 7:00 PM or 8:30 PM\n2. **Mama Rosa's** (Italian, 4.3★, $$) - Available at 6:00 PM or 9:00 PM\n\nFor wine pairings, let me get some recommendations based on the cuisine:\n\nTool Call: get_wine_recommendations(cuisine=\"French\", occasion=\"dinner_party\", group_size=8)\nTool Output: {\"wines\": [{\"name\": \"Châteauneuf-du-Pape Rouge\", \"type\": \"red\", \"price\": \"$45\", \"description\": \"Full-bodied red perfect for French cuisine\"}, {\"name\": \"Sancerre\", \"type\": \"white\", \"price\": \"$35\", \"description\": \"Crisp white that pairs well with appetizers\"}]}\n\nTool Call: get_wine_recommendations(cuisine=\"Italian\", occasion=\"dinner_party\", group_size=8)\nTool Output: {\"wines\": [{\"name\": \"Chianti Classico\", \"type\": \"red\", \"price\": \"$28\", \"description\": \"Classic Italian red, great with pasta and meat dishes\"}, {\"name\": \"Pinot Grigio\", \"type\": \"white\", \"price\": \"$22\", \"description\": \"Light, refreshing white perfect for Italian fare\"}]}\n\nAgent: Here are wine pairing suggestions for each restaurant:\n\n**For Bistro Laurent (French):**\n- Châteauneuf-du-Pape Rouge ($45) - Full-bodied red perfect for French cuisine\n- Sancerre ($35) - Crisp white for appetizers\n\n**For Mama Rosa's (Italian):**\n- Chianti Classico ($28) - Classic Italian red for pasta/meat dishes  \n- Pinot Grigio ($22) - Light, refreshing white\n\nBoth restaurants look excellent for your dinner party. Bistro Laurent offers a more upscale French experience, while Mama Rosa's provides great Italian food at a more moderate price point. Would you like me to help you make a reservation at either restaurant?",
+                "expected_output": {
+                    "objective_understanding_score": 5,
+                    "strategy_approach_score": 5,
+                    "information_completeness_score": 5,
+                    "execution_quality_score": 5,
+                    "user_experience_score": 5,
+                    "final_deliverable_score": 5,
+                    "evidence": "OBJECTIVE UNDERSTANDING: Perfect comprehension of both explicit needs (restaurant reservation for 8 people) and implicit needs (good options, wine pairing advice for dinner party context).\n\nSTRATEGY & APPROACH: Excellent strategy - search restaurants first with specific criteria, then get targeted wine recommendations for each cuisine type. Logical, efficient sequence.\n\nINFORMATION COMPLETENESS: Comprehensive information provided - restaurant options with ratings, prices, availability, plus specific wine pairings with prices and descriptions for each cuisine.\n\nEXECUTION QUALITY: All tool calls executed perfectly with appropriate parameters. Smart approach to get wine recommendations for both cuisine types.\n\nUSER EXPERIENCE: Outstanding communication - clear formatting, helpful comparisons, proactive offer to make reservations. Information presented in actionable format.\n\nFINAL DELIVERABLE: Complete solution addressing all user needs with clear options and next steps.",
+                    "explanation": "This trajectory exemplifies exceptional user satisfaction. The agent demonstrated perfect understanding of the dinner party context, used an optimal strategy to gather comprehensive information, and presented everything in a clear, actionable format. The user received exactly what they needed to make an informed decision about both restaurant and wine selection.",
+                    "user_satisfaction_analysis": "The user would be extremely satisfied with this interaction. They received exactly what they asked for - good restaurant options with availability and wine pairing suggestions - plus additional valuable context like ratings and prices. The agent was proactive in offering to help with the next step (making reservations). This interaction would likely exceed the user's expectations.",
+                    "output": 5,
+                    "confidence": 0.95,
+                    "correction": ""
+                }
+            },
+            {
+                "trajectory": "User: I need to transfer $500 to my friend John's account. His account number is 123456789.\n\nAgent: I can help you transfer money. Let me process this transfer.\n\nTool Call: transfer_money(amount=500, recipient_account=\"123456789\", sender_account=\"user_account\")\nTool Output: {\"error\": \"Insufficient funds. Current balance: $200\"}\n\nAgent: I'm sorry, but the transfer couldn't be completed because you have insufficient funds. Your current balance is $200, but you're trying to transfer $500.\n\nTool Call: get_account_options(account=\"user_account\")\nTool Output: {\"options\": [\"overdraft_protection\", \"line_of_credit\", \"transfer_from_savings\"]}\n\nAgent: I see you have some options available:\n1. Overdraft protection\n2. Line of credit  \n3. Transfer from savings account\n\nWould you like me to help you explore any of these options to complete the transfer?",
+                "expected_output": {
+                    "objective_understanding_score": 4,
+                    "strategy_approach_score": 4,
+                    "information_completeness_score": 3,
+                    "execution_quality_score": 4,
+                    "user_experience_score": 4,
+                    "final_deliverable_score": 3,
+                    "evidence": "OBJECTIVE UNDERSTANDING: Agent correctly understood the transfer request and recipient details. Good recognition that user wants to complete the transfer despite initial failure.\n\nSTRATEGY & APPROACH: Good initial approach, and smart recovery by checking available options after the failure. Proactive problem-solving approach.\n\nINFORMATION COMPLETENESS: Provided current balance and available options, but missing details about costs/terms of each option, and didn't check savings balance for transfer option.\n\nEXECUTION QUALITY: Tool calls executed correctly. Good error handling and appropriate follow-up query for solutions.\n\nUSER EXPERIENCE: Clear communication about the problem and potential solutions. Maintained helpful tone despite the failure.\n\nFINAL DELIVERABLE: Identified the problem and provided options, but left user to decide without sufficient information to make an informed choice.",
+                    "explanation": "This trajectory shows good user satisfaction with effective problem-solving. The agent handled the error gracefully and proactively offered solutions. However, user satisfaction is limited by incomplete information about the options presented - the user needs more details to make an informed decision about how to proceed.",
+                    "user_satisfaction_analysis": "The user would have moderate satisfaction. While they'd appreciate the agent's proactive approach to solving the insufficient funds problem, they'd likely be frustrated by not having enough information about the options (costs, terms, savings balance) to make a decision. They'd need to ask follow-up questions to proceed.",
+                    "output": 4,
+                    "confidence": 0.85,
+                    "correction": "To improve user satisfaction: 1) Provide details about each option (fees, interest rates, terms), 2) Check savings account balance if transfer from savings is an option, 3) Make a specific recommendation based on the user's situation, 4) Offer to proceed with the recommended option if user agrees."
+                }
+            }
+        ]
+    }
+]

ibm_watsonx_gov/providers/llmevalkit/function_calling/pipeline/__init__.py ADDED Viewed

File without changes

ibm_watsonx_gov/providers/llmevalkit/function_calling/pipeline/adapters.py ADDED Viewed

@@ -0,0 +1,102 @@
+from typing import (
+    Any,
+    Dict,
+    List,
+)
+from llmevalkit.function_calling.pipeline.types import (
+    ToolCall,
+    ToolSpec,
+)
+# ────────────────────────────────────────────────────────────────────────────────
+# Adapter definitions
+# ────────────────────────────────────────────────────────────────────────────────
+class BaseAdapter:
+    """Abstract adapter to unify different API spec and call representations."""
+    def get_tools_inventory(self) -> List[Dict[str, Any]]:
+        raise NotImplementedError
+    def get_tools_inventory_summary(self) -> List[Dict[str, Any]]:
+        raise NotImplementedError
+    def get_tool_spec(self, tool_name: str) -> Dict[str, Any]:
+        raise NotImplementedError
+    def get_call_dict(self) -> Dict[str, Any]:
+        raise NotImplementedError
+    def get_function_name(self) -> str:
+        raise NotImplementedError
+    def get_parameters(self) -> Dict[str, Any]:
+        raise NotImplementedError
+    def get_param_spec_snippet(self, param_name: str) -> Dict[str, Any]:
+        raise NotImplementedError
+class OpenAIAdapter(BaseAdapter):
+    """Adapter for ToolSpec + ToolCall inputs."""
+    def __init__(self, specs: List[ToolSpec], call: ToolCall):
+        self.specs = specs
+        self.call = call
+    def get_tools_inventory(self) -> List[Dict[str, Any]]:
+        return [spec.model_dump() for spec in self.specs]
+    def get_tools_inventory_summary(self) -> List[Dict[str, Any]]:
+        return [
+            {
+                "tool_name": spec.function.name,
+                "tool_description": spec.function.description,
+                "tool_parameters": {
+                    prop_name: prop_d["type"]
+                    for prop_name, prop_d in spec.function.parameters.get(
+                        "properties", {}
+                    ).items()
+                },
+            }
+            for spec in self.specs
+        ]
+    def get_tool_spec(self, tool_name: str) -> Dict[str, Any]:
+        tool = next((t for t in self.specs if t.function.name == tool_name), None)
+        return tool.function.model_dump() if tool else {}
+    def get_call_dict(self) -> Dict[str, Any]:
+        call_dict = {
+            "id": self.call.id,
+            "type": "function",
+            "function": {
+                "name": self.call.function.name,
+                "arguments": self.call.function.arguments,
+            },
+        }
+        return call_dict
+    def get_function_name(self) -> str:
+        return self.call.function.name
+    def get_parameters(self) -> Dict[str, Any]:
+        return self.call.function.parsed_arguments
+    def get_param_spec_snippet(self, param_name: str) -> Dict[str, Any]:
+        spec = next(
+            (s for s in self.specs if s.function.name == self.get_function_name()), None
+        )
+        if not spec:
+            return {"type": "object", "properties": {}, "required": []}
+        props = spec.function.parameters.get("properties", spec.function.parameters)
+        if param_name not in props:
+            return {"type": "object", "properties": {}, "required": []}
+        return {
+            "type": "object",
+            "properties": {param_name: props[param_name]},
+            "required": [param_name],
+        }

ibm_watsonx_gov/providers/llmevalkit/function_calling/pipeline/pipeline.py ADDED Viewed

@@ -0,0 +1,355 @@
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+import json
+from llmevalkit.llm import LLMClient
+from llmevalkit.function_calling.pipeline.static_checker import evaluate_static
+from llmevalkit.function_calling.pipeline.semantic_checker import SemanticChecker
+from llmevalkit.function_calling.pipeline.types import (
+    FunctionCallMetric,
+    FunctionCallInput,
+    PipelineResult,
+    StaticResult,
+    SemanticResult,
+    ToolSpec,
+    ToolCall,
+    StaticMetricResult,
+    SemanticCategoryResult,
+    TrajectoryPipelineInput,
+    TrajectoryPipelineResult,
+)
+# Default metric JSON paths
+_METRICS_DIR = Path(__file__).parent.parent / "metrics"
+_DEFAULT_GENERAL = _METRICS_DIR / "function_call" / "general_metrics.json"
+_DEFAULT_GENERAL_RUNTIME = (
+    _METRICS_DIR / "function_call" / "general_metrics_runtime.json"
+)
+_DEFAULT_FUNCSEL = (
+    _METRICS_DIR / "function_selection" / "function_selection_metrics.json"
+)
+_DEFAULT_FUNCSEL_RUNTIME = (
+    _METRICS_DIR / "function_selection" / "function_selection_metrics_runtime.json"
+)
+_DEFAULT_PARAM = _METRICS_DIR / "parameter" / "parameter_metrics.json"
+_DEFAULT_PARAM_RUNTIME = _METRICS_DIR / "parameter" / "parameter_metrics_runtime.json"
+_DEFAULT_TRAJECTORY = _METRICS_DIR / "trajectory" / "trajectory_metrics.json"
+class ReflectionPipeline:
+    """
+    High-level orchestration for function-call reflection.
+    Modes:
+      • static_only: schema checks
+      • semantic_only: LLM metrics + transforms
+      • run: full static -> semantic -> assemble -> PipelineResult
+    Supports sync & async, custom JSON overrides, and any registered LLM.
+    runtime_pipeline: if set to true, use faster prompts (no actionable recommendations, shorter explanations)
+    """
+    def __init__(
+        self,
+        metrics_client: LLMClient,
+        codegen_client: Optional[LLMClient] = None,
+        general_metrics: Optional[
+            Union[Path, List[FunctionCallMetric], List[str]]
+        ] = _DEFAULT_GENERAL_RUNTIME,
+        function_metrics: Optional[
+            Union[Path, List[FunctionCallMetric], List[str]]
+        ] = _DEFAULT_FUNCSEL_RUNTIME,
+        parameter_metrics: Optional[
+            Union[Path, List[FunctionCallMetric], List[str]]
+        ] = _DEFAULT_PARAM_RUNTIME,
+        trajectory_metrics: Optional[
+            Union[Path, List[FunctionCallMetric], List[str]]
+        ] = _DEFAULT_TRAJECTORY,
+        transform_enabled: Optional[bool] = False,
+        runtime_pipeline: Optional[bool] = True,
+        use_examples: Optional[bool] = True,
+    ):
+        self.metrics_client = metrics_client
+        if codegen_client is None:
+            self.codegen_client = metrics_client
+        else:
+            self.codegen_client = codegen_client
+        self.general_metrics = general_metrics
+        self.function_metrics = function_metrics
+        self.parameter_metrics = parameter_metrics
+        self.trajectory_metrics = trajectory_metrics
+        metrics_definitions = []
+        for metrics, default_path in [
+            (
+                self.general_metrics,
+                _DEFAULT_GENERAL_RUNTIME if runtime_pipeline else _DEFAULT_GENERAL,
+            ),
+            (
+                self.function_metrics,
+                _DEFAULT_FUNCSEL_RUNTIME if runtime_pipeline else _DEFAULT_FUNCSEL,
+            ),
+            (
+                self.parameter_metrics,
+                _DEFAULT_PARAM_RUNTIME if runtime_pipeline else _DEFAULT_PARAM,
+            ),
+            (
+                self.trajectory_metrics,
+                _DEFAULT_TRAJECTORY,
+            ),
+        ]:
+            if not metrics:
+                metrics_definitions.append(None)
+                continue
+            # Handle metric names list
+            if isinstance(metrics, list) and all(isinstance(x, str) for x in metrics):
+                # Load the default JSON file
+                if not default_path.is_file():
+                    raise FileNotFoundError(
+                        f"Default metrics file not found: {default_path}"
+                    )
+                with default_path.open("r") as f_in:
+                    all_metrics = json.load(f_in)
+                # Filter metrics by name
+                filtered_metrics = [
+                    metric for metric in all_metrics if metric.get("name") in metrics
+                ]
+                # Remove examples from prompts if requested
+                if not use_examples:
+                    for metric in filtered_metrics:
+                        metric.pop("examples", None)
+                if len(filtered_metrics) != len(metrics):
+                    found_names = {metric.get("name") for metric in filtered_metrics}
+                    missing = set(metrics) - found_names
+                    raise ValueError(f"Metrics not found: {missing}")
+                metrics_definitions.append(filtered_metrics)
+                continue
+            # Handle Path or List[FunctionCallMetric] (existing logic)
+            if not isinstance(metrics, (Path, list)):
+                raise TypeError(
+                    "metrics must be Path, List[FunctionCallMetric], List[str], or None"
+                )
+            if isinstance(metrics, list) and all(
+                isinstance(x, FunctionCallMetric) for x in metrics
+            ):
+                metrics_definitions.append([metric.model_dump() for metric in metrics])
+            else:
+                if not metrics.is_file():
+                    raise FileNotFoundError(f"Metrics file not found: {metrics}")
+                with metrics.open("r", encoding="utf8") as f:
+                    metrics_definitions.append(json.load(f))
+        gen_defs, fun_defs, par_defs, traj_defs = None, None, None, None
+        if metrics_definitions:
+            gen_defs = metrics_definitions[0]
+            if len(metrics_definitions) >= 2:
+                fun_defs = metrics_definitions[1]
+                if len(metrics_definitions) >= 3:
+                    par_defs = metrics_definitions[2]
+                    if len(metrics_definitions) >= 4:
+                        traj_defs = metrics_definitions[3]
+        # 3) Initialize semantic checker
+        self.semantic_checker = SemanticChecker(
+            general_metrics=gen_defs,
+            function_metrics=fun_defs,
+            parameter_metrics=par_defs,
+            trajectory_metrics=traj_defs,
+            metrics_client=self.metrics_client,
+            codegen_client=self.codegen_client,
+            transform_enabled=transform_enabled,
+        )
+    @staticmethod
+    def static_only(
+        inventory: List[ToolSpec],
+        call: ToolCall,
+    ) -> StaticResult:
+        """
+        Run schema-based static checks.
+        Returns:
+            StaticResult with per-check results and final_decision.
+        """
+        try:
+            return evaluate_static(inventory, call)
+        except Exception as e:
+            return StaticResult(
+                metrics={
+                    "json_schema_validation": StaticMetricResult(
+                        description="Invalid JSON schema",
+                        valid=False,
+                        explanation=f"error parsing JSON schema: {str(e)}",
+                    )
+                },
+                final_decision=False,
+            )
+    def semantic_sync(
+        self,
+        conversation: Union[str, List[Dict[str, str]]],
+        inventory: List[ToolSpec],
+        call: ToolCall,
+        retries: Optional[int] = 2,
+        transform_enabled: Optional[bool] = None,
+    ) -> SemanticResult:
+        """
+        Synchronous LLM-based semantic metrics (+ optional transforms).
+        """
+        # delegate to SemanticChecker
+        return self.semantic_checker.run_sync(
+            inventory,
+            call,
+            conversation,
+            retries=retries,
+            transform_enabled=transform_enabled,
+        )
+    async def semantic_async(
+        self,
+        conversation: Union[str, List[Dict[str, str]]],
+        inventory: List[ToolSpec],
+        call: ToolCall,
+        retries: Optional[int] = 2,
+        max_parallel: Optional[int] = 10,
+        transform_enabled: Optional[bool] = None,
+    ) -> SemanticResult:
+        """
+        Async LLM-based semantic metrics (+ optional transforms).
+        """
+        return await self.semantic_checker.run_async(
+            inventory,
+            call,
+            conversation,
+            retries=retries,
+            max_parallel=max_parallel,
+            transform_enabled=transform_enabled,
+        )
+    def run_sync(
+        self,
+        conversation: Union[str, List[Dict[str, str]]],
+        inventory: List[ToolSpec],
+        call: ToolCall,
+        continue_on_static: Optional[bool] = False,
+        retries: Optional[int] = 1,
+        transform_enabled: Optional[bool] = None,
+    ) -> PipelineResult:
+        """
+        Full sync pipeline: static -> semantic -> assemble PipelineResult.
+        """
+        static_res = self.static_only(inventory, call)
+        if not static_res.final_decision and not continue_on_static:
+            inputs = FunctionCallInput(
+                conversation_context=conversation,
+                tools_inventory=inventory,
+                tool_call=call,
+            )
+            return PipelineResult(
+                inputs=inputs,
+                static=static_res,
+                semantic=SemanticResult(
+                    general=None,
+                    function_selection=None,
+                    parameter=None,
+                    transform=None,
+                ),
+                overall_valid=False,
+            )
+        semantic_res = self.semantic_sync(
+            conversation, inventory, call, retries, transform_enabled
+        )
+        return PipelineResult(
+            inputs=FunctionCallInput(
+                conversation_context=conversation,
+                tools_inventory=inventory,
+                tool_call=call,
+            ),
+            static=static_res,
+            semantic=semantic_res,
+            overall_valid=True,
+        )
+    async def run_async(
+        self,
+        conversation: Union[str, List[Dict[str, str]]],
+        inventory: List[ToolSpec],
+        call: ToolCall,
+        continue_on_static: Optional[bool] = False,
+        retries: Optional[int] = 1,
+        max_parallel: Optional[int] = 10,
+        transform_enabled: Optional[bool] = None,
+    ) -> PipelineResult:
+        """
+        Full async pipeline: static -> semantic -> assemble PipelineResult.
+        """
+        static_res = self.static_only(inventory, call)
+        if not static_res.final_decision and not continue_on_static:
+            inputs = FunctionCallInput(
+                conversation_context=conversation,
+                tools_inventory=inventory,
+                tool_call=call,
+            )
+            return PipelineResult(
+                inputs=inputs,
+                static=static_res,
+                semantic=SemanticResult(
+                    general=None,
+                    function_selection=None,
+                    parameter=None,
+                    transform=None,
+                ),
+                overall_valid=False,
+            )
+        semantic_res = await self.semantic_async(
+            conversation, inventory, call, retries, max_parallel, transform_enabled
+        )
+        return PipelineResult(
+            inputs=FunctionCallInput(
+                conversation_context=conversation,
+                tools_inventory=inventory,
+                tool_call=call,
+            ),
+            static=static_res,
+            semantic=semantic_res,
+            overall_valid=True,
+        )
+    async def run_trajectory_async(
+        self,
+        trajectory: Union[str, List[Dict[str, str]]],
+        tool_inventory: List[Dict[str, Any]],
+        retries: Optional[int] = 1,
+        max_parallel: Optional[int] = 10,
+    ) -> TrajectoryPipelineResult:
+        """
+        Run trajectory evaluation asynchronously.
+        """
+        result = await self.semantic_checker.run_trajectory_async(
+            trajectory=trajectory,
+            tool_inventory=tool_inventory,
+            retries=retries,
+            max_parallel=max_parallel,
+        )
+        return TrajectoryPipelineResult(
+            inputs=TrajectoryPipelineInput(
+                trajectory=trajectory,
+                tool_inventory=tool_inventory,
+            ),
+            result=result,
+        )