PyPI - hyperplane-eval - Versions diffs - 0.1.4__tar.gz → 0.1.5__tar.gz - Mend

hyperplane-eval 0.1.4tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

hyperplane_eval-0.1.5/MANIFEST.in ADDED Viewed

@@ -0,0 +1,6 @@
+include requirements.txt
+include README.md
+include LICENSE
+recursive-include hyperplane_eval/prompts *.txt
+recursive-include hyperplane_eval/engine/domain *.json
+recursive-include hyperplane_eval/reporting/templates *.html

hyperplane_eval-0.1.5/PKG-INFO ADDED Viewed

@@ -0,0 +1,88 @@
+Metadata-Version: 2.4
+Name: hyperplane-eval
+Version: 0.1.5
+Summary: A modular framework for evaluating and verifying agentic LLM outputs.
+Author: Marten Panchev
+Author-email: marten@aquithm.com
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: numpy>=1.24.0
+Requires-Dist: scipy>=1.10.0
+Requires-Dist: litellm>=1.0.0
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: pandas>=2.0.0
+Requires-Dist: scikit-learn>=1.2.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: pyngrok>=7.1.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: questionary>=2.0.0
+Requires-Dist: PyYAML>=6.0.0
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Hyperplane Eval
+Hyperplane Eval is a Python-based testing framework that helps you figure out exactly when and where your AI agents break. Instead of writing manual test cases, you give Hyperplane a target function and a set of rules, and it systematically generates edge-cases to map out your agent's "Safe Polytope" — the operational volume where your agent is reliable.
+## 🚀 How It Works: Breadth-First Evaluation
+Testing an AI agent is hard because the potential input space is infinite. Hyperplane solves this by breaking down inputs into "dimensions" of complexity (e.g., Urgency, Ambiguity, Formatting).
+Instead of randomly guessing inputs, Hyperplane uses a **breadth-first evaluation** approach:
+1. **Dimension Extraction:** It automatically extracts relevant dimensions based on the rules you want to test.
+2. **Grid Generation:** It generates a uniform grid of test scenarios across these dimensions (using Sobol sequences for perfectly even distribution).
+3. **Input Synthesis:** It uses a strong LLM to generate realistic user inputs that match those specific dimension coordinates.
+4. **Evaluation:** It executes your local agent code with the generated inputs, and evaluates the output against your rules using a Chain-of-Thought (CoT) judge.
+By doing this breadth-first scan across multiple dimensions simultaneously, Hyperplane creates a mathematical map of your agent's reliability and calculates its "Reliability Coverage" as a clear, comparable percentage.
+## 🚦 CLI Integration
+Hyperplane is incredibly easy to use. You don't need to write any complex evaluation scripts or boilerplate code; everything is handled through an interactive CLI.
+### Setup & Installation
+Install the framework via pip:
+```bash
+pip install hyperplane-eval
+```
+### Running the CLI
+Run the interactive CLI directly in your terminal from inside your project directory:
+```bash
+hyperplane
+```
+The wizard will immediately guide you through the evaluation setup:
+1. **Target Selection:** It will automatically scan your local Python files and let you pick the function that acts as your agent's entry point.
+2. **Rule Definition:** You define the rules your agent must follow in plain English (e.g., "Never offer a refund over $50").
+3. **Configuration:** You configure the depth (how many points to test) and breadth (how many dimensions to extract).
+4. **Execution:** The framework will spin up workers, generate the test space, execute your local code, and render a real-time terminal dashboard.
+Once complete, Hyperplane generates an interactive HTML report showing exactly which dimensions cause your agent to fail, allowing you to easily identify blind spots in your system prompts.
+## 🛠 Technology Stack
+- **Language:** Python 3.10+
+- **Data Modeling:** `pydantic`
+- **Math/Geometry:** `numpy`, `scipy` (Sobol sequences, ConvexHull analysis)
+- **LLM Integration:** `litellm` for universal API connectivity (OpenAI, Gemini, Anthropic, or any local vLLM).
+## 📄 License
+This project is licensed under the Apache License, Version 2.0.
+See the [LICENSE](LICENSE) file for more information.

hyperplane_eval-0.1.5/README.md ADDED Viewed

@@ -0,0 +1,54 @@
+# Hyperplane Eval
+Hyperplane Eval is a Python-based testing framework that helps you figure out exactly when and where your AI agents break. Instead of writing manual test cases, you give Hyperplane a target function and a set of rules, and it systematically generates edge-cases to map out your agent's "Safe Polytope" — the operational volume where your agent is reliable.
+## 🚀 How It Works: Breadth-First Evaluation
+Testing an AI agent is hard because the potential input space is infinite. Hyperplane solves this by breaking down inputs into "dimensions" of complexity (e.g., Urgency, Ambiguity, Formatting).
+Instead of randomly guessing inputs, Hyperplane uses a **breadth-first evaluation** approach:
+1. **Dimension Extraction:** It automatically extracts relevant dimensions based on the rules you want to test.
+2. **Grid Generation:** It generates a uniform grid of test scenarios across these dimensions (using Sobol sequences for perfectly even distribution).
+3. **Input Synthesis:** It uses a strong LLM to generate realistic user inputs that match those specific dimension coordinates.
+4. **Evaluation:** It executes your local agent code with the generated inputs, and evaluates the output against your rules using a Chain-of-Thought (CoT) judge.
+By doing this breadth-first scan across multiple dimensions simultaneously, Hyperplane creates a mathematical map of your agent's reliability and calculates its "Reliability Coverage" as a clear, comparable percentage.
+## 🚦 CLI Integration
+Hyperplane is incredibly easy to use. You don't need to write any complex evaluation scripts or boilerplate code; everything is handled through an interactive CLI.
+### Setup & Installation
+Install the framework via pip:
+```bash
+pip install hyperplane-eval
+```
+### Running the CLI
+Run the interactive CLI directly in your terminal from inside your project directory:
+```bash
+hyperplane
+```
+The wizard will immediately guide you through the evaluation setup:
+1. **Target Selection:** It will automatically scan your local Python files and let you pick the function that acts as your agent's entry point.
+2. **Rule Definition:** You define the rules your agent must follow in plain English (e.g., "Never offer a refund over $50").
+3. **Configuration:** You configure the depth (how many points to test) and breadth (how many dimensions to extract).
+4. **Execution:** The framework will spin up workers, generate the test space, execute your local code, and render a real-time terminal dashboard.
+Once complete, Hyperplane generates an interactive HTML report showing exactly which dimensions cause your agent to fail, allowing you to easily identify blind spots in your system prompts.
+## 🛠 Technology Stack
+- **Language:** Python 3.10+
+- **Data Modeling:** `pydantic`
+- **Math/Geometry:** `numpy`, `scipy` (Sobol sequences, ConvexHull analysis)
+- **LLM Integration:** `litellm` for universal API connectivity (OpenAI, Gemini, Anthropic, or any local vLLM).
+## 📄 License
+This project is licensed under the Apache License, Version 2.0.
+See the [LICENSE](LICENSE) file for more information.

{hyperplane_eval-0.1.4 → hyperplane_eval-0.1.5/hyperplane_eval}/adapters/llms/llm_client.py RENAMED Viewed

@@ -4,6 +4,7 @@ import re
 import asyncio
 from typing import Any, Dict
 from litellm import acompletion
+from hyperplane_eval.engine.prompt_loader import load_prompt
 class LLMClient:
@@ -39,8 +40,8 @@ class LLMClient:
         response_schema: Dict[str, Any],
         temperature: float,
     ) -> str:
-        if response_schema:
-            prompt += f"\n\nYOU MUST RETURN A JSON OBJECT WITH THE EXACT FOLLOWING SCHEMA:\n{json.dumps(response_schema, indent=2)}"
+        schema_str = json.dumps(response_schema, indent=2)
+        prompt += "\n\n" + load_prompt("adapters/llm/schema_prompt", schema=schema_str)
         kwargs = {
             "model": self.model,  # Force using the user-selected model
@@ -49,8 +50,7 @@ class LLMClient:
             **self.llm_kwargs,
         }
-        if response_schema:
-            kwargs["response_format"] = {"type": "json_object"}
+        kwargs["response_format"] = {"type": "json_object"}
         async with self._semaphore:
             try:

{hyperplane_eval-0.1.4 → hyperplane_eval-0.1.5/hyperplane_eval}/adapters/runners/agent_runner.py RENAMED Viewed

@@ -9,15 +9,15 @@ class AgentRunner:
     def __init__(
         self,
-        executor_func: Callable = None,
-        target_path: str = "",
-        selected_func: dict = None,
+        executor_func: Callable,
+        target_path: str,
+        selected_func: dict,
     ):
         self.executor_func = executor_func
         self.target_path = target_path
         self.selected_func = selected_func
-    async def _call_target_agent(self, messages: List[Dict[str, str]]) -> str:
+    async def call_target_agent(self, messages: List[Dict[str, str]]) -> str:
         """Dispatches a multi-turn request to the agent under evaluation."""
         if not messages:
             return ""
@@ -75,7 +75,3 @@ class AgentRunner:
                 return f"Error: {str(e)}"
         else:
             return ""
-    async def close(self):
-        """No-op close method to satisfy framework expectation."""
-        pass

{hyperplane_eval-0.1.4 → hyperplane_eval-0.1.5/hyperplane_eval}/cli/app.py RENAMED Viewed

@@ -7,7 +7,11 @@ from rich.text import Text
 from rich.panel import Panel
 from typing import Any
+from hyperplane_eval.adapters.llms.llm_client import LLMClient
+from hyperplane_eval.adapters.runners.agent_runner import AgentRunner
+from hyperplane_eval.adapters.local_bindings.executor import execute_temp_runner
+from hyperplane_eval.engine.config import EvaluationConfig
+from hyperplane_eval.engine.orchestrator import PipelineOrchestrator
 LOGO = """
@@ -36,6 +40,106 @@ class VerifyApp:
         with open(self.config_file, "w") as f:
             yaml.dump(self.config, f)
+    @staticmethod
+    def update_dashboard_display(
+        active_scenarios: dict,
+        plane_input_space: Any,
+        scenarios_per_plane: int,
+        plane_features: list,
+        rule_idx: int,
+        rules_len: int,
+        plane_idx: int,
+        num_planes: int,
+        rule: str,
+    ) -> Group:
+        """Generates the CLI dashboard showing evaluation progress and scenario status."""
+        pct = min(1.0, len(plane_input_space.get_all_vectors()) / scenarios_per_plane)
+        bar = "█" * int(30 * pct) + "░" * (30 - int(30 * pct))
+        dims_str = ", ".join(f.name for f in plane_features)
+        renderables = []
+        renderables.append(
+            Text.from_markup(
+                f"[bold cyan]Rule [{rule_idx + 1}/{rules_len}] - Plane [{plane_idx + 1}/{num_planes}]:[/bold cyan] {rule[:80]}..."
+            )
+        )
+        renderables.append(Text.from_markup(f"[cyan]Dimensions:[/cyan] {dims_str}"))
+        renderables.append(
+            Text.from_markup(
+                f"[cyan]Progress:[/cyan] [{bar}] {pct:.0%} ({len(plane_input_space.get_all_vectors())}/{scenarios_per_plane})\n"
+            )
+        )
+        for item in list(active_scenarios.values())[-3:]:
+            if item["status"] == "Pending":
+                renderables.append(Text.from_markup(f" • {item['text']}\n"))
+            else:
+                score = item["score"]
+                if score >= 0.75:
+                    marker = "[bold green][✓][/bold green]"
+                elif score >= 0.25:
+                    marker = "[bold yellow][~][/bold yellow]"
+                else:
+                    marker = "[bold red][✗][/bold red]"
+                renderables.append(
+                    Text.from_markup(f" • {marker} ({score:.0%}) {item['text']}\n")
+                )
+        return Group(*renderables)
+    async def run(self):
+        self.console.print(Panel.fit(Text(LOGO, style="bold cyan")))
+        target_path, selected_func, description, rules = await self._prompt_for_target()
+        if not target_path or not selected_func:
+            return
+        rules_to_run = await self._prompt_for_rule(rules)
+        if not rules_to_run:
+            self.console.print("[red]No rules selected. Exiting.[/red]")
+            return
+        (
+            depth,
+            breadth,
+            adversarial,
+            conversational,
+        ) = await self._prompt_for_dynamic_config()
+        rules_str = ", ".join(f"'{r}'" for r in rules_to_run)
+        self.console.print(
+            f"\n[bold green]Starting evaluation locally for rules: {rules_str}[/bold green]"
+        )
+        llm_params = {
+            k.replace("llm_", ""): v
+            for k, v in self.config.items()
+            if k.startswith("llm_") and k != "llm_model"
+        }
+        llm_client = LLMClient(model=self.config.get("llm_model"), **llm_params)
+        runner = AgentRunner(
+            executor_func=execute_temp_runner,
+            target_path=target_path,
+            selected_func=selected_func,
+        )
+        eval_config = EvaluationConfig(
+            rules=rules_to_run,
+            runner=runner,
+            generator_target_schema=selected_func.get("params", []),
+            generator_target_code=selected_func.get("code", ""),
+            depth=depth,
+            breadth=breadth,
+            adversarial_testing=adversarial,
+            conversational_testing=conversational,
+            llm_client=llm_client,
+            agent_description=description,
+        )
+        orchestrator = PipelineOrchestrator(eval_config)
+        await orchestrator.run()
     async def _prompt_for_target(self):
         """Prompts the user to select or confirm the target file and function."""
         if self.config and "file" in self.config and "function" in self.config:
@@ -44,7 +148,8 @@ class VerifyApp:
             )
             use_existing = await questionary.confirm("Use this target?").ask_async()
             if use_existing:
-                from adapters.local_bindings.scanner import extract_functions
+                from hyperplane_eval.adapters.local_bindings.scanner import extract_functions
                 funcs = extract_functions(self.config["file"])
                 selected_func = next(
                     (f for f in funcs if f["name"] == self.config["function"]), None
@@ -82,7 +187,8 @@ class VerifyApp:
             return None, None, None, []
         self.console.print("[cyan]Scanning for functions...[/cyan]")
-        from adapters.local_bindings.scanner import extract_functions
+        from hyperplane_eval.adapters.local_bindings.scanner import extract_functions
         funcs = extract_functions(target_path)
         if not funcs:
             self.console.print(
@@ -304,117 +410,16 @@ class VerifyApp:
         ).ask_async()
         self.config["adversarial_testing"] = adversarial
-        self.save_config()
-        return depth, breadth, adversarial
-    @staticmethod
-    def update_dashboard_display(
-        active_scenarios: dict,
-        plane_input_space: Any,
-        scenarios_per_plane: int,
-        plane_features: list,
-        rule_idx: int,
-        rules_len: int,
-        plane_idx: int,
-        num_planes: int,
-        rule: str,
-    ) -> Group:
-        """Generates the CLI dashboard showing evaluation progress and scenario status."""
-        pct = min(1.0, len(plane_input_space.get_all_vectors()) / scenarios_per_plane)
-        bar = "█" * int(30 * pct) + "░" * (30 - int(30 * pct))
-        dims_str = ", ".join(f.name for f in plane_features)
-        renderables = []
-        renderables.append(
-            Text.from_markup(
-                f"[bold cyan]Rule [{rule_idx + 1}/{rules_len}] - Plane [{plane_idx + 1}/{num_planes}]:[/bold cyan] {rule[:80]}..."
-            )
-        )
-        renderables.append(Text.from_markup(f"[cyan]Dimensions:[/cyan] {dims_str}"))
-        renderables.append(
-            Text.from_markup(
-                f"[cyan]Progress:[/cyan] [{bar}] {pct:.0%} ({len(plane_input_space.get_all_vectors())}/{scenarios_per_plane})\n"
-            )
-        )
-        for item in list(active_scenarios.values())[-3:]:
-            if item["status"] == "Pending":
-                renderables.append(Text.from_markup(f" • {item['text']}\n"))
-            else:
-                score = item["score"]
-                if score >= 0.75:
-                    marker = "[bold green][✓][/bold green]"
-                elif score >= 0.25:
-                    marker = "[bold yellow][~][/bold yellow]"
-                else:
-                    marker = "[bold red][✗][/bold red]"
-                renderables.append(
-                    Text.from_markup(f" • {marker} ({score:.0%}) {item['text']}\n")
-                )
-        return Group(*renderables)
-    async def run(self):
-        self.console.print(Panel.fit(Text(LOGO, style="bold cyan")))
-        target_path, selected_func, description, rules = await self._prompt_for_target()
-        if not target_path or not selected_func:
-            return
-        rules_to_run = await self._prompt_for_rule(rules)
-        if not rules_to_run:
-            self.console.print("[red]No rules selected. Exiting.[/red]")
-            return
-        depth, breadth, adversarial = await self._prompt_for_dynamic_config()
-        rules_str = ", ".join(f"'{r}'" for r in rules_to_run)
-        self.console.print(
-            f"\n[bold green]Starting evaluation locally for rules: {rules_str}[/bold green]"
-        )
-        from adapters.llms.llm_client import LLMClient
-        llm_params = {
-            k.replace("llm_", ""): v
-            for k, v in self.config.items()
-            if k.startswith("llm_") and k != "llm_model"
-        }
-        llm_client = LLMClient(model=self.config.get("llm_model"), **llm_params)
-        from adapters.runners.agent_runner import AgentRunner
-        from adapters.local_bindings.executor import execute_temp_runner
-        runner = AgentRunner(
-            executor_func=execute_temp_runner,
-            target_path=target_path,
-            selected_func=selected_func,
-        )
-        import os
-        agent_dir = os.path.dirname(os.path.abspath(target_path))
-        results_path = os.path.join(agent_dir, "results")
+        conversational = await questionary.confirm(
+            "Enable Conversational Testing? (Injects natural conversational artifacts like dictation errors, multi-tasking, etc.)",
+            default=self.config.get("conversational_testing", False),
+        ).ask_async()
-        from engine.config import EvaluationConfig
-        from engine.orchestrator import PipelineOrchestrator
+        self.config["conversational_testing"] = conversational
+        self.save_config()
-        eval_config = EvaluationConfig(
-            results_dir=results_path,
-            rules=rules_to_run,
-            runner=runner,
-            generator_target_schema=selected_func.get("params", []),
-            generator_target_code=selected_func.get("code", ""),
-            depth=depth,
-            breadth=breadth,
-            adversarial_testing=adversarial,
-            llm_client=llm_client,
-            agent_description=description,
-        )
-        orchestrator = PipelineOrchestrator(eval_config)
-        await orchestrator.run()
+        return depth, breadth, adversarial, conversational
 async def main():

hyperplane_eval-0.1.5/hyperplane_eval/engine/config.py ADDED Viewed

@@ -0,0 +1,20 @@
+from dataclasses import dataclass
+from typing import Any, List, Dict
+from hyperplane_eval.adapters.runners.agent_runner import AgentRunner
+@dataclass
+class EvaluationConfig:
+    """Configuration for an evaluation run."""
+    rules: List[str]
+    runner: AgentRunner
+    generator_target_schema: List[Dict[str, Any]]
+    generator_target_code: str
+    llm_client: Any
+    depth: str
+    breadth: str
+    adversarial_testing: bool
+    conversational_testing: bool
+    agent_description: str

hyperplane-eval 0.1.4__tar.gz → 0.1.5__tar.gz

hyperplane-eval 0.1.4tar.gz → 0.1.5tar.gz