PyPI - hud-python - Versions diffs - 0.4.11__tar.gz → 0.4.13__tar.gz - Mend

hud-python 0.4.11tar.gz → 0.4.13tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (173) hide show

{hud_python-0.4.11 → hud_python-0.4.13}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.11
+Version: 0.4.13
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
@@ -35,10 +35,9 @@ Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Requires-Python: <3.14,>=3.11
-Requires-Dist: fastmcp>=2.11.2
 Requires-Dist: httpx<1,>=0.23.0
-Requires-Dist: hud-mcp-python-sdk>=0.1.0
-Requires-Dist: mcp>=1.13.1
+Requires-Dist: hud-fastmcp-python-sdk>=0.1.2
+Requires-Dist: hud-mcp-python-sdk>=3.13.2
 Requires-Dist: opentelemetry-api>=1.34.1
 Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.34.1
 Requires-Dist: opentelemetry-instrumentation-mcp>=0.44.1
@@ -56,7 +55,11 @@ Provides-Extra: agent
 Requires-Dist: anthropic; extra == 'agent'
 Requires-Dist: datasets>=2.14.0; extra == 'agent'
 Requires-Dist: dotenv>=0.9.9; extra == 'agent'
-Requires-Dist: hud-mcp-use-python-sdk>=0.1.0; extra == 'agent'
+Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'agent'
+Requires-Dist: ipykernel; extra == 'agent'
+Requires-Dist: ipython<9; extra == 'agent'
+Requires-Dist: jupyter-client; extra == 'agent'
+Requires-Dist: jupyter-core; extra == 'agent'
 Requires-Dist: langchain; extra == 'agent'
 Requires-Dist: langchain-anthropic; extra == 'agent'
 Requires-Dist: langchain-openai; extra == 'agent'
@@ -66,7 +69,11 @@ Provides-Extra: agents
 Requires-Dist: anthropic; extra == 'agents'
 Requires-Dist: datasets>=2.14.0; extra == 'agents'
 Requires-Dist: dotenv>=0.9.9; extra == 'agents'
-Requires-Dist: hud-mcp-use-python-sdk>=0.1.0; extra == 'agents'
+Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'agents'
+Requires-Dist: ipykernel; extra == 'agents'
+Requires-Dist: ipython<9; extra == 'agents'
+Requires-Dist: jupyter-client; extra == 'agents'
+Requires-Dist: jupyter-core; extra == 'agents'
 Requires-Dist: langchain; extra == 'agents'
 Requires-Dist: langchain-anthropic; extra == 'agents'
 Requires-Dist: langchain-openai; extra == 'agents'
@@ -77,7 +84,7 @@ Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
 Requires-Dist: anthropic; extra == 'dev'
 Requires-Dist: datasets>=2.14.0; extra == 'dev'
 Requires-Dist: dotenv>=0.9.9; extra == 'dev'
-Requires-Dist: hud-mcp-use-python-sdk>=0.1.0; extra == 'dev'
+Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'dev'
 Requires-Dist: inspect-ai>=0.3.80; extra == 'dev'
 Requires-Dist: ipykernel; extra == 'dev'
 Requires-Dist: ipython<9; extra == 'dev'
@@ -233,7 +240,7 @@ Any hud MCP environment and evaluation works with our RL pipeline. Even our remo
 This is Claude Computer Use running on our proprietary financial analyst benchmark [SheetBench-50](https://huggingface.co/datasets/hud-evals/SheetBench-50):
-![Trace screenshot](https://raw.githubusercontent.com/hud-evals/hud-python/l/text-2048/docs/src/images/trace_sheet.gif)
+![Trace screenshot](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/trace_sheet.gif)
 > [See this trace on _app.hud.so_](https://app.hud.so/trace/9e212e9e-3627-4f1f-9eb5-c6d03c59070a)
@@ -385,7 +392,7 @@ result = await ClaudeAgent().run({  # See all agents: https://docs.hud.so/refere
 All leaderboards are publicly available on [app.hud.so/leaderboards](https://app.hud.so/leaderboards) (see [docs](https://docs.hud.so/evaluate-agents/leaderboards))
-![Leaderboard](https://raw.githubusercontent.com/hud-evals/hud-python/l/text-2048/docs/src/images/leaderboards_2.png)
+![Leaderboard](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/leaderboards_3.png)
 We highly suggest running 3-5 evaluations per dataset for the most consistent results across multiple jobs.
@@ -430,10 +437,6 @@ graph LR
     Trace --> Dashboard
     AnyMCP -->|"MCP"| API
-    style Dashboard fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
-    style SDK fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
-    style RemoteEnv fill:#d1fae5,stroke:#10b981,stroke-width:2px
-    style AnyMCP fill:#fce7f3,stroke:#ec4899,stroke-width:2px,stroke-dasharray: 5 5
 ```
 ## CLI reference

{hud_python-0.4.11 → hud_python-0.4.13}/README.md RENAMED Viewed

@@ -130,7 +130,7 @@ Any hud MCP environment and evaluation works with our RL pipeline. Even our remo
 This is Claude Computer Use running on our proprietary financial analyst benchmark [SheetBench-50](https://huggingface.co/datasets/hud-evals/SheetBench-50):
-![Trace screenshot](https://raw.githubusercontent.com/hud-evals/hud-python/l/text-2048/docs/src/images/trace_sheet.gif)
+![Trace screenshot](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/trace_sheet.gif)
 > [See this trace on _app.hud.so_](https://app.hud.so/trace/9e212e9e-3627-4f1f-9eb5-c6d03c59070a)
@@ -282,7 +282,7 @@ result = await ClaudeAgent().run({  # See all agents: https://docs.hud.so/refere
 All leaderboards are publicly available on [app.hud.so/leaderboards](https://app.hud.so/leaderboards) (see [docs](https://docs.hud.so/evaluate-agents/leaderboards))
-![Leaderboard](https://raw.githubusercontent.com/hud-evals/hud-python/l/text-2048/docs/src/images/leaderboards_2.png)
+![Leaderboard](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/leaderboards_3.png)
 We highly suggest running 3-5 evaluations per dataset for the most consistent results across multiple jobs.
@@ -327,10 +327,6 @@ graph LR
     Trace --> Dashboard
     AnyMCP -->|"MCP"| API
-    style Dashboard fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
-    style SDK fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
-    style RemoteEnv fill:#d1fae5,stroke:#10b981,stroke-width:2px
-    style AnyMCP fill:#fce7f3,stroke:#ec4899,stroke-width:2px,stroke-dasharray: 5 5
 ```
 ## CLI reference

{hud_python-0.4.11 → hud_python-0.4.13}/environments/README.md RENAMED Viewed

@@ -351,7 +351,7 @@ from . import basic, advanced  # This registers all @setup.tool() decorated func
 # In setup/basic.py
 from . import setup
-from hud.tools.types import SetupResult
+from mcp.types import TextContent
 @setup.tool()
 async def reset(**kwargs):
@@ -361,14 +361,14 @@ async def reset(**kwargs):
         **kwargs: Additional parameters
     Returns:
-        SetupResult
+        TextContent
     """
     # Access environment from the hub
     env = setup.env
     await env.reset_state()
-    return SetupResult(
-        content="Environment reset to initial state",
-        info={"status": "success"}
+    return TextContent(
+        text="Environment reset to initial state",
+        type="text"
     )
 @setup.tool()
@@ -379,14 +379,14 @@ async def seed_data(num_items: int = 5):
         num_items: Number of items to create
     Returns:
-        SetupResult
+        TextContent
     """
     # Access environment from the hub
     env = setup.env
     items = await env.create_items(num_items)
-    return SetupResult(
-        content=f"Created {len(items)} items",
-        info={"items_created": len(items)}
+    return TextContent(
+        text=f"Created {len(items)} items",
+        type="text"
     )
 # In evaluate/__init__.py
@@ -735,7 +735,7 @@ See the `browser` environment for a complete production example of this pattern.
 ### 4. Cursor rules – paste this once
-Inside `.cursor/rules/hud_environment_iteration.mdc` add (or verify) the following so the agent always knows the expected iteration loop:
+Inside `.cursor/rules/mcp_environment_iteration.mdc` add (or verify) the following so the agent always knows the expected iteration loop:
 ```mdc
 ---
@@ -743,7 +743,7 @@ description: Improve an MCP environment
 alwaysApply: false
 ---
 Setup
-1. Make sure the user has started the development server with `hud dev --build` and that you can connect to the environment through the provided HTTP endpoint. Check that you have access to the environment's tools.
+1. Make sure the user has set up the mcp config for the environment by seeing if you have access to the tools by the given name (i.e. my-environment-dev), and make sure the title is in dev mode. If not, ask the user to make a dev version!
 2. Make sure you can find the source folder for this environment. Explore its contents and README.
 3. Clarify the objectives and ask follow up questions on the initial query to determine precise implementation details.
@@ -760,7 +760,7 @@ Iteration
 Context: In the my-environment folder, I have a browser app environment. I've built a tool to interact with it called my-environment-dev.
 Interaction: There are multiple tools to setup and evaluate the environment. There are also interaction tools for you to be able to move around it, and a screenshot tool to see the state. Use all of the available tools.
 Objective: Please test if all setup, evaluation functions are working. This means you should come up with new problem definitions to test all functionality on. Be creative in how you pick edge cases to test on.
-Rules: @hud_environment_iteration.mdc
+Rules: @mcp_environment_iteration.mdc
 ```
 ---
@@ -827,13 +827,13 @@ Before making changes:
 ```python
 # In setup/my_new_setup.py
 from . import setup
-from hud.tools import BaseSetup, SetupResult
+from hud.tools import BaseSetup, TextContent
 @setup("my_new_setup", description="Clear description of what this does")
 class MyNewSetup(BaseSetup):
-    async def __call__(self, context, param1: str, param2: int = 10) -> SetupResult:
+    async def __call__(self, context, param1: str, param2: int = 10) -> TextContent:
         # Implementation
-        return {"status": "success", "details": "..."}
+        return TextContent(...)
 ```
 **Adding New Evaluators**

hud_python-0.4.13/environments/browser/README.md ADDED Viewed

@@ -0,0 +1,213 @@
+# Browser Environment
+A browser automation environment for HUD that provides GUI access and web app interaction capabilities. This environment supports hot-reloading during development while maintaining persistent state.
+## Architecture Overview
+The browser environment uses a two-process architecture:
+1. **Context Server** (`context.py`): Long-running process that maintains persistent state
+2. **MCP Server** (`server.py`): Hot-reloadable process that handles tool requests
+### Key Components
+- **BrowserContext**: Stores persistent state (running apps, ports, playwright instance)
+- **ServiceManager**: Manages X11, VNC, and app processes
+- **BaseHub Tools**: Setup and evaluate tools organized by app (2048, todo)
+- **Multiprocessing Proxy**: Enables state sharing between processes
+## Context Management and Common Pitfalls
+### Understanding the Proxy System
+The browser environment uses Python's `multiprocessing.Manager` to share state between the context server and MCP server. This introduces important constraints:
+#### ❌ Common Pitfall: Unpicklable Objects
+```python
+# BAD: This will fail with "cannot pickle 'coroutine' object"
+@setup.tool("my_tool")
+async def my_tool():
+    env = setup.env
+    result = await env.call_app_api("app", "/api/endpoint")  # Returns coroutine
+    # The coroutine can't be serialized through the proxy!
+```
+#### ✅ Solution: Direct HTTP Calls
+```python
+# GOOD: Make HTTP calls directly
+@setup.tool("my_tool")
+async def my_tool():
+    import httpx
+    # Get the backend port from persistent context
+    persistent_ctx = setup.env
+    backend_port = persistent_ctx.get_app_backend_port("app")
+    # Make API call directly
+    url = f"http://localhost:{backend_port}/api/endpoint"
+    async with httpx.AsyncClient() as client:
+        response = await client.get(url)
+        response.raise_for_status()
+        result = response.json()
+```
+### State Synchronization Issues
+#### ❌ Common Pitfall: Direct List/Dict Manipulation
+```python
+# BAD: Regular Python lists don't sync through proxy
+class ServiceManager:
+    def __init__(self):
+        self._launched_apps = []  # Won't sync!
+```
+#### ✅ Solution: Store State in Persistent Context
+```python
+# GOOD: Use the persistent context for shared state
+class BrowserContext:
+    def __init__(self):
+        self._running_apps: List[str] = []
+        self._app_ports: Dict[str, Dict[str, int]] = {}
+    def add_running_app(self, app_name: str) -> None:
+        """Add app to running list."""
+        if app_name not in self._running_apps:
+            self._running_apps.append(app_name)
+```
+### Accessing Shared Resources
+#### ❌ Common Pitfall: Direct Attribute Access
+```python
+# BAD: Direct attribute access on proxy objects
+playwright_tool = env.playwright  # May not work with proxy
+```
+#### ✅ Solution: Use Getter Methods
+```python
+# GOOD: Use proxy-friendly getter methods
+playwright_tool = persistent_ctx.get_playwright_tool()
+```
+## Best Practices
+### 1. Tool Implementation Pattern
+All setup and evaluate tools should follow this pattern:
+```python
+@setup.tool("tool_name")
+async def tool_name(param1: type, param2: type):
+    """Tool description."""
+    try:
+        # Get persistent context
+        persistent_ctx = setup.env  # or evaluate.env
+        # Get app ports
+        backend_port = persistent_ctx.get_app_backend_port("app_name")
+        # Make HTTP request
+        url = f"http://localhost:{backend_port}/api/endpoint"
+        async with httpx.AsyncClient() as client:
+            response = await client.method(url, json=data)
+            response.raise_for_status()
+            result = response.json()
+        # Return result
+        return TextContent(
+            text=f"Success message",
+            type="text"
+        )
+    except Exception as e:
+        logger.error(f"tool_name failed: {e}")
+        return TextContent(
+            text=f"Failed: {str(e)}",
+            type="text"
+        )
+```
+### 2. App Launch Pattern
+When launching apps, ensure ports are stored in the persistent context:
+```python
+# In launch_app tool
+app_info = await service_manager.launch_app(app_name)
+# Store ports in persistent context for later access
+try:
+    backend_port = service_manager.get_app_port(app_name)
+    frontend_port = service_manager.get_app_frontend_port(app_name)
+    persistent_ctx.set_app_ports(app_name, frontend_port, backend_port)
+except Exception as e:
+    logger.error(f"Failed to store ports: {e}")
+# Track app in persistent context
+persistent_ctx.add_running_app(app_name)
+```
+### 3. Import Organization
+Keep imports at module level:
+```python
+# At top of file
+import logging
+import httpx
+from mcp.types import TextContent
+from . import setup
+# Not inside functions
+```
+## Troubleshooting
+### "Cannot pickle 'coroutine' object"
+**Cause**: Trying to return an async function result through the proxy.
+**Fix**: Don't use async methods on proxied objects. Make direct HTTP calls instead.
+### "App not launched" errors
+**Cause**: State synchronization issue between ServiceManager and persistent context.
+**Fix**: Ensure `launch_app` stores app info in the persistent context, and setup/evaluate tools check the persistent context's app list.
+### "Object has no attribute" on proxy objects
+**Cause**: Direct attribute access on multiprocessing proxy objects.
+**Fix**: Use getter/setter methods instead of direct attribute access.
+## Development Workflow
+1. **Start the environment**: `hud dev`
+2. **Make changes**: Edit tools in `src/hud_controller/`
+3. **Test immediately**: The MCP server hot-reloads automatically
+4. **Check logs**: Look for serialization or proxy errors
+## Adding New Apps
+1. Create app directory in `apps/`
+2. Add setup tools in `src/hud_controller/setup/app_name.py`
+3. Add evaluate tools in `src/hud_controller/evaluate/app_name.py`
+4. Follow the HTTP pattern - no `call_app_api` usage
+5. Store app ports in persistent context when launching
+## Key Files
+- `context.py`: Persistent state management
+- `server.py`: MCP server and tool definitions
+- `services.py`: Process management for X11, VNC, apps
+- `setup/`: Setup tools organized by app
+- `evaluate/`: Evaluation tools organized by app
+Remember: When in doubt, make direct HTTP calls and store state in the persistent context!

{hud_python-0.4.11 → hud_python-0.4.13}/environments/remote_browser/README.md RENAMED Viewed

@@ -52,10 +52,13 @@ hud dev . --build
 # - Provide HTTP endpoint for Cursor
 # - Auto-restart on file changes
 # - Pass through environment variables
+# - **Keep browser sessions alive across restarts**
 ```
 Add the URL from output to Cursor or click the deeplink.
+**Note**: With hot-reload enabled, your browser session persists across code changes. This means you can modify your code and the server will restart automatically without losing your browser state, tabs, or navigation history.
 #### Option 2: Manual Docker Run
 For direct control over the development environment:

{hud_python-0.4.11 → hud_python-0.4.13}/environments/remote_browser/pyproject.toml RENAMED Viewed

@@ -3,25 +3,20 @@ name = "hud-remote-browser"
 version = "0.1.0"
 description = "HUD Remote Browser Controller with MCP tools for cloud browser providers"
 requires-python = ">=3.11,<3.13"
-dependencies = [
-    "hud-python @ git+https://github.com/hud-evals/hud-python.git@l/text-2048",
-    "pyautogui",
-    "playwright",
-    "httpx",
-    "typer",
-    "google-api-python-client",
-    "google-auth",
-]
-[project.scripts]
-hud-remote-browser = "hud_controller.__main__:main"
+dependencies = [ "hud-python>=0.4.12", "pyautogui", "playwright", "httpx", "typer", "google-api-python-client", "google-auth",]
 [build-system]
-requires = ["hatchling"]
+requires = [ "hatchling",]
 build-backend = "hatchling.build"
-[tool.hatch.build.targets.wheel]
-packages = ["src/hud_controller"]
+[project.scripts]
+hud-remote-browser = "hud_controller.__main__:main"
+[tool.hud]
+image = "hud-remote-browser:dev"
 [tool.hatch.metadata]
-allow-direct-references = true
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = [ "src/hud_controller",]

hud_python-0.4.13/hud/__main__.py ADDED Viewed

@@ -0,0 +1,8 @@
+"""Allow running CLI with python -m hud."""
+from __future__ import annotations
+from hud.cli import main
+if __name__ == "__main__":
+    main()

{hud_python-0.4.11 → hud_python-0.4.13}/hud/agents/base.py RENAMED Viewed

@@ -306,7 +306,7 @@ class MCPAgent(ABC):
                         if decision == "STOP":
                             # Try to submit response through lifecycle tool
                             await self._maybe_submit_response(response, messages)
                             logger.info("Stopping execution")
                             final_response = response
                             break
@@ -487,7 +487,7 @@ class MCPAgent(ABC):
             self._available_tools.append(tool)
             # Simplified mapping - just tool name to tool
             self._tool_map[tool.name] = tool
             # Auto-detect response tool as a lifecycle tool
             if tool.name == "response" and "response" not in self.lifecycle_tools:
                 logger.debug("Auto-detected 'response' tool as a lifecycle tool")
@@ -495,7 +495,7 @@ class MCPAgent(ABC):
     async def _maybe_submit_response(self, response: AgentResponse, messages: list[Any]) -> None:
         """Submit response through lifecycle tool if available.
         Args:
             response: The agent's response
             messages: The current message history (will be modified in-place)
@@ -506,17 +506,16 @@ class MCPAgent(ABC):
             try:
                 # Call the response tool with the agent's response
                 response_tool_call = MCPToolCall(
-                    name="response",
-                    arguments={"response": response.content, "messages": messages}
+                    name="response", arguments={"response": response.content, "messages": messages}
                 )
                 response_results = await self.call_tools(response_tool_call)
                 # Format and add the response tool results to messages
                 response_messages = await self.format_tool_results(
                     [response_tool_call], response_results
                 )
                 messages.extend(response_messages)
                 # Mark the task as done
                 logger.info("Response lifecycle tool executed, marking task as done")
             except Exception as e:
@@ -579,7 +578,7 @@ class MCPAgent(ABC):
                 logger.warning("Failed to close auto-created trace: %s", e)
             finally:
                 self._auto_trace_cm = None
         # Clean up auto-created client
         if self._auto_created_client and self.mcp_client:
             try:

{hud_python-0.4.11 → hud_python-0.4.13}/hud/agents/langchain.py RENAMED Viewed

@@ -15,10 +15,10 @@ import hud
 if TYPE_CHECKING:
     from langchain.schema.language_model import BaseLanguageModel
     from langchain_core.tools import BaseTool
-    from mcp_use.adapters.langchain_adapter import LangChainAdapter
+    from mcp_use.adapters.langchain_adapter import LangChainAdapter  # type: ignore[attr-defined]
 try:
-    from mcp_use.adapters.langchain_adapter import LangChainAdapter
+    from mcp_use.adapters.langchain_adapter import LangChainAdapter  # type: ignore[attr-defined]
 except ImportError:
     LangChainAdapter = None  # type: ignore[misc, assignment]

{hud_python-0.4.11 → hud_python-0.4.13}/hud/agents/tests/test_openai.py RENAMED Viewed

@@ -17,7 +17,9 @@ class TestOperatorAgent:
     @pytest.fixture
     def mock_mcp_client(self):
         """Create a mock MCP client."""
-        mcp_client = MagicMock()
+        mcp_client = AsyncMock()
+        # Set up the mcp_config attribute as a regular dict, not a coroutine
+        mcp_client.mcp_config = {"test_server": {"url": "http://test"}}
         return mcp_client
     @pytest.fixture

hud-python 0.4.11__tar.gz → 0.4.13__tar.gz

Potentially problematic release.

hud-python 0.4.11tar.gz → 0.4.13tar.gz