PyPI - hud-python - Versions diffs - 0.4.8__tar.gz → 0.4.9__tar.gz - Mend

hud-python 0.4.8tar.gz → 0.4.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (159) hide show

{hud_python-0.4.8 → hud_python-0.4.9}/.gitignore RENAMED Viewed

@@ -42,4 +42,6 @@ CLAUDE.md
 # RL
 wandb/
-outputs/
+outputs/
+test/

{hud_python-0.4.8 → hud_python-0.4.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.8
+Version: 0.4.9
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
@@ -38,6 +38,7 @@ Requires-Python: <3.14,>=3.11
 Requires-Dist: fastmcp>=2.11.2
 Requires-Dist: httpx<1,>=0.23.0
 Requires-Dist: hud-mcp-python-sdk>=0.1.0
+Requires-Dist: mcp>=1.13.1
 Requires-Dist: opentelemetry-api>=1.34.1
 Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.34.1
 Requires-Dist: opentelemetry-instrumentation-mcp>=0.44.1
@@ -61,6 +62,16 @@ Requires-Dist: langchain-anthropic; extra == 'agent'
 Requires-Dist: langchain-openai; extra == 'agent'
 Requires-Dist: numpy>=1.24.0; extra == 'agent'
 Requires-Dist: openai; extra == 'agent'
+Provides-Extra: agents
+Requires-Dist: anthropic; extra == 'agents'
+Requires-Dist: datasets>=2.14.0; extra == 'agents'
+Requires-Dist: dotenv>=0.9.9; extra == 'agents'
+Requires-Dist: hud-mcp-use-python-sdk>=0.1.0; extra == 'agents'
+Requires-Dist: langchain; extra == 'agents'
+Requires-Dist: langchain-anthropic; extra == 'agents'
+Requires-Dist: langchain-openai; extra == 'agents'
+Requires-Dist: numpy>=1.24.0; extra == 'agents'
+Requires-Dist: openai; extra == 'agents'
 Provides-Extra: dev
 Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
 Requires-Dist: anthropic; extra == 'dev'

{hud_python-0.4.8 → hud_python-0.4.9}/environments/browser/README.md RENAMED Viewed

@@ -2,6 +2,8 @@
 A browser automation environment for the HUD platform demonstrating best practices for building MCP (Model Context Protocol) environments with evaluation systems.
+**Key Feature**: This environment is **hot-reloadable** - it maintains state (running services, browser sessions, launched apps) across server restarts during development.
 ## Quick Start
 ### Build & Deploy
@@ -14,6 +16,34 @@ docker build -t hud-browser .
 docker run --rm -i -p 8080:8080 hud-browser
 ```
+### Hot-Reloadable Architecture
+This environment uses a persistent context server architecture that maintains state across MCP server restarts:
+- **Context Server**: Runs as a separate process holding ServiceManager and state
+- **MCP Server**: Connects via Unix socket, can restart without losing services
+- **State Preservation**: X11, VNC, running apps, and service states persist
+- **Development Friendly**: Edit code and restart MCP server instantly
+#### Docker Architecture
+The environment uses a single CMD that follows the proven text_2048 pattern:
+```dockerfile
+CMD ["sh", "-c", "\
+    # Start services in background \
+    python -m hud_controller.context_server & \
+    x11vnc ... & \
+    # Run MCP server in foreground \
+    exec hud-controller mcp \
+"]
+```
+This pattern ensures:
+- Background services (`&`) start once and persist
+- Only the `exec` command gets wrapped by watchfiles
+- Services survive hot-reloads during development
 ## Deployment to Registry
 ### 1. Publish to Docker Registry
@@ -169,10 +199,11 @@ Set these in your environment/Docker configuration:
 ```
 Docker Container
-├── start.sh                 # Service startup orchestration
 ├── MCP Server (FastMCP)     # Protocol implementation
 │   ├── Tools                # setup, evaluate, computer, etc.
-│   └── Resources           # Dynamic registry discovery
+│   └── Resources            # Dynamic registry discovery
+├── Context Server           # Persistent state management
+│   └── PersistentContext    # Maintains services & browser state
 ├── Services
 │   ├── X11 (Xvfb)          # Virtual display
 │   ├── VNC + Websockify    # Remote access
@@ -188,8 +219,7 @@ Docker Container
 ```
 browser/
-├── Dockerfile              # Multi-stage build with optimization
-├── start.sh                # Service startup script
+├── Dockerfile              # Multi-stage build with integrated startup
 ├── apps/                   # Launchable web applications
 │   ├── todo/              # Example app with evaluation APIs
 │   └── 2048/              # 2048 game app
@@ -197,6 +227,8 @@ browser/
 │   ├── server.py          # FastMCP server + resource definitions
 │   ├── services.py        # Service management
 │   ├── context.py         # Environment context
+│   ├── context_server.py  # Persistent context server
+│   ├── persistent_context.py # State persistence wrapper
 │   ├── evaluators/        # Evaluation system
 │   ├── setup/            # Setup system
 │   └── problems/         # Problem definitions
@@ -205,7 +237,7 @@ browser/
 ## Development Workflow
-### Hot-Reload Development with `hud mcp`
+### Hot-Reload Development with `hud dev`
 For rapid iteration without Docker rebuilds:
@@ -214,7 +246,7 @@ For rapid iteration without Docker rebuilds:
 cd environments/browser
 # Start hot-reload development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-browser:dev image
@@ -225,6 +257,21 @@ hud mcp . --build
 Add the URL from output to Cursor settings or click the deeplink. Now you can edit code in `src/` and changes apply instantly!
+#### How Hot-Reloading Works
+This environment uses a persistent context server pattern:
+1. **Context Server**: A separate Python process maintains state (services, browser, apps)
+2. **Socket Communication**: MCP server connects via Unix socket `/tmp/hud_browser_ctx.sock`
+3. **State Preservation**: X11, VNC, browser sessions, and launched apps persist across reloads
+4. **Automatic Recovery**: On reload, the server reconnects to existing services
+This means you can:
+- Edit code and have changes apply immediately
+- Keep browser sessions and apps running
+- Maintain VNC connections
+- Preserve test state between iterations
 ### Traditional Development Steps
 1. **Start with apps** - Build your web applications independently
@@ -392,4 +439,9 @@ When creating new MCP environments:
 6. **Update service dependencies** in `services.py` as needed
 7. **Extend Dockerfile** with your environment's requirements
+For hot-reloadability:
+- Keep complex objects out of the persistent context
+- Only store simple, picklable state
+- Recreate tools and clients on each server start
 See `src/hud_controller/README.md` for detailed implementation guidance.

{hud_python-0.4.8 → hud_python-0.4.9}/environments/browser/pyproject.toml RENAMED Viewed

@@ -3,25 +3,20 @@ name = "hud-controller"
 version = "0.1.0"
 description = "HUD Controller for browser environments with MCP tools"
 requires-python = ">=3.11,<3.14"
-dependencies = [
-    "hud-python @ git+https://github.com/hud-evals/hud-python.git@l/text-2048",
-    "playwright",
-    "pyautogui",
-    "httpx",
-    "typer",
-]
-[project.scripts]
-hud-controller = "hud_controller.__main__:main"
+dependencies = [ "hud-python", "playwright", "pyautogui", "httpx", "typer",]
 [build-system]
-requires = ["hatchling"]
+requires = [ "hatchling",]
 build-backend = "hatchling.build"
-[tool.hatch.build.targets.wheel]
-packages = ["src/hud_controller"]
+[project.scripts]
+hud-controller = "hud_controller.__main__:main"
+[tool.hud]
+image = "hud-browser:dev"
 [tool.hatch.metadata]
 allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = [ "src/hud_controller",]

{hud_python-0.4.8 → hud_python-0.4.9}/environments/browser/src/hud_controller/README.md RENAMED Viewed

@@ -55,7 +55,7 @@ class EvaluatorRegistry:
     def create_evaluator(cls, spec, context): pass
 ```
-### BrowserEnvironmentContext
+### BrowserContext
 Unified interface for environment interactions:
 - `call_app_api(app, endpoint, method, data)` - Call app backend API

{hud_python-0.4.8 → hud_python-0.4.9}/environments/remote_browser/README.md RENAMED Viewed

@@ -34,7 +34,7 @@ docker run --rm -i \
 Development mode allows you to edit code locally and see changes immediately without rebuilding.
-#### Option 1: Using `hud mcp` (Recommended)
+#### Option 1: Using `hud dev` (Recommended)
 The easiest way to develop with hot-reload:
@@ -44,7 +44,7 @@ export BROWSER_PROVIDER=anchorbrowser
 export ANCHOR_API_KEY=your-api-key
 # Start development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-remote-browser:dev image

{hud_python-0.4.8 → hud_python-0.4.9}/environments/text_2048/README.md RENAMED Viewed

@@ -57,13 +57,13 @@ The agent will play 2048 and try to reach a target tile using the available tool
 ## Development Mode
-### Option 1: Using `hud mcp` (Recommended)
+### Option 1: Using `hud dev` (Recommended)
 The easiest way to develop with hot-reload:
 ```bash
 # Start development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-text-2048:dev image

{hud_python-0.4.8 → hud_python-0.4.9}/hud/agents/base.py RENAMED Viewed

@@ -85,6 +85,7 @@ class MCPAgent(ABC):
         self._tool_map: dict[str, types.Tool] = {}  # Simplified: just name to tool
         self.screenshot_history: list[str] = []
         self._auto_trace = auto_trace
+        self._auto_trace_cm: Any | None = None  # Store auto-created trace context manager
         self.initialization_complete = False
         # Response agent to automatically interact with the model
@@ -303,6 +304,9 @@ class MCPAgent(ABC):
                             except Exception as e:
                                 logger.warning("ResponseAgent failed: %s", e)
                         if decision == "STOP":
+                            # Try to submit response through lifecycle tool
+                            await self._maybe_submit_response(response, messages)
                             logger.info("Stopping execution")
                             final_response = response
                             break
@@ -483,6 +487,40 @@ class MCPAgent(ABC):
             self._available_tools.append(tool)
             # Simplified mapping - just tool name to tool
             self._tool_map[tool.name] = tool
+            # Auto-detect response tool as a lifecycle tool
+            if tool.name == "response" and "response" not in self.lifecycle_tools:
+                logger.debug("Auto-detected 'response' tool as a lifecycle tool")
+                self.lifecycle_tools.append("response")
+    async def _maybe_submit_response(self, response: AgentResponse, messages: list[Any]) -> None:
+        """Submit response through lifecycle tool if available.
+        Args:
+            response: The agent's response
+            messages: The current message history (will be modified in-place)
+        """
+        # Check if we have a response lifecycle tool
+        if "response" in self.lifecycle_tools and "response" in self._tool_map:
+            logger.debug("Calling response lifecycle tool")
+            try:
+                # Call the response tool with the agent's response
+                response_tool_call = MCPToolCall(
+                    name="response",
+                    arguments={"response": response.content, "messages": messages}
+                )
+                response_results = await self.call_tools(response_tool_call)
+                # Format and add the response tool results to messages
+                response_messages = await self.format_tool_results(
+                    [response_tool_call], response_results
+                )
+                messages.extend(response_messages)
+                # Mark the task as done
+                logger.info("Response lifecycle tool executed, marking task as done")
+            except Exception as e:
+                logger.error("Response lifecycle tool failed: %s", e)
     async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None:
         """Inject metadata into the metadata of the initialize request."""
@@ -491,7 +529,7 @@ class MCPAgent(ABC):
                 mcp_config,
                 MCPConfigPatch(meta=self.metadata),
             )
-        setup_hud_telemetry(mcp_config, auto_trace=self._auto_trace)
+        self._auto_trace_cm = setup_hud_telemetry(mcp_config, auto_trace=self._auto_trace)
     def get_available_tools(self) -> list[types.Tool]:
         """Get list of available MCP tools for LLM use (excludes lifecycle tools)."""
@@ -532,6 +570,17 @@ class MCPAgent(ABC):
     async def _cleanup(self) -> None:
         """Cleanup resources."""
+        # Clean up auto-created trace if any
+        if self._auto_trace_cm:
+            try:
+                self._auto_trace_cm.__exit__(None, None, None)
+                logger.info("Closed auto-created trace")
+            except Exception as e:
+                logger.warning("Failed to close auto-created trace: %s", e)
+            finally:
+                self._auto_trace_cm = None
+        # Clean up auto-created client
         if self._auto_created_client and self.mcp_client:
             try:
                 await self.mcp_client.shutdown()

{hud_python-0.4.8 → hud_python-0.4.9}/hud/cli/__init__.py RENAMED Viewed

@@ -23,10 +23,13 @@ from .clone import clone_repository, get_clone_message, print_error, print_tutor
 from .cursor import get_cursor_config_path, list_cursor_servers, parse_cursor_config
 from .debug import debug_mcp_stdio
 from .init import create_environment
+from . import list_func as list_module
 from .mcp_server import run_mcp_dev_server
 from .pull import pull_command
 from .push import push_command
+from .remove import remove_command
 from .utils import CaptureLogger
+from .eval import eval_command
 # Create the main Typer app
 app = typer.Typer(
@@ -442,7 +445,8 @@ def run(
         # Get URL from options or environment
         if not url:
-            url = os.getenv("HUD_MCP_URL", "https://mcp.hud.so/v3/mcp")
+            from hud.settings import settings
+            url = settings.hud_mcp_url
         run_remote_server(image, docker_args, transport, port, url, api_key, run_id, verbose)
@@ -561,6 +565,63 @@ def pull(
     pull_command(target, lock_file, yes, verify_only, verbose)
+@app.command(name="list")
+def list_environments(
+    filter_name: str | None = typer.Option(
+        None, "--filter", "-f", help="Filter environments by name (case-insensitive)"
+    ),
+    json_output: bool = typer.Option(
+        False, "--json", help="Output as JSON"
+    ),
+    show_all: bool = typer.Option(
+        False, "--all", "-a", help="Show all columns including digest"
+    ),
+    verbose: bool = typer.Option(
+        False, "--verbose", "-v", help="Show detailed output"
+    ),
+) -> None:
+    """📋 List all HUD environments in local registry.
+    Shows environments pulled with 'hud pull' stored in ~/.hud/envs/
+    Examples:
+        hud list                    # List all environments
+        hud list --filter text      # Filter by name
+        hud list --json            # Output as JSON
+        hud list --all             # Show digest column
+        hud list --verbose         # Show full descriptions
+    """
+    list_module.list_command(filter_name, json_output, show_all, verbose)
+@app.command()
+def remove(
+    target: str | None = typer.Argument(
+        None,
+        help="Environment to remove (digest, name, or 'all' for all environments)"
+    ),
+    yes: bool = typer.Option(
+        False, "--yes", "-y", help="Skip confirmation prompt"
+    ),
+    verbose: bool = typer.Option(
+        False, "--verbose", "-v", help="Show detailed output"
+    ),
+) -> None:
+    """🗑️ Remove HUD environments from local registry.
+    Removes environment metadata from ~/.hud/envs/
+    Note: This does not remove the Docker images.
+    Examples:
+        hud remove abc123              # Remove by digest
+        hud remove text_2048           # Remove by name
+        hud remove hudpython/test_init # Remove by full name
+        hud remove all                 # Remove all environments
+        hud remove all --yes           # Remove all without confirmation
+    """
+    remove_command(target, yes, verbose)
 @app.command()
 def init(
     name: str = typer.Argument(None, help="Environment name (default: current directory name)"),
@@ -592,6 +653,64 @@ def quickstart() -> None:
     clone("https://github.com/hud-evals/quickstart.git")
+@app.command()
+def eval(
+    source: str = typer.Argument(
+        ...,
+        help="HuggingFace dataset identifier (e.g. 'hud-evals/SheetBench-50') or task JSON file",
+    ),
+    full: bool = typer.Option(
+        False,
+        "--full",
+        help="Run the entire dataset (omit for single-task debug mode)",
+    ),
+    agent: str = typer.Option(
+        "claude",
+        "--agent",
+        help="Agent backend to use (claude or openai)",
+    ),
+    model: str | None = typer.Option(
+        None,
+        "--model",
+        help="Model name for the chosen agent",
+    ),
+    allowed_tools: str | None = typer.Option(
+        None,
+        "--allowed-tools",
+        help="Comma-separated list of allowed tools",
+    ),
+    max_concurrent: int = typer.Option(
+        30,
+        "--max-concurrent",
+        help="Concurrency level for full-dataset mode",
+    ),
+    max_steps: int = typer.Option(
+        30,
+        "--max-steps",
+        help="Maximum steps per task (default: 10 for single, 50 for full)",
+    ),
+) -> None:
+    """🚀 Run evaluation on datasets or individual tasks with agents."""
+    # Validate agent choice
+    valid_agents = ["claude", "openai"]
+    if agent not in valid_agents:
+        from hud.utils.design import HUDDesign
+        design = HUDDesign()
+        design.error(f"Invalid agent: {agent}. Must be one of: {', '.join(valid_agents)}")
+        raise typer.Exit(1)
+    # Import and run the command
+    eval_command(
+        source=source,
+        full=full,
+        agent=agent,  # type: ignore
+        model=model,
+        allowed_tools=allowed_tools,
+        max_concurrent=max_concurrent,
+        max_steps=max_steps,
+    )
 def main() -> None:
     """Main entry point for the CLI."""
     # Show header for main help

{hud_python-0.4.8 → hud_python-0.4.9}/hud/cli/analyze_metadata.py RENAMED Viewed

@@ -12,6 +12,8 @@ from rich.progress import Progress, SpinnerColumn, TextColumn
 from hud.settings import settings
 from hud.utils.design import HUDDesign
+from .registry import get_registry_dir, list_registry_entries, extract_digest_from_image, load_from_registry
 console = Console()
 design = HUDDesign()
@@ -50,38 +52,31 @@ def fetch_lock_from_registry(reference: str) -> dict | None:
 def check_local_cache(reference: str) -> dict | None:
     """Check local cache for lock file."""
-    # Extract digest if present
-    if "@sha256:" in reference:
-        digest = reference.split("@sha256:")[-1][:12]
-    elif "/" in reference:
-        # Try to find by name pattern
-        cache_dir = Path.home() / ".hud" / "envs"
-        if cache_dir.exists():
-            # Look for any cached version of this image
-            for env_dir in cache_dir.iterdir():
-                if env_dir.is_dir():
-                    lock_file = env_dir / "hud.lock.yaml"
-                    if lock_file.exists():
-                        with open(lock_file) as f:
-                            lock_data = yaml.safe_load(f)
-                        # Check if this matches our reference
-                        if lock_data and "image" in lock_data:
-                            image = lock_data["image"]
-                            # Match by name (ignoring tag/digest)
-                            ref_base = reference.split("@")[0].split(":")[0]
-                            img_base = image.split("@")[0].split(":")[0]
-                            if ref_base in img_base or img_base in ref_base:
-                                return lock_data
-        return None
-    else:
-        digest = "latest"
-    # Check specific digest directory
-    lock_file = Path.home() / ".hud" / "envs" / digest / "hud.lock.yaml"
-    if lock_file.exists():
-        with open(lock_file) as f:
-            return yaml.safe_load(f)
+    # First try exact digest match
+    digest = extract_digest_from_image(reference)
+    lock_data = load_from_registry(digest)
+    if lock_data:
+        return lock_data
+    # If not found and reference has a name, search by name pattern
+    if "/" in reference:
+        # Look for any cached version of this image
+        ref_base = reference.split("@")[0].split(":")[0]
+        for digest, lock_file in list_registry_entries():
+            try:
+                with open(lock_file) as f:
+                    lock_data = yaml.safe_load(f)
+                # Check if this matches our reference
+                if lock_data and "image" in lock_data:
+                    image = lock_data["image"]
+                    # Match by name (ignoring tag/digest)
+                    img_base = image.split("@")[0].split(":")[0]
+                    if ref_base in img_base or img_base in ref_base:
+                        return lock_data
+            except Exception:
+                continue
     return None
@@ -147,15 +142,8 @@ async def analyze_from_metadata(reference: str, output_format: str, verbose: boo
                 source = "registry"
                 # Save to local cache for next time
-                if "@sha256:" in lock_data.get("image", ""):
-                    digest = lock_data["image"].split("@sha256:")[-1][:12]
-                else:
-                    digest = "latest"
-                cache_dir = Path.home() / ".hud" / "envs" / digest
-                cache_dir.mkdir(parents=True, exist_ok=True)
-                with open(cache_dir / "hud.lock.yaml", "w") as f:  # noqa: ASYNC230
-                    yaml.dump(lock_data, f, default_flow_style=False, sort_keys=False)
+                from .registry import save_to_registry
+                save_to_registry(lock_data, lock_data.get("image", ""), verbose=False)
             else:
                 progress.update(task, description="[red]✗ Not found[/red]")

{hud_python-0.4.8 → hud_python-0.4.9}/hud/cli/build.py RENAMED Viewed

@@ -17,6 +17,8 @@ from hud.clients import MCPClient
 from hud.utils.design import HUDDesign
 from hud.version import __version__ as hud_version
+from .registry import save_to_registry
 def parse_version(version_str: str) -> tuple[int, int, int]:
     """Parse version string like '1.0.0' or '1.0' into tuple of integers."""
@@ -459,6 +461,11 @@ def build_environment(
     # Remove temp image after we're done
     subprocess.run(["docker", "rmi", temp_tag], capture_output=True)  # noqa: S603, S607
+    # Add to local registry
+    if image_id:
+        # Save to local registry using the helper
+        save_to_registry(lock_content, lock_content.get("image", tag), verbose)
     # Print summary
     design.section_title("Build Complete")

{hud_python-0.4.8 → hud_python-0.4.9}/hud/cli/debug.py RENAMED Viewed

@@ -167,7 +167,14 @@ async def debug_mcp_stdio(command: list[str], logger: CaptureLogger, max_phase:
                         break
                 except Exception as e:
                     logger.error(f"Failed to parse MCP response: {e}")
-                    continue
+                    logger.error(f"Raw output that caused the error: {repr(line)}")
+                    logger.hint("This usually means non-JSON output is being sent to STDOUT")
+                    logger.hint("Common causes:")
+                    logger.hint("  - Print statements in your server code")
+                    logger.hint("  - Library warnings (use warnings.filterwarnings)")
+                    logger.hint("  - Import-time output from dependencies")
+                    phases_completed = 1  # Mark as failed
+                    break  # Stop trying to parse
         if response and "result" in response:
             logger.success("MCP server initialized successfully")

hud-python 0.4.8__tar.gz → 0.4.9__tar.gz

Potentially problematic release.

hud-python 0.4.8tar.gz → 0.4.9tar.gz