PyPI - hud-python - Versions diffs - 0.4.8__tar.gz → 0.4.10__tar.gz - Mend

hud-python 0.4.8tar.gz → 0.4.10tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (160) hide show

{hud_python-0.4.8 → hud_python-0.4.10}/.gitignore RENAMED Viewed

@@ -42,4 +42,6 @@ CLAUDE.md
 # RL
 wandb/
-outputs/
+outputs/
+test/

{hud_python-0.4.8 → hud_python-0.4.10}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.8
+Version: 0.4.10
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
@@ -38,6 +38,7 @@ Requires-Python: <3.14,>=3.11
 Requires-Dist: fastmcp>=2.11.2
 Requires-Dist: httpx<1,>=0.23.0
 Requires-Dist: hud-mcp-python-sdk>=0.1.0
+Requires-Dist: mcp>=1.13.1
 Requires-Dist: opentelemetry-api>=1.34.1
 Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.34.1
 Requires-Dist: opentelemetry-instrumentation-mcp>=0.44.1
@@ -61,6 +62,16 @@ Requires-Dist: langchain-anthropic; extra == 'agent'
 Requires-Dist: langchain-openai; extra == 'agent'
 Requires-Dist: numpy>=1.24.0; extra == 'agent'
 Requires-Dist: openai; extra == 'agent'
+Provides-Extra: agents
+Requires-Dist: anthropic; extra == 'agents'
+Requires-Dist: datasets>=2.14.0; extra == 'agents'
+Requires-Dist: dotenv>=0.9.9; extra == 'agents'
+Requires-Dist: hud-mcp-use-python-sdk>=0.1.0; extra == 'agents'
+Requires-Dist: langchain; extra == 'agents'
+Requires-Dist: langchain-anthropic; extra == 'agents'
+Requires-Dist: langchain-openai; extra == 'agents'
+Requires-Dist: numpy>=1.24.0; extra == 'agents'
+Requires-Dist: openai; extra == 'agents'
 Provides-Extra: dev
 Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
 Requires-Dist: anthropic; extra == 'dev'

{hud_python-0.4.8 → hud_python-0.4.10}/environments/browser/README.md RENAMED Viewed

@@ -2,6 +2,8 @@
 A browser automation environment for the HUD platform demonstrating best practices for building MCP (Model Context Protocol) environments with evaluation systems.
+**Key Feature**: This environment is **hot-reloadable** - it maintains state (running services, browser sessions, launched apps) across server restarts during development.
 ## Quick Start
 ### Build & Deploy
@@ -14,6 +16,34 @@ docker build -t hud-browser .
 docker run --rm -i -p 8080:8080 hud-browser
 ```
+### Hot-Reloadable Architecture
+This environment uses a persistent context server architecture that maintains state across MCP server restarts:
+- **Context Server**: Runs as a separate process holding ServiceManager and state
+- **MCP Server**: Connects via Unix socket, can restart without losing services
+- **State Preservation**: X11, VNC, running apps, and service states persist
+- **Development Friendly**: Edit code and restart MCP server instantly
+#### Docker Architecture
+The environment uses a single CMD that follows the proven text_2048 pattern:
+```dockerfile
+CMD ["sh", "-c", "\
+    # Start services in background \
+    python -m hud_controller.context_server & \
+    x11vnc ... & \
+    # Run MCP server in foreground \
+    exec hud-controller mcp \
+"]
+```
+This pattern ensures:
+- Background services (`&`) start once and persist
+- Only the `exec` command gets wrapped by watchfiles
+- Services survive hot-reloads during development
 ## Deployment to Registry
 ### 1. Publish to Docker Registry
@@ -169,10 +199,11 @@ Set these in your environment/Docker configuration:
 ```
 Docker Container
-├── start.sh                 # Service startup orchestration
 ├── MCP Server (FastMCP)     # Protocol implementation
 │   ├── Tools                # setup, evaluate, computer, etc.
-│   └── Resources           # Dynamic registry discovery
+│   └── Resources            # Dynamic registry discovery
+├── Context Server           # Persistent state management
+│   └── PersistentContext    # Maintains services & browser state
 ├── Services
 │   ├── X11 (Xvfb)          # Virtual display
 │   ├── VNC + Websockify    # Remote access
@@ -188,8 +219,7 @@ Docker Container
 ```
 browser/
-├── Dockerfile              # Multi-stage build with optimization
-├── start.sh                # Service startup script
+├── Dockerfile              # Multi-stage build with integrated startup
 ├── apps/                   # Launchable web applications
 │   ├── todo/              # Example app with evaluation APIs
 │   └── 2048/              # 2048 game app
@@ -197,6 +227,8 @@ browser/
 │   ├── server.py          # FastMCP server + resource definitions
 │   ├── services.py        # Service management
 │   ├── context.py         # Environment context
+│   ├── context_server.py  # Persistent context server
+│   ├── persistent_context.py # State persistence wrapper
 │   ├── evaluators/        # Evaluation system
 │   ├── setup/            # Setup system
 │   └── problems/         # Problem definitions
@@ -205,7 +237,7 @@ browser/
 ## Development Workflow
-### Hot-Reload Development with `hud mcp`
+### Hot-Reload Development with `hud dev`
 For rapid iteration without Docker rebuilds:
@@ -214,7 +246,7 @@ For rapid iteration without Docker rebuilds:
 cd environments/browser
 # Start hot-reload development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-browser:dev image
@@ -225,6 +257,21 @@ hud mcp . --build
 Add the URL from output to Cursor settings or click the deeplink. Now you can edit code in `src/` and changes apply instantly!
+#### How Hot-Reloading Works
+This environment uses a persistent context server pattern:
+1. **Context Server**: A separate Python process maintains state (services, browser, apps)
+2. **Socket Communication**: MCP server connects via Unix socket `/tmp/hud_browser_ctx.sock`
+3. **State Preservation**: X11, VNC, browser sessions, and launched apps persist across reloads
+4. **Automatic Recovery**: On reload, the server reconnects to existing services
+This means you can:
+- Edit code and have changes apply immediately
+- Keep browser sessions and apps running
+- Maintain VNC connections
+- Preserve test state between iterations
 ### Traditional Development Steps
 1. **Start with apps** - Build your web applications independently
@@ -392,4 +439,9 @@ When creating new MCP environments:
 6. **Update service dependencies** in `services.py` as needed
 7. **Extend Dockerfile** with your environment's requirements
+For hot-reloadability:
+- Keep complex objects out of the persistent context
+- Only store simple, picklable state
+- Recreate tools and clients on each server start
 See `src/hud_controller/README.md` for detailed implementation guidance.

{hud_python-0.4.8 → hud_python-0.4.10}/environments/browser/pyproject.toml RENAMED Viewed

@@ -3,25 +3,20 @@ name = "hud-controller"
 version = "0.1.0"
 description = "HUD Controller for browser environments with MCP tools"
 requires-python = ">=3.11,<3.14"
-dependencies = [
-    "hud-python @ git+https://github.com/hud-evals/hud-python.git@l/text-2048",
-    "playwright",
-    "pyautogui",
-    "httpx",
-    "typer",
-]
-[project.scripts]
-hud-controller = "hud_controller.__main__:main"
+dependencies = [ "hud-python", "playwright", "pyautogui", "httpx", "typer",]
 [build-system]
-requires = ["hatchling"]
+requires = [ "hatchling",]
 build-backend = "hatchling.build"
-[tool.hatch.build.targets.wheel]
-packages = ["src/hud_controller"]
+[project.scripts]
+hud-controller = "hud_controller.__main__:main"
+[tool.hud]
+image = "hud-browser:dev"
 [tool.hatch.metadata]
 allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = [ "src/hud_controller",]

{hud_python-0.4.8 → hud_python-0.4.10}/environments/browser/src/hud_controller/README.md RENAMED Viewed

@@ -55,7 +55,7 @@ class EvaluatorRegistry:
     def create_evaluator(cls, spec, context): pass
 ```
-### BrowserEnvironmentContext
+### BrowserContext
 Unified interface for environment interactions:
 - `call_app_api(app, endpoint, method, data)` - Call app backend API

{hud_python-0.4.8 → hud_python-0.4.10}/environments/remote_browser/README.md RENAMED Viewed

@@ -34,7 +34,7 @@ docker run --rm -i \
 Development mode allows you to edit code locally and see changes immediately without rebuilding.
-#### Option 1: Using `hud mcp` (Recommended)
+#### Option 1: Using `hud dev` (Recommended)
 The easiest way to develop with hot-reload:
@@ -44,7 +44,7 @@ export BROWSER_PROVIDER=anchorbrowser
 export ANCHOR_API_KEY=your-api-key
 # Start development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-remote-browser:dev image

{hud_python-0.4.8 → hud_python-0.4.10}/environments/text_2048/README.md RENAMED Viewed

@@ -57,13 +57,13 @@ The agent will play 2048 and try to reach a target tile using the available tool
 ## Development Mode
-### Option 1: Using `hud mcp` (Recommended)
+### Option 1: Using `hud dev` (Recommended)
 The easiest way to develop with hot-reload:
 ```bash
 # Start development proxy
-hud mcp . --build
+hud dev . --build
 # This will:
 # - Build/use hud-text-2048:dev image

{hud_python-0.4.8 → hud_python-0.4.10}/hud/agents/base.py RENAMED Viewed

@@ -85,6 +85,7 @@ class MCPAgent(ABC):
         self._tool_map: dict[str, types.Tool] = {}  # Simplified: just name to tool
         self.screenshot_history: list[str] = []
         self._auto_trace = auto_trace
+        self._auto_trace_cm: Any | None = None  # Store auto-created trace context manager
         self.initialization_complete = False
         # Response agent to automatically interact with the model
@@ -303,6 +304,9 @@ class MCPAgent(ABC):
                             except Exception as e:
                                 logger.warning("ResponseAgent failed: %s", e)
                         if decision == "STOP":
+                            # Try to submit response through lifecycle tool
+                            await self._maybe_submit_response(response, messages)
                             logger.info("Stopping execution")
                             final_response = response
                             break
@@ -483,6 +487,40 @@ class MCPAgent(ABC):
             self._available_tools.append(tool)
             # Simplified mapping - just tool name to tool
             self._tool_map[tool.name] = tool
+            # Auto-detect response tool as a lifecycle tool
+            if tool.name == "response" and "response" not in self.lifecycle_tools:
+                logger.debug("Auto-detected 'response' tool as a lifecycle tool")
+                self.lifecycle_tools.append("response")
+    async def _maybe_submit_response(self, response: AgentResponse, messages: list[Any]) -> None:
+        """Submit response through lifecycle tool if available.
+        Args:
+            response: The agent's response
+            messages: The current message history (will be modified in-place)
+        """
+        # Check if we have a response lifecycle tool
+        if "response" in self.lifecycle_tools and "response" in self._tool_map:
+            logger.debug("Calling response lifecycle tool")
+            try:
+                # Call the response tool with the agent's response
+                response_tool_call = MCPToolCall(
+                    name="response",
+                    arguments={"response": response.content, "messages": messages}
+                )
+                response_results = await self.call_tools(response_tool_call)
+                # Format and add the response tool results to messages
+                response_messages = await self.format_tool_results(
+                    [response_tool_call], response_results
+                )
+                messages.extend(response_messages)
+                # Mark the task as done
+                logger.info("Response lifecycle tool executed, marking task as done")
+            except Exception as e:
+                logger.error("Response lifecycle tool failed: %s", e)
     async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None:
         """Inject metadata into the metadata of the initialize request."""
@@ -491,7 +529,7 @@ class MCPAgent(ABC):
                 mcp_config,
                 MCPConfigPatch(meta=self.metadata),
             )
-        setup_hud_telemetry(mcp_config, auto_trace=self._auto_trace)
+        self._auto_trace_cm = setup_hud_telemetry(mcp_config, auto_trace=self._auto_trace)
     def get_available_tools(self) -> list[types.Tool]:
         """Get list of available MCP tools for LLM use (excludes lifecycle tools)."""
@@ -532,6 +570,17 @@ class MCPAgent(ABC):
     async def _cleanup(self) -> None:
         """Cleanup resources."""
+        # Clean up auto-created trace if any
+        if self._auto_trace_cm:
+            try:
+                self._auto_trace_cm.__exit__(None, None, None)
+                logger.info("Closed auto-created trace")
+            except Exception as e:
+                logger.warning("Failed to close auto-created trace: %s", e)
+            finally:
+                self._auto_trace_cm = None
+        # Clean up auto-created client
         if self._auto_created_client and self.mcp_client:
             try:
                 await self.mcp_client.shutdown()

{hud_python-0.4.8 → hud_python-0.4.10}/hud/cli/__init__.py RENAMED Viewed

@@ -23,9 +23,11 @@ from .clone import clone_repository, get_clone_message, print_error, print_tutor
 from .cursor import get_cursor_config_path, list_cursor_servers, parse_cursor_config
 from .debug import debug_mcp_stdio
 from .init import create_environment
+from . import list_func as list_module
 from .mcp_server import run_mcp_dev_server
 from .pull import pull_command
 from .push import push_command
+from .remove import remove_command
 from .utils import CaptureLogger
 # Create the main Typer app
@@ -129,7 +131,7 @@ def analyze(
 def debug(
     params: list[str] = typer.Argument(  # type: ignore[arg-type]  # noqa: B008
         None,
-        help="Docker image followed by optional Docker run arguments (e.g., 'hud-image:latest -e KEY=value')",  # noqa: E501
+        help="Docker image, environment directory, or config file followed by optional Docker arguments",  # noqa: E501
     ),
     config: Path = typer.Option(  # noqa: B008
         None,
@@ -145,6 +147,12 @@ def debug(
         "--cursor",
         help="Debug a server from Cursor config",
     ),
+    build: bool = typer.Option(
+        False,
+        "--build",
+        "-b",
+        help="Build image before debugging (for directory mode)",
+    ),
     max_phase: int = typer.Option(
         5,
         "--max-phase",
@@ -157,15 +165,24 @@ def debug(
     """🐛 Debug MCP environment - test initialization, tools, and readiness.
     Examples:
-        hud debug hud-text-2048:latest
-        hud debug my-mcp-server:v1 -e API_KEY=xxx -p 8080:8080
+        hud debug .                              # Debug current directory
+        hud debug environments/browser           # Debug specific directory
+        hud debug . --build                      # Build then debug
+        hud debug hud-text-2048:latest          # Debug Docker image
+        hud debug my-mcp-server:v1 -e API_KEY=xxx
         hud debug --config mcp-config.json
         hud debug --cursor text-2048-dev
-        hud debug hud-browser:dev --max-phase 3
+        hud debug . --max-phase 3               # Stop after phase 3
     """
+    # Import here to avoid circular imports
+    from .env_utils import get_image_name, is_environment_directory, build_environment, image_exists
+    from hud.utils.design import HUDDesign
+    design = HUDDesign()
     # Determine the command to run
     command = None
+    docker_args = []
     if config:
         # Load config from JSON file
@@ -183,13 +200,44 @@ def debug(
             console.print(f"[red]❌ {error or 'Failed to parse cursor config'}[/red]")
             raise typer.Exit(1)
     elif params:
-        image, *docker_args = params
-        # Build Docker command
-        command = ["docker", "run", "--rm", "-i", *docker_args, image]
+        first_param = params[0]
+        docker_args = params[1:] if len(params) > 1 else []
+        # Check if it's a directory
+        if Path(first_param).exists() and is_environment_directory(first_param):
+            # Directory mode - like hud dev
+            directory = first_param
+            # Get or generate image name
+            image_name, source = get_image_name(directory)
+            if source == "auto":
+                design.info(f"Auto-generated image name: {image_name}")
+            # Build if requested or if image doesn't exist
+            if build or not image_exists(image_name):
+                if not build and not image_exists(image_name):
+                    if typer.confirm(f"Image {image_name} not found. Build it now?"):
+                        build = True
+                    else:
+                        raise typer.Exit(1)
+                if build:
+                    if not build_environment(directory, image_name):
+                        raise typer.Exit(1)
+            # Build Docker command
+            command = ["docker", "run", "--rm", "-i", *docker_args, image_name]
+        else:
+            # Assume it's an image name
+            image = first_param
+            command = ["docker", "run", "--rm", "-i", *docker_args, image]
     else:
-        console.print("[red]Error: Must specify either a Docker image, --config, or --cursor[/red]")
+        console.print("[red]Error: Must specify a directory, Docker image, --config, or --cursor[/red]")
         console.print("\nExamples:")
-        console.print("  hud debug hud-text-2048:latest")
+        console.print("  hud debug .                      # Debug current directory")
+        console.print("  hud debug environments/browser   # Debug specific directory")
+        console.print("  hud debug hud-text-2048:latest  # Debug Docker image")
         console.print("  hud debug --config mcp-config.json")
         console.print("  hud debug --cursor my-server")
         raise typer.Exit(1)
@@ -442,7 +490,8 @@ def run(
         # Get URL from options or environment
         if not url:
-            url = os.getenv("HUD_MCP_URL", "https://mcp.hud.so/v3/mcp")
+            from hud.settings import settings
+            url = settings.hud_mcp_url
         run_remote_server(image, docker_args, transport, port, url, api_key, run_id, verbose)
@@ -561,6 +610,63 @@ def pull(
     pull_command(target, lock_file, yes, verify_only, verbose)
+@app.command(name="list")
+def list_environments(
+    filter_name: str | None = typer.Option(
+        None, "--filter", "-f", help="Filter environments by name (case-insensitive)"
+    ),
+    json_output: bool = typer.Option(
+        False, "--json", help="Output as JSON"
+    ),
+    show_all: bool = typer.Option(
+        False, "--all", "-a", help="Show all columns including digest"
+    ),
+    verbose: bool = typer.Option(
+        False, "--verbose", "-v", help="Show detailed output"
+    ),
+) -> None:
+    """📋 List all HUD environments in local registry.
+    Shows environments pulled with 'hud pull' stored in ~/.hud/envs/
+    Examples:
+        hud list                    # List all environments
+        hud list --filter text      # Filter by name
+        hud list --json            # Output as JSON
+        hud list --all             # Show digest column
+        hud list --verbose         # Show full descriptions
+    """
+    list_module.list_command(filter_name, json_output, show_all, verbose)
+@app.command()
+def remove(
+    target: str | None = typer.Argument(
+        None,
+        help="Environment to remove (digest, name, or 'all' for all environments)"
+    ),
+    yes: bool = typer.Option(
+        False, "--yes", "-y", help="Skip confirmation prompt"
+    ),
+    verbose: bool = typer.Option(
+        False, "--verbose", "-v", help="Show detailed output"
+    ),
+) -> None:
+    """🗑️ Remove HUD environments from local registry.
+    Removes environment metadata from ~/.hud/envs/
+    Note: This does not remove the Docker images.
+    Examples:
+        hud remove abc123              # Remove by digest
+        hud remove text_2048           # Remove by name
+        hud remove hudpython/test_init # Remove by full name
+        hud remove all                 # Remove all environments
+        hud remove all --yes           # Remove all without confirmation
+    """
+    remove_command(target, yes, verbose)
 @app.command()
 def init(
     name: str = typer.Argument(None, help="Environment name (default: current directory name)"),
@@ -592,6 +698,76 @@ def quickstart() -> None:
     clone("https://github.com/hud-evals/quickstart.git")
+@app.command()
+def eval(
+    source: str = typer.Argument(
+        ...,
+        help="HuggingFace dataset identifier (e.g. 'hud-evals/SheetBench-50') or task JSON file",
+    ),
+    full: bool = typer.Option(
+        False,
+        "--full",
+        help="Run the entire dataset (omit for single-task debug mode)",
+    ),
+    agent: str = typer.Option(
+        "claude",
+        "--agent",
+        help="Agent backend to use (claude or openai)",
+    ),
+    model: str | None = typer.Option(
+        None,
+        "--model",
+        help="Model name for the chosen agent",
+    ),
+    allowed_tools: str | None = typer.Option(
+        None,
+        "--allowed-tools",
+        help="Comma-separated list of allowed tools",
+    ),
+    max_concurrent: int = typer.Option(
+        30,
+        "--max-concurrent",
+        help="Concurrency level for full-dataset mode",
+    ),
+    max_steps: int = typer.Option(
+        30,
+        "--max-steps",
+        help="Maximum steps per task (default: 10 for single, 50 for full)",
+    ),
+) -> None:
+    """🚀 Run evaluation on datasets or individual tasks with agents."""
+    # Validate agent choice
+    valid_agents = ["claude", "openai"]
+    if agent not in valid_agents:
+        from hud.utils.design import HUDDesign
+        design = HUDDesign()
+        design.error(f"Invalid agent: {agent}. Must be one of: {', '.join(valid_agents)}")
+        raise typer.Exit(1)
+    # Import eval_command lazily to avoid importing agent dependencies
+    try:
+        from .eval import eval_command
+    except ImportError as e:
+        from hud.utils.design import HUDDesign
+        design = HUDDesign()
+        design.error(
+            "Evaluation dependencies are not installed. "
+            "Please install with: pip install 'hud-python[agent]'"
+        )
+        raise typer.Exit(1) from e
+    # Run the command
+    eval_command(
+        source=source,
+        full=full,
+        agent=agent,  # type: ignore
+        model=model,
+        allowed_tools=allowed_tools,
+        max_concurrent=max_concurrent,
+        max_steps=max_steps,
+    )
 def main() -> None:
     """Main entry point for the CLI."""
     # Show header for main help

hud-python 0.4.8__tar.gz → 0.4.10__tar.gz

Potentially problematic release.

hud-python 0.4.8tar.gz → 0.4.10tar.gz