PyPI - hud-python - Versions diffs - 0.4.48__tar.gz → 0.4.49__tar.gz - Mend

hud-python 0.4.48tar.gz → 0.4.49tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (261) hide show

{hud_python-0.4.48 → hud_python-0.4.49}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.48
+Version: 0.4.49
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues

{hud_python-0.4.48 → hud_python-0.4.49}/environments/README.md RENAMED Viewed

@@ -156,24 +156,24 @@ For Python-based MCP environments, use this standard structure:
 ```
 my-environment/
 ├── Dockerfile
-├── pyproject.toml          # Package definition with dependencies
-├── README.md               # Environment documentation
-└── src/
-    └── my_module/          # Your Python package
-        ├── __init__.py
-        ├── server.py       # MCP server entry point
-        ├── context.py      # Core stateful environment logic (optional)
-        ├── tools/          # Interactive tools (move, click, type, etc.)
-        │   ├── __init__.py
-        │   └── move.py     # Example: custom tool inheriting from BaseTool
-        ├── setup/          # Setup functions (modular approach)
-        │   ├── __init__.py # Creates SetupTool instance & exports decorator
-        │   ├── basic.py    # Basic setup functions
-        │   └── advanced.py # Advanced setup functions
-        └── evaluate/       # Evaluator functions (modular approach)
-            ├── __init__.py # Creates EvaluateTool instance & exports decorator
-            ├── checks.py   # Basic evaluation checks
-            └── metrics.py  # Advanced metrics evaluators
+├── README.md
+├── server/                 # MCP server package
+│   ├── pyproject.toml      # MCP dependencies (hud-python, etc.)
+│   ├── __init__.py         # Empty package marker
+│   ├── main.py             # mcp = MCPServer() + lifecycle hooks
+│   ├── tools.py            # router = MCPRouter() + @router.tool decorators
+│   ├── setup/              # Setup router (modular approach)
+│   │   ├── __init__.py
+│   │   ├── basic.py        # Basic setup functions
+│   │   └── advanced.py     # Advanced setup functions
+│   └── evaluate/           # Evaluate router (modular approach)
+│       ├── __init__.py
+│       ├── checks.py       # Basic evaluation checks
+│       └── metrics.py      # Advanced metrics evaluators
+└── environment/            # Backend service package
+    ├── pyproject.toml      # Backend dependencies (fastapi, uvicorn)
+    ├── __init__.py
+    └── server.py           # FastAPI app with /health, /act, /reset, /state
 ```
 This structure enables:
@@ -607,51 +607,62 @@ Once all of the above works you can unleash *hundreds* of concurrent agents on y
 ## Phase 5 – Hot-Reload Development
-To enable rapid development without Docker rebuilds, we can mount the source code and use hot-reload. The HUD CLI provides a built-in development proxy that handles all the complexity:
+For rapid local development, run the controller and environment servers separately. This enables instant code updates without Docker rebuilds.
+### Development Setup
+You'll need **two terminal windows** for local development:
+#### Terminal 1: MCP Server
 ```bash
-# Navigate to your environment directory
-cd environments/my-environment
+cd environments/my-environment/server
+hud dev                  # Auto-detects and runs with hot-reload
-# Start the development proxy with hot-reload
-hud dev --build
+# Optional flags:
+hud dev --inspector      # Launch MCP Inspector
+hud dev --interactive    # Launch interactive testing mode
+hud dev --stdio          # Use stdio transport (default: HTTP)
+hud dev --watch ../shared  # Watch additional directories
+```
+The `hud dev` command:
+- Auto-detects the MCP module in the current directory
+- Watches for file changes and reloads automatically
+- Runs on HTTP by default (http://localhost:8765/mcp)
+- Can launch MCP Inspector for testing tools
+- Can launch interactive mode for manual testing
-# Output:
-# 📦 Using cached image: hud-my-environment:dev
-# "hud-my-environment": {
-#   "url": "http://localhost:8765/mcp"
-# }
-# ✨ Add to Cursor: cursor://anysphere.cursor-deeplink/mcp/install?name=...
-# 🌐 Reloading proxy live, press Ctrl+C to stop
+#### Terminal 2: Environment Server (Backend)
+```bash
+cd environments/my-environment/environment
+uvicorn server:app --reload  # Standard uvicorn with hot-reload
 ```
-This command:
-- Auto-detects or builds your Docker image with `:dev` tag
-- Mounts `./src` to `/app/src` for instant code updates
-- Uses watchfiles to monitor file changes and restart automatically
-- Exposes an HTTP endpoint for Cursor integration
-- Caches the image name in `pyproject.toml` for faster subsequent runs
+For the backend, we simply use `uvicorn` directly since it already provides excellent hot-reload capabilities.
+### Development Workflow
+1. Start both servers in separate terminals
+2. Edit code in either `server/` or `environment/` - changes reload automatically
+3. Test changes immediately without rebuilding Docker images
+4. Use MCP Inspector or interactive mode to test tools
+5. When ready, build the complete Docker image: `hud build`
-#### Quick Cursor Setup
+### Quick Cursor Setup
-Either click the deeplink URL from the output, or manually add to `.cursor/mcp.json`:
+Add to `.cursor/mcp.json` (or use the deeplink from `hud dev` output):
 ```json
 {
   "mcpServers": {
-    "hud-my-environment": {
+    "my-environment-dev": {
       "url": "http://localhost:8765/mcp"
     }
   }
 }
 ```
-### Development Workflow
-1. Keep `hud dev` running in one terminal - it automatically handles reloads
-2. Edit your code in `src/` - changes take effect immediately
-3. Test changes in another terminal with `hud analyze` or the interactive mode
-4. Use Cursor/Claude to iterate quickly on your environment
+**Note**: Make sure both MCP server and environment backend are running when using with Cursor or agents.
 ### Process Separation for Stateful Environments

hud_python-0.4.49/environments/blank/README.md ADDED Viewed

@@ -0,0 +1,121 @@
+# Blank Environment
+Minimal starter template for building HUD environments.
+See [docs](https://docs.hud.so/build-environments) for the complete environment design workflow.
+## Architecture
+**`environment/`** - Produces structured data
+- Owns all state (game logic, browser sessions, databases, etc.)
+- Exposes HTTP endpoints `/health`, `/act`, `/reset`, `/state` that return structured information about the environment state
+**`server/`** - Wraps data in MCP tools
+- Calls environment endpoints to get structured data for the agent, and environment setup/evaluation
+- Agents and tasks interact only with these tools!
+**Why separate?** Edit tools for the agent or tasks without restarting the heavy environment backend.
+## Development
+```bash
+# Terminal 1 - Environment backend
+cd environment
+uv run uvicorn server:app --reload
+# Terminal 2 - MCP server
+cd server
+uv run hud dev
+```
+Uncomment the `setup` tool in `server/tools.py`, save, and watch it reload.
+Visit http://localhost:8765/docs to see the new tool appear instantly.
+In general, we recommend starting work on the environment backend first, then developing the MCP server to expose the right things to the agent.
+For complex environments that require many dependencies, we recommend running `hud dev` in the environment root:
+```bash
+cd ..
+hud dev
+```
+## Tasks & Evaluation
+```bash
+# Build first in the global folder with the Dockerfile (creates blank:0.1.0)
+hud build
+```
+Your `tasks.json` uses `docker run` to launch the environment:
+```json
+{
+  "prompt": "Your task prompt",
+  "mcp_config": {
+    "local": {
+      "command": "docker",
+      "args": ["run", "--rm", "-i", "blank:0.1.0"]
+    }
+  }
+}
+```
+**Commands:**
+```bash
+# Build first
+hud build
+# Test task locally
+hud eval tasks.json
+# Push environment for remote running
+hud push
+# Production RL training
+hud rl tasks.json  # Auto-converts docker→remote, builds & pushes if needed
+```
+## Publishing Your Environment
+Once your environment is ready, you can share it with the community:
+### 1. Push to Registry
+```bash
+# Build and push your environment (requires docker hub login and hud api key)
+hud build
+hud push
+```
+### 2. Create a Dataset
+Create a dataset on HuggingFace with your tasks:
+**Option A: Upload manually**
+1. Upload your `tasks.json` to HuggingFace
+2. Make sure it's **public** to appear on leaderboards
+**Option B: Use the SDK**
+```python
+from hud.datasets import save_tasks
+import json
+# Load your tasks
+with open("tasks.json") as f:
+    tasks = json.load(f)
+# Push to HuggingFace
+save_tasks(tasks, repo_id="your-org/your-dataset")
+```
+### 3. Run and Track Performance
+```bash
+# Run Claude on your benchmark
+hud eval "your-org/your-dataset" --agent claude
+# View results at:
+# hud.so/leaderboards/your-org/your-dataset
+```
+**Note**: Only public HuggingFace datasets appear as leaderboards!
+📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)

{hud_python-0.4.48 → hud_python-0.4.49}/environments/blank/environment/README.md RENAMED Viewed

@@ -10,7 +10,7 @@ Endpoints (FastAPI)
 Run (dev)
 ```bash
-uv run uvicorn environment.server:app --reload --port 8005
+uv run uvicorn server:app --reload --port 8005
 ```
 Principle: treat like a backend. Keep long‑lived state here; add endpoints as tools need them.

hud_python-0.4.49/environments/blank/environment/pyproject.toml ADDED Viewed

@@ -0,0 +1,16 @@
+[project]
+name = "blank-environment"
+version = "0.1.0"
+description = "Backend service for blank environment"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi",
+    "uvicorn[standard]",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["."]

hud_python-0.4.49/environments/blank/server/README.md ADDED Viewed

@@ -0,0 +1,21 @@
+# MCP Server
+MCP layer that wraps environment data in tools for agent interaction.
+## Structure
+- `main.py` - Server initialization, imports routers
+- `tools.py` - MCP tools that call environment HTTP endpoints
+## Development
+```bash
+# Start MCP server with hot-reload
+uv run hud dev
+```
+## Key Principles
+- Keep tools thin - call environment HTTP endpoints
+- Use routers for organization
+- All long-lived state lives in `environment/`, not here

hud_python-0.4.49/environments/blank/server/pyproject.toml ADDED Viewed

@@ -0,0 +1,19 @@
+[project]
+name = "blank-server"
+version = "0.1.0"
+description = "MCP server for blank environment"
+requires-python = ">=3.11"
+dependencies = [
+    "hud-python>=0.4.49",
+    "httpx>=0.28.1",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = ["."]

{hud_python-0.4.48 → hud_python-0.4.49}/environments/browser/README.md RENAMED Viewed

@@ -1,40 +1,39 @@
 # Browser Environment
-A browser automation environment for HUD that provides GUI access and web app interaction capabilities. This environment supports hot-reloading during development while maintaining persistent state.
+Browser automation environment with GUI access for testing web applications. Includes sample apps (2048, Todo) and supports hot-reload development.
-## Quick Start
+## Architecture
-### Interactive Development
-```bash
-# 1. Configure your API keys (optional - only needed for evaluation)
-# Edit .env file to add your HUD_API_KEY and ANTHROPIC_API_KEY
-# 2. Start the environment (optional: with inspector)
-hud dev --build --inspector
+**`environment/`** - Produces structured data
+- FastAPI backend with X11/VNC services (Linux-only)
+- Launches and manages web apps (Next.js frontends + Python backends)
+- Exposes HTTP endpoints for app control and state
-# 3. Choose your preferred way to test:
+**`server/`** - Wraps data in MCP tools
+- Browser automation tools (Playwright, computer vision)
+- Setup tools (launch apps, seed data)
+- Evaluation tools (check game state, todo completion)
-# Option A: Run the task with Claude (requires ANTHROPIC_API_KEY)
-hud eval tasks.json --agent claude
+**Why separate?** The environment backend requires X11/VNC/Chromium (Docker-only). The MCP server tools can be edited with hot-reload, while the heavy environment stays running.
-# Option B: Interactive notebook test_env.ipynb (great for learning!)
-# Requires installation:
-pip install hud-python[agents]
-# Option C: Simple Python script (runs all tasks from tasks.json)
-python test_task.py
-```
+## Development
-## How HUD Environments Work
+This environment **requires Docker** due to X11/VNC dependencies.
-The environment is split into two components:
+```bash
+# Build first (creates hud-browser:0.1.0)
+hud build
-- **`env.py`** - Stateful logic that persists across reloads
-- **`server.py`** - MCP server with tools (reloads on file changes)
+# Start with hot-reload
+hud dev
+```
-This separation is crucial for `hud dev` - it allows you to modify the MCP tools and see changes immediately without losing the environment state. The environment runs as a separate process and communicates via socket, while the server can be restarted freely.
+When you run `hud dev` in an environment with a Dockerfile, it automatically:
+- Detects Docker mode is needed
+- Mounts `server/` and `environment/` as volumes
+- Enables hot-reload for both layers
-If you are ever seeing issues with the environment itself, running `hud dev --full-reload` will reload both the environment and the server.
+Edit files in `server/` or `environment/` and they reload inside the container!
 ## Publishing Your Environment

hud_python-0.4.49/environments/browser/environment/pyproject.toml ADDED Viewed

@@ -0,0 +1,23 @@
+[project]
+name = "hud-browser-environment"
+version = "0.1.0"
+description = "HUD Browser Environment Backend"
+requires-python = ">=3.11,<3.14"
+dependencies = [
+    "fastapi>=0.104.1",
+    "uvicorn[standard]>=0.24.0",
+    "python-multipart>=0.0.6",
+    "pydantic>=2.6,<3",
+    "pydantic-settings>=2.2,<3",
+    "httpx",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = ["environment"]

hud_python-0.4.49/environments/browser/server/pyproject.toml ADDED Viewed

@@ -0,0 +1,21 @@
+[project]
+name = "hud-browser-server"
+version = "0.1.0"
+description = "HUD Browser MCP Server"
+requires-python = ">=3.11,<3.14"
+dependencies = [
+    "hud-python@git+https://github.com/hud-evals/hud-python@cli-dev",
+    "httpx",
+    "playwright",
+    "pyautogui",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = ["server"]

hud_python-0.4.49/environments/deepresearch/README.md ADDED Viewed

@@ -0,0 +1,165 @@
+# Deep Research Environment
+Web research environment powered by Exa API for searching and fetching content.
+See [docs](https://docs.hud.so/build-environments) for the complete environment design workflow.
+## Architecture
+**`environment/`** - Manages Exa API integration and state
+- Holds the Exa API key server-side
+- Exposes HTTP endpoints `/search`, `/fetch`, `/answer`, `/evaluate` for research workflows
+- Implements exponential backoff for rate limiting
+**`server/`** - Wraps data in MCP tools
+- Provides `search()`, `fetch()`, `answer()`, `evaluate()` tools for agents
+- Agents and tasks interact only with these tools
+**Why separate?** Edit tools for the agent or tasks without restarting the environment backend.
+## Tools
+- **`search(query: str)`** - Search the web using Exa API, returns list of results with titles and URLs
+- **`fetch(url: str)`** - Fetch full content from a URL, returns summary, highlights, and text
+- **`answer(final_answer: str)`** - Submit the final research answer
+- **`evaluate(expected_answer: str)`** - Evaluate submitted answer against expected result
+## Setup
+### Requirements
+- Exa API key (get one at [exa.ai](https://exa.ai))
+### Environment Variables
+```bash
+export EXA_API_KEY="your_exa_api_key_here"
+```
+## Development
+```bash
+# Terminal 1 - Environment backend
+cd environment
+export EXA_API_KEY="your_key"
+uv run uvicorn server:app --reload
+# Terminal 2 - MCP server
+cd server
+uv run hud dev
+```
+The environment includes exponential backoff for rate limiting, so API calls will automatically retry on 429 errors.
+In general, we recommend starting work on the environment backend first, then developing the MCP server to expose the right things to the agent.
+For complex environments that require many dependencies, we recommend running `hud dev` in the environment root:
+```bash
+cd ..
+export EXA_API_KEY="your_key"
+hud dev
+```
+## Tasks & Evaluation
+```bash
+# Build first in the global folder with the Dockerfile (creates deepresearch:0.1.0)
+hud build
+```
+Your `tasks.json` uses `docker run` to launch the environment:
+```json
+{
+  "prompt": "Research and answer: What is the capital of France?",
+  "mcp_config": {
+    "local": {
+      "command": "docker",
+      "args": ["run", "--rm", "-i", "-e", "EXA_API_KEY", "deepresearch:0.1.0"]
+    }
+  },
+  "evaluator": {
+    "tool_name": "evaluate",
+    "tool_params": {
+      "expected_answer": "Paris"
+    }
+  }
+}
+```
+**Note:** The `-e EXA_API_KEY` flag passes your local API key to the container.
+**Commands:**
+```bash
+# Build first
+hud build
+# Test task locally
+export EXA_API_KEY="your_key"
+hud eval tasks.json
+# Push environment for remote running
+hud push
+# Production RL training
+hud rl tasks.json  # Auto-converts docker→remote, builds & pushes if needed
+```
+## Publishing Your Environment
+Once your environment is ready, you can share it with the community:
+### 1. Push to Registry
+```bash
+# Build and push your environment (requires docker hub login and hud api key)
+hud build
+hud push
+```
+### 2. Create a Dataset
+Create a dataset on HuggingFace with your tasks:
+**Option A: Upload manually**
+1. Upload your `tasks.json` to HuggingFace
+2. Make sure it's **public** to appear on leaderboards
+**Option B: Use the SDK**
+```python
+from hud.datasets import save_tasks
+import json
+# Load your tasks
+with open("tasks.json") as f:
+    tasks = json.load(f)
+# Push to HuggingFace
+save_tasks(tasks, repo_id="your-org/your-dataset")
+```
+### 3. Run and Track Performance
+```bash
+# Run Claude on your benchmark
+hud eval "your-org/your-dataset" --agent claude
+# View results at:
+# hud.so/leaderboards/your-org/your-dataset
+```
+**Note**: Only public HuggingFace datasets appear as leaderboards!
+📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)
+## Example Research Workflow
+```python
+# Agent searches for information
+results = search("latest AI developments 2024")
+# Agent fetches detailed content from top result
+content = fetch(results[0]["url"])
+# Agent submits final answer
+answer("Based on research, AI developments in 2024 include...")
+# Evaluate answer
+result = evaluate(expected_answer="AI developments")
+```

hud_python-0.4.49/environments/deepresearch/environment/pyproject.toml ADDED Viewed

@@ -0,0 +1,17 @@
+[project]
+name = "deepresearch-environment"
+version = "0.1.0"
+description = "Backend service for DeepResearch environment"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.104.1",
+    "uvicorn[standard]>=0.24.0",
+    "httpx>=0.24.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["environment"]

{hud_python-0.4.48 → hud_python-0.4.49}/environments/deepresearch/pyproject.toml RENAMED Viewed

@@ -3,7 +3,7 @@ name = "deepresearch"
 version = "0.1.0"
 description = "DeepResearch HUD environment with HTTP backend (EXA on server)"
 requires-python = ">=3.11"
-dependencies = [ "hud-python==0.4.41", "fastapi>=0.104.1", "uvicorn[standard]>=0.24.0", "httpx>=0.24.0",]
+dependencies = [ "hud-python==0.4.42", "fastapi>=0.104.1", "uvicorn[standard]>=0.24.0", "httpx>=0.24.0",]
 [build-system]
 requires = [ "hatchling",]

hud_python-0.4.49/environments/deepresearch/server/pyproject.toml ADDED Viewed

@@ -0,0 +1,19 @@
+[project]
+name = "deepresearch-mcp"
+version = "0.1.0"
+description = "MCP server for DeepResearch environment"
+requires-python = ">=3.11"
+dependencies = [
+    "hud-python>=0.4.49",
+    "httpx>=0.24.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = ["mcp"]

hud-python 0.4.48__tar.gz → 0.4.49__tar.gz

Potentially problematic release.

hud-python 0.4.48tar.gz → 0.4.49tar.gz