PyPI - hud-python - Versions diffs - 0.2.5__tar.gz → 0.2.7__tar.gz - Mend

hud-python 0.2.5tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (173) hide show

{hud_python-0.2.5 → hud_python-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.2.5
+Version: 0.2.7
 Summary: SDK for the HUD evaluation platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-sdk
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-sdk/issues
@@ -47,6 +47,7 @@ Requires-Dist: langchain-openai
 Requires-Dist: mcp
 Requires-Dist: numpy
 Requires-Dist: openai
+Requires-Dist: pathspec>=0.12.1
 Requires-Dist: pillow>=11.1.0
 Requires-Dist: pydantic-settings<3,>=2
 Requires-Dist: pydantic<3,>=2
@@ -61,7 +62,7 @@ Requires-Dist: ipython<9; extra == 'dev'
 Requires-Dist: jupyter-client; extra == 'dev'
 Requires-Dist: jupyter-core; extra == 'dev'
 Requires-Dist: openai; extra == 'dev'
-Requires-Dist: pyright==1.1.364; extra == 'dev'
+Requires-Dist: pyright==1.1.401; extra == 'dev'
 Requires-Dist: pytest-asyncio; extra == 'dev'
 Requires-Dist: pytest-cov; extra == 'dev'
 Requires-Dist: pytest-mock; extra == 'dev'
@@ -74,23 +75,23 @@ Description-Content-Type: text/markdown
 </div>
 <h3>
-Create, evaluate, and improve AI agents across web browsers, desktop environments, and custom scenarios.
+Evaluate your Computer Use AI agents across web browsers, desktop environments, and custom scenarios.
 </h3>
-> ### 🚀 Are you a startup building agents?
->
-> [📅 Hop on a call ](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
->
-> We're here to help with eval strategies, custom environments, or improving your agent architecture!
+### 🚀 Are you a startup building agents?
+[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
-> **Early Release Notice**: This SDK is currently in early release status. The API is evolving and may change in future releases as we gather feedback and improve functionality.
+We're here to help with eval strategies, custom environments, or improving your agent architecture!
+> **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
 [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
 ## ✨ What You Can Do
-**Evaluate Existing Benchmarks**
+**[Evaluate Existing Benchmarks](https://docs.hud.so/examples/benchmarking-agents)**
 ```python
 from hud import load_taskset, run_job, ClaudeAgent
@@ -98,7 +99,7 @@ taskset = await load_taskset("WebVoyager")  # or GAIA, OSWorld-Ubuntu, Mind2Web
 job = await run_job(ClaudeAgent, taskset, "my-evaluation")
 ```
-**Create Custom Tasks**
+**[Create Custom Tasks](https://docs.hud.so/task-creation)**
 ```python
 from hud.task import Task
@@ -110,7 +111,7 @@ task = Task(
 )
 ```
-**Build Custom Environments**
+**[Build Custom Environments](https://docs.hud.so/environment-creation)**
 ```python
 from hud.types import CustomGym
@@ -123,7 +124,7 @@ custom_gym = CustomGym(
 # Or create complex Docker environments - see environments/ folder for examples
 ```
-**Trace Tool Calls Alongside HUD Environments (or Independently)**
+**[Trace Tool Calls Alongside HUD Environments (or Independently)](https://docs.hud.so/examples/mcp-agent-tracing)**
 ```python
 import hud
@@ -132,23 +133,23 @@ with hud.trace("my-agent-run"):
     result = await agent.run(task)
 ```
-## API Key Setup
-Before getting started, you'll need to obtain an API key:
+## Quick Start
-1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
-2. Set it in your environment or .env file:
+### Installation
 ```bash
-export HUD_API_KEY=your_api_key_here
+pip install hud-python
 ```
-## Quick Start
+### API Key Setup
-### Installation
+Before getting started, you'll need to obtain an API key:
+1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
+2. Set it in your environment or .env file:
 ```bash
-pip install hud-python
+export HUD_API_KEY=your_api_key_here
 ```
 ### Simple Browser Example with Claude Computer Use
@@ -171,6 +172,7 @@ async def main():
         setup=("goto", "google.com"),
         evaluate=("contains_text", "capybara")
     )
+    print(f"Running task with prompt: {task.prompt}")
     # Create environment using the gym module
     env = await gym.make(task)
@@ -182,6 +184,7 @@ async def main():
     obs, _ = await env.reset() # Gets first observation
     for i in range(5):
         actions, done = await agent.predict(obs)
+        print(f"Agent action {i}: {actions}")
         obs, reward, terminated, info = await env.step(actions)
         if done or terminated: break
@@ -269,4 +272,4 @@ If you use this SDK in your research, please cite it as follows:
   url = {https://github.com/hud-evals/hud-sdk},
   langid = {en}
 }
-```
+```

{hud_python-0.2.5 → hud_python-0.2.7}/README.md RENAMED Viewed

@@ -3,23 +3,23 @@
 </div>
 <h3>
-Create, evaluate, and improve AI agents across web browsers, desktop environments, and custom scenarios.
+Evaluate your Computer Use AI agents across web browsers, desktop environments, and custom scenarios.
 </h3>
-> ### 🚀 Are you a startup building agents?
->
-> [📅 Hop on a call ](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
->
-> We're here to help with eval strategies, custom environments, or improving your agent architecture!
+### 🚀 Are you a startup building agents?
+[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
-> **Early Release Notice**: This SDK is currently in early release status. The API is evolving and may change in future releases as we gather feedback and improve functionality.
+We're here to help with eval strategies, custom environments, or improving your agent architecture!
+> **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
 [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
 ## ✨ What You Can Do
-**Evaluate Existing Benchmarks**
+**[Evaluate Existing Benchmarks](https://docs.hud.so/examples/benchmarking-agents)**
 ```python
 from hud import load_taskset, run_job, ClaudeAgent
@@ -27,7 +27,7 @@ taskset = await load_taskset("WebVoyager")  # or GAIA, OSWorld-Ubuntu, Mind2Web
 job = await run_job(ClaudeAgent, taskset, "my-evaluation")
 ```
-**Create Custom Tasks**
+**[Create Custom Tasks](https://docs.hud.so/task-creation)**
 ```python
 from hud.task import Task
@@ -39,7 +39,7 @@ task = Task(
 )
 ```
-**Build Custom Environments**
+**[Build Custom Environments](https://docs.hud.so/environment-creation)**
 ```python
 from hud.types import CustomGym
@@ -52,7 +52,7 @@ custom_gym = CustomGym(
 # Or create complex Docker environments - see environments/ folder for examples
 ```
-**Trace Tool Calls Alongside HUD Environments (or Independently)**
+**[Trace Tool Calls Alongside HUD Environments (or Independently)](https://docs.hud.so/examples/mcp-agent-tracing)**
 ```python
 import hud
@@ -61,23 +61,23 @@ with hud.trace("my-agent-run"):
     result = await agent.run(task)
 ```
-## API Key Setup
-Before getting started, you'll need to obtain an API key:
+## Quick Start
-1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
-2. Set it in your environment or .env file:
+### Installation
 ```bash
-export HUD_API_KEY=your_api_key_here
+pip install hud-python
 ```
-## Quick Start
+### API Key Setup
-### Installation
+Before getting started, you'll need to obtain an API key:
+1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
+2. Set it in your environment or .env file:
 ```bash
-pip install hud-python
+export HUD_API_KEY=your_api_key_here
 ```
 ### Simple Browser Example with Claude Computer Use
@@ -100,6 +100,7 @@ async def main():
         setup=("goto", "google.com"),
         evaluate=("contains_text", "capybara")
     )
+    print(f"Running task with prompt: {task.prompt}")
     # Create environment using the gym module
     env = await gym.make(task)
@@ -111,6 +112,7 @@ async def main():
     obs, _ = await env.reset() # Gets first observation
     for i in range(5):
         actions, done = await agent.predict(obs)
+        print(f"Agent action {i}: {actions}")
         obs, reward, terminated, info = await env.step(actions)
         if done or terminated: break
@@ -198,4 +200,4 @@ If you use this SDK in your research, please cite it as follows:
   url = {https://github.com/hud-evals/hud-sdk},
   langid = {en}
 }
-```
+```

{hud_python-0.2.5 → hud_python-0.2.7}/docs/concepts/environment.mdx RENAMED Viewed

@@ -54,8 +54,6 @@ The HUD SDK provides several standard environment types, specified via the `gym`
 *   **`"hud-browser"`**: Provides a remote Chromium browser instance managed via Playwright. Ideal for web navigation, form interaction, and testing web applications.
     *   [See `hud-browser` Details](../environments/hud-browser.mdx)
-*   **`"hud-ubuntu"`**: Provides a remote Ubuntu desktop environment accessed via VNC. Suitable for tasks involving GUI applications, file system interaction, or running Linux software.
-    *   [See `hud-ubuntu` Details](../environments/hud-ubuntu.mdx)
 *   **`"qa"`**: A non-interactive environment for question-answering tasks where the agent provides a direct textual response.
     *   [See `qa` Environment Details](../environments/qa.mdx)
 *   **`CustomGym`**: Allows defining and running your own [Custom Environments](../advanced/custom-environments.mdx) using Docker, either locally or remotely. This provides maximum flexibility for specific testing needs.

{hud_python-0.2.5 → hud_python-0.2.7}/docs/docs.json RENAMED Viewed

@@ -24,6 +24,7 @@
       "pages": [
         "examples/benchmarking-agents",
         "examples/alignment-evaluation",
+        "examples/web-mocks",
         "examples/custom-os-env",
         "examples/mcp-agent-tracing",
         "examples/web-app-testing"
@@ -55,7 +56,6 @@
         "environments/browser",
         "environments/custom",
         "environments/qa",
-        "environments/ubuntu",
         "environments/osworld-ubuntu"
       ]
     }

{hud_python-0.2.5 → hud_python-0.2.7}/docs/environment-creation.mdx RENAMED Viewed

@@ -63,8 +63,13 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     # ... other system dependencies for your environment (e.g., desktop, browsers) ...
     && rm -rf /var/lib/apt/lists/*
+# Upgrade pip and setuptools to ensure PEP 660 support
+RUN pip3 install --upgrade pip setuptools>=64.0.0 wheel
 # Copy your controller source code
 WORKDIR /app
+RUN mkdir /app_data
 COPY ./src /app/src
 COPY ./pyproject.toml /app/
@@ -92,7 +97,7 @@ dependencies = [
 ]
 [build-system]
-requires = ["setuptools>=61.0"]
+requires = ["setuptools>=64.0.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [project.scripts]
@@ -135,9 +140,26 @@ def verify_output_file(expected_content: str) -> float:
         logger.error("Evaluation failed: Output file not found.")
         return 0.0 # Failure
+def step(action: str) -> str:
+    """Example step function for a Task."""
+    logger.info(f"Controller: Stepping with {action=}")
+    return {
+        "observation": {
+            "text": "Sample Text",
+            "screenshot": None
+        }
+    }
 # You can add more functions as needed for different setup/evaluation logic
 ```
+### d. `src/hud_controller/__init__.py`
+```python
+from .main import initialize_environment, verify_output_file, step
+```
 ## 4. Building & Testing Locally
 ### a. Define `CustomGym`
@@ -307,7 +329,7 @@ We strongly encourage community contributions! If you've built a useful custom e
 Check the `environments/` directory in the SDK for inspiration:
 -   `environments/novnc_ubuntu/`: Provides an Ubuntu desktop accessible via VNC, for GUI-based tasks.
 -   `environments/custom_website/`: A template for packaging and testing your own web application.
--   `environments/gameboy/`: Example of a retro gaming environment.
+-   `environments/pokemon_controller/`: Example of a retro gaming environment.
 ## Using Remote Custom Environments
@@ -353,4 +375,4 @@ task_on_remote2 = Task(
 - **[Task Creation](/task-creation)**: How to define tasks that use your custom environments.
 - **[Custom Environments Overview](/environments/custom)**: Higher-level concepts of custom environments.
-- **[Browser Environment](/environments/browser)**: For standard web interaction tasks.
+- **[Browser Environment](/environments/browser)**: For standard web interaction tasks.

hud_python-0.2.7/docs/examples/web-mocks.mdx ADDED Viewed

@@ -0,0 +1,240 @@
+---
+title: 'Web Mocks'
+description: 'Clone websites and host them as stable test environments for AI agents using HUD page archives.'
+icon: 'clone'
+---
+# Page Cloning
+This guide demonstrates how to create and host web archives for testing AI agents with consistent, offline-first environments. By cloning websites into WACZ (Web ARChiveZip) files, you can ensure your agents always test against specific, unchanging versions of web pages.
+**Goal**: Create reproducible web environments for testing browser-based agents without depending on live websites that might change or go offline.
+**Concepts Covered**:
+- Using ArchiveWeb.page to clone websites into WACZ files
+- Hosting archives locally with the HUD page archives repository and `CustomGym`
+- Uploading archives to app.hud.so for immediate cloud hosting
+- Creating tasks that use these stable archived environments
+## Prerequisites
+- HUD SDK installed
+- Docker installed (for local hosting option)
+- ArchiveWeb.page browser extension (for cloning pages)
+- API keys for HUD and your chosen agent
+## Part 1: Cloning the Page
+### Installing ArchiveWeb.page
+1. **Install the Browser Extension**:
+   - Visit [ArchiveWeb.page](https://archiveweb.page)
+   - Install the extension for Chrome/Chromium-based browsers
+   - The extension icon will appear in your browser toolbar
+2. **Create a New Archive**:
+   - Click the ArchiveWeb.page extension icon
+   - Click "Create New Collection"
+   - Give your collection a descriptive name (e.g., "my-test-site")
+### Capturing Web Pages
+1. **Start Archiving**:
+   - Click "Start" in the extension popup to begin an archiving session
+   - Navigate to the website you want to clone
+   - Interact with the site as your agent would (login, navigate through pages, fill forms)
+   - All pages and resources will be captured automatically
+2. **Best Practices for Agent Testing**:
+   - Capture all relevant pages and states your agent will interact with
+   - Include error pages and edge cases
+   - If testing login flows, capture both logged-out and logged-in states
+   - For form submissions, capture the form page and success/error pages
+3. **Stop and Download**:
+   - Click "Stop" in the extension when done capturing
+   - Click "Download" to save your collection
+   - Choose WACZ format (default)
+   - Save with a meaningful filename (e.g., `my-test-site.wacz`)
+### Example: Cloning a Login Flow
+```
+1. Start archiving session
+2. Visit https://example.com/login
+3. Enter test credentials (e.g., testuser/password123)
+4. Submit the form
+5. Capture the dashboard/welcome page
+6. Optionally capture logout flow
+7. Stop and download as my-test-site.wacz
+```
+## Part 2: Hosting the Website
+You have two options for hosting your archived website:
+### Option 1: Local Hosting with CustomGym
+This approach uses the [HUD page archives repository](https://github.com/hud-evals/page-archives) to host archives locally and access them via `CustomGym`.
+#### Step 1: Clone the Page Archives Repository
+```bash
+git clone https://github.com/hud-evals/page-archives.git
+cd page-archives
+```
+#### Step 2: Add Your Archive
+1. **Place your WACZ file**:
+   ```bash
+   cp ~/Downloads/my-test-site.wacz archives/
+   ```
+2. **Update `archives/archive_list.json`**:
+   ```json
+   {
+     "archives": [
+       {
+         "name": "my-test-site",
+         "displayName": "My Test Site Archive",
+         "startPage": "https://example.com/login"  // Optional: default page to open
+       }
+       // ... other archives
+     ]
+   }
+   ```
+   Note: The `name` field must match your WACZ filename without the `.wacz` extension.
+#### Step 3: Create a CustomGym for the Archive Server
+```python
+from hud.types import CustomGym
+from pathlib import Path
+# Create a Dockerfile for the archive server
+archive_server_dockerfile = """
+FROM node:18-slim
+WORKDIR /app
+COPY . /app
+RUN npm install
+EXPOSE 3000
+CMD ["npm", "run", "start"]
+"""
+# Save Dockerfile in the page-archives directory
+with open("page-archives/Dockerfile", "w") as f:
+    f.write(archive_server_dockerfile)
+# Define the CustomGym
+archive_server_gym = CustomGym(
+    location="local",
+    image_or_build_context=Path("./page-archives"),
+    host_config={
+        "port_bindings": {3000: 3000}  # Expose port 3000
+    }
+)
+```
+#### Step 4: Create Tasks Using the Archived Site
+```python
+from hud import Task, run_job
+from hud.agent import ClaudeAgent
+# Task to test login flow on the archived site
+login_task = Task(
+    prompt="Log into the website using username 'testuser' and password 'password123'.",
+    gym="hud-browser",  # Use browser to interact
+    setup=[
+        # Navigate to your archived site running locally
+        ("goto", "http://localhost:3000/my-test-site")
+    ],
+    evaluate=("page_contains", "Welcome, testuser!")
+)
+```
+#### Advanced: Query Parameters
+The archive viewer supports useful query parameters:
+```python
+# Open a specific page within the archive
+specific_page_task = Task(
+    prompt="Navigate to the user profile page",
+    gym="hud-browser",
+    setup=[
+        ("goto", "http://localhost:3000/my-test-site?page=https%3A%2F%2Fexample.com%2Fprofile")
+    ]
+)
+# Debug mode - shows full ReplayWeb.page UI
+debug_task = Task(
+    prompt="Explore the archive interface",
+    gym="hud-browser",
+    setup=[
+        ("goto", "http://localhost:3000/my-test-site?debug=true")
+    ]
+)
+```
+### Option 2: Cloud Hosting on app.hud.so
+For immediate hosting without local setup, use the HUD platform's built-in page cloning feature.
+#### Step 1: Access Page Clone Feature
+1. Go to [app.hud.so](https://app.hud.so)
+2. Click "Create" in the navigation
+3. Select "Page Clone"
+#### Step 2: Upload Your Archive
+1. Click "Upload WACZ file"
+2. Select your `.wacz` file created in Part 1
+3. Provide a name for your cloned environment
+4. Click "Create"
+#### Step 3: Use the Hosted Archive
+Once uploaded, you'll receive a URL for your hosted archive (e.g., `https://archives.hud.so/your-archive-id`).
+```python
+from hud import Task, run_job
+from hud.agent import ClaudeAgent
+# Task using the cloud-hosted archive
+cloud_login_task = Task(
+    prompt="Log into the website using username 'testuser' and password 'password123'.",
+    gym="hud-browser",
+    setup=[
+        # Navigate to your cloud-hosted archive
+        ("goto", "https://archives.hud.so/your-archive-id")
+    ],
+    evaluate=("page_contains", "Welcome, testuser!")
+)
+# Run evaluation
+job = await run_job(
+    agent_cls=ClaudeAgent,
+    task_or_taskset=cloud_login_task,
+    job_name="Cloud Archive Test"
+)
+```
+## Tips for Effective Page Cloning
+1. **Capture Complete Flows**: Don't just capture individual pages - capture entire user journeys
+2. **Include Resources**: Ensure CSS, JavaScript, and images are properly captured
+3. **Test Your Archives**: Always verify your archives work correctly before using them in evaluations
+4. **Document States**: Keep notes on what states and pages are included in each archive
+5. **Update Regularly**: Re-clone sites when significant changes occur
+## Key Takeaways
+- ArchiveWeb.page makes it easy to create WACZ archives of any website
+- Local hosting with CustomGym gives you full control and fast performance
+- Cloud hosting on app.hud.so provides instant deployment without infrastructure
+- Page cloning ensures consistent, reproducible testing environments for AI agents
+- Archived sites eliminate external dependencies and enable offline testing

{hud_python-0.2.5 → hud_python-0.2.7}/docs/quickstart.mdx RENAMED Viewed

@@ -15,7 +15,7 @@ See [Installation](/installation) for more details on development setup.
 ## 2. API Key Setup
-Set your API keys in a `.env` file:
+Set your API keys in a `.env` file (get your HUD API key from [app.hud.so](https://app.hud.so)):
 ```bash
 HUD_API_KEY=sk-hud-...
@@ -51,9 +51,12 @@ async def main():
     await env.close()
 if __name__ == "__main__":
-        asyncio.run(main())
+    asyncio.run(main())
 ```
+Each gym (`hud-browser`, `OSWorld-Ubuntu`, custom) has it's own set of setup and evaluate funcitons, and you can define your own.
+See [setup](/environments/browser#setup-functions-initial-state) and [evalutors](/environments/browser#evaluation-functions) for more info on available functions.
 ### Manual Agent Loop
 ```python
 env = await gym.make(task)

{hud_python-0.2.5 → hud_python-0.2.7}/docs/task-creation.mdx RENAMED Viewed

@@ -30,6 +30,10 @@ task = Task(
     setup=("goto", "https://news.example.com"), # Function to run at env.reset()
     evaluate=("page_contains", "artificial intelligence") # Function to run at env.evaluate()
 )
+# Create environment
+env = gym.make(task)
+# ...
 ```
 ## Setup Functions (for `hud-browser`)

hud-python 0.2.5__tar.gz → 0.2.7__tar.gz

Potentially problematic release.

hud-python 0.2.5tar.gz → 0.2.7tar.gz