PyPI - hud-python - Versions diffs - 0.2.5__tar.gz → 0.2.6__tar.gz - Mend

hud-python 0.2.5tar.gz → 0.2.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (165) hide show

{hud_python-0.2.5 → hud_python-0.2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.2.5
+Version: 0.2.6
 Summary: SDK for the HUD evaluation platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-sdk
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-sdk/issues
@@ -74,17 +74,17 @@ Description-Content-Type: text/markdown
 </div>
 <h3>
-Create, evaluate, and improve AI agents across web browsers, desktop environments, and custom scenarios.
+Evaluate your Computer Use AI agents across web browsers, desktop environments, and custom scenarios.
 </h3>
-> ### 🚀 Are you a startup building agents?
->
-> [📅 Hop on a call ](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
->
-> We're here to help with eval strategies, custom environments, or improving your agent architecture!
+### 🚀 Are you a startup building agents?
+[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
-> **Early Release Notice**: This SDK is currently in early release status. The API is evolving and may change in future releases as we gather feedback and improve functionality.
+We're here to help with eval strategies, custom environments, or improving your agent architecture!
+> **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
 [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
@@ -132,23 +132,23 @@ with hud.trace("my-agent-run"):
     result = await agent.run(task)
 ```
-## API Key Setup
-Before getting started, you'll need to obtain an API key:
+## Quick Start
-1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
-2. Set it in your environment or .env file:
+### Installation
 ```bash
-export HUD_API_KEY=your_api_key_here
+pip install hud-python
 ```
-## Quick Start
+### API Key Setup
-### Installation
+Before getting started, you'll need to obtain an API key:
+1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
+2. Set it in your environment or .env file:
 ```bash
-pip install hud-python
+export HUD_API_KEY=your_api_key_here
 ```
 ### Simple Browser Example with Claude Computer Use
@@ -269,4 +269,4 @@ If you use this SDK in your research, please cite it as follows:
   url = {https://github.com/hud-evals/hud-sdk},
   langid = {en}
 }
-```
+```

{hud_python-0.2.5 → hud_python-0.2.6}/README.md RENAMED Viewed

@@ -3,17 +3,17 @@
 </div>
 <h3>
-Create, evaluate, and improve AI agents across web browsers, desktop environments, and custom scenarios.
+Evaluate your Computer Use AI agents across web browsers, desktop environments, and custom scenarios.
 </h3>
-> ### 🚀 Are you a startup building agents?
->
-> [📅 Hop on a call ](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
->
-> We're here to help with eval strategies, custom environments, or improving your agent architecture!
+### 🚀 Are you a startup building agents?
+[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
-> **Early Release Notice**: This SDK is currently in early release status. The API is evolving and may change in future releases as we gather feedback and improve functionality.
+We're here to help with eval strategies, custom environments, or improving your agent architecture!
+> **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
 [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
@@ -61,23 +61,23 @@ with hud.trace("my-agent-run"):
     result = await agent.run(task)
 ```
-## API Key Setup
-Before getting started, you'll need to obtain an API key:
+## Quick Start
-1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
-2. Set it in your environment or .env file:
+### Installation
 ```bash
-export HUD_API_KEY=your_api_key_here
+pip install hud-python
 ```
-## Quick Start
+### API Key Setup
-### Installation
+Before getting started, you'll need to obtain an API key:
+1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
+2. Set it in your environment or .env file:
 ```bash
-pip install hud-python
+export HUD_API_KEY=your_api_key_here
 ```
 ### Simple Browser Example with Claude Computer Use
@@ -198,4 +198,4 @@ If you use this SDK in your research, please cite it as follows:
   url = {https://github.com/hud-evals/hud-sdk},
   langid = {en}
 }
-```
+```

{hud_python-0.2.5 → hud_python-0.2.6}/docs/concepts/environment.mdx RENAMED Viewed

@@ -54,8 +54,6 @@ The HUD SDK provides several standard environment types, specified via the `gym`
 *   **`"hud-browser"`**: Provides a remote Chromium browser instance managed via Playwright. Ideal for web navigation, form interaction, and testing web applications.
     *   [See `hud-browser` Details](../environments/hud-browser.mdx)
-*   **`"hud-ubuntu"`**: Provides a remote Ubuntu desktop environment accessed via VNC. Suitable for tasks involving GUI applications, file system interaction, or running Linux software.
-    *   [See `hud-ubuntu` Details](../environments/hud-ubuntu.mdx)
 *   **`"qa"`**: A non-interactive environment for question-answering tasks where the agent provides a direct textual response.
     *   [See `qa` Environment Details](../environments/qa.mdx)
 *   **`CustomGym`**: Allows defining and running your own [Custom Environments](../advanced/custom-environments.mdx) using Docker, either locally or remotely. This provides maximum flexibility for specific testing needs.

{hud_python-0.2.5 → hud_python-0.2.6}/docs/docs.json RENAMED Viewed

@@ -55,7 +55,6 @@
         "environments/browser",
         "environments/custom",
         "environments/qa",
-        "environments/ubuntu",
         "environments/osworld-ubuntu"
       ]
     }

{hud_python-0.2.5 → hud_python-0.2.6}/docs/environment-creation.mdx RENAMED Viewed

@@ -63,8 +63,13 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     # ... other system dependencies for your environment (e.g., desktop, browsers) ...
     && rm -rf /var/lib/apt/lists/*
+# Upgrade pip and setuptools to ensure PEP 660 support
+RUN pip3 install --upgrade pip setuptools>=64.0.0 wheel
 # Copy your controller source code
 WORKDIR /app
+RUN mkdir /app_data
 COPY ./src /app/src
 COPY ./pyproject.toml /app/
@@ -92,7 +97,7 @@ dependencies = [
 ]
 [build-system]
-requires = ["setuptools>=61.0"]
+requires = ["setuptools>=64.0.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [project.scripts]
@@ -135,9 +140,26 @@ def verify_output_file(expected_content: str) -> float:
         logger.error("Evaluation failed: Output file not found.")
         return 0.0 # Failure
+def step(action: str) -> str:
+    """Example step function for a Task."""
+    logger.info(f"Controller: Stepping with {action=}")
+    return {
+        "observation": {
+            "text": "Sample Text",
+            "screenshot": None
+        }
+    }
 # You can add more functions as needed for different setup/evaluation logic
 ```
+### d. `src/hud_controller/__init__.py`
+```python
+from .main import initialize_environment, verify_output_file, step
+```
 ## 4. Building & Testing Locally
 ### a. Define `CustomGym`

{hud_python-0.2.5 → hud_python-0.2.6}/docs/quickstart.mdx RENAMED Viewed

@@ -15,7 +15,7 @@ See [Installation](/installation) for more details on development setup.
 ## 2. API Key Setup
-Set your API keys in a `.env` file:
+Set your API keys in a `.env` file (get your HUD API key from [app.hud.so](https://app.hud.so)):
 ```bash
 HUD_API_KEY=sk-hud-...
@@ -51,9 +51,12 @@ async def main():
     await env.close()
 if __name__ == "__main__":
-        asyncio.run(main())
+    asyncio.run(main())
 ```
+Each gym (`hud-browser`, `OSWorld-Ubuntu`, custom) has it's own set of setup and evaluate funcitons, and you can define your own.
+See [setup](/environments/browser#setup-functions-initial-state) and [evalutors](/environments/browser#evaluation-functions) for more info on available functions.
 ### Manual Agent Loop
 ```python
 env = await gym.make(task)

hud_python-0.2.6/examples/osworld.ipynb ADDED Viewed

@@ -0,0 +1,199 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# uv pip install -e \".[dev]\"\n",
+    "from hud import gym, load_taskset\n",
+    "from pprint import pprint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total tasks in OSWorld: 369\n",
+      "Task prompt: Can you make my computer bring back the last tab I shut down?\n"
+     ]
+    }
+   ],
+   "source": [
+    "taskset = await load_taskset(\"OSWorld-Ubuntu\")\n",
+    "print(f\"Total tasks in OSWorld: {len(taskset)}\")\n",
+    "\n",
+    "test = taskset[144]\n",
+    "print(f\"Task prompt: {test.prompt}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-05-27 10:04:56,691 - hud.gym - INFO - Creating private environment\n"
+     ]
+    }
+   ],
+   "source": [
+    "# The Ubuntu environment will take around 2.5 minutes to start, but can be parallelized\n",
+    "env = await gym.make(test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Initial observation complete\n",
+      "========= Step 1 =========\n",
+      "Agent's action: [PressAction(type='press', keys=['ctrl', 'shift', 't'])]\n",
+      "========= Step 2 =========\n",
+      "Agent's action: [ResponseAction(type='response', text=\"Great! I've successfully reopened your last closed tab. As you can see, the TripAdvisor tab has been restored. Now you have three tabs open:\\n\\n1. Lonely Planet | Travel Guide\\n2. Airbnb | Vacation rentals\\n3. TripAdvisor: Over a billion reviews & contributions for Hotels\\n\\nThe keyboard shortcut Ctrl+Shift+T is very useful for recovering recently closed tabs in Chrome. You can actually press it multiple times to continue reopening previously closed tabs in the order they were closed.\")]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from hud.agent import ClaudeAgent\n",
+    "\n",
+    "# Define a new agent each time to reset the message history\n",
+    "# Make sure to define the environment variable ANTHROPIC_API_KEY\n",
+    "agent = ClaudeAgent()\n",
+    "\n",
+    "# Initial observation\n",
+    "obs, _ = await env.reset()\n",
+    "print(f\"Initial observation complete\")\n",
+    "\n",
+    "# Agent loop\n",
+    "for i in range(8):\n",
+    "    print(f\"========= Step {i + 1} =========\")\n",
+    "    action, done = await agent.predict(obs)\n",
+    "    print(f\"Agent's action: {action}\")\n",
+    "\n",
+    "    obs, reward, terminated, info = await env.step(action)\n",
+    "\n",
+    "    if done or terminated:\n",
+    "        break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'error': None,\n",
+      " 'logs': 'INFO: Starting evaluation...\\n'\n",
+      "         'INFO: Evaluating task 08d9a8b1-7b7a-4ba7-a226-4e266e13f6df...\\n'\n",
+      "         'INFO: Evaluator configuration:\\n'\n",
+      "         'INFO:   Metric function(s): is_expected_tabs\\n'\n",
+      "         'INFO:   Metric conjunction: and\\n'\n",
+      "         'INFO:   Result getter: get_open_tabs_info\\n'\n",
+      "         'INFO:   Expected getter: get_rule\\n'\n",
+      "         'INFO:   Metric options: {}\\n'\n",
+      "         'INFO: Setting up post-config for evaluation...\\n'\n",
+      "         'INFO: Evaluating single metric: is_expected_tabs\\n'\n",
+      "         \"INFO: Getting result state using config: {'type': 'open_tabs_info'}\\n\"\n",
+      "         \"INFO: Getting expected state using config: {'type': 'rule', 'rules': \"\n",
+      "         \"{'type': 'url', 'urls': ['https://www.lonelyplanet.com', \"\n",
+      "         \"'https://www.airbnb.com', 'https://www.tripadvisor.com']}}\\n\"\n",
+      "         'INFO: Comparing result state with expected state\\n'\n",
+      "         'INFO: Final evaluation result: 1\\n'\n",
+      "         'INFO: Completed evaluation.\\n'\n",
+      "         'INFO: Completed evaluation.\\n',\n",
+      " 'reward': 1.0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Evaluate environment state\n",
+    "result = await env.evaluate()\n",
+    "pprint(result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Make sure to close environment to avoid being charged for idle time\n",
+    "await env.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Paralell runs for the whole dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from hud import run_job\n",
+    "\n",
+    "taskset = await load_taskset(\"OSWorld-Ubuntu\")\n",
+    "job = await run_job(\n",
+    "    ClaudeAgent,\n",
+    "    taskset,\n",
+    "    \"osworld-test\",\n",
+    "    max_steps_per_task=20,\n",
+    "    max_concurrent_tasks=20,\n",
+    "    auto_reply_question=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "await job.get_analytics()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

{hud_python-0.2.5 → hud_python-0.2.6}/examples/tasks.ipynb RENAMED Viewed

@@ -7,7 +7,6 @@
    "outputs": [],
    "source": [
     "from hud import gym\n",
-    "from hud.utils import stream\n",
     "from hud.task import Task"
    ]
   },
@@ -41,10 +40,7 @@
    "source": [
     "# Create and set up environment with google, takes around 20 seconds\n",
     "env = await gym.make(task)\n",
-    "urls = await env.get_urls()\n",
-    "\n",
-    "# Stream the live view\n",
-    "stream(urls[\"live_url\"])"
+    "await env.stream()"
    ]
   },
   {
@@ -127,10 +123,7 @@
    "source": [
     "# Create and set up environment with google, takes around 20 seconds\n",
     "env = await gym.make(task)\n",
-    "urls = await env.get_urls()\n",
-    "\n",
-    "# Stream the live view\n",
-    "stream(urls[\"live_url\"])"
+    "await env.stream()"
    ]
   },
   {
@@ -217,9 +210,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from hud.adapters.common.types import ResponseAction\n",
+    "from hud import Response\n",
     "\n",
-    "await env.step([ResponseAction(text=\"Paris\")])"
+    "await env.step([Response(text=\"Paris\")])"
    ]
   },
   {

hud-python 0.2.5__tar.gz → 0.2.6__tar.gz

Potentially problematic release.

hud-python 0.2.5tar.gz → 0.2.6tar.gz