PyPI - hud-python - Versions diffs - 0.4.51__tar.gz → 0.4.53__tar.gz - Mend

hud-python 0.4.51tar.gz → 0.4.53tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (301) hide show

{hud_python-0.4.51 → hud_python-0.4.53}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.51
+Version: 0.4.53
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
@@ -48,6 +48,7 @@ Requires-Dist: opentelemetry-api>=1.34.1
 Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.34.1
 Requires-Dist: opentelemetry-instrumentation-mcp==0.47.0
 Requires-Dist: opentelemetry-sdk>=1.34.1
+Requires-Dist: packaging>=21.0
 Requires-Dist: pathspec>=0.12.1
 Requires-Dist: pillow>=11.1.0
 Requires-Dist: prompt-toolkit==3.0.51
@@ -159,12 +160,12 @@ OSS RL environment + evals toolkit. Wrap software as environments, run benchmark
 ## Highlights
-- 🎓 **[One-click RL](https://hud.so/models)** – Run `hud rl` to get a trained model on any environment.
 - 🚀 **[MCP environment skeleton](https://docs.hud.so/core-concepts/mcp-protocol)** – any agent can call any environment.
 - ⚡️ **[Live telemetry](https://hud.so)** – inspect every tool call, observation, and reward in real time.
 - 🗂️ **[Public benchmarks](https://hud.so/leaderboards)** – OSWorld-Verified, SheetBench-50, and more.
 - 🌐 **[Cloud browsers](environments/remote_browser/)** – AnchorBrowser, Steel, BrowserBase integrations for browser automation.
 - 🛠️ **[Hot-reload dev loop](environments/README.md#phase-5-hot-reload-development-with-cursor-agent)** – `hud dev` for iterating on environments without rebuilds.
+- 🎓 **[One-click RL](https://hud.so/models)** – Run `hud rl` to get a trained model on any environment.
 > We welcome contributors and feature requests – open an issue or hop on a call to discuss improvements!
@@ -185,29 +186,6 @@ uv tool install hud-python
 Before starting, get your HUD_API_KEY at [hud.so](https://hud.so).
-## Quickstart: Training
-RL using GRPO a Qwen2.5-VL model on any hud dataset:
-```bash
-hud get hud-evals/basic-2048 # from HF
-hud rl basic-2048.json
-```
-> See [agent training docs](https://docs.hud.so/train-agents/quickstart)
-Or make your own environment and dataset:
-```bash
-hud init my-env && cd my-env
-hud dev --interactive
-# When ready to run:
-hud rl
-```
-> See [environment design docs](https://docs.hud.so/build-environments)
 ## Quickstart: Evals
 For a tutorial that explains the agent and evaluation design, run:
@@ -264,38 +242,27 @@ The above example let's the agent play 2048 ([See replay](https://hud.so/trace/6
 ![Agent playing 2048](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/2048_1.gif)
-## Reinforcement Learning with GRPO
-This is a Qwen‑2.5‑VL‑3B agent training a policy on the 2048-basic browser environment:
-![RL curve](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)
+## Quickstart: Training
-Train with the new interactive `hud rl` flow:
+RL using GRPO a Qwen2.5-VL model on any hud dataset:
 ```bash
-# Install CLI
-uv tool install hud-python
-# Option A: Run directly from a HuggingFace dataset
-hud rl hud-evals/basic-2048
-# Option B: Download first, modify, then train
-hud get hud-evals/basic-2048
+hud get hud-evals/basic-2048 # from HF
 hud rl basic-2048.json
-# Optional: baseline evaluation
-hud eval basic-2048.json
 ```
-Supports multi‑turn RL for both:
-- Language‑only models (e.g., `Qwen/Qwen2.5-7B-Instruct`)
-- Vision‑Language models (e.g., `Qwen/Qwen2.5-VL-3B-Instruct`)
+> See [agent training docs](https://docs.hud.so/train-agents/quickstart)
-By default, `hud rl` provisions a persistent server and trainer in the cloud, streams telemetry to `hud.so`, and lets you monitor/manage models at `hud.so/models`. Use `--local` to run entirely on your machines (typically 2+ GPUs: one for vLLM, the rest for training).
+Or make your own environment and dataset:
-Any HUD MCP environment and evaluation works with our RL pipeline (including remote configurations). See the guided docs: `https://docs.hud.so/train-agents/quickstart`.
+```bash
+hud init my-env && cd my-env
+hud dev --interactive
+# When ready to run:
+hud rl
+```
-Pricing: Hosted vLLM and training GPU rates are listed in the [Training Quickstart → Pricing](https://docs.hud.so/train-agents/quickstart#pricing). Manage billing at the [HUD billing dashboard](https://hud.so/project/billing).
+> See [environment design docs](https://docs.hud.so/build-environments)
 ## Benchmarking Agents
@@ -459,6 +426,39 @@ We highly suggest running 3-5 evaluations per dataset for the most consistent re
 Using the [`run_dataset`](https://docs.hud.so/reference/tasks#run_dataset) function with a HuggingFace dataset automatically assigns your job to that leaderboard page, and allows you to create a scorecard out of it:
+## Reinforcement Learning with GRPO
+This is a Qwen‑2.5‑VL‑3B agent training a policy on the 2048-basic browser environment:
+![RL curve](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)
+Train with the new interactive `hud rl` flow:
+```bash
+# Install CLI
+uv tool install hud-python
+# Option A: Run directly from a HuggingFace dataset
+hud rl hud-evals/basic-2048
+# Option B: Download first, modify, then train
+hud get hud-evals/basic-2048
+hud rl basic-2048.json
+# Optional: baseline evaluation
+hud eval basic-2048.json
+```
+Supports multi‑turn RL for both:
+- Language‑only models (e.g., `Qwen/Qwen2.5-7B-Instruct`)
+- Vision‑Language models (e.g., `Qwen/Qwen2.5-VL-3B-Instruct`)
+By default, `hud rl` provisions a persistent server and trainer in the cloud, streams telemetry to `hud.so`, and lets you monitor/manage models at `hud.so/models`. Use `--local` to run entirely on your machines (typically 2+ GPUs: one for vLLM, the rest for training).
+Any HUD MCP environment and evaluation works with our RL pipeline (including remote configurations). See the guided docs: `https://docs.hud.so/train-agents/quickstart`.
+Pricing: Hosted vLLM and training GPU rates are listed in the [Training Quickstart → Pricing](https://docs.hud.so/train-agents/quickstart#pricing). Manage billing at the [HUD billing dashboard](https://hud.so/project/billing).
 ## Architecture
 ```mermaid

{hud_python-0.4.51 → hud_python-0.4.53}/README.md RENAMED Viewed

@@ -22,12 +22,12 @@ OSS RL environment + evals toolkit. Wrap software as environments, run benchmark
 ## Highlights
-- 🎓 **[One-click RL](https://hud.so/models)** – Run `hud rl` to get a trained model on any environment.
 - 🚀 **[MCP environment skeleton](https://docs.hud.so/core-concepts/mcp-protocol)** – any agent can call any environment.
 - ⚡️ **[Live telemetry](https://hud.so)** – inspect every tool call, observation, and reward in real time.
 - 🗂️ **[Public benchmarks](https://hud.so/leaderboards)** – OSWorld-Verified, SheetBench-50, and more.
 - 🌐 **[Cloud browsers](environments/remote_browser/)** – AnchorBrowser, Steel, BrowserBase integrations for browser automation.
 - 🛠️ **[Hot-reload dev loop](environments/README.md#phase-5-hot-reload-development-with-cursor-agent)** – `hud dev` for iterating on environments without rebuilds.
+- 🎓 **[One-click RL](https://hud.so/models)** – Run `hud rl` to get a trained model on any environment.
 > We welcome contributors and feature requests – open an issue or hop on a call to discuss improvements!
@@ -48,29 +48,6 @@ uv tool install hud-python
 Before starting, get your HUD_API_KEY at [hud.so](https://hud.so).
-## Quickstart: Training
-RL using GRPO a Qwen2.5-VL model on any hud dataset:
-```bash
-hud get hud-evals/basic-2048 # from HF
-hud rl basic-2048.json
-```
-> See [agent training docs](https://docs.hud.so/train-agents/quickstart)
-Or make your own environment and dataset:
-```bash
-hud init my-env && cd my-env
-hud dev --interactive
-# When ready to run:
-hud rl
-```
-> See [environment design docs](https://docs.hud.so/build-environments)
 ## Quickstart: Evals
 For a tutorial that explains the agent and evaluation design, run:
@@ -127,38 +104,27 @@ The above example let's the agent play 2048 ([See replay](https://hud.so/trace/6
 ![Agent playing 2048](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/2048_1.gif)
-## Reinforcement Learning with GRPO
-This is a Qwen‑2.5‑VL‑3B agent training a policy on the 2048-basic browser environment:
-![RL curve](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)
+## Quickstart: Training
-Train with the new interactive `hud rl` flow:
+RL using GRPO a Qwen2.5-VL model on any hud dataset:
 ```bash
-# Install CLI
-uv tool install hud-python
-# Option A: Run directly from a HuggingFace dataset
-hud rl hud-evals/basic-2048
-# Option B: Download first, modify, then train
-hud get hud-evals/basic-2048
+hud get hud-evals/basic-2048 # from HF
 hud rl basic-2048.json
-# Optional: baseline evaluation
-hud eval basic-2048.json
 ```
-Supports multi‑turn RL for both:
-- Language‑only models (e.g., `Qwen/Qwen2.5-7B-Instruct`)
-- Vision‑Language models (e.g., `Qwen/Qwen2.5-VL-3B-Instruct`)
+> See [agent training docs](https://docs.hud.so/train-agents/quickstart)
-By default, `hud rl` provisions a persistent server and trainer in the cloud, streams telemetry to `hud.so`, and lets you monitor/manage models at `hud.so/models`. Use `--local` to run entirely on your machines (typically 2+ GPUs: one for vLLM, the rest for training).
+Or make your own environment and dataset:
-Any HUD MCP environment and evaluation works with our RL pipeline (including remote configurations). See the guided docs: `https://docs.hud.so/train-agents/quickstart`.
+```bash
+hud init my-env && cd my-env
+hud dev --interactive
+# When ready to run:
+hud rl
+```
-Pricing: Hosted vLLM and training GPU rates are listed in the [Training Quickstart → Pricing](https://docs.hud.so/train-agents/quickstart#pricing). Manage billing at the [HUD billing dashboard](https://hud.so/project/billing).
+> See [environment design docs](https://docs.hud.so/build-environments)
 ## Benchmarking Agents
@@ -322,6 +288,39 @@ We highly suggest running 3-5 evaluations per dataset for the most consistent re
 Using the [`run_dataset`](https://docs.hud.so/reference/tasks#run_dataset) function with a HuggingFace dataset automatically assigns your job to that leaderboard page, and allows you to create a scorecard out of it:
+## Reinforcement Learning with GRPO
+This is a Qwen‑2.5‑VL‑3B agent training a policy on the 2048-basic browser environment:
+![RL curve](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)
+Train with the new interactive `hud rl` flow:
+```bash
+# Install CLI
+uv tool install hud-python
+# Option A: Run directly from a HuggingFace dataset
+hud rl hud-evals/basic-2048
+# Option B: Download first, modify, then train
+hud get hud-evals/basic-2048
+hud rl basic-2048.json
+# Optional: baseline evaluation
+hud eval basic-2048.json
+```
+Supports multi‑turn RL for both:
+- Language‑only models (e.g., `Qwen/Qwen2.5-7B-Instruct`)
+- Vision‑Language models (e.g., `Qwen/Qwen2.5-VL-3B-Instruct`)
+By default, `hud rl` provisions a persistent server and trainer in the cloud, streams telemetry to `hud.so`, and lets you monitor/manage models at `hud.so/models`. Use `--local` to run entirely on your machines (typically 2+ GPUs: one for vLLM, the rest for training).
+Any HUD MCP environment and evaluation works with our RL pipeline (including remote configurations). See the guided docs: `https://docs.hud.so/train-agents/quickstart`.
+Pricing: Hosted vLLM and training GPU rates are listed in the [Training Quickstart → Pricing](https://docs.hud.so/train-agents/quickstart#pricing). Manage billing at the [HUD billing dashboard](https://hud.so/project/billing).
 ## Architecture
 ```mermaid

{hud_python-0.4.51 → hud_python-0.4.53}/environments/blank/README.md RENAMED Viewed

@@ -6,10 +6,12 @@ See [docs](https://docs.hud.so/build-environments) for the complete environment
 ## Architecture
 **`environment/`** - Produces structured data
 - Owns all state (game logic, browser sessions, databases, etc.)
 - Exposes HTTP endpoints `/health`, `/act`, `/reset`, `/state` that return structured information about the environment state
 **`server/`** - Wraps data in MCP tools
 - Calls environment endpoints to get structured data for the agent, and environment setup/evaluation
 - Agents and tasks interact only with these tools!
@@ -33,12 +35,14 @@ Visit http://localhost:8765/docs to see the new tool appear instantly.
 In general, we recommend starting work on the environment backend first, then developing the MCP server to expose the right things to the agent.
 For complex environments that require many dependencies, we recommend running `hud dev` in the environment root:
 ```bash
 cd ..
 hud dev
 ```
 ## Tasks & Evaluation
 ```bash
 # Build first in the global folder with the Dockerfile (creates blank:0.1.0)
 hud build
@@ -59,6 +63,7 @@ Your `tasks.json` uses `docker run` to launch the environment:
 ```
 **Commands:**
 ```bash
 # Build first
 hud build
@@ -78,6 +83,7 @@ hud rl tasks.json  # Auto-converts docker→remote, builds & pushes if needed
 Once your environment is ready, you can share it with the community:
 ### 1. Push to Registry
 ```bash
 # Build and push your environment (requires docker hub login and hud api key)
 hud build
@@ -89,10 +95,12 @@ hud push
 Create a dataset on HuggingFace with your tasks:
 **Option A: Upload manually**
 1. Upload your `tasks.json` to HuggingFace
 2. Make sure it's **public** to appear on leaderboards
 **Option B: Use the SDK**
 ```python
 from hud.datasets import save_tasks
 import json
@@ -109,7 +117,7 @@ save_tasks(tasks, repo_id="your-org/your-dataset")
 ```bash
 # Run Claude on your benchmark
-hud eval "your-org/your-dataset" --agent claude
+hud eval "your-org/your-dataset" claude
 # View results at:
 # hud.so/leaderboards/your-org/your-dataset
@@ -118,4 +126,3 @@ hud eval "your-org/your-dataset" --agent claude
 **Note**: Only public HuggingFace datasets appear as leaderboards!
 📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)

{hud_python-0.4.51 → hud_python-0.4.53}/environments/blank/server/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ version = "0.1.0"
 description = "MCP server for blank environment"
 requires-python = ">=3.11"
 dependencies = [
-    "hud-python>=0.4.51",
+    "hud-python>=0.4.53",
     "httpx>=0.28.1",
 ]

{hud_python-0.4.51 → hud_python-0.4.53}/environments/browser/server/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ version = "0.1.0"
 description = "HUD Browser MCP Server"
 requires-python = ">=3.11,<3.14"
 dependencies = [
-    "hud-python@git+https://github.com/hud-evals/hud-python@cli-dev",
+    "hud-python>=0.4.53",
     "httpx",
     "playwright",
     "pyautogui",

{hud_python-0.4.51 → hud_python-0.4.53}/environments/deepresearch/server/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ version = "0.1.0"
 description = "MCP server for DeepResearch environment"
 requires-python = ">=3.11"
 dependencies = [
-    "hud-python>=0.4.51",
+    "hud-python>=0.4.53",
     "httpx>=0.24.0",
 ]

{hud_python-0.4.51 → hud_python-0.4.53}/hud/__init__.py RENAMED Viewed

@@ -5,10 +5,22 @@ tools for building, evaluating, and training AI agents.
 from __future__ import annotations
-from .telemetry import Trace, clear_trace, create_job, get_trace, instrument, job, trace
+from .telemetry import (
+    Trace,
+    async_job,
+    async_trace,
+    clear_trace,
+    create_job,
+    get_trace,
+    instrument,
+    job,
+    trace,
+)
 __all__ = [
     "Trace",
+    "async_job",
+    "async_trace",
     "clear_trace",
     "create_job",
     "get_trace",

{hud_python-0.4.51 → hud_python-0.4.53}/hud/agents/base.py RENAMED Viewed

@@ -55,6 +55,7 @@ class MCPAgent(ABC):
         # Filtering
         allowed_tools: list[str] | None = None,
         disallowed_tools: list[str] | None = None,
+        response_tool_name: str | None = None,
         # Messages
         system_prompt: str = GLOBAL_SYSTEM_PROMPT,
         append_setup_output: bool = True,
@@ -74,6 +75,7 @@ class MCPAgent(ABC):
                 that provides `mcp_config`.
             allowed_tools: Names of tools to allow (None means allow all).
             disallowed_tools: Names of tools to always exclude.
+            response_tool_name: Name of the tool to use for response.
             system_prompt: System prompt to seed the conversation.
             append_setup_output: Whether to append setup tool output to the
                 first turn's messages.
@@ -108,7 +110,7 @@ class MCPAgent(ABC):
         # Initialize these here so methods can be called before initialize()
         self._tool_map: dict[str, types.Tool] = {}  # Simplified: just name to tool
-        self.response_tool_name = None
+        self.response_tool_name = response_tool_name
         # Trace
         self._auto_trace = auto_trace
@@ -135,7 +137,11 @@ class MCPAgent(ABC):
                 "No MCPClient. Please provide one when initializing the agent or pass a Task with mcp_config."  # noqa: E501
             )
-        await self._setup_config(self.mcp_client.mcp_config)
+        try:
+            client_cfg = getattr(self.mcp_client, "mcp_config", None)
+        except Exception:
+            client_cfg = None
+        await self._setup_config(client_cfg)
         # Initialize client if needed
         try:
@@ -168,6 +174,8 @@ class MCPAgent(ABC):
                     self.disallowed_tools.extend(task.agent_config["disallowed_tools"])
                 else:  # If disallowed_tools is None, we overwrite it
                     self.disallowed_tools = task.agent_config["disallowed_tools"]
+            if "response_tool_name" in task.agent_config:
+                self.response_tool_name = task.agent_config["response_tool_name"]
         all_tools = await self.mcp_client.list_tools()
         self._available_tools = []
@@ -614,8 +622,11 @@ class MCPAgent(ABC):
             except Exception as e:
                 self.console.error_log(f"Response lifecycle tool failed: {e}")
-    async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None:
+    async def _setup_config(self, mcp_config: dict[str, dict[str, Any]] | None) -> None:
         """Inject metadata into the metadata of the initialize request."""
+        if not isinstance(mcp_config, dict):
+            return
         if self.metadata:
             patch_mcp_config(
                 mcp_config,

{hud_python-0.4.51 → hud_python-0.4.53}/hud/agents/lite_llm.py RENAMED Viewed

@@ -47,7 +47,7 @@ class LiteAgent(GenericOpenAIChatAgent):
             **agent_kwargs,
         )
-    def get_tool_schemas(self) -> list[dict]:
+    def get_tool_schemas(self) -> list[Any]:
         # Prefer LiteLLM's stricter transformer (handles Bedrock & friends)
         if transform_mcp_tool_to_openai_tool is not None:
             return [

{hud_python-0.4.51 → hud_python-0.4.53}/hud/agents/openai_chat_generic.py RENAMED Viewed

@@ -20,6 +20,7 @@ import logging
 from typing import TYPE_CHECKING, Any, ClassVar, cast
 import mcp.types as types
+from openai import AsyncOpenAI
 from hud import instrument
 from hud.types import AgentResponse, MCPToolCall, MCPToolResult
@@ -28,7 +29,6 @@ from hud.utils.hud_console import HUDConsole
 from .base import MCPAgent
 if TYPE_CHECKING:
-    from openai import AsyncOpenAI
     from openai.types.chat import ChatCompletionToolParam
 logger = logging.getLogger(__name__)
@@ -42,14 +42,26 @@ class GenericOpenAIChatAgent(MCPAgent):
     def __init__(
         self,
         *,
-        openai_client: AsyncOpenAI | None,
+        openai_client: AsyncOpenAI | None = None,
+        api_key: str | None = None,
+        base_url: str | None = None,
         model_name: str = "gpt-4o-mini",
         completion_kwargs: dict[str, Any] | None = None,
         **agent_kwargs: Any,
     ) -> None:
         # Accept base-agent settings via **agent_kwargs (e.g., mcp_client, system_prompt, etc.)
         super().__init__(**agent_kwargs)
-        self.oai = openai_client
+        # Handle client creation - support both patterns
+        if openai_client is not None:
+            # Use provided client (backward compatibility)
+            self.oai = openai_client
+        elif api_key is not None or base_url is not None:
+            # Create client from config (new pattern, consistent with other agents)
+            self.oai = AsyncOpenAI(api_key=api_key, base_url=base_url)
+        else:
+            raise ValueError("Either openai_client or (api_key and base_url) must be provided")
         self.model_name = model_name
         self.completion_kwargs: dict[str, Any] = completion_kwargs or {}
         self.mcp_schemas = []

{hud_python-0.4.51 → hud_python-0.4.53}/hud/agents/tests/test_base.py RENAMED Viewed

@@ -94,7 +94,7 @@ class TestBaseMCPAgent:
         assert agent.mcp_client is not None
         assert agent.allowed_tools is None
-        assert agent.disallowed_tools == []
+        assert agent.disallowed_tools is None
         assert agent.initial_screenshot is True
         assert agent.system_prompt is not None  # Default system prompt is set
@@ -241,6 +241,13 @@ class TestBaseMCPAgent:
         assert "tool2" not in tool_names  # Not in allowed list
         assert "tool3" not in tool_names  # In disallowed list
+        # Make sure tool schemas are correct
+        schemas = agent.get_tool_schemas()
+        assert len(schemas) == 1
+        assert schemas[0]["name"] == "tool1"
+        assert schemas[0]["description"] == "Tool 1"
+        assert schemas[0]["parameters"] == {"type": "object"}
     @pytest.mark.asyncio
     async def test_call_tool_success(self):
         """Test successful tool call."""
@@ -334,7 +341,7 @@ class TestBaseMCPAgent:
         schemas = agent.get_tool_schemas()
         # Should include non-lifecycle tools
-        assert len(schemas) == 1
+        assert len(schemas) == 2
         assert schemas[0]["name"] == "tool1"
     def test_get_tools_by_server(self):

hud-python 0.4.51__tar.gz → 0.4.53__tar.gz

Potentially problematic release.

hud-python 0.4.51tar.gz → 0.4.53tar.gz