PyPI - hud-python - Versions diffs - 0.4.54__tar.gz → 0.4.56__tar.gz - Mend

hud-python 0.4.54tar.gz → 0.4.56tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of hud-python might be problematic. Click here for more details.

Files changed (303) hide show

{hud_python-0.4.54 → hud_python-0.4.56}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.4.54
+Version: 0.4.56
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues

hud_python-0.4.56/environments/rubrics/README.md ADDED Viewed

@@ -0,0 +1,182 @@
+# Rubrics Environment
+Web research environment powered by Exa API for searching and fetching content, with rubric-based evaluation for structured grading.
+See [docs](https://docs.hud.so/build-environments) for the complete environment design workflow.
+## Architecture
+**`environment/`** - Manages Exa API integration and state
+- Holds the Exa API key server-side
+- Exposes HTTP endpoints `/search`, `/fetch`, `/answer`, `/evaluate` for research workflows
+- Implements exponential backoff for rate limiting
+**`server/`** - Wraps data in MCP tools
+- Provides `search()`, `fetch()`, `answer()`, `evaluate()` tools for agents
+- Agents and tasks interact only with these tools
+**Why separate?** Edit tools for the agent or tasks without restarting the environment backend.
+## Tools
+- **`search(query: str)`** - Search the web using Exa API, returns list of results with titles and URLs
+- **`fetch(url: str)`** - Fetch full content from a URL, returns summary, highlights, and text
+- **`answer(final_answer: str)`** - Submit the final research answer
+- **`evaluate(rubric: list[dict])`** - Evaluate submitted answer using a structured rubric with weighted requirements
+### Rubric-Based Evaluation
+The `evaluate` tool uses The LLM Data Company's [rubric](https://github.com/The-LLM-Data-Company/rubric/) package to grade answers against structured criteria with autograders.
+## Setup
+### Requirements
+- Exa API key (get one at [exa.ai](https://exa.ai))
+### Environment Variables
+```bash
+export EXA_API_KEY="your_exa_api_key_here"
+```
+## Development
+```bash
+# Terminal 1 - Environment backend
+cd environment
+export EXA_API_KEY="your_key"
+uv run uvicorn server:app --reload
+# Terminal 2 - MCP server
+cd server
+uv run hud dev
+```
+The environment includes exponential backoff for rate limiting, so API calls will automatically retry on 429 errors.
+In general, we recommend starting work on the environment backend first, then developing the MCP server to expose the right things to the agent.
+For complex environments that require many dependencies, we recommend running `hud dev` in the environment root:
+```bash
+cd ..
+export EXA_API_KEY="your_key"
+hud dev
+```
+## Tasks & Evaluation
+```bash
+# Build first in the global folder with the Dockerfile (creates rubrics:0.1.0)
+hud build
+```
+Your `tasks.json` uses `docker run` to launch the environment:
+```json
+{
+  "prompt": "Research and answer: What is the capital of France?",
+  "mcp_config": {
+    "local": {
+      "command": "docker",
+      "args": ["run", "--rm", "-i", "-e", "EXA_API_KEY", "rubrics:latest"]
+    }
+  },
+  "evaluate_tool": {
+    "name": "evaluate",
+    "arguments": {
+      "rubric": [
+        {
+          "requirement": "Correctly identifies Paris as the capital of France",
+          "weight": 5
+        },
+        {
+          "requirement": "Provides additional context about Paris (population, history, or geography)",
+          "weight": 10
+        }
+      ]
+    }
+  }
+}
+```
+**Note:** The `-e EXA_API_KEY` flag passes your local API key to the container.
+**Commands:**
+```bash
+# Build first
+hud build
+# Test task locally
+export EXA_API_KEY="your_key"
+hud eval tasks.json
+# Push environment for remote running
+hud push
+# Production RL training
+hud rl tasks.json  # Auto-converts docker→remote, builds & pushes if needed
+```
+## Publishing Your Environment
+Once your environment is ready, you can share it with the community:
+### 1. Push to Registry
+```bash
+# Build and push your environment (requires docker hub login and hud api key)
+hud build
+hud push
+```
+### 2. Create a Dataset
+Create a dataset on HuggingFace with your tasks:
+**Option A: Upload manually**
+1. Upload your `tasks.json` to HuggingFace
+2. Make sure it's **public** to appear on leaderboards
+**Option B: Use the SDK**
+```python
+from hud.datasets import save_tasks
+import json
+# Load your tasks
+with open("tasks.json") as f:
+    tasks = json.load(f)
+# Push to HuggingFace
+save_tasks(tasks, repo_id="your-org/your-dataset")
+```
+### 3. Run and Track Performance
+```bash
+# Run Claude on your benchmark
+hud eval "your-org/your-dataset" --agent claude
+# View results at:
+# hud.so/leaderboards/your-org/your-dataset
+```
+**Note**: Only public HuggingFace datasets appear as leaderboards!
+📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)
+## Example Research Workflow
+```python
+# Agent searches for information
+results = search("latest AI developments 2024")
+# Agent fetches detailed content from top result
+content = fetch(results[0]["url"])
+# Agent submits final answer
+answer("Based on research, AI developments in 2024 include...")
+# Evaluate answer using rubric
+result = evaluate(rubric=[
+    {"requirement": "Mentions at least 3 specific AI developments", "weight": 15},
+    {"requirement": "Includes dates or timeframes for developments", "weight": 5},
+])
+# Returns: {"reward": float, "info": {"report": [...]}, "done": True}
+```

hud_python-0.4.56/environments/rubrics/environment/pyproject.toml ADDED Viewed

@@ -0,0 +1,18 @@
+[project]
+name = "rubrics-environment"
+version = "0.1.0"
+description = "Backend service for Rubrics environment"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.104.1",
+    "uvicorn[standard]>=0.24.0",
+    "httpx>=0.24.0",
+    "rubric>=1.1.7",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["environment"]

hud_python-0.4.56/environments/rubrics/pyproject.toml ADDED Viewed

@@ -0,0 +1,19 @@
+[project]
+name = "rubrics"
+version = "0.1.0"
+description = "Rubrics HUD environment with HTTP backend (EXA on server)"
+requires-python = ">=3.11"
+dependencies = [ "hud-python==0.4.42", "fastapi>=0.104.1", "uvicorn[standard]>=0.24.0", "httpx>=0.24.0",]
+[build-system]
+requires = [ "hatchling",]
+build-backend = "hatchling.build"
+[tool.hud]
+image = "rubrics:dev"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = [ "controller", "environment",]

hud_python-0.4.56/environments/rubrics/server/pyproject.toml ADDED Viewed

@@ -0,0 +1,19 @@
+[project]
+name = "rubrics-mcp"
+version = "0.1.0"
+description = "MCP server for Rubrics environment"
+requires-python = ">=3.11"
+dependencies = [
+    "hud-python>=0.4.54",
+    "httpx>=0.24.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.metadata]
+allow-direct-references = true
+[tool.hatch.build.targets.wheel]
+packages = ["mcp"]

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/base.py RENAMED Viewed

@@ -11,6 +11,7 @@ from typing import TYPE_CHECKING, Any, ClassVar, Literal
 import mcp.types as types
+from hud.agents.utils import log_agent_metadata_to_status, log_task_config_to_current_trace
 from hud.types import AgentResponse, MCPToolCall, MCPToolResult, Trace
 from hud.utils.hud_console import HUDConsole
 from hud.utils.mcp import MCPConfigPatch, patch_mcp_config, setup_hud_telemetry
@@ -62,6 +63,7 @@ class MCPAgent(ABC):
         initial_screenshot: bool = True,
         # Misc
         model_name: str = "mcp-agent",
+        checkpoint_name: str | None = None,
         response_agent: ResponseAgent | None = None,
         auto_trace: bool = True,
         verbose: bool = False,
@@ -92,6 +94,7 @@ class MCPAgent(ABC):
         self._auto_created_client = False  # Track if we created the client
         self.model_name = model_name
+        self.checkpoint_name = checkpoint_name
         self.console = HUDConsole(logger=logger)
         # Set verbose mode if requested
@@ -198,6 +201,8 @@ class MCPAgent(ABC):
             f"Agent initialized with {len(self.get_available_tools())} tools: {', '.join([t.name for t in self.get_available_tools()])}"  # noqa: E501
         )
+        await log_agent_metadata_to_status(self.model_name, self.checkpoint_name)
     async def run(self, prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace:
         """
         Run the agent with the given prompt or task.
@@ -223,6 +228,9 @@ class MCPAgent(ABC):
             # Handle Task objects with full lifecycle
             if isinstance(prompt_or_task, Task):
+                # Log a compact summary of task config to the current trace (async)
+                await log_task_config_to_current_trace(prompt_or_task)
                 return await self.run_task(prompt_or_task, max_steps)
             # Handle simple string prompts

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/claude.py RENAMED Viewed

@@ -89,7 +89,8 @@ class ClaudeAgent(MCPAgent):
         self.use_computer_beta = use_computer_beta
         self.hud_console = HUDConsole(logger=logger)
-        self.model_name = self.model
+        self.model_name = "Claude"
+        self.checkpoint_name = self.model
         # Track mapping from Claude tool names to MCP tool names
         self._claude_to_mcp_tool_map: dict[str, str] = {}
@@ -98,14 +99,14 @@ class ClaudeAgent(MCPAgent):
         # Append Claude-specific instructions to the base system prompt
         claude_instructions = """
         You are Claude, an AI assistant created by Anthropic. You are helpful, harmless, and honest.
         When working on tasks:
         1. Be thorough and systematic in your approach
         2. Complete tasks autonomously without asking for confirmation
         3. Use available tools efficiently to accomplish your goals
         4. Verify your actions and ensure task completion
         5. Be precise and accurate in all operations
         Remember: You are expected to complete tasks autonomously. The user trusts you to accomplish what they asked.
         """.strip()  # noqa: E501

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/openai.py RENAMED Viewed

@@ -70,6 +70,7 @@ class OperatorAgent(MCPAgent):
         self.openai_client = model_client
         self.model = model
+        self.checkpoint_name = self.model
         self.environment = environment
         # State tracking for OpenAI's stateful API
@@ -84,7 +85,7 @@ class OperatorAgent(MCPAgent):
             except Exception as e:
                 raise ValueError(f"OpenAI API key is invalid: {e}") from e
-        self.model_name = "openai-" + self.model
+        self.model_name = "Operator"
         # Append OpenAI-specific instructions to the base system prompt
         openai_instructions = """

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/openai_chat_generic.py RENAMED Viewed

@@ -62,7 +62,8 @@ class GenericOpenAIChatAgent(MCPAgent):
         else:
             raise ValueError("Either openai_client or (api_key and base_url) must be provided")
-        self.model_name = model_name
+        self.model_name = "GenericOpenAI"
+        self.checkpoint_name = model_name
         self.completion_kwargs: dict[str, Any] = completion_kwargs or {}
         self.mcp_schemas = []
         self.hud_console = HUDConsole(logger=logger)
@@ -194,7 +195,7 @@ class GenericOpenAIChatAgent(MCPAgent):
             raise ValueError("openai_client is required for GenericOpenAIChatAgent")
         # default transport = OpenAI SDK
         return await self.oai.chat.completions.create(
-            model=self.model_name,
+            model=self.checkpoint_name,
             messages=messages,
             tools=tools,  # type: ignore ready ChatCompletionToolParam-shaped
             **extra,

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/tests/test_claude.py RENAMED Viewed

@@ -89,7 +89,7 @@ class TestClaudeAgent:
             validate_api_key=False,  # Skip validation in tests
         )
-        assert agent.model_name == "claude-3-opus-20240229"
+        assert agent.model_name == "Claude"
         assert agent.max_tokens == 1000
         assert agent.anthropic_client == mock_model_client
@@ -103,7 +103,7 @@ class TestClaudeAgent:
                 validate_api_key=False,  # Skip validation in tests
             )
-            assert agent.model_name == "claude-3-opus-20240229"
+            assert agent.model_name == "Claude"
             assert agent.anthropic_client is not None
     @pytest.mark.asyncio

{hud_python-0.4.54 → hud_python-0.4.56}/hud/agents/tests/test_openai.py RENAMED Viewed

@@ -50,7 +50,7 @@ class TestOperatorAgent:
             validate_api_key=False,  # Skip validation in tests
         )
-        assert agent.model_name == "openai-gpt-4"
+        assert agent.model_name == "Operator"
         assert agent.model == "gpt-4"
         assert agent.openai_client == mock_model_client

hud_python-0.4.56/hud/agents/utils.py ADDED Viewed

@@ -0,0 +1,50 @@
+from __future__ import annotations
+import contextlib
+from typing import TYPE_CHECKING
+from hud.otel.context import (
+    _update_task_status_async,
+    get_current_task_run_id,
+)
+if TYPE_CHECKING:
+    from hud.datasets import Task
+async def log_task_config_to_current_trace(task: Task) -> None:
+    with contextlib.suppress(Exception):
+        task_run_id = get_current_task_run_id()
+        if not task_run_id:
+            return
+        raw_config = task.model_dump()
+        await _update_task_status_async(
+            task_run_id,
+            "running",
+            task_id=task.id,
+            extra_metadata={"task_config": raw_config},
+        )
+async def log_agent_metadata_to_status(
+    model_name: str | None = None, checkpoint_name: str | None = None
+) -> None:
+    """Attach agent metadata (model/checkpoint) to current trace status metadata."""
+    with contextlib.suppress(Exception):
+        task_run_id = get_current_task_run_id()
+        if not task_run_id or (not model_name and not checkpoint_name):
+            return
+        agent_meta = {}
+        if model_name is not None:
+            agent_meta["model_name"] = model_name
+        if checkpoint_name is not None:
+            agent_meta["checkpoint_name"] = checkpoint_name
+        await _update_task_status_async(
+            task_run_id,
+            "running",
+            extra_metadata={"agent": agent_meta},
+        )

{hud_python-0.4.54 → hud_python-0.4.56}/hud/cli/__init__.py RENAMED Viewed

@@ -382,6 +382,11 @@ def dev(
         "--watch",
         help="Additional directories to watch for changes (default: current directory)",
     ),
+    new: bool = typer.Option(
+        False,
+        "--new",
+        help="Show Cursor installation link for new server setup",
+    ),
 ) -> None:
     """🔥 Development mode - run MCP server with hot-reload.
@@ -422,6 +427,7 @@ def dev(
         watch,
         docker=docker,
         docker_args=docker_args,
+        new=new,
     )
@@ -740,7 +746,7 @@ def init(
         None,
         "--preset",
         "-p",
-        help="Preset to use: blank, deep-research, browser. If omitted, you'll choose interactively.",  # noqa: E501
+        help="Preset to use: blank, deep-research, browser, rubrics. If omitted, you'll choose interactively.",  # noqa: E501
     ),
     directory: str = typer.Option(".", "--dir", "-d", help="Target directory"),
     force: bool = typer.Option(False, "--force", "-f", help="Overwrite existing files"),
@@ -1079,6 +1085,51 @@ def rl(
     )
+@app.command()
+def convert(
+    tasks_file: str = typer.Argument(
+        ..., help="Path to tasks file (JSON/JSONL) to convert to remote MCP configuration"
+    ),
+) -> None:
+    """Convert local MCP task configs to remote (mcp.hud.so) format.
+    This mirrors the implicit conversion flow used by 'hud rl' and writes a new
+    remote_<name>.json next to the source file when needed.
+    """
+    from pathlib import Path
+    from hud.utils.hud_console import HUDConsole
+    hud_console = HUDConsole()
+    try:
+        from .flows.tasks import convert_tasks_to_remote
+        result_path = convert_tasks_to_remote(tasks_file)
+        # If nothing changed, inform the user
+        try:
+            if Path(result_path).resolve() == Path(tasks_file).resolve():
+                hud_console.success(
+                    "Tasks already reference remote MCP URLs. No conversion needed."
+                )
+                hud_console.hint("You can run them directly with: hud eval <tasks_file> --full")
+                return
+        except Exception as e:
+            # Best effort; continue with success message
+            hud_console.debug(f"Path comparison failed, continuing: {e}")
+        hud_console.success(f"Converted tasks written to: {result_path}")
+        hud_console.hint(
+            "You can now run remote flows: hud rl <converted_file> or hud eval <converted_file>"
+        )
+    except typer.Exit:
+        raise
+    except Exception as e:
+        hud_console.error(f"Failed to convert tasks: {e}")
+        raise typer.Exit(1) from e
 @app.command()
 def set(
     assignments: list[str] = typer.Argument(  # type: ignore[arg-type]  # noqa: B008

hud-python 0.4.54__tar.gz → 0.4.56__tar.gz

Potentially problematic release.

hud-python 0.4.54tar.gz → 0.4.56tar.gz