PyPI - hud-python - Versions diffs - 0.6.6__tar.gz → 0.6.8.dev0__tar.gz - Mend

hud-python 0.6.6tar.gz → 0.6.8.dev0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (244) hide show

{hud_python-0.6.6 → hud_python-0.6.8.dev0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.6.6
+Version: 0.6.8.dev0
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
@@ -87,7 +87,7 @@ Description-Content-Type: text/markdown
 HUD is a platform for building RL environments for AI agents, across coding, browser, computer-use, and robotics. Define an environment, write tasks, and run them as evals and training across any model, at any scale.
-To learn more, see the [documentation](https://docs.hud.ai) and [API reference](https://docs.hud.ai/reference/environment).
+To learn more, see the [documentation](https://docs.hud.ai) and [environment reference](https://docs.hud.ai/v6/core/environment).
 [![PyPI](https://img.shields.io/pypi/v/hud-python?style=flat-square)](https://pypi.org/project/hud-python/)
 [![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)](LICENSE)
@@ -120,7 +120,7 @@ Then scaffold your first environment:
 hud init my-env
 ```
-![Agent running on SheetBench](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/trace_sheet.gif)
+![Agent running on SheetBench](docs/src/images/trace_sheet.gif)
 ## The protocol
@@ -159,14 +159,14 @@ hud eval my-taskset --remote
 For local iteration, the same protocol works against a container on your laptop:
 ```bash
-hud build .
-docker run -d --name run1 my-env
-docker exec run1 hud task start fix_bug
-docker exec run1 hud task grade fix_bug --answer "…"
+docker build -f Dockerfile.hud -t my-env .
+docker run -d --name run1 -p 8765:8765 my-env
+hud task start fix_bug --url tcp://127.0.0.1:8765
+hud task grade fix_bug --url tcp://127.0.0.1:8765 --answer "..."
 docker rm -f run1
 ```
-→ [Package & deploy](https://docs.hud.ai/run/deploy)
+→ [Run & deploy](https://docs.hud.ai/v6/core/runtime)
 ## Environments & templates
@@ -193,7 +193,7 @@ hud eval tasks.py claude --group 3
 Each graded evaluation is a **trace** (the SDK's live handle is a `Run`). With `HUD_API_KEY` set, every rollout is recorded on [hud.ai](https://hud.ai). Tasks that need a shell, browser, GUI, or robot declare **capabilities** (below); everything else — variants, grading, batching — stays identical.
-→ [Quickstart](https://docs.hud.ai/quickstart) · [Tasks & tasksets](https://docs.hud.ai/reference/tasks)
+→ [Quickstart](https://docs.hud.ai/v6/start/quickstart) · [Tasks & tasksets](https://docs.hud.ai/v6/core/tasks)
 ## Capabilities & harnesses
@@ -211,39 +211,42 @@ A **capability** is a connection the environment exposes; a **harness** attaches
 **Bring your own:** a harness attaches to a capability and defines a tool spec — wrap `browser-use` on `cdp`, a VLA policy on `robot`, or your own agent on `ssh` / `mcp`. No protocol work required.
-→ [Capabilities](https://docs.hud.ai/reference/capabilities) · [Models](https://docs.hud.ai/run/models) · [Robots](https://docs.hud.ai/reference/robots)
+→ [Capabilities](https://docs.hud.ai/v6/core/capabilities) · [Models](https://docs.hud.ai/v6/core/agents) · [Robots](https://docs.hud.ai/v6/advanced/robots)
 ## Deploy on the platform
 From the [platform UI](https://hud.ai) you can run batches, compare models on the same taskset, and inspect every trace.
-→ [Deploy](https://docs.hud.ai/run/deploy) · [Leaderboards](https://hud.ai/leaderboards)
+→ [Run & deploy](https://docs.hud.ai/v6/core/runtime)
 ## Train on rewards
-Every rollout returns a `Run` carrying a `trace_id` and a `reward`, so the tasks you evaluate are already training data. Run a **group** per task and turn the rewards into GRPO advantages with `group_relative()`:
+Every rollout returns a `Run` carrying a `trace_id` and a `reward`, so the tasks you evaluate are already training data. Run a **group** per task and pass the graded runs to `TrainingClient.step()`:
 ```python
+from hud import TrainingClient
 from hud.agents import create_agent
-from hud.eval import Taskset, group_relative
+from hud.eval import Job
-agent = create_agent("claude-sonnet-4-5")
-job = await Taskset(count_letter(word=w) for w in words).run(agent, group=16)
-for runs in job.results.values():
-    advantages = group_relative([r.reward for r in runs], normalize_std=True)
-    ...  # feed (run.trace_id, adv) into your optimizer
+agent = create_agent("arith-rl", completion_kwargs={"extra_body": {"return_token_ids": True}})
+trainer = TrainingClient("arith-rl")
+taskset, runtime = ...  # your Taskset and where rollouts run
+session = await Job.start("arith-rl", group=8)
+start = len(session.runs)
+await taskset.run(agent, runtime=runtime, group=8, job=session)
+await trainer.step(session.runs[start:], learning_rate=1e-5, group_size=8)
 ```
 HUD is the environment-and-reward source for your own GRPO/PPO loop — the same environment trains any model, text or multimodal, unchanged.
-→ [Training](https://docs.hud.ai/run/training) · [Designing tasks for signal](https://docs.hud.ai/run/signal)
+→ [Training](https://docs.hud.ai/v6/core/training) · [Designing tasks for signal](https://docs.hud.ai/v6/core/advice)
 ## Links
 - [Documentation](https://docs.hud.ai)
-- [Quickstart](https://docs.hud.ai/quickstart)
-- [CLI reference](https://docs.hud.ai/reference/cli)
-- [Leaderboards](https://hud.ai/leaderboards)
+- [Quickstart](https://docs.hud.ai/v6/start/quickstart)
+- [CLI reference](https://docs.hud.ai/v6/core/cli)
 - [Environment templates](https://hud.ai/environments)
 - [Supported models](https://hud.ai/models)
 - [Discord](https://discord.gg/wkjtmHYYjm)
@@ -268,8 +271,8 @@ Key areas: [Agents](hud/agents/) · [Environments](hud/environment/) · [Capabil
 ```bibtex
 @software{hud2025agentevalplatform,
-  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep and Nguyen Nhat Minh},
-  title  = {HUD: An Evaluation and RL Envrionments Platform for Agents},
+  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep Chawla and Nguyen Nhat Minh},
+  title  = {HUD: An Evaluation and RL Environments Platform for Agents},
   date   = {2025-04},
   url    = {https://github.com/hud-evals/hud-python},
   langid = {en}

{hud_python-0.6.6 → hud_python-0.6.8.dev0}/README.md RENAMED Viewed

@@ -8,7 +8,7 @@
 HUD is a platform for building RL environments for AI agents, across coding, browser, computer-use, and robotics. Define an environment, write tasks, and run them as evals and training across any model, at any scale.
-To learn more, see the [documentation](https://docs.hud.ai) and [API reference](https://docs.hud.ai/reference/environment).
+To learn more, see the [documentation](https://docs.hud.ai) and [environment reference](https://docs.hud.ai/v6/core/environment).
 [![PyPI](https://img.shields.io/pypi/v/hud-python?style=flat-square)](https://pypi.org/project/hud-python/)
 [![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)](LICENSE)
@@ -41,7 +41,7 @@ Then scaffold your first environment:
 hud init my-env
 ```
-![Agent running on SheetBench](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/trace_sheet.gif)
+![Agent running on SheetBench](docs/src/images/trace_sheet.gif)
 ## The protocol
@@ -80,14 +80,14 @@ hud eval my-taskset --remote
 For local iteration, the same protocol works against a container on your laptop:
 ```bash
-hud build .
-docker run -d --name run1 my-env
-docker exec run1 hud task start fix_bug
-docker exec run1 hud task grade fix_bug --answer "…"
+docker build -f Dockerfile.hud -t my-env .
+docker run -d --name run1 -p 8765:8765 my-env
+hud task start fix_bug --url tcp://127.0.0.1:8765
+hud task grade fix_bug --url tcp://127.0.0.1:8765 --answer "..."
 docker rm -f run1
 ```
-→ [Package & deploy](https://docs.hud.ai/run/deploy)
+→ [Run & deploy](https://docs.hud.ai/v6/core/runtime)
 ## Environments & templates
@@ -114,7 +114,7 @@ hud eval tasks.py claude --group 3
 Each graded evaluation is a **trace** (the SDK's live handle is a `Run`). With `HUD_API_KEY` set, every rollout is recorded on [hud.ai](https://hud.ai). Tasks that need a shell, browser, GUI, or robot declare **capabilities** (below); everything else — variants, grading, batching — stays identical.
-→ [Quickstart](https://docs.hud.ai/quickstart) · [Tasks & tasksets](https://docs.hud.ai/reference/tasks)
+→ [Quickstart](https://docs.hud.ai/v6/start/quickstart) · [Tasks & tasksets](https://docs.hud.ai/v6/core/tasks)
 ## Capabilities & harnesses
@@ -132,39 +132,42 @@ A **capability** is a connection the environment exposes; a **harness** attaches
 **Bring your own:** a harness attaches to a capability and defines a tool spec — wrap `browser-use` on `cdp`, a VLA policy on `robot`, or your own agent on `ssh` / `mcp`. No protocol work required.
-→ [Capabilities](https://docs.hud.ai/reference/capabilities) · [Models](https://docs.hud.ai/run/models) · [Robots](https://docs.hud.ai/reference/robots)
+→ [Capabilities](https://docs.hud.ai/v6/core/capabilities) · [Models](https://docs.hud.ai/v6/core/agents) · [Robots](https://docs.hud.ai/v6/advanced/robots)
 ## Deploy on the platform
 From the [platform UI](https://hud.ai) you can run batches, compare models on the same taskset, and inspect every trace.
-→ [Deploy](https://docs.hud.ai/run/deploy) · [Leaderboards](https://hud.ai/leaderboards)
+→ [Run & deploy](https://docs.hud.ai/v6/core/runtime)
 ## Train on rewards
-Every rollout returns a `Run` carrying a `trace_id` and a `reward`, so the tasks you evaluate are already training data. Run a **group** per task and turn the rewards into GRPO advantages with `group_relative()`:
+Every rollout returns a `Run` carrying a `trace_id` and a `reward`, so the tasks you evaluate are already training data. Run a **group** per task and pass the graded runs to `TrainingClient.step()`:
 ```python
+from hud import TrainingClient
 from hud.agents import create_agent
-from hud.eval import Taskset, group_relative
+from hud.eval import Job
-agent = create_agent("claude-sonnet-4-5")
-job = await Taskset(count_letter(word=w) for w in words).run(agent, group=16)
-for runs in job.results.values():
-    advantages = group_relative([r.reward for r in runs], normalize_std=True)
-    ...  # feed (run.trace_id, adv) into your optimizer
+agent = create_agent("arith-rl", completion_kwargs={"extra_body": {"return_token_ids": True}})
+trainer = TrainingClient("arith-rl")
+taskset, runtime = ...  # your Taskset and where rollouts run
+session = await Job.start("arith-rl", group=8)
+start = len(session.runs)
+await taskset.run(agent, runtime=runtime, group=8, job=session)
+await trainer.step(session.runs[start:], learning_rate=1e-5, group_size=8)
 ```
 HUD is the environment-and-reward source for your own GRPO/PPO loop — the same environment trains any model, text or multimodal, unchanged.
-→ [Training](https://docs.hud.ai/run/training) · [Designing tasks for signal](https://docs.hud.ai/run/signal)
+→ [Training](https://docs.hud.ai/v6/core/training) · [Designing tasks for signal](https://docs.hud.ai/v6/core/advice)
 ## Links
 - [Documentation](https://docs.hud.ai)
-- [Quickstart](https://docs.hud.ai/quickstart)
-- [CLI reference](https://docs.hud.ai/reference/cli)
-- [Leaderboards](https://hud.ai/leaderboards)
+- [Quickstart](https://docs.hud.ai/v6/start/quickstart)
+- [CLI reference](https://docs.hud.ai/v6/core/cli)
 - [Environment templates](https://hud.ai/environments)
 - [Supported models](https://hud.ai/models)
 - [Discord](https://discord.gg/wkjtmHYYjm)
@@ -189,8 +192,8 @@ Key areas: [Agents](hud/agents/) · [Environments](hud/environment/) · [Capabil
 ```bibtex
 @software{hud2025agentevalplatform,
-  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep and Nguyen Nhat Minh},
-  title  = {HUD: An Evaluation and RL Envrionments Platform for Agents},
+  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep Chawla and Nguyen Nhat Minh},
+  title  = {HUD: An Evaluation and RL Environments Platform for Agents},
   date   = {2025-04},
   url    = {https://github.com/hud-evals/hud-python},
   langid = {en}

hud_python-0.6.8.dev0/cookbooks/fireworks-rl-training/README.md ADDED Viewed

@@ -0,0 +1,129 @@
+# Fireworks RL Training
+Direct Fireworks Training API loop over the same arithmetic preview task used by
+`cookbooks/rl-training`.
+This does **not** use Fireworks native datasets or RFT jobs. It follows the
+Training API service path from the Fireworks docs:
+1. `FiretitanServiceClient.from_firetitan_config(...)`
+2. `create_deployment_sampler(...)` for high-parallel rollouts
+3. local grading of HUD-style multiplication tasks
+4. `forward_backward_custom(...)` + `optim_step(...)`
+5. `save_weights_for_sampler(...)` + sampler refresh
+References:
+- Fireworks Training API introduction: https://docs.fireworks.ai/fine-tuning/training-api/introduction
+- Training and sampling lifecycle: https://docs.fireworks.ai/fine-tuning/training-api/training-and-sampling
+- Loss functions / GRPO reference: https://docs.fireworks.ai/fine-tuning/training-api/loss-functions
+## Setup
+The repo-level `.env` is loaded automatically. It must contain:
+```bash
+FIREWORKS_API_KEY=...
+FIREWORKS_ACCOUNT_ID=...
+```
+Install the isolated cookbook environment:
+```bash
+uv sync --pre
+```
+## Calibrate task difficulty first
+What matters for GRPO is **within-group** reward spread: advantages are computed
+within each prompt group, so a group whose rollouts all score the same (all 0 or
+all 1) produces zero advantage and no gradient — even if the *overall* mean looks
+healthy. Calibration reports `within_group_reward_std` for exactly this; treat
+it, not `reward_mean`, as the signal that training has something to learn.
+Two backends:
+- `--calibration-backend inference` (default): Fireworks' OpenAI-compatible API.
+  Cheap, but samples `gpt-oss-120b` (`--inference-model`), not the training base —
+  the small serverless catalog on the `lorenss` key has no Qwen3 8B. Use it only
+  for a rough task sanity check.
+- `--calibration-backend managed`: provisions the same deployment sampler that
+  training uses and samples the **actual base model** (Qwen3 8B). This is the
+  calibration that counts. It still skips the trainer and `optim_step`.
+```bash
+uv run train.py --calibrate-only --calibration-backend managed \
+  --groups-per-step 6 --rollouts-per-prompt 6 --parallelism 18 --debug-samples 4
+```
+`--debug-samples N` prints the first N rollouts (reward, output-token count,
+text) so you can see *why* a group scored the way it did. Tune the multiplication
+range until `within_group_reward_std` is clearly above zero:
+- Groups all-correct (`within_group_reward_std ~= 0`) → make it harder
+  (`--min-a/--max-a/--min-b/--max-b`).
+- Groups all-wrong → make it easier, or raise `--max-tokens` so the model can
+  finish its working before the budget cuts it off.
+The shipped defaults (3-digit × 3-digit, `--max-tokens 512`, thinking disabled)
+calibrate to `reward_mean ~= 0.47`, `within_group_reward_std ~= 0.20` on Qwen3 8B:
+a regime where the same problem is sometimes solved (when the model shows its
+work) and sometimes slipped (when it answers directly) — so RL has a gradient to
+follow.
+### Reasoning models and the token budget
+Qwen3 is a hybrid reasoning model: by default it opens a `<think>` block and, on
+a tight `--max-tokens`, spends the whole budget reasoning and never emits the
+answer (reward collapses to zero). This cookbook disables thinking by default
+through the chat template so direct rollouts reach the integer. Pass
+`--enable-thinking` to keep the reasoning block — and raise `--max-tokens`
+accordingly so the answer still fits.
+## Train
+Once calibration has non-trivial rewards:
+```bash
+uv run train.py --steps 5 --groups-per-step 8 --rollouts-per-prompt 8 --parallelism 32
+```
+This uses the direct Training API managed service path. If you want calibration
+to go through the managed deployment sampler too, pass
+`--calibration-backend managed`; this provisions the same resources as training.
+### Preview account constraints
+On the `lorenss` preview account today:
+- **Trainer creation works** end to end with a provisioned key: rollouts,
+  `forward_backward_custom`, `optim_step`, checkpoint save, and sampler hotload
+  all run, and multi-step training completes. (An earlier `unkey inference api id
+  is not configured` 500 on trainer creation was an account-side provisioning gap,
+  now resolved.)
+- **LoRA is unavailable**: the validated `qwen3-8b-128k` shape only accepts
+  full-parameter training, so `--lora-rank > 0` fails at trainer creation with
+  `no validated training shape exists for ... trainer_mode=LORA_TRAINER`.
+- **Hotloads sync full 8B weights** between steps and occasionally exceed the
+  SDK's 600s hotload budget (`RuntimeError: Hotload failed for sampler snapshot
+  ...`). This is transient preview-infra latency, not a loop bug — re-running the
+  same command generally proceeds. There is no clean knob to extend the timeout
+  on the managed sampler path.
+Metrics are written to:
+- `runs/fireworks-rl-preview/metrics.jsonl`
+- `runs/fireworks-rl-preview/reward_loss.png` if `matplotlib` is installed
+## Notes
+- Defaults use Qwen 3 8B full-parameter training:
+  - `accounts/fireworks/models/qwen3-8b`
+  - `Qwen/Qwen3-8B`
+  - `accounts/fireworks/trainingShapes/qwen3-8b-128k`
+- LoRA can be tested with `--lora-rank N`, but the validated Qwen3 8B training
+  shape currently rejects LoRA mode on the `lorenss` preview account.
+- The first checkpoint sync happens after step 0 and subsequent rollouts sample
+  the updated weights through the same deployment.
+- `--keep-trainer` and `--keep-deployment` are available for debugging. By
+  default the trainer is cleaned up and the deployment scales to zero on exit.

{hud_python-0.6.6 → hud_python-0.6.8.dev0}/hud/agents/openai_compatible/agent.py RENAMED Viewed

@@ -17,11 +17,13 @@ from hud.types import MCPToolCall, MCPToolResult
 from hud.utils import gateway
 from .tools import (
+    BashTool,
+    EditTool,
     GlobTool,
     GrepTool,
-    ListTool,
     OpenAICompatibleMCPProxyTool,
     ReadTool,
+    WriteTool,
 )
 from .tools.base import format_chat_result
@@ -41,10 +43,12 @@ class OpenAIChatAgent(ToolAgent[ChatCompletionMessageParam, OpenAIChatConfig]):
     """OpenAI-compatible agent using the chat.completions protocol."""
     tool_catalog = (
+        BashTool,
         ReadTool,
-        GrepTool,
         GlobTool,
-        ListTool,
+        GrepTool,
+        EditTool,
+        WriteTool,
         OpenAICompatibleMCPProxyTool,
     )

{hud_python-0.6.6 → hud_python-0.6.8.dev0}/hud/agents/openai_compatible/tools/__init__.py RENAMED Viewed

@@ -2,13 +2,15 @@
 from __future__ import annotations
-from .filesystem import GlobTool, GrepTool, ListTool, ReadTool
+from .filesystem import BashTool, EditTool, GlobTool, GrepTool, ReadTool, WriteTool
 from .mcp_proxy import OpenAICompatibleMCPProxyTool
 __all__ = [
+    "BashTool",
+    "EditTool",
     "GlobTool",
     "GrepTool",
-    "ListTool",
     "OpenAICompatibleMCPProxyTool",
     "ReadTool",
+    "WriteTool",
 ]

hud-python 0.6.6__tar.gz → 0.6.8.dev0__tar.gz

hud-python 0.6.6tar.gz → 0.6.8.dev0tar.gz