PyPI - hud-python - Versions diffs - 0.6.6__tar.gz → 0.6.7__tar.gz - Mend

hud-python 0.6.6tar.gz → 0.6.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

{hud_python-0.6.6 → hud_python-0.6.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hud-python
-Version: 0.6.6
+Version: 0.6.7
 Summary: SDK for the HUD platform.
 Project-URL: Homepage, https://github.com/hud-evals/hud-python
 Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues

hud_python-0.6.7/cookbooks/fireworks-rl-training/README.md ADDED Viewed

@@ -0,0 +1,129 @@
+# Fireworks RL Training
+Direct Fireworks Training API loop over the same arithmetic preview task used by
+`cookbooks/rl-training`.
+This does **not** use Fireworks native datasets or RFT jobs. It follows the
+Training API service path from the Fireworks docs:
+1. `FiretitanServiceClient.from_firetitan_config(...)`
+2. `create_deployment_sampler(...)` for high-parallel rollouts
+3. local grading of HUD-style multiplication tasks
+4. `forward_backward_custom(...)` + `optim_step(...)`
+5. `save_weights_for_sampler(...)` + sampler refresh
+References:
+- Fireworks Training API introduction: https://docs.fireworks.ai/fine-tuning/training-api/introduction
+- Training and sampling lifecycle: https://docs.fireworks.ai/fine-tuning/training-api/training-and-sampling
+- Loss functions / GRPO reference: https://docs.fireworks.ai/fine-tuning/training-api/loss-functions
+## Setup
+The repo-level `.env` is loaded automatically. It must contain:
+```bash
+FIREWORKS_API_KEY=...
+FIREWORKS_ACCOUNT_ID=...
+```
+Install the isolated cookbook environment:
+```bash
+uv sync --pre
+```
+## Calibrate task difficulty first
+What matters for GRPO is **within-group** reward spread: advantages are computed
+within each prompt group, so a group whose rollouts all score the same (all 0 or
+all 1) produces zero advantage and no gradient — even if the *overall* mean looks
+healthy. Calibration reports `within_group_reward_std` for exactly this; treat
+it, not `reward_mean`, as the signal that training has something to learn.
+Two backends:
+- `--calibration-backend inference` (default): Fireworks' OpenAI-compatible API.
+  Cheap, but samples `gpt-oss-120b` (`--inference-model`), not the training base —
+  the small serverless catalog on the `lorenss` key has no Qwen3 8B. Use it only
+  for a rough task sanity check.
+- `--calibration-backend managed`: provisions the same deployment sampler that
+  training uses and samples the **actual base model** (Qwen3 8B). This is the
+  calibration that counts. It still skips the trainer and `optim_step`.
+```bash
+uv run train.py --calibrate-only --calibration-backend managed \
+  --groups-per-step 6 --rollouts-per-prompt 6 --parallelism 18 --debug-samples 4
+```
+`--debug-samples N` prints the first N rollouts (reward, output-token count,
+text) so you can see *why* a group scored the way it did. Tune the multiplication
+range until `within_group_reward_std` is clearly above zero:
+- Groups all-correct (`within_group_reward_std ~= 0`) → make it harder
+  (`--min-a/--max-a/--min-b/--max-b`).
+- Groups all-wrong → make it easier, or raise `--max-tokens` so the model can
+  finish its working before the budget cuts it off.
+The shipped defaults (3-digit × 3-digit, `--max-tokens 512`, thinking disabled)
+calibrate to `reward_mean ~= 0.47`, `within_group_reward_std ~= 0.20` on Qwen3 8B:
+a regime where the same problem is sometimes solved (when the model shows its
+work) and sometimes slipped (when it answers directly) — so RL has a gradient to
+follow.
+### Reasoning models and the token budget
+Qwen3 is a hybrid reasoning model: by default it opens a `<think>` block and, on
+a tight `--max-tokens`, spends the whole budget reasoning and never emits the
+answer (reward collapses to zero). This cookbook disables thinking by default
+through the chat template so direct rollouts reach the integer. Pass
+`--enable-thinking` to keep the reasoning block — and raise `--max-tokens`
+accordingly so the answer still fits.
+## Train
+Once calibration has non-trivial rewards:
+```bash
+uv run train.py --steps 5 --groups-per-step 8 --rollouts-per-prompt 8 --parallelism 32
+```
+This uses the direct Training API managed service path. If you want calibration
+to go through the managed deployment sampler too, pass
+`--calibration-backend managed`; this provisions the same resources as training.
+### Preview account constraints
+On the `lorenss` preview account today:
+- **Trainer creation works** end to end with a provisioned key: rollouts,
+  `forward_backward_custom`, `optim_step`, checkpoint save, and sampler hotload
+  all run, and multi-step training completes. (An earlier `unkey inference api id
+  is not configured` 500 on trainer creation was an account-side provisioning gap,
+  now resolved.)
+- **LoRA is unavailable**: the validated `qwen3-8b-128k` shape only accepts
+  full-parameter training, so `--lora-rank > 0` fails at trainer creation with
+  `no validated training shape exists for ... trainer_mode=LORA_TRAINER`.
+- **Hotloads sync full 8B weights** between steps and occasionally exceed the
+  SDK's 600s hotload budget (`RuntimeError: Hotload failed for sampler snapshot
+  ...`). This is transient preview-infra latency, not a loop bug — re-running the
+  same command generally proceeds. There is no clean knob to extend the timeout
+  on the managed sampler path.
+Metrics are written to:
+- `runs/fireworks-rl-preview/metrics.jsonl`
+- `runs/fireworks-rl-preview/reward_loss.png` if `matplotlib` is installed
+## Notes
+- Defaults use Qwen 3 8B full-parameter training:
+  - `accounts/fireworks/models/qwen3-8b`
+  - `Qwen/Qwen3-8B`
+  - `accounts/fireworks/trainingShapes/qwen3-8b-128k`
+- LoRA can be tested with `--lora-rank N`, but the validated Qwen3 8B training
+  shape currently rejects LoRA mode on the `lorenss` preview account.
+- The first checkpoint sync happens after step 0 and subsequent rollouts sample
+  the updated weights through the same deployment.
+- `--keep-trainer` and `--keep-deployment` are available for debugging. By
+  default the trainer is cleaned up and the deployment scales to zero on exit.

{hud_python-0.6.6 → hud_python-0.6.7}/hud/cli/deploy.py RENAMED Viewed

@@ -3,6 +3,7 @@
 from __future__ import annotations
 import asyncio
+import json
 import logging
 import os
 import time
@@ -12,6 +13,7 @@ from typing import Any
 import httpx
 import typer
+from pydantic import ValidationError
 from hud.cli.utils.build_display import display_build_summary
 from hud.cli.utils.build_logs import poll_build_status, stream_build_logs
@@ -19,6 +21,7 @@ from hud.cli.utils.config import parse_env_file, parse_key_value
 from hud.cli.utils.context import create_build_context_tarball, format_size
 from hud.cli.utils.registry import get_registry_environment
 from hud.cli.utils.source import EnvironmentSource, normalize_environment_name
+from hud.eval.runtime import RuntimeConfig
 from hud.utils.exceptions import HudRequestError
 from hud.utils.hud_console import HUDConsole
 from hud.utils.platform import PlatformClient
@@ -32,6 +35,7 @@ class _DeployPlan:
     name: str
     registry_id: str | None
     runtime: str | None
+    runtime_config: dict[str, Any] | None
     env_vars: dict[str, str]
     build_args: dict[str, str]
     build_secrets: dict[str, str]
@@ -75,6 +79,26 @@ def _normalize_runtime(runtime: str | None, console: HUDConsole) -> str | None:
     raise typer.Exit(1)
+def _load_runtime_config(path: str | None, console: HUDConsole) -> dict[str, Any] | None:
+    if path is None:
+        return None
+    config_path = Path(path).expanduser()
+    try:
+        raw = json.loads(config_path.read_text(encoding="utf-8"))
+        config = RuntimeConfig.model_validate(raw)
+    except FileNotFoundError:
+        console.error(f"Runtime config file not found: {config_path}")
+        raise typer.Exit(1) from None
+    except json.JSONDecodeError as exc:
+        console.error(f"Invalid runtime config JSON in {config_path}: {exc.msg}")
+        raise typer.Exit(1) from exc
+    except ValidationError as exc:
+        console.error(f"Invalid runtime config in {config_path}: {exc}")
+        raise typer.Exit(1) from exc
+    payload = config.request_payload()
+    return payload or None
 def _load_env_vars(path: Path, console: HUDConsole, *, warn_missing: bool) -> dict[str, str]:
     if not path.exists():
         if warn_missing:
@@ -322,6 +346,7 @@ def _prepare_deploy_plan(
     build_args: list[str] | None,
     build_secrets: list[str] | None,
     runtime: str | None,
+    runtime_config: str | None,
     verbose: bool,
     platform: PlatformClient,
     console: HUDConsole,
@@ -357,11 +382,13 @@ def _prepare_deploy_plan(
     build_args_dict = _parse_key_value_flags(build_args, option="--build-arg", console=console)
     if build_args_dict and verbose:
         console.info(f"Build arguments: {', '.join(build_args_dict.keys())}")
+    normalized_runtime = _normalize_runtime(runtime, console)
     return _DeployPlan(
         name=resolved_name,
         registry_id=registry_id,
-        runtime=_normalize_runtime(runtime, console),
+        runtime=normalized_runtime,
+        runtime_config=_load_runtime_config(runtime_config, console),
         env_vars=env_vars,
         build_args=build_args_dict,
         build_secrets=_collect_build_secrets(build_secrets, env_dir=env_dir, console=console),
@@ -379,6 +406,7 @@ def deploy_environment(
     build_args: list[str] | None = None,
     build_secrets: list[str] | None = None,
     runtime: str | None = None,
+    runtime_config: str | None = None,
 ) -> None:
     """Deploy one HUD environment to the platform."""
     hud_console = HUDConsole()
@@ -411,6 +439,7 @@ def deploy_environment(
         build_args=build_args,
         build_secrets=build_secrets,
         runtime=runtime,
+        runtime_config=runtime_config,
         verbose=verbose,
         platform=platform,
         console=hud_console,
@@ -485,6 +514,8 @@ async def _trigger_build(
         payload["registry_id"] = plan.registry_id
     if plan.runtime:
         payload["runtime_provider"] = plan.runtime
+    if plan.runtime_config:
+        payload["runtime_config"] = plan.runtime_config
     if plan.env_vars:
         payload["environment_variables"] = plan.env_vars
     if plan.build_args:
@@ -644,6 +675,7 @@ def deploy_all(
     build_args: list[str] | None = None,
     build_secrets: list[str] | None = None,
     runtime: str | None = None,
+    runtime_config: str | None = None,
 ) -> None:
     """Deploy each HUD environment under a parent directory."""
     hud_console = HUDConsole()
@@ -683,6 +715,7 @@ def deploy_all(
                 build_args=build_args,
                 build_secrets=build_secrets,
                 runtime=runtime,
+                runtime_config=runtime_config,
             )
             succeeded.append(env_dir.name)
         except (typer.Exit, SystemExit):
@@ -762,6 +795,11 @@ def deploy_command(
         "--runtime",
         help="Persist Modal as the hosted runtime for this registry",
     ),
+    runtime_config: str | None = typer.Option(
+        None,
+        "--runtime-config",
+        help="Path to a JSON RuntimeConfig for hosted runs",
+    ),
 ) -> None:
     """Deploy HUD environment to the platform.
@@ -781,6 +819,7 @@ def deploy_command(
             build_args=build_args,
             build_secrets=secrets,
             runtime=runtime,
+            runtime_config=runtime_config,
         )
         return
@@ -795,4 +834,5 @@ def deploy_command(
         build_args=build_args,
         build_secrets=secrets,
         runtime=runtime,
+        runtime_config=runtime_config,
     )

{hud_python-0.6.6 → hud_python-0.6.7}/hud/cli/tests/test_deploy.py RENAMED Viewed

@@ -179,6 +179,47 @@ class TestCollectEnvironmentVariables:
         assert "INVALID_FORMAT" not in result
+class TestRuntimeConfigFile:
+    def test_load_runtime_config_uses_sdk_shape(self, tmp_path: Path) -> None:
+        from hud.cli.deploy import _load_runtime_config
+        from hud.utils.hud_console import HUDConsole
+        config_path = tmp_path / "runtime.json"
+        config_path.write_text(
+            json.dumps(
+                {
+                    "resources": {"gpu": {"type": "A10G", "count": 2}},
+                    "limits": {"startup_timeout_s": 300},
+                }
+            ),
+            encoding="utf-8",
+        )
+        assert _load_runtime_config(str(config_path), HUDConsole()) == {
+            "resources": {"gpu": {"type": "A10G", "count": 2}},
+            "limits": {"startup_timeout_s": 300},
+        }
+    def test_load_runtime_config_preserves_null_override(self, tmp_path: Path) -> None:
+        from hud.cli.deploy import _load_runtime_config
+        from hud.utils.hud_console import HUDConsole
+        config_path = tmp_path / "runtime.json"
+        config_path.write_text(json.dumps({"resources": None}), encoding="utf-8")
+        assert _load_runtime_config(str(config_path), HUDConsole()) == {"resources": None}
+    def test_load_runtime_config_rejects_unknown_fields(self, tmp_path: Path) -> None:
+        from hud.cli.deploy import _load_runtime_config
+        from hud.utils.hud_console import HUDConsole
+        config_path = tmp_path / "runtime.json"
+        config_path.write_text(json.dumps({"provider_config": {}}), encoding="utf-8")
+        with pytest.raises(typer.Exit):
+            _load_runtime_config(str(config_path), HUDConsole())
 class TestDeployEnvironment:
     """Tests for deploy_environment function."""
@@ -262,6 +303,7 @@ class TestDeployAsync:
                     name="test-env",
                     registry_id=None,
                     runtime=None,
+                    runtime_config=None,
                     env_vars={},
                     build_args={},
                     build_secrets={},
@@ -292,6 +334,7 @@ class TestDeployAsync:
                     name="test-env",
                     registry_id=None,
                     runtime=None,
+                    runtime_config=None,
                     env_vars={},
                     build_args={},
                     build_secrets={},
@@ -331,6 +374,7 @@ class TestDeployAsync:
                 name="test-env",
                 registry_id=None,
                 runtime="modal",
+                runtime_config=None,
                 env_vars={},
                 build_args={},
                 build_secrets={},
@@ -343,6 +387,48 @@ class TestDeployAsync:
         assert platform.payload is not None
         assert platform.payload["runtime_provider"] == "modal"
+    @pytest.mark.asyncio
+    async def test_trigger_build_sends_runtime_config(self) -> None:
+        from hud.cli.deploy import _DeployPlan, _trigger_build
+        from hud.utils.hud_console import HUDConsole
+        from hud.utils.platform import PlatformClient
+        class FakePlatform(PlatformClient):
+            payload: dict[str, object] | None = None
+            async def apost(
+                self,
+                path: str,
+                *,
+                json: object | None = None,
+            ) -> dict[str, object]:
+                assert path == "/builds/trigger"
+                assert isinstance(json, dict)
+                object.__setattr__(self, "payload", json)
+                return {"id": "build-1", "registry_id": "registry-1"}
+        runtime_config = {"resources": {"gpu": {"type": "A10G", "count": 1}}}
+        platform = FakePlatform("https://api.example", "key")
+        result = await _trigger_build(
+            platform,
+            build_id="build-1",
+            plan=_DeployPlan(
+                name="test-env",
+                registry_id=None,
+                runtime="modal",
+                runtime_config=runtime_config,
+                env_vars={},
+                build_args={},
+                build_secrets={},
+            ),
+            no_cache=False,
+            console=HUDConsole(),
+        )
+        assert result == {"id": "build-1", "registry_id": "registry-1"}
+        assert platform.payload is not None
+        assert platform.payload["runtime_config"] == runtime_config
 class TestSaveDeployLink:
     """Tests for _save_deploy_link function."""

{hud_python-0.6.6 → hud_python-0.6.7}/hud/eval/runtime.py RENAMED Viewed

@@ -108,6 +108,9 @@ class RuntimeConfig(BaseModel):
             self.model_dump() | override.model_dump(exclude_unset=True)
         )
+    def request_payload(self) -> dict[str, Any]:
+        return self.model_dump(mode="json", exclude_unset=True)
 class Provider(Protocol):
     """Server placement: called with the task row being placed, acquire one
@@ -925,7 +928,7 @@ class HostedRuntime:
         if group_id is not None:
             payload["group_id"] = group_id
         if task.runtime_config is not None:
-            runtime_config = task.runtime_config.model_dump(mode="json", exclude_none=True)
+            runtime_config = task.runtime_config.request_payload()
             if runtime_config:
                 payload["runtime_config"] = runtime_config
         await platform.apost("/rollouts/submit", json=payload)

{hud_python-0.6.6 → hud_python-0.6.7}/hud/eval/sync.py RENAMED Viewed

@@ -163,7 +163,7 @@ def task_upload_payload(task: Task) -> dict[str, Any]:
     if task.columns:
         payload["columns"] = task.columns
     if task.runtime_config is not None:
-        payload["runtime_config"] = task.runtime_config.model_dump(exclude_none=True)
+        payload["runtime_config"] = task.runtime_config.request_payload()
     return payload
@@ -176,7 +176,7 @@ def _task_signature(task: Task) -> str:
     if task.columns:
         sig_data["columns"] = task.columns
     if task.runtime_config is not None:
-        sig_data["runtime_config"] = task.runtime_config.model_dump(exclude_none=True)
+        sig_data["runtime_config"] = task.runtime_config.request_payload()
     return f"{task.id}|" + json.dumps(
         sig_data,
         sort_keys=True,

{hud_python-0.6.6 → hud_python-0.6.7}/hud/eval/tests/test_hosted.py RENAMED Viewed

@@ -164,6 +164,25 @@ async def test_run_submits_and_polls_to_terminal(monkeypatch: pytest.MonkeyPatch
     assert payload["agent"]["config"]["model"] == "test-model"
+@pytest.mark.asyncio
+async def test_run_preserves_runtime_config_null_override(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    platform = _FakePlatform([{"status": "completed", "reward": 0.5}])
+    monkeypatch.setattr(
+        "hud.eval.runtime.PlatformClient.from_settings", classmethod(lambda cls: platform)
+    )
+    await HostedRuntime(poll_interval=0.0).run(
+        Task(env="sums", id="add", runtime_config=RuntimeConfig(resources=None)),
+        _agent(),
+        job_id=uuid.uuid4().hex,
+        trace_id=uuid.uuid4().hex,
+    )
+    assert platform.posts[0][1]["runtime_config"] == {"resources": None}
 @pytest.mark.asyncio
 async def test_run_timeout_requests_platform_cancel(monkeypatch: pytest.MonkeyPatch) -> None:
     platform = _FakePlatform([{"status": "running"}])

{hud_python-0.6.6 → hud_python-0.6.7}/hud/eval/tests/test_sync.py RENAMED Viewed

@@ -148,3 +148,15 @@ def test_task_upload_payload_includes_runtime_config() -> None:
     payload = task_upload_payload(task)
     assert payload["runtime_config"] == {"image": "img:tag"}
+def test_task_upload_payload_preserves_runtime_config_null_override() -> None:
+    task = Task(
+        env="e",
+        id="solve",
+        runtime_config=RuntimeConfig(resources=None),
+    )
+    payload = task_upload_payload(task)
+    assert payload["runtime_config"] == {"resources": None}

{hud_python-0.6.6 → hud_python-0.6.7}/hud/version.py RENAMED Viewed

@@ -4,4 +4,4 @@ Version information for the HUD SDK.
 from __future__ import annotations
-__version__ = "0.6.6"
+__version__ = "0.6.7"

{hud_python-0.6.6 → hud_python-0.6.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "hud-python"
-version = "0.6.6"
+version = "0.6.7"
 description = "SDK for the HUD platform."
 readme = "README.md"
 requires-python = ">=3.11, <3.13"

hud_python-0.6.6/cookbooks/fireworks-rl-training/README.md DELETED Viewed

@@ -1,114 +0,0 @@
-# Fireworks RL Training
-Direct Fireworks Training API loop over the same arithmetic preview task used by
-`cookbooks/rl-training`.
-This does **not** use Fireworks native datasets or RFT jobs. It follows the
-Training API service path from the Fireworks docs:
-1. `FiretitanServiceClient.from_firetitan_config(...)`
-2. `create_deployment_sampler(...)` for high-parallel rollouts
-3. local grading of HUD-style multiplication tasks
-4. `forward_backward_custom(...)` + `optim_step(...)`
-5. `save_weights_for_sampler(...)` + sampler refresh
-References:
-- Fireworks Training API introduction: https://docs.fireworks.ai/fine-tuning/training-api/introduction
-- Training and sampling lifecycle: https://docs.fireworks.ai/fine-tuning/training-api/training-and-sampling
-- Loss functions / GRPO reference: https://docs.fireworks.ai/fine-tuning/training-api/loss-functions
-## Setup
-The repo-level `.env` is loaded automatically. It must contain:
-```bash
-FIREWORKS_API_KEY=...
-FIREWORKS_ACCOUNT_ID=...
-```
-Install the isolated cookbook environment:
-```bash
-uv sync --pre
-```
-## Calibrate task difficulty first
-Calibration defaults to Fireworks' OpenAI-compatible inference API, so it does
-**not** create a trainer, provision a Training API deployment, or call
-`optim_step`. This is the cheap way to tune task difficulty before paying for a
-Training API run.
-The calibration model is separate from the training base model because the
-`lorenss` key currently exposes only a small serverless inference catalog (no
-Qwen3 8B deployment). Override it with `--inference-model` if you have a closer
-deployed model.
-```bash
-uv run train.py --calibrate-only --groups-per-step 8 --rollouts-per-prompt 8 --parallelism 32
-```
-The goal is a reward distribution with variance. If reward is all zero, make the
-task easier:
-```bash
-uv run train.py --calibrate-only --min-a 10 --max-a 99 --min-b 2 --max-b 9
-```
-If reward is all one, make the task harder:
-```bash
-uv run train.py --calibrate-only --min-a 1000 --max-a 9999 --min-b 11 --max-b 99
-```
-The current defaults are calibrated for the visible `gpt-oss-120b` inference
-model on the `lorenss` key: 2-digit by 1-digit multiplication with a direct
-"reply only with the integer" prompt. A 32-rollout calibration gave a non-trivial
-baseline (`reward_mean ~= 0.22`, `reward_std ~= 0.42`), while the original
-3-digit by 2-digit range was all-zero.
-## Train
-Once calibration has non-trivial rewards:
-```bash
-uv run train.py --steps 5 --groups-per-step 8 --rollouts-per-prompt 8 --parallelism 32
-```
-This uses the direct Training API managed service path. If you want calibration
-to go through the managed deployment sampler too, pass
-`--calibration-backend managed`; this provisions the same resources as training.
-### Current Fireworks preview account blocker
-On the `lorenss` preview account, trainer creation currently fails before the
-first train step with:
-```text
-failed to ensure FIREWORKS_API_KEY secret: unkey inference api id is not configured
-```
-This happens even with `create_deployment=False`, so it is an account/control
-plane provisioning issue rather than a problem in the rollout or loss code. Once
-Fireworks enables the missing Unkey inference API config for the account, the
-same `uv run train.py ...` command should proceed to trainer startup and the
-first `forward_backward_custom(...)` call.
-Metrics are written to:
-- `runs/fireworks-rl-preview/metrics.jsonl`
-- `runs/fireworks-rl-preview/reward_loss.png` if `matplotlib` is installed
-## Notes
-- Defaults use Qwen 3 8B full-parameter training:
-  - `accounts/fireworks/models/qwen3-8b`
-  - `Qwen/Qwen3-8B`
-  - `accounts/fireworks/trainingShapes/qwen3-8b-128k`
-- LoRA can be tested with `--lora-rank N`, but the validated Qwen3 8B training
-  shape currently rejects LoRA mode on the `lorenss` preview account.
-- The first checkpoint sync happens after step 0 and subsequent rollouts sample
-  the updated weights through the same deployment.
-- `--keep-trainer` and `--keep-deployment` are available for debugging. By
-  default the trainer is cleaned up and the deployment scales to zero on exit.