PyPI - gpu-dev - Versions diffs - 0.7.6__tar.gz → 0.7.11__tar.gz - Mend

gpu-dev 0.7.6tar.gz → 0.7.11tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (231) hide show

gpu_dev-0.7.11/.github/workflows/tests.yml ADDED Viewed

@@ -0,0 +1,20 @@
+name: tests
+on:
+  push:
+  pull_request:
+jobs:
+  unit:
+    name: unit + mocks
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+      - name: Install package + test deps
+        run: uv pip install -e ".[test]"
+      - name: Run unit + mock tests (integration excluded)
+        run: uv run pytest -m "not integration"

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/.gitignore RENAMED Viewed

@@ -73,3 +73,14 @@ lambda/*/package/
 admin/output/
 .claude/worktrees/
+.claude/settings.local.json
+.claude/scheduled_tasks.lock
+# Org-specific (filled in locally; not committed)
+docs/INTERNAL_AUTH.md
+# Local scratch / staging terraform working dir
+*.pid
+terraform-gpu-devservers/staging/.terraform/
+terraform-gpu-devservers/staging/__pycache__/
+terraform-gpu-devservers/staging/*.log

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/CLAUDE.md RENAMED Viewed

@@ -28,6 +28,59 @@ For terraform, we use opentofu, don't ever run tf apply directly. You're free to
 - Group imports in standard order: standard library, third-party, local imports
 - Use absolute imports when possible
+## Testing (DO THIS FOR EVERY CHANGE)
+There is a real test suite now. **Every change must keep it green, and add/adjust
+tests.** Two tiers:
+**1. Unit + mocks — ALWAYS run, must stay green (CI runs this on every push/PR).**
+Fully mocked (boto3 / k8s / SSH / subprocess), no network, ~2s.
+```bash
+uv pip install -e ".[test]"        # one-time: pytest, moto, kubernetes
+uv run pytest -m "not integration" # ~1140 tests; run before every commit
+```
+- Layout: `tests/unit/{sdk,cli,lambda_fn}/test_*.py`; shared fixtures in the root
+  `conftest.py` (`cli_runner`, `lambda_index` = the lambda imported as `index`
+  with env pre-set, `aws_mocks` = MagicMock boto3 handles).
+- When you touch CLI / SDK / lambda code, update or add the matching `test_*.py`.
+- CI: `.github/workflows/tests.yml`. Lambda imports need env vars + sys.path — the
+  root `conftest.py` already sets both.
+**2. e2e integration on STAGING — run for anything touching the
+reserve/pod/SSH/lambda path before merging.** Real reservations on the **staging**
+cluster (us-west-1), cpu + t4 only, auto-cancelled. Staging is the DEFAULT target
+and github_user comes from your config, so the bare command is enough:
+```bash
+uv run pytest -m integration --run-integration -v
+```
+- Staging is the default (`GPU_DEV_TEST_ENV` defaults to `staging` → us-west-1,
+  standard `pytorch-gpu-dev-*` prefix, tf workspace `default`). The integration
+  conftest pins the region so the unit-test us-east-2 default can't leak in. Wired
+  in `cli-tools/.../config.py` ENVIRONMENTS.
+- Covers: cpu-x86 + t4 reserve→active→cancel, list-while-active, exec
+  (`nproc`/`nvidia-smi`/`torch.cuda`), **`claude -p` answers "Paris"** (pod Claude
+  Code/Bedrock), and the **warm pool** (fast warm claim + custom-image
+  warm-ineligibility). Each cancels in a `finally` (no leaked pods).
+- Warm-pool tests need `WARM_POOL_TARGETS` deployed on staging — set in
+  `lambda.tf` for the `default` workspace (`{t4, cpu-x86, cpu-arm}`). Staging IS the
+  tf `default` workspace (us-west-1, environment=test) — there is no `test`/`staging`
+  workspace: `tofu workspace select default && tofu apply`. Until then the warm
+  tests skip ("came up cold"). Custom-image test: set `GPU_DEV_TEST_IMAGE`.
+- Repro test (`test_repro_known_failure.py`): set `GPU_DEV_REPRO_REF` +
+  `GPU_DEV_REPRO_TEST` to a known-red (commit, test). Find one with the
+  **treehugger MCP** (`hud`, user-scope — `get_hud_data`/`master_commit_red`).
+  Note: prebuilt torch is h100/b200 arch, so a CUDA test on t4 needs a full build;
+  prefer a failure that runs on the box's GPU or on cpu.
+- Skips cleanly if staging is unreachable or the runner has no outbound SSH (e.g. a
+  sandbox). The reservation role can query/SQS but lacks `DescribeTable`, so the
+  reachability probe uses scan+get-queue-url, not describe.
+- Validated live (2026-05-31): cpu + t4 lifecycle PASS; warm-claim test confirmed
+  it reaches the real reserve (skips until WARM_POOL_TARGETS is applied).
+**Rule of thumb:** unit+mocks for *every* change; add e2e coverage when you add a
+new command/flow; run the staging e2e before merging anything that could affect a
+live reservation. Don't say "done/tested" without having run the relevant tier.
 ## Content
 - torchci - a next.js app containing a PyTorch CI tracker
@@ -51,6 +104,42 @@ Currently we're working on a developer servers with GPUs in AWS. This means we'l
 # AGENT SECTION
+## Fast-repro redesign — by-SHA artifact cache + on-demand build (2026-06-01)
+Goal: `gpu-dev repro <ref>` for any pytorch commit from the last ~72h lands a built,
+importable tree in <2min. Design: `docs/FAST_REPRO_DESIGN.md`. **All merged to main**
+(PRs #186–#189); **needs `tofu apply` (prod, workspace `prod`) + image rebuild**.
+- **by-SHA artifact cache** (#186): whole *built* trees keyed by commit SHA at
+  `/ccache_shared/prebuilt/by-sha/<sha>.tar.{zst,gz}` (`.sha` written last = the
+  completion gate). Cron seeds one per viable/strict bump (hardlink, no extra space).
+  `stage-pytorch` (cold `--ref`) + `gpu-dev repro` consume on hit → `import torch`
+  with ZERO build. `repro` also publishes its in-pod build via `publish-pytorch-build`
+  (detached) so the cache fills from real usage. All paths safe-fallback on miss;
+  `ls-remote` is `timeout 15`.
+- **retention** (#188): prebuild cron prunes by-sha entries >72h every tick (storage
+  budget ~500-650GB on the elastic ccache EFS). The by-sha set IS the snapshot ladder.
+- **mold linker** (#187): Dockerfile installs `mold`; cron + in-pod repro build wrap
+  with `mold -run` (guarded on `command -v mold`). Drops the libtorch_cuda.so relink
+  ~1-3min → ~15s. **Needs image rebuild** to activate (prod runs a stale image; that's
+  also why prod publishes gzip not zstd — the Dockerfile has zstd already).
+- **on-demand build worker** (#189, `pytorch-ondemand.tf`): always-on Deployment on
+  NodeType=build drains `prebuilt/build-queue/<sha>.req` (own hostPath tree
+  `/mnt/ondemand-build` → builds at `/home/dev/pytorch` so build/ paths are
+  pod-compatible; mold+ccache), publishes by-sha, writes `.worker-alive` heartbeat.
+  `repro` enqueues + polls ONLY when the heartbeat is fresh (else straight to in-pod
+  build → zero regression if not deployed). Makes the FIRST repro of an uncached
+  commit fast. Coordination 100% via shared EFS — no new networking/RBAC/lambda.
+- cuDNN fidelity (`USE_CUDNN=1`) DEFERRED — forcing it can fail the build if cuDNN
+  isn't found under cuda-13.2; needs prod e2e. Base image is cudnn9-devel.
+- Fast path is **prod-arch only** (`sm_90;sm_100` = H100/B200); t4/staging is wrong-arch.
+- Also: SSH alias now keys off reservation id not pod name (#185) so warm/repro pods
+  are reachable via `ssh gpu-dev-<resid>` / `connect` (routing is via the FQDN, the
+  alias is a local label). CCACHE_MAXSIZE settled at 250G (#184).
+- Prod e2e: `gpu-dev repro <fresh-sha> <test> --gpu-type h100 --no-connect` (first =
+  off-pod build + stage; rerun = by-sha HIT zero build). Worker logs:
+  `k -n management logs deploy/pytorch-ondemand-builder -f`.
 ## Instant-sandboxes branch — WIP & things to fix (2026-05-29)
 Big push on warm pools + instant claims + prebuilt pytorch. Tracking state here so it's not lost.

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.7.6
+Version: 0.7.11
 Summary: CLI + Python SDK for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10
@@ -15,6 +15,11 @@ Requires-Dist: questionary>=2.1.1
 Requires-Dist: websockets>=12.0
 Requires-Dist: certifi>=2023.7.22
 Requires-Dist: mcp>=1.0.0
+Provides-Extra: test
+Requires-Dist: pytest>=7.4; extra == "test"
+Requires-Dist: pytest-cov>=4.1; extra == "test"
+Requires-Dist: moto[dynamodb,ec2,sqs]>=5.0; extra == "test"
+Requires-Dist: kubernetes>=28.1; extra == "test"
 # GPU Developer CLI & SDK

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/cli-tools/gpu-dev-cli/gpu_dev_cli/cli.py RENAMED Viewed

@@ -319,6 +319,9 @@ def _show_single_reservation(connection_info: dict) -> None:
         reservation_id = connection_info["reservation_id"]
         reservation_name = connection_info.get("name")
         pod_name = connection_info.get("pod_name", "")
+        # SSH host alias keys off the reservation id (works for warm-claimed pods,
+        # whose pod_name != gpu-dev-<resid8>). pod_name is shown separately below.
+        host_alias = f"gpu-dev-{short_id}"
         ssh_config_path = get_ssh_config_path(reservation_id, reservation_name)
         use_include = is_ssh_include_enabled()
@@ -328,14 +331,14 @@ def _show_single_reservation(connection_info: dict) -> None:
             if use_include:
                 # User approved Include - show simple commands
                 from .reservations import _make_vscode_link
-                ssh_command_display = f"[green]ssh {pod_name}[/green]"
-                vscode_url = _make_vscode_link(pod_name)
-                vscode_cmd_text = f"code --remote ssh-remote+{pod_name} /home/dev"
+                ssh_command_display = f"[green]ssh {host_alias}[/green]"
+                vscode_url = _make_vscode_link(host_alias)
+                vscode_cmd_text = f"code --remote ssh-remote+{host_alias} /home/dev"
                 vscode_command_display = f"[link={vscode_url}][green]{vscode_cmd_text}[/green][/link]"
                 vscode_info = f"[blue]VS Code Remote:[/blue] {vscode_command_display}\n"
             else:
                 # User declined Include - show commands with -F flag
-                ssh_command_display = f"[green]ssh -F {ssh_config_path} {pod_name}[/green]"
+                ssh_command_display = f"[green]ssh -F {ssh_config_path} {host_alias}[/green]"
                 vscode_command_display = f"Add [green]Include ~/.gpu-dev/*-sshconfig[/green] to ~/.ssh/config and ~/.cursor/ssh_config (or: [green]gpu-dev config ssh-include enable[/green])"
                 vscode_info = f"[blue]VS Code/Cursor:[/blue] {vscode_command_display}\n"
         else:
@@ -1554,27 +1557,82 @@ def repro(ctx, ref, test_args, gpu_type, gpus, hours, no_connect, keep):
     except RuntimeError as e:
         rprint(f"[red]❌ {str(e)}[/red]"); return
-    # ref -> in-pod fetch+checkout (PRs prefer /merge = CI's view, fall back to /head)
+    # Resolve the ref in-pod -> WANT (sha, for the by-sha cache) + FREF (fetch ref).
+    # A MERGED pr/N reproduces the actual squash/merge commit on main (the real trunk
+    # state that was red) — NOT pull/N/merge (the PR re-applied onto *current* trunk,
+    # which goes green once the fix lands). Open PRs keep pull/N/merge (= CI's view).
     r = ref.strip(); prnum = None
     if r.startswith("pr/"): prnum = r[3:]
     elif r.startswith("#"): prnum = r[1:]
     elif r.isdigit(): prnum = r
+    gh = "https://github.com/pytorch/pytorch.git"
     if prnum:
-        fetch = (f"git fetch origin pull/{prnum}/merge 2>/dev/null && git checkout -f FETCH_HEAD || "
-                 f"{{ echo '[repro] no /merge ref, using /head'; git fetch origin pull/{prnum}/head && git checkout -f FETCH_HEAD; }}")
+        api = f"https://api.github.com/repos/pytorch/pytorch/pulls/{prnum}"
+        resolve = (
+            f"PRJSON=$(curl -s -m 10 -H 'Accept: application/vnd.github+json' -H 'User-Agent: gpu-dev' {api} 2>/dev/null); "
+            "MCS=$(printf '%s' \"$PRJSON\" | grep -oE '\"merge_commit_sha\": *\"[0-9a-f]+\"' | head -1 | cut -d'\"' -f4); "
+            "if printf '%s' \"$PRJSON\" | grep -q '\"merged\": *true' && [ -n \"$MCS\" ]; then "
+            f"WANT=\"$MCS\"; FREF=\"$MCS\"; echo \"[repro] pr/{prnum} is merged -> reproducing trunk commit $MCS\"; "
+            f"else FREF=pull/{prnum}/merge; WANT=$(timeout 15 git ls-remote {gh} $FREF 2>/dev/null | head -1 | cut -f1); "
+            f"[ -n \"$WANT\" ] || {{ FREF=pull/{prnum}/head; WANT=$(timeout 15 git ls-remote {gh} $FREF 2>/dev/null | head -1 | cut -f1); echo '[repro] open PR, no /merge -> /head'; }}; fi; ")
     else:
         rq = shlex.quote(r)
-        fetch = f"git fetch origin {rq} 2>/dev/null && git checkout -f FETCH_HEAD || git checkout -f {rq}"
+        resolve = (f"FREF={rq}; WANT=$(timeout 15 git ls-remote {gh} {rq} 2>/dev/null | head -1 | cut -f1); "
+                   f"[ -n \"$WANT\" ] || case {rq} in *[!0-9a-fA-F]*) WANT= ;; *) WANT={rq} ;; esac; ")
+    # in-pod fallback checkout (by-sha miss + farm unavailable): fetch the resolved ref,
+    # else check out the sha directly (reachable for a merged-PR land commit / trunk).
+    checkout = ("git fetch origin \"$FREF\" 2>/dev/null && git checkout -f FETCH_HEAD "
+                "|| git checkout -f \"$WANT\" 2>/dev/null "
+                "|| { git fetch --force origin 2>/dev/null && git checkout -f \"$WANT\"; }")
     testcmd = " ".join(shlex.quote(a) for a in test_args)
+    # by-sha artifact cache: if a fully-built tree for the resolved SHA already exists
+    # (shared EFS, seeded by the build node + prior repros), stage it -> ZERO build.
+    # Otherwise build, then publish the result so the next dev (anyone) gets it instant.
     remote = (
         "set -e; cd /home/dev/pytorch; "
         "git config --global --add safe.directory /home/dev/pytorch 2>/dev/null || true; "
-        f"echo '[repro] checkout {r}'; {fetch}; "
+        "BYSHA=/ccache_shared/prebuilt/by-sha; QUEUE=/ccache_shared/prebuilt/build-queue; HIT=; "
+        # bs <sha>: stage a fully-built by-sha tree into /home/dev/pytorch (zero build); 0 on success.
+        # explicit ext check, not a glob: the pod login shell is zsh, where an unmatched glob is a hard error.
+        # require the .sha completion gate (written last) so we never stage a half-published tarball.
+        "bs() { local s=\"$1\" tb=; [ -f \"$BYSHA/$s.sha\" ] || return 1; for e in zst gz; do [ -f \"$BYSHA/$s.tar.$e\" ] && { tb=\"$BYSHA/$s.tar.$e\"; break; }; done; [ -n \"$tb\" ] || return 1; "
+        "rm -rf /home/dev/pytorch.new; mkdir -p /home/dev/pytorch.new; "
+        "case \"$tb\" in *.zst) zstd -dc \"$tb\" 2>/dev/null | tar -C /home/dev/pytorch.new --strip-components=1 -xf - 2>/dev/null ;; "
+        "*) tar -C /home/dev/pytorch.new --strip-components=1 -xzf \"$tb\" 2>/dev/null ;; esac; "
+        "[ -d /home/dev/pytorch.new/.git ] || { rm -rf /home/dev/pytorch.new; return 1; }; "
+        "rm -rf /home/dev/pytorch; mv /home/dev/pytorch.new /home/dev/pytorch; return 0; }; "
+        + resolve +
+        "echo \"[repro] target ${WANT:-?}\"; "
+        # 1) already cached -> stage it (zero build)
+        "if [ -n \"$WANT\" ] && bs \"$WANT\"; then cd /home/dev/pytorch; HIT=1; echo '[repro] by-sha cache HIT -> staged prebuilt tree (zero build)'; fi; "
+        # 2) not cached, build farm alive -> request an off-pod build, wait, then stage
+        "if [ -z \"$HIT\" ] && [ -n \"$WANT\" ] && [ -n \"$(find \"$QUEUE/.worker-alive\" -mmin -2 2>/dev/null)\" ]; then "
+        "echo \"[repro] no cached build; requesting off-pod build of $WANT (build farm; streaming progress)…\"; printf '%s\\n' \"$FREF\" > \"$QUEUE/$WANT.req\" 2>/dev/null || true; "
+        # poll for the artifact; meanwhile tail the farm's build log (ninja [x/N]) so it's not a silent hang.
+        "i=0; LL=0; while [ $i -lt 400 ]; do [ -f \"$BYSHA/$WANT.sha\" ] && break; [ -f \"$QUEUE/$WANT.req\" ] || break; "
+        "if [ -f \"$QUEUE/$WANT.log\" ]; then NL=$(wc -l < \"$QUEUE/$WANT.log\" 2>/dev/null || echo 0); "
+        "if [ \"$NL\" -gt \"$LL\" ]; then tail -n +$((LL+1)) \"$QUEUE/$WANT.log\" 2>/dev/null | grep -aE '\\[[0-9]+/[0-9]+\\]|Building wheel|Successfully built|error' | tail -1 | sed 's/^/  [farm] /'; LL=$NL; fi; fi; "
+        "sleep 3; i=$((i+1)); done; "
+        "if bs \"$WANT\"; then cd /home/dev/pytorch; HIT=1; echo '[repro] off-pod build ready -> staged (zero build)'; else echo '[repro] off-pod build unavailable, building locally'; fi; fi; "
+        # 3) fall back to in-pod fetch + build (+ cache the result for the next dev)
+        "if [ -z \"$HIT\" ]; then "
+        "echo \"[repro] checking out $FREF\"; " + checkout + "; "
         "echo \"[repro] HEAD $(git rev-parse --short HEAD)\"; "
         "git -c protocol.file.allow=always submodule update --init --recursive --jobs 8 >/dev/null 2>&1 || true; "
         "if ! PYTHONPATH=/home/dev/pytorch python -c 'import torch' 2>/dev/null; then "
-        "echo '[repro] incremental rebuild on warm build/...'; pip install --break-system-packages -e . --no-build-isolation; fi; "
+        "echo \"[repro] prebuilt torch != this commit -> rebuilding (ccache-accelerated, but the further this commit is from viable/strict, the more recompiles). checked-out: $(git log -1 --format='%h %ci')\"; "
+        # mold -run routes the libtorch_cuda.so relink through mold (~15s vs minutes); guarded.
+        # Explicit if/else (not `$M pip`): the pod login shell is zsh, which doesn't word-split
+        # unquoted vars. -v streams the cmake/ninja [x/N] progress instead of pip's blind spinner.
+        "if command -v mold >/dev/null 2>&1; then mold -run pip install --break-system-packages -e . --no-build-isolation -v; "
+        "else pip install --break-system-packages -e . --no-build-isolation -v; fi; fi; "
+        # cache this build for the next dev (detached so it survives the ssh session)
+        "SHA=$(git rev-parse HEAD 2>/dev/null); "
+        "if command -v publish-pytorch-build >/dev/null 2>&1 && [ -n \"$SHA\" ] && [ ! -f \"$BYSHA/$SHA.sha\" ]; then "
+        "echo '[repro] caching this build (by-sha) for next time…'; "
+        "setsid publish-pytorch-build \"$SHA\" >/dev/null 2>&1 < /dev/null & fi; "
+        "fi; "
         f"echo '[repro] running: python {testcmd}'; "
         f"PYTHONPATH=/home/dev/pytorch python {testcmd}"
     )
@@ -1879,7 +1937,9 @@ def submit(ctx, gpu_type, gpus, hours, disk, ref, no_persistent_disk, spot, dock
                 sys.exit(1)
             create_ssh_config_for_reservation(master_fqdn, master_pod, master_id, master_name)
-        ssh_alias = master_pod
+        # Host alias matches the Host line written by create_ssh_config_for_reservation
+        # (keyed off the reservation id, so warm-claimed masters resolve too).
+        ssh_alias = f"gpu-dev-{master_id[:8]}"
         ssh_base = ["ssh", "-F", str(config_file), "-o", "StrictHostKeyChecking=accept-new"]
         rsync_e = " ".join(shlex.quote(x) for x in ssh_base)
@@ -3166,11 +3226,15 @@ def _show_direct_success(res: dict, elapsed: float) -> None:
     """Print the success block for an instant warm-pool claim,
     matching the normal reserve output (SSH config + VS Code/Cursor remote)."""
     from gpu_dev_cli.reservations import (
-        create_ssh_config_for_reservation, _generate_vscode_command, _generate_cursor_command)
+        create_ssh_config_for_reservation, _generate_vscode_command,
+        _generate_cursor_command, _make_vscode_link, _make_cursor_link)
     rid = res.get("reservation_id", "") or ""
     ssh_command = res.get("ssh_command", "") or ""
     pod_name = res.get("pod_name", "") or ""
     fqdn = res.get("fqdn") or ""
+    # Host alias keys off the reservation id — warm-claimed pods have a pod_name
+    # that is NOT gpu-dev-<resid8>, so we must not use pod_name as the ssh alias.
+    host_alias = f"gpu-dev-{rid[:8]}" if rid else pod_name
     rprint(f"\n[green]✅ Instant reservation ready in {elapsed:.1f}s![/green]")
     rprint(f"[bold]📋 Reservation ID:[/bold] {rid}")
@@ -3179,24 +3243,28 @@ def _show_direct_success(res: dict, elapsed: float) -> None:
     if rid:
         rprint(f"[bold]⚡ Quick Connect:[/bold] gpu-dev connect {rid[:8]}")
-    # Build the per-reservation SSH config so `ssh <pod>` and connect work cleanly.
+    # Build the per-reservation SSH config so `ssh gpu-dev-<resid8>` and connect work cleanly.
     use_include = False
     if fqdn and pod_name and rid:
         try:
             _cfg, use_include = create_ssh_config_for_reservation(fqdn, pod_name, rid, None)
         except Exception:
             pass
-    if pod_name and use_include:
-        rprint(f"[bold]🖥️  SSH Command:[/bold] ssh {pod_name}")
-    elif ssh_command:
-        rprint(f"[bold]🖥️  SSH Command:[/bold] {ssh_command}")
-    vsc = _generate_vscode_command(ssh_command) if ssh_command else None
-    cur = _generate_cursor_command(ssh_command) if ssh_command else None
-    if vsc:
-        rprint(f"[bold]💻 VS Code Remote:[/bold] {vsc}")
-    if cur:
-        rprint(f"[bold]🖥️ Cursor Remote:[/bold] {cur}")
+    if use_include and rid:
+        rprint(f"[bold]🖥️  SSH Command:[/bold] ssh {host_alias}")
+        vscode_url = _make_vscode_link(host_alias)
+        cursor_url = _make_cursor_link(host_alias)
+        rprint(f"[bold]💻 VS Code Remote:[/bold] [link={vscode_url}]code --remote ssh-remote+{host_alias} /home/dev[/link]")
+        rprint(f"[bold]🖥️ Cursor Remote:[/bold] [link={cursor_url}]cursor --remote ssh-remote+{host_alias} /home/dev[/link]")
+    else:
+        if ssh_command:
+            rprint(f"[bold]🖥️  SSH Command:[/bold] {ssh_command}")
+        vsc = _generate_vscode_command(ssh_command) if ssh_command else None
+        cur = _generate_cursor_command(ssh_command) if ssh_command else None
+        if vsc:
+            rprint(f"[bold]💻 VS Code Remote:[/bold] {vsc}")
+        if cur:
+            rprint(f"[bold]🖥️ Cursor Remote:[/bold] {cur}")
 def _format_gpu_display(gpu_count, gpu_type):
@@ -3385,15 +3453,22 @@ def _show_availability(show_spot: bool = False) -> None:
                 spot_table = Table(title="⚡ Spot Instances (us-east-1, ~70% cheaper)")
                 spot_table.add_column("GPU Type", style="cyan")
                 spot_table.add_column("Avail\nNow", style="green")
+                spot_table.add_column("In\nUse", style="yellow")
                 spot_table.add_column("Per\nNode", style="bright_green")
                 spot_table.add_column("Status", style="magenta")
                 spot_table.add_column("Spot Discount", style="dim")
                 _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
                 for gt, info in sorted(spot_region_info.items()):
                     avail = info.get("available", 0)
+                    total = info.get("total", 0)
+                    in_use = max(0, total - avail)  # GPUs on up spot nodes already taken
                     per_node = spot_gpus_per_node.get(gt, 8)
                     avail_display = f"[green]{avail}[/green]" if avail > 0 else f"[dim]0[/dim]"
-                    status = "[green]Node up[/green]" if avail > 0 else "Spins up on reserve (~10 min)"
+                    in_use_display = f"[yellow]{in_use}[/yellow]" if in_use > 0 else f"[dim]0[/dim]"
+                    if in_use > 0:
+                        status = "[yellow]Node up (in use)[/yellow]" if avail == 0 else "[green]Node up[/green]"
+                    else:
+                        status = "[green]Node up[/green]" if avail > 0 else "Spins up on reserve (~10 min)"
                     si = info.get("spot_info", {}) or {}
                     sp = si.get("spot_price", "") if isinstance(si, dict) else ""
                     if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
@@ -3405,7 +3480,7 @@ def _show_availability(show_spot: bool = False) -> None:
                             avail_signal = f"[green]{pct}% off on-demand[/green]" if pct > 0 else "[dim]At on-demand price[/dim]"
                         except (ValueError, TypeError):
                             avail_signal = "[yellow]Unknown[/yellow]"
-                    spot_table.add_row(f"{gt.upper()} *", avail_display, str(per_node), status, avail_signal)
+                    spot_table.add_row(f"{gt.upper()} *", avail_display, in_use_display, str(per_node), status, avail_signal)
                 console.print(spot_table)
                 rprint("[dim]* = spot: ~70% cheaper, AWS can reclaim with 2-min notice, fulfillment not guaranteed.[/dim]")
                 rprint("[dim]  Separate cluster (us-east-1) with separate disks. Select via gpu-dev reserve (interactive).[/dim]")
@@ -3779,7 +3854,8 @@ def connect(ctx: click.Context, reservation_id: Optional[str]) -> None:
             for node in nodes:
                 status_display = "✅ Active" if node.get("status") == "active" else f"⏳ {node.get('status', 'unknown')}"
                 pod_name = node.get("pod_name", "unknown")
-                ssh_cmd_short = f"ssh {pod_name}" if pod_name != "unknown" else "N/A"
+                node_rid = node.get("reservation_id")
+                ssh_cmd_short = f"ssh gpu-dev-{node_rid[:8]}" if node_rid else "N/A"
                 table.add_row(
                     f"Node {node.get('node_index', 0) + 1}",
@@ -4036,10 +4112,11 @@ def get_ssh_config_cmd(ctx: click.Context, reservation_id: Optional[str]) -> Non
                 )
                 if config_path:
+                    node_alias = f"gpu-dev-{node_res_id[:8]}"
                     if use_include:
-                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh {pod_name}[/cyan]")
+                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh {node_alias}[/cyan]")
                     else:
-                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh -F {config_path} {pod_name}[/cyan]")
+                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh -F {config_path} {node_alias}[/cyan]")
                 else:
                     rprint(f"[yellow]⚠️  Node {node_idx + 1}: Failed to create SSH config[/yellow]")
@@ -4067,12 +4144,13 @@ def get_ssh_config_cmd(ctx: click.Context, reservation_id: Optional[str]) -> Non
             )
             if config_path:
+                host_alias = f"gpu-dev-{reservation_id[:8]}"
                 rprint(f"[green]✅ SSH config created:[/green] [cyan]{config_path}[/cyan]\n")
                 if use_include:
-                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh {pod_name}[/cyan]")
+                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh {host_alias}[/cyan]")
                     rprint(f"[dim]   or:[/dim] [cyan]gpu-dev connect {reservation_id[:8]}[/cyan]")
                 else:
-                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh -F {config_path} {pod_name}[/cyan]")
+                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh -F {config_path} {host_alias}[/cyan]")
                     rprint(f"[dim]   or:[/dim] [cyan]gpu-dev connect {reservation_id[:8]}[/cyan]")
             else:
                 rprint("[red]❌ Failed to create SSH config[/red]")
@@ -4639,13 +4717,13 @@ def ssh_include(action: str):
     \b
     When enabled:
-      • Simple SSH commands: ssh <pod-name>
-      • VS Code Remote works: code --remote ssh-remote+<pod-name>
+      • Simple SSH commands: ssh gpu-dev-<reservation-id>
+      • VS Code Remote works: code --remote ssh-remote+gpu-dev-<reservation-id>
       • Cursor Remote works: Open Remote SSH in Cursor
     \b
     When disabled:
-      • Need -F flag: ssh -F ~/.gpu-dev/<id>-sshconfig <pod-name>
+      • Need -F flag: ssh -F ~/.gpu-dev/<id>-sshconfig gpu-dev-<reservation-id>
       • VS Code/Cursor requires manual config setup
     \b

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/cli-tools/gpu-dev-cli/gpu_dev_cli/config.py RENAMED Viewed

@@ -29,6 +29,15 @@ class Config:
             "description": "Spot-only us-east-1 environment (T4/L4/CPU)",
             "spot_types": ["b300", "b200", "h200", "h100", "a100", "t4", "l4", "rtxpro6000"],
         },
+        # Staging (us-west-1, tf "default" workspace, environment=test). Same
+        # standard resource prefix as prod, just a different region — so only the
+        # region changes. Live capacity: cpu-x86/arm + t4. Used for integration
+        # tests. Select via `GPU_DEV_ENVIRONMENT=staging` (or the "test" env alias).
+        "staging": {
+            "region": "us-west-1",
+            "workspace": "default",
+            "description": "Staging (us-west-1, cpu + t4)",
+        },
     }
     DEFAULT_ENVIRONMENT = "prod"
@@ -43,19 +52,33 @@ class Config:
         # Load unified config (handles migration from legacy files)
         self.user_config = self._load_config()
-        # Get region: env vars take priority (for spot routing), then config, then default
+        # Active environment: GPU_DEV_ENVIRONMENT env wins (handy for tests/CI),
+        # then the persisted config, then the default. Its region/prefix back the
+        # fallbacks below so e.g. `GPU_DEV_ENVIRONMENT=staging` reaches us-west-2.
+        env_override = os.getenv("GPU_DEV_ENVIRONMENT")
+        env_name = env_override or self.user_config.get(
+            "environment", self.DEFAULT_ENVIRONMENT)
+        env_cfg = self.ENVIRONMENTS.get(env_name, {})
+        # Get region: AWS_* env vars take priority (for spot routing); then an
+        # explicit GPU_DEV_ENVIRONMENT switch uses that env's region (beating the
+        # persisted one); then the persisted config; then the env's region; default.
         env_region = os.getenv("AWS_REGION") or os.getenv("AWS_DEFAULT_REGION")
         if env_region and env_region != self.user_config.get("region"):
             self.aws_region = env_region
+        elif env_override and env_cfg.get("region"):
+            self.aws_region = env_cfg["region"]
         elif self.user_config.get("region"):
             self.aws_region = self.user_config["region"]
+        elif env_cfg.get("region"):
+            self.aws_region = env_cfg["region"]
         else:
             self.aws_region = "us-east-2"
         os.environ["AWS_DEFAULT_REGION"] = self.aws_region
-        # Resource naming convention - no config needed!
-        self.prefix = "pytorch-gpu-dev"
+        # Resource naming convention — per-environment prefix (default for prod).
+        self.prefix = env_cfg.get("prefix", "pytorch-gpu-dev")
         # Construct ARNs from convention
         self.queue_name = f"{self.prefix}-reservation-queue"

{gpu_dev-0.7.6 → gpu_dev-0.7.11}/cli-tools/gpu-dev-cli/gpu_dev_cli/reservations.py RENAMED Viewed

@@ -177,12 +177,14 @@ def _generate_cursor_command(ssh_command: str) -> Optional[str]:
         return None
-def _generate_ssh_config(hostname: str, pod_name: str) -> str:
+def _generate_ssh_config(hostname: str, host_alias: str) -> str:
     """Generate SSH config for a reservation
     Args:
-        hostname: The FQDN hostname (e.g., old_bison.devservers.io)
-        pod_name: The pod name to use as SSH host alias
+        hostname: The FQDN hostname (e.g., old_bison.devservers.io). SSH routing
+            happens via this HostName (the ProxyCommand routes on the FQDN), so
+            host_alias is a purely local label.
+        host_alias: The local SSH host alias (e.g., gpu-dev-<resid8>)
     Returns:
         SSH config content as string
@@ -196,7 +198,7 @@ def _generate_ssh_config(hostname: str, pod_name: str) -> str:
     extra = "    AddKeysToAgent yes\n"
     if sys.platform == "darwin":
         extra += "    IgnoreUnknown UseKeychain\n    UseKeychain yes\n"
-    config_content = f"""Host {pod_name}
+    config_content = f"""Host {host_alias}
     HostName {hostname}
     User dev
     ForwardAgent yes
@@ -255,10 +257,10 @@ def _check_ssh_config_permission() -> bool:
     console.print("[dim]  • ~/.cursor/ssh_config[/dim]")
     console.print("[dim]Line added: Include ~/.gpu-dev/*-sshconfig[/dim]\n")
     console.print("[green]Benefits:[/green]")
-    console.print("  • Simple commands: [green]ssh <pod-name>[/green]")
-    console.print("  • VS Code Remote works: [green]code --remote ssh-remote+<pod-name>[/green]")
+    console.print("  • Simple commands: [green]ssh gpu-dev-<reservation-id>[/green]")
+    console.print("  • VS Code Remote works: [green]code --remote ssh-remote+gpu-dev-<reservation-id>[/green]")
     console.print("  • Cursor Remote works: Open Remote SSH in Cursor")
-    console.print("\n[dim]Without this, you'll need to use: [green]ssh -F ~/.gpu-dev/<id>-sshconfig <pod-name>[/green][/dim]")
+    console.print("\n[dim]Without this, you'll need to use: [green]ssh -F ~/.gpu-dev/<id>-sshconfig gpu-dev-<reservation-id>[/green][/dim]")
     console.print("[yellow]━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[/yellow]\n")
     approved = click.confirm("Add Include directive to SSH config files?", default=True)
@@ -326,7 +328,8 @@ def create_ssh_config_for_reservation(hostname: str, pod_name: str, reservation_
     Args:
         hostname: The FQDN hostname (e.g., old_bison.devservers.io)
-        pod_name: The pod name to use as SSH host alias
+        pod_name: The k8s pod name (kept for API compat; no longer used for the
+            host alias — warm-claimed pods have a pod_name != gpu-dev-<resid8>)
         reservation_id: The reservation ID (full or short)
         name: Optional reservation name to use for filename (falls back to short ID)
@@ -346,8 +349,12 @@ def create_ssh_config_for_reservation(hostname: str, pod_name: str, reservation_
     short_id = reservation_id[:8]
     filename = f"{short_id}-sshconfig"
+    # Key the host alias off the reservation id (not pod_name) so warm-claimed pods,
+    # whose pod_name differs from gpu-dev-<resid8>, are still reachable as gpu-dev-<resid8>.
+    host_alias = f"gpu-dev-{short_id}"
     config_file = gpu_dev_dir / filename
-    config_content = _generate_ssh_config(hostname, pod_name)
+    config_content = _generate_ssh_config(hostname, host_alias)
     try:
         config_file.write_text(config_content)
@@ -2220,10 +2227,11 @@ class ReservationManager:
                                                     console.print(
                                                         f"[yellow]⚠️  Could not create SSH config for node {node['index']+1}: {str(e)}[/yellow]")
-                                            # Show connection info
+                                            # Show connection info (alias keys off the reservation id)
+                                            node_alias = f"gpu-dev-{res_id[:8]}" if res_id else pod_name
                                             if config_path and pod_name and use_include:
                                                 console.print(
-                                                    f"[cyan]🖥️  Node {node['index']+1}:[/cyan] [green]ssh {pod_name}[/green]")
+                                                    f"[cyan]🖥️  Node {node['index']+1}:[/cyan] [green]ssh {node_alias}[/green]")
                                             else:
                                                 ssh_command = res.get(
                                                     "ssh_command", "ssh user@pending")
@@ -2321,27 +2329,29 @@ class ReservationManager:
                                         console.print(
                                             f"[yellow]⚠️  Could not create SSH config: {str(e)}[/yellow]")
-                                # Show SSH command using config file if created, otherwise fallback
+                                # Show SSH command using config file if created, otherwise fallback.
+                                # Alias keys off the reservation id (works for warm-claimed pods too).
+                                host_alias = f"gpu-dev-{short_id}"
                                 if config_path and pod_name:
                                     if use_include:
                                         # User approved Include - show simple commands
                                         console.print(
-                                            f"[cyan]🖥️  SSH Command:[/cyan] [green]ssh {pod_name}[/green]")
+                                            f"[cyan]🖥️  SSH Command:[/cyan] [green]ssh {host_alias}[/green]")
                                         # Create clickable VS Code link
-                                        vscode_url = _make_vscode_link(pod_name)
-                                        vscode_command = f"code --remote ssh-remote+{pod_name} /home/dev"
+                                        vscode_url = _make_vscode_link(host_alias)
+                                        vscode_command = f"code --remote ssh-remote+{host_alias} /home/dev"
                                         console.print(
                                             f"[cyan]💻 VS Code Remote:[/cyan] [link={vscode_url}][green]{vscode_command}[/green][/link]")
                                         # Create clickable Cursor link
-                                        cursor_url = _make_cursor_link(pod_name)
-                                        cursor_command = f"cursor --remote ssh-remote+{pod_name} /home/dev"
+                                        cursor_url = _make_cursor_link(host_alias)
+                                        cursor_command = f"cursor --remote ssh-remote+{host_alias} /home/dev"
                                         console.print(
                                             f"[cyan]🖥️ Cursor Remote:[/cyan] [link={cursor_url}][green]{cursor_command}[/green][/link]")
                                     else:
                                         # User declined Include - show commands with -F flag
                                         console.print(
-                                            f"[cyan]🖥️  SSH Command:[/cyan] [green]ssh -F {config_path} {pod_name}[/green]")
+                                            f"[cyan]🖥️  SSH Command:[/cyan] [green]ssh -F {config_path} {host_alias}[/green]")
                                         console.print(
                                             f"[cyan]💻 VS Code/Cursor:[/cyan] Add [green]Include ~/.gpu-dev/*-sshconfig[/green] to ~/.ssh/config and ~/.cursor/ssh_config")
                                         console.print(

gpu-dev 0.7.6__tar.gz → 0.7.11__tar.gz

gpu-dev 0.7.6tar.gz → 0.7.11tar.gz