PyPI - gpu-dev - Versions diffs - 0.7.5__tar.gz → 0.7.10__tar.gz - Mend

gpu-dev 0.7.5tar.gz → 0.7.10tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (231) hide show

gpu_dev-0.7.10/.github/workflows/tests.yml ADDED Viewed

@@ -0,0 +1,20 @@
+name: tests
+on:
+  push:
+  pull_request:
+jobs:
+  unit:
+    name: unit + mocks
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+      - name: Install package + test deps
+        run: uv pip install -e ".[test]"
+      - name: Run unit + mock tests (integration excluded)
+        run: uv run pytest -m "not integration"

{gpu_dev-0.7.5 → gpu_dev-0.7.10}/.gitignore RENAMED Viewed

@@ -73,3 +73,14 @@ lambda/*/package/
 admin/output/
 .claude/worktrees/
+.claude/settings.local.json
+.claude/scheduled_tasks.lock
+# Org-specific (filled in locally; not committed)
+docs/INTERNAL_AUTH.md
+# Local scratch / staging terraform working dir
+*.pid
+terraform-gpu-devservers/staging/.terraform/
+terraform-gpu-devservers/staging/__pycache__/
+terraform-gpu-devservers/staging/*.log

{gpu_dev-0.7.5 → gpu_dev-0.7.10}/CLAUDE.md RENAMED Viewed

@@ -28,6 +28,59 @@ For terraform, we use opentofu, don't ever run tf apply directly. You're free to
 - Group imports in standard order: standard library, third-party, local imports
 - Use absolute imports when possible
+## Testing (DO THIS FOR EVERY CHANGE)
+There is a real test suite now. **Every change must keep it green, and add/adjust
+tests.** Two tiers:
+**1. Unit + mocks — ALWAYS run, must stay green (CI runs this on every push/PR).**
+Fully mocked (boto3 / k8s / SSH / subprocess), no network, ~2s.
+```bash
+uv pip install -e ".[test]"        # one-time: pytest, moto, kubernetes
+uv run pytest -m "not integration" # ~1140 tests; run before every commit
+```
+- Layout: `tests/unit/{sdk,cli,lambda_fn}/test_*.py`; shared fixtures in the root
+  `conftest.py` (`cli_runner`, `lambda_index` = the lambda imported as `index`
+  with env pre-set, `aws_mocks` = MagicMock boto3 handles).
+- When you touch CLI / SDK / lambda code, update or add the matching `test_*.py`.
+- CI: `.github/workflows/tests.yml`. Lambda imports need env vars + sys.path — the
+  root `conftest.py` already sets both.
+**2. e2e integration on STAGING — run for anything touching the
+reserve/pod/SSH/lambda path before merging.** Real reservations on the **staging**
+cluster (us-west-1), cpu + t4 only, auto-cancelled. Staging is the DEFAULT target
+and github_user comes from your config, so the bare command is enough:
+```bash
+uv run pytest -m integration --run-integration -v
+```
+- Staging is the default (`GPU_DEV_TEST_ENV` defaults to `staging` → us-west-1,
+  standard `pytorch-gpu-dev-*` prefix, tf workspace `default`). The integration
+  conftest pins the region so the unit-test us-east-2 default can't leak in. Wired
+  in `cli-tools/.../config.py` ENVIRONMENTS.
+- Covers: cpu-x86 + t4 reserve→active→cancel, list-while-active, exec
+  (`nproc`/`nvidia-smi`/`torch.cuda`), **`claude -p` answers "Paris"** (pod Claude
+  Code/Bedrock), and the **warm pool** (fast warm claim + custom-image
+  warm-ineligibility). Each cancels in a `finally` (no leaked pods).
+- Warm-pool tests need `WARM_POOL_TARGETS` deployed on staging — set in
+  `lambda.tf` for the `default` workspace (`{t4, cpu-x86, cpu-arm}`). Staging IS the
+  tf `default` workspace (us-west-1, environment=test) — there is no `test`/`staging`
+  workspace: `tofu workspace select default && tofu apply`. Until then the warm
+  tests skip ("came up cold"). Custom-image test: set `GPU_DEV_TEST_IMAGE`.
+- Repro test (`test_repro_known_failure.py`): set `GPU_DEV_REPRO_REF` +
+  `GPU_DEV_REPRO_TEST` to a known-red (commit, test). Find one with the
+  **treehugger MCP** (`hud`, user-scope — `get_hud_data`/`master_commit_red`).
+  Note: prebuilt torch is h100/b200 arch, so a CUDA test on t4 needs a full build;
+  prefer a failure that runs on the box's GPU or on cpu.
+- Skips cleanly if staging is unreachable or the runner has no outbound SSH (e.g. a
+  sandbox). The reservation role can query/SQS but lacks `DescribeTable`, so the
+  reachability probe uses scan+get-queue-url, not describe.
+- Validated live (2026-05-31): cpu + t4 lifecycle PASS; warm-claim test confirmed
+  it reaches the real reserve (skips until WARM_POOL_TARGETS is applied).
+**Rule of thumb:** unit+mocks for *every* change; add e2e coverage when you add a
+new command/flow; run the staging e2e before merging anything that could affect a
+live reservation. Don't say "done/tested" without having run the relevant tier.
 ## Content
 - torchci - a next.js app containing a PyTorch CI tracker
@@ -51,6 +104,42 @@ Currently we're working on a developer servers with GPUs in AWS. This means we'l
 # AGENT SECTION
+## Fast-repro redesign — by-SHA artifact cache + on-demand build (2026-06-01)
+Goal: `gpu-dev repro <ref>` for any pytorch commit from the last ~72h lands a built,
+importable tree in <2min. Design: `docs/FAST_REPRO_DESIGN.md`. **All merged to main**
+(PRs #186–#189); **needs `tofu apply` (prod, workspace `prod`) + image rebuild**.
+- **by-SHA artifact cache** (#186): whole *built* trees keyed by commit SHA at
+  `/ccache_shared/prebuilt/by-sha/<sha>.tar.{zst,gz}` (`.sha` written last = the
+  completion gate). Cron seeds one per viable/strict bump (hardlink, no extra space).
+  `stage-pytorch` (cold `--ref`) + `gpu-dev repro` consume on hit → `import torch`
+  with ZERO build. `repro` also publishes its in-pod build via `publish-pytorch-build`
+  (detached) so the cache fills from real usage. All paths safe-fallback on miss;
+  `ls-remote` is `timeout 15`.
+- **retention** (#188): prebuild cron prunes by-sha entries >72h every tick (storage
+  budget ~500-650GB on the elastic ccache EFS). The by-sha set IS the snapshot ladder.
+- **mold linker** (#187): Dockerfile installs `mold`; cron + in-pod repro build wrap
+  with `mold -run` (guarded on `command -v mold`). Drops the libtorch_cuda.so relink
+  ~1-3min → ~15s. **Needs image rebuild** to activate (prod runs a stale image; that's
+  also why prod publishes gzip not zstd — the Dockerfile has zstd already).
+- **on-demand build worker** (#189, `pytorch-ondemand.tf`): always-on Deployment on
+  NodeType=build drains `prebuilt/build-queue/<sha>.req` (own hostPath tree
+  `/mnt/ondemand-build` → builds at `/home/dev/pytorch` so build/ paths are
+  pod-compatible; mold+ccache), publishes by-sha, writes `.worker-alive` heartbeat.
+  `repro` enqueues + polls ONLY when the heartbeat is fresh (else straight to in-pod
+  build → zero regression if not deployed). Makes the FIRST repro of an uncached
+  commit fast. Coordination 100% via shared EFS — no new networking/RBAC/lambda.
+- cuDNN fidelity (`USE_CUDNN=1`) DEFERRED — forcing it can fail the build if cuDNN
+  isn't found under cuda-13.2; needs prod e2e. Base image is cudnn9-devel.
+- Fast path is **prod-arch only** (`sm_90;sm_100` = H100/B200); t4/staging is wrong-arch.
+- Also: SSH alias now keys off reservation id not pod name (#185) so warm/repro pods
+  are reachable via `ssh gpu-dev-<resid>` / `connect` (routing is via the FQDN, the
+  alias is a local label). CCACHE_MAXSIZE settled at 250G (#184).
+- Prod e2e: `gpu-dev repro <fresh-sha> <test> --gpu-type h100 --no-connect` (first =
+  off-pod build + stage; rerun = by-sha HIT zero build). Worker logs:
+  `k -n management logs deploy/pytorch-ondemand-builder -f`.
 ## Instant-sandboxes branch — WIP & things to fix (2026-05-29)
 Big push on warm pools + instant claims + prebuilt pytorch. Tracking state here so it's not lost.

{gpu_dev-0.7.5 → gpu_dev-0.7.10}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.7.5
+Version: 0.7.10
 Summary: CLI + Python SDK for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10
@@ -15,6 +15,11 @@ Requires-Dist: questionary>=2.1.1
 Requires-Dist: websockets>=12.0
 Requires-Dist: certifi>=2023.7.22
 Requires-Dist: mcp>=1.0.0
+Provides-Extra: test
+Requires-Dist: pytest>=7.4; extra == "test"
+Requires-Dist: pytest-cov>=4.1; extra == "test"
+Requires-Dist: moto[dynamodb,ec2,sqs]>=5.0; extra == "test"
+Requires-Dist: kubernetes>=28.1; extra == "test"
 # GPU Developer CLI & SDK

{gpu_dev-0.7.5 → gpu_dev-0.7.10}/cli-tools/gpu-dev-cli/gpu_dev_cli/cli.py RENAMED Viewed

@@ -319,6 +319,9 @@ def _show_single_reservation(connection_info: dict) -> None:
         reservation_id = connection_info["reservation_id"]
         reservation_name = connection_info.get("name")
         pod_name = connection_info.get("pod_name", "")
+        # SSH host alias keys off the reservation id (works for warm-claimed pods,
+        # whose pod_name != gpu-dev-<resid8>). pod_name is shown separately below.
+        host_alias = f"gpu-dev-{short_id}"
         ssh_config_path = get_ssh_config_path(reservation_id, reservation_name)
         use_include = is_ssh_include_enabled()
@@ -328,14 +331,14 @@ def _show_single_reservation(connection_info: dict) -> None:
             if use_include:
                 # User approved Include - show simple commands
                 from .reservations import _make_vscode_link
-                ssh_command_display = f"[green]ssh {pod_name}[/green]"
-                vscode_url = _make_vscode_link(pod_name)
-                vscode_cmd_text = f"code --remote ssh-remote+{pod_name} /home/dev"
+                ssh_command_display = f"[green]ssh {host_alias}[/green]"
+                vscode_url = _make_vscode_link(host_alias)
+                vscode_cmd_text = f"code --remote ssh-remote+{host_alias} /home/dev"
                 vscode_command_display = f"[link={vscode_url}][green]{vscode_cmd_text}[/green][/link]"
                 vscode_info = f"[blue]VS Code Remote:[/blue] {vscode_command_display}\n"
             else:
                 # User declined Include - show commands with -F flag
-                ssh_command_display = f"[green]ssh -F {ssh_config_path} {pod_name}[/green]"
+                ssh_command_display = f"[green]ssh -F {ssh_config_path} {host_alias}[/green]"
                 vscode_command_display = f"Add [green]Include ~/.gpu-dev/*-sshconfig[/green] to ~/.ssh/config and ~/.cursor/ssh_config (or: [green]gpu-dev config ssh-include enable[/green])"
                 vscode_info = f"[blue]VS Code/Cursor:[/blue] {vscode_command_display}\n"
         else:
@@ -1523,12 +1526,19 @@ def reserve(
 @click.option("--gpu-type", default="b200", show_default=True, help="GPU type for the repro box.")
 @click.option("--gpus", type=int, default=1, show_default=True)
 @click.option("--hours", type=float, default=3.0, show_default=True,
-              help="Lifetime ceiling; the box auto-cancels when the test exits unless --keep.")
+              help="Lifetime ceiling for the box.")
+@click.option("--no-connect", is_flag=True, default=False,
+              help="CI mode: run the test, auto-cancel, exit code = test result. Default (on a TTY) drops you into the box to iterate.")
 @click.option("--keep", is_flag=True, default=False,
-              help="Keep the reservation after the test exits (default: auto-cancel).")
+              help="Never cancel the box (skip the cancel prompt / auto-cancel).")
 @click.pass_context
-def repro(ctx, ref, test_args, gpu_type, gpus, hours, keep):
-    """Reserve a GPU, check out a PR/commit, run a test, then auto-cancel.
+def repro(ctx, ref, test_args, gpu_type, gpus, hours, no_connect, keep):
+    """Reserve a GPU, check out a PR/commit, run a test, then drop you into the box.
+    By default (in a terminal) repro runs the test and then **connects you into the
+    box** at ~/pytorch — the ref is checked out, so you can fix and re-run. The box
+    stays alive until you cancel it (you're prompted on exit). Use --no-connect for
+    CI/scripts (run the test, auto-cancel, process exit code = the test result).
     REF: pr/<N>, #<N>, a bare PR number, a branch, or a commit sha. PRs use
     pull/<N>/merge (what CI tests), falling back to /head.
@@ -1539,6 +1549,7 @@ def repro(ctx, ref, test_args, gpu_type, gpus, hours, keep):
     """
     import shlex
     import subprocess
+    import sys
     config = load_config()
     reservation_mgr = ReservationManager(config)
     try:
@@ -1546,27 +1557,82 @@ def repro(ctx, ref, test_args, gpu_type, gpus, hours, keep):
     except RuntimeError as e:
         rprint(f"[red]❌ {str(e)}[/red]"); return
-    # ref -> in-pod fetch+checkout (PRs prefer /merge = CI's view, fall back to /head)
+    # Resolve the ref in-pod -> WANT (sha, for the by-sha cache) + FREF (fetch ref).
+    # A MERGED pr/N reproduces the actual squash/merge commit on main (the real trunk
+    # state that was red) — NOT pull/N/merge (the PR re-applied onto *current* trunk,
+    # which goes green once the fix lands). Open PRs keep pull/N/merge (= CI's view).
     r = ref.strip(); prnum = None
     if r.startswith("pr/"): prnum = r[3:]
     elif r.startswith("#"): prnum = r[1:]
     elif r.isdigit(): prnum = r
+    gh = "https://github.com/pytorch/pytorch.git"
     if prnum:
-        fetch = (f"git fetch origin pull/{prnum}/merge 2>/dev/null && git checkout -f FETCH_HEAD || "
-                 f"{{ echo '[repro] no /merge ref, using /head'; git fetch origin pull/{prnum}/head && git checkout -f FETCH_HEAD; }}")
+        api = f"https://api.github.com/repos/pytorch/pytorch/pulls/{prnum}"
+        resolve = (
+            f"PRJSON=$(curl -s -m 10 -H 'Accept: application/vnd.github+json' -H 'User-Agent: gpu-dev' {api} 2>/dev/null); "
+            "MCS=$(printf '%s' \"$PRJSON\" | grep -oE '\"merge_commit_sha\": *\"[0-9a-f]+\"' | head -1 | cut -d'\"' -f4); "
+            "if printf '%s' \"$PRJSON\" | grep -q '\"merged\": *true' && [ -n \"$MCS\" ]; then "
+            f"WANT=\"$MCS\"; FREF=\"$MCS\"; echo \"[repro] pr/{prnum} is merged -> reproducing trunk commit $MCS\"; "
+            f"else FREF=pull/{prnum}/merge; WANT=$(timeout 15 git ls-remote {gh} $FREF 2>/dev/null | head -1 | cut -f1); "
+            f"[ -n \"$WANT\" ] || {{ FREF=pull/{prnum}/head; WANT=$(timeout 15 git ls-remote {gh} $FREF 2>/dev/null | head -1 | cut -f1); echo '[repro] open PR, no /merge -> /head'; }}; fi; ")
     else:
         rq = shlex.quote(r)
-        fetch = f"git fetch origin {rq} 2>/dev/null && git checkout -f FETCH_HEAD || git checkout -f {rq}"
+        resolve = (f"FREF={rq}; WANT=$(timeout 15 git ls-remote {gh} {rq} 2>/dev/null | head -1 | cut -f1); "
+                   f"[ -n \"$WANT\" ] || case {rq} in *[!0-9a-fA-F]*) WANT= ;; *) WANT={rq} ;; esac; ")
+    # in-pod fallback checkout (by-sha miss + farm unavailable): fetch the resolved ref,
+    # else check out the sha directly (reachable for a merged-PR land commit / trunk).
+    checkout = ("git fetch origin \"$FREF\" 2>/dev/null && git checkout -f FETCH_HEAD "
+                "|| git checkout -f \"$WANT\" 2>/dev/null "
+                "|| { git fetch --force origin 2>/dev/null && git checkout -f \"$WANT\"; }")
     testcmd = " ".join(shlex.quote(a) for a in test_args)
+    # by-sha artifact cache: if a fully-built tree for the resolved SHA already exists
+    # (shared EFS, seeded by the build node + prior repros), stage it -> ZERO build.
+    # Otherwise build, then publish the result so the next dev (anyone) gets it instant.
     remote = (
         "set -e; cd /home/dev/pytorch; "
         "git config --global --add safe.directory /home/dev/pytorch 2>/dev/null || true; "
-        f"echo '[repro] checkout {r}'; {fetch}; "
+        "BYSHA=/ccache_shared/prebuilt/by-sha; QUEUE=/ccache_shared/prebuilt/build-queue; HIT=; "
+        # bs <sha>: stage a fully-built by-sha tree into /home/dev/pytorch (zero build); 0 on success.
+        # explicit ext check, not a glob: the pod login shell is zsh, where an unmatched glob is a hard error.
+        # require the .sha completion gate (written last) so we never stage a half-published tarball.
+        "bs() { local s=\"$1\" tb=; [ -f \"$BYSHA/$s.sha\" ] || return 1; for e in zst gz; do [ -f \"$BYSHA/$s.tar.$e\" ] && { tb=\"$BYSHA/$s.tar.$e\"; break; }; done; [ -n \"$tb\" ] || return 1; "
+        "rm -rf /home/dev/pytorch.new; mkdir -p /home/dev/pytorch.new; "
+        "case \"$tb\" in *.zst) zstd -dc \"$tb\" 2>/dev/null | tar -C /home/dev/pytorch.new --strip-components=1 -xf - 2>/dev/null ;; "
+        "*) tar -C /home/dev/pytorch.new --strip-components=1 -xzf \"$tb\" 2>/dev/null ;; esac; "
+        "[ -d /home/dev/pytorch.new/.git ] || { rm -rf /home/dev/pytorch.new; return 1; }; "
+        "rm -rf /home/dev/pytorch; mv /home/dev/pytorch.new /home/dev/pytorch; return 0; }; "
+        + resolve +
+        "echo \"[repro] target ${WANT:-?}\"; "
+        # 1) already cached -> stage it (zero build)
+        "if [ -n \"$WANT\" ] && bs \"$WANT\"; then cd /home/dev/pytorch; HIT=1; echo '[repro] by-sha cache HIT -> staged prebuilt tree (zero build)'; fi; "
+        # 2) not cached, build farm alive -> request an off-pod build, wait, then stage
+        "if [ -z \"$HIT\" ] && [ -n \"$WANT\" ] && [ -n \"$(find \"$QUEUE/.worker-alive\" -mmin -2 2>/dev/null)\" ]; then "
+        "echo \"[repro] no cached build; requesting off-pod build of $WANT (build farm; streaming progress)…\"; printf '%s\\n' \"$FREF\" > \"$QUEUE/$WANT.req\" 2>/dev/null || true; "
+        # poll for the artifact; meanwhile tail the farm's build log (ninja [x/N]) so it's not a silent hang.
+        "i=0; LL=0; while [ $i -lt 400 ]; do [ -f \"$BYSHA/$WANT.sha\" ] && break; [ -f \"$QUEUE/$WANT.req\" ] || break; "
+        "if [ -f \"$QUEUE/$WANT.log\" ]; then NL=$(wc -l < \"$QUEUE/$WANT.log\" 2>/dev/null || echo 0); "
+        "if [ \"$NL\" -gt \"$LL\" ]; then tail -n +$((LL+1)) \"$QUEUE/$WANT.log\" 2>/dev/null | grep -aE '\\[[0-9]+/[0-9]+\\]|Building wheel|Successfully built|error' | tail -1 | sed 's/^/  [farm] /'; LL=$NL; fi; fi; "
+        "sleep 3; i=$((i+1)); done; "
+        "if bs \"$WANT\"; then cd /home/dev/pytorch; HIT=1; echo '[repro] off-pod build ready -> staged (zero build)'; else echo '[repro] off-pod build unavailable, building locally'; fi; fi; "
+        # 3) fall back to in-pod fetch + build (+ cache the result for the next dev)
+        "if [ -z \"$HIT\" ]; then "
+        "echo \"[repro] checking out $FREF\"; " + checkout + "; "
         "echo \"[repro] HEAD $(git rev-parse --short HEAD)\"; "
         "git -c protocol.file.allow=always submodule update --init --recursive --jobs 8 >/dev/null 2>&1 || true; "
         "if ! PYTHONPATH=/home/dev/pytorch python -c 'import torch' 2>/dev/null; then "
-        "echo '[repro] incremental rebuild on warm build/...'; pip install --break-system-packages -e . --no-build-isolation; fi; "
+        "echo \"[repro] prebuilt torch != this commit -> rebuilding (ccache-accelerated, but the further this commit is from viable/strict, the more recompiles). checked-out: $(git log -1 --format='%h %ci')\"; "
+        # mold -run routes the libtorch_cuda.so relink through mold (~15s vs minutes); guarded.
+        # Explicit if/else (not `$M pip`): the pod login shell is zsh, which doesn't word-split
+        # unquoted vars. -v streams the cmake/ninja [x/N] progress instead of pip's blind spinner.
+        "if command -v mold >/dev/null 2>&1; then mold -run pip install --break-system-packages -e . --no-build-isolation -v; "
+        "else pip install --break-system-packages -e . --no-build-isolation -v; fi; fi; "
+        # cache this build for the next dev (detached so it survives the ssh session)
+        "SHA=$(git rev-parse HEAD 2>/dev/null); "
+        "if command -v publish-pytorch-build >/dev/null 2>&1 && [ -n \"$SHA\" ] && [ ! -f \"$BYSHA/$SHA.sha\" ]; then "
+        "echo '[repro] caching this build (by-sha) for next time…'; "
+        "setsid publish-pytorch-build \"$SHA\" >/dev/null 2>&1 < /dev/null & fi; "
+        "fi; "
         f"echo '[repro] running: python {testcmd}'; "
         f"PYTHONPATH=/home/dev/pytorch python {testcmd}"
     )
@@ -1602,21 +1668,55 @@ def repro(ctx, ref, test_args, gpu_type, gpus, hours, keep):
     if "StrictHostKeyChecking" not in ssh_cmd:
         ssh_cmd = ssh_cmd.replace("ssh ", "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR ", 1)
     rprint(f"[dim]→ {ssh_cmd}[/dim]\n")
+    rid8 = str(rid)[:8]
     rc = 1
     try:
         rc = subprocess.run(f"{ssh_cmd} {shlex.quote(remote)}", shell=True).returncode
     except KeyboardInterrupt:
-        rprint("\n[yellow]interrupted[/yellow]")
-    finally:
+        rprint("\n[yellow]interrupted[/yellow]"); rc = 130
+    verdict = "[green]✓ test passed[/green]" if rc == 0 else f"[red]✗ test failed (exit {rc})[/red]"
+    # Default (TTY): drop into the box so you can fix and re-run. --no-connect is the
+    # CI path: auto-cancel and exit with the test's code.
+    connect = (not no_connect) and sys.stdout.isatty()
+    if connect:
+        rprint(f"\n{verdict} — dropping you into the box at ~/pytorch ({ref} checked out).")
+        rprint(f"[dim]  re-run:  python {testcmd}[/dim]")
+        rprint(f"[dim]  finish:  gpu-dev cancel  (from inside)  •  or exit this shell[/dim]\n")
+        shell_cmd = f"{ssh_cmd} -t {shlex.quote('cd /home/dev/pytorch 2>/dev/null; exec ${SHELL:-bash} -l')}"
+        try:
+            subprocess.run(shell_cmd, shell=True)
+        except KeyboardInterrupt:
+            pass
         if keep:
-            rprint(f"[cyan]📌 kept {str(rid)[:8]} — gpu-dev connect {str(rid)[:8]} • gpu-dev cancel {str(rid)[:8]}[/cyan]")
-        else:
+            rprint(f"[cyan]📌 left {rid8} running — connect: gpu-dev connect {rid8} • cancel: gpu-dev cancel {rid8}[/cyan]")
+            return
+        try:
+            drop = click.confirm(f"Cancel repro box {rid8}?", default=True)
+        except (KeyboardInterrupt, EOFError, click.Abort):
+            drop = False
+        if drop:
             try:
                 reservation_mgr.cancel_reservation(rid, user_info["user_id"])
-                rprint(f"[green]🧹 cancelled repro box {str(rid)[:8]}[/green]")
+                rprint(f"[green]🧹 cancelled {rid8}[/green]")
             except Exception as e:
-                rprint(f"[yellow]auto-cancel failed for {str(rid)[:8]}: {e}[/yellow]")
-    rprint(f"\n[bold]repro exit code: {rc}[/bold]")
+                rprint(f"[yellow]cancel failed for {rid8}: {e}[/yellow]")
+        else:
+            rprint(f"[cyan]📌 left {rid8} running — connect: gpu-dev connect {rid8} • cancel: gpu-dev cancel {rid8}[/cyan]")
+        return
+    # --no-connect / non-TTY: auto-cancel unless --keep, exit code = test result.
+    if keep:
+        rprint(f"[cyan]📌 kept {rid8} — gpu-dev connect {rid8} • gpu-dev cancel {rid8}[/cyan]")
+    else:
+        try:
+            reservation_mgr.cancel_reservation(rid, user_info["user_id"])
+            rprint(f"[green]🧹 cancelled repro box {rid8}[/green]")
+        except Exception as e:
+            rprint(f"[yellow]auto-cancel failed for {rid8}: {e}[/yellow]")
+    rprint(f"\n[bold]repro exit code: {rc}[/bold] ({verdict})")
+    sys.exit(rc)
 _SUBMIT_GPU_TYPES = ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100",
@@ -1837,7 +1937,9 @@ def submit(ctx, gpu_type, gpus, hours, disk, ref, no_persistent_disk, spot, dock
                 sys.exit(1)
             create_ssh_config_for_reservation(master_fqdn, master_pod, master_id, master_name)
-        ssh_alias = master_pod
+        # Host alias matches the Host line written by create_ssh_config_for_reservation
+        # (keyed off the reservation id, so warm-claimed masters resolve too).
+        ssh_alias = f"gpu-dev-{master_id[:8]}"
         ssh_base = ["ssh", "-F", str(config_file), "-o", "StrictHostKeyChecking=accept-new"]
         rsync_e = " ".join(shlex.quote(x) for x in ssh_base)
@@ -3124,11 +3226,15 @@ def _show_direct_success(res: dict, elapsed: float) -> None:
     """Print the success block for an instant warm-pool claim,
     matching the normal reserve output (SSH config + VS Code/Cursor remote)."""
     from gpu_dev_cli.reservations import (
-        create_ssh_config_for_reservation, _generate_vscode_command, _generate_cursor_command)
+        create_ssh_config_for_reservation, _generate_vscode_command,
+        _generate_cursor_command, _make_vscode_link, _make_cursor_link)
     rid = res.get("reservation_id", "") or ""
     ssh_command = res.get("ssh_command", "") or ""
     pod_name = res.get("pod_name", "") or ""
     fqdn = res.get("fqdn") or ""
+    # Host alias keys off the reservation id — warm-claimed pods have a pod_name
+    # that is NOT gpu-dev-<resid8>, so we must not use pod_name as the ssh alias.
+    host_alias = f"gpu-dev-{rid[:8]}" if rid else pod_name
     rprint(f"\n[green]✅ Instant reservation ready in {elapsed:.1f}s![/green]")
     rprint(f"[bold]📋 Reservation ID:[/bold] {rid}")
@@ -3137,24 +3243,28 @@ def _show_direct_success(res: dict, elapsed: float) -> None:
     if rid:
         rprint(f"[bold]⚡ Quick Connect:[/bold] gpu-dev connect {rid[:8]}")
-    # Build the per-reservation SSH config so `ssh <pod>` and connect work cleanly.
+    # Build the per-reservation SSH config so `ssh gpu-dev-<resid8>` and connect work cleanly.
     use_include = False
     if fqdn and pod_name and rid:
         try:
             _cfg, use_include = create_ssh_config_for_reservation(fqdn, pod_name, rid, None)
         except Exception:
             pass
-    if pod_name and use_include:
-        rprint(f"[bold]🖥️  SSH Command:[/bold] ssh {pod_name}")
-    elif ssh_command:
-        rprint(f"[bold]🖥️  SSH Command:[/bold] {ssh_command}")
-    vsc = _generate_vscode_command(ssh_command) if ssh_command else None
-    cur = _generate_cursor_command(ssh_command) if ssh_command else None
-    if vsc:
-        rprint(f"[bold]💻 VS Code Remote:[/bold] {vsc}")
-    if cur:
-        rprint(f"[bold]🖥️ Cursor Remote:[/bold] {cur}")
+    if use_include and rid:
+        rprint(f"[bold]🖥️  SSH Command:[/bold] ssh {host_alias}")
+        vscode_url = _make_vscode_link(host_alias)
+        cursor_url = _make_cursor_link(host_alias)
+        rprint(f"[bold]💻 VS Code Remote:[/bold] [link={vscode_url}]code --remote ssh-remote+{host_alias} /home/dev[/link]")
+        rprint(f"[bold]🖥️ Cursor Remote:[/bold] [link={cursor_url}]cursor --remote ssh-remote+{host_alias} /home/dev[/link]")
+    else:
+        if ssh_command:
+            rprint(f"[bold]🖥️  SSH Command:[/bold] {ssh_command}")
+        vsc = _generate_vscode_command(ssh_command) if ssh_command else None
+        cur = _generate_cursor_command(ssh_command) if ssh_command else None
+        if vsc:
+            rprint(f"[bold]💻 VS Code Remote:[/bold] {vsc}")
+        if cur:
+            rprint(f"[bold]🖥️ Cursor Remote:[/bold] {cur}")
 def _format_gpu_display(gpu_count, gpu_type):
@@ -3343,15 +3453,22 @@ def _show_availability(show_spot: bool = False) -> None:
                 spot_table = Table(title="⚡ Spot Instances (us-east-1, ~70% cheaper)")
                 spot_table.add_column("GPU Type", style="cyan")
                 spot_table.add_column("Avail\nNow", style="green")
+                spot_table.add_column("In\nUse", style="yellow")
                 spot_table.add_column("Per\nNode", style="bright_green")
                 spot_table.add_column("Status", style="magenta")
                 spot_table.add_column("Spot Discount", style="dim")
                 _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
                 for gt, info in sorted(spot_region_info.items()):
                     avail = info.get("available", 0)
+                    total = info.get("total", 0)
+                    in_use = max(0, total - avail)  # GPUs on up spot nodes already taken
                     per_node = spot_gpus_per_node.get(gt, 8)
                     avail_display = f"[green]{avail}[/green]" if avail > 0 else f"[dim]0[/dim]"
-                    status = "[green]Node up[/green]" if avail > 0 else "Spins up on reserve (~10 min)"
+                    in_use_display = f"[yellow]{in_use}[/yellow]" if in_use > 0 else f"[dim]0[/dim]"
+                    if in_use > 0:
+                        status = "[yellow]Node up (in use)[/yellow]" if avail == 0 else "[green]Node up[/green]"
+                    else:
+                        status = "[green]Node up[/green]" if avail > 0 else "Spins up on reserve (~10 min)"
                     si = info.get("spot_info", {}) or {}
                     sp = si.get("spot_price", "") if isinstance(si, dict) else ""
                     if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
@@ -3363,7 +3480,7 @@ def _show_availability(show_spot: bool = False) -> None:
                             avail_signal = f"[green]{pct}% off on-demand[/green]" if pct > 0 else "[dim]At on-demand price[/dim]"
                         except (ValueError, TypeError):
                             avail_signal = "[yellow]Unknown[/yellow]"
-                    spot_table.add_row(f"{gt.upper()} *", avail_display, str(per_node), status, avail_signal)
+                    spot_table.add_row(f"{gt.upper()} *", avail_display, in_use_display, str(per_node), status, avail_signal)
                 console.print(spot_table)
                 rprint("[dim]* = spot: ~70% cheaper, AWS can reclaim with 2-min notice, fulfillment not guaranteed.[/dim]")
                 rprint("[dim]  Separate cluster (us-east-1) with separate disks. Select via gpu-dev reserve (interactive).[/dim]")
@@ -3737,7 +3854,8 @@ def connect(ctx: click.Context, reservation_id: Optional[str]) -> None:
             for node in nodes:
                 status_display = "✅ Active" if node.get("status") == "active" else f"⏳ {node.get('status', 'unknown')}"
                 pod_name = node.get("pod_name", "unknown")
-                ssh_cmd_short = f"ssh {pod_name}" if pod_name != "unknown" else "N/A"
+                node_rid = node.get("reservation_id")
+                ssh_cmd_short = f"ssh gpu-dev-{node_rid[:8]}" if node_rid else "N/A"
                 table.add_row(
                     f"Node {node.get('node_index', 0) + 1}",
@@ -3994,10 +4112,11 @@ def get_ssh_config_cmd(ctx: click.Context, reservation_id: Optional[str]) -> Non
                 )
                 if config_path:
+                    node_alias = f"gpu-dev-{node_res_id[:8]}"
                     if use_include:
-                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh {pod_name}[/cyan]")
+                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh {node_alias}[/cyan]")
                     else:
-                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh -F {config_path} {pod_name}[/cyan]")
+                        rprint(f"[green]✅ Node {node_idx + 1}:[/green] [cyan]ssh -F {config_path} {node_alias}[/cyan]")
                 else:
                     rprint(f"[yellow]⚠️  Node {node_idx + 1}: Failed to create SSH config[/yellow]")
@@ -4025,12 +4144,13 @@ def get_ssh_config_cmd(ctx: click.Context, reservation_id: Optional[str]) -> Non
             )
             if config_path:
+                host_alias = f"gpu-dev-{reservation_id[:8]}"
                 rprint(f"[green]✅ SSH config created:[/green] [cyan]{config_path}[/cyan]\n")
                 if use_include:
-                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh {pod_name}[/cyan]")
+                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh {host_alias}[/cyan]")
                     rprint(f"[dim]   or:[/dim] [cyan]gpu-dev connect {reservation_id[:8]}[/cyan]")
                 else:
-                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh -F {config_path} {pod_name}[/cyan]")
+                    rprint(f"[green]🎉 You can now connect with:[/green] [cyan]ssh -F {config_path} {host_alias}[/cyan]")
                     rprint(f"[dim]   or:[/dim] [cyan]gpu-dev connect {reservation_id[:8]}[/cyan]")
             else:
                 rprint("[red]❌ Failed to create SSH config[/red]")
@@ -4597,13 +4717,13 @@ def ssh_include(action: str):
     \b
     When enabled:
-      • Simple SSH commands: ssh <pod-name>
-      • VS Code Remote works: code --remote ssh-remote+<pod-name>
+      • Simple SSH commands: ssh gpu-dev-<reservation-id>
+      • VS Code Remote works: code --remote ssh-remote+gpu-dev-<reservation-id>
       • Cursor Remote works: Open Remote SSH in Cursor
     \b
     When disabled:
-      • Need -F flag: ssh -F ~/.gpu-dev/<id>-sshconfig <pod-name>
+      • Need -F flag: ssh -F ~/.gpu-dev/<id>-sshconfig gpu-dev-<reservation-id>
       • VS Code/Cursor requires manual config setup
     \b

{gpu_dev-0.7.5 → gpu_dev-0.7.10}/cli-tools/gpu-dev-cli/gpu_dev_cli/config.py RENAMED Viewed

@@ -29,6 +29,15 @@ class Config:
             "description": "Spot-only us-east-1 environment (T4/L4/CPU)",
             "spot_types": ["b300", "b200", "h200", "h100", "a100", "t4", "l4", "rtxpro6000"],
         },
+        # Staging (us-west-1, tf "default" workspace, environment=test). Same
+        # standard resource prefix as prod, just a different region — so only the
+        # region changes. Live capacity: cpu-x86/arm + t4. Used for integration
+        # tests. Select via `GPU_DEV_ENVIRONMENT=staging` (or the "test" env alias).
+        "staging": {
+            "region": "us-west-1",
+            "workspace": "default",
+            "description": "Staging (us-west-1, cpu + t4)",
+        },
     }
     DEFAULT_ENVIRONMENT = "prod"
@@ -43,19 +52,33 @@ class Config:
         # Load unified config (handles migration from legacy files)
         self.user_config = self._load_config()
-        # Get region: env vars take priority (for spot routing), then config, then default
+        # Active environment: GPU_DEV_ENVIRONMENT env wins (handy for tests/CI),
+        # then the persisted config, then the default. Its region/prefix back the
+        # fallbacks below so e.g. `GPU_DEV_ENVIRONMENT=staging` reaches us-west-2.
+        env_override = os.getenv("GPU_DEV_ENVIRONMENT")
+        env_name = env_override or self.user_config.get(
+            "environment", self.DEFAULT_ENVIRONMENT)
+        env_cfg = self.ENVIRONMENTS.get(env_name, {})
+        # Get region: AWS_* env vars take priority (for spot routing); then an
+        # explicit GPU_DEV_ENVIRONMENT switch uses that env's region (beating the
+        # persisted one); then the persisted config; then the env's region; default.
         env_region = os.getenv("AWS_REGION") or os.getenv("AWS_DEFAULT_REGION")
         if env_region and env_region != self.user_config.get("region"):
             self.aws_region = env_region
+        elif env_override and env_cfg.get("region"):
+            self.aws_region = env_cfg["region"]
         elif self.user_config.get("region"):
             self.aws_region = self.user_config["region"]
+        elif env_cfg.get("region"):
+            self.aws_region = env_cfg["region"]
         else:
             self.aws_region = "us-east-2"
         os.environ["AWS_DEFAULT_REGION"] = self.aws_region
-        # Resource naming convention - no config needed!
-        self.prefix = "pytorch-gpu-dev"
+        # Resource naming convention — per-environment prefix (default for prod).
+        self.prefix = env_cfg.get("prefix", "pytorch-gpu-dev")
         # Construct ARNs from convention
         self.queue_name = f"{self.prefix}-reservation-queue"

gpu-dev 0.7.5__tar.gz → 0.7.10__tar.gz

gpu-dev 0.7.5tar.gz → 0.7.10tar.gz