PyPI - gpu-dev - Versions diffs - 0.6.6__tar.gz → 0.7.1__tar.gz - Mend

gpu-dev 0.6.6tar.gz → 0.7.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (179) hide show

{gpu_dev-0.6.6 → gpu_dev-0.7.1}/.github/workflows/publish.yml RENAMED Viewed

@@ -28,7 +28,7 @@ jobs:
             echo "::error::Tag version ($TAG_VERSION) does not match package version ($PKG_VERSION)"
             exit 1
           fi
-      - name: Build package
+      - name: "Build package (gpu-dev = CLI + SDK)"
         run: uv build
       - name: Generate attestations
         uses: actions/attest-build-provenance@v2
@@ -36,3 +36,8 @@ jobs:
           subject-path: dist/*
       - name: Publish to PyPI
         uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          # Multi-package + re-run safety: skip any file already on PyPI (e.g. a
+          # gpu-dev version that published before a sibling failed) instead of
+          # erroring on the duplicate.
+          skip-existing: true

{gpu_dev-0.6.6 → gpu_dev-0.7.1}/CLAUDE.md RENAMED Viewed

@@ -51,6 +51,41 @@ Currently we're working on a developer servers with GPUs in AWS. This means we'l
 # AGENT SECTION
+## Instant-sandboxes branch — WIP & things to fix (2026-05-29)
+Big push on warm pools + instant claims + prebuilt pytorch. Tracking state here so it's not lost.
+**Committed, needs deploy/activation:**
+- `tf apply` (branch `instant-sandboxes`): warm-pool reconciler + fail-open claim hook, async hot-refill on claim, async per-user EFS mount, processor self-invoke IAM, Bedrock marketplace perms on pod IRSA, pytorch `ref` staging, availability counts warm-ready as available, git-cache worktree snapshot + `pytorch-snapshot` DaemonSet, processor Function URL.
+- Reinstall **CLI + SDK**: `--direct` (default on) synchronous claim, `--ref` (pr/commit/branch), `--no-persist`+`--disk` conflict guard, Function-URL cache (`~/.config/gpu-dev/direct-url.json`).
+- Rebuild **gpu-dev image**: Claude Code cache-bust (latest), `~/.local/bin` on PATH (bash+zsh, all disks).
+- **Meta/fbcode**: grant the user IAM role `lambda:InvokeFunctionUrl` + `lambda:GetFunctionUrlConfig` (scoped to reservation-processor) so `--direct` works; otherwise it falls back to SQS silently.
+**Prebuilt viable/strict + warm ccache (importable torch + marginal C++ build) — COMMITTED on `instant-sandboxes`, needs `tf apply`:**
+- [x] Dedicated `m7i.48xlarge` build node group (always-on). `build-node.tf`, node `ip-10-0-26-237` up.
+- [x] Hourly **stateful incremental** build CronJob (`pytorch-prebuild.tf`): `concurrencyPolicy=Forbid` + flock (the "build queue"), **CUDA 13.2** (matches the cu13 nvshmem ABI in the image — 12.8 fails at nvlink), `TORCH_CUDA_ARCH_LIST=9.0;10.0` (see arch note below), `BUILD_TEST=0`, builds at **`/home/dev/pytorch`** on a hostPath (path-match for relocatable incremental), `CCACHE_DIR=/ccache_shared/build-node`, only when viable/strict SHA bumps. Publishes via rsync to `/ccache_shared/prebuilt/pytorch-<arch>`.
+- [x] `pytorch-snapshot` DaemonSet (in `git-cache.tf`) arch-aware: rsyncs the built tree from the shared EFS to each node's `/mnt/nvme/pytorch-built` (arch via `uname -m`; arm skips gracefully). Existing master worktree HTTP pull unchanged.
+- [x] `stage-pytorch` (lambda) reflink-copies the built tree into `/home/dev/pytorch` + sets `PYTHONPATH` (`/etc/profile.d/zz-pytorch.sh` + `*_ext`) so `import torch` works with no pod-side build. With `--ref`: same tree (warm `build/`), checkout the ref, rebuild is incremental. Applies to warm pods too.
+- **Publish/cache decision:** reuse the existing `ccache_shared` EFS (everyone already mounts it) under `/prebuilt`; no new EFS/S3. EFS here = plain NFS volume mounts, not CSI. ccache is shared by build node + ALL dev pods (incl persistent-disk) so a user's own build benefits from the build node's compiles.
+- **Validated build numbers** (m7i, 128 jobs, CUDA 13.2, `9.0;10.0`, BUILD_TEST=0): cold (build/ gone, ccache 86% warm = node-replacement case) **~21m**; incremental (1 cutlass kernel + 386MB relink) **~42s**; ninja no-op **~22s**; ccache **86.5%** hit. Result: `torch 2.13.0a0`, imports, `get_arch_list()=['sm_90','sm_100']`.
+- [ ] **Cleanup:** delete the manual test pod `gpu-dev-buildtest` (gpu-dev ns) — done with empirical measurement (kept for now in case more measurements needed). It holds a warm `/root/pt` build tree.
+- [ ] **Reflink caveat:** stage-pytorch uses `cp -a --reflink=auto || cp -a`. For the drop-in to be *instant* (not a 20-40GB copy), the pod's `/home/dev` (dev-home emptyDir) and the node's `/mnt/nvme` must be the **same filesystem**. Verify node bootstrap puts kubelet emptyDir on `/mnt/nvme`; else it falls back to a full copy (correct, slower).
+**To fix / todo:**
+- [ ] **Direct/warm claim path drops `--ref` and `--no-persist`:** a `reserve --ref X --no-persist` (no `--disk`) still satisfies the line-1388 `claim_direct` condition (it doesn't exclude `ref`), so it goes the warm/direct path which doesn't carry `ref`/`no_persistent_disk` → the user got their **default persistent disk** + no PR staged (reservation `5e83bb5b`: `no_persistent_disk=false, disk_name=default, pytorch_ref=null, version=null`). Fix: exclude `ref` (and honor `no_persistent_disk`) from the direct fast-path, OR thread `ref`/`no_persistent_disk` through `claim_direct`+`handle_direct_claim`. Workaround for now: `--no-direct --no-persist --ref`.
+- [x] **Warm full-GPU (1-GPU) pods + evict-on-demand** (DONE, commit c1211e3): `_evict_warm_for_capacity` deletes the minimum warm-ready pods on a single node when no node has enough free GPUs (gated in `get_target_az_for_reservation` before the Pending fallback; reconciler tops the pool back up). Also covers full **MIG** nodes filling up (not just full-GPU) — warm pods no longer block 2/4/8-GPU or full-node requests. Added `WARM_POOL_TARGETS` `h100:1, b200:1` (safe now that they're evictable). `get_available_gpus_on_node` counts warm pods as used, so placement avoids them until eviction frees them. Needs `tf apply`.
+- [ ] **CLI install hygiene:** user's `~/.venv` has BOTH `gpu-dev 0.6.6` (editable→repo) and a stale duplicate `gpu-dev-cli 0.3.5` (also editable, same dir, different dist name). `pip uninstall gpu-dev-cli` to remove the confusing duplicate; the real package is `gpu-dev`.
+- [ ] **Publish via tarball, not rsync-to-EFS:** rsync of the raw tree (.git + build/ = 100k+ small files) to EFS stalled at 0 files in 13min (NFS per-file round-trips). Switched publish + DaemonSet to a single `zstd` tarball (sequential I/O). (committed)
+- [ ] **Prebuilt built WITHOUT cuDNN** — `import torch` warns "compiled without cuDNN/MIOpen". CI/nightly build with cudnn9. Add libcudnn to the gpu-dev image + `USE_CUDNN=1` to the build recipe for fidelity (conv/cudnn-dependent ops + tests). Irrelevant for flex-attention int64 test; matters generally.
+- [ ] **`--ref pr/N` uses `pull/N/head`, not `/merge`** — `/head` is the PR author's raw branch tip (often based on old trunk, missing trunk-added tests); CI tests `/merge` (PR merged onto current trunk). For CI-repro fidelity, `pr/N` should fetch `pull/N/merge` (fall back to `/head` if no merge ref). `stage-pytorch` REF case in `index.py`. (This is why `pull/185479/head` lacked `test_large_kv_int64_pointer_math_cuda`.)
+- [ ] **Misleading disconnect/expiry message** — on `gpu-dev connect` connection loss OR reservation expiry, the CLI prints "❌ Authentication failed. You don't have SSH access... ask the primary user to add you" even for the PRIMARY user's own expired/cancelled reservation. Distinguish: (a) reservation expired -> "Reservation <id> expired at <time>"; (b) cancelled -> "Reservation was cancelled"; (c) connection dropped but still active -> "Connection lost, reconnect with gpu-dev connect <id>"; (d) genuine auth failure -> the current add-user message. Check reservation status before assuming auth failure.
+- [ ] **`gpu-dev cancel` from inside the pod** — show "Shutting down this reservation..." (graceful message) instead of an abrupt SSH drop, so the user knows the disconnect was intentional.
+- [ ] SSH CA certs to drop the ~0.33s `kubectl exec` key injection on warm claim (auth-model change).
+- [ ] AMI baker re-bakes on every base-EKS-AMI roll (5 baked AMIs in 2 days): pin the base AMI version + clean up old `gpu-dev-baked-*`.
+- [ ] **Warm pods: gate `warm-state=ready` on staging completion** (NOW MORE IMPORTANT — the built tree is ~30GB, and on GPU nodes it's a `cp` not reflink, so staging takes ~1-3min; a claim in that window hands over a half-copied tree). Two options: (a) claim-time check — exec `[ -f /home/dev/.pytorch-staging ]` in `try_claim_warm_pod`, skip pods still staging (simple, but adds ~0.5s exec to every warm claim); (b) label-flip — create with `warm-state=provisioning`, reconciler exec-checks staging + flips to `ready` (no claim latency, but 4 interacting changes: create label + reconciler flip + eviction must also target `provisioning` + claim already filters `ready`). Prefer (b). Marker: `.pytorch-staging` present during, removed when done; `.pytorch-ready` written at end.
+- [ ] **Image-rebuild propagation gap:** pods use `imagePullPolicy=IfNotPresent` + `:latest`, so a rebuilt image does NOT reach pods until the node re-pulls. After every image rebuild you must `kubectl rollout restart daemonset gpu-dev-image-prepuller -n kube-system` (re-pull on all GPU nodes, ~5min) **and** recycle warm pods, else pods run the stale cached image (this is why claude/PATH looked unfixed). Automate later: reconciler recycles warm pods when the `:latest` digest changes (and/or trigger the prepuller restart from the image-build step).
+- [x] **Prebuilt build archs (CORRECTED):** use plain `TORCH_CUDA_ARCH_LIST=9.0;10.0` — **NOT** `9.0a;10.0a`. You never put the `a` in the list yourself. PyTorch's `cmake/Codegen.cmake` (`_BUILD_FOR_ADDITIONAL_ARCHS`, gated on `compute_90`/`compute_100` being present) auto-adds `sm_90a`/`sm_100a` to exactly the cutlass kernels that need Hopper wgmma/TMA (`RowwiseScaledMM.cu`, `ScaledGroupMM.cu`, `GroupMM.cu`). Verified in `compile_commands.json`: the RowwiseScaledMM line shows all four (sm_90, sm_90a, sm_100, sm_100a). Forcing `9.0a` for the whole build is non-CI and would drop the plain SASS / other archs. Per-commit **trunk** CI builds narrow per-runner arch (`9.0` alone for H100 jobs, `10.0` for B200) — nightly builds the fat `7.5;8.0;8.6;9.0;10.0;12.0+PTX`; we match trunk + "9+" for our H100/B200 fleet. To add A100/T4/L4 later, widen to `8.0;8.9;9.0;10.0` (still one build). CUDA 13.2 (image default), not 12.8.
 ## Issues I found with the description above
 - I am not sure terraform-aws-github-runner is correctly described. Next time I go over this code for maintenance or adding something, I'll inform the user of what I think should change. This is not an active goal though, just a sidequest.
@@ -329,6 +364,11 @@ module "us_east_1" {
 - **Scale up T4 instances** - Add 3 more T4 nodes (g4dn.12xlarge) to cluster
 - **Scale up L4 instances** - Add 3 more L4 nodes (g6.12xlarge) to cluster
 - **Add on-demand H100/H200/B200 capacity** - Add at least 2 nodes each of H100 (p5.48xlarge), H200 (p5e.48xlarge), and B200 (p6-b200.48xlarge) as on-demand capacity in addition to existing reserved instances
+- **Run pytorch tests via gpu-dev** - Add a way to run a specific test / set of tests in ../pytorch (see `python run.py` in pytorch for how tests are normally invoked). Short term: `gpu-dev test <paths/test ids>` that reserves, stages pytorch (via --ref), and runs the test command. Long term (stretch, "magic TD"): an agent does target determination from the repo diff, picks the affected tests, kicks off a gpu-dev run, and streams test output back. Builds on the warm-pool + pytorch-snapshot work (instant-sandboxes branch).
+- **Warm pool follow-ups** (from instant-sandboxes branch):
+  - Claim-with-ref: today an explicit `--ref` skips the warm pool (cold path). Could instead claim a warm pod and incrementally `git fetch`+checkout the ref in-place.
+  - Availability display: warm-ready pods count as "used" in the availability table, so `gpu-dev avail` under-reports free MIG/CPU even though a claim is instant. Reconcile the display with warm claimability.
+  - CPU/MIG node disk: the pytorch-snapshot DaemonSet writes ~5-10GB to /mnt/nvme (root disk on nodes without instance NVMe); confirm CPU dev node root volumes are sized for it.
 - **Future features**:
   - Multi-server (16 GPU) reservations
   - GitHub organization/team verification

{gpu_dev-0.6.6 → gpu_dev-0.7.1}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.6.6
-Summary: CLI tool for PyTorch GPU developer server reservations
+Version: 0.7.1
+Summary: CLI + Python SDK for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
@@ -34,7 +34,7 @@ print(result.stdout)
 sandbox.cancel()
 ```
-Install: `pip install -e sdk/python/` — see [SDK docs](../../sdk/python/README.md) and [quickstart notebook](../../sdk/python/examples/quickstart.ipynb).
+The SDK ships inside the `gpu-dev` package: `pip install gpu-dev`, then `from gpu_dev import GpuDev`. See [SDK docs](../../sdk/python/README.md) and [quickstart notebook](../../sdk/python/examples/quickstart.ipynb).
 ---
@@ -701,23 +701,19 @@ gpu-dev disk list-content <disk-name>
 ### Getting Help
 - Use `gpu-dev help` or `gpu-dev <command> --help`
-- Report issues: https://github.com/anthropics/claude-code/issues
+- Report issues: https://github.com/wdvr/osdc/issues
 ---
 ## Development
 ```bash
-# Install development dependencies
-poetry install --with dev
-# Run tests
-poetry run pytest
-# Format code
-poetry run black .
-poetry run isort .
+# Editable install from the repo (one package: CLI + SDK)
+pip install -e .
-# Type checking
-poetry run mypy .
+# Build the distribution the way CI does (uv)
+uv build                            # gpu-dev (CLI + SDK)
 ```
+Releases are tag-driven: pushing a `v*` tag runs `.github/workflows/publish.yml`,
+which builds and publishes both packages to PyPI.

{gpu_dev-0.6.6 → gpu_dev-0.7.1}/cli-tools/gpu-dev-cli/README.md RENAMED Viewed

@@ -16,7 +16,7 @@ print(result.stdout)
 sandbox.cancel()
 ```
-Install: `pip install -e sdk/python/` — see [SDK docs](../../sdk/python/README.md) and [quickstart notebook](../../sdk/python/examples/quickstart.ipynb).
+The SDK ships inside the `gpu-dev` package: `pip install gpu-dev`, then `from gpu_dev import GpuDev`. See [SDK docs](../../sdk/python/README.md) and [quickstart notebook](../../sdk/python/examples/quickstart.ipynb).
 ---
@@ -683,23 +683,19 @@ gpu-dev disk list-content <disk-name>
 ### Getting Help
 - Use `gpu-dev help` or `gpu-dev <command> --help`
-- Report issues: https://github.com/anthropics/claude-code/issues
+- Report issues: https://github.com/wdvr/osdc/issues
 ---
 ## Development
 ```bash
-# Install development dependencies
-poetry install --with dev
-# Run tests
-poetry run pytest
-# Format code
-poetry run black .
-poetry run isort .
+# Editable install from the repo (one package: CLI + SDK)
+pip install -e .
-# Type checking
-poetry run mypy .
+# Build the distribution the way CI does (uv)
+uv build                            # gpu-dev (CLI + SDK)
 ```
+Releases are tag-driven: pushing a `v*` tag runs `.github/workflows/publish.yml`,
+which builds and publishes both packages to PyPI.

gpu-dev 0.6.6__tar.gz → 0.7.1__tar.gz

gpu-dev 0.6.6tar.gz → 0.7.1tar.gz