npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.3 - Mend

opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/multinode.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Multi-node NCCL & elastic-training gotchas — ADVANCED
+**Single-box users skip this entire file.** None of it applies to a single instance — one node, however many GPUs,
+runs DDP/FSDP over NVLink/PCIe and never touches the inter-node NCCL transport, fabric-manager, or rendezvous logic
+below. This file is **only** for jobs spanning ≥2 rented instances (multi-node DDP, FSDP, pipeline/tensor parallel,
+or elastic training). It assumes the checkpoint-to-durable + idempotent-resume spine is already in place
+(`references/principles.md` #8; cadence + atomic-write in `references/spot-resilience.md`) — multi-node only changes
+*how the process group forms and breaks*, never the resume mechanism.
+These are all **[P] platform/topology-specific** gotchas. The universal ones (disk, OOM, CRLF, silent sync, spot
+grace) are **not** restated here — see `references/gotchas_universal.md`.
+## Table of contents
+- **Fabric-manager** — one bad node hangs the WHOLE job at NCCL init (MN1)
+- **NIC selection** — NCCL picks docker0/loopback/slow NIC (MN2)
+- **Timeout masking** — default 1800 s hides a straggler/dead rank (MN3)
+- **MTU mismatch** — jumbo frames silently dropped, small messages fine (MN4)
+- **Elastic restart ≠ state restore** — torchrun `--max-restarts` (MN5)
+- **Elastic Horovod pause-below-min-np** — pauses then errors (MN6)
+- **First-move checklist** — bring up a healthy multi-node group
+To jump: `grep -in <keyword> references/multinode.md` (e.g. `grep -in fabric references/multinode.md`).
+---
+## MN1 — Fabric-manager down on one node hangs the entire job at NCCL init
+**Symptom.** Launch a multi-node job; every rank prints up to NCCL init then freezes — no traceback, no progress,
+no OOM, just a silent hang at the first collective. Killing and relaunching reproduces the exact same stall. A
+single-node run of the identical code works fine.
+**Root cause.** On NVSwitch-based nodes (HGX/DGX A100/H100), `nvidia-fabricmanager` must be running and healthy on
+**every** node for NVLink/NVSwitch routing to come up. One node with a stopped, crashed, or version-mismatched
+fabric-manager cannot establish its NVLink fabric, so its ranks never join the collective — and because NCCL init is
+a **global barrier**, all the healthy nodes block waiting for the one that can never arrive. The failure is global;
+the cause is local to one box.
+**Fix.**
+- Check fabric-manager on **every** node before launching, not just the head:
+  `systemctl status nvidia-fabricmanager` (or `nvidia-smi -q | grep -i fabric`). It must be `active (running)` and
+  its version must match the driver on that node.
+- Turn the silent hang into a diagnosable one: launch with `NCCL_DEBUG=INFO` (optionally `NCCL_DEBUG_SUBSYS=INIT,NET`).
+  The first node whose log **stops** before the others printed their topology is the culprit.
+- On a rental the fix is operational, not a reseat: restart the service (`systemctl restart nvidia-fabricmanager`)
+  if permitted, otherwise **stop that instance and re-rent a different box** — a fabric-manager that won't start is
+  usually a sick host (overlaps the Xid hardware-failure logic in `references/gotchas_universal.md`).
+URL: https://support.crusoecloud.com/hc/en-us/articles/46061806112155-NCCL-Hangs-and-Multi-Node-Training-Stalls-Caused-by-Failed-nvidia-fabricmanager
+---
+## MN2 — NCCL picks the wrong NIC (docker0 / loopback / a slow interface)
+**Symptom.** Multi-node init hangs forever, OR it connects but inter-node bandwidth is 10× too slow (allreduce
+dominates the step time; single-node throughput was fine). `NCCL_DEBUG=INFO` shows NCCL binding to `docker0`, `lo`,
+or a 1 GbE management NIC instead of the fast data-plane interface.
+**Root cause.** NCCL auto-discovers network interfaces and, with no guidance, can select an unroutable bridge
+(`docker0`), the loopback, or the slow management NIC — none of which carry traffic between the real nodes. The
+job either never connects (unroutable) or runs over the wrong, slow path.
+**Fix.** Pin the transport explicitly on **all** nodes (identical env on every rank):
+- `export NCCL_SOCKET_IFNAME=<real-iface>` — a prefix filter; e.g. `eth`, `ens`, `bond`. Exclude bad ones with a
+  leading `^`: `NCCL_SOCKET_IFNAME=^docker0,lo`.
+- On RDMA/InfiniBand fabrics pin the HCA: `export NCCL_IB_HCA=mlx5` (the active adapter prefix).
+- **No RDMA (TCP-only rental):** `export NCCL_IB_DISABLE=1` so NCCL stops probing for nonexistent IB and falls back
+  to the socket path cleanly instead of stalling on IB discovery.
+- Confirm the chosen interface with `NCCL_DEBUG=INFO` — the `NET/Socket` or `NET/IB` line names the bound device;
+  verify it is the fast data-plane NIC, not the bridge.
+URL: https://github.com/NVIDIA/nccl/issues/1580
+---
+## MN3 — Default 1800 s NCCL timeout masks a straggler or a dead rank
+**Symptom.** A run that was progressing freezes at a collective for exactly **30 minutes**, then dies with a
+`Watchdog ... collective ... timed out` / `NCCL timeout` error — or worse, hangs far longer because the watchdog is
+off. The real failure (a rank that OOM'd, crashed, or fell behind) happened 30 min earlier and is buried.
+**Root cause.** NCCL's default collective timeout is **1800 s**. When one rank dies or stalls, the others sit in the
+collective waiting out the full 30-minute window before anything surfaces — so the symptom appears half an hour after
+the cause, and a transient straggler can trip a hard abort it should have survived.
+**Fix.**
+- **Fail fast on a dead rank:** `export NCCL_ASYNC_ERROR_HANDLING=1` (newer PyTorch: `TORCH_NCCL_ASYNC_ERROR_HANDLING=1`)
+  so a crashed/unreachable rank tears the group down promptly instead of waiting out the timeout — the surviving ranks
+  get an actionable error near the true failure point.
+- **Tune the window deliberately, both directions.** Genuinely slow collectives (huge allreduce, slow checkpoint
+  barrier) need a **longer** timeout to avoid false aborts: raise it via the process-group init
+  (`torch.distributed.init_process_group(..., timeout=timedelta(minutes=60))`). To surface a hung straggler **sooner**,
+  lower it. The default rarely fits a real job — set it on purpose.
+- Pair with MN1's `NCCL_DEBUG=INFO` so the timeout error names which rank went silent.
+URL: https://repost.aws/questions/QURXddiuikQLesRDGz39RhIw/nccl-socket-timeout-when-using-large-dataset-in-multi-node-pretraining
+---
+## MN4 — Jumbo-frame MTU mismatch silently drops large NCCL frames
+**Symptom.** Small collectives work (rendezvous succeeds, tiny tensors allreduce fine), but the job hangs or throws a
+transport error the moment a **large** payload is sent — large gradient buckets, the first big allreduce, or a model
+broadcast. The break correlates with message **size**, not with which ranks are involved.
+**Root cause.** The container's interface is configured for jumbo frames (MTU 9000) but the host veth / bridge it
+attaches to is still at 1500 (or vice-versa). Small packets fit under the smaller MTU and pass; oversized frames are
+silently dropped at the mismatched hop with no application-level error, so NCCL stalls waiting for data that never
+lands. Classic on containerized rentals where the container MTU and the host bridge MTU were set independently.
+**Fix.**
+- Match MTU end to end: set the **host veth/bridge** to the same MTU as the container interface (9000 ↔ 9000, or drop
+  both to 1500). Inspect with `ip link show` on both the container and the host side.
+- Quick confirm the path actually carries jumbo frames between nodes:
+  `ping -M do -s 8972 <other-node-ip>` (8972 = 9000 − 28 bytes header). If it fails but `-s 1472` succeeds, the large
+  frames are being dropped → fix the MTU, do not blame NCCL.
+- If the host bridge MTU cannot be changed on the rental, set the container interface **down** to 1500 to match — a
+  uniform-but-smaller MTU works; a mismatch does not.
+URL: https://github.com/moby/moby/issues/4378
+---
+## MN5 — torchrun / TorchElastic `--max-restarts` restarts the process group but does NOT restore training state
+**Symptom.** A worker dies (preemption, transient fault); torchrun's `--max-restarts=N` dutifully re-runs rendezvous
+and relaunches **all** workers — but training resumes from **step 0** (or the wrong epoch), silently throwing away the
+progress before the failure. The restart "worked" yet the run is set back hours.
+**Root cause.** TorchElastic guarantees only that the **worker group** is reconstituted: it re-runs the c10d
+rendezvous, re-derives `world_size`/`rank`, and relaunches every worker process. It does **not** persist or reload
+training state — that is entirely the script's responsibility. A `--max-restarts` with no in-script
+load-latest-checkpoint just re-runs `main()` from scratch on every restart.
+**Fix.**
+- The per-epoch (or per-N-step) snapshot is what restores state, exactly per `references/principles.md` #8: write
+  full state (model + optimizer + LR scheduler + epoch/step + RNG + dataloader position) **atomically**
+  (`tmp`→`fsync`→`os.rename`), and **load-latest unconditionally at the top of every launch** so a torchrun restart
+  resumes instead of restarting. Cadence formula + atomic-write detail live in `references/spot-resilience.md`.
+- Use the c10d rendezvous backend hosted on a `host:port` (no etcd dependency); set `--max-restarts` to survive the
+  expected number of preemptions, not 0.
+- **REQUIRED:** treat the restored run as a *resume of the identical config*, never a hand-patched relaunch — a
+  silently-restarted or reshuffled run is the exact contamination `verifying-dl-experiments` guards against; confirm
+  the resumed step/epoch against the loaded checkpoint before trusting any post-restart metric.
+URL: https://docs.pytorch.org/tutorials/beginner/ddp_series_fault_tolerance.html
+---
+## MN6 — Elastic Horovod pauses below `--min-np`, then errors at `HOROVOD_ELASTIC_TIMEOUT`
+**Symptom.** Under elastic Horovod (`horovodrun -np 8 --min-np 4 --max-np 12`), enough workers get preempted to drop
+the live count below `--min-np`; the job does **not** fail immediately — it appears to hang (paused, no progress) —
+and then, minutes later, dies with a timeout error. Operators waiting for an instant failure miss the pause window.
+**Root cause.** Elastic Horovod **pauses** (does not fail) when available workers fall below `--min-np`, waiting for
+capacity to return. It only errors once `HOROVOD_ELASTIC_TIMEOUT` (default **600 s**) elapses without recovering to
+`--min-np`. So a too-high `--min-np` turns a routine couple-of-preemptions event into a hard failure after a silent
+10-minute wait.
+**Fix.**
+- Set `--min-np` **low enough** that the typical number of concurrent preemptions does not breach it — survivors keep
+  training through membership changes instead of pausing.
+- Raise `HOROVOD_ELASTIC_TIMEOUT` if the spot tier's capacity routinely returns slower than 600 s, so a temporary
+  capacity dip resumes rather than aborts.
+- **Pin LR-scaling and data-sharding to `--max-np`, not the live worker count** — otherwise the effective learning
+  rate and shard assignment drift on every membership change, quietly corrupting the run (a `verifying-dl-experiments`
+  concern: a metric from a run whose LR silently rescaled is not a clean datapoint).
+URL: https://horovod.readthedocs.io/en/stable/elastic_include.html
+---
+## First-move checklist — bring up a healthy multi-node group
+Run this order **before** trusting any multi-node throughput number; it isolates MN1–MN4 cheaply (no full job needed):
+- [ ] **Every** node: `systemctl status nvidia-fabricmanager` → `active (running)`, version matches driver (MN1).
+- [ ] Identical NCCL env exported on **all** ranks: `NCCL_SOCKET_IFNAME`, `NCCL_IB_HCA` (or `NCCL_IB_DISABLE=1`), and
+  the chosen `init_process_group` timeout (MN2, MN3).
+- [ ] MTU path check between nodes: `ping -M do -s 8972 <other-node-ip>` succeeds, or both ends pinned to 1500 (MN4).
+- [ ] First real launch with `NCCL_DEBUG=INFO` + `NCCL_ASYNC_ERROR_HANDLING=1`; confirm each rank's `NET/...` line
+  names the fast data-plane NIC, not a bridge (MN2, MN3).
+- [ ] In-script load-latest-checkpoint verified to fire on restart **before** relying on `--max-restarts` /
+  elastic membership recovery (MN5, MN6; spine in `references/principles.md` #8).
+- [ ] Distributed jobs checkpoint **more** often than single-GPU — one preemption wastes N× compute; cadence per
+  `references/spot-resilience.md`.
+For fanning a *sweep* across nodes (independent cells, not one job over many nodes), that is
+`references/parallel_ablation.md` + **REQUIRED** `superpowers:dispatching-parallel-agents`, not this file.

package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md ADDED Viewed

@@ -0,0 +1,196 @@
+# Parallel Ablation Fan-out — FS-shared deployment, isolated write paths, reconciliation
+Run N ablation cells in parallel across instances/queues without corrupting shared state, then
+reconcile and re-verify every cell before any teardown. The mechanism is **one job per cell with an
+isolated write path**; the discipline is **`superpowers:dispatching-parallel-agents`'s independence
+predicate + reconciliation**. **REQUIRED:** `superpowers:dispatching-parallel-agents` and
+**REQUIRED:** `superpowers:verification-before-completion`.
+To jump: `grep -in <keyword> references/parallel_ablation.md`.
+## Table of contents
+1. The fan-out model (one job per cell)
+2. FS-shared wrapper deployment (place once, never mutate mid-run)
+3. The independence predicate (isolated write path = the analogue of a git worktree)
+4. The portable job request (describe once, run on any profile)
+5. Queue-file format + resume via `start_index`
+6. Mandatory post-fan-out reconciliation + full re-verify
+7. Gotchas
+---
+## 1. The fan-out model
+Parallelism comes from running **multiple queues on multiple instances simultaneously** — never from
+parallel jobs inside one instance (sequential per instance keeps memory predictable and prevents disk
+contention). The unit of work is the **ablation cell**: one
+`(cfg, task, epochs)` row → one `run_one` invocation → one isolated output directory.
+```
+shared FS: /path/to/shared/run_one.sh, run_queue.sh   (ONE version, all instances read it)
+instance A  tmux ──> run_queue.sh queueA.txt ──> cell a1 ──> cell a2 ──> ...
+instance B  tmux ──> run_queue.sh queueB.txt ──> cell b1 ──> cell b2 ──> ...
+instance C  tmux ──> run_queue.sh queueC.txt ──> cell c1 ──> ...
+                                       each cell writes ONLY to its own /ckpt/<name>/ + FS/<name>/
+```
+Split the N cells across queue files (one per instance) by cost, not count — route the long cells
+(detection at 50 epochs) onto faster/idle instances so the queues finish near-simultaneously.
+---
+## 2. FS-shared wrapper deployment
+Place a **single copy** of `run_one`/`run_queue` on the cross-instance shared filesystem
+(`profiles/<platform>.md` STORAGE names the mount; on AutoDL it is the FS tier, on RunPod a Network
+Volume, on a bare box a synced NFS/`rsync` target). Every instance reads the **same version** — no
+per-instance drift, no "fixed it on A but not B."
+**Recall principle #6 — never mutate inputs under a live run.** A running queue holds
+`run_queue.sh`/`run_one.sh` in memory by byte-offset; overwriting either mid-run lands bash in the
+middle of a *different* file and re-executes blocks (duplicate runs, stalled queues). Therefore:
+- **Deploy the wrapper before launching any queue.** Treat the FS copy as immutable for the fan-out's
+  lifetime. Edit only when nothing reads it (`pgrep -af run_queue.sh` empty on every instance).
+- **Appending to a queue *file* mid-flight is safe** (streaming read re-reads on each iteration);
+  editing the *script* is not. New cells → append a line, or start a fresh queue file.
+- A fix that must reach in-flight jobs → **version the filename** (`run_one.v2.sh`), drain the old
+  queues, point new queues at the new file. Never `scp` over the path a live queue is reading.
+The FS copy is also the durable safety net: `run_one`'s post-success step syncs `best.pth` +
+metrics + log to `FS/<name>/`, so a released/dead instance still leaves its cell's result on the FS.
+---
+## 3. The independence predicate
+**REQUIRED:** `superpowers:dispatching-parallel-agents` — fan out only over work whose units share no
+mutable state. Here the predicate is concrete: **each cell writes to its own output directory and
+nothing else.** The per-job output dir is the platform analogue of a **git worktree** — an isolated
+workspace where one agent's writes can never collide with another's.
+Hold the predicate by routing every per-cell write to a name-scoped path:
+| Write target | Isolation key | Set via |
+|---|---|---|
+| checkpoints | `<ckpt_root>/<name>/` | `training.checkpoint_dir` override (per `<name>`) |
+| FS final copy | `FS/final_ckpts/<name>/` | `run_one` post-success sync |
+| tracker run | `group=<task>_<cat>`, unique run name | `wandb.group` / `wandb.tags` overrides |
+| per-cell log | `<name>.log` | `run_queue` per-line logging |
+**Never fan out onto shared mutable output.** Two cells writing `latest.pth`, the same
+`checkpoint_dir`, or one tracker run id = the exact shared-state violation the predicate forbids —
+it produces silently interleaved checkpoints and unattributable metrics, which no amount of
+post-hoc reconciliation can untangle. The `<name>` derives 1:1 from the cfg, so distinct cfgs →
+distinct paths automatically; **two queue lines must never share a `<name>`.**
+What is read-shared (the immutable wrappers, the dataset, the base image) is fine — the predicate
+only forbids shared **mutable** state.
+---
+## 4. The portable job request
+Describe a sweep once so the *same* fan-out runs against any profile (the launcher resolves it
+against `profiles/<platform>.md`; the profile supplies paths/verbs, the job supplies the work — see
+`profiles/_schema.md`):
+```yaml
+resources:
+  gpu: {name: A100, count: 1, memory: 24GB+}    # a CONSTRAINT, never a platform SKU
+  disk: 100GB                                    # ckpt_size × cells_per_instance + scratch
+candidates: [autodl, china, runpod]              # ordered fallback → describe once, run anywhere
+run: "bash run_queue.sh queue.txt"               # the per-instance entry point
+```
+Per-instance disk budget = `ckpt_size × cells_in_this_queue + scratch` (principle #5). Pre-compute it
+in Phase 0; a fan-out that under-budgets disk fails the *last* cells of each queue, not the first.
+---
+## 5. Queue-file format + resume
+One ablation cell per line, whitespace-separated (`while IFS=' ' read -r cfg task epochs`):
+```
+<cfg_path> <task> [epochs]
+```
+- `cfg_path` — yaml file relative to repo root; its basename is the cell `<name>` (the isolation key).
+- `task` — reconstruction / segmentation / detection (or other supported task) — sets tracker group/tags.
+- `epochs` — optional integer; omitted → wrapper default (e.g. `20`). The optional 3rd field lets one
+  queue mix per-task budgets (detection 50, recon/seg 20).
+```
+configs/experiments/ablation/recon/baseline.yaml      reconstruction 20
+configs/experiments/ablation/det/baseline.yaml        detection      50
+configs/experiments/ablation/seg/no_aug.yaml          segmentation
+```
+**Resume via `start_index`.** A queue killed at cell k (SSH drop, preemption, OOM) resumes with
+`bash run_queue.sh queue.txt <k>` — it skips the first k lines and continues. This is the queue-level
+form of principle #8 (idempotent resume); combined with per-cell checkpoint-load-on-startup, a
+half-finished cell resumes mid-cell, not from scratch. Keep `start_index` aligned to the queue file:
+appending lines is safe, **reordering or deleting earlier lines shifts every index** — append only.
+---
+## 6. Mandatory post-fan-out reconciliation + full re-verify
+**REQUIRED:** `superpowers:dispatching-parallel-agents` (reconcile) and
+**REQUIRED:** `superpowers:verification-before-completion` (evidence before any success claim). When
+queues report done, the watcher's "done" is a **claim** (principle #3), not ground truth — a cell can
+report success on a silently-failed sync, OOM mid-write, or never have run because its instance died.
+Reconcile and re-verify **every cell before any teardown** — this is a hard gate, not a spot check:
+1. **Roster.** Enumerate the expected cell `<name>` set from all queue files (the ground-truth roster).
+2. **Reconcile.** For each `<name>`, confirm `FS/final_ckpts/<name>/` exists and holds `best.pth` +
+   metrics + log. List the delta: missing, zero-byte, or duplicate-`<name>` collisions (a predicate
+   violation that slipped through — see Gotchas).
+3. **Re-verify by load.** Run `scripts/verify_local.py` over the durable copies — *load* each
+   checkpoint and metrics file. "The file exists" / "the log said synced" is not evidence; a load
+   that succeeds is (principle #3, the `verifying-dl-experiments` boundary owns whether the *number*
+   is real — **REQUIRED:** `verifying-dl-experiments`).
+4. **Remediate, never blind-retry.** Each missing/failed cell → classify the cause, then re-launch the
+   **identical config** (principle #7) on a live instance via `start_index`, or append its line to a
+   fresh queue. Do not patch one cell's config to make it pass — that destroys comparability.
+Only after the roster is 100% reconciled AND every cell loads does the teardown Iron Law unlock
+(SKILL.md Phase 5): no `release`/`terminate`/`destroy` until results are pulled to local AND verified
+by load AND the user approves the cost-affecting action.
+---
+## 7. Gotchas
+**Two cells share a `<name>` → interleaved checkpoints, unattributable metrics.**
+Symptom: one cell's `best.pth` overwritten, a tracker run with mixed curves, reconciliation finds N-1
+output dirs for N cells. → Root cause: independence-predicate violation — two queue lines mapped to
+the same isolation key (same cfg basename / hand-set identical `checkpoint_dir`). → Fix: enforce
+distinct `<name>` per line *before* launch (the cfg→`<name>` map must be injective); on collision,
+rename one cfg and rerun both — interleaved output cannot be un-mixed after the fact.
+**Editing the FS wrapper mid-fan-out → duplicate / stalled cells across instances.**
+Symptom: cells re-run or queues hang after a "quick fix" to `run_queue.sh`/`run_one.sh` on the FS. →
+Root cause: principle #6 — live bash holds the script by byte-offset; overwriting the shared copy
+corrupts every reader at once. → Fix: treat the FS wrapper as immutable for the fan-out's lifetime;
+version the filename and repoint new queues; edit only when `pgrep -af run_queue.sh` is empty
+everywhere.
+**Queue reports "all done" but a cell never ran.**
+Symptom: roster has N cells, FS has fewer; no error in the surviving logs. → Root cause: the instance
+died (released, preempted, host fault) and its queue's "done" was never emitted — absence of failure
+is not presence of success (principle #3). → Fix: reconcile against the **roster**, not against the
+watcher's last status; re-launch missing cells with `start_index` on a live instance.
+**`start_index` resumes the wrong cell after a queue edit.**
+Symptom: resume skips or re-runs the wrong rows. → Root cause: a line was inserted/deleted/reordered,
+shifting every subsequent index. → Fix: append-only to in-flight queue files; to drop a cell, comment
+it (don't delete) so indices stay stable, or start a fresh queue file for the remainder.
+> Universal gotchas (SSH drop on `pkill`, CRLF, cgroup OOM, silent sync, inode exhaustion on
+> many-small-files eval output across a shared FS) are **not** restated here — see
+> `references/gotchas_universal.md`. Shared-FS inode pressure (principle #5) bites hardest exactly
+> during fan-out, when N cells write eval artifacts to one FS at once.

package/bundled-skills/remote-gpu-trainer/references/principles.md ADDED Viewed

@@ -0,0 +1,179 @@
+# Operating Principles — the 10 invariants, expanded
+These are the *why* behind every phase and gotcha. They hold on any **metered, isolated, rented GPU**
+— AutoDL, RunPod, vast.ai, Lambda, Paperspace, a Chinese platform, a bare SSH box, Slurm, or K8s. Only
+the concrete paths/CLI change (those live in `profiles/<platform>.md`). Internalize these; the recipes
+follow. The one-line form is in `SKILL.md`; this file carries the cross-platform nuance.
+To jump: `grep -n '^## ' references/principles.md`.
+---
+## 1. Minimize paid wall-clock
+The meter runs the *entire* time the box is up, not just while the GPU computes. Three consequences:
+smoke-test correctness **locally on CPU before renting** (principle #2); **launch detached and hand
+control back** rather than babysitting a blocking `sleep`; and **release the instant verification
+passes** (principle #9 governs the *who-decides*). Every idle paid minute — a stuck download, a forgotten
+box overnight, a human-in-the-loop pause on a live instance — is money.
+*Universal.* Even on Slurm where the "meter" is walltime/fairshare quota rather than dollars, the same
+discipline applies: don't hold an allocation idle.
+---
+## 2. Cheap checks before expensive compute
+A CPU smoke (1–2 batches, logger disabled, tiny shapes) kills import errors, config drift, tensor-shape
+and measurement-**scale** bugs for ~free, **before** they bill GPU-hours. It is *necessary, not
+sufficient* — it won't catch convergence — but it catches the dumb-and-expensive failures that otherwise
+only surface after an instance spins up.
+*Boundary:* this skill owns *when* to run the smoke (the pre-rent gate). The smoke's *content* — what to
+assert, how to shrink the problem — belongs to **`verifying-dl-experiments`**. Don't duplicate it here.
+---
+## 3. Trust artifacts you loaded, not log lines that claim success
+"synced / saved / done / 100% complete" is a **claim**, and claims lie under a silently-failed write —
+a full disk, exhausted inodes, a swallowed error, a half-uploaded blob. Confirm the file **exists and
+loads** before releasing the only copy.
+**A watcher's own state is also a claim**, not ground truth. An async condition-waiter whose job you
+superseded polls a marker that will never arrive (a zombie that loops forever). A session-scoped monitor
+dies on context reset while the job runs on. Reconcile watchers against the job's *real* process and
+artifacts (`tmux ls` / `squeue` / `pgrep`, output `mtime`, a load-test), tear a watcher down when you
+supersede its job, and match a watcher's lifetime to the wait's duration.
+> **Monitoring physics this rests on:** foreground Bash hard-caps at 600 s (a long foreground wait is
+> killed at 10 min); `run_in_background` has **no** cap and notifies on exit; a never-*exiting* watcher
+> never notifies; an unquoted `|` inside a poll regex splits into piped commands and the first reads
+> stdin → hangs forever. See `references/monitoring_patterns.md`.
+*Universal — the load-bearing spine.* It is the platform instance of
+`superpowers:verification-before-completion`'s Iron Law ("no completion claim without fresh verification
+evidence"). Shared with `verifying-dl-experiments`.
+---
+## 4. Know what survives stop vs destroy
+**The single biggest portability trap.** AutoDL persists `/root` across a power-off — so the AutoDL
+habit is "just 关机, my data's fine." That assumption is **false almost everywhere else**:
+- **RunPod** wipes the *container disk* on stop; only the *volume disk* (`/workspace`) survives a stop,
+  and only a **Network Volume** survives a terminate.
+- **vast.ai** keeps disk across a stop but **bills it forever**; a destroy loses everything.
+- **K8s** wipes the pod filesystem on every reschedule unless a PVC is mounted.
+- **Colab** loses `/content` and RAM on disconnect.
+So the principle is not a path — it's a **discipline**: for each platform, before Phase 0, read the
+profile's STORAGE survival-matrix and write your checkpoints to the mount that survives the teardown verb
+you intend to use. The data you need most often lives on the *volatile* tier by default.
+*Mixed:* the *rule* is universal; the *which-mount* value is a profile fact.
+---
+## 5. Storage fails on the dimension — and the location — you're not watching
+Disk dies on **inodes before bytes** (`df -h` shows 34% while `cp` fails "No space left" because `df -i`
+is at 100% — classic on a shared FS full of many-small-files eval output). The real space hog often
+lives where you didn't look — a **symlinked cache** (`~/.cache/huggingface` mapped onto the data disk)
+can outweigh the `runs/` you created. **Audit with `du` on the actual mount, not assumptions.** Clean by
+**value**: keep the tiny irreplaceable evidence (metric/eval JSONs), discard the large reproducible
+scratch (periodic checkpoints, unused model caches — one observed sweep left **179 GB** of superseded `latest.pt`/`epoch_*.pt` while the real evidence was **<200 MB** of JSON). Pre-compute the budget; monitor `df -i`, not just
+`df -h`.
+*Mixed:* the inode-cap *number* is a profile fact (AutoDL/China enforce ~200K; RunPod/vast/Lambda spec
+GB quotas with no documented inode cap). The "audit the real mount, clean by value" discipline is core.
+The general form of the many-small-files trap is **shard into tar** (WebDataset) — see
+`references/gotchas_universal.md` U25.
+---
+## 6. Never mutate inputs under a live run
+A running job holds its scripts **in memory by byte-offset**. tmux keeps `run_queue.sh` as-loaded; bash
+reads a script by seeking to a saved offset, so `scp`-ing a new version mid-run makes bash land in the
+middle of a *different* file and re-execute blocks (duplicate runs, stalled queues). Version filenames;
+edit only when nothing is reading them (`pgrep -af <script>` empty).
+*Universal — pure bash/tmux physics.* Identical across every SSH backend.
+---
+## 7. Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific
+Some fraction of identical launches die (a network blip during `wandb.init`, a transient kernel fault, a
+spot preemption). Wrappers must be **idempotent and resumable**; retry the **identical config** rather
+than hand-patching one run (which destroys comparability — see `verifying-dl-experiments`).
+**Bulk transfers are the prototypical flaky step:** wrap them in `timeout`+resume retry loops — a stall
+≠ permanent failure, and resumable downloads accumulate progress across kills. An acceleration
+**mirror/proxy/cache speeds ONE route, not all** — it may cover the metadata/API path while the bulk-data
+path (a CDN/blob backend) still fails, and a *domestic* source routed through a *foreign*-acceleration
+proxy is slower. Match the route to the origin; validate a speed test on the **same route** the real
+transfer uses (a no-proxy probe of a proxied transfer measures nothing).
+*Universal.* The **spot/preemption** sub-case is profile-parameterized (central on vast/RunPod; on
+Lambda/Paperspace/China the interruption is auto-shutdown/auto-release/capacity instead) — see principle
+#8 and `references/spot-resilience.md`.
+---
+## 8. Checkpoint-to-durable + idempotent resume is the universal spine
+Detaching the job is necessary but not sufficient. The **one** mechanism that survives every failure
+mode — SSH drop, Slurm walltime kill, K8s pod reschedule, spot preemption, Colab disconnect — is:
+1. **Checkpoint full state to the platform's durable location** on a periodic timer (model + optimizer +
+   LR-scheduler + epoch/step + RNG + dataloader position), written **atomically** (`tmp`→`fsync`→
+   `os.rename`) so a mid-write kill never corrupts the latest good checkpoint.
+2. **Load-latest-on-startup unconditionally**, so the *identical launch command* resumes instead of
+   restarting. This is what makes principle #7's "retry the identical config" actually resume progress.
+The **detach primitive is the swappable plug** — tmux on a bare box, `sbatch --requeue` on Slurm, a Job
+manifest on K8s, a Save&Run commit on Kaggle, a checkpoint-to-Drive loop on Colab. Checkpoint+resume is
+the invariant underneath all of them.
+*Universal.* Cadence is a formula, not a guess — Young/Daly `W = √(2·μ·C)` (μ = mean time between
+preemptions, C = checkpoint write time); round *down* to an iteration boundary. Managed frameworks
+(SkyPilot Managed Jobs, SageMaker) move the box for you but **restart your process from scratch — your
+checkpoint-load is what restores progress.** Details + worked numbers in `references/spot-resilience.md`.
+---
+## 9. Cost and destructive actions are the user's call
+Never auto-release/terminate an instance, never delete durable/shared files without explicit
+confirmation, and if your own cleanup can't free enough space, **ask to expand the disk** (state the GB
+needed) rather than silently shrinking the experiment (fewer seeds, smaller eval, capped vis).
+This is sharpened, not softened, by going multi-platform: on RunPod/vast/Lambda the meter-stopping action
+is the **irreversible** `terminate`/`destroy` that deletes the disk — so the confirmation gate matters
+*more*. Operationalize it as the **teardown Iron Law** (SKILL.md Phase 5): no teardown before checkpoints
+are pulled to local AND verified by load AND the user approves the specific cost-affecting action.
+*Universal.* A shared FS is also multi-project: work inside your project's own folder, delete only your
+own redundancy, never a top-level dir you didn't create.
+---
+## 10. Teach the user the platform, don't just drive it
+Most users — especially on a platform they rent only occasionally — don't know its non-obvious
+**conveniences** or its **danger clocks**, and the skill's job is not just to operate the box but to *tell
+them*. On first contact with a platform, proactively surface:
+- **Conveniences they'd otherwise miss:** one-click SSH-key registration (so the agent can connect
+  non-interactively), GPU-availability notifications, the built-in panels (JupyterLab / the TensorBoard tile).
+- **Danger clocks that cost data or money:** auto-release / auto-delete timers on *stopped* instances
+  (AutoDL releases a 关机 box after **15 days** → the data disk is gone; several CN platforms in ~10), a
+  stop that keeps billing (vast.ai forever, RunPod 2×), low-balance / arrears purge.
+The per-platform list lives in each profile's **Surface to the user** block. This pairs with #9: #9 stops
+the agent from *doing* the dangerous thing; #10 makes the agent *warn the human* about the danger clock
+before it fires. The most expensive surprises on rented hardware are the silent timers (a parked box
+released, a stopped disk still billing), not the visible failures — surfacing them early is the cheapest
+insurance.

package/bundled-skills/remote-gpu-trainer/references/self-improvement.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Getting better over time — capture new gotchas + personalize (without corrupting the skill)
+This skill is a static reference, so it does **not** evolve on its own. But every real run teaches
+something — a new platform quirk, a training bug not in the catalog, or the user's own setup. This file
+is the protocol for **sedimenting that knowledge in the right place, at the right bar, without silently
+rewriting the skill**. Apply it whenever a run surfaces something the catalog did not already cover.
+To jump: `grep -in '<keyword>' references/self-improvement.md`.
+## Table of contents
+1. The bar — what qualifies as a keepable gotcha
+2. Route the learning — user memory vs the catalog vs an upstream PR
+3. Propose, don't auto-edit
+4. First-run personalization — capture the user's setup into memory
+5. Freshness — platform facts rot; stamp and re-verify
+## 1. The bar — do NOT enshrine a one-off
+A surprising failure is a **hypothesis, not a gotcha** (principle #3; **REQUIRED:**
+`verifying-dl-experiments`). Sediment a NEW gotcha ONLY when all three hold:
+- **Root-caused** — the mechanism is understood, not just "it worked after I retried."
+- **Reproduced or clearly mechanistic** — not a single flaky incident (a transient network blip is not a gotcha).
+- **Generalizable** — another user on this platform / training setup would hit it too.
+If it fails the bar it is a *project note* (→ §4 memory), not a catalog entry. Enshrining unverified
+one-offs is exactly the catalog rot this bar exists to prevent.
+## 2. Route the learning
+| What was learned | Where it goes | Form |
+|---|---|---|
+| **User/personal/host-specific** — this account's quirk, the user's preference, the usual GPU plan, "on MY box X is true" | **the host's `memory/` system** (host-specific, personal, may be ephemeral) | a `reference` / `project` / `feedback` fact, one per file, deduped |
+| **A project-level fact or recurring error the user keeps hitting on THEIR project** — a config quirk, "always run X first", a path that must be Y, an env that must be activated | a **project instructions file in the user's repo** — `CLAUDE.md` / `AGENTS.md` / `.cursorrules` (persists cross-session AND cross-tool AND for collaborators, unlike host memory) | a short imperative rule; **propose, don't auto-write** (§3) |
+| **Generalizable platform gotcha** | `profiles/<platform>.md` §7 (platform-pinned) or `references/gotchas_universal.md` (cross-platform) | `symptom → root cause → fix` + a source URL |
+| **Generalizable training-debug gotcha** | `references/training/<topic>.md` | same form |
+| **A correction** (a fact here is now wrong/stale) | edit that file; re-stamp its `verified <month>` | note old → new + URL |
+Because this skill is open-source, a generalizable addition/correction is also a candidate for an
+**upstream PR** — offer to open one so every user benefits, not just this install.
+## 3. Propose, don't auto-edit
+**NEVER silently rewrite a skill file from a single run** — a wrong "fix" or a broken structure is worse
+than a missing entry. Instead:
+- Draft the entry (`symptom → root cause → fix`, with its source) and **show it to the user for approval.**
+- For an out-of-scope or larger change, spin it off (the host's task / PR mechanism) rather than bloating
+  the current run.
+- Apply only after the user okays it; then re-run the skill's own checks — cross-refs resolve, **no secret
+  value written**, structure + TOC intact.
+## 4. First-run personalization → memory
+The first time this skill runs for a user, capture their **actual setup** into memory so later runs are
+pre-parameterized instead of re-asked:
+- which platform(s) they rent, and the per-platform §8 SCRIPT OVERRIDES that worked (paths, proxy hook, cred location);
+- the project repo path / training entrypoint / config layout;
+- the tracker entity (wandb / trackio) and **where the key lives** (its env-var name or file path — never the value);
+- the usual GPU plan + disk budget.
+Store as a `project` / `reference` memory. Record only the credential's *name or path*,
+never the secret itself. Next session the profile overrides + `run_one` params come pre-filled.
+## 5. Freshness — this skill has a shelf life
+Platform prices, billing verbs, and limits **change**. Every platform fact is annotated `verified
+<month>` at authoring time. Before betting **money or data** on a teardown/billing fact — the
+irreversible ones (`terminate` / `destroy` / release) — **re-verify it against the platform's current
+docs in the same session.** If a fact is stale, fix it (the §2 correction row) and re-stamp the date. A
+quarterly re-verification of each profile's §5 TEARDOWN/BILLING section keeps the highest-stakes facts
+honest — schedulable via the host's `/schedule`. Run `scripts/check_staleness.py` to list every `verified`
+stamp older than N months (a mechanical reminder of WHAT to re-check — it does not verify the fact itself).