npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.4 - Mend

opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md ADDED Viewed

@@ -0,0 +1,450 @@
+---
+platform: generic-ssh        # the DEFAULT profile; Slurm / K8s / Colab-Kaggle are thin diffs below
+kind: ssh                     # ssh | slurm | kubernetes | notebook (per sub-section)
+meter_stop_verb: manual       # nothing reclaims the box — a forgotten instance bills 24/7
+meter_stop_irreversible: true # destroying the box deletes its disk; no platform undo
+detach_primitive: tmux        # tmux/nohup (bare) | sbatch (Slurm) | k8s-job (K8s) | kaggle-commit
+spot_available: false         # bare box: none by default; Slurm scavenger + spot rentals override
+spot_grace: n/a               # bare: n/a · Slurm: SIGTERM→KillWait(default 30s)→SIGKILL · K8s: terminationGracePeriodSeconds(default 30s)
+shared_fs: host-dependent     # bare: one disk you own · Slurm: parallel /scratch · K8s: a PVC
+inode_cap: host-dependent     # measure with df -i; do NOT assume an AutoDL ~200K constant
+free_egress: host-dependent
+china_mirror_needed: host-dependent  # only if the box sits behind the GFW
+host_driver_cuda_max: host-dependent
+local_nvme: host-dependent
+---
+# Profile: generic-SSH — the DEFAULT (bare box) + Slurm / Kubernetes / Colab-Kaggle diffs
+One-line purpose: the lowest-common-denominator profile for a box where **SSH is the only control
+channel and teardown is manual** — every other platform profile is a *diff* against this baseline.
+> **Surface to the user up front (principle #10):** ⚠️ Danger clock — there is usually **no auto-release / idle timer to save you**: a forgotten box **bills 24/7** until you tear it down, and teardown is entirely manual (no platform safety net). Reality — you **expose ports yourself** (an `ssh -L` tunnel for TB/Jupyter); on Slurm a job dies at **walltime** — design the requeue.
+Read this whole file before Phase 0 on any unbranded rental, then jump to the matching sub-section
+(Slurm / Kubernetes / Colab-Kaggle) if the backend is a scheduler, a cluster, or a notebook.
+**Universal gotchas are NOT restated here** — see `references/gotchas_universal.md`.
+**Table of contents** (`grep -in '<keyword>' profiles/generic-ssh.md` to jump):
+- BASELINE: 8-field schema for the bare-SSH box (sections 1–8)
+- THIN DIFF — SLURM (sbatch replaces tmux)
+- THIN DIFF — KUBERNETES (a Job manifest replaces the shell)
+- THIN DIFF — COLAB / KAGGLE (not SSH-orchestratable)
+The one load-bearing abstraction every backend below solves differently: **detach the job from the
+connection, and make the result survive the session ending.** Checkpoint-to-durable + idempotent
+resume (principle #8) is the invariant; the detach primitive (tmux / sbatch / Job / commit) is the
+swappable plug.
+---
+## 1. LAUNCH
+- **Entry point:** `ssh user@host` — key-based, fronted by an `~/.ssh/config` alias so the rest of
+  the workflow says `ssh gpu-box`. There is **no platform API, console, or CLI** — SSH is the *only*
+  control channel (this is what makes the box "generic"). Set the alias per `references/ssh_transport.md`.
+- **Push code:** `rsync -avz --partial ./proj/ gpu-box:~/proj/` — resumable, delta-only on re-syncs;
+  prefer over `scp` (a reset `scp` restarts from zero). Pull results the same way, reversed.
+- **Download weights/datasets ON the box**, not over the local uplink: `ssh gpu-box 'cd ~/proj &&
+  hf download <repo> --local-dir data'` (or `aws s3 cp`, `wget`). The box almost always has a fatter,
+  cheaper pipe to HF/S3 than a home connection — pushing a 50 GB checkpoint over a residential uplink
+  is the classic self-inflicted stall. Transport verbs → **REQUIRED:** `huggingface-skills:hf-cli`.
+- **Env contract:** whatever the host ships. There is no prebuilt "base" guarantee — inspect
+  `which python && python -V && nvidia-smi` first. If the image has a usable env, treat it as AutoDL's
+  base (do not `conda create` on a throwaway box); if it is bare, `conda create` / `venv` once and
+  pin it. State the seed/determinism in the run itself — no platform does it here (**REQUIRED:**
+  `verifying-dl-experiments`).
+→ **verify:** `ssh gpu-box 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
+## 2. STORAGE MODEL  *(the survival matrix — principle #4)*
+The box gives **one persistent disk that is yours to manage** — no shared FS, no platform quota
+service, no automatic reclamation. *Measure, never assume:* run `df -h && df -i <mount>` live on the
+box. Caps are host-dependent — do **not** carry over an AutoDL ~200K-inode or ~200 GB constant.
+| Tier | Path | Survives STOP? | Survives DESTROY? | Cap |
+|---|---|---|---|---|
+| Root / home disk | `/` , `~` | yes (box keeps running) | **no** (destroy deletes the box) | host-dependent — `df -h`/`df -i` |
+| Attached block volume (if any) | `/path/to/mount` | yes | depends on provider — verify before destroy | host-dependent |
+The only "survival matrix" subtlety on a bare box: there is **no stop/destroy distinction the
+platform enforces** — the box runs until *manually* stopped, and a destroy wipes the disk with no
+undo. So checkpoints must land on a mount that gets `rsync`-pulled to local **before** teardown
+(§5). Disk fails on inodes before bytes and the real hog hides in a symlinked cache — audit the
+actual mount with `du`, clean by value (keep tiny eval JSONs, prune large periodic checkpoints).
+## 3. NETWORK
+- **Egress/proxy:** host-dependent; there is no platform proxy hook. If the box sits behind the GFW,
+  set the mirror manually — `export HF_ENDPOINT=https://hf-mirror.com` (or `HF_HUB_ENABLE_HF_TRANSFER=1`
+  off-GFW) — and validate the speed test on the **same route** the real transfer uses (principle #7).
+- **Port exposure:** expose services yourself. TensorBoard / Jupyter ride an SSH tunnel from the
+  local machine: `ssh -L 6006:localhost:6006 gpu-box` then open `http://<localhost>:6006`. There is
+  no console port-forward button.
+- **SSH flavor:** direct-TCP key-based SSH — `scp`/`rsync` work normally (unlike the proxied SSH on
+  some rental platforms). If the provider hands out a non-standard port, pin it in the alias.
+## 4. SPOT / INTERRUPTION + RESUME  *(principle #7/#8)*
+A bare on-demand box has **no spot/preemption model by default** — it runs until manually stopped, so
+the interruption to design against is an **SSH drop**, not an eviction. Without a detach primitive an
+SSH drop sends SIGHUP and kills the job; `tmux` (§6) is what severs the job from the connection.
+Resume is **self-built**: checkpoint full state (model + optimizer + scheduler + epoch/step + RNG +
+dataloader position) atomically (`tmp`→`fsync`→`os.rename`) on a periodic timer, and load-latest
+unconditionally on startup so the *identical launch command* resumes. Cadence formula + atomic-write
+pattern → `references/spot-resilience.md`. (Spot-rented bare boxes exist — if the provider can evict,
+treat it like the vast.ai profile: tiny/zero grace, checkpoint continuously.)
+## 5. TEARDOWN / BILLING  *(principle #9 + the Iron Law)*
+**Teardown is MANUAL and is the number-one cost failure on this profile.** Nothing reclaims the box:
+no idle timer, no auto-release, no scheduler that ends the job. **A forgotten box bills 24/7** — an
+overnight idle instance is the most expensive single mistake on metered hardware.
+- The meter-stopping action is **provider-manual** (a console "stop"/"destroy", a `terminate` API, or
+  a phone call) — and on most bare rentals it is **irreversible** (deletes the disk).
+- "Stop after pulling results" is a **mandatory final phase**, not an afterthought. Honor the
+  **teardown Iron Law**: no stop/destroy until checkpoints are **pulled to local AND verified by
+  load** (`scripts/verify_local.py`) **AND** the user has approved the cost-affecting action.
+  "It looked done in the log" is not evidence (principle #3). **REQUIRED:**
+  `superpowers:verification-before-completion`.
+## 6. DAEMON TOOL
+- **`tmux`** is the detach primitive: `tmux new -s train` → run inside → `Ctrl-b d` to detach;
+  `tmux attach -t train` to reattach, `tmux ls` to reconcile a watcher against the real session
+  (principle #3). It survives an SSH drop; it does **not** survive a box reboot — relaunch after one.
+- **Fallback** when tmux is absent and cannot be installed: `nohup <cmd> </dev/null >log 2>&1 &` then
+  `disown`. Always redirect stdin from `/dev/null` so the job never blocks reading the terminal.
+- **No native queue** — the operator IS the scheduler, monitor, and janitor. Use the parameterized
+  `scripts/run_queue.sh.template` for a resumable serial queue; never edit a queue script while it is
+  being read (principle #6 — version the filename).
+## 7. TOP GOTCHAS  (platform-pinned; universal ones → `references/gotchas_universal.md`)
+- **GEN1 — Forgotten box bills 24/7.** Symptom: a week-old invoice for an instance that finished
+  training on day one. → Root cause: nothing on a bare box reclaims it; the human is the only janitor.
+  → Fix: make teardown a tracked Phase-5 step; after the verified pull, prompt the user to stop/destroy
+  (never auto-act — principle #9); for cross-session safety set a `/schedule` reminder to re-check.
+- **GEN2 — SSH drop kills the run (no tmux).** Symptom: training dies the moment the laptop sleeps or
+  the network blips. → Root cause: the job is a child of the SSH shell; the drop sends SIGHUP.
+  → Fix: launch inside `tmux` (or `nohup … & disown`) **before** the long run starts — not after it is
+  already orphaned.
+- **GEN3 — `scp` restarts from zero on a reset; `rsync` does not.** Symptom: a 40 GB re-sync that
+  never finishes over a flaky link. → Root cause: `scp` has no resume. → Fix: `rsync -avz --partial`
+  for every code/data/result transfer; wrap bulk pulls in a `timeout`+resume loop (principle #7).
+- **GEN4 — CRLF breaks `.sh` on the Linux box.** Symptom: `bash: $'\r': command not found`, or a
+  shebang that "isn't found." → Root cause: a script authored on Windows carries CRLF line endings.
+  → Fix: `.gitattributes` with `*.sh text eol=lf`; on-box unblock `sed -i 's/\r$//' run.sh`.
+- **GEN5 — Heavy DL static-checked on the wrong machine.** Symptom: an OOM or a CUDA mismatch only
+  reproduces on the box. → Root cause: static/import checks ran locally, the real compute is remote.
+  → Fix: run the cheap CPU smoke locally (Phase 2), run the heavy DL **on the box**; for the
+  bug-vs-effect call once it runs, defer to **REQUIRED:** `verifying-dl-experiments`.
+- **GEN6 — A box reboot silently orphans the run (`tmux` does not survive it).** Symptom: a detached
+  job vanishes with a clean `dmesg`, idle GPU, and low `uptime`; `tmux ls` shows no sessions.
+  → Root cause: `tmux`/`nohup` survive an SSH drop but **not** a host reboot — the rental rebooted (host
+  maintenance, kernel update, or an OOM that took the box) and every session died. → Fix: treat reboot
+  as one of the four "vanished process" causes (cross-link `references/gotchas_universal.md` U3); make
+  resume idempotent (§4) so the *same* launch command continues from the last checkpoint; for a box that
+  reboots often, add an `@reboot` cron or a systemd unit that re-launches the detached queue.
+- **GEN7 — A second concurrent run silently halves throughput by oversubscribing the GPU.** Symptom: two
+  training runs on the "same idle GPU" both crawl, or the second OOMs on a card that looked free.
+  → Root cause: a bare box has **no scheduler** — nothing prevents two processes sharing one GPU, so they
+  contend for VRAM and SM time. → Fix: the operator *is* the scheduler — serialize with the
+  `run_queue.sh` template, or pin each run to a distinct card with `CUDA_VISIBLE_DEVICES=<n>`; check
+  `nvidia-smi` for an existing holder before every launch (zombie holders → U11).
+- **GEN8 — Watching a poll connection, not the run, declares a false death.** Symptom: the ssh-poll
+  drops and the run is pronounced dead, but the job finished fine and wrote `best.pth`. → Root cause: a
+  dropped *poll* connection ≠ the training dying; the two failure modes are conflated. → Fix: on any poll
+  drop, re-ssh and check ground truth directly (`pgrep -af train`, log tail, `best.pth` mtime) before
+  concluding anything (principle #3); robust short-connection poll template → U17.
+### Platform-specific debugging (bare SSH)
+The box has no console — every diagnostic is an ssh one-liner. Run these *separately* (a kill drops the
+SSH, U1/U4), and bound each with `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10` so a blip
+self-kills instead of half-open hanging:
+- **Is the run alive or orphaned?** `ssh gpu-box 'tmux ls; pgrep -af <train-script> | head'` — empty
+  `tmux ls` after a vanished log ⇒ reboot/HUP (GEN6); reconcile the watcher against the real session.
+- **Why did it die (the 4-cause ladder)?** `ssh gpu-box 'dmesg 2>/dev/null | grep -iE "killed process|out of memory|Xid" | tail; uptime'` — OOM line ⇒ U9/U10; clean dmesg + low uptime ⇒ reboot (GEN6); `Xid 48/79` ⇒ dead GPU, re-rent (U22).
+- **GPU health, not just util%:** `ssh gpu-box 'nvidia-smi dmon -s pucvmet -d 1 -c 5'` — read SM clock + power, not `GPU-Util` (a liar, U21); a holder `nvidia-smi` cannot see ⇒ `fuser -v /dev/nvidia*` (U11).
+- **Disk before it bites:** `ssh gpu-box 'df -h <mount>; df -i <mount>'` — inodes hit 100% before bytes (U7); the byte-hog often hides in `~/.cache/huggingface` (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`).
+- **Stuck download?** A transfer with a live process but a flat `df` is stalled, not progressing —
+  `ssh gpu-box 'ls -la --time-style=+%H:%M data/*.tmp; df -h <mount>'`; if the size has not moved, kill and
+  resume the per-dir loop (`scripts/download_loop.sh`, U12), never restart from zero.
+## 8. SCRIPT OVERRIDES
+Values to parameterize the `scripts/` templates for a bare-SSH box:
+```
+DATA_DIR=$HOME/proj    (working dir / data disk on the box)
+DURABLE_DIR=$HOME/proj (durable mount = the measured persistent disk; pull to local before teardown)
+PROXY_HOOK=        (none by default; set HF_ENDPOINT=https://hf-mirror.com only if behind the GFW)
+CRED_FILE=~/.netrc on the box's local disk, streamed in via stdin — never onto a shared/durable FS
+SCRATCH=*.latest.pth and periodic checkpoints  (prune on success; keep best + tiny eval JSONs)
+HF_HOME=$HOME/proj/.hf  (redirect off the default ~/.cache so it lands on the data disk)
+DETACH=tmux            (the swappable plug — replaced by sbatch / Job / commit in the diffs below)
+```
+---
+# THIN DIFF — SLURM  *(sbatch replaces tmux)*
+`kind: slurm` · meter = walltime/fairshare **quota, not dollars** · detach = `sbatch` · no teardown.
+The scheduler owns the job's lifecycle: the operator **submits**, Slurm runs and detaches it.
+`tmux+nohup` is **replaced** (not supplemented) by `sbatch` — a submitted batch job survives logout
+with no tmux. A bare `srun` still **blocks and dies on terminal close** like a foreground process, so
+wrap `srun` *inside* an `sbatch` script for long runs.
+- **Submit / monitor / kill:** `sbatch job.sh` (returns a jobid immediately) · `squeue -u $USER`
+  (status — replaces "reattach tmux") · `sacct -j <jobid>` (post-mortem: exit code, maxRSS, elapsed)
+  · `scancel <jobid>` (kill). Logs go to `slurm-%j.out` (arrays: `slurm-%A_%a.out`) — file-based, same
+  logs-to-file contract as the baseline.
+- **GPUs are declarative:** `#SBATCH --gres=gpu:a100:2` (or `--gpus=volta:3`); request, do not place.
+  Slurm's GRES plugin sets `CUDA_VISIBLE_DEVICES` per step (verified slurm.schedmd.com/gres.html 2026-06).
+- **Walltime ceiling — the hard new constraint:** `#SBATCH --time=HH:MM:SS` and at the limit each task
+  is sent **SIGTERM, then SIGKILL after `KillWait` (default 30 s)** (verified slurm.schedmd.com/sbatch.html
+  + slurm.conf 2026-06). Long training MUST checkpoint and requeue, not "run until done."
+- **Preemption + checkpoint-on-signal:** on time-limit or scavenger-partition eviction the same
+  SIGTERM→KillWait→SIGKILL sequence applies. Arm `#SBATCH --signal=B:SIGTERM@360` for a ~6-minute warning
+  (the `B:` prefix signals the **batch shell**, not the steps; **Slurm may fire it up to 60 s EARLY** —
+  size the warning with that slack, verified slurm.schedmd.com/sbatch.html 2026-06), trap it to set a flag,
+  and `#SBATCH --requeue` to auto-return to the queue (the script restarts **from its beginning with the
+  same job ID**) and resume from the last checkpoint. Cadence formula → `references/spot-resilience.md`.
+- **Native orchestration replaces hand-rolled fan-out:** `--array=0-15` (rate-limit with `%4`) fans out
+  ablation cells, `--dependency=afterok:<jobid>` chains stages (runs only on exit-code-0).
+- **No per-hour teardown — watch fairshare.** Nodes are not `shutdown`; the job just ends. The
+  baseline's #1 risk (forgotten box) **disappears**, replaced by "don't blow the walltime/fairshare
+  allocation." There is nothing to stop.
+- **No root, shared multi-tenant node:** cannot `apt install`. Use `module load cuda` or a container
+  (**Apptainer/Singularity** — Docker is usually banned).
+- **Filesystem split:** the shared parallel FS (`$HOME`, `/scratch`) persists and is where checkpoints
+  go; node-local **`$TMPDIR` is wiped when the job ends** — stage scratch to `$TMPDIR`, checkpoint to
+  `/scratch`. Multi-node NCCL/fabric specifics → `references/multinode.md`.
+### Slurm gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
+- **SLURM1 — Checkpoint *inside* the signal handler corrupts the checkpoint.** Symptom: `--requeue`
+  works most of the time, then intermittently writes a corrupt `hpc_ckpt` and the requeued job won't
+  load. → Root cause: a Python signal handler can fire **after any bytecode instruction** — including
+  mid-backward-pass — so checkpointing directly in the handler races with training (verified
+  github.com/Lightning-AI/pytorch-lightning#21406 2026-06). → Fix: the handler does the **minimum** —
+  set a flag; poll the flag in the training loop and checkpoint at a **safe point** (end of step), then
+  `scontrol requeue $SLURM_JOB_ID` or exit so `--requeue` returns it.
+- **SLURM2 — Warning signal arrives too late; the SIGKILL lands mid-write.** Symptom: the
+  `--signal@360` trap fires but the checkpoint is half-written when SIGKILL hits. → Root cause: two
+  slacks compound — Slurm may send the warning **up to 60 s early OR late**, and at the actual wall the
+  `KillWait` grace is only ~30 s (verified slurm.schedmd.com 2026-06). → Fix: budget the warning so a
+  full checkpoint fits *before* the wall even with the 60 s jitter; checkpoint *periodically* too (never
+  rely on the one signal); make the write atomic (`tmp`→`fsync`→`rename`, U6) so a truncated file is
+  never loaded.
+- **SLURM3 — `srun` inside `sbatch` no longer inherits `--cpus-per-task` (Slurm ≥ 22.05).** Symptom: a
+  nested `srun` hangs, sees one CPU, or under-threads the dataloader. → Root cause: since 22.05 `srun`
+  stopped reading `SLURM_CPUS_PER_TASK` and must be told explicitly (verified docs.icer.msu.edu 2026-06).
+  → Fix: `srun -c $SLURM_CPUS_PER_TASK …`, or set `export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK`; pass
+  `--gpus-per-task`/`--gres` on the `srun` too — a step does not inherit the allocation's GRES by default.
+- **SLURM4 — OOM is a job STATE, not a Python traceback.** Symptom: the job dies with no error in the
+  log; `sacct` shows `State=OUT_OF_MEMORY` (or `slurmstepd: Detected 1 oom-kill event(s)`). → Root cause:
+  Slurm cgroup sets a hard memory limit at (a fraction of) the requested `--mem`; exceeding it is an
+  OOM-kill the kernel performs (verified osc.edu / icer.msu.edu 2026-06). → Fix: read `sacct -o
+  MaxRSS,ReqMem` and raise `--mem`/`--mem-per-cpu` to MaxRSS×1.2; this is the cgroup-RAM OOM of U9
+  (dataloader workers × a big tensor), distinct from VRAM OOM (U10) — **do not** shrink batch for a
+  host-RAM OOM.
+- **SLURM5 — `$TMPDIR` checkpoints evaporate when the job ends.** Symptom: a requeued/array job finds an
+  empty checkpoint dir. → Root cause: node-local `$TMPDIR` is wiped at job end; only the shared parallel
+  FS persists across a requeue or a different node. → Fix: stage *scratch* to `$TMPDIR` for speed, but
+  write **checkpoints to `/scratch/$USER`**; never point `DURABLE_DIR` at node-local storage.
+### Slurm debugging (squeue / sacct / cgroup triage)
+- **Still queued or running?** `squeue -u $USER -o '%i %T %r %M %l %R'` — the `%r` Reason column explains
+  a `PENDING` (e.g. `Resources`, `Priority`, `QOSMaxGPUPerUserLimit`); `%R` on a running job is the nodelist.
+- **Post-mortem (why it ended):** `sacct -j <jobid> --format=JobID,State,ExitCode,DerivedExitCode,Elapsed,MaxRSS,ReqMem,Timelimit,NodeList`
+  — `State=TIMEOUT` ⇒ walltime kill (raise `--time` or requeue); `OUT_OF_MEMORY` ⇒ SLURM4; `PREEMPTED`/`NODE_FAIL`
+  ⇒ requeue territory; `ExitCode` like `0:9` means killed by **signal 9** (SIGKILL — the KillWait expired).
+- **Live resource use:** `sstat -j <jobid>.batch --format=JobID,MaxRSS,MaxVMSize` on a *running* step
+  (sacct only finalizes at exit); cross-check against `ReqMem` to catch a creeping leak before the cgroup kills it.
+- **GPU actually allocated to the step?** inside the job: `echo $CUDA_VISIBLE_DEVICES && nvidia-smi -L`
+  — a mismatch ⇒ SLURM3 (`--gres`/`--gpus-per-task` not on the `srun`).
+- **Multi-node hang** (job RUNNING, no progress) ⇒ NCCL/fabric, not Slurm → `references/multinode.md`.
+**Slurm OVERRIDES:** `DETACH=sbatch` · `DURABLE_DIR=/scratch/$USER/proj` (durable) + `DATA_DIR=$TMPDIR`
+(node-local, wiped) · `PROXY_HOOK=module load cuda` · teardown=`n/a (watch sacct + fairshare)`.
+---
+# THIN DIFF — KUBERNETES  *(a Job manifest replaces the shell)*
+`kind: kubernetes` · detach = a `Job` manifest (no shell) · persistence = a **PVC, non-optional**.
+The unit of work is a **manifest**, not a session: `kubectl apply -f job.yaml`; the control plane
+schedules a pod and a `Job` controller **replaces it on failure** up to `backoffLimit` (**default 4** —
+each failure creates a *new* pod, it does not restart the old one; verified kubernetes.io Jobs doc
+2026-06). The "detach from my connection" problem vanishes — the pod never had a connection to the shell.
+- **GPUs:** `resources.limits: nvidia.com/gpu: 1`. Quirk (verified kubernetes.io scheduling-gpus 2026-06):
+  GPUs go in **`limits` only**; if `requests` is set it must **equal** `limits`, and you cannot set
+  `requests` without `limits`; GPUs are **integer, not shared or overcommitted** — one whole GPU per
+  container (absent MIG/time-slicing, which K8s does not provide out of the box). Provided by the NVIDIA
+  device-plugin DaemonSet.
+- **Code delivery is different — no `rsync` into a pod.** Code is **baked into a container image**
+  (build → push to a registry) or pulled at pod start. This is the biggest workflow shift from the
+  baseline; pin the base image by `@sha256:` digest, not `:latest` (U30).
+- **Persistence is the headline risk:** the **pod filesystem is EPHEMERAL by design.** On
+  death/restart/reschedule, anything written outside a mounted volume is **gone**. Checkpoints **must**
+  mount a **PersistentVolumeClaim** (or object storage) at `/checkpoints` — this is non-optional and is
+  the single most common way ML-on-K8s loses work.
+- **Monitor:** `kubectl get pods` · `kubectl logs -f <pod>` (replaces `tail -f`). `kubectl exec -it …
+  -- bash` is a debugging tool, not the run mechanism — an exec session is not durable.
+- **Declarative parallelism:** `Job` `parallelism`/`completions` (both default 1) for fan-out (the K8s
+  analog of Slurm arrays).
+- **Lifecycle knobs:** `activeDeadlineSeconds` is the walltime analog (terminates the Job past the
+  deadline); `ttlSecondsAfterFinished` auto-GCs a finished Job; `terminationGracePeriodSeconds` (**default
+  30 s**, verified kubernetes.io 2026-06) is the SIGTERM→SIGKILL window — the K8s analog of Slurm
+  `KillWait`, so the same checkpoint-on-SIGTERM discipline applies.
+- **Teardown is two-layered:** `kubectl delete job <name>` frees the *pod* (cheap), but the underlying
+  **node/cluster keeps costing** unless an autoscaler scales it down. **delete ≠ scale-down** — the
+  node release is the real cost lever, distinct from the baseline's single "destroy the box."
+### Kubernetes gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
+- **K8S1 — Pod stuck `Pending`: `Insufficient nvidia.com/gpu`.** Symptom: `kubectl get pods` shows
+  `Pending`; the events read `0/N nodes are available: N Insufficient nvidia.com/gpu`. → Root cause:
+  *usually not* missing hardware — the **device-plugin DaemonSet** isn't running, so no node advertises
+  allocatable GPUs; or a taint blocks scheduling (verified kubenatives.com + GKE troubleshooting 2026-06).
+  → Fix: `kubectl describe node <n> | grep -A4 -E 'Capacity|Allocatable'` — if `nvidia.com/gpu` is `0`,
+  the plugin is down: `kubectl get ds -n kube-system | grep nvidia` and `kubectl logs -n kube-system -l
+  k8s-app=nvidia-device-plugin`; add the matching toleration if the GPU nodes are tainted.
+- **K8S2 — `RestartPolicy: Always` is rejected on a Job.** Symptom: `kubectl apply` errors that a Job's
+  pod template may only use `Never` or `OnFailure`. → Root cause: a Job is not a Deployment; only those
+  two restart policies are legal (verified kubernetes.io Jobs doc 2026-06). → Fix: use `OnFailure`
+  (restart the *container* in place — keeps `/checkpoints` warm) or `Never` (a fresh pod per attempt,
+  cleaner logs); never copy a Deployment's `Always`.
+- **K8S3 — `ImagePullBackOff` / `ErrImagePull` after a registry push.** Symptom: the pod never starts;
+  events show `Back-off pulling image`. → Root cause: a private registry without an `imagePullSecrets`,
+  a wrong tag/digest, or a too-big layer timing out the pull. → Fix: `kubectl describe pod <p>` reads the
+  exact pull error; attach `imagePullSecrets`, pin a real `@sha256:` digest (U30), and pre-warm large
+  images onto the node pool.
+- **K8S4 — `Multi-Attach error` on a rescheduled pod (RWO PVC).** Symptom: a pod stuck
+  `ContainerCreating` after a node failure: `Volume is already exclusively attached to one node`. → Root
+  cause: a **ReadWriteOnce** PVC can attach to **one node at a time**; on failover the old attachment
+  hasn't released, and two distributed-training pods on different nodes can never share an RWO volume
+  (verified discuss.kubernetes.io / bobcares.com 2026-06). → Fix: for multi-node training use
+  **ReadWriteMany** (NFS/EFS/CephFS) for the shared checkpoint dir, or pin co-dependent pods to one node
+  with affinity; on a stuck failover, force-detach via the cloud console or delete the old `VolumeAttachment`.
+- **K8S5 — Pod `Evicted` mid-training under node disk pressure.** Symptom: a long run dies with
+  `status: Evicted, reason: The node was low on resource: ephemeral-storage`. → Root cause: container
+  logs, the writable layer, and `emptyDir` count as **ephemeral storage**; checkpoints/caches written
+  outside the PVC fill the node and the kubelet evicts the pod (verified jorijn.com / oneuptime.com
+  2026-06). → Fix: write **everything large to the PVC**, set `resources.limits.ephemeral-storage`,
+  rotate logs, and back `emptyDir` scratch with `sizeLimit`; this is the K8s face of the disk-full crash
+  (U6/U7).
+- **K8S6 — Container runs but trains on CPU (GPU never attached).** Symptom: a pod runs to completion,
+  loss curves normal, ~100× too slow. → Root cause: the GPU limit was omitted, or `nvidia-smi` works on
+  the *node* but the container lacks the runtime/library path. → Fix: **validate `kubectl exec <p> --
+  nvidia-smi` before trusting a run**; ensure `resources.limits.nvidia.com/gpu` is set and the NVIDIA
+  container runtime is the default (this is U31 surfaced through K8s).
+### Kubernetes debugging (kubectl triage)
+- **Why is it Pending / not starting?** `kubectl describe pod <p>` — the **Events** section names it
+  directly (Insufficient GPU ⇒ K8S1; FailedScheduling taint; ImagePullBackOff ⇒ K8S3; FailedMount ⇒ K8S4).
+- **Why did it die?** `kubectl get pod <p> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'`
+  — `reason: OOMKilled` ⇒ raise `resources.limits.memory` (cgroup-RAM, U9); `Error` + exit code ⇒ read logs.
+- **Logs of a crashed/previous attempt:** `kubectl logs <p> --previous` (the current pod may be a fresh
+  retry with an empty log); `kubectl get events --sort-by=.lastTimestamp` for the cluster-wide timeline.
+- **Did the node even offer GPUs?** `kubectl describe node <n> | grep -A4 Allocatable` — `nvidia.com/gpu: 0`
+  ⇒ device plugin down (K8S1).
+- **Is the PVC bound and mounted?** `kubectl get pvc` (`Bound`?) and `kubectl describe pod <p>` Volumes
+  section — an unbound PVC stalls the pod in `Pending`.
+**K8s OVERRIDES:** `DETACH=k8s-job` · `DURABLE_DIR=/checkpoints` (PVC mount — required; RWX for multi-node)
+· `CRED_FILE=""` — credentials arrive as a K8s Secret mounted as an env var (WANDB_API_KEY / HF_TOKEN),
+never a file on disk and never baked into the image layer, so run_one's `[ -n "$CRED_FILE" ]` guard skips
+the file read and the env var passes through · teardown=`kubectl delete` **+** scale the node pool down.
+---
+# THIN DIFF — COLAB / KAGGLE  *(not SSH-orchestratable)*
+`kind: notebook` · **no SSH, no tmux, no persistent disk, no real job abstraction.** The generic
+core's central primitive ("detach + survive the session") cannot be satisfied directly — degrade to
+**checkpoint-to-cloud + idempotent resume**. Teardown is automatic and free; the *opposite* problem to
+the baseline — the work cannot be kept alive long enough.
+**Colab (free tier):**
+- **Idle timeout ~90 min** (no cell activity) and a hard **~12 h max VM lifetime**; on disconnect all
+  RAM, variables, models, and the local `/content` filesystem are **lost**. Limits are **dynamic and
+  unpublished** — GPU type/availability and the exact ceilings "vary over time" and GPU is best-effort,
+  can be denied or downgraded (verified research.google.com/colaboratory/faq.html 2026-06).
+- **Free tier requires the browser tab to STAY OPEN** — *(verified — corrects the draft's "anti-idle
+  tricks are unreliable" framing)*: **background execution is a Pro+ paid feature**; on free tier closing
+  the tab stops the runtime shortly after (verified github.com/googlecolab/colabtools#4151 + community
+  reports 2026-06). So keep-alive hacks aren't merely *unreliable* — there is **no supported headless
+  background run at all** on free Colab. Design for the disconnect, do not fight it.
+- **Only survival mechanism:** mount Google Drive and **checkpoint every epoch to Drive**; make the
+  entrypoint **resume-from-Drive idempotent** so the inevitable reconnect continues, not restarts.
+**Kaggle (free tier) — slightly better, because of one real primitive:**
+- **30 GPU-hours/week** floating quota (T4×2 or P100; resets weekly); **interactive idle timeout ~60 min**
+  and a **~9 h** session cap (verified kaggle.com/docs/efficient-gpu-usage + product-feedback 2026-06).
+- **The one genuine headless-background primitive: "Save Version → Save & Run All (commit)."** It
+  snapshots the notebook and runs it **on a separate machine with no idle timer, surviving browser
+  close**, and **persists `/kaggle/working` (20 GB) as the committed version's output** (commit times out
+  at ~9 h GPU / ~12 h CPU). This is the closest thing to `sbatch` in the free-tier world — single it out
+  as Kaggle's detach primitive. Live monitoring is weak (Colab: watch the cell; Kaggle commit: inspect
+  only the finished version's logs).
+- **Code delivery:** clone from GitHub or pull the platform's dataset mounts — no scp.
+### Colab / Kaggle gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
+- **NB1 — Drive sync lag silently loses the "saved" checkpoint.** Symptom: training logs
+  `saved best.pth to /content/drive/...`, the runtime disconnects an hour later, and the file is **0 bytes
+  or absent** in Drive. → Root cause: writes to mounted Drive are **buffered and sync asynchronously** —
+  large files can take up to ~30 min to actually land, and an unmount/disconnect before the flush loses
+  them (verified github.com/googlecolab/colabtools#2607 + #4426 2026-06). → Fix: call
+  `drive.flush_and_unmount()` (or `os.fsync`) right after each checkpoint, keep checkpoints small, and
+  treat a checkpoint as durable **only after** it is visible in Drive — re-list it before trusting resume.
+- **NB2 — Kaggle commit fails if any cell errors → the whole output is lost.** Symptom: "Save & Run All"
+  shows `committing…` forever or fails with a non-zero/`Code 0` error, and **nothing** in `/kaggle/working`
+  is saved. → Root cause: a commit re-runs the notebook **top-to-bottom on a fresh machine**; one failing
+  cell (or an interactive-only state, or a flaky cell) aborts the commit and discards its output (verified
+  kaggle.com/product-feedback/334753 + 59557 2026-06). → Fix: before committing, **Run All interactively**
+  end-to-end on a clean kernel (catch order/state bugs); guard long sections so a late failure still writes
+  partial results to `/kaggle/working`; rely on `/kaggle/working` (persisted), not in-memory variables.
+- **NB3 — Kaggle batch (commit) run picks the WRONG accelerator / has no internet.** Symptom: a committed
+  run is glacial (ran on CPU) or fails to `pip install`/download. → Root cause: the **accelerator and
+  internet toggle are notebook settings the commit inherits** — a notebook left on "None"/internet-off
+  commits that way; internet also requires phone verification on the account. → Fix: set Accelerator =
+  GPU and Internet = On in the notebook *before* committing; verify with `torch.cuda.is_available()` in an
+  early cell so a CPU commit fails fast instead of wasting the 9 h.
+- **NB4 — `/content` (Colab) and `/kaggle/temp` are scratch, not durable.** Symptom: results written to
+  `/content/...` or `/kaggle/temp` vanish on disconnect. → Root cause: only Drive (Colab) and
+  `/kaggle/working` (Kaggle committed output) survive the session; everything else is ephemeral. → Fix:
+  point `DURABLE_DIR` at the surviving path; never let the final artifact land only on scratch.
+- **NB5 — Free Colab disconnect mid-epoch with no warning.** Symptom: the session simply dies; there is
+  **no SIGTERM, no grace window** to catch. → Root cause: unlike Slurm/K8s, a notebook eviction gives no
+  signal — the resume contract is the *only* defense. → Fix: checkpoint every N steps to Drive
+  (NB1-safe), make cell-1 resume-from-latest idempotent, and chain runs across sessions under the
+  per-session ceiling. There is no checkpoint-on-signal here (contrast Slurm `--signal` / K8s SIGTERM).
+### Colab / Kaggle debugging (session-death triage)
+- **What am I actually on?** First cell: `import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))`
+  and `!nvidia-smi` — catches a CPU-only Colab assignment or a CPU Kaggle commit (NB3) before wasting the session.
+- **Is the checkpoint really in Drive?** `!ls -la /content/drive/MyDrive/proj/*.pth` *after* a
+  `drive.flush_and_unmount()` — a 0-byte or missing file ⇒ sync lag (NB1), do not teardown trusting it.
+- **Did the Kaggle commit succeed?** Open the Version's **Logs** tab (the only post-mortem for a committed
+  run) — a failed cell shows there; the committed `/kaggle/working` is the artifact, not the editor state.
+- **Disk full inside the notebook?** `!df -h` — `/kaggle/working` caps at 20 GB; HF cache and intermediate
+  files exhaust it fast (U6/U7), prune before the commit's final write.
+**Colab/Kaggle OVERRIDES:** `DETACH=`Drive-checkpoint loop (Colab) / Save&Run-All commit (Kaggle) ·
+`DURABLE_DIR=`Drive `/content/drive/MyDrive/proj` (Colab) / `/kaggle/working` (Kaggle) · teardown=`automatic`
+· the pattern, every run: checkpoint every N steps → idempotent resume from cell 1 → keep each run
+under the per-session ceiling → chain runs across sessions.