npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.4 - Mend

opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md ADDED Viewed

@@ -0,0 +1,704 @@
+# Universal & mixed gotcha catalog — every metered remote-GPU rental
+The cross-platform gotchas: they bite on **any** metered, isolated, rented GPU — only the concrete
+path/proxy/billing-verb changes (those live in `profiles/<platform>.md`). Each entry is
+**Symptom → Root cause → Fix**. "Mixed" entries are universal in symptom but carry a *platform-specific
+value* in the fix — the rule stays here, the value lives in a profile. Platform-only gotchas (AutoDL's
+TB-pin, the wandb-key classifier, the network_turbo proxy literal) do NOT live here — see each profile's
+TOP GOTCHAS section.
+To jump: `grep -in '<keyword>' references/gotchas_universal.md` (e.g. `inode`, `egress`, `xid`, `crlf`,
+`stdin`, `zombie`). Numbering `U1…` is stable; cross-platform additions continue the same series.
+## Table of contents (by theme)
+- **Process & SSH** — U1 SSH-dies-on-kill · U2 tmux-holds-script-in-memory · U3 vanished-process-4-causes · U4 kill-drops-SSH-before-relaunch · U5 hook-safe-launch
+- **Disk & Storage** — U6 disk-full-crashes-torch.save · U7 storage-fails-on-inodes · U8 stage-hot-data-to-NVMe
+- **Memory & OOM** — U9 cgroup-OOM-num_workers×tensor · U10 VRAM-OOM-vs-cgroup-OOM · U11 zombie-VRAM-nvidia-smi-cant-see · U41 host-metrics-lie/oom_kill-counter
+- **Transfer & Download** — U12 scp-resets→resumable-loop · U13 scp-into-uncreated-dir · U14 egress-surcharge+same-AZ · U15 compress-before-the-wire
+- **Monitoring** — U16 stale-waiters/zombie-monitors · U17 unquoted-pipe-grep-hang+robust-poll · U18 two-leg-remote-self-completion · U19 tracker-deletion-lags · U20 hosted-tracker-survives-teardown · U39 live-panel/TB-silently-empty (path/port/process mismatch) · U43 block-buffered-stdout-looks-frozen
+- **GPU health** — U21 nvidia-smi-util%-is-a-liar · U22 Xid-48/79-dead-GPU-re-rent · U23 thermal/power-throttle-steals-25-40%
+- **Dataloader & IO** — U24 dataloader-starvation-knobs · U25 many-small-files→shard-into-tar · U40 intra-op-thread-oversubscription-starves-GPU
+- **Env & Container** — U26 CRLF-breaks-sh · U27 overlay-config-files · U28 CUDA-toolkit-vs-driver-vs-torch · U29 install-from-lockfile · U30 pin-image-by-sha256 · U31 container-runs-but-no-GPU · U42 box-code-drift/verify-deploy
+- **Cost & teardown** — U32 task-epoch-default · U33 silent-checked-sync
+- **Secrets & trackers** — U34 secrets-via-stdin · U35 tracker-offline-without-key
+- **Delegated (cross-link only)** — U36 cuDNN-nondeterminism · U37 matplotlib-2^16 · U38 GPU-0%-util-data-bound
+- **Pointers** — spot/preemption → `references/spot-resilience.md`; multi-node/NCCL → `references/multinode.md`
+---
+## Process & SSH
+### U1 — SSH disconnects on `pkill -9` (exit 255, "Connection reset")
+**Symptom**: `ssh <host> 'pkill -9 -f train'` returns `Connection reset by peer`, exit 255.
+**Root cause**: killing the python tree tears down the PTY chain; the SSH client gets EOF and exits. The
+remote command may have run fine.
+**Fix**: this is **normal, not an error** — re-ssh and verify state, do not panic-retry.
+```bash
+ssh <host> "tmux kill-session -t qN 2>/dev/null; sleep 3; pkill -9 -f 'src.train'"  # SSH exits 255 here
+ssh <host> "pgrep -af 'src.train' | head -1 || echo CLEAN"                            # separate call verifies
+```
+### U2 — tmux holds the script in memory; editing it mid-run re-executes blocks
+**Symptom**: a queue/launcher script is updated mid-run, but the running job still uses the old logic; or
+an ablation completes cleanly yet **restarts from epoch 1** with a second tracker run and the queue never
+advances.
+**Root cause**: bash reads a script **by byte-offset on demand**. tmux keeps the launched script as-loaded;
+`scp`-ing a new version mid-run makes bash seek to its saved offset in a *now-different* file, land
+mid-command, and re-execute a block (duplicate runs, stalled queue). A child invocation (`bash run_one.sh`)
+IS re-read fresh for the *next* item — but only if none is parked mid-script. (principle #6.)
+**Fix**: **never overwrite a script any process is executing** — check `pgrep -af <script>` first; version
+the filename for hot changes (`run_one_v2.sh`), point only *new* launches at it. Appending lines to a queue
+file is safe (`while read < file` sees appended bytes); changing structure is not. To hot-swap, kill +
+restart the detach session so fresh bash reads from the top. Recovery: kill the session, copy the finished
+`best.pth` to durable storage, restart `run_queue.sh queue.txt <start_index>` to skip done items, delete any
+duplicate tracker run (cross-link verifying-dl-experiments **REQUIRED**).
+**Related detach trap — a non-exported var doesn't cross into the detach primitive.** A `VAR=x` set in
+your shell before `tmux new-session` / `nohup` is **not** in the detached job's environment unless
+**exported** (or inlined in the launched command) — the job sees it empty, and a launcher/monitor that
+interpolates it silently misdirects (writes output to the wrong path, mis-reports "died"). `export VAR`
+before launch, or inline it: `tmux new-session -d "VAR=$VAR bash run.sh"`.
+### U3 — A vanished remote process ≠ OOM: enumerate the 4 causes
+**Symptom**: a detached run's log stops right after `Starting training` with no epoch output and no
+traceback; `pgrep` shows it gone. The reflex is "OOM-killed."
+**Root cause is one of four** — OOM is only one:
+1. **Machine restart / reboot** — `dmesg` is *clean*, GPU idle, cgroup roomy, `uptime` low. Most-missed: nothing in the log hints at it.
+2. **OOM-kill (`-9`)** — `dmesg | grep -i 'killed process'` shows it, memory was tight (U9).
+3. **SSH HUP** — a foreground (non-`nohup`/`tmux`/`setsid`) launch dies when its parent SSH drops.
+4. **Manual kill** — an earlier `pkill` matched more than intended.
+**Fix — diagnose cheap → conclusive before "fixing"**:
+```bash
+dmesg 2>/dev/null | grep -iE 'killed process|out of memory' | tail   # OOM? empty = not OOM
+nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader  # idle now = died, not hung
+cat /sys/fs/cgroup/memory.max | numfmt --to=iec                       # roomy = OOM unlikely
+uptime                                                                # low = recent reboot (cause 1)
+```
+Clean `dmesg` + idle GPU + roomy cgroup + low `uptime` ⇒ **reboot, not OOM**. Do NOT shrink batch size to
+"fix" a phantom OOM — that masks the one variable under test. **Separate trap**: a dropped poll connection
+≠ the training dying — re-ssh and check the process/artifact directly (`pgrep -af train`, log tail,
+`best.pth` mtime) before concluding the run died (principle #3).
+### U4 — `kill` drops the SSH before a relaunch in the SAME command runs
+**Symptom**: `ssh <host> 'pkill -f X; relaunch X'` kills X but X is **not** relaunched; ssh returns 255.
+**Root cause**: killing a session-tied process drops the SSH (U1, normal) at the kill, so everything after
+it in that one command never executes.
+**Fix**: split — kill in one ssh call, relaunch (with NO kill) in the next. To stop a kill/poll pattern
+from matching the matcher's own command line, split the literal: `A=base; B=lines.; pgrep -f "${A}${B}"`
+(the contiguous string `baselines.` never appears in the cmdline running `pgrep`).
+### U5 — Hook-safe remote launch: keep env activation VISIBLE in the launch command
+**Symptom**: an env-guard hook (e.g. "no DL in conda base") blocks or asks on
+`ssh <host> 'nohup bash /root/job.sh ...'` even though `job.sh` activates the right env internally; it also
+misfires on heredocs that inline `python -m <pkg>.train`.
+**Root cause**: the hook scans the **command string** — it cannot see inside an scp'd script, and a bare
+`bash job.sh` launch has no visible `conda activate <env>`, so the guard assumes base.
+**Fix**: write the heavy script via Write/`scp` (so `python -m ...train` lives in the file, not the command)
+and put a VISIBLE activation in the launch ssh command:
+`ssh <host> 'source /path/to/conda.sh; conda activate <env>; nohup bash /root/job.sh ...'` — the script
+re-activating is harmless. Never `--no-verify` / never bypass the guard. (On a single-tenant rental whose
+base IS the env, the right move is to exempt remote/ephemeral base, not to clone it — that's a profile fact.)
+---
+## Disk & Storage
+### U6 — Disk-full crashes `torch.save` with `iostream error`
+**Symptom**: mid-training exit=1; log shows `RuntimeError: basic_ios::clear: iostream error` and
+`unexpected pos N vs M` from inside `torch.serialization`; a leftover `latest.pth.tmp` sits in the
+checkpoint dir; `df` shows the data mount at 100%.
+**Root cause**: `torch.save` writes atomically (write `.tmp` → rename); the `.tmp` write hits disk-full and
+errors. Any quota'd/cgroup disk on any rental does this.
+**Fix — prevent**: pre-budget `ckpt_size × N_runs + worst_case_latest + tracker_local_cache`; if it exceeds
+the mount, schedule mid-run aggregation to durable storage + delete completed-and-aggregated dirs; in
+`run_one.sh`, on success prune the rolling `latest.pth` and keep only `best.pth` (cross-link
+verifying-dl-experiments **REQUIRED** for the keepable-checkpoint policy). **Recover**: delete the
+`*.tmp`/`latest.pth` to free several GB — `best.pth` survives, the queue can resume.
+### U7 — Storage fails on the dimension (and location) not being watched
+**Symptom**: `cp`/`mkdir` fails `No space left on device`, yet `df -h` shows ~34% used — because `df -i`
+reads `100%` (inodes exhausted). Or the data mount fills despite `runs/` looking small.
+**Root cause**: disk dies on **inodes before bytes** — the classic trigger is **per-sample eval output**,
+which writes on the order of `files_per_sample × N_samples × N_conditions` tiny files. And the real
+byte-hog often hides where nobody looks: a **symlinked cache** (`~/.cache/huggingface` mapped onto the data
+disk) can outweigh everything the run created.
+**Fix**: monitor `df -i`, not just `df -h`, in Phase 0 and every space check. **Audit the real mount with
+`du`, not assumptions** (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`). Clean by **value** — keep
+the tiny irreplaceable evidence (metric/eval JSONs), drop the large reproducible scratch (periodic
+checkpoints, unused caches). Cap per-sample eval visualization (cross-link verifying-dl-experiments
+**REQUIRED** for the sizing policy). The *inode-cap number* is a profile fact (some platforms enforce a hard
+~200K cap; GB-quota'd platforms have none); the many-small-files general form is **shard into tar** (U25).
+Get explicit user confirmation naming `rm -rf` targets; offer "clean vs expand the disk" (principle #9).
+### U8 — Stage hot data to local NVMe before training
+**Symptom**: training is I/O-bound reading from a network/shared/HDD-backed volume; GPU starves between
+batches.
+**Root cause**: a remote/networked filesystem (or a spinning data disk) has far lower random-read
+throughput than instance-local NVMe — HDD-vs-NVMe gaps reach ~35×.
+**Fix**: at job start, copy the working dataset from the durable/shared tier to instance-local NVMe scratch,
+train against the local copy, write checkpoints back to durable storage. The local-NVMe path is a profile
+fact (`local_nvme` in the frontmatter); the stage-then-train discipline is universal. Pairs with U24/U25.
+---
+## Memory & OOM
+### U9 — `num_workers` × a big in-RAM tensor → cgroup OOM-kill (bare "Killed", exit 137)
+**Symptom**: training dies early with a bare `Killed` / `killed by signal: Killed (-9)` and **no Python
+traceback**; lowering `num_workers` makes it vanish.
+**Root cause**: each DataLoader worker is a `fork` that gets its **own copy** of any large object the
+dataset holds (a 16384² float32 matrix ≈ 1 GB). `num_workers=W` ⇒ ~`(W+1)×` that footprint, which blows the
+instance's cgroup `memory.max` even though a bare-process run fits. The kernel OOM-kills with no
+Python-level error, so it reads as a mysterious crash.
+**Fix**: size `num_workers` against `memory.max` and the per-worker resident set, **not** CPU count. Share
+one copy across workers (memmap / module-level singleton built once) or generate the object on the fly.
+Shrinking the problem also fixes it — a smaller matrix dim shrinks footprint *quadratically* (dim 1024 ≈
+4 MB, 256× less than 16384). Confirm it's OOM: `dmesg | tail` shows `Out of memory: Killed process`, and the
+same config survives `num_workers=0`.
+### U10 — VRAM OOM (a big model or a concurrent job) is distinct from cgroup-RAM OOM (U9)
+**Symptom**: `torch.OutOfMemoryError: CUDA out of memory` when launching a second train/eval while another
+runs, or a big model (deep transformer / unrolled net at high res) OOMs alone.
+**Root cause**: **VRAM** — the sum of concurrent jobs' allocations plus fragmentation exceeds the card. NOT
+host-RAM (U9).
+**Fix**: check free VRAM first (`nvidia-smi --query-gpu=memory.free --format=csv,noheader`); size the batch
+to fit *alongside* any concurrent job; set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to cut
+fragmentation. (Run heavy DL on the box; do static/shape checks locally — cross-link
+verifying-dl-experiments **REQUIRED** for local-OOM rationale.)
+### U11 — A zombie holds VRAM `nvidia-smi` cannot see → OOM on an "empty" GPU
+**Symptom**: `nvidia-smi` lists no process and shows free memory, yet a fresh job OOMs immediately; common
+after a crashed DDP run or a killed container.
+**Root cause**: a defunct/orphaned process (or a dead container's namespace) still holds CUDA context and
+VRAM, but `nvidia-smi`'s process table can't attribute it — so the GPU *looks* empty while memory is locked.
+**Fix**: enumerate the real holders via the device nodes and reap them:
+```bash
+fuser -v /dev/nvidia* 2>/dev/null   # or: lsof /dev/nvidia*  → kill -9 the listed PIDs
+```
+If containerized, restart the container. Ship a small reaper that flags any PID with persistent VRAM + ~0%
+util beyond a timeout — cross-link `scripts/reap_vram_zombies.sh`.
+### U41 — On a shared box, `uptime`/`free` describe the whole physical host, not your container — use cgroup-scoped readings + the `oom_kill` counter
+**Symptom**: a detached run looks "dead" or "the host is overloaded" — `uptime` shows load average 400+,
+`top`/`free -m` look maxed — so you suspect contention or an OOM-kill. But the job's own checkpoint `mtime`
+keeps advancing and its log still grows.
+**Root cause**: on a multi-tenant rental, host tools (`uptime`, `top`, `free -m`, `vmstat`) report the
+**physical node you share with other tenants**, not your cgroup. A neighbor's job spikes the host load
+average to ~490 while your container sits near-idle (your processes in `R`/`S`, none stuck in
+uninterruptible `D`). Reading host load as your own → a false "overloaded / OOM-killed" verdict and a
+needless kill-and-restart of a healthy run.
+**Fix**: judge YOUR container from cgroup-scoped readings, not host tools:
+- memory — `/sys/fs/cgroup/memory.current` vs `memory.max` (not `free -m`);
+- were YOU OOM-killed — the **`oom_kill` counter** in `/sys/fs/cgroup/memory.events`
+  (`grep oom_kill /sys/fs/cgroup/memory.events`); a non-incrementing counter means you were **not**
+  OOM-killed, however red host `free` looks;
+- CPU pressure — `/sys/fs/cgroup/cpu.stat` / `cpu.pressure`.
+A high host load with your cgroup roomy and `oom_kill 0` is a **noisy neighbor**, not your bug — don't
+shrink your batch or blame your code (a neighbor genuinely starving you on the shared card is U21/U23
+throttle territory or a re-rent, not a code fix). Sharpens the **U3** vanished-process ladder: the
+authoritative OOM check is the cgroup `oom_kill` counter, not host `dmesg`/`free` noise.
+---
+## Transfer & Download
+### U12 — `scp -r` of a large dir resets mid-transfer → per-dir resumable loop
+**Symptom**: 30–60 min into `scp -r host:...130GB ./`, the connection drops
+(`Read from remote host ... reset by peer`); local has a few dirs, the rest gone. scp does not resume.
+**Root cause**: a single SSH connection carries the whole transfer; any network blip kills all of it.
+**Fix**: loop **per-dir**, each its own SSH session — one failure doesn't lose the others, and re-running
+skips completed dirs. Prefer `rsync -avz --partial --append-verify` (resumes a half-file). Wrap bulk pulls
+in a `timeout … && break` retry loop: a stall ≠ permanent failure, and resumable transfers accumulate
+progress across kills. Validate any speed test on the **same route** the real transfer uses (principle #7).
+See `scripts/download_loop.sh` for the per-dir pattern.
+### U13 — `scp` into a remote dir a sibling command was supposed to create (race)
+**Symptom**: a background `scp big.tar host:/root/x/` fails instantly with `dest open "/root/x/": Failure`
+— the foreground command that would have `mkdir`-ed `/root/x` ran later, or was blocked/cancelled.
+**Root cause**: ordering assumption between parallel/sibling commands; the destination dir didn't exist yet.
+**Fix**: make every transfer self-sufficient inside its own retry loop:
+`ssh host 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.
+### U14 — Egress is a silent ~20% surcharge; co-locate and stay same-AZ
+**Symptom**: the monthly bill is ~20% over the rented GPU-hours; a large model/dataset re-pulled daily from
+a hyperscaler bucket dominates cost (a 140 GB model pulled daily from S3 ≈ $378/mo in egress alone).
+**Root cause**: hyperscaler **egress** is metered (AWS ~$0.09/GB, GCP ~$0.08, Azure ~$0.087) while most
+GPU-clouds (Lambda/RunPod/vast/CoreWeave) charge $0. Worse, **cross-AZ traffic bills ~$0.01/GB each
+direction even inside one provider** — storage in a different zone than compute quietly meters every read.
+**Fix**: co-locate storage with compute on the **same provider AND same AZ/region**. Pull a dataset once to
+durable local storage, not per-epoch from a remote bucket. Record `free_egress` / `egress_per_gb` /
+`cross_az_per_gb` as profile fields and prefer a $0-egress GPU-cloud for transfer-heavy jobs.
+### U15 — Compress before the wire
+**Symptom**: checkpoint/dataset transfers are slow and (on metered egress) expensive.
+**Root cause**: raw tensors and JSON cross the network uncompressed.
+**Fix**: zstd/gzip the payload before transfer — cuts checkpoints+datasets 30–60%, JSON 60–80%; store
+weights fp16/int8 where the task tolerates it. Compounds with U14 (less egress $) and U12 (fewer bytes to
+resume). Pairs with U25 (tar shards compress and transfer as one stream).
+---
+## Monitoring
+### U16 — Stale background waiters pile up; supersede a run → STOP its waiter; pick the right lifetime
+**Symptom**: a "Background tasks" panel shows 8+ "Running" wait-loops at 500–740 min elapsed, each
+ssh-polling every ~20 s, while the GPU is idle and the experiment finished hours ago.
+**Root cause**: every kill+restart of a flaky saga armed a NEW `until ssh grep MARKER; do sleep; done`
+waiter but never stopped the OLD one — its marker (in a superseded log) never appears, so it loops forever.
+A `run_in_background` waiter is **not** time-capped (a 781 s task ran to completion + notified; the ~600 s
+cap is on **foreground** Bash only). The real silent-failure mode is a waiter that never EXITS (U17).
+**Fix**: one waiter per live run — superseding a run, stop the old waiter first (`TaskStop`; cross-session
+IDs aren't stoppable from a resumed session — dismiss those from the UI). Multi-hour wait → a **persistent
+Monitor** (no 10-min cap) + a stall-detector emit so a hung run still notifies. A persistent Monitor dies on
+session resume → after any resume, check the remote ground-truth directly (`tmux ls`, `grep DONE log`,
+`nvidia-smi`); never trust a monitor that may be gone (principle #3).
+### U17 — A silent background monitor that never returns: usually an unquoted `|` in grep
+**Symptom**: a `run_in_background` ssh monitor never returns / never notifies; `pgrep` shows a process
+"alive." The run looks hung — but the actual job finished and wrote results fine.
+**Root cause**: the wrapper never EXITED because a sub-command blocks forever. The classic bug is an
+**unquoted `|` in grep** — `grep -hE noise-sweep|snr=|wrote log` — the shell splits it into THREE piped
+commands, and the first (`grep -hE noise-sweep`, no filename) reads **stdin** → blocks forever → the
+pipeline never returns → ssh never returns → the local background process never exits → no completion
+notification. (Background tasks notify on EXIT only — no 600 s cap; foreground Bash is the capped one, U16.)
+**Fix — robust remote-poll template**:
+- **Quote every regex AND give grep a filename**: `grep -hE 'noise-sweep|snr=|wrote' log` (a `|` inside quotes is alternation; a filename means read the file, never stdin).
+- **Bound the ssh**: `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 …` — a blip self-kills in ~30 s instead of half-open hanging for minutes.
+- **Short-connection poll, not one long-held ssh**: each poll = ssh in → check → disconnect; loop locally with a bounded counter.
+- **Verify by artifact, not notification**: when it "looks done," Read the local output + a fresh `ssh 'grep DONE log; tmux ls; nvidia-smi'` to confirm ground truth (cross-link verifying-dl-experiments **REQUIRED**); don't wait on a notification that may never fire.
+### U18 — "I'll check periodically" is a lie unless a trigger is armed; two-leg remote self-completion
+**Symptom**: a promise to monitor a multi-hour remote run, then no report for a day — because between turns
+the assistant does not run. A cloud scheduler set up to "ssh in and check" silently can't reach the box.
+**Root cause**: two conflated things. (a) Making the REMOTE self-complete (a waiter that blocks on a log
+marker then runs eval) guarantees RESULTS but gives no *reporting cadence* — nothing re-invokes the
+assistant on a timer. (b) A cloud schedule runs in an isolated sandbox with its own checkout and **no access
+to the local SSH key or network** → it cannot `ssh` the rented box, and the SSH private key must **never** go
+into a cloud agent (secret-leak).
+**Fix — the two-leg pattern**:
+- **Remote self-completion (guaranteed, survives session/SSH death)**: chain `train → eval → touch marker` under one `nohup ... </dev/null >log 2>&1 &`. Detect "done" by a **log marker** (`grep -q 'QUEUE DONE' master.log`), NEVER by `pgrep` — the waiter's own command line contains the pattern, so `pgrep -f` matches itself and loops forever (U17).
+- **Live progress (best-effort)**: a session-bound local loop (e.g. `/loop 30m` / cron `3,33 * * * *`) that ssh-polls with the *local* key. Be honest it dies when the session closes — the remote still finishes; the user re-pings to pull.
+- **Don't promise autonomous cross-session polling you can't deliver.** (`tmux` is often absent on a fresh box and `apt-get install` fails offline — `nohup ... </dev/null >log 2>&1 &` is zero-dependency and survives SSH drop; verify with `pgrep -af <script>`.) Full architecture → `references/monitoring_patterns.md`.
+### U19 — Tracker run deletion lags; a fresh export resurrects "deleted" runs
+**Symptom**: `run.delete()` returns, but an immediate `api.runs()` still lists every deleted run; a batch
+history-export minutes later happily re-downloads `<run>__history.csv` for runs just deleted.
+**Root cause**: deletion is asynchronous server-side; list/export endpoints serve stale listings for
+minutes.
+**Fix**: delete → re-verify on a **later** monitoring tick (not a tight loop; a second
+`delete(delete_artifacts=True)` pass is safe). Order matters: do cloud deletions **before** local exports,
+then re-check the export dir for resurrected files and remove them. (cross-link verifying-dl-experiments
+**REQUIRED** for tracker forensics.)
+### U20 — Local logs die with the instance: use a hosted tracker
+**Symptom**: TensorBoard event files written to an ephemeral box vanish on teardown — every curve gone after
+the meter-stop verb runs.
+**Root cause**: a rented box's local disk is not durable past `terminate`/`destroy` (principle #4); the
+metric history lived only there.
+**Fix**: log metrics to a **hosted tracker** so they survive teardown — `trackio.init(space_id=...)` or
+`wandb` online (push under the platform's proxy if behind a firewall). Poll the tracker's structured alerts
+as the monitor instead of brittle ssh-tail. Cross-link huggingface-skills:huggingface-trackio **REQUIRED**
+for the `init/log/finish/alert` mechanics and `space_id` sync.
+### U43 — A detached run's log looks frozen for minutes though training is fine: stdout is block-buffered off a TTY
+**Symptom**: a `nohup`/`tmux` run prints a few lines then nothing for many minutes; it reads as
+"hung / died" and the reflex is to kill it — but checkpoint `mtime`, TB scalars, and `nvidia-smi` all show
+it advancing.
+**Root cause**: Python (and libc stdio) **line-buffer when stdout is a TTY but block-buffer (~4–8 KB) when
+it is a pipe or file** — exactly the detached case. The log only flushes when the buffer fills, so a
+healthy run looks silent and a `grep`-on-log liveness check false-alarms on the gap.
+**Fix**: run unbuffered — `python -u` or `PYTHONUNBUFFERED=1` (the shipped `scripts/run_one.sh.template`
+already exports it); for a shell pipeline use `stdbuf -oL`. And judge liveness by **artifacts, not stdout
+cadence** — checkpoint `mtime`, the TB scalar API, `nvidia-smi` (monitoring_patterns §0 corollary; the
+deeper "is it actually hung?" attach is py-spy, throughput-profiling **T21**). A frozen log is the single
+most common false "dead run."
+---
+## GPU health
+### U21 — `nvidia-smi` GPU-Util % is a liar
+**Symptom**: the perf tile reads 100% util but throughput is poor; or util looks "busy" while the job is
+actually starved (the inverse of U38, which is the 0%-but-running case).
+**Root cause**: `GPU-Util` means "≥1 kernel ran in the sampling window," not "useful work filled the
+window." A trickle of tiny kernels reads as 100%.
+**Fix**: correlate util with **SM clock** (`clocks.current.sm`), memory-bandwidth util, and power draw —
+`nvidia-smi dmon -s pucvmet -d 1`. Low SM clock or low power at "100% util" means the GPU is underfed (go to
+U24). Always sample over several seconds, never one snapshot.
+### U22 — Xid 48/79 = a dead GPU; on a rental, re-rent
+**Symptom**: training crashes or the GPU drops out; `dmesg | grep -i xid` shows an Xid error.
+**Root cause**: Xid is NVIDIA's canonical hardware-fault signal. **Xid 48 = double-bit ECC (the GPU is
+dead); Xid 79 = "GPU has fallen off the bus."** These are hardware, not code.
+**Fix**: on a *rental* the card can't be reseated — **stop the instance and re-rent a different box**; don't
+burn hours debugging code for a hardware fault. Check `dmesg | grep -i xid` as part of the "vanished
+process" ladder (U3) when the GPU goes idle unexpectedly.
+### U23 — Thermal/power throttling silently steals 25–40% with no error
+**Symptom**: "the same code is slower than yesterday" — no error, no crash, just lower throughput.
+**Root cause**: the GPU is thermal- or power-throttling (an H100 throttles around 83 °C; target <75 °C). On
+a shared rental, cooling/power headroom is outside tenant control.
+**Fix**: detect — SM clock falling below base while temp >83 °C, or
+`nvidia-smi -q -d PERFORMANCE` showing a throttle reason. A tenant can't fix cooling → **flag it and
+re-rent** a healthier box; don't read the slowdown as a model/data regression. Pairs with U21 (clocks expose
+it where util% hides it).
+---
+## Dataloader & IO
+### U24 — GPU starves at 10–70% waiting on the dataloader, not on compute
+**Symptom**: util sits well below 100% (but nonzero), step log advances slowly; profiling shows time spent
+in data fetch, not fwd/bwd.
+**Root cause**: the input pipeline can't keep the GPU fed — too few workers, no prefetch, host↔device copies
+on the critical path. (Distinct from U38's *0%* CPU-data-bound transform case; this is the partial-starve
+knob set.)
+**Fix — tune in order**: `num_workers = cores − 1` (sized against per-worker footprint, U9),
+`persistent_workers=True`, `pin_memory=True`, `prefetch_factor=2`. Pathological cases show >100× gaps from
+these alone. If a heavy per-sample transform is the bottleneck, move it to the GPU (cross-link
+verifying-dl-experiments **REQUIRED** for the 0%-util diagnosis, U38). Pairs with U8 (stage to NVMe) and U25.
+### U25 — Millions of small files on a network/object store → transaction-overhead death; shard into tar
+**Symptom**: a dataset of many tiny files streams glacially from a shared/object store; or eval output of
+tens of thousands of per-sample files exhausts inodes (U7) or blows a visualization grid (U37).
+**Root cause**: per-file open/stat/close overhead dominates on networked/object storage; the inode and
+metadata cost scales with file *count*, not bytes.
+**Fix**: pack into **sharded tar** (WebDataset), a few-hundred-MB per shard → 3–10× faster sequential I/O and
+the only sane pattern for streaming from S3. This is the **general form** of the inode-exhaustion trap (U7)
+and the per-sample-vis trap — cap and shard rather than emitting a file per sample. Pairs with U8 (stage the
+shards to local NVMe) and U15 (shards compress as one stream).
+### U40 — A vCPU-sliced rental starves its own GPU: torch intra-op threads default to the HOST core count, not your cgroup quota
+**Symptom**: GPU `sm%` sits ~5–15% and runs grind, but the dataloader is not the bottleneck (few/no
+workers, data already on-device, the U24 knobs don't help); `top` shows dozens of python threads fighting
+over a handful of cores.
+**Root cause**: you rent a **cgroup CPU slice** (e.g. 12 vCPUs of a 64-core host), but torch/OpenMP size
+their intra-op thread pools to the **physical** core count — `torch.get_num_threads()` / `OMP_NUM_THREADS`
+come up ~64. ~57 runnable threads thrashing 12 cores burn the slice on context-switching, so the CPU side
+that launches kernels and feeds the GPU can't keep up and the card idles. No OOM, no error — pure scheduler
+thrash (the *host scheduling* starves the GPU, the inverse of being data-bound).
+**Fix**: cap the pools to your **slice's** vCPU count before launch —
+`export OMP_NUM_THREADS=4 MKL_NUM_THREADS=4` (and/or `torch.set_num_threads(4)`); confirm torch honoured it
+(`python -c "import torch; print(torch.get_num_threads())"` → 4, not 64). Read the real quota from the
+cgroup, not `nproc` (which reports host cores): `cat /sys/fs/cgroup/cpu.max` → `quota period`, vCPUs ≈
+quota/period. Bake the cap into the launch wrapper so every queue cell inherits it. Distinct from **U9**
+(workers × RAM → cgroup OOM) and **U24** (dataloader starvation); the triage that catches it is
+throughput-profiling **T3** (GPU SM% low while a python thread-storm pegs the cores).
+---
+## Env & Container
+### U26 — CRLF breaks `.sh` on Linux (authored on Windows)
+**Symptom**: a synced launcher silently does nothing (empty log); run by hand it errors `set: -: invalid
+option`, `cd: /path\r: No such file or directory`, `syntax error near unexpected token $'do\r'` — every
+line "ends in `\r`."
+**Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting working-tree EOL) writes `.sh` with
+CRLF; Linux `bash` treats the trailing `\r` as part of each token. `.py` is unaffected (Python's universal
+newlines); it is specifically `bash`/`.sh` that breaks.
+**Fix**: add `.gitattributes` with `*.sh text eol=lf` (so `git archive`/checkout always emits LF); immediate
+on-box unblock: `sed -i 's/\r$//' scripts/*.sh`.
+### U27 — `-o dotted.key=value` overrides explode on null parents → freeze protocols as overlay config FILES
+**Symptom**: `-o evaluation.sps_augmentation.enable=true` crashes
+`KeyError: Override path '...' is not a mapping` because the base YAML has the parent as `null`. Worse
+long-term: protocol variants that exist only as one-off CLI strings are unreproducible months later.
+**Root cause**: dotted-key override traversal can't descend through a `null` parent; and a CLI-string-only
+protocol has no diffable, reviewable record.
+**Fix**: define each protocol variant as a small overlay config (`configs/eval_overlays/<protocol>.yaml` with
+`_base_:` pointing at the canonical leaf) and pass it via `-c`. Reviewable, diff-able, immune to null-parent
+traversal. This is also the **retry-the-identical-config mechanism** (principle #7): an overlay file is a
+stable config a retry re-uses byte-for-byte. To reconstruct a historical protocol, read the artifact
+manifest (`*_manifest.json` records the resolved overrides verbatim).
+### U28 — The CUDA-toolkit ↔ host-driver ↔ torch-build triangle
+**Symptom**: `detected CUDA version mismatches the version used to compile PyTorch`; or `no kernel image is
+available for execution` at the first forward on a new-arch GPU.
+**Root cause**: three independently-versioned layers must agree — **the host driver is host-global and a
+tenant usually cannot change it on a rental; the CUDA toolkit is per-env and changeable; the torch build
+must match both.** The toolkit must be ≤ what the host driver supports; a project that pins
+`torch<2.9` can *downgrade* the only build with kernels for a new-arch card (e.g. sm_120).
+**Fix**: keep the image's working torch — filter framework pins out of the remote install:
+```bash
+grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt
+pip install -r /root/req_remote.txt
+```
+Set `LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH` when the per-env toolkit must win. Smoke
+`torch.cuda.get_device_capability()` + a heavy project import before launching; the off-band torch version
+lands in the runtime snapshot — disclose it with results. `host_driver_cuda_max` is a profile field.
+### U29 — "Same version, different result": top-level pins let transitive deps drift → install from a lockfile
+**Symptom**: two installs of the "same" `requirements.txt` produce different behavior/results.
+**Root cause**: a hand-edited `requirements.txt` pins only top-level packages; transitive dependencies drift
+between installs.
+**Fix**: install from a **lock file** (`uv lock` / `pip-tools` / `conda-lock`) that pins the full resolved
+graph, not a hand-edited top-level list. Pairs with U28 (filter the framework pins, then lock the rest).
+### U30 — A Dockerfile is NOT reproducible: pin the base image by `@sha256:` digest
+**Symptom**: a container built from the "same" Dockerfile months apart behaves differently.
+**Root cause**: `FROM image:latest` (or any moving tag) resolves to a different layer set over time.
+**Fix**: pin the base image by content digest — `FROM image@sha256:<digest>`, not `:latest` — so the build
+is bit-reproducible. (`pin_image_by_sha256` is a per-platform expectation where the image is the env
+contract.)
+### U31 — Container runs but trains 100× slower = the GPU was never attached (CPU-only)
+**Symptom**: a containerized job runs to completion but is absurdly slow; loss curves look normal, just
+glacial.
+**Root cause**: the container has no GPU — launched without `--gpus all`, or the NVIDIA Container Toolkit is
+missing/too old, so CUDA silently fell back to CPU.
+**Fix**: `docker run --gpus all …`, NVIDIA Container Toolkit ≥1.14, and **validate `nvidia-smi` *inside* the
+container before training** — never assume GPU attachment from a clean `docker run`.
+### U42 — The box runs a hand-synced copy with no git remote; a fix you "committed" may not be deployed — verify it is ON the box before trusting a run or tearing down
+**Symptom**: a bug you fixed and committed locally still reproduces on the box, or an eval runs on stale
+logic (wrong default, missing speedup, pathologically slow), even though local `git log` shows the fix
+landed.
+**Root cause**: most rentals have **no git remote** — the box holds a working tree you pushed by
+`scp`/`rsync`/`tar-over-ssh`, so its code only advances when you re-sync. A local commit changes nothing on
+the box; an interrupted or wrong-path sync, or simply forgetting, leaves the box pre-fix. "I committed it"
+≠ "it's running on the box."
+**Fix**: treat code deploy like the checked-sync (**U33**) — **verify, don't assume**. After syncing, grep
+the box for the change before relying on it:
+```bash
+ssh "$HOST" "grep -n '<new symbol / changed line>' /root/<proj>/path/file.py" || echo 'NOT DEPLOYED'
+```
+or compare a hash (`ssh host 'sha256sum file'` vs local). Make it a pre-flight for any run whose result
+depends on the fix, and part of the **Phase-5 teardown gate** — a verdict produced by stale code is not the
+verdict you think it is (principle #3). Pairs with **U29/U30** (pin deps/image): code AND environment must
+both be the version you believe.
+---
+## Cost & teardown
+### U32 — A task's default epochs differ from another task's; CLI `--epochs` silently overrides the right value
+**Symptom**: one CLI `--epochs N` is applied to all ablations; a subset (e.g. detection vs recon/seg)
+consistently underperforms; a reviewer flags it.
+**Root cause**: some task families need more epochs to converge and default to a higher value in their YAML;
+a blanket CLI `--epochs` silently overrides that per-task default.
+**Fix**: make the queue support a per-line epoch field (e.g. recon/seg `20`, det `50`); audit the codebase's
+YAML for `epochs:` declarations before deploying (`grep -rE '^\s*epochs:' configs/ | sort -u`). This is a
+config-drift instance — really a smoke/sanity target (cross-link verifying-dl-experiments **REQUIRED**).
+### U33 — Silent sync failure: gate the success line on the actual copy result
+**Symptom**: a wrapper prints `auto-synced <name> to durable storage` for every job, but at download time
+the durable dir is missing or empty.
+**Root cause**: the sync block does `mkdir -p "$DST"; cp -f ... 2>/dev/null` then `echo synced`
+**unconditionally** — it never checks the exit code. When the durable FS is inode-exhausted (U7) `mkdir`
+fails but the success line still fires, so monitoring looks green while nothing landed (principle #3).
+**Fix — checked, gated sync**:
+```bash
+if mkdir -p "$DST" && cp -f "$CKPT_DIR/best.pth" "$DST/" && [ -f "$DST/best.pth" ]; then
+    echo "[$(date +%H:%M:%S)] auto-synced $NAME to durable storage"
+else
+    echo "[$(date +%H:%M:%S)] !! SYNC FAILED for $NAME (check df -i) — data disk is still source-of-truth"
+fi
+```
+Until a download is verified locally, trust the **data-disk** copy, not the "synced" log line. The shipped
+`scripts/run_one.sh.template` carries the checked version.
+---
+## Secrets & trackers
+### U34 — Move credentials to the box without the secret ever appearing in a command
+**Symptom**: pasting a key into an ssh/scp command leaks it into shell history, transcripts, and hook logs;
+security hooks (rightly) block scp-ing a whole `~/.netrc` (it carries other machines' credentials).
+**Root cause**: any secret inside a command string is captured by history/transcript/hook logging.
+**Fix**: stream exactly one machine block via **stdin** — the value flows file→pipe→file and never appears in
+any command text or output:
+```bash
+grep -A 2 'machine api.wandb.ai' ~/.netrc | ssh <host> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
+```
+Verify by capability, not by echoing the value:
+`python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"`. Never write the secret to a
+shared/durable FS that a platform classifier scans (that platform detail is a profile fact).
+### U35 — `WANDB_MODE=offline` still dies without an API key in wrapper stacks → zero curves
+**Symptom**: a run launched `WANDB_MODE=offline` expecting "log locally, sync later" produces **no offline
+run dirs at all**; the train log shows `Disabled WandB due to initialization error: No API key configured`.
+**Root cause**: bare-SDK offline mode needs no key, but project logger *wrappers* often probe the API
+(`wandb.login()` / `wandb.Api()`) before `init` and treat key-absence as fatal → they flip to fully-disabled,
+not offline.
+**Fix**: push credentials BEFORE the first launch (U34) and run online under the platform's proxy; verify the
+first log lines show `Syncing run <name>` + a run URL — treat the *absence* of that line as a failure. Run
+already finished without a tracker? Backfill from the train log (regex per-epoch summaries →
+`init(..., tags=["backfilled"]) → run.log(..., step=epoch)`). Still in flight? Kill and relaunch with
+`--resume <latest.pth>` (costs ≤1 epoch). Prefer a hosted tracker so metrics survive teardown (U20).
+---
+## Delegated — cross-link only, do NOT restate here
+### U36 — cuDNN nondeterminism
+Same config + seed gives slightly different metrics run-to-run (`cudnn.benchmark=True` picks the fastest
+kernel by first-batch timing). Owned by **verifying-dl-experiments** (determinism). Cross-link
+verifying-dl-experiments **REQUIRED**; do not restate the fix here.
+### U37 — matplotlib `2^16`-per-axis limit on large eval visualization
+A composite grid (one row per sample) on a large test set crashes
+`Image size … must be less than 2^16`, often aborting the summary save. Owned by
+**verifying-dl-experiments** (eval-artifact sizing). Cross-link verifying-dl-experiments **REQUIRED**;
+prevent with U25 (cap + shard, don't emit a file/row per sample).
+### U38 — GPU at 0% util but training IS running (CPU-data-bound, not stalled)
+`nvidia-smi` reads ~0% util yet the step log advances and model memory is loaded — a heavy per-sample CPU
+transform with `num_workers=0` serializes data prep and starves the GPU. Owned by
+**verifying-dl-experiments** (0%-util diagnosis). Cross-link verifying-dl-experiments **REQUIRED**; the fix
+knobs are U24, the move-to-GPU remedy is in that skill.
+### U39 — Live monitoring shows nothing (TensorBoard panel empty / `INACTIVE`) but training is fine
+**Symptom**: the platform's TensorBoard tile / web panel is blank or `INACTIVE`, or a backgrounded watcher
+goes silent — yet the run is healthy: the loss advances on the box and the event/log files exist. You
+conclude "monitoring is broken" or, worse, "the run died," and waste a check or restart a fine run.
+**Root cause**: live observability breaks in three platform-shaped ways, none of which is a training
+failure. (1) **Path mismatch** — the platform's built-in panel reads a FIXED logdir/port and your logger
+wrote elsewhere, so the panel sees zero runs (AutoDL pins `tensorboard --logdir /root/tf-logs`; a
+`SummaryWriter(log_dir="runs/<exp>")` is invisible to it). (2) **Process died / never backgrounded** — the
+TB server or the watcher ran in the foreground or under the session and was killed at the foreground cap
+or on session/SSH drop, so nothing serves the curves. (3) **Port not exposed** — the service is up on the
+box but the port was never tunnelled / declared, so the panel can't reach it.
+**Fix** (the rule is universal; the *value* is per-profile): (1) **align the path** — point your logger at
+the panel's pinned dir, OR symlink the pinned dir at your output (`ln -sfn <your-runs>/<exp> <pinned>/<exp>`);
+no retrain — the running writer keeps appending and the panel reloads it. The pinned path lives in the
+profile (AutoDL `/root/tf-logs`, **AD7**; elsewhere write under the durable mount). (2) **run TB + the
+watcher under the detach primitive** (tmux / nohup / the profile's `DETACH`), never foreground, so they
+survive the session and the ~600 s cap (`references/monitoring_patterns.md` §1; cross-host background →
+§7). (3) **expose the port the platform's way** — CN built-in tiles declare it at rent time (`china.md`),
+RunPod via its HTTP proxy (100 s Cloudflare cap, fine for a TB UI, `runpod.md`), Lambda / Paperspace /
+bare-SSH via an `ssh -L 6006:localhost:6006` tunnel (`generic-ssh.md`, `lambda.md`). Before blaming the
+panel, verify ground truth: the event file is non-empty (`ls -la <logdir>; du -sh <logdir>`) and TB
+answers locally (`curl -s localhost:<port>/ | head`). For curves that must **survive teardown**, don't
+depend on a box-local panel at all → a hosted tracker (**U20**).
+---
+## Pointers — gotchas catalogued elsewhere
+- **Spot / preemption** (grace windows 2 min → ~0 s, Young/Daly cadence, atomic-write resume, managed-spot frameworks restart-your-process) → `references/spot-resilience.md`.
+- **Multi-node / NCCL** (fabric-manager hang, wrong NIC, NCCL timeout, jumbo-frame MTU mismatch, torchrun/Horovod elastic state restore) → `references/multinode.md`. Single-box users skip.