opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,704 @@
1
+ # Universal & mixed gotcha catalog — every metered remote-GPU rental
2
+
3
+ The cross-platform gotchas: they bite on **any** metered, isolated, rented GPU — only the concrete
4
+ path/proxy/billing-verb changes (those live in `profiles/<platform>.md`). Each entry is
5
+ **Symptom → Root cause → Fix**. "Mixed" entries are universal in symptom but carry a *platform-specific
6
+ value* in the fix — the rule stays here, the value lives in a profile. Platform-only gotchas (AutoDL's
7
+ TB-pin, the wandb-key classifier, the network_turbo proxy literal) do NOT live here — see each profile's
8
+ TOP GOTCHAS section.
9
+
10
+ To jump: `grep -in '<keyword>' references/gotchas_universal.md` (e.g. `inode`, `egress`, `xid`, `crlf`,
11
+ `stdin`, `zombie`). Numbering `U1…` is stable; cross-platform additions continue the same series.
12
+
13
+ ## Table of contents (by theme)
14
+
15
+ - **Process & SSH** — U1 SSH-dies-on-kill · U2 tmux-holds-script-in-memory · U3 vanished-process-4-causes · U4 kill-drops-SSH-before-relaunch · U5 hook-safe-launch
16
+ - **Disk & Storage** — U6 disk-full-crashes-torch.save · U7 storage-fails-on-inodes · U8 stage-hot-data-to-NVMe
17
+ - **Memory & OOM** — U9 cgroup-OOM-num_workers×tensor · U10 VRAM-OOM-vs-cgroup-OOM · U11 zombie-VRAM-nvidia-smi-cant-see · U41 host-metrics-lie/oom_kill-counter
18
+ - **Transfer & Download** — U12 scp-resets→resumable-loop · U13 scp-into-uncreated-dir · U14 egress-surcharge+same-AZ · U15 compress-before-the-wire
19
+ - **Monitoring** — U16 stale-waiters/zombie-monitors · U17 unquoted-pipe-grep-hang+robust-poll · U18 two-leg-remote-self-completion · U19 tracker-deletion-lags · U20 hosted-tracker-survives-teardown · U39 live-panel/TB-silently-empty (path/port/process mismatch) · U43 block-buffered-stdout-looks-frozen
20
+ - **GPU health** — U21 nvidia-smi-util%-is-a-liar · U22 Xid-48/79-dead-GPU-re-rent · U23 thermal/power-throttle-steals-25-40%
21
+ - **Dataloader & IO** — U24 dataloader-starvation-knobs · U25 many-small-files→shard-into-tar · U40 intra-op-thread-oversubscription-starves-GPU
22
+ - **Env & Container** — U26 CRLF-breaks-sh · U27 overlay-config-files · U28 CUDA-toolkit-vs-driver-vs-torch · U29 install-from-lockfile · U30 pin-image-by-sha256 · U31 container-runs-but-no-GPU · U42 box-code-drift/verify-deploy
23
+ - **Cost & teardown** — U32 task-epoch-default · U33 silent-checked-sync
24
+ - **Secrets & trackers** — U34 secrets-via-stdin · U35 tracker-offline-without-key
25
+ - **Delegated (cross-link only)** — U36 cuDNN-nondeterminism · U37 matplotlib-2^16 · U38 GPU-0%-util-data-bound
26
+ - **Pointers** — spot/preemption → `references/spot-resilience.md`; multi-node/NCCL → `references/multinode.md`
27
+
28
+ ---
29
+
30
+ ## Process & SSH
31
+
32
+ ### U1 — SSH disconnects on `pkill -9` (exit 255, "Connection reset")
33
+
34
+ **Symptom**: `ssh <host> 'pkill -9 -f train'` returns `Connection reset by peer`, exit 255.
35
+
36
+ **Root cause**: killing the python tree tears down the PTY chain; the SSH client gets EOF and exits. The
37
+ remote command may have run fine.
38
+
39
+ **Fix**: this is **normal, not an error** — re-ssh and verify state, do not panic-retry.
40
+ ```bash
41
+ ssh <host> "tmux kill-session -t qN 2>/dev/null; sleep 3; pkill -9 -f 'src.train'" # SSH exits 255 here
42
+ ssh <host> "pgrep -af 'src.train' | head -1 || echo CLEAN" # separate call verifies
43
+ ```
44
+
45
+ ### U2 — tmux holds the script in memory; editing it mid-run re-executes blocks
46
+
47
+ **Symptom**: a queue/launcher script is updated mid-run, but the running job still uses the old logic; or
48
+ an ablation completes cleanly yet **restarts from epoch 1** with a second tracker run and the queue never
49
+ advances.
50
+
51
+ **Root cause**: bash reads a script **by byte-offset on demand**. tmux keeps the launched script as-loaded;
52
+ `scp`-ing a new version mid-run makes bash seek to its saved offset in a *now-different* file, land
53
+ mid-command, and re-execute a block (duplicate runs, stalled queue). A child invocation (`bash run_one.sh`)
54
+ IS re-read fresh for the *next* item — but only if none is parked mid-script. (principle #6.)
55
+
56
+ **Fix**: **never overwrite a script any process is executing** — check `pgrep -af <script>` first; version
57
+ the filename for hot changes (`run_one_v2.sh`), point only *new* launches at it. Appending lines to a queue
58
+ file is safe (`while read < file` sees appended bytes); changing structure is not. To hot-swap, kill +
59
+ restart the detach session so fresh bash reads from the top. Recovery: kill the session, copy the finished
60
+ `best.pth` to durable storage, restart `run_queue.sh queue.txt <start_index>` to skip done items, delete any
61
+ duplicate tracker run (cross-link verifying-dl-experiments **REQUIRED**).
62
+
63
+ **Related detach trap — a non-exported var doesn't cross into the detach primitive.** A `VAR=x` set in
64
+ your shell before `tmux new-session` / `nohup` is **not** in the detached job's environment unless
65
+ **exported** (or inlined in the launched command) — the job sees it empty, and a launcher/monitor that
66
+ interpolates it silently misdirects (writes output to the wrong path, mis-reports "died"). `export VAR`
67
+ before launch, or inline it: `tmux new-session -d "VAR=$VAR bash run.sh"`.
68
+
69
+ ### U3 — A vanished remote process ≠ OOM: enumerate the 4 causes
70
+
71
+ **Symptom**: a detached run's log stops right after `Starting training` with no epoch output and no
72
+ traceback; `pgrep` shows it gone. The reflex is "OOM-killed."
73
+
74
+ **Root cause is one of four** — OOM is only one:
75
+ 1. **Machine restart / reboot** — `dmesg` is *clean*, GPU idle, cgroup roomy, `uptime` low. Most-missed: nothing in the log hints at it.
76
+ 2. **OOM-kill (`-9`)** — `dmesg | grep -i 'killed process'` shows it, memory was tight (U9).
77
+ 3. **SSH HUP** — a foreground (non-`nohup`/`tmux`/`setsid`) launch dies when its parent SSH drops.
78
+ 4. **Manual kill** — an earlier `pkill` matched more than intended.
79
+
80
+ **Fix — diagnose cheap → conclusive before "fixing"**:
81
+ ```bash
82
+ dmesg 2>/dev/null | grep -iE 'killed process|out of memory' | tail # OOM? empty = not OOM
83
+ nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader # idle now = died, not hung
84
+ cat /sys/fs/cgroup/memory.max | numfmt --to=iec # roomy = OOM unlikely
85
+ uptime # low = recent reboot (cause 1)
86
+ ```
87
+ Clean `dmesg` + idle GPU + roomy cgroup + low `uptime` ⇒ **reboot, not OOM**. Do NOT shrink batch size to
88
+ "fix" a phantom OOM — that masks the one variable under test. **Separate trap**: a dropped poll connection
89
+ ≠ the training dying — re-ssh and check the process/artifact directly (`pgrep -af train`, log tail,
90
+ `best.pth` mtime) before concluding the run died (principle #3).
91
+
92
+ ### U4 — `kill` drops the SSH before a relaunch in the SAME command runs
93
+
94
+ **Symptom**: `ssh <host> 'pkill -f X; relaunch X'` kills X but X is **not** relaunched; ssh returns 255.
95
+
96
+ **Root cause**: killing a session-tied process drops the SSH (U1, normal) at the kill, so everything after
97
+ it in that one command never executes.
98
+
99
+ **Fix**: split — kill in one ssh call, relaunch (with NO kill) in the next. To stop a kill/poll pattern
100
+ from matching the matcher's own command line, split the literal: `A=base; B=lines.; pgrep -f "${A}${B}"`
101
+ (the contiguous string `baselines.` never appears in the cmdline running `pgrep`).
102
+
103
+ ### U5 — Hook-safe remote launch: keep env activation VISIBLE in the launch command
104
+
105
+ **Symptom**: an env-guard hook (e.g. "no DL in conda base") blocks or asks on
106
+ `ssh <host> 'nohup bash /root/job.sh ...'` even though `job.sh` activates the right env internally; it also
107
+ misfires on heredocs that inline `python -m <pkg>.train`.
108
+
109
+ **Root cause**: the hook scans the **command string** — it cannot see inside an scp'd script, and a bare
110
+ `bash job.sh` launch has no visible `conda activate <env>`, so the guard assumes base.
111
+
112
+ **Fix**: write the heavy script via Write/`scp` (so `python -m ...train` lives in the file, not the command)
113
+ and put a VISIBLE activation in the launch ssh command:
114
+ `ssh <host> 'source /path/to/conda.sh; conda activate <env>; nohup bash /root/job.sh ...'` — the script
115
+ re-activating is harmless. Never `--no-verify` / never bypass the guard. (On a single-tenant rental whose
116
+ base IS the env, the right move is to exempt remote/ephemeral base, not to clone it — that's a profile fact.)
117
+
118
+ ---
119
+
120
+ ## Disk & Storage
121
+
122
+ ### U6 — Disk-full crashes `torch.save` with `iostream error`
123
+
124
+ **Symptom**: mid-training exit=1; log shows `RuntimeError: basic_ios::clear: iostream error` and
125
+ `unexpected pos N vs M` from inside `torch.serialization`; a leftover `latest.pth.tmp` sits in the
126
+ checkpoint dir; `df` shows the data mount at 100%.
127
+
128
+ **Root cause**: `torch.save` writes atomically (write `.tmp` → rename); the `.tmp` write hits disk-full and
129
+ errors. Any quota'd/cgroup disk on any rental does this.
130
+
131
+ **Fix — prevent**: pre-budget `ckpt_size × N_runs + worst_case_latest + tracker_local_cache`; if it exceeds
132
+ the mount, schedule mid-run aggregation to durable storage + delete completed-and-aggregated dirs; in
133
+ `run_one.sh`, on success prune the rolling `latest.pth` and keep only `best.pth` (cross-link
134
+ verifying-dl-experiments **REQUIRED** for the keepable-checkpoint policy). **Recover**: delete the
135
+ `*.tmp`/`latest.pth` to free several GB — `best.pth` survives, the queue can resume.
136
+
137
+ ### U7 — Storage fails on the dimension (and location) not being watched
138
+
139
+ **Symptom**: `cp`/`mkdir` fails `No space left on device`, yet `df -h` shows ~34% used — because `df -i`
140
+ reads `100%` (inodes exhausted). Or the data mount fills despite `runs/` looking small.
141
+
142
+ **Root cause**: disk dies on **inodes before bytes** — the classic trigger is **per-sample eval output**,
143
+ which writes on the order of `files_per_sample × N_samples × N_conditions` tiny files. And the real
144
+ byte-hog often hides where nobody looks: a **symlinked cache** (`~/.cache/huggingface` mapped onto the data
145
+ disk) can outweigh everything the run created.
146
+
147
+ **Fix**: monitor `df -i`, not just `df -h`, in Phase 0 and every space check. **Audit the real mount with
148
+ `du`, not assumptions** (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`). Clean by **value** — keep
149
+ the tiny irreplaceable evidence (metric/eval JSONs), drop the large reproducible scratch (periodic
150
+ checkpoints, unused caches). Cap per-sample eval visualization (cross-link verifying-dl-experiments
151
+ **REQUIRED** for the sizing policy). The *inode-cap number* is a profile fact (some platforms enforce a hard
152
+ ~200K cap; GB-quota'd platforms have none); the many-small-files general form is **shard into tar** (U25).
153
+ Get explicit user confirmation naming `rm -rf` targets; offer "clean vs expand the disk" (principle #9).
154
+
155
+ ### U8 — Stage hot data to local NVMe before training
156
+
157
+ **Symptom**: training is I/O-bound reading from a network/shared/HDD-backed volume; GPU starves between
158
+ batches.
159
+
160
+ **Root cause**: a remote/networked filesystem (or a spinning data disk) has far lower random-read
161
+ throughput than instance-local NVMe — HDD-vs-NVMe gaps reach ~35×.
162
+
163
+ **Fix**: at job start, copy the working dataset from the durable/shared tier to instance-local NVMe scratch,
164
+ train against the local copy, write checkpoints back to durable storage. The local-NVMe path is a profile
165
+ fact (`local_nvme` in the frontmatter); the stage-then-train discipline is universal. Pairs with U24/U25.
166
+
167
+ ---
168
+
169
+ ## Memory & OOM
170
+
171
+ ### U9 — `num_workers` × a big in-RAM tensor → cgroup OOM-kill (bare "Killed", exit 137)
172
+
173
+ **Symptom**: training dies early with a bare `Killed` / `killed by signal: Killed (-9)` and **no Python
174
+ traceback**; lowering `num_workers` makes it vanish.
175
+
176
+ **Root cause**: each DataLoader worker is a `fork` that gets its **own copy** of any large object the
177
+ dataset holds (a 16384² float32 matrix ≈ 1 GB). `num_workers=W` ⇒ ~`(W+1)×` that footprint, which blows the
178
+ instance's cgroup `memory.max` even though a bare-process run fits. The kernel OOM-kills with no
179
+ Python-level error, so it reads as a mysterious crash.
180
+
181
+ **Fix**: size `num_workers` against `memory.max` and the per-worker resident set, **not** CPU count. Share
182
+ one copy across workers (memmap / module-level singleton built once) or generate the object on the fly.
183
+ Shrinking the problem also fixes it — a smaller matrix dim shrinks footprint *quadratically* (dim 1024 ≈
184
+ 4 MB, 256× less than 16384). Confirm it's OOM: `dmesg | tail` shows `Out of memory: Killed process`, and the
185
+ same config survives `num_workers=0`.
186
+
187
+ ### U10 — VRAM OOM (a big model or a concurrent job) is distinct from cgroup-RAM OOM (U9)
188
+
189
+ **Symptom**: `torch.OutOfMemoryError: CUDA out of memory` when launching a second train/eval while another
190
+ runs, or a big model (deep transformer / unrolled net at high res) OOMs alone.
191
+
192
+ **Root cause**: **VRAM** — the sum of concurrent jobs' allocations plus fragmentation exceeds the card. NOT
193
+ host-RAM (U9).
194
+
195
+ **Fix**: check free VRAM first (`nvidia-smi --query-gpu=memory.free --format=csv,noheader`); size the batch
196
+ to fit *alongside* any concurrent job; set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to cut
197
+ fragmentation. (Run heavy DL on the box; do static/shape checks locally — cross-link
198
+ verifying-dl-experiments **REQUIRED** for local-OOM rationale.)
199
+
200
+ ### U11 — A zombie holds VRAM `nvidia-smi` cannot see → OOM on an "empty" GPU
201
+
202
+ **Symptom**: `nvidia-smi` lists no process and shows free memory, yet a fresh job OOMs immediately; common
203
+ after a crashed DDP run or a killed container.
204
+
205
+ **Root cause**: a defunct/orphaned process (or a dead container's namespace) still holds CUDA context and
206
+ VRAM, but `nvidia-smi`'s process table can't attribute it — so the GPU *looks* empty while memory is locked.
207
+
208
+ **Fix**: enumerate the real holders via the device nodes and reap them:
209
+ ```bash
210
+ fuser -v /dev/nvidia* 2>/dev/null # or: lsof /dev/nvidia* → kill -9 the listed PIDs
211
+ ```
212
+ If containerized, restart the container. Ship a small reaper that flags any PID with persistent VRAM + ~0%
213
+ util beyond a timeout — cross-link `scripts/reap_vram_zombies.sh`.
214
+
215
+ ### U41 — On a shared box, `uptime`/`free` describe the whole physical host, not your container — use cgroup-scoped readings + the `oom_kill` counter
216
+
217
+ **Symptom**: a detached run looks "dead" or "the host is overloaded" — `uptime` shows load average 400+,
218
+ `top`/`free -m` look maxed — so you suspect contention or an OOM-kill. But the job's own checkpoint `mtime`
219
+ keeps advancing and its log still grows.
220
+
221
+ **Root cause**: on a multi-tenant rental, host tools (`uptime`, `top`, `free -m`, `vmstat`) report the
222
+ **physical node you share with other tenants**, not your cgroup. A neighbor's job spikes the host load
223
+ average to ~490 while your container sits near-idle (your processes in `R`/`S`, none stuck in
224
+ uninterruptible `D`). Reading host load as your own → a false "overloaded / OOM-killed" verdict and a
225
+ needless kill-and-restart of a healthy run.
226
+
227
+ **Fix**: judge YOUR container from cgroup-scoped readings, not host tools:
228
+ - memory — `/sys/fs/cgroup/memory.current` vs `memory.max` (not `free -m`);
229
+ - were YOU OOM-killed — the **`oom_kill` counter** in `/sys/fs/cgroup/memory.events`
230
+ (`grep oom_kill /sys/fs/cgroup/memory.events`); a non-incrementing counter means you were **not**
231
+ OOM-killed, however red host `free` looks;
232
+ - CPU pressure — `/sys/fs/cgroup/cpu.stat` / `cpu.pressure`.
233
+
234
+ A high host load with your cgroup roomy and `oom_kill 0` is a **noisy neighbor**, not your bug — don't
235
+ shrink your batch or blame your code (a neighbor genuinely starving you on the shared card is U21/U23
236
+ throttle territory or a re-rent, not a code fix). Sharpens the **U3** vanished-process ladder: the
237
+ authoritative OOM check is the cgroup `oom_kill` counter, not host `dmesg`/`free` noise.
238
+
239
+ ---
240
+
241
+ ## Transfer & Download
242
+
243
+ ### U12 — `scp -r` of a large dir resets mid-transfer → per-dir resumable loop
244
+
245
+ **Symptom**: 30–60 min into `scp -r host:...130GB ./`, the connection drops
246
+ (`Read from remote host ... reset by peer`); local has a few dirs, the rest gone. scp does not resume.
247
+
248
+ **Root cause**: a single SSH connection carries the whole transfer; any network blip kills all of it.
249
+
250
+ **Fix**: loop **per-dir**, each its own SSH session — one failure doesn't lose the others, and re-running
251
+ skips completed dirs. Prefer `rsync -avz --partial --append-verify` (resumes a half-file). Wrap bulk pulls
252
+ in a `timeout … && break` retry loop: a stall ≠ permanent failure, and resumable transfers accumulate
253
+ progress across kills. Validate any speed test on the **same route** the real transfer uses (principle #7).
254
+ See `scripts/download_loop.sh` for the per-dir pattern.
255
+
256
+ ### U13 — `scp` into a remote dir a sibling command was supposed to create (race)
257
+
258
+ **Symptom**: a background `scp big.tar host:/root/x/` fails instantly with `dest open "/root/x/": Failure`
259
+ — the foreground command that would have `mkdir`-ed `/root/x` ran later, or was blocked/cancelled.
260
+
261
+ **Root cause**: ordering assumption between parallel/sibling commands; the destination dir didn't exist yet.
262
+
263
+ **Fix**: make every transfer self-sufficient inside its own retry loop:
264
+ `ssh host 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.
265
+
266
+ ### U14 — Egress is a silent ~20% surcharge; co-locate and stay same-AZ
267
+
268
+ **Symptom**: the monthly bill is ~20% over the rented GPU-hours; a large model/dataset re-pulled daily from
269
+ a hyperscaler bucket dominates cost (a 140 GB model pulled daily from S3 ≈ $378/mo in egress alone).
270
+
271
+ **Root cause**: hyperscaler **egress** is metered (AWS ~$0.09/GB, GCP ~$0.08, Azure ~$0.087) while most
272
+ GPU-clouds (Lambda/RunPod/vast/CoreWeave) charge $0. Worse, **cross-AZ traffic bills ~$0.01/GB each
273
+ direction even inside one provider** — storage in a different zone than compute quietly meters every read.
274
+
275
+ **Fix**: co-locate storage with compute on the **same provider AND same AZ/region**. Pull a dataset once to
276
+ durable local storage, not per-epoch from a remote bucket. Record `free_egress` / `egress_per_gb` /
277
+ `cross_az_per_gb` as profile fields and prefer a $0-egress GPU-cloud for transfer-heavy jobs.
278
+
279
+ ### U15 — Compress before the wire
280
+
281
+ **Symptom**: checkpoint/dataset transfers are slow and (on metered egress) expensive.
282
+
283
+ **Root cause**: raw tensors and JSON cross the network uncompressed.
284
+
285
+ **Fix**: zstd/gzip the payload before transfer — cuts checkpoints+datasets 30–60%, JSON 60–80%; store
286
+ weights fp16/int8 where the task tolerates it. Compounds with U14 (less egress $) and U12 (fewer bytes to
287
+ resume). Pairs with U25 (tar shards compress and transfer as one stream).
288
+
289
+ ---
290
+
291
+ ## Monitoring
292
+
293
+ ### U16 — Stale background waiters pile up; supersede a run → STOP its waiter; pick the right lifetime
294
+
295
+ **Symptom**: a "Background tasks" panel shows 8+ "Running" wait-loops at 500–740 min elapsed, each
296
+ ssh-polling every ~20 s, while the GPU is idle and the experiment finished hours ago.
297
+
298
+ **Root cause**: every kill+restart of a flaky saga armed a NEW `until ssh grep MARKER; do sleep; done`
299
+ waiter but never stopped the OLD one — its marker (in a superseded log) never appears, so it loops forever.
300
+ A `run_in_background` waiter is **not** time-capped (a 781 s task ran to completion + notified; the ~600 s
301
+ cap is on **foreground** Bash only). The real silent-failure mode is a waiter that never EXITS (U17).
302
+
303
+ **Fix**: one waiter per live run — superseding a run, stop the old waiter first (`TaskStop`; cross-session
304
+ IDs aren't stoppable from a resumed session — dismiss those from the UI). Multi-hour wait → a **persistent
305
+ Monitor** (no 10-min cap) + a stall-detector emit so a hung run still notifies. A persistent Monitor dies on
306
+ session resume → after any resume, check the remote ground-truth directly (`tmux ls`, `grep DONE log`,
307
+ `nvidia-smi`); never trust a monitor that may be gone (principle #3).
308
+
309
+ ### U17 — A silent background monitor that never returns: usually an unquoted `|` in grep
310
+
311
+ **Symptom**: a `run_in_background` ssh monitor never returns / never notifies; `pgrep` shows a process
312
+ "alive." The run looks hung — but the actual job finished and wrote results fine.
313
+
314
+ **Root cause**: the wrapper never EXITED because a sub-command blocks forever. The classic bug is an
315
+ **unquoted `|` in grep** — `grep -hE noise-sweep|snr=|wrote log` — the shell splits it into THREE piped
316
+ commands, and the first (`grep -hE noise-sweep`, no filename) reads **stdin** → blocks forever → the
317
+ pipeline never returns → ssh never returns → the local background process never exits → no completion
318
+ notification. (Background tasks notify on EXIT only — no 600 s cap; foreground Bash is the capped one, U16.)
319
+
320
+ **Fix — robust remote-poll template**:
321
+ - **Quote every regex AND give grep a filename**: `grep -hE 'noise-sweep|snr=|wrote' log` (a `|` inside quotes is alternation; a filename means read the file, never stdin).
322
+ - **Bound the ssh**: `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 …` — a blip self-kills in ~30 s instead of half-open hanging for minutes.
323
+ - **Short-connection poll, not one long-held ssh**: each poll = ssh in → check → disconnect; loop locally with a bounded counter.
324
+ - **Verify by artifact, not notification**: when it "looks done," Read the local output + a fresh `ssh 'grep DONE log; tmux ls; nvidia-smi'` to confirm ground truth (cross-link verifying-dl-experiments **REQUIRED**); don't wait on a notification that may never fire.
325
+
326
+ ### U18 — "I'll check periodically" is a lie unless a trigger is armed; two-leg remote self-completion
327
+
328
+ **Symptom**: a promise to monitor a multi-hour remote run, then no report for a day — because between turns
329
+ the assistant does not run. A cloud scheduler set up to "ssh in and check" silently can't reach the box.
330
+
331
+ **Root cause**: two conflated things. (a) Making the REMOTE self-complete (a waiter that blocks on a log
332
+ marker then runs eval) guarantees RESULTS but gives no *reporting cadence* — nothing re-invokes the
333
+ assistant on a timer. (b) A cloud schedule runs in an isolated sandbox with its own checkout and **no access
334
+ to the local SSH key or network** → it cannot `ssh` the rented box, and the SSH private key must **never** go
335
+ into a cloud agent (secret-leak).
336
+
337
+ **Fix — the two-leg pattern**:
338
+ - **Remote self-completion (guaranteed, survives session/SSH death)**: chain `train → eval → touch marker` under one `nohup ... </dev/null >log 2>&1 &`. Detect "done" by a **log marker** (`grep -q 'QUEUE DONE' master.log`), NEVER by `pgrep` — the waiter's own command line contains the pattern, so `pgrep -f` matches itself and loops forever (U17).
339
+ - **Live progress (best-effort)**: a session-bound local loop (e.g. `/loop 30m` / cron `3,33 * * * *`) that ssh-polls with the *local* key. Be honest it dies when the session closes — the remote still finishes; the user re-pings to pull.
340
+ - **Don't promise autonomous cross-session polling you can't deliver.** (`tmux` is often absent on a fresh box and `apt-get install` fails offline — `nohup ... </dev/null >log 2>&1 &` is zero-dependency and survives SSH drop; verify with `pgrep -af <script>`.) Full architecture → `references/monitoring_patterns.md`.
341
+
342
+ ### U19 — Tracker run deletion lags; a fresh export resurrects "deleted" runs
343
+
344
+ **Symptom**: `run.delete()` returns, but an immediate `api.runs()` still lists every deleted run; a batch
345
+ history-export minutes later happily re-downloads `<run>__history.csv` for runs just deleted.
346
+
347
+ **Root cause**: deletion is asynchronous server-side; list/export endpoints serve stale listings for
348
+ minutes.
349
+
350
+ **Fix**: delete → re-verify on a **later** monitoring tick (not a tight loop; a second
351
+ `delete(delete_artifacts=True)` pass is safe). Order matters: do cloud deletions **before** local exports,
352
+ then re-check the export dir for resurrected files and remove them. (cross-link verifying-dl-experiments
353
+ **REQUIRED** for tracker forensics.)
354
+
355
+ ### U20 — Local logs die with the instance: use a hosted tracker
356
+
357
+ **Symptom**: TensorBoard event files written to an ephemeral box vanish on teardown — every curve gone after
358
+ the meter-stop verb runs.
359
+
360
+ **Root cause**: a rented box's local disk is not durable past `terminate`/`destroy` (principle #4); the
361
+ metric history lived only there.
362
+
363
+ **Fix**: log metrics to a **hosted tracker** so they survive teardown — `trackio.init(space_id=...)` or
364
+ `wandb` online (push under the platform's proxy if behind a firewall). Poll the tracker's structured alerts
365
+ as the monitor instead of brittle ssh-tail. Cross-link huggingface-skills:huggingface-trackio **REQUIRED**
366
+ for the `init/log/finish/alert` mechanics and `space_id` sync.
367
+
368
+ ### U43 — A detached run's log looks frozen for minutes though training is fine: stdout is block-buffered off a TTY
369
+
370
+ **Symptom**: a `nohup`/`tmux` run prints a few lines then nothing for many minutes; it reads as
371
+ "hung / died" and the reflex is to kill it — but checkpoint `mtime`, TB scalars, and `nvidia-smi` all show
372
+ it advancing.
373
+
374
+ **Root cause**: Python (and libc stdio) **line-buffer when stdout is a TTY but block-buffer (~4–8 KB) when
375
+ it is a pipe or file** — exactly the detached case. The log only flushes when the buffer fills, so a
376
+ healthy run looks silent and a `grep`-on-log liveness check false-alarms on the gap.
377
+
378
+ **Fix**: run unbuffered — `python -u` or `PYTHONUNBUFFERED=1` (the shipped `scripts/run_one.sh.template`
379
+ already exports it); for a shell pipeline use `stdbuf -oL`. And judge liveness by **artifacts, not stdout
380
+ cadence** — checkpoint `mtime`, the TB scalar API, `nvidia-smi` (monitoring_patterns §0 corollary; the
381
+ deeper "is it actually hung?" attach is py-spy, throughput-profiling **T21**). A frozen log is the single
382
+ most common false "dead run."
383
+
384
+ ---
385
+
386
+ ## GPU health
387
+
388
+ ### U21 — `nvidia-smi` GPU-Util % is a liar
389
+
390
+ **Symptom**: the perf tile reads 100% util but throughput is poor; or util looks "busy" while the job is
391
+ actually starved (the inverse of U38, which is the 0%-but-running case).
392
+
393
+ **Root cause**: `GPU-Util` means "≥1 kernel ran in the sampling window," not "useful work filled the
394
+ window." A trickle of tiny kernels reads as 100%.
395
+
396
+ **Fix**: correlate util with **SM clock** (`clocks.current.sm`), memory-bandwidth util, and power draw —
397
+ `nvidia-smi dmon -s pucvmet -d 1`. Low SM clock or low power at "100% util" means the GPU is underfed (go to
398
+ U24). Always sample over several seconds, never one snapshot.
399
+
400
+ ### U22 — Xid 48/79 = a dead GPU; on a rental, re-rent
401
+
402
+ **Symptom**: training crashes or the GPU drops out; `dmesg | grep -i xid` shows an Xid error.
403
+
404
+ **Root cause**: Xid is NVIDIA's canonical hardware-fault signal. **Xid 48 = double-bit ECC (the GPU is
405
+ dead); Xid 79 = "GPU has fallen off the bus."** These are hardware, not code.
406
+
407
+ **Fix**: on a *rental* the card can't be reseated — **stop the instance and re-rent a different box**; don't
408
+ burn hours debugging code for a hardware fault. Check `dmesg | grep -i xid` as part of the "vanished
409
+ process" ladder (U3) when the GPU goes idle unexpectedly.
410
+
411
+ ### U23 — Thermal/power throttling silently steals 25–40% with no error
412
+
413
+ **Symptom**: "the same code is slower than yesterday" — no error, no crash, just lower throughput.
414
+
415
+ **Root cause**: the GPU is thermal- or power-throttling (an H100 throttles around 83 °C; target <75 °C). On
416
+ a shared rental, cooling/power headroom is outside tenant control.
417
+
418
+ **Fix**: detect — SM clock falling below base while temp >83 °C, or
419
+ `nvidia-smi -q -d PERFORMANCE` showing a throttle reason. A tenant can't fix cooling → **flag it and
420
+ re-rent** a healthier box; don't read the slowdown as a model/data regression. Pairs with U21 (clocks expose
421
+ it where util% hides it).
422
+
423
+ ---
424
+
425
+ ## Dataloader & IO
426
+
427
+ ### U24 — GPU starves at 10–70% waiting on the dataloader, not on compute
428
+
429
+ **Symptom**: util sits well below 100% (but nonzero), step log advances slowly; profiling shows time spent
430
+ in data fetch, not fwd/bwd.
431
+
432
+ **Root cause**: the input pipeline can't keep the GPU fed — too few workers, no prefetch, host↔device copies
433
+ on the critical path. (Distinct from U38's *0%* CPU-data-bound transform case; this is the partial-starve
434
+ knob set.)
435
+
436
+ **Fix — tune in order**: `num_workers = cores − 1` (sized against per-worker footprint, U9),
437
+ `persistent_workers=True`, `pin_memory=True`, `prefetch_factor=2`. Pathological cases show >100× gaps from
438
+ these alone. If a heavy per-sample transform is the bottleneck, move it to the GPU (cross-link
439
+ verifying-dl-experiments **REQUIRED** for the 0%-util diagnosis, U38). Pairs with U8 (stage to NVMe) and U25.
440
+
441
+ ### U25 — Millions of small files on a network/object store → transaction-overhead death; shard into tar
442
+
443
+ **Symptom**: a dataset of many tiny files streams glacially from a shared/object store; or eval output of
444
+ tens of thousands of per-sample files exhausts inodes (U7) or blows a visualization grid (U37).
445
+
446
+ **Root cause**: per-file open/stat/close overhead dominates on networked/object storage; the inode and
447
+ metadata cost scales with file *count*, not bytes.
448
+
449
+ **Fix**: pack into **sharded tar** (WebDataset), a few-hundred-MB per shard → 3–10× faster sequential I/O and
450
+ the only sane pattern for streaming from S3. This is the **general form** of the inode-exhaustion trap (U7)
451
+ and the per-sample-vis trap — cap and shard rather than emitting a file per sample. Pairs with U8 (stage the
452
+ shards to local NVMe) and U15 (shards compress as one stream).
453
+
454
+ ### U40 — A vCPU-sliced rental starves its own GPU: torch intra-op threads default to the HOST core count, not your cgroup quota
455
+
456
+ **Symptom**: GPU `sm%` sits ~5–15% and runs grind, but the dataloader is not the bottleneck (few/no
457
+ workers, data already on-device, the U24 knobs don't help); `top` shows dozens of python threads fighting
458
+ over a handful of cores.
459
+
460
+ **Root cause**: you rent a **cgroup CPU slice** (e.g. 12 vCPUs of a 64-core host), but torch/OpenMP size
461
+ their intra-op thread pools to the **physical** core count — `torch.get_num_threads()` / `OMP_NUM_THREADS`
462
+ come up ~64. ~57 runnable threads thrashing 12 cores burn the slice on context-switching, so the CPU side
463
+ that launches kernels and feeds the GPU can't keep up and the card idles. No OOM, no error — pure scheduler
464
+ thrash (the *host scheduling* starves the GPU, the inverse of being data-bound).
465
+
466
+ **Fix**: cap the pools to your **slice's** vCPU count before launch —
467
+ `export OMP_NUM_THREADS=4 MKL_NUM_THREADS=4` (and/or `torch.set_num_threads(4)`); confirm torch honoured it
468
+ (`python -c "import torch; print(torch.get_num_threads())"` → 4, not 64). Read the real quota from the
469
+ cgroup, not `nproc` (which reports host cores): `cat /sys/fs/cgroup/cpu.max` → `quota period`, vCPUs ≈
470
+ quota/period. Bake the cap into the launch wrapper so every queue cell inherits it. Distinct from **U9**
471
+ (workers × RAM → cgroup OOM) and **U24** (dataloader starvation); the triage that catches it is
472
+ throughput-profiling **T3** (GPU SM% low while a python thread-storm pegs the cores).
473
+
474
+ ---
475
+
476
+ ## Env & Container
477
+
478
+ ### U26 — CRLF breaks `.sh` on Linux (authored on Windows)
479
+
480
+ **Symptom**: a synced launcher silently does nothing (empty log); run by hand it errors `set: -: invalid
481
+ option`, `cd: /path\r: No such file or directory`, `syntax error near unexpected token $'do\r'` — every
482
+ line "ends in `\r`."
483
+
484
+ **Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting working-tree EOL) writes `.sh` with
485
+ CRLF; Linux `bash` treats the trailing `\r` as part of each token. `.py` is unaffected (Python's universal
486
+ newlines); it is specifically `bash`/`.sh` that breaks.
487
+
488
+ **Fix**: add `.gitattributes` with `*.sh text eol=lf` (so `git archive`/checkout always emits LF); immediate
489
+ on-box unblock: `sed -i 's/\r$//' scripts/*.sh`.
490
+
491
+ ### U27 — `-o dotted.key=value` overrides explode on null parents → freeze protocols as overlay config FILES
492
+
493
+ **Symptom**: `-o evaluation.sps_augmentation.enable=true` crashes
494
+ `KeyError: Override path '...' is not a mapping` because the base YAML has the parent as `null`. Worse
495
+ long-term: protocol variants that exist only as one-off CLI strings are unreproducible months later.
496
+
497
+ **Root cause**: dotted-key override traversal can't descend through a `null` parent; and a CLI-string-only
498
+ protocol has no diffable, reviewable record.
499
+
500
+ **Fix**: define each protocol variant as a small overlay config (`configs/eval_overlays/<protocol>.yaml` with
501
+ `_base_:` pointing at the canonical leaf) and pass it via `-c`. Reviewable, diff-able, immune to null-parent
502
+ traversal. This is also the **retry-the-identical-config mechanism** (principle #7): an overlay file is a
503
+ stable config a retry re-uses byte-for-byte. To reconstruct a historical protocol, read the artifact
504
+ manifest (`*_manifest.json` records the resolved overrides verbatim).
505
+
506
+ ### U28 — The CUDA-toolkit ↔ host-driver ↔ torch-build triangle
507
+
508
+ **Symptom**: `detected CUDA version mismatches the version used to compile PyTorch`; or `no kernel image is
509
+ available for execution` at the first forward on a new-arch GPU.
510
+
511
+ **Root cause**: three independently-versioned layers must agree — **the host driver is host-global and a
512
+ tenant usually cannot change it on a rental; the CUDA toolkit is per-env and changeable; the torch build
513
+ must match both.** The toolkit must be ≤ what the host driver supports; a project that pins
514
+ `torch<2.9` can *downgrade* the only build with kernels for a new-arch card (e.g. sm_120).
515
+
516
+ **Fix**: keep the image's working torch — filter framework pins out of the remote install:
517
+ ```bash
518
+ grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt
519
+ pip install -r /root/req_remote.txt
520
+ ```
521
+ Set `LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH` when the per-env toolkit must win. Smoke
522
+ `torch.cuda.get_device_capability()` + a heavy project import before launching; the off-band torch version
523
+ lands in the runtime snapshot — disclose it with results. `host_driver_cuda_max` is a profile field.
524
+
525
+ ### U29 — "Same version, different result": top-level pins let transitive deps drift → install from a lockfile
526
+
527
+ **Symptom**: two installs of the "same" `requirements.txt` produce different behavior/results.
528
+
529
+ **Root cause**: a hand-edited `requirements.txt` pins only top-level packages; transitive dependencies drift
530
+ between installs.
531
+
532
+ **Fix**: install from a **lock file** (`uv lock` / `pip-tools` / `conda-lock`) that pins the full resolved
533
+ graph, not a hand-edited top-level list. Pairs with U28 (filter the framework pins, then lock the rest).
534
+
535
+ ### U30 — A Dockerfile is NOT reproducible: pin the base image by `@sha256:` digest
536
+
537
+ **Symptom**: a container built from the "same" Dockerfile months apart behaves differently.
538
+
539
+ **Root cause**: `FROM image:latest` (or any moving tag) resolves to a different layer set over time.
540
+
541
+ **Fix**: pin the base image by content digest — `FROM image@sha256:<digest>`, not `:latest` — so the build
542
+ is bit-reproducible. (`pin_image_by_sha256` is a per-platform expectation where the image is the env
543
+ contract.)
544
+
545
+ ### U31 — Container runs but trains 100× slower = the GPU was never attached (CPU-only)
546
+
547
+ **Symptom**: a containerized job runs to completion but is absurdly slow; loss curves look normal, just
548
+ glacial.
549
+
550
+ **Root cause**: the container has no GPU — launched without `--gpus all`, or the NVIDIA Container Toolkit is
551
+ missing/too old, so CUDA silently fell back to CPU.
552
+
553
+ **Fix**: `docker run --gpus all …`, NVIDIA Container Toolkit ≥1.14, and **validate `nvidia-smi` *inside* the
554
+ container before training** — never assume GPU attachment from a clean `docker run`.
555
+
556
+ ### U42 — The box runs a hand-synced copy with no git remote; a fix you "committed" may not be deployed — verify it is ON the box before trusting a run or tearing down
557
+
558
+ **Symptom**: a bug you fixed and committed locally still reproduces on the box, or an eval runs on stale
559
+ logic (wrong default, missing speedup, pathologically slow), even though local `git log` shows the fix
560
+ landed.
561
+
562
+ **Root cause**: most rentals have **no git remote** — the box holds a working tree you pushed by
563
+ `scp`/`rsync`/`tar-over-ssh`, so its code only advances when you re-sync. A local commit changes nothing on
564
+ the box; an interrupted or wrong-path sync, or simply forgetting, leaves the box pre-fix. "I committed it"
565
+ ≠ "it's running on the box."
566
+
567
+ **Fix**: treat code deploy like the checked-sync (**U33**) — **verify, don't assume**. After syncing, grep
568
+ the box for the change before relying on it:
569
+ ```bash
570
+ ssh "$HOST" "grep -n '<new symbol / changed line>' /root/<proj>/path/file.py" || echo 'NOT DEPLOYED'
571
+ ```
572
+ or compare a hash (`ssh host 'sha256sum file'` vs local). Make it a pre-flight for any run whose result
573
+ depends on the fix, and part of the **Phase-5 teardown gate** — a verdict produced by stale code is not the
574
+ verdict you think it is (principle #3). Pairs with **U29/U30** (pin deps/image): code AND environment must
575
+ both be the version you believe.
576
+
577
+ ---
578
+
579
+ ## Cost & teardown
580
+
581
+ ### U32 — A task's default epochs differ from another task's; CLI `--epochs` silently overrides the right value
582
+
583
+ **Symptom**: one CLI `--epochs N` is applied to all ablations; a subset (e.g. detection vs recon/seg)
584
+ consistently underperforms; a reviewer flags it.
585
+
586
+ **Root cause**: some task families need more epochs to converge and default to a higher value in their YAML;
587
+ a blanket CLI `--epochs` silently overrides that per-task default.
588
+
589
+ **Fix**: make the queue support a per-line epoch field (e.g. recon/seg `20`, det `50`); audit the codebase's
590
+ YAML for `epochs:` declarations before deploying (`grep -rE '^\s*epochs:' configs/ | sort -u`). This is a
591
+ config-drift instance — really a smoke/sanity target (cross-link verifying-dl-experiments **REQUIRED**).
592
+
593
+ ### U33 — Silent sync failure: gate the success line on the actual copy result
594
+
595
+ **Symptom**: a wrapper prints `auto-synced <name> to durable storage` for every job, but at download time
596
+ the durable dir is missing or empty.
597
+
598
+ **Root cause**: the sync block does `mkdir -p "$DST"; cp -f ... 2>/dev/null` then `echo synced`
599
+ **unconditionally** — it never checks the exit code. When the durable FS is inode-exhausted (U7) `mkdir`
600
+ fails but the success line still fires, so monitoring looks green while nothing landed (principle #3).
601
+
602
+ **Fix — checked, gated sync**:
603
+ ```bash
604
+ if mkdir -p "$DST" && cp -f "$CKPT_DIR/best.pth" "$DST/" && [ -f "$DST/best.pth" ]; then
605
+ echo "[$(date +%H:%M:%S)] auto-synced $NAME to durable storage"
606
+ else
607
+ echo "[$(date +%H:%M:%S)] !! SYNC FAILED for $NAME (check df -i) — data disk is still source-of-truth"
608
+ fi
609
+ ```
610
+ Until a download is verified locally, trust the **data-disk** copy, not the "synced" log line. The shipped
611
+ `scripts/run_one.sh.template` carries the checked version.
612
+
613
+ ---
614
+
615
+ ## Secrets & trackers
616
+
617
+ ### U34 — Move credentials to the box without the secret ever appearing in a command
618
+
619
+ **Symptom**: pasting a key into an ssh/scp command leaks it into shell history, transcripts, and hook logs;
620
+ security hooks (rightly) block scp-ing a whole `~/.netrc` (it carries other machines' credentials).
621
+
622
+ **Root cause**: any secret inside a command string is captured by history/transcript/hook logging.
623
+
624
+ **Fix**: stream exactly one machine block via **stdin** — the value flows file→pipe→file and never appears in
625
+ any command text or output:
626
+ ```bash
627
+ grep -A 2 'machine api.wandb.ai' ~/.netrc | ssh <host> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
628
+ ```
629
+ Verify by capability, not by echoing the value:
630
+ `python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"`. Never write the secret to a
631
+ shared/durable FS that a platform classifier scans (that platform detail is a profile fact).
632
+
633
+ ### U35 — `WANDB_MODE=offline` still dies without an API key in wrapper stacks → zero curves
634
+
635
+ **Symptom**: a run launched `WANDB_MODE=offline` expecting "log locally, sync later" produces **no offline
636
+ run dirs at all**; the train log shows `Disabled WandB due to initialization error: No API key configured`.
637
+
638
+ **Root cause**: bare-SDK offline mode needs no key, but project logger *wrappers* often probe the API
639
+ (`wandb.login()` / `wandb.Api()`) before `init` and treat key-absence as fatal → they flip to fully-disabled,
640
+ not offline.
641
+
642
+ **Fix**: push credentials BEFORE the first launch (U34) and run online under the platform's proxy; verify the
643
+ first log lines show `Syncing run <name>` + a run URL — treat the *absence* of that line as a failure. Run
644
+ already finished without a tracker? Backfill from the train log (regex per-epoch summaries →
645
+ `init(..., tags=["backfilled"]) → run.log(..., step=epoch)`). Still in flight? Kill and relaunch with
646
+ `--resume <latest.pth>` (costs ≤1 epoch). Prefer a hosted tracker so metrics survive teardown (U20).
647
+
648
+ ---
649
+
650
+ ## Delegated — cross-link only, do NOT restate here
651
+
652
+ ### U36 — cuDNN nondeterminism
653
+
654
+ Same config + seed gives slightly different metrics run-to-run (`cudnn.benchmark=True` picks the fastest
655
+ kernel by first-batch timing). Owned by **verifying-dl-experiments** (determinism). Cross-link
656
+ verifying-dl-experiments **REQUIRED**; do not restate the fix here.
657
+
658
+ ### U37 — matplotlib `2^16`-per-axis limit on large eval visualization
659
+
660
+ A composite grid (one row per sample) on a large test set crashes
661
+ `Image size … must be less than 2^16`, often aborting the summary save. Owned by
662
+ **verifying-dl-experiments** (eval-artifact sizing). Cross-link verifying-dl-experiments **REQUIRED**;
663
+ prevent with U25 (cap + shard, don't emit a file/row per sample).
664
+
665
+ ### U38 — GPU at 0% util but training IS running (CPU-data-bound, not stalled)
666
+
667
+ `nvidia-smi` reads ~0% util yet the step log advances and model memory is loaded — a heavy per-sample CPU
668
+ transform with `num_workers=0` serializes data prep and starves the GPU. Owned by
669
+ **verifying-dl-experiments** (0%-util diagnosis). Cross-link verifying-dl-experiments **REQUIRED**; the fix
670
+ knobs are U24, the move-to-GPU remedy is in that skill.
671
+
672
+ ### U39 — Live monitoring shows nothing (TensorBoard panel empty / `INACTIVE`) but training is fine
673
+
674
+ **Symptom**: the platform's TensorBoard tile / web panel is blank or `INACTIVE`, or a backgrounded watcher
675
+ goes silent — yet the run is healthy: the loss advances on the box and the event/log files exist. You
676
+ conclude "monitoring is broken" or, worse, "the run died," and waste a check or restart a fine run.
677
+
678
+ **Root cause**: live observability breaks in three platform-shaped ways, none of which is a training
679
+ failure. (1) **Path mismatch** — the platform's built-in panel reads a FIXED logdir/port and your logger
680
+ wrote elsewhere, so the panel sees zero runs (AutoDL pins `tensorboard --logdir /root/tf-logs`; a
681
+ `SummaryWriter(log_dir="runs/<exp>")` is invisible to it). (2) **Process died / never backgrounded** — the
682
+ TB server or the watcher ran in the foreground or under the session and was killed at the foreground cap
683
+ or on session/SSH drop, so nothing serves the curves. (3) **Port not exposed** — the service is up on the
684
+ box but the port was never tunnelled / declared, so the panel can't reach it.
685
+
686
+ **Fix** (the rule is universal; the *value* is per-profile): (1) **align the path** — point your logger at
687
+ the panel's pinned dir, OR symlink the pinned dir at your output (`ln -sfn <your-runs>/<exp> <pinned>/<exp>`);
688
+ no retrain — the running writer keeps appending and the panel reloads it. The pinned path lives in the
689
+ profile (AutoDL `/root/tf-logs`, **AD7**; elsewhere write under the durable mount). (2) **run TB + the
690
+ watcher under the detach primitive** (tmux / nohup / the profile's `DETACH`), never foreground, so they
691
+ survive the session and the ~600 s cap (`references/monitoring_patterns.md` §1; cross-host background →
692
+ §7). (3) **expose the port the platform's way** — CN built-in tiles declare it at rent time (`china.md`),
693
+ RunPod via its HTTP proxy (100 s Cloudflare cap, fine for a TB UI, `runpod.md`), Lambda / Paperspace /
694
+ bare-SSH via an `ssh -L 6006:localhost:6006` tunnel (`generic-ssh.md`, `lambda.md`). Before blaming the
695
+ panel, verify ground truth: the event file is non-empty (`ls -la <logdir>; du -sh <logdir>`) and TB
696
+ answers locally (`curl -s localhost:<port>/ | head`). For curves that must **survive teardown**, don't
697
+ depend on a box-local panel at all → a hosted tracker (**U20**).
698
+
699
+ ---
700
+
701
+ ## Pointers — gotchas catalogued elsewhere
702
+
703
+ - **Spot / preemption** (grace windows 2 min → ~0 s, Young/Daly cadence, atomic-write resume, managed-spot frameworks restart-your-process) → `references/spot-resilience.md`.
704
+ - **Multi-node / NCCL** (fabric-manager hang, wrong NIC, NCCL timeout, jumbo-frame MTU mismatch, torchrun/Horovod elastic state restore) → `references/multinode.md`. Single-box users skip.