opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,148 @@
1
+ # Lifecycle Checklist — the 6-phase runbook as a per-platform checklist
2
+
3
+ Purpose: a platform-parameterized, copy-pasteable checkbox runbook for one remote-GPU job, Phase 0
4
+ (environment audit) through Phase 5 (aggregate + verify + teardown). Substrate is delegated to **your
5
+ platform profile** (`profiles/<platform>.md`, 8-field schema in `profiles/_schema.md`) — this file never
6
+ hardcodes a mount, verb, or proxy. Each phase ends in the runnable check from `SKILL.md`.
7
+
8
+ `grep -in <keyword> references/lifecycle_checklist.md` to jump.
9
+
10
+ ## Table of contents
11
+ - Phase −1 — one-time setup (skip if reused)
12
+ - Phase 0 — environment audit
13
+ - Phase 1 — SSH + credentials
14
+ - Phase 2 — wrapper + CPU-smoke gate
15
+ - Phase 3 — detached launch
16
+ - Phase 4 — durable monitoring
17
+ - Phase 5 — aggregate + verify + teardown
18
+ - Cost-saving teardown table
19
+ - Failure handling (inline, any phase)
20
+
21
+ > **How to use:** pick the profile FIRST (`SKILL.md` → "Pick your platform profile"). Wherever a step
22
+ > says *profile data mount* / *profile durable mount* / *profile meter-stop verb* / *profile detach
23
+ > primitive*, read the literal value out of that profile's STORAGE / TEARDOWN / DAEMON sections. Skip any
24
+ > phase already done. Universal gotchas referenced by id live in `references/gotchas_universal.md` — not
25
+ > restated here.
26
+
27
+ ---
28
+
29
+ ## Phase −1 — One-time setup *(skip if reused from a past project)*
30
+
31
+ - [ ] **First contact — surface the profile's "Surface to the user" block** to the user: the platform's non-obvious **conveniences** (one-click SSH-key registration, GPU-availability notifications, built-in panels) and its **danger clocks** (auto-release/auto-delete timers, stop-still-bills, low-balance purge). Don't assume they know — the danger clocks cost data or money (principle #10).
32
+ - [ ] Local SSH keypair exists (`~/.ssh/id_ed25519`); public key registered with the platform.
33
+ - [ ] `~/.ssh/config` alias per instance, with keepalive (`ServerAliveInterval`/`ServerAliveCountMax`) — see `references/ssh_transport.md`.
34
+ - [ ] Durable storage provisioned per profile (shared FS / network volume / persistent disk), sized ≥ `ckpt_size × N + buffer`.
35
+ - [ ] Reusable image/snapshot saved with the project's env + code, IF the profile supports it (saves repeated cold-build time × N instances).
36
+ - [ ] `.gitattributes` sets `*.sh text eol=lf` so Windows-authored scripts don't ship CRLF (gotcha U26).
37
+
38
+ > **verify:** `ssh <alias> 'echo reachable'` returns `reachable` and the durable mount appears in the profile's survival matrix.
39
+
40
+ ---
41
+
42
+ ## Phase 0 — Environment audit
43
+
44
+ - [ ] Read the profile's STORAGE survival-matrix: know which tier survives *stop* vs *destroy* (principle #4). Write checkpoints to the tier that survives the intended *profile meter-stop verb*.
45
+ - [ ] Read the profile's region/DC-lock: if a shared/network volume is region-scoped, confirm all instances share that region BEFORE launch.
46
+ - [ ] Measure live, do not assume: `df -h && df -i <profile data mount>` (inodes die before bytes — principle #5), cgroup `cat /sys/fs/cgroup/memory.max`, `nvidia-smi`.
47
+ - [ ] `du -sh` the real space hogs on the actual mount (symlinked caches hide — principle #5), not the dir assumed.
48
+ - [ ] Pre-compute the checkpoint disk budget: `ckpt_size × N + scratch`; confirm it fits the *profile data mount* cap.
49
+
50
+ > **verify:** `nvidia-smi` shows the expected GPU and `df -i <profile data mount>` is not near 100%.
51
+
52
+ ---
53
+
54
+ ## Phase 1 — SSH + credentials
55
+
56
+ - [ ] Set the SSH alias/env per the profile's NETWORK section; note the SSH flavor (a proxied/basic SSH may not `scp`/`rsync` — direct-TCP required; ports may change on restart).
57
+ - [ ] Use the prebuilt image/base AS the env — do NOT `conda create` on a rental (throwaway-instance exception: the image IS the env).
58
+ - [ ] Push secrets via **stdin, never onto a shared/durable FS** (a shared FS is multi-project) — pattern in `references/ssh_transport.md`. Reference creds by env-var NAME / file path, never inline a key.
59
+ - [ ] If the profile sits behind the GFW, wire the China-mirror endpoint now (`references/china-network.md`); validate the speed test on the SAME route the real transfer uses (principle #7).
60
+
61
+ > **verify:** `ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
62
+
63
+ ---
64
+
65
+ ## Phase 2 — Wrapper + CPU-smoke gate
66
+
67
+ - [ ] Build an idempotent `run_one` / `run_queue` from `scripts/` (`run_one.sh.template`, `run_queue.sh.template`), parameterized from the profile's SCRIPT OVERRIDES (`DATA_DIR=`, `DURABLE_DIR=`, `PROXY_HOOK=`, `CRED_FILE=`, `SCRATCH=`, `HF_HOME=`, `DETACH=`).
68
+ - [ ] Wrappers are resumable: load-latest-on-startup unconditionally so the identical launch command resumes, not restarts (principle #8).
69
+ - [ ] Build the per-cell queue/config files with one isolated write path per cell (no shared mutable output — parallel ablation needs this; `references/parallel_ablation.md`).
70
+ - [ ] **Run the cheap CPU smoke LOCALLY, BEFORE renting** — 1–2 batches, logger disabled, tiny shapes; it kills import/config/shape/scale bugs for ~free (principle #2). Smoke *content* → **verifying-dl-experiments** (REQUIRED).
71
+
72
+ > **verify:** smoke exits 0 on 2 batches with the logger disabled, no Traceback.
73
+
74
+ ---
75
+
76
+ ## Phase 3 — Detached launch
77
+
78
+ - [ ] Launch via the *profile detach primitive* (tmux / `sbatch` / k8s Job / commit) — survives an SSH drop; confirm whether it also survives an instance restart (profile DAEMON section).
79
+ - [ ] Push code/data with a resumable transfer (`rsync --partial` or `timeout`+resume loop — principle #7); never edit a script under a live run — version filenames (principle #6).
80
+ - [ ] Probe briefly: log head + process alive + no traceback, then **hand control back**. Never a blocking foreground `sleep` (foreground Bash hard-caps at 600 s).
81
+
82
+ > **verify:** within 60 s, the detach session is alive and the first log line shows the expected step/epoch.
83
+
84
+ ---
85
+
86
+ ## Phase 4 — Durable monitoring
87
+
88
+ - [ ] For anything over ~1–2 h, deploy the **four-layer architecture** (`references/monitoring_patterns.md`): on-box self-completion chain + session patrol loop + event sentinels + recovery handbook. A session-bound watcher alone dies with the session (principle #3).
89
+ - [ ] Use `run_in_background` (no duration cap, notifies on exit; a Claude Code primitive — other hosts map per `monitoring_patterns.md` §7) for long waits; never foreground-poll. NEVER an unquoted `|` inside a poll-regex — it reads stdin and hangs forever.
90
+ - [ ] Watch `df -i` trend (not just `df -h`), cgroup memory %, new FINISHED/ERROR/Traceback markers, and fast-finish (< ~50% expected duration → probable failure).
91
+ - [ ] Reconcile each watcher against the job's REAL process/artifact (`tmux ls`/`squeue`/`pgrep` + output `mtime`) — a watcher's own state is a claim, not ground truth (principle #3). Tear a watcher down when its job is superseded.
92
+ - [ ] Classify each failure → its fixed remediation (see Failure handling below); **never blind-retry**.
93
+
94
+ > **verify:** the patrol reports a status line even when nothing changed (proves it's alive, not silently dead).
95
+
96
+ ---
97
+
98
+ ## Phase 5 — Aggregate + verify + teardown
99
+
100
+ - [ ] Run the aggregation step (`scripts/aggregate_to_fs.sh`, idempotent — safe to re-run) to checked-sync results to the *profile durable mount*. **Gate the success line on the actual copy result** — `cp …; echo synced` lies on a full/inode-exhausted FS (principle #3, gotcha U33).
101
+ - [ ] Confirm the durable mount has the expected artifact count: `ssh <alias> 'ls <profile durable mount>/final_ckpts/ | wc -l'`.
102
+ - [ ] Pull results to local (resumable per-dir scp/rsync loop); HARD-sanitize the local target to `/path/to/local` — never a real personal path.
103
+ - [ ] **Load-and-verify each artifact** before teardown: `python scripts/verify_local.py /path/to/local/final_ckpts/` → expect `OK N/N, errors 0`. Re-pull + re-verify any error.
104
+ - [ ] Record disclosable run facts for the paper: CLI overrides, tracker summary URL (a hosted tracker survives teardown — **huggingface-skills:huggingface-trackio**). Transport verbs (`hf download --resume`, `hf upload-large-folder`) → **huggingface-skills:hf-cli**.
105
+ - [ ] ONLY THEN perform the *profile meter-stop verb*, AFTER explicit user approval of the specific cost-affecting action.
106
+
107
+ > **verify:** `verify_local.py` reports 100% OK *before* any teardown.
108
+
109
+ > **Iron Law — teardown gate:** NO `stop` / `release` / `terminate` / `destroy` / file-delete until
110
+ > checkpoints are **pulled to local AND verified by load**, AND the user has explicitly approved the
111
+ > cost-affecting action. "It looked done in the log" is not evidence (principle #3). On most platforms the
112
+ > meter-stopping verb is **irreversible** (deletes the disk) — confirmation matters *more*, not less. The
113
+ > general form is **superpowers:verification-before-completion** (REQUIRED).
114
+
115
+ ---
116
+
117
+ ## Cost-saving teardown table
118
+
119
+ The verb that stops the meter, what each preserves, and irreversibility — bind the platform-specific verb
120
+ from the profile's TEARDOWN section. **The biggest portability trap: "stop" rarely stops the meter, and
121
+ the action that does is usually irreversible** (principle #4/#9).
122
+
123
+ | Action | Stops GPU meter? | What survives | Reversible? |
124
+ |---|---|---|---|
125
+ | **stop / 关机 (power-off)** | Sometimes — depends on profile (AutoDL: yes, keeps disk; RunPod/vast: still bills storage 1–2×) | Disk tier per profile survival-matrix | Yes — instance restartable |
126
+ | **release idle instance** | Yes | Only the durable/shared mount (data disk gone) | No — instance + container disk destroyed |
127
+ | **terminate** | Yes | Only a network/persistent volume, if one was mounted | **No — irreversible**, disk deleted |
128
+ | **destroy** | Yes | Nothing on the box | **No — irreversible**, total loss |
129
+ | **delete durable files (keep subscription)** | Storage trickle only | Subscription survives for new data | No — those files gone |
130
+ | **cancel durable storage subscription** | Storage cost only | Nothing | **No — irreversible**, all durable data lost |
131
+
132
+ **Default conservative plan:** stop/release the GPU instance first (immediate $ saving, low risk once
133
+ artifacts are verified-local). Keep durable storage 1–3 months until the paper is submitted. Cancel the
134
+ durable subscription LAST, only after the local copy is verified and the user approves.
135
+
136
+ ---
137
+
138
+ ## Failure handling *(inline, any phase)*
139
+
140
+ Categorize before reacting; retry the **identical** config — hand-patching one run destroys comparability
141
+ (principle #7; **verifying-dl-experiments** owns is-it-a-bug-or-real).
142
+
143
+ - [ ] **Probabilistic** (epoch-1 stall, transient `wandb.init` blip, spot preemption): queue a retry with the SAME config, no safeguards. Resume works because of checkpoint-load (principle #8).
144
+ - [ ] **Disk-full** (exit 1 + `iostream` / "No space left"): prune the *profile scratch* (`SCRATCH=` — periodic checkpoints, unused caches), keep `best`; if cleanup can't free enough, **ask to expand the disk**, never silently shrink the experiment (principle #9). Then retry.
145
+ - [ ] **Real bug** (CUDA OOM, code error, all-zero metric): stop, investigate code — do NOT retry blindly.
146
+
147
+ > Symptom → root cause → fix for each, plus the full catalog: `references/gotchas_universal.md`
148
+ > (`grep -in <keyword> references/gotchas_universal.md` to jump).
@@ -0,0 +1,327 @@
1
+ # Monitoring Patterns — durable watching of a remote GPU job
2
+
3
+ Platform-agnostic recipes for babysitting a long-running detached job on a rented box. The crown jewel
4
+ is the **four-layer durable-monitoring architecture** (§3): a session-bound watcher alone dies with the
5
+ session, so layer it. Every recipe uses portable primitives — `tmux` OR `squeue` OR `pgrep`, a log
6
+ marker OR an artifact `mtime` — never one platform's paths. Bind the concrete paths/aliases from
7
+ `profiles/<platform>.md`.
8
+
9
+ To jump: `grep -in '<keyword>' references/monitoring_patterns.md`.
10
+
11
+ ## Table of contents
12
+
13
+ - §0 Monitoring physics — the four facts every recipe rests on
14
+ - §1 The robust short-connection ssh-poll template (the safe poll primitive)
15
+ - §2 Quick health probes (one round-trip each)
16
+ - §3 Durable monitoring architecture — the four layers (L1 self-completion · L2 patrol · L3 sentinels · L4 handbook)
17
+ - §4 Stale-waiter hygiene — one waiter per live run, right lifetime
18
+ - §5 Two-leg self-completion — guaranteed results + best-effort cadence
19
+ - §6 Failure triage on the log
20
+ - §7 Monitoring across agent hosts — per-host background/loop/cron primitives + the 2 portability rules (Claude Code · Codex · Cursor · Trae · generic)
21
+
22
+ ---
23
+
24
+ ## §0 Monitoring physics — the four facts every recipe rests on
25
+
26
+ Verified in-session, not assumed. The whole architecture is engineered around these:
27
+
28
+ > **Tool-portability note:** `run_in_background`, the ~600 s foreground cap, and `/schedule` (below and
29
+ > §5/§3) are the **Claude Code** harness's primitives. On another Agent-Skills host (Codex / Cursor /
30
+ > Trae / …) map them to that agent's equivalents — its background-task or async runner, its own
31
+ > foreground/turn limit, its scheduler. The four-layer architecture itself is host-agnostic — the full
32
+ > per-host mapping (Codex / Cursor / Trae / generic) and the two portability rules are in **§7**.
33
+
34
+ 1. **Foreground Bash hard-caps at 600 s (10 min).** A long foreground wait/monitor is *killed* at the cap
35
+ — so never foreground-poll a multi-hour run.
36
+ 2. **`run_in_background` has NO duration cap and notifies on EXIT.** A 781 s background task ran to
37
+ completion and notified (verified). Long task that finishes → background it.
38
+ 3. **A never-*exiting* watcher never notifies.** No exit event = no notification, ever. A persistent
39
+ `while true` / a stray `grep` reading stdin hangs silently forever and the user reads silence as "dead
40
+ monitor". Every watcher must have a bounded exit.
41
+ 4. **An unquoted `|` inside a poll regex hangs forever.** The shell splits `grep -hE a|b|c log` into three
42
+ piped commands; the first (`grep -hE a`, no filename) reads **stdin** → blocks → the pipeline never
43
+ returns → the ssh never returns → the background process never exits → fact 3 fires. ALWAYS quote the
44
+ regex AND give grep a filename.
45
+
46
+ Corollary — **trust the artifact, not the silence.** When a job "looks done," Read its output file and
47
+ re-check ground truth (`grep DONE log; tmux ls / squeue; nvidia-smi`) before claiming success. Do not
48
+ wait blindly for a notification that may never fire. This is the `verifying-dl-experiments` (REQUIRED)
49
+ Iron Law applied to monitoring.
50
+
51
+ ---
52
+
53
+ ## §1 The robust short-connection ssh-poll template (the safe poll primitive)
54
+
55
+ The single most important pattern: a poll that cannot hang (fact 4) and cannot strand a half-open
56
+ connection. **Never hold one long ssh open for the whole wait** — loop locally, reconnecting each tick.
57
+
58
+ ```bash
59
+ #!/usr/bin/env bash
60
+ set -u
61
+ # Short-connection poll: ssh in → check → disconnect; bounded local loop.
62
+ HOST="<alias>" # from profiles/<platform>.md
63
+ LOG="/path/to/run.log" # remote log path (profile-bound)
64
+ PATTERN='QUEUE DONE|Training completed' # QUOTED → '|' is alternation, never a pipe (fact 4)
65
+ MAX=120 # bounded: 120 ticks × 90 s ≈ 3 h, then give up cleanly
66
+ i=0
67
+ while [ "$i" -lt "$MAX" ]; do
68
+ # ConnectTimeout + ServerAlive bound a network blip to ~30 s instead of a multi-minute half-open hang.
69
+ if ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 "$HOST" \
70
+ "grep -qE '$PATTERN' '$LOG'"; then # quoted regex + a FILENAME → grep reads the file, never stdin
71
+ echo "DONE marker found"; exit 0
72
+ fi
73
+ i=$((i+1)); sleep 90
74
+ done
75
+ echo "poll gave up after $MAX ticks — check ground truth manually"; exit 1
76
+ ```
77
+
78
+ Non-negotiables baked in above:
79
+ - **Quoted regex + a filename** on every remote `grep` — the two independent guards against fact 4.
80
+ - **`ConnectTimeout` / `ServerAliveInterval` / `ServerAliveCountMax`** — a dropped link self-kills fast.
81
+ - **Short connection per tick, bounded local loop** — one ssh per check, a hard tick ceiling so the
82
+ waiter always EXITS (fact 3) and therefore always notifies when backgrounded.
83
+ - **Detect "done" by a log MARKER, never by `pgrep`** of the waiter's own pattern — `pgrep -f` matches
84
+ the waiter's own command line and the loop never ends. On a queue scheduler, `squeue -j <id>` going
85
+ empty is the equivalent done-signal.
86
+
87
+ Run this via `run_in_background` (fact 2: no cap, notifies on exit), or as a single foreground tick under
88
+ the 600 s cap (fact 1). On a session scheduler use it as the L2 patrol body (§3). **Never foreground-poll
89
+ the full wait.**
90
+
91
+ ---
92
+
93
+ ## §2 Quick health probes (one round-trip each)
94
+
95
+ Each is a single short ssh. Combine several into ONE round-trip for a patrol tick (§3). Detach-primitive
96
+ and paths come from the profile; the structure is identical everywhere.
97
+
98
+ > A blank live **TensorBoard tile / web panel** while these probes show a healthy run is **not** a dead
99
+ > run — it is `references/gotchas_universal.md` **U39**: the panel reads a fixed logdir/port your logger
100
+ > didn't write to, or the TB/watcher process died (ran foreground, not under the detach primitive), or the
101
+ > port isn't exposed. Fix per the platform profile; never restart a healthy run over an empty panel.
102
+
103
+ **Is the job alive? (tmux OR squeue OR pgrep — pick the profile's primitive)**
104
+ ```bash
105
+ ssh "$HOST" "tmux ls 2>/dev/null || true; squeue -u \$USER 2>/dev/null || true; pgrep -af 'train' | grep -v grep | head -3"
106
+ ```
107
+
108
+ **Progress since last check** — grep the run's OWN log, not a shared master:
109
+ ```bash
110
+ ssh "$HOST" "grep -nE 'Epoch [0-9]+|Training completed|Early stopping|FINISHED|QUEUE DONE' '$RUN_LOG' | tail -6"
111
+ ```
112
+
113
+ > **Gotcha — crash-detect on the per-run log, never the shared master.** Symptom: a poll reports "run D
114
+ > crashed" while D trains fine. → Root cause: a `tee`'d master log concatenates every run, so grepping it
115
+ > for `Traceback|OutOfMemory` matches an EARLIER run's crash text and false-positives on a healthy later
116
+ > run. → Fix: scope crash detection to the per-run log (`<name>.log`); reserve the master-log grep for
117
+ > `DONE`/`FINISHED`/`QUEUE DONE` and progress markers. A waiter that crash-checks the wrong log spins to
118
+ > its full timeout on a phantom failure.
119
+
120
+ **Resource pressure** (cgroup mem, GPU) — thresholds are rough, profile-tunable:
121
+ ```bash
122
+ ssh "$HOST" "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader; \
123
+ [ -f /sys/fs/cgroup/memory.current ] && numfmt --to=iec \$(cat /sys/fs/cgroup/memory.current)/\$(cat /sys/fs/cgroup/memory.max) 2>/dev/null"
124
+ ```
125
+ - cgroup mem > 90% of max → OOM risk; GPU util > 60% → healthy, not data-bottlenecked.
126
+ - GPU at 0% but the step log advances ≠ idle — it is CPU-data-bound; sample util over several seconds,
127
+ never one snapshot. (Diagnosis → `verifying-dl-experiments`, REQUIRED.)
128
+
129
+ **Disk — the silent killer** — watch `df -i` (inodes) AND `df -h` (bytes); inodes die first on
130
+ many-small-files eval output (→ `references/gotchas_universal.md`):
131
+ ```bash
132
+ ssh "$HOST" "df -h '$DATA_MOUNT'; df -i '$DATA_MOUNT'"
133
+ ```
134
+
135
+ ---
136
+
137
+ ## §3 Durable monitoring architecture — the four layers (earned by three live failures)
138
+
139
+ Session-bound watchers die with the session; the instance itself can die under the watcher; and a
140
+ monitor that only speaks on terminal events reads as "nobody is watching." One layer cannot fix all
141
+ three. Run four — **correctness in L1, liveness in L2, latency in L3, continuity in L4**:
142
+
143
+ | Layer | Lives where | Job | Survives |
144
+ |---|---|---|---|
145
+ | **L1 self-completion chain** | ON the box (tmux / nohup / sbatch dependency) | the work sequences itself: `until grep -q 'Training completed' log; do sleep 150; done && <next stage>`; stages hand off via `touch /path/STAGE_DONE` markers | session death, network loss |
146
+ | **L2 patrol loop** | session scheduler (cron `/loop`) | every ~30 min fire a SELF-CONTAINED patrol: one combined ssh probe + a decision table + "report EVEN IF nothing changed" | idle gaps (NOT session death — see L4) |
147
+ | **L3 event sentinels** | session background (`run_in_background`) | the §1 short-poll `until ssh test -f MARKER; …` for minute-level reaction between patrol ticks | nothing — acceptable; L1/L2 carry correctness |
148
+ | **L4 recovery handbook** | persistent notes/memory | exact resume commands, chain definitions, marker paths, "first command on reconnect" — a BRAND-NEW session takes over from one word | everything |
149
+
150
+ ### L1 — on-box self-completion chain (correctness)
151
+ The box finishes its own pipeline regardless of any watcher. Chain stages under one detach primitive and
152
+ **join them with `&&`, never `;`** so a marker only lands on success:
153
+ ```bash
154
+ # tmux / nohup variant — the detach primitive is the swappable plug (sbatch dependency on Slurm)
155
+ nohup bash -c '
156
+ set -u
157
+ until grep -q "Training completed" /path/to/train.log; do sleep 150; done \
158
+ && python -m eval ... \
159
+ && touch /path/to/STAGE_DONE # marker ONLY on a clean &&-chain
160
+ ' </dev/null >/path/to/chain.log 2>&1 &
161
+ ```
162
+ > **Gotcha — success-gate the chain markers.** Symptom: the downstream chain fires on a phantom
163
+ > completion. → Root cause: joining stages with `;` (or a bare `touch` after a crashing stage) stamps the
164
+ > marker even when a stage died — a live disk-full `torch.save` killed stage 3, the `;`-marker still
165
+ > landed, the next stage ran on nothing. → Fix: `&&` between every stage and the final `touch`; detect
166
+ > done by the marker, never by `pgrep` of the waiter's own pattern (fact 4 / §1).
167
+
168
+ ### L2 — patrol loop (liveness): the design checklist (what made it actually work)
169
+ - **ONE combined ssh probe per tick** — alive-check (tmux ls / squeue / pgrep) + `*_DONE` markers + last
170
+ epoch line + artifact `ls` + dataset file COUNTS, in a single round-trip.
171
+ - **An explicit decision table**, e.g.: ssh down → tell the user to check the console (only they see
172
+ balance/power state); detach session missing AND no completion marker → resume from `latest` + rebuild
173
+ the L1 chain; result CSV exists → `cat` it and report the numbers verbatim; remote file count below the
174
+ local source → resume the transfer; everything done → delete the patrol job itself.
175
+ - **Report a one-line status EVEN WHEN nothing changed** — silence between events is exactly what users
176
+ read as a dead monitor. ("你有定时看吗??" twice in one campaign is the failure signature of L3-only.)
177
+ - **Completeness = file COUNT against the local source** (bytes/hash when names collide), NEVER `test -d`
178
+ — a dir created by a killed transfer passes existence checks forever.
179
+ - **Never blind-restart** — probe session/log/markers first so a patrol firing mid-run cannot
180
+ double-launch (idempotence). Classify each outcome → a fixed remediation; never blind-retry.
181
+
182
+ > **Ready-made tick:** `scripts/health_patrol.sh.template` is this checklist as one runnable,
183
+ > read-only ssh round-trip — alive + done-count + last epoch + crash-scan + `df -h`/`df -i`, an
184
+ > escalation predicate, and a one-line report even when nothing changed — parameterized from the
185
+ > profile's §8. Fire it from the host's recurring runner (§7: `/loop`, cron `3,33 * * * *`, …).
186
+
187
+ ### L3 — event sentinels (latency)
188
+ The §1 short-poll loop for minute-level reaction between patrol ticks. Survives nothing — it is the
189
+ disposable fast-reaction layer; L1/L2 carry correctness. Re-arm exactly ONE after any session resume.
190
+
191
+ > **`run_in_background` is NOT a substitute for `/loop` on an unattended wait.** A one-shot
192
+ > `run_in_background` sentinel notifies on EXIT — fine while you keep working in an ACTIVE session, but if
193
+ > the session goes idle for hours its exit-notification lands on a closed/reset session and you hear
194
+ > nothing (the silent-monitor-for-hours failure). Any UNATTENDED wait over ~1 h → bind the **L2 `/loop`
195
+ > patrol** (a recurring agent re-wake), never a lone one-shot sentinel.
196
+
197
+ ### L4 — recovery handbook (continuity)
198
+ Persistent notes a brand-new session inherits from one word ("继续"): exact resume commands, the L1 chain
199
+ definition, every marker path, the "first command on reconnect." Two durable hardenings:
200
+ - **Externalize transfer/monitor state to a stable OS path** + a DONE marker file *outside* the session
201
+ dir, so any future session resumes by reading files instead of re-uploading.
202
+ - **True restart-immunity means an OS-owned process** (Task Scheduler / cron) — but creating one is
203
+ unauthorized persistence to the permission classifier: **get the user's explicit one-line approval
204
+ first**, or hand them the launch command to run themselves.
205
+
206
+ > **Gotcha — after a context compaction, reconcile UI task chips against the OS process table.**
207
+ > Symptom: 5 chips show "Running" for 2–6 h while zero ssh/scp processes exist and a "running" upload
208
+ > actually died at 2/10 checkpoints, silently gating the downstream eval all evening. → Root cause:
209
+ > background shells die with the old session, but their chips keep showing "Running"; the new session's
210
+ > task list is empty, so the only ground truth is a process scan. → Fix: **first action after any
211
+ > compaction is a process-scan** (e.g. `Get-CimInstance Win32_Process` matched on the remote host string,
212
+ > or `pgrep`/`ps` for ssh/scp), relaunch dead transfers with a byte-size verify, re-arm ONE fresh
213
+ > sentinel, and tell the user to clear the husks.
214
+
215
+ ---
216
+
217
+ ## §4 Stale-waiter hygiene — one waiter per live run, right lifetime
218
+
219
+ > **Gotcha — stale background waiters pile up.** Symptom: the Background-tasks panel shows 8+ "Running"
220
+ > wait-loops at 500–740 min elapsed, ssh-polling every ~20 s, while the GPU is idle and the experiment
221
+ > finished hours ago. → Root cause: every kill+restart of a flaky-network saga armed a NEW
222
+ > `until ssh grep MARKER; do sleep 20; done` waiter but never stopped the OLD one — its marker (in a
223
+ > superseded log) never appears, so it loops forever (fact 3). → Fix below.
224
+
225
+ - **One waiter per live run.** Superseding a run → STOP its old waiter *first* (TaskStop, or dismiss a
226
+ cross-session chip from the UI — resumed-session IDs aren't stoppable programmatically).
227
+ - **Match watcher lifetime to the wait.** Multi-hour wait → a persistent Monitor (no 10-min cap) plus a
228
+ stall-detector so a hung run still notifies. A persistent monitor still dies on session resume → after
229
+ any resume, **check remote ground truth directly** (tmux ls / squeue, `grep DONE log`, `nvidia-smi`);
230
+ do not trust a watcher that may be gone (fact 3 + §0 corollary).
231
+ - **A dropped poll connection ≠ the job dying.** A long background ssh poll gets killed by the remote's
232
+ idle-SSH timeout while the detached training runs on independently. Re-ssh and verify the process/
233
+ artifacts directly before concluding anything died.
234
+
235
+ ---
236
+
237
+ ## §5 Two-leg self-completion — guaranteed results + best-effort cadence
238
+
239
+ "I'll check periodically" is a lie unless a trigger is ARMED — between turns the assistant does not run.
240
+ Two legs, never conflated:
241
+
242
+ - **Leg 1 — remote self-completion (guaranteed, survives session/SSH death):** the L1 chain
243
+ (`train → eval → touch marker` under one detach primitive). Detect done by a log/marker, never by
244
+ `pgrep` of the waiter's own pattern. This guarantees RESULTS but gives no reporting cadence.
245
+ - **Leg 2 — live progress (best-effort):** a session-bound patrol loop (L2, e.g. `/loop 30m` or cron `3,33 * * * *`)
246
+ polling with the LOCAL ssh key. Be honest it dies when the session closes — the remote still finishes;
247
+ the user re-pings to pull.
248
+
249
+ > **A cloud scheduler cannot reach a rented box.** A cloud schedule (`/schedule` / RemoteTrigger) runs in
250
+ > an isolated sandbox with its own checkout and **no access to the local SSH key or network** → it cannot
251
+ > ssh the box, and the SSH private key must **never** be placed in a cloud agent (secret-leak). The honest recurring check is the remote self-monitor + a session loop, not a cloud robot pinging
252
+ > the box. Don't promise autonomous cross-session polling that can't be delivered.
253
+
254
+ For a hosted tracker whose metrics survive teardown and can be polled as a structured monitor instead of
255
+ brittle ssh-tail, use `huggingface-skills:huggingface-trackio` (REQUIRED for that path) — poll its alerts
256
+ rather than grepping a remote log.
257
+
258
+ ---
259
+
260
+ ## §6 Failure triage on the log
261
+
262
+ When a probe shows trouble, pull the full traceback from the per-run log (§2) and classify — each
263
+ outcome maps to a FIXED remediation; never blind-retry:
264
+
265
+ ```bash
266
+ ssh "$HOST" "grep -B2 -A20 'Traceback' '$RUN_LOG' | head -50"
267
+ ```
268
+ - `basic_ios::clear: iostream error` + `unexpected pos N vs M` → **disk full during checkpoint save**;
269
+ check `df -h`/`df -i`, prune `latest`/periodic snapshots to recover (→ `references/gotchas_universal.md`).
270
+ - bare `Killed` / exit 137, no traceback → **cgroup OOM** (workers × big in-RAM tensor); size workers
271
+ vs `memory.max`, not CPU count.
272
+ - `CUDA out of memory` → VRAM, usually consistent across runs (batch too big / concurrent job), rarely
273
+ transient.
274
+ - `KeyError` / `AttributeError` → config/code mismatch; investigate code, do not retry.
275
+ - Early-stop far below baseline with a grad_norm P99 spike in epoch 1–2 → likely **probabilistic
276
+ divergence**; whether it's a bug or a real effect, and the retry-the-identical-config rule, belong to
277
+ `verifying-dl-experiments` (REQUIRED) — this skill owns *running* the retry, not judging the number.
278
+ - log frozen (no new lines) but checkpoint `mtime` advances → **block-buffered stdout**, not a hang
279
+ (`references/gotchas_universal.md` U43; run `python -u`/`PYTHONUNBUFFERED=1`).
280
+ - `uptime`/`free` on the box look maxed but your cgroup is roomy → **noisy neighbor** on the shared host,
281
+ not your job (`references/gotchas_universal.md` U41; the authoritative OOM check is the `oom_kill` counter
282
+ in `/sys/fs/cgroup/memory.events`).
283
+ - GPU SM% pinned low while a python thread-storm pegs the cores → **intra-op thread oversubscription** on a
284
+ vCPU slice (`references/gotchas_universal.md` U40; cap `OMP_NUM_THREADS` to the cgroup quota).
285
+
286
+ Universal gotchas (silent sync, CRLF, mid-run script overwrite, inode caps) are NOT restated here —
287
+ see `references/gotchas_universal.md` (`grep -in '<keyword>' references/gotchas_universal.md` to jump).
288
+
289
+ ---
290
+
291
+ ## §7 Monitoring across agent hosts (portability mapping)
292
+
293
+ The four layers are host-agnostic; only **which primitive runs L2/L3** changes per host. Two rules port
294
+ the whole architecture to Codex / Cursor / Trae / any Agent-Skills host:
295
+
296
+ **Rule 1 — the durable layer needs no agent.** L1 (the box self-completes + `touch`es a marker) plus the
297
+ box **pushing its own notification** at the end of the `&&`-chain — a `curl` webhook / email / a
298
+ `huggingface-skills:huggingface-trackio` alert — works on EVERY host, because it runs entirely on the
299
+ rented box. On a host with no background/scheduler primitive, this IS the monitor; the agent just pulls
300
+ results on its next turn.
301
+
302
+ **Rule 2 — a CLOUD scheduler cannot reach a rented box (§5), on ANY host.** Every host's hosted
303
+ automation runs in an isolated sandbox with no local SSH key or network, so it cannot ssh your box (and
304
+ the key must never be placed in one — secret-leak). Use cloud cron only to **re-wake the agent** or
305
+ **poll a hosted tracker**, never to probe the box. The box-reaching poll must use the host's
306
+ **local/session** runner (which holds your SSH key), or be the L1 on-box loop.
307
+
308
+ | Agent host | Local runner — reaches the box (L3) | Recurring / loop (L2) | Cloud cron/automation — re-wake / tracker only | Foreground/turn limit |
309
+ |---|---|---|---|---|
310
+ | **Claude Code** | `run_in_background` (detach + notify-on-exit); the `Monitor` tool | `/loop` + `ScheduleWakeup` (interval or self-paced) | `/schedule` (cron cloud agent) | ~600 s foreground |
311
+ | **OpenAI Codex** | Codex Cloud background tasks (async, parallel) | a thread that schedules its own wake-up | **Automations** — cron syntax, results → review queue | per cloud task |
312
+ | **Cursor** | Background Agents (async) | — | **Automations** — cron (hourly/daily/weekly) + event triggers | per agent |
313
+ | **Trae** (ByteDance) | Agent / `trae-agent` CLI unattended runs; CI/CD | via a CI/CD pipeline | **no native cron found** → external cron / CI-CD, or rely on Rule 1 | per run |
314
+ | **Generic / none** | any local background-equivalent (else none) | a shell `while`-loop under the turn limit | none | host turn limit |
315
+
316
+ > **Hosts not in the table** (Gemini CLI, VS Code / Copilot, Goose, Kiro, …) take the **Generic** row until they expose a local recurring runner that holds your SSH key — until then, wire **Rule 1** (the on-box self-push) and let the agent pull on its next turn.
317
+
318
+ **Binding the layers:** L1 is unchanged everywhere (on-box). Bind **L2** to the host's local recurring
319
+ runner *if* it reaches the box, else to the box's own `cron`/`at` + a push (Rule 1). Bind **L3** to the
320
+ host's local background runner, re-armed once per resume. When a host offers only cloud automation (or
321
+ nothing), **do not promise agent-side polling of the box** — wire Rule 1 and let the agent pull on its
322
+ next turn (§0 corollary: trust the artifact, not the silence).
323
+
324
+ Host capabilities verified 2026-06: Codex Automations (cron) + Cloud background tasks —
325
+ `developers.openai.com/codex/app/automations` + `/codex/cloud`; Cursor Automations (cron + event
326
+ triggers) + Background Agents — `cursor.com/docs/cloud-agent/automations`; Trae Agent / `trae-agent` CLI
327
+ + CI/CD, no native cron surfaced — `docs.trae.ai/ide` + `github.com/bytedance/trae-agent`.