npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.3 - Mend

opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md ADDED Viewed

@@ -0,0 +1,148 @@
+# Lifecycle Checklist — the 6-phase runbook as a per-platform checklist
+Purpose: a platform-parameterized, copy-pasteable checkbox runbook for one remote-GPU job, Phase 0
+(environment audit) through Phase 5 (aggregate + verify + teardown). Substrate is delegated to **your
+platform profile** (`profiles/<platform>.md`, 8-field schema in `profiles/_schema.md`) — this file never
+hardcodes a mount, verb, or proxy. Each phase ends in the runnable check from `SKILL.md`.
+`grep -in <keyword> references/lifecycle_checklist.md` to jump.
+## Table of contents
+- Phase −1 — one-time setup (skip if reused)
+- Phase 0 — environment audit
+- Phase 1 — SSH + credentials
+- Phase 2 — wrapper + CPU-smoke gate
+- Phase 3 — detached launch
+- Phase 4 — durable monitoring
+- Phase 5 — aggregate + verify + teardown
+- Cost-saving teardown table
+- Failure handling (inline, any phase)
+> **How to use:** pick the profile FIRST (`SKILL.md` → "Pick your platform profile"). Wherever a step
+> says *profile data mount* / *profile durable mount* / *profile meter-stop verb* / *profile detach
+> primitive*, read the literal value out of that profile's STORAGE / TEARDOWN / DAEMON sections. Skip any
+> phase already done. Universal gotchas referenced by id live in `references/gotchas_universal.md` — not
+> restated here.
+---
+## Phase −1 — One-time setup *(skip if reused from a past project)*
+- [ ] **First contact — surface the profile's "Surface to the user" block** to the user: the platform's non-obvious **conveniences** (one-click SSH-key registration, GPU-availability notifications, built-in panels) and its **danger clocks** (auto-release/auto-delete timers, stop-still-bills, low-balance purge). Don't assume they know — the danger clocks cost data or money (principle #10).
+- [ ] Local SSH keypair exists (`~/.ssh/id_ed25519`); public key registered with the platform.
+- [ ] `~/.ssh/config` alias per instance, with keepalive (`ServerAliveInterval`/`ServerAliveCountMax`) — see `references/ssh_transport.md`.
+- [ ] Durable storage provisioned per profile (shared FS / network volume / persistent disk), sized ≥ `ckpt_size × N + buffer`.
+- [ ] Reusable image/snapshot saved with the project's env + code, IF the profile supports it (saves repeated cold-build time × N instances).
+- [ ] `.gitattributes` sets `*.sh text eol=lf` so Windows-authored scripts don't ship CRLF (gotcha U26).
+> **verify:** `ssh <alias> 'echo reachable'` returns `reachable` and the durable mount appears in the profile's survival matrix.
+---
+## Phase 0 — Environment audit
+- [ ] Read the profile's STORAGE survival-matrix: know which tier survives *stop* vs *destroy* (principle #4). Write checkpoints to the tier that survives the intended *profile meter-stop verb*.
+- [ ] Read the profile's region/DC-lock: if a shared/network volume is region-scoped, confirm all instances share that region BEFORE launch.
+- [ ] Measure live, do not assume: `df -h && df -i <profile data mount>` (inodes die before bytes — principle #5), cgroup `cat /sys/fs/cgroup/memory.max`, `nvidia-smi`.
+- [ ] `du -sh` the real space hogs on the actual mount (symlinked caches hide — principle #5), not the dir assumed.
+- [ ] Pre-compute the checkpoint disk budget: `ckpt_size × N + scratch`; confirm it fits the *profile data mount* cap.
+> **verify:** `nvidia-smi` shows the expected GPU and `df -i <profile data mount>` is not near 100%.
+---
+## Phase 1 — SSH + credentials
+- [ ] Set the SSH alias/env per the profile's NETWORK section; note the SSH flavor (a proxied/basic SSH may not `scp`/`rsync` — direct-TCP required; ports may change on restart).
+- [ ] Use the prebuilt image/base AS the env — do NOT `conda create` on a rental (throwaway-instance exception: the image IS the env).
+- [ ] Push secrets via **stdin, never onto a shared/durable FS** (a shared FS is multi-project) — pattern in `references/ssh_transport.md`. Reference creds by env-var NAME / file path, never inline a key.
+- [ ] If the profile sits behind the GFW, wire the China-mirror endpoint now (`references/china-network.md`); validate the speed test on the SAME route the real transfer uses (principle #7).
+> **verify:** `ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
+---
+## Phase 2 — Wrapper + CPU-smoke gate
+- [ ] Build an idempotent `run_one` / `run_queue` from `scripts/` (`run_one.sh.template`, `run_queue.sh.template`), parameterized from the profile's SCRIPT OVERRIDES (`DATA_DIR=`, `DURABLE_DIR=`, `PROXY_HOOK=`, `CRED_FILE=`, `SCRATCH=`, `HF_HOME=`, `DETACH=`).
+- [ ] Wrappers are resumable: load-latest-on-startup unconditionally so the identical launch command resumes, not restarts (principle #8).
+- [ ] Build the per-cell queue/config files with one isolated write path per cell (no shared mutable output — parallel ablation needs this; `references/parallel_ablation.md`).
+- [ ] **Run the cheap CPU smoke LOCALLY, BEFORE renting** — 1–2 batches, logger disabled, tiny shapes; it kills import/config/shape/scale bugs for ~free (principle #2). Smoke *content* → **verifying-dl-experiments** (REQUIRED).
+> **verify:** smoke exits 0 on 2 batches with the logger disabled, no Traceback.
+---
+## Phase 3 — Detached launch
+- [ ] Launch via the *profile detach primitive* (tmux / `sbatch` / k8s Job / commit) — survives an SSH drop; confirm whether it also survives an instance restart (profile DAEMON section).
+- [ ] Push code/data with a resumable transfer (`rsync --partial` or `timeout`+resume loop — principle #7); never edit a script under a live run — version filenames (principle #6).
+- [ ] Probe briefly: log head + process alive + no traceback, then **hand control back**. Never a blocking foreground `sleep` (foreground Bash hard-caps at 600 s).
+> **verify:** within 60 s, the detach session is alive and the first log line shows the expected step/epoch.
+---
+## Phase 4 — Durable monitoring
+- [ ] For anything over ~1–2 h, deploy the **four-layer architecture** (`references/monitoring_patterns.md`): on-box self-completion chain + session patrol loop + event sentinels + recovery handbook. A session-bound watcher alone dies with the session (principle #3).
+- [ ] Use `run_in_background` (no duration cap, notifies on exit; a Claude Code primitive — other hosts map per `monitoring_patterns.md` §7) for long waits; never foreground-poll. NEVER an unquoted `|` inside a poll-regex — it reads stdin and hangs forever.
+- [ ] Watch `df -i` trend (not just `df -h`), cgroup memory %, new FINISHED/ERROR/Traceback markers, and fast-finish (< ~50% expected duration → probable failure).
+- [ ] Reconcile each watcher against the job's REAL process/artifact (`tmux ls`/`squeue`/`pgrep` + output `mtime`) — a watcher's own state is a claim, not ground truth (principle #3). Tear a watcher down when its job is superseded.
+- [ ] Classify each failure → its fixed remediation (see Failure handling below); **never blind-retry**.
+> **verify:** the patrol reports a status line even when nothing changed (proves it's alive, not silently dead).
+---
+## Phase 5 — Aggregate + verify + teardown
+- [ ] Run the aggregation step (`scripts/aggregate_to_fs.sh`, idempotent — safe to re-run) to checked-sync results to the *profile durable mount*. **Gate the success line on the actual copy result** — `cp …; echo synced` lies on a full/inode-exhausted FS (principle #3, gotcha U33).
+- [ ] Confirm the durable mount has the expected artifact count: `ssh <alias> 'ls <profile durable mount>/final_ckpts/ | wc -l'`.
+- [ ] Pull results to local (resumable per-dir scp/rsync loop); HARD-sanitize the local target to `/path/to/local` — never a real personal path.
+- [ ] **Load-and-verify each artifact** before teardown: `python scripts/verify_local.py /path/to/local/final_ckpts/` → expect `OK N/N, errors 0`. Re-pull + re-verify any error.
+- [ ] Record disclosable run facts for the paper: CLI overrides, tracker summary URL (a hosted tracker survives teardown — **huggingface-skills:huggingface-trackio**). Transport verbs (`hf download --resume`, `hf upload-large-folder`) → **huggingface-skills:hf-cli**.
+- [ ] ONLY THEN perform the *profile meter-stop verb*, AFTER explicit user approval of the specific cost-affecting action.
+> **verify:** `verify_local.py` reports 100% OK *before* any teardown.
+> **Iron Law — teardown gate:** NO `stop` / `release` / `terminate` / `destroy` / file-delete until
+> checkpoints are **pulled to local AND verified by load**, AND the user has explicitly approved the
+> cost-affecting action. "It looked done in the log" is not evidence (principle #3). On most platforms the
+> meter-stopping verb is **irreversible** (deletes the disk) — confirmation matters *more*, not less. The
+> general form is **superpowers:verification-before-completion** (REQUIRED).
+---
+## Cost-saving teardown table
+The verb that stops the meter, what each preserves, and irreversibility — bind the platform-specific verb
+from the profile's TEARDOWN section. **The biggest portability trap: "stop" rarely stops the meter, and
+the action that does is usually irreversible** (principle #4/#9).
+| Action | Stops GPU meter? | What survives | Reversible? |
+|---|---|---|---|
+| **stop / 关机 (power-off)** | Sometimes — depends on profile (AutoDL: yes, keeps disk; RunPod/vast: still bills storage 1–2×) | Disk tier per profile survival-matrix | Yes — instance restartable |
+| **release idle instance** | Yes | Only the durable/shared mount (data disk gone) | No — instance + container disk destroyed |
+| **terminate** | Yes | Only a network/persistent volume, if one was mounted | **No — irreversible**, disk deleted |
+| **destroy** | Yes | Nothing on the box | **No — irreversible**, total loss |
+| **delete durable files (keep subscription)** | Storage trickle only | Subscription survives for new data | No — those files gone |
+| **cancel durable storage subscription** | Storage cost only | Nothing | **No — irreversible**, all durable data lost |
+**Default conservative plan:** stop/release the GPU instance first (immediate $ saving, low risk once
+artifacts are verified-local). Keep durable storage 1–3 months until the paper is submitted. Cancel the
+durable subscription LAST, only after the local copy is verified and the user approves.
+---
+## Failure handling *(inline, any phase)*
+Categorize before reacting; retry the **identical** config — hand-patching one run destroys comparability
+(principle #7; **verifying-dl-experiments** owns is-it-a-bug-or-real).
+- [ ] **Probabilistic** (epoch-1 stall, transient `wandb.init` blip, spot preemption): queue a retry with the SAME config, no safeguards. Resume works because of checkpoint-load (principle #8).
+- [ ] **Disk-full** (exit 1 + `iostream` / "No space left"): prune the *profile scratch* (`SCRATCH=` — periodic checkpoints, unused caches), keep `best`; if cleanup can't free enough, **ask to expand the disk**, never silently shrink the experiment (principle #9). Then retry.
+- [ ] **Real bug** (CUDA OOM, code error, all-zero metric): stop, investigate code — do NOT retry blindly.
+> Symptom → root cause → fix for each, plus the full catalog: `references/gotchas_universal.md`
+> (`grep -in <keyword> references/gotchas_universal.md` to jump).

package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md ADDED Viewed

@@ -0,0 +1,327 @@
+# Monitoring Patterns — durable watching of a remote GPU job
+Platform-agnostic recipes for babysitting a long-running detached job on a rented box. The crown jewel
+is the **four-layer durable-monitoring architecture** (§3): a session-bound watcher alone dies with the
+session, so layer it. Every recipe uses portable primitives — `tmux` OR `squeue` OR `pgrep`, a log
+marker OR an artifact `mtime` — never one platform's paths. Bind the concrete paths/aliases from
+`profiles/<platform>.md`.
+To jump: `grep -in '<keyword>' references/monitoring_patterns.md`.
+## Table of contents
+- §0 Monitoring physics — the four facts every recipe rests on
+- §1 The robust short-connection ssh-poll template (the safe poll primitive)
+- §2 Quick health probes (one round-trip each)
+- §3 Durable monitoring architecture — the four layers (L1 self-completion · L2 patrol · L3 sentinels · L4 handbook)
+- §4 Stale-waiter hygiene — one waiter per live run, right lifetime
+- §5 Two-leg self-completion — guaranteed results + best-effort cadence
+- §6 Failure triage on the log
+- §7 Monitoring across agent hosts — per-host background/loop/cron primitives + the 2 portability rules (Claude Code · Codex · Cursor · Trae · generic)
+---
+## §0 Monitoring physics — the four facts every recipe rests on
+Verified in-session, not assumed. The whole architecture is engineered around these:
+> **Tool-portability note:** `run_in_background`, the ~600 s foreground cap, and `/schedule` (below and
+> §5/§3) are the **Claude Code** harness's primitives. On another Agent-Skills host (Codex / Cursor /
+> Trae / …) map them to that agent's equivalents — its background-task or async runner, its own
+> foreground/turn limit, its scheduler. The four-layer architecture itself is host-agnostic — the full
+> per-host mapping (Codex / Cursor / Trae / generic) and the two portability rules are in **§7**.
+1. **Foreground Bash hard-caps at 600 s (10 min).** A long foreground wait/monitor is *killed* at the cap
+   — so never foreground-poll a multi-hour run.
+2. **`run_in_background` has NO duration cap and notifies on EXIT.** A 781 s background task ran to
+   completion and notified (verified). Long task that finishes → background it.
+3. **A never-*exiting* watcher never notifies.** No exit event = no notification, ever. A persistent
+   `while true` / a stray `grep` reading stdin hangs silently forever and the user reads silence as "dead
+   monitor". Every watcher must have a bounded exit.
+4. **An unquoted `|` inside a poll regex hangs forever.** The shell splits `grep -hE a|b|c log` into three
+   piped commands; the first (`grep -hE a`, no filename) reads **stdin** → blocks → the pipeline never
+   returns → the ssh never returns → the background process never exits → fact 3 fires. ALWAYS quote the
+   regex AND give grep a filename.
+Corollary — **trust the artifact, not the silence.** When a job "looks done," Read its output file and
+re-check ground truth (`grep DONE log; tmux ls / squeue; nvidia-smi`) before claiming success. Do not
+wait blindly for a notification that may never fire. This is the `verifying-dl-experiments` (REQUIRED)
+Iron Law applied to monitoring.
+---
+## §1 The robust short-connection ssh-poll template (the safe poll primitive)
+The single most important pattern: a poll that cannot hang (fact 4) and cannot strand a half-open
+connection. **Never hold one long ssh open for the whole wait** — loop locally, reconnecting each tick.
+```bash
+#!/usr/bin/env bash
+set -u
+# Short-connection poll: ssh in → check → disconnect; bounded local loop.
+HOST="<alias>"                       # from profiles/<platform>.md
+LOG="/path/to/run.log"               # remote log path (profile-bound)
+PATTERN='QUEUE DONE|Training completed'   # QUOTED → '|' is alternation, never a pipe (fact 4)
+MAX=120                               # bounded: 120 ticks × 90 s ≈ 3 h, then give up cleanly
+i=0
+while [ "$i" -lt "$MAX" ]; do
+  # ConnectTimeout + ServerAlive bound a network blip to ~30 s instead of a multi-minute half-open hang.
+  if ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 "$HOST" \
+       "grep -qE '$PATTERN' '$LOG'"; then          # quoted regex + a FILENAME → grep reads the file, never stdin
+    echo "DONE marker found"; exit 0
+  fi
+  i=$((i+1)); sleep 90
+done
+echo "poll gave up after $MAX ticks — check ground truth manually"; exit 1
+```
+Non-negotiables baked in above:
+- **Quoted regex + a filename** on every remote `grep` — the two independent guards against fact 4.
+- **`ConnectTimeout` / `ServerAliveInterval` / `ServerAliveCountMax`** — a dropped link self-kills fast.
+- **Short connection per tick, bounded local loop** — one ssh per check, a hard tick ceiling so the
+  waiter always EXITS (fact 3) and therefore always notifies when backgrounded.
+- **Detect "done" by a log MARKER, never by `pgrep`** of the waiter's own pattern — `pgrep -f` matches
+  the waiter's own command line and the loop never ends. On a queue scheduler, `squeue -j <id>` going
+  empty is the equivalent done-signal.
+Run this via `run_in_background` (fact 2: no cap, notifies on exit), or as a single foreground tick under
+the 600 s cap (fact 1). On a session scheduler use it as the L2 patrol body (§3). **Never foreground-poll
+the full wait.**
+---
+## §2 Quick health probes (one round-trip each)
+Each is a single short ssh. Combine several into ONE round-trip for a patrol tick (§3). Detach-primitive
+and paths come from the profile; the structure is identical everywhere.
+> A blank live **TensorBoard tile / web panel** while these probes show a healthy run is **not** a dead
+> run — it is `references/gotchas_universal.md` **U39**: the panel reads a fixed logdir/port your logger
+> didn't write to, or the TB/watcher process died (ran foreground, not under the detach primitive), or the
+> port isn't exposed. Fix per the platform profile; never restart a healthy run over an empty panel.
+**Is the job alive? (tmux OR squeue OR pgrep — pick the profile's primitive)**
+```bash
+ssh "$HOST" "tmux ls 2>/dev/null || true; squeue -u \$USER 2>/dev/null || true; pgrep -af 'train' | grep -v grep | head -3"
+```
+**Progress since last check** — grep the run's OWN log, not a shared master:
+```bash
+ssh "$HOST" "grep -nE 'Epoch [0-9]+|Training completed|Early stopping|FINISHED|QUEUE DONE' '$RUN_LOG' | tail -6"
+```
+> **Gotcha — crash-detect on the per-run log, never the shared master.** Symptom: a poll reports "run D
+> crashed" while D trains fine. → Root cause: a `tee`'d master log concatenates every run, so grepping it
+> for `Traceback|OutOfMemory` matches an EARLIER run's crash text and false-positives on a healthy later
+> run. → Fix: scope crash detection to the per-run log (`<name>.log`); reserve the master-log grep for
+> `DONE`/`FINISHED`/`QUEUE DONE` and progress markers. A waiter that crash-checks the wrong log spins to
+> its full timeout on a phantom failure.
+**Resource pressure** (cgroup mem, GPU) — thresholds are rough, profile-tunable:
+```bash
+ssh "$HOST" "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader; \
+  [ -f /sys/fs/cgroup/memory.current ] && numfmt --to=iec \$(cat /sys/fs/cgroup/memory.current)/\$(cat /sys/fs/cgroup/memory.max) 2>/dev/null"
+```
+- cgroup mem > 90% of max → OOM risk; GPU util > 60% → healthy, not data-bottlenecked.
+- GPU at 0% but the step log advances ≠ idle — it is CPU-data-bound; sample util over several seconds,
+  never one snapshot. (Diagnosis → `verifying-dl-experiments`, REQUIRED.)
+**Disk — the silent killer** — watch `df -i` (inodes) AND `df -h` (bytes); inodes die first on
+many-small-files eval output (→ `references/gotchas_universal.md`):
+```bash
+ssh "$HOST" "df -h '$DATA_MOUNT'; df -i '$DATA_MOUNT'"
+```
+---
+## §3 Durable monitoring architecture — the four layers (earned by three live failures)
+Session-bound watchers die with the session; the instance itself can die under the watcher; and a
+monitor that only speaks on terminal events reads as "nobody is watching." One layer cannot fix all
+three. Run four — **correctness in L1, liveness in L2, latency in L3, continuity in L4**:
+| Layer | Lives where | Job | Survives |
+|---|---|---|---|
+| **L1 self-completion chain** | ON the box (tmux / nohup / sbatch dependency) | the work sequences itself: `until grep -q 'Training completed' log; do sleep 150; done && <next stage>`; stages hand off via `touch /path/STAGE_DONE` markers | session death, network loss |
+| **L2 patrol loop** | session scheduler (cron `/loop`) | every ~30 min fire a SELF-CONTAINED patrol: one combined ssh probe + a decision table + "report EVEN IF nothing changed" | idle gaps (NOT session death — see L4) |
+| **L3 event sentinels** | session background (`run_in_background`) | the §1 short-poll `until ssh test -f MARKER; …` for minute-level reaction between patrol ticks | nothing — acceptable; L1/L2 carry correctness |
+| **L4 recovery handbook** | persistent notes/memory | exact resume commands, chain definitions, marker paths, "first command on reconnect" — a BRAND-NEW session takes over from one word | everything |
+### L1 — on-box self-completion chain (correctness)
+The box finishes its own pipeline regardless of any watcher. Chain stages under one detach primitive and
+**join them with `&&`, never `;`** so a marker only lands on success:
+```bash
+# tmux / nohup variant — the detach primitive is the swappable plug (sbatch dependency on Slurm)
+nohup bash -c '
+  set -u
+  until grep -q "Training completed" /path/to/train.log; do sleep 150; done \
+    && python -m eval ... \
+    && touch /path/to/STAGE_DONE        # marker ONLY on a clean &&-chain
+' </dev/null >/path/to/chain.log 2>&1 &
+```
+> **Gotcha — success-gate the chain markers.** Symptom: the downstream chain fires on a phantom
+> completion. → Root cause: joining stages with `;` (or a bare `touch` after a crashing stage) stamps the
+> marker even when a stage died — a live disk-full `torch.save` killed stage 3, the `;`-marker still
+> landed, the next stage ran on nothing. → Fix: `&&` between every stage and the final `touch`; detect
+> done by the marker, never by `pgrep` of the waiter's own pattern (fact 4 / §1).
+### L2 — patrol loop (liveness): the design checklist (what made it actually work)
+- **ONE combined ssh probe per tick** — alive-check (tmux ls / squeue / pgrep) + `*_DONE` markers + last
+  epoch line + artifact `ls` + dataset file COUNTS, in a single round-trip.
+- **An explicit decision table**, e.g.: ssh down → tell the user to check the console (only they see
+  balance/power state); detach session missing AND no completion marker → resume from `latest` + rebuild
+  the L1 chain; result CSV exists → `cat` it and report the numbers verbatim; remote file count below the
+  local source → resume the transfer; everything done → delete the patrol job itself.
+- **Report a one-line status EVEN WHEN nothing changed** — silence between events is exactly what users
+  read as a dead monitor. ("你有定时看吗??" twice in one campaign is the failure signature of L3-only.)
+- **Completeness = file COUNT against the local source** (bytes/hash when names collide), NEVER `test -d`
+  — a dir created by a killed transfer passes existence checks forever.
+- **Never blind-restart** — probe session/log/markers first so a patrol firing mid-run cannot
+  double-launch (idempotence). Classify each outcome → a fixed remediation; never blind-retry.
+> **Ready-made tick:** `scripts/health_patrol.sh.template` is this checklist as one runnable,
+> read-only ssh round-trip — alive + done-count + last epoch + crash-scan + `df -h`/`df -i`, an
+> escalation predicate, and a one-line report even when nothing changed — parameterized from the
+> profile's §8. Fire it from the host's recurring runner (§7: `/loop`, cron `3,33 * * * *`, …).
+### L3 — event sentinels (latency)
+The §1 short-poll loop for minute-level reaction between patrol ticks. Survives nothing — it is the
+disposable fast-reaction layer; L1/L2 carry correctness. Re-arm exactly ONE after any session resume.
+> **`run_in_background` is NOT a substitute for `/loop` on an unattended wait.** A one-shot
+> `run_in_background` sentinel notifies on EXIT — fine while you keep working in an ACTIVE session, but if
+> the session goes idle for hours its exit-notification lands on a closed/reset session and you hear
+> nothing (the silent-monitor-for-hours failure). Any UNATTENDED wait over ~1 h → bind the **L2 `/loop`
+> patrol** (a recurring agent re-wake), never a lone one-shot sentinel.
+### L4 — recovery handbook (continuity)
+Persistent notes a brand-new session inherits from one word ("继续"): exact resume commands, the L1 chain
+definition, every marker path, the "first command on reconnect." Two durable hardenings:
+- **Externalize transfer/monitor state to a stable OS path** + a DONE marker file *outside* the session
+  dir, so any future session resumes by reading files instead of re-uploading.
+- **True restart-immunity means an OS-owned process** (Task Scheduler / cron) — but creating one is
+  unauthorized persistence to the permission classifier: **get the user's explicit one-line approval
+  first**, or hand them the launch command to run themselves.
+> **Gotcha — after a context compaction, reconcile UI task chips against the OS process table.**
+> Symptom: 5 chips show "Running" for 2–6 h while zero ssh/scp processes exist and a "running" upload
+> actually died at 2/10 checkpoints, silently gating the downstream eval all evening. → Root cause:
+> background shells die with the old session, but their chips keep showing "Running"; the new session's
+> task list is empty, so the only ground truth is a process scan. → Fix: **first action after any
+> compaction is a process-scan** (e.g. `Get-CimInstance Win32_Process` matched on the remote host string,
+> or `pgrep`/`ps` for ssh/scp), relaunch dead transfers with a byte-size verify, re-arm ONE fresh
+> sentinel, and tell the user to clear the husks.
+---
+## §4 Stale-waiter hygiene — one waiter per live run, right lifetime
+> **Gotcha — stale background waiters pile up.** Symptom: the Background-tasks panel shows 8+ "Running"
+> wait-loops at 500–740 min elapsed, ssh-polling every ~20 s, while the GPU is idle and the experiment
+> finished hours ago. → Root cause: every kill+restart of a flaky-network saga armed a NEW
+> `until ssh grep MARKER; do sleep 20; done` waiter but never stopped the OLD one — its marker (in a
+> superseded log) never appears, so it loops forever (fact 3). → Fix below.
+- **One waiter per live run.** Superseding a run → STOP its old waiter *first* (TaskStop, or dismiss a
+  cross-session chip from the UI — resumed-session IDs aren't stoppable programmatically).
+- **Match watcher lifetime to the wait.** Multi-hour wait → a persistent Monitor (no 10-min cap) plus a
+  stall-detector so a hung run still notifies. A persistent monitor still dies on session resume → after
+  any resume, **check remote ground truth directly** (tmux ls / squeue, `grep DONE log`, `nvidia-smi`);
+  do not trust a watcher that may be gone (fact 3 + §0 corollary).
+- **A dropped poll connection ≠ the job dying.** A long background ssh poll gets killed by the remote's
+  idle-SSH timeout while the detached training runs on independently. Re-ssh and verify the process/
+  artifacts directly before concluding anything died.
+---
+## §5 Two-leg self-completion — guaranteed results + best-effort cadence
+"I'll check periodically" is a lie unless a trigger is ARMED — between turns the assistant does not run.
+Two legs, never conflated:
+- **Leg 1 — remote self-completion (guaranteed, survives session/SSH death):** the L1 chain
+  (`train → eval → touch marker` under one detach primitive). Detect done by a log/marker, never by
+  `pgrep` of the waiter's own pattern. This guarantees RESULTS but gives no reporting cadence.
+- **Leg 2 — live progress (best-effort):** a session-bound patrol loop (L2, e.g. `/loop 30m` or cron `3,33 * * * *`)
+  polling with the LOCAL ssh key. Be honest it dies when the session closes — the remote still finishes;
+  the user re-pings to pull.
+> **A cloud scheduler cannot reach a rented box.** A cloud schedule (`/schedule` / RemoteTrigger) runs in
+> an isolated sandbox with its own checkout and **no access to the local SSH key or network** → it cannot
+> ssh the box, and the SSH private key must **never** be placed in a cloud agent (secret-leak). The honest recurring check is the remote self-monitor + a session loop, not a cloud robot pinging
+> the box. Don't promise autonomous cross-session polling that can't be delivered.
+For a hosted tracker whose metrics survive teardown and can be polled as a structured monitor instead of
+brittle ssh-tail, use `huggingface-skills:huggingface-trackio` (REQUIRED for that path) — poll its alerts
+rather than grepping a remote log.
+---
+## §6 Failure triage on the log
+When a probe shows trouble, pull the full traceback from the per-run log (§2) and classify — each
+outcome maps to a FIXED remediation; never blind-retry:
+```bash
+ssh "$HOST" "grep -B2 -A20 'Traceback' '$RUN_LOG' | head -50"
+```
+- `basic_ios::clear: iostream error` + `unexpected pos N vs M` → **disk full during checkpoint save**;
+  check `df -h`/`df -i`, prune `latest`/periodic snapshots to recover (→ `references/gotchas_universal.md`).
+- bare `Killed` / exit 137, no traceback → **cgroup OOM** (workers × big in-RAM tensor); size workers
+  vs `memory.max`, not CPU count.
+- `CUDA out of memory` → VRAM, usually consistent across runs (batch too big / concurrent job), rarely
+  transient.
+- `KeyError` / `AttributeError` → config/code mismatch; investigate code, do not retry.
+- Early-stop far below baseline with a grad_norm P99 spike in epoch 1–2 → likely **probabilistic
+  divergence**; whether it's a bug or a real effect, and the retry-the-identical-config rule, belong to
+  `verifying-dl-experiments` (REQUIRED) — this skill owns *running* the retry, not judging the number.
+- log frozen (no new lines) but checkpoint `mtime` advances → **block-buffered stdout**, not a hang
+  (`references/gotchas_universal.md` U43; run `python -u`/`PYTHONUNBUFFERED=1`).
+- `uptime`/`free` on the box look maxed but your cgroup is roomy → **noisy neighbor** on the shared host,
+  not your job (`references/gotchas_universal.md` U41; the authoritative OOM check is the `oom_kill` counter
+  in `/sys/fs/cgroup/memory.events`).
+- GPU SM% pinned low while a python thread-storm pegs the cores → **intra-op thread oversubscription** on a
+  vCPU slice (`references/gotchas_universal.md` U40; cap `OMP_NUM_THREADS` to the cgroup quota).
+Universal gotchas (silent sync, CRLF, mid-run script overwrite, inode caps) are NOT restated here —
+see `references/gotchas_universal.md` (`grep -in '<keyword>' references/gotchas_universal.md` to jump).
+---
+## §7 Monitoring across agent hosts (portability mapping)
+The four layers are host-agnostic; only **which primitive runs L2/L3** changes per host. Two rules port
+the whole architecture to Codex / Cursor / Trae / any Agent-Skills host:
+**Rule 1 — the durable layer needs no agent.** L1 (the box self-completes + `touch`es a marker) plus the
+box **pushing its own notification** at the end of the `&&`-chain — a `curl` webhook / email / a
+`huggingface-skills:huggingface-trackio` alert — works on EVERY host, because it runs entirely on the
+rented box. On a host with no background/scheduler primitive, this IS the monitor; the agent just pulls
+results on its next turn.
+**Rule 2 — a CLOUD scheduler cannot reach a rented box (§5), on ANY host.** Every host's hosted
+automation runs in an isolated sandbox with no local SSH key or network, so it cannot ssh your box (and
+the key must never be placed in one — secret-leak). Use cloud cron only to **re-wake the agent** or
+**poll a hosted tracker**, never to probe the box. The box-reaching poll must use the host's
+**local/session** runner (which holds your SSH key), or be the L1 on-box loop.
+| Agent host | Local runner — reaches the box (L3) | Recurring / loop (L2) | Cloud cron/automation — re-wake / tracker only | Foreground/turn limit |
+|---|---|---|---|---|
+| **Claude Code** | `run_in_background` (detach + notify-on-exit); the `Monitor` tool | `/loop` + `ScheduleWakeup` (interval or self-paced) | `/schedule` (cron cloud agent) | ~600 s foreground |
+| **OpenAI Codex** | Codex Cloud background tasks (async, parallel) | a thread that schedules its own wake-up | **Automations** — cron syntax, results → review queue | per cloud task |
+| **Cursor** | Background Agents (async) | — | **Automations** — cron (hourly/daily/weekly) + event triggers | per agent |
+| **Trae** (ByteDance) | Agent / `trae-agent` CLI unattended runs; CI/CD | via a CI/CD pipeline | **no native cron found** → external cron / CI-CD, or rely on Rule 1 | per run |
+| **Generic / none** | any local background-equivalent (else none) | a shell `while`-loop under the turn limit | none | host turn limit |
+> **Hosts not in the table** (Gemini CLI, VS Code / Copilot, Goose, Kiro, …) take the **Generic** row until they expose a local recurring runner that holds your SSH key — until then, wire **Rule 1** (the on-box self-push) and let the agent pull on its next turn.
+**Binding the layers:** L1 is unchanged everywhere (on-box). Bind **L2** to the host's local recurring
+runner *if* it reaches the box, else to the box's own `cron`/`at` + a push (Rule 1). Bind **L3** to the
+host's local background runner, re-armed once per resume. When a host offers only cloud automation (or
+nothing), **do not promise agent-side polling of the box** — wire Rule 1 and let the agent pull on its
+next turn (§0 corollary: trust the artifact, not the silence).
+Host capabilities verified 2026-06: Codex Automations (cron) + Cloud background tasks —
+`developers.openai.com/codex/app/automations` + `/codex/cloud`; Cursor Automations (cron + event
+triggers) + Background Agents — `cursor.com/docs/cloud-agent/automations`; Trae Agent / `trae-agent` CLI
++ CI/CD, no native cron surfaced — `docs.trae.ai/ide` + `github.com/bytedance/trae-agent`.