npm - opencode-skills-collection - Versions diffs - 3.1.1 → 3.1.3 - Mend

opencode-skills-collection 3.1.1 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (86) hide show

package/bundled-skills/remote-gpu-trainer/SKILL.md ADDED Viewed

@@ -0,0 +1,249 @@
+---
+name: remote-gpu-trainer
+description: "Deploy, monitor, and debug long GPU jobs on RENTED/remote instances (AutoDL, RunPod, vast.ai, Lambda, Slurm, K8s): teardown/billing safety, spot resilience, resumable checkpointing, OOM/NaN triage."
+risk: safe
+source: community
+source_type: community
+source_repo: Hanyuyuan6/remote-gpu-trainer
+date_added: "2026-06-20"
+category: ml-ops
+license: "MIT"
+license_source: "https://github.com/Hanyuyuan6/remote-gpu-trainer/blob/main/LICENSE"
+compatibility: |
+  Any Agent-Skills (SKILL.md)-compatible agent — Claude Code, Codex, Cursor, Trae, Gemini CLI, etc.
+  Needs a shell + SSH (or a platform CLI/API) to drive the remote box; scripts are bash/python. A few
+  durable-monitoring recipes assume a host background-task runner + scheduler — map to the running
+  agent's equivalents (references/monitoring_patterns.md §7). Companion skills (verifying-dl-experiments,
+  superpowers:*, huggingface-skills:*) are optional separate installs.
+---
+# remote-gpu-trainer — Remote GPU Job Orchestration
+## Overview
+Deploy and babysit long-running GPU jobs on **rented boxes you don't own**, across any platform, and
+get the result off the box before the meter or a preemption kills it. The core insight: **you are a
+short-term tenant on someone else's machine** — so the job is to *detach the work, make the result
+outlive the instance, and stop the meter safely*, not to provision a cluster.
+This skill is **platform-agnostic at the core, platform-specific at the edges**: a fixed set of
+operating principles + a 6-phase lifecycle that hold everywhere, plus one **profile per platform**
+(`profiles/<platform>.md`) that owns every concrete path, proxy, billing verb, and spot semantic. Its
+defensible value is the union the big orchestrators skip: **Chinese cgroup-isolated rentals + bare-SSH
+cheap boxes + the disk-budget / monitoring / teardown reality** that *is* the job on metered hardware.
+## When to Use This Skill
+Use whenever the user deploys, trains, monitors, or troubleshoots a long-running GPU job on a **RENTED
+or remote instance they do not own** — training, eval, ablation sweeps, batch inference, or large data
+processing — on AutoDL, RunPod, vast.ai, Lambda, Paperspace, Chinese platforms (恒源云/矩池云/Featurize/
+揽睿星舟), a bare SSH box, Slurm, or Kubernetes; single OR multi-instance. Triggers (multilingual):
+"远程 GPU 训练", "GPU 租赁", "GPU rental", "租卡", "spot 抢占", "spot preemption", "断点续训",
+"resumable training", "tmux 训练守护", "防 SSH 断线", "scp/rsync 上传", "多实例 ablation",
+"远程 GPU 监控", "省钱关机/销毁实例", "stop vs terminate billing", "checkpoint 磁盘满",
+"CUDA OOM/显存不足", "loss NaN/loss spike", "loss 不下降/不收敛", "overfit 单 batch",
+"FSDP/DeepSpeed 配置", "多卡训练 hang", "dataloader worker/数据增广 bug". **NOT** for purely local
+single-GPU training, in-instance multi-GPU DDP (use torchrun/accelerate), managed multi-cloud
+price-shopping (use SkyPilot's skill), or zero-ops serverless (use Modal).
+## When NOT to use — and what to use instead
+| Situation | Use instead |
+|---|---|
+| Local single-GPU, or multi-GPU **DDP inside one box** | `torchrun` / `accelerate` directly |
+| Managed multi-cloud price-shopping + auto spot-recovery across **Western** clouds | **SkyPilot** (has its own Agent Skill) — then come back here to make your *code* resume-correct so its recovery actually works |
+| Open BYOC dev environments | **dstack** |
+| Zero-ops serverless inference | **Modal** |
+| "Is this metric / ablation delta real?" | **REQUIRED:** `verifying-dl-experiments` (this skill owns *running* the job; that one owns *whether the number is true*) |
+**This skill is for the blind spot those tools leave:** AutoDL + Chinese platforms, bare SSH/Slurm/K8s
+rentals, and the operational gotchas (inode caps, mirror stalls, cgroup OOM, silent sync, spot grace
+windows, irreversible teardown) that survive whichever provisioner you use.
+## Operating principles (the WHY — 10 invariants)
+These hold on every metered, isolated, rented GPU; only the paths/CLI change. One line each; the deep
+form with cross-platform nuance is in **`references/principles.md`** (read it before Phase 0).
+1. **Minimize paid wall-clock.** The meter runs the whole time — smoke locally on CPU before renting, launch detached, release the instant verification passes.
+2. **Cheap checks before expensive compute.** A 1–2 batch CPU smoke (logger off) kills import/config/shape/scale bugs for ~free. (Smoke *content* → `verifying-dl-experiments`.)
+3. **Trust artifacts you loaded, not log lines that claim success.** "synced/saved/done" lies under a silently-failed write; a watcher's own state is also a claim — reconcile it against the real process/artifact.
+4. **Know what survives stop vs destroy.** Per platform, identify exactly which mount survives a *stop* and which survives a *terminate* — the data you need often lives on the volatile one. (The single biggest portability trap.)
+5. **Storage fails on the dimension you're not watching.** Disk dies on **inodes** before bytes; the real hog hides in a symlinked cache; clean by value (keep tiny evidence, drop big scratch); monitor `df -i`, not just `df -h`.
+6. **Never mutate inputs under a live run.** A running job holds its scripts in memory by byte-offset; overwriting one mid-run re-executes blocks. Version filenames.
+7. **Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific.** Make wrappers idempotent + resumable; retry the *identical* config; wrap bulk transfers in `timeout`+resume loops; a mirror/proxy speeds ONE route — validate on the same route the real transfer uses.
+8. **Checkpoint-to-durable + idempotent resume is the universal spine.** File checkpoint to the platform's durable location + unconditional load-latest-on-startup is the *one* mechanism that survives an SSH drop, a Slurm walltime kill, a K8s reschedule, a spot preemption, and a Colab disconnect. The detach primitive (tmux/sbatch/Job/commit) is the swappable plug; this is the invariant.
+9. **Cost and destructive actions are the user's call.** Never auto-release/terminate, never delete durable files without confirmation; if cleanup can't free space, **ask to expand the disk** rather than silently shrink the experiment.
+10. **Teach the user the platform, don't just drive it.** Most users don't know a platform's non-obvious **conveniences** (one-click SSH-key registration, GPU-availability notifications, built-in panels) or its **danger clocks** (auto-release/auto-delete timers on a *stopped* box — AutoDL releases a 关机 instance after 15 days → data disk gone; a stop that keeps billing; low-balance purge). Surface them on first contact — #9 stops the agent *doing* the dangerous thing, #10 *warns the human* before the clock fires. Per-platform list → each profile's **Surface to the user** block.
+> **Monitoring physics (substrate for #3):** foreground Bash hard-caps at 600 s; `run_in_background` has no cap and notifies on exit; a never-exiting watcher never notifies; an unquoted `|` in a poll regex reads stdin and hangs forever. The four-layer monitoring architecture is built on these facts → `references/monitoring_patterns.md`.
+## Code discipline (the wrapper & training scripts you write)
+Two rules govern the launch/wrapper/training code this skill has you write — corollaries of #1 and #8, not new invariants:
+1. **Reuse before writing.** Take the lowest rung that already works before adding code: the base image's pre-installed stack + platform features → a framework/library utility (`torchrun` / `accelerate` / HF) → your existing `scripts/` templates → minimal new code. On a metered box a needless `pip install` also burns paid wall-clock and can break the image's ABI — Phase 1's rule (*the prebuilt image **is** the env; don't `conda create` on a rental*) is exactly this principle applied to dependencies.
+2. **Floor — `minimum` bounds scope, not correctness.** Shrinking code must never drop what makes an expensive run survivable: checkpoint-to-durable + idempotent resume (#8), atomic writes, the error handling that prevents losing a long run, or seed/determinism logging. Keep one minimal self-check for non-trivial logic.
+## Pick your platform profile FIRST
+Read the matching profile **before Phase 0** — it owns every path, proxy, credential location, billing
+verb, and spot rule the phases below delegate to. Each follows the same 8-field schema
+(`profiles/_schema.md`).
+> **New here? The path is:** (1) find your platform in the table below → (2) read that profile's **LAUNCH**
+> section (it walks rent → register SSH key → reach the box) → (3) come back and run the 6 phases from Phase 0.
+> Already have a box you can `ssh` into? Skip straight to Phase 0.
+| You're on… | Profile | Kind | Detach primitive | Meter-stop verb |
+|---|---|---|---|---|
+| AutoDL (deepest, battle-tested) | `profiles/autodl.md` | ssh-rental | tmux | 关机 (stops meter, **keeps disk** — the AutoDL exception) |
+| RunPod | `profiles/runpod.md` | ssh-rental | tmux | **terminate** (stop still bills 2×; destroys volume disk) |
+| vast.ai | `profiles/vastai.md` | ssh-rental (spot) | tmux | **destroy** (stop bills disk forever) |
+| Lambda | `profiles/lambda.md` | cloud-api | tmux | **terminate** (no stop state) |
+| Paperspace | `profiles/paperspace.md` | cloud-api | tmux | **destroy + release IP + delete storage** (shut-down stops compute only) |
+| 恒源云 / 矩池云 / Featurize / 揽睿星舟 | `profiles/china.md` | ssh-rental | tmux | per-platform (data disk often bills while stopped) |
+| Bare SSH box / Slurm / K8s / Colab-Kaggle | `profiles/generic-ssh.md` | ssh / slurm / k8s | tmux / sbatch / Job / commit | **manual** (a forgotten box bills 24/7) |
+> **Profile confidence:** AutoDL is battle-tested from the author's daily use; the other six profiles are
+> built from each platform's official docs + community reports (cited inline, `verified <month>`) and not
+> yet independently live-tested — lean on the Phase-0 live measurements and **re-verify any teardown/
+> billing fact against current docs before betting money or data** (`references/self-improvement.md` §5).
+**Mental verb model** (one API across all platforms; the profile binds each verb to real commands):
+`up` (rent+reach) → `push` (code/data on) → `run` (detached + checkpointing) → `watch` (durable monitor) → `pull` (results off + verify) → `down` (stop the meter).
+## Default workflow (6 phases)
+Skip phases already done. Each phase delegates substrate to the profile and **ends in a runnable check**.
+**Phase 0 — Environment audit.** Read the profile's STORAGE survival-matrix + region/DC-lock. Measure live:
+`df -h && df -i <data-mount>`, cgroup `memory.max`, `nvidia-smi`. Pre-compute the checkpoint disk budget
+(`ckpt_size × N + scratch`). → **verify:** `nvidia-smi` shows the expected GPU and `df -i` is not near 100%.
+**Phase 1 — SSH + credentials.** Set the alias/env per the profile (the prebuilt image/base IS the env —
+do not `conda create` on a rental). **Never rented before? the profile's LAUNCH section walks rent → register SSH key → connect.** Push secrets via **stdin, never onto a shared/durable FS**
+(`references/ssh_transport.md`). → **verify:** `ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'`.
+**Phase 2 — Wrapper + CPU-smoke gate.** Build an idempotent `run_one`/`run_queue` from `scripts/` (parameterized
+from the profile's OVERRIDES; **size batch/workers to the box for a standalone run, but PIN them across cells for a fair comparison** — `references/training/throughput-profiling.md`). **Run the cheap CPU smoke locally BEFORE renting** — it kills the dumb,
+expensive failures (e.g. `python -m <your.train.module> --limit-batches 2 --epochs 1` — substitute your own entrypoint; this gate needs your training code plugged in). → **verify:** that smoke exits 0 on 2 batches with the logger disabled.
+**Phase 3 — Detached launch.** Launch via the profile's detach primitive; probe briefly (log head + alive +
+no traceback), then **hand back** — never a blocking foreground `sleep`. → **verify:** within 60 s, the detach
+session is alive and the first log line shows the expected step/epoch.
+**Phase 4 — Durable monitoring.** For anything over ~1–2 h, deploy the **four-layer architecture**
+(`references/monitoring_patterns.md`): on-box self-completion chain + session patrol loop + event sentinels +
+recovery handbook. **On Claude Code, fire the L2 patrol via `/loop 30m` (or `ScheduleWakeup`) running `scripts/health_patrol.sh.template`**; a host with no local recurring runner wires the on-box self-push instead (`references/monitoring_patterns.md` §7). A session-bound watcher alone dies with the session. Classify each outcome →
+fixed remediation; **never blind-retry**. → **verify:** the patrol reports even when nothing changed.
+**Phase 5 — Aggregate + verify + teardown.** Checked-sync to durable storage (gate the success line on the
+copy result — principle #3), then **load-and-verify each artifact** (`scripts/verify_local.py`), THEN the profile's
+meter-stopping action. → **verify:** `verify_local.py` reports 100% OK *before* any teardown.
+> **Iron Law — teardown gate:** NO `release` / `terminate` / `destroy` / file-delete until checkpoints are
+> **pulled to local AND verified by load**, and the user has explicitly approved the cost-affecting action.
+> "It looked done in the log" is not evidence (principle #3). On most platforms the meter-stopping action is
+> **irreversible** (deletes the disk) — confirmation matters more, not less.
+## Parallel ablation fan-out
+For N ablation cells: one job per cell, an **isolated write path per job** (no shared mutable output), launched
+across instances/queues. **REQUIRED:** `superpowers:dispatching-parallel-agents` supplies the independence
+predicate (don't fan out onto shared state) and the mandatory post-fan-out reconciliation. FS-shared deployment
+pattern → `references/parallel_ablation.md`.
+## Quick reference — the four facts that bite per platform
+Full detail in each profile; this table is the at-a-glance.
+| Platform | Survives **stop** | Survives **destroy** | Spot grace | China mirror needed |
+|---|---|---|---|---|
+| AutoDL | /root + data + FS | FS only | n/a | yes (`/etc/network_turbo`, hf-mirror) |
+| RunPod | volume disk (bills 2×) | Network Volume only | ~5 s SIGTERM→KILL | no (`hf_transfer`) |
+| vast.ai | disk (bills forever) | nothing | ~0 s (abrupt) | no |
+| Lambda | n/a (no stop) | nothing | n/a (on-demand) | no |
+| China (恒源云/矩池云/…) | varies; data disk bills | per-platform persistent vol | n/a | yes |
+| generic-SSH/Slurm/K8s | you own it | you own it | Slurm SIGTERM→KillWait (def 30 s) | only if in China |
+## Common gotchas (top 8 inline — full catalog in references/)
+The universal ones that cost the most GPU-hours. Symptom → fix; root cause + the rest in
+**`references/gotchas_universal.md`** (run `grep -i '<keyword>' references/gotchas_universal.md` to jump).
+1. **SSH drops on `pkill -9`** (exit 255 + "Connection reset") — normal; re-ssh to verify, don't panic.
+2. **tmux holds the script in memory** — editing it mid-run re-executes blocks; version the filename.
+3. **Disk-full crashes `torch.save`** (`iostream error`) — pre-budget; auto-prune `latest.pth`, keep `best`.
+4. **cgroup OOM with no traceback** (bare `Killed` / exit 137) — `num_workers × big-tensor`; size workers vs `memory.max`, not CPU count.
+5. **Silent sync failure** — `cp … 2>/dev/null; echo synced` lies on a full/inode-exhausted FS; gate the success line on the actual copy result.
+6. **Spot preemption grace is tiny (~5 s → ~0 s on the platforms profiled here; AWS-style 2-min grace only on clouds not profiled)** — a SIGTERM-flush handler is NOT a safety net; checkpoint on a timer to durable storage, load-latest unconditionally (`references/spot-resilience.md`).
+7. **"Stop" rarely stops the meter** — only `terminate`/`destroy` does, and it's irreversible (deletes the disk). Know the verb from the profile before you click, and on RunPod a stopped Pod can even restart with zero GPUs.
+8. **CRLF breaks `.sh` on Linux** — author on Windows → `.gitattributes` `*.sh text eol=lf`; on-box unblock `sed -i 's/\r$//'`.
+## When training itself breaks (the model, not the platform)
+Platform ops is only half the job — once the box is running, training breaks in its own ways. The
+`references/training/` layer is the debug knowledge for the run itself. Boundary: **this layer owns
+"make it run, fast, and not crash"; `verifying-dl-experiments` owns "is the *number* real"** —
+cross-link it for collapse / leakage / metric-validity. Every entry is symptom → root cause → fix with
+cited current docs.
+- `references/training/oom-memory.md` — CUDA/VRAM + host-RAM OOM and the fit-it ladder (grad-accum → bf16 → activation-checkpointing → `expandable_segments` → FSDP/ZeRO → CPU/NVMe offload → LoRA/QLoRA); OOM-at-a-specific-step (first backward / val / longest batch); the memory snapshot + visualizer.
+- `references/training/distributed-launch.md` — `torchrun`/`accelerate`/`deepspeed` launch + env contract, DDP/FSDP/ZeRO config, and the multi-GPU **HANGS** toolkit (one-rank-diverged, rank-conditional collective, dataloader-length mismatch). Multi-node wire → `references/multinode.md`.
+- `references/training/precision-stability.md` — fp16/bf16/tf32 + AMP/GradScaler, NaN/Inf hunting (`detect_anomaly`), LLM **loss spikes** + divergence (warmup, clip, init, z-loss).
+- `references/training/throughput-profiling.md` — GPU-bound vs data-bound vs comms-bound; dataloader knobs; `torch.compile` traps; flash-attention; `torch.profiler` / Nsight.
+- `references/training/checkpoint-resume.md` — full-state save/resume mechanics, sharded (FSDP/DeepSpeed) checkpoints, and the resume bugs (epoch restart, data reshuffle, scaler/EMA dropped). Spot cadence → `references/spot-resilience.md`.
+- `references/training/by-domain.md` — per-domain gotchas: LLM/transformer, vision (det/seg), diffusion, RL, multimodal/VLM.
+- `references/training/convergence-debugging.md` — the **"runs but won't learn / learns badly"** layer: the overfit-one-batch smoke, params-not-updating, optimizer/LR/weight-decay/schedule config, loss-function footguns (double-softmax, BCEWithLogits, CE-target form), fine-tuning/freezing (frozen-BN drift, discriminative LR, LoRA wiring), and the training-dynamics dashboard (update:weight ratio, dead-ReLU, GradScaler-scale).
+- `references/training/data-pipeline.md` — dataloader/dataset **correctness** (not speed): the worker-RNG augmentation-duplication bug, IterableDataset worker/rank sharding, collate/`__len__`/`pin_memory`/`spawn` contracts, and preprocessing/label/shuffle traps (RGB-vs-BGR, ToTensor ÷255, `set_epoch`).
+## Companion skills (separate installs; REQUIRED reading where present)
+These are **separate** Agent Skills, not bundled here — install them for the full experience. On an
+agent where a companion isn't installed, treat its pointer below as an optional cross-reference; this
+skill still works standalone.
+- **`verifying-dl-experiments`** — owns *is-the-number-real*: smoke content, retry-vs-safeguard, keepable-checkpoint, eval sizing, tracker forensics, GPU-0%-util diagnosis. This skill owns *where/when/how-much-$*.
+- **`huggingface-skills:hf-cli`** — the transport verbs (`hf download --resume`, `hf upload-large-folder`, `hf cache verify`); this skill owns the China-mirror swap + stall-retry (`references/china-network.md`).
+- **`huggingface-skills:huggingface-trackio`** — hosted tracker so metrics survive teardown (gotcha U20); poll `trackio` alerts as a structured monitor instead of brittle ssh-tail.
+- **`superpowers:verification-before-completion`** — the Iron Law's general form; gates every "training done / synced / teardown complete" claim.
+- **`superpowers:dispatching-parallel-agents`** — independence predicate + reconciliation for ablation fan-out.
+## Getting better over time (capture new gotchas + personalize)
+This skill is static, but every run can teach it something — without corrupting it.
+Protocol → **`references/self-improvement.md`**. In short: when a run surfaces a gotcha the catalog
+lacks, **only sediment a root-caused, reproduced, generalizable one** (a one-off flake is a hypothesis,
+not a gotcha — principle #3); **route it** — user/project-specific → the host's memory system,
+generalizable → propose adding to `references/gotchas_universal.md` / the profile §7 /
+`references/training/` (and offer an upstream PR); **never silently rewrite a skill file — draft the
+`symptom → root cause → fix` and let the user approve.** On first use, capture the user's platforms +
+paths + tracker entity into memory so later runs are pre-parameterized. Platform facts carry a `verified
+<month>` stamp — re-verify any teardown/billing fact against current docs before betting money or data.
+## Limitations
+- Does not replace a real cloud orchestrator or managed provisioner; use it to make rented-box work survivable, not to optimize multi-cloud procurement.
+- Platform billing, stop, destroy, and data-retention behavior can drift; re-check current provider docs before destructive or money-impacting actions.
+- Requires user-owned credentials, SSH/API access, and explicit confirmation before teardown, deletion, or other irreversible cleanup.
+- Companion skills named above are not bundled here; treat them as optional references unless installed in the current agent environment.
+## Bundled resources
+Load only what the current phase needs.
+- `references/principles.md` — the 10 invariants expanded, with the cross-platform nuance behind each.
+- `references/lifecycle_checklist.md` — the 6-phase runbook as a per-platform checklist.
+- `references/gotchas_universal.md` — universal + mixed gotchas (TOC + grep index at top).
+- `references/monitoring_patterns.md` — the four-layer durable-monitoring architecture + robust ssh-poll template.
+- `references/ssh_transport.md` — ssh config, rsync/scp resumable patterns, secrets-via-stdin, CRLF, two-SSH-flavor caveat.
+- `references/china-network.md` — mirrors table + HF_ENDPOINT + resumable-download ladder + the `no_proxy` trap (all CN platforms).
+- `references/spot-resilience.md` — preemption signals, Young/Daly checkpoint cadence, atomic-write resume.
+- `references/parallel_ablation.md` — FS-shared fan-out + the independence predicate + reconciliation.
+- `references/multinode.md` — (advanced) NCCL / fabric-manager / elastic-training gotchas; single-box users skip.
+- `references/training/` — the **DL-training debug layer** (8 files: oom-memory, distributed-launch, precision-stability, throughput-profiling, checkpoint-resume, by-domain, convergence-debugging, data-pipeline) — see "When training breaks" above.
+- `references/self-improvement.md` — the feedback loop: capture a new gotcha (at a bar) into memory or the catalog, personalize on first run, keep platform facts fresh.
+- `scripts/` — wrapper templates (`run_one`/`run_queue`), monitors (`mem_monitor`, `gpu_health`, `reap_vram_zombies`), the read-only patrol (`health_patrol.sh.template`), transfer/aggregation (`download_loop`, `aggregate_to_fs`, `setup-china-mirrors`), the load-and-verify checker (`verify_local.py`), and the `verified`-stamp freshness linter (`check_staleness.py`).
+- `profiles/<platform>.md` — the per-platform substrate (one per platform; `_schema.md` defines the 8 fields).
+- `examples/autodl_sweep/` — one complete, runnable worked case end to end.

package/bundled-skills/remote-gpu-trainer/evals/README.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Evals — does the skill actually route to the right answer?
+A skill is only as good as an agent's ability to *find and apply* the right entry under a real
+problem. These evals test that, in two tiers, against a fixed set of realistic scenarios
+([`cases.jsonl`](cases.jsonl)) spanning both halves of the skill (remote-GPU operations on every
+platform family + the DL-training-debug layer, including the `convergence-debugging` and
+`data-pipeline` files).
+## Tier 1 — structural reachability (runnable, no API key)
+```bash
+python evals/run_evals.py        # exits non-zero if any case regresses
+```
+For each scenario it asserts the answer is **present, at the documented location, with the
+expected entry IDs / keywords intact**: every `expect_files` exists, every `expect_ids` is still a
+`### <ID>` header there, every `expect_grep` term is still in the text. This is a **drift guard** —
+it catches a renamed/removed entry, a moved section, a deleted file, or a fact rewritten away from
+its key term. Run it in CI; it needs nothing but Python 3.
+What it does **not** prove: that an agent actually *navigates* there (Tier 2), or that the platform
+*facts* are correct on a live box (see Verification status).
+## Tier 2 — agentic navigation (the gold standard)
+The real test: give a **fresh agent** the skill and one scenario's `prompt`, let it navigate **from
+SKILL.md only** (following the documented routing, not blind grep), and check it reaches a correct,
+specific answer covering the case's `must_cover` points within ~2 hops. Each case records its last
+such run in the `agentic` field; the collected runs are in [`RESULTS.md`](RESULTS.md).
+To re-run Tier 2 with any agent/harness: load the skill, paste a case `prompt`, and grade the
+answer against `expect_files` / `expect_ids` / `must_cover`. (Anthropic's skill best-practices
+recommend ≥3 evals across Haiku/Sonnet/Opus — re-running these cases per model is the way to meet
+that bar; results to date were gathered on the development model and are labelled as such.)
+## Adding a case
+Append one JSON object per line to `cases.jsonl`:
+```json
+{"id": "kebab-id", "prompt": "the user's situation, verbatim-ish",
+ "expect_files": ["references/training/<file>.md"], "expect_ids": ["O7"],
+ "expect_grep": ["lr finder"], "must_cover": "the key points a correct answer must hit",
+ "agentic": "PASS/FAIL (date): the navigation path observed"}
+```
+Use `expect_ids` for the training catalogs (they have `### O7 / DP1 / M17 …` headers) and
+`expect_grep` for platform profiles (which are section-structured). Then `python evals/run_evals.py`.
+## Verification status (important)
+These evals test **retrieval and routing inside the skill** — not the truth of the platform facts
+on a live instance. Only the AutoDL profile is battle-tested by the author; the other six platform
+profiles are researched from official docs + community reports and **not yet live-validated** (see
+the repo README's "Verification status" and `references/self-improvement.md` §5). A case passing
+here means "the skill leads an agent to *this documented answer*," not "this answer was confirmed on
+a rented box."

package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Agentic navigation results (Tier 2)
+Each row: a **fresh agent** was given the skill and one scenario `prompt` from
+[`cases.jsonl`](cases.jsonl), told to navigate **from SKILL.md only** (follow the documented
+routing, no blind grep), and graded on whether it reached a correct, specific answer covering the
+scenario's `must_cover` points within ~2 hops.
+**Methodology / honesty caveats** (so a reader can weight this correctly):
+- Runs to date were gathered **during development**, on the development model (Claude Opus class),
+  as subagent dispatches — not an independent third party, and **not yet** the
+  Haiku/Sonnet/Opus sweep Anthropic's best-practices recommend. Treat as *author-run smoke evals*,
+  not a neutral benchmark.
+- These prove **routing + retrieval** inside the skill, not the truth of platform facts on a live
+  box (only AutoDL is battle-tested — see the repo README's "Verification status").
+- Single run per scenario; no adversarial/perturbed phrasings yet.
+## Results — 2026-06
+| Scenario | Verdict | Hops | Navigation path observed |
+|---|---|---|---|
+| convergence-frozen-resnet | **PASS** | 1 | SKILL.md "When training breaks" → `convergence-debugging.md` O1 (overfit-one-batch) + O2 (params-not-in-optimizer) + O17 (frozen-still-in-optimizer) + O18 (frozen-BN drift) + O6 (Adam vs AdamW) |
+| data-worker-rng-dup | **PASS** | 1 | SKILL.md "When training breaks" → `data-pipeline.md` DP1 (numpy fork-RNG dup; worker_init_fn fix) |
+| oom-on-step-2 | **PASS** | ≤2 | SKILL.md "When training breaks" → `oom-memory.md` (fit-it ladder + OOM-at-step-2 / Adam lazy state) |
+| nccl-one-rank-hang | **PASS** | ≤2 | SKILL.md → `distributed-launch.md` (desync toolkit D19 / one-rank-diverged D20) |
+| diffusion-loss-low-samples-bad | **PASS** | ≤2 | SKILL.md → `by-domain.md` diffusion section (DF1 loss≠quality, DF2 EMA weights) |
+| nan-loss-spike-bf16 | **PASS** | ≤2 | SKILL.md "When training breaks" → `precision-stability.md` P8/P12/P15 (NaN-origin + warmup spike + z-loss) |
+| resume-epoch-reset | **PASS** | 1 | SKILL.md → `checkpoint-resume.md` C1/C12/C14 (save FULL state: epoch/step/scheduler/RNG/scaler) |
+| throughput-gpu-starved | **PASS** | ≤2 | SKILL.md → `throughput-profiling.md` T1/T4 (GPU-bound vs data-bound; num_workers/prefetch) |
+| runpod-spot-resume-teardown | **PASS** | ≤2 | SKILL.md → `profiles/runpod.md` §4/§5 → `spot-resilience.md` → `checkpoint-resume.md` C3 |
+| vastai-teardown-billing | **PASS** | ≤2 | SKILL.md → `profiles/vastai.md` §5 → `lifecycle_checklist.md` Phase 5 |
+| autodl-inode-disk-full | **PASS** | ≤2 | SKILL.md → the inode/disk gotcha (principle #5 / `gotchas_universal.md` U7) |
+| china-hf-download-stall | **PASS** | ≤2 | SKILL.md → `references/china-network.md` (HF_ENDPOINT=hf-mirror, hf_transfer caution) |
+| lambda-stop-vs-terminate | **PASS** | ≤2 | SKILL.md → `profiles/lambda.md` (no stop state; terminate irreversible) |
+| autodl-first-contact-15day | **PASS** | 1 | SKILL.md principle #10 → `profiles/autodl.md` Surface block + AD-DANGER (关机 auto-releases after 15 days) |
+**Summary: 14/14 scenarios routed correctly** (9 via workflow `w2r1t7mm9`, 5 standalone), each to a
+correct + specific answer within ≤2 hops. The Tier-1 structural check (`run_evals.py`) runs all 14
+cases and is the regression guard kept green in CI.
+## Known gaps (what these results do NOT yet cover)
+- No multi-model sweep (Haiku/Sonnet/Opus) — required to claim the best-practices testing bar.
+- No adversarial/paraphrased prompts (e.g. the user describes the symptom in non-canonical words).
+- No live-platform validation of the facts the agent retrieves (the verification-status caveat).

package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl ADDED Viewed

@@ -0,0 +1,14 @@
+{"id": "convergence-frozen-resnet", "prompt": "Fine-tuning a ResNet50 on a rented GPU. Training runs with no errors and normal speed, but loss barely drops and val accuracy is stuck near chance. I froze the backbone with requires_grad=False and use Adam with weight_decay. How do I debug why it isn't learning?", "expect_files": ["references/training/convergence-debugging.md"], "expect_ids": ["O1", "O2", "O17", "O18", "O6"], "expect_grep": [], "must_cover": "overfit-one-batch smoke; frozen-param-still-in-optimizer; frozen-BN running-stats drift; Adam vs AdamW decoupled decay", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> convergence-debugging.md O1/O2/O17/O18/O6"}
+{"id": "data-worker-rng-dup", "prompt": "My image augmentations seem to repeat: different DataLoader workers produce identical random crops, and every epoch looks the same. Linux, num_workers=8, numpy-based augmentation. Real bug? Fix?", "expect_files": ["references/training/data-pipeline.md"], "expect_ids": ["DP1"], "expect_grep": ["worker_init_fn", "torch.initial_seed"], "must_cover": "numpy global RNG inherited via fork, not reseeded per worker; fix via worker_init_fn or route RNG through torch", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> data-pipeline.md DP1"}
+{"id": "oom-on-step-2", "prompt": "CUDA out of memory on step 2, right after the first optimizer step. Step 1 ran fine. Why does it OOM only on the second step?", "expect_files": ["references/training/oom-memory.md"], "expect_ids": ["M17"], "expect_grep": [], "must_cover": "Adam lazily allocates optimizer state (m,v) on the first step()", "agentic": "PASS (workflow w2r1t7mm9): routed to oom-memory.md ladder + step-2 entry"}
+{"id": "nccl-one-rank-hang", "prompt": "Multi-GPU training hangs partway through an epoch; one rank seems stuck and the others wait forever (NCCL timeout). How do I find which rank and why?", "expect_files": ["references/training/distributed-launch.md"], "expect_ids": ["D19", "D20"], "expect_grep": [], "must_cover": "one rank diverged/OOM'd; survivors hang on the absent collective; desync-debug toolkit", "agentic": "PASS (workflow w2r1t7mm9): routed to distributed-launch.md hang toolkit"}
+{"id": "diffusion-loss-low-samples-bad", "prompt": "My diffusion model's training loss is low and still decreasing, but the generated samples look bad/blurry. The loss says it's fine. What's wrong?", "expect_files": ["references/training/by-domain.md"], "expect_ids": ["DF1", "DF2"], "expect_grep": ["EMA"], "must_cover": "loss != sample quality; sampling from raw (non-EMA) weights; cross-link verifying-dl-experiments", "agentic": "PASS (workflow w2r1t7mm9): routed to by-domain.md diffusion section"}
+{"id": "nan-loss-spike-bf16", "prompt": "LLM pretraining in bf16: loss is stable then suddenly spikes to NaN. How do I find where the NaN comes from and stop the spikes?", "expect_files": ["references/training/precision-stability.md"], "expect_ids": ["P8", "P12", "P15"], "expect_grep": ["z-loss"], "must_cover": "NaN arithmetic origins + anomaly detection; LR-too-high/warmup spike; z-loss to bound logits", "agentic": "PASS (workflow w2r1t7mm9): routed to precision-stability.md"}
+{"id": "resume-epoch-reset", "prompt": "I resume training from a checkpoint but the epoch/step counter restarts from 0 and the LR schedule replays warmup. What did I forget to save/restore?", "expect_files": ["references/training/checkpoint-resume.md"], "expect_ids": ["C1", "C12", "C14"], "expect_grep": [], "must_cover": "save FULL state (epoch/step/scheduler/RNG/scaler), not just weights", "agentic": "PASS (1-hop): SKILL.md -> checkpoint-resume.md"}
+{"id": "throughput-gpu-starved", "prompt": "GPU utilization is low and training is slow on my rented box. I think the dataloader is starving the GPU. How do I confirm and fix it?", "expect_files": ["references/training/throughput-profiling.md"], "expect_ids": ["T1", "T4"], "expect_grep": ["num_workers"], "must_cover": "GPU-bound vs data-bound vs comms-bound triage; num_workers/prefetch knobs", "agentic": "PASS: SKILL.md -> throughput-profiling.md"}
+{"id": "runpod-spot-resume-teardown", "prompt": "On RunPod my spot training keeps getting preempted. How do I make it resume instead of restarting, and how do I stop the meter most cheaply afterwards without losing checkpoints?", "expect_files": ["profiles/runpod.md"], "expect_ids": [], "expect_grep": ["terminate", "Network Volume"], "must_cover": "Network Volume is the only durable store; ~5s grace; terminate (not stop) stops billing; verify ckpt before terminate", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/runpod.md SS4/SS5 -> spot-resilience.md -> checkpoint-resume.md C3"}
+{"id": "vastai-teardown-billing", "prompt": "On vast.ai, what action actually stops billing, and how do I tear down without losing my checkpoints?", "expect_files": ["profiles/vastai.md"], "expect_ids": [], "expect_grep": ["destroy"], "must_cover": "destroy is the only meter-stop; stop still bills disk; copy + load-verify off-box before destroy", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/vastai.md SS5 -> lifecycle_checklist Phase 5"}
+{"id": "autodl-inode-disk-full", "prompt": "On AutoDL my torch.save fails with a disk/iostream error, but df -h shows plenty of space left. What's going on?", "expect_files": ["references/gotchas_universal.md"], "expect_ids": [], "expect_grep": ["inode", "df -i"], "must_cover": "storage dies on inodes before bytes; monitor df -i not just df -h; millions of small files", "agentic": "PASS (workflow w2r1t7mm9): routed to the inode/disk gotcha (principle #5 / U7)"}
+{"id": "china-hf-download-stall", "prompt": "Training in mainland China: a huggingface model download stalls and hangs with no error. How do I fix the download?", "expect_files": ["references/china-network.md"], "expect_ids": [], "expect_grep": ["hf-mirror", "HF_ENDPOINT"], "must_cover": "HF_ENDPOINT=hf-mirror.com; keep hf_transfer OFF on flaky CN links; resumable-download ladder", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> references/china-network.md"}
+{"id": "lambda-stop-vs-terminate", "prompt": "On Lambda Cloud, is there a stop action to pause billing while keeping my instance, or only terminate? How should I tear down?", "expect_files": ["profiles/lambda.md"], "expect_ids": [], "expect_grep": ["terminate"], "must_cover": "no stop state on Lambda on-demand; terminate is irreversible + wipes the instance; persistent FS is the only durable home", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/lambda.md"}
+{"id": "autodl-first-contact-15day", "prompt": "First time on AutoDL. I'll 关机 (stop) my instance between sessions to save money — is my data safe if it stays stopped for a few weeks? Anything else I should know up front?", "expect_files": ["profiles/autodl.md"], "expect_ids": [], "expect_grep": ["Surface to the user", "免密", "AD-DANGER"], "must_cover": "关机 auto-releases after 15 days -> data disk deleted (not safe to park indefinitely); sync best to /root/autodl-fs for a longer pause; surface conveniences (one-click SSH免密, GPU notify, panels) + danger clocks (principle #10)", "agentic": "PASS (2026-06): principle #10 first-contact surfacing -> profiles/autodl.md Surface block + AD-DANGER 15-day clock"}

package/bundled-skills/remote-gpu-trainer/evals/run_evals.py ADDED Viewed

@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+"""
+Structural retrieval-reachability check for the remote-gpu-trainer skill.
+For each scenario in cases.jsonl, assert that the answer is actually PRESENT in the
+skill, at the documented location, with the expected entry IDs / keywords intact:
+  - every `expect_files` path exists
+  - every `expect_ids` appears as a `### <ID>` header in one of those files
+  - every `expect_grep` keyword appears (case-insensitive) in one of those files
+This is the cheap, no-API-key tier: it does NOT prove an agent *navigates* there
+(that is the agentic tier — see RESULTS.md), and it does NOT prove the platform
+FACTS are correct on a live box (see the README "Verification status"). What it
+DOES catch is drift: a renamed/removed entry ID, a moved section, a deleted file,
+or a fact rewritten away from a key term — i.e. a regression in the skill's known
+load-bearing capabilities.
+Usage:  python evals/run_evals.py            # exits 1 if any case fails
+"""
+import json
+import re
+import sys
+from pathlib import Path
+REPO = Path(__file__).resolve().parent.parent
+CASES = Path(__file__).resolve().parent / "cases.jsonl"
+def header_present(text, id_):
+    # match `### O1 ...` but not `### O10 ...`
+    return re.search(r"(?m)^###\s+" + re.escape(id_) + r"\b", text) is not None
+def main():
+    cases = [json.loads(l) for l in CASES.read_text(encoding="utf-8").splitlines() if l.strip()]
+    passed = failed = 0
+    for c in cases:
+        problems = []
+        blobs = []
+        for f in c.get("expect_files", []):
+            p = REPO / f
+            if not p.exists():
+                problems.append(f"missing file: {f}")
+            else:
+                blobs.append(p.read_text(encoding="utf-8"))
+        joined = "\n".join(blobs)
+        low = joined.lower()
+        for i in c.get("expect_ids", []):
+            if not any(header_present(b, i) for b in blobs):
+                problems.append(f"missing entry id: {i}")
+        for kw in c.get("expect_grep", []):
+            if kw.lower() not in low:
+                problems.append(f"missing keyword: {kw!r}")
+        status = "PASS" if not problems else "FAIL"
+        if problems:
+            failed += 1
+        else:
+            passed += 1
+        print(f"[{status}] {c['id']}")
+        for pr in problems:
+            print(f"         - {pr}")
+    print(f"\n{passed}/{passed + failed} cases reachable" + ("" if not failed else f"  ({failed} FAILED)"))
+    return 1 if failed else 0
+if __name__ == "__main__":
+    sys.exit(main())

package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md ADDED Viewed

@@ -0,0 +1,72 @@
+# Worked example — a 3-cell ablation sweep on AutoDL
+A complete, end-to-end run of the 6-phase lifecycle (SKILL.md) for the deepest profile
+(`profiles/autodl.md`). Substitute your own project name, alias, and configs. Two instances run
+their own queue file in parallel; this walkthrough ships `queue_1.txt` and shows one instance. **Read `profiles/autodl.md`
+first** — it owns every path and verb used below.
+The AutoDL `SCRIPT OVERRIDES` (profiles/autodl.md §8) that parameterize the templates:
+```bash
+export PROJECT_REPO_DIR=/root/myproj
+export DATA_DIR=/root/autodl-tmp          # fast per-instance scratch (checkpoints)
+export DURABLE_DIR=/root/autodl-fs        # region-locked shared FS (survives release)
+export PROXY_HOOK='source /etc/network_turbo'
+export CRED_FILE=/root/.wandb_key
+```
+### Phase 0 — Environment audit
+```bash
+ssh autodl-1 'df -h /root/autodl-tmp /root/autodl-fs / && df -i /root/autodl-fs && \
+              cat /sys/fs/cgroup/memory.max | numfmt --to=iec && nvidia-smi'
+bash scripts/gpu_health.sh 0     # run ON the box: Xid / throttle pre-flight (U22/U23)
+```
+Budget the disk: `ckpt_size × cells_in_queue + scratch`. **Verify:** `nvidia-smi` shows the expected
+GPU; `df -i /root/autodl-fs` is well under 100% (the inode cap, U7).
+### Phase 1 — SSH + credentials
+```bash
+# alias already in ~/.ssh/config (references/ssh_transport.md). Push the wandb key via stdin,
+# to the per-instance disk — NEVER the shared FS (U34, and AutoDL's classifier blocks it, AD-gotcha):
+printf '%s\n' "$WANDB_KEY_FROM_ENV" | ssh autodl-1 'umask 077; cat > /root/.wandb_key && chmod 600 /root/.wandb_key'
+```
+**Verify:** `ssh autodl-1 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
+### Phase 2 — Wrapper + CPU-smoke gate
+```bash
+# Parameterize the templates, drop the .template suffix, smoke locally on CPU BEFORE renting time:
+cp scripts/run_one.sh.template run_one.sh && cp scripts/run_queue.sh.template run_queue.sh
+python -m src.train -c configs/ablation/baseline.yaml --task reconstruction \
+       --limit-batches 2 --epochs 1   # logger off; catches import/shape/scale bugs for free
+```
+**Verify:** the smoke exits 0 on 2 batches. (Smoke *content* → **REQUIRED:** `verifying-dl-experiments`.)
+### Phase 3 — Detached launch
+```bash
+# Push the parameterized wrappers + queue to the shared FS (ONE copy, all instances read it):
+scp run_one.sh run_queue.sh examples/autodl_sweep/queue_1.txt autodl-1:/root/autodl-fs/
+ssh autodl-1 "RUN_ONE=/root/autodl-fs/run_one.sh tmux new -d -s q1 \
+  'bash /root/autodl-fs/run_queue.sh /root/autodl-fs/queue_1.txt 2>&1 | tee /root/autodl-tmp/runs/logs/q1_master.log'"
+```
+**Verify within 60 s:** `ssh autodl-1 'tmux ls && tail -5 /root/autodl-tmp/runs/logs/q1_master.log'` shows
+the session alive and a `STARTING baseline` line. Never overwrite the FS wrapper mid-run (U2 / principle #6).
+### Phase 4 — Durable monitoring
+```bash
+ssh autodl-1 'grep -hE "STARTING|FINISHED|QUEUE DONE|ERROR|Traceback" /root/autodl-tmp/runs/logs/q1_master.log | tail -8'
+```
+For a multi-hour sweep deploy the four-layer architecture (`references/monitoring_patterns.md`): a remote
+self-completion marker + a session patrol loop. Flag a FINISHED at <50% typical duration (probable
+early-stop) and re-launch the **identical** config (principle #7), never a patched one. Don't blind-retry.
+### Phase 5 — Aggregate + verify + teardown
+```bash
+ssh autodl-1 'DATA_DIR=/root/autodl-tmp DURABLE_DIR=/root/autodl-fs bash /root/autodl-fs/aggregate_to_fs.sh'  # gated sync (U33)
+LOCAL_TARGET=/path/to/local/final_ckpts REMOTE_ALIAS=autodl-1 \
+  REMOTE_PATH=/root/autodl-fs/final_ckpts bash scripts/download_loop.sh        # resumable per-dir pull
+python scripts/verify_local.py /path/to/local/final_ckpts/                     # LOAD each best.pth
+```
+**Verify:** `verify_local.py` reports 100% OK. **Iron Law:** only AFTER every cell is pulled AND
+load-verified AND the user approves does teardown run — on AutoDL `关机` stops the meter and keeps the
+disk (the reversible exception); `release` frees it irreversibly. Reconcile against the roster, not the
+log (`references/parallel_ablation.md` §6). **REQUIRED:** `superpowers:verification-before-completion`.

package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt ADDED Viewed

@@ -0,0 +1,6 @@
+# Queue file for instance 1 — one ablation cell per line: <cfg_path> <task> [epochs]
+# Blank epochs => wrapper default (20). Detection needs more epochs (U32) => 50.
+# Split cells across queue_1.txt / queue_2.txt by COST so queues finish together.
+configs/ablation/baseline.yaml        reconstruction 20
+configs/ablation/no_aug.yaml          reconstruction 20
+configs/ablation/det_baseline.yaml    detection      50