opencode-skills-collection 3.1.2 → 3.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +4 -1
- package/bundled-skills/agent-creator/SKILL.md +246 -0
- package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/sources/sources.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
- package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
- package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
- package/bundled-skills/remote-gpu-trainer/README.md +267 -0
- package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
- package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
- package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
- package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
- package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
- package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
- package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
- package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
- package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
- package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
- package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
- package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
- package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
- package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
- package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
- package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
- package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
- package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
- package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
- package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
- package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
- package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
- package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
- package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
- package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
- package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
- package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
- package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
- package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
- package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
- package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
- package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
- package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
- package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
- package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
- package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
- package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
- package/package.json +1 -1
- package/skills_index.json +66 -0
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
# Agentic navigation results (Tier 2)
|
|
2
|
+
|
|
3
|
+
Each row: a **fresh agent** was given the skill and one scenario `prompt` from
|
|
4
|
+
[`cases.jsonl`](cases.jsonl), told to navigate **from SKILL.md only** (follow the documented
|
|
5
|
+
routing, no blind grep), and graded on whether it reached a correct, specific answer covering the
|
|
6
|
+
scenario's `must_cover` points within ~2 hops.
|
|
7
|
+
|
|
8
|
+
**Methodology / honesty caveats** (so a reader can weight this correctly):
|
|
9
|
+
- Runs to date were gathered **during development**, on the development model (Claude Opus class),
|
|
10
|
+
as subagent dispatches — not an independent third party, and **not yet** the
|
|
11
|
+
Haiku/Sonnet/Opus sweep Anthropic's best-practices recommend. Treat as *author-run smoke evals*,
|
|
12
|
+
not a neutral benchmark.
|
|
13
|
+
- These prove **routing + retrieval** inside the skill, not the truth of platform facts on a live
|
|
14
|
+
box (only AutoDL is battle-tested — see the repo README's "Verification status").
|
|
15
|
+
- Single run per scenario; no adversarial/perturbed phrasings yet.
|
|
16
|
+
|
|
17
|
+
## Results — 2026-06
|
|
18
|
+
|
|
19
|
+
| Scenario | Verdict | Hops | Navigation path observed |
|
|
20
|
+
|---|---|---|---|
|
|
21
|
+
| convergence-frozen-resnet | **PASS** | 1 | SKILL.md "When training breaks" → `convergence-debugging.md` O1 (overfit-one-batch) + O2 (params-not-in-optimizer) + O17 (frozen-still-in-optimizer) + O18 (frozen-BN drift) + O6 (Adam vs AdamW) |
|
|
22
|
+
| data-worker-rng-dup | **PASS** | 1 | SKILL.md "When training breaks" → `data-pipeline.md` DP1 (numpy fork-RNG dup; worker_init_fn fix) |
|
|
23
|
+
| oom-on-step-2 | **PASS** | ≤2 | SKILL.md "When training breaks" → `oom-memory.md` (fit-it ladder + OOM-at-step-2 / Adam lazy state) |
|
|
24
|
+
| nccl-one-rank-hang | **PASS** | ≤2 | SKILL.md → `distributed-launch.md` (desync toolkit D19 / one-rank-diverged D20) |
|
|
25
|
+
| diffusion-loss-low-samples-bad | **PASS** | ≤2 | SKILL.md → `by-domain.md` diffusion section (DF1 loss≠quality, DF2 EMA weights) |
|
|
26
|
+
| nan-loss-spike-bf16 | **PASS** | ≤2 | SKILL.md "When training breaks" → `precision-stability.md` P8/P12/P15 (NaN-origin + warmup spike + z-loss) |
|
|
27
|
+
| resume-epoch-reset | **PASS** | 1 | SKILL.md → `checkpoint-resume.md` C1/C12/C14 (save FULL state: epoch/step/scheduler/RNG/scaler) |
|
|
28
|
+
| throughput-gpu-starved | **PASS** | ≤2 | SKILL.md → `throughput-profiling.md` T1/T4 (GPU-bound vs data-bound; num_workers/prefetch) |
|
|
29
|
+
| runpod-spot-resume-teardown | **PASS** | ≤2 | SKILL.md → `profiles/runpod.md` §4/§5 → `spot-resilience.md` → `checkpoint-resume.md` C3 |
|
|
30
|
+
| vastai-teardown-billing | **PASS** | ≤2 | SKILL.md → `profiles/vastai.md` §5 → `lifecycle_checklist.md` Phase 5 |
|
|
31
|
+
| autodl-inode-disk-full | **PASS** | ≤2 | SKILL.md → the inode/disk gotcha (principle #5 / `gotchas_universal.md` U7) |
|
|
32
|
+
| china-hf-download-stall | **PASS** | ≤2 | SKILL.md → `references/china-network.md` (HF_ENDPOINT=hf-mirror, hf_transfer caution) |
|
|
33
|
+
| lambda-stop-vs-terminate | **PASS** | ≤2 | SKILL.md → `profiles/lambda.md` (no stop state; terminate irreversible) |
|
|
34
|
+
| autodl-first-contact-15day | **PASS** | 1 | SKILL.md principle #10 → `profiles/autodl.md` Surface block + AD-DANGER (关机 auto-releases after 15 days) |
|
|
35
|
+
|
|
36
|
+
**Summary: 14/14 scenarios routed correctly** (9 via workflow `w2r1t7mm9`, 5 standalone), each to a
|
|
37
|
+
correct + specific answer within ≤2 hops. The Tier-1 structural check (`run_evals.py`) runs all 14
|
|
38
|
+
cases and is the regression guard kept green in CI.
|
|
39
|
+
|
|
40
|
+
## Known gaps (what these results do NOT yet cover)
|
|
41
|
+
|
|
42
|
+
- No multi-model sweep (Haiku/Sonnet/Opus) — required to claim the best-practices testing bar.
|
|
43
|
+
- No adversarial/paraphrased prompts (e.g. the user describes the symptom in non-canonical words).
|
|
44
|
+
- No live-platform validation of the facts the agent retrieves (the verification-status caveat).
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
{"id": "convergence-frozen-resnet", "prompt": "Fine-tuning a ResNet50 on a rented GPU. Training runs with no errors and normal speed, but loss barely drops and val accuracy is stuck near chance. I froze the backbone with requires_grad=False and use Adam with weight_decay. How do I debug why it isn't learning?", "expect_files": ["references/training/convergence-debugging.md"], "expect_ids": ["O1", "O2", "O17", "O18", "O6"], "expect_grep": [], "must_cover": "overfit-one-batch smoke; frozen-param-still-in-optimizer; frozen-BN running-stats drift; Adam vs AdamW decoupled decay", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> convergence-debugging.md O1/O2/O17/O18/O6"}
|
|
2
|
+
{"id": "data-worker-rng-dup", "prompt": "My image augmentations seem to repeat: different DataLoader workers produce identical random crops, and every epoch looks the same. Linux, num_workers=8, numpy-based augmentation. Real bug? Fix?", "expect_files": ["references/training/data-pipeline.md"], "expect_ids": ["DP1"], "expect_grep": ["worker_init_fn", "torch.initial_seed"], "must_cover": "numpy global RNG inherited via fork, not reseeded per worker; fix via worker_init_fn or route RNG through torch", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> data-pipeline.md DP1"}
|
|
3
|
+
{"id": "oom-on-step-2", "prompt": "CUDA out of memory on step 2, right after the first optimizer step. Step 1 ran fine. Why does it OOM only on the second step?", "expect_files": ["references/training/oom-memory.md"], "expect_ids": ["M17"], "expect_grep": [], "must_cover": "Adam lazily allocates optimizer state (m,v) on the first step()", "agentic": "PASS (workflow w2r1t7mm9): routed to oom-memory.md ladder + step-2 entry"}
|
|
4
|
+
{"id": "nccl-one-rank-hang", "prompt": "Multi-GPU training hangs partway through an epoch; one rank seems stuck and the others wait forever (NCCL timeout). How do I find which rank and why?", "expect_files": ["references/training/distributed-launch.md"], "expect_ids": ["D19", "D20"], "expect_grep": [], "must_cover": "one rank diverged/OOM'd; survivors hang on the absent collective; desync-debug toolkit", "agentic": "PASS (workflow w2r1t7mm9): routed to distributed-launch.md hang toolkit"}
|
|
5
|
+
{"id": "diffusion-loss-low-samples-bad", "prompt": "My diffusion model's training loss is low and still decreasing, but the generated samples look bad/blurry. The loss says it's fine. What's wrong?", "expect_files": ["references/training/by-domain.md"], "expect_ids": ["DF1", "DF2"], "expect_grep": ["EMA"], "must_cover": "loss != sample quality; sampling from raw (non-EMA) weights; cross-link verifying-dl-experiments", "agentic": "PASS (workflow w2r1t7mm9): routed to by-domain.md diffusion section"}
|
|
6
|
+
{"id": "nan-loss-spike-bf16", "prompt": "LLM pretraining in bf16: loss is stable then suddenly spikes to NaN. How do I find where the NaN comes from and stop the spikes?", "expect_files": ["references/training/precision-stability.md"], "expect_ids": ["P8", "P12", "P15"], "expect_grep": ["z-loss"], "must_cover": "NaN arithmetic origins + anomaly detection; LR-too-high/warmup spike; z-loss to bound logits", "agentic": "PASS (workflow w2r1t7mm9): routed to precision-stability.md"}
|
|
7
|
+
{"id": "resume-epoch-reset", "prompt": "I resume training from a checkpoint but the epoch/step counter restarts from 0 and the LR schedule replays warmup. What did I forget to save/restore?", "expect_files": ["references/training/checkpoint-resume.md"], "expect_ids": ["C1", "C12", "C14"], "expect_grep": [], "must_cover": "save FULL state (epoch/step/scheduler/RNG/scaler), not just weights", "agentic": "PASS (1-hop): SKILL.md -> checkpoint-resume.md"}
|
|
8
|
+
{"id": "throughput-gpu-starved", "prompt": "GPU utilization is low and training is slow on my rented box. I think the dataloader is starving the GPU. How do I confirm and fix it?", "expect_files": ["references/training/throughput-profiling.md"], "expect_ids": ["T1", "T4"], "expect_grep": ["num_workers"], "must_cover": "GPU-bound vs data-bound vs comms-bound triage; num_workers/prefetch knobs", "agentic": "PASS: SKILL.md -> throughput-profiling.md"}
|
|
9
|
+
{"id": "runpod-spot-resume-teardown", "prompt": "On RunPod my spot training keeps getting preempted. How do I make it resume instead of restarting, and how do I stop the meter most cheaply afterwards without losing checkpoints?", "expect_files": ["profiles/runpod.md"], "expect_ids": [], "expect_grep": ["terminate", "Network Volume"], "must_cover": "Network Volume is the only durable store; ~5s grace; terminate (not stop) stops billing; verify ckpt before terminate", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/runpod.md SS4/SS5 -> spot-resilience.md -> checkpoint-resume.md C3"}
|
|
10
|
+
{"id": "vastai-teardown-billing", "prompt": "On vast.ai, what action actually stops billing, and how do I tear down without losing my checkpoints?", "expect_files": ["profiles/vastai.md"], "expect_ids": [], "expect_grep": ["destroy"], "must_cover": "destroy is the only meter-stop; stop still bills disk; copy + load-verify off-box before destroy", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/vastai.md SS5 -> lifecycle_checklist Phase 5"}
|
|
11
|
+
{"id": "autodl-inode-disk-full", "prompt": "On AutoDL my torch.save fails with a disk/iostream error, but df -h shows plenty of space left. What's going on?", "expect_files": ["references/gotchas_universal.md"], "expect_ids": [], "expect_grep": ["inode", "df -i"], "must_cover": "storage dies on inodes before bytes; monitor df -i not just df -h; millions of small files", "agentic": "PASS (workflow w2r1t7mm9): routed to the inode/disk gotcha (principle #5 / U7)"}
|
|
12
|
+
{"id": "china-hf-download-stall", "prompt": "Training in mainland China: a huggingface model download stalls and hangs with no error. How do I fix the download?", "expect_files": ["references/china-network.md"], "expect_ids": [], "expect_grep": ["hf-mirror", "HF_ENDPOINT"], "must_cover": "HF_ENDPOINT=hf-mirror.com; keep hf_transfer OFF on flaky CN links; resumable-download ladder", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> references/china-network.md"}
|
|
13
|
+
{"id": "lambda-stop-vs-terminate", "prompt": "On Lambda Cloud, is there a stop action to pause billing while keeping my instance, or only terminate? How should I tear down?", "expect_files": ["profiles/lambda.md"], "expect_ids": [], "expect_grep": ["terminate"], "must_cover": "no stop state on Lambda on-demand; terminate is irreversible + wipes the instance; persistent FS is the only durable home", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/lambda.md"}
|
|
14
|
+
{"id": "autodl-first-contact-15day", "prompt": "First time on AutoDL. I'll 关机 (stop) my instance between sessions to save money — is my data safe if it stays stopped for a few weeks? Anything else I should know up front?", "expect_files": ["profiles/autodl.md"], "expect_ids": [], "expect_grep": ["Surface to the user", "免密", "AD-DANGER"], "must_cover": "关机 auto-releases after 15 days -> data disk deleted (not safe to park indefinitely); sync best to /root/autodl-fs for a longer pause; surface conveniences (one-click SSH免密, GPU notify, panels) + danger clocks (principle #10)", "agentic": "PASS (2026-06): principle #10 first-contact surfacing -> profiles/autodl.md Surface block + AD-DANGER 15-day clock"}
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
Structural retrieval-reachability check for the remote-gpu-trainer skill.
|
|
4
|
+
|
|
5
|
+
For each scenario in cases.jsonl, assert that the answer is actually PRESENT in the
|
|
6
|
+
skill, at the documented location, with the expected entry IDs / keywords intact:
|
|
7
|
+
|
|
8
|
+
- every `expect_files` path exists
|
|
9
|
+
- every `expect_ids` appears as a `### <ID>` header in one of those files
|
|
10
|
+
- every `expect_grep` keyword appears (case-insensitive) in one of those files
|
|
11
|
+
|
|
12
|
+
This is the cheap, no-API-key tier: it does NOT prove an agent *navigates* there
|
|
13
|
+
(that is the agentic tier — see RESULTS.md), and it does NOT prove the platform
|
|
14
|
+
FACTS are correct on a live box (see the README "Verification status"). What it
|
|
15
|
+
DOES catch is drift: a renamed/removed entry ID, a moved section, a deleted file,
|
|
16
|
+
or a fact rewritten away from a key term — i.e. a regression in the skill's known
|
|
17
|
+
load-bearing capabilities.
|
|
18
|
+
|
|
19
|
+
Usage: python evals/run_evals.py # exits 1 if any case fails
|
|
20
|
+
"""
|
|
21
|
+
import json
|
|
22
|
+
import re
|
|
23
|
+
import sys
|
|
24
|
+
from pathlib import Path
|
|
25
|
+
|
|
26
|
+
REPO = Path(__file__).resolve().parent.parent
|
|
27
|
+
CASES = Path(__file__).resolve().parent / "cases.jsonl"
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
def header_present(text, id_):
|
|
31
|
+
# match `### O1 ...` but not `### O10 ...`
|
|
32
|
+
return re.search(r"(?m)^###\s+" + re.escape(id_) + r"\b", text) is not None
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def main():
|
|
36
|
+
cases = [json.loads(l) for l in CASES.read_text(encoding="utf-8").splitlines() if l.strip()]
|
|
37
|
+
passed = failed = 0
|
|
38
|
+
for c in cases:
|
|
39
|
+
problems = []
|
|
40
|
+
blobs = []
|
|
41
|
+
for f in c.get("expect_files", []):
|
|
42
|
+
p = REPO / f
|
|
43
|
+
if not p.exists():
|
|
44
|
+
problems.append(f"missing file: {f}")
|
|
45
|
+
else:
|
|
46
|
+
blobs.append(p.read_text(encoding="utf-8"))
|
|
47
|
+
joined = "\n".join(blobs)
|
|
48
|
+
low = joined.lower()
|
|
49
|
+
for i in c.get("expect_ids", []):
|
|
50
|
+
if not any(header_present(b, i) for b in blobs):
|
|
51
|
+
problems.append(f"missing entry id: {i}")
|
|
52
|
+
for kw in c.get("expect_grep", []):
|
|
53
|
+
if kw.lower() not in low:
|
|
54
|
+
problems.append(f"missing keyword: {kw!r}")
|
|
55
|
+
status = "PASS" if not problems else "FAIL"
|
|
56
|
+
if problems:
|
|
57
|
+
failed += 1
|
|
58
|
+
else:
|
|
59
|
+
passed += 1
|
|
60
|
+
print(f"[{status}] {c['id']}")
|
|
61
|
+
for pr in problems:
|
|
62
|
+
print(f" - {pr}")
|
|
63
|
+
print(f"\n{passed}/{passed + failed} cases reachable" + ("" if not failed else f" ({failed} FAILED)"))
|
|
64
|
+
return 1 if failed else 0
|
|
65
|
+
|
|
66
|
+
|
|
67
|
+
if __name__ == "__main__":
|
|
68
|
+
sys.exit(main())
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
# Worked example — a 3-cell ablation sweep on AutoDL
|
|
2
|
+
|
|
3
|
+
A complete, end-to-end run of the 6-phase lifecycle (SKILL.md) for the deepest profile
|
|
4
|
+
(`profiles/autodl.md`). Substitute your own project name, alias, and configs. Two instances run
|
|
5
|
+
their own queue file in parallel; this walkthrough ships `queue_1.txt` and shows one instance. **Read `profiles/autodl.md`
|
|
6
|
+
first** — it owns every path and verb used below.
|
|
7
|
+
|
|
8
|
+
The AutoDL `SCRIPT OVERRIDES` (profiles/autodl.md §8) that parameterize the templates:
|
|
9
|
+
|
|
10
|
+
```bash
|
|
11
|
+
export PROJECT_REPO_DIR=/root/myproj
|
|
12
|
+
export DATA_DIR=/root/autodl-tmp # fast per-instance scratch (checkpoints)
|
|
13
|
+
export DURABLE_DIR=/root/autodl-fs # region-locked shared FS (survives release)
|
|
14
|
+
export PROXY_HOOK='source /etc/network_turbo'
|
|
15
|
+
export CRED_FILE=/root/.wandb_key
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
### Phase 0 — Environment audit
|
|
19
|
+
```bash
|
|
20
|
+
ssh autodl-1 'df -h /root/autodl-tmp /root/autodl-fs / && df -i /root/autodl-fs && \
|
|
21
|
+
cat /sys/fs/cgroup/memory.max | numfmt --to=iec && nvidia-smi'
|
|
22
|
+
bash scripts/gpu_health.sh 0 # run ON the box: Xid / throttle pre-flight (U22/U23)
|
|
23
|
+
```
|
|
24
|
+
Budget the disk: `ckpt_size × cells_in_queue + scratch`. **Verify:** `nvidia-smi` shows the expected
|
|
25
|
+
GPU; `df -i /root/autodl-fs` is well under 100% (the inode cap, U7).
|
|
26
|
+
|
|
27
|
+
### Phase 1 — SSH + credentials
|
|
28
|
+
```bash
|
|
29
|
+
# alias already in ~/.ssh/config (references/ssh_transport.md). Push the wandb key via stdin,
|
|
30
|
+
# to the per-instance disk — NEVER the shared FS (U34, and AutoDL's classifier blocks it, AD-gotcha):
|
|
31
|
+
printf '%s\n' "$WANDB_KEY_FROM_ENV" | ssh autodl-1 'umask 077; cat > /root/.wandb_key && chmod 600 /root/.wandb_key'
|
|
32
|
+
```
|
|
33
|
+
**Verify:** `ssh autodl-1 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
|
|
34
|
+
|
|
35
|
+
### Phase 2 — Wrapper + CPU-smoke gate
|
|
36
|
+
```bash
|
|
37
|
+
# Parameterize the templates, drop the .template suffix, smoke locally on CPU BEFORE renting time:
|
|
38
|
+
cp scripts/run_one.sh.template run_one.sh && cp scripts/run_queue.sh.template run_queue.sh
|
|
39
|
+
python -m src.train -c configs/ablation/baseline.yaml --task reconstruction \
|
|
40
|
+
--limit-batches 2 --epochs 1 # logger off; catches import/shape/scale bugs for free
|
|
41
|
+
```
|
|
42
|
+
**Verify:** the smoke exits 0 on 2 batches. (Smoke *content* → **REQUIRED:** `verifying-dl-experiments`.)
|
|
43
|
+
|
|
44
|
+
### Phase 3 — Detached launch
|
|
45
|
+
```bash
|
|
46
|
+
# Push the parameterized wrappers + queue to the shared FS (ONE copy, all instances read it):
|
|
47
|
+
scp run_one.sh run_queue.sh examples/autodl_sweep/queue_1.txt autodl-1:/root/autodl-fs/
|
|
48
|
+
ssh autodl-1 "RUN_ONE=/root/autodl-fs/run_one.sh tmux new -d -s q1 \
|
|
49
|
+
'bash /root/autodl-fs/run_queue.sh /root/autodl-fs/queue_1.txt 2>&1 | tee /root/autodl-tmp/runs/logs/q1_master.log'"
|
|
50
|
+
```
|
|
51
|
+
**Verify within 60 s:** `ssh autodl-1 'tmux ls && tail -5 /root/autodl-tmp/runs/logs/q1_master.log'` shows
|
|
52
|
+
the session alive and a `STARTING baseline` line. Never overwrite the FS wrapper mid-run (U2 / principle #6).
|
|
53
|
+
|
|
54
|
+
### Phase 4 — Durable monitoring
|
|
55
|
+
```bash
|
|
56
|
+
ssh autodl-1 'grep -hE "STARTING|FINISHED|QUEUE DONE|ERROR|Traceback" /root/autodl-tmp/runs/logs/q1_master.log | tail -8'
|
|
57
|
+
```
|
|
58
|
+
For a multi-hour sweep deploy the four-layer architecture (`references/monitoring_patterns.md`): a remote
|
|
59
|
+
self-completion marker + a session patrol loop. Flag a FINISHED at <50% typical duration (probable
|
|
60
|
+
early-stop) and re-launch the **identical** config (principle #7), never a patched one. Don't blind-retry.
|
|
61
|
+
|
|
62
|
+
### Phase 5 — Aggregate + verify + teardown
|
|
63
|
+
```bash
|
|
64
|
+
ssh autodl-1 'DATA_DIR=/root/autodl-tmp DURABLE_DIR=/root/autodl-fs bash /root/autodl-fs/aggregate_to_fs.sh' # gated sync (U33)
|
|
65
|
+
LOCAL_TARGET=/path/to/local/final_ckpts REMOTE_ALIAS=autodl-1 \
|
|
66
|
+
REMOTE_PATH=/root/autodl-fs/final_ckpts bash scripts/download_loop.sh # resumable per-dir pull
|
|
67
|
+
python scripts/verify_local.py /path/to/local/final_ckpts/ # LOAD each best.pth
|
|
68
|
+
```
|
|
69
|
+
**Verify:** `verify_local.py` reports 100% OK. **Iron Law:** only AFTER every cell is pulled AND
|
|
70
|
+
load-verified AND the user approves does teardown run — on AutoDL `关机` stops the meter and keeps the
|
|
71
|
+
disk (the reversible exception); `release` frees it irreversibly. Reconcile against the roster, not the
|
|
72
|
+
log (`references/parallel_ablation.md` §6). **REQUIRED:** `superpowers:verification-before-completion`.
|
|
@@ -0,0 +1,6 @@
|
|
|
1
|
+
# Queue file for instance 1 — one ablation cell per line: <cfg_path> <task> [epochs]
|
|
2
|
+
# Blank epochs => wrapper default (20). Detection needs more epochs (U32) => 50.
|
|
3
|
+
# Split cells across queue_1.txt / queue_2.txt by COST so queues finish together.
|
|
4
|
+
configs/ablation/baseline.yaml reconstruction 20
|
|
5
|
+
configs/ablation/no_aug.yaml reconstruction 20
|
|
6
|
+
configs/ablation/det_baseline.yaml detection 50
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Platform Profile Schema
|
|
2
|
+
|
|
3
|
+
Every `profiles/<platform>.md` describes ONE platform with the **same 8 sections in the same order**, so
|
|
4
|
+
they are scannable and diffable. A profile owns all the *slow-changing, per-platform* substrate that the
|
|
5
|
+
SKILL.md phases delegate to. It does **not** describe a specific job (that's the portable job request,
|
|
6
|
+
below) and never repeats the universal gotchas (those live in `references/gotchas_universal.md` — link,
|
|
7
|
+
don't restate).
|
|
8
|
+
|
|
9
|
+
Design rule borrowed from SkyPilot / dstack / Ray: **hardware is a CONSTRAINT, not a SKU.** A job asks
|
|
10
|
+
for `gpu: A100:8`; the profile owns how that maps to this platform's instance types. **Secrets are
|
|
11
|
+
referenced by env-var NAME or file path only — never inline a key**.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Required structure of `profiles/<platform>.md`
|
|
16
|
+
|
|
17
|
+
Start each profile with a compact frontmatter block (the machine-readable facts), then the 8 prose
|
|
18
|
+
sections.
|
|
19
|
+
|
|
20
|
+
```yaml
|
|
21
|
+
---
|
|
22
|
+
platform: <name> # e.g. runpod
|
|
23
|
+
kind: ssh-rental # ssh-rental | cloud-api | kubernetes | slurm
|
|
24
|
+
meter_stop_verb: terminate # the action that STOPS billing (stop | terminate | destroy | release | 关机 | manual)
|
|
25
|
+
meter_stop_irreversible: true
|
|
26
|
+
detach_primitive: tmux # tmux | sbatch | k8s-job | nohup | kaggle-commit
|
|
27
|
+
spot_available: true
|
|
28
|
+
spot_grace: ~5s # SIGTERM→SIGKILL window, or n/a
|
|
29
|
+
shared_fs: false # is there a cross-instance shared filesystem?
|
|
30
|
+
inode_cap: none # ~200K | none | host-dependent
|
|
31
|
+
free_egress: true # download/upload to the wire free?
|
|
32
|
+
china_mirror_needed: false # does it sit behind the GFW?
|
|
33
|
+
host_driver_cuda_max: "12.x"
|
|
34
|
+
local_nvme: true
|
|
35
|
+
---
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### 1. LAUNCH
|
|
39
|
+
Entry points (web console / CLI / REST API / SSH), the canonical create command, and the **env
|
|
40
|
+
contract** — what IS the Python env (prebuilt base? a Docker image you choose? Lambda Stack?). State the
|
|
41
|
+
rule "the image/base IS the env — do not `conda create` on a rental" if it applies.
|
|
42
|
+
|
|
43
|
+
### 2. STORAGE MODEL *(the survival matrix — principle #4)*
|
|
44
|
+
List every storage tier with its path, speed, and size/inode cap. Then a **survival matrix**:
|
|
45
|
+
|
|
46
|
+
| Tier | Path | Survives STOP? | Survives DESTROY? | Cap |
|
|
47
|
+
|---|---|---|---|---|
|
|
48
|
+
|
|
49
|
+
State region/DC-lock for any shared/network volume. Name the mount checkpoints MUST go to for the
|
|
50
|
+
teardown verb in §5.
|
|
51
|
+
|
|
52
|
+
### 3. NETWORK
|
|
53
|
+
Egress/proxy story, China-mirror relevance (link `references/china-network.md` if applicable), how
|
|
54
|
+
ports/services are exposed (TB/Jupyter), and the **SSH flavor(s)** — note if proxied/basic SSH cannot
|
|
55
|
+
`scp`/`rsync` (then direct-TCP is required) and whether ports change on restart.
|
|
56
|
+
|
|
57
|
+
### 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
|
58
|
+
The interruption model (spot bid? capacity? auto-shutdown clock? auto-release?), the **detection signal +
|
|
59
|
+
grace window**, and the resume hook. Link `references/spot-resilience.md` for the cadence formula.
|
|
60
|
+
|
|
61
|
+
### 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
|
62
|
+
Exactly **what stops the meter** (stop vs terminate vs destroy vs 关机), what each preserves, what is
|
|
63
|
+
**irreversible**, and the cost trap (e.g. "stop still bills storage 2×"). This is the most error-prone
|
|
64
|
+
section — be precise.
|
|
65
|
+
|
|
66
|
+
### 6. DAEMON TOOL
|
|
67
|
+
The detach primitive (`tmux` / `sbatch` / Job manifest / commit), whether it survives an instance restart
|
|
68
|
+
(not just an SSH drop), and any native queue/scheduler. Note if `tmux` must be `apt install`-ed or is
|
|
69
|
+
absent (use `nohup … </dev/null >log 2>&1 &`).
|
|
70
|
+
|
|
71
|
+
### 7. TOP GOTCHAS (4–8, platform-pinned)
|
|
72
|
+
Only the *platform-specific* ones, Symptom → Root cause → Fix. Universal gotchas are referenced, not
|
|
73
|
+
repeated. Give each a stable local id (e.g. `RP1`, `VAST2`).
|
|
74
|
+
|
|
75
|
+
### 8. SCRIPT OVERRIDES
|
|
76
|
+
The exact values to parameterize the `scripts/` templates for this platform:
|
|
77
|
+
`DATA_DIR=` (fast scratch) · `DURABLE_DIR=` (survives teardown) · `PROXY_HOOK=` · `CRED_FILE=` (file path; `""` if the key is an env var/secret) · `SCRATCH=` (what to prune) · `HF_HOME=` · `DETACH=`.
|
|
78
|
+
The templates read exactly these env-var names. Two further knobs *derive* rather than being set per
|
|
79
|
+
platform: `RUN_ONE` (the queue runner's path to `run_one.sh`) defaults to `$DURABLE_DIR/run_one.sh`, and
|
|
80
|
+
`PROJECT_REPO_DIR` (where *this run's* code lives) is a per-run value — see "Portable job request" below;
|
|
81
|
+
set either explicitly only if your layout differs.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Portable job request (NOT in the profile — keep it per-run)
|
|
86
|
+
|
|
87
|
+
A job is described separately so the *same* job runs against any profile. Document it in
|
|
88
|
+
`references/parallel_ablation.md`; the shape:
|
|
89
|
+
|
|
90
|
+
```yaml
|
|
91
|
+
resources:
|
|
92
|
+
gpu: {name: A100, count: 8, memory: 40GB+} # a CONSTRAINT (ranges ok), never a platform SKU
|
|
93
|
+
disk: 200GB
|
|
94
|
+
candidates: [autodl, china, runpod] # ordered fallback → "describe once, run anywhere"
|
|
95
|
+
file_mounts: {/data: {source: ..., mode: MOUNT_CACHED}} # MOUNT | COPY | MOUNT_CACHED
|
|
96
|
+
run: "bash run_queue.sh queue.txt"
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
The launcher resolves a job against a profile; the profile supplies paths/verbs, the job supplies
|
|
100
|
+
the work. Keeping them separate is what makes a profile reusable across every job.
|