opencode-skills-collection 3.1.2 → 3.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +4 -1
- package/bundled-skills/agent-creator/SKILL.md +246 -0
- package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/sources/sources.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
- package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
- package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
- package/bundled-skills/remote-gpu-trainer/README.md +267 -0
- package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
- package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
- package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
- package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
- package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
- package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
- package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
- package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
- package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
- package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
- package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
- package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
- package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
- package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
- package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
- package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
- package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
- package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
- package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
- package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
- package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
- package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
- package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
- package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
- package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
- package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
- package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
- package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
- package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
- package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
- package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
- package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
- package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
- package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
- package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
- package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
- package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
- package/package.json +1 -1
- package/skills_index.json +66 -0
|
@@ -0,0 +1,327 @@
|
|
|
1
|
+
# Profile: AutoDL
|
|
2
|
+
|
|
3
|
+
The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model
|
|
4
|
+
and the *one* rental where the meter-stop action is non-destructive. Fills all 8 schema sections
|
|
5
|
+
(`profiles/_schema.md`) at full depth. Read this **before Phase 0**; it owns every path, proxy, billing
|
|
6
|
+
verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see
|
|
7
|
+
`references/gotchas_universal.md`.
|
|
8
|
+
|
|
9
|
+
> **Surface to the user up front (principle #10):** conveniences most users miss — the console has a
|
|
10
|
+
> **one-click "设置SSH免密登录"** (registers your key so the agent connects non-interactively), **GPU-availability
|
|
11
|
+
> notifications** ("订阅GPU通知"), and built-in **AutoPanel / JupyterLab / TensorBoard** tiles. ⚠️ Danger clocks
|
|
12
|
+
> — **关机 (stop) auto-releases the box after 15 days → the data disk is deleted** (AD-DANGER, §5); only
|
|
13
|
+
> `/root/autodl-fs` survives a 释放; low balance / arrears force-stop. And the TB tile is **pinned to
|
|
14
|
+
> `/root/tf-logs`** — write your logger there (or symlink) or the panel shows empty (AD7 / U39).
|
|
15
|
+
|
|
16
|
+
To jump: `grep -in '<keyword>' profiles/autodl.md` (e.g. `grep -in inode profiles/autodl.md`).
|
|
17
|
+
|
|
18
|
+
## Table of contents
|
|
19
|
+
|
|
20
|
+
1. LAUNCH — entry points + env contract (base miniconda IS the env)
|
|
21
|
+
2. STORAGE MODEL — 3 tiers + survival matrix + inode cap
|
|
22
|
+
3. NETWORK — academic proxy + China mirrors + pinned TB
|
|
23
|
+
4. SPOT / INTERRUPTION + RESUME — effectively on-demand
|
|
24
|
+
5. TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception)
|
|
25
|
+
6. DAEMON TOOL — tmux / nohup
|
|
26
|
+
7. TOP GOTCHAS — AD1..AD9, platform-pinned
|
|
27
|
+
8. SCRIPT OVERRIDES — values to parameterize `scripts/`
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
```yaml
|
|
32
|
+
---
|
|
33
|
+
platform: autodl
|
|
34
|
+
kind: ssh-rental
|
|
35
|
+
meter_stop_verb: 关机 # shutdown/power-off STOPS billing AND keeps /root + disks
|
|
36
|
+
meter_stop_irreversible: false # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes
|
|
37
|
+
detach_primitive: tmux # nohup fallback when tmux is not installed (often absent on fresh image)
|
|
38
|
+
spot_available: false # on-demand only; no spot/bid/preemption model
|
|
39
|
+
spot_grace: n/a
|
|
40
|
+
shared_fs: true # /root/autodl-fs — region-locked, cross-instance within one region
|
|
41
|
+
inode_cap: ~200K # hard cap on the shared FS, independent of byte capacity
|
|
42
|
+
free_egress: true # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed)
|
|
43
|
+
china_mirror_needed: true # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo
|
|
44
|
+
host_driver_cuda_max: image-dependent # the prebuilt image pins torch+CUDA; do not downgrade (AD9)
|
|
45
|
+
local_nvme: true # /root/autodl-tmp data disk is fast local NVMe, per-instance
|
|
46
|
+
---
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## 1. LAUNCH
|
|
52
|
+
|
|
53
|
+
**First time? (rent → reach the box).** On the AutoDL console: pick a GPU + region with stock → **创建实例**
|
|
54
|
+
(choose the PyTorch image — the base env ships prebuilt) → register your key once via **设置SSH免密登录**
|
|
55
|
+
(so the agent connects non-interactively) → copy the instance's **SSH connection string** + password from the
|
|
56
|
+
console → test `ssh -p <PORT> root@connect.<region>.seetacloud.com 'nvidia-smi'`. That string is your entry to
|
|
57
|
+
every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.)
|
|
58
|
+
|
|
59
|
+
**Entry points.** Web console (创建实例) for create/release/power; per-instance SSH connection string from
|
|
60
|
+
the console (`ssh -p <PORT> root@connect.<region>.seetacloud.com`). No first-class platform CLI/REST for
|
|
61
|
+
job control — SSH is the orchestration channel. Set a stable alias per instance in `~/.ssh/config`
|
|
62
|
+
(`Host autodl-<proj>-<N>`, `HostName connect.<region>.seetacloud.com`, `Port <PORT>`) so every later
|
|
63
|
+
command is short; the port is assigned at create-time and **changes on re-create** (update the alias).
|
|
64
|
+
SSH/keepalive config → `references/ssh_transport.md`.
|
|
65
|
+
|
|
66
|
+
**Env contract — the prebuilt base miniconda IS the env (AD6).** The image ships the full DL stack into
|
|
67
|
+
**base** (`/root/miniconda3/bin/python`); there is no `/root/miniconda3/envs/<name>/`. Base is the
|
|
68
|
+
deliberate single-tenant project env. **Never `conda create` / `conda clone base`** on the rental —
|
|
69
|
+
cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit
|
|
70
|
+
interpreter `/root/miniconda3/bin/python`; in remote polls use that path or pure shell, never bare
|
|
71
|
+
`python3` (it may be absent → exit 127). When installing project deps, **filter framework pins** so a
|
|
72
|
+
`requirements.txt` does not downgrade the image's torch build (AD9).
|
|
73
|
+
|
|
74
|
+
> The "no DL in conda base" discipline applies to the *persistent local* machine only — on an ephemeral
|
|
75
|
+
> rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base.
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## 2. STORAGE MODEL *(survival matrix — principle #4)*
|
|
80
|
+
|
|
81
|
+
Three tiers, each with a different speed / size / inode profile and a **different survival behavior**:
|
|
82
|
+
|
|
83
|
+
| Tier | Path | Speed | Size | Inode cap | Scope |
|
|
84
|
+
|---|---|---|---|---|---|
|
|
85
|
+
| System disk | `/` | medium | ~30 GB | none | per-instance |
|
|
86
|
+
| Data disk | `/root/autodl-tmp` | **fast NVMe** | per-plan (e.g. ~50 GB) | none | per-instance |
|
|
87
|
+
| Shared FS | `/root/autodl-fs` | NFS (slow, ~30 s/sync) | ~200 GB | **~200K (hard)** | **region-locked**, all instances in one region |
|
|
88
|
+
|
|
89
|
+
**Survival matrix** — the part most platforms get wrong, and where AutoDL is the **exception**:
|
|
90
|
+
|
|
91
|
+
| Tier | Survives 关机 (stop)? | Survives 释放 (release/destroy)? | Notes |
|
|
92
|
+
|---|---|---|---|
|
|
93
|
+
| `/` system | **yes** | no | AutoDL persists `/root` across power-off — UNLIKE RunPod/vast/K8s/Colab |
|
|
94
|
+
| `/root/autodl-tmp` data | **yes** | no | fast tier; checkpoints written here mid-run |
|
|
95
|
+
| `/root/autodl-fs` shared | **yes** | **yes** | the ONLY tier that survives release; region-locked |
|
|
96
|
+
|
|
97
|
+
**Where checkpoints MUST go for the §5 teardown verb:** write live checkpoints to the fast data disk
|
|
98
|
+
(`/root/autodl-tmp/checkpoints/<name>`, never the 30 GB system disk), then **checked-sync `best.pth`
|
|
99
|
+
to `/root/autodl-fs`** — the only tier that survives a 释放. If only ever using 关机, the data disk also
|
|
100
|
+
survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk).
|
|
101
|
+
|
|
102
|
+
**Region/DC-lock (AD3).** FS quota is region-scoped; each region has its own physical mount. Files written
|
|
103
|
+
from a `<region-a>` instance are invisible to a `<region-b>` instance even at the identical
|
|
104
|
+
`/root/autodl-fs/` path. Create the FS quota in the **same region** as the instances; to bridge regions,
|
|
105
|
+
pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from-
|
|
106
|
+
another probe before relying on it.
|
|
107
|
+
|
|
108
|
+
**Inode discipline (AD4).** The ~200K cap is **independent of bytes**: `df -h` can read 34% while `cp`
|
|
109
|
+
fails "No space left" because `df -i` is at 100%. The inode bomb is **per-sample eval visualization**
|
|
110
|
+
(`files_per_sample × N_samples × N_conditions` → tens of thousands of tiny files); checkpoints (few large
|
|
111
|
+
files) are inode-cheap. Monitor `df -i`, not just `df -h` (Phase 0 + every space check). Eval-artifact
|
|
112
|
+
sizing policy is owned by **REQUIRED:** verifying-dl-experiments.
|
|
113
|
+
|
|
114
|
+
**Data-disk hog (AD5).** When `/root/autodl-tmp` hits 100% but `runs/` looks small, the real hog is the
|
|
115
|
+
**HF cache symlinked onto the data disk** (`~/.cache/huggingface` → tens of GB of model blobs). Audit
|
|
116
|
+
`du -sh ~/.cache/huggingface/hub/models--* | sort -rh` before deleting checkpoints; redirect `HF_HOME` to
|
|
117
|
+
the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the
|
|
118
|
+
experiment (principle #9). Get explicit user confirmation naming `rm -rf` targets (the harness classifier
|
|
119
|
+
blocks agent-inferred irreversible deletes).
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
## 3. NETWORK
|
|
124
|
+
|
|
125
|
+
**Egress proxy — `source /etc/network_turbo` is MANDATORY (AD1).** Instances start with no proxy; direct
|
|
126
|
+
egress to `api.wandb.ai` / `huggingface.co` / `github.com` / `pypi.org` is unreliable (0.5 s … 300 s …
|
|
127
|
+
blocked). Every shell that calls wandb / HF / pip / git must `source /etc/network_turbo` first
|
|
128
|
+
(`source /etc/network_turbo 2>/dev/null || true` at the top of every wrapper). It exports
|
|
129
|
+
`http_proxy` / `https_proxy` pointing at the in-DC academic proxy (`http://<proxy-ip>:<port>`), a
|
|
130
|
+
`no_proxy` allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo
|
|
131
|
+
vs >120 s timeout without — no exceptions, even a small `wandb.summary` write can wedge for minutes.
|
|
132
|
+
|
|
133
|
+
**China mirrors (AD2).** HF behind the GFW → `HF_ENDPOINT=https://hf-mirror.com` or pull from
|
|
134
|
+
**ModelScope**. Two compounding traps: (a) HF's **Xet CAS backend** is NOT mirror-proxied (the mirror
|
|
135
|
+
covers the API but big `.safetensors` shards still hit the flaky international endpoint) →
|
|
136
|
+
`export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) to force the classic LFS path the mirror does
|
|
137
|
+
proxy; (b) `no_proxy` in network_turbo lists `modelscope.com` but **not** `modelscope.cn` — routing a
|
|
138
|
+
DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a
|
|
139
|
+
`timeout <s> … && break` retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror
|
|
140
|
+
table + `no_proxy` ladder → `references/china-network.md`.
|
|
141
|
+
|
|
142
|
+
**Port exposure.** AutoDL maps a single custom port (6006) for user services; the platform also exposes
|
|
143
|
+
JupyterLab. SSH port is the per-instance `<PORT>` and changes on re-create.
|
|
144
|
+
|
|
145
|
+
**Platform TensorBoard is pinned to `/root/tf-logs` (AD7).** The image autostarts
|
|
146
|
+
`tensorboard --logdir /root/tf-logs --port 6007` on boot and the AutoPanel TB tile proxies straight to that
|
|
147
|
+
pid — the `--logdir` is hard-pinned and cannot be reconfigured from inside the container. Events written
|
|
148
|
+
anywhere else are invisible in the web tile no matter how correct the `SummaryWriter` setup. Fix: write to
|
|
149
|
+
`SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>` (the pinned TB
|
|
150
|
+
has `--reload=5`, so the run appears within ~5 s — no restart). Verify with
|
|
151
|
+
`curl -s http://127.0.0.1:6007/data/runs` (expect a JSON array with the run), NOT `ss` (can show nothing
|
|
152
|
+
inside the container while curl returns 200). Local logs die with the instance — for durable curves use a
|
|
153
|
+
hosted tracker (**REQUIRED:** huggingface-skills:huggingface-trackio).
|
|
154
|
+
|
|
155
|
+
**SSH flavor.** Direct-TCP SSH on the per-instance host:port — `scp`/`rsync` work normally (no proxied-SSH
|
|
156
|
+
restriction). Use a per-dir resumable loop for large transfers (single-connection `scp -r` resets mid-
|
|
157
|
+
transfer); `rsync -avz --partial` is preferred. Transport patterns → `references/ssh_transport.md`.
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
|
162
|
+
|
|
163
|
+
**No spot/bid/preemption model — AutoDL is on-demand.** There is no mid-run eviction, no SIGTERM grace
|
|
164
|
+
window to handle (`spot_grace: n/a`). The real loss vectors are: (a) **forgot to release/关机** → idle
|
|
165
|
+
billing (principle #1); (b) an instance **reboot** that ends a non-detached process (a vanished process is
|
|
166
|
+
not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see
|
|
167
|
+
`references/gotchas_universal.md`); (c) availability — the GPU plan being sold out at create-time (build
|
|
168
|
+
retry-until-available, not survive-an-eviction).
|
|
169
|
+
|
|
170
|
+
**Resume hook.** The universal spine still applies (principle #8): checkpoint atomically to the data disk +
|
|
171
|
+
sync `best.pth` to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes
|
|
172
|
+
the *identical launch command* survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence
|
|
173
|
+
formula → `references/spot-resilience.md` (the formula generalizes even without spot — it bounds
|
|
174
|
+
re-compute lost to a reboot).
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
|
179
|
+
|
|
180
|
+
**关机 (shutdown / power-off) STOPS the meter AND keeps `/root` + both disks — this is the AutoDL
|
|
181
|
+
EXCEPTION among rentals.** Everywhere else (RunPod wipes the container disk on stop, vast bills the disk
|
|
182
|
+
forever, K8s wipes the pod FS, Colab loses `/content`) a "stop" is lossy or still-billing. On AutoDL,
|
|
183
|
+
关机 is the **safe park**: meter off, all three tiers intact, restart later. There is also a **no-GPU /
|
|
184
|
+
无卡模式 mode** for cheap restart to copy files or fix the env without paying for the GPU.
|
|
185
|
+
|
|
186
|
+
| Action | Stops meter? | Keeps `/` + data disk? | Keeps FS? | Reversible? |
|
|
187
|
+
|---|---|---|---|---|
|
|
188
|
+
| 关机 (shutdown) | **yes** | **yes** | yes | **yes** — restart anytime (the AutoDL exception) |
|
|
189
|
+
| 无卡模式 (no-GPU) | mostly (cheap) | yes | yes | yes |
|
|
190
|
+
| 释放 (release/destroy) | yes | **NO** | yes | **NO — deletes `/` + data disk irreversibly** |
|
|
191
|
+
|
|
192
|
+
**Cost trap.** 关机 still bills the data-disk *storage* at a small rate while the GPU meter is off — far
|
|
193
|
+
cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk.
|
|
194
|
+
**⚠️ Auto-release clock (AD-DANGER):** a 关机 (stopped) instance is **auto-released after 15 days** (the
|
|
195
|
+
console shows "关机 15 天后释放") → that release deletes `/` **and the data disk**, so 关机 is safe parking
|
|
196
|
+
only *within* the window; for a longer pause, sync `best` to `/root/autodl-fs` (survives 释放) or expect to
|
|
197
|
+
re-download. Low balance / arrears also force-stop the instance. **Surface this to the user up front
|
|
198
|
+
(principle #10)** — most users assume 关机 parks the box indefinitely.
|
|
199
|
+
**Teardown Iron Law (SKILL.md Phase 5):** no 释放 / file-delete until `best.pth` is **pulled to local AND
|
|
200
|
+
verified by load** (`scripts/verify_local.py`) AND the user explicitly approves — "it looked done in the
|
|
201
|
+
log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure
|
|
202
|
+
is to **关机 and ask**, never 释放 on a guess. **REQUIRED:** superpowers:verification-before-completion is
|
|
203
|
+
the general form of this gate.
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
## 6. DAEMON TOOL
|
|
208
|
+
|
|
209
|
+
**tmux** is the detach primitive when present, but **tmux is often NOT installed on a fresh AutoDL image**
|
|
210
|
+
and `apt-get install tmux` fails when egress is down. Zero-dependency fallback:
|
|
211
|
+
`nohup bash run_queue.sh queue.txt </dev/null >master.log 2>&1 &` — survives an SSH drop (SIGHUP), needs
|
|
212
|
+
no package. Verify either with `pgrep -af <script>`. The detach survives an SSH drop; it does **not**
|
|
213
|
+
survive a 关机/reboot — that is what checkpoint+resume (§4) is for.
|
|
214
|
+
|
|
215
|
+
**Native queue: none.** AutoDL has no built-in scheduler → use the bundled `scripts/run_queue.sh.template`
|
|
216
|
+
(resumable queue iterator, `start_index` for resume) driving `scripts/run_one.sh.template` per cell.
|
|
217
|
+
**Never overwrite a script a running bash is mid-execution** (bash reads by byte-offset → re-executes
|
|
218
|
+
blocks; version the filename) — universal physics, see `references/gotchas_universal.md`.
|
|
219
|
+
|
|
220
|
+
**Monitoring.** A session-bound watcher dies with the session; for multi-hour runs deploy the four-layer
|
|
221
|
+
durable architecture (`references/monitoring_patterns.md`). Detect "done" by a **log marker**
|
|
222
|
+
(`grep -q 'QUEUE DONE' master.log`), never by `pgrep` (the waiter's own cmdline matches the pattern and
|
|
223
|
+
loops forever). A cloud scheduler cannot reach the rented box (no SSH key in a cloud sandbox — secret
|
|
224
|
+
leak); the honest recurring check is the remote self-monitor + a session loop with the local key.
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## 7. TOP GOTCHAS (AutoDL-pinned; universal ones → `references/gotchas_universal.md`)
|
|
229
|
+
|
|
230
|
+
**AD1 — external network call hangs / wandb shows 0 runs.** *Symptom:* `wandb.init` times out at
|
|
231
|
+
90/120/180 s, dashboard reads 0 runs while `wandb/run-*` exist locally; HF downloads stall; pip/git glacial.
|
|
232
|
+
*Root cause:* instances start with **no proxy**; direct egress to wandb/HF/PyPI/GitHub is unreliable or
|
|
233
|
+
blocked, and wandb-core's retry logic under a flaky link can roll back already-uploaded runs. *Fix:*
|
|
234
|
+
`source /etc/network_turbo` at the top of **every** shell/wrapper before any external call; recover an
|
|
235
|
+
empty cloud project with `for d in wandb/run-*; do timeout 120 wandb sync "$d"; done`.
|
|
236
|
+
|
|
237
|
+
**AD2 — HF download stalls even with hf-mirror + turbo.** *Symptom:* `from_pretrained` /
|
|
238
|
+
`snapshot_download` hangs or `ConnectTimeout` on big `.safetensors` shards. *Root cause:* (a) HF's Xet CAS
|
|
239
|
+
backend is not mirror-proxied; (b) `no_proxy` lists `modelscope.com` not `modelscope.cn` (domestic source
|
|
240
|
+
forced through international proxy = slower); (c) a curl test run without turbo measures the wrong path.
|
|
241
|
+
*Fix:* `export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) with `HF_ENDPOINT=https://hf-mirror.com`,
|
|
242
|
+
or pull from ModelScope to a plain dir + load via local-path override; wrap in a `timeout … && break`
|
|
243
|
+
resume loop. Detail → `references/china-network.md`.
|
|
244
|
+
|
|
245
|
+
**AD3 — cross-region instances cannot share FS.** *Symptom:* two instances in different regions see
|
|
246
|
+
identical `/root/autodl-fs/` paths but files written from one are invisible to the other. *Root cause:* FS
|
|
247
|
+
quota is region-scoped; each region has its own physical mount. *Fix:* create the FS quota in the same
|
|
248
|
+
region as the instances; bridge regions via scp from a chosen primary; verify with a write-one / read-other
|
|
249
|
+
probe.
|
|
250
|
+
|
|
251
|
+
**AD4 — FS write fails "No space left" while `df -h` looks fine.** *Symptom:* `cp`/`mkdir` to
|
|
252
|
+
`/root/autodl-fs` fails though `df -h` shows ~34%; `df -i` shows `… 0 100%`. *Root cause:* the shared FS
|
|
253
|
+
enforces a **hard ~200K inode cap independent of bytes**; per-sample eval visualization (many tiny files)
|
|
254
|
+
exhausts it. *Fix:* monitor `df -i`; cap per-sample eval vis on large test sets (sizing → verifying-dl-
|
|
255
|
+
experiments); once a results dir is verified locally, prune its per-sample image subdir from FS; recover by
|
|
256
|
+
`find /root/autodl-fs -type d -name '<vis-dir>' -exec rm -rf {} +` to free inodes fast.
|
|
257
|
+
|
|
258
|
+
**AD5 — data disk full; HF cache is the hidden hog; agent `rm` auto-denied.** *Symptom:*
|
|
259
|
+
`/root/autodl-tmp` at 100% though `runs/` looks small; an agent `rm -rf` of "obvious junk" is auto-denied.
|
|
260
|
+
*Root cause:* `~/.cache/huggingface` is symlinked onto the data disk, so the **HF model cache** (tens of
|
|
261
|
+
GB) is the real hog; the harness blocks irreversible `rm -rf` whose targets the agent inferred. *Fix:*
|
|
262
|
+
audit `du -sh ~/.cache/huggingface/hub/models--* | sort -rh`; set `HF_HOME` to a chosen data-disk dir + keep
|
|
263
|
+
the metric/eval JSONs (tiny evidence); present exact deletion targets + sizes for explicit user
|
|
264
|
+
confirmation; offer "clean vs expand the disk".
|
|
265
|
+
|
|
266
|
+
**AD6 — base IS the env; a "never use base" rule blocks every remote command.** *Symptom:* a local "don't
|
|
267
|
+
run DL in conda base" guard fires on `ssh autodl 'python train.py'`, but `conda env list` shows nothing and
|
|
268
|
+
`/root/miniconda3/envs/` is empty; poll scripts calling `python3` exit 127. *Root cause:* the image installs
|
|
269
|
+
the whole DL stack into **base** — base IS the single-tenant project env (no `/envs/`), and the image often
|
|
270
|
+
ships only `python` (no `python3`). *Fix:* train with `/root/miniconda3/bin/python`; exempt remote-ssh +
|
|
271
|
+
instance base from the local guard (never `conda create --clone base`); in remote scripts use the explicit
|
|
272
|
+
interpreter or pure shell, never bare `python3`.
|
|
273
|
+
|
|
274
|
+
**AD7 — platform TensorBoard pinned to `/root/tf-logs`; events elsewhere invisible.** *Symptom:* the
|
|
275
|
+
events file is non-empty and `curl http://127.0.0.1:6007/` returns 200, but the AutoPanel TB tile shows
|
|
276
|
+
zero runs; `/data/runs` returns `[]`. *Root cause:* the image autostarts `tensorboard --logdir
|
|
277
|
+
/root/tf-logs` and the tile proxies that pid; `--logdir` is hard-pinned and not reconfigurable in-container.
|
|
278
|
+
*Fix:* write `SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>`
|
|
279
|
+
(the pinned TB's `--reload=5` picks it up in ~5 s); verify with `curl … /data/runs`, not `ss`. (Also:
|
|
280
|
+
restart the TB server to evict STALE cached tags after deleting/renaming runs.) The cross-platform "live panel silently empty" class (path/port/process mismatch on any platform) is the general form → `references/gotchas_universal.md` U39.
|
|
281
|
+
|
|
282
|
+
**AD8 — wandb val-phase CPU memory spike to 30+ GB at epoch 1 end.** *Symptom:* at the end of epoch 1
|
|
283
|
+
(validation), cgroup memory jumps from ~8 GB to 30+ GB, sometimes wedging the instance. *Root cause:*
|
|
284
|
+
project trainers log per-sample distributions at `step==1` (e.g. LPIPS/VGG over ~2000 samples on CPU =
|
|
285
|
+
~30 GB activations). *Fix:* cap the val-time sample accumulator — `-o training.val_metric_sample_cap=256`
|
|
286
|
+
(project-specific knob; check the trainer for the equivalent). Distinct from a DataLoader-worker cgroup OOM
|
|
287
|
+
(universal gotcha).
|
|
288
|
+
|
|
289
|
+
**AD9 — project torch pin would DOWNGRADE the image's working build.** *Symptom:* the image ships e.g. a
|
|
290
|
+
new-arch-capable torch (sm_120); the project pins `torch<2.9`; a naive `pip install -r requirements.txt`
|
|
291
|
+
replaces it with a wheel lacking the arch's kernels → `no kernel image is available` at first forward.
|
|
292
|
+
*Root cause:* the image torch/CUDA build is matched to the rented GPU arch; the project pin is stale for it.
|
|
293
|
+
*Fix:* filter framework pins out of the remote install —
|
|
294
|
+
`grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt && pip install -r
|
|
295
|
+
/root/req_remote.txt` — keep the image build; smoke `torch.cuda.get_device_capability()` + a heavy import
|
|
296
|
+
before launch; disclose the off-band torch version with results.
|
|
297
|
+
|
|
298
|
+
---
|
|
299
|
+
|
|
300
|
+
## 8. SCRIPT OVERRIDES
|
|
301
|
+
|
|
302
|
+
The exact values to parameterize the `scripts/` templates (`scripts/run_one.sh.template`,
|
|
303
|
+
`scripts/run_queue.sh.template`) for AutoDL:
|
|
304
|
+
|
|
305
|
+
```sh
|
|
306
|
+
DATA_DIR=/root/autodl-tmp # fast NVMe data disk — live checkpoints, logs, HF cache
|
|
307
|
+
DURABLE_DIR=/root/autodl-fs # region-locked shared FS — the only tier surviving 释放
|
|
308
|
+
PROXY_HOOK='source /etc/network_turbo 2>/dev/null || true' # MANDATORY before any external call (AD1)
|
|
309
|
+
CRED_FILE=/root/.wandb_key # per-instance ONLY — the FS security classifier blocks wandb keys
|
|
310
|
+
SCRATCH='latest.pth' # prune on success; keep best.pth (the keepable artifact)
|
|
311
|
+
HF_HOME=/root/autodl-tmp/huggingface_cache # redirect off the symlinked ~/.cache hog (AD5)
|
|
312
|
+
HF_ENDPOINT=https://hf-mirror.com # + HF_HUB_DISABLE_XET=1 (AD2)
|
|
313
|
+
DETACH=tmux # nohup fallback when tmux is absent (§6)
|
|
314
|
+
PY=/root/miniconda3/bin/python # base IS the env — explicit interpreter, never bare python3 (AD6)
|
|
315
|
+
TB_LOGDIR=/root/tf-logs # platform TB is pinned here (AD7)
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
**Credential push (AD-specific).** The FS security classifier blocks files matching wandb-key patterns —
|
|
319
|
+
put the key at the **per-instance** `/root/.wandb_key`, never on `/root/autodl-fs`. Stream exactly one
|
|
320
|
+
credential block via stdin so the secret never appears in a command; the wrapper reads it
|
|
321
|
+
into `WANDB_API_KEY` before launch. Secrets-via-stdin pattern → `references/ssh_transport.md`.
|
|
322
|
+
|
|
323
|
+
**Checked-sync (the gated success line).** `run_one.sh` writes live checkpoints to
|
|
324
|
+
`$DATA_DIR/checkpoints/<name>`, prunes `latest.pth` on success, then syncs `best.pth` to
|
|
325
|
+
`$DURABLE_DIR/final_ckpts/<name>` **gating the success echo on the actual copy result** — an unconditional
|
|
326
|
+
"synced" lies when the FS inode cap (AD4) silently fails the `mkdir`/`cp` (universal silent-sync gotcha).
|
|
327
|
+
Until a download is verified locally, the **data disk** copy is source-of-truth.
|