opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,327 @@
1
+ # Profile: AutoDL
2
+
3
+ The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model
4
+ and the *one* rental where the meter-stop action is non-destructive. Fills all 8 schema sections
5
+ (`profiles/_schema.md`) at full depth. Read this **before Phase 0**; it owns every path, proxy, billing
6
+ verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see
7
+ `references/gotchas_universal.md`.
8
+
9
+ > **Surface to the user up front (principle #10):** conveniences most users miss — the console has a
10
+ > **one-click "设置SSH免密登录"** (registers your key so the agent connects non-interactively), **GPU-availability
11
+ > notifications** ("订阅GPU通知"), and built-in **AutoPanel / JupyterLab / TensorBoard** tiles. ⚠️ Danger clocks
12
+ > — **关机 (stop) auto-releases the box after 15 days → the data disk is deleted** (AD-DANGER, §5); only
13
+ > `/root/autodl-fs` survives a 释放; low balance / arrears force-stop. And the TB tile is **pinned to
14
+ > `/root/tf-logs`** — write your logger there (or symlink) or the panel shows empty (AD7 / U39).
15
+
16
+ To jump: `grep -in '<keyword>' profiles/autodl.md` (e.g. `grep -in inode profiles/autodl.md`).
17
+
18
+ ## Table of contents
19
+
20
+ 1. LAUNCH — entry points + env contract (base miniconda IS the env)
21
+ 2. STORAGE MODEL — 3 tiers + survival matrix + inode cap
22
+ 3. NETWORK — academic proxy + China mirrors + pinned TB
23
+ 4. SPOT / INTERRUPTION + RESUME — effectively on-demand
24
+ 5. TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception)
25
+ 6. DAEMON TOOL — tmux / nohup
26
+ 7. TOP GOTCHAS — AD1..AD9, platform-pinned
27
+ 8. SCRIPT OVERRIDES — values to parameterize `scripts/`
28
+
29
+ ---
30
+
31
+ ```yaml
32
+ ---
33
+ platform: autodl
34
+ kind: ssh-rental
35
+ meter_stop_verb: 关机 # shutdown/power-off STOPS billing AND keeps /root + disks
36
+ meter_stop_irreversible: false # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes
37
+ detach_primitive: tmux # nohup fallback when tmux is not installed (often absent on fresh image)
38
+ spot_available: false # on-demand only; no spot/bid/preemption model
39
+ spot_grace: n/a
40
+ shared_fs: true # /root/autodl-fs — region-locked, cross-instance within one region
41
+ inode_cap: ~200K # hard cap on the shared FS, independent of byte capacity
42
+ free_egress: true # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed)
43
+ china_mirror_needed: true # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo
44
+ host_driver_cuda_max: image-dependent # the prebuilt image pins torch+CUDA; do not downgrade (AD9)
45
+ local_nvme: true # /root/autodl-tmp data disk is fast local NVMe, per-instance
46
+ ---
47
+ ```
48
+
49
+ ---
50
+
51
+ ## 1. LAUNCH
52
+
53
+ **First time? (rent → reach the box).** On the AutoDL console: pick a GPU + region with stock → **创建实例**
54
+ (choose the PyTorch image — the base env ships prebuilt) → register your key once via **设置SSH免密登录**
55
+ (so the agent connects non-interactively) → copy the instance's **SSH connection string** + password from the
56
+ console → test `ssh -p <PORT> root@connect.<region>.seetacloud.com 'nvidia-smi'`. That string is your entry to
57
+ every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.)
58
+
59
+ **Entry points.** Web console (创建实例) for create/release/power; per-instance SSH connection string from
60
+ the console (`ssh -p <PORT> root@connect.<region>.seetacloud.com`). No first-class platform CLI/REST for
61
+ job control — SSH is the orchestration channel. Set a stable alias per instance in `~/.ssh/config`
62
+ (`Host autodl-<proj>-<N>`, `HostName connect.<region>.seetacloud.com`, `Port <PORT>`) so every later
63
+ command is short; the port is assigned at create-time and **changes on re-create** (update the alias).
64
+ SSH/keepalive config → `references/ssh_transport.md`.
65
+
66
+ **Env contract — the prebuilt base miniconda IS the env (AD6).** The image ships the full DL stack into
67
+ **base** (`/root/miniconda3/bin/python`); there is no `/root/miniconda3/envs/<name>/`. Base is the
68
+ deliberate single-tenant project env. **Never `conda create` / `conda clone base`** on the rental —
69
+ cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit
70
+ interpreter `/root/miniconda3/bin/python`; in remote polls use that path or pure shell, never bare
71
+ `python3` (it may be absent → exit 127). When installing project deps, **filter framework pins** so a
72
+ `requirements.txt` does not downgrade the image's torch build (AD9).
73
+
74
+ > The "no DL in conda base" discipline applies to the *persistent local* machine only — on an ephemeral
75
+ > rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base.
76
+
77
+ ---
78
+
79
+ ## 2. STORAGE MODEL *(survival matrix — principle #4)*
80
+
81
+ Three tiers, each with a different speed / size / inode profile and a **different survival behavior**:
82
+
83
+ | Tier | Path | Speed | Size | Inode cap | Scope |
84
+ |---|---|---|---|---|---|
85
+ | System disk | `/` | medium | ~30 GB | none | per-instance |
86
+ | Data disk | `/root/autodl-tmp` | **fast NVMe** | per-plan (e.g. ~50 GB) | none | per-instance |
87
+ | Shared FS | `/root/autodl-fs` | NFS (slow, ~30 s/sync) | ~200 GB | **~200K (hard)** | **region-locked**, all instances in one region |
88
+
89
+ **Survival matrix** — the part most platforms get wrong, and where AutoDL is the **exception**:
90
+
91
+ | Tier | Survives 关机 (stop)? | Survives 释放 (release/destroy)? | Notes |
92
+ |---|---|---|---|
93
+ | `/` system | **yes** | no | AutoDL persists `/root` across power-off — UNLIKE RunPod/vast/K8s/Colab |
94
+ | `/root/autodl-tmp` data | **yes** | no | fast tier; checkpoints written here mid-run |
95
+ | `/root/autodl-fs` shared | **yes** | **yes** | the ONLY tier that survives release; region-locked |
96
+
97
+ **Where checkpoints MUST go for the §5 teardown verb:** write live checkpoints to the fast data disk
98
+ (`/root/autodl-tmp/checkpoints/<name>`, never the 30 GB system disk), then **checked-sync `best.pth`
99
+ to `/root/autodl-fs`** — the only tier that survives a 释放. If only ever using 关机, the data disk also
100
+ survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk).
101
+
102
+ **Region/DC-lock (AD3).** FS quota is region-scoped; each region has its own physical mount. Files written
103
+ from a `<region-a>` instance are invisible to a `<region-b>` instance even at the identical
104
+ `/root/autodl-fs/` path. Create the FS quota in the **same region** as the instances; to bridge regions,
105
+ pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from-
106
+ another probe before relying on it.
107
+
108
+ **Inode discipline (AD4).** The ~200K cap is **independent of bytes**: `df -h` can read 34% while `cp`
109
+ fails "No space left" because `df -i` is at 100%. The inode bomb is **per-sample eval visualization**
110
+ (`files_per_sample × N_samples × N_conditions` → tens of thousands of tiny files); checkpoints (few large
111
+ files) are inode-cheap. Monitor `df -i`, not just `df -h` (Phase 0 + every space check). Eval-artifact
112
+ sizing policy is owned by **REQUIRED:** verifying-dl-experiments.
113
+
114
+ **Data-disk hog (AD5).** When `/root/autodl-tmp` hits 100% but `runs/` looks small, the real hog is the
115
+ **HF cache symlinked onto the data disk** (`~/.cache/huggingface` → tens of GB of model blobs). Audit
116
+ `du -sh ~/.cache/huggingface/hub/models--* | sort -rh` before deleting checkpoints; redirect `HF_HOME` to
117
+ the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the
118
+ experiment (principle #9). Get explicit user confirmation naming `rm -rf` targets (the harness classifier
119
+ blocks agent-inferred irreversible deletes).
120
+
121
+ ---
122
+
123
+ ## 3. NETWORK
124
+
125
+ **Egress proxy — `source /etc/network_turbo` is MANDATORY (AD1).** Instances start with no proxy; direct
126
+ egress to `api.wandb.ai` / `huggingface.co` / `github.com` / `pypi.org` is unreliable (0.5 s … 300 s …
127
+ blocked). Every shell that calls wandb / HF / pip / git must `source /etc/network_turbo` first
128
+ (`source /etc/network_turbo 2>/dev/null || true` at the top of every wrapper). It exports
129
+ `http_proxy` / `https_proxy` pointing at the in-DC academic proxy (`http://<proxy-ip>:<port>`), a
130
+ `no_proxy` allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo
131
+ vs >120 s timeout without — no exceptions, even a small `wandb.summary` write can wedge for minutes.
132
+
133
+ **China mirrors (AD2).** HF behind the GFW → `HF_ENDPOINT=https://hf-mirror.com` or pull from
134
+ **ModelScope**. Two compounding traps: (a) HF's **Xet CAS backend** is NOT mirror-proxied (the mirror
135
+ covers the API but big `.safetensors` shards still hit the flaky international endpoint) →
136
+ `export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) to force the classic LFS path the mirror does
137
+ proxy; (b) `no_proxy` in network_turbo lists `modelscope.com` but **not** `modelscope.cn` — routing a
138
+ DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a
139
+ `timeout <s> … && break` retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror
140
+ table + `no_proxy` ladder → `references/china-network.md`.
141
+
142
+ **Port exposure.** AutoDL maps a single custom port (6006) for user services; the platform also exposes
143
+ JupyterLab. SSH port is the per-instance `<PORT>` and changes on re-create.
144
+
145
+ **Platform TensorBoard is pinned to `/root/tf-logs` (AD7).** The image autostarts
146
+ `tensorboard --logdir /root/tf-logs --port 6007` on boot and the AutoPanel TB tile proxies straight to that
147
+ pid — the `--logdir` is hard-pinned and cannot be reconfigured from inside the container. Events written
148
+ anywhere else are invisible in the web tile no matter how correct the `SummaryWriter` setup. Fix: write to
149
+ `SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>` (the pinned TB
150
+ has `--reload=5`, so the run appears within ~5 s — no restart). Verify with
151
+ `curl -s http://127.0.0.1:6007/data/runs` (expect a JSON array with the run), NOT `ss` (can show nothing
152
+ inside the container while curl returns 200). Local logs die with the instance — for durable curves use a
153
+ hosted tracker (**REQUIRED:** huggingface-skills:huggingface-trackio).
154
+
155
+ **SSH flavor.** Direct-TCP SSH on the per-instance host:port — `scp`/`rsync` work normally (no proxied-SSH
156
+ restriction). Use a per-dir resumable loop for large transfers (single-connection `scp -r` resets mid-
157
+ transfer); `rsync -avz --partial` is preferred. Transport patterns → `references/ssh_transport.md`.
158
+
159
+ ---
160
+
161
+ ## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
162
+
163
+ **No spot/bid/preemption model — AutoDL is on-demand.** There is no mid-run eviction, no SIGTERM grace
164
+ window to handle (`spot_grace: n/a`). The real loss vectors are: (a) **forgot to release/关机** → idle
165
+ billing (principle #1); (b) an instance **reboot** that ends a non-detached process (a vanished process is
166
+ not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see
167
+ `references/gotchas_universal.md`); (c) availability — the GPU plan being sold out at create-time (build
168
+ retry-until-available, not survive-an-eviction).
169
+
170
+ **Resume hook.** The universal spine still applies (principle #8): checkpoint atomically to the data disk +
171
+ sync `best.pth` to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes
172
+ the *identical launch command* survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence
173
+ formula → `references/spot-resilience.md` (the formula generalizes even without spot — it bounds
174
+ re-compute lost to a reboot).
175
+
176
+ ---
177
+
178
+ ## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
179
+
180
+ **关机 (shutdown / power-off) STOPS the meter AND keeps `/root` + both disks — this is the AutoDL
181
+ EXCEPTION among rentals.** Everywhere else (RunPod wipes the container disk on stop, vast bills the disk
182
+ forever, K8s wipes the pod FS, Colab loses `/content`) a "stop" is lossy or still-billing. On AutoDL,
183
+ 关机 is the **safe park**: meter off, all three tiers intact, restart later. There is also a **no-GPU /
184
+ 无卡模式 mode** for cheap restart to copy files or fix the env without paying for the GPU.
185
+
186
+ | Action | Stops meter? | Keeps `/` + data disk? | Keeps FS? | Reversible? |
187
+ |---|---|---|---|---|
188
+ | 关机 (shutdown) | **yes** | **yes** | yes | **yes** — restart anytime (the AutoDL exception) |
189
+ | 无卡模式 (no-GPU) | mostly (cheap) | yes | yes | yes |
190
+ | 释放 (release/destroy) | yes | **NO** | yes | **NO — deletes `/` + data disk irreversibly** |
191
+
192
+ **Cost trap.** 关机 still bills the data-disk *storage* at a small rate while the GPU meter is off — far
193
+ cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk.
194
+ **⚠️ Auto-release clock (AD-DANGER):** a 关机 (stopped) instance is **auto-released after 15 days** (the
195
+ console shows "关机 15 天后释放") → that release deletes `/` **and the data disk**, so 关机 is safe parking
196
+ only *within* the window; for a longer pause, sync `best` to `/root/autodl-fs` (survives 释放) or expect to
197
+ re-download. Low balance / arrears also force-stop the instance. **Surface this to the user up front
198
+ (principle #10)** — most users assume 关机 parks the box indefinitely.
199
+ **Teardown Iron Law (SKILL.md Phase 5):** no 释放 / file-delete until `best.pth` is **pulled to local AND
200
+ verified by load** (`scripts/verify_local.py`) AND the user explicitly approves — "it looked done in the
201
+ log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure
202
+ is to **关机 and ask**, never 释放 on a guess. **REQUIRED:** superpowers:verification-before-completion is
203
+ the general form of this gate.
204
+
205
+ ---
206
+
207
+ ## 6. DAEMON TOOL
208
+
209
+ **tmux** is the detach primitive when present, but **tmux is often NOT installed on a fresh AutoDL image**
210
+ and `apt-get install tmux` fails when egress is down. Zero-dependency fallback:
211
+ `nohup bash run_queue.sh queue.txt </dev/null >master.log 2>&1 &` — survives an SSH drop (SIGHUP), needs
212
+ no package. Verify either with `pgrep -af <script>`. The detach survives an SSH drop; it does **not**
213
+ survive a 关机/reboot — that is what checkpoint+resume (§4) is for.
214
+
215
+ **Native queue: none.** AutoDL has no built-in scheduler → use the bundled `scripts/run_queue.sh.template`
216
+ (resumable queue iterator, `start_index` for resume) driving `scripts/run_one.sh.template` per cell.
217
+ **Never overwrite a script a running bash is mid-execution** (bash reads by byte-offset → re-executes
218
+ blocks; version the filename) — universal physics, see `references/gotchas_universal.md`.
219
+
220
+ **Monitoring.** A session-bound watcher dies with the session; for multi-hour runs deploy the four-layer
221
+ durable architecture (`references/monitoring_patterns.md`). Detect "done" by a **log marker**
222
+ (`grep -q 'QUEUE DONE' master.log`), never by `pgrep` (the waiter's own cmdline matches the pattern and
223
+ loops forever). A cloud scheduler cannot reach the rented box (no SSH key in a cloud sandbox — secret
224
+ leak); the honest recurring check is the remote self-monitor + a session loop with the local key.
225
+
226
+ ---
227
+
228
+ ## 7. TOP GOTCHAS (AutoDL-pinned; universal ones → `references/gotchas_universal.md`)
229
+
230
+ **AD1 — external network call hangs / wandb shows 0 runs.** *Symptom:* `wandb.init` times out at
231
+ 90/120/180 s, dashboard reads 0 runs while `wandb/run-*` exist locally; HF downloads stall; pip/git glacial.
232
+ *Root cause:* instances start with **no proxy**; direct egress to wandb/HF/PyPI/GitHub is unreliable or
233
+ blocked, and wandb-core's retry logic under a flaky link can roll back already-uploaded runs. *Fix:*
234
+ `source /etc/network_turbo` at the top of **every** shell/wrapper before any external call; recover an
235
+ empty cloud project with `for d in wandb/run-*; do timeout 120 wandb sync "$d"; done`.
236
+
237
+ **AD2 — HF download stalls even with hf-mirror + turbo.** *Symptom:* `from_pretrained` /
238
+ `snapshot_download` hangs or `ConnectTimeout` on big `.safetensors` shards. *Root cause:* (a) HF's Xet CAS
239
+ backend is not mirror-proxied; (b) `no_proxy` lists `modelscope.com` not `modelscope.cn` (domestic source
240
+ forced through international proxy = slower); (c) a curl test run without turbo measures the wrong path.
241
+ *Fix:* `export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) with `HF_ENDPOINT=https://hf-mirror.com`,
242
+ or pull from ModelScope to a plain dir + load via local-path override; wrap in a `timeout … && break`
243
+ resume loop. Detail → `references/china-network.md`.
244
+
245
+ **AD3 — cross-region instances cannot share FS.** *Symptom:* two instances in different regions see
246
+ identical `/root/autodl-fs/` paths but files written from one are invisible to the other. *Root cause:* FS
247
+ quota is region-scoped; each region has its own physical mount. *Fix:* create the FS quota in the same
248
+ region as the instances; bridge regions via scp from a chosen primary; verify with a write-one / read-other
249
+ probe.
250
+
251
+ **AD4 — FS write fails "No space left" while `df -h` looks fine.** *Symptom:* `cp`/`mkdir` to
252
+ `/root/autodl-fs` fails though `df -h` shows ~34%; `df -i` shows `… 0 100%`. *Root cause:* the shared FS
253
+ enforces a **hard ~200K inode cap independent of bytes**; per-sample eval visualization (many tiny files)
254
+ exhausts it. *Fix:* monitor `df -i`; cap per-sample eval vis on large test sets (sizing → verifying-dl-
255
+ experiments); once a results dir is verified locally, prune its per-sample image subdir from FS; recover by
256
+ `find /root/autodl-fs -type d -name '<vis-dir>' -exec rm -rf {} +` to free inodes fast.
257
+
258
+ **AD5 — data disk full; HF cache is the hidden hog; agent `rm` auto-denied.** *Symptom:*
259
+ `/root/autodl-tmp` at 100% though `runs/` looks small; an agent `rm -rf` of "obvious junk" is auto-denied.
260
+ *Root cause:* `~/.cache/huggingface` is symlinked onto the data disk, so the **HF model cache** (tens of
261
+ GB) is the real hog; the harness blocks irreversible `rm -rf` whose targets the agent inferred. *Fix:*
262
+ audit `du -sh ~/.cache/huggingface/hub/models--* | sort -rh`; set `HF_HOME` to a chosen data-disk dir + keep
263
+ the metric/eval JSONs (tiny evidence); present exact deletion targets + sizes for explicit user
264
+ confirmation; offer "clean vs expand the disk".
265
+
266
+ **AD6 — base IS the env; a "never use base" rule blocks every remote command.** *Symptom:* a local "don't
267
+ run DL in conda base" guard fires on `ssh autodl 'python train.py'`, but `conda env list` shows nothing and
268
+ `/root/miniconda3/envs/` is empty; poll scripts calling `python3` exit 127. *Root cause:* the image installs
269
+ the whole DL stack into **base** — base IS the single-tenant project env (no `/envs/`), and the image often
270
+ ships only `python` (no `python3`). *Fix:* train with `/root/miniconda3/bin/python`; exempt remote-ssh +
271
+ instance base from the local guard (never `conda create --clone base`); in remote scripts use the explicit
272
+ interpreter or pure shell, never bare `python3`.
273
+
274
+ **AD7 — platform TensorBoard pinned to `/root/tf-logs`; events elsewhere invisible.** *Symptom:* the
275
+ events file is non-empty and `curl http://127.0.0.1:6007/` returns 200, but the AutoPanel TB tile shows
276
+ zero runs; `/data/runs` returns `[]`. *Root cause:* the image autostarts `tensorboard --logdir
277
+ /root/tf-logs` and the tile proxies that pid; `--logdir` is hard-pinned and not reconfigurable in-container.
278
+ *Fix:* write `SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>`
279
+ (the pinned TB's `--reload=5` picks it up in ~5 s); verify with `curl … /data/runs`, not `ss`. (Also:
280
+ restart the TB server to evict STALE cached tags after deleting/renaming runs.) The cross-platform "live panel silently empty" class (path/port/process mismatch on any platform) is the general form → `references/gotchas_universal.md` U39.
281
+
282
+ **AD8 — wandb val-phase CPU memory spike to 30+ GB at epoch 1 end.** *Symptom:* at the end of epoch 1
283
+ (validation), cgroup memory jumps from ~8 GB to 30+ GB, sometimes wedging the instance. *Root cause:*
284
+ project trainers log per-sample distributions at `step==1` (e.g. LPIPS/VGG over ~2000 samples on CPU =
285
+ ~30 GB activations). *Fix:* cap the val-time sample accumulator — `-o training.val_metric_sample_cap=256`
286
+ (project-specific knob; check the trainer for the equivalent). Distinct from a DataLoader-worker cgroup OOM
287
+ (universal gotcha).
288
+
289
+ **AD9 — project torch pin would DOWNGRADE the image's working build.** *Symptom:* the image ships e.g. a
290
+ new-arch-capable torch (sm_120); the project pins `torch<2.9`; a naive `pip install -r requirements.txt`
291
+ replaces it with a wheel lacking the arch's kernels → `no kernel image is available` at first forward.
292
+ *Root cause:* the image torch/CUDA build is matched to the rented GPU arch; the project pin is stale for it.
293
+ *Fix:* filter framework pins out of the remote install —
294
+ `grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt && pip install -r
295
+ /root/req_remote.txt` — keep the image build; smoke `torch.cuda.get_device_capability()` + a heavy import
296
+ before launch; disclose the off-band torch version with results.
297
+
298
+ ---
299
+
300
+ ## 8. SCRIPT OVERRIDES
301
+
302
+ The exact values to parameterize the `scripts/` templates (`scripts/run_one.sh.template`,
303
+ `scripts/run_queue.sh.template`) for AutoDL:
304
+
305
+ ```sh
306
+ DATA_DIR=/root/autodl-tmp # fast NVMe data disk — live checkpoints, logs, HF cache
307
+ DURABLE_DIR=/root/autodl-fs # region-locked shared FS — the only tier surviving 释放
308
+ PROXY_HOOK='source /etc/network_turbo 2>/dev/null || true' # MANDATORY before any external call (AD1)
309
+ CRED_FILE=/root/.wandb_key # per-instance ONLY — the FS security classifier blocks wandb keys
310
+ SCRATCH='latest.pth' # prune on success; keep best.pth (the keepable artifact)
311
+ HF_HOME=/root/autodl-tmp/huggingface_cache # redirect off the symlinked ~/.cache hog (AD5)
312
+ HF_ENDPOINT=https://hf-mirror.com # + HF_HUB_DISABLE_XET=1 (AD2)
313
+ DETACH=tmux # nohup fallback when tmux is absent (§6)
314
+ PY=/root/miniconda3/bin/python # base IS the env — explicit interpreter, never bare python3 (AD6)
315
+ TB_LOGDIR=/root/tf-logs # platform TB is pinned here (AD7)
316
+ ```
317
+
318
+ **Credential push (AD-specific).** The FS security classifier blocks files matching wandb-key patterns —
319
+ put the key at the **per-instance** `/root/.wandb_key`, never on `/root/autodl-fs`. Stream exactly one
320
+ credential block via stdin so the secret never appears in a command; the wrapper reads it
321
+ into `WANDB_API_KEY` before launch. Secrets-via-stdin pattern → `references/ssh_transport.md`.
322
+
323
+ **Checked-sync (the gated success line).** `run_one.sh` writes live checkpoints to
324
+ `$DATA_DIR/checkpoints/<name>`, prunes `latest.pth` on success, then syncs `best.pth` to
325
+ `$DURABLE_DIR/final_ckpts/<name>` **gating the success echo on the actual copy result** — an unconditional
326
+ "synced" lies when the FS inode cap (AD4) silently fails the `mkdir`/`cp` (universal silent-sync gotcha).
327
+ Until a download is verified locally, the **data disk** copy is source-of-truth.