npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.3 - Mend

opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/profiles/vastai.md ADDED Viewed

@@ -0,0 +1,355 @@
+---
+platform: vastai
+kind: ssh-rental
+meter_stop_verb: destroy        # the action that STOPS billing; stop keeps billing disk forever (compute off, storage on)
+meter_stop_irreversible: true   # destroy permanently deletes container disk
+detach_primitive: tmux          # auto-attached on login; dies on container restart → onstart.sh is the durable hook
+spot_available: true            # interruptible (bid) auction — central to the platform
+spot_grace: ~0s                 # preemption is an abrupt pause, no documented notice / no SIGTERM
+shared_fs: false               # NO platform-wide FS; Volumes are machine-locked (per-GPU bound on restart)
+inode_cap: host-dependent       # undocumented; whatever the host's Docker storage driver gives
+free_egress: host-dependent     # CORRECTED: host-set bandwidth price; billed per byte in AND out, often $0 but not guaranteed
+china_mirror_needed: false      # no China DCs and no platform proxy; fix HF at workload level
+host_driver_cuda_max: image-dependent  # CUDA ships in the chosen Docker image; must be ≤ host driver
+local_nvme: host-dependent
+---
+# vast.ai — platform profile
+One-line purpose: rent a marketplace GPU as a **Docker image on a third-party host**, run a spot-resumable
+job, and **copy results off before `destroy`** — the only verb that stops the full meter.
+> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — a **`stop`ped instance bills its disk FOREVER** (only `destroy` stops the full meter, and `destroy` deletes everything); **bandwidth/egress bills continuously**, host-priced. Risk — rent only **verified, high-reliability** hosts with a direct port (an unverified host can vanish mid-run); cloud-sync works even while stopped (§5), the cleanest durable target.
+**Table of contents** (`grep -in '^## ' profiles/vastai.md` to jump):
+- §1 LAUNCH — offer-driven, Docker-image-is-the-env
+- §2 STORAGE MODEL — per-machine-local disk; survival matrix; cloud-sync escape hatch
+- §3 NETWORK — proxy vs direct SSH; random ports; host-set bandwidth; no China proxy
+- §4 SPOT / INTERRUPTION + RESUME — bid auction, ~0 s pause, GPU-bound resume, status-poll loop
+- §5 TEARDOWN / BILLING — `destroy` is the meter-stop; `stop` bills disk forever; bandwidth bills always
+- §6 DAEMON TOOL — tmux dies on restart; `onstart.sh` is the durable relaunch
+- §7 TOP GOTCHAS — VAST1–VAST13, platform-pinned + Platform-specific debugging
+- §8 SCRIPT OVERRIDES — values to parameterize `scripts/`
+Universal gotchas are NOT restated here — see `references/gotchas_universal.md`. Spot cadence math and
+atomic-resume live in `references/spot-resilience.md`.
+**The one fact that reshapes everything:** vast.ai is a **decentralized marketplace of third-party hosts**,
+not a uniform first-party cloud. Consequences that diverge from AutoDL: **no platform-wide shared FS**, **no
+China-mirror proxy**, **no single prebuilt conda env** (the Docker image IS the env), **storage is locked to
+one physical host and even one GPU ID**, **bandwidth is host-priced (not free by fiat)**, and
+**interruptible (bid) preemption is a real, central, abrupt model**.
+---
+## 1. LAUNCH
+**Entry points** (all equivalent): web console (`cloud.vast.ai`), the `vastai` CLI / Python SDK, the REST
+API (`https://console.vast.ai/api/v1/...`, Bearer token), and SSH into the running container. The CLI is the
+orchestration surface: `pip install vastai`, then `vastai set api-key $VAST_API_KEY` (env-var name only —
+never inline the key).
+**Env contract — the Docker image IS the env.** A bare VM is not offered by default; the create call MUST
+specify `--image` (e.g. `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`). **CUDA version is whatever the
+image ships** — a mismatch with the host driver is a real failure mode (VAST5). The image's default Python
+env is the low-friction place to run — do not `conda create` on a rental (the remote-base exception holds).
+Note: **Docker-in-Docker is not supported** "due to security constraints" (verified
+docs.vast.ai/.../faq/instances 2026-06) — a containerized inner runtime is not an option here.
+**Launch is offer-driven and two-step** (search a marketplace offer → create onto it):
+```bash
+#!/usr/bin/env bash
+set -u
+# 1) find a verified, rentable offer with at least one direct port, cheapest $/dlperf first
+vastai search offers 'gpu_name=RTX_4090 num_gpus=1 verified=true rentable=true direct_port_count>=1' -o 'dlperf_usd-'
+# 2) create onto the chosen OFFER_ID; --direct enables direct-TCP SSH (see §3)
+vastai create instance OFFER_ID --image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime \
+  --disk 50 --ssh --direct --onstart-cmd 'nvidia-smi && bash /workspace/onstart.sh'
+```
+`--onstart-cmd` (**max 16 KB**; for a longer script, gzip+base64-encode it) is written to `/root/onstart.sh`
+and **re-runs on every container start** — this is the platform-native boot hook and the durable relaunch
+path (§6) (verified docs.vast.ai/cli/commands 2026-06). Filter offers hard: an unverified, low-reliability
+host can simply vanish (`Offline`) mid-run (VAST7). Boot is not instant: the host must **pull the Docker
+image and boot — typically 1–5 min depending on image size** (verified docs.vast.ai CLI Hello World 2026-06);
+a fat image stuck in `Loading` is the slow-download symptom (VAST13).
+→ **verify:** `vastai show instance OFFER_ID` lists the new instance `running`, and an in-container
+`nvidia-smi` (via `--onstart-cmd` or first SSH) shows the expected GPU with a CUDA that matches the image.
+---
+## 2. STORAGE MODEL  *(survival matrix — principle #4)*
+Three tiers; the persistence + region story is the single biggest divergence from AutoDL — **there is no
+region-wide shared FS** to sync to. (verified docs.vast.ai/.../storage/types 2026-06)
+| Tier | Path | Speed | Survives STOP? | Survives DESTROY? | Cap |
+|---|---|---|---|---|---|
+| Container / instance disk (`--disk N`) | `/` + `/workspace` | local | **yes** (bills) | **NO — gone** | fixed at create, **non-resizable**, min **10 GB** (default) |
+| Volume (local) | mounted path | local | yes | **yes, until volume deleted** (bills per-GB while it exists) | fixed; **machine-locked**, non-resizable |
+| Cloud sync (S3 / GDrive / Backblaze / Dropbox) | off-box bucket | network | yes | **yes — fully off-box** | provider's; **works even while instance is stopped** |
+| Network Volume (cross-machine) | — | — | — | — | **not in current storage docs — treat as unavailable** |
+**Machine-lock — and per-GPU-lock — is the trap.** A Volume "is tied to the physical machine where created"
+and "cannot migrate between different physical machines." Worse, a stopped instance is **bound to a specific
+GPU ID**, not just the machine: "When an instance is created, it is bound to a specific GPU ID. If the
+instance is stopped, it remains bound to the same GPU ID and waits for that GPU to become available again"
+(verified vast-ai.crisp.help scheduling article 2026-06). So a machine can show **available for rent** (other
+GPUs free) while the stopped instance is stuck in `Scheduling` waiting for *its* GPU (VAST3).
+**Where checkpoints MUST go for the §5 verb:** there is **no durable mount that survives `destroy`** on the
+container disk — so the durable target is **off-box**. Two real off-box paths: (a) `vastai copy` the result
+to local / another instance / a Volume **before** `destroy`; (b) **Cloud sync** (`vastai cloud copy`) to
+S3/GDrive/Backblaze/Dropbox — notably **works even while the instance is stopped** (verified
+docs.vast.ai/.../data-movement 2026-06), which makes it the cleanest durable target for a spot job. Always
+assume the instance is lost once its lifetime expires. Inode caps and FS type are **undocumented and
+host-dependent** (whatever the host's Docker storage driver gives) — `df -i` per host, do not assume an
+AutoDL-style platform constant.
+→ **verify:** before any teardown, `vastai copy <id>:/path/to/ckpt local:/path/to/local` exits 0 (or
+`vastai cloud copy` completes) AND the local artifact loads (`scripts/verify_local.py`).
+---
+## 3. NETWORK
+**Shared public IP + random external port.** Each instance shares a host's (usually shared) public IP;
+"each open internal port (such as 22 or 8080 etc) is mapped to a *random* external port" read from the
+**"IP Port Info" pop-up** (button on the instance) or `vastai show instance` — format
+`PUBLIC_IP:33526 -> 8081/tcp` (verified docs.vast.ai/.../connect/networking 2026-06). Ports change per
+instance — discover them at runtime, never hard-code. **Hard cap 64 open ports per instance.**
+**Two SSH flavors — and the scp size trap:**
+- **Proxy SSH** (default, via Vast's proxy): "works on all machines, slower for data transfer." It carries
+  `scp` but is throttled — vast's own guidance is **scp over proxy only for transfers under ~1 GB**; above
+  that "using the direct ssh connection is recommended" (verified docs.vast.ai/.../data-movement 2026-06).
+- **Direct SSH** (direct-TCP to the host): "requires machines with open ports, faster and more reliable, the
+  preferred method." This is the one that carries large `scp`/`rsync`/`vastai copy` without stalling. It
+  **requires the offer to expose open ports** → filter `direct_port_count>=1` and create with `--direct`.
+**Rule:** if bulk transfer must work, require **direct-TCP** at create time. `vastai copy` "uses rsync and is
+generally fast and efficient, subject to single-link upload/download constraints" — for a multi-GB result,
+direct + a resumable loop (`references/gotchas_universal.md` U12). For a big *inbound* dataset, prefer
+`wget`/`curl` from a cloud bucket over proxied SSH (much higher throughput). Custom services use Docker `-p`
+(e.g. `-p 8081:8081`); Jupyter defaults to internal 8080 gated by `JUPYTER_TOKEN` (override the port via
+`JUPYTER_PORT`).
+**Bandwidth is metered and host-priced — NOT free by fiat (corrected).** "You are charged bandwidth prices
+for every byte sent or received to or from the instance, regardless of what state it is in," and "pricing is
+set by the host and is specific to each offer" (verified docs.vast.ai/.../reference/billing +
+.../instances/pricing 2026-06). In practice many hosts price egress at ~$0 (vast is generally a low/zero
+egress option), but a given offer **can** charge per-GB in *both* directions — read the per-offer bandwidth
+rate (hover the price on the instance card / search page) before a transfer-heavy job. This is why the
+frontmatter is `free_egress: host-dependent`, not `true`.
+**China relevance: none at the platform level.** No China datacenters, no `/etc/network_turbo` equivalent, no
+built-in HF mirror. The HF-unreachable problem still exists at the *workload* level from some hosts, but the
+fix is the job's **own** `HF_ENDPOINT=https://hf-mirror.com` / `hf_transfer`, not a platform script — see
+`references/gotchas_universal.md` (HF download) for the resumable-download ladder.
+→ **verify:** `ssh <alias> 'echo ok'` over the **direct** endpoint, then a 1-file `vastai copy` round-trip
+exits 0.
+---
+## 4. SPOT / INTERRUPTION + RESUME  *(principle #7/#8)*
+vast.ai's **interruptible** rentals are a **live continuous-bid auction** — the cheap-GPU core of the
+platform ("can reduce costs by fifty percent or even more"), far more first-class than anything on AutoDL.
+(verified vast.ai/article/Rental-Types 2026-06)
+- **Bidding:** clients set a bid price; "the current highest bid is the instance that runs, the others are
+  paused." **On-demand always beats interruptible** regardless of bid amount ("on-demand instances will
+  always take precedence").
+- **The bid is fixed at create.** "The bidding method cannot be changed after an instance is rented"
+  (verified Rental-Types 2026-06) — so the resume lever is **not** "raise this instance's bid." To recover an
+  out-priced run, either wait for the higher bid to finish, or **re-launch the identical job on a fresh
+  offer** (cheaper/on-demand) — which is why off-box checkpoints (§2) matter.
+- **Preemption = pause, not destroy.** A preempted instance is paused (disk survives) until its bid regains
+  top priority or the higher bid finishes. Because storage is machine-/GPU-locked, it can only resume **on
+  the original host's original GPU** — the resumability cliff (VAST3).
+- **Detection signal + grace window:** **little/no advance notice — treat the grace as ~0 s, an abrupt
+  pause.** No documented termination signal; a SIGTERM-flush handler is **NOT** a safety net. Detect via the
+  API: `show_instance` returns `actual_status` (current container state), `intended_status` (desired state),
+  `cur_state` (contract/hardware allocation), and `status_msg` (human string, e.g. "success, running ...")
+  (verified docs.vast.ai/api-reference/instances/show-instances 2026-06). A preempted instance stops being
+  `running`; the UI shows **Inactive** (stopped, data preserved) / **Scheduling** (waiting for the GPU to
+  free) / **Offline** (host gone).
+- **Resume hook:** wait for the higher bid to finish or restart the instance; it returns
+  `Scheduling → running` **only if the same GPU is still free** (else it sticks — VAST3), then
+  **`/root/onstart.sh` re-runs** and relaunches training (§6). The job itself must be checkpoint-resumable
+  (`--resume`, load-latest unconditionally) so the identical command resumes idempotently.
+**Orchestrator pattern:** poll `actual_status` / `status_msg` on a timer; on preemption, restart (or
+re-launch on a new offer) and let `onstart.sh` + checkpoint-resume recover. Cadence formula (Young/Daly) and
+atomic temp→fsync→rename resume → `references/spot-resilience.md`.
+→ **verify:** kill-and-resume drill — `vastai stop instance <id>` then `start`; the job resumes from the last
+checkpoint step, not epoch 0.
+---
+## 5. TEARDOWN / BILLING  *(principle #9 + the Iron Law)*
+This is the most error-prone section — be precise. (verified docs.vast.ai/.../reference/billing +
+.../manage-instances 2026-06)
+- **`destroy` is the ONLY thing that stops the full meter** (compute **and** disk). It is **irreversible** —
+  all container-disk data is permanently deleted. (`vastai destroy instance <id>`)
+- **`stop` is a trap:** it detaches the GPU and halts compute billing, but **disk keeps charging
+  indefinitely** while stopped — "stopping an instance does not avoid storage costs," "you will continue to
+  be billed for disk storage, even if your balance is negative." The #1 surprise bill on vast.ai.
+  "Stopped" ≠ "meter off."
+- **Bandwidth bills in EVERY state.** Charged "for every byte sent or received... regardless of what state it
+  is in" — so even a transfer to/from a *stopped* instance (cloud sync) accrues host-set bandwidth cost (§3).
+- **A Volume keeps billing after the instance is destroyed** until the volume itself is deleted ("charged per
+  GB while volume exists," independently from instances).
+- **On-demand instances auto-stop when their host-set lifetime expires** — "when the rental end date is
+  reached, the rental contract expires and the instance is stopped." Data remains until destroyed. An
+  unattended job can silently end, so checkpoint as if the box disappears at any moment.
+- **Zero / negative balance → deletion.** At $0.00 "your instances, storage volumes, and data will be
+  scheduled for deletion unless you add credits"; without a saved card "your instances and stored data will
+  be destroyed." There is a "short grace period where your balance may go negative before deletion occurs" —
+  do not rely on it.
+- **Poll-loop cost trap:** a status-poll loop with no timeout/error check will loop forever while the
+  instance keeps accruing disk + bandwidth charges. Bound every poll loop with `timeout` + an exit check.
+**Teardown Iron Law (vast.ai instance):** NO `destroy` until checkpoints are **copied off-box AND verified by
+load** — either `vastai copy`-ed to local (`scripts/verify_local.py` reports 100% OK) or `vastai cloud copy`
+confirmed — the copy exit status is checked (VAST2), and the user has **explicitly approved** the
+cost-affecting action. "It looked done in the log" is not evidence (principle #3). Because `destroy` deletes
+the disk and there is **no shared FS to fall back on**, the confirmation gate matters more here, not less.
+---
+## 6. DAEMON TOOL
+- **Auto-tmux on SSH login** (same as AutoDL): login attaches a tmux session "to keep the session active
+  even if you disconnect." Disable with `touch ~/.no_auto_tmux` then reconnect (verified docs.vast.ai
+  jupyter-ssh FAQ 2026-06).
+- **tmux survives an SSH disconnect but NOT a container restart/reboot/spot-resume** — a reboot or
+  spot-resume wipes the tmux session. The **durable relaunch hook is `/root/onstart.sh`** (the
+  `--onstart-cmd`), which re-runs on every container start. Put the training relaunch there, **not** in
+  tmux, so a spot-resume actually restarts the job.
+- **SSH keys apply only to instances created AFTER the key is added** — existing instances do not get a new
+  key automatically. Set the account key **before** creating, or inject it via `onstart`. A pasted key missing
+  its `ssh-rsa`/`ssh-ed25519` prefix or `user@host` suffix authenticates as a password prompt — copy the whole
+  line (verified docs.vast.ai jupyter-ssh FAQ 2026-06).
+- **Native queue:** vast.ai has **Serverless / autoscaler** for queue-style workloads, but single-instance
+  training has no managed scheduler — the orchestrator + `onstart.sh` + checkpoint-resume **is** the queue.
+---
+## 7. TOP GOTCHAS  (platform-pinned; Symptom → Root cause → Fix)
+Universal gotchas (CRLF, cgroup OOM, silent-sync, HF stalls, zombie VRAM, GPU-0%-util, scp-resets,
+egress-surcharge) live in `references/gotchas_universal.md` — not repeated here.
+- **VAST1 — surprise bill on a "stopped" instance.** Symptom: a stopped, idle instance keeps charging for
+  days, even past a negative balance. → Root cause: `stop` halts compute only; **disk bills forever while
+  stopped**, and bandwidth bills in every state. → Fix: to stop the meter, **`destroy`** (after copy-out per
+  §5); never leave an instance merely stopped to "save money."
+- **VAST2 — results gone after teardown.** Symptom: `destroy` run, checkpoints irrecoverable. → Root cause:
+  `destroy` permanently nukes container disk and there's **no platform-wide FS to fall back on**. → Fix:
+  `vastai copy` out (or `vastai cloud copy` to a bucket) and **check its exit status** BEFORE `destroy`; gate
+  the success line on the copy result, never on a log claim.
+- **VAST3 — paused/stopped instance stuck in `Scheduling` though the machine shows "available."** Symptom:
+  preempted or stopped run never resumes; the portal still lists the same machine as rentable. → Root cause:
+  the instance is **bound to a specific GPU ID** (not the machine); if that GPU was re-rented, it waits
+  indefinitely while *other* GPUs on the host stay free. "If stuck >30 s, GPU likely rented by another user."
+  → Fix: stop the scheduling attempt, **create a NEW instance on the same host and re-attach the same Volume**
+  (works because other GPUs are free), or re-launch on a different offer from an off-box checkpoint; don't
+  wait for the same GPU to come back (verified vast-ai.crisp.help + manage-instances 2026-06).
+- **VAST4 — job dies mid-step with no warning.** Symptom: interruptible run vanishes abruptly. → Root cause:
+  bid preemption with **~0 s notice and no SIGTERM**; a flush handler never fires. → Fix: periodic checkpoint
+  to disk on a Young/Daly timer + load-latest-on-resume; poll `actual_status`/`status_msg` and restart (§4,
+  `references/spot-resilience.md`). The bid can't be raised on a live instance — re-launch elsewhere if the
+  GPU is gone.
+- **VAST5 — CUDA driver mismatch on a fresh box.** Symptom: `torch.cuda.is_available()` is False / driver
+  mismatch error. → Root cause: **CUDA ships in the Docker image, not the host**; the image's CUDA may be
+  newer than the host driver supports (image CUDA must be ≤ host driver). → Fix: pick an image whose CUDA ≤
+  host driver; verify `nvidia-smi`/`nvcc` inside the container in `onstart` before training (general triangle:
+  `gotchas_universal.md` U28).
+- **VAST6 — a service is unreachable on its "own" port.** Symptom: TB/Jupyter/API not reachable at the
+  internal port. → Root cause: internal ports map to **random external ports** and there's a **64-port cap**
+  per instance. → Fix: open ports with `-p` at create, **discover the external mapping at runtime**
+  (`vastai show instance` / IP Port Info pop-up), never hard-code a port.
+- **VAST7 — host vanishes mid-run.** Symptom: instance flips to `Offline`, work lost. → Root cause: it's a
+  **marketplace** — an unverified/low-reliability host can disconnect. → Fix: filter offers on
+  `verified=true`, high `reliability`, and `direct_port_count>=1`; treat any single host as disposable and
+  checkpoint off-box accordingly.
+- **VAST8 — bulk `scp` over the default SSH stalls / crawls.** Symptom: a multi-GB result copy over the
+  default endpoint hangs or runs at a trickle. → Root cause: the **default is proxy SSH**, throttled and
+  recommended only for <1 GB; large transfers need direct-TCP. → Fix: create with `--direct` (offer must have
+  `direct_port_count>=1`) and use that endpoint for `scp`/`vastai copy`; for big *inbound* data prefer
+  `wget`/`curl` from a bucket (verified data-movement docs 2026-06).
+- **VAST9 — bandwidth shows up on the bill.** Symptom: a transfer-heavy job costs more than the GPU-hours
+  alone. → Root cause: bandwidth is **host-priced and metered per byte in both directions, in every state** —
+  some offers are not $0-egress. → Fix: read the per-offer bandwidth rate before committing; pull a dataset
+  **once** to durable local/Volume, not per-epoch from a remote bucket (general form: `gotchas_universal.md`
+  U14/U15).
+- **VAST10 — disk full, and you can't grow it.** Symptom: `No space left on device` mid-run; `--disk` can't
+  be raised. → Root cause: container disk is **fixed at create (min 10 GB) and non-resizable**; Docker
+  layers + HF cache + checkpoints overrun it. → Fix: over-provision `--disk` at create; redirect `HF_HOME`
+  onto the data disk; prune `latest`/periodic checkpoints, keep only `best` (inode/byte audit:
+  `gotchas_universal.md` U6/U7).
+- **VAST11 — secret baked into the image or onstart-cmd is recoverable.** Symptom: a key embedded at build
+  time or in `--onstart-cmd` is stored by the platform. → Root cause: image layers and the 16 KB onstart
+  string are persisted server-side. → Fix: inject `WANDB_API_KEY`/`HF_TOKEN` via **env vars at create**, never
+  baked into image layers or `--onstart-cmd`; stream creds via stdin at runtime (`gotchas_universal.md` U34).
+- **VAST12 — assuming a cross-machine Network Volume exists.** Symptom: a plan relies on a Volume following
+  the job to a different host. → Root cause: Volumes are **machine-locked**; cross-machine Network Volumes are
+  **not in the current storage docs**. → Fix: design for off-box durability (`vastai cloud copy` to a bucket),
+  not a portable volume; only same-machine re-attach is reliable.
+- **VAST13 — instance stuck in `Loading`, never reaches `running`.** Symptom: a new instance sits in
+  `Loading`/`Connecting` for many minutes. → Root cause: the host is **pulling a large Docker image** (boot is
+  1–5 min, longer for fat images) or the host link is slow. → Fix: wait out the documented window, then read
+  `vastai show logs <id>` (below) for the pull progress; if still stuck, `destroy` and re-create on a faster
+  offer with a slimmer image.
+### Platform-specific debugging (commands + what to check)
+- **Read the boot/container/system logs from off-box:**
+  `vastai show logs <id> --tail 200 [--filter <grep>] [--daemon-logs]` — uploads container logs (and, with
+  `--daemon-logs`, host/system logs) to a generated URL. This is the first stop for a box that won't connect,
+  a stuck `Loading`, or a silent `onstart` failure (verified docs.vast.ai/api-reference/instances/show-logs
+  2026-06). The GUI equivalent is the **"Logs" button** on the instance card.
+- **Inspect the live state machine without SSH:** `vastai show instance <id>` (or the API) — compare
+  `actual_status` (where the container *is*), `intended_status` (where it *should* be), `cur_state` (contract/
+  hardware allocation) and `status_msg`. `intended=running` but `actual≠running` + `Scheduling` ⇒ VAST3
+  (GPU-bound wait); `Offline` ⇒ VAST7 (host gone).
+- **Confirm the GPU is really attached:** in `onstart` / first SSH run `nvidia-smi` and
+  `python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"` — `False`/CPU-only ⇒ VAST5
+  (image CUDA > host driver) or no-GPU container (`gotchas_universal.md` U31).
+- **Detect a stuck download inside the box:** `du -sh ~/.cache/huggingface/hub` over time (no growth = stalled
+  HF pull), `df -h /` (filling = active download) and `df -i /` (inodes), then the resumable-download ladder
+  in `gotchas_universal.md` (HF). A fat-image stall *before* SSH is visible only via `vastai show logs`.
+- **Find the real external ports / SSH target:** `vastai show instance <id>` lists the port map and
+  `vastai ssh-url <id>` prints the connection string — never assume port 22 is reachable (VAST6).
+---
+## 8. SCRIPT OVERRIDES
+Values to parameterize the `scripts/` templates for vast.ai:
+```bash
+# DATA_DIR — data + (only) checkpoint mount; NOTHING survives destroy, so durable = off-box copy-out/cloud-sync
+DATA_DIR=/workspace              # container disk; survives stop, bills forever, GONE on destroy
+DURABLE_DIR=off-box              # no destroy-surviving mount: vastai copy / vastai cloud copy before destroy (§5)
+# PROXY_HOOK — none at platform level (no /etc/network_turbo). HF mirror is the JOB's own env if needed:
+PROXY_HOOK=''                    # set HF_ENDPOINT=https://hf-mirror.com in the job env only if a host can't reach HF
+# CRED_FILE — empty: vast's key is the VAST_API_KEY env var, not a file. WANDB_API_KEY/HF_TOKEN also arrive via env.
+CRED_FILE=""                     # no cred FILE on disk → run_one's [ -n "$CRED_FILE" ] guard skips the cat; VAST_API_KEY + WANDB_API_KEY/HF_TOKEN injected via env at create, NOT into the image or onstart-cmd
+# SCRATCH — what to prune (disk is fixed-size, non-resizable → prune aggressively)
+SCRATCH='latest.pth periodic-*.pth *.tmp ~/.cache/huggingface/hub/blobs'  # keep only best + tiny eval JSONs
+# HF_HOME — redirect cache off the small root onto the data disk
+HF_HOME=/workspace/.cache/huggingface
+# DETACH — durable relaunch is onstart.sh, NOT tmux (tmux dies on container restart/spot-resume)
+DETACH='/root/onstart.sh'        # re-runs on every container start; tmux only for an attached SSH session
+```
+**Secrets note:** inject `WANDB_API_KEY` / `HF_TOKEN` via **env vars at create**, never baked into the Docker
+image layers or the 16 KB `--onstart-cmd` (both are stored by the platform — VAST11).

package/bundled-skills/remote-gpu-trainer/references/china-network.md ADDED Viewed

@@ -0,0 +1,206 @@
+# China network + model-download reference
+Universal recipe for pulling code, packages, and model weights onto **any GPU box behind the GFW** —
+AutoDL, 矩池云, 恒源云, Featurize, 揽睿星舟, or a bare CN SSH instance. The whole problem reduces to **four
+orthogonal env-var switches** (mirror, cache location, resume tier, proxy scope); none requires editing
+training code. This file owns the CN-specific transport swap and stall-retry; **REQUIRED:**
+`huggingface-skills:hf-cli` owns the generic `hf download` / `hf upload` verbs underneath it.
+Universal gotchas (inode caps, silent sync, symlinked caches) are **not** restated here — see
+`references/gotchas_universal.md`. The AutoDL-pinned form lives in `profiles/autodl.md`.
+To jump: `grep -in '<keyword>' references/china-network.md` (try `mirror`, `HF_ENDPOINT`, `hfd`,
+`no_proxy`, `hf_transfer`, `decision`).
+## Table of contents
+1. Mirrors table — PyPI / conda / HuggingFace / alt hub
+2. Env switchboard — the four switches + the import-time trap + cache redirect
+3. Resumable-download ladder — three tiers + the `hf_transfer` caution
+4. The `no_proxy` trap — a proxy that fixes one domain breaks all the others
+5. Decision rule + `scripts/setup-china-mirrors.sh`
+---
+## 1. Mirrors table
+Swap the *source*, not the workflow. Same package names, same repo IDs — only the endpoint changes. Ship
+this verbatim; it is identical across every CN platform.
+| Channel | Set | Endpoint(s) |
+|---|---|---|
+| **PyPI** | `pip config set global.index-url <url>` or `pip install -i <url> pkg` | Tsinghua TUNA `https://pypi.tuna.tsinghua.edu.cn/simple` · Aliyun `https://mirrors.aliyun.com/pypi/simple` · USTC `https://pypi.mirrors.ustc.edu.cn/simple` |
+| **conda** | channels in `~/.condarc` (TUNA Anaconda) | `https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main` + `.../free` + the `cloud/` channels (pytorch, conda-forge) |
+| **HuggingFace** | `export HF_ENDPOINT=https://hf-mirror.com` | drop-in reverse proxy — identical repo IDs, identical `hf download` / `from_pretrained` calls |
+| **Alt model hub** | ModelScope CLI / SDK | `pip install modelscope`; `modelscope download <id>` or `snapshot_download(id, ...)` — often hosts the same Qwen / GLM / Llama weights domestically |
+**conda trap — NEVER mirror `pytorch-nightly`.** TUNA (and every CN Anaconda mirror) syncs the stable
+`pytorch` channel but **does not carry `pytorch-nightly`** — pointing the nightly channel at a mirror
+silently resolves to a stale or absent build. Install nightly only from the official channel (over a real
+proxy if the box is offline), and mirror just the stable channels.
+Source: HF-Mirror `https://hf-mirror.com/`; TUNA PyPI `https://mirrors.tuna.tsinghua.edu.cn/help/pypi/`;
+TUNA Anaconda `https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/`; ModelScope client
+`https://github.com/modelscope/modelscope_hub`.
+---
+## 2. Env switchboard + the import-time trap
+Everything below is **environment variables only** — no code edits. Export them once per shell (or bake
+them into `scripts/setup-china-mirrors.sh`, §5) before anything that touches the wire.
+```bash
+# --- mirror routing ---
+export HF_ENDPOINT=https://hf-mirror.com           # MUST precede any HF import (see trap below)
+# --- caches OFF the small reset-on-release system disk, ONTO the data disk ---
+export HF_HOME=/path/to/datadisk/hf                 # parent for hub/, datasets/, etc.
+export HF_HUB_CACHE=/path/to/datadisk/hf/hub        # the model-blob cache specifically
+export MODELSCOPE_CACHE=/path/to/datadisk/modelscope
+# --- keep hf_transfer OFF on flaky CN links (see §3) ---
+export HF_HUB_ENABLE_HF_TRANSFER=0
+```
+**The import-time trap — `HF_ENDPOINT` is read once, at import.** `huggingface_hub` / `transformers` /
+`datasets` snapshot `HF_ENDPOINT` the moment they are **imported**. Setting it *after* the import (or in a
+notebook cell run after the first `import transformers`) is a no-op — the library already cached the
+international endpoint and every download hits the slow path. Two safe forms:
+```bash
+# Inline on the command — the env is set before the interpreter starts:
+HF_ENDPOINT=https://hf-mirror.com python train.py
+# Or export in the wrapper, ABOVE any python invocation:
+export HF_ENDPOINT=https://hf-mirror.com   # then later: python -m src.train ...
+```
+**Cache redirect — why it matters.** Most CN images pair a tiny reset-on-release system disk with a larger
+persistent data disk. Left at defaults, `~/.cache/huggingface` lands on the system disk and either fills it
+(crashing downloads) or is **wiped on restart** on platforms where `/root` is ephemeral. Redirecting
+`HF_HOME` / `HF_HUB_CACHE` / `MODELSCOPE_CACHE` onto the data disk ties model storage to the same
+disk-budget discipline as checkpoints (principle #5; survival matrix in each profile).
+Source: HF-Mirror `https://hf-mirror.com/`; ModelScope client
+`https://github.com/modelscope/modelscope_hub`.
+---
+## 3. Resumable-download ladder
+Bulk weight pulls are the prototypically flaky step on a CN link — a stall is **not** a permanent failure,
+and every tier below accumulates progress across kills. Escalate by file size and instability.
+**Tier 1 — `hf download <repo> --resume-download` (default).**
+Writes partial blobs as `*.incomplete`; re-running the identical command resumes from the byte offset. Best
+for single repos under ~10 GB. Wrap in a `timeout … && break` retry loop so a stall self-recovers:
+```bash
+#!/usr/bin/env bash
+set -u
+for _ in $(seq 1 20); do
+  timeout 600 hf download "$REPO" --local-dir "$DIR" --resume-download && break
+  echo "stall, retrying (progress is saved)"; sleep 5
+done
+```
+(Underlying verbs — `hf download --resume-download`, `hf cache verify` — belong to **REQUIRED:**
+`huggingface-skills:hf-cli`; this ladder only wraps them with CN-mirror routing + stall-retry.)
+**Tier 2 — `hfd.sh` (aria2 multi-connection) for any single file > 10 GB.**
+`hfd.sh` (the HF-Mirror companion script) drives `aria2c` with many parallel connections per file —
+markedly faster and more stall-resistant than the single-stream CLI on large `.safetensors` shards over a
+congested evening link. Reach for it whenever one file exceeds ~10 GB:
+```bash
+./hfd.sh "$REPO" --tool aria2c -x 8     # 8 connections per file, resumes on re-run
+```
+**Tier 3 — ModelScope `snapshot_download` (HTTP-Range resume).**
+When a model exists on ModelScope (most CN-origin models do), pull it domestically — `snapshot_download`
+does per-file HTTP-Range resume, per-file retry with backoff, and SHA256 verification, all over a domestic
+route that never touches the GFW:
+```python
+from modelscope import snapshot_download
+snapshot_download("Org/Model", local_dir="/path/to/datadisk/model")
+```
+Note: ModelScope writes a plain directory and does **not** populate the HF cache, so
+`from_pretrained("Org/Model")` won't find it — point the load at the local dir.
+**`hf_transfer` caution — keep `HF_HUB_ENABLE_HF_TRANSFER=0` on flaky CN networks.**
+`hf_transfer` is a Rust accelerator that helps on fast, stable links, but it has a **documented
+hang-with-no-error** in exactly the unstable-bandwidth conditions CN ops hit — the download wedges with no
+progress and no exception, defeating every retry loop above. Leave it **off** by default on any CN box;
+only enable it once a route is verified fast and stable.
+Source: hf CLI resume `https://github.com/huggingface/huggingface_hub/issues/3580`; hf_transfer hang
+`https://github.com/huggingface/hf_transfer/issues/30`; ModelScope download
+`https://deepwiki.com/modelscope/modelscope/3.1-model-download-and-caching`.
+---
+## 4. The `no_proxy` trap
+**The highest-value gotcha in this file.** A Clash / VPN proxy added to reach `huggingface.co`
+**simultaneously breaks every domestic mirror** — `pip`, the TUNA index, ModelScope, intra-cloud OSS all
+get routed out through an overseas exit node, producing `ProxyError` or multi-minute stalls (principle #7:
+a proxy speeds ONE route and slows the others).
+**Symptom** → after exporting `http_proxy`/`https_proxy` to fix HF, `pip install` and ModelScope downloads
+hang or raise `ProxyError`, while `huggingface.co` now works.
+**Root cause** → the proxy is global; domestic mirrors that were fast on the direct route are now hauled
+overseas and back.
+**Fix** → exempt every domestic host from the proxy with a `no_proxy` allowlist, minding these library
+quirks:
+- **Leading-dot domains, no `*` wildcards.** `requests` honors `no_proxy` but does **not** expand `*` — use
+  `.modelscope.cn` (leading dot matches the domain and all subdomains), never `*.modelscope.cn`.
+- **Set BOTH `no_proxy` and `NO_PROXY`.** Different libraries read different casings; set both to the same
+  value.
+- **List `127.0.0.1` AND `localhost`.** They are distinct entries; omitting either lets a loopback call
+  (TensorBoard, a local API) get proxied.
+- **`pip` ignores `no_proxy` for its own connections** — pass `pip install --proxy ""` to force pip onto the
+  direct route regardless of an inherited proxy env.
+```bash
+# Only export this WHEN a proxy is present (see below):
+DOMESTIC=".tuna.tsinghua.edu.cn,.aliyun.com,.aliyuncs.com,.ustc.edu.cn,.modelscope.cn,.tencentyun.com"
+export no_proxy="127.0.0.1,localhost,${DOMESTIC}"
+export NO_PROXY="$no_proxy"
+```
+**A clean box with no proxy needs no `no_proxy` at all.** `no_proxy` only un-routes a proxy that is already
+set. On a freshly rented box with no `http_proxy`/`https_proxy` exported, adding `no_proxy` does nothing —
+add it **only** in the same breath as exporting a proxy (§5's "real overseas proxy" branch), and clear it
+when the proxy is unset.
+Source: requests `no_proxy` `https://github.com/psf/requests/issues/4871`; no_proxy guide
+`https://www.browserstack.com/guide/no_proxy-environment-variable`; Clash pip ProxyError
+`https://github.com/clash-verge-rev/clash-verge-rev/issues/2607`.
+---
+## 5. Decision rule + delivery
+**Pick the cheapest route that reaches the weights, in order:**
+1. **hf-mirror first** — `HF_ENDPOINT=https://hf-mirror.com`. Drop-in, same repo IDs, no proxy, no
+   `no_proxy` to manage. Default for everything.
+2. **ModelScope** if the model is absent on the mirror or the mirror route is flaky — same Qwen / GLM /
+   Llama weights domestically, Tier-3 resume, no GFW crossing.
+3. **`hfd.sh`** for any single file > 10 GB on a stable-but-slow link — aria2 multi-connection.
+4. **A real overseas proxy ONLY when a model exists *only* on `huggingface.co`** and neither mirror nor
+   ModelScope carries it. The moment a proxy goes on, **immediately apply the §4 `no_proxy` block** so the
+   domestic mirrors keep working — and unset both when the pull is done.
+**Never** reach for a proxy by reflex: it is the slowest, most fragile option and the one that breaks
+everything else. Mirror → alt hub → multi-connection → proxy, in that order of preference.
+**Ship `scripts/setup-china-mirrors.sh`** — the orchestrator `scp`s it onto the box and `source`s it on
+first connect. It bakes §1 (PyPI + conda mirrors), §2 (the four env switches + cache redirect off the
+system disk), and the §3 default (`HF_HUB_ENABLE_HF_TRANSFER=0`) into one idempotent step, leaving the §4
+proxy block commented out (added only on the rare proxy branch). Author it with `#!/usr/bin/env bash` +
+`set -u`, forward-slash paths, and **no unquoted `|` inside any `grep`** (an unquoted pipe in a regex reads
+stdin and hangs the setup forever).
+Source: HF-Mirror `https://hf-mirror.com/`; ModelScope `https://github.com/modelscope/modelscope_hub`.