opencode-skills-collection 3.1.2 → 3.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +4 -1
- package/bundled-skills/agent-creator/SKILL.md +246 -0
- package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/sources/sources.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
- package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
- package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
- package/bundled-skills/remote-gpu-trainer/README.md +267 -0
- package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
- package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
- package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
- package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
- package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
- package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
- package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
- package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
- package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
- package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
- package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
- package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
- package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
- package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
- package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
- package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
- package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
- package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
- package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
- package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
- package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
- package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
- package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
- package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
- package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
- package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
- package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
- package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
- package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
- package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
- package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
- package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
- package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
- package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
- package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
- package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
- package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
- package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
- package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
- package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
- package/package.json +1 -1
- package/skills_index.json +66 -0
|
@@ -0,0 +1,365 @@
|
|
|
1
|
+
---
|
|
2
|
+
platform: paperspace # Paperspace (now under DigitalOcean): Gradient Notebooks + Core/Machines
|
|
3
|
+
kind: cloud-api # web console + pspace/gradient CLI/SDK + REST; Core machines also reachable by SSH
|
|
4
|
+
meter_stop_verb: shut-down # shut-down/power-off stops COMPUTE; only destroy/delete stops storage + IP
|
|
5
|
+
meter_stop_irreversible: false # a stop is reversible; destroy/delete IS irreversible (loses block storage)
|
|
6
|
+
detach_primitive: tmux # on Core VMs; Notebooks have no clean SSH-daemon story (Jupyter kernel + hard auto-shutdown ceiling)
|
|
7
|
+
spot_available: false # no AWS-style spot/preemptible with a 2-min warning
|
|
8
|
+
spot_grace: n/a # interruption is capacity-at-launch + a deterministic auto-shutdown clock, not eviction
|
|
9
|
+
shared_fs: true # Gradient /storage is team-shared per storage region/cluster
|
|
10
|
+
inode_cap: none # no documented inode cap on either /storage or Core block storage
|
|
11
|
+
free_egress: true # no documented ingress/egress fee
|
|
12
|
+
china_mirror_needed: false # US/global cloud, direct egress; no platform-provided proxy
|
|
13
|
+
host_driver_cuda_max: "host-dependent" # ML-in-a-Box / template ships the CUDA+driver stack (often lagging)
|
|
14
|
+
local_nvme: host-dependent # ephemeral workspace on Notebooks; block storage on Core
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
# Paperspace (DigitalOcean) — platform profile
|
|
18
|
+
|
|
19
|
+
One-line purpose: substrate for running detached GPU jobs on Paperspace Gradient (managed Jupyter
|
|
20
|
+
notebooks/deployments) and Paperspace Core (raw Linux VMs, "Machines") — what stops the meter, what
|
|
21
|
+
survives a stop vs a destroy, and the auto-shutdown clock that ends every long run. Universal gotchas are
|
|
22
|
+
NOT repeated here — see `references/gotchas_universal.md`.
|
|
23
|
+
|
|
24
|
+
> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — an **auto-shutdown timer ends every Notebook/Core run** (set it consciously; Gradient free notebooks hard-cap at 6 h); **snapshots / block storage keep billing after a machine is destroyed** (orphan bleed). Heads-up — the **Gradient CLI/API was deprecated 15 Jul 2024** (pin `gradient<3.0`; the three-CLI mess, §1).
|
|
25
|
+
|
|
26
|
+
To jump: `grep -in '<keyword>' profiles/paperspace.md`.
|
|
27
|
+
|
|
28
|
+
## Table of contents
|
|
29
|
+
1. LAUNCH — Gradient vs Core, the env contract, the three-CLI mess
|
|
30
|
+
2. STORAGE MODEL — survival matrix, the stop-keeps-disk rule, pip-doesn't-persist
|
|
31
|
+
3. NETWORK — public IP (static vs dynamic), ports, SSH flavor
|
|
32
|
+
4. SPOT / INTERRUPTION + RESUME — the auto-shutdown clock, not spot
|
|
33
|
+
5. TEARDOWN / BILLING — what actually stops the meter (the trap)
|
|
34
|
+
6. DAEMON TOOL — tmux on Core; why Notebooks resist a daemon
|
|
35
|
+
7. TOP GOTCHAS — `PS1`–`PS13`, platform-pinned + platform-specific debugging
|
|
36
|
+
8. SCRIPT OVERRIDES — values for the `scripts/` templates
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## 1. LAUNCH
|
|
41
|
+
|
|
42
|
+
Two product families, with opposite operating models:
|
|
43
|
+
|
|
44
|
+
- **Gradient** — the managed layer. **Notebooks** are a web Jupyter IDE on a shared persistent store;
|
|
45
|
+
**Deployments** serve a container behind a REST endpoint (bring a Docker image `<user>/img:tag`);
|
|
46
|
+
**Workflows** run GPU-backed DAG automation. Entry: web console, the CLI/SDK, or REST.
|
|
47
|
+
- **Core / Machines** — raw Linux/Windows VMs with a persistent block disk, full root/SSH. OS templates
|
|
48
|
+
include **ML-in-a-Box** (preinstalled CUDA + PyTorch/TensorFlow/RAPIDS/Jupyter; **terminal/SSH-only**,
|
|
49
|
+
home `/home/paperspace`, shell `/bin/bash`). **Ubuntu 22.04 is required for H100 and recommended for
|
|
50
|
+
A100; Ubuntu 20.04 is recommended for any other machine type** (verified github.com/Paperspace/ml-in-a-box
|
|
51
|
+
README + DO machines docs 2026-06). This is the family that maps cleanly onto the AutoDL
|
|
52
|
+
tmux-resilient-training pattern.
|
|
53
|
+
|
|
54
|
+
**Env contract.** The chosen image/template IS the Python env — do NOT `conda create` on a rental
|
|
55
|
+
(principle: the prebuilt base is the env). On Core, run inside **ML-in-a-Box** directly; on Gradient
|
|
56
|
+
Deployments, the env is the Docker image specified at create time. Because a *destroy* wipes the box, the
|
|
57
|
+
durable analog of the env is a Docker image plus a `requirements.txt`/lock file kept off-box, so a recreate
|
|
58
|
+
reproduces it. **On Notebooks, a plain `pip install` does NOT survive a restart** (writes to
|
|
59
|
+
`/usr/local/lib`, ephemeral) — see §2 / `PS3`.
|
|
60
|
+
|
|
61
|
+
**The three-CLI mess (gates ALL automation).** The tooling fragmented across the DigitalOcean acquisition;
|
|
62
|
+
the draft's "migrate to the current API/CLI" understates the trap (verified github.com/Paperspace 2026-06):
|
|
63
|
+
- The **legacy Gradient REST API endpoints were deprecated 15 Jul 2024** — stale calls 404 or no-op.
|
|
64
|
+
- **`gradient-cli` v2 is deprecated**; pin `pip install "gradient<3.0"` only to keep *old* scripts alive.
|
|
65
|
+
- **`gradient-python` (github.com/digitalocean/gradient-python) is NOT the orchestration CLI** — it is the
|
|
66
|
+
new DigitalOcean *Gradient AI / GenAI inference* SDK. **Name collision** — do not install it expecting
|
|
67
|
+
notebook/machine control.
|
|
68
|
+
- The **recommended tool for new work is the streamlined `pspace` CLI** (github.com/Paperspace/cli,
|
|
69
|
+
releases ongoing into 2026; e.g. `pspace public-ip release <ip>`). Pin and verify the CLI binary +
|
|
70
|
+
version in any automation; do not assume `gradient` ⇒ `pspace` command parity.
|
|
71
|
+
|
|
72
|
+
→ **verify:** `ssh <core-alias> 'python -c "import torch;print(torch.cuda.is_available())"'` on Core, or a
|
|
73
|
+
`print(torch.cuda.is_available())` cell in a Notebook.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## 2. STORAGE MODEL *(survival matrix — principle #4)*
|
|
78
|
+
|
|
79
|
+
The defining fact: a **stop/shut-down keeps the disk** — Paperspace is one of the few profiles here that
|
|
80
|
+
behaves like AutoDL's 关机 in this respect. Only **destroy/delete** removes storage.
|
|
81
|
+
|
|
82
|
+
**Gradient Notebooks** — `/storage` and `/notebooks` are **separate branches from `/`, NOT nested**
|
|
83
|
+
(verified DO notebooks/details/storage-architecture 2026-06):
|
|
84
|
+
- `/storage` — **shared persistent**, team-wide, scoped to a **storage region/cluster**. Survives stop.
|
|
85
|
+
(Team-shared ⇒ never write secrets here — see §7 / `references/gotchas_universal.md`.)
|
|
86
|
+
- `/notebooks` — **per-notebook persistent**, managed via the console File Manager. Survives stop.
|
|
87
|
+
- everything else — **ephemeral workspace** (incl. `/usr/local/lib` where `pip` lands), wiped on stop.
|
|
88
|
+
|
|
89
|
+
**Core machines** — block storage **50 GB–2 TB**, persists across a stop; **expansion is one-way**
|
|
90
|
+
("increasing block storage expands the filesystem and is not reversible"). Region-locked: storage and
|
|
91
|
+
custom templates must be used in the **same datacenter**. **Snapshots** are a separate billed resource
|
|
92
|
+
(`$0.29/GB/mo`, default policy is **"Never" / 0 stored** — they bill only if manually enabled, and a
|
|
93
|
+
snapshot **survives a machine destroy**, so an orphaned snapshot keeps charging — see `PS9`).
|
|
94
|
+
|
|
95
|
+
| Tier | Path | Survives STOP? | Survives DESTROY/DELETE? | Cap / note |
|
|
96
|
+
|---|---|---|---|---|
|
|
97
|
+
| Notebook shared persistent | `/storage` | yes | yes (separate resource) | team-shared per region/cluster; billed until deleted |
|
|
98
|
+
| Notebook per-notebook | `/notebooks` | yes | no (dies with the notebook) | per-notebook persistent; console File Manager |
|
|
99
|
+
| Notebook workspace | everything else (incl. `/usr/local/lib`) | **no** | no | ephemeral; wiped on stop; `pip` lands here |
|
|
100
|
+
| Core block storage | machine root + block vol | yes | **no** | 50 GB–2 TB; expansion irreversible; region-locked |
|
|
101
|
+
| Core snapshot | (separate resource) | yes | **yes** (orphan-bills!) | `$0.29/GB/mo`; default policy Never/0; survives machine destroy |
|
|
102
|
+
|
|
103
|
+
**Mount checkpoints MUST go to (for the §5 teardown verb):** on Notebooks, `/storage` (cross-stop,
|
|
104
|
+
cross-delete-of-the-notebook) — `/notebooks` dies if the notebook itself is deleted. On Core, the block
|
|
105
|
+
disk survives a stop, but a *destroy* wipes it, so the Iron-Law pull-to-local before destroy still applies.
|
|
106
|
+
No documented inode cap on either tier; still monitor `df -i` (universal, U7 / principle #5).
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## 3. NETWORK
|
|
111
|
+
|
|
112
|
+
- **Egress.** Direct and unproxied to HF/GitHub/PyPI; no `network_turbo`-style accelerator and no
|
|
113
|
+
documented egress fee. China-mirror relevance is **N/A as a platform feature** — relevant only when
|
|
114
|
+
operating from inside China and supplying a private mirror (then `references/china-network.md`).
|
|
115
|
+
- **Public IP.** Core machines are reached by **public IP**, of two kinds (verified DO
|
|
116
|
+
machines/how-to/manage-public-ips 2026-06):
|
|
117
|
+
- **Static** — "the same IP address every time it powers on … remains in your account until you delete
|
|
118
|
+
it." Use it to pin stable SSH/endpoint addressing. **Billed until deleted** — *including while the
|
|
119
|
+
machine is powered off* (see §5 / `PS6`). API/CLI can create/release a **static** IP but **cannot add a
|
|
120
|
+
dynamic IP to an existing machine** — dynamic must be requested at machine-creation time.
|
|
121
|
+
- **Dynamic** — "assigned automatically when a machine powers on and deleted when it powers off"; a **new
|
|
122
|
+
IP on every start**, so a hard-coded SSH alias breaks after a restart. **Charged only while the machine
|
|
123
|
+
runs** (auto-released on power-off → no idle IP cost).
|
|
124
|
+
A machine with **no public IP** is internet-isolated (and avoids the IP charge). **Private networks**
|
|
125
|
+
give team-isolated pools.
|
|
126
|
+
- **Ports / services.** Firewall is self-managed — open ports to expose services. Tunnel Jupyter (8888) /
|
|
127
|
+
TensorBoard (6006) over SSH on Core:
|
|
128
|
+
`ssh -L 8888:localhost:8888 -L 6006:localhost:6006 paperspace@<machine-ip>`
|
|
129
|
+
(placeholder host — substitute the machine's real IP/static address). In a Gradient Notebook, launch
|
|
130
|
+
TensorBoard in-Jupyter and write logs under `/storage` (or they vanish on stop).
|
|
131
|
+
- **SSH flavor.** Core = a standard Linux VM → full `ssh`/`scp`/`rsync` (ML-in-a-Box default user
|
|
132
|
+
`paperspace`). Gradient Notebooks expose a **Jupyter sandbox**, not a clean persistent SSH daemon —
|
|
133
|
+
there is no stable SSH-daemon story for a multi-day unattended run on a Notebook.
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
|
138
|
+
|
|
139
|
+
**No AWS-style spot/preemptible tier** with a 2-minute interruption warning. The two interruption modes are
|
|
140
|
+
different in kind and BOTH are deterministic, not random eviction:
|
|
141
|
+
|
|
142
|
+
1. **Capacity-at-launch.** The desired GPU type may be unavailable when launching — a *launch-time*
|
|
143
|
+
availability problem, not a runtime eviction. On free notebooks this surfaces as **"out of capacity" /
|
|
144
|
+
the notebook sits "pending" in queue for the next free machine** (verified DO notebooks/how-to docs
|
|
145
|
+
2026-06). Build **retry-launch-until-available** logic, not a 2-minute-grace flush handler; for assured
|
|
146
|
+
access, a paid instance type bypasses the free queue.
|
|
147
|
+
2. **Auto-shutdown clock — the hard ceiling on any long run.** The timer is the real killer:
|
|
148
|
+
- **Gradient free** notebooks hard-stop at a **6-hour** maximum auto-shutdown (cannot be raised).
|
|
149
|
+
- **Paid notebooks** default to **12-hour** auto-shutdown; range **1 hour – 1 week**.
|
|
150
|
+
- **Core** machines allow a configurable **1 hour – 1 week** auto-shutdown.
|
|
151
|
+
- **Trap (Core/Linux):** Core Linux auto-shutdown is **wall-clock, not idle-based** — "Linux machines
|
|
152
|
+
shut down regardless of whether any users are connected" (only Windows waits for idle). An active
|
|
153
|
+
SSH/tmux session does **not** keep a Linux Core machine alive past the timer (verified DO
|
|
154
|
+
machines/how-to/manage-auto-shutdown 2026-06).
|
|
155
|
+
- **Trap (API):** auto-shutdown **cannot be enabled/disabled via API or CLI on an existing machine** —
|
|
156
|
+
"you can only manage the auto-shutdown feature via the Paperspace console" (same source). Set it
|
|
157
|
+
deliberately at create time / in the console.
|
|
158
|
+
|
|
159
|
+
The window is deterministic, so plan around it: a tmux session inside a Notebook **still dies at the
|
|
160
|
+
timeout** (§6). **Resume hook:** checkpoint full state to `/storage` (Notebooks) or the block disk
|
|
161
|
+
(Core) *before* the auto-shutdown window, then restart and load-latest-on-startup unconditionally.
|
|
162
|
+
Because the clock is known in advance, cadence can be planned rather than guessed — but the
|
|
163
|
+
load-latest-on-startup spine (principle #8) is what makes the restart idempotent. Young/Daly cadence
|
|
164
|
+
formula → `references/spot-resilience.md`.
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law — the most error-prone section)*
|
|
169
|
+
|
|
170
|
+
Per-hour billing (verified DO products/paperspace/pricing 2026-06). **A shut-down/power-off STOPS the
|
|
171
|
+
compute (GPU) meter** while disk persists — this is the AutoDL-like part. **But it does NOT stop every
|
|
172
|
+
meter.**
|
|
173
|
+
|
|
174
|
+
- **What a stop still bills (the trap):** "When a Paperspace machine is powered off, attached **storage**,
|
|
175
|
+
**public IP addresses**, and other **add-ons** continue to be billed on an hourly basis until you destroy
|
|
176
|
+
those resources." Gradient `/storage` over the plan allowance and Core block storage both keep charging
|
|
177
|
+
while the machine is off.
|
|
178
|
+
- **The monthly-cap softener (new fact):** non-GPU resources (storage, public IP, snapshots) have a
|
|
179
|
+
**maximum monthly charge** — "once a non-GPU resource reaches its monthly maximum, it no longer incurs
|
|
180
|
+
charges for the rest of the billing cycle." Static public IP caps at **$3.00/mo** ($0.0045/hr). So a
|
|
181
|
+
forgotten static IP is a bounded ~$3/mo bleed, but a forgotten 2 TB block volume is **~$120/mo** until
|
|
182
|
+
destroyed (verified DO pricing 2026-06).
|
|
183
|
+
- **What actually stops the full meter:** **destroy the machine** AND **release the static IP** AND
|
|
184
|
+
**delete the storage** (AND delete any **snapshot**) — separate actions. "To stop all charges for a
|
|
185
|
+
machine and its add-ons, destroy the machine and any resources you no longer need." A stopped-but-not-
|
|
186
|
+
destroyed machine with a Static IP, a 2 TB block volume, and a leftover snapshot is still spending money.
|
|
187
|
+
- **Irreversible:** **destroy/delete** of a machine removes its block storage (no recovery); block-storage
|
|
188
|
+
**expansion** is also one-way. A **shut-down is reversible** (resume later).
|
|
189
|
+
|
|
190
|
+
**Net contrast vs the other profiles:** Paperspace gives a real idle-cheap *stop* (unlike Lambda, which has
|
|
191
|
+
no stop), but unlike AutoDL's 关机 the **storage + IP + snapshots keep billing** until each is explicitly
|
|
192
|
+
destroyed/released. "Stopped" ≠ "free."
|
|
193
|
+
|
|
194
|
+
> **Iron Law (teardown gate):** NO destroy/delete of the machine, release of the IP, or deletion of
|
|
195
|
+
> `/storage`/block-storage/snapshot until checkpoints are **pulled to local AND verified by load**, and the
|
|
196
|
+
> user has **explicitly approved** the specific cost-affecting action. A destroy is irreversible — "it
|
|
197
|
+
> looked done in the log" is not evidence (principle #3). General form →
|
|
198
|
+
> `superpowers:verification-before-completion`.
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## 6. DAEMON TOOL
|
|
203
|
+
|
|
204
|
+
- **Core machines** — full VMs ⇒ `tmux`/`screen`/`nohup` all available; SSH is as stable as any cloud VM.
|
|
205
|
+
This is the closest analog to the AutoDL tmux-resilient pattern. tmux survives an SSH drop; it does NOT
|
|
206
|
+
survive a machine **stop/restart** (the process is gone), and — critically on Core/Linux — a live tmux
|
|
207
|
+
session does **not** defer the wall-clock auto-shutdown (§4), so durability still rests on
|
|
208
|
+
checkpoint-to-disk + load-latest (principle #8), not on the detach primitive.
|
|
209
|
+
- **Gradient Notebooks** — a managed Jupyter sandbox: **no clean persistent SSH-daemon story**, and the
|
|
210
|
+
**auto-shutdown timer is a hard ceiling** — a tmux session started inside a Notebook **still dies at the
|
|
211
|
+
timeout**. Notebooks are not built for unattended multi-day daemons.
|
|
212
|
+
- **Platform-native long-job mechanisms** — **Workflows** (DAG automation) and **Deployments** (always-on
|
|
213
|
+
serving). For training-as-a-daemon, prefer **Core + tmux**; treat Notebooks as interactive/short-run only.
|
|
214
|
+
|
|
215
|
+
If `tmux` is absent on a minimal image, fall back to `nohup <cmd> </dev/null >log 2>&1 &`.
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## 7. TOP GOTCHAS (platform-pinned; universal ones → `references/gotchas_universal.md`)
|
|
220
|
+
|
|
221
|
+
- **PS1 — "Stopped the machine, still getting billed."**
|
|
222
|
+
Symptom: GPU meter halted but the bill keeps climbing while the box is off.
|
|
223
|
+
Root cause: shut-down stops only the **compute** meter; attached **storage** + **public IP** + add-ons +
|
|
224
|
+
snapshots bill hourly until destroyed/released (verified DO pricing 2026-06).
|
|
225
|
+
Fix: to truly stop the meter, **destroy the machine, release the Static IP, delete the storage and any
|
|
226
|
+
snapshot** — separate teardown actions. Audit for orphaned storage/IPs/snapshots after every stop.
|
|
227
|
+
|
|
228
|
+
- **PS2 — A long run dies at a round-number wall-clock with no error.**
|
|
229
|
+
Symptom: training vanishes at exactly 6 h / 12 h (or the configured Core window); no traceback.
|
|
230
|
+
Root cause: the **auto-shutdown clock**, not a crash — free notebooks 6 h (hard cap), paid notebooks 12 h
|
|
231
|
+
default, Core 1 h–1 wk. On Core/Linux the clock is **wall-clock, not idle** — an active SSH/tmux session
|
|
232
|
+
does NOT extend it (verified DO manage-auto-shutdown 2026-06).
|
|
233
|
+
Fix: checkpoint to `/storage` (Notebooks) or the block disk (Core) **before** the window; for Core, raise
|
|
234
|
+
the auto-shutdown to the longest needed **in the console** (API/CLI cannot change it post-create);
|
|
235
|
+
restart + load-latest to resume.
|
|
236
|
+
|
|
237
|
+
- **PS3 — `pip install` (or any non-`/storage` write) vanishes after a Notebook restart.**
|
|
238
|
+
Symptom: packages installed in-session are gone next session; "saved" files disappear after stop/restart.
|
|
239
|
+
Root cause: `pip` writes to `/usr/local/lib`, which is **ephemeral workspace** — only `/storage` and
|
|
240
|
+
`/notebooks` persist (verified fast.ai forum + DO storage-architecture 2026-06). "Machines are snapshots,
|
|
241
|
+
not servers," so in-session installs do not persist.
|
|
242
|
+
Fix: install into a persisted dir — `pip install --user` (lands in the home dir under a persisted tree)
|
|
243
|
+
or `pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv`; write all
|
|
244
|
+
checkpoints/logs/outputs under `/storage`; verify they landed (`ls`/checksum) before stop.
|
|
245
|
+
|
|
246
|
+
- **PS4 — Automation 404s / silently no-ops / installs the wrong SDK.**
|
|
247
|
+
Symptom: a `gradient`-era create/stop call fails or does nothing; or `pip install gradient` (v3+) imports
|
|
248
|
+
an inference SDK with no notebook/machine commands.
|
|
249
|
+
Root cause: **legacy Gradient REST endpoints deprecated 15 Jul 2024**; **`gradient-cli` v2 deprecated**;
|
|
250
|
+
**`gradient-python` v3 is the DigitalOcean Gradient AI inference SDK — a name collision**, not the
|
|
251
|
+
orchestration CLI (verified github.com/Paperspace/gradient-cli + digitalocean/gradient-python 2026-06).
|
|
252
|
+
Fix: for new work use the **`pspace` CLI** (github.com/Paperspace/cli); to keep old scripts alive pin
|
|
253
|
+
`pip install "gradient<3.0"`. Pin and verify the CLI binary + version in any automation.
|
|
254
|
+
|
|
255
|
+
- **PS5 — Custom template / storage / volume "not found" in a different datacenter.**
|
|
256
|
+
Symptom: a saved template or block volume is unavailable when launching elsewhere; block-storage resize
|
|
257
|
+
can't be undone.
|
|
258
|
+
Root cause: storage and templates are **region/DC-locked**, and **block-storage expansion is
|
|
259
|
+
irreversible** (one-way filesystem grow).
|
|
260
|
+
Fix: pick the datacenter deliberately and keep storage+compute+template co-located; size block storage
|
|
261
|
+
with headroom up-front (cannot shrink).
|
|
262
|
+
|
|
263
|
+
- **PS6 — SSH alias breaks after every restart.**
|
|
264
|
+
Symptom: the saved `ssh` host no longer connects after a machine restart.
|
|
265
|
+
Root cause: a **Dynamic public IP** is released on power-off and reassigned on start (new IP each time).
|
|
266
|
+
Fix: attach a **Static IP** for stable SSH/endpoint addressing (it bills until deleted, capped $3/mo —
|
|
267
|
+
`PS1`), or re-resolve the address on each start before scripting. Note: API/CLI can manage a *static* IP
|
|
268
|
+
but cannot add a *dynamic* one to an existing machine (request dynamic at create time).
|
|
269
|
+
|
|
270
|
+
- **PS7 — Free-tier notebook code is PUBLIC by default.**
|
|
271
|
+
Symptom: proprietary/confidential code is world-readable in a Gradient free notebook.
|
|
272
|
+
Root cause: free Gradient notebooks are **public by default; private notebooks require a paid plan**
|
|
273
|
+
(verified Paperspace blog / pricing 2026-06).
|
|
274
|
+
Fix: never put confidential code or any secret in a free notebook; upgrade to a paid plan for private
|
|
275
|
+
notebooks. Treat the free tier as a public scratchpad. (Secrets hygiene → `references/gotchas_universal.md`.)
|
|
276
|
+
|
|
277
|
+
- **PS8 — Free notebook won't start / sits "pending."**
|
|
278
|
+
Symptom: a free-GPU notebook stays pending or errors "out of capacity"; only one notebook will run.
|
|
279
|
+
Root cause: free tier = **1 concurrent running notebook, ≤5 projects, 5 GB `/storage`**, and free machines
|
|
280
|
+
are pooled — a pending notebook is queued for the next free machine (verified Paperspace free-instances
|
|
281
|
+
docs + blog 2026-06).
|
|
282
|
+
Fix: expect queueing on free; stop the other free notebook (only one runs); for assured access use a paid
|
|
283
|
+
instance type, which skips the free queue.
|
|
284
|
+
|
|
285
|
+
- **PS9 — A destroyed machine keeps billing via a leftover snapshot.**
|
|
286
|
+
Symptom: machine destroyed, yet a small monthly charge persists.
|
|
287
|
+
Root cause: **snapshots are a separate resource that survives a machine destroy** and bills at
|
|
288
|
+
`$0.29/GB/mo` until deleted; auto-snapshot defaults to "Never"/0 but a manually-enabled policy (daily by
|
|
289
|
+
default, up to 10 stored) silently accrues (verified DO pricing + blog/automated-snapshots 2026-06).
|
|
290
|
+
Fix: when tearing down, delete the snapshot too (console or CLI); audit the snapshots list after every
|
|
291
|
+
machine destroy. Capped per-resource by the monthly maximum but still a bleed.
|
|
292
|
+
|
|
293
|
+
- **PS10 — Notebook upload/import fails on the 5 GB free cap.**
|
|
294
|
+
Symptom: uploading a multi-GB dataset to `/storage` fails for an unpaid account.
|
|
295
|
+
Root cause: free `/storage` allowance is **5 GB**; overage is **$0.29/GB/mo** (paid plans include more:
|
|
296
|
+
e.g. 200 GB / 1 TB tiers) (verified Paperspace pricing + fast.ai forum 2026-06).
|
|
297
|
+
Fix: stream/stage the dataset rather than uploading the whole thing, prune aggressively, or upgrade the
|
|
298
|
+
plan; redirect HF/torch caches off `/storage` if they would push over the allowance.
|
|
299
|
+
|
|
300
|
+
- **PS11 — ML-in-a-Box CUDA/driver too old for current PyTorch on a new-arch GPU.**
|
|
301
|
+
Symptom: `The NVIDIA driver on your system is too old (found version 110xx). Please update your GPU
|
|
302
|
+
driver`, or `no kernel image is available for execution` on a fresh card.
|
|
303
|
+
Root cause: the template's **host driver/CUDA stack lags newer PyTorch wheels**; on a rental the host
|
|
304
|
+
driver is host-global and a tenant usually cannot upgrade it (verified github.com/Paperspace/ml-in-a-box
|
|
305
|
+
issue #13 2026-06). This is the platform-pinned face of the universal CUDA-triangle (U28).
|
|
306
|
+
Fix: install a torch build matching the box's CUDA (do not force-upgrade the host driver on a rental);
|
|
307
|
+
pick a template whose Ubuntu/driver matches the GPU (22.04 for H100/A100). Full triangle → U28 in
|
|
308
|
+
`references/gotchas_universal.md`.
|
|
309
|
+
|
|
310
|
+
- **PS12 — Gradient Deployment / custom image won't pull or drifts.**
|
|
311
|
+
Symptom: a Deployment fails to pull `<user>/img:tag`, or "the same image" behaves differently over time.
|
|
312
|
+
Root cause: a moving tag (`:latest`) resolves to a different layer set; private-registry creds missing.
|
|
313
|
+
Fix: pin the image by digest (`@sha256:`) and supply registry creds as a Gradient **secret**, not inline.
|
|
314
|
+
General form → U30 in `references/gotchas_universal.md`.
|
|
315
|
+
|
|
316
|
+
- **PS13 — Platform-specific debugging.** Commands + what to check (Core uses standard Linux tooling; the
|
|
317
|
+
Notebook-only items are the platform delta):
|
|
318
|
+
- **Confirm GPU + driver/torch match:** `nvidia-smi` (driver/CUDA version) then
|
|
319
|
+
`python -c "import torch;print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"` —
|
|
320
|
+
a mismatch here is `PS11`/U28, not a code bug.
|
|
321
|
+
- **Find what is eating the 5 GB / over-allowance `/storage` (the platform's own recommended cmd):**
|
|
322
|
+
`du -sch .[!.]* * | sort -h` (or `!du -sch …` in a cell); install `ncdu` for an interactive view
|
|
323
|
+
(verified DO notebooks/how-to/manage-storage 2026-06). Check `df -h` AND `df -i` (inodes, U7).
|
|
324
|
+
- **Is a Notebook write durable?** `df -h /storage /notebooks` and confirm the target is one of those two
|
|
325
|
+
mounts — anything else (incl. `/usr/local/lib`) is ephemeral (`PS3`).
|
|
326
|
+
- **Why did the run vanish?** Walk the universal ladder (U3): `dmesg | grep -iE 'killed process|out of
|
|
327
|
+
memory'` (OOM?), `uptime` (recent reboot = auto-shutdown fired, `PS2`), `nvidia-smi` (GPU idle = died,
|
|
328
|
+
not hung). A round-number `uptime`-near-window with a clean `dmesg` ⇒ auto-shutdown, not a crash.
|
|
329
|
+
- **Detect a stuck/slow download:** watch the target file size grow
|
|
330
|
+
(`watch -n5 'ls -l /storage/<file>'`); a flat size with a live process = stalled wire (U12 resumable
|
|
331
|
+
loop). Egress is direct/unproxied here, so a stall is route/peer, not a missing proxy hook.
|
|
332
|
+
- **Audit orphaned billables before declaring teardown done:** in the console (or `pspace`) list
|
|
333
|
+
machines, **public IPs**, **storage/volumes**, and **snapshots** — `PS1`/`PS9` hide in the last two.
|
|
334
|
+
|
|
335
|
+
---
|
|
336
|
+
|
|
337
|
+
## 8. SCRIPT OVERRIDES
|
|
338
|
+
|
|
339
|
+
Values to parameterize the `scripts/` templates for Paperspace. Forward-slash paths; placeholders for any
|
|
340
|
+
host/IP (never a real address). Core and Gradient differ — both shown.
|
|
341
|
+
|
|
342
|
+
```sh
|
|
343
|
+
# --- Gradient Notebook ---
|
|
344
|
+
DATA_DIR=/storage # team-shared persistent; survives stop AND notebook delete
|
|
345
|
+
DURABLE_DIR=/storage # checkpoints land here (NOT /notebooks — dies with the notebook)
|
|
346
|
+
SCRATCH=/tmp # ephemeral workspace; wiped on stop — never the only copy
|
|
347
|
+
HF_HOME=/storage/.cache/huggingface # redirect cache off ephemeral workspace (watch the 5 GB free cap, PS10)
|
|
348
|
+
PROXY_HOOK= # none — direct egress (no network_turbo)
|
|
349
|
+
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); never write keys to /storage (team-shared)
|
|
350
|
+
DETACH= # no clean tmux; Jupyter kernel + hard 6h/12h auto-shutdown ceiling
|
|
351
|
+
# NOTE: pip into /storage to persist — pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv (PS3)
|
|
352
|
+
|
|
353
|
+
# --- Core machine (preferred for daemonized training) ---
|
|
354
|
+
DATA_DIR=/path/to/blockstore # placeholder — the attached block disk mount
|
|
355
|
+
DURABLE_DIR=/path/to/blockstore/ckpts
|
|
356
|
+
SCRATCH=/tmp
|
|
357
|
+
HF_HOME=/path/to/blockstore/.cache/huggingface
|
|
358
|
+
PROXY_HOOK= # none
|
|
359
|
+
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); inject at launch, never inline
|
|
360
|
+
DETACH=tmux # survives SSH drop, NOT a machine stop, and NOT the wall-clock auto-shutdown — rely on checkpoint+resume
|
|
361
|
+
SSH_HOST=<machine-ip> # placeholder — ML-in-a-Box user is `paperspace`; pin a Static IP for a stable alias (PS6); dynamic IP changes every start
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
Reminder: secrets referenced by env-var NAME or Gradient secret only — never inline a key, and never write
|
|
365
|
+
one onto the team-shared `/storage` (universal secrets gotcha → `references/gotchas_universal.md`).
|
|
@@ -0,0 +1,164 @@
|
|
|
1
|
+
---
|
|
2
|
+
platform: runpod
|
|
3
|
+
kind: ssh-rental
|
|
4
|
+
meter_stop_verb: terminate # stop releases the GPU but STILL bills volume disk at 2×; only terminate halts the meter
|
|
5
|
+
meter_stop_irreversible: true # terminate deletes container + volume disk; only a Network Volume survives
|
|
6
|
+
detach_primitive: tmux # apt-install first; survives SSH drop, NOT a Pod stop/restart
|
|
7
|
+
spot_available: true
|
|
8
|
+
spot_grace: ~5s # SIGTERM → SIGKILL window on Spot/interruptible preemption
|
|
9
|
+
shared_fs: false # global networking = private IP only; a Network Volume is shared within ONE datacenter, not a global FS
|
|
10
|
+
inode_cap: none # per-tier GB quotas, no documented inode cap
|
|
11
|
+
free_egress: true # no egress fees; download/upload to the open internet is free
|
|
12
|
+
china_mirror_needed: false # no mainland-China DC, no GFW — use HF_HUB_ENABLE_HF_TRANSFER=1, not a mirror
|
|
13
|
+
host_driver_cuda_max: image-dependent # host driver varies per machine; pick via the CUDA-Version filter (RP9)
|
|
14
|
+
local_nvme: true
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
# RunPod — platform profile
|
|
18
|
+
|
|
19
|
+
One-line purpose: the per-platform substrate for RunPod Pods — **the Docker image IS the env contract**, a three-tier storage model where the durable mount differs from the parking one, and a teardown verb (`terminate`) that DELETES the volume disk. Read this before Phase 0; it owns every path, port, billing verb, and spot rule the SKILL.md phases delegate here. Universal gotchas are NOT repeated — see `references/gotchas_universal.md`.
|
|
20
|
+
|
|
21
|
+
> **Surface to the user up front (principle #10):** convenience — RunPod's HTTP proxy auto-HTTPS-exposes TB/Jupyter (no tunnel). ⚠️ Danger clocks — a **stopped Pod still bills its volume disk at 2×** and may restart with **zero GPUs** (RP1/RP4), so stop is NOT safe parking; a **low account balance auto-deletes** the Pod; the **~5 GB container disk** silently fills (redirect caches, §8). Decouple durable state onto a Network Volume + **terminate** to truly stop the meter.
|
|
22
|
+
|
|
23
|
+
To jump: `grep -in <keyword> profiles/runpod.md` (e.g. `terminate`, `network volume`, `scp`, `zero-gpu`, `CUDA`, `interruptible`).
|
|
24
|
+
|
|
25
|
+
Table of contents: 1 LAUNCH · 2 STORAGE MODEL · 3 NETWORK · 4 SPOT / INTERRUPTION + RESUME · 5 TEARDOWN / BILLING · 6 DAEMON TOOL · 7 TOP GOTCHAS (+ Platform-specific debugging) · 8 SCRIPT OVERRIDES
|
|
26
|
+
|
|
27
|
+
**Mental-model shift vs AutoDL (the one fact that breaks portability):** AutoDL persists `/root` across a power-off, so "关机 to save money, restart later" is safe. On RunPod a stopped Pod is **pinned to one physical machine and its GPU can be rented away** (zero-GPU-on-restart, RP1), AND it still bills the volume disk at 2× (RP4). Stop is *not* a safe parking spot. Decouple durable state onto a **Network Volume** and **terminate** to truly stop the meter.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 1. LAUNCH
|
|
32
|
+
|
|
33
|
+
A Pod = one Docker container on a GPU host. Five entry points to the same primitive:
|
|
34
|
+
|
|
35
|
+
- **Web console** — pick GPU + a template (Docker image), Secure or Community Cloud, On-Demand or Spot/interruptible. A template (image + ports + env + volume mount) is the unit of reproducibility.
|
|
36
|
+
- **`runpodctl` CLI** — `runpodctl create pod --imageName=<img> --gpuType=<id>`, then `start|stop|remove pod <id>`, `get pod`. Every official-template Pod ships `runpodctl` pre-installed with a pod-scoped key (verified docs.runpod.io/runpodctl 2026-06).
|
|
37
|
+
- **REST API** — the current first-class automation surface: `POST /v2/pods` (and `/pods/{id}/start|stop`, `DELETE`). The create body takes `cloudType: SECURE|COMMUNITY` and **`interruptible: true|false`** for Spot (verified docs.runpod.io/api-reference/pods/POST/pods 2026-06). **(NEW — current fact)** The newer REST create-Pod input has **no `bidPerGpu` field**; interruptible is a plain boolean. The legacy **GraphQL** `podRentInterruptable`/`bidPerGpu` bid mutation still exists for the old API surface — if a script sets a bid, it is on the GraphQL path, not REST.
|
|
38
|
+
- **Python SDK** — `runpod` pip package, wraps the API + the serverless-worker SDK.
|
|
39
|
+
- **Custom Docker image** — any image works; official RunPod templates pre-configure an SSH daemon + a `/start.sh`, but a **custom image must start `sshd` itself** and must use **`CMD`, not `ENTRYPOINT`** (RP10).
|
|
40
|
+
|
|
41
|
+
**Env contract — the image IS the env.** RunPod hands over a container the caller specifies, not a prebuilt base conda env (the AutoDL model). Pin the image by `@sha256:` digest, not `:latest`, for reproducibility. "Running in `base` is fine" still holds (the container is ephemeral) — but any env, conda/pip install, or code that lives **outside the volume mount (`/workspace`)** vanishes on stop (§2). Install long-lived envs under `/workspace`, or bake them into the image.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## 2. STORAGE MODEL *(survival matrix — principle #4)*
|
|
46
|
+
|
|
47
|
+
Three tiers, each with different survival semantics. This is the most error-prone area on RunPod.
|
|
48
|
+
|
|
49
|
+
| Tier | Path | Speed | Cap | Price (verified docs.runpod.io/pods/pricing + storage/types 2026-06) |
|
|
50
|
+
|---|---|---|---|---|
|
|
51
|
+
| Container disk | `/` (overlay fs, system-managed) | local NVMe | GB quota; **default ~5 GB if not raised** | $0.10/GB/mo running, **not charged when stopped** |
|
|
52
|
+
| Volume disk (per-Pod) | `/workspace` (default) | local NVMe | GB quota, **grow-only** | $0.10/GB/mo running, **$0.20 stopped (2×)** |
|
|
53
|
+
| Network Volume | `/workspace` (Pods) · `/runpod-volume` (Serverless) · `/workspace` per-node (Instant Clusters) | networked | 4 TB soft ceiling (**>4 TB needs support**) | Standard $0.07/GB/mo (→$0.05 over 1 TB); **High-Performance $0.14/GB/mo (~3× throughput)** |
|
|
54
|
+
|
|
55
|
+
**Survival matrix:**
|
|
56
|
+
|
|
57
|
+
| Tier | Survives STOP? | Survives TERMINATE? | Portable across Pods? |
|
|
58
|
+
|---|---|---|---|
|
|
59
|
+
| Container disk | **No** (wiped on stop) | No | No |
|
|
60
|
+
| Volume disk | **Yes** (retained until Pod deleted) | **No** (deleted on terminate) | No (pinned to that Pod) |
|
|
61
|
+
| Network Volume | Yes | **Yes** | **Yes** (shareable within ONE datacenter) |
|
|
62
|
+
|
|
63
|
+
**Checkpoints MUST go to a Network Volume** if `terminate` is the intended teardown verb (§5) — the per-Pod volume disk is deleted by terminate, so durable-but-only-stop-safe state on `/workspace` is lost the moment the meter is truly stopped.
|
|
64
|
+
|
|
65
|
+
Critical properties:
|
|
66
|
+
- **Container disk default is tiny (~5 GB)** — pip wheels, the HF cache, apt and conda all land on `/` by default and silently fill it; raise container-disk size at create time OR redirect every cache onto `/workspace` (RP11, §7-debug).
|
|
67
|
+
- **Volume disk grows, never shrinks** — over-provision conservatively; shrinking requires a fresh Pod (verified docs.runpod.io/pods/storage/types: "Increase only" 2026-06).
|
|
68
|
+
- **Network Volume is datacenter-locked** — attaching one constrains all future GPU deployment to that DC, which "may limit GPU availability and reduce failover options" (verified docs.runpod.io/pods/storage/create-network-volumes 2026-06); on a Pod it must be attached **at creation and cannot be detached later** (RP7). Cross-DC moves are manual: rsync/`runpodctl` between two bridge Pods, or the **S3-compatible API** (manage files without launching compute).
|
|
69
|
+
- **Concurrent-write corruption** — "writing to the same volume from multiple workers simultaneously may cause data corruption" (verified same page 2026-06). Serialize writers; for parallel-ablation fan-out give each cell an **isolated write path** (see `references/parallel_ablation.md`).
|
|
70
|
+
- **No documented inode cap** — RunPod specs GB quotas, not inode counts. Audit GB usage with `du` on the actual mount; the `df -i` discipline from `references/gotchas_universal.md` still applies on any small-many-files eval tree, but there is no AutoDL-style hard ~200K ceiling.
|
|
71
|
+
- **Network Volumes cannot be encrypted** and are visible to every attached Pod — never write a secret there (§8).
|
|
72
|
+
- **Global networking ≠ shared FS** — RunPod global networking gives Pods a private IP (`<POD_ID>.runpod.internal`) for Pod-to-Pod traffic, NOT a shared filesystem (verified docs.runpod.io/pods/networking 2026-06). Shared *storage* is still a Network Volume, single-DC.
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## 3. NETWORK
|
|
77
|
+
|
|
78
|
+
- **Egress / proxy / China mirror: N/A.** Free egress, regions across NA + Europe + Oceania + Asia-Pacific (e.g. `AP-IN-1` India added 2026-04), **no mainland-China datacenter** (verified runpod.io/blog/new-runpod-datacenter-now-live-ap-in-1 2026-06). No `/etc/network_turbo` equivalent and no China mirror needed; `pip`/`hf`/`apt` reach the open internet directly. For HF big-shard stalls the fix is **not** a mirror — `pip install huggingface_hub[hf_transfer]` + `export HF_HUB_ENABLE_HF_TRANSFER=1`, and point `HF_HOME` at the Network Volume so re-downloads survive Pod churn (RP-G4 / RP11 below; transport verbs → huggingface-skills:hf-cli **REQUIRED**).
|
|
79
|
+
- **Two ways to expose a service** (verified docs.runpod.io/pods/configuration/expose-ports 2026-06):
|
|
80
|
+
1. **HTTP proxy** — `https://<POD_ID>-<INTERNAL_PORT>.proxy.runpod.net`, auto-HTTPS. **Hard 100 s Cloudflare timeout** — a service that doesn't respond within 100 s closes with a **524**; long/streaming/large-payload requests die. Fine for TensorBoard (6006) / Jupyter (8888) UI; bites WebSockets and long polls.
|
|
81
|
+
2. **Direct TCP** — public IP + a **random external port** that changes on every Pod reset. Required for SSH-scp, DBs, WebSockets, long polls. Request a port number **above 70000** in the TCP config to get a **symmetric (external == internal) mapping** ("not valid port numbers, but signal Runpod to allocate matching internal and external ports").
|
|
82
|
+
- One port cannot be exposed on both HTTP and TCP simultaneously.
|
|
83
|
+
- **Public IP stability differs by cloud (NEW — current fact):** Community Cloud public IPs **may change on migration/restart**; Secure Cloud IPs "should remain stable" (verified expose-ports 2026-06). A pinned SSH target is safer on Secure Cloud.
|
|
84
|
+
- **SSH flavors — proxied SSH cannot transfer files.** *Basic SSH* proxies through `ssh.runpod.io` (works everywhere but **does NOT support `scp`/`sftp`/`rsync`**). *Full SSH* is direct-TCP to the Pod's public IP on exposed port 22 (supports `scp`/`rsync`, needs a public-IP Pod + TCP 22 exposed + SSH daemon running + the key on the account). For bulk code/data transfer, full SSH is mandatory (RP6). Without a public IP, **`runpodctl send` / `receive`** (one-time code, no API key, pre-installed) moves files — but it is rated for **small-to-medium files only**; use full-SSH rsync for large datasets (RP12). SSH-config + resumable-rsync patterns → `references/ssh_transport.md`.
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
|
89
|
+
|
|
90
|
+
Two purchase modes, two distinct interruption vectors:
|
|
91
|
+
|
|
92
|
+
- **Spot / interruptible** — set `interruptible: true` (REST) or bid via legacy GraphQL. Roughly **~50% cheaper** than On-Demand (verified runpod.io/blog/spot-vs-on-demand-instances-runpod: e.g. A6000 spot $0.232 vs on-demand $0.491/gpu/hr 2026-06; marketing elsewhere cites "up to 60%"). Interruption is **"without notice"** — another user's On-Demand request can reclaim the GPU. Detection signal: **`SIGTERM`, then `SIGKILL` ~5 s later** — only enough to flush a flag or trigger an already-frequent checkpoint, NOT to write a fresh large checkpoint.
|
|
93
|
+
- **On-Demand** — non-interruptible while running, but carries the sneakier **zero-GPU-on-restart** trap (RP1): a stopped Pod is pinned to its host, and if that GPU is rented away the Pod can only restart **with zero GPUs** ("there are no GPUs available on the machine where your Pod was running" — verified docs.runpod.io/references/faq 2026-06). Use it as a data-recovery startup, not a compute one.
|
|
94
|
+
|
|
95
|
+
**Both vectors demand the same design:** checkpoint full state **continuously on a timer to a Network Volume** (atomic temp→fsync→rename), load-latest **unconditionally** on startup, and relaunch on a **fresh host** — never assume the same machine/GPU is available after a stop. The ~5 s grace is an opportunistic last-flush only, never the primary durability mechanism. Cadence formula (Young/Daly) and atomic-resume pattern → `references/spot-resilience.md`.
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
|
100
|
+
|
|
101
|
+
| Action | Stops compute billing? | Stops storage billing? | Deletes data? |
|
|
102
|
+
|---|---|---|---|
|
|
103
|
+
| **Stop** | Yes (releases GPU) | **No — bills volume disk at 2× ($0.20/GB/mo)** | No, but GPU may be lost on restart (zero-GPU, RP1) |
|
|
104
|
+
| **Terminate** | Yes | Yes (for that Pod) | **Yes — deletes container + volume disk, irreversible.** Only a Network Volume survives |
|
|
105
|
+
|
|
106
|
+
- **Stop is a trap, not a safe park.** It does not stop the meter (volume disk keeps billing, *doubled*), and it risks zero-GPU lock-out. A long-stopped Pod quietly bleeds money — `terminate` + Network Volume is cheaper for any idle gap longer than a short pause.
|
|
107
|
+
- **Terminate is the meter-stop verb AND it is destructive.** "Terminating permanently deletes all data not stored in a network volume. Export important data first." (verified docs.runpod.io/pods/manage-pods 2026-06). Move every needed artifact to a Network Volume (then billed at $0.07/GB/mo) or off-platform **before** terminating. If checkpoints are still only on the per-Pod **volume disk** at teardown time, `rsync` them to a Network Volume **or pull them local first** — a Network Volume cannot be attached to an existing Pod after creation (§2 / RP7), so this rescue must happen while the Pod is still alive.
|
|
108
|
+
- **Low-balance auto-stop → silent deletion (NEW — billing trap).** When the account balance can no longer cover remaining runtime, RunPod **auto-stops all Pods**; storage then keeps accruing on the stopped volume disk, and **a depleted balance can have Pods + storage deleted with no backup** ("Runpod cannot restore data once a resource has been terminated due to insufficient balance… does not maintain backups" — verified contact.runpod.io Data-Loss-on-Low-Balance 2026-06). Separately, **stale stopped Pods are removed after ~30 days** of non-use. Disk charges are **non-refundable**. Net: a forgotten Pod first drains credit, then loses data — enable Auto-Pay or terminate-with-Network-Volume before walking away.
|
|
109
|
+
- **Billing granularity:** compute + container/volume disk bill **per second**; Network Volumes bill **hourly** (verified docs.runpod.io/references/billing-information 2026-06).
|
|
110
|
+
- Savings Plans are prepaid 3- or 6-month non-refundable commitments — a separate billing knob, orthogonal to stop/terminate.
|
|
111
|
+
|
|
112
|
+
> **Teardown Iron Law (SKILL.md Phase 5):** NO `terminate` until checkpoints are **pulled to local OR confirmed present on a Network Volume, AND verified by load**, and the user has explicitly approved the cost-affecting action. On RunPod the meter-stop verb is irreversible by design and there is **no backup safety net** (low-balance deletion above) — "it looked done in the log" is not evidence (principle #3). Cross-link: superpowers:verification-before-completion **REQUIRED**.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## 6. DAEMON TOOL
|
|
117
|
+
|
|
118
|
+
- **tmux** — available but **not installed by default**: `apt-get update && apt-get install -y tmux`. Survives an SSH disconnect; **does NOT survive a Pod restart/stop** (sessions are process-scoped to the container). `screen`/`nohup` are likewise process-scoped — use `nohup <cmd> </dev/null >log 2>&1 &` if tmux is unavailable.
|
|
119
|
+
- **Native queue: Serverless** — RunPod's request→worker→result→scale-to-zero system. `executionTimeout` and `ttl` each cap at **7 days** (TTL is a hard kill even mid-job). It is request/response-shaped, designed for inference/batch — **the wrong tool for interactive long training**.
|
|
120
|
+
- **For multi-day training: Pod + tmux + frequent checkpoints to a Network Volume**, orchestrated via `runpodctl`/REST. The detach primitive (tmux) is the swappable plug; the checkpoint-to-durable + resume-from-latest spine (principle #8) is what actually survives the restart tmux cannot.
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
## 7. TOP GOTCHAS (platform-pinned; universal ones → `references/gotchas_universal.md`)
|
|
125
|
+
|
|
126
|
+
- **RP1 — Zero-GPU-on-restart.** Symptom: a stopped Pod restarts with no GPU attached and refuses compute work ("Zero GPU Pods"). Root cause: a stopped Pod stays bound to its physical host; another user rented that GPU while it was stopped. Fix: keep all durable state on a **Network Volume**, terminate instead of stop, relaunch on a fresh host. (verified docs.runpod.io/references/faq 2026-06)
|
|
127
|
+
- **RP2 — Container disk wiped on stop.** Symptom: code, conda/pip env, or checkpoints gone after a stop. Root cause: only `/workspace` (volume disk) or a Network Volume survives a stop; container disk (`/`) is cleared. Fix: install envs and write all state under `/workspace` (or the Network Volume).
|
|
128
|
+
- **RP3 — Terminate deletes the volume disk irreversibly.** Symptom: one `remove pod` loses all checkpoints. Root cause: terminate permanently deletes container + volume disk; only a Network Volume persists. Fix: move artifacts to a Network Volume (or local) and verify-by-load before terminating (Iron Law, §5).
|
|
129
|
+
- **RP4 — Stopped storage costs double.** Symptom: a "stopped to save money" Pod keeps charging, faster than expected. Root cause: stopped volume disk bills at $0.20/GB/mo (2× the running rate) and never reaches zero. Fix: for idle gaps, terminate-with-Network-Volume instead of stopping.
|
|
130
|
+
- **RP5 — HTTP-proxy 100 s Cloudflare timeout.** Symptom: long/streaming/large-payload requests return 524 through `*.proxy.runpod.net`. Root cause: a fixed 100 s Cloudflare proxy timeout. Fix: use direct TCP (a port above 70000) for WebSockets, long polls, and big payloads; reserve the HTTP proxy for short UI requests.
|
|
131
|
+
- **RP6 — Basic (proxied) SSH cannot scp/rsync; external TCP port changes on every reset.** Symptom: bulk upload/download fails over `ssh.runpod.io`, or a hardcoded external SSH/service port stops working after a restart. Root cause: proxied basic SSH does not support `scp`/`sftp`/`rsync`, and external port mappings (and Community-Cloud public IPs) are re-assigned on every reset. Fix: use full direct-TCP SSH (public IP + TCP 22 + key on account), and never hardcode the external port — re-read it from Connect → TCP after each (re)start (Secure Cloud IPs are stabler than Community).
|
|
132
|
+
- **RP7 — Network Volume is DC-locked and cannot detach.** Symptom: GPU availability is unexpectedly constrained, or a Network Volume cannot be moved off a Pod. Root cause: a Network Volume pins all future deployment to its datacenter and must be attached at Pod creation, never detached. Fix: choose the DC deliberately up front; do cross-DC moves via bridge-Pod rsync or the S3 API.
|
|
133
|
+
- **RP8 — Low-balance auto-stop then silent deletion.** Symptom: Pods vanish and unrecoverable data is gone after the account ran low; or a Pod kept charging "daily" while doing nothing. Root cause: a depleted balance auto-stops Pods (storage still billing), and depleted-balance / 30-day-stale Pods get deleted with **no backups kept**. Fix: enable Auto-Pay or terminate-with-Network-Volume before leaving a Pod idle; treat the Network Volume / local pull as the only safety net (§5). (verified contact.runpod.io 2026-06)
|
|
134
|
+
- **RP9 — CUDA forward-compat error (host driver too old).** Symptom: container runs locally but on RunPod throws `CUDA failure 804: forward compatibility was attempted on non supported HW`, or `cuda>=12.x, please update your driver`, or `OCI runtime create failed`. Root cause: the assigned machine's NVIDIA host driver is older than the image's CUDA needs (e.g. driver 525.x under a CUDA 12.1 image). Fix: in the deploy dialog use **Additional filters → CUDA Version** to require a machine whose driver meets the image's minimum; or pick an image matching the available driver. (verified github.com/runpod/containers/issues/67 2026-06)
|
|
135
|
+
- **RP10 — `ENTRYPOINT` in a custom image silences the template start command.** Symptom: a custom image deploys but never starts `sshd` / the handler / `/start.sh`; the container runs the wrong process and SSH never comes up. Root cause: an image `ENTRYPOINT` cannot be overridden by the RunPod template's "container start command" (which only overrides `CMD`). Fix: use `CMD ["/start.sh"]` (not `ENTRYPOINT`) in the Dockerfile so the template override works. (verified github.com/runpod/runpodctl/issues/170 2026-06)
|
|
136
|
+
- **RP11 — Container disk (~5 GB) fills, not the volume disk.** Symptom: "No space left on device" mid-`pip install` / mid-download even though `/workspace` has free GB. Root cause: pip wheels, the HF cache, apt and conda default to `/` (the small ~5 GB overlay), not `/workspace`. Fix: raise container-disk size at create time, AND redirect caches onto the volume — `export HF_HOME=/workspace/hf PIP_CACHE_DIR=/workspace/.cache/pip`, install conda envs under `/workspace`. Diagnose with the §7-debug commands. (verified docs.runpod.io/pods/troubleshooting/storage-full 2026-06)
|
|
137
|
+
- **RP12 — Env vars set on the Pod are missing inside a full-SSH (over-TCP) session.** Symptom: `WANDB_API_KEY` / `HF_TOKEN` / template env vars are empty when reached via full SSH, though they exist in the web terminal / basic SSH. Root cause: the SSH daemon's login shell does not inherit the container env set on PID 1 at startup. Fix: snapshot at boot in the start command (`env > /workspace/.env_vars.txt`) and source it in the SSH session, or write the vars into `/etc/environment` / `~/.bashrc`. (verified leimao.github.io Setting-Up-Environment-Variables-SSH-Over-TCP-Runpod 2026-06)
|
|
138
|
+
- **RP13 — `runpodctl send/receive` is only for small/medium files.** Symptom: a large dataset transfer via `runpodctl send` is slow or unreliable. Root cause: the one-time-code transfer is positioned for "quick, occasional, small-to-medium" exchanges, not bulk data. Fix: use full-SSH `rsync` (RP6) or the Network-Volume S3 API for large datasets; keep `send/receive` for keyless one-off pulls on no-public-IP Pods. (verified docs.runpod.io/runpodctl/transfer-files 2026-06)
|
|
139
|
+
|
|
140
|
+
### Platform-specific debugging
|
|
141
|
+
|
|
142
|
+
Quick checks when a RunPod Pod misbehaves (run inside the Pod unless noted):
|
|
143
|
+
|
|
144
|
+
- **Which disk is full?** `df -h` — read the **`overlay`** row (= container disk `/`, often only ~5 GB) separately from the **`/workspace`** row (volume / Network Volume). A full `overlay` with a near-empty `/workspace` is RP11, not a real out-of-space. Largest offenders: `find /workspace -type f -exec du -h {} + | sort -rh | head -n 10` (swap `/workspace` for `/` to hunt container-disk bloat). If files deleted in JupyterLab didn't free space, empty `~/.local/share/Trash/` and `/workspace/.Trash*`. (verified docs.runpod.io/pods/troubleshooting/storage-full 2026-06)
|
|
145
|
+
- **GPU actually attached?** `nvidia-smi` — if it errors or shows no device, suspect zero-GPU-on-restart (RP1) or a driver/CUDA mismatch (RP9). Cross-check the image's CUDA vs the host driver: `nvcc --version` (image) against the driver line in `nvidia-smi` (host).
|
|
146
|
+
- **Stuck initializing / image pull?** A Pod looping in "initializing" is usually a slow/failing image pull or a throttled machine. Watch the **container logs** (web console → the Pod's *Logs* tab, or `runpodctl get pod <id>`); cloning the template to a different machine / cloud often unsticks it.
|
|
147
|
+
- **SSH won't connect on a custom image?** Confirm `sshd` is actually running (`ps aux | grep sshd`), TCP 22 is exposed, and the Dockerfile used `CMD` not `ENTRYPOINT` (RP10); confirm the public key is on the account and matches the local private key.
|
|
148
|
+
- **Env var missing over SSH?** `env | grep <VAR>` in the SSH shell vs the web terminal — divergence is RP12.
|
|
149
|
+
- **Detect a stuck/zombie download:** watch the target grow — `watch -n5 'du -sh /workspace/hf 2>/dev/null; ls -la <partial-file>'`; a `.incomplete`/`.part` file whose size is frozen means a stalled HF pull → re-run with `HF_HUB_ENABLE_HF_TRANSFER=1` (§3). For a robust remote ssh-poll loop, see `references/gotchas_universal.md` U17.
|
|
150
|
+
- **Billing reality check:** the running meter and remaining-balance runtime live in the web console billing page; do not trust "it should be cheap because it's stopped" — a stopped Pod still bills the volume disk at 2× (RP4) and a low balance silently deletes (RP8).
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## 8. SCRIPT OVERRIDES
|
|
155
|
+
|
|
156
|
+
Values to parameterize the `scripts/` templates for RunPod:
|
|
157
|
+
|
|
158
|
+
- `DATA_DIR=` `/workspace` (the per-Pod volume disk) — stop-safe working state (code, conda/pip env, in-progress outputs survive a stop, not a terminate).
|
|
159
|
+
- `DURABLE_DIR=` a **Network Volume** mount (`/workspace` on Pods, `/runpod-volume` on Serverless) — terminate-safe durable checkpoints. Point `DURABLE_DIR` at the Network Volume when `terminate` is the teardown verb so `best` checkpoints survive Pod deletion AND the low-balance auto-delete (RP8).
|
|
160
|
+
- `PROXY_HOOK=` none. No China mirror. Instead `export HF_HUB_ENABLE_HF_TRANSFER=1` (after `pip install huggingface_hub[hf_transfer]`).
|
|
161
|
+
- `CRED_FILE=""` — no credential file on disk; the key is a RunPod secret / env var injected at Pod creation, so `WANDB_API_KEY` / `HF_TOKEN` arrive via the platform env and `run_one`'s `[ -n "$CRED_FILE" ]` guard skips the file read. **Caveat (RP12):** a full-SSH-over-TCP login shell may NOT see these env vars — snapshot them at boot (`env > /workspace/.env_vars.txt`) and source in the SSH session if a script reads them there. **NEVER** write a key to a Network Volume — it is unencryptable and shared across every attached Pod.
|
|
162
|
+
- `SCRATCH=` periodic/`latest` checkpoints under the Network Volume; keep `best` only (`save_top_k` small). Pruning matters more here — the volume disk grows-only and stopped storage is double-priced (RP4).
|
|
163
|
+
- `HF_HOME=` a path on the Network Volume (e.g. `/workspace/hf` on a Network-Volume-backed Pod) so model caches survive Pod churn instead of re-downloading — AND to keep the cache off the tiny ~5 GB container disk (RP11). Likewise `PIP_CACHE_DIR=/workspace/.cache/pip`.
|
|
164
|
+
- `DETACH=` `tmux` (after `apt-get install -y tmux`); fall back to `nohup … </dev/null >log 2>&1 &`. Neither survives a Pod restart — checkpoint-to-Network-Volume is the resilience layer.
|