npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.3 - Mend

opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/profiles/lambda.md ADDED Viewed

@@ -0,0 +1,342 @@
+---
+platform: lambda
+kind: cloud-api               # REST API / web console / SSH to a normal Ubuntu VM
+meter_stop_verb: terminate    # the ONLY action that stops billing; sudo shutdown does NOT
+meter_stop_irreversible: true # terminate wipes local NVMe — there is no stop/suspend state
+detach_primitive: tmux        # plain Ubuntu; tmux/screen/nohup, install if absent
+spot_available: false         # no spot/preemptible tier; interruption is capacity-at-launch
+spot_grace: n/a               # no mid-run eviction → no grace window
+shared_fs: true               # region-locked NFS filesystem, attach-at-launch only
+inode_cap: none               # no documented inode cap; GiB quota only
+free_egress: true             # no ingress/egress fees on instances or filesystems
+china_mirror_needed: false    # US/global cloud, direct egress; no platform proxy
+host_driver_cuda_max: lambda-stack-dependent  # Lambda Stack bundles driver+CUDA+PyTorch; version moves per release — read nvidia-smi on the box, do NOT assume a number
+local_nvme: true              # ephemeral root/local NVMe, gone on terminate
+---
+# Lambda Cloud — Profile
+Lambda Cloud is a **cattle-not-pets** GPU cloud: on-demand + reserved instances, a prebuilt **Lambda
+Stack** image, and **no stop/suspend state** — an instance can only be **launched, restarted, or
+terminated**, and terminate destroys the local NVMe. Nothing on the box survives a teardown except what was
+pushed off or written to an attached **region-locked NFS filesystem**. This inverts the AutoDL "关机保留数据"
+instinct: here, durable design (checkpoint-to-NFS + idempotent resume) is **mandatory, not optional**.
+> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — there is **no stop/suspend**: an instance can only be launched / restarted / **terminated, and terminate wipes the local NVMe** — only the attached **NFS filesystem** survives, and **it keeps billing until you delete it manually** (LAM6). Conveniences — one-click **JupyterLab** per instance, free egress both directions. A terminate→relaunch yields a **new IP**.
+> Docs/console domain moved from `lambdalabs.com` to `lambda.ai` (docs at `docs.lambda.ai`, console at
+> `cloud.lambda.ai`); the **REST API base is still `cloud.lambdalabs.com/api/v1`** and `cloud.lambda.ai`
+> also resolves (verified docs.lambda.ai + cloud-api 2026-06). Treat both hosts as live.
+To jump: `grep -in <keyword> profiles/lambda.md`.
+**Table of contents** — 1. LAUNCH · 2. STORAGE MODEL (survival matrix) · 3. NETWORK ·
+4. SPOT / INTERRUPTION + RESUME · 5. TEARDOWN / BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (LAM1–LAM13) +
+Platform-specific debugging · 8. SCRIPT OVERRIDES.
+Universal gotchas (CRLF, inode/`df -i`, silent sync, cgroup OOM, spot grace) are NOT repeated here —
+see `references/gotchas_universal.md`. Universal invariants → `references/principles.md`.
+---
+## 1. LAUNCH
+Entry points:
+- **Web console** at `cloud.lambda.ai` → Instances → Launch (pick GPU type + region, attach a filesystem
+  here if one is needed — see §2; attach any per-instance firewall ruleset here too — see §3/LAM4).
+- **REST API** — `https://cloud.lambdalabs.com/api/v1`, auth `curl -u $LAMBDA_API_KEY:` (basic-auth,
+  password empty). Canonical automation surface (verified docs.lambda.ai/api/cloud 2026-06):
+  - `GET  /instance-types` — lists every GPU type **and** `regions_with_capacity_available[]` per type.
+    This field IS the capacity signal — poll it to know where a type can launch right now (drives LAM5
+    retry-until-available).
+  - `POST /instance-operations/launch` · `.../terminate` · `.../restart` — create / stop-meter / reboot.
+- **SSH** — standard connection to a normal Ubuntu VM; **default user is `ubuntu`** (not `root`); use
+  `sudo` for root. One-click **JupyterLab** is offered per instance.
+- **SkyPilot** — de-facto orchestration layer: `pip install "skypilot[lambda]"`, key file at
+  `~/.lambda_cloud/lambda_keys` containing a line `api_key = <KEY>` (verified docs.skypilot.co 2026-06).
+  Use it for retry-until-capacity + autostop (§4, §6).
+**Env contract — the image/base IS the env.** Instances ship **Lambda Stack** (NVIDIA driver + CUDA +
+cuDNN + PyTorch/TensorFlow, all upgraded together as one apt metapackage). Run in it directly on the
+throwaway box — do **not** `conda create` on a rental (`references/principles.md` §2), and do not `pip
+install torch` over the top (LAM7/LAM8). Lambda Stack's exact CUDA/driver/PyTorch **moves per release**;
+read it off the box (`nvidia-smi`, `python -c "import torch;print(torch.__version__,torch.version.cuda)"`)
+rather than assuming a number. The **durable** form of the env is a Docker image (Lambda recommends running
+Docker inside the instance) or a setup script replayed on each launch — because terminate destroys the box.
+Reserved / 1-Click Clusters provide flat-rate multi-node (own billing model — LAM12).
+> **verify:** `ssh ubuntu@<IP> 'python -c "import torch;print(torch.cuda.is_available())"'` → `True`.
+---
+## 2. STORAGE MODEL  *(survival matrix — principle #4)*
+Two tiers, and the trap is that the default working location is the **volatile** one.
+- **Local / root NVMe** — fast, per-instance, **ephemeral**. Docs: *"Data not stored in the mount location
+  is erased once you terminate your instance and cannot be recovered"* (verified docs.lambda.ai
+  creating-managing-instances 2026-06). This is where work lands by default.
+- **NFS filesystem** — a regional network filesystem mounted at `/lambda/nfs/<name>` (docs example mount:
+  `/lambda/nfs/persistent-storage`). **The only durable home.** Three hard constraints (verified
+  docs.lambda.ai/public-cloud/filesystems 2026-06):
+  - **Region-locked** — *"The filesystem must reside in the same region as the instance or cluster"* and
+    *"Filesystems cannot currently be transferred between regions."* Pick the region deliberately at create.
+  - **Attach-at-launch only** — *"You must attach the filesystem … at the time that the instance … is
+    launched"* and *"You can't attach a filesystem after you've created an instance."*
+  - Billed **$0.20/GiB/month in 1-hour increments**, **free ingress/egress**; **up to 24 filesystems per
+    account**; most regions allow up to 8 EB/filesystem but **us-south-1 (Texas) caps at 10 TB**.
+- **No documented inode cap** — GiB quota only; no `df -i` ceiling surfaced (still audit `df -i` per the
+  universal storage gotcha).
+| Tier | Path | Survives RESTART? | Survives TERMINATE? | Cap |
+|---|---|---|---|---|
+| Local / root NVMe | `/`, `/home/ubuntu` | yes (data persists; **but cold reboot wipes RAM** — LAM9) | **NO** (erased, unrecoverable) | instance root volume |
+| NFS filesystem | `/lambda/nfs/<name>` | yes | **yes** (separate lifecycle; keeps billing — LAM6) | GiB quota; ~10 TB in us-south-1, 8 EB elsewhere |
+**Checkpoints MUST go to** `/lambda/nfs/<name>` (the durable tier) for the §5 `terminate` verb. A
+checkpoint left on local NVMe dies with the box. If no filesystem was attached at launch, the only durable
+path is to `pull` the result off-box (free egress) before terminating.
+---
+## 3. NETWORK
+- **Direct, unproxied egress.** US/global cloud — egress to HF / GitHub / PyPI is direct; **no
+  `network_turbo`-style accelerator exists**, and none is needed. China-mirror relevance is **N/A as a
+  platform feature** (relevant only when operating from inside China; then `references/china-network.md`
+  applies to the user's own setup, nothing platform-provided).
+- **Free egress both directions** — *"Transparent pricing with no egress fees"* (verified lambda.ai
+  pricing 2026-06). Re-pulling a large model or pushing results off-box costs nothing, making
+  "pull-before-terminate" the cheap, safe default when no NFS is attached.
+- **Firewall** — default allows *"only incoming ICMP traffic or TCP traffic on port 22 (SSH)"*. Open more
+  via **global rules** (apply workspace-wide) or **per-instance rulesets** (region-scoped). Per-instance
+  rulesets: *"You must attach rulesets during the instance launch process. You can't attach them after the
+  instance has been launched"* and *"You can't remove rulesets from an instance after the instance has been
+  launched"* (verified docs.lambda.ai/public-cloud/firewalls 2026-06) → plan port exposure before launch
+  (gotcha LAM4). Global rules can still be edited on the workspace afterward.
+- **Exposing TB / Jupyter** — instances get a public IP; tunnel over SSH rather than opening ports:
+  `ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>`. No platform-pinned TensorBoard dir —
+  run TB on `:6006` against the logdir under the NFS mount.
+- **SSH flavor** — direct TCP to a normal VM (`ubuntu@<IP>`); full `scp`/`rsync` work, no proxy-jump quirk.
+  **No static IP feature** — *"On-Demand Cloud doesn't support static IP addresses"* (verified DeepTalk
+  staff 2026-06). The IP is fixed for an instance's life, but **terminate→relaunch yields a NEW IP**
+  (LAM10) — re-read it from the console/API every launch; never hard-code it in automation.
+---
+## 4. SPOT / INTERRUPTION + RESUME  *(principle #7/#8)*
+**No spot / preemptible tier — and no mid-run eviction.** This is the key divergence from vast.ai/RunPod:
+there is **no SIGTERM→SIGKILL grace window to survive**, because a running instance is never evicted
+mid-epoch. The interruption model is different in kind:
+- **Capacity-at-launch is the real failure.** The desired GPU type may be **unavailable when launch is
+  attempted** — Lambda has **no spot tier to fall back to**, and real-world on-demand fill rates are
+  spiky (one published 6-month log: ~64% same-day A100 success — i.e. ~1 in 3 attempts blocked; a 26 h
+  "temporarily unavailable" stall scaling 2→4 H100; verified medium.com/@velinxs 2026-06). H100/B200
+  capacity is the tightest. The resilience pattern is **retry-until-available**, not survive-eviction:
+  poll `GET /instance-types` for `regions_with_capacity_available` and `POST .../launch` the moment a
+  region appears (or let SkyPilot's provisioner retry across regions/types).
+- **Self-inflicted termination only.** Once running, the only destructive events are an operator
+  `terminate`, or an **improper `sudo shutdown`** that pushes the box to **Alert** while still billing
+  (LAM3 / §5), or a **cold reboot** that wipes RAM (LAM9).
+- **Resume hook** — checkpoint full state to the NFS filesystem on a periodic timer, load-latest
+  unconditionally on startup, so a fresh post-capacity launch resumes instead of restarting. Because the
+  box is cattle, the resume path is exercised on *every* relaunch, not just after a rare preemption.
+Cadence formula (Young/Daly) + atomic-write resume → `references/spot-resilience.md`. Here the formula's
+μ is effectively "time between voluntary relaunches," not a preemption rate.
+---
+## 5. TEARDOWN / BILLING  *(principle #9 + the Iron Law)*
+**TERMINATE is the meter-stop verb — and it is irreversible.** *"Billing begins the moment you launch an
+instance and the instance passes health checks, and ends the moment you terminate the instance"*, billed
+in **one-minute increments**, *"regardless if they're actively being used"* (verified
+docs.lambda.ai/public-cloud/billing 2026-06).
+> **The shutdown trap (most error-prone fact on this platform):** *"Do not use commands such as `sudo
+> shutdown -h now` or `sudo systemctl poweroff` … These commands will not work as expected and will cause
+> your instances to go into Alert status, and billing will continue"* (verified docs.lambda.ai 2026-06).
+> Also `halt` / `shutdown -P 0` only stop the OS, not the meter (DeepTalk staff). Stop the meter **only**
+> via `terminate` from the console or `POST /instance-operations/terminate` — which works even from inside
+> the instance itself.
+What each action preserves:
+- **terminate** — stops the instance meter; **erases the local NVMe** (unrecoverable). The NFS filesystem
+  has a **separate lifecycle** and survives — but it **keeps billing $0.20/GiB/month until explicitly
+  deleted** (*"Billing continues as long as a filesystem exists, even if it's not mounted to an instance"*),
+  so a terminated-but-forgotten filesystem is a silent ongoing charge (LAM6).
+- **There is no stop/suspend state** — *"It currently isn't possible to pause (suspend) your instance …
+  Your only options are to launch, restart, or terminate"* (verified docs.lambda.ai 2026-06). Idle-cheap
+  pause is impossible; the only way to stop paying for compute is to destroy the box and rebuild later.
+- **restart / cold reboot** — does **not** stop the meter and does **not** wipe disk, but a **cold reboot
+  erases RAM and bypasses safe shutdown** — reserve it for a frozen box only (LAM9).
+**Iron Law (SKILL.md Phase 5):** NO `terminate` until checkpoints are **pulled to local OR confirmed on
+NFS by load-test** AND the user approves the cost-affecting action. Because terminate is destructive and
+irreversible, an unverified `cp`/`rsync` to NFS means **permanent loss** — verify the sync (checksum /
+`ls -l` / a load) before terminating, not after. Egress is free, so a belt-and-suspenders `pull` to local
+is cheap. Cross-link: `superpowers:verification-before-completion` (REQUIRED) for the general gate.
+---
+## 6. DAEMON TOOL
+- **Detach primitive: `tmux`** (or `screen` / `nohup`) on a standard Ubuntu VM — same playbook as the
+  AutoDL tmux pattern. Install if absent (`sudo apt install -y tmux`); fall back to
+  `nohup … </dev/null >log 2>&1 &`.
+- **Survives an SSH drop, NOT a terminate.** tmux keeps the job alive across a dropped connection, but
+  with no stop state the detach primitive can't survive a teardown — only the **checkpoint-to-NFS +
+  idempotent resume** spine does (principle #8). tmux is the SSH-resilience layer; the checkpoint is the
+  instance-resilience layer. (tmux also won't survive a cold reboot — LAM9.)
+- **Native orchestration: SkyPilot** (managed jobs, autostop, retry-until-capacity) + **1-Click
+  Clusters** for multi-node; no platform job-queue otherwise. SkyPilot moves the box on capacity loss but
+  **restarts the process from scratch — the checkpoint-load restores progress** (don't assume the
+  framework resumes training state).
+---
+## 7. TOP GOTCHAS  (Lambda-pinned — universal ones live in `references/gotchas_universal.md`)
+- **LAM1 — Terminate erases the local NVMe; there is no stop/suspend.**
+  Symptom: relaunched instance is blank, yesterday's run gone. → Root cause: local storage is ephemeral
+  (*"Data not stored in the mount location is erased … and cannot be recovered"*) and no stop state
+  preserves it; the AutoDL "关机 keeps my data" assumption is false. → Fix: design every workflow around
+  destroy/recreate — checkpoint to `/lambda/nfs/<name>` or `pull` off-box before any terminate; never keep
+  the only copy on local NVMe. (docs.lambda.ai 2026-06)
+- **LAM2 — Filesystem is attach-at-launch only and region-locked.**
+  Symptom: a running instance has no durable storage and one can't be added; or a us-east filesystem won't
+  mount on a us-west instance. → Root cause: filesystems attach only at create time and can't move between
+  regions. → Fix: decide the region and attach the filesystem **at launch**; co-locate instance +
+  filesystem in the same region. (filesystems doc 2026-06)
+- **LAM3 — `sudo shutdown` / `poweroff` keeps the meter running (Alert state).**
+  Symptom: instance "powered off" but the bill keeps climbing. → Root cause: an in-OS shutdown sends the
+  instance to **Alert** without stopping billing; `halt`/`shutdown -P 0` only stop the OS, not the meter.
+  → Fix: stop the meter only via **terminate** (console or `POST /instance-operations/terminate`); never
+  rely on an in-box poweroff. (billing doc + DeepTalk staff 2026-06)
+- **LAM4 — Per-instance firewall rulesets are immutable post-launch.**
+  Symptom: a needed inbound port can't be opened (or a wrong one removed) on a live instance. → Root cause:
+  per-instance rulesets *"must [be attached] during the instance launch process"* and *"can't [be removed]
+  after the instance has been launched."* → Fix: plan port exposure before launch, use an editable
+  **global** rule, or tunnel over SSH (`-L`, §3) instead of opening a port. (firewalls doc 2026-06)
+- **LAM5 — Capacity, not eviction, is the bottleneck (no spot fallback).**
+  Symptom: launch fails / dashboard shows the desired GPU type unavailable; long stalls scaling up. → Root
+  cause: on-demand supply for a specific GPU/region is exhausted (worst for H100/B200), and there is no
+  spot tier to fall back to. → Fix: poll `GET /instance-types` for `regions_with_capacity_available` and
+  launch the instant a region appears (or use SkyPilot's cross-region/type provisioner); resume from the
+  NFS checkpoint once granted (§4). (cloud-api doc + medium.com/@velinxs 2026-06)
+- **LAM6 — The NFS filesystem keeps billing after the instance is gone.**
+  Symptom: all instances terminated, but storage charges continue. → Root cause: *"Billing continues as
+  long as a filesystem exists, even if it's not mounted to an instance"* — $0.20/GiB/month until deleted.
+  → Fix: after the final `pull` + verify, **delete the filesystem** (console Storage → Delete; requires
+  terminating attached instances first) — a distinct teardown step. (billing + filesystems docs 2026-06)
+- **LAM7 — `pip install torch` over Lambda Stack silently shadows or mismatches it.**
+  Symptom: a `pip install` in `base` reports *"Defaulting to user installation because normal site-packages
+  is not writeable"* and lands in `~/.local`, or a `torch==X` pin drags in a CUDA/torchvision combo that
+  conflicts with the system build → import/CUDA errors. → Root cause: Lambda Stack PyTorch lives in
+  system `/usr/lib/python3/dist-packages` (not pip-writable as `ubuntu`); pip's user install or a hard
+  version pin diverges from it. → Fix: use the Stack's PyTorch as-is (don't reinstall), loosen pins
+  (`torch>=2.x` not `==`), or fully isolate in a fresh venv/conda env and install torch there cleanly —
+  don't half-mix pip-over-system. (DeepTalk threads 2026-06)
+- **LAM8 — conda/venv that "borrows" Stack PyTorch via system-site-packages then breaks on pip.**
+  Symptom: created a conda env to use the Stack's torch, then a later `pip install` pulls a second,
+  conflicting torch or can't write site-packages. → Root cause: mixing `--system-site-packages` (to see
+  the system torch) with pip installs into the same env creates two torch copies. → Fix: pick ONE model —
+  either run in the bare Stack base (preferred on a rental), or build a fully self-contained env with
+  `conda install pytorch torchvision` (no system-site-packages borrowing). (DeepTalk
+  bypassing-lambda-stack thread 2026-06)
+- **LAM9 — Cold reboot wipes RAM and tmux; warm restart still bills.**
+  Symptom: after a "reboot" the detached training job is gone and the box came back clean-ish. → Root
+  cause: a **cold reboot** *"erases all data currently in the instance's memory and bypasses the operating
+  system's safe-shutdown mechanisms"* — kills tmux sessions and any in-RAM state; neither reboot stops the
+  meter. → Fix: only cold-reboot a frozen box; rely on checkpoint-to-NFS, not on process survival across a
+  reboot; expect to re-`ssh` and re-`tmux attach` (session may be gone). (console doc 2026-06)
+- **LAM10 — No static IP; the public IP changes on terminate→relaunch.**
+  Symptom: automation/SSH config hard-coded to yesterday's IP fails after a relaunch. → Root cause:
+  *"On-Demand Cloud doesn't support static IP addresses"* — a fresh launch gets a fresh IP. → Fix: read
+  the IP from the console / `GET /instances` on every launch; template SSH config dynamically; never
+  hard-code it. (DeepTalk staff 2026-06)
+- **LAM11 — `apt full-upgrade` on Lambda Stack images can break cuDNN/DOCA.**
+  Symptom: after a recommended `apt-get update && upgrade` (or `full-upgrade` on 24.04 images), PyTorch/TF
+  fails to find cuDNN, or full-upgrade itself fails on a DOCA package. → Root cause: a system cuDNN bump
+  or DOCA repo state diverges from the Stack-bundled libs. → Fix: avoid blanket `full-upgrade` on a
+  rental; if cuDNN is missing, symlink the Stack copies —
+  `for so in /usr/lib/python3/dist-packages/tensorflow/libcudnn*; do sudo ln -s "$so" /usr/lib/x86_64-linux-gnu/; done`
+  (note: Stack cuDNN is usable *only* by the Stack-installed PyTorch/TF). (troubleshooting doc 2026-06)
+- **LAM12 — 1-Click Clusters / reserved bill differently than on-demand (commitment traps).**
+  Symptom: expected per-minute pricing, got a 2-week minimum / weekly invoice / a reservation that expired.
+  → Root cause: **1-Click Clusters** carry a **minimum 2-week commitment with weekly billing** (not
+  per-minute); **reserved** capacity requires Lambda approval and the **invoice must be paid within ~10
+  days or the reservation is forfeited**, on non-cancelable terms. → Fix: use plain on-demand single
+  instances for per-minute experiments; only enter a cluster/reservation with confirmed sustained need and
+  budget approval. (1-click-clusters docs + nOps/CheckThat 2026-06)
+- **LAM13 — GH200 (ARM/aarch64) breaks `pip install torch` — needs the ARM build.**
+  Symptom: on a 1× GH200 box, `pip install torch` installs a **CPU-only** wheel (no CUDA), or a pinned
+  `torch==2.2.0` fails to resolve. → Root cause: GH200 is aarch64; the default PyPI torch wheel for
+  aarch64 is CPU-only. → Fix: use Lambda Stack's pre-compiled ARM PyTorch (e.g. 2.4.1) as-is, or install
+  from the CUDA index `pip install torch --index-url https://download.pytorch.org/whl/cu128` (aarch64 GPU
+  wheels live there), or compile from source for newer versions; relax exact pins. (DeepTalk GH200 thread
+  + pytorch.org 2026-06)
+### Platform-specific debugging
+- **Confirm billing actually stopped:** after a teardown, check the instance is **gone** (not in *Alert*)
+  via the console or `curl -u $LAMBDA_API_KEY: https://cloud.lambdalabs.com/api/v1/instances` — an Alert-
+  state box (from an in-OS shutdown) is still charging (LAM3).
+- **Capacity probe before launch:** `curl -u $LAMBDA_API_KEY: .../instance-types | jq '.data | to_entries[]
+  | {type:.key, regions:.value.regions_with_capacity_available}'` — empty `regions` ⇒ that GPU type can't
+  launch anywhere right now (LAM5); this is the loop condition for retry-until-available.
+- **GPU sanity on the box:** `nvidia-smi` (driver/CUDA + util) and `python -c "import torch;
+  print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"` — mismatch between
+  `torch.version.cuda` and `nvidia-smi` CUDA usually means a pip-shadowed torch (LAM7/8/13), not a Stack
+  problem.
+- **Read the real Stack version, never assume:** `apt list --installed 2>/dev/null | grep -i lambda-stack`
+  and `dpkg -l | grep -i cudnn` — confirm before debugging a "version mismatch."
+- **Disk pressure on the ephemeral root:** `df -h /` and `df -h /lambda/nfs/<name>`; remember `/home/ubuntu`
+  is volatile — large datasets/checkpoints filling the root volume are also *lost* on terminate, so move
+  them to NFS, not just to clear space.
+- **Detect a stalled download:** background the pull (`nohup … &`) and watch growth —
+  `watch -n5 'du -sh <target>; ls -l <target>'` (flat size for minutes ⇒ stalled; re-pull, egress is free).
+- **Stuck/unreachable after reboot:** if SSH dies post-reboot, the box may be in *Alert* or networking
+  failed to come up — check the console state and prefer a fresh **terminate→relaunch** (resume from NFS)
+  over fighting a cold-reboot that already wiped RAM (LAM9).
+---
+## 8. SCRIPT OVERRIDES
+Values to parameterize the `scripts/` templates for Lambda:
+```
+DATA_DIR=       /home/ubuntu (ephemeral NVMe — lost on terminate)
+DURABLE_DIR=    /lambda/nfs/<name>
+PROXY_HOOK=     (none — direct egress; no network_turbo)
+CRED_FILE=      ""  (Lambda key is the $LAMBDA_API_KEY env var, not a file on disk — run_one's [ -n "$CRED_FILE" ] guard skips the file read and the env var passes through; SkyPilot key file at ~/.lambda_cloud/lambda_keys, format `api_key = <KEY>`)
+SCRATCH=        prune periodic ckpts on local NVMe; keep only `best` on /lambda/nfs/<name>
+HF_HOME=        /lambda/nfs/<name>/.cache/huggingface   (durable; survives terminate, free egress on re-pull)
+DETACH=         tmux  (apt install if absent; nohup fallback)
+SSH_USER=       ubuntu   (NOT root)
+```
+Notes for the wrapper:
+- Default checkpoint dir → the NFS mount, not `/home/ubuntu` — the latter is erased on terminate.
+- If no NFS filesystem is attached, set the wrapper to `pull` checkpoints to local on the periodic timer
+  (free egress) instead of relying on durable on-box storage.
+- Re-read the instance IP from the console/API on every launch (LAM10) — never persist it in SSH config.
+- Do not `pip install torch` / blanket `apt full-upgrade` on the rental — use the Stack as-is (LAM7/8/11);
+  on GH200 use the ARM build (LAM13).
+- The teardown step is **terminate via API**, gated by the Iron Law; verify billing stopped (no *Alert*
+  state) and add an explicit reminder to **delete the NFS filesystem** (LAM6) when the project is done.