npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.4 - Mend

opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md ADDED Viewed

@@ -0,0 +1,401 @@
+# Numerical precision & training stability — make it RUN, then stop it diverging
+The mechanics of getting a DL run to compute *finite* numbers fast on a rented card, and of debugging it
+when the loss goes NaN or spikes. This layer owns **make-it-run + the mechanics of divergence**; it does
+NOT own *is the converged number real* / cuDNN-nondeterminism-as-a-metric-error — that is
+**verifying-dl-experiments** (cross-link **REQUIRED** at every "is this a bug or a real effect" fork).
+To jump: `grep -in '<keyword>' references/training/precision-stability.md` (e.g. `tf32`, `bf16`, `scaler`,
+`nan`, `anomaly`, `z-loss`, `clip`, `warmup`, `qk`, `deterministic`).
+## Table of contents
+- **Precision choice** — P1 fp32/tf32/fp16/bf16 decision · P2 TF32 default-off footgun · P3 H100/A100/V100 capability
+- **AMP mechanics** — P4 autocast scope · P5 GradScaler (fp16 only) · P6 bf16 needs no scaler · P7 grad-clip under scaler
+- **NaN / Inf** — P8 where NaNs come from · P9 anomaly detection · P10 fp16 overflow vs underflow · P11 bad-data NaN
+- **Loss spikes / divergence** — P12 LR + warmup · P13 grad clipping · P14 skip-the-batch · P15 z-loss · P16 qk-norm · P17 init
+- **Gradients** — P18 explosion/vanishing diagnosis
+- **Repro** — P19 determinism knobs (cross-link)
+- **Pointers** — gotchas_universal.md, multinode.md, spot-resilience.md
+---
+## Precision choice
+### P1 — Which precision: fp32 / TF32 / fp16 / bf16
+**Symptom**: unsure which `dtype` to train in; run is either slow (fp32) or NaN-prone (fp16).
+**Root cause**: the four modes trade dynamic range against mantissa precision against tensor-core speed.
+fp16 has a 5-bit exponent (max ~65504) so it *overflows* and *underflows* easily; bf16 keeps fp32's 8-bit
+exponent (same range as fp32) but only 7 mantissa bits, so it never needs loss-scaling but is coarser per
+value. TF32 is an fp32-storage mode that runs matmuls at 10 mantissa bits on tensor cores.
+**Fix — default ladder (PyTorch 2.x)**:
+1. **bf16 autocast** on Ampere+ (A100/H100/4090/...) — the modern default; same range as fp32, no GradScaler, robust. `torch.autocast("cuda", dtype=torch.bfloat16)`.
+2. **TF32** for the fp32 matmuls that remain (the non-autocast path) — `torch.set_float32_matmul_precision("high")`. Free ~speedup, negligible convergence impact for most nets (P2).
+3. **fp16 autocast + GradScaler** ONLY if stuck on a card with no bf16 tensor cores (V100/T4/2080Ti) — needs the scaler (P5) and is overflow-prone.
+4. **Pure fp32** as the diagnostic fallback: if a run NaNs, *first* prove it's finite in fp32 before blaming the model. fp32 isolates "is this a numerics bug or a model bug."
+bf16 handles large dot-products / attention logits better than fp16, which saturates and triggers
+scaler-step-skipping. URLs: https://docs.pytorch.org/docs/2.12/amp.html ·
+https://www.runpod.io/articles/guides/fp16-bf16-fp8-mixed-precision-speed-up-my-model-training
+### P2 — TF32 is OFF by default for matmul since PyTorch 1.12 — the "why is my A100 slow" footgun
+**Symptom**: an fp32 (or autocast-but-fp32-matmul-heavy) run on an A100/H100 is ~2–4× slower than expected;
+nothing is wrong with the code.
+**Root cause**: `torch.backends.cuda.matmul.allow_tf32` defaulted **True in 1.7–1.11**, then flipped to
+**False in 1.12+** (precision-loss complaints from non-DL users). So a fresh PyTorch 2.x box runs fp32
+matmuls at full fp32 on the tensor cores' slow path unless TF32 is re-enabled. Convolutions' TF32
+(`cudnn.allow_tf32`) is a separate knob, enabled by default.
+**Fix**: opt back in once at startup —
+```python
+torch.set_float32_matmul_precision("high")   # preferred: enables TF32 (or bf16x3) for fp32 matmul
+# legacy-equivalent, still works:
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+```
+`"high"` = TF32; `"highest"` = true fp32 (default); `"medium"` = even coarser. HF Trainer exposes `--tf32 1`.
+Most nets converge identically with TF32 as with fp32. URLs:
+https://github.com/pytorch/pytorch/pull/76509 ·
+https://docs.pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html ·
+https://docs.pytorch.org/docs/2.12/notes/numerical_accuracy.html
+### P3 — Card capability gates the choice: bf16 needs Ampere+; V100/T4 are fp16-only
+**Symptom**: bf16 training is unexpectedly slow (no error), or a config picks bf16 on an old card and falls
+to a slow path.
+**Root cause**: fast bf16 tensor cores arrived with **Ampere (A100, RTX 30xx)**; Hopper (H100/H200) adds
+native **FP8**. **V100/T4/RTX 20xx have fp16 tensor cores but no fast bf16** (runs emulated/slow). A rental
+hands whatever card is free, so the right precision is a *per-rental* fact, not a constant.
+**Fix**: branch on capability at runtime, never hardcode —
+```python
+use_bf16 = torch.cuda.is_bf16_supported()    # True on Ampere+
+amp_dtype = torch.bfloat16 if use_bf16 else torch.float16
+```
+On V100/T4 use fp16+GradScaler (P5). FP8 (H100) is opt-in via Transformer Engine / `torchao`, not plain
+autocast (out of scope). Record the card next to `nvidia-smi` in Phase 0.
+URL: https://www.e2enetworks.com/blog/nvidia-a100-vs-h100-vs-h200-gpu-comparison
+---
+## AMP mechanics
+### P4 — autocast: wrap ONLY forward + loss, never backward, never `.half()` the model
+**Symptom**: dtype-mismatch errors, or AMP gives no speedup, or grads look wrong.
+**Root cause**: autocast is a context that casts *eligible ops* per-op inside the region; manually
+`.half()`-ing the model or wrapping the backward pass fights it.
+**Fix**:
+```python
+for x, y in loader:
+    optimizer.zero_grad(set_to_none=True)
+    with torch.autocast("cuda", dtype=amp_dtype):   # forward + loss ONLY
+        out = model(x); loss = loss_fn(out, y)
+    # backward is OUTSIDE autocast:
+    loss.backward()                                 # (+ scaler for fp16, P5)
+    optimizer.step()
+```
+Keep the model and optimizer in fp32; do NOT call `model.half()`. Use the new `torch.amp.autocast("cuda",
+...)` / `torch.amp.GradScaler("cuda")` API — `torch.cuda.amp.*` is **deprecated** in PyTorch 2.x. autocast
+state is thread-local (re-enter it inside each DDP/DataParallel worker thread).
+URL: https://docs.pytorch.org/docs/2.12/amp.html
+### P5 — GradScaler: required for fp16 to stop gradient *underflow*
+**Symptom (no scaler, fp16)**: loss looks fine but the model doesn't learn — small gradients flush to 0 in
+fp16's tiny subnormal range.
+**Root cause**: fp16's narrow range underflows small gradients to zero. GradScaler multiplies the loss by a
+large factor before backward (pushing grads into representable range), then unscales before the step and
+**adapts the factor**: on any inf/NaN grad it *skips the optimizer step* and halves the scale (backoff 0.5);
+after `growth_interval` (default 2000) clean steps it doubles it (growth 2.0).
+**Fix — canonical fp16 loop**:
+```python
+scaler = torch.amp.GradScaler("cuda")
+for x, y in loader:
+    optimizer.zero_grad(set_to_none=True)
+    with torch.autocast("cuda", dtype=torch.float16):
+        loss = loss_fn(model(x), y)
+    scaler.scale(loss).backward()
+    scaler.step(optimizer)     # internally unscales; SKIPS step if inf/NaN found
+    scaler.update()            # adapts the scale factor
+```
+Early-training "skipped step" warnings as the scaler calibrates are **normal**; *persistent* skips every
+step = a real overflow (go to P10). URLs:
+https://github.com/pytorch/pytorch/blob/main/docs/source/notes/amp_examples.rst ·
+https://docs.pytorch.org/docs/2.12/amp.html
+### P6 — bf16 needs NO GradScaler (adding one is pointless, not harmful)
+**Symptom**: a copied fp16 recipe carries a GradScaler into a bf16 run — wasted overhead, not a crash or a wrong result.
+**Root cause**: bf16 has fp32's exponent range, so gradients don't underflow → loss-scaling is unnecessary
+and the scaler's skip/backoff machinery is dead weight (scale-then-unscale cancels, and it never finds an
+overflow to skip).
+**Fix**: for bf16, drop the scaler entirely — plain `loss.backward(); optimizer.step()`. Only fp16 (and the
+V100/T4 path) uses GradScaler.
+URL: https://docs.pytorch.org/docs/2.12/amp.html
+### P7 — Gradient clipping under GradScaler: `unscale_` FIRST or you clip scaled grads
+**Symptom**: `clip_grad_norm_` under fp16 AMP has no effect, or clips at the wrong magnitude.
+**Root cause**: inside the scaler the grads are still multiplied by the (large) scale factor, so clipping to
+`max_norm=1.0` is really clipping to `1.0 × scale` — effectively never.
+**Fix**: `scaler.unscale_(optimizer)` once, THEN clip, THEN `scaler.step`:
+```python
+scaler.scale(loss).backward()
+scaler.unscale_(optimizer)                                  # grads now in true scale
+torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+scaler.step(optimizer); scaler.update()
+```
+`unscale_` is idempotent-per-step (call it once). For bf16, just `clip_grad_norm_` directly — no unscale.
+URL: https://github.com/pytorch/pytorch/blob/main/docs/source/notes/amp_examples.rst
+---
+## NaN / Inf
+### P8 — Where NaNs come from: the four arithmetic origins
+**Symptom**: loss prints `nan` (or `inf`) after N steps; everything was fine before.
+**Root cause** — NaN/Inf is produced by a *finite* set of ops on bad inputs:
+- `log(x)` / `log_softmax` with `x ≤ 0` (e.g. `log` of a `sigmoid` output that hit 0).
+- `sqrt(x)` / `x ** 0.5` with `x < 0`, or its grad at `x = 0` (`d/dx sqrt = 1/(2√x) → inf`).
+- division `a / b` with `b → 0` (un-epsilon'd normalization, variance ≈ 0 in BatchNorm/LayerNorm).
+- `exp(x)` overflow → `inf`, then `inf − inf` / `inf / inf → nan`.
+- fp16 overflow (P10): a value exceeds 65504 → `inf` → grads → NaN.
+**Fix — make the op stable, don't paper over it**:
+- Never hand-roll `log(softmax(x))` — use `F.log_softmax` / `F.cross_entropy` (fused, log-sum-exp-stable).
+- Add epsilon *inside* the unstable op: `torch.log(x + 1e-8)`, `torch.sqrt(x + 1e-12)`, `a / (b + 1e-8)`.
+- Clamp before the danger op: `x.clamp(min=1e-7)` before `log`; clamp logits before a manual softmax.
+- Use `eps` in the optimizer/norm (AdamW `eps=1e-8`; raise modestly if `v` is tiny and steps explode).
+URLs: https://docs.pytorch.org/docs/stable/generated/torch.log.html ·
+https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
+### P9 — Find the exact op: anomaly detection + a cheap forward hook
+**Symptom**: loss is NaN but the stack trace points at `loss.backward()`, not the op that caused it.
+**Root cause**: by default the NaN surfaces wherever it's *consumed*, not where it was *born*.
+**Fix — two tools, cheap → precise**:
+- **Forward NaN hook (cheap, leave on)** — register on every module to catch the *first* layer to emit NaN:
+  ```python
+  for name, m in model.named_modules():
+      m.register_forward_hook(lambda mod, i, o, n=name:
+          print(f"NaN in {n}") if torch.is_tensor(o) and not torch.isfinite(o).all() else None)
+  ```
+- **`torch.autograd.set_detect_anomaly(True)` (expensive, debug-only)** — records the forward traceback of
+  each backward op and raises at the first backward NaN, pointing at the *forward* line that created it.
+  ```python
+  with torch.autograd.detect_anomaly():   # or set_detect_anomaly(True, check_nan=True)
+      loss.backward()
+  ```
+  The docs warn it "will slow down your program" (roughly an order of magnitude) — enable to *locate*, then
+  turn OFF for the real run, never ship it on. URL: https://docs.pytorch.org/docs/2.12/autograd.html
+### P10 — fp16 overflow vs underflow: read the GradScaler signal
+**Symptom (fp16)**: loss → inf/NaN; or the scaler skips *every* step and the scale factor collapses toward 0.
+**Root cause**: a forward activation exceeds fp16's 65504 max → `inf` → NaN grads → the scaler can't find a
+scale that avoids overflow, so it backs off forever. Common in attention logits and large residual sums.
+(Distinct from underflow, which the scaler *fixes* by P5.)
+**Fix**: switch fp16 → **bf16** (P1) — its fp32 range absorbs the large values; this is the single most
+effective fix. If bf16 is unavailable (V100/T4): keep the overflow-prone block (final logits, attention
+scores, the loss) in **fp32** via a nested `torch.autocast("cuda", enabled=False)` region, and apply z-loss
+(P15) / qk-norm (P16) to stop the logits growing.
+URL: https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
+### P11 — NaN from the *data*, not the math
+**Symptom**: NaN appears at a specific, reproducible step (always step 4137), not gradually.
+**Root cause**: a corrupt sample — NaN/Inf pixel, all-zero target, label outside `[0, C)`, empty sequence,
+divide-by-zero in a custom transform. The math is fine; the input is poison.
+**Fix**: guard at the data boundary — `assert torch.isfinite(x).all(), f"non-finite input @ step {step}"`
+(fail loud, with the index). A reproducible-step NaN ⇒ inspect *that batch* (seed the loader, dump the
+index); a *step-varying* NaN ⇒ a numerics/LR problem (P12), not data. Smoke the data first — smoke
+*content* is owned by **verifying-dl-experiments** (cross-link **REQUIRED**).
+URL: https://arxiv.org/pdf/2311.03938
+---
+## Loss spikes / divergence
+### P12 — Loss spike / divergence: LR too high or warmup too short
+**Symptom**: training is stable, then the loss jumps orders of magnitude (spike), sometimes recovering,
+sometimes diverging to NaN — most often early, or after a fast LR ramp.
+**Root cause**: if the LR ramps too fast or starts too high, early updates land before activation norms and
+the optimizer's second moment (`v`) have stabilized, overshooting into sharp loss regions → gradient-norm
+blowup → spike. A sustained **grad-norm** rise typically *precedes* the loss spike by several steps.
+**Fix — in order of cheapness**:
+1. **Lengthen warmup** (linear ramp 0 → peak over e.g. 1–10% of steps); warmup is the single biggest lever on LR-sensitivity of final loss.
+2. **Lower peak LR** ~3–10× and re-check.
+3. **Log grad-norm every step** as the early-warning signal — spikes are predictable from activation/grad-norm scaling before they hit.
+4. Resume from the last good checkpoint *before* the spike (don't train through a diverged region).
+URLs: https://arxiv.org/pdf/2309.14322 ·
+https://apxml.com/courses/how-to-build-a-large-language-model/chapter-24-identifying-mitigating-training-instabilities/stabilization-techniques-revisited
+### P13 — Gradient clipping: the standard guardrail (and what constant clipping means)
+**Symptom**: occasional grad-norm spikes; or NaN right after a single bad batch.
+**Root cause**: one pathological batch (rare embedding IDs, an outlier sample) produces an outsized global
+grad norm that overshoots.
+**Fix**: clip global grad norm every step — `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)`
+with `max_norm` ∈ [0.5, 1.0] typical for transformers (under the scaler: P7). **Diagnostic**: if clipping is
+active *every* step or needs an absurdly low threshold to stay stable, that's a symptom of a deeper problem
+(LR too high P12, bad init P17, architecture), not a fix — chase the cause. Global-norm clipping scales
+*all* grads down, so one embedding-heavy batch can throttle everything else that step — consider per-module
+clipping if embeddings dominate.
+URL: https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
+### P14 — Skip-the-batch: drop the update when this step is non-finite
+**Symptom**: a single bad batch every few thousand steps NaNs the whole run; restarting wastes hours.
+**Root cause**: the optimizer applies a non-finite grad and permanently corrupts the weights.
+**Fix**: gate the optimizer step on finiteness (fp16's GradScaler already does this internally, P5; bf16
+needs it explicit):
+```python
+loss.backward()
+gnorm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+if torch.isfinite(gnorm):
+    optimizer.step()
+else:
+    optimizer.zero_grad(set_to_none=True)   # skip this batch, keep weights intact
+    skipped += 1
+```
+Log a `skipped` counter — a *rising* skip rate means a systematic problem (P12/P10), not stray bad data.
+Adaptive spike-clipping (ZClip) and momentum-reset on spike (SPAM) automate this for large runs. URLs:
+https://arxiv.org/pdf/2504.02507 · https://arxiv.org/pdf/2501.06842
+### P15 — z-loss: stop softmax logits from drifting unbounded
+**Symptom**: training is slowly destabilizing; the softmax normalizer / output logits grow over time and
+eventually overflow (acute in fp16/bf16); the "output logits diverge from log-probs" failure mode.
+**Root cause**: nothing pins the absolute scale of pre-softmax logits, so they drift up; large logits cause
+numerical instability and (in low precision) overflow → collapse.
+**Fix**: add an auxiliary **z-loss** = `1e-4 · (log Z)²` where `Z` is the softmax denominator
+(`log Z = logsumexp(logits)`), pulling `log Z → 0`:
+```python
+logits = model(x)
+z = torch.logsumexp(logits, dim=-1)
+loss = F.cross_entropy(logits, y) + 1e-4 * (z ** 2).mean()
+```
+Coefficient **1e-4** is the PaLM/ST-MoE value; too large lets z-loss dominate. Standard in LLM pretraining;
+also the recommended fix for MoE router instability. URLs:
+https://medium.com/dair-ai/papers-explained-50-palm-480e72fa3fd5 · https://arxiv.org/pdf/2202.08906 ·
+https://arxiv.org/pdf/2309.14322
+### P16 — qk-norm: kill attention-logit growth at high LR
+**Symptom**: a transformer diverges only at higher LR; the instability traces to attention scores (Q·Kᵀ)
+growing large before the softmax.
+**Root cause**: "growth of logits in attention layers" — one of the two dominant transformer instability
+modes (the other is output-logit divergence, P15). Unbounded attention logits saturate the softmax.
+**Fix**: apply **QK-LayerNorm** — LayerNorm query and key per-head before the dot-product. Combined with
+z-loss (P15) + warmup (P12), it lets small models train to similar loss across *orders of magnitude* of LR,
+i.e. removes most LR-sensitivity. URL: https://arxiv.org/pdf/2309.14322
+### P17 — Initialization & normalization placement
+**Symptom**: divergence in the first few hundred steps regardless of LR; or vanishing signal (P18) in deep
+stacks.
+**Root cause**: residual streams accumulate variance with depth; default init can make early
+activations/grads too large (spike) or too small (vanish). Norm/embedding init scale matters.
+**Fix**: scale residual-branch init by `1/√(2·n_layers)` (GPT-2-style); prefer pre-LN over post-LN for deep
+transformers; init embeddings at small std (~0.02). When unsure, copy a *known-good* config's init+norm
+scheme rather than tuning blind. URL: https://arxiv.org/pdf/2309.14322
+---
+## Gradients
+### P18 — Gradient explosion vs vanishing: diagnose by logging the norm
+**Symptom**: loss NaN/diverges (explosion) OR loss plateaus and the model never learns (vanishing).
+**Root cause**: per-layer grad norms blow up (explosion: deep nets, high LR, no clip) or decay to ~0
+(vanishing: saturating activations, bad init P17, too-deep unnormalized stacks).
+**Fix — measure first**:
+```python
+total = sum(p.grad.detach().norm()**2 for p in model.parameters() if p.grad is not None) ** 0.5
+# log `total` every step; also log per-layer norms when hunting the culprit layer
+```
+- **Explosion** (norm ↑↑): grad clipping (P13), lower LR (P12), longer warmup, bf16 over fp16 (P10).
+- **Vanishing** (norm → 0): residual connections, normalization layers, better init (P17), non-saturating
+  activations (GELU/SiLU over deep sigmoid/tanh stacks), check the LR isn't *too low*.
+A grad-norm trace is the cheapest, highest-signal stability instrument — log it from step 1.
+URL: https://apxml.com/courses/how-to-build-a-large-language-model/chapter-24-identifying-mitigating-training-instabilities/stabilization-techniques-revisited
+---
+## Reproducibility
+### P19 — Deterministic / repro knobs — set them, but the *interpretation* is delegated
+**Symptom**: same config + seed gives slightly different loss/metrics run-to-run.
+**Root cause**: nondeterministic CUDA kernels + `cudnn.benchmark` autotuning pick different algorithms per
+run; TF32/AMP add low-order noise on top.
+**Fix — the mechanical knobs (set these here)**:
+```python
+torch.manual_seed(s); np.random.seed(s); random.seed(s)
+torch.use_deterministic_algorithms(True)        # may need CUBLAS_WORKSPACE_CONFIG=:4096:8
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False          # benchmark=True trades determinism for speed
+```
+**Whether a run-to-run delta is "a real effect vs cuDNN nondeterminism," and the full determinism
+methodology, is owned by verifying-dl-experiments (cross-link REQUIRED)** — catalogued as **U36** in
+`references/gotchas_universal.md`. This layer only ensures the knobs are *set and logged*. Determinism costs
+speed — enable for the datapoint that must be clean, not every throwaway run.
+URL: https://docs.pytorch.org/docs/stable/notes/randomness.html
+---
+## Pointers — adjacent layers, do NOT restate here
+- **`references/gotchas_universal.md`** — the *infra* failure modes that masquerade as numerics:
+  **U6** disk-full crashes `torch.save`, **U9** cgroup-OOM (bare `Killed`, not a NaN), **U28** CUDA/driver/
+  torch-build mismatch (`no kernel image` ≠ a precision bug), **U10/U11** VRAM OOM. Rule out infra before
+  chasing a "numerics" ghost.
+- **`verifying-dl-experiments`** (**REQUIRED** cross-link) — owns *is-the-number-real*: smoke **content**,
+  cuDNN-nondeterminism-as-metric-error (U36), collapse/constant-output diagnosis, "bug vs real effect." This
+  file makes training *run and stay finite*; that skill judges whether the converged result is *true*.
+- **`references/spot-resilience.md`** — checkpoint cadence so a divergence-and-resume (P12) loses minimal work.
+- **`references/multinode.md`** — NCCL/precision interactions in DDP (all-reduce dtype, loss-scale sync) for
+  multi-node runs; single-box users skip.