opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,401 @@
1
+ # Numerical precision & training stability — make it RUN, then stop it diverging
2
+
3
+ The mechanics of getting a DL run to compute *finite* numbers fast on a rented card, and of debugging it
4
+ when the loss goes NaN or spikes. This layer owns **make-it-run + the mechanics of divergence**; it does
5
+ NOT own *is the converged number real* / cuDNN-nondeterminism-as-a-metric-error — that is
6
+ **verifying-dl-experiments** (cross-link **REQUIRED** at every "is this a bug or a real effect" fork).
7
+
8
+ To jump: `grep -in '<keyword>' references/training/precision-stability.md` (e.g. `tf32`, `bf16`, `scaler`,
9
+ `nan`, `anomaly`, `z-loss`, `clip`, `warmup`, `qk`, `deterministic`).
10
+
11
+ ## Table of contents
12
+
13
+ - **Precision choice** — P1 fp32/tf32/fp16/bf16 decision · P2 TF32 default-off footgun · P3 H100/A100/V100 capability
14
+ - **AMP mechanics** — P4 autocast scope · P5 GradScaler (fp16 only) · P6 bf16 needs no scaler · P7 grad-clip under scaler
15
+ - **NaN / Inf** — P8 where NaNs come from · P9 anomaly detection · P10 fp16 overflow vs underflow · P11 bad-data NaN
16
+ - **Loss spikes / divergence** — P12 LR + warmup · P13 grad clipping · P14 skip-the-batch · P15 z-loss · P16 qk-norm · P17 init
17
+ - **Gradients** — P18 explosion/vanishing diagnosis
18
+ - **Repro** — P19 determinism knobs (cross-link)
19
+ - **Pointers** — gotchas_universal.md, multinode.md, spot-resilience.md
20
+
21
+ ---
22
+
23
+ ## Precision choice
24
+
25
+ ### P1 — Which precision: fp32 / TF32 / fp16 / bf16
26
+
27
+ **Symptom**: unsure which `dtype` to train in; run is either slow (fp32) or NaN-prone (fp16).
28
+
29
+ **Root cause**: the four modes trade dynamic range against mantissa precision against tensor-core speed.
30
+ fp16 has a 5-bit exponent (max ~65504) so it *overflows* and *underflows* easily; bf16 keeps fp32's 8-bit
31
+ exponent (same range as fp32) but only 7 mantissa bits, so it never needs loss-scaling but is coarser per
32
+ value. TF32 is an fp32-storage mode that runs matmuls at 10 mantissa bits on tensor cores.
33
+
34
+ **Fix — default ladder (PyTorch 2.x)**:
35
+ 1. **bf16 autocast** on Ampere+ (A100/H100/4090/...) — the modern default; same range as fp32, no GradScaler, robust. `torch.autocast("cuda", dtype=torch.bfloat16)`.
36
+ 2. **TF32** for the fp32 matmuls that remain (the non-autocast path) — `torch.set_float32_matmul_precision("high")`. Free ~speedup, negligible convergence impact for most nets (P2).
37
+ 3. **fp16 autocast + GradScaler** ONLY if stuck on a card with no bf16 tensor cores (V100/T4/2080Ti) — needs the scaler (P5) and is overflow-prone.
38
+ 4. **Pure fp32** as the diagnostic fallback: if a run NaNs, *first* prove it's finite in fp32 before blaming the model. fp32 isolates "is this a numerics bug or a model bug."
39
+
40
+ bf16 handles large dot-products / attention logits better than fp16, which saturates and triggers
41
+ scaler-step-skipping. URLs: https://docs.pytorch.org/docs/2.12/amp.html ·
42
+ https://www.runpod.io/articles/guides/fp16-bf16-fp8-mixed-precision-speed-up-my-model-training
43
+
44
+ ### P2 — TF32 is OFF by default for matmul since PyTorch 1.12 — the "why is my A100 slow" footgun
45
+
46
+ **Symptom**: an fp32 (or autocast-but-fp32-matmul-heavy) run on an A100/H100 is ~2–4× slower than expected;
47
+ nothing is wrong with the code.
48
+
49
+ **Root cause**: `torch.backends.cuda.matmul.allow_tf32` defaulted **True in 1.7–1.11**, then flipped to
50
+ **False in 1.12+** (precision-loss complaints from non-DL users). So a fresh PyTorch 2.x box runs fp32
51
+ matmuls at full fp32 on the tensor cores' slow path unless TF32 is re-enabled. Convolutions' TF32
52
+ (`cudnn.allow_tf32`) is a separate knob, enabled by default.
53
+
54
+ **Fix**: opt back in once at startup —
55
+ ```python
56
+ torch.set_float32_matmul_precision("high") # preferred: enables TF32 (or bf16x3) for fp32 matmul
57
+ # legacy-equivalent, still works:
58
+ torch.backends.cuda.matmul.allow_tf32 = True
59
+ torch.backends.cudnn.allow_tf32 = True
60
+ ```
61
+ `"high"` = TF32; `"highest"` = true fp32 (default); `"medium"` = even coarser. HF Trainer exposes `--tf32 1`.
62
+ Most nets converge identically with TF32 as with fp32. URLs:
63
+ https://github.com/pytorch/pytorch/pull/76509 ·
64
+ https://docs.pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html ·
65
+ https://docs.pytorch.org/docs/2.12/notes/numerical_accuracy.html
66
+
67
+ ### P3 — Card capability gates the choice: bf16 needs Ampere+; V100/T4 are fp16-only
68
+
69
+ **Symptom**: bf16 training is unexpectedly slow (no error), or a config picks bf16 on an old card and falls
70
+ to a slow path.
71
+
72
+ **Root cause**: fast bf16 tensor cores arrived with **Ampere (A100, RTX 30xx)**; Hopper (H100/H200) adds
73
+ native **FP8**. **V100/T4/RTX 20xx have fp16 tensor cores but no fast bf16** (runs emulated/slow). A rental
74
+ hands whatever card is free, so the right precision is a *per-rental* fact, not a constant.
75
+
76
+ **Fix**: branch on capability at runtime, never hardcode —
77
+ ```python
78
+ use_bf16 = torch.cuda.is_bf16_supported() # True on Ampere+
79
+ amp_dtype = torch.bfloat16 if use_bf16 else torch.float16
80
+ ```
81
+ On V100/T4 use fp16+GradScaler (P5). FP8 (H100) is opt-in via Transformer Engine / `torchao`, not plain
82
+ autocast (out of scope). Record the card next to `nvidia-smi` in Phase 0.
83
+ URL: https://www.e2enetworks.com/blog/nvidia-a100-vs-h100-vs-h200-gpu-comparison
84
+
85
+ ---
86
+
87
+ ## AMP mechanics
88
+
89
+ ### P4 — autocast: wrap ONLY forward + loss, never backward, never `.half()` the model
90
+
91
+ **Symptom**: dtype-mismatch errors, or AMP gives no speedup, or grads look wrong.
92
+
93
+ **Root cause**: autocast is a context that casts *eligible ops* per-op inside the region; manually
94
+ `.half()`-ing the model or wrapping the backward pass fights it.
95
+
96
+ **Fix**:
97
+ ```python
98
+ for x, y in loader:
99
+ optimizer.zero_grad(set_to_none=True)
100
+ with torch.autocast("cuda", dtype=amp_dtype): # forward + loss ONLY
101
+ out = model(x); loss = loss_fn(out, y)
102
+ # backward is OUTSIDE autocast:
103
+ loss.backward() # (+ scaler for fp16, P5)
104
+ optimizer.step()
105
+ ```
106
+ Keep the model and optimizer in fp32; do NOT call `model.half()`. Use the new `torch.amp.autocast("cuda",
107
+ ...)` / `torch.amp.GradScaler("cuda")` API — `torch.cuda.amp.*` is **deprecated** in PyTorch 2.x. autocast
108
+ state is thread-local (re-enter it inside each DDP/DataParallel worker thread).
109
+ URL: https://docs.pytorch.org/docs/2.12/amp.html
110
+
111
+ ### P5 — GradScaler: required for fp16 to stop gradient *underflow*
112
+
113
+ **Symptom (no scaler, fp16)**: loss looks fine but the model doesn't learn — small gradients flush to 0 in
114
+ fp16's tiny subnormal range.
115
+
116
+ **Root cause**: fp16's narrow range underflows small gradients to zero. GradScaler multiplies the loss by a
117
+ large factor before backward (pushing grads into representable range), then unscales before the step and
118
+ **adapts the factor**: on any inf/NaN grad it *skips the optimizer step* and halves the scale (backoff 0.5);
119
+ after `growth_interval` (default 2000) clean steps it doubles it (growth 2.0).
120
+
121
+ **Fix — canonical fp16 loop**:
122
+ ```python
123
+ scaler = torch.amp.GradScaler("cuda")
124
+ for x, y in loader:
125
+ optimizer.zero_grad(set_to_none=True)
126
+ with torch.autocast("cuda", dtype=torch.float16):
127
+ loss = loss_fn(model(x), y)
128
+ scaler.scale(loss).backward()
129
+ scaler.step(optimizer) # internally unscales; SKIPS step if inf/NaN found
130
+ scaler.update() # adapts the scale factor
131
+ ```
132
+ Early-training "skipped step" warnings as the scaler calibrates are **normal**; *persistent* skips every
133
+ step = a real overflow (go to P10). URLs:
134
+ https://github.com/pytorch/pytorch/blob/main/docs/source/notes/amp_examples.rst ·
135
+ https://docs.pytorch.org/docs/2.12/amp.html
136
+
137
+ ### P6 — bf16 needs NO GradScaler (adding one is pointless, not harmful)
138
+
139
+ **Symptom**: a copied fp16 recipe carries a GradScaler into a bf16 run — wasted overhead, not a crash or a wrong result.
140
+
141
+ **Root cause**: bf16 has fp32's exponent range, so gradients don't underflow → loss-scaling is unnecessary
142
+ and the scaler's skip/backoff machinery is dead weight (scale-then-unscale cancels, and it never finds an
143
+ overflow to skip).
144
+
145
+ **Fix**: for bf16, drop the scaler entirely — plain `loss.backward(); optimizer.step()`. Only fp16 (and the
146
+ V100/T4 path) uses GradScaler.
147
+ URL: https://docs.pytorch.org/docs/2.12/amp.html
148
+
149
+ ### P7 — Gradient clipping under GradScaler: `unscale_` FIRST or you clip scaled grads
150
+
151
+ **Symptom**: `clip_grad_norm_` under fp16 AMP has no effect, or clips at the wrong magnitude.
152
+
153
+ **Root cause**: inside the scaler the grads are still multiplied by the (large) scale factor, so clipping to
154
+ `max_norm=1.0` is really clipping to `1.0 × scale` — effectively never.
155
+
156
+ **Fix**: `scaler.unscale_(optimizer)` once, THEN clip, THEN `scaler.step`:
157
+ ```python
158
+ scaler.scale(loss).backward()
159
+ scaler.unscale_(optimizer) # grads now in true scale
160
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
161
+ scaler.step(optimizer); scaler.update()
162
+ ```
163
+ `unscale_` is idempotent-per-step (call it once). For bf16, just `clip_grad_norm_` directly — no unscale.
164
+ URL: https://github.com/pytorch/pytorch/blob/main/docs/source/notes/amp_examples.rst
165
+
166
+ ---
167
+
168
+ ## NaN / Inf
169
+
170
+ ### P8 — Where NaNs come from: the four arithmetic origins
171
+
172
+ **Symptom**: loss prints `nan` (or `inf`) after N steps; everything was fine before.
173
+
174
+ **Root cause** — NaN/Inf is produced by a *finite* set of ops on bad inputs:
175
+ - `log(x)` / `log_softmax` with `x ≤ 0` (e.g. `log` of a `sigmoid` output that hit 0).
176
+ - `sqrt(x)` / `x ** 0.5` with `x < 0`, or its grad at `x = 0` (`d/dx sqrt = 1/(2√x) → inf`).
177
+ - division `a / b` with `b → 0` (un-epsilon'd normalization, variance ≈ 0 in BatchNorm/LayerNorm).
178
+ - `exp(x)` overflow → `inf`, then `inf − inf` / `inf / inf → nan`.
179
+ - fp16 overflow (P10): a value exceeds 65504 → `inf` → grads → NaN.
180
+
181
+ **Fix — make the op stable, don't paper over it**:
182
+ - Never hand-roll `log(softmax(x))` — use `F.log_softmax` / `F.cross_entropy` (fused, log-sum-exp-stable).
183
+ - Add epsilon *inside* the unstable op: `torch.log(x + 1e-8)`, `torch.sqrt(x + 1e-12)`, `a / (b + 1e-8)`.
184
+ - Clamp before the danger op: `x.clamp(min=1e-7)` before `log`; clamp logits before a manual softmax.
185
+ - Use `eps` in the optimizer/norm (AdamW `eps=1e-8`; raise modestly if `v` is tiny and steps explode).
186
+
187
+ URLs: https://docs.pytorch.org/docs/stable/generated/torch.log.html ·
188
+ https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
189
+
190
+ ### P9 — Find the exact op: anomaly detection + a cheap forward hook
191
+
192
+ **Symptom**: loss is NaN but the stack trace points at `loss.backward()`, not the op that caused it.
193
+
194
+ **Root cause**: by default the NaN surfaces wherever it's *consumed*, not where it was *born*.
195
+
196
+ **Fix — two tools, cheap → precise**:
197
+ - **Forward NaN hook (cheap, leave on)** — register on every module to catch the *first* layer to emit NaN:
198
+ ```python
199
+ for name, m in model.named_modules():
200
+ m.register_forward_hook(lambda mod, i, o, n=name:
201
+ print(f"NaN in {n}") if torch.is_tensor(o) and not torch.isfinite(o).all() else None)
202
+ ```
203
+ - **`torch.autograd.set_detect_anomaly(True)` (expensive, debug-only)** — records the forward traceback of
204
+ each backward op and raises at the first backward NaN, pointing at the *forward* line that created it.
205
+ ```python
206
+ with torch.autograd.detect_anomaly(): # or set_detect_anomaly(True, check_nan=True)
207
+ loss.backward()
208
+ ```
209
+ The docs warn it "will slow down your program" (roughly an order of magnitude) — enable to *locate*, then
210
+ turn OFF for the real run, never ship it on. URL: https://docs.pytorch.org/docs/2.12/autograd.html
211
+
212
+ ### P10 — fp16 overflow vs underflow: read the GradScaler signal
213
+
214
+ **Symptom (fp16)**: loss → inf/NaN; or the scaler skips *every* step and the scale factor collapses toward 0.
215
+
216
+ **Root cause**: a forward activation exceeds fp16's 65504 max → `inf` → NaN grads → the scaler can't find a
217
+ scale that avoids overflow, so it backs off forever. Common in attention logits and large residual sums.
218
+ (Distinct from underflow, which the scaler *fixes* by P5.)
219
+
220
+ **Fix**: switch fp16 → **bf16** (P1) — its fp32 range absorbs the large values; this is the single most
221
+ effective fix. If bf16 is unavailable (V100/T4): keep the overflow-prone block (final logits, attention
222
+ scores, the loss) in **fp32** via a nested `torch.autocast("cuda", enabled=False)` region, and apply z-loss
223
+ (P15) / qk-norm (P16) to stop the logits growing.
224
+ URL: https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
225
+
226
+ ### P11 — NaN from the *data*, not the math
227
+
228
+ **Symptom**: NaN appears at a specific, reproducible step (always step 4137), not gradually.
229
+
230
+ **Root cause**: a corrupt sample — NaN/Inf pixel, all-zero target, label outside `[0, C)`, empty sequence,
231
+ divide-by-zero in a custom transform. The math is fine; the input is poison.
232
+
233
+ **Fix**: guard at the data boundary — `assert torch.isfinite(x).all(), f"non-finite input @ step {step}"`
234
+ (fail loud, with the index). A reproducible-step NaN ⇒ inspect *that batch* (seed the loader, dump the
235
+ index); a *step-varying* NaN ⇒ a numerics/LR problem (P12), not data. Smoke the data first — smoke
236
+ *content* is owned by **verifying-dl-experiments** (cross-link **REQUIRED**).
237
+ URL: https://arxiv.org/pdf/2311.03938
238
+
239
+ ---
240
+
241
+ ## Loss spikes / divergence
242
+
243
+ ### P12 — Loss spike / divergence: LR too high or warmup too short
244
+
245
+ **Symptom**: training is stable, then the loss jumps orders of magnitude (spike), sometimes recovering,
246
+ sometimes diverging to NaN — most often early, or after a fast LR ramp.
247
+
248
+ **Root cause**: if the LR ramps too fast or starts too high, early updates land before activation norms and
249
+ the optimizer's second moment (`v`) have stabilized, overshooting into sharp loss regions → gradient-norm
250
+ blowup → spike. A sustained **grad-norm** rise typically *precedes* the loss spike by several steps.
251
+
252
+ **Fix — in order of cheapness**:
253
+ 1. **Lengthen warmup** (linear ramp 0 → peak over e.g. 1–10% of steps); warmup is the single biggest lever on LR-sensitivity of final loss.
254
+ 2. **Lower peak LR** ~3–10× and re-check.
255
+ 3. **Log grad-norm every step** as the early-warning signal — spikes are predictable from activation/grad-norm scaling before they hit.
256
+ 4. Resume from the last good checkpoint *before* the spike (don't train through a diverged region).
257
+
258
+ URLs: https://arxiv.org/pdf/2309.14322 ·
259
+ https://apxml.com/courses/how-to-build-a-large-language-model/chapter-24-identifying-mitigating-training-instabilities/stabilization-techniques-revisited
260
+
261
+ ### P13 — Gradient clipping: the standard guardrail (and what constant clipping means)
262
+
263
+ **Symptom**: occasional grad-norm spikes; or NaN right after a single bad batch.
264
+
265
+ **Root cause**: one pathological batch (rare embedding IDs, an outlier sample) produces an outsized global
266
+ grad norm that overshoots.
267
+
268
+ **Fix**: clip global grad norm every step — `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)`
269
+ with `max_norm` ∈ [0.5, 1.0] typical for transformers (under the scaler: P7). **Diagnostic**: if clipping is
270
+ active *every* step or needs an absurdly low threshold to stay stable, that's a symptom of a deeper problem
271
+ (LR too high P12, bad init P17, architecture), not a fix — chase the cause. Global-norm clipping scales
272
+ *all* grads down, so one embedding-heavy batch can throttle everything else that step — consider per-module
273
+ clipping if embeddings dominate.
274
+ URL: https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1
275
+
276
+ ### P14 — Skip-the-batch: drop the update when this step is non-finite
277
+
278
+ **Symptom**: a single bad batch every few thousand steps NaNs the whole run; restarting wastes hours.
279
+
280
+ **Root cause**: the optimizer applies a non-finite grad and permanently corrupts the weights.
281
+
282
+ **Fix**: gate the optimizer step on finiteness (fp16's GradScaler already does this internally, P5; bf16
283
+ needs it explicit):
284
+ ```python
285
+ loss.backward()
286
+ gnorm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
287
+ if torch.isfinite(gnorm):
288
+ optimizer.step()
289
+ else:
290
+ optimizer.zero_grad(set_to_none=True) # skip this batch, keep weights intact
291
+ skipped += 1
292
+ ```
293
+ Log a `skipped` counter — a *rising* skip rate means a systematic problem (P12/P10), not stray bad data.
294
+ Adaptive spike-clipping (ZClip) and momentum-reset on spike (SPAM) automate this for large runs. URLs:
295
+ https://arxiv.org/pdf/2504.02507 · https://arxiv.org/pdf/2501.06842
296
+
297
+ ### P15 — z-loss: stop softmax logits from drifting unbounded
298
+
299
+ **Symptom**: training is slowly destabilizing; the softmax normalizer / output logits grow over time and
300
+ eventually overflow (acute in fp16/bf16); the "output logits diverge from log-probs" failure mode.
301
+
302
+ **Root cause**: nothing pins the absolute scale of pre-softmax logits, so they drift up; large logits cause
303
+ numerical instability and (in low precision) overflow → collapse.
304
+
305
+ **Fix**: add an auxiliary **z-loss** = `1e-4 · (log Z)²` where `Z` is the softmax denominator
306
+ (`log Z = logsumexp(logits)`), pulling `log Z → 0`:
307
+ ```python
308
+ logits = model(x)
309
+ z = torch.logsumexp(logits, dim=-1)
310
+ loss = F.cross_entropy(logits, y) + 1e-4 * (z ** 2).mean()
311
+ ```
312
+ Coefficient **1e-4** is the PaLM/ST-MoE value; too large lets z-loss dominate. Standard in LLM pretraining;
313
+ also the recommended fix for MoE router instability. URLs:
314
+ https://medium.com/dair-ai/papers-explained-50-palm-480e72fa3fd5 · https://arxiv.org/pdf/2202.08906 ·
315
+ https://arxiv.org/pdf/2309.14322
316
+
317
+ ### P16 — qk-norm: kill attention-logit growth at high LR
318
+
319
+ **Symptom**: a transformer diverges only at higher LR; the instability traces to attention scores (Q·Kᵀ)
320
+ growing large before the softmax.
321
+
322
+ **Root cause**: "growth of logits in attention layers" — one of the two dominant transformer instability
323
+ modes (the other is output-logit divergence, P15). Unbounded attention logits saturate the softmax.
324
+
325
+ **Fix**: apply **QK-LayerNorm** — LayerNorm query and key per-head before the dot-product. Combined with
326
+ z-loss (P15) + warmup (P12), it lets small models train to similar loss across *orders of magnitude* of LR,
327
+ i.e. removes most LR-sensitivity. URL: https://arxiv.org/pdf/2309.14322
328
+
329
+ ### P17 — Initialization & normalization placement
330
+
331
+ **Symptom**: divergence in the first few hundred steps regardless of LR; or vanishing signal (P18) in deep
332
+ stacks.
333
+
334
+ **Root cause**: residual streams accumulate variance with depth; default init can make early
335
+ activations/grads too large (spike) or too small (vanish). Norm/embedding init scale matters.
336
+
337
+ **Fix**: scale residual-branch init by `1/√(2·n_layers)` (GPT-2-style); prefer pre-LN over post-LN for deep
338
+ transformers; init embeddings at small std (~0.02). When unsure, copy a *known-good* config's init+norm
339
+ scheme rather than tuning blind. URL: https://arxiv.org/pdf/2309.14322
340
+
341
+ ---
342
+
343
+ ## Gradients
344
+
345
+ ### P18 — Gradient explosion vs vanishing: diagnose by logging the norm
346
+
347
+ **Symptom**: loss NaN/diverges (explosion) OR loss plateaus and the model never learns (vanishing).
348
+
349
+ **Root cause**: per-layer grad norms blow up (explosion: deep nets, high LR, no clip) or decay to ~0
350
+ (vanishing: saturating activations, bad init P17, too-deep unnormalized stacks).
351
+
352
+ **Fix — measure first**:
353
+ ```python
354
+ total = sum(p.grad.detach().norm()**2 for p in model.parameters() if p.grad is not None) ** 0.5
355
+ # log `total` every step; also log per-layer norms when hunting the culprit layer
356
+ ```
357
+ - **Explosion** (norm ↑↑): grad clipping (P13), lower LR (P12), longer warmup, bf16 over fp16 (P10).
358
+ - **Vanishing** (norm → 0): residual connections, normalization layers, better init (P17), non-saturating
359
+ activations (GELU/SiLU over deep sigmoid/tanh stacks), check the LR isn't *too low*.
360
+
361
+ A grad-norm trace is the cheapest, highest-signal stability instrument — log it from step 1.
362
+ URL: https://apxml.com/courses/how-to-build-a-large-language-model/chapter-24-identifying-mitigating-training-instabilities/stabilization-techniques-revisited
363
+
364
+ ---
365
+
366
+ ## Reproducibility
367
+
368
+ ### P19 — Deterministic / repro knobs — set them, but the *interpretation* is delegated
369
+
370
+ **Symptom**: same config + seed gives slightly different loss/metrics run-to-run.
371
+
372
+ **Root cause**: nondeterministic CUDA kernels + `cudnn.benchmark` autotuning pick different algorithms per
373
+ run; TF32/AMP add low-order noise on top.
374
+
375
+ **Fix — the mechanical knobs (set these here)**:
376
+ ```python
377
+ torch.manual_seed(s); np.random.seed(s); random.seed(s)
378
+ torch.use_deterministic_algorithms(True) # may need CUBLAS_WORKSPACE_CONFIG=:4096:8
379
+ torch.backends.cudnn.deterministic = True
380
+ torch.backends.cudnn.benchmark = False # benchmark=True trades determinism for speed
381
+ ```
382
+ **Whether a run-to-run delta is "a real effect vs cuDNN nondeterminism," and the full determinism
383
+ methodology, is owned by verifying-dl-experiments (cross-link REQUIRED)** — catalogued as **U36** in
384
+ `references/gotchas_universal.md`. This layer only ensures the knobs are *set and logged*. Determinism costs
385
+ speed — enable for the datapoint that must be clean, not every throwaway run.
386
+ URL: https://docs.pytorch.org/docs/stable/notes/randomness.html
387
+
388
+ ---
389
+
390
+ ## Pointers — adjacent layers, do NOT restate here
391
+
392
+ - **`references/gotchas_universal.md`** — the *infra* failure modes that masquerade as numerics:
393
+ **U6** disk-full crashes `torch.save`, **U9** cgroup-OOM (bare `Killed`, not a NaN), **U28** CUDA/driver/
394
+ torch-build mismatch (`no kernel image` ≠ a precision bug), **U10/U11** VRAM OOM. Rule out infra before
395
+ chasing a "numerics" ghost.
396
+ - **`verifying-dl-experiments`** (**REQUIRED** cross-link) — owns *is-the-number-real*: smoke **content**,
397
+ cuDNN-nondeterminism-as-metric-error (U36), collapse/constant-output diagnosis, "bug vs real effect." This
398
+ file makes training *run and stay finite*; that skill judges whether the converged result is *true*.
399
+ - **`references/spot-resilience.md`** — checkpoint cadence so a divergence-and-resume (P12) loses minimal work.
400
+ - **`references/multinode.md`** — NCCL/precision interactions in DDP (all-reduce dtype, loss-scale sync) for
401
+ multi-node runs; single-box users skip.