npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.4 - Mend

opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md ADDED Viewed

@@ -0,0 +1,230 @@
+# Per-domain training gotchas — make each domain's run start, not lie, and not silently mistrain
+The cross-cutting layers (precision, OOM, throughput, checkpoint, distributed) hold everywhere; this file
+is the **domain-shaped** residue — the data-format, masking, normalization, schedule, and freezing traps
+that only bite LLM / vision / diffusion / RL / VLM training. Each entry is **Symptom → Root cause → Fix**
+with the exact knob. This layer owns *making the domain pipeline RUN and debugging its mechanics*;
+**verifying-dl-experiments** owns *is the converged number real* (collapse-vs-real-effect, train/val
+leakage, metric validity, constant/degenerate output). Cross-link it (**REQUIRED**) at every "loss is fine
+but the output/metric is wrong" fork — the headline domain failures (diffusion samples bad at low loss,
+mAP=0, reward collapse, VLM ignores the image) are exactly that shape.
+To jump: `grep -in '<keyword>' references/training/by-domain.md` (e.g. `padding`, `packing`, `rope`,
+`z-loss`, `dpo`, `mAP=0`, `mIoU`, `ignore_index`, `ema`, `vae`, `cfg`, `kl`, `whiten`, `projector`,
+`freeze`, `seed`).
+## Table of contents
+- **LLM** — L1 pad-side · L2 loss-mask-−100 · L3 pad-token-unset · L4 packing-cross-contamination · L5 RoPE-context-extension · L6 grad-explosion+z-loss · L7 eval-perplexity-mask · L8 SFT/DPO/RLHF-data-format · L9 DPO-collapse+KL · L10 gated-token-before-Trainer
+- **Vision (cls/det/seg)** — V1 normalization-mismatch · V2 aug-on-eval · V3 mAP=0 · V4 anchor/NMS/conf-thresh · V5 mIoU=0 ignore_index/off-by-one · V6 class-imbalance · V7 BN-tiny-batch
+- **Diffusion** — DF1 loss-low-samples-bad (cross-link) · DF2 EMA-weights · DF3 VAE-scaling · DF4 noise-schedule/timestep · DF5 CFG-conditioning-dropout · DF6 sampler-vs-model · DF7 SNR-weighting
+- **RL** — R1 reward-collapse · R2 KL-blowup · R3 whitening · R4 replay/obs-normalization · R5 non-stationarity · R6 seed-variance (cross-link)
+- **VLM** — X1 stage-freeze · X2 projector-only-stage1 · X3 per-group-LR · X4 image-token-truncation · X5 alignment-collapse (cross-link)
+- **Pointers** — precision-stability.md, oom-memory.md, gotchas_universal.md, verifying-dl-experiments (skill)
+---
+## LLM / transformer
+### L1 — Padding side: right for causal-LM SFT, **left** for generation/DPO
+**Symptom**: a fine-tuned causal LM produces garbage, or DPO/batched-generation logprobs disagree with single-example decoding.
+**Root cause**: causal-LM **training** wants **right-padding** (pad lands after content, attention mask zeroes it). Batched **generation/DPO** want **left-padding** — with right-padding the "last real token" position differs per row, so a shared decode step reads pad. TRL's `DPOTrainer` requires `processing_class` padding side `"left"`.
+**Fix**: `tokenizer.padding_side="right"` for SFT collation; `"left"` for generation/eval/DPO — set it per phase, not globally. ([HF causal-LM](https://huggingface.co/docs/transformers/tasks/language_modeling), [TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
+### L2 — Loss over prompt + pad tokens dilutes the signal → mask with −100
+**Symptom**: SFT "trains" but parrots the prompt / barely follows instructions; loss plausible but flat.
+**Root cause**: HF LM loss is `CrossEntropyLoss(ignore_index=-100)` — only `-100` positions are skipped. Leaving the prompt-prefix labels and pad labels as real ids averages the loss over "predict the prompt / predict pad."
+**Fix**: set labels to `-100` at **both** the prompt prefix (train only on the completion) and all padding positions. TRL `SFTTrainer` `completion_only_loss` / `DataCollatorForCompletionOnlyLM` does the prefix masking — verify it fired (decode one masked label row). Whether the gradient hits the right tokens is a smoke-target → cross-link **verifying-dl-experiments** (**REQUIRED**). ([gpt2 thread](https://huggingface.co/gpt2/discussions/34))
+### L3 — `pad_token` unset → pad error or silent pad-with-token-0
+**Symptom**: `ValueError: Asking to pad but the tokenizer does not have a padding token`, or it pads with id 0 (a real token, often `<unk>`/`!`).
+**Root cause**: many base LMs (GPT-2, Llama, Mistral) ship no `pad_token`.
+**Fix**: `tokenizer.pad_token = tokenizer.eos_token` and `model.config.pad_token_id = tokenizer.pad_token_id`. With right-padding + attention mask, reusing EOS as PAD is safe. If a *new* token is added, `model.resize_token_embeddings(len(tokenizer))` or its id indexes out of range.
+### L4 — Sequence packing leaks attention across documents → contaminated training
+**Symptom**: throughput jumps after enabling packing, but quality drops vs unpacked; the model "completes" one doc with content from a packed neighbor.
+**Root cause**: naive packing concatenates examples into one `max_len` sequence; a vanilla causal mask lets a token in doc 2 attend back into doc 1 — cross-sequence contamination.
+**Fix**: **document masking** — emit `position_ids` that reset per sub-sequence + an attention impl that honors boundaries. TRL/HF `DataCollatorWithFlattening` packs into one stream, returns `position_ids`, and sets each example's first label to `-100`; FlashAttention-2 varlen restricts attention within-document. Requires `attn_implementation="flash_attention_2"` — packing without it silently contaminates. ([HF blog](https://huggingface.co/blog/packing-with-FA2), [transformers #31629](https://github.com/huggingface/transformers/pull/31629), [IBM](https://research.ibm.com/blog/hugging-face-training-flash-attention))
+### L5 — Fine-tuning past pretrain context without RoPE scaling → garbage past N tokens
+**Symptom**: a 4k-context model is incoherent past ~4k at inference even when trained on longer sequences; or long-context finetune won't converge.
+**Root cause**: RoPE frequencies are calibrated to the pretrain context; longer positions extrapolate into unseen rotation angles. Linear interp degrades past ~4×; YaRN holds to 16–32×.
+**Fix**: set `rope_scaling={"type":"linear"|"dynamic"|"yarn","factor":<target/orig>}` in config and finetune **with scaling active** (`"yarn"` for big jumps, `"linear"` only ≤4×). Train-time vs inference-time `rope_scaling` mismatch is a silent regression. ([RoPE deep dive](https://amaarora.github.io/posts/2025-09-21-rope-context-extension.html), [HF guide](https://medium.com/@leannetan/extending-context-length-with-hugging-faces-transformers-6b04db05b39a))
+### L6 — Loss spikes / logit drift in long LM training → z-loss + the precision-layer knobs
+**Symptom**: pretraining/long-SFT loss is stable then spikes; in bf16/fp16 it can NaN; logits grow unboundedly over training.
+**Root cause**: the softmax normalizer `log Z` drifts from 0 as logits grow → low-precision overflow + gradient instability.
+**Fix**: add **z-loss** `1e-4 · log²(Z)` (the PaLM/Gopher coefficient) to pull `log Z` toward 0. The general divergence ladder (warmup, grad-clip, skip-the-batch, qk-norm, bf16-over-fp16) is **references/training/precision-stability.md** P12–P18 (z-loss is P15) — not restated here; this entry is the LM-specific *why z-loss exists*. ([PaLM](https://arxiv.org/abs/2204.02311), [small-scale proxies](https://arxiv.org/abs/2309.14322))
+### L7 — Eval perplexity wrong from a wrong mask/stride, not the model
+**Symptom**: reported PPL implausible, or differs from a published number on the same checkpoint+data.
+**Root cause**: PPL = `exp(mean NLL over scored tokens)`. Including pad/prompt tokens, or a sliding window that double-counts overlap context as scored tokens, corrupts the denominator.
+**Fix**: score only non-`-100` positions; for long docs use the HF strided window where overlap tokens are `-100` (context, not scored). Whether the number is comparable across runs is metric-validity → cross-link **verifying-dl-experiments** (**REQUIRED**). ([HF perplexity](https://huggingface.co/docs/transformers/perplexity))
+### L8 — SFT / DPO / RLHF expect different dataset schemas; the wrong one trains on nothing
+**Symptom**: TRL trainer runs but learns nothing, or errors on a missing column; preference data in an SFT trainer (or vice versa) silently mistrains.
+**Root cause**: **SFT** = prompt+completion (train on completion); **DPO/preference** = `{prompt, chosen, rejected}` or conversational messages; **RLHF/PPO** = prompts only + a separate reward model. Conversational data needs the chat template applied.
+**Fix**: match trainer to schema; for conversational data confirm the chat template fired (decode one example — look for role tags). Prefer the **explicit-prompt** form `{prompt, chosen, rejected}` over implicit. Recommended order SFT → DPO (DPO from a non-SFT base often underperforms). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
+### L9 — DPO reward margin won't grow / chosen logps crash → beta + ref-model + collapse
+**Symptom**: `rewards/margins` ~0 or `rewards/accuracies` ~0.5; or `logps/chosen` and `logps/rejected` both plunge (suppresses everything).
+**Root cause**: `beta` controls deviation from the frozen reference — too **small** → policy drifts (implicit KL blows up, degenerate text); too **large** → signal too weak to move the margin. DPO widens the gap mostly by **suppressing the rejected** likelihood, so both logps falling *with a growing margin* is normal; both falling *with a flat margin* is collapse. A lost/absent `ref_model` (some PEFT paths) removes the anchor.
+**Fix**: start `beta=0.1`, raise to 0.3–0.5 if text degrades, lower if the margin won't move. Use `learning_rate≈1e-6` (TRL DPO default; `≈1e-5` for LoRA) — too high is the classic collapse. Health signal: `rewards/margins` ↑, `rewards/accuracies` → ~0.7+. With `ref_model=None` TRL uses the initial policy as the frozen reference — concrete check: a frozen reference must yield **identical** logps for a fixed batch across steps; re-score one batch early and late, and if they drift the anchor is being trained (the trap when `ref_model=None` lacks a real frozen copy). Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
+### L10 — Gated/private model 401s mid-Trainer → authenticate BEFORE construction
+**Symptom**: `401`/`GatedRepoError` when the Trainer loads a Llama/Gemma/Mistral base despite granted access; or `push_to_hub` can't write.
+**Root cause**: the token must be visible to the process **before** the gated `from_pretrained` / Trainer-internal load; setting it after is too late.
+**Fix**: push the token first (env/stdin, **never inline the literal**): set `HF_TOKEN`, or `huggingface_hub.login(token=os.environ["HF_TOKEN"])` at the top before any `from_pretrained`. `push_to_hub` needs a **write**-scope token + `hub_model_id` in `TrainingArguments`. Verify `huggingface-cli whoami` before launch — on a metered box a 401 wastes a full reload. Secrets transport → `references/ssh_transport.md` (U34); offline-without-key → gotchas_universal.md U35. ([HF gated](https://huggingface.co/docs/hub/en/models-gated))
+---
+## Vision (classification / detection / segmentation)
+### V1 — Normalization mismatch train↔eval → near-zero accuracy on a "trained" model
+**Symptom**: training loss falls but val/test is near-chance; or a fine-tuned backbone is far worse than its pretrained eval.
+**Root cause**: pretrain `(mean,std)` (ImageNet `mean=[.485,.456,.406] std=[.229,.224,.225]`) differs between train/eval paths, or one normalizes to `[0,1]` and the other `[0,255]`; or RGB vs BGR (OpenCV loads BGR). A reported CenterNet case got post-norm mean `-115`, std `8` from wrong channel stats.
+**Fix**: use the **exact** pretrain normalization, identically in train and eval; match channel order. Print one input tensor's per-channel mean/std — should be ~`N(0,1)`. Remaining gap = real-effect vs this bug → cross-link **verifying-dl-experiments** (**REQUIRED**; input-normalization is a named check). ([why-normalize](https://inside-machinelearning.com/en/why-and-how-to-normalize-data-object-detection-on-image-in-pytorch-part-1/), [tf/models #10778](https://github.com/tensorflow/models/issues/10778))
+### V2 — Train-time augmentation applied at eval → unstable/depressed metrics
+**Symptom**: eval numbers flicker run-to-run or sit below the training-curve val.
+**Root cause**: the random transform pipeline (`RandomResizedCrop`, flip, jitter) is reused for the eval loader, so each pass sees different inputs; or `model.eval()` never called so Dropout/BN stay in train mode.
+**Fix**: separate `train_transform` (random) from `eval_transform` (deterministic resize+center-crop+normalize); call `model.eval()` + `torch.no_grad()`. A flickering eval metric is usually this, not the model.
+### V3 — Detection mAP=0 despite a falling loss → box-format / label-id / scale mismatch
+**Symptom**: detection loss decreases normally but mAP is exactly 0 (or ~0) at every eval.
+**Root cause**: a format mismatch the loss tolerates but eval doesn't — (1) box format `cxcywh`/`xywh` vs evaluator's `xyxy`, or normalized `[0,1]` vs absolute pixels; (2) class id off-by-one (0-indexed model vs 1-indexed COCO, 0=background); (3) boxes in resized space matched against original-res GT; (4) eval score threshold so high everything is filtered.
+**Fix**: assert the eval pipeline's box format + class indexing, convert explicitly (`torchvision.ops.box_convert`), and visualize 2–3 predicted boxes before trusting the metric. mAP=0 with healthy loss is almost never the model — it's eval glue; the all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([tf/models #10778](https://github.com/tensorflow/models/issues/10778), [bbox formats](https://www.learnml.io/posts/a-guide-to-bounding-box-formats/))
+### V4 — Detections vanish after NMS / anchor mismatch → no boxes survive
+**Symptom**: raw head outputs look reasonable but final detections are empty or absurdly few.
+**Root cause**: NMS IoU too aggressive, score threshold too high, or anchor sizes/ratios don't cover the dataset's object scales (regression targets unreachable).
+**Fix**: log pre-NMS vs post-NMS counts; loosen `score_thresh` (~0.05 for eval recall), NMS IoU ~0.5–0.6; for anchor heads run k-means auto-anchor (YOLO `autoanchor`) on GT boxes. Pairs with V3.
+### V5 — Segmentation mIoU=0 or NaN → `ignore_index` / label off-by-one
+**Symptom**: seg loss trains but mIoU is 0 (or a class NaN); or loss is NaN from step 1.
+**Root cause**: label/class-index inconsistency — a void value (commonly `255`) not excluded → treated as a class id ≥ `num_classes` (out-of-range / pollutes IoU); or off-by-one (0=background but labels start at 1, or a `reduce_labels` 0→255 shift applied inconsistently between loss and metric).
+**Fix**: set the **same** `ignore_index` in **both** loss and metric — `CrossEntropyLoss(ignore_index=255)` and mIoU mask `(label != 255)`; confirm `label.max() < num_classes` after any shift; apply reduction identically. mIoU=0 with falling loss = all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([torchmetrics #2747](https://github.com/Lightning-AI/torchmetrics/issues/2747), [HF ignore_index](https://discuss.huggingface.co/t/understanding-ignore-index-and-reduce-labels/64587))
+### V6 — Severe class imbalance → model predicts only the majority class
+**Symptom**: high pixel/sample accuracy but rare classes never predicted; minority recall ~0.
+**Root cause**: unweighted cross-entropy is dominated by the majority class; "always predict majority" is the easy degenerate solution.
+**Fix**: weight the loss (`CrossEntropyLoss(weight=...)` inverse-frequency) or focal loss (detection); class-balanced sampler. Report **per-class / macro** metrics, never just overall — a high aggregate hiding a collapsed minority is degenerate output → cross-link **verifying-dl-experiments** (**REQUIRED**).
+### V7 — Tiny per-GPU batch → BatchNorm stats garbage → unstable/poor training
+**Symptom**: detection/seg at batch 1–2 per GPU is unstable or underperforms a larger-batch run.
+**Root cause**: BN estimates mean/var over the batch; at batch 1–2 those are noisy and running stats drift.
+**Fix**: **SyncBatchNorm** across GPUs (`torch.nn.SyncBatchNorm.convert_sync_batchnorm`) under DDP, or **GroupNorm**, or freeze pretrained BN (`FrozenBatchNorm2d`, as detection backbones do). DDP mechanics → `references/training/distributed-launch.md`.
+---
+## Diffusion / generative
+### DF1 — Loss is low but samples are bad → the canonical "loss ≠ quality"
+**Symptom**: the noise-prediction MSE converges nicely but samples are blurry, mode-collapsed, or wrong.
+**Root cause**: diffusion loss (predict noise at a random timestep) is **weakly correlated with sample quality** — good average noise-prediction still compounds errors over the sampling trajectory. Real culprits are downstream: missing EMA (DF2), wrong VAE scaling (DF3), train/sample schedule mismatch (DF4/DF6), no/over CFG (DF5).
+**Fix**: the textbook **is-the-number-real** fork → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns loss-low-output-bad). Mechanically walk DF2→DF6; the single most common miss is evaluating **raw** weights instead of **EMA** (DF2). ([stability techniques](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/training-stability-techniques))
+### DF2 — Sampling from raw (non-EMA) weights → worse than the "same" model
+**Symptom**: samples from the just-saved checkpoint look worse than expected; quality jumps with an EMA checkpoint.
+**Root cause**: diffusion quality depends heavily on EMA — a running average (`decay≈0.999`, ~1000-update window) that denoises the noisy SGD trajectory. Raw weights are the noisy point estimate.
+**Fix**: maintain EMA during training and **sample/evaluate from EMA weights** (`diffusers` `EMAModel`); save both (EMA for inference, raw for resume); verify the eval path actually loaded EMA. Resolves a large share of "DF1" reports. ([EMA](https://medium.com/@thibaut.chauffier/training-diffusion-models-from-scratch-21d7a1f18e9e))
+### DF3 — VAE latent scaling wrong → latents off-unit-variance → diffusion can't learn
+**Symptom**: latent-diffusion (SD-style) training is unstable / produces noise or blank output; or a swapped-in custom VAE degrades everything.
+**Root cause**: the model assumes ~unit-variance latents; the VAE output is multiplied by a calibrated `scaling_factor` (SD v1 = `0.18215`). A wrong/missing factor (or a custom VAE with different stats) leaves latents off-scale.
+**Fix**: scale by the VAE's `config.scaling_factor` on encode, divide on decode. For a **custom** VAE, **measure** empirical latent std on a sample and set factor `1/std` — don't inherit `0.18215`. Print latent mean/std before training (~0/~1). ([sd-vae](https://huggingface.co/stabilityai/sd-vae-ft-mse))
+### DF4 — Noise schedule / timestep / prediction_type mismatch train↔inference
+**Symptom**: structured artifacts, wrong contrast/brightness, or failure at low step counts.
+**Root cause**: betas/alphas schedule (linear/cosine/scaled-linear), `num_train_timesteps`, or `prediction_type` differs between training and the inference scheduler; e.g. trained on `epsilon`, sampled as `v`/`x0`.
+**Fix**: keep the **same** `beta_schedule`, `num_train_timesteps`, `prediction_type` in both — in `diffusers` build the inference scheduler from the training `scheduler.config`. Mismatched `prediction_type` is a silent quality killer.
+### DF5 — Conditioning never dropped during training → CFG is a no-op / model ignores prompt
+**Symptom**: changing `guidance_scale` at inference barely changes output, or the model ignores conditioning.
+**Root cause**: CFG needs a learned **unconditional** path, trained by randomly replacing the condition with a null embedding for a fraction `p_drop` of examples. No dropout ⇒ no usable unconditional estimate ⇒ CFG no-op.
+**Fix**: during training drop the condition `p_drop≈0.1` (replace with null embedding). At inference use `guidance_scale` ~5–15 for T2I (higher = more prompt adherence, lower diversity). Model-ignores-input is a verifying-dl-experiments concern (**REQUIRED**); the training-side root cause is the missing dropout. ([CFG theory](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/classifier-free-guidance-theory))
+### DF6 — Sampler ≠ model: a sampler config that doesn't match the trained objective
+**Symptom**: switching samplers wildly changes quality; one sampler gives noise.
+**Root cause**: samplers assume a `prediction_type` + schedule (DF4); ancestral/SDE vs deterministic ODE interact differently with the trained noise level, and step count below the sampler's stable range degrades.
+**Fix**: validate the checkpoint with the **reference** sampler/step-count from its recipe first, then explore; confirm `prediction_type` matches; for few-step use DPM-Solver++ not plain DDPM. "New sampler → bad samples" is a config mismatch, not a model regression.
+### DF7 — Uniform-timestep loss weighting → blurry samples → Min-SNR weighting
+**Symptom**: samples are systematically blurry / low-detail despite long training.
+**Root cause**: uniform loss weight over-weights high-noise (easy) steps relative to low-noise steps that carry fine detail; the gradient is dominated by the easy regime.
+**Fix**: apply **Min-SNR-γ** weighting (γ≈5) so low-noise steps get their share; `diffusers` scripts expose `--snr_gamma`. Compounds with DF2/DF3 — fix these details together, not in isolation. ([Min-SNR](https://arxiv.org/abs/2303.09556))
+---
+## RL
+### R1 — Reward collapses / output degenerates mid-training
+**Symptom**: average reward suddenly drops, responses get short/repetitive or refuse, length collapses.
+**Root cause**: reward hacking / over-optimization — the policy exploits the reward model's blind spots and drifts far, often after the ratio gets clipped much more and approximate KL spikes.
+**Fix**: strengthen the KL penalty (raise the KL coefficient), lower LR (`≈1e-6` for LLM PPO), reduce PPO update epochs/batch, add a length penalty if length is gamed. Watch reward **and** KL together — a reward jump with a KL spike is hacking, not progress. Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns collapse/degenerate-output). ([PPO instability](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/troubleshooting-ppo-instability), [N-details RLHF](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo))
+### R2 — KL to the reference blows up → policy runs away
+**Symptom**: KL grows without bound; generations go incoherent; "diverges" though loss isn't NaN.
+**Root cause**: the KL penalty is too weak (or adaptive-KL target too loose); aggressive updates push the policy far, and a huge KL term can dominate the objective so the model optimizes the penalty instead.
+**Fix**: use an **adaptive KL controller** with a target (e.g. 6), or a fixed coefficient large enough to hold KL bounded; clip the ratio (`cliprange≈0.2`); cap update epochs. Confirm the frozen reference isn't being updated. Same axis DPO's `beta` controls (L9). ([KL penalty role](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/kl-divergence-penalty-role))
+### R3 — Un-normalized rewards/advantages → unstable gradients → whiten
+**Symptom**: high-variance, brittle training; small reward-scale changes destabilize everything.
+**Root cause**: raw reward/advantage magnitudes vary wildly across batches; PPO gradients are scale-sensitive.
+**Fix**: **whiten** advantages per minibatch (subtract mean, divide by std) — the standard PPO trick, more stabilizing than plain reward normalization (double-normalizing reward **and** advantage is often redundant). For classic-control RL, normalize **observations** with a running mean/std (`VecNormalize`) — un-normalized obs is a top cause of failure to learn. ([impl matters](https://openreview.net/pdf?id=rxEmiOEIFL), [whitening redundancy](https://liujch1998.github.io/2023/04/16/ppo-norm.html))
+### R4 — Replay buffer / normalization state not checkpointed → resume behaves like cold start
+**Symptom**: an off-policy run (DQN/SAC) resumed from checkpoint acts like a cold start; or a normalized env's stats reset on resume and performance tanks.
+**Root cause**: the replay buffer and the running obs/reward normalization stats are part of training **state** but are often omitted — restoring only weights loses them.
+**Fix**: checkpoint+restore the replay buffer (or accept warmup) **and** the `VecNormalize`/running-stats alongside weights. General checkpoint-everything-stateful (optimizer/scheduler/RNG/step) → `references/training/checkpoint-resume.md`; on spot boxes losing buffer/normstats every preemption silently degrades learning → `references/spot-resilience.md`.
+### R5 — Non-stationarity treated as a bug → chasing a moving target
+**Symptom**: value/critic loss won't converge to zero; metrics oscillate even when "working."
+**Root cause**: RL targets are **non-stationary** — the policy changes the data distribution and bootstrapped targets move. A value loss that never hits zero is expected.
+**Fix**: judge by **return/reward trend**, not critic-loss-to-zero; stabilize with a slow target network (DQN/SAC) and GAE. Don't "fix" non-convergent critic loss by shrinking LR to zero — that just stops learning.
+### R6 — A single seed's result is not the result → RL variance is huge
+**Symptom**: identical hyperparameters + different seeds give non-overlapping curves; an ablation "win" disappears on re-run.
+**Root cause**: extreme seed variance from algorithm, policy sampling, and environment stochasticity (comparing 5-run point estimates yields >50% Type-I error).
+**Fix**: report aggregate over **≥5 seeds** (more for noisy envs), use **IQM** (interquartile mean) over mean/median, show CIs. A single-seed delta is not a result — squarely **verifying-dl-experiments** territory (bug-vs-real-effect, seed discipline; **REQUIRED**). ([Henderson](https://arxiv.org/pdf/1708.04133), [how-many-seeds](https://arxiv.org/pdf/1806.08295))
+---
+## Multimodal / VLM
+### X1 — Wrong freeze schedule across stages → alignment never forms or the LLM is wrecked
+**Symptom**: a LLaVA-style VLM doesn't ground on images (ignores visual tokens), or text quality collapses after multimodal finetune.
+**Root cause**: VLM training is **staged**, each stage freezing different towers. Stage 1 (alignment): freeze vision encoder **and** LLM, train **only the projector**. Stage 2 (instruction tuning): unfreeze LLM **and** projector, keep vision encoder frozen. Training the LLM in stage 1 (before the projector aligns) corrupts it; never unfreezing it means it can't use the visual tokens.
+**Fix**: set `requires_grad` per tower per stage per the recipe; print trainable-param counts at each stage start to confirm the freeze took. "Ignores its input image" is model-ignores-input → cross-link **verifying-dl-experiments** (**REQUIRED**). ([LLaVA recipe](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
+### X2 — Projector trained from scratch with the LLM hot → unstable stage-1
+**Symptom**: stage-1 alignment loss is unstable or the projector output is garbage.
+**Root cause**: the projector (linear or 2-layer MLP, often concatenating groups of vision tokens into the LLM embedding space) is **randomly initialized**; flowing gradients into the LLM through an un-aligned projector destabilizes both.
+**Fix**: stage 1 trains the projector **alone** against a frozen LLM so it learns the LLM's embedding space first; only then (stage 2) unfreeze the LLM. Confirm projector output dim == LLM hidden size (else a silent shape-broadcast bug). ([projector adapter](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
+### X3 — One global LR for all towers → vision encoder drifts or LLM underfits
+**Symptom**: a shared LR either over-updates the pretrained vision encoder (forgets visual features) or leaves projector/LLM undertrained.
+**Root cause**: towers are at different maturities — a pretrained vision encoder needs a tiny LR (or freeze), a fresh projector a larger one, the LLM in between. LLaVA: ~`1e-3` projector in alignment, `2e-5` LLM in stage 2.
+**Fix**: **parameter groups** with per-group LRs (`AdamW([{"params":proj.parameters(),"lr":1e-3},{"params":llm.parameters(),"lr":2e-5}])`), or freeze the vision encoder. Log each group's LR to confirm. ([LLaVA LRs](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
+### X4 — Sequence truncation drops image tokens → shape error or silent loss of vision
+**Symptom**: VLM training errors on image-token-count mismatch, or intermittently ignores images on long examples.
+**Root cause**: image placeholders expand into many tokens (hundreds per image); a `max_length` truncation cuts them, breaking the image-token ↔ feature alignment.
+**Fix**: set `max_length=None` (or large enough never to truncate image tokens) for VLM trainers, or verify truncation never removes placeholders across the whole dataset. Count image tokens vs expected per example as a smoke check. ([TRL VLM note](https://huggingface.co/docs/trl/dpo_trainer))
+### X5 — Modality alignment collapse: the LLM answers from text priors, not the image
+**Symptom**: the VLM gives plausible answers that ignore the actual image content (language-prior shortcut).
+**Root cause**: weak visual signal (bad projector, frozen-everything, too little alignment data) lets the LLM fall back on its language prior — a degenerate "ignore the input" solution that still lowers loss on text-predictable answers.
+**Fix**: the mechanical fixes are X1–X3 (correct freeze/projector/LR so the visual path contributes). Whether it's genuinely grounding vs shortcutting is **verifying-dl-experiments** (model-ignores-input/degenerate-output; **REQUIRED**) — probe with image-perturbation / counterfactual-image tests, which that skill owns.
+---
+## Pointers — domain-adjacent mechanics catalogued elsewhere
+- **Precision / NaN / loss-spike / z-loss / grad-clip** (L6's general ladder) → `references/training/precision-stability.md`.
+- **OOM, activation checkpointing, LoRA/QLoRA, FSDP/ZeRO, seq-len memory** → `references/training/oom-memory.md`.
+- **Dataloader starvation, GPU-util%, NVMe staging, tar-sharding** → `references/gotchas_universal.md` (U8, U21, U24, U25).
+- **DDP/FSDP launch, SyncBatchNorm under DDP, NCCL** → `references/training/distributed-launch.md`, `references/multinode.md`.
+- **Checkpoint-everything-stateful + atomic resume** (R4's general form) → `references/training/checkpoint-resume.md`, `references/spot-resilience.md`.
+- **"Runs but won't learn": loop wiring, optimizer/LR/weight-decay, loss-function & label form, freezing/BN drift, dataloader correctness** → `references/training/convergence-debugging.md`, `references/training/data-pipeline.md`.
+- **Is the metric/model correct** (collapse, leakage, all-zero metrics, model-ignores-input, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — owns every "bug vs real effect" fork above).

package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md ADDED Viewed

@@ -0,0 +1,368 @@
+# Correct checkpointing & idempotent resume — full state, atomic write, sharded checkpoints, framework APIs
+Make a training job resume **exactly where it stopped** after any kill — not "reload the weights and
+silently restart the epoch." This layer owns the *mechanics*: what FULL state to save, how to write it
+without corruption, how to load it unconditionally, and the framework-specific knobs (FSDP / DeepSpeed /
+HF Trainer / Accelerate / Lightning) plus the resume **bugs** that make a job look resumed while it
+quietly lost progress. **verifying-dl-experiments** (**REQUIRED**) owns *is the resumed number correct* —
+e.g. proving step/epoch/loss actually continued instead of resetting is its reproducibility check applied
+here. The spot/preemption *cadence* (when + how often, Young/Daly) lives in
+`references/spot-resilience.md` (**REQUIRED** for any interruptible/spot tier) — this file is the *content
+and correctness* of each checkpoint; that file is the *timing*.
+To jump: `grep -in '<keyword>' references/training/checkpoint-resume.md` (e.g. `atomic`, `rename`,
+`scaler`, `ema`, `sampler`, `fsdp`, `sharded`, `zero_to_fp32`, `dcp`, `resume_from_checkpoint`,
+`save_state`, `ckpt_path`, `save_total_limit`, `reshuffle`).
+## Table of contents
+- **The contract** — C1 full-state-list · C2 atomic-write · C3 load-latest-unconditionally · C4 durable-location
+- **Sharded checkpoints (multi-GPU)** — C5 FSDP-FULL_STATE_DICT-rank0-OOM · C6 FSDP-SHARDED_STATE_DICT · C7 DCP-(dcp.save/load) · C8 DeepSpeed-ZeRO-dir+zero_to_fp32
+- **Framework APIs** — C9 HF-Trainer-resume_from_checkpoint+save_total_limit · C10 Accelerate-save_state/load_state · C11 Lightning-ModelCheckpoint+ckpt_path
+- **The resume BUGS** — C12 epoch-restarts · C13 data-reshuffles/order · C14 LR-schedule-resets · C15 scaler-not-restored · C16 EMA-not-saved · C17 save_total_limit-deletes-best · C18 strict-load-key-mismatch
+- **Pointers** — disk-full on save → gotchas_universal.md U6 · silent sync → U33 · keepable-policy/save_top_k → verifying-dl-experiments (skill) · cadence/Young-Daly → spot-resilience.md
+---
+## The contract
+### C1 — A checkpoint that restores only weights is NOT a resume — save the FULL training state
+**Symptom**: resume "works" (no crash) but the loss jumps up, accuracy regresses, or training takes more
+total epochs than an uninterrupted run — because the resume silently restarted the epoch, reset the
+optimizer momentum, and reshuffled the data.
+**Root cause**: `torch.save(model.state_dict())` captures *weights only*. Optimizer momentum/variance,
+the LR-scheduler position, the epoch/step counter, RNG state, the AMP scaler, and the dataloader position
+are all lost, so the restarted run is a *different* trajectory, not a continuation.
+**Fix**: every checkpoint must carry the full state (PyTorch tutorial
+[saving multiple / general checkpoint](https://docs.pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html);
+the spot-resilience §3 list):
+| Must save | Why losing it breaks resume |
+|---|---|
+| model `state_dict` | the weights (obvious) |
+| optimizer `state_dict` | Adam m/v momentum — losing it = a cold optimizer restart (C12) |
+| LR-scheduler `state_dict` | step-based LR position — losing it resets the schedule (C14) |
+| `epoch` **and** global `step`/iteration | resume the exact position, not the epoch start (C12) |
+| RNG state: Python `random`, NumPy, `torch`, **CUDA** (`torch.cuda.get_rng_state_all()`) | reproducible augmentation/dropout stream after restart |
+| dataloader / sampler position | so the next batch is the *next* unseen one, not a reshuffle (C13) |
+| AMP `GradScaler` `state_dict` | the loss-scale + growth tracker — losing it triggers an inf-scale stall (C15) |
+| EMA / SWA shadow weights (if used) | the EMA copy is often what's evaluated — losing it = eval on the wrong weights (C16) |
+| best-metric-so-far + `best.pth` selection state | so "best" survives a restart instead of resetting |
+The runnable atomic skeleton that assembles this dict is in `references/spot-resilience.md` §5 — do not
+duplicate it; this table is the *checklist*, that is the *code*.
+### C2 — Write atomically: tmp → fsync → os.replace (a kill mid-write corrupts a naive save)
+**Symptom**: after a preemption/OOM, `latest.pth` is truncated/zero-byte or `torch.load` raises
+`RuntimeError: PytorchStreamReader failed reading zip archive`; a `latest.pth.tmp` is left behind.
+**Root cause**: overwriting `latest.pth` in place is **not** atomic — a kill partway through leaves a
+corrupt file and (if it was the only checkpoint) zero good ones. `torch.save` itself does *not* fsync.
+**Fix**: write to a temp file, force bytes to disk, then atomically rename (POSIX `rename`/`os.replace`
+is atomic on the **same filesystem**):
+```python
+tmp = ckpt_path + ".tmp"
+with open(tmp, "wb") as f:
+    torch.save(state, f); f.flush(); os.fsync(f.fileno())   # bytes hit disk BEFORE the swap
+os.replace(tmp, ckpt_path)                                   # all-or-nothing; keep prev until this returns
+```
+Keep the previous `latest.pth` valid until the rename returns (a kill at any instant leaves one intact
+file). `os.replace` (not `os.rename`) also works on Windows for the local-test path. Full recipe +
+rationale: `references/spot-resilience.md` §3. Disk-full *during* the save is a separate failure with the
+same `.tmp` left behind → `references/gotchas_universal.md` U6 (pre-budget + prune `latest`, keep `best`).
+### C3 — Load-latest UNCONDITIONALLY on startup → idempotent resume
+**Symptom**: a relaunch starts from scratch because the resume is gated behind a `--resume` flag the
+launch wrapper forgot to pass; or two code paths (fresh vs resume) diverge.
+**Root cause**: making resume *opt-in* means a generic relaunch (spot recovery, SSH-drop restart, queue
+retry) re-trains from zero. A divergent "first launch" code path also drifts from the resume path.
+**Fix**: one code path that loads the latest checkpoint if it exists, else starts fresh — so the
+**identical launch command** converges to the same end state no matter how many times it runs. This is
+what makes principle #7's "retry the identical config" actually *resume* instead of restart, and it is the
+universal spine (principle #8) under SSH-drop / Slurm-walltime / K8s-reschedule / spot-preemption. Skeleton:
+`references/spot-resilience.md` §3 (`load_latest_if_any`).
+### C4 — Checkpoint to the platform's DURABLE location, not local scratch
+**Symptom**: resume after a managed-spot replacement (or a `terminate`/`destroy`) finds no checkpoint —
+the box came up *fresh* and the only copy was on the dead instance's local disk.
+**Root cause**: a replacement node is clean; anything not on a cloud bucket / network volume / shared FS
+is gone (principle #4 — know what survives stop vs destroy).
+**Fix**: write checkpoints to the profile's durable mount (`DURABLE_DIR` in `profiles/<platform>.md` §8),
+or mirror local→durable on the checkpoint timer. The single biggest portability trap is assuming local
+disk survives — see each profile's STORAGE survival-matrix and the SKILL Quick-reference table. Gate the
+sync on the actual copy result, never an unconditional `echo synced` →
+`references/gotchas_universal.md` U33.
+---
+## Sharded checkpoints (multi-GPU)
+### C5 — FSDP `FULL_STATE_DICT` OOMs on rank 0 when gathering a large model
+**Symptom**: an FSDP job trains fine but **crashes at the first checkpoint** with CUDA OOM on rank 0;
+the model is larger than one GPU.
+**Root cause**: `StateDictType.FULL_STATE_DICT` all-gathers every shard onto **one rank** to assemble the
+unsharded dict. For a model that only fits *because* it's sharded, materializing the whole thing on rank 0
+exceeds that GPU's VRAM.
+**Fix**: when taking a full (consolidated) dict, offload it to CPU and build it on rank 0 only —
+`FullStateDictConfig(offload_to_cpu=True, rank0_only=True)`. This all-gathers parameters one-by-one,
+offloading each to CPU on rank 0, so peak GPU memory stays bounded and non-rank-0 workers skip the GPU→CPU
+copy entirely
+([HF Accelerate FSDP guide](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp),
+[Lightning issue #11207](https://github.com/Lightning-AI/pytorch-lightning/issues/11207)). The full dict
+is only viable when it fits in CPU RAM; past that, use sharded (C6). Save a full dict only at the **end**
+for a portable single-file artifact; checkpoint *during* training as sharded.
+### C6 — `SHARDED_STATE_DICT`: each rank saves its own shard (no gather, no rank-0 OOM)
+**Symptom**: need to checkpoint a model too big to consolidate even on CPU, or want a fast resume that
+re-shards onto a *different* world size.
+**Root cause**: `FULL_STATE_DICT` is fundamentally a single-rank materialization; it does not scale and
+cannot reshard.
+**Fix**: use `StateDictType.SHARDED_STATE_DICT` — every rank writes only its own shard, so there is no
+all-gather and no OOM, and the per-rank files load back in parallel. Pair it with Distributed Checkpoint
+(C7), which is the production path for sharded save/load and supports **resharding** (resume on a different
+GPU count). Tradeoff: a sharded checkpoint is a *directory of N files*, not a single `.pth` — convert to a
+full dict for export/inference (C7's `get_model_state_dict`, or the DeepSpeed analogue C8).
+### C7 — Distributed Checkpoint (DCP): `dcp.save` / `dcp.load` for FSDP/sharded models
+**Symptom**: hand-rolling FSDP state-dict context managers is brittle, slow, and breaks when the world
+size changes between save and resume.
+**Root cause**: `torch.save` produces a single file and has no notion of sharding or FQN remapping;
+manually toggling `FSDP.state_dict_type` is error-prone.
+**Fix**: use `torch.distributed.checkpoint` (DCP), the current PyTorch-2.x sharded-checkpoint API
+([DCP recipe](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html),
+[2.12 reference](https://docs.pytorch.org/docs/2.12/distributed.checkpoint.html)). **Save**: get canonical
+dicts with `get_state_dict(model, optimizer)` from `torch.distributed.checkpoint.state_dict`, then
+`dcp.save(state_dict, checkpoint_id=DIR)` — it writes **≥1 file per rank in parallel** and auto-manages FQN
+mappings. **Load**: allocate the model first, then `dcp.load(state_dict, checkpoint_id=DIR)` (loads **in
+place** and **auto-reshards** to the current world size), then `set_state_dict(...)`. DCP beats
+`torch.save` for any distributed model because it shards the write across ranks (no rank-0 gather, C5) and
+reshards on load. For a single portable inference file, convert offline with `torch.distributed.checkpoint.format_utils.dcp_to_torch_save(DIR, "out.pt")` (or the CLI `python -m torch.distributed.checkpoint.format_utils dcp_to_torch DIR out.pt`).
+### C8 — DeepSpeed ZeRO: a checkpoint *directory* per save + `zero_to_fp32.py` to consolidate
+**Symptom**: `model_engine.save_checkpoint(dir)` writes a *folder* of `mp_rank_*` / `zero_pp_rank_*`
+files, not a `.pth`; loading the weights into a plain (non-DeepSpeed) model for inference fails.
+**Root cause**: ZeRO **partitions** optimizer state (stage 1), gradients (2), and parameters (3) across
+ranks; the on-disk checkpoint is inherently sharded across per-rank files — it is not a single fp32 model.
+**Fix** ([DeepSpeed model-checkpointing](https://deepspeed.readthedocs.io/en/stable/model-checkpointing.html),
+[ZeRO tutorial](https://www.deepspeed.ai/tutorials/zero/)):
+- **Save/resume training** — `model_engine.save_checkpoint(save_dir, tag)` /
+  `model_engine.load_checkpoint(save_dir, tag)`. **All ranks must call both** (they're collective; rank-0
+  only deadlocks/corrupts). Round-trips full sharded optimizer+param state.
+- **Export a single fp32 model** — DeepSpeed auto-drops a `zero_to_fp32.py` into the checkpoint dir; run
+  `python zero_to_fp32.py <checkpoint_dir> pytorch_model.bin`, or in-process
+  `from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint(dir)` /
+  `convert_zero_checkpoint_to_fp32_state_dict(...)` / `load_state_dict_from_zero_checkpoint(model, dir)`
+  (the last returns a model that **can't continue training** without re-init). The consolidated file no
+  longer needs DeepSpeed. For ZeRO-3, set
+  `"zero_optimization": {"stage3_gather_16bit_weights_on_model_save": true}` + `engine.save_16bit_model(dir)`.
+---
+## Framework APIs
+### C9 — HF Trainer: `resume_from_checkpoint` + `save_total_limit` (and what it actually saves)
+**Symptom**: assuming `Trainer.save_model()` is a resume point (it saves *weights only*); or a relaunch
+re-trains from step 0 because `resume_from_checkpoint` wasn't passed; or the disk fills with `checkpoint-*`
+dirs.
+**Root cause**: `save_model` ≠ a training checkpoint. A real Trainer checkpoint dir (`checkpoint-<step>`)
+contains the model **plus** `optimizer.pt`, `scheduler.pt`, `rng_state.pth`, `trainer_state.json`, and the
+AMP `scaler.pt` — the full state. Without `resume_from_checkpoint` the run starts cold.
+**Fix** ([Trainer docs](https://huggingface.co/docs/transformers/main/en/main_classes/trainer)):
+`trainer.train(resume_from_checkpoint="path/to/checkpoint-1500")` resumes that exact dir;
+`resume_from_checkpoint=True` auto-finds the **last** checkpoint in `args.output_dir` (idempotent spelling,
+C3; `trainer_utils.get_last_checkpoint(output_dir)` finds it in code). `save_strategy="steps"` +
+`save_steps=N` (or `"epoch"`) sets cadence; **`save_total_limit=k`** keeps only the `k` most-recent
+`checkpoint-*` and **deletes older ones in `output_dir`** — the built-in disk-budget knob (pairs with
+`references/gotchas_universal.md` U6). `load_best_model_at_end=True` + `metric_for_best_model` +
+`greater_is_better` reloads the best checkpoint at the end **and** protects it from `save_total_limit`
+deletion (C17).
+### C10 — Accelerate: `accelerator.save_state(dir)` / `load_state(dir)` + dataloader skip
+**Symptom**: a custom (non-Trainer) Accelerate loop resumes with a cold optimizer/scaler, or the LR
+scheduler resets, or it replays already-seen batches.
+**Root cause**: saving only `accelerator.get_state_dict(model)` drops optimizer/scaler/RNG; and a
+mid-epoch resume re-iterates the dataloader from batch 0.
+**Fix** ([Accelerate checkpoint guide](https://huggingface.co/docs/accelerate/en/usage_guides/checkpoint)):
+`accelerator.save_state(output_dir)` saves model, optimizer, **GradScaler**, and RNG generators in one
+call; `accelerator.load_state(output_dir)` restores all of it (objects must come from the *same* script).
+The LR scheduler (and any object with `state_dict`/`load_state_dict`) **must** be registered first —
+`accelerator.register_for_checkpointing(my_scheduler)` — or it is not saved and resets (C14). For
+mid-epoch resume, skip consumed batches with `accelerator.skip_first_batches(train_dataloader, N)` on the
+first resumed epoch, then fall back to the full dataloader (C13).
+`ProjectConfiguration(automatic_checkpoint_naming=True, total_limit=k)` gives rolling
+`checkpoints/checkpoint_<n>` dirs with a built-in limit.
+### C11 — Lightning: `ModelCheckpoint` + `trainer.fit(ckpt_path=...)` (don't use `resume_from_checkpoint`)
+**Symptom**: an old tutorial's `Trainer(resume_from_checkpoint=...)` is ignored/deprecated; or
+`save_top_k` quietly deletes the checkpoint needed to resume.
+**Root cause**: `resume_from_checkpoint` moved to `fit(ckpt_path=...)` (deprecated since 1.x). A Lightning
+`.ckpt` is a full dump — epoch, global step, LightningModule `state_dict`, **all** optimizer + LR-scheduler
+states, callback states, loop state, and the 16-bit scaling factor (AMP)
+([Lightning checkpointing basics](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html)).
+**Fix**:
+- Configure `ModelCheckpoint(dirpath=..., monitor="val_loss", mode="min", save_top_k=k, save_last=True)`;
+  resume with `trainer.fit(model, datamodule, ckpt_path="path/to/last.ckpt")`, or
+  `ckpt_path="last"` to auto-pick the `save_last=True` file (the idempotent spelling, C3). Best/last paths
+  read back from `cb.best_model_path` / `cb.last_model_path`.
+- `save_top_k` keeps only the k best by `monitor`; **always set `save_last=True`** so a resume target
+  exists even when the latest step isn't a top-k metric (otherwise resume may have no recent checkpoint).
+  Add custom state (EMA, C16) via `on_save_checkpoint` / `on_load_checkpoint` on the module or a stateful
+  callback. Lightning's DeepSpeed strategy writes a ZeRO dir — convert with
+  `lightning.pytorch.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict` (C8 analogue).
+---
+## The resume BUGS (looks resumed, silently lost progress)
+These are the "it ran without error but the result is wrong" traps — confirm the fix with the
+`verifying-dl-experiments` reproducibility check (**REQUIRED**): kill mid-run, relaunch the *identical*
+command, and verify step/epoch/loss **continue** rather than reset.
+### C12 — Epoch/step restarts from 0 despite "resuming"
+**Symptom**: tracker shows a second run starting at epoch 1; total trained epochs exceed the schedule;
+LR warm-up replays. (The remote-ops version of this — a tmux script re-executed mid-run — is
+`references/gotchas_universal.md` U2.)
+**Root cause**: the loop is `for epoch in range(total_epochs)` with a hardcoded `0` start; the saved
+`epoch`/`step` was never read back, or was saved but not used to seed the range.
+**Fix**: `start_epoch, start_step = load_latest_if_any(...)` then
+`for epoch in range(start_epoch, total_epochs)` and seed the step counter from `start_step`. The counter
+**must** be in the checkpoint (C1) *and* consumed on load.
+### C13 — Data reshuffles / repeats the same order after resume
+**Symptom**: resume re-shows already-seen samples (worse, the *same* batch every epoch even without
+resume), hurting convergence or leaking.
+**Root cause**: two distinct bugs. (a) Resume restarts the epoch from batch 0 without skipping consumed
+batches. (b) `DistributedSampler` seeds its shuffle from an internal epoch that defaults to 0 forever
+unless `sampler.set_epoch(epoch)` is called each epoch — so every epoch (and every resume) produces the
+**identical** order
+([PyTorch #31771](https://github.com/pytorch/pytorch/issues/31771),
+[DistributedSampler docs](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler)).
+**Fix**: call `train_sampler.set_epoch(epoch)` at the top of every epoch (restore the epoch counter on
+resume so the shuffle stream continues). For mid-epoch resume, fast-forward consumed batches
+(`accelerator.skip_first_batches`, C10) or use a resumable/stateful sampler (`torchdata`
+`StatefulDataLoader`) whose offset is in the checkpoint (C1).
+### C14 — LR schedule resets (cosine restarts, warm-up replays)
+**Symptom**: the LR curve restarts from the initial/warm-up value on resume; final LR is wrong; cosine
+decay never reaches its floor.
+**Root cause**: the LR scheduler's `state_dict` (its `last_epoch`/step counter) was not saved or not
+restored. With Accelerate, the scheduler wasn't `register_for_checkpointing`-ed (C10).
+**Fix**: save `scheduler.state_dict()` and call `scheduler.load_state_dict(...)` on resume (C1). Note a
+step-based scheduler advanced *per optimizer step* must restore the **step**, not the epoch — restoring
+only `epoch` under-/over-shoots the schedule.
+### C15 — AMP `GradScaler` not restored → "No inf checks were recorded" / scale stall
+**Symptom**: resuming a mixed-precision run raises
+`AssertionError: No inf checks were recorded for this optimizer`, or training stalls/NaNs because the
+loss-scale snapped back to the default and re-enters the scale-search.
+**Root cause**: the `GradScaler` holds dynamic state — `scale`, `growth_factor`, `backoff_factor`,
+`growth_interval`, `_growth_tracker` — that evolves during training; dropping it resets the scaler
+([PyTorch AMP recipe](https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html),
+[forum: No inf checks were recorded](https://discuss.pytorch.org/t/resume-training-with-mixed-precision-lead-to-no-inf-checks-were-recorded-for-this-optimizer/115828)).
+**Fix**: save `scaler.state_dict()` (call it **after** `scaler.update()` in the iteration) and
+`scaler.load_state_dict(checkpoint["scaler"])` on resume. HF Trainer (`scaler.pt`), Accelerate
+(`save_state`), and Lightning (16-bit factor) all do this automatically — the bug bites hand-written loops.
+Resuming a *non-AMP* checkpoint into an AMP run has no saved scaler → start a **fresh** `GradScaler`.
+### C16 — EMA / SWA shadow weights not saved → eval on the wrong weights after resume
+**Symptom**: pre-resume eval (using EMA weights) is good; post-resume eval drops sharply, then recovers
+over many steps — because the EMA copy restarted from the raw weights.
+**Root cause**: EMA/SWA maintain a *separate* shadow parameter set that is what gets evaluated/exported;
+saving only the live model `state_dict` loses it, so EMA reinitializes from the (noisier) live weights.
+**Fix**: include `ema.state_dict()` (and SWA `AveragedModel` / `swa_scheduler` state) in the checkpoint
+dict (C1) and restore it. In Lightning, persist it via `on_save_checkpoint`/`on_load_checkpoint` (C11).
+This is a *which-weights-are-correct* concern at the boundary — cross-link **verifying-dl-experiments**
+(**REQUIRED**) for confirming the evaluated weights are the intended ones.
+### C17 — `save_total_limit` / `save_top_k` deletes the very checkpoint resume needs
+**Symptom**: resume fails because the target checkpoint was auto-pruned; or `load_best_model_at_end`
+errors because the best checkpoint was rotated out.
+**Root cause**: a rolling limit prunes by *recency* (`save_total_limit`) or by *metric* (`save_top_k`),
+and neither guarantees the most-recent-step checkpoint is the one kept — so the resume anchor can be the
+one deleted.
+**Fix**: keep an explicit `last`/`latest` alongside the top-k (`save_last=True` in Lightning, C11; in HF,
+`load_best_model_at_end=True` makes Trainer preserve the best checkpoint past `save_total_limit`). General
+keepable-checkpoint *policy* (how many, which selection criterion, `save_top_k ≤ 3`, prune `latest`) is
+owned by **verifying-dl-experiments** (**REQUIRED**); the disk-budget consequence is
+`references/gotchas_universal.md` U6.
+### C18 — `load_state_dict` key mismatch on resume (`module.` prefix, compiled-model prefix)
+**Symptom**: resume raises `Missing key(s)` / `Unexpected key(s) ... module.<name>` or
+`_orig_mod.<name>`, or strict load fails after switching DDP/`torch.compile` on or off.
+**Root cause**: `DataParallel`/DDP wrap adds a `module.` prefix and `torch.compile` adds `_orig_mod.` to
+every key; a checkpoint saved wrapped and loaded unwrapped (or vice-versa) won't key-match under
+`strict=True`.
+**Fix**: save the **unwrapped** module — `model.module.state_dict()` (DDP) /
+`accelerator.unwrap_model(model).state_dict()` / `model._orig_mod.state_dict()` (compiled) — so the
+checkpoint is wrapper-agnostic. On load, strip the prefix if present
+(`{k.replace("module.", "").replace("_orig_mod.", ""): v for k, v in sd.items()}`). Keep `strict=True`
+while debugging a resume so a silent partial load can't masquerade as success; only relax it deliberately.
+---
+## Pointers — owned elsewhere, do NOT restate here
+- **Cadence — when/how often** (Young/Daly `W = sqrt(2·mu·C)`, grace windows, opportunistic SIGTERM
+  last-flush, the runnable atomic skeleton) → `references/spot-resilience.md` (**REQUIRED**, spot tier).
+- **Disk-full on save** (pre-budget, prune `latest`, keep `best`, `.tmp` recovery) →
+  `references/gotchas_universal.md` U6; **silent "synced" line** → U33; **inode exhaustion** → U7.
+- **Sharding a model that won't fit** (FSDP wrap policy, ZeRO stages, offload) is the *fitting* concern →
+  `references/training/oom-memory.md` M9/M10; this file owns *checkpointing* the sharded state.
+- **Multi-rank save/load collectives + elastic restart** (torchrun `--max-restarts` restores from the
+  checkpoint) → `references/training/distributed-launch.md`, `references/multinode.md`.
+- **Keepable-checkpoint policy + "is the resumed/best number real"** (selection criterion, `save_top_k`,
+  proving step/epoch/loss continued) → **verifying-dl-experiments** (**REQUIRED**).