opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,230 @@
1
+ # Per-domain training gotchas — make each domain's run start, not lie, and not silently mistrain
2
+
3
+ The cross-cutting layers (precision, OOM, throughput, checkpoint, distributed) hold everywhere; this file
4
+ is the **domain-shaped** residue — the data-format, masking, normalization, schedule, and freezing traps
5
+ that only bite LLM / vision / diffusion / RL / VLM training. Each entry is **Symptom → Root cause → Fix**
6
+ with the exact knob. This layer owns *making the domain pipeline RUN and debugging its mechanics*;
7
+ **verifying-dl-experiments** owns *is the converged number real* (collapse-vs-real-effect, train/val
8
+ leakage, metric validity, constant/degenerate output). Cross-link it (**REQUIRED**) at every "loss is fine
9
+ but the output/metric is wrong" fork — the headline domain failures (diffusion samples bad at low loss,
10
+ mAP=0, reward collapse, VLM ignores the image) are exactly that shape.
11
+
12
+ To jump: `grep -in '<keyword>' references/training/by-domain.md` (e.g. `padding`, `packing`, `rope`,
13
+ `z-loss`, `dpo`, `mAP=0`, `mIoU`, `ignore_index`, `ema`, `vae`, `cfg`, `kl`, `whiten`, `projector`,
14
+ `freeze`, `seed`).
15
+
16
+ ## Table of contents
17
+
18
+ - **LLM** — L1 pad-side · L2 loss-mask-−100 · L3 pad-token-unset · L4 packing-cross-contamination · L5 RoPE-context-extension · L6 grad-explosion+z-loss · L7 eval-perplexity-mask · L8 SFT/DPO/RLHF-data-format · L9 DPO-collapse+KL · L10 gated-token-before-Trainer
19
+ - **Vision (cls/det/seg)** — V1 normalization-mismatch · V2 aug-on-eval · V3 mAP=0 · V4 anchor/NMS/conf-thresh · V5 mIoU=0 ignore_index/off-by-one · V6 class-imbalance · V7 BN-tiny-batch
20
+ - **Diffusion** — DF1 loss-low-samples-bad (cross-link) · DF2 EMA-weights · DF3 VAE-scaling · DF4 noise-schedule/timestep · DF5 CFG-conditioning-dropout · DF6 sampler-vs-model · DF7 SNR-weighting
21
+ - **RL** — R1 reward-collapse · R2 KL-blowup · R3 whitening · R4 replay/obs-normalization · R5 non-stationarity · R6 seed-variance (cross-link)
22
+ - **VLM** — X1 stage-freeze · X2 projector-only-stage1 · X3 per-group-LR · X4 image-token-truncation · X5 alignment-collapse (cross-link)
23
+ - **Pointers** — precision-stability.md, oom-memory.md, gotchas_universal.md, verifying-dl-experiments (skill)
24
+
25
+ ---
26
+
27
+ ## LLM / transformer
28
+
29
+ ### L1 — Padding side: right for causal-LM SFT, **left** for generation/DPO
30
+ **Symptom**: a fine-tuned causal LM produces garbage, or DPO/batched-generation logprobs disagree with single-example decoding.
31
+ **Root cause**: causal-LM **training** wants **right-padding** (pad lands after content, attention mask zeroes it). Batched **generation/DPO** want **left-padding** — with right-padding the "last real token" position differs per row, so a shared decode step reads pad. TRL's `DPOTrainer` requires `processing_class` padding side `"left"`.
32
+ **Fix**: `tokenizer.padding_side="right"` for SFT collation; `"left"` for generation/eval/DPO — set it per phase, not globally. ([HF causal-LM](https://huggingface.co/docs/transformers/tasks/language_modeling), [TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
33
+
34
+ ### L2 — Loss over prompt + pad tokens dilutes the signal → mask with −100
35
+ **Symptom**: SFT "trains" but parrots the prompt / barely follows instructions; loss plausible but flat.
36
+ **Root cause**: HF LM loss is `CrossEntropyLoss(ignore_index=-100)` — only `-100` positions are skipped. Leaving the prompt-prefix labels and pad labels as real ids averages the loss over "predict the prompt / predict pad."
37
+ **Fix**: set labels to `-100` at **both** the prompt prefix (train only on the completion) and all padding positions. TRL `SFTTrainer` `completion_only_loss` / `DataCollatorForCompletionOnlyLM` does the prefix masking — verify it fired (decode one masked label row). Whether the gradient hits the right tokens is a smoke-target → cross-link **verifying-dl-experiments** (**REQUIRED**). ([gpt2 thread](https://huggingface.co/gpt2/discussions/34))
38
+
39
+ ### L3 — `pad_token` unset → pad error or silent pad-with-token-0
40
+ **Symptom**: `ValueError: Asking to pad but the tokenizer does not have a padding token`, or it pads with id 0 (a real token, often `<unk>`/`!`).
41
+ **Root cause**: many base LMs (GPT-2, Llama, Mistral) ship no `pad_token`.
42
+ **Fix**: `tokenizer.pad_token = tokenizer.eos_token` and `model.config.pad_token_id = tokenizer.pad_token_id`. With right-padding + attention mask, reusing EOS as PAD is safe. If a *new* token is added, `model.resize_token_embeddings(len(tokenizer))` or its id indexes out of range.
43
+
44
+ ### L4 — Sequence packing leaks attention across documents → contaminated training
45
+ **Symptom**: throughput jumps after enabling packing, but quality drops vs unpacked; the model "completes" one doc with content from a packed neighbor.
46
+ **Root cause**: naive packing concatenates examples into one `max_len` sequence; a vanilla causal mask lets a token in doc 2 attend back into doc 1 — cross-sequence contamination.
47
+ **Fix**: **document masking** — emit `position_ids` that reset per sub-sequence + an attention impl that honors boundaries. TRL/HF `DataCollatorWithFlattening` packs into one stream, returns `position_ids`, and sets each example's first label to `-100`; FlashAttention-2 varlen restricts attention within-document. Requires `attn_implementation="flash_attention_2"` — packing without it silently contaminates. ([HF blog](https://huggingface.co/blog/packing-with-FA2), [transformers #31629](https://github.com/huggingface/transformers/pull/31629), [IBM](https://research.ibm.com/blog/hugging-face-training-flash-attention))
48
+
49
+ ### L5 — Fine-tuning past pretrain context without RoPE scaling → garbage past N tokens
50
+ **Symptom**: a 4k-context model is incoherent past ~4k at inference even when trained on longer sequences; or long-context finetune won't converge.
51
+ **Root cause**: RoPE frequencies are calibrated to the pretrain context; longer positions extrapolate into unseen rotation angles. Linear interp degrades past ~4×; YaRN holds to 16–32×.
52
+ **Fix**: set `rope_scaling={"type":"linear"|"dynamic"|"yarn","factor":<target/orig>}` in config and finetune **with scaling active** (`"yarn"` for big jumps, `"linear"` only ≤4×). Train-time vs inference-time `rope_scaling` mismatch is a silent regression. ([RoPE deep dive](https://amaarora.github.io/posts/2025-09-21-rope-context-extension.html), [HF guide](https://medium.com/@leannetan/extending-context-length-with-hugging-faces-transformers-6b04db05b39a))
53
+
54
+ ### L6 — Loss spikes / logit drift in long LM training → z-loss + the precision-layer knobs
55
+ **Symptom**: pretraining/long-SFT loss is stable then spikes; in bf16/fp16 it can NaN; logits grow unboundedly over training.
56
+ **Root cause**: the softmax normalizer `log Z` drifts from 0 as logits grow → low-precision overflow + gradient instability.
57
+ **Fix**: add **z-loss** `1e-4 · log²(Z)` (the PaLM/Gopher coefficient) to pull `log Z` toward 0. The general divergence ladder (warmup, grad-clip, skip-the-batch, qk-norm, bf16-over-fp16) is **references/training/precision-stability.md** P12–P18 (z-loss is P15) — not restated here; this entry is the LM-specific *why z-loss exists*. ([PaLM](https://arxiv.org/abs/2204.02311), [small-scale proxies](https://arxiv.org/abs/2309.14322))
58
+
59
+ ### L7 — Eval perplexity wrong from a wrong mask/stride, not the model
60
+ **Symptom**: reported PPL implausible, or differs from a published number on the same checkpoint+data.
61
+ **Root cause**: PPL = `exp(mean NLL over scored tokens)`. Including pad/prompt tokens, or a sliding window that double-counts overlap context as scored tokens, corrupts the denominator.
62
+ **Fix**: score only non-`-100` positions; for long docs use the HF strided window where overlap tokens are `-100` (context, not scored). Whether the number is comparable across runs is metric-validity → cross-link **verifying-dl-experiments** (**REQUIRED**). ([HF perplexity](https://huggingface.co/docs/transformers/perplexity))
63
+
64
+ ### L8 — SFT / DPO / RLHF expect different dataset schemas; the wrong one trains on nothing
65
+ **Symptom**: TRL trainer runs but learns nothing, or errors on a missing column; preference data in an SFT trainer (or vice versa) silently mistrains.
66
+ **Root cause**: **SFT** = prompt+completion (train on completion); **DPO/preference** = `{prompt, chosen, rejected}` or conversational messages; **RLHF/PPO** = prompts only + a separate reward model. Conversational data needs the chat template applied.
67
+ **Fix**: match trainer to schema; for conversational data confirm the chat template fired (decode one example — look for role tags). Prefer the **explicit-prompt** form `{prompt, chosen, rejected}` over implicit. Recommended order SFT → DPO (DPO from a non-SFT base often underperforms). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
68
+
69
+ ### L9 — DPO reward margin won't grow / chosen logps crash → beta + ref-model + collapse
70
+ **Symptom**: `rewards/margins` ~0 or `rewards/accuracies` ~0.5; or `logps/chosen` and `logps/rejected` both plunge (suppresses everything).
71
+ **Root cause**: `beta` controls deviation from the frozen reference — too **small** → policy drifts (implicit KL blows up, degenerate text); too **large** → signal too weak to move the margin. DPO widens the gap mostly by **suppressing the rejected** likelihood, so both logps falling *with a growing margin* is normal; both falling *with a flat margin* is collapse. A lost/absent `ref_model` (some PEFT paths) removes the anchor.
72
+ **Fix**: start `beta=0.1`, raise to 0.3–0.5 if text degrades, lower if the margin won't move. Use `learning_rate≈1e-6` (TRL DPO default; `≈1e-5` for LoRA) — too high is the classic collapse. Health signal: `rewards/margins` ↑, `rewards/accuracies` → ~0.7+. With `ref_model=None` TRL uses the initial policy as the frozen reference — concrete check: a frozen reference must yield **identical** logps for a fixed batch across steps; re-score one batch early and late, and if they drift the anchor is being trained (the trap when `ref_model=None` lacks a real frozen copy). Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
73
+
74
+ ### L10 — Gated/private model 401s mid-Trainer → authenticate BEFORE construction
75
+ **Symptom**: `401`/`GatedRepoError` when the Trainer loads a Llama/Gemma/Mistral base despite granted access; or `push_to_hub` can't write.
76
+ **Root cause**: the token must be visible to the process **before** the gated `from_pretrained` / Trainer-internal load; setting it after is too late.
77
+ **Fix**: push the token first (env/stdin, **never inline the literal**): set `HF_TOKEN`, or `huggingface_hub.login(token=os.environ["HF_TOKEN"])` at the top before any `from_pretrained`. `push_to_hub` needs a **write**-scope token + `hub_model_id` in `TrainingArguments`. Verify `huggingface-cli whoami` before launch — on a metered box a 401 wastes a full reload. Secrets transport → `references/ssh_transport.md` (U34); offline-without-key → gotchas_universal.md U35. ([HF gated](https://huggingface.co/docs/hub/en/models-gated))
78
+
79
+ ---
80
+
81
+ ## Vision (classification / detection / segmentation)
82
+
83
+ ### V1 — Normalization mismatch train↔eval → near-zero accuracy on a "trained" model
84
+ **Symptom**: training loss falls but val/test is near-chance; or a fine-tuned backbone is far worse than its pretrained eval.
85
+ **Root cause**: pretrain `(mean,std)` (ImageNet `mean=[.485,.456,.406] std=[.229,.224,.225]`) differs between train/eval paths, or one normalizes to `[0,1]` and the other `[0,255]`; or RGB vs BGR (OpenCV loads BGR). A reported CenterNet case got post-norm mean `-115`, std `8` from wrong channel stats.
86
+ **Fix**: use the **exact** pretrain normalization, identically in train and eval; match channel order. Print one input tensor's per-channel mean/std — should be ~`N(0,1)`. Remaining gap = real-effect vs this bug → cross-link **verifying-dl-experiments** (**REQUIRED**; input-normalization is a named check). ([why-normalize](https://inside-machinelearning.com/en/why-and-how-to-normalize-data-object-detection-on-image-in-pytorch-part-1/), [tf/models #10778](https://github.com/tensorflow/models/issues/10778))
87
+
88
+ ### V2 — Train-time augmentation applied at eval → unstable/depressed metrics
89
+ **Symptom**: eval numbers flicker run-to-run or sit below the training-curve val.
90
+ **Root cause**: the random transform pipeline (`RandomResizedCrop`, flip, jitter) is reused for the eval loader, so each pass sees different inputs; or `model.eval()` never called so Dropout/BN stay in train mode.
91
+ **Fix**: separate `train_transform` (random) from `eval_transform` (deterministic resize+center-crop+normalize); call `model.eval()` + `torch.no_grad()`. A flickering eval metric is usually this, not the model.
92
+
93
+ ### V3 — Detection mAP=0 despite a falling loss → box-format / label-id / scale mismatch
94
+ **Symptom**: detection loss decreases normally but mAP is exactly 0 (or ~0) at every eval.
95
+ **Root cause**: a format mismatch the loss tolerates but eval doesn't — (1) box format `cxcywh`/`xywh` vs evaluator's `xyxy`, or normalized `[0,1]` vs absolute pixels; (2) class id off-by-one (0-indexed model vs 1-indexed COCO, 0=background); (3) boxes in resized space matched against original-res GT; (4) eval score threshold so high everything is filtered.
96
+ **Fix**: assert the eval pipeline's box format + class indexing, convert explicitly (`torchvision.ops.box_convert`), and visualize 2–3 predicted boxes before trusting the metric. mAP=0 with healthy loss is almost never the model — it's eval glue; the all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([tf/models #10778](https://github.com/tensorflow/models/issues/10778), [bbox formats](https://www.learnml.io/posts/a-guide-to-bounding-box-formats/))
97
+
98
+ ### V4 — Detections vanish after NMS / anchor mismatch → no boxes survive
99
+ **Symptom**: raw head outputs look reasonable but final detections are empty or absurdly few.
100
+ **Root cause**: NMS IoU too aggressive, score threshold too high, or anchor sizes/ratios don't cover the dataset's object scales (regression targets unreachable).
101
+ **Fix**: log pre-NMS vs post-NMS counts; loosen `score_thresh` (~0.05 for eval recall), NMS IoU ~0.5–0.6; for anchor heads run k-means auto-anchor (YOLO `autoanchor`) on GT boxes. Pairs with V3.
102
+
103
+ ### V5 — Segmentation mIoU=0 or NaN → `ignore_index` / label off-by-one
104
+ **Symptom**: seg loss trains but mIoU is 0 (or a class NaN); or loss is NaN from step 1.
105
+ **Root cause**: label/class-index inconsistency — a void value (commonly `255`) not excluded → treated as a class id ≥ `num_classes` (out-of-range / pollutes IoU); or off-by-one (0=background but labels start at 1, or a `reduce_labels` 0→255 shift applied inconsistently between loss and metric).
106
+ **Fix**: set the **same** `ignore_index` in **both** loss and metric — `CrossEntropyLoss(ignore_index=255)` and mIoU mask `(label != 255)`; confirm `label.max() < num_classes` after any shift; apply reduction identically. mIoU=0 with falling loss = all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([torchmetrics #2747](https://github.com/Lightning-AI/torchmetrics/issues/2747), [HF ignore_index](https://discuss.huggingface.co/t/understanding-ignore-index-and-reduce-labels/64587))
107
+
108
+ ### V6 — Severe class imbalance → model predicts only the majority class
109
+ **Symptom**: high pixel/sample accuracy but rare classes never predicted; minority recall ~0.
110
+ **Root cause**: unweighted cross-entropy is dominated by the majority class; "always predict majority" is the easy degenerate solution.
111
+ **Fix**: weight the loss (`CrossEntropyLoss(weight=...)` inverse-frequency) or focal loss (detection); class-balanced sampler. Report **per-class / macro** metrics, never just overall — a high aggregate hiding a collapsed minority is degenerate output → cross-link **verifying-dl-experiments** (**REQUIRED**).
112
+
113
+ ### V7 — Tiny per-GPU batch → BatchNorm stats garbage → unstable/poor training
114
+ **Symptom**: detection/seg at batch 1–2 per GPU is unstable or underperforms a larger-batch run.
115
+ **Root cause**: BN estimates mean/var over the batch; at batch 1–2 those are noisy and running stats drift.
116
+ **Fix**: **SyncBatchNorm** across GPUs (`torch.nn.SyncBatchNorm.convert_sync_batchnorm`) under DDP, or **GroupNorm**, or freeze pretrained BN (`FrozenBatchNorm2d`, as detection backbones do). DDP mechanics → `references/training/distributed-launch.md`.
117
+
118
+ ---
119
+
120
+ ## Diffusion / generative
121
+
122
+ ### DF1 — Loss is low but samples are bad → the canonical "loss ≠ quality"
123
+ **Symptom**: the noise-prediction MSE converges nicely but samples are blurry, mode-collapsed, or wrong.
124
+ **Root cause**: diffusion loss (predict noise at a random timestep) is **weakly correlated with sample quality** — good average noise-prediction still compounds errors over the sampling trajectory. Real culprits are downstream: missing EMA (DF2), wrong VAE scaling (DF3), train/sample schedule mismatch (DF4/DF6), no/over CFG (DF5).
125
+ **Fix**: the textbook **is-the-number-real** fork → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns loss-low-output-bad). Mechanically walk DF2→DF6; the single most common miss is evaluating **raw** weights instead of **EMA** (DF2). ([stability techniques](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/training-stability-techniques))
126
+
127
+ ### DF2 — Sampling from raw (non-EMA) weights → worse than the "same" model
128
+ **Symptom**: samples from the just-saved checkpoint look worse than expected; quality jumps with an EMA checkpoint.
129
+ **Root cause**: diffusion quality depends heavily on EMA — a running average (`decay≈0.999`, ~1000-update window) that denoises the noisy SGD trajectory. Raw weights are the noisy point estimate.
130
+ **Fix**: maintain EMA during training and **sample/evaluate from EMA weights** (`diffusers` `EMAModel`); save both (EMA for inference, raw for resume); verify the eval path actually loaded EMA. Resolves a large share of "DF1" reports. ([EMA](https://medium.com/@thibaut.chauffier/training-diffusion-models-from-scratch-21d7a1f18e9e))
131
+
132
+ ### DF3 — VAE latent scaling wrong → latents off-unit-variance → diffusion can't learn
133
+ **Symptom**: latent-diffusion (SD-style) training is unstable / produces noise or blank output; or a swapped-in custom VAE degrades everything.
134
+ **Root cause**: the model assumes ~unit-variance latents; the VAE output is multiplied by a calibrated `scaling_factor` (SD v1 = `0.18215`). A wrong/missing factor (or a custom VAE with different stats) leaves latents off-scale.
135
+ **Fix**: scale by the VAE's `config.scaling_factor` on encode, divide on decode. For a **custom** VAE, **measure** empirical latent std on a sample and set factor `1/std` — don't inherit `0.18215`. Print latent mean/std before training (~0/~1). ([sd-vae](https://huggingface.co/stabilityai/sd-vae-ft-mse))
136
+
137
+ ### DF4 — Noise schedule / timestep / prediction_type mismatch train↔inference
138
+ **Symptom**: structured artifacts, wrong contrast/brightness, or failure at low step counts.
139
+ **Root cause**: betas/alphas schedule (linear/cosine/scaled-linear), `num_train_timesteps`, or `prediction_type` differs between training and the inference scheduler; e.g. trained on `epsilon`, sampled as `v`/`x0`.
140
+ **Fix**: keep the **same** `beta_schedule`, `num_train_timesteps`, `prediction_type` in both — in `diffusers` build the inference scheduler from the training `scheduler.config`. Mismatched `prediction_type` is a silent quality killer.
141
+
142
+ ### DF5 — Conditioning never dropped during training → CFG is a no-op / model ignores prompt
143
+ **Symptom**: changing `guidance_scale` at inference barely changes output, or the model ignores conditioning.
144
+ **Root cause**: CFG needs a learned **unconditional** path, trained by randomly replacing the condition with a null embedding for a fraction `p_drop` of examples. No dropout ⇒ no usable unconditional estimate ⇒ CFG no-op.
145
+ **Fix**: during training drop the condition `p_drop≈0.1` (replace with null embedding). At inference use `guidance_scale` ~5–15 for T2I (higher = more prompt adherence, lower diversity). Model-ignores-input is a verifying-dl-experiments concern (**REQUIRED**); the training-side root cause is the missing dropout. ([CFG theory](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/classifier-free-guidance-theory))
146
+
147
+ ### DF6 — Sampler ≠ model: a sampler config that doesn't match the trained objective
148
+ **Symptom**: switching samplers wildly changes quality; one sampler gives noise.
149
+ **Root cause**: samplers assume a `prediction_type` + schedule (DF4); ancestral/SDE vs deterministic ODE interact differently with the trained noise level, and step count below the sampler's stable range degrades.
150
+ **Fix**: validate the checkpoint with the **reference** sampler/step-count from its recipe first, then explore; confirm `prediction_type` matches; for few-step use DPM-Solver++ not plain DDPM. "New sampler → bad samples" is a config mismatch, not a model regression.
151
+
152
+ ### DF7 — Uniform-timestep loss weighting → blurry samples → Min-SNR weighting
153
+ **Symptom**: samples are systematically blurry / low-detail despite long training.
154
+ **Root cause**: uniform loss weight over-weights high-noise (easy) steps relative to low-noise steps that carry fine detail; the gradient is dominated by the easy regime.
155
+ **Fix**: apply **Min-SNR-γ** weighting (γ≈5) so low-noise steps get their share; `diffusers` scripts expose `--snr_gamma`. Compounds with DF2/DF3 — fix these details together, not in isolation. ([Min-SNR](https://arxiv.org/abs/2303.09556))
156
+
157
+ ---
158
+
159
+ ## RL
160
+
161
+ ### R1 — Reward collapses / output degenerates mid-training
162
+ **Symptom**: average reward suddenly drops, responses get short/repetitive or refuse, length collapses.
163
+ **Root cause**: reward hacking / over-optimization — the policy exploits the reward model's blind spots and drifts far, often after the ratio gets clipped much more and approximate KL spikes.
164
+ **Fix**: strengthen the KL penalty (raise the KL coefficient), lower LR (`≈1e-6` for LLM PPO), reduce PPO update epochs/batch, add a length penalty if length is gamed. Watch reward **and** KL together — a reward jump with a KL spike is hacking, not progress. Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns collapse/degenerate-output). ([PPO instability](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/troubleshooting-ppo-instability), [N-details RLHF](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo))
165
+
166
+ ### R2 — KL to the reference blows up → policy runs away
167
+ **Symptom**: KL grows without bound; generations go incoherent; "diverges" though loss isn't NaN.
168
+ **Root cause**: the KL penalty is too weak (or adaptive-KL target too loose); aggressive updates push the policy far, and a huge KL term can dominate the objective so the model optimizes the penalty instead.
169
+ **Fix**: use an **adaptive KL controller** with a target (e.g. 6), or a fixed coefficient large enough to hold KL bounded; clip the ratio (`cliprange≈0.2`); cap update epochs. Confirm the frozen reference isn't being updated. Same axis DPO's `beta` controls (L9). ([KL penalty role](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/kl-divergence-penalty-role))
170
+
171
+ ### R3 — Un-normalized rewards/advantages → unstable gradients → whiten
172
+ **Symptom**: high-variance, brittle training; small reward-scale changes destabilize everything.
173
+ **Root cause**: raw reward/advantage magnitudes vary wildly across batches; PPO gradients are scale-sensitive.
174
+ **Fix**: **whiten** advantages per minibatch (subtract mean, divide by std) — the standard PPO trick, more stabilizing than plain reward normalization (double-normalizing reward **and** advantage is often redundant). For classic-control RL, normalize **observations** with a running mean/std (`VecNormalize`) — un-normalized obs is a top cause of failure to learn. ([impl matters](https://openreview.net/pdf?id=rxEmiOEIFL), [whitening redundancy](https://liujch1998.github.io/2023/04/16/ppo-norm.html))
175
+
176
+ ### R4 — Replay buffer / normalization state not checkpointed → resume behaves like cold start
177
+ **Symptom**: an off-policy run (DQN/SAC) resumed from checkpoint acts like a cold start; or a normalized env's stats reset on resume and performance tanks.
178
+ **Root cause**: the replay buffer and the running obs/reward normalization stats are part of training **state** but are often omitted — restoring only weights loses them.
179
+ **Fix**: checkpoint+restore the replay buffer (or accept warmup) **and** the `VecNormalize`/running-stats alongside weights. General checkpoint-everything-stateful (optimizer/scheduler/RNG/step) → `references/training/checkpoint-resume.md`; on spot boxes losing buffer/normstats every preemption silently degrades learning → `references/spot-resilience.md`.
180
+
181
+ ### R5 — Non-stationarity treated as a bug → chasing a moving target
182
+ **Symptom**: value/critic loss won't converge to zero; metrics oscillate even when "working."
183
+ **Root cause**: RL targets are **non-stationary** — the policy changes the data distribution and bootstrapped targets move. A value loss that never hits zero is expected.
184
+ **Fix**: judge by **return/reward trend**, not critic-loss-to-zero; stabilize with a slow target network (DQN/SAC) and GAE. Don't "fix" non-convergent critic loss by shrinking LR to zero — that just stops learning.
185
+
186
+ ### R6 — A single seed's result is not the result → RL variance is huge
187
+ **Symptom**: identical hyperparameters + different seeds give non-overlapping curves; an ablation "win" disappears on re-run.
188
+ **Root cause**: extreme seed variance from algorithm, policy sampling, and environment stochasticity (comparing 5-run point estimates yields >50% Type-I error).
189
+ **Fix**: report aggregate over **≥5 seeds** (more for noisy envs), use **IQM** (interquartile mean) over mean/median, show CIs. A single-seed delta is not a result — squarely **verifying-dl-experiments** territory (bug-vs-real-effect, seed discipline; **REQUIRED**). ([Henderson](https://arxiv.org/pdf/1708.04133), [how-many-seeds](https://arxiv.org/pdf/1806.08295))
190
+
191
+ ---
192
+
193
+ ## Multimodal / VLM
194
+
195
+ ### X1 — Wrong freeze schedule across stages → alignment never forms or the LLM is wrecked
196
+ **Symptom**: a LLaVA-style VLM doesn't ground on images (ignores visual tokens), or text quality collapses after multimodal finetune.
197
+ **Root cause**: VLM training is **staged**, each stage freezing different towers. Stage 1 (alignment): freeze vision encoder **and** LLM, train **only the projector**. Stage 2 (instruction tuning): unfreeze LLM **and** projector, keep vision encoder frozen. Training the LLM in stage 1 (before the projector aligns) corrupts it; never unfreezing it means it can't use the visual tokens.
198
+ **Fix**: set `requires_grad` per tower per stage per the recipe; print trainable-param counts at each stage start to confirm the freeze took. "Ignores its input image" is model-ignores-input → cross-link **verifying-dl-experiments** (**REQUIRED**). ([LLaVA recipe](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
199
+
200
+ ### X2 — Projector trained from scratch with the LLM hot → unstable stage-1
201
+ **Symptom**: stage-1 alignment loss is unstable or the projector output is garbage.
202
+ **Root cause**: the projector (linear or 2-layer MLP, often concatenating groups of vision tokens into the LLM embedding space) is **randomly initialized**; flowing gradients into the LLM through an un-aligned projector destabilizes both.
203
+ **Fix**: stage 1 trains the projector **alone** against a frozen LLM so it learns the LLM's embedding space first; only then (stage 2) unfreeze the LLM. Confirm projector output dim == LLM hidden size (else a silent shape-broadcast bug). ([projector adapter](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
204
+
205
+ ### X3 — One global LR for all towers → vision encoder drifts or LLM underfits
206
+ **Symptom**: a shared LR either over-updates the pretrained vision encoder (forgets visual features) or leaves projector/LLM undertrained.
207
+ **Root cause**: towers are at different maturities — a pretrained vision encoder needs a tiny LR (or freeze), a fresh projector a larger one, the LLM in between. LLaVA: ~`1e-3` projector in alignment, `2e-5` LLM in stage 2.
208
+ **Fix**: **parameter groups** with per-group LRs (`AdamW([{"params":proj.parameters(),"lr":1e-3},{"params":llm.parameters(),"lr":2e-5}])`), or freeze the vision encoder. Log each group's LR to confirm. ([LLaVA LRs](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
209
+
210
+ ### X4 — Sequence truncation drops image tokens → shape error or silent loss of vision
211
+ **Symptom**: VLM training errors on image-token-count mismatch, or intermittently ignores images on long examples.
212
+ **Root cause**: image placeholders expand into many tokens (hundreds per image); a `max_length` truncation cuts them, breaking the image-token ↔ feature alignment.
213
+ **Fix**: set `max_length=None` (or large enough never to truncate image tokens) for VLM trainers, or verify truncation never removes placeholders across the whole dataset. Count image tokens vs expected per example as a smoke check. ([TRL VLM note](https://huggingface.co/docs/trl/dpo_trainer))
214
+
215
+ ### X5 — Modality alignment collapse: the LLM answers from text priors, not the image
216
+ **Symptom**: the VLM gives plausible answers that ignore the actual image content (language-prior shortcut).
217
+ **Root cause**: weak visual signal (bad projector, frozen-everything, too little alignment data) lets the LLM fall back on its language prior — a degenerate "ignore the input" solution that still lowers loss on text-predictable answers.
218
+ **Fix**: the mechanical fixes are X1–X3 (correct freeze/projector/LR so the visual path contributes). Whether it's genuinely grounding vs shortcutting is **verifying-dl-experiments** (model-ignores-input/degenerate-output; **REQUIRED**) — probe with image-perturbation / counterfactual-image tests, which that skill owns.
219
+
220
+ ---
221
+
222
+ ## Pointers — domain-adjacent mechanics catalogued elsewhere
223
+
224
+ - **Precision / NaN / loss-spike / z-loss / grad-clip** (L6's general ladder) → `references/training/precision-stability.md`.
225
+ - **OOM, activation checkpointing, LoRA/QLoRA, FSDP/ZeRO, seq-len memory** → `references/training/oom-memory.md`.
226
+ - **Dataloader starvation, GPU-util%, NVMe staging, tar-sharding** → `references/gotchas_universal.md` (U8, U21, U24, U25).
227
+ - **DDP/FSDP launch, SyncBatchNorm under DDP, NCCL** → `references/training/distributed-launch.md`, `references/multinode.md`.
228
+ - **Checkpoint-everything-stateful + atomic resume** (R4's general form) → `references/training/checkpoint-resume.md`, `references/spot-resilience.md`.
229
+ - **"Runs but won't learn": loop wiring, optimizer/LR/weight-decay, loss-function & label form, freezing/BN drift, dataloader correctness** → `references/training/convergence-debugging.md`, `references/training/data-pipeline.md`.
230
+ - **Is the metric/model correct** (collapse, leakage, all-zero metrics, model-ignores-input, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — owns every "bug vs real effect" fork above).
@@ -0,0 +1,368 @@
1
+ # Correct checkpointing & idempotent resume — full state, atomic write, sharded checkpoints, framework APIs
2
+
3
+ Make a training job resume **exactly where it stopped** after any kill — not "reload the weights and
4
+ silently restart the epoch." This layer owns the *mechanics*: what FULL state to save, how to write it
5
+ without corruption, how to load it unconditionally, and the framework-specific knobs (FSDP / DeepSpeed /
6
+ HF Trainer / Accelerate / Lightning) plus the resume **bugs** that make a job look resumed while it
7
+ quietly lost progress. **verifying-dl-experiments** (**REQUIRED**) owns *is the resumed number correct* —
8
+ e.g. proving step/epoch/loss actually continued instead of resetting is its reproducibility check applied
9
+ here. The spot/preemption *cadence* (when + how often, Young/Daly) lives in
10
+ `references/spot-resilience.md` (**REQUIRED** for any interruptible/spot tier) — this file is the *content
11
+ and correctness* of each checkpoint; that file is the *timing*.
12
+
13
+ To jump: `grep -in '<keyword>' references/training/checkpoint-resume.md` (e.g. `atomic`, `rename`,
14
+ `scaler`, `ema`, `sampler`, `fsdp`, `sharded`, `zero_to_fp32`, `dcp`, `resume_from_checkpoint`,
15
+ `save_state`, `ckpt_path`, `save_total_limit`, `reshuffle`).
16
+
17
+ ## Table of contents
18
+
19
+ - **The contract** — C1 full-state-list · C2 atomic-write · C3 load-latest-unconditionally · C4 durable-location
20
+ - **Sharded checkpoints (multi-GPU)** — C5 FSDP-FULL_STATE_DICT-rank0-OOM · C6 FSDP-SHARDED_STATE_DICT · C7 DCP-(dcp.save/load) · C8 DeepSpeed-ZeRO-dir+zero_to_fp32
21
+ - **Framework APIs** — C9 HF-Trainer-resume_from_checkpoint+save_total_limit · C10 Accelerate-save_state/load_state · C11 Lightning-ModelCheckpoint+ckpt_path
22
+ - **The resume BUGS** — C12 epoch-restarts · C13 data-reshuffles/order · C14 LR-schedule-resets · C15 scaler-not-restored · C16 EMA-not-saved · C17 save_total_limit-deletes-best · C18 strict-load-key-mismatch
23
+ - **Pointers** — disk-full on save → gotchas_universal.md U6 · silent sync → U33 · keepable-policy/save_top_k → verifying-dl-experiments (skill) · cadence/Young-Daly → spot-resilience.md
24
+
25
+ ---
26
+
27
+ ## The contract
28
+
29
+ ### C1 — A checkpoint that restores only weights is NOT a resume — save the FULL training state
30
+
31
+ **Symptom**: resume "works" (no crash) but the loss jumps up, accuracy regresses, or training takes more
32
+ total epochs than an uninterrupted run — because the resume silently restarted the epoch, reset the
33
+ optimizer momentum, and reshuffled the data.
34
+
35
+ **Root cause**: `torch.save(model.state_dict())` captures *weights only*. Optimizer momentum/variance,
36
+ the LR-scheduler position, the epoch/step counter, RNG state, the AMP scaler, and the dataloader position
37
+ are all lost, so the restarted run is a *different* trajectory, not a continuation.
38
+
39
+ **Fix**: every checkpoint must carry the full state (PyTorch tutorial
40
+ [saving multiple / general checkpoint](https://docs.pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html);
41
+ the spot-resilience §3 list):
42
+
43
+ | Must save | Why losing it breaks resume |
44
+ |---|---|
45
+ | model `state_dict` | the weights (obvious) |
46
+ | optimizer `state_dict` | Adam m/v momentum — losing it = a cold optimizer restart (C12) |
47
+ | LR-scheduler `state_dict` | step-based LR position — losing it resets the schedule (C14) |
48
+ | `epoch` **and** global `step`/iteration | resume the exact position, not the epoch start (C12) |
49
+ | RNG state: Python `random`, NumPy, `torch`, **CUDA** (`torch.cuda.get_rng_state_all()`) | reproducible augmentation/dropout stream after restart |
50
+ | dataloader / sampler position | so the next batch is the *next* unseen one, not a reshuffle (C13) |
51
+ | AMP `GradScaler` `state_dict` | the loss-scale + growth tracker — losing it triggers an inf-scale stall (C15) |
52
+ | EMA / SWA shadow weights (if used) | the EMA copy is often what's evaluated — losing it = eval on the wrong weights (C16) |
53
+ | best-metric-so-far + `best.pth` selection state | so "best" survives a restart instead of resetting |
54
+
55
+ The runnable atomic skeleton that assembles this dict is in `references/spot-resilience.md` §5 — do not
56
+ duplicate it; this table is the *checklist*, that is the *code*.
57
+
58
+ ### C2 — Write atomically: tmp → fsync → os.replace (a kill mid-write corrupts a naive save)
59
+
60
+ **Symptom**: after a preemption/OOM, `latest.pth` is truncated/zero-byte or `torch.load` raises
61
+ `RuntimeError: PytorchStreamReader failed reading zip archive`; a `latest.pth.tmp` is left behind.
62
+
63
+ **Root cause**: overwriting `latest.pth` in place is **not** atomic — a kill partway through leaves a
64
+ corrupt file and (if it was the only checkpoint) zero good ones. `torch.save` itself does *not* fsync.
65
+
66
+ **Fix**: write to a temp file, force bytes to disk, then atomically rename (POSIX `rename`/`os.replace`
67
+ is atomic on the **same filesystem**):
68
+ ```python
69
+ tmp = ckpt_path + ".tmp"
70
+ with open(tmp, "wb") as f:
71
+ torch.save(state, f); f.flush(); os.fsync(f.fileno()) # bytes hit disk BEFORE the swap
72
+ os.replace(tmp, ckpt_path) # all-or-nothing; keep prev until this returns
73
+ ```
74
+ Keep the previous `latest.pth` valid until the rename returns (a kill at any instant leaves one intact
75
+ file). `os.replace` (not `os.rename`) also works on Windows for the local-test path. Full recipe +
76
+ rationale: `references/spot-resilience.md` §3. Disk-full *during* the save is a separate failure with the
77
+ same `.tmp` left behind → `references/gotchas_universal.md` U6 (pre-budget + prune `latest`, keep `best`).
78
+
79
+ ### C3 — Load-latest UNCONDITIONALLY on startup → idempotent resume
80
+
81
+ **Symptom**: a relaunch starts from scratch because the resume is gated behind a `--resume` flag the
82
+ launch wrapper forgot to pass; or two code paths (fresh vs resume) diverge.
83
+
84
+ **Root cause**: making resume *opt-in* means a generic relaunch (spot recovery, SSH-drop restart, queue
85
+ retry) re-trains from zero. A divergent "first launch" code path also drifts from the resume path.
86
+
87
+ **Fix**: one code path that loads the latest checkpoint if it exists, else starts fresh — so the
88
+ **identical launch command** converges to the same end state no matter how many times it runs. This is
89
+ what makes principle #7's "retry the identical config" actually *resume* instead of restart, and it is the
90
+ universal spine (principle #8) under SSH-drop / Slurm-walltime / K8s-reschedule / spot-preemption. Skeleton:
91
+ `references/spot-resilience.md` §3 (`load_latest_if_any`).
92
+
93
+ ### C4 — Checkpoint to the platform's DURABLE location, not local scratch
94
+
95
+ **Symptom**: resume after a managed-spot replacement (or a `terminate`/`destroy`) finds no checkpoint —
96
+ the box came up *fresh* and the only copy was on the dead instance's local disk.
97
+
98
+ **Root cause**: a replacement node is clean; anything not on a cloud bucket / network volume / shared FS
99
+ is gone (principle #4 — know what survives stop vs destroy).
100
+
101
+ **Fix**: write checkpoints to the profile's durable mount (`DURABLE_DIR` in `profiles/<platform>.md` §8),
102
+ or mirror local→durable on the checkpoint timer. The single biggest portability trap is assuming local
103
+ disk survives — see each profile's STORAGE survival-matrix and the SKILL Quick-reference table. Gate the
104
+ sync on the actual copy result, never an unconditional `echo synced` →
105
+ `references/gotchas_universal.md` U33.
106
+
107
+ ---
108
+
109
+ ## Sharded checkpoints (multi-GPU)
110
+
111
+ ### C5 — FSDP `FULL_STATE_DICT` OOMs on rank 0 when gathering a large model
112
+
113
+ **Symptom**: an FSDP job trains fine but **crashes at the first checkpoint** with CUDA OOM on rank 0;
114
+ the model is larger than one GPU.
115
+
116
+ **Root cause**: `StateDictType.FULL_STATE_DICT` all-gathers every shard onto **one rank** to assemble the
117
+ unsharded dict. For a model that only fits *because* it's sharded, materializing the whole thing on rank 0
118
+ exceeds that GPU's VRAM.
119
+
120
+ **Fix**: when taking a full (consolidated) dict, offload it to CPU and build it on rank 0 only —
121
+ `FullStateDictConfig(offload_to_cpu=True, rank0_only=True)`. This all-gathers parameters one-by-one,
122
+ offloading each to CPU on rank 0, so peak GPU memory stays bounded and non-rank-0 workers skip the GPU→CPU
123
+ copy entirely
124
+ ([HF Accelerate FSDP guide](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp),
125
+ [Lightning issue #11207](https://github.com/Lightning-AI/pytorch-lightning/issues/11207)). The full dict
126
+ is only viable when it fits in CPU RAM; past that, use sharded (C6). Save a full dict only at the **end**
127
+ for a portable single-file artifact; checkpoint *during* training as sharded.
128
+
129
+ ### C6 — `SHARDED_STATE_DICT`: each rank saves its own shard (no gather, no rank-0 OOM)
130
+
131
+ **Symptom**: need to checkpoint a model too big to consolidate even on CPU, or want a fast resume that
132
+ re-shards onto a *different* world size.
133
+
134
+ **Root cause**: `FULL_STATE_DICT` is fundamentally a single-rank materialization; it does not scale and
135
+ cannot reshard.
136
+
137
+ **Fix**: use `StateDictType.SHARDED_STATE_DICT` — every rank writes only its own shard, so there is no
138
+ all-gather and no OOM, and the per-rank files load back in parallel. Pair it with Distributed Checkpoint
139
+ (C7), which is the production path for sharded save/load and supports **resharding** (resume on a different
140
+ GPU count). Tradeoff: a sharded checkpoint is a *directory of N files*, not a single `.pth` — convert to a
141
+ full dict for export/inference (C7's `get_model_state_dict`, or the DeepSpeed analogue C8).
142
+
143
+ ### C7 — Distributed Checkpoint (DCP): `dcp.save` / `dcp.load` for FSDP/sharded models
144
+
145
+ **Symptom**: hand-rolling FSDP state-dict context managers is brittle, slow, and breaks when the world
146
+ size changes between save and resume.
147
+
148
+ **Root cause**: `torch.save` produces a single file and has no notion of sharding or FQN remapping;
149
+ manually toggling `FSDP.state_dict_type` is error-prone.
150
+
151
+ **Fix**: use `torch.distributed.checkpoint` (DCP), the current PyTorch-2.x sharded-checkpoint API
152
+ ([DCP recipe](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html),
153
+ [2.12 reference](https://docs.pytorch.org/docs/2.12/distributed.checkpoint.html)). **Save**: get canonical
154
+ dicts with `get_state_dict(model, optimizer)` from `torch.distributed.checkpoint.state_dict`, then
155
+ `dcp.save(state_dict, checkpoint_id=DIR)` — it writes **≥1 file per rank in parallel** and auto-manages FQN
156
+ mappings. **Load**: allocate the model first, then `dcp.load(state_dict, checkpoint_id=DIR)` (loads **in
157
+ place** and **auto-reshards** to the current world size), then `set_state_dict(...)`. DCP beats
158
+ `torch.save` for any distributed model because it shards the write across ranks (no rank-0 gather, C5) and
159
+ reshards on load. For a single portable inference file, convert offline with `torch.distributed.checkpoint.format_utils.dcp_to_torch_save(DIR, "out.pt")` (or the CLI `python -m torch.distributed.checkpoint.format_utils dcp_to_torch DIR out.pt`).
160
+
161
+ ### C8 — DeepSpeed ZeRO: a checkpoint *directory* per save + `zero_to_fp32.py` to consolidate
162
+
163
+ **Symptom**: `model_engine.save_checkpoint(dir)` writes a *folder* of `mp_rank_*` / `zero_pp_rank_*`
164
+ files, not a `.pth`; loading the weights into a plain (non-DeepSpeed) model for inference fails.
165
+
166
+ **Root cause**: ZeRO **partitions** optimizer state (stage 1), gradients (2), and parameters (3) across
167
+ ranks; the on-disk checkpoint is inherently sharded across per-rank files — it is not a single fp32 model.
168
+
169
+ **Fix** ([DeepSpeed model-checkpointing](https://deepspeed.readthedocs.io/en/stable/model-checkpointing.html),
170
+ [ZeRO tutorial](https://www.deepspeed.ai/tutorials/zero/)):
171
+
172
+ - **Save/resume training** — `model_engine.save_checkpoint(save_dir, tag)` /
173
+ `model_engine.load_checkpoint(save_dir, tag)`. **All ranks must call both** (they're collective; rank-0
174
+ only deadlocks/corrupts). Round-trips full sharded optimizer+param state.
175
+ - **Export a single fp32 model** — DeepSpeed auto-drops a `zero_to_fp32.py` into the checkpoint dir; run
176
+ `python zero_to_fp32.py <checkpoint_dir> pytorch_model.bin`, or in-process
177
+ `from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint(dir)` /
178
+ `convert_zero_checkpoint_to_fp32_state_dict(...)` / `load_state_dict_from_zero_checkpoint(model, dir)`
179
+ (the last returns a model that **can't continue training** without re-init). The consolidated file no
180
+ longer needs DeepSpeed. For ZeRO-3, set
181
+ `"zero_optimization": {"stage3_gather_16bit_weights_on_model_save": true}` + `engine.save_16bit_model(dir)`.
182
+
183
+ ---
184
+
185
+ ## Framework APIs
186
+
187
+ ### C9 — HF Trainer: `resume_from_checkpoint` + `save_total_limit` (and what it actually saves)
188
+
189
+ **Symptom**: assuming `Trainer.save_model()` is a resume point (it saves *weights only*); or a relaunch
190
+ re-trains from step 0 because `resume_from_checkpoint` wasn't passed; or the disk fills with `checkpoint-*`
191
+ dirs.
192
+
193
+ **Root cause**: `save_model` ≠ a training checkpoint. A real Trainer checkpoint dir (`checkpoint-<step>`)
194
+ contains the model **plus** `optimizer.pt`, `scheduler.pt`, `rng_state.pth`, `trainer_state.json`, and the
195
+ AMP `scaler.pt` — the full state. Without `resume_from_checkpoint` the run starts cold.
196
+
197
+ **Fix** ([Trainer docs](https://huggingface.co/docs/transformers/main/en/main_classes/trainer)):
198
+ `trainer.train(resume_from_checkpoint="path/to/checkpoint-1500")` resumes that exact dir;
199
+ `resume_from_checkpoint=True` auto-finds the **last** checkpoint in `args.output_dir` (idempotent spelling,
200
+ C3; `trainer_utils.get_last_checkpoint(output_dir)` finds it in code). `save_strategy="steps"` +
201
+ `save_steps=N` (or `"epoch"`) sets cadence; **`save_total_limit=k`** keeps only the `k` most-recent
202
+ `checkpoint-*` and **deletes older ones in `output_dir`** — the built-in disk-budget knob (pairs with
203
+ `references/gotchas_universal.md` U6). `load_best_model_at_end=True` + `metric_for_best_model` +
204
+ `greater_is_better` reloads the best checkpoint at the end **and** protects it from `save_total_limit`
205
+ deletion (C17).
206
+
207
+ ### C10 — Accelerate: `accelerator.save_state(dir)` / `load_state(dir)` + dataloader skip
208
+
209
+ **Symptom**: a custom (non-Trainer) Accelerate loop resumes with a cold optimizer/scaler, or the LR
210
+ scheduler resets, or it replays already-seen batches.
211
+
212
+ **Root cause**: saving only `accelerator.get_state_dict(model)` drops optimizer/scaler/RNG; and a
213
+ mid-epoch resume re-iterates the dataloader from batch 0.
214
+
215
+ **Fix** ([Accelerate checkpoint guide](https://huggingface.co/docs/accelerate/en/usage_guides/checkpoint)):
216
+ `accelerator.save_state(output_dir)` saves model, optimizer, **GradScaler**, and RNG generators in one
217
+ call; `accelerator.load_state(output_dir)` restores all of it (objects must come from the *same* script).
218
+ The LR scheduler (and any object with `state_dict`/`load_state_dict`) **must** be registered first —
219
+ `accelerator.register_for_checkpointing(my_scheduler)` — or it is not saved and resets (C14). For
220
+ mid-epoch resume, skip consumed batches with `accelerator.skip_first_batches(train_dataloader, N)` on the
221
+ first resumed epoch, then fall back to the full dataloader (C13).
222
+ `ProjectConfiguration(automatic_checkpoint_naming=True, total_limit=k)` gives rolling
223
+ `checkpoints/checkpoint_<n>` dirs with a built-in limit.
224
+
225
+ ### C11 — Lightning: `ModelCheckpoint` + `trainer.fit(ckpt_path=...)` (don't use `resume_from_checkpoint`)
226
+
227
+ **Symptom**: an old tutorial's `Trainer(resume_from_checkpoint=...)` is ignored/deprecated; or
228
+ `save_top_k` quietly deletes the checkpoint needed to resume.
229
+
230
+ **Root cause**: `resume_from_checkpoint` moved to `fit(ckpt_path=...)` (deprecated since 1.x). A Lightning
231
+ `.ckpt` is a full dump — epoch, global step, LightningModule `state_dict`, **all** optimizer + LR-scheduler
232
+ states, callback states, loop state, and the 16-bit scaling factor (AMP)
233
+ ([Lightning checkpointing basics](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html)).
234
+
235
+ **Fix**:
236
+ - Configure `ModelCheckpoint(dirpath=..., monitor="val_loss", mode="min", save_top_k=k, save_last=True)`;
237
+ resume with `trainer.fit(model, datamodule, ckpt_path="path/to/last.ckpt")`, or
238
+ `ckpt_path="last"` to auto-pick the `save_last=True` file (the idempotent spelling, C3). Best/last paths
239
+ read back from `cb.best_model_path` / `cb.last_model_path`.
240
+ - `save_top_k` keeps only the k best by `monitor`; **always set `save_last=True`** so a resume target
241
+ exists even when the latest step isn't a top-k metric (otherwise resume may have no recent checkpoint).
242
+ Add custom state (EMA, C16) via `on_save_checkpoint` / `on_load_checkpoint` on the module or a stateful
243
+ callback. Lightning's DeepSpeed strategy writes a ZeRO dir — convert with
244
+ `lightning.pytorch.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict` (C8 analogue).
245
+
246
+ ---
247
+
248
+ ## The resume BUGS (looks resumed, silently lost progress)
249
+
250
+ These are the "it ran without error but the result is wrong" traps — confirm the fix with the
251
+ `verifying-dl-experiments` reproducibility check (**REQUIRED**): kill mid-run, relaunch the *identical*
252
+ command, and verify step/epoch/loss **continue** rather than reset.
253
+
254
+ ### C12 — Epoch/step restarts from 0 despite "resuming"
255
+
256
+ **Symptom**: tracker shows a second run starting at epoch 1; total trained epochs exceed the schedule;
257
+ LR warm-up replays. (The remote-ops version of this — a tmux script re-executed mid-run — is
258
+ `references/gotchas_universal.md` U2.)
259
+
260
+ **Root cause**: the loop is `for epoch in range(total_epochs)` with a hardcoded `0` start; the saved
261
+ `epoch`/`step` was never read back, or was saved but not used to seed the range.
262
+
263
+ **Fix**: `start_epoch, start_step = load_latest_if_any(...)` then
264
+ `for epoch in range(start_epoch, total_epochs)` and seed the step counter from `start_step`. The counter
265
+ **must** be in the checkpoint (C1) *and* consumed on load.
266
+
267
+ ### C13 — Data reshuffles / repeats the same order after resume
268
+
269
+ **Symptom**: resume re-shows already-seen samples (worse, the *same* batch every epoch even without
270
+ resume), hurting convergence or leaking.
271
+
272
+ **Root cause**: two distinct bugs. (a) Resume restarts the epoch from batch 0 without skipping consumed
273
+ batches. (b) `DistributedSampler` seeds its shuffle from an internal epoch that defaults to 0 forever
274
+ unless `sampler.set_epoch(epoch)` is called each epoch — so every epoch (and every resume) produces the
275
+ **identical** order
276
+ ([PyTorch #31771](https://github.com/pytorch/pytorch/issues/31771),
277
+ [DistributedSampler docs](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler)).
278
+
279
+ **Fix**: call `train_sampler.set_epoch(epoch)` at the top of every epoch (restore the epoch counter on
280
+ resume so the shuffle stream continues). For mid-epoch resume, fast-forward consumed batches
281
+ (`accelerator.skip_first_batches`, C10) or use a resumable/stateful sampler (`torchdata`
282
+ `StatefulDataLoader`) whose offset is in the checkpoint (C1).
283
+
284
+ ### C14 — LR schedule resets (cosine restarts, warm-up replays)
285
+
286
+ **Symptom**: the LR curve restarts from the initial/warm-up value on resume; final LR is wrong; cosine
287
+ decay never reaches its floor.
288
+
289
+ **Root cause**: the LR scheduler's `state_dict` (its `last_epoch`/step counter) was not saved or not
290
+ restored. With Accelerate, the scheduler wasn't `register_for_checkpointing`-ed (C10).
291
+
292
+ **Fix**: save `scheduler.state_dict()` and call `scheduler.load_state_dict(...)` on resume (C1). Note a
293
+ step-based scheduler advanced *per optimizer step* must restore the **step**, not the epoch — restoring
294
+ only `epoch` under-/over-shoots the schedule.
295
+
296
+ ### C15 — AMP `GradScaler` not restored → "No inf checks were recorded" / scale stall
297
+
298
+ **Symptom**: resuming a mixed-precision run raises
299
+ `AssertionError: No inf checks were recorded for this optimizer`, or training stalls/NaNs because the
300
+ loss-scale snapped back to the default and re-enters the scale-search.
301
+
302
+ **Root cause**: the `GradScaler` holds dynamic state — `scale`, `growth_factor`, `backoff_factor`,
303
+ `growth_interval`, `_growth_tracker` — that evolves during training; dropping it resets the scaler
304
+ ([PyTorch AMP recipe](https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html),
305
+ [forum: No inf checks were recorded](https://discuss.pytorch.org/t/resume-training-with-mixed-precision-lead-to-no-inf-checks-were-recorded-for-this-optimizer/115828)).
306
+
307
+ **Fix**: save `scaler.state_dict()` (call it **after** `scaler.update()` in the iteration) and
308
+ `scaler.load_state_dict(checkpoint["scaler"])` on resume. HF Trainer (`scaler.pt`), Accelerate
309
+ (`save_state`), and Lightning (16-bit factor) all do this automatically — the bug bites hand-written loops.
310
+ Resuming a *non-AMP* checkpoint into an AMP run has no saved scaler → start a **fresh** `GradScaler`.
311
+
312
+ ### C16 — EMA / SWA shadow weights not saved → eval on the wrong weights after resume
313
+
314
+ **Symptom**: pre-resume eval (using EMA weights) is good; post-resume eval drops sharply, then recovers
315
+ over many steps — because the EMA copy restarted from the raw weights.
316
+
317
+ **Root cause**: EMA/SWA maintain a *separate* shadow parameter set that is what gets evaluated/exported;
318
+ saving only the live model `state_dict` loses it, so EMA reinitializes from the (noisier) live weights.
319
+
320
+ **Fix**: include `ema.state_dict()` (and SWA `AveragedModel` / `swa_scheduler` state) in the checkpoint
321
+ dict (C1) and restore it. In Lightning, persist it via `on_save_checkpoint`/`on_load_checkpoint` (C11).
322
+ This is a *which-weights-are-correct* concern at the boundary — cross-link **verifying-dl-experiments**
323
+ (**REQUIRED**) for confirming the evaluated weights are the intended ones.
324
+
325
+ ### C17 — `save_total_limit` / `save_top_k` deletes the very checkpoint resume needs
326
+
327
+ **Symptom**: resume fails because the target checkpoint was auto-pruned; or `load_best_model_at_end`
328
+ errors because the best checkpoint was rotated out.
329
+
330
+ **Root cause**: a rolling limit prunes by *recency* (`save_total_limit`) or by *metric* (`save_top_k`),
331
+ and neither guarantees the most-recent-step checkpoint is the one kept — so the resume anchor can be the
332
+ one deleted.
333
+
334
+ **Fix**: keep an explicit `last`/`latest` alongside the top-k (`save_last=True` in Lightning, C11; in HF,
335
+ `load_best_model_at_end=True` makes Trainer preserve the best checkpoint past `save_total_limit`). General
336
+ keepable-checkpoint *policy* (how many, which selection criterion, `save_top_k ≤ 3`, prune `latest`) is
337
+ owned by **verifying-dl-experiments** (**REQUIRED**); the disk-budget consequence is
338
+ `references/gotchas_universal.md` U6.
339
+
340
+ ### C18 — `load_state_dict` key mismatch on resume (`module.` prefix, compiled-model prefix)
341
+
342
+ **Symptom**: resume raises `Missing key(s)` / `Unexpected key(s) ... module.<name>` or
343
+ `_orig_mod.<name>`, or strict load fails after switching DDP/`torch.compile` on or off.
344
+
345
+ **Root cause**: `DataParallel`/DDP wrap adds a `module.` prefix and `torch.compile` adds `_orig_mod.` to
346
+ every key; a checkpoint saved wrapped and loaded unwrapped (or vice-versa) won't key-match under
347
+ `strict=True`.
348
+
349
+ **Fix**: save the **unwrapped** module — `model.module.state_dict()` (DDP) /
350
+ `accelerator.unwrap_model(model).state_dict()` / `model._orig_mod.state_dict()` (compiled) — so the
351
+ checkpoint is wrapper-agnostic. On load, strip the prefix if present
352
+ (`{k.replace("module.", "").replace("_orig_mod.", ""): v for k, v in sd.items()}`). Keep `strict=True`
353
+ while debugging a resume so a silent partial load can't masquerade as success; only relax it deliberately.
354
+
355
+ ---
356
+
357
+ ## Pointers — owned elsewhere, do NOT restate here
358
+
359
+ - **Cadence — when/how often** (Young/Daly `W = sqrt(2·mu·C)`, grace windows, opportunistic SIGTERM
360
+ last-flush, the runnable atomic skeleton) → `references/spot-resilience.md` (**REQUIRED**, spot tier).
361
+ - **Disk-full on save** (pre-budget, prune `latest`, keep `best`, `.tmp` recovery) →
362
+ `references/gotchas_universal.md` U6; **silent "synced" line** → U33; **inode exhaustion** → U7.
363
+ - **Sharding a model that won't fit** (FSDP wrap policy, ZeRO stages, offload) is the *fitting* concern →
364
+ `references/training/oom-memory.md` M9/M10; this file owns *checkpointing* the sharded state.
365
+ - **Multi-rank save/load collectives + elastic restart** (torchrun `--max-restarts` restores from the
366
+ checkpoint) → `references/training/distributed-launch.md`, `references/multinode.md`.
367
+ - **Keepable-checkpoint policy + "is the resumed/best number real"** (selection criterion, `save_top_k`,
368
+ proving step/epoch/loss continued) → **verifying-dl-experiments** (**REQUIRED**).