opencode-skills-collection 3.1.2 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +4 -1
  2. package/bundled-skills/agent-creator/SKILL.md +246 -0
  3. package/bundled-skills/ax-extract-workflow/SKILL.md +156 -0
  4. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  5. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  6. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  7. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  8. package/bundled-skills/docs/sources/sources.md +1 -1
  9. package/bundled-skills/docs/users/bundles.md +1 -1
  10. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  11. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  12. package/bundled-skills/docs/users/getting-started.md +1 -1
  13. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  14. package/bundled-skills/docs/users/usage.md +4 -4
  15. package/bundled-skills/docs/users/visual-guide.md +4 -4
  16. package/bundled-skills/lovable-cleanup/SKILL.md +2 -1
  17. package/bundled-skills/remote-gpu-trainer/.gitattributes +8 -0
  18. package/bundled-skills/remote-gpu-trainer/LICENSE +21 -0
  19. package/bundled-skills/remote-gpu-trainer/README.md +267 -0
  20. package/bundled-skills/remote-gpu-trainer/SKILL.md +249 -0
  21. package/bundled-skills/remote-gpu-trainer/evals/README.md +57 -0
  22. package/bundled-skills/remote-gpu-trainer/evals/RESULTS.md +44 -0
  23. package/bundled-skills/remote-gpu-trainer/evals/cases.jsonl +14 -0
  24. package/bundled-skills/remote-gpu-trainer/evals/run_evals.py +68 -0
  25. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/README.md +72 -0
  26. package/bundled-skills/remote-gpu-trainer/examples/autodl_sweep/queue_1.txt +6 -0
  27. package/bundled-skills/remote-gpu-trainer/profiles/_schema.md +100 -0
  28. package/bundled-skills/remote-gpu-trainer/profiles/autodl.md +327 -0
  29. package/bundled-skills/remote-gpu-trainer/profiles/china.md +397 -0
  30. package/bundled-skills/remote-gpu-trainer/profiles/generic-ssh.md +450 -0
  31. package/bundled-skills/remote-gpu-trainer/profiles/lambda.md +342 -0
  32. package/bundled-skills/remote-gpu-trainer/profiles/paperspace.md +365 -0
  33. package/bundled-skills/remote-gpu-trainer/profiles/runpod.md +164 -0
  34. package/bundled-skills/remote-gpu-trainer/profiles/vastai.md +355 -0
  35. package/bundled-skills/remote-gpu-trainer/references/china-network.md +206 -0
  36. package/bundled-skills/remote-gpu-trainer/references/gotchas_universal.md +704 -0
  37. package/bundled-skills/remote-gpu-trainer/references/lifecycle_checklist.md +148 -0
  38. package/bundled-skills/remote-gpu-trainer/references/monitoring_patterns.md +327 -0
  39. package/bundled-skills/remote-gpu-trainer/references/multinode.md +190 -0
  40. package/bundled-skills/remote-gpu-trainer/references/parallel_ablation.md +196 -0
  41. package/bundled-skills/remote-gpu-trainer/references/principles.md +179 -0
  42. package/bundled-skills/remote-gpu-trainer/references/self-improvement.md +74 -0
  43. package/bundled-skills/remote-gpu-trainer/references/spot-resilience.md +235 -0
  44. package/bundled-skills/remote-gpu-trainer/references/ssh_transport.md +270 -0
  45. package/bundled-skills/remote-gpu-trainer/references/training/by-domain.md +230 -0
  46. package/bundled-skills/remote-gpu-trainer/references/training/checkpoint-resume.md +368 -0
  47. package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md +187 -0
  48. package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md +119 -0
  49. package/bundled-skills/remote-gpu-trainer/references/training/distributed-launch.md +422 -0
  50. package/bundled-skills/remote-gpu-trainer/references/training/oom-memory.md +338 -0
  51. package/bundled-skills/remote-gpu-trainer/references/training/precision-stability.md +401 -0
  52. package/bundled-skills/remote-gpu-trainer/references/training/throughput-profiling.md +451 -0
  53. package/bundled-skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh +55 -0
  54. package/bundled-skills/remote-gpu-trainer/scripts/check_staleness.py +70 -0
  55. package/bundled-skills/remote-gpu-trainer/scripts/download_loop.sh +67 -0
  56. package/bundled-skills/remote-gpu-trainer/scripts/gpu_health.sh +169 -0
  57. package/bundled-skills/remote-gpu-trainer/scripts/health_patrol.sh.template +67 -0
  58. package/bundled-skills/remote-gpu-trainer/scripts/mem_monitor.sh +67 -0
  59. package/bundled-skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh +175 -0
  60. package/bundled-skills/remote-gpu-trainer/scripts/run_one.sh.template +104 -0
  61. package/bundled-skills/remote-gpu-trainer/scripts/run_queue.sh.template +83 -0
  62. package/bundled-skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh +35 -0
  63. package/bundled-skills/remote-gpu-trainer/scripts/verify_local.py +145 -0
  64. package/package.json +1 -1
  65. package/skills_index.json +66 -0
@@ -0,0 +1,422 @@
1
+ # Launching & debugging multi-GPU / multi-node training — torchrun · Accelerate · DeepSpeed · DDP · FSDP
2
+
3
+ Pick a launcher, get the rank/world-size env right, choose a parallelism (DDP vs FSDP vs ZeRO),
4
+ and — when 8 processes silently freeze — find *which* rank diverged. This layer owns *making the
5
+ distributed job RUN, not hang, and not silently mis-shard*; **verifying-dl-experiments** owns *is the
6
+ resulting number correct* (a run whose LR silently rescaled with world size, or that resumed from
7
+ step 0 after a restart, is its concern). Cross-link it (**REQUIRED**) wherever a launch fix changes
8
+ effective batch size, LR, or precision.
9
+
10
+ Single box, multiple GPUs is DDP/FSDP over NVLink/PCIe and lives here. The **inter-node** transport
11
+ (NCCL NIC, fabric-manager, timeout, MTU, elastic restart) is `references/multinode.md` (**REQUIRED**
12
+ for any job spanning ≥2 instances) — this file ends where the wire between boxes begins.
13
+
14
+ To jump: `grep -in '<keyword>' references/training/distributed-launch.md` (e.g. `rdzv`, `local_rank`,
15
+ `unused`, `hang`, `desync`, `fsdp`, `zero`, `state_dict`, `port`, `barrier`, `accelerate`).
16
+
17
+ ## Table of contents
18
+
19
+ - **Launchers & env** — D1 torchrun-env-contract · D2 standalone-vs-rendezvous · D3 LOCAL_RANK-device-bug · D4 port-collision · D5 accelerate-launch · D6 deepspeed-launcher · D7 which-launcher
20
+ - **DDP** — D8 find_unused_parameters · D9 uneven-inputs-Join · D10 SyncBN-&-buffers · D11 effective-batch/LR
21
+ - **FSDP** — D12 wrapping-policy · D13 sharding-strategy · D14 mixed-precision · D15 state_dict-type
22
+ - **DeepSpeed** — D16 ZeRO-stages · D17 config.json-knobs · D18 auto-&-engine.backward
23
+ - **The HANGS** (highest-value) — D19 desync-debug-toolkit · D20 one-rank-diverged · D21 rank-conditional-collective · D22 dataloader-length-mismatch · D23 eval/print/save-on-one-rank
24
+ - **Pointers** — inter-node NCCL/NIC/timeout → multinode.md · OOM/sharding-to-fit → oom-memory.md · spot-restart → spot-resilience.md
25
+
26
+ ---
27
+
28
+ ## Launchers & env
29
+
30
+ ### D1 — The rank/world-size env contract every launcher must satisfy
31
+
32
+ **Symptom**: a raw `python train.py` on a 4-GPU box uses **one** GPU; or `init_process_group` hangs
33
+ forever because `MASTER_ADDR`/`RANK` were never set.
34
+
35
+ **Root cause**: `torch.distributed` reads its topology from **environment variables**, not from the GPU
36
+ count. A bare `python` sets none of them, so the process group never forms.
37
+
38
+ **Fix**: launch through `torchrun`, which sets the full contract per process
39
+ ([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)):
40
+
41
+ | Var | Meaning |
42
+ |---|---|
43
+ | `RANK` | global rank `0..WORLD_SIZE-1` (unique across the whole job) |
44
+ | `LOCAL_RANK` | rank **within this node** — bind it to the GPU (`cuda:LOCAL_RANK`), NOT `RANK` (D3) |
45
+ | `WORLD_SIZE` | total workers = `nnodes × nproc_per_node` |
46
+ | `LOCAL_WORLD_SIZE` | workers on this node |
47
+ | `GROUP_RANK` | the node's rank (`0..nnodes-1`) |
48
+ | `MASTER_ADDR` / `MASTER_PORT` | FQDN + port of rank-0 hosting the c10d TCP store |
49
+
50
+ The script reads them (`int(os.environ["LOCAL_RANK"])`), calls
51
+ `init_process_group(backend="nccl")` (NCCL for GPU; `gloo` for CPU), and `set_device(LOCAL_RANK)`
52
+ before allocating any CUDA tensor.
53
+
54
+ ### D2 — Single-node uses `--standalone`; multi-node needs a shared rendezvous id+endpoint
55
+
56
+ **Symptom**: copying a single-node `torchrun` line to a second node either hangs at init or both nodes
57
+ form two separate 1-node groups.
58
+
59
+ **Root cause**: single-node and multi-node use **different rendezvous**. `--standalone` self-hosts a
60
+ rendezvous on localhost (no coordination); multi-node requires every node to point at the *same*
61
+ external rendezvous server with the *same* job id.
62
+
63
+ **Fix** ([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)):
64
+ ```bash
65
+ # single node, 4 GPUs — self-contained, no addr/port to manage
66
+ torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py
67
+
68
+ # multi-node: IDENTICAL command on every node; only env-derived node-rank differs
69
+ torchrun --nnodes=2 --nproc-per-node=8 \
70
+ --rdzv-id=$JOB_ID --rdzv-backend=c10d \
71
+ --rdzv-endpoint=$HEAD_IP:29400 train.py
72
+ ```
73
+ `c10d` is the recommended backend (no etcd dependency). `--nnodes=1:4` enables elastic scaling. The
74
+ inter-node wire health (NIC pinning, fabric-manager, timeout) is `references/multinode.md`.
75
+
76
+ ### D3 — Every process lands on GPU 0 (the `RANK` vs `LOCAL_RANK` bug)
77
+
78
+ **Symptom**: on multi-node, all of node-1's processes pile onto `cuda:0` and OOM, while GPUs 1-7 sit
79
+ idle; single-node looked fine.
80
+
81
+ **Root cause**: the script did `torch.cuda.set_device(RANK)`. On a single node `RANK==LOCAL_RANK` so
82
+ the bug hides; on node 1 of a 2-node job `RANK` is 8-15 but the node only has GPUs 0-7, so
83
+ `set_device` wraps/collides and everything funnels to device 0.
84
+
85
+ **Fix**: **always index the local device by `LOCAL_RANK`**, never `RANK`:
86
+ `torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))`. `RANK` selects the *data shard*; `LOCAL_RANK`
87
+ selects the *physical GPU*.
88
+
89
+ ### D4 — `RuntimeError: Address already in use` when launching a second job on one node
90
+
91
+ **Symptom**: a second `torchrun` (e.g. a parallel ablation cell) on the same box dies immediately with
92
+ `errno 98: Address already in use`.
93
+
94
+ **Root cause**: both jobs default to `MASTER_PORT=29500`; the c10d TCP store can't bind a port the
95
+ first job holds ([pytorch#85604](https://github.com/pytorch/pytorch/issues/85604)).
96
+
97
+ **Fix**: give each co-located job a unique port **and** disjoint GPUs:
98
+ ```bash
99
+ CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc-per-node=2 --master-port=29500 train.py &
100
+ CUDA_VISIBLE_DEVICES=2,3 torchrun --standalone --nproc-per-node=2 --master-port=29600 train.py &
101
+ ```
102
+ Or use `--rdzv-endpoint=localhost:0` to let torchrun pick a free port. Fanning cells across instances
103
+ instead of one box → `references/parallel_ablation.md`.
104
+
105
+ ### D5 — HF Accelerate: `accelerate launch` reads a config, not torchrun flags
106
+
107
+ **Symptom**: `accelerate launch train.py` runs single-GPU despite 4 cards, because no config exists or
108
+ `compute_environment` defaulted to one process.
109
+
110
+ **Root cause**: Accelerate wraps the same env contract (D1) but sources it from
111
+ `~/.cache/huggingface/accelerate/default_config.yaml` (written by `accelerate config`) or CLI flags
112
+ ([launch docs](https://huggingface.co/docs/accelerate/en/basic_tutorials/launch)).
113
+
114
+ **Fix**: generate a config once, then launch against it — and on a headless rental, write the YAML
115
+ directly instead of the interactive `accelerate config`:
116
+ ```bash
117
+ accelerate launch --multi_gpu --num_processes=4 --mixed_precision=bf16 train.py
118
+ # or a checked-in YAML (reproducible, diffable):
119
+ accelerate launch --config_file configs/acc_fsdp.yaml train.py
120
+ ```
121
+ Switching DDP↔FSDP↔DeepSpeed is *only* a config swap — the training script is unchanged. The same
122
+ `--num_machines`/`--machine_rank`/`--main_process_ip` map onto multi-node (D2 territory).
123
+
124
+ ### D6 — DeepSpeed: `deepspeed` launcher vs `accelerate launch`, and the `hostfile`
125
+
126
+ **Symptom**: `deepspeed train.py` on multi-node can't find the other host, or `--num_gpus` is ignored.
127
+
128
+ **Root cause**: the `deepspeed` launcher discovers nodes from a `hostfile`
129
+ (`worker-1 slots=8`), distinct from torchrun's rendezvous. Under HF it's usually cleaner to let
130
+ `accelerate launch` (with a DeepSpeed plugin/config) drive it
131
+ ([HF DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)).
132
+
133
+ **Fix**: single-node `deepspeed --num_gpus=8 train.py --deepspeed ds_config.json`; multi-node
134
+ `deepspeed --hostfile=hostfile --num_gpus=8 train.py ...`. With HF Trainer/Accelerate, pass the config
135
+ via `--config_file` and let it spawn the workers — don't mix both launchers.
136
+
137
+ ### D7 — Which launcher / parallelism — decision in one breath
138
+
139
+ - **Model fits on one GPU, just want more throughput** → **DDP** (`torchrun`), simplest, fastest. Each rank holds a full replica.
140
+ - **Model does NOT fit (params+optim+grads ≈ 18 B/param, see oom-memory.md M1)** → shard it: **FSDP** (PyTorch-native) or **DeepSpeed ZeRO** (richer offload). Sharding-to-fit ladder → `references/training/oom-memory.md` M9.
141
+ - **HF ecosystem / Trainer** → **Accelerate** as the launcher; flip a config field to choose DDP/FSDP/ZeRO.
142
+ - **Need CPU/NVMe offload of params *and* optimizer separately, or ZeRO-Infinity** → **DeepSpeed** (FSDP1 offload is all-or-nothing; [HF concept guide](https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/fsdp_and_deepspeed.md)).
143
+
144
+ ---
145
+
146
+ ## DDP
147
+
148
+ ### D8 — `find_unused_parameters` — the "Expected to have finished reduction" error vs the silent hang
149
+
150
+ **Symptom**: `RuntimeError: Expected to have finished reduction in the prior iteration before starting
151
+ a new one. ... parameters that were not used in producing loss`
152
+ ([HF discuss](https://discuss.huggingface.co/t/runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one-this-error-indicates-that-your-module-has-parameters-that-were-not-used-in-producing-loss/64760)).
153
+
154
+ **Root cause**: DDP registers an allreduce hook on every parameter and waits for *all* of them each
155
+ step. If a branch (a frozen head, a conditional layer) produces no gradient, its bucket never fires and
156
+ the reduction never completes.
157
+
158
+ **Fix — in priority order**:
159
+ 1. **Best**: make every output participate in the loss (often the real bug is a dropped/detached head).
160
+ 2. If a branch is *legitimately* unused some steps, `DDP(model, find_unused_parameters=True)` — but it adds a full graph traversal each step and **can be drastically slower** ([PyTorch forum](https://discuss.pytorch.org/t/process-got-stuck-when-set-find-unused-parameters-true-in-ddp/106078)). Use only if (1) is impossible.
161
+ 3. If the return value is a dict/list, DDP may not locate the output tensors — flatten or simplify the `forward` return.
162
+ > Setting `find_unused_parameters=True` to *paper over* a real bug masks it — confirm the params are intentionally unused, don't silence the diagnostic.
163
+
164
+ ### D9 — Ranks have unequal batch counts → hang at the last step (uneven inputs)
165
+
166
+ **Symptom**: training completes most of an epoch then **freezes on the final batch**; one rank had fewer
167
+ samples and exited the loop while the others wait in allreduce forever
168
+ ([PyTorch forum](https://discuss.pytorch.org/t/understanding-distributedsampler-and-dataloader-drop-last/206271)).
169
+
170
+ **Root cause**: DDP assumes every rank runs the **same number of collectives**. `DistributedSampler`
171
+ pads (`drop_last=False`) or drops (`drop_last=True`) to equalize, but a custom sampler, a per-rank
172
+ filter, or a `IterableDataset` can leave counts uneven — the short rank stops calling allreduce.
173
+
174
+ **Fix**:
175
+ - Use `DistributedSampler` (it equalizes by default) and set the **same** `drop_last` on every rank.
176
+ - Truly uneven inputs (variable-length, can't pad): wrap the loop in the **Join** context manager —
177
+ `from torch.distributed.algorithms.join import Join; with Join([model]): for batch in loader: ...`
178
+ — which mirrors the missing ranks' collectives so finished ranks don't deadlock
179
+ ([Join tutorial](https://docs.pytorch.org/tutorials/advanced/generic_join.html)).
180
+ - Always call `sampler.set_epoch(epoch)` each epoch, or every epoch sees the identical shuffle (a
181
+ silent correctness bug — **verifying-dl-experiments** **REQUIRED**).
182
+
183
+ ### D10 — BatchNorm stats diverge across ranks; buffers aren't synced
184
+
185
+ **Symptom**: DDP converges worse than single-GPU at the same effective batch, or eval is unstable —
186
+ each rank computed BN statistics on only its local shard.
187
+
188
+ **Root cause**: DDP all-reduces **gradients**, not **buffers** (BN running mean/var). With small
189
+ per-GPU batches each replica's BN stats are noisy and inconsistent.
190
+
191
+ **Fix**: convert BN to synchronized BN before wrapping:
192
+ `model = nn.SyncBatchNorm.convert_sync_batchnorm(model)` then `DDP(model, ...)`. Adds a collective per
193
+ BN layer (cost), but BN stats become global. (Whether the metric *needs* SyncBN is a
194
+ **verifying-dl-experiments** call.)
195
+
196
+ ### D11 — N GPUs silently N× the effective batch (and the LR is now wrong)
197
+
198
+ **Symptom**: moving from 1→8 GPUs makes training diverge or plateau; loss curve is shaped differently
199
+ even with "the same config."
200
+
201
+ **Root cause**: DDP keeps per-GPU batch size, so **effective batch = per_gpu_batch × world_size**. The
202
+ LR tuned for the 1-GPU batch is now mismatched (commonly under-scaled). This is the single most common
203
+ silent multi-GPU regression.
204
+
205
+ **Fix**: scale LR with effective batch (linear-scaling rule as a baseline, with warmup) and record
206
+ `world_size`, per-GPU batch, and effective batch in the run manifest. **This changes the science** —
207
+ declare it; comparing a 1-GPU baseline to an 8-GPU run with unscaled LR is not a clean datapoint
208
+ (**verifying-dl-experiments** **REQUIRED**).
209
+
210
+ ---
211
+
212
+ ## FSDP (Fully Sharded Data Parallel)
213
+
214
+ ### D12 — FSDP wraps the whole model as one unit → no memory saving (wrapping policy)
215
+
216
+ **Symptom**: FSDP enabled but VRAM barely drops vs DDP, or it OOMs gathering one giant flat parameter.
217
+
218
+ **Root cause**: with no `auto_wrap_policy`, FSDP makes the **entire model one FSDP unit** — it must
219
+ all-gather all parameters at once, defeating sharding
220
+ ([FSDP tutorial](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)).
221
+
222
+ **Fix**: wrap per transformer block so only one block's params are gathered at a time:
223
+ ```python
224
+ from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
225
+ import functools
226
+ policy = functools.partial(transformer_auto_wrap_policy,
227
+ transformer_layer_cls={LlamaDecoderLayer})
228
+ ```
229
+ Under Accelerate set `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` +
230
+ `fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer`
231
+ ([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)). FSDP2 (`fully_shard`) is the
232
+ current API; the wrapping principle is identical.
233
+
234
+ ### D13 — Sharding strategy: FULL_SHARD vs SHARD_GRAD_OP vs HYBRID
235
+
236
+ **Symptom**: FSDP is communication-bound (allgather/reducescatter dominate the step), or still OOMs.
237
+
238
+ **Root cause**: the strategy trades memory against comms. `FULL_SHARD` (default, == ZeRO-3) shards
239
+ params+grads+optimizer — max memory saving, max comms. `SHARD_GRAD_OP` (== ZeRO-2) shards grads+optim
240
+ only, keeps params resident — less comms, more memory.
241
+
242
+ **Fix**: pick by the binding constraint — OOM → `FULL_SHARD`; comms-bound but it fits →
243
+ `SHARD_GRAD_OP`. On a **multi-node** job where intra-node NVLink is fast but inter-node is slow,
244
+ `HYBRID_SHARD` shards within a node and replicates across nodes (cuts inter-node traffic; pairs with
245
+ `references/multinode.md` NIC tuning).
246
+
247
+ ### D14 — FSDP mixed precision: loss diverges or buffers stay fp32
248
+
249
+ **Symptom**: bf16 FSDP run diverges where bf16 DDP was fine; or BN/positional buffers silently run in
250
+ the wrong dtype.
251
+
252
+ **Root cause**: FSDP mixed precision is **explicit per-tensor-class** via `MixedPrecision(param_dtype,
253
+ reduce_dtype, buffer_dtype)` — not a single AMP flag. Setting `param_dtype=bf16` but leaving
254
+ `reduce_dtype=fp32` (or vice versa) changes gradient-reduction precision; FSDP keeps fp32 master
255
+ weights and casts to bf16 for forward
256
+ ([pytorch#146114](https://github.com/pytorch/pytorch/issues/146114)).
257
+
258
+ **Fix**: set all three deliberately — a safe default is `param_dtype=bf16, reduce_dtype=fp32` (keep
259
+ reductions in fp32 for stability), and set `buffer_dtype` explicitly so buffers don't drift. Prefer
260
+ **bf16 over fp16** for sharded training (no loss-scaler needed). The numerical-correctness check is
261
+ **verifying-dl-experiments**; this entry only ensures the dtypes are *set*, not left implicit.
262
+
263
+ ### D15 — Checkpoint OOMs or saves an unloadable shard (state_dict type)
264
+
265
+ **Symptom**: `FSDP.state_dict()` OOMs the host RAM on rank 0; or every rank wrote a `.pt` and reloading
266
+ on a different world size fails.
267
+
268
+ **Root cause**: FSDP has three state-dict types. `FULL_STATE_DICT` gathers + unflattens the whole model
269
+ to **rank-0 CPU** (peaks host RAM, single-writer); `SHARDED_STATE_DICT` writes one shard per rank
270
+ (scales, but tied to layout); `LOCAL_STATE_DICT` is raw flat params
271
+ ([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)).
272
+
273
+ **Fix**:
274
+ - Large models / want resumable-at-scale: **`SHARDED_STATE_DICT`** via Distributed Checkpoint (DCP) — each rank saves its shard, reload reshards to any world size.
275
+ - Need a single portable file (export/inference): `FULL_STATE_DICT` with `rank0_only=True, offload_to_cpu=True` so only rank 0 materializes it on CPU (avoids the all-ranks OOM). FSDP2 uses `broadcast_from_rank0=True` to load the full dict on rank 0 then shard out.
276
+ - Atomic-write + load-latest-on-startup is the resume spine regardless of type → `references/spot-resilience.md` and `references/multinode.md` MN5 (a torchrun restart restores the *group*, never the *state*).
277
+
278
+ ---
279
+
280
+ ## DeepSpeed
281
+
282
+ ### D16 — ZeRO stage selection (1/2/3) and what each shards
283
+
284
+ **Symptom**: ZeRO enabled but still OOM, or comms overhead with no memory need.
285
+
286
+ **Root cause**: stages shard progressively more across data-parallel ranks
287
+ ([DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/)):
288
+ **Stage 1** = optimizer states · **Stage 2** = + gradients · **Stage 3** = + parameters (== FSDP
289
+ `FULL_SHARD`).
290
+
291
+ **Fix**: smallest stage that fits — Stage 2 is the common sweet spot for models that *almost* fit;
292
+ Stage 3 for models that don't fit even with grads sharded; add **ZeRO-Offload** (CPU) or
293
+ **ZeRO-Infinity** (NVMe) only when Stage 3 alone still OOMs (each offload trades large slowdowns for
294
+ capacity → `references/training/oom-memory.md` M10).
295
+
296
+ ### D17 — The `ds_config.json` knobs that actually matter
297
+
298
+ **Symptom**: config applied but behavior unchanged, or a cryptic key error at init.
299
+
300
+ **Root cause**: DeepSpeed reads from the JSON, and several Accelerate/Trainer fields are **ignored** once
301
+ a `deepspeed_config_file` is supplied
302
+ ([HF Accelerate DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)).
303
+
304
+ **Fix** — the load-bearing keys:
305
+ ```jsonc
306
+ {
307
+ "zero_optimization": {
308
+ "stage": 3,
309
+ "offload_optimizer": {"device": "cpu"}, // or "nvme"
310
+ "offload_param": {"device": "cpu"}
311
+ },
312
+ "bf16": {"enabled": true}, // prefer over fp16 (no loss-scale tuning)
313
+ "gradient_accumulation_steps": "auto", // let HF fill from Trainer
314
+ "train_micro_batch_size_per_gpu": "auto",
315
+ "gradient_clipping": "auto"
316
+ }
317
+ ```
318
+ When the JSON is present, `gradient_accumulation_steps`, `gradient_clipping`, `zero_stage`,
319
+ `offload_*_device`, and `mixed_precision` from the Accelerate config are **overridden by the JSON** —
320
+ set them there, not in two places.
321
+
322
+ ### D18 — `"auto"` mismatch and `loss.backward()` vs `engine.backward()`
323
+
324
+ **Symptom**: optimizer steps far less often than expected (gradient accumulation double-counted), or a
325
+ `RuntimeError` about unscaled gradients.
326
+
327
+ **Root cause**: two traps. (a) Setting `gradient_accumulation_steps` in *both* the Trainer/Accelerate
328
+ config *and* the JSON to non-`"auto"` values multiplies them. (b) With DeepSpeed's own AMP, gradient
329
+ scaling lives inside the engine — calling bare `loss.backward()` instead of `model_engine.backward(loss)`
330
+ skips scaling ([DeepSpeed engine](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py)).
331
+
332
+ **Fix**: set accumulation in **one** place (use `"auto"` in the JSON and let HF fill it); in a manual
333
+ loop call `model_engine.backward(loss); model_engine.step()` — never `loss.backward()` /
334
+ `optimizer.step()` directly under DeepSpeed.
335
+
336
+ ---
337
+
338
+ ## The HANGS — debugging a frozen distributed job (highest-value section)
339
+
340
+ A distributed hang has **no traceback** — every rank sits in a collective waiting for a peer that will
341
+ never call it. The job to do is identify *which rank* diverged and *which collective* mismatched.
342
+ (Distinct from a **single-process** vanish — for OOM/reboot/SSH-HUP/kill, see `gotchas_universal.md`
343
+ U3; for the *inter-node* causes — fabric-manager, wrong NIC, MTU, the 1800 s NCCL timeout that *masks*
344
+ the real failure — see `references/multinode.md` MN1-MN4.)
345
+
346
+ ### D19 — The desync-debug toolkit: turn a silent freeze into a named mismatch
347
+
348
+ **Symptom**: all ranks frozen, GPUs at 100% SM util but 0% memory-util (spin-wait), no output.
349
+
350
+ **Root cause**: a collective desync — ranks enqueued *different* collectives, or one rank never reached
351
+ the collective the others are blocked in.
352
+
353
+ **Fix — set these and relaunch the hang**:
354
+ - `export TORCH_DISTRIBUTED_DEBUG=DETAIL` + `export TORCH_CPP_LOG_LEVEL=INFO` → on mismatch PyTorch prints `Detected mismatch between collectives on ranks`, naming the op + sequence number per rank ([PyTorch forum](https://discuss.pytorch.org/t/torch-distributed-collectives-call-logging/172726)). (DETAIL itself does collectives — use to *diagnose*, remove for production; it can perturb timing.)
355
+ - `export NCCL_DEBUG=INFO` (or `WARN`) → the node whose log **stops first** before others print their topology is the culprit.
356
+ - `export TORCH_NCCL_ASYNC_ERROR_HANDLING=1` (older PyTorch: `NCCL_ASYNC_ERROR_HANDLING=1`) → a dead rank tears the group down *promptly* instead of every rank waiting out the 1800 s NCCL timeout (`references/multinode.md` MN3).
357
+ - **Flight Recorder** (`TORCH_NCCL_TRACE_BUFFER_SIZE=2000`) dumps the last N collectives per rank with stack traces — read it to see which rank's queue is one collective behind.
358
+
359
+ ### D20 — One rank diverged (NaN/OOM) and the survivors hang waiting for it
360
+
361
+ **Symptom**: training ran for a while, then froze; one rank's last log shows a NaN, an OOM, or a
362
+ data/CUDA error, the rest are stuck in allreduce.
363
+
364
+ **Root cause**: a rank that crashes or `return`s early **stops calling collectives**; the others block.
365
+ The crash is the cause, the hang is the symptom — and without async error handling (D19) it surfaces
366
+ 30 min later as a timeout, far from the cause.
367
+
368
+ **Fix**: with `TORCH_NCCL_ASYNC_ERROR_HANDLING=1` the group aborts near the true failure. Then fix the
369
+ *diverged rank*, not the hang — common roots: one shard hit a bad sample (rank-dependent data), a
370
+ per-rank OOM from uneven sequence lengths (longest-batch lands on one rank → `oom-memory.md` M16), or
371
+ NaN from LR/precision. Don't lower batch size to "fix" a hang that was actually one rank's data bug.
372
+
373
+ ### D21 — A rank-conditional collective (the `if rank == 0:` deadlock)
374
+
375
+ **Symptom**: hangs reproducibly at the *same* spot — often validation, logging, or checkpoint save.
376
+
377
+ **Root cause**: a collective (or a `dist.barrier()`, or an op that *implies* one like `all_gather`,
378
+ SyncBN, or a metric `all_reduce`) placed inside a rank-conditional branch. Rank 0 calls it; others
379
+ skip it; everyone deadlocks. The classic is "save/log on rank 0 only" where the save path triggers a
380
+ collective ([Lightning#19604](https://github.com/Lightning-AI/pytorch-lightning/issues/19604)).
381
+
382
+ **Fix**: collectives must run on **all ranks unconditionally**. Gate only the *side effect*, not the
383
+ collective: compute the metric's `all_reduce` on every rank, then `if rank == 0: log(value)`. A
384
+ `barrier()` must be reached by every rank or none. Audit every `if rank/local_rank == 0` block for a
385
+ hidden collective.
386
+
387
+ ### D22 — Dataloader length mismatch across ranks (and the `set_epoch` shuffle bug)
388
+
389
+ **Symptom**: hang at end of epoch (D9's mechanism), OR every epoch trains on the identical data order.
390
+
391
+ **Root cause**: two related dataloader faults. (a) Unequal `len(loader)` per rank → the short rank
392
+ stops calling collectives. (b) Forgetting `sampler.set_epoch(epoch)` → `DistributedSampler` reshuffles
393
+ identically every epoch.
394
+
395
+ **Fix**: identical `batch_size`/`drop_last`/sampler on all ranks; call `set_epoch` each epoch; for
396
+ genuinely uneven data use **Join** (D9). The shuffle-staleness is a correctness bug —
397
+ **verifying-dl-experiments** **REQUIRED**.
398
+
399
+ ### D23 — `print` / `tqdm` / eval / `torch.save` interleaving looks like a hang (but isn't always)
400
+
401
+ **Symptom**: garbled interleaved logs from 8 ranks; or an apparent freeze during eval where only rank 0
402
+ should be working.
403
+
404
+ **Root cause**: by default **every rank executes everything** — 8× the prints, 8× eval, 8 ranks racing
405
+ to write the same checkpoint file (corrupting it). If the eval/save path contains a collective and is
406
+ *also* rank-gated, it's the D21 deadlock; if not, it's just noisy + wasteful + a file race.
407
+
408
+ **Fix**: gate pure side effects (logging, progress bar, file writes) to `if rank == 0:` — but keep any
409
+ collective *outside* the gate (D21). Write checkpoints from rank 0 only, to a temp path, atomic-rename
410
+ (`references/spot-resilience.md`), and `dist.barrier()` (on **all** ranks) before others read the file.
411
+ A genuine hang vs noisy-but-progressing is told apart by the Flight Recorder / step counter (D19), not
412
+ by the log soup.
413
+
414
+ ---
415
+
416
+ ## Pointers — handled elsewhere, do not restate
417
+
418
+ - **Inter-node wire** (NCCL NIC pinning, `nvidia-fabricmanager`, the 1800 s timeout masking a dead rank, jumbo-frame MTU, torchrun/Horovod elastic restart restoring the *group* not the *state*) → `references/multinode.md` (**REQUIRED** for ≥2 instances).
419
+ - **Sharding *to fit a model that OOMs*** (the FSDP/ZeRO ladder in cost order, activation checkpointing, offload, LoRA/QLoRA, reading the OOM trace) → `references/training/oom-memory.md`.
420
+ - **Restart-and-resume mechanics** (atomic write, load-latest, cadence, preemption signals) → `references/spot-resilience.md`; the spine is `references/principles.md` #8.
421
+ - **Single-process vanish** (OOM vs reboot vs SSH-HUP vs manual kill) → `references/gotchas_universal.md` U3; **cgroup host-RAM OOM from `num_workers`** → U9; **zombie VRAM after a crashed DDP run** → U11.
422
+ - **Is the resulting number real** (LR-rescaled run, restarted-from-0 run, shuffle staleness, SyncBN necessity, precision change) → **verifying-dl-experiments** (**REQUIRED** at every "this fix changes the science" note above).