npm - opencode-skills-collection - Versions diffs - 3.1.2 → 3.1.4 - Mend

opencode-skills-collection 3.1.2 → 3.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/bundled-skills/remote-gpu-trainer/references/training/convergence-debugging.md ADDED Viewed

@@ -0,0 +1,187 @@
+# Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly)
+The other training layers cover the run that **crashes** (`oom-memory.md`), **NaNs**
+(`precision-stability.md`), **hangs** (`distributed-launch.md`), or is **slow** (`throughput-profiling.md`).
+This file owns the quieter, far more common failure: the job runs cleanly to the end but the **loss is
+flat, falls too slowly, or the model underfits** — and the bug is in the optimization wiring, not the
+hardware. Each entry is **Symptom → Root cause → Fix** with the exact knob. **Always start with O1
+(overfit one batch)** — it separates "the loop is broken" from "the model/data is weak" in five minutes
+and tells you which half of this file you need.
+Boundary: **verifying-dl-experiments** (**REQUIRED** at every "is the result real" fork) owns collapse,
+leakage, metric validity, train-vs-val generalization, and seed interpretation; this file owns the
+*mechanism* of why a correct-looking loop doesn't converge. NaN / loss-spike / LR-too-**HIGH** live next
+door in `precision-stability.md` (P8–P18) — this file is the LR-too-**LOW** / won't-move / mis-wired side.
+To jump: `grep -in '<keyword>' references/training/convergence-debugging.md` (e.g. `overfit`, `requires_grad`,
+`no_grad`, `optimizer`, `weight decay`, `adamw`, `lr finder`, `scheduler`, `accum`, `cross entropy`,
+`bcewithlogits`, `nllloss`, `freeze`, `batchnorm`, `discriminative`, `lora`, `update ratio`, `dead relu`).
+## Table of contents
+- **It isn't learning at all (start here)** — O1 overfit-one-batch · O2 params-not-in-optimizer · O3 loss-detached-from-graph · O4 zero_grad/backward/step-order · O5 train()/eval()-mode
+- **Optimizer / LR / weight-decay / schedule** — O6 AdamW-vs-Adam+no-decay-group · O7 LR-too-LOW+finder · O8 scheduler-order/cadence · O9 grad-accum-divide · O10 AdamW-eps-in-bf16 · O11 fused/foreach
+- **Loss-function footguns** — O12 double-softmax · O13 BCEWithLogits · O14 CE-target-form · O15 padded-loss-reduction · O16 NLLLoss-needs-log_softmax
+- **Fine-tuning / transfer** — O17 frozen-but-still-in-optimizer · O18 frozen-BN-running-stats · O19 discriminative-LR/forgetting · O20 strict=False-shape-mismatch · O21 LoRA/PEFT-wiring
+- **Training-dynamics dashboard (instrument it)** — O22 update:weight-ratio · O23 actual-LR · O24 GradScaler-scale · O25 dead-ReLU-fraction · O26 weight/grad/act-histograms
+- **Pointers** — precision-stability.md, distributed-launch.md, verifying-dl-experiments (skill)
+---
+## It isn't learning at all — the first-hour triage
+### O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test)
+**Symptom**: training "runs" (no error, normal throughput) but loss plateaus near its init value or wanders without trending down, across LRs/optimizers/architectures. You're tuning hyperparameters blind because nothing proves the loop can learn at all.
+**Root cause**: the loop is broken somewhere between forward and weight-update (any of O2–O5, or a label/shape bug) and no single test isolates "can this code memorize?" from "is this a modeling/data problem?".
+**Fix**: take ONE fixed mini-batch (2 examples is enough) and loop forward/backward/step on **that same batch** for hundreds of iters — a correct loop drives train loss → ~0. Turn **off** augmentation, shuffling, dropout, and weight decay for the test. Also "verify the loss at init" (e.g. softmax CE should start near `-log(1/n_classes)` then fall). If it will not reach ~0, *"there is a bug somewhere and we cannot continue"* — debug the loop (O2–O5) before touching hyperparameters. (Smoke *content/interpretation* → **verifying-dl-experiments**; this is the mechanical gate.) ([Karpathy, "A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/))
+### O2 — Loss flat from step 0, weights byte-identical after `step()` → params aren't in the optimizer
+**Symptom**: overfit-one-batch fails; a snapshotted param is unchanged before/after `optimizer.step()`; grad-norm may even be nonzero. No error.
+**Root cause**: the optimizer updates a **different** set of tensors than the model forwards through. Four causes: (a) the params have `requires_grad=False` so `.grad` stays `None` and `step()` skips them; (b) a submodule/head was never passed into the optimizer's param iterable; (c) the optimizer was built from `model.parameters()` **before** `model.to(device)`, so it holds stale CPU tensors while the model forwards the GPU copies; (d) freeze/unfreeze toggled `requires_grad` but left the wrong set in the optimizer.
+**Fix**: build the optimizer **after** `model.to(device)`. Assert it sees every trainable param: `opt_ids={id(p) for g in optimizer.param_groups for p in g['params']}; assert all(id(p) in opt_ids for p in model.parameters() if p.requires_grad)`. Log `sum(p.requires_grad for p in model.parameters())` at startup. Probe: `w0=next(model.parameters()).clone(); <one step>; assert not torch.equal(w0, next(model.parameters()))`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [stale-optimizer-after-.to bug](https://github.com/pytorch/xla/issues/1623))
+### O3 — `backward()` is a no-op / raises "does not require grad" → loss detached from the graph
+**Symptom**: overfit fails with every `p.grad is None`; or `loss.backward()` raises *"element 0 of tensors does not require grad and does not have a grad_fn"*.
+**Root cause**: the loss tensor was severed from autograd before `backward`. Common severings: (a) the train forward+loss ran inside `with torch.no_grad():` / `@torch.inference_mode()` left over from eval — *"computations in no-grad mode are never recorded in the backward graph"*; (b) `.item()` / `.detach()` / `.cpu().numpy()` / `float(loss)` on the loss path (e.g. back-propping an accumulated `total_loss += loss.item()`); (c) a tensor rebuilt from numpy mid-network; (d) the metric, not the differentiable loss, was passed to `backward()`.
+**Fix**: before `backward`, `assert loss.requires_grad and loss.grad_fn is not None`. Keep the differentiable loss tensor distinct from logging scalars (log `loss.item()`, back-prop the raw tensor). Reserve `no_grad`/`inference_mode` for eval only. After `backward`, assert at least one `p.grad is not None`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html))
+### O4 — Wrong `zero_grad` / `backward` / `step` order, or a missing `step()`
+**Symptom**: overfit fails; weights never move, or training is erratic despite nonzero grads.
+**Root cause**: PyTorch's contract is *"gradients by default add up; to prevent double-counting we explicitly zero them each iteration"*, `backward` deposits into `.grad`, `step` reads `.grad`. Failure modes: (a) `optimizer.step()` omitted → grads computed, weights never updated; (b) `zero_grad()` placed **after** `backward()` → wipes the fresh grads; (c) `step()` **before** `backward()` → steps on stale/zero grads; (d) `zero_grad` never called → grads from all iters keep summing → effective LR explodes.
+**Fix**: the canonical order, exactly — `optimizer.zero_grad(set_to_none=True)` → forward → `loss=loss_fn(out,y)` → `loss.backward()` → `optimizer.step()` (under AMP: `scaler.scale(loss).backward()` → `scaler.step(optimizer)` → `scaler.update()`). Gradient accumulation is the one exception (O9): `backward` every micro-step, `step`+`zero_grad` only on the boundary. ([optimization tutorial](https://docs.pytorch.org/tutorials/beginner/basics/optimization_tutorial.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html))
+### O5 — Forgot `model.train()` / left `model.eval()` on → Dropout & BatchNorm in the wrong mode
+**Symptom**: two faces — (1) trained under `eval()`: BN uses frozen running stats and never updates them, Dropout is off → underfits / loss barely moves; (2) evaluated under `train()`: BN uses noisy per-batch stats and Dropout fires → val loss flickers run-to-run and looks worse than train.
+**Root cause**: `train()`/`eval()` set a per-module flag that *"has an effect only on certain modules ... e.g. Dropout, BatchNorm"* (`eval()` == `train(False)`). In eval mode BN switches to stored `running_mean/var` and **stops** updating them; Dropout becomes identity. A fresh `nn.Module` defaults to `train()`, but any prior `.eval()` (a reused object, an inference helper, a val loop that didn't switch back) persists.
+**Fix**: bracket phases explicitly — `model.train()` atop each train epoch; `model.eval()` + `with torch.no_grad():` for every val/test pass; `model.train()` again before resuming. After build/load, `assert model.training` before the train loop. (Frozen-backbone BN is a *different* axis → O18; tiny-batch BN → by-domain V7.) ([nn.Module.train/eval](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html))
+---
+## Optimizer / learning-rate / weight-decay / schedule
+### O6 — Weight decay "does nothing" / Norm gains destabilize → coupled `Adam(weight_decay=)` + decaying bias & Norm
+**Symptom**: `weight_decay` on `torch.optim.Adam` barely regularizes (or hurts) vs the literature's AdamW recipe; or a from-scratch transformer/CNN trains worse than a reference at the "same" wd; or small models destabilize when LayerNorm/BN gains and biases get shrunk toward 0.
+**Root cause**: (1) `Adam`'s `weight_decay` is classic **L2** — added into the gradient, so it passes through Adam's `1/(sqrt(v)+eps)` preconditioner and params with large historical grads get **less** decay; the intended strength decouples from `wd`. **AdamW** applies decoupled decay directly to the weight (`θ ← θ − lr·wd·θ`), outside the moment path — uniform and lr-independent. They are **not** interchangeable at the same `wd`. (2) Decaying 1-D params (biases, LayerNorm/BN weight & bias) shrinks Norm gains toward 0 — they have no overfitting capacity and shrinking them degrades training.
+**Fix**: use `torch.optim.AdamW`, not `Adam(weight_decay=...)`. Split into two param groups with `weight_decay=0.0` on the no-decay group — nanoGPT's rule: decay `p.dim()>=2` (matmul/embedding weights), no-decay `p.dim()<2` (all biases + all LayerNorm weights); HF/timm exclude by name (`bias`, `LayerNorm.weight`). ([AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html), [Loshchilov & Hutter 2017 "Decoupled Weight Decay"](https://arxiv.org/abs/1711.05101), [nanoGPT configure_optimizers](https://github.com/karpathy/nanoGPT/blob/master/model.py))
+### O7 — Loss crawls with no NaN → LR is too **LOW**; find the band with an LR range test
+**Symptom**: no divergence, no NaN, grads finite — loss just falls glacially or plateaus high; throughput is fine but the model "won't learn." Often after copying an LR from a different-batch/optimizer recipe or defaulting to a tiny "safe" LR. (The mirror of P12's too-HIGH spike.)
+**Root cause**: the LR sits far below the productive band, so each update is a negligible fraction of the loss-landscape curvature and optimization crawls. The usable band for adaptive optimizers is narrow and architecture-dependent, so a guessed LR is often 1–2 orders of magnitude too small. Distinguishable from vanishing grads — the grad-norm is healthy, just under-applied.
+**Fix**: run an **LR range test** (Smith) — from a tiny LR, multiply it geometrically each batch over ~100–1000 steps, plot loss vs LR, pick ~1 decade below where loss starts to diverge. Tools: `pytorch-lr-finder` `LRFinder(model,opt,crit).range_test(loader,end_lr=1,num_iter=100)`, fast.ai `learn.lr_find()`, Lightning `Tuner(trainer).lr_find()`. Re-run whenever batch size / optimizer / architecture changes — the band moves; then confirm the LR survives warmup without the P12 spike. ([Smith 2015 "Cyclical Learning Rates"](https://arxiv.org/abs/1506.01186), [pytorch-lr-finder](https://github.com/davidtvs/pytorch-lr-finder), [Smith 2018 disciplined-approach](https://arxiv.org/abs/1803.09820))
+### O8 — `lr_scheduler.step()` before `optimizer.step()` skips the first LR; per-step vs per-epoch cadence
+**Symptom**: PyTorch warns *"Detected call of `lr_scheduler.step()` before `optimizer.step()`"* and the LR curve is off-by-one; OR a cosine/warmup schedule sized in optimizer steps barely moves (stepped per-epoch) or decays to ~0 in one epoch (per-step schedule stepped per-batch under accumulation).
+**Root cause**: (1) since PyTorch 1.1 the scheduler must step **after** the optimizer — *"if you ... call scheduler.step() before the optimizer's update ... this will skip the first value of the learning rate schedule."* (2) A scheduler advances one tick per `.step()`; schedulers built around `total_steps`/`num_training_steps` in **optimizer** steps (OneCycleLR, HF `get_cosine_schedule_with_warmup`, Lightning `interval='step'`) must be stepped every optimizer step, and under accumulation an "optimizer step" ≠ a batch.
+**Fix**: order it `optimizer.step(); scheduler.step()`. Step at the granularity its `total_steps` was computed in — per optimizer step for warmup/cosine/OneCycle (inside the `if (i+1)%accum==0` block, **not** every micro-batch), per epoch only for epoch schedulers. HF `Trainer` steps it automatically — don't also step it manually. ([torch.optim — scheduler order](https://docs.pytorch.org/docs/stable/optim.html), [OneCycleLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html))
+### O9 — Gradient accumulation gives effective N×LR → divide the loss by `accum_steps` (and normalize per token)
+**Symptom**: switching from batch `B` to (micro-batch `B/N` × N accumulation) "at the same config" trains hotter/diverges — loss/grad magnitude ~N× too big, i.e. you silently get N× the LR. For token tasks the accumulated loss also differs from the un-accumulated run even after `/N` when micro-batches hold unequal #non-pad tokens.
+**Root cause**: each micro-batch loss is `reduction='mean'`; `backward` **adds** grads across the N micro-batches, so the accumulated grad = SUM of N mean-grads = N× the full-batch mean grad → stepping on it ≈ N× LR. Subtler: dividing each mean-loss by N still mis-weights tokens when micro-batches have different valid-token counts (average-of-means ≠ total-loss / total-tokens) — HF found and fixed exactly this in `transformers` in 2024.
+**Fix**: divide before backward — `loss = loss_fn(out,y) / accum_steps; loss.backward()`, with `step()`/`zero_grad()` only on the boundary. For token-level losses, normalize by the **total** non-pad tokens across the accumulation window (accumulate `reduction='sum'`, divide by total tokens), not the mean-of-means. Under DDP wrap non-boundary micro-steps in `with model.no_sync():` to skip the all-reduce (correctness-neutral, perf win). (DeepSpeed double-counts accum in some configs → D18; world-size×batch → D11.) ([HF "Fixing Gradient Accumulation"](https://huggingface.co/blog/gradient_accumulation), [DDP no_sync](https://docs.pytorch.org/docs/stable/notes/ddp.html))
+### O10 — `AdamW(eps=1e-8)` underflows in bf16/fp16 → unbounded updates where `v` is tiny
+**Symptom**: a run stable in fp32 develops update spikes/NaNs once optimizer math is half precision; or AdamW behaves as if `eps=0` (huge updates where the second moment `v` is small). Most visible with fp16 optimizer states or foreach/8-bit paths computing `sqrt(v)+eps` in reduced precision.
+**Root cause**: the AdamW update is `θ -= lr·m̂/(sqrt(v̂)+eps)`. The default `eps=1e-8` is an fp32 value; in fp16 (and to a lesser degree bf16's 7-bit mantissa) `1e-8` rounds to **0** — *"if you use 1e-8 as default and you use 16 bit, it will round to zero."* With `eps≈0`, params whose `v̂≈0` get an unbounded step. (Separate from GradScaler, which protects activations/grads, not this denominator.)
+**Fix**: raise eps for half-precision optimizer math — `eps=1e-7` (proposed in pytorch#26218 for fp16) up to `1e-6` for bf16; or keep optimizer states / master weights in **fp32** (FSDP `MixedPrecision`, DeepSpeed bf16 keep an fp32 master) so the default eps stays meaningful. Related: `betas=(0.9,0.999)` averages `v` over ~1000 steps — too slow for short fine-tunes; `0.95` is the common LLM-scale second-moment choice. ([pytorch#26218](https://github.com/pytorch/pytorch/issues/26218), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
+### O11 — `fused=True` AdamW breaks under AMP/FSDP; `foreach` inflates peak memory
+**Symptom**: `AdamW(fused=True)` raises (e.g. on `_foreach_sub_` of `device_found_inf`) or mis-steps under GradScaler / bf16-mixed / FSDP; **or** the default `foreach` path OOMs at the optimizer step on a model that fit during forward/backward.
+**Root cause**: (1) fused AdamW does unscale + step + the inf/NaN check inside one CUDA kernel via `found_inf`; version-specific bugs (pytorch#140514, Lightning#21435) come from that plumbing / FSDP interaction — fused is still the experimental path. (2) `foreach` (the CUDA default when unset) horizontally fuses by allocating intermediates across **all** params at once, raising peak memory at the step vs the for-loop path.
+**Fix**: on a fused error/suspicious step under AMP/FSDP/bf16-mixed, fall back to `fused=False` (lets `foreach` default) or upgrade past the fixed issue — confirm a parity loss-curve before trusting fused for a real datapoint. If the **step** OOMs, set `foreach=False` for the low-peak for-loop path (slower, less memory; see oom-memory). Pick deliberately: fused fastest-when-correct, foreach faster than for-loop but higher peak. ([pytorch#140514](https://github.com/pytorch/pytorch/issues/140514), [Lightning#21435](https://github.com/Lightning-AI/pytorch-lightning/issues/21435), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
+---
+## Loss-function footguns
+### O12 — `softmax`/`log_softmax` before `nn.CrossEntropyLoss` → double-softmax → diluted gradient, slow/no learning
+**Symptom**: a model with a softmax (or log_softmax) final layer trains far slower than expected, plateaus high, or barely learns; loss is sluggish but not NaN. Classic when porting a Keras/TF model (expects probabilities) to PyTorch, or after "adding softmax to get probabilities."
+**Root cause**: `nn.CrossEntropyLoss` internally does `LogSoftmax + NLLLoss` and *"expects ... unnormalized logits."* Feeding already-softmaxed values applies softmax twice; `softmax(softmax(z))` flattens toward uniform, shrinking the logit dynamic range, so the CE gradient w.r.t. the pre-softmax activations becomes small and ill-conditioned. It still trains — just with a near-vanishing signal.
+**Fix**: pass **raw logits** of shape `(N,C)` — remove any `nn.Softmax`/`F.log_softmax`/`nn.LogSoftmax` from the head. Apply softmax only at inference (for probabilities) or argmax (for the class). If you genuinely need log-probs in-graph, use `F.log_softmax` + `nn.NLLLoss` (O16) instead — never both. ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html))
+### O13 — `sigmoid` + `nn.BCELoss` → `log(0)=-inf` → NaN; use `nn.BCEWithLogitsLoss` (+`pos_weight`)
+**Symptom**: a binary / multi-label head shows NaN or inf loss (often once outputs saturate toward 0/1), or spiky loss; the model has an explicit `torch.sigmoid` before `nn.BCELoss`. Under imbalance it also collapses to always predicting the majority (negative) class.
+**Root cause**: `nn.BCELoss` takes probabilities and computes `log(p)`/`log(1-p)` directly; when the preceding sigmoid saturates (`p`→0 or 1) `log(0)=-inf` and its gradient is inf/NaN, poisoning every param. Two separate ops can't use the stabilized formulation. Plain BCE also weights positives and negatives equally → rare-positive data drives the trivial all-negative solution.
+**Fix**: feed **raw logits** to `nn.BCEWithLogitsLoss` — it fuses sigmoid+BCE with the log-sum-exp trick, avoiding `log(0)`. Remove the explicit sigmoid (apply only at inference). For imbalance pass `pos_weight = #neg/#pos` per class (`>1` raises recall, `<1` raises precision). Target must be **float**, same shape as the logits. ([BCEWithLogitsLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html), [numerical-stability thread](https://discuss.pytorch.org/t/numerical-stability-of-bcewithlogitsloss/8246)) (imbalance *strategy* → by-domain V6.)
+### O14 — `CrossEntropyLoss` target form: long `(N,)` indices in `[0,C)` vs float `(N,C)` soft; off-by-one → device-side assert
+**Symptom**: any of — `RuntimeError: 0D or 1D target tensor expected, multi-target not supported` (one-hot target); `expected scalar type Long but found Float`; `IndexError: Target N is out of bounds` / CUDA `device-side assert ... t >= 0 && t < n_classes` (a label `== C`, or labels `1..C`, or arbitrary ids); or a plausible-but-non-converging loss.
+**Root cause**: `nn.CrossEntropyLoss` has **two** target forms. **Class-index** form: target shape `(N,)` (one fewer dim than the `(N,C,...)` input), dtype `long`, every value in `[0,C)`. A `(N,C)` target is read as multiple targets ("multi-target"); a value `==C` (off-by-one from 1-indexed classes) or non-contiguous ids trips the bounds assert — on CUDA an **async** device-side assert that may surface at a later, unrelated line. **Class-probability** form (soft/smoothed/mixup): target must be float, same shape `(N,C,...)`, summing to 1. Mixing them is the error.
+**Fix**: hard labels → `targets.long()` of shape `(N,)`; remap ids to contiguous `0..C-1` (`{orig:i for i,orig in enumerate(sorted(set(labels)))}`; subtract 1 if 1-indexed); `assert targets.min()>=0 and targets.max()<C`. Don't one-hot the standard path. Debug the opaque CUDA assert with `CUDA_LAUNCH_BLOCKING=1` (or rerun on CPU) for the real line. Soft labels → a float `(N,C)` distribution (no manual log_softmax). Use `ignore_index` for pad, not an out-of-range sentinel (O15). ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), ["Target N out of bounds" + remap](https://discuss.huggingface.co/t/indexerror-target-4-is-out-of-bounds/10213))
+### O15 — Padded-token loss: `reduction='mean'` averages over PAD → diluted, length-dependent loss
+**Symptom**: a seq/NLP model's loss looks suspiciously small from step 0 and scales with how much padding is in the batch (more pad → lower loss); the model under-learns real tokens; changing batch size or max-length changes the loss magnitude for the same data.
+**Root cause**: default `reduction='mean'` divides the summed loss by the **total** element count, **including** padded positions, so the real-token loss is averaged with (near-zero) pad contributions — shrinking reported loss and the effective gradient on real tokens by the pad ratio. Unmasked pad targets also contribute real gradient, teaching the model to predict padding.
+**Fix**: skip padding. Easiest: `nn.CrossEntropyLoss(ignore_index=PAD_ID)` — *"the loss is averaged over non-ignored targets"* (sums valid positions, divides by valid count). Otherwise compute `reduction='none'`, multiply by a 0/1 mask, and divide by `mask.sum()` (valid tokens), **not** `mask.numel()`: `loss=(per_tok*mask).sum()/mask.sum().clamp(min=1)`. Reshape logits→`(N*T,C)`, targets→`(N*T,)` first. (Masking the inputs/attention → by-domain L1/L2; this owns the loss **denominator**.) ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), [ignore_index nuance pytorch#63004](https://github.com/pytorch/pytorch/issues/63004))
+### O16 — `nn.NLLLoss` fed raw logits (no `log_softmax`) → silently wrong loss
+**Symptom**: a model uses `nn.NLLLoss` but has no `LogSoftmax`/`F.log_softmax` before it (or a plain `Softmax`): training "runs" with no error but loss is nonsensical / won't converge, accuracy stuck near chance.
+**Root cause**: `nn.NLLLoss` computes **no** softmax — *"the input ... is expected to contain log-probabilities."* It simply gathers `-input[target]`. Raw logits → it negates an arbitrary-scale value; softmax **probabilities** (not log) → it negates a value in `[0,1]` giving a tiny, ill-scaled loss. Either way it isn't cross-entropy and the gradient is wrong, but the shapes are valid so PyTorch can't catch it.
+**Fix**: put `F.log_softmax(logits, dim=1)` (or an `nn.LogSoftmax(dim=1)` final layer) immediately before `nn.NLLLoss` (class dim = 1 for `(N,C)`). Simpler and less error-prone: drop NLLLoss+LogSoftmax and use `nn.CrossEntropyLoss` on raw logits (O12), which fuses both. Never pair NLLLoss with a plain (non-log) Softmax. ([NLLLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html))
+---
+## Fine-tuning / transfer
+### O17 — A "frozen" layer keeps changing → `requires_grad=False` but still in the optimizer
+**Symptom**: you set `requires_grad=False` on the backbone (or set it **after** building the optimizer over `model.parameters()`), yet the frozen weights keep moving every step; pretrained features drift and degrade though no real gradient flows.
+**Root cause**: whether an optimizer touches a param is decided by `param.grad is None`, **not** by `param.requires_grad`. If a frozen param is in the optimizer, after `backward()` its `.grad` is often a **zero tensor** (not `None`), and SGD/Adam apply **weight decay** (`+wd·param`) and **momentum/Adam buffers** *before* the update — so the param moves even on a zero gradient. `requires_grad=False` only stops grad *accumulation*; it does not remove the param from the optimizer.
+**Fix**: exclude frozen params from the optimizer at construction — `optim.SGD([p for p in model.parameters() if p.requires_grad], lr=...)` (or per-module param groups). If you froze after building the optimizer, rebuild it, or set `param.grad=None` for the frozen params each step. Freezing correctly = `requires_grad=False` **AND** not in any optimizer param group (and for Norm layers, O18). ([forum: WD/momentum on zero grad](https://discuss.pytorch.org/t/parameters-with-requires-grad-false-are-updated-during-training/90096), [pytorch#679](https://github.com/pytorch/pytorch/issues/679))
+### O18 — Frozen backbone left in `.train()` → BatchNorm `running_mean/var` silently drift
+**Symptom**: the backbone is "frozen" (`requires_grad=False`) yet val accuracy is unstable / worse than train, or `eval()` vs `train()`-mode inference disagree; small fine-tuning batches make it worse. The frozen features keep shifting batch-to-batch.
+**Root cause**: BatchNorm has two kinds of state — learnable affine (`gamma/beta`, gated by `requires_grad`) and **non-learnable** `running_mean/running_var` buffers updated by the **forward pass whenever the module is in training mode** (default `momentum=0.1`), independent of `requires_grad` and the optimizer. A frozen backbone left in `.train()` therefore overwrites the pretrained BN stats with your (often tiny, domain-shifted) batch stats — so the "frozen" extractor isn't frozen.
+**Fix**: put the frozen Norm layers in eval mode after `model.train()`: `for m in backbone.modules():\n    if isinstance(m,(nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)): m.eval()` — or build them `track_running_stats=False`. Re-apply every epoch, because a top-level `model.train()` flips children back. ([BatchNorm2d doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)) (general train/eval-mode bug → O5; tiny-batch BN → by-domain V7.)
+### O19 — One global LR wrecks pretrained features (catastrophic forgetting) → discriminative LR + gradual unfreezing
+**Symptom**: fine-tuning with a single LR either (too high) destroys the pretrained representations on the first updates and accuracy collapses below a frozen-feature baseline, or (too low) the random new head can't move. Both are the same misconfiguration.
+**Root cause**: at step 0 the backbone is near a good optimum but the new head is random, so its large initial loss yields large gradients that, under one high LR, propagate into and overwrite the low-level pretrained layers (catastrophic forgetting). A single LR can't be simultaneously small enough to preserve early layers and large enough to fit the head — the fix is per-group LRs, not more data.
+**Fix**: discriminative fine-tuning — per-layer param groups with LR decaying toward the input (head highest, stem lowest), e.g. `AdamW([{'params':head,'lr':1e-3},{'params':backbone,'lr':1e-5}])`. Combine with **gradual unfreezing** (train the head with the backbone frozen first, then unfreeze deeper→shallower) and an LR **warmup** so the random head settles before its gradients reach the backbone. ([Howard & Ruder 2018, ULMFiT — discriminative fine-tuning + gradual unfreezing](https://arxiv.org/abs/1801.06146)) (the general too-high-LR spike → P12.)
+### O20 — `load_state_dict(strict=False)` still RuntimeErrors on the replaced head → shape ≠ key mismatch
+**Symptom**: you replaced the classifier for a new `num_classes` and pass `strict=False` expecting it to skip the head, but loading still crashes: `RuntimeError: ... size mismatch for fc.weight: copying a param with shape [1000,...] ..., the shape in current model is [N,...]`.
+**Root cause**: `strict=False` relaxes only the **presence** check — it tolerates `missing_keys`/`unexpected_keys`. It does **not** relax tensor-shape compatibility: any key present in **both** the checkpoint and the model whose shapes differ (exactly your old-vs-new head `fc.weight/bias`) still raises. So `strict=False` is necessary but not sufficient when the head keeps the same name.
+**Fix**: drop the incompatible head entries before loading, then load non-strict — `sd={k:v for k,v in ckpt.items() if not k.startswith('fc.')}; missing,unexpected = model.load_state_dict(sd, strict=False)` — and inspect `missing/unexpected` to confirm only the head is missing. Or give the new head a different attribute name so it never collides. (Save/resume of matching architectures → checkpoint-resume C1–C18.) ([load_state_dict doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.load_state_dict), [forum: strict=False ≠ shape](https://discuss.pytorch.org/t/when-load-state-dict-strict-false-do-not-work/82301))
+### O21 — LoRA/PEFT "barely trains" or reloads random → `target_modules` don't match, head/Norm not in `modules_to_save`
+**Symptom**: after `get_peft_model(...)`, `print_trainable_parameters()` shows ~0% (or far fewer than expected) and loss won't drop; or a PEFT classifier reloads with a **random** head / shifted metrics after `save_pretrained`.
+**Root cause**: (a) LoRA only wraps modules whose names match `target_modules`, and names are architecture-specific (`q_proj/v_proj` for Llama vs `query/value` for BERT vs `convolution` for resnet) — a wrong/absent name injects no adapter, PEFT just warns "no modules matched," and you train nothing. (b) A newly-initialized task head (`score`/`classifier`) or a base-model BatchNorm's `running_mean/var` are **not** saved unless listed in `modules_to_save` — reload restores the base's random head / original BN stats → garbage / non-reproducible outputs.
+**Fix**: enumerate real names with `[n for n,_ in model.named_modules()]` and set `LoraConfig(target_modules=[...])` (or `'all-linear'`); confirm with `model.print_trainable_parameters()` and that you see `lora.Linear` layers. Add the new head and any base Norm layers to `modules_to_save` (e.g. `modules_to_save=['classifier','normalization']`) — or pass the right `task_type` (PEFT auto-adds the standard head). ([PEFT troubleshooting](https://huggingface.co/docs/peft/developer_guides/troubleshooting))
+---
+## Training-dynamics dashboard — instrument it so the failure is visible
+### O22 — Update-to-weight L2 ratio ≈ 1e-3 (the single highest-signal LR dial)
+**Symptom**: loss barely moves (under-stepping) or is jittery (over-stepping), and the bare grad-norm can't tell you which — it isn't scale-relative to the weights.
+**Root cause**: what matters is the size of the **actual update** relative to the param's own magnitude — `ratio = ||lr·update|| / ||W||`, measured per tensor **after** `step()` (so it folds in lr, momentum, Adam's preconditioning, weight decay). CS231n's heuristic: this should sit ~`1e-3`. Lower → LR too low (weights barely change); higher (`1e-2..1e-1`) → LR too high. Being per-tensor, it exposes individually mis-scaled layers (an embedding moving 100× faster than the trunk) that a global grad-norm hides.
+**Fix**: log it every K steps — snapshot `w0={n:p.detach().clone() for n,p in model.named_parameters()}` before `step()`, then `(p.detach()-w0[n]).norm()/(w0[n].norm()+1e-12)` per name after. Lever: `≪1e-3` → raise that group's LR; `≫1e-3` → lower LR / lengthen warmup. Track per param group, not just globally. ([CS231n "Neural Networks 3"](https://cs231n.github.io/neural-networks-3/)) (complements P12/P18.)
+### O23 — Log the **actual** per-step LR, not the config value
+**Symptom**: you log `cfg.lr` (a constant) so the dashboard LR is flat — yet you're on warmup+cosine. You can't see warmup, decay, a restart, or a frozen scheduler; LR-related loss behavior (spike on ramp, stall at the floor) is invisible.
+**Root cause**: the effective LR lives in `optimizer.param_groups[i]['lr']` and is rewritten by `scheduler.step()` each step (and per group for differential/no-decay LRs). Failure modes: plotting the config scalar (never changes); or the O8 order bug skipping the first value. Also `get_lr()` returns a value "one step ahead" — reading it instead of `get_last_lr()` logs the wrong number.
+**Fix**: log `scheduler.get_last_lr()` (a list — one per param group; log them all if you use differential LRs) or read `optimizer.param_groups[0]['lr']` directly, every step. Don't use `get_lr()` for logging. If the logged LR plateaus when it should ramp/decay, your scheduler isn't being stepped (or is stepped at the wrong cadence → O8). ([torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [lr_scheduler source — get_last_lr](https://github.com/pytorch/pytorch/blob/main/torch/optim/lr_scheduler.py))
+### O24 — GradScaler scale drifting toward 0 = silent persistent fp16 overflow
+**Symptom**: an fp16-AMP run looks healthy (loss prints, no crash) but isn't learning or silently skips many optimizer steps — because you never plotted `scaler.get_scale()` and the loss-scale has cratered from 65536 toward ~1 (or sawtooths down).
+**Root cause**: GradScaler adapts a multiplicative loss-scale: on any inf/NaN grad it multiplies by `backoff_factor=0.5` **and skips** that `step()`; after `growth_interval=2000` clean steps it multiplies by `growth_factor=2.0` (`init_scale=65536`). A few early backoffs are normal calibration (P5/P10), but a scale that keeps halving and stays low means the forward keeps producing values `> fp16's 65504` → grads overflow → step skipped every step → weights frozen while loss still looks plausible. The config "fp16" tells you nothing; only the live scale reveals it.
+**Fix**: add `scaler.get_scale()` and a skipped-step counter to the dashboard. Healthy: a high plateau (`2^13..2^16`) after early calibration. Bad: monotonic decay toward 1, or step-count not advancing with iteration count. Lever when it collapses: switch **fp16 → bf16** (no scaler; fp32 exponent range absorbs the large activations — highest leverage), or keep the overflow-prone block (final logits / attention) in fp32 via a nested `autocast(enabled=False)`, plus z-loss / qk-norm (P15/P16). Don't "fix" it by lowering `init_scale`. ([torch.amp GradScaler](https://docs.pytorch.org/docs/stable/amp.html)) (mechanism → P5/P10.)
+### O25 — Rising dead-ReLU / zero-activation fraction → a slice of the net is permanently off
+**Symptom**: capacity quietly vanishes — a layer's outputs are increasingly all-zero, loss plateaus above where it should, and adding width doesn't help. No crash; it just under-fits. Worst case the net degenerates toward a constant function.
+**Root cause**: a ReLU whose pre-activation is driven negative for ~all inputs outputs 0 and has **zero** local gradient there, so backprop sends no signal to its incoming weights — the unit is stuck off and unrecoverable. Triggered by too-high LR (a big update pushes weights/bias deep negative) or a large negative bias. Once a large fraction of a layer dies, gradients can't flow through it and that capacity is gone. The same shape (saturation → ~0 gradient → frozen region) applies to sigmoid/tanh tails.
+**Fix**: instrument the zero/saturation fraction per activation with a forward hook — `(out==0).float().mean()` for ReLU (or `|out|>0.99` for tanh/sigmoid), logged every K steps per layer. Healthy: a stable modest dead fraction (ReLU is sparse by design). Bad: a fraction climbing over training or a layer pinned near ~100% dead. Levers, in order: (1) lower LR (the primary cause); (2) ReLU → LeakyReLU / GELU / SiLU so the negative region keeps a gradient; (3) fix init / large negative biases. ([CS231n "Neural Networks 1" — dying ReLU](https://cs231n.github.io/neural-networks-1/)) (the *output* being constant is owned by verifying-dl-experiments; this is the internal mechanism.)
+### O26 — No weight/grad/activation histograms → scalar norms hide bimodal/saturating/collapsing distributions
+**Symptom**: scalar dashboards (loss, one grad-norm) look fine yet the model under-performs or destabilizes — a mean/norm hides the shape: activations drifting to a saturated tail, weights collapsing to a spike at 0 (a layer dying, O25), or a gradient distribution growing fat outlier tails all read as an unremarkable scalar.
+**Root cause**: norms and means are lossy summaries — a healthy spread and a bimodal/all-saturated/all-zero distribution can share the same L2 norm. The diagnostic signal is the **change in shape over training**, which a scalar can't show.
+**Fix**: periodically (every few hundred steps — histograms aren't free) log `SummaryWriter.add_histogram(tag, values, global_step)` for each layer's **weights**, its **gradients** (after `backward`, before `zero_grad`), and key **activations** (forward hook). Read the time-evolution: weights collapsing to a spike = a layer dying; gradient histograms collapsing to ~0 = vanishing (lever: residual/norm/init, P17); fat tails = clip + lower LR (P13/P12); activations wandering into a saturating tail = init/normalization fix (P17). Pair with the scalars above. ([SummaryWriter.add_histogram](https://docs.pytorch.org/docs/stable/tensorboard.html), [Karpathy recipe — visualize weights/activations](https://karpathy.github.io/2019/04/25/recipe/))
+---
+## Pointers — adjacent mechanics catalogued elsewhere
+- **NaN / loss-spike / LR-too-HIGH / grad explosion / z-loss / qk-norm / init & norm placement / determinism** → `references/training/precision-stability.md` (P8–P19). This file is the LR-too-LOW / won't-move side; that one is the blows-up side.
+- **OOM from the optimizer step / activation checkpointing / LoRA-QLoRA memory** → `references/training/oom-memory.md` (M5, M12–M13).
+- **N-GPU effective batch × LR, DeepSpeed accum double-count, find_unused_parameters** → `references/training/distributed-launch.md` (D11, D18, D8).
+- **Dataloader correctness (worker RNG, collate, labels, shuffle) that mimics "won't learn"** → `references/training/data-pipeline.md`.
+- **Is the converged number REAL** (collapse, leakage, train-vs-val, metric validity, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — every "is the result real" fork above).

package/bundled-skills/remote-gpu-trainer/references/training/data-pipeline.md ADDED Viewed

@@ -0,0 +1,119 @@
+# Data-pipeline correctness — the silent mistrainers in the DataLoader, not the model
+`throughput-profiling.md` owns making the dataloader **fast**; this file owns making it **correct** — the
+bugs that raise no error and let training "succeed" on the wrong data: augmentations that secretly never
+vary, streams that duplicate across workers/GPUs, collate that crashes or mis-pads, and preprocessing that
+silently shifts the input distribution. Each entry is **Symptom → Root cause → Fix** with the exact knob.
+Boundary: **verifying-dl-experiments** owns the *judgement* "is this leakage / is the metric valid"; this
+file owns the *mechanism* (what the DataLoader / Dataset / transform actually did). When a data bug makes
+training "run but not learn," cross-check `convergence-debugging.md` — **O1 (overfit one batch)** isolates a
+broken loop from broken data.
+To jump: `grep -in '<keyword>' references/training/data-pipeline.md` (e.g. `worker`, `worker_init_fn`,
+`numpy`, `seed`, `iterabledataset`, `get_worker_info`, `collate`, `pin_memory`, `spawn`, `lambda`, `__len__`,
+`drop_last`, `cache`, `bgr`, `totensor`, `normalize`, `set_epoch`, `shuffle`).
+## Table of contents
+- **DataLoader worker RNG (the augmentation-duplication bug)** — DP1 numpy-RNG-duplicated-across-workers · DP2 IterableDataset-duplicated-workers+ranks · DP3 uneven-shard-DDP-hang
+- **Dataset / collate / DataLoader contract** — DP4 ragged-collate · DP5 pin_memory-custom-type · DP6 spawn-breaks-lambdas · DP7 wrong-__len__ · DP8 size-1-batch-kills-BN · DP9 in-RAM-cache-OOM · DP15 /dev/shm-Bus-error
+- **Input preprocessing / labels / shuffle** — DP10 norm-stats-space/split+RGB/BGR · DP11 cv2-BGR · DP12 ToTensor-no-÷255 · DP13 Normalize-before-ToTensor · DP14 shuffle/sampler + set_epoch
+- **Pointers** — throughput-profiling.md, convergence-debugging.md, distributed-launch.md, verifying-dl-experiments (skill)
+---
+## DataLoader worker RNG — the augmentation-duplication bug
+### DP1 — Identical "random" augmentations across workers and every epoch → numpy's global RNG inherited via `fork`
+**Symptom**: with `num_workers>0`, different workers emit the **same** random augmentation parameters (same crop coords, flips, noise) within a batch, and the exact same random sequence repeats **every epoch**. Augmentation diversity collapses to ~`1/num_workers`; the model generalizes worse for no visible reason — no crash, no warning. (An audit found this in >95% of inspected repos with custom datasets.)
+**Root cause**: DataLoader spawns workers via `fork` (Linux default), so each worker inherits an **identical** copy of NumPy's global RNG state from the parent. PyTorch auto-seeds each worker's **torch** RNG (and Python `random`) to `base_seed+worker_id`, but it does **not** touch numpy's global RNG — so `np.random.*` in `__getitem__`/transforms is identical across workers, and because workers are respawned from the unchanged parent state, identical every epoch.
+**Fix**: pass a `worker_init_fn` that reseeds numpy from torch's already-per-worker seed: `def wif(_): np.random.seed(torch.initial_seed() % 2**32)`. `torch.initial_seed()` = `base_seed+worker_id` and `base_seed` is redrawn each epoch, giving both cross-worker **and** cross-epoch variety. **Two traps**: (a) seeding from a constant (`np.random.seed(42+worker_id)`) re-breaks epoch variety — every epoch resets to the same start; (b) do **not** call `torch.manual_seed(CONST)` in `worker_init_fn` — it clobbers torch's correct per-worker offset. Cleanest of all: route augmentation RNG through torch (`torch.rand`/`torch.Generator`), which is auto-seeded per worker — then no `worker_init_fn` is needed. With `persistent_workers=True` the init runs once, so vary per epoch from an epoch counter instead. ([tanelp "PyTorch+NumPy, you're making a mistake"](https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/), [PyTorch "Randomness in multi-process data loading"](https://docs.pytorch.org/docs/stable/notes/randomness.html))
+### DP2 — `IterableDataset` yields every sample N× (per worker) or world_size× (per rank) → not sharded
+**Symptom**: an `IterableDataset` with `num_workers=N` yields each sample **N times** (an "epoch" is N× too long, samples repeat within a batch); and under DDP every **rank** streams the **same** data, so `all_reduce` averages identical gradients and the model sees `world_size×` fewer unique samples despite more GPUs. Often misread as a too-large dataset or slow convergence.
+**Root cause**: the **same** `IterableDataset` object is replicated onto every worker **and** every rank; unlike map-style datasets there is no `Sampler` handing out disjoint indices (and `DistributedSampler` does **not** apply to `IterableDataset`). Unless `__iter__` partitions the stream itself, all consumers iterate the identical sequence. `get_worker_info()` knows only intra-process workers, not ranks.
+**Fix**: shard by **both** dimensions inside `__iter__`. Workers: `wi=torch.utils.data.get_worker_info()`, then keep records where `idx % wi.num_workers == wi.id` (or contiguous ranges). Ranks: fold in `dist.get_rank()`/`get_world_size()` — `global_id = rank*num_workers + worker_id`, `global_world = world_size*num_workers`, keep `idx % global_world == global_id`. With HF `datasets`, `datasets.distributed.split_dataset_by_node(ds, rank, world_size)` assigns disjoint per-rank shards, then `num_workers` handles the inner split. ([PyTorch data — IterableDataset multi-worker](https://docs.pytorch.org/docs/stable/data.html), [HF datasets#5360 — DDP duplication](https://github.com/huggingface/datasets/issues/5360))
+### DP3 — Uneven `IterableDataset` shard length under DDP → NCCL hang / silent sample drop
+**Symptom**: after correctly sharding an `IterableDataset` by rank, training intermittently **hangs at the last batch** of an epoch (NCCL collective timeout), or some ranks run one extra step.
+**Root cause**: streaming shards rarely divide evenly by `world_size*num_workers`; when one rank's iterator exhausts while others still yield, the finished rank skips its `backward`/all-reduce and the rest block forever waiting on the absent collective. Unlike map-style `DistributedSampler` (which pads to a uniform length), `IterableDataset` sharding gives no automatic length equalization.
+**Fix**: make every rank run the **same** number of steps — (a) compute a global min steps/epoch and stop all ranks there (drop the ragged tail), (b) pad short shards by cycling samples, or (c) wrap with `model.join()` (the DDP `join` context manager) which shadows collectives for ranks that finish early. Set `drop_last=True` to discard the uneven final micro-batch within a worker. (Map-style `set_epoch` hang is a *different* cause → D22.) ([PyTorch data](https://docs.pytorch.org/docs/stable/data.html), [HF datasets#5360](https://github.com/huggingface/datasets/issues/5360))
+---
+## Dataset / collate / DataLoader contract
+### DP4 — `default_collate` "stack expects each tensor to be equal size" on ragged samples → custom `collate_fn`
+**Symptom**: iteration crashes at batch assembly — `RuntimeError: stack expects each tensor to be equal size, but got [..] at entry 0 and [..] at entry 1` — for variable-length sequences, variable bbox counts, or differently-sized images/masks. `batch_size=1` works; the error appears only at `batch_size>1`.
+**Root cause**: the default collate batches same-key tensors with `torch.stack(batch, 0)`, which requires identical shape on every non-batched dim. Ragged samples violate it, so the stack throws — the bug is in the collate glue, not the model or dataset.
+**Fix**: pass `DataLoader(..., collate_fn=my_collate)`. Sequences: `pad_sequence(seqs, batch_first=True, padding_value=pad_id)` + emit a length/attention mask (then mask the loss → O15, by-domain L2). Detection-style ragged targets: keep them as a Python **list** of per-sample tensors instead of stacking (Faster-RCNN/DETR convention). Variably-sized images: pad to the batch-max H/W (NestedTensor / pad+mask). ([PyTorch data — custom collate_fn](https://docs.pytorch.org/docs/stable/data.html), [forum: variable bbox counts](https://discuss.pytorch.org/t/dataloader-collate-fn-throws-runtimeerror-stack-expects-each-tensor-to-be-equal-size-in-response-to-variable-number-of-bounding-boxes/117952))
+### DP5 — `pin_memory=True` silently no-ops on a custom batch type → it must define `.pin_memory()`
+**Symptom**: after wrapping batches in a custom class (a `Batch` object, a graph batch, a dataclass) and setting `pin_memory=True`, the async H2D copy (`.to('cuda', non_blocking=True)`) no longer overlaps — throughput regresses to a blocking copy — or pinning appears to do nothing.
+**Root cause**: DataLoader's pin step only knows how to pin tensors and the built-in containers it recurses into (`list/tuple/dict`). A user-defined batch type is opaque, so its inner tensors stay pageable; the later `non_blocking=True` copy then silently falls back to **synchronous** (the T6 overlap is lost). PyTorch's contract: *"to enable memory pinning for custom batch or data type(s), define a `pin_memory()` method on your custom type(s)."*
+**Fix**: implement `def pin_memory(self): self.x=self.x.pin_memory(); self.y=self.y.pin_memory(); return self` (return `self`) — the pin worker calls it per batch. Then keep `pin_memory=True` and transfer with `.to(device, non_blocking=True)`. ([PyTorch data — Memory Pinning](https://docs.pytorch.org/docs/stable/data.html)) (pinned-memory *perf* mechanics → throughput T6.)
+### DP6 — `num_workers>0` under the `spawn` start method (Windows/macOS) breaks lambdas/closures
+**Symptom**: on Windows/macOS, `num_workers>0` raises `AttributeError: Can't pickle local object '<locals>.<lambda>'`; OR worse, it proceeds but transforms silently vanish (samples come back un-augmented). The identical code runs fine with `num_workers=0` or on Linux.
+**Root cause**: Windows/macOS default to `spawn` — each worker launches a fresh interpreter and reconstructs the dataset/collate/transforms via **pickle**. Lambdas, nested functions, and closures aren't picklable → a hard pickle error, or (pytorch/vision#8066) transforms dropped during serialization. Linux's `fork` copies live memory, masking the bug.
+**Fix**: make everything the worker reconstructs a top-level importable callable — replace `collate_fn=lambda b: ...` and lambda transforms with module-level `def`s; bind args with `functools.partial(top_level_fn, ...)` not a closure; for parameterized transforms use a top-level callable class. Keep main-script code under `if __name__ == '__main__':`. Stopgap: `num_workers=0` sidesteps pickling. ([vision#8066 — transforms lost under spawn](https://github.com/pytorch/vision/issues/8066), [PyTorch data — platform-specific](https://docs.pytorch.org/docs/stable/data.html))
+### DP7 — Wrong `Dataset.__len__` → out-of-range `__getitem__`: IndexError, or a SILENT modulo wraparound
+**Symptom**: either (a) `IndexError`/`KeyError` from `__getitem__` partway through an epoch, or (b) no error but training quietly sees duplicated/skipped samples — when `__getitem__` does `self.items[idx % len(...)]` or indexes a shorter list so over-long indices wrap.
+**Root cause**: the map-style contract — `__len__()` must equal the number of valid keys, and the default `RandomSampler` draws indices from `range(len(dataset))`. If `__len__` is computed from a different/stale source than `__getitem__` indexes (counts files but indexes a filtered list, an off-by-one, a cached length), the sampler requests indices the structure can't serve. A defensive `idx % N` turns the loud IndexError into a silent correctness bug.
+**Fix**: compute `__len__` and `__getitem__` from the **same** collection (materialize the kept-index list in `__init__`, index through it). Remove any `idx % N`/clamping — let an out-of-range index raise. Sanity once: `assert len(ds)==<expected>`; `ds[len(ds)-1]` works and `ds[len(ds)]` raises. ([pytorch#45040](https://github.com/pytorch/pytorch/issues/45040), [PyTorch data — map-style contract](https://docs.pytorch.org/docs/stable/data.html))
+### DP8 — A size-1 final batch crashes BatchNorm → `drop_last=True` on the train loader
+**Symptom**: training runs most of an epoch then dies at the **last** batch with `ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, C, ...])`. Happens when `len(dataset) % batch_size == 1`.
+**Root cause**: `nn.BatchNorm*` in training mode computes per-channel mean/var over the batch; with a single sample and trivial spatial size the per-channel count is 1, so variance is undefined and `F.batch_norm` raises (an intentional guard since 0.3). The default `DataLoader(drop_last=False)` keeps that ragged final batch.
+**Fix**: `DataLoader(..., drop_last=True)` on the **train** loader discards the incomplete final batch (the standard fix). Alternatives if you can't drop data: swap BatchNorm → `nn.GroupNorm`/`nn.LayerNorm` (no batch-stat dependence), or freeze BN to eval (O18). Keep `drop_last=False` on the **eval** loader (you want every sample) and rely on `model.eval()` there. (Tiny-batch BN *quality* → V7; per-rank batch-count equalization → D9; this is the single-process size-1 crash.) ([pytorch#4534](https://github.com/pytorch/pytorch/issues/4534))
+### DP9 — An in-RAM `Dataset` cache grows into host-OOM (and under `fork` workers never even shares)
+**Symptom**: RAM climbs steadily across iters/epochs until a bare `Killed` (exit 137, no traceback) — typically from a Dataset that lazy-caches decoded samples (`if idx not in self.cache: self.cache[idx]=load(idx)`). With `num_workers>0` the growth is **per-worker** and the cache gives no speedup.
+**Root cause**: two compounding effects — (1) the cache is unbounded: every index ever requested stays resident, so an epoch caches the whole decoded dataset; (2) under Linux `fork`, each worker is copy-on-write, so writing `self.cache[idx]=...` copies the touched Python objects' pages into that worker's **private** memory — invisible to siblings, so the cache both replicates (RAM × ~`num_workers`) AND is useless for cross-worker reuse.
+**Fix**: don't accumulate unbounded Python objects in `__getitem__`. Options: (a) precompute to a single `np.memmap` / Arrow / LMDB / `.npy` in `__init__` and read slices (the OS page cache **is** shared across forked workers); (b) bound the cache (`functools.lru_cache(maxsize=...)` or a ring buffer); (c) store it in shared memory (`Tensor.share_memory_()`). Prefer numpy/Arrow buffers over `list`/`dict` to avoid copy-on-write page churn. (Static `num_workers × big tensor` startup multiplier → U9; this is the *grows-during-training* cousin.) ([pytorch#13246 — worker memory replication](https://github.com/pytorch/pytorch/issues/13246), [PyTorch data — multi-process memory caveat](https://docs.pytorch.org/docs/stable/data.html))
+---
+## Input preprocessing / labels / shuffle
+### DP10 — Normalization applied in the wrong space/split, or stats mis-aligned to channel order → accuracy quietly tanks
+**Symptom**: the model loads and runs without error, but a pretrained backbone scores far below its reported number, or your own val accuracy is a few points under train for no obvious reason; predictions are systematically biased (reds↔blues confused if channel order is wrong).
+**Root cause**: the per-channel mean/std are correct numbers applied in the wrong space or order. (1) Stats must be computed on the **train split only** and reused verbatim at eval (the sklearn contract: `fit_transform` on train, `transform` — never `fit` — on test/whole set). (2) torchvision pretrained weights expect input already scaled to `[0,1]`, in **RGB**, then normalized with ImageNet `mean=[0.485,0.456,0.406]`/`std=[0.229,0.224,0.225]`. That mean vector is **RGB-indexed**, so feeding a BGR tensor (cv2 default, DP11) aligns the R-stat to the B channel.
+**Fix**: compute stats once on train and reuse the same constants/transform at eval. For a torchvision pretrained model don't hand-roll it — use `weights.transforms()` (e.g. `ResNet50_Weights.IMAGENET1K_V2.transforms()`), which bundles resize + to-`[0,1]` + RGB + the exact Normalize the weights were trained with. (The leakage *judgement* is owned by verifying-dl-experiments; this is the mechanism.) ([sklearn "Common pitfalls" — fit on train only](https://scikit-learn.org/stable/common_pitfalls.html), [torchvision models — input contract](https://docs.pytorch.org/vision/stable/models.html)) (extends V1.)
+### DP11 — `cv2`-loaded image (BGR) fed to an RGB-trained model → channels swapped
+**Symptom**: a pipeline mixing `cv2` for I/O and a PIL/torchvision-trained (RGB) model: no exception, but color-sensitive predictions degrade; visualizing the array shows reds appearing blue. Often surfaces only when you switch the loader (PIL→cv2) and accuracy drops with zero logic change.
+**Root cause**: `cv2.imread`/`VideoCapture` return **BGR** channel order, whereas `PIL.Image` and essentially every ImageNet-pretrained model assume **RGB**. Indexing channel 0 as "red" now reads blue. Both are valid `HxWx3 uint8` arrays, so nothing errors — the model just sees a consistently color-swapped distribution.
+**Fix**: convert immediately after a cv2 load — `img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)` (or `img = img[:, :, ::-1].copy()` — the `.copy()` matters, a negative-stride view breaks `torch.from_numpy`). Or switch I/O to `torchvision.io.read_image`/PIL (RGB). Keep one channel convention end-to-end and assert it at the dataset boundary. ([BGR↔RGB / cvtColor](https://note.nkmk.me/en/python-opencv-bgr-rgb-cvtcolor/), [torchvision models — RGB](https://docs.pytorch.org/vision/stable/models.html))
+### DP12 — `transforms.ToTensor` doesn't ÷255 for non-`uint8` input → activations 255× too large
+**Symptom**: loss is huge or NaN from step 0, or activations/gradients are enormous, when the input came from a float numpy array, a `.npy`, a 16-bit/HDR image (PIL mode `I`/`F`), or a tensor already in `[0,1]`. The same model works fine on `uint8` PNGs.
+**Root cause**: `transforms.ToTensor()` rescales to `[0,1]` (÷255) **only** when the source is a PIL Image in a listed mode **or** a numpy array with `dtype==uint8`. In every other case (float32/64, int32, exotic PIL modes) it converts **without** scaling — so a float numpy array in `0..255` stays `0..255`, and a `uint8` array someone already scaled gets ÷255 a second time (→ `0..0.004`).
+**Fix**: don't rely on `ToTensor` for non-uint8 scaling. For float inputs scale explicitly: `t = torch.from_numpy(arr).float() / 255.0` (or the correct max for 16-bit). In the v2 API prefer `transforms.v2.ToImage()` + `transforms.v2.ToDtype(torch.float32, scale=True)`, where `scale=True` makes the rescale explicit and dtype-aware. Sanity: `assert 0.0 <= x.max() <= 1.0` right after. ([ToTensor doc — "tensors are returned without scaling" for other cases](https://docs.pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html))
+### DP13 — `transforms.Normalize` placed before `ToTensor` in `Compose` → TypeError (it needs a float CHW tensor)
+**Symptom**: dataset construction or the first `__getitem__` raises `TypeError: tensor should be a torch tensor. Got <class 'PIL.Image.Image'>` (or `img should be Tensor`).
+**Root cause**: `transforms.Normalize` operates on a float tensor shaped `(C,H,W)` and subtracts a length-`C` mean / divides by length-`C` std along dim 0; it cannot consume a PIL Image or HWC array. In a `Compose` the steps run top-to-bottom, so `Normalize` must come **after** `ToTensor` (which produces the float CHW tensor). PIL-domain ops (Resize/Crop/flip) must come **before** `ToTensor`.
+**Fix**: order the pipeline — PIL ops → `ToTensor()` → `Normalize(mean, std)`, e.g. `Compose([Resize(256), CenterCrop(224), ToTensor(), Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])])`. `mean`/`std` lengths must equal the channel count (3 RGB, 1 grayscale). ([torchvision transforms — Compose order](https://docs.pytorch.org/vision/stable/transforms.html))
+### DP14 — DataLoader silently un-shuffles → `shuffle=True`+sampler raises; `DistributedSampler` without `set_epoch` replays one order
+**Symptom**: two shuffle failures — (a) `ValueError: sampler option is mutually exclusive with shuffle` the moment you add any sampler; (b) no error, but in DDP every epoch iterates the data in the **identical** order, so the train-loss curve looks oddly periodic / over-memorized and shuffling "does nothing."
+**Root cause**: (a) `DataLoader.__init__` enforces mutual exclusion — `shuffle` picks the sampler for you (`True`→`RandomSampler`, `False`→`SequentialSampler`), so passing both is contradictory; `batch_sampler` is likewise exclusive with `batch_size`/`shuffle`/`sampler`/`drop_last`. (b) `DistributedSampler` derives its per-epoch permutation from a generator seeded `self.seed + self.epoch`, and `self.epoch` stays **0** until you call `sampler.set_epoch(epoch)` — so without it every epoch uses `seed+0` → byte-identical ordering.
+**Fix**: (a) when you must use a sampler (DistributedSampler, WeightedRandomSampler), set `shuffle=False` and let the sampler own ordering. (b) call `train_sampler.set_epoch(epoch)` at the **start of each epoch** before iterating (Lightning/Accelerate do this for you; raw torchrun is your responsibility). Verify by logging the first few indices of epoch 0 vs 1 — they must differ. (The DDP `set_epoch` **hang** is a different failure → D22.) ([DataLoader source — shuffle/sampler exclusivity](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py), [DistributedSampler.set_epoch](https://docs.pytorch.org/docs/stable/data.html))
+### DP15 — `Bus error` / DataLoader worker killed → `/dev/shm` exhausted (the rental-container classic)
+**Symptom**: `DataLoader worker (pid N) is killed by signal: Bus error`, or `RuntimeError: unable to write to file </torch_...>` / `received 0 items of ancdata` — on a **rented container** while the identical code runs fine on your workstation. Usually with `num_workers>0`, often mid-epoch.
+**Root cause**: PyTorch passes worker tensors through **shared memory** (`/dev/shm`). Docker defaults `/dev/shm` to **64 MB** and many rentals inherit that, so a few workers moving normal batches overrun it and the kernel SIGBUS-kills a worker. This is *shared-memory* exhaustion — NOT host-RAM OOM (a bare `Killed` / exit-137 → `gotchas_universal.md` U9) and NOT a deadlock.
+**Fix**: enlarge it at launch — `docker run --shm-size=8g` (or `--ipc=host`); where you can't set that (a fixed rental), switch the IPC strategy `torch.multiprocessing.set_sharing_strategy("file_system")` (fd-passing, slower but uncapped) and/or lower `num_workers`. Tell-tale: `df -h /dev/shm` shows a tiny cap — check it before launch. ([PyTorch multiprocessing shm note](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html), [pytorch#5040](https://github.com/pytorch/pytorch/issues/5040))
+---
+## Pointers — adjacent mechanics catalogued elsewhere
+- **Dataloader SPEED (num_workers / prefetch / pin-overlap / GPU-starvation)** → `references/training/throughput-profiling.md` (T4–T8), `references/gotchas_universal.md` (U8, U24).
+- **"Runs but won't learn" loop wiring + loss-function + label-form bugs** → `references/training/convergence-debugging.md` (O1 overfit-one-batch first; O14 CrossEntropyLoss target form).
+- **IterableDataset/DDP launch, `set_epoch` hang, SyncBatchNorm, uneven inputs** → `references/training/distributed-launch.md` (D9, D10, D22).
+- **Host-RAM OOM from worker fork-copy of a big startup tensor** → `references/gotchas_universal.md` (U9).
+- **Is the data leaking / is the metric valid / is the split contaminated** → **verifying-dl-experiments** (**REQUIRED** — owns the judgement; this file owns the mechanism).