npm - @tryhamster/gerbil - Versions diffs - 1.0.0-rc.8 → 1.0.0 - Mend

@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (179) hide show

package/LICENSE +1 -1
package/README.md +247 -84
package/dist/architectures-C1I5V3Dt.mjs +6070 -0
package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
package/dist/browser/index.d.ts +264 -588
package/dist/browser/index.d.ts.map +1 -1
package/dist/browser/index.js +585 -2334
package/dist/browser/index.js.map +1 -1
package/dist/cli.mjs +625 -1098
package/dist/cli.mjs.map +1 -1
package/dist/defaults-9komdrbY.mjs +24 -0
package/dist/defaults-9komdrbY.mjs.map +1 -0
package/dist/frameworks/express.d.mts +1 -3
package/dist/frameworks/express.d.mts.map +1 -1
package/dist/frameworks/express.mjs +7 -7
package/dist/frameworks/express.mjs.map +1 -1
package/dist/frameworks/fastify.d.mts +1 -1
package/dist/frameworks/fastify.d.mts.map +1 -1
package/dist/frameworks/fastify.mjs +3 -3
package/dist/frameworks/fastify.mjs.map +1 -1
package/dist/frameworks/hono.d.mts +1 -1
package/dist/frameworks/hono.d.mts.map +1 -1
package/dist/frameworks/hono.mjs +4 -4
package/dist/frameworks/hono.mjs.map +1 -1
package/dist/frameworks/next.d.mts +3 -2
package/dist/frameworks/next.d.mts.map +1 -1
package/dist/frameworks/next.mjs +4 -4
package/dist/frameworks/next.mjs.map +1 -1
package/dist/frameworks/react.d.mts +1 -1
package/dist/frameworks/trpc.d.mts +1 -1
package/dist/frameworks/trpc.d.mts.map +1 -1
package/dist/frameworks/trpc.mjs +4 -4
package/dist/frameworks/trpc.mjs.map +1 -1
package/dist/gerbil-BHrJJIa4.mjs +1656 -0
package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
package/dist/gerbil-BT9fCydo.d.mts +488 -0
package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
package/dist/gerbil-DomNfIr1.mjs +4 -0
package/dist/gpu/hooks.d.mts +520 -0
package/dist/gpu/hooks.d.mts.map +1 -0
package/dist/gpu/hooks.mjs +1188 -0
package/dist/gpu/hooks.mjs.map +1 -0
package/dist/gpu/index.d.mts +2 -0
package/dist/gpu/index.mjs +6 -0
package/dist/gpu-33qCAtHW.mjs +3615 -0
package/dist/gpu-33qCAtHW.mjs.map +1 -0
package/dist/index-Dgmb2kE3.d.mts +245 -0
package/dist/index-Dgmb2kE3.d.mts.map +1 -0
package/dist/index-jEAL2s-A.d.mts +2022 -0
package/dist/index-jEAL2s-A.d.mts.map +1 -0
package/dist/index.d.mts +22 -487
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +13 -8
package/dist/index.mjs.map +1 -1
package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
package/dist/integrations/ai-sdk.d.mts +75 -6
package/dist/integrations/ai-sdk.d.mts.map +1 -1
package/dist/integrations/ai-sdk.mjs +131 -15
package/dist/integrations/ai-sdk.mjs.map +1 -1
package/dist/integrations/langchain.d.mts +1 -1
package/dist/integrations/langchain.d.mts.map +1 -1
package/dist/integrations/langchain.mjs +5 -5
package/dist/integrations/langchain.mjs.map +1 -1
package/dist/integrations/llamaindex.d.mts +1 -1
package/dist/integrations/llamaindex.d.mts.map +1 -1
package/dist/integrations/llamaindex.mjs +5 -5
package/dist/integrations/llamaindex.mjs.map +1 -1
package/dist/integrations/mcp-client.mjs +3 -3
package/dist/integrations/mcp-client.mjs.map +1 -1
package/dist/integrations/mcp.d.mts +3 -2
package/dist/integrations/mcp.d.mts.map +1 -1
package/dist/integrations/mcp.mjs +5 -5
package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
package/dist/mcp-1DaMsaBc.mjs.map +1 -0
package/dist/memory/index.d.mts +3 -0
package/dist/memory/index.mjs +6 -0
package/dist/memory-D1P7Tmda.mjs +4 -0
package/dist/memory-DVN0MnIG.mjs +132 -0
package/dist/memory-DVN0MnIG.mjs.map +1 -0
package/dist/memory-Dj0J1v88.mjs +294 -0
package/dist/memory-Dj0J1v88.mjs.map +1 -0
package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
package/dist/repl-jV5gcJFA.mjs +9 -0
package/dist/skills/index.d.mts +270 -320
package/dist/skills/index.d.mts.map +1 -1
package/dist/skills/index.mjs +5 -5
package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
package/dist/skills-DX8D59UH.mjs.map +1 -0
package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
package/dist/tools-DQ1mPUw5.mjs.map +1 -0
package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
package/dist/types-D6FiR_oh.d.mts.map +1 -0
package/dist/types-DQBe2lFo.d.mts +165 -0
package/dist/types-DQBe2lFo.d.mts.map +1 -0
package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
package/dist/vector-B0panuy6.mjs +95 -0
package/dist/vector-B0panuy6.mjs.map +1 -0
package/docs/PROJECT-STATE.md +321 -0
package/docs/adding-a-model-family.md +280 -0
package/docs/ai-sdk.md +70 -61
package/docs/architecture/overview.md +17 -7
package/docs/browser.md +203 -8
package/docs/embeddings.md +156 -0
package/docs/gerbil-site-native-migration.md +217 -0
package/docs/gpu-engine/architectures.md +398 -0
package/docs/gpu-engine/ir.md +372 -0
package/docs/gpu-engine/kernels.md +718 -0
package/docs/gpu-engine/paper.html +1759 -0
package/docs/gpu-engine/paper.md +2109 -0
package/docs/gpu-engine/safetensors.md +312 -0
package/docs/gpu-engine/tokenizer.md +302 -0
package/docs/memory-rag.md +91 -0
package/docs/metal-safari-intel.md +190 -0
package/docs/mobile-failure-diagnosis.md +124 -0
package/docs/mobile.md +99 -0
package/docs/observability.md +230 -0
package/docs/onnx-removal-plan.md +339 -0
package/docs/research/autoresearch-portable.md +904 -0
package/docs/research/dispatch-reduction-hivemind.md +84 -0
package/docs/research/ios-safari-model-caching.md +117 -0
package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
package/docs/research/native-stt-model-selection.md +49 -0
package/docs/research/native-tts-model-selection.md +90 -0
package/docs/research/native-vs-chromium-decision.md +152 -0
package/docs/research/nemotron-mamba2-inference.md +910 -0
package/docs/research/qwen35-multimodal.md +293 -0
package/docs/research/qwen36-gemma4-targets.md +337 -0
package/docs/research/sota-embedding-models.md +179 -0
package/docs/research/sota-mobile-models-2026.md +263 -0
package/docs/research/sota-modality-models.md +202 -0
package/docs/research/tps-baselines.md +71 -0
package/docs/research/webgpu-m4-reference.md +104 -0
package/docs/site-update-plan.md +155 -0
package/docs/structured-output.md +123 -0
package/docs/stt.md +63 -446
package/docs/tts.md +77 -499
package/docs/vision.md +100 -338
package/package.json +22 -7
package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
package/dist/gerbil-CJ3ifloF.mjs +0 -4
package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
package/dist/gerbil-qOTe1nl2.d.mts +0 -431
package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
package/dist/kokoro-BNTb6egA.mjs +0 -20210
package/dist/kokoro-BNTb6egA.mjs.map +0 -1
package/dist/kokoro-DFRQ1OeM.js +0 -20212
package/dist/kokoro-DFRQ1OeM.js.map +0 -1
package/dist/mcp-BvbriaBy.mjs.map +0 -1
package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
package/dist/repl-DveXw36T.mjs +0 -9
package/dist/skills-CD3Orlex.mjs.map +0 -1
package/dist/stt-CpLYbGFd.mjs +0 -433
package/dist/stt-CpLYbGFd.mjs.map +0 -1
package/dist/stt-DRPLEEHB.mjs +0 -3
package/dist/stt-Te8Qz-Ay.js +0 -433
package/dist/stt-Te8Qz-Ay.js.map +0 -1
package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
package/dist/transformers.web-DokyH3rP.js +0 -3
package/dist/transformers.web-M6mCnEYJ.js +0 -30382
package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
package/dist/tts-C0xx3CtE.js +0 -724
package/dist/tts-C0xx3CtE.js.map +0 -1
package/dist/tts-DXgsKGCe.mjs +0 -3
package/dist/tts-DeGANMNV.mjs +0 -730
package/dist/tts-DeGANMNV.mjs.map +0 -1
package/dist/types-CiTc7ez3.d.mts.map +0 -1
/package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
/package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
/package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0

package/docs/research/dispatch-reduction-hivemind.md ADDED Viewed

@@ -0,0 +1,84 @@
+# Decode Dispatch-Reduction — Hive-Mind Synthesis (June 2026)
+_Generated by a 5-agent research workflow (arXiv + codebase + outside-box) → synthesis, calibrated against scripts/engine/results.jsonl. See docs/gpu-engine/paper.md §20._
+## Research summary
+DISPATCH-OVERHEAD THESIS, RE-CALIBRATED AGAINST THE ENGINE'S OWN MEASURED HISTORY (scripts/engine/results.jsonl, 193 lines). The task framing ("decode is dispatch-overhead bound, ~287 dispatches x 65us") is TRUE for mobile Safari (per-dispatch submit+drain) but FALSE for the primary node-Dawn metric, where the whole decode step is one beginComputePass / one submit and per-dispatch overhead measured ~2-3us. The engine has run this exact experiment family repeatedly and the result is unambiguous:
+PROVEN ON node-DAWN (verbatim lessons): (a) "node-dawn decode is NOT dispatch-count-bound" — r1 ResidualRMSNorm Add+Norm fusion removed ~23 dispatches/token = +0.25% (reverted). (b) RMSNorm-into-MambaSSM removed 18 dispatches + ONE 2048-wide round-trip = +0.3% (reverted, mobile-flagged). (c) ConvStateUpdate-into-CausalConv1dSiLU removed 18 dispatches + a conv_state re-read = +1.0% high-variance (reverted, mobile-flagged). (d) Q/K L2-reduction merge -0.4%. (e) matvec dequant-ALU-factoring +0.09% — "CONCLUSIVE: matvec_int4 is memory-bandwidth-bound, not ALU-bound; the only lever is reducing BYTES MOVED." (f) matvec software-prefetch -1.2% — "matvec is NOT improvable via manual latency-hiding."
+THE REAL node-Dawn WINS were all BYTES-MOVED cuts on WIDE tensors: matvec wg=256/N_TILE=16 A-reuse +4.9%, vec4 INT4 loads, MambaSSM single-pass state fusion (halved state traffic, the big one, +~13%), f16 SSM state +~4%, plus the kept structural fusions that ALSO removed a wide read/write (r2 SiLU-into-conv +1.8%, LFM2 conv post-gate +2.0%, conv pre-gate MulCols +1.9%). The vision leg independently re-confirmed: dispatch+round-trip cuts on NARROW tensors are flat on desktop (b4-r4 MatMul+AddBias -1.0%, r5 SliceRotary -0.4%, r6 residual-Add fold flat — all reverted/kept-as-simplification, all flagged mobile), while the WIDE compute lever won (b4-r6 f16-mix matmul -1.63%).
+DECISIVE RANKING RULE: on node-Dawn a dispatch cut moves the metric ONLY when it ALSO eliminates a WIDE-tensor global round-trip (6144-wide qkvOut, 2048-wide ssm/z, not the 1024-wide narrow activation or N=16 tiny projections). The profiler (test-profile-decode.mjs) puts MatMulInt4 at 67% and MambaSSM at 30% of GPU time, with the engine 5.2x above the 1213 tok/s weight-bandwidth floor — so the residual gap lives in dispatch boundaries (mobile) AND MambaSSM's ~900us/dispatch (desktop). MambaSSM (18 of 24 layers) is therefore the highest-leverage desktop target, and its tail (SSM->per-head-RMSNorm->SiLU-z-gate) is where two stacked WIDE round-trip removals are still on the table.
+WGSL CEILING (cross-angle agreement, arXiv:2605.11581 Ada-MK, Stanford No-Bubbles, Mirage MPK, gpuweb #2233/#3935): true device-wide persistent megakernels need cross-workgroup forward-progress that WGSL explicitly does NOT guarantee — counter spin-loops can hardware-deadlock on Metal/M4. So the only legal "megakernel" is SINGLE-WORKGROUP (intra-workgroup workgroupBarrier only). The Mamba per-head pipeline (one workgroup owns one head's val_dim=128 lanes) fits this exactly. zerotvm.com (Phi-3 Q4, 228 vs WebLLM 342 dispatches) and arXiv:2604.02344 confirm fusion is the lever but provide no desktop-magnitude evidence that contradicts gerbil's own flat results.
+NET: prioritize fusions that COLLAPSE WIDE round-trips inside the 18 Mamba layers (epilogue and prologue megakernels), deprioritize pure dispatch-count cuts (they are mobile-only wins below the 1.5% desktop gate), and DROP the already-tried/reverted standalone fusions (conv-state-into-conv, norm-into-SSM, a/b reorder, down+residual).
+## Ranked experiment plan
+### #1. Mamba epilogue single-workgroup megakernel: fold per-head RMSNorm AND SiLU(z)-gate into the MambaSSM kernel (two stacked WIDE round-trip removals)  _(medium effort, medium confidence)_
+**Expected impact:** Removes ~36 dispatches/token (2 per Mamba layer x 18) + TWO 2048-wide round-trips/layer (ssmOut->norm and ssmNormed->gate). node-Dawn: r3 alone (one round-trip) was +0.3%; stacking the second wide round-trip plausibly +1-2% (crosses the 1.5% gate). Mobile Safari: 36 fewer submit+drains/token = strong.
+**Mechanism:** Files: src/gpu/kernels/registry.ts (WGSL_MAMBA_SSM + WGSL_MAMBA_SSM_F16, mambaSSMSpec), src/gpu/architectures/qwen3_5.ts:913-968, src/gpu/ir.ts/executor.ts decode-graph build. The MambaSSM kernel already runs one workgroup per head spanning val_dim=128 lanes and writes ssmOut[T,2048]. Today TWO wide ops follow: per-head RMSNorm (qwen3_5.ts:947, reads/writes 2048-wide) then SwiGLU gate silu(z)*normed (qwen3_5.ts:963, reads ssmNormed+z 2048-wide, writes 2048-wide). Add to mambaSSMSpec two new bindings: norm_weight + z_proj output (zProjOut, already produced at :822), plus eps param. After each head computes its 128 outputs in-lane: (1) workgroup sum-of-squares reduction over the 128 lanes the kernel already owns (reuse existing shared_norm machinery), (2) write y = silu(z[d]) * (ssm_out[d]*inv_rms*norm_w[d]) directly — emitting the GATED 2048-wide output that mamba out_proj consumes. Drop the mamba_norm and mamba_swiglu nodes from the DECODE graph only (keep for prefill). Wire both f32 and f16 SSM-state variants (f16 path is Dawn-only per r8 gating; KEEP rms accumulation in f32). KEY DIFFERENCE FROM PRIOR r3 (which folded ONLY the norm, +0.3% flat): this also removes the SECOND 2048-wide round-trip (the ssmNormed write+reread by SwiGLU), doubling bandwidth saved — the lever that turned r2/r5/r6 from flat into kept wins.
+**Rationale:** Highest-leverage SURVIVOR. Builds directly on the already-VALIDATED-CORRECT r3 norm fold (lowers correctness risk) and adds exactly the missing ingredient the lessons demand: a second WIDE (2048) round-trip removal. MambaSSM is 30% of desktop GPU time and 18 of 24 layers, so its tail is the right target. Single-workgroup-per-head is the only WGSL-legal megakernel form (per cross-angle WGSL forward-progress finding).
+**Risks:** If even two stacked 2048-wide round-trips are too small at T=1 to cross 1.5% (the gate that flattened r3 and the vision narrow-tensor folds), it lands as a mobile-only win + desktop simplification. Must keep f32 rms accumulation (ollama #15865: bf16 state corrupts GatedDeltaNet). Per-head reduction over 128 lanes adds 1 barrier round — keep cheap. z_proj must be bound to BOTH variants + executor bind-group wiring updated. Coherence-validate with test-q4-generate.mjs (merged-cos gate).
+### #2. Mamba prologue megakernel: fuse qkv_proj INT4 matvec + CausalConv1dSiLU into one kernel, never materializing the 6144-wide qkvOut  _(hard effort, medium confidence)_
+**Expected impact:** Removes ~18-36 dispatches/token (conv, optionally conv_state, x18) + the 6144-wide qkvOut write + re-read per Mamba layer (the single widest activation round-trip in the decode graph). node-Dawn: this is the r2/r5 winning pattern at MAX width — plausibly +1-3%. Mobile: 18-36 fewer round-trips.
+**Mechanism:** Files: registry.ts (new ConvInProjInt4SiLU kernel cloning the proven vec4 INT4 matvec dequant + the CausalConv1dSiLU body at ~:6186), qwen3_5.ts:795-867. Today: qkv_proj (MatMulInt4, N=6144) writes the 6144-wide qkvOut to global, then CausalConv1dSiLU RE-READS qkvOut (+conv_state) to produce qkvConv. Depthwise conv is per-channel, so one workgroup can own a contiguous channel slab: compute that slab's INT4-projected QKV for the single new token IN REGISTERS, immediately apply the kernel-size-4 depthwise conv with the 3 conv_state taps + SiLU, write qkvConv — qkvOut is NEVER written to global memory. Removes the 6144-wide qkvOut write + its re-read. Decode-only (T=1); prefill keeps the split path. IMPORTANT: do NOT also fold ConvStateUpdate here as the primary justification — that exact conv-state roll was tried alone (r4, +1.0%, reverted); the WIN here is eliminating the 6144-wide qkvOut materialization (a far larger round-trip than r4 touched). Optionally roll conv_state in the same dispatch as a free rider (per-channel independent, no cross-thread hazard at T=1).
+**Rationale:** Directly removes the widest activation round-trip (6144) in the Mamba path — exactly the lever the lessons identify as the only desktop mover. Distinct from the already-reverted r4 (which only touched conv_state, leaving qkvOut materialized). qkv_proj is part of the 67% MatMulInt4 bucket so feeding conv from registers attacks the #1 hotspot's output traffic.
+**Risks:** Folding the INT4 matvec into per-channel-slab workgroups may underutilize vs the tuned standalone matvec (wg=256/N_TILE=16, the +4.9% A-reuse config) — the 6144-wide matvec is the 48-67% hotspot and stuffing it into conv-shaped workgroups could regress compute even as it saves the round-trip. Mitigation: keep the matvec's column-parallel tiling, write to shared/registers, then conv-reduce. Conv-state roll decode-only (T>1 race per r4). Pooled-aliasing: confirm no other consumer reads qkvOut (only conv + state-update do). High correctness risk — a conv bug corrupts all subsequent tokens; validate bit-exact.
+### #3. Skip SliceLastRow on the decode (T=1) path — lm_head reads final_norm_out row 0 directly  _(easy effort, high confidence)_
+**Expected impact:** Removes 1 dispatch/token. node-Dawn: ~0.3% (noise, one narrow dispatch). Mobile: 1 fewer submit+drain/token. Value is as a zero-risk warm-up edit + simplification + small mobile win.
+**Mechanism:** Files: qwen3_5.ts:1037 (SliceLastRow node), executor.ts decode-entry build (around the fuse* passes at :499-517 / initBindGroups). At decode T=1, final_norm_out is already exactly [1,hidden], so SliceLastRow is a 1-row identity copy. In the decode path, rebind lm_head's activation input directly to final_norm_out and drop the SliceLastRow decodeEntry; keep SliceLastRow for prefill (T>1). Pure removal, no new kernel.
+**Rationale:** The cheapest, safest item — good first round to validate the harness loop. It will NOT pass the 1.5% desktop gate (narrow, single dispatch, consistent with every prior narrow dispatch-cut result) but is a correct simplification and a genuine mobile dispatch-count reduction. Note: SliceLastRow is correctness-critical (it was lost+reconstructed in r10-recovery) so the buffer aliasing must be exact.
+**Risks:** final_norm_out may be a pooled activation buffer; confirm lm_head reading it directly is hazard-safe (separate dispatches are hazard-synchronized — fine) and that logits does not share final_norm_out's pooled slot. Apply ONLY to decode entries; leave prefill SliceLastRow intact. Validate logits row mapping unchanged with test-q4-generate.mjs.
+### #4. Heterogeneous-N quad/dual matvec for the Mamba prologue: fuse qkv_proj(N=6144)+z_proj(N=2048) into one kernel reading norm1Out once  _(hard effort, low confidence)_
+**Expected impact:** Removes ~18-36 dispatches/token (qkv+z dual: 18; +a/b into it if bindings allow: up to 54) + 1-3 redundant reads of norm1Out per layer. BUT norm1Out is the NARROW (1024 f32 = 4KB) activation, not the wide weight stream — per the proven matvec-is-weight-bandwidth-bound lesson, the saved activation re-read is small.
+**Mechanism:** Files: executor.ts:1814 fuseDualMatVecDecodeEntries (generalize), registry.ts (new HeteroMatVecInt4 spec allowing per-output N). Today fuseDualMatVecDecodeEntries requires EQUAL N (executor.ts:1830) so qkv(6144) and z(2048) stay separate; a+b (both N=16) already auto-fuse. Add a spec + pass that fuses same-input/same-K/same-group_size matvecs with DIFFERENT N by partitioning the workgroup grid over the concatenated N-space (an N-offset table in the uniform routes each output column to its matrix). Reads the shared norm1Out vector from L1 once across both. Numerically identical to separate matvecs (WebKit-safe). Gate on ctx.limits.maxStorageBuffersPerShaderStage (qkv+z = 6 weight buffers + input + 2 outputs + uniform = 10, within the >=9 path already used; a true quad with a/b would be 14+, likely over-limit — fall back to qkv+z dual).
+**Rationale:** Removes real dispatches but the round-trip it eliminates is the NARROW shared activation, not a wide weight/activation tensor — exactly the case that came in flat for every prior narrow dispatch cut (r1, r2-dequant, vision b4-r4/r5/r6). Listed for completeness and as a mobile win, but expect desktop-flat. Lower than ranks 1-2 which remove WIDE (2048/6144) round-trips.
+**Risks:** Binding-count limit is the hard blocker (gate + fall back to qkv+z dual). Heterogeneous N partitioning wastes lanes on the tiny a/b (N=16) sub-matrices. The tuned wg=256/N_TILE=16 A-reuse config must be preserved per sub-matrix or it could regress the 67% matvec hotspot. Pooled-aliasing guard (input vs all outputs distinct). Bit-exact validation.
+### #5. Multi-token GPU-resident decode burst: encode K chained greedy steps into ONE command encoder/submit (mobile-targeted)  _(medium effort, medium confidence)_
+**Expected impact:** 0 dispatch-COUNT reduction, but K-fold fewer submit+drain round-trips. node-Dawn: 1-3% (already GPU-bound, single submit/token). Mobile Safari/iOS: potentially 10-30% (each submit currently drains — this is the primary mobile bottleneck the per-dispatch-submit+drain commits address).
+**Mechanism:** Files: executor.ts:1096-1166 submitGreedyDecodeStep. It already chains tokens GPU-side (argmaxResult->inputIds copyBufferToBuffer at :1147, single submit/token). Encode K consecutive decode steps in ONE encoder: after each step's argmax, copy argmax->inputIds and immediately begin the next step's compute pass in the same encoder. Bake K distinct per-step uniform buffers (seqPos/KV-offset differ per step; there is no writeBuffer interleave point inside one encoder) or an in-shader step counter. Greedy-only; EOS lags by K (over-generate and trim).
+**Rationale:** The ONLY proposal that attacks the actual mobile bottleneck (submit+drain) at the submit level rather than per-kernel. Marginal on the desktop metric (correctly), but the task names mobile Safari as the secondary goal and this is the highest-leverage mobile lever that does not require per-kernel WGSL surgery. Distinct from the existing 2-deep pipeline (which overlaps CPU prep, not submits).
+**Risks:** Greedy-only (no temperature/top-p without CPU logits readback breaking the chain). EOS/stop-string delayed by K -> trim overrun. Per-step uniform management inside one encoder is the real complexity (no writeBuffer interleave). Validate coherence + that EOS trimming is correct. Desktop gain likely below 1.5% gate — frame as mobile leg.
+## Paper notes
+GERBIL DECODE DISPATCH-REDUCTION — RESEARCH DOC NOTES (synthesis of 5 angles + verification against results.jsonl, qwen3_5.ts, executor.ts, registry.ts).
+CORE THESIS (two-regime model of dispatch overhead):
+- MOBILE Safari/WebKit: per-dispatch submit+drain (webkitGroupSize=1) makes EVERY dispatch ~32-71us (arXiv:2604.02344). Decode at ~287 dispatches/token IS dispatch-bound here. Any dispatch cut helps. This is the regime the task framing describes.
+- DESKTOP node-Dawn (the PRIMARY metric, test-benchmark.mjs): the whole decode step is ONE beginComputePass / ONE submit; measured per-dispatch overhead is ~2-3us (results.jsonl r1). Decode is NOT dispatch-count-bound. The metric moves ONLY on BYTES-MOVED reductions over WIDE tensors.
+EMPIRICAL CALIBRATION (gerbil's own measured history — this is the load-bearing evidence, not external literature):
+- DISPATCH-COUNT CUTS THAT WERE FLAT/REVERTED on Dawn: ResidualRMSNorm Add+Norm fuse -23 dispatches = +0.25%; RMSNorm-into-MambaSSM -18 dispatches +1 round-trip = +0.3%; ConvStateUpdate-into-conv -18 dispatches = +1.0% (high variance); Q/K L2 merge -0.4%; mamba-swiglu vec4 -0.1%; vision MatMul+AddBias -1.0%, SliceRotary -0.4%, residual-Add fold flat. ALL below the 1.5% gate; ALL flagged as mobile-positive.
+- BYTES-MOVED CUTS THAT WON on Dawn: matvec wg=256/N_TILE=16 A-reuse +4.9%; vec4 INT4 loads; MambaSSM single-pass state fusion (halved state traffic — the biggest single win, ~+13%); f16 SSM state +~4%; SiLU-into-conv +1.8%; LFM2 conv post-gate C*conv +2.0%; conv pre-gate MulCols +1.9%; vision f16-mix matmul -1.63% encode.
+- matvec_int4 is MEMORY-BANDWIDTH-bound, not ALU-bound (dequant-factoring +0.09%) and NOT latency-improvable (prefetch -1.2%). Engine is 5.2x above the 1213 tok/s weight-bandwidth floor (profiler: MatMulInt4 67%, MambaSSM 30%) — the gap is dispatch boundaries (mobile) + MambaSSM ~900us/dispatch (desktop).
+DESIGN RULE: on Dawn, a dispatch cut moves the metric only if it ALSO removes a WIDE-tensor global round-trip (6144-wide qkvOut, 2048-wide ssm/z; NOT the 1024-wide shared activation, NOT N=16 tiny projections). This single rule explains every kept-vs-reverted result.
+WGSL MEGAKERNEL CEILING (Ada-MK arXiv:2605.11581; Stanford No-Bubbles; Mirage MPK; gpuweb #2233/#3935): device-wide persistent megakernels (3000+ tok/s via cross-SM producer-consumer counters) require cross-workgroup forward-progress that WGSL does NOT guarantee — spin-loops can hardware-deadlock on Metal/M4. OFF THE TABLE. The only legal WGSL megakernel is SINGLE-WORKGROUP (intra-workgroup workgroupBarrier). The Mamba per-head pipeline (one workgroup owns one head's val_dim=128) is the natural fit and the only place a layer-tail megakernel is WGSL-safe.
+DIRECT WEBGPU COMPARABLE: zerotvm.com (Phi-3 Mini Q4, all-WGSL) = 228 dispatches/token vs WebLLM 342, via per-layer projection+RoPE+KV-write fusion, Add+RMSNorm fusion, FFN gate+up+SiLU fusion, online-softmax attention. Gerbil already has the FFN, Add+RMSNorm, and online-softmax pieces; the un-done analog is the projection+conv/RoPE+cache-write fusion. LlamaWeb (arXiv:2605.20706) explicitly does NOT fuse kernels (names it future work) — nobody in WebGPU land has done aggressive decode fusion, so gerbil is at the frontier here. WeInfer (openreview Qu2itILaoZ) = buffer reuse + compute-pass grouping, which gerbil already does.
+ALREADY EXHAUSTED / DO NOT RE-RUN: config sweeps (wg/N_TILE/K_THREADS), subgroups matvec reduction (gated off, large regression on M-series), ConvStateUpdate-into-conv standalone (r4 reverted), RMSNorm-into-MambaSSM standalone (r3 reverted), a_proj+b_proj are ALREADY auto-fused (adjacent, equal N=16), down_proj+residual Add (reverted; +norm fold infeasible — RMSNorm needs full-row reduction after K-parallel matvec, no intra-dispatch grid sync), matvec prefetch/dequant-ALU (bandwidth-bound), command-buffer reuse (already single-submit/token).
+FORWARD PLAN: the two desktop-credible bets both COLLAPSE WIDE round-trips inside the 18 Mamba layers — (1) epilogue megakernel folding per-head-RMSNorm + SiLU(z)-gate into MambaSSM (stacks the SECOND 2048-wide round-trip removal on top of the already-validated r3 norm fold — the missing ingredient that may cross 1.5%); (2) prologue megakernel computing qkv_proj INT4 in registers and feeding conv+SiLU directly, never materializing the 6144-wide qkvOut (the widest activation round-trip in the graph, distinct from the reverted conv-state-only r4). Everything narrower (SliceLastRow skip, heterogeneous quad-matvec on the narrow shared activation, multi-token submit burst) is a mobile-only win below the desktop gate and should be landed as simplifications / for the Safari leg, not expected to move node-Dawn.

package/docs/research/ios-safari-model-caching.md ADDED Viewed

@@ -0,0 +1,117 @@
+# Persistently caching a ~400MB model in iOS/iPadOS Safari (Safari 26 / iOS 26.x)
+**Date:** 2026-06-13
+**Target:** iPadOS 26.5, Safari, ~400MB model (Qwen3.5-0.8B INT4)
+**Goal:** Stop the model re-downloading (~60-160s) on every page load.
+Source tags: `[docs/spec]` `[lib/issue]` `[blog/SO]` `[our-probe]`
+---
+## TL;DR verdict
+- **Does switching Cache API -> IndexedDB help? NO.** On iOS Safari they are the *same* best-effort storage pool, evicted together on an origin basis under the *same* WebKit quota + 7-day-inactivity policy. Switching is pointless for durability. `[docs/spec]` `[blog/SO]`
+- **Is persistence achievable? YES, but effectively only as a Home-Screen Web App (PWA).** A plain Safari tab *can* in theory get `navigator.storage.persist()` === true, but WebKit's only documented positive heuristic is "opened as a Home Screen Web App." Your probe already confirms a plain tab gets `false`. Add-to-Home-Screen is the only reliable path to a persistence grant. `[docs/spec]` `[our-probe]`
+- **The actual reason the cache never survives on YOUR device is quota pressure, not reload-eviction.** Best-effort data DOES survive a reload in normal conditions. But your probe shows quota ≈ 1049 MB with 444 MB already consumed by foreign data, leaving only ~605 MB. A 400 MB write can land you near the cap; under storage pressure WebKit evicts whole origins LRU. The ~1GB quota is small because Safari 17+ sets the origin quota to ~60% of **total disk space** and this iPad is nearly full. `[docs/spec]` `[our-probe]`
+**Single recommended strategy:** Ship an **Add-to-Home-Screen PWA** that, on first run inside the standalone web app, calls `navigator.storage.persist()` (now likely granted) and writes the 400MB model **in a Worker via OPFS `createSyncAccessHandle()` in small chunks** (not main-thread `createWritable`, which throws on your device). Before writing, free the foreign 444MB. As a plain Safari tab, treat re-download as unavoidable and minimize its cost instead (smaller/streamed model, HTTP cache, fast CDN).
+---
+## Q1. Without a persistence grant, what survives a reload on iOS Safari?
+**A plain reload does NOT evict best-effort data.** WebKit's storage policy is explicit: eviction is triggered by (a) exceeding the overall quota, (b) system storage pressure, or (c) ~7 days of no user interaction (ITP). A page reload is not in that list. Cache API / IndexedDB / OPFS all survive reloads and same-session navigation under normal conditions. `[docs/spec]`
+- WebKit: *"By default, all origins use a best-effort mode... their data can be evicted."* Eviction happens *"when exceeding the overall quota, when the system is under storage pressure, or when the site has not been interacted with by the user for some time."* Origins *"might be excluded from eviction if it has active page at the time of eviction, or its storage is in persistent mode."* Eviction is *origin-wide* and ordered **LRU**. `[docs/spec]` (https://webkit.org/blog/14403/updates-to-storage-policy/)
+- The **7-day cap** (ITP) wipes *all* script-writable storage (IndexedDB, Cache API, service workers, OPFS, JS cookies) if the origin gets no user interaction for 7 days of browser use. Same-session reloads do not trigger it. `[docs/spec]` (web.dev "Storage for the web"; MDN "Storage quotas and eviction criteria")
+**Conclusion for our case:** Reload is *not* why our cache dies. Because the probe shows usage already at 444/1049 MB, a 400MB write pushes the origin toward the cap, and WebKit then evicts under storage pressure — including across reloads. This is **quota-pressure eviction**, not reload-eviction. So caching *can* work in principle, but only if we (1) free the quota and (2) ideally get persistence so the origin is excluded from eviction.
+---
+## Q2. Can a plain HTTPS page (LAN IP, not PWA, not bookmarked) get `persist()` granted?
+**In theory yes; in practice, treat it as no for a plain tab. The reliable path is Add-to-Home-Screen (PWA).**
+- WebKit documents that persistent mode is *"granted based on heuristics like whether the website is opened as a Home Screen Web App."* That is the **only** explicit positive heuristic Apple publishes. There is **no user prompt** on Safari (unlike Firefox), and no JS-only way to force a grant. `[docs/spec]` (https://webkit.org/blog/14403/updates-to-storage-policy/)
+- Our probe: `navigator.storage.persist()` returns **false**, `persisted()` false — confirming a plain tab on this device is not granted. `[our-probe]`
+- Bookmarks are **not** documented as a heuristic for Safari (they are for Chrome, not Safari). User engagement is *likely* a secondary signal by analogy to Chromium but is undocumented and unreliable for Safari. `[blog/SO]` (web.dev "Persistent storage"; raymondcamden.com Storage API)
+**Once granted, persistent storage is excluded from both LRU/quota eviction and the 7-day ITP purge** `[docs/spec]`. So a PWA that gets the grant is the durable configuration.
+> Caveat: some 2026 PWA field reports claim iOS still aggressively evicts even installed PWAs and gives no hard persistence guarantee. So even as a PWA, design defensively (integrity-check on load, re-fetch on corruption). `[blog/SO]` (magicbell.com PWA iOS limitations 2026)
+---
+## Q3. What do production browser-LLM libraries use, and does it work on iOS?
+- **transformers.js / @huggingface/transformers**: caches model files via the **Cache API**, per-domain — intended as a one-time download. `[lib/issue]` (HF docs; xenova/transformers.js)
+- On iOS Safari there is no single GitHub issue titled "re-downloads every load," but the *symptoms* are well documented:
+  - `QuotaExceededError` / *"Unable to add response to browser cache: QuotaExceededError"* on iOS when the model exceeds the tight origin quota — a failed cache write means the next load re-downloads. `[lib/issue]` (transformers.js issues; SitePoint "Optimizing Transformers.js for Production")
+  - v3 crashes on iOS/macOS from growing memory usage (#1242), Whisper demo broken on iOS (#1298), Safari unexpected restarts (#973). `[lib/issue]`
+- **web-llm/MLC, wllama, whisper-web**: same class of iOS problems (memory limits, WebGPU/WASM gaps, eviction). No maintainer documents a clean working "survives on iOS Safari" cache path for multi-hundred-MB models. `[lib/issue]`
+**Conclusion:** Re-download-on-iOS is a *real, widely-hit* practical limitation for large models, driven by (a) the small iOS origin quota and (b) lack of persistence — **not** a fundamental "Cache API is ignored" bug. Libraries that "work" do so on devices with ample free disk (large quota) and benefit from the model staying under quota.
+### Q3b. The ~1GB quota and the foreign 444MB
+- **Safari 17+ sets the origin quota to ~60% of total disk space** (overall cap ~80%). It is **per-origin** and **shared** across Cache API + IndexedDB + OPFS (one pool, evicted together). `[docs/spec]` (WebKit storage policy; the old fixed 500MB/1GB IndexedDB cap was removed in Safari 17)
+- The probe's quota ≈ **1049 MB is small because the iPad has little free disk** (~1.75GB free x 60% ≈ 1GB). `navigator.storage.estimate()` quota tracks free space, so it shrinks as the device fills. `[docs/spec]` `[our-probe]`
+- The **444MB foreign usage** (likely earlier transformers.js caches) directly steals from the same per-origin pool, leaving only ~605MB. A 400MB write fits *numerically* but leaves the origin near the cap, inviting storage-pressure eviction.
+**Yes — clearing the foreign 444MB is necessary** to make a 400MB write land with headroom and survive. Do it via `caches.keys()` -> delete, IndexedDB `deleteDatabase`, and OPFS `removeEntry`, or have the user clear website data. The real fix for the small quota itself is **free up device disk space** and/or **get persistence (PWA)**.
+---
+## Q4. Is IndexedDB more durable than Cache API on iOS? (Is the switch worth it?)
+**No. They are equally evictable.** `[docs/spec]` `[blog/SO]`
+- WebKit evicts **per origin** — *"the data of an origin will be deleted as a whole."* If an origin used both IndexedDB and Cache API, **both are deleted together**. `[docs/spec]`
+- The **7-day ITP cap covers IndexedDB AND Cache API AND OPFS** identically. `[docs/spec]` (web.dev; MDN)
+- The only real differences are API shape (Cache API = Response objects; IndexedDB = structured/blobs; OPFS = random-access files) and performance — **not durability**. `[blog/SO]`
+**Switching Cache API -> IndexedDB for durability is pointless on iOS Safari.** (IndexedDB is only marginally preferable if you need to store before a Service Worker is active, or want explicit blob handling — not for survival.)
+---
+## Q5. Pragmatic recommendation — what actually works
+Evaluated against the constraints (no main-thread OPFS write, no persistence grant on a tab, ~1GB shared quota, 444MB foreign usage):
+| Option | Verdict on iOS Safari today |
+|---|---|
+| (a) Cache API + clear foreign 444MB | **Helps but insufficient alone.** Clearing frees headroom so the 400MB write succeeds, but without persistence it's still best-effort and evictable under pressure / after 7 days idle. Necessary, not sufficient. |
+| (b) IndexedDB instead of Cache API | **Pointless for durability** (Q4). Same eviction. Skip the migration. |
+| (c) OPFS via Worker `createSyncAccessHandle()` | **Correct write mechanism** for this device (main-thread `createWritable` throws OOM here `[our-probe]`). `createSyncAccessHandle` is Worker-only by design and is the WebKit-blessed path for large writes — write in 1-4MB chunks, never buffer 400MB. BUT OPFS is the *same best-effort tier* — it does not by itself make data survive without persistence. `[docs/spec]` `[blog/SO]` |
+| (d) Add-to-Home-Screen / PWA for persistence | **The only reliable durability lever.** Standalone Web App is WebKit's documented heuristic for granting `persist()`, which then excludes the origin from LRU + 7-day eviction. `[docs/spec]` |
+| (e) Accept re-download on a plain tab | **Realistic fallback.** Without a PWA, expect re-download eventually; optimize its cost instead. |
+### Recommended strategy (single path)
+**Ship it as an Add-to-Home-Screen PWA and combine (a)+(c)+(d):**
+1. **Add-to-Home-Screen / PWA** (manifest + standalone). This is the only configuration that reliably gets `navigator.storage.persist()`.
+2. On first run **inside the standalone app**, call `await navigator.storage.persist()` and verify `await navigator.storage.persisted()`. If granted, the origin is excluded from eviction (LRU + 7-day). `[docs/spec]`
+3. **Free the foreign 444MB first** (`caches` delete, `indexedDB.deleteDatabase`, OPFS `removeEntry`) so the 400MB write has headroom under the ~1GB pool. Also recommend the user free device disk to grow the quota (quota ≈ 60% of free disk).
+4. **Write the model in a Worker via OPFS `createSyncAccessHandle()` in 1-4MB chunks**, streaming from the network — never hold 400MB in memory. This sidesteps the main-thread `createWritable` OOM you observed. `[our-probe]`
+5. **Integrity-check on load** (length/checksum); if missing or truncated, re-fetch. Treat persistence as best-effort even when granted.
+**If a PWA is unacceptable (must be a plain tab):** Accept that the cache will not durably survive on this device. Minimize re-download pain instead — ship a smaller/more-quantized model, stream and start inference before full load, serve from a fast local/CDN source, and rely on the HTTP disk cache (which is separate from the quota'd storage and can survive some reloads, though Safari is also aggressive there). Clearing the foreign 444MB still helps a same-session cache survive reloads, but not the 7-day/eviction window.
+---
+## Sources
+- WebKit — *Updates to Storage Policy* (best-effort vs persistent, eviction triggers, LRU origin-wide eviction, 60%/80% quota, Home-Screen-Web-App heuristic): https://webkit.org/blog/14403/updates-to-storage-policy/ `[docs/spec]`
+- MDN — *Storage quotas and eviction criteria* (7-day ITP eviction of script-writable storage; persist semantics): https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria `[docs/spec]`
+- web.dev — *Storage for the web* (7-day cap covers IndexedDB + SW + Cache API; PWA exception): https://web.dev/articles/storage-for-the-web `[docs/spec]`
+- web.dev — *Persistent storage* (heuristics; Chrome vs Safari): https://web.dev/articles/persistent-storage `[blog/SO]`
+- MDN — *FileSystemFileHandle.createWritable / createSyncAccessHandle* (sync handle is Worker-only): https://developer.mozilla.org/en-US/docs/Web/API/FileSystemFileHandle/createWritable `[docs/spec]`
+- MDN content issue #40394 — *Safari storage quota accuracy* (estimate() is misleading on Safari): https://github.com/mdn/content/issues/40394 `[lib/issue]`
+- WebKit bug 199614 — *iOS 13 IndexedDB 500MB cap* (historical; removed in Safari 17): https://bugs.webkit.org/show_bug.cgi?id=199614 `[lib/issue]`
+- transformers.js issues — caching/iOS: #889 (no-cache), #973 (Safari restarts), #1242 (iOS memory crash), #1298 (Whisper iOS): https://github.com/huggingface/transformers.js/issues `[lib/issue]`
+- SitePoint — *Optimizing Transformers.js for Production* (iOS Cache QuotaExceededError; IndexedDB fallback note): https://www.sitepoint.com/optimizing-transformers-js-production/ `[blog/SO]`
+- MagicBell — *PWA iOS Limitations and Safari Support [2026]* (aggressive eviction, no guaranteed persistence even for PWAs): https://www.magicbell.com/blog/pwa-ios-limitations-safari-support-complete-guide `[blog/SO]`
+- Raymond Camden — *Working with the Storage API*: https://www.raymondcamden.com/2023/08/25/working-with-the-storage-api `[blog/SO]`
+- Our on-device probe: iPadOS 26.5 Safari — persist()=false, main-thread createWritable throws OOM, quota≈1049MB / usage≈444MB. `[our-probe]`

package/docs/research/mobile-webgpu-speed-fusion.md ADDED Viewed

@@ -0,0 +1,135 @@
+# Mobile WebGPU Speed via Fusion & Batched Submit — Research & Strategy
+Date: 2026-06-15. Scope: making the Gerbil from-scratch WGSL decode engine fast on mobile Safari/iOS for Qwen3.5-0.8B (MLX 4-bit). Companion to `docs/mobile-failure-diagnosis.md` (correctness/crash) and `docs/metal-safari-intel.md`. This doc is about **speed once correct** — the two levers are (1) cut dispatch count via fusion, (2) make a batched submit viable on memory-tight phones.
+**One-line answer:** Our ~4.7 tok/s on iPhone is ~6–12× below what the same class of model already achieves in-browser on Safari WebGPU, and ~10–13× below native. The gap is almost entirely **per-dispatch CPU↔GPU round-trip overhead**, not compute. The realistic target is **15–30 tok/s on iPhone** and **40–60 tok/s on iPad** with the changes below; the single highest-leverage move is **batching all ~440 dispatches per token into ONE command buffer with exactly one in-flight submit** (after the correctness fix), with operator fusion as the second multiplier.
+---
+## 1. Achievable throughput — what other stacks get for ~0.5–1B 4-bit on recent iPhone/iPad
+### Native (the ceiling)
+On **iPhone 17 Pro (A19 Pro, iOS 26.4)**, MLBoy's reproducible cross-runtime benchmark over 128-token responses ([rockyshikoku/Medium](https://rockyshikoku.medium.com/local-llm-on-iphone-which-runtime-is-actually-fastest-58096685481e)):
+| Model (4-bit class) | MLX | llama.cpp (Metal) | LiteRT-LM | CoreML/ANE |
+|---|---|---|---|---|
+| Gemma 4 E2B (~2B) | 47.5 | 37.8 | **55.4** | 33.4 |
+| Qwen 3.5 2B | **61.2** | 39.1 | n/a | 27.9 |
+These are 2B models. A **0.8B** model has ~2.5× fewer params, so native decode for our size on an A-series should be roughly **80–130+ tok/s** (decode is bandwidth-bound, scales ~inverse with active bytes/token). Corroborating order of magnitude: Ricky Takkar's iPhone 17 Pro / iPad Pro MLX sweep and a DEV.to roundup both put sub-1B 4-bit (Qwen3 0.6B, Llama 3.2 1B) at **58–70 tok/s** on A17-Pro-class hardware ([dev.to](https://dev.to/alichherawalla/how-to-run-llms-locally-on-your-iphone-in-2026-completely-offline-no-subscription-4b3a)), degrading with context length. Older iPhone 12/13 (4GB): ~8–15 tok/s for Qwen3 0.6B — useful as our floor reference for memory-tight phones.
+### In-browser WebGPU (our actual comparable)
+- **WebLLM / MLC** on Apple Silicon: the WebLLM paper reports Llama-3.1-8B at **41 tok/s** and Phi-3.5-mini (3.8B) at **71 tok/s**, both on **M3 Max in Chrome Canary** — and explicitly **~72–80% of native MLC** ([WebLLM, arXiv 2412.15803](https://arxiv.org/html/2412.15803v2)). The paper gives **no iOS/Safari numbers**, which is a recurring gap across all sources (flagged).
+- **Safari penalty vs Chrome:** independent WebGPU survey work puts Safari at **30–42 tok/s** where Chrome hits 46–51 tok/s on the same hardware for 4-bit models — i.e. Safari runs in-browser LLMs at roughly **65–80% of Chrome's WebGPU decode** ([aicompetence.org guide](https://aicompetence.org/ai-in-browser-with-webgpu/)).
+- **LlamaWeb** ("Llamas on the Web", arXiv 2605.20706 — the closest published system to what we are building) reports **~52 tok/s on Apple M3 in Chrome**, **54% faster than WebLLM and 69% faster than Transformers.js**, and on low-power devices a decode range of **4–17 tok/s** ([LlamaWeb, arXiv 2605.20706](https://arxiv.org/html/2605.20706v1)). Crucially it confirms **Safari tab memory is limited to <500 MB** and that this is the binding constraint on iOS.
+### Realistic ceiling for Gerbil
+The honest framing: **WebGPU-on-Safari is a ~0.65–0.8× tax on top of Chrome-WebGPU, which is itself ~0.72–0.8× of native.** So in-browser Safari ≈ **0.5–0.65× of native** at best. For a 0.8B 4-bit model with native ≈ 80–130 tok/s on a recent iPhone, the **theoretical in-browser-Safari ceiling is ~40–80 tok/s**. No published system has demonstrated that on iOS Safari specifically (the data simply isn't out there — every paper benchmarks Chrome/desktop), so we should treat **15–30 tok/s on iPhone as the credible near-term target** and 40+ as a stretch. iPad already hits ~31 tok/s for us, consistent with it being less memory/thermally constrained.
+---
+## 2. Why WebGPU-on-Safari is slow & how the fast stacks avoid it
+The decisive finding is from "Characterizing WebGPU Dispatch Overhead for LLM Inference" (arXiv 2604.02344), which measured per-dispatch API overhead across NVIDIA/AMD/Apple/Intel × Dawn/wgpu × three browsers:
+- **Metal per-dispatch overhead: 32–71 µs** (vs Vulkan 24–36 µs); total per-operation overhead including the JS/host layer ~95 µs ([arXiv 2604.02344](https://arxiv.org/abs/2604.02344)).
+- **"Per-operation overhead dominates regardless of kernel quality" at batch size 1.** At decode (M=1) the GPU is idle most of the time; you are paying launch/submit latency, not compute.
+- **Naive single-op benchmarks overestimate dispatch cost by ~20×** — meaning the real win comes from amortizing overhead across *many* fused/batched ops, not from micro-optimizing one kernel.
+- **Kernel fusion gave +53% throughput on Vulkan**; backend choice was the dominant factor.
+Now apply our numbers. We do **~440 dispatches/token, each its own command buffer + a full `queue.onSubmittedWorkDone()` drain** (`webkitGroupSize=1`). Even at a conservative ~150–200 µs round-trip per drained submit on Safari (Metal launch + JS promise + scheduler wake), 440 × ~175 µs ≈ **77 ms/token ≈ ~13 tok/s of pure overhead ceiling** — and our measured 4.7 tok/s (~210 ms/token) says each drained round-trip is closer to ~480 µs in practice (promise microtask + GPU-process IPC on iPhone). **This is the entire problem.** The compute for a 0.8B 4-bit forward at M=1 is a few hundred GFLOP-equivalents of memory traffic — single-digit milliseconds.
+**How the fast stacks avoid it:**
+- **They do NOT drain per dispatch.** The whole point of WebGPU command buffers is to batch many compute passes into one buffer and submit once; the driver schedules them back-to-back with no host round-trip. `onSubmittedWorkDone` is meant to be awaited **once per token** (to read the sampled logit), not per op. The MDN/spec guidance is explicit that awaiting between independent or dependent passes "creates artificial serialization" and removes the driver's freedom to pipeline ([MDN onSubmittedWorkDone](https://developer.mozilla.org/en-US/docs/Web/API/GPUQueue/onSubmittedWorkDone), [gpuweb CommandSubmission](https://github.com/gpuweb/gpuweb/blob/main/design/CommandSubmission.md)).
+- **Separate compute passes ARE the barrier.** WebGPU has no intra-pass memory barrier; the inter-pass dependency (end one `ComputePassEncoder`, begin the next, both in the *same* command buffer) is the synchronization primitive. So the fast pattern is: one command buffer, ~N compute passes (one per dependent op), one `queue.submit`, one `onSubmittedWorkDone` per token. This is exactly what gives Chrome/Dawn its throughput and what we currently refuse to do because of the WebKit zeros-within-one-submit bug (diagnosis §1.2).
+- **CUDA-graph / megakernel analog:** native MLC uses a CUDA-graph rewrite pass to collapse launch overhead ([MLC blog](https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference)); the megakernel work (MPK) goes further and fuses the *entire* forward into one launch with an in-kernel scheduler (1–2 µs inter-task overhead), but at A100 batch-1 that only bought 14.5→12.5 ms (~1.16×) because launch overhead is small on CUDA ([Jia/Medium](https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17)). **The lesson for us: the megakernel payoff is proportional to launch overhead, and our launch overhead is ~50–100× CUDA's — so the equivalent move (one big command buffer) is worth far more on Safari than the 1.16× it buys on CUDA.**
+---
+## 3. Kernel/dispatch-reduction techniques for our WGSL decode loop (M=1)
+Highest-leverage first. "Desktop-neutral, mobile-win" = does not regress Dawn/M4 but removes Safari round-trips.
+1. **Batched submit (one command buffer, one drain/token).** This is THE change. It is not a fusion technique — it's removing 439 of 440 host round-trips. Expected effect: if Safari per-submit overhead is the binding constraint, going from 440 drained submits to 1 drained submit collapses the ~210 ms/token overhead toward the actual compute time → plausibly **5–10× decode speedup alone**, gated only by the WebKit correctness bug (see §below). Desktop-neutral (Dawn already tolerates it; the diagnosis even shows our desktop regression came from forcing the *opposite* path).
+2. **Operator fusion to cut the dispatch *count* itself** (multiplies #1 by shrinking N passes, and reduces intermediate buffers → helps memory, §4). In order of value for a decode loop:
+   - **Dequant-into-matvec** — fold INT4 dequant into the matvec kernel; never materialize an f16/f32 weight tensor. LlamaWeb does exactly this: "threads collaboratively load quantized blocks, dequantize into shared memory" and for decode "dequantize directly into registers" ([LlamaWeb](https://arxiv.org/html/2605.20706v1)). We likely already do some of this; verify no separate dequant dispatch exists.
+   - **RMSNorm + matvec fusion** (norm folded into the following Q/K/V/gate projections' epilogue/prologue) — removes a dispatch and a round-trip per norm. Per token we have ~2 norms/layer × 28 layers ≈ 56 norm dispatches; fusing removes them.
+   - **QKV into one matvec** — pack the three projections into a single [3*hidden] matvec dispatch (one weight load region, one dispatch instead of 3). Standard; ~2×/layer dispatch reduction on the attention input side.
+   - **Gate+Up (SwiGLU) into one matvec + fused SiLU·mul epilogue** — one dispatch instead of two matvecs + an elementwise. (We already have a "fused-SwiGLU pipeline" per the diagnosis — confirm it's the single-dispatch form.)
+   - **Residual-add folded into the next kernel's load** (read residual, add, proceed) — removes the standalone add dispatches.
+   - **Attention as fewer passes** — a decode-time flash-style single-pass-per-head (or per KV-tile) attention. Full single-pass attention in WGSL is hard (no intra-pass barrier across the softmax reduction), but online-softmax flash decoding keeps it to a small fixed number of passes regardless of context length.
+   Realistic floor: a transformer layer can be brought to roughly **~6–8 dispatches/layer** (qkv-matvec, rope, attention(1–2), o-proj-matvec, gate/up-matvec, down-matvec, with norms/residuals folded), i.e. **~28 layers × 7 ≈ ~200 dispatches/token**, plus embedding + final-norm + lm_head. So fusion realistically takes us from **~440 → ~200** dispatches. With batched submit that's ~200 passes in one CB — and on Safari, fewer passes also lowers the chance of tripping the within-submit scale bug (diagnosis §1.2 hints the bug is pipeline/binding-count-sensitive).
+3. **GPU-side sampling — keep the loop on-GPU.** Today we read back logits (or argmax) every token. Do **argmax/sampling in a kernel** so the only host readback is the 4-byte chosen token id, and ideally even feed it back into the next step's embedding gather on-GPU (write the sampled id into the input buffer via a tiny kernel). This removes the 485MB-logits readback pressure (already flagged in diagnosis A4) and lets us approach **one `onSubmittedWorkDone` per token** with a 4-byte map. Desktop-neutral.
+4. **Multi-token in flight (pipelining), only if #1–#3 leave headroom.** Speculative/2-tokens-ahead is complex and risky on Safari; deprioritize. Indirect dispatch buys little at M=1 (dispatch dims are static). Skip.
+**Single highest-leverage change: #1 (batched submit), unblocked by fixing the within-submit correctness bug.** Fusion (#2) is the necessary multiplier and the memory enabler, but order it after the submit batching is proven correct, because fusion only pays off once you've stopped draining per dispatch.
+---
+## 4. The memory-crash angle — why batching OOM'd, and how to make it viable
+The diagnosis already nailed this: batching didn't crash because of *passes-per-buffer*, it crashed because the **buffer footprint was 2.77 GB at maxSeqLen=512** against Safari's hard limit. LlamaWeb independently confirms the constraint: **"Safari tab memory is limited to <500 MB"** and "Safari has especially strict memory usage limits" ([LlamaWeb](https://arxiv.org/html/2605.20706v1)). The reason packing more passes into one command buffer *appears* to OOM is that more in-flight passes pin more intermediate buffers resident simultaneously, and our allocator gives **one buffer per activation tensor at full maxSeqLen with zero reuse** (~430 live buffers). Batched submit makes the whole working set resident at once; per-dispatch drain let buffers churn.
+How the fast stacks make batching memory-viable (and our matching fixes):
+- **Static buffer "arena" + precomputed intermediate sizes** — LlamaWeb allocates a static parameter arena and computes intermediate needs upfront, "avoiding dynamic GPU buffer creation" and the fragmentation/crashes it causes ([LlamaWeb](https://arxiv.org/html/2605.20706v1)). → Our **activation-buffer aliasing** (diagnosis A5): liveness analysis over the execution order, pool by size class → ~430 buffers → ~a dozen live (~50–150 MB at T=512).
+- **Streamed weight loading, no WASM-heap materialization** — LlamaWeb downloads to OPFS and streams into WebGPU using only **four 1 MB buffers**, never holding the model in the JS heap ([LlamaWeb](https://arxiv.org/html/2605.20706v1)). → Our load-transient trims (diagnosis A9) + OPFS streaming.
+- **Shrink logits to [1, vocab]** (diagnosis A4): −485 MB at T=512, and it's the largest single buffer.
+- **Fusion reduces intermediates** (§3.2): every fused op is an intermediate buffer that no longer exists.
+Combined (per diagnosis): INT4@512 ≈ 0.44 GB weights + ~0.15 GB activations ≈ **~0.6–0.7 GB**. That's still above the strict <500 MB Safari figure LlamaWeb cites — so on memory-tight phones we likely also need **maxSeqLen clamped to ≤256** (≈0.4–0.5 GB) and/or weight sharding, and we must **never load BF16 on iOS** (diagnosis A6). Bottom line: **memory reduction is the precondition that makes the one-big-command-buffer batched submit physically possible on iPhone.** They are the same project.
+---
+## 5. Quantization / bandwidth — is mobile bandwidth- or overhead-bound here?
+At our current 4.7 tok/s with 440 drained submits, we are **overhead-bound, full stop** — same as the desktop finding, just worse because Safari's per-submit cost is higher and the iPhone GPU process IPC is slower. The dispatch-overhead paper's "per-operation overhead dominates regardless of kernel quality at batch size 1" applies directly ([arXiv 2604.02344](https://arxiv.org/abs/2604.02344)).
+**However, once batched submit lands and we're compute-bound, decode becomes bandwidth-bound** (M=1 matvec is pure weight streaming). Then quantization/precision matters in a way it didn't before:
+- LlamaWeb measured **f16→q8 = +20% (high) / +53% (mid-cluster) decode**, and **q4_k_m→q2_k = −17%** (i.e. going *more* aggressive than q4 hurt — likely dequant overhead outweighing bandwidth savings) ([LlamaWeb](https://arxiv.org/html/2605.20706v1)). So **4-bit weights are near the sweet spot**; don't chase 2-bit.
+- **f16 vs f32 activations/scales matters more on mobile than desktop.** Apple A-series GPUs have native f16 throughput and tighter bandwidth than M4 Max; keeping activations, KV-cache, and dequant scales in **f16** roughly halves the bytes moved per decode step and uses the f16 ALU path. Our desktop "f16-doesn't-matter, it's overhead-bound" finding will **not** hold on mobile once overhead is removed — expect f16 activations/KV to be a real win there. (Caveat from the diagnosis: there's an open `?kvf32=1` correctness question around packed-f16 KV on WebKit — resolve that before relying on f16 KV.)
+- WebGPU's mandatory bounds/safety checks cost ~1–5% at decode (vs 14–42% at prefill) per LlamaWeb — negligible for our decode focus; not worth fighting.
+---
+## Prioritized engine-change list (most impact first)
+Mapped to the two levers — **[BATCH]** = make batched submit viable, **[FUSE]** = reduce dispatch count. "DN" = desktop-neutral / mobile-win.
+| # | Change | Lever | Expected speedup (iPhone) | Implementation sketch (our terms) | Risk |
+|---|---|---|---|---|---|
+| **1** | **Resolve the within-submit correctness bug, then batch ALL dispatches into 1 command buffer, 1 `onSubmittedWorkDone`/token** | BATCH | **~5–10×** (the whole ballgame) | In `executor.ts` replace the per-dispatch submit+drain loop (`executor.ts:367-395`) with: one `commandEncoder`, one `beginComputePass`/`end` per dependent op (passes are the barrier), one `queue.submit`, one `await onSubmittedWorkDone()` per token. Run the diagnosis B2 sweep first (N ∈ {1,8,32,64,…,all}) to find the largest N that stays correct on the test WebKit; ship that N. DN: Dawn already tolerates one-CB. | High — blocked by the WebKit zeros-at-scale bug (diagnosis §1.2). If sub-CB granularity bites, fall back to N=32–64/CB (llama.cpp's existence proof) — still ~7–14× fewer round-trips than N=1. |
+| **2** | **Activation-buffer aliasing + logits [1,vocab] + no-BF16-on-iOS + maxSeqLen clamp ≤256** | BATCH (memory precondition) | enables #1 at all on iPhone (prevents jetsam) | Liveness pass over `graph.executionOrder` in `allocateActivationBuffers` (`executor.ts:838-850`), pool by size class; logits shape→`[1,vocab]` (`qwen3_5.ts:932-937`); clamp engine/React defaults (`index.ts:168-172`, `use-native-engine.ts:84`). Targets ~0.4–0.7 GB total. | Medium — aliasing must respect true last-use or corrupts output; unit-test against Dawn reference. DN. |
+| **3** | **GPU-side sampling; only 4-byte token-id readback per step** | FUSE/BATCH | 1.1–1.3× + removes 485MB readback pressure | argmax/sample kernel writes chosen id to a 4-byte buffer; map only that. Optionally a tiny kernel writes the id into the next-step input buffer to keep the loop on-GPU. (diagnosis A4/A8) | Low. DN. |
+| **4** | **Fuse RMSNorm into following matvec; QKV into one matvec; SwiGLU gate+up single dispatch + fused SiLU·mul; residual-add folded** | FUSE | 1.5–2.5× (cuts ~440→~200 passes, multiplies #1; fewer intermediates helps #2) | Add fused WGSL variants in the kernel registry; norm epilogue/prologue folded into matvec; pack QKV weights contiguously for one [3*hidden] dispatch; confirm existing fused-SwiGLU is single-dispatch. | Medium — more kernel variants = more compile time + more surface for WebKit WGSL→MSL miscompiles; A/B each against Dawn. DN. |
+| **5** | **Dequant-into-matvec verified (no standalone dequant dispatch); decode matvec dequantizes into registers** | FUSE | 1.2–1.5× if a separate dequant exists; bandwidth win | Confirm INT4 matvec reads packed weights + scales and dequantizes inline (LlamaWeb register-dequant pattern). Remove any materialized-weight path. | Low–medium. DN. |
+| **6** | **f16 activations + f16 KV-cache + f16 scales on mobile** | (bandwidth) | 1.2–1.5× **once compute-bound** (mobile-specific) | Keep activation/KV/scale tensors f16; use f16 ALU path. Gate behind resolving the `?kvf32=1` packed-f16-KV correctness question (diagnosis B5). | Medium — packed-f16 KV has an open WebKit correctness flag; verify first. **Mobile-win, desktop-neutral.** |
+| **7** | **OPFS streamed weight load (four ~1MB staging buffers), drop JS-heap materialization** | BATCH (memory) | load-time + frees headroom for #1 | Stream shards from OPFS straight into GPU buffers; skip `slice(0)` cache copy and MLX zero-copy pinning on iOS (diagnosis A9). | Low. DN. |
+### Net expectation
+- **#1 + #2** (the unblock + the memory precondition) is the bulk: realistically **~10–25 tok/s on iPhone** on their own (from 4.7), depending on how large an N the WebKit bug permits.
+- **#3–#6** stack a further ~2–4× of multipliers and bandwidth wins → the **15–30 tok/s iPhone / 40–60 tok/s iPad** target.
+- Everything except #6 is **desktop-neutral** (and #1/#2 actively *fix* the recorded desktop regression by removing the forced per-dispatch path).
+---
+## Skeptical flags / unverified claims
+- **No source gives iOS-Safari WebGPU tok/s for a sub-1B 4-bit model.** Every benchmark (WebLLM, LlamaWeb, the dispatch paper) runs Chrome/desktop. Our 15–30 tok/s iPhone target is **extrapolated** from (native iPhone numbers) × (Safari/Chrome WebGPU penalty) × (Chrome-WebGPU/native penalty) — treat as an engineering estimate, not a measured comparable. **The diagnosis is right that nothing in our repo has a recorded iPhone result either** — the very first action should be to land #1/#2 and *measure*.
+- **kernelfusion.dev** ("median 71×, peak 226×, 15,000–213,000 tok/s on phones, 1,024 dispatches→1") is a **vendor landing page with no methodology or model named**; the phone tok/s figures are not credible for a real LLM and I could not verify any of it. Cited only as directional support that browser engines launch ~10³ dispatches/gen and that single-dispatch fusion is the lever — **do not quote its speedup numbers.**
+- The **megakernel-on-CUDA payoff is small (1.16× at A100 batch-1)** precisely because CUDA launch overhead is tiny; do **not** infer the same modest payoff for us — our Metal/Safari launch overhead is 50–100× higher, so the analogous one-command-buffer move is worth far more. (Cross-check: dispatch paper's 32–71 µs Metal vs MPK's 1–2 µs in-kernel.)
+- The **WebLLM "72–80% of native" ratio is desktop (M3 Max, Chrome).** On iOS Safari the achievable fraction is lower (Safari penalty); I used 0.5–0.65× of native as the Safari-specific estimate, derived from the 30–42 vs 46–51 tok/s Safari-vs-Chrome figure × the 0.72–0.8 Chrome-vs-native figure.
+- **Whether a single ~200-pass command buffer is even *correct* on the target WebKit is unknown** (diagnosis §1.2 is unresolved). #1's speedup is contingent on the B2/B3 sweeps; if the bug is sub-CB-granular, the realistic batched N is 32–64 and the speedup is the lower end of the range. This is the one assumption that could halve the projected gains.
+## Sources
+- [Characterizing WebGPU Dispatch Overhead for LLM Inference (arXiv 2604.02344)](https://arxiv.org/abs/2604.02344) — Metal 32–71 µs/dispatch; per-op overhead dominates at batch 1; fusion +53% Vulkan.
+- [Llamas on the Web / LlamaWeb (arXiv 2605.20706)](https://arxiv.org/html/2605.20706v1) — closest comparable; Safari <500MB; dequant-into-matvec; static arena; f16/q8/q4/q2 deltas; ~52 tok/s M3 Chrome; 4–17 tok/s low-power.
+- [WebLLM (arXiv 2412.15803)](https://arxiv.org/html/2412.15803v2) — MLC/TVM fusion+GEMM tiling; 41/71 tok/s M3 Max Chrome; ~72–80% of native.
+- [MLBoy — Local LLM on iPhone, which runtime is fastest](https://rockyshikoku.medium.com/local-llm-on-iphone-which-runtime-is-actually-fastest-58096685481e) — iPhone 17 Pro native: MLX 61.2 / llama.cpp 39.1 (Qwen 3.5 2B).
+- [Compiling LLMs into a MegaKernel (Jia)](https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17) — 14.5→12.5 ms A100 batch-1; 1–2 µs in-kernel task overhead.
+- [MLC blog — Optimizing high-throughput low-latency inference](https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference) — CUDA-graph rewrite pass cuts launch overhead.
+- [MDN — GPUQueue.onSubmittedWorkDone](https://developer.mozilla.org/en-US/docs/Web/API/GPUQueue/onSubmittedWorkDone) & [gpuweb CommandSubmission design](https://github.com/gpuweb/gpuweb/blob/main/design/CommandSubmission.md) — awaiting per pass = artificial serialization; passes-in-one-CB is the intended pattern.
+- [AI in Browser with WebGPU (2025 guide)](https://aicompetence.org/ai-in-browser-with-webgpu/) — Safari 30–42 vs Chrome 46–51 tok/s WebGPU.
+- [Run LLMs locally on iPhone 2026 (dev.to)](https://dev.to/alichherawalla/how-to-run-llms-locally-on-your-iphone-in-2026-completely-offline-no-subscription-4b3a) — sub-1B 4-bit 58–70 tok/s A17-Pro-class; 8–15 tok/s iPhone 12/13.
+- [kernelfusion.dev](https://kernelfusion.dev/) — directional only (browser engines ~10³ dispatches/gen, single-dispatch fusion); **speedup/tok-s figures unverified, not quoted.**

package/docs/research/native-stt-model-selection.md ADDED Viewed

@@ -0,0 +1,49 @@
+# Native STT Model Selection — Decision
+**Date:** 2026-06-14
+**Status:** Recommendation — build **Moonshine** native (base 61M / tiny 27M); keep ONNX-Whisper as the multilingual harness fallback.
+**Context:** STT was open as "native vs just a harness capability." Verdict: a small native STT IS worth it, because exactly one model avoids the expensive FFT frontend and needs only one new kernel we've already scoped.
+---
+## Bottom line
+Build **Moonshine** (Useful Sensors). It's the best fit by a wide margin:
+- **Raw-waveform Conv frontend** (2× stride-2 causal conv, 4× downsample) — **no FFT/STFT/mel** (the engine has no FFT; this is the load-bearing gap Moonshine sidesteps). Reuses existing `CausalConv1d`.
+- Encoder = transformer + RoPE; decoder = **AR transformer + cross-attention** — runs on the existing AR decode loop + KV-cache.
+- **27.1M / 61.5M params**, ~15–35MB at 4-bit. **MIT license, safetensors** — drop-in with the loader.
+- **Beats Whisper tiny/base on English WER**; purpose-built for edge/streaming.
+**The only new kernel = `CrossAttention`** — and it's **already declared in the IR** (`src/gpu/ir.ts`, `"CrossAttention"`), just not implemented (`registry.ts` lists it as absent). No CTC head, no FFT, no new frontend math. The engine is ~80% there for an encoder-decoder ASR; Moonshine is the candidate whose missing piece is the cheapest.
+## Why not the others
+| Model | Disqualifier |
+|---|---|
+| Whisper tiny/base | log-mel frontend → needs **FFT/STFT/mel** (expensive, accuracy-sensitive) + cross-attn (2 gaps). |
+| Distil-Whisper | FFT + 121M/770M — too big. |
+| wav2vec2 / HuBERT CTC | Raw-waveform (no FFT) and **no decoder** (CTC = linear+argmax, cheapest *kernel* path) BUT ~95M, English-only, brittle out-of-domain, no internal LM. Stepping stone, not destination. |
+| NVIDIA Parakeet | 600M FastConformer, mel frontend, **PyTorch-only (no safetensors/ONNX)**, CC-BY-4.0. Not portable. |
+| Kyutai STT | mel/STFT + transducer + restrictive license. |
+## Plan
+1. Keep **ONNX-Whisper** (`src/core/stt.ts`) as the multilingual/fallback harness route — don't delete. Moonshine-English + Whisper-multilingual is a strong combo.
+2. Native build (when greenlit): implement the **`CrossAttention`** kernel (encoder K/V computed once, frozen, cached for the whole decode — *not* a growing KV cache like the causal self-attention path), wire Moonshine's causal-conv frontend (reuse `CausalConv1d`) + RoPE encoder + AR decoder.
+3. **Validate cross-attention bit-exact vs a HF reference forward pass on a fixed encoder output BEFORE wiring the full decode loop** — same discipline as `test-codec-kernels.mjs`.
+## Biggest risk
+**Cross-attention correctness in the AR decode loop** — mixing two coherence regimes per step: causal self-attention with a *growing* KV-cache + unmasked cross-attention over a *fixed* encoder sequence. This is the exact shape that surfaced the Safari/Metal barrier bugs. Mitigation: freeze encoder K/V, validate the kernel in isolation first.
+## Relationship to TTS
+Moonshine is the natural sibling of the [Orpheus TTS pick](native-tts-model-selection.md): same Qwen-style block math, same conv-cluster philosophy. `CrossAttention` is also a prerequisite for any encoder-decoder model, so it's reusable engine infrastructure beyond STT.
+## Sources
+- Moonshine (arXiv 2410.15608; v2 streaming arXiv 2602.12241): https://huggingface.co/UsefulSensors/moonshine — 27.1M/61.5M, raw-waveform causal-conv frontend, RoPE, AR+cross-attn, MIT.
+- wav2vec2/HuBERT CTC: raw Conv + linear CTC, ~95M, Apache/MIT.
+- nvidia/parakeet-tdt-0.6b-v2 / parakeet-ctc-0.6b: 600M, mel, CC-BY-4.0, PyTorch-only.
+- kyutai/stt-*: streaming Conformer + transducer, restrictive license.
+- In-repo: `src/gpu/ir.ts` (`CrossAttention` declared), `src/gpu/kernels/registry.ts` (not implemented), `src/gpu/architectures/qwen3_5_vision.ts` (bidirectional attn exists), `src/core/stt.ts` (current ONNX-Whisper route).

package/docs/research/native-tts-model-selection.md ADDED Viewed

@@ -0,0 +1,90 @@
+# Native TTS Model Selection — Decision
+**Date:** 2026-06-14
+**Status:** ✅ DECISION — build **Kani-TTS-2** (Feb 2026, ~400M: **LFM2-350M backbone + NanoCodec**). Orpheus is SUPERSEDED (3B + 2024 Llama backbone — too big/old). Do **not** build OmniVoice (diffusion).
+**Context:** OmniVoice (k2-fsa) is masked-token **diffusion** (too heavy). A first pass picked Orpheus (Llama-3.2 + SNAC) for kernel reuse, but a re-hunt for the *smallest/newest* portable TTS found a decisively better fit.
+---
+## ⭐ FINAL PICK: Kani-TTS-2 (supersedes Orpheus below)
+**Build Kani-TTS-2** (`nineninesix/kani-tts`, Feb 2026). It beats Orpheus on every axis and — crucially — **rides infrastructure we already shipped this session**:
+- **Backbone = LFM2-350M** = the SAME hybrid short-conv + GQA architecture we already built (`src/gpu/architectures/lfm2.ts`, validated ~46 tok/s on iPad). The conv-cache the re-search flagged as "the risk" is **already implemented** for LFM2.5. Backbone ≈ done; the codec-LM is that generator emitting NanoCodec tokens instead of text.
+- **Vocoder = NanoCodec** = `ConvTranspose1d` + `Snake` (HiFi-GAN-style, **NO iSTFT/FFT**) → reuses the **DAC vocoder kernels** banked from OmniVoice (`Conv1dFull`, `ConvTranspose1d`, `Snake1d`). NanoCodec adds **FSQ** (finite scalar quantization, 13 codebooks, 12.5 FPS — lowest frame rate in the field → fewest decode steps).
+- **~370M / ~271MB at INT4** (q4 backbone ~208MB + f16 NanoCodec ~63MB) — mobile-viable. safetensors, on-the-fly INT4.
+> **⚠️ LICENSE CORRECTION (verified at build):** `nineninesix/kani-tts-2-en` is **"other / LFM1.0" (Liquid AI), NOT Apache-2.0** (it fine-tunes LFM2-350M; NanoCodec is the NVIDIA Open Model License). The earlier "Apache-2.0" claim was wrong. The older **`kani-tts-450m-0.2-ft` IS Apache-2.0** and is the SAME architecture (LFM2-350M codec-LM + NanoCodec) — so it runs on the identical engine path. **Decision: if clean commercial licensing matters, ship the 450m Apache variant; if best quality and LFM1.0 terms are acceptable, ship kani-tts-2-en. The engine work covers both — it's purely a weights choice.**
+> **SPEC CORRECTION (verified at build):** NanoCodec v2 is **4 tokens/frame** (not 13), FSQ = 4 groups × 4 dims, levels [9,8,8,7]; decoder is **causal**, **HalfSnake** activation, **clamp** output (not tanh), upsampling rates [7,7,6,3,2] → 22050 Hz. Now correctly implemented + validated bit-exact (`test-nanocodec-decode.mjs`, max|err| 4.2e-6).
+**New work = small:** NanoCodec's **FSQ decode** (scalar de-quantization → codebook → the conv decoder we have) + the codec-LM head/token wiring. No new backbone, no new vocoder family, no FFT.
+**Biggest risk:** NanoCodec FSQ token layout + the LFM2→audio-token head wiring — validate the FSQ→PCM decode bit-exact against the reference NanoCodec `decode()` before wiring the full AR loop.
+**Lowest-effort fallback (zero backbone risk):** **OuteTTS-1.0-0.6B** = Qwen3-0.6B (plain transformer, maps 1:1 to our AR loop) + IBM DAC.speech.v1.0 (ConvTranspose+Snake, no FFT), Apache-2.0. Slightly bigger/older than Kani but both halves are already covered. Use if Kani's LFM2-audio wiring proves fiddly.
+**Honest landscape note:** a true sub-100M, safetensors, FFT-free TTS does NOT exist as of June 2026 — Kitten-TTS (15M) and Supertonic (66M) are ONNX-locked + iSTFT; X-Codec2/NeuCodec/Vocos models all need iSTFT. ~400M (Kani) is the floor for a portable codec-LM.
+**Do NOT build:** NeuTTS Air / Llasa (Vocos/iSTFT decoder = FFT cost; Llasa also CC-BY-NC), Kitten/Supertonic (ONNX + iSTFT), Orpheus (3B/old).
+---
+## (Superseded) Original recommendation: Orpheus-style
+---
+## Bottom line
+Build an **Orpheus-style TTS**: a **Llama-3.2 backbone** (codec-LM) predicting **SNAC 24 kHz** audio-codec tokens autoregressively, decoded by the **SNAC decoder** (DAC lineage).
+- **Backbone** = Llama-3.2 → a **Tier-1 generator** on Gerbil's existing AR KV-cache decode loop (RMSNorm/RoPE/GQA/SwiGLU). The codec-LM is literally the text decode loop emitting SNAC token IDs instead of text tokens — near-zero new backbone code. (Llama generator not yet registered; it's hours — copy `qwen2.ts`, drop QKV bias.)
+- **Vocoder** = SNAC 24 kHz decoder = `ConvTranspose1d [8,8,4,2]` + `Snake1d` + dilated `Conv1d` residuals — **the exact DAC-lineage kernel cluster Gerbil already shipped** (`Conv1dFull` dilated, `ConvTranspose1d` w/ output_padding, `Snake1d`, committed in the OmniVoice codec work, `ecff924`). Local attention is **disabled** in the 24 kHz checkpoint (`attn_window_size: null`) → no attention kernel needed in the vocoder.
+- **New kernel surface = 2 small ops only:** a **noise-injection block** and **depthwise conv** support in `Conv1dFull`.
+- SNAC = **19.8M params / ~80 MB**, single forward pass (not autoregressive), mobile-trivial.
+- **License: Apache-2.0** end-to-end (backbone + SNAC). Commercial-clean.
+## The size catch
+The only fully-trained public Orpheus checkpoint is **3B** → desktop-only (~1.8 GB at 4-bit). Canopy announced 150M/400M/1B variants but shipped weights center on 3B. Two honest options, identical engine path:
+1. **Ship 3B desktop-flagged now** (best quality, fastest to a working demo).
+2. **Pair SNAC with a Llama-3.2-1B codec-LM** (~600 MB at 4-bit, iPad-viable).
+**Plan: prove the path end-to-end on 3B/desktop, then swap in a 1B backbone for mobile.** Only the backbone download changes; the engine work is identical.
+## Biggest risk
+**SNAC multi-scale RVQ token de-interleaving.** The 3 quantizers run at different rates (12/23/47 Hz, `vq_strides [4,2,1]`, ~7 codes/frame). Codes must be emitted by the backbone and regrouped in the right temporal hierarchy before the decoder. Backbone + vocoder kernels are easy; this glue is where bugs hide — validate token layout against the reference SNAC `decode()` first.
+## Rejected / excluded
+| Model | Why not |
+|---|---|
+| **OmniVoice** | Diffusion + novel MaskGIT decode driver; ~3 GB. (Codec kernels already salvaged.) |
+| **OuteTTS-0.3-500M** | Ideal Qwen2.5-0.5B backbone, but **WavTokenizer** vocoder = Vocos + ConvNeXt + **iSTFT** (the FFT exotica to avoid). |
+| **Llasa-1B** | Clean Llama-3.2-1B backbone, but **XCodec2** codec is 0.8B params (non-DAC, heavier) and **CC-BY-NC-4.0** (non-commercial). Backup only. |
+| **Kokoro-82M** | StyleTTS2, **not** an AR codec-LM; ISTFTNet (iSTFT) vocoder. Doesn't use the decode loop. |
+| **Parler-TTS-mini** | DAC vocoder (good) but T5-style **encoder-decoder** backbone ≠ the decode loop. |
+| **CSM-1B, Fish-Speech, Zonos** | Two-transformer / custom-codec / flow-matching — more novel work. Deprioritize. |
+## Why this beats OmniVoice
+Reuses two already-validated systems (the AR decode loop + the DAC vocoder cluster) and adds essentially one small codec decoder (2 ops). OmniVoice needs a brand-new diffusion decode driver *and* a diffusion stack — strictly more, heavier, novel code.
+## Build plan (when greenlit)
+1. Register a **Llama** generator (`llama.ts`, Tier-1 — copy `qwen2.ts`, drop QKV bias). Validate text coherence on a Llama-3.2 checkpoint.
+2. **Codec-LM head:** backbone emits SNAC token IDs (audio vocab) instead of text; wire the AR loop to produce the multi-scale RVQ codes.
+3. **SNAC decoder graph:** ConvTranspose1d [8,8,4,2] + Snake1d + dilated Conv1d residuals (reuse kernels) + the **noise-injection** and **depthwise-conv** additions. Convert SNAC's `pytorch_model.bin` → safetensors offline.
+4. **Token de-interleaving** glue (the risk) — validate against reference SNAC `decode()`.
+5. **`engine.speak(text) → {pcm, sampleRate}`** API + e2e validation (save `.wav`, measure real-time factor).
+6. Prove on 3B/desktop → swap 1B backbone for mobile.
+## Sources
+- Orpheus-3B (Llama-3.2-3B + SNAC, Apache-2.0, safetensors): https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
+- SNAC 24 kHz (`attn_window_size: null`, decoder_rates [8,8,4,2], noise/depthwise, 19.8M, pytorch_model.bin): https://huggingface.co/hubertsiuzdak/snac_24khz · config: https://huggingface.co/hubertsiuzdak/snac_24khz/raw/main/config.json
+- SNAC paper / DAC lineage: https://arxiv.org/abs/2410.14411 · https://github.com/hubertsiuzdak/snac
+- Llasa-1B + XCodec2 (CC-BY-NC): https://huggingface.co/HKUSTAudio/Llasa-1B
+- OuteTTS-0.3-500M (WavTokenizer/iSTFT): https://huggingface.co/OuteAI/OuteTTS-0.3-500M
+- Kokoro-82M (StyleTTS2/ISTFTNet): https://huggingface.co/hexgrad/Kokoro-82M