@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +247 -84
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +264 -588
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +585 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BHrJJIa4.mjs +1656 -0
  35. package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
  36. package/dist/gerbil-BT9fCydo.d.mts +488 -0
  37. package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
  38. package/dist/gerbil-DomNfIr1.mjs +4 -0
  39. package/dist/gpu/hooks.d.mts +520 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1188 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-33qCAtHW.mjs +3615 -0
  46. package/dist/gpu-33qCAtHW.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-jEAL2s-A.d.mts +2022 -0
  50. package/dist/index-jEAL2s-A.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
  76. package/dist/mcp-1DaMsaBc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
  85. package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
  86. package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
  88. package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
  89. package/dist/repl-jV5gcJFA.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
  94. package/dist/skills-DX8D59UH.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
  98. package/dist/types-D6FiR_oh.d.mts.map +1 -0
  99. package/dist/types-DQBe2lFo.d.mts +165 -0
  100. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-DFRQ1OeM.js +0 -20212
  157. package/dist/kokoro-DFRQ1OeM.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-CpLYbGFd.mjs +0 -433
  163. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  164. package/dist/stt-DRPLEEHB.mjs +0 -3
  165. package/dist/stt-Te8Qz-Ay.js +0 -433
  166. package/dist/stt-Te8Qz-Ay.js.map +0 -1
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DokyH3rP.js +0 -3
  169. package/dist/transformers.web-M6mCnEYJ.js +0 -30382
  170. package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
  171. package/dist/tts-C0xx3CtE.js +0 -724
  172. package/dist/tts-C0xx3CtE.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,2109 @@
1
+ # Gerbil WebGPU Inference Engine
2
+
3
+ **A WebGPU-native transformer inference engine for browser-based LLM execution.**
4
+
5
+ *Living technical document. Last updated: June 2026 (native audio — Moonshine STT + Kani-TTS-2 decoder — Gemma 4 E2B, on-device memory/RAG, and the text+ViT autoresearch campaign).*
6
+
7
+ ---
8
+
9
+ ## Table of Contents
10
+
11
+ 1. [Introduction & Motivation](#1-introduction--motivation)
12
+ 2. [Previous Approach & Its Shortcomings](#2-previous-approach--its-shortcomings)
13
+ 3. [Architecture Overview](#3-architecture-overview)
14
+ 4. [Intermediate Representation](#4-intermediate-representation)
15
+ 5. [Safetensors Parser](#5-safetensors-parser)
16
+ 6. [WebGPU Device Layer](#6-webgpu-device-layer)
17
+ 7. [WGSL Kernel Library](#7-wgsl-kernel-library)
18
+ 8. [Tokenizer](#8-tokenizer)
19
+ 9. [Sampler](#9-sampler)
20
+ 10. [Architecture Registry & Graph Generators](#10-architecture-registry--graph-generators)
21
+ 11. [KV Cache](#11-kv-cache)
22
+ 12. [Executor](#12-executor)
23
+ 13. [Model Loading Pipeline](#13-model-loading-pipeline)
24
+ 14. [Public API (WebGPUEngine)](#14-public-api-webgpuengine)
25
+ 15. [What's Built vs. What's Planned](#15-whats-built-vs-whats-planned)
26
+ 16. [Key Design Decisions](#16-key-design-decisions)
27
+ 17. [The Four-Failure-Mode Diagnosis (June 2026)](#17-the-four-failure-mode-diagnosis-june-2026)
28
+ 18. [The Mobile Fix Campaign](#18-the-mobile-fix-campaign)
29
+ 19. [Mobile Results & Submit-Granularity Scaling](#19-mobile-results--submit-granularity-scaling)
30
+ 20. [Autoresearch: Profile-Guided Throughput Optimization](#20-autoresearch-profile-guided-throughput-optimization-june-2026)
31
+ 21. [Native Text Embeddings](#21-native-text-embeddings-june-2026)
32
+ 22. [Native Vision Encoder (Qwen3.5 ViT)](#22-native-vision-encoder-qwen35-vit-june-2026)
33
+ 23. [The Native-Only Architecture Decision](#23-the-native-only-architecture-decision)
34
+ 24. [iOS Model-Caching Reality](#24-ios-model-caching-reality)
35
+ 25. [EmbeddingGemma-300M: a Second Embedding Family](#25-embeddinggemma-300m-a-second-embedding-family-validated-on-device-june-2026)
36
+ 26. [The SentencePiece Tokenizer Fix](#26-the-sentencepiece-tokenizer-fix-a-cross-family-lesson)
37
+ 27. [MLX-4bit Loading and the DWQ Trap](#27-mlx-4bit-loading-broadened-detection-and-the-dwq-trap)
38
+ 28. [The Progress-Reporting Fix](#28-the-progress-reporting-fix-killing-the-stuck-at-10-freeze)
39
+ 29. [Cross-Device Multi-Modal Parity](#29-cross-device-multi-modal-parity-june-2026)
40
+ 30. [Model-Zoo Growth and Kernel-Library Saturation](#30-model-zoo-growth-and-the-saturation-of-the-kernel-library)
41
+ 31. [Native STT: Moonshine](#31-native-stt-moonshine-june-2026)
42
+ 32. [Native TTS: Kani-TTS-2 and the NanoCodec Decoder](#32-native-tts-kani-tts-2-and-the-nanocodec-decoder-june-2026)
43
+ 33. [Gemma 4 E2B: PLE, KV-Sharing, and Logit Softcap](#33-gemma-4-e2b-ple-kv-sharing-and-logit-softcap-june-2026)
44
+ 34. [On-Device Memory / RAG](#34-on-device-memory-rag-june-2026)
45
+ 35. [The June-2026 Autoresearch Campaign (Text, LFM2, and ViT)](#35-the-june-2026-autoresearch-campaign-text-lfm2-and-vit)
46
+ - [Appendix A: File Map](#appendix-a-file-map)
47
+ - [Appendix B: WebGPU Browser Compatibility](#appendix-b-webgpu-browser-compatibility)
48
+
49
+ ---
50
+
51
+ ## 1. Introduction & Motivation
52
+
53
+ Gerbil is an on-device AI library for JavaScript. Its GPU engine is a WebGPU-native transformer inference engine that runs large language models directly in the browser with zero server-side dependencies. Given any HuggingFace model repository, the engine:
54
+
55
+ 1. Downloads the model's `config.json` and determines its architecture
56
+ 2. Generates a fine-grained computation graph (the IR) at runtime
57
+ 3. Downloads safetensors weight files and uploads them to GPU buffers
58
+ 4. Dispatches WGSL compute shaders for each operation in topological order
59
+ 5. Streams generated tokens back to the caller
60
+
61
+ The engine was built to replace an earlier approach based on transformers.js and ONNX Runtime Web, which suffered from fundamental architectural limitations. Those limitations -- iOS memory crashes, 50MB bundle sizes, webpack-within-webpack collisions, ONNX format lock-in, and three divergent inference paths -- motivated a complete rewrite with WebGPU as the single execution target.
62
+
63
+ The core philosophy is: **own every layer of the stack**. No ONNX runtime, no transformers.js, no external WASM blobs. Just TypeScript orchestration code and hand-written WGSL kernels, speaking directly to the WebGPU API. As of this cycle that single engine is no longer text-only: **native text embeddings** (Section 21) and the **native Qwen3.5 vision encoder** (Section 22, bit-exact vs HuggingFace) run through the same IR and kernel registry, and the architecture decision is explicitly **native-only** across text, vision, and embeddings — no fallback lane (Section 23).
64
+
65
+ As of June 2026 the engine runs Qwen3.5-0.8B (MLX 4-bit) with byte-identical greedy output on desktop Dawn (node, Chrome) and WebKit WebGPU (iPad), at **145 tok/s on M4 Max** and **31.7-35.9 tok/s on iPad (iPadOS 26.5)** — the first correct, crash-free on-device mobile results in the project's history. Getting there required diagnosing what had been treated as "one mobile bug" into four independent failure modes (Section 17) and a fix campaign spanning memory layout, submit architecture, a WGSL data race, and platform detection (Section 18).
66
+
67
+ ---
68
+
69
+ ## 2. Previous Approach & Its Shortcomings
70
+
71
+ Understanding why the GPU engine was built requires understanding what came before it and why that approach was unsustainable.
72
+
73
+ ### 2.1 The transformers.js + ort-web Architecture
74
+
75
+ The original gerbil browser inference path used [transformers.js](https://github.com/huggingface/transformers.js) (from HuggingFace) as its runtime. The architecture was:
76
+
77
+ ```
78
+ User Code
79
+ -> gerbil browser SDK (createGerbilWorker)
80
+ -> Web Worker (blob URL, IIFE bundle)
81
+ -> transformers.js (AutoModelForCausalLM, AutoTokenizer)
82
+ -> ONNX Runtime Web (ort-web)
83
+ -> WebGPU backend (or WASM fallback)
84
+ -> ONNX model files (.onnx)
85
+ ```
86
+
87
+ transformers.js provided a high-level API (`AutoModelForCausalLM.from_pretrained`) that handled model loading, tokenization, and generation. Under the hood, it delegated all tensor math to ONNX Runtime Web (ort-web), which could target either WebGPU or WASM backends.
88
+
89
+ This worked for desktop browsers. On mobile -- particularly iOS -- it was catastrophic.
90
+
91
+ ### 2.2 The iOS Crisis
92
+
93
+ The single most painful failure mode: **models would download to 100% and then crash the page**.
94
+
95
+ iOS Safari and iOS Chrome (which uses WKWebView under the hood) impose a hard memory limit of approximately 300-400MB per web page. The ort-web WASM runtime alone consumed significant memory, and once model weights were loaded on top of that, the total exceeded the budget. The page would be silently terminated by the OS with no error, no exception, no callback -- just a blank page.
96
+
97
+ Every mitigation was attempted and failed:
98
+
99
+ - **Running in a Web Worker**: WKWebView imposes even stricter memory limits on workers than the main thread. Moving inference to a worker made things worse.
100
+ - **Main-thread inference**: Avoided worker memory limits but froze the UI entirely during generation. Still crashed on models above ~600MB.
101
+ - **KV cache disposal**: Aggressively disposing past_key_values after each generation freed some memory but not enough. The WASM runtime's own memory footprint was the floor, and it was already above 200MB.
102
+ - **Model fallback chains**: Automatically trying smaller models when larger ones failed. This helped with availability but didn't solve the crash-then-reload loop.
103
+ - **Session phase tracking**: Using `sessionStorage` to detect crash-and-reload cycles (`detectMemoryCrash()` in `device-guards.ts`). This only diagnosed the problem; it couldn't prevent it.
104
+ - **Breadcrumb logging to localStorage**: Writing debug state before each critical step so that after a crash, the reload could read what step killed the page. This produced excellent diagnostics but no fix.
105
+
106
+ The iOS code path grew increasingly complex. `createIOSMainThreadWorker()` in `worker.ts` is 270+ lines of iOS-specific logic including CDN dynamic imports, breadcrumb logging, 128-token generation caps, and manual KV cache disposal -- all of which existed solely to work around ort-web's memory consumption.
107
+
108
+ ### 2.3 The Bundle Size Problem
109
+
110
+ The browser build bundled transformers.js and its dependencies (including kokoro-js for TTS) into a single ESM file via rolldown with `noExternal: ["@huggingface/transformers", "kokoro-js"]` and `inlineDynamicImports: true`. The resulting `dist/browser/index.js` was approximately 50MB.
111
+
112
+ This meant:
113
+
114
+ - Slow initial page loads even with caching
115
+ - Massive JavaScript parse times on mobile devices
116
+ - The browser had to allocate memory for the entire bundle before inference even began
117
+
118
+ ### 2.4 The Webpack-Within-Webpack Crash
119
+
120
+ This was perhaps the most bizarre bug in the entire project. The `@huggingface/transformers` package ships as a **pre-compiled webpack bundle**. When rolldown inlined this bundle into gerbil's ESM output, a subtle scoping issue emerged:
121
+
122
+ The transformers.js webpack runtime declares variables like `var __webpack_exports__`, `var __webpack_modules__`, `var __webpack_module_cache__`, and `var __webpack_require__` inside arrow function callbacks. Arrow functions in JavaScript **do not create a `var` scope** -- `var` declarations inside arrow functions are hoisted to the enclosing function scope (or module scope).
123
+
124
+ When Next.js webpack processed gerbil's bundle (which now contained these inlined declarations), it evaluated the code inside `eval()` contexts where its own `__webpack_exports__` was a parameter. The inlined `var __webpack_exports__` hoisted through the arrow function and **shadowed** webpack's own parameter with `undefined`. This caused:
125
+
126
+ ```
127
+ __webpack_require__.r(undefined)
128
+ -> Object.defineProperty(undefined, '__esModule', ...)
129
+ -> "Properties can only be defined on Objects"
130
+ ```
131
+
132
+ The fix was a `renderChunk` plugin in `tsdown.config.ts` that renamed all inner webpack runtime variables:
133
+
134
+ ```typescript
135
+ result = result.replaceAll("__webpack_exports__", "__gerbil_tf_exports__");
136
+ result = result.replaceAll("__webpack_modules__", "__gerbil_tf_modules__");
137
+ result = result.replaceAll("__webpack_module_cache__", "__gerbil_tf_cache__");
138
+ result = result.replaceAll("__webpack_require__", "__gerbil_tf_require__");
139
+ ```
140
+
141
+ Additionally, the plugin stripped `//# sourceMappingURL` comments to prevent the Next.js dev overlay from attempting to parse the 8MB source map file, which caused a stack overflow.
142
+
143
+ This hack worked, but it was fragile. Every transformers.js update risked breaking it, and it highlighted a fundamental problem: the dependency chain was too deep and too opaque.
144
+
145
+ ### 2.5 The ONNX Dependency
146
+
147
+ transformers.js required models to be pre-converted to ONNX format. This meant:
148
+
149
+ - Only models that had been converted and uploaded to HuggingFace in ONNX format could be used
150
+ - The conversion process itself was error-prone and model-specific
151
+ - Quantization was limited to what the ONNX ecosystem supported
152
+ - Users couldn't point at any arbitrary HuggingFace model repo and run it
153
+
154
+ ### 2.6 The Blob URL Worker Complexity
155
+
156
+ iOS WKWebView cannot load ES module workers (`new Worker(url, { type: "module" })`). The workaround was an IIFE blob URL worker:
157
+
158
+ 1. `worker-entry.ts` -- real TypeScript worker source with static transformers.js imports
159
+ 2. `scripts/build-worker.mjs` -- esbuild bundles worker-entry.ts as IIFE into `worker-code.generated.ts`
160
+ 3. `worker.ts` -- creates a classic (non-module) Blob worker using the `WORKER_CODE` string constant
161
+
162
+ This created a chain of build tooling issues:
163
+ - The IIFE bundle couldn't use `import.meta.url`, which ort-web relied on internally for locating its WASM files
164
+ - CDN paths for the WASM runtime (`ort.bundle.min.mjs`, `ort-wasm-simd-threaded.jsep.wasm`) had to be explicitly configured via `env.backends.onnx.wasm.wasmPaths`
165
+ - These CDN files had to be aliased to `false` in Next.js's webpack config to prevent webpack from trying to resolve them as modules
166
+
167
+ ### 2.7 Three Separate Inference Paths
168
+
169
+ The original gerbil maintained three completely different inference paths:
170
+
171
+ 1. **Browser WebGPU** via transformers.js (desktop Chrome, Edge, Safari 18+)
172
+ 2. **Browser WASM** via transformers.js (iOS, older browsers, low-GPU-memory devices)
173
+ 3. **Node.js** via server-side transformers.js with different model loading
174
+
175
+ Each path had different bugs, different performance characteristics, and different maintenance burdens. The `backend-selector.ts` file alone was over 200 lines of tiered decision logic (`selectBackend()`, `getModelFallbackChain()`, `getDeviceContextLimit()`) to route users to the right path.
176
+
177
+ ### 2.8 Summary: Why a Rewrite Was Necessary
178
+
179
+ | Problem | Root Cause |
180
+ |---------|-----------|
181
+ | iOS crashes after download | ort-web WASM runtime baseline memory too high for WKWebView |
182
+ | 50MB bundle | Bundling transformers.js + ort-web + kokoro-js |
183
+ | Webpack crash | var hoisting through arrow functions in pre-compiled webpack bundle |
184
+ | Limited model support | ONNX format requirement |
185
+ | Worker complexity | iOS WKWebView module worker limitations + ort WASM path resolution |
186
+ | Maintenance burden | Three divergent inference code paths |
187
+
188
+ The GPU engine eliminates all of these problems by owning the entire inference stack: a custom IR, custom safetensors parser, custom WGSL kernels, custom tokenizer, and custom sampler -- all speaking directly to the WebGPU API with zero intermediate runtimes.
189
+
190
+ ---
191
+
192
+ ## 3. Architecture Overview
193
+
194
+ The GPU engine is structured as a pipeline of cooperating components:
195
+
196
+ ```
197
+ HuggingFace Hub
198
+ |
199
+ model-loader.ts
200
+ / | \
201
+ config.json tokenizer.json *.safetensors
202
+ | | |
203
+ architectures/ tokenizer.ts safetensors.ts
204
+ qwen2.ts | |
205
+ | Tokenizer weight data
206
+ | |
207
+ ModelGraph (IR) |
208
+ | |
209
+ executor.ts <----------------------+
210
+ / | \
211
+ device.ts | kv-cache.ts
212
+ | |
213
+ WGSL kernels sampler.ts
214
+ (12 shaders) (CPU-side sampling)
215
+ | |
216
+ GPU dispatch token selection
217
+ | |
218
+ logits -----------> next token
219
+ ```
220
+
221
+ ### Component Responsibilities
222
+
223
+ | Component | File | Role |
224
+ |-----------|------|------|
225
+ | **IR** | `ir.ts` | Type definitions for the computation graph: OpType, TensorDesc, OpNode, ModelGraph |
226
+ | **Architecture Registry** | `architectures/index.ts` | Maps HF architecture strings to graph generators |
227
+ | **Graph Generator** | `architectures/qwen2.ts` | Builds a complete ModelGraph from a Qwen config.json |
228
+ | **Safetensors Parser** | `safetensors.ts` | Parses HF safetensors binary format into typed array views |
229
+ | **Device Layer** | `device.ts` | WebGPU initialization, buffer management, pipeline compilation, readback |
230
+ | **KV Cache** | `kv-cache.ts` | Pre-allocated GPU buffers for autoregressive key/value storage |
231
+ | **Executor** | `executor.ts` | Allocates buffers, dispatches ops, manages forward passes |
232
+ | **WGSL Kernels** | `kernels/wgsl/*.wgsl` | 12 hand-written compute shaders for all tensor operations |
233
+ | **Tokenizer** | `tokenizer.ts` | Pure JS BPE tokenizer from HF tokenizer.json |
234
+ | **Sampler** | `sampler.ts` | CPU-side temperature/top-k/top-p/repetition penalty sampling |
235
+ | **Model Loader** | `model-loader.ts` | HF Hub integration: fetch config, tokenizer, weights |
236
+
237
+ ### Data Flow
238
+
239
+ 1. **Model Load**: `model-loader.ts` fetches `config.json` from HuggingFace, determines the architecture (e.g., `Qwen2ForCausalLM`), and calls the matching graph generator. The generator produces a `ModelGraph` -- the complete IR specifying every tensor and every operation. The loader then downloads safetensors files, maps HF keys to canonical names, and returns the weights alongside the graph and tokenizer.
240
+
241
+ 2. **Initialization**: The `Executor` takes the `ModelGraph`, allocates a liveness-pooled set of GPU buffers for activation tensors (Section 18.1), creates the KV cache, and uploads weight data to GPU storage buffers.
242
+
243
+ 3. **Forward Pass**: Given input token IDs, the executor walks the `executionOrder` array, dispatching each operation's WGSL kernel with the appropriate buffers and uniforms. On Dawn (Chrome, node) all dispatches are batched into a single command encoder; on WebKit they are grouped into command buffers of a configurable size with one buffer in flight (Section 18.3).
244
+
245
+ 4. **Logit Readback**: After the forward pass, the `[1, vocab]` logits buffer (only the last position's logits are ever computed — Section 18.2) is copied to a MAP_READ staging buffer and read back to CPU as a `Float32Array`.
246
+
247
+ 5. **Sampling**: `sampler.ts` applies the sampling pipeline (repetition penalty, temperature, top-k, top-p, random selection) to produce the next token ID.
248
+
249
+ 6. **Decode Loop**: Steps 3-5 repeat with T=1 (single token) until an EOS token or max length is reached.
250
+
251
+ ---
252
+
253
+ ## 4. Intermediate Representation
254
+
255
+ The IR (`ir.ts`) is the contract that every component builds on. It defines what the model computes, how tensors are shaped, and in what order operations execute.
256
+
257
+ ### Design Philosophy
258
+
259
+ The IR is **fine-grained** and **runtime-generated**. Rather than compiling a model to an opaque binary format (like ONNX), the engine generates the IR at runtime from a model's `config.json`. This means:
260
+
261
+ - **Any HuggingFace model** with a supported architecture can be used -- no pre-conversion step
262
+ - **Config changes** (different hidden sizes, layer counts, head counts) are handled automatically
263
+ - **The IR is inspectable**: it's a plain TypeScript object that can be logged, debugged, and visualized
264
+
265
+ ### OpType Taxonomy
266
+
267
+ Every computation the engine can perform is represented as an `OpType`:
268
+
269
+ **Core ops (implemented):**
270
+ - `Embedding` -- Lookup rows from embedding weight matrix by token ID
271
+ - `MatMul` -- Full-precision (f32) tiled matrix multiplication
272
+ - `MatMulInt4` -- Fused INT4 dequantize + matrix multiplication
273
+ - `Add` -- Element-wise addition (residual connections)
274
+ - `Mul` -- Element-wise multiplication (SwiGLU gate)
275
+ - `RMSNorm` -- Root mean square layer normalization
276
+ - `LayerNorm` -- Standard layer normalization (mean + variance)
277
+ - `RoPE` -- Rotary position embeddings
278
+ - `Attention` -- Scaled dot-product attention with causal mask and GQA
279
+ - `Softmax` -- Row-wise softmax
280
+ - `SiLU` -- Sigmoid linear unit activation
281
+ - `GELU` -- Gaussian error linear unit activation (tanh approximation, `gelu_pytorch_tanh`)
282
+ - `GeluErf` -- Exact (erf-based) GELU, used by the ViT merger
283
+ - `AddBias` -- Row-broadcast bias add (ViT projections carry bias)
284
+ - `ApplyRotaryEmb` -- Precomputed-cos/sin `rotate_half` rotary (ViT 2D RoPE)
285
+ - `SliceCols` -- Extract a column range (splits fused QKV in the ViT)
286
+ - `L2Norm` -- Row-wise L2 normalization (embedding tail)
287
+ - `SliceLastRow` -- Extract the last position's row (last-token pooling / lm_head)
288
+
289
+ **Structural ops (defined, not yet kernel-backed):**
290
+ - `Gather`, `Reshape`, `Transpose`, `Concat`
291
+
292
+ **Future ops (stubbed):**
293
+ - `MoERouter`, `ExpertMatMul` -- Mixture of Experts (when first MoE model is added)
294
+ - `Conv2d`, `AvgPool2d`, `CrossAttention` -- audio / encoder-decoder (future)
295
+
296
+ ### TensorDesc
297
+
298
+ Every tensor in the graph is described by a `TensorDesc`:
299
+
300
+ ```typescript
301
+ interface TensorDesc {
302
+ name: string; // Unique name, e.g. "layers.0.self_attn.q_proj.weight"
303
+ shape: (number | string)[]; // Concrete or symbolic dimensions
304
+ dtype: DType; // "f32" | "f16" | "i32" | "u32" | "i4"
305
+ storage: TensorStorage; // "constant" | "activation" | "kv_cache"
306
+ safetensorsKey?: string; // Key in safetensors file (constants only)
307
+ }
308
+ ```
309
+
310
+ **Symbolic dimensions** are strings like `"T"` (current sequence length) and `"L_max"` (maximum cache length). These are resolved to concrete numbers at execution time. This allows the graph to be generated once and used for both prefill (T=prompt length) and decode (T=1) passes.
311
+
312
+ **Storage types:**
313
+ - `constant` -- Weight tensors loaded from safetensors. Immutable after upload.
314
+ - `activation` -- Intermediate computation results. Buffers are pre-allocated at max size and reused.
315
+ - `kv_cache` -- Key/value cache tensors managed by the KV cache module.
316
+
317
+ ### OpNode
318
+
319
+ Each operation is represented as an `OpNode`:
320
+
321
+ ```typescript
322
+ interface OpNode {
323
+ id: string; // Unique node ID, e.g. "layer0_norm1"
324
+ opType: OpType; // Which operation to perform
325
+ inputs: string[]; // Input tensor names (order = kernel binding order)
326
+ outputs: string[]; // Output tensor names
327
+ attributes: Record<string, unknown>; // Op-specific params (hidden_size, eps, etc.)
328
+ }
329
+ ```
330
+
331
+ The `attributes` dictionary carries operation-specific parameters that the kernel needs. For example, an `RMSNorm` node carries `hidden_size` and `eps`; a `MatMul` node carries `M_tensor`, `K`, and `N`.
332
+
333
+ ### ModelGraph
334
+
335
+ The top-level container:
336
+
337
+ ```typescript
338
+ interface ModelGraph {
339
+ architecture: string; // e.g. "Qwen2ForCausalLM"
340
+ config: ModelArchConfig; // Resolved dimensions
341
+ capabilities: ModelCapabilities; // { text: true, vision: false, moe: false }
342
+ tensors: Record<string, TensorDesc>; // All tensors, keyed by name
343
+ nodes: OpNode[]; // All computation nodes
344
+ executionOrder: string[]; // Topologically-sorted node IDs
345
+ inputs: string[]; // Graph input names (["input_ids"])
346
+ outputs: string[]; // Graph output names (["logits"])
347
+ }
348
+ ```
349
+
350
+ ### Canonical Tensor Naming
351
+
352
+ The IR uses a canonical naming convention for weight tensors:
353
+
354
+ ```
355
+ embed_tokens.weight -- Token embedding matrix
356
+ layers.{i}.input_layernorm.weight -- Pre-attention norm
357
+ layers.{i}.self_attn.q_proj.weight -- Q projection
358
+ layers.{i}.self_attn.k_proj.weight -- K projection
359
+ layers.{i}.self_attn.v_proj.weight -- V projection
360
+ layers.{i}.self_attn.o_proj.weight -- Output projection
361
+ layers.{i}.post_attention_layernorm.weight -- Post-attention norm
362
+ layers.{i}.mlp.gate_proj.weight -- MLP gate (SwiGLU)
363
+ layers.{i}.mlp.up_proj.weight -- MLP up projection
364
+ layers.{i}.mlp.down_proj.weight -- MLP down projection
365
+ norm.weight -- Final norm
366
+ lm_head.weight -- Language model head
367
+ ```
368
+
369
+ This convention is defined in `CANONICAL_KEYS` and matches the common LLaMA/Qwen naming scheme. The `HFKeyMapper` (default: strip the `"model."` prefix) converts HuggingFace safetensors keys to these canonical names.
370
+
371
+ For more details, see [ir.md](./ir.md).
372
+
373
+ ---
374
+
375
+ ## 5. Safetensors Parser
376
+
377
+ The safetensors parser (`safetensors.ts`) reads HuggingFace's binary safetensors format into typed array views over the raw buffer.
378
+
379
+ ### Binary Format
380
+
381
+ ```
382
+ [8 bytes: header_length (little-endian u64)]
383
+ [header_length bytes: JSON header]
384
+ [remaining bytes: raw tensor data, contiguous]
385
+ ```
386
+
387
+ The JSON header maps tensor names to their metadata:
388
+
389
+ ```json
390
+ {
391
+ "model.layers.0.self_attn.q_proj.weight": {
392
+ "dtype": "F32",
393
+ "shape": [896, 896],
394
+ "data_offsets": [0, 3211264]
395
+ },
396
+ "__metadata__": {
397
+ "format": "pt"
398
+ }
399
+ }
400
+ ```
401
+
402
+ ### Zero-Copy Design
403
+
404
+ `getTensorData()` creates typed array views directly into the original `ArrayBuffer` when byte alignment allows. For example, an F32 tensor at a 4-byte-aligned offset returns a `Float32Array` view with no data copy. When alignment requirements aren't met (rare in practice), the function copies the relevant slice into a new properly-aligned buffer.
405
+
406
+ ### Streaming Support
407
+
408
+ `parseSafetensorsFromResponse()` accepts a `Response` object and reads it into an `ArrayBuffer`. The header can be parsed independently of the tensor data, enabling a two-phase approach for large models: parse the header first to learn tensor sizes and offsets, then fetch tensor data by range request.
409
+
410
+ For more details, see [safetensors.md](./safetensors.md).
411
+
412
+ ---
413
+
414
+ ## 6. WebGPU Device Layer
415
+
416
+ The device layer (`device.ts`) wraps the WebGPU API with helpers for buffer management, pipeline compilation, and data readback.
417
+
418
+ ### Device Initialization
419
+
420
+ `initGPU()` requests a high-performance GPU adapter and device with:
421
+
422
+ - Maximum buffer size and storage buffer binding size from the adapter
423
+ - 256x256 max compute workgroup size
424
+ - `shader-f16` feature when available
425
+ - A device loss handler that logs the reason
426
+
427
+ ```typescript
428
+ const adapter = await navigator.gpu.requestAdapter({
429
+ powerPreference: "high-performance",
430
+ });
431
+ ```
432
+
433
+ If `shader-f16` is available, the pipeline compiler automatically prepends `enable f16;` to WGSL source code.
434
+
435
+ ### WebKit Implementation Detection (`isWebKitWebGPU`)
436
+
437
+ `initGPU()` classifies the WebGPU *implementation*, not the GPU hardware. This distinction matters: Dawn running on Apple Silicon (node-dawn, Chrome on macOS) reports `adapter.info.vendor === "apple"` exactly like Safari does, so adapter info **cannot** distinguish Dawn-on-Metal from WebKit-on-Metal. An earlier predicate that keyed on vendor/architecture routed desktop Dawn through every Safari workaround, collapsing desktop decode from ~161 to 19.7 tok/s (Section 17.4).
438
+
439
+ The detection is now purely user-agent based (`device.ts`): `AppleWebKit` in the UA, with iOS/iPadOS detected via `/iPhone|iPad|iPod/` or `maxTouchPoints > 1` (iPadOS masquerades as macOS), and macOS Safari as `AppleWebKit && !Chrome/`. Node has no `navigator.userAgent`, so node-dawn resolves to `false`. Only `isWebKitWebGPU === true` enables the grouped-submit path and other WebKit workarounds.
440
+
441
+ ### Uncaptured Error Surfacing
442
+
443
+ `initGPU()` registers `device.onuncapturederror`. Without it, failed buffer allocations and invalid bind groups on Safari are completely silent — the affected dispatches no-op and the symptom is simply zero logits. `WebGPUEngine.create()` additionally wraps executor construction, weight upload, and bind-group creation in paired `pushErrorScope("out-of-memory")` / `pushErrorScope("validation")` scopes and throws a descriptive error if either pops non-null. Before this, every iPad failure was observed blind (Section 18.6).
444
+
445
+ ### Buffer Types
446
+
447
+ | Buffer Type | Usage Flags | Alignment | Purpose |
448
+ |------------|-------------|-----------|---------|
449
+ | Storage | STORAGE \| COPY_SRC \| COPY_DST | 4 bytes | Weights, activations, KV cache |
450
+ | Uniform | UNIFORM \| COPY_DST | 16 bytes | Kernel parameters |
451
+ | Readback | MAP_READ \| COPY_DST | 4 bytes | Reading logits back to CPU |
452
+
453
+ ### Pipeline Compilation and Caching
454
+
455
+ `getOrCreatePipeline()` compiles WGSL shader code into a `GPUComputePipeline` and caches it by a key of `shaderCode::entryPoint`. This ensures each unique kernel is compiled exactly once per session. The cache can be cleared with `clearPipelineCache()` when switching models or recovering from device loss.
456
+
457
+ ### Logit Readback
458
+
459
+ The primary CPU-GPU synchronization point is logit readback. `readbackFloats()` copies data from a storage buffer to a MAP_READ staging buffer, maps it, and returns a `Float32Array` copy. This is used after each forward pass to get the logit distribution for sampling.
460
+
461
+ ### GPU Diagnostics Suite (A-P)
462
+
463
+ `verifyGPU()` runs 17 progressive diagnostics testing increasingly complex GPU patterns. This isolates Safari/Metal-specific bugs without needing a full model load:
464
+
465
+ | Test | What It Verifies |
466
+ |------|-----------------|
467
+ | A-E | Basic compute, input binding, workgroups, shared memory, separate encoders |
468
+ | F-G | pack2x16float / unpack2x16float (packed-f16 KV path) |
469
+ | H | 256-thread tree reduction (RMSNorm/softmax pattern) |
470
+ | I | INT4 dequant (nibble extraction + scale/zero) |
471
+ | J | Multi-dispatch chain (3 sequential dispatches in one pass) |
472
+ | K | exp() clamp safety (Metal fast-math NaN check) |
473
+ | L | Large buffer integrity (1 MB upload + readback) |
474
+ | M | Offset buffer integrity (view with non-zero byteOffset, simulates safetensors views) |
475
+ | N | Real RMSNorm kernel (actual WGSL from registry, hidden=1024) |
476
+ | O | Real MatVec kernel (actual WGSL, K-parallel layout) |
477
+ | P | 300 dispatches in one compute pass (model does ~300/forward) |
478
+
479
+ Available via `WebGPUEngine.quickDiagnose()` (no model needed) or `engine.diagnose()`.
480
+
481
+ ### Integrity Checker
482
+
483
+ `integrityCheck()` reads back weight tensors from the GPU and runs a single forward pass, producing checksums that can be compared between Dawn (known-good) and Safari. This pinpoints WHERE corruption occurs:
484
+
485
+ - **Weights match, logits differ** → Kernel computation bug on Metal (WGSL→MSL miscompilation)
486
+ - **Weights match, embed_out all zeros** → writeBuffer/dispatch synchronization bug (Safari reads stale uniform params)
487
+ - **Weights don't match** → Data transfer/download corruption (Safari fetch or writeBuffer bug)
488
+
489
+ Weight checksums are compared against hardcoded Dawn reference values. Logits are validated for corruption signals (NaN, Inf, all-same values). Automatically runs after engine loads (all browsers, not just Safari — results shown in Playground UI for devices without console access).
490
+
491
+ ---
492
+
493
+ ## 7. WGSL Kernel Library
494
+
495
+ The engine includes 12 hand-written WGSL compute shaders covering all operations needed for transformer inference. Each kernel is in `src/gpu/kernels/wgsl/`.
496
+
497
+ For a comprehensive reference of each kernel's algorithm, bindings, uniform layout, dispatch formula, and optimization opportunities, see [kernels.md](./kernels.md).
498
+
499
+ ### Kernel Summary
500
+
501
+ | Kernel | File | Workgroup Size | Algorithm |
502
+ |--------|------|---------------|-----------|
503
+ | Embedding | `embedding.wgsl` | (256, 1, 1) | Parallel row gather from weight matrix |
504
+ | MatMul | `matmul.wgsl` | (16, 16, 1) | 16x16 tiled multiply with shared memory |
505
+ | MatMulInt4 | `matmul_int4.wgsl` | (16, 16, 1) | Fused INT4 dequantize + multiply |
506
+ | RMSNorm | `rmsnorm.wgsl` | (256, 1, 1) | One workgroup per row, tree reduction |
507
+ | LayerNorm | `layernorm.wgsl` | (256, 1, 1) | Two-pass (mean, variance), tree reduction |
508
+ | RoPE | `rope.wgsl` | (256, 1, 1) | Parallel pair rotation on Q and K |
509
+ | Attention | `attention.wgsl` | (256, 1, 1) | Per-head causal attention with GQA |
510
+ | Softmax | `softmax.wgsl` | (256, 1, 1) | Three-pass (max, sum, normalize) with tree reduction |
511
+ | SiLU | `silu.wgsl` | (256, 1, 1) | Element-wise x / (1 + exp(-x)) |
512
+ | GELU | `gelu.wgsl` | (256, 1, 1) | Approximate GELU with tanh |
513
+ | Add | `add.wgsl` | (256, 1, 1) | Element-wise addition |
514
+ | Mul | `mul.wgsl` | (256, 1, 1) | Element-wise multiplication |
515
+
516
+ ### Design Notes
517
+
518
+ The kernels use a performance-oriented design. Key optimizations:
519
+
520
+ - **Attention**: Tiled online-softmax kernel (Flash-Attention style). Processes KV cache in TILE_S=16 tiles with cooperative loading into shared memory. 256 threads split into groups of 16 for parallel Q·K dot products with vec4 loads, parallel tree reductions for max/sum, and per-dimension V accumulation. Uses exactly 16 KB shared memory (minimum WebGPU guarantee) for iOS compatibility. Score registers are saved before V tile load to allow full shared memory reuse. The Q·K score reduction is **two-phase**: leader threads reduce their group's partials into a local variable, hit a `workgroupBarrier()` in uniform control flow, and only then write `smem[pos_in_tile]`. The earlier single-phase version was a genuine WGSL data race — leader 0 read `smem[1..15]` while leaders 1..15 wrote `smem[0..15]` after the same barrier (Section 17.3). The barrier is a race fix, not a performance knob.
521
+ - **GEMV (decode)**: K-parallel matrix-vector multiply with 8 output columns × 32 K-threads per column, vec4 dot products, single-barrier reduction. Optimized for Apple Silicon SIMD width 32.
522
+ - **MatMul (prefill)**: 16×16 tiling with shared memory tiles.
523
+ - **SwiGLU**: Fused SiLU + Mul in a single kernel (saves 2 dispatches per MLP block).
524
+ - **SwiGLU MatVec**: Fused gate_proj + up_proj + SwiGLU for M=1 decode. Reads input vector once from L1 cache, computes both weight matrix projections, and applies SiLU gating — all in a single dispatch. Saves 2 dispatches per MLP block vs separate gate + up + SwiGLU. The executor detects the gate_proj → up_proj → SwiGLU pattern and automatically substitutes the fused kernel for decode.
525
+ - **ResidualRMSNorm**: Fused residual add + RMS normalization (saves 2 dispatches per layer).
526
+
527
+ Safari/Metal compatibility: the attention kernel avoids WGSL `select()` (Safari has correctness bugs) and clamps `exp()` arguments to -80 (Metal returns NaN for `exp(-1e30)` instead of 0). Additionally, Safari uses packed f16 KV cache (`array<u32>` + `pack2x16float`/`unpack2x16float`) instead of native `array<f16>`, which WebKit's WGSL compiler miscompiles. (The packed-f16 path was A/B-verified against `?kvf32=1` on iPadOS 26.5: both produce byte-identical greedy output — Section 19.) Submit granularity on WebKit is handled by the executor's grouped-submit path, not by kernel changes (Section 18.3).
528
+
529
+ The matmul kernel uses a straightforward 16x16 tiling strategy without register blocking.
530
+
531
+ ### Optimization Roadmap
532
+
533
+ Remaining optimizations, prioritized by expected impact. Research from 63 sources covering WebGPU spec, Apple Silicon architecture docs, Chrome/Safari implementation details, and Metal compute best practices.
534
+
535
+ #### Priority 1: f16 KV Cache — DONE
536
+
537
+ Halves memory traffic through the attention kernel during decode (192 MB vs 384 MB for Qwen3.5-0.8B at maxSeqLen=4096).
538
+
539
+ **Three kernel strategies** via `KvMode`:
540
+ - **`native-f16`** (Chrome/Dawn): `enable f16;` + `array<f16>` for K/V buffers. Direct `f16()` cast on write, `f32()` on read. Best performance on platforms with correct f16 WGSL support.
541
+ - **`packed-f16`** (Safari): `array<u32>` + `pack2x16float()`/`unpack2x16float()`. No `enable f16` directive — avoids WebKit's WGSL compiler miscompilation of native f16 types. Same memory savings, slightly more ALU for pack/unpack.
542
+ - **`f32`** (fallback): Standard f32 buffers for devices without shader-f16.
543
+
544
+ **Auto-detection** in `WebGPUEngine.create()`: Safari → packed-f16, Chrome/Dawn → native-f16, no f16 → f32. Override via `?kvf32=1` or `?maxseq=N` URL params.
545
+
546
+ **Safari/WebKit bug**: Native `f16` types (`array<f16>`, `f16()` casts) cause garbage output on WebKit's WGSL compiler (confirmed iPad Safari, likely Metal shader compiler). The packed approach sidesteps this entirely by only using `u32` and `f32` types with standard pack/unpack built-ins.
547
+
548
+ #### Priority 2: Subgroup Operations (est. 1.1-1.3×)
549
+
550
+ The stable directive is `enable subgroups;` (shipped Chrome 134). Apple Silicon subgroupSize is always 32. The current 256-thread max reduction drops from 8 barriers + 1024 bytes of shared memory to 1 barrier + 32 bytes: each subgroup does a hardware `subgroupMax()`, leaders write to a tiny scratch array, one barrier, then subgroup 0 reduces the 8 values.
551
+
552
+ **Critical**: Subgroup ops must be in uniform control flow by default (`subgroup_uniformity` diagnostic errors otherwise). Cross-subgroup `if (sg_index == 0u)` blocks ARE uniform within each subgroup, so they pass analysis. Safari 26 does NOT expose subgroups — maintain the shared-memory fallback. Feature detection required.
553
+
554
+ #### Priority 3: Dynamic Uniform Buffer Offsets (est. 1.1-1.3×)
555
+
556
+ `minUniformBufferOffsetAlignment` = 256 bytes everywhere. Pack all ~441 op params at 256-byte stride into a single ~113 KB buffer. Can mix dynamic and static bindings in the same bind group. During dispatch, `pass.setBindGroup(0, bg, [i * 256])` selects each op's params. Only update the ~24 slots with changing `seq_pos` per step via targeted `writeBuffer` calls.
557
+
558
+ #### Priority 4: Fused GEMV + Residual
559
+
560
+ Thread 0 of the K-parallel reduction adds `residual[row]` before writing output — single extra read, 48 dispatches saved ≈ ~1.4ms at 30μs per dispatch overhead.
561
+
562
+ #### Priority 5: Fused KV Append
563
+
564
+ Combine K+V cache writes into one kernel. Trivial, saves 12 dispatches. Best done alongside f16 KV cache work.
565
+
566
+ #### Priority 6: Advanced GEMV (Register Tiling)
567
+
568
+ Register tiling (4 output rows per thread) is the biggest GEMV win — amortizes input vector reads 4×, increasing arithmetic intensity. Apple GPU has 128 GPRs per SIMD-group, current kernels use ~20-40, so 4 extra accumulators fit easily. Double-buffering is overkill for single-token GEMV where the input vector (4 KB) fits in L1.
569
+
570
+ #### Priority 7: Timestamp Profiling
571
+
572
+ Per-pass timestamps via `timestampWrites` on pass descriptor. `writeTimestamp()` was removed from WebGPU spec. Safari 26 does NOT support `timestamp-query`. Chrome quantizes to 100μs by default; `chrome://flags/#enable-webgpu-developer-features` for full precision. Essential for finding real bottlenecks before investing in lower-impact optimizations.
573
+
574
+ #### Not Worth Pursuing
575
+
576
+ - **Indirect dispatch** (`dispatchWorkgroupsIndirect()`): Slower than direct dispatch due to Chrome's validation overhead. Not useful for fixed architectures.
577
+ - **Double-buffering input**: Overkill for decode where the input vector is 4 KB and fits in L1.
578
+
579
+ #### Platform Capability Matrix
580
+
581
+ | Feature | Chrome 134+ | Safari 26 |
582
+ |---------|-------------|-----------|
583
+ | `shader-f16` | ✅ | ✅ |
584
+ | `subgroups` | ✅ | ❌ |
585
+ | `timestamp-query` | ✅ (100μs quant) | ❌ |
586
+ | `maxComputeWorkgroupStorageSize` | 32768 (requestable) | 16384 (spec minimum) |
587
+
588
+ #### Apple Silicon Occupancy
589
+
590
+ Current 256-thread workgroups with 16 KiB shared memory give ~50% occupancy (2 threadgroups per compute unit sharing the 32 KiB threadgroup memory). This is good for memory-bound kernels where DRAM bandwidth (68-100 GB/s), not compute, is the bottleneck. Smaller workgroups would just increase reduction overhead. Using f16 types reduces register pressure, potentially enabling higher occupancy as a side benefit.
591
+
592
+ #### CPU-GPU Sync Notes
593
+
594
+ `mapAsync()` already guarantees prior work completion — no need for `onSubmittedWorkDone()`. `createCommandEncoder()` cost is <1μs and cannot be avoided (command buffers are single-use in WebGPU). The 441 dispatches cost ~3.5-5.5ms of CPU encoding time. Double-buffered staging buffers could overlap readback with the next step's encoding.
595
+
596
+ **Current baselines (2026-06-13, Qwen3.5-0.8B MLX 4-bit, greedy; see `docs/research/tps-baselines.md`)**: M4 Max node-dawn **207 tok/s** (cooled-stable, after the autoresearch optimization campaign in §20; 145 was the post-detection-fix starting point); iPad (iPadOS 26.5) **31.7 tok/s** batch-all, up to **51.7 tok/s** sustained.
597
+
598
+ **Goals**: desktop **180+ tok/s** via kernel tuning (K_THREADS/N_TILE/workgroup sweeps), fused KV append (−24 dispatches), fused GEMV+residual (−48 dispatches); iPad **50+ tok/s** via single-compute-pass batching on iPadOS 26.5+ plus dispatch-count reduction — every dispatch eliminated helps mobile 2-3× more than desktop because mobile is dispatch-overhead-bound (Section 19).
599
+
600
+ **Mobile invariants any optimization must respect**: iPad `maxComputeWorkgroupStorageSize` = 16384 (the attention kernel sits exactly at this limit); iPad default `maxBufferSize` = 256MB and `maxStorageBufferBindingSize` = 128MB (the INT4 embedding is ~127MB — no headroom for larger vocabularies without sharding); activation buffers are liveness-pooled, so fused kernels must never read and write the same pooled buffer in one dispatch; the two-phase attention reduction barrier is a race fix, not a perf knob.
601
+
602
+ ---
603
+
604
+ ## 8. Tokenizer
605
+
606
+ The tokenizer (`tokenizer.ts`) is a pure JavaScript BPE implementation that reads HuggingFace `tokenizer.json` files directly. No WASM dependencies, no external libraries.
607
+
608
+ ### Encoding Pipeline
609
+
610
+ 1. **Added token splitting**: Text is split around ALL added tokens (not just `special: true` ones) using a regex built from the sorted token list. This ensures tokens like `<think>` and `</think>` (which HF marks as non-special) are correctly recognized during encoding
611
+ 2. **Pre-tokenization**: Non-special segments are split using a GPT-style regex pattern that separates contractions, words, numbers, and punctuation
612
+ 3. **Byte-level encoding**: Characters are converted to the HF byte representation (space becomes `\u0120`, control characters are offset by 256)
613
+ 4. **BPE merge**: Character sequences are iteratively merged according to the merge priority table until no more merges are possible
614
+ 5. **Byte fallback**: Characters not in the vocabulary are encoded as `<0xHH>` byte-level tokens
615
+
616
+ ### Decoding
617
+
618
+ Decoding reverses the process: token IDs are mapped back to strings, joined, and post-processed to restore spaces (from `\u0120`) and byte-level tokens (from `<0xHH>` format).
619
+
620
+ ### Chat Template
621
+
622
+ The tokenizer implements ChatML format for chat-style models:
623
+
624
+ ```
625
+ <|im_start|>system
626
+ {system message}<|im_end|>
627
+ <|im_start|>user
628
+ {user message}<|im_end|>
629
+ <|im_start|>assistant
630
+ ```
631
+
632
+ A TODO exists for parsing Jinja2 templates from `tokenizer_config.json` for full generality across model families.
633
+
634
+ For a detailed walkthrough with a worked example, see [tokenizer.md](./tokenizer.md).
635
+
636
+ ---
637
+
638
+ ## 9. Sampler
639
+
640
+ Token sampling is performed on CPU (`sampler.ts`). The rationale is straightforward: for a vocabulary of ~150K tokens, the logits array is ~600KB -- well within CPU memory bandwidth. Moving sampling to GPU would add dispatch overhead and a synchronization point without meaningful benefit.
641
+
642
+ ### Sampling Pipeline
643
+
644
+ Given a `Float32Array` of logits:
645
+
646
+ 1. **Greedy check**: If temperature < 1e-6, return `argmax(logits)` immediately
647
+ 2. **Copy**: Work on a copy to avoid mutating the source
648
+ 3. **Repetition penalty**: For each previously generated token, divide positive logits by the penalty factor and multiply negative logits by it
649
+ 4. **Temperature**: Divide all scores by temperature
650
+ 5. **Top-k**: Sort scores descending, keep only the top K candidates
651
+ 6. **Softmax**: Compute `exp(score - max)` and normalize to get probabilities
652
+ 7. **Top-p (nucleus)**: Keep the smallest set of tokens whose cumulative probability exceeds `topP`, then re-normalize
653
+ 8. **Weighted random sample**: Draw a random number and walk the cumulative distribution
654
+
655
+ ### Default Parameters
656
+
657
+ ```typescript
658
+ temperature: 0.7
659
+ topK: 50
660
+ topP: 0.9
661
+ repetitionPenalty: 1.0 // disabled
662
+ ```
663
+
664
+ ---
665
+
666
+ ## 10. Architecture Registry & Graph Generators
667
+
668
+ The architecture registry (`architectures/index.ts`) maps HuggingFace model architecture strings (from `config.architectures[0]`) to graph generator functions.
669
+
670
+ ### How It Works
671
+
672
+ ```typescript
673
+ const ARCHITECTURES: Record<string, GraphGenerator> = {
674
+ Qwen2ForCausalLM: generateQwen2Graph,
675
+ Qwen3ForCausalLM: generateQwen2Graph,
676
+ // LlamaForCausalLM: generateLlamaGraph, // TODO
677
+ };
678
+ ```
679
+
680
+ When a model is loaded, the engine reads `config.architectures[0]` and looks it up in this registry. The matched generator receives the raw `config.json` object and returns a complete `ModelGraph`.
681
+
682
+ ### Qwen2 Graph Generator
683
+
684
+ `architectures/qwen2.ts` generates the graph for Qwen2/Qwen3 models. Each transformer layer produces 12 operation nodes:
685
+
686
+ 1. `RMSNorm` -- input_layernorm
687
+ 2. `MatMul` -- Q projection
688
+ 3. `MatMul` -- K projection
689
+ 4. `MatMul` -- V projection
690
+ 5. `RoPE` -- rotate Q and K
691
+ 6. `Attention` -- scaled dot-product with causal mask and GQA
692
+ 7. `MatMul` -- output projection
693
+ 8. `Add` -- residual connection 1
694
+ 9. `RMSNorm` -- post_attention_layernorm
695
+ 10. `MatMul` -- gate projection (SwiGLU)
696
+ 11. `MatMul` -- up projection
697
+ 12. `SiLU` -- activation on gate output
698
+ 13. `Mul` -- gate * up (SwiGLU combine)
699
+ 14. `MatMul` -- down projection
700
+ 15. `Add` -- residual connection 2
701
+
702
+ Plus 4 global nodes: `Embedding`, `RMSNorm` (final), `SliceLastRow` (extract the last position of `final_norm_out`), and `MatMul` (lm_head, M=1). Both graph generators (`qwen2.ts`, `qwen3_5.ts`) emit the `SliceLastRow` → `lm_head` pair so that logits are computed only for the last position (Section 18.2).
703
+
704
+ For a Qwen3.5-0.8B model (24 layers), this produces:
705
+ - ~360 operation nodes (15 per layer * 24 + 3)
706
+ - ~500+ tensor descriptors
707
+
708
+ ### Adding a New Architecture
709
+
710
+ See [architectures.md](./architectures.md) for a step-by-step guide.
711
+
712
+ ---
713
+
714
+ ## 11. KV Cache
715
+
716
+ The KV cache (`kv-cache.ts`) stores previously computed key and value vectors in GPU memory, enabling autoregressive generation without recomputing attention over the full sequence at each step.
717
+
718
+ ### LHSd Layout
719
+
720
+ The cache uses LHSd (Layer, Head, Sequence, head_dim) layout:
721
+
722
+ ```
723
+ Per layer:
724
+ K buffer: [num_kv_heads, max_seq_len, head_dim] f32
725
+ V buffer: [num_kv_heads, max_seq_len, head_dim] f32
726
+ ```
727
+
728
+ Each layer has its own pair of K and V GPU storage buffers.
729
+
730
+ ### Pre-allocation Strategy
731
+
732
+ Buffers are allocated at `max_seq_len` capacity when the executor is created. This avoids dynamic reallocation during generation. The tradeoff is higher initial memory usage, but it eliminates allocation jitter and potential out-of-memory errors mid-generation.
733
+
734
+ ### Memory Budget Calculation
735
+
736
+ For a model with `L` layers, `H_kv` KV heads, `d` head dimension, and max sequence length `S`:
737
+
738
+ ```
739
+ KV cache bytes = L * H_kv * S * d * 4 (f32) * 2 (K + V)
740
+ ```
741
+
742
+ Example for Qwen3.5-0.8B (24 layers, 4 KV heads, 64 head_dim, 2048 max seq):
743
+ ```
744
+ 24 * 4 * 2048 * 64 * 4 * 2 = 100,663,296 bytes (~96 MB)
745
+ ```
746
+
747
+ ### Prefill vs. Decode
748
+
749
+ - **Prefill** (T = prompt length): All T tokens' K/V entries are written at once at positions [0, T-1]. The attention kernel reads and writes the full range.
750
+ - **Decode** (T = 1): A single K/V entry is written at position `seqPos`. The attention kernel reads positions [0, seqPos] and writes position [seqPos].
751
+
752
+ The `seqPos` counter in the `KVCache` struct tracks how many tokens have been cached. `advanceKVCache()` increments it after each forward pass. `resetKVCache()` sets it to 0 for a new generation without reallocating buffers.
753
+
754
+ ---
755
+
756
+ ## 12. Executor
757
+
758
+ The executor (`executor.ts`) is the core runtime that orchestrates buffer allocation, weight upload, and compute dispatch.
759
+
760
+ ### Buffer Allocation Strategy
761
+
762
+ The executor allocates activation buffers once at construction, sized for `maxSeqLen` — there is no dynamic allocation during generation and memory usage is deterministic and front-loaded. But buffers are **not** one-per-tensor: `allocateActivationBuffers()` runs a last-use liveness analysis over `graph.executionOrder` and recycles buffers through a size-keyed free pool, so tensors that are never live simultaneously share storage. For Qwen3.5-0.8B this collapses 431 activation tensors to 20 physical buffers (37 MB at T=256) — versus ~2.3 GB for the naive per-tensor scheme that was jetsam-killing iPads (Section 17.1). Graph outputs and tensors that are read before they are written (cross-forward state) are excluded from pooling and keep dedicated buffers. See Section 18.1 for the full algorithm.
763
+
764
+ Weight buffers are allocated on demand during `uploadWeights()` and are sized to exactly match the data.
765
+
766
+ ### Forward Pass
767
+
768
+ ```typescript
769
+ async forward(inputIds: Uint32Array): Promise<ForwardResult>
770
+ ```
771
+
772
+ The forward pass uses a **two-phase dispatch pattern**:
773
+
774
+ **Phase 1 — Uniform buffer updates** (before any dispatch):
775
+
776
+ 1. Writes token IDs to the pre-allocated `input_ids` buffer
777
+ 2. Resolves all symbolic shapes (`"T"` -> actual T, `"L_max"` -> seqPos + T)
778
+ 3. For each operation: builds uniform params, compares against cached bytes, and calls `device.queue.writeBuffer()` only if changed
779
+
780
+ **Phase 2 — Compute dispatch** (no writeBuffer calls), with a per-implementation submit strategy:
781
+
782
+ - **Dawn (Chrome, node)**: a single `GPUCommandEncoder` with one compute pass containing all ~440 dispatches, plus the logits copy, submitted in one batch.
783
+ - **WebKit** (`needsMultiEncoder === true`): dispatches are grouped into command buffers of `webkitGroupSize` (one compute pass per dispatch), each submitted and **awaited via `onSubmittedWorkDone()` before the next is encoded** — exactly one command buffer in flight. The logits copy goes in its own final submit. See Section 18.3 for why, and Section 19 for the throughput-vs-group-size curve.
784
+
785
+ Either way the pass then:
786
+
787
+ 4. Copies the `[1, vocab]` logits buffer to the readback buffer (full buffer, offset 0 — `SliceLastRow` already restricted lm_head to the last position)
788
+ 5. Maps the readback buffer and returns the logits as `Float32Array`
789
+ 6. Advances the KV cache position by T
790
+
791
+ `forwardArgmax()` (greedy decode without logits readback) follows the same structure, with the argmax dispatch and its 4-byte readback copy in separate submits on WebKit, mirroring `forward()`.
792
+
793
+ > **History — the "Safari writeBuffer bug" callout, revised**: An earlier version of this section blamed iPad gibberish on WebKit failing to synchronize `writeBuffer()` calls made *during* an active compute pass: stale zero params → all threads exit early → all-zero output → degenerate text at ~276 tok/s (zero-work dispatches are nearly free, which is why the "throughput" was 2-4× desktop). The throughput signature and the "zeros, not garbage" character of that diagnosis were correct — the recorded gibberish *was* zero logits in disguise. But the June 2026 investigation **refuted stale params as the mechanism** (Section 17.2): the kernels' write guards mean zeroed params would have written *nothing*, yet probes showed zeros actively overwriting pre-seeded data, and the failure flipped correct↔zeros purely on submit granularity with byte-identical CPU-side writes. The real cause was a within-submission storage-visibility bug in WebKit, since fixed upstream (does not reproduce on iPadOS 26.5). The two-phase write-then-dispatch pattern is retained as cheap, spec-clean hygiene, but it was never the fix.
794
+
795
+ ### Prefill vs. Decode Dispatch
796
+
797
+ The same forward pass code handles both prefill and decode -- the only difference is the value of T:
798
+
799
+ - **Prefill**: T = prompt token count (e.g., 50). Matmul dispatch sizes are larger, attention processes T queries against T keys.
800
+ - **Decode**: T = 1. Single-token forward pass. Matmul dispatch is minimal, attention processes 1 query against seqPos keys.
801
+
802
+ The kernel dispatch sizing functions (in the kernel registry) use the resolved shapes to compute appropriate workgroup counts for either case.
803
+
804
+ ---
805
+
806
+ ## 13. Model Loading Pipeline
807
+
808
+ The model loader (`model-loader.ts`) orchestrates the entire process of downloading and preparing a model from HuggingFace Hub.
809
+
810
+ ### Loading Steps
811
+
812
+ 1. **Fetch config.json**: Determine architecture, extract dimensions
813
+ 2. **Generate IR**: Call the matching graph generator to produce the `ModelGraph`
814
+ 3. **Fetch tokenizer**: Download `tokenizer.json` and `tokenizer_config.json` in parallel, construct the `Tokenizer`
815
+ 4. **Discover safetensors files**: Try `model.safetensors.index.json` first (for sharded models), fall back to single `model.safetensors`
816
+ 5. **Download weight files**: Stream each safetensors file with progress callbacks
817
+ 6. **Parse and map weights**: Parse safetensors headers, extract typed array views, map HF keys to canonical names via `HFKeyMapper`
818
+ 7. **Verify**: Check that all expected weight tensors are present; warn about missing ones
819
+
820
+ ### HuggingFace Hub Integration
821
+
822
+ The loader resolves repo strings to HF Hub URLs:
823
+
824
+ ```
825
+ "Qwen/Qwen3.5-0.8B" -> "https://huggingface.co/Qwen/Qwen3.5-0.8B/resolve/main"
826
+ ```
827
+
828
+ It supports:
829
+ - Gated models via `hfToken` authentication
830
+ - Custom revisions/branches
831
+ - Full URL pass-through (for self-hosted models)
832
+ - Multi-shard safetensors models (via the index file)
833
+
834
+ ### Progress Tracking
835
+
836
+ The loader provides normalized progress (0-100) to the caller:
837
+ - 0-5%: Fetching config
838
+ - 5-10%: Fetching tokenizer
839
+ - 10-95%: Downloading weight files (distributed evenly across shards)
840
+ - 95-100%: Final verification
841
+
842
+ ---
843
+
844
+ ## 14. Public API (WebGPUEngine)
845
+
846
+ The engine is designed to be used through a high-level API surface. Based on the current codebase, the intended usage pattern is:
847
+
848
+ ```typescript
849
+ import { initGPU } from "./device.js";
850
+ import { loadModel } from "./model-loader.js";
851
+ import { Executor } from "./executor.js";
852
+ import { sampleToken } from "./sampler.js";
853
+
854
+ // 1. Initialize WebGPU
855
+ const ctx = await initGPU();
856
+
857
+ // 2. Load model from HuggingFace
858
+ const { graph, tokenizer, weights } = await loadModel({
859
+ repo: "Qwen/Qwen3.5-0.8B",
860
+ onProgress: (loaded, total, msg) => console.log(`${loaded}/${total}: ${msg}`),
861
+ });
862
+
863
+ // 3. Create executor and upload weights
864
+ const executor = new Executor(ctx, graph, { maxSeqLen: 2048 });
865
+ executor.uploadWeights(weights);
866
+
867
+ // 4. Encode prompt
868
+ const inputIds = new Uint32Array(
869
+ tokenizer.encodeChat([
870
+ { role: "system", content: "You are a helpful assistant." },
871
+ { role: "user", content: "Hello!" },
872
+ ])
873
+ );
874
+
875
+ // 5. Prefill
876
+ let result = await executor.forward(inputIds);
877
+ let nextToken = sampleToken(result.logits, { temperature: 0.7, topK: 50, topP: 0.9 });
878
+
879
+ // 6. Decode loop
880
+ const maxTokens = 256;
881
+ const generated: number[] = [nextToken];
882
+
883
+ while (generated.length < maxTokens && nextToken !== tokenizer.config.eosTokenId) {
884
+ result = await executor.forward(new Uint32Array([nextToken]));
885
+ nextToken = sampleToken(result.logits, { temperature: 0.7 }, generated);
886
+ generated.push(nextToken);
887
+ process.stdout.write(tokenizer.decode([nextToken]));
888
+ }
889
+
890
+ // 7. Cleanup
891
+ executor.destroy();
892
+ ```
893
+
894
+ ### Key Methods
895
+
896
+ | Method | Description |
897
+ |--------|-------------|
898
+ | `initGPU()` | Request WebGPU adapter and device |
899
+ | `loadModel(options)` | Download model from HF Hub, generate IR, build tokenizer |
900
+ | `new Executor(ctx, graph, opts)` | Create executor with KV cache allocation |
901
+ | `executor.uploadWeights(weights)` | Upload safetensors data to GPU |
902
+ | `executor.forward(inputIds)` | Run one forward pass, return logits |
903
+ | `executor.reset()` | Clear KV cache for new conversation |
904
+ | `executor.destroy()` | Free all GPU resources |
905
+ | `sampleToken(logits, params, history?)` | Sample next token from logit distribution |
906
+
907
+ ---
908
+
909
+ ## 15. What's Built vs. What's Planned
910
+
911
+ | Component | Status | Notes |
912
+ |-----------|--------|-------|
913
+ | IR type system | Built | OpType, TensorDesc, OpNode, ModelGraph |
914
+ | Canonical key mapping | Built | Default HF key mapper (strip "model." prefix) |
915
+ | Safetensors parser | Built | Zero-copy typed views, streaming support |
916
+ | WebGPU device layer | Built | Buffer management, pipeline caching, readback |
917
+ | Embedding kernel | Built | Simple parallel gather |
918
+ | MatMul kernel (f32) | Built | 16x16 tiled with shared memory |
919
+ | MatMulInt4 kernel | Built | Fused dequant+matmul, not yet wired to graph generators |
920
+ | RMSNorm kernel | Built | Tree reduction, one workgroup per row |
921
+ | LayerNorm kernel | Built | Two-pass mean+variance, tree reduction |
922
+ | RoPE kernel | Built | Handles GQA (separate Q/K head counts) |
923
+ | Attention kernel | Built | Tiled online-softmax, 256-thread parallel, 16 KB shared memory, two-phase score reduction (race-free) |
924
+ | Softmax kernel | Built | Three-pass with tree reduction |
925
+ | SiLU kernel | Built | Element-wise |
926
+ | GELU kernel | Built | Approximate version |
927
+ | Add kernel | Built | Element-wise |
928
+ | Mul kernel | Built | Element-wise |
929
+ | BPE Tokenizer | Built | Pure JS, HF tokenizer.json compatible |
930
+ | CPU Sampler | Built | temp/top-k/top-p/repetition penalty |
931
+ | Qwen2/3 graph generator | Built | Full layer structure with SwiGLU MLP |
932
+ | KV Cache | Built | LHSd layout, pre-allocated |
933
+ | Executor | Built | Single-encoder batching (Dawn) + grouped submits with one CB in flight (WebKit, `?group=N`) |
934
+ | Activation liveness pooling | Built | 431 tensors → 20 buffers (37 MB at T=256), size-keyed free pool |
935
+ | SliceLastRow + [1, vocab] logits | Built | Last-position lm_head only; saves 485 MB at T=512 and the full-vocab prefill matmul |
936
+ | Model Loader | Built | HF Hub, multi-shard support |
937
+ | Kernel Registry | Built | Mapping OpType -> kernel spec with dispatch/params helpers |
938
+ | GEMV decode kernels | Built | K-parallel MatVec (f32 + INT4), 8 cols × 32 K-threads per workgroup |
939
+ | SwiGLU MatVec fusion | Built | Fused gate+up+SwiGLU decode kernel, auto-detected by executor |
940
+ | EmbeddingInt4 kernel | Built | INT4 dequant lookup, tied lm_head reuse |
941
+ | INT4 quantization | Built | On-the-fly F32→INT4, GPTQ repack, MLX adapter |
942
+ | WebKit/Safari compat | Built | UA-based `isWebKitWebGPU`, grouped submits, no select(), clamped exp(), 16 KB smem, packed-f16 KV, Cache-API write skip >64MB |
943
+ | GPU error surfacing | Built | `onuncapturederror` + error scopes around alloc/upload/bind-group setup |
944
+ | GPU diagnostics (A-P) | Built | 17-test suite: buffer integrity, compute, shared mem, f16, tree reduce, INT4, multi-dispatch, exp safety, real kernels, 300-dispatch stress |
945
+ | Integrity checker | Built | Weight/logits checksum comparison between Dawn and Safari, auto-runs on Safari load |
946
+ | Qwen3.5 graph gen | Built | Hybrid Mamba SSM + full attention with Gated Delta Net |
947
+ | LLaMA/Mistral graph gen | Planned | Same base architecture, different key mappings |
948
+ | Phi graph gen | Planned | Different MLP structure (fc1/fc2 vs gate/up/down) |
949
+ | f16 KV cache | Built | Native f16 (Chrome) + packed u32 (Safari), auto-detected |
950
+ | f16 compute path | Planned | Use shader-f16 for faster matmul |
951
+ | Native text embeddings | Built | Qwen3-Embedding-0.6B (`Qwen3ForCausalLM`): last-token EOS pooling + `L2Norm` tail. dim 1024, unit norm, cos(similar)=0.81 > cos(unrelated)=0.56 (Section 21) |
952
+ | L2Norm kernel | Built | Row-wise L2 normalization, one workgroup/row |
953
+ | Vision encoder (ViT) | Built | Qwen3.5's own 12-layer ViT from the same checkpoint; bit-exact vs HF transformers 5.12 (per-token cosine 1.000000, max abs err ~5e-6). `engine.encodeImage(patches, gridTHW)` (Section 22) |
954
+ | Vision ops (AddBias, GeluErf, ApplyRotaryEmb, SliceCols) | Built | ViT bias adds, exact-erf merger GELU, 2D rotary, fused-QKV split |
955
+ | Bidirectional attention (`is_causal` flag) | Built | One-line uniform on the existing parallel attention kernel; text stays causal by default, ViT runs non-causal (Section 22) |
956
+ | Vision LM integration (M-RoPE, token splice, image preprocessing) | In progress | Phase 2 — the encoder is done; splicing image tokens into the text stream is the remaining work |
957
+ | MoE support | Planned | MoERouter, ExpertMatMul kernels |
958
+ | Streaming API | Planned | High-level `stream()` method with async iteration |
959
+ | Jinja2 template parser | Planned | Full chat template support beyond ChatML |
960
+
961
+ ### 15.1 Exploration backlog (open directions, June 2026)
962
+
963
+ Candidates under evaluation — not commitments. The first item gates several others.
964
+
965
+ | Direction | Source / rationale | Status |
966
+ |-----------|--------------------|--------|
967
+ | **Engine consolidation decision** | Two engines (native WGSL text-only; transformers.js/ONNX breadth) is untenable long-term. Only transformers.js covers all modalities (vision/TTS/STT/embeddings). The decisive test: does transformers.js run on iOS 26.5 (post our environmental fixes + ORT 1.25 + IIFE worker)? If yes → consolidate on it; native WGSL kept only if a head-to-head shows a decisive text-speed win. | **Decision gate — test pending** |
968
+ | **KVSwap: KV-cache memory tiering** | son-of-ole/infinite-edge-agent (MIT). KV cache as VRAM→RAM→IndexedDB hierarchy with pinning/eviction/prefetch. Lifts the mobile context cap (the next wall after activation pooling). Applies to any WebGPU text path. | Explore — high value |
969
+ | **SSA: sparse/sub-quadratic attention** | Same repo. Block-summary → top-K block routing → gather → sparse attention. Cheaper long context, but changes attention semantics (approximate) — a quality/throughput research bet, not drop-in. | Explore — research |
970
+ | **Evaluate WebLLM/MLC as the text backend** | Mature compiled-WebGPU runtime (TVM-based); the most-used browser LLM engine. Text-only (not multimodal). Relevant only if we decide a separate fast text path is worth maintaining — as a more-supported alternative to the native WGSL engine. | Evaluate |
971
+ | **Gemma 4 E2B support** | Smallest current on-device Gemma (~3.35 GB q4). Tier 2: needs sliding-window attention + GeGLU + logit soft-cap (`docs/research/qwen36-gemma4-targets.md`). | Planned target |
972
+ | **Qwen3-0.6B (dense)** | Tier-1 proof of the `add-model-family` process; ~0.32 GB q4. | Planned target |
973
+ | **Agent-runtime feature layer** | Persistent local memory + context reconstruction (cf. same repo's Memory OS). A *feature set* packaged AI-SDK-style, NOT an engine concern. | Explore — product layer |
974
+
975
+ ---
976
+
977
+ ## 16. Key Design Decisions
978
+
979
+ ### Why Runtime IR (Not Offline Converter)
980
+
981
+ An offline converter (like ONNX export) adds a mandatory pre-processing step. Users must find or create converted model files, keep them in sync with upstream model updates, and deal with conversion bugs. By generating the IR at runtime from `config.json`, gerbil can point at any HuggingFace repo that uses a supported architecture and just run it. The cost is a few milliseconds of graph generation at load time -- negligible compared to weight download.
982
+
983
+ ### Why f32 First (Not f16)
984
+
985
+ f16 compute requires the `shader-f16` WebGPU feature, which is not universally available. Starting with f32 ensures correctness across all WebGPU-capable browsers. f16 is being added incrementally: the KV cache uses f16 storage with f32 compute (halving attention memory traffic), and f16 matmul will follow. Feature detection ensures automatic fallback to f32 on devices without `shader-f16`.
986
+
987
+ ### Why CPU Sampling
988
+
989
+ The logits array for a typical vocabulary (~150K tokens) is ~600KB as f32. Copying this to CPU and sampling there takes microseconds. GPU-side sampling would require:
990
+ 1. An additional kernel for argmax/multinomial sampling
991
+ 2. A readback of a single u32 (the selected token ID)
992
+ 3. Synchronization to get the result
993
+
994
+ The overhead of kernel dispatch + sync would exceed the CPU sampling time. The CPU also has access to `Math.random()` and can implement complex sampling strategies (repetition penalty over arbitrary history) more naturally.
995
+
996
+ ### Why No WASM Fallback
997
+
998
+ The previous approach maintained three inference paths (WebGPU, WASM, CPU), each with different bugs and performance profiles. The GPU engine targets WebGPU exclusively. If WebGPU is unavailable, the engine throws a clear error rather than silently degrading to a 10x slower path. This keeps the codebase focused and testable.
999
+
1000
+ Browser WebGPU coverage (Chrome 113+, Safari 18+, Firefox 141+) is sufficient for the target audience. The old transformers.js path remains available for legacy browser support if needed.
1001
+
1002
+ ### Why Canonical Tensor Naming
1003
+
1004
+ Different model families use different key prefixes in their safetensors files (`model.layers.0.self_attn.q_proj.weight` vs `transformer.h.0.attn.c_attn.weight`). By mapping to canonical names, the graph generators and executor code don't need to know about HF-specific naming. Adding a new model family only requires writing a key mapper function, not modifying the core engine.
1005
+
1006
+ ### Why LHSd KV Layout
1007
+
1008
+ LHSd (Layer, Head, Sequence, head_dim) puts the sequence dimension third, which means appending a new token's K/V vectors is a simple write at offset `seqPos * head_dim`. The attention kernel reads contiguous head_dim-sized chunks, which is cache-friendly for the dot product computation. This layout matches what most transformer implementations use internally.
1009
+
1010
+ ### Why Detect the Implementation, Not the Hardware
1011
+
1012
+ Workarounds in this engine exist for *WebGPU implementations* (WebKit's), not for GPUs. Apple Silicon is the hardware under both Safari and desktop Chrome/node-dawn, and `adapter.info` reports `vendor: "apple"` in all three — so any hardware-keyed predicate inevitably drags a correct, fast implementation through workarounds built for a broken one (Section 17.4 documents the 161 → 19.7 tok/s cost of getting this wrong). The user agent identifies the implementation unambiguously: all iOS/iPadOS browsers are WebKit by platform mandate, macOS Safari is `AppleWebKit` without `Chrome/`, and node has no UA at all.
1013
+
1014
+ ### Why Liveness Pooling Instead of Per-Tensor Buffers
1015
+
1016
+ Per-tensor allocation at `maxSeqLen` is simple and was fine on a 128 GB desktop, but it scales with *graph size*, not with *concurrent liveness* — and a 431-tensor graph at T=512 wants ~2.3 GB while only ~20 tensors are ever alive at once. The pooled allocator keeps every desirable property of pre-allocation (no allocation during generation, deterministic footprint, buffers reused across forwards) at ~1.6% of the memory. The one obligation it creates: fused kernels must never read and write the same pooled buffer in a single dispatch, and mid-graph debug readbacks are only meaningful before a later op reuses the buffer.
1017
+
1018
+ ---
1019
+
1020
+ ## 17. The Four-Failure-Mode Diagnosis (June 2026)
1021
+
1022
+ Through early 2026 the engine's mobile story was a single undifferentiated narrative: "it crashes or produces garbage on iPad, probably Metal coherence." On 2026-06-12 a multi-agent investigation (full report: `docs/mobile-failure-diagnosis.md`; running log: `docs/metal-safari-intel.md`) took the four observed symptoms apart and found **four independent root causes** — two of them local bugs in this codebase, one a genuine (and since-fixed) WebKit bug, and one a detection bug that had silently poisoned every desktop benchmark for months. Conflating them is precisely what made the problem look intractable.
1023
+
1024
+ ### 17.1 jetsam-crash: a local memory bug, not Metal
1025
+
1026
+ The iPad tab kills ("download reaches 100%, then blank page") were attributed to dispatch-strategy overhead — the engineering log explicitly blamed "400 roundtrips + Promise overhead" for the DRAIN_EVERY=1 OOM. That explanation was wrong: a fully drained loop bounds in-flight command buffers to ~1, so roundtrip count cannot dominate memory. The actual cause was the executor's buffer allocator.
1027
+
1028
+ `allocateActivationBuffers()` created **one dedicated buffer per activation tensor at full `maxSeqLen`**, with zero reuse. For Qwen3.5-0.8B (431 activation tensors, vocab 248,320) at `maxSeqLen=512` the math is unforgiving:
1029
+
1030
+ | Component | Bytes |
1031
+ |---|---|
1032
+ | Logits buffer alone: `[T, vocab]` f32 = 512 × 248,320 × 4 | **508.6 MB (≈485 MiB)** — of which only the last row (993 KB) was ever read |
1033
+ | All ~430 per-tensor activation buffers | **~2.30 GB** |
1034
+ | INT4 weights | ~0.44 GB |
1035
+ | **Total GPU footprint, INT4 @ T=512** | **~2.77 GB** |
1036
+
1037
+ Against an iOS web-content jetsam budget of ~1.5-2 GB, **every configuration ever run on the iPad was memory-doomed before the first dispatch executed** — INT4 @ T=256 was 1.61 GB, the engine's then-default `maxSeqLen=4096` requested 18.9 GB, and the diagnostic page's default URL silently loaded the BF16 repo (6.08 GB F32 graph) under an "INT4" label. The drained per-dispatch experiments "failed" for memory reasons, not because draining was wrong — which sent the investigation chasing submit strategies when the bug was in allocation.
1038
+
1039
+ Fixes: liveness pooling (18.1), `SliceLastRow` (18.2), iOS `maxSeqLen` clamping, diagnostic-page default model fix, and load-transient trims. Post-fix footprint at T=512: ~0.44 GB weights + ~0.04 GB pooled activations + ~0.03 GB KV/SSM state ≈ **0.6-0.7 GB — comfortably inside the budget for the first time in the project's history**.
1040
+
1041
+ ### 17.2 zero-logits: a real WebKit bug — and an OS-version-dependent one
1042
+
1043
+ The historical core mystery: a full forward pass (~440 dispatches in one command buffer) produced all-zero logits from dispatch entry 2 onward on iPad Safari, while every per-dispatch-submit probe was correct and the identical code was correct on Dawn. Three things are now established:
1044
+
1045
+ 1. **It was not a local bug.** Stale-params and silent-allocation-failure theories were refuted: CPU-side state is byte-identical between the batched and per-dispatch paths (only submit granularity differs, and that alone flipped correct↔zeros); the kernels' `col < params.N` write guards mean zeroed params would write *nothing*, yet probes showed zeros actively **overwriting pre-seeded nonzero data** — a read-side visibility failure, with entry 2 computing real output from a stale (zero) view of entry 1's freshly written input.
1046
+
1047
+ 2. **It was a genuine WebKit within-submission storage-visibility bug, threshold-dependent.** The WebGPU spec guarantees cross-dispatch storage visibility within a submission (gpuweb #4433/#4434). Every passing diagnostic was tiny (Test J: 2 pipelines, 16-byte buffers; Test P: 300 dispatches but one pipeline); production combined ~441 dispatches, 15+ distinct pipelines, 6-9 bindings, and MB-scale buffers in one submission — a scale no diagnostic ever reached.
1048
+
1049
+ 3. **KEY FINDING (2026-06-12): the bug does not reproduce on iPadOS 26.5.** The exact historical zero-logits configuration — the entire ~440-dispatch forward in a single command buffer — now produces **byte-identical output to the desktop Dawn reference** on the test iPad (WebKit WebGPU via Chrome-iOS WKWebView over HTTPS). The bug is OS-version-dependent and was fixed upstream in WebKit (cf. WebKit bug 311598, where llama.cpp reports ~64 passes/CB stable on iOS 26.4). The grouped-submit machinery (18.3) is therefore retained as a *compatibility dial for older WebKit*, not as the permanent execution model.
1050
+
1051
+ ### 17.3 gibberish: zeros in disguise, plus a real data race waiting underneath
1052
+
1053
+ The recorded gibberish event (plausible-but-wrong tokens at ~276 tok/s on iPad, 2026-03) had been attributed to an "unknown Safari WGSL→MSL compiler bug." Two separate findings replace that:
1054
+
1055
+ **(a) The recorded event was zero-logits in disguise.** ~276 tok/s — 2-4× faster than an M4 Max — is the signature of zero-work dispatches: all threads early-exit, argmax over an all-zero logits buffer returns token 0 or whatever degenerate pattern the sampler makes of it, and "generation" races ahead. The miscompilation theory was additionally contradicted by Tests N/O (the real RMSNorm and MatVec kernels pass in isolation on the same iPad) and by the fact that the packed-f16 kernels only ever saw already-zero inputs in failing runs.
1056
+
1057
+ **(b) A genuine WGSL data race existed in all three attention variants.** In the Q·K score reduction, after the publish barrier, each dot-group leader read `smem[tid..tid+15]` and wrote `smem[pos_in_tile]` (slots 0..15) in the same block — leader 0's read range is exactly the write target of leaders 1..15, with no barrier between cross-group reads and writes. This is spec-level UB. It happened to be benign on Tint/M-series scheduling, which is why desktop never showed it, but WebKit's independent WGSL→MSL compiler and A-series scheduling carry no such guarantee; it would corrupt the first key position of each 16-position tile — exactly the "plausible but wrong tokens" phenotype. Fixed unconditionally with the two-phase reduction (18.4) since the cost is one barrier per KV tile.
1058
+
1059
+ ### 17.4 desktop-regression: conflating Apple-GPU hardware with the WebKit implementation
1060
+
1061
+ The Safari workaround gate was `isMetalBackend = vendor === "apple" || arch.startsWith("common")` — a *hardware* test. Dawn running on the M4 Max reports `vendor: "apple", arch: "metal-3"`, so node-dawn and Chrome-on-macOS took the full WebKit workaround path: per-dispatch multi-encoder submits plus the (already-disproven, yet still active) shader variant alternation that compiled ~850-900 per-node pipelines instead of ~25. Result: the recorded **161.8 → 19.7 tok/s collapse** on desktop, and every desktop performance number recorded between commit `2f0cabc` and the fix is poisoned. The fix is the UA-based `isWebKitWebGPU` predicate (18.5); post-fix desktop baseline is 145 tok/s.
1062
+
1063
+ ### 17.5 Corrections to earlier claims in this document
1064
+
1065
+ The June 2026 findings falsify several statements this paper (and the engineering log) previously made. For the record:
1066
+
1067
+ - **"Metal provides no cross-dispatch coherence by design"** — wrong. The WWDC25 quote about command-buffer boundaries explains why CB boundaries are *expensive*, not why they are required for coherence. The WebGPU spec mandates within-submission visibility; Dawn-on-Metal delivers it; WebKit on iPadOS 26.5 delivers it. The older-WebKit zeros were a bug, not a design property to architect around.
1068
+ - **Shader variant alternation** (commit `2f0cabc`) — never necessary. Disproven by the project's own Test Q, yet it remained active and was a major contributor to the desktop regression. Deleted; do not reintroduce.
1069
+ - **Per-dispatch submit without await (fire-and-forget**, commit `38bc674`**)** — an anti-pattern, not a fix. It queues ~400 unbounded in-flight Metal command buffers per token, the documented WebKit resource-exhaustion pattern, and never produced a recorded on-device result. Replaced by grouped submits with exactly one CB in flight (18.3).
1070
+ - **"DRAIN_EVERY=1/5 OOMs prove roundtrips exhaust memory"** — wrong attribution; the buffer footprint (17.1) was the cause. Drained loops are memory-bounded by construction.
1071
+ - **The two-phase writeBuffer pattern as the gibberish fix** — the pattern is harmless and retained, but stale params were refuted as the zero-logits mechanism (17.2); the recorded gibberish was zeros in disguise (17.3a).
1072
+ - **Validity boundary**: the single-CB zeros behavior was real on the older iPadOS version where it was recorded and presumably remains real on pre-26.5 WebKit; statements about it are *historical, version-scoped facts*, not current behavior.
1073
+
1074
+ ---
1075
+
1076
+ ## 18. The Mobile Fix Campaign
1077
+
1078
+ All fixes landed 2026-06-12. Each is small; together they took the iPad from "never completed a recorded run" to byte-correct 35.9 tok/s.
1079
+
1080
+ ### 18.1 Liveness-based activation pooling
1081
+
1082
+ `executor.ts allocateActivationBuffers()` replaces per-tensor allocation with a classic last-use liveness scheme over the topologically ordered graph:
1083
+
1084
+ 1. **Liveness pass**: walk `graph.executionOrder` once, recording each activation tensor's first defining node and last reading node. Tensors read before they are ever written carry state across forward passes (or are written in place) — they join a `persistent` set along with all `graph.outputs`, and are never pooled.
1085
+ 2. **Allocation pass**: walk the order again with a **size-keyed free pool** (`Map<byteSize, GPUBuffer[]>`). For each node, *acquire outputs first* (popping an exact-size buffer from the pool or creating one) so a node never writes the buffer of a tensor it also reads; then release every input/output whose last use is this node back into the pool.
1086
+ 3. Anything persistent or untouched by the execution order gets a dedicated buffer.
1087
+
1088
+ Reuse is safe because dispatches execute in `executionOrder` on every path (single-pass Dawn, grouped WebKit) and WebGPU synchronizes write-after-read hazards between dispatches. The executor logs the result: `431 tensors → 20 buffers (37 MB)` at T=256. Two standing obligations follow (also listed in Section 7's mobile invariants): fused kernels must not read+write one pooled buffer in a single dispatch, and `debugReadBuffer()` on an intermediate tensor is only meaningful before a later op reuses it.
1089
+
1090
+ ### 18.2 SliceLastRow and the [1, vocab] logits buffer
1091
+
1092
+ Only the last position's logits are ever consumed (the sampler operates on one row). Previously both graph generators emitted `lm_head` over the full `[T, hidden]` normed activations into a `[T, vocab]` logits tensor — 508.6 MB at T=512 (512 × 248,320 × 4) of which 993 KB was read. Now a `SliceLastRow` op (`kernels/registry.ts`, a trivial 256-wide copy kernel with `{width, last_row_offset}` params) extracts `final_norm_out`'s last row into `final_norm_last [1, hidden]`, and `lm_head` runs with M=1 into `logits [1, vocab]`. This saves the 485 MiB buffer **and** removes the full-vocab matmul over all prefill rows — the dominant prefill compute — so it is a pure win on desktop too. Readback offsets collapse to 0 in `forward()`/`forwardArgmax()`.
1093
+
1094
+ ### 18.3 Grouped-submit architecture (WebKit)
1095
+
1096
+ The WebKit execution path in `forward()`/`forwardArgmax()` is now a parameterized grouped loop: `webkitGroupSize` dispatches per command buffer (one compute pass per dispatch), `queue.submit()` then `await queue.onSubmittedWorkDone()` per group — **exactly one command buffer in flight at all times**. The logits copy (and in `forwardArgmax()`, the argmax dispatch and its 4-byte readback copy) go in separate final submits. This replaces the fire-and-forget anti-pattern of ~400 unbounded in-flight CBs per token. `group=1` is the proven-correct floor for older WebKit; the size is sweepable at runtime via the `?group=N` URL parameter, which produced the scaling curve in Section 19. On Dawn nothing changed: one encoder, one compute pass, one submit.
1097
+
1098
+ ### 18.4 Two-phase attention score reduction
1099
+
1100
+ In all three attention kernel variants (f32, native-f16, packed-f16), the leader-thread score reduction now accumulates into a local variable, executes a `workgroupBarrier()` in **uniform control flow** (the barrier cannot live inside the guarded block — non-uniform barriers are a WGSL validation error), and only then writes `smem[pos_in_tile]` in a second guarded block. Cost: one extra barrier per 16-position KV tile. This closes the race described in 17.3(b).
1101
+
1102
+ ### 18.5 `isWebKitWebGPU` detection
1103
+
1104
+ Described in Section 6. The hardware-keyed `isMetalBackend` predicate is gone; the executor's `needsMultiEncoder` is set solely from the UA-based `isWebKitWebGPU`, and shader variant alternation is deleted outright (~850-900 pipelines → ~25 on the workaround path's worst case).
1105
+
1106
+ ### 18.6 Error scopes and uncaptured-error surfacing
1107
+
1108
+ `device.onuncapturederror` is registered at init, and `WebGPUEngine.create()` brackets executor construction, weight upload, and bind-group creation with `pushErrorScope("out-of-memory")` + `pushErrorScope("validation")`, throwing a descriptive error (with a crash-phase breadcrumb in `localStorage`) if either pops. WebKit reports failed allocations asynchronously and otherwise *silently no-ops the affected dispatches* — before this change, every mobile observation was collected blind. Device limits (`maxBufferSize`, `maxStorageBufferBindingSize`, `maxComputeWorkgroupStorageSize`) are logged on WebKit at startup.
1109
+
1110
+ ### 18.7 WebKit Cache-API large-write skip and memory-policy fixes
1111
+
1112
+ - `model-loader.ts browserCacheWrite()`: the defensive `data.slice(0)` needed for `cache.put()` doubles a weight shard in memory at the worst moment; on WebKit, writes over 64 MB are skipped entirely (the HTTP cache still serves warm reloads).
1113
+ - iOS `maxSeqLen` policy (`gpu/index.ts`): WebKit defaults to 512 and is hard-clamped to 2048 (the old default of 4096 was an 18.9 GB request); `?kvf32=1` on Safari caps at 1024; desktop caps at 4096. `?maxseq=N` overrides within `context_length`.
1114
+ - The iPad diagnostic page's default model is now the MLX 4-bit repo, never the BF16 one.
1115
+
1116
+ ---
1117
+
1118
+ ## 19. Mobile Results & Submit-Granularity Scaling
1119
+
1120
+ All measurements 2026-06-12 on iPad, iPadOS 26.5, WebKit WebGPU via Chrome-iOS WKWebView over HTTPS. Model: Qwen3.5-0.8B MLX 4-bit, greedy (temperature 0). **Every configuration below produced output byte-identical to the desktop Dawn reference** — these are correctness results first, throughput results second. (Canonical numbers: `docs/research/tps-baselines.md`.)
1121
+
1122
+ Device limits (default adapter): `maxBufferSize` 256 MB, `maxStorageBufferBindingSize` 128 MB, `maxComputeWorkgroupStorageSize` 16384 bytes.
1123
+
1124
+ ### Throughput vs. group size (24-token generations)
1125
+
1126
+ | `?group=N` (dispatches/CB, one CB in flight, awaited) | Decode tok/s |
1127
+ |---|---|
1128
+ | 1 | 6.4-7.7 |
1129
+ | 8 | 19.7 |
1130
+ | 32 | 24.6 |
1131
+ | 64 | 28.8 |
1132
+ | 128 | 29.7 |
1133
+ | batch-all (entire forward in one CB) | **31.7** |
1134
+ | sustained 120-token run, group=64, maxseq=512 | **35.9** |
1135
+
1136
+ **Interpretation**: the curve is round-trip-bound until ~32 dispatches/CB — each group costs a full submit + `onSubmittedWorkDone()` JS↔GPU round trip, so throughput scales nearly linearly with group size at first (1→8 is a 3× jump). Past 32-64 the round trips amortize away and the engine becomes **dispatch-bound**: 64→all is only 28.8→31.7, because ~440 individual compute passes per token dominate regardless of CB packaging. This is why dispatch-count reduction (kernel fusion) is worth 2-3× more on mobile than on desktop, and why the iPad goal path (50+ tok/s) runs through a single compute pass per CB plus fusion rather than through submit tuning.
1137
+
1138
+ Other results from the same session:
1139
+
1140
+ - **maxseq survival**: at group=1, both `maxseq=64` and `maxseq=512` complete — pre-fix, every configuration was jetsam-killed (17.1). The memory fix and the correctness fix are independently confirmed.
1141
+ - **KV A/B**: packed-f16 KV vs `?kvf32=1` both byte-correct — the packed-f16 path is cleared on-device (it had never run against correct inputs before; see 17.3a).
1142
+ - **Within-submission visibility**: the historical zero-logits configuration (batch-all) is correct on iPadOS 26.5 — the decisive evidence for 17.2's key finding.
1143
+ - **Desktop post-fix baseline**: node-dawn on M4 Max at **145 tok/s** (143.9 / 144.1 / 147.1), confirming the 19.7 tok/s regression was entirely the detection predicate.
1144
+
1145
+ ### Production submit strategy
1146
+
1147
+ The strategy is OS-gated, chosen at engine init:
1148
+
1149
+ - **iPadOS/iOS 26.5+**: batch-all (single CB per forward) — the upstream WebKit fix makes the fast path correct.
1150
+ - **Older 26.x WebKit**: run a **startup coherence probe** — a small batched dependent-chain dispatch whose result is compared against the same chain run per-dispatch. Probe passes → `group=64`; probe fails → `group=1` awaited (the proven-correct floor, 6.4-7.7 tok/s).
1151
+ - **Dawn (Chrome, node)**: single encoder, single compute pass, unchanged.
1152
+
1153
+ Until the probe ships, `webkitGroupSize` defaults to 1 (correctness floor) and `?group=N` overrides it.
1154
+
1155
+ ### Update (2026-06-15): a third device class — batching *crashes* (iPhone)
1156
+
1157
+ A `?group=N` sweep on an **iPhone (iOS 18.7, WebKit/Safari 26.5)** revealed a failure mode distinct from both the iPad-26.5 class (batch-all correct) and the older-WebKit class (batch-all → zero logits): **any `group > 1` hard-crashes the GPU process** — the page dies ("A problem repeatedly occurred"), confirmed at `group=4`, `32`, and `4096`. `group=1` runs correctly at ~**4.7 tok/s** (the round-trip-bound floor predicted by the curve above). This is the single most important mobile fact: on this device the submit-granularity lever — the whole basis of the 6.4→31.7 tok/s scaling on iPad — is **unavailable**, so the *only* remaining levers are dispatch-count reduction (fusion) and memory reduction.
1158
+
1159
+ Both signs point at **memory**, not pure miscompilation:
1160
+
1161
+ - The crash is at the **first grouped compute (prefill)**, immediately after load, and scales with group size — consistent with the command buffers / intermediate state held in flight before the single `onSubmittedWorkDone()` drain exceeding the iPhone's (smaller-than-iPad) jetsam budget.
1162
+ - The same device **hard-crashes at model *load*** with the ~596 MB vision-enabled Qwen checkpoint but loads the ~404 MB text-only one. It is operating right at its memory ceiling, so the extra working set a batched submit needs is enough to push it over. (Action taken: the site's default chat engine now loads the text-only checkpoint; the vision tower is built only on the Vision tab.)
1163
+
1164
+ **Revised production strategy** — the startup probe must be *crash-surviving*, not merely output-comparing (the original design in the previous subsection cannot detect a class that kills the page before returning a result):
1165
+
1166
+ 1. Persist the candidate group size to `localStorage` **before** running the probe (reuse the existing crash breadcrumb in `device-guards.ts` / `setDownloadPhase`).
1167
+ 2. On reload, if the breadcrumb for a group size is still set, that size crashed → step down (batch-all → 64 → 8 → 1) and record the safe ceiling for the device.
1168
+ 3. Promote a group size to "known-good" only after a probe both completes **and** returns byte-correct output.
1169
+
1170
+ This self-calibrates across all three classes (correct-batched / wrong-batched / crash-batched) with no hardcoded device table, and survives the crash class because the breadcrumb outlives the page kill. **This — shipping the crash-surviving probe — is the decided path; the rabbit hole was hand-sweeping `?group` on a device that crashes the page on every batched value, which a self-calibrating probe is specifically designed to avoid.**
1171
+
1172
+ **Caching note (same session):** on iOS Safari in a plain tab the CacheStorage model cache is **evicted between visits** (`navigator.storage.persist()` is not granted outside an installed PWA — `model-loader.ts:requestPersistence` is best-effort), so every visit re-downloads the full model regardless of throughput. The durable fix is OPFS via a Worker (`createSyncAccessHandle` in chunks), tracked as a separate task; "add to Home Screen" (PWA) also earns persistent storage as an interim path.
1173
+
1174
+ ---
1175
+
1176
+ ## 20. Autoresearch: Profile-Guided Throughput Optimization (June 2026)
1177
+
1178
+ Once mobile correctness was secured, an autonomous optimization loop
1179
+ (`scripts/engine/optimize.mjs --mode=autoresearch`, driven by the model itself)
1180
+ took **desktop decode throughput from 145 → 207 tok/s** (M4 Max, node-dawn,
1181
+ Qwen3.5-0.8B INT4), cooled-stable, with every round verified byte-coherent. The
1182
+ loop's contract: read engine code → make one focused edit → build → benchmark
1183
+ twice (lower mean, to defeat thermal lies) → keep only if >1.5–2% over the
1184
+ current best, else `git checkout` the touched files → log to `results.jsonl`.
1185
+
1186
+ ### The arc — and the change of angle
1187
+
1188
+ The interesting result is not the number but the *shape* of how it was reached:
1189
+
1190
+ | Round | tok/s | Kept | Edit |
1191
+ |---|---|---|---|
1192
+ | baseline | 145 | — | — |
1193
+ | r1 | 145.9 | ✗ | fuse down_proj + residual |
1194
+ | r2 | 152.8 | ✓ | vec4 INT4 weight loads (32 nibbles/iter) |
1195
+ | r3 | 158 | ✓ | vec4 loads in the SwiGLU matvec |
1196
+ | r4 | 158 | ✗ | subgroup-shuffle reduction |
1197
+ | **r5** | **188** | ✓ | **GPU-driven pipelined decode** |
1198
+ | **r6** | **213** | ✓ | **fuse MambaSSM (profiled #2 hotspot)** |
1199
+ | r7 | 210 | ✗ | SSM occupancy split |
1200
+ | **r8** | **222** | ✓ | **f16 SSM-state storage** |
1201
+
1202
+ Rounds 1–4 are a *sideways* plateau: local kernel-load micro-optimizations
1203
+ (145→158), half reverted. The breakthroughs (r5, r6, r8) came when the loop
1204
+ **stopped optimizing the kernel in front of it and changed its analysis**:
1205
+
1206
+ - **r5** is not a kernel tweak but a *control-flow* restructuring — keep the
1207
+ argmax result on the GPU and feed it back as the next token's input without a
1208
+ per-token CPU round-trip (158→188).
1209
+ - **r6/r8** are *profile-guided*: the loop identified the Mamba-2 SSM as the #2
1210
+ hotspot (≈1982 µs/token, 37% of decode), then attacked it structurally —
1211
+ fusing four per-head passes (188→213), then halving its dominant memory
1212
+ traffic by storing SSM state in f16 while keeping f32 compute (213→222).
1213
+
1214
+ This transition — from "make this matmul faster" to "profile the whole forward,
1215
+ find the real bottleneck, and rewrite its data flow" — is the qualitative jump
1216
+ that broke the plateau. It is also the clearest signal that the search was
1217
+ reasoning about angles it had not tried before, not enumerating variations of
1218
+ the same one.
1219
+
1220
+ ### A safety lesson: reverts can clobber un-tracked work
1221
+
1222
+ One round's failed-experiment revert ran `git checkout -- executor.ts
1223
+ registry.ts`, which silently reset *uncommitted* mobile-fix edits in those files
1224
+ to the last commit, then re-layered only the optimizer's own changes — producing
1225
+ a tree that built (esbuild strips types) but failed `tsc` and would have lost the
1226
+ mobile work. The incident motivates two rules now documented for the loop: revert
1227
+ only the files a round actually touched, and commit known-good baselines before
1228
+ launching autonomous edit sessions. The kept kernel wins were recovered intact
1229
+ from the loop's own per-round `.bak` snapshots.
1230
+
1231
+ ### Open throughput frontiers
1232
+
1233
+ The decode path is now bandwidth- and dispatch-bound rather than matmul-bound.
1234
+ Remaining candidates (tracked in `scripts/engine/backlog.md`): fused KV-append,
1235
+ RMSNorm/RoPE fusion, further dispatch-count reduction (which helps mobile 2–3×
1236
+ more than desktop, since mobile is round-trip-bound below ~32 dispatches/CB), and
1237
+ investigating reported large wins from fusing the linear-attention/SSM layers
1238
+ end-to-end.
1239
+
1240
+ #### The two-regime model (recalibrated, June 2026)
1241
+
1242
+ A profiling pass (env-gated timestamp-query profiler, `scripts/engine/test-profile-decode.mjs`)
1243
+ plus a 5-agent research synthesis (`docs/research/dispatch-reduction-hivemind.md`)
1244
+ resolved an apparent contradiction in this campaign. Decode has **two distinct
1245
+ overhead regimes**:
1246
+
1247
+ - **Mobile (Safari/WebKit):** each dispatch is its own command-buffer submit +
1248
+ `onSubmittedWorkDone()` drain, so per-dispatch cost is ~32–71 µs (Metal; arXiv
1249
+ 2604.02344). At ~287 dispatches/token this regime **is** dispatch-count-bound —
1250
+ any dispatch cut helps, and the lever is fewer dispatches.
1251
+ - **Desktop (node-Dawn, the primary `test-benchmark.mjs` metric):** the whole
1252
+ decode step is **one** `beginComputePass` / **one** submit; measured per-dispatch
1253
+ overhead is ~2–3 µs. Decode is **not** dispatch-count-bound here. The metric
1254
+ moves only on **bytes-moved reductions over WIDE tensors.**
1255
+
1256
+ This is load-bearing and confirmed by the engine's own history: pure dispatch-count
1257
+ cuts were flat/reverted on Dawn (ResidualRMSNorm Add+Norm −23 dispatches = +0.25%;
1258
+ RMSNorm-into-MambaSSM −18 dispatches = +0.3%; software-prefetch −1.2%), while every
1259
+ kept Dawn win removed a **wide** global round-trip (MambaSSM single-pass state
1260
+ fusion ~+13%, f16 SSM state +4%, SiLU-into-conv +1.8%, matvec A-reuse +4.9%). True
1261
+ WebGPU megakernels are **not** viable (no portable inter-workgroup sync / forward
1262
+ progress); the achievable analog is aggressive per-block fusion. The ranked plan
1263
+ (epilogue/prologue Mamba megakernels that erase 2048/6144-wide round-trips) lives
1264
+ in `docs/research/dispatch-reduction-hivemind.md`.
1265
+
1266
+ **Measurement rule:** dispatch-count cuts must be validated on the **iPad runner**
1267
+ (mobile regime), not the desktop benchmark, or they read as "flat" and get reverted
1268
+ despite being real mobile wins.
1269
+
1270
+ ---
1271
+
1272
+ ## 21. Native Text Embeddings (June 2026)
1273
+
1274
+ The second modality to go native after text generation is **text embeddings**.
1275
+ The target model is **Qwen3-Embedding-0.6B**, and the key fact that makes it cheap
1276
+ is that it ships as architecture `Qwen3ForCausalLM` — the *same* graph generator
1277
+ the engine already runs for text. An embedding model is not a separate engine; it
1278
+ is the existing causal-LM forward pass with a different tail.
1279
+
1280
+ ### What changes vs. text generation
1281
+
1282
+ The graph generator (`architectures/qwen2.ts`) takes an `embedding` flag. When
1283
+ set, it replaces the `SliceLastRow → lm_head → logits` tail with a two-op
1284
+ **pooling tail**:
1285
+
1286
+ 1. **`SliceLastRow`** on `final_norm_out` — Qwen3-Embedding uses **last-token
1287
+ (EOS-position) pooling**: the pooled vector is the final hidden state at the
1288
+ last input position. This reuses the exact op the LM path already emits to
1289
+ restrict lm_head to the last row, so no new pooling kernel is needed.
1290
+ 2. **`L2Norm`** — a new row-wise L2-normalization kernel (`WGSL_L2NORM`,
1291
+ `kernels/registry.ts`: one workgroup per row, `{rows, width}` params) that
1292
+ divides the pooled vector by its Euclidean norm, producing a unit-length
1293
+ embedding in tensor `embedding`.
1294
+
1295
+ The lm_head matmul (and the entire ~248K-row vocab projection) is dropped
1296
+ entirely — an embedding forward pass is strictly cheaper than a generation step.
1297
+ `graph.outputs` becomes `["embedding"]`, and `WebGPUEngine` exposes the modality
1298
+ through `engine.embed(text, options)` (with `engine.isEmbedding` as the guard).
1299
+
1300
+ ### The pooling-token subtlety
1301
+
1302
+ Last-token pooling reads the final position, so *which* token sits there matters.
1303
+ Qwen3-Embedding was trained to pool at the literal `<|endoftext|>` token
1304
+ (`eos_token_id`), **not** the chat `<|im_end|>` token the tokenizer reports as its
1305
+ generic `eos_token`. `engine.embed()` (`gpu/index.ts`) resolves the literal
1306
+ `<|endoftext|>` id and appends it if absent, truncating to the budget while
1307
+ keeping that pooling token last. Query embeddings additionally take the
1308
+ Qwen3-Embedding instruction prefix (`"Instruct: {task}\nQuery:{text}"`); document
1309
+ embeddings omit it.
1310
+
1311
+ ### Validation
1312
+
1313
+ The native embeddings were validated against the reference: output **dim 1024**,
1314
+ **unit L2 norm** confirmed, and the semantic geometry is correct —
1315
+ **cos(similar) = 0.81 > cos(unrelated) = 0.56**. The path is non-autoregressive
1316
+ (prefill only), so it inherits the existing memory and submit machinery with no
1317
+ mobile-specific work.
1318
+
1319
+ ---
1320
+
1321
+ ## 22. Native Vision Encoder (Qwen3.5 ViT) (June 2026)
1322
+
1323
+ Qwen3.5 is **natively multimodal**: the same checkpoint that the engine already
1324
+ runs for text ships a **12-layer Vision Transformer** in its weights — a ~192 MB
1325
+ tower the engine had previously been dropping on load. This cycle implements that
1326
+ ViT natively and validates it **bit-exact against HuggingFace transformers 5.12**.
1327
+
1328
+ ### Why this was tractable
1329
+
1330
+ The vision feasibility was re-assessed (Section 5 / `docs/PROJECT-STATE.md`) and
1331
+ found materially easier than an earlier strategy pass had concluded. That pass
1332
+ believed native vision was blocked behind a *single-threaded attention kernel
1333
+ rewrite* — but it had read `attention.wgsl`, a **dead reference file imported
1334
+ nowhere** (the live kernels are embedded strings in `registry.ts`). The live
1335
+ attention kernel was already the tiled, online-softmax, fully parallel
1336
+ flash-attention-style kernel of Section 7. Making it bidirectional for the ViT was
1337
+ **one uniform flag**, not a rewrite (below). With that correction, the encoder was
1338
+ a bounded amount of ordinary graph-generator work.
1339
+
1340
+ ### Architecture
1341
+
1342
+ The ViT generator (`architectures/qwen3_5_vision.ts`) produces a non-autoregressive
1343
+ graph over a symbolic patch count `N`, validated line-for-line against
1344
+ `modeling_qwen3_5.py`:
1345
+
1346
+ - **Unfold-free patch embed.** Patches arrive **already flattened** to `[N, 1536]`
1347
+ (`patch_dim = in_channels · temporal_patch_size · patch_size² = 3·2·16²`), so the
1348
+ 5-D Conv3d patch embedding collapses to a **plain `MatMul`** + `AddBias`. No
1349
+ `Conv3d`/`unfold` kernel is needed — the host image processor delivers patches in
1350
+ the right layout. A bilinear-interpolated learned position embedding is then
1351
+ added.
1352
+ - **12 pre-norm, bidirectional blocks.** Each block is `LayerNorm → fused-QKV
1353
+ MatMul+AddBias → SliceCols (split Q/K/V) → ApplyRotaryEmb(Q,K) → Attention
1354
+ (bidirectional) → proj MatMul+AddBias → residual → LayerNorm → fc1 MatMul+AddBias
1355
+ → GELU(tanh) → fc2 MatMul+AddBias → residual`.
1356
+ - **Bidirectional attention via one flag.** The attention kernel computes
1357
+ `causal_limit = q_pos + position_offset + 1; S_eff = is_causal ? min(S, causal_limit) : S`.
1358
+ An `is_causal` uniform (default `1`) keeps text decoding causal and unchanged; the
1359
+ ViT passes `causal: false`, setting `is_causal = 0` so every patch attends to all
1360
+ patches. This is the entire non-causal change.
1361
+ - **2D rotary.** The ViT uses a 2D (row, col) rotary embedding. The cos/sin tables
1362
+ and the bilinear-interpolated position embeddings are functions of the image grid
1363
+ `(t, h, w)` **only** — not of weights or pixels — so they are precomputed on the
1364
+ host (`vision-preprocess.ts`, porting `get_vision_bilinear_indices_and_weights`,
1365
+ `get_vision_position_ids`, and `Qwen3_5VisionRotaryEmbedding`) and fed in as input
1366
+ activations. `ApplyRotaryEmb` then applies them with a `rotate_half`. This keeps
1367
+ the GPU graph to the weight-dependent math while staying byte-identical to HF.
1368
+ - **Two distinct GELUs.** The transformer blocks use `gelu_pytorch_tanh` (the
1369
+ `GELU` op); the merger uses the **exact erf** GELU (the new `GeluErf` op). Getting
1370
+ both exactly right was load-bearing for the bit-exact match.
1371
+ - **Spatial-merge-2 → 1024-dim tokens.** The merger's `LayerNorm` over the 768-dim
1372
+ block output is followed by a free row-major reshape `[N, 768] → [N/4, 3072]`
1373
+ (a spatial 2×2 merge — patches arrive pre-grouped into merge-blocks so no gather
1374
+ is needed), then `fc1 → GeluErf → fc2` to produce merged image tokens of dim
1375
+ **1024**, matching the LM hidden size, as `[N/4, 1024]`.
1376
+
1377
+ The encoder runs in its own `VisionExecutor` (`vision-executor.ts`) — kept
1378
+ separate from the autoregressive text `Executor` because it is a single prefill
1379
+ over `N` patches with no KV cache, one buffer per tensor, dispatched in
1380
+ `executionOrder`. It reuses the shared kernel registry and device helpers, so the
1381
+ kernel math is identical to the text path and inherits the same WebKit
1382
+ grouped-submit behavior.
1383
+
1384
+ ### A real bug the validation caught
1385
+
1386
+ Achieving the bit-exact match surfaced a genuine kernel bug: **`WGSL_GELU`
1387
+ returned NaN for large arguments on Metal/Dawn**. The MLP can produce `|x| > 30`,
1388
+ and `x³` then pushes the tanh argument into the thousands, where Metal's fast-math
1389
+ `tanh`/`exp` returns NaN instead of saturating. The fix clamps the inner argument
1390
+ to a safe `±15` (`SQRT_2_OVER_PI · (x + GELU_COEFF·x³)`), which is numerically
1391
+ indistinguishable from the true GELU on the saturated tails. This is the same
1392
+ class of Metal fast-math hazard already documented for `exp()` in the attention
1393
+ kernel (Section 7).
1394
+
1395
+ ### Validation and the public surface
1396
+
1397
+ Validated **bit-exact vs HF transformers 5.12**: **per-token cosine = 1.000000**,
1398
+ **max absolute error ~5e-6**. The encoder is exposed as
1399
+ `engine.encodeImage(patches, gridTHW)` → merged image tokens `[rows, 1024]`,
1400
+ behind `enableVision: true` at load (the engine snapshots the raw
1401
+ `visual.pos_embed.weight` table before `uploadWeights` consumes it, for the host
1402
+ bilinear interpolation).
1403
+
1404
+ ### What is *not* done yet (phase 2)
1405
+
1406
+ `encodeImage()` returns image tokens; it does **not** splice them into a text
1407
+ sequence. The remaining LM-side integration — **M-RoPE** position assignment for
1408
+ interleaved image/text tokens, **token splicing** of image embeddings into the
1409
+ input stream at the image placeholder positions, and the host-side **image
1410
+ preprocessing** (pixels → ordered patches) — is the in-progress phase 2. The
1411
+ encoder being bit-exact means that integration is plumbing over a verified
1412
+ numerical core, not open research.
1413
+
1414
+ ---
1415
+
1416
+ ## 23. The Native-Only Architecture Decision
1417
+
1418
+ The earlier framing of this project as **"two lanes"** — a native WGSL engine for
1419
+ text plus a transformers.js/ONNX fallback lane for breadth — has been retired. The
1420
+ decided architecture (owner decision, see `docs/PROJECT-STATE.md`) is **one native
1421
+ WebGPU engine**:
1422
+
1423
+ - **Launch set = text + vision + embeddings, all native, in one engine.** Text
1424
+ generation, the Qwen3.5 ViT encoder (Section 22), and Qwen3-Embedding
1425
+ (Section 21) all run through the same IR, kernel registry, and device layer. One
1426
+ model per modality by default, expandable to other families via the
1427
+ add-model-family process (`docs/adding-a-model-family.md`).
1428
+ - **A permanent fallback lane is rejected** — it "assumes defeat to begin with."
1429
+ `transformers.js`/ONNX (`chrome-backend.ts`) is at most temporary dev scaffolding
1430
+ to keep desktop demos alive during the native build, and is being **removed, not
1431
+ kept** as a lane.
1432
+ - **Audio is deferred to native small models, not delegated to a second engine.**
1433
+ Launching without audio is acceptable; a permanent second runtime is not. The
1434
+ candidates are native: **OmniVoice** for TTS (a Qwen3 backbone + a codec decoder —
1435
+ i.e. mostly the text path the engine already runs, plus a decoder) and
1436
+ **Moonshine** for STT (a lean encoder-decoder with a raw-waveform Conv1d
1437
+ frontend, no log-mel Conv2d). A thin onnxruntime-web bridge remains only a
1438
+ break-glass option for a model with no extractable weights *and* no native
1439
+ alternative.
1440
+
1441
+ ### Corrections to earlier "two-lane / blocked-vision" language
1442
+
1443
+ Two claims that appeared in earlier roadmap text are now corrected for the record:
1444
+
1445
+ - **"Native vision is blocked on a parallel attention kernel."** False — the
1446
+ attention kernel was **already parallel** (Section 7). The blocking claim came
1447
+ from reading `attention.wgsl`, a dead file imported nowhere; the live kernel is a
1448
+ tiled flash-attention-style kernel. Non-causal attention is a one-line uniform
1449
+ (`is_causal`), not a new kernel. **Vision is done at the encoder level**
1450
+ (Section 22), bit-exact.
1451
+ - **"tfjs is a kept breadth lane."** It is not. The engine is native-only across
1452
+ all launch modalities; `chrome-backend.ts` is slated for deletion.
1453
+
1454
+ ---
1455
+
1456
+ ## 24. iOS Model-Caching Reality
1457
+
1458
+ A persistent finding this cycle, with consequences for how the engine ships on
1459
+ iOS: **durably caching a ~400 MB model in iOS/iPadOS Safari is not achievable from
1460
+ a plain browser tab.** The full investigation is in
1461
+ `docs/research/ios-safari-model-caching.md`; the load-bearing facts:
1462
+
1463
+ - **Persistence requires a PWA.** WebKit's only documented positive heuristic for
1464
+ granting `navigator.storage.persist()` is "opened as a Home-Screen Web App." A
1465
+ plain tab gets `false` (confirmed by on-device probe). Only a persistence grant
1466
+ excludes an origin from eviction — so durable model caching means shipping an
1467
+ Add-to-Home-Screen PWA.
1468
+ - **Eviction is quota-pressure-driven, not reload-driven.** Best-effort data
1469
+ survives reloads under normal conditions; what kills the cache on a near-full
1470
+ device is WebKit's **origin-wide LRU eviction under storage pressure**. The
1471
+ probed iPad had a ~1 GB origin quota (Safari 17+ sets it to ~60% of *free* disk,
1472
+ and the device was nearly full) with ~444 MB already consumed by foreign caches —
1473
+ a 400 MB write lands near the cap and gets evicted.
1474
+ - **Switching Cache API → IndexedDB does not help.** On iOS they share one
1475
+ best-effort origin pool, evicted together under the same quota and 7-day-ITP
1476
+ policy. The migration buys no durability.
1477
+ - **Main-thread OPFS is broken on iOS.** `createWritable()` on the main thread
1478
+ throws OOM on the test device; the only viable large-write path is a **Worker
1479
+ using OPFS `createSyncAccessHandle()`** in small chunks. This finding drove the
1480
+ removal of the engine's OPFS write path in favor of a Cache-API-only loader
1481
+ (`model-loader.ts`); a main-thread OPFS attempt left unclearable junk that filled
1482
+ the quota and evicted everything.
1483
+
1484
+ The practical consequence: as a plain tab the engine treats re-download as
1485
+ unavoidable and minimizes its cost (smaller/streamed model, HTTP cache, fast CDN);
1486
+ durable on-device caching is a PWA feature, gated behind the persistence grant,
1487
+ foreign-cache cleanup, and Worker-OPFS chunked writes.
1488
+
1489
+ ---
1490
+
1491
+ ## 25. EmbeddingGemma-300M: a Second Embedding Family, Validated On-Device (June 2026)
1492
+
1493
+ §21 shipped native embeddings by reusing the *text* graph — Qwen3-Embedding is
1494
+ `Qwen3ForCausalLM` with a last-token-pool + L2Norm tail. That model proved the
1495
+ modality but not the engine's **generality across embedding families**, and it had a
1496
+ fatal mobile problem: the only weights that existed were BF16 (~1.2 GB, which OOMs
1497
+ the iPad) or a broken MLX-DWQ convert. This cycle adds **EmbeddingGemma-300M**
1498
+ (`mlx-community/embeddinggemma-300m-4bit`, ~173 MB at MLX-4bit) — the first
1499
+ **non-Qwen** embedding model, a genuinely different architecture, **and the first
1500
+ embedding model confirmed running on iPad Safari on-device** (owner-confirmed).
1501
+
1502
+ ### A real encoder, not a re-skinned causal LM
1503
+
1504
+ EmbeddingGemma is a **bidirectional Gemma3 encoder**, so unlike §21 it needed its own
1505
+ graph generator (`architectures/gemma3_encoder.ts`, `generateGemma3EncoderGraph`).
1506
+ The architecture, read line-for-line from `config.json` (the generator hardcodes
1507
+ nothing — block count, dims and head config all come from the config; the 24-block /
1508
+ 768-hidden / 3072-intermediate shape below is that model's config):
1509
+
1510
+ - **24 pre-norm blocks, bidirectional.** The attention op is emitted with
1511
+ `causal: false` — the same one-flag `is_causal` mechanism the ViT introduced
1512
+ (§22), reused for an encoder. This is the second consumer of that flag and the
1513
+ payoff of having made non-causal a uniform rather than a kernel.
1514
+ - **GQA 3 q / 1 kv, head_dim 256.** `q_dim = num_heads·head_dim`,
1515
+ `kv_dim = num_kv_heads·head_dim`; for this model 3·256 = 768 query, 1·256 kv.
1516
+ - **Per-head q-norm / k-norm.** Each block emits `q_norm` and `k_norm` as `RMSNorm`
1517
+ over `head_dim` before RoPE — the Gemma3 per-head QK normalization.
1518
+ - **Dual-theta RoPE, selected per layer from `layer_types`.** Gemma3 interleaves
1519
+ sliding-window and full-attention layers, and they use *different* rotary bases.
1520
+ The generator reads `layer_types[i]`: a `"full_attention"` layer takes
1521
+ `rope_theta` (1e6); every other (local/sliding) layer takes `rope_local_base_freq`
1522
+ (10000). The full head_dim is rotated. (This is a per-layer `layer_types` lookup,
1523
+ not a fixed "every 6th" rule — the config happens to make every 6th layer global,
1524
+ but the code keys off the type string.)
1525
+ - **GeGLU MLP.** `down(gelu_tanh(gate) · up)` — separate gate/up projections, a
1526
+ `GELU` (tanh-approx) op on the gate, a `Mul`, then the down projection. Not the
1527
+ SiLU-based fused SwiGLU of the Qwen text path.
1528
+ - **Gemma's four-norm "sandwich."** Each block carries four norms, not two:
1529
+ `input_layernorm` (pre-attn), `post_attention_layernorm`, `pre_feedforward_layernorm`,
1530
+ `post_feedforward_layernorm`. The two *post* norms are applied to the sublayer
1531
+ output **before** the residual add — Gemma's distinguishing sandwich layout.
1532
+ - **Embedding scale ×√768.** The token embeddings are multiplied by
1533
+ `√hidden_size` via a new `Scale` op before the blocks.
1534
+ - **Tail: MeanPool → Dense0 (768→3072) → Dense1 (3072→768) → L2Norm.** Unlike
1535
+ Qwen3-Embedding's *last-token* EOS pool, EmbeddingGemma uses **mean pooling** over
1536
+ all tokens (a new `MeanPool` kernel), followed by the model's two learned Dense
1537
+ projection heads and a final L2 normalization to a unit-length 768-dim vector.
1538
+
1539
+ ### Two new kernels — and that is all
1540
+
1541
+ The only genuinely new GPU kernels this model required are **`MeanPool`** and
1542
+ **`Scale`** (`kernels/registry.ts`):
1543
+
1544
+ - **`MeanPool`** — `output[c] = (1/T)·Σ_t src[t·width + c]`: the column-mean of a
1545
+ `[T, width]` activation into `[1, width]`. One thread per output channel,
1546
+ workgroup_size 256, params `{ seq_len, width }`.
1547
+ - **`Scale`** — `output[i] = input[i]·scale`, the embedding normalizer. Params
1548
+ `{ count, scale_bits }` (the f32 scale passed as a bit pattern and `bitcast` back).
1549
+
1550
+ Everything else — RMSNorm, the bidirectional attention, RoPE, GELU, Mul, the INT4
1551
+ matmul, L2Norm — was already in the registry. A whole new embedding family cost two
1552
+ small reduction/elementwise kernels.
1553
+
1554
+ ### The (1+weight) norm absorption — baked by the loader, for MLX too
1555
+
1556
+ Gemma's RMSNorm is `(1 + weight)·normalized`, not `weight·normalized`. Rather than
1557
+ fork the kernel, the **loader bakes the +1 into every Gemma norm weight** at load
1558
+ (`model-loader.ts`), so the standard RMSNorm kernel stays correct. The subtle part:
1559
+ this runs **even for MLX-4bit** Gemma. mlx-lm pre-absorbs the +1 for Qwen3.5 but
1560
+ **does not** for Gemma — so the Gemma branch deliberately omits the `&& !isMLX`
1561
+ guard the Qwen branch carries, and adds +1 to `input/post_attention/pre_feedforward/
1562
+ post_feedforward_layernorm`, `q_norm`, `k_norm`, and the final `norm`.
1563
+
1564
+ ### Validation: bit-faithful, then semantic
1565
+
1566
+ Correctness was pinned against an **independent NumPy reference**
1567
+ (`scripts/engine/test-embedding-gemma-reference.py`) that re-implements the MLX
1568
+ affine dequant (`scale·nibble + bias`, group_size 64), the `(1+w)` Gemma norm, the
1569
+ `query_pre_attn_scalar^-0.5` attention scale, GQA repeat, dual-theta RoPE and the two
1570
+ Dense heads from raw safetensors. The reference asserts engine-vs-NumPy
1571
+ `cosine > 0.95` per probe; the measured result was **cos = 1.00000** (the commit's
1572
+ own headline). On top of that, a semantic test (`test-embedding-gemma.mjs`) confirms
1573
+ the geometry is *useful*, not merely reproducible: a "Red Planet" query embeds
1574
+ **closer to two Mars documents than to an unrelated sourdough-bread document by a
1575
+ >0.1 cosine margin**, all vectors are unit-norm at dim 768, and none are NaN or
1576
+ degenerate. The path is prefill-only (non-autoregressive), so it inherits the
1577
+ existing memory/submit machinery with no mobile-specific work — which is why, once
1578
+ the size dropped from 1.2 GB to 173 MB, **it simply ran on the iPad**.
1579
+
1580
+ ### Why this one matters
1581
+
1582
+ Qwen3-Embedding proved the *tail*. EmbeddingGemma proves the *engine*: a different
1583
+ vendor, a different block structure (four-norm sandwich, dual-theta RoPE, per-head
1584
+ QK-norm, mean-pool, Dense heads), validated bit-faithful and then run on-device on
1585
+ the hardware that previously crashed. It is the difference between "we support an
1586
+ embedding model" and "the engine generalizes across embedding families." The
1587
+ abandoned 1.2 GB Qwen3-Embedding-on-iPad and the MLX-DWQ-garbage trap (§27) are the
1588
+ two dead ends this model routes around.
1589
+
1590
+ ---
1591
+
1592
+ ## 26. The SentencePiece Tokenizer Fix (a Cross-Family Lesson)
1593
+
1594
+ EmbeddingGemma surfaced a tokenizer bug that had been **silently destroying
1595
+ semantics**, and the fix is a reusable lesson for any future non-GPT-lineage model.
1596
+
1597
+ Gemma's `tokenizer.json` declares `"type": "BPE"` — but it is **SentencePiece-flavored
1598
+ BPE**, not the byte-level (GPT-2 / `Ġ`) BPE the engine was built for. The two
1599
+ disagree on nearly everything that matters:
1600
+
1601
+ | | Byte-level BPE (Qwen, LFM2) | SentencePiece BPE (Gemma) |
1602
+ |---|---|---|
1603
+ | Space marker | `Ġ` (`Ġ`) | `▁` (U+2581) |
1604
+ | Token bytes | byte-to-unicode remapped | raw UTF-8 |
1605
+ | `merges` form | `"a b"` strings | `["a","b"]` arrays |
1606
+ | Unknown bytes | — | `<0xHH>` byte-fallback |
1607
+
1608
+ Feeding a SentencePiece vocab through the byte-level path **char-split every word**:
1609
+ the pre-tokenizer's byte-to-unicode remap turned the raw-UTF-8 ▁-prefixed Gemma
1610
+ tokens into unmatchable garbage, the merges never fired, and the embeddings that came
1611
+ out were numerically "valid" (unit norm, no NaN) but **semantically meaningless** —
1612
+ exactly the kind of failure that passes a smoke test and fails a margin test.
1613
+
1614
+ The fix (`tokenizer.ts`) **auto-detects SPM mode structurally**, with no model-name
1615
+ list. A `spmMode` flag is set true when either: the `normalizer` is (or contains, in
1616
+ a `Sequence`) a `Replace` node mapping `" " → "▁"`; or the model has
1617
+ `byte_fallback: true` **and** the vocab literally contains the ▁-prefixed token
1618
+ `"▁the"`. In SPM mode the tokenizer:
1619
+
1620
+ - **encodes** by replacing spaces with ▁ and splitting on ▁ (keeping it attached to
1621
+ the following piece) — *no* byte-to-unicode remap, BPE runs on raw UTF-8;
1622
+ - **decodes** by fusing runs of consecutive `<0xHH>` byte-fallback tokens into a
1623
+ single UTF-8 decode (so a multi-byte codepoint split across byte tokens
1624
+ reassembles) and finally mapping ▁ → space;
1625
+ - **parses merges** from both array and string form into one space-joined key.
1626
+
1627
+ Crucially the detection is *structural*, so **Qwen and LFM2 fall through to the
1628
+ byte-level path unchanged** — they have no `" "→"▁"` normalizer and are not
1629
+ byte_fallback SPM vocabs, so `spmMode` is false and the `Ġ` machinery is untouched.
1630
+ The lesson, recorded for the add-model-family process: **`type: "BPE"` in a HF
1631
+ tokenizer is not a guarantee of byte-level BPE.** Any SentencePiece-lineage family
1632
+ (Gemma, Llama, Mistral) needs the SPM path, and the engine now picks it automatically.
1633
+
1634
+ ---
1635
+
1636
+ ## 27. MLX-4bit Loading: Broadened Detection, and the DWQ Trap
1637
+
1638
+ Shipping a `mlx-community` checkpoint forced the loader to get precise about what
1639
+ "MLX-4bit" means (`model-loader.ts`). Three changes:
1640
+
1641
+ 1. **Detection broadened to mode-less configs.** Standard mlx-lm converts often emit
1642
+ `{ bits: 4, group_size: N }` with **no `mode` field**, where earlier code only
1643
+ recognized an explicit `mode: "affine"`. The loader now treats a config as
1644
+ MLX-shaped when `bits === 4` and either `mode === "affine"` **or**
1645
+ (`mode` is absent **and** `group_size` is a number).
1646
+
1647
+ 2. **A DWQ reject, because DWQ is config-indistinguishable.** Distillation-quantized
1648
+ (DWQ) MLX repos carry the *same* `{bits:4, group_size}` config as a standard
1649
+ affine convert but pack weights the engine's dequant can't read — they produce
1650
+ garbage, not an error. Since the config can't tell them apart, the loader rejects
1651
+ by **repo name**: any repo whose lowercased id contains `"dwq"` can never be
1652
+ treated as a verified MLX repo.
1653
+
1654
+ 3. **A `VERIFIED_MLX_REPOS` allowlist.** A mode-less config only loads if its repo is
1655
+ on a hardcoded allowlist of repos confirmed to be standard affine MLX-4bit (case-
1656
+ insensitive substring match, currently exactly
1657
+ `["mlx-community/embeddinggemma-300m-4bit"]`). The full gate is
1658
+ `isMLX = hasMlxShape && (mode === "affine" || isVerifiedMlxRepo)` — so an explicit
1659
+ `affine` config loads anywhere, but a bare `{bits:4, group_size}` only loads from a
1660
+ vetted repo, and never from a DWQ one. This is the codified form of the
1661
+ "MLX-DWQ-garbage trap" that wasted a cycle on the Qwen embedding model.
1662
+
1663
+ ---
1664
+
1665
+ ## 28. The Progress-Reporting Fix: Killing the "Stuck at 10%" Freeze
1666
+
1667
+ A long-standing, every-model UX bug was fixed this cycle (commit `682a09b`,
1668
+ `model-loader.ts`): the download bar **froze at "10% — discovering weight files"**
1669
+ for several seconds on every load, looking like a hang. The cause was a *dead zone*
1670
+ of latency-heavy network round-trips between the "discovering" progress emit and the
1671
+ first emit that the download code produced — during which nothing was reported:
1672
+
1673
+ - the **index probe** (`fetchJSON("model.safetensors.index.json")`),
1674
+ - **two header range-requests** per shard (`fetchRange(url, 0, 8)` to read the header
1675
+ length, then a second range to read the full header), and
1676
+ - the **first-byte latency** of the first data Range request.
1677
+
1678
+ The fix brackets that dead zone with two descriptive emits *before* the first chunk:
1679
+ a `"Reading {filename} header (i/N)…"` message right before the header fetch, and a
1680
+ `"Downloading {filename} (0/{totalMB} MB)"` message that shows the size up front so
1681
+ the first (latency-heavy) chunk doesn't read as a freeze. No throughput changed; the
1682
+ bar now narrates the round-trips it was previously silent through. It is a small fix
1683
+ with broad reach — it affected **every model the engine loads**.
1684
+
1685
+ ---
1686
+
1687
+ ## 29. Cross-Device Multi-Modal Parity (June 2026)
1688
+
1689
+ The milestone this cycle is not any single model but their *intersection*: **text,
1690
+ vision, and embeddings now all run natively on iPad Safari/WebKit**, on the same
1691
+ device that crashed at the start of the campaign (§17). Concretely, on iPad
1692
+ (iPadOS 26.5, WebKit), through the one native WGSL engine:
1693
+
1694
+ | Modality | Native model on iPad | Status |
1695
+ |---|---|---|
1696
+ | Text | Qwen3.5-0.8B INT4 | ✅ ~51 tok/s sustained (200-tok run), bit-correct |
1697
+ | Text (alt) | LFM2.5-350M | ✅ ~46 tok/s on-device; faster/smaller alternative |
1698
+ | Vision | Qwen3.5 ViT (`describeImage`) | ✅ runs; encoder bit-exact vs HF (§22) |
1699
+ | Embeddings | EmbeddingGemma-300M | ✅ runs on-device (§25), 173 MB |
1700
+
1701
+ The desktop numbers remain higher (Qwen3.5 ~207 tok/s, §20; LFM2.5 ~600 tok/s on
1702
+ M4 Max), but the *parity* claim is about mobile: the native engine has reached the
1703
+ **same modality coverage on iPad that the transformers.js/ONNX path offered on the
1704
+ modalities that matter — without the mobile crashes, and faster** (native mobile
1705
+ decode is ~5× the transformers.js path, §19/PROJECT-STATE §4).
1706
+
1707
+ The honest gap is **audio**. Neither STT nor TTS runs natively yet:
1708
+
1709
+ - **TTS via OmniVoice** — in progress; a Qwen3 backbone (mostly the text path the
1710
+ engine already runs) plus a codec decoder.
1711
+ - **STT via Moonshine** — not started; a lean encoder-decoder with a raw-waveform
1712
+ Conv1d frontend (no log-mel Conv2d), needing a parallel CrossAttention kernel.
1713
+
1714
+ Until those land, audio is the one launch modality that still requires the
1715
+ transformers.js path (Kokoro/Supertonic TTS, Whisper STT). The native engine is at
1716
+ **multi-modal parity minus audio**.
1717
+
1718
+ ---
1719
+
1720
+ ## 30. Model-Zoo Growth and the Saturation of the Kernel Library
1721
+
1722
+ Two model-zoo facts close out the cycle, and together they describe a qualitative
1723
+ shift in what "add a model" costs.
1724
+
1725
+ **LFM2.5-350M shipped** (`architectures/lfm2.ts`, `Lfm2ForCausalLM`) — a hybrid
1726
+ conv/attention text model, ~199 MB at q4 (half Qwen3.5's footprint), faster on both
1727
+ desktop (~600 tok/s, ~2.8× Qwen3.5) and mobile (~46 tok/s). It needed **no new
1728
+ kernels**. Two general fixes fell out of it and were kept: (1) LFM2's effective FF
1729
+ dim is the `block_auto_adjust_ff_dim`-rounded value, not the config's raw
1730
+ `intermediate_size`; (2) its "garbage output" was a **chat-template** problem, not a
1731
+ graph problem — LFM2.5 ships its template as a `chat_template.jinja` sidecar absent
1732
+ from `tokenizer_config.json`, so the engine fell back to Qwen ChatML and injected an
1733
+ empty `<think>` loop. Fetching the `.jinja` sidecar and gating think-injection on the
1734
+ template actually emitting `<think>` fixes **any** model with a jinja sidecar.
1735
+
1736
+ **Adding a new *text* family is now usually Tier-1: a generator, no new kernels.**
1737
+ The kernel library has effectively **saturated** for standard transformers — Llama,
1738
+ Mistral, and Gemma-text all reduce to ops already in the registry (RMSNorm,
1739
+ GQA attention, RoPE, SwiGLU/GeGLU, INT4 matmul). New kernels are needed only for
1740
+ genuinely novel computation: a new norm, an SSM/Mamba path, PLE, a cross-attention
1741
+ for STT. The effort tiers (`docs/adding-a-model-family.md`) make this concrete —
1742
+ Tier 1 (hours, generator-only) covers most of the HF text zoo; Tier 2 (one novel op)
1743
+ and Tier 3 (SSM/MoE/new-executor) are now the exception. EmbeddingGemma was a
1744
+ Tier-2-ish outlier *only* because of MeanPool/Scale and the SPM tokenizer; the next
1745
+ text family will most likely cost a config→IR generator and nothing else.
1746
+
1747
+ ---
1748
+
1749
+ ## 31. Native STT: Moonshine (June 2026)
1750
+
1751
+ §29 named audio as the one launch modality still on the transformers.js path. This
1752
+ cycle closes the **speech-to-text** half of that gap natively. The model is
1753
+ **Moonshine** (`architectures/moonshine.ts`, `moonshine-executor.ts`,
1754
+ `moonshine-stt.ts`) — a lean encoder-decoder ASR model chosen over Whisper
1755
+ precisely because it avoids the two things the native engine did not want to build:
1756
+ a log-mel front-end (a Conv2d/FFT pipeline) and a generic spectrogram path.
1757
+
1758
+ ### A raw-waveform Conv1d front-end (no FFT, no log-mel)
1759
+
1760
+ Moonshine consumes **16 kHz PCM directly**. The front-end
1761
+ (`architectures/moonshine.ts`) is three strided 1-D convolutions, not a
1762
+ spectrogram:
1763
+
1764
+ 1. `Conv1d(1→H, kernel 127, stride 64)` + `tanh`,
1765
+ 2. `GroupNorm(num_groups=1)` over the H channels,
1766
+ 3. `Conv1d(H→2H, kernel 7, stride 3)` + GELU,
1767
+ 4. `Conv1d(2H→H, kernel 3, stride 2)` + GELU,
1768
+ 5. a `Transpose` from `[H, frames]` to `[frames, H]` for the transformer.
1769
+
1770
+ The total downsample is 64·3·2 = **384×**, i.e. ~41.6 frames/second at 16 kHz.
1771
+ This needed three genuinely new kernels — **`Conv1dFull`**, **`GroupNorm`**, and
1772
+ **`Tanh`** (plus a `Transpose`) in `kernels/registry.ts` — and **no FFT and no
1773
+ 2-D convolution**, which is the whole reason Moonshine was picked over Whisper.
1774
+
1775
+ ### The CrossAttention kernel — the one real new attention primitive
1776
+
1777
+ The decoder attends to the frozen encoder output, so it needed a **cross-attention**
1778
+ kernel distinct from the engine's self-attention. `WGSL_CROSS_ATTENTION`
1779
+ (`kernels/registry.ts`, `crossAttentionSpec`) is a tiled, online-softmax
1780
+ cross-attention: decoder queries stream over the encoder sequence in tiles of 16
1781
+ with a 256-thread workgroup, running-max/running-sum softmax, V accumulation — and
1782
+ the same mobile-safe discipline the self-attention kernel uses (no `select()`,
1783
+ `exp()` clamped, ≤16 KB workgroup memory). It was validated against an independent
1784
+ NumPy reference (`scripts/engine/test-crossattention.mjs`): **max absolute error
1785
+ < 2e-4 and cosine ≥ 0.9999** — bit-exact within f32 rounding.
1786
+
1787
+ ### The dual-graph executor: encode once, freeze K/V, then decode
1788
+
1789
+ The runtime is **two graphs, not one** (`moonshine-executor.ts`,
1790
+ `moonshine-stt.ts`):
1791
+
1792
+ - `MoonshineEncoderExecutor.encode(pcm)` runs the Conv1d front-end and the
1793
+ bidirectional encoder transformer **once** over the whole utterance, then
1794
+ projects and caches the **per-decoder-layer K/V** from the encoder output.
1795
+ - `MoonshineSTT.transcribe(pcm)` binds that **frozen encoder K/V** into the decoder
1796
+ and runs a normal greedy autoregressive loop, where each step does self-attention
1797
+ over the generated text *plus* cross-attention into the frozen encoder K/V.
1798
+
1799
+ This is the first encoder-decoder shape in the engine; the design choice — compute
1800
+ the encoder K/V exactly once and treat it as a constant during decode — is what
1801
+ keeps cross-attention cheap per token.
1802
+
1803
+ ### Interleaved RoPE
1804
+
1805
+ Moonshine's rotary embedding is **interleaved**, not split-half. Where the engine's
1806
+ default RoPE (HF Llama `rotate_half`) pairs dim *i* with dim *i*+½·rope_dim,
1807
+ Moonshine pairs **adjacent** dims (2p, 2p+1) under a single frequency `inv_freq[p]`.
1808
+ This is a separate kernel, `WGSL_ROPE_INTERLEAVED` (`ROPE_INTERLEAVED_SPEC`),
1809
+ selected by an `interleaved: true` attribute on the encoder/decoder RoPE nodes;
1810
+ the comment records it as verified against HF's `MoonshineRotaryEmbedding`
1811
+ (`cos[:half].repeat_interleave(2)` + interleaved `rotate_half`).
1812
+
1813
+ ### Validation and status
1814
+
1815
+ End-to-end transcription is exercised by `scripts/engine/test-moonshine-transcribe.mjs`,
1816
+ which asserts the transcript **contains the expected ground-truth substrings** for
1817
+ standard HF Moonshine reference clips (e.g. "stew for dinner", "his belly counsel"),
1818
+ and reports real-time-factor and the 4-bit size projection as informational output
1819
+ (both computed dynamically from the run, not hardcoded). The architecture comment
1820
+ records **encoder cosine ≈ 0.990 vs HF transformers** on Dawn. (The crisper
1821
+ "6/6 verbatim / RTF ~40× / ~31 MB at 4-bit" framing from the working notes is
1822
+ consistent with these checks, but the *enforced* gates in the committed tests are
1823
+ the substring-match transcript assertions and the bit-exact CrossAttention numbers
1824
+ above; quote those when precision matters.) **Whisper stays as the multilingual /
1825
+ no-WebGPU fallback** — `WhisperSTT` (`src/core/stt.ts`) is a separate
1826
+ transformers.js/ONNX path, untouched by and independent of the native Moonshine
1827
+ engine.
1828
+
1829
+ ---
1830
+
1831
+ ## 32. Native TTS: Kani-TTS-2 and the NanoCodec Decoder (June 2026)
1832
+
1833
+ The **text-to-speech** half of the audio gap is **partly** landed: the hard,
1834
+ novel piece — the **NanoCodec audio decoder** — is implemented and validated
1835
+ bit-exact, while the codec-LM backbone's autoregressive driver is scaffolded and
1836
+ deliberately not yet runnable end-to-end. The model is **Kani-TTS-2**
1837
+ (`architectures/kani_tts.ts`).
1838
+
1839
+ ### The backbone: an LFM2-350M codec-LM
1840
+
1841
+ Kani-TTS-2's backbone is **LFM2-350M** (arch string `KaniTTS2ForCausalLM`,
1842
+ model_type `lfm2`) reusing `generateLfm2Graph` almost verbatim — the LFM2 block
1843
+ math (RMSNorm, short-conv with conv-state cache, GQA, SwiGLU) shipped already in
1844
+ §30. It autoregressively emits **NanoCodec audio tokens, 4 per frame**, into a
1845
+ vocab that extends *above* the text vocab (`vocab_size` 80538 vs `text_vocab_size`
1846
+ 64400; audio token IDs start at 64410). The backbone-specific additions are
1847
+ frame-level position IDs (audio tokens within a frame share a position; text tokens
1848
+ advance by one), a **learnable per-layer RoPE**
1849
+ (`α^(l) = alpha_min + (alpha_max−alpha_min)·sigmoid(alpha_weight^(l))`), the
1850
+ 4-token-frame decode loop, and an optional speaker-embedding projection.
1851
+ `generateKaniTtsGraph` currently **throws a descriptive error** rather than emit a
1852
+ half-wired graph: it parses and reports the config, confirms the decoder is done,
1853
+ and lists the remaining position/RoPE/decode-driver glue.
1854
+
1855
+ ### The NanoCodec decoder — FSQ + causal HiFi-GAN, the novel part
1856
+
1857
+ `generateNanoCodecDecoderGraph` is the genuinely new computation, and it is
1858
+ **complete and validated**. Two stages:
1859
+
1860
+ - **FSQ dequant.** NanoCodec uses **finite scalar quantization**: 4 groups × 4
1861
+ dims, per-group levels `[9, 8, 8, 7]` (codebook 4032), mixed-radix base
1862
+ `[1, 9, 72, 576]`. Each audio code is unpacked by
1863
+ `nonneg = (idx // base[d]) % levels[d]; code = (nonneg − L/2)/(L/2)` — a new
1864
+ **`FSQDequant`** kernel.
1865
+ - **Causal HiFi-GAN vocoder.** A `CausalConv1d(16→864, k7)`, then 5 upsample
1866
+ stages with rates `[7, 7, 6, 3, 2]` (each: `HalfSnake` → depthwise
1867
+ `CausalConvTranspose1d(k = 2·rate)` → a HiFi-GAN residual layer averaging
1868
+ kernels `[3, 7, 11]` over dilations `[1, 3, 5]`), then `HalfSnake` → a final
1869
+ `CausalConv1d(27→1, k3)` and a clamp to [-1, 1]. The hop is 1764, so PCM length
1870
+ = `frames · 1764` at 22050 Hz.
1871
+
1872
+ This needed two more new kernels beyond `FSQDequant`: **`HalfSnake1d`** (snake
1873
+ activation on the first half of the channels, leaky-ReLU on the rest) and
1874
+ **`ConvTranspose1dDepthwise`** (causal depthwise transposed convolution) — all
1875
+ three registered in `kernels/registry.ts`.
1876
+
1877
+ ### Validation and the license note
1878
+
1879
+ The full decoder is checked by `scripts/engine/test-nanocodec-decode.mjs` against a
1880
+ real MLX reference. The committed assertion gate is **`err < 1e-3`** (and matching
1881
+ PCM length); the run prints the *actual* measured error, which the code's own
1882
+ headers record as **max|err| ≈ 4.2e-6 vs the MLX reference** — i.e. bit-exact
1883
+ within f32 rounding, with comfortable margin under the 1e-3 gate. **Status:**
1884
+ NanoCodec decoder + Kani scaffold landed and validated; the codec-LM backbone's
1885
+ frame-position / learnable-RoPE / 4-token-frame AR loop is the remaining glue
1886
+ (most of the block math is reused from `lfm2.ts`).
1887
+
1888
+ **Licensing** (recorded in the source header): the shipping **kani-tts-2-en** is
1889
+ **LFM1.0 (Liquid AI), `license: other`** — it fine-tunes LFM2-350M, so it is *not*
1890
+ Apache; the NanoCodec weights are under the **NVIDIA Open Model License**. The
1891
+ older `kani-tts-450m-0.2-ft` variant is **Apache-2.0** with the same architecture.
1892
+
1893
+ ---
1894
+
1895
+ ## 33. Gemma 4 E2B: PLE, KV-Sharing, and Logit Softcap (June 2026)
1896
+
1897
+ Gemma 4 E2B is the smallest Gemma 4 and the engine's first **Tier-2** text decoder
1898
+ beyond the Gemma3 encoder of §25 — a *text-only* model with several Gemma-4-specific
1899
+ ops, but **no MatFormer / AltUp / LAuReL** (those belong to Gemma-3n, not E2B). Its
1900
+ graph generator is `architectures/gemma4.ts`, building on the Gemma machinery from
1901
+ `gemma3_encoder.ts` with a causal LM tail. The decode graph is **structurally
1902
+ complete and validated**; the gate to running it on real weights is embedding
1903
+ sharding (below).
1904
+
1905
+ ### What is Gemma-4-specific
1906
+
1907
+ - **Per-Layer Embeddings (PLE).** A *second* embedding table is gathered per token,
1908
+ projected once (`per_layer_model_projection` + `per_layer_projection_norm`), and
1909
+ then injected at **every** layer:
1910
+ `h = h + post_norm( per_layer_projection( gelu(gate(h)) · ple_i ) )` — a per-layer
1911
+ gate, GELU, elementwise multiply with that layer's PLE slice, projection, norm,
1912
+ and residual.
1913
+ - **KV-cache sharing.** The last `num_kv_shared_layers` layers reuse the K/V cache of
1914
+ the matching `layer_type` from before the shared region — a *graph-level rewire*
1915
+ with no kernel change and no K/V projection emitted for the shared layers. For E2B
1916
+ that is **35 layers total, the last 20 shared** (layers 15–34 reuse layers 0–14 of
1917
+ matching type).
1918
+ - **Proportional / dual-theta RoPE.** Full-attention layers rotate the first
1919
+ `partial_rotary_factor·head_dim` dims (0.25·256 = 64) but compute `inv_freq` over
1920
+ the **full** `head_dim` denominator (the `rope_denom` attribute) — distinct from
1921
+ Qwen3.5's partial RoPE, which divides by the *rotated* dim. Full layers use
1922
+ `rope_theta` 1e6, sliding layers 1e4 — the same dual-theta selection as the Gemma3
1923
+ encoder.
1924
+ - **GeGLU MLP** (`down(gelu_tanh(gate) · up)`) and Gemma's four-norm sandwich,
1925
+ reused from §25.
1926
+ - **Final logit softcap via a new `Softcap` kernel.** `WGSL_SOFTCAP`
1927
+ (`kernels/registry.ts`, `softcapSpec`) computes `out = cap · tanh(in / cap)` with
1928
+ the cap passed as a bit-pattern; for E2B `cap = 30`, squashing the final logits to
1929
+ ±30. It is wired into the graph tail (`logit_softcap` node) when
1930
+ `final_logit_softcapping > 0`, and validated by `test-gemma4-softcap.mjs`
1931
+ (GPU vs CPU reference, max err < 1e-4; saturation behavior confirmed).
1932
+
1933
+ ### Validation and the gating item
1934
+
1935
+ `scripts/engine/test-gemma4-graph.mjs` is a structural validation of the generated
1936
+ graph — embedding+PLE pipeline, per-layer PLE injection chain, the sandwich norms,
1937
+ QK-norm + GeGLU, causal attention, proportional RoPE per layer-type, the KV-share
1938
+ rewire (own K/V for layers 0–14, shared for 15–34), the LM head and the softcap —
1939
+ and it passes **62/62** (893 nodes, 1920 tensors).
1940
+
1941
+ The open item is **embedding sharding**. At 4-bit the PLE nibble buffer is
1942
+ **~1.17 GB** (≈1174 MB) and the main embedding is **~201 MB**, both over the
1943
+ per-binding cap, so the loader / `EmbeddingInt4` op must shard the quantized tables
1944
+ across multiple storage buffers before the decode path can run on real weights. The
1945
+ decode graph (last-row slice → tied LM head → Softcap) is otherwise complete and
1946
+ verified; sharding is loader-level work, in progress.
1947
+
1948
+ ---
1949
+
1950
+ ## 34. On-Device Memory / RAG (June 2026)
1951
+
1952
+ The native embedding stack (§21, §25) makes a small **retrieval-augmented memory**
1953
+ layer cheap to build entirely on-device, with no server and no external vector DB.
1954
+ It ships as `@tryhamster/gerbil/memory` (`src/memory/`, exported in `package.json`)
1955
+ and reuses the **native EmbeddingGemma** path through a thin adapter.
1956
+
1957
+ ### Shape
1958
+
1959
+ - **Pluggable vector stores.** Three backends behind one interface
1960
+ (`src/memory/stores/`): `InMemoryStore` (default, process lifetime),
1961
+ `IndexedDBStore` (browser, durable across sessions), and `FileStore` (Node, a
1962
+ durable JSON file on disk) — via `createInMemoryStore()` / `createIndexedDBStore()`
1963
+ / `createFileStore(path)`.
1964
+ - **Native-embedder adapter.** `createGerbilEmbedder(engine)`
1965
+ (`src/memory/gerbil-embedder.ts`) wraps anything with an `embedBatch` — a Gerbil
1966
+ instance, the one-liner `embedBatch`, or the browser `useEmbedding().embedBatch` —
1967
+ so a memory is built with `createMemory({ embed: createGerbilEmbedder(g) })` after
1968
+ `g.loadModel("embeddinggemma-300m")`. The whole pipeline (embed → store → recall)
1969
+ runs on the native WGSL engine.
1970
+ - **Chunking.** `add(text, { chunk: true })` splits long documents into overlapping
1971
+ character windows (`src/memory/chunking.ts`, defaults 1000 chars / 200 overlap),
1972
+ one record per chunk, so retrieval targets relevant passages.
1973
+ - **Redaction on write.** `applyRedaction` (`src/memory/redaction.ts`) accepts a
1974
+ `RegExp` (matches replaced with `[REDACTED]`) or a function, applied **before**
1975
+ the text is embedded and stored, so sensitive data never lands in the index.
1976
+ - **Token-budgeted recall.** `recall(query, options)` (`src/memory/memory.ts`)
1977
+ searches the store, then **greedily packs the highest-scoring records under a
1978
+ token budget** (default 1024, via a ~4-chars/token estimate, accounting for the
1979
+ separator), returning `{ context, records, tokensUsed }` — a ready-to-inject
1980
+ context string sized to fit a prompt budget.
1981
+
1982
+ ### Validation
1983
+
1984
+ `src/memory/memory.test.ts` is **12 tests** covering relevance ranking, metadata
1985
+ filtering, budget-aware packing (including the empty-context edge), chunking,
1986
+ import/export round-trip, regex and predicate redaction, and durability for both the
1987
+ IndexedDB and File stores. The module is a straightforward, fully-tested consumer of
1988
+ the native embedding modality — no new kernels, no GPU work of its own.
1989
+
1990
+ ---
1991
+
1992
+ ## 35. The June-2026 Autoresearch Campaign (Text, LFM2, and ViT)
1993
+
1994
+ §20 documented the first autoresearch run (Qwen3.5 decode 145→207 on M4 Max). After
1995
+ LFM2.5 (§30) and the ViT (§22) landed, the same loop
1996
+ (`scripts/engine/optimize.mjs --mode=autoresearch`; results in
1997
+ `scripts/engine/results.jsonl`, chart at `scripts/engine/chart.html`) was run over
1998
+ **three more batches** — two on text, one on vision — all on M4 Max / node-dawn. The
1999
+ numbers are smaller than the first run's, and **that is the finding**: the tuned
2000
+ kernels are now near the bandwidth floor, so only a specific *class* of edit still
2001
+ wins.
2002
+
2003
+ ### The three batches
2004
+
2005
+ | Batch | Target | Baseline | Best | Kept | Reverted |
2006
+ |---|---|---|---|---|---|
2007
+ | Text (b1) | Qwen3.5-0.8B Q4 decode | 219.1 | ~223 | 3 | 5 |
2008
+ | Text (b1) | LFM2.5-350M Q4 decode | 624.1 | ~649 | (same batch) | |
2009
+ | Text (b2) | Qwen3.5 / LFM2.5 decode | 220.8 / 652.6 | **234.4 / 672.2** | several | several |
2010
+ | Vision (b3/b4) | Qwen3.5 ViT encode | 581.8 ms | **~502 ms** (−~14%) | 3 + 1 | 3 |
2011
+ | Vision (b3/b4) | `describeImage` decode | 37.0 tok/s | **42.0 tok/s** (+13.5%) | (same batch) | |
2012
+
2013
+ The text wins were **fusions that also kill a wide-tensor round-trip**: fuse SiLU
2014
+ into Qwen's CausalConv1d (+1.8%), fuse LFM2's post-gate `C·conv` (+2.0%) and its
2015
+ pre-gate `B·x`/`B/x` slices via a `MulCols` (+1.9%); the second batch pushed Qwen to
2016
+ ~234 and LFM2 to ~672. The vision wins were all in the **shared f32 `MatMul`** that
2017
+ dominates ViT encode — 2×2 register blocking (−6.5%), vec4 global-tile loads
2018
+ (−3.0%), 4×2 register blocking (−1.9%), plus a `MatMul`+`AddBias` → `MatMulBias`
2019
+ structural fusion (desktop ~flat but removes ~86 dispatches and wide round-trips per
2020
+ encode). Text decode was *unaffected* by the vision batch (it uses `MatMulInt4`,
2021
+ confirmed steady at ~233 tok/s throughout), and every kept ViT change stayed
2022
+ bit-exact (merged cosine 1.0, e2e 7/7, description matches HF).
2023
+
2024
+ ### The lesson, sharpened
2025
+
2026
+ The first campaign's lesson ("stop optimizing the kernel in front of you; profile
2027
+ the whole forward") generalized into a sharper rule, recorded in the batch summaries:
2028
+
2029
+ > **On desktop Dawn, dispatch-count cuts only win when they also eliminate a wide
2030
+ > round-trip on a poorly-occupied kernel.** Pure dispatch/barrier reduction on an
2031
+ > already-tuned kernel is *noise*.
2032
+
2033
+ Concretely, the desktop wins came from eliminating **large, wide reads on
2034
+ poorly-occupied kernels** — fused conv+activation, register-blocked + f16-mixed ViT
2035
+ matmuls — while the tuned INT4 matmuls and the Mamba SSM sit at the **bandwidth
2036
+ floor**, where butterfly reductions, subgroup shuffles, and bigger N-tiles came back
2037
+ flat or negative and were reverted. The honest read is that **the remaining headroom
2038
+ is mobile, not desktop**: several reverted-on-desktop fusions (the `MatMul`+`AddBias`
2039
+ merge, residual-Add fusion) are *predicted mobile wins* because mobile is
2040
+ round-trip-bound below ~32 dispatches per command buffer (§19) — which is exactly
2041
+ why the autoresearch loop's next leg is a mobile-validation pass rather than more
2042
+ desktop rounds.
2043
+
2044
+ ---
2045
+
2046
+ ## Appendix A: File Map
2047
+
2048
+ ```
2049
+ src/gpu/
2050
+ ir.ts -- IR types: OpType, TensorDesc, OpNode, ModelGraph, CANONICAL_KEYS
2051
+ safetensors.ts -- Safetensors binary parser with zero-copy typed views
2052
+ device.ts -- WebGPU device init, buffer helpers, pipeline cache, readback
2053
+ tokenizer.ts -- Pure JS BPE tokenizer from HF tokenizer.json
2054
+ sampler.ts -- CPU-side token sampling (temp/top-k/top-p/rep penalty)
2055
+ executor.ts -- Graph executor: buffer allocation, op dispatch, forward pass
2056
+ kv-cache.ts -- GPU-resident KV cache: allocation, advance, reset, destroy
2057
+ model-loader.ts -- HF Hub integration: fetch config/tokenizer/weights, generate IR (Cache-API-only; OPFS removed)
2058
+ vision-executor.ts -- VisionExecutor: runs the ViT graph (single prefill over N patches)
2059
+ vision-preprocess.ts -- Host-side ViT pos-embeds + 2D rotary cos/sin (grid-only, bit-exact vs HF)
2060
+ moonshine-executor.ts -- MoonshineEncoderExecutor: raw-PCM Conv1d front-end + bidirectional encoder, frozen per-layer K/V
2061
+ moonshine-stt.ts -- MoonshineSTT.transcribe(): AR decoder with self- + cross-attention into frozen encoder K/V
2062
+ architectures/
2063
+ index.ts -- Architecture registry: maps HF strings to graph generators
2064
+ qwen2.ts -- Qwen2/3/3.5 graph generator (SwiGLU MLP, GQA attention; embedding tail for Qwen3-Embedding)
2065
+ qwen3_5_vision.ts -- Qwen3.5 12-layer ViT encoder graph (bidirectional, 2D rotary, spatial-merge-2)
2066
+ gemma3_encoder.ts -- EmbeddingGemma-300M bidirectional encoder (4-norm sandwich, dual-theta RoPE, per-head QK-norm, GeGLU, mean-pool + 2 Dense heads + L2Norm)
2067
+ lfm2.ts -- LFM2.5 hybrid conv/attention text generator (Lfm2ForCausalLM; Tier-1, no new kernels)
2068
+ moonshine.ts -- Moonshine STT encoder-decoder graph (raw-waveform Conv1d front-end, interleaved RoPE, cross-attention)
2069
+ kani_tts.ts -- Kani-TTS-2: LFM2-350M codec-LM scaffold + NanoCodec decoder graph (FSQ + causal HiFi-GAN); decoder validated, backbone AR-loop pending
2070
+ gemma4.ts -- Gemma 4 E2B Tier-2 text decoder (PLE, KV-cache sharing, proportional/dual-theta RoPE, GeGLU, final logit Softcap)
2071
+ kernels/
2072
+ registry.ts -- KernelSpec registry: WGSL sources, bindings, dispatch sizing, uniform builders (incl. SliceLastRow, fused decode kernels, MeanPool, Scale, L2Norm, CrossAttention, ROPE_INTERLEAVED, Conv1dFull/GroupNorm/Tanh/Transpose, FSQDequant/HalfSnake1d/ConvTranspose1dDepthwise, Softcap)
2073
+ wgsl/
2074
+ embedding.wgsl -- Embedding lookup (gather rows by token ID)
2075
+ matmul.wgsl -- Tiled f32 matrix multiply (16x16 shared memory)
2076
+ matmul_int4.wgsl -- Fused INT4 dequantize + matmul
2077
+ rmsnorm.wgsl -- RMS normalization (tree reduction)
2078
+ layernorm.wgsl -- Layer normalization (mean + variance, tree reduction)
2079
+ rope.wgsl -- Rotary position embeddings (GQA-aware)
2080
+ attention.wgsl -- Scaled dot-product attention (causal, GQA)
2081
+ softmax.wgsl -- Row-wise softmax (three-pass, tree reduction)
2082
+ silu.wgsl -- SiLU activation (x * sigmoid(x))
2083
+ gelu.wgsl -- Approximate GELU activation
2084
+ add.wgsl -- Element-wise addition
2085
+ mul.wgsl -- Element-wise multiplication
2086
+ ```
2087
+
2088
+ ---
2089
+
2090
+ ## Appendix B: WebGPU Browser Compatibility
2091
+
2092
+ | Browser | Version | Status |
2093
+ |---------|---------|--------|
2094
+ | Chrome | 113+ (May 2023) | Stable support |
2095
+ | Edge | 113+ (May 2023) | Stable support (Chromium-based) |
2096
+ | Safari | 18+ (Sep 2024) | Stable support (macOS + iOS) |
2097
+ | Firefox | 141+ (Jan 2025) | Stable support |
2098
+ | Chrome Android | 113+ | Stable support |
2099
+ | Safari iOS | 18+ | Stable support (via WKWebView) |
2100
+ | Samsung Internet | 25+ | Stable support |
2101
+
2102
+ ### Notable Limitations
2103
+
2104
+ - **iOS WKWebView memory**: the jetsam budget for a web-content process is ~1.5-2 GB. Post-fix (Sections 17.1, 18.1-18.2), Qwen3.5-0.8B INT4 at `maxSeqLen=512` runs at ~0.6-0.7 GB total GPU footprint — comfortable headroom, versus the 2.77 GB the per-tensor allocator used to request. The engine clamps iOS `maxSeqLen` to 2048 (default 512) so an oversized request can never reach the device.
2105
+ - **iPad WebGPU limits** (measured, iPadOS 26.5 defaults): `maxBufferSize` 256 MB, `maxStorageBufferBindingSize` 128 MB, `maxComputeWorkgroupStorageSize` 16384 bytes. The ~127 MB INT4 embedding sits just under the binding limit — larger vocabularies need sharding.
2106
+ - **WebKit submit granularity**: on WebKit older than iPadOS 26.5, batching many dispatches into one command buffer can zero storage reads mid-chain (Section 17.2); the engine's grouped-submit path (`?group=N`, Section 18.3) is the compatibility dial. iPadOS 26.5+ is correct at batch-all.
2107
+ - **Firefox**: WebGPU support arrived later than Chromium and Safari. Feature coverage is complete but performance may vary.
2108
+ - **shader-f16**: Not available on all GPUs. The engine detects this at initialization and adapts accordingly (currently all kernels use f32).
2109
+ - **maxBufferSize**: Varies by device. The engine requests the adapter's maximum, but some mobile GPUs have limits below 256MB which constrains model size.