@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +247 -84
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +264 -588
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +585 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BHrJJIa4.mjs +1656 -0
- package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
- package/dist/gerbil-BT9fCydo.d.mts +488 -0
- package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
- package/dist/gerbil-DomNfIr1.mjs +4 -0
- package/dist/gpu/hooks.d.mts +520 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1188 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-33qCAtHW.mjs +3615 -0
- package/dist/gpu-33qCAtHW.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-jEAL2s-A.d.mts +2022 -0
- package/dist/index-jEAL2s-A.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
- package/dist/mcp-1DaMsaBc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
- package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
- package/dist/repl-jV5gcJFA.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
- package/dist/skills-DX8D59UH.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
- package/dist/types-D6FiR_oh.d.mts.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-DFRQ1OeM.js +0 -20212
- package/dist/kokoro-DFRQ1OeM.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/stt-Te8Qz-Ay.js +0 -433
- package/dist/stt-Te8Qz-Ay.js.map +0 -1
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DokyH3rP.js +0 -3
- package/dist/transformers.web-M6mCnEYJ.js +0 -30382
- package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
- package/dist/tts-C0xx3CtE.js +0 -724
- package/dist/tts-C0xx3CtE.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,2109 @@
|
|
|
1
|
+
# Gerbil WebGPU Inference Engine
|
|
2
|
+
|
|
3
|
+
**A WebGPU-native transformer inference engine for browser-based LLM execution.**
|
|
4
|
+
|
|
5
|
+
*Living technical document. Last updated: June 2026 (native audio — Moonshine STT + Kani-TTS-2 decoder — Gemma 4 E2B, on-device memory/RAG, and the text+ViT autoresearch campaign).*
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Table of Contents
|
|
10
|
+
|
|
11
|
+
1. [Introduction & Motivation](#1-introduction--motivation)
|
|
12
|
+
2. [Previous Approach & Its Shortcomings](#2-previous-approach--its-shortcomings)
|
|
13
|
+
3. [Architecture Overview](#3-architecture-overview)
|
|
14
|
+
4. [Intermediate Representation](#4-intermediate-representation)
|
|
15
|
+
5. [Safetensors Parser](#5-safetensors-parser)
|
|
16
|
+
6. [WebGPU Device Layer](#6-webgpu-device-layer)
|
|
17
|
+
7. [WGSL Kernel Library](#7-wgsl-kernel-library)
|
|
18
|
+
8. [Tokenizer](#8-tokenizer)
|
|
19
|
+
9. [Sampler](#9-sampler)
|
|
20
|
+
10. [Architecture Registry & Graph Generators](#10-architecture-registry--graph-generators)
|
|
21
|
+
11. [KV Cache](#11-kv-cache)
|
|
22
|
+
12. [Executor](#12-executor)
|
|
23
|
+
13. [Model Loading Pipeline](#13-model-loading-pipeline)
|
|
24
|
+
14. [Public API (WebGPUEngine)](#14-public-api-webgpuengine)
|
|
25
|
+
15. [What's Built vs. What's Planned](#15-whats-built-vs-whats-planned)
|
|
26
|
+
16. [Key Design Decisions](#16-key-design-decisions)
|
|
27
|
+
17. [The Four-Failure-Mode Diagnosis (June 2026)](#17-the-four-failure-mode-diagnosis-june-2026)
|
|
28
|
+
18. [The Mobile Fix Campaign](#18-the-mobile-fix-campaign)
|
|
29
|
+
19. [Mobile Results & Submit-Granularity Scaling](#19-mobile-results--submit-granularity-scaling)
|
|
30
|
+
20. [Autoresearch: Profile-Guided Throughput Optimization](#20-autoresearch-profile-guided-throughput-optimization-june-2026)
|
|
31
|
+
21. [Native Text Embeddings](#21-native-text-embeddings-june-2026)
|
|
32
|
+
22. [Native Vision Encoder (Qwen3.5 ViT)](#22-native-vision-encoder-qwen35-vit-june-2026)
|
|
33
|
+
23. [The Native-Only Architecture Decision](#23-the-native-only-architecture-decision)
|
|
34
|
+
24. [iOS Model-Caching Reality](#24-ios-model-caching-reality)
|
|
35
|
+
25. [EmbeddingGemma-300M: a Second Embedding Family](#25-embeddinggemma-300m-a-second-embedding-family-validated-on-device-june-2026)
|
|
36
|
+
26. [The SentencePiece Tokenizer Fix](#26-the-sentencepiece-tokenizer-fix-a-cross-family-lesson)
|
|
37
|
+
27. [MLX-4bit Loading and the DWQ Trap](#27-mlx-4bit-loading-broadened-detection-and-the-dwq-trap)
|
|
38
|
+
28. [The Progress-Reporting Fix](#28-the-progress-reporting-fix-killing-the-stuck-at-10-freeze)
|
|
39
|
+
29. [Cross-Device Multi-Modal Parity](#29-cross-device-multi-modal-parity-june-2026)
|
|
40
|
+
30. [Model-Zoo Growth and Kernel-Library Saturation](#30-model-zoo-growth-and-the-saturation-of-the-kernel-library)
|
|
41
|
+
31. [Native STT: Moonshine](#31-native-stt-moonshine-june-2026)
|
|
42
|
+
32. [Native TTS: Kani-TTS-2 and the NanoCodec Decoder](#32-native-tts-kani-tts-2-and-the-nanocodec-decoder-june-2026)
|
|
43
|
+
33. [Gemma 4 E2B: PLE, KV-Sharing, and Logit Softcap](#33-gemma-4-e2b-ple-kv-sharing-and-logit-softcap-june-2026)
|
|
44
|
+
34. [On-Device Memory / RAG](#34-on-device-memory-rag-june-2026)
|
|
45
|
+
35. [The June-2026 Autoresearch Campaign (Text, LFM2, and ViT)](#35-the-june-2026-autoresearch-campaign-text-lfm2-and-vit)
|
|
46
|
+
- [Appendix A: File Map](#appendix-a-file-map)
|
|
47
|
+
- [Appendix B: WebGPU Browser Compatibility](#appendix-b-webgpu-browser-compatibility)
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## 1. Introduction & Motivation
|
|
52
|
+
|
|
53
|
+
Gerbil is an on-device AI library for JavaScript. Its GPU engine is a WebGPU-native transformer inference engine that runs large language models directly in the browser with zero server-side dependencies. Given any HuggingFace model repository, the engine:
|
|
54
|
+
|
|
55
|
+
1. Downloads the model's `config.json` and determines its architecture
|
|
56
|
+
2. Generates a fine-grained computation graph (the IR) at runtime
|
|
57
|
+
3. Downloads safetensors weight files and uploads them to GPU buffers
|
|
58
|
+
4. Dispatches WGSL compute shaders for each operation in topological order
|
|
59
|
+
5. Streams generated tokens back to the caller
|
|
60
|
+
|
|
61
|
+
The engine was built to replace an earlier approach based on transformers.js and ONNX Runtime Web, which suffered from fundamental architectural limitations. Those limitations -- iOS memory crashes, 50MB bundle sizes, webpack-within-webpack collisions, ONNX format lock-in, and three divergent inference paths -- motivated a complete rewrite with WebGPU as the single execution target.
|
|
62
|
+
|
|
63
|
+
The core philosophy is: **own every layer of the stack**. No ONNX runtime, no transformers.js, no external WASM blobs. Just TypeScript orchestration code and hand-written WGSL kernels, speaking directly to the WebGPU API. As of this cycle that single engine is no longer text-only: **native text embeddings** (Section 21) and the **native Qwen3.5 vision encoder** (Section 22, bit-exact vs HuggingFace) run through the same IR and kernel registry, and the architecture decision is explicitly **native-only** across text, vision, and embeddings — no fallback lane (Section 23).
|
|
64
|
+
|
|
65
|
+
As of June 2026 the engine runs Qwen3.5-0.8B (MLX 4-bit) with byte-identical greedy output on desktop Dawn (node, Chrome) and WebKit WebGPU (iPad), at **145 tok/s on M4 Max** and **31.7-35.9 tok/s on iPad (iPadOS 26.5)** — the first correct, crash-free on-device mobile results in the project's history. Getting there required diagnosing what had been treated as "one mobile bug" into four independent failure modes (Section 17) and a fix campaign spanning memory layout, submit architecture, a WGSL data race, and platform detection (Section 18).
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## 2. Previous Approach & Its Shortcomings
|
|
70
|
+
|
|
71
|
+
Understanding why the GPU engine was built requires understanding what came before it and why that approach was unsustainable.
|
|
72
|
+
|
|
73
|
+
### 2.1 The transformers.js + ort-web Architecture
|
|
74
|
+
|
|
75
|
+
The original gerbil browser inference path used [transformers.js](https://github.com/huggingface/transformers.js) (from HuggingFace) as its runtime. The architecture was:
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
User Code
|
|
79
|
+
-> gerbil browser SDK (createGerbilWorker)
|
|
80
|
+
-> Web Worker (blob URL, IIFE bundle)
|
|
81
|
+
-> transformers.js (AutoModelForCausalLM, AutoTokenizer)
|
|
82
|
+
-> ONNX Runtime Web (ort-web)
|
|
83
|
+
-> WebGPU backend (or WASM fallback)
|
|
84
|
+
-> ONNX model files (.onnx)
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
transformers.js provided a high-level API (`AutoModelForCausalLM.from_pretrained`) that handled model loading, tokenization, and generation. Under the hood, it delegated all tensor math to ONNX Runtime Web (ort-web), which could target either WebGPU or WASM backends.
|
|
88
|
+
|
|
89
|
+
This worked for desktop browsers. On mobile -- particularly iOS -- it was catastrophic.
|
|
90
|
+
|
|
91
|
+
### 2.2 The iOS Crisis
|
|
92
|
+
|
|
93
|
+
The single most painful failure mode: **models would download to 100% and then crash the page**.
|
|
94
|
+
|
|
95
|
+
iOS Safari and iOS Chrome (which uses WKWebView under the hood) impose a hard memory limit of approximately 300-400MB per web page. The ort-web WASM runtime alone consumed significant memory, and once model weights were loaded on top of that, the total exceeded the budget. The page would be silently terminated by the OS with no error, no exception, no callback -- just a blank page.
|
|
96
|
+
|
|
97
|
+
Every mitigation was attempted and failed:
|
|
98
|
+
|
|
99
|
+
- **Running in a Web Worker**: WKWebView imposes even stricter memory limits on workers than the main thread. Moving inference to a worker made things worse.
|
|
100
|
+
- **Main-thread inference**: Avoided worker memory limits but froze the UI entirely during generation. Still crashed on models above ~600MB.
|
|
101
|
+
- **KV cache disposal**: Aggressively disposing past_key_values after each generation freed some memory but not enough. The WASM runtime's own memory footprint was the floor, and it was already above 200MB.
|
|
102
|
+
- **Model fallback chains**: Automatically trying smaller models when larger ones failed. This helped with availability but didn't solve the crash-then-reload loop.
|
|
103
|
+
- **Session phase tracking**: Using `sessionStorage` to detect crash-and-reload cycles (`detectMemoryCrash()` in `device-guards.ts`). This only diagnosed the problem; it couldn't prevent it.
|
|
104
|
+
- **Breadcrumb logging to localStorage**: Writing debug state before each critical step so that after a crash, the reload could read what step killed the page. This produced excellent diagnostics but no fix.
|
|
105
|
+
|
|
106
|
+
The iOS code path grew increasingly complex. `createIOSMainThreadWorker()` in `worker.ts` is 270+ lines of iOS-specific logic including CDN dynamic imports, breadcrumb logging, 128-token generation caps, and manual KV cache disposal -- all of which existed solely to work around ort-web's memory consumption.
|
|
107
|
+
|
|
108
|
+
### 2.3 The Bundle Size Problem
|
|
109
|
+
|
|
110
|
+
The browser build bundled transformers.js and its dependencies (including kokoro-js for TTS) into a single ESM file via rolldown with `noExternal: ["@huggingface/transformers", "kokoro-js"]` and `inlineDynamicImports: true`. The resulting `dist/browser/index.js` was approximately 50MB.
|
|
111
|
+
|
|
112
|
+
This meant:
|
|
113
|
+
|
|
114
|
+
- Slow initial page loads even with caching
|
|
115
|
+
- Massive JavaScript parse times on mobile devices
|
|
116
|
+
- The browser had to allocate memory for the entire bundle before inference even began
|
|
117
|
+
|
|
118
|
+
### 2.4 The Webpack-Within-Webpack Crash
|
|
119
|
+
|
|
120
|
+
This was perhaps the most bizarre bug in the entire project. The `@huggingface/transformers` package ships as a **pre-compiled webpack bundle**. When rolldown inlined this bundle into gerbil's ESM output, a subtle scoping issue emerged:
|
|
121
|
+
|
|
122
|
+
The transformers.js webpack runtime declares variables like `var __webpack_exports__`, `var __webpack_modules__`, `var __webpack_module_cache__`, and `var __webpack_require__` inside arrow function callbacks. Arrow functions in JavaScript **do not create a `var` scope** -- `var` declarations inside arrow functions are hoisted to the enclosing function scope (or module scope).
|
|
123
|
+
|
|
124
|
+
When Next.js webpack processed gerbil's bundle (which now contained these inlined declarations), it evaluated the code inside `eval()` contexts where its own `__webpack_exports__` was a parameter. The inlined `var __webpack_exports__` hoisted through the arrow function and **shadowed** webpack's own parameter with `undefined`. This caused:
|
|
125
|
+
|
|
126
|
+
```
|
|
127
|
+
__webpack_require__.r(undefined)
|
|
128
|
+
-> Object.defineProperty(undefined, '__esModule', ...)
|
|
129
|
+
-> "Properties can only be defined on Objects"
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
The fix was a `renderChunk` plugin in `tsdown.config.ts` that renamed all inner webpack runtime variables:
|
|
133
|
+
|
|
134
|
+
```typescript
|
|
135
|
+
result = result.replaceAll("__webpack_exports__", "__gerbil_tf_exports__");
|
|
136
|
+
result = result.replaceAll("__webpack_modules__", "__gerbil_tf_modules__");
|
|
137
|
+
result = result.replaceAll("__webpack_module_cache__", "__gerbil_tf_cache__");
|
|
138
|
+
result = result.replaceAll("__webpack_require__", "__gerbil_tf_require__");
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Additionally, the plugin stripped `//# sourceMappingURL` comments to prevent the Next.js dev overlay from attempting to parse the 8MB source map file, which caused a stack overflow.
|
|
142
|
+
|
|
143
|
+
This hack worked, but it was fragile. Every transformers.js update risked breaking it, and it highlighted a fundamental problem: the dependency chain was too deep and too opaque.
|
|
144
|
+
|
|
145
|
+
### 2.5 The ONNX Dependency
|
|
146
|
+
|
|
147
|
+
transformers.js required models to be pre-converted to ONNX format. This meant:
|
|
148
|
+
|
|
149
|
+
- Only models that had been converted and uploaded to HuggingFace in ONNX format could be used
|
|
150
|
+
- The conversion process itself was error-prone and model-specific
|
|
151
|
+
- Quantization was limited to what the ONNX ecosystem supported
|
|
152
|
+
- Users couldn't point at any arbitrary HuggingFace model repo and run it
|
|
153
|
+
|
|
154
|
+
### 2.6 The Blob URL Worker Complexity
|
|
155
|
+
|
|
156
|
+
iOS WKWebView cannot load ES module workers (`new Worker(url, { type: "module" })`). The workaround was an IIFE blob URL worker:
|
|
157
|
+
|
|
158
|
+
1. `worker-entry.ts` -- real TypeScript worker source with static transformers.js imports
|
|
159
|
+
2. `scripts/build-worker.mjs` -- esbuild bundles worker-entry.ts as IIFE into `worker-code.generated.ts`
|
|
160
|
+
3. `worker.ts` -- creates a classic (non-module) Blob worker using the `WORKER_CODE` string constant
|
|
161
|
+
|
|
162
|
+
This created a chain of build tooling issues:
|
|
163
|
+
- The IIFE bundle couldn't use `import.meta.url`, which ort-web relied on internally for locating its WASM files
|
|
164
|
+
- CDN paths for the WASM runtime (`ort.bundle.min.mjs`, `ort-wasm-simd-threaded.jsep.wasm`) had to be explicitly configured via `env.backends.onnx.wasm.wasmPaths`
|
|
165
|
+
- These CDN files had to be aliased to `false` in Next.js's webpack config to prevent webpack from trying to resolve them as modules
|
|
166
|
+
|
|
167
|
+
### 2.7 Three Separate Inference Paths
|
|
168
|
+
|
|
169
|
+
The original gerbil maintained three completely different inference paths:
|
|
170
|
+
|
|
171
|
+
1. **Browser WebGPU** via transformers.js (desktop Chrome, Edge, Safari 18+)
|
|
172
|
+
2. **Browser WASM** via transformers.js (iOS, older browsers, low-GPU-memory devices)
|
|
173
|
+
3. **Node.js** via server-side transformers.js with different model loading
|
|
174
|
+
|
|
175
|
+
Each path had different bugs, different performance characteristics, and different maintenance burdens. The `backend-selector.ts` file alone was over 200 lines of tiered decision logic (`selectBackend()`, `getModelFallbackChain()`, `getDeviceContextLimit()`) to route users to the right path.
|
|
176
|
+
|
|
177
|
+
### 2.8 Summary: Why a Rewrite Was Necessary
|
|
178
|
+
|
|
179
|
+
| Problem | Root Cause |
|
|
180
|
+
|---------|-----------|
|
|
181
|
+
| iOS crashes after download | ort-web WASM runtime baseline memory too high for WKWebView |
|
|
182
|
+
| 50MB bundle | Bundling transformers.js + ort-web + kokoro-js |
|
|
183
|
+
| Webpack crash | var hoisting through arrow functions in pre-compiled webpack bundle |
|
|
184
|
+
| Limited model support | ONNX format requirement |
|
|
185
|
+
| Worker complexity | iOS WKWebView module worker limitations + ort WASM path resolution |
|
|
186
|
+
| Maintenance burden | Three divergent inference code paths |
|
|
187
|
+
|
|
188
|
+
The GPU engine eliminates all of these problems by owning the entire inference stack: a custom IR, custom safetensors parser, custom WGSL kernels, custom tokenizer, and custom sampler -- all speaking directly to the WebGPU API with zero intermediate runtimes.
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## 3. Architecture Overview
|
|
193
|
+
|
|
194
|
+
The GPU engine is structured as a pipeline of cooperating components:
|
|
195
|
+
|
|
196
|
+
```
|
|
197
|
+
HuggingFace Hub
|
|
198
|
+
|
|
|
199
|
+
model-loader.ts
|
|
200
|
+
/ | \
|
|
201
|
+
config.json tokenizer.json *.safetensors
|
|
202
|
+
| | |
|
|
203
|
+
architectures/ tokenizer.ts safetensors.ts
|
|
204
|
+
qwen2.ts | |
|
|
205
|
+
| Tokenizer weight data
|
|
206
|
+
| |
|
|
207
|
+
ModelGraph (IR) |
|
|
208
|
+
| |
|
|
209
|
+
executor.ts <----------------------+
|
|
210
|
+
/ | \
|
|
211
|
+
device.ts | kv-cache.ts
|
|
212
|
+
| |
|
|
213
|
+
WGSL kernels sampler.ts
|
|
214
|
+
(12 shaders) (CPU-side sampling)
|
|
215
|
+
| |
|
|
216
|
+
GPU dispatch token selection
|
|
217
|
+
| |
|
|
218
|
+
logits -----------> next token
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
### Component Responsibilities
|
|
222
|
+
|
|
223
|
+
| Component | File | Role |
|
|
224
|
+
|-----------|------|------|
|
|
225
|
+
| **IR** | `ir.ts` | Type definitions for the computation graph: OpType, TensorDesc, OpNode, ModelGraph |
|
|
226
|
+
| **Architecture Registry** | `architectures/index.ts` | Maps HF architecture strings to graph generators |
|
|
227
|
+
| **Graph Generator** | `architectures/qwen2.ts` | Builds a complete ModelGraph from a Qwen config.json |
|
|
228
|
+
| **Safetensors Parser** | `safetensors.ts` | Parses HF safetensors binary format into typed array views |
|
|
229
|
+
| **Device Layer** | `device.ts` | WebGPU initialization, buffer management, pipeline compilation, readback |
|
|
230
|
+
| **KV Cache** | `kv-cache.ts` | Pre-allocated GPU buffers for autoregressive key/value storage |
|
|
231
|
+
| **Executor** | `executor.ts` | Allocates buffers, dispatches ops, manages forward passes |
|
|
232
|
+
| **WGSL Kernels** | `kernels/wgsl/*.wgsl` | 12 hand-written compute shaders for all tensor operations |
|
|
233
|
+
| **Tokenizer** | `tokenizer.ts` | Pure JS BPE tokenizer from HF tokenizer.json |
|
|
234
|
+
| **Sampler** | `sampler.ts` | CPU-side temperature/top-k/top-p/repetition penalty sampling |
|
|
235
|
+
| **Model Loader** | `model-loader.ts` | HF Hub integration: fetch config, tokenizer, weights |
|
|
236
|
+
|
|
237
|
+
### Data Flow
|
|
238
|
+
|
|
239
|
+
1. **Model Load**: `model-loader.ts` fetches `config.json` from HuggingFace, determines the architecture (e.g., `Qwen2ForCausalLM`), and calls the matching graph generator. The generator produces a `ModelGraph` -- the complete IR specifying every tensor and every operation. The loader then downloads safetensors files, maps HF keys to canonical names, and returns the weights alongside the graph and tokenizer.
|
|
240
|
+
|
|
241
|
+
2. **Initialization**: The `Executor` takes the `ModelGraph`, allocates a liveness-pooled set of GPU buffers for activation tensors (Section 18.1), creates the KV cache, and uploads weight data to GPU storage buffers.
|
|
242
|
+
|
|
243
|
+
3. **Forward Pass**: Given input token IDs, the executor walks the `executionOrder` array, dispatching each operation's WGSL kernel with the appropriate buffers and uniforms. On Dawn (Chrome, node) all dispatches are batched into a single command encoder; on WebKit they are grouped into command buffers of a configurable size with one buffer in flight (Section 18.3).
|
|
244
|
+
|
|
245
|
+
4. **Logit Readback**: After the forward pass, the `[1, vocab]` logits buffer (only the last position's logits are ever computed — Section 18.2) is copied to a MAP_READ staging buffer and read back to CPU as a `Float32Array`.
|
|
246
|
+
|
|
247
|
+
5. **Sampling**: `sampler.ts` applies the sampling pipeline (repetition penalty, temperature, top-k, top-p, random selection) to produce the next token ID.
|
|
248
|
+
|
|
249
|
+
6. **Decode Loop**: Steps 3-5 repeat with T=1 (single token) until an EOS token or max length is reached.
|
|
250
|
+
|
|
251
|
+
---
|
|
252
|
+
|
|
253
|
+
## 4. Intermediate Representation
|
|
254
|
+
|
|
255
|
+
The IR (`ir.ts`) is the contract that every component builds on. It defines what the model computes, how tensors are shaped, and in what order operations execute.
|
|
256
|
+
|
|
257
|
+
### Design Philosophy
|
|
258
|
+
|
|
259
|
+
The IR is **fine-grained** and **runtime-generated**. Rather than compiling a model to an opaque binary format (like ONNX), the engine generates the IR at runtime from a model's `config.json`. This means:
|
|
260
|
+
|
|
261
|
+
- **Any HuggingFace model** with a supported architecture can be used -- no pre-conversion step
|
|
262
|
+
- **Config changes** (different hidden sizes, layer counts, head counts) are handled automatically
|
|
263
|
+
- **The IR is inspectable**: it's a plain TypeScript object that can be logged, debugged, and visualized
|
|
264
|
+
|
|
265
|
+
### OpType Taxonomy
|
|
266
|
+
|
|
267
|
+
Every computation the engine can perform is represented as an `OpType`:
|
|
268
|
+
|
|
269
|
+
**Core ops (implemented):**
|
|
270
|
+
- `Embedding` -- Lookup rows from embedding weight matrix by token ID
|
|
271
|
+
- `MatMul` -- Full-precision (f32) tiled matrix multiplication
|
|
272
|
+
- `MatMulInt4` -- Fused INT4 dequantize + matrix multiplication
|
|
273
|
+
- `Add` -- Element-wise addition (residual connections)
|
|
274
|
+
- `Mul` -- Element-wise multiplication (SwiGLU gate)
|
|
275
|
+
- `RMSNorm` -- Root mean square layer normalization
|
|
276
|
+
- `LayerNorm` -- Standard layer normalization (mean + variance)
|
|
277
|
+
- `RoPE` -- Rotary position embeddings
|
|
278
|
+
- `Attention` -- Scaled dot-product attention with causal mask and GQA
|
|
279
|
+
- `Softmax` -- Row-wise softmax
|
|
280
|
+
- `SiLU` -- Sigmoid linear unit activation
|
|
281
|
+
- `GELU` -- Gaussian error linear unit activation (tanh approximation, `gelu_pytorch_tanh`)
|
|
282
|
+
- `GeluErf` -- Exact (erf-based) GELU, used by the ViT merger
|
|
283
|
+
- `AddBias` -- Row-broadcast bias add (ViT projections carry bias)
|
|
284
|
+
- `ApplyRotaryEmb` -- Precomputed-cos/sin `rotate_half` rotary (ViT 2D RoPE)
|
|
285
|
+
- `SliceCols` -- Extract a column range (splits fused QKV in the ViT)
|
|
286
|
+
- `L2Norm` -- Row-wise L2 normalization (embedding tail)
|
|
287
|
+
- `SliceLastRow` -- Extract the last position's row (last-token pooling / lm_head)
|
|
288
|
+
|
|
289
|
+
**Structural ops (defined, not yet kernel-backed):**
|
|
290
|
+
- `Gather`, `Reshape`, `Transpose`, `Concat`
|
|
291
|
+
|
|
292
|
+
**Future ops (stubbed):**
|
|
293
|
+
- `MoERouter`, `ExpertMatMul` -- Mixture of Experts (when first MoE model is added)
|
|
294
|
+
- `Conv2d`, `AvgPool2d`, `CrossAttention` -- audio / encoder-decoder (future)
|
|
295
|
+
|
|
296
|
+
### TensorDesc
|
|
297
|
+
|
|
298
|
+
Every tensor in the graph is described by a `TensorDesc`:
|
|
299
|
+
|
|
300
|
+
```typescript
|
|
301
|
+
interface TensorDesc {
|
|
302
|
+
name: string; // Unique name, e.g. "layers.0.self_attn.q_proj.weight"
|
|
303
|
+
shape: (number | string)[]; // Concrete or symbolic dimensions
|
|
304
|
+
dtype: DType; // "f32" | "f16" | "i32" | "u32" | "i4"
|
|
305
|
+
storage: TensorStorage; // "constant" | "activation" | "kv_cache"
|
|
306
|
+
safetensorsKey?: string; // Key in safetensors file (constants only)
|
|
307
|
+
}
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
**Symbolic dimensions** are strings like `"T"` (current sequence length) and `"L_max"` (maximum cache length). These are resolved to concrete numbers at execution time. This allows the graph to be generated once and used for both prefill (T=prompt length) and decode (T=1) passes.
|
|
311
|
+
|
|
312
|
+
**Storage types:**
|
|
313
|
+
- `constant` -- Weight tensors loaded from safetensors. Immutable after upload.
|
|
314
|
+
- `activation` -- Intermediate computation results. Buffers are pre-allocated at max size and reused.
|
|
315
|
+
- `kv_cache` -- Key/value cache tensors managed by the KV cache module.
|
|
316
|
+
|
|
317
|
+
### OpNode
|
|
318
|
+
|
|
319
|
+
Each operation is represented as an `OpNode`:
|
|
320
|
+
|
|
321
|
+
```typescript
|
|
322
|
+
interface OpNode {
|
|
323
|
+
id: string; // Unique node ID, e.g. "layer0_norm1"
|
|
324
|
+
opType: OpType; // Which operation to perform
|
|
325
|
+
inputs: string[]; // Input tensor names (order = kernel binding order)
|
|
326
|
+
outputs: string[]; // Output tensor names
|
|
327
|
+
attributes: Record<string, unknown>; // Op-specific params (hidden_size, eps, etc.)
|
|
328
|
+
}
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
The `attributes` dictionary carries operation-specific parameters that the kernel needs. For example, an `RMSNorm` node carries `hidden_size` and `eps`; a `MatMul` node carries `M_tensor`, `K`, and `N`.
|
|
332
|
+
|
|
333
|
+
### ModelGraph
|
|
334
|
+
|
|
335
|
+
The top-level container:
|
|
336
|
+
|
|
337
|
+
```typescript
|
|
338
|
+
interface ModelGraph {
|
|
339
|
+
architecture: string; // e.g. "Qwen2ForCausalLM"
|
|
340
|
+
config: ModelArchConfig; // Resolved dimensions
|
|
341
|
+
capabilities: ModelCapabilities; // { text: true, vision: false, moe: false }
|
|
342
|
+
tensors: Record<string, TensorDesc>; // All tensors, keyed by name
|
|
343
|
+
nodes: OpNode[]; // All computation nodes
|
|
344
|
+
executionOrder: string[]; // Topologically-sorted node IDs
|
|
345
|
+
inputs: string[]; // Graph input names (["input_ids"])
|
|
346
|
+
outputs: string[]; // Graph output names (["logits"])
|
|
347
|
+
}
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
### Canonical Tensor Naming
|
|
351
|
+
|
|
352
|
+
The IR uses a canonical naming convention for weight tensors:
|
|
353
|
+
|
|
354
|
+
```
|
|
355
|
+
embed_tokens.weight -- Token embedding matrix
|
|
356
|
+
layers.{i}.input_layernorm.weight -- Pre-attention norm
|
|
357
|
+
layers.{i}.self_attn.q_proj.weight -- Q projection
|
|
358
|
+
layers.{i}.self_attn.k_proj.weight -- K projection
|
|
359
|
+
layers.{i}.self_attn.v_proj.weight -- V projection
|
|
360
|
+
layers.{i}.self_attn.o_proj.weight -- Output projection
|
|
361
|
+
layers.{i}.post_attention_layernorm.weight -- Post-attention norm
|
|
362
|
+
layers.{i}.mlp.gate_proj.weight -- MLP gate (SwiGLU)
|
|
363
|
+
layers.{i}.mlp.up_proj.weight -- MLP up projection
|
|
364
|
+
layers.{i}.mlp.down_proj.weight -- MLP down projection
|
|
365
|
+
norm.weight -- Final norm
|
|
366
|
+
lm_head.weight -- Language model head
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
This convention is defined in `CANONICAL_KEYS` and matches the common LLaMA/Qwen naming scheme. The `HFKeyMapper` (default: strip the `"model."` prefix) converts HuggingFace safetensors keys to these canonical names.
|
|
370
|
+
|
|
371
|
+
For more details, see [ir.md](./ir.md).
|
|
372
|
+
|
|
373
|
+
---
|
|
374
|
+
|
|
375
|
+
## 5. Safetensors Parser
|
|
376
|
+
|
|
377
|
+
The safetensors parser (`safetensors.ts`) reads HuggingFace's binary safetensors format into typed array views over the raw buffer.
|
|
378
|
+
|
|
379
|
+
### Binary Format
|
|
380
|
+
|
|
381
|
+
```
|
|
382
|
+
[8 bytes: header_length (little-endian u64)]
|
|
383
|
+
[header_length bytes: JSON header]
|
|
384
|
+
[remaining bytes: raw tensor data, contiguous]
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
The JSON header maps tensor names to their metadata:
|
|
388
|
+
|
|
389
|
+
```json
|
|
390
|
+
{
|
|
391
|
+
"model.layers.0.self_attn.q_proj.weight": {
|
|
392
|
+
"dtype": "F32",
|
|
393
|
+
"shape": [896, 896],
|
|
394
|
+
"data_offsets": [0, 3211264]
|
|
395
|
+
},
|
|
396
|
+
"__metadata__": {
|
|
397
|
+
"format": "pt"
|
|
398
|
+
}
|
|
399
|
+
}
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
### Zero-Copy Design
|
|
403
|
+
|
|
404
|
+
`getTensorData()` creates typed array views directly into the original `ArrayBuffer` when byte alignment allows. For example, an F32 tensor at a 4-byte-aligned offset returns a `Float32Array` view with no data copy. When alignment requirements aren't met (rare in practice), the function copies the relevant slice into a new properly-aligned buffer.
|
|
405
|
+
|
|
406
|
+
### Streaming Support
|
|
407
|
+
|
|
408
|
+
`parseSafetensorsFromResponse()` accepts a `Response` object and reads it into an `ArrayBuffer`. The header can be parsed independently of the tensor data, enabling a two-phase approach for large models: parse the header first to learn tensor sizes and offsets, then fetch tensor data by range request.
|
|
409
|
+
|
|
410
|
+
For more details, see [safetensors.md](./safetensors.md).
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
414
|
+
## 6. WebGPU Device Layer
|
|
415
|
+
|
|
416
|
+
The device layer (`device.ts`) wraps the WebGPU API with helpers for buffer management, pipeline compilation, and data readback.
|
|
417
|
+
|
|
418
|
+
### Device Initialization
|
|
419
|
+
|
|
420
|
+
`initGPU()` requests a high-performance GPU adapter and device with:
|
|
421
|
+
|
|
422
|
+
- Maximum buffer size and storage buffer binding size from the adapter
|
|
423
|
+
- 256x256 max compute workgroup size
|
|
424
|
+
- `shader-f16` feature when available
|
|
425
|
+
- A device loss handler that logs the reason
|
|
426
|
+
|
|
427
|
+
```typescript
|
|
428
|
+
const adapter = await navigator.gpu.requestAdapter({
|
|
429
|
+
powerPreference: "high-performance",
|
|
430
|
+
});
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
If `shader-f16` is available, the pipeline compiler automatically prepends `enable f16;` to WGSL source code.
|
|
434
|
+
|
|
435
|
+
### WebKit Implementation Detection (`isWebKitWebGPU`)
|
|
436
|
+
|
|
437
|
+
`initGPU()` classifies the WebGPU *implementation*, not the GPU hardware. This distinction matters: Dawn running on Apple Silicon (node-dawn, Chrome on macOS) reports `adapter.info.vendor === "apple"` exactly like Safari does, so adapter info **cannot** distinguish Dawn-on-Metal from WebKit-on-Metal. An earlier predicate that keyed on vendor/architecture routed desktop Dawn through every Safari workaround, collapsing desktop decode from ~161 to 19.7 tok/s (Section 17.4).
|
|
438
|
+
|
|
439
|
+
The detection is now purely user-agent based (`device.ts`): `AppleWebKit` in the UA, with iOS/iPadOS detected via `/iPhone|iPad|iPod/` or `maxTouchPoints > 1` (iPadOS masquerades as macOS), and macOS Safari as `AppleWebKit && !Chrome/`. Node has no `navigator.userAgent`, so node-dawn resolves to `false`. Only `isWebKitWebGPU === true` enables the grouped-submit path and other WebKit workarounds.
|
|
440
|
+
|
|
441
|
+
### Uncaptured Error Surfacing
|
|
442
|
+
|
|
443
|
+
`initGPU()` registers `device.onuncapturederror`. Without it, failed buffer allocations and invalid bind groups on Safari are completely silent — the affected dispatches no-op and the symptom is simply zero logits. `WebGPUEngine.create()` additionally wraps executor construction, weight upload, and bind-group creation in paired `pushErrorScope("out-of-memory")` / `pushErrorScope("validation")` scopes and throws a descriptive error if either pops non-null. Before this, every iPad failure was observed blind (Section 18.6).
|
|
444
|
+
|
|
445
|
+
### Buffer Types
|
|
446
|
+
|
|
447
|
+
| Buffer Type | Usage Flags | Alignment | Purpose |
|
|
448
|
+
|------------|-------------|-----------|---------|
|
|
449
|
+
| Storage | STORAGE \| COPY_SRC \| COPY_DST | 4 bytes | Weights, activations, KV cache |
|
|
450
|
+
| Uniform | UNIFORM \| COPY_DST | 16 bytes | Kernel parameters |
|
|
451
|
+
| Readback | MAP_READ \| COPY_DST | 4 bytes | Reading logits back to CPU |
|
|
452
|
+
|
|
453
|
+
### Pipeline Compilation and Caching
|
|
454
|
+
|
|
455
|
+
`getOrCreatePipeline()` compiles WGSL shader code into a `GPUComputePipeline` and caches it by a key of `shaderCode::entryPoint`. This ensures each unique kernel is compiled exactly once per session. The cache can be cleared with `clearPipelineCache()` when switching models or recovering from device loss.
|
|
456
|
+
|
|
457
|
+
### Logit Readback
|
|
458
|
+
|
|
459
|
+
The primary CPU-GPU synchronization point is logit readback. `readbackFloats()` copies data from a storage buffer to a MAP_READ staging buffer, maps it, and returns a `Float32Array` copy. This is used after each forward pass to get the logit distribution for sampling.
|
|
460
|
+
|
|
461
|
+
### GPU Diagnostics Suite (A-P)
|
|
462
|
+
|
|
463
|
+
`verifyGPU()` runs 17 progressive diagnostics testing increasingly complex GPU patterns. This isolates Safari/Metal-specific bugs without needing a full model load:
|
|
464
|
+
|
|
465
|
+
| Test | What It Verifies |
|
|
466
|
+
|------|-----------------|
|
|
467
|
+
| A-E | Basic compute, input binding, workgroups, shared memory, separate encoders |
|
|
468
|
+
| F-G | pack2x16float / unpack2x16float (packed-f16 KV path) |
|
|
469
|
+
| H | 256-thread tree reduction (RMSNorm/softmax pattern) |
|
|
470
|
+
| I | INT4 dequant (nibble extraction + scale/zero) |
|
|
471
|
+
| J | Multi-dispatch chain (3 sequential dispatches in one pass) |
|
|
472
|
+
| K | exp() clamp safety (Metal fast-math NaN check) |
|
|
473
|
+
| L | Large buffer integrity (1 MB upload + readback) |
|
|
474
|
+
| M | Offset buffer integrity (view with non-zero byteOffset, simulates safetensors views) |
|
|
475
|
+
| N | Real RMSNorm kernel (actual WGSL from registry, hidden=1024) |
|
|
476
|
+
| O | Real MatVec kernel (actual WGSL, K-parallel layout) |
|
|
477
|
+
| P | 300 dispatches in one compute pass (model does ~300/forward) |
|
|
478
|
+
|
|
479
|
+
Available via `WebGPUEngine.quickDiagnose()` (no model needed) or `engine.diagnose()`.
|
|
480
|
+
|
|
481
|
+
### Integrity Checker
|
|
482
|
+
|
|
483
|
+
`integrityCheck()` reads back weight tensors from the GPU and runs a single forward pass, producing checksums that can be compared between Dawn (known-good) and Safari. This pinpoints WHERE corruption occurs:
|
|
484
|
+
|
|
485
|
+
- **Weights match, logits differ** → Kernel computation bug on Metal (WGSL→MSL miscompilation)
|
|
486
|
+
- **Weights match, embed_out all zeros** → writeBuffer/dispatch synchronization bug (Safari reads stale uniform params)
|
|
487
|
+
- **Weights don't match** → Data transfer/download corruption (Safari fetch or writeBuffer bug)
|
|
488
|
+
|
|
489
|
+
Weight checksums are compared against hardcoded Dawn reference values. Logits are validated for corruption signals (NaN, Inf, all-same values). Automatically runs after engine loads (all browsers, not just Safari — results shown in Playground UI for devices without console access).
|
|
490
|
+
|
|
491
|
+
---
|
|
492
|
+
|
|
493
|
+
## 7. WGSL Kernel Library
|
|
494
|
+
|
|
495
|
+
The engine includes 12 hand-written WGSL compute shaders covering all operations needed for transformer inference. Each kernel is in `src/gpu/kernels/wgsl/`.
|
|
496
|
+
|
|
497
|
+
For a comprehensive reference of each kernel's algorithm, bindings, uniform layout, dispatch formula, and optimization opportunities, see [kernels.md](./kernels.md).
|
|
498
|
+
|
|
499
|
+
### Kernel Summary
|
|
500
|
+
|
|
501
|
+
| Kernel | File | Workgroup Size | Algorithm |
|
|
502
|
+
|--------|------|---------------|-----------|
|
|
503
|
+
| Embedding | `embedding.wgsl` | (256, 1, 1) | Parallel row gather from weight matrix |
|
|
504
|
+
| MatMul | `matmul.wgsl` | (16, 16, 1) | 16x16 tiled multiply with shared memory |
|
|
505
|
+
| MatMulInt4 | `matmul_int4.wgsl` | (16, 16, 1) | Fused INT4 dequantize + multiply |
|
|
506
|
+
| RMSNorm | `rmsnorm.wgsl` | (256, 1, 1) | One workgroup per row, tree reduction |
|
|
507
|
+
| LayerNorm | `layernorm.wgsl` | (256, 1, 1) | Two-pass (mean, variance), tree reduction |
|
|
508
|
+
| RoPE | `rope.wgsl` | (256, 1, 1) | Parallel pair rotation on Q and K |
|
|
509
|
+
| Attention | `attention.wgsl` | (256, 1, 1) | Per-head causal attention with GQA |
|
|
510
|
+
| Softmax | `softmax.wgsl` | (256, 1, 1) | Three-pass (max, sum, normalize) with tree reduction |
|
|
511
|
+
| SiLU | `silu.wgsl` | (256, 1, 1) | Element-wise x / (1 + exp(-x)) |
|
|
512
|
+
| GELU | `gelu.wgsl` | (256, 1, 1) | Approximate GELU with tanh |
|
|
513
|
+
| Add | `add.wgsl` | (256, 1, 1) | Element-wise addition |
|
|
514
|
+
| Mul | `mul.wgsl` | (256, 1, 1) | Element-wise multiplication |
|
|
515
|
+
|
|
516
|
+
### Design Notes
|
|
517
|
+
|
|
518
|
+
The kernels use a performance-oriented design. Key optimizations:
|
|
519
|
+
|
|
520
|
+
- **Attention**: Tiled online-softmax kernel (Flash-Attention style). Processes KV cache in TILE_S=16 tiles with cooperative loading into shared memory. 256 threads split into groups of 16 for parallel Q·K dot products with vec4 loads, parallel tree reductions for max/sum, and per-dimension V accumulation. Uses exactly 16 KB shared memory (minimum WebGPU guarantee) for iOS compatibility. Score registers are saved before V tile load to allow full shared memory reuse. The Q·K score reduction is **two-phase**: leader threads reduce their group's partials into a local variable, hit a `workgroupBarrier()` in uniform control flow, and only then write `smem[pos_in_tile]`. The earlier single-phase version was a genuine WGSL data race — leader 0 read `smem[1..15]` while leaders 1..15 wrote `smem[0..15]` after the same barrier (Section 17.3). The barrier is a race fix, not a performance knob.
|
|
521
|
+
- **GEMV (decode)**: K-parallel matrix-vector multiply with 8 output columns × 32 K-threads per column, vec4 dot products, single-barrier reduction. Optimized for Apple Silicon SIMD width 32.
|
|
522
|
+
- **MatMul (prefill)**: 16×16 tiling with shared memory tiles.
|
|
523
|
+
- **SwiGLU**: Fused SiLU + Mul in a single kernel (saves 2 dispatches per MLP block).
|
|
524
|
+
- **SwiGLU MatVec**: Fused gate_proj + up_proj + SwiGLU for M=1 decode. Reads input vector once from L1 cache, computes both weight matrix projections, and applies SiLU gating — all in a single dispatch. Saves 2 dispatches per MLP block vs separate gate + up + SwiGLU. The executor detects the gate_proj → up_proj → SwiGLU pattern and automatically substitutes the fused kernel for decode.
|
|
525
|
+
- **ResidualRMSNorm**: Fused residual add + RMS normalization (saves 2 dispatches per layer).
|
|
526
|
+
|
|
527
|
+
Safari/Metal compatibility: the attention kernel avoids WGSL `select()` (Safari has correctness bugs) and clamps `exp()` arguments to -80 (Metal returns NaN for `exp(-1e30)` instead of 0). Additionally, Safari uses packed f16 KV cache (`array<u32>` + `pack2x16float`/`unpack2x16float`) instead of native `array<f16>`, which WebKit's WGSL compiler miscompiles. (The packed-f16 path was A/B-verified against `?kvf32=1` on iPadOS 26.5: both produce byte-identical greedy output — Section 19.) Submit granularity on WebKit is handled by the executor's grouped-submit path, not by kernel changes (Section 18.3).
|
|
528
|
+
|
|
529
|
+
The matmul kernel uses a straightforward 16x16 tiling strategy without register blocking.
|
|
530
|
+
|
|
531
|
+
### Optimization Roadmap
|
|
532
|
+
|
|
533
|
+
Remaining optimizations, prioritized by expected impact. Research from 63 sources covering WebGPU spec, Apple Silicon architecture docs, Chrome/Safari implementation details, and Metal compute best practices.
|
|
534
|
+
|
|
535
|
+
#### Priority 1: f16 KV Cache — DONE
|
|
536
|
+
|
|
537
|
+
Halves memory traffic through the attention kernel during decode (192 MB vs 384 MB for Qwen3.5-0.8B at maxSeqLen=4096).
|
|
538
|
+
|
|
539
|
+
**Three kernel strategies** via `KvMode`:
|
|
540
|
+
- **`native-f16`** (Chrome/Dawn): `enable f16;` + `array<f16>` for K/V buffers. Direct `f16()` cast on write, `f32()` on read. Best performance on platforms with correct f16 WGSL support.
|
|
541
|
+
- **`packed-f16`** (Safari): `array<u32>` + `pack2x16float()`/`unpack2x16float()`. No `enable f16` directive — avoids WebKit's WGSL compiler miscompilation of native f16 types. Same memory savings, slightly more ALU for pack/unpack.
|
|
542
|
+
- **`f32`** (fallback): Standard f32 buffers for devices without shader-f16.
|
|
543
|
+
|
|
544
|
+
**Auto-detection** in `WebGPUEngine.create()`: Safari → packed-f16, Chrome/Dawn → native-f16, no f16 → f32. Override via `?kvf32=1` or `?maxseq=N` URL params.
|
|
545
|
+
|
|
546
|
+
**Safari/WebKit bug**: Native `f16` types (`array<f16>`, `f16()` casts) cause garbage output on WebKit's WGSL compiler (confirmed iPad Safari, likely Metal shader compiler). The packed approach sidesteps this entirely by only using `u32` and `f32` types with standard pack/unpack built-ins.
|
|
547
|
+
|
|
548
|
+
#### Priority 2: Subgroup Operations (est. 1.1-1.3×)
|
|
549
|
+
|
|
550
|
+
The stable directive is `enable subgroups;` (shipped Chrome 134). Apple Silicon subgroupSize is always 32. The current 256-thread max reduction drops from 8 barriers + 1024 bytes of shared memory to 1 barrier + 32 bytes: each subgroup does a hardware `subgroupMax()`, leaders write to a tiny scratch array, one barrier, then subgroup 0 reduces the 8 values.
|
|
551
|
+
|
|
552
|
+
**Critical**: Subgroup ops must be in uniform control flow by default (`subgroup_uniformity` diagnostic errors otherwise). Cross-subgroup `if (sg_index == 0u)` blocks ARE uniform within each subgroup, so they pass analysis. Safari 26 does NOT expose subgroups — maintain the shared-memory fallback. Feature detection required.
|
|
553
|
+
|
|
554
|
+
#### Priority 3: Dynamic Uniform Buffer Offsets (est. 1.1-1.3×)
|
|
555
|
+
|
|
556
|
+
`minUniformBufferOffsetAlignment` = 256 bytes everywhere. Pack all ~441 op params at 256-byte stride into a single ~113 KB buffer. Can mix dynamic and static bindings in the same bind group. During dispatch, `pass.setBindGroup(0, bg, [i * 256])` selects each op's params. Only update the ~24 slots with changing `seq_pos` per step via targeted `writeBuffer` calls.
|
|
557
|
+
|
|
558
|
+
#### Priority 4: Fused GEMV + Residual
|
|
559
|
+
|
|
560
|
+
Thread 0 of the K-parallel reduction adds `residual[row]` before writing output — single extra read, 48 dispatches saved ≈ ~1.4ms at 30μs per dispatch overhead.
|
|
561
|
+
|
|
562
|
+
#### Priority 5: Fused KV Append
|
|
563
|
+
|
|
564
|
+
Combine K+V cache writes into one kernel. Trivial, saves 12 dispatches. Best done alongside f16 KV cache work.
|
|
565
|
+
|
|
566
|
+
#### Priority 6: Advanced GEMV (Register Tiling)
|
|
567
|
+
|
|
568
|
+
Register tiling (4 output rows per thread) is the biggest GEMV win — amortizes input vector reads 4×, increasing arithmetic intensity. Apple GPU has 128 GPRs per SIMD-group, current kernels use ~20-40, so 4 extra accumulators fit easily. Double-buffering is overkill for single-token GEMV where the input vector (4 KB) fits in L1.
|
|
569
|
+
|
|
570
|
+
#### Priority 7: Timestamp Profiling
|
|
571
|
+
|
|
572
|
+
Per-pass timestamps via `timestampWrites` on pass descriptor. `writeTimestamp()` was removed from WebGPU spec. Safari 26 does NOT support `timestamp-query`. Chrome quantizes to 100μs by default; `chrome://flags/#enable-webgpu-developer-features` for full precision. Essential for finding real bottlenecks before investing in lower-impact optimizations.
|
|
573
|
+
|
|
574
|
+
#### Not Worth Pursuing
|
|
575
|
+
|
|
576
|
+
- **Indirect dispatch** (`dispatchWorkgroupsIndirect()`): Slower than direct dispatch due to Chrome's validation overhead. Not useful for fixed architectures.
|
|
577
|
+
- **Double-buffering input**: Overkill for decode where the input vector is 4 KB and fits in L1.
|
|
578
|
+
|
|
579
|
+
#### Platform Capability Matrix
|
|
580
|
+
|
|
581
|
+
| Feature | Chrome 134+ | Safari 26 |
|
|
582
|
+
|---------|-------------|-----------|
|
|
583
|
+
| `shader-f16` | ✅ | ✅ |
|
|
584
|
+
| `subgroups` | ✅ | ❌ |
|
|
585
|
+
| `timestamp-query` | ✅ (100μs quant) | ❌ |
|
|
586
|
+
| `maxComputeWorkgroupStorageSize` | 32768 (requestable) | 16384 (spec minimum) |
|
|
587
|
+
|
|
588
|
+
#### Apple Silicon Occupancy
|
|
589
|
+
|
|
590
|
+
Current 256-thread workgroups with 16 KiB shared memory give ~50% occupancy (2 threadgroups per compute unit sharing the 32 KiB threadgroup memory). This is good for memory-bound kernels where DRAM bandwidth (68-100 GB/s), not compute, is the bottleneck. Smaller workgroups would just increase reduction overhead. Using f16 types reduces register pressure, potentially enabling higher occupancy as a side benefit.
|
|
591
|
+
|
|
592
|
+
#### CPU-GPU Sync Notes
|
|
593
|
+
|
|
594
|
+
`mapAsync()` already guarantees prior work completion — no need for `onSubmittedWorkDone()`. `createCommandEncoder()` cost is <1μs and cannot be avoided (command buffers are single-use in WebGPU). The 441 dispatches cost ~3.5-5.5ms of CPU encoding time. Double-buffered staging buffers could overlap readback with the next step's encoding.
|
|
595
|
+
|
|
596
|
+
**Current baselines (2026-06-13, Qwen3.5-0.8B MLX 4-bit, greedy; see `docs/research/tps-baselines.md`)**: M4 Max node-dawn **207 tok/s** (cooled-stable, after the autoresearch optimization campaign in §20; 145 was the post-detection-fix starting point); iPad (iPadOS 26.5) **31.7 tok/s** batch-all, up to **51.7 tok/s** sustained.
|
|
597
|
+
|
|
598
|
+
**Goals**: desktop **180+ tok/s** via kernel tuning (K_THREADS/N_TILE/workgroup sweeps), fused KV append (−24 dispatches), fused GEMV+residual (−48 dispatches); iPad **50+ tok/s** via single-compute-pass batching on iPadOS 26.5+ plus dispatch-count reduction — every dispatch eliminated helps mobile 2-3× more than desktop because mobile is dispatch-overhead-bound (Section 19).
|
|
599
|
+
|
|
600
|
+
**Mobile invariants any optimization must respect**: iPad `maxComputeWorkgroupStorageSize` = 16384 (the attention kernel sits exactly at this limit); iPad default `maxBufferSize` = 256MB and `maxStorageBufferBindingSize` = 128MB (the INT4 embedding is ~127MB — no headroom for larger vocabularies without sharding); activation buffers are liveness-pooled, so fused kernels must never read and write the same pooled buffer in one dispatch; the two-phase attention reduction barrier is a race fix, not a perf knob.
|
|
601
|
+
|
|
602
|
+
---
|
|
603
|
+
|
|
604
|
+
## 8. Tokenizer
|
|
605
|
+
|
|
606
|
+
The tokenizer (`tokenizer.ts`) is a pure JavaScript BPE implementation that reads HuggingFace `tokenizer.json` files directly. No WASM dependencies, no external libraries.
|
|
607
|
+
|
|
608
|
+
### Encoding Pipeline
|
|
609
|
+
|
|
610
|
+
1. **Added token splitting**: Text is split around ALL added tokens (not just `special: true` ones) using a regex built from the sorted token list. This ensures tokens like `<think>` and `</think>` (which HF marks as non-special) are correctly recognized during encoding
|
|
611
|
+
2. **Pre-tokenization**: Non-special segments are split using a GPT-style regex pattern that separates contractions, words, numbers, and punctuation
|
|
612
|
+
3. **Byte-level encoding**: Characters are converted to the HF byte representation (space becomes `\u0120`, control characters are offset by 256)
|
|
613
|
+
4. **BPE merge**: Character sequences are iteratively merged according to the merge priority table until no more merges are possible
|
|
614
|
+
5. **Byte fallback**: Characters not in the vocabulary are encoded as `<0xHH>` byte-level tokens
|
|
615
|
+
|
|
616
|
+
### Decoding
|
|
617
|
+
|
|
618
|
+
Decoding reverses the process: token IDs are mapped back to strings, joined, and post-processed to restore spaces (from `\u0120`) and byte-level tokens (from `<0xHH>` format).
|
|
619
|
+
|
|
620
|
+
### Chat Template
|
|
621
|
+
|
|
622
|
+
The tokenizer implements ChatML format for chat-style models:
|
|
623
|
+
|
|
624
|
+
```
|
|
625
|
+
<|im_start|>system
|
|
626
|
+
{system message}<|im_end|>
|
|
627
|
+
<|im_start|>user
|
|
628
|
+
{user message}<|im_end|>
|
|
629
|
+
<|im_start|>assistant
|
|
630
|
+
```
|
|
631
|
+
|
|
632
|
+
A TODO exists for parsing Jinja2 templates from `tokenizer_config.json` for full generality across model families.
|
|
633
|
+
|
|
634
|
+
For a detailed walkthrough with a worked example, see [tokenizer.md](./tokenizer.md).
|
|
635
|
+
|
|
636
|
+
---
|
|
637
|
+
|
|
638
|
+
## 9. Sampler
|
|
639
|
+
|
|
640
|
+
Token sampling is performed on CPU (`sampler.ts`). The rationale is straightforward: for a vocabulary of ~150K tokens, the logits array is ~600KB -- well within CPU memory bandwidth. Moving sampling to GPU would add dispatch overhead and a synchronization point without meaningful benefit.
|
|
641
|
+
|
|
642
|
+
### Sampling Pipeline
|
|
643
|
+
|
|
644
|
+
Given a `Float32Array` of logits:
|
|
645
|
+
|
|
646
|
+
1. **Greedy check**: If temperature < 1e-6, return `argmax(logits)` immediately
|
|
647
|
+
2. **Copy**: Work on a copy to avoid mutating the source
|
|
648
|
+
3. **Repetition penalty**: For each previously generated token, divide positive logits by the penalty factor and multiply negative logits by it
|
|
649
|
+
4. **Temperature**: Divide all scores by temperature
|
|
650
|
+
5. **Top-k**: Sort scores descending, keep only the top K candidates
|
|
651
|
+
6. **Softmax**: Compute `exp(score - max)` and normalize to get probabilities
|
|
652
|
+
7. **Top-p (nucleus)**: Keep the smallest set of tokens whose cumulative probability exceeds `topP`, then re-normalize
|
|
653
|
+
8. **Weighted random sample**: Draw a random number and walk the cumulative distribution
|
|
654
|
+
|
|
655
|
+
### Default Parameters
|
|
656
|
+
|
|
657
|
+
```typescript
|
|
658
|
+
temperature: 0.7
|
|
659
|
+
topK: 50
|
|
660
|
+
topP: 0.9
|
|
661
|
+
repetitionPenalty: 1.0 // disabled
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
---
|
|
665
|
+
|
|
666
|
+
## 10. Architecture Registry & Graph Generators
|
|
667
|
+
|
|
668
|
+
The architecture registry (`architectures/index.ts`) maps HuggingFace model architecture strings (from `config.architectures[0]`) to graph generator functions.
|
|
669
|
+
|
|
670
|
+
### How It Works
|
|
671
|
+
|
|
672
|
+
```typescript
|
|
673
|
+
const ARCHITECTURES: Record<string, GraphGenerator> = {
|
|
674
|
+
Qwen2ForCausalLM: generateQwen2Graph,
|
|
675
|
+
Qwen3ForCausalLM: generateQwen2Graph,
|
|
676
|
+
// LlamaForCausalLM: generateLlamaGraph, // TODO
|
|
677
|
+
};
|
|
678
|
+
```
|
|
679
|
+
|
|
680
|
+
When a model is loaded, the engine reads `config.architectures[0]` and looks it up in this registry. The matched generator receives the raw `config.json` object and returns a complete `ModelGraph`.
|
|
681
|
+
|
|
682
|
+
### Qwen2 Graph Generator
|
|
683
|
+
|
|
684
|
+
`architectures/qwen2.ts` generates the graph for Qwen2/Qwen3 models. Each transformer layer produces 12 operation nodes:
|
|
685
|
+
|
|
686
|
+
1. `RMSNorm` -- input_layernorm
|
|
687
|
+
2. `MatMul` -- Q projection
|
|
688
|
+
3. `MatMul` -- K projection
|
|
689
|
+
4. `MatMul` -- V projection
|
|
690
|
+
5. `RoPE` -- rotate Q and K
|
|
691
|
+
6. `Attention` -- scaled dot-product with causal mask and GQA
|
|
692
|
+
7. `MatMul` -- output projection
|
|
693
|
+
8. `Add` -- residual connection 1
|
|
694
|
+
9. `RMSNorm` -- post_attention_layernorm
|
|
695
|
+
10. `MatMul` -- gate projection (SwiGLU)
|
|
696
|
+
11. `MatMul` -- up projection
|
|
697
|
+
12. `SiLU` -- activation on gate output
|
|
698
|
+
13. `Mul` -- gate * up (SwiGLU combine)
|
|
699
|
+
14. `MatMul` -- down projection
|
|
700
|
+
15. `Add` -- residual connection 2
|
|
701
|
+
|
|
702
|
+
Plus 4 global nodes: `Embedding`, `RMSNorm` (final), `SliceLastRow` (extract the last position of `final_norm_out`), and `MatMul` (lm_head, M=1). Both graph generators (`qwen2.ts`, `qwen3_5.ts`) emit the `SliceLastRow` → `lm_head` pair so that logits are computed only for the last position (Section 18.2).
|
|
703
|
+
|
|
704
|
+
For a Qwen3.5-0.8B model (24 layers), this produces:
|
|
705
|
+
- ~360 operation nodes (15 per layer * 24 + 3)
|
|
706
|
+
- ~500+ tensor descriptors
|
|
707
|
+
|
|
708
|
+
### Adding a New Architecture
|
|
709
|
+
|
|
710
|
+
See [architectures.md](./architectures.md) for a step-by-step guide.
|
|
711
|
+
|
|
712
|
+
---
|
|
713
|
+
|
|
714
|
+
## 11. KV Cache
|
|
715
|
+
|
|
716
|
+
The KV cache (`kv-cache.ts`) stores previously computed key and value vectors in GPU memory, enabling autoregressive generation without recomputing attention over the full sequence at each step.
|
|
717
|
+
|
|
718
|
+
### LHSd Layout
|
|
719
|
+
|
|
720
|
+
The cache uses LHSd (Layer, Head, Sequence, head_dim) layout:
|
|
721
|
+
|
|
722
|
+
```
|
|
723
|
+
Per layer:
|
|
724
|
+
K buffer: [num_kv_heads, max_seq_len, head_dim] f32
|
|
725
|
+
V buffer: [num_kv_heads, max_seq_len, head_dim] f32
|
|
726
|
+
```
|
|
727
|
+
|
|
728
|
+
Each layer has its own pair of K and V GPU storage buffers.
|
|
729
|
+
|
|
730
|
+
### Pre-allocation Strategy
|
|
731
|
+
|
|
732
|
+
Buffers are allocated at `max_seq_len` capacity when the executor is created. This avoids dynamic reallocation during generation. The tradeoff is higher initial memory usage, but it eliminates allocation jitter and potential out-of-memory errors mid-generation.
|
|
733
|
+
|
|
734
|
+
### Memory Budget Calculation
|
|
735
|
+
|
|
736
|
+
For a model with `L` layers, `H_kv` KV heads, `d` head dimension, and max sequence length `S`:
|
|
737
|
+
|
|
738
|
+
```
|
|
739
|
+
KV cache bytes = L * H_kv * S * d * 4 (f32) * 2 (K + V)
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
Example for Qwen3.5-0.8B (24 layers, 4 KV heads, 64 head_dim, 2048 max seq):
|
|
743
|
+
```
|
|
744
|
+
24 * 4 * 2048 * 64 * 4 * 2 = 100,663,296 bytes (~96 MB)
|
|
745
|
+
```
|
|
746
|
+
|
|
747
|
+
### Prefill vs. Decode
|
|
748
|
+
|
|
749
|
+
- **Prefill** (T = prompt length): All T tokens' K/V entries are written at once at positions [0, T-1]. The attention kernel reads and writes the full range.
|
|
750
|
+
- **Decode** (T = 1): A single K/V entry is written at position `seqPos`. The attention kernel reads positions [0, seqPos] and writes position [seqPos].
|
|
751
|
+
|
|
752
|
+
The `seqPos` counter in the `KVCache` struct tracks how many tokens have been cached. `advanceKVCache()` increments it after each forward pass. `resetKVCache()` sets it to 0 for a new generation without reallocating buffers.
|
|
753
|
+
|
|
754
|
+
---
|
|
755
|
+
|
|
756
|
+
## 12. Executor
|
|
757
|
+
|
|
758
|
+
The executor (`executor.ts`) is the core runtime that orchestrates buffer allocation, weight upload, and compute dispatch.
|
|
759
|
+
|
|
760
|
+
### Buffer Allocation Strategy
|
|
761
|
+
|
|
762
|
+
The executor allocates activation buffers once at construction, sized for `maxSeqLen` — there is no dynamic allocation during generation and memory usage is deterministic and front-loaded. But buffers are **not** one-per-tensor: `allocateActivationBuffers()` runs a last-use liveness analysis over `graph.executionOrder` and recycles buffers through a size-keyed free pool, so tensors that are never live simultaneously share storage. For Qwen3.5-0.8B this collapses 431 activation tensors to 20 physical buffers (37 MB at T=256) — versus ~2.3 GB for the naive per-tensor scheme that was jetsam-killing iPads (Section 17.1). Graph outputs and tensors that are read before they are written (cross-forward state) are excluded from pooling and keep dedicated buffers. See Section 18.1 for the full algorithm.
|
|
763
|
+
|
|
764
|
+
Weight buffers are allocated on demand during `uploadWeights()` and are sized to exactly match the data.
|
|
765
|
+
|
|
766
|
+
### Forward Pass
|
|
767
|
+
|
|
768
|
+
```typescript
|
|
769
|
+
async forward(inputIds: Uint32Array): Promise<ForwardResult>
|
|
770
|
+
```
|
|
771
|
+
|
|
772
|
+
The forward pass uses a **two-phase dispatch pattern**:
|
|
773
|
+
|
|
774
|
+
**Phase 1 — Uniform buffer updates** (before any dispatch):
|
|
775
|
+
|
|
776
|
+
1. Writes token IDs to the pre-allocated `input_ids` buffer
|
|
777
|
+
2. Resolves all symbolic shapes (`"T"` -> actual T, `"L_max"` -> seqPos + T)
|
|
778
|
+
3. For each operation: builds uniform params, compares against cached bytes, and calls `device.queue.writeBuffer()` only if changed
|
|
779
|
+
|
|
780
|
+
**Phase 2 — Compute dispatch** (no writeBuffer calls), with a per-implementation submit strategy:
|
|
781
|
+
|
|
782
|
+
- **Dawn (Chrome, node)**: a single `GPUCommandEncoder` with one compute pass containing all ~440 dispatches, plus the logits copy, submitted in one batch.
|
|
783
|
+
- **WebKit** (`needsMultiEncoder === true`): dispatches are grouped into command buffers of `webkitGroupSize` (one compute pass per dispatch), each submitted and **awaited via `onSubmittedWorkDone()` before the next is encoded** — exactly one command buffer in flight. The logits copy goes in its own final submit. See Section 18.3 for why, and Section 19 for the throughput-vs-group-size curve.
|
|
784
|
+
|
|
785
|
+
Either way the pass then:
|
|
786
|
+
|
|
787
|
+
4. Copies the `[1, vocab]` logits buffer to the readback buffer (full buffer, offset 0 — `SliceLastRow` already restricted lm_head to the last position)
|
|
788
|
+
5. Maps the readback buffer and returns the logits as `Float32Array`
|
|
789
|
+
6. Advances the KV cache position by T
|
|
790
|
+
|
|
791
|
+
`forwardArgmax()` (greedy decode without logits readback) follows the same structure, with the argmax dispatch and its 4-byte readback copy in separate submits on WebKit, mirroring `forward()`.
|
|
792
|
+
|
|
793
|
+
> **History — the "Safari writeBuffer bug" callout, revised**: An earlier version of this section blamed iPad gibberish on WebKit failing to synchronize `writeBuffer()` calls made *during* an active compute pass: stale zero params → all threads exit early → all-zero output → degenerate text at ~276 tok/s (zero-work dispatches are nearly free, which is why the "throughput" was 2-4× desktop). The throughput signature and the "zeros, not garbage" character of that diagnosis were correct — the recorded gibberish *was* zero logits in disguise. But the June 2026 investigation **refuted stale params as the mechanism** (Section 17.2): the kernels' write guards mean zeroed params would have written *nothing*, yet probes showed zeros actively overwriting pre-seeded data, and the failure flipped correct↔zeros purely on submit granularity with byte-identical CPU-side writes. The real cause was a within-submission storage-visibility bug in WebKit, since fixed upstream (does not reproduce on iPadOS 26.5). The two-phase write-then-dispatch pattern is retained as cheap, spec-clean hygiene, but it was never the fix.
|
|
794
|
+
|
|
795
|
+
### Prefill vs. Decode Dispatch
|
|
796
|
+
|
|
797
|
+
The same forward pass code handles both prefill and decode -- the only difference is the value of T:
|
|
798
|
+
|
|
799
|
+
- **Prefill**: T = prompt token count (e.g., 50). Matmul dispatch sizes are larger, attention processes T queries against T keys.
|
|
800
|
+
- **Decode**: T = 1. Single-token forward pass. Matmul dispatch is minimal, attention processes 1 query against seqPos keys.
|
|
801
|
+
|
|
802
|
+
The kernel dispatch sizing functions (in the kernel registry) use the resolved shapes to compute appropriate workgroup counts for either case.
|
|
803
|
+
|
|
804
|
+
---
|
|
805
|
+
|
|
806
|
+
## 13. Model Loading Pipeline
|
|
807
|
+
|
|
808
|
+
The model loader (`model-loader.ts`) orchestrates the entire process of downloading and preparing a model from HuggingFace Hub.
|
|
809
|
+
|
|
810
|
+
### Loading Steps
|
|
811
|
+
|
|
812
|
+
1. **Fetch config.json**: Determine architecture, extract dimensions
|
|
813
|
+
2. **Generate IR**: Call the matching graph generator to produce the `ModelGraph`
|
|
814
|
+
3. **Fetch tokenizer**: Download `tokenizer.json` and `tokenizer_config.json` in parallel, construct the `Tokenizer`
|
|
815
|
+
4. **Discover safetensors files**: Try `model.safetensors.index.json` first (for sharded models), fall back to single `model.safetensors`
|
|
816
|
+
5. **Download weight files**: Stream each safetensors file with progress callbacks
|
|
817
|
+
6. **Parse and map weights**: Parse safetensors headers, extract typed array views, map HF keys to canonical names via `HFKeyMapper`
|
|
818
|
+
7. **Verify**: Check that all expected weight tensors are present; warn about missing ones
|
|
819
|
+
|
|
820
|
+
### HuggingFace Hub Integration
|
|
821
|
+
|
|
822
|
+
The loader resolves repo strings to HF Hub URLs:
|
|
823
|
+
|
|
824
|
+
```
|
|
825
|
+
"Qwen/Qwen3.5-0.8B" -> "https://huggingface.co/Qwen/Qwen3.5-0.8B/resolve/main"
|
|
826
|
+
```
|
|
827
|
+
|
|
828
|
+
It supports:
|
|
829
|
+
- Gated models via `hfToken` authentication
|
|
830
|
+
- Custom revisions/branches
|
|
831
|
+
- Full URL pass-through (for self-hosted models)
|
|
832
|
+
- Multi-shard safetensors models (via the index file)
|
|
833
|
+
|
|
834
|
+
### Progress Tracking
|
|
835
|
+
|
|
836
|
+
The loader provides normalized progress (0-100) to the caller:
|
|
837
|
+
- 0-5%: Fetching config
|
|
838
|
+
- 5-10%: Fetching tokenizer
|
|
839
|
+
- 10-95%: Downloading weight files (distributed evenly across shards)
|
|
840
|
+
- 95-100%: Final verification
|
|
841
|
+
|
|
842
|
+
---
|
|
843
|
+
|
|
844
|
+
## 14. Public API (WebGPUEngine)
|
|
845
|
+
|
|
846
|
+
The engine is designed to be used through a high-level API surface. Based on the current codebase, the intended usage pattern is:
|
|
847
|
+
|
|
848
|
+
```typescript
|
|
849
|
+
import { initGPU } from "./device.js";
|
|
850
|
+
import { loadModel } from "./model-loader.js";
|
|
851
|
+
import { Executor } from "./executor.js";
|
|
852
|
+
import { sampleToken } from "./sampler.js";
|
|
853
|
+
|
|
854
|
+
// 1. Initialize WebGPU
|
|
855
|
+
const ctx = await initGPU();
|
|
856
|
+
|
|
857
|
+
// 2. Load model from HuggingFace
|
|
858
|
+
const { graph, tokenizer, weights } = await loadModel({
|
|
859
|
+
repo: "Qwen/Qwen3.5-0.8B",
|
|
860
|
+
onProgress: (loaded, total, msg) => console.log(`${loaded}/${total}: ${msg}`),
|
|
861
|
+
});
|
|
862
|
+
|
|
863
|
+
// 3. Create executor and upload weights
|
|
864
|
+
const executor = new Executor(ctx, graph, { maxSeqLen: 2048 });
|
|
865
|
+
executor.uploadWeights(weights);
|
|
866
|
+
|
|
867
|
+
// 4. Encode prompt
|
|
868
|
+
const inputIds = new Uint32Array(
|
|
869
|
+
tokenizer.encodeChat([
|
|
870
|
+
{ role: "system", content: "You are a helpful assistant." },
|
|
871
|
+
{ role: "user", content: "Hello!" },
|
|
872
|
+
])
|
|
873
|
+
);
|
|
874
|
+
|
|
875
|
+
// 5. Prefill
|
|
876
|
+
let result = await executor.forward(inputIds);
|
|
877
|
+
let nextToken = sampleToken(result.logits, { temperature: 0.7, topK: 50, topP: 0.9 });
|
|
878
|
+
|
|
879
|
+
// 6. Decode loop
|
|
880
|
+
const maxTokens = 256;
|
|
881
|
+
const generated: number[] = [nextToken];
|
|
882
|
+
|
|
883
|
+
while (generated.length < maxTokens && nextToken !== tokenizer.config.eosTokenId) {
|
|
884
|
+
result = await executor.forward(new Uint32Array([nextToken]));
|
|
885
|
+
nextToken = sampleToken(result.logits, { temperature: 0.7 }, generated);
|
|
886
|
+
generated.push(nextToken);
|
|
887
|
+
process.stdout.write(tokenizer.decode([nextToken]));
|
|
888
|
+
}
|
|
889
|
+
|
|
890
|
+
// 7. Cleanup
|
|
891
|
+
executor.destroy();
|
|
892
|
+
```
|
|
893
|
+
|
|
894
|
+
### Key Methods
|
|
895
|
+
|
|
896
|
+
| Method | Description |
|
|
897
|
+
|--------|-------------|
|
|
898
|
+
| `initGPU()` | Request WebGPU adapter and device |
|
|
899
|
+
| `loadModel(options)` | Download model from HF Hub, generate IR, build tokenizer |
|
|
900
|
+
| `new Executor(ctx, graph, opts)` | Create executor with KV cache allocation |
|
|
901
|
+
| `executor.uploadWeights(weights)` | Upload safetensors data to GPU |
|
|
902
|
+
| `executor.forward(inputIds)` | Run one forward pass, return logits |
|
|
903
|
+
| `executor.reset()` | Clear KV cache for new conversation |
|
|
904
|
+
| `executor.destroy()` | Free all GPU resources |
|
|
905
|
+
| `sampleToken(logits, params, history?)` | Sample next token from logit distribution |
|
|
906
|
+
|
|
907
|
+
---
|
|
908
|
+
|
|
909
|
+
## 15. What's Built vs. What's Planned
|
|
910
|
+
|
|
911
|
+
| Component | Status | Notes |
|
|
912
|
+
|-----------|--------|-------|
|
|
913
|
+
| IR type system | Built | OpType, TensorDesc, OpNode, ModelGraph |
|
|
914
|
+
| Canonical key mapping | Built | Default HF key mapper (strip "model." prefix) |
|
|
915
|
+
| Safetensors parser | Built | Zero-copy typed views, streaming support |
|
|
916
|
+
| WebGPU device layer | Built | Buffer management, pipeline caching, readback |
|
|
917
|
+
| Embedding kernel | Built | Simple parallel gather |
|
|
918
|
+
| MatMul kernel (f32) | Built | 16x16 tiled with shared memory |
|
|
919
|
+
| MatMulInt4 kernel | Built | Fused dequant+matmul, not yet wired to graph generators |
|
|
920
|
+
| RMSNorm kernel | Built | Tree reduction, one workgroup per row |
|
|
921
|
+
| LayerNorm kernel | Built | Two-pass mean+variance, tree reduction |
|
|
922
|
+
| RoPE kernel | Built | Handles GQA (separate Q/K head counts) |
|
|
923
|
+
| Attention kernel | Built | Tiled online-softmax, 256-thread parallel, 16 KB shared memory, two-phase score reduction (race-free) |
|
|
924
|
+
| Softmax kernel | Built | Three-pass with tree reduction |
|
|
925
|
+
| SiLU kernel | Built | Element-wise |
|
|
926
|
+
| GELU kernel | Built | Approximate version |
|
|
927
|
+
| Add kernel | Built | Element-wise |
|
|
928
|
+
| Mul kernel | Built | Element-wise |
|
|
929
|
+
| BPE Tokenizer | Built | Pure JS, HF tokenizer.json compatible |
|
|
930
|
+
| CPU Sampler | Built | temp/top-k/top-p/repetition penalty |
|
|
931
|
+
| Qwen2/3 graph generator | Built | Full layer structure with SwiGLU MLP |
|
|
932
|
+
| KV Cache | Built | LHSd layout, pre-allocated |
|
|
933
|
+
| Executor | Built | Single-encoder batching (Dawn) + grouped submits with one CB in flight (WebKit, `?group=N`) |
|
|
934
|
+
| Activation liveness pooling | Built | 431 tensors → 20 buffers (37 MB at T=256), size-keyed free pool |
|
|
935
|
+
| SliceLastRow + [1, vocab] logits | Built | Last-position lm_head only; saves 485 MB at T=512 and the full-vocab prefill matmul |
|
|
936
|
+
| Model Loader | Built | HF Hub, multi-shard support |
|
|
937
|
+
| Kernel Registry | Built | Mapping OpType -> kernel spec with dispatch/params helpers |
|
|
938
|
+
| GEMV decode kernels | Built | K-parallel MatVec (f32 + INT4), 8 cols × 32 K-threads per workgroup |
|
|
939
|
+
| SwiGLU MatVec fusion | Built | Fused gate+up+SwiGLU decode kernel, auto-detected by executor |
|
|
940
|
+
| EmbeddingInt4 kernel | Built | INT4 dequant lookup, tied lm_head reuse |
|
|
941
|
+
| INT4 quantization | Built | On-the-fly F32→INT4, GPTQ repack, MLX adapter |
|
|
942
|
+
| WebKit/Safari compat | Built | UA-based `isWebKitWebGPU`, grouped submits, no select(), clamped exp(), 16 KB smem, packed-f16 KV, Cache-API write skip >64MB |
|
|
943
|
+
| GPU error surfacing | Built | `onuncapturederror` + error scopes around alloc/upload/bind-group setup |
|
|
944
|
+
| GPU diagnostics (A-P) | Built | 17-test suite: buffer integrity, compute, shared mem, f16, tree reduce, INT4, multi-dispatch, exp safety, real kernels, 300-dispatch stress |
|
|
945
|
+
| Integrity checker | Built | Weight/logits checksum comparison between Dawn and Safari, auto-runs on Safari load |
|
|
946
|
+
| Qwen3.5 graph gen | Built | Hybrid Mamba SSM + full attention with Gated Delta Net |
|
|
947
|
+
| LLaMA/Mistral graph gen | Planned | Same base architecture, different key mappings |
|
|
948
|
+
| Phi graph gen | Planned | Different MLP structure (fc1/fc2 vs gate/up/down) |
|
|
949
|
+
| f16 KV cache | Built | Native f16 (Chrome) + packed u32 (Safari), auto-detected |
|
|
950
|
+
| f16 compute path | Planned | Use shader-f16 for faster matmul |
|
|
951
|
+
| Native text embeddings | Built | Qwen3-Embedding-0.6B (`Qwen3ForCausalLM`): last-token EOS pooling + `L2Norm` tail. dim 1024, unit norm, cos(similar)=0.81 > cos(unrelated)=0.56 (Section 21) |
|
|
952
|
+
| L2Norm kernel | Built | Row-wise L2 normalization, one workgroup/row |
|
|
953
|
+
| Vision encoder (ViT) | Built | Qwen3.5's own 12-layer ViT from the same checkpoint; bit-exact vs HF transformers 5.12 (per-token cosine 1.000000, max abs err ~5e-6). `engine.encodeImage(patches, gridTHW)` (Section 22) |
|
|
954
|
+
| Vision ops (AddBias, GeluErf, ApplyRotaryEmb, SliceCols) | Built | ViT bias adds, exact-erf merger GELU, 2D rotary, fused-QKV split |
|
|
955
|
+
| Bidirectional attention (`is_causal` flag) | Built | One-line uniform on the existing parallel attention kernel; text stays causal by default, ViT runs non-causal (Section 22) |
|
|
956
|
+
| Vision LM integration (M-RoPE, token splice, image preprocessing) | In progress | Phase 2 — the encoder is done; splicing image tokens into the text stream is the remaining work |
|
|
957
|
+
| MoE support | Planned | MoERouter, ExpertMatMul kernels |
|
|
958
|
+
| Streaming API | Planned | High-level `stream()` method with async iteration |
|
|
959
|
+
| Jinja2 template parser | Planned | Full chat template support beyond ChatML |
|
|
960
|
+
|
|
961
|
+
### 15.1 Exploration backlog (open directions, June 2026)
|
|
962
|
+
|
|
963
|
+
Candidates under evaluation — not commitments. The first item gates several others.
|
|
964
|
+
|
|
965
|
+
| Direction | Source / rationale | Status |
|
|
966
|
+
|-----------|--------------------|--------|
|
|
967
|
+
| **Engine consolidation decision** | Two engines (native WGSL text-only; transformers.js/ONNX breadth) is untenable long-term. Only transformers.js covers all modalities (vision/TTS/STT/embeddings). The decisive test: does transformers.js run on iOS 26.5 (post our environmental fixes + ORT 1.25 + IIFE worker)? If yes → consolidate on it; native WGSL kept only if a head-to-head shows a decisive text-speed win. | **Decision gate — test pending** |
|
|
968
|
+
| **KVSwap: KV-cache memory tiering** | son-of-ole/infinite-edge-agent (MIT). KV cache as VRAM→RAM→IndexedDB hierarchy with pinning/eviction/prefetch. Lifts the mobile context cap (the next wall after activation pooling). Applies to any WebGPU text path. | Explore — high value |
|
|
969
|
+
| **SSA: sparse/sub-quadratic attention** | Same repo. Block-summary → top-K block routing → gather → sparse attention. Cheaper long context, but changes attention semantics (approximate) — a quality/throughput research bet, not drop-in. | Explore — research |
|
|
970
|
+
| **Evaluate WebLLM/MLC as the text backend** | Mature compiled-WebGPU runtime (TVM-based); the most-used browser LLM engine. Text-only (not multimodal). Relevant only if we decide a separate fast text path is worth maintaining — as a more-supported alternative to the native WGSL engine. | Evaluate |
|
|
971
|
+
| **Gemma 4 E2B support** | Smallest current on-device Gemma (~3.35 GB q4). Tier 2: needs sliding-window attention + GeGLU + logit soft-cap (`docs/research/qwen36-gemma4-targets.md`). | Planned target |
|
|
972
|
+
| **Qwen3-0.6B (dense)** | Tier-1 proof of the `add-model-family` process; ~0.32 GB q4. | Planned target |
|
|
973
|
+
| **Agent-runtime feature layer** | Persistent local memory + context reconstruction (cf. same repo's Memory OS). A *feature set* packaged AI-SDK-style, NOT an engine concern. | Explore — product layer |
|
|
974
|
+
|
|
975
|
+
---
|
|
976
|
+
|
|
977
|
+
## 16. Key Design Decisions
|
|
978
|
+
|
|
979
|
+
### Why Runtime IR (Not Offline Converter)
|
|
980
|
+
|
|
981
|
+
An offline converter (like ONNX export) adds a mandatory pre-processing step. Users must find or create converted model files, keep them in sync with upstream model updates, and deal with conversion bugs. By generating the IR at runtime from `config.json`, gerbil can point at any HuggingFace repo that uses a supported architecture and just run it. The cost is a few milliseconds of graph generation at load time -- negligible compared to weight download.
|
|
982
|
+
|
|
983
|
+
### Why f32 First (Not f16)
|
|
984
|
+
|
|
985
|
+
f16 compute requires the `shader-f16` WebGPU feature, which is not universally available. Starting with f32 ensures correctness across all WebGPU-capable browsers. f16 is being added incrementally: the KV cache uses f16 storage with f32 compute (halving attention memory traffic), and f16 matmul will follow. Feature detection ensures automatic fallback to f32 on devices without `shader-f16`.
|
|
986
|
+
|
|
987
|
+
### Why CPU Sampling
|
|
988
|
+
|
|
989
|
+
The logits array for a typical vocabulary (~150K tokens) is ~600KB as f32. Copying this to CPU and sampling there takes microseconds. GPU-side sampling would require:
|
|
990
|
+
1. An additional kernel for argmax/multinomial sampling
|
|
991
|
+
2. A readback of a single u32 (the selected token ID)
|
|
992
|
+
3. Synchronization to get the result
|
|
993
|
+
|
|
994
|
+
The overhead of kernel dispatch + sync would exceed the CPU sampling time. The CPU also has access to `Math.random()` and can implement complex sampling strategies (repetition penalty over arbitrary history) more naturally.
|
|
995
|
+
|
|
996
|
+
### Why No WASM Fallback
|
|
997
|
+
|
|
998
|
+
The previous approach maintained three inference paths (WebGPU, WASM, CPU), each with different bugs and performance profiles. The GPU engine targets WebGPU exclusively. If WebGPU is unavailable, the engine throws a clear error rather than silently degrading to a 10x slower path. This keeps the codebase focused and testable.
|
|
999
|
+
|
|
1000
|
+
Browser WebGPU coverage (Chrome 113+, Safari 18+, Firefox 141+) is sufficient for the target audience. The old transformers.js path remains available for legacy browser support if needed.
|
|
1001
|
+
|
|
1002
|
+
### Why Canonical Tensor Naming
|
|
1003
|
+
|
|
1004
|
+
Different model families use different key prefixes in their safetensors files (`model.layers.0.self_attn.q_proj.weight` vs `transformer.h.0.attn.c_attn.weight`). By mapping to canonical names, the graph generators and executor code don't need to know about HF-specific naming. Adding a new model family only requires writing a key mapper function, not modifying the core engine.
|
|
1005
|
+
|
|
1006
|
+
### Why LHSd KV Layout
|
|
1007
|
+
|
|
1008
|
+
LHSd (Layer, Head, Sequence, head_dim) puts the sequence dimension third, which means appending a new token's K/V vectors is a simple write at offset `seqPos * head_dim`. The attention kernel reads contiguous head_dim-sized chunks, which is cache-friendly for the dot product computation. This layout matches what most transformer implementations use internally.
|
|
1009
|
+
|
|
1010
|
+
### Why Detect the Implementation, Not the Hardware
|
|
1011
|
+
|
|
1012
|
+
Workarounds in this engine exist for *WebGPU implementations* (WebKit's), not for GPUs. Apple Silicon is the hardware under both Safari and desktop Chrome/node-dawn, and `adapter.info` reports `vendor: "apple"` in all three — so any hardware-keyed predicate inevitably drags a correct, fast implementation through workarounds built for a broken one (Section 17.4 documents the 161 → 19.7 tok/s cost of getting this wrong). The user agent identifies the implementation unambiguously: all iOS/iPadOS browsers are WebKit by platform mandate, macOS Safari is `AppleWebKit` without `Chrome/`, and node has no UA at all.
|
|
1013
|
+
|
|
1014
|
+
### Why Liveness Pooling Instead of Per-Tensor Buffers
|
|
1015
|
+
|
|
1016
|
+
Per-tensor allocation at `maxSeqLen` is simple and was fine on a 128 GB desktop, but it scales with *graph size*, not with *concurrent liveness* — and a 431-tensor graph at T=512 wants ~2.3 GB while only ~20 tensors are ever alive at once. The pooled allocator keeps every desirable property of pre-allocation (no allocation during generation, deterministic footprint, buffers reused across forwards) at ~1.6% of the memory. The one obligation it creates: fused kernels must never read and write the same pooled buffer in a single dispatch, and mid-graph debug readbacks are only meaningful before a later op reuses the buffer.
|
|
1017
|
+
|
|
1018
|
+
---
|
|
1019
|
+
|
|
1020
|
+
## 17. The Four-Failure-Mode Diagnosis (June 2026)
|
|
1021
|
+
|
|
1022
|
+
Through early 2026 the engine's mobile story was a single undifferentiated narrative: "it crashes or produces garbage on iPad, probably Metal coherence." On 2026-06-12 a multi-agent investigation (full report: `docs/mobile-failure-diagnosis.md`; running log: `docs/metal-safari-intel.md`) took the four observed symptoms apart and found **four independent root causes** — two of them local bugs in this codebase, one a genuine (and since-fixed) WebKit bug, and one a detection bug that had silently poisoned every desktop benchmark for months. Conflating them is precisely what made the problem look intractable.
|
|
1023
|
+
|
|
1024
|
+
### 17.1 jetsam-crash: a local memory bug, not Metal
|
|
1025
|
+
|
|
1026
|
+
The iPad tab kills ("download reaches 100%, then blank page") were attributed to dispatch-strategy overhead — the engineering log explicitly blamed "400 roundtrips + Promise overhead" for the DRAIN_EVERY=1 OOM. That explanation was wrong: a fully drained loop bounds in-flight command buffers to ~1, so roundtrip count cannot dominate memory. The actual cause was the executor's buffer allocator.
|
|
1027
|
+
|
|
1028
|
+
`allocateActivationBuffers()` created **one dedicated buffer per activation tensor at full `maxSeqLen`**, with zero reuse. For Qwen3.5-0.8B (431 activation tensors, vocab 248,320) at `maxSeqLen=512` the math is unforgiving:
|
|
1029
|
+
|
|
1030
|
+
| Component | Bytes |
|
|
1031
|
+
|---|---|
|
|
1032
|
+
| Logits buffer alone: `[T, vocab]` f32 = 512 × 248,320 × 4 | **508.6 MB (≈485 MiB)** — of which only the last row (993 KB) was ever read |
|
|
1033
|
+
| All ~430 per-tensor activation buffers | **~2.30 GB** |
|
|
1034
|
+
| INT4 weights | ~0.44 GB |
|
|
1035
|
+
| **Total GPU footprint, INT4 @ T=512** | **~2.77 GB** |
|
|
1036
|
+
|
|
1037
|
+
Against an iOS web-content jetsam budget of ~1.5-2 GB, **every configuration ever run on the iPad was memory-doomed before the first dispatch executed** — INT4 @ T=256 was 1.61 GB, the engine's then-default `maxSeqLen=4096` requested 18.9 GB, and the diagnostic page's default URL silently loaded the BF16 repo (6.08 GB F32 graph) under an "INT4" label. The drained per-dispatch experiments "failed" for memory reasons, not because draining was wrong — which sent the investigation chasing submit strategies when the bug was in allocation.
|
|
1038
|
+
|
|
1039
|
+
Fixes: liveness pooling (18.1), `SliceLastRow` (18.2), iOS `maxSeqLen` clamping, diagnostic-page default model fix, and load-transient trims. Post-fix footprint at T=512: ~0.44 GB weights + ~0.04 GB pooled activations + ~0.03 GB KV/SSM state ≈ **0.6-0.7 GB — comfortably inside the budget for the first time in the project's history**.
|
|
1040
|
+
|
|
1041
|
+
### 17.2 zero-logits: a real WebKit bug — and an OS-version-dependent one
|
|
1042
|
+
|
|
1043
|
+
The historical core mystery: a full forward pass (~440 dispatches in one command buffer) produced all-zero logits from dispatch entry 2 onward on iPad Safari, while every per-dispatch-submit probe was correct and the identical code was correct on Dawn. Three things are now established:
|
|
1044
|
+
|
|
1045
|
+
1. **It was not a local bug.** Stale-params and silent-allocation-failure theories were refuted: CPU-side state is byte-identical between the batched and per-dispatch paths (only submit granularity differs, and that alone flipped correct↔zeros); the kernels' `col < params.N` write guards mean zeroed params would write *nothing*, yet probes showed zeros actively **overwriting pre-seeded nonzero data** — a read-side visibility failure, with entry 2 computing real output from a stale (zero) view of entry 1's freshly written input.
|
|
1046
|
+
|
|
1047
|
+
2. **It was a genuine WebKit within-submission storage-visibility bug, threshold-dependent.** The WebGPU spec guarantees cross-dispatch storage visibility within a submission (gpuweb #4433/#4434). Every passing diagnostic was tiny (Test J: 2 pipelines, 16-byte buffers; Test P: 300 dispatches but one pipeline); production combined ~441 dispatches, 15+ distinct pipelines, 6-9 bindings, and MB-scale buffers in one submission — a scale no diagnostic ever reached.
|
|
1048
|
+
|
|
1049
|
+
3. **KEY FINDING (2026-06-12): the bug does not reproduce on iPadOS 26.5.** The exact historical zero-logits configuration — the entire ~440-dispatch forward in a single command buffer — now produces **byte-identical output to the desktop Dawn reference** on the test iPad (WebKit WebGPU via Chrome-iOS WKWebView over HTTPS). The bug is OS-version-dependent and was fixed upstream in WebKit (cf. WebKit bug 311598, where llama.cpp reports ~64 passes/CB stable on iOS 26.4). The grouped-submit machinery (18.3) is therefore retained as a *compatibility dial for older WebKit*, not as the permanent execution model.
|
|
1050
|
+
|
|
1051
|
+
### 17.3 gibberish: zeros in disguise, plus a real data race waiting underneath
|
|
1052
|
+
|
|
1053
|
+
The recorded gibberish event (plausible-but-wrong tokens at ~276 tok/s on iPad, 2026-03) had been attributed to an "unknown Safari WGSL→MSL compiler bug." Two separate findings replace that:
|
|
1054
|
+
|
|
1055
|
+
**(a) The recorded event was zero-logits in disguise.** ~276 tok/s — 2-4× faster than an M4 Max — is the signature of zero-work dispatches: all threads early-exit, argmax over an all-zero logits buffer returns token 0 or whatever degenerate pattern the sampler makes of it, and "generation" races ahead. The miscompilation theory was additionally contradicted by Tests N/O (the real RMSNorm and MatVec kernels pass in isolation on the same iPad) and by the fact that the packed-f16 kernels only ever saw already-zero inputs in failing runs.
|
|
1056
|
+
|
|
1057
|
+
**(b) A genuine WGSL data race existed in all three attention variants.** In the Q·K score reduction, after the publish barrier, each dot-group leader read `smem[tid..tid+15]` and wrote `smem[pos_in_tile]` (slots 0..15) in the same block — leader 0's read range is exactly the write target of leaders 1..15, with no barrier between cross-group reads and writes. This is spec-level UB. It happened to be benign on Tint/M-series scheduling, which is why desktop never showed it, but WebKit's independent WGSL→MSL compiler and A-series scheduling carry no such guarantee; it would corrupt the first key position of each 16-position tile — exactly the "plausible but wrong tokens" phenotype. Fixed unconditionally with the two-phase reduction (18.4) since the cost is one barrier per KV tile.
|
|
1058
|
+
|
|
1059
|
+
### 17.4 desktop-regression: conflating Apple-GPU hardware with the WebKit implementation
|
|
1060
|
+
|
|
1061
|
+
The Safari workaround gate was `isMetalBackend = vendor === "apple" || arch.startsWith("common")` — a *hardware* test. Dawn running on the M4 Max reports `vendor: "apple", arch: "metal-3"`, so node-dawn and Chrome-on-macOS took the full WebKit workaround path: per-dispatch multi-encoder submits plus the (already-disproven, yet still active) shader variant alternation that compiled ~850-900 per-node pipelines instead of ~25. Result: the recorded **161.8 → 19.7 tok/s collapse** on desktop, and every desktop performance number recorded between commit `2f0cabc` and the fix is poisoned. The fix is the UA-based `isWebKitWebGPU` predicate (18.5); post-fix desktop baseline is 145 tok/s.
|
|
1062
|
+
|
|
1063
|
+
### 17.5 Corrections to earlier claims in this document
|
|
1064
|
+
|
|
1065
|
+
The June 2026 findings falsify several statements this paper (and the engineering log) previously made. For the record:
|
|
1066
|
+
|
|
1067
|
+
- **"Metal provides no cross-dispatch coherence by design"** — wrong. The WWDC25 quote about command-buffer boundaries explains why CB boundaries are *expensive*, not why they are required for coherence. The WebGPU spec mandates within-submission visibility; Dawn-on-Metal delivers it; WebKit on iPadOS 26.5 delivers it. The older-WebKit zeros were a bug, not a design property to architect around.
|
|
1068
|
+
- **Shader variant alternation** (commit `2f0cabc`) — never necessary. Disproven by the project's own Test Q, yet it remained active and was a major contributor to the desktop regression. Deleted; do not reintroduce.
|
|
1069
|
+
- **Per-dispatch submit without await (fire-and-forget**, commit `38bc674`**)** — an anti-pattern, not a fix. It queues ~400 unbounded in-flight Metal command buffers per token, the documented WebKit resource-exhaustion pattern, and never produced a recorded on-device result. Replaced by grouped submits with exactly one CB in flight (18.3).
|
|
1070
|
+
- **"DRAIN_EVERY=1/5 OOMs prove roundtrips exhaust memory"** — wrong attribution; the buffer footprint (17.1) was the cause. Drained loops are memory-bounded by construction.
|
|
1071
|
+
- **The two-phase writeBuffer pattern as the gibberish fix** — the pattern is harmless and retained, but stale params were refuted as the zero-logits mechanism (17.2); the recorded gibberish was zeros in disguise (17.3a).
|
|
1072
|
+
- **Validity boundary**: the single-CB zeros behavior was real on the older iPadOS version where it was recorded and presumably remains real on pre-26.5 WebKit; statements about it are *historical, version-scoped facts*, not current behavior.
|
|
1073
|
+
|
|
1074
|
+
---
|
|
1075
|
+
|
|
1076
|
+
## 18. The Mobile Fix Campaign
|
|
1077
|
+
|
|
1078
|
+
All fixes landed 2026-06-12. Each is small; together they took the iPad from "never completed a recorded run" to byte-correct 35.9 tok/s.
|
|
1079
|
+
|
|
1080
|
+
### 18.1 Liveness-based activation pooling
|
|
1081
|
+
|
|
1082
|
+
`executor.ts allocateActivationBuffers()` replaces per-tensor allocation with a classic last-use liveness scheme over the topologically ordered graph:
|
|
1083
|
+
|
|
1084
|
+
1. **Liveness pass**: walk `graph.executionOrder` once, recording each activation tensor's first defining node and last reading node. Tensors read before they are ever written carry state across forward passes (or are written in place) — they join a `persistent` set along with all `graph.outputs`, and are never pooled.
|
|
1085
|
+
2. **Allocation pass**: walk the order again with a **size-keyed free pool** (`Map<byteSize, GPUBuffer[]>`). For each node, *acquire outputs first* (popping an exact-size buffer from the pool or creating one) so a node never writes the buffer of a tensor it also reads; then release every input/output whose last use is this node back into the pool.
|
|
1086
|
+
3. Anything persistent or untouched by the execution order gets a dedicated buffer.
|
|
1087
|
+
|
|
1088
|
+
Reuse is safe because dispatches execute in `executionOrder` on every path (single-pass Dawn, grouped WebKit) and WebGPU synchronizes write-after-read hazards between dispatches. The executor logs the result: `431 tensors → 20 buffers (37 MB)` at T=256. Two standing obligations follow (also listed in Section 7's mobile invariants): fused kernels must not read+write one pooled buffer in a single dispatch, and `debugReadBuffer()` on an intermediate tensor is only meaningful before a later op reuses it.
|
|
1089
|
+
|
|
1090
|
+
### 18.2 SliceLastRow and the [1, vocab] logits buffer
|
|
1091
|
+
|
|
1092
|
+
Only the last position's logits are ever consumed (the sampler operates on one row). Previously both graph generators emitted `lm_head` over the full `[T, hidden]` normed activations into a `[T, vocab]` logits tensor — 508.6 MB at T=512 (512 × 248,320 × 4) of which 993 KB was read. Now a `SliceLastRow` op (`kernels/registry.ts`, a trivial 256-wide copy kernel with `{width, last_row_offset}` params) extracts `final_norm_out`'s last row into `final_norm_last [1, hidden]`, and `lm_head` runs with M=1 into `logits [1, vocab]`. This saves the 485 MiB buffer **and** removes the full-vocab matmul over all prefill rows — the dominant prefill compute — so it is a pure win on desktop too. Readback offsets collapse to 0 in `forward()`/`forwardArgmax()`.
|
|
1093
|
+
|
|
1094
|
+
### 18.3 Grouped-submit architecture (WebKit)
|
|
1095
|
+
|
|
1096
|
+
The WebKit execution path in `forward()`/`forwardArgmax()` is now a parameterized grouped loop: `webkitGroupSize` dispatches per command buffer (one compute pass per dispatch), `queue.submit()` then `await queue.onSubmittedWorkDone()` per group — **exactly one command buffer in flight at all times**. The logits copy (and in `forwardArgmax()`, the argmax dispatch and its 4-byte readback copy) go in separate final submits. This replaces the fire-and-forget anti-pattern of ~400 unbounded in-flight CBs per token. `group=1` is the proven-correct floor for older WebKit; the size is sweepable at runtime via the `?group=N` URL parameter, which produced the scaling curve in Section 19. On Dawn nothing changed: one encoder, one compute pass, one submit.
|
|
1097
|
+
|
|
1098
|
+
### 18.4 Two-phase attention score reduction
|
|
1099
|
+
|
|
1100
|
+
In all three attention kernel variants (f32, native-f16, packed-f16), the leader-thread score reduction now accumulates into a local variable, executes a `workgroupBarrier()` in **uniform control flow** (the barrier cannot live inside the guarded block — non-uniform barriers are a WGSL validation error), and only then writes `smem[pos_in_tile]` in a second guarded block. Cost: one extra barrier per 16-position KV tile. This closes the race described in 17.3(b).
|
|
1101
|
+
|
|
1102
|
+
### 18.5 `isWebKitWebGPU` detection
|
|
1103
|
+
|
|
1104
|
+
Described in Section 6. The hardware-keyed `isMetalBackend` predicate is gone; the executor's `needsMultiEncoder` is set solely from the UA-based `isWebKitWebGPU`, and shader variant alternation is deleted outright (~850-900 pipelines → ~25 on the workaround path's worst case).
|
|
1105
|
+
|
|
1106
|
+
### 18.6 Error scopes and uncaptured-error surfacing
|
|
1107
|
+
|
|
1108
|
+
`device.onuncapturederror` is registered at init, and `WebGPUEngine.create()` brackets executor construction, weight upload, and bind-group creation with `pushErrorScope("out-of-memory")` + `pushErrorScope("validation")`, throwing a descriptive error (with a crash-phase breadcrumb in `localStorage`) if either pops. WebKit reports failed allocations asynchronously and otherwise *silently no-ops the affected dispatches* — before this change, every mobile observation was collected blind. Device limits (`maxBufferSize`, `maxStorageBufferBindingSize`, `maxComputeWorkgroupStorageSize`) are logged on WebKit at startup.
|
|
1109
|
+
|
|
1110
|
+
### 18.7 WebKit Cache-API large-write skip and memory-policy fixes
|
|
1111
|
+
|
|
1112
|
+
- `model-loader.ts browserCacheWrite()`: the defensive `data.slice(0)` needed for `cache.put()` doubles a weight shard in memory at the worst moment; on WebKit, writes over 64 MB are skipped entirely (the HTTP cache still serves warm reloads).
|
|
1113
|
+
- iOS `maxSeqLen` policy (`gpu/index.ts`): WebKit defaults to 512 and is hard-clamped to 2048 (the old default of 4096 was an 18.9 GB request); `?kvf32=1` on Safari caps at 1024; desktop caps at 4096. `?maxseq=N` overrides within `context_length`.
|
|
1114
|
+
- The iPad diagnostic page's default model is now the MLX 4-bit repo, never the BF16 one.
|
|
1115
|
+
|
|
1116
|
+
---
|
|
1117
|
+
|
|
1118
|
+
## 19. Mobile Results & Submit-Granularity Scaling
|
|
1119
|
+
|
|
1120
|
+
All measurements 2026-06-12 on iPad, iPadOS 26.5, WebKit WebGPU via Chrome-iOS WKWebView over HTTPS. Model: Qwen3.5-0.8B MLX 4-bit, greedy (temperature 0). **Every configuration below produced output byte-identical to the desktop Dawn reference** — these are correctness results first, throughput results second. (Canonical numbers: `docs/research/tps-baselines.md`.)
|
|
1121
|
+
|
|
1122
|
+
Device limits (default adapter): `maxBufferSize` 256 MB, `maxStorageBufferBindingSize` 128 MB, `maxComputeWorkgroupStorageSize` 16384 bytes.
|
|
1123
|
+
|
|
1124
|
+
### Throughput vs. group size (24-token generations)
|
|
1125
|
+
|
|
1126
|
+
| `?group=N` (dispatches/CB, one CB in flight, awaited) | Decode tok/s |
|
|
1127
|
+
|---|---|
|
|
1128
|
+
| 1 | 6.4-7.7 |
|
|
1129
|
+
| 8 | 19.7 |
|
|
1130
|
+
| 32 | 24.6 |
|
|
1131
|
+
| 64 | 28.8 |
|
|
1132
|
+
| 128 | 29.7 |
|
|
1133
|
+
| batch-all (entire forward in one CB) | **31.7** |
|
|
1134
|
+
| sustained 120-token run, group=64, maxseq=512 | **35.9** |
|
|
1135
|
+
|
|
1136
|
+
**Interpretation**: the curve is round-trip-bound until ~32 dispatches/CB — each group costs a full submit + `onSubmittedWorkDone()` JS↔GPU round trip, so throughput scales nearly linearly with group size at first (1→8 is a 3× jump). Past 32-64 the round trips amortize away and the engine becomes **dispatch-bound**: 64→all is only 28.8→31.7, because ~440 individual compute passes per token dominate regardless of CB packaging. This is why dispatch-count reduction (kernel fusion) is worth 2-3× more on mobile than on desktop, and why the iPad goal path (50+ tok/s) runs through a single compute pass per CB plus fusion rather than through submit tuning.
|
|
1137
|
+
|
|
1138
|
+
Other results from the same session:
|
|
1139
|
+
|
|
1140
|
+
- **maxseq survival**: at group=1, both `maxseq=64` and `maxseq=512` complete — pre-fix, every configuration was jetsam-killed (17.1). The memory fix and the correctness fix are independently confirmed.
|
|
1141
|
+
- **KV A/B**: packed-f16 KV vs `?kvf32=1` both byte-correct — the packed-f16 path is cleared on-device (it had never run against correct inputs before; see 17.3a).
|
|
1142
|
+
- **Within-submission visibility**: the historical zero-logits configuration (batch-all) is correct on iPadOS 26.5 — the decisive evidence for 17.2's key finding.
|
|
1143
|
+
- **Desktop post-fix baseline**: node-dawn on M4 Max at **145 tok/s** (143.9 / 144.1 / 147.1), confirming the 19.7 tok/s regression was entirely the detection predicate.
|
|
1144
|
+
|
|
1145
|
+
### Production submit strategy
|
|
1146
|
+
|
|
1147
|
+
The strategy is OS-gated, chosen at engine init:
|
|
1148
|
+
|
|
1149
|
+
- **iPadOS/iOS 26.5+**: batch-all (single CB per forward) — the upstream WebKit fix makes the fast path correct.
|
|
1150
|
+
- **Older 26.x WebKit**: run a **startup coherence probe** — a small batched dependent-chain dispatch whose result is compared against the same chain run per-dispatch. Probe passes → `group=64`; probe fails → `group=1` awaited (the proven-correct floor, 6.4-7.7 tok/s).
|
|
1151
|
+
- **Dawn (Chrome, node)**: single encoder, single compute pass, unchanged.
|
|
1152
|
+
|
|
1153
|
+
Until the probe ships, `webkitGroupSize` defaults to 1 (correctness floor) and `?group=N` overrides it.
|
|
1154
|
+
|
|
1155
|
+
### Update (2026-06-15): a third device class — batching *crashes* (iPhone)
|
|
1156
|
+
|
|
1157
|
+
A `?group=N` sweep on an **iPhone (iOS 18.7, WebKit/Safari 26.5)** revealed a failure mode distinct from both the iPad-26.5 class (batch-all correct) and the older-WebKit class (batch-all → zero logits): **any `group > 1` hard-crashes the GPU process** — the page dies ("A problem repeatedly occurred"), confirmed at `group=4`, `32`, and `4096`. `group=1` runs correctly at ~**4.7 tok/s** (the round-trip-bound floor predicted by the curve above). This is the single most important mobile fact: on this device the submit-granularity lever — the whole basis of the 6.4→31.7 tok/s scaling on iPad — is **unavailable**, so the *only* remaining levers are dispatch-count reduction (fusion) and memory reduction.
|
|
1158
|
+
|
|
1159
|
+
Both signs point at **memory**, not pure miscompilation:
|
|
1160
|
+
|
|
1161
|
+
- The crash is at the **first grouped compute (prefill)**, immediately after load, and scales with group size — consistent with the command buffers / intermediate state held in flight before the single `onSubmittedWorkDone()` drain exceeding the iPhone's (smaller-than-iPad) jetsam budget.
|
|
1162
|
+
- The same device **hard-crashes at model *load*** with the ~596 MB vision-enabled Qwen checkpoint but loads the ~404 MB text-only one. It is operating right at its memory ceiling, so the extra working set a batched submit needs is enough to push it over. (Action taken: the site's default chat engine now loads the text-only checkpoint; the vision tower is built only on the Vision tab.)
|
|
1163
|
+
|
|
1164
|
+
**Revised production strategy** — the startup probe must be *crash-surviving*, not merely output-comparing (the original design in the previous subsection cannot detect a class that kills the page before returning a result):
|
|
1165
|
+
|
|
1166
|
+
1. Persist the candidate group size to `localStorage` **before** running the probe (reuse the existing crash breadcrumb in `device-guards.ts` / `setDownloadPhase`).
|
|
1167
|
+
2. On reload, if the breadcrumb for a group size is still set, that size crashed → step down (batch-all → 64 → 8 → 1) and record the safe ceiling for the device.
|
|
1168
|
+
3. Promote a group size to "known-good" only after a probe both completes **and** returns byte-correct output.
|
|
1169
|
+
|
|
1170
|
+
This self-calibrates across all three classes (correct-batched / wrong-batched / crash-batched) with no hardcoded device table, and survives the crash class because the breadcrumb outlives the page kill. **This — shipping the crash-surviving probe — is the decided path; the rabbit hole was hand-sweeping `?group` on a device that crashes the page on every batched value, which a self-calibrating probe is specifically designed to avoid.**
|
|
1171
|
+
|
|
1172
|
+
**Caching note (same session):** on iOS Safari in a plain tab the CacheStorage model cache is **evicted between visits** (`navigator.storage.persist()` is not granted outside an installed PWA — `model-loader.ts:requestPersistence` is best-effort), so every visit re-downloads the full model regardless of throughput. The durable fix is OPFS via a Worker (`createSyncAccessHandle` in chunks), tracked as a separate task; "add to Home Screen" (PWA) also earns persistent storage as an interim path.
|
|
1173
|
+
|
|
1174
|
+
---
|
|
1175
|
+
|
|
1176
|
+
## 20. Autoresearch: Profile-Guided Throughput Optimization (June 2026)
|
|
1177
|
+
|
|
1178
|
+
Once mobile correctness was secured, an autonomous optimization loop
|
|
1179
|
+
(`scripts/engine/optimize.mjs --mode=autoresearch`, driven by the model itself)
|
|
1180
|
+
took **desktop decode throughput from 145 → 207 tok/s** (M4 Max, node-dawn,
|
|
1181
|
+
Qwen3.5-0.8B INT4), cooled-stable, with every round verified byte-coherent. The
|
|
1182
|
+
loop's contract: read engine code → make one focused edit → build → benchmark
|
|
1183
|
+
twice (lower mean, to defeat thermal lies) → keep only if >1.5–2% over the
|
|
1184
|
+
current best, else `git checkout` the touched files → log to `results.jsonl`.
|
|
1185
|
+
|
|
1186
|
+
### The arc — and the change of angle
|
|
1187
|
+
|
|
1188
|
+
The interesting result is not the number but the *shape* of how it was reached:
|
|
1189
|
+
|
|
1190
|
+
| Round | tok/s | Kept | Edit |
|
|
1191
|
+
|---|---|---|---|
|
|
1192
|
+
| baseline | 145 | — | — |
|
|
1193
|
+
| r1 | 145.9 | ✗ | fuse down_proj + residual |
|
|
1194
|
+
| r2 | 152.8 | ✓ | vec4 INT4 weight loads (32 nibbles/iter) |
|
|
1195
|
+
| r3 | 158 | ✓ | vec4 loads in the SwiGLU matvec |
|
|
1196
|
+
| r4 | 158 | ✗ | subgroup-shuffle reduction |
|
|
1197
|
+
| **r5** | **188** | ✓ | **GPU-driven pipelined decode** |
|
|
1198
|
+
| **r6** | **213** | ✓ | **fuse MambaSSM (profiled #2 hotspot)** |
|
|
1199
|
+
| r7 | 210 | ✗ | SSM occupancy split |
|
|
1200
|
+
| **r8** | **222** | ✓ | **f16 SSM-state storage** |
|
|
1201
|
+
|
|
1202
|
+
Rounds 1–4 are a *sideways* plateau: local kernel-load micro-optimizations
|
|
1203
|
+
(145→158), half reverted. The breakthroughs (r5, r6, r8) came when the loop
|
|
1204
|
+
**stopped optimizing the kernel in front of it and changed its analysis**:
|
|
1205
|
+
|
|
1206
|
+
- **r5** is not a kernel tweak but a *control-flow* restructuring — keep the
|
|
1207
|
+
argmax result on the GPU and feed it back as the next token's input without a
|
|
1208
|
+
per-token CPU round-trip (158→188).
|
|
1209
|
+
- **r6/r8** are *profile-guided*: the loop identified the Mamba-2 SSM as the #2
|
|
1210
|
+
hotspot (≈1982 µs/token, 37% of decode), then attacked it structurally —
|
|
1211
|
+
fusing four per-head passes (188→213), then halving its dominant memory
|
|
1212
|
+
traffic by storing SSM state in f16 while keeping f32 compute (213→222).
|
|
1213
|
+
|
|
1214
|
+
This transition — from "make this matmul faster" to "profile the whole forward,
|
|
1215
|
+
find the real bottleneck, and rewrite its data flow" — is the qualitative jump
|
|
1216
|
+
that broke the plateau. It is also the clearest signal that the search was
|
|
1217
|
+
reasoning about angles it had not tried before, not enumerating variations of
|
|
1218
|
+
the same one.
|
|
1219
|
+
|
|
1220
|
+
### A safety lesson: reverts can clobber un-tracked work
|
|
1221
|
+
|
|
1222
|
+
One round's failed-experiment revert ran `git checkout -- executor.ts
|
|
1223
|
+
registry.ts`, which silently reset *uncommitted* mobile-fix edits in those files
|
|
1224
|
+
to the last commit, then re-layered only the optimizer's own changes — producing
|
|
1225
|
+
a tree that built (esbuild strips types) but failed `tsc` and would have lost the
|
|
1226
|
+
mobile work. The incident motivates two rules now documented for the loop: revert
|
|
1227
|
+
only the files a round actually touched, and commit known-good baselines before
|
|
1228
|
+
launching autonomous edit sessions. The kept kernel wins were recovered intact
|
|
1229
|
+
from the loop's own per-round `.bak` snapshots.
|
|
1230
|
+
|
|
1231
|
+
### Open throughput frontiers
|
|
1232
|
+
|
|
1233
|
+
The decode path is now bandwidth- and dispatch-bound rather than matmul-bound.
|
|
1234
|
+
Remaining candidates (tracked in `scripts/engine/backlog.md`): fused KV-append,
|
|
1235
|
+
RMSNorm/RoPE fusion, further dispatch-count reduction (which helps mobile 2–3×
|
|
1236
|
+
more than desktop, since mobile is round-trip-bound below ~32 dispatches/CB), and
|
|
1237
|
+
investigating reported large wins from fusing the linear-attention/SSM layers
|
|
1238
|
+
end-to-end.
|
|
1239
|
+
|
|
1240
|
+
#### The two-regime model (recalibrated, June 2026)
|
|
1241
|
+
|
|
1242
|
+
A profiling pass (env-gated timestamp-query profiler, `scripts/engine/test-profile-decode.mjs`)
|
|
1243
|
+
plus a 5-agent research synthesis (`docs/research/dispatch-reduction-hivemind.md`)
|
|
1244
|
+
resolved an apparent contradiction in this campaign. Decode has **two distinct
|
|
1245
|
+
overhead regimes**:
|
|
1246
|
+
|
|
1247
|
+
- **Mobile (Safari/WebKit):** each dispatch is its own command-buffer submit +
|
|
1248
|
+
`onSubmittedWorkDone()` drain, so per-dispatch cost is ~32–71 µs (Metal; arXiv
|
|
1249
|
+
2604.02344). At ~287 dispatches/token this regime **is** dispatch-count-bound —
|
|
1250
|
+
any dispatch cut helps, and the lever is fewer dispatches.
|
|
1251
|
+
- **Desktop (node-Dawn, the primary `test-benchmark.mjs` metric):** the whole
|
|
1252
|
+
decode step is **one** `beginComputePass` / **one** submit; measured per-dispatch
|
|
1253
|
+
overhead is ~2–3 µs. Decode is **not** dispatch-count-bound here. The metric
|
|
1254
|
+
moves only on **bytes-moved reductions over WIDE tensors.**
|
|
1255
|
+
|
|
1256
|
+
This is load-bearing and confirmed by the engine's own history: pure dispatch-count
|
|
1257
|
+
cuts were flat/reverted on Dawn (ResidualRMSNorm Add+Norm −23 dispatches = +0.25%;
|
|
1258
|
+
RMSNorm-into-MambaSSM −18 dispatches = +0.3%; software-prefetch −1.2%), while every
|
|
1259
|
+
kept Dawn win removed a **wide** global round-trip (MambaSSM single-pass state
|
|
1260
|
+
fusion ~+13%, f16 SSM state +4%, SiLU-into-conv +1.8%, matvec A-reuse +4.9%). True
|
|
1261
|
+
WebGPU megakernels are **not** viable (no portable inter-workgroup sync / forward
|
|
1262
|
+
progress); the achievable analog is aggressive per-block fusion. The ranked plan
|
|
1263
|
+
(epilogue/prologue Mamba megakernels that erase 2048/6144-wide round-trips) lives
|
|
1264
|
+
in `docs/research/dispatch-reduction-hivemind.md`.
|
|
1265
|
+
|
|
1266
|
+
**Measurement rule:** dispatch-count cuts must be validated on the **iPad runner**
|
|
1267
|
+
(mobile regime), not the desktop benchmark, or they read as "flat" and get reverted
|
|
1268
|
+
despite being real mobile wins.
|
|
1269
|
+
|
|
1270
|
+
---
|
|
1271
|
+
|
|
1272
|
+
## 21. Native Text Embeddings (June 2026)
|
|
1273
|
+
|
|
1274
|
+
The second modality to go native after text generation is **text embeddings**.
|
|
1275
|
+
The target model is **Qwen3-Embedding-0.6B**, and the key fact that makes it cheap
|
|
1276
|
+
is that it ships as architecture `Qwen3ForCausalLM` — the *same* graph generator
|
|
1277
|
+
the engine already runs for text. An embedding model is not a separate engine; it
|
|
1278
|
+
is the existing causal-LM forward pass with a different tail.
|
|
1279
|
+
|
|
1280
|
+
### What changes vs. text generation
|
|
1281
|
+
|
|
1282
|
+
The graph generator (`architectures/qwen2.ts`) takes an `embedding` flag. When
|
|
1283
|
+
set, it replaces the `SliceLastRow → lm_head → logits` tail with a two-op
|
|
1284
|
+
**pooling tail**:
|
|
1285
|
+
|
|
1286
|
+
1. **`SliceLastRow`** on `final_norm_out` — Qwen3-Embedding uses **last-token
|
|
1287
|
+
(EOS-position) pooling**: the pooled vector is the final hidden state at the
|
|
1288
|
+
last input position. This reuses the exact op the LM path already emits to
|
|
1289
|
+
restrict lm_head to the last row, so no new pooling kernel is needed.
|
|
1290
|
+
2. **`L2Norm`** — a new row-wise L2-normalization kernel (`WGSL_L2NORM`,
|
|
1291
|
+
`kernels/registry.ts`: one workgroup per row, `{rows, width}` params) that
|
|
1292
|
+
divides the pooled vector by its Euclidean norm, producing a unit-length
|
|
1293
|
+
embedding in tensor `embedding`.
|
|
1294
|
+
|
|
1295
|
+
The lm_head matmul (and the entire ~248K-row vocab projection) is dropped
|
|
1296
|
+
entirely — an embedding forward pass is strictly cheaper than a generation step.
|
|
1297
|
+
`graph.outputs` becomes `["embedding"]`, and `WebGPUEngine` exposes the modality
|
|
1298
|
+
through `engine.embed(text, options)` (with `engine.isEmbedding` as the guard).
|
|
1299
|
+
|
|
1300
|
+
### The pooling-token subtlety
|
|
1301
|
+
|
|
1302
|
+
Last-token pooling reads the final position, so *which* token sits there matters.
|
|
1303
|
+
Qwen3-Embedding was trained to pool at the literal `<|endoftext|>` token
|
|
1304
|
+
(`eos_token_id`), **not** the chat `<|im_end|>` token the tokenizer reports as its
|
|
1305
|
+
generic `eos_token`. `engine.embed()` (`gpu/index.ts`) resolves the literal
|
|
1306
|
+
`<|endoftext|>` id and appends it if absent, truncating to the budget while
|
|
1307
|
+
keeping that pooling token last. Query embeddings additionally take the
|
|
1308
|
+
Qwen3-Embedding instruction prefix (`"Instruct: {task}\nQuery:{text}"`); document
|
|
1309
|
+
embeddings omit it.
|
|
1310
|
+
|
|
1311
|
+
### Validation
|
|
1312
|
+
|
|
1313
|
+
The native embeddings were validated against the reference: output **dim 1024**,
|
|
1314
|
+
**unit L2 norm** confirmed, and the semantic geometry is correct —
|
|
1315
|
+
**cos(similar) = 0.81 > cos(unrelated) = 0.56**. The path is non-autoregressive
|
|
1316
|
+
(prefill only), so it inherits the existing memory and submit machinery with no
|
|
1317
|
+
mobile-specific work.
|
|
1318
|
+
|
|
1319
|
+
---
|
|
1320
|
+
|
|
1321
|
+
## 22. Native Vision Encoder (Qwen3.5 ViT) (June 2026)
|
|
1322
|
+
|
|
1323
|
+
Qwen3.5 is **natively multimodal**: the same checkpoint that the engine already
|
|
1324
|
+
runs for text ships a **12-layer Vision Transformer** in its weights — a ~192 MB
|
|
1325
|
+
tower the engine had previously been dropping on load. This cycle implements that
|
|
1326
|
+
ViT natively and validates it **bit-exact against HuggingFace transformers 5.12**.
|
|
1327
|
+
|
|
1328
|
+
### Why this was tractable
|
|
1329
|
+
|
|
1330
|
+
The vision feasibility was re-assessed (Section 5 / `docs/PROJECT-STATE.md`) and
|
|
1331
|
+
found materially easier than an earlier strategy pass had concluded. That pass
|
|
1332
|
+
believed native vision was blocked behind a *single-threaded attention kernel
|
|
1333
|
+
rewrite* — but it had read `attention.wgsl`, a **dead reference file imported
|
|
1334
|
+
nowhere** (the live kernels are embedded strings in `registry.ts`). The live
|
|
1335
|
+
attention kernel was already the tiled, online-softmax, fully parallel
|
|
1336
|
+
flash-attention-style kernel of Section 7. Making it bidirectional for the ViT was
|
|
1337
|
+
**one uniform flag**, not a rewrite (below). With that correction, the encoder was
|
|
1338
|
+
a bounded amount of ordinary graph-generator work.
|
|
1339
|
+
|
|
1340
|
+
### Architecture
|
|
1341
|
+
|
|
1342
|
+
The ViT generator (`architectures/qwen3_5_vision.ts`) produces a non-autoregressive
|
|
1343
|
+
graph over a symbolic patch count `N`, validated line-for-line against
|
|
1344
|
+
`modeling_qwen3_5.py`:
|
|
1345
|
+
|
|
1346
|
+
- **Unfold-free patch embed.** Patches arrive **already flattened** to `[N, 1536]`
|
|
1347
|
+
(`patch_dim = in_channels · temporal_patch_size · patch_size² = 3·2·16²`), so the
|
|
1348
|
+
5-D Conv3d patch embedding collapses to a **plain `MatMul`** + `AddBias`. No
|
|
1349
|
+
`Conv3d`/`unfold` kernel is needed — the host image processor delivers patches in
|
|
1350
|
+
the right layout. A bilinear-interpolated learned position embedding is then
|
|
1351
|
+
added.
|
|
1352
|
+
- **12 pre-norm, bidirectional blocks.** Each block is `LayerNorm → fused-QKV
|
|
1353
|
+
MatMul+AddBias → SliceCols (split Q/K/V) → ApplyRotaryEmb(Q,K) → Attention
|
|
1354
|
+
(bidirectional) → proj MatMul+AddBias → residual → LayerNorm → fc1 MatMul+AddBias
|
|
1355
|
+
→ GELU(tanh) → fc2 MatMul+AddBias → residual`.
|
|
1356
|
+
- **Bidirectional attention via one flag.** The attention kernel computes
|
|
1357
|
+
`causal_limit = q_pos + position_offset + 1; S_eff = is_causal ? min(S, causal_limit) : S`.
|
|
1358
|
+
An `is_causal` uniform (default `1`) keeps text decoding causal and unchanged; the
|
|
1359
|
+
ViT passes `causal: false`, setting `is_causal = 0` so every patch attends to all
|
|
1360
|
+
patches. This is the entire non-causal change.
|
|
1361
|
+
- **2D rotary.** The ViT uses a 2D (row, col) rotary embedding. The cos/sin tables
|
|
1362
|
+
and the bilinear-interpolated position embeddings are functions of the image grid
|
|
1363
|
+
`(t, h, w)` **only** — not of weights or pixels — so they are precomputed on the
|
|
1364
|
+
host (`vision-preprocess.ts`, porting `get_vision_bilinear_indices_and_weights`,
|
|
1365
|
+
`get_vision_position_ids`, and `Qwen3_5VisionRotaryEmbedding`) and fed in as input
|
|
1366
|
+
activations. `ApplyRotaryEmb` then applies them with a `rotate_half`. This keeps
|
|
1367
|
+
the GPU graph to the weight-dependent math while staying byte-identical to HF.
|
|
1368
|
+
- **Two distinct GELUs.** The transformer blocks use `gelu_pytorch_tanh` (the
|
|
1369
|
+
`GELU` op); the merger uses the **exact erf** GELU (the new `GeluErf` op). Getting
|
|
1370
|
+
both exactly right was load-bearing for the bit-exact match.
|
|
1371
|
+
- **Spatial-merge-2 → 1024-dim tokens.** The merger's `LayerNorm` over the 768-dim
|
|
1372
|
+
block output is followed by a free row-major reshape `[N, 768] → [N/4, 3072]`
|
|
1373
|
+
(a spatial 2×2 merge — patches arrive pre-grouped into merge-blocks so no gather
|
|
1374
|
+
is needed), then `fc1 → GeluErf → fc2` to produce merged image tokens of dim
|
|
1375
|
+
**1024**, matching the LM hidden size, as `[N/4, 1024]`.
|
|
1376
|
+
|
|
1377
|
+
The encoder runs in its own `VisionExecutor` (`vision-executor.ts`) — kept
|
|
1378
|
+
separate from the autoregressive text `Executor` because it is a single prefill
|
|
1379
|
+
over `N` patches with no KV cache, one buffer per tensor, dispatched in
|
|
1380
|
+
`executionOrder`. It reuses the shared kernel registry and device helpers, so the
|
|
1381
|
+
kernel math is identical to the text path and inherits the same WebKit
|
|
1382
|
+
grouped-submit behavior.
|
|
1383
|
+
|
|
1384
|
+
### A real bug the validation caught
|
|
1385
|
+
|
|
1386
|
+
Achieving the bit-exact match surfaced a genuine kernel bug: **`WGSL_GELU`
|
|
1387
|
+
returned NaN for large arguments on Metal/Dawn**. The MLP can produce `|x| > 30`,
|
|
1388
|
+
and `x³` then pushes the tanh argument into the thousands, where Metal's fast-math
|
|
1389
|
+
`tanh`/`exp` returns NaN instead of saturating. The fix clamps the inner argument
|
|
1390
|
+
to a safe `±15` (`SQRT_2_OVER_PI · (x + GELU_COEFF·x³)`), which is numerically
|
|
1391
|
+
indistinguishable from the true GELU on the saturated tails. This is the same
|
|
1392
|
+
class of Metal fast-math hazard already documented for `exp()` in the attention
|
|
1393
|
+
kernel (Section 7).
|
|
1394
|
+
|
|
1395
|
+
### Validation and the public surface
|
|
1396
|
+
|
|
1397
|
+
Validated **bit-exact vs HF transformers 5.12**: **per-token cosine = 1.000000**,
|
|
1398
|
+
**max absolute error ~5e-6**. The encoder is exposed as
|
|
1399
|
+
`engine.encodeImage(patches, gridTHW)` → merged image tokens `[rows, 1024]`,
|
|
1400
|
+
behind `enableVision: true` at load (the engine snapshots the raw
|
|
1401
|
+
`visual.pos_embed.weight` table before `uploadWeights` consumes it, for the host
|
|
1402
|
+
bilinear interpolation).
|
|
1403
|
+
|
|
1404
|
+
### What is *not* done yet (phase 2)
|
|
1405
|
+
|
|
1406
|
+
`encodeImage()` returns image tokens; it does **not** splice them into a text
|
|
1407
|
+
sequence. The remaining LM-side integration — **M-RoPE** position assignment for
|
|
1408
|
+
interleaved image/text tokens, **token splicing** of image embeddings into the
|
|
1409
|
+
input stream at the image placeholder positions, and the host-side **image
|
|
1410
|
+
preprocessing** (pixels → ordered patches) — is the in-progress phase 2. The
|
|
1411
|
+
encoder being bit-exact means that integration is plumbing over a verified
|
|
1412
|
+
numerical core, not open research.
|
|
1413
|
+
|
|
1414
|
+
---
|
|
1415
|
+
|
|
1416
|
+
## 23. The Native-Only Architecture Decision
|
|
1417
|
+
|
|
1418
|
+
The earlier framing of this project as **"two lanes"** — a native WGSL engine for
|
|
1419
|
+
text plus a transformers.js/ONNX fallback lane for breadth — has been retired. The
|
|
1420
|
+
decided architecture (owner decision, see `docs/PROJECT-STATE.md`) is **one native
|
|
1421
|
+
WebGPU engine**:
|
|
1422
|
+
|
|
1423
|
+
- **Launch set = text + vision + embeddings, all native, in one engine.** Text
|
|
1424
|
+
generation, the Qwen3.5 ViT encoder (Section 22), and Qwen3-Embedding
|
|
1425
|
+
(Section 21) all run through the same IR, kernel registry, and device layer. One
|
|
1426
|
+
model per modality by default, expandable to other families via the
|
|
1427
|
+
add-model-family process (`docs/adding-a-model-family.md`).
|
|
1428
|
+
- **A permanent fallback lane is rejected** — it "assumes defeat to begin with."
|
|
1429
|
+
`transformers.js`/ONNX (`chrome-backend.ts`) is at most temporary dev scaffolding
|
|
1430
|
+
to keep desktop demos alive during the native build, and is being **removed, not
|
|
1431
|
+
kept** as a lane.
|
|
1432
|
+
- **Audio is deferred to native small models, not delegated to a second engine.**
|
|
1433
|
+
Launching without audio is acceptable; a permanent second runtime is not. The
|
|
1434
|
+
candidates are native: **OmniVoice** for TTS (a Qwen3 backbone + a codec decoder —
|
|
1435
|
+
i.e. mostly the text path the engine already runs, plus a decoder) and
|
|
1436
|
+
**Moonshine** for STT (a lean encoder-decoder with a raw-waveform Conv1d
|
|
1437
|
+
frontend, no log-mel Conv2d). A thin onnxruntime-web bridge remains only a
|
|
1438
|
+
break-glass option for a model with no extractable weights *and* no native
|
|
1439
|
+
alternative.
|
|
1440
|
+
|
|
1441
|
+
### Corrections to earlier "two-lane / blocked-vision" language
|
|
1442
|
+
|
|
1443
|
+
Two claims that appeared in earlier roadmap text are now corrected for the record:
|
|
1444
|
+
|
|
1445
|
+
- **"Native vision is blocked on a parallel attention kernel."** False — the
|
|
1446
|
+
attention kernel was **already parallel** (Section 7). The blocking claim came
|
|
1447
|
+
from reading `attention.wgsl`, a dead file imported nowhere; the live kernel is a
|
|
1448
|
+
tiled flash-attention-style kernel. Non-causal attention is a one-line uniform
|
|
1449
|
+
(`is_causal`), not a new kernel. **Vision is done at the encoder level**
|
|
1450
|
+
(Section 22), bit-exact.
|
|
1451
|
+
- **"tfjs is a kept breadth lane."** It is not. The engine is native-only across
|
|
1452
|
+
all launch modalities; `chrome-backend.ts` is slated for deletion.
|
|
1453
|
+
|
|
1454
|
+
---
|
|
1455
|
+
|
|
1456
|
+
## 24. iOS Model-Caching Reality
|
|
1457
|
+
|
|
1458
|
+
A persistent finding this cycle, with consequences for how the engine ships on
|
|
1459
|
+
iOS: **durably caching a ~400 MB model in iOS/iPadOS Safari is not achievable from
|
|
1460
|
+
a plain browser tab.** The full investigation is in
|
|
1461
|
+
`docs/research/ios-safari-model-caching.md`; the load-bearing facts:
|
|
1462
|
+
|
|
1463
|
+
- **Persistence requires a PWA.** WebKit's only documented positive heuristic for
|
|
1464
|
+
granting `navigator.storage.persist()` is "opened as a Home-Screen Web App." A
|
|
1465
|
+
plain tab gets `false` (confirmed by on-device probe). Only a persistence grant
|
|
1466
|
+
excludes an origin from eviction — so durable model caching means shipping an
|
|
1467
|
+
Add-to-Home-Screen PWA.
|
|
1468
|
+
- **Eviction is quota-pressure-driven, not reload-driven.** Best-effort data
|
|
1469
|
+
survives reloads under normal conditions; what kills the cache on a near-full
|
|
1470
|
+
device is WebKit's **origin-wide LRU eviction under storage pressure**. The
|
|
1471
|
+
probed iPad had a ~1 GB origin quota (Safari 17+ sets it to ~60% of *free* disk,
|
|
1472
|
+
and the device was nearly full) with ~444 MB already consumed by foreign caches —
|
|
1473
|
+
a 400 MB write lands near the cap and gets evicted.
|
|
1474
|
+
- **Switching Cache API → IndexedDB does not help.** On iOS they share one
|
|
1475
|
+
best-effort origin pool, evicted together under the same quota and 7-day-ITP
|
|
1476
|
+
policy. The migration buys no durability.
|
|
1477
|
+
- **Main-thread OPFS is broken on iOS.** `createWritable()` on the main thread
|
|
1478
|
+
throws OOM on the test device; the only viable large-write path is a **Worker
|
|
1479
|
+
using OPFS `createSyncAccessHandle()`** in small chunks. This finding drove the
|
|
1480
|
+
removal of the engine's OPFS write path in favor of a Cache-API-only loader
|
|
1481
|
+
(`model-loader.ts`); a main-thread OPFS attempt left unclearable junk that filled
|
|
1482
|
+
the quota and evicted everything.
|
|
1483
|
+
|
|
1484
|
+
The practical consequence: as a plain tab the engine treats re-download as
|
|
1485
|
+
unavoidable and minimizes its cost (smaller/streamed model, HTTP cache, fast CDN);
|
|
1486
|
+
durable on-device caching is a PWA feature, gated behind the persistence grant,
|
|
1487
|
+
foreign-cache cleanup, and Worker-OPFS chunked writes.
|
|
1488
|
+
|
|
1489
|
+
---
|
|
1490
|
+
|
|
1491
|
+
## 25. EmbeddingGemma-300M: a Second Embedding Family, Validated On-Device (June 2026)
|
|
1492
|
+
|
|
1493
|
+
§21 shipped native embeddings by reusing the *text* graph — Qwen3-Embedding is
|
|
1494
|
+
`Qwen3ForCausalLM` with a last-token-pool + L2Norm tail. That model proved the
|
|
1495
|
+
modality but not the engine's **generality across embedding families**, and it had a
|
|
1496
|
+
fatal mobile problem: the only weights that existed were BF16 (~1.2 GB, which OOMs
|
|
1497
|
+
the iPad) or a broken MLX-DWQ convert. This cycle adds **EmbeddingGemma-300M**
|
|
1498
|
+
(`mlx-community/embeddinggemma-300m-4bit`, ~173 MB at MLX-4bit) — the first
|
|
1499
|
+
**non-Qwen** embedding model, a genuinely different architecture, **and the first
|
|
1500
|
+
embedding model confirmed running on iPad Safari on-device** (owner-confirmed).
|
|
1501
|
+
|
|
1502
|
+
### A real encoder, not a re-skinned causal LM
|
|
1503
|
+
|
|
1504
|
+
EmbeddingGemma is a **bidirectional Gemma3 encoder**, so unlike §21 it needed its own
|
|
1505
|
+
graph generator (`architectures/gemma3_encoder.ts`, `generateGemma3EncoderGraph`).
|
|
1506
|
+
The architecture, read line-for-line from `config.json` (the generator hardcodes
|
|
1507
|
+
nothing — block count, dims and head config all come from the config; the 24-block /
|
|
1508
|
+
768-hidden / 3072-intermediate shape below is that model's config):
|
|
1509
|
+
|
|
1510
|
+
- **24 pre-norm blocks, bidirectional.** The attention op is emitted with
|
|
1511
|
+
`causal: false` — the same one-flag `is_causal` mechanism the ViT introduced
|
|
1512
|
+
(§22), reused for an encoder. This is the second consumer of that flag and the
|
|
1513
|
+
payoff of having made non-causal a uniform rather than a kernel.
|
|
1514
|
+
- **GQA 3 q / 1 kv, head_dim 256.** `q_dim = num_heads·head_dim`,
|
|
1515
|
+
`kv_dim = num_kv_heads·head_dim`; for this model 3·256 = 768 query, 1·256 kv.
|
|
1516
|
+
- **Per-head q-norm / k-norm.** Each block emits `q_norm` and `k_norm` as `RMSNorm`
|
|
1517
|
+
over `head_dim` before RoPE — the Gemma3 per-head QK normalization.
|
|
1518
|
+
- **Dual-theta RoPE, selected per layer from `layer_types`.** Gemma3 interleaves
|
|
1519
|
+
sliding-window and full-attention layers, and they use *different* rotary bases.
|
|
1520
|
+
The generator reads `layer_types[i]`: a `"full_attention"` layer takes
|
|
1521
|
+
`rope_theta` (1e6); every other (local/sliding) layer takes `rope_local_base_freq`
|
|
1522
|
+
(10000). The full head_dim is rotated. (This is a per-layer `layer_types` lookup,
|
|
1523
|
+
not a fixed "every 6th" rule — the config happens to make every 6th layer global,
|
|
1524
|
+
but the code keys off the type string.)
|
|
1525
|
+
- **GeGLU MLP.** `down(gelu_tanh(gate) · up)` — separate gate/up projections, a
|
|
1526
|
+
`GELU` (tanh-approx) op on the gate, a `Mul`, then the down projection. Not the
|
|
1527
|
+
SiLU-based fused SwiGLU of the Qwen text path.
|
|
1528
|
+
- **Gemma's four-norm "sandwich."** Each block carries four norms, not two:
|
|
1529
|
+
`input_layernorm` (pre-attn), `post_attention_layernorm`, `pre_feedforward_layernorm`,
|
|
1530
|
+
`post_feedforward_layernorm`. The two *post* norms are applied to the sublayer
|
|
1531
|
+
output **before** the residual add — Gemma's distinguishing sandwich layout.
|
|
1532
|
+
- **Embedding scale ×√768.** The token embeddings are multiplied by
|
|
1533
|
+
`√hidden_size` via a new `Scale` op before the blocks.
|
|
1534
|
+
- **Tail: MeanPool → Dense0 (768→3072) → Dense1 (3072→768) → L2Norm.** Unlike
|
|
1535
|
+
Qwen3-Embedding's *last-token* EOS pool, EmbeddingGemma uses **mean pooling** over
|
|
1536
|
+
all tokens (a new `MeanPool` kernel), followed by the model's two learned Dense
|
|
1537
|
+
projection heads and a final L2 normalization to a unit-length 768-dim vector.
|
|
1538
|
+
|
|
1539
|
+
### Two new kernels — and that is all
|
|
1540
|
+
|
|
1541
|
+
The only genuinely new GPU kernels this model required are **`MeanPool`** and
|
|
1542
|
+
**`Scale`** (`kernels/registry.ts`):
|
|
1543
|
+
|
|
1544
|
+
- **`MeanPool`** — `output[c] = (1/T)·Σ_t src[t·width + c]`: the column-mean of a
|
|
1545
|
+
`[T, width]` activation into `[1, width]`. One thread per output channel,
|
|
1546
|
+
workgroup_size 256, params `{ seq_len, width }`.
|
|
1547
|
+
- **`Scale`** — `output[i] = input[i]·scale`, the embedding normalizer. Params
|
|
1548
|
+
`{ count, scale_bits }` (the f32 scale passed as a bit pattern and `bitcast` back).
|
|
1549
|
+
|
|
1550
|
+
Everything else — RMSNorm, the bidirectional attention, RoPE, GELU, Mul, the INT4
|
|
1551
|
+
matmul, L2Norm — was already in the registry. A whole new embedding family cost two
|
|
1552
|
+
small reduction/elementwise kernels.
|
|
1553
|
+
|
|
1554
|
+
### The (1+weight) norm absorption — baked by the loader, for MLX too
|
|
1555
|
+
|
|
1556
|
+
Gemma's RMSNorm is `(1 + weight)·normalized`, not `weight·normalized`. Rather than
|
|
1557
|
+
fork the kernel, the **loader bakes the +1 into every Gemma norm weight** at load
|
|
1558
|
+
(`model-loader.ts`), so the standard RMSNorm kernel stays correct. The subtle part:
|
|
1559
|
+
this runs **even for MLX-4bit** Gemma. mlx-lm pre-absorbs the +1 for Qwen3.5 but
|
|
1560
|
+
**does not** for Gemma — so the Gemma branch deliberately omits the `&& !isMLX`
|
|
1561
|
+
guard the Qwen branch carries, and adds +1 to `input/post_attention/pre_feedforward/
|
|
1562
|
+
post_feedforward_layernorm`, `q_norm`, `k_norm`, and the final `norm`.
|
|
1563
|
+
|
|
1564
|
+
### Validation: bit-faithful, then semantic
|
|
1565
|
+
|
|
1566
|
+
Correctness was pinned against an **independent NumPy reference**
|
|
1567
|
+
(`scripts/engine/test-embedding-gemma-reference.py`) that re-implements the MLX
|
|
1568
|
+
affine dequant (`scale·nibble + bias`, group_size 64), the `(1+w)` Gemma norm, the
|
|
1569
|
+
`query_pre_attn_scalar^-0.5` attention scale, GQA repeat, dual-theta RoPE and the two
|
|
1570
|
+
Dense heads from raw safetensors. The reference asserts engine-vs-NumPy
|
|
1571
|
+
`cosine > 0.95` per probe; the measured result was **cos = 1.00000** (the commit's
|
|
1572
|
+
own headline). On top of that, a semantic test (`test-embedding-gemma.mjs`) confirms
|
|
1573
|
+
the geometry is *useful*, not merely reproducible: a "Red Planet" query embeds
|
|
1574
|
+
**closer to two Mars documents than to an unrelated sourdough-bread document by a
|
|
1575
|
+
>0.1 cosine margin**, all vectors are unit-norm at dim 768, and none are NaN or
|
|
1576
|
+
degenerate. The path is prefill-only (non-autoregressive), so it inherits the
|
|
1577
|
+
existing memory/submit machinery with no mobile-specific work — which is why, once
|
|
1578
|
+
the size dropped from 1.2 GB to 173 MB, **it simply ran on the iPad**.
|
|
1579
|
+
|
|
1580
|
+
### Why this one matters
|
|
1581
|
+
|
|
1582
|
+
Qwen3-Embedding proved the *tail*. EmbeddingGemma proves the *engine*: a different
|
|
1583
|
+
vendor, a different block structure (four-norm sandwich, dual-theta RoPE, per-head
|
|
1584
|
+
QK-norm, mean-pool, Dense heads), validated bit-faithful and then run on-device on
|
|
1585
|
+
the hardware that previously crashed. It is the difference between "we support an
|
|
1586
|
+
embedding model" and "the engine generalizes across embedding families." The
|
|
1587
|
+
abandoned 1.2 GB Qwen3-Embedding-on-iPad and the MLX-DWQ-garbage trap (§27) are the
|
|
1588
|
+
two dead ends this model routes around.
|
|
1589
|
+
|
|
1590
|
+
---
|
|
1591
|
+
|
|
1592
|
+
## 26. The SentencePiece Tokenizer Fix (a Cross-Family Lesson)
|
|
1593
|
+
|
|
1594
|
+
EmbeddingGemma surfaced a tokenizer bug that had been **silently destroying
|
|
1595
|
+
semantics**, and the fix is a reusable lesson for any future non-GPT-lineage model.
|
|
1596
|
+
|
|
1597
|
+
Gemma's `tokenizer.json` declares `"type": "BPE"` — but it is **SentencePiece-flavored
|
|
1598
|
+
BPE**, not the byte-level (GPT-2 / `Ġ`) BPE the engine was built for. The two
|
|
1599
|
+
disagree on nearly everything that matters:
|
|
1600
|
+
|
|
1601
|
+
| | Byte-level BPE (Qwen, LFM2) | SentencePiece BPE (Gemma) |
|
|
1602
|
+
|---|---|---|
|
|
1603
|
+
| Space marker | `Ġ` (`Ġ`) | `▁` (U+2581) |
|
|
1604
|
+
| Token bytes | byte-to-unicode remapped | raw UTF-8 |
|
|
1605
|
+
| `merges` form | `"a b"` strings | `["a","b"]` arrays |
|
|
1606
|
+
| Unknown bytes | — | `<0xHH>` byte-fallback |
|
|
1607
|
+
|
|
1608
|
+
Feeding a SentencePiece vocab through the byte-level path **char-split every word**:
|
|
1609
|
+
the pre-tokenizer's byte-to-unicode remap turned the raw-UTF-8 ▁-prefixed Gemma
|
|
1610
|
+
tokens into unmatchable garbage, the merges never fired, and the embeddings that came
|
|
1611
|
+
out were numerically "valid" (unit norm, no NaN) but **semantically meaningless** —
|
|
1612
|
+
exactly the kind of failure that passes a smoke test and fails a margin test.
|
|
1613
|
+
|
|
1614
|
+
The fix (`tokenizer.ts`) **auto-detects SPM mode structurally**, with no model-name
|
|
1615
|
+
list. A `spmMode` flag is set true when either: the `normalizer` is (or contains, in
|
|
1616
|
+
a `Sequence`) a `Replace` node mapping `" " → "▁"`; or the model has
|
|
1617
|
+
`byte_fallback: true` **and** the vocab literally contains the ▁-prefixed token
|
|
1618
|
+
`"▁the"`. In SPM mode the tokenizer:
|
|
1619
|
+
|
|
1620
|
+
- **encodes** by replacing spaces with ▁ and splitting on ▁ (keeping it attached to
|
|
1621
|
+
the following piece) — *no* byte-to-unicode remap, BPE runs on raw UTF-8;
|
|
1622
|
+
- **decodes** by fusing runs of consecutive `<0xHH>` byte-fallback tokens into a
|
|
1623
|
+
single UTF-8 decode (so a multi-byte codepoint split across byte tokens
|
|
1624
|
+
reassembles) and finally mapping ▁ → space;
|
|
1625
|
+
- **parses merges** from both array and string form into one space-joined key.
|
|
1626
|
+
|
|
1627
|
+
Crucially the detection is *structural*, so **Qwen and LFM2 fall through to the
|
|
1628
|
+
byte-level path unchanged** — they have no `" "→"▁"` normalizer and are not
|
|
1629
|
+
byte_fallback SPM vocabs, so `spmMode` is false and the `Ġ` machinery is untouched.
|
|
1630
|
+
The lesson, recorded for the add-model-family process: **`type: "BPE"` in a HF
|
|
1631
|
+
tokenizer is not a guarantee of byte-level BPE.** Any SentencePiece-lineage family
|
|
1632
|
+
(Gemma, Llama, Mistral) needs the SPM path, and the engine now picks it automatically.
|
|
1633
|
+
|
|
1634
|
+
---
|
|
1635
|
+
|
|
1636
|
+
## 27. MLX-4bit Loading: Broadened Detection, and the DWQ Trap
|
|
1637
|
+
|
|
1638
|
+
Shipping a `mlx-community` checkpoint forced the loader to get precise about what
|
|
1639
|
+
"MLX-4bit" means (`model-loader.ts`). Three changes:
|
|
1640
|
+
|
|
1641
|
+
1. **Detection broadened to mode-less configs.** Standard mlx-lm converts often emit
|
|
1642
|
+
`{ bits: 4, group_size: N }` with **no `mode` field**, where earlier code only
|
|
1643
|
+
recognized an explicit `mode: "affine"`. The loader now treats a config as
|
|
1644
|
+
MLX-shaped when `bits === 4` and either `mode === "affine"` **or**
|
|
1645
|
+
(`mode` is absent **and** `group_size` is a number).
|
|
1646
|
+
|
|
1647
|
+
2. **A DWQ reject, because DWQ is config-indistinguishable.** Distillation-quantized
|
|
1648
|
+
(DWQ) MLX repos carry the *same* `{bits:4, group_size}` config as a standard
|
|
1649
|
+
affine convert but pack weights the engine's dequant can't read — they produce
|
|
1650
|
+
garbage, not an error. Since the config can't tell them apart, the loader rejects
|
|
1651
|
+
by **repo name**: any repo whose lowercased id contains `"dwq"` can never be
|
|
1652
|
+
treated as a verified MLX repo.
|
|
1653
|
+
|
|
1654
|
+
3. **A `VERIFIED_MLX_REPOS` allowlist.** A mode-less config only loads if its repo is
|
|
1655
|
+
on a hardcoded allowlist of repos confirmed to be standard affine MLX-4bit (case-
|
|
1656
|
+
insensitive substring match, currently exactly
|
|
1657
|
+
`["mlx-community/embeddinggemma-300m-4bit"]`). The full gate is
|
|
1658
|
+
`isMLX = hasMlxShape && (mode === "affine" || isVerifiedMlxRepo)` — so an explicit
|
|
1659
|
+
`affine` config loads anywhere, but a bare `{bits:4, group_size}` only loads from a
|
|
1660
|
+
vetted repo, and never from a DWQ one. This is the codified form of the
|
|
1661
|
+
"MLX-DWQ-garbage trap" that wasted a cycle on the Qwen embedding model.
|
|
1662
|
+
|
|
1663
|
+
---
|
|
1664
|
+
|
|
1665
|
+
## 28. The Progress-Reporting Fix: Killing the "Stuck at 10%" Freeze
|
|
1666
|
+
|
|
1667
|
+
A long-standing, every-model UX bug was fixed this cycle (commit `682a09b`,
|
|
1668
|
+
`model-loader.ts`): the download bar **froze at "10% — discovering weight files"**
|
|
1669
|
+
for several seconds on every load, looking like a hang. The cause was a *dead zone*
|
|
1670
|
+
of latency-heavy network round-trips between the "discovering" progress emit and the
|
|
1671
|
+
first emit that the download code produced — during which nothing was reported:
|
|
1672
|
+
|
|
1673
|
+
- the **index probe** (`fetchJSON("model.safetensors.index.json")`),
|
|
1674
|
+
- **two header range-requests** per shard (`fetchRange(url, 0, 8)` to read the header
|
|
1675
|
+
length, then a second range to read the full header), and
|
|
1676
|
+
- the **first-byte latency** of the first data Range request.
|
|
1677
|
+
|
|
1678
|
+
The fix brackets that dead zone with two descriptive emits *before* the first chunk:
|
|
1679
|
+
a `"Reading {filename} header (i/N)…"` message right before the header fetch, and a
|
|
1680
|
+
`"Downloading {filename} (0/{totalMB} MB)"` message that shows the size up front so
|
|
1681
|
+
the first (latency-heavy) chunk doesn't read as a freeze. No throughput changed; the
|
|
1682
|
+
bar now narrates the round-trips it was previously silent through. It is a small fix
|
|
1683
|
+
with broad reach — it affected **every model the engine loads**.
|
|
1684
|
+
|
|
1685
|
+
---
|
|
1686
|
+
|
|
1687
|
+
## 29. Cross-Device Multi-Modal Parity (June 2026)
|
|
1688
|
+
|
|
1689
|
+
The milestone this cycle is not any single model but their *intersection*: **text,
|
|
1690
|
+
vision, and embeddings now all run natively on iPad Safari/WebKit**, on the same
|
|
1691
|
+
device that crashed at the start of the campaign (§17). Concretely, on iPad
|
|
1692
|
+
(iPadOS 26.5, WebKit), through the one native WGSL engine:
|
|
1693
|
+
|
|
1694
|
+
| Modality | Native model on iPad | Status |
|
|
1695
|
+
|---|---|---|
|
|
1696
|
+
| Text | Qwen3.5-0.8B INT4 | ✅ ~51 tok/s sustained (200-tok run), bit-correct |
|
|
1697
|
+
| Text (alt) | LFM2.5-350M | ✅ ~46 tok/s on-device; faster/smaller alternative |
|
|
1698
|
+
| Vision | Qwen3.5 ViT (`describeImage`) | ✅ runs; encoder bit-exact vs HF (§22) |
|
|
1699
|
+
| Embeddings | EmbeddingGemma-300M | ✅ runs on-device (§25), 173 MB |
|
|
1700
|
+
|
|
1701
|
+
The desktop numbers remain higher (Qwen3.5 ~207 tok/s, §20; LFM2.5 ~600 tok/s on
|
|
1702
|
+
M4 Max), but the *parity* claim is about mobile: the native engine has reached the
|
|
1703
|
+
**same modality coverage on iPad that the transformers.js/ONNX path offered on the
|
|
1704
|
+
modalities that matter — without the mobile crashes, and faster** (native mobile
|
|
1705
|
+
decode is ~5× the transformers.js path, §19/PROJECT-STATE §4).
|
|
1706
|
+
|
|
1707
|
+
The honest gap is **audio**. Neither STT nor TTS runs natively yet:
|
|
1708
|
+
|
|
1709
|
+
- **TTS via OmniVoice** — in progress; a Qwen3 backbone (mostly the text path the
|
|
1710
|
+
engine already runs) plus a codec decoder.
|
|
1711
|
+
- **STT via Moonshine** — not started; a lean encoder-decoder with a raw-waveform
|
|
1712
|
+
Conv1d frontend (no log-mel Conv2d), needing a parallel CrossAttention kernel.
|
|
1713
|
+
|
|
1714
|
+
Until those land, audio is the one launch modality that still requires the
|
|
1715
|
+
transformers.js path (Kokoro/Supertonic TTS, Whisper STT). The native engine is at
|
|
1716
|
+
**multi-modal parity minus audio**.
|
|
1717
|
+
|
|
1718
|
+
---
|
|
1719
|
+
|
|
1720
|
+
## 30. Model-Zoo Growth and the Saturation of the Kernel Library
|
|
1721
|
+
|
|
1722
|
+
Two model-zoo facts close out the cycle, and together they describe a qualitative
|
|
1723
|
+
shift in what "add a model" costs.
|
|
1724
|
+
|
|
1725
|
+
**LFM2.5-350M shipped** (`architectures/lfm2.ts`, `Lfm2ForCausalLM`) — a hybrid
|
|
1726
|
+
conv/attention text model, ~199 MB at q4 (half Qwen3.5's footprint), faster on both
|
|
1727
|
+
desktop (~600 tok/s, ~2.8× Qwen3.5) and mobile (~46 tok/s). It needed **no new
|
|
1728
|
+
kernels**. Two general fixes fell out of it and were kept: (1) LFM2's effective FF
|
|
1729
|
+
dim is the `block_auto_adjust_ff_dim`-rounded value, not the config's raw
|
|
1730
|
+
`intermediate_size`; (2) its "garbage output" was a **chat-template** problem, not a
|
|
1731
|
+
graph problem — LFM2.5 ships its template as a `chat_template.jinja` sidecar absent
|
|
1732
|
+
from `tokenizer_config.json`, so the engine fell back to Qwen ChatML and injected an
|
|
1733
|
+
empty `<think>` loop. Fetching the `.jinja` sidecar and gating think-injection on the
|
|
1734
|
+
template actually emitting `<think>` fixes **any** model with a jinja sidecar.
|
|
1735
|
+
|
|
1736
|
+
**Adding a new *text* family is now usually Tier-1: a generator, no new kernels.**
|
|
1737
|
+
The kernel library has effectively **saturated** for standard transformers — Llama,
|
|
1738
|
+
Mistral, and Gemma-text all reduce to ops already in the registry (RMSNorm,
|
|
1739
|
+
GQA attention, RoPE, SwiGLU/GeGLU, INT4 matmul). New kernels are needed only for
|
|
1740
|
+
genuinely novel computation: a new norm, an SSM/Mamba path, PLE, a cross-attention
|
|
1741
|
+
for STT. The effort tiers (`docs/adding-a-model-family.md`) make this concrete —
|
|
1742
|
+
Tier 1 (hours, generator-only) covers most of the HF text zoo; Tier 2 (one novel op)
|
|
1743
|
+
and Tier 3 (SSM/MoE/new-executor) are now the exception. EmbeddingGemma was a
|
|
1744
|
+
Tier-2-ish outlier *only* because of MeanPool/Scale and the SPM tokenizer; the next
|
|
1745
|
+
text family will most likely cost a config→IR generator and nothing else.
|
|
1746
|
+
|
|
1747
|
+
---
|
|
1748
|
+
|
|
1749
|
+
## 31. Native STT: Moonshine (June 2026)
|
|
1750
|
+
|
|
1751
|
+
§29 named audio as the one launch modality still on the transformers.js path. This
|
|
1752
|
+
cycle closes the **speech-to-text** half of that gap natively. The model is
|
|
1753
|
+
**Moonshine** (`architectures/moonshine.ts`, `moonshine-executor.ts`,
|
|
1754
|
+
`moonshine-stt.ts`) — a lean encoder-decoder ASR model chosen over Whisper
|
|
1755
|
+
precisely because it avoids the two things the native engine did not want to build:
|
|
1756
|
+
a log-mel front-end (a Conv2d/FFT pipeline) and a generic spectrogram path.
|
|
1757
|
+
|
|
1758
|
+
### A raw-waveform Conv1d front-end (no FFT, no log-mel)
|
|
1759
|
+
|
|
1760
|
+
Moonshine consumes **16 kHz PCM directly**. The front-end
|
|
1761
|
+
(`architectures/moonshine.ts`) is three strided 1-D convolutions, not a
|
|
1762
|
+
spectrogram:
|
|
1763
|
+
|
|
1764
|
+
1. `Conv1d(1→H, kernel 127, stride 64)` + `tanh`,
|
|
1765
|
+
2. `GroupNorm(num_groups=1)` over the H channels,
|
|
1766
|
+
3. `Conv1d(H→2H, kernel 7, stride 3)` + GELU,
|
|
1767
|
+
4. `Conv1d(2H→H, kernel 3, stride 2)` + GELU,
|
|
1768
|
+
5. a `Transpose` from `[H, frames]` to `[frames, H]` for the transformer.
|
|
1769
|
+
|
|
1770
|
+
The total downsample is 64·3·2 = **384×**, i.e. ~41.6 frames/second at 16 kHz.
|
|
1771
|
+
This needed three genuinely new kernels — **`Conv1dFull`**, **`GroupNorm`**, and
|
|
1772
|
+
**`Tanh`** (plus a `Transpose`) in `kernels/registry.ts` — and **no FFT and no
|
|
1773
|
+
2-D convolution**, which is the whole reason Moonshine was picked over Whisper.
|
|
1774
|
+
|
|
1775
|
+
### The CrossAttention kernel — the one real new attention primitive
|
|
1776
|
+
|
|
1777
|
+
The decoder attends to the frozen encoder output, so it needed a **cross-attention**
|
|
1778
|
+
kernel distinct from the engine's self-attention. `WGSL_CROSS_ATTENTION`
|
|
1779
|
+
(`kernels/registry.ts`, `crossAttentionSpec`) is a tiled, online-softmax
|
|
1780
|
+
cross-attention: decoder queries stream over the encoder sequence in tiles of 16
|
|
1781
|
+
with a 256-thread workgroup, running-max/running-sum softmax, V accumulation — and
|
|
1782
|
+
the same mobile-safe discipline the self-attention kernel uses (no `select()`,
|
|
1783
|
+
`exp()` clamped, ≤16 KB workgroup memory). It was validated against an independent
|
|
1784
|
+
NumPy reference (`scripts/engine/test-crossattention.mjs`): **max absolute error
|
|
1785
|
+
< 2e-4 and cosine ≥ 0.9999** — bit-exact within f32 rounding.
|
|
1786
|
+
|
|
1787
|
+
### The dual-graph executor: encode once, freeze K/V, then decode
|
|
1788
|
+
|
|
1789
|
+
The runtime is **two graphs, not one** (`moonshine-executor.ts`,
|
|
1790
|
+
`moonshine-stt.ts`):
|
|
1791
|
+
|
|
1792
|
+
- `MoonshineEncoderExecutor.encode(pcm)` runs the Conv1d front-end and the
|
|
1793
|
+
bidirectional encoder transformer **once** over the whole utterance, then
|
|
1794
|
+
projects and caches the **per-decoder-layer K/V** from the encoder output.
|
|
1795
|
+
- `MoonshineSTT.transcribe(pcm)` binds that **frozen encoder K/V** into the decoder
|
|
1796
|
+
and runs a normal greedy autoregressive loop, where each step does self-attention
|
|
1797
|
+
over the generated text *plus* cross-attention into the frozen encoder K/V.
|
|
1798
|
+
|
|
1799
|
+
This is the first encoder-decoder shape in the engine; the design choice — compute
|
|
1800
|
+
the encoder K/V exactly once and treat it as a constant during decode — is what
|
|
1801
|
+
keeps cross-attention cheap per token.
|
|
1802
|
+
|
|
1803
|
+
### Interleaved RoPE
|
|
1804
|
+
|
|
1805
|
+
Moonshine's rotary embedding is **interleaved**, not split-half. Where the engine's
|
|
1806
|
+
default RoPE (HF Llama `rotate_half`) pairs dim *i* with dim *i*+½·rope_dim,
|
|
1807
|
+
Moonshine pairs **adjacent** dims (2p, 2p+1) under a single frequency `inv_freq[p]`.
|
|
1808
|
+
This is a separate kernel, `WGSL_ROPE_INTERLEAVED` (`ROPE_INTERLEAVED_SPEC`),
|
|
1809
|
+
selected by an `interleaved: true` attribute on the encoder/decoder RoPE nodes;
|
|
1810
|
+
the comment records it as verified against HF's `MoonshineRotaryEmbedding`
|
|
1811
|
+
(`cos[:half].repeat_interleave(2)` + interleaved `rotate_half`).
|
|
1812
|
+
|
|
1813
|
+
### Validation and status
|
|
1814
|
+
|
|
1815
|
+
End-to-end transcription is exercised by `scripts/engine/test-moonshine-transcribe.mjs`,
|
|
1816
|
+
which asserts the transcript **contains the expected ground-truth substrings** for
|
|
1817
|
+
standard HF Moonshine reference clips (e.g. "stew for dinner", "his belly counsel"),
|
|
1818
|
+
and reports real-time-factor and the 4-bit size projection as informational output
|
|
1819
|
+
(both computed dynamically from the run, not hardcoded). The architecture comment
|
|
1820
|
+
records **encoder cosine ≈ 0.990 vs HF transformers** on Dawn. (The crisper
|
|
1821
|
+
"6/6 verbatim / RTF ~40× / ~31 MB at 4-bit" framing from the working notes is
|
|
1822
|
+
consistent with these checks, but the *enforced* gates in the committed tests are
|
|
1823
|
+
the substring-match transcript assertions and the bit-exact CrossAttention numbers
|
|
1824
|
+
above; quote those when precision matters.) **Whisper stays as the multilingual /
|
|
1825
|
+
no-WebGPU fallback** — `WhisperSTT` (`src/core/stt.ts`) is a separate
|
|
1826
|
+
transformers.js/ONNX path, untouched by and independent of the native Moonshine
|
|
1827
|
+
engine.
|
|
1828
|
+
|
|
1829
|
+
---
|
|
1830
|
+
|
|
1831
|
+
## 32. Native TTS: Kani-TTS-2 and the NanoCodec Decoder (June 2026)
|
|
1832
|
+
|
|
1833
|
+
The **text-to-speech** half of the audio gap is **partly** landed: the hard,
|
|
1834
|
+
novel piece — the **NanoCodec audio decoder** — is implemented and validated
|
|
1835
|
+
bit-exact, while the codec-LM backbone's autoregressive driver is scaffolded and
|
|
1836
|
+
deliberately not yet runnable end-to-end. The model is **Kani-TTS-2**
|
|
1837
|
+
(`architectures/kani_tts.ts`).
|
|
1838
|
+
|
|
1839
|
+
### The backbone: an LFM2-350M codec-LM
|
|
1840
|
+
|
|
1841
|
+
Kani-TTS-2's backbone is **LFM2-350M** (arch string `KaniTTS2ForCausalLM`,
|
|
1842
|
+
model_type `lfm2`) reusing `generateLfm2Graph` almost verbatim — the LFM2 block
|
|
1843
|
+
math (RMSNorm, short-conv with conv-state cache, GQA, SwiGLU) shipped already in
|
|
1844
|
+
§30. It autoregressively emits **NanoCodec audio tokens, 4 per frame**, into a
|
|
1845
|
+
vocab that extends *above* the text vocab (`vocab_size` 80538 vs `text_vocab_size`
|
|
1846
|
+
64400; audio token IDs start at 64410). The backbone-specific additions are
|
|
1847
|
+
frame-level position IDs (audio tokens within a frame share a position; text tokens
|
|
1848
|
+
advance by one), a **learnable per-layer RoPE**
|
|
1849
|
+
(`α^(l) = alpha_min + (alpha_max−alpha_min)·sigmoid(alpha_weight^(l))`), the
|
|
1850
|
+
4-token-frame decode loop, and an optional speaker-embedding projection.
|
|
1851
|
+
`generateKaniTtsGraph` currently **throws a descriptive error** rather than emit a
|
|
1852
|
+
half-wired graph: it parses and reports the config, confirms the decoder is done,
|
|
1853
|
+
and lists the remaining position/RoPE/decode-driver glue.
|
|
1854
|
+
|
|
1855
|
+
### The NanoCodec decoder — FSQ + causal HiFi-GAN, the novel part
|
|
1856
|
+
|
|
1857
|
+
`generateNanoCodecDecoderGraph` is the genuinely new computation, and it is
|
|
1858
|
+
**complete and validated**. Two stages:
|
|
1859
|
+
|
|
1860
|
+
- **FSQ dequant.** NanoCodec uses **finite scalar quantization**: 4 groups × 4
|
|
1861
|
+
dims, per-group levels `[9, 8, 8, 7]` (codebook 4032), mixed-radix base
|
|
1862
|
+
`[1, 9, 72, 576]`. Each audio code is unpacked by
|
|
1863
|
+
`nonneg = (idx // base[d]) % levels[d]; code = (nonneg − L/2)/(L/2)` — a new
|
|
1864
|
+
**`FSQDequant`** kernel.
|
|
1865
|
+
- **Causal HiFi-GAN vocoder.** A `CausalConv1d(16→864, k7)`, then 5 upsample
|
|
1866
|
+
stages with rates `[7, 7, 6, 3, 2]` (each: `HalfSnake` → depthwise
|
|
1867
|
+
`CausalConvTranspose1d(k = 2·rate)` → a HiFi-GAN residual layer averaging
|
|
1868
|
+
kernels `[3, 7, 11]` over dilations `[1, 3, 5]`), then `HalfSnake` → a final
|
|
1869
|
+
`CausalConv1d(27→1, k3)` and a clamp to [-1, 1]. The hop is 1764, so PCM length
|
|
1870
|
+
= `frames · 1764` at 22050 Hz.
|
|
1871
|
+
|
|
1872
|
+
This needed two more new kernels beyond `FSQDequant`: **`HalfSnake1d`** (snake
|
|
1873
|
+
activation on the first half of the channels, leaky-ReLU on the rest) and
|
|
1874
|
+
**`ConvTranspose1dDepthwise`** (causal depthwise transposed convolution) — all
|
|
1875
|
+
three registered in `kernels/registry.ts`.
|
|
1876
|
+
|
|
1877
|
+
### Validation and the license note
|
|
1878
|
+
|
|
1879
|
+
The full decoder is checked by `scripts/engine/test-nanocodec-decode.mjs` against a
|
|
1880
|
+
real MLX reference. The committed assertion gate is **`err < 1e-3`** (and matching
|
|
1881
|
+
PCM length); the run prints the *actual* measured error, which the code's own
|
|
1882
|
+
headers record as **max|err| ≈ 4.2e-6 vs the MLX reference** — i.e. bit-exact
|
|
1883
|
+
within f32 rounding, with comfortable margin under the 1e-3 gate. **Status:**
|
|
1884
|
+
NanoCodec decoder + Kani scaffold landed and validated; the codec-LM backbone's
|
|
1885
|
+
frame-position / learnable-RoPE / 4-token-frame AR loop is the remaining glue
|
|
1886
|
+
(most of the block math is reused from `lfm2.ts`).
|
|
1887
|
+
|
|
1888
|
+
**Licensing** (recorded in the source header): the shipping **kani-tts-2-en** is
|
|
1889
|
+
**LFM1.0 (Liquid AI), `license: other`** — it fine-tunes LFM2-350M, so it is *not*
|
|
1890
|
+
Apache; the NanoCodec weights are under the **NVIDIA Open Model License**. The
|
|
1891
|
+
older `kani-tts-450m-0.2-ft` variant is **Apache-2.0** with the same architecture.
|
|
1892
|
+
|
|
1893
|
+
---
|
|
1894
|
+
|
|
1895
|
+
## 33. Gemma 4 E2B: PLE, KV-Sharing, and Logit Softcap (June 2026)
|
|
1896
|
+
|
|
1897
|
+
Gemma 4 E2B is the smallest Gemma 4 and the engine's first **Tier-2** text decoder
|
|
1898
|
+
beyond the Gemma3 encoder of §25 — a *text-only* model with several Gemma-4-specific
|
|
1899
|
+
ops, but **no MatFormer / AltUp / LAuReL** (those belong to Gemma-3n, not E2B). Its
|
|
1900
|
+
graph generator is `architectures/gemma4.ts`, building on the Gemma machinery from
|
|
1901
|
+
`gemma3_encoder.ts` with a causal LM tail. The decode graph is **structurally
|
|
1902
|
+
complete and validated**; the gate to running it on real weights is embedding
|
|
1903
|
+
sharding (below).
|
|
1904
|
+
|
|
1905
|
+
### What is Gemma-4-specific
|
|
1906
|
+
|
|
1907
|
+
- **Per-Layer Embeddings (PLE).** A *second* embedding table is gathered per token,
|
|
1908
|
+
projected once (`per_layer_model_projection` + `per_layer_projection_norm`), and
|
|
1909
|
+
then injected at **every** layer:
|
|
1910
|
+
`h = h + post_norm( per_layer_projection( gelu(gate(h)) · ple_i ) )` — a per-layer
|
|
1911
|
+
gate, GELU, elementwise multiply with that layer's PLE slice, projection, norm,
|
|
1912
|
+
and residual.
|
|
1913
|
+
- **KV-cache sharing.** The last `num_kv_shared_layers` layers reuse the K/V cache of
|
|
1914
|
+
the matching `layer_type` from before the shared region — a *graph-level rewire*
|
|
1915
|
+
with no kernel change and no K/V projection emitted for the shared layers. For E2B
|
|
1916
|
+
that is **35 layers total, the last 20 shared** (layers 15–34 reuse layers 0–14 of
|
|
1917
|
+
matching type).
|
|
1918
|
+
- **Proportional / dual-theta RoPE.** Full-attention layers rotate the first
|
|
1919
|
+
`partial_rotary_factor·head_dim` dims (0.25·256 = 64) but compute `inv_freq` over
|
|
1920
|
+
the **full** `head_dim` denominator (the `rope_denom` attribute) — distinct from
|
|
1921
|
+
Qwen3.5's partial RoPE, which divides by the *rotated* dim. Full layers use
|
|
1922
|
+
`rope_theta` 1e6, sliding layers 1e4 — the same dual-theta selection as the Gemma3
|
|
1923
|
+
encoder.
|
|
1924
|
+
- **GeGLU MLP** (`down(gelu_tanh(gate) · up)`) and Gemma's four-norm sandwich,
|
|
1925
|
+
reused from §25.
|
|
1926
|
+
- **Final logit softcap via a new `Softcap` kernel.** `WGSL_SOFTCAP`
|
|
1927
|
+
(`kernels/registry.ts`, `softcapSpec`) computes `out = cap · tanh(in / cap)` with
|
|
1928
|
+
the cap passed as a bit-pattern; for E2B `cap = 30`, squashing the final logits to
|
|
1929
|
+
±30. It is wired into the graph tail (`logit_softcap` node) when
|
|
1930
|
+
`final_logit_softcapping > 0`, and validated by `test-gemma4-softcap.mjs`
|
|
1931
|
+
(GPU vs CPU reference, max err < 1e-4; saturation behavior confirmed).
|
|
1932
|
+
|
|
1933
|
+
### Validation and the gating item
|
|
1934
|
+
|
|
1935
|
+
`scripts/engine/test-gemma4-graph.mjs` is a structural validation of the generated
|
|
1936
|
+
graph — embedding+PLE pipeline, per-layer PLE injection chain, the sandwich norms,
|
|
1937
|
+
QK-norm + GeGLU, causal attention, proportional RoPE per layer-type, the KV-share
|
|
1938
|
+
rewire (own K/V for layers 0–14, shared for 15–34), the LM head and the softcap —
|
|
1939
|
+
and it passes **62/62** (893 nodes, 1920 tensors).
|
|
1940
|
+
|
|
1941
|
+
The open item is **embedding sharding**. At 4-bit the PLE nibble buffer is
|
|
1942
|
+
**~1.17 GB** (≈1174 MB) and the main embedding is **~201 MB**, both over the
|
|
1943
|
+
per-binding cap, so the loader / `EmbeddingInt4` op must shard the quantized tables
|
|
1944
|
+
across multiple storage buffers before the decode path can run on real weights. The
|
|
1945
|
+
decode graph (last-row slice → tied LM head → Softcap) is otherwise complete and
|
|
1946
|
+
verified; sharding is loader-level work, in progress.
|
|
1947
|
+
|
|
1948
|
+
---
|
|
1949
|
+
|
|
1950
|
+
## 34. On-Device Memory / RAG (June 2026)
|
|
1951
|
+
|
|
1952
|
+
The native embedding stack (§21, §25) makes a small **retrieval-augmented memory**
|
|
1953
|
+
layer cheap to build entirely on-device, with no server and no external vector DB.
|
|
1954
|
+
It ships as `@tryhamster/gerbil/memory` (`src/memory/`, exported in `package.json`)
|
|
1955
|
+
and reuses the **native EmbeddingGemma** path through a thin adapter.
|
|
1956
|
+
|
|
1957
|
+
### Shape
|
|
1958
|
+
|
|
1959
|
+
- **Pluggable vector stores.** Three backends behind one interface
|
|
1960
|
+
(`src/memory/stores/`): `InMemoryStore` (default, process lifetime),
|
|
1961
|
+
`IndexedDBStore` (browser, durable across sessions), and `FileStore` (Node, a
|
|
1962
|
+
durable JSON file on disk) — via `createInMemoryStore()` / `createIndexedDBStore()`
|
|
1963
|
+
/ `createFileStore(path)`.
|
|
1964
|
+
- **Native-embedder adapter.** `createGerbilEmbedder(engine)`
|
|
1965
|
+
(`src/memory/gerbil-embedder.ts`) wraps anything with an `embedBatch` — a Gerbil
|
|
1966
|
+
instance, the one-liner `embedBatch`, or the browser `useEmbedding().embedBatch` —
|
|
1967
|
+
so a memory is built with `createMemory({ embed: createGerbilEmbedder(g) })` after
|
|
1968
|
+
`g.loadModel("embeddinggemma-300m")`. The whole pipeline (embed → store → recall)
|
|
1969
|
+
runs on the native WGSL engine.
|
|
1970
|
+
- **Chunking.** `add(text, { chunk: true })` splits long documents into overlapping
|
|
1971
|
+
character windows (`src/memory/chunking.ts`, defaults 1000 chars / 200 overlap),
|
|
1972
|
+
one record per chunk, so retrieval targets relevant passages.
|
|
1973
|
+
- **Redaction on write.** `applyRedaction` (`src/memory/redaction.ts`) accepts a
|
|
1974
|
+
`RegExp` (matches replaced with `[REDACTED]`) or a function, applied **before**
|
|
1975
|
+
the text is embedded and stored, so sensitive data never lands in the index.
|
|
1976
|
+
- **Token-budgeted recall.** `recall(query, options)` (`src/memory/memory.ts`)
|
|
1977
|
+
searches the store, then **greedily packs the highest-scoring records under a
|
|
1978
|
+
token budget** (default 1024, via a ~4-chars/token estimate, accounting for the
|
|
1979
|
+
separator), returning `{ context, records, tokensUsed }` — a ready-to-inject
|
|
1980
|
+
context string sized to fit a prompt budget.
|
|
1981
|
+
|
|
1982
|
+
### Validation
|
|
1983
|
+
|
|
1984
|
+
`src/memory/memory.test.ts` is **12 tests** covering relevance ranking, metadata
|
|
1985
|
+
filtering, budget-aware packing (including the empty-context edge), chunking,
|
|
1986
|
+
import/export round-trip, regex and predicate redaction, and durability for both the
|
|
1987
|
+
IndexedDB and File stores. The module is a straightforward, fully-tested consumer of
|
|
1988
|
+
the native embedding modality — no new kernels, no GPU work of its own.
|
|
1989
|
+
|
|
1990
|
+
---
|
|
1991
|
+
|
|
1992
|
+
## 35. The June-2026 Autoresearch Campaign (Text, LFM2, and ViT)
|
|
1993
|
+
|
|
1994
|
+
§20 documented the first autoresearch run (Qwen3.5 decode 145→207 on M4 Max). After
|
|
1995
|
+
LFM2.5 (§30) and the ViT (§22) landed, the same loop
|
|
1996
|
+
(`scripts/engine/optimize.mjs --mode=autoresearch`; results in
|
|
1997
|
+
`scripts/engine/results.jsonl`, chart at `scripts/engine/chart.html`) was run over
|
|
1998
|
+
**three more batches** — two on text, one on vision — all on M4 Max / node-dawn. The
|
|
1999
|
+
numbers are smaller than the first run's, and **that is the finding**: the tuned
|
|
2000
|
+
kernels are now near the bandwidth floor, so only a specific *class* of edit still
|
|
2001
|
+
wins.
|
|
2002
|
+
|
|
2003
|
+
### The three batches
|
|
2004
|
+
|
|
2005
|
+
| Batch | Target | Baseline | Best | Kept | Reverted |
|
|
2006
|
+
|---|---|---|---|---|---|
|
|
2007
|
+
| Text (b1) | Qwen3.5-0.8B Q4 decode | 219.1 | ~223 | 3 | 5 |
|
|
2008
|
+
| Text (b1) | LFM2.5-350M Q4 decode | 624.1 | ~649 | (same batch) | |
|
|
2009
|
+
| Text (b2) | Qwen3.5 / LFM2.5 decode | 220.8 / 652.6 | **234.4 / 672.2** | several | several |
|
|
2010
|
+
| Vision (b3/b4) | Qwen3.5 ViT encode | 581.8 ms | **~502 ms** (−~14%) | 3 + 1 | 3 |
|
|
2011
|
+
| Vision (b3/b4) | `describeImage` decode | 37.0 tok/s | **42.0 tok/s** (+13.5%) | (same batch) | |
|
|
2012
|
+
|
|
2013
|
+
The text wins were **fusions that also kill a wide-tensor round-trip**: fuse SiLU
|
|
2014
|
+
into Qwen's CausalConv1d (+1.8%), fuse LFM2's post-gate `C·conv` (+2.0%) and its
|
|
2015
|
+
pre-gate `B·x`/`B/x` slices via a `MulCols` (+1.9%); the second batch pushed Qwen to
|
|
2016
|
+
~234 and LFM2 to ~672. The vision wins were all in the **shared f32 `MatMul`** that
|
|
2017
|
+
dominates ViT encode — 2×2 register blocking (−6.5%), vec4 global-tile loads
|
|
2018
|
+
(−3.0%), 4×2 register blocking (−1.9%), plus a `MatMul`+`AddBias` → `MatMulBias`
|
|
2019
|
+
structural fusion (desktop ~flat but removes ~86 dispatches and wide round-trips per
|
|
2020
|
+
encode). Text decode was *unaffected* by the vision batch (it uses `MatMulInt4`,
|
|
2021
|
+
confirmed steady at ~233 tok/s throughout), and every kept ViT change stayed
|
|
2022
|
+
bit-exact (merged cosine 1.0, e2e 7/7, description matches HF).
|
|
2023
|
+
|
|
2024
|
+
### The lesson, sharpened
|
|
2025
|
+
|
|
2026
|
+
The first campaign's lesson ("stop optimizing the kernel in front of you; profile
|
|
2027
|
+
the whole forward") generalized into a sharper rule, recorded in the batch summaries:
|
|
2028
|
+
|
|
2029
|
+
> **On desktop Dawn, dispatch-count cuts only win when they also eliminate a wide
|
|
2030
|
+
> round-trip on a poorly-occupied kernel.** Pure dispatch/barrier reduction on an
|
|
2031
|
+
> already-tuned kernel is *noise*.
|
|
2032
|
+
|
|
2033
|
+
Concretely, the desktop wins came from eliminating **large, wide reads on
|
|
2034
|
+
poorly-occupied kernels** — fused conv+activation, register-blocked + f16-mixed ViT
|
|
2035
|
+
matmuls — while the tuned INT4 matmuls and the Mamba SSM sit at the **bandwidth
|
|
2036
|
+
floor**, where butterfly reductions, subgroup shuffles, and bigger N-tiles came back
|
|
2037
|
+
flat or negative and were reverted. The honest read is that **the remaining headroom
|
|
2038
|
+
is mobile, not desktop**: several reverted-on-desktop fusions (the `MatMul`+`AddBias`
|
|
2039
|
+
merge, residual-Add fusion) are *predicted mobile wins* because mobile is
|
|
2040
|
+
round-trip-bound below ~32 dispatches per command buffer (§19) — which is exactly
|
|
2041
|
+
why the autoresearch loop's next leg is a mobile-validation pass rather than more
|
|
2042
|
+
desktop rounds.
|
|
2043
|
+
|
|
2044
|
+
---
|
|
2045
|
+
|
|
2046
|
+
## Appendix A: File Map
|
|
2047
|
+
|
|
2048
|
+
```
|
|
2049
|
+
src/gpu/
|
|
2050
|
+
ir.ts -- IR types: OpType, TensorDesc, OpNode, ModelGraph, CANONICAL_KEYS
|
|
2051
|
+
safetensors.ts -- Safetensors binary parser with zero-copy typed views
|
|
2052
|
+
device.ts -- WebGPU device init, buffer helpers, pipeline cache, readback
|
|
2053
|
+
tokenizer.ts -- Pure JS BPE tokenizer from HF tokenizer.json
|
|
2054
|
+
sampler.ts -- CPU-side token sampling (temp/top-k/top-p/rep penalty)
|
|
2055
|
+
executor.ts -- Graph executor: buffer allocation, op dispatch, forward pass
|
|
2056
|
+
kv-cache.ts -- GPU-resident KV cache: allocation, advance, reset, destroy
|
|
2057
|
+
model-loader.ts -- HF Hub integration: fetch config/tokenizer/weights, generate IR (Cache-API-only; OPFS removed)
|
|
2058
|
+
vision-executor.ts -- VisionExecutor: runs the ViT graph (single prefill over N patches)
|
|
2059
|
+
vision-preprocess.ts -- Host-side ViT pos-embeds + 2D rotary cos/sin (grid-only, bit-exact vs HF)
|
|
2060
|
+
moonshine-executor.ts -- MoonshineEncoderExecutor: raw-PCM Conv1d front-end + bidirectional encoder, frozen per-layer K/V
|
|
2061
|
+
moonshine-stt.ts -- MoonshineSTT.transcribe(): AR decoder with self- + cross-attention into frozen encoder K/V
|
|
2062
|
+
architectures/
|
|
2063
|
+
index.ts -- Architecture registry: maps HF strings to graph generators
|
|
2064
|
+
qwen2.ts -- Qwen2/3/3.5 graph generator (SwiGLU MLP, GQA attention; embedding tail for Qwen3-Embedding)
|
|
2065
|
+
qwen3_5_vision.ts -- Qwen3.5 12-layer ViT encoder graph (bidirectional, 2D rotary, spatial-merge-2)
|
|
2066
|
+
gemma3_encoder.ts -- EmbeddingGemma-300M bidirectional encoder (4-norm sandwich, dual-theta RoPE, per-head QK-norm, GeGLU, mean-pool + 2 Dense heads + L2Norm)
|
|
2067
|
+
lfm2.ts -- LFM2.5 hybrid conv/attention text generator (Lfm2ForCausalLM; Tier-1, no new kernels)
|
|
2068
|
+
moonshine.ts -- Moonshine STT encoder-decoder graph (raw-waveform Conv1d front-end, interleaved RoPE, cross-attention)
|
|
2069
|
+
kani_tts.ts -- Kani-TTS-2: LFM2-350M codec-LM scaffold + NanoCodec decoder graph (FSQ + causal HiFi-GAN); decoder validated, backbone AR-loop pending
|
|
2070
|
+
gemma4.ts -- Gemma 4 E2B Tier-2 text decoder (PLE, KV-cache sharing, proportional/dual-theta RoPE, GeGLU, final logit Softcap)
|
|
2071
|
+
kernels/
|
|
2072
|
+
registry.ts -- KernelSpec registry: WGSL sources, bindings, dispatch sizing, uniform builders (incl. SliceLastRow, fused decode kernels, MeanPool, Scale, L2Norm, CrossAttention, ROPE_INTERLEAVED, Conv1dFull/GroupNorm/Tanh/Transpose, FSQDequant/HalfSnake1d/ConvTranspose1dDepthwise, Softcap)
|
|
2073
|
+
wgsl/
|
|
2074
|
+
embedding.wgsl -- Embedding lookup (gather rows by token ID)
|
|
2075
|
+
matmul.wgsl -- Tiled f32 matrix multiply (16x16 shared memory)
|
|
2076
|
+
matmul_int4.wgsl -- Fused INT4 dequantize + matmul
|
|
2077
|
+
rmsnorm.wgsl -- RMS normalization (tree reduction)
|
|
2078
|
+
layernorm.wgsl -- Layer normalization (mean + variance, tree reduction)
|
|
2079
|
+
rope.wgsl -- Rotary position embeddings (GQA-aware)
|
|
2080
|
+
attention.wgsl -- Scaled dot-product attention (causal, GQA)
|
|
2081
|
+
softmax.wgsl -- Row-wise softmax (three-pass, tree reduction)
|
|
2082
|
+
silu.wgsl -- SiLU activation (x * sigmoid(x))
|
|
2083
|
+
gelu.wgsl -- Approximate GELU activation
|
|
2084
|
+
add.wgsl -- Element-wise addition
|
|
2085
|
+
mul.wgsl -- Element-wise multiplication
|
|
2086
|
+
```
|
|
2087
|
+
|
|
2088
|
+
---
|
|
2089
|
+
|
|
2090
|
+
## Appendix B: WebGPU Browser Compatibility
|
|
2091
|
+
|
|
2092
|
+
| Browser | Version | Status |
|
|
2093
|
+
|---------|---------|--------|
|
|
2094
|
+
| Chrome | 113+ (May 2023) | Stable support |
|
|
2095
|
+
| Edge | 113+ (May 2023) | Stable support (Chromium-based) |
|
|
2096
|
+
| Safari | 18+ (Sep 2024) | Stable support (macOS + iOS) |
|
|
2097
|
+
| Firefox | 141+ (Jan 2025) | Stable support |
|
|
2098
|
+
| Chrome Android | 113+ | Stable support |
|
|
2099
|
+
| Safari iOS | 18+ | Stable support (via WKWebView) |
|
|
2100
|
+
| Samsung Internet | 25+ | Stable support |
|
|
2101
|
+
|
|
2102
|
+
### Notable Limitations
|
|
2103
|
+
|
|
2104
|
+
- **iOS WKWebView memory**: the jetsam budget for a web-content process is ~1.5-2 GB. Post-fix (Sections 17.1, 18.1-18.2), Qwen3.5-0.8B INT4 at `maxSeqLen=512` runs at ~0.6-0.7 GB total GPU footprint — comfortable headroom, versus the 2.77 GB the per-tensor allocator used to request. The engine clamps iOS `maxSeqLen` to 2048 (default 512) so an oversized request can never reach the device.
|
|
2105
|
+
- **iPad WebGPU limits** (measured, iPadOS 26.5 defaults): `maxBufferSize` 256 MB, `maxStorageBufferBindingSize` 128 MB, `maxComputeWorkgroupStorageSize` 16384 bytes. The ~127 MB INT4 embedding sits just under the binding limit — larger vocabularies need sharding.
|
|
2106
|
+
- **WebKit submit granularity**: on WebKit older than iPadOS 26.5, batching many dispatches into one command buffer can zero storage reads mid-chain (Section 17.2); the engine's grouped-submit path (`?group=N`, Section 18.3) is the compatibility dial. iPadOS 26.5+ is correct at batch-all.
|
|
2107
|
+
- **Firefox**: WebGPU support arrived later than Chromium and Safari. Feature coverage is complete but performance may vary.
|
|
2108
|
+
- **shader-f16**: Not available on all GPUs. The engine detects this at initialization and adapts accordingly (currently all kernels use f32).
|
|
2109
|
+
- **maxBufferSize**: Varies by device. The engine requests the adapter's maximum, but some mobile GPUs have limits below 256MB which constrains model size.
|