@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +318 -104
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +276 -590
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +592 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BetB5xb0.d.mts +488 -0
  35. package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
  36. package/dist/gerbil-CTZUa8EZ.mjs +4 -0
  37. package/dist/gerbil-DNniplr4.mjs +1656 -0
  38. package/dist/gerbil-DNniplr4.mjs.map +1 -0
  39. package/dist/gpu/hooks.d.mts +640 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1369 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-DFuglcEx.mjs +3790 -0
  46. package/dist/gpu-DFuglcEx.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-DukkJRMj.d.mts +2114 -0
  50. package/dist/index-DukkJRMj.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
  76. package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
  85. package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
  86. package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
  88. package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
  89. package/dist/repl-BDRkwPGX.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
  94. package/dist/skills-CU694Dc8.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/types-DQBe2lFo.d.mts +165 -0
  98. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  99. package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
  100. package/dist/types-LlyYILII.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-CMOGDSgT.js +0 -20212
  157. package/dist/kokoro-CMOGDSgT.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-Bu-E23Sc.js +0 -433
  163. package/dist/stt-Bu-E23Sc.js.map +0 -1
  164. package/dist/stt-CpLYbGFd.mjs +0 -433
  165. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  166. package/dist/stt-DRPLEEHB.mjs +0 -3
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DiD1gTwk.js +0 -44695
  169. package/dist/transformers.web-DiD1gTwk.js.map +0 -1
  170. package/dist/transformers.web-u34VxRFM.js +0 -3
  171. package/dist/tts-CqroPaSK.js +0 -724
  172. package/dist/tts-CqroPaSK.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,179 @@
1
+ # SOTA Small Embedding Models for Gerbil on iPad (WebGPU)
2
+
3
+ > Research date: 2026-06-13. Verified against live HuggingFace `config.json` / file lists / model cards
4
+ > and the MTEB leaderboard, cross-checked with Perplexity (`sonar-pro`). Engine capabilities verified
5
+ > against `src/gpu/` (model-loader, architectures, kernels).
6
+
7
+ ## TL;DR — the ranked pick
8
+
9
+ | Rank | Model | Arch | q4 size | Compatible q4 format on HF? | Engine work to run it |
10
+ |------|-------|------|---------|------------------------------|------------------------|
11
+ | **1** | **`google/embeddinggemma-300m`** (via `mlx-community/embeddinggemma-300m-4bit`) | **Bidirectional** Gemma3 encoder | **~173 MB** | **YES — standard MLX-4bit** (QAT base, not DWQ) | New **Tier-2 Gemma3 encoder generator** (medium effort) |
12
+ | **2** | **`Qwen/Qwen3-Embedding-0.6B`** | Causal-LM (Qwen3) | ~335 MB (MLX) / ~538 MB (compressed-tensors) | **Risky** — see notes | None for arch (runs today on desktop); but **no clean-verified q4** |
13
+ | **3** | **`Snowflake/snowflake-arctic-embed-s`** or **`bge-small-en-v1.5`** | Bidirectional BERT encoder | ~70–130 MB at q4 (on-the-fly from F32) | No q4 we read (only ONNX q4); use **on-the-fly INT4 from F32** | New **Tier-2 BERT encoder generator** (smaller effort than Gemma3) |
14
+
15
+ **Recommendation:** ship **EmbeddingGemma-300M** as Gerbil's default iPad embedding model. It is the
16
+ single highest-quality model under 1B params (MTEB English v2 **68.36**, Multilingual v2 **61.15**),
17
+ fits the iPad budget at **~173 MB q4**, and — critically — ships a **standard-affine MLX-4bit** quant
18
+ derived from Google's **quantization-aware-trained (QAT)** base, so q4 quality is near-lossless and the
19
+ format is one our `repackMLX` loader already understands. The cost is one new **bidirectional Gemma3
20
+ encoder generator**. We already have the prerequisite ops (the new `is_causal=false` attention flag,
21
+ RMSNorm, RoPE, GELU-tanh, mean pooling needs adding, L2Norm).
22
+
23
+ ---
24
+
25
+ ## Engine capability ground-truth (verified in `src/gpu/`)
26
+
27
+ What the loader actually detects (`src/gpu/model-loader.ts` ~L567–584):
28
+
29
+ ```ts
30
+ const quantConfig = rawConfig.quantization_config;
31
+ const isGPTQ = quantConfig?.quant_method === "gptq";
32
+ const isMLX = !isGPTQ && quantConfig?.bits === 4 && quantConfig?.mode === "affine";
33
+ ```
34
+
35
+ - **GPTQ-Int4**: read via `repackGPTQ` (`src/gpu/gptq-adapter.ts`). Detects `quant_method === "gptq"`.
36
+ - **MLX 4-bit**: read via `repackMLX` (`src/gpu/mlx-adapter.ts`). Tensors `.weight`/`.scales`/`.biases`.
37
+ - **On-the-fly INT4 from F32/BF16**: `quantizeInt4` (`src/gpu/quantize.ts`), group size 128.
38
+ - **NOT supported**: `compressed-tensors` / `pack-quantized`, GGUF, ONNX, MLX **DWQ**.
39
+
40
+ ⚠️ **Two loader caveats this research surfaced (both are quick fixes, but load-bearing):**
41
+ 1. **MLX detection currently requires `mode === "affine"`**, but the EmbeddingGemma and Qwen3 MLX
42
+ converts on HF write only `{"bits":4,"group_size":64}` with **no `mode` field** (and put it under
43
+ `quantization`, not always `quantization_config`). So the detector must be broadened to recognize
44
+ `{bits:4, group_size}` MLX configs and read the `quantization` key, else they fall through to F32.
45
+ 2. **DWQ vs standard MLX is indistinguishable from config** (both say `{bits:4,group_size:64}`). The
46
+ difference is in tensor data. This is exactly why the Qwen3 `...-4bit-DWQ` produces garbage — the
47
+ repack treats it as affine MLX but the weights aren't. We must pin to **known-good standard converts**.
48
+
49
+ Architecture support today (`src/gpu/architectures/`):
50
+ - `Qwen2ForCausalLM`, `Qwen3ForCausalLM` → `generateQwen2Graph` (causal). **Qwen3-Embedding runs here**
51
+ with `embedding:true` → `SliceLastRow` (last-token EOS pool) + `L2Norm`.
52
+ - `Qwen3_5ForConditionalGeneration` (hybrid Mamba) + a Qwen3.5 **vision ViT** that already uses
53
+ **bidirectional attention** (`causal:false`) — proof the non-causal path works end-to-end.
54
+ - **No Gemma3 generator. No BERT/encoder text generator. Only last-token + L2Norm pooling (no mean pooling).**
55
+
56
+ `is_causal` flag: defined in the Attention kernel (`src/gpu/kernels/registry.ts`, `Params.is_causal: u32`);
57
+ `is_causal=0` → attend to all keys. Consumed by the Qwen3.5 vision tower today.
58
+
59
+ ---
60
+
61
+ ## The survey (live-verified configs + sizes + MTEB)
62
+
63
+ | Model (HF repo) | model_type / arch | Params | hidden×layers | Pooling | MTEB | F32/BF16 size | q4 on HF (compatible?) |
64
+ |---|---|---|---|---|---|---|---|
65
+ | **google/embeddinggemma-300m** | `gemma3_text` / `Gemma3TextModel`, **`use_bidirectional_attention: true`** | 300M | 768 × 24 | **mean** + 2×Dense + L2Norm | **68.36 En / 61.15 Multi (v2)** | ~1.2 GB f32 | **`mlx-community/embeddinggemma-300m-4bit` = 173 MB, standard MLX-4bit, QAT base ✅** (gated, instant ack) |
66
+ | **Qwen/Qwen3-Embedding-0.6B** | `qwen3` / `Qwen3ForCausalLM` | 0.6B | 1024 × 28 | last-token EOS + L2Norm | ~64 Multi (8B=70.58 #1) | **1192 MB bf16** | DWQ (broken ❌), `kerncore/...-MXL-4bit` 335 MB (MLX, **DWQ-contaminated card, risky**), `boboliu/...-W4A16-G128` 538 MB (compressed-tensors ❌ not read) |
67
+ | **BAAI/bge-base-en-v1.5** | `bert` / `BertModel`, absolute pos | 110M | 768 × 12 | CLS (or mean) + L2Norm | 63.5 En | 438 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 from F32 |
68
+ | **BAAI/bge-small-en-v1.5** | `bert` / `BertModel` | ~33M | 384 × 12 | CLS + L2Norm | 61.9 En | 133 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
69
+ | **thenlper/gte-base** | `bert` / `BertModel` | 110M | 768 × 12 | mean + L2Norm | 62.4 En | 219 MB f32 | only ONNX/OpenVINO q8 ❌ → on-the-fly INT4 |
70
+ | **thenlper/gte-small** | `bert` / `BertModel` | ~33M | 384 × 12 | mean + L2Norm | 62.0 En | 67 MB f32 | only ONNX q8 ❌ → on-the-fly INT4 |
71
+ | **Snowflake/snowflake-arctic-embed-s** | `bert` / `BertModel`, absolute pos | ~33M | 384 × 12 | CLS + L2Norm | high mid (En) | 133 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
72
+ | **Snowflake/snowflake-arctic-embed-xs** | `bert` / `BertModel` | ~23M | 384 × **6** | CLS + L2Norm | mid (En) | 90 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
73
+ | **Snowflake/snowflake-arctic-embed-m-v2.0** | `gte` / `GteModel`, **RoPE**, vocab 250k | 305M | 768 × 12 | CLS + L2Norm, Matryoshka 256 | ~55–56 Multi | 1221 MB f32 | only ONNX q4 ❌ (too big anyway) |
74
+ | **nomic-ai/nomic-embed-text-v2-moe** | `nomic_bert` **MoE** (megablocks, 8 experts) | 475M (305M active) | 768 × 12 | mean + L2Norm | strong Multi | f32 large | ❌ MoE megablocks — not supported, skip |
75
+ | **jinaai/jina-embeddings-v3** | `xlm-roberta` + **5 LoRA adapters**, RoPE, Matryoshka | 572M | 1024 × 24 | mean + task-LoRA | strong Multi | 1.1 GB bf16 | ❌ LoRA-adapter routing not supported, too big, skip |
76
+ | **ibm-granite/granite-embedding-278m-multilingual** | `xlm-roberta` / `XLMRobertaModel`, absolute pos | 278M | 768 × 12 | mean (or CLS) + L2Norm | mid-low Multi | 556 MB bf16 | only standard safetensors → on-the-fly INT4 (~150 MB) |
77
+ | **minishlab/potion-base-8M** | `model2vec` / `StaticModel` (**static**, not a transformer) | 8M | 256 dim, PCA, no layers | token-lookup + mean | **51.08** (much lower) | 30 MB f32 | n/a — trivially fast, but a different (non-graph) code path |
78
+
79
+ MTEB sources: EmbeddingGemma & Qwen3 figures from official model cards (EmbeddingGemma card:
80
+ 68.36 En v2 / 61.15 Multi v2; Qwen3 card: 8B #1 at 70.58, 0.6B is the small-tier sibling). BGE/GTE
81
+ English averages cross-confirmed via the AILog MTEB compilation. Potion from its own card (51.08 MTEB).
82
+
83
+ ---
84
+
85
+ ## Why EmbeddingGemma-300M is the pick (#1)
86
+
87
+ - **Quality:** best open model under 1B params — MTEB English v2 **68.36**, Multilingual v2 **61.15**.
88
+ That is materially above every BERT-small option (61–63) and competitive with Qwen3-0.6B while being
89
+ half the download.
90
+ - **Size:** **~173 MB at q4** — comfortably inside the iPad <300 MB target.
91
+ - **Format we can read:** `mlx-community/embeddinggemma-300m-4bit` is a **standard-affine MLX-4bit**
92
+ convert (`.scales`/`.biases` tensors, `mlx-lm`) of Google's **QAT-q4_0** base
93
+ (`google/embeddinggemma-300m-qat-q4_0-unquantized`). Because the base is quantization-aware-trained,
94
+ q4 is near-lossless. This is the same `repackMLX` path we already use — **not** the broken DWQ variant.
95
+ - **Architecture fits our new capability:** it is a **bidirectional Gemma3 encoder**
96
+ (`use_bidirectional_attention: true`) — exactly the case the new `is_causal=false` flag unlocked.
97
+
98
+ ### Exact engine work for EmbeddingGemma (Tier-2 generator)
99
+
100
+ Config (verified): `gemma3_text`, hidden **768**, **24** layers, **3** attention heads (head_dim 256,
101
+ GQA `num_kv_heads:1`), intermediate **1152**, vocab **262144**, `max_position_embeddings` **2048**,
102
+ `gelu_pytorch_tanh`, `rms_norm_eps 1e-6`, `query_pre_attn_scalar 256`. Pooling = **mean** then two
103
+ **Dense** projection layers (the Matryoshka/bottleneck head) then **L2 Normalize**.
104
+
105
+ New generator `architectures/gemma3_embedding.ts` (or extend a gemma3 generator) needs:
106
+ 1. **Bidirectional attention** — set `causal: false` on every Attention op (flag already exists). ✅ op-ready
107
+ 2. **Gemma3 attention specifics:** GQA (heads 3 / kv 1), `query_pre_attn_scalar=256` scaling, head_dim 256.
108
+ 3. **Mixed `layer_types`** — alternating `sliding_attention` (window 512) and `full_attention` (every 6th
109
+ layer). For a 2048-max encoder you can **treat all layers as full bidirectional** (window 512 only
110
+ bounds long contexts; embeddings are short) — simplest correct approximation. Sliding-window masking
111
+ can be a later optimization.
112
+ 4. **Dual RoPE bases:** `rope_local_base_freq=10000` (sliding/local layers) vs `rope_theta=1,000,000`
113
+ (full/global layers). RoPE op exists; needs per-layer theta selection.
114
+ 5. **RMSNorm** (have it), **GELU-tanh** (have it), **AddBias** (have it).
115
+ 6. **Mean-pooling tail** — **NEW**: we only have `SliceLastRow` (last-token) + `L2Norm`. Add a
116
+ `MeanPool` op (sum over valid tokens / count). Small kernel.
117
+ 7. **Two Dense head layers** (768→… →768) — plain MatMul(Int4)+AddBias, then **L2Norm** (have it).
118
+ 8. **Loader fix:** broaden MLX detection to accept `quantization: {bits:4, group_size:64}` with no
119
+ `mode` field (see caveat #1 above), and read the `dense.*` head tensors.
120
+
121
+ Effort: **medium** (one generator + one MeanPool kernel + a small loader tweak). Gated (gemma license),
122
+ but the gate is an instant click-through; we can also point at an ungated standard-MLX mirror or convert
123
+ from the ungated `google/embeddinggemma-300m-qat-q4_0-unquantized` base ourselves.
124
+
125
+ ---
126
+
127
+ ## #2 Qwen3-Embedding-0.6B — runs today, but q4 is the problem
128
+
129
+ Architecture already works on our engine (causal path, last-token EOS pool + L2Norm) — **zero generator
130
+ work**. The blocker is purely the quant:
131
+ - BF16 is **1192 MB** — too heavy to ship/quantize on-device on iPad.
132
+ - `mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ` → **garbage in our engine** (DWQ).
133
+ - `kerncore/Qwen3-Embedding-0.6B-MXL-4bit` (335 MB) → standard MLX-4bit **in name**, but its model card
134
+ text is copy-pasted from the **DWQ** repo, so provenance is **unverified/risky**. Would need a runtime
135
+ correctness check (cosine vs reference) before trusting.
136
+ - `boboliu/Qwen3-Embedding-0.6B-W4A16-G128` (538 MB) → **compressed-tensors / pack-quantized**, which our
137
+ loader does **not** read. Would need a new compressed-tensors adapter (and 538 MB is over budget).
138
+
139
+ Net: viable only if we either (a) verify the `kerncore` MLX convert is truly affine (not DWQ), or
140
+ (b) produce our own clean MLX-4bit convert of Qwen3-Embedding-0.6B from BF16. Lower quality-per-MB and
141
+ larger than EmbeddingGemma, so it's the fallback, not the default.
142
+
143
+ ## #3 BERT-small encoders — smallest, but needs a different generator and lower quality
144
+
145
+ `bge-small` / `gte-small` / `arctic-embed-s` are classic `BertModel` bidirectional encoders
146
+ (hidden 384, 12 layers, absolute position embeddings, ~33M params). At q4 they're **~30–130 MB** — tiny.
147
+ But:
148
+ - They ship q4 **only as ONNX** (`onnx/model_q4.onnx`) which we don't read; we'd quantize **on-the-fly
149
+ from F32** (no QAT, so slightly more quality loss than EmbeddingGemma's QAT base).
150
+ - They need a **new BERT encoder generator**: token + **learned absolute position** embeddings (+ optional
151
+ token-type embedding), bidirectional attention (`causal:false`), GELU, post-LN LayerNorm (have it), and
152
+ **mean or CLS pooling** (CLS = take row 0, easy; mean = new MeanPool op). **No RoPE** → arguably *simpler*
153
+ than the Gemma3 generator.
154
+ - Quality tops out ~62 MTEB(En) — below EmbeddingGemma's 68.36.
155
+
156
+ Good "absolute floor" option if we want the smallest possible default or want to validate the Tier-2
157
+ encoder path on a simpler architecture first (`gte-small`/`bge-small` is the easiest first encoder to land).
158
+
159
+ ## Explicitly rejected
160
+
161
+ - **nomic-embed-text-v2-moe** — megablocks MoE (8 experts, top-2). No MoE support; skip.
162
+ - **jina-embeddings-v3** — XLM-Roberta + 5 task LoRA adapters + custom flash impl; routing not supported,
163
+ 572M too big; skip.
164
+ - **arctic-embed-m-v2.0** — 1221 MB f32, GteModel w/ RoPE + 250k vocab; over budget; skip (the *small*
165
+ arctic is the keeper).
166
+ - **potion-base-8M** — static Model2Vec (51 MTEB). Not a transformer graph; would be a separate
167
+ token-lookup+pool code path. Worth a cheap "ultra-light" mode later, but quality is well below the rest.
168
+
169
+ ---
170
+
171
+ ## Suggested path
172
+
173
+ 1. **Land EmbeddingGemma-300M** as the default iPad embedding model:
174
+ - Add the Gemma3 bidirectional encoder generator + `MeanPool` kernel + Dense head + loader MLX-detect fix.
175
+ - Use `mlx-community/embeddinggemma-300m-4bit` (or self-convert the QAT base) — ~173 MB, near-lossless q4.
176
+ 2. (Optional, faster first win) **Land `gte-small`/`bge-small` BERT encoder generator** first to de-risk
177
+ the non-causal + mean-pool path on the simplest architecture (~30–67 MB), then reuse the MeanPool for Gemma3.
178
+ 3. Keep **Qwen3-Embedding-0.6B** as a desktop/high-RAM option; only ship its q4 on iPad after verifying a
179
+ clean (non-DWQ) MLX convert against a reference cosine-similarity check.
@@ -0,0 +1,263 @@
1
+ # SOTA Small / Mobile OSS LLMs for Gerbil — Research Report (June 2026)
2
+
3
+ Research date: 2026-06-13. Engine baseline: Gerbil WebGPU INT4 engine, currently running
4
+ Qwen3.5-0.8B (hybrid SSM + attention) at 207 tok/s desktop / 51 tok/s mobile Safari.
5
+
6
+ Method: facts below are taken from **primary sources** wherever possible — live HuggingFace
7
+ `config.json` / safetensors headers fetched directly (ground truth for architecture and size),
8
+ backed by model cards and Perplexity web search for release context. Each claim is marked
9
+ **[config]** (verified from the live HF config/safetensors), **[card/docs]** (official model card
10
+ or vendor docs), or **[social/secondary]** (X/Twitter or third-party roundup, treat as softer).
11
+
12
+ ---
13
+
14
+ ## 0. TL;DR for the impatient
15
+
16
+ - **Gemma 4 is real and shipping.** `google/gemma-4-E2B` and `google/gemma-4-12B` exist on HF
17
+ *today* with downloadable configs. Gemma 4 E2B is the on-device MatFormer variant (successor to
18
+ Gemma 3n E2B/E4B). Architecture = Gemma-family with **5:1 sliding-window:full attention** — this
19
+ is the one genuinely new kernel requirement. **[config]**
20
+ - **The "13x fused LinearAttention" post is mostly a misread.** The real, documented artifact is
21
+ Alibaba's **FlashQLA** — a *fused linear-attention (Gated DeltaNet) kernel* for Qwen3.5/3.6,
22
+ claiming ~3x (not 13x), on **TileLang/CUDA GPUs, not WebGPU**. No verifiable WebGPU/Claude 13x
23
+ kernel exists in public sources. **Gerbil's `qwen3_5.ts` already implements this exact fusion**
24
+ (causal conv1d + delta-rule scan + gating + RMSNorm in one path). **[card/docs + repo]**
25
+ - **Best next target: the Gemma 4 family**, because one new architecture generator (Gemma4, with
26
+ sliding-window attention + final-logit softcap + Gemma RMSNorm) unlocks E2B (mobile), 12B, and
27
+ the 26B-A4B MoE / 31B dense. Qwen3 dense (0.6/1.7/4B) is the cheapest add (it's a config tweak on
28
+ the existing Qwen2/3 path). SmolLM3 is the easiest "new vendor" win (plain Llama-like + NoPE).
29
+
30
+ ---
31
+
32
+ ## 1. Latest SOTA small / mobile OSS LLMs (mid-2026)
33
+
34
+ ### 1.1 Gemma — what actually exists (the user's "Gemma 4 E2B" question)
35
+
36
+ The user's instinct was right; "Gemma 4 E2B" is a real model, not a hallucination.
37
+
38
+ **Verified live from HF configs [config]:**
39
+
40
+ | Repo | model_type | Notes |
41
+ |---|---|---|
42
+ | `google/gemma-4-E2B` | `gemma4` / `Gemma4ForConditionalGeneration` | On-device MatFormer variant. 35 layers, hidden 1536, MQA (1 KV head), head_dim 256, vocab 262144, sliding_window=512, tied embeddings. |
43
+ | `google/gemma-4-12B` | `gemma4_unified` / `Gemma4UnifiedForConditionalGeneration` | 48 layers, hidden 3840, GQA 16/8, head_dim 256, sliding_window=1024, 262k context. |
44
+ | `google/gemma-3n-E2B-it`, `gemma-3n-E4B-it` | (gated, login required) | The previous-gen MatFormer mobile models. Still exist but gated. |
45
+ | `google/gemma-3-1b-it` etc. | (gated) | Gemma 3 dense family, 270M–27B. |
46
+
47
+ **The lineage (correcting the secondary-source confusion):**
48
+ - **Gemma 3** (Mar 2025): dense family 270M / 1B / 4B / 12B / 27B. **[card/docs]**
49
+ - **Gemma 3n E2B / E4B**: the *mobile-optimized MatFormer* models — "E2B"/"E4B" = *effective*
50
+ ~2B / ~4B active params via MatFormer elastic nesting + Per-Layer Embeddings (PLE). Total stored
51
+ params are larger than the effective label; PLE means real on-device memory ≈ the effective size.
52
+ These were the Gemma-3-generation mobile models. **[card/docs, social/secondary]**
53
+ - **Gemma 4** (~Apr–Jun 2026): new generation. Edge variants **E2B / E4B** (MatFormer again),
54
+ dense **12B** (released ~3 Jun 2026), plus larger **26B-A4B MoE** and **31B dense**. So "Gemma 4
55
+ E2B" = the current-gen on-device model, the direct successor to Gemma 3n E2B. **[config for E2B/12B; card/social for E4B/26B/31B]**
56
+
57
+ **Sizes (E2B):** the bf16 single-file safetensors is **~10.2 GB on disk** [config: content-length
58
+ 10,246,621,918 B]. Official Q4 footprint per Google docs is **~2.9 GB** (Q4_0); Unsloth's QAT int4
59
+ GGUF is **~2.62 GB**. **[card/docs]** E4B Q4 ≈ **4.2–4.5 GB**. The MatFormer "effective 2B" label
60
+ does NOT shrink download — you ship the full static weights.
61
+
62
+ ### 1.2 The rest of the field (verified configs + 4-bit sizes)
63
+
64
+ 4-bit estimate column = group-quantized INT4 (g128) ≈ bf16_size × ~0.30 (weights at 0.5 B/param +
65
+ scales; embeddings often kept higher precision). Treat as ±15%.
66
+
67
+ | Model | Params | Arch family | model_type [config] | Context | Tied emb | bf16 size | **~INT4 download** | Official mobile/edge? |
68
+ |---|---|---|---|---|---|---|---|---|
69
+ | **Qwen3.5-0.8B** (current) | 0.8B | Hybrid GDN + attn | `qwen3_5` | 262k | yes | 1.62 GB | **~0.49 GB** | yes (edge line 0.8/2/4/9B) [social] |
70
+ | **Qwen3-0.6B** | 0.6B | Llama-like dense | `qwen3` | 32k | yes | 1.40 GB | **~0.42 GB** | de facto; not branded |
71
+ | **Qwen3-1.7B** | 1.7B | Llama-like dense | `qwen3` | 32k | yes | 3.78 GB | **~1.13 GB** | de facto |
72
+ | **Qwen3-4B** | 4B | Llama-like dense | `qwen3` | 40k | yes | 7.49 GB | **~2.25 GB** | de facto |
73
+ | **Phi-4-mini** | 3.8B | Phi3 (packed QKV, partial RoPE) | `phi3` | 128k | yes | 7.14 GB | **~2.14 GB** | local-friendly, no edge SKU |
74
+ | **SmolLM3-3B** | 3B | Llama-like + NoPE | `smollm3` | 64k | yes | 5.72 GB | **~1.72 GB** | yes — built for edge/browser |
75
+ | **Gemma 4 E2B** | ~2B eff | Gemma4 + sliding-window MatFormer | `gemma4` | 131k | yes | (10.2 GB raw) | **~2.6–2.9 GB** (official QAT) | **yes — official on-device** |
76
+ | **Gemma 4 12B** | 12B | Gemma4 unified | `gemma4_unified` | 262k | yes | 22.3 GB | **~6.7 GB** | desktop-class |
77
+ | **Llama 4 small** | 7–8B class | dense (Llama 4) | `llama4`/`llama` | long | — | ~15 GB | ~4.5 GB | no explicit edge SKU |
78
+
79
+ Notes:
80
+ - **No truly tiny official Llama mobile SKU.** Meta's smallest broadly-used open models remain
81
+ 7–8B-class; phone-class Llama is community-quantized, not an official edge variant. **[social/secondary]**
82
+ - **SmolLM3** is all-`full_attention` (no sliding window, no SSM) with **NoPE** (some layers skip
83
+ RoPE) and tied embeddings — architecturally the simplest "new vendor" target. **[config]**
84
+
85
+ ---
86
+
87
+ ## 2. Architecture compatibility with Gerbil's WebGPU INT4 engine
88
+
89
+ Gerbil today (from `src/gpu/architectures/`):
90
+ - `qwen2.ts` — standard transformer (Qwen2/Qwen3 = Llama-like, QKV bias, GQA, RoPE, RMSNorm, SwiGLU).
91
+ - `qwen3_5.ts` — hybrid: full-attention layers (gated, QK-RMSNorm, partial RoPE, attn output gate)
92
+ **+ linear_attention layers implemented as fused Gated DeltaNet** (RMSNorm → fused QKV proj →
93
+ A/B/Z proj → causal conv1d+SiLU → gated-delta-net recurrence with L2-normed Q/K + exp decay +
94
+ delta rule → per-head RMSNorm → SiLU gate → out proj). This is a big asset (see §3).
95
+
96
+ Existing reusable kernels: INT4 matmul (B^T), GQA attention, RMSNorm, RoPE (incl. partial), SwiGLU,
97
+ causal conv1d, delta-rule scan, gating, tied embeddings.
98
+
99
+ Per-candidate op delta (effort tiers per Gerbil's `add-model-family` skill: **Tier 1** = hours,
100
+ reuse all ops; **Tier 2** = days, 1 novel op/kernel; **Tier 3** = weeks, new computation class).
101
+ Note Gerbil generators are **not** Qwen-specific — they're config→IR generators over a
102
+ family-agnostic IR + WGSL kernel registry. Adding a family = write a generator, register it in
103
+ `src/gpu/architectures/index.ts`, and only write a kernel if an op is missing. Existing kernels:
104
+ Embedding/MatMul(Int4), Add, Mul, RMSNorm, LayerNorm, RoPE, Attention, Softmax, SiLU, SwiGLU, GELU,
105
+ ResidualRMSNorm, KVCacheAppend, **MambaSSM, CausalConv1d, SigmoidGate, ConvStateUpdate**, SliceLastRow.
106
+
107
+ ### Qwen3 dense (0.6B / 1.7B / 4B) — **Tier 1 (hours)**
108
+ Same Llama-like decoder Gerbil's `qwen2.ts` already runs. `model_type: qwen3`, GQA, RoPE
109
+ (theta 1e6), RMSNorm, SwiGLU, tied embeddings, QK-RMSNorm (already done for Qwen3.5 full-attn
110
+ layers). **New kernels: none.** Mostly a config-mapping / loader change. The Qwen3.5 full-attention
111
+ sublayer is essentially Qwen3, so the code already exists. **[config]**
112
+
113
+ ### SmolLM3-3B — **Tier 1 (hours)**
114
+ Llama-like dense, all full_attention, GQA 16/4, RMSNorm, SwiGLU, tied embeddings. The one quirk:
115
+ **NoPE** — certain layers omit RoPE (config gives `layer_types`/`no_rope_layers`). Need a per-layer
116
+ "skip RoPE" flag in the attention builder. **New kernels: none; one loader/graph flag.** **[config]**
117
+
118
+ ### Phi-4-mini (3.8B) — **Tier 1 (hours, loader-heavy)**
119
+ `model_type: phi3`. Two deviations from the Qwen path: (a) **packed `qkv_proj`** (single fused
120
+ matrix, split into Q/K/V) and (b) **partial RoPE** + a packed `gate_up_proj` for the MLP. Gerbil
121
+ already splits fused QKV (Qwen3.5 splits a fused Q+gate) and already does partial RoPE, so this is
122
+ mostly tensor-slicing config. Sliding_window is set very large (262144) → effectively full
123
+ attention at normal context. **New kernels: none; loader handles packed weights.** **[config]**
124
+
125
+ ### Gemma 4 E2B / 12B — **Tier 2 (days; sliding-window attn is the one novel op)**
126
+ `model_type: gemma4` / `gemma4_unified`. Gemma-family specifics:
127
+ - **Sliding-window attention (SWA)**, interleaved with full attention — E2B is **5:1**
128
+ (sliding:full, full every 5th layer), 12B is **5:1** with window 1024. **This is the one real new
129
+ kernel/masking path Gerbil lacks** — a banded/windowed attention mask (and a windowed KV cache to
130
+ get the memory benefit). At short prompts SWA == full attention, so a correctness-first v1 can
131
+ treat SWA as full attention and add the banded mask + windowed cache later for long-context wins.
132
+ - **Gemma RMSNorm** uses `(1 + weight)` scaling and norms in more places (pre/post attn, pre/post
133
+ MLP). Small kernel variant of existing RMSNorm.
134
+ - **Final logit softcapping** (`tanh`-based) and often **attn logit softcapping** — a cheap
135
+ elementwise op, but must be added or outputs are wrong.
136
+ - **head_dim 256** with **query scaling** (`query_pre_attn_scalar`) — parameter, not new kernel.
137
+ - **GeGLU** activation (Gemma uses gelu-based gating) vs SwiGLU (SiLU). Need a GeGLU variant of the
138
+ MLP (swap SiLU→GELU). Tiny.
139
+ - **MatFormer / PLE (E2B)**: for a single fixed deployment Gerbil can just load the E2B slice as a
140
+ normal dense model — no need to implement elastic nesting. PLE means per-layer embedding tables;
141
+ loader must place them but it's not a new compute kernel.
142
+ - Tied embeddings, vocab 262144 (large — embed/logit matmul is heavier).
143
+
144
+ **New kernels for Gemma 4: (1) windowed/banded attention mask + windowed KV cache, (2) GeGLU MLP
145
+ variant, (3) Gemma-style `(1+w)` RMSNorm, (4) logit softcap elementwise.** Items 2–4 are small;
146
+ item 1 is the only real engineering. A correctness-first build can defer item 1.
147
+
148
+ ### Gemma 4 26B-A4B (MoE) / Llama 4 — **Tier 3 (weeks); out of scope for mobile**
149
+ MoE needs a router + top-k expert gather + per-expert INT4 matmul (sparse dispatch). New kernel
150
+ family, and 6–7 GB+ at int4 — desktop-only. Skip for the mobile mandate. **[social/secondary]**
151
+
152
+ Exotic-feature flag summary:
153
+ - **MatFormer/elastic params**: Gemma 3n/4 E-series — ignorable for a fixed deployment.
154
+ - **Sliding-window attention**: Gemma 4 (and Phi3 in principle) — **needs new masking + windowed cache.**
155
+ - **MoE**: Gemma 4 26B-A4B, Qwen3.5 large — new kernel family, desktop-only.
156
+ - **NoPE**: SmolLM3 — one per-layer flag.
157
+ - **Packed QKV / gate_up**: Phi-4-mini — loader-level, already partially handled.
158
+ - **Logit softcap / Gemma (1+w) RMSNorm / GeGLU**: Gemma — small new variants.
159
+ - **Tied embeddings**: every candidate — already supported.
160
+
161
+ ---
162
+
163
+ ## 3. The "fused LinearAttention 13x" claim — verdict
164
+
165
+ **What's real (confirmed):**
166
+ - Qwen3.5's `linear_attention` layers are **Gated DeltaNet (GDN)** — the Qwen3-Next mechanism:
167
+ short **causal conv1d**, **delta-rule recurrent/chunked scan**, **exponential gating**, **L2-norm
168
+ on Q/K** (replacing softmax), and **RMSNorm**. Interleaved 3:1 with gated full attention. Verified
169
+ in Gerbil's own config (`layer_types`) and NVIDIA Megatron-Bridge / Qwen model-card docs. **[config + card/docs]**
170
+ - Alibaba shipped **FlashQLA** — "high-performance linear attention kernel library built on
171
+ TileLang, specifically optimized for GDN chunked-prefill, the linear attention mechanism used in
172
+ Qwen3.5 and Qwen3.6." It **fuses** the GDN ops and reports **up to ~3x** vs prior linear-attention
173
+ kernels — **on server GPUs (TileLang/CUDA), not WebGPU.** **[card/docs]**
174
+ (Source: alibabacloud.com/blog/flashqla-cp-bwd-friendly-fused-linear-attention-kernels-for-gdn)
175
+
176
+ **What's NOT verifiable:**
177
+ - A specific X/Twitter post claiming "Claude Opus 4.7 wrote a WebGPU kernel running Qwen3.5 13x
178
+ faster via a fused LinearAttention op" — **no traceable primary source.** Two independent
179
+ Perplexity sweeps (sonar-pro) over WebGPU + Qwen3.5 + GDN + "fused" + "13x" found no repo, no
180
+ benchmark, no attributable tweet. The closest real artifacts are the FlashQLA GPU kernels (~3x)
181
+ and the unrelated "Qwen3.6-35B Claude-4.7-Opus reasoning-distilled" model (no kernel content).
182
+ **Treat the 13x WebGPU claim as social-media rumor / paraphrase of FlashQLA, not a reproduced result.**
183
+
184
+ **Does it apply to Gerbil?** It already *is* applied. Gerbil's `qwen3_5.ts` describes a single fused
185
+ Gated DeltaNet path (the docstring literally lists conv1d+SiLU → gated-delta-net recurrence →
186
+ per-head RMSNorm → SiLU gate → out proj). The "fused LinearAttention" technique = exactly what
187
+ Gerbil's Qwen3.5 path does. The remaining upside is *autoresearch-style kernel tuning* of that
188
+ fused op (chunked scan tiling, workgroup sizing, reducing dispatch/submit overhead — which the
189
+ recent Safari/Metal commits already touch), not a new architectural fusion. There is no free 13x
190
+ sitting on the table; FlashQLA's ~3x is the realistic ceiling for the GDN op specifically, and only
191
+ if Gerbil's current scan is far from optimal.
192
+
193
+ ---
194
+
195
+ ## 4. Recommendation — ranked targets for Gerbil
196
+
197
+ Weighting: SOTA quality × mobile download size × implementation effort against the existing engine.
198
+
199
+ **#1 — Gemma 4 E2B (highest value).** Google's flagship *official on-device* model, multimodal-capable
200
+ text core, ~2.6–2.9 GB at int4 (acceptable mobile download), strong quality. It's the model the
201
+ user actually asked about and it's real and ungated (E2B/12B configs download without auth, unlike
202
+ Gemma 3n). Effort: moderate — needs the Gemma4 generator (sliding-window mask + windowed KV cache,
203
+ GeGLU, Gemma `(1+w)` RMSNorm, logit softcap). A correctness-first v1 can treat SWA as full attention
204
+ and ship fast, then add the windowed cache for long-context/memory wins.
205
+
206
+ **#2 — Qwen3 dense 0.6B / 1.7B / 4B (cheapest, broadest).** Near-zero new kernel work — it's the
207
+ Qwen3.5 full-attention sublayer Gerbil already runs, minus the SSM layers. Gives a clean size ladder
208
+ (0.42 / 1.13 / 2.25 GB int4) so users pick by device. Best effort-to-coverage ratio; should likely
209
+ ship *before* Gemma to derisk the loader/config plumbing.
210
+
211
+ **#3 — SmolLM3-3B (easy new-vendor win).** Plain Llama-like + NoPE flag + tied embeddings; ~1.72 GB
212
+ int4; purpose-built for edge/browser. Almost free given the Qwen3 path; adds vendor diversity.
213
+
214
+ **#4 — Phi-4-mini (3.8B).** Strong reasoning per param, 128k context, ~2.14 GB int4. Effort is
215
+ loader-level (packed QKV / gate_up, partial RoPE — both already partly handled). Good "quality
216
+ small" option but no edge branding and slightly more loader work than SmolLM3.
217
+
218
+ **#5 — Gemma 4 12B / Llama 4 / MoE variants — desktop-only, defer.** 6.7 GB+ int4 and (for 26B-A4B)
219
+ a whole MoE kernel family. Out of scope for the mobile mandate; revisit for a desktop tier.
220
+
221
+ ### Which single architecture-family generator unlocks the most high-value models?
222
+
223
+ **The `Gemma4` generator.** One new arch family (sliding-window attention + windowed KV cache +
224
+ GeGLU + Gemma RMSNorm + logit softcap) unlocks **E2B (mobile), E4B, 12B, and — with an added MoE
225
+ path — 26B-A4B / 31B**: the entire current-gen Google open line, anchored by the single most
226
+ requested mobile model. Sliding-window attention is also reusable for other modern models (it's a
227
+ common long-context pattern), so the investment compounds.
228
+
229
+ **Sequencing recommendation:** ship **Qwen3 dense** first (days, derisks plumbing) → then build the
230
+ **Gemma4 generator** for E2B (the headline mobile model) → fold in **SmolLM3** as a cheap rider on
231
+ the Qwen3 path. That ordering front-loads low-risk wins and lands the high-value Gemma 4 E2B with
232
+ the loader already battle-tested.
233
+
234
+ ---
235
+
236
+ ## Sources
237
+
238
+ Primary (verified live, 2026-06-13):
239
+ - HF configs fetched directly: `Qwen/Qwen3.5-0.8B`, `Qwen/Qwen3-4B`, `Qwen/Qwen3-1.7B`,
240
+ `Qwen/Qwen3-0.6B`, `microsoft/Phi-4-mini-instruct`, `HuggingFaceTB/SmolLM3-3B`,
241
+ `google/gemma-4-E2B`, `google/gemma-4-12B` (config.json + safetensors index/content-length).
242
+ - Gerbil repo: `src/gpu/architectures/qwen3_5.ts`, `src/gpu/architectures/qwen2.ts`,
243
+ `src/core/model-compat.ts`.
244
+
245
+ Docs / cards:
246
+ - Gemma 4 overview: https://ai.google.dev/gemma/docs/core
247
+ - Gemma 4 QAT int4 GGUF sizes (Unsloth): https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF ,
248
+ https://unsloth.ai/docs/models/gemma-4/qat
249
+ - FlashQLA fused GDN linear-attention kernels:
250
+ https://www.alibabacloud.com/blog/flashqla-cp-bwd-friendly-fused-linear-attention-kernels-for-gdn_603084
251
+ - Qwen 3.5 hybrid GDN arch (NVIDIA): https://docs.nvidia.com/nemo/megatron-bridge/0.4.1/models/vlm/qwen35-vl.html
252
+ - Gated DeltaNet explainer: https://sebastianraschka.com/llms-from-scratch/ch04/08_deltanet/
253
+ - vLLM Qwen3-Next hybrid support: https://vllm.ai/blog/2025-09-11-qwen3-next
254
+ - Qwen3.5 GDN analysis: https://gist.github.com/justinchuby/0213aa253664fb72e9adb0089816de15
255
+
256
+ Secondary / context (treat softer):
257
+ - BentoML 2026 OSS LLM roundup: https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models
258
+ - Trilogy AI "Qwen 3.6 vs Opus 4.7 vs Gemma 4": https://trilogyai.substack.com/p/qwen-36-open-vs-opus-47-vs-gemma
259
+ - Gemma 4 E2B/E4B edge writeup: https://www.mindstudio.ai/blog/gemma-4-e2b-e4b-edge-models-phone-local
260
+
261
+ Unverified (flagged):
262
+ - "Claude Opus 4.7 wrote a WebGPU kernel running Qwen3.5 13x faster via fused LinearAttention" — no
263
+ traceable primary source found; likely a paraphrase/conflation of FlashQLA (GPU, ~3x).