@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +318 -104
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +276 -590
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +592 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BetB5xb0.d.mts +488 -0
  35. package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
  36. package/dist/gerbil-CTZUa8EZ.mjs +4 -0
  37. package/dist/gerbil-DNniplr4.mjs +1656 -0
  38. package/dist/gerbil-DNniplr4.mjs.map +1 -0
  39. package/dist/gpu/hooks.d.mts +640 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1369 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-DFuglcEx.mjs +3790 -0
  46. package/dist/gpu-DFuglcEx.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-DukkJRMj.d.mts +2114 -0
  50. package/dist/index-DukkJRMj.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
  76. package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
  85. package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
  86. package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
  88. package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
  89. package/dist/repl-BDRkwPGX.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
  94. package/dist/skills-CU694Dc8.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/types-DQBe2lFo.d.mts +165 -0
  98. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  99. package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
  100. package/dist/types-LlyYILII.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-CMOGDSgT.js +0 -20212
  157. package/dist/kokoro-CMOGDSgT.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-Bu-E23Sc.js +0 -433
  163. package/dist/stt-Bu-E23Sc.js.map +0 -1
  164. package/dist/stt-CpLYbGFd.mjs +0 -433
  165. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  166. package/dist/stt-DRPLEEHB.mjs +0 -3
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DiD1gTwk.js +0 -44695
  169. package/dist/transformers.web-DiD1gTwk.js.map +0 -1
  170. package/dist/transformers.web-u34VxRFM.js +0 -3
  171. package/dist/tts-CqroPaSK.js +0 -724
  172. package/dist/tts-CqroPaSK.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,293 @@
1
+ # Qwen3.5 Multimodality — Native Engine Investigation
2
+
3
+ **Date:** 2026-06-13
4
+ **Question:** Is `Qwen/Qwen3.5-0.8B` (the model Gerbil runs) actually multimodal? Our native
5
+ loader downloads only its text tensors and skips ~192 MB labeled "vision/MTP". Could one native
6
+ model give us text + vision (+ audio) on our WebGPU engine, dissolving the two-engine problem on mobile?
7
+
8
+ **Verdict up front:** **YES — Qwen3.5-0.8B is a single, natively multimodal (text + vision)
9
+ checkpoint.** We are deliberately throwing away its vision tower. Re-enabling vision is a **mid-size
10
+ build** (one real new kernel: the patch-embed conv; everything else in the ViT reuses ops we already
11
+ have). **Audio is NOT in Qwen3.5 at all** — omni/audio lives in a separate, much larger family
12
+ (`Qwen3-Omni-30B-A3B`). So one native model covers **text + image/video-in**, not audio.
13
+
14
+ All facts below tagged `[config-verified]` were read directly from live HuggingFace files on
15
+ 2026-06-13 (config.json + safetensors header), not from web prose.
16
+
17
+ ---
18
+
19
+ ## PART A — What Gerbil currently skips
20
+
21
+ ### The filter rule
22
+
23
+ `src/gpu/model-loader.ts`, `createKeyMapperForArch()` (lines 480–504). For the
24
+ `Qwen3_5ForConditionalGeneration` architecture, the HF→canonical key mapper returns `null` for any
25
+ tensor under three prefixes, which causes the selective downloader to skip those byte ranges entirely:
26
+
27
+ ```ts
28
+ function createKeyMapperForArch(architectureName: string): HFKeyMapper {
29
+ if (architectureName === "Qwen3_5ForConditionalGeneration") {
30
+ return (hfKey: string): string | null => {
31
+ // Skip visual encoder, vision tower, and MTP weights
32
+ if (
33
+ hfKey.startsWith("model.visual.") ||
34
+ hfKey.startsWith("vision_tower.") ||
35
+ hfKey.startsWith("mtp.")
36
+ ) {
37
+ return null;
38
+ }
39
+ let key = hfKey;
40
+ if (key.startsWith("model.language_model.")) {
41
+ key = key.slice(21);
42
+ } else if (key.startsWith("language_model.model.")) {
43
+ key = key.slice(21);
44
+ } else if (key.startsWith("model.")) {
45
+ key = key.slice(6);
46
+ }
47
+ return key;
48
+ };
49
+ }
50
+ return createDefaultHFKeyMapper();
51
+ }
52
+ ```
53
+
54
+ ### Where the skip is realized (the log line)
55
+
56
+ `loadModel()`, lines 626–640. Any entry whose mapped key is `null` is excluded from the download and
57
+ counted into `skippedBytes`:
58
+
59
+ ```ts
60
+ const neededEntries = file.entries.filter((e) => keyMapper(e.name) !== null);
61
+ const skippedBytes = file.entries
62
+ .filter((e) => keyMapper(e.name) === null)
63
+ .reduce((sum, e) => sum + e.dataLength, 0);
64
+ // ...
65
+ onProgress?.(10, 100,
66
+ `Selective download: ${dlMB} MB needed, skipping ${savedMB} MB (vision/MTP)`);
67
+ ```
68
+
69
+ ### What exactly is filtered out — measured from the live safetensors header `[config-verified]`
70
+
71
+ Single shard `model.safetensors-00001-of-00001.safetensors`, total **1666 MB** (BF16/F32, the
72
+ unquantized base). Broken down by prefix:
73
+
74
+ | Component | Tensors | Size | Skipped? |
75
+ |---|---|---|---|
76
+ | `model.language_model.*` (text backbone) | 320 | **1435.1 MB** | kept |
77
+ | `model.visual.*` (vision tower / ViT) | 153 | **191.9 MB** | **skipped** |
78
+ | `mtp.*` (multi-token prediction head) | 15 | **39.0 MB** | **skipped** |
79
+
80
+ So the "skipping 192 MB (vision/MTP)" line is **~192 MB vision + ~39 MB MTP**. (In the q4 path the
81
+ ratios shrink but the same prefixes are dropped; "404 MB needed / 192 MB skipped" is the quantized
82
+ profile.)
83
+
84
+ ### What "MTP" means here `[config-verified]`
85
+
86
+ **MTP = Multi-Token Prediction**, not a vision module. The text config sets
87
+ `"mtp_num_hidden_layers": 1` and `"mtp_use_dedicated_embeddings": false`. The 15 skipped `mtp.*`
88
+ tensors are a small **one-layer auxiliary transformer head** used for speculative / parallel
89
+ next-token decoding:
90
+
91
+ ```
92
+ mtp.fc.weight, mtp.norm.weight, mtp.pre_fc_norm_embedding.weight, mtp.pre_fc_norm_hidden.weight,
93
+ mtp.layers.0.{input_layernorm,post_attention_layernorm}.weight,
94
+ mtp.layers.0.self_attn.{q_proj,k_proj,v_proj,o_proj,q_norm,k_norm}.weight,
95
+ mtp.layers.0.mlp.{gate_proj,up_proj,down_proj}.weight
96
+ ```
97
+
98
+ It is an inference-speedup head (predicts the +2 token to enable self-speculative decoding); it is
99
+ **correctly skipped** for a basic autoregressive engine and has nothing to do with vision. Dropping
100
+ it costs only decode throughput, not capability.
101
+
102
+ ### The vision tower we're skipping — it's a full ViT `[config-verified]`
103
+
104
+ `config.json` ships a complete `vision_config` and the multimodal plumbing tokens, proving the base
105
+ checkpoint is multimodal by construction:
106
+
107
+ ```json
108
+ "image_token_id": 248056,
109
+ "video_token_id": 248057,
110
+ "vision_start_token_id": 248053, "vision_end_token_id": 248054,
111
+ "vision_config": {
112
+ "depth": 12, "hidden_size": 768, "num_heads": 12, "intermediate_size": 3072,
113
+ "in_channels": 3, "patch_size": 16, "spatial_merge_size": 2, "temporal_patch_size": 2,
114
+ "num_position_embeddings": 2304, "out_hidden_size": 1024,
115
+ "hidden_act": "gelu_pytorch_tanh"
116
+ }
117
+ ```
118
+
119
+ Tensor inventory (from the header, leaf names generalized over the 12 blocks):
120
+
121
+ ```
122
+ model.visual.patch_embed.proj.weight BF16 [768, 3, 2, 16, 16] ← Conv3d patch embed
123
+ model.visual.patch_embed.proj.bias
124
+ model.visual.pos_embed.weight BF16 [2304, 768] ← learned pos embeddings
125
+ model.visual.blocks.N.norm1.{weight,bias} ← LayerNorm
126
+ model.visual.blocks.N.attn.qkv.{weight,bias} [2304, 768] ← fused QKV (768→3*768)
127
+ model.visual.blocks.N.attn.proj.{weight,bias}
128
+ model.visual.blocks.N.norm2.{weight,bias} ← LayerNorm
129
+ model.visual.blocks.N.mlp.linear_fc1.{weight,bias} ← GELU MLP
130
+ model.visual.blocks.N.mlp.linear_fc2.{weight,bias}
131
+ model.visual.merger.norm.{weight,bias}
132
+ model.visual.merger.linear_fc1.{weight,bias} [3072, 3072] ← projector (merger)
133
+ model.visual.merger.linear_fc2.{weight,bias} [1024, 3072] → out_hidden_size 1024
134
+ ```
135
+
136
+ The **patch embed is a 5-D Conv3d** kernel `[out=768, in=3, T=2, 16, 16]` (temporal_patch_size=2 ×
137
+ 16×16 spatial patches over RGB) — this is the one genuinely new primitive. Everything else (LayerNorm,
138
+ fused QKV matmul, self-attention + softmax, GELU MLP, the 2-layer merger projecting 768→1024 to match
139
+ the LM hidden size) is **standard transformer math the engine already runs for text.**
140
+
141
+ ### What it would take to NOT skip them
142
+
143
+ For **vision**: change the key mapper to *map* (not null) `model.visual.*` instead of dropping it.
144
+ That's a one-line change to start downloading the 192 MB. The real work is downstream: a Qwen3.5
145
+ vision-tower graph generator + the patch-embed conv kernel + image preprocessing + splicing the
146
+ merged image embeddings into the text sequence at `image_token_id` positions (see Part C).
147
+
148
+ For **MTP**: leave it skipped unless/until we implement speculative decoding. It is an optimization,
149
+ not a modality.
150
+
151
+ ---
152
+
153
+ ## PART B — Qwen3.5 actual capabilities (live HF, config-verified)
154
+
155
+ ### B1. Is Qwen3.5-0.8B itself multimodal? `[config-verified]`
156
+
157
+ **Yes, text + vision, in a single checkpoint.** There is no separate VL repo — the base
158
+ `Qwen/Qwen3.5-0.8B` *is* the vision-language model. Evidence (all from the live config/header):
159
+
160
+ - Architecture `Qwen3_5ForConditionalGeneration` (a `*ForConditionalGeneration` multimodal wrapper),
161
+ `model_type: "qwen3_5"`.
162
+ - A full `vision_config` (12-layer ViT) ships in the config.
163
+ - `image_token_id` / `video_token_id` / `vision_start/end_token_id` are defined.
164
+ - `rope_parameters.mrope_interleaved: true` with `mrope_section: [11, 11, 10]` — **M-RoPE
165
+ (multimodal rotary)**, the time/height/width positional scheme used specifically to place image and
166
+ video tokens. A text-only model would not carry mrope sections.
167
+ - 153 `model.visual.*` weight tensors (191.9 MB) physically present in the safetensors.
168
+
169
+ "MTP" is the multi-token-prediction head (Part A), **not** a vision module — they are two separate
170
+ skipped groups.
171
+
172
+ ### B2. The Qwen3.5 family and its VL / Omni variants `[verified via HF API]`
173
+
174
+ Live `huggingface.co/api/models?author=Qwen&search=Qwen3.5` returns the whole family. **Every dense
175
+ member is a unified text+vision checkpoint; there are NO `Qwen3.5-VL` and NO `Qwen3.5-Omni` repos** —
176
+ vision is built into the base models:
177
+
178
+ | Model | Type | Notes |
179
+ |---|---|---|
180
+ | `Qwen/Qwen3.5-0.8B` (+ `-Base`) | dense | **smallest multimodal**; ~1.0 B params (0.8 B label) |
181
+ | `Qwen/Qwen3.5-2B` (+ `-Base`) | dense | |
182
+ | `Qwen/Qwen3.5-4B` (+ `-Base`) | dense | |
183
+ | `Qwen/Qwen3.5-9B` (+ `-Base`) | dense | |
184
+ | `Qwen/Qwen3.5-27B` | dense | also `-FP8`, `-GPTQ-Int4` |
185
+ | `Qwen/Qwen3.5-35B-A3B` | MoE (3B active) | also `-Base/-FP8/-GPTQ-Int4` |
186
+ | `Qwen/Qwen3.5-122B-A10B` | MoE | also `-FP8/-GPTQ-Int4` |
187
+ | `Qwen/Qwen3.5-397B-A17B` | MoE | also `-FP8/-GPTQ-Int4` |
188
+
189
+ **Pre-quantized 4-bit:** official `-GPTQ-Int4` exists for 27B and the MoE tiers only. For 0.8B there
190
+ is no official Int4 — Gerbil quantizes on the fly (our loader supports GPTQ and MLX repack paths).
191
+
192
+ **4-bit download size for the 0.8B (the one we run):** text backbone ~1435 MB BF16 → roughly
193
+ **~400 MB at Int4** (matches the "404 MB needed" log). Adding the vision tower (191.9 MB BF16 →
194
+ ~50–60 MB Int4, or keep it F16 at ~96 MB) keeps a vision-capable native model well under ~500 MB
195
+ total — viable on mobile.
196
+
197
+ **Modalities, per family:**
198
+
199
+ - **Qwen3.5 (all sizes incl. 0.8B):** text in/out, **image-in, video-in** (vision tokens + temporal
200
+ patch + M-RoPE). **No audio, no speech-out.** `[config-verified]`
201
+ - **Audio / omni is a different family:** `Qwen/Qwen3-Omni-30B-A3B-{Instruct,Thinking,Captioner}` —
202
+ **30B MoE (~3B active)**, far too big for our mobile target, and it's Qwen3-Omni, not Qwen3.5.
203
+ Smallest omni anywhere is `Qwen/Qwen2.5-Omni-3B` (~5.5 B params) `[verified via HF API]`.
204
+
205
+ So: **vision is free inside the model we already run; audio is not available in any small Qwen3.5/omni
206
+ checkpoint we could realistically run on a phone.**
207
+
208
+ ### B3. Architecture of the smallest multimodal variant (Qwen3.5-0.8B vision tower) `[config-verified]`
209
+
210
+ - **Vision encoder:** a 12-layer **ViT** (`depth 12`, `hidden 768`, `12 heads`, `mlp 3072`,
211
+ `gelu_pytorch_tanh`), pre-norm blocks (norm1→attn→norm2→MLP), fused QKV.
212
+ - **Patch embed:** **Conv3d** `[768, 3, 2, 16, 16]` — `temporal_patch_size 2` × `16×16` patches over 3
213
+ RGB channels. (For single images the temporal dim is duplicated to 2; for video it spans frame
214
+ pairs.)
215
+ - **Positional:** learned `pos_embed [2304, 768]` plus M-RoPE on the LM side.
216
+ - **Merger / projector:** `spatial_merge_size 2` → concatenates a 2×2 patch block (768×4 = 3072),
217
+ LayerNorm, then `linear_fc1 [3072,3072]` → GELU → `linear_fc2 [1024,3072]` to produce one
218
+ `out_hidden_size = 1024` token per merged block, matching the LM `hidden_size 1024`.
219
+ - **No audio encoder/decoder** in Qwen3.5.
220
+
221
+ **Ops the engine needs vs. what it already has.** Engine implemented ops (from `src/gpu/ir.ts` +
222
+ `src/gpu/kernels/registry.ts`): Embedding(+Int4), MatMul(+Int4), Add, Mul, RMSNorm, **LayerNorm**,
223
+ RoPE, **Attention**, **Softmax**, SiLU, SwiGLU, **GELU**, plus the Mamba-SSM stack (MambaSSM,
224
+ CausalConv1d, etc.). Stubbed (declared in IR, **no WGSL kernel**): `Conv2d`, `AvgPool2d`,
225
+ `CrossAttention`, and the MoE ops.
226
+
227
+ Mapping the ViT onto that:
228
+
229
+ | ViT component | Engine status |
230
+ |---|---|
231
+ | LayerNorm (norm1/norm2/merger.norm) | ✅ have `LayerNorm` |
232
+ | Fused QKV / proj / MLP matmuls | ✅ have `MatMul` / `MatMulInt4` |
233
+ | Self-attention + softmax (bidirectional, non-causal) | ✅ have `Attention` + `Softmax` — needs a **non-causal mask flag** (text attention is causal) |
234
+ | GELU (`gelu_pytorch_tanh`) | ✅ have `GELU` (verify tanh approximation variant) |
235
+ | Merger projector (concat + 2×MLP) | ✅ matmul + a Concat/reshape (Concat is currently stubbed but trivial) |
236
+ | **Patch embed Conv3d `[768,3,2,16,16]`** | ❌ **new** — only genuinely missing primitive |
237
+
238
+ The patch-embed conv is a strided non-overlapping conv = an **im2col/unfold + MatMul**: reshape each
239
+ non-overlapping 2×16×16×3 patch into a 1536-vector and multiply by the reshaped
240
+ `[768, 1536]` weight. **No general Conv2d/Conv3d kernel needed** — a small unfold (gather/reshape)
241
+ shader feeding the existing MatMul covers it. `CrossAttention` and `AvgPool2d` are **not** required by
242
+ this ViT (it uses standard self-attention and a learned merger, not pooling/cross-attn).
243
+
244
+ ---
245
+
246
+ ## PART C — Verdict & build plan
247
+
248
+ ### Can one native model deliver text + vision (+ audio) on our WebGPU engine?
249
+
250
+ - **Text + vision (image/video-in): YES, realistically.** It's the *same checkpoint we already run* —
251
+ we are choosing to discard its vision tower. The ViT reuses our existing matmul / attention /
252
+ LayerNorm / GELU / softmax kernels. Only one new primitive (the patch-embed conv, implementable as
253
+ unfold+matmul) plus host-side image preprocessing and token splicing are required.
254
+ - **Audio: NO.** Qwen3.5 has no audio path at all. The only omni checkpoints are `Qwen3-Omni-30B-A3B`
255
+ (30B MoE) and `Qwen2.5-Omni-3B` (~5.5B) — both too large for mobile, both a different architecture
256
+ requiring a Whisper-style audio encoder, a codec/talker, and `code2wav` vocoder (RVQ codec decode)
257
+ we don't remotely have. Audio stays out of scope for the native engine.
258
+
259
+ **So it dissolves the two-engine problem *partially*: vision yes, audio no.** If the second
260
+ engine (transformers.js) on mobile was there for *vision*, a vision-enabled Gerbil can replace it and
261
+ we ship one native model for text + image/video. If audio was also required, that still needs a
262
+ separate path — but no small Qwen model gives us audio either, so that's a model-availability wall,
263
+ not a Gerbil limitation.
264
+
265
+ ### Build tier and specific work to enable native Qwen3.5 vision
266
+
267
+ **Tier: mid-size (1 new kernel, 1 new graph generator, host glue). Not a rewrite.**
268
+
269
+ 1. **Loader (trivial):** stop nulling `model.visual.*` in `createKeyMapperForArch` (keep skipping
270
+ `mtp.*`). Map vision keys to canonical names; download grows by ~192 MB BF16 (~50–96 MB if we
271
+ quant/keep-F16 the tower).
272
+ 2. **Vision graph generator (new):** add a `generateQwen3_5VisionGraph()` reading `vision_config`,
273
+ emitting the 12 ViT blocks + merger, reusing existing LayerNorm/MatMul/Attention/Softmax/GELU
274
+ nodes. Add a **non-causal flag** to the Attention op (vision attention is bidirectional).
275
+ 3. **Patch-embed kernel (new, small):** unfold/im2col shader that reshapes 2×16×16×3 patches →
276
+ `[num_patches, 1536]`, then MatMul by `[768,1536]` weight + bias. (Reuses MatMul; the only new WGSL
277
+ is the gather/unfold.) `Concat`/`Reshape` for the spatial-merge step need real kernels (currently
278
+ stubbed) but are simple memory shuffles.
279
+ 4. **Image preprocessing (host):** resize/normalize to the patch grid, build the temporal pair, emit
280
+ patch tensors — JS/Canvas/`ImageBitmap`, no GPU.
281
+ 5. **Sequence splicing (executor):** replace `image_token_id` (248056) placeholder positions in the
282
+ text embedding stream with merged vision tokens, and apply **M-RoPE** (`mrope_section [11,11,10]`,
283
+ `mrope_interleaved`) for the multimodal position ids. This M-RoPE variant is new vs. our current
284
+ text RoPE and is the second-most-involved piece after the patch conv.
285
+
286
+ **New kernels needed:** patch-embed unfold (small), real `Concat`/`Reshape` (trivial), non-causal
287
+ attention flag, M-RoPE position handling. **Not needed:** general `Conv2d`/`Conv3d`, `AvgPool2d`,
288
+ `CrossAttention`, MoE ops.
289
+
290
+ **Bottom line:** Qwen3.5-0.8B is one native multimodal model that, with a mid-size vision build
291
+ (~1 real new kernel + a vision graph + M-RoPE + host image prep), gives Gerbil **text + image/video**
292
+ on WebGPU and lets us drop a second vision engine on mobile. **Audio is genuinely unavailable** in any
293
+ small Qwen3.5/omni checkpoint, so it remains a separate problem regardless of engine work.
@@ -0,0 +1,337 @@
1
+ # Qwen3.6 & Gemma 4 — Smallest On-Device Targets (Verification Follow-up)
2
+
3
+ Research date: **2026-06-13**. Follow-up to `docs/research/sota-mobile-models-2026.md`.
4
+ Goal: pick the two smallest on-device models to support next. SmolLM dropped per instruction.
5
+
6
+ Fact tags: **[config-verified]** = fetched the actual `config.json` / HF API live today;
7
+ **[card/docs]** = official model card / vendor docs; **[social/secondary]** = third-party.
8
+
9
+ ---
10
+
11
+ ## 0. TL;DR (the two picks)
12
+
13
+ | Pick | Exact repo ID | Arch | 4-bit download | Dense or hybrid | Gerbil tier |
14
+ |---|---|---|---|---|---|
15
+ | **Smallest "Qwen"** | **`Qwen/Qwen3-0.6B`** (+ `Qwen/Qwen3-0.6B-MLX-4bit`) | dense full-attention transformer | **0.32 GB** (MLX 4-bit) | **DENSE** (not hybrid) | **Tier 1** |
16
+ | **Smallest "Gemma 4"** | **`google/gemma-4-E2B`** (+ `google/gemma-4-E2B-it-qat-q4_0-gguf`) | dense transformer w/ sliding-window attn | **3.35 GB** (official QAT q4_0, text-only) | dense (interleaved SWA) | **Tier 2** |
17
+
18
+ **Headline correction:** There is **no small Qwen3.6**. Qwen3.6 exists but the smallest official
19
+ variant is **27B dense** (also a 35B-A3B MoE). No 0.6B / 1.7B / 4B Qwen3.6 exists or is announced.
20
+ The smallest deployable Qwen on-device model remains **Qwen3-0.6B** (the prior Qwen3 dense generation).
21
+
22
+ **Gemma 4 correction:** No `gemma-4-270m`, no `gemma-4-1B`. The 270M only exists for **Gemma 3**.
23
+ The smallest Gemma 4 is **E2B** (effective ~2B via Per-Layer Embeddings; *not* MatFormer — see
24
+ §2.5). Confirmed against the full `author=google&search=gemma-4` listing.
25
+
26
+ ---
27
+
28
+ ## 1. Target 1 — "smallest Qwen3.6" → does not exist; use Qwen3-0.6B
29
+
30
+ ### 1a. Qwen3.6 existence check [config-verified]
31
+
32
+ Full official Qwen org listing for `Qwen3.6` (HF API `author=Qwen&search=Qwen3.6`, fetched today):
33
+
34
+ ```
35
+ Qwen/Qwen3.6-35B-A3B (MoE, 35B total / 3B active)
36
+ Qwen/Qwen3.6-27B (dense)
37
+ Qwen/Qwen3.6-35B-A3B-FP8
38
+ Qwen/Qwen3.6-27B-FP8
39
+ ```
40
+
41
+ That is the **entire** official family. Hub-wide search for `Qwen3.6 0.6B / 1.7B / 4B / mini / nano / edge`
42
+ returns only community quants/finetunes **of the 27B and 35B-A3B** — no small base model anywhere.
43
+ Probes for `Qwen/Qwen3.6-0.6B`, `-0.5B`, `-1.7B`, `-0.8B`, `-0.6B-Instruct` all return HF API **404-equivalent
44
+ 401** (nonexistent repo, not gated). **[config-verified]**
45
+
46
+ Independent confirmation: Perplexity (`sonar-pro`, today) — *"No official ~0.6B/1.7B/≤5B-class Qwen3.6
47
+ model has been released; the documented Qwen3.6 family is 27B dense and 35B-A3B MoE only."* Sources:
48
+ Unsloth Qwen3.6 docs (https://unsloth.ai/docs/models/qwen3.6), Qwen blog
49
+ (https://qwen.ai/blog?id=qwen3.6-35b-a3b), GitHub QwenLM/Qwen3.6. **[social/secondary]**
50
+
51
+ > The user's belief that "Qwen released a 3.6 generation with a ~0.6B model" is **incorrect** as of
52
+ > 2026-06-13. Qwen3.6 is a large-model release (27B/35B-A3B). For a smallest on-device Qwen, the
53
+ > correct target is **Qwen3-0.6B** from the Qwen3 dense generation.
54
+
55
+ ### 1b. `Qwen/Qwen3-0.6B` — verified config [config-verified]
56
+
57
+ Fetched `https://huggingface.co/Qwen/Qwen3-0.6B/resolve/main/config.json`:
58
+
59
+ | Field | Value |
60
+ |---|---|
61
+ | `architectures[0]` | **`Qwen3ForCausalLM`** |
62
+ | `model_type` | **`qwen3`** |
63
+ | `hidden_size` | 1024 |
64
+ | `num_hidden_layers` | 28 |
65
+ | `num_attention_heads` | 16 |
66
+ | `num_key_value_heads` | 8 (GQA 2:1) |
67
+ | `head_dim` | **128** (note: ≠ hidden/heads = 64; set explicitly) |
68
+ | `intermediate_size` | 3072 |
69
+ | `vocab_size` | 151936 |
70
+ | `tie_word_embeddings` | **true** |
71
+ | `rope_theta` | 1000000 |
72
+ | `max_position_embeddings` | 40960 |
73
+ | `rms_norm_eps` | 1e-6 |
74
+ | `attention_bias` | **false** |
75
+ | `sliding_window` / `use_sliding_window` | null / false |
76
+ | `hidden_act` | silu (→ SwiGLU MLP) |
77
+
78
+ **Dense vs hybrid: 100% DENSE full-attention transformer.** No SSM, no linear/Gated-DeltaNet
79
+ layers, no sliding window. It uses the **simple dense path**, NOT Gerbil's Qwen3.5 hybrid path.
80
+ **[config-verified]**
81
+
82
+ **Non-standard features:**
83
+ - **No QKV bias** (`attention_bias: false`) — opposite of Qwen2, which has QKV bias. **[config-verified]**
84
+ - **QK-norm: YES** — Qwen3 applies per-head RMSNorm to Q and K (`self_attn.q_norm.weight`,
85
+ `self_attn.k_norm.weight`). This is the defining Qwen3 architectural change vs Qwen2. Not visible
86
+ as a scalar in config.json but present in the weights and modeling code; confirmed by Gerbil's own
87
+ loader already listing `layers.{i}.self_attn.q_norm.weight` / `k_norm.weight`. **[card/docs]**
88
+ - Tied embeddings (saves shipping a separate lm_head). **[config-verified]**
89
+
90
+ **Sizes:**
91
+ - Raw bf16 `model.safetensors` = **1,503,300,328 B ≈ 1.50 GB** (HTTP content-length). **[config-verified]**
92
+ - **Official `Qwen/Qwen3-0.6B-MLX-4bit` = 0.317 GB total** (4-bit incl. quantized tied embeddings). **[config-verified]**
93
+ - Official `Qwen/Qwen3-0.6B-GGUF` exists (Q4 ≈ 0.4–0.5 GB; Gerbil quantizes on-the-fly from bf16
94
+ anyway, so the ~0.42 GB g128 INT4 estimate from the prior report stands). **[config-verified for repo existence]**
95
+
96
+ Official quant repos confirmed live: `Qwen/Qwen3-0.6B-MLX-4bit`, `Qwen/Qwen3-0.6B-GGUF`,
97
+ `unsloth/Qwen3-0.6B-GGUF`, `unsloth/Qwen3-0.6B-bnb-4bit`. **[config-verified]**
98
+
99
+ ---
100
+
101
+ ## 2. Target 2 — smallest Gemma 4 = `google/gemma-4-E2B`
102
+
103
+ ### 2a. Smallest-variant check [config-verified]
104
+
105
+ Full `author=google&search=gemma-4` listing (fetched today) contains, by size:
106
+ **E2B, E4B, 12B, 26B-A4B (MoE), 31B** — plus `-it`, `-qat-q4_0-gguf`, `-qat-mobile-transformers`,
107
+ `-qat-w4a16-ct` variants of each. **No `gemma-4-270m`, no `gemma-4-1B`, no `gemma-4-E1B`.** The only
108
+ 270M / 1B Google models are **Gemma 3** (`google/gemma-3-270m-*`, `gemma-3-1b-*`). So **E2B is the
109
+ smallest Gemma 4.** Perplexity concurs: *"Gemma 4-2B (~2B dense) is the smallest public Gemma 4; no
110
+ ~270M or ~1B Gemma 4 as of June 2026."* **[config-verified + social/secondary]**
111
+
112
+ `google/gemma-4-E2B` HF API returns HTTP 200 (ungated, downloadable without auth). **[config-verified]**
113
+
114
+ ### 2b. `google/gemma-4-E2B` — verified config (`text_config`) [config-verified]
115
+
116
+ Fetched `https://huggingface.co/google/gemma-4-E2B/resolve/main/config.json`. It is a multimodal
117
+ container (`Gemma4ForConditionalGeneration`, `model_type: gemma4`) with `text_config` + `vision_config`
118
+ + `audio_config`. For a text-only Gerbil deployment, only `text_config` matters:
119
+
120
+ | Field (`text_config`) | Value |
121
+ |---|---|
122
+ | `architectures[0]` (top) | `Gemma4ForConditionalGeneration` |
123
+ | `model_type` | `gemma4_text` (container `gemma4`) |
124
+ | `hidden_size` | 1536 |
125
+ | `num_hidden_layers` | 35 |
126
+ | `num_attention_heads` | 8 |
127
+ | `num_key_value_heads` | **1 (MQA)** |
128
+ | `head_dim` | **256** (≠ hidden/heads) |
129
+ | `intermediate_size` | 6144 |
130
+ | `vocab_size` | **262144** (large embed/logit matmul) |
131
+ | `tie_word_embeddings` | **true** |
132
+ | `rms_norm_eps` | 1e-6 |
133
+ | `attention_bias` | false |
134
+ | `hidden_activation` | **`gelu_pytorch_tanh`** (→ **GeGLU** MLP, not SwiGLU) |
135
+ | `sliding_window` | **512** |
136
+ | `final_logit_softcapping` | **30.0** |
137
+ | `max_position_embeddings` | 131072 |
138
+
139
+ **Sliding-window attention layout (`layer_types`)** [config-verified]:
140
+ 35 layers, pattern = 4× sliding then 1× full, repeating → **full attention at layers
141
+ 4,9,14,19,24,29,34 (7 full / 28 sliding) = 4:1 sliding:full ratio.** (Prior report said 5:1 — the
142
+ live config shows full every 5th layer, i.e. a **4:1** sliding-to-full ratio.) `sliding_window: 512`.
143
+
144
+ **Dual RoPE** [config-verified]:
145
+ - full_attention layers: `rope_theta: 1e6`, `rope_type: proportional`, **`partial_rotary_factor: 0.25`** (partial RoPE).
146
+ - sliding_attention layers: `rope_theta: 1e4`, `rope_type: default` (full RoPE).
147
+
148
+ **Other Gemma-4 specifics** [config-verified]:
149
+ - `query_pre_attn_scalar`: not present as a separate field in this config — query scaling is folded
150
+ via `head_dim`/`global_head_dim` (`global_head_dim: 512`). Gemma uses head_dim-based scaling; treat
151
+ as a parameter, not a kernel.
152
+ - `hidden_size_per_layer_input: 256` + `vocab_size_per_layer_input: 262144` → **Per-Layer Embeddings
153
+ (PLE)**: extra per-layer embedding tables the loader must place. Not a new compute kernel.
154
+ - `num_kv_shared_layers: 20` → **cross-layer KV sharing** (20 layers reuse KV from a paired layer).
155
+ New loader/graph wiring; reduces KV cache but adds plumbing.
156
+ - `use_double_wide_mlp: true`, `attention_k_eq_v: false`, `enable_moe_block: false` (E2B is dense, not MoE).
157
+ - **Effective "~2B" via PLE (NOT MatFormer):** the effective-vs-total split comes from Per-Layer
158
+ Embeddings, not elastic nesting (see §2.5). Load E2B as a normal fixed dense checkpoint — no
159
+ elastic-nesting code needed. **[card/docs]**
160
+ - Gemma RMSNorm uses `(1 + weight)` scaling — Gerbil's loader already bakes `+1` into Gemma-style
161
+ norm weights (see `model-loader.ts` comment), and norms appear pre/post attn + pre/post MLP.
162
+
163
+ **Sizes:**
164
+ - Raw bf16 single-file `model.safetensors` = **10,246,621,918 B ≈ 10.25 GB** (full multimodal weights). **[config-verified]**
165
+ - **Official QAT INT4 (`google/gemma-4-E2B-it-qat-q4_0-gguf`):** text weights
166
+ `gemma-4-E2B_q4_0-it.gguf` = **3.349 GB**; multimodal projector `gemma-4-E2B-it-mmproj.gguf` =
167
+ 0.987 GB (skip for text-only). So **text-only 4-bit download ≈ 3.35 GB.** **[config-verified]**
168
+ - (Prior report's "~2.6–2.9 GB" figure was the Unsloth/Google headline; the official Google QAT
169
+ q4_0 text GGUF measures **3.35 GB** live — use this.)
170
+
171
+ ---
172
+
173
+ ## 2.5 E2B vs E4B — separate checkpoints, NOT MatFormer slices [config-verified]
174
+
175
+ The addendum asked whether E2B is an elastic MatFormer sub-network nested inside E4B, or a
176
+ genuinely separate checkpoint. **Answer: genuinely separate checkpoints.** Both `config.json`
177
+ files were fetched live today and the dims differ structurally — E2B is not a slice of E4B.
178
+
179
+ **What "E" actually means.** The official E2B model card states verbatim: *"The 'E' in E2B and
180
+ E4B stands for 'effective' parameters. The smaller models incorporate **Per-Layer Embeddings
181
+ (PLE)** to maximize parameter efficiency… These embedding tables are large but are only used for
182
+ quick lookups, which is why the effective parameter count is much smaller than the total."* The
183
+ card lists Gemma 4 as **four distinct sizes — E2B, E4B, 26B-A4B, 31B** — i.e. separate models,
184
+ not nested elastic variants. **The card never mentions "MatFormer" or "Mix-n-Match" (grep count
185
+ = 0).** That is a Gemma-3n-generation mechanism; Gemma 4's edge effective/total split is driven
186
+ by **PLE**, not MatFormer nesting. So the addendum's hypothesis ("E4B is the full model with E2B
187
+ nested inside it as an elastic sub-network") is **incorrect for Gemma 4** — they are two distinct
188
+ checkpoints, each with its own `model.safetensors`. **[card/docs + config-verified]**
189
+
190
+ > Correction to earlier notes in this repo (and `sota-mobile-models-2026.md`): the "MatFormer"
191
+ > label applied to Gemma 4 E2B is wrong. Gemma 4 uses **PLE** for the effective-vs-total split.
192
+ > Treat E2B and E4B as ordinary fixed dense checkpoints — no elastic-nesting code is needed for
193
+ > either, which was the practical conclusion anyway.
194
+
195
+ ### Per-checkpoint facts (both `text_config`, fetched live 2026-06-13)
196
+
197
+ | | **E2B** | **E4B** |
198
+ |---|---|---|
199
+ | Exact HF repo ID | **`google/gemma-4-E2B`** (+ `-it`) | **`google/gemma-4-E4B`** (+ `-it`) |
200
+ | Effective params (card) | **2.3B effective** | **4.5B effective** |
201
+ | Total params incl. embeddings (card) | **5.1B** | **8B** |
202
+ | Raw bf16 `model.safetensors` (content-length) | **10,246,621,918 B ≈ 10.25 GB** | **15,992,595,884 B ≈ 15.99 GB** |
203
+ | Implied 4-bit download (full multimodal, ≈ raw/3.6) | **≈ 2.85 GB** | **≈ 4.45 GB** |
204
+ | Official QAT q4_0 GGUF (text-only) | **3.35 GB** (`gemma-4-E2B_q4_0-it.gguf`, measured) | **≈ 5.0–5.2 GB** (`-E4B-it-qat-q4_0-gguf`, gated; scales with raw) |
205
+ | `hidden_size` | **1536** | **2560** |
206
+ | `num_hidden_layers` | **35** | **42** |
207
+ | `num_attention_heads` | 8 | 8 |
208
+ | `num_key_value_heads` | **1 (MQA)** | **2 (GQA)** |
209
+ | `head_dim` / `global_head_dim` | 256 / 512 | 256 / 512 |
210
+ | `intermediate_size` (MLP) | **6144** | **10240** |
211
+ | `use_double_wide_mlp` | **true** | **false** |
212
+ | `num_kv_shared_layers` | **20** | **18** |
213
+ | `sliding_window` | 512 | 512 |
214
+ | SWA layout (`layer_types`) | 4:1 sliding:full (full @ 4,9,14,19,24,29,34) | 5:1 sliding:full (full @ 5,11,17,23,29,35,41) |
215
+ | `vocab_size` | 262144 | 262144 |
216
+ | `final_logit_softcapping` | 30.0 | 30.0 |
217
+ | `hidden_activation` | `gelu_pytorch_tanh` (GeGLU) | `gelu_pytorch_tanh` (GeGLU) |
218
+ | `tie_word_embeddings` | true | true |
219
+ | `max_position_embeddings` | 131072 | 131072 |
220
+ | `model_type` (container / text) | `gemma4` / `gemma4_text` | `gemma4` / `gemma4_text` |
221
+ | `architectures[0]` | `Gemma4ForConditionalGeneration` | `Gemma4ForConditionalGeneration` |
222
+ | HF gating | ungated (HTTP 200, no auth) | ungated (HTTP 200, no auth) |
223
+
224
+ The structural differences (hidden 1536 vs 2560, 35 vs 42 layers, 1 vs 2 KV heads, MLP 6144 vs
225
+ 10240, double-wide MLP only on E2B, 20 vs 18 KV-shared layers) prove these are **independently
226
+ trained checkpoints**, not one model sliced two ways. **[config-verified]**
227
+
228
+ ### Smaller download
229
+
230
+ **E2B is the smaller download** — by every measure: raw bf16 **10.25 GB vs 15.99 GB**, and
231
+ official text-only 4-bit **≈ 3.35 GB vs ≈ 5.0+ GB**. E2B is ~36% smaller on disk.
232
+
233
+ ### Does the choice change Gerbil's implementation work? **No — same generator, same kernels.**
234
+
235
+ Both share `model_type: gemma4` / `Gemma4ForConditionalGeneration` and the **identical op set**:
236
+ sliding-window attention, GeGLU MLP, Gemma `(1+w)` RMSNorm, final-logit softcap, dual RoPE
237
+ (partial on full-attn layers, full on sliding layers), PLE tables, cross-layer KV sharing, tied
238
+ embeddings. Gerbil's generators read **every dimension from `config.json` at runtime** (verified
239
+ in `qwen3_5.ts`: `hidden_size`, `num_hidden_layers`, `num_key_value_heads`, `intermediate_size`,
240
+ `head_dim`, `layer_types`, … are all pulled from the config object). Therefore:
241
+
242
+ - **Same `gemma4.ts` generator** handles both — no E4B-specific code.
243
+ - **Same kernels** — the only deltas are config values (1536→2560, 35→42, MQA→GQA, MLP width,
244
+ `use_double_wide_mlp`, KV-shared count, SWA stride). All of these are already
245
+ config-parameterized inputs to the graph, not branches.
246
+ - The only practical difference is **which weights you load and how much memory/VRAM they need.**
247
+
248
+ One nuance to verify when implementing: `use_double_wide_mlp` is `true` for E2B and `false` for
249
+ E4B. This is a config flag the generator must honor (likely a 2× factor on the gate/up
250
+ projection), but it is a parameterized branch inside the single generator — not a separate kernel.
251
+
252
+ ### Verdict: is E4B worth the extra size on-device?
253
+
254
+ **For Gerbil's on-device / mobile target: pick E2B.** E4B is ~57% larger (15.99 vs 10.25 GB raw;
255
+ ~5.0 vs 3.35 GB at 4-bit). On an iPad/phone-class budget the extra ~1.7 GB of 4-bit weights plus
256
+ the larger KV/activation footprint (hidden 2560, 42 layers, 2 KV heads) materially raises peak
257
+ memory and lowers tokens/sec, for a quality bump (2.3B→4.5B effective) that is real but not
258
+ transformative at the edge. **E4B is "worth it" only on a laptop/desktop-class device** with
259
+ headroom to spare and a quality-over-latency preference. Since implementation cost is identical
260
+ (same generator, same kernels), Gerbil can support both from one code path and simply let the
261
+ user pick the weight set — but the **default on-device pick should be E2B**.
262
+
263
+ ---
264
+
265
+ ## 3. Gerbil implementation tiers (per `docs/adding-a-model-family.md`)
266
+
267
+ ### Target 1 — `Qwen/Qwen3-0.6B` → **Tier 1 (hours)**
268
+
269
+ `Qwen3ForCausalLM` is **already registered** in `src/gpu/architectures/index.ts` →
270
+ `generateQwen2Graph`. But `qwen2.ts` models the **Qwen2** layer (QKV bias, **no QK-norm**), while
271
+ Qwen3 is the inverse: **no QKV bias + adds per-head QK-RMSNorm**. So a vanilla load would produce
272
+ wrong outputs (missing q_norm/k_norm).
273
+
274
+ **What must be built (all reuse, zero new kernels):**
275
+ 1. Add **QK-norm** (per-head RMSNorm on Q and K) to the dense attention path. This is a
276
+ **copy-paste from `qwen3_5.ts`** (lines ~514–555 already wire `CANONICAL_KEYS.qNorm(i)` /
277
+ `kNorm(i)` + two `RMSNorm` nodes), and the loader already handles `self_attn.q_norm.weight` /
278
+ `k_norm.weight` (with the `+1` bake). Cleanest fix: branch on `model_type === "qwen3"` to emit the
279
+ QK-norm nodes and skip QKV bias.
280
+ 2. Confirm `attention_bias: false` path (skip the Qwen2 QKV-bias tensors).
281
+ 3. `head_dim` is read from config (128) — already handled. GQA 16/8, tied embeddings, RoPE θ=1e6 —
282
+ all already supported.
283
+ 4. Validate vs HF reference (Step 7 of the guide).
284
+
285
+ **New kernels: none.** All ops (RMSNorm, QK-norm RMSNorm, RoPE, GQA Attention, SwiGLU, tied
286
+ embeddings, SliceLastRow) already exist. Effort: hours.
287
+
288
+ ### Target 2 — `google/gemma-4-E2B` → **Tier 2 (days; 1 real new kernel)**
289
+
290
+ New `gemma4.ts` generator + register `Gemma4ForConditionalGeneration` (read `config.text_config`).
291
+
292
+ **What must be built:**
293
+ - **Tier-2 new kernel (the one real engineering item): sliding-window / banded attention** +
294
+ (for the memory win) a **windowed KV cache** — `sliding_window: 512`, interleaved 4:1 with full
295
+ attention. *Correctness-first v1 can treat SWA as full attention* (identical results at prompts
296
+ ≤512 tokens) and ship without the new kernel; add the banded mask + windowed cache later for
297
+ long-context memory wins.
298
+ - **Small new variants (each tiny, not a full kernel class):**
299
+ - **GeGLU MLP** (swap SiLU→GELU `gelu_pytorch_tanh` in the gated MLP).
300
+ - **Final logit softcapping** (`tanh`-based, `final_logit_softcapping: 30.0`) — elementwise op on logits.
301
+ - **Gemma `(1+w)` RMSNorm** — loader already bakes `+1`; norms placed pre/post attn + pre/post MLP.
302
+ - **Dual RoPE** — full layers use partial RoPE (`partial_rotary_factor: 0.25`, θ=1e6); sliding
303
+ layers use full RoPE (θ=1e4). Gerbil already does partial RoPE; this is per-layer parameterization.
304
+ - **Loader work (no new compute kernels):**
305
+ - **Per-Layer Embeddings (PLE)** — place per-layer embedding tables (`hidden_size_per_layer_input: 256`).
306
+ - **Cross-layer KV sharing** (`num_kv_shared_layers: 20`) — graph wiring so 20 layers reuse a peer's KV.
307
+ - No MatFormer/elastic handling needed — E2B (and E4B) are fixed dense checkpoints; effective
308
+ param count is a PLE artifact, not a slice (§2.5).
309
+ - **Mobile budget note:** vocab 262144 × hidden 1536 tied embedding is large; at INT4 the embed
310
+ tensor is ~0.2 GB (fits the iPad 128 MB `maxStorageBufferBindingSize` only if sharded — **the
311
+ embed/logit weight will need sharding** per the guide's buffer-cap rule).
312
+
313
+ **Net:** Gemma 4 E2B is **Tier 2** — one genuinely new kernel (windowed attention, deferrable) plus
314
+ several small variants/loader features. Matches the prior report's Gemma4 assessment; the live config
315
+ refines it (4:1 not 5:1, MQA `num_kv_heads=1`, head_dim 256, dual RoPE, PLE, 20 shared-KV layers).
316
+
317
+ ---
318
+
319
+ ## Sources
320
+
321
+ Primary, fetched live 2026-06-13:
322
+ - `https://huggingface.co/Qwen/Qwen3-0.6B/resolve/main/config.json` [config-verified]
323
+ - `https://huggingface.co/google/gemma-4-E2B/resolve/main/config.json` and `.../gemma-4-E2B-it/...` [config-verified]
324
+ - `https://huggingface.co/google/gemma-4-E4B/resolve/main/config.json` and `.../gemma-4-E4B-it/...` [config-verified]
325
+ - `https://huggingface.co/google/gemma-4-E2B/raw/main/README.md` (official card: "E" = effective
326
+ params via Per-Layer Embeddings; four distinct sizes E2B/E4B/26B-A4B/31B; no MatFormer mention) [card/docs]
327
+ - HF API: `api/models?author=Qwen&search=Qwen3.6`, `api/models?author=google&search=gemma-4`,
328
+ `api/models/{repo}?blobs=true`, content-length headers [config-verified]
329
+ - `Qwen/Qwen3-0.6B-MLX-4bit` (0.317 GB), `google/gemma-4-E2B-it-qat-q4_0-gguf`
330
+ (`gemma-4-E2B_q4_0-it.gguf` 3.349 GB, mmproj 0.987 GB) [config-verified]
331
+ - Gerbil repo: `src/gpu/architectures/index.ts` (Qwen3ForCausalLM→qwen2.ts already registered),
332
+ `qwen2.ts` (no QK-norm), `qwen3_5.ts` (QK-norm reference wiring), `model-loader.ts`
333
+ (q_norm/k_norm keys + Gemma `+1` bake).
334
+
335
+ Secondary:
336
+ - Perplexity sonar-pro (2026-06-13): no small Qwen3.6; smallest Gemma 4 ≈ 2B. Citing unsloth.ai
337
+ Qwen3.6 docs, qwen.ai blog, github.com/QwenLM/Qwen3.6. [social/secondary]