@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +318 -104
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +276 -590
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +592 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BetB5xb0.d.mts +488 -0
- package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
- package/dist/gerbil-CTZUa8EZ.mjs +4 -0
- package/dist/gerbil-DNniplr4.mjs +1656 -0
- package/dist/gerbil-DNniplr4.mjs.map +1 -0
- package/dist/gpu/hooks.d.mts +640 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1369 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-DFuglcEx.mjs +3790 -0
- package/dist/gpu-DFuglcEx.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-DukkJRMj.d.mts +2114 -0
- package/dist/index-DukkJRMj.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
- package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
- package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
- package/dist/repl-BDRkwPGX.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
- package/dist/skills-CU694Dc8.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
- package/dist/types-LlyYILII.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-CMOGDSgT.js +0 -20212
- package/dist/kokoro-CMOGDSgT.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-Bu-E23Sc.js +0 -433
- package/dist/stt-Bu-E23Sc.js.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DiD1gTwk.js +0 -44695
- package/dist/transformers.web-DiD1gTwk.js.map +0 -1
- package/dist/transformers.web-u34VxRFM.js +0 -3
- package/dist/tts-CqroPaSK.js +0 -724
- package/dist/tts-CqroPaSK.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
# Qwen3.5 Multimodality — Native Engine Investigation
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-06-13
|
|
4
|
+
**Question:** Is `Qwen/Qwen3.5-0.8B` (the model Gerbil runs) actually multimodal? Our native
|
|
5
|
+
loader downloads only its text tensors and skips ~192 MB labeled "vision/MTP". Could one native
|
|
6
|
+
model give us text + vision (+ audio) on our WebGPU engine, dissolving the two-engine problem on mobile?
|
|
7
|
+
|
|
8
|
+
**Verdict up front:** **YES — Qwen3.5-0.8B is a single, natively multimodal (text + vision)
|
|
9
|
+
checkpoint.** We are deliberately throwing away its vision tower. Re-enabling vision is a **mid-size
|
|
10
|
+
build** (one real new kernel: the patch-embed conv; everything else in the ViT reuses ops we already
|
|
11
|
+
have). **Audio is NOT in Qwen3.5 at all** — omni/audio lives in a separate, much larger family
|
|
12
|
+
(`Qwen3-Omni-30B-A3B`). So one native model covers **text + image/video-in**, not audio.
|
|
13
|
+
|
|
14
|
+
All facts below tagged `[config-verified]` were read directly from live HuggingFace files on
|
|
15
|
+
2026-06-13 (config.json + safetensors header), not from web prose.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## PART A — What Gerbil currently skips
|
|
20
|
+
|
|
21
|
+
### The filter rule
|
|
22
|
+
|
|
23
|
+
`src/gpu/model-loader.ts`, `createKeyMapperForArch()` (lines 480–504). For the
|
|
24
|
+
`Qwen3_5ForConditionalGeneration` architecture, the HF→canonical key mapper returns `null` for any
|
|
25
|
+
tensor under three prefixes, which causes the selective downloader to skip those byte ranges entirely:
|
|
26
|
+
|
|
27
|
+
```ts
|
|
28
|
+
function createKeyMapperForArch(architectureName: string): HFKeyMapper {
|
|
29
|
+
if (architectureName === "Qwen3_5ForConditionalGeneration") {
|
|
30
|
+
return (hfKey: string): string | null => {
|
|
31
|
+
// Skip visual encoder, vision tower, and MTP weights
|
|
32
|
+
if (
|
|
33
|
+
hfKey.startsWith("model.visual.") ||
|
|
34
|
+
hfKey.startsWith("vision_tower.") ||
|
|
35
|
+
hfKey.startsWith("mtp.")
|
|
36
|
+
) {
|
|
37
|
+
return null;
|
|
38
|
+
}
|
|
39
|
+
let key = hfKey;
|
|
40
|
+
if (key.startsWith("model.language_model.")) {
|
|
41
|
+
key = key.slice(21);
|
|
42
|
+
} else if (key.startsWith("language_model.model.")) {
|
|
43
|
+
key = key.slice(21);
|
|
44
|
+
} else if (key.startsWith("model.")) {
|
|
45
|
+
key = key.slice(6);
|
|
46
|
+
}
|
|
47
|
+
return key;
|
|
48
|
+
};
|
|
49
|
+
}
|
|
50
|
+
return createDefaultHFKeyMapper();
|
|
51
|
+
}
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### Where the skip is realized (the log line)
|
|
55
|
+
|
|
56
|
+
`loadModel()`, lines 626–640. Any entry whose mapped key is `null` is excluded from the download and
|
|
57
|
+
counted into `skippedBytes`:
|
|
58
|
+
|
|
59
|
+
```ts
|
|
60
|
+
const neededEntries = file.entries.filter((e) => keyMapper(e.name) !== null);
|
|
61
|
+
const skippedBytes = file.entries
|
|
62
|
+
.filter((e) => keyMapper(e.name) === null)
|
|
63
|
+
.reduce((sum, e) => sum + e.dataLength, 0);
|
|
64
|
+
// ...
|
|
65
|
+
onProgress?.(10, 100,
|
|
66
|
+
`Selective download: ${dlMB} MB needed, skipping ${savedMB} MB (vision/MTP)`);
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### What exactly is filtered out — measured from the live safetensors header `[config-verified]`
|
|
70
|
+
|
|
71
|
+
Single shard `model.safetensors-00001-of-00001.safetensors`, total **1666 MB** (BF16/F32, the
|
|
72
|
+
unquantized base). Broken down by prefix:
|
|
73
|
+
|
|
74
|
+
| Component | Tensors | Size | Skipped? |
|
|
75
|
+
|---|---|---|---|
|
|
76
|
+
| `model.language_model.*` (text backbone) | 320 | **1435.1 MB** | kept |
|
|
77
|
+
| `model.visual.*` (vision tower / ViT) | 153 | **191.9 MB** | **skipped** |
|
|
78
|
+
| `mtp.*` (multi-token prediction head) | 15 | **39.0 MB** | **skipped** |
|
|
79
|
+
|
|
80
|
+
So the "skipping 192 MB (vision/MTP)" line is **~192 MB vision + ~39 MB MTP**. (In the q4 path the
|
|
81
|
+
ratios shrink but the same prefixes are dropped; "404 MB needed / 192 MB skipped" is the quantized
|
|
82
|
+
profile.)
|
|
83
|
+
|
|
84
|
+
### What "MTP" means here `[config-verified]`
|
|
85
|
+
|
|
86
|
+
**MTP = Multi-Token Prediction**, not a vision module. The text config sets
|
|
87
|
+
`"mtp_num_hidden_layers": 1` and `"mtp_use_dedicated_embeddings": false`. The 15 skipped `mtp.*`
|
|
88
|
+
tensors are a small **one-layer auxiliary transformer head** used for speculative / parallel
|
|
89
|
+
next-token decoding:
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
mtp.fc.weight, mtp.norm.weight, mtp.pre_fc_norm_embedding.weight, mtp.pre_fc_norm_hidden.weight,
|
|
93
|
+
mtp.layers.0.{input_layernorm,post_attention_layernorm}.weight,
|
|
94
|
+
mtp.layers.0.self_attn.{q_proj,k_proj,v_proj,o_proj,q_norm,k_norm}.weight,
|
|
95
|
+
mtp.layers.0.mlp.{gate_proj,up_proj,down_proj}.weight
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
It is an inference-speedup head (predicts the +2 token to enable self-speculative decoding); it is
|
|
99
|
+
**correctly skipped** for a basic autoregressive engine and has nothing to do with vision. Dropping
|
|
100
|
+
it costs only decode throughput, not capability.
|
|
101
|
+
|
|
102
|
+
### The vision tower we're skipping — it's a full ViT `[config-verified]`
|
|
103
|
+
|
|
104
|
+
`config.json` ships a complete `vision_config` and the multimodal plumbing tokens, proving the base
|
|
105
|
+
checkpoint is multimodal by construction:
|
|
106
|
+
|
|
107
|
+
```json
|
|
108
|
+
"image_token_id": 248056,
|
|
109
|
+
"video_token_id": 248057,
|
|
110
|
+
"vision_start_token_id": 248053, "vision_end_token_id": 248054,
|
|
111
|
+
"vision_config": {
|
|
112
|
+
"depth": 12, "hidden_size": 768, "num_heads": 12, "intermediate_size": 3072,
|
|
113
|
+
"in_channels": 3, "patch_size": 16, "spatial_merge_size": 2, "temporal_patch_size": 2,
|
|
114
|
+
"num_position_embeddings": 2304, "out_hidden_size": 1024,
|
|
115
|
+
"hidden_act": "gelu_pytorch_tanh"
|
|
116
|
+
}
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Tensor inventory (from the header, leaf names generalized over the 12 blocks):
|
|
120
|
+
|
|
121
|
+
```
|
|
122
|
+
model.visual.patch_embed.proj.weight BF16 [768, 3, 2, 16, 16] ← Conv3d patch embed
|
|
123
|
+
model.visual.patch_embed.proj.bias
|
|
124
|
+
model.visual.pos_embed.weight BF16 [2304, 768] ← learned pos embeddings
|
|
125
|
+
model.visual.blocks.N.norm1.{weight,bias} ← LayerNorm
|
|
126
|
+
model.visual.blocks.N.attn.qkv.{weight,bias} [2304, 768] ← fused QKV (768→3*768)
|
|
127
|
+
model.visual.blocks.N.attn.proj.{weight,bias}
|
|
128
|
+
model.visual.blocks.N.norm2.{weight,bias} ← LayerNorm
|
|
129
|
+
model.visual.blocks.N.mlp.linear_fc1.{weight,bias} ← GELU MLP
|
|
130
|
+
model.visual.blocks.N.mlp.linear_fc2.{weight,bias}
|
|
131
|
+
model.visual.merger.norm.{weight,bias}
|
|
132
|
+
model.visual.merger.linear_fc1.{weight,bias} [3072, 3072] ← projector (merger)
|
|
133
|
+
model.visual.merger.linear_fc2.{weight,bias} [1024, 3072] → out_hidden_size 1024
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
The **patch embed is a 5-D Conv3d** kernel `[out=768, in=3, T=2, 16, 16]` (temporal_patch_size=2 ×
|
|
137
|
+
16×16 spatial patches over RGB) — this is the one genuinely new primitive. Everything else (LayerNorm,
|
|
138
|
+
fused QKV matmul, self-attention + softmax, GELU MLP, the 2-layer merger projecting 768→1024 to match
|
|
139
|
+
the LM hidden size) is **standard transformer math the engine already runs for text.**
|
|
140
|
+
|
|
141
|
+
### What it would take to NOT skip them
|
|
142
|
+
|
|
143
|
+
For **vision**: change the key mapper to *map* (not null) `model.visual.*` instead of dropping it.
|
|
144
|
+
That's a one-line change to start downloading the 192 MB. The real work is downstream: a Qwen3.5
|
|
145
|
+
vision-tower graph generator + the patch-embed conv kernel + image preprocessing + splicing the
|
|
146
|
+
merged image embeddings into the text sequence at `image_token_id` positions (see Part C).
|
|
147
|
+
|
|
148
|
+
For **MTP**: leave it skipped unless/until we implement speculative decoding. It is an optimization,
|
|
149
|
+
not a modality.
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## PART B — Qwen3.5 actual capabilities (live HF, config-verified)
|
|
154
|
+
|
|
155
|
+
### B1. Is Qwen3.5-0.8B itself multimodal? `[config-verified]`
|
|
156
|
+
|
|
157
|
+
**Yes, text + vision, in a single checkpoint.** There is no separate VL repo — the base
|
|
158
|
+
`Qwen/Qwen3.5-0.8B` *is* the vision-language model. Evidence (all from the live config/header):
|
|
159
|
+
|
|
160
|
+
- Architecture `Qwen3_5ForConditionalGeneration` (a `*ForConditionalGeneration` multimodal wrapper),
|
|
161
|
+
`model_type: "qwen3_5"`.
|
|
162
|
+
- A full `vision_config` (12-layer ViT) ships in the config.
|
|
163
|
+
- `image_token_id` / `video_token_id` / `vision_start/end_token_id` are defined.
|
|
164
|
+
- `rope_parameters.mrope_interleaved: true` with `mrope_section: [11, 11, 10]` — **M-RoPE
|
|
165
|
+
(multimodal rotary)**, the time/height/width positional scheme used specifically to place image and
|
|
166
|
+
video tokens. A text-only model would not carry mrope sections.
|
|
167
|
+
- 153 `model.visual.*` weight tensors (191.9 MB) physically present in the safetensors.
|
|
168
|
+
|
|
169
|
+
"MTP" is the multi-token-prediction head (Part A), **not** a vision module — they are two separate
|
|
170
|
+
skipped groups.
|
|
171
|
+
|
|
172
|
+
### B2. The Qwen3.5 family and its VL / Omni variants `[verified via HF API]`
|
|
173
|
+
|
|
174
|
+
Live `huggingface.co/api/models?author=Qwen&search=Qwen3.5` returns the whole family. **Every dense
|
|
175
|
+
member is a unified text+vision checkpoint; there are NO `Qwen3.5-VL` and NO `Qwen3.5-Omni` repos** —
|
|
176
|
+
vision is built into the base models:
|
|
177
|
+
|
|
178
|
+
| Model | Type | Notes |
|
|
179
|
+
|---|---|---|
|
|
180
|
+
| `Qwen/Qwen3.5-0.8B` (+ `-Base`) | dense | **smallest multimodal**; ~1.0 B params (0.8 B label) |
|
|
181
|
+
| `Qwen/Qwen3.5-2B` (+ `-Base`) | dense | |
|
|
182
|
+
| `Qwen/Qwen3.5-4B` (+ `-Base`) | dense | |
|
|
183
|
+
| `Qwen/Qwen3.5-9B` (+ `-Base`) | dense | |
|
|
184
|
+
| `Qwen/Qwen3.5-27B` | dense | also `-FP8`, `-GPTQ-Int4` |
|
|
185
|
+
| `Qwen/Qwen3.5-35B-A3B` | MoE (3B active) | also `-Base/-FP8/-GPTQ-Int4` |
|
|
186
|
+
| `Qwen/Qwen3.5-122B-A10B` | MoE | also `-FP8/-GPTQ-Int4` |
|
|
187
|
+
| `Qwen/Qwen3.5-397B-A17B` | MoE | also `-FP8/-GPTQ-Int4` |
|
|
188
|
+
|
|
189
|
+
**Pre-quantized 4-bit:** official `-GPTQ-Int4` exists for 27B and the MoE tiers only. For 0.8B there
|
|
190
|
+
is no official Int4 — Gerbil quantizes on the fly (our loader supports GPTQ and MLX repack paths).
|
|
191
|
+
|
|
192
|
+
**4-bit download size for the 0.8B (the one we run):** text backbone ~1435 MB BF16 → roughly
|
|
193
|
+
**~400 MB at Int4** (matches the "404 MB needed" log). Adding the vision tower (191.9 MB BF16 →
|
|
194
|
+
~50–60 MB Int4, or keep it F16 at ~96 MB) keeps a vision-capable native model well under ~500 MB
|
|
195
|
+
total — viable on mobile.
|
|
196
|
+
|
|
197
|
+
**Modalities, per family:**
|
|
198
|
+
|
|
199
|
+
- **Qwen3.5 (all sizes incl. 0.8B):** text in/out, **image-in, video-in** (vision tokens + temporal
|
|
200
|
+
patch + M-RoPE). **No audio, no speech-out.** `[config-verified]`
|
|
201
|
+
- **Audio / omni is a different family:** `Qwen/Qwen3-Omni-30B-A3B-{Instruct,Thinking,Captioner}` —
|
|
202
|
+
**30B MoE (~3B active)**, far too big for our mobile target, and it's Qwen3-Omni, not Qwen3.5.
|
|
203
|
+
Smallest omni anywhere is `Qwen/Qwen2.5-Omni-3B` (~5.5 B params) `[verified via HF API]`.
|
|
204
|
+
|
|
205
|
+
So: **vision is free inside the model we already run; audio is not available in any small Qwen3.5/omni
|
|
206
|
+
checkpoint we could realistically run on a phone.**
|
|
207
|
+
|
|
208
|
+
### B3. Architecture of the smallest multimodal variant (Qwen3.5-0.8B vision tower) `[config-verified]`
|
|
209
|
+
|
|
210
|
+
- **Vision encoder:** a 12-layer **ViT** (`depth 12`, `hidden 768`, `12 heads`, `mlp 3072`,
|
|
211
|
+
`gelu_pytorch_tanh`), pre-norm blocks (norm1→attn→norm2→MLP), fused QKV.
|
|
212
|
+
- **Patch embed:** **Conv3d** `[768, 3, 2, 16, 16]` — `temporal_patch_size 2` × `16×16` patches over 3
|
|
213
|
+
RGB channels. (For single images the temporal dim is duplicated to 2; for video it spans frame
|
|
214
|
+
pairs.)
|
|
215
|
+
- **Positional:** learned `pos_embed [2304, 768]` plus M-RoPE on the LM side.
|
|
216
|
+
- **Merger / projector:** `spatial_merge_size 2` → concatenates a 2×2 patch block (768×4 = 3072),
|
|
217
|
+
LayerNorm, then `linear_fc1 [3072,3072]` → GELU → `linear_fc2 [1024,3072]` to produce one
|
|
218
|
+
`out_hidden_size = 1024` token per merged block, matching the LM `hidden_size 1024`.
|
|
219
|
+
- **No audio encoder/decoder** in Qwen3.5.
|
|
220
|
+
|
|
221
|
+
**Ops the engine needs vs. what it already has.** Engine implemented ops (from `src/gpu/ir.ts` +
|
|
222
|
+
`src/gpu/kernels/registry.ts`): Embedding(+Int4), MatMul(+Int4), Add, Mul, RMSNorm, **LayerNorm**,
|
|
223
|
+
RoPE, **Attention**, **Softmax**, SiLU, SwiGLU, **GELU**, plus the Mamba-SSM stack (MambaSSM,
|
|
224
|
+
CausalConv1d, etc.). Stubbed (declared in IR, **no WGSL kernel**): `Conv2d`, `AvgPool2d`,
|
|
225
|
+
`CrossAttention`, and the MoE ops.
|
|
226
|
+
|
|
227
|
+
Mapping the ViT onto that:
|
|
228
|
+
|
|
229
|
+
| ViT component | Engine status |
|
|
230
|
+
|---|---|
|
|
231
|
+
| LayerNorm (norm1/norm2/merger.norm) | ✅ have `LayerNorm` |
|
|
232
|
+
| Fused QKV / proj / MLP matmuls | ✅ have `MatMul` / `MatMulInt4` |
|
|
233
|
+
| Self-attention + softmax (bidirectional, non-causal) | ✅ have `Attention` + `Softmax` — needs a **non-causal mask flag** (text attention is causal) |
|
|
234
|
+
| GELU (`gelu_pytorch_tanh`) | ✅ have `GELU` (verify tanh approximation variant) |
|
|
235
|
+
| Merger projector (concat + 2×MLP) | ✅ matmul + a Concat/reshape (Concat is currently stubbed but trivial) |
|
|
236
|
+
| **Patch embed Conv3d `[768,3,2,16,16]`** | ❌ **new** — only genuinely missing primitive |
|
|
237
|
+
|
|
238
|
+
The patch-embed conv is a strided non-overlapping conv = an **im2col/unfold + MatMul**: reshape each
|
|
239
|
+
non-overlapping 2×16×16×3 patch into a 1536-vector and multiply by the reshaped
|
|
240
|
+
`[768, 1536]` weight. **No general Conv2d/Conv3d kernel needed** — a small unfold (gather/reshape)
|
|
241
|
+
shader feeding the existing MatMul covers it. `CrossAttention` and `AvgPool2d` are **not** required by
|
|
242
|
+
this ViT (it uses standard self-attention and a learned merger, not pooling/cross-attn).
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## PART C — Verdict & build plan
|
|
247
|
+
|
|
248
|
+
### Can one native model deliver text + vision (+ audio) on our WebGPU engine?
|
|
249
|
+
|
|
250
|
+
- **Text + vision (image/video-in): YES, realistically.** It's the *same checkpoint we already run* —
|
|
251
|
+
we are choosing to discard its vision tower. The ViT reuses our existing matmul / attention /
|
|
252
|
+
LayerNorm / GELU / softmax kernels. Only one new primitive (the patch-embed conv, implementable as
|
|
253
|
+
unfold+matmul) plus host-side image preprocessing and token splicing are required.
|
|
254
|
+
- **Audio: NO.** Qwen3.5 has no audio path at all. The only omni checkpoints are `Qwen3-Omni-30B-A3B`
|
|
255
|
+
(30B MoE) and `Qwen2.5-Omni-3B` (~5.5B) — both too large for mobile, both a different architecture
|
|
256
|
+
requiring a Whisper-style audio encoder, a codec/talker, and `code2wav` vocoder (RVQ codec decode)
|
|
257
|
+
we don't remotely have. Audio stays out of scope for the native engine.
|
|
258
|
+
|
|
259
|
+
**So it dissolves the two-engine problem *partially*: vision yes, audio no.** If the second
|
|
260
|
+
engine (transformers.js) on mobile was there for *vision*, a vision-enabled Gerbil can replace it and
|
|
261
|
+
we ship one native model for text + image/video. If audio was also required, that still needs a
|
|
262
|
+
separate path — but no small Qwen model gives us audio either, so that's a model-availability wall,
|
|
263
|
+
not a Gerbil limitation.
|
|
264
|
+
|
|
265
|
+
### Build tier and specific work to enable native Qwen3.5 vision
|
|
266
|
+
|
|
267
|
+
**Tier: mid-size (1 new kernel, 1 new graph generator, host glue). Not a rewrite.**
|
|
268
|
+
|
|
269
|
+
1. **Loader (trivial):** stop nulling `model.visual.*` in `createKeyMapperForArch` (keep skipping
|
|
270
|
+
`mtp.*`). Map vision keys to canonical names; download grows by ~192 MB BF16 (~50–96 MB if we
|
|
271
|
+
quant/keep-F16 the tower).
|
|
272
|
+
2. **Vision graph generator (new):** add a `generateQwen3_5VisionGraph()` reading `vision_config`,
|
|
273
|
+
emitting the 12 ViT blocks + merger, reusing existing LayerNorm/MatMul/Attention/Softmax/GELU
|
|
274
|
+
nodes. Add a **non-causal flag** to the Attention op (vision attention is bidirectional).
|
|
275
|
+
3. **Patch-embed kernel (new, small):** unfold/im2col shader that reshapes 2×16×16×3 patches →
|
|
276
|
+
`[num_patches, 1536]`, then MatMul by `[768,1536]` weight + bias. (Reuses MatMul; the only new WGSL
|
|
277
|
+
is the gather/unfold.) `Concat`/`Reshape` for the spatial-merge step need real kernels (currently
|
|
278
|
+
stubbed) but are simple memory shuffles.
|
|
279
|
+
4. **Image preprocessing (host):** resize/normalize to the patch grid, build the temporal pair, emit
|
|
280
|
+
patch tensors — JS/Canvas/`ImageBitmap`, no GPU.
|
|
281
|
+
5. **Sequence splicing (executor):** replace `image_token_id` (248056) placeholder positions in the
|
|
282
|
+
text embedding stream with merged vision tokens, and apply **M-RoPE** (`mrope_section [11,11,10]`,
|
|
283
|
+
`mrope_interleaved`) for the multimodal position ids. This M-RoPE variant is new vs. our current
|
|
284
|
+
text RoPE and is the second-most-involved piece after the patch conv.
|
|
285
|
+
|
|
286
|
+
**New kernels needed:** patch-embed unfold (small), real `Concat`/`Reshape` (trivial), non-causal
|
|
287
|
+
attention flag, M-RoPE position handling. **Not needed:** general `Conv2d`/`Conv3d`, `AvgPool2d`,
|
|
288
|
+
`CrossAttention`, MoE ops.
|
|
289
|
+
|
|
290
|
+
**Bottom line:** Qwen3.5-0.8B is one native multimodal model that, with a mid-size vision build
|
|
291
|
+
(~1 real new kernel + a vision graph + M-RoPE + host image prep), gives Gerbil **text + image/video**
|
|
292
|
+
on WebGPU and lets us drop a second vision engine on mobile. **Audio is genuinely unavailable** in any
|
|
293
|
+
small Qwen3.5/omni checkpoint, so it remains a separate problem regardless of engine work.
|
|
@@ -0,0 +1,337 @@
|
|
|
1
|
+
# Qwen3.6 & Gemma 4 — Smallest On-Device Targets (Verification Follow-up)
|
|
2
|
+
|
|
3
|
+
Research date: **2026-06-13**. Follow-up to `docs/research/sota-mobile-models-2026.md`.
|
|
4
|
+
Goal: pick the two smallest on-device models to support next. SmolLM dropped per instruction.
|
|
5
|
+
|
|
6
|
+
Fact tags: **[config-verified]** = fetched the actual `config.json` / HF API live today;
|
|
7
|
+
**[card/docs]** = official model card / vendor docs; **[social/secondary]** = third-party.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## 0. TL;DR (the two picks)
|
|
12
|
+
|
|
13
|
+
| Pick | Exact repo ID | Arch | 4-bit download | Dense or hybrid | Gerbil tier |
|
|
14
|
+
|---|---|---|---|---|---|
|
|
15
|
+
| **Smallest "Qwen"** | **`Qwen/Qwen3-0.6B`** (+ `Qwen/Qwen3-0.6B-MLX-4bit`) | dense full-attention transformer | **0.32 GB** (MLX 4-bit) | **DENSE** (not hybrid) | **Tier 1** |
|
|
16
|
+
| **Smallest "Gemma 4"** | **`google/gemma-4-E2B`** (+ `google/gemma-4-E2B-it-qat-q4_0-gguf`) | dense transformer w/ sliding-window attn | **3.35 GB** (official QAT q4_0, text-only) | dense (interleaved SWA) | **Tier 2** |
|
|
17
|
+
|
|
18
|
+
**Headline correction:** There is **no small Qwen3.6**. Qwen3.6 exists but the smallest official
|
|
19
|
+
variant is **27B dense** (also a 35B-A3B MoE). No 0.6B / 1.7B / 4B Qwen3.6 exists or is announced.
|
|
20
|
+
The smallest deployable Qwen on-device model remains **Qwen3-0.6B** (the prior Qwen3 dense generation).
|
|
21
|
+
|
|
22
|
+
**Gemma 4 correction:** No `gemma-4-270m`, no `gemma-4-1B`. The 270M only exists for **Gemma 3**.
|
|
23
|
+
The smallest Gemma 4 is **E2B** (effective ~2B via Per-Layer Embeddings; *not* MatFormer — see
|
|
24
|
+
§2.5). Confirmed against the full `author=google&search=gemma-4` listing.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 1. Target 1 — "smallest Qwen3.6" → does not exist; use Qwen3-0.6B
|
|
29
|
+
|
|
30
|
+
### 1a. Qwen3.6 existence check [config-verified]
|
|
31
|
+
|
|
32
|
+
Full official Qwen org listing for `Qwen3.6` (HF API `author=Qwen&search=Qwen3.6`, fetched today):
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
Qwen/Qwen3.6-35B-A3B (MoE, 35B total / 3B active)
|
|
36
|
+
Qwen/Qwen3.6-27B (dense)
|
|
37
|
+
Qwen/Qwen3.6-35B-A3B-FP8
|
|
38
|
+
Qwen/Qwen3.6-27B-FP8
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
That is the **entire** official family. Hub-wide search for `Qwen3.6 0.6B / 1.7B / 4B / mini / nano / edge`
|
|
42
|
+
returns only community quants/finetunes **of the 27B and 35B-A3B** — no small base model anywhere.
|
|
43
|
+
Probes for `Qwen/Qwen3.6-0.6B`, `-0.5B`, `-1.7B`, `-0.8B`, `-0.6B-Instruct` all return HF API **404-equivalent
|
|
44
|
+
401** (nonexistent repo, not gated). **[config-verified]**
|
|
45
|
+
|
|
46
|
+
Independent confirmation: Perplexity (`sonar-pro`, today) — *"No official ~0.6B/1.7B/≤5B-class Qwen3.6
|
|
47
|
+
model has been released; the documented Qwen3.6 family is 27B dense and 35B-A3B MoE only."* Sources:
|
|
48
|
+
Unsloth Qwen3.6 docs (https://unsloth.ai/docs/models/qwen3.6), Qwen blog
|
|
49
|
+
(https://qwen.ai/blog?id=qwen3.6-35b-a3b), GitHub QwenLM/Qwen3.6. **[social/secondary]**
|
|
50
|
+
|
|
51
|
+
> The user's belief that "Qwen released a 3.6 generation with a ~0.6B model" is **incorrect** as of
|
|
52
|
+
> 2026-06-13. Qwen3.6 is a large-model release (27B/35B-A3B). For a smallest on-device Qwen, the
|
|
53
|
+
> correct target is **Qwen3-0.6B** from the Qwen3 dense generation.
|
|
54
|
+
|
|
55
|
+
### 1b. `Qwen/Qwen3-0.6B` — verified config [config-verified]
|
|
56
|
+
|
|
57
|
+
Fetched `https://huggingface.co/Qwen/Qwen3-0.6B/resolve/main/config.json`:
|
|
58
|
+
|
|
59
|
+
| Field | Value |
|
|
60
|
+
|---|---|
|
|
61
|
+
| `architectures[0]` | **`Qwen3ForCausalLM`** |
|
|
62
|
+
| `model_type` | **`qwen3`** |
|
|
63
|
+
| `hidden_size` | 1024 |
|
|
64
|
+
| `num_hidden_layers` | 28 |
|
|
65
|
+
| `num_attention_heads` | 16 |
|
|
66
|
+
| `num_key_value_heads` | 8 (GQA 2:1) |
|
|
67
|
+
| `head_dim` | **128** (note: ≠ hidden/heads = 64; set explicitly) |
|
|
68
|
+
| `intermediate_size` | 3072 |
|
|
69
|
+
| `vocab_size` | 151936 |
|
|
70
|
+
| `tie_word_embeddings` | **true** |
|
|
71
|
+
| `rope_theta` | 1000000 |
|
|
72
|
+
| `max_position_embeddings` | 40960 |
|
|
73
|
+
| `rms_norm_eps` | 1e-6 |
|
|
74
|
+
| `attention_bias` | **false** |
|
|
75
|
+
| `sliding_window` / `use_sliding_window` | null / false |
|
|
76
|
+
| `hidden_act` | silu (→ SwiGLU MLP) |
|
|
77
|
+
|
|
78
|
+
**Dense vs hybrid: 100% DENSE full-attention transformer.** No SSM, no linear/Gated-DeltaNet
|
|
79
|
+
layers, no sliding window. It uses the **simple dense path**, NOT Gerbil's Qwen3.5 hybrid path.
|
|
80
|
+
**[config-verified]**
|
|
81
|
+
|
|
82
|
+
**Non-standard features:**
|
|
83
|
+
- **No QKV bias** (`attention_bias: false`) — opposite of Qwen2, which has QKV bias. **[config-verified]**
|
|
84
|
+
- **QK-norm: YES** — Qwen3 applies per-head RMSNorm to Q and K (`self_attn.q_norm.weight`,
|
|
85
|
+
`self_attn.k_norm.weight`). This is the defining Qwen3 architectural change vs Qwen2. Not visible
|
|
86
|
+
as a scalar in config.json but present in the weights and modeling code; confirmed by Gerbil's own
|
|
87
|
+
loader already listing `layers.{i}.self_attn.q_norm.weight` / `k_norm.weight`. **[card/docs]**
|
|
88
|
+
- Tied embeddings (saves shipping a separate lm_head). **[config-verified]**
|
|
89
|
+
|
|
90
|
+
**Sizes:**
|
|
91
|
+
- Raw bf16 `model.safetensors` = **1,503,300,328 B ≈ 1.50 GB** (HTTP content-length). **[config-verified]**
|
|
92
|
+
- **Official `Qwen/Qwen3-0.6B-MLX-4bit` = 0.317 GB total** (4-bit incl. quantized tied embeddings). **[config-verified]**
|
|
93
|
+
- Official `Qwen/Qwen3-0.6B-GGUF` exists (Q4 ≈ 0.4–0.5 GB; Gerbil quantizes on-the-fly from bf16
|
|
94
|
+
anyway, so the ~0.42 GB g128 INT4 estimate from the prior report stands). **[config-verified for repo existence]**
|
|
95
|
+
|
|
96
|
+
Official quant repos confirmed live: `Qwen/Qwen3-0.6B-MLX-4bit`, `Qwen/Qwen3-0.6B-GGUF`,
|
|
97
|
+
`unsloth/Qwen3-0.6B-GGUF`, `unsloth/Qwen3-0.6B-bnb-4bit`. **[config-verified]**
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 2. Target 2 — smallest Gemma 4 = `google/gemma-4-E2B`
|
|
102
|
+
|
|
103
|
+
### 2a. Smallest-variant check [config-verified]
|
|
104
|
+
|
|
105
|
+
Full `author=google&search=gemma-4` listing (fetched today) contains, by size:
|
|
106
|
+
**E2B, E4B, 12B, 26B-A4B (MoE), 31B** — plus `-it`, `-qat-q4_0-gguf`, `-qat-mobile-transformers`,
|
|
107
|
+
`-qat-w4a16-ct` variants of each. **No `gemma-4-270m`, no `gemma-4-1B`, no `gemma-4-E1B`.** The only
|
|
108
|
+
270M / 1B Google models are **Gemma 3** (`google/gemma-3-270m-*`, `gemma-3-1b-*`). So **E2B is the
|
|
109
|
+
smallest Gemma 4.** Perplexity concurs: *"Gemma 4-2B (~2B dense) is the smallest public Gemma 4; no
|
|
110
|
+
~270M or ~1B Gemma 4 as of June 2026."* **[config-verified + social/secondary]**
|
|
111
|
+
|
|
112
|
+
`google/gemma-4-E2B` HF API returns HTTP 200 (ungated, downloadable without auth). **[config-verified]**
|
|
113
|
+
|
|
114
|
+
### 2b. `google/gemma-4-E2B` — verified config (`text_config`) [config-verified]
|
|
115
|
+
|
|
116
|
+
Fetched `https://huggingface.co/google/gemma-4-E2B/resolve/main/config.json`. It is a multimodal
|
|
117
|
+
container (`Gemma4ForConditionalGeneration`, `model_type: gemma4`) with `text_config` + `vision_config`
|
|
118
|
+
+ `audio_config`. For a text-only Gerbil deployment, only `text_config` matters:
|
|
119
|
+
|
|
120
|
+
| Field (`text_config`) | Value |
|
|
121
|
+
|---|---|
|
|
122
|
+
| `architectures[0]` (top) | `Gemma4ForConditionalGeneration` |
|
|
123
|
+
| `model_type` | `gemma4_text` (container `gemma4`) |
|
|
124
|
+
| `hidden_size` | 1536 |
|
|
125
|
+
| `num_hidden_layers` | 35 |
|
|
126
|
+
| `num_attention_heads` | 8 |
|
|
127
|
+
| `num_key_value_heads` | **1 (MQA)** |
|
|
128
|
+
| `head_dim` | **256** (≠ hidden/heads) |
|
|
129
|
+
| `intermediate_size` | 6144 |
|
|
130
|
+
| `vocab_size` | **262144** (large embed/logit matmul) |
|
|
131
|
+
| `tie_word_embeddings` | **true** |
|
|
132
|
+
| `rms_norm_eps` | 1e-6 |
|
|
133
|
+
| `attention_bias` | false |
|
|
134
|
+
| `hidden_activation` | **`gelu_pytorch_tanh`** (→ **GeGLU** MLP, not SwiGLU) |
|
|
135
|
+
| `sliding_window` | **512** |
|
|
136
|
+
| `final_logit_softcapping` | **30.0** |
|
|
137
|
+
| `max_position_embeddings` | 131072 |
|
|
138
|
+
|
|
139
|
+
**Sliding-window attention layout (`layer_types`)** [config-verified]:
|
|
140
|
+
35 layers, pattern = 4× sliding then 1× full, repeating → **full attention at layers
|
|
141
|
+
4,9,14,19,24,29,34 (7 full / 28 sliding) = 4:1 sliding:full ratio.** (Prior report said 5:1 — the
|
|
142
|
+
live config shows full every 5th layer, i.e. a **4:1** sliding-to-full ratio.) `sliding_window: 512`.
|
|
143
|
+
|
|
144
|
+
**Dual RoPE** [config-verified]:
|
|
145
|
+
- full_attention layers: `rope_theta: 1e6`, `rope_type: proportional`, **`partial_rotary_factor: 0.25`** (partial RoPE).
|
|
146
|
+
- sliding_attention layers: `rope_theta: 1e4`, `rope_type: default` (full RoPE).
|
|
147
|
+
|
|
148
|
+
**Other Gemma-4 specifics** [config-verified]:
|
|
149
|
+
- `query_pre_attn_scalar`: not present as a separate field in this config — query scaling is folded
|
|
150
|
+
via `head_dim`/`global_head_dim` (`global_head_dim: 512`). Gemma uses head_dim-based scaling; treat
|
|
151
|
+
as a parameter, not a kernel.
|
|
152
|
+
- `hidden_size_per_layer_input: 256` + `vocab_size_per_layer_input: 262144` → **Per-Layer Embeddings
|
|
153
|
+
(PLE)**: extra per-layer embedding tables the loader must place. Not a new compute kernel.
|
|
154
|
+
- `num_kv_shared_layers: 20` → **cross-layer KV sharing** (20 layers reuse KV from a paired layer).
|
|
155
|
+
New loader/graph wiring; reduces KV cache but adds plumbing.
|
|
156
|
+
- `use_double_wide_mlp: true`, `attention_k_eq_v: false`, `enable_moe_block: false` (E2B is dense, not MoE).
|
|
157
|
+
- **Effective "~2B" via PLE (NOT MatFormer):** the effective-vs-total split comes from Per-Layer
|
|
158
|
+
Embeddings, not elastic nesting (see §2.5). Load E2B as a normal fixed dense checkpoint — no
|
|
159
|
+
elastic-nesting code needed. **[card/docs]**
|
|
160
|
+
- Gemma RMSNorm uses `(1 + weight)` scaling — Gerbil's loader already bakes `+1` into Gemma-style
|
|
161
|
+
norm weights (see `model-loader.ts` comment), and norms appear pre/post attn + pre/post MLP.
|
|
162
|
+
|
|
163
|
+
**Sizes:**
|
|
164
|
+
- Raw bf16 single-file `model.safetensors` = **10,246,621,918 B ≈ 10.25 GB** (full multimodal weights). **[config-verified]**
|
|
165
|
+
- **Official QAT INT4 (`google/gemma-4-E2B-it-qat-q4_0-gguf`):** text weights
|
|
166
|
+
`gemma-4-E2B_q4_0-it.gguf` = **3.349 GB**; multimodal projector `gemma-4-E2B-it-mmproj.gguf` =
|
|
167
|
+
0.987 GB (skip for text-only). So **text-only 4-bit download ≈ 3.35 GB.** **[config-verified]**
|
|
168
|
+
- (Prior report's "~2.6–2.9 GB" figure was the Unsloth/Google headline; the official Google QAT
|
|
169
|
+
q4_0 text GGUF measures **3.35 GB** live — use this.)
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## 2.5 E2B vs E4B — separate checkpoints, NOT MatFormer slices [config-verified]
|
|
174
|
+
|
|
175
|
+
The addendum asked whether E2B is an elastic MatFormer sub-network nested inside E4B, or a
|
|
176
|
+
genuinely separate checkpoint. **Answer: genuinely separate checkpoints.** Both `config.json`
|
|
177
|
+
files were fetched live today and the dims differ structurally — E2B is not a slice of E4B.
|
|
178
|
+
|
|
179
|
+
**What "E" actually means.** The official E2B model card states verbatim: *"The 'E' in E2B and
|
|
180
|
+
E4B stands for 'effective' parameters. The smaller models incorporate **Per-Layer Embeddings
|
|
181
|
+
(PLE)** to maximize parameter efficiency… These embedding tables are large but are only used for
|
|
182
|
+
quick lookups, which is why the effective parameter count is much smaller than the total."* The
|
|
183
|
+
card lists Gemma 4 as **four distinct sizes — E2B, E4B, 26B-A4B, 31B** — i.e. separate models,
|
|
184
|
+
not nested elastic variants. **The card never mentions "MatFormer" or "Mix-n-Match" (grep count
|
|
185
|
+
= 0).** That is a Gemma-3n-generation mechanism; Gemma 4's edge effective/total split is driven
|
|
186
|
+
by **PLE**, not MatFormer nesting. So the addendum's hypothesis ("E4B is the full model with E2B
|
|
187
|
+
nested inside it as an elastic sub-network") is **incorrect for Gemma 4** — they are two distinct
|
|
188
|
+
checkpoints, each with its own `model.safetensors`. **[card/docs + config-verified]**
|
|
189
|
+
|
|
190
|
+
> Correction to earlier notes in this repo (and `sota-mobile-models-2026.md`): the "MatFormer"
|
|
191
|
+
> label applied to Gemma 4 E2B is wrong. Gemma 4 uses **PLE** for the effective-vs-total split.
|
|
192
|
+
> Treat E2B and E4B as ordinary fixed dense checkpoints — no elastic-nesting code is needed for
|
|
193
|
+
> either, which was the practical conclusion anyway.
|
|
194
|
+
|
|
195
|
+
### Per-checkpoint facts (both `text_config`, fetched live 2026-06-13)
|
|
196
|
+
|
|
197
|
+
| | **E2B** | **E4B** |
|
|
198
|
+
|---|---|---|
|
|
199
|
+
| Exact HF repo ID | **`google/gemma-4-E2B`** (+ `-it`) | **`google/gemma-4-E4B`** (+ `-it`) |
|
|
200
|
+
| Effective params (card) | **2.3B effective** | **4.5B effective** |
|
|
201
|
+
| Total params incl. embeddings (card) | **5.1B** | **8B** |
|
|
202
|
+
| Raw bf16 `model.safetensors` (content-length) | **10,246,621,918 B ≈ 10.25 GB** | **15,992,595,884 B ≈ 15.99 GB** |
|
|
203
|
+
| Implied 4-bit download (full multimodal, ≈ raw/3.6) | **≈ 2.85 GB** | **≈ 4.45 GB** |
|
|
204
|
+
| Official QAT q4_0 GGUF (text-only) | **3.35 GB** (`gemma-4-E2B_q4_0-it.gguf`, measured) | **≈ 5.0–5.2 GB** (`-E4B-it-qat-q4_0-gguf`, gated; scales with raw) |
|
|
205
|
+
| `hidden_size` | **1536** | **2560** |
|
|
206
|
+
| `num_hidden_layers` | **35** | **42** |
|
|
207
|
+
| `num_attention_heads` | 8 | 8 |
|
|
208
|
+
| `num_key_value_heads` | **1 (MQA)** | **2 (GQA)** |
|
|
209
|
+
| `head_dim` / `global_head_dim` | 256 / 512 | 256 / 512 |
|
|
210
|
+
| `intermediate_size` (MLP) | **6144** | **10240** |
|
|
211
|
+
| `use_double_wide_mlp` | **true** | **false** |
|
|
212
|
+
| `num_kv_shared_layers` | **20** | **18** |
|
|
213
|
+
| `sliding_window` | 512 | 512 |
|
|
214
|
+
| SWA layout (`layer_types`) | 4:1 sliding:full (full @ 4,9,14,19,24,29,34) | 5:1 sliding:full (full @ 5,11,17,23,29,35,41) |
|
|
215
|
+
| `vocab_size` | 262144 | 262144 |
|
|
216
|
+
| `final_logit_softcapping` | 30.0 | 30.0 |
|
|
217
|
+
| `hidden_activation` | `gelu_pytorch_tanh` (GeGLU) | `gelu_pytorch_tanh` (GeGLU) |
|
|
218
|
+
| `tie_word_embeddings` | true | true |
|
|
219
|
+
| `max_position_embeddings` | 131072 | 131072 |
|
|
220
|
+
| `model_type` (container / text) | `gemma4` / `gemma4_text` | `gemma4` / `gemma4_text` |
|
|
221
|
+
| `architectures[0]` | `Gemma4ForConditionalGeneration` | `Gemma4ForConditionalGeneration` |
|
|
222
|
+
| HF gating | ungated (HTTP 200, no auth) | ungated (HTTP 200, no auth) |
|
|
223
|
+
|
|
224
|
+
The structural differences (hidden 1536 vs 2560, 35 vs 42 layers, 1 vs 2 KV heads, MLP 6144 vs
|
|
225
|
+
10240, double-wide MLP only on E2B, 20 vs 18 KV-shared layers) prove these are **independently
|
|
226
|
+
trained checkpoints**, not one model sliced two ways. **[config-verified]**
|
|
227
|
+
|
|
228
|
+
### Smaller download
|
|
229
|
+
|
|
230
|
+
**E2B is the smaller download** — by every measure: raw bf16 **10.25 GB vs 15.99 GB**, and
|
|
231
|
+
official text-only 4-bit **≈ 3.35 GB vs ≈ 5.0+ GB**. E2B is ~36% smaller on disk.
|
|
232
|
+
|
|
233
|
+
### Does the choice change Gerbil's implementation work? **No — same generator, same kernels.**
|
|
234
|
+
|
|
235
|
+
Both share `model_type: gemma4` / `Gemma4ForConditionalGeneration` and the **identical op set**:
|
|
236
|
+
sliding-window attention, GeGLU MLP, Gemma `(1+w)` RMSNorm, final-logit softcap, dual RoPE
|
|
237
|
+
(partial on full-attn layers, full on sliding layers), PLE tables, cross-layer KV sharing, tied
|
|
238
|
+
embeddings. Gerbil's generators read **every dimension from `config.json` at runtime** (verified
|
|
239
|
+
in `qwen3_5.ts`: `hidden_size`, `num_hidden_layers`, `num_key_value_heads`, `intermediate_size`,
|
|
240
|
+
`head_dim`, `layer_types`, … are all pulled from the config object). Therefore:
|
|
241
|
+
|
|
242
|
+
- **Same `gemma4.ts` generator** handles both — no E4B-specific code.
|
|
243
|
+
- **Same kernels** — the only deltas are config values (1536→2560, 35→42, MQA→GQA, MLP width,
|
|
244
|
+
`use_double_wide_mlp`, KV-shared count, SWA stride). All of these are already
|
|
245
|
+
config-parameterized inputs to the graph, not branches.
|
|
246
|
+
- The only practical difference is **which weights you load and how much memory/VRAM they need.**
|
|
247
|
+
|
|
248
|
+
One nuance to verify when implementing: `use_double_wide_mlp` is `true` for E2B and `false` for
|
|
249
|
+
E4B. This is a config flag the generator must honor (likely a 2× factor on the gate/up
|
|
250
|
+
projection), but it is a parameterized branch inside the single generator — not a separate kernel.
|
|
251
|
+
|
|
252
|
+
### Verdict: is E4B worth the extra size on-device?
|
|
253
|
+
|
|
254
|
+
**For Gerbil's on-device / mobile target: pick E2B.** E4B is ~57% larger (15.99 vs 10.25 GB raw;
|
|
255
|
+
~5.0 vs 3.35 GB at 4-bit). On an iPad/phone-class budget the extra ~1.7 GB of 4-bit weights plus
|
|
256
|
+
the larger KV/activation footprint (hidden 2560, 42 layers, 2 KV heads) materially raises peak
|
|
257
|
+
memory and lowers tokens/sec, for a quality bump (2.3B→4.5B effective) that is real but not
|
|
258
|
+
transformative at the edge. **E4B is "worth it" only on a laptop/desktop-class device** with
|
|
259
|
+
headroom to spare and a quality-over-latency preference. Since implementation cost is identical
|
|
260
|
+
(same generator, same kernels), Gerbil can support both from one code path and simply let the
|
|
261
|
+
user pick the weight set — but the **default on-device pick should be E2B**.
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## 3. Gerbil implementation tiers (per `docs/adding-a-model-family.md`)
|
|
266
|
+
|
|
267
|
+
### Target 1 — `Qwen/Qwen3-0.6B` → **Tier 1 (hours)**
|
|
268
|
+
|
|
269
|
+
`Qwen3ForCausalLM` is **already registered** in `src/gpu/architectures/index.ts` →
|
|
270
|
+
`generateQwen2Graph`. But `qwen2.ts` models the **Qwen2** layer (QKV bias, **no QK-norm**), while
|
|
271
|
+
Qwen3 is the inverse: **no QKV bias + adds per-head QK-RMSNorm**. So a vanilla load would produce
|
|
272
|
+
wrong outputs (missing q_norm/k_norm).
|
|
273
|
+
|
|
274
|
+
**What must be built (all reuse, zero new kernels):**
|
|
275
|
+
1. Add **QK-norm** (per-head RMSNorm on Q and K) to the dense attention path. This is a
|
|
276
|
+
**copy-paste from `qwen3_5.ts`** (lines ~514–555 already wire `CANONICAL_KEYS.qNorm(i)` /
|
|
277
|
+
`kNorm(i)` + two `RMSNorm` nodes), and the loader already handles `self_attn.q_norm.weight` /
|
|
278
|
+
`k_norm.weight` (with the `+1` bake). Cleanest fix: branch on `model_type === "qwen3"` to emit the
|
|
279
|
+
QK-norm nodes and skip QKV bias.
|
|
280
|
+
2. Confirm `attention_bias: false` path (skip the Qwen2 QKV-bias tensors).
|
|
281
|
+
3. `head_dim` is read from config (128) — already handled. GQA 16/8, tied embeddings, RoPE θ=1e6 —
|
|
282
|
+
all already supported.
|
|
283
|
+
4. Validate vs HF reference (Step 7 of the guide).
|
|
284
|
+
|
|
285
|
+
**New kernels: none.** All ops (RMSNorm, QK-norm RMSNorm, RoPE, GQA Attention, SwiGLU, tied
|
|
286
|
+
embeddings, SliceLastRow) already exist. Effort: hours.
|
|
287
|
+
|
|
288
|
+
### Target 2 — `google/gemma-4-E2B` → **Tier 2 (days; 1 real new kernel)**
|
|
289
|
+
|
|
290
|
+
New `gemma4.ts` generator + register `Gemma4ForConditionalGeneration` (read `config.text_config`).
|
|
291
|
+
|
|
292
|
+
**What must be built:**
|
|
293
|
+
- **Tier-2 new kernel (the one real engineering item): sliding-window / banded attention** +
|
|
294
|
+
(for the memory win) a **windowed KV cache** — `sliding_window: 512`, interleaved 4:1 with full
|
|
295
|
+
attention. *Correctness-first v1 can treat SWA as full attention* (identical results at prompts
|
|
296
|
+
≤512 tokens) and ship without the new kernel; add the banded mask + windowed cache later for
|
|
297
|
+
long-context memory wins.
|
|
298
|
+
- **Small new variants (each tiny, not a full kernel class):**
|
|
299
|
+
- **GeGLU MLP** (swap SiLU→GELU `gelu_pytorch_tanh` in the gated MLP).
|
|
300
|
+
- **Final logit softcapping** (`tanh`-based, `final_logit_softcapping: 30.0`) — elementwise op on logits.
|
|
301
|
+
- **Gemma `(1+w)` RMSNorm** — loader already bakes `+1`; norms placed pre/post attn + pre/post MLP.
|
|
302
|
+
- **Dual RoPE** — full layers use partial RoPE (`partial_rotary_factor: 0.25`, θ=1e6); sliding
|
|
303
|
+
layers use full RoPE (θ=1e4). Gerbil already does partial RoPE; this is per-layer parameterization.
|
|
304
|
+
- **Loader work (no new compute kernels):**
|
|
305
|
+
- **Per-Layer Embeddings (PLE)** — place per-layer embedding tables (`hidden_size_per_layer_input: 256`).
|
|
306
|
+
- **Cross-layer KV sharing** (`num_kv_shared_layers: 20`) — graph wiring so 20 layers reuse a peer's KV.
|
|
307
|
+
- No MatFormer/elastic handling needed — E2B (and E4B) are fixed dense checkpoints; effective
|
|
308
|
+
param count is a PLE artifact, not a slice (§2.5).
|
|
309
|
+
- **Mobile budget note:** vocab 262144 × hidden 1536 tied embedding is large; at INT4 the embed
|
|
310
|
+
tensor is ~0.2 GB (fits the iPad 128 MB `maxStorageBufferBindingSize` only if sharded — **the
|
|
311
|
+
embed/logit weight will need sharding** per the guide's buffer-cap rule).
|
|
312
|
+
|
|
313
|
+
**Net:** Gemma 4 E2B is **Tier 2** — one genuinely new kernel (windowed attention, deferrable) plus
|
|
314
|
+
several small variants/loader features. Matches the prior report's Gemma4 assessment; the live config
|
|
315
|
+
refines it (4:1 not 5:1, MQA `num_kv_heads=1`, head_dim 256, dual RoPE, PLE, 20 shared-KV layers).
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
## Sources
|
|
320
|
+
|
|
321
|
+
Primary, fetched live 2026-06-13:
|
|
322
|
+
- `https://huggingface.co/Qwen/Qwen3-0.6B/resolve/main/config.json` [config-verified]
|
|
323
|
+
- `https://huggingface.co/google/gemma-4-E2B/resolve/main/config.json` and `.../gemma-4-E2B-it/...` [config-verified]
|
|
324
|
+
- `https://huggingface.co/google/gemma-4-E4B/resolve/main/config.json` and `.../gemma-4-E4B-it/...` [config-verified]
|
|
325
|
+
- `https://huggingface.co/google/gemma-4-E2B/raw/main/README.md` (official card: "E" = effective
|
|
326
|
+
params via Per-Layer Embeddings; four distinct sizes E2B/E4B/26B-A4B/31B; no MatFormer mention) [card/docs]
|
|
327
|
+
- HF API: `api/models?author=Qwen&search=Qwen3.6`, `api/models?author=google&search=gemma-4`,
|
|
328
|
+
`api/models/{repo}?blobs=true`, content-length headers [config-verified]
|
|
329
|
+
- `Qwen/Qwen3-0.6B-MLX-4bit` (0.317 GB), `google/gemma-4-E2B-it-qat-q4_0-gguf`
|
|
330
|
+
(`gemma-4-E2B_q4_0-it.gguf` 3.349 GB, mmproj 0.987 GB) [config-verified]
|
|
331
|
+
- Gerbil repo: `src/gpu/architectures/index.ts` (Qwen3ForCausalLM→qwen2.ts already registered),
|
|
332
|
+
`qwen2.ts` (no QK-norm), `qwen3_5.ts` (QK-norm reference wiring), `model-loader.ts`
|
|
333
|
+
(q_norm/k_norm keys + Gemma `+1` bake).
|
|
334
|
+
|
|
335
|
+
Secondary:
|
|
336
|
+
- Perplexity sonar-pro (2026-06-13): no small Qwen3.6; smallest Gemma 4 ≈ 2B. Citing unsloth.ai
|
|
337
|
+
Qwen3.6 docs, qwen.ai blog, github.com/QwenLM/Qwen3.6. [social/secondary]
|