@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +318 -104
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +276 -590
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +592 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BetB5xb0.d.mts +488 -0
- package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
- package/dist/gerbil-CTZUa8EZ.mjs +4 -0
- package/dist/gerbil-DNniplr4.mjs +1656 -0
- package/dist/gerbil-DNniplr4.mjs.map +1 -0
- package/dist/gpu/hooks.d.mts +640 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1369 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-DFuglcEx.mjs +3790 -0
- package/dist/gpu-DFuglcEx.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-DukkJRMj.d.mts +2114 -0
- package/dist/index-DukkJRMj.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
- package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
- package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
- package/dist/repl-BDRkwPGX.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
- package/dist/skills-CU694Dc8.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
- package/dist/types-LlyYILII.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-CMOGDSgT.js +0 -20212
- package/dist/kokoro-CMOGDSgT.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-Bu-E23Sc.js +0 -433
- package/dist/stt-Bu-E23Sc.js.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DiD1gTwk.js +0 -44695
- package/dist/transformers.web-DiD1gTwk.js.map +0 -1
- package/dist/transformers.web-u34VxRFM.js +0 -3
- package/dist/tts-CqroPaSK.js +0 -724
- package/dist/tts-CqroPaSK.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
# SOTA Small Embedding Models for Gerbil on iPad (WebGPU)
|
|
2
|
+
|
|
3
|
+
> Research date: 2026-06-13. Verified against live HuggingFace `config.json` / file lists / model cards
|
|
4
|
+
> and the MTEB leaderboard, cross-checked with Perplexity (`sonar-pro`). Engine capabilities verified
|
|
5
|
+
> against `src/gpu/` (model-loader, architectures, kernels).
|
|
6
|
+
|
|
7
|
+
## TL;DR — the ranked pick
|
|
8
|
+
|
|
9
|
+
| Rank | Model | Arch | q4 size | Compatible q4 format on HF? | Engine work to run it |
|
|
10
|
+
|------|-------|------|---------|------------------------------|------------------------|
|
|
11
|
+
| **1** | **`google/embeddinggemma-300m`** (via `mlx-community/embeddinggemma-300m-4bit`) | **Bidirectional** Gemma3 encoder | **~173 MB** | **YES — standard MLX-4bit** (QAT base, not DWQ) | New **Tier-2 Gemma3 encoder generator** (medium effort) |
|
|
12
|
+
| **2** | **`Qwen/Qwen3-Embedding-0.6B`** | Causal-LM (Qwen3) | ~335 MB (MLX) / ~538 MB (compressed-tensors) | **Risky** — see notes | None for arch (runs today on desktop); but **no clean-verified q4** |
|
|
13
|
+
| **3** | **`Snowflake/snowflake-arctic-embed-s`** or **`bge-small-en-v1.5`** | Bidirectional BERT encoder | ~70–130 MB at q4 (on-the-fly from F32) | No q4 we read (only ONNX q4); use **on-the-fly INT4 from F32** | New **Tier-2 BERT encoder generator** (smaller effort than Gemma3) |
|
|
14
|
+
|
|
15
|
+
**Recommendation:** ship **EmbeddingGemma-300M** as Gerbil's default iPad embedding model. It is the
|
|
16
|
+
single highest-quality model under 1B params (MTEB English v2 **68.36**, Multilingual v2 **61.15**),
|
|
17
|
+
fits the iPad budget at **~173 MB q4**, and — critically — ships a **standard-affine MLX-4bit** quant
|
|
18
|
+
derived from Google's **quantization-aware-trained (QAT)** base, so q4 quality is near-lossless and the
|
|
19
|
+
format is one our `repackMLX` loader already understands. The cost is one new **bidirectional Gemma3
|
|
20
|
+
encoder generator**. We already have the prerequisite ops (the new `is_causal=false` attention flag,
|
|
21
|
+
RMSNorm, RoPE, GELU-tanh, mean pooling needs adding, L2Norm).
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Engine capability ground-truth (verified in `src/gpu/`)
|
|
26
|
+
|
|
27
|
+
What the loader actually detects (`src/gpu/model-loader.ts` ~L567–584):
|
|
28
|
+
|
|
29
|
+
```ts
|
|
30
|
+
const quantConfig = rawConfig.quantization_config;
|
|
31
|
+
const isGPTQ = quantConfig?.quant_method === "gptq";
|
|
32
|
+
const isMLX = !isGPTQ && quantConfig?.bits === 4 && quantConfig?.mode === "affine";
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
- **GPTQ-Int4**: read via `repackGPTQ` (`src/gpu/gptq-adapter.ts`). Detects `quant_method === "gptq"`.
|
|
36
|
+
- **MLX 4-bit**: read via `repackMLX` (`src/gpu/mlx-adapter.ts`). Tensors `.weight`/`.scales`/`.biases`.
|
|
37
|
+
- **On-the-fly INT4 from F32/BF16**: `quantizeInt4` (`src/gpu/quantize.ts`), group size 128.
|
|
38
|
+
- **NOT supported**: `compressed-tensors` / `pack-quantized`, GGUF, ONNX, MLX **DWQ**.
|
|
39
|
+
|
|
40
|
+
⚠️ **Two loader caveats this research surfaced (both are quick fixes, but load-bearing):**
|
|
41
|
+
1. **MLX detection currently requires `mode === "affine"`**, but the EmbeddingGemma and Qwen3 MLX
|
|
42
|
+
converts on HF write only `{"bits":4,"group_size":64}` with **no `mode` field** (and put it under
|
|
43
|
+
`quantization`, not always `quantization_config`). So the detector must be broadened to recognize
|
|
44
|
+
`{bits:4, group_size}` MLX configs and read the `quantization` key, else they fall through to F32.
|
|
45
|
+
2. **DWQ vs standard MLX is indistinguishable from config** (both say `{bits:4,group_size:64}`). The
|
|
46
|
+
difference is in tensor data. This is exactly why the Qwen3 `...-4bit-DWQ` produces garbage — the
|
|
47
|
+
repack treats it as affine MLX but the weights aren't. We must pin to **known-good standard converts**.
|
|
48
|
+
|
|
49
|
+
Architecture support today (`src/gpu/architectures/`):
|
|
50
|
+
- `Qwen2ForCausalLM`, `Qwen3ForCausalLM` → `generateQwen2Graph` (causal). **Qwen3-Embedding runs here**
|
|
51
|
+
with `embedding:true` → `SliceLastRow` (last-token EOS pool) + `L2Norm`.
|
|
52
|
+
- `Qwen3_5ForConditionalGeneration` (hybrid Mamba) + a Qwen3.5 **vision ViT** that already uses
|
|
53
|
+
**bidirectional attention** (`causal:false`) — proof the non-causal path works end-to-end.
|
|
54
|
+
- **No Gemma3 generator. No BERT/encoder text generator. Only last-token + L2Norm pooling (no mean pooling).**
|
|
55
|
+
|
|
56
|
+
`is_causal` flag: defined in the Attention kernel (`src/gpu/kernels/registry.ts`, `Params.is_causal: u32`);
|
|
57
|
+
`is_causal=0` → attend to all keys. Consumed by the Qwen3.5 vision tower today.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## The survey (live-verified configs + sizes + MTEB)
|
|
62
|
+
|
|
63
|
+
| Model (HF repo) | model_type / arch | Params | hidden×layers | Pooling | MTEB | F32/BF16 size | q4 on HF (compatible?) |
|
|
64
|
+
|---|---|---|---|---|---|---|---|
|
|
65
|
+
| **google/embeddinggemma-300m** | `gemma3_text` / `Gemma3TextModel`, **`use_bidirectional_attention: true`** | 300M | 768 × 24 | **mean** + 2×Dense + L2Norm | **68.36 En / 61.15 Multi (v2)** | ~1.2 GB f32 | **`mlx-community/embeddinggemma-300m-4bit` = 173 MB, standard MLX-4bit, QAT base ✅** (gated, instant ack) |
|
|
66
|
+
| **Qwen/Qwen3-Embedding-0.6B** | `qwen3` / `Qwen3ForCausalLM` | 0.6B | 1024 × 28 | last-token EOS + L2Norm | ~64 Multi (8B=70.58 #1) | **1192 MB bf16** | DWQ (broken ❌), `kerncore/...-MXL-4bit` 335 MB (MLX, **DWQ-contaminated card, risky**), `boboliu/...-W4A16-G128` 538 MB (compressed-tensors ❌ not read) |
|
|
67
|
+
| **BAAI/bge-base-en-v1.5** | `bert` / `BertModel`, absolute pos | 110M | 768 × 12 | CLS (or mean) + L2Norm | 63.5 En | 438 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 from F32 |
|
|
68
|
+
| **BAAI/bge-small-en-v1.5** | `bert` / `BertModel` | ~33M | 384 × 12 | CLS + L2Norm | 61.9 En | 133 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
|
|
69
|
+
| **thenlper/gte-base** | `bert` / `BertModel` | 110M | 768 × 12 | mean + L2Norm | 62.4 En | 219 MB f32 | only ONNX/OpenVINO q8 ❌ → on-the-fly INT4 |
|
|
70
|
+
| **thenlper/gte-small** | `bert` / `BertModel` | ~33M | 384 × 12 | mean + L2Norm | 62.0 En | 67 MB f32 | only ONNX q8 ❌ → on-the-fly INT4 |
|
|
71
|
+
| **Snowflake/snowflake-arctic-embed-s** | `bert` / `BertModel`, absolute pos | ~33M | 384 × 12 | CLS + L2Norm | high mid (En) | 133 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
|
|
72
|
+
| **Snowflake/snowflake-arctic-embed-xs** | `bert` / `BertModel` | ~23M | 384 × **6** | CLS + L2Norm | mid (En) | 90 MB f32 | only ONNX q4 ❌ → on-the-fly INT4 |
|
|
73
|
+
| **Snowflake/snowflake-arctic-embed-m-v2.0** | `gte` / `GteModel`, **RoPE**, vocab 250k | 305M | 768 × 12 | CLS + L2Norm, Matryoshka 256 | ~55–56 Multi | 1221 MB f32 | only ONNX q4 ❌ (too big anyway) |
|
|
74
|
+
| **nomic-ai/nomic-embed-text-v2-moe** | `nomic_bert` **MoE** (megablocks, 8 experts) | 475M (305M active) | 768 × 12 | mean + L2Norm | strong Multi | f32 large | ❌ MoE megablocks — not supported, skip |
|
|
75
|
+
| **jinaai/jina-embeddings-v3** | `xlm-roberta` + **5 LoRA adapters**, RoPE, Matryoshka | 572M | 1024 × 24 | mean + task-LoRA | strong Multi | 1.1 GB bf16 | ❌ LoRA-adapter routing not supported, too big, skip |
|
|
76
|
+
| **ibm-granite/granite-embedding-278m-multilingual** | `xlm-roberta` / `XLMRobertaModel`, absolute pos | 278M | 768 × 12 | mean (or CLS) + L2Norm | mid-low Multi | 556 MB bf16 | only standard safetensors → on-the-fly INT4 (~150 MB) |
|
|
77
|
+
| **minishlab/potion-base-8M** | `model2vec` / `StaticModel` (**static**, not a transformer) | 8M | 256 dim, PCA, no layers | token-lookup + mean | **51.08** (much lower) | 30 MB f32 | n/a — trivially fast, but a different (non-graph) code path |
|
|
78
|
+
|
|
79
|
+
MTEB sources: EmbeddingGemma & Qwen3 figures from official model cards (EmbeddingGemma card:
|
|
80
|
+
68.36 En v2 / 61.15 Multi v2; Qwen3 card: 8B #1 at 70.58, 0.6B is the small-tier sibling). BGE/GTE
|
|
81
|
+
English averages cross-confirmed via the AILog MTEB compilation. Potion from its own card (51.08 MTEB).
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Why EmbeddingGemma-300M is the pick (#1)
|
|
86
|
+
|
|
87
|
+
- **Quality:** best open model under 1B params — MTEB English v2 **68.36**, Multilingual v2 **61.15**.
|
|
88
|
+
That is materially above every BERT-small option (61–63) and competitive with Qwen3-0.6B while being
|
|
89
|
+
half the download.
|
|
90
|
+
- **Size:** **~173 MB at q4** — comfortably inside the iPad <300 MB target.
|
|
91
|
+
- **Format we can read:** `mlx-community/embeddinggemma-300m-4bit` is a **standard-affine MLX-4bit**
|
|
92
|
+
convert (`.scales`/`.biases` tensors, `mlx-lm`) of Google's **QAT-q4_0** base
|
|
93
|
+
(`google/embeddinggemma-300m-qat-q4_0-unquantized`). Because the base is quantization-aware-trained,
|
|
94
|
+
q4 is near-lossless. This is the same `repackMLX` path we already use — **not** the broken DWQ variant.
|
|
95
|
+
- **Architecture fits our new capability:** it is a **bidirectional Gemma3 encoder**
|
|
96
|
+
(`use_bidirectional_attention: true`) — exactly the case the new `is_causal=false` flag unlocked.
|
|
97
|
+
|
|
98
|
+
### Exact engine work for EmbeddingGemma (Tier-2 generator)
|
|
99
|
+
|
|
100
|
+
Config (verified): `gemma3_text`, hidden **768**, **24** layers, **3** attention heads (head_dim 256,
|
|
101
|
+
GQA `num_kv_heads:1`), intermediate **1152**, vocab **262144**, `max_position_embeddings` **2048**,
|
|
102
|
+
`gelu_pytorch_tanh`, `rms_norm_eps 1e-6`, `query_pre_attn_scalar 256`. Pooling = **mean** then two
|
|
103
|
+
**Dense** projection layers (the Matryoshka/bottleneck head) then **L2 Normalize**.
|
|
104
|
+
|
|
105
|
+
New generator `architectures/gemma3_embedding.ts` (or extend a gemma3 generator) needs:
|
|
106
|
+
1. **Bidirectional attention** — set `causal: false` on every Attention op (flag already exists). ✅ op-ready
|
|
107
|
+
2. **Gemma3 attention specifics:** GQA (heads 3 / kv 1), `query_pre_attn_scalar=256` scaling, head_dim 256.
|
|
108
|
+
3. **Mixed `layer_types`** — alternating `sliding_attention` (window 512) and `full_attention` (every 6th
|
|
109
|
+
layer). For a 2048-max encoder you can **treat all layers as full bidirectional** (window 512 only
|
|
110
|
+
bounds long contexts; embeddings are short) — simplest correct approximation. Sliding-window masking
|
|
111
|
+
can be a later optimization.
|
|
112
|
+
4. **Dual RoPE bases:** `rope_local_base_freq=10000` (sliding/local layers) vs `rope_theta=1,000,000`
|
|
113
|
+
(full/global layers). RoPE op exists; needs per-layer theta selection.
|
|
114
|
+
5. **RMSNorm** (have it), **GELU-tanh** (have it), **AddBias** (have it).
|
|
115
|
+
6. **Mean-pooling tail** — **NEW**: we only have `SliceLastRow` (last-token) + `L2Norm`. Add a
|
|
116
|
+
`MeanPool` op (sum over valid tokens / count). Small kernel.
|
|
117
|
+
7. **Two Dense head layers** (768→… →768) — plain MatMul(Int4)+AddBias, then **L2Norm** (have it).
|
|
118
|
+
8. **Loader fix:** broaden MLX detection to accept `quantization: {bits:4, group_size:64}` with no
|
|
119
|
+
`mode` field (see caveat #1 above), and read the `dense.*` head tensors.
|
|
120
|
+
|
|
121
|
+
Effort: **medium** (one generator + one MeanPool kernel + a small loader tweak). Gated (gemma license),
|
|
122
|
+
but the gate is an instant click-through; we can also point at an ungated standard-MLX mirror or convert
|
|
123
|
+
from the ungated `google/embeddinggemma-300m-qat-q4_0-unquantized` base ourselves.
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## #2 Qwen3-Embedding-0.6B — runs today, but q4 is the problem
|
|
128
|
+
|
|
129
|
+
Architecture already works on our engine (causal path, last-token EOS pool + L2Norm) — **zero generator
|
|
130
|
+
work**. The blocker is purely the quant:
|
|
131
|
+
- BF16 is **1192 MB** — too heavy to ship/quantize on-device on iPad.
|
|
132
|
+
- `mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ` → **garbage in our engine** (DWQ).
|
|
133
|
+
- `kerncore/Qwen3-Embedding-0.6B-MXL-4bit` (335 MB) → standard MLX-4bit **in name**, but its model card
|
|
134
|
+
text is copy-pasted from the **DWQ** repo, so provenance is **unverified/risky**. Would need a runtime
|
|
135
|
+
correctness check (cosine vs reference) before trusting.
|
|
136
|
+
- `boboliu/Qwen3-Embedding-0.6B-W4A16-G128` (538 MB) → **compressed-tensors / pack-quantized**, which our
|
|
137
|
+
loader does **not** read. Would need a new compressed-tensors adapter (and 538 MB is over budget).
|
|
138
|
+
|
|
139
|
+
Net: viable only if we either (a) verify the `kerncore` MLX convert is truly affine (not DWQ), or
|
|
140
|
+
(b) produce our own clean MLX-4bit convert of Qwen3-Embedding-0.6B from BF16. Lower quality-per-MB and
|
|
141
|
+
larger than EmbeddingGemma, so it's the fallback, not the default.
|
|
142
|
+
|
|
143
|
+
## #3 BERT-small encoders — smallest, but needs a different generator and lower quality
|
|
144
|
+
|
|
145
|
+
`bge-small` / `gte-small` / `arctic-embed-s` are classic `BertModel` bidirectional encoders
|
|
146
|
+
(hidden 384, 12 layers, absolute position embeddings, ~33M params). At q4 they're **~30–130 MB** — tiny.
|
|
147
|
+
But:
|
|
148
|
+
- They ship q4 **only as ONNX** (`onnx/model_q4.onnx`) which we don't read; we'd quantize **on-the-fly
|
|
149
|
+
from F32** (no QAT, so slightly more quality loss than EmbeddingGemma's QAT base).
|
|
150
|
+
- They need a **new BERT encoder generator**: token + **learned absolute position** embeddings (+ optional
|
|
151
|
+
token-type embedding), bidirectional attention (`causal:false`), GELU, post-LN LayerNorm (have it), and
|
|
152
|
+
**mean or CLS pooling** (CLS = take row 0, easy; mean = new MeanPool op). **No RoPE** → arguably *simpler*
|
|
153
|
+
than the Gemma3 generator.
|
|
154
|
+
- Quality tops out ~62 MTEB(En) — below EmbeddingGemma's 68.36.
|
|
155
|
+
|
|
156
|
+
Good "absolute floor" option if we want the smallest possible default or want to validate the Tier-2
|
|
157
|
+
encoder path on a simpler architecture first (`gte-small`/`bge-small` is the easiest first encoder to land).
|
|
158
|
+
|
|
159
|
+
## Explicitly rejected
|
|
160
|
+
|
|
161
|
+
- **nomic-embed-text-v2-moe** — megablocks MoE (8 experts, top-2). No MoE support; skip.
|
|
162
|
+
- **jina-embeddings-v3** — XLM-Roberta + 5 task LoRA adapters + custom flash impl; routing not supported,
|
|
163
|
+
572M too big; skip.
|
|
164
|
+
- **arctic-embed-m-v2.0** — 1221 MB f32, GteModel w/ RoPE + 250k vocab; over budget; skip (the *small*
|
|
165
|
+
arctic is the keeper).
|
|
166
|
+
- **potion-base-8M** — static Model2Vec (51 MTEB). Not a transformer graph; would be a separate
|
|
167
|
+
token-lookup+pool code path. Worth a cheap "ultra-light" mode later, but quality is well below the rest.
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
## Suggested path
|
|
172
|
+
|
|
173
|
+
1. **Land EmbeddingGemma-300M** as the default iPad embedding model:
|
|
174
|
+
- Add the Gemma3 bidirectional encoder generator + `MeanPool` kernel + Dense head + loader MLX-detect fix.
|
|
175
|
+
- Use `mlx-community/embeddinggemma-300m-4bit` (or self-convert the QAT base) — ~173 MB, near-lossless q4.
|
|
176
|
+
2. (Optional, faster first win) **Land `gte-small`/`bge-small` BERT encoder generator** first to de-risk
|
|
177
|
+
the non-causal + mean-pool path on the simplest architecture (~30–67 MB), then reuse the MeanPool for Gemma3.
|
|
178
|
+
3. Keep **Qwen3-Embedding-0.6B** as a desktop/high-RAM option; only ship its q4 on iPad after verifying a
|
|
179
|
+
clean (non-DWQ) MLX convert against a reference cosine-similarity check.
|
|
@@ -0,0 +1,263 @@
|
|
|
1
|
+
# SOTA Small / Mobile OSS LLMs for Gerbil — Research Report (June 2026)
|
|
2
|
+
|
|
3
|
+
Research date: 2026-06-13. Engine baseline: Gerbil WebGPU INT4 engine, currently running
|
|
4
|
+
Qwen3.5-0.8B (hybrid SSM + attention) at 207 tok/s desktop / 51 tok/s mobile Safari.
|
|
5
|
+
|
|
6
|
+
Method: facts below are taken from **primary sources** wherever possible — live HuggingFace
|
|
7
|
+
`config.json` / safetensors headers fetched directly (ground truth for architecture and size),
|
|
8
|
+
backed by model cards and Perplexity web search for release context. Each claim is marked
|
|
9
|
+
**[config]** (verified from the live HF config/safetensors), **[card/docs]** (official model card
|
|
10
|
+
or vendor docs), or **[social/secondary]** (X/Twitter or third-party roundup, treat as softer).
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## 0. TL;DR for the impatient
|
|
15
|
+
|
|
16
|
+
- **Gemma 4 is real and shipping.** `google/gemma-4-E2B` and `google/gemma-4-12B` exist on HF
|
|
17
|
+
*today* with downloadable configs. Gemma 4 E2B is the on-device MatFormer variant (successor to
|
|
18
|
+
Gemma 3n E2B/E4B). Architecture = Gemma-family with **5:1 sliding-window:full attention** — this
|
|
19
|
+
is the one genuinely new kernel requirement. **[config]**
|
|
20
|
+
- **The "13x fused LinearAttention" post is mostly a misread.** The real, documented artifact is
|
|
21
|
+
Alibaba's **FlashQLA** — a *fused linear-attention (Gated DeltaNet) kernel* for Qwen3.5/3.6,
|
|
22
|
+
claiming ~3x (not 13x), on **TileLang/CUDA GPUs, not WebGPU**. No verifiable WebGPU/Claude 13x
|
|
23
|
+
kernel exists in public sources. **Gerbil's `qwen3_5.ts` already implements this exact fusion**
|
|
24
|
+
(causal conv1d + delta-rule scan + gating + RMSNorm in one path). **[card/docs + repo]**
|
|
25
|
+
- **Best next target: the Gemma 4 family**, because one new architecture generator (Gemma4, with
|
|
26
|
+
sliding-window attention + final-logit softcap + Gemma RMSNorm) unlocks E2B (mobile), 12B, and
|
|
27
|
+
the 26B-A4B MoE / 31B dense. Qwen3 dense (0.6/1.7/4B) is the cheapest add (it's a config tweak on
|
|
28
|
+
the existing Qwen2/3 path). SmolLM3 is the easiest "new vendor" win (plain Llama-like + NoPE).
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## 1. Latest SOTA small / mobile OSS LLMs (mid-2026)
|
|
33
|
+
|
|
34
|
+
### 1.1 Gemma — what actually exists (the user's "Gemma 4 E2B" question)
|
|
35
|
+
|
|
36
|
+
The user's instinct was right; "Gemma 4 E2B" is a real model, not a hallucination.
|
|
37
|
+
|
|
38
|
+
**Verified live from HF configs [config]:**
|
|
39
|
+
|
|
40
|
+
| Repo | model_type | Notes |
|
|
41
|
+
|---|---|---|
|
|
42
|
+
| `google/gemma-4-E2B` | `gemma4` / `Gemma4ForConditionalGeneration` | On-device MatFormer variant. 35 layers, hidden 1536, MQA (1 KV head), head_dim 256, vocab 262144, sliding_window=512, tied embeddings. |
|
|
43
|
+
| `google/gemma-4-12B` | `gemma4_unified` / `Gemma4UnifiedForConditionalGeneration` | 48 layers, hidden 3840, GQA 16/8, head_dim 256, sliding_window=1024, 262k context. |
|
|
44
|
+
| `google/gemma-3n-E2B-it`, `gemma-3n-E4B-it` | (gated, login required) | The previous-gen MatFormer mobile models. Still exist but gated. |
|
|
45
|
+
| `google/gemma-3-1b-it` etc. | (gated) | Gemma 3 dense family, 270M–27B. |
|
|
46
|
+
|
|
47
|
+
**The lineage (correcting the secondary-source confusion):**
|
|
48
|
+
- **Gemma 3** (Mar 2025): dense family 270M / 1B / 4B / 12B / 27B. **[card/docs]**
|
|
49
|
+
- **Gemma 3n E2B / E4B**: the *mobile-optimized MatFormer* models — "E2B"/"E4B" = *effective*
|
|
50
|
+
~2B / ~4B active params via MatFormer elastic nesting + Per-Layer Embeddings (PLE). Total stored
|
|
51
|
+
params are larger than the effective label; PLE means real on-device memory ≈ the effective size.
|
|
52
|
+
These were the Gemma-3-generation mobile models. **[card/docs, social/secondary]**
|
|
53
|
+
- **Gemma 4** (~Apr–Jun 2026): new generation. Edge variants **E2B / E4B** (MatFormer again),
|
|
54
|
+
dense **12B** (released ~3 Jun 2026), plus larger **26B-A4B MoE** and **31B dense**. So "Gemma 4
|
|
55
|
+
E2B" = the current-gen on-device model, the direct successor to Gemma 3n E2B. **[config for E2B/12B; card/social for E4B/26B/31B]**
|
|
56
|
+
|
|
57
|
+
**Sizes (E2B):** the bf16 single-file safetensors is **~10.2 GB on disk** [config: content-length
|
|
58
|
+
10,246,621,918 B]. Official Q4 footprint per Google docs is **~2.9 GB** (Q4_0); Unsloth's QAT int4
|
|
59
|
+
GGUF is **~2.62 GB**. **[card/docs]** E4B Q4 ≈ **4.2–4.5 GB**. The MatFormer "effective 2B" label
|
|
60
|
+
does NOT shrink download — you ship the full static weights.
|
|
61
|
+
|
|
62
|
+
### 1.2 The rest of the field (verified configs + 4-bit sizes)
|
|
63
|
+
|
|
64
|
+
4-bit estimate column = group-quantized INT4 (g128) ≈ bf16_size × ~0.30 (weights at 0.5 B/param +
|
|
65
|
+
scales; embeddings often kept higher precision). Treat as ±15%.
|
|
66
|
+
|
|
67
|
+
| Model | Params | Arch family | model_type [config] | Context | Tied emb | bf16 size | **~INT4 download** | Official mobile/edge? |
|
|
68
|
+
|---|---|---|---|---|---|---|---|---|
|
|
69
|
+
| **Qwen3.5-0.8B** (current) | 0.8B | Hybrid GDN + attn | `qwen3_5` | 262k | yes | 1.62 GB | **~0.49 GB** | yes (edge line 0.8/2/4/9B) [social] |
|
|
70
|
+
| **Qwen3-0.6B** | 0.6B | Llama-like dense | `qwen3` | 32k | yes | 1.40 GB | **~0.42 GB** | de facto; not branded |
|
|
71
|
+
| **Qwen3-1.7B** | 1.7B | Llama-like dense | `qwen3` | 32k | yes | 3.78 GB | **~1.13 GB** | de facto |
|
|
72
|
+
| **Qwen3-4B** | 4B | Llama-like dense | `qwen3` | 40k | yes | 7.49 GB | **~2.25 GB** | de facto |
|
|
73
|
+
| **Phi-4-mini** | 3.8B | Phi3 (packed QKV, partial RoPE) | `phi3` | 128k | yes | 7.14 GB | **~2.14 GB** | local-friendly, no edge SKU |
|
|
74
|
+
| **SmolLM3-3B** | 3B | Llama-like + NoPE | `smollm3` | 64k | yes | 5.72 GB | **~1.72 GB** | yes — built for edge/browser |
|
|
75
|
+
| **Gemma 4 E2B** | ~2B eff | Gemma4 + sliding-window MatFormer | `gemma4` | 131k | yes | (10.2 GB raw) | **~2.6–2.9 GB** (official QAT) | **yes — official on-device** |
|
|
76
|
+
| **Gemma 4 12B** | 12B | Gemma4 unified | `gemma4_unified` | 262k | yes | 22.3 GB | **~6.7 GB** | desktop-class |
|
|
77
|
+
| **Llama 4 small** | 7–8B class | dense (Llama 4) | `llama4`/`llama` | long | — | ~15 GB | ~4.5 GB | no explicit edge SKU |
|
|
78
|
+
|
|
79
|
+
Notes:
|
|
80
|
+
- **No truly tiny official Llama mobile SKU.** Meta's smallest broadly-used open models remain
|
|
81
|
+
7–8B-class; phone-class Llama is community-quantized, not an official edge variant. **[social/secondary]**
|
|
82
|
+
- **SmolLM3** is all-`full_attention` (no sliding window, no SSM) with **NoPE** (some layers skip
|
|
83
|
+
RoPE) and tied embeddings — architecturally the simplest "new vendor" target. **[config]**
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## 2. Architecture compatibility with Gerbil's WebGPU INT4 engine
|
|
88
|
+
|
|
89
|
+
Gerbil today (from `src/gpu/architectures/`):
|
|
90
|
+
- `qwen2.ts` — standard transformer (Qwen2/Qwen3 = Llama-like, QKV bias, GQA, RoPE, RMSNorm, SwiGLU).
|
|
91
|
+
- `qwen3_5.ts` — hybrid: full-attention layers (gated, QK-RMSNorm, partial RoPE, attn output gate)
|
|
92
|
+
**+ linear_attention layers implemented as fused Gated DeltaNet** (RMSNorm → fused QKV proj →
|
|
93
|
+
A/B/Z proj → causal conv1d+SiLU → gated-delta-net recurrence with L2-normed Q/K + exp decay +
|
|
94
|
+
delta rule → per-head RMSNorm → SiLU gate → out proj). This is a big asset (see §3).
|
|
95
|
+
|
|
96
|
+
Existing reusable kernels: INT4 matmul (B^T), GQA attention, RMSNorm, RoPE (incl. partial), SwiGLU,
|
|
97
|
+
causal conv1d, delta-rule scan, gating, tied embeddings.
|
|
98
|
+
|
|
99
|
+
Per-candidate op delta (effort tiers per Gerbil's `add-model-family` skill: **Tier 1** = hours,
|
|
100
|
+
reuse all ops; **Tier 2** = days, 1 novel op/kernel; **Tier 3** = weeks, new computation class).
|
|
101
|
+
Note Gerbil generators are **not** Qwen-specific — they're config→IR generators over a
|
|
102
|
+
family-agnostic IR + WGSL kernel registry. Adding a family = write a generator, register it in
|
|
103
|
+
`src/gpu/architectures/index.ts`, and only write a kernel if an op is missing. Existing kernels:
|
|
104
|
+
Embedding/MatMul(Int4), Add, Mul, RMSNorm, LayerNorm, RoPE, Attention, Softmax, SiLU, SwiGLU, GELU,
|
|
105
|
+
ResidualRMSNorm, KVCacheAppend, **MambaSSM, CausalConv1d, SigmoidGate, ConvStateUpdate**, SliceLastRow.
|
|
106
|
+
|
|
107
|
+
### Qwen3 dense (0.6B / 1.7B / 4B) — **Tier 1 (hours)**
|
|
108
|
+
Same Llama-like decoder Gerbil's `qwen2.ts` already runs. `model_type: qwen3`, GQA, RoPE
|
|
109
|
+
(theta 1e6), RMSNorm, SwiGLU, tied embeddings, QK-RMSNorm (already done for Qwen3.5 full-attn
|
|
110
|
+
layers). **New kernels: none.** Mostly a config-mapping / loader change. The Qwen3.5 full-attention
|
|
111
|
+
sublayer is essentially Qwen3, so the code already exists. **[config]**
|
|
112
|
+
|
|
113
|
+
### SmolLM3-3B — **Tier 1 (hours)**
|
|
114
|
+
Llama-like dense, all full_attention, GQA 16/4, RMSNorm, SwiGLU, tied embeddings. The one quirk:
|
|
115
|
+
**NoPE** — certain layers omit RoPE (config gives `layer_types`/`no_rope_layers`). Need a per-layer
|
|
116
|
+
"skip RoPE" flag in the attention builder. **New kernels: none; one loader/graph flag.** **[config]**
|
|
117
|
+
|
|
118
|
+
### Phi-4-mini (3.8B) — **Tier 1 (hours, loader-heavy)**
|
|
119
|
+
`model_type: phi3`. Two deviations from the Qwen path: (a) **packed `qkv_proj`** (single fused
|
|
120
|
+
matrix, split into Q/K/V) and (b) **partial RoPE** + a packed `gate_up_proj` for the MLP. Gerbil
|
|
121
|
+
already splits fused QKV (Qwen3.5 splits a fused Q+gate) and already does partial RoPE, so this is
|
|
122
|
+
mostly tensor-slicing config. Sliding_window is set very large (262144) → effectively full
|
|
123
|
+
attention at normal context. **New kernels: none; loader handles packed weights.** **[config]**
|
|
124
|
+
|
|
125
|
+
### Gemma 4 E2B / 12B — **Tier 2 (days; sliding-window attn is the one novel op)**
|
|
126
|
+
`model_type: gemma4` / `gemma4_unified`. Gemma-family specifics:
|
|
127
|
+
- **Sliding-window attention (SWA)**, interleaved with full attention — E2B is **5:1**
|
|
128
|
+
(sliding:full, full every 5th layer), 12B is **5:1** with window 1024. **This is the one real new
|
|
129
|
+
kernel/masking path Gerbil lacks** — a banded/windowed attention mask (and a windowed KV cache to
|
|
130
|
+
get the memory benefit). At short prompts SWA == full attention, so a correctness-first v1 can
|
|
131
|
+
treat SWA as full attention and add the banded mask + windowed cache later for long-context wins.
|
|
132
|
+
- **Gemma RMSNorm** uses `(1 + weight)` scaling and norms in more places (pre/post attn, pre/post
|
|
133
|
+
MLP). Small kernel variant of existing RMSNorm.
|
|
134
|
+
- **Final logit softcapping** (`tanh`-based) and often **attn logit softcapping** — a cheap
|
|
135
|
+
elementwise op, but must be added or outputs are wrong.
|
|
136
|
+
- **head_dim 256** with **query scaling** (`query_pre_attn_scalar`) — parameter, not new kernel.
|
|
137
|
+
- **GeGLU** activation (Gemma uses gelu-based gating) vs SwiGLU (SiLU). Need a GeGLU variant of the
|
|
138
|
+
MLP (swap SiLU→GELU). Tiny.
|
|
139
|
+
- **MatFormer / PLE (E2B)**: for a single fixed deployment Gerbil can just load the E2B slice as a
|
|
140
|
+
normal dense model — no need to implement elastic nesting. PLE means per-layer embedding tables;
|
|
141
|
+
loader must place them but it's not a new compute kernel.
|
|
142
|
+
- Tied embeddings, vocab 262144 (large — embed/logit matmul is heavier).
|
|
143
|
+
|
|
144
|
+
**New kernels for Gemma 4: (1) windowed/banded attention mask + windowed KV cache, (2) GeGLU MLP
|
|
145
|
+
variant, (3) Gemma-style `(1+w)` RMSNorm, (4) logit softcap elementwise.** Items 2–4 are small;
|
|
146
|
+
item 1 is the only real engineering. A correctness-first build can defer item 1.
|
|
147
|
+
|
|
148
|
+
### Gemma 4 26B-A4B (MoE) / Llama 4 — **Tier 3 (weeks); out of scope for mobile**
|
|
149
|
+
MoE needs a router + top-k expert gather + per-expert INT4 matmul (sparse dispatch). New kernel
|
|
150
|
+
family, and 6–7 GB+ at int4 — desktop-only. Skip for the mobile mandate. **[social/secondary]**
|
|
151
|
+
|
|
152
|
+
Exotic-feature flag summary:
|
|
153
|
+
- **MatFormer/elastic params**: Gemma 3n/4 E-series — ignorable for a fixed deployment.
|
|
154
|
+
- **Sliding-window attention**: Gemma 4 (and Phi3 in principle) — **needs new masking + windowed cache.**
|
|
155
|
+
- **MoE**: Gemma 4 26B-A4B, Qwen3.5 large — new kernel family, desktop-only.
|
|
156
|
+
- **NoPE**: SmolLM3 — one per-layer flag.
|
|
157
|
+
- **Packed QKV / gate_up**: Phi-4-mini — loader-level, already partially handled.
|
|
158
|
+
- **Logit softcap / Gemma (1+w) RMSNorm / GeGLU**: Gemma — small new variants.
|
|
159
|
+
- **Tied embeddings**: every candidate — already supported.
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## 3. The "fused LinearAttention 13x" claim — verdict
|
|
164
|
+
|
|
165
|
+
**What's real (confirmed):**
|
|
166
|
+
- Qwen3.5's `linear_attention` layers are **Gated DeltaNet (GDN)** — the Qwen3-Next mechanism:
|
|
167
|
+
short **causal conv1d**, **delta-rule recurrent/chunked scan**, **exponential gating**, **L2-norm
|
|
168
|
+
on Q/K** (replacing softmax), and **RMSNorm**. Interleaved 3:1 with gated full attention. Verified
|
|
169
|
+
in Gerbil's own config (`layer_types`) and NVIDIA Megatron-Bridge / Qwen model-card docs. **[config + card/docs]**
|
|
170
|
+
- Alibaba shipped **FlashQLA** — "high-performance linear attention kernel library built on
|
|
171
|
+
TileLang, specifically optimized for GDN chunked-prefill, the linear attention mechanism used in
|
|
172
|
+
Qwen3.5 and Qwen3.6." It **fuses** the GDN ops and reports **up to ~3x** vs prior linear-attention
|
|
173
|
+
kernels — **on server GPUs (TileLang/CUDA), not WebGPU.** **[card/docs]**
|
|
174
|
+
(Source: alibabacloud.com/blog/flashqla-cp-bwd-friendly-fused-linear-attention-kernels-for-gdn)
|
|
175
|
+
|
|
176
|
+
**What's NOT verifiable:**
|
|
177
|
+
- A specific X/Twitter post claiming "Claude Opus 4.7 wrote a WebGPU kernel running Qwen3.5 13x
|
|
178
|
+
faster via a fused LinearAttention op" — **no traceable primary source.** Two independent
|
|
179
|
+
Perplexity sweeps (sonar-pro) over WebGPU + Qwen3.5 + GDN + "fused" + "13x" found no repo, no
|
|
180
|
+
benchmark, no attributable tweet. The closest real artifacts are the FlashQLA GPU kernels (~3x)
|
|
181
|
+
and the unrelated "Qwen3.6-35B Claude-4.7-Opus reasoning-distilled" model (no kernel content).
|
|
182
|
+
**Treat the 13x WebGPU claim as social-media rumor / paraphrase of FlashQLA, not a reproduced result.**
|
|
183
|
+
|
|
184
|
+
**Does it apply to Gerbil?** It already *is* applied. Gerbil's `qwen3_5.ts` describes a single fused
|
|
185
|
+
Gated DeltaNet path (the docstring literally lists conv1d+SiLU → gated-delta-net recurrence →
|
|
186
|
+
per-head RMSNorm → SiLU gate → out proj). The "fused LinearAttention" technique = exactly what
|
|
187
|
+
Gerbil's Qwen3.5 path does. The remaining upside is *autoresearch-style kernel tuning* of that
|
|
188
|
+
fused op (chunked scan tiling, workgroup sizing, reducing dispatch/submit overhead — which the
|
|
189
|
+
recent Safari/Metal commits already touch), not a new architectural fusion. There is no free 13x
|
|
190
|
+
sitting on the table; FlashQLA's ~3x is the realistic ceiling for the GDN op specifically, and only
|
|
191
|
+
if Gerbil's current scan is far from optimal.
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## 4. Recommendation — ranked targets for Gerbil
|
|
196
|
+
|
|
197
|
+
Weighting: SOTA quality × mobile download size × implementation effort against the existing engine.
|
|
198
|
+
|
|
199
|
+
**#1 — Gemma 4 E2B (highest value).** Google's flagship *official on-device* model, multimodal-capable
|
|
200
|
+
text core, ~2.6–2.9 GB at int4 (acceptable mobile download), strong quality. It's the model the
|
|
201
|
+
user actually asked about and it's real and ungated (E2B/12B configs download without auth, unlike
|
|
202
|
+
Gemma 3n). Effort: moderate — needs the Gemma4 generator (sliding-window mask + windowed KV cache,
|
|
203
|
+
GeGLU, Gemma `(1+w)` RMSNorm, logit softcap). A correctness-first v1 can treat SWA as full attention
|
|
204
|
+
and ship fast, then add the windowed cache for long-context/memory wins.
|
|
205
|
+
|
|
206
|
+
**#2 — Qwen3 dense 0.6B / 1.7B / 4B (cheapest, broadest).** Near-zero new kernel work — it's the
|
|
207
|
+
Qwen3.5 full-attention sublayer Gerbil already runs, minus the SSM layers. Gives a clean size ladder
|
|
208
|
+
(0.42 / 1.13 / 2.25 GB int4) so users pick by device. Best effort-to-coverage ratio; should likely
|
|
209
|
+
ship *before* Gemma to derisk the loader/config plumbing.
|
|
210
|
+
|
|
211
|
+
**#3 — SmolLM3-3B (easy new-vendor win).** Plain Llama-like + NoPE flag + tied embeddings; ~1.72 GB
|
|
212
|
+
int4; purpose-built for edge/browser. Almost free given the Qwen3 path; adds vendor diversity.
|
|
213
|
+
|
|
214
|
+
**#4 — Phi-4-mini (3.8B).** Strong reasoning per param, 128k context, ~2.14 GB int4. Effort is
|
|
215
|
+
loader-level (packed QKV / gate_up, partial RoPE — both already partly handled). Good "quality
|
|
216
|
+
small" option but no edge branding and slightly more loader work than SmolLM3.
|
|
217
|
+
|
|
218
|
+
**#5 — Gemma 4 12B / Llama 4 / MoE variants — desktop-only, defer.** 6.7 GB+ int4 and (for 26B-A4B)
|
|
219
|
+
a whole MoE kernel family. Out of scope for the mobile mandate; revisit for a desktop tier.
|
|
220
|
+
|
|
221
|
+
### Which single architecture-family generator unlocks the most high-value models?
|
|
222
|
+
|
|
223
|
+
**The `Gemma4` generator.** One new arch family (sliding-window attention + windowed KV cache +
|
|
224
|
+
GeGLU + Gemma RMSNorm + logit softcap) unlocks **E2B (mobile), E4B, 12B, and — with an added MoE
|
|
225
|
+
path — 26B-A4B / 31B**: the entire current-gen Google open line, anchored by the single most
|
|
226
|
+
requested mobile model. Sliding-window attention is also reusable for other modern models (it's a
|
|
227
|
+
common long-context pattern), so the investment compounds.
|
|
228
|
+
|
|
229
|
+
**Sequencing recommendation:** ship **Qwen3 dense** first (days, derisks plumbing) → then build the
|
|
230
|
+
**Gemma4 generator** for E2B (the headline mobile model) → fold in **SmolLM3** as a cheap rider on
|
|
231
|
+
the Qwen3 path. That ordering front-loads low-risk wins and lands the high-value Gemma 4 E2B with
|
|
232
|
+
the loader already battle-tested.
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## Sources
|
|
237
|
+
|
|
238
|
+
Primary (verified live, 2026-06-13):
|
|
239
|
+
- HF configs fetched directly: `Qwen/Qwen3.5-0.8B`, `Qwen/Qwen3-4B`, `Qwen/Qwen3-1.7B`,
|
|
240
|
+
`Qwen/Qwen3-0.6B`, `microsoft/Phi-4-mini-instruct`, `HuggingFaceTB/SmolLM3-3B`,
|
|
241
|
+
`google/gemma-4-E2B`, `google/gemma-4-12B` (config.json + safetensors index/content-length).
|
|
242
|
+
- Gerbil repo: `src/gpu/architectures/qwen3_5.ts`, `src/gpu/architectures/qwen2.ts`,
|
|
243
|
+
`src/core/model-compat.ts`.
|
|
244
|
+
|
|
245
|
+
Docs / cards:
|
|
246
|
+
- Gemma 4 overview: https://ai.google.dev/gemma/docs/core
|
|
247
|
+
- Gemma 4 QAT int4 GGUF sizes (Unsloth): https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF ,
|
|
248
|
+
https://unsloth.ai/docs/models/gemma-4/qat
|
|
249
|
+
- FlashQLA fused GDN linear-attention kernels:
|
|
250
|
+
https://www.alibabacloud.com/blog/flashqla-cp-bwd-friendly-fused-linear-attention-kernels-for-gdn_603084
|
|
251
|
+
- Qwen 3.5 hybrid GDN arch (NVIDIA): https://docs.nvidia.com/nemo/megatron-bridge/0.4.1/models/vlm/qwen35-vl.html
|
|
252
|
+
- Gated DeltaNet explainer: https://sebastianraschka.com/llms-from-scratch/ch04/08_deltanet/
|
|
253
|
+
- vLLM Qwen3-Next hybrid support: https://vllm.ai/blog/2025-09-11-qwen3-next
|
|
254
|
+
- Qwen3.5 GDN analysis: https://gist.github.com/justinchuby/0213aa253664fb72e9adb0089816de15
|
|
255
|
+
|
|
256
|
+
Secondary / context (treat softer):
|
|
257
|
+
- BentoML 2026 OSS LLM roundup: https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models
|
|
258
|
+
- Trilogy AI "Qwen 3.6 vs Opus 4.7 vs Gemma 4": https://trilogyai.substack.com/p/qwen-36-open-vs-opus-47-vs-gemma
|
|
259
|
+
- Gemma 4 E2B/E4B edge writeup: https://www.mindstudio.ai/blog/gemma-4-e2b-e4b-edge-models-phone-local
|
|
260
|
+
|
|
261
|
+
Unverified (flagged):
|
|
262
|
+
- "Claude Opus 4.7 wrote a WebGPU kernel running Qwen3.5 13x faster via fused LinearAttention" — no
|
|
263
|
+
traceable primary source found; likely a paraphrase/conflation of FlashQLA (GPU, ~3x).
|