@tryhamster/gerbil 1.0.0-rc.9 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +247 -84
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +264 -588
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +585 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BHrJJIa4.mjs +1656 -0
- package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
- package/dist/gerbil-BT9fCydo.d.mts +488 -0
- package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
- package/dist/gerbil-DomNfIr1.mjs +4 -0
- package/dist/gpu/hooks.d.mts +520 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1188 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-33qCAtHW.mjs +3615 -0
- package/dist/gpu-33qCAtHW.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-jEAL2s-A.d.mts +2022 -0
- package/dist/index-jEAL2s-A.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
- package/dist/mcp-1DaMsaBc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
- package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
- package/dist/repl-jV5gcJFA.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
- package/dist/skills-DX8D59UH.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
- package/dist/types-D6FiR_oh.d.mts.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-CMOGDSgT.js +0 -20212
- package/dist/kokoro-CMOGDSgT.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-Bu-E23Sc.js +0 -433
- package/dist/stt-Bu-E23Sc.js.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DiD1gTwk.js +0 -44695
- package/dist/transformers.web-DiD1gTwk.js.map +0 -1
- package/dist/transformers.web-u34VxRFM.js +0 -3
- package/dist/tts-CqroPaSK.js +0 -724
- package/dist/tts-CqroPaSK.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,910 @@
|
|
|
1
|
+
# Nemotron-3 & Mamba-2: Inference Engine Research
|
|
2
|
+
|
|
3
|
+
**Research for Gerbil WebGPU inference engine optimization.**
|
|
4
|
+
**Date: March 2026**
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Table of Contents
|
|
9
|
+
|
|
10
|
+
1. [Executive Summary](#1-executive-summary)
|
|
11
|
+
2. [Nemotron-3 Model Family](#2-nemotron-3-model-family)
|
|
12
|
+
3. [Hybrid Mamba-2 + Transformer Architecture](#3-hybrid-mamba-2--transformer-architecture)
|
|
13
|
+
4. [Mamba-2 / SSD Deep Dive](#4-mamba-2--ssd-deep-dive)
|
|
14
|
+
5. [Mixture of Experts (MoE)](#5-mixture-of-experts-moe)
|
|
15
|
+
6. [LatentMoE](#6-latentmoe)
|
|
16
|
+
7. [Multi-Token Prediction (MTP)](#7-multi-token-prediction-mtp)
|
|
17
|
+
8. [Quantization: NVFP4 vs INT4](#8-quantization-nvfp4-vs-int4)
|
|
18
|
+
9. [Nemotron-3 Nano 30B-A3B Config Reference](#9-nemotron-3-nano-30b-a3b-config-reference)
|
|
19
|
+
10. [Gerbil Implications & Implementation Roadmap](#10-gerbil-implications--implementation-roadmap)
|
|
20
|
+
11. [Small Model Viability: Nano vs Qwen 3.5 0.8B](#11-small-model-viability-nano-vs-qwen-35-08b)
|
|
21
|
+
12. [Sources](#12-sources)
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## 1. Executive Summary
|
|
26
|
+
|
|
27
|
+
NVIDIA's Nemotron-3 family represents a paradigm shift for local inference: **hybrid Mamba-2 + Transformer + MoE** models that are dramatically faster than pure Transformers while matching or exceeding their accuracy. The key innovations relevant to Gerbil:
|
|
28
|
+
|
|
29
|
+
| Innovation | What It Does | Gerbil Impact |
|
|
30
|
+
|-----------|-------------|---------------|
|
|
31
|
+
| **Mamba-2 layers** | Replace 92% of attention with constant-memory SSM | Eliminates KV cache growth, unlocks long context on mobile |
|
|
32
|
+
| **LatentMoE** | 4x bandwidth reduction in expert routing | Direct throughput multiplier on bandwidth-bound M4 Max |
|
|
33
|
+
| **MTP** | Native speculative decoding (97% accept rate) | ~2x decode throughput at batch-1, no draft model needed |
|
|
34
|
+
| **NVFP4** | 4-bit floating point with micro-block scaling | Better accuracy/size than our INT4 per-group quantization |
|
|
35
|
+
| **Open weights** | Apache 2.0 / nvidia-open-model-license | Can ship Nemotron models directly in Gerbil |
|
|
36
|
+
|
|
37
|
+
The smallest model, **Nemotron-3 Nano** (30B total / 3B active), is the primary target for Gerbil. No sub-1B Nemotron model exists yet, so Qwen 3.5 0.8B remains our small-model option. But Nano's 3B active params with MoE sparsity could be feasible on M4 Max (128GB unified memory) and represents a massive capability upgrade.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## 2. Nemotron-3 Model Family
|
|
42
|
+
|
|
43
|
+
| Model | Total Params | Active Params | Architecture | License |
|
|
44
|
+
|-------|-------------|---------------|-------------|---------|
|
|
45
|
+
| **Nano** | 31.6B | 3.6B (w/ embeddings) | Hybrid Mamba-2 + MoE | nvidia-open-model-license |
|
|
46
|
+
| **Super** | ~120B | ~12B | Hybrid Mamba-2 + LatentMoE | nvidia-open-model-license |
|
|
47
|
+
| **Ultra** | ~500B | ~50B | Hybrid Mamba-2 + LatentMoE | nvidia-open-model-license |
|
|
48
|
+
|
|
49
|
+
### Nano Performance Claims
|
|
50
|
+
- **3.3x faster** than Qwen3-30B-A3B on H200 (8k input / 16k output)
|
|
51
|
+
- **2.2x faster** than GPT-OSS-20B
|
|
52
|
+
- Trained on **25 trillion tokens**
|
|
53
|
+
- Supports **1M token context**
|
|
54
|
+
- Best-in-class on reasoning, coding, math, agentic tasks in its size class
|
|
55
|
+
|
|
56
|
+
### Quantization Variants Available
|
|
57
|
+
- **BF16** (full precision, official)
|
|
58
|
+
- **FP8** (official)
|
|
59
|
+
- **NVFP4** (official)
|
|
60
|
+
- **GGUF** variants (community: IQ4_XS, MXFP4_MOE via Unsloth)
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## 3. Hybrid Mamba-2 + Transformer Architecture
|
|
65
|
+
|
|
66
|
+
### The Core Insight
|
|
67
|
+
|
|
68
|
+
Pure Transformers have **linearly growing** KV cache and per-token attention compute during decode. Nemotron replaces ~92% of attention layers with Mamba-2, which has **constant** memory and compute per generated token.
|
|
69
|
+
|
|
70
|
+
### Layer Pattern (Nemotron-3 Nano)
|
|
71
|
+
|
|
72
|
+
From the `config.json`:
|
|
73
|
+
```
|
|
74
|
+
hybrid_override_pattern: "MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME"
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Where:
|
|
78
|
+
- **M** = Mamba-2 layer
|
|
79
|
+
- **E** = MoE (expert) layer (replaces standard FFN)
|
|
80
|
+
- **\*** = Attention layer (grouped-query attention with 2 KV heads)
|
|
81
|
+
|
|
82
|
+
**52 total layers:**
|
|
83
|
+
- **23 Mamba-2 layers** (constant-memory decode, no KV cache)
|
|
84
|
+
- **23 MoE layers** (128 experts, top-6 routing + 1 shared expert)
|
|
85
|
+
- **6 Attention layers** (GQA with 32 heads, 2 KV groups — these need KV cache)
|
|
86
|
+
|
|
87
|
+
### Why This Matters for Gerbil
|
|
88
|
+
|
|
89
|
+
Only 6 out of 52 layers need KV cache. For Gerbil on mobile:
|
|
90
|
+
- KV cache memory drops by **~88%** vs pure Transformer
|
|
91
|
+
- Long context becomes feasible within iOS/mobile memory budgets
|
|
92
|
+
- Mamba-2 decode is a simple matrix-vector multiply (no softmax, no causal masking, no KV append)
|
|
93
|
+
|
|
94
|
+
### Nemotron-H Architecture Search Results
|
|
95
|
+
|
|
96
|
+
From the Nemotron-H paper, the optimal ratio is **~8% attention layers** across all model sizes:
|
|
97
|
+
- Nemotron-H-8B: 4 attention out of 52 layers
|
|
98
|
+
- Nemotron-H-56B: 10 attention out of 118 layers
|
|
99
|
+
- Nemotron-3 Nano 30B: 6 attention out of 52 layers (slightly higher, ~11.5%)
|
|
100
|
+
|
|
101
|
+
First layer is always Mamba-2, last layer is always FFN/MoE. Attention layers are evenly dispersed throughout.
|
|
102
|
+
|
|
103
|
+
### Inference Throughput (Nemotron-H benchmarks, H100)
|
|
104
|
+
|
|
105
|
+
| Model | vs Transformer Baseline | Context |
|
|
106
|
+
|-------|------------------------|---------|
|
|
107
|
+
| Nemotron-H-8B | **1.8-3x faster** than Qwen-2.5-7B/Llama-3.1-8B | 65k input |
|
|
108
|
+
| Nemotron-H-56B | **2.4x faster** than Qwen-2.5-72B/Llama-3.1-70B | 65k input |
|
|
109
|
+
| Nemotron-H-56B | **19.6x faster** than Llama-3.1-405B | 65k input |
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## 4. Mamba-2 / SSD Deep Dive
|
|
114
|
+
|
|
115
|
+
### 4.1 SSM Equations
|
|
116
|
+
|
|
117
|
+
The Selective State Space Model at the core of Mamba-2:
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
State update: h_t = a_t · h_{t-1} + B_t · x_t
|
|
121
|
+
Output: y_t = C_t^T · h_t
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Where per timestep t:
|
|
125
|
+
- `x_t ∈ ℝ^P` — input (P = head_dim, e.g. 64)
|
|
126
|
+
- `h_t ∈ ℝ^(P × N)` — hidden state (N = state_size, e.g. 128)
|
|
127
|
+
- `y_t ∈ ℝ^P` — output
|
|
128
|
+
- `a_t ∈ ℝ` — **scalar** decay (Mamba-2 key simplification)
|
|
129
|
+
- `B_t ∈ ℝ^N` — input projection (input-dependent)
|
|
130
|
+
- `C_t ∈ ℝ^N` — output projection (input-dependent)
|
|
131
|
+
|
|
132
|
+
### 4.2 Mamba-2 vs Mamba-1
|
|
133
|
+
|
|
134
|
+
| Aspect | Mamba-1 (S6) | Mamba-2 (SSD) |
|
|
135
|
+
|--------|-------------|---------------|
|
|
136
|
+
| A matrix | Diagonal `(N,)` per head | **Scalar** per head (scalar × identity) |
|
|
137
|
+
| State size N | 16 | **64-256** (much larger) |
|
|
138
|
+
| Training algorithm | Custom CUDA scan | **Matmul-based** (tensor core friendly) |
|
|
139
|
+
| Training speed | Baseline | **2-8x faster** |
|
|
140
|
+
| B, C generation | Sequential (after conv) | **Parallel** (with X) |
|
|
141
|
+
|
|
142
|
+
The critical insight: by restricting A to a scalar (instead of diagonal), the recurrence dynamics are shared across all N state dimensions. This means the state update can be expressed as matrix multiplications, enabling use of tensor cores during training.
|
|
143
|
+
|
|
144
|
+
### 4.3 The SSD (Structured State Space Duality)
|
|
145
|
+
|
|
146
|
+
The SSM recurrence can be equivalently written as a matrix multiplication:
|
|
147
|
+
|
|
148
|
+
```
|
|
149
|
+
Y = M · X
|
|
150
|
+
|
|
151
|
+
where M = L ⊙ (C · B^T) [semiseparable matrix]
|
|
152
|
+
|
|
153
|
+
L[i,j] = a_i · a_{i-1} · ... · a_{j+1} for i > j
|
|
154
|
+
L[i,i] = 1
|
|
155
|
+
L[i,j] = 0 for i < j
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
This is structurally identical to **causal linear attention** when all `a_t = 1` (L becomes the causal mask). The scalar `a_t` values act as input-dependent relative positional encodings / decay factors.
|
|
159
|
+
|
|
160
|
+
### 4.4 Decode Mode (Single Token — Gerbil Hot Path)
|
|
161
|
+
|
|
162
|
+
For autoregressive generation, Mamba-2 is a simple recurrence:
|
|
163
|
+
|
|
164
|
+
```python
|
|
165
|
+
# Per token, per head:
|
|
166
|
+
h = a_t * h + outer(B_t, x_t) # state update: (P, N) = scalar * (P, N) + (N,) ⊗ (P,)
|
|
167
|
+
y = h @ C_t # output: (P,) = (P, N) @ (N,)
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Per-token cost per head:**
|
|
171
|
+
- 1 scalar multiply of state: `P × N` multiplies
|
|
172
|
+
- 1 outer product + add: `P × N` multiply-adds
|
|
173
|
+
- 1 matrix-vector product: `P × N` multiply-adds
|
|
174
|
+
- **Total: ~3PN FLOPs per head**
|
|
175
|
+
|
|
176
|
+
**For Nemotron-3 Nano** (64 heads, P=64, N=128):
|
|
177
|
+
- Per Mamba layer: `64 × 3 × 64 × 128 = 1.57M FLOPs`
|
|
178
|
+
- Compare to attention: KV cache read + softmax + output = much more for long sequences
|
|
179
|
+
|
|
180
|
+
**State memory per Mamba layer:**
|
|
181
|
+
- `num_heads × head_dim × state_size = 64 × 64 × 128 = 524,288` values
|
|
182
|
+
- At f32: **2MB per layer**, at f16: **1MB per layer**
|
|
183
|
+
- For 23 Mamba layers: **~23MB total** (constant, regardless of sequence length!)
|
|
184
|
+
|
|
185
|
+
Compare to attention KV cache for 1024 tokens: `6 layers × 2(K+V) × 32 heads × 128 dim × 1024 tokens × 2 bytes = ~100MB` and growing linearly.
|
|
186
|
+
|
|
187
|
+
### 4.5 Prefill Mode (Parallel — Processing Prompt)
|
|
188
|
+
|
|
189
|
+
The chunked SSD algorithm processes the prompt in parallel using 4 steps:
|
|
190
|
+
|
|
191
|
+
```
|
|
192
|
+
Given: X (T, P), A (T,), B (T, N), C (T, N)
|
|
193
|
+
Chunk into blocks of size Q (default 64-128):
|
|
194
|
+
X_chunks: (T/Q, Q, P)
|
|
195
|
+
A_chunks: (T/Q, Q), etc.
|
|
196
|
+
|
|
197
|
+
Step 1 — Intra-chunk (parallel, uses matmuls):
|
|
198
|
+
Y_diag = einsum("bclhn, bcshn, bhcls, bcshp -> bclhp", C, B, L, X)
|
|
199
|
+
# L is the Q×Q semiseparable mask from cumsum of A within each chunk
|
|
200
|
+
|
|
201
|
+
Step 2 — Chunk states (parallel):
|
|
202
|
+
states = einsum("bclhn, bhcl, bclhp -> bchpn", B, decay, X)
|
|
203
|
+
# Final state of each chunk assuming zero initial state
|
|
204
|
+
|
|
205
|
+
Step 3 — Inter-chunk recurrence (sequential, but over T/Q chunks only):
|
|
206
|
+
new_states[c] = decay_chunk[c] * new_states[c-1] + states[c]
|
|
207
|
+
# Simple scan over reduced sequence length T/Q
|
|
208
|
+
|
|
209
|
+
Step 4 — Output correction (parallel):
|
|
210
|
+
Y_off = einsum("bclhn, bchpn, bhcl -> bclhp", C, states, state_decay)
|
|
211
|
+
# Add contribution from initial states to each chunk's output
|
|
212
|
+
|
|
213
|
+
Final: Y = Y_diag + Y_off
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Key property:** Steps 1, 2, 4 are all matmuls (tensor core / GPU friendly). Step 3 is a short sequential scan over T/Q elements (e.g., for 2048 tokens with Q=128, only 16 sequential steps).
|
|
217
|
+
|
|
218
|
+
### 4.6 Mamba-2 Block Architecture
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
Input x (d_model)
|
|
222
|
+
│
|
|
223
|
+
├──→ Linear projection → (expand * d_model) ──→ split into:
|
|
224
|
+
│ │ │
|
|
225
|
+
│ ├── z (gate, expand * d_model) │
|
|
226
|
+
│ └── x' (SSM input, expand * d_model) │
|
|
227
|
+
│ │ │
|
|
228
|
+
│ Conv1d(kernel=4, causal) │
|
|
229
|
+
│ │ │
|
|
230
|
+
│ SiLU(x') │
|
|
231
|
+
│ │ │
|
|
232
|
+
│ ┌── B_t = Linear(x') (N per head) │
|
|
233
|
+
│ ├── C_t = Linear(x') (N per head) │
|
|
234
|
+
│ ├── dt = Linear(x') → softplus → a_t │
|
|
235
|
+
│ │ │
|
|
236
|
+
│ └──── SSM(x', a_t, B_t, C_t) ──→ y │
|
|
237
|
+
│ │ │
|
|
238
|
+
│ y * SiLU(z) │ (gating)
|
|
239
|
+
│ │ │
|
|
240
|
+
│ Linear(out) ──→ output (d_model)
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
**Nemotron-3 Nano Mamba-2 config:**
|
|
244
|
+
- `expand = 2` → inner dim = 2 × 2688 = 5376
|
|
245
|
+
- `mamba_head_dim = 64` (P)
|
|
246
|
+
- `mamba_num_heads = 64` → 64 × 64 = 4096 (but expand × hidden = 5376?)
|
|
247
|
+
- `ssm_state_size = 128` (N)
|
|
248
|
+
- `conv_kernel = 4`
|
|
249
|
+
- `n_groups = 8` (Mamba groups, likely for B/C sharing)
|
|
250
|
+
- `chunk_size = 128` (Q for chunked prefill)
|
|
251
|
+
|
|
252
|
+
### 4.7 Weight Tensors for Mamba-2 Layer
|
|
253
|
+
|
|
254
|
+
Expected HuggingFace weight names per layer:
|
|
255
|
+
```
|
|
256
|
+
model.layers.{i}.mamba.in_proj.weight # (expand*d + expand*d, d_model) or split
|
|
257
|
+
model.layers.{i}.mamba.conv1d.weight # (inner_dim, 1, conv_kernel)
|
|
258
|
+
model.layers.{i}.mamba.conv1d.bias # (inner_dim,)
|
|
259
|
+
model.layers.{i}.mamba.out_proj.weight # (d_model, inner_dim)
|
|
260
|
+
model.layers.{i}.mamba.dt_bias # (num_heads,) — added to dt before softplus
|
|
261
|
+
model.layers.{i}.mamba.A_log # (num_heads,) — log-space A parameter
|
|
262
|
+
model.layers.{i}.mamba.D # (num_heads,) — skip connection scalar
|
|
263
|
+
model.layers.{i}.mamba.norm.weight # (inner_dim,) — RMSNorm before output
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
Note: `A_log` is stored in log-space for numerical stability. At runtime: `A = -exp(A_log)` (negative to ensure decay).
|
|
267
|
+
|
|
268
|
+
### 4.8 Decode Path Implementation Pseudocode
|
|
269
|
+
|
|
270
|
+
```python
|
|
271
|
+
def mamba2_decode_step(x, layer, state):
|
|
272
|
+
"""Single token decode for one Mamba-2 layer.
|
|
273
|
+
|
|
274
|
+
x: (d_model,) — input token embedding
|
|
275
|
+
state.h: (num_heads, head_dim, state_size) — SSM hidden state
|
|
276
|
+
state.conv: (inner_dim, conv_kernel-1) — conv1d rolling buffer
|
|
277
|
+
"""
|
|
278
|
+
# 1. Input projection
|
|
279
|
+
xz = layer.in_proj @ x # (2 * inner_dim,)
|
|
280
|
+
x_inner, z = split(xz, inner_dim) # each (inner_dim,)
|
|
281
|
+
|
|
282
|
+
# 2. Causal conv1d (shift buffer, apply kernel)
|
|
283
|
+
state.conv = roll_left(state.conv)
|
|
284
|
+
state.conv[:, -1] = x_inner
|
|
285
|
+
x_conv = sum(state.conv * layer.conv1d_weight, dim=-1) + layer.conv1d_bias
|
|
286
|
+
|
|
287
|
+
# 3. SiLU activation
|
|
288
|
+
x_act = x_conv * sigmoid(x_conv) # SiLU = x * sigmoid(x)
|
|
289
|
+
|
|
290
|
+
# 4. Generate B, C, dt from activated input
|
|
291
|
+
# (These come from the in_proj output, exact split depends on implementation)
|
|
292
|
+
B = ... # (num_heads, state_size)
|
|
293
|
+
C = ... # (num_heads, state_size)
|
|
294
|
+
dt = softplus(layer.dt_proj(x_act) + layer.dt_bias) # (num_heads,)
|
|
295
|
+
|
|
296
|
+
# 5. Discretize A
|
|
297
|
+
A = -exp(layer.A_log) # (num_heads,) — negative real
|
|
298
|
+
a = exp(dt * A) # (num_heads,) — decay per head
|
|
299
|
+
|
|
300
|
+
# 6. SSM state update (THE HOT LOOP)
|
|
301
|
+
# state.h shape: (num_heads, head_dim, state_size)
|
|
302
|
+
# a shape: (num_heads,) — broadcast over head_dim and state_size
|
|
303
|
+
x_heads = reshape(x_act, (num_heads, head_dim)) # (num_heads, head_dim)
|
|
304
|
+
|
|
305
|
+
for each head h:
|
|
306
|
+
state.h[h] = a[h] * state.h[h] + outer(x_heads[h], B[h])
|
|
307
|
+
# (head_dim, state_size) = scalar * (head_dim, state_size) + (head_dim,1) @ (1, state_size)
|
|
308
|
+
|
|
309
|
+
# 7. SSM output
|
|
310
|
+
y_heads = einsum("hpn, hn -> hp", state.h, C) # (num_heads, head_dim)
|
|
311
|
+
y = reshape(y_heads, (inner_dim,))
|
|
312
|
+
|
|
313
|
+
# 8. Add D skip connection
|
|
314
|
+
y = y + reshape(layer.D, ...) * x_act
|
|
315
|
+
|
|
316
|
+
# 9. RMSNorm
|
|
317
|
+
y = layer.norm(y)
|
|
318
|
+
|
|
319
|
+
# 10. Gate and output
|
|
320
|
+
y = y * (z * sigmoid(z)) # gate with SiLU(z)
|
|
321
|
+
output = layer.out_proj @ y # (d_model,)
|
|
322
|
+
|
|
323
|
+
return output, state
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
### 4.9 Memory Bandwidth Analysis (Decode)
|
|
327
|
+
|
|
328
|
+
For one Mamba-2 layer in Nemotron Nano, per token:
|
|
329
|
+
|
|
330
|
+
**Weights to read:**
|
|
331
|
+
- `in_proj`: 2688 × 5376 × 2 bytes = ~28.9 MB (at f16)
|
|
332
|
+
- `conv1d`: 5376 × 4 × 2 = ~43 KB
|
|
333
|
+
- `out_proj`: 5376 × 2688 × 2 = ~28.9 MB
|
|
334
|
+
- Other (dt_bias, A_log, D, norm): negligible
|
|
335
|
+
- **Total weights: ~58 MB per Mamba layer**
|
|
336
|
+
|
|
337
|
+
**State to read/write:**
|
|
338
|
+
- SSM state: 64 × 64 × 128 × 4 = 2 MB (at f32)
|
|
339
|
+
- Conv buffer: 5376 × 3 × 2 = ~32 KB
|
|
340
|
+
- **Total state: ~2 MB per layer**
|
|
341
|
+
|
|
342
|
+
**Comparison with attention layer:**
|
|
343
|
+
- Attention weights: QKV proj + output proj ≈ similar to Mamba projections
|
|
344
|
+
- **Plus KV cache read**: grows with sequence length (at 1024 tokens: ~1 MB per layer per read)
|
|
345
|
+
- At long sequences, KV cache read dominates — Mamba avoids this entirely
|
|
346
|
+
|
|
347
|
+
---
|
|
348
|
+
|
|
349
|
+
## 5. Mixture of Experts (MoE)
|
|
350
|
+
|
|
351
|
+
### Nemotron-3 Nano MoE Configuration
|
|
352
|
+
|
|
353
|
+
```json
|
|
354
|
+
{
|
|
355
|
+
"n_routed_experts": 128,
|
|
356
|
+
"num_experts_per_tok": 6,
|
|
357
|
+
"n_shared_experts": 1,
|
|
358
|
+
"moe_intermediate_size": 1856,
|
|
359
|
+
"moe_shared_expert_intermediate_size": 3712,
|
|
360
|
+
"routed_scaling_factor": 2.5,
|
|
361
|
+
"norm_topk_prob": true,
|
|
362
|
+
"topk_group": 1
|
|
363
|
+
}
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
### How MoE Works at Inference
|
|
367
|
+
|
|
368
|
+
For each token:
|
|
369
|
+
1. Router (small MLP) computes scores over all 128 experts
|
|
370
|
+
2. Top-6 experts selected
|
|
371
|
+
3. Each selected expert processes the token independently
|
|
372
|
+
4. Outputs are weighted-summed by router scores
|
|
373
|
+
5. Shared expert (always active) output is added
|
|
374
|
+
|
|
375
|
+
**Active parameters per token:** 6 × (2688 × 1856 × 2) + 1 × (2688 × 3712 × 2) ≈ **80M params** (per MoE layer)
|
|
376
|
+
|
|
377
|
+
### Memory Implications for Gerbil
|
|
378
|
+
|
|
379
|
+
All 128 expert weights must be in GPU memory, but only 6 are read per token:
|
|
380
|
+
- Per MoE layer total weights: 128 × 2 × 2688 × 1856 × 2 bytes ≈ **2.5 GB** (at f16)
|
|
381
|
+
- Per MoE layer per-token read: 6 × 2 × 2688 × 1856 × 2 ≈ **120 MB**
|
|
382
|
+
- Plus shared expert: 2 × 2688 × 3712 × 2 ≈ **40 MB**
|
|
383
|
+
- **Total per MoE layer per token: ~160 MB bandwidth**
|
|
384
|
+
|
|
385
|
+
With 23 MoE layers: **~3.7 GB bandwidth per token** — this is the bottleneck.
|
|
386
|
+
|
|
387
|
+
At M4 Max 546 GB/s: theoretical minimum **~6.8ms per token** just for MoE weight reads (assuming perfect bandwidth utilization).
|
|
388
|
+
|
|
389
|
+
---
|
|
390
|
+
|
|
391
|
+
## 6. LatentMoE
|
|
392
|
+
|
|
393
|
+
### Architecture
|
|
394
|
+
|
|
395
|
+
LatentMoE reduces the memory bandwidth of MoE by projecting tokens to a smaller latent space before expert routing:
|
|
396
|
+
|
|
397
|
+
```
|
|
398
|
+
Standard MoE:
|
|
399
|
+
x (d) → Router → Expert_i(x) (d→m→d) → weighted sum
|
|
400
|
+
|
|
401
|
+
LatentMoE:
|
|
402
|
+
x (d) → Router (in d-space)
|
|
403
|
+
→ W_down @ x (d→ℓ)
|
|
404
|
+
→ Expert_i(x_latent) (ℓ→m→ℓ)
|
|
405
|
+
→ W_up @ y_latent (ℓ→d)
|
|
406
|
+
→ weighted sum
|
|
407
|
+
```
|
|
408
|
+
|
|
409
|
+
### Key Properties
|
|
410
|
+
|
|
411
|
+
- **Compression ratio α = d/ℓ = 4** (typical, validated up to 4x without quality loss)
|
|
412
|
+
- **Router stays in d-space** — routing decisions use full-dimensional information
|
|
413
|
+
- **Experts operate in ℓ-space** — 4x smaller input/output dimensions
|
|
414
|
+
- **Expert count scales up**: N′ = α × N (e.g., 128 → 512 experts)
|
|
415
|
+
- **Active experts scale up**: K′ = α × K (e.g., 6 → 24 active)
|
|
416
|
+
- **Net effect**: same compute, same params, but 4x less bandwidth per expert
|
|
417
|
+
|
|
418
|
+
### Bandwidth Savings
|
|
419
|
+
|
|
420
|
+
| Metric | Standard MoE | LatentMoE (α=4) |
|
|
421
|
+
|--------|-------------|-----------------|
|
|
422
|
+
| Expert weight read per token | `K × 2 × d × m` | `K' × 2 × ℓ × m` (same total) |
|
|
423
|
+
| All-to-all communication | `K × d` per token | `K × ℓ` per token (**4x less**) |
|
|
424
|
+
| Routing payload | `d` per token per expert | `ℓ` per token per expert (**4x less**) |
|
|
425
|
+
|
|
426
|
+
### Performance
|
|
427
|
+
|
|
428
|
+
- At iso-accuracy: **up to 3.5x throughput improvement** over standard MoE
|
|
429
|
+
- At iso-FLOP: significant accuracy gains (MMLU-Pro: +5.65 points)
|
|
430
|
+
- Overhead of down/up projection: **<9%** of total compute
|
|
431
|
+
|
|
432
|
+
### Gerbil Relevance
|
|
433
|
+
|
|
434
|
+
Nemotron-3 Nano uses standard MoE (not LatentMoE). LatentMoE is used in Super and Ultra models. However:
|
|
435
|
+
- If NVIDIA releases a Nano-class model with LatentMoE, the bandwidth savings would be transformative for WebGPU
|
|
436
|
+
- The down/up projections are simple matmuls — trivial to implement in WGSL
|
|
437
|
+
- Could reduce our per-token MoE bandwidth from ~160MB to ~40MB per layer
|
|
438
|
+
|
|
439
|
+
---
|
|
440
|
+
|
|
441
|
+
## 7. Multi-Token Prediction (MTP)
|
|
442
|
+
|
|
443
|
+
### How It Works
|
|
444
|
+
|
|
445
|
+
Instead of predicting only the next token, the model has N additional prediction heads that predict tokens 2, 3, ..., N+1 steps ahead.
|
|
446
|
+
|
|
447
|
+
```
|
|
448
|
+
Backbone output (hidden states)
|
|
449
|
+
│
|
|
450
|
+
├── Head 0 (standard): predict token t+1
|
|
451
|
+
├── Head 1 (MTP): predict token t+2
|
|
452
|
+
├── Head 2 (MTP): predict token t+3
|
|
453
|
+
└── Head 3 (MTP): predict token t+4
|
|
454
|
+
```
|
|
455
|
+
|
|
456
|
+
### Training Benefits
|
|
457
|
+
- Richer training signal (predicting further ahead encourages planning)
|
|
458
|
+
- ~2.4% average improvement across benchmarks
|
|
459
|
+
- Minimal additional FLOPs during training
|
|
460
|
+
|
|
461
|
+
### Inference: Self-Speculative Decoding
|
|
462
|
+
|
|
463
|
+
The MTP heads enable speculative decoding **without a separate draft model**:
|
|
464
|
+
|
|
465
|
+
```
|
|
466
|
+
1. Run one forward pass → get predictions from all heads
|
|
467
|
+
2. Head 0 predicts token t+1 (high confidence)
|
|
468
|
+
3. Head 1 predicts token t+2 (draft)
|
|
469
|
+
4. Head 2 predicts token t+3 (draft)
|
|
470
|
+
5. Verify drafts by running a single forward pass on [t+1, t+2, t+3]
|
|
471
|
+
6. Accept all tokens up to first mismatch
|
|
472
|
+
```
|
|
473
|
+
|
|
474
|
+
### Key Numbers
|
|
475
|
+
- **97% acceptance rate** on first two predicted tokens (Nemotron-3)
|
|
476
|
+
- **Up to 3x faster** inference with 4-token prediction heads (Meta paper)
|
|
477
|
+
- Particularly effective at **batch-size-1** (exactly Gerbil's use case)
|
|
478
|
+
|
|
479
|
+
### Gerbil Implementation
|
|
480
|
+
|
|
481
|
+
For decode, MTP would:
|
|
482
|
+
1. Add lightweight prediction heads (shared backbone, separate output projections)
|
|
483
|
+
2. Generate 2-4 draft tokens per forward pass
|
|
484
|
+
3. Verify all drafts in a single prefill-style forward pass
|
|
485
|
+
4. Accept valid tokens, fall back to first mismatch
|
|
486
|
+
|
|
487
|
+
**Expected speedup: ~1.5-2x** decode throughput at batch-1, with near-zero memory overhead (heads are just additional small linear projections over vocab).
|
|
488
|
+
|
|
489
|
+
**Note:** Nemotron-3 Nano's config doesn't explicitly show MTP configuration, so this may be a Super/Ultra feature or baked into the architecture at the weights level. Needs verification.
|
|
490
|
+
|
|
491
|
+
---
|
|
492
|
+
|
|
493
|
+
## 8. Quantization: NVFP4 vs INT4
|
|
494
|
+
|
|
495
|
+
### Our Current INT4
|
|
496
|
+
|
|
497
|
+
```
|
|
498
|
+
Format: 4-bit unsigned integer, packed 8 nibbles per u32
|
|
499
|
+
Dequant: output = (nibble - zero) * scale
|
|
500
|
+
Group size: 128 (configurable 32/64)
|
|
501
|
+
Scale/zero: f32 per group
|
|
502
|
+
```
|
|
503
|
+
|
|
504
|
+
### NVFP4
|
|
505
|
+
|
|
506
|
+
```
|
|
507
|
+
Format: E2M1 (2-bit exponent, 1-bit mantissa)
|
|
508
|
+
Values: {0, 0.5, 1, 1.5, 2, 3, 4, 6} × sign
|
|
509
|
+
Block scaling: E4M3 format, per 16-element micro-block
|
|
510
|
+
Global scaling: FP32, second-level scale
|
|
511
|
+
2D block scaling for weights
|
|
512
|
+
```
|
|
513
|
+
|
|
514
|
+
### Comparison
|
|
515
|
+
|
|
516
|
+
| Aspect | Our INT4 | NVFP4 |
|
|
517
|
+
|--------|---------|-------|
|
|
518
|
+
| Element format | 4-bit unsigned int | E2M1 (4-bit float) |
|
|
519
|
+
| Group/block size | 128 elements | **16 elements** (8x finer) |
|
|
520
|
+
| Scale format | f32 | E4M3 + FP32 (two-level) |
|
|
521
|
+
| Zero point | Per-group | Not needed (symmetric) |
|
|
522
|
+
| Accuracy (vs BF16) | ~decent (no published numbers) | **<1% relative loss difference** |
|
|
523
|
+
| Selective precision | No | **Last 15% of layers in BF16** |
|
|
524
|
+
|
|
525
|
+
### Implications for Gerbil
|
|
526
|
+
|
|
527
|
+
- NVFP4's 16-element micro-blocks are **8x finer granularity** than our 128-element groups
|
|
528
|
+
- This means 8x more scale factors to read, but much better accuracy
|
|
529
|
+
- E2M1 dequantization is simpler than INT4 (no zero point subtraction)
|
|
530
|
+
- Could implement as: read 2 packed u32s (16 nibbles) + 1 E4M3 scale + share FP32 global scale
|
|
531
|
+
- NVFP4 weights from HuggingFace could be loaded directly without requantization
|
|
532
|
+
|
|
533
|
+
---
|
|
534
|
+
|
|
535
|
+
## 9. Nemotron-3 Nano 30B-A3B Config Reference
|
|
536
|
+
|
|
537
|
+
Complete architecture config from HuggingFace:
|
|
538
|
+
|
|
539
|
+
```json
|
|
540
|
+
{
|
|
541
|
+
"architectures": ["NemotronHForCausalLM"],
|
|
542
|
+
"model_type": "nemotron_h",
|
|
543
|
+
|
|
544
|
+
"hidden_size": 2688,
|
|
545
|
+
"num_hidden_layers": 52,
|
|
546
|
+
"intermediate_size": 1856,
|
|
547
|
+
"vocab_size": 131072,
|
|
548
|
+
"max_position_embeddings": 262144,
|
|
549
|
+
|
|
550
|
+
"hybrid_override_pattern": "MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
|
|
551
|
+
|
|
552
|
+
"num_attention_heads": 32,
|
|
553
|
+
"num_key_value_heads": 2,
|
|
554
|
+
"head_dim": 128,
|
|
555
|
+
"rope_theta": 10000,
|
|
556
|
+
|
|
557
|
+
"mamba_num_heads": 64,
|
|
558
|
+
"mamba_head_dim": 64,
|
|
559
|
+
"ssm_state_size": 128,
|
|
560
|
+
"expand": 2,
|
|
561
|
+
"conv_kernel": 4,
|
|
562
|
+
"n_groups": 8,
|
|
563
|
+
"chunk_size": 128,
|
|
564
|
+
"mamba_hidden_act": "silu",
|
|
565
|
+
|
|
566
|
+
"n_routed_experts": 128,
|
|
567
|
+
"num_experts_per_tok": 6,
|
|
568
|
+
"n_shared_experts": 1,
|
|
569
|
+
"moe_intermediate_size": 1856,
|
|
570
|
+
"moe_shared_expert_intermediate_size": 3712,
|
|
571
|
+
"routed_scaling_factor": 2.5,
|
|
572
|
+
|
|
573
|
+
"mlp_hidden_act": "relu2",
|
|
574
|
+
"layer_norm_epsilon": 1e-05,
|
|
575
|
+
"torch_dtype": "bfloat16"
|
|
576
|
+
}
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
### Layer Pattern Decoded
|
|
580
|
+
|
|
581
|
+
```
|
|
582
|
+
Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
|
583
|
+
Layer: M E M E M * E M E M E M * E M E M E M *
|
|
584
|
+
Type: m x m x m a x m x m x m a x m x m x m a
|
|
585
|
+
|
|
586
|
+
Position: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
|
587
|
+
Layer: E M E M E M * E M E M E M E M * E M E M
|
|
588
|
+
Type: x m x m x m a x m x m x m x m a x m x m
|
|
589
|
+
|
|
590
|
+
Position: 40 41 42 43 44 45 46 47 48 49 50 51
|
|
591
|
+
Layer: E M E M E M E M E M E M E
|
|
592
|
+
Type: x m x m x m x m a m x m x (approximate — pattern varies at end)
|
|
593
|
+
|
|
594
|
+
m = Mamba-2, x = MoE, a = Attention (GQA)
|
|
595
|
+
```
|
|
596
|
+
|
|
597
|
+
### Derived Dimensions
|
|
598
|
+
|
|
599
|
+
| Component | Dimension | Notes |
|
|
600
|
+
|-----------|-----------|-------|
|
|
601
|
+
| Mamba inner dim | 2 × 2688 = 5376 | expand=2 |
|
|
602
|
+
| Mamba state per layer | 64 heads × 64 head_dim × 128 state = 524K values | ~2MB at f32 |
|
|
603
|
+
| Mamba conv buffer | 5376 × 3 values | conv_kernel=4, store last 3 |
|
|
604
|
+
| Attention KV per token | 2 KV heads × 128 dim × 2(K+V) = 512 values | ~1KB at f16 |
|
|
605
|
+
| MoE per-token read | 6 experts × 2 × 2688 × 1856 = ~60M values | ~120MB at f16 |
|
|
606
|
+
| Total model weights (BF16) | ~63 GB | All 128 experts × 23 layers |
|
|
607
|
+
| Total model weights (INT4) | ~16 GB | With quantization |
|
|
608
|
+
| Active weights per token | ~1.8 GB | Only active experts + Mamba/attention |
|
|
609
|
+
|
|
610
|
+
### Weight Size Estimates (INT4/FP4 quantized)
|
|
611
|
+
|
|
612
|
+
| Component | Per Layer | Total |
|
|
613
|
+
|-----------|-----------|-------|
|
|
614
|
+
| Mamba in_proj | 2688 × 5376 / 2 = 7.2 MB | 23 × 7.2 = 166 MB |
|
|
615
|
+
| Mamba out_proj | 5376 × 2688 / 2 = 7.2 MB | 23 × 7.2 = 166 MB |
|
|
616
|
+
| Mamba conv + small | ~11 KB | ~0.25 MB |
|
|
617
|
+
| MoE routed experts | 128 × 2 × 2688 × 1856 / 2 = 1.22 GB | 23 × 1.22 = **28 GB** |
|
|
618
|
+
| MoE shared expert | 2 × 2688 × 3712 / 2 = 10 MB | 23 × 10 = 230 MB |
|
|
619
|
+
| Attention (QKV+O) | 4 × 2688 × 2688 / 2 = 14.5 MB | 6 × 14.5 = 87 MB |
|
|
620
|
+
| Embeddings + head | 131072 × 2688 × 2 = 670 MB | 670 MB |
|
|
621
|
+
| **Total (INT4)** | | **~29.3 GB** |
|
|
622
|
+
|
|
623
|
+
This fits in M4 Max 128GB unified memory, but is tight for 64GB configs and impossible for mobile.
|
|
624
|
+
|
|
625
|
+
---
|
|
626
|
+
|
|
627
|
+
## 10. Gerbil Implications & Implementation Roadmap
|
|
628
|
+
|
|
629
|
+
### Phase 1: Mamba-2 Kernel (WGSL)
|
|
630
|
+
|
|
631
|
+
**New kernels needed:**
|
|
632
|
+
|
|
633
|
+
1. **Mamba2Decode** (hot path — single token)
|
|
634
|
+
- SSM state update: `h = a*h + B⊗x` per head
|
|
635
|
+
- SSM output: `y = h @ C` per head
|
|
636
|
+
- Fuse with SiLU gate and output projection
|
|
637
|
+
- Workgroup: one workgroup per head (64 threads for head_dim=64)
|
|
638
|
+
- State in storage buffer (persistent across tokens)
|
|
639
|
+
|
|
640
|
+
2. **Mamba2Conv1d** (per token)
|
|
641
|
+
- 1D causal convolution, kernel=4
|
|
642
|
+
- Maintain rolling buffer of last 3 activations
|
|
643
|
+
- Simple: read 4 values, multiply-accumulate
|
|
644
|
+
|
|
645
|
+
3. **Mamba2Prefill** (prompt processing)
|
|
646
|
+
- Chunked SSD algorithm (4 steps)
|
|
647
|
+
- Steps 1,2,4 are matmuls — reuse existing MatMul kernels
|
|
648
|
+
- Step 3 is short scan — new small kernel
|
|
649
|
+
- Chunk size 128 (from config)
|
|
650
|
+
|
|
651
|
+
4. **Mamba2InputProj / OutputProj**
|
|
652
|
+
- Standard MatVec (decode) or MatMul (prefill)
|
|
653
|
+
- Reuse existing kernels, just need correct dispatch
|
|
654
|
+
|
|
655
|
+
**State management changes:**
|
|
656
|
+
- New persistent buffer type: SSM state (per layer, constant size)
|
|
657
|
+
- Conv buffer: rolling window (per layer, width=3)
|
|
658
|
+
- These replace KV cache for Mamba layers (massive memory savings)
|
|
659
|
+
|
|
660
|
+
### Phase 2: MoE Routing
|
|
661
|
+
|
|
662
|
+
**New kernels:**
|
|
663
|
+
1. **TopKRouter** — compute router scores, find top-6 experts
|
|
664
|
+
2. **MoEGather** — select expert weights for active experts
|
|
665
|
+
3. **MoEScatter** — weighted sum of expert outputs
|
|
666
|
+
|
|
667
|
+
**Architecture changes:**
|
|
668
|
+
- Expert weights stored as large buffers (128 experts × weight matrix)
|
|
669
|
+
- Router dispatches only active expert kernels per token
|
|
670
|
+
- Shared expert always dispatched in parallel
|
|
671
|
+
|
|
672
|
+
### Phase 3: Graph Generator for NemotronH
|
|
673
|
+
|
|
674
|
+
New architecture handler:
|
|
675
|
+
```typescript
|
|
676
|
+
function generateNemotronHGraph(config: NemotronHConfig): ModelGraph {
|
|
677
|
+
const pattern = config.hybrid_override_pattern;
|
|
678
|
+
for (let i = 0; i < config.num_hidden_layers; i++) {
|
|
679
|
+
if (pattern[i] === 'M') {
|
|
680
|
+
// Mamba-2 layer: in_proj → conv1d → SSM → gate → out_proj
|
|
681
|
+
addMamba2Layer(graph, i);
|
|
682
|
+
} else if (pattern[i] === '*') {
|
|
683
|
+
// Attention layer: QKV → attention → out_proj (existing kernels)
|
|
684
|
+
addAttentionLayer(graph, i);
|
|
685
|
+
} else if (pattern[i] === 'E') {
|
|
686
|
+
// MoE layer: router → top-k gather → expert FFN → scatter
|
|
687
|
+
addMoELayer(graph, i);
|
|
688
|
+
}
|
|
689
|
+
}
|
|
690
|
+
}
|
|
691
|
+
```
|
|
692
|
+
|
|
693
|
+
### Phase 4: MTP Speculative Decoding (if applicable)
|
|
694
|
+
|
|
695
|
+
- Add lightweight prediction heads
|
|
696
|
+
- Implement verify-and-accept loop
|
|
697
|
+
- Expected ~1.5-2x decode speedup
|
|
698
|
+
|
|
699
|
+
### Priority Order
|
|
700
|
+
|
|
701
|
+
1. **NemotronH graph generator** — ships the architecture adapter, validates IR design
|
|
702
|
+
2. **Mamba-2 decode kernel** — gates everything, simplest to validate
|
|
703
|
+
3. **MoE routing + dispatch** — needed for any Nemotron model
|
|
704
|
+
4. **Mamba-2 prefill (chunked SSD)** — needed for prompt processing
|
|
705
|
+
5. **NVFP4 dequantization** — use their quantized weights directly
|
|
706
|
+
6. **MTP speculative decode** — optimization on top
|
|
707
|
+
|
|
708
|
+
---
|
|
709
|
+
|
|
710
|
+
## 10.5. NemotronH Graph Generator — Detailed Implementation Plan
|
|
711
|
+
|
|
712
|
+
### Verified Model Structure (from `modeling_nemotron_h.py`)
|
|
713
|
+
|
|
714
|
+
The actual HuggingFace implementation reveals key differences from assumptions:
|
|
715
|
+
|
|
716
|
+
**Weight prefix is `backbone.` not `model.`:**
|
|
717
|
+
```
|
|
718
|
+
backbone.embeddings.weight
|
|
719
|
+
backbone.layers.{i}.norm.weight
|
|
720
|
+
backbone.layers.{i}.mixer.{component}
|
|
721
|
+
backbone.norm_f.weight
|
|
722
|
+
lm_head.weight
|
|
723
|
+
```
|
|
724
|
+
|
|
725
|
+
**All layer types share the `.mixer.` namespace** — Mamba, Attention, and MLP/MoE all live under `layers.{i}.mixer.*`.
|
|
726
|
+
|
|
727
|
+
**Pattern characters map to:**
|
|
728
|
+
- **M** → `NemotronHMamba2Mixer` (standard Mamba-2 SSM)
|
|
729
|
+
- **E** → `NemotronHMLP` (8B dense model) / MoE (30B Nano model)
|
|
730
|
+
- **\*** → `NemotronHAttention` (GQA)
|
|
731
|
+
|
|
732
|
+
### Verified Safetensors Key Names
|
|
733
|
+
|
|
734
|
+
**Mamba layers (`M`):**
|
|
735
|
+
```
|
|
736
|
+
backbone.layers.{i}.norm.weight # pre-norm
|
|
737
|
+
backbone.layers.{i}.mixer.in_proj.weight # fused [d_mlp, d_mlp, gate, x_B_C, dt]
|
|
738
|
+
backbone.layers.{i}.mixer.conv1d.weight # depthwise conv1d
|
|
739
|
+
backbone.layers.{i}.mixer.conv1d.bias
|
|
740
|
+
backbone.layers.{i}.mixer.dt_bias # (num_heads,)
|
|
741
|
+
backbone.layers.{i}.mixer.A_log # (num_heads,) log-space decay
|
|
742
|
+
backbone.layers.{i}.mixer.D # (num_heads,) skip connection
|
|
743
|
+
backbone.layers.{i}.mixer.norm.weight # gated RMSNorm before output
|
|
744
|
+
backbone.layers.{i}.mixer.out_proj.weight # back to hidden_size
|
|
745
|
+
```
|
|
746
|
+
|
|
747
|
+
**Attention layers (`*`):**
|
|
748
|
+
```
|
|
749
|
+
backbone.layers.{i}.norm.weight
|
|
750
|
+
backbone.layers.{i}.mixer.q_proj.weight
|
|
751
|
+
backbone.layers.{i}.mixer.k_proj.weight
|
|
752
|
+
backbone.layers.{i}.mixer.v_proj.weight
|
|
753
|
+
backbone.layers.{i}.mixer.o_proj.weight
|
|
754
|
+
```
|
|
755
|
+
|
|
756
|
+
**MLP layers (`E` in 8B dense):**
|
|
757
|
+
```
|
|
758
|
+
backbone.layers.{i}.norm.weight
|
|
759
|
+
backbone.layers.{i}.mixer.up_proj.weight
|
|
760
|
+
backbone.layers.{i}.mixer.down_proj.weight
|
|
761
|
+
```
|
|
762
|
+
|
|
763
|
+
**MoE layers (`E` in 30B Nano) — needs verification:**
|
|
764
|
+
```
|
|
765
|
+
backbone.layers.{i}.norm.weight
|
|
766
|
+
backbone.layers.{i}.mixer.gate.weight # router
|
|
767
|
+
backbone.layers.{i}.mixer.experts.{j}.up_proj.weight
|
|
768
|
+
backbone.layers.{i}.mixer.experts.{j}.down_proj.weight
|
|
769
|
+
backbone.layers.{i}.mixer.shared_experts.up_proj.weight # shared expert
|
|
770
|
+
backbone.layers.{i}.mixer.shared_experts.down_proj.weight
|
|
771
|
+
```
|
|
772
|
+
|
|
773
|
+
### Mamba-2 in_proj Split (from source code)
|
|
774
|
+
|
|
775
|
+
NemotronH's Mamba-2 fuses everything into a single `in_proj`:
|
|
776
|
+
|
|
777
|
+
```python
|
|
778
|
+
# in_proj output is split as:
|
|
779
|
+
d_mlp, d_mlp, gate, hidden_states_B_C, dt = projected_states.split(
|
|
780
|
+
[d_mlp, d_mlp, intermediate_size, conv_dim, num_heads], dim=-1
|
|
781
|
+
)
|
|
782
|
+
|
|
783
|
+
# where:
|
|
784
|
+
# intermediate_size = mamba_num_heads * mamba_head_dim = 64 * 64 = 4096
|
|
785
|
+
# conv_dim = intermediate_size + 2 * n_groups * ssm_state_size
|
|
786
|
+
# = 4096 + 2 * 8 * 128 = 6144
|
|
787
|
+
# num_heads = 64 (for dt)
|
|
788
|
+
# d_mlp = any remaining dimensions / 2 (optional MLP component)
|
|
789
|
+
|
|
790
|
+
# Then hidden_states_B_C is further split:
|
|
791
|
+
hidden_states, B, C = split(hidden_states_B_C,
|
|
792
|
+
[intermediate_size, n_groups*ssm_state_size, n_groups*ssm_state_size])
|
|
793
|
+
```
|
|
794
|
+
|
|
795
|
+
This differs from Qwen3.5's separate `in_proj_qkv`, `in_proj_a`, `in_proj_b`, `in_proj_z` projections.
|
|
796
|
+
|
|
797
|
+
### Files to Change
|
|
798
|
+
|
|
799
|
+
| File | Change | Size |
|
|
800
|
+
|------|--------|------|
|
|
801
|
+
| `src/gpu/architectures/nemotron_h.ts` | **New file** — graph generator | ~400 lines |
|
|
802
|
+
| `src/gpu/architectures/index.ts` | Register `NemotronHForCausalLM` | 3 lines |
|
|
803
|
+
| `src/gpu/ir.ts` | Add NemotronH canonical keys + HF key mapper | ~40 lines |
|
|
804
|
+
|
|
805
|
+
### Existing Op Reuse
|
|
806
|
+
|
|
807
|
+
| Component | Exists in Gerbil? | Reusable? |
|
|
808
|
+
|-----------|-------------------|-----------|
|
|
809
|
+
| MatMul / MatMulInt4 | Yes | Direct reuse for all projections |
|
|
810
|
+
| Attention + KV cache | Yes | Direct reuse for \* layers |
|
|
811
|
+
| RMSNorm / ResidualRMSNorm | Yes | Direct reuse |
|
|
812
|
+
| SiLU / SwiGLU | Yes | Direct reuse for gating |
|
|
813
|
+
| CausalConv1d | Yes (Qwen3.5) | Direct reuse for Mamba conv |
|
|
814
|
+
| ConvStateUpdate | Yes (Qwen3.5) | Direct reuse |
|
|
815
|
+
| MambaSSM | **Partial** | Qwen3.5 implements Gated DeltaNet. NemotronH needs standard Mamba-2 SSM (different recurrence: `h = a*h + outer(B,x)` vs delta rule). Needs new kernel variant. |
|
|
816
|
+
| ReLU² activation | **No** | Trivial — `relu(x)²` or `x * relu(x)`, ~20 lines WGSL |
|
|
817
|
+
| MoERouter (top-k) | **No** (stubbed in IR) | New kernel: compute router logits, select top-6 |
|
|
818
|
+
| MoE Expert dispatch | **No** (stubbed in IR) | New kernel: route tokens to experts, weighted combine |
|
|
819
|
+
|
|
820
|
+
### What the Graph Generator Produces (But Can't Run Yet)
|
|
821
|
+
|
|
822
|
+
The graph generator will emit correct IR nodes referencing ops whose kernels don't exist yet. This is the same pattern used for Qwen3.5 — the graph is correct documentation of the compute graph, and when kernels are implemented, inference works automatically.
|
|
823
|
+
|
|
824
|
+
**Ops that block end-to-end execution:**
|
|
825
|
+
1. **MoE kernels** (23 E-layers × 128 experts) — the big one
|
|
826
|
+
2. **Standard Mamba-2 SSM kernel** — differs from Qwen3.5's DeltaNet kernel
|
|
827
|
+
3. **ReLU² activation** — trivial to add
|
|
828
|
+
|
|
829
|
+
### Approach
|
|
830
|
+
|
|
831
|
+
Ship the graph generator now (Option A). It serves as:
|
|
832
|
+
- Verified architecture documentation (weight names, dimensions, layer pattern)
|
|
833
|
+
- Correct IR that will work when kernels land
|
|
834
|
+
- Test target for kernel development (load config → generate graph → inspect nodes)
|
|
835
|
+
|
|
836
|
+
---
|
|
837
|
+
|
|
838
|
+
## 11. Small Model Viability: Nano vs Qwen 3.5 0.8B
|
|
839
|
+
|
|
840
|
+
### Does Nano Compete at the Small End?
|
|
841
|
+
|
|
842
|
+
**No.** Nemotron-3 Nano is not a small model replacement for Qwen 3.5 0.8B:
|
|
843
|
+
|
|
844
|
+
| Aspect | Qwen 3.5 0.8B | Nemotron-3 Nano 30B-A3B |
|
|
845
|
+
|--------|--------------|------------------------|
|
|
846
|
+
| Total params | 0.8B | 31.6B |
|
|
847
|
+
| Active params | 0.8B (dense) | 3.6B (MoE) |
|
|
848
|
+
| Weights (INT4) | ~0.4 GB | ~29 GB |
|
|
849
|
+
| Target hardware | Mobile, browser, edge | M4 Max, RTX, DGX Spark |
|
|
850
|
+
| Context length | 32K | 1M |
|
|
851
|
+
| Capability | Basic chat/generation | Full reasoning, coding, agentic |
|
|
852
|
+
|
|
853
|
+
**The use cases are different:**
|
|
854
|
+
- **Qwen 3.5 0.8B**: ultra-lightweight, runs on any device, good enough for simple tasks
|
|
855
|
+
- **Nemotron Nano**: dramatically more capable, but needs 29GB for weights alone
|
|
856
|
+
|
|
857
|
+
### No Sub-1B Nemotron Models Exist
|
|
858
|
+
|
|
859
|
+
NVIDIA has not released any Nemotron model with fewer than 3B active parameters. The family starts at Nano (3.6B active / 31.6B total). There are no announced plans for smaller variants.
|
|
860
|
+
|
|
861
|
+
### Gerbil Strategy
|
|
862
|
+
|
|
863
|
+
**Keep both:**
|
|
864
|
+
- **Qwen 3.5 0.8B** for mobile/browser (our current optimized path)
|
|
865
|
+
- **Nemotron-3 Nano** for desktop with M4 Max or similar (29GB fits in 128GB unified memory)
|
|
866
|
+
- Watch for smaller Nemotron or community-distilled variants
|
|
867
|
+
|
|
868
|
+
### Could Nemotron Nano Run on M4 Max via Gerbil?
|
|
869
|
+
|
|
870
|
+
**Theoretically yes**, with caveats:
|
|
871
|
+
- INT4 weights: ~29 GB (fits in 128GB, tight for 64GB)
|
|
872
|
+
- Active weight reads per token: ~1.8 GB of bandwidth
|
|
873
|
+
- At 546 GB/s (M4 Max): theoretical ~300 tok/s **if perfectly bandwidth-bound**
|
|
874
|
+
- Realistic with overhead: **50-150 tok/s** estimated
|
|
875
|
+
- Would need MoE expert weight management (not all 128 experts in active cache)
|
|
876
|
+
|
|
877
|
+
Compared to our current 160 tok/s on Qwen 0.8B, Nano would be slower but **dramatically more capable** — it's a reasoning model that can do multi-step coding, math, and tool use.
|
|
878
|
+
|
|
879
|
+
**Important:** MoE means **all 30B weights must reside in memory** even though only 3.5B are active per token. The router selects different experts per token unpredictably. The saving is in compute FLOPs, not memory bandwidth — this is crucial for bandwidth-bound WebGPU inference.
|
|
880
|
+
|
|
881
|
+
### Community Small Mamba-2 Models
|
|
882
|
+
|
|
883
|
+
If smaller Mamba-2 models are of future interest:
|
|
884
|
+
|
|
885
|
+
| Model | Params | Type | Notes |
|
|
886
|
+
|-------|--------|------|-------|
|
|
887
|
+
| `state-spaces/mamba2-2.7b` | 2.7B | Pure Mamba-2, dense | Base model only (no instruct) |
|
|
888
|
+
| `state-spaces/mamba2-1.3b` | 1.3B | Pure Mamba-2, dense | Base model only |
|
|
889
|
+
| `state-spaces/mamba2-780m` | 780M | Pure Mamba-2, dense | Base model only |
|
|
890
|
+
| `state-spaces/mamba2attn-2.7b` | 2.7B | Mamba-2 + Attention hybrid | Minimal documentation |
|
|
891
|
+
| `Zyphra/Zamba2-1.2B-Instruct-v2` | 1.2B | Mamba-2 + shared attention | Apache 2.0, instruct-tuned |
|
|
892
|
+
|
|
893
|
+
None of these are Nemotron-class models, but they prove the Mamba-2 kernel would have broader utility beyond just Nemotron.
|
|
894
|
+
|
|
895
|
+
---
|
|
896
|
+
|
|
897
|
+
## 12. Sources
|
|
898
|
+
|
|
899
|
+
- [NVIDIA Nemotron 3: Efficient and Open Intelligence (arxiv 2512.20856)](https://arxiv.org/abs/2512.20856)
|
|
900
|
+
- [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models (arxiv 2504.03624)](https://arxiv.org/abs/2504.03624)
|
|
901
|
+
- [Nemotron 3 Nano: Open, Efficient MoE Hybrid Mamba-Transformer (arxiv 2512.20848)](https://arxiv.org/abs/2512.20848)
|
|
902
|
+
- [Mamba-2 / SSD: Transformers are SSMs (arxiv 2405.21060)](https://arxiv.org/abs/2405.21060)
|
|
903
|
+
- [Tri Dao blog: Mamba-2 Part I - The Model](https://tridao.me/blog/2024/mamba2-part1-model/)
|
|
904
|
+
- [Tri Dao blog: Mamba-2 Part III - The Algorithm](https://tridao.me/blog/2024/mamba2-part3-algorithm/)
|
|
905
|
+
- [LatentMoE: Toward Optimal Accuracy per FLOP (arxiv 2601.18089)](https://arxiv.org/abs/2601.18089)
|
|
906
|
+
- [LatentMoE NVIDIA Research Page](https://research.nvidia.com/labs/nemotron/LatentMoE/)
|
|
907
|
+
- [Meta: Better & Faster LLMs via Multi-token Prediction (arxiv 2404.19737)](https://arxiv.org/abs/2404.19737)
|
|
908
|
+
- [Nemotron-3 Nano HuggingFace Blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models)
|
|
909
|
+
- [Nemotron-3 Nano FP8 Config.json](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blob/main/config.json)
|
|
910
|
+
- [NVIDIA Nemotron-3 Super Blog](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/)
|