@tryhamster/gerbil 1.0.0-rc.9 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +247 -84
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +264 -588
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +585 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BHrJJIa4.mjs +1656 -0
  35. package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
  36. package/dist/gerbil-BT9fCydo.d.mts +488 -0
  37. package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
  38. package/dist/gerbil-DomNfIr1.mjs +4 -0
  39. package/dist/gpu/hooks.d.mts +520 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1188 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-33qCAtHW.mjs +3615 -0
  46. package/dist/gpu-33qCAtHW.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-jEAL2s-A.d.mts +2022 -0
  50. package/dist/index-jEAL2s-A.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
  76. package/dist/mcp-1DaMsaBc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
  85. package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
  86. package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
  88. package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
  89. package/dist/repl-jV5gcJFA.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
  94. package/dist/skills-DX8D59UH.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
  98. package/dist/types-D6FiR_oh.d.mts.map +1 -0
  99. package/dist/types-DQBe2lFo.d.mts +165 -0
  100. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-CMOGDSgT.js +0 -20212
  157. package/dist/kokoro-CMOGDSgT.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-Bu-E23Sc.js +0 -433
  163. package/dist/stt-Bu-E23Sc.js.map +0 -1
  164. package/dist/stt-CpLYbGFd.mjs +0 -433
  165. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  166. package/dist/stt-DRPLEEHB.mjs +0 -3
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DiD1gTwk.js +0 -44695
  169. package/dist/transformers.web-DiD1gTwk.js.map +0 -1
  170. package/dist/transformers.web-u34VxRFM.js +0 -3
  171. package/dist/tts-CqroPaSK.js +0 -724
  172. package/dist/tts-CqroPaSK.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,910 @@
1
+ # Nemotron-3 & Mamba-2: Inference Engine Research
2
+
3
+ **Research for Gerbil WebGPU inference engine optimization.**
4
+ **Date: March 2026**
5
+
6
+ ---
7
+
8
+ ## Table of Contents
9
+
10
+ 1. [Executive Summary](#1-executive-summary)
11
+ 2. [Nemotron-3 Model Family](#2-nemotron-3-model-family)
12
+ 3. [Hybrid Mamba-2 + Transformer Architecture](#3-hybrid-mamba-2--transformer-architecture)
13
+ 4. [Mamba-2 / SSD Deep Dive](#4-mamba-2--ssd-deep-dive)
14
+ 5. [Mixture of Experts (MoE)](#5-mixture-of-experts-moe)
15
+ 6. [LatentMoE](#6-latentmoe)
16
+ 7. [Multi-Token Prediction (MTP)](#7-multi-token-prediction-mtp)
17
+ 8. [Quantization: NVFP4 vs INT4](#8-quantization-nvfp4-vs-int4)
18
+ 9. [Nemotron-3 Nano 30B-A3B Config Reference](#9-nemotron-3-nano-30b-a3b-config-reference)
19
+ 10. [Gerbil Implications & Implementation Roadmap](#10-gerbil-implications--implementation-roadmap)
20
+ 11. [Small Model Viability: Nano vs Qwen 3.5 0.8B](#11-small-model-viability-nano-vs-qwen-35-08b)
21
+ 12. [Sources](#12-sources)
22
+
23
+ ---
24
+
25
+ ## 1. Executive Summary
26
+
27
+ NVIDIA's Nemotron-3 family represents a paradigm shift for local inference: **hybrid Mamba-2 + Transformer + MoE** models that are dramatically faster than pure Transformers while matching or exceeding their accuracy. The key innovations relevant to Gerbil:
28
+
29
+ | Innovation | What It Does | Gerbil Impact |
30
+ |-----------|-------------|---------------|
31
+ | **Mamba-2 layers** | Replace 92% of attention with constant-memory SSM | Eliminates KV cache growth, unlocks long context on mobile |
32
+ | **LatentMoE** | 4x bandwidth reduction in expert routing | Direct throughput multiplier on bandwidth-bound M4 Max |
33
+ | **MTP** | Native speculative decoding (97% accept rate) | ~2x decode throughput at batch-1, no draft model needed |
34
+ | **NVFP4** | 4-bit floating point with micro-block scaling | Better accuracy/size than our INT4 per-group quantization |
35
+ | **Open weights** | Apache 2.0 / nvidia-open-model-license | Can ship Nemotron models directly in Gerbil |
36
+
37
+ The smallest model, **Nemotron-3 Nano** (30B total / 3B active), is the primary target for Gerbil. No sub-1B Nemotron model exists yet, so Qwen 3.5 0.8B remains our small-model option. But Nano's 3B active params with MoE sparsity could be feasible on M4 Max (128GB unified memory) and represents a massive capability upgrade.
38
+
39
+ ---
40
+
41
+ ## 2. Nemotron-3 Model Family
42
+
43
+ | Model | Total Params | Active Params | Architecture | License |
44
+ |-------|-------------|---------------|-------------|---------|
45
+ | **Nano** | 31.6B | 3.6B (w/ embeddings) | Hybrid Mamba-2 + MoE | nvidia-open-model-license |
46
+ | **Super** | ~120B | ~12B | Hybrid Mamba-2 + LatentMoE | nvidia-open-model-license |
47
+ | **Ultra** | ~500B | ~50B | Hybrid Mamba-2 + LatentMoE | nvidia-open-model-license |
48
+
49
+ ### Nano Performance Claims
50
+ - **3.3x faster** than Qwen3-30B-A3B on H200 (8k input / 16k output)
51
+ - **2.2x faster** than GPT-OSS-20B
52
+ - Trained on **25 trillion tokens**
53
+ - Supports **1M token context**
54
+ - Best-in-class on reasoning, coding, math, agentic tasks in its size class
55
+
56
+ ### Quantization Variants Available
57
+ - **BF16** (full precision, official)
58
+ - **FP8** (official)
59
+ - **NVFP4** (official)
60
+ - **GGUF** variants (community: IQ4_XS, MXFP4_MOE via Unsloth)
61
+
62
+ ---
63
+
64
+ ## 3. Hybrid Mamba-2 + Transformer Architecture
65
+
66
+ ### The Core Insight
67
+
68
+ Pure Transformers have **linearly growing** KV cache and per-token attention compute during decode. Nemotron replaces ~92% of attention layers with Mamba-2, which has **constant** memory and compute per generated token.
69
+
70
+ ### Layer Pattern (Nemotron-3 Nano)
71
+
72
+ From the `config.json`:
73
+ ```
74
+ hybrid_override_pattern: "MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME"
75
+ ```
76
+
77
+ Where:
78
+ - **M** = Mamba-2 layer
79
+ - **E** = MoE (expert) layer (replaces standard FFN)
80
+ - **\*** = Attention layer (grouped-query attention with 2 KV heads)
81
+
82
+ **52 total layers:**
83
+ - **23 Mamba-2 layers** (constant-memory decode, no KV cache)
84
+ - **23 MoE layers** (128 experts, top-6 routing + 1 shared expert)
85
+ - **6 Attention layers** (GQA with 32 heads, 2 KV groups — these need KV cache)
86
+
87
+ ### Why This Matters for Gerbil
88
+
89
+ Only 6 out of 52 layers need KV cache. For Gerbil on mobile:
90
+ - KV cache memory drops by **~88%** vs pure Transformer
91
+ - Long context becomes feasible within iOS/mobile memory budgets
92
+ - Mamba-2 decode is a simple matrix-vector multiply (no softmax, no causal masking, no KV append)
93
+
94
+ ### Nemotron-H Architecture Search Results
95
+
96
+ From the Nemotron-H paper, the optimal ratio is **~8% attention layers** across all model sizes:
97
+ - Nemotron-H-8B: 4 attention out of 52 layers
98
+ - Nemotron-H-56B: 10 attention out of 118 layers
99
+ - Nemotron-3 Nano 30B: 6 attention out of 52 layers (slightly higher, ~11.5%)
100
+
101
+ First layer is always Mamba-2, last layer is always FFN/MoE. Attention layers are evenly dispersed throughout.
102
+
103
+ ### Inference Throughput (Nemotron-H benchmarks, H100)
104
+
105
+ | Model | vs Transformer Baseline | Context |
106
+ |-------|------------------------|---------|
107
+ | Nemotron-H-8B | **1.8-3x faster** than Qwen-2.5-7B/Llama-3.1-8B | 65k input |
108
+ | Nemotron-H-56B | **2.4x faster** than Qwen-2.5-72B/Llama-3.1-70B | 65k input |
109
+ | Nemotron-H-56B | **19.6x faster** than Llama-3.1-405B | 65k input |
110
+
111
+ ---
112
+
113
+ ## 4. Mamba-2 / SSD Deep Dive
114
+
115
+ ### 4.1 SSM Equations
116
+
117
+ The Selective State Space Model at the core of Mamba-2:
118
+
119
+ ```
120
+ State update: h_t = a_t · h_{t-1} + B_t · x_t
121
+ Output: y_t = C_t^T · h_t
122
+ ```
123
+
124
+ Where per timestep t:
125
+ - `x_t ∈ ℝ^P` — input (P = head_dim, e.g. 64)
126
+ - `h_t ∈ ℝ^(P × N)` — hidden state (N = state_size, e.g. 128)
127
+ - `y_t ∈ ℝ^P` — output
128
+ - `a_t ∈ ℝ` — **scalar** decay (Mamba-2 key simplification)
129
+ - `B_t ∈ ℝ^N` — input projection (input-dependent)
130
+ - `C_t ∈ ℝ^N` — output projection (input-dependent)
131
+
132
+ ### 4.2 Mamba-2 vs Mamba-1
133
+
134
+ | Aspect | Mamba-1 (S6) | Mamba-2 (SSD) |
135
+ |--------|-------------|---------------|
136
+ | A matrix | Diagonal `(N,)` per head | **Scalar** per head (scalar × identity) |
137
+ | State size N | 16 | **64-256** (much larger) |
138
+ | Training algorithm | Custom CUDA scan | **Matmul-based** (tensor core friendly) |
139
+ | Training speed | Baseline | **2-8x faster** |
140
+ | B, C generation | Sequential (after conv) | **Parallel** (with X) |
141
+
142
+ The critical insight: by restricting A to a scalar (instead of diagonal), the recurrence dynamics are shared across all N state dimensions. This means the state update can be expressed as matrix multiplications, enabling use of tensor cores during training.
143
+
144
+ ### 4.3 The SSD (Structured State Space Duality)
145
+
146
+ The SSM recurrence can be equivalently written as a matrix multiplication:
147
+
148
+ ```
149
+ Y = M · X
150
+
151
+ where M = L ⊙ (C · B^T) [semiseparable matrix]
152
+
153
+ L[i,j] = a_i · a_{i-1} · ... · a_{j+1} for i > j
154
+ L[i,i] = 1
155
+ L[i,j] = 0 for i < j
156
+ ```
157
+
158
+ This is structurally identical to **causal linear attention** when all `a_t = 1` (L becomes the causal mask). The scalar `a_t` values act as input-dependent relative positional encodings / decay factors.
159
+
160
+ ### 4.4 Decode Mode (Single Token — Gerbil Hot Path)
161
+
162
+ For autoregressive generation, Mamba-2 is a simple recurrence:
163
+
164
+ ```python
165
+ # Per token, per head:
166
+ h = a_t * h + outer(B_t, x_t) # state update: (P, N) = scalar * (P, N) + (N,) ⊗ (P,)
167
+ y = h @ C_t # output: (P,) = (P, N) @ (N,)
168
+ ```
169
+
170
+ **Per-token cost per head:**
171
+ - 1 scalar multiply of state: `P × N` multiplies
172
+ - 1 outer product + add: `P × N` multiply-adds
173
+ - 1 matrix-vector product: `P × N` multiply-adds
174
+ - **Total: ~3PN FLOPs per head**
175
+
176
+ **For Nemotron-3 Nano** (64 heads, P=64, N=128):
177
+ - Per Mamba layer: `64 × 3 × 64 × 128 = 1.57M FLOPs`
178
+ - Compare to attention: KV cache read + softmax + output = much more for long sequences
179
+
180
+ **State memory per Mamba layer:**
181
+ - `num_heads × head_dim × state_size = 64 × 64 × 128 = 524,288` values
182
+ - At f32: **2MB per layer**, at f16: **1MB per layer**
183
+ - For 23 Mamba layers: **~23MB total** (constant, regardless of sequence length!)
184
+
185
+ Compare to attention KV cache for 1024 tokens: `6 layers × 2(K+V) × 32 heads × 128 dim × 1024 tokens × 2 bytes = ~100MB` and growing linearly.
186
+
187
+ ### 4.5 Prefill Mode (Parallel — Processing Prompt)
188
+
189
+ The chunked SSD algorithm processes the prompt in parallel using 4 steps:
190
+
191
+ ```
192
+ Given: X (T, P), A (T,), B (T, N), C (T, N)
193
+ Chunk into blocks of size Q (default 64-128):
194
+ X_chunks: (T/Q, Q, P)
195
+ A_chunks: (T/Q, Q), etc.
196
+
197
+ Step 1 — Intra-chunk (parallel, uses matmuls):
198
+ Y_diag = einsum("bclhn, bcshn, bhcls, bcshp -> bclhp", C, B, L, X)
199
+ # L is the Q×Q semiseparable mask from cumsum of A within each chunk
200
+
201
+ Step 2 — Chunk states (parallel):
202
+ states = einsum("bclhn, bhcl, bclhp -> bchpn", B, decay, X)
203
+ # Final state of each chunk assuming zero initial state
204
+
205
+ Step 3 — Inter-chunk recurrence (sequential, but over T/Q chunks only):
206
+ new_states[c] = decay_chunk[c] * new_states[c-1] + states[c]
207
+ # Simple scan over reduced sequence length T/Q
208
+
209
+ Step 4 — Output correction (parallel):
210
+ Y_off = einsum("bclhn, bchpn, bhcl -> bclhp", C, states, state_decay)
211
+ # Add contribution from initial states to each chunk's output
212
+
213
+ Final: Y = Y_diag + Y_off
214
+ ```
215
+
216
+ **Key property:** Steps 1, 2, 4 are all matmuls (tensor core / GPU friendly). Step 3 is a short sequential scan over T/Q elements (e.g., for 2048 tokens with Q=128, only 16 sequential steps).
217
+
218
+ ### 4.6 Mamba-2 Block Architecture
219
+
220
+ ```
221
+ Input x (d_model)
222
+
223
+ ├──→ Linear projection → (expand * d_model) ──→ split into:
224
+ │ │ │
225
+ │ ├── z (gate, expand * d_model) │
226
+ │ └── x' (SSM input, expand * d_model) │
227
+ │ │ │
228
+ │ Conv1d(kernel=4, causal) │
229
+ │ │ │
230
+ │ SiLU(x') │
231
+ │ │ │
232
+ │ ┌── B_t = Linear(x') (N per head) │
233
+ │ ├── C_t = Linear(x') (N per head) │
234
+ │ ├── dt = Linear(x') → softplus → a_t │
235
+ │ │ │
236
+ │ └──── SSM(x', a_t, B_t, C_t) ──→ y │
237
+ │ │ │
238
+ │ y * SiLU(z) │ (gating)
239
+ │ │ │
240
+ │ Linear(out) ──→ output (d_model)
241
+ ```
242
+
243
+ **Nemotron-3 Nano Mamba-2 config:**
244
+ - `expand = 2` → inner dim = 2 × 2688 = 5376
245
+ - `mamba_head_dim = 64` (P)
246
+ - `mamba_num_heads = 64` → 64 × 64 = 4096 (but expand × hidden = 5376?)
247
+ - `ssm_state_size = 128` (N)
248
+ - `conv_kernel = 4`
249
+ - `n_groups = 8` (Mamba groups, likely for B/C sharing)
250
+ - `chunk_size = 128` (Q for chunked prefill)
251
+
252
+ ### 4.7 Weight Tensors for Mamba-2 Layer
253
+
254
+ Expected HuggingFace weight names per layer:
255
+ ```
256
+ model.layers.{i}.mamba.in_proj.weight # (expand*d + expand*d, d_model) or split
257
+ model.layers.{i}.mamba.conv1d.weight # (inner_dim, 1, conv_kernel)
258
+ model.layers.{i}.mamba.conv1d.bias # (inner_dim,)
259
+ model.layers.{i}.mamba.out_proj.weight # (d_model, inner_dim)
260
+ model.layers.{i}.mamba.dt_bias # (num_heads,) — added to dt before softplus
261
+ model.layers.{i}.mamba.A_log # (num_heads,) — log-space A parameter
262
+ model.layers.{i}.mamba.D # (num_heads,) — skip connection scalar
263
+ model.layers.{i}.mamba.norm.weight # (inner_dim,) — RMSNorm before output
264
+ ```
265
+
266
+ Note: `A_log` is stored in log-space for numerical stability. At runtime: `A = -exp(A_log)` (negative to ensure decay).
267
+
268
+ ### 4.8 Decode Path Implementation Pseudocode
269
+
270
+ ```python
271
+ def mamba2_decode_step(x, layer, state):
272
+ """Single token decode for one Mamba-2 layer.
273
+
274
+ x: (d_model,) — input token embedding
275
+ state.h: (num_heads, head_dim, state_size) — SSM hidden state
276
+ state.conv: (inner_dim, conv_kernel-1) — conv1d rolling buffer
277
+ """
278
+ # 1. Input projection
279
+ xz = layer.in_proj @ x # (2 * inner_dim,)
280
+ x_inner, z = split(xz, inner_dim) # each (inner_dim,)
281
+
282
+ # 2. Causal conv1d (shift buffer, apply kernel)
283
+ state.conv = roll_left(state.conv)
284
+ state.conv[:, -1] = x_inner
285
+ x_conv = sum(state.conv * layer.conv1d_weight, dim=-1) + layer.conv1d_bias
286
+
287
+ # 3. SiLU activation
288
+ x_act = x_conv * sigmoid(x_conv) # SiLU = x * sigmoid(x)
289
+
290
+ # 4. Generate B, C, dt from activated input
291
+ # (These come from the in_proj output, exact split depends on implementation)
292
+ B = ... # (num_heads, state_size)
293
+ C = ... # (num_heads, state_size)
294
+ dt = softplus(layer.dt_proj(x_act) + layer.dt_bias) # (num_heads,)
295
+
296
+ # 5. Discretize A
297
+ A = -exp(layer.A_log) # (num_heads,) — negative real
298
+ a = exp(dt * A) # (num_heads,) — decay per head
299
+
300
+ # 6. SSM state update (THE HOT LOOP)
301
+ # state.h shape: (num_heads, head_dim, state_size)
302
+ # a shape: (num_heads,) — broadcast over head_dim and state_size
303
+ x_heads = reshape(x_act, (num_heads, head_dim)) # (num_heads, head_dim)
304
+
305
+ for each head h:
306
+ state.h[h] = a[h] * state.h[h] + outer(x_heads[h], B[h])
307
+ # (head_dim, state_size) = scalar * (head_dim, state_size) + (head_dim,1) @ (1, state_size)
308
+
309
+ # 7. SSM output
310
+ y_heads = einsum("hpn, hn -> hp", state.h, C) # (num_heads, head_dim)
311
+ y = reshape(y_heads, (inner_dim,))
312
+
313
+ # 8. Add D skip connection
314
+ y = y + reshape(layer.D, ...) * x_act
315
+
316
+ # 9. RMSNorm
317
+ y = layer.norm(y)
318
+
319
+ # 10. Gate and output
320
+ y = y * (z * sigmoid(z)) # gate with SiLU(z)
321
+ output = layer.out_proj @ y # (d_model,)
322
+
323
+ return output, state
324
+ ```
325
+
326
+ ### 4.9 Memory Bandwidth Analysis (Decode)
327
+
328
+ For one Mamba-2 layer in Nemotron Nano, per token:
329
+
330
+ **Weights to read:**
331
+ - `in_proj`: 2688 × 5376 × 2 bytes = ~28.9 MB (at f16)
332
+ - `conv1d`: 5376 × 4 × 2 = ~43 KB
333
+ - `out_proj`: 5376 × 2688 × 2 = ~28.9 MB
334
+ - Other (dt_bias, A_log, D, norm): negligible
335
+ - **Total weights: ~58 MB per Mamba layer**
336
+
337
+ **State to read/write:**
338
+ - SSM state: 64 × 64 × 128 × 4 = 2 MB (at f32)
339
+ - Conv buffer: 5376 × 3 × 2 = ~32 KB
340
+ - **Total state: ~2 MB per layer**
341
+
342
+ **Comparison with attention layer:**
343
+ - Attention weights: QKV proj + output proj ≈ similar to Mamba projections
344
+ - **Plus KV cache read**: grows with sequence length (at 1024 tokens: ~1 MB per layer per read)
345
+ - At long sequences, KV cache read dominates — Mamba avoids this entirely
346
+
347
+ ---
348
+
349
+ ## 5. Mixture of Experts (MoE)
350
+
351
+ ### Nemotron-3 Nano MoE Configuration
352
+
353
+ ```json
354
+ {
355
+ "n_routed_experts": 128,
356
+ "num_experts_per_tok": 6,
357
+ "n_shared_experts": 1,
358
+ "moe_intermediate_size": 1856,
359
+ "moe_shared_expert_intermediate_size": 3712,
360
+ "routed_scaling_factor": 2.5,
361
+ "norm_topk_prob": true,
362
+ "topk_group": 1
363
+ }
364
+ ```
365
+
366
+ ### How MoE Works at Inference
367
+
368
+ For each token:
369
+ 1. Router (small MLP) computes scores over all 128 experts
370
+ 2. Top-6 experts selected
371
+ 3. Each selected expert processes the token independently
372
+ 4. Outputs are weighted-summed by router scores
373
+ 5. Shared expert (always active) output is added
374
+
375
+ **Active parameters per token:** 6 × (2688 × 1856 × 2) + 1 × (2688 × 3712 × 2) ≈ **80M params** (per MoE layer)
376
+
377
+ ### Memory Implications for Gerbil
378
+
379
+ All 128 expert weights must be in GPU memory, but only 6 are read per token:
380
+ - Per MoE layer total weights: 128 × 2 × 2688 × 1856 × 2 bytes ≈ **2.5 GB** (at f16)
381
+ - Per MoE layer per-token read: 6 × 2 × 2688 × 1856 × 2 ≈ **120 MB**
382
+ - Plus shared expert: 2 × 2688 × 3712 × 2 ≈ **40 MB**
383
+ - **Total per MoE layer per token: ~160 MB bandwidth**
384
+
385
+ With 23 MoE layers: **~3.7 GB bandwidth per token** — this is the bottleneck.
386
+
387
+ At M4 Max 546 GB/s: theoretical minimum **~6.8ms per token** just for MoE weight reads (assuming perfect bandwidth utilization).
388
+
389
+ ---
390
+
391
+ ## 6. LatentMoE
392
+
393
+ ### Architecture
394
+
395
+ LatentMoE reduces the memory bandwidth of MoE by projecting tokens to a smaller latent space before expert routing:
396
+
397
+ ```
398
+ Standard MoE:
399
+ x (d) → Router → Expert_i(x) (d→m→d) → weighted sum
400
+
401
+ LatentMoE:
402
+ x (d) → Router (in d-space)
403
+ → W_down @ x (d→ℓ)
404
+ → Expert_i(x_latent) (ℓ→m→ℓ)
405
+ → W_up @ y_latent (ℓ→d)
406
+ → weighted sum
407
+ ```
408
+
409
+ ### Key Properties
410
+
411
+ - **Compression ratio α = d/ℓ = 4** (typical, validated up to 4x without quality loss)
412
+ - **Router stays in d-space** — routing decisions use full-dimensional information
413
+ - **Experts operate in ℓ-space** — 4x smaller input/output dimensions
414
+ - **Expert count scales up**: N′ = α × N (e.g., 128 → 512 experts)
415
+ - **Active experts scale up**: K′ = α × K (e.g., 6 → 24 active)
416
+ - **Net effect**: same compute, same params, but 4x less bandwidth per expert
417
+
418
+ ### Bandwidth Savings
419
+
420
+ | Metric | Standard MoE | LatentMoE (α=4) |
421
+ |--------|-------------|-----------------|
422
+ | Expert weight read per token | `K × 2 × d × m` | `K' × 2 × ℓ × m` (same total) |
423
+ | All-to-all communication | `K × d` per token | `K × ℓ` per token (**4x less**) |
424
+ | Routing payload | `d` per token per expert | `ℓ` per token per expert (**4x less**) |
425
+
426
+ ### Performance
427
+
428
+ - At iso-accuracy: **up to 3.5x throughput improvement** over standard MoE
429
+ - At iso-FLOP: significant accuracy gains (MMLU-Pro: +5.65 points)
430
+ - Overhead of down/up projection: **<9%** of total compute
431
+
432
+ ### Gerbil Relevance
433
+
434
+ Nemotron-3 Nano uses standard MoE (not LatentMoE). LatentMoE is used in Super and Ultra models. However:
435
+ - If NVIDIA releases a Nano-class model with LatentMoE, the bandwidth savings would be transformative for WebGPU
436
+ - The down/up projections are simple matmuls — trivial to implement in WGSL
437
+ - Could reduce our per-token MoE bandwidth from ~160MB to ~40MB per layer
438
+
439
+ ---
440
+
441
+ ## 7. Multi-Token Prediction (MTP)
442
+
443
+ ### How It Works
444
+
445
+ Instead of predicting only the next token, the model has N additional prediction heads that predict tokens 2, 3, ..., N+1 steps ahead.
446
+
447
+ ```
448
+ Backbone output (hidden states)
449
+
450
+ ├── Head 0 (standard): predict token t+1
451
+ ├── Head 1 (MTP): predict token t+2
452
+ ├── Head 2 (MTP): predict token t+3
453
+ └── Head 3 (MTP): predict token t+4
454
+ ```
455
+
456
+ ### Training Benefits
457
+ - Richer training signal (predicting further ahead encourages planning)
458
+ - ~2.4% average improvement across benchmarks
459
+ - Minimal additional FLOPs during training
460
+
461
+ ### Inference: Self-Speculative Decoding
462
+
463
+ The MTP heads enable speculative decoding **without a separate draft model**:
464
+
465
+ ```
466
+ 1. Run one forward pass → get predictions from all heads
467
+ 2. Head 0 predicts token t+1 (high confidence)
468
+ 3. Head 1 predicts token t+2 (draft)
469
+ 4. Head 2 predicts token t+3 (draft)
470
+ 5. Verify drafts by running a single forward pass on [t+1, t+2, t+3]
471
+ 6. Accept all tokens up to first mismatch
472
+ ```
473
+
474
+ ### Key Numbers
475
+ - **97% acceptance rate** on first two predicted tokens (Nemotron-3)
476
+ - **Up to 3x faster** inference with 4-token prediction heads (Meta paper)
477
+ - Particularly effective at **batch-size-1** (exactly Gerbil's use case)
478
+
479
+ ### Gerbil Implementation
480
+
481
+ For decode, MTP would:
482
+ 1. Add lightweight prediction heads (shared backbone, separate output projections)
483
+ 2. Generate 2-4 draft tokens per forward pass
484
+ 3. Verify all drafts in a single prefill-style forward pass
485
+ 4. Accept valid tokens, fall back to first mismatch
486
+
487
+ **Expected speedup: ~1.5-2x** decode throughput at batch-1, with near-zero memory overhead (heads are just additional small linear projections over vocab).
488
+
489
+ **Note:** Nemotron-3 Nano's config doesn't explicitly show MTP configuration, so this may be a Super/Ultra feature or baked into the architecture at the weights level. Needs verification.
490
+
491
+ ---
492
+
493
+ ## 8. Quantization: NVFP4 vs INT4
494
+
495
+ ### Our Current INT4
496
+
497
+ ```
498
+ Format: 4-bit unsigned integer, packed 8 nibbles per u32
499
+ Dequant: output = (nibble - zero) * scale
500
+ Group size: 128 (configurable 32/64)
501
+ Scale/zero: f32 per group
502
+ ```
503
+
504
+ ### NVFP4
505
+
506
+ ```
507
+ Format: E2M1 (2-bit exponent, 1-bit mantissa)
508
+ Values: {0, 0.5, 1, 1.5, 2, 3, 4, 6} × sign
509
+ Block scaling: E4M3 format, per 16-element micro-block
510
+ Global scaling: FP32, second-level scale
511
+ 2D block scaling for weights
512
+ ```
513
+
514
+ ### Comparison
515
+
516
+ | Aspect | Our INT4 | NVFP4 |
517
+ |--------|---------|-------|
518
+ | Element format | 4-bit unsigned int | E2M1 (4-bit float) |
519
+ | Group/block size | 128 elements | **16 elements** (8x finer) |
520
+ | Scale format | f32 | E4M3 + FP32 (two-level) |
521
+ | Zero point | Per-group | Not needed (symmetric) |
522
+ | Accuracy (vs BF16) | ~decent (no published numbers) | **<1% relative loss difference** |
523
+ | Selective precision | No | **Last 15% of layers in BF16** |
524
+
525
+ ### Implications for Gerbil
526
+
527
+ - NVFP4's 16-element micro-blocks are **8x finer granularity** than our 128-element groups
528
+ - This means 8x more scale factors to read, but much better accuracy
529
+ - E2M1 dequantization is simpler than INT4 (no zero point subtraction)
530
+ - Could implement as: read 2 packed u32s (16 nibbles) + 1 E4M3 scale + share FP32 global scale
531
+ - NVFP4 weights from HuggingFace could be loaded directly without requantization
532
+
533
+ ---
534
+
535
+ ## 9. Nemotron-3 Nano 30B-A3B Config Reference
536
+
537
+ Complete architecture config from HuggingFace:
538
+
539
+ ```json
540
+ {
541
+ "architectures": ["NemotronHForCausalLM"],
542
+ "model_type": "nemotron_h",
543
+
544
+ "hidden_size": 2688,
545
+ "num_hidden_layers": 52,
546
+ "intermediate_size": 1856,
547
+ "vocab_size": 131072,
548
+ "max_position_embeddings": 262144,
549
+
550
+ "hybrid_override_pattern": "MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
551
+
552
+ "num_attention_heads": 32,
553
+ "num_key_value_heads": 2,
554
+ "head_dim": 128,
555
+ "rope_theta": 10000,
556
+
557
+ "mamba_num_heads": 64,
558
+ "mamba_head_dim": 64,
559
+ "ssm_state_size": 128,
560
+ "expand": 2,
561
+ "conv_kernel": 4,
562
+ "n_groups": 8,
563
+ "chunk_size": 128,
564
+ "mamba_hidden_act": "silu",
565
+
566
+ "n_routed_experts": 128,
567
+ "num_experts_per_tok": 6,
568
+ "n_shared_experts": 1,
569
+ "moe_intermediate_size": 1856,
570
+ "moe_shared_expert_intermediate_size": 3712,
571
+ "routed_scaling_factor": 2.5,
572
+
573
+ "mlp_hidden_act": "relu2",
574
+ "layer_norm_epsilon": 1e-05,
575
+ "torch_dtype": "bfloat16"
576
+ }
577
+ ```
578
+
579
+ ### Layer Pattern Decoded
580
+
581
+ ```
582
+ Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
583
+ Layer: M E M E M * E M E M E M * E M E M E M *
584
+ Type: m x m x m a x m x m x m a x m x m x m a
585
+
586
+ Position: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
587
+ Layer: E M E M E M * E M E M E M E M * E M E M
588
+ Type: x m x m x m a x m x m x m x m a x m x m
589
+
590
+ Position: 40 41 42 43 44 45 46 47 48 49 50 51
591
+ Layer: E M E M E M E M E M E M E
592
+ Type: x m x m x m x m a m x m x (approximate — pattern varies at end)
593
+
594
+ m = Mamba-2, x = MoE, a = Attention (GQA)
595
+ ```
596
+
597
+ ### Derived Dimensions
598
+
599
+ | Component | Dimension | Notes |
600
+ |-----------|-----------|-------|
601
+ | Mamba inner dim | 2 × 2688 = 5376 | expand=2 |
602
+ | Mamba state per layer | 64 heads × 64 head_dim × 128 state = 524K values | ~2MB at f32 |
603
+ | Mamba conv buffer | 5376 × 3 values | conv_kernel=4, store last 3 |
604
+ | Attention KV per token | 2 KV heads × 128 dim × 2(K+V) = 512 values | ~1KB at f16 |
605
+ | MoE per-token read | 6 experts × 2 × 2688 × 1856 = ~60M values | ~120MB at f16 |
606
+ | Total model weights (BF16) | ~63 GB | All 128 experts × 23 layers |
607
+ | Total model weights (INT4) | ~16 GB | With quantization |
608
+ | Active weights per token | ~1.8 GB | Only active experts + Mamba/attention |
609
+
610
+ ### Weight Size Estimates (INT4/FP4 quantized)
611
+
612
+ | Component | Per Layer | Total |
613
+ |-----------|-----------|-------|
614
+ | Mamba in_proj | 2688 × 5376 / 2 = 7.2 MB | 23 × 7.2 = 166 MB |
615
+ | Mamba out_proj | 5376 × 2688 / 2 = 7.2 MB | 23 × 7.2 = 166 MB |
616
+ | Mamba conv + small | ~11 KB | ~0.25 MB |
617
+ | MoE routed experts | 128 × 2 × 2688 × 1856 / 2 = 1.22 GB | 23 × 1.22 = **28 GB** |
618
+ | MoE shared expert | 2 × 2688 × 3712 / 2 = 10 MB | 23 × 10 = 230 MB |
619
+ | Attention (QKV+O) | 4 × 2688 × 2688 / 2 = 14.5 MB | 6 × 14.5 = 87 MB |
620
+ | Embeddings + head | 131072 × 2688 × 2 = 670 MB | 670 MB |
621
+ | **Total (INT4)** | | **~29.3 GB** |
622
+
623
+ This fits in M4 Max 128GB unified memory, but is tight for 64GB configs and impossible for mobile.
624
+
625
+ ---
626
+
627
+ ## 10. Gerbil Implications & Implementation Roadmap
628
+
629
+ ### Phase 1: Mamba-2 Kernel (WGSL)
630
+
631
+ **New kernels needed:**
632
+
633
+ 1. **Mamba2Decode** (hot path — single token)
634
+ - SSM state update: `h = a*h + B⊗x` per head
635
+ - SSM output: `y = h @ C` per head
636
+ - Fuse with SiLU gate and output projection
637
+ - Workgroup: one workgroup per head (64 threads for head_dim=64)
638
+ - State in storage buffer (persistent across tokens)
639
+
640
+ 2. **Mamba2Conv1d** (per token)
641
+ - 1D causal convolution, kernel=4
642
+ - Maintain rolling buffer of last 3 activations
643
+ - Simple: read 4 values, multiply-accumulate
644
+
645
+ 3. **Mamba2Prefill** (prompt processing)
646
+ - Chunked SSD algorithm (4 steps)
647
+ - Steps 1,2,4 are matmuls — reuse existing MatMul kernels
648
+ - Step 3 is short scan — new small kernel
649
+ - Chunk size 128 (from config)
650
+
651
+ 4. **Mamba2InputProj / OutputProj**
652
+ - Standard MatVec (decode) or MatMul (prefill)
653
+ - Reuse existing kernels, just need correct dispatch
654
+
655
+ **State management changes:**
656
+ - New persistent buffer type: SSM state (per layer, constant size)
657
+ - Conv buffer: rolling window (per layer, width=3)
658
+ - These replace KV cache for Mamba layers (massive memory savings)
659
+
660
+ ### Phase 2: MoE Routing
661
+
662
+ **New kernels:**
663
+ 1. **TopKRouter** — compute router scores, find top-6 experts
664
+ 2. **MoEGather** — select expert weights for active experts
665
+ 3. **MoEScatter** — weighted sum of expert outputs
666
+
667
+ **Architecture changes:**
668
+ - Expert weights stored as large buffers (128 experts × weight matrix)
669
+ - Router dispatches only active expert kernels per token
670
+ - Shared expert always dispatched in parallel
671
+
672
+ ### Phase 3: Graph Generator for NemotronH
673
+
674
+ New architecture handler:
675
+ ```typescript
676
+ function generateNemotronHGraph(config: NemotronHConfig): ModelGraph {
677
+ const pattern = config.hybrid_override_pattern;
678
+ for (let i = 0; i < config.num_hidden_layers; i++) {
679
+ if (pattern[i] === 'M') {
680
+ // Mamba-2 layer: in_proj → conv1d → SSM → gate → out_proj
681
+ addMamba2Layer(graph, i);
682
+ } else if (pattern[i] === '*') {
683
+ // Attention layer: QKV → attention → out_proj (existing kernels)
684
+ addAttentionLayer(graph, i);
685
+ } else if (pattern[i] === 'E') {
686
+ // MoE layer: router → top-k gather → expert FFN → scatter
687
+ addMoELayer(graph, i);
688
+ }
689
+ }
690
+ }
691
+ ```
692
+
693
+ ### Phase 4: MTP Speculative Decoding (if applicable)
694
+
695
+ - Add lightweight prediction heads
696
+ - Implement verify-and-accept loop
697
+ - Expected ~1.5-2x decode speedup
698
+
699
+ ### Priority Order
700
+
701
+ 1. **NemotronH graph generator** — ships the architecture adapter, validates IR design
702
+ 2. **Mamba-2 decode kernel** — gates everything, simplest to validate
703
+ 3. **MoE routing + dispatch** — needed for any Nemotron model
704
+ 4. **Mamba-2 prefill (chunked SSD)** — needed for prompt processing
705
+ 5. **NVFP4 dequantization** — use their quantized weights directly
706
+ 6. **MTP speculative decode** — optimization on top
707
+
708
+ ---
709
+
710
+ ## 10.5. NemotronH Graph Generator — Detailed Implementation Plan
711
+
712
+ ### Verified Model Structure (from `modeling_nemotron_h.py`)
713
+
714
+ The actual HuggingFace implementation reveals key differences from assumptions:
715
+
716
+ **Weight prefix is `backbone.` not `model.`:**
717
+ ```
718
+ backbone.embeddings.weight
719
+ backbone.layers.{i}.norm.weight
720
+ backbone.layers.{i}.mixer.{component}
721
+ backbone.norm_f.weight
722
+ lm_head.weight
723
+ ```
724
+
725
+ **All layer types share the `.mixer.` namespace** — Mamba, Attention, and MLP/MoE all live under `layers.{i}.mixer.*`.
726
+
727
+ **Pattern characters map to:**
728
+ - **M** → `NemotronHMamba2Mixer` (standard Mamba-2 SSM)
729
+ - **E** → `NemotronHMLP` (8B dense model) / MoE (30B Nano model)
730
+ - **\*** → `NemotronHAttention` (GQA)
731
+
732
+ ### Verified Safetensors Key Names
733
+
734
+ **Mamba layers (`M`):**
735
+ ```
736
+ backbone.layers.{i}.norm.weight # pre-norm
737
+ backbone.layers.{i}.mixer.in_proj.weight # fused [d_mlp, d_mlp, gate, x_B_C, dt]
738
+ backbone.layers.{i}.mixer.conv1d.weight # depthwise conv1d
739
+ backbone.layers.{i}.mixer.conv1d.bias
740
+ backbone.layers.{i}.mixer.dt_bias # (num_heads,)
741
+ backbone.layers.{i}.mixer.A_log # (num_heads,) log-space decay
742
+ backbone.layers.{i}.mixer.D # (num_heads,) skip connection
743
+ backbone.layers.{i}.mixer.norm.weight # gated RMSNorm before output
744
+ backbone.layers.{i}.mixer.out_proj.weight # back to hidden_size
745
+ ```
746
+
747
+ **Attention layers (`*`):**
748
+ ```
749
+ backbone.layers.{i}.norm.weight
750
+ backbone.layers.{i}.mixer.q_proj.weight
751
+ backbone.layers.{i}.mixer.k_proj.weight
752
+ backbone.layers.{i}.mixer.v_proj.weight
753
+ backbone.layers.{i}.mixer.o_proj.weight
754
+ ```
755
+
756
+ **MLP layers (`E` in 8B dense):**
757
+ ```
758
+ backbone.layers.{i}.norm.weight
759
+ backbone.layers.{i}.mixer.up_proj.weight
760
+ backbone.layers.{i}.mixer.down_proj.weight
761
+ ```
762
+
763
+ **MoE layers (`E` in 30B Nano) — needs verification:**
764
+ ```
765
+ backbone.layers.{i}.norm.weight
766
+ backbone.layers.{i}.mixer.gate.weight # router
767
+ backbone.layers.{i}.mixer.experts.{j}.up_proj.weight
768
+ backbone.layers.{i}.mixer.experts.{j}.down_proj.weight
769
+ backbone.layers.{i}.mixer.shared_experts.up_proj.weight # shared expert
770
+ backbone.layers.{i}.mixer.shared_experts.down_proj.weight
771
+ ```
772
+
773
+ ### Mamba-2 in_proj Split (from source code)
774
+
775
+ NemotronH's Mamba-2 fuses everything into a single `in_proj`:
776
+
777
+ ```python
778
+ # in_proj output is split as:
779
+ d_mlp, d_mlp, gate, hidden_states_B_C, dt = projected_states.split(
780
+ [d_mlp, d_mlp, intermediate_size, conv_dim, num_heads], dim=-1
781
+ )
782
+
783
+ # where:
784
+ # intermediate_size = mamba_num_heads * mamba_head_dim = 64 * 64 = 4096
785
+ # conv_dim = intermediate_size + 2 * n_groups * ssm_state_size
786
+ # = 4096 + 2 * 8 * 128 = 6144
787
+ # num_heads = 64 (for dt)
788
+ # d_mlp = any remaining dimensions / 2 (optional MLP component)
789
+
790
+ # Then hidden_states_B_C is further split:
791
+ hidden_states, B, C = split(hidden_states_B_C,
792
+ [intermediate_size, n_groups*ssm_state_size, n_groups*ssm_state_size])
793
+ ```
794
+
795
+ This differs from Qwen3.5's separate `in_proj_qkv`, `in_proj_a`, `in_proj_b`, `in_proj_z` projections.
796
+
797
+ ### Files to Change
798
+
799
+ | File | Change | Size |
800
+ |------|--------|------|
801
+ | `src/gpu/architectures/nemotron_h.ts` | **New file** — graph generator | ~400 lines |
802
+ | `src/gpu/architectures/index.ts` | Register `NemotronHForCausalLM` | 3 lines |
803
+ | `src/gpu/ir.ts` | Add NemotronH canonical keys + HF key mapper | ~40 lines |
804
+
805
+ ### Existing Op Reuse
806
+
807
+ | Component | Exists in Gerbil? | Reusable? |
808
+ |-----------|-------------------|-----------|
809
+ | MatMul / MatMulInt4 | Yes | Direct reuse for all projections |
810
+ | Attention + KV cache | Yes | Direct reuse for \* layers |
811
+ | RMSNorm / ResidualRMSNorm | Yes | Direct reuse |
812
+ | SiLU / SwiGLU | Yes | Direct reuse for gating |
813
+ | CausalConv1d | Yes (Qwen3.5) | Direct reuse for Mamba conv |
814
+ | ConvStateUpdate | Yes (Qwen3.5) | Direct reuse |
815
+ | MambaSSM | **Partial** | Qwen3.5 implements Gated DeltaNet. NemotronH needs standard Mamba-2 SSM (different recurrence: `h = a*h + outer(B,x)` vs delta rule). Needs new kernel variant. |
816
+ | ReLU² activation | **No** | Trivial — `relu(x)²` or `x * relu(x)`, ~20 lines WGSL |
817
+ | MoERouter (top-k) | **No** (stubbed in IR) | New kernel: compute router logits, select top-6 |
818
+ | MoE Expert dispatch | **No** (stubbed in IR) | New kernel: route tokens to experts, weighted combine |
819
+
820
+ ### What the Graph Generator Produces (But Can't Run Yet)
821
+
822
+ The graph generator will emit correct IR nodes referencing ops whose kernels don't exist yet. This is the same pattern used for Qwen3.5 — the graph is correct documentation of the compute graph, and when kernels are implemented, inference works automatically.
823
+
824
+ **Ops that block end-to-end execution:**
825
+ 1. **MoE kernels** (23 E-layers × 128 experts) — the big one
826
+ 2. **Standard Mamba-2 SSM kernel** — differs from Qwen3.5's DeltaNet kernel
827
+ 3. **ReLU² activation** — trivial to add
828
+
829
+ ### Approach
830
+
831
+ Ship the graph generator now (Option A). It serves as:
832
+ - Verified architecture documentation (weight names, dimensions, layer pattern)
833
+ - Correct IR that will work when kernels land
834
+ - Test target for kernel development (load config → generate graph → inspect nodes)
835
+
836
+ ---
837
+
838
+ ## 11. Small Model Viability: Nano vs Qwen 3.5 0.8B
839
+
840
+ ### Does Nano Compete at the Small End?
841
+
842
+ **No.** Nemotron-3 Nano is not a small model replacement for Qwen 3.5 0.8B:
843
+
844
+ | Aspect | Qwen 3.5 0.8B | Nemotron-3 Nano 30B-A3B |
845
+ |--------|--------------|------------------------|
846
+ | Total params | 0.8B | 31.6B |
847
+ | Active params | 0.8B (dense) | 3.6B (MoE) |
848
+ | Weights (INT4) | ~0.4 GB | ~29 GB |
849
+ | Target hardware | Mobile, browser, edge | M4 Max, RTX, DGX Spark |
850
+ | Context length | 32K | 1M |
851
+ | Capability | Basic chat/generation | Full reasoning, coding, agentic |
852
+
853
+ **The use cases are different:**
854
+ - **Qwen 3.5 0.8B**: ultra-lightweight, runs on any device, good enough for simple tasks
855
+ - **Nemotron Nano**: dramatically more capable, but needs 29GB for weights alone
856
+
857
+ ### No Sub-1B Nemotron Models Exist
858
+
859
+ NVIDIA has not released any Nemotron model with fewer than 3B active parameters. The family starts at Nano (3.6B active / 31.6B total). There are no announced plans for smaller variants.
860
+
861
+ ### Gerbil Strategy
862
+
863
+ **Keep both:**
864
+ - **Qwen 3.5 0.8B** for mobile/browser (our current optimized path)
865
+ - **Nemotron-3 Nano** for desktop with M4 Max or similar (29GB fits in 128GB unified memory)
866
+ - Watch for smaller Nemotron or community-distilled variants
867
+
868
+ ### Could Nemotron Nano Run on M4 Max via Gerbil?
869
+
870
+ **Theoretically yes**, with caveats:
871
+ - INT4 weights: ~29 GB (fits in 128GB, tight for 64GB)
872
+ - Active weight reads per token: ~1.8 GB of bandwidth
873
+ - At 546 GB/s (M4 Max): theoretical ~300 tok/s **if perfectly bandwidth-bound**
874
+ - Realistic with overhead: **50-150 tok/s** estimated
875
+ - Would need MoE expert weight management (not all 128 experts in active cache)
876
+
877
+ Compared to our current 160 tok/s on Qwen 0.8B, Nano would be slower but **dramatically more capable** — it's a reasoning model that can do multi-step coding, math, and tool use.
878
+
879
+ **Important:** MoE means **all 30B weights must reside in memory** even though only 3.5B are active per token. The router selects different experts per token unpredictably. The saving is in compute FLOPs, not memory bandwidth — this is crucial for bandwidth-bound WebGPU inference.
880
+
881
+ ### Community Small Mamba-2 Models
882
+
883
+ If smaller Mamba-2 models are of future interest:
884
+
885
+ | Model | Params | Type | Notes |
886
+ |-------|--------|------|-------|
887
+ | `state-spaces/mamba2-2.7b` | 2.7B | Pure Mamba-2, dense | Base model only (no instruct) |
888
+ | `state-spaces/mamba2-1.3b` | 1.3B | Pure Mamba-2, dense | Base model only |
889
+ | `state-spaces/mamba2-780m` | 780M | Pure Mamba-2, dense | Base model only |
890
+ | `state-spaces/mamba2attn-2.7b` | 2.7B | Mamba-2 + Attention hybrid | Minimal documentation |
891
+ | `Zyphra/Zamba2-1.2B-Instruct-v2` | 1.2B | Mamba-2 + shared attention | Apache 2.0, instruct-tuned |
892
+
893
+ None of these are Nemotron-class models, but they prove the Mamba-2 kernel would have broader utility beyond just Nemotron.
894
+
895
+ ---
896
+
897
+ ## 12. Sources
898
+
899
+ - [NVIDIA Nemotron 3: Efficient and Open Intelligence (arxiv 2512.20856)](https://arxiv.org/abs/2512.20856)
900
+ - [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models (arxiv 2504.03624)](https://arxiv.org/abs/2504.03624)
901
+ - [Nemotron 3 Nano: Open, Efficient MoE Hybrid Mamba-Transformer (arxiv 2512.20848)](https://arxiv.org/abs/2512.20848)
902
+ - [Mamba-2 / SSD: Transformers are SSMs (arxiv 2405.21060)](https://arxiv.org/abs/2405.21060)
903
+ - [Tri Dao blog: Mamba-2 Part I - The Model](https://tridao.me/blog/2024/mamba2-part1-model/)
904
+ - [Tri Dao blog: Mamba-2 Part III - The Algorithm](https://tridao.me/blog/2024/mamba2-part3-algorithm/)
905
+ - [LatentMoE: Toward Optimal Accuracy per FLOP (arxiv 2601.18089)](https://arxiv.org/abs/2601.18089)
906
+ - [LatentMoE NVIDIA Research Page](https://research.nvidia.com/labs/nemotron/LatentMoE/)
907
+ - [Meta: Better & Faster LLMs via Multi-token Prediction (arxiv 2404.19737)](https://arxiv.org/abs/2404.19737)
908
+ - [Nemotron-3 Nano HuggingFace Blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models)
909
+ - [Nemotron-3 Nano FP8 Config.json](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blob/main/config.json)
910
+ - [NVIDIA Nemotron-3 Super Blog](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/)