@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +318 -104
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +276 -590
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +592 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BetB5xb0.d.mts +488 -0
  35. package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
  36. package/dist/gerbil-CTZUa8EZ.mjs +4 -0
  37. package/dist/gerbil-DNniplr4.mjs +1656 -0
  38. package/dist/gerbil-DNniplr4.mjs.map +1 -0
  39. package/dist/gpu/hooks.d.mts +640 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1369 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-DFuglcEx.mjs +3790 -0
  46. package/dist/gpu-DFuglcEx.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-DukkJRMj.d.mts +2114 -0
  50. package/dist/index-DukkJRMj.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
  76. package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
  85. package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
  86. package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
  88. package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
  89. package/dist/repl-BDRkwPGX.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
  94. package/dist/skills-CU694Dc8.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/types-DQBe2lFo.d.mts +165 -0
  98. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  99. package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
  100. package/dist/types-LlyYILII.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-CMOGDSgT.js +0 -20212
  157. package/dist/kokoro-CMOGDSgT.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-Bu-E23Sc.js +0 -433
  163. package/dist/stt-Bu-E23Sc.js.map +0 -1
  164. package/dist/stt-CpLYbGFd.mjs +0 -433
  165. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  166. package/dist/stt-DRPLEEHB.mjs +0 -3
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DiD1gTwk.js +0 -44695
  169. package/dist/transformers.web-DiD1gTwk.js.map +0 -1
  170. package/dist/transformers.web-u34VxRFM.js +0 -3
  171. package/dist/tts-CqroPaSK.js +0 -724
  172. package/dist/tts-CqroPaSK.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -41,8 +41,12 @@ type GenerateOptions = {
41
41
  system?: string;
42
42
  /** Enable thinking/reasoning mode (Qwen3) */
43
43
  thinking?: boolean;
44
- /** Callback for each token (streaming) */
45
- onToken?: (token: string) => void;
44
+ /** Callback for each token (streaming); `meta` carries live decode-only tok/s */
45
+ onToken?: (token: string, meta?: {
46
+ tokenIndex: number;
47
+ tps: number;
48
+ elapsedMs: number;
49
+ }) => void;
46
50
  /** Images to include (only used if model supports vision) */
47
51
  images?: ImageInput[];
48
52
  /** Enable response caching (default: false) */
@@ -92,16 +96,47 @@ type EmbedResult = {
92
96
  /** Time in ms */
93
97
  totalTime: number;
94
98
  };
99
+ type SearchResult = {
100
+ /** The matched text */
101
+ text: string;
102
+ /** Similarity score (0-1, higher is more similar) */
103
+ score: number;
104
+ /** Index in the original corpus */
105
+ index: number;
106
+ };
107
+ type SimilarityResult = {
108
+ /** Similarity score (0-1, higher is more similar) */
109
+ score: number;
110
+ /** First text */
111
+ textA: string;
112
+ /** Second text */
113
+ textB: string;
114
+ /** Time in ms */
115
+ totalTime: number;
116
+ };
95
117
  type LoadOptions = {
96
118
  /** Progress callback */
97
119
  onProgress?: (info: ProgressInfo) => void;
98
- /** Device: 'auto', 'gpu', 'cpu', 'webgpu' (default: 'auto') */
99
- device?: "auto" | "gpu" | "cpu" | "webgpu";
100
- /** Quantization: 'q4', 'q8', 'fp16', 'fp32' (default: 'q4') */
120
+ /**
121
+ * Compute device. The only inference backend is the native WebGPU engine
122
+ * (Dawn in Node, WebGPU in the browser); "auto" resolves to "webgpu". There
123
+ * is no CPU/WASM or ONNX path.
124
+ */
125
+ device?: "auto" | "webgpu";
126
+ /**
127
+ * Weight quantization. The engine quantizes to INT4 ("q4") on load; the other
128
+ * values are accepted for forward-compat but currently map to q4.
129
+ */
101
130
  dtype?: "q4" | "q8" | "fp16" | "fp32";
102
131
  /** Override context length */
103
132
  contextLength?: number;
104
133
  };
134
+ type PreloadOptions = {
135
+ /** Progress callback for download status */
136
+ onProgress?: (info: ProgressInfo) => void;
137
+ /** Keep model loaded in memory after preload (default: false - disposes to free memory) */
138
+ keepLoaded?: boolean;
139
+ };
105
140
  type ProgressInfo = {
106
141
  status: string;
107
142
  progress?: number;
@@ -112,14 +147,18 @@ type ProgressInfo = {
112
147
  type GerbilConfig = {
113
148
  /** Default model */
114
149
  model?: string;
115
- /** Default device */
116
- device?: "auto" | "gpu" | "cpu";
117
- /** Default quantization */
150
+ /** Default device (native WebGPU only; "auto" resolves to "webgpu") */
151
+ device?: "auto" | "webgpu";
152
+ /** Default quantization (engine uses INT4 "q4") */
118
153
  dtype?: "q4" | "q8" | "fp16" | "fp32";
119
154
  /** Cache configuration */
120
155
  cache?: CacheConfig;
121
156
  /** Fallback configuration */
122
157
  fallback?: FallbackConfig;
158
+ /** Telemetry hooks for observability (Sentry, logging, etc.) */
159
+ telemetry?: TelemetryConfig;
160
+ /** Concurrency control for request queuing */
161
+ concurrency?: ConcurrencyConfig;
123
162
  };
124
163
  type CacheConfig = {
125
164
  /** Enable caching (default: true) */
@@ -183,14 +222,14 @@ type SystemInfo = {
183
222
  type GerbilModelSettings = {
184
223
  /** Enable thinking mode */
185
224
  thinking?: boolean;
186
- /** Device to use */
187
- device?: "auto" | "gpu" | "cpu";
225
+ /** Device to use (native WebGPU only) */
226
+ device?: "auto" | "webgpu";
188
227
  /** Quantization level */
189
228
  dtype?: "q4" | "q8" | "fp16" | "fp32";
190
229
  };
191
230
  type GerbilProviderSettings = {
192
- /** Default device */
193
- device?: "auto" | "gpu" | "cpu";
231
+ /** Default device (native WebGPU only) */
232
+ device?: "auto" | "webgpu";
194
233
  /** Default quantization */
195
234
  dtype?: "q4" | "q8" | "fp16" | "fp32";
196
235
  };
@@ -348,6 +387,65 @@ type StreamingTranscriptionSession = {
348
387
  /** Reset session (clear buffer and transcript) */
349
388
  reset: () => void;
350
389
  };
390
+ /**
391
+ * Telemetry hooks for production observability.
392
+ * Pass your own Sentry instance or custom logging functions.
393
+ */
394
+ type TelemetryConfig = {
395
+ /**
396
+ * Called after successful generation with full result and timing.
397
+ * Use for logging, metrics, or analytics.
398
+ */
399
+ onGenerate?: (event: GenerateEvent) => void;
400
+ /**
401
+ * Called when any error occurs during Gerbil operations.
402
+ * Perfect for Sentry.captureException() or similar.
403
+ */
404
+ onError?: (error: Error, context: ErrorContext) => void;
405
+ /**
406
+ * Called after model loading completes (success or failure).
407
+ */
408
+ onModelLoad?: (event: ModelLoadEvent) => void;
409
+ /**
410
+ * Called when a request is queued (if concurrency limit reached).
411
+ */
412
+ onQueueWait?: (waitTimeMs: number) => void;
413
+ };
414
+ type GenerateEvent = {
415
+ /** Model used for generation */
416
+ modelId: string;
417
+ /** Generation result */
418
+ result: GenerateResult;
419
+ /** Whether response came from cache */
420
+ cached: boolean;
421
+ /** Time spent waiting in queue (if any) */
422
+ queueTimeMs?: number;
423
+ };
424
+ /**
425
+ * Context passed to telemetry onError callback.
426
+ * Flexible record to allow any relevant context data.
427
+ */
428
+ type ErrorContext = Record<string, unknown>;
429
+ type ModelLoadEvent = {
430
+ /** Model that was loaded */
431
+ modelId: string;
432
+ /** Time to load in ms */
433
+ loadTimeMs: number;
434
+ /** Whether loaded from cache */
435
+ fromCache: boolean;
436
+ /** Device used */
437
+ device: "webgpu" | "cpu" | "wasm";
438
+ /** Whether load succeeded */
439
+ success: boolean;
440
+ /** Error message if failed */
441
+ error?: string;
442
+ };
443
+ type ConcurrencyConfig = {
444
+ /** Maximum concurrent generation requests (default: 1 for LLM) */
445
+ maxConcurrent?: number;
446
+ /** Request timeout in ms (default: 300000 = 5 min) */
447
+ timeout?: number;
448
+ };
351
449
  //#endregion
352
- export { TranscribeSegment as A, SpeakResult as C, TTSModelConfig as D, SystemInfo as E, TranscribeOptions as O, SpeakOptions as S, StreamingTranscriptionSession as T, ModelSource as _, FallbackConfig as a, STTModelConfig as b, GerbilConfig as c, ImageInput as d, JsonOptions as f, ModelConfig as g, LoadTTSOptions as h, EmbedResult as i, VoiceInfo as j, TranscribeResult as k, GerbilModelSettings as l, LoadSTTOptions as m, CacheConfig as n, GenerateOptions as o, LoadOptions as p, EmbedOptions as r, GenerateResult as s, AudioChunk as t, GerbilProviderSettings as u, ModelStats as v, StreamingTranscriptionOptions as w, SessionStats as x, ProgressInfo as y };
353
- //# sourceMappingURL=types-CiTc7ez3.d.mts.map
450
+ export { SpeakResult as A, PreloadOptions as C, SessionStats as D, SearchResult as E, TelemetryConfig as F, TranscribeOptions as I, TranscribeResult as L, StreamingTranscriptionSession as M, SystemInfo as N, SimilarityResult as O, TTSModelConfig as P, TranscribeSegment as R, ModelStats as S, STTModelConfig as T, LoadSTTOptions as _, EmbedResult as a, ModelLoadEvent as b, GenerateEvent as c, GerbilConfig as d, GerbilModelSettings as f, LoadOptions as g, JsonOptions as h, EmbedOptions as i, StreamingTranscriptionOptions as j, SpeakOptions as k, GenerateOptions as l, ImageInput as m, CacheConfig as n, ErrorContext as o, GerbilProviderSettings as p, ConcurrencyConfig as r, FallbackConfig as s, AudioChunk as t, GenerateResult as u, LoadTTSOptions as v, ProgressInfo as w, ModelSource as x, ModelConfig as y, VoiceInfo as z };
451
+ //# sourceMappingURL=types-LlyYILII.d.mts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"types-LlyYILII.d.mts","names":[],"sources":["../src/core/types.ts"],"sourcesContent":[],"mappings":";;;;AAyBY,KAfA,WAAA,GAeW;EASX,EAAA,EAAA,MAAA;EAWA,IAAA,EAAA,MAAA;EAmCA,WAAA,EAAA,MAAc;EA8Bd,IAAA,EAAA,MAAA;EAkBA,aAAA,EAAY,MAAA;EAQZ,gBAAW,EAAA,OAAA;EAWX,YAAA,EAAA,OAAY;EAWZ;EAkBA,cAAW,CAAA,EAAA,OAAA;EAqBX;EAOA,iBAAY,CAAA,EAAA,MAAA;EAYZ,MAAA,EAAA,MAAA,GAAY,QAAA,GAAA,KAAA,GAAA,SAAA,GAAA,OAAA,GAAA,OAAA;CAWd;AAGG,KA7MD,WAAA,GA6MC;EAGC,IAAA,EAAA,SAAA,GAAA,aAAA,GAAA,OAAA;EAGE,IAAA,EAAA,MAAA;CAAiB;AAGrB,KA7MA,UAAA,GA6MW;EAiBX;EAqBA,MAAA,EAAA,MAAA;EAUA;EAOA,GAAA,CAAA,EAAA,MAAA;AAyBZ,CAAA;AAWY,KA7RA,eAAA,GA6RsB;EAYtB;EAeA,SAAA,CAAA,EAAA,MAAc;EAmBd;EAWA,WAAA,CAAA,EAAU,MAAA;EAWV;EAaA,IAAA,CAAA,EAAA,MAAA;EAWA;EAiBA,IAAA,CAAA,EAAA,MAAA;EASA;EASA,aAAA,CAAA,EAAA,MAAgB,EAAA;EAahB;EAWA,MAAA,CAAA,EAAA,MAAA;EAeA;EAES,QAAA,CAAA,EAAA,OAAA;EAEN;EAID,OAAA,CAAA,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAgCM,CAhCN,EAAA;IAAO,UAAA,EAAA,MAAA;IAqBT,GAAA,EAAA,MAAA;IAKW,SAAA,EAAA,MAAA;EAMH,CAAA,EAAA,GAAA,IAAA;EAAgB;EAKZ,MAAA,CAAA,EAtdb,UAsda,EAAA;EAAc;EAQ1B,KAAA,CAAA,EAAA,OAAA;EAeA;EAEA,QAAA,CAAA,EAAA,MAAc;AAmB1B,CAAA;KAzfY,cAAA;;;;;;;;;;;;;;;;;;KA8BA;;UAEF,CAAA,CAAE,QAAQ;;;;;;;;KAgBR,YAAA;;;;;;KAQA,WAAA;;;;;;;;KAWA,YAAA;;;;;;;;KAWA,gBAAA;;;;;;;;;;KAkBA,WAAA;;sBAEU;;;;;;;;;;;;;;;KAmBV,cAAA;;sBAEU;;;;KAKV,YAAA;;;;;;;KAYA,YAAA;;;;;;;;UAWF;;aAGG;;cAGC;;gBAGE;;KAGJ,WAAA;;;;;;;;;;;;KAiBA,cAAA;;;;;;;;;;;;KAqBA,YAAA;;;;;;;;;KAUA,UAAA;;;;;;KAOA,UAAA;;SAEH;;;;;;;;;;;;;;;;;;KAuBG,mBAAA;;;;;;;;KAWA,sBAAA;;;;;;KAYA,SAAA;;;;;;;;;;;;;;KAeA,cAAA;;;;;;;;;;;;UAYF;;;;;;KAOE,YAAA;;;;;;sBAMU;;yBAEG;;KAGb,UAAA;;WAED;;;;;;;;KASC,WAAA;;SAEH;;;;;;;;;;KAWG,cAAA;;sBAEU;;;;KASV,cAAA;;;;;;;;;;;;;;;;KAiBA,iBAAA;;;;;;sBAMU;;KAGV,iBAAA;;;;;;;;KASA,gBAAA;;;;;;aAMC;;;;;;KAOD,cAAA;;sBAEU;;;;KASV,6BAAA;;;;;;;;;;;;;;KAeA,6BAAA;;qBAES;;eAEN;;;;cAID;;;;;;;;;;;;;;;;KAqBF,eAAA;;;;;uBAKW;;;;;oBAMH,gBAAgB;;;;wBAKZ;;;;;;KAQZ,aAAA;;;;UAIF;;;;;;;;;;KAWE,YAAA,GAAe;KAEf,cAAA;;;;;;;;;;;;;;KAmBA,iBAAA"}
@@ -60,4 +60,4 @@ function extractJson(text) {
60
60
 
61
61
  //#endregion
62
62
  export { zodToJsonSchema as n, extractJson as t };
63
- //# sourceMappingURL=utils-CZBZ8dgR.mjs.map
63
+ //# sourceMappingURL=utils-DKO55ZmZ.mjs.map
@@ -1 +1 @@
1
- {"version":3,"file":"utils-CZBZ8dgR.mjs","names":["properties: Record<string, any>","required: string[]"],"sources":["../src/core/utils.ts"],"sourcesContent":["/**\n * Shared utility functions for Gerbil core\n */\n\nimport type { z } from \"zod\";\n\n/**\n * Convert Zod schema to JSON Schema (simplified)\n * Handles objects, arrays, primitives, enums, optionals, and defaults\n */\nexport function zodToJsonSchema(schema: z.ZodType<any>): object {\n try {\n if (\"_def\" in schema) {\n const def = (schema as any)._def;\n\n if (def.typeName === \"ZodObject\") {\n const shape = def.shape();\n const properties: Record<string, any> = {};\n const required: string[] = [];\n\n for (const [key, value] of Object.entries(shape)) {\n properties[key] = zodToJsonSchema(value as z.ZodType<any>);\n // Check if required (not optional)\n if (!(value as any)._def?.typeName?.includes(\"Optional\")) {\n required.push(key);\n }\n }\n\n return { type: \"object\", properties, required };\n }\n if (def.typeName === \"ZodString\") {\n return { type: \"string\", description: def.description };\n }\n if (def.typeName === \"ZodNumber\") {\n return { type: \"number\", description: def.description };\n }\n if (def.typeName === \"ZodBoolean\") {\n return { type: \"boolean\" };\n }\n if (def.typeName === \"ZodArray\") {\n return { type: \"array\", items: zodToJsonSchema(def.type) };\n }\n if (def.typeName === \"ZodEnum\") {\n return { type: \"string\", enum: def.values };\n }\n if (def.typeName === \"ZodOptional\") {\n return zodToJsonSchema(def.innerType);\n }\n if (def.typeName === \"ZodDefault\") {\n const inner = zodToJsonSchema(def.innerType);\n return { ...inner, default: def.defaultValue() };\n }\n }\n } catch {}\n\n return { type: \"string\" };\n}\n\n/**\n * Extract JSON from text (finds first { } or [ ] block)\n */\nexport function extractJson(text: string): string {\n const jsonMatch = text.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n return jsonMatch[0];\n }\n\n const arrayMatch = text.match(/\\[[\\s\\S]*\\]/);\n if (arrayMatch) {\n return arrayMatch[0];\n }\n\n return text;\n}\n"],"mappings":";;;;;AAUA,SAAgB,gBAAgB,QAAgC;AAC9D,KAAI;AACF,MAAI,UAAU,QAAQ;GACpB,MAAM,MAAO,OAAe;AAE5B,OAAI,IAAI,aAAa,aAAa;IAChC,MAAM,QAAQ,IAAI,OAAO;IACzB,MAAMA,aAAkC,EAAE;IAC1C,MAAMC,WAAqB,EAAE;AAE7B,SAAK,MAAM,CAAC,KAAK,UAAU,OAAO,QAAQ,MAAM,EAAE;AAChD,gBAAW,OAAO,gBAAgB,MAAwB;AAE1D,SAAI,CAAE,MAAc,MAAM,UAAU,SAAS,WAAW,CACtD,UAAS,KAAK,IAAI;;AAItB,WAAO;KAAE,MAAM;KAAU;KAAY;KAAU;;AAEjD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,aACnB,QAAO,EAAE,MAAM,WAAW;AAE5B,OAAI,IAAI,aAAa,WACnB,QAAO;IAAE,MAAM;IAAS,OAAO,gBAAgB,IAAI,KAAK;IAAE;AAE5D,OAAI,IAAI,aAAa,UACnB,QAAO;IAAE,MAAM;IAAU,MAAM,IAAI;IAAQ;AAE7C,OAAI,IAAI,aAAa,cACnB,QAAO,gBAAgB,IAAI,UAAU;AAEvC,OAAI,IAAI,aAAa,aAEnB,QAAO;IAAE,GADK,gBAAgB,IAAI,UAAU;IACzB,SAAS,IAAI,cAAc;IAAE;;SAG9C;AAER,QAAO,EAAE,MAAM,UAAU;;;;;AAM3B,SAAgB,YAAY,MAAsB;CAChD,MAAM,YAAY,KAAK,MAAM,cAAc;AAC3C,KAAI,UACF,QAAO,UAAU;CAGnB,MAAM,aAAa,KAAK,MAAM,cAAc;AAC5C,KAAI,WACF,QAAO,WAAW;AAGpB,QAAO"}
1
+ {"version":3,"file":"utils-DKO55ZmZ.mjs","names":["properties: Record<string, any>","required: string[]"],"sources":["../src/core/utils.ts"],"sourcesContent":["/**\n * Shared utility functions for Gerbil core\n */\n\nimport type { z } from \"zod\";\n\n/**\n * Convert Zod schema to JSON Schema (simplified)\n * Handles objects, arrays, primitives, enums, optionals, and defaults\n */\nexport function zodToJsonSchema(schema: z.ZodType<any>): object {\n try {\n if (\"_def\" in schema) {\n const def = (schema as any)._def;\n\n if (def.typeName === \"ZodObject\") {\n const shape = def.shape();\n const properties: Record<string, any> = {};\n const required: string[] = [];\n\n for (const [key, value] of Object.entries(shape)) {\n properties[key] = zodToJsonSchema(value as z.ZodType<any>);\n // Check if required (not optional)\n if (!(value as any)._def?.typeName?.includes(\"Optional\")) {\n required.push(key);\n }\n }\n\n return { type: \"object\", properties, required };\n }\n if (def.typeName === \"ZodString\") {\n return { type: \"string\", description: def.description };\n }\n if (def.typeName === \"ZodNumber\") {\n return { type: \"number\", description: def.description };\n }\n if (def.typeName === \"ZodBoolean\") {\n return { type: \"boolean\" };\n }\n if (def.typeName === \"ZodArray\") {\n return { type: \"array\", items: zodToJsonSchema(def.type) };\n }\n if (def.typeName === \"ZodEnum\") {\n return { type: \"string\", enum: def.values };\n }\n if (def.typeName === \"ZodOptional\") {\n return zodToJsonSchema(def.innerType);\n }\n if (def.typeName === \"ZodDefault\") {\n const inner = zodToJsonSchema(def.innerType);\n return { ...inner, default: def.defaultValue() };\n }\n }\n } catch {}\n\n return { type: \"string\" };\n}\n\n/**\n * Extract JSON from text (finds first { } or [ ] block)\n */\nexport function extractJson(text: string): string {\n const jsonMatch = text.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n return jsonMatch[0];\n }\n\n const arrayMatch = text.match(/\\[[\\s\\S]*\\]/);\n if (arrayMatch) {\n return arrayMatch[0];\n }\n\n return text;\n}\n"],"mappings":";;;;;AAUA,SAAgB,gBAAgB,QAAgC;AAC9D,KAAI;AACF,MAAI,UAAU,QAAQ;GACpB,MAAM,MAAO,OAAe;AAE5B,OAAI,IAAI,aAAa,aAAa;IAChC,MAAM,QAAQ,IAAI,OAAO;IACzB,MAAMA,aAAkC,EAAE;IAC1C,MAAMC,WAAqB,EAAE;AAE7B,SAAK,MAAM,CAAC,KAAK,UAAU,OAAO,QAAQ,MAAM,EAAE;AAChD,gBAAW,OAAO,gBAAgB,MAAwB;AAE1D,SAAI,CAAE,MAAc,MAAM,UAAU,SAAS,WAAW,CACtD,UAAS,KAAK,IAAI;;AAItB,WAAO;KAAE,MAAM;KAAU;KAAY;KAAU;;AAEjD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,aACnB,QAAO,EAAE,MAAM,WAAW;AAE5B,OAAI,IAAI,aAAa,WACnB,QAAO;IAAE,MAAM;IAAS,OAAO,gBAAgB,IAAI,KAAK;IAAE;AAE5D,OAAI,IAAI,aAAa,UACnB,QAAO;IAAE,MAAM;IAAU,MAAM,IAAI;IAAQ;AAE7C,OAAI,IAAI,aAAa,cACnB,QAAO,gBAAgB,IAAI,UAAU;AAEvC,OAAI,IAAI,aAAa,aAEnB,QAAO;IAAE,GADK,gBAAgB,IAAI,UAAU;IACzB,SAAS,IAAI,cAAc;IAAE;;SAG9C;AAER,QAAO,EAAE,MAAM,UAAU;;;;;AAM3B,SAAgB,YAAY,MAAsB;CAChD,MAAM,YAAY,KAAK,MAAM,cAAc;AAC3C,KAAI,UACF,QAAO,UAAU;CAGnB,MAAM,aAAa,KAAK,MAAM,cAAc;AAC5C,KAAI,WACF,QAAO,WAAW;AAGpB,QAAO"}
@@ -0,0 +1,95 @@
1
+ //#region src/memory/serialize.ts
2
+ /** Convert a runtime record to its JSON-safe form. */
3
+ function serializeRecord(record) {
4
+ return {
5
+ id: record.id,
6
+ text: record.text,
7
+ embedding: record.embedding ? Array.from(record.embedding) : void 0,
8
+ metadata: record.metadata,
9
+ createdAt: record.createdAt
10
+ };
11
+ }
12
+ /** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */
13
+ function deserializeRecord(record) {
14
+ return {
15
+ id: record.id,
16
+ text: record.text,
17
+ embedding: record.embedding ? Float32Array.from(record.embedding) : void 0,
18
+ metadata: record.metadata ?? {},
19
+ createdAt: record.createdAt
20
+ };
21
+ }
22
+ /**
23
+ * True when every key in `filter` is present on `metadata` with an equal
24
+ * (strict ===) value. An empty/undefined filter matches everything.
25
+ */
26
+ function matchesFilter(metadata, filter) {
27
+ if (!filter) return true;
28
+ for (const key of Object.keys(filter)) if (metadata[key] !== filter[key]) return false;
29
+ return true;
30
+ }
31
+
32
+ //#endregion
33
+ //#region src/memory/vector.ts
34
+ /**
35
+ * Vector math for cosine similarity search.
36
+ *
37
+ * Vectors are stored L2-normalized so cosine similarity is a plain dot
38
+ * product. Normalization happens once on insert via {@link normalize}.
39
+ */
40
+ /**
41
+ * Return an L2-normalized copy of `vector`.
42
+ *
43
+ * A zero vector is returned unchanged (its norm is 0).
44
+ */
45
+ function normalize(vector) {
46
+ let sumSquares = 0;
47
+ for (let i = 0; i < vector.length; i++) sumSquares += vector[i] * vector[i];
48
+ const norm = Math.sqrt(sumSquares);
49
+ if (norm === 0) return vector;
50
+ const out = new Float32Array(vector.length);
51
+ for (let i = 0; i < vector.length; i++) out[i] = vector[i] / norm;
52
+ return out;
53
+ }
54
+ /**
55
+ * Dot product of two equal-length vectors.
56
+ *
57
+ * For L2-normalized inputs this equals cosine similarity.
58
+ */
59
+ function dot(a, b) {
60
+ const length = Math.min(a.length, b.length);
61
+ let sum = 0;
62
+ for (let i = 0; i < length; i++) sum += a[i] * b[i];
63
+ return sum;
64
+ }
65
+ /**
66
+ * Cosine similarity of two vectors, normalizing on the fly.
67
+ *
68
+ * Prefer {@link dot} when both inputs are already normalized.
69
+ */
70
+ function cosine(a, b) {
71
+ return dot(normalize(a), normalize(b));
72
+ }
73
+ /**
74
+ * Score every candidate against `query` (dot product) and return the top `k`
75
+ * by descending score, optionally filtering by a minimum score.
76
+ *
77
+ * Inputs are assumed normalized; this keeps the hot path branch-free.
78
+ */
79
+ function topK(query, candidates, k, minScore) {
80
+ const scored = [];
81
+ for (const candidate of candidates) {
82
+ const score = dot(query, candidate.vector);
83
+ if (minScore !== void 0 && score < minScore) continue;
84
+ scored.push({
85
+ item: candidate.item,
86
+ score
87
+ });
88
+ }
89
+ scored.sort((a, b) => b.score - a.score);
90
+ return scored.slice(0, k);
91
+ }
92
+
93
+ //#endregion
94
+ export { deserializeRecord as a, topK as i, dot as n, matchesFilter as o, normalize as r, serializeRecord as s, cosine as t };
95
+ //# sourceMappingURL=vector-B0panuy6.mjs.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"vector-B0panuy6.mjs","names":["scored: Scored<T>[]"],"sources":["../src/memory/serialize.ts","../src/memory/vector.ts"],"sourcesContent":["/**\n * (De)serialization helpers shared by stores and import/export.\n *\n * Embeddings are stored at runtime as {@link Float32Array} but serialized as\n * plain number arrays so records survive `JSON.stringify` and IndexedDB\n * structured clone round-trips.\n */\n\nimport type { MemoryRecord, SerializedRecord } from \"./types.js\";\n\n/** Convert a runtime record to its JSON-safe form. */\nexport function serializeRecord(record: MemoryRecord): SerializedRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Array.from(record.embedding) : undefined,\n metadata: record.metadata,\n createdAt: record.createdAt,\n };\n}\n\n/** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */\nexport function deserializeRecord(record: SerializedRecord): MemoryRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Float32Array.from(record.embedding) : undefined,\n metadata: record.metadata ?? {},\n createdAt: record.createdAt,\n };\n}\n\n/**\n * True when every key in `filter` is present on `metadata` with an equal\n * (strict ===) value. An empty/undefined filter matches everything.\n */\nexport function matchesFilter(\n metadata: Record<string, unknown>,\n filter?: Record<string, unknown>,\n): boolean {\n if (!filter) {\n return true;\n }\n for (const key of Object.keys(filter)) {\n if (metadata[key] !== filter[key]) {\n return false;\n }\n }\n return true;\n}\n","/**\n * Vector math for cosine similarity search.\n *\n * Vectors are stored L2-normalized so cosine similarity is a plain dot\n * product. Normalization happens once on insert via {@link normalize}.\n */\n\n/**\n * Return an L2-normalized copy of `vector`.\n *\n * A zero vector is returned unchanged (its norm is 0).\n */\nexport function normalize(vector: Float32Array): Float32Array {\n let sumSquares = 0;\n for (let i = 0; i < vector.length; i++) {\n sumSquares += vector[i] * vector[i];\n }\n const norm = Math.sqrt(sumSquares);\n if (norm === 0) {\n return vector;\n }\n const out = new Float32Array(vector.length);\n for (let i = 0; i < vector.length; i++) {\n out[i] = vector[i] / norm;\n }\n return out;\n}\n\n/**\n * Dot product of two equal-length vectors.\n *\n * For L2-normalized inputs this equals cosine similarity.\n */\nexport function dot(a: Float32Array, b: Float32Array): number {\n const length = Math.min(a.length, b.length);\n let sum = 0;\n for (let i = 0; i < length; i++) {\n sum += a[i] * b[i];\n }\n return sum;\n}\n\n/**\n * Cosine similarity of two vectors, normalizing on the fly.\n *\n * Prefer {@link dot} when both inputs are already normalized.\n */\nexport function cosine(a: Float32Array, b: Float32Array): number {\n return dot(normalize(a), normalize(b));\n}\n\n/** A scored item used by {@link topK}. */\nexport type Scored<T> = { item: T; score: number };\n\n/**\n * Score every candidate against `query` (dot product) and return the top `k`\n * by descending score, optionally filtering by a minimum score.\n *\n * Inputs are assumed normalized; this keeps the hot path branch-free.\n */\nexport function topK<T>(\n query: Float32Array,\n candidates: { item: T; vector: Float32Array }[],\n k: number,\n minScore?: number,\n): Scored<T>[] {\n const scored: Scored<T>[] = [];\n for (const candidate of candidates) {\n const score = dot(query, candidate.vector);\n if (minScore !== undefined && score < minScore) {\n continue;\n }\n scored.push({ item: candidate.item, score });\n }\n scored.sort((a, b) => b.score - a.score);\n return scored.slice(0, k);\n}\n"],"mappings":";;AAWA,SAAgB,gBAAgB,QAAwC;AACtE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,MAAM,KAAK,OAAO,UAAU,GAAG;EAC7D,UAAU,OAAO;EACjB,WAAW,OAAO;EACnB;;;AAIH,SAAgB,kBAAkB,QAAwC;AACxE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,aAAa,KAAK,OAAO,UAAU,GAAG;EACpE,UAAU,OAAO,YAAY,EAAE;EAC/B,WAAW,OAAO;EACnB;;;;;;AAOH,SAAgB,cACd,UACA,QACS;AACT,KAAI,CAAC,OACH,QAAO;AAET,MAAK,MAAM,OAAO,OAAO,KAAK,OAAO,CACnC,KAAI,SAAS,SAAS,OAAO,KAC3B,QAAO;AAGX,QAAO;;;;;;;;;;;;;;;;ACpCT,SAAgB,UAAU,QAAoC;CAC5D,IAAI,aAAa;AACjB,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,eAAc,OAAO,KAAK,OAAO;CAEnC,MAAM,OAAO,KAAK,KAAK,WAAW;AAClC,KAAI,SAAS,EACX,QAAO;CAET,MAAM,MAAM,IAAI,aAAa,OAAO,OAAO;AAC3C,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,KAAI,KAAK,OAAO,KAAK;AAEvB,QAAO;;;;;;;AAQT,SAAgB,IAAI,GAAiB,GAAyB;CAC5D,MAAM,SAAS,KAAK,IAAI,EAAE,QAAQ,EAAE,OAAO;CAC3C,IAAI,MAAM;AACV,MAAK,IAAI,IAAI,GAAG,IAAI,QAAQ,IAC1B,QAAO,EAAE,KAAK,EAAE;AAElB,QAAO;;;;;;;AAQT,SAAgB,OAAO,GAAiB,GAAyB;AAC/D,QAAO,IAAI,UAAU,EAAE,EAAE,UAAU,EAAE,CAAC;;;;;;;;AAYxC,SAAgB,KACd,OACA,YACA,GACA,UACa;CACb,MAAMA,SAAsB,EAAE;AAC9B,MAAK,MAAM,aAAa,YAAY;EAClC,MAAM,QAAQ,IAAI,OAAO,UAAU,OAAO;AAC1C,MAAI,aAAa,UAAa,QAAQ,SACpC;AAEF,SAAO,KAAK;GAAE,MAAM,UAAU;GAAM;GAAO,CAAC;;AAE9C,QAAO,MAAM,GAAG,MAAM,EAAE,QAAQ,EAAE,MAAM;AACxC,QAAO,OAAO,MAAM,GAAG,EAAE"}
@@ -0,0 +1,321 @@
1
+ # Gerbil — Project State & Architecture Decision
2
+
3
+ **As of 2026-06-14.** The single authoritative snapshot. When something here conflicts
4
+ with an older doc, this wins. Supersedes scattered findings in `docs/research/*` and
5
+ the paper's roadmap. **Newest material is §12** (native audio begins — Moonshine STT
6
+ + Kani-TTS-2 NanoCodec decoder; Gemma 4 E2B text decode; on-device memory/RAG; the
7
+ text+ViT autoresearch campaign; gerbil-site live on the native engine). §11 captured
8
+ EmbeddingGemma on-device, LFM2.5, the SPM tokenizer fix, MLX/DWQ loader, progress fix.
9
+
10
+ ---
11
+
12
+ ## 1. What Gerbil is
13
+
14
+ A local LLM inference library that runs models in the browser and Node on a
15
+ **single native WebGPU engine** behind one task API (no fallback lane — §2). The
16
+ headline achievements this cycle: the from-scratch native engine **works on mobile**
17
+ (iPad/iOS Safari 26.5+, WebKit) — previously it crashed — a desktop optimization
18
+ pass roughly doubled its throughput, and the engine went **multimodal natively**:
19
+ text embeddings ship (Qwen3-Embedding-0.6B) and the Qwen3.5 vision encoder is built
20
+ **bit-exact vs HuggingFace** (vision LM-integration is phase 2).
21
+
22
+ ## 2. The decided architecture — NATIVE-ONLY (owner decision, overrides the panel)
23
+
24
+ **Decision (owner, 2026-06-13):** Gerbil is a single native WebGPU engine. **No
25
+ tfjs fallback lane** — a permanent fallback "assumes defeat to begin with." The
26
+ panel proposed keeping tfjs as a breadth lane; the owner rejected that. Instead:
27
+
28
+ - **Launch set = text + vision + embeddings, ALL native.** One model per modality
29
+ by default, expandable to other families via the add-model-family process.
30
+ - **Audio (TTS/STT) is deferred, not delegated to tfjs** — it ships later as
31
+ small *native* models (candidates under eval: VibeVoice-1.5B, OmniVoice,
32
+ dots.tts-soar for TTS — IF they publish safetensors; Moonshine for STT). "What's
33
+ the big deal" — launching without audio is fine; a permanent second engine is not.
34
+ - **tfjs is at most temporary dev scaffolding, not a destination.** It may stay
35
+ briefly to keep desktop demos working during the native build, but it is being
36
+ removed, not kept as a lane. `chrome-backend.ts` is deleted outright.
37
+ - **Vision uses Qwen3.5's OWN built-in ViT** (we currently skip its 192MB tower) —
38
+ one multimodal model, not a separate vision model. Stop dropping it "like idiots."
39
+ - **A thin onnxruntime-web bridge is a break-glass option only** — used solely if
40
+ a needed model has no extractable weights AND no native alternative. Not a lane.
41
+
42
+ **Status update (2026-06-13):** both next modalities have landed natively.
43
+ **Embeddings ship** (Qwen3-Embedding-0.6B, validated). The **vision encoder is
44
+ DONE** — the Qwen3.5 ViT runs natively and is **bit-exact vs HF transformers 5.12**
45
+ (per-token cosine 1.000000). The earlier "parallel attention kernel first" gate was
46
+ based on a misread of dead code (§5): the attention kernel was already parallel, and
47
+ non-causal was a one-line `is_causal` flag, not a rewrite. The remaining vision work
48
+ is **LM-side integration** (M-RoPE, token splice, image preprocessing) — phase 2,
49
+ plumbing over a verified core. Audio follows once a native small model is validated.
50
+
51
+ ## 3. Capability matrix (what's true today)
52
+
53
+ Native-only. The "tfjs lane" is gone — tfjs is temporary dev scaffolding being
54
+ removed (§2), not a capability path.
55
+
56
+ | Modality | Native status |
57
+ |---|---|
58
+ | Text | ✅ ~51 tok/s mobile (sustained 200-tok), ~207 desktop, bit-correct |
59
+ | Text (alt family) | ✅ **LFM2.5-350M LANDED** (`Lfm2ForCausalLM`, hybrid conv/attn): ~600 tok/s desktop (2.8× Qwen), ~46 tok/s mobile, ~199MB q4, no new kernels |
60
+ | Embeddings | ✅ **DONE + ON iPad** — **EmbeddingGemma-300M** (bidirectional Gemma3 encoder, 173MB MLX-4bit) runs on iPad Safari; cos=1.00000 vs NumPy ref, Mars/bread margin >0.1, dim 768. First non-Qwen embedder. (Qwen3-Embedding-0.6B still works on desktop but OOMs iPad at 1.2GB BF16.) |
61
+ | Vision (image) | ✅ **END-TO-END DONE** — Qwen3.5 ViT, bit-exact vs HF (cosine 1.000000); `describeImage()` word-identical to HF; runs on iPad |
62
+ | Text model families (Llama/Mistral/Gemma) | 🟢 cheap — now usually **Tier-1, generator-only, NO new kernels** (kernel library saturated for standard transformers) |
63
+ | Text (Gemma 4 E2B) | ✅ **COHERENT on q4** (`gemma4.ts`): "capital of France"→"Paris", ~83 tok/s, all 35 layers cos≥0.998 vs MLX-LM. PLE, KV-share (20 layers), proportional/dual-theta RoPE, GeGLU, `Softcap`, double-wide MLP, V-norm, per-node `attn_scale`, head_dim-512 attention. **PLE CPU-streamed (0 MB GPU)** — not GPU-sharded |
64
+ | STT | 🟢 **Moonshine native LANDED** (`moonshine.ts`, `moonshine-executor.ts`, `moonshine-stt.ts`): raw-waveform Conv1d front-end (no FFT/log-mel), bit-exact `CrossAttention` kernel (max|err|<2e-4, cos≥0.9999), dual-graph encode-once/frozen-K-V/AR-decode, interleaved RoPE. Encoder cos≈0.990 vs HF; transcript substring-matches HF refs. **Whisper(ONNX) stays as multilingual / no-GPU fallback** (`src/core/stt.ts`) |
65
+ | TTS | 🟡 **Kani-TTS-2 partial** (`kani_tts.ts`): NanoCodec decoder (FSQ + causal HiFi-GAN) **LANDED + validated bit-exact** (`test-nanocodec-decode.mjs`, gate err<1e-3, measured ~4.2e-6). LFM2-350M codec-LM backbone scaffolded; **AR-loop glue remaining** (frame positions + learnable RoPE + 4-token-frame decode). License: kani-tts-2-en = LFM1.0/other; 450m variant = Apache |
66
+ | Memory / RAG | ✅ **SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`): vector store (in-memory/IndexedDB/file), token-budgeted `recall()`, chunking, redaction, native EmbeddingGemma adapter. 12/12 tests. No new kernels |
67
+ | No-WebGPU / old devices | not targeted — engine throws a clear error rather than degrading |
68
+
69
+ ## 4. Performance baselines (re-confirm before quoting)
70
+
71
+ | Platform | Config | Decode tok/s | Note |
72
+ |---|---|---|---|
73
+ | M4 Max, node-dawn | optimized | ~207 | re-confirm on a cooled run; numbers between commit 2f0cabc and the isMetalBackend fix are invalid |
74
+ | iPad (iOS 26.5) native | batch-all | ~41 (cooled, consistent) | NOT thermal — confirmed; Dawn-tuned autoresearch wins did NOT transfer to Metal (was ~51 pre-optimization) |
75
+ | iPad native, submit floor | group=1 awaited | 6–8 | proven-correct floor |
76
+ | iPad transformers.js (same model) | WebGPU | 7–12 | ~5× slower than native |
77
+
78
+ ## 5. Vision — DONE at the encoder level (bit-exact vs HF)
79
+
80
+ The native Qwen3.5 vision encoder is **built and validated bit-exact** against HF
81
+ transformers 5.12: **per-token cosine = 1.000000, max abs err ~5e-6**. Exposed as
82
+ `engine.encodeImage(patches, gridTHW)` → merged image tokens `[rows, 1024]`. See
83
+ paper §22 and `src/gpu/architectures/qwen3_5_vision.ts`, `src/gpu/vision-executor.ts`,
84
+ `src/gpu/vision-preprocess.ts`.
85
+
86
+ The earlier strategy panel claimed native vision was blocked on a single-threaded
87
+ attention kernel. **That was wrong** — it read `src/gpu/kernels/wgsl/attention.wgsl`,
88
+ a STALE reference file **not imported anywhere** (kernels are embedded strings in
89
+ `registry.ts`; the dead `.wgsl` files have been deleted). What actually shipped:
90
+
91
+ 1. **The live attention kernel was already parallel** — `WGSL_ATTENTION` is a
92
+ tiled, online-softmax (flash-attention-style) kernel. No thread-0 serialization,
93
+ no rewrite needed.
94
+ 2. **Non-causal was a one-line flag.** An `is_causal` uniform
95
+ (`S_eff = is_causal ? min(S, causal_limit) : S`) makes it bidirectional for the
96
+ ViT; text stays causal by default. Done.
97
+ 3. **Patch-embed was NOT a new Conv3d kernel.** Patches arrive pre-flattened to
98
+ `[N, 1536]` from the host image processor, so the 5-D unfold/Conv3d collapses to
99
+ a plain `MatMul` + `AddBias`. New ops added were small: `AddBias`, `GeluErf`
100
+ (exact-erf merger GELU), `ApplyRotaryEmb` (2D rotary), `SliceCols` (fused-QKV
101
+ split) — plus host-side pos-embed/rotary precompute (grid-only, bit-exact).
102
+ 4. **A real bug surfaced during validation:** `WGSL_GELU` returned NaN for large
103
+ args on Metal/Dawn (`x³` overflow into fast-math `tanh`); fixed by clamping the
104
+ inner arg to ±15.
105
+
106
+ **Net: vision is done at the encoder level.** Remaining work is **LM-side
107
+ integration** (M-RoPE, image-token splice into the text stream, pixel→patch
108
+ preprocessing) — phase 2, bounded plumbing over a verified numerical core. The
109
+ ViT prefill runs through the same parallel attention as text, so there is no
110
+ separate mobile attention risk; on-device ViT speed is a measure-when-integrated
111
+ item, not a feasibility gate.
112
+
113
+ ## 6. Build sequence (ordered)
114
+
115
+ 1. ~~**Native embeddings**~~ — **DONE.** Qwen3-Embedding-0.6B (`Qwen3ForCausalLM`): last-token EOS pooling + `L2Norm` tail. Validated: dim 1024, unit norm, cos(similar)=0.81 > cos(unrelated)=0.56. Confirmed causal-LM-pooling. `engine.embed()`.
116
+ 2. ~~**OPFS model cache**~~ — **resolved differently: OPFS removed.** Main-thread OPFS `createWritable` is broken on iOS (leaves unclearable junk that fills the quota). The loader is now **Cache-API-only**. Durable iOS caching needs a PWA (§ paper 24, `docs/research/ios-safari-model-caching.md`), not OPFS.
117
+ 3. ~~**Native vision encoder**~~ — **DONE, bit-exact vs HF** (§5). Remaining: LM-side integration (M-RoPE, token splice, image preprocessing) — phase 2.
118
+ 4. **Vision LM integration (phase 2)** — splice `encodeImage()` tokens into the text stream: M-RoPE position assignment, placeholder-token splice, host pixel→patch preprocessing. Plumbing over a verified encoder.
119
+ 5. **Close the mobile WebKit correctness/perf sweep** — `?group=N` + Test-R bisect on real iOS 26.5 for a stable fast-and-correct submit config above the group=1 floor; re-measure desktop.
120
+ 6. **Llama/Mistral/Gemma graph generator** — ~90% is the existing qwen2 generator; unlocks most of the HF text zoo. Low risk, high ROI.
121
+ 7. **Remove tfjs scaffolding** — delete `chrome-backend.ts`. The engine is native-only; tfjs was temporary dev scaffolding, not a kept lane.
122
+ 8. **Audio, native (deferred, not delegated)** — TTS: **OmniVoice** (Qwen3 backbone + codec decoder — mostly the existing text path + a decoder). STT: **Moonshine** (lean encoder-decoder; raw-waveform Conv1d frontend, no log-mel Conv2d; needs a parallel CrossAttention kernel). Ship once a native small model clears the mobile bar.
123
+
124
+ ## 7. What NOT to build
125
+
126
+ - **A second permanent engine / kept tfjs lane** — the architecture is native-only across text + vision + embeddings. tfjs is temporary scaffolding being removed, not a breadth lane.
127
+ - **A Conv3d/unfold patch-embed kernel for the ViT** — not needed; patches arrive pre-flattened, so patch-embed is a plain MatMul + AddBias (§5).
128
+ - **A heavyweight native TTS pipeline with custom vocoder kernels now** — audio is deferred; when it lands, OmniVoice's Qwen3 backbone reuses the text path. A thin onnxruntime-web bridge stays break-glass-only.
129
+ - **A general Conv2d kernel / Whisper-on-native** — Moonshine's raw-waveform Conv1d avoids the log-mel Conv2d.
130
+ - **An OPFS write path on iOS** — main-thread `createWritable` is broken; durable caching is a PWA concern, not an OPFS one.
131
+ - **Trusting stale tok/s numbers** between 2f0cabc and the isMetalBackend fix.
132
+
133
+ ## 8. Open spikes (measure before committing)
134
+
135
+ - ~~**Embedder pooling direction**~~ — RESOLVED: causal-LM last-token (EOS) pooling, validated.
136
+ - ~~**Parallel non-causal attention feasibility for the ViT**~~ — RESOLVED: the live attention kernel was already parallel; non-causal is a one-line `is_causal` flag; the encoder is bit-exact. Remaining ViT spike: **on-device ViT prefill speed on a real iPad** (measure when LM-integration lands; not a feasibility gate).
137
+ - **Mobile WebKit submit config** — does `?group=N`/Test-R land a stable fast config above the group=1 floor on iOS 26.5? The critical path.
138
+ - **Vision LM-integration correctness** — M-RoPE + token splice must stay bit-exact end-to-end (image+text), not just at the encoder boundary.
139
+ - **Native audio on mobile** — OmniVoice (codec-decoder) and Moonshine (CrossAttention kernel) latency/correctness once a candidate is validated.
140
+ - **Autoresearch on-device tax** — desktop kernel wins did NOT transfer to mobile (41 vs 51); the loop needs a mobile-validation leg, possibly per-backend tunings.
141
+
142
+ ## 9. Decision log (this cycle)
143
+
144
+ - Mobile native inference fixed (four-bug diagnosis: jetsam memory, WebKit visibility, attention race, detection predicate). See `docs/mobile-failure-diagnosis.md`, paper §17–19.
145
+ - Desktop 145→~207 via autoresearch (paper §20).
146
+ - **Architecture decision: NATIVE-ONLY** across text + vision + embeddings; no tfjs fallback lane; `chrome-backend.ts` to be deleted (paper §23).
147
+ - **Native embeddings shipped** — Qwen3-Embedding-0.6B, last-token EOS pool + L2Norm, validated (paper §21).
148
+ - **Native vision encoder shipped** — Qwen3.5 ViT, bit-exact vs HF transformers 5.12 (cosine 1.000000); LM-integration is phase 2 (paper §22). Supersedes the earlier "skip the ViT" stance (`docs/research/qwen35-multimodal.md`).
149
+ - **Vision feasibility corrected** — attention was already parallel (panel read a dead `.wgsl` file); non-causal is a one-line flag (paper §22–23, §5 above).
150
+ - **OPFS removed** — main-thread `createWritable` broken on iOS; Cache-API-only loader; durable caching needs a PWA (`docs/research/ios-safari-model-caching.md`, paper §24).
151
+ - Modality model picks (`docs/research/sota-modality-models.md`); chromium decision (`docs/research/native-vs-chromium-decision.md`); site update plan (`docs/site-update-plan.md`).
152
+
153
+ ---
154
+
155
+ ## 10. 2026-06-13 evening — multimodal milestone + in-flight work (CAPTURE)
156
+
157
+ **Major wins this session (all committed to `feat/webgpu-engine-mobile`):**
158
+
159
+ - **Native VISION end-to-end DONE & validated bit-exact** (commit `5fed12b`). `engine.describeImage({pixels,width,height})` → coherent description. Validated vs HF transformers 5.12 on `examples/skelly.png`: 7/7 checks — encoder cosine 1.000000, M-RoPE 3D position ids EXACT, spliced embeds cosine 1.0, first-token exact, full greedy description WORD-IDENTICAL to HF for 201 chars ("This is a detailed black-and-white line drawing of the skeleton of a fox…"). New ops: `MRoPE`, `EmbedSplice`, host image preprocessing (smart-resize/normalize/patchify in `vision-preprocess.ts`). Text path unchanged (M-RoPE on linear positions reduces to 1D RoPE).
160
+ - **Native EMBEDDINGS** (earlier commit): Qwen3-Embedding-0.6B works on DESKTOP (cosine 0.81>0.56). BUT **not iPad-viable** — only BF16 (~1.2GB, OOMs on-device) or a broken MLX-DWQ. → Pivot below.
161
+ - **`is_causal` flag** on f32 attention (from vision) now unlocks **bidirectional encoders**.
162
+
163
+ **The 51-vs-41 mobile throughput question — RESOLVED: NOT a regression.** t15 hit 51.7 on a 200-token sustained run; the 38-41 readings were 60-token runs (more warmup overhead/token). A fresh 200-token sustained run on the CURRENT (post-autoresearch-optimization) engine **hit ~51 again**. So the Dawn-tuned wins did NOT regress mobile; sustained mobile decode is ~51 tok/s. Lesson: benchmark with enough tokens (≥200) to be representative.
164
+
165
+ **IN-FLIGHT / NOT YET MERGED (future-me: merge these):**
166
+ - **LFM2.5-350M** — ✅ **LANDED** on `feat/webgpu-engine-mobile` (commit `3c4bac8`). Tier-1, no new kernels. **~600 tok/s desktop (2.8× Qwen3.5's ~213), ~199MB q4 (half Qwen), coherent, bit-exact vs NumPy ref.** Files: new `src/gpu/architectures/lfm2.ts` (824 lines), registered `Lfm2ForCausalLM`, LFM2 CANONICAL_KEYS, loader key-mapper. TWO GENERAL FIXES worth keeping: (1) effective FF dim is 4608 not config's 6656 (`block_auto_adjust_ff_dim` rounding — `multiple_of(⌊2/3·6656⌋)`); (2) **the "garbage output" was the CHAT TEMPLATE, not the graph** — LFM2.5 ships its template as a `chat_template.jinja` sidecar (absent from tokenizer_config.json), so the engine fell back to Qwen ChatML which auto-injects an empty `<think>` → newline loop. Fix: fetch the `.jinja` sidecar + gate think-injection on the template actually emitting `<think>`. This fix helps ANY model with a jinja sidecar. Verdict: LFM2.5 is a viable faster/smaller text-default alternative to Qwen — user picks.
167
+ - **EmbeddingGemma-300M** — ✅ **LANDED + CONFIRMED ON iPad** (commit `4874d01`, see §11). The iPad-ready embedding model (~173MB q4, MTEB 68.36, standard MLX-4bit). Bidirectional Gemma3 encoder — needs the new generator + a MeanPool kernel + dual-theta RoPE + 2 Dense head layers + the MLX-detection loader fix (research found: detector requires `mode:"affine"` but standard MLX converts omit it → silently fall to F32; and DWQ vs standard MLX are config-indistinguishable → the DWQ-garbage trap). See `docs/research/sota-embedding-models.md`.
168
+
169
+ **iPad harness:** dashboard at `https://<lan-ip>:8766/` with Text/Vision/transformers.js/Storage-probe/Clear-cache/Stop-queue buttons, live results table, `/results` + `/enqueue` + `/clear-queue` endpoints. Embeddings button removed until EmbeddingGemma lands. TODO: show generated text in the results + default Text benchmark to 200 tokens (representative); wire the Vision button to `describeImage` (needs browser ImageBitmap/Canvas→pixels decode).
170
+
171
+ **iOS caching:** OPFS removed (main-thread createWritable broken on iOS, left unclearable junk). Durable cache needs a PWA (persist() only granted when installed). See `docs/research/ios-safari-model-caching.md`. Deferred.
172
+
173
+ **Native modality scorecard (as of §10):** text ✅, vision ✅ (end-to-end bit-exact), embeddings ✅ desktop / iPad-pending (EmbeddingGemma building), LFM2.5 text-alt ✅ (pending merge). Audio (TTS/STT) deferred to native OmniVoice/Moonshine. **The native-only multimodal engine is real.** **→ All "pending/building" items above have since landed; see §11.**
174
+
175
+ ---
176
+
177
+ ## 11. 2026-06-14 — EmbeddingGemma on-device, LFM2.5 landed, loader hardening (CAPTURE)
178
+
179
+ The §10 in-flight items all landed on `feat/webgpu-engine-mobile`. New wins this session
180
+ (paper §25–§30 document each in depth):
181
+
182
+ - **EmbeddingGemma-300M LANDED and CONFIRMED RUNNING ON iPad Safari** (commit `4874d01`).
183
+ First **non-Qwen** embedding family — a real **bidirectional Gemma3 encoder**
184
+ (`src/gpu/architectures/gemma3_encoder.ts`, `generateGemma3EncoderGraph`): 24 pre-norm
185
+ blocks, GQA 3q/1kv head_dim 256, per-head q/k-norm, **dual-theta RoPE** (sliding θ=10000
186
+ / full θ=1e6 selected per layer from `layer_types`), GeGLU MLP, Gemma's **four-norm
187
+ sandwich**, embed ×√768, tail **MeanPool → Dense0(768→3072) → Dense1(3072→768) → L2Norm**.
188
+ **173 MB at MLX-4bit** (vs the abandoned 1.2 GB Qwen3-Embedding that OOM'd iPad).
189
+ Two new kernels only: **MeanPool**, **Scale** (`kernels/registry.ts`). Validated
190
+ **cos=1.00000 vs an independent NumPy reference** (`scripts/engine/test-embedding-gemma-reference.py`),
191
+ reference gate `cos>0.95`; semantic test asserts a Red-Planet query is closer to two
192
+ Mars docs than to a bread doc by **>0.1 cosine margin**, unit-norm dim-768, no NaN. The
193
+ engine **generalizes across embedding families**, not just Qwen.
194
+
195
+ - **SPM tokenizer fix (load-bearing, cross-family)** (`src/gpu/tokenizer.ts`). Gemma's
196
+ `tokenizer.json` is `type:"BPE"` but **SentencePiece-flavored** (▁/U+2581 spaces, raw
197
+ UTF-8 tokens, array-form merges, `<0xHH>` byte-fallback). The byte-level (`Ġ`) BPE path
198
+ was char-splitting every word → semantically dead embeddings that still passed norm/NaN
199
+ checks. Auto-detected `spmMode` (structural: `" "→"▁"` Replace normalizer **or**
200
+ `byte_fallback && "▁the" in vocab`) now drives encode/decode/merges; **Qwen/LFM2 stay
201
+ byte-level** (no model-name list — they just don't match). Lesson: `type:"BPE"` ≠
202
+ byte-level BPE; any SentencePiece-lineage family (Gemma/Llama/Mistral) needs this path.
203
+
204
+ - **MLX-4bit loader hardened** (`src/gpu/model-loader.ts`). (1) Detection broadened to
205
+ accept mode-less `{bits:4, group_size}` configs (standard mlx-lm omits `mode`). (2) A
206
+ `VERIFIED_MLX_REPOS` allowlist (currently `mlx-community/embeddinggemma-300m-4bit`) gates
207
+ mode-less configs. (3) **Explicit DWQ reject** — DWQ repos carry an identical
208
+ `{bits:4,group_size}` config but pack weights that dequant to garbage, so they're rejected
209
+ by repo-name substring (`includes("dwq")`). Codifies the MLX-DWQ-garbage trap. (4) Gemma's
210
+ `(1+weight)` RMSNorm absorption is **baked by the loader even for MLX** (mlx-lm pre-absorbs
211
+ +1 for Qwen3.5 but NOT for Gemma — the Gemma branch deliberately omits `&& !isMLX`).
212
+
213
+ - **Progress-reporting fix** (commit `682a09b`, `model-loader.ts`). The bar froze at
214
+ "10% discovering weight files" because the gap between that emit and the first download
215
+ chunk (index probe + 2 header range-requests + first-byte latency) emitted nothing. Now
216
+ emits "Reading {file} header" + "Downloading {file} (0/{N} MB)" up front. **Affected every
217
+ model**; fixed universally.
218
+
219
+ - **LFM2.5-350M landed** (commit `3c4bac8`) — Tier-1, no new kernels, ~600 tok/s desktop /
220
+ ~46 tok/s mobile, ~199 MB q4. The general jinja-sidecar chat-template fix from it helps any
221
+ model shipping `chat_template.jinja`.
222
+
223
+ - **Cross-device multi-modal parity.** Text (Qwen3.5 ~51 tok/s, LFM2.5 ~46 tok/s), vision
224
+ (Qwen3.5 ViT `describeImage`), and embeddings (EmbeddingGemma) **all now run natively on
225
+ iPad Safari** — the native engine reaches transformers.js-path modality coverage on the
226
+ modalities that matter, without the mobile crashes and ~5× faster. **Remaining native gap:
227
+ audio** — TTS via OmniVoice (in progress), STT via Moonshine (not started).
228
+
229
+ - **Effort-tier shift.** Adding a new TEXT family is now usually **Tier-1: generator only, no
230
+ new kernels** — the kernel library has saturated for standard transformers (Llama/Mistral/
231
+ Gemma-text reuse existing ops). New kernels are needed only for genuinely novel ops (SSM,
232
+ PLE, new norms, cross-attention). See `docs/adding-a-model-family.md`.
233
+
234
+ - **Site migration assessment** written: `docs/gerbil-site-native-migration.md` (how the
235
+ marketing/docs site can move its in-browser inference from the transformers.js/ONNX worker
236
+ to the native engine, modality-by-modality, with the no-fallback device-coverage tradeoff
237
+ stated plainly). Assessment only — the site repo was not modified.
238
+
239
+ ---
240
+
241
+ ## 12. 2026-06-14 — native audio begins, Gemma 4, memory/RAG, autoresearch campaign (CAPTURE)
242
+
243
+ This session pushed past the §11 "multimodal parity minus audio" milestone. Paper
244
+ **§31–§35** document each in depth. The decided order from §6 (audio last, native not
245
+ delegated) held — and audio is now *under way natively*, not deferred to tfjs.
246
+
247
+ - **Native STT — Moonshine LANDED** (`src/gpu/architectures/moonshine.ts`,
248
+ `moonshine-executor.ts`, `moonshine-stt.ts`; paper §31). Chosen over Whisper to avoid
249
+ a log-mel/Conv2d front-end: it consumes **16 kHz PCM directly** through three strided
250
+ `Conv1d`s (downsample 384×, ~41.6 frames/s) — new kernels `Conv1dFull`, `GroupNorm`,
251
+ `Tanh`, `Transpose`. The real new attention primitive is **`CrossAttention`**
252
+ (`WGSL_CROSS_ATTENTION`, tiled online-softmax), validated **bit-exact vs NumPy**
253
+ (`test-crossattention.mjs`: max|err|<2e-4, cos≥0.9999). Runtime is a **dual graph**:
254
+ `MoonshineEncoderExecutor.encode()` runs the front-end + bidirectional encoder once and
255
+ freezes per-decoder-layer K/V; `MoonshineSTT.transcribe()` runs greedy AR decode with
256
+ self- + cross-attention into that frozen K/V. **Interleaved RoPE** (`ROPE_INTERLEAVED_SPEC`,
257
+ adjacent-dim pairing) vs the default split-half. Validation: encoder cos≈0.990 vs HF
258
+ (source comment); `test-moonshine-transcribe.mjs` asserts transcript contains HF-ref
259
+ substrings; RTF/4-bit size are computed-and-reported, not hardcoded-asserted.
260
+ **Whisper(ONNX) stays the multilingual / no-WebGPU fallback** (`src/core/stt.ts`,
261
+ `WhisperSTT`) — separate, untouched.
262
+
263
+ - **Native TTS — Kani-TTS-2 (partial)** (`src/gpu/architectures/kani_tts.ts`; paper §32).
264
+ The hard novel piece — the **NanoCodec decoder** (FSQ 4×4 levels `[9,8,8,7]`, mixed-radix
265
+ base `[1,9,72,576]` + causal HiFi-GAN, rates `[7,7,6,3,2]`, hop 1764 @ 22050 Hz) — is
266
+ **implemented and validated bit-exact** (`test-nanocodec-decode.mjs`, gate `err<1e-3`,
267
+ measured ~4.2e-6 vs MLX). New kernels: `FSQDequant`, `HalfSnake1d`,
268
+ `ConvTranspose1dDepthwise`. The backbone is **LFM2-350M** (`KaniTTS2ForCausalLM`, audio
269
+ tokens above the text vocab, 4/frame); `generateKaniTtsGraph` **deliberately throws** —
270
+ remaining is the frame-position + learnable-RoPE + 4-token-frame AR-decode glue (most
271
+ block math reused from `lfm2.ts`). License: **kani-tts-2-en = LFM1.0 (other)**, NanoCodec
272
+ = NVIDIA OML; the **450m variant is Apache** (same arch).
273
+
274
+ - **Gemma 4 E2B — text decode COHERENT on real q4 weights** (`src/gpu/architectures/gemma4.ts`;
275
+ paper §33). **Tier-2**, no MatFormer/AltUp/LAuReL. **PLE** (2nd embedding gathered per
276
+ token, per-layer gate+GELU+multiply+project+norm+residual), **KV-cache sharing** (E2B: 35
277
+ layers, last **20** shared via graph-rewire, no kernel change), **proportional RoPE**
278
+ (rotate 0.25·head_dim=64 dims but inv_freq over full head_dim denom; dual-theta 1e6/1e4),
279
+ **GeGLU**, and a new **`Softcap`** kernel (`cap·tanh(x/cap)`, cap=30) on final logits.
280
+ Generates coherently ("The capital of France is" → "Paris") at ~83 tok/s; all 35 layers
281
+ match an MLX-LM reference cos≥0.998 with identical argmax (`test-gemma4-perlayer.mjs`).
282
+ Structural validation **67/67** (`test-gemma4-graph.mjs`; 942 nodes); softcap kernel
283
+ separately validated (`test-gemma4-softcap.mjs`, max err<1e-4). **PLE is CPU-streamed, not
284
+ GPU-sharded**: the ~1.17 GB q4 PLE table stays CPU-resident (**0 MB GPU**) and the executor
285
+ streams per-token rows each step — Gemma 4's intended flash design, mobile-viable, and it
286
+ sidesteps the per-binding cap. Coherence required four fixes: per-node `attn_scale`=1.0
287
+ (HF scaling=1.0; default 1/√head_dim keeps other models byte-identical), parameter-free
288
+ V-norm, double-wide MLP on the KV-shared layers, and head_dim-512 support in the flash
289
+ attention kernel (second per-thread accumulator + smem-capped tiling, 16 KB invariant kept).
290
+
291
+ - **On-device Memory / RAG SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`; paper §34).
292
+ Pluggable vector store (`InMemoryStore` / `IndexedDBStore` / `FileStore`), token-budgeted
293
+ `recall()` (default 1024-token greedy pack, ~4-chars/token, returns `{context, records,
294
+ tokensUsed}`), overlapping-window chunking (1000/200), write-time redaction (regex →
295
+ `[REDACTED]` or fn), and a `createGerbilEmbedder()` adapter over **native EmbeddingGemma**.
296
+ **12/12 tests** (`src/memory/memory.test.ts`). No new kernels — a clean consumer of the
297
+ embedding modality.
298
+
299
+ - **Autoresearch TPS campaign — 3 more batches** (`scripts/engine/results.jsonl`,
300
+ `scripts/engine/chart.html`; paper §35), M4 Max / node-dawn. Verified peaks:
301
+ **Qwen3.5-0.8B 219→~234 tok/s**, **LFM2.5-350M 624→~672 tok/s**, **ViT encode
302
+ 581.8→~502 ms (−~14%)**, **`describeImage` 37.0→42.0 tok/s (+13.5%)** — all kept changes
303
+ bit-exact (merged cos 1.0, e2e 7/7). **The winning lesson (sharpened):** desktop wins come
304
+ only from eliminating **large wide reads on poorly-occupied kernels** (fused conv+activation,
305
+ vec4 + register-blocked + f16-mixed ViT matmul, `MatMul+AddBias→MatMulBias`); **pure
306
+ dispatch-count cuts on already-tuned kernels are noise** (butterfly reduce, subgroup
307
+ shuffle, bigger N-tiles all reverted). The INT4 matmuls and Mamba SSM sit at the
308
+ **bandwidth floor** — the remaining headroom is **mobile**, not desktop (several
309
+ reverted-on-desktop fusions are *predicted mobile wins*; the loop's next leg is a
310
+ mobile-validation pass).
311
+
312
+ - **gerbil-site is LIVE on the native engine.** The marketing/docs site's in-browser
313
+ inference now runs on the native WGSL engine (no longer the transformers.js/ONNX worker)
314
+ for the migrated modalities — see `docs/gerbil-site-native-migration.md`. (The §11 entry
315
+ was an assessment-only; this cycle it went live.)
316
+
317
+ **Native modality scorecard (as of §12):** text ✅ (Qwen3.5, LFM2.5), text-alt families 🟢
318
+ Tier-1, Gemma 4 E2B 🟡 (decode validated, sharding gates real weights), vision ✅ (Qwen3.5
319
+ ViT, on iPad), embeddings ✅ (EmbeddingGemma, on iPad), **STT ✅ native (Moonshine) + Whisper
320
+ fallback**, **TTS 🟡 (Kani NanoCodec decoder validated, backbone AR loop pending)**, memory/RAG
321
+ ✅ shipped. **Audio is no longer the deferred gap — STT is native; TTS is one AR-loop away.**