@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +318 -104
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +276 -590
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +592 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BetB5xb0.d.mts +488 -0
- package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
- package/dist/gerbil-CTZUa8EZ.mjs +4 -0
- package/dist/gerbil-DNniplr4.mjs +1656 -0
- package/dist/gerbil-DNniplr4.mjs.map +1 -0
- package/dist/gpu/hooks.d.mts +640 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1369 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-DFuglcEx.mjs +3790 -0
- package/dist/gpu-DFuglcEx.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-DukkJRMj.d.mts +2114 -0
- package/dist/index-DukkJRMj.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
- package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
- package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
- package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
- package/dist/repl-BDRkwPGX.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
- package/dist/skills-CU694Dc8.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
- package/dist/types-LlyYILII.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-CMOGDSgT.js +0 -20212
- package/dist/kokoro-CMOGDSgT.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-Bu-E23Sc.js +0 -433
- package/dist/stt-Bu-E23Sc.js.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DiD1gTwk.js +0 -44695
- package/dist/transformers.web-DiD1gTwk.js.map +0 -1
- package/dist/transformers.web-u34VxRFM.js +0 -3
- package/dist/tts-CqroPaSK.js +0 -724
- package/dist/tts-CqroPaSK.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -41,8 +41,12 @@ type GenerateOptions = {
|
|
|
41
41
|
system?: string;
|
|
42
42
|
/** Enable thinking/reasoning mode (Qwen3) */
|
|
43
43
|
thinking?: boolean;
|
|
44
|
-
/** Callback for each token (streaming) */
|
|
45
|
-
onToken?: (token: string
|
|
44
|
+
/** Callback for each token (streaming); `meta` carries live decode-only tok/s */
|
|
45
|
+
onToken?: (token: string, meta?: {
|
|
46
|
+
tokenIndex: number;
|
|
47
|
+
tps: number;
|
|
48
|
+
elapsedMs: number;
|
|
49
|
+
}) => void;
|
|
46
50
|
/** Images to include (only used if model supports vision) */
|
|
47
51
|
images?: ImageInput[];
|
|
48
52
|
/** Enable response caching (default: false) */
|
|
@@ -92,16 +96,47 @@ type EmbedResult = {
|
|
|
92
96
|
/** Time in ms */
|
|
93
97
|
totalTime: number;
|
|
94
98
|
};
|
|
99
|
+
type SearchResult = {
|
|
100
|
+
/** The matched text */
|
|
101
|
+
text: string;
|
|
102
|
+
/** Similarity score (0-1, higher is more similar) */
|
|
103
|
+
score: number;
|
|
104
|
+
/** Index in the original corpus */
|
|
105
|
+
index: number;
|
|
106
|
+
};
|
|
107
|
+
type SimilarityResult = {
|
|
108
|
+
/** Similarity score (0-1, higher is more similar) */
|
|
109
|
+
score: number;
|
|
110
|
+
/** First text */
|
|
111
|
+
textA: string;
|
|
112
|
+
/** Second text */
|
|
113
|
+
textB: string;
|
|
114
|
+
/** Time in ms */
|
|
115
|
+
totalTime: number;
|
|
116
|
+
};
|
|
95
117
|
type LoadOptions = {
|
|
96
118
|
/** Progress callback */
|
|
97
119
|
onProgress?: (info: ProgressInfo) => void;
|
|
98
|
-
/**
|
|
99
|
-
|
|
100
|
-
|
|
120
|
+
/**
|
|
121
|
+
* Compute device. The only inference backend is the native WebGPU engine
|
|
122
|
+
* (Dawn in Node, WebGPU in the browser); "auto" resolves to "webgpu". There
|
|
123
|
+
* is no CPU/WASM or ONNX path.
|
|
124
|
+
*/
|
|
125
|
+
device?: "auto" | "webgpu";
|
|
126
|
+
/**
|
|
127
|
+
* Weight quantization. The engine quantizes to INT4 ("q4") on load; the other
|
|
128
|
+
* values are accepted for forward-compat but currently map to q4.
|
|
129
|
+
*/
|
|
101
130
|
dtype?: "q4" | "q8" | "fp16" | "fp32";
|
|
102
131
|
/** Override context length */
|
|
103
132
|
contextLength?: number;
|
|
104
133
|
};
|
|
134
|
+
type PreloadOptions = {
|
|
135
|
+
/** Progress callback for download status */
|
|
136
|
+
onProgress?: (info: ProgressInfo) => void;
|
|
137
|
+
/** Keep model loaded in memory after preload (default: false - disposes to free memory) */
|
|
138
|
+
keepLoaded?: boolean;
|
|
139
|
+
};
|
|
105
140
|
type ProgressInfo = {
|
|
106
141
|
status: string;
|
|
107
142
|
progress?: number;
|
|
@@ -112,14 +147,18 @@ type ProgressInfo = {
|
|
|
112
147
|
type GerbilConfig = {
|
|
113
148
|
/** Default model */
|
|
114
149
|
model?: string;
|
|
115
|
-
/** Default device */
|
|
116
|
-
device?: "auto" | "
|
|
117
|
-
/** Default quantization */
|
|
150
|
+
/** Default device (native WebGPU only; "auto" resolves to "webgpu") */
|
|
151
|
+
device?: "auto" | "webgpu";
|
|
152
|
+
/** Default quantization (engine uses INT4 "q4") */
|
|
118
153
|
dtype?: "q4" | "q8" | "fp16" | "fp32";
|
|
119
154
|
/** Cache configuration */
|
|
120
155
|
cache?: CacheConfig;
|
|
121
156
|
/** Fallback configuration */
|
|
122
157
|
fallback?: FallbackConfig;
|
|
158
|
+
/** Telemetry hooks for observability (Sentry, logging, etc.) */
|
|
159
|
+
telemetry?: TelemetryConfig;
|
|
160
|
+
/** Concurrency control for request queuing */
|
|
161
|
+
concurrency?: ConcurrencyConfig;
|
|
123
162
|
};
|
|
124
163
|
type CacheConfig = {
|
|
125
164
|
/** Enable caching (default: true) */
|
|
@@ -183,14 +222,14 @@ type SystemInfo = {
|
|
|
183
222
|
type GerbilModelSettings = {
|
|
184
223
|
/** Enable thinking mode */
|
|
185
224
|
thinking?: boolean;
|
|
186
|
-
/** Device to use */
|
|
187
|
-
device?: "auto" | "
|
|
225
|
+
/** Device to use (native WebGPU only) */
|
|
226
|
+
device?: "auto" | "webgpu";
|
|
188
227
|
/** Quantization level */
|
|
189
228
|
dtype?: "q4" | "q8" | "fp16" | "fp32";
|
|
190
229
|
};
|
|
191
230
|
type GerbilProviderSettings = {
|
|
192
|
-
/** Default device */
|
|
193
|
-
device?: "auto" | "
|
|
231
|
+
/** Default device (native WebGPU only) */
|
|
232
|
+
device?: "auto" | "webgpu";
|
|
194
233
|
/** Default quantization */
|
|
195
234
|
dtype?: "q4" | "q8" | "fp16" | "fp32";
|
|
196
235
|
};
|
|
@@ -348,6 +387,65 @@ type StreamingTranscriptionSession = {
|
|
|
348
387
|
/** Reset session (clear buffer and transcript) */
|
|
349
388
|
reset: () => void;
|
|
350
389
|
};
|
|
390
|
+
/**
|
|
391
|
+
* Telemetry hooks for production observability.
|
|
392
|
+
* Pass your own Sentry instance or custom logging functions.
|
|
393
|
+
*/
|
|
394
|
+
type TelemetryConfig = {
|
|
395
|
+
/**
|
|
396
|
+
* Called after successful generation with full result and timing.
|
|
397
|
+
* Use for logging, metrics, or analytics.
|
|
398
|
+
*/
|
|
399
|
+
onGenerate?: (event: GenerateEvent) => void;
|
|
400
|
+
/**
|
|
401
|
+
* Called when any error occurs during Gerbil operations.
|
|
402
|
+
* Perfect for Sentry.captureException() or similar.
|
|
403
|
+
*/
|
|
404
|
+
onError?: (error: Error, context: ErrorContext) => void;
|
|
405
|
+
/**
|
|
406
|
+
* Called after model loading completes (success or failure).
|
|
407
|
+
*/
|
|
408
|
+
onModelLoad?: (event: ModelLoadEvent) => void;
|
|
409
|
+
/**
|
|
410
|
+
* Called when a request is queued (if concurrency limit reached).
|
|
411
|
+
*/
|
|
412
|
+
onQueueWait?: (waitTimeMs: number) => void;
|
|
413
|
+
};
|
|
414
|
+
type GenerateEvent = {
|
|
415
|
+
/** Model used for generation */
|
|
416
|
+
modelId: string;
|
|
417
|
+
/** Generation result */
|
|
418
|
+
result: GenerateResult;
|
|
419
|
+
/** Whether response came from cache */
|
|
420
|
+
cached: boolean;
|
|
421
|
+
/** Time spent waiting in queue (if any) */
|
|
422
|
+
queueTimeMs?: number;
|
|
423
|
+
};
|
|
424
|
+
/**
|
|
425
|
+
* Context passed to telemetry onError callback.
|
|
426
|
+
* Flexible record to allow any relevant context data.
|
|
427
|
+
*/
|
|
428
|
+
type ErrorContext = Record<string, unknown>;
|
|
429
|
+
type ModelLoadEvent = {
|
|
430
|
+
/** Model that was loaded */
|
|
431
|
+
modelId: string;
|
|
432
|
+
/** Time to load in ms */
|
|
433
|
+
loadTimeMs: number;
|
|
434
|
+
/** Whether loaded from cache */
|
|
435
|
+
fromCache: boolean;
|
|
436
|
+
/** Device used */
|
|
437
|
+
device: "webgpu" | "cpu" | "wasm";
|
|
438
|
+
/** Whether load succeeded */
|
|
439
|
+
success: boolean;
|
|
440
|
+
/** Error message if failed */
|
|
441
|
+
error?: string;
|
|
442
|
+
};
|
|
443
|
+
type ConcurrencyConfig = {
|
|
444
|
+
/** Maximum concurrent generation requests (default: 1 for LLM) */
|
|
445
|
+
maxConcurrent?: number;
|
|
446
|
+
/** Request timeout in ms (default: 300000 = 5 min) */
|
|
447
|
+
timeout?: number;
|
|
448
|
+
};
|
|
351
449
|
//#endregion
|
|
352
|
-
export {
|
|
353
|
-
//# sourceMappingURL=types-
|
|
450
|
+
export { SpeakResult as A, PreloadOptions as C, SessionStats as D, SearchResult as E, TelemetryConfig as F, TranscribeOptions as I, TranscribeResult as L, StreamingTranscriptionSession as M, SystemInfo as N, SimilarityResult as O, TTSModelConfig as P, TranscribeSegment as R, ModelStats as S, STTModelConfig as T, LoadSTTOptions as _, EmbedResult as a, ModelLoadEvent as b, GenerateEvent as c, GerbilConfig as d, GerbilModelSettings as f, LoadOptions as g, JsonOptions as h, EmbedOptions as i, StreamingTranscriptionOptions as j, SpeakOptions as k, GenerateOptions as l, ImageInput as m, CacheConfig as n, ErrorContext as o, GerbilProviderSettings as p, ConcurrencyConfig as r, FallbackConfig as s, AudioChunk as t, GenerateResult as u, LoadTTSOptions as v, ProgressInfo as w, ModelSource as x, ModelConfig as y, VoiceInfo as z };
|
|
451
|
+
//# sourceMappingURL=types-LlyYILII.d.mts.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"types-LlyYILII.d.mts","names":[],"sources":["../src/core/types.ts"],"sourcesContent":[],"mappings":";;;;AAyBY,KAfA,WAAA,GAeW;EASX,EAAA,EAAA,MAAA;EAWA,IAAA,EAAA,MAAA;EAmCA,WAAA,EAAA,MAAc;EA8Bd,IAAA,EAAA,MAAA;EAkBA,aAAA,EAAY,MAAA;EAQZ,gBAAW,EAAA,OAAA;EAWX,YAAA,EAAA,OAAY;EAWZ;EAkBA,cAAW,CAAA,EAAA,OAAA;EAqBX;EAOA,iBAAY,CAAA,EAAA,MAAA;EAYZ,MAAA,EAAA,MAAA,GAAY,QAAA,GAAA,KAAA,GAAA,SAAA,GAAA,OAAA,GAAA,OAAA;CAWd;AAGG,KA7MD,WAAA,GA6MC;EAGC,IAAA,EAAA,SAAA,GAAA,aAAA,GAAA,OAAA;EAGE,IAAA,EAAA,MAAA;CAAiB;AAGrB,KA7MA,UAAA,GA6MW;EAiBX;EAqBA,MAAA,EAAA,MAAA;EAUA;EAOA,GAAA,CAAA,EAAA,MAAA;AAyBZ,CAAA;AAWY,KA7RA,eAAA,GA6RsB;EAYtB;EAeA,SAAA,CAAA,EAAA,MAAc;EAmBd;EAWA,WAAA,CAAA,EAAU,MAAA;EAWV;EAaA,IAAA,CAAA,EAAA,MAAA;EAWA;EAiBA,IAAA,CAAA,EAAA,MAAA;EASA;EASA,aAAA,CAAA,EAAA,MAAgB,EAAA;EAahB;EAWA,MAAA,CAAA,EAAA,MAAA;EAeA;EAES,QAAA,CAAA,EAAA,OAAA;EAEN;EAID,OAAA,CAAA,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAgCM,CAhCN,EAAA;IAAO,UAAA,EAAA,MAAA;IAqBT,GAAA,EAAA,MAAA;IAKW,SAAA,EAAA,MAAA;EAMH,CAAA,EAAA,GAAA,IAAA;EAAgB;EAKZ,MAAA,CAAA,EAtdb,UAsda,EAAA;EAAc;EAQ1B,KAAA,CAAA,EAAA,OAAA;EAeA;EAEA,QAAA,CAAA,EAAA,MAAc;AAmB1B,CAAA;KAzfY,cAAA;;;;;;;;;;;;;;;;;;KA8BA;;UAEF,CAAA,CAAE,QAAQ;;;;;;;;KAgBR,YAAA;;;;;;KAQA,WAAA;;;;;;;;KAWA,YAAA;;;;;;;;KAWA,gBAAA;;;;;;;;;;KAkBA,WAAA;;sBAEU;;;;;;;;;;;;;;;KAmBV,cAAA;;sBAEU;;;;KAKV,YAAA;;;;;;;KAYA,YAAA;;;;;;;;UAWF;;aAGG;;cAGC;;gBAGE;;KAGJ,WAAA;;;;;;;;;;;;KAiBA,cAAA;;;;;;;;;;;;KAqBA,YAAA;;;;;;;;;KAUA,UAAA;;;;;;KAOA,UAAA;;SAEH;;;;;;;;;;;;;;;;;;KAuBG,mBAAA;;;;;;;;KAWA,sBAAA;;;;;;KAYA,SAAA;;;;;;;;;;;;;;KAeA,cAAA;;;;;;;;;;;;UAYF;;;;;;KAOE,YAAA;;;;;;sBAMU;;yBAEG;;KAGb,UAAA;;WAED;;;;;;;;KASC,WAAA;;SAEH;;;;;;;;;;KAWG,cAAA;;sBAEU;;;;KASV,cAAA;;;;;;;;;;;;;;;;KAiBA,iBAAA;;;;;;sBAMU;;KAGV,iBAAA;;;;;;;;KASA,gBAAA;;;;;;aAMC;;;;;;KAOD,cAAA;;sBAEU;;;;KASV,6BAAA;;;;;;;;;;;;;;KAeA,6BAAA;;qBAES;;eAEN;;;;cAID;;;;;;;;;;;;;;;;KAqBF,eAAA;;;;;uBAKW;;;;;oBAMH,gBAAgB;;;;wBAKZ;;;;;;KAQZ,aAAA;;;;UAIF;;;;;;;;;;KAWE,YAAA,GAAe;KAEf,cAAA;;;;;;;;;;;;;;KAmBA,iBAAA"}
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"utils-
|
|
1
|
+
{"version":3,"file":"utils-DKO55ZmZ.mjs","names":["properties: Record<string, any>","required: string[]"],"sources":["../src/core/utils.ts"],"sourcesContent":["/**\n * Shared utility functions for Gerbil core\n */\n\nimport type { z } from \"zod\";\n\n/**\n * Convert Zod schema to JSON Schema (simplified)\n * Handles objects, arrays, primitives, enums, optionals, and defaults\n */\nexport function zodToJsonSchema(schema: z.ZodType<any>): object {\n try {\n if (\"_def\" in schema) {\n const def = (schema as any)._def;\n\n if (def.typeName === \"ZodObject\") {\n const shape = def.shape();\n const properties: Record<string, any> = {};\n const required: string[] = [];\n\n for (const [key, value] of Object.entries(shape)) {\n properties[key] = zodToJsonSchema(value as z.ZodType<any>);\n // Check if required (not optional)\n if (!(value as any)._def?.typeName?.includes(\"Optional\")) {\n required.push(key);\n }\n }\n\n return { type: \"object\", properties, required };\n }\n if (def.typeName === \"ZodString\") {\n return { type: \"string\", description: def.description };\n }\n if (def.typeName === \"ZodNumber\") {\n return { type: \"number\", description: def.description };\n }\n if (def.typeName === \"ZodBoolean\") {\n return { type: \"boolean\" };\n }\n if (def.typeName === \"ZodArray\") {\n return { type: \"array\", items: zodToJsonSchema(def.type) };\n }\n if (def.typeName === \"ZodEnum\") {\n return { type: \"string\", enum: def.values };\n }\n if (def.typeName === \"ZodOptional\") {\n return zodToJsonSchema(def.innerType);\n }\n if (def.typeName === \"ZodDefault\") {\n const inner = zodToJsonSchema(def.innerType);\n return { ...inner, default: def.defaultValue() };\n }\n }\n } catch {}\n\n return { type: \"string\" };\n}\n\n/**\n * Extract JSON from text (finds first { } or [ ] block)\n */\nexport function extractJson(text: string): string {\n const jsonMatch = text.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n return jsonMatch[0];\n }\n\n const arrayMatch = text.match(/\\[[\\s\\S]*\\]/);\n if (arrayMatch) {\n return arrayMatch[0];\n }\n\n return text;\n}\n"],"mappings":";;;;;AAUA,SAAgB,gBAAgB,QAAgC;AAC9D,KAAI;AACF,MAAI,UAAU,QAAQ;GACpB,MAAM,MAAO,OAAe;AAE5B,OAAI,IAAI,aAAa,aAAa;IAChC,MAAM,QAAQ,IAAI,OAAO;IACzB,MAAMA,aAAkC,EAAE;IAC1C,MAAMC,WAAqB,EAAE;AAE7B,SAAK,MAAM,CAAC,KAAK,UAAU,OAAO,QAAQ,MAAM,EAAE;AAChD,gBAAW,OAAO,gBAAgB,MAAwB;AAE1D,SAAI,CAAE,MAAc,MAAM,UAAU,SAAS,WAAW,CACtD,UAAS,KAAK,IAAI;;AAItB,WAAO;KAAE,MAAM;KAAU;KAAY;KAAU;;AAEjD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,aACnB,QAAO,EAAE,MAAM,WAAW;AAE5B,OAAI,IAAI,aAAa,WACnB,QAAO;IAAE,MAAM;IAAS,OAAO,gBAAgB,IAAI,KAAK;IAAE;AAE5D,OAAI,IAAI,aAAa,UACnB,QAAO;IAAE,MAAM;IAAU,MAAM,IAAI;IAAQ;AAE7C,OAAI,IAAI,aAAa,cACnB,QAAO,gBAAgB,IAAI,UAAU;AAEvC,OAAI,IAAI,aAAa,aAEnB,QAAO;IAAE,GADK,gBAAgB,IAAI,UAAU;IACzB,SAAS,IAAI,cAAc;IAAE;;SAG9C;AAER,QAAO,EAAE,MAAM,UAAU;;;;;AAM3B,SAAgB,YAAY,MAAsB;CAChD,MAAM,YAAY,KAAK,MAAM,cAAc;AAC3C,KAAI,UACF,QAAO,UAAU;CAGnB,MAAM,aAAa,KAAK,MAAM,cAAc;AAC5C,KAAI,WACF,QAAO,WAAW;AAGpB,QAAO"}
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
//#region src/memory/serialize.ts
|
|
2
|
+
/** Convert a runtime record to its JSON-safe form. */
|
|
3
|
+
function serializeRecord(record) {
|
|
4
|
+
return {
|
|
5
|
+
id: record.id,
|
|
6
|
+
text: record.text,
|
|
7
|
+
embedding: record.embedding ? Array.from(record.embedding) : void 0,
|
|
8
|
+
metadata: record.metadata,
|
|
9
|
+
createdAt: record.createdAt
|
|
10
|
+
};
|
|
11
|
+
}
|
|
12
|
+
/** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */
|
|
13
|
+
function deserializeRecord(record) {
|
|
14
|
+
return {
|
|
15
|
+
id: record.id,
|
|
16
|
+
text: record.text,
|
|
17
|
+
embedding: record.embedding ? Float32Array.from(record.embedding) : void 0,
|
|
18
|
+
metadata: record.metadata ?? {},
|
|
19
|
+
createdAt: record.createdAt
|
|
20
|
+
};
|
|
21
|
+
}
|
|
22
|
+
/**
|
|
23
|
+
* True when every key in `filter` is present on `metadata` with an equal
|
|
24
|
+
* (strict ===) value. An empty/undefined filter matches everything.
|
|
25
|
+
*/
|
|
26
|
+
function matchesFilter(metadata, filter) {
|
|
27
|
+
if (!filter) return true;
|
|
28
|
+
for (const key of Object.keys(filter)) if (metadata[key] !== filter[key]) return false;
|
|
29
|
+
return true;
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
//#endregion
|
|
33
|
+
//#region src/memory/vector.ts
|
|
34
|
+
/**
|
|
35
|
+
* Vector math for cosine similarity search.
|
|
36
|
+
*
|
|
37
|
+
* Vectors are stored L2-normalized so cosine similarity is a plain dot
|
|
38
|
+
* product. Normalization happens once on insert via {@link normalize}.
|
|
39
|
+
*/
|
|
40
|
+
/**
|
|
41
|
+
* Return an L2-normalized copy of `vector`.
|
|
42
|
+
*
|
|
43
|
+
* A zero vector is returned unchanged (its norm is 0).
|
|
44
|
+
*/
|
|
45
|
+
function normalize(vector) {
|
|
46
|
+
let sumSquares = 0;
|
|
47
|
+
for (let i = 0; i < vector.length; i++) sumSquares += vector[i] * vector[i];
|
|
48
|
+
const norm = Math.sqrt(sumSquares);
|
|
49
|
+
if (norm === 0) return vector;
|
|
50
|
+
const out = new Float32Array(vector.length);
|
|
51
|
+
for (let i = 0; i < vector.length; i++) out[i] = vector[i] / norm;
|
|
52
|
+
return out;
|
|
53
|
+
}
|
|
54
|
+
/**
|
|
55
|
+
* Dot product of two equal-length vectors.
|
|
56
|
+
*
|
|
57
|
+
* For L2-normalized inputs this equals cosine similarity.
|
|
58
|
+
*/
|
|
59
|
+
function dot(a, b) {
|
|
60
|
+
const length = Math.min(a.length, b.length);
|
|
61
|
+
let sum = 0;
|
|
62
|
+
for (let i = 0; i < length; i++) sum += a[i] * b[i];
|
|
63
|
+
return sum;
|
|
64
|
+
}
|
|
65
|
+
/**
|
|
66
|
+
* Cosine similarity of two vectors, normalizing on the fly.
|
|
67
|
+
*
|
|
68
|
+
* Prefer {@link dot} when both inputs are already normalized.
|
|
69
|
+
*/
|
|
70
|
+
function cosine(a, b) {
|
|
71
|
+
return dot(normalize(a), normalize(b));
|
|
72
|
+
}
|
|
73
|
+
/**
|
|
74
|
+
* Score every candidate against `query` (dot product) and return the top `k`
|
|
75
|
+
* by descending score, optionally filtering by a minimum score.
|
|
76
|
+
*
|
|
77
|
+
* Inputs are assumed normalized; this keeps the hot path branch-free.
|
|
78
|
+
*/
|
|
79
|
+
function topK(query, candidates, k, minScore) {
|
|
80
|
+
const scored = [];
|
|
81
|
+
for (const candidate of candidates) {
|
|
82
|
+
const score = dot(query, candidate.vector);
|
|
83
|
+
if (minScore !== void 0 && score < minScore) continue;
|
|
84
|
+
scored.push({
|
|
85
|
+
item: candidate.item,
|
|
86
|
+
score
|
|
87
|
+
});
|
|
88
|
+
}
|
|
89
|
+
scored.sort((a, b) => b.score - a.score);
|
|
90
|
+
return scored.slice(0, k);
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
//#endregion
|
|
94
|
+
export { deserializeRecord as a, topK as i, dot as n, matchesFilter as o, normalize as r, serializeRecord as s, cosine as t };
|
|
95
|
+
//# sourceMappingURL=vector-B0panuy6.mjs.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"vector-B0panuy6.mjs","names":["scored: Scored<T>[]"],"sources":["../src/memory/serialize.ts","../src/memory/vector.ts"],"sourcesContent":["/**\n * (De)serialization helpers shared by stores and import/export.\n *\n * Embeddings are stored at runtime as {@link Float32Array} but serialized as\n * plain number arrays so records survive `JSON.stringify` and IndexedDB\n * structured clone round-trips.\n */\n\nimport type { MemoryRecord, SerializedRecord } from \"./types.js\";\n\n/** Convert a runtime record to its JSON-safe form. */\nexport function serializeRecord(record: MemoryRecord): SerializedRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Array.from(record.embedding) : undefined,\n metadata: record.metadata,\n createdAt: record.createdAt,\n };\n}\n\n/** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */\nexport function deserializeRecord(record: SerializedRecord): MemoryRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Float32Array.from(record.embedding) : undefined,\n metadata: record.metadata ?? {},\n createdAt: record.createdAt,\n };\n}\n\n/**\n * True when every key in `filter` is present on `metadata` with an equal\n * (strict ===) value. An empty/undefined filter matches everything.\n */\nexport function matchesFilter(\n metadata: Record<string, unknown>,\n filter?: Record<string, unknown>,\n): boolean {\n if (!filter) {\n return true;\n }\n for (const key of Object.keys(filter)) {\n if (metadata[key] !== filter[key]) {\n return false;\n }\n }\n return true;\n}\n","/**\n * Vector math for cosine similarity search.\n *\n * Vectors are stored L2-normalized so cosine similarity is a plain dot\n * product. Normalization happens once on insert via {@link normalize}.\n */\n\n/**\n * Return an L2-normalized copy of `vector`.\n *\n * A zero vector is returned unchanged (its norm is 0).\n */\nexport function normalize(vector: Float32Array): Float32Array {\n let sumSquares = 0;\n for (let i = 0; i < vector.length; i++) {\n sumSquares += vector[i] * vector[i];\n }\n const norm = Math.sqrt(sumSquares);\n if (norm === 0) {\n return vector;\n }\n const out = new Float32Array(vector.length);\n for (let i = 0; i < vector.length; i++) {\n out[i] = vector[i] / norm;\n }\n return out;\n}\n\n/**\n * Dot product of two equal-length vectors.\n *\n * For L2-normalized inputs this equals cosine similarity.\n */\nexport function dot(a: Float32Array, b: Float32Array): number {\n const length = Math.min(a.length, b.length);\n let sum = 0;\n for (let i = 0; i < length; i++) {\n sum += a[i] * b[i];\n }\n return sum;\n}\n\n/**\n * Cosine similarity of two vectors, normalizing on the fly.\n *\n * Prefer {@link dot} when both inputs are already normalized.\n */\nexport function cosine(a: Float32Array, b: Float32Array): number {\n return dot(normalize(a), normalize(b));\n}\n\n/** A scored item used by {@link topK}. */\nexport type Scored<T> = { item: T; score: number };\n\n/**\n * Score every candidate against `query` (dot product) and return the top `k`\n * by descending score, optionally filtering by a minimum score.\n *\n * Inputs are assumed normalized; this keeps the hot path branch-free.\n */\nexport function topK<T>(\n query: Float32Array,\n candidates: { item: T; vector: Float32Array }[],\n k: number,\n minScore?: number,\n): Scored<T>[] {\n const scored: Scored<T>[] = [];\n for (const candidate of candidates) {\n const score = dot(query, candidate.vector);\n if (minScore !== undefined && score < minScore) {\n continue;\n }\n scored.push({ item: candidate.item, score });\n }\n scored.sort((a, b) => b.score - a.score);\n return scored.slice(0, k);\n}\n"],"mappings":";;AAWA,SAAgB,gBAAgB,QAAwC;AACtE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,MAAM,KAAK,OAAO,UAAU,GAAG;EAC7D,UAAU,OAAO;EACjB,WAAW,OAAO;EACnB;;;AAIH,SAAgB,kBAAkB,QAAwC;AACxE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,aAAa,KAAK,OAAO,UAAU,GAAG;EACpE,UAAU,OAAO,YAAY,EAAE;EAC/B,WAAW,OAAO;EACnB;;;;;;AAOH,SAAgB,cACd,UACA,QACS;AACT,KAAI,CAAC,OACH,QAAO;AAET,MAAK,MAAM,OAAO,OAAO,KAAK,OAAO,CACnC,KAAI,SAAS,SAAS,OAAO,KAC3B,QAAO;AAGX,QAAO;;;;;;;;;;;;;;;;ACpCT,SAAgB,UAAU,QAAoC;CAC5D,IAAI,aAAa;AACjB,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,eAAc,OAAO,KAAK,OAAO;CAEnC,MAAM,OAAO,KAAK,KAAK,WAAW;AAClC,KAAI,SAAS,EACX,QAAO;CAET,MAAM,MAAM,IAAI,aAAa,OAAO,OAAO;AAC3C,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,KAAI,KAAK,OAAO,KAAK;AAEvB,QAAO;;;;;;;AAQT,SAAgB,IAAI,GAAiB,GAAyB;CAC5D,MAAM,SAAS,KAAK,IAAI,EAAE,QAAQ,EAAE,OAAO;CAC3C,IAAI,MAAM;AACV,MAAK,IAAI,IAAI,GAAG,IAAI,QAAQ,IAC1B,QAAO,EAAE,KAAK,EAAE;AAElB,QAAO;;;;;;;AAQT,SAAgB,OAAO,GAAiB,GAAyB;AAC/D,QAAO,IAAI,UAAU,EAAE,EAAE,UAAU,EAAE,CAAC;;;;;;;;AAYxC,SAAgB,KACd,OACA,YACA,GACA,UACa;CACb,MAAMA,SAAsB,EAAE;AAC9B,MAAK,MAAM,aAAa,YAAY;EAClC,MAAM,QAAQ,IAAI,OAAO,UAAU,OAAO;AAC1C,MAAI,aAAa,UAAa,QAAQ,SACpC;AAEF,SAAO,KAAK;GAAE,MAAM,UAAU;GAAM;GAAO,CAAC;;AAE9C,QAAO,MAAM,GAAG,MAAM,EAAE,QAAQ,EAAE,MAAM;AACxC,QAAO,OAAO,MAAM,GAAG,EAAE"}
|
|
@@ -0,0 +1,321 @@
|
|
|
1
|
+
# Gerbil — Project State & Architecture Decision
|
|
2
|
+
|
|
3
|
+
**As of 2026-06-14.** The single authoritative snapshot. When something here conflicts
|
|
4
|
+
with an older doc, this wins. Supersedes scattered findings in `docs/research/*` and
|
|
5
|
+
the paper's roadmap. **Newest material is §12** (native audio begins — Moonshine STT
|
|
6
|
+
+ Kani-TTS-2 NanoCodec decoder; Gemma 4 E2B text decode; on-device memory/RAG; the
|
|
7
|
+
text+ViT autoresearch campaign; gerbil-site live on the native engine). §11 captured
|
|
8
|
+
EmbeddingGemma on-device, LFM2.5, the SPM tokenizer fix, MLX/DWQ loader, progress fix.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## 1. What Gerbil is
|
|
13
|
+
|
|
14
|
+
A local LLM inference library that runs models in the browser and Node on a
|
|
15
|
+
**single native WebGPU engine** behind one task API (no fallback lane — §2). The
|
|
16
|
+
headline achievements this cycle: the from-scratch native engine **works on mobile**
|
|
17
|
+
(iPad/iOS Safari 26.5+, WebKit) — previously it crashed — a desktop optimization
|
|
18
|
+
pass roughly doubled its throughput, and the engine went **multimodal natively**:
|
|
19
|
+
text embeddings ship (Qwen3-Embedding-0.6B) and the Qwen3.5 vision encoder is built
|
|
20
|
+
**bit-exact vs HuggingFace** (vision LM-integration is phase 2).
|
|
21
|
+
|
|
22
|
+
## 2. The decided architecture — NATIVE-ONLY (owner decision, overrides the panel)
|
|
23
|
+
|
|
24
|
+
**Decision (owner, 2026-06-13):** Gerbil is a single native WebGPU engine. **No
|
|
25
|
+
tfjs fallback lane** — a permanent fallback "assumes defeat to begin with." The
|
|
26
|
+
panel proposed keeping tfjs as a breadth lane; the owner rejected that. Instead:
|
|
27
|
+
|
|
28
|
+
- **Launch set = text + vision + embeddings, ALL native.** One model per modality
|
|
29
|
+
by default, expandable to other families via the add-model-family process.
|
|
30
|
+
- **Audio (TTS/STT) is deferred, not delegated to tfjs** — it ships later as
|
|
31
|
+
small *native* models (candidates under eval: VibeVoice-1.5B, OmniVoice,
|
|
32
|
+
dots.tts-soar for TTS — IF they publish safetensors; Moonshine for STT). "What's
|
|
33
|
+
the big deal" — launching without audio is fine; a permanent second engine is not.
|
|
34
|
+
- **tfjs is at most temporary dev scaffolding, not a destination.** It may stay
|
|
35
|
+
briefly to keep desktop demos working during the native build, but it is being
|
|
36
|
+
removed, not kept as a lane. `chrome-backend.ts` is deleted outright.
|
|
37
|
+
- **Vision uses Qwen3.5's OWN built-in ViT** (we currently skip its 192MB tower) —
|
|
38
|
+
one multimodal model, not a separate vision model. Stop dropping it "like idiots."
|
|
39
|
+
- **A thin onnxruntime-web bridge is a break-glass option only** — used solely if
|
|
40
|
+
a needed model has no extractable weights AND no native alternative. Not a lane.
|
|
41
|
+
|
|
42
|
+
**Status update (2026-06-13):** both next modalities have landed natively.
|
|
43
|
+
**Embeddings ship** (Qwen3-Embedding-0.6B, validated). The **vision encoder is
|
|
44
|
+
DONE** — the Qwen3.5 ViT runs natively and is **bit-exact vs HF transformers 5.12**
|
|
45
|
+
(per-token cosine 1.000000). The earlier "parallel attention kernel first" gate was
|
|
46
|
+
based on a misread of dead code (§5): the attention kernel was already parallel, and
|
|
47
|
+
non-causal was a one-line `is_causal` flag, not a rewrite. The remaining vision work
|
|
48
|
+
is **LM-side integration** (M-RoPE, token splice, image preprocessing) — phase 2,
|
|
49
|
+
plumbing over a verified core. Audio follows once a native small model is validated.
|
|
50
|
+
|
|
51
|
+
## 3. Capability matrix (what's true today)
|
|
52
|
+
|
|
53
|
+
Native-only. The "tfjs lane" is gone — tfjs is temporary dev scaffolding being
|
|
54
|
+
removed (§2), not a capability path.
|
|
55
|
+
|
|
56
|
+
| Modality | Native status |
|
|
57
|
+
|---|---|
|
|
58
|
+
| Text | ✅ ~51 tok/s mobile (sustained 200-tok), ~207 desktop, bit-correct |
|
|
59
|
+
| Text (alt family) | ✅ **LFM2.5-350M LANDED** (`Lfm2ForCausalLM`, hybrid conv/attn): ~600 tok/s desktop (2.8× Qwen), ~46 tok/s mobile, ~199MB q4, no new kernels |
|
|
60
|
+
| Embeddings | ✅ **DONE + ON iPad** — **EmbeddingGemma-300M** (bidirectional Gemma3 encoder, 173MB MLX-4bit) runs on iPad Safari; cos=1.00000 vs NumPy ref, Mars/bread margin >0.1, dim 768. First non-Qwen embedder. (Qwen3-Embedding-0.6B still works on desktop but OOMs iPad at 1.2GB BF16.) |
|
|
61
|
+
| Vision (image) | ✅ **END-TO-END DONE** — Qwen3.5 ViT, bit-exact vs HF (cosine 1.000000); `describeImage()` word-identical to HF; runs on iPad |
|
|
62
|
+
| Text model families (Llama/Mistral/Gemma) | 🟢 cheap — now usually **Tier-1, generator-only, NO new kernels** (kernel library saturated for standard transformers) |
|
|
63
|
+
| Text (Gemma 4 E2B) | ✅ **COHERENT on q4** (`gemma4.ts`): "capital of France"→"Paris", ~83 tok/s, all 35 layers cos≥0.998 vs MLX-LM. PLE, KV-share (20 layers), proportional/dual-theta RoPE, GeGLU, `Softcap`, double-wide MLP, V-norm, per-node `attn_scale`, head_dim-512 attention. **PLE CPU-streamed (0 MB GPU)** — not GPU-sharded |
|
|
64
|
+
| STT | 🟢 **Moonshine native LANDED** (`moonshine.ts`, `moonshine-executor.ts`, `moonshine-stt.ts`): raw-waveform Conv1d front-end (no FFT/log-mel), bit-exact `CrossAttention` kernel (max|err|<2e-4, cos≥0.9999), dual-graph encode-once/frozen-K-V/AR-decode, interleaved RoPE. Encoder cos≈0.990 vs HF; transcript substring-matches HF refs. **Whisper(ONNX) stays as multilingual / no-GPU fallback** (`src/core/stt.ts`) |
|
|
65
|
+
| TTS | 🟡 **Kani-TTS-2 partial** (`kani_tts.ts`): NanoCodec decoder (FSQ + causal HiFi-GAN) **LANDED + validated bit-exact** (`test-nanocodec-decode.mjs`, gate err<1e-3, measured ~4.2e-6). LFM2-350M codec-LM backbone scaffolded; **AR-loop glue remaining** (frame positions + learnable RoPE + 4-token-frame decode). License: kani-tts-2-en = LFM1.0/other; 450m variant = Apache |
|
|
66
|
+
| Memory / RAG | ✅ **SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`): vector store (in-memory/IndexedDB/file), token-budgeted `recall()`, chunking, redaction, native EmbeddingGemma adapter. 12/12 tests. No new kernels |
|
|
67
|
+
| No-WebGPU / old devices | not targeted — engine throws a clear error rather than degrading |
|
|
68
|
+
|
|
69
|
+
## 4. Performance baselines (re-confirm before quoting)
|
|
70
|
+
|
|
71
|
+
| Platform | Config | Decode tok/s | Note |
|
|
72
|
+
|---|---|---|---|
|
|
73
|
+
| M4 Max, node-dawn | optimized | ~207 | re-confirm on a cooled run; numbers between commit 2f0cabc and the isMetalBackend fix are invalid |
|
|
74
|
+
| iPad (iOS 26.5) native | batch-all | ~41 (cooled, consistent) | NOT thermal — confirmed; Dawn-tuned autoresearch wins did NOT transfer to Metal (was ~51 pre-optimization) |
|
|
75
|
+
| iPad native, submit floor | group=1 awaited | 6–8 | proven-correct floor |
|
|
76
|
+
| iPad transformers.js (same model) | WebGPU | 7–12 | ~5× slower than native |
|
|
77
|
+
|
|
78
|
+
## 5. Vision — DONE at the encoder level (bit-exact vs HF)
|
|
79
|
+
|
|
80
|
+
The native Qwen3.5 vision encoder is **built and validated bit-exact** against HF
|
|
81
|
+
transformers 5.12: **per-token cosine = 1.000000, max abs err ~5e-6**. Exposed as
|
|
82
|
+
`engine.encodeImage(patches, gridTHW)` → merged image tokens `[rows, 1024]`. See
|
|
83
|
+
paper §22 and `src/gpu/architectures/qwen3_5_vision.ts`, `src/gpu/vision-executor.ts`,
|
|
84
|
+
`src/gpu/vision-preprocess.ts`.
|
|
85
|
+
|
|
86
|
+
The earlier strategy panel claimed native vision was blocked on a single-threaded
|
|
87
|
+
attention kernel. **That was wrong** — it read `src/gpu/kernels/wgsl/attention.wgsl`,
|
|
88
|
+
a STALE reference file **not imported anywhere** (kernels are embedded strings in
|
|
89
|
+
`registry.ts`; the dead `.wgsl` files have been deleted). What actually shipped:
|
|
90
|
+
|
|
91
|
+
1. **The live attention kernel was already parallel** — `WGSL_ATTENTION` is a
|
|
92
|
+
tiled, online-softmax (flash-attention-style) kernel. No thread-0 serialization,
|
|
93
|
+
no rewrite needed.
|
|
94
|
+
2. **Non-causal was a one-line flag.** An `is_causal` uniform
|
|
95
|
+
(`S_eff = is_causal ? min(S, causal_limit) : S`) makes it bidirectional for the
|
|
96
|
+
ViT; text stays causal by default. Done.
|
|
97
|
+
3. **Patch-embed was NOT a new Conv3d kernel.** Patches arrive pre-flattened to
|
|
98
|
+
`[N, 1536]` from the host image processor, so the 5-D unfold/Conv3d collapses to
|
|
99
|
+
a plain `MatMul` + `AddBias`. New ops added were small: `AddBias`, `GeluErf`
|
|
100
|
+
(exact-erf merger GELU), `ApplyRotaryEmb` (2D rotary), `SliceCols` (fused-QKV
|
|
101
|
+
split) — plus host-side pos-embed/rotary precompute (grid-only, bit-exact).
|
|
102
|
+
4. **A real bug surfaced during validation:** `WGSL_GELU` returned NaN for large
|
|
103
|
+
args on Metal/Dawn (`x³` overflow into fast-math `tanh`); fixed by clamping the
|
|
104
|
+
inner arg to ±15.
|
|
105
|
+
|
|
106
|
+
**Net: vision is done at the encoder level.** Remaining work is **LM-side
|
|
107
|
+
integration** (M-RoPE, image-token splice into the text stream, pixel→patch
|
|
108
|
+
preprocessing) — phase 2, bounded plumbing over a verified numerical core. The
|
|
109
|
+
ViT prefill runs through the same parallel attention as text, so there is no
|
|
110
|
+
separate mobile attention risk; on-device ViT speed is a measure-when-integrated
|
|
111
|
+
item, not a feasibility gate.
|
|
112
|
+
|
|
113
|
+
## 6. Build sequence (ordered)
|
|
114
|
+
|
|
115
|
+
1. ~~**Native embeddings**~~ — **DONE.** Qwen3-Embedding-0.6B (`Qwen3ForCausalLM`): last-token EOS pooling + `L2Norm` tail. Validated: dim 1024, unit norm, cos(similar)=0.81 > cos(unrelated)=0.56. Confirmed causal-LM-pooling. `engine.embed()`.
|
|
116
|
+
2. ~~**OPFS model cache**~~ — **resolved differently: OPFS removed.** Main-thread OPFS `createWritable` is broken on iOS (leaves unclearable junk that fills the quota). The loader is now **Cache-API-only**. Durable iOS caching needs a PWA (§ paper 24, `docs/research/ios-safari-model-caching.md`), not OPFS.
|
|
117
|
+
3. ~~**Native vision encoder**~~ — **DONE, bit-exact vs HF** (§5). Remaining: LM-side integration (M-RoPE, token splice, image preprocessing) — phase 2.
|
|
118
|
+
4. **Vision LM integration (phase 2)** — splice `encodeImage()` tokens into the text stream: M-RoPE position assignment, placeholder-token splice, host pixel→patch preprocessing. Plumbing over a verified encoder.
|
|
119
|
+
5. **Close the mobile WebKit correctness/perf sweep** — `?group=N` + Test-R bisect on real iOS 26.5 for a stable fast-and-correct submit config above the group=1 floor; re-measure desktop.
|
|
120
|
+
6. **Llama/Mistral/Gemma graph generator** — ~90% is the existing qwen2 generator; unlocks most of the HF text zoo. Low risk, high ROI.
|
|
121
|
+
7. **Remove tfjs scaffolding** — delete `chrome-backend.ts`. The engine is native-only; tfjs was temporary dev scaffolding, not a kept lane.
|
|
122
|
+
8. **Audio, native (deferred, not delegated)** — TTS: **OmniVoice** (Qwen3 backbone + codec decoder — mostly the existing text path + a decoder). STT: **Moonshine** (lean encoder-decoder; raw-waveform Conv1d frontend, no log-mel Conv2d; needs a parallel CrossAttention kernel). Ship once a native small model clears the mobile bar.
|
|
123
|
+
|
|
124
|
+
## 7. What NOT to build
|
|
125
|
+
|
|
126
|
+
- **A second permanent engine / kept tfjs lane** — the architecture is native-only across text + vision + embeddings. tfjs is temporary scaffolding being removed, not a breadth lane.
|
|
127
|
+
- **A Conv3d/unfold patch-embed kernel for the ViT** — not needed; patches arrive pre-flattened, so patch-embed is a plain MatMul + AddBias (§5).
|
|
128
|
+
- **A heavyweight native TTS pipeline with custom vocoder kernels now** — audio is deferred; when it lands, OmniVoice's Qwen3 backbone reuses the text path. A thin onnxruntime-web bridge stays break-glass-only.
|
|
129
|
+
- **A general Conv2d kernel / Whisper-on-native** — Moonshine's raw-waveform Conv1d avoids the log-mel Conv2d.
|
|
130
|
+
- **An OPFS write path on iOS** — main-thread `createWritable` is broken; durable caching is a PWA concern, not an OPFS one.
|
|
131
|
+
- **Trusting stale tok/s numbers** between 2f0cabc and the isMetalBackend fix.
|
|
132
|
+
|
|
133
|
+
## 8. Open spikes (measure before committing)
|
|
134
|
+
|
|
135
|
+
- ~~**Embedder pooling direction**~~ — RESOLVED: causal-LM last-token (EOS) pooling, validated.
|
|
136
|
+
- ~~**Parallel non-causal attention feasibility for the ViT**~~ — RESOLVED: the live attention kernel was already parallel; non-causal is a one-line `is_causal` flag; the encoder is bit-exact. Remaining ViT spike: **on-device ViT prefill speed on a real iPad** (measure when LM-integration lands; not a feasibility gate).
|
|
137
|
+
- **Mobile WebKit submit config** — does `?group=N`/Test-R land a stable fast config above the group=1 floor on iOS 26.5? The critical path.
|
|
138
|
+
- **Vision LM-integration correctness** — M-RoPE + token splice must stay bit-exact end-to-end (image+text), not just at the encoder boundary.
|
|
139
|
+
- **Native audio on mobile** — OmniVoice (codec-decoder) and Moonshine (CrossAttention kernel) latency/correctness once a candidate is validated.
|
|
140
|
+
- **Autoresearch on-device tax** — desktop kernel wins did NOT transfer to mobile (41 vs 51); the loop needs a mobile-validation leg, possibly per-backend tunings.
|
|
141
|
+
|
|
142
|
+
## 9. Decision log (this cycle)
|
|
143
|
+
|
|
144
|
+
- Mobile native inference fixed (four-bug diagnosis: jetsam memory, WebKit visibility, attention race, detection predicate). See `docs/mobile-failure-diagnosis.md`, paper §17–19.
|
|
145
|
+
- Desktop 145→~207 via autoresearch (paper §20).
|
|
146
|
+
- **Architecture decision: NATIVE-ONLY** across text + vision + embeddings; no tfjs fallback lane; `chrome-backend.ts` to be deleted (paper §23).
|
|
147
|
+
- **Native embeddings shipped** — Qwen3-Embedding-0.6B, last-token EOS pool + L2Norm, validated (paper §21).
|
|
148
|
+
- **Native vision encoder shipped** — Qwen3.5 ViT, bit-exact vs HF transformers 5.12 (cosine 1.000000); LM-integration is phase 2 (paper §22). Supersedes the earlier "skip the ViT" stance (`docs/research/qwen35-multimodal.md`).
|
|
149
|
+
- **Vision feasibility corrected** — attention was already parallel (panel read a dead `.wgsl` file); non-causal is a one-line flag (paper §22–23, §5 above).
|
|
150
|
+
- **OPFS removed** — main-thread `createWritable` broken on iOS; Cache-API-only loader; durable caching needs a PWA (`docs/research/ios-safari-model-caching.md`, paper §24).
|
|
151
|
+
- Modality model picks (`docs/research/sota-modality-models.md`); chromium decision (`docs/research/native-vs-chromium-decision.md`); site update plan (`docs/site-update-plan.md`).
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## 10. 2026-06-13 evening — multimodal milestone + in-flight work (CAPTURE)
|
|
156
|
+
|
|
157
|
+
**Major wins this session (all committed to `feat/webgpu-engine-mobile`):**
|
|
158
|
+
|
|
159
|
+
- **Native VISION end-to-end DONE & validated bit-exact** (commit `5fed12b`). `engine.describeImage({pixels,width,height})` → coherent description. Validated vs HF transformers 5.12 on `examples/skelly.png`: 7/7 checks — encoder cosine 1.000000, M-RoPE 3D position ids EXACT, spliced embeds cosine 1.0, first-token exact, full greedy description WORD-IDENTICAL to HF for 201 chars ("This is a detailed black-and-white line drawing of the skeleton of a fox…"). New ops: `MRoPE`, `EmbedSplice`, host image preprocessing (smart-resize/normalize/patchify in `vision-preprocess.ts`). Text path unchanged (M-RoPE on linear positions reduces to 1D RoPE).
|
|
160
|
+
- **Native EMBEDDINGS** (earlier commit): Qwen3-Embedding-0.6B works on DESKTOP (cosine 0.81>0.56). BUT **not iPad-viable** — only BF16 (~1.2GB, OOMs on-device) or a broken MLX-DWQ. → Pivot below.
|
|
161
|
+
- **`is_causal` flag** on f32 attention (from vision) now unlocks **bidirectional encoders**.
|
|
162
|
+
|
|
163
|
+
**The 51-vs-41 mobile throughput question — RESOLVED: NOT a regression.** t15 hit 51.7 on a 200-token sustained run; the 38-41 readings were 60-token runs (more warmup overhead/token). A fresh 200-token sustained run on the CURRENT (post-autoresearch-optimization) engine **hit ~51 again**. So the Dawn-tuned wins did NOT regress mobile; sustained mobile decode is ~51 tok/s. Lesson: benchmark with enough tokens (≥200) to be representative.
|
|
164
|
+
|
|
165
|
+
**IN-FLIGHT / NOT YET MERGED (future-me: merge these):**
|
|
166
|
+
- **LFM2.5-350M** — ✅ **LANDED** on `feat/webgpu-engine-mobile` (commit `3c4bac8`). Tier-1, no new kernels. **~600 tok/s desktop (2.8× Qwen3.5's ~213), ~199MB q4 (half Qwen), coherent, bit-exact vs NumPy ref.** Files: new `src/gpu/architectures/lfm2.ts` (824 lines), registered `Lfm2ForCausalLM`, LFM2 CANONICAL_KEYS, loader key-mapper. TWO GENERAL FIXES worth keeping: (1) effective FF dim is 4608 not config's 6656 (`block_auto_adjust_ff_dim` rounding — `multiple_of(⌊2/3·6656⌋)`); (2) **the "garbage output" was the CHAT TEMPLATE, not the graph** — LFM2.5 ships its template as a `chat_template.jinja` sidecar (absent from tokenizer_config.json), so the engine fell back to Qwen ChatML which auto-injects an empty `<think>` → newline loop. Fix: fetch the `.jinja` sidecar + gate think-injection on the template actually emitting `<think>`. This fix helps ANY model with a jinja sidecar. Verdict: LFM2.5 is a viable faster/smaller text-default alternative to Qwen — user picks.
|
|
167
|
+
- **EmbeddingGemma-300M** — ✅ **LANDED + CONFIRMED ON iPad** (commit `4874d01`, see §11). The iPad-ready embedding model (~173MB q4, MTEB 68.36, standard MLX-4bit). Bidirectional Gemma3 encoder — needs the new generator + a MeanPool kernel + dual-theta RoPE + 2 Dense head layers + the MLX-detection loader fix (research found: detector requires `mode:"affine"` but standard MLX converts omit it → silently fall to F32; and DWQ vs standard MLX are config-indistinguishable → the DWQ-garbage trap). See `docs/research/sota-embedding-models.md`.
|
|
168
|
+
|
|
169
|
+
**iPad harness:** dashboard at `https://<lan-ip>:8766/` with Text/Vision/transformers.js/Storage-probe/Clear-cache/Stop-queue buttons, live results table, `/results` + `/enqueue` + `/clear-queue` endpoints. Embeddings button removed until EmbeddingGemma lands. TODO: show generated text in the results + default Text benchmark to 200 tokens (representative); wire the Vision button to `describeImage` (needs browser ImageBitmap/Canvas→pixels decode).
|
|
170
|
+
|
|
171
|
+
**iOS caching:** OPFS removed (main-thread createWritable broken on iOS, left unclearable junk). Durable cache needs a PWA (persist() only granted when installed). See `docs/research/ios-safari-model-caching.md`. Deferred.
|
|
172
|
+
|
|
173
|
+
**Native modality scorecard (as of §10):** text ✅, vision ✅ (end-to-end bit-exact), embeddings ✅ desktop / iPad-pending (EmbeddingGemma building), LFM2.5 text-alt ✅ (pending merge). Audio (TTS/STT) deferred to native OmniVoice/Moonshine. **The native-only multimodal engine is real.** **→ All "pending/building" items above have since landed; see §11.**
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## 11. 2026-06-14 — EmbeddingGemma on-device, LFM2.5 landed, loader hardening (CAPTURE)
|
|
178
|
+
|
|
179
|
+
The §10 in-flight items all landed on `feat/webgpu-engine-mobile`. New wins this session
|
|
180
|
+
(paper §25–§30 document each in depth):
|
|
181
|
+
|
|
182
|
+
- **EmbeddingGemma-300M LANDED and CONFIRMED RUNNING ON iPad Safari** (commit `4874d01`).
|
|
183
|
+
First **non-Qwen** embedding family — a real **bidirectional Gemma3 encoder**
|
|
184
|
+
(`src/gpu/architectures/gemma3_encoder.ts`, `generateGemma3EncoderGraph`): 24 pre-norm
|
|
185
|
+
blocks, GQA 3q/1kv head_dim 256, per-head q/k-norm, **dual-theta RoPE** (sliding θ=10000
|
|
186
|
+
/ full θ=1e6 selected per layer from `layer_types`), GeGLU MLP, Gemma's **four-norm
|
|
187
|
+
sandwich**, embed ×√768, tail **MeanPool → Dense0(768→3072) → Dense1(3072→768) → L2Norm**.
|
|
188
|
+
**173 MB at MLX-4bit** (vs the abandoned 1.2 GB Qwen3-Embedding that OOM'd iPad).
|
|
189
|
+
Two new kernels only: **MeanPool**, **Scale** (`kernels/registry.ts`). Validated
|
|
190
|
+
**cos=1.00000 vs an independent NumPy reference** (`scripts/engine/test-embedding-gemma-reference.py`),
|
|
191
|
+
reference gate `cos>0.95`; semantic test asserts a Red-Planet query is closer to two
|
|
192
|
+
Mars docs than to a bread doc by **>0.1 cosine margin**, unit-norm dim-768, no NaN. The
|
|
193
|
+
engine **generalizes across embedding families**, not just Qwen.
|
|
194
|
+
|
|
195
|
+
- **SPM tokenizer fix (load-bearing, cross-family)** (`src/gpu/tokenizer.ts`). Gemma's
|
|
196
|
+
`tokenizer.json` is `type:"BPE"` but **SentencePiece-flavored** (▁/U+2581 spaces, raw
|
|
197
|
+
UTF-8 tokens, array-form merges, `<0xHH>` byte-fallback). The byte-level (`Ġ`) BPE path
|
|
198
|
+
was char-splitting every word → semantically dead embeddings that still passed norm/NaN
|
|
199
|
+
checks. Auto-detected `spmMode` (structural: `" "→"▁"` Replace normalizer **or**
|
|
200
|
+
`byte_fallback && "▁the" in vocab`) now drives encode/decode/merges; **Qwen/LFM2 stay
|
|
201
|
+
byte-level** (no model-name list — they just don't match). Lesson: `type:"BPE"` ≠
|
|
202
|
+
byte-level BPE; any SentencePiece-lineage family (Gemma/Llama/Mistral) needs this path.
|
|
203
|
+
|
|
204
|
+
- **MLX-4bit loader hardened** (`src/gpu/model-loader.ts`). (1) Detection broadened to
|
|
205
|
+
accept mode-less `{bits:4, group_size}` configs (standard mlx-lm omits `mode`). (2) A
|
|
206
|
+
`VERIFIED_MLX_REPOS` allowlist (currently `mlx-community/embeddinggemma-300m-4bit`) gates
|
|
207
|
+
mode-less configs. (3) **Explicit DWQ reject** — DWQ repos carry an identical
|
|
208
|
+
`{bits:4,group_size}` config but pack weights that dequant to garbage, so they're rejected
|
|
209
|
+
by repo-name substring (`includes("dwq")`). Codifies the MLX-DWQ-garbage trap. (4) Gemma's
|
|
210
|
+
`(1+weight)` RMSNorm absorption is **baked by the loader even for MLX** (mlx-lm pre-absorbs
|
|
211
|
+
+1 for Qwen3.5 but NOT for Gemma — the Gemma branch deliberately omits `&& !isMLX`).
|
|
212
|
+
|
|
213
|
+
- **Progress-reporting fix** (commit `682a09b`, `model-loader.ts`). The bar froze at
|
|
214
|
+
"10% discovering weight files" because the gap between that emit and the first download
|
|
215
|
+
chunk (index probe + 2 header range-requests + first-byte latency) emitted nothing. Now
|
|
216
|
+
emits "Reading {file} header" + "Downloading {file} (0/{N} MB)" up front. **Affected every
|
|
217
|
+
model**; fixed universally.
|
|
218
|
+
|
|
219
|
+
- **LFM2.5-350M landed** (commit `3c4bac8`) — Tier-1, no new kernels, ~600 tok/s desktop /
|
|
220
|
+
~46 tok/s mobile, ~199 MB q4. The general jinja-sidecar chat-template fix from it helps any
|
|
221
|
+
model shipping `chat_template.jinja`.
|
|
222
|
+
|
|
223
|
+
- **Cross-device multi-modal parity.** Text (Qwen3.5 ~51 tok/s, LFM2.5 ~46 tok/s), vision
|
|
224
|
+
(Qwen3.5 ViT `describeImage`), and embeddings (EmbeddingGemma) **all now run natively on
|
|
225
|
+
iPad Safari** — the native engine reaches transformers.js-path modality coverage on the
|
|
226
|
+
modalities that matter, without the mobile crashes and ~5× faster. **Remaining native gap:
|
|
227
|
+
audio** — TTS via OmniVoice (in progress), STT via Moonshine (not started).
|
|
228
|
+
|
|
229
|
+
- **Effort-tier shift.** Adding a new TEXT family is now usually **Tier-1: generator only, no
|
|
230
|
+
new kernels** — the kernel library has saturated for standard transformers (Llama/Mistral/
|
|
231
|
+
Gemma-text reuse existing ops). New kernels are needed only for genuinely novel ops (SSM,
|
|
232
|
+
PLE, new norms, cross-attention). See `docs/adding-a-model-family.md`.
|
|
233
|
+
|
|
234
|
+
- **Site migration assessment** written: `docs/gerbil-site-native-migration.md` (how the
|
|
235
|
+
marketing/docs site can move its in-browser inference from the transformers.js/ONNX worker
|
|
236
|
+
to the native engine, modality-by-modality, with the no-fallback device-coverage tradeoff
|
|
237
|
+
stated plainly). Assessment only — the site repo was not modified.
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## 12. 2026-06-14 — native audio begins, Gemma 4, memory/RAG, autoresearch campaign (CAPTURE)
|
|
242
|
+
|
|
243
|
+
This session pushed past the §11 "multimodal parity minus audio" milestone. Paper
|
|
244
|
+
**§31–§35** document each in depth. The decided order from §6 (audio last, native not
|
|
245
|
+
delegated) held — and audio is now *under way natively*, not deferred to tfjs.
|
|
246
|
+
|
|
247
|
+
- **Native STT — Moonshine LANDED** (`src/gpu/architectures/moonshine.ts`,
|
|
248
|
+
`moonshine-executor.ts`, `moonshine-stt.ts`; paper §31). Chosen over Whisper to avoid
|
|
249
|
+
a log-mel/Conv2d front-end: it consumes **16 kHz PCM directly** through three strided
|
|
250
|
+
`Conv1d`s (downsample 384×, ~41.6 frames/s) — new kernels `Conv1dFull`, `GroupNorm`,
|
|
251
|
+
`Tanh`, `Transpose`. The real new attention primitive is **`CrossAttention`**
|
|
252
|
+
(`WGSL_CROSS_ATTENTION`, tiled online-softmax), validated **bit-exact vs NumPy**
|
|
253
|
+
(`test-crossattention.mjs`: max|err|<2e-4, cos≥0.9999). Runtime is a **dual graph**:
|
|
254
|
+
`MoonshineEncoderExecutor.encode()` runs the front-end + bidirectional encoder once and
|
|
255
|
+
freezes per-decoder-layer K/V; `MoonshineSTT.transcribe()` runs greedy AR decode with
|
|
256
|
+
self- + cross-attention into that frozen K/V. **Interleaved RoPE** (`ROPE_INTERLEAVED_SPEC`,
|
|
257
|
+
adjacent-dim pairing) vs the default split-half. Validation: encoder cos≈0.990 vs HF
|
|
258
|
+
(source comment); `test-moonshine-transcribe.mjs` asserts transcript contains HF-ref
|
|
259
|
+
substrings; RTF/4-bit size are computed-and-reported, not hardcoded-asserted.
|
|
260
|
+
**Whisper(ONNX) stays the multilingual / no-WebGPU fallback** (`src/core/stt.ts`,
|
|
261
|
+
`WhisperSTT`) — separate, untouched.
|
|
262
|
+
|
|
263
|
+
- **Native TTS — Kani-TTS-2 (partial)** (`src/gpu/architectures/kani_tts.ts`; paper §32).
|
|
264
|
+
The hard novel piece — the **NanoCodec decoder** (FSQ 4×4 levels `[9,8,8,7]`, mixed-radix
|
|
265
|
+
base `[1,9,72,576]` + causal HiFi-GAN, rates `[7,7,6,3,2]`, hop 1764 @ 22050 Hz) — is
|
|
266
|
+
**implemented and validated bit-exact** (`test-nanocodec-decode.mjs`, gate `err<1e-3`,
|
|
267
|
+
measured ~4.2e-6 vs MLX). New kernels: `FSQDequant`, `HalfSnake1d`,
|
|
268
|
+
`ConvTranspose1dDepthwise`. The backbone is **LFM2-350M** (`KaniTTS2ForCausalLM`, audio
|
|
269
|
+
tokens above the text vocab, 4/frame); `generateKaniTtsGraph` **deliberately throws** —
|
|
270
|
+
remaining is the frame-position + learnable-RoPE + 4-token-frame AR-decode glue (most
|
|
271
|
+
block math reused from `lfm2.ts`). License: **kani-tts-2-en = LFM1.0 (other)**, NanoCodec
|
|
272
|
+
= NVIDIA OML; the **450m variant is Apache** (same arch).
|
|
273
|
+
|
|
274
|
+
- **Gemma 4 E2B — text decode COHERENT on real q4 weights** (`src/gpu/architectures/gemma4.ts`;
|
|
275
|
+
paper §33). **Tier-2**, no MatFormer/AltUp/LAuReL. **PLE** (2nd embedding gathered per
|
|
276
|
+
token, per-layer gate+GELU+multiply+project+norm+residual), **KV-cache sharing** (E2B: 35
|
|
277
|
+
layers, last **20** shared via graph-rewire, no kernel change), **proportional RoPE**
|
|
278
|
+
(rotate 0.25·head_dim=64 dims but inv_freq over full head_dim denom; dual-theta 1e6/1e4),
|
|
279
|
+
**GeGLU**, and a new **`Softcap`** kernel (`cap·tanh(x/cap)`, cap=30) on final logits.
|
|
280
|
+
Generates coherently ("The capital of France is" → "Paris") at ~83 tok/s; all 35 layers
|
|
281
|
+
match an MLX-LM reference cos≥0.998 with identical argmax (`test-gemma4-perlayer.mjs`).
|
|
282
|
+
Structural validation **67/67** (`test-gemma4-graph.mjs`; 942 nodes); softcap kernel
|
|
283
|
+
separately validated (`test-gemma4-softcap.mjs`, max err<1e-4). **PLE is CPU-streamed, not
|
|
284
|
+
GPU-sharded**: the ~1.17 GB q4 PLE table stays CPU-resident (**0 MB GPU**) and the executor
|
|
285
|
+
streams per-token rows each step — Gemma 4's intended flash design, mobile-viable, and it
|
|
286
|
+
sidesteps the per-binding cap. Coherence required four fixes: per-node `attn_scale`=1.0
|
|
287
|
+
(HF scaling=1.0; default 1/√head_dim keeps other models byte-identical), parameter-free
|
|
288
|
+
V-norm, double-wide MLP on the KV-shared layers, and head_dim-512 support in the flash
|
|
289
|
+
attention kernel (second per-thread accumulator + smem-capped tiling, 16 KB invariant kept).
|
|
290
|
+
|
|
291
|
+
- **On-device Memory / RAG SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`; paper §34).
|
|
292
|
+
Pluggable vector store (`InMemoryStore` / `IndexedDBStore` / `FileStore`), token-budgeted
|
|
293
|
+
`recall()` (default 1024-token greedy pack, ~4-chars/token, returns `{context, records,
|
|
294
|
+
tokensUsed}`), overlapping-window chunking (1000/200), write-time redaction (regex →
|
|
295
|
+
`[REDACTED]` or fn), and a `createGerbilEmbedder()` adapter over **native EmbeddingGemma**.
|
|
296
|
+
**12/12 tests** (`src/memory/memory.test.ts`). No new kernels — a clean consumer of the
|
|
297
|
+
embedding modality.
|
|
298
|
+
|
|
299
|
+
- **Autoresearch TPS campaign — 3 more batches** (`scripts/engine/results.jsonl`,
|
|
300
|
+
`scripts/engine/chart.html`; paper §35), M4 Max / node-dawn. Verified peaks:
|
|
301
|
+
**Qwen3.5-0.8B 219→~234 tok/s**, **LFM2.5-350M 624→~672 tok/s**, **ViT encode
|
|
302
|
+
581.8→~502 ms (−~14%)**, **`describeImage` 37.0→42.0 tok/s (+13.5%)** — all kept changes
|
|
303
|
+
bit-exact (merged cos 1.0, e2e 7/7). **The winning lesson (sharpened):** desktop wins come
|
|
304
|
+
only from eliminating **large wide reads on poorly-occupied kernels** (fused conv+activation,
|
|
305
|
+
vec4 + register-blocked + f16-mixed ViT matmul, `MatMul+AddBias→MatMulBias`); **pure
|
|
306
|
+
dispatch-count cuts on already-tuned kernels are noise** (butterfly reduce, subgroup
|
|
307
|
+
shuffle, bigger N-tiles all reverted). The INT4 matmuls and Mamba SSM sit at the
|
|
308
|
+
**bandwidth floor** — the remaining headroom is **mobile**, not desktop (several
|
|
309
|
+
reverted-on-desktop fusions are *predicted mobile wins*; the loop's next leg is a
|
|
310
|
+
mobile-validation pass).
|
|
311
|
+
|
|
312
|
+
- **gerbil-site is LIVE on the native engine.** The marketing/docs site's in-browser
|
|
313
|
+
inference now runs on the native WGSL engine (no longer the transformers.js/ONNX worker)
|
|
314
|
+
for the migrated modalities — see `docs/gerbil-site-native-migration.md`. (The §11 entry
|
|
315
|
+
was an assessment-only; this cycle it went live.)
|
|
316
|
+
|
|
317
|
+
**Native modality scorecard (as of §12):** text ✅ (Qwen3.5, LFM2.5), text-alt families 🟢
|
|
318
|
+
Tier-1, Gemma 4 E2B 🟡 (decode validated, sharding gates real weights), vision ✅ (Qwen3.5
|
|
319
|
+
ViT, on iPad), embeddings ✅ (EmbeddingGemma, on iPad), **STT ✅ native (Moonshine) + Whisper
|
|
320
|
+
fallback**, **TTS 🟡 (Kani NanoCodec decoder validated, backbone AR loop pending)**, memory/RAG
|
|
321
|
+
✅ shipped. **Audio is no longer the deferred gap — STT is native; TTS is one AR-loop away.**
|