@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +247 -84
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +264 -588
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +585 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BHrJJIa4.mjs +1656 -0
- package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
- package/dist/gerbil-BT9fCydo.d.mts +488 -0
- package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
- package/dist/gerbil-DomNfIr1.mjs +4 -0
- package/dist/gpu/hooks.d.mts +520 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1188 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-33qCAtHW.mjs +3615 -0
- package/dist/gpu-33qCAtHW.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-jEAL2s-A.d.mts +2022 -0
- package/dist/index-jEAL2s-A.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
- package/dist/mcp-1DaMsaBc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
- package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
- package/dist/repl-jV5gcJFA.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
- package/dist/skills-DX8D59UH.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
- package/dist/types-D6FiR_oh.d.mts.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-DFRQ1OeM.js +0 -20212
- package/dist/kokoro-DFRQ1OeM.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/stt-Te8Qz-Ay.js +0 -433
- package/dist/stt-Te8Qz-Ay.js.map +0 -1
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DokyH3rP.js +0 -3
- package/dist/transformers.web-M6mCnEYJ.js +0 -30382
- package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
- package/dist/tts-C0xx3CtE.js +0 -724
- package/dist/tts-C0xx3CtE.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
//#region src/memory/types.d.ts
|
|
2
|
+
/**
|
|
3
|
+
* Type definitions for Gerbil's memory / RAG module.
|
|
4
|
+
*
|
|
5
|
+
* The module is engine-agnostic: it stores text + embeddings behind a
|
|
6
|
+
* pluggable {@link MemoryStore} and embeds text via an injected
|
|
7
|
+
* {@link Embedder}. Gerbil's native embeddings are wired in via a tiny
|
|
8
|
+
* adapter (see `gerbil-embedder.ts`).
|
|
9
|
+
*/
|
|
10
|
+
/** Arbitrary JSON-serializable metadata attached to a memory record. */
|
|
11
|
+
type MemoryMetadata = Record<string, unknown>;
|
|
12
|
+
/**
|
|
13
|
+
* A single stored memory.
|
|
14
|
+
*
|
|
15
|
+
* Vectors are stored L2-normalized so that cosine similarity reduces to a
|
|
16
|
+
* dot product. `embedding` may be omitted for records awaiting embedding.
|
|
17
|
+
*/
|
|
18
|
+
type MemoryRecord = {
|
|
19
|
+
/** Stable unique id (generated on insert if not supplied). */
|
|
20
|
+
id: string;
|
|
21
|
+
/** The raw text content of this memory. */
|
|
22
|
+
text: string;
|
|
23
|
+
/** L2-normalized embedding vector. Omitted only before embedding. */
|
|
24
|
+
embedding?: Float32Array;
|
|
25
|
+
/** Arbitrary metadata for filtering and provenance. */
|
|
26
|
+
metadata: MemoryMetadata;
|
|
27
|
+
/** Epoch milliseconds when the record was created. */
|
|
28
|
+
createdAt: number;
|
|
29
|
+
};
|
|
30
|
+
/**
|
|
31
|
+
* Embeds one or more texts into vectors.
|
|
32
|
+
*
|
|
33
|
+
* Implementations should return one vector per input text, in order.
|
|
34
|
+
* Gerbil's native embeddings satisfy this via {@link createGerbilEmbedder}.
|
|
35
|
+
*/
|
|
36
|
+
type Embedder = (texts: string[]) => Promise<Float32Array[]>;
|
|
37
|
+
/**
|
|
38
|
+
* Predicate or transform applied to text on write to redact sensitive data.
|
|
39
|
+
*
|
|
40
|
+
* - A {@link RegExp} replaces all matches with `[REDACTED]`.
|
|
41
|
+
* - A function returns the redacted string.
|
|
42
|
+
*/
|
|
43
|
+
type Redactor = RegExp | ((text: string) => string);
|
|
44
|
+
/** A hit returned from a vector search. */
|
|
45
|
+
type MemorySearchResult = {
|
|
46
|
+
/** The matched record. */
|
|
47
|
+
record: MemoryRecord;
|
|
48
|
+
/** Cosine similarity score in [-1, 1] (typically [0, 1]). */
|
|
49
|
+
score: number;
|
|
50
|
+
};
|
|
51
|
+
/** Filter applied to record metadata. Each key must match exactly. */
|
|
52
|
+
type MetadataFilter = Record<string, unknown>;
|
|
53
|
+
/** Options for {@link MemoryStore.search}. */
|
|
54
|
+
type StoreSearchOptions = {
|
|
55
|
+
/** Maximum number of results (default 5). */
|
|
56
|
+
k?: number;
|
|
57
|
+
/** Exact-match metadata filter applied before ranking. */
|
|
58
|
+
filter?: MetadataFilter;
|
|
59
|
+
/** Minimum score to include (default: no minimum). */
|
|
60
|
+
minScore?: number;
|
|
61
|
+
};
|
|
62
|
+
/**
|
|
63
|
+
* Pluggable persistence + retrieval backend.
|
|
64
|
+
*
|
|
65
|
+
* Backends are engine-agnostic and store pre-normalized embeddings. The
|
|
66
|
+
* default in-memory store works in Node and the browser; an IndexedDB store
|
|
67
|
+
* adds browser durability and a file store adds Node durability.
|
|
68
|
+
*/
|
|
69
|
+
type MemoryStore = {
|
|
70
|
+
/** Insert or replace a record by id. */
|
|
71
|
+
add(record: MemoryRecord): Promise<void>;
|
|
72
|
+
/** Insert or replace many records (may be faster than per-record adds). */
|
|
73
|
+
addMany(records: MemoryRecord[]): Promise<void>;
|
|
74
|
+
/** Fetch a record by id, or `undefined` if absent. */
|
|
75
|
+
get(id: string): Promise<MemoryRecord | undefined>;
|
|
76
|
+
/** Cosine top-k search over a query vector. */
|
|
77
|
+
search(queryVector: Float32Array, options?: StoreSearchOptions): Promise<MemorySearchResult[]>;
|
|
78
|
+
/** Delete a record by id. Returns `true` if it existed. */
|
|
79
|
+
delete(id: string): Promise<boolean>;
|
|
80
|
+
/** List all records (optionally filtered by metadata). */
|
|
81
|
+
list(filter?: MetadataFilter): Promise<MemoryRecord[]>;
|
|
82
|
+
/** Remove all records. */
|
|
83
|
+
clear(): Promise<void>;
|
|
84
|
+
/** Total number of stored records. */
|
|
85
|
+
size(): Promise<number>;
|
|
86
|
+
};
|
|
87
|
+
/** Serializable snapshot of a store for import/export. */
|
|
88
|
+
type MemoryExport = {
|
|
89
|
+
/** Schema version for forward compatibility. */
|
|
90
|
+
version: 1;
|
|
91
|
+
/** Exported records (embeddings serialized as plain number arrays). */
|
|
92
|
+
records: SerializedRecord[];
|
|
93
|
+
};
|
|
94
|
+
/** A record with its embedding serialized as a plain array for JSON. */
|
|
95
|
+
type SerializedRecord = {
|
|
96
|
+
id: string;
|
|
97
|
+
text: string;
|
|
98
|
+
embedding?: number[];
|
|
99
|
+
metadata: MemoryMetadata;
|
|
100
|
+
createdAt: number;
|
|
101
|
+
};
|
|
102
|
+
/** Options controlling document chunking before embedding. */
|
|
103
|
+
type ChunkOptions = {
|
|
104
|
+
/** Target chunk size in characters (default 1000). */
|
|
105
|
+
chunkSize?: number;
|
|
106
|
+
/** Overlap between consecutive chunks in characters (default 200). */
|
|
107
|
+
overlap?: number;
|
|
108
|
+
};
|
|
109
|
+
/** Options for {@link Memory.add}. */
|
|
110
|
+
type AddOptions = {
|
|
111
|
+
/** Metadata to attach to every resulting record. */
|
|
112
|
+
metadata?: MemoryMetadata;
|
|
113
|
+
/** Explicit id (single-chunk adds only). Auto-generated otherwise. */
|
|
114
|
+
id?: string;
|
|
115
|
+
/**
|
|
116
|
+
* Split long text into overlapping chunks before embedding.
|
|
117
|
+
* Pass `true` for defaults or an object to configure. Default: no chunking.
|
|
118
|
+
*/
|
|
119
|
+
chunk?: boolean | ChunkOptions;
|
|
120
|
+
};
|
|
121
|
+
/** Options for {@link Memory.search}. */
|
|
122
|
+
type SearchOptions = {
|
|
123
|
+
/** Maximum number of results (default 5). */
|
|
124
|
+
k?: number;
|
|
125
|
+
/** Exact-match metadata filter. */
|
|
126
|
+
filter?: MetadataFilter;
|
|
127
|
+
/** Minimum cosine score to include. */
|
|
128
|
+
minScore?: number;
|
|
129
|
+
};
|
|
130
|
+
/** Options for {@link Memory.recall} (token-budgeted context packing). */
|
|
131
|
+
type RecallOptions = {
|
|
132
|
+
/** Maximum tokens of context to pack (default 1024). */
|
|
133
|
+
tokenBudget?: number;
|
|
134
|
+
/** Candidate pool size to retrieve before packing (default 20). */
|
|
135
|
+
k?: number;
|
|
136
|
+
/** Exact-match metadata filter. */
|
|
137
|
+
filter?: MetadataFilter;
|
|
138
|
+
/** Minimum cosine score for a candidate to be considered. */
|
|
139
|
+
minScore?: number;
|
|
140
|
+
/** Separator placed between packed memories (default "\n\n"). */
|
|
141
|
+
separator?: string;
|
|
142
|
+
};
|
|
143
|
+
/** Result of {@link Memory.recall}: a packed context block plus provenance. */
|
|
144
|
+
type RecallResult = {
|
|
145
|
+
/** The token-budgeted context block, ready to prepend to a prompt. */
|
|
146
|
+
context: string;
|
|
147
|
+
/** The records included in the context, in packed order. */
|
|
148
|
+
records: MemoryRecord[];
|
|
149
|
+
/** Approximate token count of {@link RecallResult.context}. */
|
|
150
|
+
tokensUsed: number;
|
|
151
|
+
};
|
|
152
|
+
/** Configuration for {@link createMemory}. */
|
|
153
|
+
type MemoryOptions = {
|
|
154
|
+
/** Embedder used to vectorize text. Required. */
|
|
155
|
+
embed: Embedder;
|
|
156
|
+
/** Persistence backend. Defaults to an in-memory store. */
|
|
157
|
+
store?: MemoryStore;
|
|
158
|
+
/** Redaction applied to text on write. */
|
|
159
|
+
redact?: Redactor;
|
|
160
|
+
/** Default chunking applied to {@link Memory.add} when `chunk` is true. */
|
|
161
|
+
chunk?: ChunkOptions;
|
|
162
|
+
};
|
|
163
|
+
//#endregion
|
|
164
|
+
export { MemoryMetadata as a, MemorySearchResult as c, RecallOptions as d, RecallResult as f, StoreSearchOptions as g, SerializedRecord as h, MemoryExport as i, MemoryStore as l, SearchOptions as m, ChunkOptions as n, MemoryOptions as o, Redactor as p, Embedder as r, MemoryRecord as s, AddOptions as t, MetadataFilter as u };
|
|
165
|
+
//# sourceMappingURL=types-DQBe2lFo.d.mts.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"types-DQBe2lFo.d.mts","names":[],"sources":["../src/memory/types.ts"],"sourcesContent":[],"mappings":";;AAUA;AAQA;AAmBA;AAQA;AAGA;AAQA;AAGA;AAgBA;AAEc,KAnEF,cAAA,GAAiB,MAmEf,CAAA,MAAA,EAAA,OAAA,CAAA;;;;;;;AAMgC,KAjElC,YAAA,GAiEkC;EAA6B;EAAR,EAAA,EAAA,MAAA;EAE7C;EAEN,IAAA,EAAA,MAAA;EAAyB;EAAR,SAAA,CAAA,EA/DnB,YA+DmB;EAEtB;EAED,QAAA,EAjEE,cAiEF;EAAO;EAIL,SAAA,EAAA,MAAY;AAQxB,CAAA;AASA;AAQA;AAaA;AAUA;AAcA;AAUA;AAES,KApIG,QAAA,GAoIH,CAAA,KAAA,EAAA,MAAA,EAAA,EAAA,GApImC,OAoInC,CApI2C,YAoI3C,EAAA,CAAA;;;;;;;KA5HG,QAAA,GAAW;;KAGX,kBAAA;;UAEF;;;;;KAME,cAAA,GAAiB;;KAGjB,kBAAA;;;;WAID;;;;;;;;;;;KAYC,WAAA;;cAEE,eAAe;;mBAEV,iBAAiB;;mBAEjB,QAAQ;;sBAEL,wBAAwB,qBAAqB,QAAQ;;sBAErD;;gBAEN,iBAAiB,QAAQ;;WAE9B;;UAED;;;KAIE,YAAA;;;;WAID;;;KAIC,gBAAA;;;;YAIA;;;;KAKA,YAAA;;;;;;;KAQA,UAAA;;aAEC;;;;;;;oBAOO;;;KAIR,aAAA;;;;WAID;;;;;KAMC,aAAA;;;;;;WAMD;;;;;;;KAQC,YAAA;;;;WAID;;;;;KAMC,aAAA;;SAEH;;UAEC;;WAEC;;UAED"}
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"utils-
|
|
1
|
+
{"version":3,"file":"utils-DKO55ZmZ.mjs","names":["properties: Record<string, any>","required: string[]"],"sources":["../src/core/utils.ts"],"sourcesContent":["/**\n * Shared utility functions for Gerbil core\n */\n\nimport type { z } from \"zod\";\n\n/**\n * Convert Zod schema to JSON Schema (simplified)\n * Handles objects, arrays, primitives, enums, optionals, and defaults\n */\nexport function zodToJsonSchema(schema: z.ZodType<any>): object {\n try {\n if (\"_def\" in schema) {\n const def = (schema as any)._def;\n\n if (def.typeName === \"ZodObject\") {\n const shape = def.shape();\n const properties: Record<string, any> = {};\n const required: string[] = [];\n\n for (const [key, value] of Object.entries(shape)) {\n properties[key] = zodToJsonSchema(value as z.ZodType<any>);\n // Check if required (not optional)\n if (!(value as any)._def?.typeName?.includes(\"Optional\")) {\n required.push(key);\n }\n }\n\n return { type: \"object\", properties, required };\n }\n if (def.typeName === \"ZodString\") {\n return { type: \"string\", description: def.description };\n }\n if (def.typeName === \"ZodNumber\") {\n return { type: \"number\", description: def.description };\n }\n if (def.typeName === \"ZodBoolean\") {\n return { type: \"boolean\" };\n }\n if (def.typeName === \"ZodArray\") {\n return { type: \"array\", items: zodToJsonSchema(def.type) };\n }\n if (def.typeName === \"ZodEnum\") {\n return { type: \"string\", enum: def.values };\n }\n if (def.typeName === \"ZodOptional\") {\n return zodToJsonSchema(def.innerType);\n }\n if (def.typeName === \"ZodDefault\") {\n const inner = zodToJsonSchema(def.innerType);\n return { ...inner, default: def.defaultValue() };\n }\n }\n } catch {}\n\n return { type: \"string\" };\n}\n\n/**\n * Extract JSON from text (finds first { } or [ ] block)\n */\nexport function extractJson(text: string): string {\n const jsonMatch = text.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n return jsonMatch[0];\n }\n\n const arrayMatch = text.match(/\\[[\\s\\S]*\\]/);\n if (arrayMatch) {\n return arrayMatch[0];\n }\n\n return text;\n}\n"],"mappings":";;;;;AAUA,SAAgB,gBAAgB,QAAgC;AAC9D,KAAI;AACF,MAAI,UAAU,QAAQ;GACpB,MAAM,MAAO,OAAe;AAE5B,OAAI,IAAI,aAAa,aAAa;IAChC,MAAM,QAAQ,IAAI,OAAO;IACzB,MAAMA,aAAkC,EAAE;IAC1C,MAAMC,WAAqB,EAAE;AAE7B,SAAK,MAAM,CAAC,KAAK,UAAU,OAAO,QAAQ,MAAM,EAAE;AAChD,gBAAW,OAAO,gBAAgB,MAAwB;AAE1D,SAAI,CAAE,MAAc,MAAM,UAAU,SAAS,WAAW,CACtD,UAAS,KAAK,IAAI;;AAItB,WAAO;KAAE,MAAM;KAAU;KAAY;KAAU;;AAEjD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,YACnB,QAAO;IAAE,MAAM;IAAU,aAAa,IAAI;IAAa;AAEzD,OAAI,IAAI,aAAa,aACnB,QAAO,EAAE,MAAM,WAAW;AAE5B,OAAI,IAAI,aAAa,WACnB,QAAO;IAAE,MAAM;IAAS,OAAO,gBAAgB,IAAI,KAAK;IAAE;AAE5D,OAAI,IAAI,aAAa,UACnB,QAAO;IAAE,MAAM;IAAU,MAAM,IAAI;IAAQ;AAE7C,OAAI,IAAI,aAAa,cACnB,QAAO,gBAAgB,IAAI,UAAU;AAEvC,OAAI,IAAI,aAAa,aAEnB,QAAO;IAAE,GADK,gBAAgB,IAAI,UAAU;IACzB,SAAS,IAAI,cAAc;IAAE;;SAG9C;AAER,QAAO,EAAE,MAAM,UAAU;;;;;AAM3B,SAAgB,YAAY,MAAsB;CAChD,MAAM,YAAY,KAAK,MAAM,cAAc;AAC3C,KAAI,UACF,QAAO,UAAU;CAGnB,MAAM,aAAa,KAAK,MAAM,cAAc;AAC5C,KAAI,WACF,QAAO,WAAW;AAGpB,QAAO"}
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
//#region src/memory/serialize.ts
|
|
2
|
+
/** Convert a runtime record to its JSON-safe form. */
|
|
3
|
+
function serializeRecord(record) {
|
|
4
|
+
return {
|
|
5
|
+
id: record.id,
|
|
6
|
+
text: record.text,
|
|
7
|
+
embedding: record.embedding ? Array.from(record.embedding) : void 0,
|
|
8
|
+
metadata: record.metadata,
|
|
9
|
+
createdAt: record.createdAt
|
|
10
|
+
};
|
|
11
|
+
}
|
|
12
|
+
/** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */
|
|
13
|
+
function deserializeRecord(record) {
|
|
14
|
+
return {
|
|
15
|
+
id: record.id,
|
|
16
|
+
text: record.text,
|
|
17
|
+
embedding: record.embedding ? Float32Array.from(record.embedding) : void 0,
|
|
18
|
+
metadata: record.metadata ?? {},
|
|
19
|
+
createdAt: record.createdAt
|
|
20
|
+
};
|
|
21
|
+
}
|
|
22
|
+
/**
|
|
23
|
+
* True when every key in `filter` is present on `metadata` with an equal
|
|
24
|
+
* (strict ===) value. An empty/undefined filter matches everything.
|
|
25
|
+
*/
|
|
26
|
+
function matchesFilter(metadata, filter) {
|
|
27
|
+
if (!filter) return true;
|
|
28
|
+
for (const key of Object.keys(filter)) if (metadata[key] !== filter[key]) return false;
|
|
29
|
+
return true;
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
//#endregion
|
|
33
|
+
//#region src/memory/vector.ts
|
|
34
|
+
/**
|
|
35
|
+
* Vector math for cosine similarity search.
|
|
36
|
+
*
|
|
37
|
+
* Vectors are stored L2-normalized so cosine similarity is a plain dot
|
|
38
|
+
* product. Normalization happens once on insert via {@link normalize}.
|
|
39
|
+
*/
|
|
40
|
+
/**
|
|
41
|
+
* Return an L2-normalized copy of `vector`.
|
|
42
|
+
*
|
|
43
|
+
* A zero vector is returned unchanged (its norm is 0).
|
|
44
|
+
*/
|
|
45
|
+
function normalize(vector) {
|
|
46
|
+
let sumSquares = 0;
|
|
47
|
+
for (let i = 0; i < vector.length; i++) sumSquares += vector[i] * vector[i];
|
|
48
|
+
const norm = Math.sqrt(sumSquares);
|
|
49
|
+
if (norm === 0) return vector;
|
|
50
|
+
const out = new Float32Array(vector.length);
|
|
51
|
+
for (let i = 0; i < vector.length; i++) out[i] = vector[i] / norm;
|
|
52
|
+
return out;
|
|
53
|
+
}
|
|
54
|
+
/**
|
|
55
|
+
* Dot product of two equal-length vectors.
|
|
56
|
+
*
|
|
57
|
+
* For L2-normalized inputs this equals cosine similarity.
|
|
58
|
+
*/
|
|
59
|
+
function dot(a, b) {
|
|
60
|
+
const length = Math.min(a.length, b.length);
|
|
61
|
+
let sum = 0;
|
|
62
|
+
for (let i = 0; i < length; i++) sum += a[i] * b[i];
|
|
63
|
+
return sum;
|
|
64
|
+
}
|
|
65
|
+
/**
|
|
66
|
+
* Cosine similarity of two vectors, normalizing on the fly.
|
|
67
|
+
*
|
|
68
|
+
* Prefer {@link dot} when both inputs are already normalized.
|
|
69
|
+
*/
|
|
70
|
+
function cosine(a, b) {
|
|
71
|
+
return dot(normalize(a), normalize(b));
|
|
72
|
+
}
|
|
73
|
+
/**
|
|
74
|
+
* Score every candidate against `query` (dot product) and return the top `k`
|
|
75
|
+
* by descending score, optionally filtering by a minimum score.
|
|
76
|
+
*
|
|
77
|
+
* Inputs are assumed normalized; this keeps the hot path branch-free.
|
|
78
|
+
*/
|
|
79
|
+
function topK(query, candidates, k, minScore) {
|
|
80
|
+
const scored = [];
|
|
81
|
+
for (const candidate of candidates) {
|
|
82
|
+
const score = dot(query, candidate.vector);
|
|
83
|
+
if (minScore !== void 0 && score < minScore) continue;
|
|
84
|
+
scored.push({
|
|
85
|
+
item: candidate.item,
|
|
86
|
+
score
|
|
87
|
+
});
|
|
88
|
+
}
|
|
89
|
+
scored.sort((a, b) => b.score - a.score);
|
|
90
|
+
return scored.slice(0, k);
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
//#endregion
|
|
94
|
+
export { deserializeRecord as a, topK as i, dot as n, matchesFilter as o, normalize as r, serializeRecord as s, cosine as t };
|
|
95
|
+
//# sourceMappingURL=vector-B0panuy6.mjs.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"vector-B0panuy6.mjs","names":["scored: Scored<T>[]"],"sources":["../src/memory/serialize.ts","../src/memory/vector.ts"],"sourcesContent":["/**\n * (De)serialization helpers shared by stores and import/export.\n *\n * Embeddings are stored at runtime as {@link Float32Array} but serialized as\n * plain number arrays so records survive `JSON.stringify` and IndexedDB\n * structured clone round-trips.\n */\n\nimport type { MemoryRecord, SerializedRecord } from \"./types.js\";\n\n/** Convert a runtime record to its JSON-safe form. */\nexport function serializeRecord(record: MemoryRecord): SerializedRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Array.from(record.embedding) : undefined,\n metadata: record.metadata,\n createdAt: record.createdAt,\n };\n}\n\n/** Rebuild a runtime record (with a {@link Float32Array}) from JSON form. */\nexport function deserializeRecord(record: SerializedRecord): MemoryRecord {\n return {\n id: record.id,\n text: record.text,\n embedding: record.embedding ? Float32Array.from(record.embedding) : undefined,\n metadata: record.metadata ?? {},\n createdAt: record.createdAt,\n };\n}\n\n/**\n * True when every key in `filter` is present on `metadata` with an equal\n * (strict ===) value. An empty/undefined filter matches everything.\n */\nexport function matchesFilter(\n metadata: Record<string, unknown>,\n filter?: Record<string, unknown>,\n): boolean {\n if (!filter) {\n return true;\n }\n for (const key of Object.keys(filter)) {\n if (metadata[key] !== filter[key]) {\n return false;\n }\n }\n return true;\n}\n","/**\n * Vector math for cosine similarity search.\n *\n * Vectors are stored L2-normalized so cosine similarity is a plain dot\n * product. Normalization happens once on insert via {@link normalize}.\n */\n\n/**\n * Return an L2-normalized copy of `vector`.\n *\n * A zero vector is returned unchanged (its norm is 0).\n */\nexport function normalize(vector: Float32Array): Float32Array {\n let sumSquares = 0;\n for (let i = 0; i < vector.length; i++) {\n sumSquares += vector[i] * vector[i];\n }\n const norm = Math.sqrt(sumSquares);\n if (norm === 0) {\n return vector;\n }\n const out = new Float32Array(vector.length);\n for (let i = 0; i < vector.length; i++) {\n out[i] = vector[i] / norm;\n }\n return out;\n}\n\n/**\n * Dot product of two equal-length vectors.\n *\n * For L2-normalized inputs this equals cosine similarity.\n */\nexport function dot(a: Float32Array, b: Float32Array): number {\n const length = Math.min(a.length, b.length);\n let sum = 0;\n for (let i = 0; i < length; i++) {\n sum += a[i] * b[i];\n }\n return sum;\n}\n\n/**\n * Cosine similarity of two vectors, normalizing on the fly.\n *\n * Prefer {@link dot} when both inputs are already normalized.\n */\nexport function cosine(a: Float32Array, b: Float32Array): number {\n return dot(normalize(a), normalize(b));\n}\n\n/** A scored item used by {@link topK}. */\nexport type Scored<T> = { item: T; score: number };\n\n/**\n * Score every candidate against `query` (dot product) and return the top `k`\n * by descending score, optionally filtering by a minimum score.\n *\n * Inputs are assumed normalized; this keeps the hot path branch-free.\n */\nexport function topK<T>(\n query: Float32Array,\n candidates: { item: T; vector: Float32Array }[],\n k: number,\n minScore?: number,\n): Scored<T>[] {\n const scored: Scored<T>[] = [];\n for (const candidate of candidates) {\n const score = dot(query, candidate.vector);\n if (minScore !== undefined && score < minScore) {\n continue;\n }\n scored.push({ item: candidate.item, score });\n }\n scored.sort((a, b) => b.score - a.score);\n return scored.slice(0, k);\n}\n"],"mappings":";;AAWA,SAAgB,gBAAgB,QAAwC;AACtE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,MAAM,KAAK,OAAO,UAAU,GAAG;EAC7D,UAAU,OAAO;EACjB,WAAW,OAAO;EACnB;;;AAIH,SAAgB,kBAAkB,QAAwC;AACxE,QAAO;EACL,IAAI,OAAO;EACX,MAAM,OAAO;EACb,WAAW,OAAO,YAAY,aAAa,KAAK,OAAO,UAAU,GAAG;EACpE,UAAU,OAAO,YAAY,EAAE;EAC/B,WAAW,OAAO;EACnB;;;;;;AAOH,SAAgB,cACd,UACA,QACS;AACT,KAAI,CAAC,OACH,QAAO;AAET,MAAK,MAAM,OAAO,OAAO,KAAK,OAAO,CACnC,KAAI,SAAS,SAAS,OAAO,KAC3B,QAAO;AAGX,QAAO;;;;;;;;;;;;;;;;ACpCT,SAAgB,UAAU,QAAoC;CAC5D,IAAI,aAAa;AACjB,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,eAAc,OAAO,KAAK,OAAO;CAEnC,MAAM,OAAO,KAAK,KAAK,WAAW;AAClC,KAAI,SAAS,EACX,QAAO;CAET,MAAM,MAAM,IAAI,aAAa,OAAO,OAAO;AAC3C,MAAK,IAAI,IAAI,GAAG,IAAI,OAAO,QAAQ,IACjC,KAAI,KAAK,OAAO,KAAK;AAEvB,QAAO;;;;;;;AAQT,SAAgB,IAAI,GAAiB,GAAyB;CAC5D,MAAM,SAAS,KAAK,IAAI,EAAE,QAAQ,EAAE,OAAO;CAC3C,IAAI,MAAM;AACV,MAAK,IAAI,IAAI,GAAG,IAAI,QAAQ,IAC1B,QAAO,EAAE,KAAK,EAAE;AAElB,QAAO;;;;;;;AAQT,SAAgB,OAAO,GAAiB,GAAyB;AAC/D,QAAO,IAAI,UAAU,EAAE,EAAE,UAAU,EAAE,CAAC;;;;;;;;AAYxC,SAAgB,KACd,OACA,YACA,GACA,UACa;CACb,MAAMA,SAAsB,EAAE;AAC9B,MAAK,MAAM,aAAa,YAAY;EAClC,MAAM,QAAQ,IAAI,OAAO,UAAU,OAAO;AAC1C,MAAI,aAAa,UAAa,QAAQ,SACpC;AAEF,SAAO,KAAK;GAAE,MAAM,UAAU;GAAM;GAAO,CAAC;;AAE9C,QAAO,MAAM,GAAG,MAAM,EAAE,QAAQ,EAAE,MAAM;AACxC,QAAO,OAAO,MAAM,GAAG,EAAE"}
|
|
@@ -0,0 +1,321 @@
|
|
|
1
|
+
# Gerbil — Project State & Architecture Decision
|
|
2
|
+
|
|
3
|
+
**As of 2026-06-14.** The single authoritative snapshot. When something here conflicts
|
|
4
|
+
with an older doc, this wins. Supersedes scattered findings in `docs/research/*` and
|
|
5
|
+
the paper's roadmap. **Newest material is §12** (native audio begins — Moonshine STT
|
|
6
|
+
+ Kani-TTS-2 NanoCodec decoder; Gemma 4 E2B text decode; on-device memory/RAG; the
|
|
7
|
+
text+ViT autoresearch campaign; gerbil-site live on the native engine). §11 captured
|
|
8
|
+
EmbeddingGemma on-device, LFM2.5, the SPM tokenizer fix, MLX/DWQ loader, progress fix.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## 1. What Gerbil is
|
|
13
|
+
|
|
14
|
+
A local LLM inference library that runs models in the browser and Node on a
|
|
15
|
+
**single native WebGPU engine** behind one task API (no fallback lane — §2). The
|
|
16
|
+
headline achievements this cycle: the from-scratch native engine **works on mobile**
|
|
17
|
+
(iPad/iOS Safari 26.5+, WebKit) — previously it crashed — a desktop optimization
|
|
18
|
+
pass roughly doubled its throughput, and the engine went **multimodal natively**:
|
|
19
|
+
text embeddings ship (Qwen3-Embedding-0.6B) and the Qwen3.5 vision encoder is built
|
|
20
|
+
**bit-exact vs HuggingFace** (vision LM-integration is phase 2).
|
|
21
|
+
|
|
22
|
+
## 2. The decided architecture — NATIVE-ONLY (owner decision, overrides the panel)
|
|
23
|
+
|
|
24
|
+
**Decision (owner, 2026-06-13):** Gerbil is a single native WebGPU engine. **No
|
|
25
|
+
tfjs fallback lane** — a permanent fallback "assumes defeat to begin with." The
|
|
26
|
+
panel proposed keeping tfjs as a breadth lane; the owner rejected that. Instead:
|
|
27
|
+
|
|
28
|
+
- **Launch set = text + vision + embeddings, ALL native.** One model per modality
|
|
29
|
+
by default, expandable to other families via the add-model-family process.
|
|
30
|
+
- **Audio (TTS/STT) is deferred, not delegated to tfjs** — it ships later as
|
|
31
|
+
small *native* models (candidates under eval: VibeVoice-1.5B, OmniVoice,
|
|
32
|
+
dots.tts-soar for TTS — IF they publish safetensors; Moonshine for STT). "What's
|
|
33
|
+
the big deal" — launching without audio is fine; a permanent second engine is not.
|
|
34
|
+
- **tfjs is at most temporary dev scaffolding, not a destination.** It may stay
|
|
35
|
+
briefly to keep desktop demos working during the native build, but it is being
|
|
36
|
+
removed, not kept as a lane. `chrome-backend.ts` is deleted outright.
|
|
37
|
+
- **Vision uses Qwen3.5's OWN built-in ViT** (we currently skip its 192MB tower) —
|
|
38
|
+
one multimodal model, not a separate vision model. Stop dropping it "like idiots."
|
|
39
|
+
- **A thin onnxruntime-web bridge is a break-glass option only** — used solely if
|
|
40
|
+
a needed model has no extractable weights AND no native alternative. Not a lane.
|
|
41
|
+
|
|
42
|
+
**Status update (2026-06-13):** both next modalities have landed natively.
|
|
43
|
+
**Embeddings ship** (Qwen3-Embedding-0.6B, validated). The **vision encoder is
|
|
44
|
+
DONE** — the Qwen3.5 ViT runs natively and is **bit-exact vs HF transformers 5.12**
|
|
45
|
+
(per-token cosine 1.000000). The earlier "parallel attention kernel first" gate was
|
|
46
|
+
based on a misread of dead code (§5): the attention kernel was already parallel, and
|
|
47
|
+
non-causal was a one-line `is_causal` flag, not a rewrite. The remaining vision work
|
|
48
|
+
is **LM-side integration** (M-RoPE, token splice, image preprocessing) — phase 2,
|
|
49
|
+
plumbing over a verified core. Audio follows once a native small model is validated.
|
|
50
|
+
|
|
51
|
+
## 3. Capability matrix (what's true today)
|
|
52
|
+
|
|
53
|
+
Native-only. The "tfjs lane" is gone — tfjs is temporary dev scaffolding being
|
|
54
|
+
removed (§2), not a capability path.
|
|
55
|
+
|
|
56
|
+
| Modality | Native status |
|
|
57
|
+
|---|---|
|
|
58
|
+
| Text | ✅ ~51 tok/s mobile (sustained 200-tok), ~207 desktop, bit-correct |
|
|
59
|
+
| Text (alt family) | ✅ **LFM2.5-350M LANDED** (`Lfm2ForCausalLM`, hybrid conv/attn): ~600 tok/s desktop (2.8× Qwen), ~46 tok/s mobile, ~199MB q4, no new kernels |
|
|
60
|
+
| Embeddings | ✅ **DONE + ON iPad** — **EmbeddingGemma-300M** (bidirectional Gemma3 encoder, 173MB MLX-4bit) runs on iPad Safari; cos=1.00000 vs NumPy ref, Mars/bread margin >0.1, dim 768. First non-Qwen embedder. (Qwen3-Embedding-0.6B still works on desktop but OOMs iPad at 1.2GB BF16.) |
|
|
61
|
+
| Vision (image) | ✅ **END-TO-END DONE** — Qwen3.5 ViT, bit-exact vs HF (cosine 1.000000); `describeImage()` word-identical to HF; runs on iPad |
|
|
62
|
+
| Text model families (Llama/Mistral/Gemma) | 🟢 cheap — now usually **Tier-1, generator-only, NO new kernels** (kernel library saturated for standard transformers) |
|
|
63
|
+
| Text (Gemma 4 E2B) | ✅ **COHERENT on q4** (`gemma4.ts`): "capital of France"→"Paris", ~83 tok/s, all 35 layers cos≥0.998 vs MLX-LM. PLE, KV-share (20 layers), proportional/dual-theta RoPE, GeGLU, `Softcap`, double-wide MLP, V-norm, per-node `attn_scale`, head_dim-512 attention. **PLE CPU-streamed (0 MB GPU)** — not GPU-sharded |
|
|
64
|
+
| STT | 🟢 **Moonshine native LANDED** (`moonshine.ts`, `moonshine-executor.ts`, `moonshine-stt.ts`): raw-waveform Conv1d front-end (no FFT/log-mel), bit-exact `CrossAttention` kernel (max|err|<2e-4, cos≥0.9999), dual-graph encode-once/frozen-K-V/AR-decode, interleaved RoPE. Encoder cos≈0.990 vs HF; transcript substring-matches HF refs. **Whisper(ONNX) stays as multilingual / no-GPU fallback** (`src/core/stt.ts`) |
|
|
65
|
+
| TTS | 🟡 **Kani-TTS-2 partial** (`kani_tts.ts`): NanoCodec decoder (FSQ + causal HiFi-GAN) **LANDED + validated bit-exact** (`test-nanocodec-decode.mjs`, gate err<1e-3, measured ~4.2e-6). LFM2-350M codec-LM backbone scaffolded; **AR-loop glue remaining** (frame positions + learnable RoPE + 4-token-frame decode). License: kani-tts-2-en = LFM1.0/other; 450m variant = Apache |
|
|
66
|
+
| Memory / RAG | ✅ **SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`): vector store (in-memory/IndexedDB/file), token-budgeted `recall()`, chunking, redaction, native EmbeddingGemma adapter. 12/12 tests. No new kernels |
|
|
67
|
+
| No-WebGPU / old devices | not targeted — engine throws a clear error rather than degrading |
|
|
68
|
+
|
|
69
|
+
## 4. Performance baselines (re-confirm before quoting)
|
|
70
|
+
|
|
71
|
+
| Platform | Config | Decode tok/s | Note |
|
|
72
|
+
|---|---|---|---|
|
|
73
|
+
| M4 Max, node-dawn | optimized | ~207 | re-confirm on a cooled run; numbers between commit 2f0cabc and the isMetalBackend fix are invalid |
|
|
74
|
+
| iPad (iOS 26.5) native | batch-all | ~41 (cooled, consistent) | NOT thermal — confirmed; Dawn-tuned autoresearch wins did NOT transfer to Metal (was ~51 pre-optimization) |
|
|
75
|
+
| iPad native, submit floor | group=1 awaited | 6–8 | proven-correct floor |
|
|
76
|
+
| iPad transformers.js (same model) | WebGPU | 7–12 | ~5× slower than native |
|
|
77
|
+
|
|
78
|
+
## 5. Vision — DONE at the encoder level (bit-exact vs HF)
|
|
79
|
+
|
|
80
|
+
The native Qwen3.5 vision encoder is **built and validated bit-exact** against HF
|
|
81
|
+
transformers 5.12: **per-token cosine = 1.000000, max abs err ~5e-6**. Exposed as
|
|
82
|
+
`engine.encodeImage(patches, gridTHW)` → merged image tokens `[rows, 1024]`. See
|
|
83
|
+
paper §22 and `src/gpu/architectures/qwen3_5_vision.ts`, `src/gpu/vision-executor.ts`,
|
|
84
|
+
`src/gpu/vision-preprocess.ts`.
|
|
85
|
+
|
|
86
|
+
The earlier strategy panel claimed native vision was blocked on a single-threaded
|
|
87
|
+
attention kernel. **That was wrong** — it read `src/gpu/kernels/wgsl/attention.wgsl`,
|
|
88
|
+
a STALE reference file **not imported anywhere** (kernels are embedded strings in
|
|
89
|
+
`registry.ts`; the dead `.wgsl` files have been deleted). What actually shipped:
|
|
90
|
+
|
|
91
|
+
1. **The live attention kernel was already parallel** — `WGSL_ATTENTION` is a
|
|
92
|
+
tiled, online-softmax (flash-attention-style) kernel. No thread-0 serialization,
|
|
93
|
+
no rewrite needed.
|
|
94
|
+
2. **Non-causal was a one-line flag.** An `is_causal` uniform
|
|
95
|
+
(`S_eff = is_causal ? min(S, causal_limit) : S`) makes it bidirectional for the
|
|
96
|
+
ViT; text stays causal by default. Done.
|
|
97
|
+
3. **Patch-embed was NOT a new Conv3d kernel.** Patches arrive pre-flattened to
|
|
98
|
+
`[N, 1536]` from the host image processor, so the 5-D unfold/Conv3d collapses to
|
|
99
|
+
a plain `MatMul` + `AddBias`. New ops added were small: `AddBias`, `GeluErf`
|
|
100
|
+
(exact-erf merger GELU), `ApplyRotaryEmb` (2D rotary), `SliceCols` (fused-QKV
|
|
101
|
+
split) — plus host-side pos-embed/rotary precompute (grid-only, bit-exact).
|
|
102
|
+
4. **A real bug surfaced during validation:** `WGSL_GELU` returned NaN for large
|
|
103
|
+
args on Metal/Dawn (`x³` overflow into fast-math `tanh`); fixed by clamping the
|
|
104
|
+
inner arg to ±15.
|
|
105
|
+
|
|
106
|
+
**Net: vision is done at the encoder level.** Remaining work is **LM-side
|
|
107
|
+
integration** (M-RoPE, image-token splice into the text stream, pixel→patch
|
|
108
|
+
preprocessing) — phase 2, bounded plumbing over a verified numerical core. The
|
|
109
|
+
ViT prefill runs through the same parallel attention as text, so there is no
|
|
110
|
+
separate mobile attention risk; on-device ViT speed is a measure-when-integrated
|
|
111
|
+
item, not a feasibility gate.
|
|
112
|
+
|
|
113
|
+
## 6. Build sequence (ordered)
|
|
114
|
+
|
|
115
|
+
1. ~~**Native embeddings**~~ — **DONE.** Qwen3-Embedding-0.6B (`Qwen3ForCausalLM`): last-token EOS pooling + `L2Norm` tail. Validated: dim 1024, unit norm, cos(similar)=0.81 > cos(unrelated)=0.56. Confirmed causal-LM-pooling. `engine.embed()`.
|
|
116
|
+
2. ~~**OPFS model cache**~~ — **resolved differently: OPFS removed.** Main-thread OPFS `createWritable` is broken on iOS (leaves unclearable junk that fills the quota). The loader is now **Cache-API-only**. Durable iOS caching needs a PWA (§ paper 24, `docs/research/ios-safari-model-caching.md`), not OPFS.
|
|
117
|
+
3. ~~**Native vision encoder**~~ — **DONE, bit-exact vs HF** (§5). Remaining: LM-side integration (M-RoPE, token splice, image preprocessing) — phase 2.
|
|
118
|
+
4. **Vision LM integration (phase 2)** — splice `encodeImage()` tokens into the text stream: M-RoPE position assignment, placeholder-token splice, host pixel→patch preprocessing. Plumbing over a verified encoder.
|
|
119
|
+
5. **Close the mobile WebKit correctness/perf sweep** — `?group=N` + Test-R bisect on real iOS 26.5 for a stable fast-and-correct submit config above the group=1 floor; re-measure desktop.
|
|
120
|
+
6. **Llama/Mistral/Gemma graph generator** — ~90% is the existing qwen2 generator; unlocks most of the HF text zoo. Low risk, high ROI.
|
|
121
|
+
7. **Remove tfjs scaffolding** — delete `chrome-backend.ts`. The engine is native-only; tfjs was temporary dev scaffolding, not a kept lane.
|
|
122
|
+
8. **Audio, native (deferred, not delegated)** — TTS: **OmniVoice** (Qwen3 backbone + codec decoder — mostly the existing text path + a decoder). STT: **Moonshine** (lean encoder-decoder; raw-waveform Conv1d frontend, no log-mel Conv2d; needs a parallel CrossAttention kernel). Ship once a native small model clears the mobile bar.
|
|
123
|
+
|
|
124
|
+
## 7. What NOT to build
|
|
125
|
+
|
|
126
|
+
- **A second permanent engine / kept tfjs lane** — the architecture is native-only across text + vision + embeddings. tfjs is temporary scaffolding being removed, not a breadth lane.
|
|
127
|
+
- **A Conv3d/unfold patch-embed kernel for the ViT** — not needed; patches arrive pre-flattened, so patch-embed is a plain MatMul + AddBias (§5).
|
|
128
|
+
- **A heavyweight native TTS pipeline with custom vocoder kernels now** — audio is deferred; when it lands, OmniVoice's Qwen3 backbone reuses the text path. A thin onnxruntime-web bridge stays break-glass-only.
|
|
129
|
+
- **A general Conv2d kernel / Whisper-on-native** — Moonshine's raw-waveform Conv1d avoids the log-mel Conv2d.
|
|
130
|
+
- **An OPFS write path on iOS** — main-thread `createWritable` is broken; durable caching is a PWA concern, not an OPFS one.
|
|
131
|
+
- **Trusting stale tok/s numbers** between 2f0cabc and the isMetalBackend fix.
|
|
132
|
+
|
|
133
|
+
## 8. Open spikes (measure before committing)
|
|
134
|
+
|
|
135
|
+
- ~~**Embedder pooling direction**~~ — RESOLVED: causal-LM last-token (EOS) pooling, validated.
|
|
136
|
+
- ~~**Parallel non-causal attention feasibility for the ViT**~~ — RESOLVED: the live attention kernel was already parallel; non-causal is a one-line `is_causal` flag; the encoder is bit-exact. Remaining ViT spike: **on-device ViT prefill speed on a real iPad** (measure when LM-integration lands; not a feasibility gate).
|
|
137
|
+
- **Mobile WebKit submit config** — does `?group=N`/Test-R land a stable fast config above the group=1 floor on iOS 26.5? The critical path.
|
|
138
|
+
- **Vision LM-integration correctness** — M-RoPE + token splice must stay bit-exact end-to-end (image+text), not just at the encoder boundary.
|
|
139
|
+
- **Native audio on mobile** — OmniVoice (codec-decoder) and Moonshine (CrossAttention kernel) latency/correctness once a candidate is validated.
|
|
140
|
+
- **Autoresearch on-device tax** — desktop kernel wins did NOT transfer to mobile (41 vs 51); the loop needs a mobile-validation leg, possibly per-backend tunings.
|
|
141
|
+
|
|
142
|
+
## 9. Decision log (this cycle)
|
|
143
|
+
|
|
144
|
+
- Mobile native inference fixed (four-bug diagnosis: jetsam memory, WebKit visibility, attention race, detection predicate). See `docs/mobile-failure-diagnosis.md`, paper §17–19.
|
|
145
|
+
- Desktop 145→~207 via autoresearch (paper §20).
|
|
146
|
+
- **Architecture decision: NATIVE-ONLY** across text + vision + embeddings; no tfjs fallback lane; `chrome-backend.ts` to be deleted (paper §23).
|
|
147
|
+
- **Native embeddings shipped** — Qwen3-Embedding-0.6B, last-token EOS pool + L2Norm, validated (paper §21).
|
|
148
|
+
- **Native vision encoder shipped** — Qwen3.5 ViT, bit-exact vs HF transformers 5.12 (cosine 1.000000); LM-integration is phase 2 (paper §22). Supersedes the earlier "skip the ViT" stance (`docs/research/qwen35-multimodal.md`).
|
|
149
|
+
- **Vision feasibility corrected** — attention was already parallel (panel read a dead `.wgsl` file); non-causal is a one-line flag (paper §22–23, §5 above).
|
|
150
|
+
- **OPFS removed** — main-thread `createWritable` broken on iOS; Cache-API-only loader; durable caching needs a PWA (`docs/research/ios-safari-model-caching.md`, paper §24).
|
|
151
|
+
- Modality model picks (`docs/research/sota-modality-models.md`); chromium decision (`docs/research/native-vs-chromium-decision.md`); site update plan (`docs/site-update-plan.md`).
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## 10. 2026-06-13 evening — multimodal milestone + in-flight work (CAPTURE)
|
|
156
|
+
|
|
157
|
+
**Major wins this session (all committed to `feat/webgpu-engine-mobile`):**
|
|
158
|
+
|
|
159
|
+
- **Native VISION end-to-end DONE & validated bit-exact** (commit `5fed12b`). `engine.describeImage({pixels,width,height})` → coherent description. Validated vs HF transformers 5.12 on `examples/skelly.png`: 7/7 checks — encoder cosine 1.000000, M-RoPE 3D position ids EXACT, spliced embeds cosine 1.0, first-token exact, full greedy description WORD-IDENTICAL to HF for 201 chars ("This is a detailed black-and-white line drawing of the skeleton of a fox…"). New ops: `MRoPE`, `EmbedSplice`, host image preprocessing (smart-resize/normalize/patchify in `vision-preprocess.ts`). Text path unchanged (M-RoPE on linear positions reduces to 1D RoPE).
|
|
160
|
+
- **Native EMBEDDINGS** (earlier commit): Qwen3-Embedding-0.6B works on DESKTOP (cosine 0.81>0.56). BUT **not iPad-viable** — only BF16 (~1.2GB, OOMs on-device) or a broken MLX-DWQ. → Pivot below.
|
|
161
|
+
- **`is_causal` flag** on f32 attention (from vision) now unlocks **bidirectional encoders**.
|
|
162
|
+
|
|
163
|
+
**The 51-vs-41 mobile throughput question — RESOLVED: NOT a regression.** t15 hit 51.7 on a 200-token sustained run; the 38-41 readings were 60-token runs (more warmup overhead/token). A fresh 200-token sustained run on the CURRENT (post-autoresearch-optimization) engine **hit ~51 again**. So the Dawn-tuned wins did NOT regress mobile; sustained mobile decode is ~51 tok/s. Lesson: benchmark with enough tokens (≥200) to be representative.
|
|
164
|
+
|
|
165
|
+
**IN-FLIGHT / NOT YET MERGED (future-me: merge these):**
|
|
166
|
+
- **LFM2.5-350M** — ✅ **LANDED** on `feat/webgpu-engine-mobile` (commit `3c4bac8`). Tier-1, no new kernels. **~600 tok/s desktop (2.8× Qwen3.5's ~213), ~199MB q4 (half Qwen), coherent, bit-exact vs NumPy ref.** Files: new `src/gpu/architectures/lfm2.ts` (824 lines), registered `Lfm2ForCausalLM`, LFM2 CANONICAL_KEYS, loader key-mapper. TWO GENERAL FIXES worth keeping: (1) effective FF dim is 4608 not config's 6656 (`block_auto_adjust_ff_dim` rounding — `multiple_of(⌊2/3·6656⌋)`); (2) **the "garbage output" was the CHAT TEMPLATE, not the graph** — LFM2.5 ships its template as a `chat_template.jinja` sidecar (absent from tokenizer_config.json), so the engine fell back to Qwen ChatML which auto-injects an empty `<think>` → newline loop. Fix: fetch the `.jinja` sidecar + gate think-injection on the template actually emitting `<think>`. This fix helps ANY model with a jinja sidecar. Verdict: LFM2.5 is a viable faster/smaller text-default alternative to Qwen — user picks.
|
|
167
|
+
- **EmbeddingGemma-300M** — ✅ **LANDED + CONFIRMED ON iPad** (commit `4874d01`, see §11). The iPad-ready embedding model (~173MB q4, MTEB 68.36, standard MLX-4bit). Bidirectional Gemma3 encoder — needs the new generator + a MeanPool kernel + dual-theta RoPE + 2 Dense head layers + the MLX-detection loader fix (research found: detector requires `mode:"affine"` but standard MLX converts omit it → silently fall to F32; and DWQ vs standard MLX are config-indistinguishable → the DWQ-garbage trap). See `docs/research/sota-embedding-models.md`.
|
|
168
|
+
|
|
169
|
+
**iPad harness:** dashboard at `https://<lan-ip>:8766/` with Text/Vision/transformers.js/Storage-probe/Clear-cache/Stop-queue buttons, live results table, `/results` + `/enqueue` + `/clear-queue` endpoints. Embeddings button removed until EmbeddingGemma lands. TODO: show generated text in the results + default Text benchmark to 200 tokens (representative); wire the Vision button to `describeImage` (needs browser ImageBitmap/Canvas→pixels decode).
|
|
170
|
+
|
|
171
|
+
**iOS caching:** OPFS removed (main-thread createWritable broken on iOS, left unclearable junk). Durable cache needs a PWA (persist() only granted when installed). See `docs/research/ios-safari-model-caching.md`. Deferred.
|
|
172
|
+
|
|
173
|
+
**Native modality scorecard (as of §10):** text ✅, vision ✅ (end-to-end bit-exact), embeddings ✅ desktop / iPad-pending (EmbeddingGemma building), LFM2.5 text-alt ✅ (pending merge). Audio (TTS/STT) deferred to native OmniVoice/Moonshine. **The native-only multimodal engine is real.** **→ All "pending/building" items above have since landed; see §11.**
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## 11. 2026-06-14 — EmbeddingGemma on-device, LFM2.5 landed, loader hardening (CAPTURE)
|
|
178
|
+
|
|
179
|
+
The §10 in-flight items all landed on `feat/webgpu-engine-mobile`. New wins this session
|
|
180
|
+
(paper §25–§30 document each in depth):
|
|
181
|
+
|
|
182
|
+
- **EmbeddingGemma-300M LANDED and CONFIRMED RUNNING ON iPad Safari** (commit `4874d01`).
|
|
183
|
+
First **non-Qwen** embedding family — a real **bidirectional Gemma3 encoder**
|
|
184
|
+
(`src/gpu/architectures/gemma3_encoder.ts`, `generateGemma3EncoderGraph`): 24 pre-norm
|
|
185
|
+
blocks, GQA 3q/1kv head_dim 256, per-head q/k-norm, **dual-theta RoPE** (sliding θ=10000
|
|
186
|
+
/ full θ=1e6 selected per layer from `layer_types`), GeGLU MLP, Gemma's **four-norm
|
|
187
|
+
sandwich**, embed ×√768, tail **MeanPool → Dense0(768→3072) → Dense1(3072→768) → L2Norm**.
|
|
188
|
+
**173 MB at MLX-4bit** (vs the abandoned 1.2 GB Qwen3-Embedding that OOM'd iPad).
|
|
189
|
+
Two new kernels only: **MeanPool**, **Scale** (`kernels/registry.ts`). Validated
|
|
190
|
+
**cos=1.00000 vs an independent NumPy reference** (`scripts/engine/test-embedding-gemma-reference.py`),
|
|
191
|
+
reference gate `cos>0.95`; semantic test asserts a Red-Planet query is closer to two
|
|
192
|
+
Mars docs than to a bread doc by **>0.1 cosine margin**, unit-norm dim-768, no NaN. The
|
|
193
|
+
engine **generalizes across embedding families**, not just Qwen.
|
|
194
|
+
|
|
195
|
+
- **SPM tokenizer fix (load-bearing, cross-family)** (`src/gpu/tokenizer.ts`). Gemma's
|
|
196
|
+
`tokenizer.json` is `type:"BPE"` but **SentencePiece-flavored** (▁/U+2581 spaces, raw
|
|
197
|
+
UTF-8 tokens, array-form merges, `<0xHH>` byte-fallback). The byte-level (`Ġ`) BPE path
|
|
198
|
+
was char-splitting every word → semantically dead embeddings that still passed norm/NaN
|
|
199
|
+
checks. Auto-detected `spmMode` (structural: `" "→"▁"` Replace normalizer **or**
|
|
200
|
+
`byte_fallback && "▁the" in vocab`) now drives encode/decode/merges; **Qwen/LFM2 stay
|
|
201
|
+
byte-level** (no model-name list — they just don't match). Lesson: `type:"BPE"` ≠
|
|
202
|
+
byte-level BPE; any SentencePiece-lineage family (Gemma/Llama/Mistral) needs this path.
|
|
203
|
+
|
|
204
|
+
- **MLX-4bit loader hardened** (`src/gpu/model-loader.ts`). (1) Detection broadened to
|
|
205
|
+
accept mode-less `{bits:4, group_size}` configs (standard mlx-lm omits `mode`). (2) A
|
|
206
|
+
`VERIFIED_MLX_REPOS` allowlist (currently `mlx-community/embeddinggemma-300m-4bit`) gates
|
|
207
|
+
mode-less configs. (3) **Explicit DWQ reject** — DWQ repos carry an identical
|
|
208
|
+
`{bits:4,group_size}` config but pack weights that dequant to garbage, so they're rejected
|
|
209
|
+
by repo-name substring (`includes("dwq")`). Codifies the MLX-DWQ-garbage trap. (4) Gemma's
|
|
210
|
+
`(1+weight)` RMSNorm absorption is **baked by the loader even for MLX** (mlx-lm pre-absorbs
|
|
211
|
+
+1 for Qwen3.5 but NOT for Gemma — the Gemma branch deliberately omits `&& !isMLX`).
|
|
212
|
+
|
|
213
|
+
- **Progress-reporting fix** (commit `682a09b`, `model-loader.ts`). The bar froze at
|
|
214
|
+
"10% discovering weight files" because the gap between that emit and the first download
|
|
215
|
+
chunk (index probe + 2 header range-requests + first-byte latency) emitted nothing. Now
|
|
216
|
+
emits "Reading {file} header" + "Downloading {file} (0/{N} MB)" up front. **Affected every
|
|
217
|
+
model**; fixed universally.
|
|
218
|
+
|
|
219
|
+
- **LFM2.5-350M landed** (commit `3c4bac8`) — Tier-1, no new kernels, ~600 tok/s desktop /
|
|
220
|
+
~46 tok/s mobile, ~199 MB q4. The general jinja-sidecar chat-template fix from it helps any
|
|
221
|
+
model shipping `chat_template.jinja`.
|
|
222
|
+
|
|
223
|
+
- **Cross-device multi-modal parity.** Text (Qwen3.5 ~51 tok/s, LFM2.5 ~46 tok/s), vision
|
|
224
|
+
(Qwen3.5 ViT `describeImage`), and embeddings (EmbeddingGemma) **all now run natively on
|
|
225
|
+
iPad Safari** — the native engine reaches transformers.js-path modality coverage on the
|
|
226
|
+
modalities that matter, without the mobile crashes and ~5× faster. **Remaining native gap:
|
|
227
|
+
audio** — TTS via OmniVoice (in progress), STT via Moonshine (not started).
|
|
228
|
+
|
|
229
|
+
- **Effort-tier shift.** Adding a new TEXT family is now usually **Tier-1: generator only, no
|
|
230
|
+
new kernels** — the kernel library has saturated for standard transformers (Llama/Mistral/
|
|
231
|
+
Gemma-text reuse existing ops). New kernels are needed only for genuinely novel ops (SSM,
|
|
232
|
+
PLE, new norms, cross-attention). See `docs/adding-a-model-family.md`.
|
|
233
|
+
|
|
234
|
+
- **Site migration assessment** written: `docs/gerbil-site-native-migration.md` (how the
|
|
235
|
+
marketing/docs site can move its in-browser inference from the transformers.js/ONNX worker
|
|
236
|
+
to the native engine, modality-by-modality, with the no-fallback device-coverage tradeoff
|
|
237
|
+
stated plainly). Assessment only — the site repo was not modified.
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## 12. 2026-06-14 — native audio begins, Gemma 4, memory/RAG, autoresearch campaign (CAPTURE)
|
|
242
|
+
|
|
243
|
+
This session pushed past the §11 "multimodal parity minus audio" milestone. Paper
|
|
244
|
+
**§31–§35** document each in depth. The decided order from §6 (audio last, native not
|
|
245
|
+
delegated) held — and audio is now *under way natively*, not deferred to tfjs.
|
|
246
|
+
|
|
247
|
+
- **Native STT — Moonshine LANDED** (`src/gpu/architectures/moonshine.ts`,
|
|
248
|
+
`moonshine-executor.ts`, `moonshine-stt.ts`; paper §31). Chosen over Whisper to avoid
|
|
249
|
+
a log-mel/Conv2d front-end: it consumes **16 kHz PCM directly** through three strided
|
|
250
|
+
`Conv1d`s (downsample 384×, ~41.6 frames/s) — new kernels `Conv1dFull`, `GroupNorm`,
|
|
251
|
+
`Tanh`, `Transpose`. The real new attention primitive is **`CrossAttention`**
|
|
252
|
+
(`WGSL_CROSS_ATTENTION`, tiled online-softmax), validated **bit-exact vs NumPy**
|
|
253
|
+
(`test-crossattention.mjs`: max|err|<2e-4, cos≥0.9999). Runtime is a **dual graph**:
|
|
254
|
+
`MoonshineEncoderExecutor.encode()` runs the front-end + bidirectional encoder once and
|
|
255
|
+
freezes per-decoder-layer K/V; `MoonshineSTT.transcribe()` runs greedy AR decode with
|
|
256
|
+
self- + cross-attention into that frozen K/V. **Interleaved RoPE** (`ROPE_INTERLEAVED_SPEC`,
|
|
257
|
+
adjacent-dim pairing) vs the default split-half. Validation: encoder cos≈0.990 vs HF
|
|
258
|
+
(source comment); `test-moonshine-transcribe.mjs` asserts transcript contains HF-ref
|
|
259
|
+
substrings; RTF/4-bit size are computed-and-reported, not hardcoded-asserted.
|
|
260
|
+
**Whisper(ONNX) stays the multilingual / no-WebGPU fallback** (`src/core/stt.ts`,
|
|
261
|
+
`WhisperSTT`) — separate, untouched.
|
|
262
|
+
|
|
263
|
+
- **Native TTS — Kani-TTS-2 (partial)** (`src/gpu/architectures/kani_tts.ts`; paper §32).
|
|
264
|
+
The hard novel piece — the **NanoCodec decoder** (FSQ 4×4 levels `[9,8,8,7]`, mixed-radix
|
|
265
|
+
base `[1,9,72,576]` + causal HiFi-GAN, rates `[7,7,6,3,2]`, hop 1764 @ 22050 Hz) — is
|
|
266
|
+
**implemented and validated bit-exact** (`test-nanocodec-decode.mjs`, gate `err<1e-3`,
|
|
267
|
+
measured ~4.2e-6 vs MLX). New kernels: `FSQDequant`, `HalfSnake1d`,
|
|
268
|
+
`ConvTranspose1dDepthwise`. The backbone is **LFM2-350M** (`KaniTTS2ForCausalLM`, audio
|
|
269
|
+
tokens above the text vocab, 4/frame); `generateKaniTtsGraph` **deliberately throws** —
|
|
270
|
+
remaining is the frame-position + learnable-RoPE + 4-token-frame AR-decode glue (most
|
|
271
|
+
block math reused from `lfm2.ts`). License: **kani-tts-2-en = LFM1.0 (other)**, NanoCodec
|
|
272
|
+
= NVIDIA OML; the **450m variant is Apache** (same arch).
|
|
273
|
+
|
|
274
|
+
- **Gemma 4 E2B — text decode COHERENT on real q4 weights** (`src/gpu/architectures/gemma4.ts`;
|
|
275
|
+
paper §33). **Tier-2**, no MatFormer/AltUp/LAuReL. **PLE** (2nd embedding gathered per
|
|
276
|
+
token, per-layer gate+GELU+multiply+project+norm+residual), **KV-cache sharing** (E2B: 35
|
|
277
|
+
layers, last **20** shared via graph-rewire, no kernel change), **proportional RoPE**
|
|
278
|
+
(rotate 0.25·head_dim=64 dims but inv_freq over full head_dim denom; dual-theta 1e6/1e4),
|
|
279
|
+
**GeGLU**, and a new **`Softcap`** kernel (`cap·tanh(x/cap)`, cap=30) on final logits.
|
|
280
|
+
Generates coherently ("The capital of France is" → "Paris") at ~83 tok/s; all 35 layers
|
|
281
|
+
match an MLX-LM reference cos≥0.998 with identical argmax (`test-gemma4-perlayer.mjs`).
|
|
282
|
+
Structural validation **67/67** (`test-gemma4-graph.mjs`; 942 nodes); softcap kernel
|
|
283
|
+
separately validated (`test-gemma4-softcap.mjs`, max err<1e-4). **PLE is CPU-streamed, not
|
|
284
|
+
GPU-sharded**: the ~1.17 GB q4 PLE table stays CPU-resident (**0 MB GPU**) and the executor
|
|
285
|
+
streams per-token rows each step — Gemma 4's intended flash design, mobile-viable, and it
|
|
286
|
+
sidesteps the per-binding cap. Coherence required four fixes: per-node `attn_scale`=1.0
|
|
287
|
+
(HF scaling=1.0; default 1/√head_dim keeps other models byte-identical), parameter-free
|
|
288
|
+
V-norm, double-wide MLP on the KV-shared layers, and head_dim-512 support in the flash
|
|
289
|
+
attention kernel (second per-thread accumulator + smem-capped tiling, 16 KB invariant kept).
|
|
290
|
+
|
|
291
|
+
- **On-device Memory / RAG SHIPPED** (`src/memory/`, `@tryhamster/gerbil/memory`; paper §34).
|
|
292
|
+
Pluggable vector store (`InMemoryStore` / `IndexedDBStore` / `FileStore`), token-budgeted
|
|
293
|
+
`recall()` (default 1024-token greedy pack, ~4-chars/token, returns `{context, records,
|
|
294
|
+
tokensUsed}`), overlapping-window chunking (1000/200), write-time redaction (regex →
|
|
295
|
+
`[REDACTED]` or fn), and a `createGerbilEmbedder()` adapter over **native EmbeddingGemma**.
|
|
296
|
+
**12/12 tests** (`src/memory/memory.test.ts`). No new kernels — a clean consumer of the
|
|
297
|
+
embedding modality.
|
|
298
|
+
|
|
299
|
+
- **Autoresearch TPS campaign — 3 more batches** (`scripts/engine/results.jsonl`,
|
|
300
|
+
`scripts/engine/chart.html`; paper §35), M4 Max / node-dawn. Verified peaks:
|
|
301
|
+
**Qwen3.5-0.8B 219→~234 tok/s**, **LFM2.5-350M 624→~672 tok/s**, **ViT encode
|
|
302
|
+
581.8→~502 ms (−~14%)**, **`describeImage` 37.0→42.0 tok/s (+13.5%)** — all kept changes
|
|
303
|
+
bit-exact (merged cos 1.0, e2e 7/7). **The winning lesson (sharpened):** desktop wins come
|
|
304
|
+
only from eliminating **large wide reads on poorly-occupied kernels** (fused conv+activation,
|
|
305
|
+
vec4 + register-blocked + f16-mixed ViT matmul, `MatMul+AddBias→MatMulBias`); **pure
|
|
306
|
+
dispatch-count cuts on already-tuned kernels are noise** (butterfly reduce, subgroup
|
|
307
|
+
shuffle, bigger N-tiles all reverted). The INT4 matmuls and Mamba SSM sit at the
|
|
308
|
+
**bandwidth floor** — the remaining headroom is **mobile**, not desktop (several
|
|
309
|
+
reverted-on-desktop fusions are *predicted mobile wins*; the loop's next leg is a
|
|
310
|
+
mobile-validation pass).
|
|
311
|
+
|
|
312
|
+
- **gerbil-site is LIVE on the native engine.** The marketing/docs site's in-browser
|
|
313
|
+
inference now runs on the native WGSL engine (no longer the transformers.js/ONNX worker)
|
|
314
|
+
for the migrated modalities — see `docs/gerbil-site-native-migration.md`. (The §11 entry
|
|
315
|
+
was an assessment-only; this cycle it went live.)
|
|
316
|
+
|
|
317
|
+
**Native modality scorecard (as of §12):** text ✅ (Qwen3.5, LFM2.5), text-alt families 🟢
|
|
318
|
+
Tier-1, Gemma 4 E2B 🟡 (decode validated, sharding gates real weights), vision ✅ (Qwen3.5
|
|
319
|
+
ViT, on iPad), embeddings ✅ (EmbeddingGemma, on iPad), **STT ✅ native (Moonshine) + Whisper
|
|
320
|
+
fallback**, **TTS 🟡 (Kani NanoCodec decoder validated, backbone AR loop pending)**, memory/RAG
|
|
321
|
+
✅ shipped. **Audio is no longer the deferred gap — STT is native; TTS is one AR-loop away.**
|