@tryhamster/gerbil 1.0.0-rc.9 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +318 -104
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +276 -590
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +592 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BetB5xb0.d.mts +488 -0
  35. package/dist/gerbil-BetB5xb0.d.mts.map +1 -0
  36. package/dist/gerbil-CTZUa8EZ.mjs +4 -0
  37. package/dist/gerbil-DNniplr4.mjs +1656 -0
  38. package/dist/gerbil-DNniplr4.mjs.map +1 -0
  39. package/dist/gpu/hooks.d.mts +640 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1369 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-DFuglcEx.mjs +3790 -0
  46. package/dist/gpu-DFuglcEx.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-DukkJRMj.d.mts +2114 -0
  50. package/dist/index-DukkJRMj.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-D2vvH1Xc.mjs} +4 -4
  76. package/dist/mcp-D2vvH1Xc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-17dpP1kr.mjs +4 -0
  85. package/dist/moonshine-stt-4ojLtMq7.mjs +11962 -0
  86. package/dist/moonshine-stt-4ojLtMq7.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-JhdIPxzF.mjs} +14 -16
  88. package/dist/one-liner-JhdIPxzF.mjs.map +1 -0
  89. package/dist/repl-BDRkwPGX.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-CU694Dc8.mjs} +187 -32
  94. package/dist/skills-CU694Dc8.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/types-DQBe2lFo.d.mts +165 -0
  98. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  99. package/dist/{types-CiTc7ez3.d.mts → types-LlyYILII.d.mts} +112 -14
  100. package/dist/types-LlyYILII.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-CMOGDSgT.js +0 -20212
  157. package/dist/kokoro-CMOGDSgT.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-Bu-E23Sc.js +0 -433
  163. package/dist/stt-Bu-E23Sc.js.map +0 -1
  164. package/dist/stt-CpLYbGFd.mjs +0 -433
  165. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  166. package/dist/stt-DRPLEEHB.mjs +0 -3
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DiD1gTwk.js +0 -44695
  169. package/dist/transformers.web-DiD1gTwk.js.map +0 -1
  170. package/dist/transformers.web-u34VxRFM.js +0 -3
  171. package/dist/tts-CqroPaSK.js +0 -724
  172. package/dist/tts-CqroPaSK.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,312 @@
1
+ # Safetensors Parser Deep Dive
2
+
3
+ The safetensors parser (`src/gpu/safetensors.ts`) reads HuggingFace's binary safetensors format and provides zero-copy typed array views into the raw data. This document covers the binary format, the parser's design, alignment handling, and streaming support.
4
+
5
+ ---
6
+
7
+ ## Binary Format
8
+
9
+ The safetensors format is intentionally simple. A file consists of three contiguous sections:
10
+
11
+ ```
12
+ +--------------------------------------------------+
13
+ | 8 bytes: header_length (little-endian uint64) |
14
+ +--------------------------------------------------+
15
+ | header_length bytes: JSON header (UTF-8) |
16
+ +--------------------------------------------------+
17
+ | remaining bytes: raw tensor data (contiguous) |
18
+ +--------------------------------------------------+
19
+ ```
20
+
21
+ ### Section 1: Header Length (8 bytes)
22
+
23
+ The first 8 bytes encode the JSON header length as a little-endian unsigned 64-bit integer. In practice, headers are always well under 4GB, so only the lower 32 bits are meaningful. The parser uses `DataView.getBigUint64(0, true)` and converts to a JavaScript `Number`.
24
+
25
+ ### Section 2: JSON Header
26
+
27
+ The header is a JSON object mapping tensor names to their metadata:
28
+
29
+ ```json
30
+ {
31
+ "model.embed_tokens.weight": {
32
+ "dtype": "F32",
33
+ "shape": [151936, 896],
34
+ "data_offsets": [0, 544534528]
35
+ },
36
+ "model.layers.0.input_layernorm.weight": {
37
+ "dtype": "F32",
38
+ "shape": [896],
39
+ "data_offsets": [544534528, 544538112]
40
+ },
41
+ "__metadata__": {
42
+ "format": "pt"
43
+ }
44
+ }
45
+ ```
46
+
47
+ Each tensor entry contains:
48
+ - `dtype`: Data type string (see dtype table below)
49
+ - `shape`: Array of dimension sizes
50
+ - `data_offsets`: `[start, end]` byte offsets relative to the beginning of the data section
51
+
52
+ The special `__metadata__` key is optional and contains file-level metadata (e.g., the framework that produced the file).
53
+
54
+ ### Section 3: Tensor Data
55
+
56
+ Raw tensor data, stored contiguously. Each tensor's data occupies `data_offsets[1] - data_offsets[0]` bytes starting at `data_start + data_offsets[0]`, where `data_start = 8 + header_length`.
57
+
58
+ Tensors are stored in row-major order (C contiguous). Multi-dimensional tensors are flattened: element `[i, j, k]` of a tensor with shape `[D0, D1, D2]` is at flat index `i * D1 * D2 + j * D2 + k`.
59
+
60
+ ---
61
+
62
+ ## Layout Diagram
63
+
64
+ For a file containing two tensors, "A" (F32, shape [2, 3]) and "B" (F32, shape [4]):
65
+
66
+ ```
67
+ Byte 0 Byte N
68
+ |<-- 8 -->|<------ header_length ------->|<--- tensor data --->|
69
+ +=========+============================+=======================+
70
+ | len=120 | {"A":{"dtype":"F32", | A: 24 bytes (6 f32) |
71
+ | (u64) | "shape":[2,3], | B: 16 bytes (4 f32) |
72
+ | | "data_offsets":[0,24]}, | |
73
+ | | "B":{"dtype":"F32", | |
74
+ | | "shape":[4], | |
75
+ | | "data_offsets":[24,40]}} | |
76
+ +=========+============================+=======================+
77
+ ```
78
+
79
+ Offsets:
80
+ - Header starts at byte 8
81
+ - Data section starts at byte 8 + header_length = 128
82
+ - Tensor A data: bytes 128 to 151 (24 bytes)
83
+ - Tensor B data: bytes 152 to 167 (16 bytes)
84
+
85
+ ---
86
+
87
+ ## DType Mapping
88
+
89
+ | Safetensors DType | Bytes | Alignment | JS TypedArray | Notes |
90
+ |-------------------|-------|-----------|---------------|-------|
91
+ | `F32` | 4 | 4 | `Float32Array` | Most common for model weights |
92
+ | `F16` | 2 | 2 | `Uint16Array` | No native f16 typed array; bitwise representation |
93
+ | `BF16` | 2 | 2 | `Uint16Array` | Brain float 16; bitwise representation |
94
+ | `F64` | 8 | 8 | `Float64Array` | Rare in ML models |
95
+ | `I32` | 4 | 4 | `Int32Array` | |
96
+ | `U32` | 4 | 4 | `Uint32Array` | |
97
+ | `I16` | 2 | 2 | `Int16Array` | |
98
+ | `U16` | 2 | 2 | `Uint16Array` | |
99
+ | `I8` | 1 | 1 | `Int8Array` | Used in some quantization schemes |
100
+ | `U8` | 1 | 1 | `Uint8Array` | |
101
+ | `I64` | 8 | 8 | `BigInt64Array` | |
102
+ | `U64` | 8 | 8 | `BigUint64Array` | |
103
+ | `BOOL` | 1 | 1 | `Uint8Array` | |
104
+
105
+ Note on F16 and BF16: JavaScript has no native `Float16Array`. The parser returns `Uint16Array` views containing the raw bit patterns. To use F16 data with WebGPU, the bits can be uploaded directly to a GPU buffer typed as `f16` in WGSL. To use them on CPU, manual conversion to f32 is required:
106
+
107
+ ```typescript
108
+ // F16 to F32 conversion (for CPU use)
109
+ function f16ToF32(bits: number): number {
110
+ const sign = (bits >> 15) & 1;
111
+ const exp = (bits >> 10) & 0x1f;
112
+ const frac = bits & 0x3ff;
113
+ if (exp === 0) return (sign ? -1 : 1) * 2 ** -14 * (frac / 1024);
114
+ if (exp === 31) return frac ? NaN : (sign ? -Infinity : Infinity);
115
+ return (sign ? -1 : 1) * 2 ** (exp - 15) * (1 + frac / 1024);
116
+ }
117
+ ```
118
+
119
+ ---
120
+
121
+ ## Zero-Copy Design
122
+
123
+ The parser's core principle is to avoid copying data whenever possible. `getTensorData()` returns a typed array **view** into the original `ArrayBuffer`:
124
+
125
+ ```typescript
126
+ export function getTensorData(
127
+ buffer: ArrayBuffer,
128
+ file: SafetensorsFile,
129
+ entry: SafetensorEntry,
130
+ ): ArrayBufferView {
131
+ const offset = file.dataStart + entry.dataOffset;
132
+ return makeTypedView(buffer, offset, entry.dataLength, entry.dtype);
133
+ }
134
+ ```
135
+
136
+ ### When Zero-Copy Works
137
+
138
+ A typed array view requires that the byte offset is aligned to the element size. For example, a `Float32Array` requires 4-byte alignment. Since safetensors data is stored contiguously and most tensors are F32, alignment is almost always satisfied:
139
+
140
+ ```typescript
141
+ // Zero-copy path (aligned):
142
+ const src = buffer; // Original buffer
143
+ const base = offset; // Offset into original buffer
144
+ return new Float32Array(src, base, byteLength / 4);
145
+ ```
146
+
147
+ ### When Copying Is Required
148
+
149
+ If the offset is not aligned (e.g., an F32 tensor starts at a byte offset that is not a multiple of 4), the parser copies the relevant slice:
150
+
151
+ ```typescript
152
+ // Copy path (misaligned):
153
+ const src = buffer.slice(offset, offset + byteLength); // New aligned buffer
154
+ const base = 0;
155
+ return new Float32Array(src, base, byteLength / 4);
156
+ ```
157
+
158
+ In practice, misalignment is rare because safetensors writers typically align tensor data. It only occurs when small tensors of mixed dtypes are packed in unusual ways.
159
+
160
+ ---
161
+
162
+ ## Parser API
163
+
164
+ ### `parseSafetensorsHeader(buffer: ArrayBuffer): SafetensorsFile`
165
+
166
+ Parses the header from a buffer. Can be called with just the header portion (first 8 + header_length bytes) or the entire file.
167
+
168
+ Returns:
169
+ ```typescript
170
+ interface SafetensorsFile {
171
+ headerLength: number; // Length of JSON header in bytes
172
+ dataStart: number; // Byte offset where tensor data begins (8 + headerLength)
173
+ entries: SafetensorEntry[]; // All tensor entries, sorted by offset
174
+ metadata: Record<string, string> | null; // Optional __metadata__
175
+ }
176
+ ```
177
+
178
+ Entries are sorted by `dataOffset` for sequential access patterns during loading.
179
+
180
+ ### `getTensorData(buffer, file, entry): ArrayBufferView`
181
+
182
+ Returns a typed array view for a specific tensor entry. Zero-copy when alignment allows.
183
+
184
+ ### `findTensor(file, name): SafetensorEntry | undefined`
185
+
186
+ Find a tensor entry by exact name match. Linear scan; suitable for the typical case of 100-500 tensors per file.
187
+
188
+ ### `parseSafetensorsFromResponse(response: Response): Promise<{file, fullBuffer}>`
189
+
190
+ Convenience function that reads an entire HTTP response into an `ArrayBuffer` and parses the header. Returns both the parsed header and the full buffer for subsequent tensor extraction.
191
+
192
+ ### `totalTensorBytes(file): number`
193
+
194
+ Sums the `dataLength` of all tensor entries. Useful for progress bars and memory budget estimation.
195
+
196
+ ---
197
+
198
+ ## Streaming Support
199
+
200
+ The parser supports a two-phase loading strategy for large models:
201
+
202
+ ### Phase 1: Header Only
203
+
204
+ Download just the first 8 + header_length bytes using an HTTP Range request. Parse the header to discover tensor names, shapes, and offsets.
205
+
206
+ ```typescript
207
+ // Fetch just the first 64KB (more than enough for most headers)
208
+ const headerResponse = await fetch(url, {
209
+ headers: { Range: "bytes=0-65535" }
210
+ });
211
+ const headerBuffer = await headerResponse.arrayBuffer();
212
+ const file = parseSafetensorsHeader(headerBuffer);
213
+
214
+ // Now we know all tensor names, shapes, and byte offsets
215
+ for (const entry of file.entries) {
216
+ console.log(`${entry.name}: ${entry.dtype} ${entry.shape} @ offset ${entry.dataOffset}`);
217
+ }
218
+ ```
219
+
220
+ ### Phase 2: Selective Tensor Download
221
+
222
+ Download specific tensors by offset using Range requests:
223
+
224
+ ```typescript
225
+ const start = file.dataStart + entry.dataOffset;
226
+ const end = start + entry.dataLength - 1;
227
+ const dataResponse = await fetch(url, {
228
+ headers: { Range: `bytes=${start}-${end}` }
229
+ });
230
+ ```
231
+
232
+ This is useful when:
233
+ - You only need a subset of tensors (e.g., loading a single layer for testing)
234
+ - Memory is constrained and you want to upload tensors to GPU one at a time, freeing CPU memory between downloads
235
+ - The model uses sharding and you want to parallelize downloads of independent shards
236
+
237
+ ---
238
+
239
+ ## Multi-Shard Support
240
+
241
+ Large models split their weights across multiple safetensors files. The split is described in `model.safetensors.index.json`:
242
+
243
+ ```json
244
+ {
245
+ "metadata": { "total_size": 1234567890 },
246
+ "weight_map": {
247
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
248
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
249
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
250
+ "lm_head.weight": "model-00003-of-00003.safetensors"
251
+ }
252
+ }
253
+ ```
254
+
255
+ The model loader (`model-loader.ts`) handles this automatically:
256
+ 1. Tries to fetch `model.safetensors.index.json`
257
+ 2. If found, extracts the unique filenames from `weight_map`
258
+ 3. Downloads each shard and parses it independently
259
+ 4. Maps all tensors from all shards to canonical names
260
+
261
+ ---
262
+
263
+ ## Example: Parsing a Safetensors File
264
+
265
+ ```typescript
266
+ import { parseSafetensorsHeader, getTensorData, findTensor } from "./safetensors.js";
267
+
268
+ // Assume `buffer` is an ArrayBuffer from fetch()
269
+ const file = parseSafetensorsHeader(buffer);
270
+
271
+ console.log(`Header: ${file.headerLength} bytes`);
272
+ console.log(`Data starts at: ${file.dataStart}`);
273
+ console.log(`Tensors: ${file.entries.length}`);
274
+
275
+ // Find a specific tensor
276
+ const embedding = findTensor(file, "model.embed_tokens.weight");
277
+ if (embedding) {
278
+ console.log(`Embedding: ${embedding.dtype} ${embedding.shape}`);
279
+ console.log(` Offset: ${embedding.dataOffset}, Length: ${embedding.dataLength}`);
280
+
281
+ // Get the data as a typed array
282
+ const data = getTensorData(buffer, file, embedding);
283
+ console.log(` Type: ${data.constructor.name}`);
284
+ console.log(` Elements: ${data.byteLength / 4}`);
285
+
286
+ // For F32, we can read values directly
287
+ const floats = data as Float32Array;
288
+ console.log(` First 5 values: ${Array.from(floats.slice(0, 5))}`);
289
+ }
290
+ ```
291
+
292
+ ---
293
+
294
+ ## Memory Considerations
295
+
296
+ A full model's weights must fit in memory twice during loading:
297
+ 1. The raw `ArrayBuffer` from the HTTP response
298
+ 2. The GPU buffer after upload
299
+
300
+ After GPU upload, the CPU-side `ArrayBuffer` can be released (garbage collected) since the typed array views no longer need the backing buffer. The model loader handles this by processing one safetensors shard at a time.
301
+
302
+ For a 1.7B parameter model in F32:
303
+ ```
304
+ Weight bytes = 1.7e9 * 4 = 6.8 GB (would not fit in browser memory)
305
+ ```
306
+
307
+ This is why quantized models are preferred for browser inference. In INT4 (with group scales):
308
+ ```
309
+ Weight bytes ~ 1.7e9 * 0.5 + overhead = ~1 GB (feasible)
310
+ ```
311
+
312
+ The safetensors parser itself is lightweight -- it only parses the header (a few KB) and creates views into the existing buffer. The dominant memory cost is the buffer itself, not the parser.
@@ -0,0 +1,302 @@
1
+ # BPE Tokenizer Deep Dive
2
+
3
+ The tokenizer (`src/gpu/tokenizer.ts`) is a pure JavaScript Byte Pair Encoding (BPE) implementation that reads HuggingFace `tokenizer.json` files. No WASM, no native dependencies, no external libraries.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ The `Tokenizer` class provides:
10
+ - `encode(text)` -- Convert text to token IDs
11
+ - `decode(ids)` -- Convert token IDs back to text
12
+ - `applyChatTemplate(messages)` -- Format chat messages (ChatML)
13
+ - `encodeChat(messages)` -- Template + encode in one call
14
+ - `Tokenizer.fromJSON(tokenizerJSON, configJSON)` -- Factory from HF files
15
+
16
+ ---
17
+
18
+ ## Internal Data Structures
19
+
20
+ The tokenizer maintains five core maps built from `tokenizer.json`:
21
+
22
+ | Map | Type | Source | Purpose |
23
+ |-----|------|--------|---------|
24
+ | `vocab` | `string -> number` | `model.vocab` | Token string to ID |
25
+ | `vocabReverse` | `number -> string` | (inverse of vocab) | ID to token string |
26
+ | `merges` | `string -> number` | `model.merges` | Merge pair to priority (lower = higher priority) |
27
+ | `specialTokens` | `string -> number` | `added_tokens` (where `special: true`) | Special token detection |
28
+ | `byteFallback` | `number -> string` | Derived from vocab | Byte value to `<0xHH>` token string |
29
+
30
+ ---
31
+
32
+ ## Encoding Pipeline
33
+
34
+ The encoding pipeline transforms text into a sequence of token IDs through five stages:
35
+
36
+ ### Worked Example: `"Hello world"`
37
+
38
+ #### Stage 1: Special Token Splitting
39
+
40
+ The text is split around any special tokens (like `<|im_start|>`, `<|endoftext|>`). Regular text segments are marked `special: false`.
41
+
42
+ ```
43
+ Input: "Hello world"
44
+ Output: [{ text: "Hello world", special: false }]
45
+ ```
46
+
47
+ If the text contained special tokens:
48
+ ```
49
+ Input: "<|im_start|>user\nHello<|im_end|>"
50
+ Output: [
51
+ { text: "<|im_start|>", special: true },
52
+ { text: "user\nHello", special: false },
53
+ { text: "<|im_end|>", special: true },
54
+ ]
55
+ ```
56
+
57
+ Special tokens are matched using a regex built from the sorted (longest-first) special token list, ensuring greedy matching.
58
+
59
+ #### Stage 2: Pre-tokenization
60
+
61
+ Non-special text is split into chunks using a GPT-style regex:
62
+
63
+ ```
64
+ /'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+/gu
65
+ ```
66
+
67
+ This splits on:
68
+ - Contractions: `'s`, `'t`, `'re`, `'ve`, `'m`, `'ll`, `'d`
69
+ - Words (with optional leading space): ` Hello`, ` world`
70
+ - Numbers (with optional leading space): ` 42`
71
+ - Punctuation (with optional leading space): ` !`
72
+ - Whitespace runs
73
+
74
+ ```
75
+ Input: "Hello world"
76
+ Output: ["Hello", " world"]
77
+ ```
78
+
79
+ #### Stage 3: Byte-Level Encoding
80
+
81
+ Each chunk is converted to the byte-level representation used in HF vocabularies. The key transformation is the space-to-`\u0120` (character) mapping:
82
+
83
+ | Character | Code Point | Representation |
84
+ |-----------|-----------|----------------|
85
+ | Space (` `) | 32 | `\u0120` (latin capital G with dot above) |
86
+ | Newline (`\n`) | 10 | `\u010A` (offset by 256) |
87
+ | Tab (`\t`) | 9 | `\u0109` (offset by 256) |
88
+ | Regular printable | 33-126 | Unchanged |
89
+
90
+ ```
91
+ Input: ["Hello", " world"]
92
+ Output: ["Hello", "\u0120world"] (the space becomes the special character)
93
+ ```
94
+
95
+ This is the convention used by GPT-2 and all derivative tokenizers. The `\u0120` character serves as an in-band marker for "this token starts with a space."
96
+
97
+ #### Stage 4: BPE Merge
98
+
99
+ Each byte-encoded chunk undergoes iterative pair merging:
100
+
101
+ 1. Start with individual characters: `["\u0120", "w", "o", "r", "l", "d"]`
102
+ 2. Find the pair with the lowest merge rank in the merge table
103
+ 3. Merge that pair into one token
104
+ 4. Repeat until no more merges are possible
105
+
106
+ ```
107
+ "\u0120world" merge trace (hypothetical ranks):
108
+ Step 0: ["\u0120", "w", "o", "r", "l", "d"]
109
+ Step 1: ["\u0120w", "o", "r", "l", "d"] (merge "\u0120" + "w", rank 42)
110
+ Step 2: ["\u0120w", "or", "l", "d"] (merge "o" + "r", rank 87)
111
+ Step 3: ["\u0120w", "orl", "d"] (merge "or" + "l", rank 203)
112
+ Step 4: ["\u0120world"] (merge "orl" + "d", then "\u0120w" + "orld")
113
+ ```
114
+
115
+ (Actual merge orders depend on the specific tokenizer's merge table.)
116
+
117
+ The algorithm always selects the pair with the **lowest rank** (highest priority). This greedy strategy produces the optimal BPE encoding for any given text.
118
+
119
+ If a complete chunk is already in the vocabulary as a single token, the BPE step is skipped and the chunk maps directly to its token ID.
120
+
121
+ #### Stage 5: Byte Fallback
122
+
123
+ If any character or merged symbol isn't in the vocabulary, it's encoded as a sequence of raw byte tokens using the `<0xHH>` format:
124
+
125
+ ```
126
+ Unknown character "ñ" (UTF-8 bytes: 0xC3, 0xB1):
127
+ -> ["<0xC3>", "<0xB1>"]
128
+ -> [token_id_for_0xC3, token_id_for_0xB1]
129
+ ```
130
+
131
+ This ensures every possible input can be encoded, even if it contains characters not in the training data.
132
+
133
+ ### Complete Example Result
134
+
135
+ ```
136
+ "Hello world"
137
+ -> pre-tokenize: ["Hello", " world"]
138
+ -> byte encode: ["Hello", "\u0120world"]
139
+ -> BPE merge: ["Hello"] -> [token_id_15496]
140
+ ["\u0120world"] -> [token_id_1917]
141
+ -> final IDs: [15496, 1917]
142
+ ```
143
+
144
+ (Token IDs are model-specific; these are illustrative.)
145
+
146
+ ---
147
+
148
+ ## Decoding Pipeline
149
+
150
+ Decoding reverses the encoding:
151
+
152
+ 1. Map each token ID to its string representation via `vocabReverse`
153
+ 2. Optionally skip special tokens (BOS, EOS, etc.)
154
+ 3. Join all token strings
155
+ 4. Replace `\u0120` back to space
156
+ 5. Replace `<0xHH>` patterns back to raw bytes
157
+
158
+ ```typescript
159
+ decode(ids: number[], skipSpecialTokens: boolean = true): string
160
+ ```
161
+
162
+ ---
163
+
164
+ ## Chat Template Support
165
+
166
+ The tokenizer implements ChatML format for chat-style conversation encoding:
167
+
168
+ ```
169
+ <|im_start|>system
170
+ You are a helpful assistant.<|im_end|>
171
+ <|im_start|>user
172
+ What is 2+2?<|im_end|>
173
+ <|im_start|>assistant
174
+ ```
175
+
176
+ ### API
177
+
178
+ ```typescript
179
+ const tokenizer = Tokenizer.fromJSON(tokenizerJSON, configJSON);
180
+
181
+ const text = tokenizer.applyChatTemplate([
182
+ { role: "system", content: "You are a helpful assistant." },
183
+ { role: "user", content: "What is 2+2?" },
184
+ ], { addGenerationPrompt: true });
185
+
186
+ // Returns the formatted string:
187
+ // "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
188
+ // <|im_start|>user\nWhat is 2+2?<|im_end|>\n
189
+ // <|im_start|>assistant\n"
190
+
191
+ const ids = tokenizer.encodeChat([
192
+ { role: "system", content: "You are a helpful assistant." },
193
+ { role: "user", content: "What is 2+2?" },
194
+ ]);
195
+ // Returns the token IDs directly
196
+ ```
197
+
198
+ ### Message Types
199
+
200
+ ```typescript
201
+ interface ChatMessage {
202
+ role: "system" | "user" | "assistant";
203
+ content: string;
204
+ }
205
+ ```
206
+
207
+ ### Limitations
208
+
209
+ The current implementation hardcodes the ChatML format. HuggingFace tokenizers store a Jinja2 template in `tokenizer_config.json` under the `chat_template` field. A future improvement would parse and evaluate this template to support non-ChatML formats (e.g., Llama's `[INST]...[/INST]` format, Phi's `<|user|>...<|end|>` format).
210
+
211
+ ---
212
+
213
+ ## Configuration
214
+
215
+ The tokenizer reads configuration from `tokenizer_config.json`:
216
+
217
+ ```typescript
218
+ interface TokenizerConfig {
219
+ bosToken: string | null; // e.g. "<|endoftext|>"
220
+ eosToken: string | null; // e.g. "<|im_end|>" or "<|endoftext|>"
221
+ bosTokenId: number | null; // Resolved from vocab
222
+ eosTokenId: number | null; // Resolved from vocab
223
+ chatTemplate: string | null; // Jinja2 template (stored but not yet parsed)
224
+ addBosToken: boolean; // Whether to prepend BOS to encoded text
225
+ addEosToken: boolean; // Whether to append EOS to encoded text
226
+ }
227
+ ```
228
+
229
+ When `addBosToken` is true, `encode()` prepends the BOS token ID. When `addEosToken` is true, it appends the EOS token ID. These settings come from the model's tokenizer config.
230
+
231
+ ---
232
+
233
+ ## HuggingFace tokenizer.json Format
234
+
235
+ The tokenizer reads the standard HF `tokenizer.json` format:
236
+
237
+ ```json
238
+ {
239
+ "model": {
240
+ "type": "BPE",
241
+ "vocab": {
242
+ "Hello": 15496,
243
+ "\u0120world": 1917,
244
+ "<0x0A>": 198,
245
+ ...
246
+ },
247
+ "merges": [
248
+ "\u0120 t",
249
+ "i n",
250
+ "e r",
251
+ ...
252
+ ]
253
+ },
254
+ "added_tokens": [
255
+ { "id": 151643, "content": "<|endoftext|>", "special": true },
256
+ { "id": 151644, "content": "<|im_start|>", "special": true },
257
+ { "id": 151645, "content": "<|im_end|>", "special": true }
258
+ ]
259
+ }
260
+ ```
261
+
262
+ Key fields:
263
+ - `model.type`: Must be `"BPE"` (the only supported type)
264
+ - `model.vocab`: Complete vocabulary mapping token strings to IDs
265
+ - `model.merges`: Ordered list of merge pairs (index = priority)
266
+ - `added_tokens`: Special tokens with their IDs and flags
267
+
268
+ ---
269
+
270
+ ## Performance Characteristics
271
+
272
+ The BPE algorithm has quadratic worst-case complexity in the length of a single chunk (O(n^2) where n is the number of characters). In practice, this is fast because:
273
+
274
+ 1. Pre-tokenization breaks text into small chunks (typically words)
275
+ 2. Most words are 5-15 characters, so the inner merge loop is small
276
+ 3. Common words are in the vocabulary directly, skipping BPE entirely
277
+
278
+ For a typical prompt of 200 words:
279
+ - Pre-tokenization: ~200 chunks
280
+ - BPE per chunk: ~10-15 merge iterations
281
+ - Total: ~3000 operations, well under 1ms
282
+
283
+ The dominant cost for long prompts is the regex pre-tokenization pass, which is a single linear scan using the built-in regex engine.
284
+
285
+ ---
286
+
287
+ ## Vocabulary Size
288
+
289
+ Common vocabulary sizes for models the engine targets:
290
+
291
+ | Model Family | Vocab Size | Notable |
292
+ |-------------|-----------|---------|
293
+ | Qwen2/3 | 151,936 | Large vocab with extensive CJK coverage |
294
+ | LLaMA 2 | 32,000 | |
295
+ | LLaMA 3 | 128,256 | Significantly expanded |
296
+ | Phi-3 | 32,064 | |
297
+ | SmolLM2 | 49,152 | |
298
+
299
+ The vocab size directly affects:
300
+ - LM head matmul cost (hidden_size x vocab_size)
301
+ - Logit readback size (vocab_size * 4 bytes)
302
+ - Sampler sorting cost (O(vocab_size * log(vocab_size)) for top-k)
@@ -0,0 +1,91 @@
1
+ # Memory / RAG (on-device agent memory)
2
+
3
+ > Note: this is the **RAG / persistent-memory** module (`@tryhamster/gerbil/memory`).
4
+ > For GPU/KV-cache memory management see [memory.md](memory.md).
5
+
6
+ Gerbil's memory module is an on-device, persistent memory layer that turns
7
+ Gerbil into an agent harness: store text + embeddings, retrieve semantically,
8
+ and rebuild a token-budgeted context block every turn.
9
+
10
+ It is engine-agnostic — bring any embedder and any storage backend — but wires
11
+ straight into Gerbil's native embeddings by default.
12
+
13
+ ## Quick start
14
+
15
+ ```ts
16
+ import { Gerbil } from "@tryhamster/gerbil";
17
+ import { createMemory, createGerbilEmbedder } from "@tryhamster/gerbil/memory";
18
+
19
+ const g = new Gerbil();
20
+ await g.loadModel("embeddinggemma-300m");
21
+
22
+ const mem = createMemory({ embed: createGerbilEmbedder(g) });
23
+
24
+ await mem.add("Paris is the capital of France", { metadata: { topic: "geo" } });
25
+ const hits = await mem.search("French capital", { k: 3 });
26
+ const { context } = await mem.recall("French capital", { tokenBudget: 512 });
27
+ ```
28
+
29
+ ## Public API
30
+
31
+ `createMemory({ embed, store?, redact?, chunk? }) → Memory`
32
+
33
+ | Method | Description |
34
+ | --- | --- |
35
+ | `add(text, { metadata?, id?, chunk? })` | Redact → (optional) chunk → embed → normalize → store. Returns created ids. |
36
+ | `search(query, { k?, filter?, minScore? })` | Cosine top-k. Returns `{ record, score }[]`. |
37
+ | `recall(query, { tokenBudget?, k?, filter?, minScore?, separator? })` | Retrieve + greedily pack into a token-budgeted context block. |
38
+ | `get(id)` / `delete(id)` / `list(filter?)` / `clear()` / `size()` | CRUD over records. |
39
+ | `export()` / `import(snapshot)` | JSON snapshot round-trip. |
40
+ | `backend` | The underlying `MemoryStore` (for advanced use). |
41
+
42
+ ## Backends (pluggable `MemoryStore`)
43
+
44
+ | Factory | Runtime | Durability |
45
+ | --- | --- | --- |
46
+ | `createInMemoryStore()` (default) | Node + browser | none (process lifetime) |
47
+ | `createIndexedDBStore({ dbName?, storeName?, indexedDB? })` | browser | durable across sessions |
48
+ | `createFileStore(path)` | Node | durable JSON on disk |
49
+
50
+ All backends store **pre-normalized** embeddings and perform a brute-force
51
+ cosine top-k scan, which is fine to the thousands-of-records scale. Inject an
52
+ `indexedDB` factory (e.g. `fake-indexeddb`) to exercise the IndexedDB backend
53
+ under Node.
54
+
55
+ ## Embedder injection
56
+
57
+ The module only needs `(texts: string[]) => Promise<Float32Array[]>`.
58
+ `createGerbilEmbedder(engine)` adapts any object with a compatible
59
+ `embedBatch` (a `Gerbil` instance, the one-liner `embedBatch`, or the browser
60
+ `useEmbedding().embedBatch`). Any other embedder works by passing the function
61
+ directly.
62
+
63
+ ## Chunking
64
+
65
+ `add(text, { chunk: true })` or `add(text, { chunk: { chunkSize, overlap } })`
66
+ splits long documents into overlapping character windows (defaults: 1000 chars,
67
+ 200 overlap), one record per chunk, so retrieval targets relevant passages.
68
+
69
+ ## Context packing (`recall`)
70
+
71
+ `recall` retrieves a candidate pool (default `k: 20`), then greedily fills a
72
+ context block highest-score-first, stopping before `tokenBudget` is exceeded
73
+ (it skips a too-large candidate and tries smaller ones rather than stopping
74
+ outright). Token counts are **approximate**: the heuristic is ~4 characters per
75
+ token (the common English-ish rule), deliberately avoiding a tokenizer
76
+ dependency. The goal is to stay under a model's context window, not exact
77
+ accounting.
78
+
79
+ ## Privacy
80
+
81
+ - `redact` is applied on **write**: a `RegExp` (matches → `[REDACTED]`) or a
82
+ `(text) => string` predicate.
83
+ - `export()` / `import()` move the full corpus as JSON.
84
+
85
+ ## Follow-ups
86
+
87
+ - **HNSW/ANN index** for >10k records (current scan is O(n) per query).
88
+ - **Node OPFS / SQLite backend** for larger durable corpora than the JSON
89
+ file store comfortably holds.
90
+ - **Real tokenizer** option for exact budgeting (currently a char heuristic).
91
+ - **TTL / decay & dedup** policies for long-running agents.