npm - @tryhamster/gerbil - Versions diffs - 1.0.0-rc.9 → 1.0.0 - Mend

@tryhamster/gerbil 1.0.0-rc.9 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (179) hide show

package/LICENSE +1 -1
package/README.md +247 -84
package/dist/architectures-C1I5V3Dt.mjs +6070 -0
package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
package/dist/browser/index.d.ts +264 -588
package/dist/browser/index.d.ts.map +1 -1
package/dist/browser/index.js +585 -2334
package/dist/browser/index.js.map +1 -1
package/dist/cli.mjs +625 -1098
package/dist/cli.mjs.map +1 -1
package/dist/defaults-9komdrbY.mjs +24 -0
package/dist/defaults-9komdrbY.mjs.map +1 -0
package/dist/frameworks/express.d.mts +1 -3
package/dist/frameworks/express.d.mts.map +1 -1
package/dist/frameworks/express.mjs +7 -7
package/dist/frameworks/express.mjs.map +1 -1
package/dist/frameworks/fastify.d.mts +1 -1
package/dist/frameworks/fastify.d.mts.map +1 -1
package/dist/frameworks/fastify.mjs +3 -3
package/dist/frameworks/fastify.mjs.map +1 -1
package/dist/frameworks/hono.d.mts +1 -1
package/dist/frameworks/hono.d.mts.map +1 -1
package/dist/frameworks/hono.mjs +4 -4
package/dist/frameworks/hono.mjs.map +1 -1
package/dist/frameworks/next.d.mts +3 -2
package/dist/frameworks/next.d.mts.map +1 -1
package/dist/frameworks/next.mjs +4 -4
package/dist/frameworks/next.mjs.map +1 -1
package/dist/frameworks/react.d.mts +1 -1
package/dist/frameworks/trpc.d.mts +1 -1
package/dist/frameworks/trpc.d.mts.map +1 -1
package/dist/frameworks/trpc.mjs +4 -4
package/dist/frameworks/trpc.mjs.map +1 -1
package/dist/gerbil-BHrJJIa4.mjs +1656 -0
package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
package/dist/gerbil-BT9fCydo.d.mts +488 -0
package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
package/dist/gerbil-DomNfIr1.mjs +4 -0
package/dist/gpu/hooks.d.mts +520 -0
package/dist/gpu/hooks.d.mts.map +1 -0
package/dist/gpu/hooks.mjs +1188 -0
package/dist/gpu/hooks.mjs.map +1 -0
package/dist/gpu/index.d.mts +2 -0
package/dist/gpu/index.mjs +6 -0
package/dist/gpu-33qCAtHW.mjs +3615 -0
package/dist/gpu-33qCAtHW.mjs.map +1 -0
package/dist/index-Dgmb2kE3.d.mts +245 -0
package/dist/index-Dgmb2kE3.d.mts.map +1 -0
package/dist/index-jEAL2s-A.d.mts +2022 -0
package/dist/index-jEAL2s-A.d.mts.map +1 -0
package/dist/index.d.mts +22 -487
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +13 -8
package/dist/index.mjs.map +1 -1
package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
package/dist/integrations/ai-sdk.d.mts +75 -6
package/dist/integrations/ai-sdk.d.mts.map +1 -1
package/dist/integrations/ai-sdk.mjs +131 -15
package/dist/integrations/ai-sdk.mjs.map +1 -1
package/dist/integrations/langchain.d.mts +1 -1
package/dist/integrations/langchain.d.mts.map +1 -1
package/dist/integrations/langchain.mjs +5 -5
package/dist/integrations/langchain.mjs.map +1 -1
package/dist/integrations/llamaindex.d.mts +1 -1
package/dist/integrations/llamaindex.d.mts.map +1 -1
package/dist/integrations/llamaindex.mjs +5 -5
package/dist/integrations/llamaindex.mjs.map +1 -1
package/dist/integrations/mcp-client.mjs +3 -3
package/dist/integrations/mcp-client.mjs.map +1 -1
package/dist/integrations/mcp.d.mts +3 -2
package/dist/integrations/mcp.d.mts.map +1 -1
package/dist/integrations/mcp.mjs +5 -5
package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
package/dist/mcp-1DaMsaBc.mjs.map +1 -0
package/dist/memory/index.d.mts +3 -0
package/dist/memory/index.mjs +6 -0
package/dist/memory-D1P7Tmda.mjs +4 -0
package/dist/memory-DVN0MnIG.mjs +132 -0
package/dist/memory-DVN0MnIG.mjs.map +1 -0
package/dist/memory-Dj0J1v88.mjs +294 -0
package/dist/memory-Dj0J1v88.mjs.map +1 -0
package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
package/dist/repl-jV5gcJFA.mjs +9 -0
package/dist/skills/index.d.mts +270 -320
package/dist/skills/index.d.mts.map +1 -1
package/dist/skills/index.mjs +5 -5
package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
package/dist/skills-DX8D59UH.mjs.map +1 -0
package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
package/dist/tools-DQ1mPUw5.mjs.map +1 -0
package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
package/dist/types-D6FiR_oh.d.mts.map +1 -0
package/dist/types-DQBe2lFo.d.mts +165 -0
package/dist/types-DQBe2lFo.d.mts.map +1 -0
package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
package/dist/vector-B0panuy6.mjs +95 -0
package/dist/vector-B0panuy6.mjs.map +1 -0
package/docs/PROJECT-STATE.md +321 -0
package/docs/adding-a-model-family.md +280 -0
package/docs/ai-sdk.md +70 -61
package/docs/architecture/overview.md +17 -7
package/docs/browser.md +203 -8
package/docs/embeddings.md +156 -0
package/docs/gerbil-site-native-migration.md +217 -0
package/docs/gpu-engine/architectures.md +398 -0
package/docs/gpu-engine/ir.md +372 -0
package/docs/gpu-engine/kernels.md +718 -0
package/docs/gpu-engine/paper.html +1759 -0
package/docs/gpu-engine/paper.md +2109 -0
package/docs/gpu-engine/safetensors.md +312 -0
package/docs/gpu-engine/tokenizer.md +302 -0
package/docs/memory-rag.md +91 -0
package/docs/metal-safari-intel.md +190 -0
package/docs/mobile-failure-diagnosis.md +124 -0
package/docs/mobile.md +99 -0
package/docs/observability.md +230 -0
package/docs/onnx-removal-plan.md +339 -0
package/docs/research/autoresearch-portable.md +904 -0
package/docs/research/dispatch-reduction-hivemind.md +84 -0
package/docs/research/ios-safari-model-caching.md +117 -0
package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
package/docs/research/native-stt-model-selection.md +49 -0
package/docs/research/native-tts-model-selection.md +90 -0
package/docs/research/native-vs-chromium-decision.md +152 -0
package/docs/research/nemotron-mamba2-inference.md +910 -0
package/docs/research/qwen35-multimodal.md +293 -0
package/docs/research/qwen36-gemma4-targets.md +337 -0
package/docs/research/sota-embedding-models.md +179 -0
package/docs/research/sota-mobile-models-2026.md +263 -0
package/docs/research/sota-modality-models.md +202 -0
package/docs/research/tps-baselines.md +71 -0
package/docs/research/webgpu-m4-reference.md +104 -0
package/docs/site-update-plan.md +155 -0
package/docs/structured-output.md +123 -0
package/docs/stt.md +63 -446
package/docs/tts.md +77 -499
package/docs/vision.md +100 -338
package/package.json +22 -7
package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
package/dist/gerbil-CJ3ifloF.mjs +0 -4
package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
package/dist/gerbil-qOTe1nl2.d.mts +0 -431
package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
package/dist/kokoro-BNTb6egA.mjs +0 -20210
package/dist/kokoro-BNTb6egA.mjs.map +0 -1
package/dist/kokoro-CMOGDSgT.js +0 -20212
package/dist/kokoro-CMOGDSgT.js.map +0 -1
package/dist/mcp-BvbriaBy.mjs.map +0 -1
package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
package/dist/repl-DveXw36T.mjs +0 -9
package/dist/skills-CD3Orlex.mjs.map +0 -1
package/dist/stt-Bu-E23Sc.js +0 -433
package/dist/stt-Bu-E23Sc.js.map +0 -1
package/dist/stt-CpLYbGFd.mjs +0 -433
package/dist/stt-CpLYbGFd.mjs.map +0 -1
package/dist/stt-DRPLEEHB.mjs +0 -3
package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
package/dist/transformers.web-DiD1gTwk.js +0 -44695
package/dist/transformers.web-DiD1gTwk.js.map +0 -1
package/dist/transformers.web-u34VxRFM.js +0 -3
package/dist/tts-CqroPaSK.js +0 -724
package/dist/tts-CqroPaSK.js.map +0 -1
package/dist/tts-DXgsKGCe.mjs +0 -3
package/dist/tts-DeGANMNV.mjs +0 -730
package/dist/tts-DeGANMNV.mjs.map +0 -1
package/dist/types-CiTc7ez3.d.mts.map +0 -1
/package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
/package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
/package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0

package/docs/gpu-engine/safetensors.md ADDED Viewed

@@ -0,0 +1,312 @@
+# Safetensors Parser Deep Dive
+The safetensors parser (`src/gpu/safetensors.ts`) reads HuggingFace's binary safetensors format and provides zero-copy typed array views into the raw data. This document covers the binary format, the parser's design, alignment handling, and streaming support.
+---
+## Binary Format
+The safetensors format is intentionally simple. A file consists of three contiguous sections:
+```
++--------------------------------------------------+
+| 8 bytes: header_length (little-endian uint64)     |
++--------------------------------------------------+
+| header_length bytes: JSON header (UTF-8)          |
++--------------------------------------------------+
+| remaining bytes: raw tensor data (contiguous)     |
++--------------------------------------------------+
+```
+### Section 1: Header Length (8 bytes)
+The first 8 bytes encode the JSON header length as a little-endian unsigned 64-bit integer. In practice, headers are always well under 4GB, so only the lower 32 bits are meaningful. The parser uses `DataView.getBigUint64(0, true)` and converts to a JavaScript `Number`.
+### Section 2: JSON Header
+The header is a JSON object mapping tensor names to their metadata:
+```json
+{
+  "model.embed_tokens.weight": {
+    "dtype": "F32",
+    "shape": [151936, 896],
+    "data_offsets": [0, 544534528]
+  },
+  "model.layers.0.input_layernorm.weight": {
+    "dtype": "F32",
+    "shape": [896],
+    "data_offsets": [544534528, 544538112]
+  },
+  "__metadata__": {
+    "format": "pt"
+  }
+}
+```
+Each tensor entry contains:
+- `dtype`: Data type string (see dtype table below)
+- `shape`: Array of dimension sizes
+- `data_offsets`: `[start, end]` byte offsets relative to the beginning of the data section
+The special `__metadata__` key is optional and contains file-level metadata (e.g., the framework that produced the file).
+### Section 3: Tensor Data
+Raw tensor data, stored contiguously. Each tensor's data occupies `data_offsets[1] - data_offsets[0]` bytes starting at `data_start + data_offsets[0]`, where `data_start = 8 + header_length`.
+Tensors are stored in row-major order (C contiguous). Multi-dimensional tensors are flattened: element `[i, j, k]` of a tensor with shape `[D0, D1, D2]` is at flat index `i * D1 * D2 + j * D2 + k`.
+---
+## Layout Diagram
+For a file containing two tensors, "A" (F32, shape [2, 3]) and "B" (F32, shape [4]):
+```
+Byte 0                                          Byte N
+|<-- 8 -->|<------ header_length ------->|<--- tensor data --->|
++=========+============================+=======================+
+| len=120 | {"A":{"dtype":"F32",       | A: 24 bytes (6 f32)   |
+|  (u64)  |  "shape":[2,3],            | B: 16 bytes (4 f32)   |
+|         |  "data_offsets":[0,24]},    |                       |
+|         | "B":{"dtype":"F32",         |                       |
+|         |  "shape":[4],               |                       |
+|         |  "data_offsets":[24,40]}}    |                       |
++=========+============================+=======================+
+```
+Offsets:
+- Header starts at byte 8
+- Data section starts at byte 8 + header_length = 128
+- Tensor A data: bytes 128 to 151 (24 bytes)
+- Tensor B data: bytes 152 to 167 (16 bytes)
+---
+## DType Mapping
+| Safetensors DType | Bytes | Alignment | JS TypedArray | Notes |
+|-------------------|-------|-----------|---------------|-------|
+| `F32` | 4 | 4 | `Float32Array` | Most common for model weights |
+| `F16` | 2 | 2 | `Uint16Array` | No native f16 typed array; bitwise representation |
+| `BF16` | 2 | 2 | `Uint16Array` | Brain float 16; bitwise representation |
+| `F64` | 8 | 8 | `Float64Array` | Rare in ML models |
+| `I32` | 4 | 4 | `Int32Array` | |
+| `U32` | 4 | 4 | `Uint32Array` | |
+| `I16` | 2 | 2 | `Int16Array` | |
+| `U16` | 2 | 2 | `Uint16Array` | |
+| `I8` | 1 | 1 | `Int8Array` | Used in some quantization schemes |
+| `U8` | 1 | 1 | `Uint8Array` | |
+| `I64` | 8 | 8 | `BigInt64Array` | |
+| `U64` | 8 | 8 | `BigUint64Array` | |
+| `BOOL` | 1 | 1 | `Uint8Array` | |
+Note on F16 and BF16: JavaScript has no native `Float16Array`. The parser returns `Uint16Array` views containing the raw bit patterns. To use F16 data with WebGPU, the bits can be uploaded directly to a GPU buffer typed as `f16` in WGSL. To use them on CPU, manual conversion to f32 is required:
+```typescript
+// F16 to F32 conversion (for CPU use)
+function f16ToF32(bits: number): number {
+  const sign = (bits >> 15) & 1;
+  const exp = (bits >> 10) & 0x1f;
+  const frac = bits & 0x3ff;
+  if (exp === 0) return (sign ? -1 : 1) * 2 ** -14 * (frac / 1024);
+  if (exp === 31) return frac ? NaN : (sign ? -Infinity : Infinity);
+  return (sign ? -1 : 1) * 2 ** (exp - 15) * (1 + frac / 1024);
+}
+```
+---
+## Zero-Copy Design
+The parser's core principle is to avoid copying data whenever possible. `getTensorData()` returns a typed array **view** into the original `ArrayBuffer`:
+```typescript
+export function getTensorData(
+  buffer: ArrayBuffer,
+  file: SafetensorsFile,
+  entry: SafetensorEntry,
+): ArrayBufferView {
+  const offset = file.dataStart + entry.dataOffset;
+  return makeTypedView(buffer, offset, entry.dataLength, entry.dtype);
+}
+```
+### When Zero-Copy Works
+A typed array view requires that the byte offset is aligned to the element size. For example, a `Float32Array` requires 4-byte alignment. Since safetensors data is stored contiguously and most tensors are F32, alignment is almost always satisfied:
+```typescript
+// Zero-copy path (aligned):
+const src = buffer;        // Original buffer
+const base = offset;       // Offset into original buffer
+return new Float32Array(src, base, byteLength / 4);
+```
+### When Copying Is Required
+If the offset is not aligned (e.g., an F32 tensor starts at a byte offset that is not a multiple of 4), the parser copies the relevant slice:
+```typescript
+// Copy path (misaligned):
+const src = buffer.slice(offset, offset + byteLength);  // New aligned buffer
+const base = 0;
+return new Float32Array(src, base, byteLength / 4);
+```
+In practice, misalignment is rare because safetensors writers typically align tensor data. It only occurs when small tensors of mixed dtypes are packed in unusual ways.
+---
+## Parser API
+### `parseSafetensorsHeader(buffer: ArrayBuffer): SafetensorsFile`
+Parses the header from a buffer. Can be called with just the header portion (first 8 + header_length bytes) or the entire file.
+Returns:
+```typescript
+interface SafetensorsFile {
+  headerLength: number;                    // Length of JSON header in bytes
+  dataStart: number;                       // Byte offset where tensor data begins (8 + headerLength)
+  entries: SafetensorEntry[];              // All tensor entries, sorted by offset
+  metadata: Record<string, string> | null; // Optional __metadata__
+}
+```
+Entries are sorted by `dataOffset` for sequential access patterns during loading.
+### `getTensorData(buffer, file, entry): ArrayBufferView`
+Returns a typed array view for a specific tensor entry. Zero-copy when alignment allows.
+### `findTensor(file, name): SafetensorEntry | undefined`
+Find a tensor entry by exact name match. Linear scan; suitable for the typical case of 100-500 tensors per file.
+### `parseSafetensorsFromResponse(response: Response): Promise<{file, fullBuffer}>`
+Convenience function that reads an entire HTTP response into an `ArrayBuffer` and parses the header. Returns both the parsed header and the full buffer for subsequent tensor extraction.
+### `totalTensorBytes(file): number`
+Sums the `dataLength` of all tensor entries. Useful for progress bars and memory budget estimation.
+---
+## Streaming Support
+The parser supports a two-phase loading strategy for large models:
+### Phase 1: Header Only
+Download just the first 8 + header_length bytes using an HTTP Range request. Parse the header to discover tensor names, shapes, and offsets.
+```typescript
+// Fetch just the first 64KB (more than enough for most headers)
+const headerResponse = await fetch(url, {
+  headers: { Range: "bytes=0-65535" }
+});
+const headerBuffer = await headerResponse.arrayBuffer();
+const file = parseSafetensorsHeader(headerBuffer);
+// Now we know all tensor names, shapes, and byte offsets
+for (const entry of file.entries) {
+  console.log(`${entry.name}: ${entry.dtype} ${entry.shape} @ offset ${entry.dataOffset}`);
+}
+```
+### Phase 2: Selective Tensor Download
+Download specific tensors by offset using Range requests:
+```typescript
+const start = file.dataStart + entry.dataOffset;
+const end = start + entry.dataLength - 1;
+const dataResponse = await fetch(url, {
+  headers: { Range: `bytes=${start}-${end}` }
+});
+```
+This is useful when:
+- You only need a subset of tensors (e.g., loading a single layer for testing)
+- Memory is constrained and you want to upload tensors to GPU one at a time, freeing CPU memory between downloads
+- The model uses sharding and you want to parallelize downloads of independent shards
+---
+## Multi-Shard Support
+Large models split their weights across multiple safetensors files. The split is described in `model.safetensors.index.json`:
+```json
+{
+  "metadata": { "total_size": 1234567890 },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "lm_head.weight": "model-00003-of-00003.safetensors"
+  }
+}
+```
+The model loader (`model-loader.ts`) handles this automatically:
+1. Tries to fetch `model.safetensors.index.json`
+2. If found, extracts the unique filenames from `weight_map`
+3. Downloads each shard and parses it independently
+4. Maps all tensors from all shards to canonical names
+---
+## Example: Parsing a Safetensors File
+```typescript
+import { parseSafetensorsHeader, getTensorData, findTensor } from "./safetensors.js";
+// Assume `buffer` is an ArrayBuffer from fetch()
+const file = parseSafetensorsHeader(buffer);
+console.log(`Header: ${file.headerLength} bytes`);
+console.log(`Data starts at: ${file.dataStart}`);
+console.log(`Tensors: ${file.entries.length}`);
+// Find a specific tensor
+const embedding = findTensor(file, "model.embed_tokens.weight");
+if (embedding) {
+  console.log(`Embedding: ${embedding.dtype} ${embedding.shape}`);
+  console.log(`  Offset: ${embedding.dataOffset}, Length: ${embedding.dataLength}`);
+  // Get the data as a typed array
+  const data = getTensorData(buffer, file, embedding);
+  console.log(`  Type: ${data.constructor.name}`);
+  console.log(`  Elements: ${data.byteLength / 4}`);
+  // For F32, we can read values directly
+  const floats = data as Float32Array;
+  console.log(`  First 5 values: ${Array.from(floats.slice(0, 5))}`);
+}
+```
+---
+## Memory Considerations
+A full model's weights must fit in memory twice during loading:
+1. The raw `ArrayBuffer` from the HTTP response
+2. The GPU buffer after upload
+After GPU upload, the CPU-side `ArrayBuffer` can be released (garbage collected) since the typed array views no longer need the backing buffer. The model loader handles this by processing one safetensors shard at a time.
+For a 1.7B parameter model in F32:
+```
+Weight bytes = 1.7e9 * 4 = 6.8 GB (would not fit in browser memory)
+```
+This is why quantized models are preferred for browser inference. In INT4 (with group scales):
+```
+Weight bytes ~ 1.7e9 * 0.5 + overhead = ~1 GB (feasible)
+```
+The safetensors parser itself is lightweight -- it only parses the header (a few KB) and creates views into the existing buffer. The dominant memory cost is the buffer itself, not the parser.

package/docs/gpu-engine/tokenizer.md ADDED Viewed

@@ -0,0 +1,302 @@
+# BPE Tokenizer Deep Dive
+The tokenizer (`src/gpu/tokenizer.ts`) is a pure JavaScript Byte Pair Encoding (BPE) implementation that reads HuggingFace `tokenizer.json` files. No WASM, no native dependencies, no external libraries.
+---
+## Overview
+The `Tokenizer` class provides:
+- `encode(text)` -- Convert text to token IDs
+- `decode(ids)` -- Convert token IDs back to text
+- `applyChatTemplate(messages)` -- Format chat messages (ChatML)
+- `encodeChat(messages)` -- Template + encode in one call
+- `Tokenizer.fromJSON(tokenizerJSON, configJSON)` -- Factory from HF files
+---
+## Internal Data Structures
+The tokenizer maintains five core maps built from `tokenizer.json`:
+| Map | Type | Source | Purpose |
+|-----|------|--------|---------|
+| `vocab` | `string -> number` | `model.vocab` | Token string to ID |
+| `vocabReverse` | `number -> string` | (inverse of vocab) | ID to token string |
+| `merges` | `string -> number` | `model.merges` | Merge pair to priority (lower = higher priority) |
+| `specialTokens` | `string -> number` | `added_tokens` (where `special: true`) | Special token detection |
+| `byteFallback` | `number -> string` | Derived from vocab | Byte value to `<0xHH>` token string |
+---
+## Encoding Pipeline
+The encoding pipeline transforms text into a sequence of token IDs through five stages:
+### Worked Example: `"Hello world"`
+#### Stage 1: Special Token Splitting
+The text is split around any special tokens (like `<|im_start|>`, `<|endoftext|>`). Regular text segments are marked `special: false`.
+```
+Input:  "Hello world"
+Output: [{ text: "Hello world", special: false }]
+```
+If the text contained special tokens:
+```
+Input:  "<|im_start|>user\nHello<|im_end|>"
+Output: [
+  { text: "<|im_start|>", special: true },
+  { text: "user\nHello", special: false },
+  { text: "<|im_end|>", special: true },
+]
+```
+Special tokens are matched using a regex built from the sorted (longest-first) special token list, ensuring greedy matching.
+#### Stage 2: Pre-tokenization
+Non-special text is split into chunks using a GPT-style regex:
+```
+/'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+/gu
+```
+This splits on:
+- Contractions: `'s`, `'t`, `'re`, `'ve`, `'m`, `'ll`, `'d`
+- Words (with optional leading space): ` Hello`, ` world`
+- Numbers (with optional leading space): ` 42`
+- Punctuation (with optional leading space): ` !`
+- Whitespace runs
+```
+Input:  "Hello world"
+Output: ["Hello", " world"]
+```
+#### Stage 3: Byte-Level Encoding
+Each chunk is converted to the byte-level representation used in HF vocabularies. The key transformation is the space-to-`\u0120` (character) mapping:
+| Character | Code Point | Representation |
+|-----------|-----------|----------------|
+| Space (` `) | 32 | `\u0120` (latin capital G with dot above) |
+| Newline (`\n`) | 10 | `\u010A` (offset by 256) |
+| Tab (`\t`) | 9 | `\u0109` (offset by 256) |
+| Regular printable | 33-126 | Unchanged |
+```
+Input:  ["Hello", " world"]
+Output: ["Hello", "\u0120world"]    (the space becomes the special character)
+```
+This is the convention used by GPT-2 and all derivative tokenizers. The `\u0120` character serves as an in-band marker for "this token starts with a space."
+#### Stage 4: BPE Merge
+Each byte-encoded chunk undergoes iterative pair merging:
+1. Start with individual characters: `["\u0120", "w", "o", "r", "l", "d"]`
+2. Find the pair with the lowest merge rank in the merge table
+3. Merge that pair into one token
+4. Repeat until no more merges are possible
+```
+"\u0120world" merge trace (hypothetical ranks):
+  Step 0: ["\u0120", "w", "o", "r", "l", "d"]
+  Step 1: ["\u0120w", "o", "r", "l", "d"]        (merge "\u0120" + "w", rank 42)
+  Step 2: ["\u0120w", "or", "l", "d"]             (merge "o" + "r", rank 87)
+  Step 3: ["\u0120w", "orl", "d"]                 (merge "or" + "l", rank 203)
+  Step 4: ["\u0120world"]                          (merge "orl" + "d", then "\u0120w" + "orld")
+```
+(Actual merge orders depend on the specific tokenizer's merge table.)
+The algorithm always selects the pair with the **lowest rank** (highest priority). This greedy strategy produces the optimal BPE encoding for any given text.
+If a complete chunk is already in the vocabulary as a single token, the BPE step is skipped and the chunk maps directly to its token ID.
+#### Stage 5: Byte Fallback
+If any character or merged symbol isn't in the vocabulary, it's encoded as a sequence of raw byte tokens using the `<0xHH>` format:
+```
+Unknown character "ñ" (UTF-8 bytes: 0xC3, 0xB1):
+  -> ["<0xC3>", "<0xB1>"]
+  -> [token_id_for_0xC3, token_id_for_0xB1]
+```
+This ensures every possible input can be encoded, even if it contains characters not in the training data.
+### Complete Example Result
+```
+"Hello world"
+  -> pre-tokenize: ["Hello", " world"]
+  -> byte encode:  ["Hello", "\u0120world"]
+  -> BPE merge:    ["Hello"] -> [token_id_15496]
+                   ["\u0120world"] -> [token_id_1917]
+  -> final IDs:    [15496, 1917]
+```
+(Token IDs are model-specific; these are illustrative.)
+---
+## Decoding Pipeline
+Decoding reverses the encoding:
+1. Map each token ID to its string representation via `vocabReverse`
+2. Optionally skip special tokens (BOS, EOS, etc.)
+3. Join all token strings
+4. Replace `\u0120` back to space
+5. Replace `<0xHH>` patterns back to raw bytes
+```typescript
+decode(ids: number[], skipSpecialTokens: boolean = true): string
+```
+---
+## Chat Template Support
+The tokenizer implements ChatML format for chat-style conversation encoding:
+```
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+What is 2+2?<|im_end|>
+<|im_start|>assistant
+```
+### API
+```typescript
+const tokenizer = Tokenizer.fromJSON(tokenizerJSON, configJSON);
+const text = tokenizer.applyChatTemplate([
+  { role: "system", content: "You are a helpful assistant." },
+  { role: "user", content: "What is 2+2?" },
+], { addGenerationPrompt: true });
+// Returns the formatted string:
+// "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
+//  <|im_start|>user\nWhat is 2+2?<|im_end|>\n
+//  <|im_start|>assistant\n"
+const ids = tokenizer.encodeChat([
+  { role: "system", content: "You are a helpful assistant." },
+  { role: "user", content: "What is 2+2?" },
+]);
+// Returns the token IDs directly
+```
+### Message Types
+```typescript
+interface ChatMessage {
+  role: "system" | "user" | "assistant";
+  content: string;
+}
+```
+### Limitations
+The current implementation hardcodes the ChatML format. HuggingFace tokenizers store a Jinja2 template in `tokenizer_config.json` under the `chat_template` field. A future improvement would parse and evaluate this template to support non-ChatML formats (e.g., Llama's `[INST]...[/INST]` format, Phi's `<|user|>...<|end|>` format).
+---
+## Configuration
+The tokenizer reads configuration from `tokenizer_config.json`:
+```typescript
+interface TokenizerConfig {
+  bosToken: string | null;      // e.g. "<|endoftext|>"
+  eosToken: string | null;      // e.g. "<|im_end|>" or "<|endoftext|>"
+  bosTokenId: number | null;    // Resolved from vocab
+  eosTokenId: number | null;    // Resolved from vocab
+  chatTemplate: string | null;  // Jinja2 template (stored but not yet parsed)
+  addBosToken: boolean;         // Whether to prepend BOS to encoded text
+  addEosToken: boolean;         // Whether to append EOS to encoded text
+}
+```
+When `addBosToken` is true, `encode()` prepends the BOS token ID. When `addEosToken` is true, it appends the EOS token ID. These settings come from the model's tokenizer config.
+---
+## HuggingFace tokenizer.json Format
+The tokenizer reads the standard HF `tokenizer.json` format:
+```json
+{
+  "model": {
+    "type": "BPE",
+    "vocab": {
+      "Hello": 15496,
+      "\u0120world": 1917,
+      "<0x0A>": 198,
+      ...
+    },
+    "merges": [
+      "\u0120 t",
+      "i n",
+      "e r",
+      ...
+    ]
+  },
+  "added_tokens": [
+    { "id": 151643, "content": "<|endoftext|>", "special": true },
+    { "id": 151644, "content": "<|im_start|>", "special": true },
+    { "id": 151645, "content": "<|im_end|>", "special": true }
+  ]
+}
+```
+Key fields:
+- `model.type`: Must be `"BPE"` (the only supported type)
+- `model.vocab`: Complete vocabulary mapping token strings to IDs
+- `model.merges`: Ordered list of merge pairs (index = priority)
+- `added_tokens`: Special tokens with their IDs and flags
+---
+## Performance Characteristics
+The BPE algorithm has quadratic worst-case complexity in the length of a single chunk (O(n^2) where n is the number of characters). In practice, this is fast because:
+1. Pre-tokenization breaks text into small chunks (typically words)
+2. Most words are 5-15 characters, so the inner merge loop is small
+3. Common words are in the vocabulary directly, skipping BPE entirely
+For a typical prompt of 200 words:
+- Pre-tokenization: ~200 chunks
+- BPE per chunk: ~10-15 merge iterations
+- Total: ~3000 operations, well under 1ms
+The dominant cost for long prompts is the regex pre-tokenization pass, which is a single linear scan using the built-in regex engine.
+---
+## Vocabulary Size
+Common vocabulary sizes for models the engine targets:
+| Model Family | Vocab Size | Notable |
+|-------------|-----------|---------|
+| Qwen2/3 | 151,936 | Large vocab with extensive CJK coverage |
+| LLaMA 2 | 32,000 | |
+| LLaMA 3 | 128,256 | Significantly expanded |
+| Phi-3 | 32,064 | |
+| SmolLM2 | 49,152 | |
+The vocab size directly affects:
+- LM head matmul cost (hidden_size x vocab_size)
+- Logit readback size (vocab_size * 4 bytes)
+- Sampler sorting cost (O(vocab_size * log(vocab_size)) for top-k)

package/docs/memory-rag.md ADDED Viewed

@@ -0,0 +1,91 @@
+# Memory / RAG (on-device agent memory)
+> Note: this is the **RAG / persistent-memory** module (`@tryhamster/gerbil/memory`).
+> For GPU/KV-cache memory management see [memory.md](memory.md).
+Gerbil's memory module is an on-device, persistent memory layer that turns
+Gerbil into an agent harness: store text + embeddings, retrieve semantically,
+and rebuild a token-budgeted context block every turn.
+It is engine-agnostic — bring any embedder and any storage backend — but wires
+straight into Gerbil's native embeddings by default.
+## Quick start
+```ts
+import { Gerbil } from "@tryhamster/gerbil";
+import { createMemory, createGerbilEmbedder } from "@tryhamster/gerbil/memory";
+const g = new Gerbil();
+await g.loadModel("embeddinggemma-300m");
+const mem = createMemory({ embed: createGerbilEmbedder(g) });
+await mem.add("Paris is the capital of France", { metadata: { topic: "geo" } });
+const hits = await mem.search("French capital", { k: 3 });
+const { context } = await mem.recall("French capital", { tokenBudget: 512 });
+```
+## Public API
+`createMemory({ embed, store?, redact?, chunk? }) → Memory`
+| Method | Description |
+| --- | --- |
+| `add(text, { metadata?, id?, chunk? })` | Redact → (optional) chunk → embed → normalize → store. Returns created ids. |
+| `search(query, { k?, filter?, minScore? })` | Cosine top-k. Returns `{ record, score }[]`. |
+| `recall(query, { tokenBudget?, k?, filter?, minScore?, separator? })` | Retrieve + greedily pack into a token-budgeted context block. |
+| `get(id)` / `delete(id)` / `list(filter?)` / `clear()` / `size()` | CRUD over records. |
+| `export()` / `import(snapshot)` | JSON snapshot round-trip. |
+| `backend` | The underlying `MemoryStore` (for advanced use). |
+## Backends (pluggable `MemoryStore`)
+| Factory | Runtime | Durability |
+| --- | --- | --- |
+| `createInMemoryStore()` (default) | Node + browser | none (process lifetime) |
+| `createIndexedDBStore({ dbName?, storeName?, indexedDB? })` | browser | durable across sessions |
+| `createFileStore(path)` | Node | durable JSON on disk |
+All backends store **pre-normalized** embeddings and perform a brute-force
+cosine top-k scan, which is fine to the thousands-of-records scale. Inject an
+`indexedDB` factory (e.g. `fake-indexeddb`) to exercise the IndexedDB backend
+under Node.
+## Embedder injection
+The module only needs `(texts: string[]) => Promise<Float32Array[]>`.
+`createGerbilEmbedder(engine)` adapts any object with a compatible
+`embedBatch` (a `Gerbil` instance, the one-liner `embedBatch`, or the browser
+`useEmbedding().embedBatch`). Any other embedder works by passing the function
+directly.
+## Chunking
+`add(text, { chunk: true })` or `add(text, { chunk: { chunkSize, overlap } })`
+splits long documents into overlapping character windows (defaults: 1000 chars,
+200 overlap), one record per chunk, so retrieval targets relevant passages.
+## Context packing (`recall`)
+`recall` retrieves a candidate pool (default `k: 20`), then greedily fills a
+context block highest-score-first, stopping before `tokenBudget` is exceeded
+(it skips a too-large candidate and tries smaller ones rather than stopping
+outright). Token counts are **approximate**: the heuristic is ~4 characters per
+token (the common English-ish rule), deliberately avoiding a tokenizer
+dependency. The goal is to stay under a model's context window, not exact
+accounting.
+## Privacy
+- `redact` is applied on **write**: a `RegExp` (matches → `[REDACTED]`) or a
+  `(text) => string` predicate.
+- `export()` / `import()` move the full corpus as JSON.
+## Follow-ups
+- **HNSW/ANN index** for >10k records (current scan is O(n) per query).
+- **Node OPFS / SQLite backend** for larger durable corpora than the JSON
+  file store comfortably holds.
+- **Real tokenizer** option for exact budgeting (currently a char heuristic).
+- **TTL / decay & dedup** policies for long-running agents.