npm - @tryhamster/gerbil - Versions diffs - 1.0.0-rc.8 → 1.0.0 - Mend

@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (179) hide show

package/LICENSE +1 -1
package/README.md +247 -84
package/dist/architectures-C1I5V3Dt.mjs +6070 -0
package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
package/dist/browser/index.d.ts +264 -588
package/dist/browser/index.d.ts.map +1 -1
package/dist/browser/index.js +585 -2334
package/dist/browser/index.js.map +1 -1
package/dist/cli.mjs +625 -1098
package/dist/cli.mjs.map +1 -1
package/dist/defaults-9komdrbY.mjs +24 -0
package/dist/defaults-9komdrbY.mjs.map +1 -0
package/dist/frameworks/express.d.mts +1 -3
package/dist/frameworks/express.d.mts.map +1 -1
package/dist/frameworks/express.mjs +7 -7
package/dist/frameworks/express.mjs.map +1 -1
package/dist/frameworks/fastify.d.mts +1 -1
package/dist/frameworks/fastify.d.mts.map +1 -1
package/dist/frameworks/fastify.mjs +3 -3
package/dist/frameworks/fastify.mjs.map +1 -1
package/dist/frameworks/hono.d.mts +1 -1
package/dist/frameworks/hono.d.mts.map +1 -1
package/dist/frameworks/hono.mjs +4 -4
package/dist/frameworks/hono.mjs.map +1 -1
package/dist/frameworks/next.d.mts +3 -2
package/dist/frameworks/next.d.mts.map +1 -1
package/dist/frameworks/next.mjs +4 -4
package/dist/frameworks/next.mjs.map +1 -1
package/dist/frameworks/react.d.mts +1 -1
package/dist/frameworks/trpc.d.mts +1 -1
package/dist/frameworks/trpc.d.mts.map +1 -1
package/dist/frameworks/trpc.mjs +4 -4
package/dist/frameworks/trpc.mjs.map +1 -1
package/dist/gerbil-BHrJJIa4.mjs +1656 -0
package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
package/dist/gerbil-BT9fCydo.d.mts +488 -0
package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
package/dist/gerbil-DomNfIr1.mjs +4 -0
package/dist/gpu/hooks.d.mts +520 -0
package/dist/gpu/hooks.d.mts.map +1 -0
package/dist/gpu/hooks.mjs +1188 -0
package/dist/gpu/hooks.mjs.map +1 -0
package/dist/gpu/index.d.mts +2 -0
package/dist/gpu/index.mjs +6 -0
package/dist/gpu-33qCAtHW.mjs +3615 -0
package/dist/gpu-33qCAtHW.mjs.map +1 -0
package/dist/index-Dgmb2kE3.d.mts +245 -0
package/dist/index-Dgmb2kE3.d.mts.map +1 -0
package/dist/index-jEAL2s-A.d.mts +2022 -0
package/dist/index-jEAL2s-A.d.mts.map +1 -0
package/dist/index.d.mts +22 -487
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +13 -8
package/dist/index.mjs.map +1 -1
package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
package/dist/integrations/ai-sdk.d.mts +75 -6
package/dist/integrations/ai-sdk.d.mts.map +1 -1
package/dist/integrations/ai-sdk.mjs +131 -15
package/dist/integrations/ai-sdk.mjs.map +1 -1
package/dist/integrations/langchain.d.mts +1 -1
package/dist/integrations/langchain.d.mts.map +1 -1
package/dist/integrations/langchain.mjs +5 -5
package/dist/integrations/langchain.mjs.map +1 -1
package/dist/integrations/llamaindex.d.mts +1 -1
package/dist/integrations/llamaindex.d.mts.map +1 -1
package/dist/integrations/llamaindex.mjs +5 -5
package/dist/integrations/llamaindex.mjs.map +1 -1
package/dist/integrations/mcp-client.mjs +3 -3
package/dist/integrations/mcp-client.mjs.map +1 -1
package/dist/integrations/mcp.d.mts +3 -2
package/dist/integrations/mcp.d.mts.map +1 -1
package/dist/integrations/mcp.mjs +5 -5
package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
package/dist/mcp-1DaMsaBc.mjs.map +1 -0
package/dist/memory/index.d.mts +3 -0
package/dist/memory/index.mjs +6 -0
package/dist/memory-D1P7Tmda.mjs +4 -0
package/dist/memory-DVN0MnIG.mjs +132 -0
package/dist/memory-DVN0MnIG.mjs.map +1 -0
package/dist/memory-Dj0J1v88.mjs +294 -0
package/dist/memory-Dj0J1v88.mjs.map +1 -0
package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
package/dist/repl-jV5gcJFA.mjs +9 -0
package/dist/skills/index.d.mts +270 -320
package/dist/skills/index.d.mts.map +1 -1
package/dist/skills/index.mjs +5 -5
package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
package/dist/skills-DX8D59UH.mjs.map +1 -0
package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
package/dist/tools-DQ1mPUw5.mjs.map +1 -0
package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
package/dist/types-D6FiR_oh.d.mts.map +1 -0
package/dist/types-DQBe2lFo.d.mts +165 -0
package/dist/types-DQBe2lFo.d.mts.map +1 -0
package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
package/dist/vector-B0panuy6.mjs +95 -0
package/dist/vector-B0panuy6.mjs.map +1 -0
package/docs/PROJECT-STATE.md +321 -0
package/docs/adding-a-model-family.md +280 -0
package/docs/ai-sdk.md +70 -61
package/docs/architecture/overview.md +17 -7
package/docs/browser.md +203 -8
package/docs/embeddings.md +156 -0
package/docs/gerbil-site-native-migration.md +217 -0
package/docs/gpu-engine/architectures.md +398 -0
package/docs/gpu-engine/ir.md +372 -0
package/docs/gpu-engine/kernels.md +718 -0
package/docs/gpu-engine/paper.html +1759 -0
package/docs/gpu-engine/paper.md +2109 -0
package/docs/gpu-engine/safetensors.md +312 -0
package/docs/gpu-engine/tokenizer.md +302 -0
package/docs/memory-rag.md +91 -0
package/docs/metal-safari-intel.md +190 -0
package/docs/mobile-failure-diagnosis.md +124 -0
package/docs/mobile.md +99 -0
package/docs/observability.md +230 -0
package/docs/onnx-removal-plan.md +339 -0
package/docs/research/autoresearch-portable.md +904 -0
package/docs/research/dispatch-reduction-hivemind.md +84 -0
package/docs/research/ios-safari-model-caching.md +117 -0
package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
package/docs/research/native-stt-model-selection.md +49 -0
package/docs/research/native-tts-model-selection.md +90 -0
package/docs/research/native-vs-chromium-decision.md +152 -0
package/docs/research/nemotron-mamba2-inference.md +910 -0
package/docs/research/qwen35-multimodal.md +293 -0
package/docs/research/qwen36-gemma4-targets.md +337 -0
package/docs/research/sota-embedding-models.md +179 -0
package/docs/research/sota-mobile-models-2026.md +263 -0
package/docs/research/sota-modality-models.md +202 -0
package/docs/research/tps-baselines.md +71 -0
package/docs/research/webgpu-m4-reference.md +104 -0
package/docs/site-update-plan.md +155 -0
package/docs/structured-output.md +123 -0
package/docs/stt.md +63 -446
package/docs/tts.md +77 -499
package/docs/vision.md +100 -338
package/package.json +22 -7
package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
package/dist/gerbil-CJ3ifloF.mjs +0 -4
package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
package/dist/gerbil-qOTe1nl2.d.mts +0 -431
package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
package/dist/kokoro-BNTb6egA.mjs +0 -20210
package/dist/kokoro-BNTb6egA.mjs.map +0 -1
package/dist/kokoro-DFRQ1OeM.js +0 -20212
package/dist/kokoro-DFRQ1OeM.js.map +0 -1
package/dist/mcp-BvbriaBy.mjs.map +0 -1
package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
package/dist/repl-DveXw36T.mjs +0 -9
package/dist/skills-CD3Orlex.mjs.map +0 -1
package/dist/stt-CpLYbGFd.mjs +0 -433
package/dist/stt-CpLYbGFd.mjs.map +0 -1
package/dist/stt-DRPLEEHB.mjs +0 -3
package/dist/stt-Te8Qz-Ay.js +0 -433
package/dist/stt-Te8Qz-Ay.js.map +0 -1
package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
package/dist/transformers.web-DokyH3rP.js +0 -3
package/dist/transformers.web-M6mCnEYJ.js +0 -30382
package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
package/dist/tts-C0xx3CtE.js +0 -724
package/dist/tts-C0xx3CtE.js.map +0 -1
package/dist/tts-DXgsKGCe.mjs +0 -3
package/dist/tts-DeGANMNV.mjs +0 -730
package/dist/tts-DeGANMNV.mjs.map +0 -1
package/dist/types-CiTc7ez3.d.mts.map +0 -1
/package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
/package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
/package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0

package/docs/browser.md CHANGED Viewed

@@ -1,6 +1,93 @@
 # Browser Usage
-Run LLMs, TTS, and STT directly in the browser with WebGPU acceleration. No server required.
+Run models directly in the browser with WebGPU. No server required.
+Browser inference runs on the **native engine** — the React hooks at
+`@tryhamster/gerbil/gpu/hooks` (`useEngine` / `useChat` / `useText` / `useVision` /
+`useEmbedding` / `useTTS` / `useSTT` / `useVoiceChat` / `useMemory`),
+backed by the from-scratch WGSL `WebGPUEngine`. Pure compute shaders, no ONNX, no
+transformers.js. This is the supported path for text, vision, embeddings, and speech, and the
+lane the Gerbil site itself runs on.
+> The old inline transformers.js/ONNX worker hooks (`useChat`, `useSpeech`, `useVoiceInput`,
+> `useEmbedding`, `createGerbilWorker`, `preload*`) have been **removed** from
+> `@tryhamster/gerbil/browser`. `@tryhamster/gerbil/browser` now exports only device/WebGPU
+> utilities (`isModelSafeForDevice`, `detectMemoryCrash`, `downloadModelChunked`,
+> `checkWebGPUCapabilities`, `getBrowserDiagnostics`, …). The "Legacy Worker Lane" sections
+> below are retained for historical reference and no longer reflect the shipped API.
+> **Pre-1.0.** APIs may still shift before 1.0.
+## Native Engine (recommended)
+```tsx
+import { useEngine } from "@tryhamster/gerbil/gpu/hooks";
+function Chat() {
+  const { complete, completion, isLoading, isGenerating, tps } = useEngine({
+    model: "mlx-community/Qwen3.5-0.8B-4bit",
+    autoLoad: true, // dtype defaults to "auto": int4 on mobile, native on desktop
+  });
+  if (isLoading) return <div>Loading model…</div>;
+  return (
+    <div>
+      <button onClick={() => complete("Write a haiku about coding")} disabled={isGenerating}>
+        Generate
+      </button>
+      <p>{completion}</p>
+      {isGenerating && <span>{tps?.toFixed(1)} tok/s</span>}
+    </div>
+  );
+}
+```
+`useEngine` owns the engine lifecycle — load, unload, hot-swap on config change, and
+**reference-counted sharing** so multiple components requesting the same
+`model|dtype|vision|embedding|maxSeqLen` share ONE engine (weights uploaded to the GPU once).
+```typescript
+const {
+  complete,        // (prompt, opts?) => Promise<string> — streams into `completion`
+  describeImage,   // (image, prompt?, opts?) => Promise<string> — needs enableVision
+  embed,           // (text, { taskType }?) => Promise<Float32Array> — needs embedding
+  similarity,      // (a, b) => Promise<number> — needs embedding
+  completion,      // string — current text, streams token by token
+  isLoading, loadingProgress, isGenerating, isReady, tps, error, errorKind,
+  load, stop, dispose,
+} = useEngine({
+  model: "mlx-community/Qwen3.5-0.8B-4bit", // HF repo id
+  dtype: "auto",        // "auto" (default) | "f32" | "q4"
+  maxSeqLen,            // default: 2048 mobile / 4096 desktop
+  autoLoad: false,      // load on mount
+  enableVision: false,  // build the ViT so describeImage() works
+  embedding: false,     // load as an embedding model
+  onReady, onError,     // onError(err, kind) — kind: "no-webgpu" | "oom" | …
+});
+```
+Use `enableVision: true` for image→text (see [Vision](./vision.md)) and `embedding: true`
+for embeddings (see [Embeddings](./embeddings.md)).
+### Browser support (native engine)
+- **Chrome / Edge 113+**
+- **Safari 26+ (iOS / iPadOS 26+)**
+- **Firefox 141+**
+On devices without WebGPU the hook reports an `errorKind` of `"no-webgpu"` rather than
+silently degrading.
+---
+## Legacy Worker Lane (removed — historical reference only)
+> **Removed.** Everything below this point documents the old inline transformers.js/ONNX
+> worker hooks, which have been **deleted** from the package (no `useChat`/`useSpeech`/
+> `useVoiceInput`/`useEmbedding` worker exports, no `createGerbilWorker`, no `preload*`
+> functions, and no ONNX/transformers.js dependency). It is kept here only for historical
+> reference. Use the native hooks from `@tryhamster/gerbil/gpu/hooks` instead.
 ## Quick Start (React)
@@ -34,6 +121,40 @@ function Chat() {
 That's it! The hook handles model loading, streaming, and state management.
+## Model Preloading
+Download models during app initialization so they're ready when users need them:
+```typescript
+import {
+  preloadChatModel,
+  preloadEmbeddingModel,
+  preloadTTSModel,
+  preloadSTTModel
+} from "@tryhamster/gerbil/browser";
+// During app initialization
+async function initApp() {
+  // Preload LLM
+  await preloadChatModel("qwen3-0.6b", {
+    onProgress: (p) => {
+      if (p.status === "downloading") {
+        console.log(`Downloading ${p.file}: ${p.progress}%`);
+      }
+    },
+  });
+  // Preload other models as needed
+  await preloadEmbeddingModel("Xenova/all-MiniLM-L6-v2");
+  await preloadTTSModel("kokoro-82m");
+  await preloadSTTModel("whisper-tiny.en");
+}
+initApp();
+```
+After preloading, hooks like `useChat` will load instantly from IndexedDB cache.
 ## React Hooks
 ### `useChat`
@@ -220,6 +341,9 @@ const {
 #### Vision (Image Analysis)
+> For native image→text, use `useEngine({ enableVision: true }).describeImage(...)` —
+> see [Vision docs](./vision.md). The legacy-lane example below uses the retired ONNX worker.
 Use `useCompletion` with a vision model to analyze images:
 ```tsx
@@ -227,7 +351,7 @@ import { useCompletion } from "@tryhamster/gerbil/browser";
 function ImageAnalyzer() {
   const { complete, completion, isLoading, isGenerating } = useCompletion({
-    model: "ministral-3b",  // Vision model
+    model: "ministral-3b",  // Vision model (legacy lane)
     maxTokens: 2048,
   });
   const [imageUrl, setImageUrl] = useState<string | null>(null);
@@ -456,6 +580,75 @@ for await (const chunk of gerbil.speakStream("Long text...")) {
 }
 ```
+## Embeddings Hook
+### `useEmbedding`
+Generate embeddings for semantic search and similarity:
+```tsx
+import { useEmbedding } from "@tryhamster/gerbil/browser";
+function SemanticSearch() {
+  const { embed, similarity, search, isLoading, isReady, load } = useEmbedding({
+    model: "Xenova/all-MiniLM-L6-v2",  // Default
+    autoLoad: false,
+  });
+  if (isLoading) return <div>Loading embedding model...</div>;
+  const handleSearch = async () => {
+    const results = await search("capital of France", [
+      "Paris is beautiful",
+      "London is in England",
+      "Dogs are pets",
+    ], 2);  // topK = 2
+    console.log(results);
+    // [{ text: "Paris is beautiful", score: 0.89, index: 0 }, ...]
+  };
+  const handleSimilarity = async () => {
+    const score = await similarity("Hello world", "Hi there");
+    console.log(score); // 0.85
+  };
+  return (
+    <div>
+      <button onClick={handleSearch}>Search</button>
+      <button onClick={handleSimilarity}>Compare</button>
+    </div>
+  );
+}
+```
+### Options
+```typescript
+const {
+  // Actions
+  embed,            // (text: string) => Promise<number[]>
+  embedBatch,       // (texts: string[]) => Promise<{ vector, text }[]>
+  similarity,       // (a: string, b: string) => Promise<number>
+  search,           // (query: string, corpus: string[], topK?) => Promise<SearchResult[]>
+  findNearest,      // (embedding: number[], candidates: string[], topK?) => Promise<SearchResult[]>
+  cosineSimilarity, // (a: number[], b: number[]) => number (sync)
+  load,             // () => void - manually load model
+  // State
+  isLoading,        // boolean - model loading
+  isReady,          // boolean - model ready
+  loadingProgress,  // { status, message?, progress? }
+  error,            // string | null
+} = useEmbedding({
+  model: "Xenova/all-MiniLM-L6-v2",  // Embedding model
+  normalize: true,                    // Normalize vectors (default: true)
+  autoLoad: false,                    // Load on mount (default: false)
+  onReady: () => {},
+  onError: (err) => {},
+});
+```
 ## Low-Level API
 For full control, use `createGerbilWorker` directly:
@@ -533,22 +726,24 @@ const info = await getWebGPUInfo();
 // { supported: true, adapter: "Apple", device: "Apple M4 Max" }
 ```
-## Models
+## Models (legacy worker lane)
 | Model | Size | Best For |
 |-------|------|----------|
 | `qwen3-0.6b` | ~400MB | General use, thinking mode |
 | `smollm2-360m` | ~250MB | Faster, smaller |
 | `smollm2-135m` | ~100MB | Fastest, basic tasks |
-| `ministral-3b` | ~2.5GB | **Vision** — image analysis |
-Models are cached in IndexedDB after first download.
+> For vision, embeddings, and speech use the native engine (`useEngine`) — see the
+> [Native Engine](#native-engine-recommended) section above.
+Legacy-lane models are cached in IndexedDB after first download.
-## Browser Support
+## Browser Support (legacy worker lane)
 - **Chrome/Edge 113+** — Full WebGPU support
-- **Safari 18+** — WebGPU support (may have quirks)
-- **Firefox** — WebGPU behind flag, not recommended
+- **Safari 26+ (iOS/iPadOS 26+)** — WebGPU support
+- **Firefox 141+** — WebGPU support
 ## Troubleshooting

package/docs/embeddings.md ADDED Viewed

@@ -0,0 +1,156 @@
+# Embeddings
+Gerbil generates text embeddings natively on the WebGPU engine using **EmbeddingGemma-300M**
+— a bidirectional Gemma3 encoder with mean pooling and a 2-layer Dense head, producing
+768-dim, L2-normalized vectors. Runs on-device (including iPad Safari), no ONNX, no API keys.
+> **Pre-1.0.** `engine.embed()` is the native path. The old ONNX/transformers.js embedding
+> lane (MiniLM/BGE/GTE) has been removed. The `Gerbil`-class helpers (`embed`, `similarity`,
+> `search`, `findNearest`) still work but now run native EmbeddingGemma under the hood (see
+> [below](#gerbil-class-embeddings-native-wrapper)).
+## Quick Start
+### Node
+```typescript
+import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
+const engine = await WebGPUEngine.create({
+  repo: "mlx-community/embeddinggemma-300m-4bit",
+  embedding: true,
+});
+// EmbeddingGemma is asymmetric — queries and documents use different prefixes.
+const query = await engine.embed("capital of France", { taskType: "query" });
+const doc = await engine.embed("Paris is the capital of France.", { taskType: "document" });
+// Vectors are unit-norm, so cosine similarity is just a dot product.
+const dot = query.reduce((s, v, i) => s + v * doc[i], 0);
+console.log(dot); // ~0.7+
+engine.destroy();
+```
+`embed()` returns a `Float32Array` of length 768 (EmbeddingGemma) with unit L2 norm.
+### React (Browser)
+```tsx
+import { useEngine } from "@tryhamster/gerbil/gpu/hooks";
+function SemanticSearch() {
+  const { embed, similarity, isLoading } = useEngine({
+    model: "mlx-community/embeddinggemma-300m-4bit",
+    embedding: true,
+    autoLoad: true,
+  });
+  if (isLoading) return <div>Loading embedding model…</div>;
+  const compare = async () => {
+    // similarity() embeds a as a query and b as a document, returns cosine.
+    const score = await similarity("Hello world", "Hi there");
+    console.log(score);
+  };
+  return <button onClick={compare}>Compare</button>;
+}
+```
+The hook exposes `embed(text, { taskType })` (defaults to `"query"`) and
+`similarity(a, b)`.
+## Asymmetric tasks
+EmbeddingGemma uses different task prefixes for queries vs documents. Pass `taskType`, or a
+raw `taskPrompt` for non-retrieval tasks (clustering / classification / STS):
+```typescript
+await engine.embed(text, { taskType: "query" });      // "task: search result | query: "
+await engine.embed(text, { taskType: "document" });   // "title: none | text: "
+await engine.embed(text, { taskPrompt: "task: clustering | query: " });
+```
+## API
+```typescript
+interface EmbedOptions {
+  /** EmbeddingGemma: "query" (default) or "document". */
+  taskType?: "query" | "document";
+  /** EmbeddingGemma: raw task prefix, overrides taskType. */
+  taskPrompt?: string;
+  /** Qwen3-Embedding: instruction prefix for query embeddings. */
+  instruction?: string;
+  /** Max tokens to encode (longer inputs are truncated). */
+  maxTokens?: number;
+}
+// async embed(text: string, options?: EmbedOptions): Promise<Float32Array>
+```
+`embed()` requires an engine loaded with `{ embedding: true }`. The pooling strategy is
+chosen by architecture: EmbeddingGemma mean-pools over all tokens; Qwen3-Embedding uses
+last-token (EOS-position) pooling.
+## RAG
+EmbeddingGemma pairs with `@tryhamster/gerbil/memory` for token-budgeted retrieval, or you
+can build a simple pipeline by hand:
+```typescript
+const engine = await WebGPUEngine.create({
+  repo: "mlx-community/embeddinggemma-300m-4bit",
+  embedding: true,
+});
+// Index documents.
+const docs = await loadDocuments();
+const index = [];
+for (const text of docs) {
+  index.push({ text, vector: await engine.embed(text, { taskType: "document" }) });
+}
+// Retrieve.
+const q = await engine.embed(question, { taskType: "query" });
+const ranked = index
+  .map((d) => ({ text: d.text, score: d.vector.reduce((s, v, i) => s + v * q[i], 0) }))
+  .sort((a, b) => b.score - a.score)
+  .slice(0, 3);
+```
+## Other native embedders
+- **Qwen3-Embedding-0.6B** (`Qwen/Qwen3-Embedding-0.6B`) — also supported natively
+  (`{ embedding: true }`); uses last-token pooling and an optional `instruction` prefix.
+  Larger (BF16 OOMs iPad); EmbeddingGemma is the recommended default.
+## Models
+| Model | Repo | Dim | Notes |
+|-------|------|-----|-------|
+| **EmbeddingGemma-300M** | `mlx-community/embeddinggemma-300m-4bit` | 768 | Default; asymmetric; runs on iPad |
+| Qwen3-Embedding-0.6B | `Qwen/Qwen3-Embedding-0.6B` | 1024 | Last-token pooling; desktop |
+---
+## `Gerbil`-class embeddings (native wrapper)
+> The ONNX/transformers.js embedding lane (MiniLM/BGE/GTE) has been removed. The `Gerbil`-class
+> helpers below still work but now run native EmbeddingGemma under the hood (768-dim,
+> WebGPU-required). The old browser `useEmbedding` worker hook is gone — use `useEmbedding`
+> from `@tryhamster/gerbil/gpu/hooks`.
+```typescript
+import { Gerbil } from "@tryhamster/gerbil";
+const g = new Gerbil();
+const { vector } = await g.embed("Hello world"); // number[768] (EmbeddingGemma)
+const { score } = await g.similarity("Hello world", "Hi there");
+const results = await g.search("capital of France", ["Paris…", "London…"]);
+```
+## See Also
+- [Browser Hooks](./browser.md) — React hooks
+- [Vision](./vision.md), [TTS](./tts.md), [STT](./stt.md)

package/docs/gerbil-site-native-migration.md ADDED Viewed

@@ -0,0 +1,217 @@
+# gerbil-site → Native-Engine Migration Assessment
+**As of 2026-06-14. Assessment only — no code in `gerbil-site` was changed.**
+The marketing/docs site lives in a **separate repo** at `/Users/shenron/code/gerbil-site`.
+This document maps how that site does in-browser inference today onto the native
+WebGPU engine's browser API (`@tryhamster/gerbil` `src/browser/*` + `src/gpu/index.ts`),
+says which features can switch **today** vs which must **wait**, states the device-coverage
+consequence of dropping the ONNX fallback **plainly**, and gives a phased plan with the
+smallest first step.
+It pairs with the engine paper §29 (cross-device multi-modal parity) and §23 (the
+native-only architecture decision). The owner decision is native-only with **no kept
+transformers.js/ONNX fallback lane**; this doc is the concrete site-side consequence.
+---
+## 0. The one fact that reframes everything
+The site does **not** call `@huggingface/transformers` or `onnxruntime-web` directly
+anywhere in runtime code. Every grep hit for `pipeline(`, `AutoModel`, `AutoTokenizer`,
+`KokoroTTS`, `feature-extraction`, etc. is **docs prose / code samples** under
+`app/docs/**`. All real inference goes through **gerbil**, by two paths:
+- **Path A — gerbil browser hooks** (`@tryhamster/gerbil/browser`): `useCompletion`,
+  `useSpeech`, `useVoiceInput`. These wrap a **transformers.js + onnxruntime-web Web
+  Worker** that lives *inside the gerbil tgz*. transformers.js@3.8.1 and the pinned
+  `onnxruntime-web@1.21.0-dev…` are declared deps but consumed only transitively here.
+- **Path B — gerbil native engine** (`@tryhamster/gerbil/gpu`): the `WebGPUEngine`
+  (pure WGSL, no worker, no ONNX), dynamically imported in `hooks/useNativeEngine.ts`.
+So the migration is **not** "rip ONNX out of the site." It is: **move each gerbil
+browser hook from its transformers.js/ONNX worker backend to the native engine** — and
+the site *already has a working native path* for chat. The work is (a) extend the native
+hook coverage to embeddings + vision, (b) wire audio when native audio lands, and (c)
+decide when to flip the default backend.
+> The site depends on gerbil as a **local tgz at rc.26**
+> (`@tryhamster/gerbil: file:/Users/shenron/Code/gerbil/tryhamster-gerbil-1.0.0-rc.26.tgz`).
+> Any new native browser hooks must be published in a new rc and the tgz re-pinned before
+> the site can consume them.
+---
+## (a) What the site uses for inference today, and where
+Framework: **Next.js 14.2.0** (App Router, React 18), `next.config.js` aliases the gerbil
+*browser* bundle to an empty module server-side and loads ORT WASM/MJS from CDN, with
+COOP/COEP headers for SharedArrayBuffer threading.
+| Call site (file) | Modality | Path | Library actually used | Model id(s) |
+|---|---|---|---|---|
+| `components/PlaygroundFull.tsx:309` `useCompletion` | Chat/completion | A (hooks) | transformers.js worker | `qwen3-0.6b` default + `smollm2-*`, `LFM2-*-ONNX`, `qwen3*` |
+| `components/PlaygroundFull.tsx:318` `useCompletion({model:"ministral-3b"})` | Vision (image→text) | A (hooks) | transformers.js worker (`AutoModelForImageTextToText`) | `ministral-3b` |
+| `components/PlaygroundFull.tsx:366` `useSpeech` | TTS | A (hooks) | `kokoro-js` / transformers.js | `kokoro-82m` (def), `supertonic-66m` |
+| `components/PlaygroundFull.tsx:383` `useVoiceInput` | STT | A (hooks) | transformers.js (Whisper) | `whisper-tiny.en` … `whisper-large-v3-turbo` |
+| `components/PlaygroundFull.tsx` Embed tab (~1700) `similarity()` | Embeddings | — | **REMOVED / dead** (`// useEmbedding removed in this gerbil version`) | n/a |
+| `components/AISDKPlayground.tsx:159` `useCompletion` | Chat | A (hooks) | transformers.js worker | `qwen3-0.6b` |
+| `components/AISDKPlayground.tsx:174` `useCompletion({model:"ministral-3b"})` | Vision | A (hooks) | transformers.js worker | `ministral-3b` |
+| `components/AISDKPlayground.tsx:187` `useSpeech` | TTS | A (hooks) | kokoro-js | `kokoro-82m` |
+| `hooks/useNativeEngine.ts:204` `import("@tryhamster/gerbil/gpu")` → `WebGPUEngine.create/generate` | Chat **only** | **B (native)** | **native WGSL** | `mlx-community/Qwen3.5-0.8B-4bit` (def), `Qwen/Qwen3.5-0.8B`, `Qwen/Qwen3-0.6B`, GPTQ variants |
+Wiring/render sites: `components/Playground.tsx` chooses `PlaygroundNative` (native-only
+chat; other tabs disabled) vs `PlaygroundFull` (all modalities, hooks) off
+`localStorage["gerbil-backend"]`. Both `<Playground />` and `<AISDKPlayground />` render on
+`app/page.tsx` and `app/playground/page.tsx`, all `dynamic(..., { ssr:false })`.
+Two states worth flagging now:
+- **Embeddings are already broken** in rc.26: `useEmbedding` was removed; both playgrounds
+  null out `similarity`/`embed` but keep live Embed UI that would throw if clicked. So the
+  embeddings migration is also a *bug fix*.
+- The native path (`hooks/useNativeEngine.ts`) is the site's **own** hook calling
+  `WebGPUEngine` directly — it does **not** use a gerbil-published `useNativeEngine`
+  (there is no `@tryhamster/gerbil/gpu/hooks` export subpath; see §(b) caveat).
+---
+## (b) Which native browser hook/API replaces each ONNX/transformers.js call site
+The native engine surface (`@tryhamster/gerbil/gpu`, `src/gpu/index.ts`) is the class
+`WebGPUEngine`, constructed via `static create(options)` and exposing
+`generate()`, `embed()`, `describeImage()`, `encodeImage()`. The browser hooks
+(`@tryhamster/gerbil/browser`, `src/browser/*`) currently target the transformers.js
+worker; native hooks either exist privately (`src/browser/use-native-engine.ts`,
+intentionally **not** re-exported because the GPU engine drags in `@huggingface/hub`
+Node-only `node:fs` paths) or must be added.
+| Site call site | Today (ONNX/tfjs) | Native replacement | Native symbol(s) | Status |
+|---|---|---|---|---|
+| `useCompletion` (chat) | tfjs worker `generate` | `WebGPUEngine.create({repo, dtype, maxSeqLen, onProgress})` → `engine.generate(prompt, {maxTokens, sampling, systemPrompt, stopSequences, onToken})` | `WebGPUEngine.create`, `generate` (`src/gpu/index.ts`) | ✅ **today** — already wired in `hooks/useNativeEngine.ts` |
+| Embed tab (removed) | (was `useEmbedding`) | load with `{ embedding: true }`, then `engine.embed(text, {taskType:"query"|"document"})` → unit-L2 `Float32Array` (dim 768 for EmbeddingGemma) | `embed`, guard `isEmbedding` | ✅ **today** — model `mlx-community/embeddinggemma-300m-4bit` (173 MB, runs on iPad, paper §25) |
+| Vision (`ministral-3b`) | tfjs `AutoModelForImageTextToText` | load with `{ enableVision: true }` (Qwen3.5), then `engine.describeImage({pixels,width,height}, prompt, opts)` → `GenerateResult` | `describeImage`, guard `hasVision`; lower-level `encodeImage(patches, gridTHW)` | ✅ **today** — but **model changes** to a Qwen3.5 ViT checkpoint (the native ViT is Qwen3.5's own tower, not Ministral); paper §22 / §10 |
+| `useSpeech` (TTS) | kokoro-js / tfjs | **none yet** — OmniVoice native TTS in progress | — (`src/browser/use-speech.ts` stays tfjs) | ❌ **wait** (audio) |
+| `useVoiceInput` (STT) | tfjs Whisper | **none yet** — Moonshine native STT not started | — (`src/browser/use-voice-input.ts` stays tfjs) | ❌ **wait** (audio) |
+Native `WebGPUEngine` option/method reference (from `src/gpu/index.ts` + `model-loader.ts`,
+exact signatures):
+- `create(options)`: `options extends LoadModelOptions` (`repo` required HF id/URL,
+  `onProgress(loaded,total,message)`, `dtype?: "f32"|"q4"`, `revision?`, `hfToken?`) plus
+  `maxSeqLen?` (capped 4096), `kvMode?`, `enableVision?` (downloads ~192 MB ViT, Qwen3.5
+  only), `embedding?: boolean` (last-token pool + L2; on the Gemma encoder path it builds
+  the encoder graph instead). Flags: `get isEmbedding`, `get hasVision`.
+- `generate(prompt|ChatMessage[], {maxTokens?, stopSequences?, sampling?, systemPrompt?, onToken?})`
+  → `{text, tokensGenerated, tokensPerSecond, totalTime, finishReason, thinking?}`.
+- `embed(text, {instruction?, taskType?:"query"|"document", taskPrompt?, maxTokens?})` →
+  unit-L2 `Float32Array`. Throws if not loaded with `{embedding:true}`.
+- `describeImage(image, prompt?, options?)` where `image` is `{pixels,width,height}` **or**
+  `{patches, gridTHW}` → `GenerateResult`. Throws if not loaded with `{enableVision:true}`.
+**Caveat (publishing gap to fix first):** the native engine is exposed at
+`@tryhamster/gerbil/gpu`, but there is **no published React hook** wrapping it. The site
+solved this itself by writing `hooks/useNativeEngine.ts`. To migrate the other modalities
+cleanly, gerbil should publish proper native hooks (e.g. `useNativeChat`, `useNativeEmbedding`,
+`useNativeVision`) under a real subpath (today `src/browser/use-native-engine.ts` exists but
+is excluded from the `browser` barrel, and `@tryhamster/gerbil/gpu/hooks` is referenced in a
+comment but **not** declared in `package.json` `exports`). Until then, the site keeps wrapping
+`WebGPUEngine` directly, modality by modality, as it already does for chat.
+---
+## (c) Modality coverage map — switch TODAY vs WAIT
+| Site feature | Native today? | Native model | Notes |
+|---|---|---|---|
+| **Chat / completion** | ✅ **yes — already live** | Qwen3.5-0.8B (4bit), LFM2.5-350M | `useNativeEngine` already ships; LFM2.5 is the faster/smaller alt (paper §30) |
+| **Embeddings** | ✅ **yes** | EmbeddingGemma-300M (173 MB) | Runs on iPad (paper §25). Also fixes the currently-broken Embed tab |
+| **Vision (image→text)** | ✅ **yes** | Qwen3.5 ViT (`describeImage`) | Bit-exact vs HF, word-identical greedy output (paper §22). **Model swaps off `ministral-3b`** |
+| **TTS** | ❌ **wait** | OmniVoice (in progress) | Keep `useSpeech` on kokoro-js/tfjs until native audio validates |
+| **STT** | ❌ **wait** | Moonshine (not started) | Keep `useVoiceInput` on tfjs Whisper |
+Net: **chat, embeddings, and vision can all move to native today**; **audio must wait**.
+That maps exactly to the engine's "multi-modal parity minus audio" status (paper §29).
+---
+## (d) Device/browser coverage of WebGPU-only — and the no-fallback consequence, stated plainly
+The native engine is **WebGPU-only**. There is **no WASM/CPU fallback** in the native path
+(`src/browser/backend-selector.ts`'s WASM tiers belong to the *transformers.js* path, not
+the native engine; native `WebGPUEngine.create` simply requires a WebGPU adapter).
+**Devices that gain native (faster, no mobile crashes):**
+- **iPad / iPhone Safari (iPadOS/iOS 26.5+, WebKit)** — the headline win; previously crashed,
+  now runs text + vision + embeddings (paper §17–§29). On older WebKit, the grouped-submit
+  `?group=N` dial is the compatibility lever (paper §18.3).
+- **Desktop Chrome / Edge 113+**, **desktop Safari 18+**, **Firefox 141+**.
+- **Android Chrome 113+**, Samsung Internet 25+ (per paper Appendix B).
+**The plain consequence of dropping the ONNX fallback:** **any device or browser without
+WebGPU loses in-browser inference entirely.** There is no graceful degradation to WASM/CPU
+in the native path — the engine **throws a clear error rather than degrading** (PROJECT-STATE
+§3, "No-WebGPU / old devices: not targeted"). Concretely, the users who lose support are:
+older iOS/iPadOS (pre-26 WebKit where WebGPU is absent or buggy), older desktop browsers,
+locked-down enterprise browsers with WebGPU disabled, and low-end Android without a WebGPU
+adapter. Today those users fall back to the slow-but-working transformers.js WASM path; a
+hard native-only cutover **removes that safety net**. This is a deliberate owner decision
+(paper §23: a permanent fallback "assumes defeat to begin with") — but the site must own the
+UX of it: feature-detect WebGPU up front (`isWebGPUSupported` is already imported in the
+playgrounds) and show an explicit "this demo needs WebGPU" state instead of a silent failure.
+---
+## (e) Phased migration plan (smallest first step first)
+**Phase 0 — publish native hooks + re-pin the tgz (prerequisite, gerbil-side).**
+In gerbil, expose browser-safe native React hooks (`useNativeChat`/`useNativeEmbedding`/
+`useNativeVision`) under a declared `exports` subpath, fixing the `@huggingface/hub`
+`node:fs` leak that currently keeps `use-native-engine.ts` out of the barrel. Cut a new rc,
+rebuild the tgz, re-pin `@tryhamster/gerbil` in the site. (If this slips, the site can keep
+hand-wrapping `WebGPUEngine` as it does for chat — but publishing is the clean path.)
+**Phase 1 — SMALLEST FIRST STEP: make native chat the default behind WebGPU detection.**
+The native chat path **already exists and works** (`hooks/useNativeEngine.ts` →
+`WebGPUEngine.generate`). The minimal change is in `components/Playground.tsx`: when
+`isWebGPUSupported()` is true, default `localStorage["gerbil-backend"]` to native
+(`PlaygroundNative`) instead of requiring a manual toggle; keep `PlaygroundFull` (tfjs) as the
+explicit opt-out and the no-WebGPU path. Zero new gerbil API needed, fully reversible, and it
+flips the highest-traffic modality to native first.
+**Phase 2 — fix + migrate embeddings to native (also un-breaks the dead tab).**
+Replace the removed `useEmbedding` in `PlaygroundFull`/`AISDKPlayground` with a native
+embedding hook loading `mlx-community/embeddinggemma-300m-4bit` + `engine.embed(text,
+{taskType})`. This both ships native embeddings and repairs the currently-throwing Embed tab.
+**Phase 3 — migrate vision to native, swapping the model.**
+Replace the `ministral-3b` `useCompletion` vision instances with a native vision hook:
+`WebGPUEngine.create({repo: <Qwen3.5 ViT checkpoint>, enableVision:true})` +
+`describeImage({pixels,width,height})`. Requires host pixel→patch handling (the engine's
+`describeImage` accepts decoded `{pixels,width,height}` and preprocesses internally) and a
+copy/UI change because the model id and capabilities differ from Ministral.
+**Phase 4 — keep audio on tfjs; flip when native audio lands.**
+Leave `useSpeech` (TTS) and `useVoiceInput` (STT) on the transformers.js path. Swap TTS to
+OmniVoice when it validates, then STT to Moonshine. This is the only phase gated on engine work
+not yet done.
+**Phase 5 — retire the tfjs path for non-audio (optional, end-state).**
+Once Phases 1–3 are stable and audio is native (Phase 4), the transformers.js/ONNX worker can
+be removed for all but the explicit no-WebGPU fallback decision — matching paper §23
+(`chrome-backend.ts` slated for deletion engine-side). Whether to keep *any* tfjs fallback at
+all is the owner call in §(d).
+---
+## (f) The single biggest risk
+**Losing the WebGPU-less audience with no graceful fallback, on a demo that is the product's
+shop window.** The site is gerbil's marketing front door: a visitor on an older iPhone, a
+locked-down work laptop, or any browser without WebGPU currently still gets a working (if slow)
+WASM demo. A native-only cutover turns that into a hard "unsupported" wall. The mitigation is
+non-negotiable and cheap: **feature-detect WebGPU at the top of every playground** (the hooks
+are already imported), default non-WebGPU visitors to either the retained tfjs `PlaygroundFull`
+or an explicit, friendly "needs WebGPU" state — and **never** flip the default to native without
+that guard in place. Secondary risks, in order: the **tgz/publishing coupling** (no native hooks
+are published yet, so the site is hand-wrapping `WebGPUEngine` — a versioning and maintenance
+liability until Phase 0 lands), the **vision model swap** (Ministral → Qwen3.5 ViT changes
+behavior and copy, not just an import), and **iPad re-download cost** (no durable cache without a
+PWA, paper §24 — a UX, not correctness, issue).