npm - @tryhamster/gerbil - Versions diffs - 1.0.0-rc.8 → 1.0.0 - Mend

@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (179) hide show

package/LICENSE +1 -1
package/README.md +247 -84
package/dist/architectures-C1I5V3Dt.mjs +6070 -0
package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
package/dist/browser/index.d.ts +264 -588
package/dist/browser/index.d.ts.map +1 -1
package/dist/browser/index.js +585 -2334
package/dist/browser/index.js.map +1 -1
package/dist/cli.mjs +625 -1098
package/dist/cli.mjs.map +1 -1
package/dist/defaults-9komdrbY.mjs +24 -0
package/dist/defaults-9komdrbY.mjs.map +1 -0
package/dist/frameworks/express.d.mts +1 -3
package/dist/frameworks/express.d.mts.map +1 -1
package/dist/frameworks/express.mjs +7 -7
package/dist/frameworks/express.mjs.map +1 -1
package/dist/frameworks/fastify.d.mts +1 -1
package/dist/frameworks/fastify.d.mts.map +1 -1
package/dist/frameworks/fastify.mjs +3 -3
package/dist/frameworks/fastify.mjs.map +1 -1
package/dist/frameworks/hono.d.mts +1 -1
package/dist/frameworks/hono.d.mts.map +1 -1
package/dist/frameworks/hono.mjs +4 -4
package/dist/frameworks/hono.mjs.map +1 -1
package/dist/frameworks/next.d.mts +3 -2
package/dist/frameworks/next.d.mts.map +1 -1
package/dist/frameworks/next.mjs +4 -4
package/dist/frameworks/next.mjs.map +1 -1
package/dist/frameworks/react.d.mts +1 -1
package/dist/frameworks/trpc.d.mts +1 -1
package/dist/frameworks/trpc.d.mts.map +1 -1
package/dist/frameworks/trpc.mjs +4 -4
package/dist/frameworks/trpc.mjs.map +1 -1
package/dist/gerbil-BHrJJIa4.mjs +1656 -0
package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
package/dist/gerbil-BT9fCydo.d.mts +488 -0
package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
package/dist/gerbil-DomNfIr1.mjs +4 -0
package/dist/gpu/hooks.d.mts +520 -0
package/dist/gpu/hooks.d.mts.map +1 -0
package/dist/gpu/hooks.mjs +1188 -0
package/dist/gpu/hooks.mjs.map +1 -0
package/dist/gpu/index.d.mts +2 -0
package/dist/gpu/index.mjs +6 -0
package/dist/gpu-33qCAtHW.mjs +3615 -0
package/dist/gpu-33qCAtHW.mjs.map +1 -0
package/dist/index-Dgmb2kE3.d.mts +245 -0
package/dist/index-Dgmb2kE3.d.mts.map +1 -0
package/dist/index-jEAL2s-A.d.mts +2022 -0
package/dist/index-jEAL2s-A.d.mts.map +1 -0
package/dist/index.d.mts +22 -487
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +13 -8
package/dist/index.mjs.map +1 -1
package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
package/dist/integrations/ai-sdk.d.mts +75 -6
package/dist/integrations/ai-sdk.d.mts.map +1 -1
package/dist/integrations/ai-sdk.mjs +131 -15
package/dist/integrations/ai-sdk.mjs.map +1 -1
package/dist/integrations/langchain.d.mts +1 -1
package/dist/integrations/langchain.d.mts.map +1 -1
package/dist/integrations/langchain.mjs +5 -5
package/dist/integrations/langchain.mjs.map +1 -1
package/dist/integrations/llamaindex.d.mts +1 -1
package/dist/integrations/llamaindex.d.mts.map +1 -1
package/dist/integrations/llamaindex.mjs +5 -5
package/dist/integrations/llamaindex.mjs.map +1 -1
package/dist/integrations/mcp-client.mjs +3 -3
package/dist/integrations/mcp-client.mjs.map +1 -1
package/dist/integrations/mcp.d.mts +3 -2
package/dist/integrations/mcp.d.mts.map +1 -1
package/dist/integrations/mcp.mjs +5 -5
package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
package/dist/mcp-1DaMsaBc.mjs.map +1 -0
package/dist/memory/index.d.mts +3 -0
package/dist/memory/index.mjs +6 -0
package/dist/memory-D1P7Tmda.mjs +4 -0
package/dist/memory-DVN0MnIG.mjs +132 -0
package/dist/memory-DVN0MnIG.mjs.map +1 -0
package/dist/memory-Dj0J1v88.mjs +294 -0
package/dist/memory-Dj0J1v88.mjs.map +1 -0
package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
package/dist/repl-jV5gcJFA.mjs +9 -0
package/dist/skills/index.d.mts +270 -320
package/dist/skills/index.d.mts.map +1 -1
package/dist/skills/index.mjs +5 -5
package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
package/dist/skills-DX8D59UH.mjs.map +1 -0
package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
package/dist/tools-DQ1mPUw5.mjs.map +1 -0
package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
package/dist/types-D6FiR_oh.d.mts.map +1 -0
package/dist/types-DQBe2lFo.d.mts +165 -0
package/dist/types-DQBe2lFo.d.mts.map +1 -0
package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
package/dist/vector-B0panuy6.mjs +95 -0
package/dist/vector-B0panuy6.mjs.map +1 -0
package/docs/PROJECT-STATE.md +321 -0
package/docs/adding-a-model-family.md +280 -0
package/docs/ai-sdk.md +70 -61
package/docs/architecture/overview.md +17 -7
package/docs/browser.md +203 -8
package/docs/embeddings.md +156 -0
package/docs/gerbil-site-native-migration.md +217 -0
package/docs/gpu-engine/architectures.md +398 -0
package/docs/gpu-engine/ir.md +372 -0
package/docs/gpu-engine/kernels.md +718 -0
package/docs/gpu-engine/paper.html +1759 -0
package/docs/gpu-engine/paper.md +2109 -0
package/docs/gpu-engine/safetensors.md +312 -0
package/docs/gpu-engine/tokenizer.md +302 -0
package/docs/memory-rag.md +91 -0
package/docs/metal-safari-intel.md +190 -0
package/docs/mobile-failure-diagnosis.md +124 -0
package/docs/mobile.md +99 -0
package/docs/observability.md +230 -0
package/docs/onnx-removal-plan.md +339 -0
package/docs/research/autoresearch-portable.md +904 -0
package/docs/research/dispatch-reduction-hivemind.md +84 -0
package/docs/research/ios-safari-model-caching.md +117 -0
package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
package/docs/research/native-stt-model-selection.md +49 -0
package/docs/research/native-tts-model-selection.md +90 -0
package/docs/research/native-vs-chromium-decision.md +152 -0
package/docs/research/nemotron-mamba2-inference.md +910 -0
package/docs/research/qwen35-multimodal.md +293 -0
package/docs/research/qwen36-gemma4-targets.md +337 -0
package/docs/research/sota-embedding-models.md +179 -0
package/docs/research/sota-mobile-models-2026.md +263 -0
package/docs/research/sota-modality-models.md +202 -0
package/docs/research/tps-baselines.md +71 -0
package/docs/research/webgpu-m4-reference.md +104 -0
package/docs/site-update-plan.md +155 -0
package/docs/structured-output.md +123 -0
package/docs/stt.md +63 -446
package/docs/tts.md +77 -499
package/docs/vision.md +100 -338
package/package.json +22 -7
package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
package/dist/gerbil-CJ3ifloF.mjs +0 -4
package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
package/dist/gerbil-qOTe1nl2.d.mts +0 -431
package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
package/dist/kokoro-BNTb6egA.mjs +0 -20210
package/dist/kokoro-BNTb6egA.mjs.map +0 -1
package/dist/kokoro-DFRQ1OeM.js +0 -20212
package/dist/kokoro-DFRQ1OeM.js.map +0 -1
package/dist/mcp-BvbriaBy.mjs.map +0 -1
package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
package/dist/repl-DveXw36T.mjs +0 -9
package/dist/skills-CD3Orlex.mjs.map +0 -1
package/dist/stt-CpLYbGFd.mjs +0 -433
package/dist/stt-CpLYbGFd.mjs.map +0 -1
package/dist/stt-DRPLEEHB.mjs +0 -3
package/dist/stt-Te8Qz-Ay.js +0 -433
package/dist/stt-Te8Qz-Ay.js.map +0 -1
package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
package/dist/transformers.web-DokyH3rP.js +0 -3
package/dist/transformers.web-M6mCnEYJ.js +0 -30382
package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
package/dist/tts-C0xx3CtE.js +0 -724
package/dist/tts-C0xx3CtE.js.map +0 -1
package/dist/tts-DXgsKGCe.mjs +0 -3
package/dist/tts-DeGANMNV.mjs +0 -730
package/dist/tts-DeGANMNV.mjs.map +0 -1
package/dist/types-CiTc7ez3.d.mts.map +0 -1
/package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
/package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
/package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0

package/docs/research/sota-modality-models.md ADDED Viewed

@@ -0,0 +1,202 @@
+# SOTA Small On-Device Models per Modality — Gerbil Native-Build Research (June 2026)
+Research date: **2026-06-13**. Engine baseline: Gerbil from-scratch WebGPU INT4 engine.
+Companion to `sota-mobile-models-2026.md` (which covers text LLMs). This doc covers the
+**multimodal modalities** the engine wants to support natively: TTS, STT, text embeddings, vision.
+**Method.** Facts below are verified against **live HuggingFace** wherever possible — `config.json`,
+the model file list (which reveals pipeline structure for ONNX TTS), and safetensors/blob sizes are
+ground truth. Each claim is tagged **[config]** (live HF config/file list), **[card/docs]** (official
+card or vendor doc), or **[web]** (Perplexity/secondary, softer). Knowledge-cutoff guesses were
+avoided; every repo below was hit live.
+## Gerbil op inventory (what "native-build tier" is measured against)
+Verified from `src/gpu/ir.ts` + `src/gpu/kernels/wgsl/`:
+- **Have (real kernels):** `matmul`, `matmul_int4`, flash `attention`, `causal_conv1d` +
+  `ConvStateUpdate`, `linear_attention` (Gated DeltaNet / Mamba-2 scan `MambaSSM`), `rmsnorm`,
+  `layernorm`, `rope` (incl. partial-rotary capable), `silu`, `swiglu`, `gelu`, `sigmoid_gate`,
+  `softmax`, `embedding`, `add`, `mul`.
+- **Stubbed (declared in IR, no kernel):** `Conv2d`, `AvgPool2d`, `CrossAttention`.
+Tier convention (from the `add-model-family` skill, same as the text-LLM doc):
+**Tier 1** = hours, reuse all ops (loader/config work only). **Tier 2** = days, 1–2 novel kernels.
+**Tier 3** = weeks, a new computation class (e.g. ONNX-graph interpreter, flow-matching sampler).
+---
+## TL;DR — the one to build first per modality
+| Modality | Build first | Why | Tier |
+|---|---|---|---|
+| **Embeddings** | **Qwen3-Embedding-0.6B** | Arch is literally `Qwen3ForCausalLM` — **zero new kernels**, reuses the engine's existing Qwen3 path. Ship in hours. | **Tier 1** |
+| **STT** | **Moonshine-base** (then -tiny) | Smallest real encoder-decoder transformer; raw-waveform Conv1d frontend reuses `causal_conv1d`; only a non-causal conv + cross-attention to add. | **Tier 2** |
+| **TTS** | **Kokoro-82M** (keep) as floor; **port none natively yet** | Every *better* small TTS (Kitten, Supertonic) ships **ONNX-graph-only** — no safetensors tensors to map to our IR. Native TTS is a Tier-3 lift. Recommend ONNX-runtime-web bridge short-term. | **Tier 3** |
+| **Vision** | **Skip standalone** if Qwen3.5-VL lands | A standalone VLM (SmolVLM2-256M) needs Conv2d + CrossAttention (Perceiver) + a separate SigLIP encoder — three stubs to fill. Not worth it if Qwen3.5-VL covers vision natively. | **Tier 2–3** |
+**Single highest-ROI build: Qwen3-Embedding-0.6B.** It is the only cross-modality pick that is
+genuinely free on our engine.
+---
+## 1. Text Embeddings — clearest win
+### #1 (BUILD FIRST) — Qwen3-Embedding-0.6B — **Tier 1 (hours)**
+- **Repo:** `Qwen/Qwen3-Embedding-0.6B`
+- **Params:** 0.6B. **4-bit download:** ~0.4–0.45 GB (Q4 weights; Ollama packages ~0.64 GB w/ overhead). **[web]**
+- **Arch [config, verified live]:** `architectures: ["Qwen3ForCausalLM"]`, `model_type: qwen3`,
+  28 layers, hidden 1024, GQA 16 heads / 8 KV, head_dim 128, SiLU MLP, RoPE θ=1e6, RMSNorm,
+  tied embeddings, vocab 151669. **Decoder-only / causal**; embedding via **last-token pooling**
+  with an instruction prefix. MTEB multilingual ~64.3. **[config + web]**
+- **Why Tier 1:** this is the *same architecture family Gerbil already runs* (Qwen3 dense). The only
+  delta vs text-gen: stop after the final hidden state, take the last token's vector, L2-normalize —
+  no logits/sampling. **New kernels: none.** A pooling+normalize tail and an "embeddings mode" flag.
+### #2 — EmbeddingGemma-300M — **Tier 2 (days; bidirectional attention is the novelty)**
+- **Repo:** `google/embeddinggemma-300m` (**gated** — login required to fetch weights). **[config: gated]**
+- **Params:** 300M. **4-bit:** ~0.3–0.4 GB.
+- **Arch [card/web]:** `model_type: gemma3_text`, `sentence-transformers`, `pipeline: sentence-similarity`.
+  Encoder converted from Gemma 3 via the **T5Gemma** recipe → **bidirectional (non-causal) encoder**,
+  **mean pooling**, output dim 768 with **Matryoshka** 512/256/128 truncation. Best open model <500M on MTEB.
+- **Why Tier 2 not 1:** our `attention.wgsl` is a *causal* flash kernel. A bidirectional encoder needs
+  a **full (non-causal) attention mask** variant — a small kernel branch, plus mean-pool tail. Gating
+  also complicates distribution. Strong model, but Qwen3-Embedding is the free win.
+### #3 — legacy tiny BERT-encoders (BGE-small / GTE-small / Nomic) — **Tier 2**
+- `BAAI/bge-small-en-v1.5` (~33M, dim 384, CLS pool), `thenlper/gte-small` (~33M, mean pool),
+  `nomic-ai/nomic-embed-text-v1.5` (~137M, dim 768, mean pool, MTEB ~62.3). **[web]**
+- All **bidirectional BERT encoders** → need the same non-causal attention + LayerNorm (have it) +
+  pooling. 4-bit sizes are tiny (30–280 MB). Good fallbacks if a sub-50M footprint matters, but lower
+  quality than Qwen3-Embedding / EmbeddingGemma and still Tier 2 (bidirectional mask).
+**Ranking:** Qwen3-Embedding-0.6B (build first, Tier 1) → EmbeddingGemma-300M (best quality/size if
+you add bidirectional attention + accept gating) → BGE/GTE/Nomic (ultralight Tier-2 fallbacks).
+---
+## 2. STT (Speech-to-Text) — Moonshine is the native-friendliest
+### #1 (BUILD FIRST) — Moonshine (base → tiny) — **Tier 2 (days)**
+- **Repos:** `UsefulSensors/moonshine-base`, `UsefulSensors/moonshine-tiny`
+- **Params:** base ~60M, tiny ~27M. **4-bit:** base ~60–90 MB, tiny ~30 MB.
+- **Arch [config, verified live]:** `MoonshineForConditionalGeneration`, encoder-decoder transformer,
+  hidden 416, 8 encoder + 8 decoder layers, **partial RoPE (factor 0.62)**, encoder GELU / decoder SiLU,
+  vocab 32768. **Frontend is raw 16 kHz waveform** (`Wav2Vec2FeatureExtractor`, `feature_size: 1`,
+  `do_normalize: false`) — **no mel spectrogram**: a stack of strided **Conv1d** layers subsamples the
+  raw audio. **[config]**
+- **Why Moonshine over Whisper:** Whisper needs a **2-conv log-mel frontend** (mel filterbank +
+  Conv2d-ish), absolute sinusoidal positions, and fixed 30 s padded windows. Moonshine has **no mel
+  step**, variable-length input, and **RoPE** — which we already have. Its conv subsampling maps onto
+  our existing **`causal_conv1d`** primitive (made non-causal/strided). It's the best architectural fit.
+- **New kernels:** (1) a **strided non-causal Conv1d** variant of `causal_conv1d` for the encoder
+  frontend; (2) **encoder-decoder CrossAttention** (currently stubbed) — the decoder attends to encoder
+  states. RMSNorm/RoPE/SiLU/softmax/matmul_int4 all reused. Partial-rotary is already supported.
+### #2 — Whisper tiny/base — **Tier 2–3 (more frontend work)**
+- `openai/whisper-tiny` (~39M), `openai/whisper-base` (~74M), `openai/whisper-small` (~244M).
+  Encoder-decoder transformer, **log-mel + 2× Conv** frontend, sinusoidal positions, cross-attention.
+  4-bit: tiny ~40 MB, base ~75 MB, small ~250 MB. **[web]**
+- **Why lower:** the **mel-spectrogram + Conv2d frontend** and absolute positional embeddings are extra
+  computation classes vs Moonshine. Distil-Whisper variants (`distil-whisper/*`) shrink the decoder but
+  keep the same frontend. Pick Whisper only if you need its broader multilingual robustness; otherwise
+  Moonshine is leaner to build and run.
+### Not recommended for native (Tier 3): NVIDIA NeMo Canary / Parakeet
+- FastConformer encoders = **heavy depthwise-separable Conv2d + relative-position attention + CTC/RNN-T
+  transducer decoder**. Multiple new computation classes (transducer beam search, conv-heavy frontend).
+  High accuracy but the worst native-build fit of the STT options. **[web]**
+**Ranking:** Moonshine-base (build first) → Moonshine-tiny (smallest) → Whisper tiny/base (if
+multilingual robustness needed) → NeMo (skip for native).
+---
+## 3. TTS (Text-to-Speech) — better-than-Kokoro models are ONNX-locked
+The owner is right that **Kokoro-82M is last-gen**. The catch for *native* WebGPU: every credible
+2026 upgrade ships as a **frozen ONNX graph**, not safetensors weight tensors — so there is nothing
+to map onto Gerbil's IR without first writing an ONNX-graph importer. This makes native TTS the
+single hardest modality on our engine.
+### Landscape (verified live on HF)
+| Model | Repo | Size / format | Architecture |
+|---|---|---|---|
+| **Supertonic-3** | `Supertone/supertonic-3` | **ONNX only**, ~400 MB total: `text_encoder` 36 MB + `duration_predictor` 3.7 MB + `vector_estimator` **256 MB** + `vocoder` 101 MB | **Flow-matching / StyleTTS2-style pipeline** (NOT a transformer LM): text enc → duration predictor → flow-matching acoustic ("vector_estimator") → neural vocoder. 39 langs, on-device, 822 likes (trending, May 2026). **[config]** |
+| **Kitten TTS** | `KittenML/kitten-tts-nano-0.2`, `…nano-0.8-fp32`, `…mini-0.1` | **ONNX only** + `voices.npz`; nano ~25 MB, mini larger | Tiny (15–80M) non-autoregressive duration-predictor + lightweight vocoder; CPU-first. Beats Kokoro on **latency/size**, ~par on quality. **[config + web]** |
+| **Kokoro-82M** | `hexgrad/Kokoro-82M` | safetensors available; ~80 MB | StyleTTS2-derived: text enc + duration/pitch predictor + **iSTFT/HiFiGAN-style vocoder**. Last-gen but **the only one with real weight tensors**. **[web]** |
+| **Supertonic-2 / OuteTTS** | `Supertone/supertonic-2`, OuteTTS repos | ONNX (Supertonic) / LM-codec (OuteTTS) | OuteTTS = neural-codec LM (transformer that emits audio codec tokens + needs a codec decoder). Bigger, heavier. **[web]** |
+### Architecture reality vs Gerbil ops
+A StyleTTS2/Supertonic pipeline is **not a transformer LM**. It is: a small text/phoneme encoder
+(matmul + attention — fine), a **duration predictor** (Conv1d + small regressor), an acoustic model
+that is either **flow-matching/diffusion** (iterative sampler — a new computation class) or a mel
+decoder, and a **GAN/iSTFT vocoder** (transposed Conv1d + iSTFT — new kernels). None of the
+better-than-Kokoro models expose these as mappable safetensors; they're sealed ONNX subgraphs.
+### Recommendation
+1. **Short term:** keep **Kokoro-82M** as the quality floor, and ship newer/better TTS
+   (**Supertonic-3**, **Kitten**) via an **`onnxruntime-web` (WebGPU EP) bridge** rather than the
+   native engine. This gets "better than Kokoro" shipping now with zero kernel work.
+2. **If/when native TTS is mandated (Tier 3):** the most tractable native target is **Kokoro-82M
+   itself** (real safetensors), because porting it forces exactly the reusable primitives a native TTS
+   stack needs — **transposed/strided Conv1d, iSTFT, and an upsampling vocoder kernel** — which then
+   also unblock a future native Supertonic-class pipeline. Do NOT start with an ONNX-only model.
+**Ranking (native-build effort, ascending):** Kokoro-82M (only real-weights option; Tier 3) →
+everything else requires an ONNX importer first. **Best quality available now: Supertonic-3 via ONNX
+bridge.** Best size/latency: Kitten via ONNX bridge.
+---
+## 4. Vision — likely NOT needed standalone
+**Flag: if Qwen3.5-VL covers vision natively (separate investigation), a standalone vision model is
+not worth building.** Reasons: a standalone small VLM still drags in the exact ops we've stubbed
+(Conv2d, CrossAttention) plus a *second* model (a SigLIP encoder) and its own resampler — strictly
+more native-build surface than letting the text engine's existing transformer stack absorb vision
+tokens from a Qwen3.5-VL projector.
+### If a standalone is required — smallest good option
+**SmolVLM2-256M-Video-Instruct — Tier 2–3**
+- **Repo:** `HuggingFaceTB/SmolVLM2-256M-Video-Instruct` (also 500M variant).
+- **Params:** 256M. **4-bit:** ~0.15–0.20 GB. **[config + web]**
+- **Arch [config, verified live]:** `SmolVLMForConditionalGeneration`. Text backbone = **Llama**
+  (`VLlama3ForCausalLM`, hidden 576, 30 layers, GQA 9/3 — Tier 1 on our engine). Vision = **SigLIP ViT**
+  with **Conv2d patch embedding**. A **Perceiver resampler** (`perceiver_config`, `scale_factor: 4`)
+  compresses vision tokens via **cross-attention**. **[config]**
+- **New kernels:** (1) **Conv2d** patch-embed (currently stubbed); (2) **CrossAttention** for the
+  Perceiver resampler (stubbed); (3) the SigLIP encoder is a second sub-model to wire (its attention is
+  bidirectional). The Llama text half is free. So even the *smallest* standalone VLM lights up **two of
+  our three stubs** plus a whole extra encoder.
+### Other standalone references (larger)
+- `vikhyatk/moondream2` (~1.6B), Moondream3 (larger): compact ViT + decoder; ~0.8–1.0 GB 4-bit. **[web]**
+- `Qwen/Qwen2.5-VL-3B-Instruct`: custom ViT + dynamic resolution, ~1.5–2 GB 4-bit; bigger but strong OCR. **[web]**
+- **Separate encoders** (if you ever want CLIP-style only): SigLIP2 (general) or FastVLM's **FastViTHD**
+  (hybrid conv-transformer, Conv2d-heavy, latency-optimized). Both need Conv2d. **[web]**
+**Recommendation:** **Do not build a standalone vision model** pending the Qwen3.5-VL native check. If
+forced, SmolVLM2-256M is the smallest — but budget for **Conv2d + CrossAttention** kernels and a SigLIP
+encoder. Those same two kernels are the long pole for any vision path, so build them once for
+Qwen3.5-VL rather than twice.
+---
+## Cross-modality build order (recommendation)
+1. **Qwen3-Embedding-0.6B** — Tier 1, hours, zero kernels. Ship embeddings immediately on the existing
+   Qwen3 path (add last-token pooling + normalize). **Highest ROI by far.**
+2. **Moonshine-base STT** — Tier 2, days. Add a strided non-causal Conv1d (from `causal_conv1d`) and
+   encoder-decoder CrossAttention. Unblocks STT and gives us the first real CrossAttention kernel.
+3. **Vision: wait for Qwen3.5-VL.** The Conv2d + CrossAttention kernels needed for SmolVLM are the same
+   ones Qwen3.5-VL needs — build them once, in the text-engine path, not for a throwaway standalone.
+4. **TTS: ONNX-bridge Supertonic-3/Kitten now**, native Kokoro later (Tier 3). Native TTS is the
+   biggest lift (flow-matching sampler + vocoder/iSTFT + transposed conv) and the worst fit today —
+   defer until a vocoder kernel family is justified.
+**Net:** one Tier-1 build (embeddings) ships a real new modality this week; the STT build (Moonshine)
+is the natural next step and yields the reusable CrossAttention kernel; vision and TTS are deliberately
+deferred behind shared-kernel and ONNX-bridge decisions rather than built blindly.

package/docs/research/tps-baselines.md ADDED Viewed

@@ -0,0 +1,71 @@
+# TPS Baselines & Goals
+Established 2026-06-12, after the four-bug mobile fix campaign (see `docs/mobile-failure-diagnosis.md`).
+Model: Qwen3.5-0.8B, MLX 4-bit (INT4), greedy decode. All outputs verified byte-identical across platforms.
+## Baselines (current)
+| Platform | Config | Decode tok/s | Notes |
+|---|---|---|---|
+| M4 Max, node-dawn | single encoder/pass | **207** (207.4/207.0/207.9 cooled) | after 2026-06-13 autoresearch session; warm peak 221, throttled floor ~128 |
+| M4 Max, node-dawn (pre-optimization) | single encoder/pass | 145.0 | post detection-fix baseline this session started from |
+| iPad (iPadOS 26.5, WebKit) | batch-all (1 CB) | **31.7** | historical zero-logits config — now correct |
+| iPad, sustained 200 tok | batch-all | **51.7** | t15 2026-06-13: no throttling over 200 tokens; includes optimizer kernel wins via dist |
+| iPad, submit floor | group=1, awaited | 6.4–7.7 | correctness floor for older WebKit |
+**2026-06-13 update:** the 50+ iPad goal is MET (51.7 sustained). New mobile goal: **70+ tok/s**
+(mobile prefill tuning — t13 showed prefill-heavy drops to 8.2 tok/s overall — plus continued
+kernel wins). maxseq=1024 verified working on device (37.1 tok/s, no crash).
+Group-size scaling on iPad (24-token gens): 1→6.4 · 8→19.7 · 32→24.6 · 64→28.8 · 128→29.7 · all→31.7.
+Interpretation: round-trip-bound until ~32/CB, then dispatch-bound.
+## Goals
+| Platform | Goal | Path |
+|---|---|---|
+| Desktop (M4 Max) | **180+ tok/s** | autoresearch loop: kernel tuning (K_THREADS/N_TILE/workgroup), fused KV append (+24 dispatches), fused GEMV+residual (+48), backlog P1 items |
+| iPad | **50+ tok/s** | single compute pass (vs pass-per-dispatch) on 26.5+, kernel opts shared with desktop, dispatch-count reduction (helps mobile 2-3x more than desktop) |
+| Load time (iPad) | < 15 s warm | HTTP-cache-served shard, quantize-on-GPU |
+## Engine consolidation gate (2026-06-13) — RESOLVED
+Tested transformers.js 4.2.0 on the iPad (iOS 26.5, WebKit) against the **same
+model** the native engine runs (Qwen3.5-0.8B, ONNX vs MLX-4bit), to decide
+whether one engine could replace two.
+| Engine / path | Result | tok/s |
+|---|---|---|
+| **Native WGSL** (Qwen3.5-0.8B) | works | **~51** |
+| transformers.js **WebGPU** (Qwen3.5-0.8B-ONNX q4) | works, coherent (handles the hybrid arch) | **6.9–11.6** |
+| transformers.js **WASM** (q4) | **fails** — `GatherBlockQuantized` op missing from the WASM backend (op coverage, not memory) | — |
+**Verdict:** transformers.js *does* run on mobile now (the original reason native
+was built is gone), but it's **~4–7× slower than native on text**, its WASM
+fallback can't run q4 models anywhere, and decode-heavy modalities (Whisper
+decode, VLM text, diffusion) will inherit that ~5× slowdown on mobile. So the
+decision is **two deliberate lanes, not one engine**:
+- **Native = fast text lane** — the models we implement (Qwen, Gemma next), plus
+  embeddings (cheap to add). Mobile-proven, ~5× faster.
+- **transformers.js = breadth lane** — STT/TTS/vision/image-gen + the ONNX zoo,
+  and the no-WebGPU WASM tier. **Open caveat:** decode-heavy modalities on mobile
+  are likely too slow via tfjs; measure per-modality before assuming, and build
+  native where a mobile modality both matters and is too slow on tfjs.
+**Also found:** the model re-downloads (~160s) on *every* iOS load for both
+engines — the Cache API isn't persisting on WebKit (likely eviction under
+storage pressure). Needs a durable fix (OPFS/IndexedDB, or chunked cache).
+## Production submit strategy (WebKit)
+OS-dependent: iPadOS/iOS 26.5+ → batch-all (bug fixed upstream). Older 26.x → group=64 if a
+startup coherence probe passes, else group=1 awaited. Probe: small batched dependent-chain
+dispatch comparing batched vs per-dispatch results at engine init.
+## Invariants any optimization must respect
+- iPad `maxComputeWorkgroupStorageSize` = 16384 (attention smem is exactly at the limit)
+- iPad default `maxBufferSize` = 256MB, `maxStorageBufferBindingSize` = 128MB (embedding ~127MB — no headroom for bigger vocabs without sharding)
+- Activation buffers are liveness-pooled: fused kernels must never read+write the same pooled buffer in one dispatch
+- Two-phase attention reduction barrier is a race fix, not a perf knob

package/docs/research/webgpu-m4-reference.md ADDED Viewed

@@ -0,0 +1,104 @@
+# WebGPU / WGSL / Apple M4 Max — Condensed Reference
+Quick reference for the kernel optimizer. All numbers verified against specs.
+## Hard Limits (WebGPU Spec)
+| Constraint | Limit | Notes |
+|---|---|---|
+| Max workgroup invocations | 256 | Product of all dimensions |
+| Max workgroup X/Y | 256 | |
+| Max workgroup Z | 64 | |
+| Workgroup storage (`var<workgroup>`) | 16 KB | All `var<workgroup>` combined |
+| Storage buffer default max | 128 MB | Can request higher via adapter |
+| Uniform buffer max | 64 KB | |
+| SIMD width (Apple) | 32 threads | Fixed across all Apple GPUs |
+## Apple M4 Max Hardware
+| Spec | Value |
+|---|---|
+| GPU cores | 40 (or 32 in lower SKU) |
+| Memory bandwidth | 546 GB/s (40-core) / 410 GB/s (32-core) |
+| Memory type | LPDDR5X, unified CPU/GPU |
+| L1 cache per core | 8 KB |
+| L2 cache (shared) | 1 MB |
+| System-level cache (SLC) | 8 MB |
+| Cache line size | 128 bytes |
+| Metal threadgroup memory | 32 KB max |
+| Metal max threads/threadgroup | 1024 |
+| Threadgroup memory alignment | 16 bytes |
+| Shared memory banks | 32 banks, 4-byte granularity |
+| SIMD group size | 32 threads (fixed) |
+| Peak FP32 TFLOPS | ~17.04 (40-core boost) |
+## Key Architecture Facts
+- **Unified memory**: CPU and GPU share the same physical memory. No PCIe transfers. Zero-copy.
+- **Memory-bound kernels**: INT4 matvec at ~8 FLOPS/byte — bandwidth is the bottleneck, not compute.
+- **Dynamic shader core memory** (M3/M4): Register file acts as a cache. Unused threadgroup memory can be reclaimed for registers. Improves occupancy for complex shaders.
+- **Optimal occupancy**: 1K–2K concurrent threads per GPU core. For 40 cores = 40K–80K total threads.
+- **Bank conflicts**: 32-bank shared memory, 4-byte stride. Pad arrays by 32 to avoid conflicts.
+## Workgroup Sizing Guidelines
+1. Start at 64, tune in multiples of 32 (SIMD width)
+2. Smaller workgroups (32–64) give scheduler more flexibility for load balancing
+3. Larger workgroups (128–256) can improve data reuse but increase register pressure
+4. For memory-bound kernels: smaller workgroups often win (less sync overhead)
+5. For compute-bound kernels: larger workgroups can improve ALU utilization
+## WGSL Memory Alignment
+| Type | Alignment | Size |
+|---|---|---|
+| `f32`, `u32`, `i32` | 4 bytes | 4 bytes |
+| `vec2<f32>` | 8 bytes | 8 bytes |
+| `vec3<f32>` | 16 bytes | 12 bytes (+4 padding) |
+| `vec4<f32>` | 16 bytes | 16 bytes |
+| Struct | max(member alignments) | rounded up to alignment |
+## Subgroup Operations (Partial Support)
+Available in Dawn/Metal backend:
+- `subgroupAdd` → Metal `simd_sum`
+- `subgroupBroadcast` → Metal `simd_broadcast`
+- `subgroupBallot` → Metal `simd_ballot`
+**Not available yet**: subgroup barrier. Must use `workgroupBarrier()` instead.
+**Restriction**: Some backends restrict subgroup ops to 1D workgroups only.
+## Dispatch Mechanics
+```
+global_invocation_id = workgroup_id * workgroup_size + local_invocation_id
+```
+For a buffer of N elements with workgroup_size W:
+```
+dispatchWorkgroups(ceil(N / W), 1, 1)
+```
+Threads with `global_invocation_id >= N` must early-exit via guard condition.
+## Critical Constraint: N_TILE / Dispatch Alignment
+**THE #1 BUG TO AVOID** (caused 12 invalid experiments):
+In our matvec kernels, each workgroup computes `N_TILE` output columns:
+```
+N_TILE = workgroup_size / K_THREADS
+```
+The JS dispatch must use the SAME N_TILE:
+```javascript
+dispatchWorkgroups(ceil(N / N_TILE), 1, 1)
+```
+If WGSL `N_TILE` ≠ JS dispatch `N_TILE`, workgroups skip columns → garbage output with inflated tok/s (500+ tok/s is always a broken shader).
+**Rule**: NEVER change `N_TILE`, `workgroup_size`, or `K_THREADS` in the WGSL shader without ensuring:
+1. `N_TILE == workgroup_size / K_THREADS`
+2. The JS dispatch `getDispatchSize()` uses the same `N_TILE`
+3. Shared memory arrays (`array<f32, SIZE>`) have `SIZE >= workgroup_size`

package/docs/site-update-plan.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Gerbil Marketing/Docs Site — Update Plan
+**Target repo:** `/Users/shenron/Code/gerbil-site` (separate Next.js app, App Router)
+**Date scoped:** 2026-06-13
+**Status:** Plan only — no site edits made yet. Execute against `gerbil-site`, not this repo.
+## Why this exists
+The site predates two major shifts and is now materially wrong:
+1. **Native WGSL engine now works on mobile.** iPad/iOS Safari 26.5+ (WebKit) runs Qwen3.5-0.8B at **~41–51 tok/s, byte-correct**. It previously crashed. The site still says "iOS will crash," gates WebGPU to "Chrome/Edge 113+ only," and shows Safari as "may have quirks."
+2. **Desktop got an autoresearch optimization pass.** M4 Max via node-dawn is now **~207 tok/s** (was ~145). Published numbers (40–200, 100–150) are stale and low.
+There are also two **architecture-reality gaps**:
+- The site presents a single transformers.js/ONNX stack. Reality is **two lanes**: (a) a **native WGSL engine** — fast text, mobile-proven; (b) a **transformers.js/ONNX lane** — modality breadth (STT/TTS/vision/embeddings), full-speed on desktop, **~5x slower on mobile** and only suitable for light/single-pass modalities there.
+- The **model lineup is dated**: Kokoro TTS is old (better small TTS exists now), and **vision should be served by Qwen3.5 natively**, not a separate `ministral-3b` ONNX vision model.
+## Real performance numbers to publish
+| Context | Number | Notes |
+|---|---|---|
+| Desktop native (M4 Max, node-dawn) | **~207 tok/s** | Qwen3.5-0.8B, post-autoresearch (was ~145) |
+| iPad / iOS Safari 26.5+ native | **~41–51 tok/s** | Qwen3.5-0.8B Q4, byte-correct vs reference |
+| Browser desktop native (Chrome/Edge) | ~150–207 tok/s range | keep honest; native WGSL |
+| transformers.js/ONNX on mobile | **~5x slower** | light/single-pass modalities only (one TTS line, one embedding, short transcription). Not for streaming chat. |
+| CPU (Node fallback) | ~30–60 tok/s | unchanged |
+| WASM fallback | ~5–10 tok/s | unchanged |
+**Browser support reality:** Chrome/Edge **113+** AND Safari/iOS **26.5+** (WebKit). WebGPU requires **HTTPS** (secure context) — call this out, it's the #1 mobile gotcha. Firefox still behind a flag → not recommended.
+---
+## Per-file change list
+Priority key: **P0** = factually wrong / actively misleading; **P1** = important correctness/positioning; **P2** = polish/consistency.
+### 1. `app/page.tsx` (homepage)
+- **P0 — line ~151** (GPU section subheadline):
+  - Current: `"40-200+ tok/s via WebGPU with CPU fallback that runs anywhere JavaScript runs. Text, vision, TTS & transcription. All WebGPU accelerated. Cached in IndexedDB."`
+  - New: `"Up to ~207 tok/s on desktop (M4 Max) and ~41-51 tok/s on iPad/iOS Safari 26.5+ — native WebGPU, byte-correct. Plus vision, TTS & transcription via the ONNX lane. Runs anywhere JavaScript runs. Cached in IndexedDB."`
+  - Note: drop "All WebGPU accelerated" blanket claim — the ONNX modalities are not all WebGPU on mobile.
+- **P1 — line ~164** (`~100MB-2.5GB ONNX models` chip): keep but reframe — this is the ONNX lane. Consider adding a native-engine chip: `Native WGSL engine (mobile-proven)`.
+- **P1 — Hero subheadline (lines ~91–98):** Currently lists "Text, vision, TTS, transcription, tools & skills" flatly. Add a one-line mention that text now runs on **mobile Safari**, the headline differentiator. Suggested addition near line 96: `Now runs on iPhone & iPad (Safari 26.5+), not just desktop.`
+- **P2 — line ~482** (Vision code example): uses `g.loadModel("ministral-3b")`. See model-lineup section — vision should move to Qwen3.5 native. At minimum update the comment; ideally switch the model id once the native VLM path is the recommended one.
+- **P2 — line ~575** (TTS code example, `Kokoro-82M`, `af_heart`): Kokoro is dated. Keep working but plan to swap default to the newer small TTS (see lineup section). Low urgency — code still runs.
+- **P2 — line ~648** (CLI sample output `⚡ 47.2 tok/s`): cosmetic; bump to a current-looking number (e.g. `~190 tok/s`) so the hero CLI doesn't undersell.
+### 2. `app/playground/page.tsx`
+- **P0 — line 59:**
+  - Current: `Requires WebGPU support (Chrome 113+, Edge 113+)`
+  - New: `Requires WebGPU (Chrome/Edge 113+, or Safari/iOS 26.5+). Must be served over HTTPS.`
+- **P1 — lines 124–146** (Chat Models list) and **150–163** (Vision): lineup is dated. Vision lists only `ministral-3b — 2.5GB`; should reflect native Qwen3.5 vision once available, and note mobile can't run 2.5GB ONNX vision. Add a note that **native-engine models (Qwen3.5-0.8B) are the mobile-capable ones**.
+- **P1 — lines 170–183** (TTS panel, `Kokoro-82M — 330MB`): mark Kokoro as legacy / note a newer smaller TTS is recommended.
+### 3. `components/PlaygroundNative.tsx` — **the live mobile demo**
+This is the native WGSL playground. **It now works on mobile but the homepage never shows it by default.**
+- **P0 — `components/Playground.tsx` lines 40–43** decide which playground renders:
+  ```ts
+  const [mode, setMode] = useState<"native" | "full">(() => {
+    if (typeof window === "undefined") return "full";
+    return localStorage.getItem("gerbil-backend") === "native" ? "native" : "full";
+  });
+  ```
+  Default is **`full`** (ONNX `PlaygroundFull`) unless localStorage opts into native. So a first-time visitor — especially on an iPad — gets the ONNX playground, which is the slow/fragile lane on mobile, **not** the fast native engine that now works. **Action:** detect mobile/Safari and default to `native` there (the proven-fast lane), or at least default new visitors to `native` for the text tab. This is the single highest-leverage change to make the "works on mobile" story real on the page.
+- **P1 — `PlaygroundNative.tsx` has NO iOS guards.** Confirmed: no `isModelSafeForDevice` / crash-detection / mobile gating in this component (unlike `PlaygroundFull.tsx`, which has crash warnings at lines ~419, ~994, ~1007). Before it crashed on mobile; now it runs. Action: it's safe to surface on mobile, but add a light guard so it picks the **Q4 (404 MB)** native model by default on mobile rather than F32 (1.6 GB). Currently `nativeDtype` defaults to `"f32"` (line 72) and `nativeModel` defaults to the 4bit MLX model (line 71) — verify the default actually loads the small Q4 path on mobile; F32 default dtype on a 1.6GB model would be too heavy for an iPhone.
+- **P2 — line 51** (Summarize example prompt) says "WebGPU now works in Node.js via Dawn... Small models run at 100+ tok/s on both client and server." Update the embedded number to reflect ~207 desktop / mobile reality, or generalize it. Same stale string is duplicated in `PlaygroundFull.tsx` line 132.
+### 4. `components/AISDKPlayground.tsx`
+- **P0 — lines 258, 267, 280:** three identical error strings `"WebGPU not supported in this browser. Try Chrome 113+"`.
+  - New: `"WebGPU not supported. Use Chrome/Edge 113+ or Safari/iOS 26.5+, served over HTTPS."`
+- **P1 — line 192–197:** `useEmbedding` is stubbed out (`embed = null`) and the embeddings tab will throw on run. Either restore the hook or disable the Embed tab. (Separate from the freshness story but it's a broken live demo.)
+### 5. `app/docs/browser/page.tsx` — **most stale file (1058 lines)**
+- **P0 — line 16** (top banner): `100-150 tok/s with WebGPU` → `Up to ~207 tok/s desktop, ~41-51 tok/s on iPad/iOS native`.
+- **P0 — lines 705–719** (`iOS Memory Guards` section intro): currently asserts `iOS Safari and iOS Chrome have strict memory limits (~300-400MB effective for WKWebView)` and frames iOS as crash-prone. **Rewrite the framing**: native WGSL engine now runs Qwen3.5-0.8B on iOS Safari 26.5+ at 41–51 tok/s, byte-correct. Keep the memory-guard *utilities* documentation (they're still real and useful for the ONNX lane / large models), but reframe from "iOS will crash" to "iOS is supported on the native engine; memory guards protect the heavier ONNX lane."
+- **P0 — lines 692–694** (Browser Support table, Safari row): `Safari / 18+ / ⚠ May have quirks` → `Safari / iOS 26.5+ (WebKit) / ✓ Native engine supported`. Add the HTTPS-required caveat as a table footnote.
+- **P0 — line 688** (Chrome/Edge row `113+`): fine, but add a note that Safari now joins full support for the native engine.
+- **P1 — lines 250–257, 330–333, 940–942:** `isWebGPUSupported` / alert / troubleshooting copy all say "Chrome/Edge 113+" only. Add Safari/iOS 26.5+ everywhere.
+- **P1 — lines 649–667** (Browser Models speed table `100-150` / `150-200` / `200-300 tok/s`): these are ONNX-lane numbers and now read low next to 207. Recheck against current benchmarks; at minimum add a native-engine row for Qwen3.5-0.8B.
+- **P1 — lines 770–822** (iOS Compatibility Matrix): rebuild around the native engine. Add Qwen3.5-0.8B Q4 (404 MB) as **✓ iOS Safe (native, ~41-51 tok/s)**. Keep ONNX large models as risky/blocked. The current matrix only covers ONNX models.
+- **P2 — line 401–404** (Mobile q4 note): still accurate for the ONNX lane; clarify it applies to ONNX, and that the native engine has its own Q4 path.
+### 6. `app/docs/architecture/page.tsx` — **needs the two-lane story (591 lines)**
+- **P0 — entire framing (lines 19–118):** The system-overview and pipeline Mermaid diagrams show only `transformers.js → ONNX Runtime`. There is **no mention of the native WGSL engine**. Add it as a first-class lane:
+  - Lane A: **Native WGSL engine** — hand-written WebGPU compute kernels, fast text generation, runs on desktop (node-dawn ~207 tok/s) and mobile Safari (~41-51 tok/s). This is the default text path.
+  - Lane B: **transformers.js / ONNX Runtime** — modality breadth (STT/TTS/vision/embeddings), full speed on desktop, ~5x slower on mobile, light/single-pass only there.
+  - Update both Mermaid charts to show the split (browser/node → router → {Native WGSL | ONNX}).
+- **P1 — lines 135–152** (Execution Backends table): `WebGPU / Browser, Chrome / ~100-150 tok/s` → split native vs ONNX, add Safari/iOS, bump desktop native to ~207.
+- **P1 — lines 70–74** (Key Decision "WebGPU First"): mentions only headless Chrome (ChromeGPUBackend) for Node. Reality now includes **node-dawn** for the native engine. Add node-dawn as the native-engine Node path.
+- **P2 — lines 236–264** (Node WebGPU path, ChromeGPUBackend, port 43724): still valid for the ONNX lane; clarify which lane it belongs to vs node-dawn.
+### 7. Model lineup — `components/ModelsTable.tsx`, `app/docs/models/page.tsx`, `app/docs/vision/page.tsx`, `app/docs/tts/page.tsx`
+- **P1 — Vision via Qwen3.5 native, not `ministral-3b`:**
+  - `ModelsTable.tsx` line ~22–28: `ministral-3b` flagged `vision: true`.
+  - `vision/page.tsx` lines 24, 57, 119: all use `ministral-3b` (2.5GB ONNX).
+  - `models/page.tsx` line 71–76: "For vision → ministral-3b".
+  - Action: once the native Qwen3.5 vision path is the recommendation, point vision at Qwen3.5 (same family as the fast/mobile text engine). Note that 2.5GB ONNX vision is desktop-only; native Qwen3.5 vision is the path forward. Keep `ministral-3b` documented as a heavier ONNX option if still supported.
+- **P2 — Kokoro is dated TTS:**
+  - `ModelsTable.tsx` line ~194 (`kokoro-82m`), `tts/page.tsx` lines 11/31/61/158, `page.tsx` line ~575.
+  - Action: introduce the newer small TTS as the recommended default; demote Kokoro to "also supported." `supertonic-66m` is already listed (ModelsTable line ~205, tts page line ~39) — if that's the better small model, make it the default in copy and code samples.
+- **P2 — `components/FeatureCards.tsx` line 65:** `"40-200 tok/s on WebGPU"` → `"Up to ~207 tok/s on desktop, runs on iPhone & iPad too."`
+### 8. Sweep (lower priority, same stale patterns)
+- `app/docs/vision/page.tsx` line 422: `70-100+ tok/s (WebGPU)` — vision is genuinely slower; verify but likely fine. ONNX-lane, desktop.
+- `app/docs/repl/page.tsx` lines 289–291: sample REPL output `45-47 tok/s` — cosmetic, bump if desired.
+- `components/PlaygroundFull.tsx` lines 132 (dup summarize prompt), 317/678 (vision ~25 tok/s) — ONNX lane, mostly accurate; update the embedded "100+ tok/s" claim.
+---
+## Suggested execution order
+1. **P0 correctness pass** (no design work): `playground/page.tsx:59`, `AISDKPlayground.tsx` 258/267/280, `browser/page.tsx` 16/688–694/705–719, `page.tsx:151`. These are pure copy fixes — Safari/iOS 26.5+, HTTPS, real tok/s.
+2. **Playground default-to-native on mobile** (`Playground.tsx` + light guard in `PlaygroundNative.tsx`): makes the "works on mobile" claim demonstrably true on the live site. Highest-leverage single change.
+3. **Architecture two-lane rewrite** (`architecture/page.tsx`): the conceptual fix everything else hangs off.
+4. **iOS compatibility matrix + browser models tables** rebuild around the native engine.
+5. **Model lineup** (vision → Qwen3.5 native, TTS default off Kokoro) — coordinate with whatever the engine actually ships as the recommended native VLM/TTS.
+## Open items to confirm before editing
+- Exact recommended **native vision model id** for Qwen3.5 (is the native VLM path shipped, or still desktop-ONNX `ministral-3b` for now?).
+- The **newer small TTS** model id to replace Kokoro as default (Supertonic-66M, or something newer?).
+- Whether to keep the desktop browser native number as a single figure (~207) or a range; pick one and use it consistently across all files.
+- `useEmbedding` is stubbed in `AISDKPlayground.tsx` (line 192) — restore or disable that tab.