PyPI - meadow-mind - Versions diffs - 0.1.0__tar.gz - Mend

meadow-mind 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

meadow_mind-0.1.0/PKG-INFO +170 -0
meadow_mind-0.1.0/README.md +151 -0
meadow_mind-0.1.0/meadow_mind/__init__.py +21 -0
meadow_mind-0.1.0/meadow_mind/_engine.py +302 -0
meadow_mind-0.1.0/meadow_mind/mind.py +95 -0
meadow_mind-0.1.0/meadow_mind/play.py +103 -0
meadow_mind-0.1.0/meadow_mind/prompt.py +54 -0
meadow_mind-0.1.0/meadow_mind/task.py +56 -0
meadow_mind-0.1.0/meadow_mind/tasks.py +150 -0
meadow_mind-0.1.0/meadow_mind.egg-info/PKG-INFO +170 -0
meadow_mind-0.1.0/meadow_mind.egg-info/SOURCES.txt +15 -0
meadow_mind-0.1.0/meadow_mind.egg-info/dependency_links.txt +1 -0
meadow_mind-0.1.0/meadow_mind.egg-info/entry_points.txt +2 -0
meadow_mind-0.1.0/meadow_mind.egg-info/requires.txt +9 -0
meadow_mind-0.1.0/meadow_mind.egg-info/top_level.txt +1 -0
meadow_mind-0.1.0/pyproject.toml +31 -0
meadow_mind-0.1.0/setup.cfg +4 -0

meadow_mind-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,170 @@
+Metadata-Version: 2.4
+Name: meadow-mind
+Version: 0.1.0
+Summary: Language-rule decision mind: zero-training, ~0.4s real decisions for games and control. One install, one import.
+Author: Hey-Meadow Lab
+License: MIT
+Project-URL: Homepage, https://meadow-mind.pages.dev
+Project-URL: Demo, https://meadow-mind.pages.dev/en.html
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: mlx>=0.20
+Requires-Dist: mlx-lm>=0.20
+Requires-Dist: numpy
+Requires-Dist: huggingface_hub
+Provides-Extra: games
+Requires-Dist: gymnasium[box2d,toy-text]; extra == "games"
+Requires-Dist: matplotlib; extra == "games"
+Requires-Dist: imageio[ffmpeg]; extra == "games"
+# Meadow Mind
+**Zero training. Second-level reactions (~400 ms).**
+A language-rule decision mind: write the policy as one sentence, describe the state as one sentence, and a local 7B model makes a real decision every ~0.4 s. No RL, no reward engineering, no gradients, no samples.
+🌐 **Demo site**: [meadow-mind.pages.dev](https://meadow-mind.pages.dev) (中文) · [English](https://meadow-mind.pages.dev/en.html) · [繁體中文 README](README.zh-TW.md)
+```bash
+pip install meadow-mind          # weights auto-download on first use
+```
+```python
+from meadow_mind import MeadowMind, tasks
+mind = MeadowMind()                    # loads once, runs on-device
+task = tasks.mountaincar()
+mind.check(task)                       # sanity gate: decision-table exam
+action, info = mind.decide(task, obs)  # obs in, env action out (~0.4s)
+```
+## Results
+All on official Gymnasium environments, untouched physics, **zero training**. Every frame below corresponds to one real model decision; no scripted policy, no edited speed-ups.
+| Balance · CartPole-v1<br>**400/400 perfect** (solve bar 195) | Landing · LunarLander-v3<br>**+251 safe landing** (solve bar 200) |
+|---|---|
+| ![CartPole](assets/balance.gif) | ![LunarLander](assets/landing.gif) |
+| Maze · FrozenLake 8×8<br>**goal in 14 steps = shortest path** | Momentum · MountainCar-v0<br>**flag in 103 steps** (limit 200) |
+|---|---|
+| ![Maze](assets/maze.gif) | ![MountainCar](assets/mountaincar.gif) |
+The MountainCar policy is one counterintuitive sentence — `"push in the same direction the car is moving, to pump energy like a swing"` — which replaces an entire RL reward curve.
+### Real-time reflex (wall-clock, not turn-based)
+The model runs in a thread while obstacles fall in real time. If it is still thinking when the obstacle lands, it really crashes.
+| Parkour dodge: full-generation crashes at #1, Meadow Mind clears 5/6 | Shape+color match: 6/6, down to a 0.72 s window |
+|---|---|
+| ![Parkour](assets/parkour.gif) | ![Shape+color](assets/shape_color.gif) |
+### Working memory
+A funnel maze forces both runs into the same dead-end pocket. Reactive (left) paces at its mouth forever; with `Task(memory=True)` (right) it struggles, backs out, and detours to the goal in 22 steps. The only difference is five words in the perception sentence.
+![Memory](assets/memory.gif)
+## Decision latency: traditional LLM vs Meadow Mind
+A traditional LLM agent must **generate its full answer before acting** — and latency grows with answer length. Meadow Mind reads the rule and the situation and decides in **one fixed-latency pass**, right at human reaction speed (0.3–0.4 s):
+![Latency](assets/latency.png)
+```
+Traditional LLM agent                    Meadow Mind
+─────────────────────                    ───────────
+state → long prompt                      state → one sentence   (Perceiver)
+      → generate the answer                    → one sentence rule (Policy)
+        token by token (1.2–3.9 s,             → ONE decision pass, fixed ~0.4 s
+        grows with length)                     → action letter      (Actuator)
+      → parse free text → act            exam-gated before deployment
+```
+## Why a diffusion LLM underneath
+Meadow Mind is built on a diffusion language model (MeadowCoder-7B), not an autoregressive one. The differences that matter:
+| | AR-LLM | Diffusion LLM (Meadow dLLM) |
+|---|---|---|
+| Generation | left to right, one token at a time; words are final once written | drafts the **whole answer at once**, then refines it over multiple steps |
+| Mid-course correction | cannot edit what is already written; fixing means regenerating everything | **refines while working** — any region can be re-opened and corrected in place |
+| Task awareness | sees only the next word | **global**: senses the entire task and answer shape at once |
+| Pre-answer self-sense | none | **Σ**: before answering, Meadow dLLM senses whether it understands the task; low Σ coherence becomes an escalation signal instead of a wrong answer |
+| Decision latency | grows with answer length | **fixed**, independent of answer length |
+| Long free-form prose | mature, strong ecosystem | weaker; smaller ecosystem (honest trade-off) |
+Two of these are what make Meadow Mind possible: **multi-step self-correction** (it can fix its own draft while working) and **global task perception with Σ** (it knows what it is being asked — and whether it understands — before committing to an answer).
+## How it works
+```
+┌────────────────────────────────────────────────────┐
+│ ① Perceiver   your code: numbers -> one sentence   │
+│               "The pole tilts right, fast spin."   │
+├────────────────────────────────────────────────────┤
+│ ② Rule        one sentence = the policy            │
+│               edit behavior by editing words       │
+├────────────────────────────────────────────────────┤
+│ ③ Mind        7B on-device model reads rule+state, │
+│               answers an action letter in a single │
+│               decision pass, fixed ~0.4 s          │
+├────────────────────────────────────────────────────┤
+│ ④ Actuator    letter -> env action                 │
+└────────────────────────────────────────────────────┘
+```
+There is no reward in the loop. The env score is only a report card; improvement happens by **outcome feedback**: the episode trace shows which sentence was wrong, and you edit it. (LunarLander went from a +27.5 crash to a +251 landing by adding one touchdown-cushion line to the perceiver. Ten seconds.)
+## Wire up a new game (5 steps)
+1. **Understand the task, explore input-output.** Variables, actions, win/lose conditions; the reaction deadline must be looser than ~0.4 s. List every action and watch its effect.
+2. **Build perception words.** One sentence describing the current situation. Bucket continuous values (small/big, fast/slow); always include a velocity/trend term.
+3. **Imprint the rule.** Invert the effects into "on situation X do action B". Keyword → letter, one-layer mapping, multiple choice only.
+4. **Decide on memory.** Ask: *"is revisiting the same state a failure signal?"* Yes (maze, exploration, dead ends) → `Task(memory=True)`. No (balance, landing, tracking — repetition IS the job) → keep it off; annotations measurably hurt regulation tasks (CartPole sanity 7/8 → 6/8). Unsure → leave off; the runner prints a hint when it detects looping.
+5. **Take the exam.** Enumerate every situation with its expected letter; `mind.check(task)` passes with at most 1 miss. Failures mean the wording is incomplete — rephrase and re-check, no training.
+Or skip all five: hand `meadow_mind.ai_prompt()` plus your game description to any code agent, and it wires the task for you. You only review the exam score.
+## API
+### `MeadowMind(model_path=None)`
+Weight resolution: `MEADOW_MIND_MODEL` env → explicit path → local cache (`~/.meadow-mind/models/`) → auto-download.
+| Method | |
+|---|---|
+| `mind.decide(task, obs) -> (action, info)` | one real decision; `info = {status, letter, lat}` |
+| `mind.check(task) -> (ok, n)` | sanity gate; raises if the decision table fails |
+### `Task(...)`
+| Field | |
+|---|---|
+| `perceive(obs) -> str` (or `perceive(obs, task)` with memory) | perception layer |
+| `rule` / `option_text` / `options` / `act_text` | the one-sentence policy and its multiple-choice actions |
+| `sanity` | the exam: `[(status sentence, expected letter)]` |
+| `memory` / `mem_key` | working-memory switch (default off) + state key fn |
+| `env_id` / `env_kwargs` / `max_steps` / `judge` | environment wiring and report card |
+With `memory=True` the runner auto-tracks `task.visited`; use `task.seen(key)` inside `perceive` to annotate, e.g. `(safe, already visited)`.
+### CLI
+```bash
+meadow-mind cartpole        # sanity gate -> play one episode -> video + verdict
+```
+## Honest limits
+- Reaction floor is one decision pass (~0.4 s ≈ 2 Hz). Tighter deadlines (1 m pole, Pong trajectory prediction) are out of reach today.
+- Suited to tasks whose situations can be said in a sentence and whose policy fits a rule. Continuous high-precision control is not.
+- The perceiver is human-designed (or AI-generated via `ai_prompt()`); the model's job is reading the rule and deciding.
+## Roadmap
+- **v0.2** — layered perception with early action: accumulate confidence through the network and act when it crosses a threshold; easy situations should land near ~0.15 s.
+- Rule-learning loop: discover rules like MountainCar's swing trick from failed episodes automatically (no gradients — the learned artifact is a readable sentence).
+## License
+MIT © Hey-Meadow Lab

meadow_mind-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,151 @@
+# Meadow Mind
+**Zero training. Second-level reactions (~400 ms).**
+A language-rule decision mind: write the policy as one sentence, describe the state as one sentence, and a local 7B model makes a real decision every ~0.4 s. No RL, no reward engineering, no gradients, no samples.
+🌐 **Demo site**: [meadow-mind.pages.dev](https://meadow-mind.pages.dev) (中文) · [English](https://meadow-mind.pages.dev/en.html) · [繁體中文 README](README.zh-TW.md)
+```bash
+pip install meadow-mind          # weights auto-download on first use
+```
+```python
+from meadow_mind import MeadowMind, tasks
+mind = MeadowMind()                    # loads once, runs on-device
+task = tasks.mountaincar()
+mind.check(task)                       # sanity gate: decision-table exam
+action, info = mind.decide(task, obs)  # obs in, env action out (~0.4s)
+```
+## Results
+All on official Gymnasium environments, untouched physics, **zero training**. Every frame below corresponds to one real model decision; no scripted policy, no edited speed-ups.
+| Balance · CartPole-v1<br>**400/400 perfect** (solve bar 195) | Landing · LunarLander-v3<br>**+251 safe landing** (solve bar 200) |
+|---|---|
+| ![CartPole](assets/balance.gif) | ![LunarLander](assets/landing.gif) |
+| Maze · FrozenLake 8×8<br>**goal in 14 steps = shortest path** | Momentum · MountainCar-v0<br>**flag in 103 steps** (limit 200) |
+|---|---|
+| ![Maze](assets/maze.gif) | ![MountainCar](assets/mountaincar.gif) |
+The MountainCar policy is one counterintuitive sentence — `"push in the same direction the car is moving, to pump energy like a swing"` — which replaces an entire RL reward curve.
+### Real-time reflex (wall-clock, not turn-based)
+The model runs in a thread while obstacles fall in real time. If it is still thinking when the obstacle lands, it really crashes.
+| Parkour dodge: full-generation crashes at #1, Meadow Mind clears 5/6 | Shape+color match: 6/6, down to a 0.72 s window |
+|---|---|
+| ![Parkour](assets/parkour.gif) | ![Shape+color](assets/shape_color.gif) |
+### Working memory
+A funnel maze forces both runs into the same dead-end pocket. Reactive (left) paces at its mouth forever; with `Task(memory=True)` (right) it struggles, backs out, and detours to the goal in 22 steps. The only difference is five words in the perception sentence.
+![Memory](assets/memory.gif)
+## Decision latency: traditional LLM vs Meadow Mind
+A traditional LLM agent must **generate its full answer before acting** — and latency grows with answer length. Meadow Mind reads the rule and the situation and decides in **one fixed-latency pass**, right at human reaction speed (0.3–0.4 s):
+![Latency](assets/latency.png)
+```
+Traditional LLM agent                    Meadow Mind
+─────────────────────                    ───────────
+state → long prompt                      state → one sentence   (Perceiver)
+      → generate the answer                    → one sentence rule (Policy)
+        token by token (1.2–3.9 s,             → ONE decision pass, fixed ~0.4 s
+        grows with length)                     → action letter      (Actuator)
+      → parse free text → act            exam-gated before deployment
+```
+## Why a diffusion LLM underneath
+Meadow Mind is built on a diffusion language model (MeadowCoder-7B), not an autoregressive one. The differences that matter:
+| | AR-LLM | Diffusion LLM (Meadow dLLM) |
+|---|---|---|
+| Generation | left to right, one token at a time; words are final once written | drafts the **whole answer at once**, then refines it over multiple steps |
+| Mid-course correction | cannot edit what is already written; fixing means regenerating everything | **refines while working** — any region can be re-opened and corrected in place |
+| Task awareness | sees only the next word | **global**: senses the entire task and answer shape at once |
+| Pre-answer self-sense | none | **Σ**: before answering, Meadow dLLM senses whether it understands the task; low Σ coherence becomes an escalation signal instead of a wrong answer |
+| Decision latency | grows with answer length | **fixed**, independent of answer length |
+| Long free-form prose | mature, strong ecosystem | weaker; smaller ecosystem (honest trade-off) |
+Two of these are what make Meadow Mind possible: **multi-step self-correction** (it can fix its own draft while working) and **global task perception with Σ** (it knows what it is being asked — and whether it understands — before committing to an answer).
+## How it works
+```
+┌────────────────────────────────────────────────────┐
+│ ① Perceiver   your code: numbers -> one sentence   │
+│               "The pole tilts right, fast spin."   │
+├────────────────────────────────────────────────────┤
+│ ② Rule        one sentence = the policy            │
+│               edit behavior by editing words       │
+├────────────────────────────────────────────────────┤
+│ ③ Mind        7B on-device model reads rule+state, │
+│               answers an action letter in a single │
+│               decision pass, fixed ~0.4 s          │
+├────────────────────────────────────────────────────┤
+│ ④ Actuator    letter -> env action                 │
+└────────────────────────────────────────────────────┘
+```
+There is no reward in the loop. The env score is only a report card; improvement happens by **outcome feedback**: the episode trace shows which sentence was wrong, and you edit it. (LunarLander went from a +27.5 crash to a +251 landing by adding one touchdown-cushion line to the perceiver. Ten seconds.)
+## Wire up a new game (5 steps)
+1. **Understand the task, explore input-output.** Variables, actions, win/lose conditions; the reaction deadline must be looser than ~0.4 s. List every action and watch its effect.
+2. **Build perception words.** One sentence describing the current situation. Bucket continuous values (small/big, fast/slow); always include a velocity/trend term.
+3. **Imprint the rule.** Invert the effects into "on situation X do action B". Keyword → letter, one-layer mapping, multiple choice only.
+4. **Decide on memory.** Ask: *"is revisiting the same state a failure signal?"* Yes (maze, exploration, dead ends) → `Task(memory=True)`. No (balance, landing, tracking — repetition IS the job) → keep it off; annotations measurably hurt regulation tasks (CartPole sanity 7/8 → 6/8). Unsure → leave off; the runner prints a hint when it detects looping.
+5. **Take the exam.** Enumerate every situation with its expected letter; `mind.check(task)` passes with at most 1 miss. Failures mean the wording is incomplete — rephrase and re-check, no training.
+Or skip all five: hand `meadow_mind.ai_prompt()` plus your game description to any code agent, and it wires the task for you. You only review the exam score.
+## API
+### `MeadowMind(model_path=None)`
+Weight resolution: `MEADOW_MIND_MODEL` env → explicit path → local cache (`~/.meadow-mind/models/`) → auto-download.
+| Method | |
+|---|---|
+| `mind.decide(task, obs) -> (action, info)` | one real decision; `info = {status, letter, lat}` |
+| `mind.check(task) -> (ok, n)` | sanity gate; raises if the decision table fails |
+### `Task(...)`
+| Field | |
+|---|---|
+| `perceive(obs) -> str` (or `perceive(obs, task)` with memory) | perception layer |
+| `rule` / `option_text` / `options` / `act_text` | the one-sentence policy and its multiple-choice actions |
+| `sanity` | the exam: `[(status sentence, expected letter)]` |
+| `memory` / `mem_key` | working-memory switch (default off) + state key fn |
+| `env_id` / `env_kwargs` / `max_steps` / `judge` | environment wiring and report card |
+With `memory=True` the runner auto-tracks `task.visited`; use `task.seen(key)` inside `perceive` to annotate, e.g. `(safe, already visited)`.
+### CLI
+```bash
+meadow-mind cartpole        # sanity gate -> play one episode -> video + verdict
+```
+## Honest limits
+- Reaction floor is one decision pass (~0.4 s ≈ 2 Hz). Tighter deadlines (1 m pole, Pong trajectory prediction) are out of reach today.
+- Suited to tasks whose situations can be said in a sentence and whose policy fits a rule. Continuous high-precision control is not.
+- The perceiver is human-designed (or AI-generated via `ai_prompt()`); the model's job is reading the rule and deciding.
+## Roadmap
+- **v0.2** — layered perception with early action: accumulate confidence through the network and act when it crosses a threshold; easy situations should land near ~0.15 s.
+- Rule-learning loop: discover rules like MountainCar's swing trick from failed episodes automatically (no gradients — the learned artifact is a readable sentence).
+## License
+MIT © Hey-Meadow Lab

meadow_mind-0.1.0/meadow_mind/__init__.py ADDED Viewed

@@ -0,0 +1,21 @@
+"""Meadow Mind — language-rule decision mind. One install, one import.
+    pip install meadow-mind
+    from meadow_mind import MeadowMind, Task, tasks, ai_prompt
+    mind = MeadowMind()                    # model auto-downloads on first use
+    task = tasks.mountaincar()
+    mind.check(task)                       # sanity gate
+    action, info = mind.decide(task, obs)  # obs in, env action out
+Everything below this API (engine, weights, decoding) is internal.
+Give ai_prompt() to any code agent to wire a NEW game automatically.
+"""
+from .mind import MeadowMind
+from .task import Task
+from .prompt import ai_prompt, AI_PROMPT
+from . import tasks
+__version__ = "0.1.0"
+__all__ = ["MeadowMind", "Task", "tasks", "ai_prompt", "AI_PROMPT", "__version__"]

meadow_mind-0.1.0/meadow_mind/_engine.py ADDED Viewed

@@ -0,0 +1,302 @@
+"""DiffuCoder MLX engine — block-wise (semi-autoregressive) masked-diffusion
+generation with a per-layer K/V cache and Dream boundary-shift fix.
+This is the same proven inference path as diffucoder-play/diffucoder_mlx_block.py,
+packaged as a reusable engine for the OpenAI-compatible server. Speed comes from
+the algorithm (block-wise KV cache) + 8-bit weights, NOT from "being MLX".
+"""
+import time
+import numpy as np
+import mlx.core as mx
+from mlx_lm import load
+MASK = 151666     # <|mask|>
+IM_END = 151645   # <|im_end|>
+def _top_p_filter(logits, top_p):
+    sorted_desc = -mx.sort(-logits, axis=-1)
+    cum = mx.cumsum(mx.softmax(sorted_desc, axis=-1), axis=-1)
+    keep = cum <= top_p
+    keep = mx.concatenate([mx.ones_like(keep[:, :1]), keep[:, :-1]], axis=-1)
+    k = mx.sum(keep, axis=-1).astype(mx.int32)
+    thresh = mx.take_along_axis(sorted_desc, (k - 1)[:, None], axis=-1)
+    return mx.where(logits < thresh, -1e9, logits)
+def _sample_tokens(logits, temperature, top_p, neg_entropy):
+    if temperature and temperature > 0:
+        logits = logits / temperature
+    if top_p is not None and top_p < 1:
+        logits = _top_p_filter(logits, top_p)
+    probs = mx.softmax(logits, axis=-1)
+    if temperature and temperature > 0:
+        x0 = mx.random.categorical(logits)
+        conf = mx.take_along_axis(probs, x0[:, None], axis=-1)[:, 0]
+    else:
+        x0 = mx.argmax(probs, axis=-1)
+        conf = mx.max(probs, axis=-1)
+    if neg_entropy:
+        conf = mx.sum(probs * mx.log(probs + 1e-10), axis=-1)
+    return conf, x0
+class _Cache:
+    def __init__(self, n):
+        self.k = [None] * n
+        self.v = [None] * n
+        self.length = 0
+    def append(self, ks, vs):
+        for li in range(len(self.k)):
+            self.k[li] = ks[li] if self.k[li] is None else mx.concatenate([self.k[li], ks[li]], axis=2)
+            self.v[li] = vs[li] if self.v[li] is None else mx.concatenate([self.v[li], vs[li]], axis=2)
+        self.length += ks[0].shape[2]
+class DiffuCoderEngine:
+    def __init__(self, model_path, system="You are a helpful coding assistant."):
+        t0 = time.time()
+        self.is_llada = "llada" in model_path.lower()
+        self.mask_token_id = 156895 if self.is_llada else 151666
+        tokenizer_config = {"trust_remote_code": True} if self.is_llada else None
+        self.model, self.tok = load(model_path, tokenizer_config=tokenizer_config)
+        self.tie = getattr(self.model.args, "tie_word_embeddings", False)
+        self.layers = self.model.model.layers
+        self.n_layers = len(self.layers)
+        self.system = system
+        self.model_path = model_path
+        self.load_time = time.time() - t0
+        self._pc_ids = None      # prefix cache: last prompt's token ids
+        self._pc_kv = None       # prefix cache: last prompt's per-layer K/V
+    def _forward(self, ids_mx, offset, cache, attend_len, mask=None):
+        m = self.model.model
+        h = m.embed_tokens(ids_mx)
+        ks, vs = [], []
+        for li, layer in enumerate(self.layers):
+            attn = layer.self_attn
+            x = layer.input_layernorm(h)
+            B, L, _ = x.shape
+            q = attn.q_proj(x).reshape(B, L, attn.n_heads, -1).transpose(0, 2, 1, 3)
+            k = attn.k_proj(x).reshape(B, L, attn.n_kv_heads, -1).transpose(0, 2, 1, 3)
+            v = attn.v_proj(x).reshape(B, L, attn.n_kv_heads, -1).transpose(0, 2, 1, 3)
+            q = attn.rope(q, offset=offset)
+            k = attn.rope(k, offset=offset)
+            pk, pv = cache.k[li], cache.v[li]
+            if pk is not None and attend_len > 0:
+                kk = mx.concatenate([pk[:, :, :attend_len, :], k], axis=2)
+                vv = mx.concatenate([pv[:, :, :attend_len, :], v], axis=2)
+            else:
+                kk, vv = k, v
+            amask = mask.astype(q.dtype) if mask is not None else None
+            out = mx.fast.scaled_dot_product_attention(q, kk, vv, scale=attn.scale, mask=amask)
+            out = out.transpose(0, 2, 1, 3).reshape(B, L, -1)
+            h = h + attn.o_proj(out)
+            h = h + layer.mlp(layer.post_attention_layernorm(h))
+            ks.append(k)
+            vs.append(v)
+        h = m.norm(h)
+        logits = m.embed_tokens.as_linear(h) if self.tie else self.model.lm_head(h)
+        return logits, ks, vs
+    def _prefill(self, ids, use_cache=True, causal=True):
+        """Build the prompt's K/V cache. With prefix caching, reuse the shared prefix
+        of the previous prompt and only forward the new suffix → long conversations
+        don't re-prefill everything (fixes the 'gets slower each turn' problem).
+        causal=True encodes the prompt left-to-right so the prefix K/V is stable
+        across turns (required for caching; bidirectional prefill is not reusable)."""
+        cache = _Cache(self.n_layers)
+        m = 0
+        if use_cache and causal and self._pc_ids is not None:
+            ci = self._pc_ids
+            lim = min(len(ids), len(ci))
+            while m < lim and ids[m] == ci[m]:
+                m += 1
+            m = min(m, len(ids) - 1)        # leave >=1 token to actually forward
+            if m > 8:                        # worth reusing
+                for li in range(self.n_layers):
+                    cache.k[li] = self._pc_kv.k[li][:, :, :m, :]
+                    cache.v[li] = self._pc_kv.v[li][:, :, :m, :]
+                cache.length = m
+            else:
+                m = 0
+        Ln = len(ids) - m
+        # causal mask: new tokens see ALL m cached prefix + only earlier new tokens.
+        # This makes each prompt token's K/V depend only on its left context -> the
+        # prefix K/V is stable across turns -> reusable (bidirectional prefill is not).
+        mask = None
+        if causal:
+            tri = mx.triu(mx.full((Ln, Ln), -1e9), k=1)
+            mask = mx.concatenate([mx.zeros((Ln, m)), tri], axis=1) if m > 0 else tri
+        _, ks, vs = self._forward(mx.array(ids[m:][None]), m, cache, m, mask=mask)
+        cache.append(ks, vs)
+        if use_cache and causal:
+            mx.eval(cache.k[0], cache.v[0])
+            self._pc_ids, self._pc_kv = ids.copy(), cache
+        return cache, m
+    # ---- Σ: step-0 draft (the model's instant full-answer guess before any commit) ----
+    def step0_draft(self, prompt_text, n=96):
+        """One forward over [prompt + all-MASK]: the model already drafts the whole
+        answer (77-100% of the final words) before committing anything. Returns the
+        draft text + per-position confidence. AR has no equivalent."""
+        pids = np.array(self.tok.encode(prompt_text), dtype=np.int64)
+        x = np.concatenate([pids, np.full(n, self.mask_token_id, dtype=np.int64)])
+        plen = len(pids)
+        logits = self._forward_full(mx.array(x[None]))
+        ans = logits[mx.array(list(range(plen, plen + n)))]
+        probs = mx.softmax(ans, axis=-1)
+        conf = np.array(mx.max(probs, axis=-1).astype(mx.float32))
+        ids = np.array(mx.argmax(ans, axis=-1))
+        return {"ids": ids, "conf": conf, "text": self.tok.decode(ids.tolist()), "coherence": float(conf.mean())}
+    def route(self, user_msg, n=96):
+        """Prefill Gating: read the step-0 draft to pick mode / trigger RAG BEFORE generating."""
+        d = self.step0_draft(self.build_prompt([{"role": "user", "content": user_msg}]), n)
+        t = d["text"].lower()
+        sig = []
+        if any(k in t for k in ("<!doctype", "<html", "<div", "<body")): sig.append("html")
+        if "matplotlib" in t or "plt." in t: sig.append("chart")
+        if "select " in t and "from " in t: sig.append("sql")
+        if "def " in t or "class " in t: sig.append("code")
+        mode = "sectioned" if "html" in sig else "single"
+        # low coherence = model doesn't 'have it' -> escalate / inject RAG before generating
+        escalate = d["coherence"] < 0.45
+        return {"signals": sig, "mode": mode, "coherence": round(d["coherence"], 2),
+                "escalate": escalate, "draft_head": d["text"][:70].replace("\n", " ")}
+    # ---- infilling primitive (DiffuCoder's native strength) ----
+    def _forward_full(self, x_mx):
+        """Full bidirectional forward over the whole sequence (no block cache)."""
+        m = self.model.model
+        h = m.embed_tokens(x_mx)
+        for layer in self.layers:
+            h = layer(h, None, None)
+        h = m.norm(h)
+        logits = m.embed_tokens.as_linear(h) if self.tie else self.model.lm_head(h)
+        if self.is_llada:
+            return logits[0]
+        return mx.concatenate([logits[:, :1], logits[:, :-1]], axis=1)[0]  # Dream shift
+    def infill(self, prompt_text, pre, n_slot, post, steps=8, temperature=0.2, top_p=0.95):
+        """Fill `n_slot` masked tokens between fixed `pre`/`post`.
+        Returns (full_text, slot_text) where slot_text is only the filled slot region."""
+        pids = np.array(self.tok.encode(prompt_text), dtype=np.int64)
+        pre_ids = self.tok.encode(pre, add_special_tokens=False)
+        answer = np.array(pre_ids + [self.mask_token_id] * n_slot
+                          + self.tok.encode(post, add_special_tokens=False), dtype=np.int64)
+        x = np.concatenate([pids, answer])
+        plen = len(pids)
+        slot0 = plen + len(pre_ids)
+        ts = np.linspace(1, 1e-12, steps + 1)
+        for i in range(steps):
+            mi = x == self.mask_token_id
+            if not mi.any():
+                break
+            logits = self._forward_full(mx.array(x[None]))
+            mpos = np.nonzero(mi)[0]
+            conf, x0 = _sample_tokens(logits[mx.array(mpos)], temperature, top_p, neg_entropy=True)
+            mx.eval(conf, x0)
+            conf = np.array(conf.astype(mx.float32)); x0 = np.array(x0.astype(mx.int32))
+            k = int(len(mpos) * (1 - ts[i + 1] / ts[i])) if i < steps - 1 else len(mpos)
+            if k > 0:
+                order = np.argsort(-conf)[:k]
+                x[mpos[order]] = x0[order]
+        out = x[plen:]
+        slot_ids = x[slot0: slot0 + n_slot]
+        slot_text = self.tok.decode([int(t) for t in slot_ids.tolist() if t != self.mask_token_id])
+        full = self.tok.decode(out[out != self.mask_token_id].tolist())
+        return full, slot_text
+    # prefill dictionary: tool -> (pre, slots, post). Structure fixed, only args infilled.
+    TOOL_SCAFFOLDS = {
+        "read_file":   ('{"name": "read_file", "arguments": {"path": "', 8, '"}}'),
+        "write_file":  ('{"name": "write_file", "arguments": {"path": "', 8, '", "content": "..."}}'),
+        "run_bash":    ('{"name": "run_bash", "arguments": {"command": "', 12, '"}}'),
+        "search_code": ('{"name": "search_code", "arguments": {"query": "', 10, '"}}'),
+        "git_commit":  ('{"name": "git_commit", "arguments": {"message": "', 12, '"}}'),
+    }
+    def tool_call(self, user_msg, tool_name, steps=8):
+        """Structure-guaranteed tool call: infill only the arg value, rebuild from scaffold."""
+        pre, n, post = self.TOOL_SCAFFOLDS[tool_name]
+        prompt = self.build_prompt([{"role": "user", "content": user_msg}])
+        _, slot = self.infill(prompt, pre, n, post, steps=steps)
+        # keep only the value up to the first closing quote / special token / newline
+        val = slot.split('"')[0].split("<|")[0].split("\n")[0].strip()
+        return pre + val + post   # guaranteed valid JSON structure
+    @staticmethod
+    def _flatten(content):
+        """OpenAI content can be str | list[{type,text}] | None — flatten to text."""
+        if content is None:
+            return ""
+        if isinstance(content, str):
+            return content
+        if isinstance(content, list):
+            return "".join(p.get("text", "") if isinstance(p, dict) else str(p) for p in content)
+        return str(content)
+    def build_prompt(self, messages):
+        """messages: list of {role, content}. Prepend the engine's default system if none given."""
+        has_sys = any(m.get("role") == "system" for m in messages)
+        parts = []
+        if not has_sys:
+            parts.append(f"<|im_start|>system\n{self.system}<|im_end|>\n")
+        for m in messages:
+            parts.append(f"<|im_start|>{m.get('role','user')}\n{self._flatten(m.get('content'))}<|im_end|>\n")
+        parts.append("<|im_start|>assistant\n")
+        return "".join(parts)
+    def generate(self, prompt, max_new=128, block_size=32, tokens_per_step=8,
+                 temperature=0.2, top_p=0.95, alg="entropy", use_prefix_cache=False,
+                 causal_prefill=False):
+        # DEFAULT = bidirectional prefill, no cache (DiffuCoder is bidirectional-trained;
+        # this preserves quality). prefix-cache REQUIRES causal_prefill to be correct, but
+        # causal encoding degrades quality -> it's an opt-in speed/quality tradeoff, not the
+        # fix for long-conversation slowdown. The clean fix is memory-backed short context.
+        ids = np.array(self.tok.encode(prompt), dtype=np.int64)
+        t0 = time.time()
+        cache, reused = self._prefill(ids, use_prefix_cache, causal_prefill)
+        prev_tok = int(ids[-1])
+        out = []
+        n_blocks = (max_new + block_size - 1) // block_size
+        eps = 1e-12
+        neg = mx.array(-1e30, dtype=mx.float32)
+        for _ in range(n_blocks):
+            off = cache.length
+            block = mx.full((block_size,), self.mask_token_id, dtype=mx.int32)
+            steps = max(1, block_size // tokens_per_step)
+            timesteps = np.linspace(1, eps, steps + 1)
+            for i in range(steps):
+                # unmask selection stays fully in MLX — no per-step GPU->CPU->GPU round-trip
+                is_mask = block == self.mask_token_id
+                window = mx.concatenate([mx.array([prev_tok], dtype=mx.int32), block])
+                logits, _, _ = self._forward(window[None], off - 1, cache, off - 1)
+                conf, x0 = _sample_tokens(logits[0][:-1], temperature, top_p, neg_entropy=(alg == "entropy"))
+                conf = mx.where(is_mask, conf.astype(mx.float32), neg)          # only masked are candidates
+                t, s = timesteps[i], timesteps[i + 1]
+                frac = (1 - s / t) if i < steps - 1 else 1.0
+                n_transfer = mx.floor(mx.sum(is_mask.astype(mx.float32)) * frac).astype(mx.int32)
+                rank = mx.argsort(mx.argsort(-conf))                           # rank 0 = most confident
+                block = mx.where(rank < n_transfer, x0.astype(mx.int32), block)
+            block_list = block.tolist()                                        # one eval per block, not per step
+            out.extend(block_list)
+            _, ks, vs = self._forward(block[None], off, cache, off)
+            cache.append(ks, vs)
+            prev_tok = block_list[-1]
+            if IM_END in block_list:
+                break
+        mx.eval(cache.k[0])
+        dt = time.time() - t0
+        out = np.array(out)
+        if IM_END in out.tolist():
+            out = out[: out.tolist().index(IM_END)]
+        out = out[out != self.mask_token_id]
+        text = self.tok.decode(out.tolist())
+        return {"text": text, "time": dt, "n_tokens": int(len(out)), "tok_per_s": len(out) / dt if dt else 0.0,
+                "prompt_len": int(len(ids)), "prefix_reused": int(reused)}