PyPI - mlx-code - Versions diffs - 0.0.24__tar.gz → 0.0.26__tar.gz - Mend

mlx-code 0.0.24tar.gz → 0.0.26tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

{mlx_code-0.0.24 → mlx_code-0.0.26}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mlx-code
-Version: 0.0.24
+Version: 0.0.26
 Summary: Coding Agent for Mac
 Home-page: https://josefalbers.github.io/mlx-code/
 Author: J Joe
@@ -17,6 +17,8 @@ Requires-Dist: httpx
 Requires-Dist: pydantic
 Requires-Dist: textual>=8.2.7
 Requires-Dist: rich>=15.0.0
+Requires-Dist: starlette
+Requires-Dist: uvicorn
 Provides-Extra: all
 Requires-Dist: python-lsp-server[all]; extra == "all"
 Requires-Dist: GitPython; extra == "all"
@@ -47,7 +49,7 @@ A Git-native coding agent that can run entirely on your Mac. No API keys, no clo
 ```
 Conversation tree (nodes = git commits with embedded chat history)
-  main ──●──●──●──●──●──●──●──●──●──●
+  main ──●──●──●──●──●──●──●──●──●──●──●──●──●──●
             │        │
             │        └── branch-1 ──●──●──●
             │                          │ ┌────────────┐
@@ -66,21 +68,21 @@ REPL tabs (each tab = a git branch + agent)    │
 │  └──────┘  └────┬─────┘  └──────────┘  └────────────┘  │
 └─────────────────┼──────────────────────────────────────┘
                   │
-                  ├────────────────────────────────────► each tab is an independent Agent
+                  ├─────────────────────────────────────────► Each tab is an independent Agent
                   │
-             ┌────┴─────────────────────────────────┐
-             │  Agent                               │
-             │  ┌──────────────┐  ┌──────────────┐  │
-             │  │ API:         │  │ Tools:       │  │
-             │  │ MLX (local)  │  │ Read  Write  │  │
-             │  │ Claude       │  │ Edit  Bash   │  │
-             │  │ Gemini       │  │ Grep  Find   │  │
-             │  │ OpenAI       │  │ Ls  Skill    │  │
-             │  └──────────────┘  │ Agent ───────┼──┼───► spawns child Agent
-             │                    └──────────────┘  │     (each with own tools + worktree + etc)
-             │  Git worktree                        │
-             │  (isolation + session state)         │
-             └──────────────────────────────────────┘
+             ┌────┴─────────────────────────────────────┐
+             │  Agent                                   │
+             │  ┌────────────────┐  ┌────────────────┐  │
+             │  │ API:           │  │ Tools:         │  │
+             │  │ Local (mlx-lm) │  │ Read    Write  │  │
+             │  │ Claude         │  │ Edit    Bash   │  │
+             │  │ Gemini         │  │ Grep    Find   │  │
+             │  │ OpenAI         │  │ Ls      Skill  │  │
+             │  └────────────────┘  │ Agent ─────────┼──┼───► Spawns child Agent
+             │                      └────────────────┘  │     (each with own tools + worktree + etc)
+             │  Git worktree                            │
+             │  (isolation + session state)             │
+             └──────────────────────────────────────────┘
 ```
 Each layer is importable and composable on its own. A commit records state, a branch records an alternative path, and a tab is just a live view over an `Agent`.
@@ -98,10 +100,15 @@ result = await agent.run('refactor utils.py to use dataclasses')
 ## Quick start
 ```bash
+# ephemeral run (no installation)
+uvx --from mlx-code mlc
+# or install into the current environment
 pip install mlx-code
-mlc                              # launch with local MLX model
+# launch
+mlc                              # with a local MLX model
 mlc-run --api gemini             # or use a remote provider
-mlc-run --api deepseek --model deepseek-v4-flash
 ```
 That's it. The first run starts a local inference server and drops you into the REPL.
@@ -123,12 +130,12 @@ That's it. The first run starts a local inference server and drops you into the
 **Git is the database.** When the agent makes file changes, they’re committed to a git worktree with the full conversation embedded in the commit message. Resume any past session by hash, branch from any checkpoint, and inspect the agent timeline with `git log`. No proprietary state files, just Git.
-**Your working directory is never at risk.** The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`.
-**Built-in safety nets.** Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
+**Built-in safety nets.** Your working directory is never at risk. The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`. Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
 **Batteries included.** Everything ships in one pip install: the MLX inference engine, the multi-protocol API server, the agent loop, the tools, and the TUI. No llama.cpp, no ollama, no vLLM bridge to find and configure. And the server natively speaks OpenAI, Anthropic, Gemini, and Codex wire formats simultaneously, so `claude`, `codex`, and `gemini` CLIs can all work against your local model without a translation layer.
+**Continuous batching.** The local inference server runs a continuous batching engine that processes multiple sequences concurrently. When you spawn parallel agents (eg, multiple tabs, `asyncio.gather` pipelines, or delegated sub-tasks) they all share the same GPU context and are stepped together each tick. A prefix cache persists KV snapshots to disk, so repeated system prompts and conversation prefixes are prefilled once and reused across sessions. No request queueing, no waiting for the previous agent to finish.
 ---
 ## Agent primitive
@@ -166,12 +173,12 @@ agent.messages = messages
 await agent.run("now add unit tests")
 ```
-Branch from any point in the conversation — each branch gets its own worktree:
+Branch from any point in the conversation. Each branch gets its own worktree:
 ```
 /branch                      # branch from current state
 /branch --rev 2              # branch from the 2nd user turn
-/branch --rev 3 --as-worktree try different approach
+/branch --rev 3 make it use httpx instead
 ```
 Since it's just git, you can inspect the timeline outside the REPL:
@@ -236,6 +243,43 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 ---
+## Continuous batching
+The local server can run multiple inference sequences concurrently inside a single batch step. Instead of a global lock that serialises one request at a time, the batching engine maintains a live set of active sequences and yields tokens for all of them on every step.
+```bash
+mlc --engine batch            # continuous batching + built-in REPL
+```
+This unlocks true parallelism for multi-agent workloads:
+```python
+import asyncio
+from mlx_code.repl import Agent
+async def main():
+    agents = [Agent() for _ in range(4)]
+    await asyncio.gather(*[
+        a.run(f"Research topic: {t}")
+        for a, t in zip(agents, ["consensus", "cryptography", "networking", "storage"])
+    ])
+asyncio.run(main())
+```
+All four agents generate simultaneously inside the same batch. No sequential blocking.
+### Health endpoint
+```bash
+curl http://127.0.0.1:8000/health
+# {"status":"ok","model":"mlx-community/Qwen3.5-4B-OptiQ-4bit","active_sequences":2,"prefix_cache_files":5}
+```
+`active_sequences` shows how many agents are generating right now; `prefix_cache_files` shows how many prefix KV snapshots are stored on disk.
+---
 ## Command Line
 ### `mlc`: local server + harness
@@ -243,20 +287,20 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 Starts the MLX inference server and launches the built-in TUI harness against it.
 ```bash
-# Default: local server + default TUI
+# Default: local server + default harness
 mlc
-# Use a simple terminal REPL instead of the TUI
-mlc --notui
+# Continuous batching mode (default is sequential caching mode)
+mlc --engine batch
+# Server only, no harness
+mlc --leash none
 # Use a different harness (routes traffic through the local server)
 mlc --leash claude
 mlc --leash gemini
 mlc --leash codex
-# Server only, no harness
-mlc --leash none
 # Specify a model
 mlc --model mlx-community/Qwen3.5-4B-OptiQ-4bit
@@ -307,7 +351,7 @@ mlc-run --api codex
 echo "explain lsp.py" | mlc-run -a deepseek | cat - PLAN.md | mlc-run --url http://localhost:9000
 # Simple terminal REPL (no TUI)
-mlc-run --notui
+mlc-run --bare
 ```
 ---
@@ -432,18 +476,19 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | Command | Description |
 |---|---|
-| `/help` | Show command reference |
+| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
+| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
 | `/clear [--config F]` | Clear conversation; `--config` reloads agent from a JSON/YAML file |
+| `/tab [N]` | Jump to tab N |
 | `/history [--raw]` | Show conversation transcript; `--raw` shows the raw API message log |
-| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
-| `/errors` | Show timestamped error log for the current tab |
 | `/tools` | List active tools |
-| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
 | `/abort` | Abort the running agent |
+| `/errors` | Show timestamped error log for the current tab |
 | `/export [path]` | Export session to JSON |
 | `/exit [--all]` | Close branch tab, or exit the app |
-| `!command` | Run a shell command; output captured in the TUI |
-| `$command` | Run an interactive command (TUI suspends, terminal handed to process) |
+| `/help` | Show command reference |
+| `!command` | Run a shell command; output captured in the TUI (eg, `ls`, `cat hello.c`) |
+| `$command` | Run an interactive command (eg, `vim`, `yazi`, `less hello.c`) |
 ### Key bindings
@@ -453,7 +498,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Ctrl-J` | Insert newline |
 | `Ctrl-1` … `Ctrl-9` | Jump to tab N |
 | `Ctrl-,` / `Ctrl-.` | Cycle through tabs |
-| `Ctrl-C` | Abort running agent |
+| `Ctrl-C` | Clear input, or abort running agent |
 | `Ctrl-D` | Close branch tab, or exit app |
 | `Ctrl-R` | Recall last prompt into editor |
@@ -471,7 +516,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Skill` | Retrieve named skill instructions from config |
 | `Agent` | Spawn an autonomous sub-agent for delegated work |
-All file tools enforce path sandboxing — the agent cannot read or write outside the worktree.
+All file tools enforce path sandboxing. The agent cannot read or write outside the worktree.
 ### Backends

{mlx_code-0.0.24 → mlx_code-0.0.26}/README.md RENAMED Viewed

@@ -11,7 +11,7 @@ A Git-native coding agent that can run entirely on your Mac. No API keys, no clo
 ```
 Conversation tree (nodes = git commits with embedded chat history)
-  main ──●──●──●──●──●──●──●──●──●──●
+  main ──●──●──●──●──●──●──●──●──●──●──●──●──●──●
             │        │
             │        └── branch-1 ──●──●──●
             │                          │ ┌────────────┐
@@ -30,21 +30,21 @@ REPL tabs (each tab = a git branch + agent)    │
 │  └──────┘  └────┬─────┘  └──────────┘  └────────────┘  │
 └─────────────────┼──────────────────────────────────────┘
                   │
-                  ├────────────────────────────────────► each tab is an independent Agent
+                  ├─────────────────────────────────────────► Each tab is an independent Agent
                   │
-             ┌────┴─────────────────────────────────┐
-             │  Agent                               │
-             │  ┌──────────────┐  ┌──────────────┐  │
-             │  │ API:         │  │ Tools:       │  │
-             │  │ MLX (local)  │  │ Read  Write  │  │
-             │  │ Claude       │  │ Edit  Bash   │  │
-             │  │ Gemini       │  │ Grep  Find   │  │
-             │  │ OpenAI       │  │ Ls  Skill    │  │
-             │  └──────────────┘  │ Agent ───────┼──┼───► spawns child Agent
-             │                    └──────────────┘  │     (each with own tools + worktree + etc)
-             │  Git worktree                        │
-             │  (isolation + session state)         │
-             └──────────────────────────────────────┘
+             ┌────┴─────────────────────────────────────┐
+             │  Agent                                   │
+             │  ┌────────────────┐  ┌────────────────┐  │
+             │  │ API:           │  │ Tools:         │  │
+             │  │ Local (mlx-lm) │  │ Read    Write  │  │
+             │  │ Claude         │  │ Edit    Bash   │  │
+             │  │ Gemini         │  │ Grep    Find   │  │
+             │  │ OpenAI         │  │ Ls      Skill  │  │
+             │  └────────────────┘  │ Agent ─────────┼──┼───► Spawns child Agent
+             │                      └────────────────┘  │     (each with own tools + worktree + etc)
+             │  Git worktree                            │
+             │  (isolation + session state)             │
+             └──────────────────────────────────────────┘
 ```
 Each layer is importable and composable on its own. A commit records state, a branch records an alternative path, and a tab is just a live view over an `Agent`.
@@ -62,10 +62,15 @@ result = await agent.run('refactor utils.py to use dataclasses')
 ## Quick start
 ```bash
+# ephemeral run (no installation)
+uvx --from mlx-code mlc
+# or install into the current environment
 pip install mlx-code
-mlc                              # launch with local MLX model
+# launch
+mlc                              # with a local MLX model
 mlc-run --api gemini             # or use a remote provider
-mlc-run --api deepseek --model deepseek-v4-flash
 ```
 That's it. The first run starts a local inference server and drops you into the REPL.
@@ -87,12 +92,12 @@ That's it. The first run starts a local inference server and drops you into the
 **Git is the database.** When the agent makes file changes, they’re committed to a git worktree with the full conversation embedded in the commit message. Resume any past session by hash, branch from any checkpoint, and inspect the agent timeline with `git log`. No proprietary state files, just Git.
-**Your working directory is never at risk.** The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`.
-**Built-in safety nets.** Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
+**Built-in safety nets.** Your working directory is never at risk. The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`. Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
 **Batteries included.** Everything ships in one pip install: the MLX inference engine, the multi-protocol API server, the agent loop, the tools, and the TUI. No llama.cpp, no ollama, no vLLM bridge to find and configure. And the server natively speaks OpenAI, Anthropic, Gemini, and Codex wire formats simultaneously, so `claude`, `codex`, and `gemini` CLIs can all work against your local model without a translation layer.
+**Continuous batching.** The local inference server runs a continuous batching engine that processes multiple sequences concurrently. When you spawn parallel agents (eg, multiple tabs, `asyncio.gather` pipelines, or delegated sub-tasks) they all share the same GPU context and are stepped together each tick. A prefix cache persists KV snapshots to disk, so repeated system prompts and conversation prefixes are prefilled once and reused across sessions. No request queueing, no waiting for the previous agent to finish.
 ---
 ## Agent primitive
@@ -130,12 +135,12 @@ agent.messages = messages
 await agent.run("now add unit tests")
 ```
-Branch from any point in the conversation — each branch gets its own worktree:
+Branch from any point in the conversation. Each branch gets its own worktree:
 ```
 /branch                      # branch from current state
 /branch --rev 2              # branch from the 2nd user turn
-/branch --rev 3 --as-worktree try different approach
+/branch --rev 3 make it use httpx instead
 ```
 Since it's just git, you can inspect the timeline outside the REPL:
@@ -200,6 +205,43 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 ---
+## Continuous batching
+The local server can run multiple inference sequences concurrently inside a single batch step. Instead of a global lock that serialises one request at a time, the batching engine maintains a live set of active sequences and yields tokens for all of them on every step.
+```bash
+mlc --engine batch            # continuous batching + built-in REPL
+```
+This unlocks true parallelism for multi-agent workloads:
+```python
+import asyncio
+from mlx_code.repl import Agent
+async def main():
+    agents = [Agent() for _ in range(4)]
+    await asyncio.gather(*[
+        a.run(f"Research topic: {t}")
+        for a, t in zip(agents, ["consensus", "cryptography", "networking", "storage"])
+    ])
+asyncio.run(main())
+```
+All four agents generate simultaneously inside the same batch. No sequential blocking.
+### Health endpoint
+```bash
+curl http://127.0.0.1:8000/health
+# {"status":"ok","model":"mlx-community/Qwen3.5-4B-OptiQ-4bit","active_sequences":2,"prefix_cache_files":5}
+```
+`active_sequences` shows how many agents are generating right now; `prefix_cache_files` shows how many prefix KV snapshots are stored on disk.
+---
 ## Command Line
 ### `mlc`: local server + harness
@@ -207,20 +249,20 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 Starts the MLX inference server and launches the built-in TUI harness against it.
 ```bash
-# Default: local server + default TUI
+# Default: local server + default harness
 mlc
-# Use a simple terminal REPL instead of the TUI
-mlc --notui
+# Continuous batching mode (default is sequential caching mode)
+mlc --engine batch
+# Server only, no harness
+mlc --leash none
 # Use a different harness (routes traffic through the local server)
 mlc --leash claude
 mlc --leash gemini
 mlc --leash codex
-# Server only, no harness
-mlc --leash none
 # Specify a model
 mlc --model mlx-community/Qwen3.5-4B-OptiQ-4bit
@@ -271,7 +313,7 @@ mlc-run --api codex
 echo "explain lsp.py" | mlc-run -a deepseek | cat - PLAN.md | mlc-run --url http://localhost:9000
 # Simple terminal REPL (no TUI)
-mlc-run --notui
+mlc-run --bare
 ```
 ---
@@ -396,18 +438,19 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | Command | Description |
 |---|---|
-| `/help` | Show command reference |
+| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
+| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
 | `/clear [--config F]` | Clear conversation; `--config` reloads agent from a JSON/YAML file |
+| `/tab [N]` | Jump to tab N |
 | `/history [--raw]` | Show conversation transcript; `--raw` shows the raw API message log |
-| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
-| `/errors` | Show timestamped error log for the current tab |
 | `/tools` | List active tools |
-| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
 | `/abort` | Abort the running agent |
+| `/errors` | Show timestamped error log for the current tab |
 | `/export [path]` | Export session to JSON |
 | `/exit [--all]` | Close branch tab, or exit the app |
-| `!command` | Run a shell command; output captured in the TUI |
-| `$command` | Run an interactive command (TUI suspends, terminal handed to process) |
+| `/help` | Show command reference |
+| `!command` | Run a shell command; output captured in the TUI (eg, `ls`, `cat hello.c`) |
+| `$command` | Run an interactive command (eg, `vim`, `yazi`, `less hello.c`) |
 ### Key bindings
@@ -417,7 +460,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Ctrl-J` | Insert newline |
 | `Ctrl-1` … `Ctrl-9` | Jump to tab N |
 | `Ctrl-,` / `Ctrl-.` | Cycle through tabs |
-| `Ctrl-C` | Abort running agent |
+| `Ctrl-C` | Clear input, or abort running agent |
 | `Ctrl-D` | Close branch tab, or exit app |
 | `Ctrl-R` | Recall last prompt into editor |
@@ -435,7 +478,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Skill` | Retrieve named skill instructions from config |
 | `Agent` | Spawn an autonomous sub-agent for delegated work |
-All file tools enforce path sandboxing — the agent cannot read or write outside the worktree.
+All file tools enforce path sandboxing. The agent cannot read or write outside the worktree.
 ### Backends

mlx_code-0.0.24/mlx_code/ntui.py → mlx_code-0.0.26/mlx_code/bare.py RENAMED Viewed

@@ -110,6 +110,7 @@ class SimpleRepl:
                 if out_text:
                     self._write_delta(prefix + out_text, 'tool_result')
                 self._last_stream_type = t
+                print()
             elif t == 'commit':
                 self._pending_nls = 0
                 self._awaiting_content = False

mlx_code-0.0.26/mlx_code/bats.py ADDED Viewed

@@ -0,0 +1,299 @@
+import asyncio
+import json
+import queue as _queue
+import time
+import uuid
+import threading
+import hashlib
+from array import array
+from contextlib import asynccontextmanager
+from pathlib import Path
+import mlx.core as mx
+from starlette.applications import Starlette
+from starlette.requests import Request
+from starlette.responses import StreamingResponse, JSONResponse
+from starlette.routing import Route
+import logging
+logger = logging.getLogger(__name__)
+MIN_PREFIX_TOKENS = 256
+def _hash_tokens(tokens):
+    arr = array('I', tokens)
+    return hashlib.blake2b(arr.tobytes(), digest_size=8).hexdigest()
+class PrefixCache:
+    def __init__(self, model_name, cache_dir):
+        self.model_name = model_name
+        self.cache_dir = Path(cache_dir)
+        self.cache_dir.mkdir(parents=True, exist_ok=True)
+    def _path(self, prefix_tokens):
+        safe = ''.join((c for c in self.model_name if c.isalnum()))
+        h = _hash_tokens(prefix_tokens)
+        return self.cache_dir / f'{safe}_{len(prefix_tokens)}_{h}.safetensors'
+    def lookup(self, prefix_tokens):
+        if not prefix_tokens or len(prefix_tokens) < MIN_PREFIX_TOKENS:
+            return None
+        path = self._path(prefix_tokens)
+        if not path.exists():
+            return None
+        try:
+            from mlx_lm.models.cache import load_prompt_cache
+            cache, _ = load_prompt_cache(str(path), return_metadata=True)
+            mx.async_eval(cache)
+            return cache
+        except Exception as exc:
+            logger.info(f'[batch] failed to load prefix cache {path.name}: {exc}')
+            return None
+    def store(self, prefix_tokens, kv_cache):
+        if not prefix_tokens or len(prefix_tokens) < MIN_PREFIX_TOKENS:
+            return
+        path = self._path(prefix_tokens)
+        if path.exists():
+            return
+        try:
+            from mlx_lm.models.cache import save_prompt_cache
+            save_prompt_cache(str(path), kv_cache)
+            logger.info(f'[batch] saved prefix cache  len={len(prefix_tokens)}  file={path.name}')
+        except Exception as exc:
+            logger.info(f'[batch] failed to save prefix cache: {exc}')
+def _prefill_prefix(model, tokens, prefill_step_size=2048):
+    from mlx_lm.models.cache import make_prompt_cache
+    prompt_cache = make_prompt_cache(model)
+    prompt = mx.array(tokens)
+    while prompt.shape[0] > 0:
+        n = min(prefill_step_size, prompt.shape[0])
+        model(prompt[:n][None], cache=prompt_cache)
+        mx.eval([c.state for c in prompt_cache])
+        prompt = prompt[n:]
+        mx.clear_cache()
+    return prompt_cache
+def _get_prefix(tokens, ckpts):
+    if not ckpts:
+        return (None, 0)
+    first_ckpt = min(ckpts)
+    if first_ckpt < MIN_PREFIX_TOKENS:
+        return (None, 0)
+    return (tokens[:first_ckpt], first_ckpt)
+def make_batch_app(model_name: str, cache_dir: str='.cache'):
+    state = {'model': None, 'tokenizer': None, 'batch_gen': None, 'request_queue': _queue.Queue(), 'active': {}, 'loop': None, 'prefix_cache': None}
+    def _engine():
+        rq = state['request_queue']
+        active = state['active']
+        bg = state['batch_gen']
+        tok = state['tokenizer']
+        loop = state['loop']
+        model = state['model']
+        pcache = state['prefix_cache']
+        while True:
+            while not rq.empty():
+                try:
+                    tokens, max_tokens, token_queue, ckpts = rq.get_nowait()
+                    _insert(bg, active, pcache, model, tok, loop, tokens, max_tokens, token_queue, ckpts)
+                except _queue.Empty:
+                    break
+            if not active:
+                tokens, max_tokens, token_queue, ckpts = rq.get()
+                _insert(bg, active, pcache, model, tok, loop, tokens, max_tokens, token_queue, ckpts)
+            try:
+                results = bg.next_generated()
+            except Exception:
+                for uid, meta in list(active.items()):
+                    loop.call_soon_threadsafe(meta['q'].put_nowait, None)
+                active.clear()
+                continue
+            for r in results:
+                meta = active.get(r.uid)
+                if meta is None:
+                    continue
+                detok = meta['detok']
+                detok.add_token(r.token)
+                seg = detok.last_segment
+                if r.finish_reason is not None:
+                    detok.finalize()
+                    if (final := detok.last_segment):
+                        loop.call_soon_threadsafe(meta['q'].put_nowait, final)
+                    loop.call_soon_threadsafe(meta['q'].put_nowait, None)
+                    del active[r.uid]
+                elif seg:
+                    loop.call_soon_threadsafe(meta['q'].put_nowait, seg)
+    def _insert(bg, active, pcache, model, tok, loop, tokens, max_tokens, token_queue, ckpts):
+        prefix_tokens, prefix_len = _get_prefix(tokens, ckpts)
+        if prefix_tokens is not None:
+            cached_kv = pcache.lookup(prefix_tokens)
+            if cached_kv is not None:
+                suffix = tokens[prefix_len:]
+                try:
+                    uids = bg.insert([suffix], [max_tokens], caches=[cached_kv])
+                except Exception as exc:
+                    logger.info(f'[batch] cache insert failed ({exc}), falling back to full prompt')
+                    uids = bg.insert([tokens], [max_tokens])
+                    prefix_len = 0
+                else:
+                    logger.info(f'[batch] cache HIT  prefix={prefix_len}  suffix={len(suffix)}')
+                del cached_kv
+                mx.clear_cache()
+            else:
+                logger.info(f'[batch] prefilling prefix  prefix={prefix_len}  suffix={len(tokens) - prefix_len}')
+                prefix_kv = _prefill_prefix(model, prefix_tokens)
+                pcache.store(prefix_tokens, prefix_kv)
+                suffix = tokens[prefix_len:]
+                try:
+                    uids = bg.insert([suffix], [max_tokens], caches=[prefix_kv])
+                except Exception as exc:
+                    logger.info(f'[batch] cache insert failed ({exc}), falling back to full prompt')
+                    uids = bg.insert([tokens], [max_tokens])
+                    prefix_len = 0
+                del prefix_kv
+                mx.clear_cache()
+            active[uids[0]] = {'q': token_queue, 'detok': tok.detokenizer}
+        else:
+            uids = bg.insert([tokens], [max_tokens])
+            logger.info(f'[batch] no cache  prompt={len(tokens)}')
+            active[uids[0]] = {'q': token_queue, 'detok': tok.detokenizer}
+    @asynccontextmanager
+    async def lifespan(_app):
+        from mlx_lm import load
+        from mlx_lm.generate import BatchGenerator
+        from mlx_lm.tokenizer_utils import TokenizerWrapper
+        logger.info(f'[batch] Loading model {model_name!r} …')
+        model, tokenizer = load(model_name)
+        if not isinstance(tokenizer, TokenizerWrapper):
+            tokenizer = TokenizerWrapper(tokenizer)
+        eos = set(tokenizer.eos_token_ids) | {tokenizer.eos_token_id}
+        stop_tokens = [[t] for t in eos]
+        batch_gen = BatchGenerator(model, stop_tokens=stop_tokens)
+        state.update(model=model, tokenizer=tokenizer, batch_gen=batch_gen, loop=asyncio.get_running_loop(), prefix_cache=PrefixCache(model_name, cache_dir))
+        logger.info('[batch] Model ready. Starting engine thread.')
+        threading.Thread(target=_engine, daemon=True).start()
+        yield
+        batch_gen.close()
+    @staticmethod
+    def _detect_api(path: str) -> str:
+        if path.startswith('/v1beta/models/'):
+            return 'gemini'
+        if path.startswith('/v1/messages'):
+            return 'claude'
+        if path.startswith('/v1/responses'):
+            return 'codex'
+        return 'noapi'
+    async def _stream_sse(token_queue, api, msg_id, in_tokens):
+        from . import main as _m
+        adapters = {'claude': _m.ClaudeAdapter, 'codex': _m.CodexAdapter, 'gemini': _m.GeminiAdapter, 'noapi': _m.DefaultAdapter}
+        adapter = adapters.get(api, _m.DefaultAdapter)(msg_id, in_tokens)
+        yield adapter.start()
+        st = 'thinking'
+        buf = ''
+        think_tags = ['<think>', '</think>']
+        while True:
+            text = await token_queue.get()
+            if text is None:
+                break
+            buf += text
+            seg = text
+            while any((t in seg for t in think_tags)):
+                if st == 'text' and think_tags[0] in seg:
+                    before, _, seg = seg.partition(think_tags[0])
+                    if before:
+                        yield adapter.text('text', before)
+                    st = 'thinking'
+                if st == 'thinking' and think_tags[1] in seg:
+                    before, _, seg = seg.partition(think_tags[1])
+                    if before:
+                        yield adapter.text('thinking', before)
+                    st = 'text'
+            if seg:
+                yield adapter.text(st, seg)
+        if (tools := _m._parse_tools_xml(buf)):
+            for tool in tools:
+                yield adapter.tool(tool)
+            yield adapter.end(True)
+        else:
+            yield adapter.end(False)
+    async def generate_endpoint(request: Request):
+        from . import main as _m
+        if state['batch_gen'] is None:
+            return JSONResponse({'error': 'model not loaded'}, status_code=503)
+        path = request.url.path.split('?')[0].rstrip('/')
+        api = _detect_api(path)
+        if api == 'gemini':
+            q = str(request.url.query) or ''
+            if 'alt=sse' not in q and 'streamGenerateContent' not in path:
+                return JSONResponse({'candidates': [{'content': {'role': 'model', 'parts': [{'text': '{"complexity_reasoning":"local","complexity_score":50}'}]}, 'finishReason': 'STOP'}], 'usageMetadata': {'promptTokenCount': 0, 'candidatesTokenCount': 0}})
+        body = await request.json()
+        max_tokens = int(body.get('max_tokens', body.get('max_completion_tokens', 8192)))
+        try:
+            prompt, ckpts = _m.encode(body, api, state['tokenizer'], None, None, None)
+        except Exception as exc:
+            return JSONResponse({'error': f'encode: {exc}'}, status_code=500)
+        if ckpts is None or not prompt:
+            return JSONResponse({'error': 'empty prompt'}, status_code=400)
+        msg_id = f'msg_{uuid.uuid4().hex}'
+        token_queue = asyncio.Queue()
+        state['request_queue'].put((prompt, max_tokens, token_queue, ckpts))
+        async def _sse():
+            async for chunk in _stream_sse(token_queue, api, msg_id, len(prompt)):
+                yield chunk
+        return StreamingResponse(_sse(), media_type='text/event-stream')
+    async def simple_generate(request: Request):
+        if state['batch_gen'] is None:
+            return JSONResponse({'error': 'model not loaded'}, status_code=503)
+        body = await request.json()
+        tok = state['tokenizer']
+        max_tokens = body.get('max_tokens', 256)
+        if 'messages' in body:
+            text = tok.apply_chat_template(body['messages'], tokenize=False, add_generation_prompt=True)
+        else:
+            text = body.get('prompt', '')
+        tokens = tok.encode(text)
+        if not tokens:
+            return JSONResponse({'error': 'empty prompt'}, status_code=400)
+        token_queue = asyncio.Queue()
+        state['request_queue'].put((tokens, max_tokens, token_queue, []))
+        if body.get('stream', True):
+            async def _raw():
+                while True:
+                    chunk = await token_queue.get()
+                    if chunk is None:
+                        break
+                    yield chunk
+            return StreamingResponse(_raw(), media_type='text/plain')
+        parts = []
+        while True:
+            chunk = await token_queue.get()
+            if chunk is None:
+                break
+            parts.append(chunk)
+        return JSONResponse({'text': ''.join(parts)})
+    async def list_models(_req):
+        return JSONResponse({'data': [{'id': 'local', 'object': 'model', 'created': int(time.time()), 'owned_by': 'local'}]})
+    async def count_tokens(_req):
+        return JSONResponse({'input_tokens': 0})
+    async def health(_req):
+        pc = state['prefix_cache']
+        n_cached = 0
+        if pc and pc.cache_dir.exists():
+            n_cached = sum((1 for _ in pc.cache_dir.glob('*.safetensors')))
+        return JSONResponse({'status': 'ok', 'model': model_name, 'active_sequences': len(state['active']), 'prefix_cache_files': n_cached})
+    return Starlette(routes=[Route('/v1/models', list_models, methods=['GET']), Route('/v1/messages/count_tokens', count_tokens, methods=['POST']), Route('/v1/chat/completions', generate_endpoint, methods=['POST']), Route('/v1/messages', generate_endpoint, methods=['POST']), Route('/v1/responses', generate_endpoint, methods=['POST']), Route('/v1beta/models/{rest:path}', generate_endpoint, methods=['POST']), Route('/generate', simple_generate, methods=['POST']), Route('/health', health, methods=['GET'])], lifespan=lifespan)
+if __name__ == '__main__':
+    import uvicorn
+    uvicorn.run(make_batch_app('mlx-community/Qwen3.5-4B-OptiQ-4bit'), host='0.0.0.0', port=8000)

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code/main.py RENAMED Viewed

@@ -871,13 +871,13 @@ def make_handler(model_name, cache_dir, system, names, skips, gwt=None, parse_th
                 raise
     return Handler
-def serve(host: str, port: int, model: str, cache: str, system: str | None, tools: list[str], skips: list[str], *, fixed_port: bool=False, gwt=None) -> tuple[HTTPServer, str]:
+def _serve_cache(host, port, model, cache, system, tools, skips, *, fixed_port=False, gwt=None):
     handler = make_handler(model, cache, system, tools, skips, gwt)
     while True:
         try:
             server = HTTPServer((host, port), handler)
             url = f'http://{host}:{port}'
-            logger.debug(f'Server bound to {url}')
+            logger.debug(f'Cache server bound to {url}')
             return (server, url)
         except OSError as e:
             if e.errno in (48, 98):
@@ -888,12 +888,52 @@ def serve(host: str, port: int, model: str, cache: str, system: str | None, tool
             else:
                 raise
+def _serve_batch(host, port, model, cache_dir='.cache', *, fixed_port=False):
+    import uvicorn
+    from .bats import make_batch_app
+    import socket
+    import time
+    app = make_batch_app(model, cache_dir=cache_dir)
+    while True:
+        try:
+            with socket.socket() as s:
+                s.bind((host, port))
+        except OSError as e:
+            if e.errno in (48, 98):
+                if fixed_port:
+                    logger.error(f'Port {port} is already in use.')
+                    sys.exit(1)
+                port += 1
+            else:
+                raise
+        else:
+            break
+    config = uvicorn.Config(app, host=host, port=port, loop='asyncio', log_level='warning')
+    uv_server = uvicorn.Server(config)
+    t = threading.Thread(target=uv_server.run, daemon=True)
+    t.start()
+    start_time = time.time()
+    notified = False
+    while True:
+        try:
+            with socket.create_connection((host, port), timeout=0.1):
+                break
+        except OSError:
+            if not notified and time.time() - start_time > 3.0:
+                logger.info('Waiting for batch server to start (model may be downloading)...')
+                notified = True
+            time.sleep(0.2)
+    url = f'http://{host}:{port}'
+    logger.debug(f'Batch server bound to {url}')
+    return (uv_server, url)
 def main():
     parser = argparse.ArgumentParser(description='mlx-code MAIN')
     parser.add_argument('-p', '--prompt', default=None, help='Initial prompt sent automatically when the REPL starts')
     parser.add_argument('-r', '--resume', default=None, metavar='COMMIT', help='Resume a previous session from the given git commit hash')
     parser.add_argument('-m', '--model', default='mlx-community/Qwen3.5-4B-OptiQ-4bit', help='MLX model path or HuggingFace repo ID (default: Qwen3.5-4B-OptiQ-4bit)')
     parser.add_argument('-l', '--leash', choices=['claude', 'codex', 'gemini', 'noapi', 'none'], default='noapi', help="AI harness to launch against the server; 'noapi' starts the built-in REPL, 'none' runs the server only")
+    parser.add_argument('--engine', choices=['cache', 'batch'], default='cache', help="'cache' uses PromptCache + single-sequence (default); 'batch' uses BatchGenerator for concurrent sequences (only compatible with --leash none or noapi)")
     parser.add_argument('--skill', default=None, help='Directory to scan recursively for SKILL.md files')
     parser.add_argument('--tools', nargs='+', default=None, help='Whitelist of tool names to enable; allows all tools when omitted')
     parser.add_argument('--system', type=str, default=None, help='System prompt override passed to the model')
@@ -903,10 +943,14 @@ def main():
     parser.add_argument('--port', type=int, default=None, help='Port to listen on; auto-increments if already in use (default: 8000)')
     parser.add_argument('--skips', nargs='+', default=['(?m)^\\[SUGGESTION MODE[\\s\\S]*', '(?m)^<system-reminder>[\\s\\S]*?^</system-reminder>\\s*'], help='Regex patterns stripped from model output before it is returned to the client')
     parser.add_argument('--stream', default=None, help='File to stream log into')
-    parser.add_argument('--notui', action='store_true', help='Use simple terminal REPL instead of TUI')
+    parser.add_argument('--bare', action='store_true', help='Use simple terminal REPL instead of TUI')
     args, leash_args = parser.parse_known_args()
     logger.debug(f'args={args!r} leash_args={leash_args!r}')
+    if args.engine == 'batch' and args.leash not in ('none', 'noapi'):
+        parser.error('--engine batch only supports --leash none or --leash noapi for now')
     cache = os.path.abspath(args.cache)
+    port = args.port if args.port is not None else 8000
+    fixed_port = args.port is not None
     with tempfile.TemporaryDirectory(dir='/tmp') as _home:
         env = os.environ.copy()
         home = Path(_home)
@@ -915,18 +959,28 @@ def main():
         env['HOME'] = _home
         env['SHELL'] = '/bin/bash'
         env['PWD'] = cwd
-        server, url = serve(host=args.host, port=args.port if args.port is not None else 8000, model=args.model, cache=cache, system=None if args.leash in ('none', 'noapi') else args.system, tools=args.tools, skips=args.skips, fixed_port=args.port is not None, gwt=gwt)
+        if args.engine == 'batch':
+            server, url = _serve_batch(args.host, port, args.model, cache_dir=cache, fixed_port=fixed_port)
+        else:
+            server, url = _serve_cache(host=args.host, port=port, model=args.model, cache=cache, system=None if args.leash in ('none', 'noapi') else args.system, tools=args.tools, skips=args.skips, fixed_port=fixed_port, gwt=gwt)
         if args.leash == 'none':
-            try:
-                server.serve_forever()
-            except KeyboardInterrupt:
-                print('\nShutting down server...')
-                server.server_close()
+            if args.engine == 'batch':
+                try:
+                    threading.Event().wait()
+                except KeyboardInterrupt:
+                    print('\nShutting down server...')
+            else:
+                try:
+                    server.serve_forever()
+                except KeyboardInterrupt:
+                    print('\nShutting down server...')
+                    server.server_close()
         else:
-            threading.Thread(target=server.serve_forever, daemon=True).start()
+            if args.engine == 'cache':
+                threading.Thread(target=server.serve_forever, daemon=True).start()
             if args.leash == 'noapi':
                 from .repl import run_repl
-                run_repl(base_url=url, api=args.leash, repo=cwd, env=env, system=args.system, tool_names=args.tools, sdir=args.skill, init_prompt=args.prompt, resume=args.resume, stream=args.stream, notui=args.notui)
+                run_repl(base_url=url, api=args.leash, repo=cwd, env=env, system=args.system, tool_names=args.tools, sdir=args.skill, init_prompt=args.prompt, resume=args.resume, stream=args.stream, bare=args.bare)
             else:
                 env['GOOGLE_GEMINI_BASE_URL'] = url
                 env['GEMINI_API_KEY'] = 'mc'

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code/repl.py RENAMED Viewed

@@ -980,10 +980,10 @@ async def _stream_to_stdout(agent: Agent, user_input: str) -> None:
     if text:
         print(text)
-async def repl(agent, init_prompt=None, notui=False):
+async def repl(agent, init_prompt=None, bare=False):
     is_tty = sys.stdin.isatty() and sys.stdout.isatty()
-    if notui and is_tty:
-        from .ntui import SimpleRepl
+    if bare and is_tty:
+        from .bare import SimpleRepl
         sr = SimpleRepl(agent, init_prompt=init_prompt)
         await sr.run()
         return None
@@ -1025,7 +1025,7 @@ _AGENT_ENV_ALLOWLIST: re.Pattern = re.compile('\n    ^(\n    # ── Execution
 def _make_agent_env(base: dict[str, str]) -> dict[str, str]:
     return {k: v for k, v in base.items() if _AGENT_ENV_ALLOWLIST.match(k)}
-def run_repl(*, base_url=None, model=None, api: Literal['claude', 'codex', 'gemini', 'deepseek', 'noapi']='noapi', system='', sdir=None, skills=None, env=None, tool_names=None, extra_tool_classes=None, api_key=None, gwt=None, ctx=None, init_prompt=None, resume_messages=None, repo=None, resume=None, stream=None, verbose_transcript=False, notui=False):
+def run_repl(*, base_url=None, model=None, api: Literal['claude', 'codex', 'gemini', 'deepseek', 'noapi']='noapi', system='', sdir=None, skills=None, env=None, tool_names=None, extra_tool_classes=None, api_key=None, gwt=None, ctx=None, init_prompt=None, resume_messages=None, repo=None, resume=None, stream=None, verbose_transcript=False, bare=False):
     repo = os.path.abspath(repo or os.getcwd())
     with tempfile.TemporaryDirectory(dir=tempfile.gettempdir()) as _home:
         if gwt is None:
@@ -1064,7 +1064,7 @@ def run_repl(*, base_url=None, model=None, api: Literal['claude', 'codex', 'gemi
             print(f'[resumed {len(resume_messages)} messages from checkpoint]')
         app_instance = None
         try:
-            app_instance = asyncio.run(repl(agent, init_prompt=init_prompt, notui=notui))
+            app_instance = asyncio.run(repl(agent, init_prompt=init_prompt, bare=bare))
         finally:
             if log_fp:
                 log_fp.close()
@@ -1103,7 +1103,7 @@ def main():
     parser.add_argument('--key', default=None, help='API key')
     parser.add_argument('--stream', default=None, help='File to stream log into')
     parser.add_argument('--verbose-transcript', action='store_true', help='Reserved; not yet implemented')
-    parser.add_argument('--notui', action='store_true', help='Use simple terminal REPL instead of TUI')
+    parser.add_argument('--bare', action='store_true', help='Use simple terminal REPL instead of TUI')
     args = parser.parse_args()
     logger.debug(args)
     url, model, tool_names, api_key = (args.url, args.model, args.tools, args.key)
@@ -1117,6 +1117,6 @@ def main():
         url = 'https://generativelanguage.googleapis.com' if api_key else url
         model = 'gemini-3.1-flash-lite' if model is None else model
         tool_names = [] if tool_names is None else tool_names
-    run_repl(api=args.api, system=args.system, repo=args.cwd, model=model, base_url=url, tool_names=tool_names, sdir=args.skill, api_key=api_key, init_prompt=args.prompt, resume=args.resume, stream=args.stream, notui=args.notui)
+    run_repl(api=args.api, system=args.system, repo=args.cwd, model=model, base_url=url, tool_names=tool_names, sdir=args.skill, api_key=api_key, init_prompt=args.prompt, resume=args.resume, stream=args.stream, bare=args.bare)
 if __name__ == '__main__':
     main()

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code/view_log.py RENAMED Viewed

@@ -597,7 +597,7 @@ def tui(stdscr, entries, log_file, initial_filter='', initial_visible=None):
 def main():
     parser = argparse.ArgumentParser(description='TUI viewer for JSON log files')
     parser.add_argument('logfile', nargs='?', default='.log.json', help='Path to log file (default: .log.json)')
-    parser.add_argument('-f', '--filter', default=f'lvl:10;file:main,repl,gits,apis,tools', help='Initial filter string (same syntax as in UI)')
+    parser.add_argument('-f', '--filter', default=f'lvl:10;file:main,bats,repl,bare,gits,apis,tools', help='Initial filter string (same syntax as in UI)')
     parser.add_argument('-o', '--out', dest='out', metavar='FILE', help='Write marked entries to FILE (JSON lines format) instead of stdout')
     args = parser.parse_args()
     log_path = args.logfile

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mlx-code
-Version: 0.0.24
+Version: 0.0.26
 Summary: Coding Agent for Mac
 Home-page: https://josefalbers.github.io/mlx-code/
 Author: J Joe
@@ -17,6 +17,8 @@ Requires-Dist: httpx
 Requires-Dist: pydantic
 Requires-Dist: textual>=8.2.7
 Requires-Dist: rich>=15.0.0
+Requires-Dist: starlette
+Requires-Dist: uvicorn
 Provides-Extra: all
 Requires-Dist: python-lsp-server[all]; extra == "all"
 Requires-Dist: GitPython; extra == "all"
@@ -47,7 +49,7 @@ A Git-native coding agent that can run entirely on your Mac. No API keys, no clo
 ```
 Conversation tree (nodes = git commits with embedded chat history)
-  main ──●──●──●──●──●──●──●──●──●──●
+  main ──●──●──●──●──●──●──●──●──●──●──●──●──●──●
             │        │
             │        └── branch-1 ──●──●──●
             │                          │ ┌────────────┐
@@ -66,21 +68,21 @@ REPL tabs (each tab = a git branch + agent)    │
 │  └──────┘  └────┬─────┘  └──────────┘  └────────────┘  │
 └─────────────────┼──────────────────────────────────────┘
                   │
-                  ├────────────────────────────────────► each tab is an independent Agent
+                  ├─────────────────────────────────────────► Each tab is an independent Agent
                   │
-             ┌────┴─────────────────────────────────┐
-             │  Agent                               │
-             │  ┌──────────────┐  ┌──────────────┐  │
-             │  │ API:         │  │ Tools:       │  │
-             │  │ MLX (local)  │  │ Read  Write  │  │
-             │  │ Claude       │  │ Edit  Bash   │  │
-             │  │ Gemini       │  │ Grep  Find   │  │
-             │  │ OpenAI       │  │ Ls  Skill    │  │
-             │  └──────────────┘  │ Agent ───────┼──┼───► spawns child Agent
-             │                    └──────────────┘  │     (each with own tools + worktree + etc)
-             │  Git worktree                        │
-             │  (isolation + session state)         │
-             └──────────────────────────────────────┘
+             ┌────┴─────────────────────────────────────┐
+             │  Agent                                   │
+             │  ┌────────────────┐  ┌────────────────┐  │
+             │  │ API:           │  │ Tools:         │  │
+             │  │ Local (mlx-lm) │  │ Read    Write  │  │
+             │  │ Claude         │  │ Edit    Bash   │  │
+             │  │ Gemini         │  │ Grep    Find   │  │
+             │  │ OpenAI         │  │ Ls      Skill  │  │
+             │  └────────────────┘  │ Agent ─────────┼──┼───► Spawns child Agent
+             │                      └────────────────┘  │     (each with own tools + worktree + etc)
+             │  Git worktree                            │
+             │  (isolation + session state)             │
+             └──────────────────────────────────────────┘
 ```
 Each layer is importable and composable on its own. A commit records state, a branch records an alternative path, and a tab is just a live view over an `Agent`.
@@ -98,10 +100,15 @@ result = await agent.run('refactor utils.py to use dataclasses')
 ## Quick start
 ```bash
+# ephemeral run (no installation)
+uvx --from mlx-code mlc
+# or install into the current environment
 pip install mlx-code
-mlc                              # launch with local MLX model
+# launch
+mlc                              # with a local MLX model
 mlc-run --api gemini             # or use a remote provider
-mlc-run --api deepseek --model deepseek-v4-flash
 ```
 That's it. The first run starts a local inference server and drops you into the REPL.
@@ -123,12 +130,12 @@ That's it. The first run starts a local inference server and drops you into the
 **Git is the database.** When the agent makes file changes, they’re committed to a git worktree with the full conversation embedded in the commit message. Resume any past session by hash, branch from any checkpoint, and inspect the agent timeline with `git log`. No proprietary state files, just Git.
-**Your working directory is never at risk.** The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`.
-**Built-in safety nets.** Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
+**Built-in safety nets.** Your working directory is never at risk. The agent operates inside a `git worktree`, not your checkout. It can make a mess, and you can inspect or discard it without ever touching `main`. Subprocess environment variables go through an explicit allowlist, so secrets in your shell are never leaked to agent-spawned processes.
 **Batteries included.** Everything ships in one pip install: the MLX inference engine, the multi-protocol API server, the agent loop, the tools, and the TUI. No llama.cpp, no ollama, no vLLM bridge to find and configure. And the server natively speaks OpenAI, Anthropic, Gemini, and Codex wire formats simultaneously, so `claude`, `codex`, and `gemini` CLIs can all work against your local model without a translation layer.
+**Continuous batching.** The local inference server runs a continuous batching engine that processes multiple sequences concurrently. When you spawn parallel agents (eg, multiple tabs, `asyncio.gather` pipelines, or delegated sub-tasks) they all share the same GPU context and are stepped together each tick. A prefix cache persists KV snapshots to disk, so repeated system prompts and conversation prefixes are prefilled once and reused across sessions. No request queueing, no waiting for the previous agent to finish.
 ---
 ## Agent primitive
@@ -166,12 +173,12 @@ agent.messages = messages
 await agent.run("now add unit tests")
 ```
-Branch from any point in the conversation — each branch gets its own worktree:
+Branch from any point in the conversation. Each branch gets its own worktree:
 ```
 /branch                      # branch from current state
 /branch --rev 2              # branch from the 2nd user turn
-/branch --rev 3 --as-worktree try different approach
+/branch --rev 3 make it use httpx instead
 ```
 Since it's just git, you can inspect the timeline outside the REPL:
@@ -236,6 +243,43 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 ---
+## Continuous batching
+The local server can run multiple inference sequences concurrently inside a single batch step. Instead of a global lock that serialises one request at a time, the batching engine maintains a live set of active sequences and yields tokens for all of them on every step.
+```bash
+mlc --engine batch            # continuous batching + built-in REPL
+```
+This unlocks true parallelism for multi-agent workloads:
+```python
+import asyncio
+from mlx_code.repl import Agent
+async def main():
+    agents = [Agent() for _ in range(4)]
+    await asyncio.gather(*[
+        a.run(f"Research topic: {t}")
+        for a, t in zip(agents, ["consensus", "cryptography", "networking", "storage"])
+    ])
+asyncio.run(main())
+```
+All four agents generate simultaneously inside the same batch. No sequential blocking.
+### Health endpoint
+```bash
+curl http://127.0.0.1:8000/health
+# {"status":"ok","model":"mlx-community/Qwen3.5-4B-OptiQ-4bit","active_sequences":2,"prefix_cache_files":5}
+```
+`active_sequences` shows how many agents are generating right now; `prefix_cache_files` shows how many prefix KV snapshots are stored on disk.
+---
 ## Command Line
 ### `mlc`: local server + harness
@@ -243,20 +287,20 @@ Reliability comes from specialization plus constraint. A read-only reviewer can'
 Starts the MLX inference server and launches the built-in TUI harness against it.
 ```bash
-# Default: local server + default TUI
+# Default: local server + default harness
 mlc
-# Use a simple terminal REPL instead of the TUI
-mlc --notui
+# Continuous batching mode (default is sequential caching mode)
+mlc --engine batch
+# Server only, no harness
+mlc --leash none
 # Use a different harness (routes traffic through the local server)
 mlc --leash claude
 mlc --leash gemini
 mlc --leash codex
-# Server only, no harness
-mlc --leash none
 # Specify a model
 mlc --model mlx-community/Qwen3.5-4B-OptiQ-4bit
@@ -307,7 +351,7 @@ mlc-run --api codex
 echo "explain lsp.py" | mlc-run -a deepseek | cat - PLAN.md | mlc-run --url http://localhost:9000
 # Simple terminal REPL (no TUI)
-mlc-run --notui
+mlc-run --bare
 ```
 ---
@@ -432,18 +476,19 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | Command | Description |
 |---|---|
-| `/help` | Show command reference |
+| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
+| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
 | `/clear [--config F]` | Clear conversation; `--config` reloads agent from a JSON/YAML file |
+| `/tab [N]` | Jump to tab N |
 | `/history [--raw]` | Show conversation transcript; `--raw` shows the raw API message log |
-| `/diff [--all]` | Show a side-by-side diff of changes in the worktree |
-| `/errors` | Show timestamped error log for the current tab |
 | `/tools` | List active tools |
-| `/branch [--rev N] [prompt]` | Open a new branch tab from the current (or earlier) checkpoint |
 | `/abort` | Abort the running agent |
+| `/errors` | Show timestamped error log for the current tab |
 | `/export [path]` | Export session to JSON |
 | `/exit [--all]` | Close branch tab, or exit the app |
-| `!command` | Run a shell command; output captured in the TUI |
-| `$command` | Run an interactive command (TUI suspends, terminal handed to process) |
+| `/help` | Show command reference |
+| `!command` | Run a shell command; output captured in the TUI (eg, `ls`, `cat hello.c`) |
+| `$command` | Run an interactive command (eg, `vim`, `yazi`, `less hello.c`) |
 ### Key bindings
@@ -453,7 +498,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Ctrl-J` | Insert newline |
 | `Ctrl-1` … `Ctrl-9` | Jump to tab N |
 | `Ctrl-,` / `Ctrl-.` | Cycle through tabs |
-| `Ctrl-C` | Abort running agent |
+| `Ctrl-C` | Clear input, or abort running agent |
 | `Ctrl-D` | Close branch tab, or exit app |
 | `Ctrl-R` | Recall last prompt into editor |
@@ -471,7 +516,7 @@ agent = Agent(extra_tool_classes=[LiveDBTool], tool_names=["QueryDB"])
 | `Skill` | Retrieve named skill instructions from config |
 | `Agent` | Spawn an autonomous sub-agent for delegated work |
-All file tools enforce path sandboxing — the agent cannot read or write outside the worktree.
+All file tools enforce path sandboxing. The agent cannot read or write outside the worktree.
 ### Backends

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code.egg-info/SOURCES.txt RENAMED Viewed

@@ -3,12 +3,13 @@ README.md
 setup.py
 mlx_code/__init__.py
 mlx_code/apis.py
+mlx_code/bare.py
+mlx_code/bats.py
 mlx_code/gits.py
 mlx_code/lsp_tool.py
 mlx_code/main.py
 mlx_code/mcb.py
 mlx_code/mcb_tool.py
-mlx_code/ntui.py
 mlx_code/repl.py
 mlx_code/stream_log.py
 mlx_code/tools.py

{mlx_code-0.0.24 → mlx_code-0.0.26}/mlx_code.egg-info/requires.txt RENAMED Viewed

@@ -2,6 +2,8 @@ httpx
 pydantic
 textual>=8.2.7
 rich>=15.0.0
+starlette
+uvicorn
 [:platform_system == "Darwin"]
 mlx-lm>=0.31.3

{mlx_code-0.0.24 → mlx_code-0.0.26}/setup.py RENAMED Viewed

@@ -11,7 +11,7 @@ setup(
     author_email="albersj66@gmail.com",
     author="J Joe",
     license="Apache-2.0",
-    version="0.0.24",
+    version="0.0.26",
     readme="README.md",
     description="Coding Agent for Mac",
     long_description=open("README.md").read(),
@@ -24,6 +24,9 @@ setup(
         "textual>=8.2.7",
         "rich>=15.0.0",
+        "starlette",
+        "uvicorn",
     ],
     extras_require={"all": [
         "python-lsp-server[all]",