PyPI - opencode-llmstack - Versions diffs - 0.9.4__tar.gz → 0.9.7__tar.gz - Mend

opencode-llmstack 0.9.4tar.gz → 0.9.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{opencode_llmstack-0.9.4/opencode_llmstack.egg-info → opencode_llmstack-0.9.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: opencode-llmstack
-Version: 0.9.4
+Version: 0.9.7
 Summary: Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.
 Author: llmstack
 License: MIT License
@@ -78,14 +78,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -101,7 +101,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -135,20 +135,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -198,7 +196,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -266,8 +265,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -358,8 +358,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -465,7 +466,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -473,22 +474,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -565,8 +558,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -577,14 +568,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -594,7 +581,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/README.md RENAMED Viewed

@@ -19,14 +19,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -42,7 +42,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -76,20 +76,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -139,7 +137,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -207,8 +206,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -299,8 +299,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -406,7 +407,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -414,22 +415,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -506,8 +499,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -518,14 +509,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -535,7 +522,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/UPGRADING.md RENAMED Viewed

@@ -266,7 +266,7 @@ How to evaluate:
 - Run `llama-bench -m <new>.gguf -p 512 -n 128 -ngl 999` for raw speed
 - Sniff test with a typical autocomplete prompt; latency should feel like
   the cursor is barely ahead of you
-- Aider leaderboard "edit format" column — proxy for FIM quality
+- [Aider leaderboard](https://aider.chat/docs/leaderboards/) "edit format" column — proxy for FIM quality
 Size budget: **~2–6 GB** weights (we want this resident permanently while
 sharing memory with the heavy tier).
@@ -287,10 +287,10 @@ What matters:
 - **Speed at full context** (MoE models win here on Apple Silicon)
 How to evaluate:
-- Aider's [LLM Leaderboard](https://aider.chat/docs/leaderboards/) — most
+- [Aider's LLM Leaderboard](https://aider.chat/docs/leaderboards/) — most
   honest signal for agentic coding
-- LiveCodeBench scores
-- SWE-Bench Verified (the "real PRs" benchmark)
+- [LiveCodeBench](https://livecodebench.github.io/leaderboard.html) scores
+- [SWE-Bench Verified](https://www.swebench.com/) (the "real PRs" benchmark)
 - Run an actual opencode session in `build` mode against your repo
 Size budget: **~30–55 GB** weights (must fit alongside `code-fast` ≈ 5 GB
@@ -311,8 +311,8 @@ What matters:
 - **Refusals on edge cases** — fine to refuse weird stuff in plain plan mode
 How to evaluate:
-- Open LLM Leaderboard (filter to chat/instruct, your size class)
-- Chatbot Arena — vibes-based but useful proxy
+- [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) (filter to chat/instruct, your size class)
+- [Chatbot Arena](https://lmarena.ai/) — vibes-based but useful proxy
 - Hand-roll a "design this rate limiter" prompt and compare outputs
 Size budget: **~7–25 GB** weights — this tier shouldn't dominate memory.
@@ -360,12 +360,12 @@ Same size budget as `plan`.
 | Tier | Leaderboard |
 |---|---|
-| `code-fast` / `code-smart` | https://aider.chat/docs/leaderboards/ |
-|                            | https://livecodebench.github.io/leaderboard.html |
-|                            | https://www.swebench.com/ (Verified split) |
-| `plan` / `plan-uncensored` | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
-|                            | https://lmarena.ai/ |
-|                            | https://livebench.ai/ |
+| `code-fast` / `code-smart` | [Aider LLM Leaderboard](https://aider.chat/docs/leaderboards/) |
+|                            | [LiveCodeBench](https://livecodebench.github.io/leaderboard.html) |
+|                            | [SWE-Bench Verified](https://www.swebench.com/) |
+| `plan` / `plan-uncensored` | [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
+|                            | [Chatbot Arena](https://lmarena.ai/) |
+|                            | [LiveBench](https://livebench.ai/) |
 **Community signal** (qualitative but valuable):

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/llmstack/__init__.py RENAMED Viewed

@@ -16,5 +16,5 @@ organised by concern:
 from __future__ import annotations
-__version__ = "0.9.4"
+__version__ = "0.9.6"
 __all__ = ["__version__"]

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/llmstack/app.py RENAMED Viewed

@@ -36,7 +36,7 @@ Behaviour:
   ``POST /v1/completions``
     - if request body ``model == "auto"`` (or unset), classify the request
       and rewrite ``model`` -> one of: ``code-fast``, ``code-smart``,
-      ``code-ultra`` (when wired), ``plan``, ``plan-uncensored``.
+      ``code-ultra`` (when wired).
     - otherwise pass through unchanged.
     - tiers with ``backend = bedrock`` in ``models.ini`` are dispatched
       to AWS Bedrock via :mod:`llmstack.backends.bedrock` instead of
@@ -63,41 +63,28 @@ step DOWN as context grows**. This inverts the classic
     from priors.
 So as the conversation accumulates context, we step *down*: ultra
--> smart -> fast. Triggers and the plan track sit alongside this
-ladder.
+-> smart -> fast.
 Routing decision tree (first match wins):
-  1. Explicit "uncensored" trigger in the last user message
-     (``[nofilter]``, ``[uncensored]``, ``[heretic]``, or a line
-     starting with ``uncensored:`` / ``nofilter:``) -> plan-uncensored
-  2. Explicit "ultra" trigger (``[ultra]``, ``[opus]``,
+  1. Explicit "ultra" trigger (``[ultra]``, ``[opus]``,
      ``ultra:``, ``opus:``) AND ultra tier configured -> code-ultra
-  3. PLAN signal words AND no code-block / agent verbs / tools
-     AND estimated tokens <= ``[plan]`` tier's ctx_size
-     (pure design discussion that fits the planner's
-     window)                                          -> plan
-                                                         (if the planner's
-                                                          ctx_size is breached
-                                                          we fall through to
-                                                          the coding ladder
-                                                          rather than send a
-                                                          request that won't
-                                                          fit -- the coding
-                                                          tiers cover larger
-                                                          windows by design)
-  4. Estimated input tokens <= HIGH_FIDELITY_CEILING
+  2. Estimated input tokens <= HIGH_FIDELITY_CEILING
      ("reasonable context still being built")         -> code-ultra
                                                          (else code-smart)
-  5. Estimated input tokens <= MID_FIDELITY_CEILING   -> code-smart
-   6. Otherwise (long context, top-tier becomes
-      expensive/slow, fast tier's 128k window is the
-      best fit and it's free)                          -> code-fast
+  3. Estimated input tokens <= MID_FIDELITY_CEILING   -> code-smart
+  4. Otherwise (long context, top-tier becomes
+     expensive/slow, fast tier's 128k window is the
+     best fit and it's free)                          -> code-fast
                                                          (floored at
                                                           code-smart when
                                                           n_turns >=
                                                           MULTI_TURN_THRESHOLD)
+Plan and uncensored tiers are accessible via their dedicated agent
+modes (``agent.plan``, ``agent.plan-nofilter``) and slash commands;
+they are not auto-routed through ``model = auto``.
 The auto router's effective max context window is
 ``[code-fast].ctx_size`` -- fast is the bottom of the step-down
 ladder, so any context that would overflow the tiers above lands on
@@ -140,8 +127,6 @@ UPSTREAM = os.getenv("LLAMA_SWAP_URL", "http://127.0.0.1:10102").rstrip("/")
 FAST_MODEL = os.getenv("ROUTER_FAST_MODEL", "code-fast")
 AGENT_MODEL = os.getenv("ROUTER_AGENT_MODEL", "code-smart")
 ULTRA_MODEL = os.getenv("ROUTER_ULTRA_MODEL", "code-ultra")
-PLAN_MODEL = os.getenv("ROUTER_PLAN_MODEL", "plan")
-UNCENSORED_MODEL = os.getenv("ROUTER_UNCENSORED_MODEL", "plan-uncensored")
 # Step-DOWN ladder (see module docstring). Both ceilings are *upper
 # bounds* of a tier's sweet-spot range, expressed in estimated input
@@ -167,45 +152,14 @@ UNCENSORED_MODEL = os.getenv("ROUTER_UNCENSORED_MODEL", "plan-uncensored")
 #                still has comfortable headroom.
 HIGH_FIDELITY_CEILING = int(os.getenv("ROUTER_HIGH_FIDELITY_CEILING", "12000"))
 MID_FIDELITY_CEILING = int(os.getenv("ROUTER_MID_FIDELITY_CEILING", "32000"))
-# Floor the long-context rung at code-smart whenever a tool-call
-# protocol is in play -- 3B models tool-call unreliably regardless of
-# how big their context window is.
 MULTI_TURN_THRESHOLD = int(os.getenv("ROUTER_MULTI_TURN", "10"))
 AUTO_ALIASES = {"auto", "", None}
-UNCENSORED_TRIGGERS = re.compile(
-    r"(\[(uncensored|nofilter|no-?filter|heretic)\]"
-    r"|^[ \t]*(uncensored|nofilter|no-?filter)\s*:)",
-    re.IGNORECASE | re.MULTILINE,
-)
 ULTRA_TRIGGERS = re.compile(
     r"(\[(ultra|opus)\]|^[ \t]*(ultra|opus)\s*:)",
     re.IGNORECASE | re.MULTILINE,
 )
-PLAN_SIGNALS = re.compile(
-    r"\b(plan|design|architect(ure)?|approach|trade-?off|"
-    r"should\s+we|how\s+would\s+(you|we)|what\s+would\s+you|"
-    r"explain\s+why|reason\s+about|think\s+(through|step|hard|carefully)|"
-    r"compare\s+(options|approaches)|review\s+(the|this|my)\s+"
-    r"(architecture|design|approach|plan)|brainstorm|outline|"
-    r"summari[sz]e|root\s*cause|migrate|port\s+to)\b",
-    re.IGNORECASE,
-)
-AGENT_SIGNALS = re.compile(
-    r"\b(implement|fix\s+(this|the|a|my)?\s*(bug|issue|error|test)|"
-    r"write\s+(a|the|some)?\s*(function|class|test|script|module|method)|"
-    r"add\s+(a|the)?\s*(function|class|method|test|file|endpoint)|"
-    r"create\s+(a|the)?\s*(function|class|file|component|endpoint)|"
-    r"refactor|edit|patch|generate\s+code|debug|trace|"
-    r"run\s+tests?|build\s+(it|this)|compile)\b",
-    re.IGNORECASE,
-)
-CODE_BLOCK = re.compile(r"```|`[^`\n]{30,}`")
 logging.basicConfig(
     level=os.getenv("LOG_LEVEL", "INFO"),
     format="%(asctime)s %(levelname)s router %(message)s",
@@ -221,12 +175,11 @@ async def _lifespan(app: FastAPI):
     bedrock_tiers = sorted(t.name for t in TIERS.values() if t.is_bedrock)
     log.info(
         "router up upstream=%s ladder=[ultra<=%d -> agent<=%d -> fast] "
-        "fast=%s agent=%s ultra=%s plan=%s uncensored=%s bedrock=%s",
+        "fast=%s agent=%s ultra=%s bedrock=%s",
         UPSTREAM, HIGH_FIDELITY_CEILING, MID_FIDELITY_CEILING,
         FAST_MODEL, AGENT_MODEL,
         f"{ULTRA_MODEL} (active)" if _ultra_available()
             else f"{ULTRA_MODEL} (unwired -- high-fidelity rung falls back to {AGENT_MODEL})",
-        PLAN_MODEL, UNCENSORED_MODEL,
         ",".join(bedrock_tiers) or "(none)",
     )
     yield
@@ -302,12 +255,6 @@ def _estimate_tokens(messages: list[dict[str, Any]] | None, prompt: str | None)
     return chars // 4
-def _matches(pattern: re.Pattern[str], messages: list[dict[str, Any]] | None, prompt: str | None) -> bool:
-    if prompt and pattern.search(prompt):
-        return True
-    return any(pattern.search(t) for t in _iter_message_text(messages))
 def _ultra_available() -> bool:
     """True iff the ultra tier is loaded from ``models.ini``.
@@ -331,6 +278,11 @@ def classify(body: dict[str, Any]) -> tuple[str, str]:
     Step-DOWN ladder: top fidelity for short context, fall to mid for
     medium, drop to fast for long. See module docstring for rationale.
+    Only the fast / agent / ultra rungs are implemented here. Plan and
+    uncensored tiers are accessible via their dedicated agent modes
+    (``agent.plan``, ``agent.plan-nofilter``) and slash commands; they
+    are not auto-routed from the build agent.
     """
     messages = body.get("messages") if isinstance(body.get("messages"), list) else None
     prompt = body.get("prompt") if isinstance(body.get("prompt"), str) else None
@@ -341,51 +293,17 @@ def classify(body: dict[str, Any]) -> tuple[str, str]:
         for m in (messages or [])
         if m.get("role") == "system" and isinstance(m.get("content"), str)
     ]
-    if any(UNCENSORED_TRIGGERS.search(s) for s in (last_user, *sys_prompts) if s):
-        return UNCENSORED_MODEL, "uncensored-trigger"
     if any(ULTRA_TRIGGERS.search(s) for s in (last_user, *sys_prompts) if s):
         if _ultra_available():
             return ULTRA_MODEL, "ultra-trigger"
-        # Explicit user opt-in but the tier isn't wired up. Don't 404 --
-        # serve the request from the heaviest tier we *do* have and let
-        # the user notice in logs that their trigger was a no-op.
         log.warning("ultra-trigger ignored: %s not in models.ini; falling back to %s",
                     ULTRA_MODEL, AGENT_MODEL)
         return AGENT_MODEL, f"ultra-trigger->agent ({ULTRA_MODEL} unavailable)"
     n_turns = sum(1 for m in (messages or []) if m.get("role") == "user")
-    _last_msgs = [{"role": "user", "content": last_user}] if last_user else None
-    has_code_signal = (
-        _matches(CODE_BLOCK, _last_msgs, prompt)
-        or _matches(AGENT_SIGNALS, _last_msgs, prompt)
-    )
     est = _estimate_tokens(messages, prompt)
-    # Plan track is orthogonal to the code fidelity ladder: ``plan`` is a
-    # chat-tuned model meant for design / "should we" discussions. Only
-    # take it when nothing about the request says "I'm about to write
-    # code" (no triple-backticks, no agent verbs). Tools are stripped
-    # from the request body before dispatch (see ``_handle_completion``),
-    # so their presence here does not block plan routing.
-    # Only route to plan if the input fits in the planner's ctx_size --
-    # past that we fall through to the coding ladder which has tiers
-    # (smart, fast) explicitly sized for larger contexts.
-    if (
-        not has_code_signal
-        and _matches(PLAN_SIGNALS, messages, prompt)
-    ):
-        plan_tier = TIER_BY_ALIAS.get(PLAN_MODEL)
-        plan_ctx = plan_tier.ctx_size if plan_tier else 0
-        if not plan_ctx or est <= plan_ctx:
-            return PLAN_MODEL, "plan-signal"
-        log.info(
-            "plan-signal but tokens~%d > %s.ctx_size %d; "
-            "falling through to coding ladder",
-            est, PLAN_MODEL, plan_ctx,
-        )
     # Rung 1: short context -- start at the top.
     if est <= HIGH_FIDELITY_CEILING:
         if _ultra_available():
@@ -400,9 +318,7 @@ def classify(body: dict[str, Any]) -> tuple[str, str]:
         return AGENT_MODEL, f"mid-fidelity tokens~{est}<={MID_FIDELITY_CEILING}"
     # Rung 3: long context -- step down to fast. Floor at smart only
-    # when the multi-turn threshold is hit; tools alone no longer
-    # prevent the step-down (plan tiers strip tools before dispatch,
-    # and code-fast is a hosted model that tool-calls reliably).
+    # when the multi-turn threshold is hit.
     if n_turns >= MULTI_TURN_THRESHOLD:
         return AGENT_MODEL, f"long-context tokens~{est}>{MID_FIDELITY_CEILING} (user-turns={n_turns}>={MULTI_TURN_THRESHOLD} floor)"
     return FAST_MODEL, f"long-context tokens~{est}>{MID_FIDELITY_CEILING}"
@@ -531,14 +447,14 @@ async def list_models() -> JSONResponse:
             f"'{AGENT_MODEL}' up to ~{MID_FIDELITY_CEILING}, "
             f"'{FAST_MODEL}' beyond that."
         )
-        name = "Auto (step-down router: ultra/agent/fast + plan/uncensored)"
+        name = "Auto (step-down router: ultra/agent/fast)"
     else:
         top_blurb = (
             f"Step-down ladder (top->bottom as context grows): "
             f"'{AGENT_MODEL}' up to ~{MID_FIDELITY_CEILING} tokens, "
             f"'{FAST_MODEL}' beyond that."
         )
-        name = "Auto (step-down router: agent/fast + plan/uncensored)"
+        name = "Auto (step-down router: agent/fast)"
     data["data"].insert(0, {
         "id": "auto",
         "object": "model",
@@ -547,8 +463,6 @@ async def list_models() -> JSONResponse:
         "name": name,
         "description": (
             f"{top_blurb} "
-            f"'{PLAN_MODEL}' for design/planning (orthogonal to ladder); "
-            f"'{UNCENSORED_MODEL}' for explicit [nofilter] triggers; "
             f"'[ultra]'/'[opus]' triggers force '{ULTRA_MODEL}' regardless of size."
         ),
         "tier": "auto",
@@ -608,6 +522,41 @@ def _inject_sampler(body: dict[str, Any], tier: Tier) -> bool:
     return mutated
+def _inject_name_json(raw: bytes, tier_name: str) -> bytes:
+    try:
+        data = json.loads(raw)
+    except (json.JSONDecodeError, ValueError):
+        return raw
+    try:
+        msg = data["choices"][0]["message"]
+        if msg.get("content"):
+            msg["name"] = tier_name
+    except (KeyError, IndexError, TypeError):
+        pass
+    return json.dumps(data).encode()
+def _inject_name_sse(chunk: bytes, tier_name: str, injected: list[bool]) -> bytes:
+    if injected[0]:
+        return chunk
+    line = chunk.decode(errors="replace")
+    if not line.startswith("data: "):
+        return chunk
+    payload_str = line[len("data: "):].strip()
+    if payload_str in ("[DONE]", ""):
+        return chunk
+    try:
+        payload = json.loads(payload_str)
+        delta = payload["choices"][0]["delta"]
+        if "role" in delta:
+            delta["name"] = tier_name
+            injected[0] = True
+            return f"data: {json.dumps(payload, separators=(',', ':'))}\n\n".encode()
+    except (KeyError, IndexError, TypeError, json.JSONDecodeError):
+        pass
+    return chunk
 async def _handle_completion(req: Request, path: str) -> Response:
     raw = await req.body()
     headers = _filter_request_headers(req)
@@ -631,11 +580,6 @@ async def _handle_completion(req: Request, path: str) -> Response:
         mutated = True
     chosen_name = body.get("model")
-    if chosen_name in {PLAN_MODEL, UNCENSORED_MODEL} and body.get("tools"):
-        log.info("plan tier %s: stripping tools from request", chosen_name)
-        body.pop("tools")
-        body.pop("tool_choice", None)
-        mutated = True
     tier = _resolve_tier(chosen_name)
     if tier is not None and _inject_sampler(body, tier):
         mutated = True
@@ -646,6 +590,28 @@ async def _handle_completion(req: Request, path: str) -> Response:
     if tier is not None and tier.is_bedrock:
         from llmstack.backends import bedrock as bedrock_backend
         resp = await bedrock_backend.dispatch(req, tier, body)
+    elif tier is not None and body.get("stream"):
+        proxy = await _stream_proxy(req.method, path, raw, headers)
+        injected: list[bool] = [False]
+        tier_name = tier.name
+        original_gen = proxy.body_iterator
+        async def _named_gen():
+            async for chunk in original_gen:
+                yield _inject_name_sse(chunk, tier_name, injected)
+        proxy.body_iterator = _named_gen()
+        resp = proxy
+    elif tier is not None:
+        proxy = await _stream_proxy(req.method, path, raw, headers)
+        raw_resp = b"".join([chunk async for chunk in proxy.body_iterator])
+        patched = _inject_name_json(raw_resp, tier.name)
+        resp = Response(
+            content=patched,
+            status_code=proxy.status_code,
+            headers=dict(proxy.headers),
+            media_type=proxy.media_type,
+        )
     else:
         resp = await _stream_proxy(req.method, path, raw, headers)

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/llmstack/backends/bedrock.py RENAMED Viewed

@@ -588,6 +588,8 @@ async def _complete_response(client: Any, tier: Tier, converse_kwargs: dict[str,
         return JSONResponse(status_code=502, content={"error": _error_payload(exc)})
     message, finish = _openai_message_from_converse(resp)
+    if message.get("content"):
+        message["name"] = tier.name
     usage_in = (resp.get("usage") or {})
     payload = {
         "id":      _completion_id(),
@@ -665,7 +667,7 @@ async def _stream_response(client: Any, tier: Tier, converse_kwargs: dict[str, A
         # First chunk: announce the assistant role so OpenAI clients can
         # initialise their accumulator.
-        yield _sse(_frame({"role": "assistant"}))
+        yield _sse(_frame({"role": "assistant", "name": model_label}))
         # Per-content-block state: index -> "text" | "tool_use"
         block_kinds: dict[int, str] = {}

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/llmstack/generators/opencode.py RENAMED Viewed

@@ -69,7 +69,7 @@ COMMANDS = {
         "agent":       "plan",
     },
     "nofilter": {
-        "template":    "[nofilter]",
+        "template":    "",
         "description": "Route to the uncensored planning model.",
         "agent":       "plan-nofilter",
     },
@@ -194,7 +194,7 @@ def build_config(
     models: dict[str, dict] = {
         "auto": {
-            "name":      "Auto (router selects: fast / agent / plan / uncensored)",
+            "name":      "Auto (router selects: fast / agent / ultra)",
             "limit":     {"context": auto_ctx, "output": 16384},
             "tool_call": True,
             "cost":      ZERO_COST,

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/llmstack/models.ini RENAMED Viewed

@@ -178,7 +178,7 @@ description  = Qwopus GLM 18B - planning, design discussions, architecture
 ; aws_region   = eu-central-1
 ; aws_profile  = bedrock-prod
 ; ctx_size     = 200000
-; sampler      = temp=0.7, top_p=0.9     ; creative; Opus 4.6 accepts both
+; sampler      = temp=0.7     ; creative; Opus 4.6
 ; description  = Claude Opus 4.6 on Bedrock - planning, design discussions, architecture
 [plan-uncensored]
@@ -258,21 +258,18 @@ description  = Mistral-Small 3.2 24B Heretic - no-filter planning
 ;
 ; First-match-wins decision tree applied by llmstack/app.py when model="auto":
 ;
-;   1. "[nofilter]" / "uncensored:" trigger                       -> plan-uncensored
-;   2. "[ultra]" / "[opus]" / "ultra:" trigger AND code-ultra
+;   1. "[ultra]" / "[opus]" / "ultra:" trigger AND code-ultra
 ;      tier configured                                            -> code-ultra
-;   3. PLAN signal words AND no code-block / agent verbs / tools
-;      AND tokens <= [plan].ctx_size (pure design discussion that
-;      still fits the planner's window)                           -> plan
-;      ...if the plan tier's ctx_size is breached, the request
-;      falls through to the coding ladder below rather than being
-;      sent to a planner whose window can't hold the input.
-;   4. tokens <= high_fidelity_ceiling AND code-ultra configured  -> code-ultra
+;   2. tokens <= high_fidelity_ceiling AND code-ultra configured  -> code-ultra
 ;      tokens <= high_fidelity_ceiling AND no code-ultra          -> code-smart
-;   5. tokens <= mid_fidelity_ceiling                             -> code-smart
-;   6. otherwise (long context):
-;        - if tools[] OR turns >= multi_turn (3B tool-calls badly) -> code-smart
-;        - else                                                    -> code-fast
+;   3. tokens <= mid_fidelity_ceiling                             -> code-smart
+;   4. otherwise (long context):
+;        - if turns >= multi_turn (floor at smart)                -> code-smart
+;        - else                                                   -> code-fast
+;
+; Plan and uncensored tiers are accessible via their dedicated agent
+; modes (agent.plan, agent.plan-nofilter) and slash commands; they are
+; NOT auto-routed through model=auto.
 ;
 ; AUTO ROUTER MAX CONTEXT = [code-fast].ctx_size. The fast tier sits at
 ; the bottom of the step-down ladder, so any context too big for the
@@ -303,9 +300,6 @@ description  = Mistral-Small 3.2 24B Heretic - no-filter planning
 high_fidelity_ceiling = 12000    ; tokens; below this, top-tier model is still cheap+fast (and ultra ctx_size = 2 * this)
 mid_fidelity_ceiling  = 32000    ; tokens; smart's sweet spot up to here, then step down to fast (smart ctx_size = 2 * this)
 multi_turn            = 10       ; turn count that floors the long-context rung at code-smart
-agent_signal_words    = implement, fix bug, write a function, refactor, edit, patch, debug, run tests, build it
-plan_signal_words     = design, architect, approach, trade-off, should we, how would you, explain why, think through, compare options, brainstorm, root cause
-uncensored_triggers   = [nofilter], [uncensored], [heretic], "uncensored:", "nofilter:" (line start)
 ultra_triggers        = [ultra], [opus], "ultra:", "opus:" (line start)
 ;------------------------------------------------------------------------------

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7/opencode_llmstack.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: opencode-llmstack
-Version: 0.9.4
+Version: 0.9.7
 Summary: Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.
 Author: llmstack
 License: MIT License
@@ -78,14 +78,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -101,7 +101,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -135,20 +135,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -198,7 +196,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -266,8 +265,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -358,8 +358,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -465,7 +466,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -473,22 +474,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -565,8 +558,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -577,14 +568,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -594,7 +581,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.4 → opencode_llmstack-0.9.7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "opencode-llmstack"
-version = "0.9.4"
+version = "0.9.7"
 description = "Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring."
 readme = "README.md"
 requires-python = ">=3.11"