PyPI - opencode-llmstack - Versions diffs - 0.9.6__tar.gz → 0.9.7__tar.gz - Mend

opencode-llmstack 0.9.6tar.gz → 0.9.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{opencode_llmstack-0.9.6/opencode_llmstack.egg-info → opencode_llmstack-0.9.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: opencode-llmstack
-Version: 0.9.6
+Version: 0.9.7
 Summary: Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.
 Author: llmstack
 License: MIT License
@@ -78,14 +78,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -101,7 +101,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -135,20 +135,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -198,7 +196,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -266,8 +265,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -358,8 +358,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -465,7 +466,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -473,22 +474,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -565,8 +558,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -577,14 +568,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -594,7 +581,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7}/README.md RENAMED Viewed

@@ -19,14 +19,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -42,7 +42,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -76,20 +76,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -139,7 +137,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -207,8 +206,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -299,8 +299,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -406,7 +407,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -414,22 +415,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -506,8 +499,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -518,14 +509,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -535,7 +522,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7}/UPGRADING.md RENAMED Viewed

@@ -266,7 +266,7 @@ How to evaluate:
 - Run `llama-bench -m <new>.gguf -p 512 -n 128 -ngl 999` for raw speed
 - Sniff test with a typical autocomplete prompt; latency should feel like
   the cursor is barely ahead of you
-- Aider leaderboard "edit format" column — proxy for FIM quality
+- [Aider leaderboard](https://aider.chat/docs/leaderboards/) "edit format" column — proxy for FIM quality
 Size budget: **~2–6 GB** weights (we want this resident permanently while
 sharing memory with the heavy tier).
@@ -287,10 +287,10 @@ What matters:
 - **Speed at full context** (MoE models win here on Apple Silicon)
 How to evaluate:
-- Aider's [LLM Leaderboard](https://aider.chat/docs/leaderboards/) — most
+- [Aider's LLM Leaderboard](https://aider.chat/docs/leaderboards/) — most
   honest signal for agentic coding
-- LiveCodeBench scores
-- SWE-Bench Verified (the "real PRs" benchmark)
+- [LiveCodeBench](https://livecodebench.github.io/leaderboard.html) scores
+- [SWE-Bench Verified](https://www.swebench.com/) (the "real PRs" benchmark)
 - Run an actual opencode session in `build` mode against your repo
 Size budget: **~30–55 GB** weights (must fit alongside `code-fast` ≈ 5 GB
@@ -311,8 +311,8 @@ What matters:
 - **Refusals on edge cases** — fine to refuse weird stuff in plain plan mode
 How to evaluate:
-- Open LLM Leaderboard (filter to chat/instruct, your size class)
-- Chatbot Arena — vibes-based but useful proxy
+- [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) (filter to chat/instruct, your size class)
+- [Chatbot Arena](https://lmarena.ai/) — vibes-based but useful proxy
 - Hand-roll a "design this rate limiter" prompt and compare outputs
 Size budget: **~7–25 GB** weights — this tier shouldn't dominate memory.
@@ -360,12 +360,12 @@ Same size budget as `plan`.
 | Tier | Leaderboard |
 |---|---|
-| `code-fast` / `code-smart` | https://aider.chat/docs/leaderboards/ |
-|                            | https://livecodebench.github.io/leaderboard.html |
-|                            | https://www.swebench.com/ (Verified split) |
-| `plan` / `plan-uncensored` | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
-|                            | https://lmarena.ai/ |
-|                            | https://livebench.ai/ |
+| `code-fast` / `code-smart` | [Aider LLM Leaderboard](https://aider.chat/docs/leaderboards/) |
+|                            | [LiveCodeBench](https://livecodebench.github.io/leaderboard.html) |
+|                            | [SWE-Bench Verified](https://www.swebench.com/) |
+| `plan` / `plan-uncensored` | [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
+|                            | [Chatbot Arena](https://lmarena.ai/) |
+|                            | [LiveBench](https://livebench.ai/) |
 **Community signal** (qualitative but valuable):

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7}/llmstack/app.py RENAMED Viewed

@@ -127,8 +127,6 @@ UPSTREAM = os.getenv("LLAMA_SWAP_URL", "http://127.0.0.1:10102").rstrip("/")
 FAST_MODEL = os.getenv("ROUTER_FAST_MODEL", "code-fast")
 AGENT_MODEL = os.getenv("ROUTER_AGENT_MODEL", "code-smart")
 ULTRA_MODEL = os.getenv("ROUTER_ULTRA_MODEL", "code-ultra")
-PLAN_MODEL = os.getenv("ROUTER_PLAN_MODEL", "plan")
-UNCENSORED_MODEL = os.getenv("ROUTER_UNCENSORED_MODEL", "plan-uncensored")
 # Step-DOWN ladder (see module docstring). Both ceilings are *upper
 # bounds* of a tier's sweet-spot range, expressed in estimated input

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7}/llmstack/generators/opencode.py RENAMED Viewed

@@ -69,7 +69,7 @@ COMMANDS = {
         "agent":       "plan",
     },
     "nofilter": {
-        "template":    "[nofilter]",
+        "template":    "",
         "description": "Route to the uncensored planning model.",
         "agent":       "plan-nofilter",
     },

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7/opencode_llmstack.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: opencode-llmstack
-Version: 0.9.6
+Version: 0.9.7
 Summary: Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.
 Author: llmstack
 License: MIT License
@@ -78,14 +78,14 @@ client (opencode / curl / Cursor / etc.)
         │
         ▼
   http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
-        │   • model="auto" → classify → rewrite to one of 4 tiers
+        │   • model="auto" → classify → rewrite to one of 3 coder tiers
         │   • everything else → pass-through
         ▼
   http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
         │   • loads/unloads llama-server processes per model
         │   • matrix solver allows {code-fast + one heavy model} co-resident
         ▼
-  llama-server <code-fast | code-smart | plan | plan-uncensored>
+  llama-server <code-fast | code-smart | code-ultra>
         │
         ▼
   GGUF in ~/.cache/huggingface/hub/...
@@ -101,7 +101,7 @@ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder +
 - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
 - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
-- **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
+- **Uncensored planning** is a separate plan-tier model, opted in by explicit agent selection (`/agent plan-nofilter` in opencode).
 Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
@@ -135,20 +135,18 @@ matches how these models actually behave on this stack:
   than priors, so they tend to *improve* relative to top-tier as the
   conversation grows.
-First match wins:
+First match wins (auto-routing only; `plan` and `plan-uncensored` are not auto-routed):
 | # | Condition | → Model | Reason |
 |---|---|---|---|
-| 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
-| 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
-| 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
-| 4 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
-| 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
-| 6 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
-| 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
+| 1 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
+| 2 | estimated input ≤ 12 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
+| 3 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
+| 4 | otherwise (long context) AND ≥ 10 user turns | `code-smart` | floor: deep agentic loop, keep the heavy model |
+| 5 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
 Token estimates are `chars / 4` over all message text + `prompt`. The
-`code-ultra` rungs (2 and 4) are gated on availability: when no
+`code-ultra` rungs (1 and 2) are gated on availability: when no
 `[code-ultra]` section is loaded from `models.ini`, both silently fall
 back to `code-smart` so vanilla installs don't 404.
@@ -198,7 +196,8 @@ your global setup unchanged.
 | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
 Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
-a custom one. Slash-commands `/review`, `/nofilter` are also available.
+a custom one. The `plan` and `plan-uncensored` tiers are **not auto-routed** from the build agent —
+they're only accessible via explicit agent selection (`/agent plan` or `/agent plan-nofilter`).
 Want a second terminal into the same stack? Install the activate hook
 once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
@@ -266,8 +265,9 @@ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
 ```
 The `llama-swap` binary lives outside any project at
-`$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
-`LLMSTACK_BIN_DIR`). One download is reused across all projects.
+`$XDG_DATA_HOME/llmstack/bin/llama-swap` on macOS/Linux (override with
+`LLMSTACK_BIN_DIR`), or `%LOCALAPPDATA%\llmstack\bin\llama-swap.exe` on Windows.
+One download is reused across all projects.
 ## Quick start
@@ -358,8 +358,9 @@ Notes:
   or a package like `winget install ggml.llama-cpp` and put it on
   `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
   `iogpu.wired_limit_mb` step does not apply.
-- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
-  `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
+- The `[llmstack:<channel>]` prompt prefix shows up in PowerShell; `cmd.exe`
+  does not support custom prompts in the same way, so activation is
+  PowerShell-only.
 - Stopping daemons uses `taskkill /T /F` under the hood, so the
   llama-server children get cleaned up as well.
@@ -465,7 +466,7 @@ llmstack restart --next                # cycle into the next channel
 ### Try each routing path
-All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
+All of these go to `/v1/chat/completions` on `:10101`. The `auto` router classifies based on token count and context:
 ```bash
 # trivial chat -> code-fast
@@ -473,22 +474,14 @@ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: applicatio
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
-# planning -> plan
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
 # agent work -> code-smart
 curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
   -d '{"model":"auto","stream":false,
        "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
-# uncensored plan -> plan-uncensored
-curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"model":"auto","stream":false,
-       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
 ```
+To access `plan` or `plan-uncensored` tiers, use explicit agent selection in opencode (`/agent plan` or `/agent plan-nofilter`) rather than `model=auto`.
 ## Endpoints
 | Port | Service | Purpose |
@@ -565,8 +558,6 @@ All knobs are env vars; defaults are picked up by `llmstack start`.
 | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
 | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
 | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
-| `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
-| `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
 | `ROUTER_HIGH_FIDELITY_CEILING` | `12000` | tokens; at or below this, route to top tier (ultra → smart fallback). Paired with `code-ultra.ctx_size = 24000` (2x). |
 | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast`. Paired with `code-smart.ctx_size = 64000` (2x). |
 | `ROUTER_MULTI_TURN` | `10` | user-turn count that floors the long-context rung at `code-smart` |
@@ -577,14 +568,10 @@ To force a request to never auto-route, set `model` to a concrete alias (`code-f
 ## Triggering uncensored mode
-Two ways:
-1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
-2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
-   - `[nofilter]`, `[uncensored]`, `[heretic]`
-   - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
+The `plan-uncensored` tier is accessible via explicit agent selection only:
-Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
+1. **In opencode:** `/agent plan-nofilter` (or mention `@plan-nofilter`).
+2. **Via opencode config:** set `agent.plan-nofilter` as your active agent.
 ## Troubleshooting
@@ -594,7 +581,7 @@ Triggers are *only* checked on the latest user message and the system prompt, so
 **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
-**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
+**Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`ULTRA_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
 **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

{opencode_llmstack-0.9.6 → opencode_llmstack-0.9.7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "opencode-llmstack"
-version = "0.9.6"
+version = "0.9.7"
 description = "Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring."
 readme = "README.md"
 requires-python = ">=3.11"