opencode-llmstack 0.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. opencode_llmstack-0.6.0/PKG-INFO +693 -0
  2. opencode_llmstack-0.6.0/README.md +658 -0
  3. opencode_llmstack-0.6.0/llmstack/AGENTS.md +13 -0
  4. opencode_llmstack-0.6.0/llmstack/__init__.py +20 -0
  5. opencode_llmstack-0.6.0/llmstack/__main__.py +10 -0
  6. opencode_llmstack-0.6.0/llmstack/_platform.py +420 -0
  7. opencode_llmstack-0.6.0/llmstack/app.py +644 -0
  8. opencode_llmstack-0.6.0/llmstack/backends/__init__.py +19 -0
  9. opencode_llmstack-0.6.0/llmstack/backends/bedrock.py +790 -0
  10. opencode_llmstack-0.6.0/llmstack/check_models.py +119 -0
  11. opencode_llmstack-0.6.0/llmstack/cli.py +264 -0
  12. opencode_llmstack-0.6.0/llmstack/commands/__init__.py +10 -0
  13. opencode_llmstack-0.6.0/llmstack/commands/_helpers.py +91 -0
  14. opencode_llmstack-0.6.0/llmstack/commands/activate.py +71 -0
  15. opencode_llmstack-0.6.0/llmstack/commands/check.py +13 -0
  16. opencode_llmstack-0.6.0/llmstack/commands/download.py +27 -0
  17. opencode_llmstack-0.6.0/llmstack/commands/install.py +365 -0
  18. opencode_llmstack-0.6.0/llmstack/commands/install_llama_swap.py +36 -0
  19. opencode_llmstack-0.6.0/llmstack/commands/reload.py +59 -0
  20. opencode_llmstack-0.6.0/llmstack/commands/restart.py +12 -0
  21. opencode_llmstack-0.6.0/llmstack/commands/setup.py +146 -0
  22. opencode_llmstack-0.6.0/llmstack/commands/start.py +360 -0
  23. opencode_llmstack-0.6.0/llmstack/commands/status.py +260 -0
  24. opencode_llmstack-0.6.0/llmstack/commands/stop.py +73 -0
  25. opencode_llmstack-0.6.0/llmstack/download/__init__.py +21 -0
  26. opencode_llmstack-0.6.0/llmstack/download/binary.py +234 -0
  27. opencode_llmstack-0.6.0/llmstack/download/ggufs.py +164 -0
  28. opencode_llmstack-0.6.0/llmstack/generators/__init__.py +37 -0
  29. opencode_llmstack-0.6.0/llmstack/generators/llama_swap.py +421 -0
  30. opencode_llmstack-0.6.0/llmstack/generators/opencode.py +291 -0
  31. opencode_llmstack-0.6.0/llmstack/models.ini +304 -0
  32. opencode_llmstack-0.6.0/llmstack/paths.py +318 -0
  33. opencode_llmstack-0.6.0/llmstack/shell_env.py +927 -0
  34. opencode_llmstack-0.6.0/llmstack/tiers.py +394 -0
  35. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/PKG-INFO +693 -0
  36. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/SOURCES.txt +40 -0
  37. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/dependency_links.txt +1 -0
  38. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/entry_points.txt +2 -0
  39. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/requires.txt +14 -0
  40. opencode_llmstack-0.6.0/opencode_llmstack.egg-info/top_level.txt +1 -0
  41. opencode_llmstack-0.6.0/pyproject.toml +67 -0
  42. opencode_llmstack-0.6.0/setup.cfg +4 -0
@@ -0,0 +1,693 @@
1
+ Metadata-Version: 2.4
2
+ Name: opencode-llmstack
3
+ Version: 0.6.0
4
+ Summary: Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.
5
+ Author: llmstack
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/rohitgarg19/llmstack
8
+ Project-URL: Issues, https://github.com/rohitgarg19/llmstack/issues
9
+ Keywords: llm,llama-cpp,llama-swap,opencode,router,local-ai
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Operating System :: MacOS :: MacOS X
14
+ Classifier: Operating System :: POSIX :: Linux
15
+ Classifier: Operating System :: Microsoft :: Windows
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Software Development
21
+ Requires-Python: >=3.11
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: fastapi<1.0,>=0.110
24
+ Requires-Dist: httpx<1.0,>=0.27
25
+ Requires-Dist: uvicorn[standard]<1.0,>=0.30
26
+ Requires-Dist: PyYAML<7.0,>=6.0
27
+ Requires-Dist: huggingface_hub<2.0,>=1.0
28
+ Requires-Dist: hf_transfer<1.0,>=0.1
29
+ Provides-Extra: dev
30
+ Requires-Dist: ruff>=0.4; extra == "dev"
31
+ Requires-Dist: pytest>=7; extra == "dev"
32
+ Provides-Extra: bedrock
33
+ Requires-Dist: boto3>=1.35; extra == "bedrock"
34
+ Requires-Dist: botocore>=1.35; extra == "bedrock"
35
+
36
+ # llmstack — multi-tier local LLM stack for Mac M4 Max / 64 GB
37
+
38
+ A Cursor-Auto / Claude-tier-style serving setup for local GGUF models, **role-aware**:
39
+ *coder models for agent work, chat models for planning, with an uncensored chat option for plans that need it.*
40
+
41
+ Each tier can be served by either a **local GGUF** (default) or a **hosted AWS
42
+ Bedrock model** — useful for the top-tier weights that don't fit on a laptop.
43
+ Both backends share the same `auto` router, so opencode/curl/Cursor never need
44
+ to know which one a tier resolves to.
45
+
46
+ Built on:
47
+
48
+ - [`llama.cpp`](https://github.com/ggml-org/llama.cpp) — inference engine (Metal backend)
49
+ - [`llama-swap`](https://github.com/mostlygeek/llama-swap) — multi-model process manager + OpenAI-compatible proxy
50
+ - a tiny FastAPI **router** that adds an `auto` model with intent-based routing in front of llama-swap (and AWS Bedrock)
51
+
52
+ ```
53
+ client (opencode / curl / Cursor / etc.)
54
+
55
+
56
+ http://127.0.0.1:10101 <-- FastAPI router (llmstack.app)
57
+ │ • model="auto" → classify → rewrite to one of 4 tiers
58
+ │ • everything else → pass-through
59
+
60
+ http://127.0.0.1:10102 <-- llama-swap (binary, manages model lifecycle)
61
+ │ • loads/unloads llama-server processes per model
62
+ │ • matrix solver allows {code-fast + one heavy model} co-resident
63
+
64
+ llama-server <code-fast | code-smart | plan | plan-uncensored>
65
+
66
+
67
+ GGUF in ~/.cache/huggingface/hub/...
68
+ ```
69
+
70
+ The whole thing is a pure Python package distributed via standard Python tooling
71
+ (`pip install llmstack`, or `pip install -e .` from this repo). Once installed
72
+ you get a single `llmstack` console-script.
73
+
74
+ ## Why this design
75
+
76
+ A 64 GB unified memory M4 Max can comfortably hold **one always-on tiny coder + one heavy model** simultaneously. We split heavy models by *role*:
77
+
78
+ - **Agent work** (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
79
+ - **Planning** (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
80
+ - **Uncensored planning** is a separate plan-tier model, opted in either by request (`agent.plan-nofilter` in opencode) or by an inline `[nofilter]` trigger in the prompt.
81
+
82
+ Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.
83
+
84
+ ## Tier mapping
85
+
86
+ | Alias | Model | Quant | Weights | Context | Temp | Role |
87
+ |---|---|---|---|---|---|---|
88
+ | `code-fast` | Qwen2.5-Coder-3B-Instruct | Q5_K_M | ~2.5 GB | **128k** (YaRN ×4) | **0.2** | autocomplete, FIM, single-line edits, quick Q&A. **Always loaded.** |
89
+ | `code-smart` | Qwen3-Coder-Next 80B-A3B (MoE) | Q4_K_M *(→ UD-Q4_K_XL)* | ~45 GB | 64k | **0.5** | **agent mode**: multi-file edits, tool calls, refactors, debugging |
90
+ | `plan` | Qwopus GLM 18B Merged | Q4_K_M | ~9 GB | **64k** (2× native) | **0.7** | **plan mode**: design, architecture, trade-off discussions |
91
+ | `plan-uncensored` | Mistral-Small 3.2 24B Heretic (i1) | i1-Q4_K_M *(→ i1-Q6_K)* | ~13 GB | **128k** (native) | **0.85** | **plan mode, no filter**: when the topic requires it |
92
+
93
+ **Temperature ladder** (low → high = "doing" → "thinking"): code-fast 0.2 (deterministic) · code-smart 0.5 (balanced agent) · plan 0.7 (creative ideation) · plan-uncensored 0.85 (max exploration).
94
+ opencode `agent.<name>.temperature` is set to match — clients can still override per request.
95
+
96
+ ## How `auto` decides
97
+
98
+ The router runs a **step-down fidelity ladder**: start at the top tier
99
+ for new / short conversations, drop down as the context grows. This
100
+ inverts the classic "escalate when input gets big" pattern, and it
101
+ matches how these models actually behave on this stack:
102
+
103
+ - **Top-tier hosted** (Claude Opus/Sonnet on Bedrock) — fastest *and*
104
+ most accurate on short prompts, but per-request latency and $cost
105
+ scale with input tokens, and long-context behaviour degrades faster
106
+ than headline benchmarks suggest.
107
+ - **`code-smart`** (Qwen3-Coder 80B) — 64k window. Sweet spot is the
108
+ middle of that range; saturates near the top.
109
+ - **`code-fast`** (Qwen2.5-Coder 3B + YaRN ×4) — **128k** window,
110
+ always-resident, free. Smaller models lean on explicit context rather
111
+ than priors, so they tend to *improve* relative to top-tier as the
112
+ conversation grows.
113
+
114
+ First match wins:
115
+
116
+ | # | Condition | → Model | Reason |
117
+ |---|---|---|---|
118
+ | 1 | last user msg contains `[nofilter]`, `[uncensored]`, `[heretic]`, or starts with `uncensored:` / `nofilter:` | `plan-uncensored` | explicit opt-in |
119
+ | 2 | `[ultra]` / `[opus]` / `ultra:` trigger AND `code-ultra` tier configured | `code-ultra` | explicit top-tier opt-in |
120
+ | 3 | plan verbs (*design, architect, approach, trade-off, should we, explain why, …*) AND no code blocks / agent verbs / tools | `plan` | pure design discussion (orthogonal track) |
121
+ | 4 | estimated input ≤ 8 000 tokens | `code-ultra` *(or `code-smart` if ultra unwired)* | top tier — context still being built, latency/$ are best here |
122
+ | 5 | estimated input ≤ 32 000 tokens | `code-smart` | mid-context, local heavy coder is at its sweet spot |
123
+ | 6 | otherwise (long context) AND (`tools[]` OR ≥ 6 turns) | `code-smart` | floor: 3B model tool-calls unreliably |
124
+ | 7 | otherwise (long context) | `code-fast` | 128k YaRN window + always-resident + free |
125
+
126
+ Token estimates are `chars / 4` over all message text + `prompt`. The
127
+ `code-ultra` rungs (2 and 4) are gated on availability: when no
128
+ `[code-ultra]` section is loaded from `models.ini`, both silently fall
129
+ back to `code-smart` so vanilla installs don't 404.
130
+
131
+ ## opencode integration
132
+
133
+ `llmstack install` generates an opencode config at
134
+ `<work-dir>/.llmstack/opencode.json` (derived from `models.ini`), where
135
+ `<work-dir>` is whatever directory you ran `llmstack` from (or
136
+ `$LLMSTACK_WORK_DIR`). You can `cd` into any project and run
137
+ `llmstack install` to get a project-local config there. The script also
138
+ copies `AGENTS.md` next to the generated JSON, so the `.llmstack/` folder
139
+ is a self-contained opencode bundle. Your global
140
+ `~/.config/opencode/opencode.json` is **never modified** by this stack.
141
+
142
+ opencode picks up our config because `llmstack start` (and `llmstack
143
+ shell`) drop you into a subshell with these env vars exported:
144
+
145
+ | Env var | Value |
146
+ |---|---|
147
+ | `OPENCODE_CONFIG` | `<work-dir>/.llmstack/opencode.json` (overrides global, sits below project configs) |
148
+ | `LLMSTACK_CHANNEL` | `current`, `next`, or `external` (thin client of an llmstack router, see below) |
149
+ | `LLMSTACK_ACTIVE` | `1` (used to refuse recursive entry) |
150
+ | `LLMSTACK_ROOT` | absolute path to the installed `llmstack` package |
151
+
152
+ The llama-swap and router daemons are singleton on ports 10101/10102.
153
+ The channel is **pinned at install time** in `.llmstack/default-channel`
154
+ and never auto-detected at runtime — one project on the host owns the
155
+ daemons (installed local), and any other project on the same host that
156
+ wants to consume them is installed `--external` (defaulting to
157
+ `http://127.0.0.1:10101`). This avoids the footgun where a "shared"
158
+ project's `stop` would tear down daemons it can't bring back up.
159
+
160
+ The shell's prompt is prefixed with `[llmstack:<channel>]` so you always
161
+ know whether you're in the env or not. Bash and zsh source your normal
162
+ rc first, then add the prefix; other shells just get the env vars.
163
+
164
+ Inside the subshell, run `opencode` and it will pick up the wiring
165
+ below. Outside the subshell (any other terminal), opencode keeps using
166
+ your global setup unchanged.
167
+
168
+ | opencode agent | Local model |
169
+ |---|---|
170
+ | **default `model`** | `llama.cpp/auto` (router-routed) |
171
+ | **`small_model`** (titles, tasks, tab autocomplete) | `llama.cpp/code-fast` |
172
+ | **`agent.build`** (default builder) | `llama.cpp/code-smart` |
173
+ | **`agent.plan`** (read-only planner) | `llama.cpp/plan` |
174
+ | **`agent.plan-nofilter`** (custom uncensored planner) | `llama.cpp/plan-uncensored` |
175
+
176
+ Inside opencode you can switch agents with `/agent` or by `@plan-nofilter`-mentioning
177
+ a custom one. Slash-commands `/review`, `/nofilter` are also available.
178
+
179
+ Want a second terminal into the same stack? Install the activate hook
180
+ once (`eval "$(llmstack activate zsh)"`) and any new shell that `cd`s
181
+ into the project picks up `OPENCODE_CONFIG` automatically. Want to run
182
+ opencode without the hook? `OPENCODE_CONFIG=$PWD/.llmstack/opencode.json opencode`
183
+ from any directory you previously ran `install` in.
184
+
185
+ ## Layout
186
+
187
+ ```
188
+ opencode/ # repo root
189
+ ├── pyproject.toml # package metadata + `llmstack` console script
190
+ ├── README.md # this file
191
+ ├── UPGRADING.md # how to swap any tier for a newer/better model
192
+ │ + how to upgrade the Python toolchain itself
193
+ ├── models.ini # SINGLE SOURCE OF TRUTH for tiers + sampler
194
+ └── llmstack/ # the python package (importable, installable)
195
+ ├── __init__.py
196
+ ├── __main__.py # `python -m llmstack`
197
+ ├── cli.py # arg dispatch (the `llmstack` console-script)
198
+ ├── paths.py # state / bin / work dir resolution + env overrides
199
+ ├── shell_env.py # spawn the env-prepared subshell + activate hooks
200
+ ├── app.py # FastAPI auto-router (~280 lines)
201
+ ├── tiers.py # parse models.ini -> Tier dataclasses
202
+ ├── check_models.py # snapshot tool (HF metadata + drift check)
203
+ ├── AGENTS.md # opencode agent template (shipped as package data)
204
+ ├── generators/
205
+ │ ├── llama_swap.py # render llama-swap.yaml from models.ini
206
+ │ └── opencode.py # render opencode.json from models.ini
207
+ ├── download/
208
+ │ ├── ggufs.py # background GGUF downloader
209
+ │ └── binary.py # llama-swap release downloader
210
+ └── commands/ # one module per CLI action
211
+ ├── setup.py # first-time walkthrough
212
+ ├── install.py # generate opencode.json (+ AGENTS.md copy)
213
+ ├── install_llama_swap.py
214
+ ├── download.py
215
+ ├── start.py
216
+ ├── shell.py
217
+ ├── stop.py
218
+ ├── restart.py
219
+ ├── status.py
220
+ ├── check.py
221
+ └── activate.py
222
+ ```
223
+
224
+ Per-project state (gitignored) is created lazily under `<work-dir>/.llmstack/`:
225
+
226
+ ```
227
+ .llmstack/
228
+ ├── opencode.json consumed via OPENCODE_CONFIG (written by `install`)
229
+ ├── AGENTS.md copy of the package template (written by `install`)
230
+ ├── llama-swap.yaml generated runtime config (written by `start`)
231
+ ├── default-channel pinned by `llmstack install`
232
+ ├── active-channel written by `llmstack start`, removed by `stop`
233
+ ├── llama-swap.pid daemon pid files
234
+ ├── router.pid
235
+ ├── llmstack.bashrc prompt-prefix rcfile (bash)
236
+ ├── zdotdir/ prompt-prefix rcfile (zsh)
237
+ └── logs/
238
+ ├── llama-swap.log
239
+ ├── router.log
240
+ └── dl-*.log
241
+ ```
242
+
243
+ The `llama-swap` binary lives outside any project at
244
+ `$XDG_DATA_HOME/llmstack/bin/llama-swap` (override with
245
+ `LLMSTACK_BIN_DIR`). One download is reused across all projects.
246
+
247
+ ## Quick start
248
+
249
+ Everything runs through one entry point: `llmstack <action>`.
250
+ Run `llmstack help` to see all actions and options.
251
+
252
+ ```bash
253
+ # 0. Install the package (editable, from this repo).
254
+ python3 -m venv .venv
255
+ .venv/bin/pip install -e .
256
+
257
+ # 1. (Recommended) raise GPU-wired memory to fit code-fast + code-smart together.
258
+ sudo sysctl iogpu.wired_limit_mb=57344
259
+
260
+ # 2. Full setup: download GGUFs, wait, install the llama-swap binary, print
261
+ # the activation hook, check opencode is on PATH. Stepwise & idempotent;
262
+ # re-running it later is safe.
263
+ llmstack setup
264
+
265
+ # 3. Generate this project's .llmstack/opencode.json (+ AGENTS.md copy).
266
+ # `install` does NOT touch llama-swap.yaml -- that's regenerated
267
+ # fresh by `start` for the channel you're booting into.
268
+ llmstack install
269
+
270
+ # 4. Generate .llmstack/llama-swap.yaml for the chosen channel, bring up
271
+ # llama-swap + router. With the activate hook installed (see below),
272
+ # your prompt is already wired to .llmstack/opencode.json -- just run
273
+ # `opencode`. Without the hook, `start` falls back to spawning a
274
+ # subshell with OPENCODE_CONFIG set, prefixed with [llmstack:current].
275
+ # Daemons keep running when you exit; stop them with `llmstack stop`.
276
+ llmstack start
277
+
278
+ # 4a. Daemons only (no fallback subshell, return immediately).
279
+ llmstack start --detach
280
+
281
+ # 4b. Want auto-activation in any new terminal you cd into? Install once:
282
+ eval "$(llmstack activate zsh)"
283
+ # add the same line to ~/.zshrc to make it stick.
284
+
285
+ # 5. Sanity check (works from any terminal)
286
+ llmstack status
287
+ curl -s http://127.0.0.1:10101/v1/models | jq '.data[].id'
288
+ curl -s http://127.0.0.1:10101/models.ini | head # what thin clients see
289
+ ```
290
+
291
+ To stop everything: `llmstack stop`.
292
+
293
+ ### Windows
294
+
295
+ The CLI runs the same way on Windows (PowerShell or `cmd.exe`); the only
296
+ moving parts that differ are the binary asset and the activation hook.
297
+
298
+ ```powershell
299
+ # 0. Install the package (editable, from this repo).
300
+ py -3 -m venv .venv
301
+ .venv\Scripts\pip install -e .
302
+
303
+ # 1. Pull GGUFs + the windows_amd64 llama-swap binary (lives under
304
+ # %LOCALAPPDATA%\llmstack\bin\llama-swap.exe).
305
+ .venv\Scripts\llmstack setup
306
+
307
+ # 2. Generate this project's .llmstack\opencode.json (+ AGENTS.md copy).
308
+ .venv\Scripts\llmstack install
309
+
310
+ # 3. Generate .llmstack\llama-swap.yaml for the chosen channel, bring up
311
+ # the stack. If you've installed the activate hook (step 4) the
312
+ # current shell is already wired to .llmstack\opencode.json; otherwise
313
+ # `start` falls back to spawning a PowerShell subshell.
314
+ .venv\Scripts\llmstack start
315
+
316
+ # 4. Auto-activate per project from any new PowerShell window:
317
+ Invoke-Expression (& llmstack activate powershell | Out-String)
318
+ # or persist (writes ~/.powershell_llmstack_hook + sources it on every shell):
319
+ "Invoke-Expression (& llmstack activate powershell | Out-String)" | Add-Content $PROFILE
320
+ ```
321
+
322
+ Notes:
323
+
324
+ - Only `windows_amd64` llama-swap binaries are published upstream; arm64
325
+ Windows is not supported. GPU acceleration uses whatever backend
326
+ `llama-server` was built with (CUDA / Vulkan / CPU) -- get
327
+ `llama-server.exe` from the [llama.cpp Windows releases](https://github.com/ggml-org/llama.cpp/releases)
328
+ or a package like `winget install ggml.llama-cpp` and put it on
329
+ `PATH` (or set `$env:LLAMA_SERVER_BIN`). The Mac-only
330
+ `iogpu.wired_limit_mb` step does not apply.
331
+ - The `[llmstack:<channel>]` prompt prefix shows up in PowerShell too;
332
+ `cmd.exe` gets a simpler `[llmstack:<channel>]` prompt via `doskey`.
333
+ - Stopping daemons uses `taskkill /T /F` under the hood, so the
334
+ llama-server children get cleaned up as well.
335
+
336
+ ### Thin-client mode (`--external`)
337
+
338
+ `llmstack install --external [URL]` wires this project as a thin client
339
+ of an llmstack router — no llama-swap, no router, no GGUFs needed
340
+ locally, and **no local `models.ini`**. The thin-client install:
341
+
342
+ 1. Fetches `GET URL/models.ini` live from the router (this also
343
+ doubles as the health check — a 200 with valid INI proves the
344
+ router is up).
345
+ 2. Renders `opencode.json` against the fetched content so tier names
346
+ + descriptions agree with what the router actually serves.
347
+ 3. Pins `.llmstack/default-channel = "external <url>"` so subsequent
348
+ commands know they're in client mode.
349
+
350
+ There is no client-side cache: every `install` re-fetches. To pick up
351
+ a tier edit on the router, just re-run `llmstack install` here.
352
+
353
+ URL precedence at install time: `--external <url>` arg > `$LLMSTACK_REMOTE_URL`
354
+ env var > the local router (`http://127.0.0.1:10101`). You normally
355
+ don't set the env var yourself — the activate hook does it for you
356
+ when you `cd` into an external-installed project (see below).
357
+
358
+ Two flavours of the same mode:
359
+
360
+ **Same host, two projects.** One project owns the daemons (local
361
+ install), the others are thin clients of localhost. Zero config:
362
+
363
+ ```bash
364
+ # project A — owns the daemons
365
+ cd ~/projA && llmstack install && llmstack start
366
+
367
+ # project B — consumes them
368
+ cd ~/projB && llmstack install --external
369
+ # baseURL = http://127.0.0.1:10101/v1
370
+ # default-channel = "external http://127.0.0.1:10101"
371
+ # (no local models.ini -- fetched from project A's router)
372
+ llmstack start # verifies /models.ini, drops into the client subshell
373
+ ```
374
+
375
+ **Different host.** Point at a beefy desktop's router from a laptop:
376
+
377
+ ```bash
378
+ # laptop -> desktop running llmstack on 10.0.0.5
379
+ llmstack install --external http://10.0.0.5:10101
380
+ llmstack start # verifies http://10.0.0.5:10101/models.ini
381
+ opencode # talks straight to the remote router
382
+ ```
383
+
384
+ (`LLMSTACK_REMOTE_URL=http://10.0.0.5:10101 llmstack install` also
385
+ works — the env var is honoured as an alternative way in.)
386
+
387
+ The URL is persisted into the channel marker, so any new terminal you
388
+ open with the activate hook installed (`eval "$(llmstack activate zsh)"`)
389
+ will re-export `LLMSTACK_REMOTE_URL` automatically when you `cd` into
390
+ the project. The prompt is medium-purple with the URL:
391
+ `[llmstack:<project> http://10.0.0.5:10101]`. From inside that
392
+ activated shell, `llmstack install` re-fetches `models.ini` without
393
+ needing the flag or URL again.
394
+
395
+ The local commands that manage local resources (`setup`, `download`,
396
+ `install-llama-swap`) refuse when the project is installed `--external`.
397
+ `stop` is a no-op (nothing local to tear down) — to stop the daemons
398
+ themselves, run `llmstack stop` from the project that owns them (the
399
+ one installed local).
400
+
401
+ ### Auto-activate per project
402
+
403
+ `llmstack activate <shell>` writes the hook to
404
+ `~/.<shell>_llmstack_hook` and prints a `source` line to stdout, so a
405
+ single `eval` both regenerates the file and turns the hook on in your
406
+ current shell. Pasting the same `eval` into your rc keeps it on for
407
+ every new shell:
408
+
409
+ ```bash
410
+ # ~/.zshrc (zsh)
411
+ eval "$(llmstack activate zsh)"
412
+
413
+ # or ~/.bashrc (bash)
414
+ eval "$(llmstack activate bash)"
415
+ ```
416
+
417
+ With the hook installed, `cd` into any project that has a `.llmstack/`
418
+ and your shell is wired up automatically — `OPENCODE_CONFIG`,
419
+ `LLMSTACK_WORK_DIR`, `LLMSTACK_CHANNEL` (and `LLMSTACK_REMOTE_URL` for
420
+ projects installed `--external`) all toggle on/off as you walk in and
421
+ out. There is no separate `llmstack shell` command — this is the shell
422
+ command.
423
+
424
+ ### Common partial flows
425
+
426
+ ```bash
427
+ llmstack install # opencode.json + AGENTS.md (no GGUF downloads)
428
+ llmstack install-llama-swap --force # re-pull llama-swap binary only
429
+ llmstack setup --skip-download # full setup minus the GGUF pull
430
+ llmstack setup --skip-wait # kick off downloads in background, install now
431
+ llmstack check # snapshot configured GGUFs + flag drift
432
+ llmstack start --next # try queued hf_file_next upgrades (reversible)
433
+ llmstack restart --next # cycle into the next channel
434
+ ```
435
+
436
+ ### Try each routing path
437
+
438
+ All of these go to `/v1/chat/completions` on `:10101`. Each should pick a different upstream model:
439
+
440
+ ```bash
441
+ # trivial chat -> code-fast
442
+ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
443
+ -d '{"model":"auto","stream":false,
444
+ "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model
445
+
446
+ # planning -> plan
447
+ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
448
+ -d '{"model":"auto","stream":false,
449
+ "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model
450
+
451
+ # agent work -> code-smart
452
+ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
453
+ -d '{"model":"auto","stream":false,
454
+ "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model
455
+
456
+ # uncensored plan -> plan-uncensored
457
+ curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
458
+ -d '{"model":"auto","stream":false,
459
+ "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model
460
+ ```
461
+
462
+ ## Endpoints
463
+
464
+ | Port | Service | Purpose |
465
+ |---|---|---|
466
+ | 10101 | router (FastAPI) | What clients hit. OpenAI-compatible. Adds `auto` model. |
467
+ | 10102 | llama-swap | Lifecycle manager. Useful UI at `http://127.0.0.1:10102/ui/`. |
468
+ | 10001+ | llama-server children | Internal, allocated dynamically per model. |
469
+
470
+ The router exposes:
471
+
472
+ - `GET /models.ini` ← raw config text (used by `install --external` and as the health check)
473
+ - `GET /v1/models` ← injects `auto` then proxies the rest
474
+ - `POST /v1/chat/completions` ← classify if `model=="auto"`, then proxy
475
+ - `POST /v1/completions` ← same
476
+ - `*` ← pass-through reverse proxy
477
+
478
+ There is no `/health` route on the router — `GET /models.ini`
479
+ returning a 200 + valid INI is the canonical "router is up and
480
+ configured" signal. (Hitting `/health` still works for legacy curl
481
+ users, but it's just the catch-all proxying through to llama-swap's
482
+ own `/health` endpoint.)
483
+
484
+ ## Memory math (M4 Max / 64 GB)
485
+
486
+ macOS caps GPU-wired memory at ~48 GB (75 % of RAM) by default. To unlock more for the GPU:
487
+
488
+ ```bash
489
+ sudo sysctl iogpu.wired_limit_mb=57344 # 56 GB to GPU; survives until reboot
490
+ ```
491
+
492
+ Resident with our defaults (KV q8_0, full configured context):
493
+
494
+ | Combo | Weights | + KV | Total | Status |
495
+ |---|---|---|---|---|
496
+ | `code-fast` + `code-smart` (Q4_K_M) | 47.5 GB | ~5 GB | ~53 GB | needs `wired_limit` bump |
497
+ | `code-fast` + `code-smart` (UD-Q4_K_XL) | ~52 GB | ~5 GB | ~57 GB | needs `wired_limit` bump |
498
+ | `code-fast` + `plan` | 11.5 GB | ~4.5 GB | ~16 GB | trivial |
499
+ | `code-fast` + `plan-uncensored` | 15.5 GB | ~12.5 GB | ~28 GB | trivial |
500
+ | `code-fast` + `plan` + `plan-uncensored` | ~25 GB | ~14.5 GB | ~40 GB | both chats together |
501
+ | `code-smart` + `plan-uncensored` | 58 GB | … | ❌ | matrix forbids |
502
+
503
+ KV cache only fills up as context grows — these are *worst-case* numbers at the configured max context. Typical usage will be far less.
504
+
505
+ The matrix declares which combinations are valid. When you ask for a model that isn't currently loadable, the solver picks the cheapest set to swap into.
506
+
507
+ ## Upgrading quants after downloads finish
508
+
509
+ All three pre-queued upgrades are same-model, higher-quant — drop-in replacements with no behaviour change beyond quality.
510
+
511
+ Logs are named `dl-<tier>-<label>.log` where `<label>` is `current` (file
512
+ in `models.ini` `hf_file`) or `next` (file in `models.ini` `hf_file_next`).
513
+
514
+ | When this log shows `EOF` (download done) | …edit `llama-swap.yaml` `-hff` line in this tier | …to |
515
+ |---|---|---|
516
+ | `logs/dl-code-smart-next.log` | `code-smart` | `Qwen3-Coder-Next-UD-Q4_K_XL.gguf` |
517
+ | `logs/dl-plan-next.log` | `plan` | `Qwopus-GLM-18B-Healed-Q6_K.gguf` |
518
+ | `logs/dl-plan-uncensored-next.log` | `plan-uncensored` | `Mistral-Small-3.2-24B-Instruct-2506-ultra-uncensored-heretic.i1-Q6_K.gguf` |
519
+
520
+ The `-hf <repo>` lines stay the same; only the `-hff <filename>` line changes.
521
+ After editing, also flip `hf_file` ↔ `hf_file_next` in `models.ini` so
522
+ `llmstack check` no longer reports `DRIFT!`.
523
+
524
+ Then `llmstack restart`.
525
+
526
+ For changing to a *different* model entirely (different family/provider) see [UPGRADING.md](UPGRADING.md).
527
+
528
+ ## Tuning the router
529
+
530
+ All knobs are env vars; defaults are picked up by `llmstack start`.
531
+
532
+ | Env var | Default | Meaning |
533
+ |---|---|---|
534
+ | `LLAMA_SWAP_URL` | `http://127.0.0.1:10102` | upstream llama-swap |
535
+ | `ROUTER_FAST_MODEL` | `code-fast` | long-context (>= mid ceiling) → here |
536
+ | `ROUTER_AGENT_MODEL` | `code-smart` | mid-context + tools/loop floor → here |
537
+ | `ROUTER_ULTRA_MODEL` | `code-ultra` | short-context top tier → here (gated on availability) |
538
+ | `ROUTER_PLAN_MODEL` | `plan` | design/discussion verbs → here |
539
+ | `ROUTER_UNCENSORED_MODEL` | `plan-uncensored` | `[nofilter]` triggers → here |
540
+ | `ROUTER_HIGH_FIDELITY_CEILING` | `8000` | tokens; at or below this, route to top tier (ultra → smart fallback) |
541
+ | `ROUTER_MID_FIDELITY_CEILING` | `32000` | tokens; at or below this, route to `code-smart`; beyond, step down to `code-fast` |
542
+ | `ROUTER_MULTI_TURN` | `6` | turn count that floors the long-context rung at `code-smart` |
543
+ | `ROUTER_HOST` / `ROUTER_PORT` | `127.0.0.1` / `10101` | listen address |
544
+ | `LOG_LEVEL` | `info` | router log level |
545
+
546
+ To force a request to never auto-route, set `model` to a concrete alias (`code-fast`, `code-smart`, `plan`, `plan-uncensored`, or any of their listed aliases like `agent`, `glm`, `nofilter`, …).
547
+
548
+ ## Triggering uncensored mode
549
+
550
+ Two ways:
551
+
552
+ 1. **Explicit agent in opencode:** `/agent plan-nofilter` (or mention it).
553
+ 2. **Inline trigger in any auto-routed message** — anywhere in the most recent user turn:
554
+ - `[nofilter]`, `[uncensored]`, `[heretic]`
555
+ - or a line starting with `uncensored:` / `nofilter:` / `no-filter:`
556
+
557
+ Triggers are *only* checked on the latest user message and the system prompt, so an old `[nofilter]` further up the conversation won't pin the whole session.
558
+
559
+ ## Troubleshooting
560
+
561
+ **`llama-swap` won't start** → check `.llmstack/logs/llama-swap.log`. Most common causes: port 10102 already in use, or a typo in `llama-swap.yaml`.
562
+
563
+ **First request hangs for ~60 s** → that's the model loading from disk into Metal memory. `sendLoadingState: true` will surface "loading…" in the SSE stream. After it's loaded subsequent requests are instant.
564
+
565
+ **OOM / unexplained slowdown** → run `top -o mem -stats pid,rsize,command` to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, `llmstack restart`.
566
+
567
+ **Auto picks the wrong model** → adjust the regex in `llmstack/app.py` (`AGENT_SIGNALS` / `PLAN_SIGNALS` / `UNCENSORED_TRIGGERS`) or move the ladder ceilings via `ROUTER_HIGH_FIDELITY_CEILING` / `ROUTER_MID_FIDELITY_CEILING`. To force a request to never auto-route, pass an explicit `model` (e.g. `code-smart`) instead of `auto`.
568
+
569
+ **Want a pure pass-through (no auto routing)** → change opencode's `baseURL` to `http://127.0.0.1:10102/v1` (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)
570
+
571
+ ## Hosted tiers via AWS Bedrock
572
+
573
+ Any tier in `models.ini` that declares `aws_model_id = ...` is served from
574
+ AWS Bedrock instead of llama-swap. The same tier names + auto-routing apply,
575
+ so swapping `code-smart` from a local GGUF to Claude on Bedrock is a
576
+ `models.ini` edit + `llmstack install` + `llmstack restart` away — clients
577
+ don't change.
578
+
579
+ ```ini
580
+ [code-smart]
581
+ role = agent
582
+ aws_model_id = anthropic.claude-sonnet-4-5-20250929-v1:0
583
+ aws_region = us-west-2
584
+ aws_profile = bedrock-prod ; named profile in ~/.aws/config
585
+ ctx_size = 200000
586
+ sampler = temp=0.5 ; Sonnet 4.5 accepts ONE of temp / top_p
587
+ description = Claude Sonnet 4.5 on Bedrock - heavy coder for agent loops
588
+ ```
589
+
590
+ > **Sampler is per-tier, declared in `models.ini`, applied per backend.**
591
+ > `opencode.json` is intentionally sampler-free in both cases — clients
592
+ > just specify a model. How the sampler reaches the actual inference
593
+ > engine depends on the backend:
594
+ >
595
+ > - **gguf tiers** — the llama-swap generator bakes each tier's
596
+ > `sampler = …` keys into its `llama-server` startup command line as
597
+ > `--temp` / `--top-p` / `--top-k` / `--min-p` / `--repeat-penalty`
598
+ > flags. llama-server applies them as its defaults for every request.
599
+ > The router doesn't touch the body.
600
+ > - **Bedrock tiers** — Bedrock has no server-side defaults mechanism,
601
+ > so the router injects the sampler keys into each outbound request
602
+ > body (mapping `temp` → `temperature`, `top_p` → `topP`; the other
603
+ > llama.cpp-extension keys `top_k`/`min_p`/`rep_pen` are silently
604
+ > dropped because Converse doesn't accept them). Caller-supplied
605
+ > values in the request body still win for per-call overrides.
606
+ >
607
+ > Per-Bedrock-family rules (declare only what your Bedrock model
608
+ > accepts):
609
+ >
610
+ > | Bedrock model family | What `sampler` may contain |
611
+ > |---|---|
612
+ > | Claude Opus 4.7+ | (omit `sampler =` entirely — Opus 4.7 rejects all sampler params) |
613
+ > | Claude Sonnet 4.5 / Haiku 4.5 | `temp` **or** `top_p`, never both |
614
+ > | Claude Opus 4.x (4.1, 4.5, 4.6) | `temp` and/or `top_p` |
615
+ > | Llama / Titan / Cohere / etc. | `temp` and/or `top_p` (check the model card) |
616
+ >
617
+ > Local gguf tiers accept the full set (`temp`, `top_p`, `top_k`,
618
+ > `min_p`, `rep_pen`) — llama-server honours all of them as startup
619
+ > defaults.
620
+
621
+ `models.ini` is meant to be committable, so it **only names a profile**.
622
+ Credentials, SSO, role chaining, MFA — everything boto3 normally
623
+ handles — live in the standard AWS shared config:
624
+
625
+ ```bash
626
+ aws configure --profile bedrock-prod # static keys
627
+ aws configure sso --profile bedrock-prod # SSO
628
+
629
+ # role chaining: edit ~/.aws/config, add a profile like
630
+ # [profile bedrock-planning]
631
+ # role_arn = arn:aws:iam::123456789012:role/llmstack-bedrock
632
+ # source_profile = bedrock-prod
633
+ # region = us-east-1
634
+ ```
635
+
636
+ Then reference the profile by name from each tier. Different tiers can
637
+ point at different profiles, so two tiers can live in different
638
+ accounts/regions cleanly:
639
+
640
+ | Key (in `models.ini`) | Meaning |
641
+ |---|---|
642
+ | `aws_model_id` | Bedrock model ID (`anthropic.claude-...`, `meta.llama3-1-...`, etc.). Required. |
643
+ | `aws_region` | Region the tier lives in. Falls back to the profile's region / `AWS_REGION` / default chain. |
644
+ | `aws_profile` | Named profile in `~/.aws/config` / `~/.aws/credentials`. Omit for boto3's default chain (env vars, default profile, instance role). |
645
+ | `aws_endpoint_url` | Custom Bedrock endpoint (VPC endpoint, FedRAMP, etc.). |
646
+ | `aws_model_id_next` (+ optional `aws_region_next`) | Queued upgrade target. Mirrors gguf `hf_file_next`: `llmstack start --next` swaps the tier to this model id (and region, if set) until you switch back; permanent promotion is `aws_model_id` edit + `llmstack install`. |
647
+ | `backend = bedrock` | Optional explicit override; auto-detected from `aws_model_id`. |
648
+
649
+ Banned in `models.ini` (parse-time error): `aws_access_key_id`,
650
+ `aws_secret_access_key`, `aws_session_token`, `aws_role_arn`,
651
+ `aws_role_session_name`. Put them in `~/.aws/credentials` or
652
+ `~/.aws/config` under a named profile and reference the profile.
653
+
654
+ Internally the router builds one `bedrock-runtime` client per
655
+ distinct (profile, region, endpoint) tuple, cached for the life of the
656
+ process. Credential refresh (SSO token rotation, role re-assumption,
657
+ IMDS) is handled by boto3 transparently.
658
+
659
+ Install the AWS SDK (it's an opt-in extra so the local-only path stays
660
+ small):
661
+
662
+ ```bash
663
+ pip install -e '.[bedrock]'
664
+ ```
665
+
666
+ The router translates OpenAI chat/completions to [Bedrock Converse]
667
+ (text + tool calls; streaming and non-streaming both supported) and
668
+ streams the response back as standard OpenAI SSE. Multimodal inputs are
669
+ text-only for now.
670
+
671
+ [Bedrock Converse]: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html
672
+
673
+ Hosted tiers are skipped by `llmstack download` (nothing to fetch) and by
674
+ the `llama-swap.yaml` matrix (nothing to load). They show up in
675
+ `llmstack check` with the model id + region (and a `next` row when
676
+ `aws_model_id_next` is set) instead of HF metadata, and in `/v1/models`
677
+ alongside the local GGUF tiers — including a `channel: current|next`
678
+ metadata field so clients can tell which model id they're actually
679
+ talking to.
680
+
681
+ `llmstack start --next` flips both backends in lock-step: gguf tiers
682
+ swap to `hf_file_next` and bedrock tiers swap to `aws_model_id_next`
683
+ (the router subprocess is launched with `LLMSTACK_USE_NEXT=1`). Either
684
+ backend having a queued upgrade is enough to satisfy `--next`.
685
+
686
+ **`logs/dl-*.log` is multi-GB and growing** → you're hitting [llama.cpp issue #14802](https://github.com/ggml-org/llama.cpp/issues/14802) where modern `llama-cli` is chat-only and ignores `-no-cnv`, looping `> ` prompts forever (~1.5 MB/s). Fix: `llmstack download` already prefers `llama-completion` over `llama-cli` when both are present (`brew install llama.cpp` ships both as of 2025). If you only have legacy `llama-cli`, either upgrade `llama.cpp` or kill the runaways with `pkill -9 -f llama-cli`.
687
+
688
+ ## Replacing a model with a newer/better one
689
+
690
+ See **[UPGRADING.md](UPGRADING.md)** — covers why models must be GGUF, where to
691
+ find candidates, how to evaluate "better" per tier, the safe upgrade workflow,
692
+ and a worked example. Run `llmstack check` for a snapshot of what's
693
+ currently configured along with HF URLs to compare against.