@houtini/lm 2.9.0 → 2.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -12,11 +12,11 @@
12
12
 
13
13
  > **Quick Navigation**
14
14
  >
15
- > [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Model routing](#model-routing) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints)
15
+ > [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Performance tracking](#performance-tracking) | [Structured JSON output](#structured-json-output) | [Model routing](#model-routing) | [Self-test (shakedown)](#self-test-shakedown) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints) | [Developer guide](./DEVELOPER.md)
16
16
 
17
17
  I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
18
18
 
19
- Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. Free. No rate limits. Private.
19
+ Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. No Claude quota burn. No rate limits. Private. The trade is wall-clock time: local inference is typically 3-30× slower than frontier models, so delegation wins on bounded, self-contained tasks rather than everything.
20
20
 
21
21
  I wrote a [full walkthrough of why I built this and how I use it day to day](https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/).
22
22
 
@@ -144,19 +144,34 @@ The tool descriptions are written to nudge Claude into planning delegation at th
144
144
 
145
145
  ## Performance tracking
146
146
 
147
- Every response includes a footer with real performance data - computed from the SSE stream, not from any proprietary API:
147
+ Every response includes a footer with real performance data computed from the SSE stream, not from any proprietary API:
148
148
 
149
149
  ```
150
- Model: zai-org/glm-4.7-flash | 125->430 tokens | TTFT: 678ms, 48.7 tok/s, 12.5s
151
- Session: 8,450 tokens offloaded across 14 calls
150
+ ---
151
+ Model: nvidia/nemotron-3-nano | 279→303 tokens (12 reasoning / 291 visible) | TTFT: 485ms, 58.0 tok/s, 5.2s
152
+ 📊 First measured call on nvidia/nemotron-3-nano: 58.0 tok/s, 485ms to first token — use this to gauge whether to delegate longer tasks.
153
+ 💰 Claude quota saved — this session: 4,283 tokens / 7 calls · lifetime: 147,432 tokens / 213 calls
152
154
  ```
153
155
 
154
- The `discover` tool shows per-model averages across the session:
156
+ The 📊 line only appears on the first measured call per model per session it's a real benchmark from a genuine task, not a synthetic warmup. The 💰 line updates every call.
157
+
158
+ When the active model returns `completion_tokens_details.reasoning_tokens` (DeepSeek R1, LM Studio with "Separate reasoning_content" enabled, OpenAI reasoning models), the token block splits into `reasoning / visible` so you can see when a thinking model is burning its output budget on hidden reasoning.
159
+
160
+ ### Lifetime persistence
161
+
162
+ Per-model performance and token counts persist across Claude Desktop restarts in `~/.houtini-lm/model-cache.db`. This means:
163
+
164
+ - From call 1 of a new session, `discover` shows **historical** tok/s and TTFT for the loaded model — not "not yet benchmarked".
165
+ - The 💰 counter shows both session and lifetime totals.
166
+ - The `code_task_files` pre-flight estimator uses measured per-model prefill rate to refuse obviously-too-large inputs with a clear diagnostic, instead of letting them silently hang against the MCP client timeout.
167
+
168
+ The data is workstation-specific — that's intentional. Routing decisions should reflect your actual hardware, not a synthetic benchmark.
169
+
170
+ The `discover` tool shows per-model averages across both scopes:
155
171
 
156
172
  ```
157
- Performance (this session):
158
- nvidia/nemotron-3-nano: 6 calls, avg TTFT 234ms, avg 45.2 tok/s
159
- zai-org/glm-4.7-flash: 8 calls, avg TTFT 678ms, avg 48.7 tok/s
173
+ Measured speed (session): 58.0 tok/s · TTFT 485ms (1 call)
174
+ Measured speed (lifetime on this workstation): 46.9 tok/s · TTFT 2641ms (214 calls, last used 2026-04-20)
160
175
  ```
161
176
 
162
177
  In practice, Claude delegates more aggressively the longer a session runs. After about 5,000 offloaded tokens, it starts hunting for more work to push over. Reinforcing loop.
@@ -180,7 +195,7 @@ The workhorse. Send a task, get an answer. The description includes planning tri
180
195
  | `message` | yes | - | The task. Be specific about output format. |
181
196
  | `system` | no | - | Persona - "Senior TypeScript dev" not "helpful assistant" |
182
197
  | `temperature` | no | 0.3 | 0.1 for code, 0.3 for analysis, 0.7 for creative |
183
- | `max_tokens` | no | 2048 | Lower for quick answers, higher for generation |
198
+ | `max_tokens` | no | *auto* | Defaults to 25% of the loaded model's context window (fallback 16,384). Pass a number to cap it. |
184
199
  | `json_schema` | no | - | Force structured JSON output conforming to a schema |
185
200
 
186
201
  ### `custom_prompt`
@@ -193,7 +208,7 @@ Three-part prompt: system, context, instruction. Keeping them separate prevents
193
208
  | `system` | no | - | Persona + constraints, under 30 words |
194
209
  | `context` | no | - | Complete data to analyse. Never truncate. |
195
210
  | `temperature` | no | 0.3 | 0.1 for review, 0.3 for analysis |
196
- | `max_tokens` | no | 2048 | Match to expected output length |
211
+ | `max_tokens` | no | *auto* | Defaults to 25% of the loaded model's context window (fallback 16,384). |
197
212
  | `json_schema` | no | - | Force structured JSON output |
198
213
 
199
214
  ### `code_task`
@@ -205,7 +220,20 @@ Built for code analysis. Pre-configured system prompt with temperature and outpu
205
220
  | `code` | yes | - | Complete source code. Never truncate. |
206
221
  | `task` | yes | - | "Find bugs", "Explain this", "Write tests" |
207
222
  | `language` | no | - | "typescript", "python", "rust", etc. |
208
- | `max_tokens` | no | 2048 | Match to expected output length |
223
+ | `max_tokens` | no | *auto* | Defaults to 25% of the loaded model's context window (fallback 16,384). |
224
+
225
+ ### `code_task_files`
226
+
227
+ Like `code_task`, but the local LLM reads files directly from disk — source never passes through the MCP client's context window. Use this when reviewing multiple related files, or a single large file that's awkward to paste. Files are read in parallel with `Promise.allSettled`, so one unreadable file doesn't sink the call; failures are surfaced inline with the reason.
228
+
229
+ Includes a **pre-flight prefill estimator**: if measured per-model data from the SQLite cache shows the input would exceed the MCP client's ~60s request-timeout during prompt processing, the call is refused early with a concrete diagnostic (estimated prefill seconds, tokens, and sample-count) instead of letting it silently hang. First-time callers are never refused — the estimator only fires after ≥2 measured samples.
230
+
231
+ | Parameter | Required | Default | What it does |
232
+ |-----------|----------|---------|-------------|
233
+ | `paths` | yes | - | Array of absolute file paths. Relative paths are rejected. |
234
+ | `task` | yes | - | "Find bugs across these files", "Audit this module" |
235
+ | `language` | no | - | "typescript", "python", "rust", etc. |
236
+ | `max_tokens` | no | *auto* | Defaults to 25% of the loaded model's context window (fallback 16,384). |
209
237
 
210
238
  ### `embed`
211
239
 
@@ -218,12 +246,43 @@ Generate text embeddings via the OpenAI-compatible `/v1/embeddings` endpoint. Re
218
246
 
219
247
  ### `discover`
220
248
 
221
- Health check. Returns model name, context window, latency, capability profile, and cumulative session stats including per-model performance averages. Call before delegating if you're not sure the LLM's available.
249
+ Health check and speed readout. Returns model name, context window, capability profile, connection latency (labelled explicitly — this is the `/v1/models` fetch round-trip, *not* inference speed), and the active model's measured tok/s and TTFT averaged over the session. Before any real call has run, measured speed shows as "not yet benchmarked — will be captured on the first real call" rather than inventing a number from a synthetic probe. Call before delegating if you're not sure the LLM's available, or when deciding whether a longer task is worth offloading.
222
250
 
223
251
  ### `list_models`
224
252
 
225
253
  Lists everything on the LLM server - loaded and downloaded - with full metadata: architecture, quantisation, context window, capabilities, and HuggingFace enrichment data. Shows capability profiles describing what each model is best at, so Claude can make informed delegation decisions.
226
254
 
255
+ ### `stats`
256
+
257
+ Compact markdown dump of your offload stats — session and lifetime totals, per-model performance history, reasoning-token overhead — without the model catalog that `discover` prints. Cheap to call repeatedly to watch the 💰 counter climb.
258
+
259
+ | Parameter | Required | Default | What it does |
260
+ |-----------|----------|---------|-------------|
261
+ | `model` | no | - | Filter output to a single model ID. Omit for all models ever used on this workstation. |
262
+
263
+ Example output:
264
+
265
+ ```
266
+ ## Houtini LM stats
267
+ **Endpoint**: http://hopper:1234 (LM Studio)
268
+ **First call on this workstation**: 2026-04-14
269
+
270
+ ### Totals
271
+ | Scope | Calls | Prompt tokens | Completion tokens | Total tokens |
272
+ | Session | 7 | 3,100 | 1,183 | 4,283 |
273
+ | Lifetime | 213 | — | — | 147,432 |
274
+
275
+ ### Per-model performance
276
+ | Model | Scope | Calls | Avg TTFT | Avg tok/s | Prompt tokens | Last used |
277
+ | nvidia/nemotron-3-nano | session | 7 | 485 | 58.0 | — | — |
278
+ | nvidia/nemotron-3-nano | lifetime | 213 | 2641 | 46.9 | 89,320 | 2026-04-20 |
279
+
280
+ ### Reasoning-token overhead (lifetime)
281
+ 124 / 47,183 completion tokens spent on hidden reasoning (0.3%). Low — reasoning is effectively suppressed.
282
+ ```
283
+
284
+ The reasoning-token overhead line is the canary for "is `reasoning_effort` actually being honoured on this model and this backend?" — above ~30% is a signal to investigate.
285
+
227
286
  ## Structured JSON output
228
287
 
229
288
  Both `chat` and `custom_prompt` accept a `json_schema` parameter that forces the response to conform to a JSON Schema. LM Studio uses grammar-based sampling to guarantee valid output - no hoping the model remembers to close its brackets.
@@ -270,6 +329,35 @@ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. Th
270
329
 
271
330
  **One call at a time.** As of v2.8.0, houtini-lm enforces this automatically with a request semaphore. Parallel calls queue up and run one at a time, so each gets the full timeout budget instead of stacking.
272
331
 
332
+ ## Self-test (shakedown)
333
+
334
+ The canonical way to verify an install and get an honest read on what the loaded model can do on your hardware:
335
+
336
+ ```bash
337
+ npm run shakedown
338
+ ```
339
+
340
+ This runs [`shakedown.mjs`](./shakedown.mjs) — an end-to-end test that exercises all seven tools (`discover` → `list_models` → `chat` → `custom_prompt` → `code_task` → `code_task_files` → `embed`) and prints a summary table with real TTFT, tok/s, token counts, and reasoning-token split for each call. Takes under a minute on a decent rig.
341
+
342
+ Sample output tail:
343
+
344
+ ```
345
+ Summary
346
+
347
+ 7/7 steps passed on LM Studio, model=nvidia/nemotron-3-nano
348
+
349
+ | Tool | OK | TTFT (ms) | tok/s | Tokens in→out | Reasoning | Notes
350
+ | chat | ✅ | 891 | 36.9 | 48→104 | — | answered
351
+ | custom_prompt | ✅ | 872 | 43.9 | 170→333 | — | 5 valid items
352
+ | code_task | ✅ | 857 | 41.6 | 180→189 | — | tests generated
353
+ | code_task_files | ✅ | 11028 | 39.5 | 6891→3000 | — | cross-referenced
354
+ | embed | ✅ | — | — | — | — | 768-dim vector
355
+
356
+ Tokens offloaded: 10,915 (prompt: 7,289, completion: 3,626, reasoning: 0)
357
+ ```
358
+
359
+ Want a human-readable quality review rather than just latency numbers? Paste [SHAKEDOWN.md](./SHAKEDOWN.md) into a Claude session that has houtini-lm attached — Claude will drive the seven steps and write you a report on output quality as well as performance.
360
+
273
361
  ## Think-block handling
274
362
 
275
363
  Some models emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm handles this in two ways:
@@ -285,9 +373,9 @@ The quality footer flags `think-blocks-stripped` when stripping occurred, so you
285
373
  Every response includes structured quality signals in the footer so Claude (or any orchestrator) can make informed trust decisions:
286
374
 
287
375
  ```
288
- Model: qwen3-coder-30b-a3b | 413→81 tokens | TTFT: 2355ms, 15.0 tok/s, 5.4s
289
- Quality: think-blocks-stripped, tokens-estimated
290
- Session: 494 tokens offloaded across 1 call
376
+ ---
377
+ Model: qwen3-coder-30b-a3b | 413→81 tokens | TTFT: 2355ms, 15.0 tok/s, 5.4s | Quality: think-blocks-stripped, tokens-estimated
378
+ 💰 Claude quota saved this session: 494 tokens across 1 offloaded call
291
379
  ```
292
380
 
293
381
  Flags include: `TRUNCATED` (partial result), `think-blocks-stripped`, `tokens-estimated` (usage data was missing, estimated from content length), `hit-max-tokens`. When no flags fire, the quality line is omitted — clean output, nothing to report.
@@ -344,7 +432,7 @@ Works with anything that speaks the OpenAI `/v1/chat/completions` API:
344
432
 
345
433
  ## Streaming and timeouts
346
434
 
347
- All inference uses Server-Sent Events streaming. Tokens arrive incrementally. As of v2.8.0, houtini-lm sends MCP progress notifications on every streamed chunk, which resets the SDK's 60-second client timeout. This means generation can run as long as the model needs there's no hard ceiling as long as tokens keep flowing.
435
+ All inference uses Server-Sent Events streaming. Tokens arrive incrementally. Since v2.9.0, houtini-lm sends MCP progress notifications on every streamed chunk — including during the thinking phase for reasoning models — which resets the SDK's 60-second client timeout. A 5-minute soft timeout acts as a safety net so a genuinely wedged connection can't hold a tool call open indefinitely; as long as tokens keep flowing, the per-chunk progress keeps the client side alive up to that ceiling.
348
436
 
349
437
  If the connection stalls (no new tokens for an extended period), you get a partial result instead of a timeout error. The footer shows `TRUNCATED` when this happens, and the quality metadata flags it so Claude knows to treat the output with appropriate caution.
350
438
 
@@ -368,8 +456,11 @@ git clone https://github.com/houtini-ai/lm.git
368
456
  cd lm
369
457
  npm install
370
458
  npm run build
459
+ npm run shakedown # end-to-end self-test + benchmark
371
460
  ```
372
461
 
462
+ See [DEVELOPER.md](./DEVELOPER.md) for architecture, internals, the reasoning-model pipeline, backend detection, the SQLite performance cache, and instructions for adding new tools or backends.
463
+
373
464
  ## Licence
374
465
 
375
466
  Apache-2.0