@houtini/lm 2.10.0 β 2.11.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +86 -10
- package/dist/index.js +567 -87
- package/dist/index.js.map +1 -1
- package/dist/model-cache.d.ts +90 -0
- package/dist/model-cache.js +318 -11
- package/dist/model-cache.js.map +1 -1
- package/package.json +2 -1
- package/server.json +2 -2
package/README.md
CHANGED
|
@@ -12,7 +12,7 @@
|
|
|
12
12
|
|
|
13
13
|
> **Quick Navigation**
|
|
14
14
|
>
|
|
15
|
-
> [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Model routing](#model-routing) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints)
|
|
15
|
+
> [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Performance tracking](#performance-tracking) | [Structured JSON output](#structured-json-output) | [Model routing](#model-routing) | [Self-test (shakedown)](#self-test-shakedown) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints) | [Developer guide](./DEVELOPER.md)
|
|
16
16
|
|
|
17
17
|
I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
|
|
18
18
|
|
|
@@ -144,23 +144,34 @@ The tool descriptions are written to nudge Claude into planning delegation at th
|
|
|
144
144
|
|
|
145
145
|
## Performance tracking
|
|
146
146
|
|
|
147
|
-
Every response includes a footer with real performance data
|
|
147
|
+
Every response includes a footer with real performance data β computed from the SSE stream, not from any proprietary API:
|
|
148
148
|
|
|
149
149
|
```
|
|
150
150
|
---
|
|
151
|
-
Model:
|
|
152
|
-
π First measured call on
|
|
153
|
-
π° Claude quota saved this session:
|
|
151
|
+
Model: nvidia/nemotron-3-nano | 279β303 tokens (12 reasoning / 291 visible) | TTFT: 485ms, 58.0 tok/s, 5.2s
|
|
152
|
+
π First measured call on nvidia/nemotron-3-nano: 58.0 tok/s, 485ms to first token β use this to gauge whether to delegate longer tasks.
|
|
153
|
+
π° Claude quota saved β this session: 4,283 tokens / 7 calls Β· lifetime: 147,432 tokens / 213 calls
|
|
154
154
|
```
|
|
155
155
|
|
|
156
156
|
The π line only appears on the first measured call per model per session β it's a real benchmark from a genuine task, not a synthetic warmup. The π° line updates every call.
|
|
157
157
|
|
|
158
|
-
|
|
158
|
+
When the active model returns `completion_tokens_details.reasoning_tokens` (DeepSeek R1, LM Studio with "Separate reasoning_content" enabled, OpenAI reasoning models), the token block splits into `reasoning / visible` so you can see when a thinking model is burning its output budget on hidden reasoning.
|
|
159
|
+
|
|
160
|
+
### Lifetime persistence
|
|
161
|
+
|
|
162
|
+
Per-model performance and token counts persist across Claude Desktop restarts in `~/.houtini-lm/model-cache.db`. This means:
|
|
163
|
+
|
|
164
|
+
- From call 1 of a new session, `discover` shows **historical** tok/s and TTFT for the loaded model β not "not yet benchmarked".
|
|
165
|
+
- The π° counter shows both session and lifetime totals.
|
|
166
|
+
- The `code_task_files` pre-flight estimator uses measured per-model prefill rate to refuse obviously-too-large inputs with a clear diagnostic, instead of letting them silently hang against the MCP client timeout.
|
|
167
|
+
|
|
168
|
+
The data is workstation-specific β that's intentional. Routing decisions should reflect your actual hardware, not a synthetic benchmark.
|
|
169
|
+
|
|
170
|
+
The `discover` tool shows per-model averages across both scopes:
|
|
159
171
|
|
|
160
172
|
```
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
zai-org/glm-4.7-flash: 8 calls, avg TTFT 678ms, avg 48.7 tok/s
|
|
173
|
+
Measured speed (session): 58.0 tok/s Β· TTFT 485ms (1 call)
|
|
174
|
+
Measured speed (lifetime on this workstation): 46.9 tok/s Β· TTFT 2641ms (214 calls, last used 2026-04-20)
|
|
164
175
|
```
|
|
165
176
|
|
|
166
177
|
In practice, Claude delegates more aggressively the longer a session runs. After about 5,000 offloaded tokens, it starts hunting for more work to push over. Reinforcing loop.
|
|
@@ -215,9 +226,11 @@ Built for code analysis. Pre-configured system prompt with temperature and outpu
|
|
|
215
226
|
|
|
216
227
|
Like `code_task`, but the local LLM reads files directly from disk β source never passes through the MCP client's context window. Use this when reviewing multiple related files, or a single large file that's awkward to paste. Files are read in parallel with `Promise.allSettled`, so one unreadable file doesn't sink the call; failures are surfaced inline with the reason.
|
|
217
228
|
|
|
229
|
+
Includes a **pre-flight prefill estimator**: if measured per-model data from the SQLite cache shows the input would exceed the MCP client's ~60s request-timeout during prompt processing, the call is refused early with a concrete diagnostic (estimated prefill seconds, tokens, and sample-count) instead of letting it silently hang. First-time callers are never refused β the estimator only fires after β₯2 measured samples.
|
|
230
|
+
|
|
218
231
|
| Parameter | Required | Default | What it does |
|
|
219
232
|
|-----------|----------|---------|-------------|
|
|
220
|
-
| `
|
|
233
|
+
| `paths` | yes | - | Array of absolute file paths. Relative paths are rejected. |
|
|
221
234
|
| `task` | yes | - | "Find bugs across these files", "Audit this module" |
|
|
222
235
|
| `language` | no | - | "typescript", "python", "rust", etc. |
|
|
223
236
|
| `max_tokens` | no | *auto* | Defaults to 25% of the loaded model's context window (fallback 16,384). |
|
|
@@ -239,6 +252,37 @@ Health check and speed readout. Returns model name, context window, capability p
|
|
|
239
252
|
|
|
240
253
|
Lists everything on the LLM server - loaded and downloaded - with full metadata: architecture, quantisation, context window, capabilities, and HuggingFace enrichment data. Shows capability profiles describing what each model is best at, so Claude can make informed delegation decisions.
|
|
241
254
|
|
|
255
|
+
### `stats`
|
|
256
|
+
|
|
257
|
+
Compact markdown dump of your offload stats β session and lifetime totals, per-model performance history, reasoning-token overhead β without the model catalog that `discover` prints. Cheap to call repeatedly to watch the π° counter climb.
|
|
258
|
+
|
|
259
|
+
| Parameter | Required | Default | What it does |
|
|
260
|
+
|-----------|----------|---------|-------------|
|
|
261
|
+
| `model` | no | - | Filter output to a single model ID. Omit for all models ever used on this workstation. |
|
|
262
|
+
|
|
263
|
+
Example output:
|
|
264
|
+
|
|
265
|
+
```
|
|
266
|
+
## Houtini LM stats
|
|
267
|
+
**Endpoint**: http://hopper:1234 (LM Studio)
|
|
268
|
+
**First call on this workstation**: 2026-04-14
|
|
269
|
+
|
|
270
|
+
### Totals
|
|
271
|
+
| Scope | Calls | Prompt tokens | Completion tokens | Total tokens |
|
|
272
|
+
| Session | 7 | 3,100 | 1,183 | 4,283 |
|
|
273
|
+
| Lifetime | 213 | β | β | 147,432 |
|
|
274
|
+
|
|
275
|
+
### Per-model performance
|
|
276
|
+
| Model | Scope | Calls | Avg TTFT | Avg tok/s | Prompt tokens | Last used |
|
|
277
|
+
| nvidia/nemotron-3-nano | session | 7 | 485 | 58.0 | β | β |
|
|
278
|
+
| nvidia/nemotron-3-nano | lifetime | 213 | 2641 | 46.9 | 89,320 | 2026-04-20 |
|
|
279
|
+
|
|
280
|
+
### Reasoning-token overhead (lifetime)
|
|
281
|
+
124 / 47,183 completion tokens spent on hidden reasoning (0.3%). Low β reasoning is effectively suppressed.
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
The reasoning-token overhead line is the canary for "is `reasoning_effort` actually being honoured on this model and this backend?" β above ~30% is a signal to investigate.
|
|
285
|
+
|
|
242
286
|
## Structured JSON output
|
|
243
287
|
|
|
244
288
|
Both `chat` and `custom_prompt` accept a `json_schema` parameter that forces the response to conform to a JSON Schema. LM Studio uses grammar-based sampling to guarantee valid output - no hoping the model remembers to close its brackets.
|
|
@@ -285,6 +329,35 @@ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. Th
|
|
|
285
329
|
|
|
286
330
|
**One call at a time.** As of v2.8.0, houtini-lm enforces this automatically with a request semaphore. Parallel calls queue up and run one at a time, so each gets the full timeout budget instead of stacking.
|
|
287
331
|
|
|
332
|
+
## Self-test (shakedown)
|
|
333
|
+
|
|
334
|
+
The canonical way to verify an install and get an honest read on what the loaded model can do on your hardware:
|
|
335
|
+
|
|
336
|
+
```bash
|
|
337
|
+
npm run shakedown
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
This runs [`shakedown.mjs`](./shakedown.mjs) β an end-to-end test that exercises all seven tools (`discover` β `list_models` β `chat` β `custom_prompt` β `code_task` β `code_task_files` β `embed`) and prints a summary table with real TTFT, tok/s, token counts, and reasoning-token split for each call. Takes under a minute on a decent rig.
|
|
341
|
+
|
|
342
|
+
Sample output tail:
|
|
343
|
+
|
|
344
|
+
```
|
|
345
|
+
Summary
|
|
346
|
+
|
|
347
|
+
7/7 steps passed on LM Studio, model=nvidia/nemotron-3-nano
|
|
348
|
+
|
|
349
|
+
| Tool | OK | TTFT (ms) | tok/s | Tokens inβout | Reasoning | Notes
|
|
350
|
+
| chat | β
| 891 | 36.9 | 48β104 | β | answered
|
|
351
|
+
| custom_prompt | β
| 872 | 43.9 | 170β333 | β | 5 valid items
|
|
352
|
+
| code_task | β
| 857 | 41.6 | 180β189 | β | tests generated
|
|
353
|
+
| code_task_files | β
| 11028 | 39.5 | 6891β3000 | β | cross-referenced
|
|
354
|
+
| embed | β
| β | β | β | β | 768-dim vector
|
|
355
|
+
|
|
356
|
+
Tokens offloaded: 10,915 (prompt: 7,289, completion: 3,626, reasoning: 0)
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
Want a human-readable quality review rather than just latency numbers? Paste [SHAKEDOWN.md](./SHAKEDOWN.md) into a Claude session that has houtini-lm attached β Claude will drive the seven steps and write you a report on output quality as well as performance.
|
|
360
|
+
|
|
288
361
|
## Think-block handling
|
|
289
362
|
|
|
290
363
|
Some models emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm handles this in two ways:
|
|
@@ -383,8 +456,11 @@ git clone https://github.com/houtini-ai/lm.git
|
|
|
383
456
|
cd lm
|
|
384
457
|
npm install
|
|
385
458
|
npm run build
|
|
459
|
+
npm run shakedown # end-to-end self-test + benchmark
|
|
386
460
|
```
|
|
387
461
|
|
|
462
|
+
See [DEVELOPER.md](./DEVELOPER.md) for architecture, internals, the reasoning-model pipeline, backend detection, the SQLite performance cache, and instructions for adding new tools or backends.
|
|
463
|
+
|
|
388
464
|
## Licence
|
|
389
465
|
|
|
390
466
|
Apache-2.0
|