@houtini/lm 2.7.0 → 2.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +56 -11
- package/dist/index.js +382 -44
- package/dist/index.js.map +1 -1
- package/dist/model-cache.d.ts +10 -2
- package/dist/model-cache.js +117 -37
- package/dist/model-cache.js.map +1 -1
- package/package.json +2 -2
- package/server.json +2 -2
package/README.md
CHANGED
|
@@ -1,9 +1,8 @@
|
|
|
1
|
-
# @houtini/lm Houtini LM -
|
|
1
|
+
# @houtini/lm Houtini LM - Save Tokens by Offloading Tasks from Claude Code to Your Local LLM Server (LM Studio / Ollama) or a Cloud API
|
|
2
2
|
|
|
3
3
|
[](https://www.npmjs.com/package/@houtini/lm)
|
|
4
4
|
[](https://registry.modelcontextprotocol.io)
|
|
5
|
-
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
[](https://opensource.org/licenses/Apache-2.0)
|
|
7
6
|
|
|
8
7
|
<p align="center">
|
|
9
8
|
<a href="https://glama.ai/mcp/servers/@houtini-ai/lm">
|
|
@@ -11,6 +10,10 @@
|
|
|
11
10
|
</a>
|
|
12
11
|
</p>
|
|
13
12
|
|
|
13
|
+
> **Quick Navigation**
|
|
14
|
+
>
|
|
15
|
+
> [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Model routing](#model-routing) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints)
|
|
16
|
+
|
|
14
17
|
I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
|
|
15
18
|
|
|
16
19
|
Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. Free. No rate limits. Private.
|
|
@@ -35,8 +38,6 @@ Claude Code (orchestrator)
|
|
|
35
38
|
|
|
36
39
|
Claude's the architect. Your local model's the drafter. Claude QAs everything.
|
|
37
40
|
|
|
38
|
-
Every response comes back with performance stats - TTFT, tokens per second, generation time - so you can actually see what your local hardware is doing. The session footer tracks cumulative offloaded tokens across every call.
|
|
39
|
-
|
|
40
41
|
## Quick start
|
|
41
42
|
|
|
42
43
|
### Claude Code
|
|
@@ -267,11 +268,55 @@ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. Th
|
|
|
267
268
|
|
|
268
269
|
**Include surrounding context.** For code generation, send imports, types, and function signatures - not just the function body.
|
|
269
270
|
|
|
270
|
-
**One call at a time.**
|
|
271
|
+
**One call at a time.** As of v2.8.0, houtini-lm enforces this automatically with a request semaphore. Parallel calls queue up and run one at a time, so each gets the full timeout budget instead of stacking.
|
|
272
|
+
|
|
273
|
+
## Think-block handling
|
|
274
|
+
|
|
275
|
+
Some models emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm handles this in two ways:
|
|
276
|
+
|
|
277
|
+
1. **Suppression at source** — at startup, houtini-lm checks each model's HuggingFace chat template for thinking support. Models that support the `enable_thinking` toggle (like Qwen3) get thinking disabled at inference time, reclaiming the generation budget for actual output. This detection is fully automatic — no hardcoded model lists.
|
|
278
|
+
|
|
279
|
+
2. **Stripping as fallback** — for models that always emit think blocks regardless (GLM Flash, Nemotron), the content is stripped after assembly so Claude gets clean output. Orphaned opening tags from truncated responses are handled too.
|
|
280
|
+
|
|
281
|
+
The quality footer flags `think-blocks-stripped` when stripping occurred, so you know the model was reasoning internally even though the output is clean.
|
|
282
|
+
|
|
283
|
+
## Quality metadata
|
|
284
|
+
|
|
285
|
+
Every response includes structured quality signals in the footer so Claude (or any orchestrator) can make informed trust decisions:
|
|
286
|
+
|
|
287
|
+
```
|
|
288
|
+
Model: qwen3-coder-30b-a3b | 413→81 tokens | TTFT: 2355ms, 15.0 tok/s, 5.4s
|
|
289
|
+
Quality: think-blocks-stripped, tokens-estimated
|
|
290
|
+
Session: 494 tokens offloaded across 1 call
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Flags include: `TRUNCATED` (partial result), `think-blocks-stripped`, `tokens-estimated` (usage data was missing, estimated from content length), `hit-max-tokens`. When no flags fire, the quality line is omitted — clean output, nothing to report.
|
|
294
|
+
|
|
295
|
+
## Session metrics resource
|
|
296
|
+
|
|
297
|
+
The `houtini://metrics/session` MCP resource exposes cumulative offload stats as JSON. Claude can read this proactively to make smarter delegation decisions based on actual session performance:
|
|
298
|
+
|
|
299
|
+
```json
|
|
300
|
+
{
|
|
301
|
+
"session": {
|
|
302
|
+
"totalCalls": 14,
|
|
303
|
+
"promptTokens": 3200,
|
|
304
|
+
"completionTokens": 5250,
|
|
305
|
+
"totalTokensOffloaded": 8450
|
|
306
|
+
},
|
|
307
|
+
"perModel": {
|
|
308
|
+
"qwen3-coder-30b-a3b": {
|
|
309
|
+
"calls": 14,
|
|
310
|
+
"avgTtftMs": 2100,
|
|
311
|
+
"avgTokPerSec": 15.2
|
|
312
|
+
}
|
|
313
|
+
}
|
|
314
|
+
}
|
|
315
|
+
```
|
|
271
316
|
|
|
272
|
-
##
|
|
317
|
+
## Request serialisation
|
|
273
318
|
|
|
274
|
-
|
|
319
|
+
Parallel MCP tool calls are automatically queued and run one at a time. Most local LLM servers run a single model — without serialisation, parallel requests stack timeouts and waste the generation budget. The semaphore ensures each call gets the full timeout window.
|
|
275
320
|
|
|
276
321
|
## Configuration
|
|
277
322
|
|
|
@@ -299,9 +344,9 @@ Works with anything that speaks the OpenAI `/v1/chat/completions` API:
|
|
|
299
344
|
|
|
300
345
|
## Streaming and timeouts
|
|
301
346
|
|
|
302
|
-
All inference uses Server-Sent Events streaming. Tokens arrive incrementally
|
|
347
|
+
All inference uses Server-Sent Events streaming. Tokens arrive incrementally. As of v2.8.0, houtini-lm sends MCP progress notifications on every streamed chunk, which resets the SDK's 60-second client timeout. This means generation can run as long as the model needs — there's no hard ceiling as long as tokens keep flowing.
|
|
303
348
|
|
|
304
|
-
|
|
349
|
+
If the connection stalls (no new tokens for an extended period), you get a partial result instead of a timeout error. The footer shows `TRUNCATED` when this happens, and the quality metadata flags it so Claude knows to treat the output with appropriate caution.
|
|
305
350
|
|
|
306
351
|
## Architecture
|
|
307
352
|
|
|
@@ -327,4 +372,4 @@ npm run build
|
|
|
327
372
|
|
|
328
373
|
## Licence
|
|
329
374
|
|
|
330
|
-
|
|
375
|
+
Apache-2.0
|