@houtini/lm 2.7.0 → 2.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,9 +1,8 @@
1
- # @houtini/lm Houtini LM - Offload Tasks from Claude Code to Your Local LLM Server (LM Studio / Ollama) or a Cloud API
1
+ # @houtini/lm Houtini LM - Save Tokens by Offloading Tasks from Claude Code to Your Local LLM Server (LM Studio / Ollama) or a Cloud API
2
2
 
3
3
  [![npm version](https://img.shields.io/npm/v/@houtini/lm.svg?style=flat-square)](https://www.npmjs.com/package/@houtini/lm)
4
4
  [![MCP Registry](https://img.shields.io/badge/MCP-Registry-blue?style=flat-square)](https://registry.modelcontextprotocol.io)
5
- [![Known Vulnerabilities](https://snyk.io/test/github/houtini-ai/lm/badge.svg)](https://snyk.io/test/github/houtini-ai/lm)
6
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7
6
 
8
7
  <p align="center">
9
8
  <a href="https://glama.ai/mcp/servers/@houtini-ai/lm">
@@ -11,6 +10,10 @@
11
10
  </a>
12
11
  </p>
13
12
 
13
+ > **Quick Navigation**
14
+ >
15
+ > [How it works](#how-it-works) | [Quick start](#quick-start) | [What gets offloaded](#what-gets-offloaded) | [Tools](#tools) | [Model routing](#model-routing) | [Configuration](#configuration) | [Compatible endpoints](#compatible-endpoints)
16
+
14
17
  I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
15
18
 
16
19
  Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. Free. No rate limits. Private.
@@ -35,8 +38,6 @@ Claude Code (orchestrator)
35
38
 
36
39
  Claude's the architect. Your local model's the drafter. Claude QAs everything.
37
40
 
38
- Every response comes back with performance stats - TTFT, tokens per second, generation time - so you can actually see what your local hardware is doing. The session footer tracks cumulative offloaded tokens across every call.
39
-
40
41
  ## Quick start
41
42
 
42
43
  ### Claude Code
@@ -267,11 +268,55 @@ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. Th
267
268
 
268
269
  **Include surrounding context.** For code generation, send imports, types, and function signatures - not just the function body.
269
270
 
270
- **One call at a time.** If your LLM server runs a single model, parallel calls queue up and stack timeouts. Send them sequentially.
271
+ **One call at a time.** As of v2.8.0, houtini-lm enforces this automatically with a request semaphore. Parallel calls queue up and run one at a time, so each gets the full timeout budget instead of stacking.
272
+
273
+ ## Think-block handling
274
+
275
+ Some models emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm handles this in two ways:
276
+
277
+ 1. **Suppression at source** — at startup, houtini-lm checks each model's HuggingFace chat template for thinking support. Models that support the `enable_thinking` toggle (like Qwen3) get thinking disabled at inference time, reclaiming the generation budget for actual output. This detection is fully automatic — no hardcoded model lists.
278
+
279
+ 2. **Stripping as fallback** — for models that always emit think blocks regardless (GLM Flash, Nemotron), the content is stripped after assembly so Claude gets clean output. Orphaned opening tags from truncated responses are handled too.
280
+
281
+ The quality footer flags `think-blocks-stripped` when stripping occurred, so you know the model was reasoning internally even though the output is clean.
282
+
283
+ ## Quality metadata
284
+
285
+ Every response includes structured quality signals in the footer so Claude (or any orchestrator) can make informed trust decisions:
286
+
287
+ ```
288
+ Model: qwen3-coder-30b-a3b | 413→81 tokens | TTFT: 2355ms, 15.0 tok/s, 5.4s
289
+ Quality: think-blocks-stripped, tokens-estimated
290
+ Session: 494 tokens offloaded across 1 call
291
+ ```
292
+
293
+ Flags include: `TRUNCATED` (partial result), `think-blocks-stripped`, `tokens-estimated` (usage data was missing, estimated from content length), `hit-max-tokens`. When no flags fire, the quality line is omitted — clean output, nothing to report.
294
+
295
+ ## Session metrics resource
296
+
297
+ The `houtini://metrics/session` MCP resource exposes cumulative offload stats as JSON. Claude can read this proactively to make smarter delegation decisions based on actual session performance:
298
+
299
+ ```json
300
+ {
301
+ "session": {
302
+ "totalCalls": 14,
303
+ "promptTokens": 3200,
304
+ "completionTokens": 5250,
305
+ "totalTokensOffloaded": 8450
306
+ },
307
+ "perModel": {
308
+ "qwen3-coder-30b-a3b": {
309
+ "calls": 14,
310
+ "avgTtftMs": 2100,
311
+ "avgTokPerSec": 15.2
312
+ }
313
+ }
314
+ }
315
+ ```
271
316
 
272
- ## Think-block stripping
317
+ ## Request serialisation
273
318
 
274
- Some models - GLM Flash, Nemotron, and others - always emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm strips these automatically so Claude gets clean output without wasting time parsing the model's internal chain-of-thought. You still get the benefit of the reasoning (better answers), just without the noise.
319
+ Parallel MCP tool calls are automatically queued and run one at a time. Most local LLM servers run a single model without serialisation, parallel requests stack timeouts and waste the generation budget. The semaphore ensures each call gets the full timeout window.
275
320
 
276
321
  ## Configuration
277
322
 
@@ -299,9 +344,9 @@ Works with anything that speaks the OpenAI `/v1/chat/completions` API:
299
344
 
300
345
  ## Streaming and timeouts
301
346
 
302
- All inference uses Server-Sent Events streaming. Tokens arrive incrementally, keeping the connection alive. If generation takes longer than 55 seconds, you get a partial result instead of a timeout error - the footer shows `TRUNCATED` when this happens.
347
+ All inference uses Server-Sent Events streaming. Tokens arrive incrementally. As of v2.8.0, houtini-lm sends MCP progress notifications on every streamed chunk, which resets the SDK's 60-second client timeout. This means generation can run as long as the model needs there's no hard ceiling as long as tokens keep flowing.
303
348
 
304
- The 55-second soft timeout exists because the MCP SDK has a hard ~60s client-side timeout. Without streaming, any response that took longer than 60 seconds just vanished. Not ideal.
349
+ If the connection stalls (no new tokens for an extended period), you get a partial result instead of a timeout error. The footer shows `TRUNCATED` when this happens, and the quality metadata flags it so Claude knows to treat the output with appropriate caution.
305
350
 
306
351
  ## Architecture
307
352
 
@@ -327,4 +372,4 @@ npm run build
327
372
 
328
373
  ## Licence
329
374
 
330
- MIT
375
+ Apache-2.0