@houtini/lm 2.4.1 → 2.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,31 +1,42 @@
1
- # @houtini/lm
1
+ # @houtini/lm Houtini LM - Offload Tasks from Claude Code to Your Local LLM Server (LM Studio / Ollama) or a Cloud API
2
2
 
3
- [![npm version](https://img.shields.io/npm/v/@houtini/lm)](https://www.npmjs.com/package/@houtini/lm)
3
+ [![npm version](https://img.shields.io/npm/v/@houtini/lm.svg?style=flat-square)](https://www.npmjs.com/package/@houtini/lm)
4
+ [![MCP Registry](https://img.shields.io/badge/MCP-Registry-blue?style=flat-square)](https://registry.modelcontextprotocol.io)
5
+ [![Known Vulnerabilities](https://snyk.io/test/github/houtini-ai/lm/badge.svg)](https://snyk.io/test/github/houtini-ai/lm)
4
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
7
 
6
- I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, explaining code, drafting commit messages, converting formats. Stuff that doesn't need Claude's reasoning or tool access.
8
+ <p align="center">
9
+ <a href="https://glama.ai/mcp/servers/@houtini-ai/lm">
10
+ <img width="380" height="200" src="https://glama.ai/mcp/servers/@houtini-ai/lm/badge" alt="Houtini LM MCP server" />
11
+ </a>
12
+ </p>
7
13
 
8
- Houtini LM connects Claude Code to a local LLM on your network. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to your local model. Free. No rate limits. Private.
14
+ I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
9
15
 
10
- The session footer tracks everything Claude offloads, so you can watch the savings stack up.
16
+ Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. Free. No rate limits. Private.
17
+
18
+ I wrote a [full walkthrough of why I built this and how I use it day to day](https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/).
11
19
 
12
20
  ## How it works
13
21
 
14
22
  ```
15
23
  Claude Code (orchestrator)
16
-
17
- ├─ Complex reasoning, planning, architecture Claude API (your tokens)
18
-
19
- └─ Bounded grunt work houtini-lm ──HTTP/SSE──> Your local LLM (free)
20
- Boilerplate & test stubs Qwen, Llama, Mistral, DeepSeek...
21
- Code review & explanations LM Studio, Ollama, vLLM, llama.cpp
22
- Commit messages & docs
23
- Format conversion
24
- Mock data & type definitions
24
+ |
25
+ |-- Complex reasoning, planning, architecture --> Claude API (your tokens)
26
+ |
27
+ +-- Bounded grunt work --> houtini-lm --HTTP/SSE--> Your local LLM (free)
28
+ . Boilerplate & test stubs Qwen, Llama, Nemotron, GLM...
29
+ . Code review & explanations LM Studio, Ollama, vLLM, llama.cpp
30
+ . Commit messages & docs DeepSeek, Groq, Cerebras (cloud)
31
+ . Format conversion
32
+ . Mock data & type definitions
33
+ . Embeddings for RAG pipelines
25
34
  ```
26
35
 
27
36
  Claude's the architect. Your local model's the drafter. Claude QAs everything.
28
37
 
38
+ Every response comes back with performance stats - TTFT, tokens per second, generation time - so you can actually see what your local hardware is doing. The session footer tracks cumulative offloaded tokens across every call.
39
+
29
40
  ## Quick start
30
41
 
31
42
  ### Claude Code
@@ -44,6 +55,17 @@ I've got a GPU box on my local network running Qwen 3 Coder Next in LM Studio. I
44
55
  claude mcp add houtini-lm -e LM_STUDIO_URL=http://192.168.1.50:1234 -- npx -y @houtini/lm
45
56
  ```
46
57
 
58
+ ### Cloud APIs
59
+
60
+ Works with anything speaking the OpenAI format. DeepSeek at twenty-eight cents per million tokens, Groq for speed, Cerebras if you want three thousand tokens per second - whatever you fancy:
61
+
62
+ ```bash
63
+ claude mcp add houtini-lm \
64
+ -e LM_STUDIO_URL=https://api.deepseek.com \
65
+ -e LM_STUDIO_PASSWORD=your-key-here \
66
+ -- npx -y @houtini/lm
67
+ ```
68
+
47
69
  ### Claude Desktop
48
70
 
49
71
  Drop this into your `claude_desktop_config.json`:
@@ -62,6 +84,36 @@ Drop this into your `claude_desktop_config.json`:
62
84
  }
63
85
  ```
64
86
 
87
+ ## Model discovery
88
+
89
+ This is where things get interesting. At startup, houtini-lm queries your LLM server for every model available - loaded and downloaded - then looks each one up on HuggingFace's free API to pull metadata: architecture, licence, download count, pipeline type. All of that gets cached in a local SQLite database (`~/.houtini-lm/model-cache.db`) so subsequent startups are instant.
90
+
91
+ The result is that houtini-lm actually knows what your models are good at. Not just the name - the capabilities, the strengths, what tasks to send where. If you've got Nemotron loaded but a Qwen Coder sitting idle, it'll flag that. If someone on a completely different setup loads a Mistral model houtini-lm has never seen before, the HuggingFace lookup auto-generates a profile for it.
92
+
93
+ Run `list_models` and you get the full picture:
94
+
95
+ ```
96
+ Loaded models (ready to use):
97
+
98
+ nvidia/nemotron-3-nano
99
+ type: llm, arch: nemotron_h_moe, quant: Q4_K_M, format: gguf
100
+ context: 200,082 (max 1,048,576), by: nvidia
101
+ Capabilities: tool_use
102
+ NVIDIA Nemotron: compact reasoning model optimised for step-by-step logic
103
+ Best for: analysis tasks, code bug-finding, math/science questions
104
+ HuggingFace: text-generation, 1.7M downloads, MIT licence
105
+
106
+ Available models (downloaded, not loaded):
107
+
108
+ qwen3-coder-30b-a3b-instruct
109
+ type: llm, arch: qwen3moe, quant: BF16, context: 262,144
110
+ Qwen3 Coder: code-specialised model with agentic capabilities
111
+ Best for: code generation, code review, test stubs, refactoring
112
+ HuggingFace: text-generation, 12.9K downloads, Apache-2.0
113
+ ```
114
+
115
+ For models we know well - Qwen, Nemotron, Granite, LLaMA, GLM, GPT-OSS - there's a curated profile built in with specific strengths and weaknesses. For everything else, the HuggingFace lookup fills the gaps. Cache refreshes every 7 days. Zero friction - `sql.js` is pure WASM, no native dependencies, no build tools needed.
116
+
65
117
  ## What gets offloaded
66
118
 
67
119
  **Delegate to the local model** - bounded, well-defined tasks:
@@ -72,9 +124,11 @@ Drop this into your `claude_desktop_config.json`:
72
124
  | Explain a function | Summarisation doesn't need tool access |
73
125
  | Draft commit messages | Diff in, message out |
74
126
  | Code review | Paste full source, ask for bugs |
75
- | Convert formats | JSONYAML, snake_casecamelCase |
127
+ | Convert formats | JSON to YAML, snake_case to camelCase |
76
128
  | Generate mock data | Schema in, data out |
77
129
  | Write type definitions | Source in, types out |
130
+ | Structured JSON output | Grammar-constrained, guaranteed valid |
131
+ | Text embeddings | Semantic search, RAG pipelines |
78
132
  | Brainstorm approaches | Doesn't commit to anything |
79
133
 
80
134
  **Keep on Claude** - anything that needs reasoning, tool access, or multi-step orchestration:
@@ -87,15 +141,32 @@ Drop this into your `claude_desktop_config.json`:
87
141
 
88
142
  The tool descriptions are written to nudge Claude into planning delegation at the start of large tasks, not just using it when it happens to think of it.
89
143
 
90
- ## Token tracking
144
+ ## Performance tracking
91
145
 
92
- Every response includes a session footer:
146
+ Every response includes a footer with real performance data - computed from the SSE stream, not from any proprietary API:
93
147
 
94
148
  ```
95
- Model: qwen/qwen3-coder-next | This call: 145→248 tokens | Session: 12,450 tokens offloaded across 23 calls
149
+ Model: zai-org/glm-4.7-flash | 125->430 tokens | TTFT: 678ms, 48.7 tok/s, 12.5s
150
+ Session: 8,450 tokens offloaded across 14 calls
96
151
  ```
97
152
 
98
- The `discover` tool reports cumulative session stats too. Claude sees this data and (I've found) it reinforces the delegation habit throughout long-running tasks. The more it sees it's saving tokens, the more it looks for things to offload.
153
+ The `discover` tool shows per-model averages across the session:
154
+
155
+ ```
156
+ Performance (this session):
157
+ nvidia/nemotron-3-nano: 6 calls, avg TTFT 234ms, avg 45.2 tok/s
158
+ zai-org/glm-4.7-flash: 8 calls, avg TTFT 678ms, avg 48.7 tok/s
159
+ ```
160
+
161
+ In practice, Claude delegates more aggressively the longer a session runs. After about 5,000 offloaded tokens, it starts hunting for more work to push over. Reinforcing loop.
162
+
163
+ ## Model routing
164
+
165
+ If you've got multiple models loaded (or downloaded), houtini-lm picks the best one for each task automatically. Each model family has per-family prompt hints - temperature, output constraints, and think-block flags - so GLM gets told "no preamble, no step-by-step reasoning" while Qwen Coder gets a low temperature for focused code output.
166
+
167
+ The routing scores loaded models against the task type (code, chat, analysis, embedding). If the best loaded model isn't ideal for the task, you'll see a suggestion in the response footer pointing to a better downloaded model. No runtime model swapping - model loading takes minutes, so houtini-lm suggests rather than blocks.
168
+
169
+ Supported model families with curated prompt hints: GLM-4, Qwen3 Coder, Qwen3, LLaMA 3, Nemotron, Granite, GPT-OSS, Nomic Embed. Unknown models get sensible defaults.
99
170
 
100
171
  ## Tools
101
172
 
@@ -109,10 +180,11 @@ The workhorse. Send a task, get an answer. The description includes planning tri
109
180
  | `system` | no | - | Persona - "Senior TypeScript dev" not "helpful assistant" |
110
181
  | `temperature` | no | 0.3 | 0.1 for code, 0.3 for analysis, 0.7 for creative |
111
182
  | `max_tokens` | no | 2048 | Lower for quick answers, higher for generation |
183
+ | `json_schema` | no | - | Force structured JSON output conforming to a schema |
112
184
 
113
185
  ### `custom_prompt`
114
186
 
115
- Three-part prompt: system, context, instruction. Keeping them separate prevents context bleed - consistently outperforms stuffing everything into one message, especially with local models.
187
+ Three-part prompt: system, context, instruction. Keeping them separate prevents context bleed - consistently outperforms stuffing everything into one message, especially with local models. I tested this properly one weekend - took the same batch of review tasks and ran them both ways. Splitting things into three parts won every round.
116
188
 
117
189
  | Parameter | Required | Default | What it does |
118
190
  |-----------|----------|---------|-------------|
@@ -121,10 +193,11 @@ Three-part prompt: system, context, instruction. Keeping them separate prevents
121
193
  | `context` | no | - | Complete data to analyse. Never truncate. |
122
194
  | `temperature` | no | 0.3 | 0.1 for review, 0.3 for analysis |
123
195
  | `max_tokens` | no | 2048 | Match to expected output length |
196
+ | `json_schema` | no | - | Force structured JSON output |
124
197
 
125
198
  ### `code_task`
126
199
 
127
- Built for code analysis. Pre-configured system prompt, locked to temperature 0.2 for focused output.
200
+ Built for code analysis. Pre-configured system prompt with temperature and output constraints tuned per model family via the routing layer.
128
201
 
129
202
  | Parameter | Required | Default | What it does |
130
203
  |-----------|----------|---------|-------------|
@@ -133,17 +206,56 @@ Built for code analysis. Pre-configured system prompt, locked to temperature 0.2
133
206
  | `language` | no | - | "typescript", "python", "rust", etc. |
134
207
  | `max_tokens` | no | 2048 | Match to expected output length |
135
208
 
209
+ ### `embed`
210
+
211
+ Generate text embeddings via the OpenAI-compatible `/v1/embeddings` endpoint. Requires an embedding model to be available - Nomic Embed is a solid choice. Returns the vector, dimension count, and usage stats.
212
+
213
+ | Parameter | Required | Default | What it does |
214
+ |-----------|----------|---------|-------------|
215
+ | `input` | yes | - | Text to embed |
216
+ | `model` | no | auto | Embedding model ID |
217
+
136
218
  ### `discover`
137
219
 
138
- Health check. Returns model name, context window, latency, and cumulative session stats. Call before delegating if you're not sure the LLM's available.
220
+ Health check. Returns model name, context window, latency, capability profile, and cumulative session stats including per-model performance averages. Call before delegating if you're not sure the LLM's available.
139
221
 
140
222
  ### `list_models`
141
223
 
142
- Lists everything loaded on the LLM server with context window sizes.
224
+ Lists everything on the LLM server - loaded and downloaded - with full metadata: architecture, quantisation, context window, capabilities, and HuggingFace enrichment data. Shows capability profiles describing what each model is best at, so Claude can make informed delegation decisions.
225
+
226
+ ## Structured JSON output
227
+
228
+ Both `chat` and `custom_prompt` accept a `json_schema` parameter that forces the response to conform to a JSON Schema. LM Studio uses grammar-based sampling to guarantee valid output - no hoping the model remembers to close its brackets.
229
+
230
+ ```json
231
+ {
232
+ "json_schema": {
233
+ "name": "code_review",
234
+ "schema": {
235
+ "type": "object",
236
+ "properties": {
237
+ "issues": {
238
+ "type": "array",
239
+ "items": {
240
+ "type": "object",
241
+ "properties": {
242
+ "line": { "type": "number" },
243
+ "severity": { "type": "string" },
244
+ "description": { "type": "string" }
245
+ },
246
+ "required": ["line", "severity", "description"]
247
+ }
248
+ }
249
+ },
250
+ "required": ["issues"]
251
+ }
252
+ }
253
+ }
254
+ ```
143
255
 
144
256
  ## Getting good results from local models
145
257
 
146
- Qwen, Llama, DeepSeek - they score brilliantly on coding benchmarks now. The gap between a good and bad result is almost always **prompt quality**, not model capability. I've spent a fair bit of time on this.
258
+ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. The gap between a good and bad result is almost always prompt quality, not model capability. I've spent a fair bit of time on this.
147
259
 
148
260
  **Send complete code.** Local models hallucinate details when you give them truncated input. If a file's too large, send the relevant function - not a snippet with `...` in the middle.
149
261
 
@@ -157,6 +269,10 @@ Qwen, Llama, DeepSeek - they score brilliantly on coding benchmarks now. The gap
157
269
 
158
270
  **One call at a time.** If your LLM server runs a single model, parallel calls queue up and stack timeouts. Send them sequentially.
159
271
 
272
+ ## Think-block stripping
273
+
274
+ Some models - GLM Flash, Nemotron, and others - always emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm strips these automatically so Claude gets clean output without wasting time parsing the model's internal chain-of-thought. You still get the benefit of the reasoning (better answers), just without the noise.
275
+
160
276
  ## Configuration
161
277
 
162
278
  | Variable | Default | What it does |
@@ -172,18 +288,34 @@ Works with anything that speaks the OpenAI `/v1/chat/completions` API:
172
288
 
173
289
  | What | URL | Notes |
174
290
  |------|-----|-------|
175
- | [LM Studio](https://lmstudio.ai) | `http://localhost:1234` | Default, zero config |
291
+ | [LM Studio](https://lmstudio.ai) | `http://localhost:1234` | Default, zero config. Rich metadata via v0 API. |
176
292
  | [Ollama](https://ollama.com) | `http://localhost:11434` | Set `LM_STUDIO_URL` |
177
293
  | [vLLM](https://docs.vllm.ai) | `http://localhost:8000` | Native OpenAI API |
178
294
  | [llama.cpp](https://github.com/ggml-org/llama.cpp) | `http://localhost:8080` | Server mode |
295
+ | [DeepSeek](https://platform.deepseek.com) | `https://api.deepseek.com` | 28c/M input tokens |
296
+ | [Groq](https://groq.com) | `https://api.groq.com/openai` | ~750 tok/s |
297
+ | [Cerebras](https://cerebras.ai) | `https://api.cerebras.ai` | ~3000 tok/s |
179
298
  | Any OpenAI-compatible API | Any URL | Set URL + password |
180
299
 
181
300
  ## Streaming and timeouts
182
301
 
183
- All inference uses Server-Sent Events streaming. Tokens arrive incrementally, keeping the connection alive. If generation takes longer than 55 seconds, you get a partial result instead of a timeout error - the footer shows `⚠ TRUNCATED` when this happens.
302
+ All inference uses Server-Sent Events streaming. Tokens arrive incrementally, keeping the connection alive. If generation takes longer than 55 seconds, you get a partial result instead of a timeout error - the footer shows `TRUNCATED` when this happens.
184
303
 
185
304
  The 55-second soft timeout exists because the MCP SDK has a hard ~60s client-side timeout. Without streaming, any response that took longer than 60 seconds just vanished. Not ideal.
186
305
 
306
+ ## Architecture
307
+
308
+ ```
309
+ index.ts Main MCP server - tools, streaming, session tracking
310
+ model-cache.ts SQLite-backed model profile cache (sql.js / WASM)
311
+ Auto-profiles models via HuggingFace API at startup
312
+ Persists to ~/.houtini-lm/model-cache.db
313
+
314
+ Inference: POST /v1/chat/completions (OpenAI-compatible, works everywhere)
315
+ Model metadata: GET /api/v0/models (LM Studio, falls back to /v1/models)
316
+ Embeddings: POST /v1/embeddings (OpenAI-compatible)
317
+ ```
318
+
187
319
  ## Development
188
320
 
189
321
  ```bash