@houtini/lm 2.3.0 → 2.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,11 +1,41 @@
1
- # @houtini/lm
1
+ # @houtini/lm Houtini LM - Offload Tasks from Claude Code to Your Local LLM Server (LM Studio / Ollama) or a Cloud API
2
2
 
3
- [![npm version](https://img.shields.io/npm/v/@houtini/lm)](https://www.npmjs.com/package/@houtini/lm)
3
+ [![npm version](https://img.shields.io/npm/v/@houtini/lm.svg?style=flat-square)](https://www.npmjs.com/package/@houtini/lm)
4
+ [![MCP Registry](https://img.shields.io/badge/MCP-Registry-blue?style=flat-square)](https://registry.modelcontextprotocol.io)
5
+ [![Known Vulnerabilities](https://snyk.io/test/github/houtini-ai/lm/badge.svg)](https://snyk.io/test/github/houtini-ai/lm)
4
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
7
 
6
- An MCP server that connects Claude to any OpenAI-compatible LLM - LM Studio, Ollama, vLLM, llama.cpp, whatever you've got running locally.
8
+ <p align="center">
9
+ <a href="https://glama.ai/mcp/servers/@houtini-ai/lm">
10
+ <img width="380" height="200" src="https://glama.ai/mcp/servers/@houtini-ai/lm/badge" alt="Houtini LM MCP server" />
11
+ </a>
12
+ </p>
7
13
 
8
- The idea's simple. Claude's brilliant at orchestration and reasoning, but you're burning tokens on stuff a local model handles just fine. Boilerplate, code review, summarisation, classification - hand it off. Claude keeps working on the hard stuff while your local model chews through the grunt work. Free, parallel, no API keys.
14
+ I built this because I kept leaving Claude Code running overnight on big refactors and the token bill was painful. A huge chunk of that spend goes on bounded tasks any decent model handles fine - generating boilerplate, code review, commit messages, format conversion. Stuff that doesn't need Claude's reasoning or tool access.
15
+
16
+ Houtini LM connects Claude Code to a local LLM on your network - or any OpenAI-compatible API. Claude keeps doing the hard work - architecture, planning, multi-file changes - and offloads the grunt work to whatever cheaper model you've got running. Free. No rate limits. Private.
17
+
18
+ I wrote a [full walkthrough of why I built this and how I use it day to day](https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/).
19
+
20
+ ## How it works
21
+
22
+ ```
23
+ Claude Code (orchestrator)
24
+ |
25
+ |-- Complex reasoning, planning, architecture --> Claude API (your tokens)
26
+ |
27
+ +-- Bounded grunt work --> houtini-lm --HTTP/SSE--> Your local LLM (free)
28
+ . Boilerplate & test stubs Qwen, Llama, Nemotron, GLM...
29
+ . Code review & explanations LM Studio, Ollama, vLLM, llama.cpp
30
+ . Commit messages & docs DeepSeek, Groq, Cerebras (cloud)
31
+ . Format conversion
32
+ . Mock data & type definitions
33
+ . Embeddings for RAG pipelines
34
+ ```
35
+
36
+ Claude's the architect. Your local model's the drafter. Claude QAs everything.
37
+
38
+ Every response comes back with performance stats - TTFT, tokens per second, generation time - so you can actually see what your local hardware is doing. The session footer tracks cumulative offloaded tokens across every call.
9
39
 
10
40
  ## Quick start
11
41
 
@@ -17,6 +47,25 @@ claude mcp add houtini-lm -- npx -y @houtini/lm
17
47
 
18
48
  That's it. If LM Studio's running on `localhost:1234` (the default), Claude can start delegating straight away.
19
49
 
50
+ ### LLM on a different machine
51
+
52
+ I've got a GPU box on my local network running Qwen 3 Coder Next in LM Studio. If you've got a similar setup, point the URL at it:
53
+
54
+ ```bash
55
+ claude mcp add houtini-lm -e LM_STUDIO_URL=http://192.168.1.50:1234 -- npx -y @houtini/lm
56
+ ```
57
+
58
+ ### Cloud APIs
59
+
60
+ Works with anything speaking the OpenAI format. DeepSeek at twenty-eight cents per million tokens, Groq for speed, Cerebras if you want three thousand tokens per second - whatever you fancy:
61
+
62
+ ```bash
63
+ claude mcp add houtini-lm \
64
+ -e LM_STUDIO_URL=https://api.deepseek.com \
65
+ -e LM_STUDIO_PASSWORD=your-key-here \
66
+ -- npx -y @houtini/lm
67
+ ```
68
+
20
69
  ### Claude Desktop
21
70
 
22
71
  Drop this into your `claude_desktop_config.json`:
@@ -35,122 +84,120 @@ Drop this into your `claude_desktop_config.json`:
35
84
  }
36
85
  ```
37
86
 
38
- ### LLM on a different machine
39
-
40
- If you've got a GPU box on your network (I run mine on a separate machine called hopper), point the URL at it:
87
+ ## Model discovery
41
88
 
42
- ```bash
43
- claude mcp add houtini-lm -e LM_STUDIO_URL=http://192.168.1.50:1234 -- npx -y @houtini/lm
44
- ```
89
+ This is where things get interesting. At startup, houtini-lm queries your LLM server for every model available - loaded and downloaded - then looks each one up on HuggingFace's free API to pull metadata: architecture, licence, download count, pipeline type. All of that gets cached in a local SQLite database (`~/.houtini-lm/model-cache.db`) so subsequent startups are instant.
45
90
 
46
- ## What's it good for?
91
+ The result is that houtini-lm actually knows what your models are good at. Not just the name - the capabilities, the strengths, what tasks to send where. If you've got Nemotron loaded but a Qwen Coder sitting idle, it'll flag that. If someone on a completely different setup loads a Mistral model houtini-lm has never seen before, the HuggingFace lookup auto-generates a profile for it.
47
92
 
48
- Real examples you can throw at it right now.
93
+ Run `list_models` and you get the full picture:
49
94
 
50
- **Explain something you just read**
51
95
  ```
52
- "Explain what this function does in 2-3 sentences."
53
- + paste the function
96
+ Loaded models (ready to use):
97
+
98
+ nvidia/nemotron-3-nano
99
+ type: llm, arch: nemotron_h_moe, quant: Q4_K_M, format: gguf
100
+ context: 200,082 (max 1,048,576), by: nvidia
101
+ Capabilities: tool_use
102
+ NVIDIA Nemotron: compact reasoning model optimised for step-by-step logic
103
+ Best for: analysis tasks, code bug-finding, math/science questions
104
+ HuggingFace: text-generation, 1.7M downloads, MIT licence
105
+
106
+ Available models (downloaded, not loaded):
107
+
108
+ qwen3-coder-30b-a3b-instruct
109
+ type: llm, arch: qwen3moe, quant: BF16, context: 262,144
110
+ Qwen3 Coder: code-specialised model with agentic capabilities
111
+ Best for: code generation, code review, test stubs, refactoring
112
+ HuggingFace: text-generation, 12.9K downloads, Apache-2.0
54
113
  ```
55
114
 
56
- **Second opinion on generated code**
57
- ```
58
- "Find bugs in this TypeScript module. Return a JSON array of {line, issue, fix}."
59
- + paste the module
60
- ```
115
+ For models we know well - Qwen, Nemotron, Granite, LLaMA, GLM, GPT-OSS - there's a curated profile built in with specific strengths and weaknesses. For everything else, the HuggingFace lookup fills the gaps. Cache refreshes every 7 days. Zero friction - `sql.js` is pure WASM, no native dependencies, no build tools needed.
61
116
 
62
- **Draft a commit message**
63
- ```
64
- "Write a concise commit message for this diff. One line summary, then bullet points."
65
- + paste the diff
66
- ```
117
+ ## What gets offloaded
67
118
 
68
- **Generate boilerplate**
69
- ```
70
- "Write a Jest test file for this React component. Cover the happy path and one error case."
71
- + paste the component
72
- ```
119
+ **Delegate to the local model** - bounded, well-defined tasks:
73
120
 
74
- **Extract structured data**
75
- ```
76
- "Extract all API endpoints from this Express router. Return as JSON: {method, path, handler}."
77
- + paste the router file
78
- ```
121
+ | Task | Why it works locally |
122
+ |------|---------------------|
123
+ | Generate test stubs | Clear input (source), clear output (tests) |
124
+ | Explain a function | Summarisation doesn't need tool access |
125
+ | Draft commit messages | Diff in, message out |
126
+ | Code review | Paste full source, ask for bugs |
127
+ | Convert formats | JSON to YAML, snake_case to camelCase |
128
+ | Generate mock data | Schema in, data out |
129
+ | Write type definitions | Source in, types out |
130
+ | Structured JSON output | Grammar-constrained, guaranteed valid |
131
+ | Text embeddings | Semantic search, RAG pipelines |
132
+ | Brainstorm approaches | Doesn't commit to anything |
133
+
134
+ **Keep on Claude** - anything that needs reasoning, tool access, or multi-step orchestration:
135
+
136
+ - Architectural decisions
137
+ - Reading/writing files
138
+ - Running tests and interpreting results
139
+ - Multi-file refactoring plans
140
+ - Anything that needs to call other tools
141
+
142
+ The tool descriptions are written to nudge Claude into planning delegation at the start of large tasks, not just using it when it happens to think of it.
143
+
144
+ ## Performance tracking
145
+
146
+ Every response includes a footer with real performance data - computed from the SSE stream, not from any proprietary API:
79
147
 
80
- **Translate formats**
81
148
  ```
82
- "Convert this JSON config to YAML. Return only the YAML, no explanation."
83
- + paste the JSON
149
+ Model: zai-org/glm-4.7-flash | 125->430 tokens | TTFT: 678ms, 48.7 tok/s, 12.5s
150
+ Session: 8,450 tokens offloaded across 14 calls
84
151
  ```
85
152
 
86
- **Brainstorm before committing to an approach**
153
+ The `discover` tool shows per-model averages across the session:
154
+
87
155
  ```
88
- "I need to add caching to this API client. List 3 approaches with trade-offs. Be brief."
89
- + paste the client code
156
+ Performance (this session):
157
+ nvidia/nemotron-3-nano: 6 calls, avg TTFT 234ms, avg 45.2 tok/s
158
+ zai-org/glm-4.7-flash: 8 calls, avg TTFT 678ms, avg 48.7 tok/s
90
159
  ```
91
160
 
161
+ In practice, Claude delegates more aggressively the longer a session runs. After about 5,000 offloaded tokens, it starts hunting for more work to push over. Reinforcing loop.
162
+
163
+ ## Model routing
164
+
165
+ If you've got multiple models loaded (or downloaded), houtini-lm picks the best one for each task automatically. Each model family has per-family prompt hints - temperature, output constraints, and think-block flags - so GLM gets told "no preamble, no step-by-step reasoning" while Qwen Coder gets a low temperature for focused code output.
166
+
167
+ The routing scores loaded models against the task type (code, chat, analysis, embedding). If the best loaded model isn't ideal for the task, you'll see a suggestion in the response footer pointing to a better downloaded model. No runtime model swapping - model loading takes minutes, so houtini-lm suggests rather than blocks.
168
+
169
+ Supported model families with curated prompt hints: GLM-4, Qwen3 Coder, Qwen3, LLaMA 3, Nemotron, Granite, GPT-OSS, Nomic Embed. Unknown models get sensible defaults.
170
+
92
171
  ## Tools
93
172
 
94
173
  ### `chat`
95
174
 
96
- The workhorse. Send a task, get an answer. Optional system persona if you want to steer the model's perspective.
175
+ The workhorse. Send a task, get an answer. The description includes planning triggers that nudge Claude to identify offloadable work when it's starting a big task.
97
176
 
98
177
  | Parameter | Required | Default | What it does |
99
178
  |-----------|----------|---------|-------------|
100
179
  | `message` | yes | - | The task. Be specific about output format. |
101
- | `system` | no | - | Persona - "Senior TypeScript dev", not "helpful assistant" |
180
+ | `system` | no | - | Persona - "Senior TypeScript dev" not "helpful assistant" |
102
181
  | `temperature` | no | 0.3 | 0.1 for code, 0.3 for analysis, 0.7 for creative |
103
182
  | `max_tokens` | no | 2048 | Lower for quick answers, higher for generation |
104
-
105
- **Quick factual question:**
106
- ```json
107
- {
108
- "message": "What HTTP status code means 'too many requests'? Just the number and name.",
109
- "max_tokens": 50
110
- }
111
- ```
112
-
113
- **Code explanation with persona:**
114
- ```json
115
- {
116
- "message": "Explain this function. What does it do, what are the edge cases?\n\n```ts\nfunction debounce(fn, ms) { ... }\n```",
117
- "system": "Senior TypeScript developer"
118
- }
119
- ```
183
+ | `json_schema` | no | - | Force structured JSON output conforming to a schema |
120
184
 
121
185
  ### `custom_prompt`
122
186
 
123
- Three-part prompt: system, context, instruction. Keeping them separate stops context bleed - you'll get better results than stuffing everything into one message, especially with smaller models.
187
+ Three-part prompt: system, context, instruction. Keeping them separate prevents context bleed - consistently outperforms stuffing everything into one message, especially with local models. I tested this properly one weekend - took the same batch of review tasks and ran them both ways. Splitting things into three parts won every round.
124
188
 
125
189
  | Parameter | Required | Default | What it does |
126
190
  |-----------|----------|---------|-------------|
127
191
  | `instruction` | yes | - | What to produce. Under 50 words works best. |
128
- | `system` | no | - | Persona, specific and under 30 words |
192
+ | `system` | no | - | Persona + constraints, under 30 words |
129
193
  | `context` | no | - | Complete data to analyse. Never truncate. |
130
194
  | `temperature` | no | 0.3 | 0.1 for review, 0.3 for analysis |
131
195
  | `max_tokens` | no | 2048 | Match to expected output length |
132
-
133
- **Code review:**
134
- ```json
135
- {
136
- "system": "Expert Node.js developer focused on error handling and edge cases.",
137
- "context": "< full source code here >",
138
- "instruction": "List the top 3 bugs as bullet points. For each: line number, what's wrong, how to fix it."
139
- }
140
- ```
141
-
142
- **Compare two implementations:**
143
- ```json
144
- {
145
- "system": "Performance-focused Python developer.",
146
- "context": "Implementation A:\n...\n\nImplementation B:\n...",
147
- "instruction": "Which is faster for 10k+ items? Why? One paragraph."
148
- }
149
- ```
196
+ | `json_schema` | no | - | Force structured JSON output |
150
197
 
151
198
  ### `code_task`
152
199
 
153
- Built specifically for code analysis. Wraps your request with an optimised code-review system prompt and drops the temperature to 0.2 so the model stays focused.
200
+ Built for code analysis. Pre-configured system prompt with temperature and output constraints tuned per model family via the routing layer.
154
201
 
155
202
  | Parameter | Required | Default | What it does |
156
203
  |-----------|----------|---------|-------------|
@@ -159,41 +206,72 @@ Built specifically for code analysis. Wraps your request with an optimised code-
159
206
  | `language` | no | - | "typescript", "python", "rust", etc. |
160
207
  | `max_tokens` | no | 2048 | Match to expected output length |
161
208
 
209
+ ### `embed`
210
+
211
+ Generate text embeddings via the OpenAI-compatible `/v1/embeddings` endpoint. Requires an embedding model to be available - Nomic Embed is a solid choice. Returns the vector, dimension count, and usage stats.
212
+
213
+ | Parameter | Required | Default | What it does |
214
+ |-----------|----------|---------|-------------|
215
+ | `input` | yes | - | Text to embed |
216
+ | `model` | no | auto | Embedding model ID |
217
+
218
+ ### `discover`
219
+
220
+ Health check. Returns model name, context window, latency, capability profile, and cumulative session stats including per-model performance averages. Call before delegating if you're not sure the LLM's available.
221
+
222
+ ### `list_models`
223
+
224
+ Lists everything on the LLM server - loaded and downloaded - with full metadata: architecture, quantisation, context window, capabilities, and HuggingFace enrichment data. Shows capability profiles describing what each model is best at, so Claude can make informed delegation decisions.
225
+
226
+ ## Structured JSON output
227
+
228
+ Both `chat` and `custom_prompt` accept a `json_schema` parameter that forces the response to conform to a JSON Schema. LM Studio uses grammar-based sampling to guarantee valid output - no hoping the model remembers to close its brackets.
229
+
162
230
  ```json
163
231
  {
164
- "code": "< full source file >",
165
- "task": "Find bugs and suggest improvements. Reference line numbers.",
166
- "language": "typescript"
232
+ "json_schema": {
233
+ "name": "code_review",
234
+ "schema": {
235
+ "type": "object",
236
+ "properties": {
237
+ "issues": {
238
+ "type": "array",
239
+ "items": {
240
+ "type": "object",
241
+ "properties": {
242
+ "line": { "type": "number" },
243
+ "severity": { "type": "string" },
244
+ "description": { "type": "string" }
245
+ },
246
+ "required": ["line", "severity", "description"]
247
+ }
248
+ }
249
+ },
250
+ "required": ["issues"]
251
+ }
252
+ }
167
253
  }
168
254
  ```
169
255
 
170
- ### `discover`
256
+ ## Getting good results from local models
171
257
 
172
- Checks if the local LLM's online. Returns the model name, context window size, and response latency. Typically under a second, or an offline status within 5 seconds if the host isn't reachable.
258
+ Qwen, Llama, Nemotron, GLM - they score brilliantly on coding benchmarks now. The gap between a good and bad result is almost always prompt quality, not model capability. I've spent a fair bit of time on this.
173
259
 
174
- No parameters. Call it before delegating if you're not sure the LLM's available.
260
+ **Send complete code.** Local models hallucinate details when you give them truncated input. If a file's too large, send the relevant function - not a snippet with `...` in the middle.
175
261
 
176
- ### `list_models`
262
+ **Be explicit about output format.** "Return a JSON array" or "respond in bullet points" - don't leave it open-ended. Smaller models need this.
177
263
 
178
- Lists everything loaded on the LLM server with context window sizes.
264
+ **Set a specific persona.** "Expert Rust developer who cares about memory safety" gets noticeably better results than "helpful assistant."
179
265
 
180
- ## How it works
266
+ **State constraints.** "No preamble", "reference line numbers", "max 5 bullet points" - tell the model what *not* to do as well as what to do.
181
267
 
182
- ```
183
- Claude ──MCP──> houtini-lm ──HTTP/SSE──> LM Studio (or any OpenAI-compatible API)
184
-
185
- ├─ Streaming: tokens arrive incrementally via SSE
186
- ├─ Soft timeout: returns partial results at 55s
187
- └─ Graceful failure: returns "offline" if host unreachable
188
- ```
268
+ **Include surrounding context.** For code generation, send imports, types, and function signatures - not just the function body.
189
269
 
190
- All inference calls use Server-Sent Events streaming (since v2.3.0). In practice, this means:
270
+ **One call at a time.** If your LLM server runs a single model, parallel calls queue up and stack timeouts. Send them sequentially.
191
271
 
192
- - Tokens arrive as they're generated, keeping the connection alive
193
- - If generation takes longer than 55 seconds, you get a partial result instead of a timeout error - the footer shows `⚠ TRUNCATED` when this happens
194
- - If the host is off or unreachable, you get a clean "offline" message within 5 seconds instead of hanging
272
+ ## Think-block stripping
195
273
 
196
- The 55-second soft timeout exists because the MCP SDK has a hard ~60s client-side timeout. Without streaming, any response that took longer than 60 seconds just vanished. Now you get whatever the model managed to generate before the deadline.
274
+ Some models - GLM Flash, Nemotron, and others - always emit `<think>...</think>` reasoning blocks before the actual answer. Houtini-lm strips these automatically so Claude gets clean output without wasting time parsing the model's internal chain-of-thought. You still get the benefit of the reasoning (better answers), just without the noise.
197
275
 
198
276
  ## Configuration
199
277
 
@@ -204,30 +282,40 @@ The 55-second soft timeout exists because the MCP SDK has a hard ~60s client-sid
204
282
  | `LM_STUDIO_PASSWORD` | *(none)* | Bearer token for authenticated endpoints |
205
283
  | `LM_CONTEXT_WINDOW` | `100000` | Fallback context window if the API doesn't report it |
206
284
 
207
- ## Getting good results
208
-
209
- **Send complete code.** Local models hallucinate details when you give them truncated input. If a file's too large, send the relevant function - not a snippet with `...` in the middle.
210
-
211
- **Be explicit about output format.** "Return a JSON array" or "respond in bullet points" - don't leave it open-ended. Smaller models especially need this.
212
-
213
- **One call at a time.** If your LLM server runs a single model, parallel calls queue up and stack timeouts. Send them sequentially.
214
-
215
- **Match max_tokens to expected output.** 200 for quick answers, 500 for explanations, 2048 for code generation. Lower values mean faster responses.
216
-
217
- **Set a specific persona.** "Expert Rust developer who cares about memory safety" gets noticeably better results than "helpful assistant" (or no persona at all).
218
-
219
285
  ## Compatible endpoints
220
286
 
221
287
  Works with anything that speaks the OpenAI `/v1/chat/completions` API:
222
288
 
223
289
  | What | URL | Notes |
224
290
  |------|-----|-------|
225
- | [LM Studio](https://lmstudio.ai) | `http://localhost:1234` | Default, zero config |
291
+ | [LM Studio](https://lmstudio.ai) | `http://localhost:1234` | Default, zero config. Rich metadata via v0 API. |
226
292
  | [Ollama](https://ollama.com) | `http://localhost:11434` | Set `LM_STUDIO_URL` |
227
293
  | [vLLM](https://docs.vllm.ai) | `http://localhost:8000` | Native OpenAI API |
228
294
  | [llama.cpp](https://github.com/ggml-org/llama.cpp) | `http://localhost:8080` | Server mode |
295
+ | [DeepSeek](https://platform.deepseek.com) | `https://api.deepseek.com` | 28c/M input tokens |
296
+ | [Groq](https://groq.com) | `https://api.groq.com/openai` | ~750 tok/s |
297
+ | [Cerebras](https://cerebras.ai) | `https://api.cerebras.ai` | ~3000 tok/s |
229
298
  | Any OpenAI-compatible API | Any URL | Set URL + password |
230
299
 
300
+ ## Streaming and timeouts
301
+
302
+ All inference uses Server-Sent Events streaming. Tokens arrive incrementally, keeping the connection alive. If generation takes longer than 55 seconds, you get a partial result instead of a timeout error - the footer shows `TRUNCATED` when this happens.
303
+
304
+ The 55-second soft timeout exists because the MCP SDK has a hard ~60s client-side timeout. Without streaming, any response that took longer than 60 seconds just vanished. Not ideal.
305
+
306
+ ## Architecture
307
+
308
+ ```
309
+ index.ts Main MCP server - tools, streaming, session tracking
310
+ model-cache.ts SQLite-backed model profile cache (sql.js / WASM)
311
+ Auto-profiles models via HuggingFace API at startup
312
+ Persists to ~/.houtini-lm/model-cache.db
313
+
314
+ Inference: POST /v1/chat/completions (OpenAI-compatible, works everywhere)
315
+ Model metadata: GET /api/v0/models (LM Studio, falls back to /v1/models)
316
+ Embeddings: POST /v1/embeddings (OpenAI-compatible)
317
+ ```
318
+
231
319
  ## Development
232
320
 
233
321
  ```bash