prism-mcp-server 19.0.1 → 19.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +30 -8
- package/dist/tools/prismInferHandler.js +83 -23
- package/dist/utils/entitlements.js +1 -1
- package/dist/utils/modelPicker.js +13 -13
- package/dist/utils/qualityGate.js +43 -0
- package/dist/utils/thinkStrip.js +26 -0
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -11,7 +11,7 @@
|
|
|
11
11
|
<img src="docs/v11_hivemind_multi_agent_dashboard.jpg" alt="Prism Coder — Mind Palace Dashboard with Knowledge Graph and Multi-Agent Hivemind" width="700" />
|
|
12
12
|
</p>
|
|
13
13
|
|
|
14
|
-
Prism Coder is an [MCP server](https://modelcontextprotocol.io) that gives Claude, Cursor, and other AI tools long-term memory that survives across sessions. It ships with the open-weight `prism-coder` model fleet (2B–
|
|
14
|
+
Prism Coder is an [MCP server](https://modelcontextprotocol.io) that gives Claude, Cursor, and other AI tools long-term memory that survives across sessions. It ships with the open-weight `prism-coder` model fleet (2B–27B) for fast, offline tool-routing — no cloud required.
|
|
15
15
|
|
|
16
16
|
**No account needed. No API keys. Runs on your machine.**
|
|
17
17
|
A paid subscription adds cloud sync, higher model tiers, and team features through the [Synalux portal](https://synalux.ai).
|
|
@@ -41,7 +41,7 @@ Open Claude Desktop or Cursor and your agent now has memory backed by a local SQ
|
|
|
41
41
|
ollama pull dcostenco/prism-coder:2b # 2.3 GB · mobile / lightweight (99.1% routing accuracy)
|
|
42
42
|
ollama pull dcostenco/prism-coder:4b # 3.4 GB · verifier (100% accuracy)
|
|
43
43
|
ollama pull dcostenco/prism-coder:9b # 5.8 GB · default router (100% accuracy, Qwen3.5)
|
|
44
|
-
ollama pull dcostenco/prism-coder:
|
|
44
|
+
ollama pull dcostenco/prism-coder:27b # 16 GB · complex tasks (100% accuracy)
|
|
45
45
|
```
|
|
46
46
|
|
|
47
47
|
Prism detects both the namespaced (`dcostenco/prism-coder:9b`) and bare (`prism-coder:9b`) Ollama tags automatically.
|
|
@@ -145,14 +145,16 @@ The free tier runs entirely on your machine. Paid tiers add cloud sync through t
|
|
|
145
145
|
|
|
146
146
|
## Models
|
|
147
147
|
|
|
148
|
-
The `prism-coder` fleet uses Qwen3.5 for MCP tool-routing. The 9B
|
|
148
|
+
The `prism-coder` fleet uses Qwen3.5 for MCP tool-routing AND general inference. The 9B and 27B are fine-tuned with LoRA (r=128, all 64 layers including DeltaNet); the 2B and 4B use stock Qwen3.5-4B at different quantization levels. The 27B scored 100% on BFCL function-calling and 100% on an internal 15-problem coding eval at $0 inference cost.
|
|
149
|
+
|
|
150
|
+
`prism_infer` supports three modes: `route` (tool routing, fast, nothink), `chat` (conversation with thinking), and `code` (code generation with thinking). In chat/code modes, the model uses `<think>` blocks for chain-of-thought reasoning, which are stripped before the response is served. If the local model fails a quality gate (empty, think-only, or truncated), paid tiers automatically escalate to Claude via the Synalux portal.
|
|
149
151
|
|
|
150
152
|
| Model | Ollama tag | Size | [BFCL](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v3_multi_turn.html) Accuracy | Role | Tier |
|
|
151
153
|
|---|---|---|---|---|---|
|
|
152
154
|
| Qwen3.5-4B Q3_K_M | `prism-coder:2b` | 2.3 GB | 99.1% × 3 seeds | iPhone / mobile first gate | Free |
|
|
153
155
|
| Qwen3.5-4B Q4_K_M | `prism-coder:4b` | 3.4 GB | 100% × 3 seeds | Verifier | Free |
|
|
154
156
|
| Qwen3.5-9B (LoRA) | `prism-coder:9b` | 5.8 GB | 100% × 3 seeds | Default router | Standard+ |
|
|
155
|
-
|
|
|
157
|
+
| Qwen3.5-27B (LoRA) | `prism-coder:27b` | 16 GB | 100% × 3 seeds | Quality tier (DeltaNet, 28.5 tok/s) | Advanced+ |
|
|
156
158
|
|
|
157
159
|
Weights: [huggingface.co/dcostenco](https://huggingface.co/dcostenco) (public GGUF). Latency depends on model size and hardware — see [Benchmarks](#benchmarks) to measure it on your own machine rather than trusting a printed number.
|
|
158
160
|
|
|
@@ -162,7 +164,7 @@ Weights: [huggingface.co/dcostenco](https://huggingface.co/dcostenco) (public GG
|
|
|
162
164
|
query → prism-coder:9b (local router, default)
|
|
163
165
|
→ prism-coder:4b (grounding verifier)
|
|
164
166
|
→ prism-coder:2b (iPhone / mobile, auto-selected by RAM)
|
|
165
|
-
→ prism-coder:
|
|
167
|
+
→ prism-coder:27b (complex tasks, on demand)
|
|
166
168
|
→ cloud fallback (paid tiers, for max quality)
|
|
167
169
|
```
|
|
168
170
|
|
|
@@ -189,7 +191,7 @@ Fail-closed on the verified path: when the grounding verifier runs (Standard tie
|
|
|
189
191
|
```bash
|
|
190
192
|
git clone https://github.com/dcostenco/prism-coder && cd prism-coder
|
|
191
193
|
pip install anthropic requests
|
|
192
|
-
python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b
|
|
194
|
+
python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b 27b
|
|
193
195
|
```
|
|
194
196
|
|
|
195
197
|
**Routing eval (115 cases, 12 categories, 3-seed mean).** Routing accuracy includes the deterministic L3 correction layer — the same rules that run in production. On this narrow tool-routing task all fleet models achieve near-perfect accuracy. Be honest with yourself about what that means: the eval is **near-saturated** for this taxonomy — it measures whether the right one of a small set of MCP tools is selected, not general capability. The useful takeaway is **offline routing reliability at zero cost**, not that a 2.3 GB model rivals a frontier model in general.
|
|
@@ -197,7 +199,7 @@ python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b 32b
|
|
|
197
199
|
| Model | Routing accuracy | Notes |
|
|
198
200
|
|---|---|---|
|
|
199
201
|
| prism-coder:2b (Q3_K_M) | 99.1% × 3 seeds | 1 failure: regex→knowledge_search |
|
|
200
|
-
| prism-coder:4b / 9b /
|
|
202
|
+
| prism-coder:4b / 9b / 27b | 100% × 3 seeds | Perfect on all 115 cases |
|
|
201
203
|
| Claude (frontier, same eval) | ~98% | Stronger everywhere outside this narrow task |
|
|
202
204
|
|
|
203
205
|
**Memory uplift (LoCoMo-Plus, self-published).** A separate long-context dialogue benchmark ([dcostenco/Locomo-Plus](https://github.com/dcostenco/Locomo-Plus)) measures how much structured memory helps a base model retain multi-day context. Results show large gains when a model is paired with Prism memory versus running raw. Note this benchmark is authored, run, and LLM-judged by this project — treat it as a reproducible demonstration, not an independent third-party result, and run it yourself with the commands in that repo.
|
|
@@ -254,7 +256,7 @@ All on-device models are free to run locally via Ollama on every tier. A subscri
|
|
|
254
256
|
| | **Free** | **Standard** $19/mo | **Advanced** $49/mo | **Enterprise** $99/mo |
|
|
255
257
|
|---|---|---|---|---|
|
|
256
258
|
| Seats | 1 | 1 | up to 5 | up to 25 |
|
|
257
|
-
| Local model ceiling | up to 4b | up to 9b | up to
|
|
259
|
+
| Local model ceiling | up to 4b | up to 9b | up to 27b | up to 27b |
|
|
258
260
|
| Daily cloud inference | -- | 200 | 2,000 | 100,000 |
|
|
259
261
|
| Cloud Coder (Web IDE) | -- | 100/day | 1,000/day | 100,000/day |
|
|
260
262
|
| Cloud search | -- | 50/day | 500/day | 100,000/day |
|
|
@@ -284,6 +286,26 @@ Prism exposes 40+ MCP tools. The core memory loop:
|
|
|
284
286
|
| `session_detect_drift` | Detect when a session has drifted from its goal |
|
|
285
287
|
| `verify_behavior` | Pre-edit scenario challenge — catch bad changes before they happen |
|
|
286
288
|
| `knowledge_ingest` | Teach Prism a codebase or document |
|
|
289
|
+
| `prism_infer` | Local-first inference (route/chat/code modes, thinking, cloud escalation) |
|
|
290
|
+
|
|
291
|
+
### `prism_infer` — local-first inference with cloud escalation
|
|
292
|
+
|
|
293
|
+
```typescript
|
|
294
|
+
prism_infer({
|
|
295
|
+
prompt: "Write a binary search in Python",
|
|
296
|
+
mode: "code", // "route" | "chat" | "code"
|
|
297
|
+
think: true, // enable <think> reasoning (default: true for chat/code)
|
|
298
|
+
model_ceiling: "27b", // use the quality tier
|
|
299
|
+
})
|
|
300
|
+
// → 27B generates code locally ($0), with thinking for quality
|
|
301
|
+
// → If quality gate fails + paid tier → auto-escalate to Claude
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
| Mode | Think | Model | Use case |
|
|
305
|
+
|------|-------|-------|----------|
|
|
306
|
+
| `route` | Off (fast) | 9B default | MCP tool routing |
|
|
307
|
+
| `chat` | On | 27B preferred | Conversation, reasoning |
|
|
308
|
+
| `code` | On | 27B preferred | Code generation, debugging |
|
|
287
309
|
|
|
288
310
|
Full TypeScript signatures live in [`src/tools/`](src/tools/); architecture in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).
|
|
289
311
|
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
* prism_infer — local-first inference tool
|
|
3
3
|
* ─────────────────────────────────────────────────────────────
|
|
4
4
|
* Save the caller's cloud tokens by routing to a local prism-coder
|
|
5
|
-
* model via Ollama. Tiers (
|
|
5
|
+
* model via Ollama. Tiers (27B/9B/8B/1.7B) auto-selected by free
|
|
6
6
|
* RAM, then capped by `model_ceiling` and the set of tags that are
|
|
7
7
|
* actually pulled into Ollama.
|
|
8
8
|
*
|
|
@@ -12,7 +12,7 @@
|
|
|
12
12
|
* 4. On local fail, if cloud_fallback=true:
|
|
13
13
|
* - exchange synalux_sk_ → JWT (cached)
|
|
14
14
|
* - POST synalux portal /api/v1/prism-aac/inference
|
|
15
|
-
* - portal runs its own cascade (9B/
|
|
15
|
+
* - portal runs its own cascade (9B/27B/Claude by tier)
|
|
16
16
|
* 5. Return { output, backend, model_picked, ram_free_mb, latency_ms, used_cloud }
|
|
17
17
|
*
|
|
18
18
|
* `prism_infer` is a thin client. It never calls Anthropic / OpenRouter
|
|
@@ -26,13 +26,15 @@ import { PRISM_SYNALUX_BASE_URL, PRISM_LOCAL_LLM_URL, } from "../config.js";
|
|
|
26
26
|
import { debugLog } from "../utils/logger.js";
|
|
27
27
|
import { getEntitlements, clampCeiling } from "../utils/entitlements.js";
|
|
28
28
|
import { ddLog } from "../utils/ddLogger.js";
|
|
29
|
+
import { stripThink } from "../utils/thinkStrip.js";
|
|
30
|
+
import { passesQualityGate } from "../utils/qualityGate.js";
|
|
29
31
|
// ─── Tool Definition ────────────────────────────────────────────
|
|
30
32
|
export const PRISM_INFER_TOOL = {
|
|
31
33
|
name: "prism_infer",
|
|
32
34
|
description: "Run an inference on a local prism-coder model (Ollama) to save cloud tokens. " +
|
|
33
|
-
"Picks the largest viable tier —
|
|
35
|
+
"Picks the largest viable tier — 27B / 9B / 8B / 1.7B — based on free RAM at call time, " +
|
|
34
36
|
"clamped by `model_ceiling` and what is actually pulled in Ollama. " +
|
|
35
|
-
"Falls through to the synalux portal cloud cascade (9B →
|
|
37
|
+
"Falls through to the synalux portal cloud cascade (9B → 27B → Claude Opus 4.7) " +
|
|
36
38
|
"only when local is unviable AND `cloud_fallback=true`. " +
|
|
37
39
|
"Use this for code generation, summarisation, classification, or any synth task you would " +
|
|
38
40
|
"otherwise hand to the cloud model — it costs $0 when the local hit succeeds.",
|
|
@@ -59,8 +61,8 @@ export const PRISM_INFER_TOOL = {
|
|
|
59
61
|
},
|
|
60
62
|
model_ceiling: {
|
|
61
63
|
type: "string",
|
|
62
|
-
enum: ["
|
|
63
|
-
description: "Cap the largest tier the picker may select. e.g. '9b' forbids
|
|
64
|
+
enum: ["27b", "9b", "4b", "2b"],
|
|
65
|
+
description: "Cap the largest tier the picker may select. e.g. '9b' forbids 27B even if RAM allows.",
|
|
64
66
|
},
|
|
65
67
|
cloud_fallback: {
|
|
66
68
|
type: "boolean",
|
|
@@ -69,7 +71,7 @@ export const PRISM_INFER_TOOL = {
|
|
|
69
71
|
},
|
|
70
72
|
timeout_ms: {
|
|
71
73
|
type: "number",
|
|
72
|
-
description: "Override per-call timeout. Default scales with model size:
|
|
74
|
+
description: "Override per-call timeout. Default scales with model size: 27B=120s, 9B=60s, 4B=20s, 1.7B=15s.",
|
|
73
75
|
},
|
|
74
76
|
evidence: {
|
|
75
77
|
type: "array",
|
|
@@ -102,6 +104,20 @@ export const PRISM_INFER_TOOL = {
|
|
|
102
104
|
description: "Override the verifier hard timeout. Default 2000 ms.",
|
|
103
105
|
default: 2000,
|
|
104
106
|
},
|
|
107
|
+
mode: {
|
|
108
|
+
type: "string",
|
|
109
|
+
enum: ["route", "chat", "code"],
|
|
110
|
+
description: "Execution mode. 'route' (default) for MCP tool routing — fast, nothink. " +
|
|
111
|
+
"'chat' for general conversation — uses thinking, escalates to cloud on failure. " +
|
|
112
|
+
"'code' for code generation — uses thinking, larger context. " +
|
|
113
|
+
"In chat/code modes, prefers the 27B tier and enables <think> reasoning.",
|
|
114
|
+
default: "route",
|
|
115
|
+
},
|
|
116
|
+
think: {
|
|
117
|
+
type: "boolean",
|
|
118
|
+
description: "Enable thinking mode (<think> blocks). Default: true for chat/code, false for route. " +
|
|
119
|
+
"Thinking improves quality on complex tasks but adds latency (~2-5s).",
|
|
120
|
+
},
|
|
105
121
|
},
|
|
106
122
|
required: ["prompt"],
|
|
107
123
|
},
|
|
@@ -123,7 +139,12 @@ export function isPrismInferArgs(args) {
|
|
|
123
139
|
if (a.timeout_ms !== undefined && typeof a.timeout_ms !== "number")
|
|
124
140
|
return false;
|
|
125
141
|
if (a.model_ceiling !== undefined &&
|
|
126
|
-
!["
|
|
142
|
+
!["27b", "9b", "4b", "2b"].includes(a.model_ceiling))
|
|
143
|
+
return false;
|
|
144
|
+
if (a.mode !== undefined &&
|
|
145
|
+
!["route", "chat", "code"].includes(a.mode))
|
|
146
|
+
return false;
|
|
147
|
+
if (a.think !== undefined && typeof a.think !== "boolean")
|
|
127
148
|
return false;
|
|
128
149
|
if (a.verify !== undefined && typeof a.verify !== "boolean")
|
|
129
150
|
return false;
|
|
@@ -146,7 +167,7 @@ export function isPrismInferArgs(args) {
|
|
|
146
167
|
}
|
|
147
168
|
// ─── Ollama helpers ────────────────────────────────────────────
|
|
148
169
|
const DEFAULT_TIMEOUTS = {
|
|
149
|
-
"prism-coder:
|
|
170
|
+
"prism-coder:27b": 120_000,
|
|
150
171
|
"prism-coder:9b": 60_000,
|
|
151
172
|
"prism-coder:4b": 20_000,
|
|
152
173
|
"prism-coder:2b": 15_000,
|
|
@@ -193,16 +214,20 @@ export async function listOllamaLoaded(url = PRISM_LOCAL_LLM_URL) {
|
|
|
193
214
|
return new Set();
|
|
194
215
|
}
|
|
195
216
|
}
|
|
196
|
-
async function callOllamaGenerate(url, model, prompt, system, maxTokens, temperature, timeoutMs) {
|
|
217
|
+
async function callOllamaGenerate(url, model, prompt, system, maxTokens, temperature, timeoutMs, think) {
|
|
197
218
|
try {
|
|
219
|
+
const messages = [];
|
|
220
|
+
if (system)
|
|
221
|
+
messages.push({ role: "system", content: system });
|
|
222
|
+
messages.push({ role: "user", content: prompt });
|
|
198
223
|
const body = {
|
|
199
224
|
model,
|
|
200
|
-
|
|
201
|
-
...(system ? { system } : {}),
|
|
225
|
+
messages,
|
|
202
226
|
stream: false,
|
|
227
|
+
...(think !== undefined ? { think } : {}),
|
|
203
228
|
options: { num_predict: maxTokens, temperature },
|
|
204
229
|
};
|
|
205
|
-
const res = await fetch(`${url}/api/
|
|
230
|
+
const res = await fetch(`${url}/api/chat`, {
|
|
206
231
|
method: "POST",
|
|
207
232
|
headers: { "Content-Type": "application/json" },
|
|
208
233
|
body: JSON.stringify(body),
|
|
@@ -214,10 +239,10 @@ async function callOllamaGenerate(url, model, prompt, system, maxTokens, tempera
|
|
|
214
239
|
const data = (await res.json());
|
|
215
240
|
if (data.error)
|
|
216
241
|
return { ok: false, reason: `ollama_err:${data.error}` };
|
|
217
|
-
const text = (data.
|
|
242
|
+
const text = (data.message?.content ?? "").trim();
|
|
218
243
|
if (!text)
|
|
219
244
|
return { ok: false, reason: "empty_response" };
|
|
220
|
-
return { ok: true, text };
|
|
245
|
+
return { ok: true, text, doneReason: data.done_reason };
|
|
221
246
|
}
|
|
222
247
|
catch (err) {
|
|
223
248
|
const name = err instanceof Error ? err.name : "Unknown";
|
|
@@ -279,8 +304,11 @@ export async function runInfer(args, deps) {
|
|
|
279
304
|
// Fetch user's plan limits (cached 1hr). Free users without auth
|
|
280
305
|
// get 4b ceiling, 50 calls/day, 512 max tokens.
|
|
281
306
|
const ent = deps.entitlements ?? await getEntitlements();
|
|
282
|
-
//
|
|
283
|
-
|
|
307
|
+
// MF2: In chat/code modes, request the 27B tier (subject to plan ceiling + RAM).
|
|
308
|
+
// mode:"code" implies quality → start higher in the cascade.
|
|
309
|
+
const mode = args.mode ?? "route";
|
|
310
|
+
const modeCeiling = (mode === "chat" || mode === "code") ? (args.model_ceiling ?? "27b") : args.model_ceiling;
|
|
311
|
+
const effectiveCeiling = clampCeiling(modeCeiling, ent.model_ceiling);
|
|
284
312
|
// Clamp max_tokens to plan limit
|
|
285
313
|
const maxTokens = Math.min(args.max_tokens ?? 1024, ent.max_tokens, 8192);
|
|
286
314
|
// Cloud fallback only for paid plans
|
|
@@ -326,16 +354,16 @@ export async function runInfer(args, deps) {
|
|
|
326
354
|
// Walk the tier table top → bottom, capped by model_ceiling. Each tier
|
|
327
355
|
// logs its skip reason ("not_pulled" / "ram_insufficient" / fail reason)
|
|
328
356
|
// so the caller can see exactly why each tier was bypassed.
|
|
357
|
+
let localDraft = null;
|
|
329
358
|
if (installed) {
|
|
330
|
-
// Find start index from ceiling — if no ceiling, start at the top (32B).
|
|
331
359
|
const ceilStart = effectiveCeiling
|
|
332
360
|
? Math.max(0, MODEL_TIERS.findIndex(t => t.tag.endsWith(`:${effectiveCeiling}`)))
|
|
333
361
|
: 0;
|
|
334
362
|
let anyViable = false;
|
|
335
363
|
for (let i = ceilStart; i < MODEL_TIERS.length; i++) {
|
|
336
364
|
const tier = MODEL_TIERS[i];
|
|
337
|
-
// Accept the tier whether Ollama reports it as bare (`prism-coder:
|
|
338
|
-
// or namespaced (`dcostenco/prism-coder:
|
|
365
|
+
// Accept the tier whether Ollama reports it as bare (`prism-coder:27b`)
|
|
366
|
+
// or namespaced (`dcostenco/prism-coder:27b`, the form `ollama pull`
|
|
339
367
|
// produces from a HF repo). resolveOllamaName returns the actual
|
|
340
368
|
// name Ollama knows so /api/generate finds the model.
|
|
341
369
|
const ollamaName = resolveOllamaName(tier.tag, installed);
|
|
@@ -352,9 +380,27 @@ export async function runInfer(args, deps) {
|
|
|
352
380
|
}
|
|
353
381
|
anyViable = true;
|
|
354
382
|
const timeout = args.timeout_ms ?? DEFAULT_TIMEOUTS[tier.tag] ?? 60_000;
|
|
355
|
-
const
|
|
383
|
+
const enableThink = args.think ?? (mode !== "route");
|
|
384
|
+
const result = await deps.callLocal(deps.ollamaUrl, ollamaName, args.prompt, args.system, maxTokens, temperature, timeout, enableThink);
|
|
356
385
|
if (result.ok) {
|
|
357
|
-
|
|
386
|
+
const { stripped, thinkOnly } = stripThink(result.text);
|
|
387
|
+
const output = stripped;
|
|
388
|
+
// Quality gate for chat/code modes
|
|
389
|
+
if (mode !== "route") {
|
|
390
|
+
const gate = passesQualityGate(output, thinkOnly, result.doneReason);
|
|
391
|
+
if (!gate.pass && allowCloud) {
|
|
392
|
+
debugLog(`[prism_infer] quality gate FAIL (${gate.reason}) — escalating to cloud`);
|
|
393
|
+
attempts.push({ tier: tier.tag, reason: `quality_gate:${gate.reason}` });
|
|
394
|
+
if (gate.reason === "hard_truncation" || gate.reason === "loop_detected") {
|
|
395
|
+
localDraft = { output, tier: tier.tag };
|
|
396
|
+
}
|
|
397
|
+
break;
|
|
398
|
+
}
|
|
399
|
+
if (!gate.pass) {
|
|
400
|
+
debugLog(`[prism_infer] quality gate FAIL (${gate.reason}) — no cloud, serving local`);
|
|
401
|
+
}
|
|
402
|
+
}
|
|
403
|
+
return await applyVerification(output, gatedArgs, deps, {
|
|
358
404
|
backend: `ollama-${tier.tag.replace("prism-coder:", "")}`,
|
|
359
405
|
model_picked: tier.tag,
|
|
360
406
|
ram_free_mb: ramFreeMb,
|
|
@@ -392,7 +438,20 @@ export async function runInfer(args, deps) {
|
|
|
392
438
|
else {
|
|
393
439
|
attempts.push({ tier: "synalux", reason: "cloud_fallback_disabled" });
|
|
394
440
|
}
|
|
395
|
-
//
|
|
441
|
+
// Cloud also failed — serve the local draft if we have one
|
|
442
|
+
if (localDraft) {
|
|
443
|
+
debugLog(`[prism_infer] cloud failed, serving gate-failed local draft from ${localDraft.tier}`);
|
|
444
|
+
return await applyVerification(localDraft.output, gatedArgs, deps, {
|
|
445
|
+
backend: `ollama-${localDraft.tier.replace("prism-coder:", "")}`,
|
|
446
|
+
model_picked: localDraft.tier,
|
|
447
|
+
ram_free_mb: ramFreeMb,
|
|
448
|
+
latency_ms: Date.now() - t0,
|
|
449
|
+
used_cloud: false,
|
|
450
|
+
attempts,
|
|
451
|
+
plan: ent.plan,
|
|
452
|
+
quality_gate_failed: true,
|
|
453
|
+
});
|
|
454
|
+
}
|
|
396
455
|
const err = new Error(`prism_infer: no backend produced output. attempts=${JSON.stringify(attempts)}, free=${fmtGb(freeBytes)}`);
|
|
397
456
|
err.attempts = attempts;
|
|
398
457
|
throw err;
|
|
@@ -450,6 +509,7 @@ export async function prismInferHandler(args) {
|
|
|
450
509
|
` free_ram=${result.ram_free_mb}MB` +
|
|
451
510
|
` latency=${result.latency_ms}ms` +
|
|
452
511
|
` used_cloud=${result.used_cloud}` +
|
|
512
|
+
(result.quality_gate_failed ? ` quality_gate_failed=true` : "") +
|
|
453
513
|
(result.verification ? ` verify=${result.verification.action}` : "") +
|
|
454
514
|
(result.attempts.length ? ` attempts=${JSON.stringify(result.attempts)}` : "");
|
|
455
515
|
return {
|
|
@@ -32,7 +32,7 @@ const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
|
|
|
32
32
|
let cache = null;
|
|
33
33
|
let inFlight = null;
|
|
34
34
|
// ── Model tier ordering for ceiling enforcement ───────────────────
|
|
35
|
-
const TIER_ORDER = ["2b", "4b", "9b", "
|
|
35
|
+
const TIER_ORDER = ["2b", "4b", "9b", "27b"];
|
|
36
36
|
/**
|
|
37
37
|
* Returns true if `requested` exceeds `ceiling`.
|
|
38
38
|
* e.g. ceilingExceeded("9b", "4b") → true (9b > 4b ceiling)
|
|
@@ -1,19 +1,19 @@
|
|
|
1
1
|
/**
|
|
2
2
|
* RAM-Gated Local Model Picker
|
|
3
3
|
* ─────────────────────────────────────────────────────────────
|
|
4
|
-
* Cascade: 9b (default) → 4b (verifier) → 2b (mobile) →
|
|
4
|
+
* Cascade: 9b (default) → 4b (verifier) → 2b (mobile) → 27b (quality).
|
|
5
5
|
*
|
|
6
|
-
* The default ceiling is "9b" — NOT "
|
|
6
|
+
* The default ceiling is "9b" — NOT "27b". This means:
|
|
7
7
|
* - 9b is the primary model for routing + general inference (Qwen3.5-9B, 100% BFCL)
|
|
8
8
|
* - 4b is used as the grounding verifier (fast, small)
|
|
9
9
|
* - 2b is the mobile/iPhone first gate (Qwen3.5-2B, 99.1% BFCL)
|
|
10
|
-
* -
|
|
10
|
+
* - 27b is only loaded when caller explicitly passes ceiling="27b"
|
|
11
11
|
* or when the task requires maximum quality (complex code gen, etc.)
|
|
12
12
|
*
|
|
13
|
-
* This saves
|
|
13
|
+
* This saves 11GB+ RAM vs 27b and keeps response times fast.
|
|
14
14
|
*
|
|
15
15
|
* tag weights need free ctx role
|
|
16
|
-
* prism-coder:
|
|
16
|
+
* prism-coder:27b ~16 GB ≥ 20 GB 32K quality (on-demand, Qwen3.5 DeltaNet, 100% BFCL)
|
|
17
17
|
* prism-coder:9b ~ 5.8 GB ≥ 8 GB 32K default router (Qwen3.5, 100% BFCL)
|
|
18
18
|
* prism-coder:4b ~ 3.4 GB ≥ 5 GB 32K verifier (Qwen3.5, 100%)
|
|
19
19
|
* prism-coder:2b ~ 2.3 GB ≥ 3 GB 8K mobile / iPhone (Qwen3.5, 99.1%)
|
|
@@ -26,30 +26,30 @@ const GB = 1024 ** 3;
|
|
|
26
26
|
* the first row whose minFreeGb fits within freeBytes.
|
|
27
27
|
*/
|
|
28
28
|
export const MODEL_TIERS = [
|
|
29
|
-
{ tag: 'prism-coder:
|
|
29
|
+
{ tag: 'prism-coder:27b', weightsGb: 16, minFreeGb: 20, ctxTokens: 32_768 },
|
|
30
30
|
{ tag: 'prism-coder:9b', weightsGb: 5.8, minFreeGb: 8, ctxTokens: 32_768 },
|
|
31
31
|
{ tag: 'prism-coder:4b', weightsGb: 3.4, minFreeGb: 5, ctxTokens: 32_768 },
|
|
32
32
|
{ tag: 'prism-coder:2b', weightsGb: 2.3, minFreeGb: 3, ctxTokens: 8_192 },
|
|
33
33
|
];
|
|
34
34
|
/**
|
|
35
35
|
* True when `installed` matches `tierTag` either as a bare tag
|
|
36
|
-
* (`prism-coder:
|
|
37
|
-
* (`dcostenco/prism-coder:
|
|
38
|
-
* dcostenco/prism-coder:
|
|
36
|
+
* (`prism-coder:27b`) or as a namespaced HuggingFace-style tag
|
|
37
|
+
* (`dcostenco/prism-coder:27b`). The README documents `ollama pull
|
|
38
|
+
* dcostenco/prism-coder:27b`, so Ollama's /api/tags returns the
|
|
39
39
|
* namespaced form — without this matcher the picker would never
|
|
40
40
|
* see them and silently fall through to cloud.
|
|
41
41
|
*/
|
|
42
42
|
function tagMatches(installed, tierTag) {
|
|
43
43
|
return installed === tierTag || installed.endsWith(`/${tierTag}`);
|
|
44
44
|
}
|
|
45
|
-
/** Default ceiling: 9b. Pass ceiling="
|
|
45
|
+
/** Default ceiling: 9b. Pass ceiling="27b" explicitly for max quality. */
|
|
46
46
|
export const DEFAULT_CEILING = "9b";
|
|
47
47
|
/**
|
|
48
48
|
* Pick the best viable tier for the given free RAM.
|
|
49
|
-
* Default ceiling is 9b — use ceiling="
|
|
49
|
+
* Default ceiling is 9b — use ceiling="27b" only for complex tasks.
|
|
50
50
|
*
|
|
51
51
|
* @param freeBytes Result of os.freemem() — binary bytes
|
|
52
|
-
* @param ceiling Cap tier. Default "9b". Pass "
|
|
52
|
+
* @param ceiling Cap tier. Default "9b". Pass "27b" for complex tasks.
|
|
53
53
|
* @param available Optional whitelist of installed Ollama tags.
|
|
54
54
|
*/
|
|
55
55
|
export function pickLocalModel(freeBytes, ceiling, available) {
|
|
@@ -79,7 +79,7 @@ export function pickLocalModel(freeBytes, ceiling, available) {
|
|
|
79
79
|
}
|
|
80
80
|
/**
|
|
81
81
|
* Resolve a tier tag to the actual Ollama name installed locally.
|
|
82
|
-
* If `installed` contains a namespaced match (e.g. `dcostenco/prism-coder:
|
|
82
|
+
* If `installed` contains a namespaced match (e.g. `dcostenco/prism-coder:27b`),
|
|
83
83
|
* the namespaced form is returned so Ollama's /api/generate finds it.
|
|
84
84
|
* Falls back to the bare tag when only the bare form is present.
|
|
85
85
|
*/
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Quality Gate — deterministic check for obvious inference failures.
|
|
3
|
+
*
|
|
4
|
+
* NARROW by design: only high-precision signals that rarely false-positive.
|
|
5
|
+
* Does NOT judge correctness — that's the grounding verifier's job.
|
|
6
|
+
* Does NOT use refusal regex (too many false positives on legitimate output).
|
|
7
|
+
*
|
|
8
|
+
* Returns: { pass: boolean, reason?: string }
|
|
9
|
+
*/
|
|
10
|
+
/**
|
|
11
|
+
* Check if a model response passes the quality gate.
|
|
12
|
+
* @param stripped Response AFTER think-stripping (use stripThink first)
|
|
13
|
+
* @param thinkOnly True if the response was only <think> blocks with no answer
|
|
14
|
+
* @param finishReason Ollama's finish_reason if available (e.g. "length" = truncated)
|
|
15
|
+
*/
|
|
16
|
+
export function passesQualityGate(stripped, thinkOnly, finishReason) {
|
|
17
|
+
// Signal 1: Think-only — model reasoned but produced no answer (check before empty)
|
|
18
|
+
if (thinkOnly) {
|
|
19
|
+
return { pass: false, reason: "think_only" };
|
|
20
|
+
}
|
|
21
|
+
// Signal 2: Empty or near-empty after stripping
|
|
22
|
+
if (stripped.trim().length < 5) {
|
|
23
|
+
return { pass: false, reason: "empty_response" };
|
|
24
|
+
}
|
|
25
|
+
// Signal 3: Hard truncation — Ollama reports finish_reason="length"
|
|
26
|
+
// meaning the model hit num_predict before finishing
|
|
27
|
+
if (finishReason === "length") {
|
|
28
|
+
return { pass: false, reason: "hard_truncation" };
|
|
29
|
+
}
|
|
30
|
+
// Signal 4: Exact-loop — same sentence repeated 3+ times
|
|
31
|
+
const sentences = stripped.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
|
|
32
|
+
if (sentences.length >= 6) {
|
|
33
|
+
const counts = new Map();
|
|
34
|
+
for (const s of sentences) {
|
|
35
|
+
const key = s.toLowerCase();
|
|
36
|
+
counts.set(key, (counts.get(key) ?? 0) + 1);
|
|
37
|
+
if ((counts.get(key) ?? 0) >= 3) {
|
|
38
|
+
return { pass: false, reason: "loop_detected" };
|
|
39
|
+
}
|
|
40
|
+
}
|
|
41
|
+
}
|
|
42
|
+
return { pass: true };
|
|
43
|
+
}
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Think-Strip — remove <think>...</think> blocks from model output.
|
|
3
|
+
*
|
|
4
|
+
* Qwen3.5 uses <think> blocks for chain-of-thought reasoning.
|
|
5
|
+
* These must be stripped before serving to the user or passing
|
|
6
|
+
* to the grounding verifier (which would try to ground reasoning text).
|
|
7
|
+
*
|
|
8
|
+
* Returns: { stripped: string, thinkContent: string | null, thinkOnly: boolean }
|
|
9
|
+
*/
|
|
10
|
+
const THINK_RE = /<(?:think|\|synalux_think\|)>[\s\S]*?<\/(?:think|\|synalux_think\|)>\s*/g;
|
|
11
|
+
const UNCLOSED_THINK_RE = /<(?:think|\|synalux_think\|)>[\s\S]*$/;
|
|
12
|
+
export function stripThink(raw) {
|
|
13
|
+
if (!raw.includes("<think>") && !raw.includes("<|synalux_think|>")) {
|
|
14
|
+
return { stripped: raw, thinkContent: null, thinkOnly: false };
|
|
15
|
+
}
|
|
16
|
+
const thinkMatch = raw.match(/<(?:think|\|synalux_think\|)>([\s\S]*?)<\/(?:think|\|synalux_think\|)>/);
|
|
17
|
+
const thinkContent = thinkMatch ? thinkMatch[1].trim() : null;
|
|
18
|
+
let stripped = raw.replace(THINK_RE, "");
|
|
19
|
+
stripped = stripped.replace(UNCLOSED_THINK_RE, "");
|
|
20
|
+
stripped = stripped.trim();
|
|
21
|
+
return {
|
|
22
|
+
stripped,
|
|
23
|
+
thinkContent,
|
|
24
|
+
thinkOnly: stripped.length === 0 && raw.trim().length > 0,
|
|
25
|
+
};
|
|
26
|
+
}
|
package/package.json
CHANGED
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "prism-mcp-server",
|
|
3
|
-
"version": "19.0
|
|
3
|
+
"version": "19.1.0",
|
|
4
4
|
"mcpName": "io.github.dcostenco/prism-coder",
|
|
5
|
-
"description": "Prism Coder
|
|
5
|
+
"description": "Prism Coder \u2014 Cognitive memory + tool-calling intelligence for AI agents. Mind Palace persistent memory (BFCL Gold Certified, 100% Tool-Call Accuracy, 114 Agent Skills, PHI Guard, Tier Enforcement, Prompt-Based Skill Routing, Zero-Search HDC/HRR retrieval, HRR Semantic Drift Detection across BCBA/Coding/AAC domains, HIPAA-hardened local-first storage, SLERP-optimized GRPO alignment) plus the prism-coder 1.7B\u201332B open-weights LLM fleet.",
|
|
6
6
|
"module": "index.ts",
|
|
7
7
|
"type": "module",
|
|
8
8
|
"main": "dist/server.js",
|