prism-mcp-server 19.0.1 → 19.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +132 -8
- package/dist/config.js +4 -4
- package/dist/tools/compactionHandler.js +2 -2
- package/dist/tools/ledgerHandlers.js +9 -0
- package/dist/tools/prismInferHandler.js +123 -25
- package/dist/tools/taskRouterHandler.js +2 -2
- package/dist/utils/ddLogger.js +57 -19
- package/dist/utils/entitlements.js +1 -1
- package/dist/utils/inferenceMetrics.js +64 -0
- package/dist/utils/localLlm.js +2 -2
- package/dist/utils/modelPicker.js +13 -13
- package/dist/utils/nerExtractor.js +1 -1
- package/dist/utils/qualityGate.js +67 -0
- package/dist/utils/safetyGate.js +104 -0
- package/dist/utils/thinkStrip.js +26 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -11,7 +11,7 @@
|
|
|
11
11
|
<img src="docs/v11_hivemind_multi_agent_dashboard.jpg" alt="Prism Coder — Mind Palace Dashboard with Knowledge Graph and Multi-Agent Hivemind" width="700" />
|
|
12
12
|
</p>
|
|
13
13
|
|
|
14
|
-
Prism Coder is an [MCP server](https://modelcontextprotocol.io) that gives Claude, Cursor, and other AI tools long-term memory that survives across sessions. It ships with the open-weight `prism-coder` model fleet (2B–
|
|
14
|
+
Prism Coder is an [MCP server](https://modelcontextprotocol.io) that gives Claude, Cursor, and other AI tools long-term memory that survives across sessions. It ships with the open-weight `prism-coder` model fleet (2B–27B) for fast, offline tool-routing — no cloud required.
|
|
15
15
|
|
|
16
16
|
**No account needed. No API keys. Runs on your machine.**
|
|
17
17
|
A paid subscription adds cloud sync, higher model tiers, and team features through the [Synalux portal](https://synalux.ai).
|
|
@@ -41,7 +41,7 @@ Open Claude Desktop or Cursor and your agent now has memory backed by a local SQ
|
|
|
41
41
|
ollama pull dcostenco/prism-coder:2b # 2.3 GB · mobile / lightweight (99.1% routing accuracy)
|
|
42
42
|
ollama pull dcostenco/prism-coder:4b # 3.4 GB · verifier (100% accuracy)
|
|
43
43
|
ollama pull dcostenco/prism-coder:9b # 5.8 GB · default router (100% accuracy, Qwen3.5)
|
|
44
|
-
ollama pull dcostenco/prism-coder:
|
|
44
|
+
ollama pull dcostenco/prism-coder:27b # 16 GB · complex tasks (100% accuracy)
|
|
45
45
|
```
|
|
46
46
|
|
|
47
47
|
Prism detects both the namespaced (`dcostenco/prism-coder:9b`) and bare (`prism-coder:9b`) Ollama tags automatically.
|
|
@@ -78,6 +78,23 @@ Every session is logged with files changed, decisions made, and TODOs. Search, f
|
|
|
78
78
|
<img src="docs/session-ledger.jpg" alt="Session Ledger — 93 sessions, 847 decisions logged across 12 projects" width="700" />
|
|
79
79
|
</p>
|
|
80
80
|
|
|
81
|
+
### Inference Metrics — see where your tokens go
|
|
82
|
+
|
|
83
|
+
Every `prism_infer` call tracks which model handled it (local Ollama vs cloud) and how many tokens were consumed. When you save a session, Prism shows a summary:
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
📊 Inference Metrics (this session):
|
|
87
|
+
Total calls: 12 — Local: 10 (83%) | Cloud: 2 (17%)
|
|
88
|
+
Tokens: 8,420 in + 3,150 out = 11,570 total
|
|
89
|
+
Avg latency: 1,240ms
|
|
90
|
+
By model:
|
|
91
|
+
prism-coder:27b: 6 calls, 7,200 tokens, avg 1,800ms
|
|
92
|
+
prism-coder:9b: 4 calls, 2,870 tokens, avg 620ms
|
|
93
|
+
synalux-27b: 2 calls, 1,500 tokens, avg 1,100ms
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Local calls use actual Ollama token counts; cloud calls use estimates. Metrics are aggregated by the Synalux portal — Prism is a thin client that forwards per-call data and fetches the summary on demand.
|
|
97
|
+
|
|
81
98
|
### Session Drift Detection
|
|
82
99
|
|
|
83
100
|
Long agent sessions can wander from their original goal. `session_detect_drift` compares current work against the stated goal and returns `on_track / minor_drift / major_drift` so the agent can self-correct.
|
|
@@ -145,14 +162,16 @@ The free tier runs entirely on your machine. Paid tiers add cloud sync through t
|
|
|
145
162
|
|
|
146
163
|
## Models
|
|
147
164
|
|
|
148
|
-
The `prism-coder` fleet uses Qwen3.5 for MCP tool-routing. The 9B
|
|
165
|
+
The `prism-coder` fleet uses Qwen3.5 for MCP tool-routing AND general inference. The 9B and 27B are fine-tuned with LoRA (r=128, all 64 layers including DeltaNet); the 2B and 4B use stock Qwen3.5-4B at different quantization levels. The 27B scored 100% on BFCL function-calling and 100% on an internal 15-problem coding eval at $0 inference cost.
|
|
166
|
+
|
|
167
|
+
`prism_infer` supports three modes: `route` (tool routing, fast, nothink), `chat` (conversation with thinking), and `code` (code generation with thinking). In chat/code modes, the model uses `<think>` blocks for chain-of-thought reasoning, which are stripped before the response is served. If the local model fails a quality gate (empty, think-only, or truncated), paid tiers automatically escalate to Claude via the Synalux portal.
|
|
149
168
|
|
|
150
169
|
| Model | Ollama tag | Size | [BFCL](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v3_multi_turn.html) Accuracy | Role | Tier |
|
|
151
170
|
|---|---|---|---|---|---|
|
|
152
171
|
| Qwen3.5-4B Q3_K_M | `prism-coder:2b` | 2.3 GB | 99.1% × 3 seeds | iPhone / mobile first gate | Free |
|
|
153
172
|
| Qwen3.5-4B Q4_K_M | `prism-coder:4b` | 3.4 GB | 100% × 3 seeds | Verifier | Free |
|
|
154
173
|
| Qwen3.5-9B (LoRA) | `prism-coder:9b` | 5.8 GB | 100% × 3 seeds | Default router | Standard+ |
|
|
155
|
-
|
|
|
174
|
+
| Qwen3.5-27B (LoRA) | `prism-coder:27b` | 16 GB | 100% × 3 seeds | Quality tier (DeltaNet, 28.5 tok/s) | Advanced+ |
|
|
156
175
|
|
|
157
176
|
Weights: [huggingface.co/dcostenco](https://huggingface.co/dcostenco) (public GGUF). Latency depends on model size and hardware — see [Benchmarks](#benchmarks) to measure it on your own machine rather than trusting a printed number.
|
|
158
177
|
|
|
@@ -162,7 +181,7 @@ Weights: [huggingface.co/dcostenco](https://huggingface.co/dcostenco) (public GG
|
|
|
162
181
|
query → prism-coder:9b (local router, default)
|
|
163
182
|
→ prism-coder:4b (grounding verifier)
|
|
164
183
|
→ prism-coder:2b (iPhone / mobile, auto-selected by RAM)
|
|
165
|
-
→ prism-coder:
|
|
184
|
+
→ prism-coder:27b (complex tasks, on demand)
|
|
166
185
|
→ cloud fallback (paid tiers, for max quality)
|
|
167
186
|
```
|
|
168
187
|
|
|
@@ -189,7 +208,7 @@ Fail-closed on the verified path: when the grounding verifier runs (Standard tie
|
|
|
189
208
|
```bash
|
|
190
209
|
git clone https://github.com/dcostenco/prism-coder && cd prism-coder
|
|
191
210
|
pip install anthropic requests
|
|
192
|
-
python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b
|
|
211
|
+
python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b 27b
|
|
193
212
|
```
|
|
194
213
|
|
|
195
214
|
**Routing eval (115 cases, 12 categories, 3-seed mean).** Routing accuracy includes the deterministic L3 correction layer — the same rules that run in production. On this narrow tool-routing task all fleet models achieve near-perfect accuracy. Be honest with yourself about what that means: the eval is **near-saturated** for this taxonomy — it measures whether the right one of a small set of MCP tools is selected, not general capability. The useful takeaway is **offline routing reliability at zero cost**, not that a 2.3 GB model rivals a frontier model in general.
|
|
@@ -197,11 +216,96 @@ python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b 32b
|
|
|
197
216
|
| Model | Routing accuracy | Notes |
|
|
198
217
|
|---|---|---|
|
|
199
218
|
| prism-coder:2b (Q3_K_M) | 99.1% × 3 seeds | 1 failure: regex→knowledge_search |
|
|
200
|
-
| prism-coder:4b / 9b /
|
|
219
|
+
| prism-coder:4b / 9b / 27b | 100% × 3 seeds | Perfect on all 115 cases |
|
|
201
220
|
| Claude (frontier, same eval) | ~98% | Stronger everywhere outside this narrow task |
|
|
202
221
|
|
|
203
222
|
**Memory uplift (LoCoMo-Plus, self-published).** A separate long-context dialogue benchmark ([dcostenco/Locomo-Plus](https://github.com/dcostenco/Locomo-Plus)) measures how much structured memory helps a base model retain multi-day context. Results show large gains when a model is paired with Prism memory versus running raw. Note this benchmark is authored, run, and LLM-judged by this project — treat it as a reproducible demonstration, not an independent third-party result, and run it yourself with the commands in that repo.
|
|
204
223
|
|
|
224
|
+
### Code Generation Quality (27B vs Claude Opus)
|
|
225
|
+
|
|
226
|
+
Three progressively harder Python tasks run through `prism_infer(mode:"code", think:true)` on the local 27B and compared with Claude Opus. Both produce correct, production-quality code. The 27B is slightly more verbose (docstrings, examples); Opus is slightly tighter (`__slots__`, early-exit DFS). On routine coding the 27B at $0 replaces cloud calls entirely.
|
|
227
|
+
|
|
228
|
+
| Task | Local 27B | Claude Opus | Verdict |
|
|
229
|
+
|------|-----------|-------------|---------|
|
|
230
|
+
| Fibonacci with memoization | `@lru_cache`, ValueError on negative, docstring | Nested `_fib` to keep cache private | Both correct, equivalent |
|
|
231
|
+
| LRU Cache (OrderedDict, O(1)) | `Any` keys, isinstance capacity check, `__repr__` | `Hashable` key type (more precise), same ops | Both correct, Opus marginally tighter |
|
|
232
|
+
| Trie with autocomplete | `.lower()` normalization, collect+sort+slice | `__slots__` on TrieNode, early-exit DFS at limit | Both correct, Opus slightly more optimized |
|
|
233
|
+
|
|
234
|
+
<details>
|
|
235
|
+
<summary>Local 27B output — Trie with autocomplete (hardest task)</summary>
|
|
236
|
+
|
|
237
|
+
```python
|
|
238
|
+
class TrieNode:
|
|
239
|
+
def __init__(self):
|
|
240
|
+
self.children: dict[str, 'TrieNode'] = {}
|
|
241
|
+
self.is_end_of_word: bool = False
|
|
242
|
+
|
|
243
|
+
class Trie:
|
|
244
|
+
def __init__(self):
|
|
245
|
+
self.root: TrieNode = TrieNode()
|
|
246
|
+
|
|
247
|
+
def insert(self, word: str) -> None:
|
|
248
|
+
node = self.root
|
|
249
|
+
for char in word.lower():
|
|
250
|
+
if char not in node.children:
|
|
251
|
+
node.children[char] = TrieNode()
|
|
252
|
+
node = node.children[char]
|
|
253
|
+
node.is_end_of_word = True
|
|
254
|
+
|
|
255
|
+
def search(self, word: str) -> bool:
|
|
256
|
+
node = self._get_node(word.lower())
|
|
257
|
+
return node is not None and node.is_end_of_word
|
|
258
|
+
|
|
259
|
+
def starts_with(self, prefix: str) -> bool:
|
|
260
|
+
return self._get_node(prefix.lower()) is not None
|
|
261
|
+
|
|
262
|
+
def autocomplete(self, prefix: str, limit: int = 5) -> list[str]:
|
|
263
|
+
node = self._get_node(prefix.lower())
|
|
264
|
+
if node is None:
|
|
265
|
+
return []
|
|
266
|
+
results: list[str] = []
|
|
267
|
+
self._collect_words(node, prefix.lower(), results)
|
|
268
|
+
results.sort()
|
|
269
|
+
return results[:limit]
|
|
270
|
+
|
|
271
|
+
def _get_node(self, key: str) -> 'TrieNode | None':
|
|
272
|
+
node = self.root
|
|
273
|
+
for char in key:
|
|
274
|
+
if char not in node.children:
|
|
275
|
+
return None
|
|
276
|
+
node = node.children[char]
|
|
277
|
+
return node
|
|
278
|
+
|
|
279
|
+
def _collect_words(self, node: TrieNode, prefix: str, results: list[str]) -> None:
|
|
280
|
+
if node.is_end_of_word:
|
|
281
|
+
results.append(prefix)
|
|
282
|
+
for char, child in sorted(node.children.items()):
|
|
283
|
+
self._collect_words(child, prefix + char, results)
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
</details>
|
|
287
|
+
|
|
288
|
+
| Metric | Local 27B | Cloud (Opus) |
|
|
289
|
+
|--------|-----------|-------------|
|
|
290
|
+
| Latency (Trie task) | ~30s | ~8s |
|
|
291
|
+
| Cost | $0 | ~$0.05 |
|
|
292
|
+
| Think mode | Enabled (stripped before serving) | N/A |
|
|
293
|
+
| Quality gate | Passed (no escalation needed) | N/A |
|
|
294
|
+
|
|
295
|
+
### Cloud Escalation in Practice (`cloud_fallback: true`)
|
|
296
|
+
|
|
297
|
+
The same three tasks with `cloud_fallback: true` — the quality gate decides whether local output is good enough or needs cloud escalation.
|
|
298
|
+
|
|
299
|
+
| Task | used_cloud | Quality Gate | Latency | What happened |
|
|
300
|
+
|------|:----------:|-------------|---------|---------------|
|
|
301
|
+
| Fibonacci (simple) | **no** | Passed | 11s | 27B served directly, $0 |
|
|
302
|
+
| LRU Cache (medium) | **no** | Passed | 21s | 27B served directly, $0 |
|
|
303
|
+
| Trie (hard) | **yes** | `loop_detected` | 55s | 27B looped → gate caught it → escalated to cloud 27B |
|
|
304
|
+
|
|
305
|
+
The quality gate detected repeated sentences (≥3 of the same sentence in ≥6 total) in the 27B's Trie output and escalated automatically. The cloud fallback returned clean code. On a second run of the same prompt, the 27B produced clean output without escalation — the loop is stochastic, not systematic.
|
|
306
|
+
|
|
307
|
+
**Takeaway:** for ~80–90% of coding tasks, the 27B handles everything locally at $0. The quality gate + cloud escalation exists as a safety net for the remaining cases where the local model loops, truncates, or produces empty output. Paid tiers get automatic escalation; free tier gets the local result with a warning.
|
|
308
|
+
|
|
205
309
|
---
|
|
206
310
|
|
|
207
311
|
## Why Prism Coder
|
|
@@ -254,7 +358,7 @@ All on-device models are free to run locally via Ollama on every tier. A subscri
|
|
|
254
358
|
| | **Free** | **Standard** $19/mo | **Advanced** $49/mo | **Enterprise** $99/mo |
|
|
255
359
|
|---|---|---|---|---|
|
|
256
360
|
| Seats | 1 | 1 | up to 5 | up to 25 |
|
|
257
|
-
| Local model ceiling | up to 4b | up to 9b | up to
|
|
361
|
+
| Local model ceiling | up to 4b | up to 9b | up to 27b | up to 27b |
|
|
258
362
|
| Daily cloud inference | -- | 200 | 2,000 | 100,000 |
|
|
259
363
|
| Cloud Coder (Web IDE) | -- | 100/day | 1,000/day | 100,000/day |
|
|
260
364
|
| Cloud search | -- | 50/day | 500/day | 100,000/day |
|
|
@@ -284,6 +388,26 @@ Prism exposes 40+ MCP tools. The core memory loop:
|
|
|
284
388
|
| `session_detect_drift` | Detect when a session has drifted from its goal |
|
|
285
389
|
| `verify_behavior` | Pre-edit scenario challenge — catch bad changes before they happen |
|
|
286
390
|
| `knowledge_ingest` | Teach Prism a codebase or document |
|
|
391
|
+
| `prism_infer` | Local-first inference (route/chat/code modes, thinking, cloud escalation) |
|
|
392
|
+
|
|
393
|
+
### `prism_infer` — local-first inference with cloud escalation
|
|
394
|
+
|
|
395
|
+
```typescript
|
|
396
|
+
prism_infer({
|
|
397
|
+
prompt: "Write a binary search in Python",
|
|
398
|
+
mode: "code", // "route" | "chat" | "code"
|
|
399
|
+
think: true, // enable <think> reasoning (default: true for chat/code)
|
|
400
|
+
model_ceiling: "27b", // use the quality tier
|
|
401
|
+
})
|
|
402
|
+
// → 27B generates code locally ($0), with thinking for quality
|
|
403
|
+
// → If quality gate fails + paid tier → auto-escalate to Claude
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
| Mode | Think | Model | Use case |
|
|
407
|
+
|------|-------|-------|----------|
|
|
408
|
+
| `route` | Off (fast) | 9B default | MCP tool routing |
|
|
409
|
+
| `chat` | On | 27B preferred | Conversation, reasoning |
|
|
410
|
+
| `code` | On | 27B preferred | Code generation, debugging |
|
|
287
411
|
|
|
288
412
|
Full TypeScript signatures live in [`src/tools/`](src/tools/); architecture in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).
|
|
289
413
|
|
package/dist/config.js
CHANGED
|
@@ -307,11 +307,11 @@ const rawTiebreakerEpsilon = parseFloat(process.env.PRISM_TURBOQUANT_TIEBREAKER_
|
|
|
307
307
|
export const PRISM_TURBOQUANT_TIEBREAKER_EPSILON = Number.isFinite(rawTiebreakerEpsilon) && rawTiebreakerEpsilon >= 0
|
|
308
308
|
? rawTiebreakerEpsilon
|
|
309
309
|
: 0;
|
|
310
|
-
// ─── v9.x: Local LLM (prism-coder
|
|
310
|
+
// ─── v9.x: Local LLM (prism-coder) Integration ────────────────────────────
|
|
311
311
|
// Enables background tasks (compaction, task-router fallback, pipeline ops)
|
|
312
312
|
// to use a local Ollama model instead of the cloud LLM provider.
|
|
313
313
|
//
|
|
314
|
-
// Default model is prism-coder:
|
|
314
|
+
// Default model is prism-coder:9b — fine-tuned on Prism tool schemas.
|
|
315
315
|
// Disabled by default so existing deployments are unaffected.
|
|
316
316
|
//
|
|
317
317
|
// Set PRISM_LOCAL_LLM_ENABLED=true to activate.
|
|
@@ -319,10 +319,10 @@ export const PRISM_TURBOQUANT_TIEBREAKER_EPSILON = Number.isFinite(rawTiebreaker
|
|
|
319
319
|
// Set PRISM_LOCAL_LLM_URL to override the Ollama endpoint (default: localhost:11434).
|
|
320
320
|
// Set PRISM_LOCAL_LLM_TIMEOUT_MS to override per-call timeout (default: 60000, max: 300000).
|
|
321
321
|
// Set PRISM_STRICT_LOCAL_MODE=true to block cloud fallback when local LLM is enabled (HIPAA).
|
|
322
|
-
/** Master switch — enables the local prism-coder
|
|
322
|
+
/** Master switch — enables the local prism-coder LLM for background tasks. */
|
|
323
323
|
export const PRISM_LOCAL_LLM_ENABLED = process.env.PRISM_LOCAL_LLM_ENABLED === "true"; // Opt-in, default false
|
|
324
324
|
/** Ollama model tag to use for local LLM calls. */
|
|
325
|
-
export const PRISM_LOCAL_LLM_MODEL = (process.env.PRISM_LOCAL_LLM_MODEL || "prism-coder:
|
|
325
|
+
export const PRISM_LOCAL_LLM_MODEL = (process.env.PRISM_LOCAL_LLM_MODEL || "prism-coder:9b").trim();
|
|
326
326
|
/** Ollama base URL. Override for remote Ollama instances. */
|
|
327
327
|
export const PRISM_LOCAL_LLM_URL = (process.env.PRISM_LOCAL_LLM_URL || "http://localhost:11434").trim();
|
|
328
328
|
/** Per-call timeout in ms. Prevents stalled background tasks. Capped at 300s. */
|
|
@@ -108,7 +108,7 @@ function parseCompactionResponse(response, source) {
|
|
|
108
108
|
}
|
|
109
109
|
async function summarizeEntries(entries) {
|
|
110
110
|
const prompt = buildCompactionPrompt(entries);
|
|
111
|
-
// ── Path 1: Local LLM (prism-coder:
|
|
111
|
+
// ── Path 1: Local LLM (prism-coder:9b) ───────────────────────────
|
|
112
112
|
if (PRISM_LOCAL_LLM_ENABLED) {
|
|
113
113
|
debugLog(`[compact_ledger] Attempting local LLM summarization (${entries.length} entries)`);
|
|
114
114
|
const localResponse = await callLocalLlm(prompt);
|
|
@@ -123,7 +123,7 @@ async function summarizeEntries(entries) {
|
|
|
123
123
|
if (PRISM_STRICT_LOCAL_MODE) {
|
|
124
124
|
throw new Error("[HIPAA] Local LLM failed and PRISM_STRICT_LOCAL_MODE=true. " +
|
|
125
125
|
"Cloud fallback is blocked to prevent unauthorized PHI disclosure. " +
|
|
126
|
-
"Ensure Ollama is running and prism-coder:
|
|
126
|
+
"Ensure Ollama is running and prism-coder:9b is available.");
|
|
127
127
|
}
|
|
128
128
|
debugLog(`[compact_ledger] Local LLM returned null — falling back to cloud LLM`);
|
|
129
129
|
}
|
|
@@ -89,6 +89,7 @@ const MEMORY_BOUNDARY_SUFFIX = '\n</prism_memory>';
|
|
|
89
89
|
* After saving, generates an embedding vector for the entry via fire-and-forget.
|
|
90
90
|
*/
|
|
91
91
|
import { computeEffectiveImportance, recordMemoryAccess } from "../utils/cognitiveMemory.js";
|
|
92
|
+
import { fetchPortalInferenceMetrics, markSessionStart } from "../utils/inferenceMetrics.js";
|
|
92
93
|
export async function sessionSaveLedgerHandler(args) {
|
|
93
94
|
if (!isSessionSaveLedgerArgs(args)) {
|
|
94
95
|
throw new Error("Invalid arguments for session_save_ledger");
|
|
@@ -229,6 +230,8 @@ export async function sessionSaveLedgerHandler(args) {
|
|
|
229
230
|
storage.decayImportance(project, PRISM_USER_ID, 30).catch((err) => {
|
|
230
231
|
debugLog(`[session_save_ledger] Background decay failed (non-fatal): ${err instanceof Error ? err.message : String(err)}`);
|
|
231
232
|
});
|
|
233
|
+
// Fetch inference metrics from portal (thin-client: portal is authority)
|
|
234
|
+
const metricsBlock = await fetchPortalInferenceMetrics();
|
|
232
235
|
return {
|
|
233
236
|
content: [{
|
|
234
237
|
type: "text",
|
|
@@ -238,6 +241,7 @@ export async function sessionSaveLedgerHandler(args) {
|
|
|
238
241
|
(files_changed?.length ? `Files changed: ${files_changed.length}\n` : "") +
|
|
239
242
|
(decisions?.length ? `Decisions: ${decisions.length}\n` : "") +
|
|
240
243
|
`📊 Embedding generation queued for semantic search.` +
|
|
244
|
+
metricsBlock +
|
|
241
245
|
resolverNote,
|
|
242
246
|
}],
|
|
243
247
|
isError: false,
|
|
@@ -548,11 +552,13 @@ export async function sessionSaveHandoffHandler(args, server) {
|
|
|
548
552
|
// Dynamic import itself failed — module not found or similar
|
|
549
553
|
console.error("[FactMerger] Module load failed (non-fatal): " + err));
|
|
550
554
|
}
|
|
555
|
+
const metricsBlock = await fetchPortalInferenceMetrics();
|
|
551
556
|
// Build response text based on whether a CRDT merge occurred
|
|
552
557
|
const responseText = isMerged
|
|
553
558
|
? `🔄 Auto-merged conflict for "${project}" (v${expected_version} → v${newVersion})\n` +
|
|
554
559
|
`Strategy: ${JSON.stringify(mergeStrategy)}\n` +
|
|
555
560
|
(last_summary ? `Summary: ${last_summary}\n` : "") +
|
|
561
|
+
metricsBlock +
|
|
556
562
|
`\n🔑 Remember: pass expected_version: ${newVersion} on your next save ` +
|
|
557
563
|
`to maintain concurrency control.`
|
|
558
564
|
: `✅ Handoff ${data.status || "saved"} for project "${project}" ` +
|
|
@@ -561,6 +567,7 @@ export async function sessionSaveHandoffHandler(args, server) {
|
|
|
561
567
|
(open_todos?.length ? `Open TODOs: ${open_todos.length} items\n` : "") +
|
|
562
568
|
(active_branch ? `Active branch: ${active_branch}\n` : "") +
|
|
563
569
|
`📊 Embedding generation queued for semantic search.\n` +
|
|
570
|
+
metricsBlock +
|
|
564
571
|
`\n🔑 Remember: pass expected_version: ${newVersion} on your next save ` +
|
|
565
572
|
`to maintain concurrency control.`;
|
|
566
573
|
return {
|
|
@@ -575,6 +582,8 @@ export async function sessionLoadContextHandler(args) {
|
|
|
575
582
|
if (!isSessionLoadContextArgs(args)) {
|
|
576
583
|
throw new Error("Invalid arguments for session_load_context");
|
|
577
584
|
}
|
|
585
|
+
// Mark session boundary — portal metrics fetched with since=this timestamp
|
|
586
|
+
markSessionStart();
|
|
578
587
|
const { project, level = "standard", role } = args;
|
|
579
588
|
const maxTokens = args.max_tokens
|
|
580
589
|
|| parseInt(await getSetting("max_tokens", "0"), 10) || undefined; // v4.0: arg > dashboard setting > none
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
* prism_infer — local-first inference tool
|
|
3
3
|
* ─────────────────────────────────────────────────────────────
|
|
4
4
|
* Save the caller's cloud tokens by routing to a local prism-coder
|
|
5
|
-
* model via Ollama. Tiers (
|
|
5
|
+
* model via Ollama. Tiers (27B/9B/4B/2B) auto-selected by free
|
|
6
6
|
* RAM, then capped by `model_ceiling` and the set of tags that are
|
|
7
7
|
* actually pulled into Ollama.
|
|
8
8
|
*
|
|
@@ -12,7 +12,7 @@
|
|
|
12
12
|
* 4. On local fail, if cloud_fallback=true:
|
|
13
13
|
* - exchange synalux_sk_ → JWT (cached)
|
|
14
14
|
* - POST synalux portal /api/v1/prism-aac/inference
|
|
15
|
-
* - portal runs its own cascade (9B/
|
|
15
|
+
* - portal runs its own cascade (9B/27B/Claude by tier)
|
|
16
16
|
* 5. Return { output, backend, model_picked, ram_free_mb, latency_ms, used_cloud }
|
|
17
17
|
*
|
|
18
18
|
* `prism_infer` is a thin client. It never calls Anthropic / OpenRouter
|
|
@@ -26,13 +26,16 @@ import { PRISM_SYNALUX_BASE_URL, PRISM_LOCAL_LLM_URL, } from "../config.js";
|
|
|
26
26
|
import { debugLog } from "../utils/logger.js";
|
|
27
27
|
import { getEntitlements, clampCeiling } from "../utils/entitlements.js";
|
|
28
28
|
import { ddLog } from "../utils/ddLogger.js";
|
|
29
|
+
import { stripThink } from "../utils/thinkStrip.js";
|
|
30
|
+
import { passesQualityGate } from "../utils/qualityGate.js";
|
|
31
|
+
import { checkInputSafety, checkOutputSafety } from "../utils/safetyGate.js";
|
|
29
32
|
// ─── Tool Definition ────────────────────────────────────────────
|
|
30
33
|
export const PRISM_INFER_TOOL = {
|
|
31
34
|
name: "prism_infer",
|
|
32
35
|
description: "Run an inference on a local prism-coder model (Ollama) to save cloud tokens. " +
|
|
33
|
-
"Picks the largest viable tier —
|
|
36
|
+
"Picks the largest viable tier — 27B / 9B / 4B / 2B — based on free RAM at call time, " +
|
|
34
37
|
"clamped by `model_ceiling` and what is actually pulled in Ollama. " +
|
|
35
|
-
"Falls through to the synalux portal cloud cascade (9B →
|
|
38
|
+
"Falls through to the synalux portal cloud cascade (9B → 27B → Claude Opus 4.7) " +
|
|
36
39
|
"only when local is unviable AND `cloud_fallback=true`. " +
|
|
37
40
|
"Use this for code generation, summarisation, classification, or any synth task you would " +
|
|
38
41
|
"otherwise hand to the cloud model — it costs $0 when the local hit succeeds.",
|
|
@@ -59,8 +62,8 @@ export const PRISM_INFER_TOOL = {
|
|
|
59
62
|
},
|
|
60
63
|
model_ceiling: {
|
|
61
64
|
type: "string",
|
|
62
|
-
enum: ["
|
|
63
|
-
description: "Cap the largest tier the picker may select. e.g. '9b' forbids
|
|
65
|
+
enum: ["27b", "9b", "4b", "2b"],
|
|
66
|
+
description: "Cap the largest tier the picker may select. e.g. '9b' forbids 27B even if RAM allows.",
|
|
64
67
|
},
|
|
65
68
|
cloud_fallback: {
|
|
66
69
|
type: "boolean",
|
|
@@ -69,7 +72,7 @@ export const PRISM_INFER_TOOL = {
|
|
|
69
72
|
},
|
|
70
73
|
timeout_ms: {
|
|
71
74
|
type: "number",
|
|
72
|
-
description: "Override per-call timeout. Default scales with model size:
|
|
75
|
+
description: "Override per-call timeout. Default scales with model size: 27B=120s, 9B=60s, 4B=20s, 2B=15s.",
|
|
73
76
|
},
|
|
74
77
|
evidence: {
|
|
75
78
|
type: "array",
|
|
@@ -102,6 +105,20 @@ export const PRISM_INFER_TOOL = {
|
|
|
102
105
|
description: "Override the verifier hard timeout. Default 2000 ms.",
|
|
103
106
|
default: 2000,
|
|
104
107
|
},
|
|
108
|
+
mode: {
|
|
109
|
+
type: "string",
|
|
110
|
+
enum: ["route", "chat", "code"],
|
|
111
|
+
description: "Execution mode. 'route' (default) for MCP tool routing — fast, nothink. " +
|
|
112
|
+
"'chat' for general conversation — uses thinking, escalates to cloud on failure. " +
|
|
113
|
+
"'code' for code generation — uses thinking, larger context. " +
|
|
114
|
+
"In chat/code modes, prefers the 27B tier and enables <think> reasoning.",
|
|
115
|
+
default: "route",
|
|
116
|
+
},
|
|
117
|
+
think: {
|
|
118
|
+
type: "boolean",
|
|
119
|
+
description: "Enable thinking mode (<think> blocks). Default: true for chat/code, false for route. " +
|
|
120
|
+
"Thinking improves quality on complex tasks but adds latency (~2-5s).",
|
|
121
|
+
},
|
|
105
122
|
},
|
|
106
123
|
required: ["prompt"],
|
|
107
124
|
},
|
|
@@ -123,7 +140,12 @@ export function isPrismInferArgs(args) {
|
|
|
123
140
|
if (a.timeout_ms !== undefined && typeof a.timeout_ms !== "number")
|
|
124
141
|
return false;
|
|
125
142
|
if (a.model_ceiling !== undefined &&
|
|
126
|
-
!["
|
|
143
|
+
!["27b", "9b", "4b", "2b"].includes(a.model_ceiling))
|
|
144
|
+
return false;
|
|
145
|
+
if (a.mode !== undefined &&
|
|
146
|
+
!["route", "chat", "code"].includes(a.mode))
|
|
147
|
+
return false;
|
|
148
|
+
if (a.think !== undefined && typeof a.think !== "boolean")
|
|
127
149
|
return false;
|
|
128
150
|
if (a.verify !== undefined && typeof a.verify !== "boolean")
|
|
129
151
|
return false;
|
|
@@ -146,7 +168,7 @@ export function isPrismInferArgs(args) {
|
|
|
146
168
|
}
|
|
147
169
|
// ─── Ollama helpers ────────────────────────────────────────────
|
|
148
170
|
const DEFAULT_TIMEOUTS = {
|
|
149
|
-
"prism-coder:
|
|
171
|
+
"prism-coder:27b": 120_000,
|
|
150
172
|
"prism-coder:9b": 60_000,
|
|
151
173
|
"prism-coder:4b": 20_000,
|
|
152
174
|
"prism-coder:2b": 15_000,
|
|
@@ -193,16 +215,20 @@ export async function listOllamaLoaded(url = PRISM_LOCAL_LLM_URL) {
|
|
|
193
215
|
return new Set();
|
|
194
216
|
}
|
|
195
217
|
}
|
|
196
|
-
async function callOllamaGenerate(url, model, prompt, system, maxTokens, temperature, timeoutMs) {
|
|
218
|
+
async function callOllamaGenerate(url, model, prompt, system, maxTokens, temperature, timeoutMs, think) {
|
|
197
219
|
try {
|
|
220
|
+
const messages = [];
|
|
221
|
+
if (system)
|
|
222
|
+
messages.push({ role: "system", content: system });
|
|
223
|
+
messages.push({ role: "user", content: prompt });
|
|
198
224
|
const body = {
|
|
199
225
|
model,
|
|
200
|
-
|
|
201
|
-
...(system ? { system } : {}),
|
|
226
|
+
messages,
|
|
202
227
|
stream: false,
|
|
228
|
+
...(think !== undefined ? { think } : {}),
|
|
203
229
|
options: { num_predict: maxTokens, temperature },
|
|
204
230
|
};
|
|
205
|
-
const res = await fetch(`${url}/api/
|
|
231
|
+
const res = await fetch(`${url}/api/chat`, {
|
|
206
232
|
method: "POST",
|
|
207
233
|
headers: { "Content-Type": "application/json" },
|
|
208
234
|
body: JSON.stringify(body),
|
|
@@ -214,10 +240,10 @@ async function callOllamaGenerate(url, model, prompt, system, maxTokens, tempera
|
|
|
214
240
|
const data = (await res.json());
|
|
215
241
|
if (data.error)
|
|
216
242
|
return { ok: false, reason: `ollama_err:${data.error}` };
|
|
217
|
-
const text = (data.
|
|
243
|
+
const text = (data.message?.content ?? "").trim();
|
|
218
244
|
if (!text)
|
|
219
245
|
return { ok: false, reason: "empty_response" };
|
|
220
|
-
return { ok: true, text };
|
|
246
|
+
return { ok: true, text, doneReason: data.done_reason, promptTokens: data.prompt_eval_count, completionTokens: data.eval_count };
|
|
221
247
|
}
|
|
222
248
|
catch (err) {
|
|
223
249
|
const name = err instanceof Error ? err.name : "Unknown";
|
|
@@ -275,12 +301,28 @@ async function callSynaluxInference(prompt, maxTokens, timeoutMs) {
|
|
|
275
301
|
export async function runInfer(args, deps) {
|
|
276
302
|
const t0 = Date.now();
|
|
277
303
|
const temperature = args.temperature ?? 0;
|
|
304
|
+
// ── L1 Safety — deterministic input interception ────────────
|
|
305
|
+
const safetyIntercept = checkInputSafety(args.prompt);
|
|
306
|
+
if (safetyIntercept) {
|
|
307
|
+
return {
|
|
308
|
+
output: safetyIntercept,
|
|
309
|
+
backend: "safety_gate",
|
|
310
|
+
model_picked: null,
|
|
311
|
+
ram_free_mb: Math.round(deps.freemem() / (1024 * 1024)),
|
|
312
|
+
latency_ms: Date.now() - t0,
|
|
313
|
+
used_cloud: false,
|
|
314
|
+
attempts: [{ tier: "l1_safety", reason: "crisis_or_medical_intercept" }],
|
|
315
|
+
};
|
|
316
|
+
}
|
|
278
317
|
// ── Entitlement enforcement ──────────────────────────────────
|
|
279
318
|
// Fetch user's plan limits (cached 1hr). Free users without auth
|
|
280
319
|
// get 4b ceiling, 50 calls/day, 512 max tokens.
|
|
281
320
|
const ent = deps.entitlements ?? await getEntitlements();
|
|
282
|
-
//
|
|
283
|
-
|
|
321
|
+
// MF2: In chat/code modes, request the 27B tier (subject to plan ceiling + RAM).
|
|
322
|
+
// mode:"code" implies quality → start higher in the cascade.
|
|
323
|
+
const mode = args.mode ?? "route";
|
|
324
|
+
const modeCeiling = (mode === "chat" || mode === "code") ? (args.model_ceiling ?? "27b") : args.model_ceiling;
|
|
325
|
+
const effectiveCeiling = clampCeiling(modeCeiling, ent.model_ceiling);
|
|
284
326
|
// Clamp max_tokens to plan limit
|
|
285
327
|
const maxTokens = Math.min(args.max_tokens ?? 1024, ent.max_tokens, 8192);
|
|
286
328
|
// Cloud fallback only for paid plans
|
|
@@ -326,16 +368,16 @@ export async function runInfer(args, deps) {
|
|
|
326
368
|
// Walk the tier table top → bottom, capped by model_ceiling. Each tier
|
|
327
369
|
// logs its skip reason ("not_pulled" / "ram_insufficient" / fail reason)
|
|
328
370
|
// so the caller can see exactly why each tier was bypassed.
|
|
371
|
+
let localDraft = null;
|
|
329
372
|
if (installed) {
|
|
330
|
-
// Find start index from ceiling — if no ceiling, start at the top (32B).
|
|
331
373
|
const ceilStart = effectiveCeiling
|
|
332
374
|
? Math.max(0, MODEL_TIERS.findIndex(t => t.tag.endsWith(`:${effectiveCeiling}`)))
|
|
333
375
|
: 0;
|
|
334
376
|
let anyViable = false;
|
|
335
377
|
for (let i = ceilStart; i < MODEL_TIERS.length; i++) {
|
|
336
378
|
const tier = MODEL_TIERS[i];
|
|
337
|
-
// Accept the tier whether Ollama reports it as bare (`prism-coder:
|
|
338
|
-
// or namespaced (`dcostenco/prism-coder:
|
|
379
|
+
// Accept the tier whether Ollama reports it as bare (`prism-coder:27b`)
|
|
380
|
+
// or namespaced (`dcostenco/prism-coder:27b`, the form `ollama pull`
|
|
339
381
|
// produces from a HF repo). resolveOllamaName returns the actual
|
|
340
382
|
// name Ollama knows so /api/generate finds the model.
|
|
341
383
|
const ollamaName = resolveOllamaName(tier.tag, installed);
|
|
@@ -352,9 +394,27 @@ export async function runInfer(args, deps) {
|
|
|
352
394
|
}
|
|
353
395
|
anyViable = true;
|
|
354
396
|
const timeout = args.timeout_ms ?? DEFAULT_TIMEOUTS[tier.tag] ?? 60_000;
|
|
355
|
-
const
|
|
397
|
+
const enableThink = args.think ?? (mode !== "route");
|
|
398
|
+
const result = await deps.callLocal(deps.ollamaUrl, ollamaName, args.prompt, args.system, maxTokens, temperature, timeout, enableThink);
|
|
356
399
|
if (result.ok) {
|
|
357
|
-
|
|
400
|
+
const { stripped, thinkOnly } = stripThink(result.text);
|
|
401
|
+
const output = stripped;
|
|
402
|
+
// Quality gate for chat/code modes
|
|
403
|
+
if (mode !== "route") {
|
|
404
|
+
const gate = passesQualityGate(output, thinkOnly, result.doneReason);
|
|
405
|
+
if (!gate.pass && allowCloud) {
|
|
406
|
+
debugLog(`[prism_infer] quality gate FAIL (${gate.reason}) — escalating to cloud`);
|
|
407
|
+
attempts.push({ tier: tier.tag, reason: `quality_gate:${gate.reason}` });
|
|
408
|
+
if (gate.reason === "hard_truncation" || gate.reason === "loop_detected") {
|
|
409
|
+
localDraft = { output, tier: tier.tag, promptTokens: result.promptTokens, completionTokens: result.completionTokens };
|
|
410
|
+
}
|
|
411
|
+
break;
|
|
412
|
+
}
|
|
413
|
+
if (!gate.pass) {
|
|
414
|
+
debugLog(`[prism_infer] quality gate FAIL (${gate.reason}) — no cloud, serving local`);
|
|
415
|
+
}
|
|
416
|
+
}
|
|
417
|
+
return await applyVerification(output, gatedArgs, deps, {
|
|
358
418
|
backend: `ollama-${tier.tag.replace("prism-coder:", "")}`,
|
|
359
419
|
model_picked: tier.tag,
|
|
360
420
|
ram_free_mb: ramFreeMb,
|
|
@@ -362,6 +422,8 @@ export async function runInfer(args, deps) {
|
|
|
362
422
|
used_cloud: false,
|
|
363
423
|
attempts,
|
|
364
424
|
plan: ent.plan,
|
|
425
|
+
prompt_tokens: result.promptTokens,
|
|
426
|
+
completion_tokens: result.completionTokens,
|
|
365
427
|
});
|
|
366
428
|
}
|
|
367
429
|
attempts.push({ tier: tier.tag, reason: result.reason });
|
|
@@ -385,6 +447,8 @@ export async function runInfer(args, deps) {
|
|
|
385
447
|
used_cloud: true,
|
|
386
448
|
attempts,
|
|
387
449
|
plan: ent.plan,
|
|
450
|
+
prompt_tokens: Math.ceil(args.prompt.length / 4),
|
|
451
|
+
completion_tokens: Math.ceil(cloud.output.length / 4),
|
|
388
452
|
});
|
|
389
453
|
}
|
|
390
454
|
attempts.push({ tier: "synalux", reason: cloud.reason ?? "unknown" });
|
|
@@ -392,7 +456,22 @@ export async function runInfer(args, deps) {
|
|
|
392
456
|
else {
|
|
393
457
|
attempts.push({ tier: "synalux", reason: "cloud_fallback_disabled" });
|
|
394
458
|
}
|
|
395
|
-
//
|
|
459
|
+
// Cloud also failed — serve the local draft if we have one
|
|
460
|
+
if (localDraft) {
|
|
461
|
+
debugLog(`[prism_infer] cloud failed, serving gate-failed local draft from ${localDraft.tier}`);
|
|
462
|
+
return await applyVerification(localDraft.output, gatedArgs, deps, {
|
|
463
|
+
backend: `ollama-${localDraft.tier.replace("prism-coder:", "")}`,
|
|
464
|
+
model_picked: localDraft.tier,
|
|
465
|
+
ram_free_mb: ramFreeMb,
|
|
466
|
+
latency_ms: Date.now() - t0,
|
|
467
|
+
used_cloud: false,
|
|
468
|
+
attempts,
|
|
469
|
+
plan: ent.plan,
|
|
470
|
+
prompt_tokens: localDraft.promptTokens,
|
|
471
|
+
completion_tokens: localDraft.completionTokens,
|
|
472
|
+
quality_gate_failed: true,
|
|
473
|
+
});
|
|
474
|
+
}
|
|
396
475
|
const err = new Error(`prism_infer: no backend produced output. attempts=${JSON.stringify(attempts)}, free=${fmtGb(freeBytes)}`);
|
|
397
476
|
err.attempts = attempts;
|
|
398
477
|
throw err;
|
|
@@ -405,9 +484,11 @@ export async function runInfer(args, deps) {
|
|
|
405
484
|
* field so callers can route refusals separately from successes.
|
|
406
485
|
*/
|
|
407
486
|
async function applyVerification(draft, args, deps, partial) {
|
|
487
|
+
// L1 output safety — intercept dangerous model-generated content
|
|
488
|
+
const safeDraft = checkOutputSafety(draft);
|
|
408
489
|
const shouldVerify = args.verify ?? (args.evidence !== undefined && args.evidence.length > 0);
|
|
409
490
|
if (!shouldVerify || !deps.callVerifier) {
|
|
410
|
-
return { ...partial, output:
|
|
491
|
+
return { ...partial, output: safeDraft };
|
|
411
492
|
}
|
|
412
493
|
const verifier = deps.callVerifier;
|
|
413
494
|
const outcome = await verifier({
|
|
@@ -419,7 +500,7 @@ async function applyVerification(draft, args, deps, partial) {
|
|
|
419
500
|
});
|
|
420
501
|
return {
|
|
421
502
|
...partial,
|
|
422
|
-
output: outcome.finalText,
|
|
503
|
+
output: checkOutputSafety(outcome.finalText),
|
|
423
504
|
verification: {
|
|
424
505
|
action: outcome.action,
|
|
425
506
|
verifierChain: outcome.verifierChain,
|
|
@@ -444,12 +525,29 @@ export async function prismInferHandler(args) {
|
|
|
444
525
|
ollamaUrl: PRISM_LOCAL_LLM_URL,
|
|
445
526
|
});
|
|
446
527
|
debugLog(`[prism_infer] backend=${result.backend} model=${result.model_picked} latency=${result.latency_ms}ms free=${result.ram_free_mb}MB`);
|
|
528
|
+
// Forward per-call metrics to portal (thin-client pattern).
|
|
529
|
+
// safety_gate excluded — logging crisis filter triggers is a HIPAA concern.
|
|
530
|
+
if (result.backend !== "safety_gate") {
|
|
531
|
+
ddLog("info", "prism_infer.usage", {
|
|
532
|
+
backend: result.backend,
|
|
533
|
+
model: result.model_picked ?? result.backend,
|
|
534
|
+
used_cloud: result.used_cloud,
|
|
535
|
+
prompt_tokens: result.prompt_tokens ?? 0,
|
|
536
|
+
completion_tokens: result.completion_tokens ?? 0,
|
|
537
|
+
latency_ms: result.latency_ms,
|
|
538
|
+
});
|
|
539
|
+
}
|
|
540
|
+
const tokenStr = result.prompt_tokens != null || result.completion_tokens != null
|
|
541
|
+
? ` tokens=${result.prompt_tokens ?? "?"}in/${result.completion_tokens ?? "?"}out`
|
|
542
|
+
: "";
|
|
447
543
|
const header = `[prism_infer] backend=${result.backend}` +
|
|
448
544
|
` model=${result.model_picked ?? "n/a"}` +
|
|
449
545
|
` plan=${result.plan ?? "unknown"}` +
|
|
450
546
|
` free_ram=${result.ram_free_mb}MB` +
|
|
451
547
|
` latency=${result.latency_ms}ms` +
|
|
452
548
|
` used_cloud=${result.used_cloud}` +
|
|
549
|
+
tokenStr +
|
|
550
|
+
(result.quality_gate_failed ? ` quality_gate_failed=true` : "") +
|
|
453
551
|
(result.verification ? ` verify=${result.verification.action}` : "") +
|
|
454
552
|
(result.attempts.length ? ` attempts=${JSON.stringify(result.attempts)}` : "");
|
|
455
553
|
return {
|
|
@@ -317,7 +317,7 @@ export async function sessionTaskRouteHandler(args) {
|
|
|
317
317
|
delete result._rawComposite;
|
|
318
318
|
// ── v9.x: Local LLM second-opinion for low-confidence cases ──────────────
|
|
319
319
|
// When confidence is below the threshold AND local LLM is enabled,
|
|
320
|
-
// ask prism-coder:
|
|
320
|
+
// ask prism-coder:9b to break the tie. This is purely additive — if the
|
|
321
321
|
// LLM call fails or times out, the original heuristic result is returned.
|
|
322
322
|
if (PRISM_LOCAL_LLM_ENABLED &&
|
|
323
323
|
result.confidence < PRISM_TASK_ROUTER_CONFIDENCE_THRESHOLD) {
|
|
@@ -350,7 +350,7 @@ export async function sessionTaskRouteHandler(args) {
|
|
|
350
350
|
}
|
|
351
351
|
// ─── Local LLM Route Classifier ──────────────────────────────
|
|
352
352
|
/**
|
|
353
|
-
* Ask prism-coder:
|
|
353
|
+
* Ask prism-coder:9b to classify a task description as "claw" or "host".
|
|
354
354
|
* Returns the string or null if the model is unavailable / response unparseable.
|
|
355
355
|
* Called only when heuristic confidence is below the threshold.
|
|
356
356
|
*/
|
package/dist/utils/ddLogger.js
CHANGED
|
@@ -8,9 +8,17 @@
|
|
|
8
8
|
* Env: PRISM_SYNALUX_BASE_URL (default https://synalux.ai)
|
|
9
9
|
*/
|
|
10
10
|
const SYNALUX_BASE = process.env.PRISM_SYNALUX_BASE_URL || "https://synalux.ai";
|
|
11
|
+
const TELEMETRY_WRITE_TOKEN = process.env.TELEMETRY_WRITE_TOKEN || "";
|
|
11
12
|
const DD_API_KEY = process.env.DD_API_KEY || "";
|
|
12
13
|
const DD_SITE = process.env.DD_SITE || "datadoghq.com";
|
|
13
14
|
const SERVICE = "prism-mcp";
|
|
15
|
+
const CONTEXT_ALLOWLIST = new Set([
|
|
16
|
+
"backend", "model", "used_cloud", "prompt_tokens", "completion_tokens",
|
|
17
|
+
"latency_ms", "plan", "requested_ceiling", "effective_ceiling",
|
|
18
|
+
"ceiling_clamped", "requested_tokens", "effective_tokens", "tokens_clamped",
|
|
19
|
+
"cloud_requested", "cloud_allowed", "cloud_blocked",
|
|
20
|
+
"verify_requested", "verify_allowed", "verify_blocked",
|
|
21
|
+
]);
|
|
14
22
|
const queue = [];
|
|
15
23
|
let flushTimer = null;
|
|
16
24
|
const FLUSH_INTERVAL_MS = 5_000;
|
|
@@ -26,31 +34,61 @@ async function flush() {
|
|
|
26
34
|
return;
|
|
27
35
|
const batch = queue.splice(0, MAX_BATCH);
|
|
28
36
|
// Primary: Synalux portal → Supabase (always available)
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
37
|
+
if (TELEMETRY_WRITE_TOKEN) {
|
|
38
|
+
try {
|
|
39
|
+
await fetch(`${SYNALUX_BASE}/api/v1/telemetry`, {
|
|
40
|
+
method: "POST",
|
|
41
|
+
headers: {
|
|
42
|
+
"Content-Type": "application/json",
|
|
43
|
+
"Authorization": `Bearer ${TELEMETRY_WRITE_TOKEN}`,
|
|
44
|
+
"X-Prism-Client": "prism-mcp",
|
|
45
|
+
},
|
|
46
|
+
body: JSON.stringify(batch.map(e => {
|
|
47
|
+
const ctx = {};
|
|
48
|
+
for (const [k, v] of Object.entries(e)) {
|
|
49
|
+
if (CONTEXT_ALLOWLIST.has(k))
|
|
50
|
+
ctx[k] = v;
|
|
51
|
+
}
|
|
52
|
+
return {
|
|
53
|
+
service: SERVICE,
|
|
54
|
+
event_type: e.status === "error" ? "error" : "action",
|
|
55
|
+
message: e.message,
|
|
56
|
+
context: ctx,
|
|
57
|
+
user_id: e.user_id,
|
|
58
|
+
user_plan: e.user_plan,
|
|
59
|
+
};
|
|
60
|
+
})),
|
|
61
|
+
signal: AbortSignal.timeout(5_000),
|
|
62
|
+
});
|
|
63
|
+
}
|
|
64
|
+
catch {
|
|
65
|
+
// Silent — don't crash the MCP server
|
|
66
|
+
}
|
|
46
67
|
}
|
|
47
68
|
// Secondary: Datadog Logs (if API key is set AND Logs product is enabled)
|
|
69
|
+
// Same allowlist applied — both sinks get identical filtered context.
|
|
48
70
|
if (DD_API_KEY) {
|
|
49
71
|
try {
|
|
50
72
|
await fetch(`https://http-intake.logs.${DD_SITE}/api/v2/logs`, {
|
|
51
73
|
method: "POST",
|
|
52
74
|
headers: { "Content-Type": "application/json", "DD-API-KEY": DD_API_KEY },
|
|
53
|
-
body: JSON.stringify(batch
|
|
75
|
+
body: JSON.stringify(batch.map(e => {
|
|
76
|
+
const ctx = {};
|
|
77
|
+
for (const [k, v] of Object.entries(e)) {
|
|
78
|
+
if (CONTEXT_ALLOWLIST.has(k))
|
|
79
|
+
ctx[k] = v;
|
|
80
|
+
}
|
|
81
|
+
return {
|
|
82
|
+
ddsource: "nodejs",
|
|
83
|
+
ddtags: e.ddtags,
|
|
84
|
+
hostname: e.hostname,
|
|
85
|
+
service: SERVICE,
|
|
86
|
+
status: e.status,
|
|
87
|
+
message: e.message,
|
|
88
|
+
...ctx,
|
|
89
|
+
timestamp: e.timestamp,
|
|
90
|
+
};
|
|
91
|
+
})),
|
|
54
92
|
signal: AbortSignal.timeout(5_000),
|
|
55
93
|
});
|
|
56
94
|
}
|
|
@@ -68,7 +106,7 @@ export function ddLog(level, message, context) {
|
|
|
68
106
|
hostname: process.env.HOSTNAME || "prism-mcp",
|
|
69
107
|
service: SERVICE,
|
|
70
108
|
status: level,
|
|
71
|
-
message,
|
|
109
|
+
message: message.slice(0, 200),
|
|
72
110
|
...context,
|
|
73
111
|
timestamp: new Date().toISOString(),
|
|
74
112
|
});
|
|
@@ -32,7 +32,7 @@ const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
|
|
|
32
32
|
let cache = null;
|
|
33
33
|
let inFlight = null;
|
|
34
34
|
// ── Model tier ordering for ceiling enforcement ───────────────────
|
|
35
|
-
const TIER_ORDER = ["2b", "4b", "9b", "
|
|
35
|
+
const TIER_ORDER = ["2b", "4b", "9b", "27b"];
|
|
36
36
|
/**
|
|
37
37
|
* Returns true if `requested` exceeds `ceiling`.
|
|
38
38
|
* e.g. ceilingExceeded("9b", "4b") → true (9b > 4b ceiling)
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Inference metrics — thin-client fetch from Synalux portal.
|
|
3
|
+
*
|
|
4
|
+
* Prism forwards per-call metrics via ddLog("prism_infer.usage").
|
|
5
|
+
* The portal aggregates them in app_telemetry. This module fetches
|
|
6
|
+
* the aggregated summary on demand (session_save_ledger/handoff).
|
|
7
|
+
*/
|
|
8
|
+
import { getSynaluxJwt } from "./synaluxJwt.js";
|
|
9
|
+
import { PRISM_SYNALUX_BASE_URL } from "../config.js";
|
|
10
|
+
import { debugLog } from "./logger.js";
|
|
11
|
+
let sessionStartedAt = new Date().toISOString();
|
|
12
|
+
export function markSessionStart() {
|
|
13
|
+
sessionStartedAt = new Date().toISOString();
|
|
14
|
+
}
|
|
15
|
+
async function fetchMetrics() {
|
|
16
|
+
if (!PRISM_SYNALUX_BASE_URL)
|
|
17
|
+
return { metrics: null, error: "no_portal_url" };
|
|
18
|
+
const jwt = await getSynaluxJwt();
|
|
19
|
+
if (!jwt)
|
|
20
|
+
return { metrics: null, error: "jwt_unavailable" };
|
|
21
|
+
try {
|
|
22
|
+
const url = `${PRISM_SYNALUX_BASE_URL}/api/v1/telemetry/inference-metrics?since=${encodeURIComponent(sessionStartedAt)}`;
|
|
23
|
+
const res = await fetch(url, {
|
|
24
|
+
headers: { "Authorization": `Bearer ${jwt}` },
|
|
25
|
+
signal: AbortSignal.timeout(5_000),
|
|
26
|
+
});
|
|
27
|
+
if (!res.ok) {
|
|
28
|
+
debugLog(`[inference-metrics] portal returned ${res.status}`);
|
|
29
|
+
return { metrics: null, error: `portal_${res.status}` };
|
|
30
|
+
}
|
|
31
|
+
return { metrics: (await res.json()) };
|
|
32
|
+
}
|
|
33
|
+
catch (err) {
|
|
34
|
+
const msg = err instanceof Error ? err.message : String(err);
|
|
35
|
+
debugLog(`[inference-metrics] fetch failed: ${msg}`);
|
|
36
|
+
return { metrics: null, error: msg };
|
|
37
|
+
}
|
|
38
|
+
}
|
|
39
|
+
export async function fetchPortalInferenceMetrics() {
|
|
40
|
+
const { metrics, error } = await fetchMetrics();
|
|
41
|
+
if (!metrics) {
|
|
42
|
+
if (error)
|
|
43
|
+
debugLog(`[inference-metrics] unavailable: ${error}`);
|
|
44
|
+
return "";
|
|
45
|
+
}
|
|
46
|
+
if (metrics.total_calls === 0)
|
|
47
|
+
return "";
|
|
48
|
+
const lines = [
|
|
49
|
+
`\n📊 Inference Metrics (this session):`,
|
|
50
|
+
` Total calls: ${metrics.total_calls} — Local: ${metrics.local_calls} (${metrics.local_pct}%) | Cloud: ${metrics.cloud_calls} (${metrics.cloud_pct}%)`,
|
|
51
|
+
` Tokens: ${metrics.total_prompt_tokens.toLocaleString()} in + ${metrics.total_completion_tokens.toLocaleString()} out = ${metrics.total_tokens.toLocaleString()} total`,
|
|
52
|
+
` Avg latency: ${metrics.avg_latency_ms}ms`,
|
|
53
|
+
];
|
|
54
|
+
const models = Object.entries(metrics.by_model).sort((a, b) => b[1].calls - a[1].calls);
|
|
55
|
+
if (models.length > 1) {
|
|
56
|
+
lines.push(` By model:`);
|
|
57
|
+
for (const [name, stats] of models) {
|
|
58
|
+
const tokens = stats.prompt_tokens + stats.completion_tokens;
|
|
59
|
+
const avgMs = stats.calls > 0 ? Math.round(stats.total_latency_ms / stats.calls) : 0;
|
|
60
|
+
lines.push(` ${name}: ${stats.calls} calls, ${tokens.toLocaleString()} tokens, avg ${avgMs}ms`);
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
return lines.join("\n");
|
|
64
|
+
}
|
package/dist/utils/localLlm.js
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
/**
|
|
2
|
-
* Local LLM Client — Ollama/prism-coder:
|
|
2
|
+
* Local LLM Client — Ollama/prism-coder:9b Integration (v1.0.0)
|
|
3
3
|
* ──────────────────────────────────────────────────────────────────
|
|
4
4
|
* Thin HTTP wrapper around the Ollama /api/chat endpoint.
|
|
5
5
|
*
|
|
@@ -9,7 +9,7 @@
|
|
|
9
9
|
* - Silent fail: returning null instead of throwing ensures callers
|
|
10
10
|
* can fall back to Gemini without crashing the MCP server.
|
|
11
11
|
* - Fire-and-forget safe: wrapped in try/catch, never propagates.
|
|
12
|
-
* - Default model: prism-coder:
|
|
12
|
+
* - Default model: prism-coder:9b — fine-tuned on Prism tool schemas,
|
|
13
13
|
* 8192-token context, Q8_0 quantization, ~8.1GB RAM footprint.
|
|
14
14
|
*
|
|
15
15
|
* FEATURE FLAG:
|
|
@@ -1,19 +1,19 @@
|
|
|
1
1
|
/**
|
|
2
2
|
* RAM-Gated Local Model Picker
|
|
3
3
|
* ─────────────────────────────────────────────────────────────
|
|
4
|
-
* Cascade: 9b (default) → 4b (verifier) → 2b (mobile) →
|
|
4
|
+
* Cascade: 9b (default) → 4b (verifier) → 2b (mobile) → 27b (quality).
|
|
5
5
|
*
|
|
6
|
-
* The default ceiling is "9b" — NOT "
|
|
6
|
+
* The default ceiling is "9b" — NOT "27b". This means:
|
|
7
7
|
* - 9b is the primary model for routing + general inference (Qwen3.5-9B, 100% BFCL)
|
|
8
8
|
* - 4b is used as the grounding verifier (fast, small)
|
|
9
9
|
* - 2b is the mobile/iPhone first gate (Qwen3.5-2B, 99.1% BFCL)
|
|
10
|
-
* -
|
|
10
|
+
* - 27b is only loaded when caller explicitly passes ceiling="27b"
|
|
11
11
|
* or when the task requires maximum quality (complex code gen, etc.)
|
|
12
12
|
*
|
|
13
|
-
* This saves
|
|
13
|
+
* This saves 11GB+ RAM vs 27b and keeps response times fast.
|
|
14
14
|
*
|
|
15
15
|
* tag weights need free ctx role
|
|
16
|
-
* prism-coder:
|
|
16
|
+
* prism-coder:27b ~16 GB ≥ 20 GB 32K quality (on-demand, Qwen3.5 DeltaNet, 100% BFCL)
|
|
17
17
|
* prism-coder:9b ~ 5.8 GB ≥ 8 GB 32K default router (Qwen3.5, 100% BFCL)
|
|
18
18
|
* prism-coder:4b ~ 3.4 GB ≥ 5 GB 32K verifier (Qwen3.5, 100%)
|
|
19
19
|
* prism-coder:2b ~ 2.3 GB ≥ 3 GB 8K mobile / iPhone (Qwen3.5, 99.1%)
|
|
@@ -26,30 +26,30 @@ const GB = 1024 ** 3;
|
|
|
26
26
|
* the first row whose minFreeGb fits within freeBytes.
|
|
27
27
|
*/
|
|
28
28
|
export const MODEL_TIERS = [
|
|
29
|
-
{ tag: 'prism-coder:
|
|
29
|
+
{ tag: 'prism-coder:27b', weightsGb: 16, minFreeGb: 20, ctxTokens: 32_768 },
|
|
30
30
|
{ tag: 'prism-coder:9b', weightsGb: 5.8, minFreeGb: 8, ctxTokens: 32_768 },
|
|
31
31
|
{ tag: 'prism-coder:4b', weightsGb: 3.4, minFreeGb: 5, ctxTokens: 32_768 },
|
|
32
32
|
{ tag: 'prism-coder:2b', weightsGb: 2.3, minFreeGb: 3, ctxTokens: 8_192 },
|
|
33
33
|
];
|
|
34
34
|
/**
|
|
35
35
|
* True when `installed` matches `tierTag` either as a bare tag
|
|
36
|
-
* (`prism-coder:
|
|
37
|
-
* (`dcostenco/prism-coder:
|
|
38
|
-
* dcostenco/prism-coder:
|
|
36
|
+
* (`prism-coder:27b`) or as a namespaced HuggingFace-style tag
|
|
37
|
+
* (`dcostenco/prism-coder:27b`). The README documents `ollama pull
|
|
38
|
+
* dcostenco/prism-coder:27b`, so Ollama's /api/tags returns the
|
|
39
39
|
* namespaced form — without this matcher the picker would never
|
|
40
40
|
* see them and silently fall through to cloud.
|
|
41
41
|
*/
|
|
42
42
|
function tagMatches(installed, tierTag) {
|
|
43
43
|
return installed === tierTag || installed.endsWith(`/${tierTag}`);
|
|
44
44
|
}
|
|
45
|
-
/** Default ceiling: 9b. Pass ceiling="
|
|
45
|
+
/** Default ceiling: 9b. Pass ceiling="27b" explicitly for max quality. */
|
|
46
46
|
export const DEFAULT_CEILING = "9b";
|
|
47
47
|
/**
|
|
48
48
|
* Pick the best viable tier for the given free RAM.
|
|
49
|
-
* Default ceiling is 9b — use ceiling="
|
|
49
|
+
* Default ceiling is 9b — use ceiling="27b" only for complex tasks.
|
|
50
50
|
*
|
|
51
51
|
* @param freeBytes Result of os.freemem() — binary bytes
|
|
52
|
-
* @param ceiling Cap tier. Default "9b". Pass "
|
|
52
|
+
* @param ceiling Cap tier. Default "9b". Pass "27b" for complex tasks.
|
|
53
53
|
* @param available Optional whitelist of installed Ollama tags.
|
|
54
54
|
*/
|
|
55
55
|
export function pickLocalModel(freeBytes, ceiling, available) {
|
|
@@ -79,7 +79,7 @@ export function pickLocalModel(freeBytes, ceiling, available) {
|
|
|
79
79
|
}
|
|
80
80
|
/**
|
|
81
81
|
* Resolve a tier tag to the actual Ollama name installed locally.
|
|
82
|
-
* If `installed` contains a namespaced match (e.g. `dcostenco/prism-coder:
|
|
82
|
+
* If `installed` contains a namespaced match (e.g. `dcostenco/prism-coder:27b`),
|
|
83
83
|
* the namespaced form is returned so Ollama's /api/generate finds it.
|
|
84
84
|
* Falls back to the bare tag when only the bare form is present.
|
|
85
85
|
*/
|
|
@@ -16,7 +16,7 @@
|
|
|
16
16
|
*
|
|
17
17
|
* Architecture:
|
|
18
18
|
* 1. Rule-based extraction (fast, zero-cost, always available)
|
|
19
|
-
* 2. Local LLM extraction (optional, higher quality, uses prism-coder:
|
|
19
|
+
* 2. Local LLM extraction (optional, higher quality, uses prism-coder:9b)
|
|
20
20
|
* 3. Merged + deduplicated results
|
|
21
21
|
*/
|
|
22
22
|
import { debugLog } from "./logger.js";
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Quality Gate — deterministic check for obvious inference failures.
|
|
3
|
+
*
|
|
4
|
+
* NARROW by design: only high-precision signals that rarely false-positive.
|
|
5
|
+
* Does NOT judge correctness — that's the grounding verifier's job.
|
|
6
|
+
* Does NOT use refusal regex (too many false positives on legitimate output).
|
|
7
|
+
*
|
|
8
|
+
* Returns: { pass: boolean, reason?: string }
|
|
9
|
+
*/
|
|
10
|
+
/**
|
|
11
|
+
* Check if a model response passes the quality gate.
|
|
12
|
+
* @param stripped Response AFTER think-stripping (use stripThink first)
|
|
13
|
+
* @param thinkOnly True if the response was only <think> blocks with no answer
|
|
14
|
+
* @param finishReason Ollama's finish_reason if available (e.g. "length" = truncated)
|
|
15
|
+
*/
|
|
16
|
+
export function passesQualityGate(stripped, thinkOnly, finishReason) {
|
|
17
|
+
// Signal 1: Think-only — model reasoned but produced no answer (check before empty)
|
|
18
|
+
if (thinkOnly) {
|
|
19
|
+
return { pass: false, reason: "think_only" };
|
|
20
|
+
}
|
|
21
|
+
// Signal 2: Empty or near-empty after stripping
|
|
22
|
+
if (stripped.trim().length < 5) {
|
|
23
|
+
return { pass: false, reason: "empty_response" };
|
|
24
|
+
}
|
|
25
|
+
// Signal 3: Hard truncation — Ollama reports finish_reason="length"
|
|
26
|
+
// meaning the model hit num_predict before finishing
|
|
27
|
+
if (finishReason === "length") {
|
|
28
|
+
return { pass: false, reason: "hard_truncation" };
|
|
29
|
+
}
|
|
30
|
+
// Signal 4: Exact-loop detection (two passes).
|
|
31
|
+
//
|
|
32
|
+
// Pass A (prose-only, threshold ≥3): strip structural markdown that
|
|
33
|
+
// naturally repeats (code blocks, tables, headings, bold labels).
|
|
34
|
+
// Catches loops in explanatory text.
|
|
35
|
+
const proseOnly = stripped
|
|
36
|
+
.replace(/```[\s\S]*?```/g, "")
|
|
37
|
+
.replace(/^\|.*\|$/gm, "")
|
|
38
|
+
.replace(/^#{1,6}\s+.*$/gm, "")
|
|
39
|
+
.replace(/^[\s*-]*\*{1,2}[^*]+\*{1,2}:?\s*$/gm, "");
|
|
40
|
+
const proseSentences = proseOnly.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
|
|
41
|
+
if (proseSentences.length >= 6) {
|
|
42
|
+
const counts = new Map();
|
|
43
|
+
for (const s of proseSentences) {
|
|
44
|
+
const key = s.toLowerCase();
|
|
45
|
+
counts.set(key, (counts.get(key) ?? 0) + 1);
|
|
46
|
+
if ((counts.get(key) ?? 0) >= 3) {
|
|
47
|
+
return { pass: false, reason: "loop_detected" };
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
}
|
|
51
|
+
// Pass B (full text, threshold ≥5): catches egregious loops hidden
|
|
52
|
+
// inside fake code blocks or other structural elements. Higher
|
|
53
|
+
// threshold avoids false positives on legitimate code patterns
|
|
54
|
+
// (e.g. `node = self.root` × 4 is fine, × 5 is suspicious).
|
|
55
|
+
const allSentences = stripped.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
|
|
56
|
+
if (allSentences.length >= 10) {
|
|
57
|
+
const counts = new Map();
|
|
58
|
+
for (const s of allSentences) {
|
|
59
|
+
const key = s.toLowerCase();
|
|
60
|
+
counts.set(key, (counts.get(key) ?? 0) + 1);
|
|
61
|
+
if ((counts.get(key) ?? 0) >= 5) {
|
|
62
|
+
return { pass: false, reason: "loop_detected" };
|
|
63
|
+
}
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
return { pass: true };
|
|
67
|
+
}
|
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* L1 Safety Gate — deterministic crisis/medical interception for prism_infer.
|
|
3
|
+
*
|
|
4
|
+
* Runs BEFORE model output reaches the caller. High precision, low recall:
|
|
5
|
+
* only intercepts patterns that unambiguously indicate self-harm intent
|
|
6
|
+
* (input) or dangerous instructions (output). Generic clinical/pharmacological
|
|
7
|
+
* terms ("dose of", "milligrams", "lethal") are NOT intercepted — they appear
|
|
8
|
+
* in routine BCBA/medical notes.
|
|
9
|
+
*
|
|
10
|
+
* This is a backstop, not a comprehensive filter. The model's own safety
|
|
11
|
+
* training is the primary layer.
|
|
12
|
+
*/
|
|
13
|
+
// ── Input: first-person crisis expressions ───────────────────────────────────
|
|
14
|
+
const CRISIS_INPUT_RE = [
|
|
15
|
+
// English
|
|
16
|
+
/hurt(?:ing)?\s+(?:my)?self/i,
|
|
17
|
+
/kill\s+(?:my)?self/i,
|
|
18
|
+
/end\s+my\s+life/i,
|
|
19
|
+
/want\s+to\s+die/i,
|
|
20
|
+
/want\s+to\s+(?:commit\s+)?suicide/i,
|
|
21
|
+
/cut(?:ting)?\s+(?:my)?self/i,
|
|
22
|
+
/(?:hang|hanging)\s+(?:my)?self/i,
|
|
23
|
+
/jump(?:ing)?\s+off/i,
|
|
24
|
+
/how\s+(?:many|much|to).*(?:pills|overdose|die)/i,
|
|
25
|
+
// Spanish — block hyperbole nouns only (NOT hunger/thirst/cold — may be literal for neglected child)
|
|
26
|
+
/quiero\s+morir(?!\s+de\s+(?:risa|la\s+risa|vergüenza|ganas|envidia|aburrimiento)\b)/i,
|
|
27
|
+
/(?:voy\s+a\s+)?matarme(?!\s+(?:estudiando|trabajando|riendo|de\s+(?:risa|la\s+risa))\b)/i,
|
|
28
|
+
/hacerme\s+da[ñn]o/i,
|
|
29
|
+
/(?:quiero\s+)?suicidarme/i,
|
|
30
|
+
// French — block hyperbole nouns only (NOT faim/soif/chaud/froid)
|
|
31
|
+
/je\s+veux\s+mourir(?!\s+d[e']\s*(?:rire|honte|envie|ennui)\b)/i,
|
|
32
|
+
/(?:vais\s+)?me\s+tuer(?!\s+[àa]\s+(?:rire|travailler|étudier)\b)/i,
|
|
33
|
+
/me\s+suicider/i,
|
|
34
|
+
// Portuguese — block hyperbole nouns only (NOT fome/sede/calor/frio/sono)
|
|
35
|
+
/quero\s+morrer(?!\s+de\s+(?:rir|vergonha|inveja|tédio)\b)/i,
|
|
36
|
+
/(?:vou\s+)?me\s+matar(?!\s+(?:estudando|trabalhando|rindo|de\s+(?:rir|tanto))\b)/i,
|
|
37
|
+
// Russian — block hyperbole nouns (смех/стыд/зависть/голод/жара/скука)
|
|
38
|
+
// Russian — NOT голода/жары/холода (may be literal)
|
|
39
|
+
/хочу\s+умереть(?!\s+от\s+(?:смеха|стыда|зависти|скуки)(?:\s|$))/i,
|
|
40
|
+
/убить\s+себя/i,
|
|
41
|
+
/покончить\s+с\s+собой/i,
|
|
42
|
+
// Arabic (alef variants handled in normalize)
|
|
43
|
+
/[اأإآ]ر[يى]د\s+[اأإآ]ن\s+[اأإآ]موت/,
|
|
44
|
+
/قتل\s+نفس[يى]/,
|
|
45
|
+
/[اأإآ]نه[يى]\s+ح[يى][اأإآ]ت[يى]/,
|
|
46
|
+
// German — block hyperbole nouns (Lachen/Scham/Neid/Hunger/Langeweile)
|
|
47
|
+
// German — NOT hunger/kälte (may be literal)
|
|
48
|
+
/(?:ich\s+)?will\s+sterben(?!\s+vor\s+(?:lachen|scham|neid|langeweile)\b)/i,
|
|
49
|
+
/mich\s+umbringen/i,
|
|
50
|
+
// Ukrainian — block hyperbole nouns (сміх/сором/заздрість/голод)
|
|
51
|
+
// Ukrainian — NOT голоду/холоду (may be literal)
|
|
52
|
+
/хочу\s+померти(?!\s+від\s+(?:сміху|сорому|заздрості|нудьги)(?:\s|$))/i,
|
|
53
|
+
/вбити\s+себе/i,
|
|
54
|
+
];
|
|
55
|
+
const MEDICAL_INPUT_RE = [
|
|
56
|
+
/(?:how\s+much|what\s+dose)\s+.*(?:should\s+I|do\s+I|can\s+I)\s+(?:inject|take|give)/i,
|
|
57
|
+
/(?:dose|dosage)\s+(?:should\s+I|do\s+I|can\s+I)\s+(?:inject|take|give)/i,
|
|
58
|
+
];
|
|
59
|
+
// ── Output: model providing dangerous instructions ───────────────────────────
|
|
60
|
+
const CRISIS_OUTPUT_RE = [
|
|
61
|
+
/\bhow many pills\s+to\b/i,
|
|
62
|
+
/\boverdose amount\b/i,
|
|
63
|
+
/\bnoose\b/i,
|
|
64
|
+
/\bhanging yourself\b/i,
|
|
65
|
+
/\bsuicide method\b/i,
|
|
66
|
+
/\bmethod of suicide\b/i,
|
|
67
|
+
/\bself[- ]?harm method\b/i,
|
|
68
|
+
/\bhow to (?:kill|hang|poison|overdose)\s+(?:your|my|him|her|them|one)self\b/i,
|
|
69
|
+
/\bhow to overdose\b/i,
|
|
70
|
+
];
|
|
71
|
+
const MEDICAL_OUTPUT_RE = [
|
|
72
|
+
/\binject this (?:amount|dose|much)\b/i,
|
|
73
|
+
/\btake this (?:amount|many|dose)\b/i,
|
|
74
|
+
/\byou should (?:inject|take)\s+\d+\s+units\b/i,
|
|
75
|
+
];
|
|
76
|
+
// ── Responses ────────────────────────────────────────────────────────────────
|
|
77
|
+
export const CRISIS_RESPONSE = "I'm concerned about your safety. Please call or text 988 (Suicide & Crisis Lifeline) right now — available 24/7. If in immediate danger, call 911. International: findahelpline.com. You are not alone.";
|
|
78
|
+
export const MEDICAL_RESPONSE = "I cannot provide specific medical dosing advice. Please contact your doctor or pharmacist. For emergencies, call 911.";
|
|
79
|
+
// ── API ──────────────────────────────────────────────────────────────────────
|
|
80
|
+
function normalize(text) {
|
|
81
|
+
return text
|
|
82
|
+
.toLowerCase()
|
|
83
|
+
.replace(/\p{Cf}/gu, "")
|
|
84
|
+
.replace(/\p{Mn}/gu, "") // Arabic harakat + all combining marks
|
|
85
|
+
.replace(/ـ/g, "")
|
|
86
|
+
.replace(/[أإآ]/g, "ا")
|
|
87
|
+
.replace(/\s+/g, " ");
|
|
88
|
+
}
|
|
89
|
+
export function checkInputSafety(text) {
|
|
90
|
+
const t = normalize(text);
|
|
91
|
+
if (CRISIS_INPUT_RE.some(p => p.test(t)))
|
|
92
|
+
return CRISIS_RESPONSE;
|
|
93
|
+
if (MEDICAL_INPUT_RE.some(p => p.test(t)))
|
|
94
|
+
return MEDICAL_RESPONSE;
|
|
95
|
+
return null;
|
|
96
|
+
}
|
|
97
|
+
export function checkOutputSafety(response) {
|
|
98
|
+
const r = normalize(response);
|
|
99
|
+
if (CRISIS_OUTPUT_RE.some(re => re.test(r)))
|
|
100
|
+
return CRISIS_RESPONSE;
|
|
101
|
+
if (MEDICAL_OUTPUT_RE.some(re => re.test(r)))
|
|
102
|
+
return MEDICAL_RESPONSE;
|
|
103
|
+
return response;
|
|
104
|
+
}
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Think-Strip — remove <think>...</think> blocks from model output.
|
|
3
|
+
*
|
|
4
|
+
* Qwen3.5 uses <think> blocks for chain-of-thought reasoning.
|
|
5
|
+
* These must be stripped before serving to the user or passing
|
|
6
|
+
* to the grounding verifier (which would try to ground reasoning text).
|
|
7
|
+
*
|
|
8
|
+
* Returns: { stripped: string, thinkContent: string | null, thinkOnly: boolean }
|
|
9
|
+
*/
|
|
10
|
+
const THINK_RE = /<(?:think|\|synalux_think\|)>[\s\S]*?<\/(?:think|\|synalux_think\|)>\s*/g;
|
|
11
|
+
const UNCLOSED_THINK_RE = /<(?:think|\|synalux_think\|)>[\s\S]*$/;
|
|
12
|
+
export function stripThink(raw) {
|
|
13
|
+
if (!raw.includes("<think>") && !raw.includes("<|synalux_think|>")) {
|
|
14
|
+
return { stripped: raw, thinkContent: null, thinkOnly: false };
|
|
15
|
+
}
|
|
16
|
+
const thinkMatch = raw.match(/<(?:think|\|synalux_think\|)>([\s\S]*?)<\/(?:think|\|synalux_think\|)>/);
|
|
17
|
+
const thinkContent = thinkMatch ? thinkMatch[1].trim() : null;
|
|
18
|
+
let stripped = raw.replace(THINK_RE, "");
|
|
19
|
+
stripped = stripped.replace(UNCLOSED_THINK_RE, "");
|
|
20
|
+
stripped = stripped.trim();
|
|
21
|
+
return {
|
|
22
|
+
stripped,
|
|
23
|
+
thinkContent,
|
|
24
|
+
thinkOnly: stripped.length === 0 && raw.trim().length > 0,
|
|
25
|
+
};
|
|
26
|
+
}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "prism-mcp-server",
|
|
3
|
-
"version": "19.0
|
|
3
|
+
"version": "19.2.0",
|
|
4
4
|
"mcpName": "io.github.dcostenco/prism-coder",
|
|
5
5
|
"description": "Prism Coder — Cognitive memory + tool-calling intelligence for AI agents. Mind Palace persistent memory (BFCL Gold Certified, 100% Tool-Call Accuracy, 114 Agent Skills, PHI Guard, Tier Enforcement, Prompt-Based Skill Routing, Zero-Search HDC/HRR retrieval, HRR Semantic Drift Detection across BCBA/Coding/AAC domains, HIPAA-hardened local-first storage, SLERP-optimized GRPO alignment) plus the prism-coder 1.7B–32B open-weights LLM fleet.",
|
|
6
6
|
"module": "index.ts",
|