npm - prism-mcp-server - Versions diffs - 19.1.0 → 19.2.1 - Mend

prism-mcp-server 19.1.0 → 19.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +102 -0
package/dist/config.js +4 -4
package/dist/tools/compactionHandler.js +2 -2
package/dist/tools/ledgerHandlers.js +7 -0
package/dist/tools/prismInferHandler.js +48 -7
package/dist/tools/taskRouterHandler.js +2 -2
package/dist/utils/ddLogger.js +60 -19
package/dist/utils/inferenceMetrics.js +93 -0
package/dist/utils/localLlm.js +2 -2
package/dist/utils/nerExtractor.js +1 -1
package/dist/utils/qualityGate.js +28 -4
package/dist/utils/safetyGate.js +104 -0
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -78,6 +78,23 @@ Every session is logged with files changed, decisions made, and TODOs. Search, f
   <img src="docs/session-ledger.jpg" alt="Session Ledger — 93 sessions, 847 decisions logged across 12 projects" width="700" />
 </p>
+### Inference Metrics — see where your tokens go
+Every `prism_infer` call tracks which model handled it (local Ollama vs cloud) and how many tokens were consumed. When you save a session, Prism shows a summary:
+```
+📊 Inference Metrics (this session):
+  Total calls: 12 — Local: 10 (83%) | Cloud: 2 (17%)
+  Tokens: 8,420 in + 3,150 out = 11,570 total
+  Avg latency: 1,240ms
+  By model:
+    prism-coder:27b: 6 calls, 7,200 tokens, avg 1,800ms
+    prism-coder:9b: 4 calls, 2,870 tokens, avg 620ms
+    synalux-27b: 2 calls, 1,500 tokens, avg 1,100ms
+```
+Local calls use actual Ollama token counts; cloud calls use estimates. Metrics are aggregated by the Synalux portal — Prism is a thin client that forwards per-call data and fetches the summary on demand.
 ### Session Drift Detection
 Long agent sessions can wander from their original goal. `session_detect_drift` compares current work against the stated goal and returns `on_track / minor_drift / major_drift` so the agent can self-correct.
@@ -204,6 +221,91 @@ python3 tests/benchmarks/prism-routing-100/benchmark.py --models 2b 4b 9b 27b
 **Memory uplift (LoCoMo-Plus, self-published).** A separate long-context dialogue benchmark ([dcostenco/Locomo-Plus](https://github.com/dcostenco/Locomo-Plus)) measures how much structured memory helps a base model retain multi-day context. Results show large gains when a model is paired with Prism memory versus running raw. Note this benchmark is authored, run, and LLM-judged by this project — treat it as a reproducible demonstration, not an independent third-party result, and run it yourself with the commands in that repo.
+### Code Generation Quality (27B vs Claude Opus)
+Three progressively harder Python tasks run through `prism_infer(mode:"code", think:true)` on the local 27B and compared with Claude Opus. Both produce correct, production-quality code. The 27B is slightly more verbose (docstrings, examples); Opus is slightly tighter (`__slots__`, early-exit DFS). On routine coding the 27B at $0 replaces cloud calls entirely.
+| Task | Local 27B | Claude Opus | Verdict |
+|------|-----------|-------------|---------|
+| Fibonacci with memoization | `@lru_cache`, ValueError on negative, docstring | Nested `_fib` to keep cache private | Both correct, equivalent |
+| LRU Cache (OrderedDict, O(1)) | `Any` keys, isinstance capacity check, `__repr__` | `Hashable` key type (more precise), same ops | Both correct, Opus marginally tighter |
+| Trie with autocomplete | `.lower()` normalization, collect+sort+slice | `__slots__` on TrieNode, early-exit DFS at limit | Both correct, Opus slightly more optimized |
+<details>
+<summary>Local 27B output — Trie with autocomplete (hardest task)</summary>
+```python
+class TrieNode:
+    def __init__(self):
+        self.children: dict[str, 'TrieNode'] = {}
+        self.is_end_of_word: bool = False
+class Trie:
+    def __init__(self):
+        self.root: TrieNode = TrieNode()
+    def insert(self, word: str) -> None:
+        node = self.root
+        for char in word.lower():
+            if char not in node.children:
+                node.children[char] = TrieNode()
+            node = node.children[char]
+        node.is_end_of_word = True
+    def search(self, word: str) -> bool:
+        node = self._get_node(word.lower())
+        return node is not None and node.is_end_of_word
+    def starts_with(self, prefix: str) -> bool:
+        return self._get_node(prefix.lower()) is not None
+    def autocomplete(self, prefix: str, limit: int = 5) -> list[str]:
+        node = self._get_node(prefix.lower())
+        if node is None:
+            return []
+        results: list[str] = []
+        self._collect_words(node, prefix.lower(), results)
+        results.sort()
+        return results[:limit]
+    def _get_node(self, key: str) -> 'TrieNode | None':
+        node = self.root
+        for char in key:
+            if char not in node.children:
+                return None
+            node = node.children[char]
+        return node
+    def _collect_words(self, node: TrieNode, prefix: str, results: list[str]) -> None:
+        if node.is_end_of_word:
+            results.append(prefix)
+        for char, child in sorted(node.children.items()):
+            self._collect_words(child, prefix + char, results)
+```
+</details>
+| Metric | Local 27B | Cloud (Opus) |
+|--------|-----------|-------------|
+| Latency (Trie task) | ~30s | ~8s |
+| Cost | $0 | ~$0.05 |
+| Think mode | Enabled (stripped before serving) | N/A |
+| Quality gate | Passed (no escalation needed) | N/A |
+### Cloud Escalation in Practice (`cloud_fallback: true`)
+The same three tasks with `cloud_fallback: true` — the quality gate decides whether local output is good enough or needs cloud escalation.
+| Task | used_cloud | Quality Gate | Latency | What happened |
+|------|:----------:|-------------|---------|---------------|
+| Fibonacci (simple) | **no** | Passed | 11s | 27B served directly, $0 |
+| LRU Cache (medium) | **no** | Passed | 21s | 27B served directly, $0 |
+| Trie (hard) | **yes** | `loop_detected` | 55s | 27B looped → gate caught it → escalated to cloud 27B |
+The quality gate detected repeated sentences (≥3 of the same sentence in ≥6 total) in the 27B's Trie output and escalated automatically. The cloud fallback returned clean code. On a second run of the same prompt, the 27B produced clean output without escalation — the loop is stochastic, not systematic.
+**Takeaway:** for ~80–90% of coding tasks, the 27B handles everything locally at $0. The quality gate + cloud escalation exists as a safety net for the remaining cases where the local model loops, truncates, or produces empty output. Paid tiers get automatic escalation; free tier gets the local result with a warning.
 ---
 ## Why Prism Coder

package/dist/config.js CHANGED Viewed

@@ -307,11 +307,11 @@ const rawTiebreakerEpsilon = parseFloat(process.env.PRISM_TURBOQUANT_TIEBREAKER_
 export const PRISM_TURBOQUANT_TIEBREAKER_EPSILON = Number.isFinite(rawTiebreakerEpsilon) && rawTiebreakerEpsilon >= 0
     ? rawTiebreakerEpsilon
     : 0;
-// ─── v9.x: Local LLM (prism-coder:7b) Integration ─────────────────────────
+// ─── v9.x: Local LLM (prism-coder) Integration ────────────────────────────
 // Enables background tasks (compaction, task-router fallback, pipeline ops)
 // to use a local Ollama model instead of the cloud LLM provider.
 //
-// Default model is prism-coder:7b — fine-tuned on Prism tool schemas.
+// Default model is prism-coder:9b — fine-tuned on Prism tool schemas.
 // Disabled by default so existing deployments are unaffected.
 //
 // Set PRISM_LOCAL_LLM_ENABLED=true to activate.
@@ -319,10 +319,10 @@ export const PRISM_TURBOQUANT_TIEBREAKER_EPSILON = Number.isFinite(rawTiebreaker
 // Set PRISM_LOCAL_LLM_URL to override the Ollama endpoint (default: localhost:11434).
 // Set PRISM_LOCAL_LLM_TIMEOUT_MS to override per-call timeout (default: 60000, max: 300000).
 // Set PRISM_STRICT_LOCAL_MODE=true to block cloud fallback when local LLM is enabled (HIPAA).
-/** Master switch — enables the local prism-coder:7b LLM for background tasks. */
+/** Master switch — enables the local prism-coder LLM for background tasks. */
 export const PRISM_LOCAL_LLM_ENABLED = process.env.PRISM_LOCAL_LLM_ENABLED === "true"; // Opt-in, default false
 /** Ollama model tag to use for local LLM calls. */
-export const PRISM_LOCAL_LLM_MODEL = (process.env.PRISM_LOCAL_LLM_MODEL || "prism-coder:7b").trim();
+export const PRISM_LOCAL_LLM_MODEL = (process.env.PRISM_LOCAL_LLM_MODEL || "prism-coder:9b").trim();
 /** Ollama base URL. Override for remote Ollama instances. */
 export const PRISM_LOCAL_LLM_URL = (process.env.PRISM_LOCAL_LLM_URL || "http://localhost:11434").trim();
 /** Per-call timeout in ms. Prevents stalled background tasks. Capped at 300s. */

package/dist/tools/compactionHandler.js CHANGED Viewed

@@ -108,7 +108,7 @@ function parseCompactionResponse(response, source) {
 }
 async function summarizeEntries(entries) {
     const prompt = buildCompactionPrompt(entries);
-    // ── Path 1: Local LLM (prism-coder:7b) ───────────────────────────
+    // ── Path 1: Local LLM (prism-coder:9b) ───────────────────────────
     if (PRISM_LOCAL_LLM_ENABLED) {
         debugLog(`[compact_ledger] Attempting local LLM summarization (${entries.length} entries)`);
         const localResponse = await callLocalLlm(prompt);
@@ -123,7 +123,7 @@ async function summarizeEntries(entries) {
         if (PRISM_STRICT_LOCAL_MODE) {
             throw new Error("[HIPAA] Local LLM failed and PRISM_STRICT_LOCAL_MODE=true. " +
                 "Cloud fallback is blocked to prevent unauthorized PHI disclosure. " +
-                "Ensure Ollama is running and prism-coder:7b is available.");
+                "Ensure Ollama is running and prism-coder:9b is available.");
         }
         debugLog(`[compact_ledger] Local LLM returned null — falling back to cloud LLM`);
     }

package/dist/tools/ledgerHandlers.js CHANGED Viewed

@@ -89,6 +89,7 @@ const MEMORY_BOUNDARY_SUFFIX = '\n</prism_memory>';
  * After saving, generates an embedding vector for the entry via fire-and-forget.
  */
 import { computeEffectiveImportance, recordMemoryAccess } from "../utils/cognitiveMemory.js";
+import { formatInferenceMetrics, resetInferenceMetrics } from "../utils/inferenceMetrics.js";
 export async function sessionSaveLedgerHandler(args) {
     if (!isSessionSaveLedgerArgs(args)) {
         throw new Error("Invalid arguments for session_save_ledger");
@@ -229,6 +230,7 @@ export async function sessionSaveLedgerHandler(args) {
     storage.decayImportance(project, PRISM_USER_ID, 30).catch((err) => {
         debugLog(`[session_save_ledger] Background decay failed (non-fatal): ${err instanceof Error ? err.message : String(err)}`);
     });
+    const metricsBlock = formatInferenceMetrics();
     return {
         content: [{
                 type: "text",
@@ -238,6 +240,7 @@ export async function sessionSaveLedgerHandler(args) {
                     (files_changed?.length ? `Files changed: ${files_changed.length}\n` : "") +
                     (decisions?.length ? `Decisions: ${decisions.length}\n` : "") +
                     `📊 Embedding generation queued for semantic search.` +
+                    metricsBlock +
                     resolverNote,
             }],
         isError: false,
@@ -548,11 +551,13 @@ export async function sessionSaveHandoffHandler(args, server) {
         // Dynamic import itself failed — module not found or similar
         console.error("[FactMerger] Module load failed (non-fatal): " + err));
     }
+    const metricsBlock = formatInferenceMetrics();
     // Build response text based on whether a CRDT merge occurred
     const responseText = isMerged
         ? `🔄 Auto-merged conflict for "${project}" (v${expected_version} → v${newVersion})\n` +
             `Strategy: ${JSON.stringify(mergeStrategy)}\n` +
             (last_summary ? `Summary: ${last_summary}\n` : "") +
+            metricsBlock +
             `\n🔑 Remember: pass expected_version: ${newVersion} on your next save ` +
             `to maintain concurrency control.`
         : `✅ Handoff ${data.status || "saved"} for project "${project}" ` +
@@ -561,6 +566,7 @@ export async function sessionSaveHandoffHandler(args, server) {
             (open_todos?.length ? `Open TODOs: ${open_todos.length} items\n` : "") +
             (active_branch ? `Active branch: ${active_branch}\n` : "") +
             `📊 Embedding generation queued for semantic search.\n` +
+            metricsBlock +
             `\n🔑 Remember: pass expected_version: ${newVersion} on your next save ` +
             `to maintain concurrency control.`;
     return {
@@ -575,6 +581,7 @@ export async function sessionLoadContextHandler(args) {
     if (!isSessionLoadContextArgs(args)) {
         throw new Error("Invalid arguments for session_load_context");
     }
+    resetInferenceMetrics();
     const { project, level = "standard", role } = args;
     const maxTokens = args.max_tokens
         || parseInt(await getSetting("max_tokens", "0"), 10) || undefined; // v4.0: arg > dashboard setting > none

package/dist/tools/prismInferHandler.js CHANGED Viewed

@@ -2,7 +2,7 @@
  * prism_infer — local-first inference tool
  * ─────────────────────────────────────────────────────────────
  * Save the caller's cloud tokens by routing to a local prism-coder
- * model via Ollama. Tiers (27B/9B/8B/1.7B) auto-selected by free
+ * model via Ollama. Tiers (27B/9B/4B/2B) auto-selected by free
  * RAM, then capped by `model_ceiling` and the set of tags that are
  * actually pulled into Ollama.
  *
@@ -28,11 +28,13 @@ import { getEntitlements, clampCeiling } from "../utils/entitlements.js";
 import { ddLog } from "../utils/ddLogger.js";
 import { stripThink } from "../utils/thinkStrip.js";
 import { passesQualityGate } from "../utils/qualityGate.js";
+import { checkInputSafety, checkOutputSafety } from "../utils/safetyGate.js";
+import { recordInference } from "../utils/inferenceMetrics.js";
 // ─── Tool Definition ────────────────────────────────────────────
 export const PRISM_INFER_TOOL = {
     name: "prism_infer",
     description: "Run an inference on a local prism-coder model (Ollama) to save cloud tokens. " +
-        "Picks the largest viable tier — 27B / 9B / 8B / 1.7B — based on free RAM at call time, " +
+        "Picks the largest viable tier — 27B / 9B / 4B / 2B — based on free RAM at call time, " +
         "clamped by `model_ceiling` and what is actually pulled in Ollama. " +
         "Falls through to the synalux portal cloud cascade (9B → 27B → Claude Opus 4.7) " +
         "only when local is unviable AND `cloud_fallback=true`. " +
@@ -71,7 +73,7 @@ export const PRISM_INFER_TOOL = {
             },
             timeout_ms: {
                 type: "number",
-                description: "Override per-call timeout. Default scales with model size: 27B=120s, 9B=60s, 4B=20s, 1.7B=15s.",
+                description: "Override per-call timeout. Default scales with model size: 27B=120s, 9B=60s, 4B=20s, 2B=15s.",
             },
             evidence: {
                 type: "array",
@@ -242,7 +244,7 @@ async function callOllamaGenerate(url, model, prompt, system, maxTokens, tempera
         const text = (data.message?.content ?? "").trim();
         if (!text)
             return { ok: false, reason: "empty_response" };
-        return { ok: true, text, doneReason: data.done_reason };
+        return { ok: true, text, doneReason: data.done_reason, promptTokens: data.prompt_eval_count, completionTokens: data.eval_count };
     }
     catch (err) {
         const name = err instanceof Error ? err.name : "Unknown";
@@ -300,6 +302,19 @@ async function callSynaluxInference(prompt, maxTokens, timeoutMs) {
 export async function runInfer(args, deps) {
     const t0 = Date.now();
     const temperature = args.temperature ?? 0;
+    // ── L1 Safety — deterministic input interception ────────────
+    const safetyIntercept = checkInputSafety(args.prompt);
+    if (safetyIntercept) {
+        return {
+            output: safetyIntercept,
+            backend: "safety_gate",
+            model_picked: null,
+            ram_free_mb: Math.round(deps.freemem() / (1024 * 1024)),
+            latency_ms: Date.now() - t0,
+            used_cloud: false,
+            attempts: [{ tier: "l1_safety", reason: "crisis_or_medical_intercept" }],
+        };
+    }
     // ── Entitlement enforcement ──────────────────────────────────
     // Fetch user's plan limits (cached 1hr). Free users without auth
     // get 4b ceiling, 50 calls/day, 512 max tokens.
@@ -392,7 +407,7 @@ export async function runInfer(args, deps) {
                         debugLog(`[prism_infer] quality gate FAIL (${gate.reason}) — escalating to cloud`);
                         attempts.push({ tier: tier.tag, reason: `quality_gate:${gate.reason}` });
                         if (gate.reason === "hard_truncation" || gate.reason === "loop_detected") {
-                            localDraft = { output, tier: tier.tag };
+                            localDraft = { output, tier: tier.tag, promptTokens: result.promptTokens, completionTokens: result.completionTokens };
                         }
                         break;
                     }
@@ -408,6 +423,8 @@ export async function runInfer(args, deps) {
                     used_cloud: false,
                     attempts,
                     plan: ent.plan,
+                    prompt_tokens: result.promptTokens,
+                    completion_tokens: result.completionTokens,
                 });
             }
             attempts.push({ tier: tier.tag, reason: result.reason });
@@ -431,6 +448,8 @@ export async function runInfer(args, deps) {
                 used_cloud: true,
                 attempts,
                 plan: ent.plan,
+                prompt_tokens: Math.ceil(args.prompt.length / 4),
+                completion_tokens: Math.ceil(cloud.output.length / 4),
             });
         }
         attempts.push({ tier: "synalux", reason: cloud.reason ?? "unknown" });
@@ -449,6 +468,8 @@ export async function runInfer(args, deps) {
             used_cloud: false,
             attempts,
             plan: ent.plan,
+            prompt_tokens: localDraft.promptTokens,
+            completion_tokens: localDraft.completionTokens,
             quality_gate_failed: true,
         });
     }
@@ -464,9 +485,11 @@ export async function runInfer(args, deps) {
  * field so callers can route refusals separately from successes.
  */
 async function applyVerification(draft, args, deps, partial) {
+    // L1 output safety — intercept dangerous model-generated content
+    const safeDraft = checkOutputSafety(draft);
     const shouldVerify = args.verify ?? (args.evidence !== undefined && args.evidence.length > 0);
     if (!shouldVerify || !deps.callVerifier) {
-        return { ...partial, output: draft };
+        return { ...partial, output: safeDraft };
     }
     const verifier = deps.callVerifier;
     const outcome = await verifier({
@@ -478,7 +501,7 @@ async function applyVerification(draft, args, deps, partial) {
     });
     return {
         ...partial,
-        output: outcome.finalText,
+        output: checkOutputSafety(outcome.finalText),
         verification: {
             action: outcome.action,
             verifierChain: outcome.verifierChain,
@@ -503,12 +526,30 @@ export async function prismInferHandler(args) {
             ollamaUrl: PRISM_LOCAL_LLM_URL,
         });
         debugLog(`[prism_infer] backend=${result.backend} model=${result.model_picked} latency=${result.latency_ms}ms free=${result.ram_free_mb}MB`);
+        // Local accumulator — sole source of the user-facing metrics block.
+        recordInference(result);
+        // Best-effort portal forwarding (independent analytics stream).
+        // safety_gate excluded — logging crisis filter triggers is a HIPAA concern.
+        if (result.backend !== "safety_gate") {
+            ddLog("info", "prism_infer.usage", {
+                backend: result.backend,
+                model: result.model_picked ?? result.backend,
+                used_cloud: result.used_cloud,
+                prompt_tokens: result.prompt_tokens ?? 0,
+                completion_tokens: result.completion_tokens ?? 0,
+                latency_ms: result.latency_ms,
+            });
+        }
+        const tokenStr = result.prompt_tokens != null || result.completion_tokens != null
+            ? ` tokens=${result.prompt_tokens ?? "?"}in/${result.completion_tokens ?? "?"}out`
+            : "";
         const header = `[prism_infer] backend=${result.backend}` +
             ` model=${result.model_picked ?? "n/a"}` +
             ` plan=${result.plan ?? "unknown"}` +
             ` free_ram=${result.ram_free_mb}MB` +
             ` latency=${result.latency_ms}ms` +
             ` used_cloud=${result.used_cloud}` +
+            tokenStr +
             (result.quality_gate_failed ? ` quality_gate_failed=true` : "") +
             (result.verification ? ` verify=${result.verification.action}` : "") +
             (result.attempts.length ? ` attempts=${JSON.stringify(result.attempts)}` : "");

package/dist/tools/taskRouterHandler.js CHANGED Viewed

@@ -317,7 +317,7 @@ export async function sessionTaskRouteHandler(args) {
     delete result._rawComposite;
     // ── v9.x: Local LLM second-opinion for low-confidence cases ──────────────
     // When confidence is below the threshold AND local LLM is enabled,
-    // ask prism-coder:7b to break the tie. This is purely additive — if the
+    // ask prism-coder:9b to break the tie. This is purely additive — if the
     // LLM call fails or times out, the original heuristic result is returned.
     if (PRISM_LOCAL_LLM_ENABLED &&
         result.confidence < PRISM_TASK_ROUTER_CONFIDENCE_THRESHOLD) {
@@ -350,7 +350,7 @@ export async function sessionTaskRouteHandler(args) {
 }
 // ─── Local LLM Route Classifier ──────────────────────────────
 /**
- * Ask prism-coder:7b to classify a task description as "claw" or "host".
+ * Ask prism-coder:9b to classify a task description as "claw" or "host".
  * Returns the string or null if the model is unavailable / response unparseable.
  * Called only when heuristic confidence is below the threshold.
  */

package/dist/utils/ddLogger.js CHANGED Viewed

@@ -8,9 +8,17 @@
  * Env: PRISM_SYNALUX_BASE_URL (default https://synalux.ai)
  */
 const SYNALUX_BASE = process.env.PRISM_SYNALUX_BASE_URL || "https://synalux.ai";
+const TELEMETRY_WRITE_TOKEN = process.env.TELEMETRY_WRITE_TOKEN || "";
 const DD_API_KEY = process.env.DD_API_KEY || "";
 const DD_SITE = process.env.DD_SITE || "datadoghq.com";
 const SERVICE = "prism-mcp";
+const CONTEXT_ALLOWLIST = new Set([
+    "backend", "model", "used_cloud", "prompt_tokens", "completion_tokens",
+    "latency_ms", "plan", "requested_ceiling", "effective_ceiling",
+    "ceiling_clamped", "requested_tokens", "effective_tokens", "tokens_clamped",
+    "cloud_requested", "cloud_allowed", "cloud_blocked",
+    "verify_requested", "verify_allowed", "verify_blocked",
+]);
 const queue = [];
 let flushTimer = null;
 const FLUSH_INTERVAL_MS = 5_000;
@@ -26,31 +34,61 @@ async function flush() {
         return;
     const batch = queue.splice(0, MAX_BATCH);
     // Primary: Synalux portal → Supabase (always available)
-    try {
-        await fetch(`${SYNALUX_BASE}/api/v1/telemetry`, {
-            method: "POST",
-            headers: { "Content-Type": "application/json" },
-            body: JSON.stringify(batch.map(e => ({
-                service: SERVICE,
-                event_type: e.status === "error" ? "error" : "action",
-                message: e.message,
-                context: { ...e, service: undefined, message: undefined },
-                user_id: e.user_id,
-                user_plan: e.user_plan,
-            }))),
-            signal: AbortSignal.timeout(5_000),
-        });
-    }
-    catch {
-        // Silent — don't crash the MCP server
+    if (TELEMETRY_WRITE_TOKEN) {
+        try {
+            await fetch(`${SYNALUX_BASE}/api/v1/telemetry`, {
+                method: "POST",
+                headers: {
+                    "Content-Type": "application/json",
+                    "Authorization": `Bearer ${TELEMETRY_WRITE_TOKEN}`,
+                    "X-Prism-Client": "prism-mcp",
+                },
+                body: JSON.stringify(batch.map(e => {
+                    const ctx = {};
+                    for (const [k, v] of Object.entries(e)) {
+                        if (CONTEXT_ALLOWLIST.has(k))
+                            ctx[k] = v;
+                    }
+                    return {
+                        service: SERVICE,
+                        event_type: e.status === "error" ? "error" : "action",
+                        message: e.message,
+                        context: ctx,
+                        user_id: e.user_id,
+                        user_plan: e.user_plan,
+                    };
+                })),
+                signal: AbortSignal.timeout(5_000),
+            });
+        }
+        catch {
+            // Silent — don't crash the MCP server
+        }
     }
     // Secondary: Datadog Logs (if API key is set AND Logs product is enabled)
+    // Same allowlist applied — both sinks get identical filtered context.
     if (DD_API_KEY) {
         try {
             await fetch(`https://http-intake.logs.${DD_SITE}/api/v2/logs`, {
                 method: "POST",
                 headers: { "Content-Type": "application/json", "DD-API-KEY": DD_API_KEY },
-                body: JSON.stringify(batch),
+                body: JSON.stringify(batch.map(e => {
+                    const ctx = {};
+                    for (const [k, v] of Object.entries(e)) {
+                        if (CONTEXT_ALLOWLIST.has(k))
+                            ctx[k] = v;
+                    }
+                    return {
+                        ddsource: "nodejs",
+                        ddtags: e.ddtags,
+                        hostname: e.hostname,
+                        service: SERVICE,
+                        status: e.status,
+                        message: e.message,
+                        ...ctx,
+                        timestamp: e.timestamp,
+                    };
+                })),
                 signal: AbortSignal.timeout(5_000),
             });
         }
@@ -68,7 +106,7 @@ export function ddLog(level, message, context) {
         hostname: process.env.HOSTNAME || "prism-mcp",
         service: SERVICE,
         status: level,
-        message,
+        message: message.slice(0, 200),
         ...context,
         timestamp: new Date().toISOString(),
     });
@@ -90,3 +128,6 @@ export function ddInfo(message, context) {
 export function ddWarn(message, context) {
     ddLog("warn", message, context);
 }
+if (!TELEMETRY_WRITE_TOKEN && process.env.PRISM_DEBUG_LOGGING) {
+    console.info("[prism-mcp] Portal telemetry not configured (no TELEMETRY_WRITE_TOKEN). Session metrics work locally — this is normal for offline/free-tier use.");
+}

package/dist/utils/inferenceMetrics.js ADDED Viewed

@@ -0,0 +1,93 @@
+/**
+ * Inference metrics — local accumulator for user-facing display.
+ *
+ * The local accumulator is the SOLE source for the session metrics block
+ * shown in session_save_ledger/handoff. It tracks what THIS prism process
+ * did THIS session — prism is the natural and only complete source for
+ * this data (the portal only sees what prism forwards).
+ *
+ * Portal forwarding (ddLog → /api/v1/telemetry) is a separate, best-effort
+ * analytics stream that the display path never depends on. If the portal
+ * is down, unconfigured, or the token is missing, users still see metrics.
+ */
+import { debugLog } from "./logger.js";
+const byModel = {};
+let localCalls = 0;
+let cloudCalls = 0;
+let totalPromptTokens = 0;
+let totalCompletionTokens = 0;
+let totalLatencyMs = 0;
+export function recordInference(result) {
+    if (result.backend === "safety_gate")
+        return;
+    const key = result.model_picked ?? result.backend;
+    if (result.used_cloud) {
+        cloudCalls++;
+    }
+    else {
+        localCalls++;
+    }
+    const pt = result.prompt_tokens ?? 0;
+    const ct = result.completion_tokens ?? 0;
+    totalPromptTokens += pt;
+    totalCompletionTokens += ct;
+    totalLatencyMs += result.latency_ms;
+    if (!byModel[key]) {
+        byModel[key] = { calls: 0, promptTokens: 0, completionTokens: 0, totalLatencyMs: 0 };
+    }
+    byModel[key].calls++;
+    byModel[key].promptTokens += pt;
+    byModel[key].completionTokens += ct;
+    byModel[key].totalLatencyMs += result.latency_ms;
+}
+export function getInferenceSnapshot() {
+    const total = localCalls + cloudCalls;
+    const modelCopy = {};
+    for (const [k, v] of Object.entries(byModel)) {
+        modelCopy[k] = { ...v };
+    }
+    return {
+        localCalls,
+        cloudCalls,
+        totalCalls: total,
+        localPct: total > 0 ? Math.round((localCalls / total) * 100) : 0,
+        cloudPct: total > 0 ? 100 - Math.round((localCalls / total) * 100) : 0,
+        totalPromptTokens,
+        totalCompletionTokens,
+        totalTokens: totalPromptTokens + totalCompletionTokens,
+        avgLatencyMs: total > 0 ? Math.round(totalLatencyMs / total) : 0,
+        byModel: modelCopy,
+    };
+}
+export function resetInferenceMetrics() {
+    localCalls = 0;
+    cloudCalls = 0;
+    totalPromptTokens = 0;
+    totalCompletionTokens = 0;
+    totalLatencyMs = 0;
+    for (const key of Object.keys(byModel)) {
+        delete byModel[key];
+    }
+    debugLog("[inference-metrics] Session metrics reset");
+}
+export function formatInferenceMetrics() {
+    const snap = getInferenceSnapshot();
+    if (snap.totalCalls === 0)
+        return "";
+    const lines = [
+        `\n📊 Inference Metrics (this session):`,
+        `  Total calls: ${snap.totalCalls} — Local: ${snap.localCalls} (${snap.localPct}%) | Cloud: ${snap.cloudCalls} (${snap.cloudPct}%)`,
+        `  Tokens: ${snap.totalPromptTokens.toLocaleString()} in + ${snap.totalCompletionTokens.toLocaleString()} out = ${snap.totalTokens.toLocaleString()} total`,
+        `  Avg latency: ${snap.avgLatencyMs}ms`,
+    ];
+    const models = Object.entries(snap.byModel).sort((a, b) => b[1].calls - a[1].calls);
+    if (models.length > 1) {
+        lines.push(`  By model:`);
+        for (const [name, stats] of models) {
+            const tokens = stats.promptTokens + stats.completionTokens;
+            const avgMs = stats.calls > 0 ? Math.round(stats.totalLatencyMs / stats.calls) : 0;
+            lines.push(`    ${name}: ${stats.calls} calls, ${tokens.toLocaleString()} tokens, avg ${avgMs}ms`);
+        }
+    }
+    return lines.join("\n");
+}

package/dist/utils/localLlm.js CHANGED Viewed

@@ -1,5 +1,5 @@
 /**
- * Local LLM Client — Ollama/prism-coder:7b Integration (v1.0.0)
+ * Local LLM Client — Ollama/prism-coder:9b Integration (v1.0.0)
  * ──────────────────────────────────────────────────────────────────
  * Thin HTTP wrapper around the Ollama /api/chat endpoint.
  *
@@ -9,7 +9,7 @@
  *   - Silent fail: returning null instead of throwing ensures callers
  *     can fall back to Gemini without crashing the MCP server.
  *   - Fire-and-forget safe: wrapped in try/catch, never propagates.
- *   - Default model: prism-coder:7b — fine-tuned on Prism tool schemas,
+ *   - Default model: prism-coder:9b — fine-tuned on Prism tool schemas,
  *     8192-token context, Q8_0 quantization, ~8.1GB RAM footprint.
  *
  * FEATURE FLAG:

package/dist/utils/nerExtractor.js CHANGED Viewed

@@ -16,7 +16,7 @@
  *
  * Architecture:
  *   1. Rule-based extraction (fast, zero-cost, always available)
- *   2. Local LLM extraction (optional, higher quality, uses prism-coder:7b)
+ *   2. Local LLM extraction (optional, higher quality, uses prism-coder:9b)
  *   3. Merged + deduplicated results
  */
 import { debugLog } from "./logger.js";

package/dist/utils/qualityGate.js CHANGED Viewed

@@ -27,11 +27,20 @@ export function passesQualityGate(stripped, thinkOnly, finishReason) {
     if (finishReason === "length") {
         return { pass: false, reason: "hard_truncation" };
     }
-    // Signal 4: Exact-loop — same sentence repeated 3+ times
-    const sentences = stripped.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
-    if (sentences.length >= 6) {
+    // Signal 4: Exact-loop detection (two passes).
+    //
+    // Pass A (prose-only, threshold ≥3): strip structural markdown that
+    // naturally repeats (code blocks, tables, headings, bold labels).
+    // Catches loops in explanatory text.
+    const proseOnly = stripped
+        .replace(/```[\s\S]*?```/g, "")
+        .replace(/^\|.*\|$/gm, "")
+        .replace(/^#{1,6}\s+.*$/gm, "")
+        .replace(/^[\s*-]*\*{1,2}[^*]+\*{1,2}:?\s*$/gm, "");
+    const proseSentences = proseOnly.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
+    if (proseSentences.length >= 6) {
         const counts = new Map();
-        for (const s of sentences) {
+        for (const s of proseSentences) {
             const key = s.toLowerCase();
             counts.set(key, (counts.get(key) ?? 0) + 1);
             if ((counts.get(key) ?? 0) >= 3) {
@@ -39,5 +48,20 @@ export function passesQualityGate(stripped, thinkOnly, finishReason) {
             }
         }
     }
+    // Pass B (full text, threshold ≥5): catches egregious loops hidden
+    // inside fake code blocks or other structural elements. Higher
+    // threshold avoids false positives on legitimate code patterns
+    // (e.g. `node = self.root` × 4 is fine, × 5 is suspicious).
+    const allSentences = stripped.split(/[.!?\n]+/).map(s => s.trim()).filter(s => s.length > 10);
+    if (allSentences.length >= 10) {
+        const counts = new Map();
+        for (const s of allSentences) {
+            const key = s.toLowerCase();
+            counts.set(key, (counts.get(key) ?? 0) + 1);
+            if ((counts.get(key) ?? 0) >= 5) {
+                return { pass: false, reason: "loop_detected" };
+            }
+        }
+    }
     return { pass: true };
 }

package/dist/utils/safetyGate.js ADDED Viewed

@@ -0,0 +1,104 @@
+/**
+ * L1 Safety Gate — deterministic crisis/medical interception for prism_infer.
+ *
+ * Runs BEFORE model output reaches the caller. High precision, low recall:
+ * only intercepts patterns that unambiguously indicate self-harm intent
+ * (input) or dangerous instructions (output). Generic clinical/pharmacological
+ * terms ("dose of", "milligrams", "lethal") are NOT intercepted — they appear
+ * in routine BCBA/medical notes.
+ *
+ * This is a backstop, not a comprehensive filter. The model's own safety
+ * training is the primary layer.
+ */
+// ── Input: first-person crisis expressions ───────────────────────────────────
+const CRISIS_INPUT_RE = [
+    // English
+    /hurt(?:ing)?\s+(?:my)?self/i,
+    /kill\s+(?:my)?self/i,
+    /end\s+my\s+life/i,
+    /want\s+to\s+die/i,
+    /want\s+to\s+(?:commit\s+)?suicide/i,
+    /cut(?:ting)?\s+(?:my)?self/i,
+    /(?:hang|hanging)\s+(?:my)?self/i,
+    /jump(?:ing)?\s+off/i,
+    /how\s+(?:many|much|to).*(?:pills|overdose|die)/i,
+    // Spanish — block hyperbole nouns only (NOT hunger/thirst/cold — may be literal for neglected child)
+    /quiero\s+morir(?!\s+de\s+(?:risa|la\s+risa|vergüenza|ganas|envidia|aburrimiento)\b)/i,
+    /(?:voy\s+a\s+)?matarme(?!\s+(?:estudiando|trabajando|riendo|de\s+(?:risa|la\s+risa))\b)/i,
+    /hacerme\s+da[ñn]o/i,
+    /(?:quiero\s+)?suicidarme/i,
+    // French — block hyperbole nouns only (NOT faim/soif/chaud/froid)
+    /je\s+veux\s+mourir(?!\s+d[e']\s*(?:rire|honte|envie|ennui)\b)/i,
+    /(?:vais\s+)?me\s+tuer(?!\s+[àa]\s+(?:rire|travailler|étudier)\b)/i,
+    /me\s+suicider/i,
+    // Portuguese — block hyperbole nouns only (NOT fome/sede/calor/frio/sono)
+    /quero\s+morrer(?!\s+de\s+(?:rir|vergonha|inveja|tédio)\b)/i,
+    /(?:vou\s+)?me\s+matar(?!\s+(?:estudando|trabalhando|rindo|de\s+(?:rir|tanto))\b)/i,
+    // Russian — block hyperbole nouns (смех/стыд/зависть/голод/жара/скука)
+    // Russian — NOT голода/жары/холода (may be literal)
+    /хочу\s+умереть(?!\s+от\s+(?:смеха|стыда|зависти|скуки)(?:\s|$))/i,
+    /убить\s+себя/i,
+    /покончить\s+с\s+собой/i,
+    // Arabic (alef variants handled in normalize)
+    /[اأإآ]ر[يى]د\s+[اأإآ]ن\s+[اأإآ]موت/,
+    /قتل\s+نفس[يى]/,
+    /[اأإآ]نه[يى]\s+ح[يى][اأإآ]ت[يى]/,
+    // German — block hyperbole nouns (Lachen/Scham/Neid/Hunger/Langeweile)
+    // German — NOT hunger/kälte (may be literal)
+    /(?:ich\s+)?will\s+sterben(?!\s+vor\s+(?:lachen|scham|neid|langeweile)\b)/i,
+    /mich\s+umbringen/i,
+    // Ukrainian — block hyperbole nouns (сміх/сором/заздрість/голод)
+    // Ukrainian — NOT голоду/холоду (may be literal)
+    /хочу\s+померти(?!\s+від\s+(?:сміху|сорому|заздрості|нудьги)(?:\s|$))/i,
+    /вбити\s+себе/i,
+];
+const MEDICAL_INPUT_RE = [
+    /(?:how\s+much|what\s+dose)\s+.*(?:should\s+I|do\s+I|can\s+I)\s+(?:inject|take|give)/i,
+    /(?:dose|dosage)\s+(?:should\s+I|do\s+I|can\s+I)\s+(?:inject|take|give)/i,
+];
+// ── Output: model providing dangerous instructions ───────────────────────────
+const CRISIS_OUTPUT_RE = [
+    /\bhow many pills\s+to\b/i,
+    /\boverdose amount\b/i,
+    /\bnoose\b/i,
+    /\bhanging yourself\b/i,
+    /\bsuicide method\b/i,
+    /\bmethod of suicide\b/i,
+    /\bself[- ]?harm method\b/i,
+    /\bhow to (?:kill|hang|poison|overdose)\s+(?:your|my|him|her|them|one)self\b/i,
+    /\bhow to overdose\b/i,
+];
+const MEDICAL_OUTPUT_RE = [
+    /\binject this (?:amount|dose|much)\b/i,
+    /\btake this (?:amount|many|dose)\b/i,
+    /\byou should (?:inject|take)\s+\d+\s+units\b/i,
+];
+// ── Responses ────────────────────────────────────────────────────────────────
+export const CRISIS_RESPONSE = "I'm concerned about your safety. Please call or text 988 (Suicide & Crisis Lifeline) right now — available 24/7. If in immediate danger, call 911. International: findahelpline.com. You are not alone.";
+export const MEDICAL_RESPONSE = "I cannot provide specific medical dosing advice. Please contact your doctor or pharmacist. For emergencies, call 911.";
+// ── API ──────────────────────────────────────────────────────────────────────
+function normalize(text) {
+    return text
+        .toLowerCase()
+        .replace(/\p{Cf}/gu, "")
+        .replace(/\p{Mn}/gu, "") // Arabic harakat + all combining marks
+        .replace(/ـ/g, "")
+        .replace(/[أإآ]/g, "ا")
+        .replace(/\s+/g, " ");
+}
+export function checkInputSafety(text) {
+    const t = normalize(text);
+    if (CRISIS_INPUT_RE.some(p => p.test(t)))
+        return CRISIS_RESPONSE;
+    if (MEDICAL_INPUT_RE.some(p => p.test(t)))
+        return MEDICAL_RESPONSE;
+    return null;
+}
+export function checkOutputSafety(response) {
+    const r = normalize(response);
+    if (CRISIS_OUTPUT_RE.some(re => re.test(r)))
+        return CRISIS_RESPONSE;
+    if (MEDICAL_OUTPUT_RE.some(re => re.test(r)))
+        return MEDICAL_RESPONSE;
+    return response;
+}

package/package.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
   "name": "prism-mcp-server",
-  "version": "19.1.0",
+  "version": "19.2.1",
   "mcpName": "io.github.dcostenco/prism-coder",
-  "description": "Prism Coder \u2014 Cognitive memory + tool-calling intelligence for AI agents. Mind Palace persistent memory (BFCL Gold Certified, 100% Tool-Call Accuracy, 114 Agent Skills, PHI Guard, Tier Enforcement, Prompt-Based Skill Routing, Zero-Search HDC/HRR retrieval, HRR Semantic Drift Detection across BCBA/Coding/AAC domains, HIPAA-hardened local-first storage, SLERP-optimized GRPO alignment) plus the prism-coder 1.7B\u201332B open-weights LLM fleet.",
+  "description": "Prism Coder — Cognitive memory + tool-calling intelligence for AI agents. Mind Palace persistent memory (BFCL Gold Certified, 100% Tool-Call Accuracy, 114 Agent Skills, PHI Guard, Tier Enforcement, Prompt-Based Skill Routing, Zero-Search HDC/HRR retrieval, HRR Semantic Drift Detection across BCBA/Coding/AAC domains, HIPAA-hardened local-first storage, SLERP-optimized GRPO alignment) plus the prism-coder 1.7B–32B open-weights LLM fleet.",
   "module": "index.ts",
   "type": "module",
   "main": "dist/server.js",