ruvector-mragent 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 ruvnet
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,159 @@
1
+ # MRAgent — Self-Reconstructing Graph Memory over RuVector, evolved by Darwin
2
+
3
+ A runnable reference implementation of **MRAgent** ("Memory is Reconstructed, Not
4
+ Retrieved: Graph Memory for LLM Agents") on **RuVector** — and then *past* the
5
+ paper. A **Meta-Harness Darwin** loop evolves the reconstruction harness while the
6
+ memory substrate stays frozen ("freeze the model, evolve the harness").
7
+
8
+ > **Frozen model:** the RuVector Cue-Tag-Content memory graph (`agent/memory.mjs`).
9
+ > **Evolved harness:** a 12-gene reconstruction genome (`agent/harness.mjs`).
10
+
11
+ ADRs: **[ADR-269](../../docs/adr/ADR-269-mragent-graph-memory-darwin-optimization.md)**
12
+ (the MRAgent baseline) and **[ADR-270](../../docs/adr/ADR-270-self-reconstructing-graph-memory-beyond-sota.md)**
13
+ (this beyond-SOTA version).
14
+
15
+ ## Beyond the paper
16
+
17
+ MRAgent reconstructs an answer over a *static* graph: search cues → traverse
18
+ cue→tag→content → prune → synthesize. This implementation adds three mechanisms a
19
+ 25-year-out memory system needs, each a tunable gene Darwin co-evolves:
20
+
21
+ 1. **Adaptive depth** (`haltConfidence`) — stop traversing once evidence is
22
+ decisive, so easy queries cost fewer hops (ACT-style adaptive computation).
23
+ 2. **Abstention + calibration** (`abstainThreshold`) — answer *"I don't know"*
24
+ when reconstructed evidence is too weak, instead of confidently hallucinating.
25
+ Graded by a **risk-adjusted utility**, not raw accuracy: a confident wrong
26
+ answer scores worse than an honest abstention.
27
+ 3. **Consolidation / replay** (`agent/consolidate.mjs`) — the store reorganizes
28
+ its own topology from workload (the self-learning GNN RuVector describes),
29
+ laying Cue→shortcut→Content edges so a 3-hop query resolves in 1 hop tomorrow.
30
+
31
+ ## The 12-gene reconstruction genome
32
+
33
+ | Gene | Range | RuVector mapping |
34
+ |------|-------|------------------|
35
+ | `cueK` | 1–12 | # cue vectors from `hybridSearch` |
36
+ | `efSearch` | 16–256 | HNSW search depth |
37
+ | `hybridAlpha` | 0–1 | RRF sparse↔dense weight |
38
+ | `fusion` | rrf · linear · dbsf | hybrid fusion strategy |
39
+ | `traversalDepth` | 1–4 | Cypher `LINKED_TO*1..N` hops |
40
+ | `tagFanout` | 1–8 | tags expanded per node |
41
+ | `pruneThreshold` | 0–0.6 | path-evidence floor |
42
+ | `maxContent` | 1–20 | content `LIMIT` to synthesis |
43
+ | `haltConfidence` | 0.2–0.9 | **adaptive-depth halt** |
44
+ | `rerank` | gnn · none | corroboration-aware rerank |
45
+ | `promptStrategy` | terse · evidence-first · prune-explicit | synthesis prompt |
46
+ | `abstainThreshold` | 0–0.6 | **abstention / calibration** |
47
+
48
+ Every gene is proven load-bearing in `test/harness.test.mjs` — some only via
49
+ *interaction* (distractor tasks are solved by `evidence-first` **or** by
50
+ `terse + gnn + fanout≥2`, an epistatic landscape).
51
+
52
+ ## The hardened corpus (60 tasks, 6 classes, difficulty-varied)
53
+
54
+ `data/eval-set.json` is **generated** by `tools/genCorpus.mjs` (`npm run
55
+ gen-corpus`) as **structured signal specs**; `agent/memory.mjs` synthesizes the
56
+ Cue/Tag/Content node texts so difficulty is guaranteed, not dependent on fragile
57
+ English. A **concept layer** (`agent/concepts.mjs`) gives the dense embedding real
58
+ semantics decoupled from lexical overlap. 10 instances per class, with varied
59
+ difficulty (1-hop AND 2-hop bridges, 1–3 ranking-distractors) so a train/test
60
+ split constrains every gene:
61
+
62
+ | Class | Stresses |
63
+ |-------|----------|
64
+ | semantic | `hybridAlpha`→dense (paraphrase, no shared tokens) |
65
+ | lexical | `hybridAlpha`→sparse (rare identifier, generic concept) |
66
+ | hybrid | `fusion` / RRF (needs both signals) |
67
+ | bridge | `traversalDepth` (1–2 intermediate hops) |
68
+ | distractor | `rerank` / `tagFanout` / `promptStrategy` (ranking-distractor content) |
69
+ | unanswerable | `abstainThreshold` (no correct content exists → abstain) |
70
+
71
+ ## Generalization, not overfitting (train / test / CV)
72
+
73
+ The optimizer **evolves on a train split and reports a held-out test split it
74
+ never saw** — proving the genome generalizes rather than memorizing the eval set.
75
+ Selection uses **3-fold cross-validation with a variance penalty** (mean − ½·range
76
+ across folds) so a knife-edge gene that wins one fold but collapses on another is
77
+ rejected. A subtle bug this surfaced — confidence was depressed by `decay^depth`,
78
+ making deep-but-relevant answers look weak and breaking abstention across depths —
79
+ is fixed by deriving **abstention confidence from the answer's raw relevance, not
80
+ its decayed path score** (`agent/memory.mjs`).
81
+
82
+ ```
83
+ accuracy risk halluc
84
+ baseline (test) ~30% ~0.25 0.17
85
+ evolved (test) ~65% ~0.81 0.04 ← held out, never seen in evolution
86
+ +35pt +0.56 generalizes
87
+ ```
88
+
89
+ (The synthetic toy embedding has per-instance noise, and one global `hybridAlpha`
90
+ cannot perfectly serve both dense- and sparse-keyed queries, so the test ceiling
91
+ is ~80%, not 100% — the gate asks whether **evolution transfers**, which it does.)
92
+
93
+ ## Results on the full corpus (zero optional deps, deterministic)
94
+
95
+ ```
96
+ config accuracy risk halluc latency hops
97
+ baseline 50.0% 0.417 0.17 2.81 1.23
98
+ evolved (ref) 70.0% 0.775 0.03 3.09 1.08
99
+ evolved+replay 70.0% 0.775 0.03 3.16 1.00
100
+
101
+ evolved vs baseline: accuracy +20.0pt · risk +0.358 · hallucination 0.17 → 0.03
102
+ consolidation: shortcuts → fewer hops at equal accuracy
103
+ ```
104
+
105
+ `npm run optimize` (full GA + memetic polish) reaches **+33pt train accuracy /
106
+ risk 0.94** and writes the evolved genome to `optimize.report.json`, which
107
+ `npm run benchmark` then picks up. The optimizer is **memetic**: a genetic loop
108
+ (Darwin `mapLimit`/`paretoFront`) explores broadly, then deterministic
109
+ coordinate descent refines narrow optima (e.g. the abstention band).
110
+
111
+ ## Run it
112
+
113
+ ```bash
114
+ cd examples/mragent
115
+ npm test # 12 deterministic gates, every gene proven load-bearing
116
+ npm run benchmark # baseline vs evolved vs evolved+replay
117
+ npm run optimize # Darwin loop + memetic polish + consolidation + held-out test
118
+ npm run gen-corpus # regenerate data/eval-set.json (deterministic)
119
+ npm run probe # inspect @metaharness/darwin exports (optional)
120
+ ```
121
+
122
+ Nothing requires network, an API key, or native bindings. The substrate is a
123
+ deterministic in-process graph with the **same semantics** as a live RuVector
124
+ `.rvf` index (concept-dense + token-sparse hybrid RRF search, bounded-depth
125
+ prunable Cypher traversal, GNN-style corroboration rerank), so an evolved genome
126
+ transfers to production unchanged.
127
+
128
+ ### With the real Darwin write-layer (optional)
129
+
130
+ ```bash
131
+ npm i -D @metaharness/darwin@latest
132
+ npx metaharness evolve . --generations 12 --children 3 --eval-cmd "node benchmark.mjs"
133
+ ```
134
+
135
+ `harness/scorePolicy.ts` is the fitness `metaharness evolve` calls per mutation.
136
+
137
+ ## ADR-150 compliance
138
+
139
+ `@metaharness/darwin` and `ruvector` are **optionalDependencies** only; every
140
+ touch is `try/catch` guarded; `npm test`, `npm run benchmark`, and `npm run
141
+ optimize` all pass with no optional deps installed (the CI gate).
142
+
143
+ ## Layout
144
+
145
+ ```
146
+ examples/mragent/
147
+ ├── agent/
148
+ │ ├── concepts.mjs # concept layer (dense semantics ≠ sparse tokens)
149
+ │ ├── memory.mjs # FROZEN: Cue-Tag-Content store (RuVector semantics)
150
+ │ ├── harness.mjs # EVOLVED: 12-gene genome + reasoning loop
151
+ │ └── consolidate.mjs # replay → self-reorganizing topology
152
+ ├── harness/scorePolicy.ts# Darwin fitness (accuracy + risk + cost)
153
+ ├── data/eval-set.json # 60-task structured corpus (generated)
154
+ ├── tools/genCorpus.mjs # deterministic corpus generator
155
+ ├── optimize.mjs # GA + CV + memetic polish + held-out test + consolidation
156
+ ├── benchmark.mjs # baseline vs evolved vs replay
157
+ ├── probeDarwin.mjs # probe optional @metaharness/darwin
158
+ └── test/harness.test.mjs # 12 acceptance gates
159
+ ```
@@ -0,0 +1,71 @@
1
+ // Concept layer — gives the FROZEN model a genuine *semantic* dimension that is
2
+ // decoupled from raw lexical overlap.
3
+ //
4
+ // Why this matters: with a plain hash-of-tokens embedding, dense cosine and
5
+ // sparse term-overlap are almost perfectly correlated, so `hybridAlpha` and
6
+ // `fusion` have nothing to tune (ADR-269 measured Δfit≈0 for both). Real
7
+ // embeddings differ: paraphrases ("rapid cold-start" ~ "fast boot") are dense-
8
+ // close with ZERO token overlap, while rare identifiers ("rvf-7", "cve-2") are
9
+ // lexically decisive but semantically generic.
10
+ //
11
+ // We model that split deterministically: tokens that belong to a synonym group
12
+ // project onto a shared CONCEPT dimension (dense semantics), and identifier-like
13
+ // tokens stay in a lexical tail. Result:
14
+ // • semantic queries → answerable by DENSE only (no shared tokens)
15
+ // • lexical queries → answerable by SPARSE only (concept-generic)
16
+ // • hybrid queries → need RRF over both
17
+ // which is exactly the regime where hybridAlpha + fusion are load-bearing.
18
+
19
+ // Synonym groups → concept ids. Tokens in the same group are dense-equivalent.
20
+ const CONCEPT_GROUPS = [
21
+ ["fast", "rapid", "quick", "speed", "swift", "low-latency", "sub-millisecond", "instant"],
22
+ ["boot", "cold-start", "startup", "initialize", "cold-boot", "launch", "spin-up"],
23
+ ["compress", "compression", "quantize", "quantization", "shrink", "squeeze", "pack"],
24
+ ["store", "storage", "persist", "write", "save", "backend", "durable"],
25
+ ["search", "retrieve", "retrieval", "query", "lookup", "find", "recall"],
26
+ ["graph", "topology", "network", "node", "nodes", "edge", "edges", "associative"],
27
+ ["consensus", "agreement", "leader", "elect", "authoritative", "quorum"],
28
+ ["secure", "security", "tamper", "tamper-evident", "witness", "proof", "cryptographic", "immutable"],
29
+ ["merge", "fuse", "fusion", "combine", "aggregate", "blend"],
30
+ ["prune", "filter", "drop", "discard", "remove", "trim"],
31
+ ["accuracy", "recall", "precision", "fidelity", "correct", "quality"],
32
+ ["memory", "remember", "reconstruct", "reconstruction", "cue", "tag", "content"],
33
+ ["validate", "validation", "reject", "fail-fast", "guard", "check"],
34
+ ["concurrency", "lock-free", "parallel", "branching", "copy-on-write", "throughput"],
35
+ ["embedding", "vector", "dense", "representation", "latent"],
36
+ ];
37
+
38
+ export const NUM_CONCEPTS = CONCEPT_GROUPS.length;
39
+
40
+ // Canonical concept name = first token of each group. Corpus specs reference
41
+ // concepts by these names; buildGraph synthesizes DIFFERENT surface tokens from
42
+ // the same group for query vs cue, so they share a concept but not a token.
43
+ export const CONCEPT_NAMES = CONCEPT_GROUPS.map((g) => g[0]);
44
+
45
+ const NAME_TO_INDEX = new Map(CONCEPT_NAMES.map((n, i) => [n, i]));
46
+
47
+ /** k-th distinct surface token of a concept (by name), wrapping the group. */
48
+ export function syn(conceptName, k = 0) {
49
+ const ci = NAME_TO_INDEX.get(conceptName);
50
+ if (ci === undefined) return conceptName; // treat unknown as a literal token
51
+ const group = CONCEPT_GROUPS[ci];
52
+ return group[k % group.length];
53
+ }
54
+
55
+ const TOKEN_TO_CONCEPT = new Map();
56
+ CONCEPT_GROUPS.forEach((group, ci) => {
57
+ for (const tok of group) TOKEN_TO_CONCEPT.set(tok, ci);
58
+ });
59
+
60
+ /** Concept id for a token, or -1 if it is lexical-only (identifier-like). */
61
+ export function conceptOf(token) {
62
+ if (TOKEN_TO_CONCEPT.has(token)) return TOKEN_TO_CONCEPT.get(token);
63
+ return -1;
64
+ }
65
+
66
+ // A token is "identifier-like" (purely lexical) if it carries a digit or hyphen
67
+ // with a digit, or is a known id prefix. These never get a concept, so only
68
+ // sparse search can pin them down.
69
+ export function isIdentifier(token) {
70
+ return /\d/.test(token) || /^(rvf|cve|adr|t\d|id)/.test(token);
71
+ }
@@ -0,0 +1,55 @@
1
+ // Memory consolidation / replay — the "sleep" phase of a self-reorganizing memory.
2
+ //
3
+ // Beyond MRAgent: the paper reconstructs over a STATIC graph. A 25-year-out memory
4
+ // system reshapes its own topology from workload — exactly the self-learning GNN
5
+ // RuVector describes ("pushes similarities back into the neighbor lists"). After a
6
+ // batch of successful reconstructions, we REPLAY the winning paths and lay down a
7
+ // direct Cue→shortcut→Content edge, so a query that needed a 3-hop traversal today
8
+ // resolves in 1 hop tomorrow. Embeddings/content (the frozen model) are untouched;
9
+ // only graph adjacency — the store's own learned index — changes.
10
+ //
11
+ // This is gated and deterministic: it consolidates only paths that already
12
+ // reconstruct the CORRECT content, so it never invents associations.
13
+
14
+ import { runReasoningLoop } from "./harness.mjs";
15
+
16
+ /**
17
+ * Replay the corpus under `genome` and add shortcut edges for every task whose
18
+ * correct content is currently reconstructed. Mutates the store's graph topology.
19
+ * Returns { consolidated, hopsBefore } for reporting.
20
+ *
21
+ * @param {MemoryStore} store
22
+ * @param {Array} tasks
23
+ * @param {object} genome
24
+ */
25
+ export function consolidate(store, tasks, genome) {
26
+ const { cues, tags } = store.graph;
27
+ let consolidated = 0;
28
+ let hopsBefore = 0;
29
+ let n = 0;
30
+
31
+ for (const task of tasks) {
32
+ if (task.answerable === false) continue;
33
+ const r = runReasoningLoop(store.queryText(task.id), store, genome, task);
34
+ hopsBefore += r.hops; n++;
35
+ if (!r.correct) continue; // only consolidate paths that genuinely work
36
+
37
+ const correctCue = cues.get(`cue:${task.id}:correct`);
38
+ const correctContentId = `content:${task.id}`;
39
+ if (!correctCue) continue;
40
+
41
+ // Lay down a 1-hop shortcut tag the correct cue reaches immediately.
42
+ const shortcutId = `tag:${task.id}-shortcut`;
43
+ if (!tags.has(shortcutId)) {
44
+ tags.set(shortcutId, {
45
+ id: shortcutId, name: `${task.id}-shortcut`, text: "shortcut",
46
+ toks: [], vec: new Float32Array(store.cueList[0].vec.length), content: [correctContentId], next: [],
47
+ });
48
+ // Prepend so it is the first link explored (reached at hop 1, fanout-safe).
49
+ correctCue.links = [shortcutId, ...correctCue.links];
50
+ consolidated++;
51
+ }
52
+ }
53
+
54
+ return { consolidated, avgHopsBefore: hopsBefore / (n || 1) };
55
+ }
@@ -0,0 +1,204 @@
1
+ // MRAgent EVOLVED HARNESS (v2 — beyond the paper) — the surface Darwin mutates.
2
+ //
3
+ // MRAgent's contribution: memory is a Cue-Tag-Content graph, reconstructed (not
4
+ // retrieved) by searching cues, traversing cue→tag→content, and pruning paths.
5
+ // This v2 adds three mechanisms the paper does not have, each a tunable gene:
6
+ //
7
+ // • ADAPTIVE DEPTH (haltConfidence) — stop traversing once evidence is decisive,
8
+ // so easy queries cost fewer hops (ACT-style adaptive compute).
9
+ // • ABSTENTION (abstainThreshold) — answer "I don't know" when reconstructed
10
+ // evidence is too weak, instead of confidently hallucinating.
11
+ // • CORROBORATION (rerank=gnn) — boost content reached by MULTIPLE paths, so a
12
+ // single high-similarity distractor cannot win.
13
+ //
14
+ // The memory substrate (agent/memory.mjs) stays frozen. Darwin edits only the
15
+ // DARWIN_MUTABLE_BLOCK regions.
16
+
17
+ import { MemoryStore } from "./memory.mjs";
18
+
19
+ // ─── DARWIN_MUTABLE_BLOCK: reconstruction genome ────────────────────────────
20
+ export function baselineGenome() {
21
+ return {
22
+ // Stage 1 — hybrid cue search (RuVector hybridSearch).
23
+ cueK: 4, // initial cue vectors fetched [1..12]
24
+ efSearch: 64, // HNSW search depth / candidate pool [16..256]
25
+ hybridAlpha: 0.5, // RRF weight: 0=sparse … 1=dense [0..1]
26
+ fusion: "rrf", // rrf | linear | dbsf
27
+
28
+ // Stage 2 — active reconstruction (Cypher LINKED_TO*1..N traversal).
29
+ traversalDepth: 2, // cue→tag→content hops [1..4]
30
+ tagFanout: 3, // tags expanded per frontier node [1..8]
31
+ pruneThreshold: 0.1, // drop paths below this evidence score [0..0.6]
32
+ maxContent: 8, // content nodes handed to synthesis(LIMIT)[1..20]
33
+ haltConfidence: 0.9, // adaptive-depth: stop when top≥this [0.2..0.9]
34
+
35
+ // Stage 3 — synthesis (LLM prompt strategy + safety).
36
+ rerank: "gnn", // gnn | none (corroboration-aware rerank)
37
+ promptStrategy: "evidence-first", // terse | evidence-first | prune-explicit
38
+ abstainThreshold: 0.0, // answer "I don't know" if top score < this [0..0.6]
39
+ };
40
+ }
41
+ // ─── END DARWIN_MUTABLE_BLOCK ───────────────────────────────────────────────
42
+
43
+ const STRATEGY_WINDOW = { terse: 2, "evidence-first": Infinity, "prune-explicit": 5 };
44
+
45
+ // Corroboration-aware rerank: content reached by multiple distinct paths is
46
+ // boosted, so a single high-similarity ranking-distractor cannot outrank a
47
+ // well-corroborated answer. (rerank="none" leaves raw similarity order.)
48
+ function gnnRerank(reconstructed) {
49
+ return [...reconstructed]
50
+ .map((c) => ({ ...c, score: c.score * (1 + 0.7 * ((c.paths ?? 1) - 1)) }))
51
+ .sort((a, b) => b.score - a.score);
52
+ }
53
+
54
+ /**
55
+ * Synthesis judge — deterministic stand-in for the LLM. Decides: abstain, answer
56
+ * correctly, or answer wrongly, given the reconstructed context + confidence.
57
+ */
58
+ function synthesize(reconstructed, task, genome, confidence) {
59
+ // ABSTENTION: weak evidence → refuse rather than hallucinate.
60
+ if (confidence < genome.abstainThreshold) return { abstained: true, correct: false, answer: "I don't know." };
61
+
62
+ const window = STRATEGY_WINDOW[genome.promptStrategy] ?? Infinity;
63
+ const visible = reconstructed.slice(0, window === Infinity ? reconstructed.length : window);
64
+ const hitIdx = visible.findIndex((c) => c.taskId === task.id);
65
+
66
+ if (hitIdx === -1) {
67
+ // Nothing correct in the window. If the top is a confident distractor, the LLM
68
+ // would emit it → a (wrong) answer; otherwise it produces an empty/no answer.
69
+ const wrong = visible.length > 0;
70
+ return { abstained: false, correct: false, answer: wrong ? "(distractor)" : "(no answer)" };
71
+ }
72
+
73
+ if (genome.promptStrategy === "prune-explicit") {
74
+ const distractorsAbove = visible.slice(0, hitIdx).filter((c) => c.taskId !== task.id).length;
75
+ if (distractorsAbove >= 2) return { abstained: false, correct: false, answer: "Pruned: ambiguous." };
76
+ }
77
+ return { abstained: false, correct: true, answer: task.expected_fact };
78
+ }
79
+
80
+ /** MRAgent reasoning loop for one task → deterministic result + telemetry. */
81
+ export function runReasoningLoop(queryText, store, genome, task) {
82
+ const cueIds = store.hybridSearch(queryText, genome);
83
+ let { content, stats } = store.reconstruct(queryText, cueIds, genome);
84
+ if (genome.rerank === "gnn") content = gnnRerank(content);
85
+
86
+ // Abstention confidence = chosen content's RAW relevance (depth-independent),
87
+ // not its decayed ranking score — robust across traversal depths.
88
+ const confidence = content.length ? (content[0].sim ?? content[0].score) : 0;
89
+ const out = task ? synthesize(content, task, genome, confidence) : { abstained: false, correct: false };
90
+
91
+ const latencyMs =
92
+ 0.02 * genome.efSearch +
93
+ 0.05 * stats.nodesVisited +
94
+ 0.30 * Math.min(content.length, genome.maxContent) +
95
+ (genome.rerank === "gnn" ? 0.4 : 0);
96
+
97
+ return { ...out, confidence, latencyMs, hops: stats.hops, halted: stats.halted, nodesVisited: stats.nodesVisited, contextSize: content.length };
98
+ }
99
+
100
+ /**
101
+ * Evaluate a genome over the corpus. Reports raw accuracy AND a risk-adjusted
102
+ * utility that rewards correct answers, tolerates honest abstention, and PUNISHES
103
+ * confident hallucination — the calibration objective a 25-year-out memory system
104
+ * is graded on, not raw accuracy alone.
105
+ *
106
+ * answerable: correct → +1 | abstain → 0 | wrong → −1
107
+ * unanswerable: abstain → +1 | any answer → −1
108
+ */
109
+ export function evaluate(genome, store, tasks) {
110
+ let correct = 0, answerable = 0, hallucinations = 0, util = 0;
111
+ let latency = 0, hops = 0, ctx = 0;
112
+ for (const task of tasks) {
113
+ const isAnswerable = task.answerable !== false;
114
+ const r = runReasoningLoop(store.queryText(task.id), store, genome, task);
115
+ if (isAnswerable) {
116
+ answerable++;
117
+ if (r.correct) { correct++; util += 1; }
118
+ else if (r.abstained) { util += 0; }
119
+ else { util -= 1; }
120
+ } else {
121
+ if (r.abstained) { util += 1; }
122
+ else { util -= 1; hallucinations++; }
123
+ }
124
+ latency += r.latencyMs; hops += r.hops; ctx += r.contextSize;
125
+ }
126
+ const n = tasks.length || 1;
127
+ return {
128
+ accuracy: correct / (answerable || 1), // helpfulness on answerable tasks
129
+ riskScore: (util / n + 1) / 2, // risk-adjusted utility in [0,1]
130
+ hallucinationRate: hallucinations / n,
131
+ avgLatencyMs: latency / n,
132
+ avgHops: hops / n,
133
+ avgContext: ctx / n,
134
+ n,
135
+ };
136
+ }
137
+
138
+ /**
139
+ * Deterministic, class-stratified train/test split. Within each class the first
140
+ * `trainFrac` (rounded, ≥1 each side when the class has ≥2) go to train, the rest
141
+ * to test. Used to prove the evolved genome GENERALIZES (we evolve on train, then
142
+ * report held-out test) rather than overfitting the eval set.
143
+ */
144
+ export function splitByClass(tasks, trainFrac = 0.6) {
145
+ const byClass = new Map();
146
+ for (const t of tasks) {
147
+ const c = t.class ?? "default";
148
+ if (!byClass.has(c)) byClass.set(c, []);
149
+ byClass.get(c).push(t);
150
+ }
151
+ const train = [], test = [];
152
+ for (const group of byClass.values()) {
153
+ let nTrain = Math.round(group.length * trainFrac);
154
+ if (group.length >= 2) nTrain = Math.min(group.length - 1, Math.max(1, nTrain));
155
+ group.forEach((t, i) => (i < nTrain ? train : test).push(t));
156
+ }
157
+ return { train, test };
158
+ }
159
+
160
+ /**
161
+ * Deterministic, class-stratified k-fold partition. Each fold draws ~1/k of every
162
+ * class (round-robin), so folds are balanced. Used for cross-validated genome
163
+ * selection: scoring on mean-minus-variance across folds rejects genomes tuned to
164
+ * one split (e.g. a knife-edge abstainThreshold), which is what prevents overfit.
165
+ */
166
+ export function kFoldByClass(tasks, k = 3) {
167
+ const byClass = new Map();
168
+ for (const t of tasks) {
169
+ const c = t.class ?? "default";
170
+ if (!byClass.has(c)) byClass.set(c, []);
171
+ byClass.get(c).push(t);
172
+ }
173
+ const folds = Array.from({ length: k }, () => []);
174
+ for (const group of byClass.values()) group.forEach((t, i) => folds[i % k].push(t));
175
+ return folds.filter((f) => f.length > 0);
176
+ }
177
+
178
+ // ─── DARWIN_MUTABLE_BLOCK: mutation operators ───────────────────────────────
179
+ const FUSIONS = ["rrf", "linear", "dbsf"];
180
+ const RERANKS = ["gnn", "none"];
181
+ const STRATEGIES = ["terse", "evidence-first", "prune-explicit"];
182
+ const clamp = (v, lo, hi) => Math.max(lo, Math.min(hi, v));
183
+ const clampI = (v, lo, hi) => clamp(Math.round(v), lo, hi);
184
+ const pick = (a) => a[Math.floor(Math.random() * a.length)];
185
+
186
+ export function mutate(genome) {
187
+ const g = { ...genome };
188
+ if (Math.random() < 0.4) g.cueK = clampI(g.cueK + (Math.random() * 4 - 2), 1, 12);
189
+ if (Math.random() < 0.4) g.efSearch = clampI(g.efSearch * (0.7 + Math.random() * 0.8), 16, 256);
190
+ if (Math.random() < 0.5) g.hybridAlpha = clamp(g.hybridAlpha + (Math.random() * 0.4 - 0.2), 0, 1);
191
+ if (Math.random() < 0.3) g.fusion = pick(FUSIONS);
192
+ if (Math.random() < 0.4) g.traversalDepth = clampI(g.traversalDepth + (Math.random() < 0.5 ? 1 : -1), 1, 4);
193
+ if (Math.random() < 0.4) g.tagFanout = clampI(g.tagFanout + (Math.random() * 4 - 2), 1, 8);
194
+ if (Math.random() < 0.4) g.pruneThreshold = clamp(g.pruneThreshold + (Math.random() * 0.2 - 0.1), 0, 0.6);
195
+ if (Math.random() < 0.4) g.maxContent = clampI(g.maxContent + (Math.random() * 6 - 3), 1, 20);
196
+ if (Math.random() < 0.4) g.haltConfidence = clamp(g.haltConfidence + (Math.random() * 0.3 - 0.15), 0.2, 0.9);
197
+ if (Math.random() < 0.3) g.rerank = pick(RERANKS);
198
+ if (Math.random() < 0.3) g.promptStrategy = pick(STRATEGIES);
199
+ if (Math.random() < 0.4) g.abstainThreshold = clamp(g.abstainThreshold + (Math.random() * 0.2 - 0.1), 0, 0.6);
200
+ return g;
201
+ }
202
+ // ─── END DARWIN_MUTABLE_BLOCK ───────────────────────────────────────────────
203
+
204
+ export { MemoryStore };
@@ -0,0 +1,147 @@
1
+ // GPU LLM write-layer for the MRAgent Darwin optimizer (ADR-260: "the real
2
+ // Darwin write-layer proposes leaps directly from failure traces").
3
+ //
4
+ // The built-in GA explores the genome by RANDOM mutation. This adds the missing
5
+ // directed-proposal layer: it shows a local, GPU-served code model (e.g.
6
+ // qwen2.5-coder on an OpenAI-compatible endpoint) the current genome + the tasks
7
+ // it is FAILING, and asks for an improved genome. Proposals are clamped to the
8
+ // declared gene bounds before they ever enter the population, so a bad LLM
9
+ // output can only ever be a no-op — never an unsafe gene.
10
+ //
11
+ // Fully opt-in + gracefully degrading (ADR-150): if no endpoint answers, the
12
+ // optimizer runs exactly as before (deterministic GA + coordinate-descent).
13
+ //
14
+ // Endpoint: OpenAI-compatible POST {url}/chat/completions.
15
+ // MRAGENT_LLM_URL (default http://localhost:11434/v1 — ollama on the GPU)
16
+ // MRAGENT_LLM_MODEL (default qwen2.5-coder:7b)
17
+
18
+ // Declared gene bounds — MUST stay in sync with agent/harness.mjs mutate().
19
+ const BOUNDS = {
20
+ cueK: [1, 12, "int"],
21
+ efSearch: [16, 256, "int"],
22
+ hybridAlpha: [0, 1, "float"],
23
+ traversalDepth: [1, 4, "int"],
24
+ tagFanout: [1, 8, "int"],
25
+ pruneThreshold: [0, 0.6, "float"],
26
+ maxContent: [1, 20, "int"],
27
+ haltConfidence: [0.2, 0.9, "float"],
28
+ abstainThreshold: [0, 0.6, "float"],
29
+ };
30
+ const ENUMS = {
31
+ fusion: ["rrf", "linear", "dbsf"],
32
+ rerank: ["gnn", "none"],
33
+ promptStrategy: ["terse", "evidence-first", "prune-explicit"],
34
+ };
35
+
36
+ /** Clamp/validate an arbitrary object into a safe genome, based on `baseline`. */
37
+ export function coerceGenome(obj, baseline) {
38
+ const g = { ...baseline };
39
+ if (!obj || typeof obj !== "object") return g;
40
+ for (const [k, [lo, hi, kind]] of Object.entries(BOUNDS)) {
41
+ const v = Number(obj[k]);
42
+ if (Number.isFinite(v)) {
43
+ const c = Math.max(lo, Math.min(hi, v));
44
+ g[k] = kind === "int" ? Math.round(c) : c;
45
+ }
46
+ }
47
+ for (const [k, opts] of Object.entries(ENUMS)) {
48
+ if (typeof obj[k] === "string" && opts.includes(obj[k])) g[k] = obj[k];
49
+ }
50
+ return g;
51
+ }
52
+
53
+ function extractJson(text) {
54
+ if (!text) return null;
55
+ const fenced = text.match(/```(?:json)?\s*([\s\S]*?)```/i);
56
+ const body = fenced ? fenced[1] : text;
57
+ const start = body.indexOf("{");
58
+ const end = body.lastIndexOf("}");
59
+ if (start === -1 || end <= start) return null;
60
+ try {
61
+ return JSON.parse(body.slice(start, end + 1));
62
+ } catch {
63
+ return null;
64
+ }
65
+ }
66
+
67
+ /** Returns `{ url, model }` if a local LLM endpoint answers, else `null`. */
68
+ export async function detectEndpoint(timeoutMs = 2500) {
69
+ const url = (process.env.MRAGENT_LLM_URL || "http://localhost:11434/v1").replace(/\/$/, "");
70
+ const model = process.env.MRAGENT_LLM_MODEL || "qwen2.5-coder:7b";
71
+ try {
72
+ const ctrl = AbortSignal.timeout(timeoutMs);
73
+ const r = await fetch(`${url}/models`, { signal: ctrl });
74
+ if (!r.ok) return null;
75
+ return { url, model };
76
+ } catch {
77
+ return null;
78
+ }
79
+ }
80
+
81
+ /**
82
+ * Ask the GPU model for `n` improved genomes given the current best + failure
83
+ * traces. Each proposal is coerced into bounds. Returns `[]` on any failure.
84
+ */
85
+ export async function llmProposeGenomes({ url, model, baseline, current, failures, n = 2, timeoutMs = 60000 }) {
86
+ const genes =
87
+ "cueK[1..12 int], efSearch[16..256 int], hybridAlpha[0..1], fusion(rrf|linear|dbsf), " +
88
+ "traversalDepth[1..4 int], tagFanout[1..8 int], pruneThreshold[0..0.6], maxContent[1..20 int], " +
89
+ "haltConfidence[0.2..0.9], rerank(gnn|none), promptStrategy(terse|evidence-first|prune-explicit), " +
90
+ "abstainThreshold[0..0.6]";
91
+ const sys =
92
+ "You tune a graph-memory retrieval harness (cue search -> bounded traversal -> synthesis). " +
93
+ "Goal: raise accuracy AND risk-adjusted utility (abstain on weak evidence; never confidently hallucinate) " +
94
+ "while keeping traversal cheap. Reason briefly, then output ONLY a JSON array of genome objects.";
95
+ const user =
96
+ `Current genome:\n${JSON.stringify(current)}\n\n` +
97
+ `Failing cases (id, why):\n${failures}\n\n` +
98
+ `Genes and ranges: ${genes}\n\n` +
99
+ `Propose ${n} distinct improved genomes as a JSON array. JSON only.`;
100
+
101
+ let res;
102
+ try {
103
+ res = await fetch(`${url}/chat/completions`, {
104
+ method: "POST",
105
+ headers: { "Content-Type": "application/json" },
106
+ body: JSON.stringify({
107
+ model,
108
+ messages: [
109
+ { role: "system", content: sys },
110
+ { role: "user", content: user },
111
+ ],
112
+ temperature: 0.5,
113
+ max_tokens: 700,
114
+ }),
115
+ signal: AbortSignal.timeout(timeoutMs),
116
+ });
117
+ } catch {
118
+ return [];
119
+ }
120
+ if (!res.ok) return [];
121
+ let content;
122
+ try {
123
+ content = (await res.json()).choices?.[0]?.message?.content ?? "";
124
+ } catch {
125
+ return [];
126
+ }
127
+ // Accept either a JSON array or a single object.
128
+ const parsed = extractArray(content);
129
+ return parsed.slice(0, n).map((o) => coerceGenome(o, baseline));
130
+ }
131
+
132
+ function extractArray(text) {
133
+ const fenced = text.match(/```(?:json)?\s*([\s\S]*?)```/i);
134
+ const body = fenced ? fenced[1] : text;
135
+ const a = body.indexOf("[");
136
+ const b = body.lastIndexOf("]");
137
+ if (a !== -1 && b > a) {
138
+ try {
139
+ const arr = JSON.parse(body.slice(a, b + 1));
140
+ if (Array.isArray(arr)) return arr;
141
+ } catch {
142
+ /* fall through to single-object */
143
+ }
144
+ }
145
+ const one = extractJson(text);
146
+ return one ? [one] : [];
147
+ }