ruvector-mragent 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +159 -0
- package/agent/concepts.mjs +71 -0
- package/agent/consolidate.mjs +55 -0
- package/agent/harness.mjs +204 -0
- package/agent/llmMutator.mjs +147 -0
- package/agent/memory.mjs +355 -0
- package/benchmark.mjs +63 -0
- package/data/eval-set.json +1685 -0
- package/harness/scorePolicy.ts +75 -0
- package/optimize.mjs +304 -0
- package/package.json +56 -0
- package/probeDarwin.mjs +24 -0
- package/test/harness.test.mjs +134 -0
- package/test/llmMutator.test.mjs +48 -0
- package/tools/genCorpus.mjs +74 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 ruvnet
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
# MRAgent — Self-Reconstructing Graph Memory over RuVector, evolved by Darwin
|
|
2
|
+
|
|
3
|
+
A runnable reference implementation of **MRAgent** ("Memory is Reconstructed, Not
|
|
4
|
+
Retrieved: Graph Memory for LLM Agents") on **RuVector** — and then *past* the
|
|
5
|
+
paper. A **Meta-Harness Darwin** loop evolves the reconstruction harness while the
|
|
6
|
+
memory substrate stays frozen ("freeze the model, evolve the harness").
|
|
7
|
+
|
|
8
|
+
> **Frozen model:** the RuVector Cue-Tag-Content memory graph (`agent/memory.mjs`).
|
|
9
|
+
> **Evolved harness:** a 12-gene reconstruction genome (`agent/harness.mjs`).
|
|
10
|
+
|
|
11
|
+
ADRs: **[ADR-269](../../docs/adr/ADR-269-mragent-graph-memory-darwin-optimization.md)**
|
|
12
|
+
(the MRAgent baseline) and **[ADR-270](../../docs/adr/ADR-270-self-reconstructing-graph-memory-beyond-sota.md)**
|
|
13
|
+
(this beyond-SOTA version).
|
|
14
|
+
|
|
15
|
+
## Beyond the paper
|
|
16
|
+
|
|
17
|
+
MRAgent reconstructs an answer over a *static* graph: search cues → traverse
|
|
18
|
+
cue→tag→content → prune → synthesize. This implementation adds three mechanisms a
|
|
19
|
+
25-year-out memory system needs, each a tunable gene Darwin co-evolves:
|
|
20
|
+
|
|
21
|
+
1. **Adaptive depth** (`haltConfidence`) — stop traversing once evidence is
|
|
22
|
+
decisive, so easy queries cost fewer hops (ACT-style adaptive computation).
|
|
23
|
+
2. **Abstention + calibration** (`abstainThreshold`) — answer *"I don't know"*
|
|
24
|
+
when reconstructed evidence is too weak, instead of confidently hallucinating.
|
|
25
|
+
Graded by a **risk-adjusted utility**, not raw accuracy: a confident wrong
|
|
26
|
+
answer scores worse than an honest abstention.
|
|
27
|
+
3. **Consolidation / replay** (`agent/consolidate.mjs`) — the store reorganizes
|
|
28
|
+
its own topology from workload (the self-learning GNN RuVector describes),
|
|
29
|
+
laying Cue→shortcut→Content edges so a 3-hop query resolves in 1 hop tomorrow.
|
|
30
|
+
|
|
31
|
+
## The 12-gene reconstruction genome
|
|
32
|
+
|
|
33
|
+
| Gene | Range | RuVector mapping |
|
|
34
|
+
|------|-------|------------------|
|
|
35
|
+
| `cueK` | 1–12 | # cue vectors from `hybridSearch` |
|
|
36
|
+
| `efSearch` | 16–256 | HNSW search depth |
|
|
37
|
+
| `hybridAlpha` | 0–1 | RRF sparse↔dense weight |
|
|
38
|
+
| `fusion` | rrf · linear · dbsf | hybrid fusion strategy |
|
|
39
|
+
| `traversalDepth` | 1–4 | Cypher `LINKED_TO*1..N` hops |
|
|
40
|
+
| `tagFanout` | 1–8 | tags expanded per node |
|
|
41
|
+
| `pruneThreshold` | 0–0.6 | path-evidence floor |
|
|
42
|
+
| `maxContent` | 1–20 | content `LIMIT` to synthesis |
|
|
43
|
+
| `haltConfidence` | 0.2–0.9 | **adaptive-depth halt** |
|
|
44
|
+
| `rerank` | gnn · none | corroboration-aware rerank |
|
|
45
|
+
| `promptStrategy` | terse · evidence-first · prune-explicit | synthesis prompt |
|
|
46
|
+
| `abstainThreshold` | 0–0.6 | **abstention / calibration** |
|
|
47
|
+
|
|
48
|
+
Every gene is proven load-bearing in `test/harness.test.mjs` — some only via
|
|
49
|
+
*interaction* (distractor tasks are solved by `evidence-first` **or** by
|
|
50
|
+
`terse + gnn + fanout≥2`, an epistatic landscape).
|
|
51
|
+
|
|
52
|
+
## The hardened corpus (60 tasks, 6 classes, difficulty-varied)
|
|
53
|
+
|
|
54
|
+
`data/eval-set.json` is **generated** by `tools/genCorpus.mjs` (`npm run
|
|
55
|
+
gen-corpus`) as **structured signal specs**; `agent/memory.mjs` synthesizes the
|
|
56
|
+
Cue/Tag/Content node texts so difficulty is guaranteed, not dependent on fragile
|
|
57
|
+
English. A **concept layer** (`agent/concepts.mjs`) gives the dense embedding real
|
|
58
|
+
semantics decoupled from lexical overlap. 10 instances per class, with varied
|
|
59
|
+
difficulty (1-hop AND 2-hop bridges, 1–3 ranking-distractors) so a train/test
|
|
60
|
+
split constrains every gene:
|
|
61
|
+
|
|
62
|
+
| Class | Stresses |
|
|
63
|
+
|-------|----------|
|
|
64
|
+
| semantic | `hybridAlpha`→dense (paraphrase, no shared tokens) |
|
|
65
|
+
| lexical | `hybridAlpha`→sparse (rare identifier, generic concept) |
|
|
66
|
+
| hybrid | `fusion` / RRF (needs both signals) |
|
|
67
|
+
| bridge | `traversalDepth` (1–2 intermediate hops) |
|
|
68
|
+
| distractor | `rerank` / `tagFanout` / `promptStrategy` (ranking-distractor content) |
|
|
69
|
+
| unanswerable | `abstainThreshold` (no correct content exists → abstain) |
|
|
70
|
+
|
|
71
|
+
## Generalization, not overfitting (train / test / CV)
|
|
72
|
+
|
|
73
|
+
The optimizer **evolves on a train split and reports a held-out test split it
|
|
74
|
+
never saw** — proving the genome generalizes rather than memorizing the eval set.
|
|
75
|
+
Selection uses **3-fold cross-validation with a variance penalty** (mean − ½·range
|
|
76
|
+
across folds) so a knife-edge gene that wins one fold but collapses on another is
|
|
77
|
+
rejected. A subtle bug this surfaced — confidence was depressed by `decay^depth`,
|
|
78
|
+
making deep-but-relevant answers look weak and breaking abstention across depths —
|
|
79
|
+
is fixed by deriving **abstention confidence from the answer's raw relevance, not
|
|
80
|
+
its decayed path score** (`agent/memory.mjs`).
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
accuracy risk halluc
|
|
84
|
+
baseline (test) ~30% ~0.25 0.17
|
|
85
|
+
evolved (test) ~65% ~0.81 0.04 ← held out, never seen in evolution
|
|
86
|
+
+35pt +0.56 generalizes
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
(The synthetic toy embedding has per-instance noise, and one global `hybridAlpha`
|
|
90
|
+
cannot perfectly serve both dense- and sparse-keyed queries, so the test ceiling
|
|
91
|
+
is ~80%, not 100% — the gate asks whether **evolution transfers**, which it does.)
|
|
92
|
+
|
|
93
|
+
## Results on the full corpus (zero optional deps, deterministic)
|
|
94
|
+
|
|
95
|
+
```
|
|
96
|
+
config accuracy risk halluc latency hops
|
|
97
|
+
baseline 50.0% 0.417 0.17 2.81 1.23
|
|
98
|
+
evolved (ref) 70.0% 0.775 0.03 3.09 1.08
|
|
99
|
+
evolved+replay 70.0% 0.775 0.03 3.16 1.00
|
|
100
|
+
|
|
101
|
+
evolved vs baseline: accuracy +20.0pt · risk +0.358 · hallucination 0.17 → 0.03
|
|
102
|
+
consolidation: shortcuts → fewer hops at equal accuracy
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
`npm run optimize` (full GA + memetic polish) reaches **+33pt train accuracy /
|
|
106
|
+
risk 0.94** and writes the evolved genome to `optimize.report.json`, which
|
|
107
|
+
`npm run benchmark` then picks up. The optimizer is **memetic**: a genetic loop
|
|
108
|
+
(Darwin `mapLimit`/`paretoFront`) explores broadly, then deterministic
|
|
109
|
+
coordinate descent refines narrow optima (e.g. the abstention band).
|
|
110
|
+
|
|
111
|
+
## Run it
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
cd examples/mragent
|
|
115
|
+
npm test # 12 deterministic gates, every gene proven load-bearing
|
|
116
|
+
npm run benchmark # baseline vs evolved vs evolved+replay
|
|
117
|
+
npm run optimize # Darwin loop + memetic polish + consolidation + held-out test
|
|
118
|
+
npm run gen-corpus # regenerate data/eval-set.json (deterministic)
|
|
119
|
+
npm run probe # inspect @metaharness/darwin exports (optional)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Nothing requires network, an API key, or native bindings. The substrate is a
|
|
123
|
+
deterministic in-process graph with the **same semantics** as a live RuVector
|
|
124
|
+
`.rvf` index (concept-dense + token-sparse hybrid RRF search, bounded-depth
|
|
125
|
+
prunable Cypher traversal, GNN-style corroboration rerank), so an evolved genome
|
|
126
|
+
transfers to production unchanged.
|
|
127
|
+
|
|
128
|
+
### With the real Darwin write-layer (optional)
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
npm i -D @metaharness/darwin@latest
|
|
132
|
+
npx metaharness evolve . --generations 12 --children 3 --eval-cmd "node benchmark.mjs"
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
`harness/scorePolicy.ts` is the fitness `metaharness evolve` calls per mutation.
|
|
136
|
+
|
|
137
|
+
## ADR-150 compliance
|
|
138
|
+
|
|
139
|
+
`@metaharness/darwin` and `ruvector` are **optionalDependencies** only; every
|
|
140
|
+
touch is `try/catch` guarded; `npm test`, `npm run benchmark`, and `npm run
|
|
141
|
+
optimize` all pass with no optional deps installed (the CI gate).
|
|
142
|
+
|
|
143
|
+
## Layout
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
examples/mragent/
|
|
147
|
+
├── agent/
|
|
148
|
+
│ ├── concepts.mjs # concept layer (dense semantics ≠ sparse tokens)
|
|
149
|
+
│ ├── memory.mjs # FROZEN: Cue-Tag-Content store (RuVector semantics)
|
|
150
|
+
│ ├── harness.mjs # EVOLVED: 12-gene genome + reasoning loop
|
|
151
|
+
│ └── consolidate.mjs # replay → self-reorganizing topology
|
|
152
|
+
├── harness/scorePolicy.ts# Darwin fitness (accuracy + risk + cost)
|
|
153
|
+
├── data/eval-set.json # 60-task structured corpus (generated)
|
|
154
|
+
├── tools/genCorpus.mjs # deterministic corpus generator
|
|
155
|
+
├── optimize.mjs # GA + CV + memetic polish + held-out test + consolidation
|
|
156
|
+
├── benchmark.mjs # baseline vs evolved vs replay
|
|
157
|
+
├── probeDarwin.mjs # probe optional @metaharness/darwin
|
|
158
|
+
└── test/harness.test.mjs # 12 acceptance gates
|
|
159
|
+
```
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
// Concept layer — gives the FROZEN model a genuine *semantic* dimension that is
|
|
2
|
+
// decoupled from raw lexical overlap.
|
|
3
|
+
//
|
|
4
|
+
// Why this matters: with a plain hash-of-tokens embedding, dense cosine and
|
|
5
|
+
// sparse term-overlap are almost perfectly correlated, so `hybridAlpha` and
|
|
6
|
+
// `fusion` have nothing to tune (ADR-269 measured Δfit≈0 for both). Real
|
|
7
|
+
// embeddings differ: paraphrases ("rapid cold-start" ~ "fast boot") are dense-
|
|
8
|
+
// close with ZERO token overlap, while rare identifiers ("rvf-7", "cve-2") are
|
|
9
|
+
// lexically decisive but semantically generic.
|
|
10
|
+
//
|
|
11
|
+
// We model that split deterministically: tokens that belong to a synonym group
|
|
12
|
+
// project onto a shared CONCEPT dimension (dense semantics), and identifier-like
|
|
13
|
+
// tokens stay in a lexical tail. Result:
|
|
14
|
+
// • semantic queries → answerable by DENSE only (no shared tokens)
|
|
15
|
+
// • lexical queries → answerable by SPARSE only (concept-generic)
|
|
16
|
+
// • hybrid queries → need RRF over both
|
|
17
|
+
// which is exactly the regime where hybridAlpha + fusion are load-bearing.
|
|
18
|
+
|
|
19
|
+
// Synonym groups → concept ids. Tokens in the same group are dense-equivalent.
|
|
20
|
+
const CONCEPT_GROUPS = [
|
|
21
|
+
["fast", "rapid", "quick", "speed", "swift", "low-latency", "sub-millisecond", "instant"],
|
|
22
|
+
["boot", "cold-start", "startup", "initialize", "cold-boot", "launch", "spin-up"],
|
|
23
|
+
["compress", "compression", "quantize", "quantization", "shrink", "squeeze", "pack"],
|
|
24
|
+
["store", "storage", "persist", "write", "save", "backend", "durable"],
|
|
25
|
+
["search", "retrieve", "retrieval", "query", "lookup", "find", "recall"],
|
|
26
|
+
["graph", "topology", "network", "node", "nodes", "edge", "edges", "associative"],
|
|
27
|
+
["consensus", "agreement", "leader", "elect", "authoritative", "quorum"],
|
|
28
|
+
["secure", "security", "tamper", "tamper-evident", "witness", "proof", "cryptographic", "immutable"],
|
|
29
|
+
["merge", "fuse", "fusion", "combine", "aggregate", "blend"],
|
|
30
|
+
["prune", "filter", "drop", "discard", "remove", "trim"],
|
|
31
|
+
["accuracy", "recall", "precision", "fidelity", "correct", "quality"],
|
|
32
|
+
["memory", "remember", "reconstruct", "reconstruction", "cue", "tag", "content"],
|
|
33
|
+
["validate", "validation", "reject", "fail-fast", "guard", "check"],
|
|
34
|
+
["concurrency", "lock-free", "parallel", "branching", "copy-on-write", "throughput"],
|
|
35
|
+
["embedding", "vector", "dense", "representation", "latent"],
|
|
36
|
+
];
|
|
37
|
+
|
|
38
|
+
export const NUM_CONCEPTS = CONCEPT_GROUPS.length;
|
|
39
|
+
|
|
40
|
+
// Canonical concept name = first token of each group. Corpus specs reference
|
|
41
|
+
// concepts by these names; buildGraph synthesizes DIFFERENT surface tokens from
|
|
42
|
+
// the same group for query vs cue, so they share a concept but not a token.
|
|
43
|
+
export const CONCEPT_NAMES = CONCEPT_GROUPS.map((g) => g[0]);
|
|
44
|
+
|
|
45
|
+
const NAME_TO_INDEX = new Map(CONCEPT_NAMES.map((n, i) => [n, i]));
|
|
46
|
+
|
|
47
|
+
/** k-th distinct surface token of a concept (by name), wrapping the group. */
|
|
48
|
+
export function syn(conceptName, k = 0) {
|
|
49
|
+
const ci = NAME_TO_INDEX.get(conceptName);
|
|
50
|
+
if (ci === undefined) return conceptName; // treat unknown as a literal token
|
|
51
|
+
const group = CONCEPT_GROUPS[ci];
|
|
52
|
+
return group[k % group.length];
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
const TOKEN_TO_CONCEPT = new Map();
|
|
56
|
+
CONCEPT_GROUPS.forEach((group, ci) => {
|
|
57
|
+
for (const tok of group) TOKEN_TO_CONCEPT.set(tok, ci);
|
|
58
|
+
});
|
|
59
|
+
|
|
60
|
+
/** Concept id for a token, or -1 if it is lexical-only (identifier-like). */
|
|
61
|
+
export function conceptOf(token) {
|
|
62
|
+
if (TOKEN_TO_CONCEPT.has(token)) return TOKEN_TO_CONCEPT.get(token);
|
|
63
|
+
return -1;
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
// A token is "identifier-like" (purely lexical) if it carries a digit or hyphen
|
|
67
|
+
// with a digit, or is a known id prefix. These never get a concept, so only
|
|
68
|
+
// sparse search can pin them down.
|
|
69
|
+
export function isIdentifier(token) {
|
|
70
|
+
return /\d/.test(token) || /^(rvf|cve|adr|t\d|id)/.test(token);
|
|
71
|
+
}
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
// Memory consolidation / replay — the "sleep" phase of a self-reorganizing memory.
|
|
2
|
+
//
|
|
3
|
+
// Beyond MRAgent: the paper reconstructs over a STATIC graph. A 25-year-out memory
|
|
4
|
+
// system reshapes its own topology from workload — exactly the self-learning GNN
|
|
5
|
+
// RuVector describes ("pushes similarities back into the neighbor lists"). After a
|
|
6
|
+
// batch of successful reconstructions, we REPLAY the winning paths and lay down a
|
|
7
|
+
// direct Cue→shortcut→Content edge, so a query that needed a 3-hop traversal today
|
|
8
|
+
// resolves in 1 hop tomorrow. Embeddings/content (the frozen model) are untouched;
|
|
9
|
+
// only graph adjacency — the store's own learned index — changes.
|
|
10
|
+
//
|
|
11
|
+
// This is gated and deterministic: it consolidates only paths that already
|
|
12
|
+
// reconstruct the CORRECT content, so it never invents associations.
|
|
13
|
+
|
|
14
|
+
import { runReasoningLoop } from "./harness.mjs";
|
|
15
|
+
|
|
16
|
+
/**
|
|
17
|
+
* Replay the corpus under `genome` and add shortcut edges for every task whose
|
|
18
|
+
* correct content is currently reconstructed. Mutates the store's graph topology.
|
|
19
|
+
* Returns { consolidated, hopsBefore } for reporting.
|
|
20
|
+
*
|
|
21
|
+
* @param {MemoryStore} store
|
|
22
|
+
* @param {Array} tasks
|
|
23
|
+
* @param {object} genome
|
|
24
|
+
*/
|
|
25
|
+
export function consolidate(store, tasks, genome) {
|
|
26
|
+
const { cues, tags } = store.graph;
|
|
27
|
+
let consolidated = 0;
|
|
28
|
+
let hopsBefore = 0;
|
|
29
|
+
let n = 0;
|
|
30
|
+
|
|
31
|
+
for (const task of tasks) {
|
|
32
|
+
if (task.answerable === false) continue;
|
|
33
|
+
const r = runReasoningLoop(store.queryText(task.id), store, genome, task);
|
|
34
|
+
hopsBefore += r.hops; n++;
|
|
35
|
+
if (!r.correct) continue; // only consolidate paths that genuinely work
|
|
36
|
+
|
|
37
|
+
const correctCue = cues.get(`cue:${task.id}:correct`);
|
|
38
|
+
const correctContentId = `content:${task.id}`;
|
|
39
|
+
if (!correctCue) continue;
|
|
40
|
+
|
|
41
|
+
// Lay down a 1-hop shortcut tag the correct cue reaches immediately.
|
|
42
|
+
const shortcutId = `tag:${task.id}-shortcut`;
|
|
43
|
+
if (!tags.has(shortcutId)) {
|
|
44
|
+
tags.set(shortcutId, {
|
|
45
|
+
id: shortcutId, name: `${task.id}-shortcut`, text: "shortcut",
|
|
46
|
+
toks: [], vec: new Float32Array(store.cueList[0].vec.length), content: [correctContentId], next: [],
|
|
47
|
+
});
|
|
48
|
+
// Prepend so it is the first link explored (reached at hop 1, fanout-safe).
|
|
49
|
+
correctCue.links = [shortcutId, ...correctCue.links];
|
|
50
|
+
consolidated++;
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
return { consolidated, avgHopsBefore: hopsBefore / (n || 1) };
|
|
55
|
+
}
|
|
@@ -0,0 +1,204 @@
|
|
|
1
|
+
// MRAgent EVOLVED HARNESS (v2 — beyond the paper) — the surface Darwin mutates.
|
|
2
|
+
//
|
|
3
|
+
// MRAgent's contribution: memory is a Cue-Tag-Content graph, reconstructed (not
|
|
4
|
+
// retrieved) by searching cues, traversing cue→tag→content, and pruning paths.
|
|
5
|
+
// This v2 adds three mechanisms the paper does not have, each a tunable gene:
|
|
6
|
+
//
|
|
7
|
+
// • ADAPTIVE DEPTH (haltConfidence) — stop traversing once evidence is decisive,
|
|
8
|
+
// so easy queries cost fewer hops (ACT-style adaptive compute).
|
|
9
|
+
// • ABSTENTION (abstainThreshold) — answer "I don't know" when reconstructed
|
|
10
|
+
// evidence is too weak, instead of confidently hallucinating.
|
|
11
|
+
// • CORROBORATION (rerank=gnn) — boost content reached by MULTIPLE paths, so a
|
|
12
|
+
// single high-similarity distractor cannot win.
|
|
13
|
+
//
|
|
14
|
+
// The memory substrate (agent/memory.mjs) stays frozen. Darwin edits only the
|
|
15
|
+
// DARWIN_MUTABLE_BLOCK regions.
|
|
16
|
+
|
|
17
|
+
import { MemoryStore } from "./memory.mjs";
|
|
18
|
+
|
|
19
|
+
// ─── DARWIN_MUTABLE_BLOCK: reconstruction genome ────────────────────────────
|
|
20
|
+
export function baselineGenome() {
|
|
21
|
+
return {
|
|
22
|
+
// Stage 1 — hybrid cue search (RuVector hybridSearch).
|
|
23
|
+
cueK: 4, // initial cue vectors fetched [1..12]
|
|
24
|
+
efSearch: 64, // HNSW search depth / candidate pool [16..256]
|
|
25
|
+
hybridAlpha: 0.5, // RRF weight: 0=sparse … 1=dense [0..1]
|
|
26
|
+
fusion: "rrf", // rrf | linear | dbsf
|
|
27
|
+
|
|
28
|
+
// Stage 2 — active reconstruction (Cypher LINKED_TO*1..N traversal).
|
|
29
|
+
traversalDepth: 2, // cue→tag→content hops [1..4]
|
|
30
|
+
tagFanout: 3, // tags expanded per frontier node [1..8]
|
|
31
|
+
pruneThreshold: 0.1, // drop paths below this evidence score [0..0.6]
|
|
32
|
+
maxContent: 8, // content nodes handed to synthesis(LIMIT)[1..20]
|
|
33
|
+
haltConfidence: 0.9, // adaptive-depth: stop when top≥this [0.2..0.9]
|
|
34
|
+
|
|
35
|
+
// Stage 3 — synthesis (LLM prompt strategy + safety).
|
|
36
|
+
rerank: "gnn", // gnn | none (corroboration-aware rerank)
|
|
37
|
+
promptStrategy: "evidence-first", // terse | evidence-first | prune-explicit
|
|
38
|
+
abstainThreshold: 0.0, // answer "I don't know" if top score < this [0..0.6]
|
|
39
|
+
};
|
|
40
|
+
}
|
|
41
|
+
// ─── END DARWIN_MUTABLE_BLOCK ───────────────────────────────────────────────
|
|
42
|
+
|
|
43
|
+
const STRATEGY_WINDOW = { terse: 2, "evidence-first": Infinity, "prune-explicit": 5 };
|
|
44
|
+
|
|
45
|
+
// Corroboration-aware rerank: content reached by multiple distinct paths is
|
|
46
|
+
// boosted, so a single high-similarity ranking-distractor cannot outrank a
|
|
47
|
+
// well-corroborated answer. (rerank="none" leaves raw similarity order.)
|
|
48
|
+
function gnnRerank(reconstructed) {
|
|
49
|
+
return [...reconstructed]
|
|
50
|
+
.map((c) => ({ ...c, score: c.score * (1 + 0.7 * ((c.paths ?? 1) - 1)) }))
|
|
51
|
+
.sort((a, b) => b.score - a.score);
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
/**
|
|
55
|
+
* Synthesis judge — deterministic stand-in for the LLM. Decides: abstain, answer
|
|
56
|
+
* correctly, or answer wrongly, given the reconstructed context + confidence.
|
|
57
|
+
*/
|
|
58
|
+
function synthesize(reconstructed, task, genome, confidence) {
|
|
59
|
+
// ABSTENTION: weak evidence → refuse rather than hallucinate.
|
|
60
|
+
if (confidence < genome.abstainThreshold) return { abstained: true, correct: false, answer: "I don't know." };
|
|
61
|
+
|
|
62
|
+
const window = STRATEGY_WINDOW[genome.promptStrategy] ?? Infinity;
|
|
63
|
+
const visible = reconstructed.slice(0, window === Infinity ? reconstructed.length : window);
|
|
64
|
+
const hitIdx = visible.findIndex((c) => c.taskId === task.id);
|
|
65
|
+
|
|
66
|
+
if (hitIdx === -1) {
|
|
67
|
+
// Nothing correct in the window. If the top is a confident distractor, the LLM
|
|
68
|
+
// would emit it → a (wrong) answer; otherwise it produces an empty/no answer.
|
|
69
|
+
const wrong = visible.length > 0;
|
|
70
|
+
return { abstained: false, correct: false, answer: wrong ? "(distractor)" : "(no answer)" };
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
if (genome.promptStrategy === "prune-explicit") {
|
|
74
|
+
const distractorsAbove = visible.slice(0, hitIdx).filter((c) => c.taskId !== task.id).length;
|
|
75
|
+
if (distractorsAbove >= 2) return { abstained: false, correct: false, answer: "Pruned: ambiguous." };
|
|
76
|
+
}
|
|
77
|
+
return { abstained: false, correct: true, answer: task.expected_fact };
|
|
78
|
+
}
|
|
79
|
+
|
|
80
|
+
/** MRAgent reasoning loop for one task → deterministic result + telemetry. */
|
|
81
|
+
export function runReasoningLoop(queryText, store, genome, task) {
|
|
82
|
+
const cueIds = store.hybridSearch(queryText, genome);
|
|
83
|
+
let { content, stats } = store.reconstruct(queryText, cueIds, genome);
|
|
84
|
+
if (genome.rerank === "gnn") content = gnnRerank(content);
|
|
85
|
+
|
|
86
|
+
// Abstention confidence = chosen content's RAW relevance (depth-independent),
|
|
87
|
+
// not its decayed ranking score — robust across traversal depths.
|
|
88
|
+
const confidence = content.length ? (content[0].sim ?? content[0].score) : 0;
|
|
89
|
+
const out = task ? synthesize(content, task, genome, confidence) : { abstained: false, correct: false };
|
|
90
|
+
|
|
91
|
+
const latencyMs =
|
|
92
|
+
0.02 * genome.efSearch +
|
|
93
|
+
0.05 * stats.nodesVisited +
|
|
94
|
+
0.30 * Math.min(content.length, genome.maxContent) +
|
|
95
|
+
(genome.rerank === "gnn" ? 0.4 : 0);
|
|
96
|
+
|
|
97
|
+
return { ...out, confidence, latencyMs, hops: stats.hops, halted: stats.halted, nodesVisited: stats.nodesVisited, contextSize: content.length };
|
|
98
|
+
}
|
|
99
|
+
|
|
100
|
+
/**
|
|
101
|
+
* Evaluate a genome over the corpus. Reports raw accuracy AND a risk-adjusted
|
|
102
|
+
* utility that rewards correct answers, tolerates honest abstention, and PUNISHES
|
|
103
|
+
* confident hallucination — the calibration objective a 25-year-out memory system
|
|
104
|
+
* is graded on, not raw accuracy alone.
|
|
105
|
+
*
|
|
106
|
+
* answerable: correct → +1 | abstain → 0 | wrong → −1
|
|
107
|
+
* unanswerable: abstain → +1 | any answer → −1
|
|
108
|
+
*/
|
|
109
|
+
export function evaluate(genome, store, tasks) {
|
|
110
|
+
let correct = 0, answerable = 0, hallucinations = 0, util = 0;
|
|
111
|
+
let latency = 0, hops = 0, ctx = 0;
|
|
112
|
+
for (const task of tasks) {
|
|
113
|
+
const isAnswerable = task.answerable !== false;
|
|
114
|
+
const r = runReasoningLoop(store.queryText(task.id), store, genome, task);
|
|
115
|
+
if (isAnswerable) {
|
|
116
|
+
answerable++;
|
|
117
|
+
if (r.correct) { correct++; util += 1; }
|
|
118
|
+
else if (r.abstained) { util += 0; }
|
|
119
|
+
else { util -= 1; }
|
|
120
|
+
} else {
|
|
121
|
+
if (r.abstained) { util += 1; }
|
|
122
|
+
else { util -= 1; hallucinations++; }
|
|
123
|
+
}
|
|
124
|
+
latency += r.latencyMs; hops += r.hops; ctx += r.contextSize;
|
|
125
|
+
}
|
|
126
|
+
const n = tasks.length || 1;
|
|
127
|
+
return {
|
|
128
|
+
accuracy: correct / (answerable || 1), // helpfulness on answerable tasks
|
|
129
|
+
riskScore: (util / n + 1) / 2, // risk-adjusted utility in [0,1]
|
|
130
|
+
hallucinationRate: hallucinations / n,
|
|
131
|
+
avgLatencyMs: latency / n,
|
|
132
|
+
avgHops: hops / n,
|
|
133
|
+
avgContext: ctx / n,
|
|
134
|
+
n,
|
|
135
|
+
};
|
|
136
|
+
}
|
|
137
|
+
|
|
138
|
+
/**
|
|
139
|
+
* Deterministic, class-stratified train/test split. Within each class the first
|
|
140
|
+
* `trainFrac` (rounded, ≥1 each side when the class has ≥2) go to train, the rest
|
|
141
|
+
* to test. Used to prove the evolved genome GENERALIZES (we evolve on train, then
|
|
142
|
+
* report held-out test) rather than overfitting the eval set.
|
|
143
|
+
*/
|
|
144
|
+
export function splitByClass(tasks, trainFrac = 0.6) {
|
|
145
|
+
const byClass = new Map();
|
|
146
|
+
for (const t of tasks) {
|
|
147
|
+
const c = t.class ?? "default";
|
|
148
|
+
if (!byClass.has(c)) byClass.set(c, []);
|
|
149
|
+
byClass.get(c).push(t);
|
|
150
|
+
}
|
|
151
|
+
const train = [], test = [];
|
|
152
|
+
for (const group of byClass.values()) {
|
|
153
|
+
let nTrain = Math.round(group.length * trainFrac);
|
|
154
|
+
if (group.length >= 2) nTrain = Math.min(group.length - 1, Math.max(1, nTrain));
|
|
155
|
+
group.forEach((t, i) => (i < nTrain ? train : test).push(t));
|
|
156
|
+
}
|
|
157
|
+
return { train, test };
|
|
158
|
+
}
|
|
159
|
+
|
|
160
|
+
/**
|
|
161
|
+
* Deterministic, class-stratified k-fold partition. Each fold draws ~1/k of every
|
|
162
|
+
* class (round-robin), so folds are balanced. Used for cross-validated genome
|
|
163
|
+
* selection: scoring on mean-minus-variance across folds rejects genomes tuned to
|
|
164
|
+
* one split (e.g. a knife-edge abstainThreshold), which is what prevents overfit.
|
|
165
|
+
*/
|
|
166
|
+
export function kFoldByClass(tasks, k = 3) {
|
|
167
|
+
const byClass = new Map();
|
|
168
|
+
for (const t of tasks) {
|
|
169
|
+
const c = t.class ?? "default";
|
|
170
|
+
if (!byClass.has(c)) byClass.set(c, []);
|
|
171
|
+
byClass.get(c).push(t);
|
|
172
|
+
}
|
|
173
|
+
const folds = Array.from({ length: k }, () => []);
|
|
174
|
+
for (const group of byClass.values()) group.forEach((t, i) => folds[i % k].push(t));
|
|
175
|
+
return folds.filter((f) => f.length > 0);
|
|
176
|
+
}
|
|
177
|
+
|
|
178
|
+
// ─── DARWIN_MUTABLE_BLOCK: mutation operators ───────────────────────────────
|
|
179
|
+
const FUSIONS = ["rrf", "linear", "dbsf"];
|
|
180
|
+
const RERANKS = ["gnn", "none"];
|
|
181
|
+
const STRATEGIES = ["terse", "evidence-first", "prune-explicit"];
|
|
182
|
+
const clamp = (v, lo, hi) => Math.max(lo, Math.min(hi, v));
|
|
183
|
+
const clampI = (v, lo, hi) => clamp(Math.round(v), lo, hi);
|
|
184
|
+
const pick = (a) => a[Math.floor(Math.random() * a.length)];
|
|
185
|
+
|
|
186
|
+
export function mutate(genome) {
|
|
187
|
+
const g = { ...genome };
|
|
188
|
+
if (Math.random() < 0.4) g.cueK = clampI(g.cueK + (Math.random() * 4 - 2), 1, 12);
|
|
189
|
+
if (Math.random() < 0.4) g.efSearch = clampI(g.efSearch * (0.7 + Math.random() * 0.8), 16, 256);
|
|
190
|
+
if (Math.random() < 0.5) g.hybridAlpha = clamp(g.hybridAlpha + (Math.random() * 0.4 - 0.2), 0, 1);
|
|
191
|
+
if (Math.random() < 0.3) g.fusion = pick(FUSIONS);
|
|
192
|
+
if (Math.random() < 0.4) g.traversalDepth = clampI(g.traversalDepth + (Math.random() < 0.5 ? 1 : -1), 1, 4);
|
|
193
|
+
if (Math.random() < 0.4) g.tagFanout = clampI(g.tagFanout + (Math.random() * 4 - 2), 1, 8);
|
|
194
|
+
if (Math.random() < 0.4) g.pruneThreshold = clamp(g.pruneThreshold + (Math.random() * 0.2 - 0.1), 0, 0.6);
|
|
195
|
+
if (Math.random() < 0.4) g.maxContent = clampI(g.maxContent + (Math.random() * 6 - 3), 1, 20);
|
|
196
|
+
if (Math.random() < 0.4) g.haltConfidence = clamp(g.haltConfidence + (Math.random() * 0.3 - 0.15), 0.2, 0.9);
|
|
197
|
+
if (Math.random() < 0.3) g.rerank = pick(RERANKS);
|
|
198
|
+
if (Math.random() < 0.3) g.promptStrategy = pick(STRATEGIES);
|
|
199
|
+
if (Math.random() < 0.4) g.abstainThreshold = clamp(g.abstainThreshold + (Math.random() * 0.2 - 0.1), 0, 0.6);
|
|
200
|
+
return g;
|
|
201
|
+
}
|
|
202
|
+
// ─── END DARWIN_MUTABLE_BLOCK ───────────────────────────────────────────────
|
|
203
|
+
|
|
204
|
+
export { MemoryStore };
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
// GPU LLM write-layer for the MRAgent Darwin optimizer (ADR-260: "the real
|
|
2
|
+
// Darwin write-layer proposes leaps directly from failure traces").
|
|
3
|
+
//
|
|
4
|
+
// The built-in GA explores the genome by RANDOM mutation. This adds the missing
|
|
5
|
+
// directed-proposal layer: it shows a local, GPU-served code model (e.g.
|
|
6
|
+
// qwen2.5-coder on an OpenAI-compatible endpoint) the current genome + the tasks
|
|
7
|
+
// it is FAILING, and asks for an improved genome. Proposals are clamped to the
|
|
8
|
+
// declared gene bounds before they ever enter the population, so a bad LLM
|
|
9
|
+
// output can only ever be a no-op — never an unsafe gene.
|
|
10
|
+
//
|
|
11
|
+
// Fully opt-in + gracefully degrading (ADR-150): if no endpoint answers, the
|
|
12
|
+
// optimizer runs exactly as before (deterministic GA + coordinate-descent).
|
|
13
|
+
//
|
|
14
|
+
// Endpoint: OpenAI-compatible POST {url}/chat/completions.
|
|
15
|
+
// MRAGENT_LLM_URL (default http://localhost:11434/v1 — ollama on the GPU)
|
|
16
|
+
// MRAGENT_LLM_MODEL (default qwen2.5-coder:7b)
|
|
17
|
+
|
|
18
|
+
// Declared gene bounds — MUST stay in sync with agent/harness.mjs mutate().
|
|
19
|
+
const BOUNDS = {
|
|
20
|
+
cueK: [1, 12, "int"],
|
|
21
|
+
efSearch: [16, 256, "int"],
|
|
22
|
+
hybridAlpha: [0, 1, "float"],
|
|
23
|
+
traversalDepth: [1, 4, "int"],
|
|
24
|
+
tagFanout: [1, 8, "int"],
|
|
25
|
+
pruneThreshold: [0, 0.6, "float"],
|
|
26
|
+
maxContent: [1, 20, "int"],
|
|
27
|
+
haltConfidence: [0.2, 0.9, "float"],
|
|
28
|
+
abstainThreshold: [0, 0.6, "float"],
|
|
29
|
+
};
|
|
30
|
+
const ENUMS = {
|
|
31
|
+
fusion: ["rrf", "linear", "dbsf"],
|
|
32
|
+
rerank: ["gnn", "none"],
|
|
33
|
+
promptStrategy: ["terse", "evidence-first", "prune-explicit"],
|
|
34
|
+
};
|
|
35
|
+
|
|
36
|
+
/** Clamp/validate an arbitrary object into a safe genome, based on `baseline`. */
|
|
37
|
+
export function coerceGenome(obj, baseline) {
|
|
38
|
+
const g = { ...baseline };
|
|
39
|
+
if (!obj || typeof obj !== "object") return g;
|
|
40
|
+
for (const [k, [lo, hi, kind]] of Object.entries(BOUNDS)) {
|
|
41
|
+
const v = Number(obj[k]);
|
|
42
|
+
if (Number.isFinite(v)) {
|
|
43
|
+
const c = Math.max(lo, Math.min(hi, v));
|
|
44
|
+
g[k] = kind === "int" ? Math.round(c) : c;
|
|
45
|
+
}
|
|
46
|
+
}
|
|
47
|
+
for (const [k, opts] of Object.entries(ENUMS)) {
|
|
48
|
+
if (typeof obj[k] === "string" && opts.includes(obj[k])) g[k] = obj[k];
|
|
49
|
+
}
|
|
50
|
+
return g;
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
function extractJson(text) {
|
|
54
|
+
if (!text) return null;
|
|
55
|
+
const fenced = text.match(/```(?:json)?\s*([\s\S]*?)```/i);
|
|
56
|
+
const body = fenced ? fenced[1] : text;
|
|
57
|
+
const start = body.indexOf("{");
|
|
58
|
+
const end = body.lastIndexOf("}");
|
|
59
|
+
if (start === -1 || end <= start) return null;
|
|
60
|
+
try {
|
|
61
|
+
return JSON.parse(body.slice(start, end + 1));
|
|
62
|
+
} catch {
|
|
63
|
+
return null;
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
/** Returns `{ url, model }` if a local LLM endpoint answers, else `null`. */
|
|
68
|
+
export async function detectEndpoint(timeoutMs = 2500) {
|
|
69
|
+
const url = (process.env.MRAGENT_LLM_URL || "http://localhost:11434/v1").replace(/\/$/, "");
|
|
70
|
+
const model = process.env.MRAGENT_LLM_MODEL || "qwen2.5-coder:7b";
|
|
71
|
+
try {
|
|
72
|
+
const ctrl = AbortSignal.timeout(timeoutMs);
|
|
73
|
+
const r = await fetch(`${url}/models`, { signal: ctrl });
|
|
74
|
+
if (!r.ok) return null;
|
|
75
|
+
return { url, model };
|
|
76
|
+
} catch {
|
|
77
|
+
return null;
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
/**
|
|
82
|
+
* Ask the GPU model for `n` improved genomes given the current best + failure
|
|
83
|
+
* traces. Each proposal is coerced into bounds. Returns `[]` on any failure.
|
|
84
|
+
*/
|
|
85
|
+
export async function llmProposeGenomes({ url, model, baseline, current, failures, n = 2, timeoutMs = 60000 }) {
|
|
86
|
+
const genes =
|
|
87
|
+
"cueK[1..12 int], efSearch[16..256 int], hybridAlpha[0..1], fusion(rrf|linear|dbsf), " +
|
|
88
|
+
"traversalDepth[1..4 int], tagFanout[1..8 int], pruneThreshold[0..0.6], maxContent[1..20 int], " +
|
|
89
|
+
"haltConfidence[0.2..0.9], rerank(gnn|none), promptStrategy(terse|evidence-first|prune-explicit), " +
|
|
90
|
+
"abstainThreshold[0..0.6]";
|
|
91
|
+
const sys =
|
|
92
|
+
"You tune a graph-memory retrieval harness (cue search -> bounded traversal -> synthesis). " +
|
|
93
|
+
"Goal: raise accuracy AND risk-adjusted utility (abstain on weak evidence; never confidently hallucinate) " +
|
|
94
|
+
"while keeping traversal cheap. Reason briefly, then output ONLY a JSON array of genome objects.";
|
|
95
|
+
const user =
|
|
96
|
+
`Current genome:\n${JSON.stringify(current)}\n\n` +
|
|
97
|
+
`Failing cases (id, why):\n${failures}\n\n` +
|
|
98
|
+
`Genes and ranges: ${genes}\n\n` +
|
|
99
|
+
`Propose ${n} distinct improved genomes as a JSON array. JSON only.`;
|
|
100
|
+
|
|
101
|
+
let res;
|
|
102
|
+
try {
|
|
103
|
+
res = await fetch(`${url}/chat/completions`, {
|
|
104
|
+
method: "POST",
|
|
105
|
+
headers: { "Content-Type": "application/json" },
|
|
106
|
+
body: JSON.stringify({
|
|
107
|
+
model,
|
|
108
|
+
messages: [
|
|
109
|
+
{ role: "system", content: sys },
|
|
110
|
+
{ role: "user", content: user },
|
|
111
|
+
],
|
|
112
|
+
temperature: 0.5,
|
|
113
|
+
max_tokens: 700,
|
|
114
|
+
}),
|
|
115
|
+
signal: AbortSignal.timeout(timeoutMs),
|
|
116
|
+
});
|
|
117
|
+
} catch {
|
|
118
|
+
return [];
|
|
119
|
+
}
|
|
120
|
+
if (!res.ok) return [];
|
|
121
|
+
let content;
|
|
122
|
+
try {
|
|
123
|
+
content = (await res.json()).choices?.[0]?.message?.content ?? "";
|
|
124
|
+
} catch {
|
|
125
|
+
return [];
|
|
126
|
+
}
|
|
127
|
+
// Accept either a JSON array or a single object.
|
|
128
|
+
const parsed = extractArray(content);
|
|
129
|
+
return parsed.slice(0, n).map((o) => coerceGenome(o, baseline));
|
|
130
|
+
}
|
|
131
|
+
|
|
132
|
+
function extractArray(text) {
|
|
133
|
+
const fenced = text.match(/```(?:json)?\s*([\s\S]*?)```/i);
|
|
134
|
+
const body = fenced ? fenced[1] : text;
|
|
135
|
+
const a = body.indexOf("[");
|
|
136
|
+
const b = body.lastIndexOf("]");
|
|
137
|
+
if (a !== -1 && b > a) {
|
|
138
|
+
try {
|
|
139
|
+
const arr = JSON.parse(body.slice(a, b + 1));
|
|
140
|
+
if (Array.isArray(arr)) return arr;
|
|
141
|
+
} catch {
|
|
142
|
+
/* fall through to single-object */
|
|
143
|
+
}
|
|
144
|
+
}
|
|
145
|
+
const one = extractJson(text);
|
|
146
|
+
return one ? [one] : [];
|
|
147
|
+
}
|