pi-autoresearch-vkf 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,106 @@
1
+ # Changelog
2
+
3
+ ## 0.5.0 — unreleased
4
+
5
+ Self-contained workspace (breaking path change).
6
+
7
+ - All package state now lives under a single namespaced directory,
8
+ `.autoresearch-vkf/`, with `session/` (ephemeral run state, was `.auto/`) and
9
+ `memory/` (the VKF bundle, was `.research-memory/`). This stops the session dir
10
+ from colliding with pi-autoresearch's `.auto/` and makes the package's
11
+ footprint obvious and self-contained.
12
+ - Global memory moves to `~/.autoresearch-vkf/memory/`;
13
+ `PI_AUTORESEARCH_GLOBAL_ROOT` now names the *root* (default `~`).
14
+ - Internal: hardcoded `${root}/.research-memory` paths replaced with
15
+ `memoryPaths(root)`; `paths.ts` exposes `pkgDir` and the session/memory
16
+ subdir layout. Existing bundles can be migrated by moving `.auto` →
17
+ `.autoresearch-vkf/session` and `.research-memory` → `.autoresearch-vkf/memory`.
18
+
19
+ ## 0.4.0 — unreleased
20
+
21
+ Phase 4: global cross-project memory + a benchmark vs standard autoresearch.
22
+
23
+ - **Global shared memory**: a cross-project VKF bundle (default
24
+ `~/.config/pi-autoresearch-vkf`, override `$PI_AUTORESEARCH_GLOBAL_ROOT`),
25
+ reusing the same card helpers via a `globalRoot()` resolver.
26
+ - **`promote_to_global` tool**: copies a trusted card (source_verified+) into
27
+ global memory with a transaction; only durable, verified knowledge is shareable.
28
+ - **`recall_memory` `scope`** param (`project` / `global` / `both`) surfaces
29
+ knowledge learned in other repos.
30
+ - **Benchmark harness** (`benchmark/`): standard blind autoresearch vs ours over
31
+ deterministic, ground-truth idea-environments. Ours is driven through the *real*
32
+ `scoring.ts` (rankIdeas) and `synthesis.ts` (findContradictions), so it
33
+ benchmarks shipped code. Metrics: best improvement, unique mechanisms, wasted
34
+ experiments, dead-ends retried, synthesized ideas, found-optimum rate. `npm run
35
+ bench`; `--update-readme` writes results between `<!-- BENCH:START/END -->`.
36
+ - Across scenarios, ours reaches the (synthesis-only) optimum 100% vs 0%, with
37
+ ~3× the best improvement, zero repeats, and fewer dead-end retries.
38
+
39
+ Dashboards:
40
+ - New `export_dashboard` tool writes two self-contained browser pages to `.auto/`:
41
+ `progress.html` (inline-SVG metric-over-time chart, experiment timeline, memory
42
+ lifecycle; auto-refreshing) and `dashboard.html` (the interactive idea-lineage
43
+ graph via `vkf html`).
44
+ - `progress_html.ts` is a pure, unit-tested renderer (no JS/asset deps); `vkf.ts`
45
+ gains an `html()` bridge wrapper.
46
+
47
+ Knowledge ingestion:
48
+ - `knowledge-gather` uses the agent's built-in `WebSearch` / `WebFetch` against
49
+ free, openly accessible databases — arXiv, Semantic Scholar, OpenAlex, Crossref
50
+ — with no API keys, paid services, or MCP setup.
51
+ - README documents ingestion and the free sources used.
52
+
53
+ ## 0.3.0 — unreleased
54
+
55
+ Phase 3: hypothesis synthesis — generate novel ideas, don't just retrieve them.
56
+
57
+ - **`synthesis.ts`** (pure, unit-tested): mechanism/context/topic similarity;
58
+ contradiction mining (explicit conflicts, outcome flips, same-goal/different-
59
+ mechanism); cross-domain transfer scored by `mechanism_sim × (1 − context_sim)`.
60
+ - **`find_contradictions` tool** — surfaces tensions in memory as generative
61
+ hypothesis questions.
62
+ - **`find_transfers` tool** — mechanism (not keyword) search for cross-domain
63
+ analogies to import into the current problem.
64
+ - **Idea provenance** — `remember_claim`/`buildClaimCard` accept `origin`
65
+ (literature / contradiction / transfer / synthesis) and `derived_from`, so
66
+ agent-synthesized hypotheses are traceable to their seeds.
67
+ - **New skills**: `contradiction-miner`, `cross-domain-transfer`,
68
+ `idea-tournament`; orchestrator updated with a synthesis step.
69
+
70
+ ## 0.2.0 — unreleased
71
+
72
+ Phase 2: novelty & priority scoring.
73
+
74
+ - **`score_ideas` tool** ranks untested claims by
75
+ `priority = expected_value × feasibility × evidence_strength × novelty ×
76
+ info_gain ÷ implementation_cost`, returning the full factor breakdown.
77
+ - **`scoring.ts`** (pure, unit-tested): token Jaccard novelty that penalizes
78
+ similarity to already-tried experiments, settled claims, and a configurable
79
+ standard playbook; evidence strength derived from verification level +
80
+ reliability; info-gain from belief uncertainty.
81
+ - **Scoring inputs on claims**: `remember_claim` accepts optional
82
+ `expected_value`, `feasibility`, `info_gain`, `implementation_cost`; all factors
83
+ fall back to sensible derivations when omitted.
84
+ - `hypothesis-loop` now scores instead of guessing the next experiment.
85
+
86
+ ## 0.1.0 — unreleased
87
+
88
+ Initial MVP: autoresearch with verifiable long-term memory.
89
+
90
+ - **Two-layer persistence**: ephemeral `.auto/` session + durable
91
+ `.research-memory/` VKF bundle that persists across runs.
92
+ - **Seven tools**: `init_research`, `remember_claim`, `verify_claim`,
93
+ `recall_memory`, `run_experiment`, `log_experiment`, `research_status`.
94
+ - **Six skills**: `autoresearch-create` (spine), `knowledge-gather`,
95
+ `claim-extract`, `claim-verify`, `hypothesis-loop`, `research-report`.
96
+ - **VKF bridge**: shells out to the `vkf` CLI (auto-detected in a `VKF` conda env
97
+ or via `$PI_AUTORESEARCH_VKF`) for validation, graph, freshness, and permission
98
+ checks; reads/writes bundle markdown directly. Degrades gracefully when `vkf`
99
+ is absent.
100
+ - **Trust lifecycle**: memory states (candidate → source_verified →
101
+ locally_tested/replicated → contradicted → deprecated → retired) mapped onto
102
+ VKF `status` + a staging/verified/deprecated directory layout, with a
103
+ transaction record for every change (propose-don't-promote).
104
+ - **Belief updates**: numeric belief per claim, mirrored to VKF's categorical
105
+ `confidence`, updated on each experiment outcome.
106
+ - Generated bundles validate at VKF Profile 1 (governed).
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Eric Jahns
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,301 @@
1
+ # pi-autoresearch-vkf
2
+
3
+ > **Autoresearch that remembers — and can prove what it learned.**
4
+
5
+ A [pi](https://pi.dev) extension that turns a blind optimization loop into a
6
+ self-improving researcher with **verifiable long-term memory**. It gathers
7
+ frontier literature, distills it into structured claims, *verifies* them, runs
8
+ experiments, and writes the results back to a git-native knowledge bundle — so the
9
+ next run builds on what was learned instead of rediscovering the obvious.
10
+
11
+ The memory layer is [VKF](https://github.com/EricJahns/Verifiable-Knowledge-Format)
12
+ (Verifiable Knowledge Format): markdown + YAML knowledge objects with provenance,
13
+ evidence, confidence, and a trust lifecycle, gated by the real `vkf` CLI.
14
+
15
+ ## Why
16
+
17
+ A plain autoresearch loop tries an idea, measures it, keeps wins, reverts
18
+ regressions — and forgets everything. It can't say *where* a good idea came from,
19
+ *what* it already tried, or *whether* a win was real. This extension adds the
20
+ missing layer:
21
+
22
+ ```
23
+ RAG agent: retrieve papers → try idea → forget context
24
+ pi-autoresearch-vkf:
25
+ retrieve → extract claims → verify → store
26
+ → hypothesize → test → update belief
27
+ → avoid repeated failures → improve future search
28
+ ```
29
+
30
+ The novelty isn't "autoresearch + RAG." It's that the agent's scientific memory is
31
+ **verifiable, lifecycle-managed, and auditable**.
32
+
33
+ ## Install
34
+
35
+ ```sh
36
+ pi install npm:pi-autoresearch-vkf
37
+ # or, from a local checkout:
38
+ pi install file:/path/to/pi-autoresearch-vkf
39
+ ```
40
+
41
+ ### Requirements
42
+
43
+ | Dependency | For | Required? |
44
+ |---|---|---|
45
+ | **`vkf` CLI** | Trust gating — validation, graph, freshness, permission checks | Recommended (memory still works without it; validation is skipped) |
46
+ | **Web tools** (`WebSearch` / `WebFetch`) | Ingesting new knowledge from the literature | Recommended — the ingestion path |
47
+
48
+ - **`vkf` CLI** — the extension finds it automatically inside a conda env named
49
+ `VKF`, or set `$PI_AUTORESEARCH_VKF` to the `vkf` executable.
50
+
51
+ ### Knowledge sources (how ingestion works)
52
+
53
+ The extension stores and reasons over knowledge; it does **not** fetch papers
54
+ itself. Gathering is done by the host agent through the `knowledge-gather` skill,
55
+ using the agent's built-in **`WebSearch` + `WebFetch`** against free, openly
56
+ accessible databases — no API keys, no paid services, no MCP setup:
57
+
58
+ - **arXiv** (`arxiv.org`, `export.arxiv.org/api`)
59
+ - **Semantic Scholar** (`api.semanticscholar.org` Graph API)
60
+ - **OpenAlex** (`api.openalex.org`)
61
+ - **Crossref** (`api.crossref.org`)
62
+ - GitHub / docs / benchmark reports / blogs for implementation hints
63
+
64
+ The agent reads sources and calls `remember_claim` to persist each finding as a
65
+ VKF card. If the host has no web tools, you can still ingest by pasting papers /
66
+ PDFs / findings for the agent to extract, or by seeding claims from the agent's
67
+ own knowledge (marked low-reliability until verified).
68
+
69
+ ## Usage
70
+
71
+ In a project you want to optimize:
72
+
73
+ ```
74
+ optimize the test suite runtime, using the research literature and remembering what works
75
+ ```
76
+
77
+ The **autoresearch-create** skill drives it: confirm goal/metric/command → init →
78
+ gather literature → extract & verify claims → loop (recall → experiment →
79
+ write-back) → report. All state lives in one self-contained `.autoresearch-vkf/`
80
+ folder at the project root, so work **survives restarts and context resets**.
81
+
82
+ ## How it works
83
+
84
+ ```
85
+ goal ─► recall_memory ─► gather literature ─► remember_claim (candidates)
86
+ │ │
87
+ │ verify_claim ──► trusted claims
88
+ ▼ │
89
+ hypothesis-loop: recall ─► pick idea ─► run_experiment ─► log_experiment
90
+ │ │
91
+ │ writes experiment card back to memory,
92
+ │ updates the claim's belief & lifecycle
93
+
94
+ research-report (paper → claim → hypothesis → patch → metric Δ → memory update)
95
+ ```
96
+
97
+ ### One self-contained workspace
98
+
99
+ Everything the package owns lives under a single namespaced `.autoresearch-vkf/`
100
+ directory, so it never collides with other tools and is obvious at a glance:
101
+
102
+ | Layer | Folder | Lifetime |
103
+ |-------|--------|----------|
104
+ | **Session** | `.autoresearch-vkf/session/` | this run — goal, experiment log, measure script, dashboards (safe to gitignore) |
105
+ | **Project memory** | `.autoresearch-vkf/memory/` | **persists across runs** — the VKF bundle (meant to be committed) |
106
+ | **Global memory** | `~/.autoresearch-vkf/memory/` | **persists across projects** — trusted knowledge promoted from any repo |
107
+
108
+ ### The memory lifecycle
109
+
110
+ Every card carries a trust state. Agents *propose*; promotion is explicit and
111
+ audited (a VKF transaction is written for each change). The vision's states map
112
+ directly onto VKF `status` + a lifecycle directory:
113
+
114
+ | Memory state | VKF status | Directory |
115
+ |---|---|---|
116
+ | `candidate` | `draft` | `staging/` |
117
+ | `source_verified` | `active` | `verified/` |
118
+ | `locally_tested` / `replicated` | `verified` | `verified/` |
119
+ | `contradicted` | `disputed` | `deprecated/` |
120
+ | `deprecated` | `deprecated` | `deprecated/` |
121
+ | `retired` | `retracted` | `deprecated/` |
122
+
123
+ Only `source_verified`+ drives serious hypotheses; only `locally_tested`+ strongly
124
+ steers experiments. This — plus the staging area and the citation-checking
125
+ verifier — is the defense against **memory poisoning**.
126
+
127
+ ### Tools
128
+
129
+ | Tool | What it does |
130
+ |------|--------------|
131
+ | `init_research` | Scaffold the `.autoresearch-vkf/` workspace (session + memory VKF bundle). |
132
+ | `remember_claim` | Stage a literature-derived candidate claim (+ its source paper). |
133
+ | `verify_claim` | Advance/downgrade a card's trust lifecycle (audited). |
134
+ | `recall_memory` | Query memory (project / global / both): trusted claims, candidates, prior experiments, negatives, conflicts. |
135
+ | `score_ideas` | Rank untested ideas by `EV × feasibility × evidence × novelty × info_gain ÷ cost`. |
136
+ | `find_contradictions` | Mine memory for tensions between claims — each a seed for a novel hypothesis. |
137
+ | `find_transfers` | Cross-domain mechanism search: same *how*, different *where*. |
138
+ | `run_experiment` | Run the measurement command; capture `METRIC name=value`. |
139
+ | `log_experiment` | Record a result, write it back to memory, update belief & lifecycle. |
140
+ | `promote_to_global` | Copy a trusted card into the cross-project global memory. |
141
+ | `export_dashboard` | Write browser dashboards: a live progress page + the `vkf html` idea-lineage graph. |
142
+ | `research_status` | Show session experiments + memory lifecycle. |
143
+
144
+ ### Skills
145
+
146
+ | Skill | Role |
147
+ |-------|------|
148
+ | `autoresearch-create` | Orchestrator / spine — the entry point. |
149
+ | `knowledge-gather` | Find candidate techniques via WebSearch/WebFetch (arXiv / Semantic Scholar / OpenAlex / GitHub). |
150
+ | `claim-extract` | Distill sources into reusable claim cards. |
151
+ | `claim-verify` | Check citations & codebase fit — the trust layer. |
152
+ | `contradiction-miner` | Turn tensions in memory into novel hypotheses. |
153
+ | `cross-domain-transfer` | Import a mechanism from another field. |
154
+ | `idea-tournament` | Multi-perspective debate to pick the 2–3 ideas worth testing. |
155
+ | `hypothesis-loop` | Pick the next idea and run the smallest falsifying experiment. |
156
+ | `research-report` | The auditable lineage report. |
157
+
158
+ ### The `.autoresearch-vkf/` workspace
159
+
160
+ ```
161
+ .autoresearch-vkf/
162
+ session/ # ephemeral per-run state (config, experiment log, dashboards)
163
+ memory/ # the durable VKF knowledge bundle:
164
+ vkf.bundle.yaml # profile 1 (governed); 2 (verified) once evidence lands
165
+ staging/ # candidates (status: draft)
166
+ verified/ # source-/locally-verified, replicated
167
+ deprecated/ # contradicted / retired
168
+ transactions/ # one record per promote/demote/write-back
169
+ ```
170
+
171
+ The `memory/` bundle is just markdown — human-readable, version-controllable, and
172
+ auditable. Run `vkf validate .autoresearch-vkf/memory`, `vkf graph`,
173
+ `vkf freshness`, or `vkf html` over it any time.
174
+
175
+ ## Benchmark
176
+
177
+ Does verifiable memory + novelty scoring + synthesis actually search better than a
178
+ blind loop? `npm run bench` runs both policies over deterministic, ground-truth
179
+ idea-environments — driving *ours* through the real `scoring.ts` and `synthesis.ts`
180
+ — and reports the difference. See [benchmark/README.md](benchmark/README.md) for
181
+ exactly what is and isn't simulated.
182
+
183
+ <!-- BENCH:START -->
184
+
185
+ Mean over 500 seeds per scenario. "Standard" = blind loop (EV-greedy,
186
+ no durable memory, no synthesis). "Ours" = VKF memory + novelty scoring +
187
+ contradiction synthesis, driven through the real scoring/synthesis modules.
188
+
189
+ ## Tiny-LM validation loss (budget 10)
190
+
191
+ | Metric | Standard | Ours |
192
+ |---|---:|---:|
193
+ | Best improvement (higher better) | 0.035 | **0.130** |
194
+ | Unique mechanisms tried | 7.8 | **10.0** |
195
+ | Wasted (repeat) experiments | 2.2 | **0.0** |
196
+ | Dead-ends retried | 1.4 | **1.0** |
197
+ | Synthesized ideas discovered | 0.0 | **1.0** |
198
+ | Found optimum (rate) | 0% | **100%** |
199
+
200
+ ## Inference latency (budget 8)
201
+
202
+ | Metric | Standard | Ours |
203
+ |---|---:|---:|
204
+ | Best improvement (higher better) | 0.043 | **0.150** |
205
+ | Unique mechanisms tried | 6.3 | **8.0** |
206
+ | Wasted (repeat) experiments | 1.7 | **0.0** |
207
+ | Dead-ends retried | 1.7 | **1.0** |
208
+ | Synthesized ideas discovered | 0.0 | **1.0** |
209
+ | Found optimum (rate) | 0% | **100%** |
210
+
211
+ <!-- BENCH:END -->
212
+
213
+ The global optimum in each scenario is a *synthesized* idea a blind loop can't
214
+ construct, so it reaches it 0% of the time; ours gets both parents tried (memory +
215
+ novelty), then synthesis unlocks the combo.
216
+
217
+ ## Watching progress
218
+
219
+ Three live views, in increasing detail:
220
+
221
+ - **Widget** (always on, above the editor) — win/loss counts, best metric, memory
222
+ state tally; refreshes after every tool call.
223
+ - **Fullscreen overlay** — press **Ctrl+G** (or call `research_status`) for the
224
+ full experiment list, memory lifecycle, and verified claims.
225
+ - **Browser dashboards** — `export_dashboard` writes two self-contained pages to
226
+ `.autoresearch-vkf/session/`:
227
+ - `progress.html` — metric-over-time chart, experiment timeline, and memory
228
+ lifecycle; auto-refreshes so an open tab tracks the run live.
229
+ - `dashboard.html` — the interactive **idea-lineage graph** (paper → claim →
230
+ experiment, with conflict/derived-from edges), generated by `vkf html`.
231
+
232
+ ```sh
233
+ open .autoresearch-vkf/session/progress.html # watch progress as it goes
234
+ open .autoresearch-vkf/session/dashboard.html # explore the knowledge lineage
235
+ ```
236
+
237
+ ## Configuration
238
+
239
+ - `PI_AUTORESEARCH_VKF` — path to the `vkf` executable (overrides auto-detection).
240
+ - `PI_AUTORESEARCH_VKF_CONDA_ENV` — conda env to find `vkf` in (default `VKF`).
241
+ - `PI_AUTORESEARCH_GLOBAL_ROOT` — root for the global cross-project memory
242
+ (default `~`, i.e. the bundle lives at `~/.autoresearch-vkf/memory/`).
243
+ - `PI_AUTORESEARCH_SHORTCUT` — key for the fullscreen dashboard (default `ctrl+g`;
244
+ set to `none` to disable).
245
+
246
+ ## Development
247
+
248
+ ```sh
249
+ npm install
250
+ npm run typecheck # tsc --noEmit
251
+ npm test # node --experimental-strip-types --test tests/*.test.mjs
252
+ npm run bench # standard autoresearch vs ours
253
+ ```
254
+
255
+ `npm test` requires a Node 22+ build with TypeScript stripping support (the same
256
+ requirement pi has for loading `.ts` extensions). On a Node built without it, run
257
+ the tests through a loader instead, e.g. `node --import tsx --test tests/*.test.mjs`.
258
+
259
+ ## Publishing
260
+
261
+ The package ships its `.ts` extensions and `.md` skills as-is (pi loads them
262
+ directly — no build step). The `files` whitelist publishes only `extensions/`,
263
+ `skills/`, and the docs; `prepublishOnly` runs `typecheck` as a gate.
264
+
265
+ Two ways to release:
266
+
267
+ - **Tagged CI release (recommended).** Add an npm *Automation* token as the repo
268
+ secret `NPM_TOKEN`, then bump the version and push a matching tag — the
269
+ [`publish.yml`](.github/workflows/publish.yml) workflow publishes with provenance:
270
+ ```sh
271
+ npm version patch # or minor/major — updates package.json + makes a tag
272
+ git push --follow-tags
273
+ ```
274
+ - **Manual.** `npm login`, then:
275
+ ```sh
276
+ npm publish --access public # prepublishOnly runs typecheck first
277
+ ```
278
+
279
+ Verify what will ship first with `npm pack --dry-run`.
280
+
281
+ ## Roadmap
282
+
283
+ All four planned phases are in: the lean MVP (Phase 1), the **novelty scorer**
284
+ (Phase 2), the **hypothesis-synthesis layer** (Phase 3 — `find_contradictions`,
285
+ `find_transfers`, `idea-tournament`), and **global cross-project memory + the
286
+ benchmark** (Phase 4).
287
+
288
+ Possible next steps:
289
+
290
+ - **End-to-end live benchmark** — a real LLM agent on real repos with human
291
+ novelty ratings (the controlled harness here isolates the search policy).
292
+ - **Bundle profile 2** — attach reproduction `verification` blocks to experiment
293
+ cards so memory validates at the strict `verified` profile.
294
+
295
+ (Knowledge ingestion via `WebSearch`/`WebFetch` against free databases (arXiv,
296
+ Semantic Scholar, OpenAlex, Crossref) is built in — see
297
+ [Knowledge sources](#knowledge-sources-how-ingestion-works).)
298
+
299
+ ## License
300
+
301
+ MIT