npm - pi-autoresearch-vkf - Versions diffs - 0.5.0 - Mend

pi-autoresearch-vkf 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/CHANGELOG.md +106 -0
package/LICENSE +21 -0
package/README.md +301 -0
package/extensions/pi-autoresearch-vkf/cards.ts +546 -0
package/extensions/pi-autoresearch-vkf/config.ts +81 -0
package/extensions/pi-autoresearch-vkf/dashboard.ts +107 -0
package/extensions/pi-autoresearch-vkf/experiments.ts +112 -0
package/extensions/pi-autoresearch-vkf/frontmatter.ts +272 -0
package/extensions/pi-autoresearch-vkf/index.ts +1014 -0
package/extensions/pi-autoresearch-vkf/jsonl.ts +41 -0
package/extensions/pi-autoresearch-vkf/metrics.ts +25 -0
package/extensions/pi-autoresearch-vkf/paths.ts +113 -0
package/extensions/pi-autoresearch-vkf/progress_html.ts +246 -0
package/extensions/pi-autoresearch-vkf/render.ts +34 -0
package/extensions/pi-autoresearch-vkf/runtime.ts +46 -0
package/extensions/pi-autoresearch-vkf/scoring.ts +185 -0
package/extensions/pi-autoresearch-vkf/shortcuts.ts +22 -0
package/extensions/pi-autoresearch-vkf/synthesis.ts +207 -0
package/extensions/pi-autoresearch-vkf/vkf.ts +220 -0
package/package.json +69 -0
package/skills/autoresearch-create/SKILL.md +74 -0
package/skills/claim-extract/SKILL.md +44 -0
package/skills/claim-verify/SKILL.md +50 -0
package/skills/contradiction-miner/SKILL.md +55 -0
package/skills/cross-domain-transfer/SKILL.md +57 -0
package/skills/hypothesis-loop/SKILL.md +57 -0
package/skills/idea-tournament/SKILL.md +55 -0
package/skills/knowledge-gather/SKILL.md +67 -0
package/skills/research-report/SKILL.md +58 -0

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,106 @@
+# Changelog
+## 0.5.0 — unreleased
+Self-contained workspace (breaking path change).
+- All package state now lives under a single namespaced directory,
+  `.autoresearch-vkf/`, with `session/` (ephemeral run state, was `.auto/`) and
+  `memory/` (the VKF bundle, was `.research-memory/`). This stops the session dir
+  from colliding with pi-autoresearch's `.auto/` and makes the package's
+  footprint obvious and self-contained.
+- Global memory moves to `~/.autoresearch-vkf/memory/`;
+  `PI_AUTORESEARCH_GLOBAL_ROOT` now names the *root* (default `~`).
+- Internal: hardcoded `${root}/.research-memory` paths replaced with
+  `memoryPaths(root)`; `paths.ts` exposes `pkgDir` and the session/memory
+  subdir layout. Existing bundles can be migrated by moving `.auto` →
+  `.autoresearch-vkf/session` and `.research-memory` → `.autoresearch-vkf/memory`.
+## 0.4.0 — unreleased
+Phase 4: global cross-project memory + a benchmark vs standard autoresearch.
+- **Global shared memory**: a cross-project VKF bundle (default
+  `~/.config/pi-autoresearch-vkf`, override `$PI_AUTORESEARCH_GLOBAL_ROOT`),
+  reusing the same card helpers via a `globalRoot()` resolver.
+- **`promote_to_global` tool**: copies a trusted card (source_verified+) into
+  global memory with a transaction; only durable, verified knowledge is shareable.
+- **`recall_memory` `scope`** param (`project` / `global` / `both`) surfaces
+  knowledge learned in other repos.
+- **Benchmark harness** (`benchmark/`): standard blind autoresearch vs ours over
+  deterministic, ground-truth idea-environments. Ours is driven through the *real*
+  `scoring.ts` (rankIdeas) and `synthesis.ts` (findContradictions), so it
+  benchmarks shipped code. Metrics: best improvement, unique mechanisms, wasted
+  experiments, dead-ends retried, synthesized ideas, found-optimum rate. `npm run
+  bench`; `--update-readme` writes results between `<!-- BENCH:START/END -->`.
+- Across scenarios, ours reaches the (synthesis-only) optimum 100% vs 0%, with
+  ~3× the best improvement, zero repeats, and fewer dead-end retries.
+Dashboards:
+- New `export_dashboard` tool writes two self-contained browser pages to `.auto/`:
+  `progress.html` (inline-SVG metric-over-time chart, experiment timeline, memory
+  lifecycle; auto-refreshing) and `dashboard.html` (the interactive idea-lineage
+  graph via `vkf html`).
+- `progress_html.ts` is a pure, unit-tested renderer (no JS/asset deps); `vkf.ts`
+  gains an `html()` bridge wrapper.
+Knowledge ingestion:
+- `knowledge-gather` uses the agent's built-in `WebSearch` / `WebFetch` against
+  free, openly accessible databases — arXiv, Semantic Scholar, OpenAlex, Crossref
+  — with no API keys, paid services, or MCP setup.
+- README documents ingestion and the free sources used.
+## 0.3.0 — unreleased
+Phase 3: hypothesis synthesis — generate novel ideas, don't just retrieve them.
+- **`synthesis.ts`** (pure, unit-tested): mechanism/context/topic similarity;
+  contradiction mining (explicit conflicts, outcome flips, same-goal/different-
+  mechanism); cross-domain transfer scored by `mechanism_sim × (1 − context_sim)`.
+- **`find_contradictions` tool** — surfaces tensions in memory as generative
+  hypothesis questions.
+- **`find_transfers` tool** — mechanism (not keyword) search for cross-domain
+  analogies to import into the current problem.
+- **Idea provenance** — `remember_claim`/`buildClaimCard` accept `origin`
+  (literature / contradiction / transfer / synthesis) and `derived_from`, so
+  agent-synthesized hypotheses are traceable to their seeds.
+- **New skills**: `contradiction-miner`, `cross-domain-transfer`,
+  `idea-tournament`; orchestrator updated with a synthesis step.
+## 0.2.0 — unreleased
+Phase 2: novelty & priority scoring.
+- **`score_ideas` tool** ranks untested claims by
+  `priority = expected_value × feasibility × evidence_strength × novelty ×
+  info_gain ÷ implementation_cost`, returning the full factor breakdown.
+- **`scoring.ts`** (pure, unit-tested): token Jaccard novelty that penalizes
+  similarity to already-tried experiments, settled claims, and a configurable
+  standard playbook; evidence strength derived from verification level +
+  reliability; info-gain from belief uncertainty.
+- **Scoring inputs on claims**: `remember_claim` accepts optional
+  `expected_value`, `feasibility`, `info_gain`, `implementation_cost`; all factors
+  fall back to sensible derivations when omitted.
+- `hypothesis-loop` now scores instead of guessing the next experiment.
+## 0.1.0 — unreleased
+Initial MVP: autoresearch with verifiable long-term memory.
+- **Two-layer persistence**: ephemeral `.auto/` session + durable
+  `.research-memory/` VKF bundle that persists across runs.
+- **Seven tools**: `init_research`, `remember_claim`, `verify_claim`,
+  `recall_memory`, `run_experiment`, `log_experiment`, `research_status`.
+- **Six skills**: `autoresearch-create` (spine), `knowledge-gather`,
+  `claim-extract`, `claim-verify`, `hypothesis-loop`, `research-report`.
+- **VKF bridge**: shells out to the `vkf` CLI (auto-detected in a `VKF` conda env
+  or via `$PI_AUTORESEARCH_VKF`) for validation, graph, freshness, and permission
+  checks; reads/writes bundle markdown directly. Degrades gracefully when `vkf`
+  is absent.
+- **Trust lifecycle**: memory states (candidate → source_verified →
+  locally_tested/replicated → contradicted → deprecated → retired) mapped onto
+  VKF `status` + a staging/verified/deprecated directory layout, with a
+  transaction record for every change (propose-don't-promote).
+- **Belief updates**: numeric belief per claim, mirrored to VKF's categorical
+  `confidence`, updated on each experiment outcome.
+- Generated bundles validate at VKF Profile 1 (governed).

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Eric Jahns
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,301 @@
+# pi-autoresearch-vkf
+> **Autoresearch that remembers — and can prove what it learned.**
+A [pi](https://pi.dev) extension that turns a blind optimization loop into a
+self-improving researcher with **verifiable long-term memory**. It gathers
+frontier literature, distills it into structured claims, *verifies* them, runs
+experiments, and writes the results back to a git-native knowledge bundle — so the
+next run builds on what was learned instead of rediscovering the obvious.
+The memory layer is [VKF](https://github.com/EricJahns/Verifiable-Knowledge-Format)
+(Verifiable Knowledge Format): markdown + YAML knowledge objects with provenance,
+evidence, confidence, and a trust lifecycle, gated by the real `vkf` CLI.
+## Why
+A plain autoresearch loop tries an idea, measures it, keeps wins, reverts
+regressions — and forgets everything. It can't say *where* a good idea came from,
+*what* it already tried, or *whether* a win was real. This extension adds the
+missing layer:
+```
+RAG agent:        retrieve papers → try idea → forget context
+pi-autoresearch-vkf:
+                  retrieve → extract claims → verify → store
+                  → hypothesize → test → update belief
+                  → avoid repeated failures → improve future search
+```
+The novelty isn't "autoresearch + RAG." It's that the agent's scientific memory is
+**verifiable, lifecycle-managed, and auditable**.
+## Install
+```sh
+pi install npm:pi-autoresearch-vkf
+# or, from a local checkout:
+pi install file:/path/to/pi-autoresearch-vkf
+```
+### Requirements
+| Dependency | For | Required? |
+|---|---|---|
+| **`vkf` CLI** | Trust gating — validation, graph, freshness, permission checks | Recommended (memory still works without it; validation is skipped) |
+| **Web tools** (`WebSearch` / `WebFetch`) | Ingesting new knowledge from the literature | Recommended — the ingestion path |
+- **`vkf` CLI** — the extension finds it automatically inside a conda env named
+  `VKF`, or set `$PI_AUTORESEARCH_VKF` to the `vkf` executable.
+### Knowledge sources (how ingestion works)
+The extension stores and reasons over knowledge; it does **not** fetch papers
+itself. Gathering is done by the host agent through the `knowledge-gather` skill,
+using the agent's built-in **`WebSearch` + `WebFetch`** against free, openly
+accessible databases — no API keys, no paid services, no MCP setup:
+- **arXiv** (`arxiv.org`, `export.arxiv.org/api`)
+- **Semantic Scholar** (`api.semanticscholar.org` Graph API)
+- **OpenAlex** (`api.openalex.org`)
+- **Crossref** (`api.crossref.org`)
+- GitHub / docs / benchmark reports / blogs for implementation hints
+The agent reads sources and calls `remember_claim` to persist each finding as a
+VKF card. If the host has no web tools, you can still ingest by pasting papers /
+PDFs / findings for the agent to extract, or by seeding claims from the agent's
+own knowledge (marked low-reliability until verified).
+## Usage
+In a project you want to optimize:
+```
+optimize the test suite runtime, using the research literature and remembering what works
+```
+The **autoresearch-create** skill drives it: confirm goal/metric/command → init →
+gather literature → extract & verify claims → loop (recall → experiment →
+write-back) → report. All state lives in one self-contained `.autoresearch-vkf/`
+folder at the project root, so work **survives restarts and context resets**.
+## How it works
+```
+goal ─► recall_memory ─► gather literature ─► remember_claim (candidates)
+   │                                              │
+   │                                         verify_claim ──► trusted claims
+   ▼                                              │
+ hypothesis-loop:  recall ─► pick idea ─► run_experiment ─► log_experiment
+   │                                                            │
+   │                                  writes experiment card back to memory,
+   │                                  updates the claim's belief & lifecycle
+   ▼
+ research-report   (paper → claim → hypothesis → patch → metric Δ → memory update)
+```
+### One self-contained workspace
+Everything the package owns lives under a single namespaced `.autoresearch-vkf/`
+directory, so it never collides with other tools and is obvious at a glance:
+| Layer | Folder | Lifetime |
+|-------|--------|----------|
+| **Session** | `.autoresearch-vkf/session/` | this run — goal, experiment log, measure script, dashboards (safe to gitignore) |
+| **Project memory** | `.autoresearch-vkf/memory/` | **persists across runs** — the VKF bundle (meant to be committed) |
+| **Global memory** | `~/.autoresearch-vkf/memory/` | **persists across projects** — trusted knowledge promoted from any repo |
+### The memory lifecycle
+Every card carries a trust state. Agents *propose*; promotion is explicit and
+audited (a VKF transaction is written for each change). The vision's states map
+directly onto VKF `status` + a lifecycle directory:
+| Memory state | VKF status | Directory |
+|---|---|---|
+| `candidate` | `draft` | `staging/` |
+| `source_verified` | `active` | `verified/` |
+| `locally_tested` / `replicated` | `verified` | `verified/` |
+| `contradicted` | `disputed` | `deprecated/` |
+| `deprecated` | `deprecated` | `deprecated/` |
+| `retired` | `retracted` | `deprecated/` |
+Only `source_verified`+ drives serious hypotheses; only `locally_tested`+ strongly
+steers experiments. This — plus the staging area and the citation-checking
+verifier — is the defense against **memory poisoning**.
+### Tools
+| Tool | What it does |
+|------|--------------|
+| `init_research` | Scaffold the `.autoresearch-vkf/` workspace (session + memory VKF bundle). |
+| `remember_claim` | Stage a literature-derived candidate claim (+ its source paper). |
+| `verify_claim` | Advance/downgrade a card's trust lifecycle (audited). |
+| `recall_memory` | Query memory (project / global / both): trusted claims, candidates, prior experiments, negatives, conflicts. |
+| `score_ideas` | Rank untested ideas by `EV × feasibility × evidence × novelty × info_gain ÷ cost`. |
+| `find_contradictions` | Mine memory for tensions between claims — each a seed for a novel hypothesis. |
+| `find_transfers` | Cross-domain mechanism search: same *how*, different *where*. |
+| `run_experiment` | Run the measurement command; capture `METRIC name=value`. |
+| `log_experiment` | Record a result, write it back to memory, update belief & lifecycle. |
+| `promote_to_global` | Copy a trusted card into the cross-project global memory. |
+| `export_dashboard` | Write browser dashboards: a live progress page + the `vkf html` idea-lineage graph. |
+| `research_status` | Show session experiments + memory lifecycle. |
+### Skills
+| Skill | Role |
+|-------|------|
+| `autoresearch-create` | Orchestrator / spine — the entry point. |
+| `knowledge-gather` | Find candidate techniques via WebSearch/WebFetch (arXiv / Semantic Scholar / OpenAlex / GitHub). |
+| `claim-extract` | Distill sources into reusable claim cards. |
+| `claim-verify` | Check citations & codebase fit — the trust layer. |
+| `contradiction-miner` | Turn tensions in memory into novel hypotheses. |
+| `cross-domain-transfer` | Import a mechanism from another field. |
+| `idea-tournament` | Multi-perspective debate to pick the 2–3 ideas worth testing. |
+| `hypothesis-loop` | Pick the next idea and run the smallest falsifying experiment. |
+| `research-report` | The auditable lineage report. |
+### The `.autoresearch-vkf/` workspace
+```
+.autoresearch-vkf/
+  session/             # ephemeral per-run state (config, experiment log, dashboards)
+  memory/              # the durable VKF knowledge bundle:
+    vkf.bundle.yaml    #   profile 1 (governed); 2 (verified) once evidence lands
+    staging/           #   candidates (status: draft)
+    verified/          #   source-/locally-verified, replicated
+    deprecated/        #   contradicted / retired
+    transactions/      #   one record per promote/demote/write-back
+```
+The `memory/` bundle is just markdown — human-readable, version-controllable, and
+auditable. Run `vkf validate .autoresearch-vkf/memory`, `vkf graph`,
+`vkf freshness`, or `vkf html` over it any time.
+## Benchmark
+Does verifiable memory + novelty scoring + synthesis actually search better than a
+blind loop? `npm run bench` runs both policies over deterministic, ground-truth
+idea-environments — driving *ours* through the real `scoring.ts` and `synthesis.ts`
+— and reports the difference. See [benchmark/README.md](benchmark/README.md) for
+exactly what is and isn't simulated.
+<!-- BENCH:START -->
+Mean over 500 seeds per scenario. "Standard" = blind loop (EV-greedy,
+no durable memory, no synthesis). "Ours" = VKF memory + novelty scoring +
+contradiction synthesis, driven through the real scoring/synthesis modules.
+## Tiny-LM validation loss (budget 10)
+| Metric | Standard | Ours |
+|---|---:|---:|
+| Best improvement (higher better) | 0.035 | **0.130** |
+| Unique mechanisms tried | 7.8 | **10.0** |
+| Wasted (repeat) experiments | 2.2 | **0.0** |
+| Dead-ends retried | 1.4 | **1.0** |
+| Synthesized ideas discovered | 0.0 | **1.0** |
+| Found optimum (rate) | 0% | **100%** |
+## Inference latency (budget 8)
+| Metric | Standard | Ours |
+|---|---:|---:|
+| Best improvement (higher better) | 0.043 | **0.150** |
+| Unique mechanisms tried | 6.3 | **8.0** |
+| Wasted (repeat) experiments | 1.7 | **0.0** |
+| Dead-ends retried | 1.7 | **1.0** |
+| Synthesized ideas discovered | 0.0 | **1.0** |
+| Found optimum (rate) | 0% | **100%** |
+<!-- BENCH:END -->
+The global optimum in each scenario is a *synthesized* idea a blind loop can't
+construct, so it reaches it 0% of the time; ours gets both parents tried (memory +
+novelty), then synthesis unlocks the combo.
+## Watching progress
+Three live views, in increasing detail:
+- **Widget** (always on, above the editor) — win/loss counts, best metric, memory
+  state tally; refreshes after every tool call.
+- **Fullscreen overlay** — press **Ctrl+G** (or call `research_status`) for the
+  full experiment list, memory lifecycle, and verified claims.
+- **Browser dashboards** — `export_dashboard` writes two self-contained pages to
+  `.autoresearch-vkf/session/`:
+  - `progress.html` — metric-over-time chart, experiment timeline, and memory
+    lifecycle; auto-refreshes so an open tab tracks the run live.
+  - `dashboard.html` — the interactive **idea-lineage graph** (paper → claim →
+    experiment, with conflict/derived-from edges), generated by `vkf html`.
+  ```sh
+  open .autoresearch-vkf/session/progress.html    # watch progress as it goes
+  open .autoresearch-vkf/session/dashboard.html   # explore the knowledge lineage
+  ```
+## Configuration
+- `PI_AUTORESEARCH_VKF` — path to the `vkf` executable (overrides auto-detection).
+- `PI_AUTORESEARCH_VKF_CONDA_ENV` — conda env to find `vkf` in (default `VKF`).
+- `PI_AUTORESEARCH_GLOBAL_ROOT` — root for the global cross-project memory
+  (default `~`, i.e. the bundle lives at `~/.autoresearch-vkf/memory/`).
+- `PI_AUTORESEARCH_SHORTCUT` — key for the fullscreen dashboard (default `ctrl+g`;
+  set to `none` to disable).
+## Development
+```sh
+npm install
+npm run typecheck   # tsc --noEmit
+npm test            # node --experimental-strip-types --test tests/*.test.mjs
+npm run bench       # standard autoresearch vs ours
+```
+`npm test` requires a Node 22+ build with TypeScript stripping support (the same
+requirement pi has for loading `.ts` extensions). On a Node built without it, run
+the tests through a loader instead, e.g. `node --import tsx --test tests/*.test.mjs`.
+## Publishing
+The package ships its `.ts` extensions and `.md` skills as-is (pi loads them
+directly — no build step). The `files` whitelist publishes only `extensions/`,
+`skills/`, and the docs; `prepublishOnly` runs `typecheck` as a gate.
+Two ways to release:
+- **Tagged CI release (recommended).** Add an npm *Automation* token as the repo
+  secret `NPM_TOKEN`, then bump the version and push a matching tag — the
+  [`publish.yml`](.github/workflows/publish.yml) workflow publishes with provenance:
+  ```sh
+  npm version patch        # or minor/major — updates package.json + makes a tag
+  git push --follow-tags
+  ```
+- **Manual.** `npm login`, then:
+  ```sh
+  npm publish --access public      # prepublishOnly runs typecheck first
+  ```
+Verify what will ship first with `npm pack --dry-run`.
+## Roadmap
+All four planned phases are in: the lean MVP (Phase 1), the **novelty scorer**
+(Phase 2), the **hypothesis-synthesis layer** (Phase 3 — `find_contradictions`,
+`find_transfers`, `idea-tournament`), and **global cross-project memory + the
+benchmark** (Phase 4).
+Possible next steps:
+- **End-to-end live benchmark** — a real LLM agent on real repos with human
+  novelty ratings (the controlled harness here isolates the search policy).
+- **Bundle profile 2** — attach reproduction `verification` blocks to experiment
+  cards so memory validates at the strict `verified` profile.
+(Knowledge ingestion via `WebSearch`/`WebFetch` against free databases (arXiv,
+Semantic Scholar, OpenAlex, Crossref) is built in — see
+[Knowledge sources](#knowledge-sources-how-ingestion-works).)
+## License
+MIT