audrey 0.23.1 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +101 -15
- package/LICENSE +21 -21
- package/README.md +232 -6
- package/SECURITY.md +2 -1
- package/benchmarks/adapter-kit.mjs +20 -0
- package/benchmarks/adapter-self-test.mjs +166 -0
- package/benchmarks/adapters/example-allow.mjs +28 -0
- package/benchmarks/adapters/mem0-platform.mjs +267 -0
- package/benchmarks/adapters/registry.json +51 -0
- package/benchmarks/adapters/zep-cloud.mjs +280 -0
- package/benchmarks/baselines.js +169 -0
- package/benchmarks/build-leaderboard.mjs +170 -0
- package/benchmarks/cases.js +537 -0
- package/benchmarks/create-conformance-card.mjs +139 -0
- package/benchmarks/create-submission-bundle.mjs +176 -0
- package/benchmarks/dry-run-external-adapters.mjs +165 -0
- package/benchmarks/guardbench.js +1125 -0
- package/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
- package/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
- package/benchmarks/output/external/guardbench-external-evidence.json +56 -0
- package/benchmarks/output/guardbench-conformance-card.json +63 -0
- package/benchmarks/output/guardbench-manifest.json +414 -0
- package/benchmarks/output/guardbench-raw.json +1271 -0
- package/benchmarks/output/guardbench-summary.json +2107 -0
- package/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
- package/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
- package/benchmarks/output/submission-bundle/guardbench-conformance-card.json +63 -0
- package/benchmarks/output/submission-bundle/guardbench-manifest.json +414 -0
- package/benchmarks/output/submission-bundle/guardbench-raw.json +1271 -0
- package/benchmarks/output/submission-bundle/guardbench-summary.json +2107 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-conformance-card.schema.json +184 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-evidence.schema.json +108 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-run.schema.json +160 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-leaderboard.schema.json +179 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-manifest.schema.json +213 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-publication-verification.schema.json +47 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-raw.schema.json +184 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-summary.schema.json +249 -0
- package/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
- package/benchmarks/output/submission-bundle/validation-report.json +31 -0
- package/benchmarks/output/summary.json +2354 -0
- package/benchmarks/perf-snapshot.js +304 -0
- package/benchmarks/perf.bench.js +161 -0
- package/benchmarks/public-paths.mjs +78 -0
- package/benchmarks/reference-results.js +70 -0
- package/benchmarks/report.js +259 -0
- package/benchmarks/run-external-guardbench.mjs +281 -0
- package/benchmarks/run.js +682 -0
- package/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
- package/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
- package/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
- package/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
- package/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
- package/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
- package/benchmarks/schemas/guardbench-raw.schema.json +184 -0
- package/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/benchmarks/schemas/guardbench-summary.schema.json +249 -0
- package/benchmarks/snapshots/perf-0.22.2.json +123 -0
- package/benchmarks/snapshots/perf-0.23.0.json +123 -0
- package/benchmarks/validate-adapter-module.mjs +104 -0
- package/benchmarks/validate-adapter-registry.mjs +134 -0
- package/benchmarks/validate-adapter-self-test.mjs +96 -0
- package/benchmarks/validate-guardbench-artifacts.mjs +343 -0
- package/benchmarks/verify-external-evidence.mjs +296 -0
- package/benchmarks/verify-publication-artifacts.mjs +286 -0
- package/benchmarks/verify-submission-bundle.mjs +167 -0
- package/dist/mcp-server/config.d.ts +1 -1
- package/dist/mcp-server/config.d.ts.map +1 -1
- package/dist/mcp-server/config.js +1 -1
- package/dist/mcp-server/config.js.map +1 -1
- package/dist/mcp-server/index.d.ts +65 -3
- package/dist/mcp-server/index.d.ts.map +1 -1
- package/dist/mcp-server/index.js +675 -157
- package/dist/mcp-server/index.js.map +1 -1
- package/dist/src/action-key.d.ts +9 -0
- package/dist/src/action-key.d.ts.map +1 -0
- package/dist/src/action-key.js +49 -0
- package/dist/src/action-key.js.map +1 -0
- package/dist/src/adaptive.js +5 -5
- package/dist/src/affect.js +8 -8
- package/dist/src/audrey.d.ts +13 -0
- package/dist/src/audrey.d.ts.map +1 -1
- package/dist/src/audrey.js +68 -3
- package/dist/src/audrey.js.map +1 -1
- package/dist/src/capsule.js +4 -4
- package/dist/src/causal.js +3 -3
- package/dist/src/consolidate.js +48 -48
- package/dist/src/controller.d.ts +78 -6
- package/dist/src/controller.d.ts.map +1 -1
- package/dist/src/controller.js +273 -53
- package/dist/src/controller.js.map +1 -1
- package/dist/src/db.js +172 -172
- package/dist/src/decay.js +8 -8
- package/dist/src/embedding.d.ts +2 -1
- package/dist/src/embedding.d.ts.map +1 -1
- package/dist/src/embedding.js +39 -29
- package/dist/src/embedding.js.map +1 -1
- package/dist/src/encode.js +6 -6
- package/dist/src/feedback.d.ts +6 -0
- package/dist/src/feedback.d.ts.map +1 -1
- package/dist/src/feedback.js +6 -0
- package/dist/src/feedback.js.map +1 -1
- package/dist/src/forget.js +12 -12
- package/dist/src/hybrid-recall.js +9 -9
- package/dist/src/impact.js +6 -6
- package/dist/src/import.d.ts +3 -3
- package/dist/src/import.js +41 -41
- package/dist/src/index.d.ts +5 -4
- package/dist/src/index.d.ts.map +1 -1
- package/dist/src/index.js +3 -3
- package/dist/src/index.js.map +1 -1
- package/dist/src/interference.js +14 -14
- package/dist/src/introspect.js +18 -18
- package/dist/src/preflight.d.ts.map +1 -1
- package/dist/src/preflight.js +41 -0
- package/dist/src/preflight.js.map +1 -1
- package/dist/src/promote.js +7 -7
- package/dist/src/prompts.js +118 -118
- package/dist/src/recall.js +30 -30
- package/dist/src/reflexes.d.ts +1 -0
- package/dist/src/reflexes.d.ts.map +1 -1
- package/dist/src/reflexes.js +3 -0
- package/dist/src/reflexes.js.map +1 -1
- package/dist/src/rollback.js +4 -4
- package/dist/src/routes.d.ts.map +1 -1
- package/dist/src/routes.js +71 -2
- package/dist/src/routes.js.map +1 -1
- package/dist/src/validate.js +25 -25
- package/docs/AUDREY_PAPER_OUTLINE.md +175 -0
- package/docs/MEMORY_BENCHMARKING.md +59 -0
- package/docs/PRODUCTION_BACKLOG.md +304 -0
- package/docs/paper/00-master.md +48 -0
- package/docs/paper/01-introduction.md +27 -0
- package/docs/paper/02-related-work.md +47 -0
- package/docs/paper/03-problem-definition.md +108 -0
- package/docs/paper/04-design.md +164 -0
- package/docs/paper/05-guardbench-spec.md +412 -0
- package/docs/paper/06-implementation.md +113 -0
- package/docs/paper/07-evaluation.md +168 -0
- package/docs/paper/08-discussion-limitations.md +61 -0
- package/docs/paper/09-conclusion.md +11 -0
- package/docs/paper/SUBMISSION_README.md +162 -0
- package/docs/paper/appendix-a-demo-transcript.md +114 -0
- package/docs/paper/arxiv-compile-report.schema.json +116 -0
- package/docs/paper/arxiv-source.schema.json +61 -0
- package/docs/paper/audrey-paper-v1.md +1106 -0
- package/docs/paper/browser-launch-plan.json +209 -0
- package/docs/paper/browser-launch-plan.schema.json +100 -0
- package/docs/paper/browser-launch-results.json +86 -0
- package/docs/paper/browser-launch-results.schema.json +66 -0
- package/docs/paper/claim-register.json +138 -0
- package/docs/paper/claim-register.schema.json +81 -0
- package/docs/paper/evidence-ledger.md +103 -0
- package/docs/paper/output/arxiv/README-arxiv.txt +8 -0
- package/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
- package/docs/paper/output/arxiv/main.tex +949 -0
- package/docs/paper/output/arxiv/references.bib +222 -0
- package/docs/paper/output/arxiv-compile-report.json +24 -0
- package/docs/paper/output/submission-bundle/LICENSE +21 -0
- package/docs/paper/output/submission-bundle/README.md +555 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-evidence.json +56 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-conformance-card.json +63 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-manifest.json +414 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-raw.json +1271 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-summary.json +2107 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/validation-report.json +31 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/summary.json +2354 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-raw.schema.json +184 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-summary.schema.json +249 -0
- package/docs/paper/output/submission-bundle/docs/AUDREY_PAPER_OUTLINE.md +175 -0
- package/docs/paper/output/submission-bundle/docs/paper/00-master.md +48 -0
- package/docs/paper/output/submission-bundle/docs/paper/01-introduction.md +27 -0
- package/docs/paper/output/submission-bundle/docs/paper/02-related-work.md +47 -0
- package/docs/paper/output/submission-bundle/docs/paper/03-problem-definition.md +108 -0
- package/docs/paper/output/submission-bundle/docs/paper/04-design.md +164 -0
- package/docs/paper/output/submission-bundle/docs/paper/05-guardbench-spec.md +412 -0
- package/docs/paper/output/submission-bundle/docs/paper/06-implementation.md +113 -0
- package/docs/paper/output/submission-bundle/docs/paper/07-evaluation.md +168 -0
- package/docs/paper/output/submission-bundle/docs/paper/08-discussion-limitations.md +61 -0
- package/docs/paper/output/submission-bundle/docs/paper/09-conclusion.md +11 -0
- package/docs/paper/output/submission-bundle/docs/paper/SUBMISSION_README.md +162 -0
- package/docs/paper/output/submission-bundle/docs/paper/appendix-a-demo-transcript.md +114 -0
- package/docs/paper/output/submission-bundle/docs/paper/arxiv-compile-report.schema.json +116 -0
- package/docs/paper/output/submission-bundle/docs/paper/arxiv-source.schema.json +61 -0
- package/docs/paper/output/submission-bundle/docs/paper/audrey-paper-v1.md +1106 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.json +209 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.schema.json +100 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.json +86 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.schema.json +66 -0
- package/docs/paper/output/submission-bundle/docs/paper/claim-register.json +138 -0
- package/docs/paper/output/submission-bundle/docs/paper/claim-register.schema.json +81 -0
- package/docs/paper/output/submission-bundle/docs/paper/evidence-ledger.md +103 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/README-arxiv.txt +8 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/main.tex +949 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/references.bib +222 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv-compile-report.json +24 -0
- package/docs/paper/output/submission-bundle/docs/paper/paper-submission-bundle.schema.json +70 -0
- package/docs/paper/output/submission-bundle/docs/paper/publication-pack.json +81 -0
- package/docs/paper/output/submission-bundle/docs/paper/publication-pack.schema.json +60 -0
- package/docs/paper/output/submission-bundle/docs/paper/references.bib +222 -0
- package/docs/paper/output/submission-bundle/package.json +212 -0
- package/docs/paper/output/submission-bundle/paper-submission-manifest.json +379 -0
- package/docs/paper/paper-submission-bundle.schema.json +70 -0
- package/docs/paper/publication-pack.json +81 -0
- package/docs/paper/publication-pack.schema.json +60 -0
- package/docs/paper/references.bib +222 -0
- package/package.json +87 -4
- package/scripts/audit-release-completion.mjs +362 -0
- package/scripts/create-arxiv-source.mjs +362 -0
- package/scripts/create-paper-submission-bundle.mjs +210 -0
- package/scripts/finalize-release.mjs +526 -0
- package/scripts/prepare-release-cut.mjs +269 -0
- package/scripts/publish-release-bundle.mjs +209 -0
- package/scripts/publish-release-github-api.mjs +429 -0
- package/scripts/run-vitest.mjs +34 -0
- package/scripts/smoke-cli.js +92 -0
- package/scripts/sync-paper-artifacts.mjs +109 -0
- package/scripts/verify-arxiv-compile.mjs +440 -0
- package/scripts/verify-arxiv-source.mjs +194 -0
- package/scripts/verify-browser-launch-plan.mjs +237 -0
- package/scripts/verify-browser-launch-results.mjs +285 -0
- package/scripts/verify-paper-artifacts.mjs +338 -0
- package/scripts/verify-paper-claims.mjs +226 -0
- package/scripts/verify-paper-submission-bundle.mjs +207 -0
- package/scripts/verify-publication-pack.mjs +196 -0
- package/scripts/verify-python-package.py +201 -0
- package/scripts/verify-release-readiness.mjs +785 -0
|
@@ -0,0 +1,168 @@
|
|
|
1
|
+
# 7. Evaluation
|
|
2
|
+
|
|
3
|
+
This Stage-A evaluation reports implemented Audrey artifacts and local GuardBench adapters only. Section 5 specifies GuardBench as a reproducibility contract for pre-action memory control. The empirical claims below come from the repository's existing performance snapshot, the current behavioral regression output, the local comparative GuardBench runner, source-linked implementation inspection, and a freshly captured repeated-failure demo transcript.
|
|
4
|
+
|
|
5
|
+
## Methodology Disclosure
|
|
6
|
+
|
|
7
|
+
The evaluation separates specification from implementation evidence. GuardBench defines the scenarios, baselines, metrics, and reproducibility requirements for future external-system evaluation. This paper reports what the current repository already supports: encode and hybrid-recall latency from the canonical 0.22.2 snapshot, the `bench:memory:check` regression gate output, a local comparative GuardBench run, and a deterministic demo showing Audrey Guard blocking a repeated failed action (Ledger: E20-E25, E41-E42, E46).
|
|
8
|
+
|
|
9
|
+
This is narrower than a cross-system benchmark. It is also the honest Stage-A claim: the paper introduces the control problem, specifies the benchmark, reports local comparative results, and avoids unrun external-system comparisons.
|
|
10
|
+
|
|
11
|
+
## Performance Snapshot
|
|
12
|
+
|
|
13
|
+
The canonical performance snapshot is `benchmarks/snapshots/perf-0.22.2.json`. It was generated on 2026-05-01T02:15:29.400Z from git SHA `e2e821b`, using mock in-process 64-dimensional embeddings, a mock in-process LLM provider, hybrid vector-plus-lexical retrieval with limit 5, corpus sizes 100, 1000, and 5000, and 50 recall runs per corpus size. The machine provenance is Node 25.5.0, V8 14.1.146.11-node.18, Windows x64, 24-core AMD Ryzen 9 7900X3D, and 62.9 GB RAM (Ledger: E20).
|
|
14
|
+
|
|
15
|
+
The snapshot reports the following encode and hybrid recall latencies in milliseconds:
|
|
16
|
+
|
|
17
|
+
| Corpus Size | Encode p50 | Encode p95 | Encode p99 | Hybrid Recall p50 | Hybrid Recall p95 | Hybrid Recall p99 |
|
|
18
|
+
|---:|---:|---:|---:|---:|---:|---:|
|
|
19
|
+
| 100 | 0.331 | 0.589 | 7.65 | 0.539 | 1.82 | 2.712 |
|
|
20
|
+
| 1000 | 0.307 | 2.147 | 9.672 | 1.566 | 2.364 | 21.177 |
|
|
21
|
+
| 5000 | 0.308 | 1.838 | 10.45 | 2.091 | 3.417 | 16.58 |
|
|
22
|
+
|
|
23
|
+
These numbers measure Audrey's local call path under an in-process mock embedding provider. They do not measure real embedding-provider latency, local transformer warmup cost, GPU/CPU variance for 384-dimensional local embeddings, OpenAI or Gemini network latency, or production-load concurrency. The snapshot itself notes that cloud and local 384-dimensional providers will report higher recall latency dominated by embedding cost and network (Ledger: E20-E22, E31).
|
|
24
|
+
|
|
25
|
+
## Behavioral Regression Result
|
|
26
|
+
|
|
27
|
+
The current `benchmarks/output/summary.json` was generated on 2026-05-15T17:52:00.842Z with command `node benchmarks/run.js --provider mock --dimensions 64` (Ledger: E24). It reports:
|
|
28
|
+
|
|
29
|
+
| System | Score Percent | Pass Rate | Average Duration Ms |
|
|
30
|
+
|---|---:|---:|---:|
|
|
31
|
+
| Audrey | 100 | 100 | 93.58333333333333 |
|
|
32
|
+
| Vector Only | 41.66666666666667 | 25 | 0.25 |
|
|
33
|
+
| Keyword + Recency | 41.66666666666667 | 25 | 0.5833333333333334 |
|
|
34
|
+
| Recent Window | 37.5 | 25 | 0 |
|
|
35
|
+
|
|
36
|
+
This output is a regression-gate result. The baselines are toy local baselines used to catch retrieval and lifecycle regressions in the Audrey codebase. They are not external systems, not tuned competitor implementations, and not GuardBench baselines (Ledger: E23-E24). The current suite covers retrieval and operation families such as information extraction, knowledge updates, multi-session reasoning, conflict resolution, procedural learning, privacy boundary, overwrite, delete-and-abstain, semantic merge, and procedural merge (Ledger: E23-E24).
|
|
37
|
+
|
|
38
|
+
## GuardBench Local Comparative Result
|
|
39
|
+
|
|
40
|
+
The current `benchmarks/output/guardbench-summary.json`,
|
|
41
|
+
`benchmarks/output/guardbench-manifest.json`, and
|
|
42
|
+
`benchmarks/output/guardbench-raw.json` were generated on 2026-05-12 with:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
npm run bench:guard:check
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
It reports local adapters only, not external-system comparisons (Ledger: E46):
|
|
49
|
+
|
|
50
|
+
| Metric | Result |
|
|
51
|
+
|---|---:|
|
|
52
|
+
| Scenarios passed | 10 / 10 |
|
|
53
|
+
| Prevention rate | 100% |
|
|
54
|
+
| False-block rate | 0% |
|
|
55
|
+
| Evidence recall | 100% |
|
|
56
|
+
| Redaction leaks | 0 |
|
|
57
|
+
| Recall-degradation detection | 100% |
|
|
58
|
+
| Guard latency p50 / p95 | 2.465 ms / 30.791 ms |
|
|
59
|
+
| Published artifact raw-secret leaks | 0 |
|
|
60
|
+
| Audrey Guard decision accuracy | 100% |
|
|
61
|
+
| No-memory decision accuracy | 10% |
|
|
62
|
+
| Recent-window decision accuracy | 60% |
|
|
63
|
+
| Vector-only decision accuracy | 40% |
|
|
64
|
+
| FTS-only decision accuracy | 10% |
|
|
65
|
+
|
|
66
|
+
The ten scenarios cover exact repeated failures, required procedures, changed
|
|
67
|
+
file scopes, changed commands, recovered failures, vector recall degradation,
|
|
68
|
+
FTS recall degradation, truncation-boundary secret redaction, conflicting
|
|
69
|
+
instructions, and noisy-store control-memory recall. These results are the
|
|
70
|
+
first public local comparative Audrey GuardBench numbers. The emitted manifest
|
|
71
|
+
records the ten scenario actions, seeded memories, seeded tool events, fault
|
|
72
|
+
injections, expected evidence classes, and non-secret references for seeded
|
|
73
|
+
redaction probes; the raw output file records every local adapter result for
|
|
74
|
+
each case. The harness also sweeps the summary, manifest, and raw output for
|
|
75
|
+
seeded raw secrets and fails the run on artifact leaks. External ESM adapters
|
|
76
|
+
receive private seed values at runtime while expected decisions and evidence
|
|
77
|
+
remain withheld during execution. The first concrete external adapter targets
|
|
78
|
+
Mem0 Platform via its REST APIs, but it has not yet been run with a live key, so
|
|
79
|
+
this section does not report Mem0 scores.
|
|
80
|
+
|
|
81
|
+
## Repeated-Failure Demo Transcript
|
|
82
|
+
|
|
83
|
+
The qualitative control figure is the deterministic repeated-failure demo. The project was rebuilt with `npm run build`, then the demo was run with:
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
node dist/mcp-server/index.js demo --scenario repeated-failure
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
The run produced the following transcript. The temporary path, memory IDs, and timestamp-bearing failure ID are run-specific; the decision structure is the evaluated behavior (Ledger: E25, E41-E42). Appendix A provides the same transcript as a standalone reproduction artifact with line annotations.
|
|
90
|
+
|
|
91
|
+
```text
|
|
92
|
+
Audrey Guard repeated-failure demo
|
|
93
|
+
|
|
94
|
+
Memory store: [LOCAL-TEMP]/audrey-demo-AkCROa
|
|
95
|
+
Step 1: the agent tries a deploy and hits a real setup failure.
|
|
96
|
+
Step 2: Audrey stores the failure and the operational rule it implies.
|
|
97
|
+
Lesson memory: 01KR491DG2YZHVEM79QVW5BHZA
|
|
98
|
+
|
|
99
|
+
Step 3: a new preflight checks the same action before tool use.
|
|
100
|
+
|
|
101
|
+
Audrey Guard: BLOCKED
|
|
102
|
+
|
|
103
|
+
Reason: Blocked: this exact Bash action failed before. Stop: 3 memory reflexes, 2 blocking, 1 warning matched.
|
|
104
|
+
Risk score: 0.90
|
|
105
|
+
|
|
106
|
+
Evidence:
|
|
107
|
+
- 01KR491DFZYZ20TFK71KJHC88F
|
|
108
|
+
- 01KR491DG2YZHVEM79QVW5BHZA
|
|
109
|
+
- failure:Bash:2026-05-08T17:09:22.047Z
|
|
110
|
+
|
|
111
|
+
Recommended action:
|
|
112
|
+
- Do not repeat the exact failed action until the prior error is understood or the command is changed.
|
|
113
|
+
- Do not proceed until the high-severity memory warning is addressed.
|
|
114
|
+
- Apply this must-follow rule before acting.
|
|
115
|
+
- Mitigate this remembered risk before proceeding.
|
|
116
|
+
- Before re-running Bash, check what changed since the last failure.
|
|
117
|
+
|
|
118
|
+
Memory reflexes:
|
|
119
|
+
- block: Apply this must-follow rule before acting. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
120
|
+
- block: Mitigate this remembered risk before proceeding. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
121
|
+
- warn: Before re-running Bash, check what changed since the last failure.
|
|
122
|
+
|
|
123
|
+
Next: fix the warning and retry, or pass --override to allow this guard check.
|
|
124
|
+
|
|
125
|
+
Impact:
|
|
126
|
+
- 1 repeated failure prevented
|
|
127
|
+
- 1 helpful memory validation recorded
|
|
128
|
+
- 3 evidence ids attached
|
|
129
|
+
|
|
130
|
+
Audrey saw the agent fail once.
|
|
131
|
+
Audrey stopped it from failing twice.
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
The transcript demonstrates the core pre-action memory-control loop for one scenario: a failed tool action is observed, a procedural lesson is encoded, a later identical action is intercepted before tool use, the guard returns a block decision, and the decision carries evidence IDs, recommendations, reflexes, and impact accounting (Ledger: E3-E4, E8-E11, E16-E17, E25, E42).
|
|
135
|
+
|
|
136
|
+
## Implemented-Evidence Summary
|
|
137
|
+
|
|
138
|
+
| Design Claim | Ledger IDs | Evidence Type |
|
|
139
|
+
|---|---|---|
|
|
140
|
+
| Audrey exposes a pre-action `allow`/`warn`/`block` controller result. | E1-E2 | Source inspection |
|
|
141
|
+
| Exact repeated failures are blocked by action identity. | E3, E25, E42 | Source inspection and demo |
|
|
142
|
+
| Post-action observation stores redacted tool outcomes and action identity. | E4, E12-E13 | Source inspection |
|
|
143
|
+
| Capsules preserve structured memory sections, evidence IDs, budget state, and recall errors. | E5-E7 | Source inspection |
|
|
144
|
+
| Preflight converts memory health, failures, risks, procedures, contradictions, and disputed memories into severity-sorted warnings. | E8-E10 | Source inspection |
|
|
145
|
+
| Reflexes carry response type, recommendations, and evidence IDs. | E11 | Source inspection |
|
|
146
|
+
| Recall degradation is represented and propagated as `RecallError[]`. | E15, E40 | Source inspection |
|
|
147
|
+
| Hybrid recall uses vector plus FTS with RRF constants `60`, `0.3`, and `0.7`. | E14 | Source inspection |
|
|
148
|
+
| Redaction covers named credentials, generic auth, private keys, payment/PII patterns, sessions, signed URLs, entropy fallback, JSON sensitive keys, and truncation marker preservation. | E12-E13 | Source inspection |
|
|
149
|
+
| The runtime includes SQLite, sqlite-vec, FTS5, MCP stdio, Hono REST, CLI, and Python client surfaces. | E29-E36 | Source inspection |
|
|
150
|
+
| Security-relevant defaults keep REST local, require API keys for non-loopback, disable no-auth exposure, disable admin tools, and restrict promote roots. | E33, E35, E37-E38 | Source inspection |
|
|
151
|
+
| Encode and hybrid-recall latency are reported from the canonical 0.22.2 perf snapshot. | E20-E22 | Snapshot |
|
|
152
|
+
| Behavioral regression status is reported from the current `bench:memory:check` output. | E23-E24 | Regression output |
|
|
153
|
+
| Local comparative GuardBench status is reported from `bench:guard:check`. | E46 | GuardBench comparative output |
|
|
154
|
+
| The repeated-failure control loop is demonstrated by the `demo --scenario repeated-failure` transcript. | E25, E41-E42 | Demo |
|
|
155
|
+
|
|
156
|
+
## What This Section Does Not Claim
|
|
157
|
+
|
|
158
|
+
This section does not claim cross-system superiority over Mem0, Letta/MemGPT, Zep, Graphiti, MemOS, LangMem, Supermemory, Cognee, or any production memory service.
|
|
159
|
+
|
|
160
|
+
This section claims local comparative GuardBench results, the adapter contract, and the existence of the Mem0 adapter only. External-system GuardBench outputs are deferred until live runs are captured.
|
|
161
|
+
|
|
162
|
+
This section does not claim production-load measurements. The performance snapshot is a local mock-provider benchmark, not a concurrency, soak, storage-pressure, or real-provider benchmark.
|
|
163
|
+
|
|
164
|
+
This section does not claim real-hardware variance. It reports one machine's provenance and raw numbers from the repository snapshot.
|
|
165
|
+
|
|
166
|
+
This section does not claim perfect redaction coverage. It reports implemented pattern coverage and the current GuardBench artifact sweep for seeded raw-secret leakage.
|
|
167
|
+
|
|
168
|
+
This section does not claim that local toy baselines represent external systems. The current `bench:memory:check` baselines exist to detect Audrey regressions.
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# 8. Discussion and Limitations
|
|
2
|
+
|
|
3
|
+
## What the Implementation Changes
|
|
4
|
+
|
|
5
|
+
Repeated-failure prevention is the clearest implemented behavior. The demo records one failed `npm run deploy`, stores the operational lesson that Prisma generation must run first, and blocks the identical future action before tool use. The guard output includes a `BLOCKED` decision, risk score `0.90`, two blocking reflexes, one warning reflex, evidence IDs, concrete recommendations, and impact accounting (Ledger: E25, E42). The change is not better recall in a chat answer. The change is that the second tool call does not happen.
|
|
6
|
+
|
|
7
|
+
The redacted evidence trail changes what can be safely remembered. Audrey's tool-tracing path states that raw tool input, output, and error text do not leave the module without redaction, and it stores redacted summaries, hashes, file fingerprints, redaction state, and a memory event (Ledger: E12). The redaction catalog covers named API keys, auth headers, private keys, URL credentials, password assignments, payment and PII patterns, signed URLs, session cookies, high-entropy tokens, JSON sensitive keys, and truncation marker preservation (Ledger: E13). This lets guard decisions point back to prior tool evidence without turning the memory store into an unfiltered secret archive.
|
|
8
|
+
|
|
9
|
+
Action-key determinism gives the controller a hard repeated-failure path. Audrey hashes tool name, redacted command or action text, normalized working directory, and sorted normalized file scope, then matches the hash against prior failed tool events for the same agent (Ledger: E3). This is separate from semantic retrieval. A prior failure does not need to be semantically similar enough to rank first; the exact repeated action is a structured event match.
|
|
10
|
+
|
|
11
|
+
Recall-degradation handling changes failure posture. Audrey records missing vector tables, per-type KNN failures, and FTS lookup failures as `RecallError[]`, preserves partial results, exposes degradation in memory status, and carries errors into capsules (Ledger: E15, E40). Preflight converts recall errors into high-severity memory-health warnings, and strict guard mode blocks high-severity warnings (Ledger: E9-E10). A degraded memory substrate therefore becomes a visible control condition instead of an invisible recall-quality drop.
|
|
12
|
+
|
|
13
|
+
## Conservative Control Choices
|
|
14
|
+
|
|
15
|
+
Audrey is conservative by design. Strict mode blocks high-severity warnings rather than asking the model to decide whether a remembered warning matters (Ledger: E10). This increases false-positive risk when a high-severity warning is stale or overbroad. The trade-off is appropriate for pre-action control because the guarded operations include file mutation, shell execution, credential exposure, network calls, and destructive tools. The safer default is to force inspection when memory reports high risk.
|
|
16
|
+
|
|
17
|
+
Agent-scoped recall is another conservative choice. Capsule construction forces `scope: 'agent'`, so a caller cannot accidentally widen a capsule to shared memory by passing broad recall options (Ledger: E7). This reduces cross-agent leakage at the cost of hiding useful memory learned by another agent. Cross-agent sharing belongs behind an explicit federation policy, not an accidental default.
|
|
18
|
+
|
|
19
|
+
The trusted-control-source gate is also conservative. A memory tagged as must-follow becomes a control rule only when its source is `direct-observation` or `told-by-user`; untrusted must-follow tags are routed to uncertain or disputed context (Ledger: E6). This blocks an obvious memory-injection path. It also creates false positives when a useful operating rule comes from a source that has not been promoted into trust.
|
|
20
|
+
|
|
21
|
+
## What This Paper Does Not Claim
|
|
22
|
+
|
|
23
|
+
This paper reports one local comparative GuardBench run. It does not report GuardBench results across external memory systems. Section 5 specifies GuardBench; v2 reports the full external-system run.
|
|
24
|
+
|
|
25
|
+
It does not report cross-system comparisons against Mem0, Letta/MemGPT, Zep, Graphiti, MemOS, LangMem, Supermemory, Cognee, or hosted memory services.
|
|
26
|
+
|
|
27
|
+
It does not report production-load measurements, concurrent-agent soak tests, storage-pressure behavior, or long-running operational telemetry.
|
|
28
|
+
|
|
29
|
+
It does not report real-provider embedding latency. The canonical performance snapshot uses a mock in-process 64-dimensional embedding provider (Ledger: E20-E22).
|
|
30
|
+
|
|
31
|
+
It does not prove redaction completeness. The implementation has a broad rule catalog, but unknown credential formats and adversarial encodings remain out of scope (Ledger: E13).
|
|
32
|
+
|
|
33
|
+
It does not prevent first-time errors. Pre-action memory control works from prior evidence, remembered rules, contradictions, and recall health. A novel error with no remembered signal still reaches the underlying tool policy.
|
|
34
|
+
|
|
35
|
+
It does not replace sandboxing, OS permissions, MCP permission systems, human approval, or network isolation. Audrey is a memory-derived control layer that fits inside a host's broader tool-use safety stack.
|
|
36
|
+
|
|
37
|
+
## Threats to Current Claims
|
|
38
|
+
|
|
39
|
+
Action-key fidelity is a central threat. Repeated-failure prevention depends on stable normalization of command text, working directory, and file scope. If a host supplies incomplete file lists, unstable cwd values, or action text that hides the meaningful operation inside nested arguments, exact-repeat detection loses coverage (Ledger: E3). The current hash is deterministic, not omniscient.
|
|
40
|
+
|
|
41
|
+
Redaction is rule-based. It covers many common credential and PII formats, sensitive JSON keys, high-entropy strings, and truncation boundaries (Ledger: E13). It remains incomplete for novel secret formats, multi-part secrets split across fields, encoded payloads, and tool outputs crafted to evade regex and entropy checks.
|
|
42
|
+
|
|
43
|
+
Capsule pruning is priority-based, not learned. Capsules enforce a character budget and preserve structured sections and evidence IDs (Ledger: E5). The current implementation uses explicit sectioning and truncation logic rather than a learned policy that optimizes downstream guard accuracy. Budget pressure can hide useful non-control context while keeping control evidence.
|
|
44
|
+
|
|
45
|
+
Reflex generation is deterministic, not adaptive. Reflexes are mapped from preflight warnings into `guide`, `warn`, or `block` responses with evidence and recommendations (Ledger: E11). The mapping does not learn from later validation events. Validation updates memory salience and bookkeeping, but risk scoring remains fixed by severity rules (Ledger: E16, E45).
|
|
46
|
+
|
|
47
|
+
## Open Problems
|
|
48
|
+
|
|
49
|
+
Host hook parity. Audrey exposes CLI pieces that host hooks can call, and Claude Code documents hook extension points [@anthropic2026claudecodehooks]. Audrey now generates and applies Claude Code hook settings, `guard --hook` emits the current PreToolUse `hookSpecificOutput.permissionDecision` shape, and `observe-tool` records post-tool events (Ledger: E43). The remaining production installer work is equivalent wiring for hosts with stable hook surfaces, especially Codex.
|
|
50
|
+
|
|
51
|
+
Validation lineage is implemented but not yet policy-adaptive. Audrey can bind validation events to the exact preflight event, evidence IDs, and action key that produced a decision, and rejects mismatched evidence claims (Ledger: E44). The next step is using that closed-loop signal to tune warning priority and recommendation wording without giving the model direct control over policy.
|
|
52
|
+
|
|
53
|
+
Cross-agent memory federation. Audrey currently protects capsules through agent-scoped recall (Ledger: E7). Multi-agent runtimes need explicit federation rules: which memories transfer across agents, which remain private, which require user confirmation, and how contradictions propagate across agent identities.
|
|
54
|
+
|
|
55
|
+
Adaptive risk scoring. Preflight uses a fixed severity map and strict mode blocks high-severity warnings (Ledger: E45). Validation feedback should eventually tune risk scoring, warning ordering, and recommendation wording without giving the model direct control over the safety policy.
|
|
56
|
+
|
|
57
|
+
Adversarial-memory robustness. The trusted-control-source gate blocks untrusted must-follow tags, but poisoned tool outputs that enter through trusted observation paths remain a hard problem (Ledger: E6, E12). Future work needs adversarial memory tests where attacker-controlled output resembles an operational rule, a project instruction, or a false recovery signal.
|
|
58
|
+
|
|
59
|
+
## Local-First as a Feature
|
|
60
|
+
|
|
61
|
+
Pre-action memory control operates on sensitive operational history: shell commands, file paths, build failures, project rules, API error summaries, user instructions, and redacted tool outputs. Sending that control surface to a hosted memory service creates an avoidable privacy, availability, and latency dependency. Audrey's local-first SQLite, sqlite-vec, FTS5, loopback REST default, and no-auth network refusal keep the controller deployable inside local agents and air-gapped environments (Ledger: E29-E30, E35, E37). The local design is not just an implementation convenience; it is aligned with the data that a pre-action controller must inspect.
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
# 9. Conclusion
|
|
2
|
+
|
|
3
|
+
Agent memory should be judged by whether it changes future tool actions. Audrey implements that claim for local agents through a host-side memory controller that runs before tool use and returns an auditable `allow`, `warn`, or `block` decision with evidence.
|
|
4
|
+
|
|
5
|
+
The implemented contribution is Audrey Guard: a local-first loop that observes tool outcomes, redacts traces, retrieves hybrid memory, builds bounded capsules, scores preflight risk, generates reflexes, blocks exact repeated failures, records post-action outcomes, and reports validation-linked impact (Ledger: E1-E17, E25-E26, E29-E42). The specified contribution is GuardBench: a scenario manifest, baseline set, metric suite, and reproducibility contract for evaluating memory by action effect rather than retrieved-text relevance.
|
|
6
|
+
|
|
7
|
+
The Stage-A evidence is intentionally bounded. This paper reports source-linked implementation evidence, the canonical 0.22.2 performance snapshot, the current behavioral regression gate output, a local comparative GuardBench run, and a deterministic repeated-failure demo transcript (Ledger: E20-E25, E41-E42, E46). It also includes the external adapter contract and Mem0 evidence-bundle path, but it does not report full external-system GuardBench results.
|
|
8
|
+
|
|
9
|
+
The v2 paper should run live external adapters, publish raw per-scenario output bundles, run expanded redaction sweeps, and report guard-overhead p50/p95 under machine-provenance controls.
|
|
10
|
+
|
|
11
|
+
The core result is simple: Audrey saw the agent fail once. Audrey stopped it from failing twice.
|
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
# Submission README
|
|
2
|
+
|
|
3
|
+
## What's Complete
|
|
4
|
+
|
|
5
|
+
- Nine drafted body sections in `01-introduction.md` through `09-conclusion.md`.
|
|
6
|
+
- Consolidated Markdown master: `audrey-paper-v1.md`.
|
|
7
|
+
- Evidence ledger with 97 rows: `evidence-ledger.md`.
|
|
8
|
+
- Machine-readable claim register: `claim-register.json`.
|
|
9
|
+
- Machine-readable publication pack: `publication-pack.json`.
|
|
10
|
+
- Machine-readable browser launch plan: `browser-launch-plan.json`.
|
|
11
|
+
- Machine-readable browser launch results ledger: `browser-launch-results.json`.
|
|
12
|
+
- Machine-readable arXiv source package: `docs/paper/output/arxiv/`.
|
|
13
|
+
- Machine-readable arXiv compile report: `docs/paper/output/arxiv-compile-report.json`.
|
|
14
|
+
- Compiled arXiv PDF and compile log: `docs/paper/output/arxiv-compile/main.pdf`, `docs/paper/output/arxiv-compile/arxiv-compile.log`.
|
|
15
|
+
- Machine-readable paper submission bundle: `docs/paper/output/submission-bundle/`.
|
|
16
|
+
- Primary-source bibliography with 21 entries: `references.bib`.
|
|
17
|
+
- Verbatim repeated-failure demo transcript: `appendix-a-demo-transcript.md`.
|
|
18
|
+
- GuardBench Stage-A specification with scenario manifest, baselines, metrics, reproducibility contract, JSON schemas, and Stage-A/Stage-B boundary.
|
|
19
|
+
- Machine-readable GuardBench manifest schema: `benchmarks/schemas/guardbench-manifest.schema.json`.
|
|
20
|
+
- Machine-readable GuardBench summary schema: `benchmarks/schemas/guardbench-summary.schema.json`.
|
|
21
|
+
- Machine-readable GuardBench raw-output schema: `benchmarks/schemas/guardbench-raw.schema.json`.
|
|
22
|
+
- Machine-readable GuardBench external-run metadata schema with SHA-256 artifact hashes: `benchmarks/schemas/guardbench-external-run.schema.json`.
|
|
23
|
+
- Machine-readable GuardBench external dry-run matrix schema: `benchmarks/schemas/guardbench-external-dry-run.schema.json`.
|
|
24
|
+
- Machine-readable GuardBench external evidence verification schema: `benchmarks/schemas/guardbench-external-evidence.schema.json`.
|
|
25
|
+
- Machine-readable GuardBench conformance-card schema: `benchmarks/schemas/guardbench-conformance-card.schema.json`.
|
|
26
|
+
- Machine-readable GuardBench submission-manifest schema: `benchmarks/schemas/guardbench-submission-manifest.schema.json`.
|
|
27
|
+
- Machine-readable GuardBench leaderboard schema: `benchmarks/schemas/guardbench-leaderboard.schema.json`.
|
|
28
|
+
- Machine-readable GuardBench adapter self-test schema: `benchmarks/schemas/guardbench-adapter-self-test.schema.json`.
|
|
29
|
+
- Machine-readable GuardBench adapter registry schema: `benchmarks/schemas/guardbench-adapter-registry.schema.json`.
|
|
30
|
+
- Machine-readable GuardBench publication verifier schema: `benchmarks/schemas/guardbench-publication-verification.schema.json`.
|
|
31
|
+
- Machine-readable browser launch-plan schema: `docs/paper/browser-launch-plan.schema.json`.
|
|
32
|
+
- Machine-readable browser launch-results schema: `docs/paper/browser-launch-results.schema.json`.
|
|
33
|
+
- Machine-readable arXiv source-manifest schema: `docs/paper/arxiv-source.schema.json`.
|
|
34
|
+
- Machine-readable arXiv compile-report schema: `docs/paper/arxiv-compile-report.schema.json`.
|
|
35
|
+
- Machine-readable paper submission-bundle schema: `docs/paper/paper-submission-bundle.schema.json`.
|
|
36
|
+
- Standalone GuardBench artifact validator: `npm run bench:guard:validate`.
|
|
37
|
+
- Public artifact path normalizer and local absolute-path sweep: `benchmarks/public-paths.mjs`.
|
|
38
|
+
- Shareable GuardBench conformance-card generator: `npm run bench:guard:card`.
|
|
39
|
+
- Portable GuardBench submission-bundle generator: `npm run bench:guard:bundle`.
|
|
40
|
+
- Offline GuardBench submission-bundle verifier: `npm run bench:guard:bundle:verify`.
|
|
41
|
+
- Verified GuardBench leaderboard builder: `npm run bench:guard:leaderboard`.
|
|
42
|
+
- Adapter registry: `benchmarks/adapters/registry.json`.
|
|
43
|
+
- Adapter registry validator: `npm run bench:guard:adapter-registry:validate`.
|
|
44
|
+
- Adapter author kit: `benchmarks/adapter-kit.mjs`.
|
|
45
|
+
- Fast adapter module validator: `npm run bench:guard:adapter-module:validate`.
|
|
46
|
+
- External adapter self-test: `npm run bench:guard:adapter-self-test`.
|
|
47
|
+
- Saved adapter self-test validator: `npm run bench:guard:adapter-self-test:validate`.
|
|
48
|
+
- GuardBench publication artifact verifier: `npm run bench:guard:publication:verify`.
|
|
49
|
+
- External adapter dry-run matrix: `npm run bench:guard:external:dry-run`.
|
|
50
|
+
- External evidence verifier: `npm run bench:guard:external:evidence`.
|
|
51
|
+
- Strict external evidence gate for credentialed runs: `npm run bench:guard:external:evidence:strict`.
|
|
52
|
+
- Local comparative GuardBench runner with a passing 10-scenario Audrey Guard check via `npm run bench:guard:check`.
|
|
53
|
+
- Strict external GuardBench adapter contract via `node benchmarks/guardbench.js --adapter ./adapter.mjs --check`.
|
|
54
|
+
- External adapters: `benchmarks/adapters/mem0-platform.mjs` (requires runtime `MEM0_API_KEY`) and `benchmarks/adapters/zep-cloud.mjs` (requires runtime `ZEP_API_KEY`).
|
|
55
|
+
- External evidence-bundle runners: `npm run bench:guard:mem0` and `npm run bench:guard:zep` with dry-run metadata capture.
|
|
56
|
+
- Paper metric sync command: `npm run paper:sync`.
|
|
57
|
+
- Paper claim verifier: `npm run paper:claims`.
|
|
58
|
+
- Publication pack verifier: `npm run paper:publication-pack`.
|
|
59
|
+
- arXiv source-package generator: `npm run paper:arxiv`.
|
|
60
|
+
- arXiv source-package verifier: `npm run paper:arxiv:verify`.
|
|
61
|
+
- arXiv compile-report gate: `npm run paper:arxiv:compile`.
|
|
62
|
+
- Strict arXiv compile gate: `npm run paper:arxiv:compile:strict`.
|
|
63
|
+
- Browser launch-plan verifier: `npm run paper:launch-plan`.
|
|
64
|
+
- Browser launch-results verifier: `npm run paper:launch-results`.
|
|
65
|
+
- Strict browser launch-results gate: `npm run paper:launch-results:strict`.
|
|
66
|
+
- Paper submission-bundle generator: `npm run paper:bundle`.
|
|
67
|
+
- Paper submission-bundle verifier: `npm run paper:bundle:verify`.
|
|
68
|
+
- Paper artifact verifier: `npm run paper:verify`.
|
|
69
|
+
- Pending-aware 1.0 release readiness verifier: `npm run release:readiness`.
|
|
70
|
+
- Strict 1.0 release readiness verifier: `npm run release:readiness:strict`.
|
|
71
|
+
- Source-control release-state check inside `npm run release:readiness`.
|
|
72
|
+
- Live remote-head verification inside `npm run release:readiness`.
|
|
73
|
+
- npm registry/auth readiness check inside `npm run release:readiness`.
|
|
74
|
+
- Dry-run release cut planner: `npm run release:cut:plan`.
|
|
75
|
+
- Final release cut writer: `npm run release:cut:apply`.
|
|
76
|
+
- Python package release verifier: `npm run python:release:check`.
|
|
77
|
+
- Paper-aware release gate: `npm run release:gate:paper`.
|
|
78
|
+
- Stage-A scope is explicit: implemented Audrey evidence, existing performance snapshot, current regression gate output, local comparative GuardBench output, and deterministic demo transcript.
|
|
79
|
+
|
|
80
|
+
## arXiv Preprint v1 Checklist
|
|
81
|
+
|
|
82
|
+
1. Generate the deterministic TeX source package with `npm run paper:arxiv`.
|
|
83
|
+
2. Verify the generated source package with `npm run paper:arxiv:verify`.
|
|
84
|
+
3. Run `npm run paper:arxiv:compile`; it compiles with `tectonic`, `latexmk`, `pdflatex`/`bibtex`, or `uvx tecto` plus a local bundle proxy when present and otherwise records a pending toolchain blocker in `docs/paper/output/arxiv-compile-report.json`.
|
|
85
|
+
4. Preview the arXiv-compiled PDF in the browser before pressing final submit.
|
|
86
|
+
5. Confirm the workshop variant stays under the target page limit. arXiv can run longer; workshops often need approximately 9 body pages.
|
|
87
|
+
6. Upload to arXiv with `cs.AI` as the primary category and `cs.CR` as secondary.
|
|
88
|
+
|
|
89
|
+
## Pre-Submission Status - 2026-05-08
|
|
90
|
+
|
|
91
|
+
- Done: README benchmark table values now match `benchmarks/snapshots/perf-0.22.2.json` and Ledger `E28` is resolved.
|
|
92
|
+
- Done: Ledger `E25` was re-checked against the current repeated-failure demo at `mcp-server/index.ts:825-879`.
|
|
93
|
+
- Done: primary-source URLs in `references.bib` were re-checked live on 2026-05-08; all 21 entries were reachable. The MCP schema reference now points at the current `2025-11-25` schema page.
|
|
94
|
+
- Update 2026-05-12: `npm run bench:guard:check` now runs a local comparative GuardBench suite across Audrey Guard, no-memory, recent-window, vector-only, and FTS-only adapters. Audrey Guard passes 10/10 scenarios. The harness supports external ESM adapters, the first concrete Mem0 Platform adapter is implemented, the output bundle includes an artifact redaction sweep, and `npm run bench:guard:mem0 -- --dry-run` captures the live-run command and metadata without storing credentials. A live Mem0 run is still pending a runtime key (Ledger: E46-E51).
|
|
95
|
+
- Update 2026-05-13: the adapter registry now includes Zep Cloud. `benchmarks/adapters/zep-cloud.mjs` creates a benchmark user/session, writes scenario memory with `memory.add`, searches user graph memory with `graph.search`, deletes the benchmark user during cleanup, and is covered by module validation plus mocked REST-flow tests. A live Zep run is still pending a runtime key (Ledger: E77).
|
|
96
|
+
- Update 2026-05-13: `npm run bench:guard:external:dry-run` writes non-secret dry-run metadata for every runtime-env adapter in the registry and is included in the release gates so live-run readiness cannot silently drift (Ledger: E78).
|
|
97
|
+
- Update 2026-05-13: the external dry-run matrix is now schema-bound by `benchmarks/schemas/guardbench-external-dry-run.schema.json`, written to `benchmarks/output/external/guardbench-external-dry-run.json`, included in package dry-run contents, and validated by `paper:verify` (Ledger: E79).
|
|
98
|
+
- Update 2026-05-13: `npm run bench:guard:publication:verify` now checks the schema-bound external dry-run matrix alongside registry, module, self-test, artifacts, submission bundle, and leaderboard, so the one-command public benchmark verifier covers every current GuardBench readiness artifact (Ledger: E80).
|
|
99
|
+
- Update 2026-05-13: `npm run bench:guard:external:evidence` writes a schema-bound external evidence verification report at `benchmarks/output/external/guardbench-external-evidence.json`, reports Mem0/Zep as pending when only dry-run metadata exists, checks saved metadata for runtime credential leaks, and ships a strict mode that fails until credentialed live bundles pass (Ledger: E81).
|
|
100
|
+
- Update 2026-05-13: `docs/paper/claim-register.json` now records supported and pending public claims, and `npm run paper:claims` verifies claim text, forbidden overclaims, evidence files, GuardBench artifacts, and the external-score Stage-B boundary before publication (Ledger: E82).
|
|
101
|
+
- Update 2026-05-13: `docs/paper/publication-pack.json` now carries arXiv, Hacker News, Reddit, X, and LinkedIn launch copy, and `npm run paper:publication-pack` verifies character limits, required entries, claim IDs, forbidden overclaims, pending Mem0/Zep boundary language, and secret leakage before browser-based posting (Ledger: E83).
|
|
102
|
+
- Update 2026-05-13: `npm run paper:bundle` now writes `docs/paper/output/submission-bundle/` with paper files, claim and publication registers, GuardBench outputs, schemas, README/package metadata, and `paper-submission-manifest.json` hashes; `npm run paper:bundle:verify` checks the manifest, file hashes, required files, GuardBench snapshot, claim verifier, and publication-pack verifier before upload (Ledger: E84).
|
|
103
|
+
- Update 2026-05-13: `docs/paper/browser-launch-plan.json` now maps the verified publication copy to arXiv, Hacker News, Reddit, X, and LinkedIn browser targets, records current source URLs/rules checked on 2026-05-13, requires human login/captcha/manual rule checks where needed, and is verified by `npm run paper:launch-plan` before browser posting (Ledger: E85).
|
|
104
|
+
- Update 2026-05-13: `npm run paper:arxiv` now writes a deterministic arXiv TeX source package under `docs/paper/output/arxiv/`; `npm run paper:arxiv:verify` checks the manifest, hashes, bibliography count, converted citations, missing bib IDs, seeded-secret redaction, and absence of local absolute paths before browser upload (Ledger: E86).
|
|
105
|
+
- Update 2026-05-13: `docs/paper/browser-launch-results.json` now records the post-submit state for arXiv, Hacker News, Reddit, X, and LinkedIn targets. `npm run paper:launch-results` validates target alignment, post-submit URL hosts, completed checklist fields, blocker text, and leakage boundaries while allowing pending targets; `npm run paper:launch-results:strict` fails until every target has a submitted, operator-verified public URL (Ledger: E87).
|
|
106
|
+
- Update 2026-05-13: public GuardBench and paper bundle artifacts now normalize repo-local paths before writing saved JSON/Markdown and run a local absolute-path sweep in `bench:guard:publication:verify`, `bench:guard:bundle:verify`, `paper:bundle:verify`, and `paper:verify` (Ledger: E88).
|
|
107
|
+
- Update 2026-05-13: the X launch copy now reserves `reservedUrlChars: 24` for the final public artifact URL using the `x-counting-characters` source, and `paper:launch-results` rejects submitted artifact-url targets unless `artifactUrl` records the public paper or repo URL (Ledger: E89).
|
|
108
|
+
- Update 2026-05-13: `npm run release:readiness` now emits the pending-aware Audrey 1.0 prompt-to-artifact checklist, while `npm run release:readiness:strict` fails until the target version, Python artifacts, npm registry/auth readiness, PyPI publish readiness, browser publication URLs, live Mem0/Zep evidence, and package publish readiness are complete (Ledger: E90).
|
|
109
|
+
- Update 2026-05-13: `npm audit --omit=dev --audit-level=moderate` is clean after refreshing the transitive `protobufjs`/`@protobufjs/*` lockfile chain used through `onnxruntime-web` (Ledger: E91).
|
|
110
|
+
- Update 2026-05-13: `npm run release:cut:plan` now previews the exact 1.0 version/changelog edits with publishable release notes instead of TODO scaffolding, and `npm run release:cut:apply` writes them only when the final cut is intentional (Ledger: E92).
|
|
111
|
+
- Update 2026-05-13: `npm run python:release:check` now builds the Python wheel/sdist, verifies package metadata and typed package contents, checks for local path leakage, and runs `twine check`; release gates run it before package publishing checks (Ledger: E93).
|
|
112
|
+
- Update 2026-05-13: `npm run release:readiness` now also checks source-control release state: branch, upstream, origin push remote, ahead/behind count, clean working tree, and `v1.0.0` tag placement (Ledger: E94).
|
|
113
|
+
- Update 2026-05-13: `npm run release:readiness` now checks `audrey@1.0.0` on the npm registry and requires `npm whoami` when the version is still unpublished, so package publish readiness cannot pass from the version bump alone (Ledger: E95).
|
|
114
|
+
- Update 2026-05-13: `npm run release:readiness` now verifies the live `origin/<branch>` head with `git ls-remote`, retries through Git's OpenSSL backend when Windows Schannel fails with `SEC_E_NO_CREDENTIALS`, and blocks when the local tracking ref is stale (Ledger: E96).
|
|
115
|
+
- Update 2026-05-13: `npm run paper:arxiv:compile` now attempts a local TeX compile with `tectonic`, `latexmk`, `pdflatex`/`bibtex`, or `uvx tecto` through a local bundle proxy, writes the schema-bound `docs/paper/output/arxiv-compile-report.json`, and lets the release-readiness gate keep missing TeX tooling as an explicit pending blocker instead of an undocumented host gap (Ledger: E97).
|
|
116
|
+
|
|
117
|
+
## Final Upload Checks
|
|
118
|
+
|
|
119
|
+
- Re-run URL verification if the arXiv upload moves to a later day.
|
|
120
|
+
- Run `npm run paper:sync` after benchmark outputs change.
|
|
121
|
+
- Run `npm run paper:claims` before public posts or submissions.
|
|
122
|
+
- Run `npm run paper:publication-pack` before using the launch copy.
|
|
123
|
+
- Run `npm run paper:arxiv` before arXiv browser upload.
|
|
124
|
+
- Run `npm run paper:arxiv:verify` before arXiv browser upload.
|
|
125
|
+
- Run `npm run paper:arxiv:compile` before arXiv browser upload; run `npm run paper:arxiv:compile:strict` only when a supported TeX toolchain is installed and compile proof must be complete.
|
|
126
|
+
- Run `npm run paper:launch-plan` before opening browser submission targets.
|
|
127
|
+
- Run `npm run paper:launch-results` after any browser submission or skipped/failed target update.
|
|
128
|
+
- Run `npm run paper:launch-results:strict` only when you intend to prove every launch target has a recorded public result.
|
|
129
|
+
- Run `npm run paper:bundle` after `paper:sync` and before upload.
|
|
130
|
+
- Run `npm run paper:bundle:verify` before uploading `docs/paper/output/submission-bundle/`.
|
|
131
|
+
- Run `npm run paper:verify` after any benchmark or paper edit.
|
|
132
|
+
- Run `npm run release:readiness` to capture the current 1.0 checklist without hiding pending publish blockers.
|
|
133
|
+
- Run `npm run release:readiness:strict` only after version bump, committed/tagged source-control state, live remote-head verification, Python artifacts, npm registry/auth readiness, PyPI publish readiness, browser public URLs, and live Mem0/Zep evidence are complete.
|
|
134
|
+
- Run `npm run release:cut:plan -- --target-version 1.0.0 --json` before applying the final version/changelog bump.
|
|
135
|
+
- Run `npm run release:cut:apply -- --target-version 1.0.0` only when the final 1.0 cut is intentional, then confirm strict readiness sees no placeholder changelog markers.
|
|
136
|
+
- Run `npm run python:release:check` after the final version cut and before PyPI upload.
|
|
137
|
+
- Run `npm audit --omit=dev --audit-level=moderate` before publishing package artifacts.
|
|
138
|
+
- Run `npm run release:gate:paper` before publishing or submitting public claims.
|
|
139
|
+
- Compile `docs/paper/output/arxiv/main.tex` with a local TeX toolchain before final arXiv upload; on hosts without `tectonic`, `latexmk`, `pdflatex`/`bibtex`, or `uvx tecto`, `paper:arxiv:compile` records the blocker and strict readiness remains pending.
|
|
140
|
+
- Upload to arXiv with `cs.AI` as the primary category and `cs.CR` as secondary.
|
|
141
|
+
- Use `publication-pack.json` for the first HN, Reddit, X, and LinkedIn drafts after `paper:publication-pack` passes.
|
|
142
|
+
- Keep the first X post's `reservedUrlChars` budget intact when adding the final public artifact URL.
|
|
143
|
+
- Use `browser-launch-plan.json` for target order, login/captcha handling, manual platform-rule checks, and post-submit URL capture.
|
|
144
|
+
- Update `browser-launch-results.json` with returned public URLs or blockers after each browser action.
|
|
145
|
+
- Use `docs/paper/output/submission-bundle/` as the machine-readable browser/upload package after `paper:bundle:verify` passes.
|
|
146
|
+
|
|
147
|
+
## Stage-B Work for v2
|
|
148
|
+
|
|
149
|
+
- Run `npm run bench:guard:mem0` with a runtime `MEM0_API_KEY` and publish the output bundle.
|
|
150
|
+
- Run `npm run bench:guard:zep` with a runtime `ZEP_API_KEY` and publish the output bundle.
|
|
151
|
+
- Run `npm run bench:guard:external:evidence:strict` after both live bundles exist.
|
|
152
|
+
- Add adapters for additional external memory systems using the current GuardBench ESM adapter contract and verify each one with `npm run bench:guard:adapter-self-test -- --adapter <adapter.mjs>`.
|
|
153
|
+
- Add external per-scenario confusion matrices, expanded multi-secret redaction sweeps, machine-provenance latency tables, and raw output artifacts.
|
|
154
|
+
- Strip evidence-ledger references from prose after the GuardBench claim set is stable.
|
|
155
|
+
- Compress the body to approximately 7,000 words for workshop submission.
|
|
156
|
+
|
|
157
|
+
## Suggested Venue Order
|
|
158
|
+
|
|
159
|
+
1. arXiv preprint: immediate, no review gate, stakes priority date.
|
|
160
|
+
2. NeurIPS workshop on foundation models for decision making, memory and agents, or LLM systems, depending on accepted workshop calls.
|
|
161
|
+
3. EMNLP Industry Track if GuardBench results land before the deadline.
|
|
162
|
+
4. SOSP/OSDI as a stretch target after full GuardBench results across external systems are ready.
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
# Appendix A. Repeated-Failure Demo Transcript
|
|
2
|
+
|
|
3
|
+
This appendix records the qualitative demo used in Section 7. The demo creates an isolated temporary Audrey store, records one failed deploy command, stores the operational rule implied by the failure, and runs a pre-action guard check on the same command. It is the paper's artifact-grounded figure because it shows the central claim in one executable trace: memory changes the next tool action before the tool runs.
|
|
4
|
+
|
|
5
|
+
## Commands
|
|
6
|
+
|
|
7
|
+
Build command:
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
npm run build
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
Run command:
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
node dist/mcp-server/index.js demo --scenario repeated-failure
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Verbatim Transcript
|
|
20
|
+
|
|
21
|
+
```text
|
|
22
|
+
Audrey Guard repeated-failure demo
|
|
23
|
+
|
|
24
|
+
Memory store: [LOCAL-TEMP]/audrey-demo-AkCROa
|
|
25
|
+
Step 1: the agent tries a deploy and hits a real setup failure.
|
|
26
|
+
Step 2: Audrey stores the failure and the operational rule it implies.
|
|
27
|
+
Lesson memory: 01KR491DG2YZHVEM79QVW5BHZA
|
|
28
|
+
|
|
29
|
+
Step 3: a new preflight checks the same action before tool use.
|
|
30
|
+
|
|
31
|
+
Audrey Guard: BLOCKED
|
|
32
|
+
|
|
33
|
+
Reason: Blocked: this exact Bash action failed before. Stop: 3 memory reflexes, 2 blocking, 1 warning matched.
|
|
34
|
+
Risk score: 0.90
|
|
35
|
+
|
|
36
|
+
Evidence:
|
|
37
|
+
- 01KR491DFZYZ20TFK71KJHC88F
|
|
38
|
+
- 01KR491DG2YZHVEM79QVW5BHZA
|
|
39
|
+
- failure:Bash:2026-05-08T17:09:22.047Z
|
|
40
|
+
|
|
41
|
+
Recommended action:
|
|
42
|
+
- Do not repeat the exact failed action until the prior error is understood or the command is changed.
|
|
43
|
+
- Do not proceed until the high-severity memory warning is addressed.
|
|
44
|
+
- Apply this must-follow rule before acting.
|
|
45
|
+
- Mitigate this remembered risk before proceeding.
|
|
46
|
+
- Before re-running Bash, check what changed since the last failure.
|
|
47
|
+
|
|
48
|
+
Memory reflexes:
|
|
49
|
+
- block: Apply this must-follow rule before acting. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
50
|
+
- block: Mitigate this remembered risk before proceeding. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
51
|
+
- warn: Before re-running Bash, check what changed since the last failure.
|
|
52
|
+
|
|
53
|
+
Next: fix the warning and retry, or pass --override to allow this guard check.
|
|
54
|
+
|
|
55
|
+
Impact:
|
|
56
|
+
- 1 repeated failure prevented
|
|
57
|
+
- 1 helpful memory validation recorded
|
|
58
|
+
- 3 evidence ids attached
|
|
59
|
+
|
|
60
|
+
Audrey saw the agent fail once.
|
|
61
|
+
Audrey stopped it from failing twice.
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Line Annotations
|
|
65
|
+
|
|
66
|
+
| Transcript Fragment | Demonstrates |
|
|
67
|
+
|---|---|
|
|
68
|
+
| `Audrey Guard repeated-failure demo` | The named scenario is the Guard demo, not a generic recall query. |
|
|
69
|
+
| `Memory store: ...\audrey-demo-...` | The demo uses an isolated temporary memory store; IDs and temp path are run-specific. |
|
|
70
|
+
| `Step 1: the agent tries a deploy...` | The first action is allowed to fail once, creating real operational evidence. |
|
|
71
|
+
| `Step 2: Audrey stores the failure...` | The failed tool outcome is converted into memory state. |
|
|
72
|
+
| `Lesson memory: ...` | The procedural lesson receives a concrete memory ID that can be cited later. |
|
|
73
|
+
| `Step 3: a new preflight...` | The next decision occurs before tool use, which is the pre-action control boundary. |
|
|
74
|
+
| `Audrey Guard: BLOCKED` | The controller returns an enforced block decision rather than retrieved context. |
|
|
75
|
+
| `Reason: Blocked: this exact Bash action failed before...` | The decision combines exact repeated-failure matching with reflex summary counts. |
|
|
76
|
+
| `Risk score: 0.90` | The guard exposes a numeric risk score in the decision object. |
|
|
77
|
+
| `Evidence:` and the three evidence rows | The block is auditable: prior event, lesson memory, and failure-class evidence are attached. |
|
|
78
|
+
| `Recommended action:` rows | The guard returns concrete next actions rather than only a warning sentence. |
|
|
79
|
+
| `Memory reflexes:` rows | Preflight warnings are converted into block and warn reflexes with operational wording. |
|
|
80
|
+
| `Next: fix the warning...` | The CLI output preserves an explicit override path while defaulting to prevention. |
|
|
81
|
+
| `Impact:` rows | The loop records prevention, validation, and attached-evidence accounting. |
|
|
82
|
+
| `Audrey saw the agent fail once.` | The system observes failure instead of pretending it can prevent first-time unknown errors. |
|
|
83
|
+
| `Audrey stopped it from failing twice.` | The central behavior: memory changes the next tool action. |
|
|
84
|
+
|
|
85
|
+
## How to Reproduce
|
|
86
|
+
|
|
87
|
+
Prerequisites:
|
|
88
|
+
|
|
89
|
+
- Node.js 20 or newer.
|
|
90
|
+
- Audrey dependencies installed with `npm install`.
|
|
91
|
+
- A clean or dirty checkout is acceptable; the demo uses a temporary mock-provider store and does not write to the user's normal Audrey data directory.
|
|
92
|
+
|
|
93
|
+
Commands:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
npm run build
|
|
97
|
+
node dist/mcp-server/index.js demo --scenario repeated-failure
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Expected output shape:
|
|
101
|
+
|
|
102
|
+
- The command prints `Audrey Guard repeated-failure demo`.
|
|
103
|
+
- It reports a temporary `Memory store` path.
|
|
104
|
+
- It prints a run-specific `Lesson memory` ID.
|
|
105
|
+
- It prints `Audrey Guard: BLOCKED`.
|
|
106
|
+
- It prints `Risk score: 0.90`.
|
|
107
|
+
- It lists at least one prior failure evidence ID, one lesson memory ID, and one `failure:Bash:<timestamp>` evidence ID.
|
|
108
|
+
- It reports two blocking reflexes, one warning reflex, one repeated failure prevented, one helpful memory validation recorded, and three evidence IDs attached.
|
|
109
|
+
|
|
110
|
+
Run-specific values:
|
|
111
|
+
|
|
112
|
+
- The temp directory suffix changes on each run.
|
|
113
|
+
- Memory IDs change on each run.
|
|
114
|
+
- The timestamp inside the `failure:Bash:<timestamp>` evidence ID changes on each run.
|