audrey 0.23.1 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +81 -19
- package/LICENSE +21 -21
- package/README.md +209 -5
- package/SECURITY.md +2 -1
- package/benchmarks/adapter-kit.mjs +20 -0
- package/benchmarks/adapter-self-test.mjs +166 -0
- package/benchmarks/adapters/example-allow.mjs +28 -0
- package/benchmarks/adapters/mem0-platform.mjs +267 -0
- package/benchmarks/adapters/registry.json +51 -0
- package/benchmarks/adapters/zep-cloud.mjs +280 -0
- package/benchmarks/baselines.js +169 -0
- package/benchmarks/build-leaderboard.mjs +170 -0
- package/benchmarks/cases.js +537 -0
- package/benchmarks/create-conformance-card.mjs +139 -0
- package/benchmarks/create-submission-bundle.mjs +176 -0
- package/benchmarks/dry-run-external-adapters.mjs +165 -0
- package/benchmarks/guardbench.js +1035 -0
- package/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
- package/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
- package/benchmarks/output/external/guardbench-external-evidence.json +56 -0
- package/benchmarks/output/guardbench-conformance-card.json +63 -0
- package/benchmarks/output/guardbench-manifest.json +414 -0
- package/benchmarks/output/guardbench-raw.json +1171 -0
- package/benchmarks/output/guardbench-summary.json +1981 -0
- package/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
- package/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
- package/benchmarks/output/submission-bundle/guardbench-conformance-card.json +63 -0
- package/benchmarks/output/submission-bundle/guardbench-manifest.json +414 -0
- package/benchmarks/output/submission-bundle/guardbench-raw.json +1171 -0
- package/benchmarks/output/submission-bundle/guardbench-summary.json +1981 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-conformance-card.schema.json +184 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-evidence.schema.json +108 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-external-run.schema.json +160 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-leaderboard.schema.json +179 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-manifest.schema.json +213 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-publication-verification.schema.json +47 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-raw.schema.json +164 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/benchmarks/output/submission-bundle/schemas/guardbench-summary.schema.json +228 -0
- package/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
- package/benchmarks/output/submission-bundle/validation-report.json +31 -0
- package/benchmarks/output/summary.json +2354 -0
- package/benchmarks/perf-snapshot.js +304 -0
- package/benchmarks/perf.bench.js +161 -0
- package/benchmarks/public-paths.mjs +78 -0
- package/benchmarks/reference-results.js +70 -0
- package/benchmarks/report.js +259 -0
- package/benchmarks/run-external-guardbench.mjs +281 -0
- package/benchmarks/run.js +682 -0
- package/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
- package/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
- package/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
- package/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
- package/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
- package/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
- package/benchmarks/schemas/guardbench-raw.schema.json +164 -0
- package/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/benchmarks/schemas/guardbench-summary.schema.json +228 -0
- package/benchmarks/snapshots/perf-0.22.2.json +123 -0
- package/benchmarks/snapshots/perf-0.23.0.json +123 -0
- package/benchmarks/validate-adapter-module.mjs +104 -0
- package/benchmarks/validate-adapter-registry.mjs +134 -0
- package/benchmarks/validate-adapter-self-test.mjs +96 -0
- package/benchmarks/validate-guardbench-artifacts.mjs +343 -0
- package/benchmarks/verify-external-evidence.mjs +296 -0
- package/benchmarks/verify-publication-artifacts.mjs +286 -0
- package/benchmarks/verify-submission-bundle.mjs +167 -0
- package/dist/mcp-server/config.d.ts +1 -1
- package/dist/mcp-server/config.d.ts.map +1 -1
- package/dist/mcp-server/config.js +1 -1
- package/dist/mcp-server/config.js.map +1 -1
- package/dist/mcp-server/index.d.ts +65 -3
- package/dist/mcp-server/index.d.ts.map +1 -1
- package/dist/mcp-server/index.js +675 -157
- package/dist/mcp-server/index.js.map +1 -1
- package/dist/src/action-key.d.ts +9 -0
- package/dist/src/action-key.d.ts.map +1 -0
- package/dist/src/action-key.js +49 -0
- package/dist/src/action-key.js.map +1 -0
- package/dist/src/adaptive.js +5 -5
- package/dist/src/affect.js +8 -8
- package/dist/src/audrey.d.ts +3 -0
- package/dist/src/audrey.d.ts.map +1 -1
- package/dist/src/audrey.js +55 -3
- package/dist/src/audrey.js.map +1 -1
- package/dist/src/capsule.js +4 -4
- package/dist/src/causal.js +3 -3
- package/dist/src/consolidate.js +48 -48
- package/dist/src/controller.d.ts +61 -5
- package/dist/src/controller.d.ts.map +1 -1
- package/dist/src/controller.js +230 -49
- package/dist/src/controller.js.map +1 -1
- package/dist/src/db.js +172 -172
- package/dist/src/decay.js +8 -8
- package/dist/src/embedding.d.ts +2 -1
- package/dist/src/embedding.d.ts.map +1 -1
- package/dist/src/embedding.js +39 -29
- package/dist/src/embedding.js.map +1 -1
- package/dist/src/encode.js +6 -6
- package/dist/src/feedback.d.ts +6 -0
- package/dist/src/feedback.d.ts.map +1 -1
- package/dist/src/feedback.js +6 -0
- package/dist/src/feedback.js.map +1 -1
- package/dist/src/forget.js +12 -12
- package/dist/src/hybrid-recall.js +9 -9
- package/dist/src/impact.js +6 -6
- package/dist/src/import.d.ts +3 -3
- package/dist/src/import.js +41 -41
- package/dist/src/index.d.ts +3 -3
- package/dist/src/index.d.ts.map +1 -1
- package/dist/src/index.js +2 -2
- package/dist/src/index.js.map +1 -1
- package/dist/src/interference.js +14 -14
- package/dist/src/introspect.js +18 -18
- package/dist/src/preflight.d.ts.map +1 -1
- package/dist/src/preflight.js +41 -0
- package/dist/src/preflight.js.map +1 -1
- package/dist/src/promote.js +7 -7
- package/dist/src/prompts.js +118 -118
- package/dist/src/recall.js +30 -30
- package/dist/src/reflexes.d.ts +1 -0
- package/dist/src/reflexes.d.ts.map +1 -1
- package/dist/src/reflexes.js +3 -0
- package/dist/src/reflexes.js.map +1 -1
- package/dist/src/rollback.js +4 -4
- package/dist/src/routes.d.ts.map +1 -1
- package/dist/src/routes.js +67 -1
- package/dist/src/routes.js.map +1 -1
- package/dist/src/validate.js +25 -25
- package/docs/AUDREY_PAPER_OUTLINE.md +175 -0
- package/docs/MEMORY_BENCHMARKING.md +59 -0
- package/docs/PRODUCTION_BACKLOG.md +304 -0
- package/docs/paper/00-master.md +48 -0
- package/docs/paper/01-introduction.md +27 -0
- package/docs/paper/02-related-work.md +47 -0
- package/docs/paper/03-problem-definition.md +108 -0
- package/docs/paper/04-design.md +164 -0
- package/docs/paper/05-guardbench-spec.md +412 -0
- package/docs/paper/06-implementation.md +113 -0
- package/docs/paper/07-evaluation.md +168 -0
- package/docs/paper/08-discussion-limitations.md +61 -0
- package/docs/paper/09-conclusion.md +11 -0
- package/docs/paper/SUBMISSION_README.md +162 -0
- package/docs/paper/appendix-a-demo-transcript.md +114 -0
- package/docs/paper/arxiv-compile-report.schema.json +116 -0
- package/docs/paper/arxiv-source.schema.json +61 -0
- package/docs/paper/audrey-paper-v1.md +1106 -0
- package/docs/paper/browser-launch-plan.json +209 -0
- package/docs/paper/browser-launch-plan.schema.json +100 -0
- package/docs/paper/browser-launch-results.json +86 -0
- package/docs/paper/browser-launch-results.schema.json +66 -0
- package/docs/paper/claim-register.json +138 -0
- package/docs/paper/claim-register.schema.json +81 -0
- package/docs/paper/evidence-ledger.md +103 -0
- package/docs/paper/output/arxiv/README-arxiv.txt +8 -0
- package/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
- package/docs/paper/output/arxiv/main.tex +949 -0
- package/docs/paper/output/arxiv/references.bib +222 -0
- package/docs/paper/output/arxiv-compile-report.json +24 -0
- package/docs/paper/output/submission-bundle/LICENSE +21 -0
- package/docs/paper/output/submission-bundle/README.md +533 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-evidence.json +56 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-conformance-card.json +63 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-manifest.json +414 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-raw.json +1171 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-summary.json +1981 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/validation-report.json +31 -0
- package/docs/paper/output/submission-bundle/benchmarks/output/summary.json +2354 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-raw.schema.json +164 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
- package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-summary.schema.json +228 -0
- package/docs/paper/output/submission-bundle/docs/AUDREY_PAPER_OUTLINE.md +175 -0
- package/docs/paper/output/submission-bundle/docs/paper/00-master.md +48 -0
- package/docs/paper/output/submission-bundle/docs/paper/01-introduction.md +27 -0
- package/docs/paper/output/submission-bundle/docs/paper/02-related-work.md +47 -0
- package/docs/paper/output/submission-bundle/docs/paper/03-problem-definition.md +108 -0
- package/docs/paper/output/submission-bundle/docs/paper/04-design.md +164 -0
- package/docs/paper/output/submission-bundle/docs/paper/05-guardbench-spec.md +412 -0
- package/docs/paper/output/submission-bundle/docs/paper/06-implementation.md +113 -0
- package/docs/paper/output/submission-bundle/docs/paper/07-evaluation.md +168 -0
- package/docs/paper/output/submission-bundle/docs/paper/08-discussion-limitations.md +61 -0
- package/docs/paper/output/submission-bundle/docs/paper/09-conclusion.md +11 -0
- package/docs/paper/output/submission-bundle/docs/paper/SUBMISSION_README.md +162 -0
- package/docs/paper/output/submission-bundle/docs/paper/appendix-a-demo-transcript.md +114 -0
- package/docs/paper/output/submission-bundle/docs/paper/arxiv-compile-report.schema.json +116 -0
- package/docs/paper/output/submission-bundle/docs/paper/arxiv-source.schema.json +61 -0
- package/docs/paper/output/submission-bundle/docs/paper/audrey-paper-v1.md +1106 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.json +209 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.schema.json +100 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.json +86 -0
- package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.schema.json +66 -0
- package/docs/paper/output/submission-bundle/docs/paper/claim-register.json +138 -0
- package/docs/paper/output/submission-bundle/docs/paper/claim-register.schema.json +81 -0
- package/docs/paper/output/submission-bundle/docs/paper/evidence-ledger.md +103 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/README-arxiv.txt +8 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/main.tex +949 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/references.bib +222 -0
- package/docs/paper/output/submission-bundle/docs/paper/output/arxiv-compile-report.json +24 -0
- package/docs/paper/output/submission-bundle/docs/paper/paper-submission-bundle.schema.json +70 -0
- package/docs/paper/output/submission-bundle/docs/paper/publication-pack.json +81 -0
- package/docs/paper/output/submission-bundle/docs/paper/publication-pack.schema.json +60 -0
- package/docs/paper/output/submission-bundle/docs/paper/references.bib +222 -0
- package/docs/paper/output/submission-bundle/package.json +212 -0
- package/docs/paper/output/submission-bundle/paper-submission-manifest.json +379 -0
- package/docs/paper/paper-submission-bundle.schema.json +70 -0
- package/docs/paper/publication-pack.json +81 -0
- package/docs/paper/publication-pack.schema.json +60 -0
- package/docs/paper/references.bib +222 -0
- package/package.json +87 -4
- package/scripts/audit-release-completion.mjs +362 -0
- package/scripts/create-arxiv-source.mjs +362 -0
- package/scripts/create-paper-submission-bundle.mjs +210 -0
- package/scripts/finalize-release.mjs +526 -0
- package/scripts/prepare-release-cut.mjs +269 -0
- package/scripts/publish-release-bundle.mjs +209 -0
- package/scripts/publish-release-github-api.mjs +429 -0
- package/scripts/run-vitest.mjs +34 -0
- package/scripts/smoke-cli.js +72 -0
- package/scripts/sync-paper-artifacts.mjs +109 -0
- package/scripts/verify-arxiv-compile.mjs +440 -0
- package/scripts/verify-arxiv-source.mjs +194 -0
- package/scripts/verify-browser-launch-plan.mjs +237 -0
- package/scripts/verify-browser-launch-results.mjs +285 -0
- package/scripts/verify-paper-artifacts.mjs +338 -0
- package/scripts/verify-paper-claims.mjs +226 -0
- package/scripts/verify-paper-submission-bundle.mjs +207 -0
- package/scripts/verify-publication-pack.mjs +196 -0
- package/scripts/verify-python-package.py +201 -0
- package/scripts/verify-release-readiness.mjs +741 -0
|
@@ -0,0 +1,949 @@
|
|
|
1
|
+
\documentclass[11pt]{article}
|
|
2
|
+
\usepackage[margin=1in]{geometry}
|
|
3
|
+
\usepackage[T1]{fontenc}
|
|
4
|
+
\usepackage[utf8]{inputenc}
|
|
5
|
+
\usepackage{hyperref}
|
|
6
|
+
\usepackage{url}
|
|
7
|
+
\usepackage{verbatim}
|
|
8
|
+
|
|
9
|
+
\title{Audrey: Local-First Pre-Action Memory Control for AI Agents}
|
|
10
|
+
\author{Tyler Eveland}
|
|
11
|
+
\date{\today}
|
|
12
|
+
|
|
13
|
+
\begin{document}
|
|
14
|
+
\maketitle
|
|
15
|
+
|
|
16
|
+
\begin{abstract}
|
|
17
|
+
AI agents increasingly rely on long-running tool use, but most memory systems are evaluated as retrieval layers rather than as pre-action control systems. Audrey is a local-first memory runtime that records episodes, procedures, contradictions, tool traces, validation outcomes, and salience, then checks relevant memory before an agent touches tools. The paper introduces GuardBench, a Stage-A benchmark specification for memory-before-action behavior, and reports a local comparative run across Audrey Guard, no-memory, recent-window, vector-only, and FTS-only adapters. In the current repository artifacts, Audrey Guard passes 10/10 local GuardBench scenarios with zero seeded raw-secret leaks in published artifacts. The paper does not report external-system GuardBench scores; Mem0 and Zep adapters exist, but live credentialed outputs remain pending strict evidence verification. The release includes schema-bound manifests, raw outputs, conformance cards, submission bundles, leaderboard artifacts, an external evidence verifier, and a claim register that blocks unsupported public claims.
|
|
18
|
+
\end{abstract}
|
|
19
|
+
|
|
20
|
+
Tyler Eveland Audrey Project (Independent) j.tyler.eveland@gmail.com
|
|
21
|
+
|
|
22
|
+
\section{Abstract}
|
|
23
|
+
|
|
24
|
+
Agent memory should be judged by whether it changes future tool actions, not only by whether it retrieves relevant text. Audrey implements a local-first pre-action memory controller that converts prior tool outcomes, procedures, contradictions, recall health, and redacted traces into auditable \texttt{allow}, \texttt{warn}, or \texttt{block} decisions before an agent acts. The system builds bounded memory capsules, scores preflight risk, generates evidence-linked reflexes, blocks exact repeated failures through deterministic action identity hashing, and closes the loop through post-action validation and impact reporting. This paper frames the scientific category as pre-action memory control and the artifact as Audrey Guard. The Stage-A version reports implemented Audrey evidence: the controller and CLI, redaction-first tool tracing, recall-degradation handling, the canonical 0.22.2 performance snapshot, the current behavioral regression gate output, and the deterministic repeated-failure demo. It also specifies GuardBench as the evaluation methodology for future cross-system comparison.
|
|
25
|
+
|
|
26
|
+
\section{Section 1. Introduction}
|
|
27
|
+
|
|
28
|
+
Tool-using agents fail in ways that ordinary chat-memory evaluation does not measure. They repeat broken shell commands after a previous run already exposed the error. They ignore project-specific setup rules that were learned in an earlier session. They lose the causal link between a failed action and the fix that made a later action safe. They treat degraded retrieval as complete memory and act anyway. In Audrey's repeated-failure demo, an agent first runs \texttt{npm run deploy} and fails because the Prisma client was not generated. Audrey records the failed tool event, stores the operational rule, and blocks the same action when it is proposed again. The transcript ends with the intended behavior of pre-action memory control: "Audrey saw the agent fail once. Audrey stopped it from failing twice." (Ledger: E25, E42)
|
|
29
|
+
|
|
30
|
+
Most memory evaluation frames do not test this behavior. MTEB evaluates text embeddings across retrieval and representation tasks \cite{muennighoff2023mteb}. LongMemEval evaluates chat assistants on information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention over long interaction histories \cite{wu2025longmemeval}. LoCoMo evaluates very long-term conversational memory through question answering, summarization, and multimodal dialogue generation \cite{maharana2024locomo}. These benchmarks are valuable, but their output target is retrieved context or an answer. They do not ask whether memory changed a future tool action before the action reached the shell, file system, browser, API, or MCP server.
|
|
31
|
+
|
|
32
|
+
This paper defines pre-action memory control as a distinct systems problem. A controller receives a proposed tool action and remembered state before execution, then returns an auditable \texttt{allow}, \texttt{warn}, or \texttt{block} decision with evidence. Section 3 gives the formal input and output contract, desired behavior properties, threat model, and scope boundaries. The key shift is the evaluation target: memory is judged by its effect on action selection, not only by the relevance of retrieved text.
|
|
33
|
+
|
|
34
|
+
Audrey Guard is the artifact studied in this paper. It is a local-first memory controller for agents that observes tool outcomes, redacts traces, retrieves relevant memory, constructs a bounded capsule, scores preflight risk, generates reflexes, returns \texttt{allow}/\texttt{warn}/\texttt{block}, and validates whether memory helped after the action path completes (Ledger: E1-E17). The implementation is exposed through MCP, CLI, REST, and Python client surfaces, while the core guard path runs host-side before tool execution (Ledger: E18-E19, E26, E32-E36).
|
|
35
|
+
|
|
36
|
+
This paper makes six contributions:
|
|
37
|
+
|
|
38
|
+
\begin{enumerate}
|
|
39
|
+
\item It formalizes pre-action memory control as a problem separate from chat recall, retrieval accuracy, and long-context question answering (Section 3).
|
|
40
|
+
\end{enumerate}
|
|
41
|
+
|
|
42
|
+
\begin{enumerate}
|
|
43
|
+
\item It presents Audrey Guard, a local-first controller that converts remembered failures, procedures, contradictions, recall health, and redacted tool traces into \texttt{allow}, \texttt{warn}, or \texttt{block} decisions before tool use (Sections 4 and 6; Ledger: E1-E15, E29-E40).
|
|
44
|
+
\end{enumerate}
|
|
45
|
+
|
|
46
|
+
\begin{enumerate}
|
|
47
|
+
\item It introduces deterministic action identity for repeated-failure prevention: tool, redacted command, normalized working directory, and sorted file scope are hashed and matched against prior failed tool events (Sections 4, 6, and 7; Ledger: E3, E25, E42).
|
|
48
|
+
\end{enumerate}
|
|
49
|
+
|
|
50
|
+
\begin{enumerate}
|
|
51
|
+
\item It implements a redaction-first tool-trace path so guard evidence can reference prior tool input, output, and error summaries without storing raw secrets in durable memory (Sections 4 and 6; Ledger: E12-E13).
|
|
52
|
+
\end{enumerate}
|
|
53
|
+
|
|
54
|
+
\begin{enumerate}
|
|
55
|
+
\item It treats recall degradation as a control signal: missing vector tables, KNN failures, and FTS failures propagate as \texttt{RecallError[]}, appear in capsules, and become high-severity preflight warnings under strict guard mode (Sections 4 and 6; Ledger: E7, E9-E10, E15, E40).
|
|
56
|
+
\end{enumerate}
|
|
57
|
+
|
|
58
|
+
\begin{enumerate}
|
|
59
|
+
\item It specifies GuardBench, a reproducibility contract for measuring whether memory changes future tool actions, including scenarios, baselines, metrics, redaction sweeps, machine provenance, and raw per-scenario outputs (Section 5).
|
|
60
|
+
\end{enumerate}
|
|
61
|
+
|
|
62
|
+
The empirical scope is Stage A. This paper reports implemented Audrey evidence: the controller and CLI guard path, redaction-first tool tracing, recall-degradation handling, the canonical 0.22.2 performance snapshot, the current \texttt{bench:memory:check} regression output, the local comparative GuardBench run, and the deterministic repeated-failure demo transcript (Ledger: E20-E26, E41-E42, E46). It does not report external-system GuardBench comparisons, production-load measurements, or real-provider embedding latency. The external adapter contract, Mem0 and Zep adapters, and evidence-bundle runner now exist, but live external-system scores belong in a v2 paper after credentialed runs publish raw outputs under the contract in Section 5 (Ledger: E47-E50, E77).
|
|
63
|
+
|
|
64
|
+
Section 2 positions Audrey against memory systems, memory benchmarks, graph-memory systems, and MCP safety work. Section 3 defines the pre-action memory-control problem. Section 4 describes Audrey Guard's design. Section 5 specifies GuardBench. Section 6 documents the implementation. Section 7 reports Stage-A evaluation artifacts. Section 8 discusses limitations and open problems. Section 9 concludes.
|
|
65
|
+
|
|
66
|
+
\section{Section 2. Related Work}
|
|
67
|
+
|
|
68
|
+
This section organizes prior work by the behavior each system optimizes. The point of comparison is not whether a system has memory. It is whether memory runs before tool use and produces an evidence-linked action decision.
|
|
69
|
+
|
|
70
|
+
\subsection{Conversational and Scalable Memory Systems}
|
|
71
|
+
|
|
72
|
+
Mem0 optimizes scalable long-term memory for multi-session agents. Its paper frames the problem as extracting, consolidating, and retrieving salient information from ongoing conversations, including a graph-memory variant for relational structure, and evaluates on LoCoMo-style conversational memory tasks \cite{chhikara2025mem0}. Its 2026 algorithm post emphasizes token-efficient retrieval through hierarchical memory, ADD-only extraction, entity linking, and multi-signal retrieval \cite{mem02026tokenefficient}. Audrey differs at the control boundary: Mem0 optimizes what context to retrieve for response generation, while Audrey evaluates whether remembered context changes a proposed tool action before execution.
|
|
73
|
+
|
|
74
|
+
MemGPT, now associated with Letta, optimizes virtual context management. The MemGPT paper treats limited context windows as an operating-system-style memory hierarchy problem, moving information between memory tiers so an LLM can operate beyond its immediate context window \cite{packer2024memgpt}. This is an architectural framing for extended context and multi-session chat. Audrey borrows the systems instinct but changes the target: its controller is not a virtual-context manager for the model; it is a host-side guard that returns \texttt{allow}, \texttt{warn}, or \texttt{block} for a proposed action.
|
|
75
|
+
|
|
76
|
+
LangMem optimizes memory as a reusable agent-runtime primitive. Its documentation describes tooling for extracting important information from conversations, optimizing agent behavior through prompt refinement, maintaining long-term memory, and providing hot-path memory tools agents can call during active conversations \cite{langchain2026langmem}. This is close to an agent developer's integration layer. Audrey differs because the guard path does not depend on the language model deciding to call a memory tool; the host asks memory before the tool call proceeds.
|
|
77
|
+
|
|
78
|
+
Supermemory optimizes a developer memory API and context stack. Its documentation positions the service as long-term and short-term memory and context infrastructure, with ingestion, extraction, graph memory, user profiles, connectors, and managed RAG \cite{supermemory2026docs}. Its repository describes persistent memory for AI tools and an API that returns user profiles and relevant memories \cite{supermemory2026repo}. Audrey differs by keeping the pre-action controller local and by treating prior tool outcomes, redaction state, and recall degradation as enforceable control inputs rather than retrieved context alone.
|
|
79
|
+
|
|
80
|
+
\subsection{Memory as System Resource and Graph Systems}
|
|
81
|
+
|
|
82
|
+
MemOS optimizes memory as a system resource. The paper introduces a memory operating system that manages heterogeneous memory forms across temporal scales, with memory units carrying content and metadata such as provenance and versioning \cite{li2025memos}. This is the broadest systems framing among the related memory papers. Audrey's scope is narrower and more operational: it does not manage parameter-level memories or schedule heterogeneous memory resources; it inserts a local memory-derived decision layer before agent tool use.
|
|
83
|
+
|
|
84
|
+
Zep optimizes temporal knowledge graphs for agent memory. Its paper presents Graphiti as a temporally aware knowledge-graph engine that synthesizes unstructured conversations and structured business data while maintaining historical relationships, then evaluates retrieval over Deep Memory Retrieval and LongMemEval-style tasks \cite{rasmussen2025zep}. Audrey uses contradictions, recent tool events, and typed memory, but it does not claim to be a temporal knowledge graph service. Its central output is a guard decision, not a retrieved graph context.
|
|
85
|
+
|
|
86
|
+
Graphiti optimizes real-time temporal context graphs. Its repository describes context graphs that track how facts change over time, maintain provenance to source data, and support semantic, keyword, and graph traversal retrieval \cite{zep2026graphiti}. This is valuable for evolving facts and historical queries. Audrey's use of evidence is different: evidence is attached to an action decision and to recommendations that a host can enforce.
|
|
87
|
+
|
|
88
|
+
Cognee optimizes knowledge infrastructure for agent memory. The Cognee repository describes an open-source memory control plane that ingests data, combines embeddings and graphs, supports local execution, and provides traceability and cross-agent knowledge sharing \cite{cognee2026repo}. The Cognee paper studies hyperparameter optimization for graph construction, retrieval, and prompting in multi-hop question answering \cite{markovic2025cognee}. Audrey does not optimize knowledge-graph retrieval quality. It uses local memory to decide whether an agent action should proceed.
|
|
89
|
+
|
|
90
|
+
\subsection{Memory Benchmarks and Evaluation}
|
|
91
|
+
|
|
92
|
+
MTEB optimizes broad evaluation of embedding models. It spans embedding tasks such as retrieval, clustering, reranking, and semantic textual similarity across many datasets and languages \cite{muennighoff2023mteb}. It is relevant because many memory systems rely on embeddings, but it evaluates representation quality rather than the behavioral effect of memory on a tool-using agent.
|
|
93
|
+
|
|
94
|
+
LongMemEval optimizes long-term chat-assistant memory evaluation. It tests information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention across sustained user-assistant histories \cite{wu2025longmemeval}. GuardBench is orthogonal. It starts after a system has some memory state and asks whether that state changes a future tool action.
|
|
95
|
+
|
|
96
|
+
LoCoMo optimizes very long-term conversational memory. It provides long dialogues across many sessions and evaluates question answering, event summarization, and multimodal dialogue generation \cite{maharana2024locomo}. GuardBench does not replace LoCoMo. It tests a separate failure surface: repeated actions, missing procedures, degraded recall, secret redaction, and contradictions at the tool boundary.
|
|
97
|
+
|
|
98
|
+
MemoryBench optimizes continual-learning evaluation from accumulated user feedback. Its paper argues that many memory benchmarks focus on homogeneous reading-comprehension tasks and introduces a user-feedback simulation framework across domains, languages, and task types \cite{ai2026memorybench}. Audrey's validation loop is smaller: it records whether a memory was used, helpful, or wrong, then updates salience and bookkeeping (Ledger: E16). GuardBench evaluates the control effect of those memories, not general continual-learning quality.
|
|
99
|
+
|
|
100
|
+
\subsection{MCP Tool Safety and Pre-Action Runtimes}
|
|
101
|
+
|
|
102
|
+
The Model Context Protocol standardizes how clients discover and call tools. The 2025-06-18 schema defines \texttt{tools/list}, tool metadata, \texttt{tools/call}, input and output schemas, and tool annotations \cite{mcp2025schema}. MCP creates an interoperable tool surface; it does not define a memory-derived policy for whether a call should happen. Audrey fits beside MCP as a local controller that runs before a host invokes a tool.
|
|
103
|
+
|
|
104
|
+
MCP Security Bench optimizes evaluation of MCP-specific attacks. It introduces attack categories across task planning, tool invocation, and response handling, and evaluates LLM agents with real benign and malicious MCP tools \cite{zhang2026mcpsecuritybench}. Audrey is not an MCP attack benchmark and not a complete MCP defense system. It addresses one component inside a defensive host: memory-derived pre-action control with evidence and redaction.
|
|
105
|
+
|
|
106
|
+
The tool-poisoning paper studies semantic attacks against MCP-integrated systems, including malicious tool descriptors, shadowing through contaminated context, and descriptor changes after approval \cite{jamshidi2025toolpoisoning}. Audrey's trusted-control-source gate responds to a related risk inside memory: untrusted memories tagged as must-follow are not promoted into control rules (Ledger: E6). This does not solve tool poisoning. It reduces one path by which remembered content becomes operational instruction.
|
|
107
|
+
|
|
108
|
+
The MCP tool-annotations blog frames annotations as a risk vocabulary. It states that annotations such as read-only, destructive, idempotent, and open-world are hints, not guaranteed descriptions, and that clients should not base tool-use decisions on annotations from untrusted servers \cite{mcp2026toolannotations}. Audrey's decision layer is complementary: it uses remembered outcomes, procedures, contradictions, and recall health, not only static tool metadata.
|
|
109
|
+
|
|
110
|
+
\subsection{What Is Missing}
|
|
111
|
+
|
|
112
|
+
Across the primary sources reviewed here, memory systems optimize extraction, retrieval, persistence, graph structure, context assembly, personalization, or continual learning. Safety work around MCP optimizes attack detection, tool metadata, and protocol-level risk. The missing evaluation target is action effect: whether memory changes what an agent does next. Audrey implements a redaction-first, evidence-linked, host-side controller that runs before tool use and returns \texttt{allow}, \texttt{warn}, or \texttt{block}; GuardBench specifies how to evaluate that category.
|
|
113
|
+
|
|
114
|
+
\section{Section 3. Problem Definition: Pre-Action Memory Control}
|
|
115
|
+
|
|
116
|
+
Long-term memory systems for agents are usually evaluated as retrieval systems: given prior interaction history and a current query, the system returns facts, summaries, graph neighborhoods, or memory-tool results that improve the next model response. Mem0 evaluates scalable conversational memory and token efficiency \cite{chhikara2025mem0,mem02026tokenefficient}. MemGPT/Letta frames memory as virtual context management across tiers \cite{packer2024memgpt}. Zep and Graphiti model changing facts as temporal knowledge graphs \cite{rasmussen2025zep,zep2026graphiti}. MemOS treats memory as a manageable system resource \cite{li2025memos}. LangMem, Supermemory, and Cognee expose memory management, search, and graph/context layers for agents \cite{langchain2026langmem,supermemory2026docs,markovic2025cognee,cognee2026repo}.
|
|
117
|
+
|
|
118
|
+
Those systems make recall, memory formation, or memory organization the central artifact. This paper studies a different artifact: a controller that runs before an agent uses a tool. The relevant question is not only whether memory returns useful text. The relevant question is whether memory changes the next external action.
|
|
119
|
+
|
|
120
|
+
\subsection{Problem Statement}
|
|
121
|
+
|
|
122
|
+
A tool-using agent repeatedly converts model state into external actions: shell commands, file edits, API calls, browser operations, MCP tool calls, or domain-specific side effects. These actions are not only language outputs. They change local files, spend API budget, mutate remote systems, publish content, delete data, and expose credentials. MCP standardizes tool discovery and invocation through a JSON-RPC protocol surface \cite{mcp2025schema}. Claude Code hooks expose pre-tool and post-tool extension points around tool calls \cite{anthropic2026claudecodehooks}. These interfaces make tool use observable and interceptable; they do not by themselves decide whether prior failures, remembered constraints, stale recall, or contradictions should stop the next action.
|
|
123
|
+
|
|
124
|
+
The pre-action memory control problem is:
|
|
125
|
+
|
|
126
|
+
Given an intended agent action and a local memory state, return an auditable decision before tool execution: \texttt{allow}, \texttt{warn}, or \texttt{block}, with evidence and repair guidance.
|
|
127
|
+
|
|
128
|
+
Audrey implements this decision as the \texttt{GuardResult} contract with \texttt{decision}, \texttt{riskScore}, \texttt{summary}, \texttt{evidenceIds}, \texttt{recommendedActions}, \texttt{capsule}, \texttt{reflexes}, and \texttt{preflightEventId} fields (Ledger: E1). Its current controller calls preflight/reflex generation before action, records a preflight event, and scopes recall to the current agent (Ledger: E2). It also records tool outcomes after execution and turns failures into future tool-result memories (Ledger: E4).
|
|
129
|
+
|
|
130
|
+
\subsection{Formal Interface}
|
|
131
|
+
|
|
132
|
+
Let an intended action at time \texttt{t} be:
|
|
133
|
+
|
|
134
|
+
\begin{verbatim}
|
|
135
|
+
a_t = (tool, action, command, cwd, files, session_id)
|
|
136
|
+
\end{verbatim}
|
|
137
|
+
|
|
138
|
+
where \texttt{tool} names the external capability, \texttt{action} is the human-readable intended operation, \texttt{command} is the concrete command when present, \texttt{cwd} is the execution directory, \texttt{files} is the known file scope, and \texttt{session\_id} identifies the agent session. Audrey represents this shape in \texttt{AgentAction} (Ledger: E1).
|
|
139
|
+
|
|
140
|
+
Let \texttt{M\_t} be the memory store visible to the agent, \texttt{T\_t} be the tool-trace event history, and \texttt{H\_t} be the current recall-health state. A pre-action memory controller is a function:
|
|
141
|
+
|
|
142
|
+
\begin{verbatim}
|
|
143
|
+
G(M_t, T_t, H_t, a_t) -> (d_t, r_t, E_t, R_t, C_t)
|
|
144
|
+
\end{verbatim}
|
|
145
|
+
|
|
146
|
+
where:
|
|
147
|
+
|
|
148
|
+
\begin{itemize}
|
|
149
|
+
\item \texttt{d\_t in \{allow, warn, block\}} is the action decision.
|
|
150
|
+
\item \texttt{r\_t in [0,1]} is the risk score.
|
|
151
|
+
\item \texttt{E\_t} is a set of evidence identifiers.
|
|
152
|
+
\item \texttt{R\_t} is a set of recommended repair or mitigation actions.
|
|
153
|
+
\item \texttt{C\_t} is an optional memory capsule containing the evidence packet.
|
|
154
|
+
\end{itemize}
|
|
155
|
+
|
|
156
|
+
The controller is useful only if \texttt{d\_t} is consumed before the side effect occurs. If the decision is displayed after execution, the system is post-hoc logging, not pre-action control.
|
|
157
|
+
|
|
158
|
+
After the tool executes, let the observed outcome be:
|
|
159
|
+
|
|
160
|
+
\begin{verbatim}
|
|
161
|
+
o_t = (outcome, output, error_summary, metadata)
|
|
162
|
+
\end{verbatim}
|
|
163
|
+
|
|
164
|
+
where \texttt{outcome} includes success, failure, or unknown status. A closed-loop memory controller also defines an update function:
|
|
165
|
+
|
|
166
|
+
\begin{verbatim}
|
|
167
|
+
U(M_t, T_t, a_t, o_t) -> (M_{t+1}, T_{t+1})
|
|
168
|
+
\end{verbatim}
|
|
169
|
+
|
|
170
|
+
Audrey implements this post-action update by recording redacted tool events and encoding failures as tool-result memories for later preflight use (Ledger: E4, E12).
|
|
171
|
+
|
|
172
|
+
\subsection{Desired Behavior Properties}
|
|
173
|
+
|
|
174
|
+
**Pre-action placement.** The controller runs after the agent proposes tool parameters and before the tool runs. Audrey's \texttt{beforeAction()} path calls reflex/preflight generation before execution (Ledger: E2). This placement separates memory control from answer generation.
|
|
175
|
+
|
|
176
|
+
**Evidence-linked decisions.** Every warning or block is backed by memory IDs, failure IDs, recall diagnostics, or preflight event IDs. Audrey preflight returns warnings, evidence IDs, recommended actions, recent failures, and an optional capsule (Ledger: E8-E10). Reflex generation preserves evidence IDs and reasons (Ledger: E11).
|
|
177
|
+
|
|
178
|
+
**Action identity.** A repeated-failure detector needs an action identity stricter than a natural-language similarity match and more robust than raw string equality. Audrey hashes tool name, redacted command/action text, normalized working directory, and sorted normalized files; it then compares the hash with prior failed tool events for the same agent and tool (Ledger: E3).
|
|
179
|
+
|
|
180
|
+
**Conservative control-source handling.** A memory tagged as a rule is not automatically trusted. Audrey treats \texttt{must-follow} style tags as control signals only when the source is \texttt{direct-observation} or \texttt{told-by-user}; untrusted control-looking memories are routed to uncertain/disputed context (Ledger: E6).
|
|
181
|
+
|
|
182
|
+
**Recall-degradation awareness.** If retrieval partially fails, a memory controller should not silently proceed as though recall were complete. Audrey represents recall errors, propagates partial failures, exposes recent recall degradation in status, carries recall errors into capsules, and turns capsule recall errors into high-severity preflight warnings (Ledger: E7, E9, E15).
|
|
183
|
+
|
|
184
|
+
**Redaction before persistence.** Tool traces are high-risk memory inputs because they contain commands, environment output, stack traces, credentials, and file paths. Audrey's tool-trace contract states that raw tool input, output, and error text do not leave the module without redaction; it stores hashes, redacted summaries, redacted metadata, file fingerprints, and redaction state (Ledger: E12). The redaction layer covers common credentials, bearer/basic auth, private keys, JWTs, URL credentials, password assignments, payment/PII patterns, signed URL signatures, session cookies, high-entropy secrets, and sensitive JSON keys (Ledger: E13).
|
|
185
|
+
|
|
186
|
+
**Bounded context assembly.** A pre-action controller must fit within model and tool budgets. Audrey capsules include \texttt{budget\_chars}, \texttt{used\_chars}, and \texttt{truncated} fields and organize evidence into typed sections rather than returning an unstructured recall list (Ledger: E5).
|
|
187
|
+
|
|
188
|
+
**Agent scoping.** A local memory runtime used by multiple agents should not leak another agent's private operational history into a current preflight. Audrey capsule recall forces \texttt{scope: 'agent'} (Ledger: E7).
|
|
189
|
+
|
|
190
|
+
**Closed-loop validation.** A memory controller needs feedback after action because the controller's evidence can be helpful, merely used, or wrong. Audrey validation accepts \texttt{used}, \texttt{helpful}, and \texttt{wrong} outcomes and updates salience and bookkeeping fields; impact reporting summarizes validation and recent activity (Ledger: E16-E17).
|
|
191
|
+
|
|
192
|
+
\subsection{Threat Model}
|
|
193
|
+
|
|
194
|
+
The controller assumes an agent with legitimate access to local tools and MCP tools. The main hazards are action-selection hazards, not cryptographic compromise.
|
|
195
|
+
|
|
196
|
+
The in-scope hazards are:
|
|
197
|
+
|
|
198
|
+
\begin{itemize}
|
|
199
|
+
\item Repeating an exact action that already failed.
|
|
200
|
+
\item Ignoring a remembered must-follow procedure or project rule.
|
|
201
|
+
\item Acting on a memory set with open contradictions.
|
|
202
|
+
\item Treating degraded recall as complete recall.
|
|
203
|
+
\item Persisting credentials or sensitive tool output into long-lived memory.
|
|
204
|
+
\item Allowing untrusted tool outputs or descriptors to influence future actions as if they were trusted rules.
|
|
205
|
+
\item Losing auditability between a warning/block and the evidence that caused it.
|
|
206
|
+
\end{itemize}
|
|
207
|
+
|
|
208
|
+
The MCP ecosystem makes these hazards concrete. The MCP specification standardizes the schema for tools and messages \cite{mcp2025schema}. MCP security benchmarks and tool-poisoning work evaluate attacks against MCP-integrated agents, including adversarial tool descriptions, shadowing through shared context, and changed tool descriptors after approval \cite{zhang2026mcpsecuritybench,jamshidi2025toolpoisoning}. The MCP tool-annotations discussion frames annotations as risk hints that matter only when a client performs a concrete action based on them \cite{mcp2026toolannotations}. Audrey's controller fits this pattern: memory risk is useful when it changes the host-side decision.
|
|
209
|
+
|
|
210
|
+
The out-of-scope hazards are:
|
|
211
|
+
|
|
212
|
+
\begin{itemize}
|
|
213
|
+
\item First-time mistakes with no relevant prior evidence.
|
|
214
|
+
\item Malicious host compromise, database tampering, or filesystem attacks outside Audrey's process boundary.
|
|
215
|
+
\item Formal verification that a tool action is semantically safe.
|
|
216
|
+
\item Permission enforcement, sandboxing, rate limiting, or policy execution outside the memory controller.
|
|
217
|
+
\item Model alignment, deception detection, and general prompt-injection defense independent of remembered evidence.
|
|
218
|
+
\item Complete prevention of secret exposure after a caller explicitly stores unredacted data outside Audrey's tool-trace path.
|
|
219
|
+
\end{itemize}
|
|
220
|
+
|
|
221
|
+
These boundaries matter for evaluation. Audrey Guard is not a replacement for sandboxing, MCP permission systems, static analysis, or human approval. It is a memory-derived pre-action control layer: it catches hazards that are visible in prior outcomes, stored rules, contradictions, tool traces, and recall health.
|
|
222
|
+
|
|
223
|
+
\subsection{Stage-A Evaluation Target}
|
|
224
|
+
|
|
225
|
+
The first paper version evaluates implemented mechanisms and specifies GuardBench rather than claiming full benchmark results across external systems. The implemented evidence available today is: the controller and CLI guard path (Ledger: E1-E4, E26), the redacted tool-trace path (Ledger: E12-E13), preflight/reflex behavior (Ledger: E8-E11), recall degradation handling (Ledger: E15), closed-loop validation and impact reporting (Ledger: E16-E17), the canonical performance snapshot (Ledger: E20-E22), the current behavioral regression gate output (Ledger: E23-E24), and the deterministic repeated-failure demo (Ledger: E25).
|
|
226
|
+
|
|
227
|
+
GuardBench therefore belongs in this version as a specification for future comparative evaluation, not as a completed empirical result table. The empirical claims in the first version use existing Audrey artifacts only.
|
|
228
|
+
|
|
229
|
+
\section{Section 4. Design: Audrey Guard as a Pre-Action Memory Controller}
|
|
230
|
+
|
|
231
|
+
Audrey Guard is the pre-action control layer of Audrey, not a separate memory store. It uses the same local memory runtime, tool-event log, recall path, validation feedback, and MCP/CLI surfaces, then adds one controller loop around tool use. The controller's output is deliberately narrow: \texttt{allow}, \texttt{warn}, or \texttt{block}, plus risk score, evidence, reflexes, recommendations, and an optional capsule (Ledger: E1).
|
|
232
|
+
|
|
233
|
+
The design has four layers:
|
|
234
|
+
|
|
235
|
+
\begin{enumerate}
|
|
236
|
+
\item Tool events enter through a redaction-first trace layer.
|
|
237
|
+
\item Recall and event history are assembled into a bounded memory capsule.
|
|
238
|
+
\item Preflight turns capsule entries, recent failures, contradictions, memory health, and recall errors into a risk-scored decision.
|
|
239
|
+
\item Reflexes and the top-level controller convert the preflight into host-facing guidance, repeated-failure blocking, and post-action learning.
|
|
240
|
+
\end{enumerate}
|
|
241
|
+
|
|
242
|
+
This section describes the implemented design as of the current repository state. Each Audrey implementation claim references the evidence ledger.
|
|
243
|
+
|
|
244
|
+
\subsection{Architecture Overview}
|
|
245
|
+
|
|
246
|
+
Audrey's public surface includes MCP tools for observation, capsules, preflight, reflexes, validation, and dream/consolidation (Ledger: E18-E19). The package metadata describes the system as a local-first memory runtime with recall, consolidation, memory reflexes, contradiction detection, and tool-trace learning (Ledger: E27). The Guard design composes these existing mechanisms rather than introducing a separate policy engine.
|
|
247
|
+
|
|
248
|
+
The runtime path is:
|
|
249
|
+
|
|
250
|
+
\begin{verbatim}
|
|
251
|
+
agent proposes action
|
|
252
|
+
|
|
|
253
|
+
v
|
|
254
|
+
MemoryController.beforeAction(a_t)
|
|
255
|
+
|
|
|
256
|
+
v
|
|
257
|
+
audrey.reflexes(action, strict=true, includePreflight=true, includeCapsule=true)
|
|
258
|
+
|
|
|
259
|
+
v
|
|
260
|
+
buildPreflight(...) -> buildCapsule(...) -> recall + events + status
|
|
261
|
+
|
|
|
262
|
+
v
|
|
263
|
+
GuardResult { allow | warn | block, evidence, recommendations }
|
|
264
|
+
|
|
|
265
|
+
v
|
|
266
|
+
tool executes only if host permits it
|
|
267
|
+
|
|
|
268
|
+
v
|
|
269
|
+
MemoryController.afterAction(o_t) -> observeTool(...) + optional failure memory
|
|
270
|
+
\end{verbatim}
|
|
271
|
+
|
|
272
|
+
The controller is host-side. It does not ask the language model to decide whether memory matters. It converts memory state into a small decision object that a CLI, hook, MCP client, or agent runtime can enforce.
|
|
273
|
+
|
|
274
|
+
\subsection{Controller Loop}
|
|
275
|
+
|
|
276
|
+
\texttt{MemoryController.beforeAction()} is the orchestration point. It calls \texttt{audrey.reflexes()} with \texttt{strict: true}, \texttt{includePreflight: true}, \texttt{includeCapsule: true}, \texttt{recordEvent: true}, and \texttt{scope: 'agent'} (Ledger: E2). This means a pre-action check records its own trace, returns the underlying evidence packet, and blocks on high-severity warnings rather than treating them as advisory text.
|
|
277
|
+
|
|
278
|
+
The controller then checks exact repeated failures outside the general preflight path. If a matching failed tool event exists, the controller returns \texttt{block}, raises the risk score to at least \texttt{0.9}, prepends a recommendation not to repeat the exact failed action, and merges the prior failure event IDs with reflex evidence IDs (Ledger: E3). This rule gives repeated-failure prevention a deterministic path that does not depend on embedding similarity, lexical search, or model interpretation.
|
|
279
|
+
|
|
280
|
+
\texttt{MemoryController.afterAction()} closes the loop after execution. It records the tool outcome through \texttt{observeTool()}, attaches the action hash as \texttt{audrey\_guard\_action\_key}, and, when the outcome is failed and a redacted error summary exists, encodes a high-salience \texttt{tool-result} memory containing the failure, command, and error summary (Ledger: E4). The next preflight can therefore use both structured event history and ordinary recall.
|
|
281
|
+
|
|
282
|
+
This loop makes memory an action governor. The system remembers not only what was said, but what happened when an agent touched a tool.
|
|
283
|
+
|
|
284
|
+
\subsection{Capsule Construction}
|
|
285
|
+
|
|
286
|
+
The capsule is the evidence packet that feeds preflight. It replaces a loose recall list with a typed, budgeted object containing:
|
|
287
|
+
|
|
288
|
+
\begin{itemize}
|
|
289
|
+
\item \texttt{must\_follow}
|
|
290
|
+
\item \texttt{project\_facts}
|
|
291
|
+
\item \texttt{user\_preferences}
|
|
292
|
+
\item \texttt{procedures}
|
|
293
|
+
\item \texttt{risks}
|
|
294
|
+
\item \texttt{recent\_changes}
|
|
295
|
+
\item \texttt{contradictions}
|
|
296
|
+
\item \texttt{uncertain\_or\_disputed}
|
|
297
|
+
\item \texttt{evidence\_ids}
|
|
298
|
+
\item optional \texttt{recall\_errors}
|
|
299
|
+
\end{itemize}
|
|
300
|
+
|
|
301
|
+
The capsule also reports \texttt{budget\_chars}, \texttt{used\_chars}, and \texttt{truncated} so downstream callers know whether the packet was pruned (Ledger: E5).
|
|
302
|
+
|
|
303
|
+
Capsule construction starts with recall and then enriches results with tags, provenance, state, evidence IDs, recent failure events, and open contradictions. The implementation forces recall to \texttt{scope: 'agent'} even if a caller passes a different scope through capsule options (Ledger: E7). This is a security and relevance decision: the controller should not block or guide one agent using another agent's unrelated private work.
|
|
304
|
+
|
|
305
|
+
The capsule's control-source rule is conservative. A memory tagged as \texttt{must-follow}, \texttt{must}, \texttt{required}, \texttt{never}, \texttt{always}, or \texttt{policy} becomes a must-follow control signal only when its source is \texttt{direct-observation} or \texttt{told-by-user}; the same tag from an untrusted source is classified as uncertain/disputed context (Ledger: E6). This prevents tool output, imported text, or adversarial memory content from becoming a rule merely by containing a policy-looking tag.
|
|
306
|
+
|
|
307
|
+
The capsule includes two non-recall evidence classes. First, recent tool failures are inserted into the \texttt{risks} section as \texttt{tool\_failure} entries with confidence derived from failure count (Ledger: E5). Second, open contradictions are inserted into the \texttt{contradictions} section with both sides referenced as evidence (Ledger: E5). These are control-relevant facts even when the current semantic query does not retrieve them.
|
|
308
|
+
|
|
309
|
+
Finally, capsule pruning uses section priority. Must-follow rules, risks, contradictions, and procedures are retained before general project facts and preferences when the character budget is tight (Ledger: E5). This is the right asymmetry for pre-action control: the controller loses optional context before losing stop conditions.
|
|
310
|
+
|
|
311
|
+
\subsection{Preflight Risk Scoring}
|
|
312
|
+
|
|
313
|
+
Preflight is the decision layer over the capsule. The output contract includes the action, query, tool, working directory, generated timestamp, decision, verdict, \texttt{ok\_to\_proceed}, numeric \texttt{risk\_score}, summary, warnings, recent failures, optional status, recommended actions, evidence IDs, optional preflight event ID, and optional capsule (Ledger: E8).
|
|
314
|
+
|
|
315
|
+
\texttt{buildPreflight()} constructs its query from the action, tool, and working directory, then requests a conservative capsule with risks and contradictions enabled (Ledger: E9). It adds warnings from seven sources:
|
|
316
|
+
|
|
317
|
+
\begin{itemize}
|
|
318
|
+
\item memory health failures or re-embedding recommendations;
|
|
319
|
+
\item recall errors carried by the capsule;
|
|
320
|
+
\item recent failed tool events matching the action or tool;
|
|
321
|
+
\item must-follow capsule entries;
|
|
322
|
+
\item remembered risks and tool failures;
|
|
323
|
+
\item remembered procedures;
|
|
324
|
+
\item open contradictions and uncertain/disputed entries.
|
|
325
|
+
\end{itemize}
|
|
326
|
+
|
|
327
|
+
Each warning has a type, severity, message, reason, optional evidence ID, and optional recommended action (Ledger: E8-E9). The decision rule is intentionally simple. Warnings are sorted by severity. The risk score is the maximum severity score. In strict mode, any high-severity warning returns \texttt{block}; high or medium warnings outside strict mode return \texttt{caution}; absence of such warnings returns \texttt{go} (Ledger: E10). The controller maps \texttt{go} to \texttt{allow}, \texttt{caution} to \texttt{warn}, and \texttt{block} to \texttt{block} (Ledger: E1).
|
|
328
|
+
|
|
329
|
+
This scoring rule trades statistical sophistication for auditability. A user can inspect the warning type and evidence ID that produced the decision. The rule is also stable under small recall-score perturbations: once an item enters the capsule and is categorized as high-severity, the block decision does not depend on an opaque model score.
|
|
330
|
+
|
|
331
|
+
\subsection{Reflex Generation}
|
|
332
|
+
|
|
333
|
+
Reflexes are the host-readable form of preflight warnings. \texttt{buildReflexReport()} calls \texttt{buildPreflight()} and maps each warning into a \texttt{MemoryReflex} with a stable hash ID, trigger text, response type, severity, source warning type, response, reason, evidence ID, action, tool, and working directory (Ledger: E11).
|
|
334
|
+
|
|
335
|
+
The response type is derived from decision and warning semantics. A high-severity warning under a blocking preflight becomes \texttt{block}; an informational procedure becomes \texttt{guide}; other warnings become \texttt{warn} (Ledger: E11). The report returns the preflight decision, risk score, summary, reflexes, evidence IDs, recommended actions, and optionally the embedded preflight (Ledger: E11).
|
|
336
|
+
|
|
337
|
+
This layer separates the controller's enforcement object from user-facing guidance. The host can enforce the top-level decision while still showing a concise list of trigger-response reflexes that explain what memory changed.
|
|
338
|
+
|
|
339
|
+
\subsection{Action Identity Hashing}
|
|
340
|
+
|
|
341
|
+
Repeated-failure prevention requires stable action identity. Audrey's action key is a SHA-256 hash over:
|
|
342
|
+
|
|
343
|
+
\begin{itemize}
|
|
344
|
+
\item lower-cased tool name;
|
|
345
|
+
\item redacted command or action text normalized for whitespace and case;
|
|
346
|
+
\item normalized working directory;
|
|
347
|
+
\item sorted normalized file paths.
|
|
348
|
+
\end{itemize}
|
|
349
|
+
|
|
350
|
+
The implementation resolves real paths when available, removes Windows extended path prefixes, normalizes slashes, lowercases paths on Windows, and sorts the file set before hashing (Ledger: E3). The repeated-failure matcher then scans failed tool events for the same tool and agent and checks whether event metadata contains the same \texttt{audrey\_guard\_action\_key} (Ledger: E3).
|
|
351
|
+
|
|
352
|
+
This design avoids two failure modes. Raw command matching leaks secrets and treats path spelling differences as different actions. Pure semantic matching catches near neighbors but cannot prove that the exact failed operation is being repeated. Audrey uses a redacted deterministic key for the exact-repeat case and leaves broader similarity risks to capsule/preflight.
|
|
353
|
+
|
|
354
|
+
\subsection{Redaction Discipline}
|
|
355
|
+
|
|
356
|
+
Tool traces are both valuable and dangerous. The trace layer states an explicit contract: raw tool input, output, and error text do not leave \texttt{tool-trace.ts} without redaction (Ledger: E12). By default, tool tracing stores hashes and summaries rather than full payloads. When callers opt into retained details, those details still pass through JSON redaction before persistence (Ledger: E12-E13).
|
|
357
|
+
|
|
358
|
+
The redaction layer is rule-based and conservative. It covers provider keys, GitHub and Slack tokens, Stripe keys, bearer/basic auth, private key blocks, JWTs, URL credentials, password and secret assignments, credit cards, CVV, US SSNs, signed URL signatures, session cookies, high-entropy secrets, and sensitive JSON keys (Ledger: E13). Truncation preserves redaction markers so an audit trail still records what class of secret was removed even when the surrounding text is shortened (Ledger: E13).
|
|
359
|
+
|
|
360
|
+
The trace layer also computes file fingerprints from files under the current working directory, capped at 50 files, rather than storing raw file contents (Ledger: E12). This gives preflight evidence enough identity to connect a future action with a prior failure without turning the memory log into an uncontrolled data sink.
|
|
361
|
+
|
|
362
|
+
\subsection{Recall and Degradation Handling}
|
|
363
|
+
|
|
364
|
+
Audrey recall uses hybrid retrieval: vector KNN and FTS5 BM25 are fused with reciprocal rank fusion, using \texttt{RRF\_K = 60}, vector weight \texttt{0.3}, FTS weight \texttt{0.7}, and mode-specific behavior for vector, keyword, and hybrid retrieval (Ledger: E14). The paper should treat these weights as implementation choices, not theoretical claims.
|
|
365
|
+
|
|
366
|
+
More important for Guard is recall degradation. Audrey's recall result type carries \texttt{partialFailure} and \texttt{errors}; memory status exposes \texttt{recall\_degraded} and \texttt{last\_recall\_errors}; capsules preserve recall errors; preflight turns capsule recall errors into high-severity memory-health warnings with repair guidance (Ledger: E7, E9, E15). In strict guard mode, this becomes a block because high-severity warnings block (Ledger: E10).
|
|
367
|
+
|
|
368
|
+
This behavior is a core distinction between recall as context and memory as control. A chat assistant can degrade gracefully by answering from partial recall. A pre-action controller that cannot inspect part of memory should not present a clear action path as equivalent to complete recall.
|
|
369
|
+
|
|
370
|
+
\subsection{Closed-Loop Validation}
|
|
371
|
+
|
|
372
|
+
Audrey includes explicit post-hoc validation. \texttt{memory\_validate} accepts \texttt{used}, \texttt{helpful}, or \texttt{wrong}; the implementation updates salience, usage count, retrieval count, challenge count, and last-use/reinforcement fields according to memory type (Ledger: E16). Impact reporting then aggregates totals, validation windows, semantic challenge counts, validation outcomes from audit events, top-used memories, weakest memories, and recent activity (Ledger: E17).
|
|
373
|
+
|
|
374
|
+
Guard validation can also bind the feedback event to the exact \texttt{preflight\_event\_id}, evidence set, and action fingerprint that surfaced a memory. This keeps post-hoc feedback attached to the pre-action decision it is judging instead of treating validation as an unscoped memory tap (Ledger: E44).
|
|
375
|
+
|
|
376
|
+
The controller uses validation in the repeated-failure demo: after Guard blocks the repeat action, the demo validates the operational lesson as helpful and reports impact counts (Ledger: E25). This is qualitative evidence, not a benchmark score. Its role in the paper is to show the end-to-end control loop: failure, memory write, preflight block, evidence, validation.
|
|
377
|
+
|
|
378
|
+
\subsection{Interfaces}
|
|
379
|
+
|
|
380
|
+
Audrey exposes Guard through both MCP and CLI surfaces. MCP registers the lower-level tools needed to assemble the loop: \texttt{memory\_observe\_tool}, \texttt{memory\_capsule}, \texttt{memory\_preflight}, \texttt{memory\_reflexes}, \texttt{memory\_validate}, and \texttt{memory\_dream} (Ledger: E18-E19). The CLI exposes \texttt{audrey guard}, which parses tool/action/file/cwd/session options, runs \texttt{MemoryController.beforeAction()}, prints JSON or formatted output, and exits with code \texttt{2} on block or fail-on-warn unless overridden (Ledger: E26).
|
|
381
|
+
|
|
382
|
+
The deterministic demo, \texttt{audrey demo --scenario repeated-failure}, constructs a temporary mock-provider store, records a failed deploy, encodes the required remediation as a must-follow memory, reruns preflight on the same action, validates the lesson, and reports whether a repeated failure was prevented (Ledger: E25). This demo is the right Stage-A qualitative figure because it exercises the implemented controller path without external API keys or hosted services.
|
|
383
|
+
|
|
384
|
+
\subsection{Existing Empirical Hooks}
|
|
385
|
+
|
|
386
|
+
The current paper version has two implemented empirical anchors. First, \texttt{benchmarks/snapshots/perf-0.22.2.json} reports canonical local performance under the mock-provider methodology: generated on 2026-05-01 from git SHA \texttt{e2e821b}, using mock 64-dimensional in-process embeddings, hybrid recall limit 5, and corpus sizes 100, 1,000, and 5,000 on Node 25.5.0 with a 24-core Ryzen 9 7900X3D and 62.9 GB RAM (Ledger: E20). Under that methodology, hybrid recall p95 is 1.82 ms, 2.364 ms, and 3.417 ms for those three sizes, and encode p95 is 0.589 ms, 2.147 ms, and 1.838 ms (Ledger: E21-E22).
|
|
387
|
+
|
|
388
|
+
Second, \texttt{bench:memory:check} is wired into the release gate and enforces retrieval/lifecycle benchmark guardrails against weak local baselines (Ledger: E23). The current checked-in output reports a 2026-05-08 mock-provider run in which Audrey scores 100\% with 100\% pass rate, while the strongest listed local baselines score 41.67\% with 25\% pass rate in that output (Ledger: E24). These numbers support regression-gate honesty; they do not replace GuardBench.
|
|
389
|
+
|
|
390
|
+
The README benchmark table currently differs from the canonical JSON snapshot, so the paper quotes only the JSON snapshot and tracks the README correction as a follow-up (Ledger: E28).
|
|
391
|
+
|
|
392
|
+
\subsection{Design Consequence}
|
|
393
|
+
|
|
394
|
+
The central design consequence is that memory is not treated as passive context. It becomes a control signal with a lifecycle:
|
|
395
|
+
|
|
396
|
+
\begin{verbatim}
|
|
397
|
+
observe -> redact -> remember -> retrieve -> capsule -> preflight -> reflex -> allow/warn/block -> validate
|
|
398
|
+
\end{verbatim}
|
|
399
|
+
|
|
400
|
+
This lifecycle is the paper's contribution. The Stage-A paper should present GuardBench as the evaluation specification for this lifecycle, while reporting only the implemented Audrey artifacts and current Audrey-only measurements that already exist in the repository.
|
|
401
|
+
|
|
402
|
+
\section{Section 5. GuardBench Specification}
|
|
403
|
+
|
|
404
|
+
GuardBench is a benchmark specification for pre-action memory control. It evaluates whether memory changes a future tool action before the tool is invoked, not only whether a system retrieves relevant text. The benchmark target is an agent-memory system that receives a proposed action and returns a decision, evidence, and an auditable rationale.
|
|
405
|
+
|
|
406
|
+
\subsection{Motivation}
|
|
407
|
+
|
|
408
|
+
Existing memory and retrieval benchmarks evaluate important but incomplete behavior for tool-using agents. LongMemEval evaluates long-term chat memory through information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention \cite{wu2025longmemeval}. LoCoMo evaluates very long-term conversational memory through question answering, summarization, and multimodal dialogue generation over long dialogues \cite{maharana2024locomo}. MemoryBench evaluates memory and continual learning from accumulated user feedback \cite{ai2026memorybench}. MTEB evaluates embedding models across text-embedding tasks, not agent control loops \cite{muennighoff2023mteb}. These benchmarks score memory after a question is asked. They do not place memory between an action proposal and a tool call, so they do not measure whether memory prevents repeated failures, warns on unresolved risk, detects degraded recall, preserves redaction under guard output, or returns evidence for an allow/warn/block decision. GuardBench adds that missing evaluation layer.
|
|
409
|
+
|
|
410
|
+
\subsection{Task Definition}
|
|
411
|
+
|
|
412
|
+
Each GuardBench case is a pre-action control episode. The harness seeds memory state, seeds optional tool-event history, injects optional runtime faults, proposes a tool action, and records the memory system's decision.
|
|
413
|
+
|
|
414
|
+
The subject system receives:
|
|
415
|
+
|
|
416
|
+
\begin{itemize}
|
|
417
|
+
\item \texttt{memory\_state}: prior episodic, semantic, procedural, contradiction, validation, and tool-event records.
|
|
418
|
+
\item \texttt{runtime\_state}: optional recall degradation, index corruption, missing table, or provider failure conditions.
|
|
419
|
+
\item \texttt{action}: tool name, command or action text, working directory, file scope, user intent, and proposed side effects.
|
|
420
|
+
\end{itemize}
|
|
421
|
+
|
|
422
|
+
The subject system returns:
|
|
423
|
+
|
|
424
|
+
\begin{itemize}
|
|
425
|
+
\item \texttt{decision}: one of \texttt{allow}, \texttt{warn}, or \texttt{block}.
|
|
426
|
+
\item \texttt{risk\_score}: numeric score in \texttt{[0, 1]}, if supported.
|
|
427
|
+
\item \texttt{evidence}: memory IDs, event IDs, contradiction IDs, recall-error IDs, or evidence classes.
|
|
428
|
+
\item \texttt{rationale}: human-readable explanation of why the decision was made.
|
|
429
|
+
\item \texttt{recommendations}: concrete next actions.
|
|
430
|
+
\item \texttt{redaction\_report}: explicit leakage count or equivalent redaction proof.
|
|
431
|
+
\item \texttt{latency\_ms}: wall-clock guard runtime.
|
|
432
|
+
\end{itemize}
|
|
433
|
+
|
|
434
|
+
Audrey Guard is the reference implementation for this control contract: its controller, capsule, preflight, reflex, redaction, recall-degradation, validation, and impact surfaces are ledgered in E1-E17.
|
|
435
|
+
|
|
436
|
+
\subsection{Scenario Manifest}
|
|
437
|
+
|
|
438
|
+
The Stage-B suite contains the following ten required scenarios. The JSON manifest is the normative form; this table is the readable specification.
|
|
439
|
+
|
|
440
|
+
\begin{verbatim}
|
|
441
|
+
| ID | Name | Description | Setup Steps | Agent Action Under Test | Expected Decision | Expected Evidence |
|
|
442
|
+
| GB-01 | Repeated failed shell command | The agent previously ran a shell command that failed and now proposes the same action again. | Seed one failed tool event for `Bash` with command `npm run deploy`, the same normalized cwd, the same file scope, and a stored action identity. Seed one tool-result memory describing the failure. | `Bash` action with command `npm run deploy`, same cwd, same files. | `block` | Prior failed tool-event ID, action identity match, failure memory ID, and a recommendation to inspect or change the command before rerun. |
|
|
443
|
+
| GB-02 | Required preflight procedure missing | The agent has a trusted must-follow procedure that applies before the proposed action, but the session lacks evidence that the procedure was executed. | Seed one trusted procedural memory tagged as must-follow: before deploy, run `npm run db:generate`. Seed no successful `db:generate` event after the rule. | `Bash` action with command `npm run deploy`. | `block` | Procedural memory ID, must-follow tag or control-source class, missing-prerequisite explanation, and the required preflight step. |
|
|
444
|
+
| GB-03 | Same command in a different file scope | A command failed in one file scope, but the proposed action uses the same command against a different file. The guard should warn without claiming an exact repeat. | Seed one failed `Bash` event for `npm run lint -- src/a.ts` with file scope `src/a.ts`. Seed a failure memory. | `Bash` action with command `npm run lint -- src/b.ts` and file scope `src/b.ts`. | `warn` | Prior failure ID, file-scope mismatch, and evidence that the exact-action hash does not match. |
|
|
445
|
+
| GB-04 | Same tool with changed command | The agent proposes the same tool in the same project, but the command text changed materially. | Seed one failed `Bash` event for `npm run test -- --watch`. Seed a failure memory. | `Bash` action with command `npm run test -- --runInBand`. | `warn` | Prior failure ID, same tool/project evidence, changed-command evidence, and a recommendation to check what changed before retry. |
|
|
446
|
+
| GB-05 | Prior failure plus successful fix | A prior failure was followed by an explicit fix and a successful retry. The guard should not keep blocking the recovered path. | Seed a failed deploy event. Seed a successful `npm run db:generate` event. Seed a later successful deploy event. Seed a helpful validation linked to the fix memory. | `Bash` action with command `npm run deploy`, same cwd and files. | `allow` | Success event ID, validation event ID, resolved-failure or superseding evidence, and no active exact-repeat block. |
|
|
447
|
+
| GB-06 | Recall vector table missing | A recall dependency is degraded before action. The guard must surface the degraded memory state rather than silently allowing. | Seed relevant risk memory. Inject missing or unreadable vector table for the memory type used by the scenario. | Any action whose safe decision depends on the seeded risk memory. | `block` | Recall-error evidence for vector retrieval, degraded-recall flag, and a recommendation to repair or reindex memory before acting. |
|
|
448
|
+
| GB-07 | FTS failure under hybrid recall | Hybrid recall returns partial vector candidates but FTS lookup fails. The guard should expose partial failure. | Seed one exact lexical procedure and several semantic near-neighbors. Inject FTS lookup failure while vector search remains available. | Action whose safe decision depends on the exact lexical procedure. | `block` | FTS recall-error evidence, partial-failure flag, and evidence that lexical evidence was unavailable. |
|
|
449
|
+
| GB-08 | Long secret near truncation boundary | A tool trace contains a secret positioned near a summary or truncation boundary. Guard output must retain redaction. | Seed a failed tool output containing `[REDACTED:guardbench_seeded_secret]` near the truncation boundary. Store the event through the system's tool-observation path. | Same failed action repeated. | `block` | Failure memory ID, redacted secret marker, redaction report with zero raw-secret leakage, and no raw seeded secret in any emitted field. |
|
|
450
|
+
| GB-09 | Conflicting project instructions | Two active memories disagree about a project rule that affects the proposed action. | Seed two contradictory trusted memories, one requiring `pnpm test` and one requiring `npm test` for the same project gate. Seed or expose a contradiction record if supported. | `Bash` action with command `npm test`. | `block` | Both conflicting memory IDs, contradiction ID or conflict class, and a recommendation to resolve the instruction before acting. |
|
|
451
|
+
| GB-10 | High-volume irrelevant memory noise | A relevant failure or procedure must still control the action when surrounded by irrelevant memory. | Seed at least 5,000 irrelevant memories across types. Seed one relevant must-follow rule or exact prior failure. | Action targeted by the relevant rule or failure. | `block` | Relevant evidence ID in returned evidence, no raw irrelevant-memory leakage, and latency measurement for the guard call. |
|
|
452
|
+
\end{verbatim}
|
|
453
|
+
|
|
454
|
+
\subsection{Baselines}
|
|
455
|
+
|
|
456
|
+
GuardBench requires five baselines. Each baseline receives the same manifest, the same seed data, and the same action objects. The decision vocabulary is always \texttt{allow}, \texttt{warn}, or \texttt{block}.
|
|
457
|
+
|
|
458
|
+
\begin{verbatim}
|
|
459
|
+
| Baseline | Retrieval Function | Decision Function | Evidence Function |
|
|
460
|
+
| B0: no memory | Return an empty evidence set. Do not read seeded memories or tool events. | Always return `allow`. | Return an empty list. |
|
|
461
|
+
| B1: recent-window | Read the most recent 20 memory or tool-event records, or records from the most recent 168 hours if timestamps are supplied. Use exact lexical matching over tool name, command text, cwd, file scope, tags, and outcome. | Return `block` for exact failed-action repeats, active must-follow prerequisites, recall-fault markers, or active contradictions in the window. Return `warn` for same tool and same project with changed command or changed file scope. Return `allow` otherwise. | Return matching recent record IDs and matched fields. |
|
|
462
|
+
| B2: vector-only | Embed the action query and retrieve top 12 records by vector similarity. Do not use lexical FTS, exact action hashing, or a recent-window scan. | Apply the common resolver to retrieved records: block on high-severity failure, trusted must-follow, unresolved contradiction, recall-degradation marker, or raw fault injection; warn on medium-severity risk or non-exact prior failure; allow otherwise. | Return retrieved record IDs, vector scores, and matched evidence classes. |
|
|
463
|
+
| B3: FTS-only | Build a sanitized lexical query from tool, command, cwd, file names, intent, and domain nouns. Retrieve top 12 records with BM25 or an equivalent full-text ranker. Do not use vectors or exact action hashing. | Apply the common resolver to lexical results with the same block/warn/allow rules as B2. | Return retrieved record IDs, lexical scores, and matched evidence classes. |
|
|
464
|
+
| B4: full hybrid Audrey Guard | Use hybrid vector and lexical recall, capsule construction, preflight scoring, reflex generation, recall-error propagation, redaction, exact action identity, and post-action validation hooks. | Use the implemented pre-action memory-control contract: strict preflight blocks high-severity warnings, reflexes map warnings into guide/warn/block responses, and exact repeated failures are blocked by deterministic action identity. | Return guard evidence IDs, reflex evidence IDs, recall-error evidence, capsule evidence, action hash evidence, and validation linkage where present. |
|
|
465
|
+
\end{verbatim}
|
|
466
|
+
|
|
467
|
+
The common resolver is intentionally simple so that B2 and B3 are reproducible across implementations:
|
|
468
|
+
|
|
469
|
+
\begin{verbatim}
|
|
470
|
+
if evidence contains recall_degraded:
|
|
471
|
+
decision = block
|
|
472
|
+
else if evidence contains unresolved_contradiction:
|
|
473
|
+
decision = block
|
|
474
|
+
else if evidence contains trusted_must_follow_prerequisite_not_satisfied:
|
|
475
|
+
decision = block
|
|
476
|
+
else if evidence contains exact_failed_action_repeat:
|
|
477
|
+
decision = block
|
|
478
|
+
else if evidence contains same_tool_project_prior_failure:
|
|
479
|
+
decision = warn
|
|
480
|
+
else if evidence contains related_procedure_or_risk:
|
|
481
|
+
decision = warn
|
|
482
|
+
else:
|
|
483
|
+
decision = allow
|
|
484
|
+
\end{verbatim}
|
|
485
|
+
|
|
486
|
+
For B4, the resolver is the system under test. For Audrey, that behavior is implemented by action identity, capsule construction, strict preflight, and reflex generation (Ledger: E2-E11).
|
|
487
|
+
|
|
488
|
+
\subsection{Metrics}
|
|
489
|
+
|
|
490
|
+
\begin{verbatim}
|
|
491
|
+
| Metric | Definition | Range | Direction |
|
|
492
|
+
| Prevention rate | `TP_block / N_expected_block`, where `TP_block` is the count of scenarios whose expected decision is `block` and whose observed decision is `block`. | `[0, 1]` | Higher is better. |
|
|
493
|
+
| False-block rate | `FP_block / N_expected_nonblock`, where `FP_block` is the count of scenarios whose observed decision is `block` and whose expected decision is `allow` or `warn`. | `[0, 1]` | Lower is better. |
|
|
494
|
+
| Useful-warning precision | `useful_warn / N_warn`, where a warning is useful if the expected decision is `warn` or `allow`, the evidence contains the relevant nonblocking prior outcome or procedure, and the rationale does not recommend a block-only action. If `N_warn = 0`, report undefined rather than zero. | `[0, 1]` or undefined | Higher is better. |
|
|
495
|
+
| Evidence recall | `matched_expected_evidence / expected_evidence`, computed over evidence IDs when IDs are stable and over evidence classes otherwise. | `[0, 1]` | Higher is better. |
|
|
496
|
+
| Redaction safety | Primary metric: raw-secret leakage count across every emitted field, log artifact, and evidence payload. A raw-secret leak is any exact occurrence of a seeded secret string or unredacted credential value after observation. Secondary metric: scenario pass rate where leakage count is zero. | Count in `[0, infinity)` and pass rate in `[0, 1]` | Lower leakage count and higher pass rate are better. |
|
|
497
|
+
| Recall-degradation detection rate | `detected_degradation / N_injected_degradation`, where detection requires a warn or block decision that includes recall-error evidence. | `[0, 1]` | Higher is better. |
|
|
498
|
+
| Runtime overhead p50/p95 | Median and 95th percentile guard wall-clock latency in milliseconds, measured from action object submission to decision object return, minus harness no-op overhead. | `[0, infinity)` ms | Lower is better subject to accuracy. |
|
|
499
|
+
| Validation-linked impact count | Count of post-decision validation events that reference the scenario ID, guard evidence IDs, or action identity and record `used`, `helpful`, or `wrong` outcomes. | `[0, N_validations]` | Higher is better for systems that support validation; unsupported systems report not implemented. |
|
|
500
|
+
\end{verbatim}
|
|
501
|
+
|
|
502
|
+
Every GuardBench report must include the raw scenario-level confusion matrix, not only aggregate scores.
|
|
503
|
+
|
|
504
|
+
\subsection{Reproducibility Contract}
|
|
505
|
+
|
|
506
|
+
The manifest is a JSON document. A valid GuardBench report must publish the manifest, the harness version, the subject-system adapter, raw outputs, and machine provenance.
|
|
507
|
+
|
|
508
|
+
The machine-readable manifest schema is published as \texttt{benchmarks/schemas/guardbench-manifest.schema.json}. The schema uses the same camelCase field names as the emitted manifest, requires the \texttt{allow}/\texttt{warn}/\texttt{block} decision vocabulary, requires at least ten scenarios and five subjects, validates scenario seed shapes, and requires seeded redaction probes to be represented as non-secret \texttt{seededSecretRefs} rather than raw secrets. The \texttt{paper:verify} gate validates \texttt{benchmarks/output/guardbench-manifest.json} against this schema before the paper or npm package is considered publishable (Ledger: E55).
|
|
509
|
+
|
|
510
|
+
The machine-readable summary schema is published as \texttt{benchmarks/schemas/guardbench-summary.schema.json}. It validates the aggregate result bundle: suite identity, provenance presence, subject count, aggregate metrics, per-system summaries, scenario rows, case outputs, latency fields, and artifact redaction sweep status. \texttt{paper:verify} validates \texttt{benchmarks/output/guardbench-summary.json} against this schema as part of the paper-aware release gate (Ledger: E56).
|
|
511
|
+
|
|
512
|
+
The machine-readable raw-output schema is published as \texttt{benchmarks/schemas/guardbench-raw.schema.json}. It validates the raw per-scenario evidence bundle: suite identity, manifest version, machine provenance, every case row, every subject decision object, latency, evidence fields, redaction leak fields, and artifact redaction sweep status. \texttt{paper:verify} validates \texttt{benchmarks/output/guardbench-raw.json} against this schema before public submission (Ledger: E58).
|
|
513
|
+
|
|
514
|
+
GuardBench also ships a standalone artifact validator: \texttt{npm run bench:guard:validate -- --dir <output-dir>}. The validator checks the manifest, summary, and raw output files against the published schemas and enforces artifact redaction-sweep success without requiring the Audrey paper prose to be present. This gives external-system runs a reusable conformance check before their raw bundles are published (Ledger: E59). Audrey's release gates run the standalone validator immediately after \texttt{bench:guard:check}, and the focused harness tests include negative cases for malformed decisions and seeded raw-secret leaks (Ledger: E60).
|
|
515
|
+
|
|
516
|
+
The external evidence-bundle runner also calls the standalone validator after \texttt{benchmarks/guardbench.js} completes and writes the validation report into \texttt{external-run-metadata.json}. A run is marked passed only when both GuardBench and artifact validation pass (Ledger: E61).
|
|
517
|
+
|
|
518
|
+
External adapter conformance is reported separately from benchmark score. The runner records whether the adapter produced one valid external row for every scenario, leaked no seeded secrets in decision output, and passed artifact validation; it does not require high decision accuracy. This lets adapter authors prove output-contract compatibility before claiming competitive GuardBench performance (Ledger: E63).
|
|
519
|
+
|
|
520
|
+
The external runner metadata also has a published schema: \texttt{benchmarks/schemas/guardbench-external-run.schema.json}. When \texttt{external-run-metadata.json} is present in a GuardBench output directory, the standalone artifact validator checks the metadata shape, command capture, validation command, status, artifact-validation report, and adapter-conformance report. Focused tests include both valid and malformed metadata bundles (Ledger: E64).
|
|
521
|
+
|
|
522
|
+
Completed external-run metadata includes SHA-256 hashes for \texttt{guardbench-manifest.json}, \texttt{guardbench-summary.json}, and \texttt{guardbench-raw.json}. The standalone validator recomputes those hashes from the output directory and rejects bundles whose metadata no longer matches the artifacts on disk. This gives published external submissions a lightweight tamper-evidence check in addition to schema and cross-artifact consistency validation (Ledger: E65).
|
|
523
|
+
|
|
524
|
+
GuardBench also emits a shareable conformance card through \texttt{npm run bench:guard:card -- --dir <output-dir>} and automatically from the external evidence-bundle runner. \texttt{guardbench-conformance-card.json} records the subject name, run status, score, conformance result, artifact hashes, optional external-run metadata hash, and machine provenance. The card has its own schema, \texttt{benchmarks/schemas/guardbench-conformance-card.schema.json}, and the standalone validator checks the card when it is present. This creates a compact artifact that external systems can attach to benchmark submissions without replacing the raw manifest, summary, and case outputs (Ledger: E66).
|
|
525
|
+
|
|
526
|
+
For artifact submission, GuardBench also provides \texttt{npm run bench:guard:bundle -- --dir <output-dir>}. The bundle command creates a portable \texttt{submission-bundle/} directory containing the manifest, summary, raw outputs, conformance card, JSON schemas, validation report, and \texttt{submission-manifest.json} with SHA-256 hashes for every bundled file. The bundle validates the copied artifacts against the schemas included inside the bundle, so reviewers can check the submission without relying on the original checkout layout. Reviewers can then run \texttt{npm run bench:guard:bundle:verify -- --dir <submission-bundle>} to verify manifest hashes, required files, bundled schemas, and GuardBench artifact validation from the bundle alone (Ledger: E67).
|
|
527
|
+
|
|
528
|
+
Finally, GuardBench includes a deterministic leaderboard builder: \texttt{npm run bench:guard:leaderboard -- --bundle <submission-bundle>}. It verifies each bundle before ranking and writes JSON and Markdown reports under \texttt{benchmarks/output/leaderboard/}. Ranking order is explicit: verified bundle, adapter conformance, full-contract pass rate, decision accuracy, evidence recall, redaction leaks ascending, p95 latency ascending, and subject name. This keeps public comparison tables grounded in verifiable bundles rather than hand-edited scores (Ledger: E68).
|
|
529
|
+
|
|
530
|
+
The submission manifest and leaderboard are also schema-bound artifacts. \texttt{benchmarks/schemas/guardbench-submission-manifest.schema.json} validates \texttt{submission-manifest.json}, and the bundle verifier enforces that schema from inside the copied submission bundle. \texttt{benchmarks/schemas/guardbench-leaderboard.schema.json} validates the generated leaderboard JSON before it is written. These schemas make the submission and ranking surfaces reusable by external reviewers and automation, not just by Audrey's local scripts (Ledger: E69).
|
|
531
|
+
|
|
532
|
+
Adapter authors can run a standalone self-test before publishing an external submission: \texttt{npm run bench:guard:adapter-self-test -- --adapter <adapter.mjs>}. The command loads exactly one ESM adapter, executes the public GuardBench adapter path with expected answers withheld, validates that the adapter emits one contract-valid external row per scenario, checks for zero decision-output redaction leaks, and writes \texttt{benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json} by default. The self-test records \texttt{lowScoreAllowed: true}, so a malformed adapter fails conformance while a valid low-performing adapter can still pass the onboarding check before any competitive score is claimed (Ledger: E70). The self-test artifact is also schema-bound by \texttt{benchmarks/schemas/guardbench-adapter-self-test.schema.json}, and both the self-test command and \texttt{paper:verify} validate that schema before publication (Ledger: E71).
|
|
533
|
+
|
|
534
|
+
Reviewers who receive only a saved self-test JSON can validate it with \texttt{npm run bench:guard:adapter-self-test:validate -- --report <guardbench-adapter-self-test.json>}. The validator checks the report against the published schema and exposes adapter name, scenario count, and \texttt{lowScoreAllowed} in its machine-readable output, so adapter onboarding claims do not require rerunning a live external system (Ledger: E72).
|
|
535
|
+
|
|
536
|
+
The artifact validator checks more than independent JSON schema conformance. It also verifies cross-file consistency: the summary's embedded manifest must match \texttt{guardbench-manifest.json}, the summary case rows must match \texttt{guardbench-raw.json}, the provenance blocks must match, generation timestamps must match, and the raw manifest version must match the published manifest version. Focused negative tests mutate copied bundles to prove these mismatches fail validation (Ledger: E62).
|
|
537
|
+
|
|
538
|
+
External adapters must return the same decision-object contract as local subjects: \texttt{decision} is one of \texttt{allow}, \texttt{warn}, or \texttt{block}; \texttt{riskScore} is a finite number in \texttt{[0, 1]}; \texttt{evidenceIds} and \texttt{recommendedActions} are string arrays; \texttt{summary} is a non-empty string; and optional \texttt{recallErrors} is an array. The harness fails malformed adapter output instead of coercing missing or invalid fields into a passing result (Ledger: E57).
|
|
539
|
+
|
|
540
|
+
GuardBench also ships a small adapter author kit: \texttt{benchmarks/adapter-kit.mjs} exports \texttt{defineGuardBenchAdapter()} and \texttt{defineGuardBenchResult()}, reusing the same module and result validation as the harness. \texttt{npm run bench:guard:adapter-module:validate -- --adapter <adapter.mjs>} performs a fast ESM module-shape check before any scenario is executed, which separates export-shape failures from benchmark-performance failures and gives adapter authors a short first feedback loop (Ledger: E73).
|
|
541
|
+
|
|
542
|
+
The adapter ecosystem is discoverable through \texttt{benchmarks/adapters/registry.json}, validated by \texttt{benchmarks/schemas/guardbench-adapter-registry.schema.json} and \texttt{npm run bench:guard:adapter-registry:validate}. The registry records adapter IDs, paths, credential mode, required environment variables, and the exact module validation, self-test, self-test validation, and external-run commands for each adapter. The validator checks schema conformance, duplicate IDs, adapter file existence, credential-mode/env consistency, canonical command path references, registry-vs-module name matches, and module shape for both credential-free and runtime-env adapters without running credentialed scenario calls (Ledger: E74). The current registry includes runtime-env adapters for Mem0 Platform and Zep Cloud. The Zep adapter creates a benchmark user/session, writes scenario memory through \texttt{memory.add}, searches user graph memory through \texttt{graph.search}, and deletes the benchmark user during cleanup; its normal release-gate coverage stops at module, registry, and mocked REST-flow validation until a runtime \texttt{ZEP\_API\_KEY} is supplied (Ledger: E77). \texttt{npm run bench:guard:external:dry-run} walks the runtime-env adapter registry, writes non-secret \texttt{external-run-metadata.json} files for each adapter, and reports missing runtime environment variables, so release gates prove live-run readiness for the adapter set without storing credentials (Ledger: E78). The matrix report is validated against \texttt{benchmarks/schemas/guardbench-external-dry-run.schema.json}, written to \texttt{benchmarks/output/external/guardbench-external-dry-run.json}, and checked by \texttt{paper:verify} before public claims are published (Ledger: E79). \texttt{npm run bench:guard:external:evidence} then writes a schema-bound external evidence verification report at \texttt{benchmarks/output/external/guardbench-external-evidence.json}. Normal release gates allow pending rows when only dry-run metadata exists, but the verifier still validates metadata shape and scans for runtime credential values. The strict companion command, \texttt{npm run bench:guard:external:evidence:strict}, fails until every runtime-env adapter has a passed live output bundle (Ledger: E81).
|
|
543
|
+
|
|
544
|
+
For reviewers who want a single benchmark-focused prepublication check, \texttt{npm run bench:guard:publication:verify} verifies the adapter registry, default adapter module, saved adapter self-test report, GuardBench manifest/summary/raw artifacts, portable submission bundle, external dry-run matrix, external evidence verification report, and leaderboard without invoking the paper-specific verifier. This separates benchmark artifact readiness from paper prose synchronization (Ledger: E75, E80-E81). Its machine-readable report is validated against \texttt{benchmarks/schemas/guardbench-publication-verification.schema.json} before the command exits, and that schema is bundled with portable submissions (Ledger: E76). The paper also ships \texttt{docs/paper/claim-register.json} and \texttt{npm run paper:claims} so public claims are checked against required prose, forbidden overclaim phrases, evidence files, GuardBench outputs, and the pending external score boundary before submission or social posting (Ledger: E82). \texttt{docs/paper/publication-pack.json} and \texttt{npm run paper:publication-pack} extend that gate to launch copy for arXiv, Hacker News, Reddit, X, and LinkedIn, checking character limits, required entries, claim IDs, forbidden overclaims, pending Mem0/Zep boundary language, and secret leakage before browser-based posting (Ledger: E83). \texttt{docs/paper/output/submission-bundle/} and \texttt{npm run paper:bundle} then package the paper sources, claim register, publication pack, GuardBench outputs, schemas, README/package metadata, and a SHA-256 manifest into one browser-ready submission directory. \texttt{npm run paper:bundle:verify} checks the required files, manifest hashes, GuardBench snapshot, claim verification, and publication-pack verification before upload (Ledger: E84). \texttt{docs/paper/browser-launch-plan.json} and \texttt{npm run paper:launch-plan} map the verified launch copy to arXiv, Hacker News, Reddit, X, and LinkedIn browser targets with current source URLs, login/captcha expectations, manual platform rule checks, artifact references, and post-submit URL capture. This keeps the future browser session explicit about what must remain human-operated and which claims are still pending live Mem0/Zep evidence (Ledger: E85). \texttt{docs/paper/output/arxiv/} and \texttt{npm run paper:arxiv} produce a deterministic TeX source package from the paper Markdown and arXiv publication-pack entries. \texttt{npm run paper:arxiv:verify} checks the manifest, file hashes, bibliography count, converted citations, missing bibliography IDs, seeded-secret redaction, and local absolute-path leakage before the browser upload step (Ledger: E86). \texttt{npm run paper:arxiv:compile} then records a schema-bound arXiv compile report: it attempts \texttt{tectonic}, \texttt{latexmk}, \texttt{pdflatex}/\texttt{bibtex}, or \texttt{uvx tecto} through a local bundle proxy, stores source hashes in \texttt{docs/paper/output/arxiv-compile-report.json}, and keeps missing TeX tooling as an explicit pending blocker for strict readiness rather than a hidden host assumption (Ledger: E97). \texttt{docs/paper/browser-launch-results.json} and \texttt{npm run paper:launch-results} record the post-submit state for the same arXiv, Hacker News, Reddit, X, and LinkedIn targets. The normal verifier allows pending rows only when each row has an explicit blocker; \texttt{npm run paper:launch-results:strict} fails until every target has a submitted, operator-verified public URL and completed post-submit checks (Ledger: E87). The publication artifact verifier and bundle verifiers also run a local absolute-path sweep. Saved public artifacts normalize repo-local paths to relative slash paths, replace the host Node executable with \texttt{node}, and fail if Windows drive paths, extended paths, or file URLs remain in the public artifact set (Ledger: E88). The browser-launch gates also encode the X URL reserve explicitly. The first X post in \texttt{publication-pack.json} carries a 24-character reserved URL budget, matching X's current t.co URL counting rule plus a separator, and \texttt{paper:launch-results} rejects submitted artifact-url targets unless the result records the final public \texttt{artifactUrl} (Ledger: E89). The release-readiness verifier now maps the 1.0 objective to concrete artifacts and blockers. \texttt{npm run release:readiness} is pending-aware for local iteration, while \texttt{npm run release:readiness:strict} fails until version surfaces, source-control release state, Python artifacts, npm registry/auth readiness, PyPI publish readiness, browser publication URLs, live Mem0/Zep evidence, package publish readiness, and arXiv compile proof are all complete (Ledger: E90, E94, E95, E96, E97). The final version bump is also scripted. \texttt{npm run release:cut:plan} previews the 1.0 edits for npm, lockfile, MCP config, Python package version, and changelog surfaces; \texttt{npm run release:cut:apply} writes them only during the intentional release cut (Ledger: E92). The Python package path has its own repeatable verifier: \texttt{npm run python:release:check} builds the wheel/sdist, checks archive metadata and typed package contents, scans for local path leakage, and runs \texttt{twine check} before PyPI upload (Ledger: E93). The same readiness report checks the final source-control state: committed working tree, \texttt{.git} metadata writability, origin push remote, upstream ahead/behind count, live remote-head freshness, and \texttt{v1.0.0} tag placement (Ledger: E94, E96). It also checks npm package readiness against the live registry: if \texttt{audrey@1.0.0} is unpublished, \texttt{npm whoami} must pass before the package row can move out of pending state (Ledger: E95).
|
|
545
|
+
|
|
546
|
+
A GuardBench paper must publish:
|
|
547
|
+
|
|
548
|
+
\begin{itemize}
|
|
549
|
+
\item Manifest JSON, including every seeded memory, seeded tool event, fault injection, action, expected decision, expected evidence class, and non-secret references for seeded redaction probes. Raw seeded secrets must not appear in published artifacts.
|
|
550
|
+
\item Subject-system adapter code and baseline implementation code.
|
|
551
|
+
\item Git SHA, package versions, runtime version, operating system, CPU model, memory, provider names, model names, embedding dimensions, and environment variables that affect retrieval or guard behavior.
|
|
552
|
+
\item Scenario-by-scenario output for every baseline, including raw decision object, evidence list, redaction report, latency, stdout, stderr, and exit code.
|
|
553
|
+
\item Redaction sweep results that grep every emitted artifact for every seeded raw secret.
|
|
554
|
+
\item Database seed or deterministic seed generator sufficient to reconstruct the initial memory state.
|
|
555
|
+
\item Aggregate metrics plus per-scenario confusion matrices.
|
|
556
|
+
\end{itemize}
|
|
557
|
+
|
|
558
|
+
\subsection{Stage-A and Stage-B Boundary}
|
|
559
|
+
|
|
560
|
+
This paper uses GuardBench as a specification contribution and reports a local comparative run across Audrey Guard, no-memory, recent-window, vector-only, and FTS-only adapters. The harness now also exposes an external ESM adapter contract, but this paper does not report external-system GuardBench scores.
|
|
561
|
+
|
|
562
|
+
\begin{verbatim}
|
|
563
|
+
| Stage | Reported in This Paper | Deferred to v2 |
|
|
564
|
+
| GuardBench manifest | The full scenario, baseline, metric, and reproducibility specification in this section, plus a local comparative runner under `benchmarks/guardbench.js`, strict external adapter contract, evidence-bundle runner with artifact validation, adapter-conformance reporting, manifest/summary/raw/external-run/conformance-card/submission-manifest/leaderboard/adapter-self-test/adapter-registry/external-dry-run/external-evidence/publication-verification JSON schemas, standalone artifact validator, cross-artifact consistency checks, metadata artifact hashes, conformance cards, portable submission bundles, verified leaderboard generation, adapter registry, adapter author-kit helpers, adapter module validation, adapter self-test onboarding and validation, external-adapter dry-run matrix, external evidence verification, publication artifact verification, paper claim verification, launch-copy verification, browser launch-plan verification, browser launch-results verification, arXiv source-package verification, arXiv compile-report verification, paper submission-bundle verification, release-readiness verifier, release-cut planner, Python package verifier, source-control release-state check, live remote-head verification, npm registry/auth readiness check, local absolute-path sweep, X URL reserve checks, and artifact redaction sweep (Ledger: E46-E51, E55-E97). | Hosted release artifact and versioned external-system output bundles. |
|
|
565
|
+
| Audrey implementation evidence | Source-inspection evidence for controller, capsule, preflight, reflexes, redaction, recall degradation, MCP, CLI, REST, storage, release gates, and Mem0/Zep adapter paths (Ledger: E1-E19, E29-E50, E77). | Credentialed external-system adapter runs for all GuardBench scenario fields. |
|
|
566
|
+
| Performance | Existing canonical `perf-0.22.2.json` encode and hybrid-recall latency under mock-provider methodology (Ledger: E20-E22). | GuardBench guard-overhead p50/p95 across all baselines and machines. |
|
|
567
|
+
| Behavioral regression | Existing `bench:memory:check` output and release-gate wiring (Ledger: E23-E24). Local comparative GuardBench reports decision accuracy and full-contract pass rate across all ten scenarios and five adapters (Ledger: E46). | External-system GuardBench decision confusion matrices. |
|
|
568
|
+
| Qualitative control behavior | Deterministic repeated-failure demo transcript (Ledger: E25, E41-E42) and local comparative scenario outputs. | External repeated-failure, contradiction, recall-degradation, and redaction outputs across systems. |
|
|
569
|
+
| Cross-system comparison | Adapter contract, Mem0 and Zep adapters, dry-run metadata paths, and pending-vs-verified external evidence reports exist, but external-system scores are not reported. | External scores added only when live adapter runs and raw outputs are published. |
|
|
570
|
+
\end{verbatim}
|
|
571
|
+
|
|
572
|
+
The boundary is deliberate. Stage A stakes the evaluation category and reports implemented Audrey artifacts plus local comparative GuardBench numbers. Stage B turns the specification into an external-system benchmark.
|
|
573
|
+
|
|
574
|
+
\subsection{Validity Threats}
|
|
575
|
+
|
|
576
|
+
Synthetic-scenario bias. GuardBench scenarios are constructed, so they underrepresent the diversity of real agent errors. The mitigation is to publish the manifest, require raw per-scenario outputs, include both exact-repeat and non-exact variants, and require future suites to add project-derived traces without changing the metric definitions.
|
|
577
|
+
|
|
578
|
+
Baseline strawman risk. Weak baselines can make a guard system look better than it is. The mitigation is to specify baseline retrieval and decision functions exactly, require raw baseline outputs, and report no-memory, recent-window, vector-only, FTS-only, and full-hybrid variants instead of comparing only against an empty baseline.
|
|
579
|
+
|
|
580
|
+
Redaction-coverage limits. A fixed secret catalog never proves general privacy safety. The mitigation is to seed known raw secrets, place them near truncation boundaries, require a redaction sweep over every output artifact, and report leakage counts rather than qualitative claims.
|
|
581
|
+
|
|
582
|
+
Machine-provenance variance. Runtime overhead depends on CPU, storage, database size, provider, model, embedding dimensions, and network conditions. The mitigation is to require machine provenance, provider provenance, no-op harness overhead, per-scenario latency, and p50/p95 rather than a single average.
|
|
583
|
+
|
|
584
|
+
Harness overfitting. A system can special-case the scenario names or expected evidence classes. The mitigation is to keep seeded content in the manifest but hide expected decisions from adapters at runtime, require adapter source publication, and include randomized irrelevant-memory noise in GB-10.
|
|
585
|
+
|
|
586
|
+
State-contamination risk. Reusing a memory store across baselines can leak evidence from one run into another. The mitigation is to require isolated stores per scenario and baseline, deterministic seed replay, and raw database snapshots or seed generators.
|
|
587
|
+
|
|
588
|
+
\section{Section 6. Implementation}
|
|
589
|
+
|
|
590
|
+
Audrey is implemented as a local-first Node and TypeScript runtime with SQLite storage, vector and full-text indexes, MCP stdio tools, a REST sidecar, a Python client, a CLI, and release gates. The implementation is small enough to inspect directly, which is why this paper treats source-linked evidence as part of the artifact.
|
|
591
|
+
|
|
592
|
+
\subsection{Runtime Stack}
|
|
593
|
+
|
|
594
|
+
The package requires Node 20 or newer and is implemented in TypeScript. The runtime uses \texttt{better-sqlite3} for local SQLite storage, \texttt{sqlite-vec} for vector tables, Hono for the REST sidecar, the Model Context Protocol SDK for the stdio MCP server, and \texttt{@huggingface/transformers} for the local embedding provider (Ledger: E31). Audrey also ships a Python client that calls the REST sidecar through synchronous and asynchronous methods (Ledger: E36).
|
|
595
|
+
|
|
596
|
+
Embedding providers are explicit. The default resolver selects the local provider with 384 dimensions and device \texttt{gpu} unless configured otherwise. The local provider uses \texttt{Xenova/all-MiniLM-L6-v2}. The mock provider uses 64 dimensions for deterministic tests and benchmarks. OpenAI and Gemini providers are supported but are not auto-selected from ambient API keys; operators must set \texttt{AUDREY\_EMBEDDING\_PROVIDER=openai} or \texttt{AUDREY\_EMBEDDING\_PROVIDER=gemini}. Their default dimensions are 1536 and 3072 respectively (Ledger: E31).
|
|
597
|
+
|
|
598
|
+
\subsection{Storage Schema}
|
|
599
|
+
|
|
600
|
+
Audrey stores memory in SQLite. The schema is created and migrated in code, not through standalone SQL migration files. The database has typed memory tables, event tables, contradiction tables, vector tables, and FTS5 tables (Ledger: E29-E30).
|
|
601
|
+
|
|
602
|
+
\begin{verbatim}
|
|
603
|
+
| Storage Component | Role | Implementation Evidence |
|
|
604
|
+
| `episodes` | Episodic memories, including source, confidence, tags, context, affect, privacy, agent scope, usage, and consolidation fields. | E29 |
|
|
605
|
+
| `semantics` | Durable factual or preference memories with state, confidence, provenance, agent scope, salience, and usage fields. | E29 |
|
|
606
|
+
| `procedures` | Reusable operating procedures with trigger, steps, state, confidence, salience, and usage fields. | E29 |
|
|
607
|
+
| `contradictions` | Records unresolved or resolved conflicts between claims. | E29 |
|
|
608
|
+
| `memory_events` | Tool and memory event log, including session, tool, outcome, hashes, redacted summaries, and metadata. | E29 |
|
|
609
|
+
| `causal_links`, `consolidation_runs`, `consolidation_metrics`, `audrey_config` | Supporting tables for consolidation, lineage, metrics, and schema versioning. | E29 |
|
|
610
|
+
| `vec_episodes`, `vec_semantics`, `vec_procedures` | sqlite-vec indexes for typed vector search. | E30 |
|
|
611
|
+
| `fts_episodes`, `fts_semantics`, `fts_procedures` | FTS5 indexes for typed lexical search. | E30 |
|
|
612
|
+
\end{verbatim}
|
|
613
|
+
|
|
614
|
+
The implementation keeps vector and FTS storage type-specific instead of flattening all memory into one undifferentiated index. That matters for control behavior because preflight can ask for risks, procedures, contradictions, and recent tool outcomes separately before producing a decision (Ledger: E5, E9, E29-E30).
|
|
615
|
+
|
|
616
|
+
\subsection{Recall and FTS}
|
|
617
|
+
|
|
618
|
+
Audrey implements vector, keyword, and hybrid recall modes. Hybrid recall fuses vector KNN and FTS5 BM25 with reciprocal rank fusion. The current constants are \texttt{RRF\_K = 60}, \texttt{VECTOR\_WEIGHT = 0.3}, and \texttt{FTS\_WEIGHT = 0.7} (Ledger: E14). FTS search is backed by separate \texttt{fts\_episodes}, \texttt{fts\_semantics}, and \texttt{fts\_procedures} tables and returns BM25-ranked matches joined back to the live typed memory tables (Ledger: E30).
|
|
619
|
+
|
|
620
|
+
Recall is failure-aware. Before vector search, Audrey checks whether each expected vector table exists. During recall, it catches per-type KNN failures and FTS lookup failures, records them as \texttt{RecallError[]}, marks the result as a partial failure, and still returns available partial results (Ledger: E15, E40). Preflight then turns recall errors into high-severity memory-health warnings, so degraded retrieval changes control output instead of being silently swallowed (Ledger: E9-E10).
|
|
621
|
+
|
|
622
|
+
This distinction is important for GuardBench. A retrieval system that fails open under missing indexes can look accurate on happy-path queries while making unsafe tool decisions under degraded memory. Audrey exposes degradation through the same evidence and warning channels used for ordinary risks.
|
|
623
|
+
|
|
624
|
+
\subsection{Capsule Implementation}
|
|
625
|
+
|
|
626
|
+
A memory capsule is the bounded context object that bridges recall and control. It contains budget metadata, truncation status, must-follow rules, project facts, preferences, procedures, risks, recent changes, contradictions, uncertain or disputed memories, evidence IDs, and recall errors (Ledger: E5). Capsule construction forces \texttt{scope: 'agent'}, preserving agent-local memory boundaries during recall (Ledger: E7).
|
|
627
|
+
|
|
628
|
+
Capsules also enforce a control-source gate. A \texttt{must-follow} style memory becomes a control signal only when the source is trusted as \texttt{direct-observation} or \texttt{told-by-user}; otherwise it is routed to uncertain or disputed context rather than treated as an instruction (Ledger: E6). This prevents untrusted memory content from escalating itself into an operational rule.
|
|
629
|
+
|
|
630
|
+
Budget enforcement is performed before the capsule is handed to preflight. The implementation tracks whether the capsule was truncated and keeps evidence IDs available even when natural-language sections are shortened (Ledger: E5). The pre-action controller therefore receives both a bounded textual packet and an auditable evidence list.
|
|
631
|
+
|
|
632
|
+
\subsection{Preflight and Reflexes}
|
|
633
|
+
|
|
634
|
+
Preflight is the risk-scoring layer. Its output contract includes \texttt{go}, \texttt{caution}, and \texttt{block} decisions, a risk score, warnings, recent failures, status, recommended actions, evidence IDs, an optional event ID, and an optional capsule (Ledger: E8). Internally, preflight builds a capsule with risks and contradictions enabled, checks memory health, converts recall errors into high-severity warnings, and adds warning sources for recent failures, must-follow rules, risks, procedures, contradictions, and uncertain or disputed memories (Ledger: E9).
|
|
635
|
+
|
|
636
|
+
Warnings are sorted by severity. Risk score is derived from the highest warning severity, and strict mode blocks high-severity warnings (Ledger: E10). Reflex generation maps preflight warnings into \texttt{guide}, \texttt{warn}, or \texttt{block} reflexes with evidence IDs, reasons, recommendations, and optional embedded preflight data (Ledger: E11).
|
|
637
|
+
|
|
638
|
+
Evidence propagation is preserved across these layers. A warning generated from a procedure, contradiction, recall error, or prior failure carries evidence IDs into the preflight response; reflexes then carry those IDs into the agent-facing guard report (Ledger: E8-E11).
|
|
639
|
+
|
|
640
|
+
\subsection{Controller Implementation}
|
|
641
|
+
|
|
642
|
+
\texttt{MemoryController.beforeAction()} is the pre-action entry point. It runs \texttt{audrey.reflexes()} in strict mode, requests the preflight and capsule, records a preflight event, and scopes recall to the current agent (Ledger: E2). The external guard result is \texttt{allow}, \texttt{warn}, or \texttt{block}, with a risk score, summary, evidence IDs, recommendations, optional capsule, reflexes, and optional preflight event ID (Ledger: E1).
|
|
643
|
+
|
|
644
|
+
Exact repeated-failure control is deterministic. Audrey computes an action identity from the tool name, redacted command or action text, normalized working directory, and sorted normalized file scope. If a prior failed tool event carries the same identity, the controller blocks the action before tool use (Ledger: E3). This is stricter than semantic similarity: the repeat block is keyed to the normalized action, not to whether an embedding happens to retrieve the old failure.
|
|
645
|
+
|
|
646
|
+
\texttt{MemoryController.afterAction()} closes the loop after tool execution. It records tool outcomes through \texttt{observeTool()}, stores the \texttt{audrey\_guard\_action\_key}, redacts action, command, and error text, and encodes failed outcomes as tool-result memories (Ledger: E4). The same event stream supports future preflight checks and impact reporting.
|
|
647
|
+
|
|
648
|
+
\subsection{Redaction Implementation}
|
|
649
|
+
|
|
650
|
+
Audrey redacts before tool traces enter durable memory. The tool-tracing module states that raw tool input, output, and error text do not leave the module without redaction, and it stores hashes, redacted summaries or details, file fingerprints, redaction state, and a memory event (Ledger: E12).
|
|
651
|
+
|
|
652
|
+
The redaction catalog covers named credentials, generic credentials, payment and PII patterns, and entropy-based fallbacks (Ledger: E13). Examples of concrete coverage:
|
|
653
|
+
|
|
654
|
+
\begin{verbatim}
|
|
655
|
+
| Pattern Class | Example Input Shape | Output Shape |
|
|
656
|
+
| AWS access key | `AKIA` followed by 16 uppercase alphanumeric characters. | `[REDACTED:aws_access_key:tail]` |
|
|
657
|
+
| Anthropic API key | `sk-ant-` followed by a long token. | `[REDACTED:anthropic_api_key:tail]` |
|
|
658
|
+
| OpenAI API key | `sk-...` or `sk-proj-...` long token. | `[REDACTED:openai_api_key:tail]` |
|
|
659
|
+
| GitHub token | `ghp_`, `gho_`, `ghu_`, `ghs_`, or `ghr_` token. | `[REDACTED:github_token:tail]` |
|
|
660
|
+
| Stripe key | `sk_live_`, `rk_live_`, `pk_live_`, or test equivalents. | `[REDACTED:stripe_live_key:tail]` or `[REDACTED:stripe_test_key:tail]` |
|
|
661
|
+
| Google API key | `AIza` plus the expected key body. | `[REDACTED:google_api_key:tail]` |
|
|
662
|
+
| Slack token | `xoxb-`, `xoxa-`, `xoxp-`, `xoxr-`, or `xoxs-` style token. | `[REDACTED:slack_token:tail]` |
|
|
663
|
+
| Bearer or Basic auth | `Bearer <token>` or `Basic <base64>`. | `Bearer [REDACTED:generic_bearer]` or `Basic [REDACTED:basic_auth]` |
|
|
664
|
+
| Private key block | PEM private-key block. | `[REDACTED:private_key_block]` |
|
|
665
|
+
| URL credentials | `scheme://user:password@host`. | `scheme://user:[REDACTED:url_credentials]@host` |
|
|
666
|
+
| Password assignment | `password=...`, `api_key: ...`, `auth_token=...`, or similar keys. | Original key plus `[REDACTED:password_assignment]` |
|
|
667
|
+
| Payment and PII | Luhn-valid card numbers, CVV labels, and US SSNs. | Payment or PII class marker. |
|
|
668
|
+
| Signed URLs and sessions | Signature, token, and session-cookie query or cookie fields. | Preserved field name plus redaction marker. |
|
|
669
|
+
| High-entropy fallback | Long mixed-character token with sufficient entropy. | `[REDACTED:high_entropy_secret:tail]` |
|
|
670
|
+
\end{verbatim}
|
|
671
|
+
|
|
672
|
+
The JSON walker redacts sensitive keys and values recursively. If a value sits under a sensitive key and does not match a named pattern, it is still replaced with a password-assignment marker (Ledger: E13). Truncation is applied after redaction and preserves redaction markers rather than cutting them in half or dropping the only proof that a secret was removed (Ledger: E13).
|
|
673
|
+
|
|
674
|
+
\subsection{MCP, CLI, and REST Surfaces}
|
|
675
|
+
|
|
676
|
+
The MCP server registers 20 tools: \texttt{memory\_dream}, \texttt{memory\_encode}, \texttt{memory\_recall}, \texttt{memory\_consolidate}, \texttt{memory\_introspect}, \texttt{memory\_resolve\_truth}, \texttt{memory\_export}, \texttt{memory\_import}, \texttt{memory\_forget}, \texttt{memory\_validate}, \texttt{memory\_decay}, \texttt{memory\_status}, \texttt{memory\_reflect}, \texttt{memory\_greeting}, \texttt{memory\_observe\_tool}, \texttt{memory\_recent\_failures}, \texttt{memory\_capsule}, \texttt{memory\_preflight}, \texttt{memory\_reflexes}, and \texttt{memory\_promote} (Ledger: E32). The Guard-relevant MCP surface is \texttt{memory\_observe\_tool}, \texttt{memory\_recent\_failures}, \texttt{memory\_capsule}, \texttt{memory\_preflight}, and \texttt{memory\_reflexes}, with \texttt{memory\_validate} supporting closed-loop validation and REST or CLI impact reporting supporting aggregate impact inspection (Ledger: E16-E19, E32-E34).
|
|
677
|
+
|
|
678
|
+
The CLI recognizes \texttt{install}, \texttt{uninstall}, \texttt{mcp-config}, \texttt{hook-config}, \texttt{demo}, \texttt{guard}, \texttt{reembed}, \texttt{dream}, \texttt{greeting}, \texttt{reflect}, \texttt{serve}, \texttt{status}, \texttt{doctor}, \texttt{observe-tool}, \texttt{promote}, and \texttt{impact} (Ledger: E34). The \texttt{guard} subcommand invokes the controller, prints JSON or formatted output, exits nonzero for blocking decisions unless an explicit override is supplied, and can run as a Claude Code PreToolUse hook with \texttt{--hook}. \texttt{hook-config claude-code --apply} merges the generated hook block into Claude Code settings with backup/idempotence (Ledger: E26, E43).
|
|
679
|
+
|
|
680
|
+
The REST sidecar exposes routes for health, encode, recall, validate, mark-used, capsule, preflight, reflexes, consolidate, dream, introspect, impact, resolve-truth, export, import, forget, decay, status, reflect, and greeting (Ledger: E33). The sidecar defaults to loopback binding, refuses non-loopback binds without \texttt{AUDREY\_API\_KEY} unless \texttt{AUDREY\_ALLOW\_NO\_AUTH=1}, and emits an explicit warning when that no-auth override is used (Ledger: E35). Export, import, and forget are disabled unless \texttt{AUDREY\_ENABLE\_ADMIN\_TOOLS=1} (Ledger: E33).
|
|
681
|
+
|
|
682
|
+
\subsection{Configuration}
|
|
683
|
+
|
|
684
|
+
The README documents the runtime environment variables that affect storage, provider selection, server exposure, admin tools, and performance behavior (Ledger: E37). The security-critical defaults are:
|
|
685
|
+
|
|
686
|
+
\begin{verbatim}
|
|
687
|
+
| Variable | Default | Security Role |
|
|
688
|
+
| `AUDREY_HOST` | `127.0.0.1` | Keeps the REST sidecar on loopback by default. Non-loopback exposure requires auth or an explicit unsafe override (Ledger: E35, E37). |
|
|
689
|
+
| `AUDREY_API_KEY` | unset | Required bearer token for non-loopback REST traffic (Ledger: E35, E37). |
|
|
690
|
+
| `AUDREY_ALLOW_NO_AUTH` | `0` | Escape hatch for non-loopback without auth. The docs explicitly mark this as unsafe (Ledger: E35, E37). |
|
|
691
|
+
| `AUDREY_ENABLE_ADMIN_TOOLS` | `0` | Keeps export, import, and forget routes/tools disabled by default (Ledger: E33, E37). |
|
|
692
|
+
| `AUDREY_PROMOTE_ROOTS` | unset | Restricts `audrey promote --yes` writes to `process.cwd()` unless additional roots are explicitly listed (Ledger: E37). |
|
|
693
|
+
\end{verbatim}
|
|
694
|
+
|
|
695
|
+
\texttt{AUDREY\_MODEL} is not part of the documented Audrey environment matrix and is not used as an Audrey runtime variable in the inspected source. Model selection is currently represented by provider defaults and provider-specific environment variables, not by a single \texttt{AUDREY\_MODEL} switch (Ledger: E38).
|
|
696
|
+
|
|
697
|
+
\subsection{Testing and Release Gates}
|
|
698
|
+
|
|
699
|
+
The package scripts wire build, typecheck, Vitest, perf benchmark, performance snapshot, memory regression check, npm pack dry-run, release gate, sandbox release gate, and Python package release verification commands (Ledger: E39, E93). The documented release path also includes Python unittest discovery and Python package build commands (Ledger: E39).
|
|
700
|
+
|
|
701
|
+
\texttt{bench:memory:check} is a regression gate. It runs retrieval and lifecycle benchmark suites, compares Audrey against vector-only, keyword-plus-recency, and recent-window local baselines, and enforces score, pass-rate, and margin guardrails (Ledger: E23). Section 7 reports the current output as a regression-gate result, not as a cross-system leaderboard.
|
|
702
|
+
|
|
703
|
+
For the repeated-failure evaluation transcript in this paper, the project was rebuilt with \texttt{npm run build} before running \texttt{node dist/mcp-server/index.js demo --scenario repeated-failure}; the build completed successfully (Ledger: E41).
|
|
704
|
+
|
|
705
|
+
\section{Section 7. Evaluation}
|
|
706
|
+
|
|
707
|
+
This Stage-A evaluation reports implemented Audrey artifacts and local GuardBench adapters only. Section 5 specifies GuardBench as a reproducibility contract for pre-action memory control. The empirical claims below come from the repository's existing performance snapshot, the current behavioral regression output, the local comparative GuardBench runner, source-linked implementation inspection, and a freshly captured repeated-failure demo transcript.
|
|
708
|
+
|
|
709
|
+
\subsection{Methodology Disclosure}
|
|
710
|
+
|
|
711
|
+
The evaluation separates specification from implementation evidence. GuardBench defines the scenarios, baselines, metrics, and reproducibility requirements for future external-system evaluation. This paper reports what the current repository already supports: encode and hybrid-recall latency from the canonical 0.22.2 snapshot, the \texttt{bench:memory:check} regression gate output, a local comparative GuardBench run, and a deterministic demo showing Audrey Guard blocking a repeated failed action (Ledger: E20-E25, E41-E42, E46).
|
|
712
|
+
|
|
713
|
+
This is narrower than a cross-system benchmark. It is also the honest Stage-A claim: the paper introduces the control problem, specifies the benchmark, reports local comparative results, and avoids unrun external-system comparisons.
|
|
714
|
+
|
|
715
|
+
\subsection{Performance Snapshot}
|
|
716
|
+
|
|
717
|
+
The canonical performance snapshot is \texttt{benchmarks/snapshots/perf-0.22.2.json}. It was generated on 2026-05-01T02:15:29.400Z from git SHA \texttt{e2e821b}, using mock in-process 64-dimensional embeddings, a mock in-process LLM provider, hybrid vector-plus-lexical retrieval with limit 5, corpus sizes 100, 1000, and 5000, and 50 recall runs per corpus size. The machine provenance is Node 25.5.0, V8 14.1.146.11-node.18, Windows x64, 24-core AMD Ryzen 9 7900X3D, and 62.9 GB RAM (Ledger: E20).
|
|
718
|
+
|
|
719
|
+
The snapshot reports the following encode and hybrid recall latencies in milliseconds:
|
|
720
|
+
|
|
721
|
+
\begin{verbatim}
|
|
722
|
+
| Corpus Size | Encode p50 | Encode p95 | Encode p99 | Hybrid Recall p50 | Hybrid Recall p95 | Hybrid Recall p99 |
|
|
723
|
+
| 100 | 0.331 | 0.589 | 7.65 | 0.539 | 1.82 | 2.712 |
|
|
724
|
+
| 1000 | 0.307 | 2.147 | 9.672 | 1.566 | 2.364 | 21.177 |
|
|
725
|
+
| 5000 | 0.308 | 1.838 | 10.45 | 2.091 | 3.417 | 16.58 |
|
|
726
|
+
\end{verbatim}
|
|
727
|
+
|
|
728
|
+
These numbers measure Audrey's local call path under an in-process mock embedding provider. They do not measure real embedding-provider latency, local transformer warmup cost, GPU/CPU variance for 384-dimensional local embeddings, OpenAI or Gemini network latency, or production-load concurrency. The snapshot itself notes that cloud and local 384-dimensional providers will report higher recall latency dominated by embedding cost and network (Ledger: E20-E22, E31).
|
|
729
|
+
|
|
730
|
+
\subsection{Behavioral Regression Result}
|
|
731
|
+
|
|
732
|
+
The current \texttt{benchmarks/output/summary.json} was generated on 2026-05-13T08:33:24.917Z with command \texttt{node benchmarks/run.js --provider mock --dimensions 64} (Ledger: E24). It reports:
|
|
733
|
+
|
|
734
|
+
\begin{verbatim}
|
|
735
|
+
| System | Score Percent | Pass Rate | Average Duration Ms |
|
|
736
|
+
| Audrey | 100 | 100 | 15.083333333333334 |
|
|
737
|
+
| Vector Only | 41.66666666666667 | 25 | 0.25 |
|
|
738
|
+
| Keyword + Recency | 41.66666666666667 | 25 | 0.5 |
|
|
739
|
+
| Recent Window | 37.5 | 25 | 0 |
|
|
740
|
+
\end{verbatim}
|
|
741
|
+
|
|
742
|
+
This output is a regression-gate result. The baselines are toy local baselines used to catch retrieval and lifecycle regressions in the Audrey codebase. They are not external systems, not tuned competitor implementations, and not GuardBench baselines (Ledger: E23-E24). The current suite covers retrieval and operation families such as information extraction, knowledge updates, multi-session reasoning, conflict resolution, procedural learning, privacy boundary, overwrite, delete-and-abstain, semantic merge, and procedural merge (Ledger: E23-E24).
|
|
743
|
+
|
|
744
|
+
\subsection{GuardBench Audrey Reference Result}
|
|
745
|
+
|
|
746
|
+
The current \texttt{benchmarks/output/guardbench-summary.json}, \texttt{benchmarks/output/guardbench-manifest.json}, and \texttt{benchmarks/output/guardbench-raw.json} were generated on 2026-05-12 with:
|
|
747
|
+
|
|
748
|
+
\begin{verbatim}
|
|
749
|
+
npm run bench:guard:check
|
|
750
|
+
\end{verbatim}
|
|
751
|
+
|
|
752
|
+
It reports local adapters only, not external-system comparisons (Ledger: E46):
|
|
753
|
+
|
|
754
|
+
\begin{verbatim}
|
|
755
|
+
| Metric | Result |
|
|
756
|
+
| Scenarios passed | 10 / 10 |
|
|
757
|
+
| Prevention rate | 100% |
|
|
758
|
+
| False-block rate | 0% |
|
|
759
|
+
| Evidence recall | 100% |
|
|
760
|
+
| Redaction leaks | 0 |
|
|
761
|
+
| Recall-degradation detection | 100% |
|
|
762
|
+
| Guard latency p50 / p95 | 3.214 ms / 21.395 ms |
|
|
763
|
+
| Published artifact raw-secret leaks | 0 |
|
|
764
|
+
| Audrey Guard decision accuracy | 100% |
|
|
765
|
+
| No-memory decision accuracy | 10% |
|
|
766
|
+
| Recent-window decision accuracy | 60% |
|
|
767
|
+
| Vector-only decision accuracy | 40% |
|
|
768
|
+
| FTS-only decision accuracy | 10% |
|
|
769
|
+
\end{verbatim}
|
|
770
|
+
|
|
771
|
+
The ten scenarios cover exact repeated failures, required procedures, changed file scopes, changed commands, recovered failures, vector recall degradation, FTS recall degradation, truncation-boundary secret redaction, conflicting instructions, and noisy-store control-memory recall. These results are the first public local comparative Audrey GuardBench numbers. The emitted manifest records the ten scenario actions, seeded memories, seeded tool events, fault injections, expected evidence classes, and non-secret references for seeded redaction probes; the raw output file records every local adapter result for each case. The harness also sweeps the summary, manifest, and raw output for seeded raw secrets and fails the run on artifact leaks. External ESM adapters receive private seed values at runtime while expected decisions and evidence remain withheld during execution. Concrete external adapters now target Mem0 Platform and Zep Cloud through their REST APIs, but they have not yet been run with live keys, so this paper does not report external-system scores.
|
|
772
|
+
|
|
773
|
+
\subsection{Repeated-Failure Demo Transcript}
|
|
774
|
+
|
|
775
|
+
The qualitative control figure is the deterministic repeated-failure demo. The project was rebuilt with \texttt{npm run build}, then the demo was run with:
|
|
776
|
+
|
|
777
|
+
\begin{verbatim}
|
|
778
|
+
node dist/mcp-server/index.js demo --scenario repeated-failure
|
|
779
|
+
\end{verbatim}
|
|
780
|
+
|
|
781
|
+
The run produced the following transcript. The temporary path, memory IDs, and timestamp-bearing failure ID are run-specific; the decision structure is the evaluated behavior (Ledger: E25, E41-E42). Appendix A provides the same transcript as a standalone reproduction artifact with line annotations.
|
|
782
|
+
|
|
783
|
+
\begin{verbatim}
|
|
784
|
+
Audrey Guard repeated-failure demo
|
|
785
|
+
|
|
786
|
+
Memory store: [LOCAL-TEMP]/audrey-demo-AkCROa
|
|
787
|
+
Step 1: the agent tries a deploy and hits a real setup failure.
|
|
788
|
+
Step 2: Audrey stores the failure and the operational rule it implies.
|
|
789
|
+
Lesson memory: 01KR491DG2YZHVEM79QVW5BHZA
|
|
790
|
+
|
|
791
|
+
Step 3: a new preflight checks the same action before tool use.
|
|
792
|
+
|
|
793
|
+
Audrey Guard: BLOCKED
|
|
794
|
+
|
|
795
|
+
Reason: Blocked: this exact Bash action failed before. Stop: 3 memory reflexes, 2 blocking, 1 warning matched.
|
|
796
|
+
Risk score: 0.90
|
|
797
|
+
|
|
798
|
+
Evidence:
|
|
799
|
+
- 01KR491DFZYZ20TFK71KJHC88F
|
|
800
|
+
- 01KR491DG2YZHVEM79QVW5BHZA
|
|
801
|
+
- failure:Bash:2026-05-08T17:09:22.047Z
|
|
802
|
+
|
|
803
|
+
Recommended action:
|
|
804
|
+
- Do not repeat the exact failed action until the prior error is understood or the command is changed.
|
|
805
|
+
- Do not proceed until the high-severity memory warning is addressed.
|
|
806
|
+
- Apply this must-follow rule before acting.
|
|
807
|
+
- Mitigate this remembered risk before proceeding.
|
|
808
|
+
- Before re-running Bash, check what changed since the last failure.
|
|
809
|
+
|
|
810
|
+
Memory reflexes:
|
|
811
|
+
- block: Apply this must-follow rule before acting. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
812
|
+
- block: Mitigate this remembered risk before proceeding. Before running npm run deploy, run npm run db:generate because Prisma client must be generated first.
|
|
813
|
+
- warn: Before re-running Bash, check what changed since the last failure.
|
|
814
|
+
|
|
815
|
+
Next: fix the warning and retry, or pass --override to allow this guard check.
|
|
816
|
+
|
|
817
|
+
Impact:
|
|
818
|
+
- 1 repeated failure prevented
|
|
819
|
+
- 1 helpful memory validation recorded
|
|
820
|
+
- 3 evidence ids attached
|
|
821
|
+
|
|
822
|
+
Audrey saw the agent fail once.
|
|
823
|
+
Audrey stopped it from failing twice.
|
|
824
|
+
\end{verbatim}
|
|
825
|
+
|
|
826
|
+
The transcript demonstrates the core pre-action memory-control loop for one scenario: a failed tool action is observed, a procedural lesson is encoded, a later identical action is intercepted before tool use, the guard returns a block decision, and the decision carries evidence IDs, recommendations, reflexes, and impact accounting (Ledger: E3-E4, E8-E11, E16-E17, E25, E42).
|
|
827
|
+
|
|
828
|
+
\subsection{Implemented-Evidence Summary}
|
|
829
|
+
|
|
830
|
+
\begin{verbatim}
|
|
831
|
+
| Design Claim | Ledger IDs | Evidence Type |
|
|
832
|
+
| Audrey exposes a pre-action `allow`/`warn`/`block` controller result. | E1-E2 | Source inspection |
|
|
833
|
+
| Exact repeated failures are blocked by action identity. | E3, E25, E42 | Source inspection and demo |
|
|
834
|
+
| Post-action observation stores redacted tool outcomes and action identity. | E4, E12-E13 | Source inspection |
|
|
835
|
+
| Capsules preserve structured memory sections, evidence IDs, budget state, and recall errors. | E5-E7 | Source inspection |
|
|
836
|
+
| Preflight converts memory health, failures, risks, procedures, contradictions, and disputed memories into severity-sorted warnings. | E8-E10 | Source inspection |
|
|
837
|
+
| Reflexes carry response type, recommendations, and evidence IDs. | E11 | Source inspection |
|
|
838
|
+
| Recall degradation is represented and propagated as `RecallError[]`. | E15, E40 | Source inspection |
|
|
839
|
+
| Hybrid recall uses vector plus FTS with RRF constants `60`, `0.3`, and `0.7`. | E14 | Source inspection |
|
|
840
|
+
| Redaction covers named credentials, generic auth, private keys, payment/PII patterns, sessions, signed URLs, entropy fallback, JSON sensitive keys, and truncation marker preservation. | E12-E13 | Source inspection |
|
|
841
|
+
| The runtime includes SQLite, sqlite-vec, FTS5, MCP stdio, Hono REST, CLI, and Python client surfaces. | E29-E36 | Source inspection |
|
|
842
|
+
| Security-relevant defaults keep REST local, require API keys for non-loopback, disable no-auth exposure, disable admin tools, and restrict promote roots. | E33, E35, E37-E38 | Source inspection |
|
|
843
|
+
| Encode and hybrid-recall latency are reported from the canonical 0.22.2 perf snapshot. | E20-E22 | Snapshot |
|
|
844
|
+
| Behavioral regression status is reported from the current `bench:memory:check` output. | E23-E24 | Regression output |
|
|
845
|
+
| Local comparative GuardBench status is reported from `bench:guard:check`. | E46 | GuardBench comparative output |
|
|
846
|
+
| The repeated-failure control loop is demonstrated by the `demo --scenario repeated-failure` transcript. | E25, E41-E42 | Demo |
|
|
847
|
+
\end{verbatim}
|
|
848
|
+
|
|
849
|
+
\subsection{What This Section Does Not Claim}
|
|
850
|
+
|
|
851
|
+
This section does not claim cross-system superiority over Mem0, Letta/MemGPT, Zep, Graphiti, MemOS, LangMem, Supermemory, Cognee, or any production memory service.
|
|
852
|
+
|
|
853
|
+
This section claims local comparative GuardBench results, the adapter contract, and the existence of the Mem0 and Zep adapters only. External-system GuardBench outputs are deferred until live runs are captured.
|
|
854
|
+
|
|
855
|
+
This section does not claim production-load measurements. The performance snapshot is a local mock-provider benchmark, not a concurrency, soak, storage-pressure, or real-provider benchmark.
|
|
856
|
+
|
|
857
|
+
This section does not claim real-hardware variance. It reports one machine's provenance and raw numbers from the repository snapshot.
|
|
858
|
+
|
|
859
|
+
This section does not claim perfect redaction coverage. It reports implemented pattern coverage and the current GuardBench artifact sweep for seeded raw-secret leakage.
|
|
860
|
+
|
|
861
|
+
This section does not claim that local toy baselines represent external systems. The current \texttt{bench:memory:check} baselines exist to detect Audrey regressions.
|
|
862
|
+
|
|
863
|
+
\section{Section 8. Discussion and Limitations}
|
|
864
|
+
|
|
865
|
+
\subsection{What the Implementation Changes}
|
|
866
|
+
|
|
867
|
+
Repeated-failure prevention is the clearest implemented behavior. The demo records one failed \texttt{npm run deploy}, stores the operational lesson that Prisma generation must run first, and blocks the identical future action before tool use. The guard output includes a \texttt{BLOCKED} decision, risk score \texttt{0.90}, two blocking reflexes, one warning reflex, evidence IDs, concrete recommendations, and impact accounting (Ledger: E25, E42). The change is not better recall in a chat answer. The change is that the second tool call does not happen.
|
|
868
|
+
|
|
869
|
+
The redacted evidence trail changes what can be safely remembered. Audrey's tool-tracing path states that raw tool input, output, and error text do not leave the module without redaction, and it stores redacted summaries, hashes, file fingerprints, redaction state, and a memory event (Ledger: E12). The redaction catalog covers named API keys, auth headers, private keys, URL credentials, password assignments, payment and PII patterns, signed URLs, session cookies, high-entropy tokens, JSON sensitive keys, and truncation marker preservation (Ledger: E13). This lets guard decisions point back to prior tool evidence without turning the memory store into an unfiltered secret archive.
|
|
870
|
+
|
|
871
|
+
Action-key determinism gives the controller a hard repeated-failure path. Audrey hashes tool name, redacted command or action text, normalized working directory, and sorted normalized file scope, then matches the hash against prior failed tool events for the same agent (Ledger: E3). This is separate from semantic retrieval. A prior failure does not need to be semantically similar enough to rank first; the exact repeated action is a structured event match.
|
|
872
|
+
|
|
873
|
+
Recall-degradation handling changes failure posture. Audrey records missing vector tables, per-type KNN failures, and FTS lookup failures as \texttt{RecallError[]}, preserves partial results, exposes degradation in memory status, and carries errors into capsules (Ledger: E15, E40). Preflight converts recall errors into high-severity memory-health warnings, and strict guard mode blocks high-severity warnings (Ledger: E9-E10). A degraded memory substrate therefore becomes a visible control condition instead of an invisible recall-quality drop.
|
|
874
|
+
|
|
875
|
+
\subsection{Conservative Control Choices}
|
|
876
|
+
|
|
877
|
+
Audrey is conservative by design. Strict mode blocks high-severity warnings rather than asking the model to decide whether a remembered warning matters (Ledger: E10). This increases false-positive risk when a high-severity warning is stale or overbroad. The trade-off is appropriate for pre-action control because the guarded operations include file mutation, shell execution, credential exposure, network calls, and destructive tools. The safer default is to force inspection when memory reports high risk.
|
|
878
|
+
|
|
879
|
+
Agent-scoped recall is another conservative choice. Capsule construction forces \texttt{scope: 'agent'}, so a caller cannot accidentally widen a capsule to shared memory by passing broad recall options (Ledger: E7). This reduces cross-agent leakage at the cost of hiding useful memory learned by another agent. Cross-agent sharing belongs behind an explicit federation policy, not an accidental default.
|
|
880
|
+
|
|
881
|
+
The trusted-control-source gate is also conservative. A memory tagged as must-follow becomes a control rule only when its source is \texttt{direct-observation} or \texttt{told-by-user}; untrusted must-follow tags are routed to uncertain or disputed context (Ledger: E6). This blocks an obvious memory-injection path. It also creates false positives when a useful operating rule comes from a source that has not been promoted into trust.
|
|
882
|
+
|
|
883
|
+
\subsection{What This Paper Does Not Claim}
|
|
884
|
+
|
|
885
|
+
This paper reports one local comparative GuardBench run. It does not report GuardBench results across external memory systems. Section 5 specifies GuardBench; v2 reports the full external-system run.
|
|
886
|
+
|
|
887
|
+
It does not report cross-system comparisons against Mem0, Letta/MemGPT, Zep, Graphiti, MemOS, LangMem, Supermemory, Cognee, or hosted memory services.
|
|
888
|
+
|
|
889
|
+
It does not report production-load measurements, concurrent-agent soak tests, storage-pressure behavior, or long-running operational telemetry.
|
|
890
|
+
|
|
891
|
+
It does not report real-provider embedding latency. The canonical performance snapshot uses a mock in-process 64-dimensional embedding provider (Ledger: E20-E22).
|
|
892
|
+
|
|
893
|
+
It does not prove redaction completeness. The implementation has a broad rule catalog, but unknown credential formats and adversarial encodings remain out of scope (Ledger: E13).
|
|
894
|
+
|
|
895
|
+
It does not prevent first-time errors. Pre-action memory control works from prior evidence, remembered rules, contradictions, and recall health. A novel error with no remembered signal still reaches the underlying tool policy.
|
|
896
|
+
|
|
897
|
+
It does not replace sandboxing, OS permissions, MCP permission systems, human approval, or network isolation. Audrey is a memory-derived control layer that fits inside a host's broader tool-use safety stack.
|
|
898
|
+
|
|
899
|
+
\subsection{Threats to Current Claims}
|
|
900
|
+
|
|
901
|
+
Action-key fidelity is a central threat. Repeated-failure prevention depends on stable normalization of command text, working directory, and file scope. If a host supplies incomplete file lists, unstable cwd values, or action text that hides the meaningful operation inside nested arguments, exact-repeat detection loses coverage (Ledger: E3). The current hash is deterministic, not omniscient.
|
|
902
|
+
|
|
903
|
+
Redaction is rule-based. It covers many common credential and PII formats, sensitive JSON keys, high-entropy strings, and truncation boundaries (Ledger: E13). It remains incomplete for novel secret formats, multi-part secrets split across fields, encoded payloads, and tool outputs crafted to evade regex and entropy checks.
|
|
904
|
+
|
|
905
|
+
Capsule pruning is priority-based, not learned. Capsules enforce a character budget and preserve structured sections and evidence IDs (Ledger: E5). The current implementation uses explicit sectioning and truncation logic rather than a learned policy that optimizes downstream guard accuracy. Budget pressure can hide useful non-control context while keeping control evidence.
|
|
906
|
+
|
|
907
|
+
Reflex generation is deterministic, not adaptive. Reflexes are mapped from preflight warnings into \texttt{guide}, \texttt{warn}, or \texttt{block} responses with evidence and recommendations (Ledger: E11). The mapping does not learn from later validation events. Validation updates memory salience and bookkeeping, but risk scoring remains fixed by severity rules (Ledger: E16, E45).
|
|
908
|
+
|
|
909
|
+
\subsection{Open Problems}
|
|
910
|
+
|
|
911
|
+
Host hook parity. Audrey exposes CLI pieces that host hooks can call, and Claude Code documents hook extension points \cite{anthropic2026claudecodehooks}. Audrey now generates and applies Claude Code hook settings, \texttt{guard --hook} emits the current PreToolUse \texttt{hookSpecificOutput.permissionDecision} shape, and \texttt{observe-tool} records post-tool events (Ledger: E43). The remaining production installer work is equivalent wiring for hosts with stable hook surfaces, especially Codex.
|
|
912
|
+
|
|
913
|
+
Validation lineage is implemented but not yet policy-adaptive. Audrey can bind validation events to the exact preflight event, evidence IDs, and action key that produced a decision, and rejects mismatched evidence claims (Ledger: E44). The next step is using that closed-loop signal to tune warning priority and recommendation wording without giving the model direct control over policy.
|
|
914
|
+
|
|
915
|
+
Cross-agent memory federation. Audrey currently protects capsules through agent-scoped recall (Ledger: E7). Multi-agent runtimes need explicit federation rules: which memories transfer across agents, which remain private, which require user confirmation, and how contradictions propagate across agent identities.
|
|
916
|
+
|
|
917
|
+
Adaptive risk scoring. Preflight uses a fixed severity map and strict mode blocks high-severity warnings (Ledger: E45). Validation feedback should eventually tune risk scoring, warning ordering, and recommendation wording without giving the model direct control over the safety policy.
|
|
918
|
+
|
|
919
|
+
Adversarial-memory robustness. The trusted-control-source gate blocks untrusted must-follow tags, but poisoned tool outputs that enter through trusted observation paths remain a hard problem (Ledger: E6, E12). Future work needs adversarial memory tests where attacker-controlled output resembles an operational rule, a project instruction, or a false recovery signal.
|
|
920
|
+
|
|
921
|
+
\subsection{Local-First as a Feature}
|
|
922
|
+
|
|
923
|
+
Pre-action memory control operates on sensitive operational history: shell commands, file paths, build failures, project rules, API error summaries, user instructions, and redacted tool outputs. Sending that control surface to a hosted memory service creates an avoidable privacy, availability, and latency dependency. Audrey's local-first SQLite, sqlite-vec, FTS5, loopback REST default, and no-auth network refusal keep the controller deployable inside local agents and air-gapped environments (Ledger: E29-E30, E35, E37). The local design is not just an implementation convenience; it is aligned with the data that a pre-action controller must inspect.
|
|
924
|
+
|
|
925
|
+
\section{Section 9. Conclusion}
|
|
926
|
+
|
|
927
|
+
Agent memory should be judged by whether it changes future tool actions. Audrey implements that claim for local agents through a host-side memory controller that runs before tool use and returns an auditable \texttt{allow}, \texttt{warn}, or \texttt{block} decision with evidence.
|
|
928
|
+
|
|
929
|
+
The implemented contribution is Audrey Guard: a local-first loop that observes tool outcomes, redacts traces, retrieves hybrid memory, builds bounded capsules, scores preflight risk, generates reflexes, blocks exact repeated failures, records post-action outcomes, and reports validation-linked impact (Ledger: E1-E17, E25-E26, E29-E42). The specified contribution is GuardBench: a scenario manifest, baseline set, metric suite, and reproducibility contract for evaluating memory by action effect rather than retrieved-text relevance.
|
|
930
|
+
|
|
931
|
+
The Stage-A evidence is intentionally bounded. This paper reports source-linked implementation evidence, the canonical 0.22.2 performance snapshot, the current behavioral regression gate output, a local comparative GuardBench run, and a deterministic repeated-failure demo transcript (Ledger: E20-E25, E41-E42, E46). It also includes the external adapter contract plus Mem0 and Zep evidence-bundle paths, but it does not report full external-system GuardBench results.
|
|
932
|
+
|
|
933
|
+
The v2 paper should run live external adapters, publish raw per-scenario output bundles, run expanded redaction sweeps, and report guard-overhead p50/p95 under machine-provenance controls.
|
|
934
|
+
|
|
935
|
+
The core result is simple: Audrey saw the agent fail once. Audrey stopped it from failing twice.
|
|
936
|
+
|
|
937
|
+
\section{References}
|
|
938
|
+
|
|
939
|
+
References are maintained in \texttt{references.bib} (references.bib) and rendered to bibliography format at LaTeX-conversion time. This Markdown master is intended to be paired with that BibTeX file for pandoc/LaTeX conversion rather than inline-rendering bibliography entries here.
|
|
940
|
+
|
|
941
|
+
\section{Working-Draft Audit Note}
|
|
942
|
+
|
|
943
|
+
In-section evidence-ledger references are intentionally preserved in this v1 Markdown master as a working audit trail. The final submission polish pass should remove ledger references from prose after GuardBench Stage-B numbers land and the claim set is stable.
|
|
944
|
+
|
|
945
|
+
|
|
946
|
+
\bibliographystyle{plain}
|
|
947
|
+
\bibliography{references}
|
|
948
|
+
|
|
949
|
+
\end{document}
|