audrey 0.21.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (346) hide show
  1. package/CHANGELOG.md +238 -0
  2. package/LICENSE +21 -21
  3. package/README.md +281 -33
  4. package/SECURITY.md +30 -0
  5. package/benchmarks/adapter-kit.mjs +20 -0
  6. package/benchmarks/adapter-self-test.mjs +166 -0
  7. package/benchmarks/adapters/example-allow.mjs +28 -0
  8. package/benchmarks/adapters/mem0-platform.mjs +267 -0
  9. package/benchmarks/adapters/registry.json +51 -0
  10. package/benchmarks/adapters/zep-cloud.mjs +280 -0
  11. package/benchmarks/baselines.js +169 -0
  12. package/benchmarks/build-leaderboard.mjs +170 -0
  13. package/benchmarks/cases.js +537 -0
  14. package/benchmarks/create-conformance-card.mjs +139 -0
  15. package/benchmarks/create-submission-bundle.mjs +176 -0
  16. package/benchmarks/dry-run-external-adapters.mjs +165 -0
  17. package/benchmarks/guardbench.js +1035 -0
  18. package/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
  19. package/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
  20. package/benchmarks/output/external/guardbench-external-evidence.json +56 -0
  21. package/benchmarks/output/guardbench-conformance-card.json +63 -0
  22. package/benchmarks/output/guardbench-manifest.json +414 -0
  23. package/benchmarks/output/guardbench-raw.json +1171 -0
  24. package/benchmarks/output/guardbench-summary.json +1981 -0
  25. package/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
  26. package/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
  27. package/benchmarks/output/submission-bundle/guardbench-conformance-card.json +63 -0
  28. package/benchmarks/output/submission-bundle/guardbench-manifest.json +414 -0
  29. package/benchmarks/output/submission-bundle/guardbench-raw.json +1171 -0
  30. package/benchmarks/output/submission-bundle/guardbench-summary.json +1981 -0
  31. package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-registry.schema.json +69 -0
  32. package/benchmarks/output/submission-bundle/schemas/guardbench-adapter-self-test.schema.json +156 -0
  33. package/benchmarks/output/submission-bundle/schemas/guardbench-conformance-card.schema.json +184 -0
  34. package/benchmarks/output/submission-bundle/schemas/guardbench-external-dry-run.schema.json +74 -0
  35. package/benchmarks/output/submission-bundle/schemas/guardbench-external-evidence.schema.json +108 -0
  36. package/benchmarks/output/submission-bundle/schemas/guardbench-external-run.schema.json +160 -0
  37. package/benchmarks/output/submission-bundle/schemas/guardbench-leaderboard.schema.json +179 -0
  38. package/benchmarks/output/submission-bundle/schemas/guardbench-manifest.schema.json +213 -0
  39. package/benchmarks/output/submission-bundle/schemas/guardbench-publication-verification.schema.json +47 -0
  40. package/benchmarks/output/submission-bundle/schemas/guardbench-raw.schema.json +164 -0
  41. package/benchmarks/output/submission-bundle/schemas/guardbench-submission-manifest.schema.json +151 -0
  42. package/benchmarks/output/submission-bundle/schemas/guardbench-summary.schema.json +228 -0
  43. package/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
  44. package/benchmarks/output/submission-bundle/validation-report.json +31 -0
  45. package/benchmarks/output/summary.json +2354 -0
  46. package/benchmarks/perf-snapshot.js +304 -0
  47. package/benchmarks/perf.bench.js +161 -0
  48. package/benchmarks/public-paths.mjs +78 -0
  49. package/benchmarks/reference-results.js +70 -0
  50. package/benchmarks/report.js +259 -0
  51. package/benchmarks/run-external-guardbench.mjs +281 -0
  52. package/benchmarks/run.js +682 -0
  53. package/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
  54. package/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
  55. package/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
  56. package/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
  57. package/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
  58. package/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
  59. package/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
  60. package/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
  61. package/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
  62. package/benchmarks/schemas/guardbench-raw.schema.json +164 -0
  63. package/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
  64. package/benchmarks/schemas/guardbench-summary.schema.json +228 -0
  65. package/benchmarks/snapshots/perf-0.22.2.json +123 -0
  66. package/benchmarks/snapshots/perf-0.23.0.json +123 -0
  67. package/benchmarks/validate-adapter-module.mjs +104 -0
  68. package/benchmarks/validate-adapter-registry.mjs +134 -0
  69. package/benchmarks/validate-adapter-self-test.mjs +96 -0
  70. package/benchmarks/validate-guardbench-artifacts.mjs +343 -0
  71. package/benchmarks/verify-external-evidence.mjs +296 -0
  72. package/benchmarks/verify-publication-artifacts.mjs +286 -0
  73. package/benchmarks/verify-submission-bundle.mjs +167 -0
  74. package/dist/mcp-server/config.d.ts +5 -4
  75. package/dist/mcp-server/config.d.ts.map +1 -1
  76. package/dist/mcp-server/config.js +6 -8
  77. package/dist/mcp-server/config.js.map +1 -1
  78. package/dist/mcp-server/index.d.ts +281 -23
  79. package/dist/mcp-server/index.d.ts.map +1 -1
  80. package/dist/mcp-server/index.js +1186 -82
  81. package/dist/mcp-server/index.js.map +1 -1
  82. package/dist/src/action-key.d.ts +9 -0
  83. package/dist/src/action-key.d.ts.map +1 -0
  84. package/dist/src/action-key.js +49 -0
  85. package/dist/src/action-key.js.map +1 -0
  86. package/dist/src/adaptive.d.ts.map +1 -1
  87. package/dist/src/adaptive.js +8 -6
  88. package/dist/src/adaptive.js.map +1 -1
  89. package/dist/src/affect.d.ts +4 -1
  90. package/dist/src/affect.d.ts.map +1 -1
  91. package/dist/src/affect.js +14 -12
  92. package/dist/src/affect.js.map +1 -1
  93. package/dist/src/audrey.d.ts +57 -4
  94. package/dist/src/audrey.d.ts.map +1 -1
  95. package/dist/src/audrey.js +512 -65
  96. package/dist/src/audrey.js.map +1 -1
  97. package/dist/src/capsule.d.ts +2 -1
  98. package/dist/src/capsule.d.ts.map +1 -1
  99. package/dist/src/capsule.js +18 -8
  100. package/dist/src/capsule.js.map +1 -1
  101. package/dist/src/causal.d.ts.map +1 -1
  102. package/dist/src/causal.js +23 -5
  103. package/dist/src/causal.js.map +1 -1
  104. package/dist/src/confidence.d.ts.map +1 -1
  105. package/dist/src/confidence.js +3 -0
  106. package/dist/src/confidence.js.map +1 -1
  107. package/dist/src/consolidate.d.ts +1 -0
  108. package/dist/src/consolidate.d.ts.map +1 -1
  109. package/dist/src/consolidate.js +70 -54
  110. package/dist/src/consolidate.js.map +1 -1
  111. package/dist/src/controller.d.ts +94 -0
  112. package/dist/src/controller.d.ts.map +1 -0
  113. package/dist/src/controller.js +350 -0
  114. package/dist/src/controller.js.map +1 -0
  115. package/dist/src/db.d.ts.map +1 -1
  116. package/dist/src/db.js +181 -169
  117. package/dist/src/db.js.map +1 -1
  118. package/dist/src/decay.d.ts.map +1 -1
  119. package/dist/src/decay.js +62 -55
  120. package/dist/src/decay.js.map +1 -1
  121. package/dist/src/embedding.d.ts +2 -1
  122. package/dist/src/embedding.d.ts.map +1 -1
  123. package/dist/src/embedding.js +60 -22
  124. package/dist/src/embedding.js.map +1 -1
  125. package/dist/src/encode.d.ts +9 -2
  126. package/dist/src/encode.d.ts.map +1 -1
  127. package/dist/src/encode.js +25 -12
  128. package/dist/src/encode.js.map +1 -1
  129. package/dist/src/export.d.ts.map +1 -1
  130. package/dist/src/export.js +5 -3
  131. package/dist/src/export.js.map +1 -1
  132. package/dist/src/feedback.d.ts +35 -0
  133. package/dist/src/feedback.d.ts.map +1 -0
  134. package/dist/src/feedback.js +129 -0
  135. package/dist/src/feedback.js.map +1 -0
  136. package/dist/src/forget.d.ts.map +1 -1
  137. package/dist/src/forget.js +68 -60
  138. package/dist/src/forget.js.map +1 -1
  139. package/dist/src/fts.js +1 -1
  140. package/dist/src/fts.js.map +1 -1
  141. package/dist/src/hybrid-recall.d.ts +2 -1
  142. package/dist/src/hybrid-recall.d.ts.map +1 -1
  143. package/dist/src/hybrid-recall.js +41 -32
  144. package/dist/src/hybrid-recall.js.map +1 -1
  145. package/dist/src/impact.d.ts +47 -0
  146. package/dist/src/impact.d.ts.map +1 -0
  147. package/dist/src/impact.js +146 -0
  148. package/dist/src/impact.js.map +1 -0
  149. package/dist/src/import.d.ts +177 -1
  150. package/dist/src/import.d.ts.map +1 -1
  151. package/dist/src/import.js +235 -46
  152. package/dist/src/import.js.map +1 -1
  153. package/dist/src/index.d.ts +5 -1
  154. package/dist/src/index.d.ts.map +1 -1
  155. package/dist/src/index.js +3 -1
  156. package/dist/src/index.js.map +1 -1
  157. package/dist/src/interference.d.ts +5 -2
  158. package/dist/src/interference.d.ts.map +1 -1
  159. package/dist/src/interference.js +39 -32
  160. package/dist/src/interference.js.map +1 -1
  161. package/dist/src/introspect.js +18 -18
  162. package/dist/src/llm.d.ts.map +1 -1
  163. package/dist/src/llm.js +1 -0
  164. package/dist/src/llm.js.map +1 -1
  165. package/dist/src/migrate.d.ts.map +1 -1
  166. package/dist/src/migrate.js +21 -9
  167. package/dist/src/migrate.js.map +1 -1
  168. package/dist/src/preflight.d.ts +2 -1
  169. package/dist/src/preflight.d.ts.map +1 -1
  170. package/dist/src/preflight.js +66 -5
  171. package/dist/src/preflight.js.map +1 -1
  172. package/dist/src/profile.d.ts +23 -0
  173. package/dist/src/profile.d.ts.map +1 -0
  174. package/dist/src/profile.js +51 -0
  175. package/dist/src/profile.js.map +1 -0
  176. package/dist/src/promote.d.ts.map +1 -1
  177. package/dist/src/promote.js +8 -9
  178. package/dist/src/promote.js.map +1 -1
  179. package/dist/src/prompts.d.ts.map +1 -1
  180. package/dist/src/prompts.js +165 -136
  181. package/dist/src/prompts.js.map +1 -1
  182. package/dist/src/recall.d.ts +9 -6
  183. package/dist/src/recall.d.ts.map +1 -1
  184. package/dist/src/recall.js +204 -62
  185. package/dist/src/recall.js.map +1 -1
  186. package/dist/src/redact.d.ts +7 -1
  187. package/dist/src/redact.d.ts.map +1 -1
  188. package/dist/src/redact.js +94 -11
  189. package/dist/src/redact.js.map +1 -1
  190. package/dist/src/reflexes.d.ts +1 -0
  191. package/dist/src/reflexes.d.ts.map +1 -1
  192. package/dist/src/reflexes.js +3 -0
  193. package/dist/src/reflexes.js.map +1 -1
  194. package/dist/src/rollback.d.ts.map +1 -1
  195. package/dist/src/rollback.js +13 -8
  196. package/dist/src/rollback.js.map +1 -1
  197. package/dist/src/routes.d.ts +1 -0
  198. package/dist/src/routes.d.ts.map +1 -1
  199. package/dist/src/routes.js +251 -6
  200. package/dist/src/routes.js.map +1 -1
  201. package/dist/src/rules-compiler.d.ts.map +1 -1
  202. package/dist/src/rules-compiler.js +36 -6
  203. package/dist/src/rules-compiler.js.map +1 -1
  204. package/dist/src/server.d.ts +2 -1
  205. package/dist/src/server.d.ts.map +1 -1
  206. package/dist/src/server.js +42 -4
  207. package/dist/src/server.js.map +1 -1
  208. package/dist/src/tool-trace.d.ts.map +1 -1
  209. package/dist/src/tool-trace.js +42 -29
  210. package/dist/src/tool-trace.js.map +1 -1
  211. package/dist/src/types.d.ts +28 -1
  212. package/dist/src/types.d.ts.map +1 -1
  213. package/dist/src/ulid.d.ts.map +1 -1
  214. package/dist/src/ulid.js +52 -2
  215. package/dist/src/ulid.js.map +1 -1
  216. package/dist/src/utils.d.ts.map +1 -1
  217. package/dist/src/utils.js +8 -1
  218. package/dist/src/utils.js.map +1 -1
  219. package/dist/src/validate.d.ts +2 -0
  220. package/dist/src/validate.d.ts.map +1 -1
  221. package/dist/src/validate.js +77 -46
  222. package/dist/src/validate.js.map +1 -1
  223. package/docs/AUDREY_PAPER_OUTLINE.md +175 -0
  224. package/docs/MEMORY_BENCHMARKING.md +59 -0
  225. package/docs/PRODUCTION_BACKLOG.md +304 -0
  226. package/docs/paper/00-master.md +48 -0
  227. package/docs/paper/01-introduction.md +27 -0
  228. package/docs/paper/02-related-work.md +47 -0
  229. package/docs/paper/03-problem-definition.md +108 -0
  230. package/docs/paper/04-design.md +164 -0
  231. package/docs/paper/05-guardbench-spec.md +412 -0
  232. package/docs/paper/06-implementation.md +113 -0
  233. package/docs/paper/07-evaluation.md +168 -0
  234. package/docs/paper/08-discussion-limitations.md +61 -0
  235. package/docs/paper/09-conclusion.md +11 -0
  236. package/docs/paper/SUBMISSION_README.md +162 -0
  237. package/docs/paper/appendix-a-demo-transcript.md +114 -0
  238. package/docs/paper/arxiv-compile-report.schema.json +116 -0
  239. package/docs/paper/arxiv-source.schema.json +61 -0
  240. package/docs/paper/audrey-paper-v1.md +1106 -0
  241. package/docs/paper/browser-launch-plan.json +209 -0
  242. package/docs/paper/browser-launch-plan.schema.json +100 -0
  243. package/docs/paper/browser-launch-results.json +86 -0
  244. package/docs/paper/browser-launch-results.schema.json +66 -0
  245. package/docs/paper/claim-register.json +138 -0
  246. package/docs/paper/claim-register.schema.json +81 -0
  247. package/docs/paper/evidence-ledger.md +103 -0
  248. package/docs/paper/output/arxiv/README-arxiv.txt +8 -0
  249. package/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
  250. package/docs/paper/output/arxiv/main.tex +949 -0
  251. package/docs/paper/output/arxiv/references.bib +222 -0
  252. package/docs/paper/output/arxiv-compile-report.json +24 -0
  253. package/docs/paper/output/submission-bundle/LICENSE +21 -0
  254. package/docs/paper/output/submission-bundle/README.md +533 -0
  255. package/docs/paper/output/submission-bundle/benchmarks/output/adapter-self-test/guardbench-adapter-self-test.json +50 -0
  256. package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-dry-run.json +69 -0
  257. package/docs/paper/output/submission-bundle/benchmarks/output/external/guardbench-external-evidence.json +56 -0
  258. package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-conformance-card.json +63 -0
  259. package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-manifest.json +414 -0
  260. package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-raw.json +1171 -0
  261. package/docs/paper/output/submission-bundle/benchmarks/output/guardbench-summary.json +1981 -0
  262. package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.json +93 -0
  263. package/docs/paper/output/submission-bundle/benchmarks/output/leaderboard/guardbench-leaderboard.md +7 -0
  264. package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/submission-manifest.json +131 -0
  265. package/docs/paper/output/submission-bundle/benchmarks/output/submission-bundle/validation-report.json +31 -0
  266. package/docs/paper/output/submission-bundle/benchmarks/output/summary.json +2354 -0
  267. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-registry.schema.json +69 -0
  268. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-adapter-self-test.schema.json +156 -0
  269. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-conformance-card.schema.json +184 -0
  270. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-dry-run.schema.json +74 -0
  271. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-evidence.schema.json +108 -0
  272. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-external-run.schema.json +160 -0
  273. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-leaderboard.schema.json +179 -0
  274. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-manifest.schema.json +213 -0
  275. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-publication-verification.schema.json +47 -0
  276. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-raw.schema.json +164 -0
  277. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-submission-manifest.schema.json +151 -0
  278. package/docs/paper/output/submission-bundle/benchmarks/schemas/guardbench-summary.schema.json +228 -0
  279. package/docs/paper/output/submission-bundle/docs/AUDREY_PAPER_OUTLINE.md +175 -0
  280. package/docs/paper/output/submission-bundle/docs/paper/00-master.md +48 -0
  281. package/docs/paper/output/submission-bundle/docs/paper/01-introduction.md +27 -0
  282. package/docs/paper/output/submission-bundle/docs/paper/02-related-work.md +47 -0
  283. package/docs/paper/output/submission-bundle/docs/paper/03-problem-definition.md +108 -0
  284. package/docs/paper/output/submission-bundle/docs/paper/04-design.md +164 -0
  285. package/docs/paper/output/submission-bundle/docs/paper/05-guardbench-spec.md +412 -0
  286. package/docs/paper/output/submission-bundle/docs/paper/06-implementation.md +113 -0
  287. package/docs/paper/output/submission-bundle/docs/paper/07-evaluation.md +168 -0
  288. package/docs/paper/output/submission-bundle/docs/paper/08-discussion-limitations.md +61 -0
  289. package/docs/paper/output/submission-bundle/docs/paper/09-conclusion.md +11 -0
  290. package/docs/paper/output/submission-bundle/docs/paper/SUBMISSION_README.md +162 -0
  291. package/docs/paper/output/submission-bundle/docs/paper/appendix-a-demo-transcript.md +114 -0
  292. package/docs/paper/output/submission-bundle/docs/paper/arxiv-compile-report.schema.json +116 -0
  293. package/docs/paper/output/submission-bundle/docs/paper/arxiv-source.schema.json +61 -0
  294. package/docs/paper/output/submission-bundle/docs/paper/audrey-paper-v1.md +1106 -0
  295. package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.json +209 -0
  296. package/docs/paper/output/submission-bundle/docs/paper/browser-launch-plan.schema.json +100 -0
  297. package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.json +86 -0
  298. package/docs/paper/output/submission-bundle/docs/paper/browser-launch-results.schema.json +66 -0
  299. package/docs/paper/output/submission-bundle/docs/paper/claim-register.json +138 -0
  300. package/docs/paper/output/submission-bundle/docs/paper/claim-register.schema.json +81 -0
  301. package/docs/paper/output/submission-bundle/docs/paper/evidence-ledger.md +103 -0
  302. package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/README-arxiv.txt +8 -0
  303. package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/arxiv-manifest.json +41 -0
  304. package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/main.tex +949 -0
  305. package/docs/paper/output/submission-bundle/docs/paper/output/arxiv/references.bib +222 -0
  306. package/docs/paper/output/submission-bundle/docs/paper/output/arxiv-compile-report.json +24 -0
  307. package/docs/paper/output/submission-bundle/docs/paper/paper-submission-bundle.schema.json +70 -0
  308. package/docs/paper/output/submission-bundle/docs/paper/publication-pack.json +81 -0
  309. package/docs/paper/output/submission-bundle/docs/paper/publication-pack.schema.json +60 -0
  310. package/docs/paper/output/submission-bundle/docs/paper/references.bib +222 -0
  311. package/docs/paper/output/submission-bundle/package.json +212 -0
  312. package/docs/paper/output/submission-bundle/paper-submission-manifest.json +379 -0
  313. package/docs/paper/paper-submission-bundle.schema.json +70 -0
  314. package/docs/paper/publication-pack.json +81 -0
  315. package/docs/paper/publication-pack.schema.json +60 -0
  316. package/docs/paper/references.bib +222 -0
  317. package/package.json +103 -26
  318. package/scripts/audit-release-completion.mjs +362 -0
  319. package/scripts/create-arxiv-source.mjs +362 -0
  320. package/scripts/create-paper-submission-bundle.mjs +210 -0
  321. package/scripts/finalize-release.mjs +526 -0
  322. package/scripts/prepare-release-cut.mjs +269 -0
  323. package/scripts/publish-release-bundle.mjs +209 -0
  324. package/scripts/publish-release-github-api.mjs +429 -0
  325. package/scripts/run-vitest.mjs +34 -0
  326. package/scripts/smoke-cli.js +72 -0
  327. package/scripts/sync-paper-artifacts.mjs +109 -0
  328. package/scripts/verify-arxiv-compile.mjs +440 -0
  329. package/scripts/verify-arxiv-source.mjs +194 -0
  330. package/scripts/verify-browser-launch-plan.mjs +237 -0
  331. package/scripts/verify-browser-launch-results.mjs +285 -0
  332. package/scripts/verify-paper-artifacts.mjs +338 -0
  333. package/scripts/verify-paper-claims.mjs +226 -0
  334. package/scripts/verify-paper-submission-bundle.mjs +207 -0
  335. package/scripts/verify-publication-pack.mjs +196 -0
  336. package/scripts/verify-python-package.py +201 -0
  337. package/scripts/verify-release-readiness.mjs +741 -0
  338. package/docs/assets/benchmarks/local-benchmark.svg +0 -45
  339. package/docs/assets/benchmarks/operations-benchmark.svg +0 -45
  340. package/docs/assets/benchmarks/published-memory-standards.svg +0 -50
  341. package/docs/audrey-for-dummies.md +0 -670
  342. package/docs/benchmarking.md +0 -151
  343. package/docs/future-of-llm-memory.md +0 -452
  344. package/docs/mcp-hosts.md +0 -206
  345. package/docs/ollama-local-agents.md +0 -128
  346. package/docs/production-readiness.md +0 -128
@@ -1 +1 @@
1
- {"version":3,"file":"validate.js","sourceRoot":"","sources":["../../src/validate.ts"],"names":[],"mappings":"AAEA,OAAO,EAAE,UAAU,EAAE,MAAM,WAAW,CAAC;AACvC,OAAO,EAAE,aAAa,EAAE,MAAM,YAAY,CAAC;AAC3C,OAAO,EAAE,iCAAiC,EAAE,MAAM,cAAc,CAAC;AAEjE,MAAM,uBAAuB,GAAG,IAAI,CAAC;AACrC,MAAM,uBAAuB,GAAG,IAAI,CAAC;AAkBrC,MAAM,CAAC,KAAK,UAAU,cAAc,CAClC,EAAqB,EACrB,iBAAoC,EACpC,OAAwD,EACxD,UAII,EAAE;IAEN,MAAM,EACJ,SAAS,GAAG,uBAAuB,EACnC,sBAAsB,GAAG,uBAAuB,EAChD,WAAW,GACZ,GAAG,OAAO,CAAC;IAEZ,MAAM,aAAa,GAAG,MAAM,iBAAiB,CAAC,KAAK,CAAC,OAAO,CAAC,OAAO,CAAC,CAAC;IACrE,MAAM,aAAa,GAAG,iBAAiB,CAAC,cAAc,CAAC,aAAa,CAAC,CAAC;IAEtE,MAAM,eAAe,GAAG,EAAE,CAAC,OAAO,CAAC;;;;;;;GAOlC,CAAC,CAAC,GAAG,CAAC,aAAa,CAAuC,CAAC;IAE5D,IAAI,SAAS,GAAkC,IAAI,CAAC;IACpD,IAAI,cAAc,GAAG,CAAC,CAAC;IAEvB,IAAI,eAAe,EAAE,CAAC;QACpB,SAAS,GAAG,eAAe,CAAC;QAC5B,cAAc,GAAG,eAAe,CAAC,UAAU,CAAC;IAC9C,CAAC;IAED,IAAI,SAAS,IAAI,cAAc,IAAI,SAAS,EAAE,CAAC;QAC7C,MAAM,WAAW,GAAG,aAAa,CAAW,SAAS,CAAC,oBAAoB,EAAE,EAAE,CAAC,CAAC;QAChF,IAAI,CAAC,WAAW,CAAC,QAAQ,CAAC,OAAO,CAAC,EAAE,CAAC,EAAE,CAAC;YACtC,WAAW,CAAC,IAAI,CAAC,OAAO,CAAC,EAAE,CAAC,CAAC;QAC/B,CAAC;QAED,MAAM,SAAS,GAAG,sBAAsB,CAAC,EAAE,EAAE,WAAW,EAAE,OAAO,CAAC,CAAC;QAEnE,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;QACrC,EAAE,CAAC,OAAO,CAAC;;;;;;;;KAQV,CAAC,CAAC,GAAG,CACJ,IAAI,CAAC,SAAS,CAAC,WAAW,CAAC,EAC3B,WAAW,CAAC,MAAM,EAClB,SAAS,EACT,GAAG,EACH,SAAS,CAAC,EAAE,CACb,CAAC;QAEF,OAAO;YACL,MAAM,EAAE,YAAY;YACpB,UAAU,EAAE,SAAS,CAAC,EAAE;YACxB,UAAU,EAAE,cAAc;SAC3B,CAAC;IACJ,CAAC;IAED,IAAI,SAAS,IAAI,cAAc,IAAI,sBAAsB,IAAI,WAAW,EAAE,CAAC;QACzE,MAAM,QAAQ,GAAG,iCAAiC,CAAC,OAAO,CAAC,OAAO,EAAE,SAAS,CAAC,OAAO,CAAC,CAAC;QACvF,MAAM,OAAO,GAAG,MAAM,WAAW,CAAC,IAAI,CAAC,QAAQ,CAK9C,CAAC;QAEF,IAAI,OAAO,CAAC,WAAW,EAAE,CAAC;YACxB,MAAM,UAAU,GAAG,OAAO,CAAC,UAAU,KAAK,mBAAmB;gBAC3D,CAAC,CAAC,EAAE,IAAI,EAAE,mBAAmB,EAAE,UAAU,EAAE,OAAO,CAAC,UAAU,EAAE,WAAW,EAAE,OAAO,CAAC,WAAW,EAAE;gBACjG,CAAC,CAAC,OAAO,CAAC,UAAU;oBAClB,CAAC,CAAC,EAAE,IAAI,EAAE,OAAO,CAAC,UAAU,EAAE,WAAW,EAAE,OAAO,CAAC,WAAW,EAAE;oBAChE,CAAC,CAAC,IAAI,CAAC;YAEX,MAAM,eAAe,GAAG,mBAAmB,CACzC,EAAE,EACF,SAAS,CAAC,EAAE,EACZ,UAAU,EACV,OAAO,CAAC,EAAE,EACV,UAAU,EACV,UAAU,CACX,CAAC;YAEF,IAAI,OAAO,CAAC,UAAU,KAAK,UAAU,EAAE,CAAC;gBACtC,EAAE,CAAC,OAAO,CAAC,sDAAsD,CAAC,CAAC,GAAG,CAAC,SAAS,CAAC,EAAE,CAAC,CAAC;YACvF,CAAC;iBAAM,IAAI,OAAO,CAAC,UAAU,KAAK,mBAAmB,IAAI,OAAO,CAAC,UAAU,EAAE,CAAC;gBAC5E,EAAE,CAAC,OAAO,CAAC,+EAA+E,CAAC;qBACxF,GAAG,CAAC,IAAI,CAAC,SAAS,CAAC,OAAO,CAAC,UAAU,CAAC,EAAE,SAAS,CAAC,EAAE,CAAC,CAAC;YAC3D,CAAC;YAED,OAAO;gBACL,MAAM,EAAE,eAAe;gBACvB,eAAe;gBACf,UAAU,EAAE,SAAS,CAAC,EAAE;gBACxB,UAAU,EAAE,cAAc;gBAC1B,UAAU,EAAE,OAAO,CAAC,UAAU,IAAI,IAAI;aACvC,CAAC;QACJ,CAAC;IACH,CAAC;IAED,OAAO,EAAE,MAAM,EAAE,MAAM,EAAE,CAAC;AAC5B,CAAC;AAED,SAAS,sBAAsB,CAC7B,EAAqB,EACrB,WAAqB,EACrB,cAAkC;IAElC,MAAM,WAAW,GAAG,IAAI,GAAG,EAAU,CAAC;IACtC,WAAW,CAAC,GAAG,CAAC,cAAc,CAAC,MAAM,CAAC,CAAC;IAEvC,IAAI,WAAW,CAAC,MAAM,GAAG,CAAC,EAAE,CAAC;QAC3B,MAAM,YAAY,GAAG,WAAW,CAAC,GAAG,CAAC,GAAG,EAAE,CAAC,GAAG,CAAC,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC;QAC1D,MAAM,IAAI,GAAG,EAAE,CAAC,OAAO,CACrB,qDAAqD,YAAY,GAAG,CACrE,CAAC,GAAG,CAAC,GAAG,WAAW,CAAgB,CAAC;QACrC,KAAK,MAAM,GAAG,IAAI,IAAI,EAAE,CAAC;YACvB,WAAW,CAAC,GAAG,CAAC,GAAG,CAAC,MAAM,CAAC,CAAC;QAC9B,CAAC;IACH,CAAC;IAED,OAAO,WAAW,CAAC,IAAI,CAAC;AAC1B,CAAC;AAED,MAAM,UAAU,mBAAmB,CACjC,EAAqB,EACrB,QAAgB,EAChB,UAAkB,EAClB,QAAgB,EAChB,UAAkB,EAClB,UAAyB;IAEzB,MAAM,EAAE,GAAG,UAAU,EAAE,CAAC;IACxB,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;IAErC,MAAM,KAAK,GAAG,UAAU,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC,MAAM,CAAC;IAC/C,MAAM,UAAU,GAAG,UAAU,CAAC,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC,IAAI,CAAC;IAC3C,MAAM,cAAc,GAAG,UAAU,CAAC,CAAC,CAAC,IAAI,CAAC,SAAS,CAAC,UAAU,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC;IAEtE,EAAE,CAAC,OAAO,CAAC;;;;GAIV,CAAC,CAAC,GAAG,CAAC,EAAE,EAAE,QAAQ,EAAE,UAAU,EAAE,QAAQ,EAAE,UAAU,EAAE,KAAK,EAAE,cAAc,EAAE,UAAU,EAAE,GAAG,CAAC,CAAC;IAE/F,OAAO,EAAE,CAAC;AACZ,CAAC;AAED,MAAM,UAAU,mBAAmB,CAAC,EAAqB,EAAE,eAAuB,EAAE,aAAqB;IACvG,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;IACrC,EAAE,CAAC,OAAO,CAAC;;;;;;GAMV,CAAC,CAAC,GAAG,CAAC,aAAa,EAAE,GAAG,EAAE,eAAe,CAAC,CAAC;AAC9C,CAAC"}
1
+ {"version":3,"file":"validate.js","sourceRoot":"","sources":["../../src/validate.ts"],"names":[],"mappings":"AAEA,OAAO,EAAE,UAAU,EAAE,MAAM,WAAW,CAAC;AACvC,OAAO,EAAE,aAAa,EAAE,MAAM,YAAY,CAAC;AAC3C,OAAO,EAAE,iCAAiC,EAAE,MAAM,cAAc,CAAC;AAEjE,MAAM,uBAAuB,GAAG,IAAI,CAAC;AACrC,MAAM,uBAAuB,GAAG,IAAI,CAAC;AAkBrC,MAAM,CAAC,KAAK,UAAU,cAAc,CAClC,EAAqB,EACrB,iBAAoC,EACpC,OAAwD,EACxD,UAMI,EAAE;IAEN,MAAM,EACJ,SAAS,GAAG,uBAAuB,EACnC,sBAAsB,GAAG,uBAAuB,EAChD,WAAW,EACX,eAAe,EACf,eAAe,GAChB,GAAG,OAAO,CAAC;IAEZ,MAAM,aAAa,GAAG,eAAe,IAAI,iBAAiB,CAAC,cAAc,CACvE,eAAe,IAAI,MAAM,iBAAiB,CAAC,KAAK,CAAC,OAAO,CAAC,OAAO,CAAC,CAClE,CAAC;IAEF,MAAM,eAAe,GAAG,EAAE,CAAC,OAAO,CAAC;;;;;;;GAOlC,CAAC,CAAC,GAAG,CAAC,aAAa,CAAuC,CAAC;IAE5D,IAAI,SAAS,GAAkC,IAAI,CAAC;IACpD,IAAI,cAAc,GAAG,CAAC,CAAC;IAEvB,IAAI,eAAe,EAAE,CAAC;QACpB,SAAS,GAAG,eAAe,CAAC;QAC5B,cAAc,GAAG,eAAe,CAAC,UAAU,CAAC;IAC9C,CAAC;IAED,IAAI,SAAS,IAAI,cAAc,IAAI,SAAS,EAAE,CAAC;QAC7C,MAAM,OAAO,GAAG,SAAS,CAAC,EAAE,CAAC;QAC7B,MAAM,SAAS,GAAG,EAAE,CAAC,WAAW,CAAC,GAAG,EAAE;YACpC,mFAAmF;YACnF,MAAM,OAAO,GAAG,EAAE,CAAC,OAAO,CACxB,yDAAyD,CAC1D,CAAC,GAAG,CAAC,OAAO,CAAwD,CAAC;YACtE,MAAM,QAAQ,GAAG,aAAa,CAC5B,OAAO,EAAE,oBAAoB,IAAI,IAAI,EACrC,EAAE,CACH,CAAC;YACF,MAAM,QAAQ,GAAG,CAAC,QAAQ,CAAC,QAAQ,CAAC,OAAO,CAAC,EAAE,CAAC,CAAC;YAChD,IAAI,QAAQ,EAAE,CAAC;gBACb,QAAQ,CAAC,IAAI,CAAC,OAAO,CAAC,EAAE,CAAC,CAAC;YAC5B,CAAC;YACD,MAAM,SAAS,GAAG,sBAAsB,CAAC,EAAE,EAAE,QAAQ,EAAE,OAAO,CAAC,CAAC;YAChE,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;YACrC,yEAAyE;YACzE,qEAAqE;YACrE,EAAE,CAAC,OAAO,CAAC;;;;;;;;OAQV,CAAC,CAAC,GAAG,CACJ,QAAQ,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,EAChB,IAAI,CAAC,SAAS,CAAC,QAAQ,CAAC,EACxB,QAAQ,CAAC,MAAM,EACf,SAAS,EACT,GAAG,EACH,OAAO,CACR,CAAC;QACJ,CAAC,CAAC,CAAC;QACH,SAAS,EAAE,CAAC;QAEZ,OAAO;YACL,MAAM,EAAE,YAAY;YACpB,UAAU,EAAE,OAAO;YACnB,UAAU,EAAE,cAAc;SAC3B,CAAC;IACJ,CAAC;IAED,IAAI,SAAS,IAAI,cAAc,IAAI,sBAAsB,IAAI,WAAW,EAAE,CAAC;QACzE,MAAM,QAAQ,GAAG,iCAAiC,CAAC,OAAO,CAAC,OAAO,EAAE,SAAS,CAAC,OAAO,CAAC,CAAC;QACvF,MAAM,GAAG,GAAG,MAAM,WAAW,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC;QAC7C,IAAI,CAAC,GAAG,IAAI,OAAO,GAAG,KAAK,QAAQ,EAAE,CAAC;YACpC,MAAM,IAAI,KAAK,CAAC,kDAAkD,CAAC,CAAC;QACtE,CAAC;QACD,MAAM,SAAS,GAAG,GAA8B,CAAC;QACjD,IAAI,OAAO,SAAS,CAAC,WAAW,KAAK,SAAS,EAAE,CAAC;YAC/C,MAAM,IAAI,KAAK,CAAC,gEAAgE,CAAC,CAAC;QACpF,CAAC;QACD,MAAM,OAAO,GAKT;YACF,WAAW,EAAE,SAAS,CAAC,WAAW;YAClC,UAAU,EAAE,OAAO,SAAS,CAAC,UAAU,KAAK,QAAQ,CAAC,CAAC,CAAC,SAAS,CAAC,UAAU,CAAC,CAAC,CAAC,SAAS;YACvF,UAAU,EACR,SAAS,CAAC,UAAU;gBACpB,OAAO,SAAS,CAAC,UAAU,KAAK,QAAQ;gBACxC,CAAC,KAAK,CAAC,OAAO,CAAC,SAAS,CAAC,UAAU,CAAC;gBACpC,MAAM,CAAC,MAAM,CAAC,SAAS,CAAC,UAAU,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,KAAK,QAAQ,CAAC;gBACrE,CAAC,CAAE,SAAS,CAAC,UAAqC;gBAClD,CAAC,CAAC,SAAS;YACf,WAAW,EAAE,OAAO,SAAS,CAAC,WAAW,KAAK,QAAQ,CAAC,CAAC,CAAC,SAAS,CAAC,WAAW,CAAC,CAAC,CAAC,SAAS;SAC3F,CAAC;QAEF,IAAI,OAAO,CAAC,WAAW,EAAE,CAAC;YACxB,MAAM,OAAO,GAAG,SAAS,CAAC,EAAE,CAAC;YAC7B,MAAM,UAAU,GAAG,OAAO,CAAC,UAAU,KAAK,mBAAmB;gBAC3D,CAAC,CAAC,EAAE,IAAI,EAAE,mBAAmB,EAAE,UAAU,EAAE,OAAO,CAAC,UAAU,EAAE,WAAW,EAAE,OAAO,CAAC,WAAW,EAAE;gBACjG,CAAC,CAAC,OAAO,CAAC,UAAU;oBAClB,CAAC,CAAC,EAAE,IAAI,EAAE,OAAO,CAAC,UAAU,EAAE,WAAW,EAAE,OAAO,CAAC,WAAW,EAAE;oBAChE,CAAC,CAAC,IAAI,CAAC;YAEX,IAAI,eAAe,GAAG,EAAE,CAAC;YACzB,MAAM,mBAAmB,GAAG,EAAE,CAAC,WAAW,CAAC,GAAG,EAAE;gBAC9C,eAAe,GAAG,mBAAmB,CACnC,EAAE,EACF,OAAO,EACP,UAAU,EACV,OAAO,CAAC,EAAE,EACV,UAAU,EACV,UAAU,CACX,CAAC;gBACF,IAAI,OAAO,CAAC,UAAU,KAAK,UAAU,EAAE,CAAC;oBACtC,EAAE,CAAC,OAAO,CAAC,sDAAsD,CAAC,CAAC,GAAG,CAAC,OAAO,CAAC,CAAC;gBAClF,CAAC;qBAAM,IAAI,OAAO,CAAC,UAAU,KAAK,mBAAmB,IAAI,OAAO,CAAC,UAAU,EAAE,CAAC;oBAC5E,EAAE,CAAC,OAAO,CAAC,+EAA+E,CAAC;yBACxF,GAAG,CAAC,IAAI,CAAC,SAAS,CAAC,OAAO,CAAC,UAAU,CAAC,EAAE,OAAO,CAAC,CAAC;gBACtD,CAAC;YACH,CAAC,CAAC,CAAC;YACH,mBAAmB,EAAE,CAAC;YAEtB,OAAO;gBACL,MAAM,EAAE,eAAe;gBACvB,eAAe;gBACf,UAAU,EAAE,OAAO;gBACnB,UAAU,EAAE,cAAc;gBAC1B,UAAU,EAAE,OAAO,CAAC,UAAU,IAAI,IAAI;aACvC,CAAC;QACJ,CAAC;IACH,CAAC;IAED,OAAO,EAAE,MAAM,EAAE,MAAM,EAAE,CAAC;AAC5B,CAAC;AAED,SAAS,sBAAsB,CAC7B,EAAqB,EACrB,WAAqB,EACrB,cAAkC;IAElC,MAAM,WAAW,GAAG,IAAI,GAAG,EAAU,CAAC;IACtC,WAAW,CAAC,GAAG,CAAC,cAAc,CAAC,MAAM,CAAC,CAAC;IAEvC,IAAI,WAAW,CAAC,MAAM,GAAG,CAAC,EAAE,CAAC;QAC3B,MAAM,YAAY,GAAG,WAAW,CAAC,GAAG,CAAC,GAAG,EAAE,CAAC,GAAG,CAAC,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC;QAC1D,MAAM,IAAI,GAAG,EAAE,CAAC,OAAO,CACrB,qDAAqD,YAAY,GAAG,CACrE,CAAC,GAAG,CAAC,GAAG,WAAW,CAAgB,CAAC;QACrC,KAAK,MAAM,GAAG,IAAI,IAAI,EAAE,CAAC;YACvB,WAAW,CAAC,GAAG,CAAC,GAAG,CAAC,MAAM,CAAC,CAAC;QAC9B,CAAC;IACH,CAAC;IAED,OAAO,WAAW,CAAC,IAAI,CAAC;AAC1B,CAAC;AAED,MAAM,UAAU,mBAAmB,CACjC,EAAqB,EACrB,QAAgB,EAChB,UAAkB,EAClB,QAAgB,EAChB,UAAkB,EAClB,UAAyB;IAEzB,MAAM,EAAE,GAAG,UAAU,EAAE,CAAC;IACxB,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;IAErC,MAAM,KAAK,GAAG,UAAU,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC,MAAM,CAAC;IAC/C,MAAM,UAAU,GAAG,UAAU,CAAC,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC,IAAI,CAAC;IAC3C,MAAM,cAAc,GAAG,UAAU,CAAC,CAAC,CAAC,IAAI,CAAC,SAAS,CAAC,UAAU,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC;IAEtE,EAAE,CAAC,OAAO,CAAC;;;;GAIV,CAAC,CAAC,GAAG,CAAC,EAAE,EAAE,QAAQ,EAAE,UAAU,EAAE,QAAQ,EAAE,UAAU,EAAE,KAAK,EAAE,cAAc,EAAE,UAAU,EAAE,GAAG,CAAC,CAAC;IAE/F,OAAO,EAAE,CAAC;AACZ,CAAC;AAED,MAAM,UAAU,mBAAmB,CAAC,EAAqB,EAAE,eAAuB,EAAE,aAAqB;IACvG,MAAM,GAAG,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC;IACrC,EAAE,CAAC,OAAO,CAAC;;;;;;GAMV,CAAC,CAAC,GAAG,CAAC,aAAa,EAAE,GAAG,EAAE,eAAe,CAAC,CAAC;AAC9C,CAAC"}
@@ -0,0 +1,175 @@
1
+ # Audrey Paper Outline
2
+
3
+ ## Working Title
4
+
5
+ Audrey Guard: Local-First Pre-Action Memory Control for Tool-Using Agents
6
+
7
+ ## One-Sentence Thesis
8
+
9
+ Long-term memory for agents should not stop at recall; it should run before tool use, connect prior outcomes to the next action, and return an auditable `allow`, `warn`, or `block` decision with evidence.
10
+
11
+ ## Abstract Draft
12
+
13
+ Tool-using agents repeatedly fail in ways that are avoidable: they rerun broken commands, ignore project-specific procedures, lose the context behind prior failures, and trust degraded retrieval paths as if they were complete. Existing agent-memory systems focus mainly on storing and retrieving conversational facts. Audrey reframes memory as a local-first control loop for action: observe tool outcomes, encode durable lessons, build a memory capsule before the next action, generate reflexes, decide whether to allow, warn, or block, and validate whether the memory changed the result.
14
+
15
+ This paper introduces Audrey Guard, a SQLite-backed memory controller for Model Context Protocol and CLI agents. Audrey Guard combines hybrid vector/FTS recall, memory capsules, preflight warnings, tool-trace learning, redaction-first audit logging, and evidence-linked impact measurement. The evaluation plan measures repeated-failure prevention, false-block rate, degraded-recall fail-closed behavior, redaction safety, and overhead. The result is a practical memory firewall for local agent work: not a replacement for general memory platforms, but an auditable layer that helps agents avoid repeating known mistakes before they touch tools.
16
+
17
+ ## Core Contributions
18
+
19
+ 1. Define pre-action memory control as a distinct problem from generic long-term memory retrieval.
20
+ 2. Present the Audrey Guard loop: `PostToolUse` observation -> memory encoding -> preflight/capsule/reflex generation -> `allow` / `warn` / `block` -> validation/impact.
21
+ 3. Show a local-first implementation over SQLite, vector search, FTS, MCP, CLI, REST, and Python clients.
22
+ 4. Introduce GuardBench, an evaluation suite focused on tool-use risk reduction rather than chat-memory accuracy alone.
23
+ 5. Measure safety properties that memory systems usually underreport: repeated-failure prevention, recall degradation handling, secret redaction, and audit lineage.
24
+
25
+ ## Paper Structure
26
+
27
+ ### 1. Introduction
28
+
29
+ - Agents now operate tools, not just text conversations.
30
+ - The failure mode is operational: the agent knows less than yesterday's run.
31
+ - Generic memory recall is necessary but insufficient; the memory must participate before action.
32
+ - Audrey's claim: a local memory controller can prevent repeated tool failures with low overhead and inspectable evidence.
33
+
34
+ ### 2. Background and Related Work
35
+
36
+ - Agent-memory systems: Mem0, Letta/MemGPT, LangMem, Zep/Graphiti, Supermemory, OpenMemory, Cognee, LlamaIndex memory.
37
+ - Memory-as-system-resource work: MemOS, procedural memory, evidence-driven retention, temporal graphs.
38
+ - MCP tool safety: tool annotations, tool poisoning, descriptor drift, open-world tool risk.
39
+ - Hook runtimes: Claude Code `PreToolUse`, `PostToolUse`, and `PostToolUseFailure` make pre-action memory control deployable.
40
+
41
+ ### 3. Problem Definition
42
+
43
+ - Input: proposed agent action, tool name, command/action text, cwd, file scope, session id, and current memory store.
44
+ - Output: decision, risk score, summary, evidence ids, recommended actions, reflexes, optional capsule, and preflight event id.
45
+ - Desired behavior:
46
+ - Block exact repeated failures unless the action changed.
47
+ - Warn on relevant prior failures, must-follow procedures, contradictions, and degraded recall.
48
+ - Preserve evidence lineage and redact secrets before durable storage.
49
+ - Add low enough latency to run inside tool hooks.
50
+
51
+ ### 4. Audrey Guard Design
52
+
53
+ - Memory substrate: episodic, semantic, procedural, event log, validation, decay, consolidation.
54
+ - Recall: hybrid vector + FTS with tag/source/date filters and partial-failure diagnostics.
55
+ - Capsules: budgeted evidence assembly for action context.
56
+ - Preflight: warnings and risk scoring from capsule sections, status, and recent tool failures.
57
+ - Reflexes: action-oriented responses generated from preflight evidence.
58
+ - Controller: `beforeAction()` and `afterAction()` over existing Audrey primitives.
59
+ - Audit safety: redaction-before-truncation, action hashing, file-scope hashing, event ids.
60
+
61
+ ### 5. Implementation
62
+
63
+ - Runtime: Node.js 20+, TypeScript, SQLite, sqlite-vec, Hono REST, MCP stdio, Python client.
64
+ - CLI:
65
+ - `audrey guard --tool Bash "npm run deploy"`
66
+ - `audrey demo --scenario repeated-failure`
67
+ - MCP surfaces:
68
+ - Tools for recall, preflight, reflexes, observe-tool, impact, status.
69
+ - Resources for status, recent memories, and principles.
70
+ - Prompts for session briefing, recall, and reflection.
71
+ - Docker behavior: fail-closed non-loopback REST sidecar with required API key.
72
+
73
+ ### 6. Evaluation: GuardBench
74
+
75
+ Baselines:
76
+
77
+ - No memory.
78
+ - Recent-window memory.
79
+ - Vector-only recall.
80
+ - Keyword/FTS-only recall.
81
+ - Audrey Guard with hybrid recall and exact-failure matching.
82
+
83
+ Scenarios:
84
+
85
+ - Repeated failed shell command.
86
+ - Required preflight procedure missing.
87
+ - Same command in a different file scope.
88
+ - Same tool/action with changed command.
89
+ - Prior failure plus successful fix.
90
+ - Recall vector table missing.
91
+ - FTS failure under hybrid recall.
92
+ - Long secret near truncation boundary.
93
+ - Conflicting project instructions.
94
+ - High-volume irrelevant memory noise.
95
+
96
+ Metrics:
97
+
98
+ - Repeated-failure prevention rate.
99
+ - False-block rate.
100
+ - Useful-warning precision.
101
+ - Evidence recall: whether the blocking evidence is surfaced.
102
+ - Redaction safety: raw secret leakage count.
103
+ - Recall-degradation detection rate.
104
+ - Runtime overhead p50/p95.
105
+ - Validation-linked impact count.
106
+
107
+ ### 7. Results Plan
108
+
109
+ - Use the existing repeated-failure demo as the first qualitative figure.
110
+ - Run `npm run bench:memory:check` as the memory-regression baseline.
111
+ - Keep the `bench:guard` command wired into release evidence before paper submission.
112
+ - Report machine provenance for all timings, matching the existing 0.22.2 benchmark snapshot style.
113
+ - Include ablations:
114
+ - Without exact action hash.
115
+ - Without file scope.
116
+ - Without recall degradation warnings.
117
+ - Without redaction-aware truncation.
118
+
119
+ ### 8. Discussion
120
+
121
+ - Why Audrey should not compete as "the best general memory store."
122
+ - Why local-first matters for tool traces: secrets, filesystem paths, project rules, and private failures.
123
+ - Why tool annotations are hints, not policy guarantees.
124
+ - What Audrey borrows from graph memory without adding a graph database to the core.
125
+ - Limitations:
126
+ - Claude Code hook config can be applied with a guarded settings merge, but
127
+ equivalent Codex hook wiring still depends on a stable host hook surface.
128
+ - Validation lineage is bound to exact preflight event evidence, but feedback
129
+ does not yet tune risk scoring.
130
+ - Local comparative GuardBench numbers exist; no external-system numbers yet.
131
+ - Temporal belief fields are still future work.
132
+
133
+ ### 9. Conclusion
134
+
135
+ - Agent memory should be judged by whether it changes future actions, not just whether it retrieves relevant text.
136
+ - Audrey Guard demonstrates a practical local loop for using memory as a pre-action control layer.
137
+ - The next publishable milestone is live external-adapter GuardBench output plus broader host-hook integration.
138
+
139
+ ## Figures and Tables
140
+
141
+ 1. Guard loop diagram: observe -> encode -> capsule/preflight/reflex -> decision -> validate.
142
+ 2. Architecture diagram: SQLite store, event log, recall, controller, MCP/CLI/REST clients.
143
+ 3. Repeated-failure demo transcript with evidence ids.
144
+ 4. GuardBench table by baseline and scenario.
145
+ 5. Redaction/truncation safety table.
146
+ 6. Latency table: preflight p50/p95 by memory count.
147
+
148
+ ## Artifact Checklist Before Submission
149
+
150
+ - `bench:guard` script and JSON output.
151
+ - Public GuardBench scenario manifest, comparative adapter package, and external-run metadata bundle.
152
+ - Reproducible benchmark snapshot with Node version, CPU, RAM, git SHA.
153
+ - CLI smoke transcript for `audrey demo --scenario repeated-failure`.
154
+ - MCP smoke transcript for `tools/list`, `resources/list`, `prompts/list`, and `memory_status`.
155
+ - Python integration proof.
156
+ - Docker fail-closed auth proof.
157
+ - Paper appendix with exact commands.
158
+
159
+ ## Submission Strategy
160
+
161
+ 1. Publish an arXiv preprint after GuardBench exists.
162
+ 2. Submit to an agent-systems, AI engineering, or LLM applications workshop.
163
+ 3. Keep the first version implementation-centered, not theory-heavy.
164
+ 4. Release the evaluation artifact with the paper so the claim is falsifiable.
165
+
166
+ ## Source Map
167
+
168
+ - MCP tool annotations and trust model: https://modelcontextprotocol.io/specification/2025-11-25/server/tools and https://modelcontextprotocol.io/specification/2025-11-25/schema
169
+ - MCP annotation risk vocabulary: https://blog.modelcontextprotocol.io/posts/2026-03-16-tool-annotations/
170
+ - Claude Code hooks: https://code.claude.com/docs/en/hooks
171
+ - Mem0 token-efficient memory algorithm: https://mem0.ai/blog/mem0-the-token-efficient-memory-algorithm
172
+ - MemOS: https://huggingface.co/papers/2507.03724
173
+ - MCP Security Bench: https://huggingface.co/papers/2510.15994
174
+ - Securing MCP against tool poisoning: https://papers.cool/arxiv/2512.06556
175
+ - Zep/Graphiti temporal knowledge graph: https://help.getzep.com/graphiti/graphiti/overview
@@ -0,0 +1,59 @@
1
+ # Audrey Memory Benchmarking Strategy
2
+
3
+ Updated: 2026-05-05 for the 0.23.0 Audrey Guard release.
4
+
5
+ Audrey should win trust before it tries to win leaderboards. The benchmark story is:
6
+ separate what Audrey can prove locally, publish the exact harness and artifacts, and
7
+ only compare against external systems on tasks that measure the same thing.
8
+
9
+ ## 0.23.0 Release Stance
10
+
11
+ - Performance snapshots measure Audrey's local pipeline: SQLite, sqlite-vec,
12
+ hybrid ranking, and post-encode work. They intentionally exclude hosted
13
+ embedding latency and do not compare against unrelated systems.
14
+ - The behavioral suite is split into retrieval, lifecycle operations, and guard
15
+ loop behavior. Guard cases stay out of the comparable aggregate because
16
+ "no controller" baselines are useful regressions, not fair leaderboard peers.
17
+ - `npm run bench:memory:guard` is Audrey's new product benchmark: can memory
18
+ stop an agent before it repeats a known tool failure or violates a must-follow
19
+ release rule, and can the receipt boundary reject replayed or non-guard
20
+ outcomes?
21
+ - The next public claim should be a reproducible report, not a slogan: commit
22
+ the command, raw JSON, machine provenance, model/provider configuration, and
23
+ any judge prompt or scoring rule used.
24
+
25
+ ## External Benchmark Map
26
+
27
+ | Benchmark | What It Tests | Audrey Fit |
28
+ |---|---|---|
29
+ | LongMemEval | Information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention across scalable chat histories. | Good retrieval/lifecycle fit once Audrey has an adapter and evaluator. Source: https://arxiv.org/abs/2410.10813 |
30
+ | LoCoMo | Very long-term conversations around 300 turns, 9K tokens on average, and up to 35 sessions, with QA, summarization, and multimodal dialogue tasks. | Useful external context, but Audrey should keep published scores separate from local synthetic cases. Source: https://arxiv.org/abs/2402.17753 |
31
+ | MemoryAgentBench | Incremental multi-turn memory with accurate retrieval, test-time learning, long-range understanding, and selective forgetting. | Strong fit for Audrey's live agent posture because it evaluates online accumulation rather than static long-context reading. Source: https://arxiv.org/abs/2507.05257 |
32
+ | StructMemEval | Whether agents organize long-term memory into useful structures such as ledgers, to-do lists, and trees rather than just recalling facts. | High-value 0.24 target for Audrey's memory-controller routing and future typed memory stores. Source: https://arxiv.org/abs/2602.11243 |
33
+ | MemGUI-Bench | Cross-temporal and cross-spatial memory for mobile GUI agents, with memory-centric tasks and staged evaluation. | Not a direct coding-agent benchmark, but its failure taxonomy is relevant to tool-bound agents with UI state. Source: https://arxiv.org/abs/2602.06075 |
34
+
35
+ ## Release-Quality Rules
36
+
37
+ 1. Do not mix controller benchmarks with retrieval leaderboards unless all
38
+ compared systems receive equivalent controller affordances.
39
+ 2. Do not quote latency without the embedding provider, dimensions, corpus size,
40
+ recall mode, hardware, Node version, and whether warm caches were involved.
41
+ 3. Treat abstention, deletion, overwrite, conflict resolution, and selective
42
+ forgetting as first-class memory outcomes, not edge cases.
43
+ 4. Prefer task evidence over vibe: raw JSON, artifacts, evaluator code, and
44
+ reproduction commands should ship with every public benchmark claim.
45
+ 5. For coding-agent memory, measure prevented mistakes and time-to-recovery, not
46
+ only whether a stored fact was recalled.
47
+
48
+ ## 0.24 Benchmark Targets
49
+
50
+ - Add a LongMemEval adapter that can run a small public shard with mock and real
51
+ embedding providers.
52
+ - Add a MemoryAgentBench-style incremental harness with explicit selective
53
+ forgetting and test-time learning cases.
54
+ - Add structured-memory cases that force Audrey to maintain a ledger, checklist,
55
+ or dependency tree across sessions.
56
+ - Add an agent-tool benchmark where `beforeAction()` and `afterAction()` wrap a
57
+ scripted coding workflow and score prevented repeats, blocked violations, and
58
+ useful cautions.
59
+ - Publish one reproducible external report before making any SOTA-style claim.
@@ -0,0 +1,304 @@
1
+ # Audrey Production Backlog
2
+
3
+ Updated: 2026-05-13 after the Audrey 1.0 release-cut and paper-gate pass.
4
+
5
+ This file tracks release posture and remaining product work. It is intentionally
6
+ public-safe: it avoids exploit recipes, stale line references, and private
7
+ planning notes.
8
+
9
+ ## Current Release Posture
10
+
11
+ Audrey 1.0.0 has been cut locally and is ready for package-level validation
12
+ through the full gate:
13
+
14
+ ```bash
15
+ npm run release:gate
16
+ npm run release:gate:paper
17
+ npm run release:cut:plan
18
+ npm run release:readiness
19
+ npm run python:release:check
20
+ npm run npm:smoke
21
+ npm run security:audit
22
+ npx audrey doctor
23
+ npx audrey status --fail-on-unhealthy
24
+ python -m unittest discover -s python/tests -v
25
+ python -m build --no-isolation python
26
+ ```
27
+
28
+ `npm run release:readiness` is intentionally pending-aware. It exits cleanly
29
+ when local code and paper artifacts verify but the final 1.0 release is still
30
+ blocked on source-control release state, GitHub Release object readiness, npm
31
+ registry/auth readiness, PyPI publish readiness, authenticated browser
32
+ publication URLs, live Mem0/Zep evidence, or npm/PyPI account steps. Use
33
+ `npm run release:readiness:strict` only when cutting the actual 1.0 release;
34
+ strict mode must fail until those publish blockers are resolved.
35
+ `npm run release:cut:plan` is the dry-run version/changelog cut. It previews
36
+ the edits that `npm run release:cut:apply -- --target-version 1.0.0` would
37
+ write to `package.json`, `package-lock.json`, `mcp-server/config.ts`,
38
+ `python/audrey_memory/_version.py`, and `CHANGELOG.md`. The changelog plan is
39
+ publishable release-note copy rather than TODO scaffolding, and strict
40
+ readiness rejects placeholder markers if a manual edit reintroduces them.
41
+
42
+ `npm test` now routes through `scripts/run-vitest.mjs`, which sets `TEMP`,
43
+ `TMP`, and `TMPDIR` to `.tmp-vitest` before Vitest starts. That removes the
44
+ previous locked-down Windows temp-directory startup failure while keeping
45
+ `npm run release:gate:sandbox` available for hosts that block child-process
46
+ spawning entirely.
47
+
48
+ ## Unreleased v0.23 Guard Chassis
49
+
50
+ The first Audrey Guard slice is now in the working tree:
51
+
52
+ - `src/controller.ts` adds `MemoryController.beforeAction()` and
53
+ `afterAction()` over the existing tool-trace, reflex, preflight, capsule,
54
+ validation, and impact primitives.
55
+ - `npx audrey guard --tool <Tool> "<action>"` runs the controller from the
56
+ terminal, prints a screenshot-friendly guard decision, emits nonzero on
57
+ `block`, and supports `--json`, `--explain`, `--override`, and
58
+ `--fail-on-warn`.
59
+ - `npx audrey demo --scenario repeated-failure` is the deterministic
60
+ no-network demo: a deploy fails once, Audrey records it, the next preflight
61
+ blocks the repeat with evidence, and `impact` records the helpful validation.
62
+ - `Audrey.encodeBatch()` now uses provider-level `embedBatch()` instead of
63
+ issuing one embedding call per episode.
64
+ - Guard exact-failure matching redacts before trimming, treats tool names
65
+ case-insensitively, and includes file scope in the action hash so unrelated
66
+ edits do not collide.
67
+ - Validation feedback can now bind to the exact `preflight_event_id`, evidence
68
+ id set, and Guard action fingerprint that surfaced a memory; Audrey rejects
69
+ lineage claims when the validated memory was not preflight evidence.
70
+ - `npx audrey guard --hook --fail-on-warn` consumes Claude Code `PreToolUse`
71
+ JSON from stdin and emits the current `hookSpecificOutput.permissionDecision`
72
+ shape; `npx audrey hook-config claude-code` generates PreToolUse and
73
+ PostToolUse command hooks, with failed outcomes inferred from the
74
+ PostToolUse hook payload.
75
+ - `npx audrey hook-config claude-code --apply --scope project|user` now merges
76
+ those hooks into Claude Code settings, preserves unrelated settings/hooks,
77
+ dedupes Audrey handlers, and writes a timestamped backup before changing an
78
+ existing settings file.
79
+ - `npm run bench:guard:check` now publishes local GuardBench comparative
80
+ numbers for ten pre-action scenarios across Audrey Guard, no-memory,
81
+ recent-window, vector-only, and FTS-only adapters, including repeated failure
82
+ prevention, recovered-path suppression, recall degradation, redaction safety,
83
+ and noisy memory-store control recall.
84
+ - GuardBench now accepts external ESM adapters with `--adapter`, withholds
85
+ expected decisions/evidence during adapter execution, and emits manifest,
86
+ raw-output, and machine-provenance artifacts.
87
+ - `npm run bench:guard:adapter-smoke` exercises the external adapter loader
88
+ through the real CLI path with a credential-free example adapter.
89
+ - `benchmarks/adapters/mem0-platform.mjs` is the first concrete external-system
90
+ adapter. It uses Mem0 Platform REST APIs with runtime-only `MEM0_API_KEY`,
91
+ async add/event polling, V2 search, and user-entity cleanup.
92
+ - `npm run bench:guard:mem0` wraps the live Mem0 run in a reproducible external
93
+ GuardBench evidence bundle with runtime-env checks and
94
+ `external-run-metadata.json`; `--dry-run` captures the exact command without
95
+ needing credentials.
96
+ - `npm run release:gate:paper` is the publication gate: it rebuilds, typechecks,
97
+ refreshes performance and behavioral benchmark outputs, runs GuardBench,
98
+ syncs README/paper/ledger metrics from JSON artifacts, verifies paper
99
+ consistency and redaction hygiene, runs the pending-aware 1.0 readiness
100
+ checklist, then runs a clean-consumer npm package smoke test and the npm pack
101
+ dry-run.
102
+ - Preflight now performs a supplemental tagged control-memory sweep for
103
+ trusted `must-follow` memories so high-salience rules survive irrelevant
104
+ memory noise.
105
+ - Recall partial-failure diagnostics now propagate into capsules and strict
106
+ Guard preflights, so degraded vector/FTS paths become blocking memory-health
107
+ warnings instead of silent empty context.
108
+ - `/v1/status` and `memory_status` expose the latest recall degradation check
109
+ with the failing path and error message.
110
+ - `npm test` and `npm run test:watch` use the repo-local Vitest temp launcher,
111
+ so full Vitest is no longer dependent on a writable user temp directory.
112
+ - `docs/AUDREY_PAPER_OUTLINE.md` defines the publishable Audrey Guard thesis
113
+ and the GuardBench evaluation plan.
114
+
115
+ Still not production-complete Guard: Claude Code hook apply is explicit rather
116
+ than part of `install`, Codex hook wiring is not available yet, and validation
117
+ feedback is recorded but not yet used to tune risk scoring.
118
+
119
+ ## Shipped In The 0.22.2 Correctness Pass
120
+
121
+ - Two CodeRabbit review passes plus a CodeQL audit landed: see `CHANGELOG.md#0222---2026-05-01`.
122
+ Net result: every critical and major finding from the first pass was
123
+ eliminated; the second pass surfaced a duplicate of the `vec_*.state`
124
+ stale-denormalization bug in `src/interference.ts` and an API-key auth
125
+ length-leak in `src/routes.ts`, both fixed.
126
+ - `GET /v1/impact` REST route + Python `impact()` on sync and async clients.
127
+ `analytics()` is now an alias of `impact()`.
128
+ - Python integration tests unskipped; they spin up the real TS REST sidecar
129
+ and exercise encode → recall → mark_used → impact → snapshot → restore.
130
+ - Legitimate performance benchmarks (`npm run bench:perf-snapshot`) replace
131
+ the synthetic-baseline SVGs that previously shipped in README.
132
+
133
+ ## Shipped In The 0.22.1 Hardening Pass
134
+
135
+ - Agent-scoped encode, recall, greeting, capsule, preflight, reflex, and
136
+ consolidation paths.
137
+ - Admin export, import, and forget tools/routes fail closed unless
138
+ `AUDREY_ENABLE_ADMIN_TOOLS=1` is set.
139
+ - Snapshot import uses bounded schemas for memory rows, config, events,
140
+ consolidation history, and content size.
141
+ - Export/import now preserves `memory_events` and consolidated-memory agent
142
+ ownership.
143
+ - Stored memory content is wrapped as untrusted data before LLM extraction,
144
+ contradiction, reflection, and rule-promotion prompts.
145
+ - Local embeddings are the default. Cloud embedding providers require explicit
146
+ `AUDREY_EMBEDDING_PROVIDER`.
147
+ - MCP stdio now exposes memory tools, `audrey://status`, `audrey://recent`,
148
+ `audrey://principles`, and briefing/recall/reflection prompt templates.
149
+ - Python package metadata builds cleanly as `audrey-memory 0.22.1`.
150
+ - Release scripts separate full CI (`release:gate`) from a reduced gate
151
+ (`release:gate:sandbox`) for hosts that cannot start Vitest.
152
+
153
+ ## Release Evidence To Keep Current
154
+
155
+ Before publishing a new npm or Python package, capture:
156
+
157
+ - `npm run release:gate` on a normal CI host.
158
+ - `npm run release:gate:sandbox` on locked-down local hosts.
159
+ - `npm run release:gate:paper` before publishing the paper, npm package, or
160
+ public launch posts that quote benchmark numbers.
161
+ - `npm run release:readiness -- --json` to capture the current 1.0 prompt-to-artifact checklist.
162
+ - `npm run release:readiness:strict -- --json` immediately before npm publish, PyPI publish, and browser launch are claimed complete.
163
+ - `npm run release:cut:plan -- --target-version 1.0.0 --json` before applying the final version/changelog bump.
164
+ - `npm run python:release:check` to build wheel/sdist artifacts, inspect package metadata, check for local path leakage, and run `twine check`.
165
+ - `npm run npm:smoke` to prove the packed npm tarball installs in a clean consumer project, imports the public ESM API, runs encode/recall, and executes both CLI shims.
166
+ - `npm run security:audit`.
167
+ - `npm ci --dry-run`.
168
+ - Direct stdio MCP smoke: initialize, `tools/list`, `resources/list`,
169
+ `prompts/list`, `resources/read audrey://status`.
170
+ - `npx audrey --version`, `npx audrey doctor --json`, and
171
+ `npx audrey status --json --fail-on-unhealthy`.
172
+ - Python unit tests and `npm run python:release:check`.
173
+
174
+ ## P0: Next Release Blockers
175
+
176
+ 1. Add the equivalent Codex hook wiring when the host exposes a stable hook
177
+ surface, and add a one-command Claude Code install mode that runs MCP
178
+ registration plus hook apply after a dry-run preview.
179
+ 2. Run the Mem0 Platform and Zep Cloud adapters with real runtime keys,
180
+ publish their raw per-scenario outputs, then add Letta/Graphiti-style
181
+ adapters.
182
+ 3. Use validation feedback to tune warning priority, recommendation wording,
183
+ and repeated-risk suppression without giving the model direct policy
184
+ control.
185
+ 4. Add MCP tool-risk policy inputs: descriptor fingerprints, annotation hints,
186
+ trusted-server status, and descriptor drift warnings.
187
+
188
+ ## 1.0 / Paper Publish Blockers
189
+
190
+ The local code and paper gates are strong, but the 1.0/publication objective is
191
+ not complete until `release:readiness:strict` passes. Current blockers:
192
+
193
+ - Source control is partially released remotely but not coherent yet: live
194
+ GitHub refs now show `release/audrey-1.0.0` at `83eb0ad` while `v1.0.0` is
195
+ still tag object `9a22dca` peeled to older commit `b3430fa`. Reconcile or
196
+ recreate the tag on the final release commit before treating 1.0 as cut. This
197
+ sandbox still cannot write `.git` metadata, the local `origin/master`
198
+ tracking ref is stale versus live `53761da`, and the working tree still has
199
+ uncommitted release/launch evidence changes. Fetch/reconcile from a
200
+ credentialed host or a clean temporary clone before treating this checkout as
201
+ source-control ready.
202
+ - The GitHub tag exists, but the public GitHub Releases API currently returns
203
+ `404` for `v1.0.0`, and this browser session is not signed into GitHub. Publish
204
+ a stable GitHub Release from the verified tag and attach or link the paper and
205
+ submission artifacts before strict readiness can pass.
206
+ - npm publish readiness still needs CLI authentication: `audrey@1.0.0` is not
207
+ published on the registry and `npm whoami` currently returns E401. The local
208
+ tarball smoke now passes through `npm run npm:smoke`, including clean-consumer
209
+ install, ESM import, encode/recall, and both CLI shims.
210
+ - PyPI publish readiness still needs runtime credentials or trusted-publisher
211
+ evidence for the final `audrey-memory` upload.
212
+ - Browser launch results are still not complete: LinkedIn, Reddit, and Hacker
213
+ News are submitted and verified, arXiv is account-authorization blocked under
214
+ support ticket `AH-190018`, and X still needs a logged-in posting session. The
215
+ first r/LocalLLaMA attempt was removed for insufficient subreddit karma, so
216
+ the recorded Reddit launch URL is now the rule-checked r/ClaudeCode Showcase
217
+ post. Audrey-specific Reddit replies in that thread now include the GitHub
218
+ repo URL, including the PreToolUse/permissions.deny exchange and Moriarty's
219
+ GuardBench feedback thread. The first Hacker News Show HN path was
220
+ account-restricted, so the recorded Hacker News launch URL is the verified
221
+ neutral link submission.
222
+ - Mem0 and Zep GuardBench evidence is still dry-run/pending until
223
+ `MEM0_API_KEY` and `ZEP_API_KEY` are provided at runtime and strict external
224
+ evidence passes.
225
+ - npm/PyPI publishing still needs account authentication and OTP handling.
226
+ - Production `npm audit --omit=dev --audit-level=moderate` is clean after the
227
+ latest transitive `protobufjs` lockfile refresh.
228
+
229
+ ## P1: Product Quality
230
+
231
+ 1. Add `memory_ask` / `recallAuto()` for callers that should not choose a
232
+ retrieval strategy manually.
233
+ 2. Add adaptive hybrid recall weighting behind an environment flag, then compare
234
+ against the current benchmark output before making it default.
235
+ 3. Benchmark `encodeBatch` across mock, OpenAI, and Gemini providers before
236
+ claiming cloud-provider speedups.
237
+ 4. Add a visible `audrey impact` or dashboard story that shows memories used,
238
+ helpful, wrong, decayed, and promoted over time.
239
+ 5. Add install smoke tests for generated Codex, Claude Code, VS Code, Cursor,
240
+ and Windsurf MCP configs.
241
+
242
+ ## P2: Hardening And Scale
243
+
244
+ 1. Move large local embedding dependencies to optional install paths if package
245
+ size becomes a distribution blocker.
246
+ 2. Add event-log retention controls for long-running tool-trace stores.
247
+ 3. Add signed import/export bundles for cross-machine memory transfer.
248
+ 4. Cache prepared statements on hot recall paths if production profiling shows
249
+ SQLite prepare overhead above budget.
250
+ 5. Add bitemporal belief fields for facts that change over time.
251
+
252
+ ## Commercial Wedge
253
+
254
+ The product wedge remains "memory before action": Audrey should prevent agents
255
+ from repeating known bad actions, ignoring known workflows, or acting without
256
+ the context they already earned. The strongest paid surface is likely team
257
+ memory operations: policy editor, memory diff/rollback, audit log, shared
258
+ encrypted stores, hosted relay, CI gates, and support.
259
+
260
+ ## v0.23 Product Direction (Tracked, Not Decided)
261
+
262
+ A 2026-05-01 audit recommends repositioning Audrey from a generic local
263
+ memory framework to **Audrey Guard** — a local-first memory firewall whose
264
+ single job is to stop AI coding agents from repeating expensive mistakes
265
+ before they touch tools. The core loop already exists in pieces in this
266
+ repo (`observeTool`, `preflight`, `reflexes`, `validate`, `impact`,
267
+ `promote`); the v0.23 work would be making them feel like one product
268
+ loop instead of separate primitives.
269
+
270
+ Open questions before committing to the rename:
271
+
272
+ - Is the marketing surface ("memory firewall for coding agents") narrower
273
+ than the actual product can support across non-coding agents?
274
+ - Does keeping the current "local memory runtime" framing for the OSS core
275
+ while branding the guard CLI separately give us the same wedge without
276
+ abandoning existing positioning?
277
+
278
+ Concrete v0.23 work the audit identified, scoped to fit one or two
279
+ releases:
280
+
281
+ 1. Host hook wiring for Claude Code and Codex so Guard runs automatically
282
+ before tool use and tool outcomes feed back into trace/validation surfaces.
283
+ 2. Memory Controller Layer (`src/controller.ts`) that owns
284
+ `beforeAction(action) → GuardResult` and
285
+ `afterAction(outcome) → void` over the existing primitives. This
286
+ chassis also enables splitting `src/audrey.ts` (now ~1.2K lines) into
287
+ focused services.
288
+ 3. Benchmark the new batched `Audrey.encodeBatch()` path across mock, OpenAI,
289
+ and Gemini providers before claiming cloud-provider speedups.
290
+ 4. Hybrid-recall N+1: batch the FTS-only row loaders in
291
+ `src/hybrid-recall.ts` by type instead of per-id SELECTs.
292
+ 5. Persist recall-degradation history in the event log so status can show more
293
+ than the latest in-process check.
294
+ 6. Move the heavy local embedding dependency
295
+ (`@huggingface/transformers` + ONNX) to `optionalDependencies` so
296
+ non-local-provider users don't pay the install size.
297
+ 7. Expand FTS-only confidence in hybrid recall through the same
298
+ confidence/scoring pipeline used by vector candidates.
299
+ 8. Add `AUDREY_STRICT_ISOLATION=1` and make strict agent scope the
300
+ default before team scopes ship.
301
+
302
+ The "first paid feature" line of work — encrypted blob sync of local
303
+ Audrey stores ("Audrey Cloud Sync") — remains the smallest commercial
304
+ primitive that doesn't require rebuilding the product around hosted Postgres.
@@ -0,0 +1,48 @@
1
+ # Agent Memory Should Control Tool Use: Audrey Guard and Pre-Action Memory Control
2
+
3
+ ## Abstract
4
+
5
+ Agent memory should be judged by whether it changes future tool actions, not only by whether it retrieves relevant text. Audrey implements a local-first pre-action memory controller that converts prior tool outcomes, procedures, contradictions, recall health, and redacted traces into auditable `allow`, `warn`, or `block` decisions before an agent acts. The system builds bounded memory capsules, scores preflight risk, generates evidence-linked reflexes, blocks exact repeated failures through deterministic action identity hashing, and closes the loop through post-action validation and impact reporting. This paper frames the scientific category as pre-action memory control and the artifact as Audrey Guard. The Stage-A version reports implemented Audrey evidence: the controller and CLI, redaction-first tool tracing, recall-degradation handling, the canonical 0.22.2 performance snapshot, the current behavioral regression gate output, the local comparative GuardBench run, and the deterministic repeated-failure demo. It also specifies GuardBench as the evaluation methodology for future cross-system comparison.
6
+
7
+ ## Table of Contents and Authoring Status
8
+
9
+ | Section | File | Status | Owner |
10
+ |---|---|---|---|
11
+ | 0. Master, abstract, status | `00-master.md` | Draft initialized | Codex |
12
+ | 1. Introduction | `01-introduction.md` | Draft complete | Claude strategy, Codex draft |
13
+ | 2. Related Work | `02-related-work.md` | Draft complete | Claude citation strategy, Codex draft |
14
+ | 3. Problem Definition | `03-problem-definition.md` | Draft complete | Codex |
15
+ | 4. Design | `04-design.md` | Draft complete | Codex |
16
+ | 5. GuardBench Specification | `05-guardbench-spec.md` | Draft complete | Claude spec review, Codex draft |
17
+ | 6. Implementation | `06-implementation.md` | Draft complete | Codex |
18
+ | 7. Evaluation | `07-evaluation.md` | Draft complete | Codex with Claude anti-claim review |
19
+ | 8. Discussion and Limitations | `08-discussion-limitations.md` | Draft complete | Claude review, Codex draft |
20
+ | 9. Conclusion | `09-conclusion.md` | Draft complete | Codex |
21
+ | Consolidated v1 master | `audrey-paper-v1.md` | Assembled | Codex |
22
+ | Appendix A. Demo Transcript | `appendix-a-demo-transcript.md` | Draft complete | Codex |
23
+ | Appendix B. Evidence Ledger | `evidence-ledger.md` | Initialized and populated | Codex |
24
+ | References | `references.bib` | Initialized with primary URLs; benchmark citations added | Codex |
25
+
26
+ ## Current Draft Constraints
27
+
28
+ - Quote benchmark numbers from `benchmarks/snapshots/perf-0.22.2.json`, not the README sample table (Ledger: E28).
29
+ - Treat GuardBench Stage A as a specification contribution plus local comparative result, not completed external-system results.
30
+ - Cite external claims only from primary papers, official documentation, official repositories, or first-party project posts.
31
+ - Keep claims about Audrey tied to evidence-ledger IDs.
32
+ - Keep section-body ledger references while drafting; remove them during final submission polish after claims are stable.
33
+
34
+ ## Assembled Draft Preview
35
+
36
+ | Order | File | Lines |
37
+ |---|---|---:|
38
+ | Master | `audrey-paper-v1.md` | 921 |
39
+ | 1 | `01-introduction.md` | 27 |
40
+ | 2 | `02-related-work.md` | 47 |
41
+ | 3 | `03-problem-definition.md` | 108 |
42
+ | 4 | `04-design.md` | 162 |
43
+ | 5 | `05-guardbench-spec.md` | 242 |
44
+ | 6 | `06-implementation.md` | 113 |
45
+ | 7 | `07-evaluation.md` | 124 |
46
+ | 8 | `08-discussion-limitations.md` | 61 |
47
+ | 9 | `09-conclusion.md` | 11 |
48
+ | Appendix A | `appendix-a-demo-transcript.md` | 114 |
@@ -0,0 +1,27 @@
1
+ # 1. Introduction
2
+
3
+ Tool-using agents fail in ways that ordinary chat-memory evaluation does not measure. They repeat broken shell commands after a previous run already exposed the error. They ignore project-specific setup rules that were learned in an earlier session. They lose the causal link between a failed action and the fix that made a later action safe. They treat degraded retrieval as complete memory and act anyway. In Audrey's repeated-failure demo, an agent first runs `npm run deploy` and fails because the Prisma client was not generated. Audrey records the failed tool event, stores the operational rule, and blocks the same action when it is proposed again. The transcript ends with the intended behavior of pre-action memory control: "Audrey saw the agent fail once. Audrey stopped it from failing twice." (Ledger: E25, E42)
4
+
5
+ Most memory evaluation frames do not test this behavior. MTEB evaluates text embeddings across retrieval and representation tasks [@muennighoff2023mteb]. LongMemEval evaluates chat assistants on information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention over long interaction histories [@wu2025longmemeval]. LoCoMo evaluates very long-term conversational memory through question answering, summarization, and multimodal dialogue generation [@maharana2024locomo]. These benchmarks are valuable, but their output target is retrieved context or an answer. They do not ask whether memory changed a future tool action before the action reached the shell, file system, browser, API, or MCP server.
6
+
7
+ This paper defines pre-action memory control as a distinct systems problem. A controller receives a proposed tool action and remembered state before execution, then returns an auditable `allow`, `warn`, or `block` decision with evidence. Section 3 gives the formal input and output contract, desired behavior properties, threat model, and scope boundaries. The key shift is the evaluation target: memory is judged by its effect on action selection, not only by the relevance of retrieved text.
8
+
9
+ Audrey Guard is the artifact studied in this paper. It is a local-first memory controller for agents that observes tool outcomes, redacts traces, retrieves relevant memory, constructs a bounded capsule, scores preflight risk, generates reflexes, returns `allow`/`warn`/`block`, and validates whether memory helped after the action path completes (Ledger: E1-E17). The implementation is exposed through MCP, CLI, REST, and Python client surfaces, while the core guard path runs host-side before tool execution (Ledger: E18-E19, E26, E32-E36).
10
+
11
+ This paper makes six contributions:
12
+
13
+ 1. It formalizes pre-action memory control as a problem separate from chat recall, retrieval accuracy, and long-context question answering (Section 3).
14
+
15
+ 2. It presents Audrey Guard, a local-first controller that converts remembered failures, procedures, contradictions, recall health, and redacted tool traces into `allow`, `warn`, or `block` decisions before tool use (Sections 4 and 6; Ledger: E1-E15, E29-E40).
16
+
17
+ 3. It introduces deterministic action identity for repeated-failure prevention: tool, redacted command, normalized working directory, and sorted file scope are hashed and matched against prior failed tool events (Sections 4, 6, and 7; Ledger: E3, E25, E42).
18
+
19
+ 4. It implements a redaction-first tool-trace path so guard evidence can reference prior tool input, output, and error summaries without storing raw secrets in durable memory (Sections 4 and 6; Ledger: E12-E13).
20
+
21
+ 5. It treats recall degradation as a control signal: missing vector tables, KNN failures, and FTS failures propagate as `RecallError[]`, appear in capsules, and become high-severity preflight warnings under strict guard mode (Sections 4 and 6; Ledger: E7, E9-E10, E15, E40).
22
+
23
+ 6. It specifies GuardBench, a reproducibility contract for measuring whether memory changes future tool actions, including scenarios, baselines, metrics, redaction sweeps, machine provenance, and raw per-scenario outputs (Section 5).
24
+
25
+ The empirical scope is Stage A. This paper reports implemented Audrey evidence: the controller and CLI guard path, redaction-first tool tracing, recall-degradation handling, the canonical 0.22.2 performance snapshot, the current `bench:memory:check` regression output, the local comparative GuardBench run, and the deterministic repeated-failure demo transcript (Ledger: E20-E26, E41-E42, E46). It does not report external-system GuardBench comparisons, production-load measurements, or real-provider embedding latency. The external adapter contract, Mem0 adapter, and evidence-bundle runner now exist, but live external-system scores belong in a v2 paper after credentialed runs publish raw outputs under the contract in Section 5 (Ledger: E47-E50).
26
+
27
+ Section 2 positions Audrey against memory systems, memory benchmarks, graph-memory systems, and MCP safety work. Section 3 defines the pre-action memory-control problem. Section 4 describes Audrey Guard's design. Section 5 specifies GuardBench. Section 6 documents the implementation. Section 7 reports Stage-A evaluation artifacts. Section 8 discusses limitations and open problems. Section 9 concludes.