claude_memory 0.12.0 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/memory.sqlite3 +0 -0
- data/.claude/rules/claude_memory.generated.md +44 -48
- data/.claude/settings.local.json +2 -1
- data/.claude-plugin/marketplace.json +2 -2
- data/.claude-plugin/plugin.json +3 -5
- data/CHANGELOG.md +52 -0
- data/CLAUDE.md +13 -8
- data/README.md +46 -0
- data/db/migrations/019_add_observations.rb +43 -0
- data/db/migrations/020_add_observation_promotion.rb +33 -0
- data/docs/GETTING_STARTED.md +38 -0
- data/docs/api_stability.md +23 -7
- data/docs/architecture.md +18 -6
- data/docs/audit_runbook.md +67 -0
- data/docs/dashboard.md +28 -0
- data/docs/improvements.md +94 -1
- data/docs/influence/mastra-observational-memory.md +198 -0
- data/docs/influence/strands-agent-sops.md +163 -0
- data/docs/quality_review.md +45 -0
- data/docs/soak/audit_2026-06-03_agent-training-program.json +53 -0
- data/docs/soak/audit_2026-06-03_agentic.json +31 -0
- data/docs/soak/audit_2026-06-03_ai-software-architect.json +19 -0
- data/docs/soak/audit_2026-06-03_chaos_to_the_rescue.json +60 -0
- data/docs/soak/audit_2026-06-03_claude_memory.json +55 -0
- data/docs/soak/audit_2026-06-03_daily-vibe.json +59 -0
- data/docs/soak/audit_2026-06-03_minerva-sky.json +19 -0
- data/docs/soak/audit_2026-06-03_nowreading.dev.json +19 -0
- data/docs/soak/audit_2026-06-03_ups.dev.json +55 -0
- data/docs/soak/baseline_2026-06-03.md +145 -0
- data/lib/claude_memory/audit/checks.rb +149 -0
- data/lib/claude_memory/audit/runner.rb +4 -0
- data/lib/claude_memory/commands/census_command.rb +1 -1
- data/lib/claude_memory/commands/checks/embeddings_check.rb +97 -0
- data/lib/claude_memory/commands/doctor_command.rb +1 -0
- data/lib/claude_memory/commands/hook_command.rb +16 -3
- data/lib/claude_memory/commands/initializers/hooks_configurator.rb +3 -1
- data/lib/claude_memory/commands/install_skill_command.rb +4 -0
- data/lib/claude_memory/commands/observations_command.rb +367 -0
- data/lib/claude_memory/commands/registry.rb +2 -0
- data/lib/claude_memory/commands/setup_vectors_command.rb +182 -0
- data/lib/claude_memory/commands/skills/reflect.md +68 -0
- data/lib/claude_memory/commands/stats_command.rb +60 -1
- data/lib/claude_memory/dashboard/api.rb +4 -0
- data/lib/claude_memory/dashboard/index.html +154 -2
- data/lib/claude_memory/dashboard/observations.rb +115 -0
- data/lib/claude_memory/dashboard/server.rb +1 -0
- data/lib/claude_memory/distill/extraction.rb +6 -4
- data/lib/claude_memory/distill/null_distiller.rb +86 -3
- data/lib/claude_memory/distill/reference_material_detector.rb +4 -1
- data/lib/claude_memory/domain/observation.rb +118 -0
- data/lib/claude_memory/embeddings/generator.rb +1 -1
- data/lib/claude_memory/hook/context_injector.rb +100 -2
- data/lib/claude_memory/mcp/handlers/management_handlers.rb +113 -2
- data/lib/claude_memory/mcp/handlers/query_handlers.rb +48 -1
- data/lib/claude_memory/mcp/instructions_builder.rb +1 -0
- data/lib/claude_memory/mcp/query_guide.rb +28 -0
- data/lib/claude_memory/mcp/tool_definitions.rb +58 -0
- data/lib/claude_memory/mcp/tools.rb +3 -0
- data/lib/claude_memory/observe/observations_renderer.rb +49 -0
- data/lib/claude_memory/observe/reflector.rb +91 -0
- data/lib/claude_memory/publish.rb +53 -1
- data/lib/claude_memory/resolve/resolver.rb +45 -8
- data/lib/claude_memory/store/schema_manager.rb +1 -1
- data/lib/claude_memory/store/sqlite_store.rb +181 -0
- data/lib/claude_memory/sweep/maintenance.rb +15 -1
- data/lib/claude_memory/sweep/sweeper.rb +7 -1
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +7 -0
- metadata +23 -1
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
# Mastra Observational Memory — Influence Study
|
|
2
|
+
|
|
3
|
+
*Analysis Date: 2026-06-16*
|
|
4
|
+
*Source: Mastra "Observational Memory" (announcement + docs + research, Feb 2026)*
|
|
5
|
+
*Type: Architecture study (feature/paradigm, not a full repo clone)*
|
|
6
|
+
*Status: Design exploration — no code yet. Branch `claude/observational-layer-design-7662r9`.*
|
|
7
|
+
|
|
8
|
+
*Sources:*
|
|
9
|
+
- *[Announcing Observational Memory (Mastra blog)](https://mastra.ai/blog/observational-memory)*
|
|
10
|
+
- *[Observational Memory docs](https://mastra.ai/docs/memory/observational-memory)*
|
|
11
|
+
- *[Observational Memory research / LongMemEval](https://mastra.ai/research/observational-memory)*
|
|
12
|
+
- *[VentureBeat: "Observational memory cuts AI agent costs 10x..."](https://venturebeat.com/data/observational-memory-cuts-ai-agent-costs-10x-and-outscores-rag-on-long)*
|
|
13
|
+
- *[The Decoder: traffic-light priority system](https://the-decoder.com/mastras-open-source-ai-memory-uses-traffic-light-emojis-for-more-efficient-compression/)*
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Executive Summary
|
|
18
|
+
|
|
19
|
+
### What this is
|
|
20
|
+
|
|
21
|
+
Mastra Observational Memory (OM) is a **text-based, dual-agent episodic memory** for long-running agents. It compresses raw message history into a structured, append-only log of dated **observations** that lives entirely in the LLM context window — no vector or graph DB. It reports state-of-the-art LongMemEval scores (84.23% with gpt-4o; 94.87% with gpt-5-mini) at 3–6× token compression.
|
|
22
|
+
|
|
23
|
+
### Why ClaudeMemory cares
|
|
24
|
+
|
|
25
|
+
In Mastra's taxonomy, ClaudeMemory is the thing OM positions *against*: a structured **semantic** store (subject-predicate-object facts with scope, validity windows, supersession, provenance) injected **dynamically per query** via `memory.recall` and SessionStart fact injection.
|
|
26
|
+
|
|
27
|
+
The key realization from this study: **ClaudeMemory has no episodic layer at all.** Facts answer "what is true." Observations answer "what happened." OM is purely episodic; ClaudeMemory is purely semantic. An observational layer is not redundant with distillation — it is the missing half.
|
|
28
|
+
|
|
29
|
+
We already own two of OM's four moving parts in spirit:
|
|
30
|
+
- The **distillation pipeline** (NullDistiller + Layer-2 Claude-as-distiller) is an Observer that emits *facts* instead of a *narrative log*.
|
|
31
|
+
- **Resolve + Sweep** is a Reflector that operates on *facts* instead of *observations*.
|
|
32
|
+
|
|
33
|
+
The work is therefore: add a narrative episodic store, point the existing Observer/Reflector machinery at it, add a cache-stable injection mode, and — uniquely to us — bridge observations into facts via corroboration.
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## How Mastra OM Works
|
|
38
|
+
|
|
39
|
+
### Two-block context window
|
|
40
|
+
|
|
41
|
+
1. **Observation block** — a compressed, append-only log of dated observations (decisions, key events, distilled facts from older messages). Reads like a log of decisions and actions, not documentation.
|
|
42
|
+
2. **Raw tail** — recent messages not yet compressed.
|
|
43
|
+
|
|
44
|
+
### The Observer
|
|
45
|
+
|
|
46
|
+
Fires when raw message tokens cross ~30k (configurable). A separate background agent compresses messages into new dated observations appended to the observation block. Each observation captures one discrete event: a user statement, an agent action, a tool-call result, or a preference expressed in passing. 3–6× compression.
|
|
47
|
+
|
|
48
|
+
### The Reflector
|
|
49
|
+
|
|
50
|
+
Fires when observations cross ~40k tokens (configurable). A separate background agent garbage-collects: combines related items, reflects on overarching patterns, and drops context that no longer matters.
|
|
51
|
+
|
|
52
|
+
### Traffic-light priority
|
|
53
|
+
|
|
54
|
+
Observations carry 🔴 (important) / 🟡 (maybe important) / 🟢 (info only). The priority is **internal** to the Observer/Reflector pipeline. When observations are presented to the main "Actor" agent, 🟡 and 🟢 are stripped — only 🔴 survives — because the priority emojis serve the memory pipeline and are visual noise to the actor.
|
|
55
|
+
|
|
56
|
+
### Prompt-cache stability (the headline win)
|
|
57
|
+
|
|
58
|
+
Because the observation block is **append-only between reflections**, the prompt prefix stays stable and every turn gets a full cache hit. Cache invalidates only on a reflection, which is infrequent. This is explicitly contrasted with RAG-style memory that re-retrieves and rewrites the prompt every turn, busting the cache and producing a variable cost curve.
|
|
59
|
+
|
|
60
|
+
### Storage
|
|
61
|
+
|
|
62
|
+
Plain text in a standard backend (Postgres / LibSQL / MongoDB), loaded directly into the context window — not pulled through embedding search.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Comparative Analysis vs ClaudeMemory
|
|
67
|
+
|
|
68
|
+
| Dimension | Mastra OM | ClaudeMemory today |
|
|
69
|
+
|-----------|-----------|--------------------|
|
|
70
|
+
| Memory type | Episodic (narrative log) | Semantic (SPO facts) |
|
|
71
|
+
| Storage | Plain text in context window | Normalized SQLite + FTS5 + vec0 |
|
|
72
|
+
| Retrieval | None — log loaded wholesale | Dynamic per-query (FTS + vector RRF) |
|
|
73
|
+
| Compression | Observer (LLM), 3–6× | NullDistiller + Claude-as-distiller → facts |
|
|
74
|
+
| Consolidation | Reflector (LLM), lossy drop | Resolve (supersession) + Sweep (TTL/GC) |
|
|
75
|
+
| Provenance | Weak — compression is lossy | Strong — provenance receipts, lineage |
|
|
76
|
+
| Cache behavior | Stable append-only prefix | Per-query injection (cache-busting) |
|
|
77
|
+
| Cost | Two background LLM agents (extra API $) | Claude-as-distiller, zero extra API $ |
|
|
78
|
+
|
|
79
|
+
**The two systems are complementary, not competing.** OM's weakness is exactly ClaudeMemory's strength (provenance, truth maintenance) and vice versa (episodic recall, cache-stable injection).
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## Adoption Opportunities (prioritized)
|
|
84
|
+
|
|
85
|
+
### High Priority
|
|
86
|
+
|
|
87
|
+
**A. Episodic observation store + Layer-1 Observer.** New `observations` table (schema v19 — v18 was taken by OTel telemetry); NullDistiller emits observation rows alongside facts; `memory.observations` read tool. Append-only with `consolidated_into` lineage (mirrors `fact_links`) rather than Mastra's lossy drop — preserves our provenance guarantee. Zero behavior change to facts.
|
|
88
|
+
|
|
89
|
+
**B. Cache-stable injection.** Publish `.claude/rules/claude_memory.observations.md` (append-only, dated, 🔴+plain only — 🟡/🟢 stripped as Mastra does for the actor). SessionStart injects a two-block context: Block 1 = consolidated observations (stable, cache-friendly), Block 2 = recent undistilled tail. Front-loading a stable block reduces the per-turn `memory.recall` churn that busts caching. *Honest limit:* we influence Claude Code's cache via a stable `additionalContext` prefix within a session; we don't control it. Cross-session caching remains Claude Code's domain.
|
|
90
|
+
|
|
91
|
+
**C. The observation→fact promotion bridge (unique to us).** The Reflector promotes *corroborated* observations into structured facts. An observation is low-commitment; a fact is committed truth. Requiring repeated, corroborated sightings before promotion is a natural confidence gate — and directly mitigates the documented hallucination problem where the distiller commits `uses_database`/`uses_framework` facts from one-off example text in docs (today producing reject churn). Observation-first, fact-on-corroboration makes premature hallucinated facts never commit.
|
|
92
|
+
|
|
93
|
+
### Medium Priority
|
|
94
|
+
|
|
95
|
+
**D. Automatic Reflector (free) — confirmed feasible.** A consultation with the claude-code-guide agent (2026-06-16) confirms automatic reflection is achievable with zero extra API cost, in two tiers:
|
|
96
|
+
- **Deterministic tier (fully autonomous, no model):** dedupe near-identical observations, drop stale 🟢 past a TTL, merge by entity/time window — pure Ruby, run shell-side inside the `PreCompact` and `SessionEnd` hooks (and the existing Sweep). This needs no model and fires automatically.
|
|
97
|
+
- **Semantic tier (autonomous-on-next-turn, rides the session):** at `PreCompact`, the hook injects a reflection instruction via `additionalContext` ("consolidate the observation log: combine related items, surface patterns, drop the irrelevant"). Claude Code itself performs the consolidation on its next turn, inside the existing session — no separate paid call.
|
|
98
|
+
|
|
99
|
+
See the dedicated section below for why `PreCompact` is the right trigger and what the constraints are. This **supersedes** the earlier "manual `/reflect` only" recommendation: `/reflect` remains as a manual on-demand deep pass, but reflection is now primarily automatic.
|
|
100
|
+
|
|
101
|
+
**E. Compression / cache telemetry.** Reuse the `context_tokens` telemetry on `hook_context` events (0.11.0) and the Trust/Health panels to report compression ratio and token reduction. Add a LongMemEval-style episodic/long-session suite to DevMemBench alongside the existing retrieval and truth-maintenance suites.
|
|
102
|
+
|
|
103
|
+
### Features to Avoid (from this study)
|
|
104
|
+
|
|
105
|
+
- **Two always-on background LLM agents.** Violates the standing convention against features requiring separate Anthropic API calls. Our Observer = context-hook injection (Claude-as-distiller); our Reflector = deterministic shell-side GC + `PreCompact`-injected semantic consolidation that rides the existing session (see automatic-reflection section).
|
|
106
|
+
- **Claude Code Routines / subagents for reflection.** Routines run as a separate scheduled cloud session (separate token budget); subagents run in their own context window (~7× token burn). Both incur extra spend — rejected for recurring reflection. Reserve them, if ever, for a one-off heavy backfill the user explicitly opts into.
|
|
107
|
+
- **Lossy drop on reflection.** Mastra truly discards observations ("never forgives"). We tombstone via `consolidated_into` and retain raw `content_items` — provenance is non-negotiable.
|
|
108
|
+
- **Replacing dynamic recall.** Augment, don't replace. Observations become a front-loaded episodic block; `memory.recall` stays for targeted lookups.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Proposed Data Model (sketch)
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
observations (schema v19)
|
|
116
|
+
id, ts (event time), session_id
|
|
117
|
+
body -- dense narrative text, the observation itself
|
|
118
|
+
kind -- user_statement | agent_action | tool_result | preference | decision | event
|
|
119
|
+
priority -- 1=🔴 important, 2=🟡 maybe, 3=🟢 info (internal pipeline signal)
|
|
120
|
+
scope, project_path
|
|
121
|
+
source_content_item_id -- provenance back to the raw transcript chunk
|
|
122
|
+
consolidated_into -- Reflector lineage (mirrors fact_links supersession)
|
|
123
|
+
token_count -- for budget / compression math
|
|
124
|
+
status, created_at, reflected_at
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Proposed Pipeline Integration
|
|
128
|
+
|
|
129
|
+
```
|
|
130
|
+
Transcripts → Ingest → Index (FTS5)
|
|
131
|
+
↓
|
|
132
|
+
┌─────────────── Distill ───────────────┐
|
|
133
|
+
│ │
|
|
134
|
+
Facts (SPO, semantic) Observations (narrative, episodic) ← NEW
|
|
135
|
+
│ │
|
|
136
|
+
Resolve (truth maint.) Reflect (consolidate / GC / pattern) ← NEW
|
|
137
|
+
│ │
|
|
138
|
+
Store (facts) Store (observations) ← NEW
|
|
139
|
+
│ │
|
|
140
|
+
└──────────── Promotion bridge ──────────┘
|
|
141
|
+
(Reflector promotes corroborated observations → facts)
|
|
142
|
+
↓
|
|
143
|
+
Publish: stable observation block (cache-friendly) + fact snapshot
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## Automatic Reflection in Claude Code (consultation findings, 2026-06-16)
|
|
147
|
+
|
|
148
|
+
Source: claude-code-guide agent consultation. Citations: [Hooks reference](https://code.claude.com/docs/en/hooks.md), [Subagents](https://code.claude.com/docs/en/subagents.md), [Routines / scheduled tasks](https://code.claude.com/docs/en/web-scheduled-tasks).
|
|
149
|
+
|
|
150
|
+
**What does not exist:** There is no timer-, cron-, or idle-based hook event. Hook events are lifecycle-driven only — `SessionStart`, `SessionEnd`, `UserPromptSubmit`, `Stop`/`StopFailure`, `PreCompact`/`PostCompact`, `PreToolUse`/`PostToolUse(Failure)`, plus async signals (`FileChanged`, etc.). No hook can force a model turn or enqueue a prompt; a hook can only inject `additionalContext` that the model acts on at its *next* invocation.
|
|
151
|
+
|
|
152
|
+
**What this unlocks anyway:** `PreCompact` is the right reflection trigger because it fires precisely when the context window is filling — i.e. on *context pressure*. That is conceptually the same signal Mastra uses (Reflector fires at a ~40k-token observation threshold). So "reflect when memory gets big" maps cleanly onto "reflect when Claude Code is about to compact."
|
|
153
|
+
|
|
154
|
+
**The free automatic pattern (recommended):**
|
|
155
|
+
- `PreCompact` + `SessionEnd` hooks run the **deterministic** Reflector shell-side in Ruby (dedupe / TTL-drop 🟢 / merge) — fully autonomous, no model, no cost.
|
|
156
|
+
- `PreCompact` injects an `additionalContext` instruction that makes Claude perform the **semantic** consolidation (pattern-finding, observation→fact promotion) on its next turn, inside the existing session — no separate paid call.
|
|
157
|
+
- `SessionStart` injects the consolidated two-block observation log (already in recommendation B).
|
|
158
|
+
|
|
159
|
+
**Where extra cost is unavoidable (and therefore rejected):** truly autonomous *between-session* reflection on a wall clock. That requires Claude Code Routines (separate paid cloud session) or a headless `claude -p` call or a subagent (~7× tokens) — all separate spend. We accept the tradeoff: our reflection is automatic on *lifecycle events* (compaction, session boundaries), not on a wall-clock timer. For our single-developer, local-first scale this is sufficient.
|
|
160
|
+
|
|
161
|
+
## Suggested Phasing
|
|
162
|
+
|
|
163
|
+
1. Schema + Layer-1 Observer (table, NullDistiller rows, `memory.observations`).
|
|
164
|
+
2. Stable two-block injection; measure token/compression deltas.
|
|
165
|
+
3. **Automatic Reflector**: deterministic GC shell-side in `PreCompact` + `SessionEnd`/Sweep.
|
|
166
|
+
4. **Automatic semantic reflection**: `PreCompact` `additionalContext` consolidation instruction + observation→fact promotion bridge. Keep a manual `/reflect` skill for on-demand deep passes.
|
|
167
|
+
|
|
168
|
+
Phase 4 is where this stops being "Mastra-on-Ruby" and becomes a hybrid episodic+semantic system stronger than either alone.
|
|
169
|
+
|
|
170
|
+
## Decisions for ClaudeMemory (memory-convention format)
|
|
171
|
+
|
|
172
|
+
Per the `/study-repo` memory discipline, the following are decisions about **claude_memory itself** derived from this study — to be stored via `memory.store_extraction` (`subject=claude_memory`, `decision`/`architecture` predicate, reason clause embedded) once the memory MCP server is connected. External facts about Mastra stay in this influence doc, not in memory.
|
|
173
|
+
|
|
174
|
+
- **Decision:** claude_memory will add an episodic observation layer that *augments* (does not replace) the dynamic-recall semantic fact store — because facts answer "what is true" and observations answer "what happened," and we currently have no episodic half; recall stays for targeted lookups while observations provide a stable front-loaded narrative. (User-confirmed "augment" on 2026-06-16.)
|
|
175
|
+
- **Decision:** observation reflection will be automatic via the `PreCompact` and `SessionEnd` hooks rather than a manual-only skill — because Claude Code exposes no timer/cron hook, but `PreCompact` fires on context pressure (the analog of Mastra's token-threshold trigger) and rides the existing session at no extra API cost.
|
|
176
|
+
- **Decision:** the Reflector's deterministic GC runs shell-side in Ruby and its semantic consolidation runs via `PreCompact` `additionalContext` (Claude-as-reflector inline) — to keep automatic reflection within the no-extra-API-cost convention, explicitly rejecting Claude Code Routines and subagents because each incurs a separate token budget.
|
|
177
|
+
- **Decision:** reflection will tombstone superseded observations via a `consolidated_into` link rather than hard-deleting them (unlike Mastra's lossy drop) — to preserve claude_memory's provenance guarantee while still bounding context size.
|
|
178
|
+
- **Decision:** an observation is promoted to a structured fact only after corroboration across multiple observations — because requiring repeated sightings before commitment doubles as an anti-hallucination gate against the documented reject-churn from one-off doc/example text.
|
|
179
|
+
|
|
180
|
+
> **Pending memory ingestion (deferred 2026-06-16).** These were NOT written to the project DB in the remote design session: `.claude/memory.sqlite3` is git-LFS-backed and the container had only the pointer (no `git-lfs`, real DB not materialized), so writing would have clobbered it. Store them in a local session via `memory.store_extraction` with this payload:
|
|
181
|
+
>
|
|
182
|
+
> ```json
|
|
183
|
+
> {
|
|
184
|
+
> "scope": "project",
|
|
185
|
+
> "facts": [
|
|
186
|
+
> {"subject": "claude_memory", "predicate": "decision", "object": "Add an episodic observation layer that augments (does not replace) the dynamic-recall semantic fact store, because facts answer 'what is true' and observations answer 'what happened' and we currently lack the episodic half; recall stays for targeted lookups."},
|
|
187
|
+
> {"subject": "claude_memory", "predicate": "decision", "object": "Make observation reflection automatic via the PreCompact and SessionEnd hooks rather than a manual-only skill, because Claude Code exposes no timer/cron hook but PreCompact fires on context pressure (the analog of Mastra's token threshold) and rides the existing session at no extra API cost."},
|
|
188
|
+
> {"subject": "claude_memory", "predicate": "decision", "object": "Run the Reflector's deterministic GC shell-side in Ruby and its semantic consolidation via PreCompact additionalContext (Claude-as-reflector inline), to keep automatic reflection within the no-extra-API-cost convention, explicitly rejecting Claude Code Routines and subagents because each incurs a separate token budget."},
|
|
189
|
+
> {"subject": "claude_memory", "predicate": "decision", "object": "Tombstone superseded observations via a consolidated_into link rather than hard-deleting them (unlike Mastra's lossy drop), to preserve claude_memory's provenance guarantee while still bounding context size."},
|
|
190
|
+
> {"subject": "claude_memory", "predicate": "decision", "object": "Promote an observation to a structured fact only after corroboration across multiple observations, because requiring repeated sightings before commitment doubles as an anti-hallucination gate against reject-churn from one-off doc/example text."}
|
|
191
|
+
> ]
|
|
192
|
+
> }
|
|
193
|
+
> ```
|
|
194
|
+
|
|
195
|
+
## Open Questions
|
|
196
|
+
|
|
197
|
+
- **Augment vs replace recall?** Resolved: **augment** (user-confirmed 2026-06-16). Observations become a front-loaded episodic block; `memory.recall` stays for targeted lookups.
|
|
198
|
+
- **Automatic vs manual reflection?** Resolved: **automatic** via `PreCompact`/`SessionEnd` (deterministic GC shell-side + semantic consolidation injected for the next turn), with `/reflect` retained for manual deep passes. The only thing we forgo is wall-clock between-session reflection, which would cost extra (Routines/subagents) — deliberately rejected.
|
|
@@ -0,0 +1,163 @@
|
|
|
1
|
+
# Strands Agent SOPs Analysis
|
|
2
|
+
|
|
3
|
+
*Analysis Date: 2026-05-01*
|
|
4
|
+
*Source: AWS Open Source Blog — "Introducing Strands Agent SOPs: Natural Language Workflows for AI Agents"*
|
|
5
|
+
*URL: https://aws.amazon.com/blogs/opensource/introducing-strands-agent-sops-natural-language-workflows-for-ai-agents/*
|
|
6
|
+
*Type: Article (not a repo). PyPI package `strands-agents-sops`, GitHub `strands-agents/agent-sop`.*
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Executive Summary
|
|
11
|
+
|
|
12
|
+
**Agent SOPs** are markdown-based "Standard Operating Procedures" that wrap an AI agent's instructions in a parameterized, RFC-2119-keyworded, chain-able format. Amazon teams use thousands of them internally; the open-source release ships four reference SOPs (`codebase-summary`, `pdd`, `code-task-generator`, `code-assist`) and tooling to author/load them via Strands Agents, MCP prompts, Claude Skills, or raw model calls.
|
|
13
|
+
|
|
14
|
+
**Verdict for ClaudeMemory**: ClaudeMemory already implements most of what SOPs propose, under different names. Skills (`/distill-transcripts`, `/release`, `/study-repo`) *are* SOPs. The hook context injection pipeline already chains stages (ingest → distill → resolve → publish). The current distillation prompt already uses RFC-2119-ish "MUST" language for the reason-clause requirement.
|
|
15
|
+
|
|
16
|
+
The genuinely novel ideas — and the only ones worth a closer look — are:
|
|
17
|
+
1. **Explicit parameter contracts** at SOP entry (Required/Optional with defaults), versus our skills that take freeform `$ARGUMENTS`.
|
|
18
|
+
2. **Progress checkpoints + resumability** for long-running workflows (`✅ Step 1 complete` style markers the agent emits).
|
|
19
|
+
3. **Self-describing format spec** (`strands-agents-sops rule` command) that lets Claude author new SOPs from a description.
|
|
20
|
+
|
|
21
|
+
The rest is old news for us. **Recommendation: do not adopt the Strands library or format. Borrow two narrow ideas (resumability + explicit parameters) into `/distill-transcripts` if and only if real distillation runs are large enough to fail mid-batch.**
|
|
22
|
+
|
|
23
|
+
## What an SOP Actually Is
|
|
24
|
+
|
|
25
|
+
Per the article, an SOP is markdown with these conventions:
|
|
26
|
+
|
|
27
|
+
- **RFC 2119 keywords** (MUST / SHOULD / MAY) for behavioral control.
|
|
28
|
+
- **Required/Optional parameters block** with defaults — gathered from the user via natural-language dialogue at invocation time.
|
|
29
|
+
- **Numbered steps** the agent executes sequentially.
|
|
30
|
+
- **Progress annotations** the agent prints as it goes (`✅ Validated codebase path exists`).
|
|
31
|
+
- **Output artifacts** in a conventional `.sop/<name>/` directory, used as handoff between chained SOPs.
|
|
32
|
+
|
|
33
|
+
Example parameter declaration (the only verbatim format snippet in the article):
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
Required Parameters:
|
|
37
|
+
• codebase_path: Path to the codebase to analyze
|
|
38
|
+
Optional Parameters:
|
|
39
|
+
• output_dir: Directory where documentation will be stored (default: ".sop/summary")
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Invocation surfaces:
|
|
43
|
+
|
|
44
|
+
- **Strands Agents (Python)**: `Agent(system_prompt=code_assist, tools=[editor, shell])` — SOP becomes the system prompt.
|
|
45
|
+
- **MCP**: SOPs registered as MCP *prompts* (the `prompts/list` + `prompts/get` channel), invoked with `@codebase-summary` in Kiro CLI / `/prompts` listing.
|
|
46
|
+
- **Claude Skills**: A CLI converts SOPs to Anthropic Skill format.
|
|
47
|
+
- **Direct LLM**: paste into a model's message and run.
|
|
48
|
+
|
|
49
|
+
Composition is **sequential chaining via artifact handoff** — `codebase-summary` writes docs, `pdd` reads them. No nesting/include directive is shown.
|
|
50
|
+
|
|
51
|
+
## How This Maps to ClaudeMemory Today
|
|
52
|
+
|
|
53
|
+
| Strands concept | Our equivalent | Status |
|
|
54
|
+
|---|---|---|
|
|
55
|
+
| Markdown SOP | `lib/claude_memory/commands/skills/*.md` (Anthropic Skills) | ✅ Have it |
|
|
56
|
+
| MCP prompts surface | `MCP::QueryGuide` registers `memory_guide` via `prompts/list`+`prompts/get` | ✅ Have it |
|
|
57
|
+
| RFC-2119 "MUST" in instructions | `distill-transcripts.md:38-43` uses MUST for reason-clause embed | ✅ Have it |
|
|
58
|
+
| SessionStart prompt injection | `hook_command.rb:213` writes `hookSpecificOutput.additionalContext` | ✅ Have it |
|
|
59
|
+
| SOP chaining via artifacts | `Ingest → Distill → Resolve → Store → Publish` (CLAUDE.md L72-79) | ✅ Have it (DB rows are the artifacts) |
|
|
60
|
+
| AI-assisted SOP authoring | `/skill-creator` skill | ✅ Have it |
|
|
61
|
+
| Format spec exposable to Claude | `strands-agents-sops rule` CLI | ⚠️ Partial — our distillation prompt is in `distill-transcripts.md`, not exposed as a tool |
|
|
62
|
+
| Required/Optional parameter contract | We pass `$ARGUMENTS` as freeform text | ❌ Missing |
|
|
63
|
+
| Progress checkpoints + resumability | `/distill-transcripts` runs end-to-end; no mid-batch checkpoint | ❌ Missing |
|
|
64
|
+
| Reference SOPs (`codebase-summary` etc.) | N/A — wrong domain | ❌ Not applicable |
|
|
65
|
+
|
|
66
|
+
The pattern is clear: we independently arrived at the same architecture. The two bullets in the "Missing" rows are the only candidates worth thinking about for adoption.
|
|
67
|
+
|
|
68
|
+
## Where SOPs Could Improve Distillation
|
|
69
|
+
|
|
70
|
+
### 1. Resumability for `/distill-transcripts`
|
|
71
|
+
|
|
72
|
+
**Current state.** `/distill-transcripts --limit 10` calls `memory.undistilled`, processes items one-by-one, calls `memory.mark_distilled` after each. If the run aborts mid-batch (rate limit, context exhaustion, user Ctrl-C), the items processed before the abort are marked, the rest are not. There is no explicit checkpoint file, but the DB itself is the checkpoint.
|
|
73
|
+
|
|
74
|
+
**SOPs angle.** SOPs add visible progress markers (`✅ Item 4/10 complete`) and a resume contract (`if .sop/distill/state.json exists, skip processed items`).
|
|
75
|
+
|
|
76
|
+
**Honest verdict.** Our DB-as-checkpoint already handles resumability. The visible-progress angle is a UX win for big runs but not a correctness improvement. **Do this only if we add a `--limit 100`+ workflow that users actually run.** Today nobody runs that.
|
|
77
|
+
|
|
78
|
+
### 2. Required/Optional Parameter Block in Skills
|
|
79
|
+
|
|
80
|
+
**Current state.** `/distill-transcripts` accepts `--limit N` parsed implicitly inside the skill body. Other skills accept `$ARGUMENTS` as a freeform blob. Users discover parameters by reading the skill markdown.
|
|
81
|
+
|
|
82
|
+
**SOPs angle.** Declared parameter blocks let the agent prompt the user before running ("what's the codebase_path?"), and let tooling (an SOP registry, MCP prompt list) introspect the contract.
|
|
83
|
+
|
|
84
|
+
**Honest verdict.** Anthropic Skills allow YAML frontmatter that already does this (`argument-hint`, parameter docs). We are under-using that frontmatter. **Cheap, safe improvement.** Adding a `Parameters:` block to the top of `distill-transcripts.md`, `release.md`, `study-repo.md` (and friends) costs ~10 minutes per skill and makes them self-documenting to both humans and any agent reading them.
|
|
85
|
+
|
|
86
|
+
### 3. Format-Spec-As-Tool for Authoring
|
|
87
|
+
|
|
88
|
+
**Current state.** `/skill-creator` exists. It has the format knowledge in its prompt body.
|
|
89
|
+
|
|
90
|
+
**SOPs angle.** `strands-agents-sops rule` is a CLI command that prints the SOP format spec to stdout, so any agent can `Bash` it and learn how to author one. This is a small but real ergonomic win — the spec lives in one place, not duplicated into every "make a new skill" prompt.
|
|
91
|
+
|
|
92
|
+
**Honest verdict.** Marginal. We don't have a sprawl of skill-authoring locations to consolidate. **Defer indefinitely** unless we start writing many more skills.
|
|
93
|
+
|
|
94
|
+
## What NOT to Adopt
|
|
95
|
+
|
|
96
|
+
- **The Python package itself.** Strands is a Python agent framework; we're a Ruby gem. No reuse path.
|
|
97
|
+
- **The `.sop/<name>/` artifact directory convention.** We persist via DB rows + `claude_memory.generated.md`. Adding a parallel filesystem artifact tree would just add cleanup burden and an out-of-DB state to reconcile.
|
|
98
|
+
- **The four reference SOPs (`codebase-summary` etc.).** Wrong domain — they're for code-workflow agents, not memory pipelines. Nothing to lift.
|
|
99
|
+
- **Renaming "skills" to "SOPs" in our docs.** Anthropic's term is *Skills*; that's the term Claude Code users know. Adopting Amazon's term creates confusion for zero gain.
|
|
100
|
+
- **Sequential-only chaining as an enforced pattern.** Our pipeline already chains, but we should keep room for parallel work (e.g., NullDistiller layer 1 runs synchronously in the ingest hook regardless of layer 2/3). SOP chaining is sequential by construction.
|
|
101
|
+
|
|
102
|
+
## Adoption Opportunities
|
|
103
|
+
|
|
104
|
+
### Medium Priority
|
|
105
|
+
|
|
106
|
+
#### 1. Parameter blocks in skill frontmatter
|
|
107
|
+
|
|
108
|
+
- **Value**: Self-documenting skills; Claude can prompt the user for missing parameters instead of guessing from `$ARGUMENTS`. Better intro-spectability for any future skill registry UI.
|
|
109
|
+
- **Evidence**: Article's `Required Parameters / Optional Parameters` block — only structural snippet quoted verbatim; Anthropic Skills format already supports `argument-hint` and similar fields we under-use.
|
|
110
|
+
- **Implementation**: Add `## Parameters` section near the top of `lib/claude_memory/commands/skills/distill-transcripts.md`, `release.md`, `study-repo.md`, `quality-update.md`, `improve.md`. Format: bullet list with `name: description (default: …)`.
|
|
111
|
+
- **Effort**: ~30 minutes total across all skills.
|
|
112
|
+
- **Trade-off**: Tiny doc maintenance burden; no runtime cost.
|
|
113
|
+
- **Recommendation**: ADOPT (low-cost, high-clarity).
|
|
114
|
+
|
|
115
|
+
### Low Priority
|
|
116
|
+
|
|
117
|
+
#### 2. Progress markers + explicit checkpoint file in `/distill-transcripts`
|
|
118
|
+
|
|
119
|
+
- **Value**: Better UX on long runs (users see progress); cleaner resume after mid-batch failure.
|
|
120
|
+
- **Evidence**: Article shows `✅ Validated codebase path exists` style output; SOPs document progress to support resumability.
|
|
121
|
+
- **Implementation**: Have `/distill-transcripts` print `[N/M] item <docid> → K facts` after each `memory.mark_distilled`. Optionally, write `.claude/distill_state.json` with `last_processed_content_id` so a re-run can resume.
|
|
122
|
+
- **Effort**: ~1 hour for stdout markers; ~3 hours including a state file with safe-resume semantics.
|
|
123
|
+
- **Trade-off**: State file adds another moving piece; DB already handles correctness, so this is purely UX. Not worth doing until somebody actually runs `/distill-transcripts --limit 100+` regularly.
|
|
124
|
+
- **Recommendation**: DEFER. Revisit if dashboard/usage data shows multi-hundred-item distillation runs.
|
|
125
|
+
|
|
126
|
+
#### 3. SOP-style format spec exposed as MCP prompt
|
|
127
|
+
|
|
128
|
+
- **Value**: Lets a future "make me a new skill" agent fetch our skill format spec via MCP `prompts/get` instead of duplicating it.
|
|
129
|
+
- **Evidence**: Article's `strands-agents-sops rule` command — same idea.
|
|
130
|
+
- **Implementation**: Add a `skill_authoring_guide` prompt to `MCP::QueryGuide` alongside `memory_guide`.
|
|
131
|
+
- **Effort**: ~1 hour.
|
|
132
|
+
- **Trade-off**: Solves a problem we don't yet have. We have one skill-authoring location (`/skill-creator`).
|
|
133
|
+
- **Recommendation**: DEFER until skill sprawl is a real problem.
|
|
134
|
+
|
|
135
|
+
### Features to Avoid
|
|
136
|
+
|
|
137
|
+
- **Generic "SOP runtime" abstraction** layered over our skills: pure ceremony. Anthropic Skills already give us the runtime.
|
|
138
|
+
- **`.sop/<name>/` artifact filesystem** parallel to our DB: doubles state, doubles cleanup, halves the value of having a curated SQLite store.
|
|
139
|
+
- **Adopting the term "SOP" anywhere user-facing**: term collision with Skills.
|
|
140
|
+
|
|
141
|
+
## Implementation Recommendations
|
|
142
|
+
|
|
143
|
+
**Phase 1 (do this in any 0.12.x release).** Add `## Parameters` blocks to the existing skill markdowns. ~30 minutes. Closes the only meaningful gap from this study.
|
|
144
|
+
|
|
145
|
+
**Phase 2 (defer).** Progress markers + `.claude/distill_state.json` checkpoint, only after we see real users running large distillation batches.
|
|
146
|
+
|
|
147
|
+
**Phase 3 (avoid unless triggered).** MCP-prompt-exposed skill format spec, only after we have ≥3 skill-authoring locations to consolidate.
|
|
148
|
+
|
|
149
|
+
## Architecture Decisions
|
|
150
|
+
|
|
151
|
+
**Preserve.** Our DB-as-checkpoint substrate, our use of Anthropic Skills as the SOP equivalent, our `additionalContext` injection on SessionStart, our distillation prompt's explicit reason-clause requirement.
|
|
152
|
+
|
|
153
|
+
**Adopt.** Explicit parameter declarations in skill frontmatter (Phase 1).
|
|
154
|
+
|
|
155
|
+
**Reject.** Strands Python package, `.sop/` artifact tree, generic SOP runtime abstraction, terminology adoption ("SOP" → user-facing).
|
|
156
|
+
|
|
157
|
+
## Key Takeaways
|
|
158
|
+
|
|
159
|
+
1. **We are already doing this.** Strands describes a class of patterns — markdown instructions, MCP prompts, parameterized invocation, sequential chaining, RFC-2119 vocabulary — that ClaudeMemory has independently. The existence of Strands is *validation*, not a roadmap.
|
|
160
|
+
2. **Anthropic Skills ≈ Strands SOPs.** Same idea, different label, different ecosystem. Don't refactor toward Strands; we'd just be renaming Skills.
|
|
161
|
+
3. **One narrow win.** Explicit parameter declarations in skill frontmatter cost ~30 minutes and make our skills self-documenting. Worth doing.
|
|
162
|
+
4. **One narrow defer.** Progress markers + checkpoint files in `/distill-transcripts` are real UX improvements *if* anyone runs distillation at scale; today nobody does. Revisit when the data says to.
|
|
163
|
+
5. **No deep architectural shifts.** Nothing in the article justifies a redesign of our distillation, storage, or prompting pipelines.
|
data/docs/quality_review.md
CHANGED
|
@@ -9,6 +9,51 @@
|
|
|
9
9
|
|
|
10
10
|
---
|
|
11
11
|
|
|
12
|
+
## Observational Layer — Pre-Merge Review (2026-06-18)
|
|
13
|
+
|
|
14
|
+
**Review Date:** 2026-06-18
|
|
15
|
+
**Previous Review:** 2026-04-28 (51 days ago)
|
|
16
|
+
**Scope:** the observational-layer branch (`claude/observational-layer-design-7662r9`, 22 commits ahead of `origin/main`, ~57 files). Two parallel expert-lens reviews — core/data layer (migrations 019/020, `Domain::Observation`, `SQLiteStore` observation methods, `Resolver`) and pipeline layer (`NullDistiller` extraction, renderer, `Reflector`, `ContextInjector`, MCP handlers, dashboard panel).
|
|
17
|
+
|
|
18
|
+
**Verdict:** No hard merge-blockers. The append-only/tombstone discipline is consistent, `Domain::Observation` is a clean immutable value object, border validation (`coerce_observation`) is textbook, and test coverage on this surface is above the repo average. The items below are latent correctness edge-cases + cleanups; none break the system as shipped (the layer is experimental and observations are project-scoped only today).
|
|
19
|
+
|
|
20
|
+
### High — address or consciously accept before merge
|
|
21
|
+
|
|
22
|
+
- **H1 · `consolidate_observations` read-modify-write race** — `lib/claude_memory/store/sqlite_store.rb:805-826` (Evans/Bernhardt). The source `SELECT` and `combined = sources.sum{…}` run *outside* the `@db.transaction` block, and the tombstone `UPDATE` (822) doesn't re-assert `status: "active"`. Two reflectors firing close together (PreCompact + SessionEnd) could double-count corroboration or re-tombstone an already-consolidated source. SQLite's single-writer lock narrows the window but doesn't close the gap. **Fix:** move the read inside the transaction and re-filter `status: "active"` on the update. Mechanical, ~1h incl. spec.
|
|
23
|
+
- **P1 · `noise_body?` over-broad — drops legit prose** — `lib/claude_memory/distill/null_distiller.rb:53,169` (Grimm/Bernhardt). `NOISE_BODY_SIGNATURE` matches `::`, `{}`, `=>` anywhere, so `"decided to adopt ClaudeMemory::Observation as the model"` is silently dropped — common in Ruby prose. Confirmed at runtime. **This is a precision-tuning *design* change to extraction behavior, not a mechanical fix** — per the project's data-driven-design convention it should be surveyed against real corpus data before retuning, not changed blind. Candidate: narrow to strong structural markers (`def `/`class `/`module `, JSON `","`/`":\s*"`, `$(`, `&&`, `||`), drop bare `::`/`{}`/`=>`, add a false-negative spec corpus.
|
|
24
|
+
- **P2 · cross-scope promote nudge — latent wrong-DB landmine** — `lib/claude_memory/hook/context_injector.rb:159-194` + `lib/claude_memory/mcp/handlers/management_handlers.rb:88` (Evans/Beck). `fetch_promotion_candidates` flat-maps project+global stores; the reflection block emits `[obs #<id>]` (a *per-DB* autoincrement id) with no scope; `promote_observation`/`consolidate_observations` default `scope: "project"`. A global-store candidate would route the promote call to the wrong DB. **Dead today** (nothing writes observations to the global DB), but a genuine landmine if global observations ever appear. **Fix:** restrict reflection candidates to the project store + document, or scope-tag the nudge line (mirror `emitted_facts_by_scope`). ~1-2h.
|
|
25
|
+
- **M1 · `consolidate_observations` has zero test coverage** — the most complex method in `sqlite_store.rb` is the only observation method with no specs (the `< 2 → nil` guard, summed-corroboration-tips-threshold, multi-row tombstone are all load-bearing and untested). Add specs alongside the H1 fix. ~1h.
|
|
26
|
+
|
|
27
|
+
### Medium
|
|
28
|
+
|
|
29
|
+
- **M2/P7 · token-estimate `/4.0` heuristic duplicated 3×** — `sqlite_store.rb:714,819` + `dashboard/observations.rb:98` (Metz DRY). The compression-ratio correctness depends on both halves using the same divisor. Extract `Core::TokenEstimate.from_chars`/`.from_bytes`. ~1h.
|
|
30
|
+
- **M3/P10 · `recent_observations` `min_priority` name inverted** — `sqlite_store.rb:731` (Beck revealing-names). Filters `priority <= min_priority`, but priority is inverted (1=important), so a higher "minimum" returns *more* rows. Rename `max_priority_value`/`importance_floor`. ~30m + callers.
|
|
31
|
+
- **M4 · `Resolver#apply` `@return` rdoc stale** — `resolver.rb:29` omits the `:observations_created` and `:fact_ids` keys this PR adds. ~10m.
|
|
32
|
+
- **M5 · `fact_ids` array silently contains `nil`s** — `resolver.rb:45,61` (Grimm meaningful-returns). Comment promises positional alignment with `extraction.facts`, but `:discard` contributes `nil`; the sole consumer already `.compact.first`s. Pick one contract and document it (or compact at source). ~20m.
|
|
33
|
+
- **P3 · `clean_observation_body` is 6 chained gsubs, brittle** — `null_distiller.rb:178` (Bernhardt). Pure text logic buried as a private method; extract to a tested `Observe::BodyCleaner` with an input→output spec table. ~2h.
|
|
34
|
+
- **P4 · `extract_decisions`/`extract_observations` double-scan `DECISION_PATTERNS`** — `null_distiller.rb:105,138` (Metz DRY). Two full regex passes per chunk on the P95<5ms hot path; titles and bodies also diverge, complicating later corroboration. ~2-3h.
|
|
35
|
+
- **P5 · `consolidate_observations` reuses `coerce_observation(args)` on the whole tool-args hash** — `management_handlers.rb:151` (Grimm border). Couples the consolidation tool's param names to the observation schema and pulls in `kind`/`priority` defaults the caller may not intend. Pass a narrowed hash. ~1h.
|
|
36
|
+
- **P6 · dashboard N+1 across stores** — `dashboard/observations.rb:48-99` (Evans, bounded). 8-12 small aggregate queries per load; acceptable at store-count 2 but the `.where(status: "active")` predicate repeats ~6×. One `group_and_count(:status)` per store. ~2h.
|
|
37
|
+
|
|
38
|
+
### Low (fast-follow cleanups)
|
|
39
|
+
|
|
40
|
+
- **L1** `persist_observations` reaches into raw hashes (`obs[:body]`…) — coerce through `Domain::Observation` at the border; defaults are triplicated. `resolver.rb:81`. ~1-2h.
|
|
41
|
+
- **L2** `respond_to?(:observations)` guard is dead defensiveness — `Extraction` always defines it. `resolver.rb:82`. ~5m.
|
|
42
|
+
- **L3/P8** status strings (`"active"`/`"consolidated"`/`"expired"`) and `2`/`3` literals scattered — add `STATUSES`/reuse `PROMOTION_THRESHOLD`/`INFO` from `Domain::Observation`. ~30m.
|
|
43
|
+
- **L4** `increment_corroboration` returns void while sibling mutators return `updated > 0` — make symmetric. `sqlite_store.rb:772`. ~10m.
|
|
44
|
+
- **L5** migration index DDL uses raw `CREATE INDEX` rather than Sequel's `index` DSL — idiomatic-only. ~30m, optional.
|
|
45
|
+
- **L6** `consolidate_observations` doesn't thread `session_id` (synthesized rows get NULL) — document the intent or thread it. ~5m.
|
|
46
|
+
|
|
47
|
+
### What's done well
|
|
48
|
+
|
|
49
|
+
Append-only/tombstone discipline honored end-to-end with "row preserved, not deleted" specs; `Domain::Observation` immutable/frozen/self-validating with intention-revealing predicates; `corroborated?(threshold)` kept a total function (threshold injected, not hard-coded); resolver change genuinely additive (observations persist *inside* the extraction transaction, "no observations → fact behavior unchanged" tested); pure Sequel datasets throughout (no raw SQL except index DDL); promotion-gate tests pin the anti-hallucination invariant; `coerce_observation` border validation with `filter_map` drops invalids without aborting the batch; the Go-language case-sensitivity fix is clean and well-specced.
|
|
50
|
+
|
|
51
|
+
### Recommended pre-merge action
|
|
52
|
+
|
|
53
|
+
Fix the mechanical items — **H1** (read-inside-transaction), **P2** (defuse cross-scope), **M1** (the missing consolidation spec), **M4/M5** (document this PR's own new `apply` surface). Flag **P1** (regex retune) for a data-driven decision — do not change blind. Everything else is tracked here as fast-follow.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
12
57
|
## Post-0.11 Investigation: Hallucination Rate Metric Calibration (2026-04-30)
|
|
13
58
|
|
|
14
59
|
When #48 (hallucination-rate metric) was first run against this project's real DB, it surfaced numbers that *looked* alarming:
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
{
|
|
2
|
+
"ok": true,
|
|
3
|
+
"checks_run": 10,
|
|
4
|
+
"counts": {
|
|
5
|
+
"error": 0,
|
|
6
|
+
"warn": 0,
|
|
7
|
+
"info": 2
|
|
8
|
+
},
|
|
9
|
+
"stats": {
|
|
10
|
+
"global": {
|
|
11
|
+
"active_facts": 4,
|
|
12
|
+
"predicate_counts": {
|
|
13
|
+
"convention": 4
|
|
14
|
+
}
|
|
15
|
+
},
|
|
16
|
+
"project": {
|
|
17
|
+
"active_facts": 27,
|
|
18
|
+
"predicate_counts": {
|
|
19
|
+
"convention": 20,
|
|
20
|
+
"uses_language": 3,
|
|
21
|
+
"uses_framework": 2,
|
|
22
|
+
"deployment_platform": 1,
|
|
23
|
+
"uses_database": 1
|
|
24
|
+
}
|
|
25
|
+
}
|
|
26
|
+
},
|
|
27
|
+
"findings": [
|
|
28
|
+
{
|
|
29
|
+
"id": "C007",
|
|
30
|
+
"severity": "info",
|
|
31
|
+
"title": "35% of decisions/conventions lack reason clauses (7/20)",
|
|
32
|
+
"detail": "Facts without 'because/so that/to avoid/...' lose their justification once context fades. Bare conclusions are dead weight when the team grows or you revisit a year later.",
|
|
33
|
+
"suggestion": "Inspect with: claude-memory explain <fact_id>. Reject low-value bare facts or rewrite with reason clauses via memory.store_extraction.",
|
|
34
|
+
"fact_ids": [
|
|
35
|
+
4,
|
|
36
|
+
6,
|
|
37
|
+
9,
|
|
38
|
+
13,
|
|
39
|
+
21,
|
|
40
|
+
23,
|
|
41
|
+
24
|
|
42
|
+
]
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"id": "C009",
|
|
46
|
+
"severity": "info",
|
|
47
|
+
"title": "24 auto-memory file(s) not yet imported",
|
|
48
|
+
"detail": "~/.claude/projects/<slug>/memory/*.md files contain durable knowledge that isn't reachable via memory.recall until imported. AutoMemoryMirror only surfaces them transiently at SessionStart.",
|
|
49
|
+
"suggestion": "Preview: claude-memory import-auto-memory --dry-run. Import: claude-memory import-auto-memory.",
|
|
50
|
+
"fact_ids": []
|
|
51
|
+
}
|
|
52
|
+
]
|
|
53
|
+
}
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
{
|
|
2
|
+
"ok": true,
|
|
3
|
+
"checks_run": 10,
|
|
4
|
+
"counts": {
|
|
5
|
+
"error": 0,
|
|
6
|
+
"warn": 1,
|
|
7
|
+
"info": 0
|
|
8
|
+
},
|
|
9
|
+
"stats": {
|
|
10
|
+
"global": {
|
|
11
|
+
"active_facts": 4,
|
|
12
|
+
"predicate_counts": {
|
|
13
|
+
"convention": 4
|
|
14
|
+
}
|
|
15
|
+
},
|
|
16
|
+
"project": {
|
|
17
|
+
"active_facts": 0,
|
|
18
|
+
"predicate_counts": {}
|
|
19
|
+
}
|
|
20
|
+
},
|
|
21
|
+
"findings": [
|
|
22
|
+
{
|
|
23
|
+
"id": "C008",
|
|
24
|
+
"severity": "warn",
|
|
25
|
+
"title": "Only 0 active project fact(s)",
|
|
26
|
+
"detail": "A nearly-empty project DB suggests either a fresh install (ignore) OR a broken ingest pipeline / overzealous rejection. Verify hooks are firing: claude-memory doctor.",
|
|
27
|
+
"suggestion": "claude-memory doctor; claude-memory stats; check .claude/settings.json hook configuration.",
|
|
28
|
+
"fact_ids": []
|
|
29
|
+
}
|
|
30
|
+
]
|
|
31
|
+
}
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
{
|
|
2
|
+
"ok": true,
|
|
3
|
+
"checks_run": 10,
|
|
4
|
+
"counts": {
|
|
5
|
+
"error": 0,
|
|
6
|
+
"warn": 0,
|
|
7
|
+
"info": 0
|
|
8
|
+
},
|
|
9
|
+
"stats": {
|
|
10
|
+
"global": {
|
|
11
|
+
"active_facts": 4,
|
|
12
|
+
"predicate_counts": {
|
|
13
|
+
"convention": 4
|
|
14
|
+
}
|
|
15
|
+
},
|
|
16
|
+
"project": null
|
|
17
|
+
},
|
|
18
|
+
"findings": []
|
|
19
|
+
}
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
{
|
|
2
|
+
"ok": true,
|
|
3
|
+
"checks_run": 10,
|
|
4
|
+
"counts": {
|
|
5
|
+
"error": 0,
|
|
6
|
+
"warn": 1,
|
|
7
|
+
"info": 1
|
|
8
|
+
},
|
|
9
|
+
"stats": {
|
|
10
|
+
"global": {
|
|
11
|
+
"active_facts": 4,
|
|
12
|
+
"predicate_counts": {
|
|
13
|
+
"convention": 4
|
|
14
|
+
}
|
|
15
|
+
},
|
|
16
|
+
"project": {
|
|
17
|
+
"active_facts": 17,
|
|
18
|
+
"predicate_counts": {
|
|
19
|
+
"convention": 15,
|
|
20
|
+
"decision": 1,
|
|
21
|
+
"uses_framework": 1
|
|
22
|
+
}
|
|
23
|
+
}
|
|
24
|
+
},
|
|
25
|
+
"findings": [
|
|
26
|
+
{
|
|
27
|
+
"id": "C003",
|
|
28
|
+
"severity": "warn",
|
|
29
|
+
"title": "46 content items not yet deeply distilled",
|
|
30
|
+
"detail": "Backlog grows when SessionStart distillation prompts aren't acknowledged with memory.mark_distilled. A large backlog means the same text gets re-extracted across sessions, increasing hallucination rate.",
|
|
31
|
+
"suggestion": "Triage with /distill-transcripts (interactive) OR mark all distilled if you accept the backlog is noise: claude-memory sweep --mark-all-distilled",
|
|
32
|
+
"fact_ids": []
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"id": "C007",
|
|
36
|
+
"severity": "info",
|
|
37
|
+
"title": "100% of decisions/conventions lack reason clauses (16/16)",
|
|
38
|
+
"detail": "Facts without 'because/so that/to avoid/...' lose their justification once context fades. Bare conclusions are dead weight when the team grows or you revisit a year later.",
|
|
39
|
+
"suggestion": "Inspect with: claude-memory explain <fact_id>. Reject low-value bare facts or rewrite with reason clauses via memory.store_extraction.",
|
|
40
|
+
"fact_ids": [
|
|
41
|
+
2,
|
|
42
|
+
3,
|
|
43
|
+
4,
|
|
44
|
+
5,
|
|
45
|
+
6,
|
|
46
|
+
7,
|
|
47
|
+
9,
|
|
48
|
+
10,
|
|
49
|
+
11,
|
|
50
|
+
12,
|
|
51
|
+
13,
|
|
52
|
+
14,
|
|
53
|
+
15,
|
|
54
|
+
16,
|
|
55
|
+
17,
|
|
56
|
+
18
|
|
57
|
+
]
|
|
58
|
+
}
|
|
59
|
+
]
|
|
60
|
+
}
|