RubyGems - claude_memory - Versions diffs - 0.11.0 → 0.12.0 - Mend

claude_memory 0.11.0 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

checksums.yaml +4 -4
data/.claude/memory.sqlite3 +0 -0
data/.claude/rules/claude_memory.generated.md +42 -64
data/.claude/skills/release/SKILL.md +44 -6
data/.claude/skills/study-repo/SKILL.md +15 -0
data/.claude-plugin/commands/audit-memory.md +68 -0
data/.claude-plugin/marketplace.json +1 -1
data/.claude-plugin/plugin.json +1 -1
data/CHANGELOG.md +26 -0
data/CLAUDE.md +9 -2
data/README.md +29 -1
data/db/migrations/018_add_otel_telemetry.rb +81 -0
data/docs/1_0_punchlist.md +318 -66
data/docs/api_stability.md +341 -0
data/docs/audit_runbook.md +209 -0
data/docs/claude_monitoring.md +956 -0
data/docs/improvements.md +148 -9
data/docs/influence/ai-memory-systems-2026.md +403 -0
data/docs/memory_audit_2026-05-21.md +303 -0
data/docs/plugin.md +1 -1
data/lib/claude_memory/audit/checks.rb +239 -0
data/lib/claude_memory/audit/finding.rb +33 -0
data/lib/claude_memory/audit/runner.rb +73 -0
data/lib/claude_memory/commands/audit_command.rb +117 -0
data/lib/claude_memory/commands/dashboard_command.rb +2 -1
data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
data/lib/claude_memory/commands/otel_command.rb +240 -0
data/lib/claude_memory/commands/registry.rb +4 -1
data/lib/claude_memory/configuration.rb +60 -0
data/lib/claude_memory/core/fact_query_builder.rb +1 -0
data/lib/claude_memory/dashboard/api.rb +8 -0
data/lib/claude_memory/dashboard/index.html +140 -1
data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
data/lib/claude_memory/dashboard/server.rb +86 -0
data/lib/claude_memory/dashboard/telemetry.rb +156 -0
data/lib/claude_memory/deprecations.rb +106 -0
data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
data/lib/claude_memory/hook/context_injector.rb +11 -2
data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
data/lib/claude_memory/otel/attributes.rb +118 -0
data/lib/claude_memory/otel/constants.rb +32 -0
data/lib/claude_memory/otel/ingestor.rb +54 -0
data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
data/lib/claude_memory/otel/prompt_scope.rb +108 -0
data/lib/claude_memory/otel/settings_writer.rb +122 -0
data/lib/claude_memory/otel/status.rb +58 -0
data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
data/lib/claude_memory/resolve/resolver.rb +30 -3
data/lib/claude_memory/shortcuts.rb +61 -18
data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
data/lib/claude_memory/store/schema_manager.rb +1 -1
data/lib/claude_memory/store/sqlite_store.rb +136 -0
data/lib/claude_memory/sweep/maintenance.rb +31 -1
data/lib/claude_memory/sweep/sweeper.rb +6 -0
data/lib/claude_memory/version.rb +1 -1
data/lib/claude_memory.rb +18 -0
metadata +26 -1

data/db/migrations/018_add_otel_telemetry.rb ADDED Viewed

@@ -0,0 +1,81 @@
+# frozen_string_literal: true
+# Migration v18: OpenTelemetry ingestion tables.
+#
+# ClaudeMemory's dashboard accepts OTLP/HTTP/JSON exports from Claude Code so
+# users can see cost-per-API-call, token usage by model, latency, and per-prompt
+# event journeys without leaving the dashboard.
+#
+# Three storage tables:
+#   - otel_metrics: numeric data points (token counts, USD cost, durations).
+#     Two value columns (value_int + value_float) preserve int64 precision for
+#     counters like token counts that exceed Float's 2^53 mantissa.
+#   - otel_events: log-style records (user_prompt, tool_result, api_request,
+#     skill_activated, ...). Indexed on prompt_id for the journey UNION.
+#   - otel_traces: spans. Table ships now so the schema is forward-ready, but
+#     the dashboard's POST /v1/traces returns 501 until the user opts in via
+#     `claude-memory otel --enable-traces`.
+#
+# Plus an additive prompt_id column on activity_events so existing hook
+# events (recall, hook_ingest, hook_context) can be UNION-joined into the
+# Prompt Journey panel.
+Sequel.migration do
+  up do
+    create_table?(:otel_metrics) do
+      primary_key :id
+      String :name, null: false
+      String :value_type, null: false
+      Bignum :value_int
+      Float :value_float
+      String :unit
+      String :attributes_json, text: true
+      String :resource_json, text: true
+      String :recorded_at, null: false
+    end
+    run "CREATE INDEX IF NOT EXISTS idx_otel_metrics_name_time ON otel_metrics(name, recorded_at)"
+    run "CREATE INDEX IF NOT EXISTS idx_otel_metrics_recorded_at ON otel_metrics(recorded_at)"
+    create_table?(:otel_events) do
+      primary_key :id
+      String :event_name, null: false
+      String :session_id
+      String :prompt_id
+      String :attributes_json, text: true
+      String :resource_json, text: true
+      String :occurred_at, null: false
+    end
+    run "CREATE INDEX IF NOT EXISTS idx_otel_events_name_time ON otel_events(event_name, occurred_at)"
+    run "CREATE INDEX IF NOT EXISTS idx_otel_events_session ON otel_events(session_id)"
+    run "CREATE INDEX IF NOT EXISTS idx_otel_events_prompt ON otel_events(prompt_id)"
+    create_table?(:otel_traces) do
+      primary_key :id
+      String :trace_id, null: false
+      String :span_id, null: false
+      String :parent_span_id
+      String :name, null: false
+      String :session_id
+      String :prompt_id
+      Bignum :start_unix_nano
+      Bignum :end_unix_nano
+      Integer :duration_ms
+      String :status_code
+      String :attributes_json, text: true
+      String :resource_json, text: true
+      String :recorded_at, null: false
+    end
+    run "CREATE INDEX IF NOT EXISTS idx_otel_traces_trace ON otel_traces(trace_id)"
+    run "CREATE INDEX IF NOT EXISTS idx_otel_traces_time ON otel_traces(recorded_at)"
+    alter_table(:activity_events) { add_column :prompt_id, String }
+    run "CREATE INDEX IF NOT EXISTS idx_activity_events_prompt ON activity_events(prompt_id)"
+  end
+  down do
+    run "DROP INDEX IF EXISTS idx_activity_events_prompt"
+    alter_table(:activity_events) { drop_column :prompt_id }
+    drop_table?(:otel_traces)
+    drop_table?(:otel_events)
+    drop_table?(:otel_metrics)
+  end
+end

data/docs/1_0_punchlist.md CHANGED Viewed

@@ -1,7 +1,9 @@
 # 1.0 Punchlist
 *Created: 2026-04-28. Restructured 2026-04-28 (post-0.10.0 release) around
-milestone versions per the path-to-1.0 plan.*
+milestone versions per the path-to-1.0 plan. Re-oriented 2026-05-27 to
+acknowledge OTel + audit-toolkit landings and re-anchor on the three
+1.0 pillars.*
 The remaining work for a stable 1.0 release. Distinct from `improvements.md` —
 that file tracks the long tail of inbound study/idea entries; this file tracks
@@ -25,6 +27,27 @@ So 1.0 isn't gated by features. It's gated by **the measurement infrastructure
 being trustworthy enough to defend a 1.0 claim.** That's why this punchlist is
 mostly observability, not capability.
+### The three 1.0 pillars
+Restated 2026-05-27 to ground prioritization decisions:
+1. **Stability** — semver-locked CLI / MCP / hook / Ruby API contracts, schema
+   round-trip discipline, deprecation policy. Anchored by `docs/api_stability.md`
+   (#11 ✅) and the round-trip-spec convention.
+2. **Visibility** — a skeptical user can see what memory costs, what memory
+   contains, what memory contributed, and what is wrong with it, on one screen,
+   in <30s, without trusting our marketing. Anchored by the Trust panel, the
+   digest, OTel ingestion, and the new `claude-memory audit` toolkit.
+3. **Long-horizon quality** — over weeks and months, the repo demonstrably
+   improves session quality rather than degrading it. Anchored by the harm
+   benchmark (#3, the actual release gate), the CLAUDE.md headline baseline
+   (#4), repeat-correction detection (#8), and the drift dashboard (#10).
+Every 0.12 item maps to one of those pillars; an item that doesn't map is a
+1.x feature, not a 1.0 gate. The audit toolkit and OTel landed during 0.12
+because they directly serve pillars 1 and 2 — not as scope creep, but as work
+the original punchlist didn't anticipate would be needed.
 Items are cross-linked to the canonical entry in `improvements.md` where the
 implementation detail and acceptance criteria live. This file is the
 prioritization view; that file is the work view.
@@ -159,17 +182,32 @@ Effort: ½d.
 ---
-## 0.12.0 — "Release Discipline" (~1 week of work)
+## 0.12.0 — "Release Discipline + Observability + Self-Audit" (~4 weeks of work)
-Theme: *we can't ship a regression without noticing.* Internal infrastructure
-that prevents future regressions. Not flashy but the actual prerequisite for
-1.0's semver commitment.
+Theme: *we can't ship a regression without noticing, and we can see what's
+happening inside.* Internal infrastructure that prevents future regressions,
+plus the observability primitives the 1.0 visibility pillar requires, plus
+the self-audit toolkit that catches drift in our own DB.
-### #3 Negative-fact harm benchmark (full 10-15 scenarios)
+*Restructured 2026-05-01: #11 (API stability audit) promoted from 1.0
+because the scoreboard #6 needs an explicit stable-surface list to gate
+against; new #12 (pre-release hook smoke gate) added to codify the
+verification convention that surfaced during 0.11 work.*
+*Restructured 2026-05-27: theme widened from "Release Discipline" to
+acknowledge two unplanned but on-mission work tracks that landed during the
+0.12 window — the OTel observability primitives (~15 commits) and the audit
+toolkit (#13). Both serve 1.0 pillars 1+2 directly and the punchlist now
+reflects that.*
+### #3 Negative-fact harm benchmark (full 10-15 scenarios) — **in progress 2026-05-27 (Path B blocker)**
 **Gap.** Every benchmark today measures whether memory **helps**. Nothing
 measures whether memory **harms** — i.e. injects a wrong fact and Claude
-follows it. Without this, "memory helps" is unfalsifiable.
+follows it. Without this, "memory helps" is unfalsifiable. This is the
+single 0.12 item that directly serves pillar 3 (long-horizon quality);
+shipping 0.12 without it would tag a release whose central claim is
+unmeasured.
 **Acceptance.** `spec/benchmarks/dataset/harm_scenarios.yml` with 10-15 cases
 spanning four harm classes (stale-tech, mismatched-scope, superseded-but-
@@ -184,26 +222,56 @@ in that regime.
 → improvements.md entry: *#49 Negative-Fact Harm Benchmark* (full corpus).
 Effort: 2d.
-### #4 Publish the CLAUDE.md baseline in headline E2E results
+### #4 Publish the CLAUDE.md baseline in headline E2E results — **DEFERRED to 0.13 (2026-05-29): harness limitation**
 **Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
-and is wired into `comparative_helper.rb`. The README's headline comparative
-table doesn't include it. The single most important question for adoption —
-*"is this better than a hand-written CLAUDE.md?"* — is unanswered in our
-published numbers.
+and is wired into `comparative_helper.rb`. The single most important question
+for adoption — *"is this better than a hand-written CLAUDE.md?"* — is
+unanswered in our published numbers.
+**What happened.** The first real-mode comparative run (2026-05-28) returned
+ClaudeMemory **0/10**, No-memory **0/10**, CLAUDE.md baseline **8/10** — and
+investigation showed this is a *harness artifact, not a verdict*. The CLAUDE.md
+adapter auto-loads every fact into context unconditionally; the ClaudeMemory
+adapter relies on Claude proactively calling `memory.recall` MCP tools, which
+`claude -p` headless mode doesn't do for these prompts (and the SessionStart
+context hook injects only a generic top-5, not the specific fact each
+LongMemEval-style scenario needs). So ClaudeMemory's retrieval path is never
+exercised and it ties no-memory at 0. Publishing 0% vs 80% would actively
+mislead and violate the visibility pillar's honest-numbers standard.
+**Decision (2026-05-29).** Defer #4 to 0.13. It was never a release blocker
+(the harm gate was, and it's green at 0/13). 0.12 ships without comparative
+numbers; the README + benchmark README document the limitation honestly.
+**0.13 acceptance.** Fix the harness so it fairly exercises ClaudeMemory's
+retrieval — either (a) force memory-tool use (allowedTools + a recall-
+encouraging system turn), or (b) inject the full fact set via the context
+hook to match CLAUDE.md's "everything in context" model — then re-run and
+publish the real win/loss.
-**Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
-`spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary.
-README explicitly states the win/loss versus the static baseline.
+→ improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
+Effort: harness fix ~1d + one real-mode run.
-**Why this release.** Cheapest item on the list — adapter built, just
-surface the number. Pairs with #6 because it materializes once the
-scoreboard infrastructure is there.
+### #16 Headless retrieval gap — *new observation 2026-05-29, investigate for 0.13*
-→ improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
-Effort: 30min code + one $2-8 real-mode run.
+**Observation.** The #4 comparative run surfaced a genuine (separable) product
+concern: in fully headless, non-interactive `claude -p` usage with no
+tool-forcing, Claude does **not** proactively call ClaudeMemory's `memory.recall`
+MCP tools, so memory's contribution rides entirely on what the SessionStart
+context hook injects (a generic top-5 decisions/conventions/architecture). For
+*interactive* sessions — where Claude readily calls MCP tools — this isn't an
+issue, and it's the primary use case. But the gap is real and worth measuring:
+does the context-hook top-5 cover enough, or should headless usage get a richer
+injection (or a recall-on-demand affordance)?
+**Why not 0.12.** This is investigation, not a known fix, and it's orthogonal
+to the 0.12 visibility/stability theme. Pair it with the #4 harness fix in 0.13
+since both touch the same headless-retrieval seam.
+→ No improvements.md entry yet; originates from the 2026-05-28 comparative run.
-### #6 Release-to-release benchmark scoreboard
+### #6 Release-to-release benchmark scoreboard ✅ landed 2026-05-01
 **Gap.** Benchmark output is textual today. Nothing diff-able across versions.
 Regressions land silently — the only reason we caught the BM25 normalization
@@ -217,9 +285,194 @@ reads it and refuses to ship on regressions over threshold.
 **Why this release.** The semver commitment in 1.0 *requires* this — we
 can't promise non-regression without the infrastructure to detect it.
+**Status.** Landed 2026-05-01. `bin/run-evals` writes
+`spec/benchmarks/results/<version>.json` with diff-friendly pass-rate
+metrics by category and per-scenario. `bin/bench-diff` compares against
+the most recent prior tagged version's scoreboard via `Gem::Version`
+ordering, flags pass-rate drops > threshold (default 5%), supports
+`--threshold` / `--baseline` / `--json` / `--strict`. 11 unit specs
+covering missing-baseline, threshold tuning, deep-nested metric paths,
+JSON output. Wired into `/release` skill as new Phase 1 Step 7 (after
+smoke gate, before lint). First release with the gate is 0.12.0 itself
+— prior versions have no scoreboard, so bench-diff exits 0 with a "no
+baseline" note; from 0.13 onward it actively gates.
 → improvements.md entry: *#52 Benchmark Scoreboard Diff*. Effort: 1d.
-**Ship target:** ~4 weeks from 0.10.0 (end of May 2026).
+### #11 API stability audit — *promoted from 1.0 (2026-05-01)* ✅ landed 2026-05-01
+**Gap.** "1.0 commits to semver" is meaningless without an explicit
+public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 / 0.11.0
+(MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints,
+`detail_json` field set) have evolved organically and aren't formally
+documented as stable vs. internal.
+**Acceptance.**
+- New `docs/api_stability.md` enumerating:
+  - **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
+  - **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
+  - **Public hook contract**: payload fields, return shapes, exit codes, `detail_json` field set per event_type
+  - **Public Ruby API**: `Recall`, `Configuration`, `Store::StoreManager`, `Domain::*` vs. internal-only
+  - **Schema**: stability of column names, table names, predicate vocabulary
+- Deprecation policy paragraph: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0"
+- `ClaudeMemory::Deprecations.warn(name:, replacement:, removed_in:)` module wired up and used at least once so the mechanism is exercised
+- README + CLAUDE.md link to the new doc as the authoritative source
+**Why this release.** #6's scoreboard needs to know what surfaces are stable
+to gate against. Without #11, any "regression" finding is arguable. The
+deprecation-warning module is also a prerequisite for any soft-rename work
+during the 0.12 → 1.0 soak.
+→ improvements.md entry: *#59 API Stability Audit*. Effort: 2d.
+### #12 Pre-release hook smoke gate — *new this release (2026-05-01)* ✅ landed 2026-05-01
+**Gap.** During 0.11 work, five commits landed for #47 token-budget telemetry
+with 156 specs green. 24 hours of real SessionStart hook events recorded no
+`context_tokens` field — because the *installed* gem was still 0.9.1 and the
+`.claude/settings.json` hooks invoke the installed binary via PATH, not the
+working tree. The bug wasn't in the code; the bug was in the release process.
+This trap has been hit twice now (#47 in 0.11; an earlier ActivityLog
+incident on 2026-04-16). It's documented in
+`~/.claude/projects/.../memory/feedback_hooks_run_installed_gem.md` and as
+two project conventions, but documentation hasn't stopped me (Claude) from
+springing the trap again.
+**Acceptance.**
+- New `bin/pre-release-smoke` script: `rake install` → trigger each hook
+  with a synthetic payload → inspect `activity_events.detail_json` via
+  `sqlite3 json_extract` for expected fields per the current version → exit
+  non-zero if anything is null.
+- Per-version expectation manifest at `spec/smoke/expected_fields.yml`
+  declares `{event_type, fields, since_version}` so new fields just need a
+  YAML append; no script changes per release.
+- `/release` skill Phase 1 runs the smoke gate after specs and before lint.
+  Failure aborts before `git push`.
+- Test: `spec/smoke/pre_release_smoke_spec.rb` validates the manifest schema
+  and that the exit-code logic correctly flips on simulated null fields.
+**Why this release.** Release Discipline that doesn't catch the trap I've
+already hit twice isn't real discipline. Pairs with #6 — the scoreboard
+catches regressions in measurement; the smoke gate catches the regression
+where the measurement itself doesn't fire.
+→ improvements.md entry: *#63 Pre-Release Hook Smoke Gate*. Effort: ½d.
+### #13 Memory health audit toolkit — *unplanned, landed 2026-05-27* ✅
+**Gap.** Drift inside the project DB — duplicate global conventions,
+single-cardinality multiplicity, contamination-driven rejection churn, bare
+conclusions, shortcut tools leaking the wrong predicate — was diagnosable
+only by hand, project by project. The 2026-05-21 audit surfaced 103 rejected
+single-cardinality facts in this project's own DB, all sourced from example
+text in our own docs being re-ingested. Without a productionized check, this
+class of regression silently erodes the 1.0 visibility claim.
+**Acceptance.**
+- `claude-memory audit` CLI with ten contract checks (C001-C010), `--json`
+  for CI, `--severity`, `--no-exit`
+- `/audit-memory` slash command for interactive walkthrough
+- `docs/audit_runbook.md` per-check rationale + remediation
+- `ReferenceMaterialDetector` example-quote guard + `Resolver` `:discard`
+  path (defense-in-depth at write time)
+- Memory shortcuts (`memory.decisions`/`.conventions`/`.architecture`)
+  switched from FTS text search to predicate-based filtering
+- `claude-memory import-auto-memory` retroactively pulls auto-memory entries
+  `AutoMemoryMirror` missed (slug bug fixed: `tr("/_", "-")`)
+- Signal-health benchmark spec (`spec/benchmarks/health/database_signal_spec.rb`)
+  codifies the cleanup contracts so regressions can be detected in CI
+**Why this release.** Serves pillars 1 (stability — guards single-cardinality
+predicates from drifting) and 2 (visibility — surfaces drift as a measurable
+signal). The detector + resolver fixes mean the 0.12 → 1.0 soak is more
+likely to surface real signal vs. doc-text contamination noise.
+→ improvements.md entry: not yet promoted; lives in `docs/memory_audit_2026-05-21.md`
+as the originating artifact. Effort: ~2d (across the 2026-05-27 session).
+### #14 OpenTelemetry ingestion + Dashboard Telemetry/Prompt Journey — *unplanned, landed 2026-05-21* ✅
+**Gap.** The visibility pillar promised "you can see what memory costs and
+what it's doing." Token-budget telemetry (#1) covered the cost; the rest —
+per-tool latency, cost-per-hour, the full prompt-to-response journey across
+hooks/MCP/distillation — was invisible without an external tracer. Claude
+Code already exports OTLP if asked; the question was whether ClaudeMemory
+should ingest its own telemetry rather than punting to Datadog/Honeycomb.
+**Acceptance.**
+- Schema v18: `otel_metrics`, `otel_events`, `otel_traces` + `prompt_id`
+  on `activity_events` for journey correlation
+- `claude-memory otel` CLI manages the env block (`--enable`, `--disable`,
+  `--enable-traces`, `--capture-prompts`, `--status`, `--verify`, `--backfill`)
+- Dashboard exposes `/v1/metrics`, `/v1/logs`, `/v1/traces` on
+  `127.0.0.1:3377` (OTLP/HTTP/JSON) plus a new "Telemetry" drawer
+- Prompt Journey panel UNIONs `otel_events` with `activity_events` and
+  back-tags activity_events with `prompt.id` via `OTel::PromptScope`
+- Sweep retention: 30d metrics, 14d events, 7d traces
+- Privacy posture: opt-in for prompt capture; traces 501-gated until
+  explicit `--enable-traces`
+**Why this release.** Directly serves pillar 2 (visibility) at a depth
+nothing else can — no dashboard polish substitutes for actual per-prompt
+trace data. Loud answer to "what is this thing doing right now?"
+→ improvements.md entry: tracked under the OTel research → study line.
+Effort: ~2.5w (Apr 26 → May 21).
+### #15 Staleness guard for single-value facts — *born from the #3 harm run, landed 2026-05-28* ✅
+**Gap.** The first full-corpus real-mode harm run (#3) surfaced a 15.4%
+harm rate. One was a false positive in the test pattern (fixed in the
+corpus); the other was a **real harm**: Claude emitted `git push heroku
+HEAD:main` from a stale `deployment_platform` fact with no hedge.
+Single-value predicates are exclusive claims Claude follows
+authoritatively — and ClaudeMemory had no defense against a stale one
+when no superseding fact exists (supersession only fires if the
+migration was recorded). This is a direct pillar-3 (long-horizon
+quality) hole: over months, single-value facts go stale and silently
+make Claude wrong.
+**Acceptance.**
+- `Recall::StalenessAnnotator` pure function: flags single-value facts
+  (uses_database / deployment_platform / auth_method) that are old
+  (valid_from/created_at older than threshold) AND not recently
+  confirmed (last_recalled_at null/stale)
+- `Hook::ContextInjector` appends a "⚠ stale … verify before relying"
+  marker at SessionStart; multi-value predicates never annotated
+- `Configuration#injection_stale_days` (default 180, env override),
+  distinct from the 14-day dashboard review window
+- Re-run of #3 (scaffolded + best-of-N) confirms the gate is green
+**Why this release.** It's the concrete payoff of building the harm
+benchmark before 1.0: the benchmark didn't just report a number, it
+forced a real defensive feature that makes the long-horizon-quality
+claim defensible. Shipping #3 without #15 would have meant tagging a
+release whose own gate said "memory makes Claude wrong 1-in-13 times."
+**Harness hardening (same investigation).** The first full-corpus run
+also exposed two confounds that made the gate unverifiable: scenarios
+ran in an empty tmpdir (Claude often refused for lack of project
+context, not because it resisted the bad fact) and single-shot scoring
+was noisy (the harmed *set* changed run-to-run). Fixed by (a) shipping a
+`project_files` scaffold per scenario whose current state contradicts
+the wrong memory fact — making each case a real "memory vs reality"
+test — and (b) best-of-N majority scoring (HARM_BENCH_RUNS, default 3).
+Without this, #15's effect couldn't be measured cleanly.
+→ improvements.md entry: not yet promoted; originates from the
+`spec/benchmarks/dataset/harm_scenarios.yml` `harm_stale_deployment_heroku`
+finding. Effort: ~½d (2026-05-28 session).
+**Ship target:** ready to tag (2026-05-29). #3 harm gate is green at 0/13
+(best-of-3) after #15; #4 deferred to 0.13 (harness limitation, never a
+blocker); everything else in 0.12 has shipped. 0.12 tags now; soak window
+2-3 weeks before 1.0.
 ---
@@ -275,34 +528,7 @@ dashboard. Answers "is my fact base going off?" without a manual audit.
 → improvements.md entry: *#56 Drift Dashboard*. Effort: 1.5d.
-### #11 API stability audit (NEW — added 2026-04-28)
-**Gap.** "1.0 commits to semver" is meaningless without an explicit
-public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0
-(MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints)
-have evolved organically and aren't formally documented as stable vs.
-internal.
-**Acceptance.**
-- New `docs/api_stability.md` enumerating:
-  - **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
-  - **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
-  - **Public hook contract**: payload fields, return shapes, exit codes
-  - **Public Ruby API**: which classes/modules under `lib/claude_memory/` are external-facing (`Recall`, `Configuration`, `Store::StoreManager`?) vs. internal-only
-  - **Schema**: stability of column names, table names, predicate vocabulary
-- A deprecation policy: "we'll mark X deprecated in N.x.0 and remove no earlier than (N+1).0.0"
-- README + CLAUDE.md link to the new doc as the authoritative source
-**Why this release.** Without this, the 1.0 semver promise is vibes, not a
-contract. Future regressions in non-listed areas can be argued away; future
-regressions in listed areas are bugs. Forces us to be honest about what
-we're committing to.
-→ improvements.md entry: *#59 API Stability Audit* (added 2026-04-28; renumbered
-from #57 after rebase brought in Mercury-article entries #57/#58). Effort:
-2d including the doc + deprecation-warning instrumentation for any
-soon-to-be-removed surface.
+*(#11 API stability audit moved to 0.12 on 2026-05-01 — see above.)*
 ### Release framing
@@ -330,11 +556,26 @@ README + CHANGELOG framing for 1.0 explicitly states:
 ## Risk to flag now
-The biggest hidden risk in this plan is **the harm benchmark (#3) finds
-something.** If 10-15 scenarios with intentionally wrong facts produce >1%
-harm rate, that's a fundamental retrieval-discipline issue that could push
-1.0 by months. The 3-scenario prototype in 0.11 (above) is specifically
-designed to surface this risk earlier.
+The biggest hidden risk in this plan was **the harm benchmark (#3) finds
+something.** The 3-scenario prototype in 0.11 (above) was specifically
+designed to surface this risk earlier — and **on 2026-04-30 the real-mode
+prototype reported 0/3 harm**, green-lighting the full corpus expansion.
+Risk is materially reduced; the 10-15-case corpus may still surface
+something the 3-case sample missed, but a fundamental retrieval-discipline
+issue is now unlikely.
+Remaining risk for 0.12: **#11 API stability audit reveals the surface is
+larger or messier than we thought**, pushing the doc work past the 2-day
+estimate. Mitigation: scope `Public Ruby API` aggressively to "internal
+unless proven otherwise" — easier to promote later than demote. *Update
+2026-05-27: #11 landed on time on 2026-05-01; this risk did not materialize.*
+Remaining risk for 0.12, take 2 (added 2026-05-27 in light of Path B):
+**the full 13-scenario harm corpus surfaces a >1% harm rate** that the
+3-scenario prototype masked. Mitigation paths if it happens: classify the
+harming class, ship a guard (the way #13 added `ReferenceMaterialDetector`
+example-quote guard for the contamination class), re-run. Worst case
+extends 0.12 by ~3-5 days; doesn't push 1.0 if the soak window has slack.
 ---
@@ -352,20 +593,31 @@ Based on actual release cadence Mar-Apr 2026:
 Average ~2 weeks per minor with substantial work landing each cycle.
-| Milestone | Estimated work | Calendar target |
-|---|---|---|
-| 0.10.x patches | reactive | as-needed |
-| 0.11.0 | ~1 week | ~2026-05-12 |
-| 0.12.0 | ~1 week | ~2026-05-26 |
-| Soak | 2-3 weeks | through ~2026-06-16 |
-| 1.0.0 | 1-2 days release prep + #11 | ~2026-06-16 to 2026-06-23 |
+| Milestone | Estimated work | Calendar target | Status |
+|---|---|---|---|
+| 0.11.0 | ~1 week | ~2026-05-12 | ✅ shipped 2026-04-30 |
+| 0.11.x patches | reactive | as-needed | open |
+| 0.12.0 (originally planned) | ~1.5 weeks | ~2026-06-02 | superseded — actual scope widened (see 2026-05-27 restructure) |
+| 0.12.0 (actual) | ~4 weeks (#6/#11/#12 + OTel + audit toolkit + Path B #3/#4) | tag ~2026-06-03 | 5 of 7 items shipped; #3 + #4 in progress |
+| Soak | 2-3 weeks | through ~2026-06-24 | future |
+| 1.0.0 | 1-2 days release prep | ~2026-06-24 to 2026-07-01 | future |
+*0.12 grew from ~1 week to ~1.5 weeks after 2026-05-01 restructure
+(promoted #11 + added #12), then widened again to ~4 weeks after the
+2026-05-27 restructure that absorbed the OTel observability work and the
+audit toolkit. 1.0 calendar shifted ~3 weeks later in total but the soak
+window remains 2-3 weeks — the visibility/stability surface 0.12 now ships
+is materially larger than the original "Release Discipline" scope.*
 These are calendar estimates assuming roughly the same focus level as the
 0.10.0 cycle. Real cadence will adjust based on what surfaces during soak.
 ---
-*Last updated: 2026-04-28 (post-0.10.0). Restructured around milestone
-versions per the path-to-1.0 plan. #7 moved up from post-1.0 to 0.11; #11
-API stability audit added as a new 1.0 must-have; 3-scenario harm prototype
-added to 0.11 as risk-de-risking work for the full 0.12 benchmark.*
+*Last updated: 2026-05-27 (mid-0.12 cycle). 0.11.0 shipped 2026-04-30 with
+all 5 punchlist items + harm prototype reporting 0/3 harm. 0.12 restructured
+2026-05-01 (promoted #11, added #12) and again 2026-05-27 (absorbed OTel
+#14 + audit toolkit #13, re-anchored on the three 1.0 pillars, committed
+to Path B finishing #3 + #4 before tag). 0.12 grew ~1.5w → ~4w; 1.0 ship
+target shifted ~3w later in return. Soak window held at 2-3w because the
+visibility surface in 0.12 is materially larger than originally scoped.*