claude_memory 0.10.0 → 0.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/memory.sqlite3 +0 -0
- data/.claude/rules/claude_memory.generated.md +42 -64
- data/.claude/skills/release/SKILL.md +44 -6
- data/.claude/skills/study-repo/SKILL.md +15 -0
- data/.claude-plugin/commands/audit-memory.md +68 -0
- data/.claude-plugin/marketplace.json +1 -1
- data/.claude-plugin/plugin.json +1 -1
- data/CHANGELOG.md +70 -0
- data/CLAUDE.md +20 -5
- data/README.md +64 -2
- data/db/migrations/018_add_otel_telemetry.rb +81 -0
- data/docs/1_0_punchlist.md +522 -89
- data/docs/GETTING_STARTED.md +3 -1
- data/docs/api_stability.md +341 -0
- data/docs/architecture.md +3 -3
- data/docs/audit_runbook.md +209 -0
- data/docs/claude_monitoring.md +956 -0
- data/docs/dashboard.md +23 -3
- data/docs/improvements.md +329 -5
- data/docs/influence/ai-memory-systems-2026.md +403 -0
- data/docs/memory_audit_2026-05-21.md +303 -0
- data/docs/plugin.md +1 -1
- data/docs/quality_review.md +35 -0
- data/lib/claude_memory/audit/checks.rb +239 -0
- data/lib/claude_memory/audit/finding.rb +33 -0
- data/lib/claude_memory/audit/runner.rb +73 -0
- data/lib/claude_memory/commands/audit_command.rb +117 -0
- data/lib/claude_memory/commands/dashboard_command.rb +2 -1
- data/lib/claude_memory/commands/digest_command.rb +95 -3
- data/lib/claude_memory/commands/hook_command.rb +27 -2
- data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
- data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
- data/lib/claude_memory/commands/otel_command.rb +240 -0
- data/lib/claude_memory/commands/registry.rb +5 -1
- data/lib/claude_memory/commands/show_command.rb +90 -0
- data/lib/claude_memory/commands/stats_command.rb +94 -2
- data/lib/claude_memory/configuration.rb +60 -0
- data/lib/claude_memory/core/fact_query_builder.rb +1 -0
- data/lib/claude_memory/dashboard/api.rb +8 -0
- data/lib/claude_memory/dashboard/index.html +140 -1
- data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
- data/lib/claude_memory/dashboard/server.rb +86 -0
- data/lib/claude_memory/dashboard/telemetry.rb +156 -0
- data/lib/claude_memory/dashboard/trust.rb +180 -11
- data/lib/claude_memory/deprecations.rb +106 -0
- data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
- data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
- data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
- data/lib/claude_memory/hook/context_injector.rb +11 -2
- data/lib/claude_memory/hook/handler.rb +142 -1
- data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
- data/lib/claude_memory/otel/attributes.rb +118 -0
- data/lib/claude_memory/otel/constants.rb +32 -0
- data/lib/claude_memory/otel/ingestor.rb +54 -0
- data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
- data/lib/claude_memory/otel/prompt_scope.rb +108 -0
- data/lib/claude_memory/otel/settings_writer.rb +122 -0
- data/lib/claude_memory/otel/status.rb +58 -0
- data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
- data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
- data/lib/claude_memory/resolve/resolver.rb +30 -3
- data/lib/claude_memory/shortcuts.rb +61 -18
- data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
- data/lib/claude_memory/store/schema_manager.rb +1 -1
- data/lib/claude_memory/store/sqlite_store.rb +136 -0
- data/lib/claude_memory/sweep/maintenance.rb +31 -1
- data/lib/claude_memory/sweep/sweeper.rb +6 -0
- data/lib/claude_memory/templates/hooks.example.json +5 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +20 -0
- metadata +28 -1
data/docs/1_0_punchlist.md
CHANGED
|
@@ -1,10 +1,13 @@
|
|
|
1
1
|
# 1.0 Punchlist
|
|
2
2
|
|
|
3
|
-
*Created: 2026-04-28
|
|
3
|
+
*Created: 2026-04-28. Restructured 2026-04-28 (post-0.10.0 release) around
|
|
4
|
+
milestone versions per the path-to-1.0 plan. Re-oriented 2026-05-27 to
|
|
5
|
+
acknowledge OTel + audit-toolkit landings and re-anchor on the three
|
|
6
|
+
1.0 pillars.*
|
|
4
7
|
|
|
5
8
|
The remaining work for a stable 1.0 release. Distinct from `improvements.md` —
|
|
6
9
|
that file tracks the long tail of inbound study/idea entries; this file tracks
|
|
7
|
-
**what blocks 1.0 confidence**.
|
|
10
|
+
**what blocks 1.0 confidence and which release each item ships in**.
|
|
8
11
|
|
|
9
12
|
Guiding question: *a skeptical Ruby developer should be able to look at one
|
|
10
13
|
screen and say "yes, this is helping, here's the evidence" without trusting our
|
|
@@ -12,15 +15,58 @@ marketing.* Today the dashboard tells that story in pieces but not as a
|
|
|
12
15
|
headline. Each item below closes a specific gap that prevents that headline
|
|
13
16
|
from existing.
|
|
14
17
|
|
|
18
|
+
## What 1.0 commits to
|
|
19
|
+
|
|
20
|
+
Not "feature complete" — semver commitment. Once we ship 1.0:
|
|
21
|
+
|
|
22
|
+
- Public APIs (CLI surface, MCP tool schemas, hook payload shapes) lock to semver
|
|
23
|
+
- Schema migrations stay forward-compatible per the round-trip-spec convention
|
|
24
|
+
- The trust signals we ship have a baseline measurement other releases must beat
|
|
25
|
+
|
|
26
|
+
So 1.0 isn't gated by features. It's gated by **the measurement infrastructure
|
|
27
|
+
being trustworthy enough to defend a 1.0 claim.** That's why this punchlist is
|
|
28
|
+
mostly observability, not capability.
|
|
29
|
+
|
|
30
|
+
### The three 1.0 pillars
|
|
31
|
+
|
|
32
|
+
Restated 2026-05-27 to ground prioritization decisions:
|
|
33
|
+
|
|
34
|
+
1. **Stability** — semver-locked CLI / MCP / hook / Ruby API contracts, schema
|
|
35
|
+
round-trip discipline, deprecation policy. Anchored by `docs/api_stability.md`
|
|
36
|
+
(#11 ✅) and the round-trip-spec convention.
|
|
37
|
+
2. **Visibility** — a skeptical user can see what memory costs, what memory
|
|
38
|
+
contains, what memory contributed, and what is wrong with it, on one screen,
|
|
39
|
+
in <30s, without trusting our marketing. Anchored by the Trust panel, the
|
|
40
|
+
digest, OTel ingestion, and the new `claude-memory audit` toolkit.
|
|
41
|
+
3. **Long-horizon quality** — over weeks and months, the repo demonstrably
|
|
42
|
+
improves session quality rather than degrading it. Anchored by the harm
|
|
43
|
+
benchmark (#3, the actual release gate), the CLAUDE.md headline baseline
|
|
44
|
+
(#4), repeat-correction detection (#8), and the drift dashboard (#10).
|
|
45
|
+
|
|
46
|
+
Every 0.12 item maps to one of those pillars; an item that doesn't map is a
|
|
47
|
+
1.x feature, not a 1.0 gate. The audit toolkit and OTel landed during 0.12
|
|
48
|
+
because they directly serve pillars 1 and 2 — not as scope creep, but as work
|
|
49
|
+
the original punchlist didn't anticipate would be needed.
|
|
50
|
+
|
|
15
51
|
Items are cross-linked to the canonical entry in `improvements.md` where the
|
|
16
52
|
implementation detail and acceptance criteria live. This file is the
|
|
17
53
|
prioritization view; that file is the work view.
|
|
18
54
|
|
|
19
55
|
---
|
|
20
56
|
|
|
21
|
-
##
|
|
57
|
+
## 0.10.x — patch as needed (now)
|
|
22
58
|
|
|
23
|
-
|
|
59
|
+
Reactive only. Real usage will surface issues; cut a patch when one shows up.
|
|
60
|
+
No proactive minor work here.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## 0.11.0 — "Trust & Cost" (~1 week of work)
|
|
65
|
+
|
|
66
|
+
Theme: *users can see what memory costs and whether it's helping.* Each item
|
|
67
|
+
adds a number a skeptical user can read.
|
|
68
|
+
|
|
69
|
+
### #1 Token budget telemetry — *what does memory cost?* ✅ landed 2026-04-29
|
|
24
70
|
|
|
25
71
|
**Gap.** `Core::TokenEstimator` exists and is unused outside one helper. We
|
|
26
72
|
have no idea what % of the SessionStart token budget memory consumes per
|
|
@@ -30,13 +76,18 @@ session, how it scales with DB size, or whether it's growing.
|
|
|
30
76
|
tokens per session over the last 30 days. Per-session count rides on every
|
|
31
77
|
`hook_context` activity event so the data is queryable post-hoc.
|
|
32
78
|
|
|
33
|
-
**Why
|
|
34
|
-
|
|
35
|
-
defend the trade.
|
|
79
|
+
**Why this release.** Loudest critique of any context-injection memory
|
|
80
|
+
system; if we can't answer it numerically, we can't defend the trade.
|
|
36
81
|
|
|
37
|
-
|
|
82
|
+
**Status.** Landed in 4 atomic commits on 2026-04-29 (15cb5f5, 35ae8d2,
|
|
83
|
+
d9601ca, 5bfd7c8). `context_tokens` recorded on every successful
|
|
84
|
+
`hook_context` event, surfaced via `Dashboard::Trust#token_budget`,
|
|
85
|
+
`claude-memory digest` "Context cost" section, and
|
|
86
|
+
`claude-memory stats --tokens [--since DAYS]` with histogram.
|
|
38
87
|
|
|
39
|
-
|
|
88
|
+
→ improvements.md entry: *#47 Token Budget Telemetry*. Effort: 4-6h.
|
|
89
|
+
|
|
90
|
+
### #2 Hallucination rate as a first-class trust metric ✅ landed 2026-04-29
|
|
40
91
|
|
|
41
92
|
**Gap.** `ReferenceMaterialDetector` already classifies suspect facts and we
|
|
42
93
|
know from the #34 audit that ~25% of facts had embedded reasoning (i.e.
|
|
@@ -48,48 +99,16 @@ suspect-fact ratio + bare-conclusion ratio over active facts in both stores.
|
|
|
48
99
|
Digest includes a 30-day rejection rate ("how much of what we extracted got
|
|
49
100
|
rejected within a week?") so calibration drift is visible.
|
|
50
101
|
|
|
51
|
-
**Why
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
→ improvements.md entry: *Hallucination Rate Metric*
|
|
55
|
-
|
|
56
|
-
### 3. Negative-fact harm benchmark
|
|
57
|
-
|
|
58
|
-
**Gap.** Every benchmark we run today measures whether memory **helps**.
|
|
59
|
-
Nothing measures whether memory **harms** — i.e. injects a wrong fact and
|
|
60
|
-
Claude follows it. Without this, "memory helps" is unfalsifiable.
|
|
102
|
+
**Why this release.** Pollution rate matters as much as recall rate. Pairs
|
|
103
|
+
with #1 — together they answer the "is this still worth it?" question.
|
|
61
104
|
|
|
62
|
-
**
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
`bin/run-evals`. >1% harm rate blocks release.
|
|
105
|
+
**Status.** Landed in 3 atomic commits on 2026-04-29 (27fa6af, 4d1c5bf,
|
|
106
|
+
0b72fa4). New `Distill::BareConclusionDetector` + `Dashboard::Trust#quality_score`
|
|
107
|
+
+ `claude-memory digest` Quality section with rejection rate.
|
|
66
108
|
|
|
67
|
-
|
|
68
|
-
is strictly worse than no memory; we need a release gate that proves we're
|
|
69
|
-
not in that regime.
|
|
109
|
+
→ improvements.md entry: *#48 Hallucination Rate Metric*. Effort: 1d.
|
|
70
110
|
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
### 4. Publish the CLAUDE.md baseline in headline E2E results
|
|
74
|
-
|
|
75
|
-
**Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
|
|
76
|
-
and supports E2E. The adapter is wired into `comparative_helper.rb` but the
|
|
77
|
-
README's headline comparative table doesn't include it. The single most
|
|
78
|
-
important question for adoption — *"is this better than a hand-written
|
|
79
|
-
CLAUDE.md?"* — is currently unanswered in our published numbers.
|
|
80
|
-
|
|
81
|
-
**Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
|
|
82
|
-
`spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary
|
|
83
|
-
output. README explicitly states the win/loss versus the static baseline.
|
|
84
|
-
|
|
85
|
-
**Why must-have.** Cheapest item on the list — adapter already built, just
|
|
86
|
-
surface the number. If we can't beat a static CLAUDE.md on developer
|
|
87
|
-
scenarios, that's the loudest possible signal that the rest of the system
|
|
88
|
-
needs work; if we can, that's the headline 1.0 brag.
|
|
89
|
-
|
|
90
|
-
→ improvements.md entry: *CLAUDE.md Baseline in Headline Results*
|
|
91
|
-
|
|
92
|
-
### 5. `claude-memory show` — human-readable "what would be injected"
|
|
111
|
+
### #5 `claude-memory show` — human-readable "what would be injected" ✅ landed 2026-04-29
|
|
93
112
|
|
|
94
113
|
**Gap.** Inspecting memory state today requires the dashboard or several CLI
|
|
95
114
|
commands (`recall`, `stats`, `census`). The CLAUDE.md alternative is
|
|
@@ -101,64 +120,426 @@ path real sessions use, prints what would be injected next session in plain
|
|
|
101
120
|
English (not JSON), sized to fit a terminal, with predicate-grouped sections
|
|
102
121
|
matching the snapshot format.
|
|
103
122
|
|
|
104
|
-
**Why
|
|
123
|
+
**Why this release.** Trust requires inspectability. A user who can't see what
|
|
105
124
|
memory will inject can't develop confidence in it.
|
|
106
125
|
|
|
107
|
-
|
|
126
|
+
**Status.** Landed 2026-04-29 (commit 2586bb3). New `Commands::ShowCommand`
|
|
127
|
+
runs `Hook::ContextInjector` and prints the would-be-injected Markdown.
|
|
128
|
+
Default suppresses the raw-transcript pending-knowledge dump for
|
|
129
|
+
readability (`--pending` opts in). Footer reports fact count, token
|
|
130
|
+
estimate, char count.
|
|
131
|
+
|
|
132
|
+
→ improvements.md entry: *#51 claude-memory show*. Effort: ½d.
|
|
133
|
+
|
|
134
|
+
### #7 First-week ROI nudge — *moved up from post-1.0* ✅ landed 2026-04-30
|
|
135
|
+
|
|
136
|
+
**Gap.** New users install, run a few sessions, don't know whether memory is
|
|
137
|
+
working. The dashboard exists but they have to know to look.
|
|
138
|
+
|
|
139
|
+
**Acceptance.** SessionEnd hook prints `memory contributed N facts this
|
|
140
|
+
session, %used = X` inline for the first ~10 sessions, then quiets. Opt-out
|
|
141
|
+
via `CLAUDE_MEMORY_NO_NUDGE=1`.
|
|
142
|
+
|
|
143
|
+
**Why this release.** Belongs with the trust theme — it's the user-visible
|
|
144
|
+
proof that memory is doing work for them. Originally listed as post-1.0;
|
|
145
|
+
elevating because cold-start trust deserves to land before 1.0.
|
|
146
|
+
|
|
147
|
+
**Status.** Landed in 2 atomic commits on 2026-04-30 (f450ed9, 3acce93)
|
|
148
|
+
plus production smoke-test against this project's DB (event #229
|
|
149
|
+
recorded with n=11, used=0, pct=0 for a real session_id). New
|
|
150
|
+
`Hook::Handler#nudge` + `claude-memory hook nudge`; SessionEnd config
|
|
151
|
+
appends nudge after ingest+sweep. Silent on opt-out, missing
|
|
152
|
+
session_id, n=0, or first-week-complete (so empty sessions don't burn
|
|
153
|
+
slots).
|
|
154
|
+
|
|
155
|
+
→ improvements.md entry: *#53 First-Week ROI Nudge*. Effort: ½d.
|
|
156
|
+
|
|
157
|
+
### Risk-de-risking — 3-scenario harm prototype ✅ landed 2026-04-30
|
|
158
|
+
|
|
159
|
+
Before 0.12 builds the full 10-15-scenario harm benchmark (see #3), run a
|
|
160
|
+
3-scenario prototype against the 0.10.0 codebase to confirm whether harm is
|
|
161
|
+
actually low. If the prototype surfaces a >0% harm rate on simple cases, the
|
|
162
|
+
full benchmark in 0.12 will reveal a fundamental issue — better to know at
|
|
163
|
+
0.11 than discover at 0.12.
|
|
164
|
+
|
|
165
|
+
**Acceptance.** Three hand-written `harm_scenarios.yml` cases (one stale-tech,
|
|
166
|
+
one mismatched-scope, one superseded-but-undetected) run against real Claude
|
|
167
|
+
under `EVAL_MODE=real`. Reports go/no-go on the larger benchmark in 0.12.
|
|
108
168
|
|
|
109
|
-
|
|
169
|
+
**Status.** Landed 2026-04-30 (commit 35b368e). Three cases written:
|
|
170
|
+
`harm_stale_tech` (MySQL fact vs SQLite reality), `harm_mismatched_scope`
|
|
171
|
+
(global TS/Tailwind preference applied to a Ruby gem),
|
|
172
|
+
`harm_superseded_undetected` (two contradicting auth_method facts both
|
|
173
|
+
active). Structure validation passes in stub mode. Real-mode is gated
|
|
174
|
+
behind `EVAL_MODE=real` (~$2-8 per run) so the operator decides when to
|
|
175
|
+
spend; this prototype reports harm rate but doesn't enforce a threshold
|
|
176
|
+
yet — that's the 0.12 release-gate work.
|
|
177
|
+
|
|
178
|
+
→ improvements.md entry: *#49 Negative-Fact Harm Benchmark* (prototype phase).
|
|
179
|
+
Effort: ½d.
|
|
180
|
+
|
|
181
|
+
**Ship target:** ~2 weeks from 0.10.0 (mid-May 2026 at current velocity).
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## 0.12.0 — "Release Discipline + Observability + Self-Audit" (~4 weeks of work)
|
|
186
|
+
|
|
187
|
+
Theme: *we can't ship a regression without noticing, and we can see what's
|
|
188
|
+
happening inside.* Internal infrastructure that prevents future regressions,
|
|
189
|
+
plus the observability primitives the 1.0 visibility pillar requires, plus
|
|
190
|
+
the self-audit toolkit that catches drift in our own DB.
|
|
191
|
+
|
|
192
|
+
*Restructured 2026-05-01: #11 (API stability audit) promoted from 1.0
|
|
193
|
+
because the scoreboard #6 needs an explicit stable-surface list to gate
|
|
194
|
+
against; new #12 (pre-release hook smoke gate) added to codify the
|
|
195
|
+
verification convention that surfaced during 0.11 work.*
|
|
196
|
+
|
|
197
|
+
*Restructured 2026-05-27: theme widened from "Release Discipline" to
|
|
198
|
+
acknowledge two unplanned but on-mission work tracks that landed during the
|
|
199
|
+
0.12 window — the OTel observability primitives (~15 commits) and the audit
|
|
200
|
+
toolkit (#13). Both serve 1.0 pillars 1+2 directly and the punchlist now
|
|
201
|
+
reflects that.*
|
|
202
|
+
|
|
203
|
+
### #3 Negative-fact harm benchmark (full 10-15 scenarios) — **in progress 2026-05-27 (Path B blocker)**
|
|
204
|
+
|
|
205
|
+
**Gap.** Every benchmark today measures whether memory **helps**. Nothing
|
|
206
|
+
measures whether memory **harms** — i.e. injects a wrong fact and Claude
|
|
207
|
+
follows it. Without this, "memory helps" is unfalsifiable. This is the
|
|
208
|
+
single 0.12 item that directly serves pillar 3 (long-horizon quality);
|
|
209
|
+
shipping 0.12 without it would tag a release whose central claim is
|
|
210
|
+
unmeasured.
|
|
211
|
+
|
|
212
|
+
**Acceptance.** `spec/benchmarks/dataset/harm_scenarios.yml` with 10-15 cases
|
|
213
|
+
spanning four harm classes (stale-tech, mismatched-scope, superseded-but-
|
|
214
|
+
undetected, reference-material-as-fact). Each scores `harm` if Claude follows
|
|
215
|
+
the wrong fact, `safe` otherwise. Wired into `bin/run-evals`. **>1% harm
|
|
216
|
+
rate blocks release** (configurable via `HARM_RATE_THRESHOLD`).
|
|
217
|
+
|
|
218
|
+
**Why this release.** A retrieval system that occasionally makes Claude
|
|
219
|
+
*wrong* is strictly worse than no memory; the release gate proves we're not
|
|
220
|
+
in that regime.
|
|
221
|
+
|
|
222
|
+
→ improvements.md entry: *#49 Negative-Fact Harm Benchmark* (full corpus).
|
|
223
|
+
Effort: 2d.
|
|
224
|
+
|
|
225
|
+
### #4 Publish the CLAUDE.md baseline in headline E2E results — **DEFERRED to 0.13 (2026-05-29): harness limitation**
|
|
226
|
+
|
|
227
|
+
**Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
|
|
228
|
+
and is wired into `comparative_helper.rb`. The single most important question
|
|
229
|
+
for adoption — *"is this better than a hand-written CLAUDE.md?"* — is
|
|
230
|
+
unanswered in our published numbers.
|
|
231
|
+
|
|
232
|
+
**What happened.** The first real-mode comparative run (2026-05-28) returned
|
|
233
|
+
ClaudeMemory **0/10**, No-memory **0/10**, CLAUDE.md baseline **8/10** — and
|
|
234
|
+
investigation showed this is a *harness artifact, not a verdict*. The CLAUDE.md
|
|
235
|
+
adapter auto-loads every fact into context unconditionally; the ClaudeMemory
|
|
236
|
+
adapter relies on Claude proactively calling `memory.recall` MCP tools, which
|
|
237
|
+
`claude -p` headless mode doesn't do for these prompts (and the SessionStart
|
|
238
|
+
context hook injects only a generic top-5, not the specific fact each
|
|
239
|
+
LongMemEval-style scenario needs). So ClaudeMemory's retrieval path is never
|
|
240
|
+
exercised and it ties no-memory at 0. Publishing 0% vs 80% would actively
|
|
241
|
+
mislead and violate the visibility pillar's honest-numbers standard.
|
|
242
|
+
|
|
243
|
+
**Decision (2026-05-29).** Defer #4 to 0.13. It was never a release blocker
|
|
244
|
+
(the harm gate was, and it's green at 0/13). 0.12 ships without comparative
|
|
245
|
+
numbers; the README + benchmark README document the limitation honestly.
|
|
246
|
+
|
|
247
|
+
**0.13 acceptance.** Fix the harness so it fairly exercises ClaudeMemory's
|
|
248
|
+
retrieval — either (a) force memory-tool use (allowedTools + a recall-
|
|
249
|
+
encouraging system turn), or (b) inject the full fact set via the context
|
|
250
|
+
hook to match CLAUDE.md's "everything in context" model — then re-run and
|
|
251
|
+
publish the real win/loss.
|
|
252
|
+
|
|
253
|
+
→ improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
|
|
254
|
+
Effort: harness fix ~1d + one real-mode run.
|
|
255
|
+
|
|
256
|
+
### #16 Headless retrieval gap — *new observation 2026-05-29, investigate for 0.13*
|
|
257
|
+
|
|
258
|
+
**Observation.** The #4 comparative run surfaced a genuine (separable) product
|
|
259
|
+
concern: in fully headless, non-interactive `claude -p` usage with no
|
|
260
|
+
tool-forcing, Claude does **not** proactively call ClaudeMemory's `memory.recall`
|
|
261
|
+
MCP tools, so memory's contribution rides entirely on what the SessionStart
|
|
262
|
+
context hook injects (a generic top-5 decisions/conventions/architecture). For
|
|
263
|
+
*interactive* sessions — where Claude readily calls MCP tools — this isn't an
|
|
264
|
+
issue, and it's the primary use case. But the gap is real and worth measuring:
|
|
265
|
+
does the context-hook top-5 cover enough, or should headless usage get a richer
|
|
266
|
+
injection (or a recall-on-demand affordance)?
|
|
267
|
+
|
|
268
|
+
**Why not 0.12.** This is investigation, not a known fix, and it's orthogonal
|
|
269
|
+
to the 0.12 visibility/stability theme. Pair it with the #4 harness fix in 0.13
|
|
270
|
+
since both touch the same headless-retrieval seam.
|
|
271
|
+
|
|
272
|
+
→ No improvements.md entry yet; originates from the 2026-05-28 comparative run.
|
|
273
|
+
|
|
274
|
+
### #6 Release-to-release benchmark scoreboard ✅ landed 2026-05-01
|
|
110
275
|
|
|
111
276
|
**Gap.** Benchmark output is textual today. Nothing diff-able across versions.
|
|
112
|
-
Regressions land silently — the only reason we caught the
|
|
113
|
-
|
|
277
|
+
Regressions land silently — the only reason we caught the BM25 normalization
|
|
278
|
+
bug was a manual run.
|
|
114
279
|
|
|
115
280
|
**Acceptance.** Each `bin/run-evals` run writes
|
|
116
|
-
`spec/benchmarks/results/<version>.json`. New `bin/bench-diff`
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
281
|
+
`spec/benchmarks/results/<version>.json`. New `bin/bench-diff` compares
|
|
282
|
+
against the last tagged version's JSON and reports deltas. `/release` skill
|
|
283
|
+
reads it and refuses to ship on regressions over threshold.
|
|
284
|
+
|
|
285
|
+
**Why this release.** The semver commitment in 1.0 *requires* this — we
|
|
286
|
+
can't promise non-regression without the infrastructure to detect it.
|
|
287
|
+
|
|
288
|
+
**Status.** Landed 2026-05-01. `bin/run-evals` writes
|
|
289
|
+
`spec/benchmarks/results/<version>.json` with diff-friendly pass-rate
|
|
290
|
+
metrics by category and per-scenario. `bin/bench-diff` compares against
|
|
291
|
+
the most recent prior tagged version's scoreboard via `Gem::Version`
|
|
292
|
+
ordering, flags pass-rate drops > threshold (default 5%), supports
|
|
293
|
+
`--threshold` / `--baseline` / `--json` / `--strict`. 11 unit specs
|
|
294
|
+
covering missing-baseline, threshold tuning, deep-nested metric paths,
|
|
295
|
+
JSON output. Wired into `/release` skill as new Phase 1 Step 7 (after
|
|
296
|
+
smoke gate, before lint). First release with the gate is 0.12.0 itself
|
|
297
|
+
— prior versions have no scoreboard, so bench-diff exits 0 with a "no
|
|
298
|
+
baseline" note; from 0.13 onward it actively gates.
|
|
299
|
+
|
|
300
|
+
→ improvements.md entry: *#52 Benchmark Scoreboard Diff*. Effort: 1d.
|
|
301
|
+
|
|
302
|
+
### #11 API stability audit — *promoted from 1.0 (2026-05-01)* ✅ landed 2026-05-01
|
|
303
|
+
|
|
304
|
+
**Gap.** "1.0 commits to semver" is meaningless without an explicit
|
|
305
|
+
public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 / 0.11.0
|
|
306
|
+
(MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints,
|
|
307
|
+
`detail_json` field set) have evolved organically and aren't formally
|
|
308
|
+
documented as stable vs. internal.
|
|
309
|
+
|
|
310
|
+
**Acceptance.**
|
|
311
|
+
|
|
312
|
+
- New `docs/api_stability.md` enumerating:
|
|
313
|
+
- **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
|
|
314
|
+
- **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
|
|
315
|
+
- **Public hook contract**: payload fields, return shapes, exit codes, `detail_json` field set per event_type
|
|
316
|
+
- **Public Ruby API**: `Recall`, `Configuration`, `Store::StoreManager`, `Domain::*` vs. internal-only
|
|
317
|
+
- **Schema**: stability of column names, table names, predicate vocabulary
|
|
318
|
+
- Deprecation policy paragraph: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0"
|
|
319
|
+
- `ClaudeMemory::Deprecations.warn(name:, replacement:, removed_in:)` module wired up and used at least once so the mechanism is exercised
|
|
320
|
+
- README + CLAUDE.md link to the new doc as the authoritative source
|
|
321
|
+
|
|
322
|
+
**Why this release.** #6's scoreboard needs to know what surfaces are stable
|
|
323
|
+
to gate against. Without #11, any "regression" finding is arguable. The
|
|
324
|
+
deprecation-warning module is also a prerequisite for any soft-rename work
|
|
325
|
+
during the 0.12 → 1.0 soak.
|
|
326
|
+
|
|
327
|
+
→ improvements.md entry: *#59 API Stability Audit*. Effort: 2d.
|
|
328
|
+
|
|
329
|
+
### #12 Pre-release hook smoke gate — *new this release (2026-05-01)* ✅ landed 2026-05-01
|
|
330
|
+
|
|
331
|
+
**Gap.** During 0.11 work, five commits landed for #47 token-budget telemetry
|
|
332
|
+
with 156 specs green. 24 hours of real SessionStart hook events recorded no
|
|
333
|
+
`context_tokens` field — because the *installed* gem was still 0.9.1 and the
|
|
334
|
+
`.claude/settings.json` hooks invoke the installed binary via PATH, not the
|
|
335
|
+
working tree. The bug wasn't in the code; the bug was in the release process.
|
|
336
|
+
|
|
337
|
+
This trap has been hit twice now (#47 in 0.11; an earlier ActivityLog
|
|
338
|
+
incident on 2026-04-16). It's documented in
|
|
339
|
+
`~/.claude/projects/.../memory/feedback_hooks_run_installed_gem.md` and as
|
|
340
|
+
two project conventions, but documentation hasn't stopped me (Claude) from
|
|
341
|
+
springing the trap again.
|
|
342
|
+
|
|
343
|
+
**Acceptance.**
|
|
344
|
+
|
|
345
|
+
- New `bin/pre-release-smoke` script: `rake install` → trigger each hook
|
|
346
|
+
with a synthetic payload → inspect `activity_events.detail_json` via
|
|
347
|
+
`sqlite3 json_extract` for expected fields per the current version → exit
|
|
348
|
+
non-zero if anything is null.
|
|
349
|
+
- Per-version expectation manifest at `spec/smoke/expected_fields.yml`
|
|
350
|
+
declares `{event_type, fields, since_version}` so new fields just need a
|
|
351
|
+
YAML append; no script changes per release.
|
|
352
|
+
- `/release` skill Phase 1 runs the smoke gate after specs and before lint.
|
|
353
|
+
Failure aborts before `git push`.
|
|
354
|
+
- Test: `spec/smoke/pre_release_smoke_spec.rb` validates the manifest schema
|
|
355
|
+
and that the exit-code logic correctly flips on simulated null fields.
|
|
356
|
+
|
|
357
|
+
**Why this release.** Release Discipline that doesn't catch the trap I've
|
|
358
|
+
already hit twice isn't real discipline. Pairs with #6 — the scoreboard
|
|
359
|
+
catches regressions in measurement; the smoke gate catches the regression
|
|
360
|
+
where the measurement itself doesn't fire.
|
|
361
|
+
|
|
362
|
+
→ improvements.md entry: *#63 Pre-Release Hook Smoke Gate*. Effort: ½d.
|
|
363
|
+
|
|
364
|
+
### #13 Memory health audit toolkit — *unplanned, landed 2026-05-27* ✅
|
|
365
|
+
|
|
366
|
+
**Gap.** Drift inside the project DB — duplicate global conventions,
|
|
367
|
+
single-cardinality multiplicity, contamination-driven rejection churn, bare
|
|
368
|
+
conclusions, shortcut tools leaking the wrong predicate — was diagnosable
|
|
369
|
+
only by hand, project by project. The 2026-05-21 audit surfaced 103 rejected
|
|
370
|
+
single-cardinality facts in this project's own DB, all sourced from example
|
|
371
|
+
text in our own docs being re-ingested. Without a productionized check, this
|
|
372
|
+
class of regression silently erodes the 1.0 visibility claim.
|
|
373
|
+
|
|
374
|
+
**Acceptance.**
|
|
375
|
+
|
|
376
|
+
- `claude-memory audit` CLI with ten contract checks (C001-C010), `--json`
|
|
377
|
+
for CI, `--severity`, `--no-exit`
|
|
378
|
+
- `/audit-memory` slash command for interactive walkthrough
|
|
379
|
+
- `docs/audit_runbook.md` per-check rationale + remediation
|
|
380
|
+
- `ReferenceMaterialDetector` example-quote guard + `Resolver` `:discard`
|
|
381
|
+
path (defense-in-depth at write time)
|
|
382
|
+
- Memory shortcuts (`memory.decisions`/`.conventions`/`.architecture`)
|
|
383
|
+
switched from FTS text search to predicate-based filtering
|
|
384
|
+
- `claude-memory import-auto-memory` retroactively pulls auto-memory entries
|
|
385
|
+
`AutoMemoryMirror` missed (slug bug fixed: `tr("/_", "-")`)
|
|
386
|
+
- Signal-health benchmark spec (`spec/benchmarks/health/database_signal_spec.rb`)
|
|
387
|
+
codifies the cleanup contracts so regressions can be detected in CI
|
|
388
|
+
|
|
389
|
+
**Why this release.** Serves pillars 1 (stability — guards single-cardinality
|
|
390
|
+
predicates from drifting) and 2 (visibility — surfaces drift as a measurable
|
|
391
|
+
signal). The detector + resolver fixes mean the 0.12 → 1.0 soak is more
|
|
392
|
+
likely to surface real signal vs. doc-text contamination noise.
|
|
393
|
+
|
|
394
|
+
→ improvements.md entry: not yet promoted; lives in `docs/memory_audit_2026-05-21.md`
|
|
395
|
+
as the originating artifact. Effort: ~2d (across the 2026-05-27 session).
|
|
396
|
+
|
|
397
|
+
### #14 OpenTelemetry ingestion + Dashboard Telemetry/Prompt Journey — *unplanned, landed 2026-05-21* ✅
|
|
398
|
+
|
|
399
|
+
**Gap.** The visibility pillar promised "you can see what memory costs and
|
|
400
|
+
what it's doing." Token-budget telemetry (#1) covered the cost; the rest —
|
|
401
|
+
per-tool latency, cost-per-hour, the full prompt-to-response journey across
|
|
402
|
+
hooks/MCP/distillation — was invisible without an external tracer. Claude
|
|
403
|
+
Code already exports OTLP if asked; the question was whether ClaudeMemory
|
|
404
|
+
should ingest its own telemetry rather than punting to Datadog/Honeycomb.
|
|
405
|
+
|
|
406
|
+
**Acceptance.**
|
|
407
|
+
|
|
408
|
+
- Schema v18: `otel_metrics`, `otel_events`, `otel_traces` + `prompt_id`
|
|
409
|
+
on `activity_events` for journey correlation
|
|
410
|
+
- `claude-memory otel` CLI manages the env block (`--enable`, `--disable`,
|
|
411
|
+
`--enable-traces`, `--capture-prompts`, `--status`, `--verify`, `--backfill`)
|
|
412
|
+
- Dashboard exposes `/v1/metrics`, `/v1/logs`, `/v1/traces` on
|
|
413
|
+
`127.0.0.1:3377` (OTLP/HTTP/JSON) plus a new "Telemetry" drawer
|
|
414
|
+
- Prompt Journey panel UNIONs `otel_events` with `activity_events` and
|
|
415
|
+
back-tags activity_events with `prompt.id` via `OTel::PromptScope`
|
|
416
|
+
- Sweep retention: 30d metrics, 14d events, 7d traces
|
|
417
|
+
- Privacy posture: opt-in for prompt capture; traces 501-gated until
|
|
418
|
+
explicit `--enable-traces`
|
|
419
|
+
|
|
420
|
+
**Why this release.** Directly serves pillar 2 (visibility) at a depth
|
|
421
|
+
nothing else can — no dashboard polish substitutes for actual per-prompt
|
|
422
|
+
trace data. Loud answer to "what is this thing doing right now?"
|
|
423
|
+
|
|
424
|
+
→ improvements.md entry: tracked under the OTel research → study line.
|
|
425
|
+
Effort: ~2.5w (Apr 26 → May 21).
|
|
426
|
+
|
|
427
|
+
### #15 Staleness guard for single-value facts — *born from the #3 harm run, landed 2026-05-28* ✅
|
|
428
|
+
|
|
429
|
+
**Gap.** The first full-corpus real-mode harm run (#3) surfaced a 15.4%
|
|
430
|
+
harm rate. One was a false positive in the test pattern (fixed in the
|
|
431
|
+
corpus); the other was a **real harm**: Claude emitted `git push heroku
|
|
432
|
+
HEAD:main` from a stale `deployment_platform` fact with no hedge.
|
|
433
|
+
Single-value predicates are exclusive claims Claude follows
|
|
434
|
+
authoritatively — and ClaudeMemory had no defense against a stale one
|
|
435
|
+
when no superseding fact exists (supersession only fires if the
|
|
436
|
+
migration was recorded). This is a direct pillar-3 (long-horizon
|
|
437
|
+
quality) hole: over months, single-value facts go stale and silently
|
|
438
|
+
make Claude wrong.
|
|
439
|
+
|
|
440
|
+
**Acceptance.**
|
|
441
|
+
|
|
442
|
+
- `Recall::StalenessAnnotator` pure function: flags single-value facts
|
|
443
|
+
(uses_database / deployment_platform / auth_method) that are old
|
|
444
|
+
(valid_from/created_at older than threshold) AND not recently
|
|
445
|
+
confirmed (last_recalled_at null/stale)
|
|
446
|
+
- `Hook::ContextInjector` appends a "⚠ stale … verify before relying"
|
|
447
|
+
marker at SessionStart; multi-value predicates never annotated
|
|
448
|
+
- `Configuration#injection_stale_days` (default 180, env override),
|
|
449
|
+
distinct from the 14-day dashboard review window
|
|
450
|
+
- Re-run of #3 (scaffolded + best-of-N) confirms the gate is green
|
|
451
|
+
|
|
452
|
+
**Why this release.** It's the concrete payoff of building the harm
|
|
453
|
+
benchmark before 1.0: the benchmark didn't just report a number, it
|
|
454
|
+
forced a real defensive feature that makes the long-horizon-quality
|
|
455
|
+
claim defensible. Shipping #3 without #15 would have meant tagging a
|
|
456
|
+
release whose own gate said "memory makes Claude wrong 1-in-13 times."
|
|
457
|
+
|
|
458
|
+
**Harness hardening (same investigation).** The first full-corpus run
|
|
459
|
+
also exposed two confounds that made the gate unverifiable: scenarios
|
|
460
|
+
ran in an empty tmpdir (Claude often refused for lack of project
|
|
461
|
+
context, not because it resisted the bad fact) and single-shot scoring
|
|
462
|
+
was noisy (the harmed *set* changed run-to-run). Fixed by (a) shipping a
|
|
463
|
+
`project_files` scaffold per scenario whose current state contradicts
|
|
464
|
+
the wrong memory fact — making each case a real "memory vs reality"
|
|
465
|
+
test — and (b) best-of-N majority scoring (HARM_BENCH_RUNS, default 3).
|
|
466
|
+
Without this, #15's effect couldn't be measured cleanly.
|
|
467
|
+
|
|
468
|
+
→ improvements.md entry: not yet promoted; originates from the
|
|
469
|
+
`spec/benchmarks/dataset/harm_scenarios.yml` `harm_stale_deployment_heroku`
|
|
470
|
+
finding. Effort: ~½d (2026-05-28 session).
|
|
471
|
+
|
|
472
|
+
**Ship target:** ready to tag (2026-05-29). #3 harm gate is green at 0/13
|
|
473
|
+
(best-of-3) after #15; #4 deferred to 0.13 (harness limitation, never a
|
|
474
|
+
blocker); everything else in 0.12 has shipped. 0.12 tags now; soak window
|
|
475
|
+
2-3 weeks before 1.0.
|
|
120
476
|
|
|
121
|
-
|
|
122
|
-
snapshot. 1.0 is the moment we commit to *not regressing* what we ship.
|
|
477
|
+
---
|
|
123
478
|
|
|
124
|
-
→
|
|
479
|
+
## 0.12.x → 1.0 — soak period (2-3 weeks)
|
|
125
480
|
|
|
126
|
-
|
|
481
|
+
Critical phase. Run 0.12 against real usage. Watch:
|
|
482
|
+
|
|
483
|
+
- **Harm rate stays at 0%** — release gate from #3
|
|
484
|
+
- **Hallucination rate trend** — from #2
|
|
485
|
+
- **Token budget growth** — from #1, #9
|
|
486
|
+
- **Utilization ratio** — across multiple projects
|
|
487
|
+
|
|
488
|
+
If any signal shifts unfavorably during soak, fix in 0.12.x. **Don't ship 1.0
|
|
489
|
+
from a release that hasn't observed itself for ≥2 weeks.**
|
|
490
|
+
|
|
491
|
+
This soak period is also where the relevance ratio metric (#31 from 0.10.0)
|
|
492
|
+
materializes its first real-mode measurement, and where the 0.11 trust
|
|
493
|
+
signals get a chance to be real numbers vs. theory.
|
|
127
494
|
|
|
128
|
-
|
|
495
|
+
---
|
|
129
496
|
|
|
130
|
-
|
|
497
|
+
## 1.0.0 — "Stable Memory"
|
|
131
498
|
|
|
132
|
-
|
|
499
|
+
Theme: *ready for daily use, ready to recommend.*
|
|
133
500
|
|
|
134
|
-
|
|
135
|
-
inline for the first ~10 sessions. Closes the cold-start gap where new users
|
|
136
|
-
don't see value because they don't think to look.
|
|
501
|
+
### Post-1.0-punchlist polish (if landed during soak)
|
|
137
502
|
|
|
138
|
-
|
|
503
|
+
These were originally post-1.0 in the punchlist; if soak time permits, they
|
|
504
|
+
land in 1.0. Otherwise they ship in 1.1.
|
|
139
505
|
|
|
140
|
-
### 8
|
|
506
|
+
### #8 Real-session repeat-correction detection
|
|
141
507
|
|
|
142
|
-
The repeat-correction benchmark (#32) is synthetic; production
|
|
143
|
-
equivalent signal. Analyze `activity_events`
|
|
144
|
-
last session, the user re-stated it this session" — that's where
|
|
145
|
-
silently failing.
|
|
508
|
+
The repeat-correction benchmark (#32 from 0.10.0) is synthetic; production
|
|
509
|
+
has no equivalent signal. Analyze `activity_events` for "this fact was
|
|
510
|
+
injected last session, the user re-stated it this session" — that's where
|
|
511
|
+
memory is silently failing.
|
|
146
512
|
|
|
147
|
-
→ improvements.md entry:
|
|
513
|
+
→ improvements.md entry: *#54 Real-Session Repeat-Correction Detection*.
|
|
514
|
+
Effort: 2d.
|
|
148
515
|
|
|
149
|
-
### 9
|
|
516
|
+
### #9 Token-cost growth tracking
|
|
150
517
|
|
|
151
518
|
Builds on #1. Weekly digest reports "context cost grew X% over 30d" as an
|
|
152
519
|
anomaly signal that the DB is bloating or context injection is going wide.
|
|
153
520
|
|
|
154
|
-
→ improvements.md entry:
|
|
521
|
+
→ improvements.md entry: *#55 Token-Cost Growth Tracking*. Effort: 3h after
|
|
522
|
+
#1 lands.
|
|
155
523
|
|
|
156
|
-
### 10
|
|
524
|
+
### #10 Drift dashboard
|
|
157
525
|
|
|
158
526
|
Snapshot `census` weekly, surface predicate distribution shifts on the
|
|
159
527
|
dashboard. Answers "is my fact base going off?" without a manual audit.
|
|
160
528
|
|
|
161
|
-
→ improvements.md entry:
|
|
529
|
+
→ improvements.md entry: *#56 Drift Dashboard*. Effort: 1.5d.
|
|
530
|
+
|
|
531
|
+
*(#11 API stability audit moved to 0.12 on 2026-05-01 — see above.)*
|
|
532
|
+
|
|
533
|
+
### Release framing
|
|
534
|
+
|
|
535
|
+
README + CHANGELOG framing for 1.0 explicitly states:
|
|
536
|
+
|
|
537
|
+
- "We measured X harm rate, Y utilization, Z hallucination rate across N
|
|
538
|
+
projects over W weeks before tagging this."
|
|
539
|
+
- The public API surface is documented at `docs/api_stability.md`
|
|
540
|
+
- Deprecation policy explicit
|
|
541
|
+
|
|
542
|
+
**Ship target:** 6-8 weeks from 0.10.0 (mid-June 2026 at current velocity).
|
|
162
543
|
|
|
163
544
|
---
|
|
164
545
|
|
|
@@ -168,23 +549,75 @@ dashboard. Answers "is my fact base going off?" without a manual audit.
|
|
|
168
549
|
drawers cover the primary need.
|
|
169
550
|
- **#45 Live SSE/WebSocket feed** — polling is adequate; dashboard polish, not
|
|
170
551
|
a confidence gap.
|
|
552
|
+
- **#23 REST API endpoint** — MCP covers primary use case; defer to 1.x.
|
|
553
|
+
- **#25 HTTP MCP transport** — no startup-latency complaint to motivate it yet.
|
|
554
|
+
|
|
555
|
+
---
|
|
556
|
+
|
|
557
|
+
## Risk to flag now
|
|
558
|
+
|
|
559
|
+
The biggest hidden risk in this plan was **the harm benchmark (#3) finds
|
|
560
|
+
something.** The 3-scenario prototype in 0.11 (above) was specifically
|
|
561
|
+
designed to surface this risk earlier — and **on 2026-04-30 the real-mode
|
|
562
|
+
prototype reported 0/3 harm**, green-lighting the full corpus expansion.
|
|
563
|
+
Risk is materially reduced; the 10-15-case corpus may still surface
|
|
564
|
+
something the 3-case sample missed, but a fundamental retrieval-discipline
|
|
565
|
+
issue is now unlikely.
|
|
566
|
+
|
|
567
|
+
Remaining risk for 0.12: **#11 API stability audit reveals the surface is
|
|
568
|
+
larger or messier than we thought**, pushing the doc work past the 2-day
|
|
569
|
+
estimate. Mitigation: scope `Public Ruby API` aggressively to "internal
|
|
570
|
+
unless proven otherwise" — easier to promote later than demote. *Update
|
|
571
|
+
2026-05-27: #11 landed on time on 2026-05-01; this risk did not materialize.*
|
|
572
|
+
|
|
573
|
+
Remaining risk for 0.12, take 2 (added 2026-05-27 in light of Path B):
|
|
574
|
+
**the full 13-scenario harm corpus surfaces a >1% harm rate** that the
|
|
575
|
+
3-scenario prototype masked. Mitigation paths if it happens: classify the
|
|
576
|
+
harming class, ship a guard (the way #13 added `ReferenceMaterialDetector`
|
|
577
|
+
example-quote guard for the contamination class), re-run. Worst case
|
|
578
|
+
extends 0.12 by ~3-5 days; doesn't push 1.0 if the soak window has slack.
|
|
171
579
|
|
|
172
580
|
---
|
|
173
581
|
|
|
174
|
-
##
|
|
582
|
+
## Velocity assumptions
|
|
583
|
+
|
|
584
|
+
Based on actual release cadence Mar-Apr 2026:
|
|
585
|
+
|
|
586
|
+
| Pair | Days |
|
|
587
|
+
|---|---|
|
|
588
|
+
| 0.7.0 → 0.7.1 | minor patch, days |
|
|
589
|
+
| 0.7.1 → 0.8.0 | 17 |
|
|
590
|
+
| 0.8.0 → 0.9.0 | 17 |
|
|
591
|
+
| 0.9.0 → 0.9.1 | same day (patch) |
|
|
592
|
+
| 0.9.1 → 0.10.0 | 12 |
|
|
593
|
+
|
|
594
|
+
Average ~2 weeks per minor with substantial work landing each cycle.
|
|
175
595
|
|
|
176
|
-
|
|
596
|
+
| Milestone | Estimated work | Calendar target | Status |
|
|
597
|
+
|---|---|---|---|
|
|
598
|
+
| 0.11.0 | ~1 week | ~2026-05-12 | ✅ shipped 2026-04-30 |
|
|
599
|
+
| 0.11.x patches | reactive | as-needed | open |
|
|
600
|
+
| 0.12.0 (originally planned) | ~1.5 weeks | ~2026-06-02 | superseded — actual scope widened (see 2026-05-27 restructure) |
|
|
601
|
+
| 0.12.0 (actual) | ~4 weeks (#6/#11/#12 + OTel + audit toolkit + Path B #3/#4) | tag ~2026-06-03 | 5 of 7 items shipped; #3 + #4 in progress |
|
|
602
|
+
| Soak | 2-3 weeks | through ~2026-06-24 | future |
|
|
603
|
+
| 1.0.0 | 1-2 days release prep | ~2026-06-24 to 2026-07-01 | future |
|
|
177
604
|
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
605
|
+
*0.12 grew from ~1 week to ~1.5 weeks after 2026-05-01 restructure
|
|
606
|
+
(promoted #11 + added #12), then widened again to ~4 weeks after the
|
|
607
|
+
2026-05-27 restructure that absorbed the OTel observability work and the
|
|
608
|
+
audit toolkit. 1.0 calendar shifted ~3 weeks later in total but the soak
|
|
609
|
+
window remains 2-3 weeks — the visibility/stability surface 0.12 now ships
|
|
610
|
+
is materially larger than the original "Release Discipline" scope.*
|
|
181
611
|
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
land.
|
|
612
|
+
These are calendar estimates assuming roughly the same focus level as the
|
|
613
|
+
0.10.0 cycle. Real cadence will adjust based on what surfaces during soak.
|
|
185
614
|
|
|
186
615
|
---
|
|
187
616
|
|
|
188
|
-
*Last updated: 2026-
|
|
189
|
-
|
|
190
|
-
|
|
617
|
+
*Last updated: 2026-05-27 (mid-0.12 cycle). 0.11.0 shipped 2026-04-30 with
|
|
618
|
+
all 5 punchlist items + harm prototype reporting 0/3 harm. 0.12 restructured
|
|
619
|
+
2026-05-01 (promoted #11, added #12) and again 2026-05-27 (absorbed OTel
|
|
620
|
+
#14 + audit toolkit #13, re-anchored on the three 1.0 pillars, committed
|
|
621
|
+
to Path B finishing #3 + #4 before tag). 0.12 grew ~1.5w → ~4w; 1.0 ship
|
|
622
|
+
target shifted ~3w later in return. Soak window held at 2-3w because the
|
|
623
|
+
visibility surface in 0.12 is materially larger than originally scoped.*
|