@lh8ppl/claude-memory-kit 0.3.3 → 0.3.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -20,7 +20,7 @@
20
20
  - **Bounded by compression** — session → daily → weekly Haiku rollups (cron or lazy-on-read) keep the snapshot small as history grows. The session-buffer rollup self-heals at session start too, so memory stays bounded even if you never cleanly close the window.
21
21
  - **Don't start empty — import the rules you already own** — `cmk import-claude-md` parses an existing `CLAUDE.md` / `.cursorrules` / `AGENTS.md` into typed, searchable facts through the same safe write path (secret screening, sanitization, dedup), with provenance back to source file + line. `--dry-run` previews first.
22
22
  - **Per-project, in-repo** — `context/` lives inside your project and travels with `git clone`. Each project keeps its own memory.
23
- - **8 health checks** — `cmk doctor` validates hook wiring, distill freshness, transcript firing, INDEX consistency, cron registration, native-memory coexistence, stale locks, and native-binding health (npm 12 readiness) — each failure with a repair command.
23
+ - **9 health checks** — `cmk doctor` validates hook wiring, distill freshness, transcript firing, INDEX consistency, cron registration, native-memory coexistence, stale locks, native-binding health (npm 12 readiness), and version drift (a project scaffold behind your installed `cmk` after an update) — each failure with a repair command.
24
24
 
25
25
  ## Install — pick ONE route
26
26
 
@@ -62,7 +62,7 @@ Most-used commands (full list via `cmk --help`):
62
62
  | Command | Purpose |
63
63
  | --- | --- |
64
64
  | `cmk install` | Scaffold `context/` + the `memory-write`/`memory-search` skills + `.gitignore` + CLAUDE.md block + wire hooks (`--no-hooks` for scaffold-only) |
65
- | `cmk doctor` | Run HC-1..HC-8 health checks, surface repair commands |
65
+ | `cmk doctor` | Run HC-1..HC-9 health checks, surface repair commands |
66
66
  | `cmk repair --hooks` / `--locks` / `--index` / `--all` | Idempotent self-repair |
67
67
  | `cmk search "<query>" [--mode keyword\|semantic\|hybrid] [--scope facts\|transcripts\|decisions]` | Search memory — by meaning with the embedder (hybrid default after `--with-semantic`); `--scope transcripts` = the raw session record; `--scope decisions` = the decision journal (history / "what did we reject") |
68
68
  | `cmk get <id…>` / `cmk timeline <id>` / `cmk cite <id>` / `cmk recent-activity` | Read the index back — full fact bodies + provenance, sequential context around an observation, a canonical citation link, recent changes (the CLI side of the `mk_*` MCP read tools) |
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@lh8ppl/claude-memory-kit",
3
- "version": "0.3.3",
3
+ "version": "0.3.5",
4
4
  "description": "cmk — the CLI for claude-memory-kit. Per-project, in-repo memory system for Claude Code.",
5
5
  "type": "module",
6
6
  "bin": {
package/src/claude-md.mjs CHANGED
@@ -64,7 +64,10 @@ function buildBlock(content, version) {
64
64
  * gracefully from a corrupted block (e.g. the user accidentally
65
65
  * deleted the end marker by hand).
66
66
  */
67
- function findManagedBlock(text) {
67
+ // Exported (Task 162) for version-drift.mjs (HC-9) — reads the managed-block
68
+ // version marker without re-implementing the parser. Public contract: returns
69
+ // `{version, corrupted, ...}` or null.
70
+ export function findManagedBlock(text) {
68
71
  const startMatch = text.match(MARKER_START_RE);
69
72
  if (!startMatch) return null;
70
73
 
@@ -107,7 +110,9 @@ function parseVersion(v) {
107
110
  * compareVersions('1.0.0', '1.0.0') === 0
108
111
  * compareVersions('2.0.0', '1.9.9') === 1
109
112
  */
110
- function compareVersions(a, b) {
113
+ // Exported (Task 162) for version-drift.mjs (HC-9). Public contract: -1/0/1,
114
+ // strips a `-prerelease` suffix before comparing.
115
+ export function compareVersions(a, b) {
111
116
  const av = parseVersion(a);
112
117
  const bv = parseVersion(b);
113
118
  for (let i = 0; i < 3; i++) {
@@ -0,0 +1,180 @@
1
+ // Bounded, transient-only retry for the Haiku compress call (Task 161 / D-175).
2
+ //
3
+ // WHY this exists: the v0.3.3 cut-gate surfaced `haiku_timeout` / `compress_failed`
4
+ // failures on the compression path. Measurement (D-174) proved the failure is
5
+ // ENVIRONMENTAL/TRANSIENT, not input-size-driven — the kit's own compress.log shows
6
+ // the largest SUCCESS (470 KB) bigger than the largest timeout (334 KB), and a 9 KB
7
+ // input timed out. So the fix is a RETRY (a re-call usually succeeds), not an input
8
+ // cap. The kit inherited the no-retry shape from claude-remember (its precedent —
9
+ // which doesn't retry either); the rest of the field does.
10
+ //
11
+ // The SHAPE is grounded in a 9-system code read
12
+ // (docs/research/2026-06-19-llm-call-retry-patterns-cross-system.md), which converges
13
+ // unanimously on: bounded attempts, exponential backoff, retry ONLY the transient
14
+ // class keyed on the error TYPE, NEVER the deterministic class, reraise after
15
+ // exhaustion. graphiti's `is_server_or_retry_error` predicate + Letta's
16
+ // ValueError(retry)-vs-RuntimeError(don't) split are the model.
17
+ //
18
+ // COMPOSITION (design §8.5 / D-42): the SessionEnd-hook `compressSession` runs under
19
+ // a 60s ceiling CONCURRENT with autoPersona — a 50s attempt + a 50s retry = 100s
20
+ // blows the ceiling. So callers under the ceiling pass `maxAttempts: 1` (no retry —
21
+ // they delegate the retry to the ceiling-free lazy path via the existing
22
+ // restore-on-failure, D-79); only the ceiling-free paths (daily-distill /
23
+ // weekly-curate / lazy compress) pass `maxAttempts: 2`.
24
+ //
25
+ // COOLDOWN INTERACTION (skill-review I-1): the 120s Haiku cooldown marker is touched
26
+ // on SUCCESS only, and the callers gate `isCooldownActive` ONCE before the retry loop.
27
+ // A retry WIDENS the existing "no marker until success" window (~50s → ~100s), so a
28
+ // second hook firing mid-retry could pass the gate and start its own compress. This
29
+ // is NOT a new bug class — the window pre-exists the retry — and the D-79 claim-rename
30
+ // mutex (renameSync of now.md is the real lock) still prevents any corruption: only
31
+ // one roll wins the rename; the other reads an empty buffer and skips. The retry only
32
+ // makes a pre-existing benign window ~2× wider; no marker change is warranted.
33
+ //
34
+ // NO JITTER (skill-review M-2): the field (graphiti) jitters its backoff
35
+ // (wait_random_exponential) to avoid thundering-herd across many concurrent clients.
36
+ // The kit's compress is a single low-concurrency local process (one detached child at
37
+ // a time, gated by the cooldown), so there is no herd to avoid — a plain exponential
38
+ // backoff is sufficient and keeps the timing deterministic for tests.
39
+
40
+ // The timeout for compress callers that have NO outer hook ceiling — the cron /
41
+ // detached-lazy children (daily-distill, weekly-curate, the lazy compressSession).
42
+ // The D-92/F-2 composition rule: a ceiling-free caller must NOT inherit the
43
+ // hook-sized 50s bound (which is sized under the 60s SessionEnd ceiling, §8.5).
44
+ // 120s is chosen against MEASURED `claude --print` latency: it runs ~18-27s when
45
+ // fast but was observed at 78s (Task 163 live, on a 4.7KB input) and 89s (the v0.3.4
46
+ // cut-gate, 10KB) in slow-Haiku windows — environmental, not size-driven (D-174).
47
+ // 120s clears those with headroom; the 50s budget killed them needlessly, leaving
48
+ // `recent.md` stale (D-179). One constant so the family can't drift back to 50s.
49
+ // (auto-extract uses its own 90s under the Stop hook — a separate detached path.)
50
+ export const CEILING_FREE_TIMEOUT_MS = 120_000;
51
+
52
+ // The backoff BETWEEN retries on the ceiling-free paths. The default baseBackoffMs
53
+ // (600ms) is far too short for the kit's failure mode: `claude --print` slowness is
54
+ // a transient WINDOW (slow for a stretch, then fine — D-174), and the whole point of
55
+ // backoff is to let that window PASS before retrying. A 600ms wait retries while
56
+ // still INSIDE the same slow window, so attempt 2 hits the same slowness and also
57
+ // times out. The field waits SECONDS for exactly this reason (graphiti 5-120s,
58
+ // Letta cap 10s, mempalace 2-8s — all checked across 19 systems; NONE use sub-second
59
+ // backoff, and NONE escalate the timeout itself). 5s is the field's low end — one
60
+ // 5s wait between the 2 ceiling-free attempts gives the slow window room to clear.
61
+ // Safe on every path: the HOOK path is maxAttempts:1 (no retry → backoff never
62
+ // fires); the ceiling-free paths run detached/cron, so a multi-second wait is free.
63
+ export const CEILING_FREE_BACKOFF_MS = 5_000;
64
+
65
+ /**
66
+ * Classify a compress() rejection as transient (worth a retry) or deterministic
67
+ * (a re-call re-fails identically — don't waste the attempt or the budget).
68
+ *
69
+ * Transient (retry):
70
+ * - HaikuTimeoutError (`category: 'haiku_timeout'`) — `claude --print` was slow;
71
+ * the D-174 environmental case. A re-call usually succeeds.
72
+ * - HaikuFailedError (`category: 'haiku_failed'`) whose stderr looks like a
73
+ * transient server/overload/rate-limit blip (the field's 5xx/429/overloaded
74
+ * class), classified from the `exit_code`/`stderr` that 161.6a now captures.
75
+ *
76
+ * Deterministic (do NOT retry):
77
+ * - A spawn error (`code: 'ENOENT'` etc.) — the binary isn't there; re-spawning
78
+ * re-fails.
79
+ * - A HaikuFailedError whose stderr is a known-deterministic class (auth /
80
+ * invalid-key / policy / bad-request) — retrying always re-fails (graphiti's
81
+ * explicit "retrying policy-violating content will always fail").
82
+ * - Anything unrecognized — default to NOT retryable (conservative: an unknown
83
+ * failure is more likely a real bug than a blip, and a wasted retry costs the
84
+ * hook budget).
85
+ *
86
+ * @param {unknown} err
87
+ * @returns {boolean}
88
+ */
89
+ export function isRetryableCompressError(err) {
90
+ if (!err || typeof err !== 'object') return false;
91
+
92
+ // Timeout = the transient/environmental case (D-174). Always retryable.
93
+ if (err.category === 'haiku_timeout') return true;
94
+
95
+ // Spawn-level failure (ENOENT / EACCES / EINVAL) — the binary/permissions are
96
+ // wrong; re-spawning re-fails identically. Never retryable.
97
+ if (typeof err.code === 'string' && /^E[A-Z]+$/.test(err.code)) return false;
98
+
99
+ // Non-zero exit — conditional on WHY (the exit_code/stderr 161.6a captures).
100
+ if (err.category === 'haiku_failed') {
101
+ const stderr = String(err.stderr ?? '').toLowerCase();
102
+ // Known-DETERMINISTIC classes: a re-call re-fails. Never retry these.
103
+ // (Skill-review I-2: `not found` was DROPPED — it appears in transient
104
+ // contexts too, e.g. a transient "host not found" / "upstream not found,
105
+ // retrying"; a deterministic 404 from `claude --print` is unlikely, and the
106
+ // conservative default below already catches a genuine unknown deterministic
107
+ // failure. Keeping only HIGH-CONFIDENCE deterministic markers.)
108
+ if (
109
+ /auth|invalid[_ -]?(api[_ -]?)?key|unauthor|forbidden|permission|policy|invalid[_ -]?request|bad[_ -]?request/.test(
110
+ stderr,
111
+ )
112
+ ) {
113
+ return false;
114
+ }
115
+ // Known-TRANSIENT classes: server/overload/rate blips recover on a re-call.
116
+ if (/overload|rate[_ -]?limit|429|5\d\d|timeout|timed[_ -]?out|temporar|unavailable|connection|network|reset/.test(stderr)) {
117
+ return true;
118
+ }
119
+ // Unknown non-zero exit → conservative: do NOT retry (treat as deterministic).
120
+ return false;
121
+ }
122
+
123
+ return false;
124
+ }
125
+
126
+ /**
127
+ * Call `backend.compress(opts)` with a bounded, transient-only retry.
128
+ *
129
+ * @param {{compress: (opts: object) => Promise<any>}} backend
130
+ * @param {object} opts — passed verbatim to backend.compress on every attempt.
131
+ * @param {object} [config]
132
+ * @param {number} [config.maxAttempts=2] — TOTAL attempts (1 = no retry; the ceiling-bound contract). Field range is 2–4; the kit uses ≤2 (one retry) to fit the budget.
133
+ * @param {number} [config.baseBackoffMs=600] — exponential backoff base: wait `baseBackoffMs * 2**(attempt-1)` before attempt N+1. 0 disables the wait (tests).
134
+ * @param {(err: unknown) => boolean} [config.isRetryable=isRetryableCompressError]
135
+ * @param {(ms: number) => Promise<void>} [config.sleep] — injectable for tests.
136
+ * @param {(info: {attempt: number, error: unknown}) => void} [config.onRetry] — fired once
137
+ * per retry (Task 161.12 observability), BEFORE the backoff, with the FAILED attempt
138
+ * number + the (transient) error. Callers use it to record a `retries` count on their
139
+ * compress.log entry so a frequent-retry rate (the degrading-environment signal D-174
140
+ * is about) is visible. Not fired on a first-try success or a non-retryable failure.
141
+ * @returns {Promise<any>} the backend.compress result; reraises the last error after exhaustion.
142
+ */
143
+ export async function compressWithRetry(
144
+ backend,
145
+ opts,
146
+ {
147
+ maxAttempts = 2,
148
+ baseBackoffMs = 600,
149
+ isRetryable = isRetryableCompressError,
150
+ sleep = (ms) => new Promise((r) => setTimeout(r, ms)),
151
+ onRetry,
152
+ } = {},
153
+ ) {
154
+ const attempts = Math.max(1, maxAttempts);
155
+ let lastErr;
156
+ for (let attempt = 1; attempt <= attempts; attempt++) {
157
+ try {
158
+ return await backend.compress(opts);
159
+ } catch (err) {
160
+ lastErr = err;
161
+ // Stop immediately if this is the last attempt OR the error isn't transient.
162
+ if (attempt >= attempts || !isRetryable(err)) {
163
+ throw err;
164
+ }
165
+ // We're going to retry — surface it for observability (161.12), before the wait.
166
+ if (typeof onRetry === 'function') {
167
+ try {
168
+ onRetry({ attempt, error: err });
169
+ } catch {
170
+ // onRetry is best-effort instrumentation — never let it break the retry.
171
+ }
172
+ }
173
+ // Exponential backoff before the next attempt (skip the wait when base is 0).
174
+ const delay = baseBackoffMs > 0 ? baseBackoffMs * 2 ** (attempt - 1) : 0;
175
+ if (delay > 0) await sleep(delay);
176
+ }
177
+ }
178
+ // Unreachable (the loop either returns or throws), but satisfies control-flow analysis.
179
+ throw lastErr;
180
+ }
@@ -37,6 +37,7 @@ import { join, dirname } from 'node:path';
37
37
  import { nowIso } from './audit-log.mjs';
38
38
  import { ERROR_CATEGORIES } from './result-shapes.mjs';
39
39
  import { HaikuTimeoutError } from './compressor.mjs';
40
+ import { compressWithRetry } from './compress-retry.mjs';
40
41
  import {
41
42
  DEFAULT_COOLDOWN_MS,
42
43
  isCooldownActive,
@@ -225,6 +226,22 @@ export async function compressSession({
225
226
  now,
226
227
  cooldownMs = DEFAULT_COOLDOWN_MS,
227
228
  maxOutputBytes = DEFAULT_MAX_OUTPUT_BYTES,
229
+ // Task 161 / D-175: retry policy. DEFAULT 1 = NO retry — the SessionEnd-hook
230
+ // contract: this fn runs under the 60s ceiling CONCURRENT with autoPersona, where
231
+ // a 50s attempt + a 50s retry = 100s blows the ceiling. The ceiling-free LAZY
232
+ // caller (runLazyCompress) passes maxAttempts:2 to opt into one retry; the hook
233
+ // keeps its restore-on-failure (D-79) and delegates the retry to that lazy path.
234
+ maxAttempts = 1,
235
+ // DEFAULT 50s = the SessionEnd-hook budget (sized under the 60s ceiling, §8.5).
236
+ // The ceiling-free LAZY caller (runLazyCompress, a detached SessionStart child
237
+ // with NO outer ceiling) passes 120s so a slow-but-not-broken `claude --print`
238
+ // window doesn't time out needlessly — the D-92/F-2 composition rule: a
239
+ // ceiling-free caller must not inherit a ceiling-sized timeout.
240
+ timeoutMs = 50_000,
241
+ // Backoff between retries (only the lazy maxAttempts:2 path retries). DEFAULT
242
+ // undefined → compressWithRetry's 600ms; the ceiling-free LAZY caller passes the
243
+ // 5s ceiling-free backoff so a retry lands AFTER the slow-Haiku window (D-179).
244
+ baseBackoffMs,
228
245
  } = {}) {
229
246
  const ts = now ?? nowIso();
230
247
  const date = dateFromIso(ts);
@@ -325,14 +342,23 @@ export async function compressSession({
325
342
  // restoreRolling call, so the buffer is never stranded in the rolling file.
326
343
  // See design §8.5 for the composition rationale.
327
344
  let result;
345
+ let retries = 0; // Task 161.12: count retries (only the lazy maxAttempts:2 path can retry).
328
346
  try {
329
- result = await backend.compress({
330
- input: wrapBufferForPrompt(buffer),
331
- instructions,
332
- preserveCitationIds: true,
333
- maxOutputBytes,
334
- timeoutMs: 50_000,
335
- });
347
+ // maxAttempts default 1 (hook contract: no retry); the lazy caller passes 2.
348
+ // compressWithRetry is a no-op wrapper at maxAttempts:1 (single attempt, reraise).
349
+ result = await compressWithRetry(
350
+ backend,
351
+ {
352
+ input: wrapBufferForPrompt(buffer),
353
+ instructions,
354
+ preserveCitationIds: true,
355
+ maxOutputBytes,
356
+ timeoutMs,
357
+ },
358
+ // baseBackoffMs only forwarded when the caller set it (the lazy ceiling-free
359
+ // path passes the 5s backoff); undefined → compressWithRetry's 600ms default.
360
+ { maxAttempts, ...(baseBackoffMs != null ? { baseBackoffMs } : {}), onRetry: () => { retries += 1; } },
361
+ );
336
362
  } catch (err) {
337
363
  // Distinguish HAIKU_TIMEOUT (slow Anthropic) from COMPRESS_FAILED
338
364
  // (non-zero subprocess exit / spawn ENOENT / etc). Analytics
@@ -357,6 +383,13 @@ export async function compressSession({
357
383
  duration_ms,
358
384
  success: false,
359
385
  error_category: errorCategory,
386
+ // Task 161 (D-173 observability): capture the STRUCTURED failure reason
387
+ // (subprocess exit code + stderr) so a `compress_failed` is diagnosable.
388
+ // Pre-161 the log kept only error_category — the WHY was discarded, which
389
+ // is why the kit's own 329-byte compress_failed could not be explained.
390
+ ...(err?.exitCode != null ? { exit_code: err.exitCode } : {}),
391
+ ...(err?.stderr ? { error_detail: String(err.stderr).slice(0, 500) } : {}),
392
+ ...(retries > 0 ? { retries } : {}), // 161.12: failed AFTER retrying (lazy path)
360
393
  };
361
394
  writeCompressLogEntry({ projectRoot, date, entry });
362
395
  return {
@@ -397,6 +430,7 @@ export async function compressSession({
397
430
  cost_usd: result?.costUSD ?? 0,
398
431
  duration_ms,
399
432
  success: true,
433
+ ...(retries > 0 ? { retries } : {}), // 161.12: succeeded after a transient retry (lazy path)
400
434
  };
401
435
  writeCompressLogEntry({ projectRoot, date, entry });
402
436
 
@@ -94,6 +94,23 @@ export class HaikuTimeoutError extends Error {
94
94
  }
95
95
  }
96
96
 
97
+ // Non-zero subprocess exit (the `compress_failed` category). Carries the
98
+ // STRUCTURED exit code + captured stderr so callers can write the real
99
+ // failure reason into compress.log — pre-161 this was a plain Error with
100
+ // the detail buried in `.message`, and the log kept only `error_category`,
101
+ // making a `compress_failed` undiagnosable (the 329-byte failure in the
102
+ // kit's own log that the D-173 investigation could not explain). Mirrors
103
+ // HaikuTimeoutError so the two failure modes carry parallel diagnostics.
104
+ export class HaikuFailedError extends Error {
105
+ constructor(message, { exitCode, stderr }) {
106
+ super(message);
107
+ this.name = 'HaikuFailedError';
108
+ this.category = 'haiku_failed';
109
+ this.exitCode = exitCode ?? null;
110
+ this.stderr = stderr ?? '';
111
+ }
112
+ }
113
+
97
114
  // SIGTERM → grace window → SIGKILL escalation. Exported so the kill
98
115
  // chain itself is independently testable against real OS processes
99
116
  // (see tests/spawn-smoke-kill-chain.test.js) — the production code
@@ -292,8 +309,9 @@ export class HaikuViaAnthropicApi extends CompressorBackend {
292
309
  if (settled) return; // timeout already fired
293
310
  if (code !== 0) {
294
311
  settleReject(
295
- new Error(
312
+ new HaikuFailedError(
296
313
  `HaikuViaAnthropicApi: claude --print exit ${code}: ${stderr.trim() || '(no stderr)'}`,
314
+ { exitCode: code, stderr: stderr.trim() },
297
315
  ),
298
316
  );
299
317
  return;
@@ -28,6 +28,7 @@ import { join } from 'node:path';
28
28
  import { nowIso } from './audit-log.mjs';
29
29
  import { ERROR_CATEGORIES } from './result-shapes.mjs';
30
30
  import { HaikuTimeoutError } from './compressor.mjs';
31
+ import { compressWithRetry, CEILING_FREE_TIMEOUT_MS, CEILING_FREE_BACKOFF_MS } from './compress-retry.mjs';
31
32
  import {
32
33
  DEFAULT_COOLDOWN_MS,
33
34
  isCooldownActive,
@@ -195,14 +196,27 @@ export async function dailyDistill({
195
196
  const instructions = buildDistillInstructions(maxOutputBytes);
196
197
 
197
198
  let result;
199
+ let retries = 0; // Task 161.12: count retries so the log shows the retry RATE.
198
200
  try {
199
- result = await backend.compress({
200
- input: buffer,
201
- instructions,
202
- preserveCitationIds: true,
203
- maxOutputBytes,
204
- timeoutMs: 50_000,
205
- });
201
+ // Task 161 / D-175: ceiling-free path (cron/detached child, NO 60s hook ceiling)
202
+ // → bounded transient-only retry. A re-call recovers the D-174 environmental
203
+ // timeout / transient non-zero exit; a deterministic failure (ENOENT/auth) fails
204
+ // fast (isRetryableCompressError). maxAttempts:2 = one retry.
205
+ result = await compressWithRetry(
206
+ backend,
207
+ {
208
+ input: buffer,
209
+ instructions,
210
+ preserveCitationIds: true,
211
+ maxOutputBytes,
212
+ // Ceiling-free (cron / detached lazy child, NO 60s hook ceiling) → the
213
+ // generous ceiling-free timeout, NOT the hook-sized 50s (D-92/F-2 + D-179).
214
+ timeoutMs: CEILING_FREE_TIMEOUT_MS,
215
+ },
216
+ // 5s backoff between the 2 attempts (NOT the 600ms default) so a retry lands
217
+ // AFTER the transient slow-Haiku window, not inside it (D-179).
218
+ { maxAttempts: 2, baseBackoffMs: CEILING_FREE_BACKOFF_MS, onRetry: () => { retries += 1; } },
219
+ );
206
220
  touchCooldownMarker({ projectRoot, now: ts });
207
221
  } catch (err) {
208
222
  touchCooldownMarker({ projectRoot, now: ts });
@@ -217,6 +231,10 @@ export async function dailyDistill({
217
231
  ts, scope: 'daily-distill', input_bytes, output_bytes: 0,
218
232
  model_id: typeof backend.modelId === 'function' ? backend.modelId() : null,
219
233
  cost_usd: 0, duration_ms, success: false, error_category: errorCategory,
234
+ // Task 161 (D-173 observability): structured failure reason — see compress-session.mjs.
235
+ ...(err?.exitCode != null ? { exit_code: err.exitCode } : {}),
236
+ ...(err?.stderr ? { error_detail: String(err.stderr).slice(0, 500) } : {}),
237
+ ...(retries > 0 ? { retries } : {}), // 161.12: failed AFTER retrying
220
238
  },
221
239
  });
222
240
  return {
@@ -246,6 +264,7 @@ export async function dailyDistill({
246
264
  (typeof backend.modelId === 'function' ? backend.modelId() : null),
247
265
  cost_usd: result?.costUSD ?? 0,
248
266
  duration_ms, success: true, source_days: files.length,
267
+ ...(retries > 0 ? { retries } : {}), // 161.12: succeeded after a transient retry
249
268
  },
250
269
  });
251
270
  return {
package/src/doctor.mjs CHANGED
@@ -1,4 +1,4 @@
1
- // `cmk doctor` — health checks HC-1..HC-8 (Task 37, T-031; memsearch HC-1/HC-7 removed in Task 120; HC-8 native bindings added in Task 141a).
1
+ // `cmk doctor` — health checks HC-1..HC-9 (Task 37, T-031; memsearch HC-1/HC-7 removed in Task 120; HC-8 native bindings added in Task 141a; HC-9 version-drift/update-path added in Task 162 / D-176).
2
2
  //
3
3
  // Public boundary:
4
4
  // async runDoctor({projectRoot, userDir, now, promptUser?, ...overrides})
@@ -46,6 +46,8 @@ import { cronSentinelPath } from './lazy-compress.mjs';
46
46
  import { getNativeAutoMemoryState } from './native-memory.mjs';
47
47
  import { checkKitBinding, checkEmbedderBinding } from './native-binding.mjs';
48
48
  import { resolveDefaultSearchMode } from './semantic-backend.mjs';
49
+ import { checkVersionDrift } from './version-drift.mjs';
50
+ import { getKitVersion } from './install.mjs';
49
51
 
50
52
  const TWO_DAYS_MS = 2 * 24 * 60 * 60 * 1000;
51
53
  const THREE_DAYS_MS = 3 * 24 * 60 * 60 * 1000;
@@ -541,10 +543,27 @@ async function hc8NativeBindings({ projectRoot, kitBindingProbe, embedderBinding
541
543
  * parameter lands at that PR alongside the actual consent flow — not
542
544
  * pre-empted in v0.1.0 to avoid the "forward-compat hooks rot" pattern.
543
545
  */
546
+ // --- HC-9: project scaffold version matches the installed cmk (Task 162 / D-176) ---
547
+ // After `npm i -g @latest`, a project's version-stamped scaffold stays at the OLD
548
+ // version until `cmk install` re-runs there (the easily-forgotten per-project step).
549
+ // HC-9 reads the project's CLAUDE.md managed-block version + the installed binary
550
+ // version and tells the user to re-run `cmk install` when the project is behind.
551
+ function hc9VersionDrift({ projectRoot, kitVersion }) {
552
+ const claudeMdPath = join(projectRoot, 'CLAUDE.md');
553
+ let claudeMdText = null;
554
+ try {
555
+ if (existsSync(claudeMdPath)) claudeMdText = readFileSync(claudeMdPath, 'utf8');
556
+ } catch {
557
+ claudeMdText = null; // unreadable → skip (treated as not-installed)
558
+ }
559
+ return checkVersionDrift({ claudeMdText, kitVersion });
560
+ }
561
+
544
562
  export async function runDoctor({
545
563
  projectRoot,
546
564
  userDir,
547
565
  now,
566
+ kitVersion,
548
567
  kitBindingProbe,
549
568
  embedderBindingProbe,
550
569
  } = {}) {
@@ -569,10 +588,12 @@ export async function runDoctor({
569
588
  const c6 = hc6NativeAutoMemory({ projectRoot, now: ts });
570
589
  const c7 = hc7StaleLocks({ projectRoot, userDir: resolvedUserDir });
571
590
  const c8 = await hc8NativeBindings({ projectRoot, kitBindingProbe, embedderBindingProbe });
591
+ // HC-9: kitVersion injectable for tests; defaults to the installed binary's version.
592
+ const c9 = hc9VersionDrift({ projectRoot, kitVersion: kitVersion ?? getKitVersion() });
572
593
 
573
594
  return {
574
595
  action: 'completed',
575
- checks: [c1, c2, c3, c4, c5, c6, c7, c8],
596
+ checks: [c1, c2, c3, c4, c5, c6, c7, c8, c9],
576
597
  duration_ms: Date.now() - t0,
577
598
  };
578
599
  }
@@ -44,6 +44,7 @@ import {
44
44
  import { dailyDistill } from './daily-distill.mjs';
45
45
  import { weeklyCurate } from './weekly-curate.mjs';
46
46
  import { compressSession } from './compress-session.mjs';
47
+ import { CEILING_FREE_TIMEOUT_MS, CEILING_FREE_BACKOFF_MS } from './compress-retry.mjs';
47
48
  import { syncDecisionsJournal } from './decisions-journal.mjs';
48
49
 
49
50
  const DEFAULT_DAILY_TTL_MS = 24 * 60 * 60 * 1000; // 24 hours
@@ -411,6 +412,16 @@ export async function runLazyCompress({
411
412
  backend,
412
413
  now: ts,
413
414
  cooldownMs: 0,
415
+ // Task 161 / D-175: the lazy path is a DETACHED SessionStart child with NO 60s
416
+ // hook ceiling, so it opts into the one retry the hook path can't afford. This
417
+ // is where the SessionEnd-hook's failed roll (which restored now.md, D-79) gets
418
+ // its real bounded retry.
419
+ maxAttempts: 2,
420
+ // Ceiling-free (detached child, no 60s ceiling) → the generous timeout + the 5s
421
+ // backoff so a retry lands AFTER the slow-Haiku window (D-92/F-2 + D-179; matches
422
+ // daily-distill / weekly-curate). compressSession forwards both to compressWithRetry.
423
+ timeoutMs: CEILING_FREE_TIMEOUT_MS,
424
+ baseBackoffMs: CEILING_FREE_BACKOFF_MS,
414
425
  });
415
426
  } else if (verdict.action === 'stale-weekly') {
416
427
  delegatedTo = 'weekly-curate';
@@ -0,0 +1,72 @@
1
+ // HC-9: version-drift detection (Task 162 / D-176).
2
+ //
3
+ // WHY: after a user updates the global `cmk` (npm i -g @latest), a project's
4
+ // version-stamped scaffold — the CLAUDE.md managed block, the hooks, the skills —
5
+ // stays at the OLD version until `cmk install` re-runs in that project. Updating the
6
+ // npm package ALONE does not touch a project (the per-project re-install is the
7
+ // easily-forgotten step). Pre-162 the kit was silent about it (D-172: no update path).
8
+ // HC-9 makes `cmk doctor` TELL the user the project is behind + the exact command.
9
+ //
10
+ // The project's installed version lives in the CLAUDE.md managed-block start marker
11
+ // (`<!-- claude-memory-kit:start v0.3.3 -->`); the installed binary version is
12
+ // getKitVersion(). Drift = binary NEWER than the project marker → "run cmk install".
13
+ // A project marker NEWER than the binary is a downgrade (older global cli opening a
14
+ // newer-scaffolded project), NOT drift — flag pass, not a false alarm.
15
+
16
+ import { findManagedBlock, compareVersions } from './claude-md.mjs';
17
+
18
+ /**
19
+ * Pure HC-9 check. Injectable inputs (no disk read here) so the logic is unit-tested
20
+ * without a fixture tree; the doctor wiring reads CLAUDE.md + getKitVersion() and
21
+ * passes them in.
22
+ *
23
+ * @param {object} args
24
+ * @param {string|null} args.claudeMdText — the project's CLAUDE.md content, or null if absent.
25
+ * @param {string} args.kitVersion — the installed binary version (getKitVersion()).
26
+ * @returns {{id:'HC-9', name:string, status:'pass'|'fail'|'skip', message:string, recoveryCommand?:string}}
27
+ */
28
+ export function checkVersionDrift({ claudeMdText, kitVersion } = {}) {
29
+ const id = 'HC-9';
30
+ const name = 'Project scaffold version matches the installed cmk';
31
+
32
+ // No CLAUDE.md, or no managed block → the project isn't kit-installed (or the block
33
+ // was hand-removed). Not a drift signal; skip (HC-1/repair owns the missing-block case).
34
+ if (!claudeMdText) {
35
+ return { id, name, status: 'skip', message: 'no CLAUDE.md found — project not kit-installed' };
36
+ }
37
+ const block = findManagedBlock(claudeMdText);
38
+ if (!block) {
39
+ return { id, name, status: 'skip', message: 'no claude-memory-kit managed block in CLAUDE.md' };
40
+ }
41
+
42
+ // `block.version` is the `:start vX` marker value (findManagedBlock recovers it
43
+ // even from a corrupted/orphan-start block — a stale corrupted block still earns
44
+ // the `cmk install` advice, which fixes both). compareVersions strips any
45
+ // `-prerelease` tag, so a `v0.3.4-beta` scaffold reads as `0.3.4` (the kit ships
46
+ // no prereleases today; this is the intended "close enough" behavior).
47
+ const projectVersion = block.version;
48
+ const cmp = compareVersions(kitVersion, projectVersion);
49
+
50
+ if (cmp <= 0) {
51
+ // Binary == project (match) OR binary < project (a downgrade — older cli, newer
52
+ // scaffold). Neither is "re-run install to catch up." Pass.
53
+ return {
54
+ id,
55
+ name,
56
+ status: 'pass',
57
+ message:
58
+ cmp === 0
59
+ ? `project scaffold (v${projectVersion}) matches the installed cmk (v${kitVersion})`
60
+ : `project scaffold (v${projectVersion}) is newer than the installed cmk (v${kitVersion}) — likely an older global cli; not drift`,
61
+ };
62
+ }
63
+
64
+ // Binary NEWER than the project marker → the project is stale. THE drift case.
65
+ return {
66
+ id,
67
+ name,
68
+ status: 'fail',
69
+ message: `this project's scaffold is v${projectVersion} but your installed cmk is v${kitVersion} — re-run \`cmk install\` here to refresh the CLAUDE.md block, hooks, and skills (then restart Claude Code)`,
70
+ recoveryCommand: 'cmk install',
71
+ };
72
+ }
@@ -39,6 +39,7 @@ import { canonicalize } from '@lh8ppl/cmk-canonicalize';
39
39
  import { nowIso } from './audit-log.mjs';
40
40
  import { ERROR_CATEGORIES, errorResult } from './result-shapes.mjs';
41
41
  import { HaikuTimeoutError } from './compressor.mjs';
42
+ import { compressWithRetry, CEILING_FREE_TIMEOUT_MS, CEILING_FREE_BACKOFF_MS } from './compress-retry.mjs';
42
43
  import {
43
44
  DEFAULT_COOLDOWN_MS,
44
45
  isCooldownActive,
@@ -343,7 +344,8 @@ export async function weeklyCurate({
343
344
  // corpus (heavier than a session summary) and runs as a cron/lazy child
344
345
  // with no 60s hook ceiling — give the classifier headroom so a large
345
346
  // corpus doesn't time out. (Corpus is byte-capped at PERSONA_CORPUS_BYTES.)
346
- timeoutMs: 120_000,
347
+ // The shared ceiling-free timeout (D-92/F-2; was the original 120s literal).
348
+ timeoutMs: CEILING_FREE_TIMEOUT_MS,
347
349
  });
348
350
  }
349
351
 
@@ -388,14 +390,23 @@ export async function weeklyCurate({
388
390
  const sourceDates = old.map((f) => f.date);
389
391
 
390
392
  let result;
393
+ let retries = 0; // Task 161.12: count retries so the log shows the retry RATE.
391
394
  try {
392
- result = await backend.compress({
393
- input: buffer,
394
- instructions,
395
- preserveCitationIds: true,
396
- maxOutputBytes: archiveMaxBytes,
397
- timeoutMs: 50_000,
398
- });
395
+ // Task 161 / D-175: ceiling-free path (cron/detached child, NO 60s hook ceiling)
396
+ // → bounded transient-only retry (maxAttempts:2 = one retry). See compress-retry.mjs.
397
+ result = await compressWithRetry(
398
+ backend,
399
+ {
400
+ input: buffer,
401
+ instructions,
402
+ preserveCitationIds: true,
403
+ maxOutputBytes: archiveMaxBytes,
404
+ // Ceiling-free → the generous timeout, NOT the hook-sized 50s (D-92/F-2 + D-179).
405
+ timeoutMs: CEILING_FREE_TIMEOUT_MS,
406
+ },
407
+ // 5s backoff so a retry lands after the slow-Haiku window, not inside it (D-179).
408
+ { maxAttempts: 2, baseBackoffMs: CEILING_FREE_BACKOFF_MS, onRetry: () => { retries += 1; } },
409
+ );
399
410
  touchCooldownMarker({ projectRoot, now: ts });
400
411
  } catch (err) {
401
412
  touchCooldownMarker({ projectRoot, now: ts });
@@ -418,6 +429,10 @@ export async function weeklyCurate({
418
429
  duration_ms,
419
430
  success: false,
420
431
  error_category: errorCategory,
432
+ // Task 161 (D-173 observability): structured failure reason — see compress-session.mjs.
433
+ ...(err?.exitCode != null ? { exit_code: err.exitCode } : {}),
434
+ ...(err?.stderr ? { error_detail: String(err.stderr).slice(0, 500) } : {}),
435
+ ...(retries > 0 ? { retries } : {}), // 161.12: failed AFTER retrying
421
436
  },
422
437
  });
423
438
  return errorResult({
@@ -487,6 +502,7 @@ export async function weeklyCurate({
487
502
  archived_days: old.length,
488
503
  current_days: current.length,
489
504
  recent_rebuild_action: recentResult?.action ?? 'skipped',
505
+ ...(retries > 0 ? { retries } : {}), // 161.12: succeeded after a transient retry
490
506
  ...(deletionErrors.length > 0 ? { deletion_errors: deletionErrors } : {}),
491
507
  },
492
508
  });