instar 1.3.583 → 1.3.585

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/dist/commands/server.d.ts.map +1 -1
  2. package/dist/commands/server.js +15 -1
  3. package/dist/commands/server.js.map +1 -1
  4. package/dist/config/ConfigDefaults.d.ts.map +1 -1
  5. package/dist/config/ConfigDefaults.js +13 -0
  6. package/dist/config/ConfigDefaults.js.map +1 -1
  7. package/dist/core/PlaywrightProfileRegistry.d.ts +269 -0
  8. package/dist/core/PlaywrightProfileRegistry.d.ts.map +1 -0
  9. package/dist/core/PlaywrightProfileRegistry.js +640 -0
  10. package/dist/core/PlaywrightProfileRegistry.js.map +1 -0
  11. package/dist/core/PostUpdateMigrator.d.ts +21 -0
  12. package/dist/core/PostUpdateMigrator.d.ts.map +1 -1
  13. package/dist/core/PostUpdateMigrator.js +195 -0
  14. package/dist/core/PostUpdateMigrator.js.map +1 -1
  15. package/dist/core/devGatedFeatures.d.ts.map +1 -1
  16. package/dist/core/devGatedFeatures.js +6 -0
  17. package/dist/core/devGatedFeatures.js.map +1 -1
  18. package/dist/core/types.d.ts +13 -0
  19. package/dist/core/types.d.ts.map +1 -1
  20. package/dist/core/types.js.map +1 -1
  21. package/dist/messaging/MessageProcessingLedger.d.ts +19 -2
  22. package/dist/messaging/MessageProcessingLedger.d.ts.map +1 -1
  23. package/dist/messaging/MessageProcessingLedger.js +25 -4
  24. package/dist/messaging/MessageProcessingLedger.js.map +1 -1
  25. package/dist/messaging/stuckMessageRecovery.d.ts +10 -0
  26. package/dist/messaging/stuckMessageRecovery.d.ts.map +1 -1
  27. package/dist/messaging/stuckMessageRecovery.js +13 -5
  28. package/dist/messaging/stuckMessageRecovery.js.map +1 -1
  29. package/dist/scaffold/templates.d.ts.map +1 -1
  30. package/dist/scaffold/templates.js +8 -1
  31. package/dist/scaffold/templates.js.map +1 -1
  32. package/dist/server/CapabilityIndex.d.ts.map +1 -1
  33. package/dist/server/CapabilityIndex.js +1 -0
  34. package/dist/server/CapabilityIndex.js.map +1 -1
  35. package/dist/server/routes.d.ts +8 -0
  36. package/dist/server/routes.d.ts.map +1 -1
  37. package/dist/server/routes.js +341 -0
  38. package/dist/server/routes.js.map +1 -1
  39. package/package.json +1 -1
  40. package/src/data/builtin-manifest.json +63 -63
  41. package/src/scaffold/templates.ts +9 -1
  42. package/upgrades/1.3.584.md +84 -0
  43. package/upgrades/1.3.585.md +48 -0
  44. package/upgrades/side-effects/playwright-profile-registry.md +140 -0
  45. package/upgrades/side-effects/wedge-recovery-abandoned-notice.md +47 -0
  46. package/upgrades/1.3.583.md +0 -51
@@ -0,0 +1,84 @@
1
+ # Upgrade Guide — vNEXT
2
+
3
+ <!-- assembled-by: assemble-next-md -->
4
+ <!-- bump: minor -->
5
+
6
+ ## What Changed
7
+
8
+ **A durable per-agent registry mapping each Playwright browser profile to the accounts it is logged into — plus boot-time awareness of what browser access the agent actually has.** Until now the agent self-unblocked by driving a real browser (Playwright MCP) logged into real accounts, but there was no authoritative record of *which profile holds which account*. That knowledge lived only as ~21 scattered, partly-contradictory `operationalFacts` — which led the agent to ask the operator to act (or grind a credential treadmill) instead of resolving the right profile itself.
9
+
10
+ The new `PlaywrightProfileRegistry` (`src/core/PlaywrightProfileRegistry.ts`) is the missing data + awareness + selection + activation layer:
11
+
12
+ - A durable, machine-local state file `state/playwright-profiles.json` mapping each **profile** (a physical browser user-data-dir on THIS machine) to the **accounts** it is responsible for — by vault-secret NAME only, **never values**.
13
+ - A compact boot-awareness pointer injected at session start (`GET /playwright-profiles/session-context`), so the agent knows from message one what browser access it has and **as whom** — operator-owned accounts flagged loud (Know Your Principal), login state rendered as last-asserted staleness (advisory, never a guarantee).
14
+ - Routes to list / create / assign / resolve / activate profiles. `resolve` picks the owning profile for a `(service, identity)` and forces disambiguation rather than silently picking a privileged account; `activate` rewrites the MCP config and restarts the session onto the chosen profile.
15
+
16
+ **Safety posture (the honesty disciplines that keep this from re-creating the scattered-facts problem):** no secret VALUE is ever stored, returned, injected, or resolved (names only). Every write is audited (`logs/playwright-profiles.jsonl`). A corrupt registry file fails CLOSED for writes (never auto-overwritten) and OPEN for the boot block (injects nothing). Caller-supplied profile dirs are path-jailed to the agent home. The seed is metadata-only — it never touches `.mcp.json` / `.claude/settings.json`, so an update can never regress another agent's shared browser login.
17
+
18
+ **Rollout:** the whole feature is **dev-gated** (`playwrightRegistry.enabled` omitted → live on a development agent, **dark on the fleet** — routes 503, the boot block injects nothing). The only destructive op, `activate` (config rewrite + session restart), additionally ships `dryRun: true` — it LOGS the intended rewrite/refresh and performs NEITHER until a deliberate `dryRun: false`. Existing agents pick it up via full migration parity (state seed, session-start hook, CLAUDE.md awareness section, config default + strip-false migration).
19
+
20
+ A short timer drift that recurs while load sits in the **1.0–1.5/core band** slipped past both
21
+ existing guards: the load guard fires only above 1.5/core, and the consecutive burst floor resets
22
+ whenever on-time ticks fall between drifts. Its ~2-minute cadence also outlasted the 60s cooldown.
23
+ So each isolated drift emitted a **false `wake`**, firing the full wake-recovery cascade (tunnel
24
+ restart, Slack reconnect, mesh-lease churn, topic failover) — the source of a class of multi-machine
25
+ UX failures: a reply that's lost the conversation thread, messages that get no reply, and "remote
26
+ typing is disabled" (the 2026-06-15 incident, measured at ~1.13/core).
27
+
28
+ The detector now adds a **recurring-drift guard**: a short drift within `recentDriftWindowMs`
29
+ (default 5 min) of a prior short drift, while load is oversubscribed (`> recentDriftLoadFloor`,
30
+ default 1.0/core), is treated as recurring CPU starvation and suppressed. This generalizes the burst
31
+ floor from *consecutive* ticks to *recent* ticks, and the load gate confines it to the
32
+ oversubscribed band the hard guard leaves open.
33
+
34
+ ## What to Tell Your User
35
+
36
+ - ⚗️ **Experimental, development-agent only.** On the fleet this ships dark — the routes 503 and nothing is injected at session start, so a normal agent sees no change. On a development agent it runs live, but the only state-changing operation (switching the browser onto a profile) is held in dry-run by default, so it only LOGS what it would do until that is deliberately turned off.
37
+ - **What it gives a dev agent:** instead of asking you to drive the browser or produce a credential, the agent can now look up which browser profile is logged into a given account, pick the right one, and (when activated) restart its session onto that profile. It tracks which accounts are **yours** vs the agent's own, so it won't act as you in a browser unless explicitly authorized — and login state is treated as last-asserted, so it re-verifies in-browser before any privileged action.
38
+ - **At-rest honesty:** the registry file is plaintext machine-local. It lists account identities + vault key NAMES — so filesystem access to the machine reveals the agent's access *map*, never the credentials themselves (same posture as the self-knowledge tree and the relationships store).
39
+
40
+ Side-effects review: upgrades/side-effects/playwright-profile-registry.md
41
+
42
+ - **Fewer spurious reconnects on a busy laptop**: "When my machine got busy I used to mistake the
43
+ slowdown for the computer going to sleep, which kicked off a disruptive recovery — dropping the
44
+ conversation thread, going quiet, or disabling typing. I now recognize that pattern and stay calm,
45
+ so those multi-machine glitches should largely stop."
46
+ - **Real sleeps still handled**: "If the machine genuinely sleeps, I still notice and recover
47
+ properly — nothing changes there."
48
+
49
+ ## Summary of New Capabilities
50
+
51
+ | Capability | How to Use |
52
+ |-----------|-----------|
53
+ | See which browser profile holds which account (full detail; vault NAMES only) | `GET /playwright-profiles` |
54
+ | Compact boot-awareness pointer (also auto-injected at session start) | `GET /playwright-profiles/session-context` |
55
+ | Create a custom profile | `POST /playwright-profiles` `{ id, description?, userDataDir? }` |
56
+ | Assign an account to a profile (owner agent\|operator; vault NAMES only) | `POST /playwright-profiles/:id/accounts` `{ service, identity, owner, vaultRefs[], loginMethod?, note? }` |
57
+ | Pick the right profile for a task | `GET /playwright-profiles/resolve?service=&identity=` (ambiguous service-only → `{ ambiguous: true, candidates }`) |
58
+ | Switch the browser onto a profile (config rewrite + session restart) | `POST /playwright-profiles/:id/activate` (ships `dryRun: true` — logs the intended switch until a deliberate `dryRun: false`; reversible by activating `default`) |
59
+
60
+ | Capability | How to Use |
61
+ |-----------|-----------|
62
+ | Suppress false "wake" events from CPU starvation on a loaded host | automatic |
63
+ | Tune or disable the new guard | `monitoring.sleepWake.recentDriftWindowMs` / `.recentDriftLoadFloor` (set window to 0 to disable) |
64
+
65
+ ## Evidence
66
+
67
+ - `PlaywrightProfileRegistry` seeds exactly one `default` profile via the shared `resolvePlaywrightMcpConfig()` resolver (records the real `--user-data-dir` if the canonical config carries one, else `null` = the built-in default — never `.playwright-mcp`, which is the MCP output-dir, not the browser profile).
68
+ - New `DEV_GATED_FEATURES` entry `playwrightRegistry` (configPath `playwrightRegistry.enabled`) — picked up automatically by the dual-side wiring test (`tests/unit/devGatedFeatures-wiring.test.ts`): the entry resolves LIVE under a dev-agent config and DARK under a fleet config.
69
+ - `ConfigDefaults` adds `playwrightRegistry: { dryRun: true }` and OMITS `enabled` (the dev-gate convention, mirroring `credentialRepointing` / `topicProfiles`).
70
+ - Migration parity in `PostUpdateMigrator`: the `/playwright-profiles/session-context` session-start fetch+inject block is modeled byte-for-byte on the existing `/self-knowledge/session-context` block (`curl -sf --max-time 4 --connect-timeout 1`, `python3` parse of `.block`, fail-open on 503/404/empty); a `migrateClaudeMd` content-sniff appends the awareness section; the `playwright-profiles-seed-v1` marker migration seeds the default profile metadata-only (idempotent, marks done either way); a `migrateConfigPlaywrightRegistryDevGate` strip-false migration mirrors the credential-repointing strip so a stale default-shaped `enabled: false` resolves the gate live.
71
+ - The CLAUDE.md awareness section is authored ONCE (`PLAYWRIGHT_PROFILE_REGISTRY_CLAUDEMD_SECTION`) and shared by `generateClaudeMd` (new installs) and `migrateClaudeMd` (existing agents) so the two can never drift.
72
+
73
+ Reproduction (live, 2026-06-15): on a host measured at loadavg ~18 on 16 cores (~1.13/core — above
74
+ 1.0 but below the 1.5 hard guard), `server.log` showed `[SleepWakeDetector] Wake detected after
75
+ ~33s/~21s sleep` recurring roughly every 2 minutes while the host was actively in use (not sleeping),
76
+ each triggering the wake-recovery cascade. The drifts were isolated (on-time ticks between them reset
77
+ the consecutive counter) and ~2 min apart (outlasting the 60s cooldown), so neither existing guard
78
+ caught them.
79
+
80
+ After the fix (verified by 45/45 sleep-wake unit tests across 5 files, both sides of the boundary): a
81
+ recurring short drift in the 1.0–1.5 band is suppressed (no `wake` emitted, recorded as
82
+ `cpu-starvation`); a genuinely isolated short drift, any drift on a light/idle host (ratio ≤ 1.0),
83
+ and every long (real) sleep still emit; `recentDriftWindowMs: 0` restores byte-identical prior
84
+ behavior. tsc clean.
@@ -0,0 +1,48 @@
1
+ # Upgrade Guide — vNEXT
2
+
3
+ <!-- assembled-by: assemble-next-md -->
4
+ <!-- bump: patch -->
5
+
6
+ ## What Changed
7
+
8
+ When an inbound message gets stranded mid-turn (the machine handling it crashed or was handed off),
9
+ recovery re-runs it a few times, then gives up. Previously "give up" did nothing useful: the message
10
+ was **silently dropped** (the user never told) AND it stayed marked "being-worked-on", so the
11
+ recovery routine kept re-finding it and logging `stuck-recovery: giving up … after 3 attempts` every
12
+ ~10 minutes, indefinitely (observed firing for hours on the same entries).
13
+
14
+ Now an exhausted entry is **terminally abandoned** and **announced**: a new terminal `abandoned`
15
+ state in `MessageProcessingLedger` (`markAbandoned`) moves it out of `processing` — so `reclaimStuck`
16
+ stops re-selecting it (the log-loop ends), `beginProcessing` refuses to revive it, and a provider
17
+ redelivery of the same event is dropped (a genuine resend uses a fresh dedupeKey). It leaves
18
+ `reply_committed_at` NULL, so it never masquerades as a real reply. `recoverStuckMessages` surfaces
19
+ abandoned entries and the server emits one per-topic loss notice.
20
+
21
+ ## What to Tell Your User
22
+
23
+ - **No more silently-dropped messages**: "If I ever can't finish handling something you sent — say my
24
+ machine crashed mid-thought — I'll now tell you plainly that I didn't get to it and ask you to
25
+ resend, instead of leaving you waiting on a reply that never comes."
26
+ - **A quieter, healthier system**: "I also fixed a case where I'd keep silently retrying the same
27
+ failed message every few minutes, forever. Now I close it out cleanly the first time."
28
+
29
+ ## Summary of New Capabilities
30
+
31
+ | Capability | How to Use |
32
+ |-----------|-----------|
33
+ | Honest loss notice when a stuck message is abandoned | automatic (posted to the affected conversation) |
34
+ | Stuck-recovery give-up no longer loops or drops silently | automatic |
35
+
36
+ ## Evidence
37
+
38
+ Reproduction (live, 2026-06-15): `logs/server.log` showed `stuck-recovery: giving up on
39
+ telegram:21487:990487 after 3 attempts` (plus two sibling entries) firing on a ~10-minute cadence for
40
+ hours (19:07 → 20:11+), because the give-up branch did `skipped++; continue` — leaving the entry in
41
+ `processing` so `reclaimStuck` re-selected it every cycle, with no user notice.
42
+
43
+ After the fix (verified by 30/30 ledger + recovery unit tests, both sides of the boundary): an
44
+ exhausted entry is marked `abandoned` (state asserted terminal, `replyCommittedAt` NULL,
45
+ `hasReplyCommittedForTopicSince` still false, `reclaimStuck` no longer returns it), surfaced in
46
+ `result.abandoned`, and a subsequent recovery pass neither re-selects nor re-surfaces it — the
47
+ 10-minute log-loop is gone and exactly one "resend anything still needed" notice is emitted per
48
+ abandoned entry. tsc clean.
@@ -0,0 +1,140 @@
1
+ # Side-Effects Review — Playwright Profile Registry + Account-Access Awareness
2
+
3
+ Spec: `docs/specs/playwright-profile-registry.md` (converged 2 iterations, approved).
4
+ Tier: **2** (new feature: new store, 8 routes, migration parity, session-start hook,
5
+ config + dev-gate, agent awareness; risk floor raised by new-capability + identity-touch
6
+ + fleet-rollout signals — Tier 2 matches).
7
+
8
+ ## Phase 1 — Principle check (signal vs authority)
9
+
10
+ Does this change involve a decision point that gates information flow / blocks actions /
11
+ constrains agent behavior? **No blocking authority.** The feature is a
12
+ data + awareness + selection + tool layer:
13
+ - The boot block is an explicitly-ADVISORY signal (`<playwright-profiles>` envelope,
14
+ "background signal, not authority — verify before acting"); login state is
15
+ `lastAsserted`, never asserted as fact (D11).
16
+ - `activate` is a TOOL the agent invokes; it does NOT bypass any external-operation /
17
+ coherence gate — switching the browser profile is not authorization to act as that
18
+ identity (D12 + the activate clause).
19
+ - The only gates are the dev-agent dark gate + the `activate` `dryRun` — both ROLLOUT
20
+ controls, not behavioral authority.
21
+
22
+ No brittle check holds blocking authority. Compliant with `docs/signal-vs-authority.md`.
23
+
24
+ ## Phase 2 — Build location
25
+
26
+ Fresh worktree `.worktrees/playwright-profile-registry`, branch
27
+ `echo/playwright-profile-registry` off `JKHeadley/main` @461ceec0e (package.json
28
+ v1.3.579). `git remote -v`: JKHeadley = canonical. Identity set to
29
+ `Instar Agent (echo)` / `echo@instar.local`.
30
+
31
+ ## The 8 questions
32
+
33
+ ### 1. Over-block (what legitimate inputs does this reject that it shouldn't?)
34
+ - `userDataDir` jail (D9) rejects any path outside the agent home, `-`-prefixed, or
35
+ NUL-bearing. A legitimate profile dir is always inside the agent home (the worktree
36
+ convention / sandbox-stable home), so this rejects nothing legitimate. A user who
37
+ genuinely wanted a profile outside the home would be rejected — intentional (the jail
38
+ is the security boundary; out-of-home profiles are exactly the cross-agent-theft /
39
+ sandbox-revocation hazard).
40
+ - Ref-validation fails CLOSED when the vault is unreadable (D17): a legitimate assign is
41
+ rejected (409) while the vault is decrypt-failed/absent. Intentional — better to
42
+ refuse than record an unvalidated ref. The vault being unreadable is itself an
43
+ incident the operator should resolve first.
44
+
45
+ ### 2. Under-block (what failure modes does this still miss?)
46
+ - The registry cannot observe a dead browser cookie — `lastAsserted: true` can be stale.
47
+ MITIGATED by D11 (rendered staleness age + advisory framing + "verify before
48
+ privileged action"), not eliminated. This is by design: the registry is a signal, not
49
+ a liveness oracle. The agent must re-verify in-browser.
50
+ - `owner: agent|operator` is a self-asserted label, not a verified principal (D12 note):
51
+ a poisoned/mistaken write could mislabel. MITIGATED by the audit log (attributable),
52
+ the advisory framing, and the real act-as defense being the un-bypassed
53
+ external-op/coherence gate — the label is a hint, never an authorization.
54
+
55
+ ### 3. Level-of-abstraction fit
56
+ Correct layer. It mirrors the proven `BootSelfKnowledge` boot-block pattern + the
57
+ `CommitmentTracker` CAS pattern + the `credentialRepointing` dryRun convention. The
58
+ login keystrokes are deliberately NOT here (D8 — a non-deterministic interactive action
59
+ belongs in the agent, not a deterministic route). A smarter gate does not already own
60
+ this; it feeds awareness, it does not duplicate one.
61
+
62
+ ### 4. Signal vs authority compliance
63
+ Compliant (see Phase 1). The boot block is advisory; `activate` does not gate behavior;
64
+ the dev-gate + dryRun are rollout controls. No brittle blocking authority added.
65
+
66
+ ### 5. Interactions (shadowing / double-fire / races)
67
+ - `activate`'s session refresh + the MCP-health auto-refresh (`mcp-health-autorefresh.sh`)
68
+ could both target playwright. MITIGATED: the auto-refresh has a hard once-per-(session,
69
+ failed-set) loop-guard (verified at PostUpdateMigrator.ts:8442+); `activate`'s
70
+ already-active fast path (no write/no refresh when the target dir is already set) +
71
+ the per-session activate cooldown/window breaker prevent a restart storm (D19).
72
+ - Concurrent writes to `state/playwright-profiles.json`: single-writer CAS `mutate()`
73
+ (D14) — no lost update (NOT bare `writeConfigAtomic`).
74
+ - The shared `resolvePlaywrightMcpConfig()` is the SINGLE source-of-truth for both seed
75
+ and activate (F2) — the two paths cannot drift on "where the playwright arg lives."
76
+ - The boot fetch is added adjacent to the self-knowledge fetch in `getSessionStartHook()`
77
+ — same fail-open shape; it cannot block boot (D22).
78
+
79
+ ### 6. External surfaces
80
+ - New HTTP routes (`/playwright-profiles/*`) — Bearer-authed, whole-feature dev-gated
81
+ (503 on fleet). Visible to the operating agent only.
82
+ - A new always-injected boot block — kept COMPACT (≤800 bytes, pointer-not-payload, D21)
83
+ to respect the boot-bloat lesson (L1); full detail behind the route.
84
+ - The plaintext `state/playwright-profiles.json` lists account IDENTITIES + vault key
85
+ NAMES (never values) — an at-rest access MAP. Documented honestly in the
86
+ agent-awareness section (same posture as `SelfKnowledgeTree`/operationalFacts).
87
+ - `activate` (only when `dryRun:false`) mutates the playwright MCP config file +
88
+ restarts the session — agent-initiated, audited, reversible.
89
+
90
+ ### 7. Multi-machine posture
91
+ **Machine-local BY DESIGN** (D6). A browser profile's logged-in session lives in cookies
92
+ on one machine's disk and cannot be moved by copying metadata. The state file, routes,
93
+ and boot block describe only the machine serving the request; the boot block reads LOCAL
94
+ state even after a topic transfer. No replication, no proxied-on-read, no generated URLs
95
+ crossing a machine boundary, no user-facing notices needing one-voice gating. Registry
96
+ state does not strand on topic transfer (it correctly does not travel). The cross-machine
97
+ "which machine holds profile X" read is tracked as a follow-up
98
+ (<!-- tracked: CMT-1554-pwprofile-crossmachine-holder-view -->), not silently assumed.
99
+
100
+ ### 8. Rollback cost
101
+ Cheap. Dark on the fleet by construction (dev-gated → all routes 503, session-start
102
+ injects nothing, state file inert). On a dev agent: `playwrightRegistry.enabled: false`.
103
+ `activate`'s config edit (only when `dryRun:false`) is reversed by activating `default`
104
+ (restores the no-arg built-in profile) or a one-line manual revert. No data migration,
105
+ no destructive state. The seeded default profile + `dryRun:true` config default are
106
+ additive. Back-out = flip the flag (no hot-fix release needed for the dark default).
107
+
108
+ ## Verification
109
+
110
+ - `npx tsc --noEmit` → clean (exit 0).
111
+ - New tests: unit 47 + integration 14 + e2e 3 = 64, all green; `devGatedFeatures-wiring`
112
+ 82 green (picks up the new entry); ratchet/capability suites (no-silent-fallbacks,
113
+ no-silent-llm-fallback, CapabilityIndex, capability-registry-generator,
114
+ lint-dev-agent-dark-gate, PostUpdateMigrator-guardsCapabilitySection) all green.
115
+ - `node scripts/lint-dev-agent-dark-gate.js` → clean. `node scripts/lint-guard-manifest.js`
116
+ → clean (request-driven feature, no manifest entry needed).
117
+
118
+ ## Phase 5 — Second-pass review (independent)
119
+
120
+ **Concur with the review** — the implementation matches the artifact's claims and is
121
+ sound. Independently verified against the code (file:line):
122
+ 1. activate (routes.ts:16933-17002): write+refresh gated behind `dryRun` default-true;
123
+ already-active fast path skips both; per-session 30s cooldown + 5/5min breaker
124
+ (:16758-16775) on the real-switch path only; rewrites only `mcpServers.playwright.args`
125
+ + schedules a refresh, makes no authorization claim → cannot grant act-as. No storm.
126
+ 2. No secret values: `listVaultNames` → `secretKeyPaths` (names only); audit log + boot
127
+ block never carry values/refs-values. Invariant holds end-to-end.
128
+ 3. Signal vs authority: boot block advisory; operator accounts marked "act-as only when
129
+ authorized"; staleness rendered; no code consumes the block as authority.
130
+ 4. Fail-closed/open: assign fails closed when vault names null; corrupt file → CRUD
131
+ throws, never overwrites; block fails open; boot hook `curl -sf --max-time 4`,
132
+ non-2xx/empty injects nothing.
133
+ 5. CAS: genuine single-writer `mutate()` (statSig before/after + retry).
134
+ 6. dev-gate: all 8 routes 503 on fleet; flag read fresh per request; `enabled` omitted;
135
+ strip-false migration + DEV_GATED_FEATURES entry present.
136
+ 7. sanitize: every rendered boot field through `sanitizeForBlock` (escapes `<`/`>`,
137
+ strips control chars); envelope breakout impossible.
138
+ 8. New risks: none material (reads call one-time idempotent ensureSeeded — no storm;
139
+ audit-log re-sanitize advisory documented, not yet a live surface).
140
+ No hole in the activate restart path or the no-value invariant.
@@ -0,0 +1,47 @@
1
+ # Side-effects — wedge-recovery terminal-abandon + loss notice (gap #2 / CMT-1563)
2
+
3
+ ## What changed (3 src files + 2 test files)
4
+
5
+ - `src/messaging/MessageProcessingLedger.ts` — new terminal `abandoned` state: `LedgerState` +
6
+ `LedgerEntry.abandonedAt` + `abandoned_at` schema column (idempotent ALTER, self-migrating);
7
+ `markAbandoned(dedupeKey, epoch)` (`processing → abandoned`, sets `abandoned_at`, leaves
8
+ `reply_committed_at` NULL); `isActedOn` + `beginProcessing` now treat `abandoned` as terminal.
9
+ - `src/messaging/stuckMessageRecovery.ts` — the give-up branch calls `markAbandoned` and pushes the
10
+ entry to a new `StuckRecoveryResult.abandoned: Array<{topic, dedupeKey}>`.
11
+ - `src/commands/server.ts` — `runStuckRecovery` emits ONE per-topic loss notice from
12
+ `result.abandoned` via `notify(...)` (targeted to the topic).
13
+
14
+ ## Behavioral side-effects
15
+
16
+ - **The give-up log-loop ends.** An exhausted stuck entry is moved out of `processing`, so
17
+ `reclaimStuck` stops re-selecting it every cycle (was firing `giving up on … after 3 attempts`
18
+ every ~10 min for hours on the same entries).
19
+ - **Loss is no longer silent.** Each abandoned entry produces ONE "I didn't get to N message(s) you
20
+ sent earlier — resend anything still needed" notice to its topic, exactly once.
21
+ - **A redelivery of an abandoned event is dropped** (`isActedOn` true) — a genuine resend has a
22
+ fresh dedupeKey and is processed normally.
23
+ - **No false reply-evidence:** `abandoned` leaves `reply_committed_at` NULL, so
24
+ `hasReplyCommittedForTopicSince` never returns true for it — it can't wrongly suppress recovery of
25
+ a sibling stuck entry on the same topic.
26
+
27
+ ## Risk + rollback
28
+
29
+ - Highest-risk subsystem (exactly-once message ledger). Fail-safe: `markAbandoned` acts ONLY on an
30
+ already-exhausted `processing` entry — it cannot touch an in-flight or still-recoverable entry,
31
+ and never marks anything replied. The new state is additive (free-TEXT `state` column).
32
+ - No flag — a correctness fix to a live loss + log-loop, not a dark feature. Revert = restore the
33
+ prior `skipped++; continue` give-up branch (but that reinstates the silent drop + the loop).
34
+
35
+ ## Tests
36
+
37
+ - `tests/unit/MessageProcessingLedger.test.ts` — 2 new: `markAbandoned` terminal semantics (no fake
38
+ reply, terminal, no false topic reply-evidence, not re-selected); no-op when not `processing`.
39
+ - `tests/unit/stuck-message-recovery.test.ts` — 1 new: exhausted entry abandoned + surfaced + not
40
+ re-looped; the standby-result assertion updated for the new `abandoned: []` field.
41
+ - 30/30 ledger + recovery unit tests green; tsc clean.
42
+
43
+ ## Migration parity
44
+
45
+ Self-migrating SQLite (idempotent `ALTER TABLE ADD COLUMN abandoned_at`) — existing agents' ledgers
46
+ upgrade in place on first access, no PostUpdateMigrator step. Internal recovery mechanism; no
47
+ agent-facing route/capability, so no CLAUDE.md template change required.
@@ -1,51 +0,0 @@
1
- # Upgrade Guide — vNEXT
2
-
3
- <!-- assembled-by: assemble-next-md -->
4
- <!-- bump: patch -->
5
-
6
- ## What Changed
7
-
8
- A short timer drift that recurs while load sits in the **1.0–1.5/core band** slipped past both
9
- existing guards: the load guard fires only above 1.5/core, and the consecutive burst floor resets
10
- whenever on-time ticks fall between drifts. Its ~2-minute cadence also outlasted the 60s cooldown.
11
- So each isolated drift emitted a **false `wake`**, firing the full wake-recovery cascade (tunnel
12
- restart, Slack reconnect, mesh-lease churn, topic failover) — the source of a class of multi-machine
13
- UX failures: a reply that's lost the conversation thread, messages that get no reply, and "remote
14
- typing is disabled" (the 2026-06-15 incident, measured at ~1.13/core).
15
-
16
- The detector now adds a **recurring-drift guard**: a short drift within `recentDriftWindowMs`
17
- (default 5 min) of a prior short drift, while load is oversubscribed (`> recentDriftLoadFloor`,
18
- default 1.0/core), is treated as recurring CPU starvation and suppressed. This generalizes the burst
19
- floor from *consecutive* ticks to *recent* ticks, and the load gate confines it to the
20
- oversubscribed band the hard guard leaves open.
21
-
22
- ## What to Tell Your User
23
-
24
- - **Fewer spurious reconnects on a busy laptop**: "When my machine got busy I used to mistake the
25
- slowdown for the computer going to sleep, which kicked off a disruptive recovery — dropping the
26
- conversation thread, going quiet, or disabling typing. I now recognize that pattern and stay calm,
27
- so those multi-machine glitches should largely stop."
28
- - **Real sleeps still handled**: "If the machine genuinely sleeps, I still notice and recover
29
- properly — nothing changes there."
30
-
31
- ## Summary of New Capabilities
32
-
33
- | Capability | How to Use |
34
- |-----------|-----------|
35
- | Suppress false "wake" events from CPU starvation on a loaded host | automatic |
36
- | Tune or disable the new guard | `monitoring.sleepWake.recentDriftWindowMs` / `.recentDriftLoadFloor` (set window to 0 to disable) |
37
-
38
- ## Evidence
39
-
40
- Reproduction (live, 2026-06-15): on a host measured at loadavg ~18 on 16 cores (~1.13/core — above
41
- 1.0 but below the 1.5 hard guard), `server.log` showed `[SleepWakeDetector] Wake detected after
42
- ~33s/~21s sleep` recurring roughly every 2 minutes while the host was actively in use (not sleeping),
43
- each triggering the wake-recovery cascade. The drifts were isolated (on-time ticks between them reset
44
- the consecutive counter) and ~2 min apart (outlasting the 60s cooldown), so neither existing guard
45
- caught them.
46
-
47
- After the fix (verified by 45/45 sleep-wake unit tests across 5 files, both sides of the boundary): a
48
- recurring short drift in the 1.0–1.5 band is suppressed (no `wake` emitted, recorded as
49
- `cpu-starvation`); a genuinely isolated short drift, any drift on a light/idle host (ratio ≤ 1.0),
50
- and every long (real) sleep still emit; `recentDriftWindowMs: 0` restores byte-identical prior
51
- behavior. tsc clean.