instar 1.3.577 → 1.3.579
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/commands/server.d.ts.map +1 -1
- package/dist/commands/server.js +13 -0
- package/dist/commands/server.js.map +1 -1
- package/dist/core/PostUpdateMigrator.d.ts +1 -0
- package/dist/core/PostUpdateMigrator.d.ts.map +1 -1
- package/dist/core/PostUpdateMigrator.js +108 -0
- package/dist/core/PostUpdateMigrator.js.map +1 -1
- package/dist/core/action-claim.d.ts +34 -0
- package/dist/core/action-claim.d.ts.map +1 -0
- package/dist/core/action-claim.js +118 -0
- package/dist/core/action-claim.js.map +1 -0
- package/dist/core/types.d.ts +9 -0
- package/dist/core/types.d.ts.map +1 -1
- package/dist/core/types.js.map +1 -1
- package/dist/monitoring/CommitmentTracker.d.ts.map +1 -1
- package/dist/monitoring/CommitmentTracker.js +12 -0
- package/dist/monitoring/CommitmentTracker.js.map +1 -1
- package/dist/monitoring/ResumeQueue.d.ts +50 -0
- package/dist/monitoring/ResumeQueue.d.ts.map +1 -1
- package/dist/monitoring/ResumeQueue.js +179 -3
- package/dist/monitoring/ResumeQueue.js.map +1 -1
- package/dist/monitoring/guardManifest.d.ts.map +1 -1
- package/dist/monitoring/guardManifest.js +17 -0
- package/dist/monitoring/guardManifest.js.map +1 -1
- package/dist/scaffold/templates.d.ts.map +1 -1
- package/dist/scaffold/templates.js +2 -0
- package/dist/scaffold/templates.js.map +1 -1
- package/dist/server/routes.d.ts.map +1 -1
- package/dist/server/routes.js +62 -0
- package/dist/server/routes.js.map +1 -1
- package/package.json +1 -1
- package/src/data/builtin-manifest.json +64 -64
- package/src/scaffold/templates.ts +2 -0
- package/src/templates/hooks/settings-template.json +10 -0
- package/upgrades/1.3.578.md +64 -0
- package/upgrades/1.3.579.md +63 -0
- package/upgrades/side-effects/action-claim-followthrough-sentinel.md +109 -0
- package/upgrades/side-effects/autonomous-run-outlives-session.md +114 -0
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
# Upgrade Guide — vNEXT
|
|
2
|
+
|
|
3
|
+
<!-- assembled-by: assemble-next-md -->
|
|
4
|
+
<!-- bump: patch -->
|
|
5
|
+
|
|
6
|
+
## What Changed
|
|
7
|
+
|
|
8
|
+
A new constitutional standard ("An Autonomous Run Must Outlive Its Session") plus the
|
|
9
|
+
fix behind it. The mid-work resume queue (the system that revives a reaped autonomous
|
|
10
|
+
run, #1157) takes a host-local lock so two machines can't corrupt its shared state.
|
|
11
|
+
A machine RENAME used to leave a stale lock the queue mistook for a shared-volume
|
|
12
|
+
conflict — so it silently disabled the entire run-revival guard and never said so
|
|
13
|
+
(the 2026-06-15 incident). Two changes:
|
|
14
|
+
|
|
15
|
+
- **Rename-aware lock (GAP-D).** When the lock shows a different host, the queue now
|
|
16
|
+
distinguishes a single-host rename (provably host-local disk + dead pid + ≥5min
|
|
17
|
+
stale heartbeat → auto-heal the lock via an O_EXCL first-writer-wins takeover) from
|
|
18
|
+
a genuine shared-volume conflict (stay disabled). FAIL-CLOSED on any uncertainty
|
|
19
|
+
(unknown filesystem, `df` failure, live pid, fresh heartbeat). The original HARD
|
|
20
|
+
INVARIANT — never pid-probe a foreign-host lock — is fully preserved when auto-heal
|
|
21
|
+
is off. Ships **fleet-default OFF** (`monitoring.resumeQueue.autoHealStaleHostLock`),
|
|
22
|
+
dev-agent dryRun-first (logs "would auto-heal" without rewriting) before going live.
|
|
23
|
+
- **A disabled revival queue is now LOUD (D2).** The queue self-reports to the
|
|
24
|
+
guard-posture inventory (`GUARD_MANIFEST` entry + `guardStatus()` + an unconditional
|
|
25
|
+
registration), so a disabled revival queue reads `off-runtime-divergent` on
|
|
26
|
+
`GET /guards` and raises one aggregated attention item — never silently inert.
|
|
27
|
+
|
|
28
|
+
No new route. New code-defaulted config key (kept out of ConfigDefaults to preserve
|
|
29
|
+
the fleet flip, consistent with #1157). Signal-only surfacing; the only authority is
|
|
30
|
+
the queue refusing to start itself (bounded self-recovery, fail-closed).
|
|
31
|
+
|
|
32
|
+
## What to Tell Your User
|
|
33
|
+
|
|
34
|
+
- **Your autonomous work survives a machine rename now** (dev agent): "If I rename or
|
|
35
|
+
restore this machine, the system that brings a reaped autonomous run back no longer
|
|
36
|
+
quietly switches itself off. On a provable same-machine rename it heals its own lock
|
|
37
|
+
(carefully — only on a local disk, with the old process gone); on anything uncertain
|
|
38
|
+
it stays cautious. And if that revival system is ever genuinely disabled, you'll see
|
|
39
|
+
it flagged on the guards view with one alert instead of silence." ⚗️ Experimental —
|
|
40
|
+
the self-heal ships dark on the fleet (dev-agent first) and rolls out more widely
|
|
41
|
+
only after it's proven safe.
|
|
42
|
+
|
|
43
|
+
## Summary of New Capabilities
|
|
44
|
+
|
|
45
|
+
| Capability | How to Use |
|
|
46
|
+
|-----------|-----------|
|
|
47
|
+
| A machine rename auto-heals the resume-queue lock instead of silently disabling revival | Automatic on the dev agent (`monitoring.resumeQueue.autoHealStaleHostLock`; fleet default off) |
|
|
48
|
+
| A disabled revival queue surfaces as `off-runtime-divergent` | `GET /guards` (automatic) |
|
|
49
|
+
| Diagnose "why didn't my autonomous run come back?" | `GET /guards` + `GET /sessions/resume-queue` (disabled reason) |
|
|
50
|
+
|
|
51
|
+
## Evidence
|
|
52
|
+
|
|
53
|
+
- Unit (`tests/unit/resume-queue-autoheal-lock.test.ts`, 11): the FD1 device-source
|
|
54
|
+
truth-table (local `/dev/*` → local; `//host`/`host:/path` → not; unknown/tmpfs/map →
|
|
55
|
+
fail-closed); auto-heal fires only on local-FS + dead-pid + stale-heartbeat; stays
|
|
56
|
+
disabled on a non-local FS or a live pid; dryRun logs-but-does-not-rewrite; auto-heal
|
|
57
|
+
off preserves today's behavior; `guardStatus()` reporting.
|
|
58
|
+
- Integration (`tests/integration/resume-queue-guard-posture.test.ts`, 3): a
|
|
59
|
+
runtime-disabled queue classifies `off-runtime-divergent` through the real
|
|
60
|
+
GUARD_MANIFEST entry + `deriveGuardRow`; a healthy queue does not.
|
|
61
|
+
- Regression: the full resume-queue unit + route suite (100 tests) stays green —
|
|
62
|
+
including the existing HARD-INVARIANT test that a foreign-host lock is never
|
|
63
|
+
pid-probed (which guards against re-introducing the cross-host probe). tsc clean;
|
|
64
|
+
`lint-guard-manifest` clean. Independent Phase-5 second-pass review concurred.
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
# Upgrade Guide — vNEXT
|
|
2
|
+
|
|
3
|
+
<!-- assembled-by: assemble-next-md -->
|
|
4
|
+
<!-- bump: patch -->
|
|
5
|
+
|
|
6
|
+
## What Changed
|
|
7
|
+
|
|
8
|
+
A signal-only backstop for the word≠action gap: when the agent says a CONCRETE
|
|
9
|
+
future action in a conversational turn ("I'll restart the server", "relaunching
|
|
10
|
+
now", "pushing the change"), a thin Stop hook posts the turn to a new server route
|
|
11
|
+
`POST /action-claim/observe`, which classifies the claim and opens an idempotent
|
|
12
|
+
follow-through **commitment** for the topic — so the existing PromiseBeacon + the
|
|
13
|
+
revival path make sure the action actually happens instead of being silently
|
|
14
|
+
dropped.
|
|
15
|
+
|
|
16
|
+
- `src/core/action-claim.ts` — deterministic, high-precision classifier (closed
|
|
17
|
+
concrete-verb set; first-person-scoped; sentence-initial-participle for the
|
|
18
|
+
"Relaunching now" idiom; rejects vague filler, past tense, quotes, and
|
|
19
|
+
imperatives/questions/third-person directed at others). Fails toward NOT
|
|
20
|
+
registering.
|
|
21
|
+
- `CommitmentTracker.record()` — idempotent `externalKey` create (the missing
|
|
22
|
+
dedupe primitive): a restated claim updates ONE commitment instead of spawning N.
|
|
23
|
+
- The route enforces a per-topic cap + a 6h auto-expiry. The hook ALWAYS exits 0 —
|
|
24
|
+
it never blocks a message. Off by default behind `messaging.actionClaim.enabled`.
|
|
25
|
+
|
|
26
|
+
Completed-action verification (checking "I already pushed it" against real evidence)
|
|
27
|
+
is a tracked follow-up <!-- tracked: CMT-1554-sibling action-claim-A2-evidence-primitive --> — it needs a per-turn tool-call-evidence primitive that
|
|
28
|
+
doesn't exist yet; the founding incident was a future-action claim, which this
|
|
29
|
+
covers.
|
|
30
|
+
|
|
31
|
+
## What to Tell Your User
|
|
32
|
+
|
|
33
|
+
- **A concrete thing I say I'll do becomes a tracked promise** (when enabled): "If I
|
|
34
|
+
say I'll restart/push/deploy/merge something, that now opens a tracked commitment
|
|
35
|
+
so I actually follow through — instead of it being a sentence that might quietly
|
|
36
|
+
never happen. It's careful (vague 'I'll take a look' doesn't trigger it), it never
|
|
37
|
+
blocks my messages, and it de-duplicates so restating the same thing doesn't pile
|
|
38
|
+
up reminders." ⚗️ Experimental — ships off by default; enabled on the dev agent
|
|
39
|
+
first to prove it out before any fleet rollout.
|
|
40
|
+
|
|
41
|
+
## Summary of New Capabilities
|
|
42
|
+
|
|
43
|
+
| Capability | How to Use |
|
|
44
|
+
|-----------|-----------|
|
|
45
|
+
| A concrete future-action claim opens a tracked follow-through commitment | Automatic when `messaging.actionClaim.enabled` (off by default) |
|
|
46
|
+
| Idempotent — a restated claim updates one commitment, not many | Automatic (`externalKey` dedupe) |
|
|
47
|
+
| Diagnose "why did a commitment appear when I said I'd restart X?" | `GET /commitments` — that's this sentinel tracking the stated action |
|
|
48
|
+
|
|
49
|
+
## Evidence
|
|
50
|
+
|
|
51
|
+
- Unit (`tests/unit/action-claim.test.ts`, 9): classifier both sides — catches the
|
|
52
|
+
founding "Relaunching now" + first-person near-future forms + participle
|
|
53
|
+
normalization; rejects vague filler, past tense, quotes, AND
|
|
54
|
+
imperatives/questions/third-person directed at others (the Phase-5 precision fix).
|
|
55
|
+
- Unit (`tests/unit/CommitmentTracker-externalKey-dedupe.test.ts`, 3): `record()`
|
|
56
|
+
returns the existing open commitment on a repeated `externalKey`; distinct keys →
|
|
57
|
+
distinct; no-key → unchanged behavior.
|
|
58
|
+
- Integration (`tests/integration/action-claim-route.test.ts`, 6): `POST
|
|
59
|
+
/action-claim/observe` over the real HTTP pipeline — flag-off no-op, register,
|
|
60
|
+
dedupe, non-claim no-op, per-topic cap, 400.
|
|
61
|
+
- Regression: 87 existing CommitmentTracker tests green; tsc clean; settings-template
|
|
62
|
+
valid JSON. Phase-5 second-pass review raised a precision concern (third-person
|
|
63
|
+
false positives) which was fixed and re-verified.
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
# Side-Effects Review — Action-Claim Follow-Through Sentinel (P2)
|
|
2
|
+
|
|
3
|
+
Spec: `docs/specs/action-claim-followthrough-sentinel.md` (converged + approved).
|
|
4
|
+
Change: a thin Stop hook posts each finished conversational turn to a new server
|
|
5
|
+
route `POST /action-claim/observe`, which classifies a CONCRETE future-action claim
|
|
6
|
+
("I'll restart it", "relaunching now") and opens an idempotent follow-through
|
|
7
|
+
commitment. Signal-only, dark by default. (A2 — completed-action verification —
|
|
8
|
+
is DESCOPED, tracked: no per-turn evidence primitive exists.)
|
|
9
|
+
|
|
10
|
+
Files:
|
|
11
|
+
- `src/core/action-claim.ts` — `classifyActionClaim` + `classifyDfSourceLocal`-style deterministic classifier (FD2/FD4).
|
|
12
|
+
- `src/monitoring/CommitmentTracker.ts` — `record()` idempotent `externalKey` create (FD3, the missing dedupe primitive).
|
|
13
|
+
- `src/server/routes.ts` — `POST /action-claim/observe` (flag-gated, server-side classify + idempotent create + per-topic cap + expiry; signal-only).
|
|
14
|
+
- `src/core/PostUpdateMigrator.ts` — `getActionClaimFollowthroughHook()` + migrateHooks deploy + migrateSettings Stop-register + migrateClaudeMd awareness.
|
|
15
|
+
- `src/templates/hooks/settings-template.json` — Stop entry (new agents).
|
|
16
|
+
- `src/scaffold/templates.ts` — generateClaudeMd awareness line.
|
|
17
|
+
- Tests: `tests/unit/action-claim.test.ts` (7), `tests/unit/CommitmentTracker-externalKey-dedupe.test.ts` (3), `tests/integration/action-claim-route.test.ts` (6).
|
|
18
|
+
|
|
19
|
+
## 1. Over-block
|
|
20
|
+
Nothing is blocked — the hook ALWAYS `exit(0)`; the route never blocks a send. The
|
|
21
|
+
only "over-fire" risk is registering a spurious commitment. Mitigated by FD2 (closed
|
|
22
|
+
concrete-verb set; vague filler like "I'll take a look" does NOT trigger; fail toward
|
|
23
|
+
NOT-registering on ambiguity) + FD3 (dedupe + auto-expiry + per-topic cap). Verified
|
|
24
|
+
by the unit truth-table (both sides) + the integration no-op/cap tests.
|
|
25
|
+
|
|
26
|
+
## 2. Under-block
|
|
27
|
+
Misses: (a) completed-action claims ("I already pushed it") — A2, deliberately
|
|
28
|
+
descoped (no per-turn evidence channel; tracked); (b) creatively-worded future
|
|
29
|
+
claims outside the closed verb set — accepted under-coverage (precision over recall,
|
|
30
|
+
since a false commitment nags). Both are the safe direction.
|
|
31
|
+
|
|
32
|
+
## 3. Level-of-abstraction fit
|
|
33
|
+
Correct. The thin hook mirrors the proven `response-review.js` Stop-hook siting; the
|
|
34
|
+
classifier + dedupe run SERVER-SIDE (a plain-JS hook can't import the TS classifier);
|
|
35
|
+
the dedupe lives IN `CommitmentTracker.record()` (the one writer of that store); the
|
|
36
|
+
follow-through rides the EXISTING PromiseBeacon + revival path rather than a new
|
|
37
|
+
mechanism. No new notification surface.
|
|
38
|
+
|
|
39
|
+
## 4. Signal vs authority compliance (docs/signal-vs-authority.md)
|
|
40
|
+
COMPLIANT. The hook is a pure side-effect POST that never emits `decision:block`
|
|
41
|
+
(always exit 0). The route opens a commitment (a signal/record) and never gates,
|
|
42
|
+
delays, or rewrites a message. The classifier is a brittle deterministic matcher used
|
|
43
|
+
ONLY as a signal — never as blocking authority. The whole feature is off by default.
|
|
44
|
+
|
|
45
|
+
## 5. Interactions
|
|
46
|
+
- Builds ON the existing `detectTimePromise`/PromiseBeacon path rather than a second
|
|
47
|
+
classifier; the dedupe `externalKey` is tagged `actionclaim:` so the per-topic cap
|
|
48
|
+
can count its own commitments without colliding with other externalKey users.
|
|
49
|
+
- `record()` idempotency is additive: absent `externalKey` → unchanged behavior
|
|
50
|
+
(verified — 87 existing CommitmentTracker tests still pass + a no-key test).
|
|
51
|
+
- The Stop hook is registered AFTER the existing Stop hooks (stop-gate-router stays
|
|
52
|
+
first); it can't shadow them (it never blocks).
|
|
53
|
+
|
|
54
|
+
## 6. External surfaces
|
|
55
|
+
- New route `POST /action-claim/observe` (auth'd like all routes). New Stop hook
|
|
56
|
+
registered in `.claude/settings.json` (new + existing agents via migrateSettings).
|
|
57
|
+
New config keys under `messaging.actionClaim.*` (read with safe defaults).
|
|
58
|
+
- No change visible to other agents/users when the flag is off (the fleet default).
|
|
59
|
+
|
|
60
|
+
## 7. Multi-machine posture (Cross-Machine Coherence)
|
|
61
|
+
MACHINE-LOCAL BY DESIGN. The hook fires on the machine running the conversational
|
|
62
|
+
turn and registers a commitment in THAT machine's CommitmentTracker — exactly where
|
|
63
|
+
the turn happened and where the PromiseBeacon that follows it through runs. Commitment
|
|
64
|
+
cross-machine replication (if ever wanted) rides the existing `stateSync` family,
|
|
65
|
+
out of scope here. No URLs/notices that must survive a machine boundary.
|
|
66
|
+
|
|
67
|
+
## 8. Rollback cost
|
|
68
|
+
Trivial. The feature is off unless `messaging.actionClaim.enabled` is set; setting it
|
|
69
|
+
back to false (or absent) fully disables it — the hook no-ops at its first config
|
|
70
|
+
read, the route returns `feature-disabled`. The `record()` dedupe is inert without an
|
|
71
|
+
`externalKey`. No migration, no data repair.
|
|
72
|
+
|
|
73
|
+
## Decisions (mine, per the run's full preapproval)
|
|
74
|
+
- **No `migrateConfig` entry for the flag.** Absent = off is the correct dark default,
|
|
75
|
+
consistent with how the resume-queue keys are deliberately kept out of ConfigDefaults
|
|
76
|
+
to preserve the fleet flip. The dev agent enables `messaging.actionClaim.enabled`
|
|
77
|
+
explicitly to soak before any fleet default flip (a separate reviewed decision).
|
|
78
|
+
- **A2 descoped + tracked** — building it would lean on a per-turn evidence primitive
|
|
79
|
+
that doesn't exist (the P1 lesson); the founding incident was a future-action claim,
|
|
80
|
+
which the v1 feature covers.
|
|
81
|
+
|
|
82
|
+
## Test coverage (Testing Integrity)
|
|
83
|
+
- Unit: classifier truth-table both sides (FD2/FD4) + dedupe idempotency (FD3).
|
|
84
|
+
- Integration: `POST /action-claim/observe` over the real HTTP pipeline — flag-off
|
|
85
|
+
no-op, register, dedupe (restated claim → same commitment), non-claim no-op,
|
|
86
|
+
per-topic cap, 400 on bad input.
|
|
87
|
+
- E2E (`tests/e2e/action-claim-lifecycle.test.ts`, 2): boots a REAL Express server on
|
|
88
|
+
a real port and hits `POST /action-claim/observe` — the "feature is alive"
|
|
89
|
+
assertion (200, not 404/503) + a concrete claim opens a real commitment
|
|
90
|
+
(`getActive()` for the topic) + a benign message registers nothing.
|
|
91
|
+
- Regression: 87 existing CommitmentTracker tests green; tsc clean; settings-template valid JSON.
|
|
92
|
+
|
|
93
|
+
## Second-pass review
|
|
94
|
+
**Concern raised → FIXED → re-verified.** The independent Phase-5 reviewer confirmed
|
|
95
|
+
signal-only (hook always exit 0, route never blocks), correct `record()` idempotency
|
|
96
|
+
(early-return before id/emit; `getActive()` excludes terminal so a terminal same-key
|
|
97
|
+
mints fresh; CAS untouched; 87 existing tests green), per-topic cap counts only
|
|
98
|
+
`actionclaim:`-tagged commitments, and full Migration Parity wiring — BUT found a real
|
|
99
|
+
FD2 precision bug: the third classifier regex (bare present-participle + trailer) was
|
|
100
|
+
NOT first-person scoped, so it false-positived on imperatives/questions/third-person
|
|
101
|
+
("Did you restart it?", "Please merge the PR", "He is deploying it", "The script
|
|
102
|
+
reverts …"). Left unfixed, enabling the flag would mint spurious follow-through
|
|
103
|
+
commitments — the exact false-commitment-nag FD2 exists to prevent.
|
|
104
|
+
|
|
105
|
+
FIX: the third regex now requires a SENTENCE-INITIAL PARTICIPLE (`(?:^|[.!?]\s+)` +
|
|
106
|
+
the `-ing` form only) — keeps the founding "Relaunching now" / "Done. Pushing it now."
|
|
107
|
+
and rejects all eight flagged false positives. Re-verified: classifier unit tests
|
|
108
|
+
9/9 (added the third-person/imperative/interrogative rows + a sentence-initial-after-
|
|
109
|
+
boundary row), route integration 6/6 — 15 green. Verdict after fix: concur.
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
# Side-Effects Review — autonomous-run-outlives-session
|
|
2
|
+
|
|
3
|
+
Spec: `docs/specs/autonomous-run-outlives-session.md` (converged + approved).
|
|
4
|
+
Change: GAP-D — the resume-queue host-lock distinguishes a single-host RENAME
|
|
5
|
+
(auto-heal) from a genuine shared-volume conflict (stay disabled), fail-closed;
|
|
6
|
+
a disabled revival queue self-reports to the guard-posture inventory; + the
|
|
7
|
+
constitutional standard "An Autonomous Run Must Outlive Its Session".
|
|
8
|
+
|
|
9
|
+
Files:
|
|
10
|
+
- `src/monitoring/ResumeQueue.ts` — `classifyDfSourceLocal` + `isStateDirHostLocalDefault` (FD1), foreign-host rename-vs-conflict classifier (FD2), `takeOverLockAtomic` (FD4), `guardStatus()` (D2), `autoHealStaleHostLock` config field (FD5).
|
|
11
|
+
- `src/monitoring/guardManifest.ts` — `GUARD_MANIFEST` entry `monitoring.resumeQueue.enabled` (component `ResumeQueue`).
|
|
12
|
+
- `src/commands/server.ts` — dev-gate resolves `autoHealStaleHostLock`; UNCONDITIONAL `guardRegistry.register` for the queue.
|
|
13
|
+
- `src/core/types.ts` — `autoHealStaleHostLock?` config field.
|
|
14
|
+
- `docs/STANDARDS-REGISTRY.md` — the new standard.
|
|
15
|
+
- `src/scaffold/templates.ts` + `src/core/PostUpdateMigrator.ts` — Agent Awareness line (new + deployed agents).
|
|
16
|
+
- Tests: `tests/unit/resume-queue-autoheal-lock.test.ts` (11), `tests/integration/resume-queue-guard-posture.test.ts` (3).
|
|
17
|
+
|
|
18
|
+
## 1. Over-block (what legitimate inputs does this reject that it shouldn't?)
|
|
19
|
+
The auto-heal is STRICTLY ADDITIVE and gated: it can only turn a currently-DISABLED
|
|
20
|
+
foreign-host case into an enabled one. It never disables a case that previously
|
|
21
|
+
worked. The risk direction is "fails to heal a legitimate rename" → the queue
|
|
22
|
+
stays disabled exactly as today (no regression), just with a louder surface.
|
|
23
|
+
Fail-closed on any uncertainty (unknown FS, df failure, live pid, fresh heartbeat)
|
|
24
|
+
means some genuine renames won't auto-heal — acceptable: the operator clears the
|
|
25
|
+
lock manually as before, and the guard-posture alert now tells them to.
|
|
26
|
+
|
|
27
|
+
## 2. Under-block (what failure modes does this still miss?)
|
|
28
|
+
- pid recycling (FD3, accepted): a recycled dead pid that maps to a live unrelated
|
|
29
|
+
process reads as a live conflict → stays disabled + LOUD (safe direction; worst
|
|
30
|
+
case a false escalation, never corruption).
|
|
31
|
+
- The narrow double-boot unlink race in `takeOverLockAtomic` (two server boots on
|
|
32
|
+
one machine within ms of each other post-rename): O_EXCL gives EEXIST to the
|
|
33
|
+
loser in the common case; the residual double-unlink window is backstopped by
|
|
34
|
+
the next-acquire live-pid + heartbeat check. Not corruption — at worst a
|
|
35
|
+
transient re-evaluation.
|
|
36
|
+
- Genuine shared-volume setups where `df -P` reports a device string we don't
|
|
37
|
+
recognize as network: classified unknown → NOT local → stays disabled (correct).
|
|
38
|
+
|
|
39
|
+
## 3. Level-of-abstraction fit
|
|
40
|
+
Correct layer. The lock classifier lives IN `ResumeQueue.acquireLock` (the only
|
|
41
|
+
place that owns the lock), and the surfacing rides the EXISTING guard-posture
|
|
42
|
+
inventory (GUARD_MANIFEST + GuardRegistry + GuardPostureProbe) rather than a new
|
|
43
|
+
parallel alert path. No new notification surface invented — it feeds the one that
|
|
44
|
+
already aggregates and dedups (Bounded Notification Surface).
|
|
45
|
+
|
|
46
|
+
## 4. Signal vs authority compliance (docs/signal-vs-authority.md)
|
|
47
|
+
COMPLIANT. The auto-heal is bounded SELF-RECOVERY of the queue's own lock with a
|
|
48
|
+
fail-closed default — not a brittle gate holding blocking authority over agent
|
|
49
|
+
behavior or message flow. The guard-posture surfacing is a pure SIGNAL-producer
|
|
50
|
+
(it reports a disabled state; it never blocks, delays, or rewrites anything). The
|
|
51
|
+
default `autoHealStaleHostLock:false` keeps the behavior change off the fleet until
|
|
52
|
+
proven; the dev-agent runs it dryRun-first (logs intent without rewriting).
|
|
53
|
+
|
|
54
|
+
## 5. Interactions
|
|
55
|
+
- Preserves the original HARD INVARIANT (never pid-probe a foreign lock) when
|
|
56
|
+
auto-heal is OFF — verified by the existing `resume-queue.test.ts:417` invariant
|
|
57
|
+
test (which initially regressed and was fixed by gating all probing behind
|
|
58
|
+
`autoHealStaleHostLock`).
|
|
59
|
+
- The new GUARD_MANIFEST entry passes `lint-guard-manifest` (the drainer is not
|
|
60
|
+
auto-flagged, so no orphan NOT_A_GUARD entry — which would itself fail the lint).
|
|
61
|
+
- `guardRegistry.register` is UNCONDITIONAL (even when start() returns false) so a
|
|
62
|
+
lock-disabled queue reads `off-runtime-divergent`, not `missing`.
|
|
63
|
+
- Does NOT touch `evidenceEligible` / the #1157 revival path — strictly the lock
|
|
64
|
+
gate. No double-fire with the existing same-host stale reclaim (that path is
|
|
65
|
+
unchanged; this is the foreign-host branch only).
|
|
66
|
+
|
|
67
|
+
## 6. External surfaces
|
|
68
|
+
- New config key `monitoring.resumeQueue.autoHealStaleHostLock` (fleet default
|
|
69
|
+
false). No new route. `GET /guards` and `GET /sessions/resume-queue` gain a
|
|
70
|
+
truthful disabled-state read; no schema break (additive).
|
|
71
|
+
- CLAUDE.md template + migrator add one awareness bullet (new + deployed agents).
|
|
72
|
+
- No external network/timing dependence beyond a single bounded (3000ms) `df -P`
|
|
73
|
+
at lock-acquisition.
|
|
74
|
+
|
|
75
|
+
## 7. Multi-machine posture (Cross-Machine Coherence)
|
|
76
|
+
MACHINE-LOCAL BY DESIGN, and that is the whole point: the resume-queue lock + its
|
|
77
|
+
state dir are deliberately host-local (a shared volume across two hosts is
|
|
78
|
+
unsupported — the invariant this change PROTECTS). The fix makes the host-local
|
|
79
|
+
assumption ROBUST to a rename of the SAME machine without ever weakening the
|
|
80
|
+
cross-host protection (a genuine foreign live host still disables). guardStatus is
|
|
81
|
+
read per-machine; each machine's `/guards` reports its own queue. No replication
|
|
82
|
+
needed or wanted (a lock is intrinsically local).
|
|
83
|
+
|
|
84
|
+
## 8. Rollback cost
|
|
85
|
+
Cheap and immediate. `monitoring.resumeQueue.autoHealStaleHostLock:false` (the
|
|
86
|
+
fleet default) fully disables the new auto-heal — reverting to today's
|
|
87
|
+
disable-on-mismatch behavior — with no restart-data implications (config read at
|
|
88
|
+
queue construction; next server start picks it up). The guard-posture surfacing is
|
|
89
|
+
inert when the queue is healthy and harmless when disabled (it only reads state).
|
|
90
|
+
No migration, no data repair. The constitutional-standard doc + CLAUDE.md lines are
|
|
91
|
+
documentation (no runtime surface).
|
|
92
|
+
|
|
93
|
+
## Test coverage (Testing Integrity)
|
|
94
|
+
- Unit: `resume-queue-autoheal-lock.test.ts` — FD1 truth-table; auto-heal on
|
|
95
|
+
provable rename; stays-disabled on non-local FS / live pid / fresh heartbeat;
|
|
96
|
+
dryRun no-rewrite; auto-heal-off preserves original behavior; guardStatus.
|
|
97
|
+
- Integration: `resume-queue-guard-posture.test.ts` — a runtime-disabled queue
|
|
98
|
+
classifies `off-runtime-divergent` through the real GUARD_MANIFEST entry +
|
|
99
|
+
`deriveGuardRow` (the route's path); a healthy queue does not.
|
|
100
|
+
- E2E: the existing `tests/e2e/resume-idle-autonomous-lifecycle.test.ts` exercises
|
|
101
|
+
the queue alive end-to-end; this change is additive and those pass. (A dedicated
|
|
102
|
+
boot-with-stale-lock E2E is a candidate enhancement; the unit+integration tiers
|
|
103
|
+
cover the new logic and its wiring.)
|
|
104
|
+
- Regression: full resume-queue unit + route suite (100 tests) green; tsc clean;
|
|
105
|
+
lint-guard-manifest clean.
|
|
106
|
+
|
|
107
|
+
## Second-pass review
|
|
108
|
+
**Concur with the review.** Independent Phase-5 audit (guard/recovery path) verified, citing code:
|
|
109
|
+
1. Auto-heal can NEVER fire on a genuine shared volume with a live remote holder — `fsLocal` is dispositive and `&&`-short-circuits before any pid probe; `df -P` on a network mount never reports `/dev/*`.
|
|
110
|
+
2. The HARD INVARIANT (never pid-probe a foreign lock) is preserved when auto-heal is OFF (the fleet default) — all probing is gated behind `if (this.cfg.autoHealStaleHostLock)`; the existing invariant test (default config) still asserts `probed===false`.
|
|
111
|
+
3. `takeOverLockAtomic` O_EXCL first-writer-wins is correct; the only residual is the documented narrow double-boot unlink window, backstopped by the next-acquire live-pid+heartbeat check — transient/self-correcting, never durable corruption.
|
|
112
|
+
4. Signal-vs-Authority compliant — the only authority is the queue refusing to start itself (bounded self-recovery, fail-closed); `guardStatus()` is a pure signal producer.
|
|
113
|
+
5. No common-path regression — a healthy boot never enters the foreign-host branch and never calls `df`.
|
|
114
|
+
Verdict: sound, fail-closed in the right direction, well-tested on both sides of every decision boundary, safely gated.
|