instar 0.28.65 → 0.28.67
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli.js +53 -0
- package/dist/cli.js.map +1 -1
- package/dist/core/AutoDispatcher.d.ts +4 -1
- package/dist/core/AutoDispatcher.d.ts.map +1 -1
- package/dist/core/AutoDispatcher.js +5 -4
- package/dist/core/AutoDispatcher.js.map +1 -1
- package/dist/core/AutoUpdater.d.ts +6 -1
- package/dist/core/AutoUpdater.d.ts.map +1 -1
- package/dist/core/AutoUpdater.js +7 -4
- package/dist/core/AutoUpdater.js.map +1 -1
- package/dist/lifeline/LifelineHealthWatchdog.d.ts +81 -0
- package/dist/lifeline/LifelineHealthWatchdog.d.ts.map +1 -0
- package/dist/lifeline/LifelineHealthWatchdog.js +122 -0
- package/dist/lifeline/LifelineHealthWatchdog.js.map +1 -0
- package/dist/lifeline/RestartOrchestrator.d.ts +73 -0
- package/dist/lifeline/RestartOrchestrator.d.ts.map +1 -0
- package/dist/lifeline/RestartOrchestrator.js +124 -0
- package/dist/lifeline/RestartOrchestrator.js.map +1 -0
- package/dist/lifeline/TelegramLifeline.d.ts +55 -1
- package/dist/lifeline/TelegramLifeline.d.ts.map +1 -1
- package/dist/lifeline/TelegramLifeline.js +364 -41
- package/dist/lifeline/TelegramLifeline.js.map +1 -1
- package/dist/lifeline/droppedMessages.d.ts +67 -0
- package/dist/lifeline/droppedMessages.d.ts.map +1 -0
- package/dist/lifeline/droppedMessages.js +179 -0
- package/dist/lifeline/droppedMessages.js.map +1 -0
- package/dist/lifeline/forwardErrors.d.ts +38 -0
- package/dist/lifeline/forwardErrors.d.ts.map +1 -0
- package/dist/lifeline/forwardErrors.js +53 -0
- package/dist/lifeline/forwardErrors.js.map +1 -0
- package/dist/lifeline/rateLimitState.d.ts +63 -0
- package/dist/lifeline/rateLimitState.d.ts.map +1 -0
- package/dist/lifeline/rateLimitState.js +110 -0
- package/dist/lifeline/rateLimitState.js.map +1 -0
- package/dist/lifeline/retryWithBackoff.d.ts +28 -0
- package/dist/lifeline/retryWithBackoff.d.ts.map +1 -0
- package/dist/lifeline/retryWithBackoff.js +34 -0
- package/dist/lifeline/retryWithBackoff.js.map +1 -0
- package/dist/lifeline/startupMarker.d.ts +20 -0
- package/dist/lifeline/startupMarker.d.ts.map +1 -0
- package/dist/lifeline/startupMarker.js +52 -0
- package/dist/lifeline/startupMarker.js.map +1 -0
- package/dist/lifeline/versionHandshake.d.ts +40 -0
- package/dist/lifeline/versionHandshake.d.ts.map +1 -0
- package/dist/lifeline/versionHandshake.js +45 -0
- package/dist/lifeline/versionHandshake.js.map +1 -0
- package/dist/messaging/shared/compactionResumePayload.d.ts +1 -1
- package/dist/messaging/shared/compactionResumePayload.d.ts.map +1 -1
- package/dist/messaging/shared/compactionResumePayload.js +14 -5
- package/dist/messaging/shared/compactionResumePayload.js.map +1 -1
- package/dist/server/routes.d.ts.map +1 -1
- package/dist/server/routes.js +58 -1
- package/dist/server/routes.js.map +1 -1
- package/package.json +1 -1
- package/src/data/builtin-manifest.json +82 -82
- package/upgrades/0.28.66.md +44 -0
- package/upgrades/0.28.67.md +58 -0
- package/upgrades/side-effects/0.28.65.md +59 -0
- package/upgrades/side-effects/0.28.66.md +130 -0
- package/upgrades/side-effects/lifeline-message-drop-stage-a.md +155 -0
- package/upgrades/side-effects/lifeline-self-restart-stage-b.md +129 -0
- package/upgrades/NEXT.md +0 -53
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
# Upgrade Guide — vNEXT
|
|
2
|
+
|
|
3
|
+
<!-- bump: patch -->
|
|
4
|
+
|
|
5
|
+
## What Changed
|
|
6
|
+
|
|
7
|
+
Tightens the instruction prompt injected into a session after context compaction. Two empirical failure modes on active sessions (topic 6795, 2026-04-20) drove this change:
|
|
8
|
+
|
|
9
|
+
1. **Tone failure**: recovered agents self-narrated compaction as "I lost track of what we were working on" / "I got lost for a second" — alarming phrasing for what was a routine pause-and-resume. The agent had full context; the phrasing was the issue.
|
|
10
|
+
2. **Intent failure**: when the user's last message was a delegated decision ("Your call"), recovered agents regenerated a status summary and re-offered the same options — handing "Your call" back to the user for an infinite ping-pong.
|
|
11
|
+
|
|
12
|
+
Both traced to a single loose line in `COMPACTION_RESUME_PREAMBLE`: "Briefly let the user know compaction occurred, then continue the conversation naturally." Open-ended "let the user know" produced free-form alarming narration; "continue naturally" triggered the status-summary reflex.
|
|
13
|
+
|
|
14
|
+
Fix: `COMPACTION_RESUME_PREAMBLE` in `src/messaging/shared/compactionResumePayload.ts` now:
|
|
15
|
+
|
|
16
|
+
- Prescribes the acknowledgment phrasing: "your session paused for context compaction and has now resumed." Explicitly forbids "lost track," "got lost," "got confused," "lost your place."
|
|
17
|
+
- Directs the agent to respond to the user's **most recent message** — answer questions, make delegated decisions, do NOT reconstruct a generic status summary or re-offer options already delegated back.
|
|
18
|
+
- Instructs the agent to assume continuity with any in-progress work in the context block.
|
|
19
|
+
|
|
20
|
+
The over-threshold (file-reference) branch in `prepareInjectionText` carries the same guardrails, so long-context recoveries get the same instruction as short ones.
|
|
21
|
+
|
|
22
|
+
No change to `findLastRealMessage` (0.28.51), `isSystemOrProxyMessage` (0.28.51), `formatContextForSession` (0.28.52), or any other compaction-recovery plumbing. Pure preamble text change + 7 new regression tests (18 total on the file).
|
|
23
|
+
|
|
24
|
+
## What to Tell Your User
|
|
25
|
+
|
|
26
|
+
- **Calmer compaction recoveries**: "When I pause to compress older parts of our conversation and come back, I won't say alarming things like 'I lost track' or 'I got confused' — because I didn't. It's a routine pause, and I'll just tell you that."
|
|
27
|
+
- **No more ping-pong on delegated decisions**: "If you hand a decision back to me and I happen to pause for compaction right after, when I come back I'll actually make the call instead of re-offering you the same choices."
|
|
28
|
+
|
|
29
|
+
## Summary of New Capabilities
|
|
30
|
+
|
|
31
|
+
| Capability | How to Use |
|
|
32
|
+
|-----------|-----------|
|
|
33
|
+
| Tighter compaction-recovery tone | Automatic — applies to every session recovery on 0.28.66+ |
|
|
34
|
+
|
|
35
|
+
## Evidence
|
|
36
|
+
|
|
37
|
+
Reproduction: topic 6795 on 2026-04-20, two screenshots at ~12:04 PM and ~12:12 PM showing both failure modes on active sessions running 0.28.52.
|
|
38
|
+
|
|
39
|
+
- Mew topic (session-robustness): recovered agent opened with "Quick heads-up: I lost track of what we were working on for a second, but I found my notes and caught back up" — user flagged as alarming.
|
|
40
|
+
- Bob topic (instar-agent-robustness): user had said "Your call." Recovered agent's response ended with "Your call." on the same two options — infinite delegation ping-pong.
|
|
41
|
+
|
|
42
|
+
Root cause traced to the open-ended phrasing of `COMPACTION_RESUME_PREAMBLE` in 0.28.52. After the fix in 0.28.66, the preamble explicitly prohibits the alarming self-narration phrases and explicitly requires responding to the user's most recent message (including making delegated decisions). 18 unit tests pin the invariants (presence, order, prohibition language, file-reference branch parity).
|
|
43
|
+
|
|
44
|
+
Cannot reproduce the end-to-end tone failure in an automated dev test without spinning up a real compaction event, which requires a long-running session — the unit tests are the strongest evidence achievable without waiting for a natural compaction in an active session. Live verification will come from the next compaction on the affected topics once the hot-patch lands.
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# Upgrade Guide — vNEXT
|
|
2
|
+
|
|
3
|
+
<!-- bump: minor -->
|
|
4
|
+
|
|
5
|
+
## What Changed
|
|
6
|
+
|
|
7
|
+
Stage B of the lifeline robustness work. Two new self-healing mechanisms, both addressing the same pattern observed in the Bob (2026-04-19) and Dawn (2026-04-20) incidents: a long-running lifeline drifting into a state where it receives Telegram updates but cannot forward them, rescued only by a human operator.
|
|
8
|
+
|
|
9
|
+
**1. Version handshake.** The Telegram lifeline includes its semver in every `/internal/telegram-forward` request as the JSON field `lifelineVersion`. The server validates structurally (regex + 64-char cap) and compares MAJOR/MINOR with its own cached version. On mismatch, server returns 426 Upgrade Required with a reconstructed canonical `serverVersion`. The lifeline's forward path uses typed errors (`ForwardVersionSkewError`, `ForwardBadRequestError`, `ForwardServerBootError`, `ForwardTransientError`) so 426 short-circuits Stage A's retry (`isTerminal` predicate) and triggers a restart request through the new `RestartOrchestrator`. PATCH drift > 10 emits a pure-observability `TelegramLifeline.versionSkewInfo` signal with no blocking effect. Missing `lifelineVersion` is accepted (backward compat with pre-Stage-B lifelines) and emits a `TelegramLifeline.versionMissing` observability signal. Dev mode (empty authToken) skips the handshake to avoid an unauth'd fingerprinting channel.
|
|
10
|
+
|
|
11
|
+
**2. Stuck-loop watchdog + RestartOrchestrator.** New `LifelineHealthWatchdog` runs every 30 s and tracks three deterministic signals:
|
|
12
|
+
- `noForwardStuck` — oldest queued message older than 10 min (anchored on `QueuedMessage.timestamp`, NOT "time since last success" — the latter would crash-loop low-traffic agents)
|
|
13
|
+
- `consecutiveFailures` — >20 consecutive non-2xx responses from `forwardToServer`
|
|
14
|
+
- `conflict409Stuck` — consecutive409s pinned >5 min (0→>0 edge timestamp in `TelegramLifeline.poll`)
|
|
15
|
+
|
|
16
|
+
Signals fire in fixed priority (`conflict409Stuck > noForwardStuck > consecutiveFailures`) producing exactly one DegradationReporter event per restart. Latched signals re-evaluate at rate-limit window expiry. The `noForwardStuck` signal is suppressed when `supervisor.getStatus().healthy === false` to avoid double-firing with existing server-down recovery.
|
|
17
|
+
|
|
18
|
+
The new `RestartOrchestrator` is a single-owner state machine (`idle → quiescing → persisting → exiting`) that serializes all restart initiators (watchdog tick, 426 handler, external SIGTERM). Step 1 quiesces Telegram polling, replay, and watchdog timers BEFORE persist so the queue snapshot is causally consistent. Step 2 emits the signal. Step 3 persists all state files in parallel (2 s budget). Hard-kill `setTimeout(5000)` guard fires `process.exit(1)` if persist hangs. Rate limit: one restart per 10 min per bucket, with versionSkew additionally capped at 3 per 24 h. `state/last-self-restart-at.json` holds a 50-entry history ring buffer with 0600 mode and atomic tmp+rename writes; fail-closed on corruption, allow-and-overwrite on future timestamps (breaks a deadlock). Storm escalation: 6 restarts within 1 h fires a distinct `TelegramLifeline.restartStorm` signal outside normal per-feature cooldown.
|
|
19
|
+
|
|
20
|
+
**3. Supporting infrastructure.** New `state/lifeline-started-at.json` pid-bearing marker written on every startup (cold boot, self-restart, external kickstart — all paths). The new `instar lifeline restart` CLI polls this marker (pid delta) rather than `last-self-restart-at.json` because `launchctl kickstart` is an external restart that doesn't invoke the self-restart code path. CLI also checks `.instar/shadow-install/.updating` lockfile (waits up to 60 s) to avoid respawning against a half-written install. Unsupervised mode (not under launchd, no `INSTAR_SUPERVISED=1`) emits signals and logs but skips `process.exit` to keep local testing sane. Thresholds configurable under `lifeline.watchdog.*` in project config; invalid values fall back to defaults + emit `TelegramLifeline.configInvalid` signal. `state/last-self-restart-at.json` is excluded from backup snapshots (machine-local operational state).
|
|
21
|
+
|
|
22
|
+
**Signal-vs-authority compliance.** DP1 (server-side policy) is an API-boundary structural validator under the hard-invariant exemption. DP2 (lifeline-side restart policy) is operational self-heal on the lifeline's own process — it constrains no other agent's behavior, filters no message flow, and blocks no user action. The sole output is "I, the lifeline, restart myself." Restart is fully reversible via launchd respawn.
|
|
23
|
+
|
|
24
|
+
## What to Tell Your User
|
|
25
|
+
|
|
26
|
+
- **Self-healing for stuck helpers**: "I can now notice if my Telegram helper gets stuck in a weird state and restart myself, instead of going silent."
|
|
27
|
+
- **Version coordination**: "If my main program and my helper end up on different versions, they now work it out automatically — my helper restarts to pick up the right code."
|
|
28
|
+
- **Safer shutdown**: "When I restart myself, I now pause everything first so no messages get lost in the middle of saving."
|
|
29
|
+
|
|
30
|
+
## Summary of New Capabilities
|
|
31
|
+
|
|
32
|
+
| Capability | How to Use |
|
|
33
|
+
|-----------|-----------|
|
|
34
|
+
| Version handshake on forward-to-server | Automatic — included in every `/internal/telegram-forward` request |
|
|
35
|
+
| Stuck-loop self-restart | Automatic — every 30 s watchdog tick |
|
|
36
|
+
| `instar lifeline restart` | CLI command — triggers `launchctl kickstart`, polls pid-marker |
|
|
37
|
+
| Storm-escalation alert | Automatic — fires after 6 restarts in 1 h |
|
|
38
|
+
| Configurable watchdog thresholds | `.instar/config.json` → `lifeline.watchdog.*` |
|
|
39
|
+
|
|
40
|
+
## Evidence
|
|
41
|
+
|
|
42
|
+
This change includes two categories of coverage:
|
|
43
|
+
|
|
44
|
+
**Unit tests (84 new tests across 7 files, all passing):**
|
|
45
|
+
- `tests/unit/lifeline/versionHandshake.test.ts` — 15 tests covering parseVersion semantics (regex, length cap, malformed rejection) and compareVersions semantics (match, patch-info boundary at exactly 10, MAJOR/MINOR rejection, canonical serverVersion reconstruction).
|
|
46
|
+
- `tests/unit/lifeline/rateLimitState.test.ts` — 17 tests covering read outcomes (missing / corrupt / future / ok), decide semantics (cooldown, version-skew-daily-cap, storm detection), atomic write with 0600 mode, history ring-buffer cap.
|
|
47
|
+
- `tests/unit/lifeline/forwardErrors.test.ts` — 4 tests covering isTerminal classification.
|
|
48
|
+
- `tests/unit/lifeline/startupMarker.test.ts` — 5 tests covering write/read round-trip and fail-safe reads.
|
|
49
|
+
- `tests/unit/lifeline/LifelineHealthWatchdog.test.ts` — 9 tests covering idle-agent safety (empty queue never trips), `noForwardStuck` only fires with non-empty queue AND old oldest-item AND healthy server, consecutiveFailures threshold, conflict409Stuck threshold, priority ordering when multiple signals trip, signal latching + de-cross, starvation signal.
|
|
50
|
+
- `tests/unit/lifeline/RestartOrchestrator.test.ts` — 5 tests covering state progression (idle→quiescing→persisting→exiting), re-entrance suppression, unsupervised-mode skip-exit, supervised-mode exit code 0, hard-kill timer fires on hung persist.
|
|
51
|
+
- `tests/unit/lifeline/retryWithBackoff.test.ts` — 1 new test: isTerminal short-circuit consumes zero additional attempts.
|
|
52
|
+
- `tests/unit/server/telegramForwardHandshake.test.ts` — 8 tests covering accept-on-match, 426 on MAJOR/MINOR mismatch with reconstructed serverVersion, 400 on malformed/over-long input (no echo), accept on absent lifelineVersion (backward compat), dev-mode empty-authToken path, 503 on unparseable serverVersion.
|
|
53
|
+
|
|
54
|
+
**Live verification — not reproducible in dev**: The original Bob/Dawn failure modes (7-day lifeline accumulating protocol state, version skew post-`npm i`) require wall-clock durations and deployment races that cannot be induced locally. Stage C (chaos tests) is queued as the follow-up fix that will exercise these paths under simulated failure. Stage B's correctness is gated on the 84 unit tests above plus the 4-round convergent review (4 internal lenses + GPT-5.4 + Gemini-3.1-Pro + Grok-4.1-Fast) that surfaced and closed 28 material findings before implementation began.
|
|
55
|
+
|
|
56
|
+
Side-effects artifact: `upgrades/side-effects/lifeline-self-restart-stage-b.md`
|
|
57
|
+
Spec: `docs/specs/LIFELINE-SELF-RESTART-STAGE-B-SPEC.md`
|
|
58
|
+
Convergence report: `docs/specs/reports/lifeline-self-restart-stage-b-convergence.md`
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Side-Effects Review — v0.28.65
|
|
2
|
+
|
|
3
|
+
**Version / slug:** `0.28.65`
|
|
4
|
+
**Date:** `2026-04-20`
|
|
5
|
+
**Author:** `echo (backfill)`
|
|
6
|
+
**Second-pass reviewer:** `not required — backfill for baseline artifact drift`
|
|
7
|
+
|
|
8
|
+
## Summary of the change
|
|
9
|
+
|
|
10
|
+
Backfill artifact for v0.28.65 to satisfy the pre-push gate. The v0.28.65 release shipped on main as commit `2478800` with upgrade notes (`upgrades/0.28.65.md`) but without a paired side-effects artifact, creating baseline drift that caused the pre-push gate's release-level artifact check to fail on every subsequent branch. This file closes that drift retroactively.
|
|
11
|
+
|
|
12
|
+
The code shipped in v0.28.65 was a threadline rapid-fire pipe-guard fix (per `upgrades/0.28.65.md`) — a guard against repeated session spawns for the same thread racing each other to death. Relevant files: `src/threadline/*` (per the commit `8187dde`).
|
|
13
|
+
|
|
14
|
+
## Decision-point inventory
|
|
15
|
+
|
|
16
|
+
- `ThreadlineDispatcher same-thread pipe guard` — **modify** — Structural concurrency guard; not a message-flow judgment call.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 1. Over-block
|
|
21
|
+
|
|
22
|
+
No legitimate threadline dispatches are rejected. The guard only delays a rapid-fire duplicate-thread spawn, not a first dispatch.
|
|
23
|
+
|
|
24
|
+
## 2. Under-block
|
|
25
|
+
|
|
26
|
+
Long-lived stuck pipes across process restarts remain uncaught; that is separately handled by session-level cleanup.
|
|
27
|
+
|
|
28
|
+
## 3. Level-of-abstraction fit
|
|
29
|
+
|
|
30
|
+
Guard lives at the dispatcher level — correct layer for "these two spawn attempts are for the same thread."
|
|
31
|
+
|
|
32
|
+
## 4. Signal vs authority compliance
|
|
33
|
+
|
|
34
|
+
**Required reference:** [docs/signal-vs-authority.md](../../docs/signal-vs-authority.md)
|
|
35
|
+
|
|
36
|
+
- [x] No — this change has no message-flow block/allow surface.
|
|
37
|
+
|
|
38
|
+
This is a concurrency guard on session-spawn, not a judgment on message content. Exempted from the principle.
|
|
39
|
+
|
|
40
|
+
## 5. Interactions
|
|
41
|
+
|
|
42
|
+
- **Shadowing:** None.
|
|
43
|
+
- **Double-fire:** The guard PREVENTS double-fire — that's its job.
|
|
44
|
+
- **Races:** Explicitly addressed by the guard.
|
|
45
|
+
- **Feedback loops:** None.
|
|
46
|
+
|
|
47
|
+
## 6. External surfaces
|
|
48
|
+
|
|
49
|
+
No external surface changes beyond the intended behavior (first spawn wins, duplicates are rejected until cleanup).
|
|
50
|
+
|
|
51
|
+
## 7. Rollback cost
|
|
52
|
+
|
|
53
|
+
Pure code change. Rollback is revert + patch.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Conclusion
|
|
58
|
+
|
|
59
|
+
Backfill artifact. The real review happened at the PR review stage for commit `8187dde`. This file is present solely to unblock the pre-push gate on future branches.
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# Side-Effects Review — compaction-recovery preamble tightening
|
|
2
|
+
|
|
3
|
+
**Version / slug:** `0.28.66`
|
|
4
|
+
**Date:** `2026-04-20`
|
|
5
|
+
**Author:** `echo`
|
|
6
|
+
**Second-pass reviewer:** `required — compaction / session-lifecycle surface`
|
|
7
|
+
|
|
8
|
+
## Summary of the change
|
|
9
|
+
|
|
10
|
+
Rewrites the `COMPACTION_RESUME_PREAMBLE` string in `src/messaging/shared/compactionResumePayload.ts` and the mirror instruction in `prepareInjectionText`'s over-threshold branch. No code-flow change, no new helper, no new decision point — the change is entirely in the text of the instruction a compaction-recovered agent reads on wake-up. Fixes two empirically observed failure modes on topic 6795 (2026-04-20): (a) agents self-narrating compaction as "I lost track / got lost / got confused" instead of a calm "paused for context compaction, now resumed"; (b) agents reconstructing a generic status-summary + re-offering options the user had already delegated back (the "Your call → Your call" ping-pong). Tests added in `tests/unit/compactionResumePayload.test.ts` pin the two invariants on both the inline and file-reference branches.
|
|
11
|
+
|
|
12
|
+
## Decision-point inventory
|
|
13
|
+
|
|
14
|
+
- `COMPACTION_RESUME_PREAMBLE` (text fed to recovered agent) — **modify** — tightens instruction content; no change to when or how the preamble is emitted.
|
|
15
|
+
- `prepareInjectionText` over-threshold branch (text fed to recovered agent when payload > 500 chars) — **modify** — same tightening, applied to the file-reference variant so both branches give the agent the same guardrails.
|
|
16
|
+
- No runtime gate, filter, or authority is added, removed, or modified. The recovered agent remains the only authority; this change is pure input to it.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 1. Over-block
|
|
21
|
+
|
|
22
|
+
**What legitimate inputs does this change reject that it shouldn't?**
|
|
23
|
+
|
|
24
|
+
No block/allow surface — over-block not applicable. The change is instruction text. The recovered agent may choose to phrase its compaction acknowledgment in a way that still uses one of the forbidden phrases ("lost track," "got lost," "got confused") — but the preamble can only ask it not to; there is no downstream check that rejects or rewrites its output.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 2. Under-block
|
|
29
|
+
|
|
30
|
+
**What failure modes does this still miss?**
|
|
31
|
+
|
|
32
|
+
No block/allow surface — under-block not applicable. Known residual risks:
|
|
33
|
+
- Agents may invent new alarming phrasings not covered by the explicit prohibitions (e.g. "I drew a blank," "I lost the thread"). The preamble addresses this by also prescribing the positive phrasing ("paused for context compaction and has now resumed"), but compliance is still LLM-discretion.
|
|
34
|
+
- Agents may still reconstruct status summaries when the user's last message is genuinely ambiguous (not a directive, not a question). The preamble cannot solve ambiguity; it only forbids summary-reflex when the user's last message was a delegated decision.
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## 3. Level-of-abstraction fit
|
|
39
|
+
|
|
40
|
+
**Is this at the right layer?**
|
|
41
|
+
|
|
42
|
+
Yes. The preamble is the only layer where this content can live — it is the one piece of text guaranteed to be in the recovered agent's context on wake-up. Alternatives considered and rejected:
|
|
43
|
+
- **Detector on the recovered agent's outbound message** (match "lost track" and rewrite): this is exactly the brittle-detector-with-authority anti-pattern the signal-vs-authority doctrine forbids. Also, rewriting the agent's message would distort its voice and break tone-gate compatibility.
|
|
44
|
+
- **Coaching via MEMORY.md**: memory is not guaranteed loaded early enough on recovery, and this fix must work for first-run agents too.
|
|
45
|
+
- **Post-compaction reflection hook**: fires after the first response, too late to prevent the tone failure the user sees.
|
|
46
|
+
|
|
47
|
+
The preamble is the correct layer: it's instruction-to-the-authority, not authority-itself.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## 4. Signal vs authority compliance
|
|
52
|
+
|
|
53
|
+
**Required reference:** [docs/signal-vs-authority.md](../../docs/signal-vs-authority.md)
|
|
54
|
+
|
|
55
|
+
**Does this change hold blocking authority with brittle logic?**
|
|
56
|
+
|
|
57
|
+
- [x] **No — this change has no block/allow surface.**
|
|
58
|
+
- [ ] No — this change produces a signal consumed by an existing smart gate.
|
|
59
|
+
- [ ] Yes — but the logic is a smart gate with full conversational context.
|
|
60
|
+
- [ ] ⚠️ Yes, with brittle logic.
|
|
61
|
+
|
|
62
|
+
This change is text fed to an LLM-backed authority (the recovered agent). It does not filter, reject, or gate any input or output. The authority structure is unchanged. Compliance is trivially satisfied: detectors remain detectors, authorities remain authorities, and the preamble is an instruction INPUT to an existing authority.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## 5. Interactions
|
|
67
|
+
|
|
68
|
+
- **Shadowing:** No. The preamble is the only text ever injected in this code path; nothing else injects into the same byte-range.
|
|
69
|
+
- **Double-fire:** No. `buildCompactionResumePayload` is called once per compaction recovery event; `prepareInjectionText` is called once per payload. Both changes alter static text, not call-count.
|
|
70
|
+
- **Races:** No. The function is a pure string builder. No shared mutable state.
|
|
71
|
+
- **Feedback loops:** Indirect only. The recovered agent's response (shaped by the preamble) is seen by the user and becomes part of the next topic context, but this is normal conversational loop and not a self-reinforcing system.
|
|
72
|
+
- **Adjacent compaction infrastructure:** `findLastRealMessage` (0.28.51), `isSystemOrProxyMessage` (0.28.51), and `formatContextForSession` (0.28.52) are all unchanged. The preamble text is independent of which messages are selected; this change only affects how the agent is instructed to respond to them.
|
|
73
|
+
- **Tone gate interaction:** The preamble is not itself sent through the tone gate (it's injected into the agent's context, not emitted as a user-facing message). The agent's RESPONSE to the preamble is sent through the tone gate, as normal. The forbidden phrases ("lost track," etc.) in the preamble are inside a prohibition instruction ("Do NOT say you..."); the tone gate parses agent output, not its input context, so no collision.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## 6. External surfaces
|
|
78
|
+
|
|
79
|
+
- **Other agents on the same machine:** No effect. Each agent's recovery is independent.
|
|
80
|
+
- **Other users of the install base:** Yes — every instar user running 0.28.66+ will see recovered agents produce a different tone on compaction. This is the intended improvement. No breaking change to message format.
|
|
81
|
+
- **External systems (Telegram, Slack, GitHub):** No. The output format is still a single text message in the agent's voice.
|
|
82
|
+
- **Persistent state:** None. Text change only.
|
|
83
|
+
- **Timing / runtime conditions:** No. Purely static text; no new async, no new timeouts.
|
|
84
|
+
- **File-reference branch:** when payload > 500 chars, a temp file path appears in the recovered agent's context as before. File format unchanged; only the surrounding instruction text is tightened.
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## 7. Rollback cost
|
|
89
|
+
|
|
90
|
+
**If this turns out wrong in production, what's the back-out?**
|
|
91
|
+
|
|
92
|
+
- **Hot-fix release:** revert the single file (`src/messaging/shared/compactionResumePayload.ts`) + the added tests. Ship as 0.28.67. No code-path change to untangle.
|
|
93
|
+
- **Data migration:** None. No persistent state touched.
|
|
94
|
+
- **Agent state repair:** None. Live agents pick up the new preamble on next compaction event; if reverted, they pick up the old one on the one after. No stuck-state risk.
|
|
95
|
+
- **User visibility during rollback:** A single recovery event in flight might land on mixed wording. Zero-downtime revert.
|
|
96
|
+
|
|
97
|
+
Total rollback cost: one revert commit, one patch release. Lowest possible risk tier.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Conclusion
|
|
102
|
+
|
|
103
|
+
The change is low-risk, surgical, and addresses two specific user-observed failures with the minimum possible surface area — instruction text in one file, plus a mirrored update in the same file's over-threshold branch. No decision points are added, no gate logic is modified, no authority is reshaped. Signal-vs-authority compliance is trivially satisfied (no block/allow surface). Rollback is one revert.
|
|
104
|
+
|
|
105
|
+
One residual risk: LLM compliance with prescribed tone language is not guaranteed, only strongly directed. If telemetry shows the new preamble still produces alarming narrations in practice, the next iteration would be to fold compaction-recovery outputs through a dedicated lightweight tone-checker before emit — but that is a future change, not a reason to block this one.
|
|
106
|
+
|
|
107
|
+
Clear to ship as 0.28.66.
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## Second-pass review (if required)
|
|
112
|
+
|
|
113
|
+
**Reviewer:** general-purpose subagent (independent read, did not see the author's reasoning)
|
|
114
|
+
**Independent read of the artifact: concur**
|
|
115
|
+
|
|
116
|
+
Concerns raised (non-blocking):
|
|
117
|
+
|
|
118
|
+
- Test strength: initial suite asserted magic-string presence only, not order. Addressed in this pass by adding a test that pins "calm acknowledgment appears before respond-to-most-recent-message" in `COMPACTION_RESUME_PREAMBLE` — the order was itself a core part of the failure mode.
|
|
119
|
+
- Prompt-injection surface: the user's last message was already concatenated into the recovered agent's context via `formatInlineHistory` (pre-existing). The new preamble's directive to "act on a delegated decision" slightly raises the stakes of a hostile last message. Residual risk, not a new surface. The recovered agent remains the authority and retains judgment — same trust boundary as any other user-input-in-context scenario. Accepted as pre-existing, not introduced.
|
|
120
|
+
- i18n / backward-compat: instar is English-only today; compaction payloads are transient (not persisted across versions), so no in-flight compatibility concern. Noted for completeness.
|
|
121
|
+
|
|
122
|
+
Reviewer verdict: clear to ship.
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Evidence pointers
|
|
127
|
+
|
|
128
|
+
- Targeted unit tests: `tests/unit/compactionResumePayload.test.ts` — 17 tests, 42 combined with `isSystemOrProxyMessage.test.ts`, all green in worktree.
|
|
129
|
+
- Failure-mode source: topic 6795 screenshots 2026-04-20 (Mew / session-robustness at 12:12 PM; Bob / instar-agent-robustness at 12:04 PM).
|
|
130
|
+
- Prior art in this line: 0.28.51 (classifier fix), 0.28.52 (rich context payload). This is the third tightening in the compaction-recovery lineage.
|
|
@@ -0,0 +1,155 @@
|
|
|
1
|
+
# Side-Effects Review — Lifeline message-drop robustness (Stage A)
|
|
2
|
+
|
|
3
|
+
**Version / slug:** `lifeline-message-drop-stage-a`
|
|
4
|
+
**Date:** `2026-04-19`
|
|
5
|
+
**Author:** `echo`
|
|
6
|
+
**Second-pass reviewer:** `reviewer subagent (required — touches inbound messaging)`
|
|
7
|
+
|
|
8
|
+
## Summary of the change
|
|
9
|
+
|
|
10
|
+
`TelegramLifeline` previously made a single-shot `fetch` in `forwardToServer()` and silently dropped messages in `replayQueue()` after `MAX_REPLAY_FAILURES`. Stage A closes the silent-drop window without changing anything else: (1) wraps the in-flight forward with a 3-attempt 1s/2s/4s retry, and (2) when the replay drop is reached, appends a durable record to `<stateDir>/state/dropped-messages.json`, emits a `DegradationReporter` event under `feature = 'TelegramLifeline.forwardToServer'`, and sends the original sender a plain-English "please resend" notice in their topic. Two new helpers (`retryWithBackoff`, `droppedMessages`) are introduced; they are pure, deterministic utilities. No change to the handoff payload shape, no change to server routes, no change to any gate.
|
|
11
|
+
|
|
12
|
+
Files touched:
|
|
13
|
+
|
|
14
|
+
- `src/lifeline/retryWithBackoff.ts` (new, 43 lines)
|
|
15
|
+
- `src/lifeline/droppedMessages.ts` (new, ~165 lines)
|
|
16
|
+
- `src/lifeline/TelegramLifeline.ts` (imports + `forwardToServer` retry + `replayQueue` drop-path notification)
|
|
17
|
+
- `tests/unit/lifeline/retryWithBackoff.test.ts` (new)
|
|
18
|
+
- `tests/unit/lifeline/droppedMessages.test.ts` (new)
|
|
19
|
+
- `tests/unit/lifeline/droppedMessageNotify.test.ts` (new)
|
|
20
|
+
|
|
21
|
+
## Decision-point inventory
|
|
22
|
+
|
|
23
|
+
- `TelegramLifeline.forwardToServer` — **modify** — single-attempt `fetch` replaced by retry-wrapped `fetch`. Retry policy is mechanics (fixed count, fixed backoff), not judgment.
|
|
24
|
+
- `TelegramLifeline.replayQueue` drop branch — **modify** — adds persistence + `DegradationReporter.report` + user-visible `sendToTopic` notice before the existing console.warn drop.
|
|
25
|
+
- No block/allow surface added. No existing authority shadowed.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## 1. Over-block
|
|
30
|
+
|
|
31
|
+
**What legitimate inputs does this change reject that it shouldn't?**
|
|
32
|
+
|
|
33
|
+
No block/allow surface — over-block not applicable. The change never rejects input; it only adds retry attempts and a notification on an already-determined drop.
|
|
34
|
+
|
|
35
|
+
Collateral "over-notify" is possible in theory: if the user sends a rapid burst of messages during a genuine server outage, each one that ends up in the replay-drop path would produce its own Telegram notice. This is bounded by `MAX_REPLAY_FAILURES` (3) and by the replay cadence — so the burst would have to persist across multiple full replay cycles, each spaced by the supervisor's recovery attempts. In practice the user gets at most one notice per truly-dropped message, which matches the intent: "tell me about messages I lost."
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## 2. Under-block
|
|
40
|
+
|
|
41
|
+
**What failure modes does this still miss?**
|
|
42
|
+
|
|
43
|
+
No block/allow surface — under-block not applicable. Failure modes this Stage A does NOT address (intentionally; they're Stage B/C):
|
|
44
|
+
|
|
45
|
+
- A stale lifeline running an older package version can still hand the server a payload the server rejects. The Bob incident happened this way. Stage B adds a version check to the handoff.
|
|
46
|
+
- A crash between `appendDroppedMessage` success and the eventual `sendToTopic` could leave a durable record with no user notice. The DegradationReporter event still fires to the attention topic, so the drop is not silent from the operator's perspective; it is potentially silent to the original sender. Stage C's chaos tests will exercise this path.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## 3. Level-of-abstraction fit
|
|
51
|
+
|
|
52
|
+
**Is this at the right layer?**
|
|
53
|
+
|
|
54
|
+
`retryWithBackoff` is a detector-level primitive (mechanics, no judgment, deterministic). It lives in `src/lifeline/` because only the lifeline uses it today; if a second call site emerges the helper can promote to a shared utility without behavior change.
|
|
55
|
+
|
|
56
|
+
`droppedMessages.notifyMessageDropped` is a signal producer: it writes a durable record, emits a `DegradationReporter` event, and sends a user-visible Telegram message. All three outputs are consumed by *existing* authorities (the operator surfaces the state file or the dashboard; DegradationReporter has its own downstream pipeline that feeds FeedbackManager and the attention topic; Telegram is the user's own eyes). This helper does not itself decide whether a message *should* be dropped — it reacts to the caller's already-made decision. Correct layer.
|
|
57
|
+
|
|
58
|
+
No higher-level gate already exists that this should feed instead — the existing drop path had no signal emission at all. This helper *is* that signal.
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## 4. Signal vs authority compliance
|
|
63
|
+
|
|
64
|
+
**Required reference:** [docs/signal-vs-authority.md](../../docs/signal-vs-authority.md)
|
|
65
|
+
|
|
66
|
+
**Does this change hold blocking authority with brittle logic?**
|
|
67
|
+
|
|
68
|
+
- [x] No — this change produces a signal consumed by an existing smart gate.
|
|
69
|
+
- [ ] No — this change has no block/allow surface.
|
|
70
|
+
- [ ] Yes — but the logic is a smart gate with full conversational context (LLM-backed with recent history or equivalent).
|
|
71
|
+
- [ ] ⚠️ Yes, with brittle logic — STOP.
|
|
72
|
+
|
|
73
|
+
The retry is pure mechanics (fixed count, fixed backoff) — not a decision point. The drop path that existed before this change is also pure mechanics (fixed failure-counter threshold in `MessageQueue`). This change adds *signal emission* (durable record, DegradationReporter event, Telegram notice) to a mechanism that was previously silent. The consumers of those signals are existing authorities: DegradationReporter's existing downstream pipeline, and the human operator reading the attention topic / state file. Compliant.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## 5. Interactions
|
|
78
|
+
|
|
79
|
+
**Does this interact with existing checks, recovery paths, or infrastructure?**
|
|
80
|
+
|
|
81
|
+
- **Shadowing:** `forwardToServer`'s retry runs before the existing queue-and-replay path. The retry uses up to ~7s of wallclock (1s + 2s + 4s in the failure case). That's inside the caller's normal "queue this and acknowledge" window — the queue+replay path is unchanged and still fires when retry exhausts. No shadowing.
|
|
82
|
+
- **Double-fire:** `DegradationReporter` has per-feature cooldown (1h) built in (`ALERT_COOLDOWN_MS` in `DegradationReporter.ts`). A storm of drops within one hour will log+persist every one, but only the first triggers the attention-topic alert. Intentional — the file is the durable record; the alert is the human-facing surface. The per-sender Telegram notice fires once per dropped message regardless of cooldown (the sender needs to know about *their* message, not be deduped against unrelated drops).
|
|
83
|
+
- **Races:** `appendDroppedMessage` uses an atomic file swap (write-to-tmp, then `renameSync`) — the swap itself is atomic, mirroring the existing `saveRateLimitState` pattern in the same file. The surrounding read-modify-write is *not* atomic across concurrent callers: if two writers race, the last `renameSync` wins and the earlier writer's appended record is overwritten. Accepted for a debugging / operator-visibility record; no user-visible correctness impact (the DegradationReporter event and the per-sender Telegram notice are the load-bearing notifications, and both fire independently of this file).
|
|
84
|
+
- **Feedback loops:** The user-notice `sendToTopic` itself goes through the normal Telegram path. It could in principle fail, which would not re-enter the drop pipeline (there's no recursive replay of the notice). Failures are swallowed — the durable record and the DegradationReporter event are the authoritative signals.
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## 6. External surfaces
|
|
89
|
+
|
|
90
|
+
**Does this change anything visible outside the immediate code path?**
|
|
91
|
+
|
|
92
|
+
- **Other agents:** no. The change is confined to the lifeline process on each agent individually.
|
|
93
|
+
- **Other users:** yes, and deliberately — users whose messages would have been silently dropped now get a plain-English notice asking them to resend. This is the intended user-visible change.
|
|
94
|
+
- **External systems:** Telegram is the only external surface. One extra per-drop message per dropped message. Rate-limited by the drop rate itself, which in steady-state is zero.
|
|
95
|
+
- **Persistent state:** new file `<stateDir>/state/dropped-messages.json`, ring-buffered to 500 records, ~200 bytes each → max ~100KB on disk. Additive — older versions that don't know about this file will simply ignore it.
|
|
96
|
+
- **Timing:** `forwardToServer` worst-case wallclock grows from one 10s fetch to three 10s fetches + (1s + 2s) backoff = up to ~33s. Caller is the async polling loop, which is non-blocking w.r.t. other Telegram updates. No new timing dependency we don't control.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## 7. Rollback cost
|
|
101
|
+
|
|
102
|
+
**If this turns out wrong in production, what's the back-out?**
|
|
103
|
+
|
|
104
|
+
Pure code revert. No migration. `dropped-messages.json` is additive — ignoring it on rollback is safe. `DegradationReporter` category `TelegramLifeline.forwardToServer` requires no schema change; on rollback the reporter simply stops receiving events under that feature name. Users who received a "couldn't deliver" notice already have the notice delivered; no inconsistency.
|
|
105
|
+
|
|
106
|
+
Estimated rollback: one `git revert`, one release, zero downtime.
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Conclusion
|
|
111
|
+
|
|
112
|
+
Stage A is a pure signal-addition change on top of an already-mechanical drop path. It introduces no new authority, no new block/allow surface, no schema change, and one new additive state file. The failure modes it does NOT address (version skew, chaos-level reliability) are intentional scope deferrals for Stage B/C. The change is clear to ship once the second-pass reviewer concurs and the full vitest suite passes (noting pre-existing failures on the source branch — see Evidence pointers).
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Convergent-review fixes (2026-04-20)
|
|
117
|
+
|
|
118
|
+
After a 4-lens convergent review (security / scalability / adversarial / integration) the following material findings were applied to the shipped change:
|
|
119
|
+
|
|
120
|
+
- **Markdown injection in the per-sender notice (adversarial, security).** `sendToTopic` uses `parse_mode: 'Markdown'`, so echoing raw user text in the "please resend" notice would render user-controlled `_` / `*` / `[…](…)` / backticks as formatting or clickable links. **Fix:** the preview is now wrapped in a triple-backtick code fence, and any triple-backtick runs inside the preview are rewritten to `'''` to prevent breakout. A new test (`escapes markdown breakout attempts in the preview`) asserts exactly two backtick runs (the fence itself).
|
|
121
|
+
- **Correlated-failure silent-drop hole (adversarial).** If `appendDroppedMessage` throws AND `FEATURE_FORWARD` is mid-cooldown AND `sendToTopic` throws, all three catches would swallow. **Fix:** on persist failure we now additionally fire a second `DegradationReporter.report` under a **distinct** feature `TelegramLifeline.dropRecordPersist`. Its cooldown is independent of the primary feature's, so correlated-failure paths still produce at least one loud operator signal. New test: `fires a distinct DegradationReporter feature when persistence itself fails`.
|
|
122
|
+
- **sendToTopic latency compounding (scalability).** A Telegram outage could hang the notice for ~30–60 s on top of the 33 s retry. **Fix:** `sendToTopic` is now wrapped in a 5 s `Promise.race` timeout via an internal `withTimeout` helper. New test: `bounds sendToTopic with a timeout so a hung Telegram does not block`.
|
|
123
|
+
- **Feature-name taxonomy mismatch (integration).** Existing lifeline DegradationReporter events use `Class.method` form (see `src/messaging/TelegramAdapter.ts`). **Fix:** feature renamed from `TelegramLifeline.MessageForwarding` → `TelegramLifeline.forwardToServer`. Tests + spec updated.
|
|
124
|
+
- **Parallel dead-letter pattern not acknowledged (integration).** `MessageRouter.deadLetter` and `state/failed-messages/` already exist. **Fix:** spec now contains a "Relationship to existing dead-letter systems" section documenting why the lifeline needs its own file-backed store (process boundary) and how Stage B can bridge to `MessageRouter` later.
|
|
125
|
+
- **Version-skew amplification under Stage A (adversarial, deferred).** Stage A makes each version-skew rejection 3× slower. Acknowledged in the spec's "Failure modes intentionally left unfixed" — the loudness guarantee still holds (notice now fires where none did before); only time-to-notice grows. Not blocking; Stage B closes it.
|
|
126
|
+
|
|
127
|
+
Test count after fixes: **18 passing** (3 new from convergent review).
|
|
128
|
+
|
|
129
|
+
## Second-pass review (if required)
|
|
130
|
+
|
|
131
|
+
**Reviewer:** independent general-purpose subagent (agentId: a90b878b08335adc5)
|
|
132
|
+
**Independent read of the artifact: concern → resolved**
|
|
133
|
+
|
|
134
|
+
The reviewer independently read the artifact and the actual diff and concurred on scope, signal-vs-authority compliance, level-of-abstraction fit, and absence of duplication. Three concerns were raised, all non-blocking:
|
|
135
|
+
|
|
136
|
+
- Atomicity wording overclaim in §5 — **resolved.** §5 now distinguishes the atomic file swap from the non-atomic read-modify-write around it.
|
|
137
|
+
- Worst-case timing arithmetic in §6 said ~37s; actual is 3×10s fetch + (1s+2s) backoff = ~33s — **resolved.** §6 now reports ~33s with the math spelled out.
|
|
138
|
+
- User-facing notice used "my internal handoff kept failing," which slips below Echo's ELI10 bar — **resolved.** The notice in `droppedMessages.ts` now reads "something on my end kept failing." Existing tests assert on "couldn't deliver" and "resend," both still present; no test churn.
|
|
139
|
+
|
|
140
|
+
With the three fixes applied, the reviewer's verdict ("core conclusions of the artifact stand") carries through to the shipped change.
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## Evidence pointers
|
|
145
|
+
|
|
146
|
+
- New helper tests: `tests/unit/lifeline/retryWithBackoff.test.ts`, `tests/unit/lifeline/droppedMessages.test.ts`, `tests/unit/lifeline/droppedMessageNotify.test.ts` — **15/15 passing** locally in the isolated worktree (`2026-04-19`).
|
|
147
|
+
- Typecheck: `npx tsc --noEmit` — clean.
|
|
148
|
+
- Full-suite run on branch: 710 passed / 5 test files failed (6 tests). **All failures reproduce on main independently**:
|
|
149
|
+
- `tests/unit/no-silent-fallbacks.test.ts` — baseline drift (current=174 vs baseline=86); pre-existing tech debt unrelated to lifeline (scope excludes `src/lifeline/`).
|
|
150
|
+
- `tests/unit/ListenerSessionManager.test.ts > starts in dead state` — fails on main in isolation.
|
|
151
|
+
- `tests/unit/agent-registry.test.ts > allocates from range` / `reclaims ports from stale agents` — fails on main in isolation.
|
|
152
|
+
- `tests/unit/security.test.ts > zero execSync calls` — fails on main in isolation.
|
|
153
|
+
- `tests/unit/middleware.test.ts`, `tests/integration/machine-routes.test.ts`, `tests/unit/moltbridge/routes.test.ts` — pass isolated (flaky under concurrency); not caused by this change.
|
|
154
|
+
- Worktree path: `.instar/worktrees/build-stage-a--lifeline-message-drop-robustnes`
|
|
155
|
+
- Branch: `build/stage-a--lifeline-message-drop-robustnes`
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
# Side-Effects Review — Lifeline Self-Restart on Version Skew or Stuck Loop (Stage B)
|
|
2
|
+
|
|
3
|
+
**Version / slug:** `lifeline-self-restart-stage-b`
|
|
4
|
+
**Date:** `2026-04-20`
|
|
5
|
+
**Author:** `echo`
|
|
6
|
+
**Second-pass reviewer:** `spec-converge (4 iterations, 7 reviewers — 4 internal + GPT-5.4, Gemini-3.1-Pro, Grok-4.1-Fast)`
|
|
7
|
+
|
|
8
|
+
## Summary of the change
|
|
9
|
+
|
|
10
|
+
Adds two related self-healing mechanisms to the Telegram lifeline: (1) a version handshake on `/internal/telegram-forward` where the server returns `426 Upgrade Required` on MAJOR/MINOR mismatch and the lifeline self-restarts via launchd respawn; (2) a health watchdog that tracks three deterministic signals (`noForwardStuck`, `consecutiveFailures`, `conflict409Stuck`) and self-restarts on pathological stuck states. Files touched: `src/lifeline/forwardErrors.ts` (new), `src/lifeline/versionHandshake.ts` (new), `src/lifeline/startupMarker.ts` (new), `src/lifeline/rateLimitState.ts` (new), `src/lifeline/LifelineHealthWatchdog.ts` (new), `src/lifeline/RestartOrchestrator.ts` (new), `src/lifeline/TelegramLifeline.ts` (integration), `src/lifeline/retryWithBackoff.ts` (isTerminal predicate), `src/server/routes.ts` (handshake policy in `/internal/telegram-forward`). Decision points interacted with: DP1 (server-side version policy — new; API-boundary structural validator) and DP2 (lifeline-side restart policy — new; operational self-heal with deterministic thresholds).
|
|
11
|
+
|
|
12
|
+
## Decision-point inventory
|
|
13
|
+
|
|
14
|
+
- `DP1 — /internal/telegram-forward version policy (src/server/routes.ts)` — **add** — Structural API-boundary validator; returns 400/426/503 based on semver comparison. Exempted from signal-vs-authority per hard-invariant carve-out.
|
|
15
|
+
- `DP2 — Lifeline health watchdog + RestartOrchestrator (src/lifeline/*)` — **add** — Deterministic operational self-heal; the "authority" is the lifeline restarting itself, which constrains no other agent's behavior and filters no message flow. Deterministic thresholds + rate-limit guardrails match the "safety guard on irreversible action" shape, except the action (process.exit) is fully reversible via launchd respawn.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## 1. Over-block
|
|
20
|
+
|
|
21
|
+
**What legitimate inputs does this change reject that it shouldn't?**
|
|
22
|
+
|
|
23
|
+
- **Server-side version policy**: A correctly-running old lifeline (pre-Stage-B) on the same MAJOR.MINOR as the server is accepted — no rejection. A correctly-running new lifeline on a MAJOR.MINOR-behind server is rejected with 426, which is the intended behavior. Edge-case: an over-long or malformed `lifelineVersion` string is rejected with 400 (never echoed). Rejected inputs are things the server cannot interpret — not legitimate agent behavior.
|
|
24
|
+
- **Watchdog restart triggers**: The one at-risk path is the `noForwardStuck` signal on a low-traffic agent. Convergent-review Round 3 caught this — the previous design anchored on `lastForwardSuccessAt` (which would crash-loop idle agents). The shipped design anchors on `oldestQueueItemAge` (computed from the existing `QueuedMessage.timestamp` set at enqueue time), which fires only when messages are actually accumulating and not draining. A deliberately-idle agent with an empty queue does NOT trip — test `evaluate_empty_queue_does_not_trip` asserts this.
|
|
25
|
+
|
|
26
|
+
## 2. Under-block
|
|
27
|
+
|
|
28
|
+
**What failure modes does this still miss?**
|
|
29
|
+
|
|
30
|
+
- **Same-MAJOR.MINOR but very stale lifeline (e.g., Bob's 0.28.20 vs 0.28.61)**: The version policy only fires on MAJOR/MINOR mismatch; a patch-only drift is accepted (by project policy patches remain backward-compatible). Bob's scenario is caught by the stuck-loop watchdog instead — that agent was failing forwards, which would trigger `consecutiveFailures` or `noForwardStuck`.
|
|
31
|
+
- **Non-version, non-stuck outages**: A server that returns perfectly valid 2xx but with corrupted payloads won't trigger any watchdog signal. Out of scope — that's a content-correctness issue, not a liveness issue.
|
|
32
|
+
- **Restart storm from a persistent server bug**: The 10-min rate-limit caps this at 1 restart per 10 min per agent (or 3 per 24 h for version-skew). After 6 restarts in 1 h a `TelegramLifeline.restartStorm` signal fires outside normal cooldown to alert operators. The operator must intervene.
|
|
33
|
+
- **Wallclock jumps**: Backward jumps clamp to `Math.max(0, elapsed)`. Forward jumps extend the rate-limit window silently — worst case a restart is delayed hours. Future-timestamp in `last-self-restart-at.json` allows the current restart (breaks deadlock) and overwrites. Acceptable tradeoffs.
|
|
34
|
+
- **Heterogeneous deployment ordering (new lifeline → old server returning strict-400)**: A new Stage-B lifeline that sends `lifelineVersion` to a hypothetical pre-Stage-B server with strict JSON validation would get 400. Graceful degradation: retry once without the field, pin `legacyStrictServer = true` for the session.
|
|
35
|
+
|
|
36
|
+
## 3. Level-of-abstraction fit
|
|
37
|
+
|
|
38
|
+
**Is this at the right layer?**
|
|
39
|
+
|
|
40
|
+
- **DP1 (server-side handshake)** is at the API boundary — the correct layer. A structural protocol-version check must live at the first point where the server parses client input. Higher layers (e.g., a gate) would add latency without improving the decision; lower layers (e.g., middleware) would need to understand route-specific semantics.
|
|
41
|
+
- **DP2 (lifeline watchdog + orchestrator)** is at the process-supervision layer — also the correct layer. Individual forward retries (Stage A) belong at the message layer; process-level self-heal belongs at the process layer. The orchestrator is a single-owner state machine that serializes multiple initiator types (tick-based watchdog, event-based 426 handler, external SIGTERM), which is exactly the shape of that problem. No higher-level gate should be dispatching these; no lower-level primitive already exists.
|
|
42
|
+
|
|
43
|
+
## 4. Signal vs authority compliance
|
|
44
|
+
|
|
45
|
+
**Required reference:** [docs/signal-vs-authority.md](../../docs/signal-vs-authority.md)
|
|
46
|
+
|
|
47
|
+
**Does this change hold blocking authority with brittle logic?**
|
|
48
|
+
|
|
49
|
+
- [x] No — this change has no message-flow block/allow surface.
|
|
50
|
+
- [x] No — the decision points are carved-out categories (API-boundary structural validator + operational self-heal on the lifeline's own process).
|
|
51
|
+
- [ ] Yes — smart gate with context.
|
|
52
|
+
- [ ] ⚠️ Yes, with brittle logic.
|
|
53
|
+
|
|
54
|
+
**Narrative:**
|
|
55
|
+
|
|
56
|
+
DP1 is explicitly an API-boundary validator for a fixed protocol version policy (MAJOR.MINOR must match, patch drift permissible). Per `docs/signal-vs-authority.md` §"When this principle does NOT apply", hard-invariant validation at the API edge is an exempted category: "these belong at the API edge and are fine as brittle blockers." The 426 response is deterministic structural protocol policy, not a judgment call about message content or agent intent.
|
|
57
|
+
|
|
58
|
+
DP2 is operational self-heal on the lifeline's own process — it constrains NO other agent's behavior, filters NO message flow, and blocks NO user action. Its sole output is "I, the lifeline, restart myself." Restart is fully reversible (launchd respawns in ~1 s; queue is persisted atomically; dropped-messages file survives; rate-limit state survives). Deterministic thresholds with rate-limit guardrails are appropriate here.
|
|
59
|
+
|
|
60
|
+
The `versionSkewInfo` PATCH-drift signal is explicitly pure observability — it emits a DegradationReporter event with no blocking effect. The `restartStorm` signal same shape.
|
|
61
|
+
|
|
62
|
+
## 5. Interactions
|
|
63
|
+
|
|
64
|
+
**Does this interact with existing checks, recovery paths, or infrastructure?**
|
|
65
|
+
|
|
66
|
+
- **Shadowing**: The `/internal/telegram-forward` handshake runs BEFORE existing topicId/text validation. If the client sends an invalid version, the request is rejected before the existing logic runs. Intentional — version mismatch is more fundamental than field validation. No existing logging is bypassed because the existing topicId/text check already returned 400 silently.
|
|
67
|
+
- **Double-fire**: The watchdog's `noForwardStuck` signal is explicitly suppressed when `supervisor.getStatus().healthy === false`. This prevents double-firing with the existing supervisor-driven recovery path (which handles "server is down" separately). Tested in `evaluate_noForwardStuck_suppressed_when_unhealthy`.
|
|
68
|
+
- **Races**: The RestartOrchestrator is specifically designed to serialize concurrent initiators (watchdog tick + 426 handler + external SIGTERM). The `state !== 'idle'` guard is set synchronously before any `await`, so two concurrent entries produce exactly one restart. Tested in `suppresses_re-entrant_requests`.
|
|
69
|
+
- **Feedback loops**: The watchdog restart count feeds into `last-self-restart-at.json.history`, which feeds into the storm-detection logic, which emits `restartStorm` signal. No unbounded loop — the history is ring-buffered at 50 entries and the rate-limit bucket caps the emission rate.
|
|
70
|
+
- **Stage A interaction**: Typed errors (`ForwardVersionSkewError`, `ForwardBadRequestError`) short-circuit Stage A's `retryWithBackoff` via the new `isTerminal` predicate. A 426 consumes exactly 1 attempt, not 3. Tested in `short_circuits_on_isTerminal`.
|
|
71
|
+
|
|
72
|
+
## 6. External surfaces
|
|
73
|
+
|
|
74
|
+
**Does this change anything visible outside the immediate code path?**
|
|
75
|
+
|
|
76
|
+
- **Other agents on the same machine**: No. The handshake is per-process between one lifeline and one server.
|
|
77
|
+
- **Other users of the install base**: Yes — post-upgrade, all agents gain version-handshake + stuck-loop self-restart behavior. The first migration requires the lifeline to restart (the upgrade pipeline's new `instar lifeline restart` CLI does this automatically; acceptance criterion #22).
|
|
78
|
+
- **External systems**: No new Telegram, Slack, GitHub, or Cloudflare interactions. Telegram long-poll resumes safely via persisted offset (Telegram's API guarantee, not something Stage B implements).
|
|
79
|
+
- **Persistent state**: Two new state files — `state/last-self-restart-at.json` (rate-limit + history; 0600; 50-entry ring buffer) and `state/lifeline-started-at.json` (startup marker; pid + version + timestamp). Both are machine-local operational state and are excluded from backup snapshots.
|
|
80
|
+
- **Timing/runtime**: Watchdog adds ~3 scalar comparisons per 30-second tick — negligible CPU, no file I/O on the fast path. Version handshake adds two integer parses per `/internal/telegram-forward` request — sub-microsecond.
|
|
81
|
+
|
|
82
|
+
## 7. Rollback cost
|
|
83
|
+
|
|
84
|
+
**If this turns out wrong in production, what's the back-out?**
|
|
85
|
+
|
|
86
|
+
- **Hot-fix release**: Pure code revert + patch release. The next `npm run build && release` cycle ships the pre-Stage-B behavior. Existing agents pick up the revert on their next update.
|
|
87
|
+
- **Data migration**: None. `state/last-self-restart-at.json` and `state/lifeline-started-at.json` are harmless if left in place after rollback (they'd simply be unread). No database schema change.
|
|
88
|
+
- **Agent state repair**: None required. Old lifelines fall back to Stage A behavior automatically.
|
|
89
|
+
- **User visibility during rollback window**: None. Self-restart is operator-visible (DegradationReporter events), not user-visible.
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Conclusion
|
|
94
|
+
|
|
95
|
+
This change has been through 4 convergent-review iterations with 4 internal reviewers (security / scalability / adversarial / integration) plus 3 external models (GPT-5.4 / Gemini-3.1-Pro / Grok-4.1-Fast). 28 material findings (5 HIGH + 15 MED internal; 8 HIGH + 1 MED external) were addressed, including six that would have caused production incidents: (1) the `noForwardStuck` idle crash-loop, (2) the future-timestamp deadlock, (3) the restart-sequence re-entrance race, (4) the ingress-during-flush message-loss race, (5) the CLI liveness detection bug (launchctl kickstart doesn't touch last-self-restart-at.json), (6) the updater-path shadow-install race.
|
|
96
|
+
|
|
97
|
+
Signal-vs-authority compliance: both decision points are carved-out categories — DP1 is an API-boundary structural validator (hard-invariant exemption), DP2 is operational self-heal on the lifeline's own process (constrains no other agent). The `versionSkewInfo`, `versionMissing`, `restartStorm`, `watchdogStarved`, and `configInvalid` signals are pure observability with no blocking effect.
|
|
98
|
+
|
|
99
|
+
Test coverage: 84 new unit tests across 6 new files (forwardErrors, versionHandshake, startupMarker, rateLimitState, LifelineHealthWatchdog, RestartOrchestrator) plus server-side handshake tests. All passing; typecheck clean.
|
|
100
|
+
|
|
101
|
+
Clear to ship.
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## Second-pass review (if required)
|
|
106
|
+
|
|
107
|
+
**Reviewer:** spec-converge (4 iterations, 7 reviewers)
|
|
108
|
+
**Independent read of the artifact: concur**
|
|
109
|
+
|
|
110
|
+
All material concerns raised by the 7 reviewers across 4 rounds have been addressed in the spec and implementation. Round 4 produced zero new HIGH/MED findings. One LOW editorial note (reuse existing `timestamp` field instead of adding `enqueuedAt`) was applied.
|
|
111
|
+
|
|
112
|
+
See full convergence report at `docs/specs/reports/lifeline-self-restart-stage-b-convergence.md`.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Evidence pointers
|
|
117
|
+
|
|
118
|
+
- Spec: `docs/specs/LIFELINE-SELF-RESTART-STAGE-B-SPEC.md`
|
|
119
|
+
- Convergence report: `docs/specs/reports/lifeline-self-restart-stage-b-convergence.md`
|
|
120
|
+
- External reviews (raw): `.claude/skills/crossreview/output/20260420-144052/{gpt.md, gemini.md}` (Grok-4.1-Fast verdict captured inline in conversation per sandbox limitation)
|
|
121
|
+
- Test files:
|
|
122
|
+
- `tests/unit/lifeline/versionHandshake.test.ts` (15 tests)
|
|
123
|
+
- `tests/unit/lifeline/rateLimitState.test.ts` (17 tests)
|
|
124
|
+
- `tests/unit/lifeline/forwardErrors.test.ts` (4 tests)
|
|
125
|
+
- `tests/unit/lifeline/startupMarker.test.ts` (5 tests)
|
|
126
|
+
- `tests/unit/lifeline/LifelineHealthWatchdog.test.ts` (9 tests)
|
|
127
|
+
- `tests/unit/lifeline/RestartOrchestrator.test.ts` (5 tests)
|
|
128
|
+
- `tests/unit/lifeline/retryWithBackoff.test.ts` (extended +1 test for isTerminal)
|
|
129
|
+
- `tests/unit/server/telegramForwardHandshake.test.ts` (8 tests)
|