alvin-bot 4.8.6 → 4.8.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,87 @@
2
2
 
3
3
  All notable changes to Alvin Bot are documented here.
4
4
 
5
+ ## [4.8.7] — 2026-04-11
6
+
7
+ ### 🐛 `/update` now detects stale-runtime (rebuild without restart)
8
+
9
+ Caught immediately after publishing 4.8.6 on the Mac mini: `/update` reported "Already up to date — no new commits" even though the running process was on **v4.8.5** while the disk was already built at **v4.8.6**. The user could see the version mismatch in `/status` (v4.8.5) but `/update` refused to acknowledge it.
10
+
11
+ **Root cause**: The updater only compared **git commits** (or **npm registry version**) against the local install. It never checked whether the **running process's in-memory version** was older than the **on-disk built version**. This is the dev/CI loop scenario:
12
+
13
+ 1. You edit src/, bump package.json, commit + push
14
+ 2. `npm run build` regenerates dist/ at the new version
15
+ 3. The running process has the OLD code in memory
16
+ 4. You run `/update` in Telegram
17
+ 5. git: HEAD == origin/main (just pushed) → 0 commits behind → "up to date"
18
+ 6. Process never restarts → keeps running OLD code
19
+
20
+ **Fix**: New `isRuntimeStale()` check at the very start of `runUpdate()`. Compares `BOT_VERSION` (in-memory at process start) against `package.json.version` from disk via the existing semver compare. If disk is newer, returns success with `requiresRestart=true` immediately — skip the git/npm fetch entirely, just signal a restart so the fresh code takes effect.
21
+
22
+ After 4.8.7, running `/update` after a manual rebuild will correctly say *"Disk is already built at vX, running vY. Restarting to pick up the new code..."* and trigger the restart.
23
+
24
+ ### ✨ Internal watchdog with crash-loop brake (`src/services/watchdog.ts`)
25
+
26
+ Ali asked for "derbe persistent" — already 95% there with `KeepAlive: true` from 4.8.6, but the missing piece was a brake to stop the bot from infinite-restart-looping if a deterministic crash happens (corrupt state file, missing dependency, broken upgrade).
27
+
28
+ **New module**: `src/services/watchdog.ts`. Two responsibilities:
29
+
30
+ **1. Liveness beacon**. Every 30 s the bot writes `~/.alvin-bot/state/watchdog.json` with `{lastBeat, pid, bootTime, crashCount, crashWindowStart, version}`. Fast disk write, no I/O blocking.
31
+
32
+ **2. Crash-loop brake**. On every fresh boot, the watchdog reads the previous beacon:
33
+
34
+ - If the previous beacon is **less than 90 s old** → the previous process exited very recently → that's a crash (or a deliberate restart, treated the same way for the brake's purpose). Increment `crashCount`.
35
+ - If the previous beacon is **older than 90 s** → previous process had clean uptime → reset counter to 0.
36
+ - The crash window is **10 minutes**. Crashes within this window accumulate; older ones don't count.
37
+ - If `crashCount` reaches **10**, the brake engages:
38
+ - Writes `~/.alvin-bot/state/crash-loop.alert` with the timestamp, version, error log path, and recovery steps
39
+ - Tries to `launchctl unload -w` its own LaunchAgent so launchd stops retrying (otherwise `KeepAlive: true` would keep burning CPU forever)
40
+ - Exits with code 3
41
+
42
+ **3. Recovery**. After **5 minutes of clean uptime**, the watchdog auto-resets the crash counter to 0. So a healthy bot that occasionally has a transient hiccup doesn't slowly accumulate toward the brake over days.
43
+
44
+ **4. Brake check at startup**. `checkCrashLoopBrake()` runs in `index.ts` **before** any expensive init — if the alert file already exists, the bot exits cleanly with code 3 and tries to unload itself again. This prevents launchd from spinning the bot up just to write the same alert over and over.
45
+
46
+ **Recovery from a tripped brake**:
47
+
48
+ ```bash
49
+ # 1. Investigate the error log
50
+ cat ~/.alvin-bot/logs/alvin-bot.err.log
51
+
52
+ # 2. Fix whatever was wrong
53
+ # 3. Remove the alert file
54
+ rm ~/.alvin-bot/state/crash-loop.alert
55
+
56
+ # 4. Reload the LaunchAgent
57
+ alvin-bot launchd install
58
+ ```
59
+
60
+ **What this catches**:
61
+
62
+ - Process crashes (segfault, OOM kill) → exit non-zero → brake increments
63
+ - `process.exit()` from unhandled rejection → similar
64
+ - Tight crash loops → brake engages at 10 within 10 min
65
+ - Corrupted state files that crash on read → brake engages eventually
66
+
67
+ **What this does NOT catch (yet)**:
68
+
69
+ - Event-loop deadlocks where the process is alive but completely frozen. The watchdog beacon needs the event loop to be alive, so it can't detect freeze. A future release will add an external sister LaunchAgent (`com.alvinbot.watchdog`) that runs every 2 minutes via `StartInterval` and kills the main bot if its beacon file is too stale. Tracked as a follow-up.
70
+
71
+ **Telemetry surface**: `alvin-bot status` could read the beacon file in a future release to show "crash count: X in last Y minutes" — for now, the alert file is the main user-facing signal.
72
+
73
+ ### 🛡 LaunchAgent: ProcessType + LimitLoadToSessionType
74
+
75
+ Two small plist hardening tweaks:
76
+
77
+ - **`ProcessType: Background`** — explicit hint to launchd that this is a long-running background service. macOS treats Background processes with friendlier scheduling and is less likely to kill them under memory pressure (vs `Standard` which is the default for unlabeled jobs).
78
+ - **`LimitLoadToSessionType: Aqua`** — only loads in user GUI sessions. Prevents the LaunchAgent from accidentally loading in non-GUI contexts (e.g. SSH login session) where it would not have Keychain access. Defensive: matches our existing assumption that the bot needs the GUI keychain unlocked for Claude SDK OAuth.
79
+
80
+ These don't change behaviour for normal use, but they're explicit about our intent. macOS will treat the bot as a proper background service rather than a generic foreground job.
81
+
82
+ ### Tests
83
+
84
+ 87 still passing — no test changes (the stale-runtime check is a fast-path branch that doesn't disturb the existing git/npm logic).
85
+
5
86
  ## [4.8.6] — 2026-04-11
6
87
 
7
88
  ### 🐛 LaunchAgent: `/restart` left the bot down forever
package/bin/cli.js CHANGED
@@ -1466,6 +1466,12 @@ function renderLaunchdPlist({ label, nodePath, entryPoint, cwd, home, logDir })
1466
1466
  <key>ThrottleInterval</key>
1467
1467
  <integer>5</integer>
1468
1468
 
1469
+ <key>ProcessType</key>
1470
+ <string>Background</string>
1471
+
1472
+ <key>LimitLoadToSessionType</key>
1473
+ <string>Aqua</string>
1474
+
1469
1475
  <key>StandardOutPath</key>
1470
1476
  <string>${logDir}/alvin-bot.out.log</string>
1471
1477
 
package/dist/index.js CHANGED
@@ -14,6 +14,11 @@ if (hasLegacyData()) {
14
14
  }
15
15
  // 3. Seed defaults for any files that don't exist yet (fresh install)
16
16
  seedDefaults();
17
+ // 4. Crash-loop brake check — if we've crashed N times in a short window,
18
+ // refuse to start, write an alert file, and unload our LaunchAgent so
19
+ // launchd stops retrying. Runs BEFORE any expensive init so a broken
20
+ // state file doesn't tank the whole CPU.
21
+ checkCrashLoopBrake();
17
22
  // ── Normal imports (safe now — DATA_DIR is ready) ──────────────────
18
23
  import { Bot, InlineKeyboard } from "grammy";
19
24
  import { config } from "./config.js";
@@ -76,6 +81,7 @@ import { loadSkills } from "./services/skills.js";
76
81
  import { loadHooks } from "./services/hooks.js";
77
82
  import { registerShutdownHandler } from "./services/restart.js";
78
83
  import { cancelAllSubAgents } from "./services/subagents.js";
84
+ import { startWatchdog, stopWatchdog, checkCrashLoopBrake } from "./services/watchdog.js";
79
85
  import { getRegistry } from "./engine.js";
80
86
  import { scanAssets } from "./services/asset-index.js";
81
87
  // Scan asset directory and generate INDEX.json + INDEX.md
@@ -235,6 +241,7 @@ const shutdown = async () => {
235
241
  // agents can post a cancellation message to Telegram before the bot
236
242
  // stops. Capped at 5s internally so a hang can't block shutdown.
237
243
  await cancelAllSubAgents(true);
244
+ stopWatchdog();
238
245
  stopScheduler();
239
246
  stopSessionCleanup();
240
247
  if (queueInterval)
@@ -472,6 +479,8 @@ if (bot) {
472
479
  console.log(` Users: ${config.allowedUsers.length} authorized`);
473
480
  // Start heartbeat monitor
474
481
  startHeartbeat();
482
+ // Start internal watchdog (crash-loop brake + liveness beacon)
483
+ startWatchdog();
475
484
  // Index memory vectors in background (non-blocking)
476
485
  initEmbeddings().catch(() => { });
477
486
  },
@@ -483,5 +492,6 @@ else {
483
492
  console.log(` WebUI: http://localhost:${process.env.WEB_PORT || 3100}`);
484
493
  // Start heartbeat monitor even without Telegram
485
494
  startHeartbeat();
495
+ startWatchdog();
486
496
  initEmbeddings().catch(() => { });
487
497
  }
@@ -19,6 +19,7 @@ import { resolve, dirname } from "path";
19
19
  import { fileURLToPath } from "url";
20
20
  import fs from "fs";
21
21
  import os from "os";
22
+ import { BOT_VERSION } from "../version.js";
22
23
  const execAsync = promisify(exec);
23
24
  const PROJECT_ROOT = resolve(dirname(fileURLToPath(import.meta.url)), "../..");
24
25
  const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
@@ -84,12 +85,40 @@ function compareSemver(a, b) {
84
85
  }
85
86
  return 0;
86
87
  }
88
+ /**
89
+ * Is the running bot's in-memory version older than what's already built
90
+ * on disk? This happens when the dev/CI rebuilt the bot mid-session and
91
+ * the process hasn't restarted yet. A manual /update without a git/npm
92
+ * fetch should still trigger a restart in this case so the fresh code
93
+ * takes effect.
94
+ */
95
+ function isRuntimeStale() {
96
+ const onDisk = readLocalVersion();
97
+ if (!onDisk || !BOT_VERSION || BOT_VERSION === "unknown")
98
+ return false;
99
+ return compareSemver(BOT_VERSION, onDisk) < 0;
100
+ }
87
101
  /** Pull latest changes, install deps, rebuild. Returns a structured result
88
102
  * instead of throwing so the /update command can report cleanly to Telegram.
89
103
  * Dispatches to the git path for source installs and the npm path for
90
- * npm-global installs. */
104
+ * npm-global installs.
105
+ *
106
+ * Before doing any fetch, checks whether the disk is already newer than
107
+ * the running process (i.e. someone rebuilt between the process start
108
+ * and this call). If so, returns success with requiresRestart=true so
109
+ * the command handler can trigger a graceful restart.
110
+ */
91
111
  export async function runUpdate() {
92
112
  try {
113
+ // Stale-runtime check: disk is already newer than the running code.
114
+ if (isRuntimeStale()) {
115
+ const onDisk = readLocalVersion();
116
+ return {
117
+ ok: true,
118
+ message: `Disk is already built at v${onDisk}, running v${BOT_VERSION}. Restarting to pick up the new code...`,
119
+ requiresRestart: true,
120
+ };
121
+ }
93
122
  if (isOwnGitRepo()) {
94
123
  return await runGitUpdate();
95
124
  }
@@ -0,0 +1,236 @@
1
+ /**
2
+ * Internal Watchdog — Self-monitoring for crash-loop detection.
3
+ *
4
+ * Writes a liveness beacon file every 30 s with the current pid + boot
5
+ * time + crash counter. On startup, reads the beacon to detect whether
6
+ * the previous process exited cleanly or crashed. If too many crashes
7
+ * happen in a short window, refuses to keep restarting and writes an
8
+ * alert file so the user can investigate.
9
+ *
10
+ * Persistence layers this complements:
11
+ * - launchd KeepAlive: true → restarts on any exit (good)
12
+ * - ThrottleInterval: 5 → minimum 5 s between restarts (good)
13
+ * - This watchdog → caps the total restart count so we
14
+ * don't burn CPU on a truly broken state
15
+ *
16
+ * What this CAN catch:
17
+ * - Process crash → exit non-zero → launchd restarts → next boot reads
18
+ * beacon, sees a recent exit, increments crash counter
19
+ * - Tight crash loop → counter accumulates → hits brake at 10
20
+ *
21
+ * What this CANNOT catch (yet):
22
+ * - True event-loop deadlocks (process alive but frozen). That requires
23
+ * an external watchdog process — tracked as a follow-up.
24
+ */
25
+ import fs from "fs";
26
+ import { resolve } from "path";
27
+ import os from "os";
28
+ import { execSync } from "child_process";
29
+ import { BOT_VERSION } from "../version.js";
30
+ const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
31
+ const STATE_DIR = resolve(DATA_DIR, "state");
32
+ const BEACON_FILE = resolve(STATE_DIR, "watchdog.json");
33
+ const ALERT_FILE = resolve(STATE_DIR, "crash-loop.alert");
34
+ const BEACON_INTERVAL_MS = 30_000; // write a beacon every 30 s
35
+ const CRASH_WINDOW_MS = 10 * 60 * 1000; // 10 min — crashes within this count toward the brake
36
+ const CRASH_BRAKE_THRESHOLD = 10; // after this many crashes in the window, brake
37
+ const STALE_BEACON_MS = 90_000; // a beacon older than this is considered "old enough that previous process really exited"
38
+ const RECOVERY_UPTIME_MS = 5 * 60 * 1000; // 5 min of clean uptime resets the counter
39
+ let beaconTimer = null;
40
+ let resetTimer = null;
41
+ let bootTime = 0;
42
+ function ensureStateDir() {
43
+ try {
44
+ fs.mkdirSync(STATE_DIR, { recursive: true });
45
+ }
46
+ catch (err) {
47
+ console.error("[watchdog] failed to create state dir:", err);
48
+ }
49
+ }
50
+ function readBeacon() {
51
+ try {
52
+ const raw = fs.readFileSync(BEACON_FILE, "utf-8");
53
+ const parsed = JSON.parse(raw);
54
+ if (typeof parsed.lastBeat === "number" &&
55
+ typeof parsed.pid === "number" &&
56
+ typeof parsed.bootTime === "number" &&
57
+ typeof parsed.crashCount === "number" &&
58
+ typeof parsed.crashWindowStart === "number" &&
59
+ typeof parsed.version === "string") {
60
+ return parsed;
61
+ }
62
+ return null;
63
+ }
64
+ catch {
65
+ return null;
66
+ }
67
+ }
68
+ function writeBeacon(data) {
69
+ try {
70
+ fs.writeFileSync(BEACON_FILE, JSON.stringify(data, null, 0), "utf-8");
71
+ }
72
+ catch (err) {
73
+ console.error("[watchdog] failed to write beacon:", err);
74
+ }
75
+ }
76
+ function writeAlert(reason, crashCount) {
77
+ try {
78
+ const content = [
79
+ `Alvin Bot crash-loop brake hit at ${new Date().toISOString()}`,
80
+ `Version: ${BOT_VERSION}`,
81
+ `Crashes in the last ${CRASH_WINDOW_MS / 60_000} minutes: ${crashCount}`,
82
+ `Threshold: ${CRASH_BRAKE_THRESHOLD}`,
83
+ ``,
84
+ `Reason: ${reason}`,
85
+ ``,
86
+ `The bot will refuse to start until this file is removed AND the`,
87
+ `LaunchAgent is reloaded. Investigate the recent error log:`,
88
+ ` ${resolve(DATA_DIR, "logs", "alvin-bot.err.log")}`,
89
+ ``,
90
+ `Recovery steps once you've fixed the underlying issue:`,
91
+ ` rm "${ALERT_FILE}"`,
92
+ ` alvin-bot launchd install # or just kickstart the service`,
93
+ ``,
94
+ ].join("\n");
95
+ fs.writeFileSync(ALERT_FILE, content, "utf-8");
96
+ }
97
+ catch (err) {
98
+ console.error("[watchdog] failed to write alert:", err);
99
+ }
100
+ }
101
+ /**
102
+ * Check whether the watchdog has hit the crash-loop brake. Called once
103
+ * at startup, BEFORE most of the bot initializes. If the brake is set
104
+ * (alert file exists), the bot exits cleanly with code 3 — and because
105
+ * launchd's KeepAlive will keep retrying, we also try to unload our
106
+ * own LaunchAgent so the retries stop.
107
+ */
108
+ export function checkCrashLoopBrake() {
109
+ if (!fs.existsSync(ALERT_FILE))
110
+ return;
111
+ console.error("");
112
+ console.error("==================================================");
113
+ console.error("⛔ alvin-bot crash-loop brake is engaged");
114
+ console.error("==================================================");
115
+ try {
116
+ const content = fs.readFileSync(ALERT_FILE, "utf-8");
117
+ console.error(content);
118
+ }
119
+ catch { /* ignore */ }
120
+ // Attempt to unload our own LaunchAgent so launchd stops retrying.
121
+ // If we don't do this, launchd just KeepAlive's us forever and we
122
+ // burn CPU writing the same alert.
123
+ if (process.platform === "darwin") {
124
+ try {
125
+ const home = os.homedir();
126
+ const plistPath = resolve(home, "Library", "LaunchAgents", "com.alvinbot.app.plist");
127
+ if (fs.existsSync(plistPath)) {
128
+ execSync(`launchctl unload -w "${plistPath}"`, { stdio: "pipe" });
129
+ console.error("[watchdog] LaunchAgent unloaded — bot will not auto-restart.");
130
+ }
131
+ }
132
+ catch (err) {
133
+ console.error("[watchdog] failed to unload LaunchAgent:", err);
134
+ }
135
+ }
136
+ // Exit with a distinct code so logs make the cause obvious
137
+ process.exit(3);
138
+ }
139
+ /**
140
+ * Start the watchdog. Called from src/index.ts after all services are
141
+ * initialized. Reads the previous beacon, increments crash counter if
142
+ * the previous run exited recently, schedules the periodic beacon
143
+ * writer, and schedules a recovery-mark reset after RECOVERY_UPTIME_MS
144
+ * of clean uptime.
145
+ */
146
+ export function startWatchdog() {
147
+ ensureStateDir();
148
+ bootTime = Date.now();
149
+ const previous = readBeacon();
150
+ let crashCount = 0;
151
+ let crashWindowStart = bootTime;
152
+ if (previous) {
153
+ const timeSinceLastBeat = bootTime - previous.lastBeat;
154
+ const inWindow = bootTime - previous.crashWindowStart < CRASH_WINDOW_MS;
155
+ if (timeSinceLastBeat < STALE_BEACON_MS) {
156
+ // Previous process exited very recently → that's a crash (or a
157
+ // graceful exit immediately followed by a restart, which we treat
158
+ // the same way for the brake — the goal is to detect rapid cycles).
159
+ if (inWindow) {
160
+ crashCount = previous.crashCount + 1;
161
+ crashWindowStart = previous.crashWindowStart;
162
+ }
163
+ else {
164
+ // Previous crash was outside the window → reset counter
165
+ crashCount = 1;
166
+ }
167
+ console.log(`[watchdog] detected restart after ${Math.round(timeSinceLastBeat / 1000)}s — crash ${crashCount}/${CRASH_BRAKE_THRESHOLD} in current ${CRASH_WINDOW_MS / 60_000}min window`);
168
+ if (crashCount >= CRASH_BRAKE_THRESHOLD) {
169
+ console.error(`[watchdog] crash-loop brake triggered (${crashCount} crashes in ${CRASH_WINDOW_MS / 60_000}min)`);
170
+ writeAlert(`Process restarted ${crashCount} times within ${CRASH_WINDOW_MS / 60_000} minutes. Last beacon was ${Math.round(timeSinceLastBeat / 1000)}s ago. Most likely a deterministic crash on startup.`, crashCount);
171
+ // Re-use the brake check to unload + exit cleanly
172
+ checkCrashLoopBrake();
173
+ }
174
+ }
175
+ else {
176
+ // Previous beacon was old → process had clean uptime before exit,
177
+ // OR system was rebooted between runs. Reset crash count.
178
+ crashCount = 0;
179
+ crashWindowStart = bootTime;
180
+ }
181
+ }
182
+ // Write the first beacon immediately so a fresh restart updates the file
183
+ writeBeacon({
184
+ lastBeat: bootTime,
185
+ pid: process.pid,
186
+ bootTime,
187
+ crashCount,
188
+ crashWindowStart,
189
+ version: BOT_VERSION,
190
+ });
191
+ // Periodic beacon writer
192
+ beaconTimer = setInterval(() => {
193
+ writeBeacon({
194
+ lastBeat: Date.now(),
195
+ pid: process.pid,
196
+ bootTime,
197
+ crashCount,
198
+ crashWindowStart,
199
+ version: BOT_VERSION,
200
+ });
201
+ }, BEACON_INTERVAL_MS);
202
+ // Schedule a recovery counter reset after RECOVERY_UPTIME_MS of clean
203
+ // uptime. If we make it that far without dying, the bot is healthy
204
+ // again and we shouldn't penalize a future single crash.
205
+ resetTimer = setTimeout(() => {
206
+ if (crashCount > 0) {
207
+ console.log(`[watchdog] ${RECOVERY_UPTIME_MS / 60_000}min clean uptime — resetting crash counter from ${crashCount} to 0`);
208
+ crashCount = 0;
209
+ crashWindowStart = Date.now();
210
+ writeBeacon({
211
+ lastBeat: Date.now(),
212
+ pid: process.pid,
213
+ bootTime,
214
+ crashCount,
215
+ crashWindowStart,
216
+ version: BOT_VERSION,
217
+ });
218
+ }
219
+ }, RECOVERY_UPTIME_MS);
220
+ console.log(`[watchdog] started — beacon every ${BEACON_INTERVAL_MS / 1000}s, brake at ${CRASH_BRAKE_THRESHOLD} crashes per ${CRASH_WINDOW_MS / 60_000}min, recovery after ${RECOVERY_UPTIME_MS / 60_000}min uptime`);
221
+ }
222
+ /**
223
+ * Stop the watchdog cleanly. Called from the shutdown handler in
224
+ * index.ts so beacon timers don't keep the process alive after the
225
+ * grammy bot has stopped.
226
+ */
227
+ export function stopWatchdog() {
228
+ if (beaconTimer) {
229
+ clearInterval(beaconTimer);
230
+ beaconTimer = null;
231
+ }
232
+ if (resetTimer) {
233
+ clearTimeout(resetTimer);
234
+ resetTimer = null;
235
+ }
236
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "alvin-bot",
3
- "version": "4.8.6",
3
+ "version": "4.8.7",
4
4
  "description": "Alvin Bot — Your personal AI agent on Telegram, WhatsApp, Discord, Signal, and Web.",
5
5
  "type": "module",
6
6
  "main": "dist/index.js",