alvin-bot 4.8.5 → 4.8.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +107 -0
- package/bin/cli.js +8 -7
- package/dist/handlers/commands.js +1 -1
- package/dist/index.js +10 -0
- package/dist/services/updater.js +30 -1
- package/dist/services/watchdog.js +236 -0
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,113 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to Alvin Bot are documented here.
|
|
4
4
|
|
|
5
|
+
## [4.8.7] — 2026-04-11
|
|
6
|
+
|
|
7
|
+
### 🐛 `/update` now detects stale-runtime (rebuild without restart)
|
|
8
|
+
|
|
9
|
+
Caught immediately after publishing 4.8.6 on the Mac mini: `/update` reported "Already up to date — no new commits" even though the running process was on **v4.8.5** while the disk was already built at **v4.8.6**. The user could see the version mismatch in `/status` (v4.8.5) but `/update` refused to acknowledge it.
|
|
10
|
+
|
|
11
|
+
**Root cause**: The updater only compared **git commits** (or **npm registry version**) against the local install. It never checked whether the **running process's in-memory version** was older than the **on-disk built version**. This is the dev/CI loop scenario:
|
|
12
|
+
|
|
13
|
+
1. You edit src/, bump package.json, commit + push
|
|
14
|
+
2. `npm run build` regenerates dist/ at the new version
|
|
15
|
+
3. The running process has the OLD code in memory
|
|
16
|
+
4. You run `/update` in Telegram
|
|
17
|
+
5. git: HEAD == origin/main (just pushed) → 0 commits behind → "up to date"
|
|
18
|
+
6. Process never restarts → keeps running OLD code
|
|
19
|
+
|
|
20
|
+
**Fix**: New `isRuntimeStale()` check at the very start of `runUpdate()`. Compares `BOT_VERSION` (in-memory at process start) against `package.json.version` from disk via the existing semver compare. If disk is newer, returns success with `requiresRestart=true` immediately — skip the git/npm fetch entirely, just signal a restart so the fresh code takes effect.
|
|
21
|
+
|
|
22
|
+
After 4.8.7, running `/update` after a manual rebuild will correctly say *"Disk is already built at vX, running vY. Restarting to pick up the new code..."* and trigger the restart.
|
|
23
|
+
|
|
24
|
+
### ✨ Internal watchdog with crash-loop brake (`src/services/watchdog.ts`)
|
|
25
|
+
|
|
26
|
+
Ali asked for "derbe persistent" — already 95% there with `KeepAlive: true` from 4.8.6, but the missing piece was a brake to stop the bot from infinite-restart-looping if a deterministic crash happens (corrupt state file, missing dependency, broken upgrade).
|
|
27
|
+
|
|
28
|
+
**New module**: `src/services/watchdog.ts`. Two responsibilities:
|
|
29
|
+
|
|
30
|
+
**1. Liveness beacon**. Every 30 s the bot writes `~/.alvin-bot/state/watchdog.json` with `{lastBeat, pid, bootTime, crashCount, crashWindowStart, version}`. Fast disk write, no I/O blocking.
|
|
31
|
+
|
|
32
|
+
**2. Crash-loop brake**. On every fresh boot, the watchdog reads the previous beacon:
|
|
33
|
+
|
|
34
|
+
- If the previous beacon is **less than 90 s old** → the previous process exited very recently → that's a crash (or a deliberate restart, treated the same way for the brake's purpose). Increment `crashCount`.
|
|
35
|
+
- If the previous beacon is **older than 90 s** → previous process had clean uptime → reset counter to 0.
|
|
36
|
+
- The crash window is **10 minutes**. Crashes within this window accumulate; older ones don't count.
|
|
37
|
+
- If `crashCount` reaches **10**, the brake engages:
|
|
38
|
+
- Writes `~/.alvin-bot/state/crash-loop.alert` with the timestamp, version, error log path, and recovery steps
|
|
39
|
+
- Tries to `launchctl unload -w` its own LaunchAgent so launchd stops retrying (otherwise `KeepAlive: true` would keep burning CPU forever)
|
|
40
|
+
- Exits with code 3
|
|
41
|
+
|
|
42
|
+
**3. Recovery**. After **5 minutes of clean uptime**, the watchdog auto-resets the crash counter to 0. So a healthy bot that occasionally has a transient hiccup doesn't slowly accumulate toward the brake over days.
|
|
43
|
+
|
|
44
|
+
**4. Brake check at startup**. `checkCrashLoopBrake()` runs in `index.ts` **before** any expensive init — if the alert file already exists, the bot exits cleanly with code 3 and tries to unload itself again. This prevents launchd from spinning the bot up just to write the same alert over and over.
|
|
45
|
+
|
|
46
|
+
**Recovery from a tripped brake**:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# 1. Investigate the error log
|
|
50
|
+
cat ~/.alvin-bot/logs/alvin-bot.err.log
|
|
51
|
+
|
|
52
|
+
# 2. Fix whatever was wrong
|
|
53
|
+
# 3. Remove the alert file
|
|
54
|
+
rm ~/.alvin-bot/state/crash-loop.alert
|
|
55
|
+
|
|
56
|
+
# 4. Reload the LaunchAgent
|
|
57
|
+
alvin-bot launchd install
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
**What this catches**:
|
|
61
|
+
|
|
62
|
+
- Process crashes (segfault, OOM kill) → exit non-zero → brake increments
|
|
63
|
+
- `process.exit()` from unhandled rejection → similar
|
|
64
|
+
- Tight crash loops → brake engages at 10 within 10 min
|
|
65
|
+
- Corrupted state files that crash on read → brake engages eventually
|
|
66
|
+
|
|
67
|
+
**What this does NOT catch (yet)**:
|
|
68
|
+
|
|
69
|
+
- Event-loop deadlocks where the process is alive but completely frozen. The watchdog beacon needs the event loop to be alive, so it can't detect freeze. A future release will add an external sister LaunchAgent (`com.alvinbot.watchdog`) that runs every 2 minutes via `StartInterval` and kills the main bot if its beacon file is too stale. Tracked as a follow-up.
|
|
70
|
+
|
|
71
|
+
**Telemetry surface**: `alvin-bot status` could read the beacon file in a future release to show "crash count: X in last Y minutes" — for now, the alert file is the main user-facing signal.
|
|
72
|
+
|
|
73
|
+
### 🛡 LaunchAgent: ProcessType + LimitLoadToSessionType
|
|
74
|
+
|
|
75
|
+
Two small plist hardening tweaks:
|
|
76
|
+
|
|
77
|
+
- **`ProcessType: Background`** — explicit hint to launchd that this is a long-running background service. macOS treats Background processes with friendlier scheduling and is less likely to kill them under memory pressure (vs `Standard` which is the default for unlabeled jobs).
|
|
78
|
+
- **`LimitLoadToSessionType: Aqua`** — only loads in user GUI sessions. Prevents the LaunchAgent from accidentally loading in non-GUI contexts (e.g. SSH login session) where it would not have Keychain access. Defensive: matches our existing assumption that the bot needs the GUI keychain unlocked for Claude SDK OAuth.
|
|
79
|
+
|
|
80
|
+
These don't change behaviour for normal use, but they're explicit about our intent. macOS will treat the bot as a proper background service rather than a generic foreground job.
|
|
81
|
+
|
|
82
|
+
### Tests
|
|
83
|
+
|
|
84
|
+
87 still passing — no test changes (the stale-runtime check is a fast-path branch that doesn't disturb the existing git/npm logic).
|
|
85
|
+
|
|
86
|
+
## [4.8.6] — 2026-04-11
|
|
87
|
+
|
|
88
|
+
### 🐛 LaunchAgent: `/restart` left the bot down forever
|
|
89
|
+
|
|
90
|
+
Caught on the Mac mini production bot: running `/restart` in Telegram killed the bot cleanly but the process never came back, leaving the bot dead until manual intervention.
|
|
91
|
+
|
|
92
|
+
**Root cause**: The 4.6.0 LaunchAgent plist template hardcoded a conditional `KeepAlive`:
|
|
93
|
+
|
|
94
|
+
```xml
|
|
95
|
+
<key>KeepAlive</key>
|
|
96
|
+
<dict>
|
|
97
|
+
<key>SuccessfulExit</key>
|
|
98
|
+
<false/> <!-- don't restart on normal exit -->
|
|
99
|
+
<key>Crashed</key>
|
|
100
|
+
<true/> <!-- only restart on crash -->
|
|
101
|
+
</dict>
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
That meant launchd would only auto-restart on **crashes**, not on normal exits. But `/restart` (and `/update`) work by calling `process.exit(0)` — a deliberate clean exit — and relying on the process manager to bring the bot back up. With pm2 this always worked because pm2's default is "restart on any exit". With launchd's conditional KeepAlive, `process.exit(0)` was the ONE exit code that guaranteed the bot stayed down.
|
|
105
|
+
|
|
106
|
+
**Fix**: Plist template now uses `<key>KeepAlive</key><true/>` — unconditional restart on any exit. Matches pm2's default behavior. `ThrottleInterval` dropped from 10s to 5s so recovery is quicker.
|
|
107
|
+
|
|
108
|
+
**Migration for existing installs**: re-run `alvin-bot launchd install` to get the new plist. The install script unloads the old plist, writes the new one, and reloads it — existing data and running state are preserved.
|
|
109
|
+
|
|
110
|
+
Also removed the stale "(PM2)" suffix from the `/restart` Telegram command description — it's just "Restart the bot" now, since the command works identically with both pm2 and launchd.
|
|
111
|
+
|
|
5
112
|
## [4.8.5] — 2026-04-11
|
|
6
113
|
|
|
7
114
|
### 🐛 `/update` now works for npm-global installs
|
package/bin/cli.js
CHANGED
|
@@ -1461,15 +1461,16 @@ function renderLaunchdPlist({ label, nodePath, entryPoint, cwd, home, logDir })
|
|
|
1461
1461
|
<true/>
|
|
1462
1462
|
|
|
1463
1463
|
<key>KeepAlive</key>
|
|
1464
|
-
<
|
|
1465
|
-
<key>SuccessfulExit</key>
|
|
1466
|
-
<false/>
|
|
1467
|
-
<key>Crashed</key>
|
|
1468
|
-
<true/>
|
|
1469
|
-
</dict>
|
|
1464
|
+
<true/>
|
|
1470
1465
|
|
|
1471
1466
|
<key>ThrottleInterval</key>
|
|
1472
|
-
<integer>
|
|
1467
|
+
<integer>5</integer>
|
|
1468
|
+
|
|
1469
|
+
<key>ProcessType</key>
|
|
1470
|
+
<string>Background</string>
|
|
1471
|
+
|
|
1472
|
+
<key>LimitLoadToSessionType</key>
|
|
1473
|
+
<string>Aqua</string>
|
|
1473
1474
|
|
|
1474
1475
|
<key>StandardOutPath</key>
|
|
1475
1476
|
<string>${logDir}/alvin-bot.out.log</string>
|
|
@@ -156,7 +156,7 @@ export function registerCommands(bot) {
|
|
|
156
156
|
{ command: "webui", description: "Open Web UI in browser" },
|
|
157
157
|
{ command: "setup", description: "Configure API keys & platforms" },
|
|
158
158
|
{ command: "cancel", description: "Cancel running request" },
|
|
159
|
-
{ command: "restart", description: "Restart the bot
|
|
159
|
+
{ command: "restart", description: "Restart the bot" },
|
|
160
160
|
{ command: "update", description: "Pull latest, build, restart" },
|
|
161
161
|
{ command: "autoupdate", description: "Auto-update on|off|status" },
|
|
162
162
|
]).catch(err => console.error("Failed to set bot commands:", err));
|
package/dist/index.js
CHANGED
|
@@ -14,6 +14,11 @@ if (hasLegacyData()) {
|
|
|
14
14
|
}
|
|
15
15
|
// 3. Seed defaults for any files that don't exist yet (fresh install)
|
|
16
16
|
seedDefaults();
|
|
17
|
+
// 4. Crash-loop brake check — if we've crashed N times in a short window,
|
|
18
|
+
// refuse to start, write an alert file, and unload our LaunchAgent so
|
|
19
|
+
// launchd stops retrying. Runs BEFORE any expensive init so a broken
|
|
20
|
+
// state file doesn't tank the whole CPU.
|
|
21
|
+
checkCrashLoopBrake();
|
|
17
22
|
// ── Normal imports (safe now — DATA_DIR is ready) ──────────────────
|
|
18
23
|
import { Bot, InlineKeyboard } from "grammy";
|
|
19
24
|
import { config } from "./config.js";
|
|
@@ -76,6 +81,7 @@ import { loadSkills } from "./services/skills.js";
|
|
|
76
81
|
import { loadHooks } from "./services/hooks.js";
|
|
77
82
|
import { registerShutdownHandler } from "./services/restart.js";
|
|
78
83
|
import { cancelAllSubAgents } from "./services/subagents.js";
|
|
84
|
+
import { startWatchdog, stopWatchdog, checkCrashLoopBrake } from "./services/watchdog.js";
|
|
79
85
|
import { getRegistry } from "./engine.js";
|
|
80
86
|
import { scanAssets } from "./services/asset-index.js";
|
|
81
87
|
// Scan asset directory and generate INDEX.json + INDEX.md
|
|
@@ -235,6 +241,7 @@ const shutdown = async () => {
|
|
|
235
241
|
// agents can post a cancellation message to Telegram before the bot
|
|
236
242
|
// stops. Capped at 5s internally so a hang can't block shutdown.
|
|
237
243
|
await cancelAllSubAgents(true);
|
|
244
|
+
stopWatchdog();
|
|
238
245
|
stopScheduler();
|
|
239
246
|
stopSessionCleanup();
|
|
240
247
|
if (queueInterval)
|
|
@@ -472,6 +479,8 @@ if (bot) {
|
|
|
472
479
|
console.log(` Users: ${config.allowedUsers.length} authorized`);
|
|
473
480
|
// Start heartbeat monitor
|
|
474
481
|
startHeartbeat();
|
|
482
|
+
// Start internal watchdog (crash-loop brake + liveness beacon)
|
|
483
|
+
startWatchdog();
|
|
475
484
|
// Index memory vectors in background (non-blocking)
|
|
476
485
|
initEmbeddings().catch(() => { });
|
|
477
486
|
},
|
|
@@ -483,5 +492,6 @@ else {
|
|
|
483
492
|
console.log(` WebUI: http://localhost:${process.env.WEB_PORT || 3100}`);
|
|
484
493
|
// Start heartbeat monitor even without Telegram
|
|
485
494
|
startHeartbeat();
|
|
495
|
+
startWatchdog();
|
|
486
496
|
initEmbeddings().catch(() => { });
|
|
487
497
|
}
|
package/dist/services/updater.js
CHANGED
|
@@ -19,6 +19,7 @@ import { resolve, dirname } from "path";
|
|
|
19
19
|
import { fileURLToPath } from "url";
|
|
20
20
|
import fs from "fs";
|
|
21
21
|
import os from "os";
|
|
22
|
+
import { BOT_VERSION } from "../version.js";
|
|
22
23
|
const execAsync = promisify(exec);
|
|
23
24
|
const PROJECT_ROOT = resolve(dirname(fileURLToPath(import.meta.url)), "../..");
|
|
24
25
|
const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
|
|
@@ -84,12 +85,40 @@ function compareSemver(a, b) {
|
|
|
84
85
|
}
|
|
85
86
|
return 0;
|
|
86
87
|
}
|
|
88
|
+
/**
|
|
89
|
+
* Is the running bot's in-memory version older than what's already built
|
|
90
|
+
* on disk? This happens when the dev/CI rebuilt the bot mid-session and
|
|
91
|
+
* the process hasn't restarted yet. A manual /update without a git/npm
|
|
92
|
+
* fetch should still trigger a restart in this case so the fresh code
|
|
93
|
+
* takes effect.
|
|
94
|
+
*/
|
|
95
|
+
function isRuntimeStale() {
|
|
96
|
+
const onDisk = readLocalVersion();
|
|
97
|
+
if (!onDisk || !BOT_VERSION || BOT_VERSION === "unknown")
|
|
98
|
+
return false;
|
|
99
|
+
return compareSemver(BOT_VERSION, onDisk) < 0;
|
|
100
|
+
}
|
|
87
101
|
/** Pull latest changes, install deps, rebuild. Returns a structured result
|
|
88
102
|
* instead of throwing so the /update command can report cleanly to Telegram.
|
|
89
103
|
* Dispatches to the git path for source installs and the npm path for
|
|
90
|
-
* npm-global installs.
|
|
104
|
+
* npm-global installs.
|
|
105
|
+
*
|
|
106
|
+
* Before doing any fetch, checks whether the disk is already newer than
|
|
107
|
+
* the running process (i.e. someone rebuilt between the process start
|
|
108
|
+
* and this call). If so, returns success with requiresRestart=true so
|
|
109
|
+
* the command handler can trigger a graceful restart.
|
|
110
|
+
*/
|
|
91
111
|
export async function runUpdate() {
|
|
92
112
|
try {
|
|
113
|
+
// Stale-runtime check: disk is already newer than the running code.
|
|
114
|
+
if (isRuntimeStale()) {
|
|
115
|
+
const onDisk = readLocalVersion();
|
|
116
|
+
return {
|
|
117
|
+
ok: true,
|
|
118
|
+
message: `Disk is already built at v${onDisk}, running v${BOT_VERSION}. Restarting to pick up the new code...`,
|
|
119
|
+
requiresRestart: true,
|
|
120
|
+
};
|
|
121
|
+
}
|
|
93
122
|
if (isOwnGitRepo()) {
|
|
94
123
|
return await runGitUpdate();
|
|
95
124
|
}
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Internal Watchdog — Self-monitoring for crash-loop detection.
|
|
3
|
+
*
|
|
4
|
+
* Writes a liveness beacon file every 30 s with the current pid + boot
|
|
5
|
+
* time + crash counter. On startup, reads the beacon to detect whether
|
|
6
|
+
* the previous process exited cleanly or crashed. If too many crashes
|
|
7
|
+
* happen in a short window, refuses to keep restarting and writes an
|
|
8
|
+
* alert file so the user can investigate.
|
|
9
|
+
*
|
|
10
|
+
* Persistence layers this complements:
|
|
11
|
+
* - launchd KeepAlive: true → restarts on any exit (good)
|
|
12
|
+
* - ThrottleInterval: 5 → minimum 5 s between restarts (good)
|
|
13
|
+
* - This watchdog → caps the total restart count so we
|
|
14
|
+
* don't burn CPU on a truly broken state
|
|
15
|
+
*
|
|
16
|
+
* What this CAN catch:
|
|
17
|
+
* - Process crash → exit non-zero → launchd restarts → next boot reads
|
|
18
|
+
* beacon, sees a recent exit, increments crash counter
|
|
19
|
+
* - Tight crash loop → counter accumulates → hits brake at 10
|
|
20
|
+
*
|
|
21
|
+
* What this CANNOT catch (yet):
|
|
22
|
+
* - True event-loop deadlocks (process alive but frozen). That requires
|
|
23
|
+
* an external watchdog process — tracked as a follow-up.
|
|
24
|
+
*/
|
|
25
|
+
import fs from "fs";
|
|
26
|
+
import { resolve } from "path";
|
|
27
|
+
import os from "os";
|
|
28
|
+
import { execSync } from "child_process";
|
|
29
|
+
import { BOT_VERSION } from "../version.js";
|
|
30
|
+
const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
|
|
31
|
+
const STATE_DIR = resolve(DATA_DIR, "state");
|
|
32
|
+
const BEACON_FILE = resolve(STATE_DIR, "watchdog.json");
|
|
33
|
+
const ALERT_FILE = resolve(STATE_DIR, "crash-loop.alert");
|
|
34
|
+
const BEACON_INTERVAL_MS = 30_000; // write a beacon every 30 s
|
|
35
|
+
const CRASH_WINDOW_MS = 10 * 60 * 1000; // 10 min — crashes within this count toward the brake
|
|
36
|
+
const CRASH_BRAKE_THRESHOLD = 10; // after this many crashes in the window, brake
|
|
37
|
+
const STALE_BEACON_MS = 90_000; // a beacon older than this is considered "old enough that previous process really exited"
|
|
38
|
+
const RECOVERY_UPTIME_MS = 5 * 60 * 1000; // 5 min of clean uptime resets the counter
|
|
39
|
+
let beaconTimer = null;
|
|
40
|
+
let resetTimer = null;
|
|
41
|
+
let bootTime = 0;
|
|
42
|
+
function ensureStateDir() {
|
|
43
|
+
try {
|
|
44
|
+
fs.mkdirSync(STATE_DIR, { recursive: true });
|
|
45
|
+
}
|
|
46
|
+
catch (err) {
|
|
47
|
+
console.error("[watchdog] failed to create state dir:", err);
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
function readBeacon() {
|
|
51
|
+
try {
|
|
52
|
+
const raw = fs.readFileSync(BEACON_FILE, "utf-8");
|
|
53
|
+
const parsed = JSON.parse(raw);
|
|
54
|
+
if (typeof parsed.lastBeat === "number" &&
|
|
55
|
+
typeof parsed.pid === "number" &&
|
|
56
|
+
typeof parsed.bootTime === "number" &&
|
|
57
|
+
typeof parsed.crashCount === "number" &&
|
|
58
|
+
typeof parsed.crashWindowStart === "number" &&
|
|
59
|
+
typeof parsed.version === "string") {
|
|
60
|
+
return parsed;
|
|
61
|
+
}
|
|
62
|
+
return null;
|
|
63
|
+
}
|
|
64
|
+
catch {
|
|
65
|
+
return null;
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
function writeBeacon(data) {
|
|
69
|
+
try {
|
|
70
|
+
fs.writeFileSync(BEACON_FILE, JSON.stringify(data, null, 0), "utf-8");
|
|
71
|
+
}
|
|
72
|
+
catch (err) {
|
|
73
|
+
console.error("[watchdog] failed to write beacon:", err);
|
|
74
|
+
}
|
|
75
|
+
}
|
|
76
|
+
function writeAlert(reason, crashCount) {
|
|
77
|
+
try {
|
|
78
|
+
const content = [
|
|
79
|
+
`Alvin Bot crash-loop brake hit at ${new Date().toISOString()}`,
|
|
80
|
+
`Version: ${BOT_VERSION}`,
|
|
81
|
+
`Crashes in the last ${CRASH_WINDOW_MS / 60_000} minutes: ${crashCount}`,
|
|
82
|
+
`Threshold: ${CRASH_BRAKE_THRESHOLD}`,
|
|
83
|
+
``,
|
|
84
|
+
`Reason: ${reason}`,
|
|
85
|
+
``,
|
|
86
|
+
`The bot will refuse to start until this file is removed AND the`,
|
|
87
|
+
`LaunchAgent is reloaded. Investigate the recent error log:`,
|
|
88
|
+
` ${resolve(DATA_DIR, "logs", "alvin-bot.err.log")}`,
|
|
89
|
+
``,
|
|
90
|
+
`Recovery steps once you've fixed the underlying issue:`,
|
|
91
|
+
` rm "${ALERT_FILE}"`,
|
|
92
|
+
` alvin-bot launchd install # or just kickstart the service`,
|
|
93
|
+
``,
|
|
94
|
+
].join("\n");
|
|
95
|
+
fs.writeFileSync(ALERT_FILE, content, "utf-8");
|
|
96
|
+
}
|
|
97
|
+
catch (err) {
|
|
98
|
+
console.error("[watchdog] failed to write alert:", err);
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
/**
|
|
102
|
+
* Check whether the watchdog has hit the crash-loop brake. Called once
|
|
103
|
+
* at startup, BEFORE most of the bot initializes. If the brake is set
|
|
104
|
+
* (alert file exists), the bot exits cleanly with code 3 — and because
|
|
105
|
+
* launchd's KeepAlive will keep retrying, we also try to unload our
|
|
106
|
+
* own LaunchAgent so the retries stop.
|
|
107
|
+
*/
|
|
108
|
+
export function checkCrashLoopBrake() {
|
|
109
|
+
if (!fs.existsSync(ALERT_FILE))
|
|
110
|
+
return;
|
|
111
|
+
console.error("");
|
|
112
|
+
console.error("==================================================");
|
|
113
|
+
console.error("⛔ alvin-bot crash-loop brake is engaged");
|
|
114
|
+
console.error("==================================================");
|
|
115
|
+
try {
|
|
116
|
+
const content = fs.readFileSync(ALERT_FILE, "utf-8");
|
|
117
|
+
console.error(content);
|
|
118
|
+
}
|
|
119
|
+
catch { /* ignore */ }
|
|
120
|
+
// Attempt to unload our own LaunchAgent so launchd stops retrying.
|
|
121
|
+
// If we don't do this, launchd just KeepAlive's us forever and we
|
|
122
|
+
// burn CPU writing the same alert.
|
|
123
|
+
if (process.platform === "darwin") {
|
|
124
|
+
try {
|
|
125
|
+
const home = os.homedir();
|
|
126
|
+
const plistPath = resolve(home, "Library", "LaunchAgents", "com.alvinbot.app.plist");
|
|
127
|
+
if (fs.existsSync(plistPath)) {
|
|
128
|
+
execSync(`launchctl unload -w "${plistPath}"`, { stdio: "pipe" });
|
|
129
|
+
console.error("[watchdog] LaunchAgent unloaded — bot will not auto-restart.");
|
|
130
|
+
}
|
|
131
|
+
}
|
|
132
|
+
catch (err) {
|
|
133
|
+
console.error("[watchdog] failed to unload LaunchAgent:", err);
|
|
134
|
+
}
|
|
135
|
+
}
|
|
136
|
+
// Exit with a distinct code so logs make the cause obvious
|
|
137
|
+
process.exit(3);
|
|
138
|
+
}
|
|
139
|
+
/**
|
|
140
|
+
* Start the watchdog. Called from src/index.ts after all services are
|
|
141
|
+
* initialized. Reads the previous beacon, increments crash counter if
|
|
142
|
+
* the previous run exited recently, schedules the periodic beacon
|
|
143
|
+
* writer, and schedules a recovery-mark reset after RECOVERY_UPTIME_MS
|
|
144
|
+
* of clean uptime.
|
|
145
|
+
*/
|
|
146
|
+
export function startWatchdog() {
|
|
147
|
+
ensureStateDir();
|
|
148
|
+
bootTime = Date.now();
|
|
149
|
+
const previous = readBeacon();
|
|
150
|
+
let crashCount = 0;
|
|
151
|
+
let crashWindowStart = bootTime;
|
|
152
|
+
if (previous) {
|
|
153
|
+
const timeSinceLastBeat = bootTime - previous.lastBeat;
|
|
154
|
+
const inWindow = bootTime - previous.crashWindowStart < CRASH_WINDOW_MS;
|
|
155
|
+
if (timeSinceLastBeat < STALE_BEACON_MS) {
|
|
156
|
+
// Previous process exited very recently → that's a crash (or a
|
|
157
|
+
// graceful exit immediately followed by a restart, which we treat
|
|
158
|
+
// the same way for the brake — the goal is to detect rapid cycles).
|
|
159
|
+
if (inWindow) {
|
|
160
|
+
crashCount = previous.crashCount + 1;
|
|
161
|
+
crashWindowStart = previous.crashWindowStart;
|
|
162
|
+
}
|
|
163
|
+
else {
|
|
164
|
+
// Previous crash was outside the window → reset counter
|
|
165
|
+
crashCount = 1;
|
|
166
|
+
}
|
|
167
|
+
console.log(`[watchdog] detected restart after ${Math.round(timeSinceLastBeat / 1000)}s — crash ${crashCount}/${CRASH_BRAKE_THRESHOLD} in current ${CRASH_WINDOW_MS / 60_000}min window`);
|
|
168
|
+
if (crashCount >= CRASH_BRAKE_THRESHOLD) {
|
|
169
|
+
console.error(`[watchdog] crash-loop brake triggered (${crashCount} crashes in ${CRASH_WINDOW_MS / 60_000}min)`);
|
|
170
|
+
writeAlert(`Process restarted ${crashCount} times within ${CRASH_WINDOW_MS / 60_000} minutes. Last beacon was ${Math.round(timeSinceLastBeat / 1000)}s ago. Most likely a deterministic crash on startup.`, crashCount);
|
|
171
|
+
// Re-use the brake check to unload + exit cleanly
|
|
172
|
+
checkCrashLoopBrake();
|
|
173
|
+
}
|
|
174
|
+
}
|
|
175
|
+
else {
|
|
176
|
+
// Previous beacon was old → process had clean uptime before exit,
|
|
177
|
+
// OR system was rebooted between runs. Reset crash count.
|
|
178
|
+
crashCount = 0;
|
|
179
|
+
crashWindowStart = bootTime;
|
|
180
|
+
}
|
|
181
|
+
}
|
|
182
|
+
// Write the first beacon immediately so a fresh restart updates the file
|
|
183
|
+
writeBeacon({
|
|
184
|
+
lastBeat: bootTime,
|
|
185
|
+
pid: process.pid,
|
|
186
|
+
bootTime,
|
|
187
|
+
crashCount,
|
|
188
|
+
crashWindowStart,
|
|
189
|
+
version: BOT_VERSION,
|
|
190
|
+
});
|
|
191
|
+
// Periodic beacon writer
|
|
192
|
+
beaconTimer = setInterval(() => {
|
|
193
|
+
writeBeacon({
|
|
194
|
+
lastBeat: Date.now(),
|
|
195
|
+
pid: process.pid,
|
|
196
|
+
bootTime,
|
|
197
|
+
crashCount,
|
|
198
|
+
crashWindowStart,
|
|
199
|
+
version: BOT_VERSION,
|
|
200
|
+
});
|
|
201
|
+
}, BEACON_INTERVAL_MS);
|
|
202
|
+
// Schedule a recovery counter reset after RECOVERY_UPTIME_MS of clean
|
|
203
|
+
// uptime. If we make it that far without dying, the bot is healthy
|
|
204
|
+
// again and we shouldn't penalize a future single crash.
|
|
205
|
+
resetTimer = setTimeout(() => {
|
|
206
|
+
if (crashCount > 0) {
|
|
207
|
+
console.log(`[watchdog] ${RECOVERY_UPTIME_MS / 60_000}min clean uptime — resetting crash counter from ${crashCount} to 0`);
|
|
208
|
+
crashCount = 0;
|
|
209
|
+
crashWindowStart = Date.now();
|
|
210
|
+
writeBeacon({
|
|
211
|
+
lastBeat: Date.now(),
|
|
212
|
+
pid: process.pid,
|
|
213
|
+
bootTime,
|
|
214
|
+
crashCount,
|
|
215
|
+
crashWindowStart,
|
|
216
|
+
version: BOT_VERSION,
|
|
217
|
+
});
|
|
218
|
+
}
|
|
219
|
+
}, RECOVERY_UPTIME_MS);
|
|
220
|
+
console.log(`[watchdog] started — beacon every ${BEACON_INTERVAL_MS / 1000}s, brake at ${CRASH_BRAKE_THRESHOLD} crashes per ${CRASH_WINDOW_MS / 60_000}min, recovery after ${RECOVERY_UPTIME_MS / 60_000}min uptime`);
|
|
221
|
+
}
|
|
222
|
+
/**
|
|
223
|
+
* Stop the watchdog cleanly. Called from the shutdown handler in
|
|
224
|
+
* index.ts so beacon timers don't keep the process alive after the
|
|
225
|
+
* grammy bot has stopped.
|
|
226
|
+
*/
|
|
227
|
+
export function stopWatchdog() {
|
|
228
|
+
if (beaconTimer) {
|
|
229
|
+
clearInterval(beaconTimer);
|
|
230
|
+
beaconTimer = null;
|
|
231
|
+
}
|
|
232
|
+
if (resetTimer) {
|
|
233
|
+
clearTimeout(resetTimer);
|
|
234
|
+
resetTimer = null;
|
|
235
|
+
}
|
|
236
|
+
}
|