alvin-bot 4.8.6 → 4.8.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +114 -0
- package/bin/cli.js +6 -0
- package/dist/config.js +7 -1
- package/dist/handlers/commands.js +77 -5
- package/dist/i18n.js +4 -4
- package/dist/index.js +21 -1
- package/dist/services/cron.js +17 -5
- package/dist/services/subagents.js +37 -4
- package/dist/services/updater.js +30 -1
- package/dist/services/watchdog.js +236 -0
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,120 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to Alvin Bot are documented here.
|
|
4
4
|
|
|
5
|
+
## [4.8.8] — 2026-04-11
|
|
6
|
+
|
|
7
|
+
### ✨ Unlimited sub-agent & cron timeouts (user-configurable)
|
|
8
|
+
|
|
9
|
+
Sub-agents and `ai-query` cron jobs used to hard-cap at 5 minutes (`SUBAGENT_TIMEOUT=300000` default), and `shell` cron jobs at 60 s. Long-running research, deep-dive audits, or anything that crossed the threshold got killed mid-stream with `status: "timeout"`. 4.8.8 flips the default to **unlimited** and lets the user override both globally and per job.
|
|
10
|
+
|
|
11
|
+
**What changed:**
|
|
12
|
+
|
|
13
|
+
- **Default is now infinite.** `src/config.ts` seeds `subAgentTimeout` from `SUBAGENT_TIMEOUT` env or falls back to `-1` (unlimited). The runtime value lives in `~/.alvin-bot/sub-agents.json` as `defaultTimeoutMs` and is changeable at runtime without restart.
|
|
14
|
+
- **New `/subagents timeout` command.** `/subagents timeout` shows the current value; `/subagents timeout 3600` sets 1 h; `/subagents timeout off` (or `-1`, `0`, `unlimited`, `infinite`) disables the cap entirely. The default-status output now includes a `⏱ Timeout` line.
|
|
15
|
+
- **Per-job override on cron.** `/cron add 1h ai-query "deep audit" --timeout off` gives this one job no timeout. `/cron add 5m shell "pm2 ls" --timeout 30` caps this shell at 30 s. Omitting `--timeout` inherits the current global default. Same flag exists on `scripts/cron-manage.js add --timeout <sec|off>`.
|
|
16
|
+
- **`CronJob.timeoutMs` field.** Optional number in `cron-jobs.json`. Undefined = inherit global default. Value ≤ 0 = unlimited.
|
|
17
|
+
- **Semantics.** `spawnSubAgent` now only arms the `setTimeout(abort)` when `timeout > 0`. At ≤ 0, no abort timer is created, existing `if (timeoutId) clearTimeout(…)` call sites are null-safe, and the agent runs until it finishes, is cancelled via `/subagents cancel`, or the process dies.
|
|
18
|
+
- **Shell cron unchanged behaviour preserved.** If the shell job has no `timeoutMs`, `execSync` is called without a `timeout` option, which Node treats as infinite — same effect as before was *meant* to provide, but the old hard-coded 60 s removed that freedom.
|
|
19
|
+
|
|
20
|
+
**ENV var still works but is seed-only.** `SUBAGENT_TIMEOUT=600000` at startup still seeds the config on first load, but the persisted value in `sub-agents.json` wins after that.
|
|
21
|
+
|
|
22
|
+
### 🐛 Silenced harmless `message is not modified` Telegram errors
|
|
23
|
+
|
|
24
|
+
Occasionally Ali would see a red banner at the bottom of an Alvin message:
|
|
25
|
+
|
|
26
|
+
> Error: Call to 'editMessageText' failed! (400: Bad Request: message is not modified: specified new message content and reply markup are exactly the same as a current content and reply markup of the message)
|
|
27
|
+
|
|
28
|
+
It never broke anything, but it polluted logs and showed up as an "internal error" reply to the user. Root cause: Telegram's Bot API refuses `editMessageText` when the new content + reply markup are byte-identical to the existing message. This happens legitimately in callback handlers — e.g. tapping a cron-toggle button twice, re-rendering a sudo/keys/platforms menu, language-switch callbacks that render the same content, or stream flushes where the throttled partial hasn't changed since the last edit.
|
|
29
|
+
|
|
30
|
+
**Fix**: `bot.catch()` in `src/index.ts` now filters out this specific error early. Two regex patterns (`/message is not modified/i` and `/specified new message content.*exactly the same/i`) cover both variants Telegram sends. Real errors (network, SDK, provider failures) still log and still surface the "internal error" reply to the user — only this one harmless class gets dropped.
|
|
31
|
+
|
|
32
|
+
### 📝 CLAUDE.md: PM2 references updated to launchd
|
|
33
|
+
|
|
34
|
+
The project `CLAUDE.md` still said *"PM2: `alvin-bot` Prozess, Config in `ecosystem.config.cjs`"* — outdated since the 4.8.6 switch to launchd. Updated to reflect the actual process manager (`~/Library/LaunchAgents/com.alvinbot.app.plist`, `KeepAlive=true`, `RunAtLoad=true`), the log paths, and a note that `watchdog.ts` only brakes process crash-loops — it does **not** kill long-running sessions or sub-agents. `ecosystem.config.cjs` is now labelled legacy.
|
|
35
|
+
|
|
36
|
+
The global `~/.claude/CLAUDE.md` was also corrected: `alvin-bot` was removed from the VPS PM2-process list (it runs locally, not on the VPS) and the cron-hub note now correctly says "als **launchd LaunchAgent**".
|
|
37
|
+
|
|
38
|
+
## [4.8.7] — 2026-04-11
|
|
39
|
+
|
|
40
|
+
### 🐛 `/update` now detects stale-runtime (rebuild without restart)
|
|
41
|
+
|
|
42
|
+
Caught immediately after publishing 4.8.6 on the Mac mini: `/update` reported "Already up to date — no new commits" even though the running process was on **v4.8.5** while the disk was already built at **v4.8.6**. The user could see the version mismatch in `/status` (v4.8.5) but `/update` refused to acknowledge it.
|
|
43
|
+
|
|
44
|
+
**Root cause**: The updater only compared **git commits** (or **npm registry version**) against the local install. It never checked whether the **running process's in-memory version** was older than the **on-disk built version**. This is the dev/CI loop scenario:
|
|
45
|
+
|
|
46
|
+
1. You edit src/, bump package.json, commit + push
|
|
47
|
+
2. `npm run build` regenerates dist/ at the new version
|
|
48
|
+
3. The running process has the OLD code in memory
|
|
49
|
+
4. You run `/update` in Telegram
|
|
50
|
+
5. git: HEAD == origin/main (just pushed) → 0 commits behind → "up to date"
|
|
51
|
+
6. Process never restarts → keeps running OLD code
|
|
52
|
+
|
|
53
|
+
**Fix**: New `isRuntimeStale()` check at the very start of `runUpdate()`. Compares `BOT_VERSION` (in-memory at process start) against `package.json.version` from disk via the existing semver compare. If disk is newer, returns success with `requiresRestart=true` immediately — skip the git/npm fetch entirely, just signal a restart so the fresh code takes effect.
|
|
54
|
+
|
|
55
|
+
After 4.8.7, running `/update` after a manual rebuild will correctly say *"Disk is already built at vX, running vY. Restarting to pick up the new code..."* and trigger the restart.
|
|
56
|
+
|
|
57
|
+
### ✨ Internal watchdog with crash-loop brake (`src/services/watchdog.ts`)
|
|
58
|
+
|
|
59
|
+
Ali asked for "derbe persistent" — already 95% there with `KeepAlive: true` from 4.8.6, but the missing piece was a brake to stop the bot from infinite-restart-looping if a deterministic crash happens (corrupt state file, missing dependency, broken upgrade).
|
|
60
|
+
|
|
61
|
+
**New module**: `src/services/watchdog.ts`. Two responsibilities:
|
|
62
|
+
|
|
63
|
+
**1. Liveness beacon**. Every 30 s the bot writes `~/.alvin-bot/state/watchdog.json` with `{lastBeat, pid, bootTime, crashCount, crashWindowStart, version}`. Fast disk write, no I/O blocking.
|
|
64
|
+
|
|
65
|
+
**2. Crash-loop brake**. On every fresh boot, the watchdog reads the previous beacon:
|
|
66
|
+
|
|
67
|
+
- If the previous beacon is **less than 90 s old** → the previous process exited very recently → that's a crash (or a deliberate restart, treated the same way for the brake's purpose). Increment `crashCount`.
|
|
68
|
+
- If the previous beacon is **older than 90 s** → previous process had clean uptime → reset counter to 0.
|
|
69
|
+
- The crash window is **10 minutes**. Crashes within this window accumulate; older ones don't count.
|
|
70
|
+
- If `crashCount` reaches **10**, the brake engages:
|
|
71
|
+
- Writes `~/.alvin-bot/state/crash-loop.alert` with the timestamp, version, error log path, and recovery steps
|
|
72
|
+
- Tries to `launchctl unload -w` its own LaunchAgent so launchd stops retrying (otherwise `KeepAlive: true` would keep burning CPU forever)
|
|
73
|
+
- Exits with code 3
|
|
74
|
+
|
|
75
|
+
**3. Recovery**. After **5 minutes of clean uptime**, the watchdog auto-resets the crash counter to 0. So a healthy bot that occasionally has a transient hiccup doesn't slowly accumulate toward the brake over days.
|
|
76
|
+
|
|
77
|
+
**4. Brake check at startup**. `checkCrashLoopBrake()` runs in `index.ts` **before** any expensive init — if the alert file already exists, the bot exits cleanly with code 3 and tries to unload itself again. This prevents launchd from spinning the bot up just to write the same alert over and over.
|
|
78
|
+
|
|
79
|
+
**Recovery from a tripped brake**:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
# 1. Investigate the error log
|
|
83
|
+
cat ~/.alvin-bot/logs/alvin-bot.err.log
|
|
84
|
+
|
|
85
|
+
# 2. Fix whatever was wrong
|
|
86
|
+
# 3. Remove the alert file
|
|
87
|
+
rm ~/.alvin-bot/state/crash-loop.alert
|
|
88
|
+
|
|
89
|
+
# 4. Reload the LaunchAgent
|
|
90
|
+
alvin-bot launchd install
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**What this catches**:
|
|
94
|
+
|
|
95
|
+
- Process crashes (segfault, OOM kill) → exit non-zero → brake increments
|
|
96
|
+
- `process.exit()` from unhandled rejection → similar
|
|
97
|
+
- Tight crash loops → brake engages at 10 within 10 min
|
|
98
|
+
- Corrupted state files that crash on read → brake engages eventually
|
|
99
|
+
|
|
100
|
+
**What this does NOT catch (yet)**:
|
|
101
|
+
|
|
102
|
+
- Event-loop deadlocks where the process is alive but completely frozen. The watchdog beacon needs the event loop to be alive, so it can't detect freeze. A future release will add an external sister LaunchAgent (`com.alvinbot.watchdog`) that runs every 2 minutes via `StartInterval` and kills the main bot if its beacon file is too stale. Tracked as a follow-up.
|
|
103
|
+
|
|
104
|
+
**Telemetry surface**: `alvin-bot status` could read the beacon file in a future release to show "crash count: X in last Y minutes" — for now, the alert file is the main user-facing signal.
|
|
105
|
+
|
|
106
|
+
### 🛡 LaunchAgent: ProcessType + LimitLoadToSessionType
|
|
107
|
+
|
|
108
|
+
Two small plist hardening tweaks:
|
|
109
|
+
|
|
110
|
+
- **`ProcessType: Background`** — explicit hint to launchd that this is a long-running background service. macOS treats Background processes with friendlier scheduling and is less likely to kill them under memory pressure (vs `Standard` which is the default for unlabeled jobs).
|
|
111
|
+
- **`LimitLoadToSessionType: Aqua`** — only loads in user GUI sessions. Prevents the LaunchAgent from accidentally loading in non-GUI contexts (e.g. SSH login session) where it would not have Keychain access. Defensive: matches our existing assumption that the bot needs the GUI keychain unlocked for Claude SDK OAuth.
|
|
112
|
+
|
|
113
|
+
These don't change behaviour for normal use, but they're explicit about our intent. macOS will treat the bot as a proper background service rather than a generic foreground job.
|
|
114
|
+
|
|
115
|
+
### Tests
|
|
116
|
+
|
|
117
|
+
87 still passing — no test changes (the stale-runtime check is a fast-path branch that doesn't disturb the existing git/npm logic).
|
|
118
|
+
|
|
5
119
|
## [4.8.6] — 2026-04-11
|
|
6
120
|
|
|
7
121
|
### 🐛 LaunchAgent: `/restart` left the bot down forever
|
package/bin/cli.js
CHANGED
|
@@ -1466,6 +1466,12 @@ function renderLaunchdPlist({ label, nodePath, entryPoint, cwd, home, logDir })
|
|
|
1466
1466
|
<key>ThrottleInterval</key>
|
|
1467
1467
|
<integer>5</integer>
|
|
1468
1468
|
|
|
1469
|
+
<key>ProcessType</key>
|
|
1470
|
+
<string>Background</string>
|
|
1471
|
+
|
|
1472
|
+
<key>LimitLoadToSessionType</key>
|
|
1473
|
+
<string>Aqua</string>
|
|
1474
|
+
|
|
1469
1475
|
<key>StandardOutPath</key>
|
|
1470
1476
|
<string>${logDir}/alvin-bot.out.log</string>
|
|
1471
1477
|
|
package/dist/config.js
CHANGED
|
@@ -45,7 +45,13 @@ export const config = {
|
|
|
45
45
|
compactionThreshold: Number(process.env.COMPACTION_THRESHOLD) || 80000,
|
|
46
46
|
// Sub-Agents
|
|
47
47
|
maxSubAgents: Number(process.env.MAX_SUBAGENTS) || 4,
|
|
48
|
-
|
|
48
|
+
// Default sub-agent timeout. -1 / 0 = unlimited (no hard cut-off).
|
|
49
|
+
// The runtime value lives in sub-agents.json and can be changed at runtime
|
|
50
|
+
// via /subagents timeout; this constant only seeds the initial config on
|
|
51
|
+
// first launch when SUBAGENT_TIMEOUT is not set.
|
|
52
|
+
subAgentTimeout: process.env.SUBAGENT_TIMEOUT !== undefined && process.env.SUBAGENT_TIMEOUT !== ""
|
|
53
|
+
? Number(process.env.SUBAGENT_TIMEOUT)
|
|
54
|
+
: -1,
|
|
49
55
|
// TTS Provider
|
|
50
56
|
ttsProvider: (process.env.TTS_PROVIDER || "edge"),
|
|
51
57
|
elevenlabs: {
|
|
@@ -1277,9 +1277,29 @@ export function registerCommands(bot) {
|
|
|
1277
1277
|
`Commands: /cron add · delete · toggle · run · info`, { parse_mode: "HTML", reply_markup: keyboard });
|
|
1278
1278
|
return;
|
|
1279
1279
|
}
|
|
1280
|
-
// /cron add <schedule> <type> <payload>
|
|
1280
|
+
// /cron add <schedule> <type> <payload> [--timeout <sec|off>]
|
|
1281
1281
|
if (arg.startsWith("add ")) {
|
|
1282
|
-
|
|
1282
|
+
let rest = arg.slice(4).trim();
|
|
1283
|
+
// Extract optional --timeout flag from anywhere in the command.
|
|
1284
|
+
// Accepts seconds, "off", "unlimited", "-1", or "0" — anything ≤ 0
|
|
1285
|
+
// or non-numeric collapses to -1 (unlimited).
|
|
1286
|
+
let timeoutMs;
|
|
1287
|
+
const timeoutMatch = rest.match(/(^|\s)--timeout\s+(\S+)/);
|
|
1288
|
+
if (timeoutMatch) {
|
|
1289
|
+
const val = timeoutMatch[2].toLowerCase();
|
|
1290
|
+
if (["off", "unlimited", "infinite", "-1", "0"].includes(val)) {
|
|
1291
|
+
timeoutMs = -1;
|
|
1292
|
+
}
|
|
1293
|
+
else {
|
|
1294
|
+
const secs = Number(timeoutMatch[2]);
|
|
1295
|
+
if (!Number.isFinite(secs) || secs < 0) {
|
|
1296
|
+
await ctx.reply(`❌ Invalid <code>--timeout</code> value: ${timeoutMatch[2]}`, { parse_mode: "HTML" });
|
|
1297
|
+
return;
|
|
1298
|
+
}
|
|
1299
|
+
timeoutMs = Math.floor(secs * 1000);
|
|
1300
|
+
}
|
|
1301
|
+
rest = rest.replace(/(^|\s)--timeout\s+\S+/, "").trim();
|
|
1302
|
+
}
|
|
1283
1303
|
// Natural language schedule shortcuts (German + English)
|
|
1284
1304
|
const naturalSchedules = {
|
|
1285
1305
|
"täglich": "0 8 * * *", "daily": "0 8 * * *",
|
|
@@ -1342,7 +1362,7 @@ export function registerCommands(bot) {
|
|
|
1342
1362
|
else {
|
|
1343
1363
|
const sp = rest.indexOf(" ");
|
|
1344
1364
|
if (sp < 0) {
|
|
1345
|
-
await ctx.reply("Format: <code>/cron add <schedule> <type> <payload>
|
|
1365
|
+
await ctx.reply("Format: <code>/cron add <schedule> <type> <payload> [--timeout <sec|off>]</code>\n\nSchedule options:\n• <b>Intervals:</b> 5m, 1h, 30s, 2d\n• <b>Natural:</b> daily, weekly, monthly, weekdays, hourly\n• <b>With time:</b> 8:30 daily, weekdays 9:00\n• <b>German:</b> täglich, wöchentlich, morgens, abends\n• <b>Cron:</b> \"0 9 * * 1-5\"\n\nOptional <code>--timeout</code> in seconds, or <code>off</code>/<code>-1</code> for unlimited.", { parse_mode: "HTML" });
|
|
1346
1366
|
return;
|
|
1347
1367
|
}
|
|
1348
1368
|
schedule = rest.slice(0, sp);
|
|
@@ -1381,12 +1401,19 @@ export function registerCommands(bot) {
|
|
|
1381
1401
|
payload,
|
|
1382
1402
|
target: { platform: "telegram", chatId: String(chatId) },
|
|
1383
1403
|
createdBy: `telegram:${userId}`,
|
|
1404
|
+
...(timeoutMs !== undefined ? { timeoutMs } : {}),
|
|
1384
1405
|
});
|
|
1385
1406
|
const readableSched = humanReadableSchedule(job.schedule);
|
|
1407
|
+
const timeoutLine = typeof job.timeoutMs === "number"
|
|
1408
|
+
? job.timeoutMs <= 0
|
|
1409
|
+
? `<b>Timeout:</b> ∞ (unlimited)\n`
|
|
1410
|
+
: `<b>Timeout:</b> ${Math.round(job.timeoutMs / 1000)}s\n`
|
|
1411
|
+
: "";
|
|
1386
1412
|
await ctx.reply(`✅ <b>Cron Job created</b>\n\n` +
|
|
1387
1413
|
`<b>Name:</b> ${job.name}\n` +
|
|
1388
1414
|
`📅 <b>${readableSched}</b>\n` +
|
|
1389
1415
|
`<b>Type:</b> ${job.type}\n` +
|
|
1416
|
+
timeoutLine +
|
|
1390
1417
|
`<b>Next run:</b> ${formatNextRun(job.nextRunAt)}\n` +
|
|
1391
1418
|
`<b>ID:</b> <code>${job.id}</code>`, { parse_mode: "HTML" });
|
|
1392
1419
|
return;
|
|
@@ -1734,7 +1761,7 @@ export function registerCommands(bot) {
|
|
|
1734
1761
|
// type both "/sub-agents" and "/subagents" — Telegram routes both to this.
|
|
1735
1762
|
bot.command(["sub_agents", "subagents"], async (ctx) => {
|
|
1736
1763
|
const lang = getSession(ctx.from.id).language;
|
|
1737
|
-
const { listSubAgents, cancelSubAgent, getSubAgentResult, getMaxParallelAgents, getConfiguredMaxParallel, setMaxParallelAgents, findSubAgentByName, getVisibility, setVisibility, getQueueCap, setQueueCap, } = await import("../services/subagents.js");
|
|
1764
|
+
const { listSubAgents, cancelSubAgent, getSubAgentResult, getMaxParallelAgents, getConfiguredMaxParallel, setMaxParallelAgents, findSubAgentByName, getVisibility, setVisibility, getQueueCap, setQueueCap, getDefaultTimeoutMs, setDefaultTimeoutMs, } = await import("../services/subagents.js");
|
|
1738
1765
|
const arg = (ctx.match || "").trim();
|
|
1739
1766
|
const tokens = arg.split(/\s+/).filter(Boolean);
|
|
1740
1767
|
const sub = tokens[0]?.toLowerCase() || "";
|
|
@@ -1792,6 +1819,47 @@ export function registerCommands(bot) {
|
|
|
1792
1819
|
await ctx.reply(lines.join("\n"), { parse_mode: "Markdown" });
|
|
1793
1820
|
return;
|
|
1794
1821
|
}
|
|
1822
|
+
// /subagents timeout [sec|off|unlimited|-1] — set default sub-agent timeout
|
|
1823
|
+
if (sub === "timeout") {
|
|
1824
|
+
const val = tokens[1];
|
|
1825
|
+
const formatTimeout = (ms) => {
|
|
1826
|
+
if (ms <= 0)
|
|
1827
|
+
return "∞ (unlimited)";
|
|
1828
|
+
if (ms < 1000)
|
|
1829
|
+
return `${ms}ms`;
|
|
1830
|
+
const sec = ms / 1000;
|
|
1831
|
+
if (sec < 60)
|
|
1832
|
+
return `${sec}s`;
|
|
1833
|
+
const min = sec / 60;
|
|
1834
|
+
if (min < 60)
|
|
1835
|
+
return `${min.toFixed(min < 10 ? 1 : 0)}min`;
|
|
1836
|
+
return `${(min / 60).toFixed(1)}h`;
|
|
1837
|
+
};
|
|
1838
|
+
if (!val) {
|
|
1839
|
+
const current = getDefaultTimeoutMs();
|
|
1840
|
+
await ctx.reply(`⏱ Default sub-agent timeout: *${formatTimeout(current)}*\n\n` +
|
|
1841
|
+
`Usage: \`/subagents timeout <sec>\` · \`/subagents timeout off\`\n` +
|
|
1842
|
+
`\`off\`, \`unlimited\`, \`-1\` oder \`0\` = kein Timeout. ` +
|
|
1843
|
+
`Gilt für neue Subagents und ai-query Cron-Jobs ohne eigenen Wert.`, { parse_mode: "Markdown" });
|
|
1844
|
+
return;
|
|
1845
|
+
}
|
|
1846
|
+
const lower = val.toLowerCase();
|
|
1847
|
+
let ms;
|
|
1848
|
+
if (["off", "unlimited", "infinite", "-1", "0"].includes(lower)) {
|
|
1849
|
+
ms = -1;
|
|
1850
|
+
}
|
|
1851
|
+
else {
|
|
1852
|
+
const secs = Number(val);
|
|
1853
|
+
if (!Number.isFinite(secs) || secs < 0) {
|
|
1854
|
+
await ctx.reply(`❌ Ungültiger Wert \`${val}\`. Nutze Sekunden (z.B. \`300\`) oder \`off\`.`, { parse_mode: "Markdown" });
|
|
1855
|
+
return;
|
|
1856
|
+
}
|
|
1857
|
+
ms = Math.floor(secs * 1000);
|
|
1858
|
+
}
|
|
1859
|
+
const effective = setDefaultTimeoutMs(ms);
|
|
1860
|
+
await ctx.reply(`✅ Default sub-agent timeout: *${formatTimeout(effective)}*`, { parse_mode: "Markdown" });
|
|
1861
|
+
return;
|
|
1862
|
+
}
|
|
1795
1863
|
// /subagents queue <n> — set bounded-queue cap (0 disables queue)
|
|
1796
1864
|
if (sub === "queue") {
|
|
1797
1865
|
const n = parseInt(tokens[1] || "", 10);
|
|
@@ -1921,6 +1989,10 @@ export function registerCommands(bot) {
|
|
|
1921
1989
|
? `${t("bot.subagents.maxLabel", lang)} 0 ${t("bot.subagents.autoSuffix", lang, { n: effective })}`
|
|
1922
1990
|
: `${t("bot.subagents.maxLabel", lang)} ${configured}`;
|
|
1923
1991
|
const visibilityLabel = `${t("bot.subagents.visibilityLabel", lang)} *${getVisibility()}*`;
|
|
1992
|
+
const currentTimeout = getDefaultTimeoutMs();
|
|
1993
|
+
const timeoutLabel = currentTimeout <= 0
|
|
1994
|
+
? `⏱ Timeout: *∞ (unlimited)*`
|
|
1995
|
+
: `⏱ Timeout: *${Math.round(currentTimeout / 1000)}s*`;
|
|
1924
1996
|
const agents = listSubAgents();
|
|
1925
1997
|
let body = "";
|
|
1926
1998
|
if (agents.length === 0) {
|
|
@@ -1931,7 +2003,7 @@ export function registerCommands(bot) {
|
|
|
1931
2003
|
}
|
|
1932
2004
|
const header = t("bot.subagents.header", lang);
|
|
1933
2005
|
const usage = `\n\n${t("bot.subagents.usage", lang)}`;
|
|
1934
|
-
const full = `${header}\n${maxLabel}\n${visibilityLabel}${body}${usage}`;
|
|
2006
|
+
const full = `${header}\n${maxLabel}\n${visibilityLabel}\n${timeoutLabel}${body}${usage}`;
|
|
1935
2007
|
await ctx.reply(full, { parse_mode: "Markdown" }).catch(() => ctx.reply(full));
|
|
1936
2008
|
});
|
|
1937
2009
|
}
|
package/dist/i18n.js
CHANGED
|
@@ -519,10 +519,10 @@ const strings = {
|
|
|
519
519
|
fr: "Durée : {sec}s · Tokens : {in}/{out}",
|
|
520
520
|
},
|
|
521
521
|
"bot.subagents.usage": {
|
|
522
|
-
en: "Commands:\n/subagents — show status\n/subagents max <n> — set parallel limit (0=auto)\n/subagents visibility <auto|banner|silent|live> — delivery mode\n/subagents queue <n> — bounded-queue cap (0 = disabled)\n/subagents stats — last 24h run stats\n/subagents list — list all\n/subagents cancel <name|id> — cancel one\n/subagents result <name|id> — show result",
|
|
523
|
-
de: "Befehle:\n/subagents — Status anzeigen\n/subagents max <n> — Parallel-Limit setzen (0=auto)\n/subagents visibility <auto|banner|silent|live> — Delivery-Modus\n/subagents list — alle anzeigen\n/subagents cancel <name|id> — abbrechen\n/subagents result <name|id> — Ergebnis anzeigen",
|
|
524
|
-
es: "Comandos:\n/subagents — ver estado\n/subagents max <n> — establecer límite (0=auto)\n/subagents visibility <auto|banner|silent|live> — modo de entrega\n/subagents list — listar todos\n/subagents cancel <nombre|id> — cancelar uno\n/subagents result <nombre|id> — ver resultado",
|
|
525
|
-
fr: "Commandes :\n/subagents — état\n/subagents max <n> — limite parallèle (0=auto)\n/subagents visibility <auto|banner|silent|live> — mode de livraison\n/subagents list — lister tous\n/subagents cancel <nom|id> — annuler un\n/subagents result <nom|id> — voir résultat",
|
|
522
|
+
en: "Commands:\n/subagents — show status\n/subagents max <n> — set parallel limit (0=auto)\n/subagents timeout <sec|off> — default timeout (off = unlimited)\n/subagents visibility <auto|banner|silent|live> — delivery mode\n/subagents queue <n> — bounded-queue cap (0 = disabled)\n/subagents stats — last 24h run stats\n/subagents list — list all\n/subagents cancel <name|id> — cancel one\n/subagents result <name|id> — show result",
|
|
523
|
+
de: "Befehle:\n/subagents — Status anzeigen\n/subagents max <n> — Parallel-Limit setzen (0=auto)\n/subagents timeout <sec|off> — Default-Timeout (off = unendlich)\n/subagents visibility <auto|banner|silent|live> — Delivery-Modus\n/subagents queue <n> — Queue-Cap (0 = deaktiviert)\n/subagents list — alle anzeigen\n/subagents cancel <name|id> — abbrechen\n/subagents result <name|id> — Ergebnis anzeigen",
|
|
524
|
+
es: "Comandos:\n/subagents — ver estado\n/subagents max <n> — establecer límite (0=auto)\n/subagents timeout <seg|off> — timeout por defecto (off = sin límite)\n/subagents visibility <auto|banner|silent|live> — modo de entrega\n/subagents list — listar todos\n/subagents cancel <nombre|id> — cancelar uno\n/subagents result <nombre|id> — ver resultado",
|
|
525
|
+
fr: "Commandes :\n/subagents — état\n/subagents max <n> — limite parallèle (0=auto)\n/subagents timeout <sec|off> — délai par défaut (off = illimité)\n/subagents visibility <auto|banner|silent|live> — mode de livraison\n/subagents list — lister tous\n/subagents cancel <nom|id> — annuler un\n/subagents result <nom|id> — voir résultat",
|
|
526
526
|
},
|
|
527
527
|
"bot.subagents.visibilityLabel": {
|
|
528
528
|
en: "Visibility:",
|
package/dist/index.js
CHANGED
|
@@ -14,6 +14,11 @@ if (hasLegacyData()) {
|
|
|
14
14
|
}
|
|
15
15
|
// 3. Seed defaults for any files that don't exist yet (fresh install)
|
|
16
16
|
seedDefaults();
|
|
17
|
+
// 4. Crash-loop brake check — if we've crashed N times in a short window,
|
|
18
|
+
// refuse to start, write an alert file, and unload our LaunchAgent so
|
|
19
|
+
// launchd stops retrying. Runs BEFORE any expensive init so a broken
|
|
20
|
+
// state file doesn't tank the whole CPU.
|
|
21
|
+
checkCrashLoopBrake();
|
|
17
22
|
// ── Normal imports (safe now — DATA_DIR is ready) ──────────────────
|
|
18
23
|
import { Bot, InlineKeyboard } from "grammy";
|
|
19
24
|
import { config } from "./config.js";
|
|
@@ -76,6 +81,7 @@ import { loadSkills } from "./services/skills.js";
|
|
|
76
81
|
import { loadHooks } from "./services/hooks.js";
|
|
77
82
|
import { registerShutdownHandler } from "./services/restart.js";
|
|
78
83
|
import { cancelAllSubAgents } from "./services/subagents.js";
|
|
84
|
+
import { startWatchdog, stopWatchdog, checkCrashLoopBrake } from "./services/watchdog.js";
|
|
79
85
|
import { getRegistry } from "./engine.js";
|
|
80
86
|
import { scanAssets } from "./services/asset-index.js";
|
|
81
87
|
// Scan asset directory and generate INDEX.json + INDEX.md
|
|
@@ -210,10 +216,20 @@ if (hasTelegram) {
|
|
|
210
216
|
bot.on("message:photo", handlePhoto);
|
|
211
217
|
bot.on("message:document", handleDocument);
|
|
212
218
|
bot.on("message:text", handleMessage);
|
|
213
|
-
// Error handling — log but don't crash
|
|
219
|
+
// Error handling — log but don't crash.
|
|
214
220
|
bot.catch((err) => {
|
|
215
221
|
const ctx = err.ctx;
|
|
216
222
|
const e = err.error;
|
|
223
|
+
// Telegram's "message is not modified" (400) is harmless — it fires
|
|
224
|
+
// when a callback handler re-renders an inline keyboard / edited
|
|
225
|
+
// message with content that happens to match the current message
|
|
226
|
+
// exactly (e.g. double-tapped toggle button, identical list after
|
|
227
|
+
// re-render). Swallow it silently so it neither pollutes the logs
|
|
228
|
+
// nor bubbles up to the user as "internal error".
|
|
229
|
+
const msg = e instanceof Error ? e.message : String(e);
|
|
230
|
+
if (/message is not modified/i.test(msg) || /specified new message content.*exactly the same/i.test(msg)) {
|
|
231
|
+
return;
|
|
232
|
+
}
|
|
217
233
|
console.error(`Error handling update ${ctx?.update?.update_id}:`, e);
|
|
218
234
|
// Try to notify the user
|
|
219
235
|
if (ctx?.chat?.id) {
|
|
@@ -235,6 +251,7 @@ const shutdown = async () => {
|
|
|
235
251
|
// agents can post a cancellation message to Telegram before the bot
|
|
236
252
|
// stops. Capped at 5s internally so a hang can't block shutdown.
|
|
237
253
|
await cancelAllSubAgents(true);
|
|
254
|
+
stopWatchdog();
|
|
238
255
|
stopScheduler();
|
|
239
256
|
stopSessionCleanup();
|
|
240
257
|
if (queueInterval)
|
|
@@ -472,6 +489,8 @@ if (bot) {
|
|
|
472
489
|
console.log(` Users: ${config.allowedUsers.length} authorized`);
|
|
473
490
|
// Start heartbeat monitor
|
|
474
491
|
startHeartbeat();
|
|
492
|
+
// Start internal watchdog (crash-loop brake + liveness beacon)
|
|
493
|
+
startWatchdog();
|
|
475
494
|
// Index memory vectors in background (non-blocking)
|
|
476
495
|
initEmbeddings().catch(() => { });
|
|
477
496
|
},
|
|
@@ -483,5 +502,6 @@ else {
|
|
|
483
502
|
console.log(` WebUI: http://localhost:${process.env.WEB_PORT || 3100}`);
|
|
484
503
|
// Start heartbeat monitor even without Telegram
|
|
485
504
|
startHeartbeat();
|
|
505
|
+
startWatchdog();
|
|
486
506
|
initEmbeddings().catch(() => { });
|
|
487
507
|
}
|
package/dist/services/cron.js
CHANGED
|
@@ -122,11 +122,16 @@ async function executeJob(job) {
|
|
|
122
122
|
}
|
|
123
123
|
case "shell": {
|
|
124
124
|
const cmd = job.payload.command || "echo 'no command'";
|
|
125
|
-
|
|
126
|
-
|
|
125
|
+
// Per-job timeout, default = no timeout (execSync treats timeout=0
|
|
126
|
+
// or "undefined" as infinite). Users opt in via /cron add … --timeout N.
|
|
127
|
+
const shellOpts = {
|
|
127
128
|
stdio: "pipe",
|
|
128
129
|
env: { ...process.env, PATH: process.env.PATH + ":/opt/homebrew/bin:/usr/local/bin" },
|
|
129
|
-
}
|
|
130
|
+
};
|
|
131
|
+
if (typeof job.timeoutMs === "number" && job.timeoutMs > 0) {
|
|
132
|
+
shellOpts.timeout = job.timeoutMs;
|
|
133
|
+
}
|
|
134
|
+
const output = execSync(cmd, shellOpts).toString().trim();
|
|
130
135
|
// Notify with output
|
|
131
136
|
if (notifyCallback && output) {
|
|
132
137
|
await notifyCallback(job.target, `🔧 ${job.name}\n\`\`\`\n${output.slice(0, 3000)}\n\`\`\``);
|
|
@@ -173,14 +178,20 @@ async function executeJob(job) {
|
|
|
173
178
|
? Number(job.target.chatId)
|
|
174
179
|
: undefined;
|
|
175
180
|
const result = await new Promise((resolve, reject) => {
|
|
176
|
-
|
|
181
|
+
// Only pass `timeout` through when the job has a per-job value.
|
|
182
|
+
// Otherwise the sub-agent inherits the current /subagents default.
|
|
183
|
+
const spawnConfig = {
|
|
177
184
|
name: job.name,
|
|
178
185
|
prompt,
|
|
179
186
|
workingDir: BOT_ROOT,
|
|
180
187
|
source: "cron",
|
|
181
188
|
parentChatId,
|
|
182
189
|
onComplete: (r) => resolve(r),
|
|
183
|
-
}
|
|
190
|
+
};
|
|
191
|
+
if (typeof job.timeoutMs === "number") {
|
|
192
|
+
spawnConfig.timeout = job.timeoutMs;
|
|
193
|
+
}
|
|
194
|
+
spawnSubAgent(spawnConfig).catch(reject);
|
|
184
195
|
});
|
|
185
196
|
// Non-success: don't notify here. The I3 delivery router has
|
|
186
197
|
// already posted the appropriate banner (cancelled / timeout /
|
|
@@ -309,6 +320,7 @@ export function createJob(input) {
|
|
|
309
320
|
nextRunAt: null,
|
|
310
321
|
runCount: 0,
|
|
311
322
|
createdBy: input.createdBy || "unknown",
|
|
323
|
+
...(typeof input.timeoutMs === "number" ? { timeoutMs: input.timeoutMs } : {}),
|
|
312
324
|
};
|
|
313
325
|
// Calculate first run
|
|
314
326
|
job.nextRunAt = calculateNextRun(job);
|
|
@@ -21,6 +21,14 @@ let configCache = null;
|
|
|
21
21
|
function isValidVisibility(v) {
|
|
22
22
|
return v === "auto" || v === "banner" || v === "silent" || v === "live";
|
|
23
23
|
}
|
|
24
|
+
/** Resolve the initial default timeout from config.ts, which itself seeds
|
|
25
|
+
* from the SUBAGENT_TIMEOUT env var. -1 = unlimited. */
|
|
26
|
+
function seedDefaultTimeout() {
|
|
27
|
+
const raw = config.subAgentTimeout;
|
|
28
|
+
if (typeof raw !== "number" || !Number.isFinite(raw) || raw <= 0)
|
|
29
|
+
return -1;
|
|
30
|
+
return Math.floor(raw);
|
|
31
|
+
}
|
|
24
32
|
function loadSubAgentsConfig() {
|
|
25
33
|
if (configCache)
|
|
26
34
|
return configCache;
|
|
@@ -33,14 +41,18 @@ function loadSubAgentsConfig() {
|
|
|
33
41
|
queueCap: typeof parsed.queueCap === "number"
|
|
34
42
|
? Math.max(0, Math.min(Math.floor(parsed.queueCap), ABSOLUTE_MAX_QUEUE))
|
|
35
43
|
: DEFAULT_QUEUE_CAP,
|
|
44
|
+
defaultTimeoutMs: typeof parsed.defaultTimeoutMs === "number" && Number.isFinite(parsed.defaultTimeoutMs)
|
|
45
|
+
? (parsed.defaultTimeoutMs <= 0 ? -1 : Math.floor(parsed.defaultTimeoutMs))
|
|
46
|
+
: seedDefaultTimeout(),
|
|
36
47
|
};
|
|
37
48
|
}
|
|
38
49
|
catch {
|
|
39
|
-
// File missing or invalid — seed from env
|
|
50
|
+
// File missing or invalid — seed from env vars then default to auto/unlimited
|
|
40
51
|
configCache = {
|
|
41
52
|
maxParallel: Number(process.env.MAX_SUBAGENTS) || 0,
|
|
42
53
|
visibility: "auto",
|
|
43
54
|
queueCap: DEFAULT_QUEUE_CAP,
|
|
55
|
+
defaultTimeoutMs: seedDefaultTimeout(),
|
|
44
56
|
};
|
|
45
57
|
}
|
|
46
58
|
return configCache;
|
|
@@ -102,6 +114,18 @@ export function setQueueCap(n) {
|
|
|
102
114
|
saveSubAgentsConfig({ ...cfg, queueCap: clamped });
|
|
103
115
|
return clamped;
|
|
104
116
|
}
|
|
117
|
+
/** Current default timeout in ms. -1 = unlimited. */
|
|
118
|
+
export function getDefaultTimeoutMs() {
|
|
119
|
+
return loadSubAgentsConfig().defaultTimeoutMs;
|
|
120
|
+
}
|
|
121
|
+
/** Set the default timeout in ms. Any value ≤ 0 or non-finite collapses
|
|
122
|
+
* to -1 (unlimited). Returns the persisted value. */
|
|
123
|
+
export function setDefaultTimeoutMs(ms) {
|
|
124
|
+
const normalized = !Number.isFinite(ms) || ms <= 0 ? -1 : Math.floor(ms);
|
|
125
|
+
const cfg = loadSubAgentsConfig();
|
|
126
|
+
saveSubAgentsConfig({ ...cfg, defaultTimeoutMs: normalized });
|
|
127
|
+
return normalized;
|
|
128
|
+
}
|
|
105
129
|
// ── State ───────────────────────────────────────────────
|
|
106
130
|
const activeAgents = new Map();
|
|
107
131
|
// ── Name resolver (B2) ──────────────────────────────────
|
|
@@ -433,14 +457,23 @@ export function spawnSubAgent(agentConfig) {
|
|
|
433
457
|
const resolved = resolveAgentName(agentConfig.name);
|
|
434
458
|
const resolvedName = resolved.name;
|
|
435
459
|
const id = crypto.randomUUID();
|
|
436
|
-
|
|
460
|
+
// Timeout resolution order:
|
|
461
|
+
// 1. Per-spawn override (agentConfig.timeout) — used by cron jobs that
|
|
462
|
+
// carry their own timeoutMs.
|
|
463
|
+
// 2. Runtime default from sub-agents.json (set via /subagents timeout).
|
|
464
|
+
// 3. config.subAgentTimeout fallback (seeded from SUBAGENT_TIMEOUT env).
|
|
465
|
+
// Any value ≤ 0 means "no timeout" — we simply don't arm the abort timer.
|
|
466
|
+
// The existing null-safe `clearTimeout(timeoutId)` call sites make this
|
|
467
|
+
// a safe no-op when the agent finishes or is cancelled.
|
|
468
|
+
const timeout = agentConfig.timeout ?? getDefaultTimeoutMs();
|
|
437
469
|
const abort = new AbortController();
|
|
438
|
-
const timeoutId = setTimeout(() => abort.abort(), timeout);
|
|
470
|
+
const timeoutId = timeout > 0 ? setTimeout(() => abort.abort(), timeout) : null;
|
|
439
471
|
const willRunImmediately = running < maxParallel;
|
|
440
472
|
const canQueue = !willRunImmediately && queueCap > 0 && queuedLen < queueCap;
|
|
441
473
|
if (!willRunImmediately && !canQueue) {
|
|
442
474
|
// No slot, no queue room → priority-aware reject
|
|
443
|
-
|
|
475
|
+
if (timeoutId)
|
|
476
|
+
clearTimeout(timeoutId);
|
|
444
477
|
const source = sourceOf(agentConfig);
|
|
445
478
|
const runningAgents = [...activeAgents.values()].filter((a) => a.info.status === "running");
|
|
446
479
|
const userSlots = runningAgents.filter((a) => a.info.source === "user").length;
|
package/dist/services/updater.js
CHANGED
|
@@ -19,6 +19,7 @@ import { resolve, dirname } from "path";
|
|
|
19
19
|
import { fileURLToPath } from "url";
|
|
20
20
|
import fs from "fs";
|
|
21
21
|
import os from "os";
|
|
22
|
+
import { BOT_VERSION } from "../version.js";
|
|
22
23
|
const execAsync = promisify(exec);
|
|
23
24
|
const PROJECT_ROOT = resolve(dirname(fileURLToPath(import.meta.url)), "../..");
|
|
24
25
|
const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
|
|
@@ -84,12 +85,40 @@ function compareSemver(a, b) {
|
|
|
84
85
|
}
|
|
85
86
|
return 0;
|
|
86
87
|
}
|
|
88
|
+
/**
|
|
89
|
+
* Is the running bot's in-memory version older than what's already built
|
|
90
|
+
* on disk? This happens when the dev/CI rebuilt the bot mid-session and
|
|
91
|
+
* the process hasn't restarted yet. A manual /update without a git/npm
|
|
92
|
+
* fetch should still trigger a restart in this case so the fresh code
|
|
93
|
+
* takes effect.
|
|
94
|
+
*/
|
|
95
|
+
function isRuntimeStale() {
|
|
96
|
+
const onDisk = readLocalVersion();
|
|
97
|
+
if (!onDisk || !BOT_VERSION || BOT_VERSION === "unknown")
|
|
98
|
+
return false;
|
|
99
|
+
return compareSemver(BOT_VERSION, onDisk) < 0;
|
|
100
|
+
}
|
|
87
101
|
/** Pull latest changes, install deps, rebuild. Returns a structured result
|
|
88
102
|
* instead of throwing so the /update command can report cleanly to Telegram.
|
|
89
103
|
* Dispatches to the git path for source installs and the npm path for
|
|
90
|
-
* npm-global installs.
|
|
104
|
+
* npm-global installs.
|
|
105
|
+
*
|
|
106
|
+
* Before doing any fetch, checks whether the disk is already newer than
|
|
107
|
+
* the running process (i.e. someone rebuilt between the process start
|
|
108
|
+
* and this call). If so, returns success with requiresRestart=true so
|
|
109
|
+
* the command handler can trigger a graceful restart.
|
|
110
|
+
*/
|
|
91
111
|
export async function runUpdate() {
|
|
92
112
|
try {
|
|
113
|
+
// Stale-runtime check: disk is already newer than the running code.
|
|
114
|
+
if (isRuntimeStale()) {
|
|
115
|
+
const onDisk = readLocalVersion();
|
|
116
|
+
return {
|
|
117
|
+
ok: true,
|
|
118
|
+
message: `Disk is already built at v${onDisk}, running v${BOT_VERSION}. Restarting to pick up the new code...`,
|
|
119
|
+
requiresRestart: true,
|
|
120
|
+
};
|
|
121
|
+
}
|
|
93
122
|
if (isOwnGitRepo()) {
|
|
94
123
|
return await runGitUpdate();
|
|
95
124
|
}
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Internal Watchdog — Self-monitoring for crash-loop detection.
|
|
3
|
+
*
|
|
4
|
+
* Writes a liveness beacon file every 30 s with the current pid + boot
|
|
5
|
+
* time + crash counter. On startup, reads the beacon to detect whether
|
|
6
|
+
* the previous process exited cleanly or crashed. If too many crashes
|
|
7
|
+
* happen in a short window, refuses to keep restarting and writes an
|
|
8
|
+
* alert file so the user can investigate.
|
|
9
|
+
*
|
|
10
|
+
* Persistence layers this complements:
|
|
11
|
+
* - launchd KeepAlive: true → restarts on any exit (good)
|
|
12
|
+
* - ThrottleInterval: 5 → minimum 5 s between restarts (good)
|
|
13
|
+
* - This watchdog → caps the total restart count so we
|
|
14
|
+
* don't burn CPU on a truly broken state
|
|
15
|
+
*
|
|
16
|
+
* What this CAN catch:
|
|
17
|
+
* - Process crash → exit non-zero → launchd restarts → next boot reads
|
|
18
|
+
* beacon, sees a recent exit, increments crash counter
|
|
19
|
+
* - Tight crash loop → counter accumulates → hits brake at 10
|
|
20
|
+
*
|
|
21
|
+
* What this CANNOT catch (yet):
|
|
22
|
+
* - True event-loop deadlocks (process alive but frozen). That requires
|
|
23
|
+
* an external watchdog process — tracked as a follow-up.
|
|
24
|
+
*/
|
|
25
|
+
import fs from "fs";
|
|
26
|
+
import { resolve } from "path";
|
|
27
|
+
import os from "os";
|
|
28
|
+
import { execSync } from "child_process";
|
|
29
|
+
import { BOT_VERSION } from "../version.js";
|
|
30
|
+
const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
|
|
31
|
+
const STATE_DIR = resolve(DATA_DIR, "state");
|
|
32
|
+
const BEACON_FILE = resolve(STATE_DIR, "watchdog.json");
|
|
33
|
+
const ALERT_FILE = resolve(STATE_DIR, "crash-loop.alert");
|
|
34
|
+
const BEACON_INTERVAL_MS = 30_000; // write a beacon every 30 s
|
|
35
|
+
const CRASH_WINDOW_MS = 10 * 60 * 1000; // 10 min — crashes within this count toward the brake
|
|
36
|
+
const CRASH_BRAKE_THRESHOLD = 10; // after this many crashes in the window, brake
|
|
37
|
+
const STALE_BEACON_MS = 90_000; // a beacon older than this is considered "old enough that previous process really exited"
|
|
38
|
+
const RECOVERY_UPTIME_MS = 5 * 60 * 1000; // 5 min of clean uptime resets the counter
|
|
39
|
+
let beaconTimer = null;
|
|
40
|
+
let resetTimer = null;
|
|
41
|
+
let bootTime = 0;
|
|
42
|
+
function ensureStateDir() {
|
|
43
|
+
try {
|
|
44
|
+
fs.mkdirSync(STATE_DIR, { recursive: true });
|
|
45
|
+
}
|
|
46
|
+
catch (err) {
|
|
47
|
+
console.error("[watchdog] failed to create state dir:", err);
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
function readBeacon() {
|
|
51
|
+
try {
|
|
52
|
+
const raw = fs.readFileSync(BEACON_FILE, "utf-8");
|
|
53
|
+
const parsed = JSON.parse(raw);
|
|
54
|
+
if (typeof parsed.lastBeat === "number" &&
|
|
55
|
+
typeof parsed.pid === "number" &&
|
|
56
|
+
typeof parsed.bootTime === "number" &&
|
|
57
|
+
typeof parsed.crashCount === "number" &&
|
|
58
|
+
typeof parsed.crashWindowStart === "number" &&
|
|
59
|
+
typeof parsed.version === "string") {
|
|
60
|
+
return parsed;
|
|
61
|
+
}
|
|
62
|
+
return null;
|
|
63
|
+
}
|
|
64
|
+
catch {
|
|
65
|
+
return null;
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
function writeBeacon(data) {
|
|
69
|
+
try {
|
|
70
|
+
fs.writeFileSync(BEACON_FILE, JSON.stringify(data, null, 0), "utf-8");
|
|
71
|
+
}
|
|
72
|
+
catch (err) {
|
|
73
|
+
console.error("[watchdog] failed to write beacon:", err);
|
|
74
|
+
}
|
|
75
|
+
}
|
|
76
|
+
function writeAlert(reason, crashCount) {
|
|
77
|
+
try {
|
|
78
|
+
const content = [
|
|
79
|
+
`Alvin Bot crash-loop brake hit at ${new Date().toISOString()}`,
|
|
80
|
+
`Version: ${BOT_VERSION}`,
|
|
81
|
+
`Crashes in the last ${CRASH_WINDOW_MS / 60_000} minutes: ${crashCount}`,
|
|
82
|
+
`Threshold: ${CRASH_BRAKE_THRESHOLD}`,
|
|
83
|
+
``,
|
|
84
|
+
`Reason: ${reason}`,
|
|
85
|
+
``,
|
|
86
|
+
`The bot will refuse to start until this file is removed AND the`,
|
|
87
|
+
`LaunchAgent is reloaded. Investigate the recent error log:`,
|
|
88
|
+
` ${resolve(DATA_DIR, "logs", "alvin-bot.err.log")}`,
|
|
89
|
+
``,
|
|
90
|
+
`Recovery steps once you've fixed the underlying issue:`,
|
|
91
|
+
` rm "${ALERT_FILE}"`,
|
|
92
|
+
` alvin-bot launchd install # or just kickstart the service`,
|
|
93
|
+
``,
|
|
94
|
+
].join("\n");
|
|
95
|
+
fs.writeFileSync(ALERT_FILE, content, "utf-8");
|
|
96
|
+
}
|
|
97
|
+
catch (err) {
|
|
98
|
+
console.error("[watchdog] failed to write alert:", err);
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
/**
|
|
102
|
+
* Check whether the watchdog has hit the crash-loop brake. Called once
|
|
103
|
+
* at startup, BEFORE most of the bot initializes. If the brake is set
|
|
104
|
+
* (alert file exists), the bot exits cleanly with code 3 — and because
|
|
105
|
+
* launchd's KeepAlive will keep retrying, we also try to unload our
|
|
106
|
+
* own LaunchAgent so the retries stop.
|
|
107
|
+
*/
|
|
108
|
+
export function checkCrashLoopBrake() {
|
|
109
|
+
if (!fs.existsSync(ALERT_FILE))
|
|
110
|
+
return;
|
|
111
|
+
console.error("");
|
|
112
|
+
console.error("==================================================");
|
|
113
|
+
console.error("⛔ alvin-bot crash-loop brake is engaged");
|
|
114
|
+
console.error("==================================================");
|
|
115
|
+
try {
|
|
116
|
+
const content = fs.readFileSync(ALERT_FILE, "utf-8");
|
|
117
|
+
console.error(content);
|
|
118
|
+
}
|
|
119
|
+
catch { /* ignore */ }
|
|
120
|
+
// Attempt to unload our own LaunchAgent so launchd stops retrying.
|
|
121
|
+
// If we don't do this, launchd just KeepAlive's us forever and we
|
|
122
|
+
// burn CPU writing the same alert.
|
|
123
|
+
if (process.platform === "darwin") {
|
|
124
|
+
try {
|
|
125
|
+
const home = os.homedir();
|
|
126
|
+
const plistPath = resolve(home, "Library", "LaunchAgents", "com.alvinbot.app.plist");
|
|
127
|
+
if (fs.existsSync(plistPath)) {
|
|
128
|
+
execSync(`launchctl unload -w "${plistPath}"`, { stdio: "pipe" });
|
|
129
|
+
console.error("[watchdog] LaunchAgent unloaded — bot will not auto-restart.");
|
|
130
|
+
}
|
|
131
|
+
}
|
|
132
|
+
catch (err) {
|
|
133
|
+
console.error("[watchdog] failed to unload LaunchAgent:", err);
|
|
134
|
+
}
|
|
135
|
+
}
|
|
136
|
+
// Exit with a distinct code so logs make the cause obvious
|
|
137
|
+
process.exit(3);
|
|
138
|
+
}
|
|
139
|
+
/**
|
|
140
|
+
* Start the watchdog. Called from src/index.ts after all services are
|
|
141
|
+
* initialized. Reads the previous beacon, increments crash counter if
|
|
142
|
+
* the previous run exited recently, schedules the periodic beacon
|
|
143
|
+
* writer, and schedules a recovery-mark reset after RECOVERY_UPTIME_MS
|
|
144
|
+
* of clean uptime.
|
|
145
|
+
*/
|
|
146
|
+
export function startWatchdog() {
|
|
147
|
+
ensureStateDir();
|
|
148
|
+
bootTime = Date.now();
|
|
149
|
+
const previous = readBeacon();
|
|
150
|
+
let crashCount = 0;
|
|
151
|
+
let crashWindowStart = bootTime;
|
|
152
|
+
if (previous) {
|
|
153
|
+
const timeSinceLastBeat = bootTime - previous.lastBeat;
|
|
154
|
+
const inWindow = bootTime - previous.crashWindowStart < CRASH_WINDOW_MS;
|
|
155
|
+
if (timeSinceLastBeat < STALE_BEACON_MS) {
|
|
156
|
+
// Previous process exited very recently → that's a crash (or a
|
|
157
|
+
// graceful exit immediately followed by a restart, which we treat
|
|
158
|
+
// the same way for the brake — the goal is to detect rapid cycles).
|
|
159
|
+
if (inWindow) {
|
|
160
|
+
crashCount = previous.crashCount + 1;
|
|
161
|
+
crashWindowStart = previous.crashWindowStart;
|
|
162
|
+
}
|
|
163
|
+
else {
|
|
164
|
+
// Previous crash was outside the window → reset counter
|
|
165
|
+
crashCount = 1;
|
|
166
|
+
}
|
|
167
|
+
console.log(`[watchdog] detected restart after ${Math.round(timeSinceLastBeat / 1000)}s — crash ${crashCount}/${CRASH_BRAKE_THRESHOLD} in current ${CRASH_WINDOW_MS / 60_000}min window`);
|
|
168
|
+
if (crashCount >= CRASH_BRAKE_THRESHOLD) {
|
|
169
|
+
console.error(`[watchdog] crash-loop brake triggered (${crashCount} crashes in ${CRASH_WINDOW_MS / 60_000}min)`);
|
|
170
|
+
writeAlert(`Process restarted ${crashCount} times within ${CRASH_WINDOW_MS / 60_000} minutes. Last beacon was ${Math.round(timeSinceLastBeat / 1000)}s ago. Most likely a deterministic crash on startup.`, crashCount);
|
|
171
|
+
// Re-use the brake check to unload + exit cleanly
|
|
172
|
+
checkCrashLoopBrake();
|
|
173
|
+
}
|
|
174
|
+
}
|
|
175
|
+
else {
|
|
176
|
+
// Previous beacon was old → process had clean uptime before exit,
|
|
177
|
+
// OR system was rebooted between runs. Reset crash count.
|
|
178
|
+
crashCount = 0;
|
|
179
|
+
crashWindowStart = bootTime;
|
|
180
|
+
}
|
|
181
|
+
}
|
|
182
|
+
// Write the first beacon immediately so a fresh restart updates the file
|
|
183
|
+
writeBeacon({
|
|
184
|
+
lastBeat: bootTime,
|
|
185
|
+
pid: process.pid,
|
|
186
|
+
bootTime,
|
|
187
|
+
crashCount,
|
|
188
|
+
crashWindowStart,
|
|
189
|
+
version: BOT_VERSION,
|
|
190
|
+
});
|
|
191
|
+
// Periodic beacon writer
|
|
192
|
+
beaconTimer = setInterval(() => {
|
|
193
|
+
writeBeacon({
|
|
194
|
+
lastBeat: Date.now(),
|
|
195
|
+
pid: process.pid,
|
|
196
|
+
bootTime,
|
|
197
|
+
crashCount,
|
|
198
|
+
crashWindowStart,
|
|
199
|
+
version: BOT_VERSION,
|
|
200
|
+
});
|
|
201
|
+
}, BEACON_INTERVAL_MS);
|
|
202
|
+
// Schedule a recovery counter reset after RECOVERY_UPTIME_MS of clean
|
|
203
|
+
// uptime. If we make it that far without dying, the bot is healthy
|
|
204
|
+
// again and we shouldn't penalize a future single crash.
|
|
205
|
+
resetTimer = setTimeout(() => {
|
|
206
|
+
if (crashCount > 0) {
|
|
207
|
+
console.log(`[watchdog] ${RECOVERY_UPTIME_MS / 60_000}min clean uptime — resetting crash counter from ${crashCount} to 0`);
|
|
208
|
+
crashCount = 0;
|
|
209
|
+
crashWindowStart = Date.now();
|
|
210
|
+
writeBeacon({
|
|
211
|
+
lastBeat: Date.now(),
|
|
212
|
+
pid: process.pid,
|
|
213
|
+
bootTime,
|
|
214
|
+
crashCount,
|
|
215
|
+
crashWindowStart,
|
|
216
|
+
version: BOT_VERSION,
|
|
217
|
+
});
|
|
218
|
+
}
|
|
219
|
+
}, RECOVERY_UPTIME_MS);
|
|
220
|
+
console.log(`[watchdog] started — beacon every ${BEACON_INTERVAL_MS / 1000}s, brake at ${CRASH_BRAKE_THRESHOLD} crashes per ${CRASH_WINDOW_MS / 60_000}min, recovery after ${RECOVERY_UPTIME_MS / 60_000}min uptime`);
|
|
221
|
+
}
|
|
222
|
+
/**
|
|
223
|
+
* Stop the watchdog cleanly. Called from the shutdown handler in
|
|
224
|
+
* index.ts so beacon timers don't keep the process alive after the
|
|
225
|
+
* grammy bot has stopped.
|
|
226
|
+
*/
|
|
227
|
+
export function stopWatchdog() {
|
|
228
|
+
if (beaconTimer) {
|
|
229
|
+
clearInterval(beaconTimer);
|
|
230
|
+
beaconTimer = null;
|
|
231
|
+
}
|
|
232
|
+
if (resetTimer) {
|
|
233
|
+
clearTimeout(resetTimer);
|
|
234
|
+
resetTimer = null;
|
|
235
|
+
}
|
|
236
|
+
}
|