npm - alvin-bot - Versions diffs - 4.8.6 → 4.8.8 - Mend

alvin-bot 4.8.6 → 4.8.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/CHANGELOG.md +114 -0
package/bin/cli.js +6 -0
package/dist/config.js +7 -1
package/dist/handlers/commands.js +77 -5
package/dist/i18n.js +4 -4
package/dist/index.js +21 -1
package/dist/services/cron.js +17 -5
package/dist/services/subagents.js +37 -4
package/dist/services/updater.js +30 -1
package/dist/services/watchdog.js +236 -0
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,120 @@
 All notable changes to Alvin Bot are documented here.
+## [4.8.8] — 2026-04-11
+### ✨ Unlimited sub-agent & cron timeouts (user-configurable)
+Sub-agents and `ai-query` cron jobs used to hard-cap at 5 minutes (`SUBAGENT_TIMEOUT=300000` default), and `shell` cron jobs at 60 s. Long-running research, deep-dive audits, or anything that crossed the threshold got killed mid-stream with `status: "timeout"`. 4.8.8 flips the default to **unlimited** and lets the user override both globally and per job.
+**What changed:**
+- **Default is now infinite.** `src/config.ts` seeds `subAgentTimeout` from `SUBAGENT_TIMEOUT` env or falls back to `-1` (unlimited). The runtime value lives in `~/.alvin-bot/sub-agents.json` as `defaultTimeoutMs` and is changeable at runtime without restart.
+- **New `/subagents timeout` command.** `/subagents timeout` shows the current value; `/subagents timeout 3600` sets 1 h; `/subagents timeout off` (or `-1`, `0`, `unlimited`, `infinite`) disables the cap entirely. The default-status output now includes a `⏱ Timeout` line.
+- **Per-job override on cron.** `/cron add 1h ai-query "deep audit" --timeout off` gives this one job no timeout. `/cron add 5m shell "pm2 ls" --timeout 30` caps this shell at 30 s. Omitting `--timeout` inherits the current global default. Same flag exists on `scripts/cron-manage.js add --timeout <sec|off>`.
+- **`CronJob.timeoutMs` field.** Optional number in `cron-jobs.json`. Undefined = inherit global default. Value ≤ 0 = unlimited.
+- **Semantics.** `spawnSubAgent` now only arms the `setTimeout(abort)` when `timeout > 0`. At ≤ 0, no abort timer is created, existing `if (timeoutId) clearTimeout(…)` call sites are null-safe, and the agent runs until it finishes, is cancelled via `/subagents cancel`, or the process dies.
+- **Shell cron unchanged behaviour preserved.** If the shell job has no `timeoutMs`, `execSync` is called without a `timeout` option, which Node treats as infinite — same effect as before was *meant* to provide, but the old hard-coded 60 s removed that freedom.
+**ENV var still works but is seed-only.** `SUBAGENT_TIMEOUT=600000` at startup still seeds the config on first load, but the persisted value in `sub-agents.json` wins after that.
+### 🐛 Silenced harmless `message is not modified` Telegram errors
+Occasionally Ali would see a red banner at the bottom of an Alvin message:
+> Error: Call to 'editMessageText' failed! (400: Bad Request: message is not modified: specified new message content and reply markup are exactly the same as a current content and reply markup of the message)
+It never broke anything, but it polluted logs and showed up as an "internal error" reply to the user. Root cause: Telegram's Bot API refuses `editMessageText` when the new content + reply markup are byte-identical to the existing message. This happens legitimately in callback handlers — e.g. tapping a cron-toggle button twice, re-rendering a sudo/keys/platforms menu, language-switch callbacks that render the same content, or stream flushes where the throttled partial hasn't changed since the last edit.
+**Fix**: `bot.catch()` in `src/index.ts` now filters out this specific error early. Two regex patterns (`/message is not modified/i` and `/specified new message content.*exactly the same/i`) cover both variants Telegram sends. Real errors (network, SDK, provider failures) still log and still surface the "internal error" reply to the user — only this one harmless class gets dropped.
+### 📝 CLAUDE.md: PM2 references updated to launchd
+The project `CLAUDE.md` still said *"PM2: `alvin-bot` Prozess, Config in `ecosystem.config.cjs`"* — outdated since the 4.8.6 switch to launchd. Updated to reflect the actual process manager (`~/Library/LaunchAgents/com.alvinbot.app.plist`, `KeepAlive=true`, `RunAtLoad=true`), the log paths, and a note that `watchdog.ts` only brakes process crash-loops — it does **not** kill long-running sessions or sub-agents. `ecosystem.config.cjs` is now labelled legacy.
+The global `~/.claude/CLAUDE.md` was also corrected: `alvin-bot` was removed from the VPS PM2-process list (it runs locally, not on the VPS) and the cron-hub note now correctly says "als **launchd LaunchAgent**".
+## [4.8.7] — 2026-04-11
+### 🐛 `/update` now detects stale-runtime (rebuild without restart)
+Caught immediately after publishing 4.8.6 on the Mac mini: `/update` reported "Already up to date — no new commits" even though the running process was on **v4.8.5** while the disk was already built at **v4.8.6**. The user could see the version mismatch in `/status` (v4.8.5) but `/update` refused to acknowledge it.
+**Root cause**: The updater only compared **git commits** (or **npm registry version**) against the local install. It never checked whether the **running process's in-memory version** was older than the **on-disk built version**. This is the dev/CI loop scenario:
+1. You edit src/, bump package.json, commit + push
+2. `npm run build` regenerates dist/ at the new version
+3. The running process has the OLD code in memory
+4. You run `/update` in Telegram
+5. git: HEAD == origin/main (just pushed) → 0 commits behind → "up to date"
+6. Process never restarts → keeps running OLD code
+**Fix**: New `isRuntimeStale()` check at the very start of `runUpdate()`. Compares `BOT_VERSION` (in-memory at process start) against `package.json.version` from disk via the existing semver compare. If disk is newer, returns success with `requiresRestart=true` immediately — skip the git/npm fetch entirely, just signal a restart so the fresh code takes effect.
+After 4.8.7, running `/update` after a manual rebuild will correctly say *"Disk is already built at vX, running vY. Restarting to pick up the new code..."* and trigger the restart.
+### ✨ Internal watchdog with crash-loop brake (`src/services/watchdog.ts`)
+Ali asked for "derbe persistent" — already 95% there with `KeepAlive: true` from 4.8.6, but the missing piece was a brake to stop the bot from infinite-restart-looping if a deterministic crash happens (corrupt state file, missing dependency, broken upgrade).
+**New module**: `src/services/watchdog.ts`. Two responsibilities:
+**1. Liveness beacon**. Every 30 s the bot writes `~/.alvin-bot/state/watchdog.json` with `{lastBeat, pid, bootTime, crashCount, crashWindowStart, version}`. Fast disk write, no I/O blocking.
+**2. Crash-loop brake**. On every fresh boot, the watchdog reads the previous beacon:
+- If the previous beacon is **less than 90 s old** → the previous process exited very recently → that's a crash (or a deliberate restart, treated the same way for the brake's purpose). Increment `crashCount`.
+- If the previous beacon is **older than 90 s** → previous process had clean uptime → reset counter to 0.
+- The crash window is **10 minutes**. Crashes within this window accumulate; older ones don't count.
+- If `crashCount` reaches **10**, the brake engages:
+  - Writes `~/.alvin-bot/state/crash-loop.alert` with the timestamp, version, error log path, and recovery steps
+  - Tries to `launchctl unload -w` its own LaunchAgent so launchd stops retrying (otherwise `KeepAlive: true` would keep burning CPU forever)
+  - Exits with code 3
+**3. Recovery**. After **5 minutes of clean uptime**, the watchdog auto-resets the crash counter to 0. So a healthy bot that occasionally has a transient hiccup doesn't slowly accumulate toward the brake over days.
+**4. Brake check at startup**. `checkCrashLoopBrake()` runs in `index.ts` **before** any expensive init — if the alert file already exists, the bot exits cleanly with code 3 and tries to unload itself again. This prevents launchd from spinning the bot up just to write the same alert over and over.
+**Recovery from a tripped brake**:
+```bash
+# 1. Investigate the error log
+cat ~/.alvin-bot/logs/alvin-bot.err.log
+# 2. Fix whatever was wrong
+# 3. Remove the alert file
+rm ~/.alvin-bot/state/crash-loop.alert
+# 4. Reload the LaunchAgent
+alvin-bot launchd install
+```
+**What this catches**:
+- Process crashes (segfault, OOM kill) → exit non-zero → brake increments
+- `process.exit()` from unhandled rejection → similar
+- Tight crash loops → brake engages at 10 within 10 min
+- Corrupted state files that crash on read → brake engages eventually
+**What this does NOT catch (yet)**:
+- Event-loop deadlocks where the process is alive but completely frozen. The watchdog beacon needs the event loop to be alive, so it can't detect freeze. A future release will add an external sister LaunchAgent (`com.alvinbot.watchdog`) that runs every 2 minutes via `StartInterval` and kills the main bot if its beacon file is too stale. Tracked as a follow-up.
+**Telemetry surface**: `alvin-bot status` could read the beacon file in a future release to show "crash count: X in last Y minutes" — for now, the alert file is the main user-facing signal.
+### 🛡 LaunchAgent: ProcessType + LimitLoadToSessionType
+Two small plist hardening tweaks:
+- **`ProcessType: Background`** — explicit hint to launchd that this is a long-running background service. macOS treats Background processes with friendlier scheduling and is less likely to kill them under memory pressure (vs `Standard` which is the default for unlabeled jobs).
+- **`LimitLoadToSessionType: Aqua`** — only loads in user GUI sessions. Prevents the LaunchAgent from accidentally loading in non-GUI contexts (e.g. SSH login session) where it would not have Keychain access. Defensive: matches our existing assumption that the bot needs the GUI keychain unlocked for Claude SDK OAuth.
+These don't change behaviour for normal use, but they're explicit about our intent. macOS will treat the bot as a proper background service rather than a generic foreground job.
+### Tests
+87 still passing — no test changes (the stale-runtime check is a fast-path branch that doesn't disturb the existing git/npm logic).
 ## [4.8.6] — 2026-04-11
 ### 🐛 LaunchAgent: `/restart` left the bot down forever

package/bin/cli.js CHANGED Viewed

@@ -1466,6 +1466,12 @@ function renderLaunchdPlist({ label, nodePath, entryPoint, cwd, home, logDir })
     <key>ThrottleInterval</key>
     <integer>5</integer>
+    <key>ProcessType</key>
+    <string>Background</string>
+    <key>LimitLoadToSessionType</key>
+    <string>Aqua</string>
     <key>StandardOutPath</key>
     <string>${logDir}/alvin-bot.out.log</string>

package/dist/config.js CHANGED Viewed

@@ -45,7 +45,13 @@ export const config = {
     compactionThreshold: Number(process.env.COMPACTION_THRESHOLD) || 80000,
     // Sub-Agents
     maxSubAgents: Number(process.env.MAX_SUBAGENTS) || 4,
-    subAgentTimeout: Number(process.env.SUBAGENT_TIMEOUT) || 300000, // 5 min
+    // Default sub-agent timeout. -1 / 0 = unlimited (no hard cut-off).
+    // The runtime value lives in sub-agents.json and can be changed at runtime
+    // via /subagents timeout; this constant only seeds the initial config on
+    // first launch when SUBAGENT_TIMEOUT is not set.
+    subAgentTimeout: process.env.SUBAGENT_TIMEOUT !== undefined && process.env.SUBAGENT_TIMEOUT !== ""
+        ? Number(process.env.SUBAGENT_TIMEOUT)
+        : -1,
     // TTS Provider
     ttsProvider: (process.env.TTS_PROVIDER || "edge"),
     elevenlabs: {

package/dist/handlers/commands.js CHANGED Viewed

@@ -1277,9 +1277,29 @@ export function registerCommands(bot) {
                 `Commands: /cron add · delete · toggle · run · info`, { parse_mode: "HTML", reply_markup: keyboard });
             return;
         }
-        // /cron add <schedule> <type> <payload>
+        // /cron add <schedule> <type> <payload> [--timeout <sec|off>]
         if (arg.startsWith("add ")) {
-            const rest = arg.slice(4).trim();
+            let rest = arg.slice(4).trim();
+            // Extract optional --timeout flag from anywhere in the command.
+            // Accepts seconds, "off", "unlimited", "-1", or "0" — anything ≤ 0
+            // or non-numeric collapses to -1 (unlimited).
+            let timeoutMs;
+            const timeoutMatch = rest.match(/(^|\s)--timeout\s+(\S+)/);
+            if (timeoutMatch) {
+                const val = timeoutMatch[2].toLowerCase();
+                if (["off", "unlimited", "infinite", "-1", "0"].includes(val)) {
+                    timeoutMs = -1;
+                }
+                else {
+                    const secs = Number(timeoutMatch[2]);
+                    if (!Number.isFinite(secs) || secs < 0) {
+                        await ctx.reply(`❌ Invalid <code>--timeout</code> value: ${timeoutMatch[2]}`, { parse_mode: "HTML" });
+                        return;
+                    }
+                    timeoutMs = Math.floor(secs * 1000);
+                }
+                rest = rest.replace(/(^|\s)--timeout\s+\S+/, "").trim();
+            }
             // Natural language schedule shortcuts (German + English)
             const naturalSchedules = {
                 "täglich": "0 8 * * *", "daily": "0 8 * * *",
@@ -1342,7 +1362,7 @@ export function registerCommands(bot) {
             else {
                 const sp = rest.indexOf(" ");
                 if (sp < 0) {
-                    await ctx.reply("Format: <code>/cron add &lt;schedule&gt; &lt;type&gt; &lt;payload&gt;</code>\n\nSchedule options:\n• <b>Intervals:</b> 5m, 1h, 30s, 2d\n• <b>Natural:</b> daily, weekly, monthly, weekdays, hourly\n• <b>With time:</b> 8:30 daily, weekdays 9:00\n• <b>German:</b> täglich, wöchentlich, morgens, abends\n• <b>Cron:</b> \"0 9 * * 1-5\"", { parse_mode: "HTML" });
+                    await ctx.reply("Format: <code>/cron add &lt;schedule&gt; &lt;type&gt; &lt;payload&gt; [--timeout &lt;sec|off&gt;]</code>\n\nSchedule options:\n• <b>Intervals:</b> 5m, 1h, 30s, 2d\n• <b>Natural:</b> daily, weekly, monthly, weekdays, hourly\n• <b>With time:</b> 8:30 daily, weekdays 9:00\n• <b>German:</b> täglich, wöchentlich, morgens, abends\n• <b>Cron:</b> \"0 9 * * 1-5\"\n\nOptional <code>--timeout</code> in seconds, or <code>off</code>/<code>-1</code> for unlimited.", { parse_mode: "HTML" });
                     return;
                 }
                 schedule = rest.slice(0, sp);
@@ -1381,12 +1401,19 @@ export function registerCommands(bot) {
                 payload,
                 target: { platform: "telegram", chatId: String(chatId) },
                 createdBy: `telegram:${userId}`,
+                ...(timeoutMs !== undefined ? { timeoutMs } : {}),
             });
             const readableSched = humanReadableSchedule(job.schedule);
+            const timeoutLine = typeof job.timeoutMs === "number"
+                ? job.timeoutMs <= 0
+                    ? `<b>Timeout:</b> ∞ (unlimited)\n`
+                    : `<b>Timeout:</b> ${Math.round(job.timeoutMs / 1000)}s\n`
+                : "";
             await ctx.reply(`✅ <b>Cron Job created</b>\n\n` +
                 `<b>Name:</b> ${job.name}\n` +
                 `📅 <b>${readableSched}</b>\n` +
                 `<b>Type:</b> ${job.type}\n` +
+                timeoutLine +
                 `<b>Next run:</b> ${formatNextRun(job.nextRunAt)}\n` +
                 `<b>ID:</b> <code>${job.id}</code>`, { parse_mode: "HTML" });
             return;
@@ -1734,7 +1761,7 @@ export function registerCommands(bot) {
     // type both "/sub-agents" and "/subagents" — Telegram routes both to this.
     bot.command(["sub_agents", "subagents"], async (ctx) => {
         const lang = getSession(ctx.from.id).language;
-        const { listSubAgents, cancelSubAgent, getSubAgentResult, getMaxParallelAgents, getConfiguredMaxParallel, setMaxParallelAgents, findSubAgentByName, getVisibility, setVisibility, getQueueCap, setQueueCap, } = await import("../services/subagents.js");
+        const { listSubAgents, cancelSubAgent, getSubAgentResult, getMaxParallelAgents, getConfiguredMaxParallel, setMaxParallelAgents, findSubAgentByName, getVisibility, setVisibility, getQueueCap, setQueueCap, getDefaultTimeoutMs, setDefaultTimeoutMs, } = await import("../services/subagents.js");
         const arg = (ctx.match || "").trim();
         const tokens = arg.split(/\s+/).filter(Boolean);
         const sub = tokens[0]?.toLowerCase() || "";
@@ -1792,6 +1819,47 @@ export function registerCommands(bot) {
             await ctx.reply(lines.join("\n"), { parse_mode: "Markdown" });
             return;
         }
+        // /subagents timeout [sec|off|unlimited|-1] — set default sub-agent timeout
+        if (sub === "timeout") {
+            const val = tokens[1];
+            const formatTimeout = (ms) => {
+                if (ms <= 0)
+                    return "∞ (unlimited)";
+                if (ms < 1000)
+                    return `${ms}ms`;
+                const sec = ms / 1000;
+                if (sec < 60)
+                    return `${sec}s`;
+                const min = sec / 60;
+                if (min < 60)
+                    return `${min.toFixed(min < 10 ? 1 : 0)}min`;
+                return `${(min / 60).toFixed(1)}h`;
+            };
+            if (!val) {
+                const current = getDefaultTimeoutMs();
+                await ctx.reply(`⏱ Default sub-agent timeout: *${formatTimeout(current)}*\n\n` +
+                    `Usage: \`/subagents timeout <sec>\` · \`/subagents timeout off\`\n` +
+                    `\`off\`, \`unlimited\`, \`-1\` oder \`0\` = kein Timeout. ` +
+                    `Gilt für neue Subagents und ai-query Cron-Jobs ohne eigenen Wert.`, { parse_mode: "Markdown" });
+                return;
+            }
+            const lower = val.toLowerCase();
+            let ms;
+            if (["off", "unlimited", "infinite", "-1", "0"].includes(lower)) {
+                ms = -1;
+            }
+            else {
+                const secs = Number(val);
+                if (!Number.isFinite(secs) || secs < 0) {
+                    await ctx.reply(`❌ Ungültiger Wert \`${val}\`. Nutze Sekunden (z.B. \`300\`) oder \`off\`.`, { parse_mode: "Markdown" });
+                    return;
+                }
+                ms = Math.floor(secs * 1000);
+            }
+            const effective = setDefaultTimeoutMs(ms);
+            await ctx.reply(`✅ Default sub-agent timeout: *${formatTimeout(effective)}*`, { parse_mode: "Markdown" });
+            return;
+        }
         // /subagents queue <n> — set bounded-queue cap (0 disables queue)
         if (sub === "queue") {
             const n = parseInt(tokens[1] || "", 10);
@@ -1921,6 +1989,10 @@ export function registerCommands(bot) {
             ? `${t("bot.subagents.maxLabel", lang)} 0 ${t("bot.subagents.autoSuffix", lang, { n: effective })}`
             : `${t("bot.subagents.maxLabel", lang)} ${configured}`;
         const visibilityLabel = `${t("bot.subagents.visibilityLabel", lang)} *${getVisibility()}*`;
+        const currentTimeout = getDefaultTimeoutMs();
+        const timeoutLabel = currentTimeout <= 0
+            ? `⏱ Timeout: *∞ (unlimited)*`
+            : `⏱ Timeout: *${Math.round(currentTimeout / 1000)}s*`;
         const agents = listSubAgents();
         let body = "";
         if (agents.length === 0) {
@@ -1931,7 +2003,7 @@ export function registerCommands(bot) {
         }
         const header = t("bot.subagents.header", lang);
         const usage = `\n\n${t("bot.subagents.usage", lang)}`;
-        const full = `${header}\n${maxLabel}\n${visibilityLabel}${body}${usage}`;
+        const full = `${header}\n${maxLabel}\n${visibilityLabel}\n${timeoutLabel}${body}${usage}`;
         await ctx.reply(full, { parse_mode: "Markdown" }).catch(() => ctx.reply(full));
     });
 }

package/dist/i18n.js CHANGED Viewed

@@ -519,10 +519,10 @@ const strings = {
         fr: "Durée : {sec}s · Tokens : {in}/{out}",
     },
     "bot.subagents.usage": {
-        en: "Commands:\n/subagents — show status\n/subagents max <n> — set parallel limit (0=auto)\n/subagents visibility <auto|banner|silent|live> — delivery mode\n/subagents queue <n> — bounded-queue cap (0 = disabled)\n/subagents stats — last 24h run stats\n/subagents list — list all\n/subagents cancel <name|id> — cancel one\n/subagents result <name|id> — show result",
-        de: "Befehle:\n/subagents — Status anzeigen\n/subagents max <n> — Parallel-Limit setzen (0=auto)\n/subagents visibility <auto|banner|silent|live> — Delivery-Modus\n/subagents list — alle anzeigen\n/subagents cancel <name|id> — abbrechen\n/subagents result <name|id> — Ergebnis anzeigen",
-        es: "Comandos:\n/subagents — ver estado\n/subagents max <n> — establecer límite (0=auto)\n/subagents visibility <auto|banner|silent|live> — modo de entrega\n/subagents list — listar todos\n/subagents cancel <nombre|id> — cancelar uno\n/subagents result <nombre|id> — ver resultado",
-        fr: "Commandes :\n/subagents — état\n/subagents max <n> — limite parallèle (0=auto)\n/subagents visibility <auto|banner|silent|live> — mode de livraison\n/subagents list — lister tous\n/subagents cancel <nom|id> — annuler un\n/subagents result <nom|id> — voir résultat",
+        en: "Commands:\n/subagents — show status\n/subagents max <n> — set parallel limit (0=auto)\n/subagents timeout <sec|off> — default timeout (off = unlimited)\n/subagents visibility <auto|banner|silent|live> — delivery mode\n/subagents queue <n> — bounded-queue cap (0 = disabled)\n/subagents stats — last 24h run stats\n/subagents list — list all\n/subagents cancel <name|id> — cancel one\n/subagents result <name|id> — show result",
+        de: "Befehle:\n/subagents — Status anzeigen\n/subagents max <n> — Parallel-Limit setzen (0=auto)\n/subagents timeout <sec|off> — Default-Timeout (off = unendlich)\n/subagents visibility <auto|banner|silent|live> — Delivery-Modus\n/subagents queue <n> — Queue-Cap (0 = deaktiviert)\n/subagents list — alle anzeigen\n/subagents cancel <name|id> — abbrechen\n/subagents result <name|id> — Ergebnis anzeigen",
+        es: "Comandos:\n/subagents — ver estado\n/subagents max <n> — establecer límite (0=auto)\n/subagents timeout <seg|off> — timeout por defecto (off = sin límite)\n/subagents visibility <auto|banner|silent|live> — modo de entrega\n/subagents list — listar todos\n/subagents cancel <nombre|id> — cancelar uno\n/subagents result <nombre|id> — ver resultado",
+        fr: "Commandes :\n/subagents — état\n/subagents max <n> — limite parallèle (0=auto)\n/subagents timeout <sec|off> — délai par défaut (off = illimité)\n/subagents visibility <auto|banner|silent|live> — mode de livraison\n/subagents list — lister tous\n/subagents cancel <nom|id> — annuler un\n/subagents result <nom|id> — voir résultat",
     },
     "bot.subagents.visibilityLabel": {
         en: "Visibility:",

package/dist/index.js CHANGED Viewed

@@ -14,6 +14,11 @@ if (hasLegacyData()) {
 }
 // 3. Seed defaults for any files that don't exist yet (fresh install)
 seedDefaults();
+// 4. Crash-loop brake check — if we've crashed N times in a short window,
+//    refuse to start, write an alert file, and unload our LaunchAgent so
+//    launchd stops retrying. Runs BEFORE any expensive init so a broken
+//    state file doesn't tank the whole CPU.
+checkCrashLoopBrake();
 // ── Normal imports (safe now — DATA_DIR is ready) ──────────────────
 import { Bot, InlineKeyboard } from "grammy";
 import { config } from "./config.js";
@@ -76,6 +81,7 @@ import { loadSkills } from "./services/skills.js";
 import { loadHooks } from "./services/hooks.js";
 import { registerShutdownHandler } from "./services/restart.js";
 import { cancelAllSubAgents } from "./services/subagents.js";
+import { startWatchdog, stopWatchdog, checkCrashLoopBrake } from "./services/watchdog.js";
 import { getRegistry } from "./engine.js";
 import { scanAssets } from "./services/asset-index.js";
 // Scan asset directory and generate INDEX.json + INDEX.md
@@ -210,10 +216,20 @@ if (hasTelegram) {
     bot.on("message:photo", handlePhoto);
     bot.on("message:document", handleDocument);
     bot.on("message:text", handleMessage);
-    // Error handling — log but don't crash
+    // Error handling — log but don't crash.
     bot.catch((err) => {
         const ctx = err.ctx;
         const e = err.error;
+        // Telegram's "message is not modified" (400) is harmless — it fires
+        // when a callback handler re-renders an inline keyboard / edited
+        // message with content that happens to match the current message
+        // exactly (e.g. double-tapped toggle button, identical list after
+        // re-render). Swallow it silently so it neither pollutes the logs
+        // nor bubbles up to the user as "internal error".
+        const msg = e instanceof Error ? e.message : String(e);
+        if (/message is not modified/i.test(msg) || /specified new message content.*exactly the same/i.test(msg)) {
+            return;
+        }
         console.error(`Error handling update ${ctx?.update?.update_id}:`, e);
         // Try to notify the user
         if (ctx?.chat?.id) {
@@ -235,6 +251,7 @@ const shutdown = async () => {
     // agents can post a cancellation message to Telegram before the bot
     // stops. Capped at 5s internally so a hang can't block shutdown.
     await cancelAllSubAgents(true);
+    stopWatchdog();
     stopScheduler();
     stopSessionCleanup();
     if (queueInterval)
@@ -472,6 +489,8 @@ if (bot) {
             console.log(`   Users: ${config.allowedUsers.length} authorized`);
             // Start heartbeat monitor
             startHeartbeat();
+            // Start internal watchdog (crash-loop brake + liveness beacon)
+            startWatchdog();
             // Index memory vectors in background (non-blocking)
             initEmbeddings().catch(() => { });
         },
@@ -483,5 +502,6 @@ else {
     console.log(`   WebUI: http://localhost:${process.env.WEB_PORT || 3100}`);
     // Start heartbeat monitor even without Telegram
     startHeartbeat();
+    startWatchdog();
     initEmbeddings().catch(() => { });
 }

package/dist/services/cron.js CHANGED Viewed

@@ -122,11 +122,16 @@ async function executeJob(job) {
             }
             case "shell": {
                 const cmd = job.payload.command || "echo 'no command'";
-                const output = execSync(cmd, {
-                    timeout: 60_000,
+                // Per-job timeout, default = no timeout (execSync treats timeout=0
+                // or "undefined" as infinite). Users opt in via /cron add … --timeout N.
+                const shellOpts = {
                     stdio: "pipe",
                     env: { ...process.env, PATH: process.env.PATH + ":/opt/homebrew/bin:/usr/local/bin" },
-                }).toString().trim();
+                };
+                if (typeof job.timeoutMs === "number" && job.timeoutMs > 0) {
+                    shellOpts.timeout = job.timeoutMs;
+                }
+                const output = execSync(cmd, shellOpts).toString().trim();
                 // Notify with output
                 if (notifyCallback && output) {
                     await notifyCallback(job.target, `🔧 ${job.name}\n\`\`\`\n${output.slice(0, 3000)}\n\`\`\``);
@@ -173,14 +178,20 @@ async function executeJob(job) {
                         ? Number(job.target.chatId)
                         : undefined;
                     const result = await new Promise((resolve, reject) => {
-                        spawnSubAgent({
+                        // Only pass `timeout` through when the job has a per-job value.
+                        // Otherwise the sub-agent inherits the current /subagents default.
+                        const spawnConfig = {
                             name: job.name,
                             prompt,
                             workingDir: BOT_ROOT,
                             source: "cron",
                             parentChatId,
                             onComplete: (r) => resolve(r),
-                        }).catch(reject);
+                        };
+                        if (typeof job.timeoutMs === "number") {
+                            spawnConfig.timeout = job.timeoutMs;
+                        }
+                        spawnSubAgent(spawnConfig).catch(reject);
                     });
                     // Non-success: don't notify here. The I3 delivery router has
                     // already posted the appropriate banner (cancelled / timeout /
@@ -309,6 +320,7 @@ export function createJob(input) {
         nextRunAt: null,
         runCount: 0,
         createdBy: input.createdBy || "unknown",
+        ...(typeof input.timeoutMs === "number" ? { timeoutMs: input.timeoutMs } : {}),
     };
     // Calculate first run
     job.nextRunAt = calculateNextRun(job);

package/dist/services/subagents.js CHANGED Viewed

@@ -21,6 +21,14 @@ let configCache = null;
 function isValidVisibility(v) {
     return v === "auto" || v === "banner" || v === "silent" || v === "live";
 }
+/** Resolve the initial default timeout from config.ts, which itself seeds
+ *  from the SUBAGENT_TIMEOUT env var. -1 = unlimited. */
+function seedDefaultTimeout() {
+    const raw = config.subAgentTimeout;
+    if (typeof raw !== "number" || !Number.isFinite(raw) || raw <= 0)
+        return -1;
+    return Math.floor(raw);
+}
 function loadSubAgentsConfig() {
     if (configCache)
         return configCache;
@@ -33,14 +41,18 @@ function loadSubAgentsConfig() {
             queueCap: typeof parsed.queueCap === "number"
                 ? Math.max(0, Math.min(Math.floor(parsed.queueCap), ABSOLUTE_MAX_QUEUE))
                 : DEFAULT_QUEUE_CAP,
+            defaultTimeoutMs: typeof parsed.defaultTimeoutMs === "number" && Number.isFinite(parsed.defaultTimeoutMs)
+                ? (parsed.defaultTimeoutMs <= 0 ? -1 : Math.floor(parsed.defaultTimeoutMs))
+                : seedDefaultTimeout(),
         };
     }
     catch {
-        // File missing or invalid — seed from env var then default to auto
+        // File missing or invalid — seed from env vars then default to auto/unlimited
         configCache = {
             maxParallel: Number(process.env.MAX_SUBAGENTS) || 0,
             visibility: "auto",
             queueCap: DEFAULT_QUEUE_CAP,
+            defaultTimeoutMs: seedDefaultTimeout(),
         };
     }
     return configCache;
@@ -102,6 +114,18 @@ export function setQueueCap(n) {
     saveSubAgentsConfig({ ...cfg, queueCap: clamped });
     return clamped;
 }
+/** Current default timeout in ms. -1 = unlimited. */
+export function getDefaultTimeoutMs() {
+    return loadSubAgentsConfig().defaultTimeoutMs;
+}
+/** Set the default timeout in ms. Any value ≤ 0 or non-finite collapses
+ *  to -1 (unlimited). Returns the persisted value. */
+export function setDefaultTimeoutMs(ms) {
+    const normalized = !Number.isFinite(ms) || ms <= 0 ? -1 : Math.floor(ms);
+    const cfg = loadSubAgentsConfig();
+    saveSubAgentsConfig({ ...cfg, defaultTimeoutMs: normalized });
+    return normalized;
+}
 // ── State ───────────────────────────────────────────────
 const activeAgents = new Map();
 // ── Name resolver (B2) ──────────────────────────────────
@@ -433,14 +457,23 @@ export function spawnSubAgent(agentConfig) {
     const resolved = resolveAgentName(agentConfig.name);
     const resolvedName = resolved.name;
     const id = crypto.randomUUID();
-    const timeout = agentConfig.timeout ?? config.subAgentTimeout;
+    // Timeout resolution order:
+    //   1. Per-spawn override (agentConfig.timeout) — used by cron jobs that
+    //      carry their own timeoutMs.
+    //   2. Runtime default from sub-agents.json (set via /subagents timeout).
+    //   3. config.subAgentTimeout fallback (seeded from SUBAGENT_TIMEOUT env).
+    // Any value ≤ 0 means "no timeout" — we simply don't arm the abort timer.
+    // The existing null-safe `clearTimeout(timeoutId)` call sites make this
+    // a safe no-op when the agent finishes or is cancelled.
+    const timeout = agentConfig.timeout ?? getDefaultTimeoutMs();
     const abort = new AbortController();
-    const timeoutId = setTimeout(() => abort.abort(), timeout);
+    const timeoutId = timeout > 0 ? setTimeout(() => abort.abort(), timeout) : null;
     const willRunImmediately = running < maxParallel;
     const canQueue = !willRunImmediately && queueCap > 0 && queuedLen < queueCap;
     if (!willRunImmediately && !canQueue) {
         // No slot, no queue room → priority-aware reject
-        clearTimeout(timeoutId);
+        if (timeoutId)
+            clearTimeout(timeoutId);
         const source = sourceOf(agentConfig);
         const runningAgents = [...activeAgents.values()].filter((a) => a.info.status === "running");
         const userSlots = runningAgents.filter((a) => a.info.source === "user").length;

package/dist/services/updater.js CHANGED Viewed

@@ -19,6 +19,7 @@ import { resolve, dirname } from "path";
 import { fileURLToPath } from "url";
 import fs from "fs";
 import os from "os";
+import { BOT_VERSION } from "../version.js";
 const execAsync = promisify(exec);
 const PROJECT_ROOT = resolve(dirname(fileURLToPath(import.meta.url)), "../..");
 const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
@@ -84,12 +85,40 @@ function compareSemver(a, b) {
     }
     return 0;
 }
+/**
+ * Is the running bot's in-memory version older than what's already built
+ * on disk? This happens when the dev/CI rebuilt the bot mid-session and
+ * the process hasn't restarted yet. A manual /update without a git/npm
+ * fetch should still trigger a restart in this case so the fresh code
+ * takes effect.
+ */
+function isRuntimeStale() {
+    const onDisk = readLocalVersion();
+    if (!onDisk || !BOT_VERSION || BOT_VERSION === "unknown")
+        return false;
+    return compareSemver(BOT_VERSION, onDisk) < 0;
+}
 /** Pull latest changes, install deps, rebuild. Returns a structured result
  *  instead of throwing so the /update command can report cleanly to Telegram.
  *  Dispatches to the git path for source installs and the npm path for
- *  npm-global installs. */
+ *  npm-global installs.
+ *
+ *  Before doing any fetch, checks whether the disk is already newer than
+ *  the running process (i.e. someone rebuilt between the process start
+ *  and this call). If so, returns success with requiresRestart=true so
+ *  the command handler can trigger a graceful restart.
+ */
 export async function runUpdate() {
     try {
+        // Stale-runtime check: disk is already newer than the running code.
+        if (isRuntimeStale()) {
+            const onDisk = readLocalVersion();
+            return {
+                ok: true,
+                message: `Disk is already built at v${onDisk}, running v${BOT_VERSION}. Restarting to pick up the new code...`,
+                requiresRestart: true,
+            };
+        }
         if (isOwnGitRepo()) {
             return await runGitUpdate();
         }

package/dist/services/watchdog.js ADDED Viewed

@@ -0,0 +1,236 @@
+/**
+ * Internal Watchdog — Self-monitoring for crash-loop detection.
+ *
+ * Writes a liveness beacon file every 30 s with the current pid + boot
+ * time + crash counter. On startup, reads the beacon to detect whether
+ * the previous process exited cleanly or crashed. If too many crashes
+ * happen in a short window, refuses to keep restarting and writes an
+ * alert file so the user can investigate.
+ *
+ * Persistence layers this complements:
+ *   - launchd KeepAlive: true → restarts on any exit (good)
+ *   - ThrottleInterval: 5     → minimum 5 s between restarts (good)
+ *   - This watchdog            → caps the total restart count so we
+ *                                don't burn CPU on a truly broken state
+ *
+ * What this CAN catch:
+ *   - Process crash → exit non-zero → launchd restarts → next boot reads
+ *     beacon, sees a recent exit, increments crash counter
+ *   - Tight crash loop → counter accumulates → hits brake at 10
+ *
+ * What this CANNOT catch (yet):
+ *   - True event-loop deadlocks (process alive but frozen). That requires
+ *     an external watchdog process — tracked as a follow-up.
+ */
+import fs from "fs";
+import { resolve } from "path";
+import os from "os";
+import { execSync } from "child_process";
+import { BOT_VERSION } from "../version.js";
+const DATA_DIR = process.env.ALVIN_DATA_DIR || resolve(os.homedir(), ".alvin-bot");
+const STATE_DIR = resolve(DATA_DIR, "state");
+const BEACON_FILE = resolve(STATE_DIR, "watchdog.json");
+const ALERT_FILE = resolve(STATE_DIR, "crash-loop.alert");
+const BEACON_INTERVAL_MS = 30_000; // write a beacon every 30 s
+const CRASH_WINDOW_MS = 10 * 60 * 1000; // 10 min — crashes within this count toward the brake
+const CRASH_BRAKE_THRESHOLD = 10; // after this many crashes in the window, brake
+const STALE_BEACON_MS = 90_000; // a beacon older than this is considered "old enough that previous process really exited"
+const RECOVERY_UPTIME_MS = 5 * 60 * 1000; // 5 min of clean uptime resets the counter
+let beaconTimer = null;
+let resetTimer = null;
+let bootTime = 0;
+function ensureStateDir() {
+    try {
+        fs.mkdirSync(STATE_DIR, { recursive: true });
+    }
+    catch (err) {
+        console.error("[watchdog] failed to create state dir:", err);
+    }
+}
+function readBeacon() {
+    try {
+        const raw = fs.readFileSync(BEACON_FILE, "utf-8");
+        const parsed = JSON.parse(raw);
+        if (typeof parsed.lastBeat === "number" &&
+            typeof parsed.pid === "number" &&
+            typeof parsed.bootTime === "number" &&
+            typeof parsed.crashCount === "number" &&
+            typeof parsed.crashWindowStart === "number" &&
+            typeof parsed.version === "string") {
+            return parsed;
+        }
+        return null;
+    }
+    catch {
+        return null;
+    }
+}
+function writeBeacon(data) {
+    try {
+        fs.writeFileSync(BEACON_FILE, JSON.stringify(data, null, 0), "utf-8");
+    }
+    catch (err) {
+        console.error("[watchdog] failed to write beacon:", err);
+    }
+}
+function writeAlert(reason, crashCount) {
+    try {
+        const content = [
+            `Alvin Bot crash-loop brake hit at ${new Date().toISOString()}`,
+            `Version: ${BOT_VERSION}`,
+            `Crashes in the last ${CRASH_WINDOW_MS / 60_000} minutes: ${crashCount}`,
+            `Threshold: ${CRASH_BRAKE_THRESHOLD}`,
+            ``,
+            `Reason: ${reason}`,
+            ``,
+            `The bot will refuse to start until this file is removed AND the`,
+            `LaunchAgent is reloaded. Investigate the recent error log:`,
+            `  ${resolve(DATA_DIR, "logs", "alvin-bot.err.log")}`,
+            ``,
+            `Recovery steps once you've fixed the underlying issue:`,
+            `  rm "${ALERT_FILE}"`,
+            `  alvin-bot launchd install   # or just kickstart the service`,
+            ``,
+        ].join("\n");
+        fs.writeFileSync(ALERT_FILE, content, "utf-8");
+    }
+    catch (err) {
+        console.error("[watchdog] failed to write alert:", err);
+    }
+}
+/**
+ * Check whether the watchdog has hit the crash-loop brake. Called once
+ * at startup, BEFORE most of the bot initializes. If the brake is set
+ * (alert file exists), the bot exits cleanly with code 3 — and because
+ * launchd's KeepAlive will keep retrying, we also try to unload our
+ * own LaunchAgent so the retries stop.
+ */
+export function checkCrashLoopBrake() {
+    if (!fs.existsSync(ALERT_FILE))
+        return;
+    console.error("");
+    console.error("==================================================");
+    console.error("⛔ alvin-bot crash-loop brake is engaged");
+    console.error("==================================================");
+    try {
+        const content = fs.readFileSync(ALERT_FILE, "utf-8");
+        console.error(content);
+    }
+    catch { /* ignore */ }
+    // Attempt to unload our own LaunchAgent so launchd stops retrying.
+    // If we don't do this, launchd just KeepAlive's us forever and we
+    // burn CPU writing the same alert.
+    if (process.platform === "darwin") {
+        try {
+            const home = os.homedir();
+            const plistPath = resolve(home, "Library", "LaunchAgents", "com.alvinbot.app.plist");
+            if (fs.existsSync(plistPath)) {
+                execSync(`launchctl unload -w "${plistPath}"`, { stdio: "pipe" });
+                console.error("[watchdog] LaunchAgent unloaded — bot will not auto-restart.");
+            }
+        }
+        catch (err) {
+            console.error("[watchdog] failed to unload LaunchAgent:", err);
+        }
+    }
+    // Exit with a distinct code so logs make the cause obvious
+    process.exit(3);
+}
+/**
+ * Start the watchdog. Called from src/index.ts after all services are
+ * initialized. Reads the previous beacon, increments crash counter if
+ * the previous run exited recently, schedules the periodic beacon
+ * writer, and schedules a recovery-mark reset after RECOVERY_UPTIME_MS
+ * of clean uptime.
+ */
+export function startWatchdog() {
+    ensureStateDir();
+    bootTime = Date.now();
+    const previous = readBeacon();
+    let crashCount = 0;
+    let crashWindowStart = bootTime;
+    if (previous) {
+        const timeSinceLastBeat = bootTime - previous.lastBeat;
+        const inWindow = bootTime - previous.crashWindowStart < CRASH_WINDOW_MS;
+        if (timeSinceLastBeat < STALE_BEACON_MS) {
+            // Previous process exited very recently → that's a crash (or a
+            // graceful exit immediately followed by a restart, which we treat
+            // the same way for the brake — the goal is to detect rapid cycles).
+            if (inWindow) {
+                crashCount = previous.crashCount + 1;
+                crashWindowStart = previous.crashWindowStart;
+            }
+            else {
+                // Previous crash was outside the window → reset counter
+                crashCount = 1;
+            }
+            console.log(`[watchdog] detected restart after ${Math.round(timeSinceLastBeat / 1000)}s — crash ${crashCount}/${CRASH_BRAKE_THRESHOLD} in current ${CRASH_WINDOW_MS / 60_000}min window`);
+            if (crashCount >= CRASH_BRAKE_THRESHOLD) {
+                console.error(`[watchdog] crash-loop brake triggered (${crashCount} crashes in ${CRASH_WINDOW_MS / 60_000}min)`);
+                writeAlert(`Process restarted ${crashCount} times within ${CRASH_WINDOW_MS / 60_000} minutes. Last beacon was ${Math.round(timeSinceLastBeat / 1000)}s ago. Most likely a deterministic crash on startup.`, crashCount);
+                // Re-use the brake check to unload + exit cleanly
+                checkCrashLoopBrake();
+            }
+        }
+        else {
+            // Previous beacon was old → process had clean uptime before exit,
+            // OR system was rebooted between runs. Reset crash count.
+            crashCount = 0;
+            crashWindowStart = bootTime;
+        }
+    }
+    // Write the first beacon immediately so a fresh restart updates the file
+    writeBeacon({
+        lastBeat: bootTime,
+        pid: process.pid,
+        bootTime,
+        crashCount,
+        crashWindowStart,
+        version: BOT_VERSION,
+    });
+    // Periodic beacon writer
+    beaconTimer = setInterval(() => {
+        writeBeacon({
+            lastBeat: Date.now(),
+            pid: process.pid,
+            bootTime,
+            crashCount,
+            crashWindowStart,
+            version: BOT_VERSION,
+        });
+    }, BEACON_INTERVAL_MS);
+    // Schedule a recovery counter reset after RECOVERY_UPTIME_MS of clean
+    // uptime. If we make it that far without dying, the bot is healthy
+    // again and we shouldn't penalize a future single crash.
+    resetTimer = setTimeout(() => {
+        if (crashCount > 0) {
+            console.log(`[watchdog] ${RECOVERY_UPTIME_MS / 60_000}min clean uptime — resetting crash counter from ${crashCount} to 0`);
+            crashCount = 0;
+            crashWindowStart = Date.now();
+            writeBeacon({
+                lastBeat: Date.now(),
+                pid: process.pid,
+                bootTime,
+                crashCount,
+                crashWindowStart,
+                version: BOT_VERSION,
+            });
+        }
+    }, RECOVERY_UPTIME_MS);
+    console.log(`[watchdog] started — beacon every ${BEACON_INTERVAL_MS / 1000}s, brake at ${CRASH_BRAKE_THRESHOLD} crashes per ${CRASH_WINDOW_MS / 60_000}min, recovery after ${RECOVERY_UPTIME_MS / 60_000}min uptime`);
+}
+/**
+ * Stop the watchdog cleanly. Called from the shutdown handler in
+ * index.ts so beacon timers don't keep the process alive after the
+ * grammy bot has stopped.
+ */
+export function stopWatchdog() {
+    if (beaconTimer) {
+        clearInterval(beaconTimer);
+        beaconTimer = null;
+    }
+    if (resetTimer) {
+        clearTimeout(resetTimer);
+        resetTimer = null;
+    }
+}

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "alvin-bot",
-  "version": "4.8.6",
+  "version": "4.8.8",
   "description": "Alvin Bot — Your personal AI agent on Telegram, WhatsApp, Discord, Signal, and Web.",
   "type": "module",
   "main": "dist/index.js",