alvin-bot 4.9.3 โ†’ 4.9.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,49 @@
2
2
 
3
3
  All notable changes to Alvin Bot are documented here.
4
4
 
5
+ ## [4.9.4] โ€” 2026-04-13
6
+
7
+ ### ๐Ÿ”Œ Web UI fully decoupled from main bot โ€” port conflicts no longer crash anything
8
+
9
+ Colleague feedback (WhatsApp voice note, 2026-04-13):
10
+ > *"The gateway binds to port 3100 like OpenClaw. When the bot restarts,
11
+ > the port is often still held โ†’ catastrophic crash. I ended up
12
+ > decoupling the gateway process completely, because the actual bot
13
+ > runs independently of the gateway โ€” it can still answer Telegram
14
+ > even if the web endpoint isn't reachable yet. It's weird that the
15
+ > main routine crashes when the port is busy. It should just run in
16
+ > the background, watch for the port to become free, and connect
17
+ > then. Zero impact on the main routine."*
18
+
19
+ He was right. My v4.9.0 `stopWebServer()` fix was *prevention* โ€” it stopped the bot itself from holding 3100 across restarts. But it didn't cover the *resilience* side: a foreign process holding 3100 (another dev server, an OpenClaw-style orphan, a TIME_WAIT race after SIGKILL) still crashed the boot, because `startWebServer()` was synchronous and the `uncaught exception` from `server.listen()` escaped to the main event loop.
20
+
21
+ **Complete rewrite of the bind loop:**
22
+
23
+ - **`src/web/bind-strategy.ts` (new) โ€” pure decision helper.** `decideNextBindAction(err, attempt, opts)` returns either `{type: "retry-port", port, attempt}` (climb the ladder) or `{type: "retry-background", delayMs, port}` (back off, retry the original port in 30 s). EADDRINUSE with attempts remaining โ†’ ladder. EADDRINUSE exhausted โ†’ background. Any other error โ†’ background. 8 unit tests covering every branch + purity.
24
+
25
+ - **`src/web/server.ts` startWebServer โ€” non-blocking, fresh-server-per-attempt.** Returns `void` synchronously, NEVER throws, NEVER blocks on bind. Each attempt creates a new `http.Server` (no state-recycling bugs) and attaches its own error handler. On failure, cleans up and calls `decideNextBindAction` to decide the next move. If the ladder is exhausted, schedules a 30 s background retry at the original port โ€” the Telegram bot keeps running the whole time, the web UI just isn't reachable yet.
26
+
27
+ - **`src/web/server.ts` WebSocketServer attached POST-bind.** The `ws` library's `WebSocketServer` constructor installs its own event plumbing on the underlying `http.Server` and โ€” crucially โ€” causes EADDRINUSE errors to escape as uncaught exceptions when attached pre-listen. Debugging this chewed an hour on 2026-04-13. Fix: only `new WebSocketServer({ server })` AFTER `listen()` has fired its callback. The unit-test `test/web-server-integration.test.ts "when the primary port is taken"` pins this behaviour.
28
+
29
+ - **`src/web/server.ts` error handler: `on` not `once`.** Previous version used `.once("error", handler)` and a node edge case where a single bind failure emits TWO error events left the second one uncaught. Handler is now `on` with a `handled` guard โ€” idempotent, and a post-bind quiet logger replaces it on success.
30
+
31
+ - **`src/web/server.ts` defensive try/catch around `server.listen()`.** In the wild Node sometimes throws synchronously for edge-case binds (already-listening, invalid backlog, kernel race). The catch funnels sync throws through the same `handleBindFailure` path as async error events.
32
+
33
+ - **`src/web/server.ts` `closeHttpServerGracefully(server)` + `stopWebServer()`.** The old `stopWebServer(server)` took an explicit server arg; it's been split into a low-level helper (`closeHttpServerGracefully(server)`, exported for tests) and a stateful top-level (`stopWebServer()`, no args, cleans up `currentServer` + `wsServerRef` + `bindRetryTimer`). Safe to call before start, safe to call twice, cancels pending background retries.
34
+
35
+ - **`src/index.ts` call sites adjusted.** `const webServer = startWebServer()` โ†’ `startWebServer()`. `stopWebServer(webServer)` โ†’ `stopWebServer()`. The comment above the call explains the decoupling so nobody accidentally re-couples it in a future "clean up" refactor.
36
+
37
+ **Testing: 186 โ†’ 201 (+15 new).**
38
+
39
+ - `test/web-server-resilience.test.ts` โ€” 8 unit tests for `decideNextBindAction`
40
+ - `test/web-server-integration.test.ts` โ€” 7 real-server integration tests: startWebServer returns void, binds, stops, is idempotent, survives primary-port conflict by climbing the ladder, closes servers with hanging sockets.
41
+ - **Live-verified on the maintainer's machine**: `launchctl unload` + dual-stack Node hog on port 3100 + `launchctl load` โ†’ bot booted cleanly โ†’ out.log contained `[web] port 3100 busy (EADDRINUSE) โ€” trying 3101` โ†’ `๐ŸŒ Web UI: http://localhost:3101 (Port 3100 was busy, using 3101 instead)` โ†’ Telegram responsive throughout. Exactly what the colleague described.
42
+
43
+ **Non-goals / intentionally unchanged:**
44
+ - Timeouts stay unlimited (v4.8.8 behaviour preserved).
45
+ - The primary port is still `WEB_PORT || 3100` โ€” no config schema change.
46
+ - When the bot binds on a non-primary port (e.g. 3101), the README permalink still points at 3100. Users hitting a ladder-climbed bot should check the startup log; this is rare and temporary.
47
+
5
48
  ## [4.9.3] โ€” 2026-04-11
6
49
 
7
50
  ### ๐Ÿ›  Two UX bugs found in production after v4.9.2 โ€” now closed
package/README.md CHANGED
@@ -114,7 +114,18 @@ That's it. The setup wizard validates everything:
114
114
 
115
115
  **Requires:** Node.js 18+ ([nodejs.org](https://nodejs.org)) ยท Telegram bot token ([@BotFather](https://t.me/BotFather)) ยท Your Telegram user ID ([@userinfobot](https://t.me/userinfobot))
116
116
 
117
- Free AI providers available โ€” no credit card needed.
117
+ Free AI providers available โ€” no credit card needed. **Privacy-first?** Pick the ๐Ÿ”’ **Offline โ€” Gemma 4 E4B** option in setup for a fully local LLM via Ollama (macOS/Linux: automated install; Windows: manual).
118
+
119
+ ### ๐Ÿ“˜ First-time setup walkthroughs
120
+
121
+ Step-by-step guides with screenshots and screen-for-screen instructions:
122
+
123
+ | Platform | PDF (printable) |
124
+ |---|---|
125
+ | ๐ŸŽ **macOS** (with `launchd` background service) | [Download PDF](https://github.com/alvbln/Alvin-Bot/releases/latest/download/Alvin-Bot-macOS-Setup-Guide.pdf) |
126
+ | ๐ŸชŸ **Windows** (with Task Scheduler / Startup folder) | [Download PDF](https://github.com/alvbln/Alvin-Bot/releases/latest/download/Alvin-Bot-Windows-Setup-Guide.pdf) |
127
+
128
+ Both guides cover: Node.js install ยท Telegram bot creation ยท first-time `setup` ยท foreground test ยท background service ยท offline Gemma 4 mode ยท troubleshooting. ~15 min end-to-end for a first-time user.
118
129
 
119
130
  ### macOS: use `launchd` instead of pm2 (recommended)
120
131
 
package/dist/index.js CHANGED
@@ -267,7 +267,7 @@ const shutdown = async () => {
267
267
  }
268
268
  // Release :3100 so the next launchd boot doesn't hit EADDRINUSE.
269
269
  // Must happen before exit โ€” see src/web/server.ts stopWebServer() comment.
270
- await stopWebServer(webServer).catch((err) => console.warn("[shutdown] stopWebServer failed:", err));
270
+ await stopWebServer().catch((err) => console.warn("[shutdown] stopWebServer failed:", err));
271
271
  await unloadPlugins().catch(() => { });
272
272
  await disconnectMCP().catch(() => { });
273
273
  // Tear down any bot-managed local runners (Ollama, LM Studio, โ€ฆ) so VRAM
@@ -404,8 +404,13 @@ async function startOptionalPlatforms() {
404
404
  }
405
405
  }
406
406
  startOptionalPlatforms().catch(err => console.error("Platform startup error:", err));
407
- // Start Web UI (ALWAYS โ€” regardless of Telegram/AI config)
408
- const webServer = startWebServer();
407
+ // Start Web UI (ALWAYS โ€” regardless of Telegram/AI config).
408
+ // startWebServer is now non-blocking and will never throw: if port 3100
409
+ // is busy (foreign process, TIME_WAIT, another bot instance), it climbs
410
+ // the port ladder up to 3119 and then enters a background retry loop
411
+ // at 3100 every 30s. The Telegram bot runs independently โ€” Web UI is a
412
+ // feature, not core. See src/web/bind-strategy.ts for the retry rules.
413
+ startWebServer();
409
414
  // Start Cron Scheduler โ€” route notifications through delivery queue for reliability
410
415
  setNotifyCallback(async (target, text) => {
411
416
  if (target.platform === "web") {
@@ -0,0 +1,42 @@
1
+ /**
2
+ * Pure decision helper for the web-server bind loop.
3
+ *
4
+ * Decouples the "what should happen next" logic from the side-effect
5
+ * spaghetti of real http.Server binding so it can be unit-tested in
6
+ * isolation. See test/web-server-resilience.test.ts for the contract.
7
+ *
8
+ * Why this exists: the v4.8.x and earlier implementations crashed the
9
+ * entire bot when port 3100 was held by a foreign process. A colleague
10
+ * running an OpenClaw fork hit the same bug years ago and ended up
11
+ * decoupling the web server completely โ€” the main bot should never be
12
+ * gated on a web-UI bind. This helper encodes the decision logic so
13
+ * the new startWebServer() can just act on the returned action.
14
+ */
15
+ /**
16
+ * Decide what the bind loop should do next after a failed listen().
17
+ *
18
+ * Rule of thumb:
19
+ * - EADDRINUSE AND attempts remaining โ†’ climb the port ladder.
20
+ * - EADDRINUSE AND ladder exhausted โ†’ background retry at original port.
21
+ * - any other error (EACCES, listen-called-twice, etc.) โ†’ background retry.
22
+ *
23
+ * PURE: no timers, no I/O, no mutation of inputs. Safe to call from tests.
24
+ */
25
+ export function decideNextBindAction(err, attempt, opts) {
26
+ const code = err?.code;
27
+ if (code === "EADDRINUSE" && attempt < opts.maxPortTries - 1) {
28
+ return {
29
+ type: "retry-port",
30
+ port: opts.originalPort + attempt + 1,
31
+ attempt: attempt + 1,
32
+ };
33
+ }
34
+ // EADDRINUSE with no attempts left, OR any non-EADDRINUSE error:
35
+ // don't walk the port ladder further, just back off and retry the
36
+ // original port in the background.
37
+ return {
38
+ type: "retry-background",
39
+ delayMs: opts.backgroundRetryMs,
40
+ port: opts.originalPort,
41
+ };
42
+ }
@@ -30,10 +30,24 @@ import { addCanvasClient } from "./canvas.js";
30
30
  import { BOT_ROOT, ENV_FILE, PUBLIC_DIR, MEMORY_DIR, MEMORY_FILE, SOUL_FILE, DATA_DIR, MCP_CONFIG, SKILLS_DIR } from "../paths.js";
31
31
  import { broadcast } from "../services/broadcast.js";
32
32
  import { BOT_VERSION } from "../version.js";
33
+ import { decideNextBindAction } from "./bind-strategy.js";
33
34
  const WEB_PORT = parseInt(process.env.WEB_PORT || "3100");
34
- /** Module-scope reference to the WebSocket server so stopWebServer() can
35
- * tear it down together with the HTTP server. Set inside startWebServer(). */
35
+ /** Tuning for the bind loop. Walk the port ladder `MAX_PORT_TRIES` times
36
+ * then fall back to a `BACKGROUND_RETRY_MS` idle loop โ€” the bot keeps
37
+ * running on Telegram either way; see bind-strategy.ts for the pure
38
+ * decision logic. */
39
+ const MAX_PORT_TRIES = 20;
40
+ const BACKGROUND_RETRY_MS = 30_000;
41
+ /** Current live http.Server, if one has successfully bound. */
42
+ let currentServer = null;
43
+ /** Current live WebSocketServer attached to currentServer. */
36
44
  let wsServerRef = null;
45
+ /** Background-retry timer handle โ€” set when the bind loop is in its
46
+ * idle wait between cycles, cleared when stopWebServer() cancels. */
47
+ let bindRetryTimer = null;
48
+ /** Flag flipped by stopWebServer(). Every bind-loop callback checks
49
+ * this and exits silently if set, so stop is truly terminal. */
50
+ let stopRequested = false;
37
51
  const WEB_PASSWORD = process.env.WEB_PASSWORD || "";
38
52
  /** The actual port the Web UI is running on (may differ from WEB_PORT if busy). */
39
53
  let actualWebPort = WEB_PORT;
@@ -1371,126 +1385,207 @@ function handleWebSocket(wss) {
1371
1385
  });
1372
1386
  }
1373
1387
  // โ”€โ”€ Start Server โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1374
- export function startWebServer() {
1375
- const server = http.createServer((req, res) => {
1376
- let body = "";
1377
- req.on("data", (chunk) => { body += chunk; });
1378
- req.on("end", () => {
1379
- const urlPath = (req.url || "/").split("?")[0];
1380
- // OpenAI-compatible API (/v1/chat/completions, /v1/models)
1381
- if (urlPath.startsWith("/v1/")) {
1382
- handleOpenAICompat(req, res, urlPath, body);
1383
- return;
1384
- }
1385
- // API routes
1386
- if (urlPath.startsWith("/api/")) {
1387
- handleAPI(req, res, urlPath, body);
1388
- return;
1389
- }
1390
- // Auth page (if password set and not authenticated)
1391
- if (WEB_PASSWORD && !checkAuth(req) && urlPath !== "/login.html") {
1392
- res.writeHead(302, { Location: "/login.html" });
1393
- res.end();
1394
- return;
1395
- }
1396
- // Canvas UI
1397
- if (urlPath === "/canvas") {
1398
- const canvasFile = resolve(PUBLIC_DIR, "canvas.html");
1399
- try {
1400
- const content = fs.readFileSync(canvasFile);
1401
- res.setHeader("Content-Type", "text/html");
1402
- res.end(content);
1403
- }
1404
- catch {
1405
- res.statusCode = 404;
1406
- res.end("Not found");
1407
- }
1408
- return;
1409
- }
1410
- // Static files
1411
- let filePath = urlPath === "/" ? "/index.html" : urlPath;
1412
- filePath = resolve(PUBLIC_DIR, filePath.slice(1));
1413
- // Security: prevent path traversal
1414
- if (!filePath.startsWith(PUBLIC_DIR)) {
1415
- res.statusCode = 403;
1416
- res.end("Forbidden");
1417
- return;
1418
- }
1388
+ /**
1389
+ * HTTP request handler for the web UI. Hoisted to a top-level function
1390
+ * so every bind attempt can create a fresh http.Server without
1391
+ * rebuilding the handler closure.
1392
+ */
1393
+ function handleWebRequest(req, res) {
1394
+ let body = "";
1395
+ req.on("data", (chunk) => { body += chunk; });
1396
+ req.on("end", () => {
1397
+ const urlPath = (req.url || "/").split("?")[0];
1398
+ // OpenAI-compatible API (/v1/chat/completions, /v1/models)
1399
+ if (urlPath.startsWith("/v1/")) {
1400
+ handleOpenAICompat(req, res, urlPath, body);
1401
+ return;
1402
+ }
1403
+ // API routes
1404
+ if (urlPath.startsWith("/api/")) {
1405
+ handleAPI(req, res, urlPath, body);
1406
+ return;
1407
+ }
1408
+ // Auth page (if password set and not authenticated)
1409
+ if (WEB_PASSWORD && !checkAuth(req) && urlPath !== "/login.html") {
1410
+ res.writeHead(302, { Location: "/login.html" });
1411
+ res.end();
1412
+ return;
1413
+ }
1414
+ // Canvas UI
1415
+ if (urlPath === "/canvas") {
1416
+ const canvasFile = resolve(PUBLIC_DIR, "canvas.html");
1419
1417
  try {
1420
- const content = fs.readFileSync(filePath);
1421
- const ext = path.extname(filePath);
1422
- res.setHeader("Content-Type", MIME[ext] || "application/octet-stream");
1418
+ const content = fs.readFileSync(canvasFile);
1419
+ res.setHeader("Content-Type", "text/html");
1423
1420
  res.end(content);
1424
1421
  }
1425
1422
  catch {
1426
1423
  res.statusCode = 404;
1427
1424
  res.end("Not found");
1428
1425
  }
1429
- });
1426
+ return;
1427
+ }
1428
+ // Static files
1429
+ let filePath = urlPath === "/" ? "/index.html" : urlPath;
1430
+ filePath = resolve(PUBLIC_DIR, filePath.slice(1));
1431
+ // Security: prevent path traversal
1432
+ if (!filePath.startsWith(PUBLIC_DIR)) {
1433
+ res.statusCode = 403;
1434
+ res.end("Forbidden");
1435
+ return;
1436
+ }
1437
+ try {
1438
+ const content = fs.readFileSync(filePath);
1439
+ const ext = path.extname(filePath);
1440
+ res.setHeader("Content-Type", MIME[ext] || "application/octet-stream");
1441
+ res.end(content);
1442
+ }
1443
+ catch {
1444
+ res.statusCode = 404;
1445
+ res.end("Not found");
1446
+ }
1430
1447
  });
1431
- const wss = new WebSocketServer({ server });
1432
- wsServerRef = wss;
1433
- handleWebSocket(wss);
1434
- // Smart port: try WEB_PORT, increment if busy (up to +20)
1435
- const MAX_TRIES = 20;
1436
- function tryListen(port, attempt = 0) {
1437
- server.once("error", (err) => {
1438
- if (err.code === "EADDRINUSE" && attempt < MAX_TRIES) {
1439
- tryListen(port + 1, attempt + 1);
1440
- }
1441
- else {
1442
- console.error(`โŒ Web UI failed to start: ${err.message}`);
1443
- }
1448
+ }
1449
+ /**
1450
+ * Kick off the web-UI bind loop. NEVER throws, NEVER blocks.
1451
+ *
1452
+ * History: earlier versions returned an http.Server synchronously and
1453
+ * let listen() errors bubble up as uncaught exceptions โ€” a colleague
1454
+ * flagged this on 2026-04-13 after spending months fighting the exact
1455
+ * same bug on a parallel OpenClaw fork. Their resolution: "the gateway
1456
+ * is a feature, not core. Decouple it."
1457
+ *
1458
+ * New contract:
1459
+ * - Returns `void` immediately. The actual bind happens asynchronously.
1460
+ * - If port 3100 is busy, tries 3101โ€ฆ3119 in sequence (same as before).
1461
+ * - If ALL 20 ports are busy, schedules a background retry at 3100
1462
+ * in `BACKGROUND_RETRY_MS` โ€” keeps trying forever until success
1463
+ * or stopWebServer() is called.
1464
+ * - Any non-EADDRINUSE error also falls through to background retry.
1465
+ * - Each attempt uses a FRESH http.Server to avoid node's fragile
1466
+ * "listen-called-twice" state-recycling behaviour.
1467
+ * - The main Telegram bot is completely independent of this โ€” if the
1468
+ * web UI never binds, the bot still answers messages.
1469
+ */
1470
+ export function startWebServer() {
1471
+ stopRequested = false;
1472
+ scheduleBindAttempt(WEB_PORT, 0);
1473
+ }
1474
+ function scheduleBindAttempt(port, attempt) {
1475
+ if (stopRequested)
1476
+ return;
1477
+ // Read WEB_PORT live every time rather than closing over the
1478
+ // module-load value, so tests that change process.env.WEB_PORT
1479
+ // between runs see the new port.
1480
+ const originalPort = parseInt(process.env.WEB_PORT || "3100");
1481
+ // Fresh server for each attempt. Recycling a server that has already
1482
+ // emitted an EADDRINUSE error has produced "Listen method has been
1483
+ // called more than once" crashes in the wild.
1484
+ //
1485
+ // IMPORTANT: do NOT attach the WebSocketServer yet. The `ws` library
1486
+ // installs its own event plumbing on the http.Server in its
1487
+ // constructor, which causes bind errors to escape as uncaught
1488
+ // exceptions. We only attach it AFTER listen() has succeeded.
1489
+ const server = http.createServer(handleWebRequest);
1490
+ // Double-invocation guard: on some Node versions `server.listen`
1491
+ // both throws synchronously AND emits an `error` event for the same
1492
+ // bind failure. Without the guard we'd climb the ladder twice in
1493
+ // parallel and end up with two retry cascades racing each other.
1494
+ let handled = false;
1495
+ const cleanupDeadAttempt = () => {
1496
+ try {
1497
+ server.removeAllListeners("error");
1498
+ }
1499
+ catch { /* ignore */ }
1500
+ try {
1501
+ server.close(() => { });
1502
+ }
1503
+ catch { /* ignore */ }
1504
+ };
1505
+ const handleBindFailure = (err) => {
1506
+ if (handled)
1507
+ return;
1508
+ handled = true;
1509
+ cleanupDeadAttempt();
1510
+ if (stopRequested)
1511
+ return;
1512
+ const action = decideNextBindAction(err, attempt, {
1513
+ originalPort,
1514
+ maxPortTries: MAX_PORT_TRIES,
1515
+ backgroundRetryMs: BACKGROUND_RETRY_MS,
1444
1516
  });
1517
+ if (action.type === "retry-port") {
1518
+ console.warn(`[web] port ${port} busy (${err.code || err.message}) โ€” trying ${action.port}`);
1519
+ scheduleBindAttempt(action.port, action.attempt);
1520
+ return;
1521
+ }
1522
+ // action.type === "retry-background"
1523
+ console.warn(`[web] bind failed (${err.code || err.message}) โ€” ` +
1524
+ `backing off ${action.delayMs / 1000}s then retrying port ${action.port}. ` +
1525
+ `Bot is unaffected; Telegram remains live.`);
1526
+ bindRetryTimer = setTimeout(() => {
1527
+ bindRetryTimer = null;
1528
+ scheduleBindAttempt(action.port, 0);
1529
+ }, action.delayMs);
1530
+ };
1531
+ // Use `on` (not `once`) so a pathological server that emits two
1532
+ // error events for a single failure doesn't leave the second one
1533
+ // uncaught. The `handled` guard makes the handler idempotent.
1534
+ server.on("error", handleBindFailure);
1535
+ // Defensive try/catch โ€” `server.listen()` usually emits async errors,
1536
+ // but certain Node versions + edge cases (already-listening server,
1537
+ // invalid backlog, kernel hiccup) can throw synchronously. Catch here
1538
+ // so the main routine never crashes during web-UI bind.
1539
+ try {
1445
1540
  server.listen(port, () => {
1541
+ if (handled)
1542
+ return; // Should be impossible; paranoia.
1543
+ handled = true;
1544
+ // Now โ€” and only now โ€” attach the WebSocketServer. Before the
1545
+ // bind succeeded, the ws library's constructor would hijack the
1546
+ // http.Server's error event chain and let EADDRINUSE escape as
1547
+ // uncaught. Post-bind is safe.
1548
+ const wss = new WebSocketServer({ server });
1549
+ handleWebSocket(wss);
1550
+ currentServer = server;
1551
+ wsServerRef = wss;
1446
1552
  actualWebPort = port;
1553
+ // Remove the bind error handler โ€” post-listen errors (socket
1554
+ // errors, close events) should not kick off a spurious retry
1555
+ // cycle. Install a quiet logger for any stray error events so
1556
+ // they can't escape as uncaught.
1557
+ server.removeListener("error", handleBindFailure);
1558
+ server.on("error", (err) => {
1559
+ console.warn(`[web] post-bind server error (ignored): ${err.message}`);
1560
+ });
1447
1561
  console.log(`๐ŸŒ Web UI: http://localhost:${actualWebPort}`);
1448
- if (actualWebPort !== WEB_PORT) {
1449
- console.log(` (Port ${WEB_PORT} was busy, using ${actualWebPort} instead)`);
1562
+ if (actualWebPort !== originalPort) {
1563
+ console.log(` (Port ${originalPort} was busy, using ${actualWebPort} instead)`);
1450
1564
  }
1451
1565
  });
1452
1566
  }
1453
- tryListen(WEB_PORT);
1454
- return server;
1567
+ catch (err) {
1568
+ handleBindFailure(err);
1569
+ }
1455
1570
  }
1456
1571
  /**
1457
- * Gracefully stop the web server so the port is released.
1458
- *
1459
- * Why this exists: `shutdown()` in src/index.ts used to stop grammy and the
1460
- * scheduler but leave the HTTP server listening. macOS then held the
1461
- * listening socket in the socket table, so launchd's next boot of the bot
1462
- * hit `EADDRINUSE :::3100`, threw an Uncaught exception and crash-looped.
1572
+ * Gracefully close a specific http.Server โ€” the low-level building
1573
+ * block. Exported for tests and for any future callers that manage
1574
+ * their own servers. Production bot code uses `stopWebServer()` below
1575
+ * which operates on the module-global current server instead.
1463
1576
  *
1464
1577
  * What this does:
1465
- * 1. Force-close idle keep-alive sockets (otherwise close() hangs on them).
1466
- * 2. Force-close active open requests (long-poll clients, WebSocket
1467
- * upgrades that never completed).
1468
- * 3. Tear down the WebSocket server so its own sockets don't linger.
1469
- * 4. Await `server.close()` so the listening socket is truly released
1470
- * before the caller's shutdown continues.
1578
+ * 1. Force-close idle keep-alive sockets (Node 18.2+).
1579
+ * 2. Force-close active open requests (long-poll clients).
1580
+ * 3. Await `server.close()` so the listening socket is truly freed.
1471
1581
  *
1472
- * Safe to call multiple times; no-op when the server is already closed or
1473
- * never listened. Never throws.
1582
+ * Safe to call on already-closed, never-listened, or mid-listen servers.
1583
+ * Never throws.
1474
1584
  */
1475
- export async function stopWebServer(server) {
1476
- try {
1477
- if (wsServerRef) {
1478
- for (const client of wsServerRef.clients) {
1479
- try {
1480
- client.terminate();
1481
- }
1482
- catch { /* ignore */ }
1483
- }
1484
- await new Promise((resolve) => wsServerRef.close(() => resolve()));
1485
- wsServerRef = null;
1486
- }
1487
- }
1488
- catch { /* ignore */ }
1585
+ export async function closeHttpServerGracefully(server) {
1489
1586
  if (!server.listening)
1490
1587
  return;
1491
1588
  try {
1492
- // Node 18.2+ APIs โ€” break any keep-alive / long-poll stalls so
1493
- // server.close() can actually resolve.
1494
1589
  const s = server;
1495
1590
  if (typeof s.closeIdleConnections === "function")
1496
1591
  s.closeIdleConnections();
@@ -1499,12 +1594,47 @@ export async function stopWebServer(server) {
1499
1594
  }
1500
1595
  catch { /* ignore */ }
1501
1596
  await new Promise((resolve) => {
1502
- // close() callback fires with an Error arg when the server wasn't
1503
- // listening โ€” we just resolve in either case. The caller only cares
1504
- // that the port is free when this awaits.
1505
1597
  server.close(() => resolve());
1506
1598
  });
1507
1599
  }
1600
+ /**
1601
+ * Stop the web server: cancel any pending background-retry, close
1602
+ * WebSocket clients, then gracefully close the HTTP server.
1603
+ *
1604
+ * Idempotent โ€” safe to call multiple times, and safe to call before
1605
+ * startWebServer() ever successfully bound. Never throws.
1606
+ */
1607
+ export async function stopWebServer() {
1608
+ stopRequested = true;
1609
+ // Cancel any pending background-retry timer so a late retry doesn't
1610
+ // grab the port AFTER we thought we'd shut everything down.
1611
+ if (bindRetryTimer) {
1612
+ clearTimeout(bindRetryTimer);
1613
+ bindRetryTimer = null;
1614
+ }
1615
+ // Tear down the WebSocket server first so its sockets can't keep
1616
+ // the underlying http.Server alive.
1617
+ if (wsServerRef) {
1618
+ try {
1619
+ for (const client of wsServerRef.clients) {
1620
+ try {
1621
+ client.terminate();
1622
+ }
1623
+ catch { /* ignore */ }
1624
+ }
1625
+ await new Promise((resolve) => wsServerRef.close(() => resolve()));
1626
+ }
1627
+ catch { /* ignore */ }
1628
+ wsServerRef = null;
1629
+ }
1630
+ if (currentServer) {
1631
+ try {
1632
+ await closeHttpServerGracefully(currentServer);
1633
+ }
1634
+ catch { /* ignore */ }
1635
+ currentServer = null;
1636
+ }
1637
+ }
1508
1638
  /** Get the actual port the Web UI is running on. */
1509
1639
  export function getWebPort() {
1510
1640
  return actualWebPort;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "alvin-bot",
3
- "version": "4.9.3",
3
+ "version": "4.9.4",
4
4
  "description": "Alvin Bot โ€” Your personal AI agent on Telegram, WhatsApp, Discord, Signal, and Web.",
5
5
  "type": "module",
6
6
  "main": "dist/index.js",
@@ -19,7 +19,7 @@
19
19
  */
20
20
  import { describe, it, expect, beforeEach, vi } from "vitest";
21
21
  import http from "http";
22
- import { stopWebServer } from "../src/web/server.js";
22
+ import { closeHttpServerGracefully as stopWebServer } from "../src/web/server.js";
23
23
  import {
24
24
  handleStartupCatchup,
25
25
  prepareForExecution,
@@ -0,0 +1,189 @@
1
+ /**
2
+ * Fix #16 (integration) โ€” end-to-end tests for the decoupled
3
+ * startWebServer + stopWebServer pair.
4
+ *
5
+ * These tests exercise the ACTUAL http.Server binding, not the pure
6
+ * decision helper. They rely on:
7
+ * - process.env.WEB_PORT to keep the test off the running bot's 3100
8
+ * - process.env.ALVIN_DATA_DIR to keep touch-points away from
9
+ * the maintainer's real ~/.alvin-bot/.env
10
+ *
11
+ * What's covered here:
12
+ * 1. startWebServer() returns synchronously (void) without throwing
13
+ * 2. stopWebServer() releases the port so another server can bind
14
+ * 3. Start โ†’ stop โ†’ start cycle doesn't leak sockets or timers
15
+ * 4. If the configured port is already busy, startWebServer still
16
+ * returns cleanly (no throw); the bot keeps running.
17
+ * 5. stopWebServer() is idempotent โ€” safe to call twice in a row
18
+ * and safe to call before startWebServer ever succeeded.
19
+ *
20
+ * The deliberate EADDRINUSE scenario is tested HERE against a real
21
+ * running hog โ€” no mocking.
22
+ */
23
+ import { describe, it, expect, beforeEach, afterEach, vi } from "vitest";
24
+ import http from "http";
25
+ import fs from "fs";
26
+ import os from "os";
27
+ import { resolve } from "path";
28
+
29
+ const TEST_DATA_DIR = resolve(os.tmpdir(), `alvin-bot-web-int-${process.pid}-${Date.now()}`);
30
+
31
+ function getFreePort(): Promise<number> {
32
+ return new Promise((resolve, reject) => {
33
+ const s = http.createServer();
34
+ s.listen(0, () => {
35
+ const addr = s.address();
36
+ if (typeof addr === "object" && addr) {
37
+ const p = addr.port;
38
+ s.close(() => resolve(p));
39
+ } else {
40
+ reject(new Error("no address"));
41
+ }
42
+ });
43
+ });
44
+ }
45
+
46
+ async function waitForPortBound(port: number, timeoutMs = 3000): Promise<boolean> {
47
+ const deadline = Date.now() + timeoutMs;
48
+ while (Date.now() < deadline) {
49
+ try {
50
+ const code = await new Promise<number>((resolveCode, reject) => {
51
+ const req = http.get(`http://127.0.0.1:${port}/`, (res) => {
52
+ res.resume();
53
+ resolveCode(res.statusCode ?? 0);
54
+ });
55
+ req.on("error", (err) => reject(err));
56
+ req.setTimeout(500, () => {
57
+ req.destroy(new Error("timeout"));
58
+ });
59
+ });
60
+ if (code > 0) return true;
61
+ } catch {
62
+ /* not yet */
63
+ }
64
+ await new Promise((r) => setTimeout(r, 100));
65
+ }
66
+ return false;
67
+ }
68
+
69
+ beforeEach(async () => {
70
+ if (fs.existsSync(TEST_DATA_DIR)) fs.rmSync(TEST_DATA_DIR, { recursive: true, force: true });
71
+ fs.mkdirSync(TEST_DATA_DIR, { recursive: true });
72
+ process.env.ALVIN_DATA_DIR = TEST_DATA_DIR;
73
+ // Write a minimal .env so config.ts loads cleanly
74
+ fs.writeFileSync(`${TEST_DATA_DIR}/.env`, "WEB_PASSWORD=\n", "utf-8");
75
+ process.env.WEB_PORT = String(await getFreePort());
76
+ // Reset module cache so each test imports server.js fresh and
77
+ // picks up the new WEB_PORT env var at module-load time.
78
+ vi.resetModules();
79
+ });
80
+
81
+ afterEach(async () => {
82
+ // Best-effort: stop whatever is running in the current module instance
83
+ try {
84
+ const { stopWebServer } = await import("../src/web/server.js");
85
+ await stopWebServer();
86
+ } catch {
87
+ /* ignore */
88
+ }
89
+ // Give the OS a moment to release ports before the next test
90
+ await new Promise((r) => setTimeout(r, 50));
91
+ });
92
+
93
+ describe("startWebServer / stopWebServer integration (Fix #16)", () => {
94
+ it("startWebServer returns void synchronously without throwing", async () => {
95
+ const { startWebServer } = await import("../src/web/server.js");
96
+ const result = startWebServer();
97
+ // Must return void (undefined). If it returned a Server instance
98
+ // the old API is still in place.
99
+ expect(result).toBeUndefined();
100
+ });
101
+
102
+ it("actually binds the web server and serves HTTP", async () => {
103
+ const port = Number(process.env.WEB_PORT);
104
+ const { startWebServer } = await import("../src/web/server.js");
105
+ startWebServer();
106
+ const up = await waitForPortBound(port, 3000);
107
+ expect(up).toBe(true);
108
+ });
109
+
110
+ it("stopWebServer releases the port", async () => {
111
+ const port = Number(process.env.WEB_PORT);
112
+ const { startWebServer, stopWebServer } = await import("../src/web/server.js");
113
+ startWebServer();
114
+ expect(await waitForPortBound(port, 3000)).toBe(true);
115
+ await stopWebServer();
116
+
117
+ // Port should now be free โ€” a fresh bind must succeed
118
+ const reuse = http.createServer();
119
+ await new Promise<void>((resolve, reject) => {
120
+ reuse.once("error", reject);
121
+ reuse.listen(port, () => resolve());
122
+ });
123
+ await new Promise<void>((r) => reuse.close(() => r()));
124
+ });
125
+
126
+ it("stopWebServer is idempotent โ€” safe to call multiple times", async () => {
127
+ const { startWebServer, stopWebServer } = await import("../src/web/server.js");
128
+ startWebServer();
129
+ await new Promise((r) => setTimeout(r, 200));
130
+ await stopWebServer();
131
+ // Second call must not throw
132
+ await expect(stopWebServer()).resolves.toBeUndefined();
133
+ // Third call must also not throw
134
+ await expect(stopWebServer()).resolves.toBeUndefined();
135
+ });
136
+
137
+ it("stopWebServer is safe to call before startWebServer ever bound", async () => {
138
+ const { stopWebServer } = await import("../src/web/server.js");
139
+ // Module just imported โ€” nothing started yet
140
+ await expect(stopWebServer()).resolves.toBeUndefined();
141
+ });
142
+
143
+ it("when the primary port is taken, startWebServer still returns cleanly (climbs the ladder)", async () => {
144
+ const originalPort = Number(process.env.WEB_PORT);
145
+ // Plant a hog on the primary port BEFORE startWebServer
146
+ const hog = http.createServer();
147
+ await new Promise<void>((r) => hog.listen(originalPort, () => r()));
148
+
149
+ try {
150
+ const { startWebServer } = await import("../src/web/server.js");
151
+ // Must NOT throw even though the port is occupied
152
+ expect(() => startWebServer()).not.toThrow();
153
+
154
+ // The bot should have climbed the ladder โ€” one port higher should
155
+ // now be serving HTTP.
156
+ const climbed = await waitForPortBound(originalPort + 1, 3000);
157
+ expect(climbed).toBe(true);
158
+ } finally {
159
+ await new Promise<void>((r) => hog.close(() => r()));
160
+ }
161
+ });
162
+
163
+ it("closeHttpServerGracefully closes a server that's holding an open socket", async () => {
164
+ const { closeHttpServerGracefully } = await import("../src/web/server.js");
165
+ const port = await getFreePort();
166
+ const server = http.createServer((_req, res) => {
167
+ res.writeHead(200, { "Content-Type": "text/plain" });
168
+ res.write("chunk");
169
+ // never res.end โ€” client hangs forever
170
+ });
171
+ await new Promise<void>((r) => server.listen(port, () => r()));
172
+
173
+ const req = http.get(`http://127.0.0.1:${port}/hang`);
174
+ req.on("error", () => { /* expected */ });
175
+ await new Promise((r) => setTimeout(r, 100));
176
+
177
+ const t0 = Date.now();
178
+ await closeHttpServerGracefully(server);
179
+ expect(Date.now() - t0).toBeLessThan(2000);
180
+
181
+ // Port is reusable
182
+ const reuse = http.createServer();
183
+ await new Promise<void>((resolve, reject) => {
184
+ reuse.once("error", reject);
185
+ reuse.listen(port, () => resolve());
186
+ });
187
+ await new Promise<void>((r) => reuse.close(() => r()));
188
+ });
189
+ });
@@ -0,0 +1,118 @@
1
+ /**
2
+ * Fix #16 โ€” Web server must never crash the bot.
3
+ *
4
+ * Colleague feedback (WhatsApp voice note, 2026-04-13):
5
+ * > The gateway binds to port 3100 like OpenClaw. When the bot
6
+ * > restarts, the port is often still held โ†’ catastrophic crash.
7
+ * > I ended up decoupling the gateway process completely, because
8
+ * > the actual bot runs independently of the gateway โ€” it can still
9
+ * > answer Telegram even if the web endpoint isn't reachable yet.
10
+ * > It's weird that the main routine crashes when the port is busy.
11
+ * > It should just run in the background, watch for the port to
12
+ * > become free, and connect then. Zero impact on the main routine.
13
+ *
14
+ * This file tests the pure decision helper that the new startWebServer
15
+ * uses to choose between "try the next port immediately" and "retry
16
+ * the default port in the background after a delay".
17
+ *
18
+ * Contract:
19
+ * decideNextBindAction(err, attempt, opts)
20
+ *
21
+ * err.code = "EADDRINUSE", attempt < maxPortTries
22
+ * โ†’ { type: "retry-port", port: opts.originalPort + attempt + 1, attempt: attempt + 1 }
23
+ *
24
+ * err.code = "EADDRINUSE", attempt >= maxPortTries
25
+ * โ†’ { type: "retry-background", delayMs: opts.backgroundRetryMs, port: opts.originalPort }
26
+ *
27
+ * err.code = anything else (EACCES, ECONNRESET, "Listen method called twice"โ€ฆ)
28
+ * โ†’ { type: "retry-background", delayMs: opts.backgroundRetryMs, port: opts.originalPort }
29
+ *
30
+ * Pure function, no side effects, no timers, no I/O.
31
+ */
32
+ import { describe, it, expect } from "vitest";
33
+ import { decideNextBindAction } from "../src/web/bind-strategy.js";
34
+
35
+ const defaultOpts = {
36
+ originalPort: 3100,
37
+ maxPortTries: 20,
38
+ backgroundRetryMs: 30_000,
39
+ };
40
+
41
+ describe("decideNextBindAction (Fix #16)", () => {
42
+ it("retries on the next port when EADDRINUSE and attempts remain", () => {
43
+ const err = Object.assign(new Error("EADDRINUSE"), { code: "EADDRINUSE" });
44
+ const result = decideNextBindAction(err, 0, defaultOpts);
45
+ expect(result).toEqual({ type: "retry-port", port: 3101, attempt: 1 });
46
+ });
47
+
48
+ it("walks the port ladder across multiple attempts", () => {
49
+ const err = Object.assign(new Error("EADDRINUSE"), { code: "EADDRINUSE" });
50
+ expect(decideNextBindAction(err, 5, defaultOpts)).toEqual({
51
+ type: "retry-port",
52
+ port: 3106,
53
+ attempt: 6,
54
+ });
55
+ expect(decideNextBindAction(err, 18, defaultOpts)).toEqual({
56
+ type: "retry-port",
57
+ port: 3119,
58
+ attempt: 19,
59
+ });
60
+ });
61
+
62
+ it("switches to background retry when all port attempts are exhausted", () => {
63
+ const err = Object.assign(new Error("EADDRINUSE"), { code: "EADDRINUSE" });
64
+ const result = decideNextBindAction(err, 19, defaultOpts); // 20th failure
65
+ expect(result).toEqual({
66
+ type: "retry-background",
67
+ delayMs: 30_000,
68
+ port: 3100,
69
+ });
70
+ });
71
+
72
+ it("goes straight to background retry on non-EADDRINUSE errors", () => {
73
+ const err = Object.assign(new Error("EACCES"), { code: "EACCES" });
74
+ const result = decideNextBindAction(err, 0, defaultOpts);
75
+ expect(result).toEqual({
76
+ type: "retry-background",
77
+ delayMs: 30_000,
78
+ port: 3100,
79
+ });
80
+ });
81
+
82
+ it("handles errors without a .code field by doing background retry", () => {
83
+ const err = new Error("Listen method has been called more than once");
84
+ const result = decideNextBindAction(err, 3, defaultOpts);
85
+ expect(result.type).toBe("retry-background");
86
+ if (result.type === "retry-background") {
87
+ expect(result.port).toBe(3100);
88
+ }
89
+ });
90
+
91
+ it("respects custom maxPortTries", () => {
92
+ const err = Object.assign(new Error("EADDRINUSE"), { code: "EADDRINUSE" });
93
+ const opts = { ...defaultOpts, maxPortTries: 3 };
94
+ // attempts 0, 1 still retry; attempt 2 is the LAST retry; attempt 3 -> background
95
+ expect(decideNextBindAction(err, 0, opts).type).toBe("retry-port");
96
+ expect(decideNextBindAction(err, 1, opts).type).toBe("retry-port");
97
+ expect(decideNextBindAction(err, 2, opts).type).toBe("retry-background");
98
+ });
99
+
100
+ it("respects custom backgroundRetryMs", () => {
101
+ const err = Object.assign(new Error("EACCES"), { code: "EACCES" });
102
+ const opts = { ...defaultOpts, backgroundRetryMs: 5_000 };
103
+ const result = decideNextBindAction(err, 0, opts);
104
+ expect(result).toEqual({
105
+ type: "retry-background",
106
+ delayMs: 5_000,
107
+ port: 3100,
108
+ });
109
+ });
110
+
111
+ it("is pure โ€” same input, same output, no mutation", () => {
112
+ const err = Object.assign(new Error("EADDRINUSE"), { code: "EADDRINUSE" });
113
+ const snapshot = JSON.stringify({ ...defaultOpts });
114
+ decideNextBindAction(err, 5, defaultOpts);
115
+ decideNextBindAction(err, 5, defaultOpts);
116
+ expect(JSON.stringify({ ...defaultOpts })).toBe(snapshot);
117
+ });
118
+ });
@@ -17,7 +17,13 @@
17
17
  import { describe, it, expect } from "vitest";
18
18
  import http from "http";
19
19
  import { once } from "events";
20
- import { startWebServer, stopWebServer } from "../src/web/server.js";
20
+ // Fix #1 shipped as stopWebServer(server) โ€” Fix #16 (v4.9.4) promoted
21
+ // that to `closeHttpServerGracefully(server)` and reserved the name
22
+ // `stopWebServer()` for the module-state-aware shutdown. The underlying
23
+ // contract (close an http.Server even when clients hold open sockets,
24
+ // release the port, idempotent, never throw) is unchanged โ€” these
25
+ // tests now exercise the renamed helper.
26
+ import { closeHttpServerGracefully as stopWebServer } from "../src/web/server.js";
21
27
 
22
28
  function getFreePort(): Promise<number> {
23
29
  return new Promise((resolve, reject) => {