@rljson/server 0.0.14 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -86,7 +86,9 @@ The `Node` class sits above `Server` and `Client`, bridging `@rljson/network` to
86
86
  3. **Manages transport**: Uses injectable factories (`CreateHubTransport`/`CreateClientTransport`) to create the transport layer, keeping the Node class transport-agnostic.
87
87
  4. **Agent lifecycle**: An optional `createAgent` factory in `NodeDeps` is called on every `ready` event. The returned `AgentHandle.stop()` is called before the next role transition. This enables application-level wiring (e.g. FsAgent) without circular dependencies.
88
88
  5. **Serialized transitions**: Role transitions are queued — a new `role-changed` event waits for the previous transition to complete before starting. This prevents race conditions between teardown and setup.
89
- 6. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
89
+ 6. **Hub-changed reconnect** (v0.0.14): Subscribes to `NetworkManager`'s `hub-changed` event in addition to `role-changed`. When the hub changes but the node's role stays `client`, the node tears down its connection and reconnects to the new hub. This prevents split-brain scenarios where clients remain attached to a stale hub.
90
+ 7. **Clean socket teardown** (v0.0.14): `_tearDownCurrentRole()` explicitly calls `disconnect()` on client sockets before clearing the reference. This prevents orphaned Socket.IO connections from auto-reconnecting to the old hub.
91
+ 8. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning — a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
90
92
 
91
93
  ```text
92
94
  ┌─────────────────────────────────────────┐
package/README.blog.md CHANGED
@@ -17,3 +17,11 @@ Add posts as Markdown entries in this file (newest last). Keep each post small a
17
17
  - Why it matters
18
18
  - Links: PRs, docs, demos
19
19
  ```
20
+
21
+ ## 2026-03-20 — v0.0.14: Split-brain fix and hub-changed reconnect
22
+
23
+ - Node class now listens to `hub-changed` events from NetworkManager — clients reconnect when hub changes but role stays `client`
24
+ - `_tearDownCurrentRole()` explicitly disconnects sockets before clearing references — prevents orphaned connections
25
+ - Validated on 4-node Windows test lab: E2E Reports 18 & 19 both score **38/41 passed, 0 failures**
26
+ - Previous Report 17 showed split-brain (two simultaneous hubs, 23/41 passed) — now fully resolved
27
+ - PR: https://github.com/rljson/server/pull/14
package/README.trouble.md CHANGED
@@ -10,9 +10,42 @@ found in the LICENSE file in the root of this package.
10
10
 
11
11
  ## Table of contents <!-- omit in toc -->
12
12
 
13
+ - [Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)](#split-brain-clients-not-reconnecting-on-hub-change-fixed-in-v0014)
13
14
  - [Vscode Windows: Debugging is not working](#vscode-windows-debugging-is-not-working)
14
15
  - [Test Isolation: Socket.IO event listener accumulation](#test-isolation-socketio-event-listener-accumulation)
15
16
 
17
+ ## Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)
18
+
19
+ Date: 2026-03-20
20
+
21
+ **Problem:**
22
+
23
+ In a 4-node deployment, two nodes simultaneously acted as hub (split-brain). Clients stayed connected to the old hub while a new hub was elected. File sync stopped working because the hub had no real clients.
24
+
25
+ **Symptoms:**
26
+
27
+ - E2E Report 17: 23/41 passed, 18 failed
28
+ - Two nodes reporting `role=hub` simultaneously
29
+ - Files written by one hub never appearing on clients
30
+ - File counts diverging between nodes (hub accumulating files, clients stuck)
31
+
32
+ **Root Cause:**
33
+
34
+ Two bugs in the `Node` class:
35
+
36
+ 1. **Missing `hub-changed` listener**: Node only subscribed to `role-changed` from NetworkManager. When the hub changed but the node's role stayed `client`, the `role-changed` handler skipped (same role). Clients never reconnected to the new hub.
37
+
38
+ 2. **No socket disconnect on teardown**: `_tearDownCurrentRole()` set `_clientSocket = undefined` without calling `disconnect()`. The orphaned Socket.IO connection kept auto-reconnecting to the old hub (especially with the `socket.connect()` reconnect fix from v0.0.13).
39
+
40
+ **Solution (v0.0.14):**
41
+
42
+ 1. Added `_onHubChanged` listener that tears down and reconnects when hub changes while role stays `client`
43
+ 2. Added explicit `socket.disconnect()` call in `_tearDownCurrentRole()` before clearing the reference
44
+
45
+ **Validation:**
46
+
47
+ - E2E Reports 18 & 19: **38/41 passed, 0 failures** on 4-node test lab
48
+
16
49
  ## Vscode Windows: Debugging is not working
17
50
 
18
51
  Date: 2025-03-08
@@ -86,7 +86,9 @@ The `Node` class sits above `Server` and `Client`, bridging `@rljson/network` to
86
86
  3. **Manages transport**: Uses injectable factories (`CreateHubTransport`/`CreateClientTransport`) to create the transport layer, keeping the Node class transport-agnostic.
87
87
  4. **Agent lifecycle**: An optional `createAgent` factory in `NodeDeps` is called on every `ready` event. The returned `AgentHandle.stop()` is called before the next role transition. This enables application-level wiring (e.g. FsAgent) without circular dependencies.
88
88
  5. **Serialized transitions**: Role transitions are queued — a new `role-changed` event waits for the previous transition to complete before starting. This prevents race conditions between teardown and setup.
89
- 6. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
89
+ 6. **Hub-changed reconnect** (v0.0.14): Subscribes to `NetworkManager`'s `hub-changed` event in addition to `role-changed`. When the hub changes but the node's role stays `client`, the node tears down its connection and reconnects to the new hub. This prevents split-brain scenarios where clients remain attached to a stale hub.
90
+ 7. **Clean socket teardown** (v0.0.14): `_tearDownCurrentRole()` explicitly calls `disconnect()` on client sockets before clearing the reference. This prevents orphaned Socket.IO connections from auto-reconnecting to the old hub.
91
+ 8. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning — a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
90
92
 
91
93
  ```text
92
94
  ┌─────────────────────────────────────────┐
@@ -17,3 +17,11 @@ Add posts as Markdown entries in this file (newest last). Keep each post small a
17
17
  - Why it matters
18
18
  - Links: PRs, docs, demos
19
19
  ```
20
+
21
+ ## 2026-03-20 — v0.0.14: Split-brain fix and hub-changed reconnect
22
+
23
+ - Node class now listens to `hub-changed` events from NetworkManager — clients reconnect when hub changes but role stays `client`
24
+ - `_tearDownCurrentRole()` explicitly disconnects sockets before clearing references — prevents orphaned connections
25
+ - Validated on 4-node Windows test lab: E2E Reports 18 & 19 both score **38/41 passed, 0 failures**
26
+ - Previous Report 17 showed split-brain (two simultaneous hubs, 23/41 passed) — now fully resolved
27
+ - PR: https://github.com/rljson/server/pull/14
@@ -10,9 +10,42 @@ found in the LICENSE file in the root of this package.
10
10
 
11
11
  ## Table of contents <!-- omit in toc -->
12
12
 
13
+ - [Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)](#split-brain-clients-not-reconnecting-on-hub-change-fixed-in-v0014)
13
14
  - [Vscode Windows: Debugging is not working](#vscode-windows-debugging-is-not-working)
14
15
  - [Test Isolation: Socket.IO event listener accumulation](#test-isolation-socketio-event-listener-accumulation)
15
16
 
17
+ ## Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)
18
+
19
+ Date: 2026-03-20
20
+
21
+ **Problem:**
22
+
23
+ In a 4-node deployment, two nodes simultaneously acted as hub (split-brain). Clients stayed connected to the old hub while a new hub was elected. File sync stopped working because the hub had no real clients.
24
+
25
+ **Symptoms:**
26
+
27
+ - E2E Report 17: 23/41 passed, 18 failed
28
+ - Two nodes reporting `role=hub` simultaneously
29
+ - Files written by one hub never appearing on clients
30
+ - File counts diverging between nodes (hub accumulating files, clients stuck)
31
+
32
+ **Root Cause:**
33
+
34
+ Two bugs in the `Node` class:
35
+
36
+ 1. **Missing `hub-changed` listener**: Node only subscribed to `role-changed` from NetworkManager. When the hub changed but the node's role stayed `client`, the `role-changed` handler skipped (same role). Clients never reconnected to the new hub.
37
+
38
+ 2. **No socket disconnect on teardown**: `_tearDownCurrentRole()` set `_clientSocket = undefined` without calling `disconnect()`. The orphaned Socket.IO connection kept auto-reconnecting to the old hub (especially with the `socket.connect()` reconnect fix from v0.0.13).
39
+
40
+ **Solution (v0.0.14):**
41
+
42
+ 1. Added `_onHubChanged` listener that tears down and reconnects when hub changes while role stays `client`
43
+ 2. Added explicit `socket.disconnect()` call in `_tearDownCurrentRole()` before clearing the reference
44
+
45
+ **Validation:**
46
+
47
+ - E2E Reports 18 & 19: **38/41 passed, 0 failures** on 4-node test lab
48
+
16
49
  ## Vscode Windows: Debugging is not working
17
50
 
18
51
  Date: 2025-03-08
package/dist/server.d.ts CHANGED
@@ -49,6 +49,19 @@ export interface ServerOptions {
49
49
  * Defaults to false (local cache enabled).
50
50
  */
51
51
  disableLocalCache?: boolean;
52
+ /**
53
+ * Interval in milliseconds for application-level health checks.
54
+ * The server pings each connected client and prunes those that
55
+ * do not respond within {@link healthCheckTimeoutMs}.
56
+ * Defaults to 30 000 (30 s). Set to 0 to disable health checks.
57
+ */
58
+ healthCheckIntervalMs?: number;
59
+ /**
60
+ * Timeout in milliseconds to wait for a health check pong.
61
+ * Clients that do not respond within this window are pruned.
62
+ * Defaults to 10 000 (10 s).
63
+ */
64
+ healthCheckTimeoutMs?: number;
52
65
  }
53
66
  export declare class Server extends BaseNode {
54
67
  private _route;
@@ -77,6 +90,9 @@ export declare class Server extends BaseNode {
77
90
  private _disableLocalCache;
78
91
  private _latestRef;
79
92
  private _bootstrapHeartbeatTimer?;
93
+ private _healthCheckIntervalMs;
94
+ private _healthCheckTimeoutMs;
95
+ private _healthCheckTimer?;
80
96
  private _tornDown;
81
97
  constructor(_route: Route, _localIo: Io, _localBs: Bs, options?: ServerOptions);
82
98
  /**
@@ -181,6 +197,18 @@ export declare class Server extends BaseNode {
181
197
  * Each client's dedup pipeline will filter out refs it already has.
182
198
  */
183
199
  private _broadcastBootstrapHeartbeat;
200
+ /**
201
+ * Starts the periodic health check timer if not already running.
202
+ * Each cycle sends a ping to every non-broadcast client and waits
203
+ * for a pong. Clients that do not respond are pruned.
204
+ */
205
+ private _startHealthChecks;
206
+ /**
207
+ * Sends a health ping to each connected (non-broadcast) client.
208
+ * If a client does not respond within `_healthCheckTimeoutMs`,
209
+ * the server force-disconnects and removes it.
210
+ */
211
+ private _runHealthCheck;
184
212
  /**
185
213
  * Starts the periodic bootstrap heartbeat timer if configured
186
214
  * and not already running.
package/dist/server.js CHANGED
@@ -393,9 +393,14 @@ class Client extends BaseNode {
393
393
  };
394
394
  socket.on("disconnect", disconnectHandler);
395
395
  socket.on("connect", reconnectHandler);
396
+ const healthHandler = (payload) => {
397
+ sockets.ioUp.emit("__health:pong", { nonce: payload.nonce });
398
+ };
399
+ sockets.ioDown.on("__health:ping", healthHandler);
396
400
  this._connectionCleanup = () => {
397
401
  socket.off("disconnect", disconnectHandler);
398
402
  socket.off("connect", reconnectHandler);
403
+ sockets.ioDown.off("__health:ping", healthHandler);
399
404
  };
400
405
  }
401
406
  /**
@@ -655,6 +660,8 @@ class Server extends BaseNode {
655
660
  this._logger = options?.logger ?? noopLogger;
656
661
  this._peerInitTimeoutMs = options?.peerInitTimeoutMs ?? 3e4;
657
662
  this._disableLocalCache = options?.disableLocalCache ?? false;
663
+ this._healthCheckIntervalMs = options?.healthCheckIntervalMs ?? 3e4;
664
+ this._healthCheckTimeoutMs = options?.healthCheckTimeoutMs ?? 1e4;
658
665
  this._syncConfig = options?.syncConfig;
659
666
  this._refLogSize = options?.refLogSize ?? 1e3;
660
667
  this._ackTimeoutMs = options?.ackTimeoutMs ?? options?.syncConfig?.ackTimeoutMs ?? 1e4;
@@ -728,6 +735,10 @@ class Server extends BaseNode {
728
735
  // Bootstrap state
729
736
  _latestRef;
730
737
  _bootstrapHeartbeatTimer;
738
+ // Health check state
739
+ _healthCheckIntervalMs;
740
+ _healthCheckTimeoutMs;
741
+ _healthCheckTimer;
731
742
  _tornDown = false;
732
743
  /**
733
744
  * Initializes Io and Bs multis on the server.
@@ -786,6 +797,7 @@ class Server extends BaseNode {
786
797
  this._registerDisconnectHandler(clientId, ioUp);
787
798
  this._sendBootstrap(ioDown);
788
799
  this._startBootstrapHeartbeat();
800
+ this._startHealthChecks();
789
801
  this._logger.info("Server", "Client socket added successfully", {
790
802
  clientId,
791
803
  totalClients: this._clients.size
@@ -836,6 +848,7 @@ class Server extends BaseNode {
836
848
  this._registerDisconnectHandler(clientId, ioUp);
837
849
  this._sendBootstrap(ioDown);
838
850
  this._startBootstrapHeartbeat();
851
+ this._startHealthChecks();
839
852
  this._logger.info("Server", "Broadcast-only socket added", {
840
853
  clientId,
841
854
  totalClients: this._clients.size
@@ -1034,6 +1047,54 @@ class Server extends BaseNode {
1034
1047
  ioDown.emit(this._events.bootstrap, payload);
1035
1048
  }
1036
1049
  }
1050
+ // ...........................................................................
1051
+ // Health checks
1052
+ // ...........................................................................
1053
+ /**
1054
+ * Starts the periodic health check timer if not already running.
1055
+ * Each cycle sends a ping to every non-broadcast client and waits
1056
+ * for a pong. Clients that do not respond are pruned.
1057
+ */
1058
+ _startHealthChecks() {
1059
+ if (this._healthCheckTimer || this._healthCheckIntervalMs <= 0) return;
1060
+ this._healthCheckTimer = setInterval(() => {
1061
+ this._runHealthCheck();
1062
+ }, this._healthCheckIntervalMs);
1063
+ this._healthCheckTimer.unref();
1064
+ }
1065
+ /**
1066
+ * Sends a health ping to each connected (non-broadcast) client.
1067
+ * If a client does not respond within `_healthCheckTimeoutMs`,
1068
+ * the server force-disconnects and removes it.
1069
+ */
1070
+ _runHealthCheck() {
1071
+ for (const [clientId, { ioUp, ioDown }] of this._clients.entries()) {
1072
+ if (clientId.startsWith("broadcast_")) continue;
1073
+ const nonce = Math.random().toString(36).slice(2);
1074
+ let resolved = false;
1075
+ const handler = (payload) => {
1076
+ if (payload?.nonce !== nonce) return;
1077
+ resolved = true;
1078
+ ioUp.off("__health:pong", handler);
1079
+ clearTimeout(timer);
1080
+ };
1081
+ ioUp.on("__health:pong", handler);
1082
+ const timer = setTimeout(() => {
1083
+ if (resolved) return;
1084
+ ioUp.off("__health:pong", handler);
1085
+ this._logger.warn(
1086
+ "Server.Health",
1087
+ "Client failed health check — pruning",
1088
+ { clientId }
1089
+ );
1090
+ if ("disconnect" in ioUp) {
1091
+ ioUp.disconnect(true);
1092
+ }
1093
+ this.removeSocket(clientId);
1094
+ }, this._healthCheckTimeoutMs);
1095
+ ioDown.emit("__health:ping", { nonce });
1096
+ }
1097
+ }
1037
1098
  /**
1038
1099
  * Starts the periodic bootstrap heartbeat timer if configured
1039
1100
  * and not already running.
@@ -1315,6 +1376,10 @@ class Server extends BaseNode {
1315
1376
  clearInterval(this._bootstrapHeartbeatTimer);
1316
1377
  this._bootstrapHeartbeatTimer = void 0;
1317
1378
  }
1379
+ if (this._healthCheckTimer) {
1380
+ clearInterval(this._healthCheckTimer);
1381
+ this._healthCheckTimer = void 0;
1382
+ }
1318
1383
  this._removeAllListeners();
1319
1384
  for (const cleanup of this._disconnectCleanups.values()) {
1320
1385
  cleanup();
@@ -1569,6 +1634,18 @@ class Node {
1569
1634
  await this._becomeClient();
1570
1635
  break;
1571
1636
  }
1637
+ if (!this._running) return;
1638
+ const networkRole = this._networkManager.getTopology().myRole;
1639
+ if (networkRole !== this._role && networkRole !== "unassigned") {
1640
+ this._logger.info(
1641
+ "Node",
1642
+ `Reconciling stale role: node=${this._role} → network=${networkRole}`
1643
+ );
1644
+ await this._performTransition({
1645
+ previous: this._role,
1646
+ current: networkRole
1647
+ });
1648
+ }
1572
1649
  }
1573
1650
  async _becomeHub() {
1574
1651
  await this._ioMem.init();
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@rljson/server",
3
- "version": "0.0.14",
3
+ "version": "0.0.16",
4
4
  "description": "Rljson server description",
5
5
  "homepage": "https://github.com/rljson/server",
6
6
  "bugs": "https://github.com/rljson/server/issues",