@rljson/server 0.0.14 → 0.0.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.architecture.md +3 -1
- package/README.blog.md +8 -0
- package/README.trouble.md +33 -0
- package/dist/README.architecture.md +3 -1
- package/dist/README.blog.md +8 -0
- package/dist/README.trouble.md +33 -0
- package/dist/server.d.ts +28 -0
- package/dist/server.js +77 -0
- package/package.json +1 -1
package/README.architecture.md
CHANGED
|
@@ -86,7 +86,9 @@ The `Node` class sits above `Server` and `Client`, bridging `@rljson/network` to
|
|
|
86
86
|
3. **Manages transport**: Uses injectable factories (`CreateHubTransport`/`CreateClientTransport`) to create the transport layer, keeping the Node class transport-agnostic.
|
|
87
87
|
4. **Agent lifecycle**: An optional `createAgent` factory in `NodeDeps` is called on every `ready` event. The returned `AgentHandle.stop()` is called before the next role transition. This enables application-level wiring (e.g. FsAgent) without circular dependencies.
|
|
88
88
|
5. **Serialized transitions**: Role transitions are queued — a new `role-changed` event waits for the previous transition to complete before starting. This prevents race conditions between teardown and setup.
|
|
89
|
-
6. **
|
|
89
|
+
6. **Hub-changed reconnect** (v0.0.14): Subscribes to `NetworkManager`'s `hub-changed` event in addition to `role-changed`. When the hub changes but the node's role stays `client`, the node tears down its connection and reconnects to the new hub. This prevents split-brain scenarios where clients remain attached to a stale hub.
|
|
90
|
+
7. **Clean socket teardown** (v0.0.14): `_tearDownCurrentRole()` explicitly calls `disconnect()` on client sockets before clearing the reference. This prevents orphaned Socket.IO connections from auto-reconnecting to the old hub.
|
|
91
|
+
8. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning — a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
|
|
90
92
|
|
|
91
93
|
```text
|
|
92
94
|
┌─────────────────────────────────────────┐
|
package/README.blog.md
CHANGED
|
@@ -17,3 +17,11 @@ Add posts as Markdown entries in this file (newest last). Keep each post small a
|
|
|
17
17
|
- Why it matters
|
|
18
18
|
- Links: PRs, docs, demos
|
|
19
19
|
```
|
|
20
|
+
|
|
21
|
+
## 2026-03-20 — v0.0.14: Split-brain fix and hub-changed reconnect
|
|
22
|
+
|
|
23
|
+
- Node class now listens to `hub-changed` events from NetworkManager — clients reconnect when hub changes but role stays `client`
|
|
24
|
+
- `_tearDownCurrentRole()` explicitly disconnects sockets before clearing references — prevents orphaned connections
|
|
25
|
+
- Validated on 4-node Windows test lab: E2E Reports 18 & 19 both score **38/41 passed, 0 failures**
|
|
26
|
+
- Previous Report 17 showed split-brain (two simultaneous hubs, 23/41 passed) — now fully resolved
|
|
27
|
+
- PR: https://github.com/rljson/server/pull/14
|
package/README.trouble.md
CHANGED
|
@@ -10,9 +10,42 @@ found in the LICENSE file in the root of this package.
|
|
|
10
10
|
|
|
11
11
|
## Table of contents <!-- omit in toc -->
|
|
12
12
|
|
|
13
|
+
- [Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)](#split-brain-clients-not-reconnecting-on-hub-change-fixed-in-v0014)
|
|
13
14
|
- [Vscode Windows: Debugging is not working](#vscode-windows-debugging-is-not-working)
|
|
14
15
|
- [Test Isolation: Socket.IO event listener accumulation](#test-isolation-socketio-event-listener-accumulation)
|
|
15
16
|
|
|
17
|
+
## Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)
|
|
18
|
+
|
|
19
|
+
Date: 2026-03-20
|
|
20
|
+
|
|
21
|
+
**Problem:**
|
|
22
|
+
|
|
23
|
+
In a 4-node deployment, two nodes simultaneously acted as hub (split-brain). Clients stayed connected to the old hub while a new hub was elected. File sync stopped working because the hub had no real clients.
|
|
24
|
+
|
|
25
|
+
**Symptoms:**
|
|
26
|
+
|
|
27
|
+
- E2E Report 17: 23/41 passed, 18 failed
|
|
28
|
+
- Two nodes reporting `role=hub` simultaneously
|
|
29
|
+
- Files written by one hub never appearing on clients
|
|
30
|
+
- File counts diverging between nodes (hub accumulating files, clients stuck)
|
|
31
|
+
|
|
32
|
+
**Root Cause:**
|
|
33
|
+
|
|
34
|
+
Two bugs in the `Node` class:
|
|
35
|
+
|
|
36
|
+
1. **Missing `hub-changed` listener**: Node only subscribed to `role-changed` from NetworkManager. When the hub changed but the node's role stayed `client`, the `role-changed` handler skipped (same role). Clients never reconnected to the new hub.
|
|
37
|
+
|
|
38
|
+
2. **No socket disconnect on teardown**: `_tearDownCurrentRole()` set `_clientSocket = undefined` without calling `disconnect()`. The orphaned Socket.IO connection kept auto-reconnecting to the old hub (especially with the `socket.connect()` reconnect fix from v0.0.13).
|
|
39
|
+
|
|
40
|
+
**Solution (v0.0.14):**
|
|
41
|
+
|
|
42
|
+
1. Added `_onHubChanged` listener that tears down and reconnects when hub changes while role stays `client`
|
|
43
|
+
2. Added explicit `socket.disconnect()` call in `_tearDownCurrentRole()` before clearing the reference
|
|
44
|
+
|
|
45
|
+
**Validation:**
|
|
46
|
+
|
|
47
|
+
- E2E Reports 18 & 19: **38/41 passed, 0 failures** on 4-node test lab
|
|
48
|
+
|
|
16
49
|
## Vscode Windows: Debugging is not working
|
|
17
50
|
|
|
18
51
|
Date: 2025-03-08
|
|
@@ -86,7 +86,9 @@ The `Node` class sits above `Server` and `Client`, bridging `@rljson/network` to
|
|
|
86
86
|
3. **Manages transport**: Uses injectable factories (`CreateHubTransport`/`CreateClientTransport`) to create the transport layer, keeping the Node class transport-agnostic.
|
|
87
87
|
4. **Agent lifecycle**: An optional `createAgent` factory in `NodeDeps` is called on every `ready` event. The returned `AgentHandle.stop()` is called before the next role transition. This enables application-level wiring (e.g. FsAgent) without circular dependencies.
|
|
88
88
|
5. **Serialized transitions**: Role transitions are queued — a new `role-changed` event waits for the previous transition to complete before starting. This prevents race conditions between teardown and setup.
|
|
89
|
-
6. **
|
|
89
|
+
6. **Hub-changed reconnect** (v0.0.14): Subscribes to `NetworkManager`'s `hub-changed` event in addition to `role-changed`. When the hub changes but the node's role stays `client`, the node tears down its connection and reconnects to the new hub. This prevents split-brain scenarios where clients remain attached to a stale hub.
|
|
90
|
+
7. **Clean socket teardown** (v0.0.14): `_tearDownCurrentRole()` explicitly calls `disconnect()` on client sockets before clearing the reference. This prevents orphaned Socket.IO connections from auto-reconnecting to the old hub.
|
|
91
|
+
8. **Error resilience**: Errors in user-provided code (agent factories, transport factories) are caught and logged. The node continues functioning — a failed transport degrades connectivity but doesn't crash, a failed agent leaves the node's core intact.
|
|
90
92
|
|
|
91
93
|
```text
|
|
92
94
|
┌─────────────────────────────────────────┐
|
package/dist/README.blog.md
CHANGED
|
@@ -17,3 +17,11 @@ Add posts as Markdown entries in this file (newest last). Keep each post small a
|
|
|
17
17
|
- Why it matters
|
|
18
18
|
- Links: PRs, docs, demos
|
|
19
19
|
```
|
|
20
|
+
|
|
21
|
+
## 2026-03-20 — v0.0.14: Split-brain fix and hub-changed reconnect
|
|
22
|
+
|
|
23
|
+
- Node class now listens to `hub-changed` events from NetworkManager — clients reconnect when hub changes but role stays `client`
|
|
24
|
+
- `_tearDownCurrentRole()` explicitly disconnects sockets before clearing references — prevents orphaned connections
|
|
25
|
+
- Validated on 4-node Windows test lab: E2E Reports 18 & 19 both score **38/41 passed, 0 failures**
|
|
26
|
+
- Previous Report 17 showed split-brain (two simultaneous hubs, 23/41 passed) — now fully resolved
|
|
27
|
+
- PR: https://github.com/rljson/server/pull/14
|
package/dist/README.trouble.md
CHANGED
|
@@ -10,9 +10,42 @@ found in the LICENSE file in the root of this package.
|
|
|
10
10
|
|
|
11
11
|
## Table of contents <!-- omit in toc -->
|
|
12
12
|
|
|
13
|
+
- [Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)](#split-brain-clients-not-reconnecting-on-hub-change-fixed-in-v0014)
|
|
13
14
|
- [Vscode Windows: Debugging is not working](#vscode-windows-debugging-is-not-working)
|
|
14
15
|
- [Test Isolation: Socket.IO event listener accumulation](#test-isolation-socketio-event-listener-accumulation)
|
|
15
16
|
|
|
17
|
+
## Split-Brain: Clients not reconnecting on hub change (fixed in v0.0.14)
|
|
18
|
+
|
|
19
|
+
Date: 2026-03-20
|
|
20
|
+
|
|
21
|
+
**Problem:**
|
|
22
|
+
|
|
23
|
+
In a 4-node deployment, two nodes simultaneously acted as hub (split-brain). Clients stayed connected to the old hub while a new hub was elected. File sync stopped working because the hub had no real clients.
|
|
24
|
+
|
|
25
|
+
**Symptoms:**
|
|
26
|
+
|
|
27
|
+
- E2E Report 17: 23/41 passed, 18 failed
|
|
28
|
+
- Two nodes reporting `role=hub` simultaneously
|
|
29
|
+
- Files written by one hub never appearing on clients
|
|
30
|
+
- File counts diverging between nodes (hub accumulating files, clients stuck)
|
|
31
|
+
|
|
32
|
+
**Root Cause:**
|
|
33
|
+
|
|
34
|
+
Two bugs in the `Node` class:
|
|
35
|
+
|
|
36
|
+
1. **Missing `hub-changed` listener**: Node only subscribed to `role-changed` from NetworkManager. When the hub changed but the node's role stayed `client`, the `role-changed` handler skipped (same role). Clients never reconnected to the new hub.
|
|
37
|
+
|
|
38
|
+
2. **No socket disconnect on teardown**: `_tearDownCurrentRole()` set `_clientSocket = undefined` without calling `disconnect()`. The orphaned Socket.IO connection kept auto-reconnecting to the old hub (especially with the `socket.connect()` reconnect fix from v0.0.13).
|
|
39
|
+
|
|
40
|
+
**Solution (v0.0.14):**
|
|
41
|
+
|
|
42
|
+
1. Added `_onHubChanged` listener that tears down and reconnects when hub changes while role stays `client`
|
|
43
|
+
2. Added explicit `socket.disconnect()` call in `_tearDownCurrentRole()` before clearing the reference
|
|
44
|
+
|
|
45
|
+
**Validation:**
|
|
46
|
+
|
|
47
|
+
- E2E Reports 18 & 19: **38/41 passed, 0 failures** on 4-node test lab
|
|
48
|
+
|
|
16
49
|
## Vscode Windows: Debugging is not working
|
|
17
50
|
|
|
18
51
|
Date: 2025-03-08
|
package/dist/server.d.ts
CHANGED
|
@@ -49,6 +49,19 @@ export interface ServerOptions {
|
|
|
49
49
|
* Defaults to false (local cache enabled).
|
|
50
50
|
*/
|
|
51
51
|
disableLocalCache?: boolean;
|
|
52
|
+
/**
|
|
53
|
+
* Interval in milliseconds for application-level health checks.
|
|
54
|
+
* The server pings each connected client and prunes those that
|
|
55
|
+
* do not respond within {@link healthCheckTimeoutMs}.
|
|
56
|
+
* Defaults to 30 000 (30 s). Set to 0 to disable health checks.
|
|
57
|
+
*/
|
|
58
|
+
healthCheckIntervalMs?: number;
|
|
59
|
+
/**
|
|
60
|
+
* Timeout in milliseconds to wait for a health check pong.
|
|
61
|
+
* Clients that do not respond within this window are pruned.
|
|
62
|
+
* Defaults to 10 000 (10 s).
|
|
63
|
+
*/
|
|
64
|
+
healthCheckTimeoutMs?: number;
|
|
52
65
|
}
|
|
53
66
|
export declare class Server extends BaseNode {
|
|
54
67
|
private _route;
|
|
@@ -77,6 +90,9 @@ export declare class Server extends BaseNode {
|
|
|
77
90
|
private _disableLocalCache;
|
|
78
91
|
private _latestRef;
|
|
79
92
|
private _bootstrapHeartbeatTimer?;
|
|
93
|
+
private _healthCheckIntervalMs;
|
|
94
|
+
private _healthCheckTimeoutMs;
|
|
95
|
+
private _healthCheckTimer?;
|
|
80
96
|
private _tornDown;
|
|
81
97
|
constructor(_route: Route, _localIo: Io, _localBs: Bs, options?: ServerOptions);
|
|
82
98
|
/**
|
|
@@ -181,6 +197,18 @@ export declare class Server extends BaseNode {
|
|
|
181
197
|
* Each client's dedup pipeline will filter out refs it already has.
|
|
182
198
|
*/
|
|
183
199
|
private _broadcastBootstrapHeartbeat;
|
|
200
|
+
/**
|
|
201
|
+
* Starts the periodic health check timer if not already running.
|
|
202
|
+
* Each cycle sends a ping to every non-broadcast client and waits
|
|
203
|
+
* for a pong. Clients that do not respond are pruned.
|
|
204
|
+
*/
|
|
205
|
+
private _startHealthChecks;
|
|
206
|
+
/**
|
|
207
|
+
* Sends a health ping to each connected (non-broadcast) client.
|
|
208
|
+
* If a client does not respond within `_healthCheckTimeoutMs`,
|
|
209
|
+
* the server force-disconnects and removes it.
|
|
210
|
+
*/
|
|
211
|
+
private _runHealthCheck;
|
|
184
212
|
/**
|
|
185
213
|
* Starts the periodic bootstrap heartbeat timer if configured
|
|
186
214
|
* and not already running.
|
package/dist/server.js
CHANGED
|
@@ -393,9 +393,14 @@ class Client extends BaseNode {
|
|
|
393
393
|
};
|
|
394
394
|
socket.on("disconnect", disconnectHandler);
|
|
395
395
|
socket.on("connect", reconnectHandler);
|
|
396
|
+
const healthHandler = (payload) => {
|
|
397
|
+
sockets.ioUp.emit("__health:pong", { nonce: payload.nonce });
|
|
398
|
+
};
|
|
399
|
+
sockets.ioDown.on("__health:ping", healthHandler);
|
|
396
400
|
this._connectionCleanup = () => {
|
|
397
401
|
socket.off("disconnect", disconnectHandler);
|
|
398
402
|
socket.off("connect", reconnectHandler);
|
|
403
|
+
sockets.ioDown.off("__health:ping", healthHandler);
|
|
399
404
|
};
|
|
400
405
|
}
|
|
401
406
|
/**
|
|
@@ -655,6 +660,8 @@ class Server extends BaseNode {
|
|
|
655
660
|
this._logger = options?.logger ?? noopLogger;
|
|
656
661
|
this._peerInitTimeoutMs = options?.peerInitTimeoutMs ?? 3e4;
|
|
657
662
|
this._disableLocalCache = options?.disableLocalCache ?? false;
|
|
663
|
+
this._healthCheckIntervalMs = options?.healthCheckIntervalMs ?? 3e4;
|
|
664
|
+
this._healthCheckTimeoutMs = options?.healthCheckTimeoutMs ?? 1e4;
|
|
658
665
|
this._syncConfig = options?.syncConfig;
|
|
659
666
|
this._refLogSize = options?.refLogSize ?? 1e3;
|
|
660
667
|
this._ackTimeoutMs = options?.ackTimeoutMs ?? options?.syncConfig?.ackTimeoutMs ?? 1e4;
|
|
@@ -728,6 +735,10 @@ class Server extends BaseNode {
|
|
|
728
735
|
// Bootstrap state
|
|
729
736
|
_latestRef;
|
|
730
737
|
_bootstrapHeartbeatTimer;
|
|
738
|
+
// Health check state
|
|
739
|
+
_healthCheckIntervalMs;
|
|
740
|
+
_healthCheckTimeoutMs;
|
|
741
|
+
_healthCheckTimer;
|
|
731
742
|
_tornDown = false;
|
|
732
743
|
/**
|
|
733
744
|
* Initializes Io and Bs multis on the server.
|
|
@@ -786,6 +797,7 @@ class Server extends BaseNode {
|
|
|
786
797
|
this._registerDisconnectHandler(clientId, ioUp);
|
|
787
798
|
this._sendBootstrap(ioDown);
|
|
788
799
|
this._startBootstrapHeartbeat();
|
|
800
|
+
this._startHealthChecks();
|
|
789
801
|
this._logger.info("Server", "Client socket added successfully", {
|
|
790
802
|
clientId,
|
|
791
803
|
totalClients: this._clients.size
|
|
@@ -836,6 +848,7 @@ class Server extends BaseNode {
|
|
|
836
848
|
this._registerDisconnectHandler(clientId, ioUp);
|
|
837
849
|
this._sendBootstrap(ioDown);
|
|
838
850
|
this._startBootstrapHeartbeat();
|
|
851
|
+
this._startHealthChecks();
|
|
839
852
|
this._logger.info("Server", "Broadcast-only socket added", {
|
|
840
853
|
clientId,
|
|
841
854
|
totalClients: this._clients.size
|
|
@@ -1034,6 +1047,54 @@ class Server extends BaseNode {
|
|
|
1034
1047
|
ioDown.emit(this._events.bootstrap, payload);
|
|
1035
1048
|
}
|
|
1036
1049
|
}
|
|
1050
|
+
// ...........................................................................
|
|
1051
|
+
// Health checks
|
|
1052
|
+
// ...........................................................................
|
|
1053
|
+
/**
|
|
1054
|
+
* Starts the periodic health check timer if not already running.
|
|
1055
|
+
* Each cycle sends a ping to every non-broadcast client and waits
|
|
1056
|
+
* for a pong. Clients that do not respond are pruned.
|
|
1057
|
+
*/
|
|
1058
|
+
_startHealthChecks() {
|
|
1059
|
+
if (this._healthCheckTimer || this._healthCheckIntervalMs <= 0) return;
|
|
1060
|
+
this._healthCheckTimer = setInterval(() => {
|
|
1061
|
+
this._runHealthCheck();
|
|
1062
|
+
}, this._healthCheckIntervalMs);
|
|
1063
|
+
this._healthCheckTimer.unref();
|
|
1064
|
+
}
|
|
1065
|
+
/**
|
|
1066
|
+
* Sends a health ping to each connected (non-broadcast) client.
|
|
1067
|
+
* If a client does not respond within `_healthCheckTimeoutMs`,
|
|
1068
|
+
* the server force-disconnects and removes it.
|
|
1069
|
+
*/
|
|
1070
|
+
_runHealthCheck() {
|
|
1071
|
+
for (const [clientId, { ioUp, ioDown }] of this._clients.entries()) {
|
|
1072
|
+
if (clientId.startsWith("broadcast_")) continue;
|
|
1073
|
+
const nonce = Math.random().toString(36).slice(2);
|
|
1074
|
+
let resolved = false;
|
|
1075
|
+
const handler = (payload) => {
|
|
1076
|
+
if (payload?.nonce !== nonce) return;
|
|
1077
|
+
resolved = true;
|
|
1078
|
+
ioUp.off("__health:pong", handler);
|
|
1079
|
+
clearTimeout(timer);
|
|
1080
|
+
};
|
|
1081
|
+
ioUp.on("__health:pong", handler);
|
|
1082
|
+
const timer = setTimeout(() => {
|
|
1083
|
+
if (resolved) return;
|
|
1084
|
+
ioUp.off("__health:pong", handler);
|
|
1085
|
+
this._logger.warn(
|
|
1086
|
+
"Server.Health",
|
|
1087
|
+
"Client failed health check — pruning",
|
|
1088
|
+
{ clientId }
|
|
1089
|
+
);
|
|
1090
|
+
if ("disconnect" in ioUp) {
|
|
1091
|
+
ioUp.disconnect(true);
|
|
1092
|
+
}
|
|
1093
|
+
this.removeSocket(clientId);
|
|
1094
|
+
}, this._healthCheckTimeoutMs);
|
|
1095
|
+
ioDown.emit("__health:ping", { nonce });
|
|
1096
|
+
}
|
|
1097
|
+
}
|
|
1037
1098
|
/**
|
|
1038
1099
|
* Starts the periodic bootstrap heartbeat timer if configured
|
|
1039
1100
|
* and not already running.
|
|
@@ -1315,6 +1376,10 @@ class Server extends BaseNode {
|
|
|
1315
1376
|
clearInterval(this._bootstrapHeartbeatTimer);
|
|
1316
1377
|
this._bootstrapHeartbeatTimer = void 0;
|
|
1317
1378
|
}
|
|
1379
|
+
if (this._healthCheckTimer) {
|
|
1380
|
+
clearInterval(this._healthCheckTimer);
|
|
1381
|
+
this._healthCheckTimer = void 0;
|
|
1382
|
+
}
|
|
1318
1383
|
this._removeAllListeners();
|
|
1319
1384
|
for (const cleanup of this._disconnectCleanups.values()) {
|
|
1320
1385
|
cleanup();
|
|
@@ -1569,6 +1634,18 @@ class Node {
|
|
|
1569
1634
|
await this._becomeClient();
|
|
1570
1635
|
break;
|
|
1571
1636
|
}
|
|
1637
|
+
if (!this._running) return;
|
|
1638
|
+
const networkRole = this._networkManager.getTopology().myRole;
|
|
1639
|
+
if (networkRole !== this._role && networkRole !== "unassigned") {
|
|
1640
|
+
this._logger.info(
|
|
1641
|
+
"Node",
|
|
1642
|
+
`Reconciling stale role: node=${this._role} → network=${networkRole}`
|
|
1643
|
+
);
|
|
1644
|
+
await this._performTransition({
|
|
1645
|
+
previous: this._role,
|
|
1646
|
+
current: networkRole
|
|
1647
|
+
});
|
|
1648
|
+
}
|
|
1572
1649
|
}
|
|
1573
1650
|
async _becomeHub() {
|
|
1574
1651
|
await this._ioMem.init();
|