rollbridge 0.1.4 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +137 -4
- package/TODO.md +47 -45
- package/docs/cli.md +169 -6
- package/docs/config.md +160 -3
- package/docs/logging.md +77 -0
- package/docs/nginx.md +104 -0
- package/docs/releasing.md +53 -0
- package/docs/tensorbuzz-runbook.md +129 -0
- package/docs/velocious.md +238 -0
- package/docs/workers.md +115 -0
- package/package.json +3 -2
- package/src/cli.js +317 -1
- package/src/config.js +240 -6
- package/src/daemon.js +284 -4
- package/src/doctor.js +177 -0
- package/src/event-log.js +47 -0
- package/src/managed-process.js +287 -22
- package/src/process-memory.js +110 -0
- package/src/recover.js +134 -0
- package/src/release-group.js +80 -21
- package/src/state-store.js +103 -0
- package/src/system-ids.js +71 -0
- package/src/template.js +32 -0
- package/test/completion.test.js +64 -0
- package/test/config-validation.test.js +267 -0
- package/test/doctor.test.js +205 -3
- package/test/event-log.test.js +46 -0
- package/test/fixtures/memory-hog.js +19 -0
- package/test/managed-process.test.js +376 -0
- package/test/process-memory.test.js +40 -0
- package/test/recover.test.js +162 -0
- package/test/release-group.test.js +22 -0
- package/test/rollbridge.test.js +716 -6
- package/test/state-store.test.js +69 -0
- package/test/system-ids.test.js +24 -0
- package/scripts/release-patch.js +0 -83
package/README.md
CHANGED
|
@@ -84,13 +84,73 @@ more or fewer lines for chatty or quiet processes.
|
|
|
84
84
|
Set `control.mode` to an octal permission string (for example `"660"`) to
|
|
85
85
|
chmod the control socket after it binds. This restricts which users can send
|
|
86
86
|
control commands — useful when several deploy users share a group. When unset,
|
|
87
|
-
the socket keeps the default permissions from the daemon's umask.
|
|
87
|
+
the socket keeps the default permissions from the daemon's umask. Pair it with
|
|
88
|
+
`control.owner` and `control.group` (a numeric id or a user/group name) to
|
|
89
|
+
`chown` the socket to a shared deploy group; names resolve via
|
|
90
|
+
`/etc/passwd`/`/etc/group`, and the daemon must run as a user allowed to chown it.
|
|
88
91
|
|
|
89
92
|
Set the proxied process's `health.startDelayMs` (default `0`) to wait that long
|
|
90
93
|
after the process starts before the first health probe — like a readiness
|
|
91
94
|
probe's initial delay, useful for apps with a known boot time. The delay runs
|
|
92
95
|
before the `health.timeoutMs` window begins.
|
|
93
96
|
|
|
97
|
+
Set a process's `restart` policy to control automatic restarts after a crash.
|
|
98
|
+
`restart.maxRestarts` caps how many restarts are allowed within `restart.windowMs`
|
|
99
|
+
before Rollbridge gives up and leaves the process `failed` (`maxRestarts: 0`
|
|
100
|
+
disables restarts entirely), while `restart.backoffFactor` — with an optional
|
|
101
|
+
`restart.maxDelayMs` cap — backs off the `restartDelayMs` delay on each successive
|
|
102
|
+
restart. With no `restart` block, a crashed process keeps restarting after
|
|
103
|
+
`restartDelayMs`, as before. See [`docs/config.md`](docs/config.md#processesrestart).
|
|
104
|
+
|
|
105
|
+
```js
|
|
106
|
+
restart: {maxRestarts: 5, windowMs: 60000, backoffFactor: 2, maxDelayMs: 30000}
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Set a process's `memory` policy to supervise its resident memory (RSS) and
|
|
110
|
+
gracefully restart it when it grows too large. `memory.limitBytes` is the RSS
|
|
111
|
+
limit (measured across the whole process group, not just the wrapper);
|
|
112
|
+
`memory.warnBytes` logs a warning before the limit; `memory.checkIntervalMs`
|
|
113
|
+
(default `5000`) sets how often RSS is sampled. A memory restart is reported in
|
|
114
|
+
`status` and recorded in `events` (a `process started` with `reason: "memory"`).
|
|
115
|
+
See [`docs/config.md`](docs/config.md#processesmemory).
|
|
116
|
+
|
|
117
|
+
```js
|
|
118
|
+
memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Set a process's `stopSignal` (default `"SIGTERM"`) to the signal it quiets on, so
|
|
122
|
+
a worker finishes its in-flight work before exiting. Rollbridge sends `stopSignal`
|
|
123
|
+
to gracefully stop the process and `SIGKILL`s it only if it hasn't exited within
|
|
124
|
+
`gracefulStopMs`. For example, a job worker that drains on `SIGINT`:
|
|
125
|
+
|
|
126
|
+
```js
|
|
127
|
+
{id: "worker", policy: "companion", command: "…", stopSignal: "SIGINT", gracefulStopMs: 60000}
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Set `replicas` on a port-less `companion` to run a pool of identical workers.
|
|
131
|
+
Each instance runs as `<id>#<index>` (`worker#0`, `worker#1`, …) — visible in
|
|
132
|
+
`status` and targetable by `rollbridge restart` (base id for all, `worker#0` for
|
|
133
|
+
one) — and gets `{{replicaIndex}}`/`{{replicaCount}}` and
|
|
134
|
+
`ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` so each instance can pick a distinct shard or
|
|
135
|
+
queue. See [`docs/config.md`](docs/config.md#processesreplicas).
|
|
136
|
+
|
|
137
|
+
```js
|
|
138
|
+
{id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
For workers that quiesce or drain via a command, set a `lifecycle` block —
|
|
142
|
+
Rollbridge runs `quietCommand`, then drains (`drainCommand`/`drainTimeoutMs`),
|
|
143
|
+
then `stopCommand`/`stopSignal`, then `SIGKILL` after `gracefulStopMs` when
|
|
144
|
+
gracefully stopping the process. Each hook is bounded so it can't wedge a stop.
|
|
145
|
+
|
|
146
|
+
Set `nonBlockingDrain: true` on a worker companion to start its graceful stop the
|
|
147
|
+
moment its release is retired — in parallel with the proxied connection drain,
|
|
148
|
+
not after it — so new workers handle new work while the old workers finish theirs.
|
|
149
|
+
|
|
150
|
+
See [`docs/workers.md`](docs/workers.md) for the full safe background-job worker
|
|
151
|
+
deployment pattern — companion policy, `replicas`, and finishing in-flight jobs
|
|
152
|
+
on deploy with `stopSignal`/`lifecycle` + `gracefulStopMs`.
|
|
153
|
+
|
|
94
154
|
Set `releaseRetention` to bound how many stopped (drained) releases the daemon
|
|
95
155
|
keeps in memory and reports in `status`. `keep` (default `10`) retains the most
|
|
96
156
|
recent stopped releases; `maxAgeMs` (default `0`, disabled) also prunes stopped
|
|
@@ -102,6 +162,18 @@ owns cleaning up on-disk release directories.
|
|
|
102
162
|
releaseRetention: {keep: 5, maxAgeMs: 86400000}
|
|
103
163
|
```
|
|
104
164
|
|
|
165
|
+
Set `statePath` to have the daemon persist its state to a file (active/draining
|
|
166
|
+
releases, process pids, counters, recent events). On the next startup it reads
|
|
167
|
+
any leftover file and reports managed processes still alive from a daemon that
|
|
168
|
+
didn't shut down cleanly — advisory orphan detection. After a crash, run
|
|
169
|
+
`rollbridge recover` to list those leftovers and `rollbridge recover --force` to
|
|
170
|
+
stop them before restarting the daemon. A clean `shutdown` removes the file. See
|
|
171
|
+
[`docs/config.md`](docs/config.md#statepath).
|
|
172
|
+
|
|
173
|
+
```js
|
|
174
|
+
statePath: "/var/lib/rollbridge/ticket-server.state.json"
|
|
175
|
+
```
|
|
176
|
+
|
|
105
177
|
A function export receives no arguments and lets you build the config at load
|
|
106
178
|
time:
|
|
107
179
|
|
|
@@ -131,7 +203,14 @@ Referencing a placeholder with no value (including an unset `{{env.<NAME>}}`)
|
|
|
131
203
|
fails the process start with a clear error, so typos surface immediately.
|
|
132
204
|
|
|
133
205
|
Production-ready examples live in `examples/`, including
|
|
134
|
-
`examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment
|
|
206
|
+
`examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment; see
|
|
207
|
+
[`docs/tensorbuzz-runbook.md`](docs/tensorbuzz-runbook.md) for the matching
|
|
208
|
+
production runbook (ports, deploy ordering, rollback constraints, and day-to-day
|
|
209
|
+
operations).
|
|
210
|
+
|
|
211
|
+
See [`docs/velocious.md`](docs/velocious.md) for a Velocious deployment guide —
|
|
212
|
+
how Beacon, background-jobs-main, background-jobs-worker, and the web process map
|
|
213
|
+
to Rollbridge policies, with startup ordering and deploy behavior.
|
|
135
214
|
|
|
136
215
|
See [`docs/config.md`](docs/config.md) for the full config reference — every
|
|
137
216
|
field, its default, validation rules, template variables, and the environment
|
|
@@ -332,8 +411,12 @@ rollbridge status --config rollbridge.js
|
|
|
332
411
|
|
|
333
412
|
`status` reports each managed process's `state`, `pid`, recent `logs`, last
|
|
334
413
|
`exitCode`/`exitSignal`, and — per process — its automatic-restart count
|
|
335
|
-
(`restarts`), last start time (`startedAt`),
|
|
336
|
-
|
|
414
|
+
(`restarts`), last start time (`startedAt`), current `uptimeMs` while running,
|
|
415
|
+
and why it last started (`lastStartReason`: `deploy`, `crash`, `manual`, or
|
|
416
|
+
`memory`). The same reason appears on each `process started` entry in
|
|
417
|
+
`rollbridge events`. For memory-supervised processes it also reports current
|
|
418
|
+
`rssBytes`, `memoryRestarts`, `lastMemoryRestartAt`, and `children` (the sampled
|
|
419
|
+
process tree — each group member's `pid`, `command`, and `rssBytes`).
|
|
337
420
|
|
|
338
421
|
Print the recent captured stdout/stderr per process (a one-shot snapshot of the
|
|
339
422
|
retained `outputLines`, not a live stream):
|
|
@@ -343,18 +426,53 @@ rollbridge logs --config rollbridge.js
|
|
|
343
426
|
rollbridge logs --config rollbridge.js --process web
|
|
344
427
|
```
|
|
345
428
|
|
|
429
|
+
Print the daemon's recent structured event history — deploys, traffic switches,
|
|
430
|
+
release stops, process crashes/restarts, and failed commands (the most recent
|
|
431
|
+
1000 events, in memory):
|
|
432
|
+
|
|
433
|
+
```bash
|
|
434
|
+
rollbridge events --config rollbridge.js
|
|
435
|
+
rollbridge events --config rollbridge.js --limit 20
|
|
436
|
+
```
|
|
437
|
+
|
|
346
438
|
Stop the active release:
|
|
347
439
|
|
|
348
440
|
```bash
|
|
349
441
|
rollbridge stop --config rollbridge.js
|
|
350
442
|
```
|
|
351
443
|
|
|
444
|
+
Roll back to a previous release — re-starts it, health-checks it, and switches
|
|
445
|
+
traffic back (defaults to the most recently retired release; a failed rollback
|
|
446
|
+
leaves the current release active). Rollback manages processes only, not
|
|
447
|
+
database migrations:
|
|
448
|
+
|
|
449
|
+
```bash
|
|
450
|
+
rollbridge rollback --config rollbridge.js # the previous release
|
|
451
|
+
rollbridge rollback --config rollbridge.js --release-id v3
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
Restart non-proxied processes in place — all of them, one by id, or a policy
|
|
455
|
+
group (the proxied process is never restarted; use `deploy` for that):
|
|
456
|
+
|
|
457
|
+
```bash
|
|
458
|
+
rollbridge restart --config rollbridge.js # all non-proxied processes
|
|
459
|
+
rollbridge restart --config rollbridge.js --process background-jobs-worker
|
|
460
|
+
rollbridge restart --config rollbridge.js --policy companion
|
|
461
|
+
```
|
|
462
|
+
|
|
352
463
|
Shut down the daemon and managed processes:
|
|
353
464
|
|
|
354
465
|
```bash
|
|
355
466
|
rollbridge shutdown --config rollbridge.js
|
|
356
467
|
```
|
|
357
468
|
|
|
469
|
+
Enable shell completion (bash or zsh) for command names and option flags:
|
|
470
|
+
|
|
471
|
+
```bash
|
|
472
|
+
source <(rollbridge completion bash) # add to ~/.bashrc
|
|
473
|
+
source <(rollbridge completion zsh) # add to ~/.zshrc
|
|
474
|
+
```
|
|
475
|
+
|
|
358
476
|
## Nginx
|
|
359
477
|
|
|
360
478
|
Nginx should proxy to Rollbridge, not directly to Velocious:
|
|
@@ -371,6 +489,10 @@ location / {
|
|
|
371
489
|
}
|
|
372
490
|
```
|
|
373
491
|
|
|
492
|
+
See [`docs/nginx.md`](docs/nginx.md) for the full guide — WebSocket upgrade
|
|
493
|
+
headers, timeouts for long-lived connections, forwarded headers, and common
|
|
494
|
+
failure modes (502/503, dropped WebSockets).
|
|
495
|
+
|
|
374
496
|
## Running under systemd
|
|
375
497
|
|
|
376
498
|
Run the long-lived daemon as a systemd service so it starts on boot and is
|
|
@@ -403,6 +525,10 @@ The daemon is long-lived and survives deploys. **Deploy with
|
|
|
403
525
|
release paths are passed per deploy. Use `command -v rollbridge` to find the
|
|
404
526
|
absolute CLI path for `ExecStart`.
|
|
405
527
|
|
|
528
|
+
See [`docs/logging.md`](docs/logging.md) for where the daemon's JSON logs go
|
|
529
|
+
(stdout / journald / the `--daemon-log-path` file) and how to rotate them — the
|
|
530
|
+
daemon holds its log file open, so logrotate needs `copytruncate`.
|
|
531
|
+
|
|
406
532
|
## Deployment Notes
|
|
407
533
|
|
|
408
534
|
Run migrations before `rollbridge deploy`, and keep migrations backwards-compatible while old and new web releases overlap. For stable local brokers such as Velocious Beacon or `background-jobs-main`, use `service` when the process should survive deploys and restart from the latest successful release if it crashes.
|
|
@@ -417,6 +543,13 @@ Maintainers can publish a patch release from the latest default branch:
|
|
|
417
543
|
npm run release:patch
|
|
418
544
|
```
|
|
419
545
|
|
|
546
|
+
The release script owns the package version bump, lockfile update, default-branch
|
|
547
|
+
commit, push, and npm publish. Do not run `npm version` manually before running
|
|
548
|
+
it.
|
|
549
|
+
|
|
550
|
+
See [`docs/releasing.md`](docs/releasing.md) for the maintainer release checklist
|
|
551
|
+
— the pre-flight checks before `npm run release:patch` and what to verify after.
|
|
552
|
+
|
|
420
553
|
## License
|
|
421
554
|
|
|
422
555
|
Rollbridge is released under the [MIT License](LICENSE).
|
package/TODO.md
CHANGED
|
@@ -19,57 +19,58 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
|
|
|
19
19
|
|
|
20
20
|
## Major Features
|
|
21
21
|
|
|
22
|
-
- [
|
|
23
|
-
- [
|
|
24
|
-
- [
|
|
25
|
-
- [
|
|
26
|
-
- [
|
|
27
|
-
- [
|
|
22
|
+
- [x] Memory supervision.
|
|
23
|
+
- [x] Add per-process memory config with an RSS limit, check interval, warning threshold, and restart policy.
|
|
24
|
+
- [x] Measure the managed process tree, not only the shell wrapper PID. (Sums RSS across the process group via `/proc`.)
|
|
25
|
+
- [x] Report memory stats and last memory-triggered restart in `status`.
|
|
26
|
+
- [x] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
|
|
27
|
+
- [x] Add tests with a fixture process that allocates memory above the configured limit.
|
|
28
28
|
- [ ] Worker auto-restart and restart policy controls.
|
|
29
|
-
- [
|
|
30
|
-
- [
|
|
31
|
-
- [
|
|
32
|
-
- [
|
|
33
|
-
- [
|
|
34
|
-
- [
|
|
35
|
-
- [
|
|
36
|
-
- [
|
|
37
|
-
- [
|
|
38
|
-
- [
|
|
39
|
-
- [
|
|
40
|
-
- [
|
|
41
|
-
- [
|
|
42
|
-
- [
|
|
43
|
-
- [
|
|
44
|
-
- [
|
|
45
|
-
- [
|
|
46
|
-
- [
|
|
47
|
-
- [
|
|
48
|
-
- [
|
|
49
|
-
- [
|
|
50
|
-
- [
|
|
51
|
-
- [
|
|
52
|
-
- [
|
|
29
|
+
- [x] Add config for max restarts, restart window, exponential backoff, and disabled restart behavior (per-process `restart` policy).
|
|
30
|
+
- [x] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events. (Per-process `lastStartReason` + a `reason` on the `process started` event; the `memory` reason is wired and fires once memory supervision restarts a process.)
|
|
31
|
+
- [x] Add a `restart` CLI command for a single process, a policy group, or all non-proxied workers.
|
|
32
|
+
- [x] Keep restart behavior safe for job workers by using lifecycle hooks before termination. (Manual restart, memory restart, and deploy-drain stops all run the `lifecycle` hooks via `stop()`.)
|
|
33
|
+
- [x] Graceful job-worker lifecycle.
|
|
34
|
+
- [x] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand` (per-process `lifecycle`).
|
|
35
|
+
- [x] Support signal-only lifecycle steps for workers that can quiet on a Unix signal. (Per-process `stopSignal`; sent before the `SIGKILL`-after-`gracefulStopMs` fallback.)
|
|
36
|
+
- [x] Add a non-blocking drain mode so new workers can start while old workers finish running jobs (per-process `nonBlockingDrain`; drains the worker in parallel with the connection drain).
|
|
37
|
+
- [x] Document a Velocious background-jobs-worker recipe once the lifecycle contract is implemented (`docs/velocious.md` → Worker recipe).
|
|
38
|
+
- [x] Replicas and stable worker indexes. (Supported on port-less `companion` processes; `proxied`/`singleton`/ported processes stay single.)
|
|
39
|
+
- [x] Allow one process config to start multiple replicas (`replicas`, companion-only for now).
|
|
40
|
+
- [x] Expose `ROLLBRIDGE_REPLICA_INDEX`, replica count, and per-replica template context (`{{replicaIndex}}`/`{{replicaCount}}`).
|
|
41
|
+
- [x] Restart or stop one replica without affecting the rest (`rollbridge restart --process worker#0`).
|
|
42
|
+
- [x] Preserve readable status output for replica groups (each instance shown as `<id>#<index>`).
|
|
43
|
+
- [x] Persistent daemon state and recovery.
|
|
44
|
+
- [x] Persist active release, draining releases, process metadata, counters, and recent events (opt-in `statePath`; atomic snapshot on change + periodic).
|
|
45
|
+
- [x] Reconnect status to still-running child processes after daemon restart where possible. (Feasible subset: `status` now includes an `orphans` array — still-alive managed processes from the prior daemon's persisted state, re-checked each call. Full re-management/stdout-exit re-attach stays infeasible; the daemon reports them and `rollbridge recover` stops them.)
|
|
46
|
+
- [x] Detect and report orphaned Rollbridge-managed processes. (On startup, reports persisted process pids that are still alive; advisory, see `statePath`.)
|
|
47
|
+
- [x] Add a recovery mode for safe startup after daemon crash or machine reboot. (`rollbridge recover` lists orphaned processes from the persisted state and, with `--force`, stops them and clears the state; refuses while a daemon is running.)
|
|
48
|
+
- [x] Rollback support.
|
|
49
|
+
- [x] Keep enough release metadata to switch traffic back to a previous healthy release.
|
|
50
|
+
- [x] Add a `rollback` CLI command that health-checks the target before switching.
|
|
51
|
+
- [x] Define how rollback interacts with singleton workers and draining releases. (Reuses the deploy flow: replaces singletons and drains the current release.)
|
|
52
|
+
- [x] Document migration constraints for rollback.
|
|
53
53
|
- [ ] Observability and diagnostics.
|
|
54
|
-
- [
|
|
54
|
+
- [x] Add structured event history for deploys, switches, stops, crashes, memory restarts, and failed commands. (In-memory `EventLog` tapping the daemon logger; memory-restart events populate once memory supervision logs them.)
|
|
55
55
|
- [x] Add restart counters and uptime to status (exit reasons already reported via `exitCode`/`exitSignal`/`state`).
|
|
56
|
-
- [
|
|
56
|
+
- [x] Add memory stats and child-process-tree details to status (with memory supervision). (`rssBytes`/`memoryRestarts`/`lastMemoryRestartAt` plus `children`: the sampled process tree with each member's pid, command, and RSS.)
|
|
57
57
|
- [x] Add a `logs` CLI command (recent per-process output from status).
|
|
58
|
-
- [
|
|
59
|
-
- [
|
|
58
|
+
- [x] Add an `events` CLI command (after structured event history lands).
|
|
59
|
+
- [x] Add optional file logging with rotation guidance (`docs/logging.md`; daemon log file via `--daemon-log-path`, logrotate `copytruncate`).
|
|
60
60
|
- [x] Add machine-readable JSON output for all CLI commands (data commands print JSON; `validate`/`doctor`/`logs` take `--json`).
|
|
61
61
|
- [ ] Config validation and doctoring.
|
|
62
62
|
- [x] Add `validate` to parse config and report all config errors without starting the daemon.
|
|
63
63
|
- [x] Add `doctor` to check config validity, control socket reachability, proxy port availability, and control-socket directory writability.
|
|
64
|
-
- [
|
|
64
|
+
- [x] Extend `doctor` with state-path checks: state-path directory writability and orphaned-process reporting from a prior state file.
|
|
65
|
+
- [x] Extend `doctor` with process-command and release-path checks once those are resolvable (they need per-release rendered templates, which only exist at deploy time). (`rollbridge doctor --release-path <path>` renders each process's command/cwd/env against that release and checks the release directory, template resolvability, and rendered working directories; uses representative ports and replica index 0.)
|
|
65
66
|
- [x] Validate duplicate process IDs, missing ports on proxied processes, invalid ranges, and the single-proxied-process policy rule.
|
|
66
|
-
- [
|
|
67
|
+
- [x] Validate unsupported lifecycle-hook combinations once worker lifecycle hooks land. (`lifecycle.drainCommand` requires a positive `drainTimeoutMs`; `nonBlockingDrain` is companion-only; a `lifecycle.stopCommand` may not be combined with a custom `stopSignal`, since the command runs instead of the signal.)
|
|
67
68
|
- [x] Include example fixes in validation output.
|
|
68
69
|
|
|
69
70
|
## Minor Features
|
|
70
71
|
|
|
71
72
|
- [x] Add a control-socket permission option (`control.mode`) for shared deploy users.
|
|
72
|
-
- [
|
|
73
|
+
- [x] Add control-socket owner/group options for shared deploy users (`control.owner`/`control.group`, numeric id or name resolved via `/etc/passwd`/`/etc/group`).
|
|
73
74
|
- [x] Make stale control socket diagnostics clearer when another daemon is still alive.
|
|
74
75
|
- [x] Add old-release cleanup policies by age, count, and stopped state (`releaseRetention`).
|
|
75
76
|
- [x] Add port allocation diagnostics when a range is exhausted.
|
|
@@ -77,11 +78,11 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
|
|
|
77
78
|
- [x] Add process output retention config instead of a fixed recent-log count.
|
|
78
79
|
- [x] Add environment variable interpolation from the daemon environment.
|
|
79
80
|
- [x] Add `--config` default lookup resolving to `rollbridge.js` when no path is given.
|
|
80
|
-
- [
|
|
81
|
+
- [x] Add shell completion generation for common shells (`rollbridge completion bash|zsh`).
|
|
81
82
|
- [x] Add npm package metadata such as repository, license, bugs, and homepage.
|
|
82
83
|
- [x] Add systemd service examples for the Rollbridge daemon.
|
|
83
84
|
- [x] Add tests for malformed control socket JSON and unknown control commands.
|
|
84
|
-
- [
|
|
85
|
+
- [x] Add tests for duplicate IDs and singleton replacement failure behavior.
|
|
85
86
|
- [x] Add tests for proxy behavior when the active release exits unexpectedly.
|
|
86
87
|
|
|
87
88
|
## Documentation TODO
|
|
@@ -89,12 +90,13 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
|
|
|
89
90
|
- [x] Write a full config reference covering every field, default, and template variable (`docs/config.md`).
|
|
90
91
|
- [x] Write a CLI reference for `daemon`, `ensure-daemon`, `deploy`, `status`, `stop`, `shutdown`, and future commands (`docs/cli.md`).
|
|
91
92
|
- [x] Expand process policy docs with deployment examples for `proxied`, `companion`, `singleton`, and `service`.
|
|
92
|
-
- [
|
|
93
|
-
- [
|
|
94
|
-
- [
|
|
95
|
-
- [
|
|
93
|
+
- [x] Document memory checks and auto-restart behavior after the feature lands (`docs/config.md` → `processes[].memory`).
|
|
94
|
+
- [x] Document safe background-job deployment patterns (`docs/workers.md`: companion + `replicas` + `stopSignal` + `gracefulStopMs`, old/new worker overlap).
|
|
95
|
+
- [x] Document worker lifecycle hooks (`docs/config.md` → `processes[].lifecycle`, `docs/workers.md`).
|
|
96
|
+
- [x] Add a Velocious deployment guide with Beacon, background-jobs-main, background-jobs-worker, and web process examples (`docs/velocious.md`).
|
|
97
|
+
- [x] Add an Nginx guide with WebSocket headers, timeouts, and common failure modes (`docs/nginx.md`).
|
|
96
98
|
- [x] Add deploy-tool recipes that call Rollbridge CLI commands directly (`docs/deploy-recipes.md`).
|
|
97
99
|
- [x] Add a Capistrano recipe showing shell commands only; do not add a Capistrano plugin or Rollbridge-specific Capistrano tasks (`docs/deploy-recipes.md`).
|
|
98
|
-
- [
|
|
100
|
+
- [x] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints (`docs/tensorbuzz-runbook.md`).
|
|
99
101
|
- [x] Add troubleshooting docs for health-check failures, port conflicts, stale sockets, crash loops, and stuck draining releases (`docs/troubleshooting.md`).
|
|
100
|
-
- [
|
|
102
|
+
- [x] Add a release checklist for maintainers using `npm run release:patch` (`docs/releasing.md`).
|
package/docs/cli.md
CHANGED
|
@@ -46,7 +46,8 @@ already accepting commands, waits until it responds, then prints the daemon
|
|
|
46
46
|
status JSON. Idempotent — safe to call before every deploy.
|
|
47
47
|
|
|
48
48
|
- `--daemon-log-path <path>` — file the detached daemon's stdout/stderr is
|
|
49
|
-
appended to. Default: `/tmp/rollbridge-<application>.log`.
|
|
49
|
+
appended to. Default: `/tmp/rollbridge-<application>.log`. See
|
|
50
|
+
[`logging.md`](logging.md) for the log format and rotation guidance.
|
|
50
51
|
- `--daemon-pid-path <path>` — file the detached daemon's PID is written to.
|
|
51
52
|
Default: `/tmp/rollbridge-<application>.pid`.
|
|
52
53
|
- `--daemon-start-timeout-ms <ms>` — how long to wait for the daemon to accept
|
|
@@ -79,6 +80,35 @@ active and the command errors.
|
|
|
79
80
|
- `--ensure-daemon` — start the daemon first if it isn't running (honors the
|
|
80
81
|
same `--daemon-*` options as `ensure-daemon`).
|
|
81
82
|
|
|
83
|
+
## `rollback`
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
rollbridge rollback [--config <path>] [--release-id <id>]
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Rolls back to a previously-active release by re-running the deploy flow on its
|
|
90
|
+
retained metadata: it re-starts that release, health-checks the proxied process,
|
|
91
|
+
switches traffic, replaces singletons, and drains the current release — exactly
|
|
92
|
+
like a deploy. With no `--release-id`, it targets the **most recently retired**
|
|
93
|
+
release (the one active just before the current). Prints the same
|
|
94
|
+
`{"activeReleaseId", "previousReleaseId"}` result as `deploy`.
|
|
95
|
+
|
|
96
|
+
Because rollback reuses the deploy flow, a failed rollback (the target won't
|
|
97
|
+
start or health-check) leaves the current release active and errors — it never
|
|
98
|
+
takes the site down. Singletons are replaced (old stopped, then the target's
|
|
99
|
+
started) and the current release is drained, just like any deploy.
|
|
100
|
+
|
|
101
|
+
Errors when there is no previous release, the `--release-id` is not a retained
|
|
102
|
+
release, or the target is already active. Only releases Rollbridge still retains
|
|
103
|
+
(see [`releaseRetention`](config.md#releaseretention)) can be rolled back to.
|
|
104
|
+
|
|
105
|
+
**Migration constraints.** Rollback only manages processes — it does **not**
|
|
106
|
+
revert database migrations or other external state. The target release's on-disk
|
|
107
|
+
directory must still exist, and its code must be compatible with the current
|
|
108
|
+
schema. Keep migrations backwards-compatible (the same rule that lets old and
|
|
109
|
+
new releases overlap during a deploy) so rolling code back to a retained release
|
|
110
|
+
stays safe.
|
|
111
|
+
|
|
82
112
|
## `status`
|
|
83
113
|
|
|
84
114
|
```
|
|
@@ -87,8 +117,19 @@ rollbridge status [--config <path>]
|
|
|
87
117
|
|
|
88
118
|
Prints the daemon status JSON: the active release id, the proxy address, and —
|
|
89
119
|
per release, service, and singleton process — its `state`, `pid`, automatic
|
|
90
|
-
`restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`,
|
|
91
|
-
`logs`.
|
|
120
|
+
`restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`,
|
|
121
|
+
`lastStartReason` (`deploy`, `crash`, `manual`, or `memory`), and recent `logs`.
|
|
122
|
+
Memory-supervised processes also report `rssBytes`, `memoryRestarts`,
|
|
123
|
+
`lastMemoryRestartAt`, and `children` (the process tree: each group member's
|
|
124
|
+
`pid`, `command`, and `rssBytes`).
|
|
125
|
+
|
|
126
|
+
When [`statePath`](config.md#statepath) is configured, status also includes an
|
|
127
|
+
`orphans` array: managed processes from a **previous** daemon that are still
|
|
128
|
+
alive (`id`, `pid`, `releaseId`) — for example after the daemon restarted but its
|
|
129
|
+
detached children kept running. It is empty in the normal case. Liveness is
|
|
130
|
+
re-checked on each call, so the list clears itself as you stop the leftovers (see
|
|
131
|
+
[`recover`](#recover)). These are reported only — the new daemon can't re-adopt
|
|
132
|
+
them.
|
|
92
133
|
|
|
93
134
|
## `stop`
|
|
94
135
|
|
|
@@ -100,6 +141,52 @@ Stops the active release (or the release named by `--release-id`) and prints the
|
|
|
100
141
|
updated status JSON. With no active release, the proxy answers `503` until the
|
|
101
142
|
next deploy.
|
|
102
143
|
|
|
144
|
+
## `restart`
|
|
145
|
+
|
|
146
|
+
```
|
|
147
|
+
rollbridge restart [--config <path>] [--process <id>] [--policy <policy>]
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
Restarts **non-proxied** processes and prints `{"restarted": [<ids>]}`. Like
|
|
151
|
+
`systemctl restart`, a running process is bounced (stop, then start) and a
|
|
152
|
+
crashed or stopped one is revived — so this is also how you bring back a process
|
|
153
|
+
that exhausted its `restart` budget (see [`config.md`](config.md#processesrestart)).
|
|
154
|
+
Selectors:
|
|
155
|
+
|
|
156
|
+
- no selector — restart every non-proxied process (companions, singletons, and services);
|
|
157
|
+
- `--process <id>` — restart only that process;
|
|
158
|
+
- `--policy <companion|singleton|service>` — restart only processes with that policy.
|
|
159
|
+
|
|
160
|
+
The proxied process is never restarted in place — that would drop traffic.
|
|
161
|
+
Targeting it (by id or `--policy proxied`) is an error; use `rollbridge deploy`
|
|
162
|
+
for a zero-downtime replacement. `--process <id>` with an id that is not a
|
|
163
|
+
managed process (unknown, or a companion with no active release) is also an
|
|
164
|
+
error. Restarting a `service` bounces a shared broker (for example Velocious
|
|
165
|
+
Beacon), which briefly disrupts every process that depends on it.
|
|
166
|
+
|
|
167
|
+
## `recover`
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
rollbridge recover [--config <path>] [--force]
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Cleans up orphaned managed processes left by a **crashed** daemon. It reads the
|
|
174
|
+
persisted state ([`statePath`](config.md#statepath)) and finds managed processes
|
|
175
|
+
whose pids are still alive. Without `--force` it only **lists** them (a dry run);
|
|
176
|
+
with `--force` it stops each one's process group (`SIGTERM`, then `SIGKILL` after
|
|
177
|
+
`proxy.forceStopTimeoutMs`) and clears the stale state file.
|
|
178
|
+
|
|
179
|
+
Run it **before** restarting the daemon after a crash. It refuses to run while a
|
|
180
|
+
daemon (or another process) holds the control socket — those pids belong to a
|
|
181
|
+
live daemon, not a crash. A recycled pid can be a false positive, so review the
|
|
182
|
+
dry-run list before using `--force`.
|
|
183
|
+
|
|
184
|
+
If `--force` cannot stop some orphan (for example one now owned by another user,
|
|
185
|
+
so it can't be signaled), that process is reported as still running, the state
|
|
186
|
+
file is **kept** so you can investigate and re-run `recover`, and the command
|
|
187
|
+
exits non-zero. Requires `statePath`; also exits non-zero when it is unset or a
|
|
188
|
+
daemon is running.
|
|
189
|
+
|
|
103
190
|
## `shutdown`
|
|
104
191
|
|
|
105
192
|
```
|
|
@@ -123,15 +210,53 @@ issue with an example fix. Exits `1` when issues are found. With `--json`, print
|
|
|
123
210
|
## `doctor`
|
|
124
211
|
|
|
125
212
|
```
|
|
126
|
-
rollbridge doctor [--config <path>]
|
|
213
|
+
rollbridge doctor [--config <path>]
|
|
214
|
+
[--release-path <path>]
|
|
215
|
+
[--release-id <id>]
|
|
216
|
+
[--revision <sha>]
|
|
217
|
+
[--json]
|
|
127
218
|
```
|
|
128
219
|
|
|
129
220
|
Validates the config, then probes the environment: whether a daemon already
|
|
130
221
|
holds the control socket, whether the control socket's directory is writable,
|
|
131
|
-
and whether the proxy port can be bound.
|
|
132
|
-
|
|
222
|
+
and whether the proxy port can be bound. When [`statePath`](config.md#statepath)
|
|
223
|
+
is configured, it also checks that the state file's directory is writable and
|
|
224
|
+
reports any **orphaned processes** — managed processes still alive in a prior
|
|
225
|
+
state file, left by a daemon that didn't shut down cleanly (advisory; a recycled
|
|
226
|
+
pid can be a false positive, so verify before stopping). Exits `1` when any check
|
|
227
|
+
fails (so a green `doctor` means a fresh daemon can start). With `--json`, prints
|
|
133
228
|
`{"checks": [{"name", "ok", "detail"}], "ok"}`.
|
|
134
229
|
|
|
230
|
+
### Pre-flighting a release with `--release-path`
|
|
231
|
+
|
|
232
|
+
Process commands, working directories, and env values are
|
|
233
|
+
[templates](config.md#template-variables) (`{{releasePath}}`, `{{port}}`, …) that
|
|
234
|
+
are only rendered at deploy time, against a specific release. Pass
|
|
235
|
+
`--release-path <path>` to a **prepared release directory** to add deploy-time
|
|
236
|
+
checks against it:
|
|
237
|
+
|
|
238
|
+
- **release path** — the release directory exists.
|
|
239
|
+
- **process templates** — every process's `command`, `cwd`, and `env` templates
|
|
240
|
+
resolve (no `{{…}}` references an undefined variable). Ports are rendered with
|
|
241
|
+
the low end of each process's configured range.
|
|
242
|
+
- **process working directories** — each process's rendered `cwd` (defaulting to
|
|
243
|
+
the release path) exists.
|
|
244
|
+
|
|
245
|
+
`--release-id` and `--revision` set `{{releaseId}}`/`{{revision}}` for rendering
|
|
246
|
+
(defaulting the way `deploy` does: `--release-id` falls back to `--revision` or
|
|
247
|
+
the release path's basename, and `--revision` falls back to `--release-id`). Run
|
|
248
|
+
it as part of a deploy pipeline, after preparing the release and before
|
|
249
|
+
`rollbridge deploy`, to catch a template typo or a missing directory before
|
|
250
|
+
traffic is involved:
|
|
251
|
+
|
|
252
|
+
```bash
|
|
253
|
+
rollbridge doctor --config /etc/rollbridge/app.js --release-path /srv/app/releases/20260524
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
These checks render replica index `0` and use representative ports, so they
|
|
257
|
+
catch template and path problems but not values that only exist once the daemon
|
|
258
|
+
allocates real ports and spawns processes.
|
|
259
|
+
|
|
135
260
|
## `logs`
|
|
136
261
|
|
|
137
262
|
```
|
|
@@ -143,6 +268,44 @@ snapshot of each process's `outputLines`, not a live stream. `--process <id>`
|
|
|
143
268
|
limits output to one process. With `--json`, prints
|
|
144
269
|
`[{"id", "source", "logs": [{"at", "line", "stream"}]}]`.
|
|
145
270
|
|
|
271
|
+
## `events`
|
|
272
|
+
|
|
273
|
+
```
|
|
274
|
+
rollbridge events [--config <path>] [--limit <count>] [--json]
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
Prints the daemon's recent structured event history — deploys (`deploy
|
|
278
|
+
starting`, `traffic switched`, `deploy failed`), release stops (`release
|
|
279
|
+
stopped`, `release drained`), process lifecycle (`process started` — with a
|
|
280
|
+
`reason` of `deploy`, `crash`, `manual`, or `memory` — `process exited`,
|
|
281
|
+
`memory limit exceeded`, `restart limit reached`, `process restart requested`),
|
|
282
|
+
and failed control commands (`command failed`). Each event has a timestamp, a
|
|
283
|
+
message, and a structured data payload. The daemon keeps the most recent 1000 events in
|
|
284
|
+
memory (cleared on restart). `--limit <count>` shows only the most recent
|
|
285
|
+
`count`. With `--json`, prints `[{"at", "message", "data"}]`.
|
|
286
|
+
|
|
287
|
+
## `completion`
|
|
288
|
+
|
|
289
|
+
```
|
|
290
|
+
rollbridge completion <bash|zsh>
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Prints a shell completion script to stdout, generated by introspecting the
|
|
294
|
+
command set (so it never drifts from the real commands and options). It
|
|
295
|
+
completes command names, each command's option flags, and falls back to file
|
|
296
|
+
completion after an option that takes a value (bash). Enable it for the current
|
|
297
|
+
session, or add the line to your shell startup file:
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
# bash (~/.bashrc)
|
|
301
|
+
source <(rollbridge completion bash)
|
|
302
|
+
|
|
303
|
+
# zsh (~/.zshrc)
|
|
304
|
+
source <(rollbridge completion zsh)
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
An unsupported shell exits `1` with the list of supported shells.
|
|
308
|
+
|
|
146
309
|
## Exit codes
|
|
147
310
|
|
|
148
311
|
- `0` — success.
|