rollbridge 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -84,7 +84,10 @@ more or fewer lines for chatty or quiet processes.
84
84
  Set `control.mode` to an octal permission string (for example `"660"`) to
85
85
  chmod the control socket after it binds. This restricts which users can send
86
86
  control commands — useful when several deploy users share a group. When unset,
87
- the socket keeps the default permissions from the daemon's umask.
87
+ the socket keeps the default permissions from the daemon's umask. Pair it with
88
+ `control.owner` and `control.group` (a numeric id or a user/group name) to
89
+ `chown` the socket to a shared deploy group; names resolve via
90
+ `/etc/passwd`/`/etc/group`, and the daemon must run as a user allowed to chown it.
88
91
 
89
92
  Set the proxied process's `health.startDelayMs` (default `0`) to wait that long
90
93
  after the process starts before the first health probe — like a readiness
@@ -103,6 +106,51 @@ restart. With no `restart` block, a crashed process keeps restarting after
103
106
  restart: {maxRestarts: 5, windowMs: 60000, backoffFactor: 2, maxDelayMs: 30000}
104
107
  ```
105
108
 
109
+ Set a process's `memory` policy to supervise its resident memory (RSS) and
110
+ gracefully restart it when it grows too large. `memory.limitBytes` is the RSS
111
+ limit (measured across the whole process group, not just the wrapper);
112
+ `memory.warnBytes` logs a warning before the limit; `memory.checkIntervalMs`
113
+ (default `5000`) sets how often RSS is sampled. A memory restart is reported in
114
+ `status` and recorded in `events` (a `process started` with `reason: "memory"`).
115
+ See [`docs/config.md`](docs/config.md#processesmemory).
116
+
117
+ ```js
118
+ memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}
119
+ ```
120
+
121
+ Set a process's `stopSignal` (default `"SIGTERM"`) to the signal it quiets on, so
122
+ a worker finishes its in-flight work before exiting. Rollbridge sends `stopSignal`
123
+ to gracefully stop the process and `SIGKILL`s it only if it hasn't exited within
124
+ `gracefulStopMs`. For example, a job worker that drains on `SIGINT`:
125
+
126
+ ```js
127
+ {id: "worker", policy: "companion", command: "…", stopSignal: "SIGINT", gracefulStopMs: 60000}
128
+ ```
129
+
130
+ Set `replicas` on a port-less `companion` to run a pool of identical workers.
131
+ Each instance runs as `<id>#<index>` (`worker#0`, `worker#1`, …) — visible in
132
+ `status` and targetable by `rollbridge restart` (base id for all, `worker#0` for
133
+ one) — and gets `{{replicaIndex}}`/`{{replicaCount}}` and
134
+ `ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` so each instance can pick a distinct shard or
135
+ queue. See [`docs/config.md`](docs/config.md#processesreplicas).
136
+
137
+ ```js
138
+ {id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
139
+ ```
140
+
141
+ For workers that quiesce or drain via a command, set a `lifecycle` block —
142
+ Rollbridge runs `quietCommand`, then drains (`drainCommand`/`drainTimeoutMs`),
143
+ then `stopCommand`/`stopSignal`, then `SIGKILL` after `gracefulStopMs` when
144
+ gracefully stopping the process. Each hook is bounded so it can't wedge a stop.
145
+
146
+ Set `nonBlockingDrain: true` on a worker companion to start its graceful stop the
147
+ moment its release is retired — in parallel with the proxied connection drain,
148
+ not after it — so new workers handle new work while the old workers finish theirs.
149
+
150
+ See [`docs/workers.md`](docs/workers.md) for the full safe background-job worker
151
+ deployment pattern — companion policy, `replicas`, and finishing in-flight jobs
152
+ on deploy with `stopSignal`/`lifecycle` + `gracefulStopMs`.
153
+
106
154
  Set `releaseRetention` to bound how many stopped (drained) releases the daemon
107
155
  keeps in memory and reports in `status`. `keep` (default `10`) retains the most
108
156
  recent stopped releases; `maxAgeMs` (default `0`, disabled) also prunes stopped
@@ -114,6 +162,32 @@ owns cleaning up on-disk release directories.
114
162
  releaseRetention: {keep: 5, maxAgeMs: 86400000}
115
163
  ```
116
164
 
165
+ Set `statePath` to have the daemon persist its state to a file (active/draining
166
+ releases, process pids, counters, recent events). On the next startup it reads
167
+ any leftover file and reports managed processes still alive from a daemon that
168
+ didn't shut down cleanly — advisory orphan detection. After a crash, run
169
+ `rollbridge recover` to list those leftovers and `rollbridge recover --force` to
170
+ stop them before restarting the daemon. A clean `shutdown` removes the file. See
171
+ [`docs/config.md`](docs/config.md#statepath).
172
+
173
+ ```js
174
+ statePath: "/var/lib/rollbridge/ticket-server.state.json"
175
+ ```
176
+
177
+ During the first migration from an old supervisor, set `legacyTakeover` and run
178
+ `rollbridge predeploy-cleanup --release-path <path>` before `rollbridge deploy`.
179
+ Rollbridge will only stop configured legacy processes when no reusable active
180
+ Rollbridge release is running.
181
+
182
+ ```js
183
+ legacyTakeover: {
184
+ screens: ["ticket-server"],
185
+ processes: [
186
+ {name: "legacy web", includes: ["/home/dev/ticket-server/", "velocious server", "--port 8082"]}
187
+ ]
188
+ }
189
+ ```
190
+
117
191
  A function export receives no arguments and lets you build the config at load
118
192
  time:
119
193
 
@@ -143,7 +217,10 @@ Referencing a placeholder with no value (including an unset `{{env.<NAME>}}`)
143
217
  fails the process start with a clear error, so typos surface immediately.
144
218
 
145
219
  Production-ready examples live in `examples/`, including
146
- `examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment.
220
+ `examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment; see
221
+ [`docs/tensorbuzz-runbook.md`](docs/tensorbuzz-runbook.md) for the matching
222
+ production runbook (ports, deploy ordering, rollback constraints, and day-to-day
223
+ operations).
147
224
 
148
225
  See [`docs/velocious.md`](docs/velocious.md) for a Velocious deployment guide —
149
226
  how Beacon, background-jobs-main, background-jobs-worker, and the web process map
@@ -348,8 +425,12 @@ rollbridge status --config rollbridge.js
348
425
 
349
426
  `status` reports each managed process's `state`, `pid`, recent `logs`, last
350
427
  `exitCode`/`exitSignal`, and — per process — its automatic-restart count
351
- (`restarts`), last start time (`startedAt`), and current `uptimeMs` while
352
- running.
428
+ (`restarts`), last start time (`startedAt`), current `uptimeMs` while running,
429
+ and why it last started (`lastStartReason`: `deploy`, `crash`, `manual`, or
430
+ `memory`). The same reason appears on each `process started` entry in
431
+ `rollbridge events`. For memory-supervised processes it also reports current
432
+ `rssBytes`, `memoryRestarts`, `lastMemoryRestartAt`, and `children` (the sampled
433
+ process tree — each group member's `pid`, `command`, and `rssBytes`).
353
434
 
354
435
  Print the recent captured stdout/stderr per process (a one-shot snapshot of the
355
436
  retained `outputLines`, not a live stream):
@@ -359,12 +440,31 @@ rollbridge logs --config rollbridge.js
359
440
  rollbridge logs --config rollbridge.js --process web
360
441
  ```
361
442
 
443
+ Print the daemon's recent structured event history — deploys, traffic switches,
444
+ release stops, process crashes/restarts, and failed commands (the most recent
445
+ 1000 events, in memory):
446
+
447
+ ```bash
448
+ rollbridge events --config rollbridge.js
449
+ rollbridge events --config rollbridge.js --limit 20
450
+ ```
451
+
362
452
  Stop the active release:
363
453
 
364
454
  ```bash
365
455
  rollbridge stop --config rollbridge.js
366
456
  ```
367
457
 
458
+ Roll back to a previous release — re-starts it, health-checks it, and switches
459
+ traffic back (defaults to the most recently retired release; a failed rollback
460
+ leaves the current release active). Rollback manages processes only, not
461
+ database migrations:
462
+
463
+ ```bash
464
+ rollbridge rollback --config rollbridge.js # the previous release
465
+ rollbridge rollback --config rollbridge.js --release-id v3
466
+ ```
467
+
368
468
  Restart non-proxied processes in place — all of them, one by id, or a policy
369
469
  group (the proxied process is never restarted; use `deploy` for that):
370
470
 
@@ -380,6 +480,20 @@ Shut down the daemon and managed processes:
380
480
  rollbridge shutdown --config rollbridge.js
381
481
  ```
382
482
 
483
+ Prepare a first Rollbridge deploy by recovering Rollbridge-managed orphans and
484
+ stopping configured legacy processes:
485
+
486
+ ```bash
487
+ rollbridge predeploy-cleanup --config rollbridge.js --release-path /srv/app/current
488
+ ```
489
+
490
+ Enable shell completion (bash or zsh) for command names and option flags:
491
+
492
+ ```bash
493
+ source <(rollbridge completion bash) # add to ~/.bashrc
494
+ source <(rollbridge completion zsh) # add to ~/.zshrc
495
+ ```
496
+
383
497
  ## Nginx
384
498
 
385
499
  Nginx should proxy to Rollbridge, not directly to Velocious:
@@ -432,6 +546,10 @@ The daemon is long-lived and survives deploys. **Deploy with
432
546
  release paths are passed per deploy. Use `command -v rollbridge` to find the
433
547
  absolute CLI path for `ExecStart`.
434
548
 
549
+ See [`docs/logging.md`](docs/logging.md) for where the daemon's JSON logs go
550
+ (stdout / journald / the `--daemon-log-path` file) and how to rotate them — the
551
+ daemon holds its log file open, so logrotate needs `copytruncate`.
552
+
435
553
  ## Deployment Notes
436
554
 
437
555
  Run migrations before `rollbridge deploy`, and keep migrations backwards-compatible while old and new web releases overlap. For stable local brokers such as Velocious Beacon or `background-jobs-main`, use `service` when the process should survive deploys and restart from the latest successful release if it crashes.
@@ -450,6 +568,9 @@ The release script owns the package version bump, lockfile update, default-branc
450
568
  commit, push, and npm publish. Do not run `npm version` manually before running
451
569
  it.
452
570
 
571
+ See [`docs/releasing.md`](docs/releasing.md) for the maintainer release checklist
572
+ — the pre-flight checks before `npm run release:patch` and what to verify after.
573
+
453
574
  ## License
454
575
 
455
576
  Rollbridge is released under the [MIT License](LICENSE).
package/TODO.md CHANGED
@@ -19,57 +19,58 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
19
19
 
20
20
  ## Major Features
21
21
 
22
- - [ ] Memory supervision.
23
- - [ ] Add per-process memory config with an RSS limit, check interval, warning threshold, and restart policy.
24
- - [ ] Measure the managed process tree, not only the shell wrapper PID.
25
- - [ ] Report memory stats and last memory-triggered restart in `status`.
26
- - [ ] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
27
- - [ ] Add tests with a fixture process that allocates memory above the configured limit.
28
- - [ ] Worker auto-restart and restart policy controls.
22
+ - [x] Memory supervision.
23
+ - [x] Add per-process memory config with an RSS limit, check interval, warning threshold, and restart policy.
24
+ - [x] Measure the managed process tree, not only the shell wrapper PID. (Sums RSS across the process group via `/proc`.)
25
+ - [x] Report memory stats and last memory-triggered restart in `status`.
26
+ - [x] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
27
+ - [x] Add tests with a fixture process that allocates memory above the configured limit.
28
+ - [x] Worker auto-restart and restart policy controls.
29
29
  - [x] Add config for max restarts, restart window, exponential backoff, and disabled restart behavior (per-process `restart` policy).
30
- - [ ] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events.
30
+ - [x] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events. (Per-process `lastStartReason` + a `reason` on the `process started` event; the `memory` reason is wired and fires once memory supervision restarts a process.)
31
31
  - [x] Add a `restart` CLI command for a single process, a policy group, or all non-proxied workers.
32
- - [ ] Keep restart behavior safe for job workers by using lifecycle hooks before termination.
33
- - [ ] Graceful job-worker lifecycle.
34
- - [ ] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand`.
35
- - [ ] Support signal-only lifecycle steps for workers that can quiet on a Unix signal.
36
- - [ ] Add a non-blocking drain mode so new workers can start while old workers finish running jobs.
37
- - [ ] Document a Velocious background-jobs-worker recipe once the lifecycle contract is implemented.
38
- - [ ] Replicas and stable worker indexes.
39
- - [ ] Allow one process config to start multiple replicas.
40
- - [ ] Expose `ROLLBRIDGE_REPLICA_INDEX`, replica count, and per-replica template context.
41
- - [ ] Restart or stop one replica without affecting the rest.
42
- - [ ] Preserve readable status output for replica groups.
43
- - [ ] Persistent daemon state and recovery.
44
- - [ ] Persist active release, draining releases, process metadata, counters, and recent events.
45
- - [ ] Reconnect status to still-running child processes after daemon restart where possible.
46
- - [ ] Detect and report orphaned Rollbridge-managed processes.
47
- - [ ] Add a recovery mode for safe startup after daemon crash or machine reboot.
48
- - [ ] Rollback support.
49
- - [ ] Keep enough release metadata to switch traffic back to a previous healthy release.
50
- - [ ] Add a `rollback` CLI command that health-checks the target before switching.
51
- - [ ] Define how rollback interacts with singleton workers and draining releases.
52
- - [ ] Document migration constraints for rollback.
53
- - [ ] Observability and diagnostics.
54
- - [ ] Add structured event history for deploys, switches, stops, crashes, memory restarts, and failed commands.
32
+ - [x] Keep restart behavior safe for job workers by using lifecycle hooks before termination. (Manual restart, memory restart, and deploy-drain stops all run the `lifecycle` hooks via `stop()`.)
33
+ - [x] Graceful job-worker lifecycle.
34
+ - [x] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand` (per-process `lifecycle`).
35
+ - [x] Support signal-only lifecycle steps for workers that can quiet on a Unix signal. (Per-process `stopSignal`; sent before the `SIGKILL`-after-`gracefulStopMs` fallback.)
36
+ - [x] Add a non-blocking drain mode so new workers can start while old workers finish running jobs (per-process `nonBlockingDrain`; drains the worker in parallel with the connection drain).
37
+ - [x] Document a Velocious background-jobs-worker recipe once the lifecycle contract is implemented (`docs/velocious.md` → Worker recipe).
38
+ - [x] Replicas and stable worker indexes. (Supported on port-less `companion` processes; `proxied`/`singleton`/ported processes stay single.)
39
+ - [x] Allow one process config to start multiple replicas (`replicas`, companion-only for now).
40
+ - [x] Expose `ROLLBRIDGE_REPLICA_INDEX`, replica count, and per-replica template context (`{{replicaIndex}}`/`{{replicaCount}}`).
41
+ - [x] Restart or stop one replica without affecting the rest (`rollbridge restart --process worker#0`).
42
+ - [x] Preserve readable status output for replica groups (each instance shown as `<id>#<index>`).
43
+ - [x] Persistent daemon state and recovery.
44
+ - [x] Persist active release, draining releases, process metadata, counters, and recent events (opt-in `statePath`; atomic snapshot on change + periodic).
45
+ - [x] Reconnect status to still-running child processes after daemon restart where possible. (Feasible subset: `status` now includes an `orphans` array — still-alive managed processes from the prior daemon's persisted state, re-checked each call. Full re-management/stdout-exit re-attach stays infeasible; the daemon reports them and `rollbridge recover` stops them.)
46
+ - [x] Detect and report orphaned Rollbridge-managed processes. (On startup, reports persisted process pids that are still alive; advisory, see `statePath`.)
47
+ - [x] Add a recovery mode for safe startup after daemon crash or machine reboot. (`rollbridge recover` lists orphaned processes from the persisted state and, with `--force`, stops them and clears the state; refuses while a daemon is running.)
48
+ - [x] Rollback support.
49
+ - [x] Keep enough release metadata to switch traffic back to a previous healthy release.
50
+ - [x] Add a `rollback` CLI command that health-checks the target before switching.
51
+ - [x] Define how rollback interacts with singleton workers and draining releases. (Reuses the deploy flow: replaces singletons and drains the current release.)
52
+ - [x] Document migration constraints for rollback.
53
+ - [x] Observability and diagnostics.
54
+ - [x] Add structured event history for deploys, switches, stops, crashes, memory restarts, and failed commands. (In-memory `EventLog` tapping the daemon logger; memory-restart events populate once memory supervision logs them.)
55
55
  - [x] Add restart counters and uptime to status (exit reasons already reported via `exitCode`/`exitSignal`/`state`).
56
- - [ ] Add memory stats and child-process-tree details to status (with memory supervision).
56
+ - [x] Add memory stats and child-process-tree details to status (with memory supervision). (`rssBytes`/`memoryRestarts`/`lastMemoryRestartAt` plus `children`: the sampled process tree with each member's pid, command, and RSS.)
57
57
  - [x] Add a `logs` CLI command (recent per-process output from status).
58
- - [ ] Add an `events` CLI command (after structured event history lands).
59
- - [ ] Add optional file logging with rotation guidance.
58
+ - [x] Add an `events` CLI command (after structured event history lands).
59
+ - [x] Add optional file logging with rotation guidance (`docs/logging.md`; daemon log file via `--daemon-log-path`, logrotate `copytruncate`).
60
60
  - [x] Add machine-readable JSON output for all CLI commands (data commands print JSON; `validate`/`doctor`/`logs` take `--json`).
61
- - [ ] Config validation and doctoring.
61
+ - [x] Config validation and doctoring.
62
62
  - [x] Add `validate` to parse config and report all config errors without starting the daemon.
63
63
  - [x] Add `doctor` to check config validity, control socket reachability, proxy port availability, and control-socket directory writability.
64
- - [ ] Extend `doctor` with process-command, release-path, and log/state-path checks once those are resolvable (rendered templates, persisted state).
64
+ - [x] Extend `doctor` with state-path checks: state-path directory writability and orphaned-process reporting from a prior state file.
65
+ - [x] Extend `doctor` with process-command and release-path checks once those are resolvable (they need per-release rendered templates, which only exist at deploy time). (`rollbridge doctor --release-path <path>` renders each process's command/cwd/env against that release and checks the release directory, template resolvability, and rendered working directories; uses representative ports and replica index 0.)
65
66
  - [x] Validate duplicate process IDs, missing ports on proxied processes, invalid ranges, and the single-proxied-process policy rule.
66
- - [ ] Validate unsupported lifecycle-hook combinations once worker lifecycle hooks land.
67
+ - [x] Validate unsupported lifecycle-hook combinations once worker lifecycle hooks land. (`lifecycle.drainCommand` requires a positive `drainTimeoutMs`; `nonBlockingDrain` is companion-only; a `lifecycle.stopCommand` may not be combined with a custom `stopSignal`, since the command runs instead of the signal.)
67
68
  - [x] Include example fixes in validation output.
68
69
 
69
70
  ## Minor Features
70
71
 
71
72
  - [x] Add a control-socket permission option (`control.mode`) for shared deploy users.
72
- - [ ] Add control-socket owner/group options for shared deploy users (needs name-to-id resolution).
73
+ - [x] Add control-socket owner/group options for shared deploy users (`control.owner`/`control.group`, numeric id or name resolved via `/etc/passwd`/`/etc/group`).
73
74
  - [x] Make stale control socket diagnostics clearer when another daemon is still alive.
74
75
  - [x] Add old-release cleanup policies by age, count, and stopped state (`releaseRetention`).
75
76
  - [x] Add port allocation diagnostics when a range is exhausted.
@@ -77,7 +78,7 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
77
78
  - [x] Add process output retention config instead of a fixed recent-log count.
78
79
  - [x] Add environment variable interpolation from the daemon environment.
79
80
  - [x] Add `--config` default lookup resolving to `rollbridge.js` when no path is given.
80
- - [ ] Add shell completion generation for common shells.
81
+ - [x] Add shell completion generation for common shells (`rollbridge completion bash|zsh`).
81
82
  - [x] Add npm package metadata such as repository, license, bugs, and homepage.
82
83
  - [x] Add systemd service examples for the Rollbridge daemon.
83
84
  - [x] Add tests for malformed control socket JSON and unknown control commands.
@@ -89,12 +90,13 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
89
90
  - [x] Write a full config reference covering every field, default, and template variable (`docs/config.md`).
90
91
  - [x] Write a CLI reference for `daemon`, `ensure-daemon`, `deploy`, `status`, `stop`, `shutdown`, and future commands (`docs/cli.md`).
91
92
  - [x] Expand process policy docs with deployment examples for `proxied`, `companion`, `singleton`, and `service`.
92
- - [ ] Document memory checks and auto-restart behavior after the feature lands.
93
- - [ ] Document worker lifecycle hooks and safe background-job deployment patterns after the feature lands.
93
+ - [x] Document memory checks and auto-restart behavior after the feature lands (`docs/config.md` → `processes[].memory`).
94
+ - [x] Document safe background-job deployment patterns (`docs/workers.md`: companion + `replicas` + `stopSignal` + `gracefulStopMs`, old/new worker overlap).
95
+ - [x] Document worker lifecycle hooks (`docs/config.md` → `processes[].lifecycle`, `docs/workers.md`).
94
96
  - [x] Add a Velocious deployment guide with Beacon, background-jobs-main, background-jobs-worker, and web process examples (`docs/velocious.md`).
95
97
  - [x] Add an Nginx guide with WebSocket headers, timeouts, and common failure modes (`docs/nginx.md`).
96
98
  - [x] Add deploy-tool recipes that call Rollbridge CLI commands directly (`docs/deploy-recipes.md`).
97
99
  - [x] Add a Capistrano recipe showing shell commands only; do not add a Capistrano plugin or Rollbridge-specific Capistrano tasks (`docs/deploy-recipes.md`).
98
- - [ ] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints.
100
+ - [x] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints (`docs/tensorbuzz-runbook.md`).
99
101
  - [x] Add troubleshooting docs for health-check failures, port conflicts, stale sockets, crash loops, and stuck draining releases (`docs/troubleshooting.md`).
100
- - [ ] Add a release checklist for maintainers using `npm run release:patch`.
102
+ - [x] Add a release checklist for maintainers using `npm run release:patch` (`docs/releasing.md`).
package/docs/cli.md CHANGED
@@ -46,7 +46,8 @@ already accepting commands, waits until it responds, then prints the daemon
46
46
  status JSON. Idempotent — safe to call before every deploy.
47
47
 
48
48
  - `--daemon-log-path <path>` — file the detached daemon's stdout/stderr is
49
- appended to. Default: `/tmp/rollbridge-<application>.log`.
49
+ appended to. Default: `/tmp/rollbridge-<application>.log`. See
50
+ [`logging.md`](logging.md) for the log format and rotation guidance.
50
51
  - `--daemon-pid-path <path>` — file the detached daemon's PID is written to.
51
52
  Default: `/tmp/rollbridge-<application>.pid`.
52
53
  - `--daemon-start-timeout-ms <ms>` — how long to wait for the daemon to accept
@@ -79,6 +80,35 @@ active and the command errors.
79
80
  - `--ensure-daemon` — start the daemon first if it isn't running (honors the
80
81
  same `--daemon-*` options as `ensure-daemon`).
81
82
 
83
+ ## `rollback`
84
+
85
+ ```
86
+ rollbridge rollback [--config <path>] [--release-id <id>]
87
+ ```
88
+
89
+ Rolls back to a previously-active release by re-running the deploy flow on its
90
+ retained metadata: it re-starts that release, health-checks the proxied process,
91
+ switches traffic, replaces singletons, and drains the current release — exactly
92
+ like a deploy. With no `--release-id`, it targets the **most recently retired**
93
+ release (the one active just before the current). Prints the same
94
+ `{"activeReleaseId", "previousReleaseId"}` result as `deploy`.
95
+
96
+ Because rollback reuses the deploy flow, a failed rollback (the target won't
97
+ start or health-check) leaves the current release active and errors — it never
98
+ takes the site down. Singletons are replaced (old stopped, then the target's
99
+ started) and the current release is drained, just like any deploy.
100
+
101
+ Errors when there is no previous release, the `--release-id` is not a retained
102
+ release, or the target is already active. Only releases Rollbridge still retains
103
+ (see [`releaseRetention`](config.md#releaseretention)) can be rolled back to.
104
+
105
+ **Migration constraints.** Rollback only manages processes — it does **not**
106
+ revert database migrations or other external state. The target release's on-disk
107
+ directory must still exist, and its code must be compatible with the current
108
+ schema. Keep migrations backwards-compatible (the same rule that lets old and
109
+ new releases overlap during a deploy) so rolling code back to a retained release
110
+ stays safe.
111
+
82
112
  ## `status`
83
113
 
84
114
  ```
@@ -87,8 +117,19 @@ rollbridge status [--config <path>]
87
117
 
88
118
  Prints the daemon status JSON: the active release id, the proxy address, and —
89
119
  per release, service, and singleton process — its `state`, `pid`, automatic
90
- `restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`, and recent
91
- `logs`.
120
+ `restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`,
121
+ `lastStartReason` (`deploy`, `crash`, `manual`, or `memory`), and recent `logs`.
122
+ Memory-supervised processes also report `rssBytes`, `memoryRestarts`,
123
+ `lastMemoryRestartAt`, and `children` (the process tree: each group member's
124
+ `pid`, `command`, and `rssBytes`).
125
+
126
+ When [`statePath`](config.md#statepath) is configured, status also includes an
127
+ `orphans` array: managed processes from a **previous** daemon that are still
128
+ alive (`id`, `pid`, `releaseId`) — for example after the daemon restarted but its
129
+ detached children kept running. It is empty in the normal case. Liveness is
130
+ re-checked on each call, so the list clears itself as you stop the leftovers (see
131
+ [`recover`](#recover)). These are reported only — the new daemon can't re-adopt
132
+ them.
92
133
 
93
134
  ## `stop`
94
135
 
@@ -123,6 +164,49 @@ managed process (unknown, or a companion with no active release) is also an
123
164
  error. Restarting a `service` bounces a shared broker (for example Velocious
124
165
  Beacon), which briefly disrupts every process that depends on it.
125
166
 
167
+ ## `predeploy-cleanup`
168
+
169
+ ```
170
+ rollbridge predeploy-cleanup [--config <path>] [--release-path <path>]
171
+ ```
172
+
173
+ Prepares a host for the first Rollbridge deploy. If a Rollbridge daemon already
174
+ has an active release, the command exits without stopping anything. Otherwise it
175
+ recovers Rollbridge-managed orphans from `statePath` and stops the legacy
176
+ processes configured in [`legacyTakeover`](config.md#legacytakeover), then exits
177
+ before `rollbridge deploy` starts the new daemon/proxy.
178
+
179
+ When `--release-path` is provided, the command also restarts the existing daemon
180
+ if the active release uses a different Rollbridge package version than the
181
+ pending release. It also restarts the daemon when the active daemon's proxy host,
182
+ port, or upstream host differs from the pending config.
183
+
184
+ Use it immediately before `rollbridge deploy --ensure-daemon` when migrating an
185
+ app from `screen`, `process_bot`, or another old supervisor to Rollbridge.
186
+
187
+ ## `recover`
188
+
189
+ ```
190
+ rollbridge recover [--config <path>] [--force]
191
+ ```
192
+
193
+ Cleans up orphaned managed processes left by a **crashed** daemon. It reads the
194
+ persisted state ([`statePath`](config.md#statepath)) and finds managed processes
195
+ whose pids are still alive. Without `--force` it only **lists** them (a dry run);
196
+ with `--force` it stops each one's process group (`SIGTERM`, then `SIGKILL` after
197
+ `proxy.forceStopTimeoutMs`) and clears the stale state file.
198
+
199
+ Run it **before** restarting the daemon after a crash. It refuses to run while a
200
+ daemon (or another process) holds the control socket — those pids belong to a
201
+ live daemon, not a crash. A recycled pid can be a false positive, so review the
202
+ dry-run list before using `--force`.
203
+
204
+ If `--force` cannot stop some orphan (for example one now owned by another user,
205
+ so it can't be signaled), that process is reported as still running, the state
206
+ file is **kept** so you can investigate and re-run `recover`, and the command
207
+ exits non-zero. Requires `statePath`; also exits non-zero when it is unset or a
208
+ daemon is running.
209
+
126
210
  ## `shutdown`
127
211
 
128
212
  ```
@@ -146,15 +230,53 @@ issue with an example fix. Exits `1` when issues are found. With `--json`, print
146
230
  ## `doctor`
147
231
 
148
232
  ```
149
- rollbridge doctor [--config <path>] [--json]
233
+ rollbridge doctor [--config <path>]
234
+ [--release-path <path>]
235
+ [--release-id <id>]
236
+ [--revision <sha>]
237
+ [--json]
150
238
  ```
151
239
 
152
240
  Validates the config, then probes the environment: whether a daemon already
153
241
  holds the control socket, whether the control socket's directory is writable,
154
- and whether the proxy port can be bound. Exits `1` when any check fails (so a
155
- green `doctor` means a fresh daemon can start). With `--json`, prints
242
+ and whether the proxy port can be bound. When [`statePath`](config.md#statepath)
243
+ is configured, it also checks that the state file's directory is writable and
244
+ reports any **orphaned processes** — managed processes still alive in a prior
245
+ state file, left by a daemon that didn't shut down cleanly (advisory; a recycled
246
+ pid can be a false positive, so verify before stopping). Exits `1` when any check
247
+ fails (so a green `doctor` means a fresh daemon can start). With `--json`, prints
156
248
  `{"checks": [{"name", "ok", "detail"}], "ok"}`.
157
249
 
250
+ ### Pre-flighting a release with `--release-path`
251
+
252
+ Process commands, working directories, and env values are
253
+ [templates](config.md#template-variables) (`{{releasePath}}`, `{{port}}`, …) that
254
+ are only rendered at deploy time, against a specific release. Pass
255
+ `--release-path <path>` to a **prepared release directory** to add deploy-time
256
+ checks against it:
257
+
258
+ - **release path** — the release directory exists.
259
+ - **process templates** — every process's `command`, `cwd`, and `env` templates
260
+ resolve (no `{{…}}` references an undefined variable). Ports are rendered with
261
+ the low end of each process's configured range.
262
+ - **process working directories** — each process's rendered `cwd` (defaulting to
263
+ the release path) exists.
264
+
265
+ `--release-id` and `--revision` set `{{releaseId}}`/`{{revision}}` for rendering
266
+ (defaulting the way `deploy` does: `--release-id` falls back to `--revision` or
267
+ the release path's basename, and `--revision` falls back to `--release-id`). Run
268
+ it as part of a deploy pipeline, after preparing the release and before
269
+ `rollbridge deploy`, to catch a template typo or a missing directory before
270
+ traffic is involved:
271
+
272
+ ```bash
273
+ rollbridge doctor --config /etc/rollbridge/app.js --release-path /srv/app/releases/20260524
274
+ ```
275
+
276
+ These checks render replica index `0` and use representative ports, so they
277
+ catch template and path problems but not values that only exist once the daemon
278
+ allocates real ports and spawns processes.
279
+
158
280
  ## `logs`
159
281
 
160
282
  ```
@@ -166,6 +288,44 @@ snapshot of each process's `outputLines`, not a live stream. `--process <id>`
166
288
  limits output to one process. With `--json`, prints
167
289
  `[{"id", "source", "logs": [{"at", "line", "stream"}]}]`.
168
290
 
291
+ ## `events`
292
+
293
+ ```
294
+ rollbridge events [--config <path>] [--limit <count>] [--json]
295
+ ```
296
+
297
+ Prints the daemon's recent structured event history — deploys (`deploy
298
+ starting`, `traffic switched`, `deploy failed`), release stops (`release
299
+ stopped`, `release drained`), process lifecycle (`process started` — with a
300
+ `reason` of `deploy`, `crash`, `manual`, or `memory` — `process exited`,
301
+ `memory limit exceeded`, `restart limit reached`, `process restart requested`),
302
+ and failed control commands (`command failed`). Each event has a timestamp, a
303
+ message, and a structured data payload. The daemon keeps the most recent 1000 events in
304
+ memory (cleared on restart). `--limit <count>` shows only the most recent
305
+ `count`. With `--json`, prints `[{"at", "message", "data"}]`.
306
+
307
+ ## `completion`
308
+
309
+ ```
310
+ rollbridge completion <bash|zsh>
311
+ ```
312
+
313
+ Prints a shell completion script to stdout, generated by introspecting the
314
+ command set (so it never drifts from the real commands and options). It
315
+ completes command names, each command's option flags, and falls back to file
316
+ completion after an option that takes a value (bash). Enable it for the current
317
+ session, or add the line to your shell startup file:
318
+
319
+ ```bash
320
+ # bash (~/.bashrc)
321
+ source <(rollbridge completion bash)
322
+
323
+ # zsh (~/.zshrc)
324
+ source <(rollbridge completion zsh)
325
+ ```
326
+
327
+ An unsupported shell exits `1` with the list of supported shells.
328
+
169
329
  ## Exit codes
170
330
 
171
331
  - `0` — success.