rollbridge 0.1.5 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -84,7 +84,10 @@ more or fewer lines for chatty or quiet processes.
84
84
  Set `control.mode` to an octal permission string (for example `"660"`) to
85
85
  chmod the control socket after it binds. This restricts which users can send
86
86
  control commands — useful when several deploy users share a group. When unset,
87
- the socket keeps the default permissions from the daemon's umask.
87
+ the socket keeps the default permissions from the daemon's umask. Pair it with
88
+ `control.owner` and `control.group` (a numeric id or a user/group name) to
89
+ `chown` the socket to a shared deploy group; names resolve via
90
+ `/etc/passwd`/`/etc/group`, and the daemon must run as a user allowed to chown it.
88
91
 
89
92
  Set the proxied process's `health.startDelayMs` (default `0`) to wait that long
90
93
  after the process starts before the first health probe — like a readiness
@@ -103,6 +106,51 @@ restart. With no `restart` block, a crashed process keeps restarting after
103
106
  restart: {maxRestarts: 5, windowMs: 60000, backoffFactor: 2, maxDelayMs: 30000}
104
107
  ```
105
108
 
109
+ Set a process's `memory` policy to supervise its resident memory (RSS) and
110
+ gracefully restart it when it grows too large. `memory.limitBytes` is the RSS
111
+ limit (measured across the whole process group, not just the wrapper);
112
+ `memory.warnBytes` logs a warning before the limit; `memory.checkIntervalMs`
113
+ (default `5000`) sets how often RSS is sampled. A memory restart is reported in
114
+ `status` and recorded in `events` (a `process started` with `reason: "memory"`).
115
+ See [`docs/config.md`](docs/config.md#processesmemory).
116
+
117
+ ```js
118
+ memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}
119
+ ```
120
+
121
+ Set a process's `stopSignal` (default `"SIGTERM"`) to the signal it quiets on, so
122
+ a worker finishes its in-flight work before exiting. Rollbridge sends `stopSignal`
123
+ to gracefully stop the process and `SIGKILL`s it only if it hasn't exited within
124
+ `gracefulStopMs`. For example, a job worker that drains on `SIGINT`:
125
+
126
+ ```js
127
+ {id: "worker", policy: "companion", command: "…", stopSignal: "SIGINT", gracefulStopMs: 60000}
128
+ ```
129
+
130
+ Set `replicas` on a port-less `companion` to run a pool of identical workers.
131
+ Each instance runs as `<id>#<index>` (`worker#0`, `worker#1`, …) — visible in
132
+ `status` and targetable by `rollbridge restart` (base id for all, `worker#0` for
133
+ one) — and gets `{{replicaIndex}}`/`{{replicaCount}}` and
134
+ `ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` so each instance can pick a distinct shard or
135
+ queue. See [`docs/config.md`](docs/config.md#processesreplicas).
136
+
137
+ ```js
138
+ {id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
139
+ ```
140
+
141
+ For workers that quiesce or drain via a command, set a `lifecycle` block —
142
+ Rollbridge runs `quietCommand`, then drains (`drainCommand`/`drainTimeoutMs`),
143
+ then `stopCommand`/`stopSignal`, then `SIGKILL` after `gracefulStopMs` when
144
+ gracefully stopping the process. Each hook is bounded so it can't wedge a stop.
145
+
146
+ Set `nonBlockingDrain: true` on a worker companion to start its graceful stop the
147
+ moment its release is retired — in parallel with the proxied connection drain,
148
+ not after it — so new workers handle new work while the old workers finish theirs.
149
+
150
+ See [`docs/workers.md`](docs/workers.md) for the full safe background-job worker
151
+ deployment pattern — companion policy, `replicas`, and finishing in-flight jobs
152
+ on deploy with `stopSignal`/`lifecycle` + `gracefulStopMs`.
153
+
106
154
  Set `releaseRetention` to bound how many stopped (drained) releases the daemon
107
155
  keeps in memory and reports in `status`. `keep` (default `10`) retains the most
108
156
  recent stopped releases; `maxAgeMs` (default `0`, disabled) also prunes stopped
@@ -114,6 +162,18 @@ owns cleaning up on-disk release directories.
114
162
  releaseRetention: {keep: 5, maxAgeMs: 86400000}
115
163
  ```
116
164
 
165
+ Set `statePath` to have the daemon persist its state to a file (active/draining
166
+ releases, process pids, counters, recent events). On the next startup it reads
167
+ any leftover file and reports managed processes still alive from a daemon that
168
+ didn't shut down cleanly — advisory orphan detection. After a crash, run
169
+ `rollbridge recover` to list those leftovers and `rollbridge recover --force` to
170
+ stop them before restarting the daemon. A clean `shutdown` removes the file. See
171
+ [`docs/config.md`](docs/config.md#statepath).
172
+
173
+ ```js
174
+ statePath: "/var/lib/rollbridge/ticket-server.state.json"
175
+ ```
176
+
117
177
  A function export receives no arguments and lets you build the config at load
118
178
  time:
119
179
 
@@ -143,7 +203,10 @@ Referencing a placeholder with no value (including an unset `{{env.<NAME>}}`)
143
203
  fails the process start with a clear error, so typos surface immediately.
144
204
 
145
205
  Production-ready examples live in `examples/`, including
146
- `examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment.
206
+ `examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment; see
207
+ [`docs/tensorbuzz-runbook.md`](docs/tensorbuzz-runbook.md) for the matching
208
+ production runbook (ports, deploy ordering, rollback constraints, and day-to-day
209
+ operations).
147
210
 
148
211
  See [`docs/velocious.md`](docs/velocious.md) for a Velocious deployment guide —
149
212
  how Beacon, background-jobs-main, background-jobs-worker, and the web process map
@@ -348,8 +411,12 @@ rollbridge status --config rollbridge.js
348
411
 
349
412
  `status` reports each managed process's `state`, `pid`, recent `logs`, last
350
413
  `exitCode`/`exitSignal`, and — per process — its automatic-restart count
351
- (`restarts`), last start time (`startedAt`), and current `uptimeMs` while
352
- running.
414
+ (`restarts`), last start time (`startedAt`), current `uptimeMs` while running,
415
+ and why it last started (`lastStartReason`: `deploy`, `crash`, `manual`, or
416
+ `memory`). The same reason appears on each `process started` entry in
417
+ `rollbridge events`. For memory-supervised processes it also reports current
418
+ `rssBytes`, `memoryRestarts`, `lastMemoryRestartAt`, and `children` (the sampled
419
+ process tree — each group member's `pid`, `command`, and `rssBytes`).
353
420
 
354
421
  Print the recent captured stdout/stderr per process (a one-shot snapshot of the
355
422
  retained `outputLines`, not a live stream):
@@ -359,12 +426,31 @@ rollbridge logs --config rollbridge.js
359
426
  rollbridge logs --config rollbridge.js --process web
360
427
  ```
361
428
 
429
+ Print the daemon's recent structured event history — deploys, traffic switches,
430
+ release stops, process crashes/restarts, and failed commands (the most recent
431
+ 1000 events, in memory):
432
+
433
+ ```bash
434
+ rollbridge events --config rollbridge.js
435
+ rollbridge events --config rollbridge.js --limit 20
436
+ ```
437
+
362
438
  Stop the active release:
363
439
 
364
440
  ```bash
365
441
  rollbridge stop --config rollbridge.js
366
442
  ```
367
443
 
444
+ Roll back to a previous release — re-starts it, health-checks it, and switches
445
+ traffic back (defaults to the most recently retired release; a failed rollback
446
+ leaves the current release active). Rollback manages processes only, not
447
+ database migrations:
448
+
449
+ ```bash
450
+ rollbridge rollback --config rollbridge.js # the previous release
451
+ rollbridge rollback --config rollbridge.js --release-id v3
452
+ ```
453
+
368
454
  Restart non-proxied processes in place — all of them, one by id, or a policy
369
455
  group (the proxied process is never restarted; use `deploy` for that):
370
456
 
@@ -380,6 +466,13 @@ Shut down the daemon and managed processes:
380
466
  rollbridge shutdown --config rollbridge.js
381
467
  ```
382
468
 
469
+ Enable shell completion (bash or zsh) for command names and option flags:
470
+
471
+ ```bash
472
+ source <(rollbridge completion bash) # add to ~/.bashrc
473
+ source <(rollbridge completion zsh) # add to ~/.zshrc
474
+ ```
475
+
383
476
  ## Nginx
384
477
 
385
478
  Nginx should proxy to Rollbridge, not directly to Velocious:
@@ -432,6 +525,10 @@ The daemon is long-lived and survives deploys. **Deploy with
432
525
  release paths are passed per deploy. Use `command -v rollbridge` to find the
433
526
  absolute CLI path for `ExecStart`.
434
527
 
528
+ See [`docs/logging.md`](docs/logging.md) for where the daemon's JSON logs go
529
+ (stdout / journald / the `--daemon-log-path` file) and how to rotate them — the
530
+ daemon holds its log file open, so logrotate needs `copytruncate`.
531
+
435
532
  ## Deployment Notes
436
533
 
437
534
  Run migrations before `rollbridge deploy`, and keep migrations backwards-compatible while old and new web releases overlap. For stable local brokers such as Velocious Beacon or `background-jobs-main`, use `service` when the process should survive deploys and restart from the latest successful release if it crashes.
@@ -450,6 +547,9 @@ The release script owns the package version bump, lockfile update, default-branc
450
547
  commit, push, and npm publish. Do not run `npm version` manually before running
451
548
  it.
452
549
 
550
+ See [`docs/releasing.md`](docs/releasing.md) for the maintainer release checklist
551
+ — the pre-flight checks before `npm run release:patch` and what to verify after.
552
+
453
553
  ## License
454
554
 
455
555
  Rollbridge is released under the [MIT License](LICENSE).
package/TODO.md CHANGED
@@ -19,57 +19,58 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
19
19
 
20
20
  ## Major Features
21
21
 
22
- - [ ] Memory supervision.
23
- - [ ] Add per-process memory config with an RSS limit, check interval, warning threshold, and restart policy.
24
- - [ ] Measure the managed process tree, not only the shell wrapper PID.
25
- - [ ] Report memory stats and last memory-triggered restart in `status`.
26
- - [ ] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
27
- - [ ] Add tests with a fixture process that allocates memory above the configured limit.
22
+ - [x] Memory supervision.
23
+ - [x] Add per-process memory config with an RSS limit, check interval, warning threshold, and restart policy.
24
+ - [x] Measure the managed process tree, not only the shell wrapper PID. (Sums RSS across the process group via `/proc`.)
25
+ - [x] Report memory stats and last memory-triggered restart in `status`.
26
+ - [x] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
27
+ - [x] Add tests with a fixture process that allocates memory above the configured limit.
28
28
  - [ ] Worker auto-restart and restart policy controls.
29
29
  - [x] Add config for max restarts, restart window, exponential backoff, and disabled restart behavior (per-process `restart` policy).
30
- - [ ] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events.
30
+ - [x] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events. (Per-process `lastStartReason` + a `reason` on the `process started` event; the `memory` reason is wired and fires once memory supervision restarts a process.)
31
31
  - [x] Add a `restart` CLI command for a single process, a policy group, or all non-proxied workers.
32
- - [ ] Keep restart behavior safe for job workers by using lifecycle hooks before termination.
33
- - [ ] Graceful job-worker lifecycle.
34
- - [ ] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand`.
35
- - [ ] Support signal-only lifecycle steps for workers that can quiet on a Unix signal.
36
- - [ ] Add a non-blocking drain mode so new workers can start while old workers finish running jobs.
37
- - [ ] Document a Velocious background-jobs-worker recipe once the lifecycle contract is implemented.
38
- - [ ] Replicas and stable worker indexes.
39
- - [ ] Allow one process config to start multiple replicas.
40
- - [ ] Expose `ROLLBRIDGE_REPLICA_INDEX`, replica count, and per-replica template context.
41
- - [ ] Restart or stop one replica without affecting the rest.
42
- - [ ] Preserve readable status output for replica groups.
43
- - [ ] Persistent daemon state and recovery.
44
- - [ ] Persist active release, draining releases, process metadata, counters, and recent events.
45
- - [ ] Reconnect status to still-running child processes after daemon restart where possible.
46
- - [ ] Detect and report orphaned Rollbridge-managed processes.
47
- - [ ] Add a recovery mode for safe startup after daemon crash or machine reboot.
48
- - [ ] Rollback support.
49
- - [ ] Keep enough release metadata to switch traffic back to a previous healthy release.
50
- - [ ] Add a `rollback` CLI command that health-checks the target before switching.
51
- - [ ] Define how rollback interacts with singleton workers and draining releases.
52
- - [ ] Document migration constraints for rollback.
32
+ - [x] Keep restart behavior safe for job workers by using lifecycle hooks before termination. (Manual restart, memory restart, and deploy-drain stops all run the `lifecycle` hooks via `stop()`.)
33
+ - [x] Graceful job-worker lifecycle.
34
+ - [x] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand` (per-process `lifecycle`).
35
+ - [x] Support signal-only lifecycle steps for workers that can quiet on a Unix signal. (Per-process `stopSignal`; sent before the `SIGKILL`-after-`gracefulStopMs` fallback.)
36
+ - [x] Add a non-blocking drain mode so new workers can start while old workers finish running jobs (per-process `nonBlockingDrain`; drains the worker in parallel with the connection drain).
37
+ - [x] Document a Velocious background-jobs-worker recipe once the lifecycle contract is implemented (`docs/velocious.md` → Worker recipe).
38
+ - [x] Replicas and stable worker indexes. (Supported on port-less `companion` processes; `proxied`/`singleton`/ported processes stay single.)
39
+ - [x] Allow one process config to start multiple replicas (`replicas`, companion-only for now).
40
+ - [x] Expose `ROLLBRIDGE_REPLICA_INDEX`, replica count, and per-replica template context (`{{replicaIndex}}`/`{{replicaCount}}`).
41
+ - [x] Restart or stop one replica without affecting the rest (`rollbridge restart --process worker#0`).
42
+ - [x] Preserve readable status output for replica groups (each instance shown as `<id>#<index>`).
43
+ - [x] Persistent daemon state and recovery.
44
+ - [x] Persist active release, draining releases, process metadata, counters, and recent events (opt-in `statePath`; atomic snapshot on change + periodic).
45
+ - [x] Reconnect status to still-running child processes after daemon restart where possible. (Feasible subset: `status` now includes an `orphans` array — still-alive managed processes from the prior daemon's persisted state, re-checked each call. Full re-management/stdout-exit re-attach stays infeasible; the daemon reports them and `rollbridge recover` stops them.)
46
+ - [x] Detect and report orphaned Rollbridge-managed processes. (On startup, reports persisted process pids that are still alive; advisory, see `statePath`.)
47
+ - [x] Add a recovery mode for safe startup after daemon crash or machine reboot. (`rollbridge recover` lists orphaned processes from the persisted state and, with `--force`, stops them and clears the state; refuses while a daemon is running.)
48
+ - [x] Rollback support.
49
+ - [x] Keep enough release metadata to switch traffic back to a previous healthy release.
50
+ - [x] Add a `rollback` CLI command that health-checks the target before switching.
51
+ - [x] Define how rollback interacts with singleton workers and draining releases. (Reuses the deploy flow: replaces singletons and drains the current release.)
52
+ - [x] Document migration constraints for rollback.
53
53
  - [ ] Observability and diagnostics.
54
- - [ ] Add structured event history for deploys, switches, stops, crashes, memory restarts, and failed commands.
54
+ - [x] Add structured event history for deploys, switches, stops, crashes, memory restarts, and failed commands. (In-memory `EventLog` tapping the daemon logger; memory-restart events populate once memory supervision logs them.)
55
55
  - [x] Add restart counters and uptime to status (exit reasons already reported via `exitCode`/`exitSignal`/`state`).
56
- - [ ] Add memory stats and child-process-tree details to status (with memory supervision).
56
+ - [x] Add memory stats and child-process-tree details to status (with memory supervision). (`rssBytes`/`memoryRestarts`/`lastMemoryRestartAt` plus `children`: the sampled process tree with each member's pid, command, and RSS.)
57
57
  - [x] Add a `logs` CLI command (recent per-process output from status).
58
- - [ ] Add an `events` CLI command (after structured event history lands).
59
- - [ ] Add optional file logging with rotation guidance.
58
+ - [x] Add an `events` CLI command (after structured event history lands).
59
+ - [x] Add optional file logging with rotation guidance (`docs/logging.md`; daemon log file via `--daemon-log-path`, logrotate `copytruncate`).
60
60
  - [x] Add machine-readable JSON output for all CLI commands (data commands print JSON; `validate`/`doctor`/`logs` take `--json`).
61
61
  - [ ] Config validation and doctoring.
62
62
  - [x] Add `validate` to parse config and report all config errors without starting the daemon.
63
63
  - [x] Add `doctor` to check config validity, control socket reachability, proxy port availability, and control-socket directory writability.
64
- - [ ] Extend `doctor` with process-command, release-path, and log/state-path checks once those are resolvable (rendered templates, persisted state).
64
+ - [x] Extend `doctor` with state-path checks: state-path directory writability and orphaned-process reporting from a prior state file.
65
+ - [x] Extend `doctor` with process-command and release-path checks once those are resolvable (they need per-release rendered templates, which only exist at deploy time). (`rollbridge doctor --release-path <path>` renders each process's command/cwd/env against that release and checks the release directory, template resolvability, and rendered working directories; uses representative ports and replica index 0.)
65
66
  - [x] Validate duplicate process IDs, missing ports on proxied processes, invalid ranges, and the single-proxied-process policy rule.
66
- - [ ] Validate unsupported lifecycle-hook combinations once worker lifecycle hooks land.
67
+ - [x] Validate unsupported lifecycle-hook combinations once worker lifecycle hooks land. (`lifecycle.drainCommand` requires a positive `drainTimeoutMs`; `nonBlockingDrain` is companion-only; a `lifecycle.stopCommand` may not be combined with a custom `stopSignal`, since the command runs instead of the signal.)
67
68
  - [x] Include example fixes in validation output.
68
69
 
69
70
  ## Minor Features
70
71
 
71
72
  - [x] Add a control-socket permission option (`control.mode`) for shared deploy users.
72
- - [ ] Add control-socket owner/group options for shared deploy users (needs name-to-id resolution).
73
+ - [x] Add control-socket owner/group options for shared deploy users (`control.owner`/`control.group`, numeric id or name resolved via `/etc/passwd`/`/etc/group`).
73
74
  - [x] Make stale control socket diagnostics clearer when another daemon is still alive.
74
75
  - [x] Add old-release cleanup policies by age, count, and stopped state (`releaseRetention`).
75
76
  - [x] Add port allocation diagnostics when a range is exhausted.
@@ -77,7 +78,7 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
77
78
  - [x] Add process output retention config instead of a fixed recent-log count.
78
79
  - [x] Add environment variable interpolation from the daemon environment.
79
80
  - [x] Add `--config` default lookup resolving to `rollbridge.js` when no path is given.
80
- - [ ] Add shell completion generation for common shells.
81
+ - [x] Add shell completion generation for common shells (`rollbridge completion bash|zsh`).
81
82
  - [x] Add npm package metadata such as repository, license, bugs, and homepage.
82
83
  - [x] Add systemd service examples for the Rollbridge daemon.
83
84
  - [x] Add tests for malformed control socket JSON and unknown control commands.
@@ -89,12 +90,13 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
89
90
  - [x] Write a full config reference covering every field, default, and template variable (`docs/config.md`).
90
91
  - [x] Write a CLI reference for `daemon`, `ensure-daemon`, `deploy`, `status`, `stop`, `shutdown`, and future commands (`docs/cli.md`).
91
92
  - [x] Expand process policy docs with deployment examples for `proxied`, `companion`, `singleton`, and `service`.
92
- - [ ] Document memory checks and auto-restart behavior after the feature lands.
93
- - [ ] Document worker lifecycle hooks and safe background-job deployment patterns after the feature lands.
93
+ - [x] Document memory checks and auto-restart behavior after the feature lands (`docs/config.md` → `processes[].memory`).
94
+ - [x] Document safe background-job deployment patterns (`docs/workers.md`: companion + `replicas` + `stopSignal` + `gracefulStopMs`, old/new worker overlap).
95
+ - [x] Document worker lifecycle hooks (`docs/config.md` → `processes[].lifecycle`, `docs/workers.md`).
94
96
  - [x] Add a Velocious deployment guide with Beacon, background-jobs-main, background-jobs-worker, and web process examples (`docs/velocious.md`).
95
97
  - [x] Add an Nginx guide with WebSocket headers, timeouts, and common failure modes (`docs/nginx.md`).
96
98
  - [x] Add deploy-tool recipes that call Rollbridge CLI commands directly (`docs/deploy-recipes.md`).
97
99
  - [x] Add a Capistrano recipe showing shell commands only; do not add a Capistrano plugin or Rollbridge-specific Capistrano tasks (`docs/deploy-recipes.md`).
98
- - [ ] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints.
100
+ - [x] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints (`docs/tensorbuzz-runbook.md`).
99
101
  - [x] Add troubleshooting docs for health-check failures, port conflicts, stale sockets, crash loops, and stuck draining releases (`docs/troubleshooting.md`).
100
- - [ ] Add a release checklist for maintainers using `npm run release:patch`.
102
+ - [x] Add a release checklist for maintainers using `npm run release:patch` (`docs/releasing.md`).
package/docs/cli.md CHANGED
@@ -46,7 +46,8 @@ already accepting commands, waits until it responds, then prints the daemon
46
46
  status JSON. Idempotent — safe to call before every deploy.
47
47
 
48
48
  - `--daemon-log-path <path>` — file the detached daemon's stdout/stderr is
49
- appended to. Default: `/tmp/rollbridge-<application>.log`.
49
+ appended to. Default: `/tmp/rollbridge-<application>.log`. See
50
+ [`logging.md`](logging.md) for the log format and rotation guidance.
50
51
  - `--daemon-pid-path <path>` — file the detached daemon's PID is written to.
51
52
  Default: `/tmp/rollbridge-<application>.pid`.
52
53
  - `--daemon-start-timeout-ms <ms>` — how long to wait for the daemon to accept
@@ -79,6 +80,35 @@ active and the command errors.
79
80
  - `--ensure-daemon` — start the daemon first if it isn't running (honors the
80
81
  same `--daemon-*` options as `ensure-daemon`).
81
82
 
83
+ ## `rollback`
84
+
85
+ ```
86
+ rollbridge rollback [--config <path>] [--release-id <id>]
87
+ ```
88
+
89
+ Rolls back to a previously-active release by re-running the deploy flow on its
90
+ retained metadata: it re-starts that release, health-checks the proxied process,
91
+ switches traffic, replaces singletons, and drains the current release — exactly
92
+ like a deploy. With no `--release-id`, it targets the **most recently retired**
93
+ release (the one active just before the current). Prints the same
94
+ `{"activeReleaseId", "previousReleaseId"}` result as `deploy`.
95
+
96
+ Because rollback reuses the deploy flow, a failed rollback (the target won't
97
+ start or health-check) leaves the current release active and errors — it never
98
+ takes the site down. Singletons are replaced (old stopped, then the target's
99
+ started) and the current release is drained, just like any deploy.
100
+
101
+ Errors when there is no previous release, the `--release-id` is not a retained
102
+ release, or the target is already active. Only releases Rollbridge still retains
103
+ (see [`releaseRetention`](config.md#releaseretention)) can be rolled back to.
104
+
105
+ **Migration constraints.** Rollback only manages processes — it does **not**
106
+ revert database migrations or other external state. The target release's on-disk
107
+ directory must still exist, and its code must be compatible with the current
108
+ schema. Keep migrations backwards-compatible (the same rule that lets old and
109
+ new releases overlap during a deploy) so rolling code back to a retained release
110
+ stays safe.
111
+
82
112
  ## `status`
83
113
 
84
114
  ```
@@ -87,8 +117,19 @@ rollbridge status [--config <path>]
87
117
 
88
118
  Prints the daemon status JSON: the active release id, the proxy address, and —
89
119
  per release, service, and singleton process — its `state`, `pid`, automatic
90
- `restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`, and recent
91
- `logs`.
120
+ `restarts`, `startedAt`, `uptimeMs`, last `exitCode`/`exitSignal`,
121
+ `lastStartReason` (`deploy`, `crash`, `manual`, or `memory`), and recent `logs`.
122
+ Memory-supervised processes also report `rssBytes`, `memoryRestarts`,
123
+ `lastMemoryRestartAt`, and `children` (the process tree: each group member's
124
+ `pid`, `command`, and `rssBytes`).
125
+
126
+ When [`statePath`](config.md#statepath) is configured, status also includes an
127
+ `orphans` array: managed processes from a **previous** daemon that are still
128
+ alive (`id`, `pid`, `releaseId`) — for example after the daemon restarted but its
129
+ detached children kept running. It is empty in the normal case. Liveness is
130
+ re-checked on each call, so the list clears itself as you stop the leftovers (see
131
+ [`recover`](#recover)). These are reported only — the new daemon can't re-adopt
132
+ them.
92
133
 
93
134
  ## `stop`
94
135
 
@@ -123,6 +164,29 @@ managed process (unknown, or a companion with no active release) is also an
123
164
  error. Restarting a `service` bounces a shared broker (for example Velocious
124
165
  Beacon), which briefly disrupts every process that depends on it.
125
166
 
167
+ ## `recover`
168
+
169
+ ```
170
+ rollbridge recover [--config <path>] [--force]
171
+ ```
172
+
173
+ Cleans up orphaned managed processes left by a **crashed** daemon. It reads the
174
+ persisted state ([`statePath`](config.md#statepath)) and finds managed processes
175
+ whose pids are still alive. Without `--force` it only **lists** them (a dry run);
176
+ with `--force` it stops each one's process group (`SIGTERM`, then `SIGKILL` after
177
+ `proxy.forceStopTimeoutMs`) and clears the stale state file.
178
+
179
+ Run it **before** restarting the daemon after a crash. It refuses to run while a
180
+ daemon (or another process) holds the control socket — those pids belong to a
181
+ live daemon, not a crash. A recycled pid can be a false positive, so review the
182
+ dry-run list before using `--force`.
183
+
184
+ If `--force` cannot stop some orphan (for example one now owned by another user,
185
+ so it can't be signaled), that process is reported as still running, the state
186
+ file is **kept** so you can investigate and re-run `recover`, and the command
187
+ exits non-zero. Requires `statePath`; also exits non-zero when it is unset or a
188
+ daemon is running.
189
+
126
190
  ## `shutdown`
127
191
 
128
192
  ```
@@ -146,15 +210,53 @@ issue with an example fix. Exits `1` when issues are found. With `--json`, print
146
210
  ## `doctor`
147
211
 
148
212
  ```
149
- rollbridge doctor [--config <path>] [--json]
213
+ rollbridge doctor [--config <path>]
214
+ [--release-path <path>]
215
+ [--release-id <id>]
216
+ [--revision <sha>]
217
+ [--json]
150
218
  ```
151
219
 
152
220
  Validates the config, then probes the environment: whether a daemon already
153
221
  holds the control socket, whether the control socket's directory is writable,
154
- and whether the proxy port can be bound. Exits `1` when any check fails (so a
155
- green `doctor` means a fresh daemon can start). With `--json`, prints
222
+ and whether the proxy port can be bound. When [`statePath`](config.md#statepath)
223
+ is configured, it also checks that the state file's directory is writable and
224
+ reports any **orphaned processes** — managed processes still alive in a prior
225
+ state file, left by a daemon that didn't shut down cleanly (advisory; a recycled
226
+ pid can be a false positive, so verify before stopping). Exits `1` when any check
227
+ fails (so a green `doctor` means a fresh daemon can start). With `--json`, prints
156
228
  `{"checks": [{"name", "ok", "detail"}], "ok"}`.
157
229
 
230
+ ### Pre-flighting a release with `--release-path`
231
+
232
+ Process commands, working directories, and env values are
233
+ [templates](config.md#template-variables) (`{{releasePath}}`, `{{port}}`, …) that
234
+ are only rendered at deploy time, against a specific release. Pass
235
+ `--release-path <path>` to a **prepared release directory** to add deploy-time
236
+ checks against it:
237
+
238
+ - **release path** — the release directory exists.
239
+ - **process templates** — every process's `command`, `cwd`, and `env` templates
240
+ resolve (no `{{…}}` references an undefined variable). Ports are rendered with
241
+ the low end of each process's configured range.
242
+ - **process working directories** — each process's rendered `cwd` (defaulting to
243
+ the release path) exists.
244
+
245
+ `--release-id` and `--revision` set `{{releaseId}}`/`{{revision}}` for rendering
246
+ (defaulting the way `deploy` does: `--release-id` falls back to `--revision` or
247
+ the release path's basename, and `--revision` falls back to `--release-id`). Run
248
+ it as part of a deploy pipeline, after preparing the release and before
249
+ `rollbridge deploy`, to catch a template typo or a missing directory before
250
+ traffic is involved:
251
+
252
+ ```bash
253
+ rollbridge doctor --config /etc/rollbridge/app.js --release-path /srv/app/releases/20260524
254
+ ```
255
+
256
+ These checks render replica index `0` and use representative ports, so they
257
+ catch template and path problems but not values that only exist once the daemon
258
+ allocates real ports and spawns processes.
259
+
158
260
  ## `logs`
159
261
 
160
262
  ```
@@ -166,6 +268,44 @@ snapshot of each process's `outputLines`, not a live stream. `--process <id>`
166
268
  limits output to one process. With `--json`, prints
167
269
  `[{"id", "source", "logs": [{"at", "line", "stream"}]}]`.
168
270
 
271
+ ## `events`
272
+
273
+ ```
274
+ rollbridge events [--config <path>] [--limit <count>] [--json]
275
+ ```
276
+
277
+ Prints the daemon's recent structured event history — deploys (`deploy
278
+ starting`, `traffic switched`, `deploy failed`), release stops (`release
279
+ stopped`, `release drained`), process lifecycle (`process started` — with a
280
+ `reason` of `deploy`, `crash`, `manual`, or `memory` — `process exited`,
281
+ `memory limit exceeded`, `restart limit reached`, `process restart requested`),
282
+ and failed control commands (`command failed`). Each event has a timestamp, a
283
+ message, and a structured data payload. The daemon keeps the most recent 1000 events in
284
+ memory (cleared on restart). `--limit <count>` shows only the most recent
285
+ `count`. With `--json`, prints `[{"at", "message", "data"}]`.
286
+
287
+ ## `completion`
288
+
289
+ ```
290
+ rollbridge completion <bash|zsh>
291
+ ```
292
+
293
+ Prints a shell completion script to stdout, generated by introspecting the
294
+ command set (so it never drifts from the real commands and options). It
295
+ completes command names, each command's option flags, and falls back to file
296
+ completion after an option that takes a value (bash). Enable it for the current
297
+ session, or add the line to your shell startup file:
298
+
299
+ ```bash
300
+ # bash (~/.bashrc)
301
+ source <(rollbridge completion bash)
302
+
303
+ # zsh (~/.zshrc)
304
+ source <(rollbridge completion zsh)
305
+ ```
306
+
307
+ An unsupported shell exits `1` with the list of supported shells.
308
+
169
309
  ## Exit codes
170
310
 
171
311
  - `0` — success.