npm - rollbridge - Versions diffs - 0.1.5 → 0.1.7 - Mend

rollbridge 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

package/README.md +125 -4
package/TODO.md +45 -43
package/docs/cli.md +166 -6
package/docs/config.md +172 -2
package/docs/logging.md +77 -0
package/docs/releasing.md +53 -0
package/docs/tensorbuzz-runbook.md +129 -0
package/docs/velocious.md +49 -11
package/docs/workers.md +115 -0
package/package.json +1 -1
package/src/cli.js +327 -1
package/src/config.js +268 -6
package/src/daemon.js +216 -13
package/src/doctor.js +177 -0
package/src/event-log.js +47 -0
package/src/managed-process.js +225 -16
package/src/predeploy-cleanup.js +340 -0
package/src/process-memory.js +110 -0
package/src/recover.js +134 -0
package/src/release-group.js +71 -21
package/src/state-store.js +103 -0
package/src/system-ids.js +71 -0
package/src/template.js +32 -0
package/test/completion.test.js +64 -0
package/test/config-validation.test.js +268 -0
package/test/doctor.test.js +205 -3
package/test/event-log.test.js +46 -0
package/test/fixtures/memory-hog.js +19 -0
package/test/managed-process.test.js +290 -0
package/test/predeploy-cleanup.test.js +131 -0
package/test/process-memory.test.js +40 -0
package/test/recover.test.js +162 -0
package/test/release-group.test.js +22 -0
package/test/rollbridge.test.js +523 -6
package/test/state-store.test.js +69 -0
package/test/system-ids.test.js +24 -0

package/docs/config.md CHANGED Viewed

@@ -23,9 +23,11 @@ export default {
 | --- | --- | --- | --- |
 | `application` | string | basename of the config file's directory | Names the app; used in the default control-socket path and the `ROLLBRIDGE_APPLICATION` env var. |
 | `control` | object | — | Control-socket settings (see below). |
+| `legacyTakeover` | object | unset | Optional matchers for `rollbridge predeploy-cleanup` to stop pre-Rollbridge supervisors during first handover (see below). |
 | `proxy` | object | **required** | Proxy listener and shared defaults (see below). |
 | `processes` | array | **required** | Managed processes (see below). Exactly one must be `proxied`. |
 | `releaseRetention` | object | — | How many stopped releases the daemon retains (see below). |
+| `statePath` | string | unset (no persistence) | File the daemon persists its state to, enabling orphaned-process detection on the next startup (see [`statePath`](#statepath)). |
 ## `control`
@@ -33,6 +35,14 @@ export default {
 | --- | --- | --- | --- |
 | `control.path` | string | `/tmp/rollbridge-<application>.sock` | Unix domain socket the CLI uses to talk to the daemon. |
 | `control.mode` | octal string (e.g. `"660"`) or octal number (`0o660`) | unset | `chmod` applied to the socket after it binds, to share it with a deploy group. When unset, the daemon umask applies. |
+| `control.owner` | non-negative integer uid or user name | unset | `chown` owner applied to the socket after it binds. |
+| `control.group` | non-negative integer gid or group name | unset | `chown` group applied to the socket after it binds, so a shared deploy group can use it. |
+Names are resolved via `/etc/passwd`/`/etc/group` (local users and groups); use
+numeric ids for NSS-only principals. The daemon must run as a user permitted to
+`chown` the socket (root, or a member of the target group) — otherwise it fails
+to start with a clear error. Combine `control.group` with `control.mode: "660"`
+to let a deploy group talk to the daemon.
 ## `proxy`
@@ -56,6 +66,61 @@ export default {
 Active and draining releases are never pruned. This governs Rollbridge's own
 release records; the deploy tool still owns on-disk release directories.
+## `statePath`
+When set, the daemon persists a state snapshot — the active and draining
+releases, each managed process's metadata (including pid), restart counters, and
+recent events — to this file (atomically, on changes and every few seconds). On a
+clean `shutdown` the file is removed.
+On the **next startup**, the daemon reads any leftover file and reports managed
+processes whose pids are still alive — likely orphans from a daemon that crashed
+without shutting down cleanly — in its log and event history, and in the
+`orphans` array of [`rollbridge status`](cli.md#status). This is **advisory**:
+Rollbridge cannot re-adopt detached children, so it does not stop them
+automatically; the operator verifies and stops the leftovers. A recycled pid can
+be a false positive, so treat a report as a prompt to investigate. Use
+[`rollbridge recover`](cli.md#recover) to list and (with `--force`) stop those
+orphans after a crash.
+```js
+statePath: "/var/lib/rollbridge/ticket-server.state.json"
+```
+Leave `statePath` unset to disable persistence (the default).
+## `legacyTakeover`
+`legacyTakeover` lets deploy scripts run `rollbridge predeploy-cleanup` during
+the first migration from an old supervisor. The command only uses these matchers
+when no active Rollbridge release is running. If a Rollbridge daemon already has
+an active release, it exits without stopping legacy processes.
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `legacyTakeover.screens` | array of strings | `[]` | GNU Screen session names to stop with `screen -S <name> -X quit`. |
+| `legacyTakeover.processes` | array | `[]` | Process command-line matchers. Each entry must define `includes`, and may define `name`. |
+| `legacyTakeover.forceStopTimeoutMs` | number | `proxy.forceStopTimeoutMs` | Grace period after `SIGTERM` before `SIGKILL` is sent to matched legacy processes. |
+Each `legacyTakeover.processes[]` entry:
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `includes` | array of strings | **required** | Every string must appear in a process command line for it to be considered a legacy seed process. Descendants of seed processes are stopped too. |
+| `name` | string | generated | Human-readable label for diagnostics. |
+Example:
+```js
+legacyTakeover: {
+  forceStopTimeoutMs: 10000,
+  screens: ["ticket-server"],
+  processes: [
+    {name: "legacy web", includes: ["/home/dev/ticket-server/", "velocious server", "--port 8082"]}
+  ]
+}
+```
 ## `processes[]`
 | Field | Type | Default | Description |
@@ -67,11 +132,85 @@ release records; the deploy tool still owns on-disk release directories.
 | `env` | object of string → string | `{}` | Extra environment variables (values templated). Merged over the injected `ROLLBRIDGE_*` vars. |
 | `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
 | `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
-| `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
+| `stopSignal` | signal name (e.g. `"SIGTERM"`, `"SIGINT"`, `"SIGQUIT"`) | `"SIGTERM"` | Signal sent to gracefully stop the process; after `gracefulStopMs` it is `SIGKILL`ed. Use a worker's quit signal so it finishes in-flight work before exiting. |
+| `nonBlockingDrain` | boolean | `false` | When a release is retired, drain this process **immediately** (in parallel with the proxied connection drain) instead of after it. Companion processes only — typically background workers (see below). |
+| `lifecycle` | object | no hooks | Command hooks run when gracefully stopping the process (see below). |
+| `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | Graceful-stop window: time between `stopSignal`/`stopCommand` and `SIGKILL` for this process. |
 | `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
 | `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
+| `memory` | object | unset (no monitoring) | Memory supervision: restart the process when its RSS exceeds a limit (see below). |
+| `replicas` | positive integer | `1` | Run this many instances of the process (see below). |
 | `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
+### `processes[].replicas`
+Run a pool of identical instances of one process — for example several
+background-job workers. `replicas` greater than `1` is supported only on a
+**`companion`** process **without a `port`** (the worker-pool case);
+`proxied`, `singleton`, and ported processes must keep `replicas: 1`.
+```js
+{id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
+```
+Each replica runs as its own managed process with id `<id>#<index>` (`worker#0`,
+`worker#1`, …) — that id is what appears in `status` and what
+[`rollbridge restart`](cli.md#restart) targets (use the base id `worker` to
+restart every replica, or `worker#0` for one). Replicas get `replicaIndex`/
+`replicaCount` template variables and `ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` in their
+environment, so each instance can pick a distinct shard, queue, or lock. A single
+process (`replicas: 1`) keeps its plain id and is replica `0` of `1`.
+### `processes[].lifecycle`
+Command hooks run when Rollbridge **gracefully stops** the process — during a
+deploy's drain, a `rollbridge restart`, a memory restart, or shutdown. They let a
+job worker quiesce and finish in-flight work before it is terminated. Omit
+`lifecycle` for the default behavior (just `stopSignal` then `SIGKILL`).
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `lifecycle.quietCommand` | string | unset | Run first to tell the process to stop accepting new work. |
+| `lifecycle.drainCommand` | string | unset | Run after quieting to wait until the process has drained (it blocks until done). When unset, Rollbridge instead waits up to `drainTimeoutMs` for the process to exit on its own. Requires a positive `drainTimeoutMs` (which bounds it). |
+| `lifecycle.drainTimeoutMs` | non-negative number | `0` | Bounds the drain step. `0` **skips the drain step entirely** (no `drainCommand`, no wait). |
+| `lifecycle.stopCommand` | string | unset | Run to stop the process instead of sending `stopSignal`, if it is still running after draining. |
+Because `stopCommand` runs **instead of** sending `stopSignal`, setting both a
+`stopCommand` and a custom `stopSignal` is rejected — the signal would be silently
+ignored. Use one or the other.
+The full stop sequence is: run `quietCommand` → drain (`drainCommand`, or wait
+`drainTimeoutMs` for the process to exit) → if still running, run `stopCommand`
+or send `stopSignal` → `SIGKILL` after `gracefulStopMs`. Each hook command is run
+through a shell with the process's environment plus `ROLLBRIDGE_PID` (the
+process-group leader's pid, so a hook can `kill -TSTP -$ROLLBRIDGE_PID`). Every
+hook is **bounded by a timeout** (its drain timeout, or `gracefulStopMs`) and its
+failure is non-fatal — the sequence proceeds and `SIGKILL` is always the final
+fallback, so a slow or broken hook can't wedge a stop.
+```js
+{id: "worker", policy: "companion", command: "…", lifecycle: {quietCommand: "kill -TSTP -$ROLLBRIDGE_PID", drainTimeoutMs: 60000}}
+```
+### `processes[].nonBlockingDrain`
+By default, when a release is retired its processes are stopped **after** the
+proxied process's connections have drained (or `proxy.drainTimeoutMs` elapses).
+That keeps a worker alive in case the draining web process still depends on it —
+but it also holds a background worker open for the whole connection drain.
+Set `nonBlockingDrain: true` on a `companion` whose work is independent of the
+proxied process (a job worker on a shared queue). Its graceful stop — `lifecycle`
+hooks, or `stopSignal` then `SIGKILL` after `gracefulStopMs` — then starts **as
+soon as the release is retired**, in parallel with the connection drain, rather
+than after it. The new release's workers (started before traffic switches) handle
+new work while the retired release's workers finish their in-flight jobs. The
+whole drain stays non-blocking — the deploy returns immediately.
+```js
+{id: "worker", policy: "companion", command: "…", nonBlockingDrain: true, stopSignal: "SIGINT", gracefulStopMs: 60000}
+```
 ### `processes[].restart`
 Controls automatic restarts of a crashed process (a release's active processes
@@ -90,6 +229,29 @@ With the defaults a crashed process restarts indefinitely after `restartDelayMs`
 Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
 set `maxRestarts` to give up on a process stuck in a crash loop.
+### `processes[].memory`
+Monitors the resident memory (RSS) of the process and **gracefully restarts** it
+(`SIGTERM`, then `SIGKILL` after `gracefulStopMs`) when it exceeds `limitBytes`.
+RSS is measured across the whole managed process group (the spawned wrapper and
+its children), not just the wrapper. Omit `memory` to disable monitoring. Memory
+measurement uses `/proc` and is a no-op on platforms without it.
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `memory.limitBytes` | positive integer | **required** | RSS limit in bytes; exceeding it restarts the process. |
+| `memory.warnBytes` | non-negative integer | `0` (off) | Log a `memory warning` once when RSS first crosses this threshold (set below `limitBytes`). |
+| `memory.checkIntervalMs` | positive number | `5000` | How often to measure RSS. |
+```js
+{id: "worker", policy: "companion", command: "…", memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}}
+```
+A memory restart is reported in `status` (`memoryRestarts`, `lastMemoryRestartAt`,
+current `rssBytes`) and recorded in the event history (a `process started` event
+with `reason: "memory"`). `status` also reports `children` — the sampled process
+tree, with each group member's `pid`, `command`, and `rssBytes`.
 ### `processes[].health`
 Only the `proxied` process is health-checked (before traffic switches to a new
@@ -115,6 +277,7 @@ start with a clear error.
 | `{{releasePath}}` | The deploy's `--release-path`. |
 | `{{revision}}` | The deploy's `--revision` (falls back to the release id). |
 | `{{processId}}` | This process's `id`. |
+| `{{replicaIndex}}`, `{{replicaCount}}` | This instance's zero-based replica index and the total replica count (`0` and `1` for a single process). |
 | `{{port}}` | The port allocated to this process. |
 | `{{ports.<id>}}` | The port allocated to another process. |
 | `{{proxy.host}}`, `{{proxy.port}}`, `{{proxy.upstreamHost}}` | The configured proxy bind host/port and upstream host. |
@@ -128,7 +291,8 @@ Rollbridge sets these in every managed process's environment (the process's own
 | Variable | Value |
 | --- | --- |
 | `ROLLBRIDGE_APPLICATION` | `application` |
-| `ROLLBRIDGE_PROCESS_ID` | This process's `id`. |
+| `ROLLBRIDGE_PROCESS_ID` | This process's `id` (the base id, not the `#index` instance id). |
+| `ROLLBRIDGE_REPLICA_INDEX`, `ROLLBRIDGE_REPLICA_COUNT` | This instance's zero-based replica index and total replica count (`0` and `1` for a single process). |
 | `ROLLBRIDGE_RELEASE_ID` | The release id. |
 | `ROLLBRIDGE_RELEASE_PATH` | The release path. |
 | `ROLLBRIDGE_REVISION` | The revision (or release id). |
@@ -144,5 +308,11 @@ Rollbridge sets these in every managed process's environment (the process's own
 - Process `id`s must be unique.
 - `port` must be a positive port number or an ascending `{from, to}` range.
 - `control.mode` must be an octal mode between `0` and `0o777`.
+- `control.owner` and `control.group` must each be a non-negative integer id or a non-empty name (resolved at daemon start).
 - `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
 - `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.
+- When `memory` is set, `memory.limitBytes` must be a positive integer, `memory.warnBytes` a non-negative integer, and `memory.checkIntervalMs` a positive number.
+- `replicas` must be a positive integer; `replicas > 1` is allowed only on a `companion` process without a `port`. Process ids must not contain `#` (reserved for replica instance ids).
+- `lifecycle.quietCommand`/`drainCommand`/`stopCommand` must be strings when set, and `lifecycle.drainTimeoutMs` a non-negative number; `lifecycle.drainCommand` requires a positive `lifecycle.drainTimeoutMs`. A `lifecycle.stopCommand` may not be combined with a custom `stopSignal` (the `stopCommand` runs instead of the signal, so the signal would be ignored).
+- `nonBlockingDrain` must be a boolean, and is allowed only on a `companion` process.
+- `statePath` must be a string when set.

package/docs/logging.md ADDED Viewed

@@ -0,0 +1,77 @@
+# Logging
+The Rollbridge daemon writes one structured JSON line per operational event
+(deploys, traffic switches, process starts/exits, restarts, memory events, and
+failed commands):
+```json
+{"at":"2026-05-23T14:31:09.512Z","message":"traffic switched","data":{"previousReleaseId":"v3","releaseId":"v4"}}
+```
+These lines go to the daemon's **stdout**; where that ends up depends on how the
+daemon was started.
+## Where logs go
+| How the daemon runs | Destination |
+| --- | --- |
+| `rollbridge daemon` (foreground) | stdout — redirect it (`rollbridge daemon … >> /var/log/rollbridge/app.log 2>&1`) or let your service manager capture it. |
+| systemd (`examples/rollbridge.service`) | the journal — `journalctl -u rollbridge`. journald rotates on its own. |
+| `rollbridge ensure-daemon` / `rollbridge deploy --ensure-daemon` | the **daemon log file**: `--daemon-log-path <path>`, default `/tmp/rollbridge-<application>.log`. The detached daemon's stdout and stderr are appended there. |
+Point `--daemon-log-path` at a path your rotation tooling manages, for example:
+```bash
+rollbridge deploy --ensure-daemon \
+  --config /etc/rollbridge/rollbridge.js \
+  --daemon-log-path /var/log/rollbridge/app.log \
+  --release-path "$release_path"
+```
+The daemon log file is the durable, append-only stream of the daemon's own
+events. It is distinct from the two in-memory views:
+- `rollbridge logs` — recent stdout/stderr of each **managed process** (your app),
+  bounded per process by `outputLines`.
+- `rollbridge events` — the recent structured daemon event history (the most
+  recent 1000 events), the same events written to the log file.
+Both are cleared when the daemon restarts; the log file persists.
+## Rotation
+### systemd / journald
+When the daemon runs under systemd its logs are in the journal, which rotates
+automatically. Bound journal disk use with `SystemMaxUse=` in
+`/etc/systemd/journald.conf` (or a per-namespace drop-in). No logrotate config is
+needed for the daemon itself.
+### The daemon log file (logrotate)
+The detached daemon keeps the log file **open for its whole lifetime** (its
+stdout/stderr file descriptors point at it). A plain `rename`-based rotation
+would leave the daemon writing to the old, now-renamed inode while the new file
+stays empty. Use logrotate's **`copytruncate`**, which copies the file and then
+truncates it in place, keeping the daemon's open descriptor valid:
+```
+/var/log/rollbridge/*.log {
+  daily
+  rotate 14
+  compress
+  missingok
+  notifempty
+  copytruncate
+}
+```
+`copytruncate` has a small race window — log lines written between the copy and
+the truncate can be lost — which is acceptable for the daemon's low-volume,
+milestone-level logging. Rollbridge does not reopen its log file on a signal, so
+`copytruncate` (rather than `create` + a reopen signal) is the recommended
+approach for the daemon log file.
+Prefer running under systemd (journald) when you can; reach for `--daemon-log-path`
++ logrotate when you run the daemon outside a service manager that captures
+stdout.

package/docs/releasing.md ADDED Viewed

@@ -0,0 +1,53 @@
+# Releasing (maintainers)
+Rollbridge publishes **patch** releases from the default branch with:
+```bash
+npm run release:patch
+```
+That script (the `release-patch` package) owns the version bump, lockfile update,
+default-branch commit, push, and `npm publish`. Don't run `npm version` yourself
+first — let the script own the bump. Use this checklist around it.
+The default branch is `master` for this repo; the checks below stay
+branch-agnostic so they stay correct if that ever changes. Capture the name once
+and reuse it (the commands below assume it is set):
+```bash
+default_branch=$(git rev-parse --abbrev-ref origin/HEAD | sed 's@^origin/@@')   # e.g. master
+```
+## Before releasing
+- [ ] You're on the default branch and synced with it: `git switch "$default_branch"`,
+      then `git fetch && git status` shows it up to date, with a **clean working tree**.
+- [ ] CI is green for that commit, and `npm run all-checks` passes locally
+      (typecheck, lint, and the full test suite).
+- [ ] `README.md` and `docs/` reflect every user-visible change shipped since the
+      last release (config fields, CLI commands/flags, status/event output,
+      operational behavior).
+- [ ] `TODO.md` checkboxes for the shipped work are updated.
+- [ ] You can publish: `npm whoami` shows an account with publish rights to the
+      `rollbridge` package, and you can push to the default branch.
+## Release
+```bash
+npm run release:patch
+```
+The script bumps the patch version, updates `package-lock.json`, commits the bump
+to the default branch, pushes it, and publishes the package to npm.
+## After releasing
+- [ ] The new version is on the registry: `npm view rollbridge version` matches
+      the bumped `package.json` version.
+- [ ] The version-bump commit reached the remote (not just local): `git fetch`,
+      then `git log --oneline -1 "origin/$default_branch"` shows the bump — a
+      failed or blocked push won't satisfy this.
+- [ ] Your working tree is clean and still on the default branch.
+`release:patch` only does patch releases — a minor or major version bump is a
+manual decision and is not covered by this script.

package/docs/tensorbuzz-runbook.md ADDED Viewed

@@ -0,0 +1,129 @@
+# TensorBuzz production runbook
+Operating the TensorBuzz backend under Rollbridge. The production config lives at
+[`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js); this runbook
+assumes it is deployed to a stable path (`/etc/rollbridge/tensorbuzz.com.js`
+below) and the daemon runs as a systemd service (see
+[Running under systemd](../README.md#running-under-systemd)). For the general
+Velocious topology and the worker recipe, see [`docs/velocious.md`](velocious.md).
+## Ports
+| Port | Process | Notes |
+| --- | --- | --- |
+| `4500` | Rollbridge proxy | The stable public port. **Nginx proxies the backend host to `127.0.0.1:4500`** — never to a release's web port. |
+| `7330` | `beacon` (`service`) | Fixed; the shared broker every release connects to. |
+| `7331` | `background-jobs-main` (`service`) | Fixed; the job coordinator. |
+| `14500`–`14599` | `web` (`proxied`) | One port per release, allocated per deploy; Rollbridge forwards `4500` here. |
+| (none) | `background-jobs-worker` (`companion`) | A per-release worker; no listening port. |
+Control socket: `/tmp/rollbridge-tensorbuzz.sock`.
+## Process topology
+- **`beacon`** and **`background-jobs-main`** are `service`s: one daemon-wide
+  instance each, on their fixed ports, surviving deploys.
+- **`background-jobs-worker`** is a `companion`: a fresh worker per release,
+  running that release's code, with `gracefulStopMs: 60000` so an in-flight job
+  finishes before `SIGKILL`.
+- **`web`** is the one `proxied` process, health-checked at `/ping` before
+  traffic switches.
+Each process waits for its dependencies with `wait-for-it` (`beacon` →
+`background-jobs-main` → `worker`/`web`), so nothing starts talking to Beacon or
+the job coordinator before they listen.
+## External services
+Rollbridge manages **only the four processes above**. Everything else the
+Velocious app depends on — the database and any other backing services — is
+**provisioned and operated outside Rollbridge**: Rollbridge does not start, stop,
+health-check, or know about them. Configure those connections through the app's
+own environment/config. When such a dependency is down, the `web` process's
+`/ping` health check is what gates a deploy (a release that can't reach its
+database won't pass health and won't go live).
+## Deploying
+Drive deploys through the CLI (see [`docs/deploy-recipes.md`](deploy-recipes.md)).
+Run **backwards-compatible** migrations before switching traffic, because the old
+and new releases overlap during the drain:
+```bash
+release_path=/srv/tensorbuzz/releases/<timestamp>     # prepared by your pipeline
+(cd "$release_path/backend" && npx velocious db:migrate)
+rollbridge deploy \
+  --ensure-daemon \
+  --config /etc/rollbridge/tensorbuzz.com.js \
+  --release-path "$release_path" \
+  --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
+```
+### Deploy ordering
+On `rollbridge deploy`, Rollbridge:
+1. starts any missing `service` (`beacon`, `background-jobs-main`);
+2. starts the new release's `background-jobs-worker`, then its `web` process, and
+   health-checks `web` on its `{{port}}`/`/ping`;
+3. switches new traffic to the new `web`;
+4. refreshes the services' restart templates to the new release;
+5. drains the previous release's connections, then stops its `web` and worker.
+If the new release fails to start or health-check, **the previous release stays
+active** and the command exits non-zero — so a failed deploy never takes the site
+down.
+## Rollback
+```bash
+rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js
+# or a specific retained release:
+rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js --release-id <id>
+```
+Rollback re-runs the deploy flow on a retained release, health-checks it, and
+switches traffic back. Constraints:
+- **Migrations are not reverted.** Rollback only manages processes; if a release
+  bumped the schema, rolling code back requires that the old code still works
+  against the new schema — keep migrations backwards-compatible (the same rule as
+  deploys).
+- The target release's on-disk directory must still exist (don't prune it from
+  disk before you might roll back to it).
+- Only releases Rollbridge still retains (`releaseRetention`) can be targeted.
+## Day-to-day operations
+```bash
+C=/etc/rollbridge/tensorbuzz.com.js
+rollbridge status  --config "$C"                 # active release, ports, per-process state
+rollbridge logs    --config "$C" --process web   # recent stdout/stderr of a process
+rollbridge events  --config "$C"                 # deploys, switches, crashes, restarts
+rollbridge doctor  --config "$C"                 # pre-flight: socket, proxy port, state
+rollbridge restart --config "$C" --process background-jobs-worker   # bounce the worker
+```
+Restarting `beacon` or `background-jobs-main` bounces a shared broker and briefly
+disrupts everything that depends on it; prefer `deploy`/`rollback` for code
+changes. See [`docs/troubleshooting.md`](troubleshooting.md) for health-check
+failures, port conflicts, stale sockets, crash loops, and stuck draining
+releases.
+## Crash recovery
+Set [`statePath`](config.md#statepath) in the config to have the daemon persist
+its state. After a daemon crash or reboot, `rollbridge doctor` reports any
+**orphaned** processes still alive from the previous daemon. To clean them up
+before restarting the daemon, run `rollbridge recover` (a dry run that lists
+them), then `rollbridge recover --force` to stop them:
+```bash
+rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js          # list leftovers
+rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js --force  # stop them
+```
+A machine reboot kills every process, so there are usually no orphans afterward —
+the daemon just starts fresh.

package/docs/velocious.md CHANGED Viewed

@@ -156,17 +156,55 @@ The worker is a `companion`, so each release runs its own workers:
 - On deploy, the **new** release's workers start (running the new code) before
   traffic switches; the **old** release's workers are stopped when that release
-  is drained and retired — `SIGTERM`, then `SIGKILL` after `gracefulStopMs`.
-- Set `gracefulStopMs` on the worker to at least your longest in-flight job so a
-  job gets time to finish on `SIGTERM` before the forced kill. The example uses
-  `60000` (60s).
-> **Planned:** graceful job-worker draining via lifecycle hooks
-> (`quietCommand`/`drainCommand`/`stopCommand` and a non-blocking drain mode so
-> new workers start while old workers finish) is on the
-> [roadmap](../TODO.md#major-features) and not yet implemented. Until then, the
-> `gracefulStopMs` window above is the mechanism for letting in-flight jobs
-> finish.
+  is drained and retired — the worker's `stopSignal`, then `SIGKILL` after
+  `gracefulStopMs`.
+- Set `stopSignal` to the signal your worker drains on and `gracefulStopMs` to at
+  least your longest in-flight job, so a job gets time to finish before the
+  forced kill. Set `replicas` to run a pool of workers.
+See [`docs/workers.md`](workers.md) for the full safe background-job deployment
+pattern (companion + `replicas` + `stopSignal`/`lifecycle` hooks +
+`gracefulStopMs`), the old/new worker overlap, and `nonBlockingDrain` to start the
+old workers' drain immediately when a release is retired.
+### Worker recipe
+A complete `background-jobs-worker` entry that runs a pool and finishes in-flight
+jobs across a deploy:
+```js
+{
+  id: "background-jobs-worker",
+  policy: "companion",
+  cwd: "{{releasePath}}/backend",
+  env: {
+    NODE_ENV: "production",
+    VELOCIOUS_ENV: "production",
+    VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
+    VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
+  },
+  command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious background-jobs-worker",
+  replicas: 4,
+  gracefulStopMs: 60000
+}
+```
+- `replicas: 4` runs four worker instances (`background-jobs-worker#0` … `#3`),
+  each with `ROLLBRIDGE_REPLICA_INDEX`/`ROLLBRIDGE_REPLICA_COUNT` if you shard work.
+- On deploy the new release's workers start before traffic switches; the old
+  release's workers receive `SIGTERM` (the default `stopSignal`) when the old
+  release is retired, then `SIGKILL` after `gracefulStopMs` — so size
+  `gracefulStopMs` to your longest job. Both releases' workers briefly consume the
+  shared queue, so keep job code backwards-compatible and jobs idempotent.
+If your worker quiesces on a command or a non-default signal, add a `lifecycle`
+block — Rollbridge runs `quietCommand`, drains for up to `drainTimeoutMs`, then
+stops. For example, send a quiet signal to the worker's process group before the
+drain:
+```js
+lifecycle: {quietCommand: "kill -TSTP -$ROLLBRIDGE_PID", drainTimeoutMs: 60000}
+```
 ### Choosing the jobs-main policy