npm - rollbridge - Versions diffs - 0.1.2 → 0.1.5 - Mend

rollbridge 0.1.2 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

package/LICENSE +21 -0
package/README.md +205 -5
package/TODO.md +21 -18
package/docs/cli.md +174 -0
package/docs/config.md +148 -0
package/docs/deploy-recipes.md +102 -0
package/docs/nginx.md +104 -0
package/docs/troubleshooting.md +102 -0
package/docs/velocious.md +200 -0
package/package.json +22 -2
package/src/cli.js +168 -2
package/src/config.js +146 -8
package/src/daemon.js +138 -6
package/src/doctor.js +114 -0
package/src/health.js +4 -0
package/src/managed-process.js +73 -10
package/src/release-group.js +42 -4
package/test/config-validation.test.js +145 -0
package/test/doctor.test.js +228 -0
package/test/fixtures/crasher.js +2 -0
package/test/health.test.js +63 -0
package/test/logs.test.js +99 -0
package/test/managed-process.test.js +146 -0
package/test/package-metadata.test.js +29 -0
package/test/release-retention.test.js +107 -0
package/test/rollbridge.test.js +249 -5
package/scripts/release-patch.js +0 -83

package/docs/config.md ADDED Viewed

@@ -0,0 +1,148 @@
+# Config reference
+A Rollbridge config is a JavaScript module that `export default`s a config
+object (or a sync/async function returning one). When `--config` is omitted,
+the CLI loads `rollbridge.js` from the working directory. Run
+`rollbridge validate` to check a config without starting the daemon.
+```js
+// rollbridge.js
+export default {
+  application: "ticket-server",
+  control: {path: "/tmp/rollbridge-ticket-server.sock"},
+  proxy: {host: "127.0.0.1", port: 8182},
+  processes: [
+    {id: "web", policy: "proxied", cwd: "{{releasePath}}", command: "npx velocious server --port {{port}}", port: {from: 18182, to: 18299}, health: {path: "/ping"}}
+  ]
+}
+```
+## Top-level fields
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `application` | string | basename of the config file's directory | Names the app; used in the default control-socket path and the `ROLLBRIDGE_APPLICATION` env var. |
+| `control` | object | — | Control-socket settings (see below). |
+| `proxy` | object | **required** | Proxy listener and shared defaults (see below). |
+| `processes` | array | **required** | Managed processes (see below). Exactly one must be `proxied`. |
+| `releaseRetention` | object | — | How many stopped releases the daemon retains (see below). |
+## `control`
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `control.path` | string | `/tmp/rollbridge-<application>.sock` | Unix domain socket the CLI uses to talk to the daemon. |
+| `control.mode` | octal string (e.g. `"660"`) or octal number (`0o660`) | unset | `chmod` applied to the socket after it binds, to share it with a deploy group. When unset, the daemon umask applies. |
+## `proxy`
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `proxy.host` | string | `"127.0.0.1"` | Interface the stable proxy binds. |
+| `proxy.port` | number | `8182` | Stable port Nginx (or another front end) points at. |
+| `proxy.upstreamHost` | string | `proxy.host`, or `"127.0.0.1"` when `proxy.host` is `0.0.0.0`/`::` | Host Rollbridge uses for release health checks and proxy targets. |
+| `proxy.healthPath` | string | `"/ping"` | Default health-check path for proxied processes. |
+| `proxy.healthTimeoutMs` | number | `30000` | Default health-check timeout for proxied processes. |
+| `proxy.drainTimeoutMs` | number | `60000` | How long to drain open connections from a retired release before stopping it. |
+| `proxy.forceStopTimeoutMs` | number | `10000` | Default per-process graceful-stop timeout (`SIGTERM`, then `SIGKILL`). |
+## `releaseRetention`
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `releaseRetention.keep` | non-negative integer | `10` | Number of most-recent **stopped** releases the daemon keeps in memory and reports in `status`. |
+| `releaseRetention.maxAgeMs` | non-negative number | `0` (disabled) | Also prune stopped releases older than this many milliseconds. |
+Active and draining releases are never pruned. This governs Rollbridge's own
+release records; the deploy tool still owns on-disk release directories.
+## `processes[]`
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `id` | string | **required** | Unique identifier. Appears in `status`, logs, and `ROLLBRIDGE_*` env vars. |
+| `policy` | `"proxied"` \| `"companion"` \| `"singleton"` \| `"service"` | `"companion"` | Lifecycle policy (see [README → Process Policies](../README.md#process-policies)). Exactly one process must be `proxied`. |
+| `command` | string | **required** | Shell command to run (templated). |
+| `cwd` | string | the release path | Working directory (templated). |
+| `env` | object of string → string | `{}` | Extra environment variables (values templated). Merged over the injected `ROLLBRIDGE_*` vars. |
+| `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
+| `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
+| `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
+| `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
+| `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
+| `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
+### `processes[].restart`
+Controls automatic restarts of a crashed process (a release's active processes
+and daemon-wide `service`s). The base delay is the process's `restartDelayMs`;
+when the policy's limit is reached the process is left `failed` and not
+restarted again.
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `restart.maxRestarts` | non-negative integer | unset (unlimited) | Maximum automatic restarts allowed within `windowMs` before Rollbridge stops restarting the process. `0` disables automatic restarts entirely. |
+| `restart.windowMs` | non-negative number | `0` (process lifetime) | Rolling window over which `maxRestarts` is counted and after which the backoff resets. `0` counts over the process's whole lifetime. |
+| `restart.backoffFactor` | number ≥ 1 | `1` (constant) | Multiplier applied to `restartDelayMs` on each successive restart in the window: `delay = restartDelayMs × backoffFactor ^ n`. `1` keeps a constant delay. |
+| `restart.maxDelayMs` | non-negative number | `0` (no cap) | Upper bound on the backed-off delay. `0` means no cap. |
+With the defaults a crashed process restarts indefinitely after `restartDelayMs`.
+Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
+set `maxRestarts` to give up on a process stuck in a crash loop.
+### `processes[].health`
+Only the `proxied` process is health-checked (before traffic switches to a new
+release). Set `health: false` to disable it.
+| Field | Type | Default | Description |
+| --- | --- | --- | --- |
+| `health.path` | string | `proxy.healthPath` | HTTP path probed on the process's port. |
+| `health.timeoutMs` | number | `proxy.healthTimeoutMs` | Total time to wait for the first healthy response. |
+| `health.intervalMs` | number | `250` | Delay between probes. |
+| `health.startDelayMs` | non-negative number | `0` | Wait this long after the process starts before the first probe (runs before the `timeoutMs` window). |
+## Template variables
+`command`, `cwd`, and `env` values support `{{...}}` placeholders, rendered when
+the process starts. Referencing a placeholder with no value fails the process
+start with a clear error.
+| Placeholder | Value |
+| --- | --- |
+| `{{application}}` | `application` |
+| `{{releaseId}}` | The deploy's release id. |
+| `{{releasePath}}` | The deploy's `--release-path`. |
+| `{{revision}}` | The deploy's `--revision` (falls back to the release id). |
+| `{{processId}}` | This process's `id`. |
+| `{{port}}` | The port allocated to this process. |
+| `{{ports.<id>}}` | The port allocated to another process. |
+| `{{proxy.host}}`, `{{proxy.port}}`, `{{proxy.upstreamHost}}` | The configured proxy bind host/port and upstream host. |
+| `{{env.<NAME>}}` | A variable from the daemon's own environment, e.g. `{{env.HOME}}`. |
+## Injected environment variables
+Rollbridge sets these in every managed process's environment (the process's own
+`env` is merged on top and can override them):
+| Variable | Value |
+| --- | --- |
+| `ROLLBRIDGE_APPLICATION` | `application` |
+| `ROLLBRIDGE_PROCESS_ID` | This process's `id`. |
+| `ROLLBRIDGE_RELEASE_ID` | The release id. |
+| `ROLLBRIDGE_RELEASE_PATH` | The release path. |
+| `ROLLBRIDGE_REVISION` | The revision (or release id). |
+| `ROLLBRIDGE_PORT` | This process's allocated port (only when it has one). |
+| `ROLLBRIDGE_<ID>_PORT` | Each process's allocated port, where `<ID>` is the process id uppercased with non-alphanumerics replaced by `_` (e.g. `background-jobs-main` → `ROLLBRIDGE_BACKGROUND_JOBS_MAIN_PORT`). |
+## Validation rules
+`rollbridge validate` reports all of these at once with an example fix:
+- Required `application` defaults are filled; `proxy` and `processes` must be present and well-typed.
+- Exactly one process must be `proxied`, and the `proxied` process must define a `port`.
+- Process `id`s must be unique.
+- `port` must be a positive port number or an ascending `{from, to}` range.
+- `control.mode` must be an octal mode between `0` and `0o777`.
+- `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
+- `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.

package/docs/deploy-recipes.md ADDED Viewed

@@ -0,0 +1,102 @@
+# Deploy-tool recipes
+Rollbridge is deploy-tool agnostic: it ships no plugins or tasks for any deploy
+tool. Whatever you use — a shell script, CI, or Capistrano — drives Rollbridge
+by **calling its CLI** (see [`cli.md`](cli.md)). The daemon is long-lived;
+deploys just hand it a prepared release path.
+The deploy contract is the same everywhere:
+1. Prepare the release directory (checkout, install dependencies, build assets).
+2. Run **backwards-compatible** migrations *before* switching traffic (the old
+   and new web releases overlap during the drain).
+3. Run `rollbridge deploy` — it starts the new release, health-checks the
+   proxied process, switches traffic, then drains and stops the old release.
+   It exits non-zero (leaving the previous release active) if the new release
+   fails to start or health-check, so your script should stop on a failed
+   deploy.
+Point `--config` at a stable, daemon-wide config file; release paths are passed
+per deploy. `rollbridge deploy --ensure-daemon` starts the daemon first if it
+isn't already running, so the recipes below work whether or not the daemon is
+already managed by systemd.
+## Shell script
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+app_dir=/srv/ticket-server
+config=/etc/rollbridge/rollbridge.js
+# Read the revision from the source repo (not the script's cwd, which may not be
+# a checkout under cron/systemd/CI).
+revision="$(git -C "$app_dir/repo" rev-parse HEAD)"
+release_path="$app_dir/releases/$(date -u +%Y%m%d%H%M%S)-$revision"
+# 1. Prepare the release.
+git clone --depth 1 "$app_dir/repo" "$release_path"
+(cd "$release_path" && npm ci && npm run build)
+# 2. Run backwards-compatible migrations before switching traffic.
+(cd "$release_path" && npx velocious db:migrate)
+# 3. Switch traffic (and start the daemon if needed). A non-zero exit here means
+#    the new release failed health checks and the previous one is still active;
+#    `set -e` aborts the script so the bad release is not promoted.
+rollbridge deploy \
+  --ensure-daemon \
+  --config "$config" \
+  --release-path "$release_path" \
+  --revision "$revision"
+```
+## CI
+In CI, build/test the release, then run the same `rollbridge deploy` over SSH
+on the target host (CI rarely runs the long-lived daemon itself):
+```bash
+# after the build/test job has produced a release at $RELEASE_PATH on the host
+ssh deploy@app.example.com \
+  "rollbridge deploy --ensure-daemon \
+     --config /etc/rollbridge/rollbridge.js \
+     --release-path '$RELEASE_PATH' \
+     --revision '$GIT_SHA'"
+```
+`rollbridge deploy` exits non-zero on a failed health check, which fails the CI
+step — no extra gating needed. Use `rollbridge validate --json` / `rollbridge
+doctor --json` earlier in the pipeline if you want to fail fast before building.
+## Capistrano
+Rollbridge ships **no Capistrano plugin or tasks** — you only run its CLI as a
+shell command from your own `deploy.rb`. Capistrano already uploads the release
+to `release_path`, so the deploy step is a single `execute` of the CLI:
+```ruby
+# config/deploy.rb — just a shell command; no Rollbridge-specific Capistrano code.
+after "deploy:publishing", "rollbridge:deploy"
+namespace :rollbridge do
+  task :deploy do
+    on roles(:app) do
+      within release_path do
+        execute :npx, "velocious", "db:migrate"
+      end
+      execute "rollbridge", "deploy",
+        "--ensure-daemon",
+        "--config", "/etc/rollbridge/rollbridge.js",
+        "--release-path", release_path,
+        "--revision", fetch(:current_revision)
+    end
+  end
+end
+```
+`execute` runs the command over SSH and raises if it exits non-zero, so a failed
+Rollbridge health check fails the Capistrano deploy. Keep Capistrano's own
+`linked_dirs`/`keep_releases` for on-disk release directories; Rollbridge only
+manages the running processes and its own in-memory release records (see
+`releaseRetention`).

package/docs/nginx.md ADDED Viewed

@@ -0,0 +1,104 @@
+# Nginx guide
+Nginx should always proxy to the **stable Rollbridge proxy port**
+(`proxy.host:proxy.port`), never directly to a release process — release ports
+are allocated per deploy and change. Rollbridge forwards both HTTP and WebSocket
+traffic to the active release and drains old connections across deploys.
+## Server block
+```nginx
+# Maps the Upgrade header so WebSocket requests get "Connection: upgrade" and
+# normal requests get a closed/keep-alive connection.
+map $http_upgrade $connection_upgrade {
+  default upgrade;
+  ''      close;
+}
+server {
+  listen 443 ssl;
+  server_name app.example.com;
+  # ssl_certificate / ssl_certificate_key ...
+  location / {
+    proxy_pass http://127.0.0.1:8182;   # Rollbridge proxy.host:proxy.port
+    # WebSocket upgrade
+    proxy_http_version 1.1;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection $connection_upgrade;
+    # Pass the real client through to the app
+    proxy_set_header Host $host;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+    proxy_set_header X-Real-IP $remote_addr;
+    # Long-lived connections (WebSocket/SSE) — see "Timeouts" below
+    proxy_read_timeout 3600s;
+    proxy_send_timeout 3600s;
+  }
+}
+```
+The repository README shows a minimal version of this block; the additions here
+matter for production.
+## WebSocket headers
+Rollbridge's proxy has WebSocket support enabled, so the only requirement is that
+Nginx forwards the upgrade handshake:
+- `proxy_http_version 1.1` — WebSocket upgrades require HTTP/1.1 (the default is 1.0).
+- `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection $connection_upgrade;` — forward the upgrade. Using the `map` above is preferred over a hard-coded `Connection "upgrade"`, so non-WebSocket requests aren't forced into an upgrade.
+If these are missing, WebSocket clients fail to connect (the handshake never
+completes) while plain HTTP still works.
+## Timeouts
+Nginx's `proxy_read_timeout`/`proxy_send_timeout` default to **60s**. An idle
+WebSocket (or a slow streaming response) is closed once that elapses, so
+long-lived connections silently drop after a minute unless you raise them — set
+them on the relevant `location` (or globally) to a value above your longest idle
+period.
+Related Rollbridge timeouts (configured in `rollbridge.js`, not Nginx):
+- `proxy.healthTimeoutMs` gates how long a new release has to become healthy
+  before a deploy aborts — it does not affect request timeouts.
+- `proxy.drainTimeoutMs` is how long Rollbridge keeps an old release alive for
+  in-flight connections during a deploy. Keep Nginx's `proxy_read_timeout` for
+  WebSocket locations comfortably above it so the front end doesn't cut
+  connections Rollbridge is still draining.
+## Forwarded headers
+Set `X-Forwarded-For`, `X-Forwarded-Proto`, and `Host` so the app behind
+Rollbridge sees the real client and scheme. Rollbridge proxies with
+`X-Forwarded-*` enabled, but it can only forward what Nginx provides — terminate
+TLS at Nginx and pass `X-Forwarded-Proto $scheme` so the app knows the original
+request was HTTPS.
+For Server-Sent Events or other streamed responses, also disable response
+buffering on that location so events flush immediately:
+```nginx
+location /events {
+  proxy_pass http://127.0.0.1:8182;
+  proxy_http_version 1.1;
+  proxy_buffering off;
+  proxy_read_timeout 3600s;
+}
+```
+## Common failure modes
+| Symptom | Cause | Fix |
+| --- | --- | --- |
+| `502 Bad Gateway` | Rollbridge can't reach the active release's process (it crashed or is restarting); Rollbridge returns `Bad gateway` and Nginx relays it. | Check `rollbridge status` / `rollbridge logs --process <id>` (see [troubleshooting.md](troubleshooting.md)). The process auto-restarts on its port. |
+| `503` / `No active release` | No release is active — before the first deploy, or after `rollbridge stop`. | Deploy a release (`rollbridge deploy`). |
+| WebSocket drops after ~60s | `proxy_read_timeout` left at the 60s default. | Raise `proxy_read_timeout`/`proxy_send_timeout` on the WebSocket location. |
+| WebSocket never connects (plain HTTP works) | Missing `proxy_http_version 1.1` and the `Upgrade`/`Connection` headers. | Add the WebSocket directives shown above. |
+| `504 Gateway Timeout` | A slow response exceeded `proxy_read_timeout`. | Raise the timeout, or speed up the endpoint. |
+| Connections cut mid-deploy | Nginx `proxy_read_timeout` shorter than `proxy.drainTimeoutMs`. | Raise the Nginx timeout above `proxy.drainTimeoutMs`. |

package/docs/troubleshooting.md ADDED Viewed

@@ -0,0 +1,102 @@
+# Troubleshooting
+Start with these three commands — they diagnose most problems without guessing:
+- `rollbridge validate` — config errors, with an example fix for each.
+- `rollbridge doctor` — control socket reachability, socket-directory writability, and proxy-port availability before the daemon starts.
+- `rollbridge status` / `rollbridge logs` — live release/process state, restart counts, exit codes, connection counts, and recent process output.
+For scripting, `validate`, `doctor`, and `logs` accept a `--json` flag, and
+`status` already prints JSON — so every command's output is easy to parse.
+## Health-check failures
+**Symptom.** `rollbridge deploy` exits non-zero with:
+```
+Health check failed for http://127.0.0.1:18182/ping: HTTP 503
+```
+(the reason is `HTTP <status>` or a connection error such as `ECONNREFUSED`). The
+new release never went live; the previous release stays active.
+**Diagnose.** The new release's `proxied` process didn't return a healthy
+response in time. Check its output with `rollbridge logs --process <id>` and its
+state/`exitCode` with `rollbridge status`. Common causes: the app doesn't listen
+on the templated `{{port}}`, the `health.path` returns a non-2xx status, or the
+app boots slower than `health.timeoutMs`.
+**Fix.** Make the proxied command bind `{{port}}` and serve `health.path` with a
+2xx status. For slow boots, raise `health.timeoutMs` or set `health.startDelayMs`
+so probing begins after the app is up.
+## Port conflicts / exhausted ranges
+**Symptom.** A deploy fails with:
+```
+No available ports in range 18182-18299 (118 ports on 127.0.0.1): 0 reserved by this deploy, 118 already in use. Widen the port range, free a port, or check bind permissions.
+```
+**Diagnose.** The counts tell you which case it is:
+- **reserved by this deploy** high → the range is too small for the processes that share it.
+- **already in use** → another process (or an old release that has not finished draining) holds the ports.
+- **could not be bound (e.g. EACCES)** → permission problem, e.g. a privileged (`<1024`) port.
+`rollbridge doctor` reports whether the configured `proxy.port` is bindable.
+**Fix.** Widen the process's `port` range, free the conflicting port (`ss -ltnp`
+or `lsof -i :<port>` to find the holder), or avoid privileged ports / grant the
+needed capability.
+## Stale or busy control socket
+**Symptom.** `rollbridge daemon` (or `ensure-daemon`) errors with one of:
+```
+A Rollbridge daemon for application "ticket-server" is already running on /tmp/rollbridge-ticket-server.sock (active release: v3). Run "rollbridge status" to inspect it or "rollbridge shutdown" to stop it, or set a different control.path.
+The control socket /tmp/rollbridge-ticket-server.sock is already in use by another process. Stop that process or set a different control.path.
+```
+**Diagnose.** Run `rollbridge status` (does a daemon answer?) and `rollbridge
+doctor` (control-socket check). A leftover socket *file* with no live daemon
+behind it is removed automatically the next time the daemon starts — no action
+needed.
+**Fix.** If a Rollbridge daemon is already running, use it, or
+`rollbridge shutdown` before starting another. If a non-Rollbridge process owns
+the path, stop it or point `control.path` somewhere else.
+## Crash loops
+**Symptom.** `rollbridge status` shows a process with a climbing `restarts`
+count and a `state` that flips between `running` and `failed`, with repeated
+`process started` / `process exited` log lines.
+**Diagnose.** `rollbridge logs --process <id>` shows the crash output;
+`rollbridge status` shows `exitCode`, `exitSignal`, `restarts`, and `uptimeMs`
+(a tiny `uptimeMs` that keeps resetting is a fast crash loop). Crashed
+active-release and `service` processes auto-restart after `restartDelayMs`.
+**Fix.** Correct the command, environment, or dependency that makes the process
+exit; raise `restartDelayMs` to slow a tight loop. Note that a release which
+fails its health check never receives traffic, so a crash-looping proxied
+process in a *failed* deploy does not take the site down — the previous release
+stays active.
+## Stuck draining releases
+**Symptom.** Long after a deploy, `rollbridge status` still shows an old release
+in `state: "draining"` with non-zero `connections` (often `websocket`).
+**Diagnose.** Long-lived connections (WebSockets, SSE, streaming responses) keep
+the retired release alive until they close or `proxy.drainTimeoutMs` elapses.
+`status` shows the release's `connections.http`/`connections.websocket` and
+`drainStartedAt`.
+**Fix.** Draining ends automatically when those connections close, or after
+`proxy.drainTimeoutMs` (then the release is stopped regardless). Lower
+`proxy.drainTimeoutMs` to force-stop sooner, or make clients reconnect (for
+example, have the front end close idle WebSockets on deploy). Once stopped, the
+release is pruned per `releaseRetention`.

package/docs/velocious.md ADDED Viewed

@@ -0,0 +1,200 @@
+# Velocious deployment guide
+A Velocious backend typically runs four kinds of process: **Beacon** (the
+message broker other processes connect to), **background-jobs-main** (the job
+coordinator), **background-jobs-worker** (runs the jobs), and the **web/API**
+server. This guide maps each to a Rollbridge process policy, shows a complete
+`rollbridge.js`, and explains startup ordering and what happens on a deploy.
+A production version of this config lives at
+[`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js).
+## Process mapping
+| Velocious process | Policy | Why |
+| --- | --- | --- |
+| `beacon` | `service` | A shared broker the other processes connect to. It should survive deploys and keep a **stable port**, so workers and the web process always reach the same Beacon. |
+| `background-jobs-main` | `service` (or `singleton`) | The job coordinator. Run it as a `service` when it should outlive releases on a stable port; run it as a `singleton` when it must run the latest release's code after every deploy (see [Choosing the jobs-main policy](#choosing-the-jobs-main-policy)). |
+| `background-jobs-worker` | `companion` | Release-scoped: one set of workers per active release, started before the web process and running that release's code. |
+| `web` | `proxied` | Receives external HTTP/WebSocket traffic, is health-checked before traffic switches, and is drained on the next deploy. Exactly one process is `proxied`. |
+See [README → Process Policies](../README.md#process-policies) for the full
+semantics of each policy and [`docs/config.md`](config.md) for every field.
+## Example `rollbridge.js`
+```js
+// rollbridge.js
+export default {
+  application: "tensorbuzz",
+  control: {path: "/tmp/rollbridge-tensorbuzz.sock"},
+  proxy: {
+    host: "127.0.0.1",
+    port: 4500,          // the stable port Nginx points at
+    healthPath: "/ping",
+    healthTimeoutMs: 30000,
+    drainTimeoutMs: 60000,
+    forceStopTimeoutMs: 10000
+  },
+  processes: [
+    // Shared broker — one daemon-wide instance on a stable port.
+    {
+      id: "beacon",
+      policy: "service",
+      cwd: "{{releasePath}}/backend",
+      env: {NODE_ENV: "production", VELOCIOUS_BEACON_PORT: "{{port}}"},
+      command: "npx velocious beacon",
+      port: 7330
+    },
+    // Job coordinator — waits for Beacon, stable port other jobs processes use.
+    {
+      id: "background-jobs-main",
+      policy: "service",
+      cwd: "{{releasePath}}/backend",
+      env: {
+        NODE_ENV: "production",
+        VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
+        VELOCIOUS_BACKGROUND_JOBS_PORT: "{{port}}"
+      },
+      command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- npx velocious background-jobs-main",
+      port: 7331
+    },
+    // Workers — one set per release; raise gracefulStopMs to let in-flight
+    // jobs finish during a deploy.
+    {
+      id: "background-jobs-worker",
+      policy: "companion",
+      cwd: "{{releasePath}}/backend",
+      env: {
+        NODE_ENV: "production",
+        VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
+        VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
+      },
+      command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious background-jobs-worker",
+      gracefulStopMs: 60000
+    },
+    // Web/API — the one proxied process.
+    {
+      id: "web",
+      policy: "proxied",
+      cwd: "{{releasePath}}/backend",
+      env: {
+        NODE_ENV: "production",
+        VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
+        VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
+      },
+      command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious server --host 127.0.0.1 --port {{port}}",
+      port: {from: 14500, to: 14599},
+      health: {path: "/ping", timeoutMs: 30000, intervalMs: 500}
+    }
+  ]
+}
+```
+## Wiring processes together
+Beacon and `background-jobs-main` get **fixed** ports (`7330`, `7331`) because
+they are `service`s — a stable port lets every release's workers and web process
+find them. The proxied `web` process gets a **range** (`{from: 14500, to:
+14599}`); Rollbridge allocates a free port per release so the old and new web
+releases can run side by side during the drain.
+Cross-reference ports with `{{ports.<id>}}` and pass them to Velocious through
+`env`. Rollbridge also injects `ROLLBRIDGE_<ID>_PORT` for every process (e.g.
+`ROLLBRIDGE_BACKGROUND_JOBS_MAIN_PORT`), so you can read ports from the
+environment instead of templating if you prefer — see
+[`docs/config.md`](config.md#injected-environment-variables).
+### Startup ordering
+Only the `proxied` process is health-checked, so dependent processes must wait
+for their dependencies themselves. Two mechanisms combine:
+1. **Policy ordering.** On each deploy Rollbridge starts `service`s first, then
+   the release's `companion`s, then the `proxied` process (see
+   [README → Deploy ordering](../README.md#deploy-ordering)).
+2. **Readiness gating.** `wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- …`
+   blocks the command until Beacon's port accepts connections, so
+   `background-jobs-main`, the worker, and `web` don't start talking to Beacon
+   before it is listening. `wait-for-it` is a small standalone script (install it
+   on the host); any equivalent port-wait works.
+## Deploying
+Drive deploys through the Rollbridge CLI — Rollbridge ships no deploy-tool
+plugins (see [`docs/deploy-recipes.md`](deploy-recipes.md) for shell/CI/Capistrano
+recipes). The minimal step after a release directory is prepared:
+```bash
+release_path=/srv/tensorbuzz/releases/20260523120000  # prepared by your pipeline
+# Run backwards-compatible migrations BEFORE switching traffic: the old and new
+# web releases overlap during the drain.
+(cd "$release_path/backend" && npx velocious db:migrate)
+rollbridge deploy \
+  --ensure-daemon \
+  --config /etc/rollbridge/rollbridge.js \
+  --release-path "$release_path" \
+  --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
+```
+`rollbridge deploy` starts the new release's worker and web process,
+health-checks `web` on its `{{port}}`/`/ping`, switches traffic, then drains and
+stops the previous release. It exits non-zero (leaving the previous release
+active) if the new release fails to start or health-check, so a failed deploy
+never promotes a broken release.
+## Background jobs across a deploy
+The worker is a `companion`, so each release runs its own workers:
+- On deploy, the **new** release's workers start (running the new code) before
+  traffic switches; the **old** release's workers are stopped when that release
+  is drained and retired — `SIGTERM`, then `SIGKILL` after `gracefulStopMs`.
+- Set `gracefulStopMs` on the worker to at least your longest in-flight job so a
+  job gets time to finish on `SIGTERM` before the forced kill. The example uses
+  `60000` (60s).
+> **Planned:** graceful job-worker draining via lifecycle hooks
+> (`quietCommand`/`drainCommand`/`stopCommand` and a non-blocking drain mode so
+> new workers start while old workers finish) is on the
+> [roadmap](../TODO.md#major-features) and not yet implemented. Until then, the
+> `gracefulStopMs` window above is the mechanism for letting in-flight jobs
+> finish.
+### Choosing the jobs-main policy
+`background-jobs-main` is duplicate-unsafe (you never want two coordinators), so
+it is either a `service` or a `singleton` — never a `companion`:
+- **`service`** — keeps running across deploys on its stable port. Workers from
+  every release talk to the same coordinator, so there's no coordination gap on
+  deploy. The trade-off: a `service` keeps running the **release it was started
+  from** and only adopts the latest release's template if it crashes and
+  restarts (or the daemon restarts). If `background-jobs-main` itself needs the
+  newest code immediately after every deploy, this is the wrong policy.
+- **`singleton`** — Rollbridge stops the old instance and then starts the new
+  one on each deploy, so it always runs the latest release's code and two copies
+  never overlap. The trade-off: a brief coordination gap while it restarts.
+Beacon is a broker rather than code that changes per release, so `service` is
+almost always right for it.
+## Verifying
+After a deploy, `rollbridge status` should show `beacon` and
+`background-jobs-main` as long-lived `service`s with unchanged ports across
+deploys, one `background-jobs-worker` for the active release, and the `web`
+process `proxied` with its connection counts. Use
+[`rollbridge logs --process <id>`](cli.md) to read recent output from any
+process, and [`docs/troubleshooting.md`](troubleshooting.md) for health-check,
+port, and draining problems.
+For the front end, point Nginx at the stable `proxy.port` (here `4500`), never at
+a release's web port — see [`docs/nginx.md`](nginx.md).