rollbridge 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/config.md CHANGED
@@ -23,9 +23,11 @@ export default {
23
23
  | --- | --- | --- | --- |
24
24
  | `application` | string | basename of the config file's directory | Names the app; used in the default control-socket path and the `ROLLBRIDGE_APPLICATION` env var. |
25
25
  | `control` | object | — | Control-socket settings (see below). |
26
+ | `legacyTakeover` | object | unset | Optional matchers for `rollbridge predeploy-cleanup` to stop pre-Rollbridge supervisors during first handover (see below). |
26
27
  | `proxy` | object | **required** | Proxy listener and shared defaults (see below). |
27
28
  | `processes` | array | **required** | Managed processes (see below). Exactly one must be `proxied`. |
28
29
  | `releaseRetention` | object | — | How many stopped releases the daemon retains (see below). |
30
+ | `statePath` | string | unset (no persistence) | File the daemon persists its state to, enabling orphaned-process detection on the next startup (see [`statePath`](#statepath)). |
29
31
 
30
32
  ## `control`
31
33
 
@@ -33,6 +35,14 @@ export default {
33
35
  | --- | --- | --- | --- |
34
36
  | `control.path` | string | `/tmp/rollbridge-<application>.sock` | Unix domain socket the CLI uses to talk to the daemon. |
35
37
  | `control.mode` | octal string (e.g. `"660"`) or octal number (`0o660`) | unset | `chmod` applied to the socket after it binds, to share it with a deploy group. When unset, the daemon umask applies. |
38
+ | `control.owner` | non-negative integer uid or user name | unset | `chown` owner applied to the socket after it binds. |
39
+ | `control.group` | non-negative integer gid or group name | unset | `chown` group applied to the socket after it binds, so a shared deploy group can use it. |
40
+
41
+ Names are resolved via `/etc/passwd`/`/etc/group` (local users and groups); use
42
+ numeric ids for NSS-only principals. The daemon must run as a user permitted to
43
+ `chown` the socket (root, or a member of the target group) — otherwise it fails
44
+ to start with a clear error. Combine `control.group` with `control.mode: "660"`
45
+ to let a deploy group talk to the daemon.
36
46
 
37
47
  ## `proxy`
38
48
 
@@ -56,6 +66,61 @@ export default {
56
66
  Active and draining releases are never pruned. This governs Rollbridge's own
57
67
  release records; the deploy tool still owns on-disk release directories.
58
68
 
69
+ ## `statePath`
70
+
71
+ When set, the daemon persists a state snapshot — the active and draining
72
+ releases, each managed process's metadata (including pid), restart counters, and
73
+ recent events — to this file (atomically, on changes and every few seconds). On a
74
+ clean `shutdown` the file is removed.
75
+
76
+ On the **next startup**, the daemon reads any leftover file and reports managed
77
+ processes whose pids are still alive — likely orphans from a daemon that crashed
78
+ without shutting down cleanly — in its log and event history, and in the
79
+ `orphans` array of [`rollbridge status`](cli.md#status). This is **advisory**:
80
+ Rollbridge cannot re-adopt detached children, so it does not stop them
81
+ automatically; the operator verifies and stops the leftovers. A recycled pid can
82
+ be a false positive, so treat a report as a prompt to investigate. Use
83
+ [`rollbridge recover`](cli.md#recover) to list and (with `--force`) stop those
84
+ orphans after a crash.
85
+
86
+ ```js
87
+ statePath: "/var/lib/rollbridge/ticket-server.state.json"
88
+ ```
89
+
90
+ Leave `statePath` unset to disable persistence (the default).
91
+
92
+ ## `legacyTakeover`
93
+
94
+ `legacyTakeover` lets deploy scripts run `rollbridge predeploy-cleanup` during
95
+ the first migration from an old supervisor. The command only uses these matchers
96
+ when no active Rollbridge release is running. If a Rollbridge daemon already has
97
+ an active release, it exits without stopping legacy processes.
98
+
99
+ | Field | Type | Default | Description |
100
+ | --- | --- | --- | --- |
101
+ | `legacyTakeover.screens` | array of strings | `[]` | GNU Screen session names to stop with `screen -S <name> -X quit`. |
102
+ | `legacyTakeover.processes` | array | `[]` | Process command-line matchers. Each entry must define `includes`, and may define `name`. |
103
+ | `legacyTakeover.forceStopTimeoutMs` | number | `proxy.forceStopTimeoutMs` | Grace period after `SIGTERM` before `SIGKILL` is sent to matched legacy processes. |
104
+
105
+ Each `legacyTakeover.processes[]` entry:
106
+
107
+ | Field | Type | Default | Description |
108
+ | --- | --- | --- | --- |
109
+ | `includes` | array of strings | **required** | Every string must appear in a process command line for it to be considered a legacy seed process. Descendants of seed processes are stopped too. |
110
+ | `name` | string | generated | Human-readable label for diagnostics. |
111
+
112
+ Example:
113
+
114
+ ```js
115
+ legacyTakeover: {
116
+ forceStopTimeoutMs: 10000,
117
+ screens: ["ticket-server"],
118
+ processes: [
119
+ {name: "legacy web", includes: ["/home/dev/ticket-server/", "velocious server", "--port 8082"]}
120
+ ]
121
+ }
122
+ ```
123
+
59
124
  ## `processes[]`
60
125
 
61
126
  | Field | Type | Default | Description |
@@ -67,11 +132,85 @@ release records; the deploy tool still owns on-disk release directories.
67
132
  | `env` | object of string → string | `{}` | Extra environment variables (values templated). Merged over the injected `ROLLBRIDGE_*` vars. |
68
133
  | `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
69
134
  | `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
70
- | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
135
+ | `stopSignal` | signal name (e.g. `"SIGTERM"`, `"SIGINT"`, `"SIGQUIT"`) | `"SIGTERM"` | Signal sent to gracefully stop the process; after `gracefulStopMs` it is `SIGKILL`ed. Use a worker's quit signal so it finishes in-flight work before exiting. |
136
+ | `nonBlockingDrain` | boolean | `false` | When a release is retired, drain this process **immediately** (in parallel with the proxied connection drain) instead of after it. Companion processes only — typically background workers (see below). |
137
+ | `lifecycle` | object | no hooks | Command hooks run when gracefully stopping the process (see below). |
138
+ | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | Graceful-stop window: time between `stopSignal`/`stopCommand` and `SIGKILL` for this process. |
71
139
  | `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
72
140
  | `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
141
+ | `memory` | object | unset (no monitoring) | Memory supervision: restart the process when its RSS exceeds a limit (see below). |
142
+ | `replicas` | positive integer | `1` | Run this many instances of the process (see below). |
73
143
  | `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
74
144
 
145
+ ### `processes[].replicas`
146
+
147
+ Run a pool of identical instances of one process — for example several
148
+ background-job workers. `replicas` greater than `1` is supported only on a
149
+ **`companion`** process **without a `port`** (the worker-pool case);
150
+ `proxied`, `singleton`, and ported processes must keep `replicas: 1`.
151
+
152
+ ```js
153
+ {id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
154
+ ```
155
+
156
+ Each replica runs as its own managed process with id `<id>#<index>` (`worker#0`,
157
+ `worker#1`, …) — that id is what appears in `status` and what
158
+ [`rollbridge restart`](cli.md#restart) targets (use the base id `worker` to
159
+ restart every replica, or `worker#0` for one). Replicas get `replicaIndex`/
160
+ `replicaCount` template variables and `ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` in their
161
+ environment, so each instance can pick a distinct shard, queue, or lock. A single
162
+ process (`replicas: 1`) keeps its plain id and is replica `0` of `1`.
163
+
164
+ ### `processes[].lifecycle`
165
+
166
+ Command hooks run when Rollbridge **gracefully stops** the process — during a
167
+ deploy's drain, a `rollbridge restart`, a memory restart, or shutdown. They let a
168
+ job worker quiesce and finish in-flight work before it is terminated. Omit
169
+ `lifecycle` for the default behavior (just `stopSignal` then `SIGKILL`).
170
+
171
+ | Field | Type | Default | Description |
172
+ | --- | --- | --- | --- |
173
+ | `lifecycle.quietCommand` | string | unset | Run first to tell the process to stop accepting new work. |
174
+ | `lifecycle.drainCommand` | string | unset | Run after quieting to wait until the process has drained (it blocks until done). When unset, Rollbridge instead waits up to `drainTimeoutMs` for the process to exit on its own. Requires a positive `drainTimeoutMs` (which bounds it). |
175
+ | `lifecycle.drainTimeoutMs` | non-negative number | `0` | Bounds the drain step. `0` **skips the drain step entirely** (no `drainCommand`, no wait). |
176
+ | `lifecycle.stopCommand` | string | unset | Run to stop the process instead of sending `stopSignal`, if it is still running after draining. |
177
+
178
+ Because `stopCommand` runs **instead of** sending `stopSignal`, setting both a
179
+ `stopCommand` and a custom `stopSignal` is rejected — the signal would be silently
180
+ ignored. Use one or the other.
181
+
182
+ The full stop sequence is: run `quietCommand` → drain (`drainCommand`, or wait
183
+ `drainTimeoutMs` for the process to exit) → if still running, run `stopCommand`
184
+ or send `stopSignal` → `SIGKILL` after `gracefulStopMs`. Each hook command is run
185
+ through a shell with the process's environment plus `ROLLBRIDGE_PID` (the
186
+ process-group leader's pid, so a hook can `kill -TSTP -$ROLLBRIDGE_PID`). Every
187
+ hook is **bounded by a timeout** (its drain timeout, or `gracefulStopMs`) and its
188
+ failure is non-fatal — the sequence proceeds and `SIGKILL` is always the final
189
+ fallback, so a slow or broken hook can't wedge a stop.
190
+
191
+ ```js
192
+ {id: "worker", policy: "companion", command: "…", lifecycle: {quietCommand: "kill -TSTP -$ROLLBRIDGE_PID", drainTimeoutMs: 60000}}
193
+ ```
194
+
195
+ ### `processes[].nonBlockingDrain`
196
+
197
+ By default, when a release is retired its processes are stopped **after** the
198
+ proxied process's connections have drained (or `proxy.drainTimeoutMs` elapses).
199
+ That keeps a worker alive in case the draining web process still depends on it —
200
+ but it also holds a background worker open for the whole connection drain.
201
+
202
+ Set `nonBlockingDrain: true` on a `companion` whose work is independent of the
203
+ proxied process (a job worker on a shared queue). Its graceful stop — `lifecycle`
204
+ hooks, or `stopSignal` then `SIGKILL` after `gracefulStopMs` — then starts **as
205
+ soon as the release is retired**, in parallel with the connection drain, rather
206
+ than after it. The new release's workers (started before traffic switches) handle
207
+ new work while the retired release's workers finish their in-flight jobs. The
208
+ whole drain stays non-blocking — the deploy returns immediately.
209
+
210
+ ```js
211
+ {id: "worker", policy: "companion", command: "…", nonBlockingDrain: true, stopSignal: "SIGINT", gracefulStopMs: 60000}
212
+ ```
213
+
75
214
  ### `processes[].restart`
76
215
 
77
216
  Controls automatic restarts of a crashed process (a release's active processes
@@ -90,6 +229,29 @@ With the defaults a crashed process restarts indefinitely after `restartDelayMs`
90
229
  Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
91
230
  set `maxRestarts` to give up on a process stuck in a crash loop.
92
231
 
232
+ ### `processes[].memory`
233
+
234
+ Monitors the resident memory (RSS) of the process and **gracefully restarts** it
235
+ (`SIGTERM`, then `SIGKILL` after `gracefulStopMs`) when it exceeds `limitBytes`.
236
+ RSS is measured across the whole managed process group (the spawned wrapper and
237
+ its children), not just the wrapper. Omit `memory` to disable monitoring. Memory
238
+ measurement uses `/proc` and is a no-op on platforms without it.
239
+
240
+ | Field | Type | Default | Description |
241
+ | --- | --- | --- | --- |
242
+ | `memory.limitBytes` | positive integer | **required** | RSS limit in bytes; exceeding it restarts the process. |
243
+ | `memory.warnBytes` | non-negative integer | `0` (off) | Log a `memory warning` once when RSS first crosses this threshold (set below `limitBytes`). |
244
+ | `memory.checkIntervalMs` | positive number | `5000` | How often to measure RSS. |
245
+
246
+ ```js
247
+ {id: "worker", policy: "companion", command: "…", memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}}
248
+ ```
249
+
250
+ A memory restart is reported in `status` (`memoryRestarts`, `lastMemoryRestartAt`,
251
+ current `rssBytes`) and recorded in the event history (a `process started` event
252
+ with `reason: "memory"`). `status` also reports `children` — the sampled process
253
+ tree, with each group member's `pid`, `command`, and `rssBytes`.
254
+
93
255
  ### `processes[].health`
94
256
 
95
257
  Only the `proxied` process is health-checked (before traffic switches to a new
@@ -115,6 +277,7 @@ start with a clear error.
115
277
  | `{{releasePath}}` | The deploy's `--release-path`. |
116
278
  | `{{revision}}` | The deploy's `--revision` (falls back to the release id). |
117
279
  | `{{processId}}` | This process's `id`. |
280
+ | `{{replicaIndex}}`, `{{replicaCount}}` | This instance's zero-based replica index and the total replica count (`0` and `1` for a single process). |
118
281
  | `{{port}}` | The port allocated to this process. |
119
282
  | `{{ports.<id>}}` | The port allocated to another process. |
120
283
  | `{{proxy.host}}`, `{{proxy.port}}`, `{{proxy.upstreamHost}}` | The configured proxy bind host/port and upstream host. |
@@ -128,7 +291,8 @@ Rollbridge sets these in every managed process's environment (the process's own
128
291
  | Variable | Value |
129
292
  | --- | --- |
130
293
  | `ROLLBRIDGE_APPLICATION` | `application` |
131
- | `ROLLBRIDGE_PROCESS_ID` | This process's `id`. |
294
+ | `ROLLBRIDGE_PROCESS_ID` | This process's `id` (the base id, not the `#index` instance id). |
295
+ | `ROLLBRIDGE_REPLICA_INDEX`, `ROLLBRIDGE_REPLICA_COUNT` | This instance's zero-based replica index and total replica count (`0` and `1` for a single process). |
132
296
  | `ROLLBRIDGE_RELEASE_ID` | The release id. |
133
297
  | `ROLLBRIDGE_RELEASE_PATH` | The release path. |
134
298
  | `ROLLBRIDGE_REVISION` | The revision (or release id). |
@@ -144,5 +308,11 @@ Rollbridge sets these in every managed process's environment (the process's own
144
308
  - Process `id`s must be unique.
145
309
  - `port` must be a positive port number or an ascending `{from, to}` range.
146
310
  - `control.mode` must be an octal mode between `0` and `0o777`.
311
+ - `control.owner` and `control.group` must each be a non-negative integer id or a non-empty name (resolved at daemon start).
147
312
  - `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
148
313
  - `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.
314
+ - When `memory` is set, `memory.limitBytes` must be a positive integer, `memory.warnBytes` a non-negative integer, and `memory.checkIntervalMs` a positive number.
315
+ - `replicas` must be a positive integer; `replicas > 1` is allowed only on a `companion` process without a `port`. Process ids must not contain `#` (reserved for replica instance ids).
316
+ - `lifecycle.quietCommand`/`drainCommand`/`stopCommand` must be strings when set, and `lifecycle.drainTimeoutMs` a non-negative number; `lifecycle.drainCommand` requires a positive `lifecycle.drainTimeoutMs`. A `lifecycle.stopCommand` may not be combined with a custom `stopSignal` (the `stopCommand` runs instead of the signal, so the signal would be ignored).
317
+ - `nonBlockingDrain` must be a boolean, and is allowed only on a `companion` process.
318
+ - `statePath` must be a string when set.
@@ -0,0 +1,77 @@
1
+ # Logging
2
+
3
+ The Rollbridge daemon writes one structured JSON line per operational event
4
+ (deploys, traffic switches, process starts/exits, restarts, memory events, and
5
+ failed commands):
6
+
7
+ ```json
8
+ {"at":"2026-05-23T14:31:09.512Z","message":"traffic switched","data":{"previousReleaseId":"v3","releaseId":"v4"}}
9
+ ```
10
+
11
+ These lines go to the daemon's **stdout**; where that ends up depends on how the
12
+ daemon was started.
13
+
14
+ ## Where logs go
15
+
16
+ | How the daemon runs | Destination |
17
+ | --- | --- |
18
+ | `rollbridge daemon` (foreground) | stdout — redirect it (`rollbridge daemon … >> /var/log/rollbridge/app.log 2>&1`) or let your service manager capture it. |
19
+ | systemd (`examples/rollbridge.service`) | the journal — `journalctl -u rollbridge`. journald rotates on its own. |
20
+ | `rollbridge ensure-daemon` / `rollbridge deploy --ensure-daemon` | the **daemon log file**: `--daemon-log-path <path>`, default `/tmp/rollbridge-<application>.log`. The detached daemon's stdout and stderr are appended there. |
21
+
22
+ Point `--daemon-log-path` at a path your rotation tooling manages, for example:
23
+
24
+ ```bash
25
+ rollbridge deploy --ensure-daemon \
26
+ --config /etc/rollbridge/rollbridge.js \
27
+ --daemon-log-path /var/log/rollbridge/app.log \
28
+ --release-path "$release_path"
29
+ ```
30
+
31
+ The daemon log file is the durable, append-only stream of the daemon's own
32
+ events. It is distinct from the two in-memory views:
33
+
34
+ - `rollbridge logs` — recent stdout/stderr of each **managed process** (your app),
35
+ bounded per process by `outputLines`.
36
+ - `rollbridge events` — the recent structured daemon event history (the most
37
+ recent 1000 events), the same events written to the log file.
38
+
39
+ Both are cleared when the daemon restarts; the log file persists.
40
+
41
+ ## Rotation
42
+
43
+ ### systemd / journald
44
+
45
+ When the daemon runs under systemd its logs are in the journal, which rotates
46
+ automatically. Bound journal disk use with `SystemMaxUse=` in
47
+ `/etc/systemd/journald.conf` (or a per-namespace drop-in). No logrotate config is
48
+ needed for the daemon itself.
49
+
50
+ ### The daemon log file (logrotate)
51
+
52
+ The detached daemon keeps the log file **open for its whole lifetime** (its
53
+ stdout/stderr file descriptors point at it). A plain `rename`-based rotation
54
+ would leave the daemon writing to the old, now-renamed inode while the new file
55
+ stays empty. Use logrotate's **`copytruncate`**, which copies the file and then
56
+ truncates it in place, keeping the daemon's open descriptor valid:
57
+
58
+ ```
59
+ /var/log/rollbridge/*.log {
60
+ daily
61
+ rotate 14
62
+ compress
63
+ missingok
64
+ notifempty
65
+ copytruncate
66
+ }
67
+ ```
68
+
69
+ `copytruncate` has a small race window — log lines written between the copy and
70
+ the truncate can be lost — which is acceptable for the daemon's low-volume,
71
+ milestone-level logging. Rollbridge does not reopen its log file on a signal, so
72
+ `copytruncate` (rather than `create` + a reopen signal) is the recommended
73
+ approach for the daemon log file.
74
+
75
+ Prefer running under systemd (journald) when you can; reach for `--daemon-log-path`
76
+ + logrotate when you run the daemon outside a service manager that captures
77
+ stdout.
@@ -0,0 +1,53 @@
1
+ # Releasing (maintainers)
2
+
3
+ Rollbridge publishes **patch** releases from the default branch with:
4
+
5
+ ```bash
6
+ npm run release:patch
7
+ ```
8
+
9
+ That script (the `release-patch` package) owns the version bump, lockfile update,
10
+ default-branch commit, push, and `npm publish`. Don't run `npm version` yourself
11
+ first — let the script own the bump. Use this checklist around it.
12
+
13
+ The default branch is `master` for this repo; the checks below stay
14
+ branch-agnostic so they stay correct if that ever changes. Capture the name once
15
+ and reuse it (the commands below assume it is set):
16
+
17
+ ```bash
18
+ default_branch=$(git rev-parse --abbrev-ref origin/HEAD | sed 's@^origin/@@') # e.g. master
19
+ ```
20
+
21
+ ## Before releasing
22
+
23
+ - [ ] You're on the default branch and synced with it: `git switch "$default_branch"`,
24
+ then `git fetch && git status` shows it up to date, with a **clean working tree**.
25
+ - [ ] CI is green for that commit, and `npm run all-checks` passes locally
26
+ (typecheck, lint, and the full test suite).
27
+ - [ ] `README.md` and `docs/` reflect every user-visible change shipped since the
28
+ last release (config fields, CLI commands/flags, status/event output,
29
+ operational behavior).
30
+ - [ ] `TODO.md` checkboxes for the shipped work are updated.
31
+ - [ ] You can publish: `npm whoami` shows an account with publish rights to the
32
+ `rollbridge` package, and you can push to the default branch.
33
+
34
+ ## Release
35
+
36
+ ```bash
37
+ npm run release:patch
38
+ ```
39
+
40
+ The script bumps the patch version, updates `package-lock.json`, commits the bump
41
+ to the default branch, pushes it, and publishes the package to npm.
42
+
43
+ ## After releasing
44
+
45
+ - [ ] The new version is on the registry: `npm view rollbridge version` matches
46
+ the bumped `package.json` version.
47
+ - [ ] The version-bump commit reached the remote (not just local): `git fetch`,
48
+ then `git log --oneline -1 "origin/$default_branch"` shows the bump — a
49
+ failed or blocked push won't satisfy this.
50
+ - [ ] Your working tree is clean and still on the default branch.
51
+
52
+ `release:patch` only does patch releases — a minor or major version bump is a
53
+ manual decision and is not covered by this script.
@@ -0,0 +1,129 @@
1
+ # TensorBuzz production runbook
2
+
3
+ Operating the TensorBuzz backend under Rollbridge. The production config lives at
4
+ [`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js); this runbook
5
+ assumes it is deployed to a stable path (`/etc/rollbridge/tensorbuzz.com.js`
6
+ below) and the daemon runs as a systemd service (see
7
+ [Running under systemd](../README.md#running-under-systemd)). For the general
8
+ Velocious topology and the worker recipe, see [`docs/velocious.md`](velocious.md).
9
+
10
+ ## Ports
11
+
12
+ | Port | Process | Notes |
13
+ | --- | --- | --- |
14
+ | `4500` | Rollbridge proxy | The stable public port. **Nginx proxies the backend host to `127.0.0.1:4500`** — never to a release's web port. |
15
+ | `7330` | `beacon` (`service`) | Fixed; the shared broker every release connects to. |
16
+ | `7331` | `background-jobs-main` (`service`) | Fixed; the job coordinator. |
17
+ | `14500`–`14599` | `web` (`proxied`) | One port per release, allocated per deploy; Rollbridge forwards `4500` here. |
18
+ | (none) | `background-jobs-worker` (`companion`) | A per-release worker; no listening port. |
19
+
20
+ Control socket: `/tmp/rollbridge-tensorbuzz.sock`.
21
+
22
+ ## Process topology
23
+
24
+ - **`beacon`** and **`background-jobs-main`** are `service`s: one daemon-wide
25
+ instance each, on their fixed ports, surviving deploys.
26
+ - **`background-jobs-worker`** is a `companion`: a fresh worker per release,
27
+ running that release's code, with `gracefulStopMs: 60000` so an in-flight job
28
+ finishes before `SIGKILL`.
29
+ - **`web`** is the one `proxied` process, health-checked at `/ping` before
30
+ traffic switches.
31
+
32
+ Each process waits for its dependencies with `wait-for-it` (`beacon` →
33
+ `background-jobs-main` → `worker`/`web`), so nothing starts talking to Beacon or
34
+ the job coordinator before they listen.
35
+
36
+ ## External services
37
+
38
+ Rollbridge manages **only the four processes above**. Everything else the
39
+ Velocious app depends on — the database and any other backing services — is
40
+ **provisioned and operated outside Rollbridge**: Rollbridge does not start, stop,
41
+ health-check, or know about them. Configure those connections through the app's
42
+ own environment/config. When such a dependency is down, the `web` process's
43
+ `/ping` health check is what gates a deploy (a release that can't reach its
44
+ database won't pass health and won't go live).
45
+
46
+ ## Deploying
47
+
48
+ Drive deploys through the CLI (see [`docs/deploy-recipes.md`](deploy-recipes.md)).
49
+ Run **backwards-compatible** migrations before switching traffic, because the old
50
+ and new releases overlap during the drain:
51
+
52
+ ```bash
53
+ release_path=/srv/tensorbuzz/releases/<timestamp> # prepared by your pipeline
54
+ (cd "$release_path/backend" && npx velocious db:migrate)
55
+
56
+ rollbridge deploy \
57
+ --ensure-daemon \
58
+ --config /etc/rollbridge/tensorbuzz.com.js \
59
+ --release-path "$release_path" \
60
+ --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
61
+ ```
62
+
63
+ ### Deploy ordering
64
+
65
+ On `rollbridge deploy`, Rollbridge:
66
+
67
+ 1. starts any missing `service` (`beacon`, `background-jobs-main`);
68
+ 2. starts the new release's `background-jobs-worker`, then its `web` process, and
69
+ health-checks `web` on its `{{port}}`/`/ping`;
70
+ 3. switches new traffic to the new `web`;
71
+ 4. refreshes the services' restart templates to the new release;
72
+ 5. drains the previous release's connections, then stops its `web` and worker.
73
+
74
+ If the new release fails to start or health-check, **the previous release stays
75
+ active** and the command exits non-zero — so a failed deploy never takes the site
76
+ down.
77
+
78
+ ## Rollback
79
+
80
+ ```bash
81
+ rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js
82
+ # or a specific retained release:
83
+ rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js --release-id <id>
84
+ ```
85
+
86
+ Rollback re-runs the deploy flow on a retained release, health-checks it, and
87
+ switches traffic back. Constraints:
88
+
89
+ - **Migrations are not reverted.** Rollback only manages processes; if a release
90
+ bumped the schema, rolling code back requires that the old code still works
91
+ against the new schema — keep migrations backwards-compatible (the same rule as
92
+ deploys).
93
+ - The target release's on-disk directory must still exist (don't prune it from
94
+ disk before you might roll back to it).
95
+ - Only releases Rollbridge still retains (`releaseRetention`) can be targeted.
96
+
97
+ ## Day-to-day operations
98
+
99
+ ```bash
100
+ C=/etc/rollbridge/tensorbuzz.com.js
101
+
102
+ rollbridge status --config "$C" # active release, ports, per-process state
103
+ rollbridge logs --config "$C" --process web # recent stdout/stderr of a process
104
+ rollbridge events --config "$C" # deploys, switches, crashes, restarts
105
+ rollbridge doctor --config "$C" # pre-flight: socket, proxy port, state
106
+ rollbridge restart --config "$C" --process background-jobs-worker # bounce the worker
107
+ ```
108
+
109
+ Restarting `beacon` or `background-jobs-main` bounces a shared broker and briefly
110
+ disrupts everything that depends on it; prefer `deploy`/`rollback` for code
111
+ changes. See [`docs/troubleshooting.md`](troubleshooting.md) for health-check
112
+ failures, port conflicts, stale sockets, crash loops, and stuck draining
113
+ releases.
114
+
115
+ ## Crash recovery
116
+
117
+ Set [`statePath`](config.md#statepath) in the config to have the daemon persist
118
+ its state. After a daemon crash or reboot, `rollbridge doctor` reports any
119
+ **orphaned** processes still alive from the previous daemon. To clean them up
120
+ before restarting the daemon, run `rollbridge recover` (a dry run that lists
121
+ them), then `rollbridge recover --force` to stop them:
122
+
123
+ ```bash
124
+ rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js # list leftovers
125
+ rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js --force # stop them
126
+ ```
127
+
128
+ A machine reboot kills every process, so there are usually no orphans afterward —
129
+ the daemon just starts fresh.
package/docs/velocious.md CHANGED
@@ -156,17 +156,55 @@ The worker is a `companion`, so each release runs its own workers:
156
156
 
157
157
  - On deploy, the **new** release's workers start (running the new code) before
158
158
  traffic switches; the **old** release's workers are stopped when that release
159
- is drained and retired — `SIGTERM`, then `SIGKILL` after `gracefulStopMs`.
160
- - Set `gracefulStopMs` on the worker to at least your longest in-flight job so a
161
- job gets time to finish on `SIGTERM` before the forced kill. The example uses
162
- `60000` (60s).
163
-
164
- > **Planned:** graceful job-worker draining via lifecycle hooks
165
- > (`quietCommand`/`drainCommand`/`stopCommand` and a non-blocking drain mode so
166
- > new workers start while old workers finish) is on the
167
- > [roadmap](../TODO.md#major-features) and not yet implemented. Until then, the
168
- > `gracefulStopMs` window above is the mechanism for letting in-flight jobs
169
- > finish.
159
+ is drained and retired — the worker's `stopSignal`, then `SIGKILL` after
160
+ `gracefulStopMs`.
161
+ - Set `stopSignal` to the signal your worker drains on and `gracefulStopMs` to at
162
+ least your longest in-flight job, so a job gets time to finish before the
163
+ forced kill. Set `replicas` to run a pool of workers.
164
+
165
+ See [`docs/workers.md`](workers.md) for the full safe background-job deployment
166
+ pattern (companion + `replicas` + `stopSignal`/`lifecycle` hooks +
167
+ `gracefulStopMs`), the old/new worker overlap, and `nonBlockingDrain` to start the
168
+ old workers' drain immediately when a release is retired.
169
+
170
+ ### Worker recipe
171
+
172
+ A complete `background-jobs-worker` entry that runs a pool and finishes in-flight
173
+ jobs across a deploy:
174
+
175
+ ```js
176
+ {
177
+ id: "background-jobs-worker",
178
+ policy: "companion",
179
+ cwd: "{{releasePath}}/backend",
180
+ env: {
181
+ NODE_ENV: "production",
182
+ VELOCIOUS_ENV: "production",
183
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
184
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
185
+ },
186
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious background-jobs-worker",
187
+ replicas: 4,
188
+ gracefulStopMs: 60000
189
+ }
190
+ ```
191
+
192
+ - `replicas: 4` runs four worker instances (`background-jobs-worker#0` … `#3`),
193
+ each with `ROLLBRIDGE_REPLICA_INDEX`/`ROLLBRIDGE_REPLICA_COUNT` if you shard work.
194
+ - On deploy the new release's workers start before traffic switches; the old
195
+ release's workers receive `SIGTERM` (the default `stopSignal`) when the old
196
+ release is retired, then `SIGKILL` after `gracefulStopMs` — so size
197
+ `gracefulStopMs` to your longest job. Both releases' workers briefly consume the
198
+ shared queue, so keep job code backwards-compatible and jobs idempotent.
199
+
200
+ If your worker quiesces on a command or a non-default signal, add a `lifecycle`
201
+ block — Rollbridge runs `quietCommand`, drains for up to `drainTimeoutMs`, then
202
+ stops. For example, send a quiet signal to the worker's process group before the
203
+ drain:
204
+
205
+ ```js
206
+ lifecycle: {quietCommand: "kill -TSTP -$ROLLBRIDGE_PID", drainTimeoutMs: 60000}
207
+ ```
170
208
 
171
209
  ### Choosing the jobs-main policy
172
210