rollbridge 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/config.md CHANGED
@@ -26,6 +26,7 @@ export default {
26
26
  | `proxy` | object | **required** | Proxy listener and shared defaults (see below). |
27
27
  | `processes` | array | **required** | Managed processes (see below). Exactly one must be `proxied`. |
28
28
  | `releaseRetention` | object | — | How many stopped releases the daemon retains (see below). |
29
+ | `statePath` | string | unset (no persistence) | File the daemon persists its state to, enabling orphaned-process detection on the next startup (see [`statePath`](#statepath)). |
29
30
 
30
31
  ## `control`
31
32
 
@@ -33,6 +34,14 @@ export default {
33
34
  | --- | --- | --- | --- |
34
35
  | `control.path` | string | `/tmp/rollbridge-<application>.sock` | Unix domain socket the CLI uses to talk to the daemon. |
35
36
  | `control.mode` | octal string (e.g. `"660"`) or octal number (`0o660`) | unset | `chmod` applied to the socket after it binds, to share it with a deploy group. When unset, the daemon umask applies. |
37
+ | `control.owner` | non-negative integer uid or user name | unset | `chown` owner applied to the socket after it binds. |
38
+ | `control.group` | non-negative integer gid or group name | unset | `chown` group applied to the socket after it binds, so a shared deploy group can use it. |
39
+
40
+ Names are resolved via `/etc/passwd`/`/etc/group` (local users and groups); use
41
+ numeric ids for NSS-only principals. The daemon must run as a user permitted to
42
+ `chown` the socket (root, or a member of the target group) — otherwise it fails
43
+ to start with a clear error. Combine `control.group` with `control.mode: "660"`
44
+ to let a deploy group talk to the daemon.
36
45
 
37
46
  ## `proxy`
38
47
 
@@ -56,6 +65,29 @@ export default {
56
65
  Active and draining releases are never pruned. This governs Rollbridge's own
57
66
  release records; the deploy tool still owns on-disk release directories.
58
67
 
68
+ ## `statePath`
69
+
70
+ When set, the daemon persists a state snapshot — the active and draining
71
+ releases, each managed process's metadata (including pid), restart counters, and
72
+ recent events — to this file (atomically, on changes and every few seconds). On a
73
+ clean `shutdown` the file is removed.
74
+
75
+ On the **next startup**, the daemon reads any leftover file and reports managed
76
+ processes whose pids are still alive — likely orphans from a daemon that crashed
77
+ without shutting down cleanly — in its log and event history, and in the
78
+ `orphans` array of [`rollbridge status`](cli.md#status). This is **advisory**:
79
+ Rollbridge cannot re-adopt detached children, so it does not stop them
80
+ automatically; the operator verifies and stops the leftovers. A recycled pid can
81
+ be a false positive, so treat a report as a prompt to investigate. Use
82
+ [`rollbridge recover`](cli.md#recover) to list and (with `--force`) stop those
83
+ orphans after a crash.
84
+
85
+ ```js
86
+ statePath: "/var/lib/rollbridge/ticket-server.state.json"
87
+ ```
88
+
89
+ Leave `statePath` unset to disable persistence (the default).
90
+
59
91
  ## `processes[]`
60
92
 
61
93
  | Field | Type | Default | Description |
@@ -67,10 +99,126 @@ release records; the deploy tool still owns on-disk release directories.
67
99
  | `env` | object of string → string | `{}` | Extra environment variables (values templated). Merged over the injected `ROLLBRIDGE_*` vars. |
68
100
  | `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
69
101
  | `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
70
- | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
71
- | `restartDelayMs` | number | `1000` | Delay before restarting this process after a crash. |
102
+ | `stopSignal` | signal name (e.g. `"SIGTERM"`, `"SIGINT"`, `"SIGQUIT"`) | `"SIGTERM"` | Signal sent to gracefully stop the process; after `gracefulStopMs` it is `SIGKILL`ed. Use a worker's quit signal so it finishes in-flight work before exiting. |
103
+ | `nonBlockingDrain` | boolean | `false` | When a release is retired, drain this process **immediately** (in parallel with the proxied connection drain) instead of after it. Companion processes only — typically background workers (see below). |
104
+ | `lifecycle` | object | no hooks | Command hooks run when gracefully stopping the process (see below). |
105
+ | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | Graceful-stop window: time between `stopSignal`/`stopCommand` and `SIGKILL` for this process. |
106
+ | `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
107
+ | `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
108
+ | `memory` | object | unset (no monitoring) | Memory supervision: restart the process when its RSS exceeds a limit (see below). |
109
+ | `replicas` | positive integer | `1` | Run this many instances of the process (see below). |
72
110
  | `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
73
111
 
112
+ ### `processes[].replicas`
113
+
114
+ Run a pool of identical instances of one process — for example several
115
+ background-job workers. `replicas` greater than `1` is supported only on a
116
+ **`companion`** process **without a `port`** (the worker-pool case);
117
+ `proxied`, `singleton`, and ported processes must keep `replicas: 1`.
118
+
119
+ ```js
120
+ {id: "worker", policy: "companion", command: "npx velocious background-jobs-worker", replicas: 4}
121
+ ```
122
+
123
+ Each replica runs as its own managed process with id `<id>#<index>` (`worker#0`,
124
+ `worker#1`, …) — that id is what appears in `status` and what
125
+ [`rollbridge restart`](cli.md#restart) targets (use the base id `worker` to
126
+ restart every replica, or `worker#0` for one). Replicas get `replicaIndex`/
127
+ `replicaCount` template variables and `ROLLBRIDGE_REPLICA_INDEX`/`_COUNT` in their
128
+ environment, so each instance can pick a distinct shard, queue, or lock. A single
129
+ process (`replicas: 1`) keeps its plain id and is replica `0` of `1`.
130
+
131
+ ### `processes[].lifecycle`
132
+
133
+ Command hooks run when Rollbridge **gracefully stops** the process — during a
134
+ deploy's drain, a `rollbridge restart`, a memory restart, or shutdown. They let a
135
+ job worker quiesce and finish in-flight work before it is terminated. Omit
136
+ `lifecycle` for the default behavior (just `stopSignal` then `SIGKILL`).
137
+
138
+ | Field | Type | Default | Description |
139
+ | --- | --- | --- | --- |
140
+ | `lifecycle.quietCommand` | string | unset | Run first to tell the process to stop accepting new work. |
141
+ | `lifecycle.drainCommand` | string | unset | Run after quieting to wait until the process has drained (it blocks until done). When unset, Rollbridge instead waits up to `drainTimeoutMs` for the process to exit on its own. Requires a positive `drainTimeoutMs` (which bounds it). |
142
+ | `lifecycle.drainTimeoutMs` | non-negative number | `0` | Bounds the drain step. `0` **skips the drain step entirely** (no `drainCommand`, no wait). |
143
+ | `lifecycle.stopCommand` | string | unset | Run to stop the process instead of sending `stopSignal`, if it is still running after draining. |
144
+
145
+ Because `stopCommand` runs **instead of** sending `stopSignal`, setting both a
146
+ `stopCommand` and a custom `stopSignal` is rejected — the signal would be silently
147
+ ignored. Use one or the other.
148
+
149
+ The full stop sequence is: run `quietCommand` → drain (`drainCommand`, or wait
150
+ `drainTimeoutMs` for the process to exit) → if still running, run `stopCommand`
151
+ or send `stopSignal` → `SIGKILL` after `gracefulStopMs`. Each hook command is run
152
+ through a shell with the process's environment plus `ROLLBRIDGE_PID` (the
153
+ process-group leader's pid, so a hook can `kill -TSTP -$ROLLBRIDGE_PID`). Every
154
+ hook is **bounded by a timeout** (its drain timeout, or `gracefulStopMs`) and its
155
+ failure is non-fatal — the sequence proceeds and `SIGKILL` is always the final
156
+ fallback, so a slow or broken hook can't wedge a stop.
157
+
158
+ ```js
159
+ {id: "worker", policy: "companion", command: "…", lifecycle: {quietCommand: "kill -TSTP -$ROLLBRIDGE_PID", drainTimeoutMs: 60000}}
160
+ ```
161
+
162
+ ### `processes[].nonBlockingDrain`
163
+
164
+ By default, when a release is retired its processes are stopped **after** the
165
+ proxied process's connections have drained (or `proxy.drainTimeoutMs` elapses).
166
+ That keeps a worker alive in case the draining web process still depends on it —
167
+ but it also holds a background worker open for the whole connection drain.
168
+
169
+ Set `nonBlockingDrain: true` on a `companion` whose work is independent of the
170
+ proxied process (a job worker on a shared queue). Its graceful stop — `lifecycle`
171
+ hooks, or `stopSignal` then `SIGKILL` after `gracefulStopMs` — then starts **as
172
+ soon as the release is retired**, in parallel with the connection drain, rather
173
+ than after it. The new release's workers (started before traffic switches) handle
174
+ new work while the retired release's workers finish their in-flight jobs. The
175
+ whole drain stays non-blocking — the deploy returns immediately.
176
+
177
+ ```js
178
+ {id: "worker", policy: "companion", command: "…", nonBlockingDrain: true, stopSignal: "SIGINT", gracefulStopMs: 60000}
179
+ ```
180
+
181
+ ### `processes[].restart`
182
+
183
+ Controls automatic restarts of a crashed process (a release's active processes
184
+ and daemon-wide `service`s). The base delay is the process's `restartDelayMs`;
185
+ when the policy's limit is reached the process is left `failed` and not
186
+ restarted again.
187
+
188
+ | Field | Type | Default | Description |
189
+ | --- | --- | --- | --- |
190
+ | `restart.maxRestarts` | non-negative integer | unset (unlimited) | Maximum automatic restarts allowed within `windowMs` before Rollbridge stops restarting the process. `0` disables automatic restarts entirely. |
191
+ | `restart.windowMs` | non-negative number | `0` (process lifetime) | Rolling window over which `maxRestarts` is counted and after which the backoff resets. `0` counts over the process's whole lifetime. |
192
+ | `restart.backoffFactor` | number ≥ 1 | `1` (constant) | Multiplier applied to `restartDelayMs` on each successive restart in the window: `delay = restartDelayMs × backoffFactor ^ n`. `1` keeps a constant delay. |
193
+ | `restart.maxDelayMs` | non-negative number | `0` (no cap) | Upper bound on the backed-off delay. `0` means no cap. |
194
+
195
+ With the defaults a crashed process restarts indefinitely after `restartDelayMs`.
196
+ Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
197
+ set `maxRestarts` to give up on a process stuck in a crash loop.
198
+
199
+ ### `processes[].memory`
200
+
201
+ Monitors the resident memory (RSS) of the process and **gracefully restarts** it
202
+ (`SIGTERM`, then `SIGKILL` after `gracefulStopMs`) when it exceeds `limitBytes`.
203
+ RSS is measured across the whole managed process group (the spawned wrapper and
204
+ its children), not just the wrapper. Omit `memory` to disable monitoring. Memory
205
+ measurement uses `/proc` and is a no-op on platforms without it.
206
+
207
+ | Field | Type | Default | Description |
208
+ | --- | --- | --- | --- |
209
+ | `memory.limitBytes` | positive integer | **required** | RSS limit in bytes; exceeding it restarts the process. |
210
+ | `memory.warnBytes` | non-negative integer | `0` (off) | Log a `memory warning` once when RSS first crosses this threshold (set below `limitBytes`). |
211
+ | `memory.checkIntervalMs` | positive number | `5000` | How often to measure RSS. |
212
+
213
+ ```js
214
+ {id: "worker", policy: "companion", command: "…", memory: {limitBytes: 536870912, warnBytes: 402653184, checkIntervalMs: 5000}}
215
+ ```
216
+
217
+ A memory restart is reported in `status` (`memoryRestarts`, `lastMemoryRestartAt`,
218
+ current `rssBytes`) and recorded in the event history (a `process started` event
219
+ with `reason: "memory"`). `status` also reports `children` — the sampled process
220
+ tree, with each group member's `pid`, `command`, and `rssBytes`.
221
+
74
222
  ### `processes[].health`
75
223
 
76
224
  Only the `proxied` process is health-checked (before traffic switches to a new
@@ -96,6 +244,7 @@ start with a clear error.
96
244
  | `{{releasePath}}` | The deploy's `--release-path`. |
97
245
  | `{{revision}}` | The deploy's `--revision` (falls back to the release id). |
98
246
  | `{{processId}}` | This process's `id`. |
247
+ | `{{replicaIndex}}`, `{{replicaCount}}` | This instance's zero-based replica index and the total replica count (`0` and `1` for a single process). |
99
248
  | `{{port}}` | The port allocated to this process. |
100
249
  | `{{ports.<id>}}` | The port allocated to another process. |
101
250
  | `{{proxy.host}}`, `{{proxy.port}}`, `{{proxy.upstreamHost}}` | The configured proxy bind host/port and upstream host. |
@@ -109,7 +258,8 @@ Rollbridge sets these in every managed process's environment (the process's own
109
258
  | Variable | Value |
110
259
  | --- | --- |
111
260
  | `ROLLBRIDGE_APPLICATION` | `application` |
112
- | `ROLLBRIDGE_PROCESS_ID` | This process's `id`. |
261
+ | `ROLLBRIDGE_PROCESS_ID` | This process's `id` (the base id, not the `#index` instance id). |
262
+ | `ROLLBRIDGE_REPLICA_INDEX`, `ROLLBRIDGE_REPLICA_COUNT` | This instance's zero-based replica index and total replica count (`0` and `1` for a single process). |
113
263
  | `ROLLBRIDGE_RELEASE_ID` | The release id. |
114
264
  | `ROLLBRIDGE_RELEASE_PATH` | The release path. |
115
265
  | `ROLLBRIDGE_REVISION` | The revision (or release id). |
@@ -125,4 +275,11 @@ Rollbridge sets these in every managed process's environment (the process's own
125
275
  - Process `id`s must be unique.
126
276
  - `port` must be a positive port number or an ascending `{from, to}` range.
127
277
  - `control.mode` must be an octal mode between `0` and `0o777`.
278
+ - `control.owner` and `control.group` must each be a non-negative integer id or a non-empty name (resolved at daemon start).
128
279
  - `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
280
+ - `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.
281
+ - When `memory` is set, `memory.limitBytes` must be a positive integer, `memory.warnBytes` a non-negative integer, and `memory.checkIntervalMs` a positive number.
282
+ - `replicas` must be a positive integer; `replicas > 1` is allowed only on a `companion` process without a `port`. Process ids must not contain `#` (reserved for replica instance ids).
283
+ - `lifecycle.quietCommand`/`drainCommand`/`stopCommand` must be strings when set, and `lifecycle.drainTimeoutMs` a non-negative number; `lifecycle.drainCommand` requires a positive `lifecycle.drainTimeoutMs`. A `lifecycle.stopCommand` may not be combined with a custom `stopSignal` (the `stopCommand` runs instead of the signal, so the signal would be ignored).
284
+ - `nonBlockingDrain` must be a boolean, and is allowed only on a `companion` process.
285
+ - `statePath` must be a string when set.
@@ -0,0 +1,77 @@
1
+ # Logging
2
+
3
+ The Rollbridge daemon writes one structured JSON line per operational event
4
+ (deploys, traffic switches, process starts/exits, restarts, memory events, and
5
+ failed commands):
6
+
7
+ ```json
8
+ {"at":"2026-05-23T14:31:09.512Z","message":"traffic switched","data":{"previousReleaseId":"v3","releaseId":"v4"}}
9
+ ```
10
+
11
+ These lines go to the daemon's **stdout**; where that ends up depends on how the
12
+ daemon was started.
13
+
14
+ ## Where logs go
15
+
16
+ | How the daemon runs | Destination |
17
+ | --- | --- |
18
+ | `rollbridge daemon` (foreground) | stdout — redirect it (`rollbridge daemon … >> /var/log/rollbridge/app.log 2>&1`) or let your service manager capture it. |
19
+ | systemd (`examples/rollbridge.service`) | the journal — `journalctl -u rollbridge`. journald rotates on its own. |
20
+ | `rollbridge ensure-daemon` / `rollbridge deploy --ensure-daemon` | the **daemon log file**: `--daemon-log-path <path>`, default `/tmp/rollbridge-<application>.log`. The detached daemon's stdout and stderr are appended there. |
21
+
22
+ Point `--daemon-log-path` at a path your rotation tooling manages, for example:
23
+
24
+ ```bash
25
+ rollbridge deploy --ensure-daemon \
26
+ --config /etc/rollbridge/rollbridge.js \
27
+ --daemon-log-path /var/log/rollbridge/app.log \
28
+ --release-path "$release_path"
29
+ ```
30
+
31
+ The daemon log file is the durable, append-only stream of the daemon's own
32
+ events. It is distinct from the two in-memory views:
33
+
34
+ - `rollbridge logs` — recent stdout/stderr of each **managed process** (your app),
35
+ bounded per process by `outputLines`.
36
+ - `rollbridge events` — the recent structured daemon event history (the most
37
+ recent 1000 events), the same events written to the log file.
38
+
39
+ Both are cleared when the daemon restarts; the log file persists.
40
+
41
+ ## Rotation
42
+
43
+ ### systemd / journald
44
+
45
+ When the daemon runs under systemd its logs are in the journal, which rotates
46
+ automatically. Bound journal disk use with `SystemMaxUse=` in
47
+ `/etc/systemd/journald.conf` (or a per-namespace drop-in). No logrotate config is
48
+ needed for the daemon itself.
49
+
50
+ ### The daemon log file (logrotate)
51
+
52
+ The detached daemon keeps the log file **open for its whole lifetime** (its
53
+ stdout/stderr file descriptors point at it). A plain `rename`-based rotation
54
+ would leave the daemon writing to the old, now-renamed inode while the new file
55
+ stays empty. Use logrotate's **`copytruncate`**, which copies the file and then
56
+ truncates it in place, keeping the daemon's open descriptor valid:
57
+
58
+ ```
59
+ /var/log/rollbridge/*.log {
60
+ daily
61
+ rotate 14
62
+ compress
63
+ missingok
64
+ notifempty
65
+ copytruncate
66
+ }
67
+ ```
68
+
69
+ `copytruncate` has a small race window — log lines written between the copy and
70
+ the truncate can be lost — which is acceptable for the daemon's low-volume,
71
+ milestone-level logging. Rollbridge does not reopen its log file on a signal, so
72
+ `copytruncate` (rather than `create` + a reopen signal) is the recommended
73
+ approach for the daemon log file.
74
+
75
+ Prefer running under systemd (journald) when you can; reach for `--daemon-log-path`
76
+ + logrotate when you run the daemon outside a service manager that captures
77
+ stdout.
package/docs/nginx.md ADDED
@@ -0,0 +1,104 @@
1
+ # Nginx guide
2
+
3
+ Nginx should always proxy to the **stable Rollbridge proxy port**
4
+ (`proxy.host:proxy.port`), never directly to a release process — release ports
5
+ are allocated per deploy and change. Rollbridge forwards both HTTP and WebSocket
6
+ traffic to the active release and drains old connections across deploys.
7
+
8
+ ## Server block
9
+
10
+ ```nginx
11
+ # Maps the Upgrade header so WebSocket requests get "Connection: upgrade" and
12
+ # normal requests get a closed/keep-alive connection.
13
+ map $http_upgrade $connection_upgrade {
14
+ default upgrade;
15
+ '' close;
16
+ }
17
+
18
+ server {
19
+ listen 443 ssl;
20
+ server_name app.example.com;
21
+ # ssl_certificate / ssl_certificate_key ...
22
+
23
+ location / {
24
+ proxy_pass http://127.0.0.1:8182; # Rollbridge proxy.host:proxy.port
25
+
26
+ # WebSocket upgrade
27
+ proxy_http_version 1.1;
28
+ proxy_set_header Upgrade $http_upgrade;
29
+ proxy_set_header Connection $connection_upgrade;
30
+
31
+ # Pass the real client through to the app
32
+ proxy_set_header Host $host;
33
+ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
34
+ proxy_set_header X-Forwarded-Proto $scheme;
35
+ proxy_set_header X-Real-IP $remote_addr;
36
+
37
+ # Long-lived connections (WebSocket/SSE) — see "Timeouts" below
38
+ proxy_read_timeout 3600s;
39
+ proxy_send_timeout 3600s;
40
+ }
41
+ }
42
+ ```
43
+
44
+ The repository README shows a minimal version of this block; the additions here
45
+ matter for production.
46
+
47
+ ## WebSocket headers
48
+
49
+ Rollbridge's proxy has WebSocket support enabled, so the only requirement is that
50
+ Nginx forwards the upgrade handshake:
51
+
52
+ - `proxy_http_version 1.1` — WebSocket upgrades require HTTP/1.1 (the default is 1.0).
53
+ - `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection $connection_upgrade;` — forward the upgrade. Using the `map` above is preferred over a hard-coded `Connection "upgrade"`, so non-WebSocket requests aren't forced into an upgrade.
54
+
55
+ If these are missing, WebSocket clients fail to connect (the handshake never
56
+ completes) while plain HTTP still works.
57
+
58
+ ## Timeouts
59
+
60
+ Nginx's `proxy_read_timeout`/`proxy_send_timeout` default to **60s**. An idle
61
+ WebSocket (or a slow streaming response) is closed once that elapses, so
62
+ long-lived connections silently drop after a minute unless you raise them — set
63
+ them on the relevant `location` (or globally) to a value above your longest idle
64
+ period.
65
+
66
+ Related Rollbridge timeouts (configured in `rollbridge.js`, not Nginx):
67
+
68
+ - `proxy.healthTimeoutMs` gates how long a new release has to become healthy
69
+ before a deploy aborts — it does not affect request timeouts.
70
+ - `proxy.drainTimeoutMs` is how long Rollbridge keeps an old release alive for
71
+ in-flight connections during a deploy. Keep Nginx's `proxy_read_timeout` for
72
+ WebSocket locations comfortably above it so the front end doesn't cut
73
+ connections Rollbridge is still draining.
74
+
75
+ ## Forwarded headers
76
+
77
+ Set `X-Forwarded-For`, `X-Forwarded-Proto`, and `Host` so the app behind
78
+ Rollbridge sees the real client and scheme. Rollbridge proxies with
79
+ `X-Forwarded-*` enabled, but it can only forward what Nginx provides — terminate
80
+ TLS at Nginx and pass `X-Forwarded-Proto $scheme` so the app knows the original
81
+ request was HTTPS.
82
+
83
+ For Server-Sent Events or other streamed responses, also disable response
84
+ buffering on that location so events flush immediately:
85
+
86
+ ```nginx
87
+ location /events {
88
+ proxy_pass http://127.0.0.1:8182;
89
+ proxy_http_version 1.1;
90
+ proxy_buffering off;
91
+ proxy_read_timeout 3600s;
92
+ }
93
+ ```
94
+
95
+ ## Common failure modes
96
+
97
+ | Symptom | Cause | Fix |
98
+ | --- | --- | --- |
99
+ | `502 Bad Gateway` | Rollbridge can't reach the active release's process (it crashed or is restarting); Rollbridge returns `Bad gateway` and Nginx relays it. | Check `rollbridge status` / `rollbridge logs --process <id>` (see [troubleshooting.md](troubleshooting.md)). The process auto-restarts on its port. |
100
+ | `503` / `No active release` | No release is active — before the first deploy, or after `rollbridge stop`. | Deploy a release (`rollbridge deploy`). |
101
+ | WebSocket drops after ~60s | `proxy_read_timeout` left at the 60s default. | Raise `proxy_read_timeout`/`proxy_send_timeout` on the WebSocket location. |
102
+ | WebSocket never connects (plain HTTP works) | Missing `proxy_http_version 1.1` and the `Upgrade`/`Connection` headers. | Add the WebSocket directives shown above. |
103
+ | `504 Gateway Timeout` | A slow response exceeded `proxy_read_timeout`. | Raise the timeout, or speed up the endpoint. |
104
+ | Connections cut mid-deploy | Nginx `proxy_read_timeout` shorter than `proxy.drainTimeoutMs`. | Raise the Nginx timeout above `proxy.drainTimeoutMs`. |
@@ -0,0 +1,53 @@
1
+ # Releasing (maintainers)
2
+
3
+ Rollbridge publishes **patch** releases from the default branch with:
4
+
5
+ ```bash
6
+ npm run release:patch
7
+ ```
8
+
9
+ That script (the `release-patch` package) owns the version bump, lockfile update,
10
+ default-branch commit, push, and `npm publish`. Don't run `npm version` yourself
11
+ first — let the script own the bump. Use this checklist around it.
12
+
13
+ The default branch is `master` for this repo; the checks below stay
14
+ branch-agnostic so they stay correct if that ever changes. Capture the name once
15
+ and reuse it (the commands below assume it is set):
16
+
17
+ ```bash
18
+ default_branch=$(git rev-parse --abbrev-ref origin/HEAD | sed 's@^origin/@@') # e.g. master
19
+ ```
20
+
21
+ ## Before releasing
22
+
23
+ - [ ] You're on the default branch and synced with it: `git switch "$default_branch"`,
24
+ then `git fetch && git status` shows it up to date, with a **clean working tree**.
25
+ - [ ] CI is green for that commit, and `npm run all-checks` passes locally
26
+ (typecheck, lint, and the full test suite).
27
+ - [ ] `README.md` and `docs/` reflect every user-visible change shipped since the
28
+ last release (config fields, CLI commands/flags, status/event output,
29
+ operational behavior).
30
+ - [ ] `TODO.md` checkboxes for the shipped work are updated.
31
+ - [ ] You can publish: `npm whoami` shows an account with publish rights to the
32
+ `rollbridge` package, and you can push to the default branch.
33
+
34
+ ## Release
35
+
36
+ ```bash
37
+ npm run release:patch
38
+ ```
39
+
40
+ The script bumps the patch version, updates `package-lock.json`, commits the bump
41
+ to the default branch, pushes it, and publishes the package to npm.
42
+
43
+ ## After releasing
44
+
45
+ - [ ] The new version is on the registry: `npm view rollbridge version` matches
46
+ the bumped `package.json` version.
47
+ - [ ] The version-bump commit reached the remote (not just local): `git fetch`,
48
+ then `git log --oneline -1 "origin/$default_branch"` shows the bump — a
49
+ failed or blocked push won't satisfy this.
50
+ - [ ] Your working tree is clean and still on the default branch.
51
+
52
+ `release:patch` only does patch releases — a minor or major version bump is a
53
+ manual decision and is not covered by this script.
@@ -0,0 +1,129 @@
1
+ # TensorBuzz production runbook
2
+
3
+ Operating the TensorBuzz backend under Rollbridge. The production config lives at
4
+ [`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js); this runbook
5
+ assumes it is deployed to a stable path (`/etc/rollbridge/tensorbuzz.com.js`
6
+ below) and the daemon runs as a systemd service (see
7
+ [Running under systemd](../README.md#running-under-systemd)). For the general
8
+ Velocious topology and the worker recipe, see [`docs/velocious.md`](velocious.md).
9
+
10
+ ## Ports
11
+
12
+ | Port | Process | Notes |
13
+ | --- | --- | --- |
14
+ | `4500` | Rollbridge proxy | The stable public port. **Nginx proxies the backend host to `127.0.0.1:4500`** — never to a release's web port. |
15
+ | `7330` | `beacon` (`service`) | Fixed; the shared broker every release connects to. |
16
+ | `7331` | `background-jobs-main` (`service`) | Fixed; the job coordinator. |
17
+ | `14500`–`14599` | `web` (`proxied`) | One port per release, allocated per deploy; Rollbridge forwards `4500` here. |
18
+ | (none) | `background-jobs-worker` (`companion`) | A per-release worker; no listening port. |
19
+
20
+ Control socket: `/tmp/rollbridge-tensorbuzz.sock`.
21
+
22
+ ## Process topology
23
+
24
+ - **`beacon`** and **`background-jobs-main`** are `service`s: one daemon-wide
25
+ instance each, on their fixed ports, surviving deploys.
26
+ - **`background-jobs-worker`** is a `companion`: a fresh worker per release,
27
+ running that release's code, with `gracefulStopMs: 60000` so an in-flight job
28
+ finishes before `SIGKILL`.
29
+ - **`web`** is the one `proxied` process, health-checked at `/ping` before
30
+ traffic switches.
31
+
32
+ Each process waits for its dependencies with `wait-for-it` (`beacon` →
33
+ `background-jobs-main` → `worker`/`web`), so nothing starts talking to Beacon or
34
+ the job coordinator before they listen.
35
+
36
+ ## External services
37
+
38
+ Rollbridge manages **only the four processes above**. Everything else the
39
+ Velocious app depends on — the database and any other backing services — is
40
+ **provisioned and operated outside Rollbridge**: Rollbridge does not start, stop,
41
+ health-check, or know about them. Configure those connections through the app's
42
+ own environment/config. When such a dependency is down, the `web` process's
43
+ `/ping` health check is what gates a deploy (a release that can't reach its
44
+ database won't pass health and won't go live).
45
+
46
+ ## Deploying
47
+
48
+ Drive deploys through the CLI (see [`docs/deploy-recipes.md`](deploy-recipes.md)).
49
+ Run **backwards-compatible** migrations before switching traffic, because the old
50
+ and new releases overlap during the drain:
51
+
52
+ ```bash
53
+ release_path=/srv/tensorbuzz/releases/<timestamp> # prepared by your pipeline
54
+ (cd "$release_path/backend" && npx velocious db:migrate)
55
+
56
+ rollbridge deploy \
57
+ --ensure-daemon \
58
+ --config /etc/rollbridge/tensorbuzz.com.js \
59
+ --release-path "$release_path" \
60
+ --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
61
+ ```
62
+
63
+ ### Deploy ordering
64
+
65
+ On `rollbridge deploy`, Rollbridge:
66
+
67
+ 1. starts any missing `service` (`beacon`, `background-jobs-main`);
68
+ 2. starts the new release's `background-jobs-worker`, then its `web` process, and
69
+ health-checks `web` on its `{{port}}`/`/ping`;
70
+ 3. switches new traffic to the new `web`;
71
+ 4. refreshes the services' restart templates to the new release;
72
+ 5. drains the previous release's connections, then stops its `web` and worker.
73
+
74
+ If the new release fails to start or health-check, **the previous release stays
75
+ active** and the command exits non-zero — so a failed deploy never takes the site
76
+ down.
77
+
78
+ ## Rollback
79
+
80
+ ```bash
81
+ rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js
82
+ # or a specific retained release:
83
+ rollbridge rollback --config /etc/rollbridge/tensorbuzz.com.js --release-id <id>
84
+ ```
85
+
86
+ Rollback re-runs the deploy flow on a retained release, health-checks it, and
87
+ switches traffic back. Constraints:
88
+
89
+ - **Migrations are not reverted.** Rollback only manages processes; if a release
90
+ bumped the schema, rolling code back requires that the old code still works
91
+ against the new schema — keep migrations backwards-compatible (the same rule as
92
+ deploys).
93
+ - The target release's on-disk directory must still exist (don't prune it from
94
+ disk before you might roll back to it).
95
+ - Only releases Rollbridge still retains (`releaseRetention`) can be targeted.
96
+
97
+ ## Day-to-day operations
98
+
99
+ ```bash
100
+ C=/etc/rollbridge/tensorbuzz.com.js
101
+
102
+ rollbridge status --config "$C" # active release, ports, per-process state
103
+ rollbridge logs --config "$C" --process web # recent stdout/stderr of a process
104
+ rollbridge events --config "$C" # deploys, switches, crashes, restarts
105
+ rollbridge doctor --config "$C" # pre-flight: socket, proxy port, state
106
+ rollbridge restart --config "$C" --process background-jobs-worker # bounce the worker
107
+ ```
108
+
109
+ Restarting `beacon` or `background-jobs-main` bounces a shared broker and briefly
110
+ disrupts everything that depends on it; prefer `deploy`/`rollback` for code
111
+ changes. See [`docs/troubleshooting.md`](troubleshooting.md) for health-check
112
+ failures, port conflicts, stale sockets, crash loops, and stuck draining
113
+ releases.
114
+
115
+ ## Crash recovery
116
+
117
+ Set [`statePath`](config.md#statepath) in the config to have the daemon persist
118
+ its state. After a daemon crash or reboot, `rollbridge doctor` reports any
119
+ **orphaned** processes still alive from the previous daemon. To clean them up
120
+ before restarting the daemon, run `rollbridge recover` (a dry run that lists
121
+ them), then `rollbridge recover --force` to stop them:
122
+
123
+ ```bash
124
+ rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js # list leftovers
125
+ rollbridge recover --config /etc/rollbridge/tensorbuzz.com.js --force # stop them
126
+ ```
127
+
128
+ A machine reboot kills every process, so there are usually no orphans afterward —
129
+ the daemon just starts fresh.