rollbridge 0.1.2 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/config.md ADDED
@@ -0,0 +1,148 @@
1
+ # Config reference
2
+
3
+ A Rollbridge config is a JavaScript module that `export default`s a config
4
+ object (or a sync/async function returning one). When `--config` is omitted,
5
+ the CLI loads `rollbridge.js` from the working directory. Run
6
+ `rollbridge validate` to check a config without starting the daemon.
7
+
8
+ ```js
9
+ // rollbridge.js
10
+ export default {
11
+ application: "ticket-server",
12
+ control: {path: "/tmp/rollbridge-ticket-server.sock"},
13
+ proxy: {host: "127.0.0.1", port: 8182},
14
+ processes: [
15
+ {id: "web", policy: "proxied", cwd: "{{releasePath}}", command: "npx velocious server --port {{port}}", port: {from: 18182, to: 18299}, health: {path: "/ping"}}
16
+ ]
17
+ }
18
+ ```
19
+
20
+ ## Top-level fields
21
+
22
+ | Field | Type | Default | Description |
23
+ | --- | --- | --- | --- |
24
+ | `application` | string | basename of the config file's directory | Names the app; used in the default control-socket path and the `ROLLBRIDGE_APPLICATION` env var. |
25
+ | `control` | object | — | Control-socket settings (see below). |
26
+ | `proxy` | object | **required** | Proxy listener and shared defaults (see below). |
27
+ | `processes` | array | **required** | Managed processes (see below). Exactly one must be `proxied`. |
28
+ | `releaseRetention` | object | — | How many stopped releases the daemon retains (see below). |
29
+
30
+ ## `control`
31
+
32
+ | Field | Type | Default | Description |
33
+ | --- | --- | --- | --- |
34
+ | `control.path` | string | `/tmp/rollbridge-<application>.sock` | Unix domain socket the CLI uses to talk to the daemon. |
35
+ | `control.mode` | octal string (e.g. `"660"`) or octal number (`0o660`) | unset | `chmod` applied to the socket after it binds, to share it with a deploy group. When unset, the daemon umask applies. |
36
+
37
+ ## `proxy`
38
+
39
+ | Field | Type | Default | Description |
40
+ | --- | --- | --- | --- |
41
+ | `proxy.host` | string | `"127.0.0.1"` | Interface the stable proxy binds. |
42
+ | `proxy.port` | number | `8182` | Stable port Nginx (or another front end) points at. |
43
+ | `proxy.upstreamHost` | string | `proxy.host`, or `"127.0.0.1"` when `proxy.host` is `0.0.0.0`/`::` | Host Rollbridge uses for release health checks and proxy targets. |
44
+ | `proxy.healthPath` | string | `"/ping"` | Default health-check path for proxied processes. |
45
+ | `proxy.healthTimeoutMs` | number | `30000` | Default health-check timeout for proxied processes. |
46
+ | `proxy.drainTimeoutMs` | number | `60000` | How long to drain open connections from a retired release before stopping it. |
47
+ | `proxy.forceStopTimeoutMs` | number | `10000` | Default per-process graceful-stop timeout (`SIGTERM`, then `SIGKILL`). |
48
+
49
+ ## `releaseRetention`
50
+
51
+ | Field | Type | Default | Description |
52
+ | --- | --- | --- | --- |
53
+ | `releaseRetention.keep` | non-negative integer | `10` | Number of most-recent **stopped** releases the daemon keeps in memory and reports in `status`. |
54
+ | `releaseRetention.maxAgeMs` | non-negative number | `0` (disabled) | Also prune stopped releases older than this many milliseconds. |
55
+
56
+ Active and draining releases are never pruned. This governs Rollbridge's own
57
+ release records; the deploy tool still owns on-disk release directories.
58
+
59
+ ## `processes[]`
60
+
61
+ | Field | Type | Default | Description |
62
+ | --- | --- | --- | --- |
63
+ | `id` | string | **required** | Unique identifier. Appears in `status`, logs, and `ROLLBRIDGE_*` env vars. |
64
+ | `policy` | `"proxied"` \| `"companion"` \| `"singleton"` \| `"service"` | `"companion"` | Lifecycle policy (see [README → Process Policies](../README.md#process-policies)). Exactly one process must be `proxied`. |
65
+ | `command` | string | **required** | Shell command to run (templated). |
66
+ | `cwd` | string | the release path | Working directory (templated). |
67
+ | `env` | object of string → string | `{}` | Extra environment variables (values templated). Merged over the injected `ROLLBRIDGE_*` vars. |
68
+ | `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
69
+ | `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
70
+ | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
71
+ | `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
72
+ | `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
73
+ | `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
74
+
75
+ ### `processes[].restart`
76
+
77
+ Controls automatic restarts of a crashed process (a release's active processes
78
+ and daemon-wide `service`s). The base delay is the process's `restartDelayMs`;
79
+ when the policy's limit is reached the process is left `failed` and not
80
+ restarted again.
81
+
82
+ | Field | Type | Default | Description |
83
+ | --- | --- | --- | --- |
84
+ | `restart.maxRestarts` | non-negative integer | unset (unlimited) | Maximum automatic restarts allowed within `windowMs` before Rollbridge stops restarting the process. `0` disables automatic restarts entirely. |
85
+ | `restart.windowMs` | non-negative number | `0` (process lifetime) | Rolling window over which `maxRestarts` is counted and after which the backoff resets. `0` counts over the process's whole lifetime. |
86
+ | `restart.backoffFactor` | number ≥ 1 | `1` (constant) | Multiplier applied to `restartDelayMs` on each successive restart in the window: `delay = restartDelayMs × backoffFactor ^ n`. `1` keeps a constant delay. |
87
+ | `restart.maxDelayMs` | non-negative number | `0` (no cap) | Upper bound on the backed-off delay. `0` means no cap. |
88
+
89
+ With the defaults a crashed process restarts indefinitely after `restartDelayMs`.
90
+ Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
91
+ set `maxRestarts` to give up on a process stuck in a crash loop.
92
+
93
+ ### `processes[].health`
94
+
95
+ Only the `proxied` process is health-checked (before traffic switches to a new
96
+ release). Set `health: false` to disable it.
97
+
98
+ | Field | Type | Default | Description |
99
+ | --- | --- | --- | --- |
100
+ | `health.path` | string | `proxy.healthPath` | HTTP path probed on the process's port. |
101
+ | `health.timeoutMs` | number | `proxy.healthTimeoutMs` | Total time to wait for the first healthy response. |
102
+ | `health.intervalMs` | number | `250` | Delay between probes. |
103
+ | `health.startDelayMs` | non-negative number | `0` | Wait this long after the process starts before the first probe (runs before the `timeoutMs` window). |
104
+
105
+ ## Template variables
106
+
107
+ `command`, `cwd`, and `env` values support `{{...}}` placeholders, rendered when
108
+ the process starts. Referencing a placeholder with no value fails the process
109
+ start with a clear error.
110
+
111
+ | Placeholder | Value |
112
+ | --- | --- |
113
+ | `{{application}}` | `application` |
114
+ | `{{releaseId}}` | The deploy's release id. |
115
+ | `{{releasePath}}` | The deploy's `--release-path`. |
116
+ | `{{revision}}` | The deploy's `--revision` (falls back to the release id). |
117
+ | `{{processId}}` | This process's `id`. |
118
+ | `{{port}}` | The port allocated to this process. |
119
+ | `{{ports.<id>}}` | The port allocated to another process. |
120
+ | `{{proxy.host}}`, `{{proxy.port}}`, `{{proxy.upstreamHost}}` | The configured proxy bind host/port and upstream host. |
121
+ | `{{env.<NAME>}}` | A variable from the daemon's own environment, e.g. `{{env.HOME}}`. |
122
+
123
+ ## Injected environment variables
124
+
125
+ Rollbridge sets these in every managed process's environment (the process's own
126
+ `env` is merged on top and can override them):
127
+
128
+ | Variable | Value |
129
+ | --- | --- |
130
+ | `ROLLBRIDGE_APPLICATION` | `application` |
131
+ | `ROLLBRIDGE_PROCESS_ID` | This process's `id`. |
132
+ | `ROLLBRIDGE_RELEASE_ID` | The release id. |
133
+ | `ROLLBRIDGE_RELEASE_PATH` | The release path. |
134
+ | `ROLLBRIDGE_REVISION` | The revision (or release id). |
135
+ | `ROLLBRIDGE_PORT` | This process's allocated port (only when it has one). |
136
+ | `ROLLBRIDGE_<ID>_PORT` | Each process's allocated port, where `<ID>` is the process id uppercased with non-alphanumerics replaced by `_` (e.g. `background-jobs-main` → `ROLLBRIDGE_BACKGROUND_JOBS_MAIN_PORT`). |
137
+
138
+ ## Validation rules
139
+
140
+ `rollbridge validate` reports all of these at once with an example fix:
141
+
142
+ - Required `application` defaults are filled; `proxy` and `processes` must be present and well-typed.
143
+ - Exactly one process must be `proxied`, and the `proxied` process must define a `port`.
144
+ - Process `id`s must be unique.
145
+ - `port` must be a positive port number or an ascending `{from, to}` range.
146
+ - `control.mode` must be an octal mode between `0` and `0o777`.
147
+ - `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
148
+ - `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.
@@ -0,0 +1,102 @@
1
+ # Deploy-tool recipes
2
+
3
+ Rollbridge is deploy-tool agnostic: it ships no plugins or tasks for any deploy
4
+ tool. Whatever you use — a shell script, CI, or Capistrano — drives Rollbridge
5
+ by **calling its CLI** (see [`cli.md`](cli.md)). The daemon is long-lived;
6
+ deploys just hand it a prepared release path.
7
+
8
+ The deploy contract is the same everywhere:
9
+
10
+ 1. Prepare the release directory (checkout, install dependencies, build assets).
11
+ 2. Run **backwards-compatible** migrations *before* switching traffic (the old
12
+ and new web releases overlap during the drain).
13
+ 3. Run `rollbridge deploy` — it starts the new release, health-checks the
14
+ proxied process, switches traffic, then drains and stops the old release.
15
+ It exits non-zero (leaving the previous release active) if the new release
16
+ fails to start or health-check, so your script should stop on a failed
17
+ deploy.
18
+
19
+ Point `--config` at a stable, daemon-wide config file; release paths are passed
20
+ per deploy. `rollbridge deploy --ensure-daemon` starts the daemon first if it
21
+ isn't already running, so the recipes below work whether or not the daemon is
22
+ already managed by systemd.
23
+
24
+ ## Shell script
25
+
26
+ ```bash
27
+ #!/usr/bin/env bash
28
+ set -euo pipefail
29
+
30
+ app_dir=/srv/ticket-server
31
+ config=/etc/rollbridge/rollbridge.js
32
+ # Read the revision from the source repo (not the script's cwd, which may not be
33
+ # a checkout under cron/systemd/CI).
34
+ revision="$(git -C "$app_dir/repo" rev-parse HEAD)"
35
+ release_path="$app_dir/releases/$(date -u +%Y%m%d%H%M%S)-$revision"
36
+
37
+ # 1. Prepare the release.
38
+ git clone --depth 1 "$app_dir/repo" "$release_path"
39
+ (cd "$release_path" && npm ci && npm run build)
40
+
41
+ # 2. Run backwards-compatible migrations before switching traffic.
42
+ (cd "$release_path" && npx velocious db:migrate)
43
+
44
+ # 3. Switch traffic (and start the daemon if needed). A non-zero exit here means
45
+ # the new release failed health checks and the previous one is still active;
46
+ # `set -e` aborts the script so the bad release is not promoted.
47
+ rollbridge deploy \
48
+ --ensure-daemon \
49
+ --config "$config" \
50
+ --release-path "$release_path" \
51
+ --revision "$revision"
52
+ ```
53
+
54
+ ## CI
55
+
56
+ In CI, build/test the release, then run the same `rollbridge deploy` over SSH
57
+ on the target host (CI rarely runs the long-lived daemon itself):
58
+
59
+ ```bash
60
+ # after the build/test job has produced a release at $RELEASE_PATH on the host
61
+ ssh deploy@app.example.com \
62
+ "rollbridge deploy --ensure-daemon \
63
+ --config /etc/rollbridge/rollbridge.js \
64
+ --release-path '$RELEASE_PATH' \
65
+ --revision '$GIT_SHA'"
66
+ ```
67
+
68
+ `rollbridge deploy` exits non-zero on a failed health check, which fails the CI
69
+ step — no extra gating needed. Use `rollbridge validate --json` / `rollbridge
70
+ doctor --json` earlier in the pipeline if you want to fail fast before building.
71
+
72
+ ## Capistrano
73
+
74
+ Rollbridge ships **no Capistrano plugin or tasks** — you only run its CLI as a
75
+ shell command from your own `deploy.rb`. Capistrano already uploads the release
76
+ to `release_path`, so the deploy step is a single `execute` of the CLI:
77
+
78
+ ```ruby
79
+ # config/deploy.rb — just a shell command; no Rollbridge-specific Capistrano code.
80
+ after "deploy:publishing", "rollbridge:deploy"
81
+
82
+ namespace :rollbridge do
83
+ task :deploy do
84
+ on roles(:app) do
85
+ within release_path do
86
+ execute :npx, "velocious", "db:migrate"
87
+ end
88
+ execute "rollbridge", "deploy",
89
+ "--ensure-daemon",
90
+ "--config", "/etc/rollbridge/rollbridge.js",
91
+ "--release-path", release_path,
92
+ "--revision", fetch(:current_revision)
93
+ end
94
+ end
95
+ end
96
+ ```
97
+
98
+ `execute` runs the command over SSH and raises if it exits non-zero, so a failed
99
+ Rollbridge health check fails the Capistrano deploy. Keep Capistrano's own
100
+ `linked_dirs`/`keep_releases` for on-disk release directories; Rollbridge only
101
+ manages the running processes and its own in-memory release records (see
102
+ `releaseRetention`).
package/docs/nginx.md ADDED
@@ -0,0 +1,104 @@
1
+ # Nginx guide
2
+
3
+ Nginx should always proxy to the **stable Rollbridge proxy port**
4
+ (`proxy.host:proxy.port`), never directly to a release process — release ports
5
+ are allocated per deploy and change. Rollbridge forwards both HTTP and WebSocket
6
+ traffic to the active release and drains old connections across deploys.
7
+
8
+ ## Server block
9
+
10
+ ```nginx
11
+ # Maps the Upgrade header so WebSocket requests get "Connection: upgrade" and
12
+ # normal requests get a closed/keep-alive connection.
13
+ map $http_upgrade $connection_upgrade {
14
+ default upgrade;
15
+ '' close;
16
+ }
17
+
18
+ server {
19
+ listen 443 ssl;
20
+ server_name app.example.com;
21
+ # ssl_certificate / ssl_certificate_key ...
22
+
23
+ location / {
24
+ proxy_pass http://127.0.0.1:8182; # Rollbridge proxy.host:proxy.port
25
+
26
+ # WebSocket upgrade
27
+ proxy_http_version 1.1;
28
+ proxy_set_header Upgrade $http_upgrade;
29
+ proxy_set_header Connection $connection_upgrade;
30
+
31
+ # Pass the real client through to the app
32
+ proxy_set_header Host $host;
33
+ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
34
+ proxy_set_header X-Forwarded-Proto $scheme;
35
+ proxy_set_header X-Real-IP $remote_addr;
36
+
37
+ # Long-lived connections (WebSocket/SSE) — see "Timeouts" below
38
+ proxy_read_timeout 3600s;
39
+ proxy_send_timeout 3600s;
40
+ }
41
+ }
42
+ ```
43
+
44
+ The repository README shows a minimal version of this block; the additions here
45
+ matter for production.
46
+
47
+ ## WebSocket headers
48
+
49
+ Rollbridge's proxy has WebSocket support enabled, so the only requirement is that
50
+ Nginx forwards the upgrade handshake:
51
+
52
+ - `proxy_http_version 1.1` — WebSocket upgrades require HTTP/1.1 (the default is 1.0).
53
+ - `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection $connection_upgrade;` — forward the upgrade. Using the `map` above is preferred over a hard-coded `Connection "upgrade"`, so non-WebSocket requests aren't forced into an upgrade.
54
+
55
+ If these are missing, WebSocket clients fail to connect (the handshake never
56
+ completes) while plain HTTP still works.
57
+
58
+ ## Timeouts
59
+
60
+ Nginx's `proxy_read_timeout`/`proxy_send_timeout` default to **60s**. An idle
61
+ WebSocket (or a slow streaming response) is closed once that elapses, so
62
+ long-lived connections silently drop after a minute unless you raise them — set
63
+ them on the relevant `location` (or globally) to a value above your longest idle
64
+ period.
65
+
66
+ Related Rollbridge timeouts (configured in `rollbridge.js`, not Nginx):
67
+
68
+ - `proxy.healthTimeoutMs` gates how long a new release has to become healthy
69
+ before a deploy aborts — it does not affect request timeouts.
70
+ - `proxy.drainTimeoutMs` is how long Rollbridge keeps an old release alive for
71
+ in-flight connections during a deploy. Keep Nginx's `proxy_read_timeout` for
72
+ WebSocket locations comfortably above it so the front end doesn't cut
73
+ connections Rollbridge is still draining.
74
+
75
+ ## Forwarded headers
76
+
77
+ Set `X-Forwarded-For`, `X-Forwarded-Proto`, and `Host` so the app behind
78
+ Rollbridge sees the real client and scheme. Rollbridge proxies with
79
+ `X-Forwarded-*` enabled, but it can only forward what Nginx provides — terminate
80
+ TLS at Nginx and pass `X-Forwarded-Proto $scheme` so the app knows the original
81
+ request was HTTPS.
82
+
83
+ For Server-Sent Events or other streamed responses, also disable response
84
+ buffering on that location so events flush immediately:
85
+
86
+ ```nginx
87
+ location /events {
88
+ proxy_pass http://127.0.0.1:8182;
89
+ proxy_http_version 1.1;
90
+ proxy_buffering off;
91
+ proxy_read_timeout 3600s;
92
+ }
93
+ ```
94
+
95
+ ## Common failure modes
96
+
97
+ | Symptom | Cause | Fix |
98
+ | --- | --- | --- |
99
+ | `502 Bad Gateway` | Rollbridge can't reach the active release's process (it crashed or is restarting); Rollbridge returns `Bad gateway` and Nginx relays it. | Check `rollbridge status` / `rollbridge logs --process <id>` (see [troubleshooting.md](troubleshooting.md)). The process auto-restarts on its port. |
100
+ | `503` / `No active release` | No release is active — before the first deploy, or after `rollbridge stop`. | Deploy a release (`rollbridge deploy`). |
101
+ | WebSocket drops after ~60s | `proxy_read_timeout` left at the 60s default. | Raise `proxy_read_timeout`/`proxy_send_timeout` on the WebSocket location. |
102
+ | WebSocket never connects (plain HTTP works) | Missing `proxy_http_version 1.1` and the `Upgrade`/`Connection` headers. | Add the WebSocket directives shown above. |
103
+ | `504 Gateway Timeout` | A slow response exceeded `proxy_read_timeout`. | Raise the timeout, or speed up the endpoint. |
104
+ | Connections cut mid-deploy | Nginx `proxy_read_timeout` shorter than `proxy.drainTimeoutMs`. | Raise the Nginx timeout above `proxy.drainTimeoutMs`. |
@@ -0,0 +1,102 @@
1
+ # Troubleshooting
2
+
3
+ Start with these three commands — they diagnose most problems without guessing:
4
+
5
+ - `rollbridge validate` — config errors, with an example fix for each.
6
+ - `rollbridge doctor` — control socket reachability, socket-directory writability, and proxy-port availability before the daemon starts.
7
+ - `rollbridge status` / `rollbridge logs` — live release/process state, restart counts, exit codes, connection counts, and recent process output.
8
+
9
+ For scripting, `validate`, `doctor`, and `logs` accept a `--json` flag, and
10
+ `status` already prints JSON — so every command's output is easy to parse.
11
+
12
+ ## Health-check failures
13
+
14
+ **Symptom.** `rollbridge deploy` exits non-zero with:
15
+
16
+ ```
17
+ Health check failed for http://127.0.0.1:18182/ping: HTTP 503
18
+ ```
19
+
20
+ (the reason is `HTTP <status>` or a connection error such as `ECONNREFUSED`). The
21
+ new release never went live; the previous release stays active.
22
+
23
+ **Diagnose.** The new release's `proxied` process didn't return a healthy
24
+ response in time. Check its output with `rollbridge logs --process <id>` and its
25
+ state/`exitCode` with `rollbridge status`. Common causes: the app doesn't listen
26
+ on the templated `{{port}}`, the `health.path` returns a non-2xx status, or the
27
+ app boots slower than `health.timeoutMs`.
28
+
29
+ **Fix.** Make the proxied command bind `{{port}}` and serve `health.path` with a
30
+ 2xx status. For slow boots, raise `health.timeoutMs` or set `health.startDelayMs`
31
+ so probing begins after the app is up.
32
+
33
+ ## Port conflicts / exhausted ranges
34
+
35
+ **Symptom.** A deploy fails with:
36
+
37
+ ```
38
+ No available ports in range 18182-18299 (118 ports on 127.0.0.1): 0 reserved by this deploy, 118 already in use. Widen the port range, free a port, or check bind permissions.
39
+ ```
40
+
41
+ **Diagnose.** The counts tell you which case it is:
42
+
43
+ - **reserved by this deploy** high → the range is too small for the processes that share it.
44
+ - **already in use** → another process (or an old release that has not finished draining) holds the ports.
45
+ - **could not be bound (e.g. EACCES)** → permission problem, e.g. a privileged (`<1024`) port.
46
+
47
+ `rollbridge doctor` reports whether the configured `proxy.port` is bindable.
48
+
49
+ **Fix.** Widen the process's `port` range, free the conflicting port (`ss -ltnp`
50
+ or `lsof -i :<port>` to find the holder), or avoid privileged ports / grant the
51
+ needed capability.
52
+
53
+ ## Stale or busy control socket
54
+
55
+ **Symptom.** `rollbridge daemon` (or `ensure-daemon`) errors with one of:
56
+
57
+ ```
58
+ A Rollbridge daemon for application "ticket-server" is already running on /tmp/rollbridge-ticket-server.sock (active release: v3). Run "rollbridge status" to inspect it or "rollbridge shutdown" to stop it, or set a different control.path.
59
+ The control socket /tmp/rollbridge-ticket-server.sock is already in use by another process. Stop that process or set a different control.path.
60
+ ```
61
+
62
+ **Diagnose.** Run `rollbridge status` (does a daemon answer?) and `rollbridge
63
+ doctor` (control-socket check). A leftover socket *file* with no live daemon
64
+ behind it is removed automatically the next time the daemon starts — no action
65
+ needed.
66
+
67
+ **Fix.** If a Rollbridge daemon is already running, use it, or
68
+ `rollbridge shutdown` before starting another. If a non-Rollbridge process owns
69
+ the path, stop it or point `control.path` somewhere else.
70
+
71
+ ## Crash loops
72
+
73
+ **Symptom.** `rollbridge status` shows a process with a climbing `restarts`
74
+ count and a `state` that flips between `running` and `failed`, with repeated
75
+ `process started` / `process exited` log lines.
76
+
77
+ **Diagnose.** `rollbridge logs --process <id>` shows the crash output;
78
+ `rollbridge status` shows `exitCode`, `exitSignal`, `restarts`, and `uptimeMs`
79
+ (a tiny `uptimeMs` that keeps resetting is a fast crash loop). Crashed
80
+ active-release and `service` processes auto-restart after `restartDelayMs`.
81
+
82
+ **Fix.** Correct the command, environment, or dependency that makes the process
83
+ exit; raise `restartDelayMs` to slow a tight loop. Note that a release which
84
+ fails its health check never receives traffic, so a crash-looping proxied
85
+ process in a *failed* deploy does not take the site down — the previous release
86
+ stays active.
87
+
88
+ ## Stuck draining releases
89
+
90
+ **Symptom.** Long after a deploy, `rollbridge status` still shows an old release
91
+ in `state: "draining"` with non-zero `connections` (often `websocket`).
92
+
93
+ **Diagnose.** Long-lived connections (WebSockets, SSE, streaming responses) keep
94
+ the retired release alive until they close or `proxy.drainTimeoutMs` elapses.
95
+ `status` shows the release's `connections.http`/`connections.websocket` and
96
+ `drainStartedAt`.
97
+
98
+ **Fix.** Draining ends automatically when those connections close, or after
99
+ `proxy.drainTimeoutMs` (then the release is stopped regardless). Lower
100
+ `proxy.drainTimeoutMs` to force-stop sooner, or make clients reconnect (for
101
+ example, have the front end close idle WebSockets on deploy). Once stopped, the
102
+ release is pruned per `releaseRetention`.
@@ -0,0 +1,200 @@
1
+ # Velocious deployment guide
2
+
3
+ A Velocious backend typically runs four kinds of process: **Beacon** (the
4
+ message broker other processes connect to), **background-jobs-main** (the job
5
+ coordinator), **background-jobs-worker** (runs the jobs), and the **web/API**
6
+ server. This guide maps each to a Rollbridge process policy, shows a complete
7
+ `rollbridge.js`, and explains startup ordering and what happens on a deploy.
8
+
9
+ A production version of this config lives at
10
+ [`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js).
11
+
12
+ ## Process mapping
13
+
14
+ | Velocious process | Policy | Why |
15
+ | --- | --- | --- |
16
+ | `beacon` | `service` | A shared broker the other processes connect to. It should survive deploys and keep a **stable port**, so workers and the web process always reach the same Beacon. |
17
+ | `background-jobs-main` | `service` (or `singleton`) | The job coordinator. Run it as a `service` when it should outlive releases on a stable port; run it as a `singleton` when it must run the latest release's code after every deploy (see [Choosing the jobs-main policy](#choosing-the-jobs-main-policy)). |
18
+ | `background-jobs-worker` | `companion` | Release-scoped: one set of workers per active release, started before the web process and running that release's code. |
19
+ | `web` | `proxied` | Receives external HTTP/WebSocket traffic, is health-checked before traffic switches, and is drained on the next deploy. Exactly one process is `proxied`. |
20
+
21
+ See [README → Process Policies](../README.md#process-policies) for the full
22
+ semantics of each policy and [`docs/config.md`](config.md) for every field.
23
+
24
+ ## Example `rollbridge.js`
25
+
26
+ ```js
27
+ // rollbridge.js
28
+ export default {
29
+ application: "tensorbuzz",
30
+ control: {path: "/tmp/rollbridge-tensorbuzz.sock"},
31
+
32
+ proxy: {
33
+ host: "127.0.0.1",
34
+ port: 4500, // the stable port Nginx points at
35
+ healthPath: "/ping",
36
+ healthTimeoutMs: 30000,
37
+ drainTimeoutMs: 60000,
38
+ forceStopTimeoutMs: 10000
39
+ },
40
+
41
+ processes: [
42
+ // Shared broker — one daemon-wide instance on a stable port.
43
+ {
44
+ id: "beacon",
45
+ policy: "service",
46
+ cwd: "{{releasePath}}/backend",
47
+ env: {NODE_ENV: "production", VELOCIOUS_BEACON_PORT: "{{port}}"},
48
+ command: "npx velocious beacon",
49
+ port: 7330
50
+ },
51
+
52
+ // Job coordinator — waits for Beacon, stable port other jobs processes use.
53
+ {
54
+ id: "background-jobs-main",
55
+ policy: "service",
56
+ cwd: "{{releasePath}}/backend",
57
+ env: {
58
+ NODE_ENV: "production",
59
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
60
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{port}}"
61
+ },
62
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- npx velocious background-jobs-main",
63
+ port: 7331
64
+ },
65
+
66
+ // Workers — one set per release; raise gracefulStopMs to let in-flight
67
+ // jobs finish during a deploy.
68
+ {
69
+ id: "background-jobs-worker",
70
+ policy: "companion",
71
+ cwd: "{{releasePath}}/backend",
72
+ env: {
73
+ NODE_ENV: "production",
74
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
75
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
76
+ },
77
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious background-jobs-worker",
78
+ gracefulStopMs: 60000
79
+ },
80
+
81
+ // Web/API — the one proxied process.
82
+ {
83
+ id: "web",
84
+ policy: "proxied",
85
+ cwd: "{{releasePath}}/backend",
86
+ env: {
87
+ NODE_ENV: "production",
88
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
89
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
90
+ },
91
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious server --host 127.0.0.1 --port {{port}}",
92
+ port: {from: 14500, to: 14599},
93
+ health: {path: "/ping", timeoutMs: 30000, intervalMs: 500}
94
+ }
95
+ ]
96
+ }
97
+ ```
98
+
99
+ ## Wiring processes together
100
+
101
+ Beacon and `background-jobs-main` get **fixed** ports (`7330`, `7331`) because
102
+ they are `service`s — a stable port lets every release's workers and web process
103
+ find them. The proxied `web` process gets a **range** (`{from: 14500, to:
104
+ 14599}`); Rollbridge allocates a free port per release so the old and new web
105
+ releases can run side by side during the drain.
106
+
107
+ Cross-reference ports with `{{ports.<id>}}` and pass them to Velocious through
108
+ `env`. Rollbridge also injects `ROLLBRIDGE_<ID>_PORT` for every process (e.g.
109
+ `ROLLBRIDGE_BACKGROUND_JOBS_MAIN_PORT`), so you can read ports from the
110
+ environment instead of templating if you prefer — see
111
+ [`docs/config.md`](config.md#injected-environment-variables).
112
+
113
+ ### Startup ordering
114
+
115
+ Only the `proxied` process is health-checked, so dependent processes must wait
116
+ for their dependencies themselves. Two mechanisms combine:
117
+
118
+ 1. **Policy ordering.** On each deploy Rollbridge starts `service`s first, then
119
+ the release's `companion`s, then the `proxied` process (see
120
+ [README → Deploy ordering](../README.md#deploy-ordering)).
121
+ 2. **Readiness gating.** `wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- …`
122
+ blocks the command until Beacon's port accepts connections, so
123
+ `background-jobs-main`, the worker, and `web` don't start talking to Beacon
124
+ before it is listening. `wait-for-it` is a small standalone script (install it
125
+ on the host); any equivalent port-wait works.
126
+
127
+ ## Deploying
128
+
129
+ Drive deploys through the Rollbridge CLI — Rollbridge ships no deploy-tool
130
+ plugins (see [`docs/deploy-recipes.md`](deploy-recipes.md) for shell/CI/Capistrano
131
+ recipes). The minimal step after a release directory is prepared:
132
+
133
+ ```bash
134
+ release_path=/srv/tensorbuzz/releases/20260523120000 # prepared by your pipeline
135
+
136
+ # Run backwards-compatible migrations BEFORE switching traffic: the old and new
137
+ # web releases overlap during the drain.
138
+ (cd "$release_path/backend" && npx velocious db:migrate)
139
+
140
+ rollbridge deploy \
141
+ --ensure-daemon \
142
+ --config /etc/rollbridge/rollbridge.js \
143
+ --release-path "$release_path" \
144
+ --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
145
+ ```
146
+
147
+ `rollbridge deploy` starts the new release's worker and web process,
148
+ health-checks `web` on its `{{port}}`/`/ping`, switches traffic, then drains and
149
+ stops the previous release. It exits non-zero (leaving the previous release
150
+ active) if the new release fails to start or health-check, so a failed deploy
151
+ never promotes a broken release.
152
+
153
+ ## Background jobs across a deploy
154
+
155
+ The worker is a `companion`, so each release runs its own workers:
156
+
157
+ - On deploy, the **new** release's workers start (running the new code) before
158
+ traffic switches; the **old** release's workers are stopped when that release
159
+ is drained and retired — `SIGTERM`, then `SIGKILL` after `gracefulStopMs`.
160
+ - Set `gracefulStopMs` on the worker to at least your longest in-flight job so a
161
+ job gets time to finish on `SIGTERM` before the forced kill. The example uses
162
+ `60000` (60s).
163
+
164
+ > **Planned:** graceful job-worker draining via lifecycle hooks
165
+ > (`quietCommand`/`drainCommand`/`stopCommand` and a non-blocking drain mode so
166
+ > new workers start while old workers finish) is on the
167
+ > [roadmap](../TODO.md#major-features) and not yet implemented. Until then, the
168
+ > `gracefulStopMs` window above is the mechanism for letting in-flight jobs
169
+ > finish.
170
+
171
+ ### Choosing the jobs-main policy
172
+
173
+ `background-jobs-main` is duplicate-unsafe (you never want two coordinators), so
174
+ it is either a `service` or a `singleton` — never a `companion`:
175
+
176
+ - **`service`** — keeps running across deploys on its stable port. Workers from
177
+ every release talk to the same coordinator, so there's no coordination gap on
178
+ deploy. The trade-off: a `service` keeps running the **release it was started
179
+ from** and only adopts the latest release's template if it crashes and
180
+ restarts (or the daemon restarts). If `background-jobs-main` itself needs the
181
+ newest code immediately after every deploy, this is the wrong policy.
182
+ - **`singleton`** — Rollbridge stops the old instance and then starts the new
183
+ one on each deploy, so it always runs the latest release's code and two copies
184
+ never overlap. The trade-off: a brief coordination gap while it restarts.
185
+
186
+ Beacon is a broker rather than code that changes per release, so `service` is
187
+ almost always right for it.
188
+
189
+ ## Verifying
190
+
191
+ After a deploy, `rollbridge status` should show `beacon` and
192
+ `background-jobs-main` as long-lived `service`s with unchanged ports across
193
+ deploys, one `background-jobs-worker` for the active release, and the `web`
194
+ process `proxied` with its connection counts. Use
195
+ [`rollbridge logs --process <id>`](cli.md) to read recent output from any
196
+ process, and [`docs/troubleshooting.md`](troubleshooting.md) for health-check,
197
+ port, and draining problems.
198
+
199
+ For the front end, point Nginx at the stable `proxy.port` (here `4500`), never at
200
+ a release's web port — see [`docs/nginx.md`](nginx.md).