rollbridge 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -91,6 +91,18 @@ after the process starts before the first health probe — like a readiness
91
91
  probe's initial delay, useful for apps with a known boot time. The delay runs
92
92
  before the `health.timeoutMs` window begins.
93
93
 
94
+ Set a process's `restart` policy to control automatic restarts after a crash.
95
+ `restart.maxRestarts` caps how many restarts are allowed within `restart.windowMs`
96
+ before Rollbridge gives up and leaves the process `failed` (`maxRestarts: 0`
97
+ disables restarts entirely), while `restart.backoffFactor` — with an optional
98
+ `restart.maxDelayMs` cap — backs off the `restartDelayMs` delay on each successive
99
+ restart. With no `restart` block, a crashed process keeps restarting after
100
+ `restartDelayMs`, as before. See [`docs/config.md`](docs/config.md#processesrestart).
101
+
102
+ ```js
103
+ restart: {maxRestarts: 5, windowMs: 60000, backoffFactor: 2, maxDelayMs: 30000}
104
+ ```
105
+
94
106
  Set `releaseRetention` to bound how many stopped (drained) releases the daemon
95
107
  keeps in memory and reports in `status`. `keep` (default `10`) retains the most
96
108
  recent stopped releases; `maxAgeMs` (default `0`, disabled) also prunes stopped
@@ -133,6 +145,10 @@ fails the process start with a clear error, so typos surface immediately.
133
145
  Production-ready examples live in `examples/`, including
134
146
  `examples/tensorbuzz.com.js` for the current TensorBuzz backend deployment.
135
147
 
148
+ See [`docs/velocious.md`](docs/velocious.md) for a Velocious deployment guide —
149
+ how Beacon, background-jobs-main, background-jobs-worker, and the web process map
150
+ to Rollbridge policies, with startup ordering and deploy behavior.
151
+
136
152
  See [`docs/config.md`](docs/config.md) for the full config reference — every
137
153
  field, its default, validation rules, template variables, and the environment
138
154
  variables Rollbridge injects.
@@ -349,6 +365,15 @@ Stop the active release:
349
365
  rollbridge stop --config rollbridge.js
350
366
  ```
351
367
 
368
+ Restart non-proxied processes in place — all of them, one by id, or a policy
369
+ group (the proxied process is never restarted; use `deploy` for that):
370
+
371
+ ```bash
372
+ rollbridge restart --config rollbridge.js # all non-proxied processes
373
+ rollbridge restart --config rollbridge.js --process background-jobs-worker
374
+ rollbridge restart --config rollbridge.js --policy companion
375
+ ```
376
+
352
377
  Shut down the daemon and managed processes:
353
378
 
354
379
  ```bash
@@ -371,6 +396,10 @@ location / {
371
396
  }
372
397
  ```
373
398
 
399
+ See [`docs/nginx.md`](docs/nginx.md) for the full guide — WebSocket upgrade
400
+ headers, timeouts for long-lived connections, forwarded headers, and common
401
+ failure modes (502/503, dropped WebSockets).
402
+
374
403
  ## Running under systemd
375
404
 
376
405
  Run the long-lived daemon as a systemd service so it starts on boot and is
@@ -417,6 +446,10 @@ Maintainers can publish a patch release from the latest default branch:
417
446
  npm run release:patch
418
447
  ```
419
448
 
449
+ The release script owns the package version bump, lockfile update, default-branch
450
+ commit, push, and npm publish. Do not run `npm version` manually before running
451
+ it.
452
+
420
453
  ## License
421
454
 
422
455
  Rollbridge is released under the [MIT License](LICENSE).
package/TODO.md CHANGED
@@ -26,9 +26,9 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
26
26
  - [ ] Restart memory-heavy workers gracefully when possible, with a forced stop timeout.
27
27
  - [ ] Add tests with a fixture process that allocates memory above the configured limit.
28
28
  - [ ] Worker auto-restart and restart policy controls.
29
- - [ ] Add config for max restarts, restart window, exponential backoff, and disabled restart behavior.
29
+ - [x] Add config for max restarts, restart window, exponential backoff, and disabled restart behavior (per-process `restart` policy).
30
30
  - [ ] Distinguish crash restarts, deploy replacements, manual restarts, and memory restarts in status/events.
31
- - [ ] Add a `restart` CLI command for a single process, a policy group, or all non-proxied workers.
31
+ - [x] Add a `restart` CLI command for a single process, a policy group, or all non-proxied workers.
32
32
  - [ ] Keep restart behavior safe for job workers by using lifecycle hooks before termination.
33
33
  - [ ] Graceful job-worker lifecycle.
34
34
  - [ ] Add generic lifecycle hooks such as `quietCommand`, `drainCommand`, `drainTimeoutMs`, and `stopCommand`.
@@ -81,7 +81,7 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
81
81
  - [x] Add npm package metadata such as repository, license, bugs, and homepage.
82
82
  - [x] Add systemd service examples for the Rollbridge daemon.
83
83
  - [x] Add tests for malformed control socket JSON and unknown control commands.
84
- - [ ] Add tests for duplicate IDs and singleton replacement failure behavior.
84
+ - [x] Add tests for duplicate IDs and singleton replacement failure behavior.
85
85
  - [x] Add tests for proxy behavior when the active release exits unexpectedly.
86
86
 
87
87
  ## Documentation TODO
@@ -91,8 +91,8 @@ This roadmap tracks planned Rollbridge features and documentation. Rollbridge sh
91
91
  - [x] Expand process policy docs with deployment examples for `proxied`, `companion`, `singleton`, and `service`.
92
92
  - [ ] Document memory checks and auto-restart behavior after the feature lands.
93
93
  - [ ] Document worker lifecycle hooks and safe background-job deployment patterns after the feature lands.
94
- - [ ] Add a Velocious deployment guide with Beacon, background-jobs-main, background-jobs-worker, and web process examples.
95
- - [ ] Add an Nginx guide with WebSocket headers, timeouts, and common failure modes.
94
+ - [x] Add a Velocious deployment guide with Beacon, background-jobs-main, background-jobs-worker, and web process examples (`docs/velocious.md`).
95
+ - [x] Add an Nginx guide with WebSocket headers, timeouts, and common failure modes (`docs/nginx.md`).
96
96
  - [x] Add deploy-tool recipes that call Rollbridge CLI commands directly (`docs/deploy-recipes.md`).
97
97
  - [x] Add a Capistrano recipe showing shell commands only; do not add a Capistrano plugin or Rollbridge-specific Capistrano tasks (`docs/deploy-recipes.md`).
98
98
  - [ ] Add a TensorBuzz-specific runbook for current production ports, external services, deploy ordering, and rollback constraints.
package/docs/cli.md CHANGED
@@ -100,6 +100,29 @@ Stops the active release (or the release named by `--release-id`) and prints the
100
100
  updated status JSON. With no active release, the proxy answers `503` until the
101
101
  next deploy.
102
102
 
103
+ ## `restart`
104
+
105
+ ```
106
+ rollbridge restart [--config <path>] [--process <id>] [--policy <policy>]
107
+ ```
108
+
109
+ Restarts **non-proxied** processes and prints `{"restarted": [<ids>]}`. Like
110
+ `systemctl restart`, a running process is bounced (stop, then start) and a
111
+ crashed or stopped one is revived — so this is also how you bring back a process
112
+ that exhausted its `restart` budget (see [`config.md`](config.md#processesrestart)).
113
+ Selectors:
114
+
115
+ - no selector — restart every non-proxied process (companions, singletons, and services);
116
+ - `--process <id>` — restart only that process;
117
+ - `--policy <companion|singleton|service>` — restart only processes with that policy.
118
+
119
+ The proxied process is never restarted in place — that would drop traffic.
120
+ Targeting it (by id or `--policy proxied`) is an error; use `rollbridge deploy`
121
+ for a zero-downtime replacement. `--process <id>` with an id that is not a
122
+ managed process (unknown, or a companion with no active release) is also an
123
+ error. Restarting a `service` bounces a shared broker (for example Velocious
124
+ Beacon), which briefly disrupts every process that depends on it.
125
+
103
126
  ## `shutdown`
104
127
 
105
128
  ```
package/docs/config.md CHANGED
@@ -68,9 +68,28 @@ release records; the deploy tool still owns on-disk release directories.
68
68
  | `port` | number or `{from, to}` | unset | Port (or range) allocated per release. **Required for the `proxied` process.** A plain number `n` means the fixed port `n` (`{from: n, to: n}`). |
69
69
  | `health` | object or `false` | enabled with defaults | Health check for the `proxied` process; set `false` to disable (see below). |
70
70
  | `gracefulStopMs` | number | `proxy.forceStopTimeoutMs` | `SIGTERM`→`SIGKILL` window for this process. |
71
- | `restartDelayMs` | number | `1000` | Delay before restarting this process after a crash. |
71
+ | `restartDelayMs` | number | `1000` | Base delay before restarting this process after a crash (the backoff base; see `restart`). |
72
+ | `restart` | object | unlimited restarts, constant delay | Automatic-restart policy: cap, rolling window, and backoff (see below). |
72
73
  | `outputLines` | positive integer | `50` | Recent stdout/stderr lines retained per process and reported by `status`/`logs`. |
73
74
 
75
+ ### `processes[].restart`
76
+
77
+ Controls automatic restarts of a crashed process (a release's active processes
78
+ and daemon-wide `service`s). The base delay is the process's `restartDelayMs`;
79
+ when the policy's limit is reached the process is left `failed` and not
80
+ restarted again.
81
+
82
+ | Field | Type | Default | Description |
83
+ | --- | --- | --- | --- |
84
+ | `restart.maxRestarts` | non-negative integer | unset (unlimited) | Maximum automatic restarts allowed within `windowMs` before Rollbridge stops restarting the process. `0` disables automatic restarts entirely. |
85
+ | `restart.windowMs` | non-negative number | `0` (process lifetime) | Rolling window over which `maxRestarts` is counted and after which the backoff resets. `0` counts over the process's whole lifetime. |
86
+ | `restart.backoffFactor` | number ≥ 1 | `1` (constant) | Multiplier applied to `restartDelayMs` on each successive restart in the window: `delay = restartDelayMs × backoffFactor ^ n`. `1` keeps a constant delay. |
87
+ | `restart.maxDelayMs` | non-negative number | `0` (no cap) | Upper bound on the backed-off delay. `0` means no cap. |
88
+
89
+ With the defaults a crashed process restarts indefinitely after `restartDelayMs`.
90
+ Pair `backoffFactor`/`windowMs` to back off and self-heal after a clean run, or
91
+ set `maxRestarts` to give up on a process stuck in a crash loop.
92
+
74
93
  ### `processes[].health`
75
94
 
76
95
  Only the `proxied` process is health-checked (before traffic switches to a new
@@ -126,3 +145,4 @@ Rollbridge sets these in every managed process's environment (the process's own
126
145
  - `port` must be a positive port number or an ascending `{from, to}` range.
127
146
  - `control.mode` must be an octal mode between `0` and `0o777`.
128
147
  - `outputLines` and `releaseRetention.keep` must be positive/non-negative integers; `health.startDelayMs` and `releaseRetention.maxAgeMs` must be non-negative numbers.
148
+ - `restart.maxRestarts` must be a non-negative integer (omit it for unlimited restarts); `restart.backoffFactor` must be a number ≥ 1; `restart.windowMs` and `restart.maxDelayMs` must be non-negative numbers.
package/docs/nginx.md ADDED
@@ -0,0 +1,104 @@
1
+ # Nginx guide
2
+
3
+ Nginx should always proxy to the **stable Rollbridge proxy port**
4
+ (`proxy.host:proxy.port`), never directly to a release process — release ports
5
+ are allocated per deploy and change. Rollbridge forwards both HTTP and WebSocket
6
+ traffic to the active release and drains old connections across deploys.
7
+
8
+ ## Server block
9
+
10
+ ```nginx
11
+ # Maps the Upgrade header so WebSocket requests get "Connection: upgrade" and
12
+ # normal requests get a closed/keep-alive connection.
13
+ map $http_upgrade $connection_upgrade {
14
+ default upgrade;
15
+ '' close;
16
+ }
17
+
18
+ server {
19
+ listen 443 ssl;
20
+ server_name app.example.com;
21
+ # ssl_certificate / ssl_certificate_key ...
22
+
23
+ location / {
24
+ proxy_pass http://127.0.0.1:8182; # Rollbridge proxy.host:proxy.port
25
+
26
+ # WebSocket upgrade
27
+ proxy_http_version 1.1;
28
+ proxy_set_header Upgrade $http_upgrade;
29
+ proxy_set_header Connection $connection_upgrade;
30
+
31
+ # Pass the real client through to the app
32
+ proxy_set_header Host $host;
33
+ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
34
+ proxy_set_header X-Forwarded-Proto $scheme;
35
+ proxy_set_header X-Real-IP $remote_addr;
36
+
37
+ # Long-lived connections (WebSocket/SSE) — see "Timeouts" below
38
+ proxy_read_timeout 3600s;
39
+ proxy_send_timeout 3600s;
40
+ }
41
+ }
42
+ ```
43
+
44
+ The repository README shows a minimal version of this block; the additions here
45
+ matter for production.
46
+
47
+ ## WebSocket headers
48
+
49
+ Rollbridge's proxy has WebSocket support enabled, so the only requirement is that
50
+ Nginx forwards the upgrade handshake:
51
+
52
+ - `proxy_http_version 1.1` — WebSocket upgrades require HTTP/1.1 (the default is 1.0).
53
+ - `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection $connection_upgrade;` — forward the upgrade. Using the `map` above is preferred over a hard-coded `Connection "upgrade"`, so non-WebSocket requests aren't forced into an upgrade.
54
+
55
+ If these are missing, WebSocket clients fail to connect (the handshake never
56
+ completes) while plain HTTP still works.
57
+
58
+ ## Timeouts
59
+
60
+ Nginx's `proxy_read_timeout`/`proxy_send_timeout` default to **60s**. An idle
61
+ WebSocket (or a slow streaming response) is closed once that elapses, so
62
+ long-lived connections silently drop after a minute unless you raise them — set
63
+ them on the relevant `location` (or globally) to a value above your longest idle
64
+ period.
65
+
66
+ Related Rollbridge timeouts (configured in `rollbridge.js`, not Nginx):
67
+
68
+ - `proxy.healthTimeoutMs` gates how long a new release has to become healthy
69
+ before a deploy aborts — it does not affect request timeouts.
70
+ - `proxy.drainTimeoutMs` is how long Rollbridge keeps an old release alive for
71
+ in-flight connections during a deploy. Keep Nginx's `proxy_read_timeout` for
72
+ WebSocket locations comfortably above it so the front end doesn't cut
73
+ connections Rollbridge is still draining.
74
+
75
+ ## Forwarded headers
76
+
77
+ Set `X-Forwarded-For`, `X-Forwarded-Proto`, and `Host` so the app behind
78
+ Rollbridge sees the real client and scheme. Rollbridge proxies with
79
+ `X-Forwarded-*` enabled, but it can only forward what Nginx provides — terminate
80
+ TLS at Nginx and pass `X-Forwarded-Proto $scheme` so the app knows the original
81
+ request was HTTPS.
82
+
83
+ For Server-Sent Events or other streamed responses, also disable response
84
+ buffering on that location so events flush immediately:
85
+
86
+ ```nginx
87
+ location /events {
88
+ proxy_pass http://127.0.0.1:8182;
89
+ proxy_http_version 1.1;
90
+ proxy_buffering off;
91
+ proxy_read_timeout 3600s;
92
+ }
93
+ ```
94
+
95
+ ## Common failure modes
96
+
97
+ | Symptom | Cause | Fix |
98
+ | --- | --- | --- |
99
+ | `502 Bad Gateway` | Rollbridge can't reach the active release's process (it crashed or is restarting); Rollbridge returns `Bad gateway` and Nginx relays it. | Check `rollbridge status` / `rollbridge logs --process <id>` (see [troubleshooting.md](troubleshooting.md)). The process auto-restarts on its port. |
100
+ | `503` / `No active release` | No release is active — before the first deploy, or after `rollbridge stop`. | Deploy a release (`rollbridge deploy`). |
101
+ | WebSocket drops after ~60s | `proxy_read_timeout` left at the 60s default. | Raise `proxy_read_timeout`/`proxy_send_timeout` on the WebSocket location. |
102
+ | WebSocket never connects (plain HTTP works) | Missing `proxy_http_version 1.1` and the `Upgrade`/`Connection` headers. | Add the WebSocket directives shown above. |
103
+ | `504 Gateway Timeout` | A slow response exceeded `proxy_read_timeout`. | Raise the timeout, or speed up the endpoint. |
104
+ | Connections cut mid-deploy | Nginx `proxy_read_timeout` shorter than `proxy.drainTimeoutMs`. | Raise the Nginx timeout above `proxy.drainTimeoutMs`. |
@@ -0,0 +1,200 @@
1
+ # Velocious deployment guide
2
+
3
+ A Velocious backend typically runs four kinds of process: **Beacon** (the
4
+ message broker other processes connect to), **background-jobs-main** (the job
5
+ coordinator), **background-jobs-worker** (runs the jobs), and the **web/API**
6
+ server. This guide maps each to a Rollbridge process policy, shows a complete
7
+ `rollbridge.js`, and explains startup ordering and what happens on a deploy.
8
+
9
+ A production version of this config lives at
10
+ [`examples/tensorbuzz.com.js`](../examples/tensorbuzz.com.js).
11
+
12
+ ## Process mapping
13
+
14
+ | Velocious process | Policy | Why |
15
+ | --- | --- | --- |
16
+ | `beacon` | `service` | A shared broker the other processes connect to. It should survive deploys and keep a **stable port**, so workers and the web process always reach the same Beacon. |
17
+ | `background-jobs-main` | `service` (or `singleton`) | The job coordinator. Run it as a `service` when it should outlive releases on a stable port; run it as a `singleton` when it must run the latest release's code after every deploy (see [Choosing the jobs-main policy](#choosing-the-jobs-main-policy)). |
18
+ | `background-jobs-worker` | `companion` | Release-scoped: one set of workers per active release, started before the web process and running that release's code. |
19
+ | `web` | `proxied` | Receives external HTTP/WebSocket traffic, is health-checked before traffic switches, and is drained on the next deploy. Exactly one process is `proxied`. |
20
+
21
+ See [README → Process Policies](../README.md#process-policies) for the full
22
+ semantics of each policy and [`docs/config.md`](config.md) for every field.
23
+
24
+ ## Example `rollbridge.js`
25
+
26
+ ```js
27
+ // rollbridge.js
28
+ export default {
29
+ application: "tensorbuzz",
30
+ control: {path: "/tmp/rollbridge-tensorbuzz.sock"},
31
+
32
+ proxy: {
33
+ host: "127.0.0.1",
34
+ port: 4500, // the stable port Nginx points at
35
+ healthPath: "/ping",
36
+ healthTimeoutMs: 30000,
37
+ drainTimeoutMs: 60000,
38
+ forceStopTimeoutMs: 10000
39
+ },
40
+
41
+ processes: [
42
+ // Shared broker — one daemon-wide instance on a stable port.
43
+ {
44
+ id: "beacon",
45
+ policy: "service",
46
+ cwd: "{{releasePath}}/backend",
47
+ env: {NODE_ENV: "production", VELOCIOUS_BEACON_PORT: "{{port}}"},
48
+ command: "npx velocious beacon",
49
+ port: 7330
50
+ },
51
+
52
+ // Job coordinator — waits for Beacon, stable port other jobs processes use.
53
+ {
54
+ id: "background-jobs-main",
55
+ policy: "service",
56
+ cwd: "{{releasePath}}/backend",
57
+ env: {
58
+ NODE_ENV: "production",
59
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
60
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{port}}"
61
+ },
62
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- npx velocious background-jobs-main",
63
+ port: 7331
64
+ },
65
+
66
+ // Workers — one set per release; raise gracefulStopMs to let in-flight
67
+ // jobs finish during a deploy.
68
+ {
69
+ id: "background-jobs-worker",
70
+ policy: "companion",
71
+ cwd: "{{releasePath}}/backend",
72
+ env: {
73
+ NODE_ENV: "production",
74
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
75
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
76
+ },
77
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious background-jobs-worker",
78
+ gracefulStopMs: 60000
79
+ },
80
+
81
+ // Web/API — the one proxied process.
82
+ {
83
+ id: "web",
84
+ policy: "proxied",
85
+ cwd: "{{releasePath}}/backend",
86
+ env: {
87
+ NODE_ENV: "production",
88
+ VELOCIOUS_BEACON_PORT: "{{ports.beacon}}",
89
+ VELOCIOUS_BACKGROUND_JOBS_PORT: "{{ports.background-jobs-main}}"
90
+ },
91
+ command: "wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- wait-for-it 127.0.0.1:{{ports.background-jobs-main}} --strict -- npx velocious server --host 127.0.0.1 --port {{port}}",
92
+ port: {from: 14500, to: 14599},
93
+ health: {path: "/ping", timeoutMs: 30000, intervalMs: 500}
94
+ }
95
+ ]
96
+ }
97
+ ```
98
+
99
+ ## Wiring processes together
100
+
101
+ Beacon and `background-jobs-main` get **fixed** ports (`7330`, `7331`) because
102
+ they are `service`s — a stable port lets every release's workers and web process
103
+ find them. The proxied `web` process gets a **range** (`{from: 14500, to:
104
+ 14599}`); Rollbridge allocates a free port per release so the old and new web
105
+ releases can run side by side during the drain.
106
+
107
+ Cross-reference ports with `{{ports.<id>}}` and pass them to Velocious through
108
+ `env`. Rollbridge also injects `ROLLBRIDGE_<ID>_PORT` for every process (e.g.
109
+ `ROLLBRIDGE_BACKGROUND_JOBS_MAIN_PORT`), so you can read ports from the
110
+ environment instead of templating if you prefer — see
111
+ [`docs/config.md`](config.md#injected-environment-variables).
112
+
113
+ ### Startup ordering
114
+
115
+ Only the `proxied` process is health-checked, so dependent processes must wait
116
+ for their dependencies themselves. Two mechanisms combine:
117
+
118
+ 1. **Policy ordering.** On each deploy Rollbridge starts `service`s first, then
119
+ the release's `companion`s, then the `proxied` process (see
120
+ [README → Deploy ordering](../README.md#deploy-ordering)).
121
+ 2. **Readiness gating.** `wait-for-it 127.0.0.1:{{ports.beacon}} --strict -- …`
122
+ blocks the command until Beacon's port accepts connections, so
123
+ `background-jobs-main`, the worker, and `web` don't start talking to Beacon
124
+ before it is listening. `wait-for-it` is a small standalone script (install it
125
+ on the host); any equivalent port-wait works.
126
+
127
+ ## Deploying
128
+
129
+ Drive deploys through the Rollbridge CLI — Rollbridge ships no deploy-tool
130
+ plugins (see [`docs/deploy-recipes.md`](deploy-recipes.md) for shell/CI/Capistrano
131
+ recipes). The minimal step after a release directory is prepared:
132
+
133
+ ```bash
134
+ release_path=/srv/tensorbuzz/releases/20260523120000 # prepared by your pipeline
135
+
136
+ # Run backwards-compatible migrations BEFORE switching traffic: the old and new
137
+ # web releases overlap during the drain.
138
+ (cd "$release_path/backend" && npx velocious db:migrate)
139
+
140
+ rollbridge deploy \
141
+ --ensure-daemon \
142
+ --config /etc/rollbridge/rollbridge.js \
143
+ --release-path "$release_path" \
144
+ --revision "$(git -C "$release_path/backend" rev-parse HEAD)"
145
+ ```
146
+
147
+ `rollbridge deploy` starts the new release's worker and web process,
148
+ health-checks `web` on its `{{port}}`/`/ping`, switches traffic, then drains and
149
+ stops the previous release. It exits non-zero (leaving the previous release
150
+ active) if the new release fails to start or health-check, so a failed deploy
151
+ never promotes a broken release.
152
+
153
+ ## Background jobs across a deploy
154
+
155
+ The worker is a `companion`, so each release runs its own workers:
156
+
157
+ - On deploy, the **new** release's workers start (running the new code) before
158
+ traffic switches; the **old** release's workers are stopped when that release
159
+ is drained and retired — `SIGTERM`, then `SIGKILL` after `gracefulStopMs`.
160
+ - Set `gracefulStopMs` on the worker to at least your longest in-flight job so a
161
+ job gets time to finish on `SIGTERM` before the forced kill. The example uses
162
+ `60000` (60s).
163
+
164
+ > **Planned:** graceful job-worker draining via lifecycle hooks
165
+ > (`quietCommand`/`drainCommand`/`stopCommand` and a non-blocking drain mode so
166
+ > new workers start while old workers finish) is on the
167
+ > [roadmap](../TODO.md#major-features) and not yet implemented. Until then, the
168
+ > `gracefulStopMs` window above is the mechanism for letting in-flight jobs
169
+ > finish.
170
+
171
+ ### Choosing the jobs-main policy
172
+
173
+ `background-jobs-main` is duplicate-unsafe (you never want two coordinators), so
174
+ it is either a `service` or a `singleton` — never a `companion`:
175
+
176
+ - **`service`** — keeps running across deploys on its stable port. Workers from
177
+ every release talk to the same coordinator, so there's no coordination gap on
178
+ deploy. The trade-off: a `service` keeps running the **release it was started
179
+ from** and only adopts the latest release's template if it crashes and
180
+ restarts (or the daemon restarts). If `background-jobs-main` itself needs the
181
+ newest code immediately after every deploy, this is the wrong policy.
182
+ - **`singleton`** — Rollbridge stops the old instance and then starts the new
183
+ one on each deploy, so it always runs the latest release's code and two copies
184
+ never overlap. The trade-off: a brief coordination gap while it restarts.
185
+
186
+ Beacon is a broker rather than code that changes per release, so `service` is
187
+ almost always right for it.
188
+
189
+ ## Verifying
190
+
191
+ After a deploy, `rollbridge status` should show `beacon` and
192
+ `background-jobs-main` as long-lived `service`s with unchanged ports across
193
+ deploys, one `background-jobs-worker` for the active release, and the `web`
194
+ process `proxied` with its connection counts. Use
195
+ [`rollbridge logs --process <id>`](cli.md) to read recent output from any
196
+ process, and [`docs/troubleshooting.md`](troubleshooting.md) for health-check,
197
+ port, and draining problems.
198
+
199
+ For the front end, point Nginx at the stable `proxy.port` (here `4500`), never at
200
+ a release's web port — see [`docs/nginx.md`](nginx.md).
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "rollbridge",
3
- "version": "0.1.4",
3
+ "version": "0.1.5",
4
4
  "description": "Zero-downtime process supervisor and local traffic switcher for deploy-managed apps.",
5
5
  "keywords": [
6
6
  "deploy",
@@ -28,7 +28,7 @@
28
28
  "scripts": {
29
29
  "all-checks": "npm run typecheck && npm run lint && npm test",
30
30
  "lint": "eslint",
31
- "release:patch": "node scripts/release-patch.js",
31
+ "release:patch": "release-patch",
32
32
  "test": "node --test test/*.test.js",
33
33
  "typecheck": "tsc --noEmit"
34
34
  },
@@ -46,6 +46,7 @@
46
46
  "eslint": "^10.4.0",
47
47
  "eslint-plugin-jsdoc": "^62.9.0",
48
48
  "globals": "^17.6.0",
49
+ "release-patch": "^1.0.0",
49
50
  "typescript": "^6.0.3"
50
51
  }
51
52
  }
package/src/cli.js CHANGED
@@ -136,6 +136,33 @@ export async function runCli(argv) {
136
136
  console.log(JSON.stringify(response, null, 2))
137
137
  })
138
138
 
139
+ program
140
+ .command("restart")
141
+ .description("Restart running non-proxied processes (by id, by policy, or all).")
142
+ .option("-c, --config <path>", "Config file path (defaults to rollbridge.js)")
143
+ .option("--process <id>", "Restart only the process with this id")
144
+ .option("--policy <policy>", "Restart only processes with this policy (companion, singleton, or service)")
145
+ .action(async (options) => {
146
+ if (options.policy !== undefined && !["companion", "service", "singleton"].includes(options.policy)) {
147
+ console.error("--policy must be one of: companion, singleton, service.")
148
+ process.exitCode = 1
149
+ return
150
+ }
151
+
152
+ const configPath = await resolveConfigPath(options.config)
153
+ const config = await loadConfig(configPath)
154
+ const response = await sendControlCommand({
155
+ command: {
156
+ command: "restart",
157
+ policy: options.policy,
158
+ processId: options.process
159
+ },
160
+ path: config.control.path
161
+ })
162
+
163
+ console.log(JSON.stringify(response, null, 2))
164
+ })
165
+
139
166
  program
140
167
  .command("shutdown")
141
168
  .option("-c, --config <path>", "Config file path (defaults to rollbridge.js)")
package/src/config.js CHANGED
@@ -9,7 +9,8 @@ import {pathToFileURL} from "node:url"
9
9
  * @typedef {{from: number, to: number}} PortRange
10
10
  * @typedef {{path: string, startDelayMs: number, timeoutMs: number, intervalMs: number}} HealthConfig
11
11
  * @typedef {"proxied" | "companion" | "singleton" | "service"} ProcessPolicy
12
- * @typedef {{cwd?: string, env: Record<string, string>, gracefulStopMs: number, health?: HealthConfig, id: string, outputLines: number, policy: ProcessPolicy, port?: PortRange, restartDelayMs: number, command: string}} ProcessConfig
12
+ * @typedef {{backoffFactor: number, maxDelayMs: number, maxRestarts: number | undefined, windowMs: number}} RestartConfig
13
+ * @typedef {{cwd?: string, env: Record<string, string>, gracefulStopMs: number, health?: HealthConfig, id: string, outputLines: number, policy: ProcessPolicy, port?: PortRange, restart: RestartConfig, restartDelayMs: number, command: string}} ProcessConfig
13
14
  * @typedef {{mode?: number, path: string}} ControlConfig
14
15
  * @typedef {{drainTimeoutMs: number, forceStopTimeoutMs: number, healthPath: string, healthTimeoutMs: number, host: string, port: number, upstreamHost: string}} ProxyConfig
15
16
  * @typedef {{keep: number, maxAgeMs: number}} ReleaseRetentionConfig
@@ -175,7 +176,7 @@ function normalizeProcess(value, index, proxy, issues) {
175
176
  if (!isPlainObject(value)) {
176
177
  issues.push({fix: `Define processes[${index}] as a mapping with id, policy, and command.`, message: `processes[${index}] must be an object`})
177
178
 
178
- return {command: "", cwd: undefined, env: {}, gracefulStopMs: proxy.forceStopTimeoutMs, health: undefined, id: "", outputLines: 50, policy: "companion", port: undefined, restartDelayMs: 1000}
179
+ return {command: "", cwd: undefined, env: {}, gracefulStopMs: proxy.forceStopTimeoutMs, health: undefined, id: "", outputLines: 50, policy: "companion", port: undefined, restart: defaultRestartConfig(), restartDelayMs: 1000}
179
180
  }
180
181
 
181
182
  const source = value
@@ -190,10 +191,80 @@ function normalizeProcess(value, index, proxy, issues) {
190
191
  outputLines: normalizeOutputLines(source.outputLines, `processes[${index}].outputLines`, issues),
191
192
  policy: normalizePolicy(source.policy, `processes[${index}].policy`, issues),
192
193
  port: normalizePortRange(source.port, `processes[${index}].port`, issues),
194
+ restart: normalizeRestart(source.restart, `processes[${index}].restart`, issues),
193
195
  restartDelayMs: normalizeNumber(source.restartDelayMs, `processes[${index}].restartDelayMs`, issues, {default: 1000})
194
196
  }
195
197
  }
196
198
 
199
+ /**
200
+ * @returns {RestartConfig} Default restart policy: unlimited restarts with a constant delay.
201
+ */
202
+ function defaultRestartConfig() {
203
+ return {backoffFactor: 1, maxDelayMs: 0, maxRestarts: undefined, windowMs: 0}
204
+ }
205
+
206
+ /**
207
+ * @param {JsonValue} value - Raw restart policy.
208
+ * @param {string} key - Config key.
209
+ * @param {ConfigIssue[]} issues - Issue collector.
210
+ * @returns {RestartConfig} Normalized restart policy.
211
+ */
212
+ function normalizeRestart(value, key, issues) {
213
+ if (value === undefined || value === null) return defaultRestartConfig()
214
+
215
+ if (!isPlainObject(value)) {
216
+ issues.push({fix: `Set ${key} to a mapping with maxRestarts, windowMs, backoffFactor, and maxDelayMs.`, message: `${key} must be an object`})
217
+
218
+ return defaultRestartConfig()
219
+ }
220
+
221
+ const windowMs = normalizeNumber(value.windowMs, `${key}.windowMs`, issues, {default: 0})
222
+ const maxDelayMs = normalizeNumber(value.maxDelayMs, `${key}.maxDelayMs`, issues, {default: 0})
223
+
224
+ return {
225
+ backoffFactor: normalizeBackoffFactor(value.backoffFactor, `${key}.backoffFactor`, issues),
226
+ maxDelayMs: nonNegativeOrDefault(maxDelayMs, `${key}.maxDelayMs`, issues, 0, false),
227
+ maxRestarts: normalizeMaxRestarts(value.maxRestarts, `${key}.maxRestarts`, issues),
228
+ windowMs: nonNegativeOrDefault(windowMs, `${key}.windowMs`, issues, 0, false)
229
+ }
230
+ }
231
+
232
+ /**
233
+ * @param {JsonValue} value - Raw maximum restart count.
234
+ * @param {string} key - Config key.
235
+ * @param {ConfigIssue[]} issues - Issue collector.
236
+ * @returns {number | undefined} Restart cap, or undefined for unlimited restarts.
237
+ */
238
+ function normalizeMaxRestarts(value, key, issues) {
239
+ if (value === undefined || value === null) return undefined
240
+
241
+ if (typeof value !== "number" || !Number.isInteger(value) || value < 0) {
242
+ issues.push({fix: `Set ${key} to a non-negative integer (0 disables automatic restarts), or omit it for unlimited restarts.`, message: `${key} must be a non-negative integer`})
243
+
244
+ return undefined
245
+ }
246
+
247
+ return value
248
+ }
249
+
250
+ /**
251
+ * @param {JsonValue} value - Raw backoff factor.
252
+ * @param {string} key - Config key.
253
+ * @param {ConfigIssue[]} issues - Issue collector.
254
+ * @returns {number} Backoff multiplier (>= 1; 1 keeps a constant delay).
255
+ */
256
+ function normalizeBackoffFactor(value, key, issues) {
257
+ const factor = normalizeNumber(value, key, issues, {default: 1})
258
+
259
+ if (factor < 1) {
260
+ issues.push({fix: `Set ${key} to a number >= 1 (1 keeps a constant delay; 2 doubles the delay each restart).`, message: `${key} must be a number greater than or equal to 1`})
261
+
262
+ return 1
263
+ }
264
+
265
+ return factor
266
+ }
267
+
197
268
  /**
198
269
  * @param {JsonValue} value - Raw output retention value.
199
270
  * @param {string} key - Config key.
package/src/daemon.js CHANGED
@@ -233,6 +233,13 @@ export default class RollbridgeDaemon {
233
233
  return this.status()
234
234
  }
235
235
 
236
+ if (commandName === "restart") {
237
+ return await this.restartProcesses({
238
+ policy: stringOrUndefined(data.policy),
239
+ processId: stringOrUndefined(data.processId)
240
+ })
241
+ }
242
+
236
243
  if (commandName === "shutdown") {
237
244
  setImmediate(() => {
238
245
  this.shutdown().catch((error) => {
@@ -365,6 +372,7 @@ export default class RollbridgeDaemon {
365
372
  env: nextDefinition.env,
366
373
  logger: nextDefinition.logger,
367
374
  outputLines: nextDefinition.outputLines,
375
+ restart: nextDefinition.restart,
368
376
  restartDelayMs: nextDefinition.restartDelayMs,
369
377
  shouldRestart: nextDefinition.shouldRestart,
370
378
  stopTimeoutMs: nextDefinition.stopTimeoutMs
@@ -394,6 +402,75 @@ export default class RollbridgeDaemon {
394
402
  }
395
403
  }
396
404
 
405
+ /**
406
+ * Restarts non-proxied processes selected by id or policy, or all of them: running
407
+ * processes are bounced (stop then start) and crashed or stopped ones are revived,
408
+ * matching the conventional meaning of "restart".
409
+ *
410
+ * The proxied process is never restarted in place (that would drop traffic); use a
411
+ * deploy for a zero-downtime replacement.
412
+ * @param {{policy?: string, processId?: string}} selector - Restart selector; restarts all non-proxied processes when both are omitted.
413
+ * @returns {Promise<Record<string, JsonValue>>} The ids that were restarted.
414
+ */
415
+ async restartProcesses({policy, processId} = {}) {
416
+ if (policy === "proxied" || (processId !== undefined && this.isProxiedId(processId))) {
417
+ throw new Error('The proxied process cannot be restarted in place; use "rollbridge deploy" for a zero-downtime replacement.')
418
+ }
419
+
420
+ const targets = this.collectRestartTargets({policy, processId})
421
+
422
+ if (processId !== undefined && targets.length === 0) {
423
+ throw new Error(`No managed process with id "${processId}" to restart.`)
424
+ }
425
+
426
+ for (const target of targets) {
427
+ this.logger("process restart requested", {processId: target.id})
428
+ await target.process.stop()
429
+ await target.process.start()
430
+ }
431
+
432
+ return {restarted: targets.map((target) => target.id)}
433
+ }
434
+
435
+ /**
436
+ * @param {{policy?: string, processId?: string}} selector - Restart selector.
437
+ * @returns {{id: string, process: import("./managed-process.js").default}[]} Running non-proxied processes matching the selector.
438
+ */
439
+ collectRestartTargets({policy, processId}) {
440
+ const targets = /** @type {{id: string, process: import("./managed-process.js").default}[]} */ ([])
441
+
442
+ for (const processConfig of this.config.processes) {
443
+ if (processConfig.policy === "proxied") continue
444
+ if (processId !== undefined && processConfig.id !== processId) continue
445
+ if (policy !== undefined && processConfig.policy !== policy) continue
446
+
447
+ const process = this.findProcessInstance(processConfig)
448
+
449
+ if (process) targets.push({id: processConfig.id, process})
450
+ }
451
+
452
+ return targets
453
+ }
454
+
455
+ /**
456
+ * @param {import("./config.js").ProcessConfig} processConfig - Process definition.
457
+ * @returns {import("./managed-process.js").default | undefined} The running instance, if any.
458
+ */
459
+ findProcessInstance(processConfig) {
460
+ if (processConfig.policy === "service") return this.services.get(processConfig.id)
461
+ if (processConfig.policy === "singleton") return this.singletons.get(processConfig.id)
462
+
463
+ return this.activeRelease ? this.activeRelease.getProcess(processConfig.id) : undefined
464
+ }
465
+
466
+ /**
467
+ * @param {string} id - Process id.
468
+ * @returns {boolean} True when the id belongs to the proxied process.
469
+ */
470
+ isProxiedId(id) {
471
+ return this.config.processes.some((processConfig) => processConfig.policy === "proxied" && processConfig.id === id)
472
+ }
473
+
397
474
  /**
398
475
  * @param {string | undefined} releaseId - Release id, or active release when omitted.
399
476
  * @returns {Promise<void>} Resolves when stopped.
@@ -8,7 +8,7 @@ import {spawn} from "node:child_process"
8
8
  * @typedef {"starting" | "running" | "stopping" | "stopped" | "failed"} ManagedProcessState
9
9
  * @typedef {import("node:child_process").ChildProcess["signalCode"]} ProcessExitSignal
10
10
  * @typedef {{at: string, line: string, stream: "stdout" | "stderr"}} ManagedProcessLog
11
- * @typedef {{command: string, cwd: string | undefined, env: Record<string, string | undefined>, logger: (message: string, data?: Record<string, import("./json.js").JsonValue>) => void, outputLines: number, restartDelayMs: number, shouldRestart: () => boolean, stopTimeoutMs: number}} ManagedProcessDefinition
11
+ * @typedef {{command: string, cwd: string | undefined, env: Record<string, string | undefined>, logger: (message: string, data?: Record<string, import("./json.js").JsonValue>) => void, outputLines: number, restart: import("./config.js").RestartConfig, restartDelayMs: number, shouldRestart: () => boolean, stopTimeoutMs: number}} ManagedProcessDefinition
12
12
  * @typedef {{command: string, cwd: string | undefined, exitCode: number | null | undefined, exitSignal: ProcessExitSignal | undefined, id: string, logs: ManagedProcessLog[], pid: number | undefined, restarts: number, startedAt: string | undefined, state: ManagedProcessState, uptimeMs: number | undefined}} ManagedProcessStatus
13
13
  */
14
14
 
@@ -21,11 +21,12 @@ export default class ManagedProcess extends EventEmitter {
21
21
  * @param {string} args.id - Process id.
22
22
  * @param {(message: string, data?: Record<string, JsonValue>) => void} args.logger - Logger callback.
23
23
  * @param {number} args.outputLines - Recent stdout/stderr lines to retain and report.
24
+ * @param {import("./config.js").RestartConfig} [args.restart] - Restart policy (defaults to unlimited restarts with a constant delay).
24
25
  * @param {number} args.restartDelayMs - Restart delay.
25
26
  * @param {() => boolean} args.shouldRestart - Restart policy callback.
26
27
  * @param {number} args.stopTimeoutMs - Stop timeout.
27
28
  */
28
- constructor({command, cwd, env, id, logger, outputLines, restartDelayMs, shouldRestart, stopTimeoutMs}) {
29
+ constructor({command, cwd, env, id, logger, outputLines, restart = {backoffFactor: 1, maxDelayMs: 0, maxRestarts: undefined, windowMs: 0}, restartDelayMs, shouldRestart, stopTimeoutMs}) {
29
30
  super()
30
31
 
31
32
  this.command = command
@@ -34,12 +35,14 @@ export default class ManagedProcess extends EventEmitter {
34
35
  this.id = id
35
36
  this.logger = logger
36
37
  this.outputLines = outputLines
38
+ this.restart = restart
37
39
  this.restartDelayMs = restartDelayMs
38
40
  this.shouldRestart = shouldRestart
39
41
  this.stopTimeoutMs = stopTimeoutMs
40
42
  this.state = /** @type {ManagedProcessState} */ ("stopped")
41
43
  this.logs = /** @type {ManagedProcessLog[]} */ ([])
42
44
  this.restarts = 0
45
+ this.recentRestarts = /** @type {number[]} */ ([])
43
46
  this.startedAtMs = /** @type {number | undefined} */ (undefined)
44
47
  this.intentionalStop = false
45
48
  this.restartTimer = undefined
@@ -106,6 +109,7 @@ export default class ManagedProcess extends EventEmitter {
106
109
  this.env = definition.env
107
110
  this.logger = definition.logger
108
111
  this.outputLines = definition.outputLines
112
+ this.restart = definition.restart
109
113
  this.restartDelayMs = definition.restartDelayMs
110
114
  this.shouldRestart = definition.shouldRestart
111
115
  this.stopTimeoutMs = definition.stopTimeoutMs
@@ -146,14 +150,66 @@ export default class ManagedProcess extends EventEmitter {
146
150
  this.emit("exit", {code, signal})
147
151
 
148
152
  if (!wasIntentional && this.shouldRestart()) {
149
- this.restartTimer = setTimeout(() => {
150
- this.restartTimer = undefined
151
- this.restarts += 1
152
- this.start().catch((error) => {
153
- this.logger("process restart failed", {error: error instanceof Error ? error.message : String(error), id: this.id})
154
- })
155
- }, this.restartDelayMs)
153
+ this.scheduleRestart()
154
+ }
155
+ }
156
+
157
+ /**
158
+ * Schedules an automatic restart per the restart policy, or gives up once the policy's limit is reached.
159
+ * @returns {void}
160
+ */
161
+ scheduleRestart() {
162
+ const {backoffFactor, maxRestarts, windowMs} = this.restart
163
+
164
+ // Fast path: unlimited restarts with a constant delay needs no per-restart bookkeeping.
165
+ // The delay is constant across attempts here (backoffFactor is 1), so restartDelayFor(0)
166
+ // gives the right value while still applying any maxDelayMs cap.
167
+ if (maxRestarts === undefined && backoffFactor === 1) {
168
+ this.queueRestart(this.restartDelayFor(0))
169
+
170
+ return
171
+ }
172
+
173
+ const now = Date.now()
174
+
175
+ if (windowMs > 0) {
176
+ this.recentRestarts = this.recentRestarts.filter((time) => time > now - windowMs)
156
177
  }
178
+
179
+ if (maxRestarts !== undefined && this.recentRestarts.length >= maxRestarts) {
180
+ this.logger("restart limit reached", {id: this.id, maxRestarts, windowMs})
181
+
182
+ return
183
+ }
184
+
185
+ const delay = this.restartDelayFor(this.recentRestarts.length)
186
+
187
+ this.recentRestarts.push(now)
188
+ this.queueRestart(delay)
189
+ }
190
+
191
+ /**
192
+ * @param {number} attempt - Number of restarts already counted in the current window.
193
+ * @returns {number} Backed-off restart delay in milliseconds, capped by maxDelayMs when set.
194
+ */
195
+ restartDelayFor(attempt) {
196
+ const backedOff = this.restartDelayMs * this.restart.backoffFactor ** attempt
197
+
198
+ return this.restart.maxDelayMs > 0 ? Math.min(backedOff, this.restart.maxDelayMs) : backedOff
199
+ }
200
+
201
+ /**
202
+ * @param {number} delayMs - Delay before the restart attempt.
203
+ * @returns {void}
204
+ */
205
+ queueRestart(delayMs) {
206
+ this.restartTimer = setTimeout(() => {
207
+ this.restartTimer = undefined
208
+ this.restarts += 1
209
+ this.start().catch((error) => {
210
+ this.logger("process restart failed", {error: error instanceof Error ? error.message : String(error), id: this.id})
211
+ })
212
+ }, delayMs)
157
213
  }
158
214
 
159
215
  /**
@@ -80,6 +80,14 @@ export default class ReleaseGroup extends EventEmitter {
80
80
  }
81
81
  }
82
82
 
83
+ /**
84
+ * @param {string} id - Process id.
85
+ * @returns {ManagedProcess | undefined} This release's managed process with the given id, if present.
86
+ */
87
+ getProcess(id) {
88
+ return this.processes.get(id)
89
+ }
90
+
83
91
  /**
84
92
  * Logs process diagnostics before failed startup cleanup stops and removes the release processes.
85
93
  * @param {Error | string} error - Startup failure.
@@ -170,6 +178,7 @@ export default class ReleaseGroup extends EventEmitter {
170
178
  id: processConfig.id,
171
179
  logger: (message, data = {}) => this.logger(message, {processId: processConfig.id, releaseId: this.releaseId, ...data}),
172
180
  outputLines: processConfig.outputLines,
181
+ restart: processConfig.restart,
173
182
  restartDelayMs: processConfig.restartDelayMs,
174
183
  shouldRestart: options.shouldRestart || (() => this.state === "active" || this.state === "starting"),
175
184
  stopTimeoutMs: processConfig.gracefulStopMs
@@ -86,6 +86,46 @@ test("validateConfig defaults outputLines and accepts a positive override", () =
86
86
  assert.equal(config.processes[1].outputLines, 5)
87
87
  })
88
88
 
89
+ test("validateConfig defaults the restart policy, accepts overrides, and rejects bad values", () => {
90
+ /**
91
+ * @param {import("../src/json.js").JsonValue} restart - Restart policy under test, or undefined to omit it.
92
+ * @returns {{config: import("../src/config.js").RollbridgeConfig, issues: import("../src/config.js").ConfigIssue[]}} Validation result.
93
+ */
94
+ const validateRestart = (restart) => validateConfig({
95
+ application: "demo",
96
+ control: {path: "/tmp/demo.sock"},
97
+ processes: [{command: "run web", id: "web", policy: "proxied", port: {from: 18000, to: 18099}, restart}],
98
+ proxy: {host: "127.0.0.1", port: 8182}
99
+ })
100
+
101
+ const defaulted = validateRestart(undefined)
102
+
103
+ assert.deepEqual(defaulted.issues, [])
104
+ assert.deepEqual(defaulted.config.processes[0].restart, {backoffFactor: 1, maxDelayMs: 0, maxRestarts: undefined, windowMs: 0})
105
+
106
+ const custom = validateRestart({backoffFactor: 2, maxDelayMs: 30000, maxRestarts: 5, windowMs: 60000})
107
+
108
+ assert.deepEqual(custom.issues, [])
109
+ assert.deepEqual(custom.config.processes[0].restart, {backoffFactor: 2, maxDelayMs: 30000, maxRestarts: 5, windowMs: 60000})
110
+
111
+ // maxRestarts: 0 disables automatic restarts.
112
+ const disabled = validateRestart({maxRestarts: 0})
113
+
114
+ assert.deepEqual(disabled.issues, [])
115
+ assert.equal(disabled.config.processes[0].restart.maxRestarts, 0)
116
+
117
+ const invalid = validateRestart({backoffFactor: 0.5, maxDelayMs: -1, maxRestarts: -2, windowMs: -3})
118
+ const messages = invalid.issues.map((issue) => issue.message)
119
+
120
+ assert.ok(messages.includes("processes[0].restart.backoffFactor must be a number greater than or equal to 1"), JSON.stringify(messages))
121
+ assert.ok(messages.includes("processes[0].restart.maxRestarts must be a non-negative integer"), JSON.stringify(messages))
122
+ assert.ok(messages.includes("processes[0].restart.maxDelayMs must be a non-negative number"), JSON.stringify(messages))
123
+ assert.ok(messages.includes("processes[0].restart.windowMs must be a non-negative number"), JSON.stringify(messages))
124
+
125
+ // A fractional maxRestarts is rejected (it must be a whole number of restarts).
126
+ assert.ok(validateRestart({maxRestarts: 1.5}).issues.some((issue) => issue.message === "processes[0].restart.maxRestarts must be a non-negative integer"))
127
+ })
128
+
89
129
  test("validateConfig rejects a non-positive-integer outputLines with a fix", () => {
90
130
  const {issues} = validateConfig({
91
131
  application: "demo",
@@ -104,3 +104,89 @@ test("counts automatic restarts and reports startedAt and uptime while running",
104
104
  await managed.stop()
105
105
  }
106
106
  })
107
+
108
+ /**
109
+ * Builds a managed crasher with a specific restart policy.
110
+ * @param {import("../src/config.js").RestartConfig} restart - Restart policy.
111
+ * @returns {ManagedProcess} Managed process.
112
+ */
113
+ function buildCrasher(restart) {
114
+ return new ManagedProcess({
115
+ command: `${JSON.stringify(process.execPath)} ${JSON.stringify(crasherPath)}`,
116
+ cwd: undefined,
117
+ env: {},
118
+ id: "crasher",
119
+ logger: () => {},
120
+ outputLines: 50,
121
+ restart,
122
+ restartDelayMs: 10,
123
+ shouldRestart: () => true,
124
+ stopTimeoutMs: 500
125
+ })
126
+ }
127
+
128
+ test("does not auto-restart when the restart policy is disabled (maxRestarts: 0)", async () => {
129
+ const managed = buildCrasher({backoffFactor: 1, maxDelayMs: 0, maxRestarts: 0, windowMs: 0})
130
+
131
+ try {
132
+ await managed.start()
133
+
134
+ // The fixture exits ~40ms after start; with restarts disabled it should stay failed.
135
+ await waitFor(() => managed.status().state === "failed")
136
+ await new Promise((resolve) => setTimeout(resolve, 100))
137
+
138
+ assert.equal(managed.status().restarts, 0)
139
+ assert.equal(managed.status().state, "failed")
140
+ } finally {
141
+ await managed.stop()
142
+ }
143
+ })
144
+
145
+ test("stops auto-restarting once maxRestarts within the window is reached", async () => {
146
+ const managed = buildCrasher({backoffFactor: 1, maxDelayMs: 0, maxRestarts: 2, windowMs: 60000})
147
+
148
+ try {
149
+ await managed.start()
150
+
151
+ // It restarts at most twice within the window, then gives up and stays failed.
152
+ await waitFor(() => managed.status().restarts === 2 && managed.status().state === "failed")
153
+ await new Promise((resolve) => setTimeout(resolve, 100))
154
+
155
+ assert.equal(managed.status().restarts, 2)
156
+ assert.equal(managed.status().state, "failed")
157
+ } finally {
158
+ await managed.stop()
159
+ }
160
+ })
161
+
162
+ test("applies exponential backoff to restart delays, capped by maxDelayMs", () => {
163
+ const capped = buildCrasher({backoffFactor: 2, maxDelayMs: 500, maxRestarts: undefined, windowMs: 0})
164
+
165
+ // restartDelayMs (10) * 2 ** attempt, capped at 500.
166
+ assert.equal(capped.restartDelayFor(0), 10)
167
+ assert.equal(capped.restartDelayFor(1), 20)
168
+ assert.equal(capped.restartDelayFor(2), 40)
169
+ assert.equal(capped.restartDelayFor(6), 500) // 10 * 64 = 640, capped to 500
170
+ assert.equal(capped.restartDelayFor(7), 500)
171
+
172
+ // maxDelayMs: 0 means no cap.
173
+ const uncapped = buildCrasher({backoffFactor: 3, maxDelayMs: 0, maxRestarts: undefined, windowMs: 0})
174
+
175
+ assert.equal(uncapped.restartDelayFor(0), 10)
176
+ assert.equal(uncapped.restartDelayFor(2), 90)
177
+ })
178
+
179
+ test("the unlimited constant-delay fast path still applies maxDelayMs", () => {
180
+ // restartDelayMs (10) above maxDelayMs (5), with no backoff and unlimited restarts.
181
+ const managed = buildCrasher({backoffFactor: 1, maxDelayMs: 5, maxRestarts: undefined, windowMs: 0})
182
+
183
+ assert.equal(managed.restartDelayFor(0), 5)
184
+
185
+ /** @type {number | undefined} */
186
+ let queued
187
+
188
+ managed.queueRestart = (delayMs) => { queued = delayMs }
189
+ managed.scheduleRestart()
190
+
191
+ assert.equal(queued, 5)
192
+ })
@@ -141,6 +141,38 @@ test("singleton processes restart without overlap during deploy", async () => {
141
141
  }
142
142
  })
143
143
 
144
+ test("a failed singleton replacement surfaces the error after stopping the old singleton", async () => {
145
+ // The singleton's working directory is per-release; only the v1 directory exists, so
146
+ // the v2 replacement cannot spawn (ENOENT on cwd) and its start() rejects.
147
+ const fixture = await createFixture({includeSingleton: true, singletonCwd: "{{releasePath}}/{{releaseId}}"})
148
+ const daemon = await startDaemon(fixture.config)
149
+
150
+ await fs.mkdir(path.join(fixture.root, "v1"))
151
+
152
+ try {
153
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
154
+ await waitFor(async () => (await processEvents(fixture.singletonLogPath)).some((event) => event.event === "start" && event.releaseId === "v1"))
155
+
156
+ // The new release's singleton fails to start, so the deploy surfaces the error.
157
+ await assert.rejects(() => daemon.deploy({releaseId: "v2", releasePath: fixture.root, revision: "v2"}))
158
+
159
+ // The old singleton is stopped before the new one is started, so two copies never
160
+ // overlap — even when the replacement then fails.
161
+ await waitFor(async () => (await processEvents(fixture.singletonLogPath)).some((event) => event.event === "stop" && event.releaseId === "v1"))
162
+
163
+ const status = daemon.status()
164
+
165
+ // Traffic switches before singletons are replaced, so the new release is already active,
166
+ // but its singleton is left failed with no replacement running.
167
+ assert.equal(status.activeReleaseId, "v2")
168
+ assert.equal(status.singletons.length, 1)
169
+ assert.equal(status.singletons[0].process.state, "failed")
170
+ } finally {
171
+ await daemon.shutdown()
172
+ await fs.rm(fixture.root, {force: true, recursive: true})
173
+ }
174
+ })
175
+
144
176
  test("service processes start before releases and restart with the latest deploy template", async () => {
145
177
  const fixture = await createFixture({includeService: true, webDependsOnService: true})
146
178
  const daemon = await startDaemon(fixture.config)
@@ -173,6 +205,137 @@ test("service processes start before releases and restart with the latest deploy
173
205
  }
174
206
  })
175
207
 
208
+ test("restart bounces a single process by id", async () => {
209
+ const fixture = await createFixture({includeService: true})
210
+ const daemon = await startDaemon(fixture.config)
211
+
212
+ try {
213
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
214
+
215
+ const before = pidsById(daemon.status())
216
+ const result = await daemon.restartProcesses({processId: "beacon"})
217
+
218
+ assert.deepEqual(result.restarted, ["beacon"])
219
+
220
+ const after = pidsById(daemon.status())
221
+
222
+ assert.ok(before.beacon && after.beacon, "beacon should have a pid before and after")
223
+ assert.notEqual(after.beacon, before.beacon)
224
+ } finally {
225
+ await daemon.shutdown()
226
+ await fs.rm(fixture.root, {force: true, recursive: true})
227
+ }
228
+ })
229
+
230
+ test("restart with no selector bounces every non-proxied process but not the proxied one", async () => {
231
+ const fixture = await createFixture({includeCompanion: true, includeService: true, includeSingleton: true})
232
+ const daemon = await startDaemon(fixture.config)
233
+
234
+ try {
235
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
236
+
237
+ const before = pidsById(daemon.status())
238
+ const result = await daemon.restartProcesses()
239
+ const restarted = /** @type {string[]} */ (result.restarted)
240
+
241
+ assert.deepEqual([...restarted].sort(), ["beacon", "jobs-main", "worker"])
242
+
243
+ const after = pidsById(daemon.status())
244
+
245
+ assert.equal(after.web, before.web, "proxied process should not be restarted")
246
+ assert.notEqual(after.beacon, before.beacon)
247
+ assert.notEqual(after["jobs-main"], before["jobs-main"])
248
+ assert.notEqual(after.worker, before.worker)
249
+ } finally {
250
+ await daemon.shutdown()
251
+ await fs.rm(fixture.root, {force: true, recursive: true})
252
+ }
253
+ })
254
+
255
+ test("restart --policy targets only processes with that policy", async () => {
256
+ const fixture = await createFixture({includeCompanion: true, includeService: true})
257
+ const daemon = await startDaemon(fixture.config)
258
+
259
+ try {
260
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
261
+
262
+ const before = pidsById(daemon.status())
263
+ const result = await daemon.restartProcesses({policy: "companion"})
264
+
265
+ assert.deepEqual(result.restarted, ["worker"])
266
+
267
+ const after = pidsById(daemon.status())
268
+
269
+ assert.notEqual(after.worker, before.worker)
270
+ assert.equal(after.beacon, before.beacon, "the service should be left running")
271
+ } finally {
272
+ await daemon.shutdown()
273
+ await fs.rm(fixture.root, {force: true, recursive: true})
274
+ }
275
+ })
276
+
277
+ test("restart refuses the proxied process and reports unknown ids", async () => {
278
+ const fixture = await createFixture()
279
+ const daemon = await startDaemon(fixture.config)
280
+
281
+ try {
282
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
283
+
284
+ await assert.rejects(() => daemon.restartProcesses({processId: "web"}), /proxied process cannot be restarted/)
285
+ await assert.rejects(() => daemon.restartProcesses({policy: "proxied"}), /proxied process cannot be restarted/)
286
+ await assert.rejects(() => daemon.restartProcesses({processId: "missing"}), /No managed process with id "missing"/)
287
+ } finally {
288
+ await daemon.shutdown()
289
+ await fs.rm(fixture.root, {force: true, recursive: true})
290
+ }
291
+ })
292
+
293
+ test("restart revives a stopped process instead of erroring", async () => {
294
+ const fixture = await createFixture({includeCompanion: true})
295
+ const daemon = await startDaemon(fixture.config)
296
+
297
+ try {
298
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
299
+
300
+ // Simulate the worker having exited (e.g. crashed and exhausted its restart budget).
301
+ const worker = daemon.activeRelease?.getProcess("worker")
302
+
303
+ assert.ok(worker, "worker process should exist")
304
+ await worker.stop()
305
+ assert.equal(worker.status().state, "stopped")
306
+
307
+ const result = await daemon.restartProcesses({processId: "worker"})
308
+
309
+ assert.deepEqual(result.restarted, ["worker"])
310
+ assert.equal(worker.status().state, "running")
311
+ assert.ok(worker.status().pid)
312
+ } finally {
313
+ await daemon.shutdown()
314
+ await fs.rm(fixture.root, {force: true, recursive: true})
315
+ }
316
+ })
317
+
318
+ test("the restart control command bounces a process over the socket", async () => {
319
+ const fixture = await createFixture({includeService: true})
320
+ const daemon = await startDaemon(fixture.config)
321
+
322
+ try {
323
+ await daemon.deploy({releaseId: "v1", releasePath: fixture.root, revision: "v1"})
324
+
325
+ const before = pidsById(daemon.status())
326
+ const response = await sendControlCommand({
327
+ command: {command: "restart", processId: "beacon"},
328
+ path: fixture.config.control.path
329
+ })
330
+
331
+ assert.deepEqual(response.restarted, ["beacon"])
332
+ assert.notEqual(pidsById(daemon.status()).beacon, before.beacon)
333
+ } finally {
334
+ await daemon.shutdown()
335
+ await fs.rm(fixture.root, {force: true, recursive: true})
336
+ }
337
+ })
338
+
176
339
  test("control socket accepts deploy and status commands", async () => {
177
340
  const fixture = await createFixture()
178
341
  const daemon = await startDaemon(fixture.config)
@@ -336,7 +499,7 @@ test("deploy can ensure the daemon before sending the release command", async ()
336
499
  })
337
500
 
338
501
  /**
339
- * @param {{includeService?: boolean, includeSingleton?: boolean, proxyHost?: string, webCommand?: string, webDependsOnService?: boolean, webHealthTimeoutMs?: number}} [options] - Fixture options.
502
+ * @param {{includeCompanion?: boolean, includeService?: boolean, includeSingleton?: boolean, proxyHost?: string, singletonCwd?: string, webCommand?: string, webDependsOnService?: boolean, webHealthTimeoutMs?: number}} [options] - Fixture options.
340
503
  * @returns {Promise<{config: import("../src/config.js").RollbridgeConfig, root: string, serviceLogPath: string, singletonLogPath: string}>} Fixture data.
341
504
  */
342
505
  async function createFixture(options = {}) {
@@ -359,6 +522,14 @@ async function createFixture(options = {}) {
359
522
  })
360
523
  }
361
524
 
525
+ if (options.includeCompanion) {
526
+ processes.push({
527
+ command: `${JSON.stringify(process.execPath)} -e ${JSON.stringify("setInterval(() => {}, 1000)")}`,
528
+ id: "worker",
529
+ policy: "companion"
530
+ })
531
+ }
532
+
362
533
  processes.push({
363
534
  command: options.webCommand || (options.webDependsOnService
364
535
  ? `${JSON.stringify(process.execPath)} ${JSON.stringify(dependentAppPath)}`
@@ -376,6 +547,7 @@ async function createFixture(options = {}) {
376
547
  if (options.includeSingleton) {
377
548
  processes.push({
378
549
  command: `${JSON.stringify(process.execPath)} ${JSON.stringify(singletonAppPath)}`,
550
+ ...(options.singletonCwd ? {cwd: options.singletonCwd} : {}),
379
551
  env: {
380
552
  ROLLBRIDGE_SINGLETON_LOG: singletonLogPath
381
553
  },
@@ -466,6 +638,27 @@ function statusRelease(daemon, releaseId) {
466
638
  return release
467
639
  }
468
640
 
641
+ /**
642
+ * Maps process id to pid across the active release, services, and singletons.
643
+ * @param {import("../src/daemon.js").DaemonStatus} status - Daemon status payload.
644
+ * @returns {Record<string, number | undefined>} Process id to current pid.
645
+ */
646
+ function pidsById(status) {
647
+ /** @type {Record<string, number | undefined>} */
648
+ const pids = {}
649
+
650
+ for (const release of status.releases) {
651
+ if (release.state !== "active") continue
652
+
653
+ for (const processStatus of release.processes) pids[processStatus.id] = processStatus.pid
654
+ }
655
+
656
+ for (const service of status.services) pids[service.id] = service.process.pid
657
+ for (const singleton of status.singletons) pids[singleton.id] = singleton.process.pid
658
+
659
+ return pids
660
+ }
661
+
469
662
  /**
470
663
  * @param {string} logPath - Log path.
471
664
  * @returns {Promise<Array<{event: string, pid: number, releaseId: string}>>} Events.
@@ -1,83 +0,0 @@
1
- #!/usr/bin/env node
2
- import {execFileSync} from "node:child_process"
3
-
4
- /**
5
- * Runs a command and inherits stdio.
6
- * @param {string} command - Command to run.
7
- * @param {string[]} [args] - Command arguments.
8
- * @returns {void}
9
- */
10
- function run(command, args = []) {
11
- execFileSync(command, args, {
12
- env: {
13
- ...process.env,
14
- GIT_EDITOR: "true",
15
- GIT_MERGE_AUTOEDIT: "no"
16
- },
17
- stdio: "inherit"
18
- })
19
- }
20
-
21
- /**
22
- * Runs a command and returns trimmed stdout.
23
- * @param {string} command - Command to run.
24
- * @param {string[]} [args] - Command arguments.
25
- * @returns {string} Trimmed stdout.
26
- */
27
- function output(command, args = []) {
28
- return execFileSync(command, args, {encoding: "utf8"}).trim()
29
- }
30
-
31
- /** @returns {string} GitHub remote default branch name. */
32
- function defaultBranch() {
33
- const remoteHead = output("git", ["ls-remote", "--symref", "origin", "HEAD"])
34
- const match = remoteHead.match(/^ref: refs\/heads\/(.+)\s+HEAD$/m)
35
-
36
- if (!match) throw new Error("Unable to determine origin default branch")
37
-
38
- return match[1]
39
- }
40
-
41
- /**
42
- * @param {string} branch - Branch name.
43
- * @returns {boolean} True when the local branch exists.
44
- */
45
- function localBranchExists(branch) {
46
- try {
47
- output("git", ["rev-parse", "--verify", `refs/heads/${branch}`])
48
- return true
49
- } catch (_error) {
50
- return false
51
- }
52
- }
53
-
54
- /** @returns {string} Updated default branch name. */
55
- function updateLocalDefaultBranch() {
56
- run("git", ["fetch", "origin"])
57
- const branch = defaultBranch()
58
-
59
- if (localBranchExists(branch)) {
60
- run("git", ["checkout", branch])
61
- } else {
62
- run("git", ["checkout", "-b", branch, `origin/${branch}`])
63
- }
64
-
65
- run("git", ["merge", "--ff-only", `origin/${branch}`])
66
-
67
- return branch
68
- }
69
-
70
- try {
71
- execFileSync("npm", ["whoami"], {stdio: "ignore"})
72
- } catch {
73
- run("npm", ["login"])
74
- }
75
-
76
- const branch = updateLocalDefaultBranch()
77
-
78
- run("npm", ["version", "patch", "--no-git-tag-version"])
79
- run("npm", ["install"])
80
- run("git", ["add", "package.json", "package-lock.json"])
81
- run("git", ["commit", "-m", "chore: bump patch version"])
82
- run("git", ["push", "origin", branch])
83
- run("npm", ["publish"])