quonfig 0.0.15 → 0.0.17
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -0
- data/README.md +143 -12
- data/lib/quonfig/client.rb +230 -22
- data/lib/quonfig/datadir_watcher.rb +113 -0
- data/lib/quonfig/options.rb +25 -2
- data/lib/quonfig/sse_config_client.rb +536 -225
- data/lib/quonfig/version.rb +1 -1
- data/lib/quonfig.rb +3 -1
- data/quonfig.gemspec +4 -1
- metadata +8 -7
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 68d9721e3220acc150e33b43e993c8b8b2380056453939b62b06b05cc4ef4255
|
|
4
|
+
data.tar.gz: 643c409f2b8fa3d5291d92fcf8a0d39cf5a2a67b4d697036497a19296f584d72
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c133fdcdf47da1b026465f42dfc71e98be03590bb0df80ebd247a572bce6404a59b5f411b014e83349ca2278af2c4c62974c22c1ea9c0c7b1af474d933e91725
|
|
7
|
+
data.tar.gz: c982a887a21dcfe7545e2b50b71a09f2fb7465827819676a238bdbd87cbeaa7626f26e723eb3535c8e3f893de4f7bcebc93e31264783b76716add5057ece836c
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,18 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.0.17 - 2026-05-19
|
|
4
|
+
|
|
5
|
+
- **Feat (datadir): opt-in `data_dir_auto_reload` (qfg-mol-2da).** Datadir mode previously loaded the workspace once at construction and served purely from memory. Set `data_dir_auto_reload: true` to have the SDK watch the configured `datadir`, re-read `Quonfig::Datadir.load_envelope`, and fire the existing `on_update` callback whenever files change. Adds `listen ~> 3.8` (FSEvents on macOS, inotify on Linux, polling fallback on Windows) as a runtime dep. Behavior: parse-then-swap (a failed parse keeps the previous envelope and skips the callback), debounced (`data_dir_auto_reload_debounce_ms`, default 200 ms — bursts coalesce to one reload), and gracefully downgrades when watch registration fails (read-only fs, immutable container, missing native backend). Symlinked datadirs are resolved to their real path before watching. Default is `false`; opt-in only.
|
|
6
|
+
- **Feat (datadir + fork): auto-restart the watcher across `fork(2)` (qfg-mol-2da).** The watcher uses a background thread, which does not survive fork. The existing `Process._fork` hook (qfg-ryov, Ruby 3.1+) now also tears the datadir watcher down in the parent before fork and rebuilds a fresh one on the same `Client` in each child — no customer wiring required for Puma clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Resque, or Spring. Ruby 3.0 customers continue to use the documented `Quonfig.fork` pattern in `on_worker_boot`, which rebuilds the watcher alongside the rest of the client.
|
|
7
|
+
|
|
8
|
+
## 0.0.16 - 2026-05-15
|
|
9
|
+
|
|
10
|
+
- **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
|
|
11
|
+
- **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
|
|
12
|
+
- **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
|
|
13
|
+
- **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
|
|
14
|
+
- **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
|
|
15
|
+
|
|
3
16
|
## 0.0.15 - 2026-05-15
|
|
4
17
|
|
|
5
18
|
- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
|
data/README.md
CHANGED
|
@@ -107,6 +107,107 @@ export QUONFIG_ENVIRONMENT=production
|
|
|
107
107
|
client = Quonfig::Client.new # reads QUONFIG_DIR + QUONFIG_ENVIRONMENT
|
|
108
108
|
```
|
|
109
109
|
|
|
110
|
+
## Datadir mode: auto-reload on file changes
|
|
111
|
+
|
|
112
|
+
In datadir mode the SDK loads the workspace once at construction time and then
|
|
113
|
+
serves config purely from memory. Opt in to `data_dir_auto_reload: true` to
|
|
114
|
+
have the SDK watch the directory and re-read the envelope whenever files
|
|
115
|
+
change — an editor save, a `git pull`, or a build step that rewrites the
|
|
116
|
+
workspace.
|
|
117
|
+
|
|
118
|
+
```ruby
|
|
119
|
+
client = Quonfig::Client.new(
|
|
120
|
+
datadir: '/path/to/workspace',
|
|
121
|
+
environment: 'development',
|
|
122
|
+
data_dir_auto_reload: true # off by default — must be opted in
|
|
123
|
+
)
|
|
124
|
+
|
|
125
|
+
client.on_update do
|
|
126
|
+
puts 'Quonfig configs reloaded from disk'
|
|
127
|
+
end
|
|
128
|
+
|
|
129
|
+
# Edit a file under /path/to/workspace and on_update fires within ~200ms.
|
|
130
|
+
|
|
131
|
+
# On shutdown, stop stops the watcher and cancels any pending debounce.
|
|
132
|
+
client.stop
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### When to enable
|
|
136
|
+
|
|
137
|
+
- Local development with the datadir checked out from git.
|
|
138
|
+
- Self-hosted servers that `git pull` the datadir on a schedule.
|
|
139
|
+
- CI jobs that mutate the datadir between assertions.
|
|
140
|
+
|
|
141
|
+
### When NOT to enable
|
|
142
|
+
|
|
143
|
+
- **Read-only / immutable filesystems** (some containers, scratch images,
|
|
144
|
+
AWS Lambda). Watch registration may fail; the SDK degrades gracefully
|
|
145
|
+
(logs the error and continues serving the envelope it loaded at init time)
|
|
146
|
+
but you're paying for nothing.
|
|
147
|
+
- **Build-time-embedded workflows** where the datadir is bundled into the
|
|
148
|
+
artifact and never changes at runtime. Watching wastes a thread and a
|
|
149
|
+
native-backend handle.
|
|
150
|
+
- **Production paths where reload timing matters** — e.g. you'd rather pin
|
|
151
|
+
the envelope you shipped with and roll forward through a redeploy than
|
|
152
|
+
have it shift under traffic.
|
|
153
|
+
|
|
154
|
+
Default is `false`; datadir mode is silent until you opt in.
|
|
155
|
+
|
|
156
|
+
### Behavior contract
|
|
157
|
+
|
|
158
|
+
- **Parse-then-swap.** If the new envelope fails to parse (truncated write,
|
|
159
|
+
mid-`git pull` state, invalid JSON), the SDK logs the error and **keeps
|
|
160
|
+
serving the previous envelope**. `on_update` is _not_ fired on parse
|
|
161
|
+
failure — only on a successful swap.
|
|
162
|
+
- **Debounced.** Bursts of filesystem events (atomic-rename editor saves,
|
|
163
|
+
`git pull` touching dozens of files) coalesce into a single re-read.
|
|
164
|
+
Default window: **200ms** — long enough to absorb the 3–5 events a typical
|
|
165
|
+
editor emits in <50ms, short enough that interactive edits feel immediate.
|
|
166
|
+
Tune via `data_dir_auto_reload_debounce_ms` if you need a different
|
|
167
|
+
window.
|
|
168
|
+
- **Graceful degrade.** If watch registration fails (read-only fs, immutable
|
|
169
|
+
container, missing native backend), the SDK logs and continues without
|
|
170
|
+
watching — it does **not** raise from the constructor.
|
|
171
|
+
- **Symlinks.** The watcher resolves `datadir` to its real path at start
|
|
172
|
+
time. Editing the file the symlink points at _is_ detected; atomic flips
|
|
173
|
+
that retarget the link itself are **not**.
|
|
174
|
+
- **Shutdown.** `client.stop` stops the watcher and cancels any pending
|
|
175
|
+
debounce. There is no separate handle to manage — the watcher lifecycle
|
|
176
|
+
is tied to the client.
|
|
177
|
+
|
|
178
|
+
### Fork safety (Puma cluster, Unicorn, Resque, Sidekiq)
|
|
179
|
+
|
|
180
|
+
The auto-reload watcher uses a background thread, which — like any Ruby
|
|
181
|
+
thread — does not survive `fork(2)`. **You do not need to wire this up
|
|
182
|
+
manually on Ruby 3.1+.** The SDK's `Process._fork` hook (see [Rails
|
|
183
|
+
integration](#rails-integration) below) stops the watcher in the parent
|
|
184
|
+
before fork and restarts a fresh watcher in each child after fork. This
|
|
185
|
+
covers Puma clustered mode, Unicorn, Sidekiq's parent-forks-workers model,
|
|
186
|
+
Resque, Spring, and manual `fork { ... }` calls.
|
|
187
|
+
|
|
188
|
+
On Ruby 3.0 (no `Process._fork`), follow the manual `before_fork` /
|
|
189
|
+
`on_worker_boot` pattern in the [Rails integration](#rails-integration)
|
|
190
|
+
section — `Quonfig.fork` rebuilds the full client, including the datadir
|
|
191
|
+
watcher, in the child.
|
|
192
|
+
|
|
193
|
+
### Tuning the debounce window
|
|
194
|
+
|
|
195
|
+
```ruby
|
|
196
|
+
Quonfig::Client.new(
|
|
197
|
+
datadir: '/path/to/workspace',
|
|
198
|
+
data_dir_auto_reload: true,
|
|
199
|
+
data_dir_auto_reload_debounce_ms: 1000 # wait a full second after the last event
|
|
200
|
+
)
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
The default (200 ms) is tuned for interactive editing. Raise it if you have
|
|
204
|
+
a noisy producer (continuously regenerating files) and you'd rather see one
|
|
205
|
+
reload per second than per save. Lower it only if you've measured that 200 ms
|
|
206
|
+
is meaningfully too slow for your use case.
|
|
207
|
+
|
|
208
|
+
See the [open-source / local how-to](https://docs.quonfig.com/docs/how-tos/open-source-local)
|
|
209
|
+
for the cross-SDK story (sdk-node, sdk-go, sdk-ruby, sdk-python, sdk-java).
|
|
210
|
+
|
|
110
211
|
## Environment variables
|
|
111
212
|
|
|
112
213
|
| Variable | Purpose |
|
|
@@ -130,7 +231,9 @@ Quonfig::Client.new(
|
|
|
130
231
|
on_no_default: :error,
|
|
131
232
|
global_context: {},
|
|
132
233
|
datadir: '/path/to/workspace',
|
|
133
|
-
environment: 'production'
|
|
234
|
+
environment: 'production',
|
|
235
|
+
data_dir_auto_reload: false,
|
|
236
|
+
data_dir_auto_reload_debounce_ms: 200
|
|
134
237
|
)
|
|
135
238
|
```
|
|
136
239
|
|
|
@@ -147,6 +250,8 @@ Quonfig::Client.new(
|
|
|
147
250
|
| `global_context` | `Hash` | `{}` | Context applied to every evaluation. |
|
|
148
251
|
| `datadir` | `String` | `ENV['QUONFIG_DIR']` | Path to a local workspace. When set, the SDK runs offline from disk. |
|
|
149
252
|
| `environment` | `String` | `ENV['QUONFIG_ENVIRONMENT']` | Environment to evaluate in datadir mode. Required when `datadir` is set. |
|
|
253
|
+
| `data_dir_auto_reload` | `Boolean` | `false` | Datadir mode only. When `true`, the SDK watches the datadir and re-reads the envelope when files change. See [Datadir mode: auto-reload on file changes](#datadir-mode-auto-reload-on-file-changes). |
|
|
254
|
+
| `data_dir_auto_reload_debounce_ms` | `Integer` (ms) | `200` | Debounce window for the auto-reload watcher — events arriving inside the window are coalesced into a single re-read. Ignored when `data_dir_auto_reload` is `false`. |
|
|
150
255
|
| `logger` | Logger-like object | `nil` | Optional host-app logger (e.g. `Rails.logger`). Must respond to `debug`/`info`/`warn`/`error`. When set, all SDK warnings/errors flow through this logger instead of the default stderr / SemanticLogger backend. |
|
|
151
256
|
|
|
152
257
|
## Typed getters
|
|
@@ -247,15 +352,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
|
|
|
247
352
|
are dead — the SSE socket is held open by a thread that no longer exists, and
|
|
248
353
|
the child silently stops receiving live updates.
|
|
249
354
|
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
355
|
+
**On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
|
|
356
|
+
automatically tears down threaded components in the parent and restarts them
|
|
357
|
+
in the child. This covers any `Process.fork` / `Kernel#fork` path — Puma's
|
|
358
|
+
clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
|
|
359
|
+
manual `fork { ... }` calls. **No customer wiring is required.**
|
|
360
|
+
|
|
361
|
+
Caveats:
|
|
362
|
+
|
|
363
|
+
- Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
|
|
364
|
+
- `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
|
|
365
|
+
not go through `Process._fork`), but those execute a new program, so the
|
|
366
|
+
in-process SSE state is moot.
|
|
367
|
+
- The hook tears down the SSE/polling/telemetry threads in the parent before
|
|
368
|
+
fork (so the child does not inherit a live socket fd) and does **not**
|
|
369
|
+
auto-restart the parent. This mirrors the Puma master case: the master no
|
|
370
|
+
longer serves requests, so it does not need a live SSE connection. If you
|
|
371
|
+
have a non-Puma topology where the parent must keep streaming after fork,
|
|
372
|
+
call `Quonfig.instance.after_fork_in_child` manually in the parent after
|
|
373
|
+
the fork returns.
|
|
254
374
|
|
|
255
375
|
### Puma (clustered mode)
|
|
256
376
|
|
|
377
|
+
With the automatic fork hook, the typical Puma config needs **no Quonfig
|
|
378
|
+
lifecycle wiring** — initialize in your Rails initializer and let the hook
|
|
379
|
+
handle the rest:
|
|
380
|
+
|
|
381
|
+
```ruby
|
|
382
|
+
# config/initializers/quonfig.rb
|
|
383
|
+
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
|
|
387
|
+
|
|
257
388
|
```ruby
|
|
258
|
-
# config/puma.rb
|
|
389
|
+
# config/puma.rb (Ruby 3.0 only)
|
|
259
390
|
before_fork do
|
|
260
391
|
Quonfig.instance.stop # close the master's SSE before forking
|
|
261
392
|
end
|
|
@@ -265,18 +396,18 @@ on_worker_boot do
|
|
|
265
396
|
end
|
|
266
397
|
```
|
|
267
398
|
|
|
268
|
-
If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
|
|
269
|
-
single mode (no clustering), no fork hook is needed.
|
|
270
|
-
|
|
271
399
|
### Sidekiq
|
|
272
400
|
|
|
273
|
-
|
|
401
|
+
On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too — no
|
|
402
|
+
`configure_server` wiring required.
|
|
403
|
+
|
|
404
|
+
On Ruby 3.0:
|
|
274
405
|
|
|
275
406
|
```ruby
|
|
276
407
|
# config/initializers/quonfig.rb
|
|
277
408
|
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
278
409
|
|
|
279
|
-
# config/initializers/sidekiq.rb
|
|
410
|
+
# config/initializers/sidekiq.rb (Ruby 3.0 only)
|
|
280
411
|
Sidekiq.configure_server do |config|
|
|
281
412
|
config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
|
|
282
413
|
config.on(:shutdown) { Quonfig.instance.stop rescue nil }
|
|
@@ -284,7 +415,7 @@ end
|
|
|
284
415
|
```
|
|
285
416
|
|
|
286
417
|
For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
|
|
287
|
-
`Quonfig.init` in the initializer is sufficient.
|
|
418
|
+
`Quonfig.init` in the initializer is sufficient on any Ruby version.
|
|
288
419
|
|
|
289
420
|
### Spring / Bootsnap preloaders
|
|
290
421
|
|
data/lib/quonfig/client.rb
CHANGED
|
@@ -20,6 +20,29 @@ module Quonfig
|
|
|
20
20
|
class Client
|
|
21
21
|
LOG = Quonfig::InternalLogger.new(self)
|
|
22
22
|
|
|
23
|
+
# qfg-ryov: instance registry for the Process._fork hook. Every live
|
|
24
|
+
# Client is tracked here so the hook can fan out before_fork_in_parent /
|
|
25
|
+
# after_fork_in_child across all of them without the customer needing to
|
|
26
|
+
# name a specific instance. ObjectSpace::WeakMap means a Client that goes
|
|
27
|
+
# out of scope is GC'd without leaking through this registry. Stopped
|
|
28
|
+
# Clients stay in the registry until GC; both fork hooks early-return on
|
|
29
|
+
# +@stopped+ so a stopped instance is effectively a no-op. (We don't use
|
|
30
|
+
# WeakMap#delete because it was added in Ruby 3.3 and the matrix still
|
|
31
|
+
# includes 3.2.)
|
|
32
|
+
@instances = ObjectSpace::WeakMap.new
|
|
33
|
+
@instances_mutex = Mutex.new
|
|
34
|
+
|
|
35
|
+
class << self
|
|
36
|
+
# Iterate live Client instances. Used by Quonfig::ForkSafety.
|
|
37
|
+
def each_instance(&block)
|
|
38
|
+
@instances_mutex.synchronize { @instances.keys }.each(&block)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def register_instance(client)
|
|
42
|
+
@instances_mutex.synchronize { @instances[client] = true }
|
|
43
|
+
end
|
|
44
|
+
end
|
|
45
|
+
|
|
23
46
|
attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
|
|
24
47
|
:config_loader, :telemetry_reporter
|
|
25
48
|
|
|
@@ -48,17 +71,23 @@ module Quonfig
|
|
|
48
71
|
@sse_state = :idle
|
|
49
72
|
@sse_ever_connected = false
|
|
50
73
|
@fallback_engage_timer = nil
|
|
74
|
+
@sse_terminal_failure = false
|
|
51
75
|
|
|
52
76
|
# If the caller injected a store, we're in test/bootstrap mode; skip I/O.
|
|
53
77
|
return if store
|
|
54
78
|
|
|
55
79
|
if @options.datadir
|
|
56
80
|
load_datadir_into_store
|
|
81
|
+
start_datadir_watcher if @options.data_dir_auto_reload
|
|
57
82
|
else
|
|
58
83
|
initialize_network_mode
|
|
59
84
|
end
|
|
60
85
|
|
|
61
86
|
initialize_telemetry
|
|
87
|
+
|
|
88
|
+
# Register only for non-store-injected clients (a caller-supplied store
|
|
89
|
+
# is the test/bootstrap path; the fork hook does not apply there).
|
|
90
|
+
self.class.register_instance(self) unless store
|
|
62
91
|
end
|
|
63
92
|
|
|
64
93
|
# ---- Lookup --------------------------------------------------------
|
|
@@ -264,34 +293,57 @@ module Quonfig
|
|
|
264
293
|
|
|
265
294
|
def stop
|
|
266
295
|
@stopped = true
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
rescue StandardError => e
|
|
270
|
-
LOG.debug "Error closing SSE client: #{e.message}"
|
|
271
|
-
end
|
|
272
|
-
@sse_client = nil
|
|
296
|
+
tear_down_threaded_components!
|
|
297
|
+
end
|
|
273
298
|
|
|
274
|
-
|
|
299
|
+
# qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
|
|
300
|
+
# telemetry reporter, and any fallback-engage timer. Idempotent — calling
|
|
301
|
+
# twice is safe. Does NOT set @stopped: the client is still expected to
|
|
302
|
+
# be usable post-fork via after_fork_in_child.
|
|
303
|
+
#
|
|
304
|
+
# Why this matters: Ruby threads do not survive fork(2). If we let the
|
|
305
|
+
# child inherit a live Net::HTTP socket, both processes read from the
|
|
306
|
+
# same fd and corrupt each other's bytes. Closing in the parent before
|
|
307
|
+
# fork is the only safe shape.
|
|
308
|
+
def before_fork_in_parent
|
|
309
|
+
return if @stopped
|
|
275
310
|
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
311
|
+
tear_down_threaded_components!
|
|
312
|
+
end
|
|
313
|
+
|
|
314
|
+
# qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
|
|
315
|
+
# components the client had pre-fork. No-op if the client was already
|
|
316
|
+
# stopped (the customer asked for it to be dead — do not resurrect),
|
|
317
|
+
# or if the client is in datadir mode (no threaded components to start).
|
|
318
|
+
def after_fork_in_child
|
|
319
|
+
return if @stopped
|
|
320
|
+
|
|
321
|
+
if @options.datadir
|
|
322
|
+
start_datadir_watcher if @options.data_dir_auto_reload
|
|
323
|
+
return
|
|
280
324
|
end
|
|
281
|
-
@poll_supervisor = nil
|
|
282
325
|
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
326
|
+
return if @config_loader.nil? # never finished network init (e.g. invalid key)
|
|
327
|
+
|
|
328
|
+
# SSE state machine carries flags that no longer apply in the child
|
|
329
|
+
# (the parent had connected, the parent had errored, etc.). Reset.
|
|
330
|
+
@state_mutex.synchronize do
|
|
331
|
+
@sse_state = :idle
|
|
332
|
+
@sse_ever_connected = false
|
|
333
|
+
@sse_terminal_failure = false
|
|
287
334
|
end
|
|
288
|
-
|
|
335
|
+
|
|
336
|
+
sse_started = @options.enable_sse && start_sse
|
|
337
|
+
start_polling if @options.enable_polling && !sse_started
|
|
338
|
+
|
|
339
|
+
restart_telemetry_in_child
|
|
289
340
|
end
|
|
290
341
|
|
|
291
342
|
# quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
|
|
292
343
|
# Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
|
|
293
|
-
# incremented
|
|
294
|
-
# Layer 2 (HTTP polling fallback) is wired through
|
|
344
|
+
# incremented once per reconnect attempt by the SDK-owned reconnect
|
|
345
|
+
# loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
|
|
346
|
+
# Quonfig::WorkerSupervisor.
|
|
295
347
|
#
|
|
296
348
|
# Pass +layer:+ ('1' or '2') to read a single layer; default returns the
|
|
297
349
|
# sum across both layers so the chaos harness (and operators) can pull
|
|
@@ -357,6 +409,48 @@ module Quonfig
|
|
|
357
409
|
|
|
358
410
|
private
|
|
359
411
|
|
|
412
|
+
# Close every threaded component and drop its reference. Used by both
|
|
413
|
+
# +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
|
|
414
|
+
# (where @stopped is left alone so the child can restart).
|
|
415
|
+
def tear_down_threaded_components!
|
|
416
|
+
begin
|
|
417
|
+
@sse_client&.close
|
|
418
|
+
rescue StandardError => e
|
|
419
|
+
LOG.debug "Error closing SSE client: #{e.message}"
|
|
420
|
+
end
|
|
421
|
+
@sse_client = nil
|
|
422
|
+
|
|
423
|
+
cancel_fallback_engage_timer
|
|
424
|
+
|
|
425
|
+
begin
|
|
426
|
+
@poll_supervisor&.stop
|
|
427
|
+
rescue StandardError => e
|
|
428
|
+
LOG.debug "Error stopping poll supervisor: #{e.message}"
|
|
429
|
+
end
|
|
430
|
+
@poll_supervisor = nil
|
|
431
|
+
|
|
432
|
+
begin
|
|
433
|
+
@telemetry_reporter&.stop
|
|
434
|
+
rescue StandardError => e
|
|
435
|
+
LOG.debug "Error stopping telemetry reporter: #{e.message}"
|
|
436
|
+
end
|
|
437
|
+
@telemetry_reporter = nil
|
|
438
|
+
|
|
439
|
+
begin
|
|
440
|
+
@datadir_watcher&.stop
|
|
441
|
+
rescue StandardError => e
|
|
442
|
+
LOG.debug "Error stopping datadir watcher: #{e.message}"
|
|
443
|
+
end
|
|
444
|
+
@datadir_watcher = nil
|
|
445
|
+
end
|
|
446
|
+
|
|
447
|
+
# Rebuild the telemetry reporter in the child after fork. Mirrors the
|
|
448
|
+
# original initialize_telemetry path — fresh aggregators, fresh reporter.
|
|
449
|
+
def restart_telemetry_in_child
|
|
450
|
+
@telemetry_reporter = nil
|
|
451
|
+
initialize_telemetry
|
|
452
|
+
end
|
|
453
|
+
|
|
360
454
|
# Stamp +last_successful_refresh+ at install time. Called by every code
|
|
361
455
|
# path that hands an envelope to the cache: datadir load, initial HTTP
|
|
362
456
|
# fetch, SSE event apply, and polling worker fetch.
|
|
@@ -402,20 +496,31 @@ module Quonfig
|
|
|
402
496
|
@sse_error_callback ||= ->(error) { handle_sse_error(error) }
|
|
403
497
|
end
|
|
404
498
|
|
|
405
|
-
def handle_sse_error(
|
|
499
|
+
def handle_sse_error(error)
|
|
500
|
+
# qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
|
|
501
|
+
# key that won't auth over SSE won't auth over HTTP polling either, so
|
|
502
|
+
# we must NOT engage the Layer 2 fallback — that just moves the
|
|
503
|
+
# auth-failure storm from one endpoint to another. Once flipped,
|
|
504
|
+
# @sse_terminal_failure latches: a buggy customer retry loop cannot
|
|
505
|
+
# un-classify the failure by driving the state machine.
|
|
506
|
+
@state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
|
|
406
507
|
handle_sse_state_change(:error)
|
|
407
508
|
end
|
|
408
509
|
|
|
409
510
|
def handle_sse_state_change(new_state)
|
|
410
511
|
state = new_state.to_sym
|
|
411
|
-
ever_connected = @state_mutex.synchronize do
|
|
512
|
+
ever_connected, terminal = @state_mutex.synchronize do
|
|
412
513
|
@sse_state = state
|
|
413
514
|
@sse_ever_connected = true if state == :connected
|
|
414
|
-
@sse_ever_connected
|
|
515
|
+
[@sse_ever_connected, @sse_terminal_failure]
|
|
415
516
|
end
|
|
416
517
|
|
|
417
518
|
return unless @options.respond_to?(:enable_polling) && @options.enable_polling
|
|
418
519
|
return if @stopped
|
|
520
|
+
# qfg-i5xv: a terminal SSE classification suppresses polling engage in
|
|
521
|
+
# every branch — the customer's key is bad and HTTP polling will fail
|
|
522
|
+
# identically. Operators surface this via #terminal_failure?.
|
|
523
|
+
return if terminal
|
|
419
524
|
|
|
420
525
|
case state
|
|
421
526
|
when :connected
|
|
@@ -430,6 +535,21 @@ module Quonfig
|
|
|
430
535
|
end
|
|
431
536
|
end
|
|
432
537
|
|
|
538
|
+
public
|
|
539
|
+
|
|
540
|
+
# qfg-i5xv: true once the SSE layer has classified an HTTP response as
|
|
541
|
+
# terminal (401/403/404) — bad SDK key, revoked workspace permission,
|
|
542
|
+
# or wrong endpoint. The classification latches: the SDK will not
|
|
543
|
+
# auto-recover, and a customer-supplied retry must rebuild the client.
|
|
544
|
+
# Surfaced for operator alerting; `connection_state` still reports
|
|
545
|
+
# `:disconnected` to honor the documented connection_state vocabulary
|
|
546
|
+
# (supervisor-test-contract.md §"connectionState()" — values fixed).
|
|
547
|
+
def terminal_failure?
|
|
548
|
+
@state_mutex.synchronize { @sse_terminal_failure }
|
|
549
|
+
end
|
|
550
|
+
|
|
551
|
+
private
|
|
552
|
+
|
|
433
553
|
def cancel_fallback_engage_timer
|
|
434
554
|
timer = @state_mutex.synchronize do
|
|
435
555
|
t = @fallback_engage_timer
|
|
@@ -568,10 +688,60 @@ module Quonfig
|
|
|
568
688
|
|
|
569
689
|
def load_datadir_into_store
|
|
570
690
|
envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
|
|
691
|
+
apply_datadir_envelope(envelope)
|
|
692
|
+
end
|
|
693
|
+
|
|
694
|
+
# Apply a freshly loaded datadir envelope to the store. Keys that were
|
|
695
|
+
# present before but missing now are deleted, so a `rm configs/foo.json`
|
|
696
|
+
# propagates through the auto-reload path. Records a refresh timestamp.
|
|
697
|
+
# Caller is responsible for firing on_update.
|
|
698
|
+
def apply_datadir_envelope(envelope)
|
|
699
|
+
new_keys = envelope.configs.map { |cfg| cfg['key'] }.compact.to_set
|
|
700
|
+
old_keys = @store.keys.to_set
|
|
701
|
+
(old_keys - new_keys).each { |k| @store.delete(k) }
|
|
571
702
|
envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
|
|
572
703
|
record_refresh!
|
|
573
704
|
end
|
|
574
705
|
|
|
706
|
+
# qfg-mol-2da: start the filesystem watcher for datadir auto-reload.
|
|
707
|
+
# On listen-registration failure (read-only fs, missing native backend),
|
|
708
|
+
# log and continue without watching — the SDK keeps serving the envelope
|
|
709
|
+
# captured at init.
|
|
710
|
+
def start_datadir_watcher
|
|
711
|
+
return unless @options.datadir
|
|
712
|
+
|
|
713
|
+
watcher = Quonfig::DatadirWatcher.new(
|
|
714
|
+
datadir: @options.datadir,
|
|
715
|
+
debounce_ms: @options.data_dir_auto_reload_debounce_ms,
|
|
716
|
+
on_change: -> { reload_datadir! },
|
|
717
|
+
on_error: ->(err) { LOG.warn "[quonfig] datadir watcher error: #{err.class}: #{err.message}" }
|
|
718
|
+
)
|
|
719
|
+
unless watcher.start
|
|
720
|
+
LOG.warn '[quonfig] data_dir_auto_reload requested but watcher registration failed; continuing without auto-reload'
|
|
721
|
+
return
|
|
722
|
+
end
|
|
723
|
+
@datadir_watcher = watcher
|
|
724
|
+
end
|
|
725
|
+
|
|
726
|
+
# Re-read the datadir into a fresh envelope and atomically install it.
|
|
727
|
+
# Parse errors (mid-write JSON, garbage file) are logged and swallowed:
|
|
728
|
+
# the previous envelope stays in the store and on_update does NOT fire.
|
|
729
|
+
# qfg-mol-2da.
|
|
730
|
+
def reload_datadir!
|
|
731
|
+
return if @stopped
|
|
732
|
+
return unless @options.datadir
|
|
733
|
+
|
|
734
|
+
begin
|
|
735
|
+
envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
|
|
736
|
+
rescue StandardError => e
|
|
737
|
+
LOG.warn "[quonfig] datadir reload failed; keeping previous envelope: #{e.class}: #{e.message}"
|
|
738
|
+
return
|
|
739
|
+
end
|
|
740
|
+
|
|
741
|
+
apply_datadir_envelope(envelope)
|
|
742
|
+
notify_on_update_callback
|
|
743
|
+
end
|
|
744
|
+
|
|
575
745
|
# Initialize network mode: sync HTTP fetch (bounded by
|
|
576
746
|
# initialization_timeout_sec) then start SSE + polling as requested.
|
|
577
747
|
def initialize_network_mode
|
|
@@ -904,4 +1074,42 @@ module Quonfig
|
|
|
904
1074
|
end
|
|
905
1075
|
end
|
|
906
1076
|
end
|
|
1077
|
+
|
|
1078
|
+
# qfg-ryov: hook into Process._fork so customers using Puma's clustered
|
|
1079
|
+
# mode (or any preload/fork-worker server) don't have to wire
|
|
1080
|
+
# +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
|
|
1081
|
+
# +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
|
|
1082
|
+
# prepend covers them all.
|
|
1083
|
+
#
|
|
1084
|
+
# Process._fork's contract:
|
|
1085
|
+
# - Called in the parent process before the fork syscall.
|
|
1086
|
+
# - Returns 0 in the child, child's pid in the parent.
|
|
1087
|
+
# - +super+ performs the actual fork.
|
|
1088
|
+
#
|
|
1089
|
+
# The parent's view: SSE/polling/telemetry threads are torn down before
|
|
1090
|
+
# the syscall so the child does not inherit a live Net::HTTP socket fd
|
|
1091
|
+
# (which would corrupt both sides). The parent does NOT auto-restart —
|
|
1092
|
+
# that mirrors the Puma master use case where the master process no
|
|
1093
|
+
# longer serves requests after spawning workers.
|
|
1094
|
+
module ForkSafety
|
|
1095
|
+
def _fork
|
|
1096
|
+
Quonfig::Client.each_instance(&:before_fork_in_parent)
|
|
1097
|
+
pid = super
|
|
1098
|
+
Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
|
|
1099
|
+
pid
|
|
1100
|
+
rescue StandardError => e
|
|
1101
|
+
# Fork-hook failures must never break the customer's fork. Worst case
|
|
1102
|
+
# the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
|
|
1103
|
+
# bad, but recoverable. Crashing the fork itself is not.
|
|
1104
|
+
Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
|
|
1105
|
+
raise if pid.nil? # super never returned — propagate fork failures
|
|
1106
|
+
|
|
1107
|
+
pid
|
|
1108
|
+
end
|
|
1109
|
+
end
|
|
1110
|
+
|
|
1111
|
+
# Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
|
|
1112
|
+
# customers must keep wiring their own Puma before_fork / on_worker_boot
|
|
1113
|
+
# (see README "Rails integration"). On 3.1+ we install the hook globally.
|
|
1114
|
+
Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
|
|
907
1115
|
end
|