quonfig 0.0.15 → 0.0.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
4
- data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
3
+ metadata.gz: 68d9721e3220acc150e33b43e993c8b8b2380056453939b62b06b05cc4ef4255
4
+ data.tar.gz: 643c409f2b8fa3d5291d92fcf8a0d39cf5a2a67b4d697036497a19296f584d72
5
5
  SHA512:
6
- metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
7
- data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045
6
+ metadata.gz: c133fdcdf47da1b026465f42dfc71e98be03590bb0df80ebd247a572bce6404a59b5f411b014e83349ca2278af2c4c62974c22c1ea9c0c7b1af474d933e91725
7
+ data.tar.gz: c982a887a21dcfe7545e2b50b71a09f2fb7465827819676a238bdbd87cbeaa7626f26e723eb3535c8e3f893de4f7bcebc93e31264783b76716add5057ece836c
data/CHANGELOG.md CHANGED
@@ -1,5 +1,18 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.0.17 - 2026-05-19
4
+
5
+ - **Feat (datadir): opt-in `data_dir_auto_reload` (qfg-mol-2da).** Datadir mode previously loaded the workspace once at construction and served purely from memory. Set `data_dir_auto_reload: true` to have the SDK watch the configured `datadir`, re-read `Quonfig::Datadir.load_envelope`, and fire the existing `on_update` callback whenever files change. Adds `listen ~> 3.8` (FSEvents on macOS, inotify on Linux, polling fallback on Windows) as a runtime dep. Behavior: parse-then-swap (a failed parse keeps the previous envelope and skips the callback), debounced (`data_dir_auto_reload_debounce_ms`, default 200 ms — bursts coalesce to one reload), and gracefully downgrades when watch registration fails (read-only fs, immutable container, missing native backend). Symlinked datadirs are resolved to their real path before watching. Default is `false`; opt-in only.
6
+ - **Feat (datadir + fork): auto-restart the watcher across `fork(2)` (qfg-mol-2da).** The watcher uses a background thread, which does not survive fork. The existing `Process._fork` hook (qfg-ryov, Ruby 3.1+) now also tears the datadir watcher down in the parent before fork and rebuilds a fresh one on the same `Client` in each child — no customer wiring required for Puma clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Resque, or Spring. Ruby 3.0 customers continue to use the documented `Quonfig.fork` pattern in `on_worker_boot`, which rebuilds the watcher alongside the rest of the client.
7
+
8
+ ## 0.0.16 - 2026-05-15
9
+
10
+ - **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
11
+ - **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
12
+ - **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
13
+ - **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
14
+ - **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
15
+
3
16
  ## 0.0.15 - 2026-05-15
4
17
 
5
18
  - **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
data/README.md CHANGED
@@ -107,6 +107,107 @@ export QUONFIG_ENVIRONMENT=production
107
107
  client = Quonfig::Client.new # reads QUONFIG_DIR + QUONFIG_ENVIRONMENT
108
108
  ```
109
109
 
110
+ ## Datadir mode: auto-reload on file changes
111
+
112
+ In datadir mode the SDK loads the workspace once at construction time and then
113
+ serves config purely from memory. Opt in to `data_dir_auto_reload: true` to
114
+ have the SDK watch the directory and re-read the envelope whenever files
115
+ change — an editor save, a `git pull`, or a build step that rewrites the
116
+ workspace.
117
+
118
+ ```ruby
119
+ client = Quonfig::Client.new(
120
+ datadir: '/path/to/workspace',
121
+ environment: 'development',
122
+ data_dir_auto_reload: true # off by default — must be opted in
123
+ )
124
+
125
+ client.on_update do
126
+ puts 'Quonfig configs reloaded from disk'
127
+ end
128
+
129
+ # Edit a file under /path/to/workspace and on_update fires within ~200ms.
130
+
131
+ # On shutdown, stop stops the watcher and cancels any pending debounce.
132
+ client.stop
133
+ ```
134
+
135
+ ### When to enable
136
+
137
+ - Local development with the datadir checked out from git.
138
+ - Self-hosted servers that `git pull` the datadir on a schedule.
139
+ - CI jobs that mutate the datadir between assertions.
140
+
141
+ ### When NOT to enable
142
+
143
+ - **Read-only / immutable filesystems** (some containers, scratch images,
144
+ AWS Lambda). Watch registration may fail; the SDK degrades gracefully
145
+ (logs the error and continues serving the envelope it loaded at init time)
146
+ but you're paying for nothing.
147
+ - **Build-time-embedded workflows** where the datadir is bundled into the
148
+ artifact and never changes at runtime. Watching wastes a thread and a
149
+ native-backend handle.
150
+ - **Production paths where reload timing matters** — e.g. you'd rather pin
151
+ the envelope you shipped with and roll forward through a redeploy than
152
+ have it shift under traffic.
153
+
154
+ Default is `false`; datadir mode is silent until you opt in.
155
+
156
+ ### Behavior contract
157
+
158
+ - **Parse-then-swap.** If the new envelope fails to parse (truncated write,
159
+ mid-`git pull` state, invalid JSON), the SDK logs the error and **keeps
160
+ serving the previous envelope**. `on_update` is _not_ fired on parse
161
+ failure — only on a successful swap.
162
+ - **Debounced.** Bursts of filesystem events (atomic-rename editor saves,
163
+ `git pull` touching dozens of files) coalesce into a single re-read.
164
+ Default window: **200ms** — long enough to absorb the 3–5 events a typical
165
+ editor emits in <50ms, short enough that interactive edits feel immediate.
166
+ Tune via `data_dir_auto_reload_debounce_ms` if you need a different
167
+ window.
168
+ - **Graceful degrade.** If watch registration fails (read-only fs, immutable
169
+ container, missing native backend), the SDK logs and continues without
170
+ watching — it does **not** raise from the constructor.
171
+ - **Symlinks.** The watcher resolves `datadir` to its real path at start
172
+ time. Editing the file the symlink points at _is_ detected; atomic flips
173
+ that retarget the link itself are **not**.
174
+ - **Shutdown.** `client.stop` stops the watcher and cancels any pending
175
+ debounce. There is no separate handle to manage — the watcher lifecycle
176
+ is tied to the client.
177
+
178
+ ### Fork safety (Puma cluster, Unicorn, Resque, Sidekiq)
179
+
180
+ The auto-reload watcher uses a background thread, which — like any Ruby
181
+ thread — does not survive `fork(2)`. **You do not need to wire this up
182
+ manually on Ruby 3.1+.** The SDK's `Process._fork` hook (see [Rails
183
+ integration](#rails-integration) below) stops the watcher in the parent
184
+ before fork and restarts a fresh watcher in each child after fork. This
185
+ covers Puma clustered mode, Unicorn, Sidekiq's parent-forks-workers model,
186
+ Resque, Spring, and manual `fork { ... }` calls.
187
+
188
+ On Ruby 3.0 (no `Process._fork`), follow the manual `before_fork` /
189
+ `on_worker_boot` pattern in the [Rails integration](#rails-integration)
190
+ section — `Quonfig.fork` rebuilds the full client, including the datadir
191
+ watcher, in the child.
192
+
193
+ ### Tuning the debounce window
194
+
195
+ ```ruby
196
+ Quonfig::Client.new(
197
+ datadir: '/path/to/workspace',
198
+ data_dir_auto_reload: true,
199
+ data_dir_auto_reload_debounce_ms: 1000 # wait a full second after the last event
200
+ )
201
+ ```
202
+
203
+ The default (200 ms) is tuned for interactive editing. Raise it if you have
204
+ a noisy producer (continuously regenerating files) and you'd rather see one
205
+ reload per second than per save. Lower it only if you've measured that 200 ms
206
+ is meaningfully too slow for your use case.
207
+
208
+ See the [open-source / local how-to](https://docs.quonfig.com/docs/how-tos/open-source-local)
209
+ for the cross-SDK story (sdk-node, sdk-go, sdk-ruby, sdk-python, sdk-java).
210
+
110
211
  ## Environment variables
111
212
 
112
213
  | Variable | Purpose |
@@ -130,7 +231,9 @@ Quonfig::Client.new(
130
231
  on_no_default: :error,
131
232
  global_context: {},
132
233
  datadir: '/path/to/workspace',
133
- environment: 'production'
234
+ environment: 'production',
235
+ data_dir_auto_reload: false,
236
+ data_dir_auto_reload_debounce_ms: 200
134
237
  )
135
238
  ```
136
239
 
@@ -147,6 +250,8 @@ Quonfig::Client.new(
147
250
  | `global_context` | `Hash` | `{}` | Context applied to every evaluation. |
148
251
  | `datadir` | `String` | `ENV['QUONFIG_DIR']` | Path to a local workspace. When set, the SDK runs offline from disk. |
149
252
  | `environment` | `String` | `ENV['QUONFIG_ENVIRONMENT']` | Environment to evaluate in datadir mode. Required when `datadir` is set. |
253
+ | `data_dir_auto_reload` | `Boolean` | `false` | Datadir mode only. When `true`, the SDK watches the datadir and re-reads the envelope when files change. See [Datadir mode: auto-reload on file changes](#datadir-mode-auto-reload-on-file-changes). |
254
+ | `data_dir_auto_reload_debounce_ms` | `Integer` (ms) | `200` | Debounce window for the auto-reload watcher — events arriving inside the window are coalesced into a single re-read. Ignored when `data_dir_auto_reload` is `false`. |
150
255
  | `logger` | Logger-like object | `nil` | Optional host-app logger (e.g. `Rails.logger`). Must respond to `debug`/`info`/`warn`/`error`. When set, all SDK warnings/errors flow through this logger instead of the default stderr / SemanticLogger backend. |
151
256
 
152
257
  ## Typed getters
@@ -247,15 +352,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
247
352
  are dead — the SSE socket is held open by a thread that no longer exists, and
248
353
  the child silently stops receiving live updates.
249
354
 
250
- Use `Quonfig::Client#fork` (or `Quonfig.fork` if you use the module-level
251
- singleton) in any process that fork-spawns workers. It returns a fresh client
252
- configured for the child: a new `ConfigStore`, a new SSE subscription, and
253
- suppressed telemetry double-counting (`Options#is_fork` is set to `true`).
355
+ **On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
356
+ automatically tears down threaded components in the parent and restarts them
357
+ in the child. This covers any `Process.fork` / `Kernel#fork` path Puma's
358
+ clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
359
+ manual `fork { ... }` calls. **No customer wiring is required.**
360
+
361
+ Caveats:
362
+
363
+ - Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
364
+ - `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
365
+ not go through `Process._fork`), but those execute a new program, so the
366
+ in-process SSE state is moot.
367
+ - The hook tears down the SSE/polling/telemetry threads in the parent before
368
+ fork (so the child does not inherit a live socket fd) and does **not**
369
+ auto-restart the parent. This mirrors the Puma master case: the master no
370
+ longer serves requests, so it does not need a live SSE connection. If you
371
+ have a non-Puma topology where the parent must keep streaming after fork,
372
+ call `Quonfig.instance.after_fork_in_child` manually in the parent after
373
+ the fork returns.
254
374
 
255
375
  ### Puma (clustered mode)
256
376
 
377
+ With the automatic fork hook, the typical Puma config needs **no Quonfig
378
+ lifecycle wiring** — initialize in your Rails initializer and let the hook
379
+ handle the rest:
380
+
381
+ ```ruby
382
+ # config/initializers/quonfig.rb
383
+ Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
384
+ ```
385
+
386
+ If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
387
+
257
388
  ```ruby
258
- # config/puma.rb
389
+ # config/puma.rb (Ruby 3.0 only)
259
390
  before_fork do
260
391
  Quonfig.instance.stop # close the master's SSE before forking
261
392
  end
@@ -265,18 +396,18 @@ on_worker_boot do
265
396
  end
266
397
  ```
267
398
 
268
- If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
269
- single mode (no clustering), no fork hook is needed.
270
-
271
399
  ### Sidekiq
272
400
 
273
- Sidekiq's parent process forks workers. Wire the same lifecycle:
401
+ On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too no
402
+ `configure_server` wiring required.
403
+
404
+ On Ruby 3.0:
274
405
 
275
406
  ```ruby
276
407
  # config/initializers/quonfig.rb
277
408
  Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
278
409
 
279
- # config/initializers/sidekiq.rb
410
+ # config/initializers/sidekiq.rb (Ruby 3.0 only)
280
411
  Sidekiq.configure_server do |config|
281
412
  config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
282
413
  config.on(:shutdown) { Quonfig.instance.stop rescue nil }
@@ -284,7 +415,7 @@ end
284
415
  ```
285
416
 
286
417
  For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
287
- `Quonfig.init` in the initializer is sufficient.
418
+ `Quonfig.init` in the initializer is sufficient on any Ruby version.
288
419
 
289
420
  ### Spring / Bootsnap preloaders
290
421
 
@@ -20,6 +20,29 @@ module Quonfig
20
20
  class Client
21
21
  LOG = Quonfig::InternalLogger.new(self)
22
22
 
23
+ # qfg-ryov: instance registry for the Process._fork hook. Every live
24
+ # Client is tracked here so the hook can fan out before_fork_in_parent /
25
+ # after_fork_in_child across all of them without the customer needing to
26
+ # name a specific instance. ObjectSpace::WeakMap means a Client that goes
27
+ # out of scope is GC'd without leaking through this registry. Stopped
28
+ # Clients stay in the registry until GC; both fork hooks early-return on
29
+ # +@stopped+ so a stopped instance is effectively a no-op. (We don't use
30
+ # WeakMap#delete because it was added in Ruby 3.3 and the matrix still
31
+ # includes 3.2.)
32
+ @instances = ObjectSpace::WeakMap.new
33
+ @instances_mutex = Mutex.new
34
+
35
+ class << self
36
+ # Iterate live Client instances. Used by Quonfig::ForkSafety.
37
+ def each_instance(&block)
38
+ @instances_mutex.synchronize { @instances.keys }.each(&block)
39
+ end
40
+
41
+ def register_instance(client)
42
+ @instances_mutex.synchronize { @instances[client] = true }
43
+ end
44
+ end
45
+
23
46
  attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
24
47
  :config_loader, :telemetry_reporter
25
48
 
@@ -48,17 +71,23 @@ module Quonfig
48
71
  @sse_state = :idle
49
72
  @sse_ever_connected = false
50
73
  @fallback_engage_timer = nil
74
+ @sse_terminal_failure = false
51
75
 
52
76
  # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
53
77
  return if store
54
78
 
55
79
  if @options.datadir
56
80
  load_datadir_into_store
81
+ start_datadir_watcher if @options.data_dir_auto_reload
57
82
  else
58
83
  initialize_network_mode
59
84
  end
60
85
 
61
86
  initialize_telemetry
87
+
88
+ # Register only for non-store-injected clients (a caller-supplied store
89
+ # is the test/bootstrap path; the fork hook does not apply there).
90
+ self.class.register_instance(self) unless store
62
91
  end
63
92
 
64
93
  # ---- Lookup --------------------------------------------------------
@@ -264,34 +293,57 @@ module Quonfig
264
293
 
265
294
  def stop
266
295
  @stopped = true
267
- begin
268
- @sse_client&.close
269
- rescue StandardError => e
270
- LOG.debug "Error closing SSE client: #{e.message}"
271
- end
272
- @sse_client = nil
296
+ tear_down_threaded_components!
297
+ end
273
298
 
274
- cancel_fallback_engage_timer
299
+ # qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
300
+ # telemetry reporter, and any fallback-engage timer. Idempotent — calling
301
+ # twice is safe. Does NOT set @stopped: the client is still expected to
302
+ # be usable post-fork via after_fork_in_child.
303
+ #
304
+ # Why this matters: Ruby threads do not survive fork(2). If we let the
305
+ # child inherit a live Net::HTTP socket, both processes read from the
306
+ # same fd and corrupt each other's bytes. Closing in the parent before
307
+ # fork is the only safe shape.
308
+ def before_fork_in_parent
309
+ return if @stopped
275
310
 
276
- begin
277
- @poll_supervisor&.stop
278
- rescue StandardError => e
279
- LOG.debug "Error stopping poll supervisor: #{e.message}"
311
+ tear_down_threaded_components!
312
+ end
313
+
314
+ # qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
315
+ # components the client had pre-fork. No-op if the client was already
316
+ # stopped (the customer asked for it to be dead — do not resurrect),
317
+ # or if the client is in datadir mode (no threaded components to start).
318
+ def after_fork_in_child
319
+ return if @stopped
320
+
321
+ if @options.datadir
322
+ start_datadir_watcher if @options.data_dir_auto_reload
323
+ return
280
324
  end
281
- @poll_supervisor = nil
282
325
 
283
- begin
284
- @telemetry_reporter&.stop
285
- rescue StandardError => e
286
- LOG.debug "Error stopping telemetry reporter: #{e.message}"
326
+ return if @config_loader.nil? # never finished network init (e.g. invalid key)
327
+
328
+ # SSE state machine carries flags that no longer apply in the child
329
+ # (the parent had connected, the parent had errored, etc.). Reset.
330
+ @state_mutex.synchronize do
331
+ @sse_state = :idle
332
+ @sse_ever_connected = false
333
+ @sse_terminal_failure = false
287
334
  end
288
- @telemetry_reporter = nil
335
+
336
+ sse_started = @options.enable_sse && start_sse
337
+ start_polling if @options.enable_polling && !sse_started
338
+
339
+ restart_telemetry_in_child
289
340
  end
290
341
 
291
342
  # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
292
343
  # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
293
- # incremented on every on_error edge from ld-eventsource (qfg-ll6r).
294
- # Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
344
+ # incremented once per reconnect attempt by the SDK-owned reconnect
345
+ # loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
346
+ # Quonfig::WorkerSupervisor.
295
347
  #
296
348
  # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
297
349
  # sum across both layers so the chaos harness (and operators) can pull
@@ -357,6 +409,48 @@ module Quonfig
357
409
 
358
410
  private
359
411
 
412
+ # Close every threaded component and drop its reference. Used by both
413
+ # +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
414
+ # (where @stopped is left alone so the child can restart).
415
+ def tear_down_threaded_components!
416
+ begin
417
+ @sse_client&.close
418
+ rescue StandardError => e
419
+ LOG.debug "Error closing SSE client: #{e.message}"
420
+ end
421
+ @sse_client = nil
422
+
423
+ cancel_fallback_engage_timer
424
+
425
+ begin
426
+ @poll_supervisor&.stop
427
+ rescue StandardError => e
428
+ LOG.debug "Error stopping poll supervisor: #{e.message}"
429
+ end
430
+ @poll_supervisor = nil
431
+
432
+ begin
433
+ @telemetry_reporter&.stop
434
+ rescue StandardError => e
435
+ LOG.debug "Error stopping telemetry reporter: #{e.message}"
436
+ end
437
+ @telemetry_reporter = nil
438
+
439
+ begin
440
+ @datadir_watcher&.stop
441
+ rescue StandardError => e
442
+ LOG.debug "Error stopping datadir watcher: #{e.message}"
443
+ end
444
+ @datadir_watcher = nil
445
+ end
446
+
447
+ # Rebuild the telemetry reporter in the child after fork. Mirrors the
448
+ # original initialize_telemetry path — fresh aggregators, fresh reporter.
449
+ def restart_telemetry_in_child
450
+ @telemetry_reporter = nil
451
+ initialize_telemetry
452
+ end
453
+
360
454
  # Stamp +last_successful_refresh+ at install time. Called by every code
361
455
  # path that hands an envelope to the cache: datadir load, initial HTTP
362
456
  # fetch, SSE event apply, and polling worker fetch.
@@ -402,20 +496,31 @@ module Quonfig
402
496
  @sse_error_callback ||= ->(error) { handle_sse_error(error) }
403
497
  end
404
498
 
405
- def handle_sse_error(_error)
499
+ def handle_sse_error(error)
500
+ # qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
501
+ # key that won't auth over SSE won't auth over HTTP polling either, so
502
+ # we must NOT engage the Layer 2 fallback — that just moves the
503
+ # auth-failure storm from one endpoint to another. Once flipped,
504
+ # @sse_terminal_failure latches: a buggy customer retry loop cannot
505
+ # un-classify the failure by driving the state machine.
506
+ @state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
406
507
  handle_sse_state_change(:error)
407
508
  end
408
509
 
409
510
  def handle_sse_state_change(new_state)
410
511
  state = new_state.to_sym
411
- ever_connected = @state_mutex.synchronize do
512
+ ever_connected, terminal = @state_mutex.synchronize do
412
513
  @sse_state = state
413
514
  @sse_ever_connected = true if state == :connected
414
- @sse_ever_connected
515
+ [@sse_ever_connected, @sse_terminal_failure]
415
516
  end
416
517
 
417
518
  return unless @options.respond_to?(:enable_polling) && @options.enable_polling
418
519
  return if @stopped
520
+ # qfg-i5xv: a terminal SSE classification suppresses polling engage in
521
+ # every branch — the customer's key is bad and HTTP polling will fail
522
+ # identically. Operators surface this via #terminal_failure?.
523
+ return if terminal
419
524
 
420
525
  case state
421
526
  when :connected
@@ -430,6 +535,21 @@ module Quonfig
430
535
  end
431
536
  end
432
537
 
538
+ public
539
+
540
+ # qfg-i5xv: true once the SSE layer has classified an HTTP response as
541
+ # terminal (401/403/404) — bad SDK key, revoked workspace permission,
542
+ # or wrong endpoint. The classification latches: the SDK will not
543
+ # auto-recover, and a customer-supplied retry must rebuild the client.
544
+ # Surfaced for operator alerting; `connection_state` still reports
545
+ # `:disconnected` to honor the documented connection_state vocabulary
546
+ # (supervisor-test-contract.md §"connectionState()" — values fixed).
547
+ def terminal_failure?
548
+ @state_mutex.synchronize { @sse_terminal_failure }
549
+ end
550
+
551
+ private
552
+
433
553
  def cancel_fallback_engage_timer
434
554
  timer = @state_mutex.synchronize do
435
555
  t = @fallback_engage_timer
@@ -568,10 +688,60 @@ module Quonfig
568
688
 
569
689
  def load_datadir_into_store
570
690
  envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
691
+ apply_datadir_envelope(envelope)
692
+ end
693
+
694
+ # Apply a freshly loaded datadir envelope to the store. Keys that were
695
+ # present before but missing now are deleted, so a `rm configs/foo.json`
696
+ # propagates through the auto-reload path. Records a refresh timestamp.
697
+ # Caller is responsible for firing on_update.
698
+ def apply_datadir_envelope(envelope)
699
+ new_keys = envelope.configs.map { |cfg| cfg['key'] }.compact.to_set
700
+ old_keys = @store.keys.to_set
701
+ (old_keys - new_keys).each { |k| @store.delete(k) }
571
702
  envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
572
703
  record_refresh!
573
704
  end
574
705
 
706
+ # qfg-mol-2da: start the filesystem watcher for datadir auto-reload.
707
+ # On listen-registration failure (read-only fs, missing native backend),
708
+ # log and continue without watching — the SDK keeps serving the envelope
709
+ # captured at init.
710
+ def start_datadir_watcher
711
+ return unless @options.datadir
712
+
713
+ watcher = Quonfig::DatadirWatcher.new(
714
+ datadir: @options.datadir,
715
+ debounce_ms: @options.data_dir_auto_reload_debounce_ms,
716
+ on_change: -> { reload_datadir! },
717
+ on_error: ->(err) { LOG.warn "[quonfig] datadir watcher error: #{err.class}: #{err.message}" }
718
+ )
719
+ unless watcher.start
720
+ LOG.warn '[quonfig] data_dir_auto_reload requested but watcher registration failed; continuing without auto-reload'
721
+ return
722
+ end
723
+ @datadir_watcher = watcher
724
+ end
725
+
726
+ # Re-read the datadir into a fresh envelope and atomically install it.
727
+ # Parse errors (mid-write JSON, garbage file) are logged and swallowed:
728
+ # the previous envelope stays in the store and on_update does NOT fire.
729
+ # qfg-mol-2da.
730
+ def reload_datadir!
731
+ return if @stopped
732
+ return unless @options.datadir
733
+
734
+ begin
735
+ envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
736
+ rescue StandardError => e
737
+ LOG.warn "[quonfig] datadir reload failed; keeping previous envelope: #{e.class}: #{e.message}"
738
+ return
739
+ end
740
+
741
+ apply_datadir_envelope(envelope)
742
+ notify_on_update_callback
743
+ end
744
+
575
745
  # Initialize network mode: sync HTTP fetch (bounded by
576
746
  # initialization_timeout_sec) then start SSE + polling as requested.
577
747
  def initialize_network_mode
@@ -904,4 +1074,42 @@ module Quonfig
904
1074
  end
905
1075
  end
906
1076
  end
1077
+
1078
+ # qfg-ryov: hook into Process._fork so customers using Puma's clustered
1079
+ # mode (or any preload/fork-worker server) don't have to wire
1080
+ # +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
1081
+ # +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
1082
+ # prepend covers them all.
1083
+ #
1084
+ # Process._fork's contract:
1085
+ # - Called in the parent process before the fork syscall.
1086
+ # - Returns 0 in the child, child's pid in the parent.
1087
+ # - +super+ performs the actual fork.
1088
+ #
1089
+ # The parent's view: SSE/polling/telemetry threads are torn down before
1090
+ # the syscall so the child does not inherit a live Net::HTTP socket fd
1091
+ # (which would corrupt both sides). The parent does NOT auto-restart —
1092
+ # that mirrors the Puma master use case where the master process no
1093
+ # longer serves requests after spawning workers.
1094
+ module ForkSafety
1095
+ def _fork
1096
+ Quonfig::Client.each_instance(&:before_fork_in_parent)
1097
+ pid = super
1098
+ Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
1099
+ pid
1100
+ rescue StandardError => e
1101
+ # Fork-hook failures must never break the customer's fork. Worst case
1102
+ # the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
1103
+ # bad, but recoverable. Crashing the fork itself is not.
1104
+ Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
1105
+ raise if pid.nil? # super never returned — propagate fork failures
1106
+
1107
+ pid
1108
+ end
1109
+ end
1110
+
1111
+ # Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
1112
+ # customers must keep wiring their own Puma before_fork / on_worker_boot
1113
+ # (see README "Rails integration"). On 3.1+ we install the hook globally.
1114
+ Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
907
1115
  end