quonfig 0.0.14 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b25ea20d7f44acff4ed82e17522a9fb6055791c4f1e0c861075974e5ae37421f
4
- data.tar.gz: e0c260d2d13926e21f2525c7686a24f8dec2f1fa998efa039db59baf4447cd60
3
+ metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
4
+ data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
5
5
  SHA512:
6
- metadata.gz: da91dbd4f9cc300f2dab9e8f39a73033e642d94272288cbcacf4358eb28f4f9b064f8fbe8301c5c26e1b342cd3cd76179d362029e06379bcac39685c3a050cb2
7
- data.tar.gz: ac77088e6a6e0256d947f40b26abb9527bb55cff8a3fa39eaaebf91c43746379d5fa2325bd06e049922b2cbc8521f78252bdd79106c6f1ae7f1a0264f4033ab6
6
+ metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
7
+ data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3
data/CHANGELOG.md CHANGED
@@ -1,5 +1,19 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.0.16 - 2026-05-15
4
+
5
+ - **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
6
+ - **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
7
+ - **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
8
+ - **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
9
+ - **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
10
+
11
+ ## 0.0.15 - 2026-05-15
12
+
13
+ - **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
14
+ - **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
15
+ - **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
16
+
3
17
  ## 0.0.14 - 2026-05-10
4
18
 
5
19
  - **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.
data/README.md CHANGED
@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
247
247
  are dead — the SSE socket is held open by a thread that no longer exists, and
248
248
  the child silently stops receiving live updates.
249
249
 
250
- Use `Quonfig::Client#fork` (or `Quonfig.fork` if you use the module-level
251
- singleton) in any process that fork-spawns workers. It returns a fresh client
252
- configured for the child: a new `ConfigStore`, a new SSE subscription, and
253
- suppressed telemetry double-counting (`Options#is_fork` is set to `true`).
250
+ **On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
251
+ automatically tears down threaded components in the parent and restarts them
252
+ in the child. This covers any `Process.fork` / `Kernel#fork` path Puma's
253
+ clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
254
+ manual `fork { ... }` calls. **No customer wiring is required.**
255
+
256
+ Caveats:
257
+
258
+ - Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
259
+ - `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
260
+ not go through `Process._fork`), but those execute a new program, so the
261
+ in-process SSE state is moot.
262
+ - The hook tears down the SSE/polling/telemetry threads in the parent before
263
+ fork (so the child does not inherit a live socket fd) and does **not**
264
+ auto-restart the parent. This mirrors the Puma master case: the master no
265
+ longer serves requests, so it does not need a live SSE connection. If you
266
+ have a non-Puma topology where the parent must keep streaming after fork,
267
+ call `Quonfig.instance.after_fork_in_child` manually in the parent after
268
+ the fork returns.
254
269
 
255
270
  ### Puma (clustered mode)
256
271
 
272
+ With the automatic fork hook, the typical Puma config needs **no Quonfig
273
+ lifecycle wiring** — initialize in your Rails initializer and let the hook
274
+ handle the rest:
275
+
276
+ ```ruby
277
+ # config/initializers/quonfig.rb
278
+ Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
279
+ ```
280
+
281
+ If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
282
+
257
283
  ```ruby
258
- # config/puma.rb
284
+ # config/puma.rb (Ruby 3.0 only)
259
285
  before_fork do
260
286
  Quonfig.instance.stop # close the master's SSE before forking
261
287
  end
@@ -265,18 +291,18 @@ on_worker_boot do
265
291
  end
266
292
  ```
267
293
 
268
- If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
269
- single mode (no clustering), no fork hook is needed.
270
-
271
294
  ### Sidekiq
272
295
 
273
- Sidekiq's parent process forks workers. Wire the same lifecycle:
296
+ On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too no
297
+ `configure_server` wiring required.
298
+
299
+ On Ruby 3.0:
274
300
 
275
301
  ```ruby
276
302
  # config/initializers/quonfig.rb
277
303
  Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
278
304
 
279
- # config/initializers/sidekiq.rb
305
+ # config/initializers/sidekiq.rb (Ruby 3.0 only)
280
306
  Sidekiq.configure_server do |config|
281
307
  config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
282
308
  config.on(:shutdown) { Quonfig.instance.stop rescue nil }
@@ -284,7 +310,7 @@ end
284
310
  ```
285
311
 
286
312
  For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
287
- `Quonfig.init` in the initializer is sufficient.
313
+ `Quonfig.init` in the initializer is sufficient on any Ruby version.
288
314
 
289
315
  ### Spring / Bootsnap preloaders
290
316
 
@@ -333,6 +359,24 @@ converge once the envelope finishes applying.
333
359
  `Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
334
360
  — do not reuse the parent's client in a child process.
335
361
 
362
+ ## Diagnostic health signals
363
+
364
+ `Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
365
+
366
+ - `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
367
+ envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
368
+ polling). Returns `nil` before the first install. Preserved across `stop`.
369
+ - `client.connection_state` — a `Symbol` describing the aggregate state:
370
+ `:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
371
+
372
+ > Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
373
+
374
+ Compose your own threshold from the two getters if you need a dashboard signal
375
+ — but route alerts through a metrics pipeline, not a probe that restarts the
376
+ process.
377
+
378
+ There is intentionally no `client.healthy?` primitive.
379
+
336
380
  ## Documentation
337
381
 
338
382
  Full documentation, including SPEC, SDK reference, and operational guides, is
@@ -20,6 +20,29 @@ module Quonfig
20
20
  class Client
21
21
  LOG = Quonfig::InternalLogger.new(self)
22
22
 
23
+ # qfg-ryov: instance registry for the Process._fork hook. Every live
24
+ # Client is tracked here so the hook can fan out before_fork_in_parent /
25
+ # after_fork_in_child across all of them without the customer needing to
26
+ # name a specific instance. ObjectSpace::WeakMap means a Client that goes
27
+ # out of scope is GC'd without leaking through this registry. Stopped
28
+ # Clients stay in the registry until GC; both fork hooks early-return on
29
+ # +@stopped+ so a stopped instance is effectively a no-op. (We don't use
30
+ # WeakMap#delete because it was added in Ruby 3.3 and the matrix still
31
+ # includes 3.2.)
32
+ @instances = ObjectSpace::WeakMap.new
33
+ @instances_mutex = Mutex.new
34
+
35
+ class << self
36
+ # Iterate live Client instances. Used by Quonfig::ForkSafety.
37
+ def each_instance(&block)
38
+ @instances_mutex.synchronize { @instances.keys }.each(&block)
39
+ end
40
+
41
+ def register_instance(client)
42
+ @instances_mutex.synchronize { @instances[client] = true }
43
+ end
44
+ end
45
+
23
46
  attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
24
47
  :config_loader, :telemetry_reporter
25
48
 
@@ -40,9 +63,15 @@ module Quonfig
40
63
  @resolver = Quonfig::Resolver.new(@store, @evaluator)
41
64
  @semantic_logger_filters = {}
42
65
  @sse_client = nil
43
- @poll_thread = nil
66
+ @poll_supervisor = nil
44
67
  @stopped = false
45
68
  @telemetry_reporter = nil
69
+ @state_mutex = Mutex.new
70
+ @last_successful_refresh = nil
71
+ @sse_state = :idle
72
+ @sse_ever_connected = false
73
+ @fallback_engage_timer = nil
74
+ @sse_terminal_failure = false
46
75
 
47
76
  # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
48
77
  return if store
@@ -54,6 +83,10 @@ module Quonfig
54
83
  end
55
84
 
56
85
  initialize_telemetry
86
+
87
+ # Register only for non-store-injected clients (a caller-supplied store
88
+ # is the test/bootstrap path; the fork hook does not apply there).
89
+ self.class.register_instance(self) unless store
57
90
  end
58
91
 
59
92
  # ---- Lookup --------------------------------------------------------
@@ -259,6 +292,121 @@ module Quonfig
259
292
 
260
293
  def stop
261
294
  @stopped = true
295
+ tear_down_threaded_components!
296
+ end
297
+
298
+ # qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
299
+ # telemetry reporter, and any fallback-engage timer. Idempotent — calling
300
+ # twice is safe. Does NOT set @stopped: the client is still expected to
301
+ # be usable post-fork via after_fork_in_child.
302
+ #
303
+ # Why this matters: Ruby threads do not survive fork(2). If we let the
304
+ # child inherit a live Net::HTTP socket, both processes read from the
305
+ # same fd and corrupt each other's bytes. Closing in the parent before
306
+ # fork is the only safe shape.
307
+ def before_fork_in_parent
308
+ return if @stopped
309
+
310
+ tear_down_threaded_components!
311
+ end
312
+
313
+ # qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
314
+ # components the client had pre-fork. No-op if the client was already
315
+ # stopped (the customer asked for it to be dead — do not resurrect),
316
+ # or if the client is in datadir mode (no threaded components to start).
317
+ def after_fork_in_child
318
+ return if @stopped
319
+ return if @options.datadir
320
+ return if @config_loader.nil? # never finished network init (e.g. invalid key)
321
+
322
+ # SSE state machine carries flags that no longer apply in the child
323
+ # (the parent had connected, the parent had errored, etc.). Reset.
324
+ @state_mutex.synchronize do
325
+ @sse_state = :idle
326
+ @sse_ever_connected = false
327
+ @sse_terminal_failure = false
328
+ end
329
+
330
+ sse_started = @options.enable_sse && start_sse
331
+ start_polling if @options.enable_polling && !sse_started
332
+
333
+ restart_telemetry_in_child
334
+ end
335
+
336
+ # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
337
+ # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
338
+ # incremented once per reconnect attempt by the SDK-owned reconnect
339
+ # loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
340
+ # Quonfig::WorkerSupervisor.
341
+ #
342
+ # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
343
+ # sum across both layers so the chaos harness (and operators) can pull
344
+ # per-layer values explicitly while preserving the previous single-number
345
+ # diagnostic surface.
346
+ def worker_restart_total(layer: nil)
347
+ case layer&.to_s
348
+ when '1' then sse_restart_total
349
+ when '2' then poll_restart_total
350
+ else sse_restart_total + poll_restart_total
351
+ end
352
+ end
353
+
354
+ # Wall-clock time of the last installed envelope (any source: datadir,
355
+ # initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
356
+ # install. Preserved after +stop+.
357
+ #
358
+ # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
359
+ # — a transient network blip will trip any freshness threshold and cause
360
+ # a rolling restart cascade. See the README "Diagnostic health signals"
361
+ # section.
362
+ #
363
+ # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
364
+ def last_successful_refresh
365
+ @state_mutex.synchronize { @last_successful_refresh }
366
+ end
367
+
368
+ # Aggregate connection state. Returns one of:
369
+ #
370
+ # - +:initializing+ — no envelope has been installed and SSE is not yet
371
+ # connected.
372
+ # - +:connected+ — SSE is live, or the SDK is delivering configs from a
373
+ # loaded envelope (datadir mode or post-initial-fetch with no SSE).
374
+ # - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
375
+ # poller is active.
376
+ # - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
377
+ # serving as the active update channel.
378
+ #
379
+ # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
380
+ # — see the README "Diagnostic health signals" section.
381
+ #
382
+ # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
383
+ def connection_state
384
+ @state_mutex.synchronize do
385
+ next :disconnected if @stopped
386
+ next :falling_back if @poll_supervisor&.alive?
387
+ next :connected if @sse_state == :connected
388
+ next :disconnected if @sse_state == :error
389
+
390
+ # No SSE state change yet: state is driven by whether any envelope
391
+ # has been installed (datadir / initial fetch).
392
+ @last_successful_refresh.nil? ? :initializing : :connected
393
+ end
394
+ end
395
+
396
+ def fork
397
+ self.class.new(@options.for_fork)
398
+ end
399
+
400
+ def inspect
401
+ "#<Quonfig::Client:#{object_id} environment=#{@options.environment.inspect}>"
402
+ end
403
+
404
+ private
405
+
406
+ # Close every threaded component and drop its reference. Used by both
407
+ # +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
408
+ # (where @stopped is left alone so the child can restart).
409
+ def tear_down_threaded_components!
262
410
  begin
263
411
  @sse_client&.close
264
412
  rescue StandardError => e
@@ -266,9 +414,14 @@ module Quonfig
266
414
  end
267
415
  @sse_client = nil
268
416
 
269
- thread = @poll_thread
270
- @poll_thread = nil
271
- thread&.kill
417
+ cancel_fallback_engage_timer
418
+
419
+ begin
420
+ @poll_supervisor&.stop
421
+ rescue StandardError => e
422
+ LOG.debug "Error stopping poll supervisor: #{e.message}"
423
+ end
424
+ @poll_supervisor = nil
272
425
 
273
426
  begin
274
427
  @telemetry_reporter&.stop
@@ -278,16 +431,161 @@ module Quonfig
278
431
  @telemetry_reporter = nil
279
432
  end
280
433
 
281
- def fork
282
- self.class.new(@options.for_fork)
434
+ # Rebuild the telemetry reporter in the child after fork. Mirrors the
435
+ # original initialize_telemetry path — fresh aggregators, fresh reporter.
436
+ def restart_telemetry_in_child
437
+ @telemetry_reporter = nil
438
+ initialize_telemetry
283
439
  end
284
440
 
285
- def inspect
286
- "#<Quonfig::Client:#{object_id} environment=#{@options.environment.inspect}>"
441
+ # Stamp +last_successful_refresh+ at install time. Called by every code
442
+ # path that hands an envelope to the cache: datadir load, initial HTTP
443
+ # fetch, SSE event apply, and polling worker fetch.
444
+ def record_refresh!
445
+ @state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
446
+ end
447
+
448
+ def sse_restart_total
449
+ sse = @sse_client
450
+ return 0 if sse.nil?
451
+ return 0 unless sse.respond_to?(:restart_total)
452
+
453
+ sse.restart_total.to_i
454
+ end
455
+
456
+ def poll_restart_total
457
+ sup = @poll_supervisor
458
+ return 0 if sup.nil?
459
+ return 0 unless sup.respond_to?(:worker_restart_total)
460
+
461
+ sup.worker_restart_total.to_i
462
+ end
463
+
464
+ # Drive the SSE-side of the connection_state machine. The SSE client
465
+ # invokes this on connect/error edges; tests call it directly via +send+.
466
+ # Documented values: :idle, :connecting, :connected, :error.
467
+ #
468
+ # Also drives the Layer 2 fallback poller's engage/disengage:
469
+ # - :connected clears any pending engage timer and stops an active
470
+ # fallback poller (SSE recovered, drop the second channel).
471
+ # - :error before any successful connect engages immediately
472
+ # (initial-fail path).
473
+ # - :error after a successful connect schedules a 2x-poll-interval
474
+ # grace timer; the timer engages if SSE has not recovered by then.
475
+ # Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
476
+ # `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
477
+ # Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
478
+ # block can drive @sse_state -> :error on a mid-run socket drop. Without
479
+ # this wiring, +connection_state+ would stay +:connected+ after a
480
+ # disconnect and customers composing staleness checks would see stale
481
+ # data. (qfg-47c2.27)
482
+ def sse_error_callback
483
+ @sse_error_callback ||= ->(error) { handle_sse_error(error) }
484
+ end
485
+
486
+ def handle_sse_error(error)
487
+ # qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
488
+ # key that won't auth over SSE won't auth over HTTP polling either, so
489
+ # we must NOT engage the Layer 2 fallback — that just moves the
490
+ # auth-failure storm from one endpoint to another. Once flipped,
491
+ # @sse_terminal_failure latches: a buggy customer retry loop cannot
492
+ # un-classify the failure by driving the state machine.
493
+ @state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
494
+ handle_sse_state_change(:error)
495
+ end
496
+
497
+ def handle_sse_state_change(new_state)
498
+ state = new_state.to_sym
499
+ ever_connected, terminal = @state_mutex.synchronize do
500
+ @sse_state = state
501
+ @sse_ever_connected = true if state == :connected
502
+ [@sse_ever_connected, @sse_terminal_failure]
503
+ end
504
+
505
+ return unless @options.respond_to?(:enable_polling) && @options.enable_polling
506
+ return if @stopped
507
+ # qfg-i5xv: a terminal SSE classification suppresses polling engage in
508
+ # every branch — the customer's key is bad and HTTP polling will fail
509
+ # identically. Operators surface this via #terminal_failure?.
510
+ return if terminal
511
+
512
+ case state
513
+ when :connected
514
+ cancel_fallback_engage_timer
515
+ stop_fallback_poller('sse-recovered')
516
+ when :error
517
+ if ever_connected
518
+ schedule_fallback_engage
519
+ else
520
+ start_polling
521
+ end
522
+ end
523
+ end
524
+
525
+ public
526
+
527
+ # qfg-i5xv: true once the SSE layer has classified an HTTP response as
528
+ # terminal (401/403/404) — bad SDK key, revoked workspace permission,
529
+ # or wrong endpoint. The classification latches: the SDK will not
530
+ # auto-recover, and a customer-supplied retry must rebuild the client.
531
+ # Surfaced for operator alerting; `connection_state` still reports
532
+ # `:disconnected` to honor the documented connection_state vocabulary
533
+ # (supervisor-test-contract.md §"connectionState()" — values fixed).
534
+ def terminal_failure?
535
+ @state_mutex.synchronize { @sse_terminal_failure }
287
536
  end
288
537
 
289
538
  private
290
539
 
540
+ def cancel_fallback_engage_timer
541
+ timer = @state_mutex.synchronize do
542
+ t = @fallback_engage_timer
543
+ @fallback_engage_timer = nil
544
+ t
545
+ end
546
+ timer&.kill if timer&.alive?
547
+ end
548
+
549
+ def stop_fallback_poller(reason)
550
+ supervisor = @state_mutex.synchronize do
551
+ s = @poll_supervisor
552
+ @poll_supervisor = nil
553
+ s
554
+ end
555
+ return if supervisor.nil?
556
+
557
+ begin
558
+ supervisor.stop
559
+ LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
560
+ rescue StandardError => e
561
+ LOG.debug "Error stopping fallback poller: #{e.message}"
562
+ end
563
+ end
564
+
565
+ # Schedule a 2*poll_interval grace timer after a connected->error edge.
566
+ # If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
567
+ # tears it down. Idempotent — does nothing if a timer is already pending
568
+ # or the supervisor is already alive.
569
+ def schedule_fallback_engage
570
+ poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
571
+ return if poll_interval <= 0
572
+
573
+ grace_seconds = poll_interval * 2.0
574
+
575
+ @state_mutex.synchronize do
576
+ return if @fallback_engage_timer&.alive?
577
+ return if @poll_supervisor&.alive?
578
+ return if @stopped
579
+
580
+ @fallback_engage_timer = Thread.new do
581
+ Thread.current.report_on_exception = false
582
+ sleep grace_seconds
583
+ @state_mutex.synchronize { @fallback_engage_timer = nil }
584
+ start_polling unless @stopped
585
+ end
586
+ end
587
+ end
588
+
291
589
  # Construct and start the telemetry reporter if the options permit it.
292
590
  # The reporter runs on a background thread and periodically POSTs
293
591
  # context-shape and example-context batches to +telemetry_destination+.
@@ -378,6 +676,7 @@ module Quonfig
378
676
  def load_datadir_into_store
379
677
  envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
380
678
  envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
679
+ record_refresh!
381
680
  end
382
681
 
383
682
  # Initialize network mode: sync HTTP fetch (bounded by
@@ -412,7 +711,11 @@ module Quonfig
412
711
  return
413
712
  end
414
713
 
415
- handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls')) if result == :failed
714
+ if result == :failed
715
+ handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
716
+ else
717
+ record_refresh!
718
+ end
416
719
  end
417
720
 
418
721
  def handle_init_failure(err)
@@ -429,44 +732,79 @@ module Quonfig
429
732
  def start_sse
430
733
  return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
431
734
 
432
- @sse_client = Quonfig::SSEConfigClient.new(@options, @config_loader)
735
+ @sse_client = Quonfig::SSEConfigClient.new(
736
+ @options,
737
+ @config_loader,
738
+ nil,
739
+ nil,
740
+ on_error: sse_error_callback
741
+ )
433
742
  @sse_client.start do |envelope, _event, _source|
434
743
  next if @stopped
435
744
 
436
745
  begin
437
746
  @config_loader.apply_envelope(envelope)
438
- @on_update&.call
747
+ handle_sse_state_change(:connected)
748
+ record_refresh!
439
749
  rescue StandardError => e
440
750
  LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
751
+ next
441
752
  end
753
+ notify_on_update_callback
442
754
  end
443
755
  true
444
756
  rescue StandardError => e
445
757
  LOG.warn "[quonfig] SSE start failed: #{e.message}"
446
758
  @sse_client = nil
759
+ handle_sse_state_change(:error)
447
760
  false
448
761
  end
449
762
 
450
763
  def start_polling
764
+ return if @stopped
765
+ return if @poll_supervisor&.alive?
766
+
451
767
  poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
452
768
  return if poll_interval <= 0
453
769
 
454
- @poll_thread = Thread.new do
455
- Thread.current.name = 'quonfig-poller'
770
+ stopped_ref = -> { @stopped }
771
+ worker = lambda do |notify_delivered|
456
772
  loop do
457
- break if @stopped
773
+ break if stopped_ref.call
458
774
 
459
775
  sleep poll_interval
460
- break if @stopped
461
-
462
- begin
463
- @config_loader.fetch!
464
- @on_update&.call
465
- rescue StandardError => e
466
- LOG.warn "[quonfig] Polling error: #{e.message}"
467
- end
776
+ break if stopped_ref.call
777
+
778
+ @config_loader.fetch!
779
+ record_refresh!
780
+ notify_delivered.call
781
+ notify_on_update_callback
468
782
  end
469
783
  end
784
+
785
+ supervisor = Quonfig::WorkerSupervisor.new(
786
+ name: 'poll', layer: '2', worker: worker
787
+ )
788
+ @state_mutex.synchronize { @poll_supervisor = supervisor }
789
+ supervisor.start
790
+ end
791
+
792
+ # Invoke the customer-supplied on_update callback under a rescue. A raise
793
+ # here is the customer's bug, but it must NOT take down the SSE listener
794
+ # or polling supervisor. Log at ERROR with a message containing
795
+ # "onConfigUpdate callback" so chaos scenario 10's
796
+ # sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
797
+ # the message is distinguishable from internal envelope-apply errors
798
+ # (qfg-47c2.30).
799
+ def notify_on_update_callback
800
+ cb = @on_update
801
+ return unless cb
802
+
803
+ begin
804
+ cb.call
805
+ rescue StandardError => e
806
+ LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
807
+ end
470
808
  end
471
809
 
472
810
  def build_context(jit_context)
@@ -673,4 +1011,42 @@ module Quonfig
673
1011
  end
674
1012
  end
675
1013
  end
1014
+
1015
+ # qfg-ryov: hook into Process._fork so customers using Puma's clustered
1016
+ # mode (or any preload/fork-worker server) don't have to wire
1017
+ # +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
1018
+ # +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
1019
+ # prepend covers them all.
1020
+ #
1021
+ # Process._fork's contract:
1022
+ # - Called in the parent process before the fork syscall.
1023
+ # - Returns 0 in the child, child's pid in the parent.
1024
+ # - +super+ performs the actual fork.
1025
+ #
1026
+ # The parent's view: SSE/polling/telemetry threads are torn down before
1027
+ # the syscall so the child does not inherit a live Net::HTTP socket fd
1028
+ # (which would corrupt both sides). The parent does NOT auto-restart —
1029
+ # that mirrors the Puma master use case where the master process no
1030
+ # longer serves requests after spawning workers.
1031
+ module ForkSafety
1032
+ def _fork
1033
+ Quonfig::Client.each_instance(&:before_fork_in_parent)
1034
+ pid = super
1035
+ Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
1036
+ pid
1037
+ rescue StandardError => e
1038
+ # Fork-hook failures must never break the customer's fork. Worst case
1039
+ # the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
1040
+ # bad, but recoverable. Crashing the fork itself is not.
1041
+ Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
1042
+ raise if pid.nil? # super never returned — propagate fork failures
1043
+
1044
+ pid
1045
+ end
1046
+ end
1047
+
1048
+ # Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
1049
+ # customers must keep wiring their own Puma before_fork / on_worker_boot
1050
+ # (see README "Rails integration"). On 3.1+ we install the hook globally.
1051
+ Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
676
1052
  end