RubyGems - quonfig - Versions diffs - 0.0.14 → 0.0.16 - Mend

quonfig 0.0.14 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -0
data/README.md +55 -11
data/lib/quonfig/client.rb +398 -22
data/lib/quonfig/datadir.rb +8 -3
data/lib/quonfig/sse_config_client.rb +550 -93
data/lib/quonfig/version.rb +1 -1
data/lib/quonfig/worker_supervisor.rb +186 -0
data/lib/quonfig.rb +2 -1
data/quonfig.gemspec +0 -1
metadata +3 -16

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b25ea20d7f44acff4ed82e17522a9fb6055791c4f1e0c861075974e5ae37421f
-  data.tar.gz: e0c260d2d13926e21f2525c7686a24f8dec2f1fa998efa039db59baf4447cd60
+  metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
+  data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
 SHA512:
-  metadata.gz: da91dbd4f9cc300f2dab9e8f39a73033e642d94272288cbcacf4358eb28f4f9b064f8fbe8301c5c26e1b342cd3cd76179d362029e06379bcac39685c3a050cb2
-  data.tar.gz: ac77088e6a6e0256d947f40b26abb9527bb55cff8a3fa39eaaebf91c43746379d5fa2325bd06e049922b2cbc8521f78252bdd79106c6f1ae7f1a0264f4033ab6
+  metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
+  data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,19 @@
 # Changelog
+## 0.0.16 - 2026-05-15
+- **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
+- **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
+- **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
+- **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
+- **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
+## 0.0.15 - 2026-05-15
+- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
+- **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
+- **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
 ## 0.0.14 - 2026-05-10
 - **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.

data/README.md CHANGED Viewed

@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
 are dead — the SSE socket is held open by a thread that no longer exists, and
 the child silently stops receiving live updates.
-Use `Quonfig::Client#fork` (or `Quonfig.fork` if you use the module-level
-singleton) in any process that fork-spawns workers. It returns a fresh client
-configured for the child: a new `ConfigStore`, a new SSE subscription, and
-suppressed telemetry double-counting (`Options#is_fork` is set to `true`).
+**On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
+automatically tears down threaded components in the parent and restarts them
+in the child. This covers any `Process.fork` / `Kernel#fork` path — Puma's
+clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
+manual `fork { ... }` calls. **No customer wiring is required.**
+Caveats:
+- Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
+- `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
+  not go through `Process._fork`), but those execute a new program, so the
+  in-process SSE state is moot.
+- The hook tears down the SSE/polling/telemetry threads in the parent before
+  fork (so the child does not inherit a live socket fd) and does **not**
+  auto-restart the parent. This mirrors the Puma master case: the master no
+  longer serves requests, so it does not need a live SSE connection. If you
+  have a non-Puma topology where the parent must keep streaming after fork,
+  call `Quonfig.instance.after_fork_in_child` manually in the parent after
+  the fork returns.
 ### Puma (clustered mode)
+With the automatic fork hook, the typical Puma config needs **no Quonfig
+lifecycle wiring** — initialize in your Rails initializer and let the hook
+handle the rest:
+```ruby
+# config/initializers/quonfig.rb
+Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
+```
+If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
 ```ruby
-# config/puma.rb
+# config/puma.rb (Ruby 3.0 only)
 before_fork do
   Quonfig.instance.stop          # close the master's SSE before forking
 end
@@ -265,18 +291,18 @@ on_worker_boot do
 end
 ```
-If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
-single mode (no clustering), no fork hook is needed.
 ### Sidekiq
-Sidekiq's parent process forks workers. Wire the same lifecycle:
+On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too — no
+`configure_server` wiring required.
+On Ruby 3.0:
 ```ruby
 # config/initializers/quonfig.rb
 Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
-# config/initializers/sidekiq.rb
+# config/initializers/sidekiq.rb (Ruby 3.0 only)
 Sidekiq.configure_server do |config|
   config.on(:startup)  { Quonfig.fork if Process.ppid != 1 }
   config.on(:shutdown) { Quonfig.instance.stop rescue nil }
@@ -284,7 +310,7 @@ end
 ```
 For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
-`Quonfig.init` in the initializer is sufficient.
+`Quonfig.init` in the initializer is sufficient on any Ruby version.
 ### Spring / Bootsnap preloaders
@@ -333,6 +359,24 @@ converge once the envelope finishes applying.
 `Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
 — do not reuse the parent's client in a child process.
+## Diagnostic health signals
+`Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
+- `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
+  envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
+  polling). Returns `nil` before the first install. Preserved across `stop`.
+- `client.connection_state` — a `Symbol` describing the aggregate state:
+  `:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
+> Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
+Compose your own threshold from the two getters if you need a dashboard signal
+— but route alerts through a metrics pipeline, not a probe that restarts the
+process.
+There is intentionally no `client.healthy?` primitive.
 ## Documentation
 Full documentation, including SPEC, SDK reference, and operational guides, is

data/lib/quonfig/client.rb CHANGED Viewed

@@ -20,6 +20,29 @@ module Quonfig
   class Client
     LOG = Quonfig::InternalLogger.new(self)
+    # qfg-ryov: instance registry for the Process._fork hook. Every live
+    # Client is tracked here so the hook can fan out before_fork_in_parent /
+    # after_fork_in_child across all of them without the customer needing to
+    # name a specific instance. ObjectSpace::WeakMap means a Client that goes
+    # out of scope is GC'd without leaking through this registry. Stopped
+    # Clients stay in the registry until GC; both fork hooks early-return on
+    # +@stopped+ so a stopped instance is effectively a no-op. (We don't use
+    # WeakMap#delete because it was added in Ruby 3.3 and the matrix still
+    # includes 3.2.)
+    @instances = ObjectSpace::WeakMap.new
+    @instances_mutex = Mutex.new
+    class << self
+      # Iterate live Client instances. Used by Quonfig::ForkSafety.
+      def each_instance(&block)
+        @instances_mutex.synchronize { @instances.keys }.each(&block)
+      end
+      def register_instance(client)
+        @instances_mutex.synchronize { @instances[client] = true }
+      end
+    end
     attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
                 :config_loader, :telemetry_reporter
@@ -40,9 +63,15 @@ module Quonfig
       @resolver = Quonfig::Resolver.new(@store, @evaluator)
       @semantic_logger_filters = {}
       @sse_client = nil
-      @poll_thread = nil
+      @poll_supervisor = nil
       @stopped = false
       @telemetry_reporter = nil
+      @state_mutex = Mutex.new
+      @last_successful_refresh = nil
+      @sse_state = :idle
+      @sse_ever_connected = false
+      @fallback_engage_timer = nil
+      @sse_terminal_failure = false
       # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
       return if store
@@ -54,6 +83,10 @@ module Quonfig
       end
       initialize_telemetry
+      # Register only for non-store-injected clients (a caller-supplied store
+      # is the test/bootstrap path; the fork hook does not apply there).
+      self.class.register_instance(self) unless store
     end
     # ---- Lookup --------------------------------------------------------
@@ -259,6 +292,121 @@ module Quonfig
     def stop
       @stopped = true
+      tear_down_threaded_components!
+    end
+    # qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
+    # telemetry reporter, and any fallback-engage timer. Idempotent — calling
+    # twice is safe. Does NOT set @stopped: the client is still expected to
+    # be usable post-fork via after_fork_in_child.
+    #
+    # Why this matters: Ruby threads do not survive fork(2). If we let the
+    # child inherit a live Net::HTTP socket, both processes read from the
+    # same fd and corrupt each other's bytes. Closing in the parent before
+    # fork is the only safe shape.
+    def before_fork_in_parent
+      return if @stopped
+      tear_down_threaded_components!
+    end
+    # qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
+    # components the client had pre-fork. No-op if the client was already
+    # stopped (the customer asked for it to be dead — do not resurrect),
+    # or if the client is in datadir mode (no threaded components to start).
+    def after_fork_in_child
+      return if @stopped
+      return if @options.datadir
+      return if @config_loader.nil? # never finished network init (e.g. invalid key)
+      # SSE state machine carries flags that no longer apply in the child
+      # (the parent had connected, the parent had errored, etc.). Reset.
+      @state_mutex.synchronize do
+        @sse_state = :idle
+        @sse_ever_connected = false
+        @sse_terminal_failure = false
+      end
+      sse_started = @options.enable_sse && start_sse
+      start_polling if @options.enable_polling && !sse_started
+      restart_telemetry_in_child
+    end
+    # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
+    # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
+    # incremented once per reconnect attempt by the SDK-owned reconnect
+    # loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
+    # Quonfig::WorkerSupervisor.
+    #
+    # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
+    # sum across both layers so the chaos harness (and operators) can pull
+    # per-layer values explicitly while preserving the previous single-number
+    # diagnostic surface.
+    def worker_restart_total(layer: nil)
+      case layer&.to_s
+      when '1' then sse_restart_total
+      when '2' then poll_restart_total
+      else          sse_restart_total + poll_restart_total
+      end
+    end
+    # Wall-clock time of the last installed envelope (any source: datadir,
+    # initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
+    # install. Preserved after +stop+.
+    #
+    # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
+    # — a transient network blip will trip any freshness threshold and cause
+    # a rolling restart cascade. See the README "Diagnostic health signals"
+    # section.
+    #
+    # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
+    def last_successful_refresh
+      @state_mutex.synchronize { @last_successful_refresh }
+    end
+    # Aggregate connection state. Returns one of:
+    #
+    # - +:initializing+ — no envelope has been installed and SSE is not yet
+    #   connected.
+    # - +:connected+ — SSE is live, or the SDK is delivering configs from a
+    #   loaded envelope (datadir mode or post-initial-fetch with no SSE).
+    # - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
+    #   poller is active.
+    # - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
+    #   serving as the active update channel.
+    #
+    # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
+    # — see the README "Diagnostic health signals" section.
+    #
+    # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
+    def connection_state
+      @state_mutex.synchronize do
+        next :disconnected if @stopped
+        next :falling_back if @poll_supervisor&.alive?
+        next :connected if @sse_state == :connected
+        next :disconnected if @sse_state == :error
+        # No SSE state change yet: state is driven by whether any envelope
+        # has been installed (datadir / initial fetch).
+        @last_successful_refresh.nil? ? :initializing : :connected
+      end
+    end
+    def fork
+      self.class.new(@options.for_fork)
+    end
+    def inspect
+      "#<Quonfig::Client:#{object_id} environment=#{@options.environment.inspect}>"
+    end
+    private
+    # Close every threaded component and drop its reference. Used by both
+    # +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
+    # (where @stopped is left alone so the child can restart).
+    def tear_down_threaded_components!
       begin
         @sse_client&.close
       rescue StandardError => e
@@ -266,9 +414,14 @@ module Quonfig
       end
       @sse_client = nil
-      thread = @poll_thread
-      @poll_thread = nil
-      thread&.kill
+      cancel_fallback_engage_timer
+      begin
+        @poll_supervisor&.stop
+      rescue StandardError => e
+        LOG.debug "Error stopping poll supervisor: #{e.message}"
+      end
+      @poll_supervisor = nil
       begin
         @telemetry_reporter&.stop
@@ -278,16 +431,161 @@ module Quonfig
       @telemetry_reporter = nil
     end
-    def fork
-      self.class.new(@options.for_fork)
+    # Rebuild the telemetry reporter in the child after fork. Mirrors the
+    # original initialize_telemetry path — fresh aggregators, fresh reporter.
+    def restart_telemetry_in_child
+      @telemetry_reporter = nil
+      initialize_telemetry
     end
-    def inspect
-      "#<Quonfig::Client:#{object_id} environment=#{@options.environment.inspect}>"
+    # Stamp +last_successful_refresh+ at install time. Called by every code
+    # path that hands an envelope to the cache: datadir load, initial HTTP
+    # fetch, SSE event apply, and polling worker fetch.
+    def record_refresh!
+      @state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
+    end
+    def sse_restart_total
+      sse = @sse_client
+      return 0 if sse.nil?
+      return 0 unless sse.respond_to?(:restart_total)
+      sse.restart_total.to_i
+    end
+    def poll_restart_total
+      sup = @poll_supervisor
+      return 0 if sup.nil?
+      return 0 unless sup.respond_to?(:worker_restart_total)
+      sup.worker_restart_total.to_i
+    end
+    # Drive the SSE-side of the connection_state machine. The SSE client
+    # invokes this on connect/error edges; tests call it directly via +send+.
+    # Documented values: :idle, :connecting, :connected, :error.
+    #
+    # Also drives the Layer 2 fallback poller's engage/disengage:
+    # - :connected clears any pending engage timer and stops an active
+    #   fallback poller (SSE recovered, drop the second channel).
+    # - :error before any successful connect engages immediately
+    #   (initial-fail path).
+    # - :error after a successful connect schedules a 2x-poll-interval
+    #   grace timer; the timer engages if SSE has not recovered by then.
+    #   Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
+    #   `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
+    # Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
+    # block can drive @sse_state -> :error on a mid-run socket drop. Without
+    # this wiring, +connection_state+ would stay +:connected+ after a
+    # disconnect and customers composing staleness checks would see stale
+    # data. (qfg-47c2.27)
+    def sse_error_callback
+      @sse_error_callback ||= ->(error) { handle_sse_error(error) }
+    end
+    def handle_sse_error(error)
+      # qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
+      # key that won't auth over SSE won't auth over HTTP polling either, so
+      # we must NOT engage the Layer 2 fallback — that just moves the
+      # auth-failure storm from one endpoint to another. Once flipped,
+      # @sse_terminal_failure latches: a buggy customer retry loop cannot
+      # un-classify the failure by driving the state machine.
+      @state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
+      handle_sse_state_change(:error)
+    end
+    def handle_sse_state_change(new_state)
+      state = new_state.to_sym
+      ever_connected, terminal = @state_mutex.synchronize do
+        @sse_state = state
+        @sse_ever_connected = true if state == :connected
+        [@sse_ever_connected, @sse_terminal_failure]
+      end
+      return unless @options.respond_to?(:enable_polling) && @options.enable_polling
+      return if @stopped
+      # qfg-i5xv: a terminal SSE classification suppresses polling engage in
+      # every branch — the customer's key is bad and HTTP polling will fail
+      # identically. Operators surface this via #terminal_failure?.
+      return if terminal
+      case state
+      when :connected
+        cancel_fallback_engage_timer
+        stop_fallback_poller('sse-recovered')
+      when :error
+        if ever_connected
+          schedule_fallback_engage
+        else
+          start_polling
+        end
+      end
+    end
+    public
+    # qfg-i5xv: true once the SSE layer has classified an HTTP response as
+    # terminal (401/403/404) — bad SDK key, revoked workspace permission,
+    # or wrong endpoint. The classification latches: the SDK will not
+    # auto-recover, and a customer-supplied retry must rebuild the client.
+    # Surfaced for operator alerting; `connection_state` still reports
+    # `:disconnected` to honor the documented connection_state vocabulary
+    # (supervisor-test-contract.md §"connectionState()" — values fixed).
+    def terminal_failure?
+      @state_mutex.synchronize { @sse_terminal_failure }
     end
     private
+    def cancel_fallback_engage_timer
+      timer = @state_mutex.synchronize do
+        t = @fallback_engage_timer
+        @fallback_engage_timer = nil
+        t
+      end
+      timer&.kill if timer&.alive?
+    end
+    def stop_fallback_poller(reason)
+      supervisor = @state_mutex.synchronize do
+        s = @poll_supervisor
+        @poll_supervisor = nil
+        s
+      end
+      return if supervisor.nil?
+      begin
+        supervisor.stop
+        LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
+      rescue StandardError => e
+        LOG.debug "Error stopping fallback poller: #{e.message}"
+      end
+    end
+    # Schedule a 2*poll_interval grace timer after a connected->error edge.
+    # If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
+    # tears it down. Idempotent — does nothing if a timer is already pending
+    # or the supervisor is already alive.
+    def schedule_fallback_engage
+      poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
+      return if poll_interval <= 0
+      grace_seconds = poll_interval * 2.0
+      @state_mutex.synchronize do
+        return if @fallback_engage_timer&.alive?
+        return if @poll_supervisor&.alive?
+        return if @stopped
+        @fallback_engage_timer = Thread.new do
+          Thread.current.report_on_exception = false
+          sleep grace_seconds
+          @state_mutex.synchronize { @fallback_engage_timer = nil }
+          start_polling unless @stopped
+        end
+      end
+    end
     # Construct and start the telemetry reporter if the options permit it.
     # The reporter runs on a background thread and periodically POSTs
     # context-shape and example-context batches to +telemetry_destination+.
@@ -378,6 +676,7 @@ module Quonfig
     def load_datadir_into_store
       envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
       envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
+      record_refresh!
     end
     # Initialize network mode: sync HTTP fetch (bounded by
@@ -412,7 +711,11 @@ module Quonfig
         return
       end
-      handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls')) if result == :failed
+      if result == :failed
+        handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
+      else
+        record_refresh!
+      end
     end
     def handle_init_failure(err)
@@ -429,44 +732,79 @@ module Quonfig
     def start_sse
       return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
-      @sse_client = Quonfig::SSEConfigClient.new(@options, @config_loader)
+      @sse_client = Quonfig::SSEConfigClient.new(
+        @options,
+        @config_loader,
+        nil,
+        nil,
+        on_error: sse_error_callback
+      )
       @sse_client.start do |envelope, _event, _source|
         next if @stopped
         begin
           @config_loader.apply_envelope(envelope)
-          @on_update&.call
+          handle_sse_state_change(:connected)
+          record_refresh!
         rescue StandardError => e
           LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
+          next
         end
+        notify_on_update_callback
       end
       true
     rescue StandardError => e
       LOG.warn "[quonfig] SSE start failed: #{e.message}"
       @sse_client = nil
+      handle_sse_state_change(:error)
       false
     end
     def start_polling
+      return if @stopped
+      return if @poll_supervisor&.alive?
       poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
       return if poll_interval <= 0
-      @poll_thread = Thread.new do
-        Thread.current.name = 'quonfig-poller'
+      stopped_ref = -> { @stopped }
+      worker = lambda do |notify_delivered|
         loop do
-          break if @stopped
+          break if stopped_ref.call
           sleep poll_interval
-          break if @stopped
-          begin
-            @config_loader.fetch!
-            @on_update&.call
-          rescue StandardError => e
-            LOG.warn "[quonfig] Polling error: #{e.message}"
-          end
+          break if stopped_ref.call
+          @config_loader.fetch!
+          record_refresh!
+          notify_delivered.call
+          notify_on_update_callback
         end
       end
+      supervisor = Quonfig::WorkerSupervisor.new(
+        name: 'poll', layer: '2', worker: worker
+      )
+      @state_mutex.synchronize { @poll_supervisor = supervisor }
+      supervisor.start
+    end
+    # Invoke the customer-supplied on_update callback under a rescue. A raise
+    # here is the customer's bug, but it must NOT take down the SSE listener
+    # or polling supervisor. Log at ERROR with a message containing
+    # "onConfigUpdate callback" so chaos scenario 10's
+    # sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
+    # the message is distinguishable from internal envelope-apply errors
+    # (qfg-47c2.30).
+    def notify_on_update_callback
+      cb = @on_update
+      return unless cb
+      begin
+        cb.call
+      rescue StandardError => e
+        LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
+      end
     end
     def build_context(jit_context)
@@ -673,4 +1011,42 @@ module Quonfig
       end
     end
   end
+  # qfg-ryov: hook into Process._fork so customers using Puma's clustered
+  # mode (or any preload/fork-worker server) don't have to wire
+  # +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
+  # +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
+  # prepend covers them all.
+  #
+  # Process._fork's contract:
+  #   - Called in the parent process before the fork syscall.
+  #   - Returns 0 in the child, child's pid in the parent.
+  #   - +super+ performs the actual fork.
+  #
+  # The parent's view: SSE/polling/telemetry threads are torn down before
+  # the syscall so the child does not inherit a live Net::HTTP socket fd
+  # (which would corrupt both sides). The parent does NOT auto-restart —
+  # that mirrors the Puma master use case where the master process no
+  # longer serves requests after spawning workers.
+  module ForkSafety
+    def _fork
+      Quonfig::Client.each_instance(&:before_fork_in_parent)
+      pid = super
+      Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
+      pid
+    rescue StandardError => e
+      # Fork-hook failures must never break the customer's fork. Worst case
+      # the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
+      # bad, but recoverable. Crashing the fork itself is not.
+      Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
+      raise if pid.nil? # super never returned — propagate fork failures
+      pid
+    end
+  end
+  # Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
+  # customers must keep wiring their own Puma before_fork / on_worker_boot
+  # (see README "Rails integration"). On 3.1+ we install the hook globally.
+  Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
 end