RubyGems - quonfig - Versions diffs - 0.0.14 → 0.0.15 - Mend

quonfig 0.0.14 → 0.0.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +6 -0
data/README.md +18 -0
data/lib/quonfig/client.rb +249 -18
data/lib/quonfig/datadir.rb +8 -3
data/lib/quonfig/sse_config_client.rb +150 -4
data/lib/quonfig/version.rb +1 -1
data/lib/quonfig/worker_supervisor.rb +186 -0
data/lib/quonfig.rb +1 -0
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b25ea20d7f44acff4ed82e17522a9fb6055791c4f1e0c861075974e5ae37421f
-  data.tar.gz: e0c260d2d13926e21f2525c7686a24f8dec2f1fa998efa039db59baf4447cd60
+  metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
+  data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
 SHA512:
-  metadata.gz: da91dbd4f9cc300f2dab9e8f39a73033e642d94272288cbcacf4358eb28f4f9b064f8fbe8301c5c26e1b342cd3cd76179d362029e06379bcac39685c3a050cb2
-  data.tar.gz: ac77088e6a6e0256d947f40b26abb9527bb55cff8a3fa39eaaebf91c43746379d5fa2325bd06e049922b2cbc8521f78252bdd79106c6f1ae7f1a0264f4033ab6
+  metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
+  data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,11 @@
 # Changelog
+## 0.0.15 - 2026-05-15
+- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
+- **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
+- **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
 ## 0.0.14 - 2026-05-10
 - **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.

data/README.md CHANGED Viewed

@@ -333,6 +333,24 @@ converge once the envelope finishes applying.
 `Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
 — do not reuse the parent's client in a child process.
+## Diagnostic health signals
+`Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
+- `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
+  envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
+  polling). Returns `nil` before the first install. Preserved across `stop`.
+- `client.connection_state` — a `Symbol` describing the aggregate state:
+  `:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
+> Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
+Compose your own threshold from the two getters if you need a dashboard signal
+— but route alerts through a metrics pipeline, not a probe that restarts the
+process.
+There is intentionally no `client.healthy?` primitive.
 ## Documentation
 Full documentation, including SPEC, SDK reference, and operational guides, is

data/lib/quonfig/client.rb CHANGED Viewed

@@ -40,9 +40,14 @@ module Quonfig
       @resolver = Quonfig::Resolver.new(@store, @evaluator)
       @semantic_logger_filters = {}
       @sse_client = nil
-      @poll_thread = nil
+      @poll_supervisor = nil
       @stopped = false
       @telemetry_reporter = nil
+      @state_mutex = Mutex.new
+      @last_successful_refresh = nil
+      @sse_state = :idle
+      @sse_ever_connected = false
+      @fallback_engage_timer = nil
       # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
       return if store
@@ -266,9 +271,14 @@ module Quonfig
       end
       @sse_client = nil
-      thread = @poll_thread
-      @poll_thread = nil
-      thread&.kill
+      cancel_fallback_engage_timer
+      begin
+        @poll_supervisor&.stop
+      rescue StandardError => e
+        LOG.debug "Error stopping poll supervisor: #{e.message}"
+      end
+      @poll_supervisor = nil
       begin
         @telemetry_reporter&.stop
@@ -278,6 +288,65 @@ module Quonfig
       @telemetry_reporter = nil
     end
+    # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
+    # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
+    # incremented on every on_error edge from ld-eventsource (qfg-ll6r).
+    # Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
+    #
+    # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
+    # sum across both layers so the chaos harness (and operators) can pull
+    # per-layer values explicitly while preserving the previous single-number
+    # diagnostic surface.
+    def worker_restart_total(layer: nil)
+      case layer&.to_s
+      when '1' then sse_restart_total
+      when '2' then poll_restart_total
+      else          sse_restart_total + poll_restart_total
+      end
+    end
+    # Wall-clock time of the last installed envelope (any source: datadir,
+    # initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
+    # install. Preserved after +stop+.
+    #
+    # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
+    # — a transient network blip will trip any freshness threshold and cause
+    # a rolling restart cascade. See the README "Diagnostic health signals"
+    # section.
+    #
+    # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
+    def last_successful_refresh
+      @state_mutex.synchronize { @last_successful_refresh }
+    end
+    # Aggregate connection state. Returns one of:
+    #
+    # - +:initializing+ — no envelope has been installed and SSE is not yet
+    #   connected.
+    # - +:connected+ — SSE is live, or the SDK is delivering configs from a
+    #   loaded envelope (datadir mode or post-initial-fetch with no SSE).
+    # - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
+    #   poller is active.
+    # - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
+    #   serving as the active update channel.
+    #
+    # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
+    # — see the README "Diagnostic health signals" section.
+    #
+    # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
+    def connection_state
+      @state_mutex.synchronize do
+        next :disconnected if @stopped
+        next :falling_back if @poll_supervisor&.alive?
+        next :connected if @sse_state == :connected
+        next :disconnected if @sse_state == :error
+        # No SSE state change yet: state is driven by whether any envelope
+        # has been installed (datadir / initial fetch).
+        @last_successful_refresh.nil? ? :initializing : :connected
+      end
+    end
     def fork
       self.class.new(@options.for_fork)
     end
@@ -288,6 +357,128 @@ module Quonfig
     private
+    # Stamp +last_successful_refresh+ at install time. Called by every code
+    # path that hands an envelope to the cache: datadir load, initial HTTP
+    # fetch, SSE event apply, and polling worker fetch.
+    def record_refresh!
+      @state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
+    end
+    def sse_restart_total
+      sse = @sse_client
+      return 0 if sse.nil?
+      return 0 unless sse.respond_to?(:restart_total)
+      sse.restart_total.to_i
+    end
+    def poll_restart_total
+      sup = @poll_supervisor
+      return 0 if sup.nil?
+      return 0 unless sup.respond_to?(:worker_restart_total)
+      sup.worker_restart_total.to_i
+    end
+    # Drive the SSE-side of the connection_state machine. The SSE client
+    # invokes this on connect/error edges; tests call it directly via +send+.
+    # Documented values: :idle, :connecting, :connected, :error.
+    #
+    # Also drives the Layer 2 fallback poller's engage/disengage:
+    # - :connected clears any pending engage timer and stops an active
+    #   fallback poller (SSE recovered, drop the second channel).
+    # - :error before any successful connect engages immediately
+    #   (initial-fail path).
+    # - :error after a successful connect schedules a 2x-poll-interval
+    #   grace timer; the timer engages if SSE has not recovered by then.
+    #   Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
+    #   `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
+    # Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
+    # block can drive @sse_state -> :error on a mid-run socket drop. Without
+    # this wiring, +connection_state+ would stay +:connected+ after a
+    # disconnect and customers composing staleness checks would see stale
+    # data. (qfg-47c2.27)
+    def sse_error_callback
+      @sse_error_callback ||= ->(error) { handle_sse_error(error) }
+    end
+    def handle_sse_error(_error)
+      handle_sse_state_change(:error)
+    end
+    def handle_sse_state_change(new_state)
+      state = new_state.to_sym
+      ever_connected = @state_mutex.synchronize do
+        @sse_state = state
+        @sse_ever_connected = true if state == :connected
+        @sse_ever_connected
+      end
+      return unless @options.respond_to?(:enable_polling) && @options.enable_polling
+      return if @stopped
+      case state
+      when :connected
+        cancel_fallback_engage_timer
+        stop_fallback_poller('sse-recovered')
+      when :error
+        if ever_connected
+          schedule_fallback_engage
+        else
+          start_polling
+        end
+      end
+    end
+    def cancel_fallback_engage_timer
+      timer = @state_mutex.synchronize do
+        t = @fallback_engage_timer
+        @fallback_engage_timer = nil
+        t
+      end
+      timer&.kill if timer&.alive?
+    end
+    def stop_fallback_poller(reason)
+      supervisor = @state_mutex.synchronize do
+        s = @poll_supervisor
+        @poll_supervisor = nil
+        s
+      end
+      return if supervisor.nil?
+      begin
+        supervisor.stop
+        LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
+      rescue StandardError => e
+        LOG.debug "Error stopping fallback poller: #{e.message}"
+      end
+    end
+    # Schedule a 2*poll_interval grace timer after a connected->error edge.
+    # If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
+    # tears it down. Idempotent — does nothing if a timer is already pending
+    # or the supervisor is already alive.
+    def schedule_fallback_engage
+      poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
+      return if poll_interval <= 0
+      grace_seconds = poll_interval * 2.0
+      @state_mutex.synchronize do
+        return if @fallback_engage_timer&.alive?
+        return if @poll_supervisor&.alive?
+        return if @stopped
+        @fallback_engage_timer = Thread.new do
+          Thread.current.report_on_exception = false
+          sleep grace_seconds
+          @state_mutex.synchronize { @fallback_engage_timer = nil }
+          start_polling unless @stopped
+        end
+      end
+    end
     # Construct and start the telemetry reporter if the options permit it.
     # The reporter runs on a background thread and periodically POSTs
     # context-shape and example-context batches to +telemetry_destination+.
@@ -378,6 +569,7 @@ module Quonfig
     def load_datadir_into_store
       envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
       envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
+      record_refresh!
     end
     # Initialize network mode: sync HTTP fetch (bounded by
@@ -412,7 +604,11 @@ module Quonfig
         return
       end
-      handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls')) if result == :failed
+      if result == :failed
+        handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
+      else
+        record_refresh!
+      end
     end
     def handle_init_failure(err)
@@ -429,44 +625,79 @@ module Quonfig
     def start_sse
       return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
-      @sse_client = Quonfig::SSEConfigClient.new(@options, @config_loader)
+      @sse_client = Quonfig::SSEConfigClient.new(
+        @options,
+        @config_loader,
+        nil,
+        nil,
+        on_error: sse_error_callback
+      )
       @sse_client.start do |envelope, _event, _source|
         next if @stopped
         begin
           @config_loader.apply_envelope(envelope)
-          @on_update&.call
+          handle_sse_state_change(:connected)
+          record_refresh!
         rescue StandardError => e
           LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
+          next
         end
+        notify_on_update_callback
       end
       true
     rescue StandardError => e
       LOG.warn "[quonfig] SSE start failed: #{e.message}"
       @sse_client = nil
+      handle_sse_state_change(:error)
       false
     end
     def start_polling
+      return if @stopped
+      return if @poll_supervisor&.alive?
       poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
       return if poll_interval <= 0
-      @poll_thread = Thread.new do
-        Thread.current.name = 'quonfig-poller'
+      stopped_ref = -> { @stopped }
+      worker = lambda do |notify_delivered|
         loop do
-          break if @stopped
+          break if stopped_ref.call
           sleep poll_interval
-          break if @stopped
-          begin
-            @config_loader.fetch!
-            @on_update&.call
-          rescue StandardError => e
-            LOG.warn "[quonfig] Polling error: #{e.message}"
-          end
+          break if stopped_ref.call
+          @config_loader.fetch!
+          record_refresh!
+          notify_delivered.call
+          notify_on_update_callback
         end
       end
+      supervisor = Quonfig::WorkerSupervisor.new(
+        name: 'poll', layer: '2', worker: worker
+      )
+      @state_mutex.synchronize { @poll_supervisor = supervisor }
+      supervisor.start
+    end
+    # Invoke the customer-supplied on_update callback under a rescue. A raise
+    # here is the customer's bug, but it must NOT take down the SSE listener
+    # or polling supervisor. Log at ERROR with a message containing
+    # "onConfigUpdate callback" so chaos scenario 10's
+    # sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
+    # the message is distinguishable from internal envelope-apply errors
+    # (qfg-47c2.30).
+    def notify_on_update_callback
+      cb = @on_update
+      return unless cb
+      begin
+        cb.call
+      rescue StandardError => e
+        LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
+      end
     end
     def build_context(jit_context)

data/lib/quonfig/datadir.rb CHANGED Viewed

@@ -11,14 +11,16 @@ module Quonfig
   #   <datadir>/configs/*.json
   #   <datadir>/feature-flags/*.json
   #   <datadir>/segments/*.json
-  #   <datadir>/schemas/*.json
   #   <datadir>/log-levels/*.json
   #
+  # schemas/ is intentionally excluded — those files are raw JSON Schema
+  # documents, not Configs, and SDKs do not consume them (qfg-uzsl).
+  #
   # Each <type>/*.json file is a WorkspaceConfigDocument. The loader projects
   # it down to the ConfigResponse shape that the SSE/HTTP delivery path emits,
   # so ConfigStore consumes both transports uniformly.
   module Datadir
-    CONFIG_SUBDIRS = %w[configs feature-flags segments schemas log-levels].freeze
+    CONFIG_SUBDIRS = %w[configs feature-flags segments log-levels].freeze
     module_function
@@ -36,7 +38,10 @@ module Quonfig
            .select { |name| name.end_with?('.json') }
            .sort
            .each do |filename|
-          raw = JSON.parse(File.read(File.join(dir, filename)))
+          path = File.join(dir, filename)
+          raw = JSON.parse(File.read(path))
+          raise ArgumentError, "[quonfig] config has empty key — file is not a Quonfig Config: #{path}" if raw['key'].nil? || raw['key'].to_s.empty?
           configs << to_config_response(raw, env_id)
         end
       end

data/lib/quonfig/sse_config_client.rb CHANGED Viewed

@@ -5,19 +5,99 @@ require 'json'
 module Quonfig
   class SSEConfigClient
+    # ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
+    # *internally* — it never calls +on_error+ for that case, only for
+    # ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
+    # signal it emits for any reconnect is an info-level
+    # "Will retry connection after ..." line, logged once per reconnect attempt
+    # and never on the first connect. Wrapping the logger we hand to
+    # SSE::Client lets the SDK observe those internal reconnects without
+    # touching the data path. This is the only reconnect hook ld-eventsource
+    # >= 2.0 exposes.
+    class ReconnectCountingLogger
+      RECONNECT_SIGNAL = 'Will retry connection after'
+      LEVELS = %i[trace debug info warn error fatal].freeze
+      def initialize(wrapped, &on_reconnect)
+        @wrapped = wrapped
+        @on_reconnect = on_reconnect
+      end
+      # Crash-safe by construction: ld-eventsource calls this logger from
+      # inside its bare-Thread +run_stream+ loop, and several of those call
+      # sites (+connect+, +log_and_dispatch_error+, query-param building) are
+      # NOT wrapped in a rescue. Any exception that escapes a logger call kills
+      # the worker thread with +@stopped+ still false, so +closed?+ never flips
+      # true and the SDK's @retry_thread never reconnects — the SSE stream is
+      # silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
+      # step here is therefore independently guarded: a throwing message block,
+      # a throwing on_reconnect callback, or a throwing wrapped logger can
+      # never propagate out of this method.
+      LEVELS.each do |level|
+        define_method(level) do |message = nil, &block|
+          begin
+            message = block.call if message.nil? && block
+          rescue StandardError
+            message = nil
+          end
+          if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
+            begin
+              @on_reconnect.call
+            rescue StandardError
+              nil
+            end
+          end
+          begin
+            @wrapped.public_send(level, message) if @wrapped.respond_to?(level)
+          rescue StandardError
+            nil
+          end
+        end
+      end
+      def level
+        @wrapped&.level
+      end
+      def level=(new_level)
+        @wrapped.level = new_level if @wrapped.respond_to?(:level=)
+      end
+    end
     class Options
       attr_reader :sse_read_timeout, :seconds_between_new_connection,
                   :sse_default_reconnect_time, :sleep_delay_for_new_connection_check,
-                  :errors_to_close_connection
+                  :errors_to_close_connection, :sse_reconnect_reset_interval
-      def initialize(sse_read_timeout: 300,
+      # sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
+      # stall trips the read deadline within one missed-heartbeat window
+      # rather than the previous 5-minute idle. See plan
+      # `project/plans/sdk-hardening-and-verification.md` Layer 1.
+      #
+      # sse_reconnect_reset_interval: 1s (ld-eventsource default is 60s). The
+      # ld-eventsource backoff only resets to the base interval once a
+      # connection has stayed up this long; until then each reconnect doubles
+      # the delay (1s, 2s, 4s, 8s...). With the 60s default, a flapping
+      # connection (chaos scenario 09 — proxy killed every 6s) backs off so
+      # fast the SDK is mid-sleep when the next kill lands and never observes
+      # it. Resetting after 1s of healthy connection mirrors sdk-python, which
+      # resets its backoff on every successful connect (sdk-python/quonfig/
+      # sse.py). A *sustained* outage still backs off exponentially: no
+      # connection succeeds, so `mark_success` is never called and the reset
+      # never triggers (qfg-ie49).
+      def initialize(sse_read_timeout: 90,
                      seconds_between_new_connection: 5,
                      sleep_delay_for_new_connection_check: 1,
                      sse_default_reconnect_time: SSE::Client::DEFAULT_RECONNECT_TIME,
+                     sse_reconnect_reset_interval: 1,
                      errors_to_close_connection: [HTTP::ConnectionError])
         @sse_read_timeout = sse_read_timeout
         @seconds_between_new_connection = seconds_between_new_connection
         @sse_default_reconnect_time = sse_default_reconnect_time
+        @sse_reconnect_reset_interval = sse_reconnect_reset_interval
         @sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
         @errors_to_close_connection = errors_to_close_connection
       end
@@ -25,12 +105,46 @@ module Quonfig
     LOG = Quonfig::InternalLogger.new(self)
-    def initialize(prefab_options, config_loader, options = nil, logger = nil)
+    # +on_error+: optional callable invoked on every SSE error edge. Parent
+    # Quonfig::Client wires this to drive @sse_state -> :error so that
+    # +connection_state+ reflects the disconnect (qfg-47c2.27). Without it
+    # the SDK's public health primitive would lie about its own state during
+    # a mid-run socket drop.
+    def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
       @prefab_options = prefab_options
       @options = options || Options.new
       @config_loader = config_loader
       @connected = false
       @logger = logger || LOG
+      @on_error = on_error
+      @restart_total = 0
+      @restart_mutex = Mutex.new
+    end
+    # qfg-ll6r / qfg-ie49: Layer 1 (SSE) restart counter — counts every
+    # *reconnect*, from two sources:
+    #   1. ld-eventsource's own internal reconnect (clean FIN, read timeout,
+    #      transient errors it doesn't surface) — observed via the
+    #      ReconnectCountingLogger "Will retry connection after" signal.
+    #   2. SDK-driven reconnects in @retry_thread, after a closing error
+    #      (HTTP::ConnectionError) made us close the SSE::Client outright.
+    # These two are mutually exclusive per disconnect, so there is no
+    # double-count. on_error is deliberately NOT a source — ld-eventsource
+    # reconnects internally after most non-closing errors, so counting the
+    # error edge AND the reconnect would double up (qfg-ie49).
+    #
+    # The chaos harness pulls this via Client#worker_restart_total(layer: '1')
+    # so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
+    # assert restart_total >= 5 even when the kills produce clean FINs that
+    # never reach on_error.
+    def restart_total
+      @restart_mutex.synchronize { @restart_total }
+    end
+    # Bump the Layer 1 reconnect counter. Called from the ld-eventsource
+    # worker thread (via ReconnectCountingLogger) and from @retry_thread.
+    def count_restart!
+      @restart_mutex.synchronize { @restart_total += 1 }
     end
     def close
@@ -60,6 +174,11 @@ module Quonfig
           closed_count = 0
           @logger.debug 'Reconnecting SSE client'
+          # SDK-driven reconnect: a closing error (HTTP::ConnectionError)
+          # closed the previous SSE::Client, so ld-eventsource's own
+          # reconnect loop has exited and won't emit the "Will retry" signal.
+          # Count it here instead (qfg-ie49).
+          count_restart!
           @client = connect(&load_configs)
         end
       end
@@ -70,12 +189,20 @@ module Quonfig
       cursor = current_cursor
       @logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
+      # Wrap the ld-eventsource logger so internal reconnects (clean FIN,
+      # read-timeout, transient errors) bump restart_total — they never reach
+      # on_error (qfg-ie49).
+      sse_logger = ReconnectCountingLogger.new(
+        Quonfig::InternalLogger.new(SSE::Client)
+      ) { count_restart! }
       SSE::Client.new(url,
                       headers: headers,
                       read_timeout: @options.sse_read_timeout,
                       reconnect_time: @options.sse_default_reconnect_time,
+                      reconnect_reset_interval: @options.sse_reconnect_reset_interval,
                       last_event_id: cursor,
-                      logger: Quonfig::InternalLogger.new(SSE::Client)) do |client|
+                      logger: sse_logger) do |client|
         client.on_event do |event|
           if event.data.nil? || event.data.empty?
             @logger.error "SSE Streaming Error: Received empty data for url #{url}"
@@ -106,6 +233,25 @@ module Quonfig
             @logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
           end
+          # qfg-ie49: restart_total is NOT bumped here. ld-eventsource
+          # auto-reconnects after most non-closing errors, and that reconnect
+          # is already counted via ReconnectCountingLogger; bumping here too
+          # would double-count. For closing errors (HTTP::ConnectionError) the
+          # reconnect is counted in @retry_thread instead. on_error's job is
+          # purely to notify the parent client of the disconnect edge.
+          # Notify the parent client BEFORE deciding whether to close — every
+          # error edge is a disconnect signal as far as @sse_state goes, even
+          # if we let the underlying SSE library handle reconnect itself.
+          # qfg-47c2.27
+          if @on_error
+            begin
+              @on_error.call(error)
+            rescue StandardError => e
+              @logger.error "SSE on_error callback raised: #{e.inspect}"
+            end
+          end
           if @options.errors_to_close_connection.any? { |klass| error.is_a?(klass) }
             @logger.debug "Closing SSE connection for url #{url}"
             client.close

data/lib/quonfig/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Quonfig
-  VERSION = '0.0.14'
+  VERSION = '0.0.15'
 end

data/lib/quonfig/worker_supervisor.rb ADDED Viewed

@@ -0,0 +1,186 @@
+# frozen_string_literal: true
+module Quonfig
+  # Internal control-flow exception raised inside a supervised worker thread
+  # to signal cooperative shutdown. Workers may catch and re-raise, or just
+  # propagate.
+  class Shutdown < StandardError; end
+  # Single supervisor for a long-lived background worker (SSE read loop,
+  # fallback poller). Catches unhandled exceptions at the worker boundary,
+  # logs them, increments +worker_restart_total+, and restarts with
+  # exponential backoff capped at 30s.
+  #
+  # Contract: integration-test-data/chaos/supervisor-test-contract.md
+  # Plan:     project/plans/sdk-hardening-and-verification.md (Phase 1)
+  #
+  # The worker is a Proc-like callable invoked as +worker.call(notify_delivered)+
+  # where +notify_delivered+ is a Proc the worker calls when it has handed at
+  # least one envelope to the cache. That signal resets the backoff so a
+  # transient blip doesn't double the delay on the next disconnect.
+  #
+  # Shutdown is signaled by Thread#raise(Quonfig::Shutdown) into the
+  # supervisor thread. Logger writes and bookkeeping use Thread.handle_interrupt
+  # so a concurrent raise doesn't trip Ruby's "log writing failed" path.
+  class WorkerSupervisor
+    METRIC_NAME = 'quonfig_sdk_worker_restart_total'
+    DEFAULT_INITIAL_BACKOFF = 0.5
+    DEFAULT_MAX_BACKOFF     = 30.0
+    DEFAULT_MULTIPLIER      = 2.0
+    SHUTDOWN_TIMEOUT_SEC    = 5.0
+    LOG = Quonfig::InternalLogger.new(self)
+    attr_reader :worker_restart_total, :worker_restart_labels
+    def initialize(name:, worker:, layer: '1',
+                   initial_backoff: DEFAULT_INITIAL_BACKOFF,
+                   max_backoff: DEFAULT_MAX_BACKOFF,
+                   multiplier: DEFAULT_MULTIPLIER,
+                   sleep_proc: nil,
+                   logger: nil)
+      @name = name
+      @layer = layer.to_s
+      @worker = worker
+      @initial_backoff = initial_backoff
+      @max_backoff = max_backoff
+      @multiplier = multiplier
+      @sleep_proc = sleep_proc || ->(seconds) { sleep(seconds) }
+      @logger = logger || LOG
+      @worker_restart_total = 0
+      @worker_restart_labels = {
+        sdk: 'ruby',
+        sdk_version: Quonfig::VERSION,
+        layer: @layer
+      }.freeze
+      @mutex = Mutex.new
+      @stop_requested = false
+      @thread = nil
+      @current_backoff = @initial_backoff
+    end
+    def start
+      @mutex.synchronize do
+        return self if @thread&.alive?
+        @stop_requested = false
+        ready = Queue.new
+        @thread = Thread.new do
+          # Set report_on_exception + signal "ready" BEFORE entering
+          # run_loop. start() blocks on the ready queue so a racing stop()
+          # can never raise into a thread that hasn't yet installed its
+          # Shutdown rescue.
+          Thread.current.report_on_exception = false
+          ready << true
+          run_loop
+        rescue Quonfig::Shutdown
+          # cooperative shutdown raced with thread startup; swallowed
+        end
+        ready.pop
+      end
+      self
+    end
+    def alive?
+      t = @thread
+      !t.nil? && t.alive?
+    end
+    def stop
+      thread = @mutex.synchronize do
+        @stop_requested = true
+        t = @thread
+        @thread = nil
+        t
+      end
+      return if thread.nil?
+      raise_shutdown(thread)
+      thread.join(SHUTDOWN_TIMEOUT_SEC)
+      thread.kill if thread.alive?
+      nil
+    end
+    alias close stop
+    private
+    def raise_shutdown(thread)
+      return if thread.nil?
+      return unless thread.alive?
+      begin
+        thread.raise(Quonfig::Shutdown.new('supervisor stopping'))
+      rescue ThreadError
+        # thread already exited between alive? and raise — fine
+      end
+    end
+    def run_loop
+      Thread.current.name = "quonfig-supervisor-#{@name}"
+      # Don't dump our managed Shutdown to stderr on shutdown.
+      Thread.current.report_on_exception = false
+      loop do
+        break if stop?
+        delivered = false
+        notify_delivered = -> { delivered = true }
+        reason = :worker_exit
+        begin
+          @worker.call(notify_delivered)
+        rescue Quonfig::Shutdown
+          break
+        rescue StandardError => e
+          reason = :worker_throw
+          safe_log(:error,
+                   "[quonfig] supervisor=#{@name} worker raised #{e.class}: #{e.message}")
+          bt = e.backtrace&.first(10)&.join("\n")
+          safe_log(:debug, bt) if bt
+        end
+        break if stop?
+        @worker_restart_total += 1
+        @current_backoff = @initial_backoff if delivered
+        backoff = @current_backoff
+        safe_log(:warn,
+                 "[quonfig] supervisor=#{@name} restarting worker " \
+                 "(reason=#{reason}, restart_total=#{@worker_restart_total}, " \
+                 "backoff_s=#{backoff})")
+        begin
+          @sleep_proc.call(backoff)
+        rescue Quonfig::Shutdown
+          break
+        end
+        @current_backoff = [@current_backoff * @multiplier, @max_backoff].min
+      end
+    rescue Quonfig::Shutdown
+      # supervisor-level cooperative shutdown
+    rescue StandardError => e
+      safe_log(:error, "[quonfig] supervisor=#{@name} crashed: #{e.class}: #{e.message}")
+    end
+    def stop?
+      @mutex.synchronize { @stop_requested }
+    end
+    # Defer Shutdown delivery while we're inside Logger.write so we don't
+    # trip Logger's "log writing failed" -> stderr fallback. Swallow any
+    # other logger error.
+    def safe_log(level, msg)
+      return unless @logger.respond_to?(level)
+      Thread.handle_interrupt(Quonfig::Shutdown => :never) do
+        @logger.public_send(level, msg)
+      end
+    rescue StandardError
+      nil
+    end
+  end
+end

data/lib/quonfig.rb CHANGED Viewed

@@ -29,6 +29,7 @@ require 'quonfig/evaluation'
 require 'quonfig/evaluation_details'
 require 'quonfig/encryption'
 require 'quonfig/exponential_backoff'
+require 'quonfig/worker_supervisor'
 require 'quonfig/periodic_sync'
 require 'quonfig/errors/initialization_timeout_error'
 require 'quonfig/errors/invalid_sdk_key_error'

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: quonfig
 version: !ruby/object:Gem::Version
-  version: 0.0.14
+  version: 0.0.15
 platform: ruby
 authors:
 - Jeff Dwyer
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-05-10 00:00:00.000000000 Z
+date: 2026-05-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -134,6 +134,7 @@ files:
 - lib/quonfig/types.rb
 - lib/quonfig/version.rb
 - lib/quonfig/weighted_value_resolver.rb
+- lib/quonfig/worker_supervisor.rb
 - quonfig.gemspec
 homepage: https://github.com/quonfig/sdk-ruby
 licenses: