RubyGems - quonfig - Versions diffs - 0.0.15 → 0.0.16 - Mend

quonfig 0.0.15 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +8 -0
data/README.md +37 -11
data/lib/quonfig/client.rb +168 -23
data/lib/quonfig/sse_config_client.rb +536 -225
data/lib/quonfig/version.rb +1 -1
data/lib/quonfig.rb +1 -1
data/quonfig.gemspec +0 -1
metadata +1 -15

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
-  data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
+  metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
+  data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
 SHA512:
-  metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
-  data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045
+  metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
+  data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,13 @@
 # Changelog
+## 0.0.16 - 2026-05-15
+- **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
+- **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
+- **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
+- **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
+- **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
 ## 0.0.15 - 2026-05-15
 - **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.

data/README.md CHANGED Viewed

@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
 are dead — the SSE socket is held open by a thread that no longer exists, and
 the child silently stops receiving live updates.
-Use `Quonfig::Client#fork` (or `Quonfig.fork` if you use the module-level
-singleton) in any process that fork-spawns workers. It returns a fresh client
-configured for the child: a new `ConfigStore`, a new SSE subscription, and
-suppressed telemetry double-counting (`Options#is_fork` is set to `true`).
+**On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
+automatically tears down threaded components in the parent and restarts them
+in the child. This covers any `Process.fork` / `Kernel#fork` path — Puma's
+clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
+manual `fork { ... }` calls. **No customer wiring is required.**
+Caveats:
+- Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
+- `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
+  not go through `Process._fork`), but those execute a new program, so the
+  in-process SSE state is moot.
+- The hook tears down the SSE/polling/telemetry threads in the parent before
+  fork (so the child does not inherit a live socket fd) and does **not**
+  auto-restart the parent. This mirrors the Puma master case: the master no
+  longer serves requests, so it does not need a live SSE connection. If you
+  have a non-Puma topology where the parent must keep streaming after fork,
+  call `Quonfig.instance.after_fork_in_child` manually in the parent after
+  the fork returns.
 ### Puma (clustered mode)
+With the automatic fork hook, the typical Puma config needs **no Quonfig
+lifecycle wiring** — initialize in your Rails initializer and let the hook
+handle the rest:
 ```ruby
-# config/puma.rb
+# config/initializers/quonfig.rb
+Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
+```
+If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
+```ruby
+# config/puma.rb (Ruby 3.0 only)
 before_fork do
   Quonfig.instance.stop          # close the master's SSE before forking
 end
@@ -265,18 +291,18 @@ on_worker_boot do
 end
 ```
-If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
-single mode (no clustering), no fork hook is needed.
 ### Sidekiq
-Sidekiq's parent process forks workers. Wire the same lifecycle:
+On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too — no
+`configure_server` wiring required.
+On Ruby 3.0:
 ```ruby
 # config/initializers/quonfig.rb
 Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
-# config/initializers/sidekiq.rb
+# config/initializers/sidekiq.rb (Ruby 3.0 only)
 Sidekiq.configure_server do |config|
   config.on(:startup)  { Quonfig.fork if Process.ppid != 1 }
   config.on(:shutdown) { Quonfig.instance.stop rescue nil }
@@ -284,7 +310,7 @@ end
 ```
 For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
-`Quonfig.init` in the initializer is sufficient.
+`Quonfig.init` in the initializer is sufficient on any Ruby version.
 ### Spring / Bootsnap preloaders

data/lib/quonfig/client.rb CHANGED Viewed

@@ -20,6 +20,29 @@ module Quonfig
   class Client
     LOG = Quonfig::InternalLogger.new(self)
+    # qfg-ryov: instance registry for the Process._fork hook. Every live
+    # Client is tracked here so the hook can fan out before_fork_in_parent /
+    # after_fork_in_child across all of them without the customer needing to
+    # name a specific instance. ObjectSpace::WeakMap means a Client that goes
+    # out of scope is GC'd without leaking through this registry. Stopped
+    # Clients stay in the registry until GC; both fork hooks early-return on
+    # +@stopped+ so a stopped instance is effectively a no-op. (We don't use
+    # WeakMap#delete because it was added in Ruby 3.3 and the matrix still
+    # includes 3.2.)
+    @instances = ObjectSpace::WeakMap.new
+    @instances_mutex = Mutex.new
+    class << self
+      # Iterate live Client instances. Used by Quonfig::ForkSafety.
+      def each_instance(&block)
+        @instances_mutex.synchronize { @instances.keys }.each(&block)
+      end
+      def register_instance(client)
+        @instances_mutex.synchronize { @instances[client] = true }
+      end
+    end
     attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
                 :config_loader, :telemetry_reporter
@@ -48,6 +71,7 @@ module Quonfig
       @sse_state = :idle
       @sse_ever_connected = false
       @fallback_engage_timer = nil
+      @sse_terminal_failure = false
       # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
       return if store
@@ -59,6 +83,10 @@ module Quonfig
       end
       initialize_telemetry
+      # Register only for non-store-injected clients (a caller-supplied store
+      # is the test/bootstrap path; the fork hook does not apply there).
+      self.class.register_instance(self) unless store
     end
     # ---- Lookup --------------------------------------------------------
@@ -264,34 +292,52 @@ module Quonfig
     def stop
       @stopped = true
-      begin
-        @sse_client&.close
-      rescue StandardError => e
-        LOG.debug "Error closing SSE client: #{e.message}"
-      end
-      @sse_client = nil
+      tear_down_threaded_components!
+    end
-      cancel_fallback_engage_timer
+    # qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
+    # telemetry reporter, and any fallback-engage timer. Idempotent — calling
+    # twice is safe. Does NOT set @stopped: the client is still expected to
+    # be usable post-fork via after_fork_in_child.
+    #
+    # Why this matters: Ruby threads do not survive fork(2). If we let the
+    # child inherit a live Net::HTTP socket, both processes read from the
+    # same fd and corrupt each other's bytes. Closing in the parent before
+    # fork is the only safe shape.
+    def before_fork_in_parent
+      return if @stopped
-      begin
-        @poll_supervisor&.stop
-      rescue StandardError => e
-        LOG.debug "Error stopping poll supervisor: #{e.message}"
-      end
-      @poll_supervisor = nil
+      tear_down_threaded_components!
+    end
-      begin
-        @telemetry_reporter&.stop
-      rescue StandardError => e
-        LOG.debug "Error stopping telemetry reporter: #{e.message}"
+    # qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
+    # components the client had pre-fork. No-op if the client was already
+    # stopped (the customer asked for it to be dead — do not resurrect),
+    # or if the client is in datadir mode (no threaded components to start).
+    def after_fork_in_child
+      return if @stopped
+      return if @options.datadir
+      return if @config_loader.nil? # never finished network init (e.g. invalid key)
+      # SSE state machine carries flags that no longer apply in the child
+      # (the parent had connected, the parent had errored, etc.). Reset.
+      @state_mutex.synchronize do
+        @sse_state = :idle
+        @sse_ever_connected = false
+        @sse_terminal_failure = false
       end
-      @telemetry_reporter = nil
+      sse_started = @options.enable_sse && start_sse
+      start_polling if @options.enable_polling && !sse_started
+      restart_telemetry_in_child
     end
     # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
     # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
-    # incremented on every on_error edge from ld-eventsource (qfg-ll6r).
-    # Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
+    # incremented once per reconnect attempt by the SDK-owned reconnect
+    # loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
+    # Quonfig::WorkerSupervisor.
     #
     # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
     # sum across both layers so the chaos harness (and operators) can pull
@@ -357,6 +403,41 @@ module Quonfig
     private
+    # Close every threaded component and drop its reference. Used by both
+    # +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
+    # (where @stopped is left alone so the child can restart).
+    def tear_down_threaded_components!
+      begin
+        @sse_client&.close
+      rescue StandardError => e
+        LOG.debug "Error closing SSE client: #{e.message}"
+      end
+      @sse_client = nil
+      cancel_fallback_engage_timer
+      begin
+        @poll_supervisor&.stop
+      rescue StandardError => e
+        LOG.debug "Error stopping poll supervisor: #{e.message}"
+      end
+      @poll_supervisor = nil
+      begin
+        @telemetry_reporter&.stop
+      rescue StandardError => e
+        LOG.debug "Error stopping telemetry reporter: #{e.message}"
+      end
+      @telemetry_reporter = nil
+    end
+    # Rebuild the telemetry reporter in the child after fork. Mirrors the
+    # original initialize_telemetry path — fresh aggregators, fresh reporter.
+    def restart_telemetry_in_child
+      @telemetry_reporter = nil
+      initialize_telemetry
+    end
     # Stamp +last_successful_refresh+ at install time. Called by every code
     # path that hands an envelope to the cache: datadir load, initial HTTP
     # fetch, SSE event apply, and polling worker fetch.
@@ -402,20 +483,31 @@ module Quonfig
       @sse_error_callback ||= ->(error) { handle_sse_error(error) }
     end
-    def handle_sse_error(_error)
+    def handle_sse_error(error)
+      # qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
+      # key that won't auth over SSE won't auth over HTTP polling either, so
+      # we must NOT engage the Layer 2 fallback — that just moves the
+      # auth-failure storm from one endpoint to another. Once flipped,
+      # @sse_terminal_failure latches: a buggy customer retry loop cannot
+      # un-classify the failure by driving the state machine.
+      @state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
       handle_sse_state_change(:error)
     end
     def handle_sse_state_change(new_state)
       state = new_state.to_sym
-      ever_connected = @state_mutex.synchronize do
+      ever_connected, terminal = @state_mutex.synchronize do
         @sse_state = state
         @sse_ever_connected = true if state == :connected
-        @sse_ever_connected
+        [@sse_ever_connected, @sse_terminal_failure]
       end
       return unless @options.respond_to?(:enable_polling) && @options.enable_polling
       return if @stopped
+      # qfg-i5xv: a terminal SSE classification suppresses polling engage in
+      # every branch — the customer's key is bad and HTTP polling will fail
+      # identically. Operators surface this via #terminal_failure?.
+      return if terminal
       case state
       when :connected
@@ -430,6 +522,21 @@ module Quonfig
       end
     end
+    public
+    # qfg-i5xv: true once the SSE layer has classified an HTTP response as
+    # terminal (401/403/404) — bad SDK key, revoked workspace permission,
+    # or wrong endpoint. The classification latches: the SDK will not
+    # auto-recover, and a customer-supplied retry must rebuild the client.
+    # Surfaced for operator alerting; `connection_state` still reports
+    # `:disconnected` to honor the documented connection_state vocabulary
+    # (supervisor-test-contract.md §"connectionState()" — values fixed).
+    def terminal_failure?
+      @state_mutex.synchronize { @sse_terminal_failure }
+    end
+    private
     def cancel_fallback_engage_timer
       timer = @state_mutex.synchronize do
         t = @fallback_engage_timer
@@ -904,4 +1011,42 @@ module Quonfig
       end
     end
   end
+  # qfg-ryov: hook into Process._fork so customers using Puma's clustered
+  # mode (or any preload/fork-worker server) don't have to wire
+  # +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
+  # +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
+  # prepend covers them all.
+  #
+  # Process._fork's contract:
+  #   - Called in the parent process before the fork syscall.
+  #   - Returns 0 in the child, child's pid in the parent.
+  #   - +super+ performs the actual fork.
+  #
+  # The parent's view: SSE/polling/telemetry threads are torn down before
+  # the syscall so the child does not inherit a live Net::HTTP socket fd
+  # (which would corrupt both sides). The parent does NOT auto-restart —
+  # that mirrors the Puma master use case where the master process no
+  # longer serves requests after spawning workers.
+  module ForkSafety
+    def _fork
+      Quonfig::Client.each_instance(&:before_fork_in_parent)
+      pid = super
+      Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
+      pid
+    rescue StandardError => e
+      # Fork-hook failures must never break the customer's fork. Worst case
+      # the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
+      # bad, but recoverable. Crashing the fork itself is not.
+      Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
+      raise if pid.nil? # super never returned — propagate fork failures
+      pid
+    end
+  end
+  # Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
+  # customers must keep wiring their own Puma before_fork / on_worker_boot
+  # (see README "Rails integration"). On 3.1+ we install the hook globally.
+  Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
 end

data/lib/quonfig/sse_config_client.rb CHANGED Viewed

@@ -2,300 +2,611 @@
 require 'base64'
 require 'json'
+require 'net/http'
+require 'uri'
 module Quonfig
+  # Event delivered to on_envelope. +id+ mirrors the SSE +id:+ field and is
+  # consumed by callers that want the server cursor (tests + last-event-id
+  # resume). +data+ is the raw +data:+ payload string. +envelope+ is the
+  # parsed Quonfig::ConfigEnvelope.
+  StreamEvent = Struct.new(:envelope, :id, :data)
+  # SSE client for real-time config delivery from api-delivery-sse.
+  #
+  # Owns its reconnect loop end-to-end. sdk-go, sdk-python, and sdk-node all
+  # reached the same conclusion: the wire format we consume (plain JSON
+  # envelopes in single-line +data:+ frames, no named events, no retry
+  # directives) is simple enough that an SDK-owned loop is clearer than a
+  # library wrapper, and the operator-facing reconnect counter becomes
+  # trivially correct because there is exactly one place that increments it
+  # (qfg-35sm; replaces the ld-eventsource integration from qfg-ie49 +
+  # qfg-cf52, which required log-line scraping and a raise-proof logger
+  # wrapper to observe reconnects through the upstream library).
   class SSEConfigClient
-    # ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
-    # *internally* — it never calls +on_error+ for that case, only for
-    # ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
-    # signal it emits for any reconnect is an info-level
-    # "Will retry connection after ..." line, logged once per reconnect attempt
-    # and never on the first connect. Wrapping the logger we hand to
-    # SSE::Client lets the SDK observe those internal reconnects without
-    # touching the data path. This is the only reconnect hook ld-eventsource
-    # >= 2.0 exposes.
-    class ReconnectCountingLogger
-      RECONNECT_SIGNAL = 'Will retry connection after'
-      LEVELS = %i[trace debug info warn error fatal].freeze
-      def initialize(wrapped, &on_reconnect)
-        @wrapped = wrapped
-        @on_reconnect = on_reconnect
-      end
-      # Crash-safe by construction: ld-eventsource calls this logger from
-      # inside its bare-Thread +run_stream+ loop, and several of those call
-      # sites (+connect+, +log_and_dispatch_error+, query-param building) are
-      # NOT wrapped in a rescue. Any exception that escapes a logger call kills
-      # the worker thread with +@stopped+ still false, so +closed?+ never flips
-      # true and the SDK's @retry_thread never reconnects — the SSE stream is
-      # silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
-      # step here is therefore independently guarded: a throwing message block,
-      # a throwing on_reconnect callback, or a throwing wrapped logger can
-      # never propagate out of this method.
-      LEVELS.each do |level|
-        define_method(level) do |message = nil, &block|
-          begin
-            message = block.call if message.nil? && block
-          rescue StandardError
-            message = nil
-          end
-          if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
-            begin
-              @on_reconnect.call
-            rescue StandardError
-              nil
-            end
-          end
-          begin
-            @wrapped.public_send(level, message) if @wrapped.respond_to?(level)
-          rescue StandardError
-            nil
-          end
-        end
-      end
-      def level
-        @wrapped&.level
-      end
-      def level=(new_level)
-        @wrapped.level = new_level if @wrapped.respond_to?(:level=)
-      end
-    end
     class Options
-      attr_reader :sse_read_timeout, :seconds_between_new_connection,
-                  :sse_default_reconnect_time, :sleep_delay_for_new_connection_check,
-                  :errors_to_close_connection, :sse_reconnect_reset_interval
+      attr_reader :sse_read_timeout, :sse_connect_timeout,
+                  :sse_initial_reconnect_delay, :sse_max_reconnect_delay
       # sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
-      # stall trips the read deadline within one missed-heartbeat window
-      # rather than the previous 5-minute idle. See plan
-      # `project/plans/sdk-hardening-and-verification.md` Layer 1.
+      # stall trips within one missed-heartbeat window rather than the OS
+      # TCP idle (often hours).
       #
-      # sse_reconnect_reset_interval: 1s (ld-eventsource default is 60s). The
-      # ld-eventsource backoff only resets to the base interval once a
-      # connection has stayed up this long; until then each reconnect doubles
-      # the delay (1s, 2s, 4s, 8s...). With the 60s default, a flapping
-      # connection (chaos scenario 09 — proxy killed every 6s) backs off so
-      # fast the SDK is mid-sleep when the next kill lands and never observes
-      # it. Resetting after 1s of healthy connection mirrors sdk-python, which
-      # resets its backoff on every successful connect (sdk-python/quonfig/
-      # sse.py). A *sustained* outage still backs off exponentially: no
-      # connection succeeds, so `mark_success` is never called and the reset
-      # never triggers (qfg-ie49).
+      # sse_initial_reconnect_delay / sse_max_reconnect_delay: backoff bounds.
+      # Each failed reconnect doubles the delay (with +/-50% jitter) up to the
+      # max. A successful event delivery resets the delay to the initial
+      # value — matches sdk-python's policy. A clean server-initiated FIN is
+      # treated as "not a failure for backoff purposes" because LBs recycling
+      # connections is normal; the reconnect counter still increments.
       def initialize(sse_read_timeout: 90,
-                     seconds_between_new_connection: 5,
-                     sleep_delay_for_new_connection_check: 1,
-                     sse_default_reconnect_time: SSE::Client::DEFAULT_RECONNECT_TIME,
-                     sse_reconnect_reset_interval: 1,
-                     errors_to_close_connection: [HTTP::ConnectionError])
+                     sse_connect_timeout: 10,
+                     sse_initial_reconnect_delay: 1.0,
+                     sse_max_reconnect_delay: 30.0)
         @sse_read_timeout = sse_read_timeout
-        @seconds_between_new_connection = seconds_between_new_connection
-        @sse_default_reconnect_time = sse_default_reconnect_time
-        @sse_reconnect_reset_interval = sse_reconnect_reset_interval
-        @sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
-        @errors_to_close_connection = errors_to_close_connection
+        @sse_connect_timeout = sse_connect_timeout
+        @sse_initial_reconnect_delay = sse_initial_reconnect_delay.to_f
+        @sse_max_reconnect_delay = sse_max_reconnect_delay.to_f
       end
     end
     LOG = Quonfig::InternalLogger.new(self)
+    # qfg-i5xv: HTTP status codes the SDK classifies as terminal — these will
+    # not heal by retrying (bad key, revoked permission, missing endpoint).
+    # Anything else (5xx, 429, network errors) stays on the transient path.
+    TERMINAL_HTTP_CODES = [401, 403, 404].freeze
     # +on_error+: optional callable invoked on every SSE error edge. Parent
     # Quonfig::Client wires this to drive @sse_state -> :error so that
-    # +connection_state+ reflects the disconnect (qfg-47c2.27). Without it
-    # the SDK's public health primitive would lie about its own state during
-    # a mid-run socket drop.
+    # +connection_state+ reflects the disconnect (qfg-47c2.27).
     def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
       @prefab_options = prefab_options
       @options = options || Options.new
       @config_loader = config_loader
-      @connected = false
       @logger = logger || LOG
       @on_error = on_error
+      @stopped = Concurrent::AtomicBoolean.new(false)
       @restart_total = 0
       @restart_mutex = Mutex.new
+      @on_envelope_error_total = 0
+      @on_envelope_error_mutex = Mutex.new
+      @conn_mutex = Mutex.new
+      @active_http = nil
+      @source_index = -1
+      @last_event_id = nil
     end
-    # qfg-ll6r / qfg-ie49: Layer 1 (SSE) restart counter — counts every
-    # *reconnect*, from two sources:
-    #   1. ld-eventsource's own internal reconnect (clean FIN, read timeout,
-    #      transient errors it doesn't surface) — observed via the
-    #      ReconnectCountingLogger "Will retry connection after" signal.
-    #   2. SDK-driven reconnects in @retry_thread, after a closing error
-    #      (HTTP::ConnectionError) made us close the SSE::Client outright.
-    # These two are mutually exclusive per disconnect, so there is no
-    # double-count. on_error is deliberately NOT a source — ld-eventsource
-    # reconnects internally after most non-closing errors, so counting the
-    # error edge AND the reconnect would double up (qfg-ie49).
-    #
-    # The chaos harness pulls this via Client#worker_restart_total(layer: '1')
-    # so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
-    # assert restart_total >= 5 even when the kills produce clean FINs that
-    # never reach on_error.
+    # Layer 1 (SSE) reconnect counter. Bumped exactly once per reconnect
+    # attempt — never per error edge, never per envelope. Read by
+    # Quonfig::Client#worker_restart_total(layer: '1') and asserted by chaos
+    # scenario 09 (>= 5 after 5 proxy flaps in 30s).
     def restart_total
       @restart_mutex.synchronize { @restart_total }
     end
-    # Bump the Layer 1 reconnect counter. Called from the ld-eventsource
-    # worker thread (via ReconnectCountingLogger) and from @retry_thread.
-    def count_restart!
-      @restart_mutex.synchronize { @restart_total += 1 }
+    # qfg-m3lk: count of user-supplied on_envelope callback invocations that
+    # raised. Surfaced for operator visibility — a non-zero value here with
+    # restart_total stable means a caller-side listener bug, not a transport
+    # problem. (Pre-fix, those raises propagated into run_loop's rescue and
+    # masqueraded as transport errors, causing reconnect storms.)
+    def on_envelope_error_total
+      @on_envelope_error_mutex.synchronize { @on_envelope_error_total }
     end
-    def close
-      @retry_thread&.kill
-      @client&.close
+    def start(&on_envelope)
+      return if @prefab_options.sse_api_urls.nil? || @prefab_options.sse_api_urls.empty?
+      @worker = Thread.new { run_loop(&on_envelope) }
     end
-    def start(&load_configs)
-      if @prefab_options.sse_api_urls.empty?
-        @logger.debug 'No SSE api_urls configured'
-        return
+    # Shut down. Interrupts the in-flight stream by closing the underlying
+    # socket from this thread — the worker thread observes the resulting
+    # IOError, sees @stopped == true, and exits cleanly.
+    def close
+      @stopped.make_true
+      @conn_mutex.synchronize do
+        begin
+          @active_http&.finish
+        rescue StandardError
+          # already closed / never started — idempotent
+        end
+        @active_http = nil
       end
+      @worker&.join(2)
+      @worker = nil
+    end
+    # Public so tests can assert the headers shape. Body of the request is
+    # always empty; this is the full set api-delivery-sse sees.
+    def headers
+      auth = "1:#{@prefab_options.sdk_key}"
+      auth_string = Base64.strict_encode64(auth)
+      h = {
+        'Authorization' => "Basic #{auth_string}",
+        'Accept' => 'text/event-stream',
+        'Cache-Control' => 'no-cache',
+        'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
+      }
+      cursor = current_cursor
+      h['Last-Event-Id'] = cursor if cursor
+      h
+    end
-      @client = connect(&load_configs)
+    # Compute a Last-Event-ID for the next request. Three sources, in
+    # priority order:
+    #   1. @last_event_id  -- set by the most recent event we processed
+    #   2. config_loader.version  -- string ETag from last HTTP fetch
+    #   3. config_loader.highwater_mark  -- legacy numeric cursor
+    # Returns nil if no prior state exists.
+    def current_cursor
+      return @last_event_id if @last_event_id && !@last_event_id.empty?
-      closed_count = 0
+      if @config_loader.respond_to?(:version)
+        v = @config_loader.version
+        return v if v.is_a?(String) && !v.empty?
+      end
-      @retry_thread = Thread.new do
-        loop do
-          sleep @options.sleep_delay_for_new_connection_check
+      if @config_loader.respond_to?(:highwater_mark)
+        hw = @config_loader.highwater_mark
+        return hw.to_s if hw.is_a?(Numeric) && hw.positive?
+        return hw if hw.is_a?(String) && !hw.empty?
+      end
-          next unless @client.closed?
+      nil
+    end
-          closed_count += @options.sleep_delay_for_new_connection_check
+    private
-          next unless closed_count > @options.seconds_between_new_connection
+    # Long-lived reconnect loop. One iteration = one connect attempt. Bumps
+    # restart_total *before* every retry — so the counter answers "how many
+    # times have we reconnected after a drop" rather than "how many connect
+    # attempts have occurred." The first attempt is not a restart.
+    #
+    # qfg-tj18: the body is wrapped in
+    # +Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)+ so a
+    # watchdog raise that's already been queued (the watchdog's mutex covers
+    # the *decision* to fire but cannot un-queue a delivered raise) lands
+    # only at a blocking-IO checkpoint. Inside stream_once we explicitly
+    # re-enable +:immediate+ around the +read_body+ block where we *do*
+    # want the raise to wake the read. A per-iteration paranoid rescue
+    # catches any late-landing raise that escapes the inner +rescue
+    # StandardError+ (e.g. lands inside +interruptible_sleep+ between
+    # iterations) so the worker thread never silently dies.
+    def run_loop(&on_envelope)
+      Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking) do
+        delay = @options.sse_initial_reconnect_delay
+        first_attempt = true
+        until @stopped.value
+          begin
+            unless first_attempt
+              increment_restart!
+              interruptible_sleep(jittered(delay))
+              break if @stopped.value
+            end
+            first_attempt = false
-          closed_count = 0
-          @logger.debug 'Reconnecting SSE client'
-          # SDK-driven reconnect: a closing error (HTTP::ConnectionError)
-          # closed the previous SSE::Client, so ld-eventsource's own
-          # reconnect loop has exited and won't emit the "Will retry" signal.
-          # Count it here instead (qfg-ie49).
-          count_restart!
-          @client = connect(&load_configs)
+            connected_at_least_once = false
+            begin
+              stream_once do |event|
+                connected_at_least_once = true
+                # Persist the most recent id so the next reconnect resumes
+                # from there via Last-Event-Id. Updated *before* the user
+                # callback runs so a raising listener still advances the
+                # cursor — the event was delivered to us, the bug is on the
+                # caller side.
+                @last_event_id = event.id if event.id
+                # qfg-m3lk: callback exceptions are isolated. A buggy
+                # listener must not look like a transport error and trigger
+                # a reconnect.
+                invoke_on_envelope_safely(on_envelope, event)
+                # A connection healthy enough to deliver a real envelope
+                # earns a reset of the backoff. Sustained outages never
+                # reach this branch (no event ever delivered) so the
+                # exponential growth still holds.
+                delay = @options.sse_initial_reconnect_delay
+              end
+            rescue StandardError => e
+              handle_error(e) unless @stopped.value
+            end
+            # Backoff only grows on failed connect attempts. A server-
+            # initiated clean FIN after a healthy session (normal LB
+            # recycling) reuses the same delay — punishing it would make
+            # us look broken under benign rolling restarts. Matches
+            # sdk-go's `connectedOK` distinction.
+            delay = [delay * 2, @options.sse_max_reconnect_delay].min unless connected_at_least_once
+          rescue SSEReadDeadlineExceeded => e
+            # Paranoid backstop (qfg-tj18). A watchdog raise that landed
+            # outside +stream_once+ — typically in +interruptible_sleep+
+            # — must not kill the worker thread. We log loudly and let the
+            # +until+ loop carry on.
+            @logger.error "SSE watchdog late-raise contained: #{e.inspect}; resuming loop"
+          end
         end
       end
+    ensure
+      register_active(nil)
     end
-    def connect(&load_configs)
-      url = "#{source}/api/v2/sse/config"
+    # Opens one SSE request and yields each parsed event until the stream
+    # ends (clean FIN, error, or stop). Raises on transport errors so the
+    # caller can apply backoff. Clean FIN returns without raising.
+    #
+    # A watchdog thread closes the socket if no bytes arrive within
+    # +sse_read_timeout+. Net::HTTP#read_timeout is NOT reliable for the
+    # streaming +read_body do |chunk|+ form — the underlying BufferedIO
+    # reads bypass it in practice (a silent server stall blocks indefinitely
+    # against a configured deadline). sdk-go and sdk-node hit the same
+    # gotcha and solve it the same way: per-chunk reset, async close on
+    # expiry (chaos scenario 02 — sse_silent_stall).
+    def stream_once(&block)
+      url = "#{current_url}/api/v2/sse/config"
       cursor = current_cursor
       @logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
-      # Wrap the ld-eventsource logger so internal reconnects (clean FIN,
-      # read-timeout, transient errors) bump restart_total — they never reach
-      # on_error (qfg-ie49).
-      sse_logger = ReconnectCountingLogger.new(
-        Quonfig::InternalLogger.new(SSE::Client)
-      ) { count_restart! }
-      SSE::Client.new(url,
-                      headers: headers,
-                      read_timeout: @options.sse_read_timeout,
-                      reconnect_time: @options.sse_default_reconnect_time,
-                      reconnect_reset_interval: @options.sse_reconnect_reset_interval,
-                      last_event_id: cursor,
-                      logger: sse_logger) do |client|
-        client.on_event do |event|
-          if event.data.nil? || event.data.empty?
-            @logger.error "SSE Streaming Error: Received empty data for url #{url}"
-            client.close
-            next
+      uri = URI(url)
+      http = Net::HTTP.new(uri.host, uri.port)
+      http.use_ssl = (uri.scheme == 'https')
+      http.open_timeout = @options.sse_connect_timeout
+      # Keep Net::HTTP's read_timeout as a backstop for the header read
+      # (where it does apply reliably). The watchdog covers the body path.
+      http.read_timeout = @options.sse_read_timeout
+      req = Net::HTTP::Get.new(uri.request_uri, headers)
+      http.start
+      register_active(http)
+      watchdog = ReadDeadlineWatchdog.new(
+        worker: Thread.current, deadline_s: @options.sse_read_timeout,
+        stopped: @stopped, logger: @logger
+      )
+      watchdog.start
+      begin
+        http.request(req) do |resp|
+          code = resp.code.to_i
+          if TERMINAL_HTTP_CODES.include?(code)
+            # qfg-i5xv: 401/403/404 will not heal by retrying — bad key,
+            # revoked permission, or wrong endpoint. Mark stopped *before*
+            # invoking on_error so the loop's terminal-error branch is
+            # already locked in if the parent callback inspects state, and
+            # so the inner rescue's `handle_error(e) unless @stopped.value`
+            # guard suppresses a second on_error edge.
+            err = SSEHTTPTerminalError.new(code)
+            @logger.error "SSE Streaming Terminal Error: HTTP #{code} for url #{url}; will not retry"
+            @stopped.make_true
+            invoke_on_error(err)
+            raise err
           end
-          begin
-            parsed = JSON.parse(event.data)
-          rescue JSON::ParserError => e
-            @logger.error "SSE Streaming Error: Failed to parse JSON for url #{url}: #{e.message}"
-            client.close
-            next
+          if code != 200
+            err = SSEHTTPStatusError.new(code)
+            @logger.error "SSE Streaming Error: HTTP #{code} for url #{url}"
+            invoke_on_error(err)
+            raise err
           end
-          envelope = Quonfig::ConfigEnvelope.new(
-            configs: parsed['configs'] || [],
-            meta: parsed['meta'] || {}
-          )
-          load_configs.call(envelope, event, :sse)
+          parser = EventParser.new
+          # qfg-tj18: run_loop wraps the body in +:on_blocking+ which
+          # *would* still deliver during read_body (read_body is a
+          # blocking IO call), but be explicit: we want the watchdog raise
+          # to land here without ambiguity.
+          Thread.handle_interrupt(SSEReadDeadlineExceeded => :immediate) do
+            resp.read_body do |chunk|
+              watchdog.reset!
+              break if @stopped.value
+              parser.feed(chunk, &block)
+            end
+          end
+          # read_body returned cleanly — either a server-initiated FIN, or
+          # the watchdog closed the socket on a silent stall. Either way,
+          # the outer loop will reconnect and bump restart_total on the
+          # next iteration.
+          @logger.debug "SSE stream ended for url #{url}"
+        end
+      ensure
+        watchdog.stop
+        register_active(nil)
+        begin
+          http.finish if http.started?
+        rescue StandardError
+          # already closed
         end
+      end
+    end
-        client.on_error do |error|
-          # SSL "unexpected eof" is expected when SSE sessions timeout normally
-          if error.is_a?(OpenSSL::SSL::SSLError) && error.message.include?('unexpected eof')
-            @logger.debug "SSE Streaming: Connection closed (expected timeout) for url #{url}"
-          else
-            @logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
-          end
+    # Track the active connection so close() can interrupt a blocked
+    # read_body from another thread. Guarded by @conn_mutex.
+    def register_active(http)
+      @conn_mutex.synchronize { @active_http = http }
+    end
-          # qfg-ie49: restart_total is NOT bumped here. ld-eventsource
-          # auto-reconnects after most non-closing errors, and that reconnect
-          # is already counted via ReconnectCountingLogger; bumping here too
-          # would double-count. For closing errors (HTTP::ConnectionError) the
-          # reconnect is counted in @retry_thread instead. on_error's job is
-          # purely to notify the parent client of the disconnect edge.
-          # Notify the parent client BEFORE deciding whether to close — every
-          # error edge is a disconnect signal as far as @sse_state goes, even
-          # if we let the underlying SSE library handle reconnect itself.
-          # qfg-47c2.27
-          if @on_error
-            begin
-              @on_error.call(error)
-            rescue StandardError => e
-              @logger.error "SSE on_error callback raised: #{e.inspect}"
-            end
-          end
+    def increment_restart!
+      @restart_mutex.synchronize { @restart_total += 1 }
+    end
-          if @options.errors_to_close_connection.any? { |klass| error.is_a?(klass) }
-            @logger.debug "Closing SSE connection for url #{url}"
-            client.close
-          end
-        end
+    def handle_error(error)
+      @logger.error "SSE Streaming Error: #{error.inspect}"
+      invoke_on_error(error)
+    end
+    # qfg-m3lk: rescue StandardError (NOT Exception) so SystemExit /
+    # Interrupt / SignalException still escape — Ctrl-C inside a customer
+    # callback must still kill the process. StandardError is the right
+    # boundary for "the caller's listener has a bug".
+    def invoke_on_envelope_safely(on_envelope, event)
+      on_envelope.call(event.envelope, event, :sse)
+    rescue StandardError => e
+      @on_envelope_error_mutex.synchronize { @on_envelope_error_total += 1 }
+      bt = (e.backtrace || []).first(5).join("\n  ")
+      @logger.error "SSE on_envelope callback raised: #{e.class}: #{e.message}\n  #{bt}"
+    end
+    def invoke_on_error(error)
+      return unless @on_error
+      begin
+        @on_error.call(error)
+      rescue StandardError => e
+        @logger.error "SSE on_error callback raised: #{e.inspect}"
       end
     end
-    def headers
-      auth = "1:#{@prefab_options.sdk_key}"
-      auth_string = Base64.strict_encode64(auth)
-      {
-        'Authorization' => "Basic #{auth_string}",
-        'Accept' => 'text/event-stream',
-        'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
-      }
+    # +/-50% jitter — caps thundering-herd amplitude after a partition heal.
+    # Identical shape to ld-eventsource's Backoff#next_interval (and
+    # sdk-go's runLoop jitter) so we don't surprise operators familiar with
+    # those.
+    def jittered(delay)
+      (delay / 2) + rand(delay / 2.0)
     end
-    def source
-      @source_index = @source_index.nil? ? 0 : @source_index + 1
+    # Sleep with interrupt: chunks the sleep so close() during a long
+    # backoff doesn't block shutdown for tens of seconds.
+    def interruptible_sleep(seconds)
+      deadline = Process.clock_gettime(Process::CLOCK_MONOTONIC) + seconds
+      until @stopped.value
+        remaining = deadline - Process.clock_gettime(Process::CLOCK_MONOTONIC)
+        break if remaining <= 0
-      @source_index = 0 if @source_index >= @prefab_options.sse_api_urls.size
+        sleep([remaining, 0.1].min)
+      end
+    end
-      @prefab_options.sse_api_urls[@source_index]
+    # Rotate through configured SSE URLs. The same rotation rule the
+    # previous implementation used, preserved so multi-region failover
+    # behavior is unchanged.
+    def current_url
+      urls = @prefab_options.sse_api_urls
+      @source_index = (@source_index + 1) % urls.size
+      urls[@source_index]
     end
-    # Compute a Last-Event-ID to resume the stream from. Three sources, in
-    # priority order:
-    #   1. config_loader.version  -- string ETag from last HTTP fetch (new path)
-    #   2. config_loader.highwater_mark -- legacy numeric cursor
-    #   3. nil -- no prior state; stream from HEAD
-    def current_cursor
-      if @config_loader.respond_to?(:version)
-        v = @config_loader.version
-        return v if v.is_a?(String) && !v.empty?
+    # Internal: HTTP-status sentinel error for non-200 SSE responses. Surfaces
+    # the status code through #message so parent on_error callbacks can log
+    # meaningfully without depending on ld-eventsource's error hierarchy.
+    class SSEHTTPStatusError < StandardError
+      attr_reader :status_code
+      def initialize(status_code)
+        @status_code = status_code
+        super("HTTP #{status_code}")
       end
+    end
-      if @config_loader.respond_to?(:highwater_mark)
-        hw = @config_loader.highwater_mark
-        return hw.to_s if hw.is_a?(Numeric) && hw.positive?
-        return hw if hw.is_a?(String) && !hw.empty?
+    # qfg-i5xv: terminal HTTP failures the SDK will not retry. 401 = bad key,
+    # 403 = revoked workspace permission, 404 = wrong endpoint / missing
+    # workspace. A subclass of SSEHTTPStatusError so existing on_error
+    # callbacks that only check `is_a?(SSEHTTPStatusError)` keep working,
+    # while customers that want to distinguish (alerting, OpenFeature
+    # provider error events) can dispatch on the subclass.
+    class SSEHTTPTerminalError < SSEHTTPStatusError; end
+    # Raised by the watchdog into the worker thread when the per-chunk
+    # read deadline elapses. Caught by run_loop's rescue, indistinguishable
+    # from any other transport error for backoff/restart purposes.
+    class SSEReadDeadlineExceeded < StandardError; end
+    # Background watchdog that interrupts the worker thread if no chunk
+    # arrives within +deadline_s+ seconds. Uses Thread#raise — the only
+    # reliable cross-platform way to unblock a Ruby thread blocked in
+    # +Net::HTTP+'s body-read on macOS. (Closing or shutting down the
+    # underlying socket from another thread does NOT wake the reader on
+    # macOS; the kernel discards future reads but the in-flight syscall
+    # stays blocked until something else trips. sdk-go and sdk-node solve
+    # the equivalent problem with context cancellation / AbortController,
+    # which Ruby lacks at the IO layer.) Thread#raise is essentially what
+    # +Timeout.timeout+ does internally; using it directly avoids
+    # Timeout.timeout's sketch reputation around ensure blocks.
+    class ReadDeadlineWatchdog
+      POLL_INTERVAL = 0.25
+      def initialize(worker:, deadline_s:, stopped:, logger:)
+        @worker = worker
+        @deadline_s = deadline_s
+        @stopped = stopped
+        @logger = logger
+        @active = true
+        # Mutex covers @active AND the decision to fire Thread#raise. stop()
+        # holds the mutex when flipping @active false, so a +stop+ that
+        # arrives mid-deadline-check cannot lose the race against the
+        # watchdog's @worker.raise call (which would inject a spurious
+        # SSEReadDeadlineExceeded into the worker thread right after a
+        # clean read_body return).
+        @mutex = Mutex.new
+        @last_read_at = Concurrent::AtomicReference.new(Process.clock_gettime(Process::CLOCK_MONOTONIC))
       end
-      nil
+      def start
+        @thread = Thread.new { watch }
+      end
+      def reset!
+        @last_read_at.set(Process.clock_gettime(Process::CLOCK_MONOTONIC))
+      end
+      def stop
+        @mutex.synchronize { @active = false }
+        @thread&.join(1)
+        @thread = nil
+      end
+      private
+      def watch
+        loop do
+          sleep POLL_INTERVAL
+          break unless @mutex.synchronize { @active } && !@stopped.value
+          idle = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @last_read_at.value
+          next if idle < @deadline_s
+          fired = @mutex.synchronize do
+            next false unless @active && !@stopped.value
+            @logger.debug "SSE read deadline exceeded (#{idle.round(1)}s idle >= #{@deadline_s}s); interrupting worker"
+            @worker.raise(SSEReadDeadlineExceeded.new("SSE read deadline #{@deadline_s}s exceeded"))
+            true
+          end
+          break if fired
+        end
+      rescue StandardError => e
+        # Watchdog must never crash the SDK. Worst case we silently fall
+        # back to Net::HTTP's own (unreliable) read_timeout.
+        @logger.debug "SSE watchdog error: #{e.inspect}"
+      end
+    end
+    # Streaming SSE parser. Accepts byte chunks (any encoding), yields one
+    # Quonfig::StreamEvent per complete event. Tolerates:
+    #   - chunks that split a UTF-8 multi-byte character (buffer in 8-bit,
+    #     transcode whole lines)
+    #   - chunks that split a line mid-way
+    #   - any of CR / LF / CRLF as line terminators
+    #   - +data:+, +data: + (optional space per SSE spec)
+    #   - +:comment+ lines (keepalives — ignored)
+    #   - multi-line +data:+ (concatenated with +\n+, per spec)
+    # Ignores +event:+ and +retry:+ — api-delivery does not emit them and the
+    # Quonfig wire contract does not honor reconnect-time directives.
+    # Malformed +data:+ JSON is logged and skipped; one bad event does not
+    # tear down the stream.
+    class EventParser
+      def initialize(logger: nil)
+        @logger = logger
+        @reader = LineReader.new
+        @data = +''
+        @have_data = false
+        @id = nil
+      end
+      def feed(chunk)
+        @reader.feed(chunk) do |line|
+          if line.empty?
+            event = flush
+            yield event if event
+          elsif line.start_with?(':')
+            # comment / keepalive — ignore
+          else
+            process_field(line)
+          end
+        end
+      end
+      private
+      def process_field(line)
+        idx = line.index(':')
+        return unless idx
+        name = line[0...idx]
+        rest = line[(idx + 1)..]
+        rest = rest[1..] if rest.start_with?(' ')
+        case name
+        when 'data'
+          if @have_data
+            @data << "\n" << rest
+          else
+            @data = rest
+            @have_data = true
+          end
+        when 'id'
+          @id = rest unless rest.include?("\x00")
+          # event: / retry: are intentionally ignored
+        end
+      end
+      def flush
+        return nil unless @have_data
+        data = @data
+        id = @id
+        @data = +''
+        @have_data = false
+        # NB: @id persists across events — the SSE spec says last-event-id
+        # is sticky until overwritten. Matches ld-eventsource.
+        begin
+          parsed = JSON.parse(data)
+        rescue JSON::ParserError => e
+          (@logger || LOG).error "SSE Streaming Error: malformed JSON: #{e.message}"
+          return nil
+        end
+        envelope = Quonfig::ConfigEnvelope.new(
+          configs: parsed['configs'] || [],
+          meta: parsed['meta'] || {}
+        )
+        StreamEvent.new(envelope, id, data)
+      end
+    end
+    # Byte-level line reader. Accepts arbitrary chunks, yields one UTF-8
+    # line per call to the block. Terminator-stripped (CR / LF / CRLF
+    # supported). Modeled on ld-eventsource's BufferedLineReader — same
+    # invariants: split bytes-not-chars while scanning, force-encode to
+    # UTF-8 only once a complete line is sliced out, so a multi-byte
+    # character spanning two chunks does not raise Encoding::CompatibilityError.
+    class LineReader
+      def initialize
+        @buffer = +''.b
+        @last_was_cr = false
+      end
+      def feed(chunk)
+        @buffer << chunk.b
+        loop do
+          idx = @buffer.index(/[\r\n]/)
+          break if idx.nil?
+          ch = @buffer[idx]
+          if idx.zero? && ch == "\n" && @last_was_cr
+            # Dangling LF of a CRLF pair split across chunks — consume and skip.
+            @last_was_cr = false
+            @buffer.slice!(0, 1)
+            next
+          end
+          line = @buffer[0, idx].force_encoding('UTF-8')
+          consume = idx + 1
+          @last_was_cr = false
+          if ch == "\r"
+            if consume == @buffer.bytesize
+              # CR at end of buffer — could be CRLF split across feeds.
+              @last_was_cr = true
+            elsif @buffer[consume] == "\n"
+              consume += 1
+            end
+          end
+          @buffer.slice!(0, consume)
+          yield line
+        end
+      end
     end
   end
 end

data/lib/quonfig/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Quonfig
-  VERSION = '0.0.15'
+  VERSION = '0.0.16'
 end

data/lib/quonfig.rb CHANGED Viewed

@@ -17,7 +17,7 @@ require 'concurrent/atomics'
 require 'concurrent'
 require 'faraday'
 require 'openssl'
-require 'ld-eventsource'
+require 'net/http'
 require 'quonfig/internal_logger'
 require 'quonfig/time_helpers'

data/quonfig.gemspec CHANGED Viewed

@@ -31,5 +31,4 @@ Gem::Specification.new do |s|
   s.add_dependency 'activesupport', '>= 4'
   s.add_dependency 'concurrent-ruby', '~> 1.0', '>= 1.0.5'
   s.add_dependency 'faraday', '>= 1.0'
-  s.add_dependency 'ld-eventsource', '>= 2.0'
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: quonfig
 version: !ruby/object:Gem::Version
-  version: 0.0.15
+  version: 0.0.16
 platform: ruby
 authors:
 - Jeff Dwyer
@@ -58,20 +58,6 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '1.0'
-- !ruby/object:Gem::Dependency
-  name: ld-eventsource
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '2.0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '2.0'
 description: Quonfig — feature flags and live config, stored as files in git.
 email: jeff@quonfig.com
 executables: []