quonfig 0.0.15 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
4
- data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
3
+ metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
4
+ data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
5
5
  SHA512:
6
- metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
7
- data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045
6
+ metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
7
+ data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.0.16 - 2026-05-15
4
+
5
+ - **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
6
+ - **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
7
+ - **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
8
+ - **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
9
+ - **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
10
+
3
11
  ## 0.0.15 - 2026-05-15
4
12
 
5
13
  - **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
data/README.md CHANGED
@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
247
247
  are dead — the SSE socket is held open by a thread that no longer exists, and
248
248
  the child silently stops receiving live updates.
249
249
 
250
- Use `Quonfig::Client#fork` (or `Quonfig.fork` if you use the module-level
251
- singleton) in any process that fork-spawns workers. It returns a fresh client
252
- configured for the child: a new `ConfigStore`, a new SSE subscription, and
253
- suppressed telemetry double-counting (`Options#is_fork` is set to `true`).
250
+ **On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
251
+ automatically tears down threaded components in the parent and restarts them
252
+ in the child. This covers any `Process.fork` / `Kernel#fork` path Puma's
253
+ clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
254
+ manual `fork { ... }` calls. **No customer wiring is required.**
255
+
256
+ Caveats:
257
+
258
+ - Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
259
+ - `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
260
+ not go through `Process._fork`), but those execute a new program, so the
261
+ in-process SSE state is moot.
262
+ - The hook tears down the SSE/polling/telemetry threads in the parent before
263
+ fork (so the child does not inherit a live socket fd) and does **not**
264
+ auto-restart the parent. This mirrors the Puma master case: the master no
265
+ longer serves requests, so it does not need a live SSE connection. If you
266
+ have a non-Puma topology where the parent must keep streaming after fork,
267
+ call `Quonfig.instance.after_fork_in_child` manually in the parent after
268
+ the fork returns.
254
269
 
255
270
  ### Puma (clustered mode)
256
271
 
272
+ With the automatic fork hook, the typical Puma config needs **no Quonfig
273
+ lifecycle wiring** — initialize in your Rails initializer and let the hook
274
+ handle the rest:
275
+
257
276
  ```ruby
258
- # config/puma.rb
277
+ # config/initializers/quonfig.rb
278
+ Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
279
+ ```
280
+
281
+ If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
282
+
283
+ ```ruby
284
+ # config/puma.rb (Ruby 3.0 only)
259
285
  before_fork do
260
286
  Quonfig.instance.stop # close the master's SSE before forking
261
287
  end
@@ -265,18 +291,18 @@ on_worker_boot do
265
291
  end
266
292
  ```
267
293
 
268
- If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
269
- single mode (no clustering), no fork hook is needed.
270
-
271
294
  ### Sidekiq
272
295
 
273
- Sidekiq's parent process forks workers. Wire the same lifecycle:
296
+ On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too no
297
+ `configure_server` wiring required.
298
+
299
+ On Ruby 3.0:
274
300
 
275
301
  ```ruby
276
302
  # config/initializers/quonfig.rb
277
303
  Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
278
304
 
279
- # config/initializers/sidekiq.rb
305
+ # config/initializers/sidekiq.rb (Ruby 3.0 only)
280
306
  Sidekiq.configure_server do |config|
281
307
  config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
282
308
  config.on(:shutdown) { Quonfig.instance.stop rescue nil }
@@ -284,7 +310,7 @@ end
284
310
  ```
285
311
 
286
312
  For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
287
- `Quonfig.init` in the initializer is sufficient.
313
+ `Quonfig.init` in the initializer is sufficient on any Ruby version.
288
314
 
289
315
  ### Spring / Bootsnap preloaders
290
316
 
@@ -20,6 +20,29 @@ module Quonfig
20
20
  class Client
21
21
  LOG = Quonfig::InternalLogger.new(self)
22
22
 
23
+ # qfg-ryov: instance registry for the Process._fork hook. Every live
24
+ # Client is tracked here so the hook can fan out before_fork_in_parent /
25
+ # after_fork_in_child across all of them without the customer needing to
26
+ # name a specific instance. ObjectSpace::WeakMap means a Client that goes
27
+ # out of scope is GC'd without leaking through this registry. Stopped
28
+ # Clients stay in the registry until GC; both fork hooks early-return on
29
+ # +@stopped+ so a stopped instance is effectively a no-op. (We don't use
30
+ # WeakMap#delete because it was added in Ruby 3.3 and the matrix still
31
+ # includes 3.2.)
32
+ @instances = ObjectSpace::WeakMap.new
33
+ @instances_mutex = Mutex.new
34
+
35
+ class << self
36
+ # Iterate live Client instances. Used by Quonfig::ForkSafety.
37
+ def each_instance(&block)
38
+ @instances_mutex.synchronize { @instances.keys }.each(&block)
39
+ end
40
+
41
+ def register_instance(client)
42
+ @instances_mutex.synchronize { @instances[client] = true }
43
+ end
44
+ end
45
+
23
46
  attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
24
47
  :config_loader, :telemetry_reporter
25
48
 
@@ -48,6 +71,7 @@ module Quonfig
48
71
  @sse_state = :idle
49
72
  @sse_ever_connected = false
50
73
  @fallback_engage_timer = nil
74
+ @sse_terminal_failure = false
51
75
 
52
76
  # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
53
77
  return if store
@@ -59,6 +83,10 @@ module Quonfig
59
83
  end
60
84
 
61
85
  initialize_telemetry
86
+
87
+ # Register only for non-store-injected clients (a caller-supplied store
88
+ # is the test/bootstrap path; the fork hook does not apply there).
89
+ self.class.register_instance(self) unless store
62
90
  end
63
91
 
64
92
  # ---- Lookup --------------------------------------------------------
@@ -264,34 +292,52 @@ module Quonfig
264
292
 
265
293
  def stop
266
294
  @stopped = true
267
- begin
268
- @sse_client&.close
269
- rescue StandardError => e
270
- LOG.debug "Error closing SSE client: #{e.message}"
271
- end
272
- @sse_client = nil
295
+ tear_down_threaded_components!
296
+ end
273
297
 
274
- cancel_fallback_engage_timer
298
+ # qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
299
+ # telemetry reporter, and any fallback-engage timer. Idempotent — calling
300
+ # twice is safe. Does NOT set @stopped: the client is still expected to
301
+ # be usable post-fork via after_fork_in_child.
302
+ #
303
+ # Why this matters: Ruby threads do not survive fork(2). If we let the
304
+ # child inherit a live Net::HTTP socket, both processes read from the
305
+ # same fd and corrupt each other's bytes. Closing in the parent before
306
+ # fork is the only safe shape.
307
+ def before_fork_in_parent
308
+ return if @stopped
275
309
 
276
- begin
277
- @poll_supervisor&.stop
278
- rescue StandardError => e
279
- LOG.debug "Error stopping poll supervisor: #{e.message}"
280
- end
281
- @poll_supervisor = nil
310
+ tear_down_threaded_components!
311
+ end
282
312
 
283
- begin
284
- @telemetry_reporter&.stop
285
- rescue StandardError => e
286
- LOG.debug "Error stopping telemetry reporter: #{e.message}"
313
+ # qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
314
+ # components the client had pre-fork. No-op if the client was already
315
+ # stopped (the customer asked for it to be dead — do not resurrect),
316
+ # or if the client is in datadir mode (no threaded components to start).
317
+ def after_fork_in_child
318
+ return if @stopped
319
+ return if @options.datadir
320
+ return if @config_loader.nil? # never finished network init (e.g. invalid key)
321
+
322
+ # SSE state machine carries flags that no longer apply in the child
323
+ # (the parent had connected, the parent had errored, etc.). Reset.
324
+ @state_mutex.synchronize do
325
+ @sse_state = :idle
326
+ @sse_ever_connected = false
327
+ @sse_terminal_failure = false
287
328
  end
288
- @telemetry_reporter = nil
329
+
330
+ sse_started = @options.enable_sse && start_sse
331
+ start_polling if @options.enable_polling && !sse_started
332
+
333
+ restart_telemetry_in_child
289
334
  end
290
335
 
291
336
  # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
292
337
  # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
293
- # incremented on every on_error edge from ld-eventsource (qfg-ll6r).
294
- # Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
338
+ # incremented once per reconnect attempt by the SDK-owned reconnect
339
+ # loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
340
+ # Quonfig::WorkerSupervisor.
295
341
  #
296
342
  # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
297
343
  # sum across both layers so the chaos harness (and operators) can pull
@@ -357,6 +403,41 @@ module Quonfig
357
403
 
358
404
  private
359
405
 
406
+ # Close every threaded component and drop its reference. Used by both
407
+ # +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
408
+ # (where @stopped is left alone so the child can restart).
409
+ def tear_down_threaded_components!
410
+ begin
411
+ @sse_client&.close
412
+ rescue StandardError => e
413
+ LOG.debug "Error closing SSE client: #{e.message}"
414
+ end
415
+ @sse_client = nil
416
+
417
+ cancel_fallback_engage_timer
418
+
419
+ begin
420
+ @poll_supervisor&.stop
421
+ rescue StandardError => e
422
+ LOG.debug "Error stopping poll supervisor: #{e.message}"
423
+ end
424
+ @poll_supervisor = nil
425
+
426
+ begin
427
+ @telemetry_reporter&.stop
428
+ rescue StandardError => e
429
+ LOG.debug "Error stopping telemetry reporter: #{e.message}"
430
+ end
431
+ @telemetry_reporter = nil
432
+ end
433
+
434
+ # Rebuild the telemetry reporter in the child after fork. Mirrors the
435
+ # original initialize_telemetry path — fresh aggregators, fresh reporter.
436
+ def restart_telemetry_in_child
437
+ @telemetry_reporter = nil
438
+ initialize_telemetry
439
+ end
440
+
360
441
  # Stamp +last_successful_refresh+ at install time. Called by every code
361
442
  # path that hands an envelope to the cache: datadir load, initial HTTP
362
443
  # fetch, SSE event apply, and polling worker fetch.
@@ -402,20 +483,31 @@ module Quonfig
402
483
  @sse_error_callback ||= ->(error) { handle_sse_error(error) }
403
484
  end
404
485
 
405
- def handle_sse_error(_error)
486
+ def handle_sse_error(error)
487
+ # qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
488
+ # key that won't auth over SSE won't auth over HTTP polling either, so
489
+ # we must NOT engage the Layer 2 fallback — that just moves the
490
+ # auth-failure storm from one endpoint to another. Once flipped,
491
+ # @sse_terminal_failure latches: a buggy customer retry loop cannot
492
+ # un-classify the failure by driving the state machine.
493
+ @state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
406
494
  handle_sse_state_change(:error)
407
495
  end
408
496
 
409
497
  def handle_sse_state_change(new_state)
410
498
  state = new_state.to_sym
411
- ever_connected = @state_mutex.synchronize do
499
+ ever_connected, terminal = @state_mutex.synchronize do
412
500
  @sse_state = state
413
501
  @sse_ever_connected = true if state == :connected
414
- @sse_ever_connected
502
+ [@sse_ever_connected, @sse_terminal_failure]
415
503
  end
416
504
 
417
505
  return unless @options.respond_to?(:enable_polling) && @options.enable_polling
418
506
  return if @stopped
507
+ # qfg-i5xv: a terminal SSE classification suppresses polling engage in
508
+ # every branch — the customer's key is bad and HTTP polling will fail
509
+ # identically. Operators surface this via #terminal_failure?.
510
+ return if terminal
419
511
 
420
512
  case state
421
513
  when :connected
@@ -430,6 +522,21 @@ module Quonfig
430
522
  end
431
523
  end
432
524
 
525
+ public
526
+
527
+ # qfg-i5xv: true once the SSE layer has classified an HTTP response as
528
+ # terminal (401/403/404) — bad SDK key, revoked workspace permission,
529
+ # or wrong endpoint. The classification latches: the SDK will not
530
+ # auto-recover, and a customer-supplied retry must rebuild the client.
531
+ # Surfaced for operator alerting; `connection_state` still reports
532
+ # `:disconnected` to honor the documented connection_state vocabulary
533
+ # (supervisor-test-contract.md §"connectionState()" — values fixed).
534
+ def terminal_failure?
535
+ @state_mutex.synchronize { @sse_terminal_failure }
536
+ end
537
+
538
+ private
539
+
433
540
  def cancel_fallback_engage_timer
434
541
  timer = @state_mutex.synchronize do
435
542
  t = @fallback_engage_timer
@@ -904,4 +1011,42 @@ module Quonfig
904
1011
  end
905
1012
  end
906
1013
  end
1014
+
1015
+ # qfg-ryov: hook into Process._fork so customers using Puma's clustered
1016
+ # mode (or any preload/fork-worker server) don't have to wire
1017
+ # +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
1018
+ # +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
1019
+ # prepend covers them all.
1020
+ #
1021
+ # Process._fork's contract:
1022
+ # - Called in the parent process before the fork syscall.
1023
+ # - Returns 0 in the child, child's pid in the parent.
1024
+ # - +super+ performs the actual fork.
1025
+ #
1026
+ # The parent's view: SSE/polling/telemetry threads are torn down before
1027
+ # the syscall so the child does not inherit a live Net::HTTP socket fd
1028
+ # (which would corrupt both sides). The parent does NOT auto-restart —
1029
+ # that mirrors the Puma master use case where the master process no
1030
+ # longer serves requests after spawning workers.
1031
+ module ForkSafety
1032
+ def _fork
1033
+ Quonfig::Client.each_instance(&:before_fork_in_parent)
1034
+ pid = super
1035
+ Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
1036
+ pid
1037
+ rescue StandardError => e
1038
+ # Fork-hook failures must never break the customer's fork. Worst case
1039
+ # the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
1040
+ # bad, but recoverable. Crashing the fork itself is not.
1041
+ Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
1042
+ raise if pid.nil? # super never returned — propagate fork failures
1043
+
1044
+ pid
1045
+ end
1046
+ end
1047
+
1048
+ # Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
1049
+ # customers must keep wiring their own Puma before_fork / on_worker_boot
1050
+ # (see README "Rails integration"). On 3.1+ we install the hook globally.
1051
+ Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
907
1052
  end
@@ -2,300 +2,611 @@
2
2
 
3
3
  require 'base64'
4
4
  require 'json'
5
+ require 'net/http'
6
+ require 'uri'
5
7
 
6
8
  module Quonfig
9
+ # Event delivered to on_envelope. +id+ mirrors the SSE +id:+ field and is
10
+ # consumed by callers that want the server cursor (tests + last-event-id
11
+ # resume). +data+ is the raw +data:+ payload string. +envelope+ is the
12
+ # parsed Quonfig::ConfigEnvelope.
13
+ StreamEvent = Struct.new(:envelope, :id, :data)
14
+
15
+ # SSE client for real-time config delivery from api-delivery-sse.
16
+ #
17
+ # Owns its reconnect loop end-to-end. sdk-go, sdk-python, and sdk-node all
18
+ # reached the same conclusion: the wire format we consume (plain JSON
19
+ # envelopes in single-line +data:+ frames, no named events, no retry
20
+ # directives) is simple enough that an SDK-owned loop is clearer than a
21
+ # library wrapper, and the operator-facing reconnect counter becomes
22
+ # trivially correct because there is exactly one place that increments it
23
+ # (qfg-35sm; replaces the ld-eventsource integration from qfg-ie49 +
24
+ # qfg-cf52, which required log-line scraping and a raise-proof logger
25
+ # wrapper to observe reconnects through the upstream library).
7
26
  class SSEConfigClient
8
- # ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
9
- # *internally* — it never calls +on_error+ for that case, only for
10
- # ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
11
- # signal it emits for any reconnect is an info-level
12
- # "Will retry connection after ..." line, logged once per reconnect attempt
13
- # and never on the first connect. Wrapping the logger we hand to
14
- # SSE::Client lets the SDK observe those internal reconnects without
15
- # touching the data path. This is the only reconnect hook ld-eventsource
16
- # >= 2.0 exposes.
17
- class ReconnectCountingLogger
18
- RECONNECT_SIGNAL = 'Will retry connection after'
19
-
20
- LEVELS = %i[trace debug info warn error fatal].freeze
21
-
22
- def initialize(wrapped, &on_reconnect)
23
- @wrapped = wrapped
24
- @on_reconnect = on_reconnect
25
- end
26
-
27
- # Crash-safe by construction: ld-eventsource calls this logger from
28
- # inside its bare-Thread +run_stream+ loop, and several of those call
29
- # sites (+connect+, +log_and_dispatch_error+, query-param building) are
30
- # NOT wrapped in a rescue. Any exception that escapes a logger call kills
31
- # the worker thread with +@stopped+ still false, so +closed?+ never flips
32
- # true and the SDK's @retry_thread never reconnects — the SSE stream is
33
- # silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
34
- # step here is therefore independently guarded: a throwing message block,
35
- # a throwing on_reconnect callback, or a throwing wrapped logger can
36
- # never propagate out of this method.
37
- LEVELS.each do |level|
38
- define_method(level) do |message = nil, &block|
39
- begin
40
- message = block.call if message.nil? && block
41
- rescue StandardError
42
- message = nil
43
- end
44
-
45
- if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
46
- begin
47
- @on_reconnect.call
48
- rescue StandardError
49
- nil
50
- end
51
- end
52
-
53
- begin
54
- @wrapped.public_send(level, message) if @wrapped.respond_to?(level)
55
- rescue StandardError
56
- nil
57
- end
58
- end
59
- end
60
-
61
- def level
62
- @wrapped&.level
63
- end
64
-
65
- def level=(new_level)
66
- @wrapped.level = new_level if @wrapped.respond_to?(:level=)
67
- end
68
- end
69
-
70
27
  class Options
71
- attr_reader :sse_read_timeout, :seconds_between_new_connection,
72
- :sse_default_reconnect_time, :sleep_delay_for_new_connection_check,
73
- :errors_to_close_connection, :sse_reconnect_reset_interval
28
+ attr_reader :sse_read_timeout, :sse_connect_timeout,
29
+ :sse_initial_reconnect_delay, :sse_max_reconnect_delay
74
30
 
75
31
  # sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
76
- # stall trips the read deadline within one missed-heartbeat window
77
- # rather than the previous 5-minute idle. See plan
78
- # `project/plans/sdk-hardening-and-verification.md` Layer 1.
32
+ # stall trips within one missed-heartbeat window rather than the OS
33
+ # TCP idle (often hours).
79
34
  #
80
- # sse_reconnect_reset_interval: 1s (ld-eventsource default is 60s). The
81
- # ld-eventsource backoff only resets to the base interval once a
82
- # connection has stayed up this long; until then each reconnect doubles
83
- # the delay (1s, 2s, 4s, 8s...). With the 60s default, a flapping
84
- # connection (chaos scenario 09 proxy killed every 6s) backs off so
85
- # fast the SDK is mid-sleep when the next kill lands and never observes
86
- # it. Resetting after 1s of healthy connection mirrors sdk-python, which
87
- # resets its backoff on every successful connect (sdk-python/quonfig/
88
- # sse.py). A *sustained* outage still backs off exponentially: no
89
- # connection succeeds, so `mark_success` is never called and the reset
90
- # never triggers (qfg-ie49).
35
+ # sse_initial_reconnect_delay / sse_max_reconnect_delay: backoff bounds.
36
+ # Each failed reconnect doubles the delay (with +/-50% jitter) up to the
37
+ # max. A successful event delivery resets the delay to the initial
38
+ # value matches sdk-python's policy. A clean server-initiated FIN is
39
+ # treated as "not a failure for backoff purposes" because LBs recycling
40
+ # connections is normal; the reconnect counter still increments.
91
41
  def initialize(sse_read_timeout: 90,
92
- seconds_between_new_connection: 5,
93
- sleep_delay_for_new_connection_check: 1,
94
- sse_default_reconnect_time: SSE::Client::DEFAULT_RECONNECT_TIME,
95
- sse_reconnect_reset_interval: 1,
96
- errors_to_close_connection: [HTTP::ConnectionError])
42
+ sse_connect_timeout: 10,
43
+ sse_initial_reconnect_delay: 1.0,
44
+ sse_max_reconnect_delay: 30.0)
97
45
  @sse_read_timeout = sse_read_timeout
98
- @seconds_between_new_connection = seconds_between_new_connection
99
- @sse_default_reconnect_time = sse_default_reconnect_time
100
- @sse_reconnect_reset_interval = sse_reconnect_reset_interval
101
- @sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
102
- @errors_to_close_connection = errors_to_close_connection
46
+ @sse_connect_timeout = sse_connect_timeout
47
+ @sse_initial_reconnect_delay = sse_initial_reconnect_delay.to_f
48
+ @sse_max_reconnect_delay = sse_max_reconnect_delay.to_f
103
49
  end
104
50
  end
105
51
 
106
52
  LOG = Quonfig::InternalLogger.new(self)
107
53
 
54
+ # qfg-i5xv: HTTP status codes the SDK classifies as terminal — these will
55
+ # not heal by retrying (bad key, revoked permission, missing endpoint).
56
+ # Anything else (5xx, 429, network errors) stays on the transient path.
57
+ TERMINAL_HTTP_CODES = [401, 403, 404].freeze
58
+
108
59
  # +on_error+: optional callable invoked on every SSE error edge. Parent
109
60
  # Quonfig::Client wires this to drive @sse_state -> :error so that
110
- # +connection_state+ reflects the disconnect (qfg-47c2.27). Without it
111
- # the SDK's public health primitive would lie about its own state during
112
- # a mid-run socket drop.
61
+ # +connection_state+ reflects the disconnect (qfg-47c2.27).
113
62
  def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
114
63
  @prefab_options = prefab_options
115
64
  @options = options || Options.new
116
65
  @config_loader = config_loader
117
- @connected = false
118
66
  @logger = logger || LOG
119
67
  @on_error = on_error
68
+
69
+ @stopped = Concurrent::AtomicBoolean.new(false)
120
70
  @restart_total = 0
121
71
  @restart_mutex = Mutex.new
72
+
73
+ @on_envelope_error_total = 0
74
+ @on_envelope_error_mutex = Mutex.new
75
+
76
+ @conn_mutex = Mutex.new
77
+ @active_http = nil
78
+
79
+ @source_index = -1
80
+ @last_event_id = nil
122
81
  end
123
82
 
124
- # qfg-ll6r / qfg-ie49: Layer 1 (SSE) restart counter counts every
125
- # *reconnect*, from two sources:
126
- # 1. ld-eventsource's own internal reconnect (clean FIN, read timeout,
127
- # transient errors it doesn't surface) observed via the
128
- # ReconnectCountingLogger "Will retry connection after" signal.
129
- # 2. SDK-driven reconnects in @retry_thread, after a closing error
130
- # (HTTP::ConnectionError) made us close the SSE::Client outright.
131
- # These two are mutually exclusive per disconnect, so there is no
132
- # double-count. on_error is deliberately NOT a source — ld-eventsource
133
- # reconnects internally after most non-closing errors, so counting the
134
- # error edge AND the reconnect would double up (qfg-ie49).
135
- #
136
- # The chaos harness pulls this via Client#worker_restart_total(layer: '1')
137
- # so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
138
- # assert restart_total >= 5 even when the kills produce clean FINs that
139
- # never reach on_error.
83
+ # Layer 1 (SSE) reconnect counter. Bumped exactly once per reconnect
84
+ # attempt never per error edge, never per envelope. Read by
85
+ # Quonfig::Client#worker_restart_total(layer: '1') and asserted by chaos
86
+ # scenario 09 (>= 5 after 5 proxy flaps in 30s).
140
87
  def restart_total
141
88
  @restart_mutex.synchronize { @restart_total }
142
89
  end
143
90
 
144
- # Bump the Layer 1 reconnect counter. Called from the ld-eventsource
145
- # worker thread (via ReconnectCountingLogger) and from @retry_thread.
146
- def count_restart!
147
- @restart_mutex.synchronize { @restart_total += 1 }
91
+ # qfg-m3lk: count of user-supplied on_envelope callback invocations that
92
+ # raised. Surfaced for operator visibility a non-zero value here with
93
+ # restart_total stable means a caller-side listener bug, not a transport
94
+ # problem. (Pre-fix, those raises propagated into run_loop's rescue and
95
+ # masqueraded as transport errors, causing reconnect storms.)
96
+ def on_envelope_error_total
97
+ @on_envelope_error_mutex.synchronize { @on_envelope_error_total }
148
98
  end
149
99
 
150
- def close
151
- @retry_thread&.kill
152
- @client&.close
100
+ def start(&on_envelope)
101
+ return if @prefab_options.sse_api_urls.nil? || @prefab_options.sse_api_urls.empty?
102
+
103
+ @worker = Thread.new { run_loop(&on_envelope) }
153
104
  end
154
105
 
155
- def start(&load_configs)
156
- if @prefab_options.sse_api_urls.empty?
157
- @logger.debug 'No SSE api_urls configured'
158
- return
106
+ # Shut down. Interrupts the in-flight stream by closing the underlying
107
+ # socket from this thread — the worker thread observes the resulting
108
+ # IOError, sees @stopped == true, and exits cleanly.
109
+ def close
110
+ @stopped.make_true
111
+ @conn_mutex.synchronize do
112
+ begin
113
+ @active_http&.finish
114
+ rescue StandardError
115
+ # already closed / never started — idempotent
116
+ end
117
+ @active_http = nil
159
118
  end
119
+ @worker&.join(2)
120
+ @worker = nil
121
+ end
122
+
123
+ # Public so tests can assert the headers shape. Body of the request is
124
+ # always empty; this is the full set api-delivery-sse sees.
125
+ def headers
126
+ auth = "1:#{@prefab_options.sdk_key}"
127
+ auth_string = Base64.strict_encode64(auth)
128
+ h = {
129
+ 'Authorization' => "Basic #{auth_string}",
130
+ 'Accept' => 'text/event-stream',
131
+ 'Cache-Control' => 'no-cache',
132
+ 'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
133
+ }
134
+ cursor = current_cursor
135
+ h['Last-Event-Id'] = cursor if cursor
136
+ h
137
+ end
160
138
 
161
- @client = connect(&load_configs)
139
+ # Compute a Last-Event-ID for the next request. Three sources, in
140
+ # priority order:
141
+ # 1. @last_event_id -- set by the most recent event we processed
142
+ # 2. config_loader.version -- string ETag from last HTTP fetch
143
+ # 3. config_loader.highwater_mark -- legacy numeric cursor
144
+ # Returns nil if no prior state exists.
145
+ def current_cursor
146
+ return @last_event_id if @last_event_id && !@last_event_id.empty?
162
147
 
163
- closed_count = 0
148
+ if @config_loader.respond_to?(:version)
149
+ v = @config_loader.version
150
+ return v if v.is_a?(String) && !v.empty?
151
+ end
164
152
 
165
- @retry_thread = Thread.new do
166
- loop do
167
- sleep @options.sleep_delay_for_new_connection_check
153
+ if @config_loader.respond_to?(:highwater_mark)
154
+ hw = @config_loader.highwater_mark
155
+ return hw.to_s if hw.is_a?(Numeric) && hw.positive?
156
+ return hw if hw.is_a?(String) && !hw.empty?
157
+ end
168
158
 
169
- next unless @client.closed?
159
+ nil
160
+ end
170
161
 
171
- closed_count += @options.sleep_delay_for_new_connection_check
162
+ private
172
163
 
173
- next unless closed_count > @options.seconds_between_new_connection
164
+ # Long-lived reconnect loop. One iteration = one connect attempt. Bumps
165
+ # restart_total *before* every retry — so the counter answers "how many
166
+ # times have we reconnected after a drop" rather than "how many connect
167
+ # attempts have occurred." The first attempt is not a restart.
168
+ #
169
+ # qfg-tj18: the body is wrapped in
170
+ # +Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)+ so a
171
+ # watchdog raise that's already been queued (the watchdog's mutex covers
172
+ # the *decision* to fire but cannot un-queue a delivered raise) lands
173
+ # only at a blocking-IO checkpoint. Inside stream_once we explicitly
174
+ # re-enable +:immediate+ around the +read_body+ block where we *do*
175
+ # want the raise to wake the read. A per-iteration paranoid rescue
176
+ # catches any late-landing raise that escapes the inner +rescue
177
+ # StandardError+ (e.g. lands inside +interruptible_sleep+ between
178
+ # iterations) so the worker thread never silently dies.
179
+ def run_loop(&on_envelope)
180
+ Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking) do
181
+ delay = @options.sse_initial_reconnect_delay
182
+ first_attempt = true
183
+
184
+ until @stopped.value
185
+ begin
186
+ unless first_attempt
187
+ increment_restart!
188
+ interruptible_sleep(jittered(delay))
189
+ break if @stopped.value
190
+ end
191
+ first_attempt = false
174
192
 
175
- closed_count = 0
176
- @logger.debug 'Reconnecting SSE client'
177
- # SDK-driven reconnect: a closing error (HTTP::ConnectionError)
178
- # closed the previous SSE::Client, so ld-eventsource's own
179
- # reconnect loop has exited and won't emit the "Will retry" signal.
180
- # Count it here instead (qfg-ie49).
181
- count_restart!
182
- @client = connect(&load_configs)
193
+ connected_at_least_once = false
194
+ begin
195
+ stream_once do |event|
196
+ connected_at_least_once = true
197
+ # Persist the most recent id so the next reconnect resumes
198
+ # from there via Last-Event-Id. Updated *before* the user
199
+ # callback runs so a raising listener still advances the
200
+ # cursor — the event was delivered to us, the bug is on the
201
+ # caller side.
202
+ @last_event_id = event.id if event.id
203
+ # qfg-m3lk: callback exceptions are isolated. A buggy
204
+ # listener must not look like a transport error and trigger
205
+ # a reconnect.
206
+ invoke_on_envelope_safely(on_envelope, event)
207
+ # A connection healthy enough to deliver a real envelope
208
+ # earns a reset of the backoff. Sustained outages never
209
+ # reach this branch (no event ever delivered) so the
210
+ # exponential growth still holds.
211
+ delay = @options.sse_initial_reconnect_delay
212
+ end
213
+ rescue StandardError => e
214
+ handle_error(e) unless @stopped.value
215
+ end
216
+
217
+ # Backoff only grows on failed connect attempts. A server-
218
+ # initiated clean FIN after a healthy session (normal LB
219
+ # recycling) reuses the same delay — punishing it would make
220
+ # us look broken under benign rolling restarts. Matches
221
+ # sdk-go's `connectedOK` distinction.
222
+ delay = [delay * 2, @options.sse_max_reconnect_delay].min unless connected_at_least_once
223
+ rescue SSEReadDeadlineExceeded => e
224
+ # Paranoid backstop (qfg-tj18). A watchdog raise that landed
225
+ # outside +stream_once+ — typically in +interruptible_sleep+
226
+ # — must not kill the worker thread. We log loudly and let the
227
+ # +until+ loop carry on.
228
+ @logger.error "SSE watchdog late-raise contained: #{e.inspect}; resuming loop"
229
+ end
183
230
  end
184
231
  end
232
+ ensure
233
+ register_active(nil)
185
234
  end
186
235
 
187
- def connect(&load_configs)
188
- url = "#{source}/api/v2/sse/config"
236
+ # Opens one SSE request and yields each parsed event until the stream
237
+ # ends (clean FIN, error, or stop). Raises on transport errors so the
238
+ # caller can apply backoff. Clean FIN returns without raising.
239
+ #
240
+ # A watchdog thread closes the socket if no bytes arrive within
241
+ # +sse_read_timeout+. Net::HTTP#read_timeout is NOT reliable for the
242
+ # streaming +read_body do |chunk|+ form — the underlying BufferedIO
243
+ # reads bypass it in practice (a silent server stall blocks indefinitely
244
+ # against a configured deadline). sdk-go and sdk-node hit the same
245
+ # gotcha and solve it the same way: per-chunk reset, async close on
246
+ # expiry (chaos scenario 02 — sse_silent_stall).
247
+ def stream_once(&block)
248
+ url = "#{current_url}/api/v2/sse/config"
189
249
  cursor = current_cursor
190
250
  @logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
191
251
 
192
- # Wrap the ld-eventsource logger so internal reconnects (clean FIN,
193
- # read-timeout, transient errors) bump restart_total — they never reach
194
- # on_error (qfg-ie49).
195
- sse_logger = ReconnectCountingLogger.new(
196
- Quonfig::InternalLogger.new(SSE::Client)
197
- ) { count_restart! }
198
-
199
- SSE::Client.new(url,
200
- headers: headers,
201
- read_timeout: @options.sse_read_timeout,
202
- reconnect_time: @options.sse_default_reconnect_time,
203
- reconnect_reset_interval: @options.sse_reconnect_reset_interval,
204
- last_event_id: cursor,
205
- logger: sse_logger) do |client|
206
- client.on_event do |event|
207
- if event.data.nil? || event.data.empty?
208
- @logger.error "SSE Streaming Error: Received empty data for url #{url}"
209
- client.close
210
- next
252
+ uri = URI(url)
253
+ http = Net::HTTP.new(uri.host, uri.port)
254
+ http.use_ssl = (uri.scheme == 'https')
255
+ http.open_timeout = @options.sse_connect_timeout
256
+ # Keep Net::HTTP's read_timeout as a backstop for the header read
257
+ # (where it does apply reliably). The watchdog covers the body path.
258
+ http.read_timeout = @options.sse_read_timeout
259
+
260
+ req = Net::HTTP::Get.new(uri.request_uri, headers)
261
+
262
+ http.start
263
+ register_active(http)
264
+
265
+ watchdog = ReadDeadlineWatchdog.new(
266
+ worker: Thread.current, deadline_s: @options.sse_read_timeout,
267
+ stopped: @stopped, logger: @logger
268
+ )
269
+ watchdog.start
270
+
271
+ begin
272
+ http.request(req) do |resp|
273
+ code = resp.code.to_i
274
+ if TERMINAL_HTTP_CODES.include?(code)
275
+ # qfg-i5xv: 401/403/404 will not heal by retrying — bad key,
276
+ # revoked permission, or wrong endpoint. Mark stopped *before*
277
+ # invoking on_error so the loop's terminal-error branch is
278
+ # already locked in if the parent callback inspects state, and
279
+ # so the inner rescue's `handle_error(e) unless @stopped.value`
280
+ # guard suppresses a second on_error edge.
281
+ err = SSEHTTPTerminalError.new(code)
282
+ @logger.error "SSE Streaming Terminal Error: HTTP #{code} for url #{url}; will not retry"
283
+ @stopped.make_true
284
+ invoke_on_error(err)
285
+ raise err
211
286
  end
212
-
213
- begin
214
- parsed = JSON.parse(event.data)
215
- rescue JSON::ParserError => e
216
- @logger.error "SSE Streaming Error: Failed to parse JSON for url #{url}: #{e.message}"
217
- client.close
218
- next
287
+ if code != 200
288
+ err = SSEHTTPStatusError.new(code)
289
+ @logger.error "SSE Streaming Error: HTTP #{code} for url #{url}"
290
+ invoke_on_error(err)
291
+ raise err
219
292
  end
220
293
 
221
- envelope = Quonfig::ConfigEnvelope.new(
222
- configs: parsed['configs'] || [],
223
- meta: parsed['meta'] || {}
224
- )
225
- load_configs.call(envelope, event, :sse)
294
+ parser = EventParser.new
295
+ # qfg-tj18: run_loop wraps the body in +:on_blocking+ which
296
+ # *would* still deliver during read_body (read_body is a
297
+ # blocking IO call), but be explicit: we want the watchdog raise
298
+ # to land here without ambiguity.
299
+ Thread.handle_interrupt(SSEReadDeadlineExceeded => :immediate) do
300
+ resp.read_body do |chunk|
301
+ watchdog.reset!
302
+ break if @stopped.value
303
+
304
+ parser.feed(chunk, &block)
305
+ end
306
+ end
307
+ # read_body returned cleanly — either a server-initiated FIN, or
308
+ # the watchdog closed the socket on a silent stall. Either way,
309
+ # the outer loop will reconnect and bump restart_total on the
310
+ # next iteration.
311
+ @logger.debug "SSE stream ended for url #{url}"
312
+ end
313
+ ensure
314
+ watchdog.stop
315
+ register_active(nil)
316
+ begin
317
+ http.finish if http.started?
318
+ rescue StandardError
319
+ # already closed
226
320
  end
321
+ end
322
+ end
227
323
 
228
- client.on_error do |error|
229
- # SSL "unexpected eof" is expected when SSE sessions timeout normally
230
- if error.is_a?(OpenSSL::SSL::SSLError) && error.message.include?('unexpected eof')
231
- @logger.debug "SSE Streaming: Connection closed (expected timeout) for url #{url}"
232
- else
233
- @logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
234
- end
324
+ # Track the active connection so close() can interrupt a blocked
325
+ # read_body from another thread. Guarded by @conn_mutex.
326
+ def register_active(http)
327
+ @conn_mutex.synchronize { @active_http = http }
328
+ end
235
329
 
236
- # qfg-ie49: restart_total is NOT bumped here. ld-eventsource
237
- # auto-reconnects after most non-closing errors, and that reconnect
238
- # is already counted via ReconnectCountingLogger; bumping here too
239
- # would double-count. For closing errors (HTTP::ConnectionError) the
240
- # reconnect is counted in @retry_thread instead. on_error's job is
241
- # purely to notify the parent client of the disconnect edge.
242
-
243
- # Notify the parent client BEFORE deciding whether to close — every
244
- # error edge is a disconnect signal as far as @sse_state goes, even
245
- # if we let the underlying SSE library handle reconnect itself.
246
- # qfg-47c2.27
247
- if @on_error
248
- begin
249
- @on_error.call(error)
250
- rescue StandardError => e
251
- @logger.error "SSE on_error callback raised: #{e.inspect}"
252
- end
253
- end
330
+ def increment_restart!
331
+ @restart_mutex.synchronize { @restart_total += 1 }
332
+ end
254
333
 
255
- if @options.errors_to_close_connection.any? { |klass| error.is_a?(klass) }
256
- @logger.debug "Closing SSE connection for url #{url}"
257
- client.close
258
- end
259
- end
334
+ def handle_error(error)
335
+ @logger.error "SSE Streaming Error: #{error.inspect}"
336
+ invoke_on_error(error)
337
+ end
338
+
339
+ # qfg-m3lk: rescue StandardError (NOT Exception) so SystemExit /
340
+ # Interrupt / SignalException still escape — Ctrl-C inside a customer
341
+ # callback must still kill the process. StandardError is the right
342
+ # boundary for "the caller's listener has a bug".
343
+ def invoke_on_envelope_safely(on_envelope, event)
344
+ on_envelope.call(event.envelope, event, :sse)
345
+ rescue StandardError => e
346
+ @on_envelope_error_mutex.synchronize { @on_envelope_error_total += 1 }
347
+ bt = (e.backtrace || []).first(5).join("\n ")
348
+ @logger.error "SSE on_envelope callback raised: #{e.class}: #{e.message}\n #{bt}"
349
+ end
350
+
351
+ def invoke_on_error(error)
352
+ return unless @on_error
353
+
354
+ begin
355
+ @on_error.call(error)
356
+ rescue StandardError => e
357
+ @logger.error "SSE on_error callback raised: #{e.inspect}"
260
358
  end
261
359
  end
262
360
 
263
- def headers
264
- auth = "1:#{@prefab_options.sdk_key}"
265
- auth_string = Base64.strict_encode64(auth)
266
- {
267
- 'Authorization' => "Basic #{auth_string}",
268
- 'Accept' => 'text/event-stream',
269
- 'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
270
- }
361
+ # +/-50% jitter — caps thundering-herd amplitude after a partition heal.
362
+ # Identical shape to ld-eventsource's Backoff#next_interval (and
363
+ # sdk-go's runLoop jitter) so we don't surprise operators familiar with
364
+ # those.
365
+ def jittered(delay)
366
+ (delay / 2) + rand(delay / 2.0)
271
367
  end
272
368
 
273
- def source
274
- @source_index = @source_index.nil? ? 0 : @source_index + 1
369
+ # Sleep with interrupt: chunks the sleep so close() during a long
370
+ # backoff doesn't block shutdown for tens of seconds.
371
+ def interruptible_sleep(seconds)
372
+ deadline = Process.clock_gettime(Process::CLOCK_MONOTONIC) + seconds
373
+ until @stopped.value
374
+ remaining = deadline - Process.clock_gettime(Process::CLOCK_MONOTONIC)
375
+ break if remaining <= 0
275
376
 
276
- @source_index = 0 if @source_index >= @prefab_options.sse_api_urls.size
377
+ sleep([remaining, 0.1].min)
378
+ end
379
+ end
277
380
 
278
- @prefab_options.sse_api_urls[@source_index]
381
+ # Rotate through configured SSE URLs. The same rotation rule the
382
+ # previous implementation used, preserved so multi-region failover
383
+ # behavior is unchanged.
384
+ def current_url
385
+ urls = @prefab_options.sse_api_urls
386
+ @source_index = (@source_index + 1) % urls.size
387
+ urls[@source_index]
279
388
  end
280
389
 
281
- # Compute a Last-Event-ID to resume the stream from. Three sources, in
282
- # priority order:
283
- # 1. config_loader.version -- string ETag from last HTTP fetch (new path)
284
- # 2. config_loader.highwater_mark -- legacy numeric cursor
285
- # 3. nil -- no prior state; stream from HEAD
286
- def current_cursor
287
- if @config_loader.respond_to?(:version)
288
- v = @config_loader.version
289
- return v if v.is_a?(String) && !v.empty?
390
+ # Internal: HTTP-status sentinel error for non-200 SSE responses. Surfaces
391
+ # the status code through #message so parent on_error callbacks can log
392
+ # meaningfully without depending on ld-eventsource's error hierarchy.
393
+ class SSEHTTPStatusError < StandardError
394
+ attr_reader :status_code
395
+
396
+ def initialize(status_code)
397
+ @status_code = status_code
398
+ super("HTTP #{status_code}")
290
399
  end
400
+ end
291
401
 
292
- if @config_loader.respond_to?(:highwater_mark)
293
- hw = @config_loader.highwater_mark
294
- return hw.to_s if hw.is_a?(Numeric) && hw.positive?
295
- return hw if hw.is_a?(String) && !hw.empty?
402
+ # qfg-i5xv: terminal HTTP failures the SDK will not retry. 401 = bad key,
403
+ # 403 = revoked workspace permission, 404 = wrong endpoint / missing
404
+ # workspace. A subclass of SSEHTTPStatusError so existing on_error
405
+ # callbacks that only check `is_a?(SSEHTTPStatusError)` keep working,
406
+ # while customers that want to distinguish (alerting, OpenFeature
407
+ # provider error events) can dispatch on the subclass.
408
+ class SSEHTTPTerminalError < SSEHTTPStatusError; end
409
+
410
+ # Raised by the watchdog into the worker thread when the per-chunk
411
+ # read deadline elapses. Caught by run_loop's rescue, indistinguishable
412
+ # from any other transport error for backoff/restart purposes.
413
+ class SSEReadDeadlineExceeded < StandardError; end
414
+
415
+ # Background watchdog that interrupts the worker thread if no chunk
416
+ # arrives within +deadline_s+ seconds. Uses Thread#raise — the only
417
+ # reliable cross-platform way to unblock a Ruby thread blocked in
418
+ # +Net::HTTP+'s body-read on macOS. (Closing or shutting down the
419
+ # underlying socket from another thread does NOT wake the reader on
420
+ # macOS; the kernel discards future reads but the in-flight syscall
421
+ # stays blocked until something else trips. sdk-go and sdk-node solve
422
+ # the equivalent problem with context cancellation / AbortController,
423
+ # which Ruby lacks at the IO layer.) Thread#raise is essentially what
424
+ # +Timeout.timeout+ does internally; using it directly avoids
425
+ # Timeout.timeout's sketch reputation around ensure blocks.
426
+ class ReadDeadlineWatchdog
427
+ POLL_INTERVAL = 0.25
428
+
429
+ def initialize(worker:, deadline_s:, stopped:, logger:)
430
+ @worker = worker
431
+ @deadline_s = deadline_s
432
+ @stopped = stopped
433
+ @logger = logger
434
+ @active = true
435
+ # Mutex covers @active AND the decision to fire Thread#raise. stop()
436
+ # holds the mutex when flipping @active false, so a +stop+ that
437
+ # arrives mid-deadline-check cannot lose the race against the
438
+ # watchdog's @worker.raise call (which would inject a spurious
439
+ # SSEReadDeadlineExceeded into the worker thread right after a
440
+ # clean read_body return).
441
+ @mutex = Mutex.new
442
+ @last_read_at = Concurrent::AtomicReference.new(Process.clock_gettime(Process::CLOCK_MONOTONIC))
296
443
  end
297
444
 
298
- nil
445
+ def start
446
+ @thread = Thread.new { watch }
447
+ end
448
+
449
+ def reset!
450
+ @last_read_at.set(Process.clock_gettime(Process::CLOCK_MONOTONIC))
451
+ end
452
+
453
+ def stop
454
+ @mutex.synchronize { @active = false }
455
+ @thread&.join(1)
456
+ @thread = nil
457
+ end
458
+
459
+ private
460
+
461
+ def watch
462
+ loop do
463
+ sleep POLL_INTERVAL
464
+ break unless @mutex.synchronize { @active } && !@stopped.value
465
+
466
+ idle = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @last_read_at.value
467
+ next if idle < @deadline_s
468
+
469
+ fired = @mutex.synchronize do
470
+ next false unless @active && !@stopped.value
471
+
472
+ @logger.debug "SSE read deadline exceeded (#{idle.round(1)}s idle >= #{@deadline_s}s); interrupting worker"
473
+ @worker.raise(SSEReadDeadlineExceeded.new("SSE read deadline #{@deadline_s}s exceeded"))
474
+ true
475
+ end
476
+ break if fired
477
+ end
478
+ rescue StandardError => e
479
+ # Watchdog must never crash the SDK. Worst case we silently fall
480
+ # back to Net::HTTP's own (unreliable) read_timeout.
481
+ @logger.debug "SSE watchdog error: #{e.inspect}"
482
+ end
483
+ end
484
+
485
+ # Streaming SSE parser. Accepts byte chunks (any encoding), yields one
486
+ # Quonfig::StreamEvent per complete event. Tolerates:
487
+ # - chunks that split a UTF-8 multi-byte character (buffer in 8-bit,
488
+ # transcode whole lines)
489
+ # - chunks that split a line mid-way
490
+ # - any of CR / LF / CRLF as line terminators
491
+ # - +data:+, +data: + (optional space per SSE spec)
492
+ # - +:comment+ lines (keepalives — ignored)
493
+ # - multi-line +data:+ (concatenated with +\n+, per spec)
494
+ # Ignores +event:+ and +retry:+ — api-delivery does not emit them and the
495
+ # Quonfig wire contract does not honor reconnect-time directives.
496
+ # Malformed +data:+ JSON is logged and skipped; one bad event does not
497
+ # tear down the stream.
498
+ class EventParser
499
+ def initialize(logger: nil)
500
+ @logger = logger
501
+ @reader = LineReader.new
502
+ @data = +''
503
+ @have_data = false
504
+ @id = nil
505
+ end
506
+
507
+ def feed(chunk)
508
+ @reader.feed(chunk) do |line|
509
+ if line.empty?
510
+ event = flush
511
+ yield event if event
512
+ elsif line.start_with?(':')
513
+ # comment / keepalive — ignore
514
+ else
515
+ process_field(line)
516
+ end
517
+ end
518
+ end
519
+
520
+ private
521
+
522
+ def process_field(line)
523
+ idx = line.index(':')
524
+ return unless idx
525
+
526
+ name = line[0...idx]
527
+ rest = line[(idx + 1)..]
528
+ rest = rest[1..] if rest.start_with?(' ')
529
+
530
+ case name
531
+ when 'data'
532
+ if @have_data
533
+ @data << "\n" << rest
534
+ else
535
+ @data = rest
536
+ @have_data = true
537
+ end
538
+ when 'id'
539
+ @id = rest unless rest.include?("\x00")
540
+ # event: / retry: are intentionally ignored
541
+ end
542
+ end
543
+
544
+ def flush
545
+ return nil unless @have_data
546
+
547
+ data = @data
548
+ id = @id
549
+ @data = +''
550
+ @have_data = false
551
+ # NB: @id persists across events — the SSE spec says last-event-id
552
+ # is sticky until overwritten. Matches ld-eventsource.
553
+
554
+ begin
555
+ parsed = JSON.parse(data)
556
+ rescue JSON::ParserError => e
557
+ (@logger || LOG).error "SSE Streaming Error: malformed JSON: #{e.message}"
558
+ return nil
559
+ end
560
+
561
+ envelope = Quonfig::ConfigEnvelope.new(
562
+ configs: parsed['configs'] || [],
563
+ meta: parsed['meta'] || {}
564
+ )
565
+ StreamEvent.new(envelope, id, data)
566
+ end
567
+ end
568
+
569
+ # Byte-level line reader. Accepts arbitrary chunks, yields one UTF-8
570
+ # line per call to the block. Terminator-stripped (CR / LF / CRLF
571
+ # supported). Modeled on ld-eventsource's BufferedLineReader — same
572
+ # invariants: split bytes-not-chars while scanning, force-encode to
573
+ # UTF-8 only once a complete line is sliced out, so a multi-byte
574
+ # character spanning two chunks does not raise Encoding::CompatibilityError.
575
+ class LineReader
576
+ def initialize
577
+ @buffer = +''.b
578
+ @last_was_cr = false
579
+ end
580
+
581
+ def feed(chunk)
582
+ @buffer << chunk.b
583
+ loop do
584
+ idx = @buffer.index(/[\r\n]/)
585
+ break if idx.nil?
586
+
587
+ ch = @buffer[idx]
588
+ if idx.zero? && ch == "\n" && @last_was_cr
589
+ # Dangling LF of a CRLF pair split across chunks — consume and skip.
590
+ @last_was_cr = false
591
+ @buffer.slice!(0, 1)
592
+ next
593
+ end
594
+
595
+ line = @buffer[0, idx].force_encoding('UTF-8')
596
+ consume = idx + 1
597
+ @last_was_cr = false
598
+ if ch == "\r"
599
+ if consume == @buffer.bytesize
600
+ # CR at end of buffer — could be CRLF split across feeds.
601
+ @last_was_cr = true
602
+ elsif @buffer[consume] == "\n"
603
+ consume += 1
604
+ end
605
+ end
606
+ @buffer.slice!(0, consume)
607
+ yield line
608
+ end
609
+ end
299
610
  end
300
611
  end
301
612
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Quonfig
4
- VERSION = '0.0.15'
4
+ VERSION = '0.0.16'
5
5
  end
data/lib/quonfig.rb CHANGED
@@ -17,7 +17,7 @@ require 'concurrent/atomics'
17
17
  require 'concurrent'
18
18
  require 'faraday'
19
19
  require 'openssl'
20
- require 'ld-eventsource'
20
+ require 'net/http'
21
21
 
22
22
  require 'quonfig/internal_logger'
23
23
  require 'quonfig/time_helpers'
data/quonfig.gemspec CHANGED
@@ -31,5 +31,4 @@ Gem::Specification.new do |s|
31
31
  s.add_dependency 'activesupport', '>= 4'
32
32
  s.add_dependency 'concurrent-ruby', '~> 1.0', '>= 1.0.5'
33
33
  s.add_dependency 'faraday', '>= 1.0'
34
- s.add_dependency 'ld-eventsource', '>= 2.0'
35
34
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: quonfig
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.15
4
+ version: 0.0.16
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jeff Dwyer
@@ -58,20 +58,6 @@ dependencies:
58
58
  - - ">="
59
59
  - !ruby/object:Gem::Version
60
60
  version: '1.0'
61
- - !ruby/object:Gem::Dependency
62
- name: ld-eventsource
63
- requirement: !ruby/object:Gem::Requirement
64
- requirements:
65
- - - ">="
66
- - !ruby/object:Gem::Version
67
- version: '2.0'
68
- type: :runtime
69
- prerelease: false
70
- version_requirements: !ruby/object:Gem::Requirement
71
- requirements:
72
- - - ">="
73
- - !ruby/object:Gem::Version
74
- version: '2.0'
75
61
  description: Quonfig — feature flags and live config, stored as files in git.
76
62
  email: jeff@quonfig.com
77
63
  executables: []