quonfig 0.0.14 → 0.0.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b25ea20d7f44acff4ed82e17522a9fb6055791c4f1e0c861075974e5ae37421f
4
- data.tar.gz: e0c260d2d13926e21f2525c7686a24f8dec2f1fa998efa039db59baf4447cd60
3
+ metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
4
+ data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
5
5
  SHA512:
6
- metadata.gz: da91dbd4f9cc300f2dab9e8f39a73033e642d94272288cbcacf4358eb28f4f9b064f8fbe8301c5c26e1b342cd3cd76179d362029e06379bcac39685c3a050cb2
7
- data.tar.gz: ac77088e6a6e0256d947f40b26abb9527bb55cff8a3fa39eaaebf91c43746379d5fa2325bd06e049922b2cbc8521f78252bdd79106c6f1ae7f1a0264f4033ab6
6
+ metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
7
+ data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045
data/CHANGELOG.md CHANGED
@@ -1,5 +1,11 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.0.15 - 2026-05-15
4
+
5
+ - **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
6
+ - **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
7
+ - **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
8
+
3
9
  ## 0.0.14 - 2026-05-10
4
10
 
5
11
  - **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.
data/README.md CHANGED
@@ -333,6 +333,24 @@ converge once the envelope finishes applying.
333
333
  `Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
334
334
  — do not reuse the parent's client in a child process.
335
335
 
336
+ ## Diagnostic health signals
337
+
338
+ `Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
339
+
340
+ - `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
341
+ envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
342
+ polling). Returns `nil` before the first install. Preserved across `stop`.
343
+ - `client.connection_state` — a `Symbol` describing the aggregate state:
344
+ `:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
345
+
346
+ > Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
347
+
348
+ Compose your own threshold from the two getters if you need a dashboard signal
349
+ — but route alerts through a metrics pipeline, not a probe that restarts the
350
+ process.
351
+
352
+ There is intentionally no `client.healthy?` primitive.
353
+
336
354
  ## Documentation
337
355
 
338
356
  Full documentation, including SPEC, SDK reference, and operational guides, is
@@ -40,9 +40,14 @@ module Quonfig
40
40
  @resolver = Quonfig::Resolver.new(@store, @evaluator)
41
41
  @semantic_logger_filters = {}
42
42
  @sse_client = nil
43
- @poll_thread = nil
43
+ @poll_supervisor = nil
44
44
  @stopped = false
45
45
  @telemetry_reporter = nil
46
+ @state_mutex = Mutex.new
47
+ @last_successful_refresh = nil
48
+ @sse_state = :idle
49
+ @sse_ever_connected = false
50
+ @fallback_engage_timer = nil
46
51
 
47
52
  # If the caller injected a store, we're in test/bootstrap mode; skip I/O.
48
53
  return if store
@@ -266,9 +271,14 @@ module Quonfig
266
271
  end
267
272
  @sse_client = nil
268
273
 
269
- thread = @poll_thread
270
- @poll_thread = nil
271
- thread&.kill
274
+ cancel_fallback_engage_timer
275
+
276
+ begin
277
+ @poll_supervisor&.stop
278
+ rescue StandardError => e
279
+ LOG.debug "Error stopping poll supervisor: #{e.message}"
280
+ end
281
+ @poll_supervisor = nil
272
282
 
273
283
  begin
274
284
  @telemetry_reporter&.stop
@@ -278,6 +288,65 @@ module Quonfig
278
288
  @telemetry_reporter = nil
279
289
  end
280
290
 
291
+ # quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
292
+ # Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
293
+ # incremented on every on_error edge from ld-eventsource (qfg-ll6r).
294
+ # Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
295
+ #
296
+ # Pass +layer:+ ('1' or '2') to read a single layer; default returns the
297
+ # sum across both layers so the chaos harness (and operators) can pull
298
+ # per-layer values explicitly while preserving the previous single-number
299
+ # diagnostic surface.
300
+ def worker_restart_total(layer: nil)
301
+ case layer&.to_s
302
+ when '1' then sse_restart_total
303
+ when '2' then poll_restart_total
304
+ else sse_restart_total + poll_restart_total
305
+ end
306
+ end
307
+
308
+ # Wall-clock time of the last installed envelope (any source: datadir,
309
+ # initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
310
+ # install. Preserved after +stop+.
311
+ #
312
+ # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
313
+ # — a transient network blip will trip any freshness threshold and cause
314
+ # a rolling restart cascade. See the README "Diagnostic health signals"
315
+ # section.
316
+ #
317
+ # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
318
+ def last_successful_refresh
319
+ @state_mutex.synchronize { @last_successful_refresh }
320
+ end
321
+
322
+ # Aggregate connection state. Returns one of:
323
+ #
324
+ # - +:initializing+ — no envelope has been installed and SSE is not yet
325
+ # connected.
326
+ # - +:connected+ — SSE is live, or the SDK is delivering configs from a
327
+ # loaded envelope (datadir mode or post-initial-fetch with no SSE).
328
+ # - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
329
+ # poller is active.
330
+ # - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
331
+ # serving as the active update channel.
332
+ #
333
+ # **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
334
+ # — see the README "Diagnostic health signals" section.
335
+ #
336
+ # Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
337
+ def connection_state
338
+ @state_mutex.synchronize do
339
+ next :disconnected if @stopped
340
+ next :falling_back if @poll_supervisor&.alive?
341
+ next :connected if @sse_state == :connected
342
+ next :disconnected if @sse_state == :error
343
+
344
+ # No SSE state change yet: state is driven by whether any envelope
345
+ # has been installed (datadir / initial fetch).
346
+ @last_successful_refresh.nil? ? :initializing : :connected
347
+ end
348
+ end
349
+
281
350
  def fork
282
351
  self.class.new(@options.for_fork)
283
352
  end
@@ -288,6 +357,128 @@ module Quonfig
288
357
 
289
358
  private
290
359
 
360
+ # Stamp +last_successful_refresh+ at install time. Called by every code
361
+ # path that hands an envelope to the cache: datadir load, initial HTTP
362
+ # fetch, SSE event apply, and polling worker fetch.
363
+ def record_refresh!
364
+ @state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
365
+ end
366
+
367
+ def sse_restart_total
368
+ sse = @sse_client
369
+ return 0 if sse.nil?
370
+ return 0 unless sse.respond_to?(:restart_total)
371
+
372
+ sse.restart_total.to_i
373
+ end
374
+
375
+ def poll_restart_total
376
+ sup = @poll_supervisor
377
+ return 0 if sup.nil?
378
+ return 0 unless sup.respond_to?(:worker_restart_total)
379
+
380
+ sup.worker_restart_total.to_i
381
+ end
382
+
383
+ # Drive the SSE-side of the connection_state machine. The SSE client
384
+ # invokes this on connect/error edges; tests call it directly via +send+.
385
+ # Documented values: :idle, :connecting, :connected, :error.
386
+ #
387
+ # Also drives the Layer 2 fallback poller's engage/disengage:
388
+ # - :connected clears any pending engage timer and stops an active
389
+ # fallback poller (SSE recovered, drop the second channel).
390
+ # - :error before any successful connect engages immediately
391
+ # (initial-fail path).
392
+ # - :error after a successful connect schedules a 2x-poll-interval
393
+ # grace timer; the timer engages if SSE has not recovered by then.
394
+ # Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
395
+ # `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
396
+ # Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
397
+ # block can drive @sse_state -> :error on a mid-run socket drop. Without
398
+ # this wiring, +connection_state+ would stay +:connected+ after a
399
+ # disconnect and customers composing staleness checks would see stale
400
+ # data. (qfg-47c2.27)
401
+ def sse_error_callback
402
+ @sse_error_callback ||= ->(error) { handle_sse_error(error) }
403
+ end
404
+
405
+ def handle_sse_error(_error)
406
+ handle_sse_state_change(:error)
407
+ end
408
+
409
+ def handle_sse_state_change(new_state)
410
+ state = new_state.to_sym
411
+ ever_connected = @state_mutex.synchronize do
412
+ @sse_state = state
413
+ @sse_ever_connected = true if state == :connected
414
+ @sse_ever_connected
415
+ end
416
+
417
+ return unless @options.respond_to?(:enable_polling) && @options.enable_polling
418
+ return if @stopped
419
+
420
+ case state
421
+ when :connected
422
+ cancel_fallback_engage_timer
423
+ stop_fallback_poller('sse-recovered')
424
+ when :error
425
+ if ever_connected
426
+ schedule_fallback_engage
427
+ else
428
+ start_polling
429
+ end
430
+ end
431
+ end
432
+
433
+ def cancel_fallback_engage_timer
434
+ timer = @state_mutex.synchronize do
435
+ t = @fallback_engage_timer
436
+ @fallback_engage_timer = nil
437
+ t
438
+ end
439
+ timer&.kill if timer&.alive?
440
+ end
441
+
442
+ def stop_fallback_poller(reason)
443
+ supervisor = @state_mutex.synchronize do
444
+ s = @poll_supervisor
445
+ @poll_supervisor = nil
446
+ s
447
+ end
448
+ return if supervisor.nil?
449
+
450
+ begin
451
+ supervisor.stop
452
+ LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
453
+ rescue StandardError => e
454
+ LOG.debug "Error stopping fallback poller: #{e.message}"
455
+ end
456
+ end
457
+
458
+ # Schedule a 2*poll_interval grace timer after a connected->error edge.
459
+ # If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
460
+ # tears it down. Idempotent — does nothing if a timer is already pending
461
+ # or the supervisor is already alive.
462
+ def schedule_fallback_engage
463
+ poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
464
+ return if poll_interval <= 0
465
+
466
+ grace_seconds = poll_interval * 2.0
467
+
468
+ @state_mutex.synchronize do
469
+ return if @fallback_engage_timer&.alive?
470
+ return if @poll_supervisor&.alive?
471
+ return if @stopped
472
+
473
+ @fallback_engage_timer = Thread.new do
474
+ Thread.current.report_on_exception = false
475
+ sleep grace_seconds
476
+ @state_mutex.synchronize { @fallback_engage_timer = nil }
477
+ start_polling unless @stopped
478
+ end
479
+ end
480
+ end
481
+
291
482
  # Construct and start the telemetry reporter if the options permit it.
292
483
  # The reporter runs on a background thread and periodically POSTs
293
484
  # context-shape and example-context batches to +telemetry_destination+.
@@ -378,6 +569,7 @@ module Quonfig
378
569
  def load_datadir_into_store
379
570
  envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
380
571
  envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
572
+ record_refresh!
381
573
  end
382
574
 
383
575
  # Initialize network mode: sync HTTP fetch (bounded by
@@ -412,7 +604,11 @@ module Quonfig
412
604
  return
413
605
  end
414
606
 
415
- handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls')) if result == :failed
607
+ if result == :failed
608
+ handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
609
+ else
610
+ record_refresh!
611
+ end
416
612
  end
417
613
 
418
614
  def handle_init_failure(err)
@@ -429,44 +625,79 @@ module Quonfig
429
625
  def start_sse
430
626
  return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
431
627
 
432
- @sse_client = Quonfig::SSEConfigClient.new(@options, @config_loader)
628
+ @sse_client = Quonfig::SSEConfigClient.new(
629
+ @options,
630
+ @config_loader,
631
+ nil,
632
+ nil,
633
+ on_error: sse_error_callback
634
+ )
433
635
  @sse_client.start do |envelope, _event, _source|
434
636
  next if @stopped
435
637
 
436
638
  begin
437
639
  @config_loader.apply_envelope(envelope)
438
- @on_update&.call
640
+ handle_sse_state_change(:connected)
641
+ record_refresh!
439
642
  rescue StandardError => e
440
643
  LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
644
+ next
441
645
  end
646
+ notify_on_update_callback
442
647
  end
443
648
  true
444
649
  rescue StandardError => e
445
650
  LOG.warn "[quonfig] SSE start failed: #{e.message}"
446
651
  @sse_client = nil
652
+ handle_sse_state_change(:error)
447
653
  false
448
654
  end
449
655
 
450
656
  def start_polling
657
+ return if @stopped
658
+ return if @poll_supervisor&.alive?
659
+
451
660
  poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
452
661
  return if poll_interval <= 0
453
662
 
454
- @poll_thread = Thread.new do
455
- Thread.current.name = 'quonfig-poller'
663
+ stopped_ref = -> { @stopped }
664
+ worker = lambda do |notify_delivered|
456
665
  loop do
457
- break if @stopped
666
+ break if stopped_ref.call
458
667
 
459
668
  sleep poll_interval
460
- break if @stopped
461
-
462
- begin
463
- @config_loader.fetch!
464
- @on_update&.call
465
- rescue StandardError => e
466
- LOG.warn "[quonfig] Polling error: #{e.message}"
467
- end
669
+ break if stopped_ref.call
670
+
671
+ @config_loader.fetch!
672
+ record_refresh!
673
+ notify_delivered.call
674
+ notify_on_update_callback
468
675
  end
469
676
  end
677
+
678
+ supervisor = Quonfig::WorkerSupervisor.new(
679
+ name: 'poll', layer: '2', worker: worker
680
+ )
681
+ @state_mutex.synchronize { @poll_supervisor = supervisor }
682
+ supervisor.start
683
+ end
684
+
685
+ # Invoke the customer-supplied on_update callback under a rescue. A raise
686
+ # here is the customer's bug, but it must NOT take down the SSE listener
687
+ # or polling supervisor. Log at ERROR with a message containing
688
+ # "onConfigUpdate callback" so chaos scenario 10's
689
+ # sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
690
+ # the message is distinguishable from internal envelope-apply errors
691
+ # (qfg-47c2.30).
692
+ def notify_on_update_callback
693
+ cb = @on_update
694
+ return unless cb
695
+
696
+ begin
697
+ cb.call
698
+ rescue StandardError => e
699
+ LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
700
+ end
470
701
  end
471
702
 
472
703
  def build_context(jit_context)
@@ -11,14 +11,16 @@ module Quonfig
11
11
  # <datadir>/configs/*.json
12
12
  # <datadir>/feature-flags/*.json
13
13
  # <datadir>/segments/*.json
14
- # <datadir>/schemas/*.json
15
14
  # <datadir>/log-levels/*.json
16
15
  #
16
+ # schemas/ is intentionally excluded — those files are raw JSON Schema
17
+ # documents, not Configs, and SDKs do not consume them (qfg-uzsl).
18
+ #
17
19
  # Each <type>/*.json file is a WorkspaceConfigDocument. The loader projects
18
20
  # it down to the ConfigResponse shape that the SSE/HTTP delivery path emits,
19
21
  # so ConfigStore consumes both transports uniformly.
20
22
  module Datadir
21
- CONFIG_SUBDIRS = %w[configs feature-flags segments schemas log-levels].freeze
23
+ CONFIG_SUBDIRS = %w[configs feature-flags segments log-levels].freeze
22
24
 
23
25
  module_function
24
26
 
@@ -36,7 +38,10 @@ module Quonfig
36
38
  .select { |name| name.end_with?('.json') }
37
39
  .sort
38
40
  .each do |filename|
39
- raw = JSON.parse(File.read(File.join(dir, filename)))
41
+ path = File.join(dir, filename)
42
+ raw = JSON.parse(File.read(path))
43
+ raise ArgumentError, "[quonfig] config has empty key — file is not a Quonfig Config: #{path}" if raw['key'].nil? || raw['key'].to_s.empty?
44
+
40
45
  configs << to_config_response(raw, env_id)
41
46
  end
42
47
  end
@@ -5,19 +5,99 @@ require 'json'
5
5
 
6
6
  module Quonfig
7
7
  class SSEConfigClient
8
+ # ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
9
+ # *internally* — it never calls +on_error+ for that case, only for
10
+ # ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
11
+ # signal it emits for any reconnect is an info-level
12
+ # "Will retry connection after ..." line, logged once per reconnect attempt
13
+ # and never on the first connect. Wrapping the logger we hand to
14
+ # SSE::Client lets the SDK observe those internal reconnects without
15
+ # touching the data path. This is the only reconnect hook ld-eventsource
16
+ # >= 2.0 exposes.
17
+ class ReconnectCountingLogger
18
+ RECONNECT_SIGNAL = 'Will retry connection after'
19
+
20
+ LEVELS = %i[trace debug info warn error fatal].freeze
21
+
22
+ def initialize(wrapped, &on_reconnect)
23
+ @wrapped = wrapped
24
+ @on_reconnect = on_reconnect
25
+ end
26
+
27
+ # Crash-safe by construction: ld-eventsource calls this logger from
28
+ # inside its bare-Thread +run_stream+ loop, and several of those call
29
+ # sites (+connect+, +log_and_dispatch_error+, query-param building) are
30
+ # NOT wrapped in a rescue. Any exception that escapes a logger call kills
31
+ # the worker thread with +@stopped+ still false, so +closed?+ never flips
32
+ # true and the SDK's @retry_thread never reconnects — the SSE stream is
33
+ # silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
34
+ # step here is therefore independently guarded: a throwing message block,
35
+ # a throwing on_reconnect callback, or a throwing wrapped logger can
36
+ # never propagate out of this method.
37
+ LEVELS.each do |level|
38
+ define_method(level) do |message = nil, &block|
39
+ begin
40
+ message = block.call if message.nil? && block
41
+ rescue StandardError
42
+ message = nil
43
+ end
44
+
45
+ if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
46
+ begin
47
+ @on_reconnect.call
48
+ rescue StandardError
49
+ nil
50
+ end
51
+ end
52
+
53
+ begin
54
+ @wrapped.public_send(level, message) if @wrapped.respond_to?(level)
55
+ rescue StandardError
56
+ nil
57
+ end
58
+ end
59
+ end
60
+
61
+ def level
62
+ @wrapped&.level
63
+ end
64
+
65
+ def level=(new_level)
66
+ @wrapped.level = new_level if @wrapped.respond_to?(:level=)
67
+ end
68
+ end
69
+
8
70
  class Options
9
71
  attr_reader :sse_read_timeout, :seconds_between_new_connection,
10
72
  :sse_default_reconnect_time, :sleep_delay_for_new_connection_check,
11
- :errors_to_close_connection
73
+ :errors_to_close_connection, :sse_reconnect_reset_interval
12
74
 
13
- def initialize(sse_read_timeout: 300,
75
+ # sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
76
+ # stall trips the read deadline within one missed-heartbeat window
77
+ # rather than the previous 5-minute idle. See plan
78
+ # `project/plans/sdk-hardening-and-verification.md` Layer 1.
79
+ #
80
+ # sse_reconnect_reset_interval: 1s (ld-eventsource default is 60s). The
81
+ # ld-eventsource backoff only resets to the base interval once a
82
+ # connection has stayed up this long; until then each reconnect doubles
83
+ # the delay (1s, 2s, 4s, 8s...). With the 60s default, a flapping
84
+ # connection (chaos scenario 09 — proxy killed every 6s) backs off so
85
+ # fast the SDK is mid-sleep when the next kill lands and never observes
86
+ # it. Resetting after 1s of healthy connection mirrors sdk-python, which
87
+ # resets its backoff on every successful connect (sdk-python/quonfig/
88
+ # sse.py). A *sustained* outage still backs off exponentially: no
89
+ # connection succeeds, so `mark_success` is never called and the reset
90
+ # never triggers (qfg-ie49).
91
+ def initialize(sse_read_timeout: 90,
14
92
  seconds_between_new_connection: 5,
15
93
  sleep_delay_for_new_connection_check: 1,
16
94
  sse_default_reconnect_time: SSE::Client::DEFAULT_RECONNECT_TIME,
95
+ sse_reconnect_reset_interval: 1,
17
96
  errors_to_close_connection: [HTTP::ConnectionError])
18
97
  @sse_read_timeout = sse_read_timeout
19
98
  @seconds_between_new_connection = seconds_between_new_connection
20
99
  @sse_default_reconnect_time = sse_default_reconnect_time
100
+ @sse_reconnect_reset_interval = sse_reconnect_reset_interval
21
101
  @sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
22
102
  @errors_to_close_connection = errors_to_close_connection
23
103
  end
@@ -25,12 +105,46 @@ module Quonfig
25
105
 
26
106
  LOG = Quonfig::InternalLogger.new(self)
27
107
 
28
- def initialize(prefab_options, config_loader, options = nil, logger = nil)
108
+ # +on_error+: optional callable invoked on every SSE error edge. Parent
109
+ # Quonfig::Client wires this to drive @sse_state -> :error so that
110
+ # +connection_state+ reflects the disconnect (qfg-47c2.27). Without it
111
+ # the SDK's public health primitive would lie about its own state during
112
+ # a mid-run socket drop.
113
+ def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
29
114
  @prefab_options = prefab_options
30
115
  @options = options || Options.new
31
116
  @config_loader = config_loader
32
117
  @connected = false
33
118
  @logger = logger || LOG
119
+ @on_error = on_error
120
+ @restart_total = 0
121
+ @restart_mutex = Mutex.new
122
+ end
123
+
124
+ # qfg-ll6r / qfg-ie49: Layer 1 (SSE) restart counter — counts every
125
+ # *reconnect*, from two sources:
126
+ # 1. ld-eventsource's own internal reconnect (clean FIN, read timeout,
127
+ # transient errors it doesn't surface) — observed via the
128
+ # ReconnectCountingLogger "Will retry connection after" signal.
129
+ # 2. SDK-driven reconnects in @retry_thread, after a closing error
130
+ # (HTTP::ConnectionError) made us close the SSE::Client outright.
131
+ # These two are mutually exclusive per disconnect, so there is no
132
+ # double-count. on_error is deliberately NOT a source — ld-eventsource
133
+ # reconnects internally after most non-closing errors, so counting the
134
+ # error edge AND the reconnect would double up (qfg-ie49).
135
+ #
136
+ # The chaos harness pulls this via Client#worker_restart_total(layer: '1')
137
+ # so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
138
+ # assert restart_total >= 5 even when the kills produce clean FINs that
139
+ # never reach on_error.
140
+ def restart_total
141
+ @restart_mutex.synchronize { @restart_total }
142
+ end
143
+
144
+ # Bump the Layer 1 reconnect counter. Called from the ld-eventsource
145
+ # worker thread (via ReconnectCountingLogger) and from @retry_thread.
146
+ def count_restart!
147
+ @restart_mutex.synchronize { @restart_total += 1 }
34
148
  end
35
149
 
36
150
  def close
@@ -60,6 +174,11 @@ module Quonfig
60
174
 
61
175
  closed_count = 0
62
176
  @logger.debug 'Reconnecting SSE client'
177
+ # SDK-driven reconnect: a closing error (HTTP::ConnectionError)
178
+ # closed the previous SSE::Client, so ld-eventsource's own
179
+ # reconnect loop has exited and won't emit the "Will retry" signal.
180
+ # Count it here instead (qfg-ie49).
181
+ count_restart!
63
182
  @client = connect(&load_configs)
64
183
  end
65
184
  end
@@ -70,12 +189,20 @@ module Quonfig
70
189
  cursor = current_cursor
71
190
  @logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
72
191
 
192
+ # Wrap the ld-eventsource logger so internal reconnects (clean FIN,
193
+ # read-timeout, transient errors) bump restart_total — they never reach
194
+ # on_error (qfg-ie49).
195
+ sse_logger = ReconnectCountingLogger.new(
196
+ Quonfig::InternalLogger.new(SSE::Client)
197
+ ) { count_restart! }
198
+
73
199
  SSE::Client.new(url,
74
200
  headers: headers,
75
201
  read_timeout: @options.sse_read_timeout,
76
202
  reconnect_time: @options.sse_default_reconnect_time,
203
+ reconnect_reset_interval: @options.sse_reconnect_reset_interval,
77
204
  last_event_id: cursor,
78
- logger: Quonfig::InternalLogger.new(SSE::Client)) do |client|
205
+ logger: sse_logger) do |client|
79
206
  client.on_event do |event|
80
207
  if event.data.nil? || event.data.empty?
81
208
  @logger.error "SSE Streaming Error: Received empty data for url #{url}"
@@ -106,6 +233,25 @@ module Quonfig
106
233
  @logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
107
234
  end
108
235
 
236
+ # qfg-ie49: restart_total is NOT bumped here. ld-eventsource
237
+ # auto-reconnects after most non-closing errors, and that reconnect
238
+ # is already counted via ReconnectCountingLogger; bumping here too
239
+ # would double-count. For closing errors (HTTP::ConnectionError) the
240
+ # reconnect is counted in @retry_thread instead. on_error's job is
241
+ # purely to notify the parent client of the disconnect edge.
242
+
243
+ # Notify the parent client BEFORE deciding whether to close — every
244
+ # error edge is a disconnect signal as far as @sse_state goes, even
245
+ # if we let the underlying SSE library handle reconnect itself.
246
+ # qfg-47c2.27
247
+ if @on_error
248
+ begin
249
+ @on_error.call(error)
250
+ rescue StandardError => e
251
+ @logger.error "SSE on_error callback raised: #{e.inspect}"
252
+ end
253
+ end
254
+
109
255
  if @options.errors_to_close_connection.any? { |klass| error.is_a?(klass) }
110
256
  @logger.debug "Closing SSE connection for url #{url}"
111
257
  client.close
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Quonfig
4
- VERSION = '0.0.14'
4
+ VERSION = '0.0.15'
5
5
  end
@@ -0,0 +1,186 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Quonfig
4
+ # Internal control-flow exception raised inside a supervised worker thread
5
+ # to signal cooperative shutdown. Workers may catch and re-raise, or just
6
+ # propagate.
7
+ class Shutdown < StandardError; end
8
+
9
+ # Single supervisor for a long-lived background worker (SSE read loop,
10
+ # fallback poller). Catches unhandled exceptions at the worker boundary,
11
+ # logs them, increments +worker_restart_total+, and restarts with
12
+ # exponential backoff capped at 30s.
13
+ #
14
+ # Contract: integration-test-data/chaos/supervisor-test-contract.md
15
+ # Plan: project/plans/sdk-hardening-and-verification.md (Phase 1)
16
+ #
17
+ # The worker is a Proc-like callable invoked as +worker.call(notify_delivered)+
18
+ # where +notify_delivered+ is a Proc the worker calls when it has handed at
19
+ # least one envelope to the cache. That signal resets the backoff so a
20
+ # transient blip doesn't double the delay on the next disconnect.
21
+ #
22
+ # Shutdown is signaled by Thread#raise(Quonfig::Shutdown) into the
23
+ # supervisor thread. Logger writes and bookkeeping use Thread.handle_interrupt
24
+ # so a concurrent raise doesn't trip Ruby's "log writing failed" path.
25
+ class WorkerSupervisor
26
+ METRIC_NAME = 'quonfig_sdk_worker_restart_total'
27
+
28
+ DEFAULT_INITIAL_BACKOFF = 0.5
29
+ DEFAULT_MAX_BACKOFF = 30.0
30
+ DEFAULT_MULTIPLIER = 2.0
31
+ SHUTDOWN_TIMEOUT_SEC = 5.0
32
+
33
+ LOG = Quonfig::InternalLogger.new(self)
34
+
35
+ attr_reader :worker_restart_total, :worker_restart_labels
36
+
37
+ def initialize(name:, worker:, layer: '1',
38
+ initial_backoff: DEFAULT_INITIAL_BACKOFF,
39
+ max_backoff: DEFAULT_MAX_BACKOFF,
40
+ multiplier: DEFAULT_MULTIPLIER,
41
+ sleep_proc: nil,
42
+ logger: nil)
43
+ @name = name
44
+ @layer = layer.to_s
45
+ @worker = worker
46
+ @initial_backoff = initial_backoff
47
+ @max_backoff = max_backoff
48
+ @multiplier = multiplier
49
+ @sleep_proc = sleep_proc || ->(seconds) { sleep(seconds) }
50
+ @logger = logger || LOG
51
+ @worker_restart_total = 0
52
+ @worker_restart_labels = {
53
+ sdk: 'ruby',
54
+ sdk_version: Quonfig::VERSION,
55
+ layer: @layer
56
+ }.freeze
57
+ @mutex = Mutex.new
58
+ @stop_requested = false
59
+ @thread = nil
60
+ @current_backoff = @initial_backoff
61
+ end
62
+
63
+ def start
64
+ @mutex.synchronize do
65
+ return self if @thread&.alive?
66
+
67
+ @stop_requested = false
68
+ ready = Queue.new
69
+ @thread = Thread.new do
70
+ # Set report_on_exception + signal "ready" BEFORE entering
71
+ # run_loop. start() blocks on the ready queue so a racing stop()
72
+ # can never raise into a thread that hasn't yet installed its
73
+ # Shutdown rescue.
74
+ Thread.current.report_on_exception = false
75
+ ready << true
76
+ run_loop
77
+ rescue Quonfig::Shutdown
78
+ # cooperative shutdown raced with thread startup; swallowed
79
+ end
80
+ ready.pop
81
+ end
82
+ self
83
+ end
84
+
85
+ def alive?
86
+ t = @thread
87
+ !t.nil? && t.alive?
88
+ end
89
+
90
+ def stop
91
+ thread = @mutex.synchronize do
92
+ @stop_requested = true
93
+ t = @thread
94
+ @thread = nil
95
+ t
96
+ end
97
+ return if thread.nil?
98
+
99
+ raise_shutdown(thread)
100
+ thread.join(SHUTDOWN_TIMEOUT_SEC)
101
+ thread.kill if thread.alive?
102
+ nil
103
+ end
104
+
105
+ alias close stop
106
+
107
+ private
108
+
109
+ def raise_shutdown(thread)
110
+ return if thread.nil?
111
+ return unless thread.alive?
112
+
113
+ begin
114
+ thread.raise(Quonfig::Shutdown.new('supervisor stopping'))
115
+ rescue ThreadError
116
+ # thread already exited between alive? and raise — fine
117
+ end
118
+ end
119
+
120
+ def run_loop
121
+ Thread.current.name = "quonfig-supervisor-#{@name}"
122
+ # Don't dump our managed Shutdown to stderr on shutdown.
123
+ Thread.current.report_on_exception = false
124
+
125
+ loop do
126
+ break if stop?
127
+
128
+ delivered = false
129
+ notify_delivered = -> { delivered = true }
130
+ reason = :worker_exit
131
+
132
+ begin
133
+ @worker.call(notify_delivered)
134
+ rescue Quonfig::Shutdown
135
+ break
136
+ rescue StandardError => e
137
+ reason = :worker_throw
138
+ safe_log(:error,
139
+ "[quonfig] supervisor=#{@name} worker raised #{e.class}: #{e.message}")
140
+ bt = e.backtrace&.first(10)&.join("\n")
141
+ safe_log(:debug, bt) if bt
142
+ end
143
+
144
+ break if stop?
145
+
146
+ @worker_restart_total += 1
147
+ @current_backoff = @initial_backoff if delivered
148
+ backoff = @current_backoff
149
+
150
+ safe_log(:warn,
151
+ "[quonfig] supervisor=#{@name} restarting worker " \
152
+ "(reason=#{reason}, restart_total=#{@worker_restart_total}, " \
153
+ "backoff_s=#{backoff})")
154
+
155
+ begin
156
+ @sleep_proc.call(backoff)
157
+ rescue Quonfig::Shutdown
158
+ break
159
+ end
160
+
161
+ @current_backoff = [@current_backoff * @multiplier, @max_backoff].min
162
+ end
163
+ rescue Quonfig::Shutdown
164
+ # supervisor-level cooperative shutdown
165
+ rescue StandardError => e
166
+ safe_log(:error, "[quonfig] supervisor=#{@name} crashed: #{e.class}: #{e.message}")
167
+ end
168
+
169
+ def stop?
170
+ @mutex.synchronize { @stop_requested }
171
+ end
172
+
173
+ # Defer Shutdown delivery while we're inside Logger.write so we don't
174
+ # trip Logger's "log writing failed" -> stderr fallback. Swallow any
175
+ # other logger error.
176
+ def safe_log(level, msg)
177
+ return unless @logger.respond_to?(level)
178
+
179
+ Thread.handle_interrupt(Quonfig::Shutdown => :never) do
180
+ @logger.public_send(level, msg)
181
+ end
182
+ rescue StandardError
183
+ nil
184
+ end
185
+ end
186
+ end
data/lib/quonfig.rb CHANGED
@@ -29,6 +29,7 @@ require 'quonfig/evaluation'
29
29
  require 'quonfig/evaluation_details'
30
30
  require 'quonfig/encryption'
31
31
  require 'quonfig/exponential_backoff'
32
+ require 'quonfig/worker_supervisor'
32
33
  require 'quonfig/periodic_sync'
33
34
  require 'quonfig/errors/initialization_timeout_error'
34
35
  require 'quonfig/errors/invalid_sdk_key_error'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: quonfig
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.14
4
+ version: 0.0.15
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jeff Dwyer
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2026-05-10 00:00:00.000000000 Z
11
+ date: 2026-05-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -134,6 +134,7 @@ files:
134
134
  - lib/quonfig/types.rb
135
135
  - lib/quonfig/version.rb
136
136
  - lib/quonfig/weighted_value_resolver.rb
137
+ - lib/quonfig/worker_supervisor.rb
137
138
  - quonfig.gemspec
138
139
  homepage: https://github.com/quonfig/sdk-ruby
139
140
  licenses: