quonfig 0.0.13 → 0.0.15
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -0
- data/README.md +18 -0
- data/lib/quonfig/client.rb +301 -23
- data/lib/quonfig/datadir.rb +8 -3
- data/lib/quonfig/evaluation_details.rb +11 -4
- data/lib/quonfig/sse_config_client.rb +150 -4
- data/lib/quonfig/version.rb +1 -1
- data/lib/quonfig/worker_supervisor.rb +186 -0
- data/lib/quonfig.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e4e037ad01a35ca5a3fb3ddcc30ad6b0dab78ad82e4908a4a8ce9e8bab6cab40
|
|
4
|
+
data.tar.gz: 8bcccb03befbab5f1fbed1cbae867ce970498ac0081c92e24db7d8eb899d2faa
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9d4abdeaeaaad881e5f28cb9a653715dd8b1838ba33cc38b6b1f08db5f729173d5eadbf2afebfb6e3ca3a379f0354ab453fafd760a1fd61d13c3efef60ad0aee
|
|
7
|
+
data.tar.gz: 890131a3f75092f1b846ee4ca46c1dc20702b1effc3db5803443905d8a8571a33b672a691a18bbb0c3ad8471c5db72006a745f50ac0e919bf0997b49cf202045
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,16 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.0.15 - 2026-05-15
|
|
4
|
+
|
|
5
|
+
- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
|
|
6
|
+
- **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
|
|
7
|
+
- **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
|
|
8
|
+
|
|
9
|
+
## 0.0.14 - 2026-05-10
|
|
10
|
+
|
|
11
|
+
- **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.
|
|
12
|
+
- **Test: regenerate integration tests from rubocop-clean templates (qfg-vrck).** The integration suite under `test/integration/` is now generated from templates that pass `bundle exec rubocop` on first emit, so future regenerations don't trigger a follow-up autofix commit.
|
|
13
|
+
|
|
3
14
|
## 0.0.13 - 2026-05-07
|
|
4
15
|
|
|
5
16
|
- **Feat: `IS_PRESENT` and `IS_NOT_PRESENT` targeting operators (qfg-7jnb.6).** Both take only `propertyName` (no `valueToMatch`). `IS_PRESENT` resolves the dotted path against the merged context and returns true iff the value is non-nil. Type-agnostic — empty string `""`, `0`, and `false` all count as **present**; only `nil` / missing keys (including missing nested paths) are absent. `IS_NOT_PRESENT` is the negation. Implemented explicitly without ActiveSupport's `present?` / `blank?`, which would have given the wrong semantics on `""` and `false`. Matches sdk-node, sdk-go, sdk-python, sdk-ruby, sdk-javascript wire behaviour. Closes the integration-test parity gap that left 7 RSpec/Minitest cases red since the operators landed in `integration-test-data`.
|
data/README.md
CHANGED
|
@@ -333,6 +333,24 @@ converge once the envelope finishes applying.
|
|
|
333
333
|
`Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
|
|
334
334
|
— do not reuse the parent's client in a child process.
|
|
335
335
|
|
|
336
|
+
## Diagnostic health signals
|
|
337
|
+
|
|
338
|
+
`Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
|
|
339
|
+
|
|
340
|
+
- `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
|
|
341
|
+
envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
|
|
342
|
+
polling). Returns `nil` before the first install. Preserved across `stop`.
|
|
343
|
+
- `client.connection_state` — a `Symbol` describing the aggregate state:
|
|
344
|
+
`:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
|
|
345
|
+
|
|
346
|
+
> Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
|
|
347
|
+
|
|
348
|
+
Compose your own threshold from the two getters if you need a dashboard signal
|
|
349
|
+
— but route alerts through a metrics pipeline, not a probe that restarts the
|
|
350
|
+
process.
|
|
351
|
+
|
|
352
|
+
There is intentionally no `client.healthy?` primitive.
|
|
353
|
+
|
|
336
354
|
## Documentation
|
|
337
355
|
|
|
338
356
|
Full documentation, including SPEC, SDK reference, and operational guides, is
|
data/lib/quonfig/client.rb
CHANGED
|
@@ -40,9 +40,14 @@ module Quonfig
|
|
|
40
40
|
@resolver = Quonfig::Resolver.new(@store, @evaluator)
|
|
41
41
|
@semantic_logger_filters = {}
|
|
42
42
|
@sse_client = nil
|
|
43
|
-
@
|
|
43
|
+
@poll_supervisor = nil
|
|
44
44
|
@stopped = false
|
|
45
45
|
@telemetry_reporter = nil
|
|
46
|
+
@state_mutex = Mutex.new
|
|
47
|
+
@last_successful_refresh = nil
|
|
48
|
+
@sse_state = :idle
|
|
49
|
+
@sse_ever_connected = false
|
|
50
|
+
@fallback_engage_timer = nil
|
|
46
51
|
|
|
47
52
|
# If the caller injected a store, we're in test/bootstrap mode; skip I/O.
|
|
48
53
|
return if store
|
|
@@ -266,9 +271,14 @@ module Quonfig
|
|
|
266
271
|
end
|
|
267
272
|
@sse_client = nil
|
|
268
273
|
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
274
|
+
cancel_fallback_engage_timer
|
|
275
|
+
|
|
276
|
+
begin
|
|
277
|
+
@poll_supervisor&.stop
|
|
278
|
+
rescue StandardError => e
|
|
279
|
+
LOG.debug "Error stopping poll supervisor: #{e.message}"
|
|
280
|
+
end
|
|
281
|
+
@poll_supervisor = nil
|
|
272
282
|
|
|
273
283
|
begin
|
|
274
284
|
@telemetry_reporter&.stop
|
|
@@ -278,6 +288,65 @@ module Quonfig
|
|
|
278
288
|
@telemetry_reporter = nil
|
|
279
289
|
end
|
|
280
290
|
|
|
291
|
+
# quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
|
|
292
|
+
# Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
|
|
293
|
+
# incremented on every on_error edge from ld-eventsource (qfg-ll6r).
|
|
294
|
+
# Layer 2 (HTTP polling fallback) is wired through Quonfig::WorkerSupervisor.
|
|
295
|
+
#
|
|
296
|
+
# Pass +layer:+ ('1' or '2') to read a single layer; default returns the
|
|
297
|
+
# sum across both layers so the chaos harness (and operators) can pull
|
|
298
|
+
# per-layer values explicitly while preserving the previous single-number
|
|
299
|
+
# diagnostic surface.
|
|
300
|
+
def worker_restart_total(layer: nil)
|
|
301
|
+
case layer&.to_s
|
|
302
|
+
when '1' then sse_restart_total
|
|
303
|
+
when '2' then poll_restart_total
|
|
304
|
+
else sse_restart_total + poll_restart_total
|
|
305
|
+
end
|
|
306
|
+
end
|
|
307
|
+
|
|
308
|
+
# Wall-clock time of the last installed envelope (any source: datadir,
|
|
309
|
+
# initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
|
|
310
|
+
# install. Preserved after +stop+.
|
|
311
|
+
#
|
|
312
|
+
# **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
|
|
313
|
+
# — a transient network blip will trip any freshness threshold and cause
|
|
314
|
+
# a rolling restart cascade. See the README "Diagnostic health signals"
|
|
315
|
+
# section.
|
|
316
|
+
#
|
|
317
|
+
# Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
|
|
318
|
+
def last_successful_refresh
|
|
319
|
+
@state_mutex.synchronize { @last_successful_refresh }
|
|
320
|
+
end
|
|
321
|
+
|
|
322
|
+
# Aggregate connection state. Returns one of:
|
|
323
|
+
#
|
|
324
|
+
# - +:initializing+ — no envelope has been installed and SSE is not yet
|
|
325
|
+
# connected.
|
|
326
|
+
# - +:connected+ — SSE is live, or the SDK is delivering configs from a
|
|
327
|
+
# loaded envelope (datadir mode or post-initial-fetch with no SSE).
|
|
328
|
+
# - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
|
|
329
|
+
# poller is active.
|
|
330
|
+
# - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
|
|
331
|
+
# serving as the active update channel.
|
|
332
|
+
#
|
|
333
|
+
# **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
|
|
334
|
+
# — see the README "Diagnostic health signals" section.
|
|
335
|
+
#
|
|
336
|
+
# Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
|
|
337
|
+
def connection_state
|
|
338
|
+
@state_mutex.synchronize do
|
|
339
|
+
next :disconnected if @stopped
|
|
340
|
+
next :falling_back if @poll_supervisor&.alive?
|
|
341
|
+
next :connected if @sse_state == :connected
|
|
342
|
+
next :disconnected if @sse_state == :error
|
|
343
|
+
|
|
344
|
+
# No SSE state change yet: state is driven by whether any envelope
|
|
345
|
+
# has been installed (datadir / initial fetch).
|
|
346
|
+
@last_successful_refresh.nil? ? :initializing : :connected
|
|
347
|
+
end
|
|
348
|
+
end
|
|
349
|
+
|
|
281
350
|
def fork
|
|
282
351
|
self.class.new(@options.for_fork)
|
|
283
352
|
end
|
|
@@ -288,6 +357,128 @@ module Quonfig
|
|
|
288
357
|
|
|
289
358
|
private
|
|
290
359
|
|
|
360
|
+
# Stamp +last_successful_refresh+ at install time. Called by every code
|
|
361
|
+
# path that hands an envelope to the cache: datadir load, initial HTTP
|
|
362
|
+
# fetch, SSE event apply, and polling worker fetch.
|
|
363
|
+
def record_refresh!
|
|
364
|
+
@state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
|
|
365
|
+
end
|
|
366
|
+
|
|
367
|
+
def sse_restart_total
|
|
368
|
+
sse = @sse_client
|
|
369
|
+
return 0 if sse.nil?
|
|
370
|
+
return 0 unless sse.respond_to?(:restart_total)
|
|
371
|
+
|
|
372
|
+
sse.restart_total.to_i
|
|
373
|
+
end
|
|
374
|
+
|
|
375
|
+
def poll_restart_total
|
|
376
|
+
sup = @poll_supervisor
|
|
377
|
+
return 0 if sup.nil?
|
|
378
|
+
return 0 unless sup.respond_to?(:worker_restart_total)
|
|
379
|
+
|
|
380
|
+
sup.worker_restart_total.to_i
|
|
381
|
+
end
|
|
382
|
+
|
|
383
|
+
# Drive the SSE-side of the connection_state machine. The SSE client
|
|
384
|
+
# invokes this on connect/error edges; tests call it directly via +send+.
|
|
385
|
+
# Documented values: :idle, :connecting, :connected, :error.
|
|
386
|
+
#
|
|
387
|
+
# Also drives the Layer 2 fallback poller's engage/disengage:
|
|
388
|
+
# - :connected clears any pending engage timer and stops an active
|
|
389
|
+
# fallback poller (SSE recovered, drop the second channel).
|
|
390
|
+
# - :error before any successful connect engages immediately
|
|
391
|
+
# (initial-fail path).
|
|
392
|
+
# - :error after a successful connect schedules a 2x-poll-interval
|
|
393
|
+
# grace timer; the timer engages if SSE has not recovered by then.
|
|
394
|
+
# Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
|
|
395
|
+
# `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
|
|
396
|
+
# Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
|
|
397
|
+
# block can drive @sse_state -> :error on a mid-run socket drop. Without
|
|
398
|
+
# this wiring, +connection_state+ would stay +:connected+ after a
|
|
399
|
+
# disconnect and customers composing staleness checks would see stale
|
|
400
|
+
# data. (qfg-47c2.27)
|
|
401
|
+
def sse_error_callback
|
|
402
|
+
@sse_error_callback ||= ->(error) { handle_sse_error(error) }
|
|
403
|
+
end
|
|
404
|
+
|
|
405
|
+
def handle_sse_error(_error)
|
|
406
|
+
handle_sse_state_change(:error)
|
|
407
|
+
end
|
|
408
|
+
|
|
409
|
+
def handle_sse_state_change(new_state)
|
|
410
|
+
state = new_state.to_sym
|
|
411
|
+
ever_connected = @state_mutex.synchronize do
|
|
412
|
+
@sse_state = state
|
|
413
|
+
@sse_ever_connected = true if state == :connected
|
|
414
|
+
@sse_ever_connected
|
|
415
|
+
end
|
|
416
|
+
|
|
417
|
+
return unless @options.respond_to?(:enable_polling) && @options.enable_polling
|
|
418
|
+
return if @stopped
|
|
419
|
+
|
|
420
|
+
case state
|
|
421
|
+
when :connected
|
|
422
|
+
cancel_fallback_engage_timer
|
|
423
|
+
stop_fallback_poller('sse-recovered')
|
|
424
|
+
when :error
|
|
425
|
+
if ever_connected
|
|
426
|
+
schedule_fallback_engage
|
|
427
|
+
else
|
|
428
|
+
start_polling
|
|
429
|
+
end
|
|
430
|
+
end
|
|
431
|
+
end
|
|
432
|
+
|
|
433
|
+
def cancel_fallback_engage_timer
|
|
434
|
+
timer = @state_mutex.synchronize do
|
|
435
|
+
t = @fallback_engage_timer
|
|
436
|
+
@fallback_engage_timer = nil
|
|
437
|
+
t
|
|
438
|
+
end
|
|
439
|
+
timer&.kill if timer&.alive?
|
|
440
|
+
end
|
|
441
|
+
|
|
442
|
+
def stop_fallback_poller(reason)
|
|
443
|
+
supervisor = @state_mutex.synchronize do
|
|
444
|
+
s = @poll_supervisor
|
|
445
|
+
@poll_supervisor = nil
|
|
446
|
+
s
|
|
447
|
+
end
|
|
448
|
+
return if supervisor.nil?
|
|
449
|
+
|
|
450
|
+
begin
|
|
451
|
+
supervisor.stop
|
|
452
|
+
LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
|
|
453
|
+
rescue StandardError => e
|
|
454
|
+
LOG.debug "Error stopping fallback poller: #{e.message}"
|
|
455
|
+
end
|
|
456
|
+
end
|
|
457
|
+
|
|
458
|
+
# Schedule a 2*poll_interval grace timer after a connected->error edge.
|
|
459
|
+
# If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
|
|
460
|
+
# tears it down. Idempotent — does nothing if a timer is already pending
|
|
461
|
+
# or the supervisor is already alive.
|
|
462
|
+
def schedule_fallback_engage
|
|
463
|
+
poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
|
|
464
|
+
return if poll_interval <= 0
|
|
465
|
+
|
|
466
|
+
grace_seconds = poll_interval * 2.0
|
|
467
|
+
|
|
468
|
+
@state_mutex.synchronize do
|
|
469
|
+
return if @fallback_engage_timer&.alive?
|
|
470
|
+
return if @poll_supervisor&.alive?
|
|
471
|
+
return if @stopped
|
|
472
|
+
|
|
473
|
+
@fallback_engage_timer = Thread.new do
|
|
474
|
+
Thread.current.report_on_exception = false
|
|
475
|
+
sleep grace_seconds
|
|
476
|
+
@state_mutex.synchronize { @fallback_engage_timer = nil }
|
|
477
|
+
start_polling unless @stopped
|
|
478
|
+
end
|
|
479
|
+
end
|
|
480
|
+
end
|
|
481
|
+
|
|
291
482
|
# Construct and start the telemetry reporter if the options permit it.
|
|
292
483
|
# The reporter runs on a background thread and periodically POSTs
|
|
293
484
|
# context-shape and example-context batches to +telemetry_destination+.
|
|
@@ -378,6 +569,7 @@ module Quonfig
|
|
|
378
569
|
def load_datadir_into_store
|
|
379
570
|
envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
|
|
380
571
|
envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
|
|
572
|
+
record_refresh!
|
|
381
573
|
end
|
|
382
574
|
|
|
383
575
|
# Initialize network mode: sync HTTP fetch (bounded by
|
|
@@ -412,7 +604,11 @@ module Quonfig
|
|
|
412
604
|
return
|
|
413
605
|
end
|
|
414
606
|
|
|
415
|
-
|
|
607
|
+
if result == :failed
|
|
608
|
+
handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
|
|
609
|
+
else
|
|
610
|
+
record_refresh!
|
|
611
|
+
end
|
|
416
612
|
end
|
|
417
613
|
|
|
418
614
|
def handle_init_failure(err)
|
|
@@ -429,44 +625,79 @@ module Quonfig
|
|
|
429
625
|
def start_sse
|
|
430
626
|
return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
|
|
431
627
|
|
|
432
|
-
@sse_client = Quonfig::SSEConfigClient.new(
|
|
628
|
+
@sse_client = Quonfig::SSEConfigClient.new(
|
|
629
|
+
@options,
|
|
630
|
+
@config_loader,
|
|
631
|
+
nil,
|
|
632
|
+
nil,
|
|
633
|
+
on_error: sse_error_callback
|
|
634
|
+
)
|
|
433
635
|
@sse_client.start do |envelope, _event, _source|
|
|
434
636
|
next if @stopped
|
|
435
637
|
|
|
436
638
|
begin
|
|
437
639
|
@config_loader.apply_envelope(envelope)
|
|
438
|
-
|
|
640
|
+
handle_sse_state_change(:connected)
|
|
641
|
+
record_refresh!
|
|
439
642
|
rescue StandardError => e
|
|
440
643
|
LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
|
|
644
|
+
next
|
|
441
645
|
end
|
|
646
|
+
notify_on_update_callback
|
|
442
647
|
end
|
|
443
648
|
true
|
|
444
649
|
rescue StandardError => e
|
|
445
650
|
LOG.warn "[quonfig] SSE start failed: #{e.message}"
|
|
446
651
|
@sse_client = nil
|
|
652
|
+
handle_sse_state_change(:error)
|
|
447
653
|
false
|
|
448
654
|
end
|
|
449
655
|
|
|
450
656
|
def start_polling
|
|
657
|
+
return if @stopped
|
|
658
|
+
return if @poll_supervisor&.alive?
|
|
659
|
+
|
|
451
660
|
poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
|
|
452
661
|
return if poll_interval <= 0
|
|
453
662
|
|
|
454
|
-
|
|
455
|
-
|
|
663
|
+
stopped_ref = -> { @stopped }
|
|
664
|
+
worker = lambda do |notify_delivered|
|
|
456
665
|
loop do
|
|
457
|
-
break if
|
|
666
|
+
break if stopped_ref.call
|
|
458
667
|
|
|
459
668
|
sleep poll_interval
|
|
460
|
-
break if
|
|
461
|
-
|
|
462
|
-
|
|
463
|
-
|
|
464
|
-
|
|
465
|
-
|
|
466
|
-
LOG.warn "[quonfig] Polling error: #{e.message}"
|
|
467
|
-
end
|
|
669
|
+
break if stopped_ref.call
|
|
670
|
+
|
|
671
|
+
@config_loader.fetch!
|
|
672
|
+
record_refresh!
|
|
673
|
+
notify_delivered.call
|
|
674
|
+
notify_on_update_callback
|
|
468
675
|
end
|
|
469
676
|
end
|
|
677
|
+
|
|
678
|
+
supervisor = Quonfig::WorkerSupervisor.new(
|
|
679
|
+
name: 'poll', layer: '2', worker: worker
|
|
680
|
+
)
|
|
681
|
+
@state_mutex.synchronize { @poll_supervisor = supervisor }
|
|
682
|
+
supervisor.start
|
|
683
|
+
end
|
|
684
|
+
|
|
685
|
+
# Invoke the customer-supplied on_update callback under a rescue. A raise
|
|
686
|
+
# here is the customer's bug, but it must NOT take down the SSE listener
|
|
687
|
+
# or polling supervisor. Log at ERROR with a message containing
|
|
688
|
+
# "onConfigUpdate callback" so chaos scenario 10's
|
|
689
|
+
# sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
|
|
690
|
+
# the message is distinguishable from internal envelope-apply errors
|
|
691
|
+
# (qfg-47c2.30).
|
|
692
|
+
def notify_on_update_callback
|
|
693
|
+
cb = @on_update
|
|
694
|
+
return unless cb
|
|
695
|
+
|
|
696
|
+
begin
|
|
697
|
+
cb.call
|
|
698
|
+
rescue StandardError => e
|
|
699
|
+
LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
|
|
700
|
+
end
|
|
470
701
|
end
|
|
471
702
|
|
|
472
703
|
def build_context(jit_context)
|
|
@@ -547,19 +778,25 @@ module Quonfig
|
|
|
547
778
|
value: nil,
|
|
548
779
|
reason: Quonfig::EvaluationDetails::REASON_ERROR,
|
|
549
780
|
error_code: Quonfig::EvaluationDetails::ERROR_FLAG_NOT_FOUND,
|
|
550
|
-
error_message: e.message
|
|
781
|
+
error_message: e.message,
|
|
782
|
+
variant: build_variant(Quonfig::EvaluationDetails::REASON_ERROR, nil, nil),
|
|
783
|
+
flag_metadata: build_flag_metadata(nil, nil, nil, nil, nil)
|
|
551
784
|
)
|
|
552
785
|
end
|
|
553
786
|
|
|
554
787
|
if result.nil?
|
|
555
788
|
return Quonfig::EvaluationDetails.new(
|
|
556
789
|
value: nil,
|
|
557
|
-
reason: Quonfig::EvaluationDetails::REASON_DEFAULT
|
|
790
|
+
reason: Quonfig::EvaluationDetails::REASON_DEFAULT,
|
|
791
|
+
variant: build_variant(Quonfig::EvaluationDetails::REASON_DEFAULT, nil, nil),
|
|
792
|
+
flag_metadata: build_flag_metadata(nil, nil, nil, nil, nil)
|
|
558
793
|
)
|
|
559
794
|
end
|
|
560
795
|
|
|
561
796
|
record_evaluation_for_telemetry(result)
|
|
562
797
|
|
|
798
|
+
config_id = result.config&.dig('id') || result.config&.dig(:id)
|
|
799
|
+
config_type = result.config&.dig('type') || result.config&.dig(:type)
|
|
563
800
|
raw_value = result.unwrapped_value
|
|
564
801
|
|
|
565
802
|
begin
|
|
@@ -569,23 +806,64 @@ module Quonfig
|
|
|
569
806
|
value: nil,
|
|
570
807
|
reason: Quonfig::EvaluationDetails::REASON_ERROR,
|
|
571
808
|
error_code: Quonfig::EvaluationDetails::ERROR_TYPE_MISMATCH,
|
|
572
|
-
error_message: e.message
|
|
809
|
+
error_message: e.message,
|
|
810
|
+
variant: build_variant(Quonfig::EvaluationDetails::REASON_ERROR, nil, nil),
|
|
811
|
+
flag_metadata: build_flag_metadata(config_id, config_type, nil, nil, nil)
|
|
573
812
|
)
|
|
574
813
|
end
|
|
575
814
|
|
|
815
|
+
reason = result.of_reason
|
|
576
816
|
Quonfig::EvaluationDetails.new(
|
|
577
817
|
value: coerced,
|
|
578
|
-
reason:
|
|
818
|
+
reason: reason,
|
|
819
|
+
variant: build_variant(reason, result.rule_index, result.weighted_value_index),
|
|
820
|
+
flag_metadata: build_flag_metadata(
|
|
821
|
+
config_id, config_type, result.rule_index, result.weighted_value_index, reason
|
|
822
|
+
)
|
|
579
823
|
)
|
|
580
824
|
rescue StandardError => e
|
|
581
825
|
Quonfig::EvaluationDetails.new(
|
|
582
826
|
value: nil,
|
|
583
827
|
reason: Quonfig::EvaluationDetails::REASON_ERROR,
|
|
584
828
|
error_code: Quonfig::EvaluationDetails::ERROR_GENERAL,
|
|
585
|
-
error_message: e.message
|
|
829
|
+
error_message: e.message,
|
|
830
|
+
variant: build_variant(Quonfig::EvaluationDetails::REASON_ERROR, nil, nil),
|
|
831
|
+
flag_metadata: build_flag_metadata(nil, nil, nil, nil, nil)
|
|
586
832
|
)
|
|
587
833
|
end
|
|
588
834
|
|
|
835
|
+
# Build the variant string per the cross-SDK spec
|
|
836
|
+
# (project/plans/openfeature-resolution-details.md §2).
|
|
837
|
+
def build_variant(reason, rule_index, weighted_value_index)
|
|
838
|
+
case reason
|
|
839
|
+
when Quonfig::EvaluationDetails::REASON_STATIC
|
|
840
|
+
'static'
|
|
841
|
+
when Quonfig::EvaluationDetails::REASON_TARGETING_MATCH
|
|
842
|
+
"targeting:#{rule_index || 0}"
|
|
843
|
+
when Quonfig::EvaluationDetails::REASON_SPLIT
|
|
844
|
+
"split:#{weighted_value_index || 0}"
|
|
845
|
+
else
|
|
846
|
+
'default'
|
|
847
|
+
end
|
|
848
|
+
end
|
|
849
|
+
|
|
850
|
+
# Build the flag_metadata hash per the cross-SDK spec
|
|
851
|
+
# (project/plans/openfeature-resolution-details.md §3) using Ruby's
|
|
852
|
+
# snake_case keys and the wire's snake_case config_type values.
|
|
853
|
+
def build_flag_metadata(config_id, config_type, rule_index, weighted_value_index, reason)
|
|
854
|
+
md = {}
|
|
855
|
+
md['config_id'] = config_id if config_id
|
|
856
|
+
md['config_type'] = config_type if config_type
|
|
857
|
+
env = @options.environment
|
|
858
|
+
md['environment'] = env if env && !env.empty?
|
|
859
|
+
if rule_index && rule_index >= 0 &&
|
|
860
|
+
[Quonfig::EvaluationDetails::REASON_TARGETING_MATCH, Quonfig::EvaluationDetails::REASON_SPLIT].include?(reason)
|
|
861
|
+
md['rule_index'] = rule_index
|
|
862
|
+
end
|
|
863
|
+
md['weighted_value_index'] = weighted_value_index if weighted_value_index && reason == Quonfig::EvaluationDetails::REASON_SPLIT
|
|
864
|
+
md
|
|
865
|
+
end
|
|
866
|
+
|
|
589
867
|
def typed_get(key, expected_type, default:, context:)
|
|
590
868
|
jit = context == NO_DEFAULT_PROVIDED ? NO_DEFAULT_PROVIDED : context
|
|
591
869
|
value = get(key, default, jit)
|
data/lib/quonfig/datadir.rb
CHANGED
|
@@ -11,14 +11,16 @@ module Quonfig
|
|
|
11
11
|
# <datadir>/configs/*.json
|
|
12
12
|
# <datadir>/feature-flags/*.json
|
|
13
13
|
# <datadir>/segments/*.json
|
|
14
|
-
# <datadir>/schemas/*.json
|
|
15
14
|
# <datadir>/log-levels/*.json
|
|
16
15
|
#
|
|
16
|
+
# schemas/ is intentionally excluded — those files are raw JSON Schema
|
|
17
|
+
# documents, not Configs, and SDKs do not consume them (qfg-uzsl).
|
|
18
|
+
#
|
|
17
19
|
# Each <type>/*.json file is a WorkspaceConfigDocument. The loader projects
|
|
18
20
|
# it down to the ConfigResponse shape that the SSE/HTTP delivery path emits,
|
|
19
21
|
# so ConfigStore consumes both transports uniformly.
|
|
20
22
|
module Datadir
|
|
21
|
-
CONFIG_SUBDIRS = %w[configs feature-flags segments
|
|
23
|
+
CONFIG_SUBDIRS = %w[configs feature-flags segments log-levels].freeze
|
|
22
24
|
|
|
23
25
|
module_function
|
|
24
26
|
|
|
@@ -36,7 +38,10 @@ module Quonfig
|
|
|
36
38
|
.select { |name| name.end_with?('.json') }
|
|
37
39
|
.sort
|
|
38
40
|
.each do |filename|
|
|
39
|
-
|
|
41
|
+
path = File.join(dir, filename)
|
|
42
|
+
raw = JSON.parse(File.read(path))
|
|
43
|
+
raise ArgumentError, "[quonfig] config has empty key — file is not a Quonfig Config: #{path}" if raw['key'].nil? || raw['key'].to_s.empty?
|
|
44
|
+
|
|
40
45
|
configs << to_config_response(raw, env_id)
|
|
41
46
|
end
|
|
42
47
|
end
|
|
@@ -28,13 +28,16 @@ module Quonfig
|
|
|
28
28
|
ERROR_TYPE_MISMATCH = 'TYPE_MISMATCH'
|
|
29
29
|
ERROR_GENERAL = 'GENERAL'
|
|
30
30
|
|
|
31
|
-
attr_reader :value, :reason, :error_code, :error_message
|
|
31
|
+
attr_reader :value, :reason, :error_code, :error_message, :variant, :flag_metadata
|
|
32
32
|
|
|
33
|
-
def initialize(value:, reason:, error_code: nil, error_message: nil
|
|
33
|
+
def initialize(value:, reason:, error_code: nil, error_message: nil,
|
|
34
|
+
variant: nil, flag_metadata: nil)
|
|
34
35
|
@value = value
|
|
35
36
|
@reason = reason
|
|
36
37
|
@error_code = error_code
|
|
37
38
|
@error_message = error_message
|
|
39
|
+
@variant = variant
|
|
40
|
+
@flag_metadata = flag_metadata
|
|
38
41
|
end
|
|
39
42
|
|
|
40
43
|
def ==(other)
|
|
@@ -42,18 +45,22 @@ module Quonfig
|
|
|
42
45
|
other.value == @value &&
|
|
43
46
|
other.reason == @reason &&
|
|
44
47
|
other.error_code == @error_code &&
|
|
45
|
-
other.error_message == @error_message
|
|
48
|
+
other.error_message == @error_message &&
|
|
49
|
+
other.variant == @variant &&
|
|
50
|
+
other.flag_metadata == @flag_metadata
|
|
46
51
|
end
|
|
47
52
|
alias eql? ==
|
|
48
53
|
|
|
49
54
|
def hash
|
|
50
|
-
[@value, @reason, @error_code, @error_message].hash
|
|
55
|
+
[@value, @reason, @error_code, @error_message, @variant, @flag_metadata].hash
|
|
51
56
|
end
|
|
52
57
|
|
|
53
58
|
def inspect
|
|
54
59
|
parts = ["value=#{@value.inspect}", "reason=#{@reason.inspect}"]
|
|
55
60
|
parts << "error_code=#{@error_code.inspect}" if @error_code
|
|
56
61
|
parts << "error_message=#{@error_message.inspect}" if @error_message
|
|
62
|
+
parts << "variant=#{@variant.inspect}" if @variant
|
|
63
|
+
parts << "flag_metadata=#{@flag_metadata.inspect}" if @flag_metadata
|
|
57
64
|
"#<Quonfig::EvaluationDetails #{parts.join(' ')}>"
|
|
58
65
|
end
|
|
59
66
|
end
|
|
@@ -5,19 +5,99 @@ require 'json'
|
|
|
5
5
|
|
|
6
6
|
module Quonfig
|
|
7
7
|
class SSEConfigClient
|
|
8
|
+
# ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
|
|
9
|
+
# *internally* — it never calls +on_error+ for that case, only for
|
|
10
|
+
# ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
|
|
11
|
+
# signal it emits for any reconnect is an info-level
|
|
12
|
+
# "Will retry connection after ..." line, logged once per reconnect attempt
|
|
13
|
+
# and never on the first connect. Wrapping the logger we hand to
|
|
14
|
+
# SSE::Client lets the SDK observe those internal reconnects without
|
|
15
|
+
# touching the data path. This is the only reconnect hook ld-eventsource
|
|
16
|
+
# >= 2.0 exposes.
|
|
17
|
+
class ReconnectCountingLogger
|
|
18
|
+
RECONNECT_SIGNAL = 'Will retry connection after'
|
|
19
|
+
|
|
20
|
+
LEVELS = %i[trace debug info warn error fatal].freeze
|
|
21
|
+
|
|
22
|
+
def initialize(wrapped, &on_reconnect)
|
|
23
|
+
@wrapped = wrapped
|
|
24
|
+
@on_reconnect = on_reconnect
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
# Crash-safe by construction: ld-eventsource calls this logger from
|
|
28
|
+
# inside its bare-Thread +run_stream+ loop, and several of those call
|
|
29
|
+
# sites (+connect+, +log_and_dispatch_error+, query-param building) are
|
|
30
|
+
# NOT wrapped in a rescue. Any exception that escapes a logger call kills
|
|
31
|
+
# the worker thread with +@stopped+ still false, so +closed?+ never flips
|
|
32
|
+
# true and the SDK's @retry_thread never reconnects — the SSE stream is
|
|
33
|
+
# silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
|
|
34
|
+
# step here is therefore independently guarded: a throwing message block,
|
|
35
|
+
# a throwing on_reconnect callback, or a throwing wrapped logger can
|
|
36
|
+
# never propagate out of this method.
|
|
37
|
+
LEVELS.each do |level|
|
|
38
|
+
define_method(level) do |message = nil, &block|
|
|
39
|
+
begin
|
|
40
|
+
message = block.call if message.nil? && block
|
|
41
|
+
rescue StandardError
|
|
42
|
+
message = nil
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
|
|
46
|
+
begin
|
|
47
|
+
@on_reconnect.call
|
|
48
|
+
rescue StandardError
|
|
49
|
+
nil
|
|
50
|
+
end
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
begin
|
|
54
|
+
@wrapped.public_send(level, message) if @wrapped.respond_to?(level)
|
|
55
|
+
rescue StandardError
|
|
56
|
+
nil
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
def level
|
|
62
|
+
@wrapped&.level
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
def level=(new_level)
|
|
66
|
+
@wrapped.level = new_level if @wrapped.respond_to?(:level=)
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
|
|
8
70
|
class Options
|
|
9
71
|
attr_reader :sse_read_timeout, :seconds_between_new_connection,
|
|
10
72
|
:sse_default_reconnect_time, :sleep_delay_for_new_connection_check,
|
|
11
|
-
:errors_to_close_connection
|
|
73
|
+
:errors_to_close_connection, :sse_reconnect_reset_interval
|
|
12
74
|
|
|
13
|
-
|
|
75
|
+
# sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
|
|
76
|
+
# stall trips the read deadline within one missed-heartbeat window
|
|
77
|
+
# rather than the previous 5-minute idle. See plan
|
|
78
|
+
# `project/plans/sdk-hardening-and-verification.md` Layer 1.
|
|
79
|
+
#
|
|
80
|
+
# sse_reconnect_reset_interval: 1s (ld-eventsource default is 60s). The
|
|
81
|
+
# ld-eventsource backoff only resets to the base interval once a
|
|
82
|
+
# connection has stayed up this long; until then each reconnect doubles
|
|
83
|
+
# the delay (1s, 2s, 4s, 8s...). With the 60s default, a flapping
|
|
84
|
+
# connection (chaos scenario 09 — proxy killed every 6s) backs off so
|
|
85
|
+
# fast the SDK is mid-sleep when the next kill lands and never observes
|
|
86
|
+
# it. Resetting after 1s of healthy connection mirrors sdk-python, which
|
|
87
|
+
# resets its backoff on every successful connect (sdk-python/quonfig/
|
|
88
|
+
# sse.py). A *sustained* outage still backs off exponentially: no
|
|
89
|
+
# connection succeeds, so `mark_success` is never called and the reset
|
|
90
|
+
# never triggers (qfg-ie49).
|
|
91
|
+
def initialize(sse_read_timeout: 90,
|
|
14
92
|
seconds_between_new_connection: 5,
|
|
15
93
|
sleep_delay_for_new_connection_check: 1,
|
|
16
94
|
sse_default_reconnect_time: SSE::Client::DEFAULT_RECONNECT_TIME,
|
|
95
|
+
sse_reconnect_reset_interval: 1,
|
|
17
96
|
errors_to_close_connection: [HTTP::ConnectionError])
|
|
18
97
|
@sse_read_timeout = sse_read_timeout
|
|
19
98
|
@seconds_between_new_connection = seconds_between_new_connection
|
|
20
99
|
@sse_default_reconnect_time = sse_default_reconnect_time
|
|
100
|
+
@sse_reconnect_reset_interval = sse_reconnect_reset_interval
|
|
21
101
|
@sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
|
|
22
102
|
@errors_to_close_connection = errors_to_close_connection
|
|
23
103
|
end
|
|
@@ -25,12 +105,46 @@ module Quonfig
|
|
|
25
105
|
|
|
26
106
|
LOG = Quonfig::InternalLogger.new(self)
|
|
27
107
|
|
|
28
|
-
|
|
108
|
+
# +on_error+: optional callable invoked on every SSE error edge. Parent
|
|
109
|
+
# Quonfig::Client wires this to drive @sse_state -> :error so that
|
|
110
|
+
# +connection_state+ reflects the disconnect (qfg-47c2.27). Without it
|
|
111
|
+
# the SDK's public health primitive would lie about its own state during
|
|
112
|
+
# a mid-run socket drop.
|
|
113
|
+
def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
|
|
29
114
|
@prefab_options = prefab_options
|
|
30
115
|
@options = options || Options.new
|
|
31
116
|
@config_loader = config_loader
|
|
32
117
|
@connected = false
|
|
33
118
|
@logger = logger || LOG
|
|
119
|
+
@on_error = on_error
|
|
120
|
+
@restart_total = 0
|
|
121
|
+
@restart_mutex = Mutex.new
|
|
122
|
+
end
|
|
123
|
+
|
|
124
|
+
# qfg-ll6r / qfg-ie49: Layer 1 (SSE) restart counter — counts every
|
|
125
|
+
# *reconnect*, from two sources:
|
|
126
|
+
# 1. ld-eventsource's own internal reconnect (clean FIN, read timeout,
|
|
127
|
+
# transient errors it doesn't surface) — observed via the
|
|
128
|
+
# ReconnectCountingLogger "Will retry connection after" signal.
|
|
129
|
+
# 2. SDK-driven reconnects in @retry_thread, after a closing error
|
|
130
|
+
# (HTTP::ConnectionError) made us close the SSE::Client outright.
|
|
131
|
+
# These two are mutually exclusive per disconnect, so there is no
|
|
132
|
+
# double-count. on_error is deliberately NOT a source — ld-eventsource
|
|
133
|
+
# reconnects internally after most non-closing errors, so counting the
|
|
134
|
+
# error edge AND the reconnect would double up (qfg-ie49).
|
|
135
|
+
#
|
|
136
|
+
# The chaos harness pulls this via Client#worker_restart_total(layer: '1')
|
|
137
|
+
# so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
|
|
138
|
+
# assert restart_total >= 5 even when the kills produce clean FINs that
|
|
139
|
+
# never reach on_error.
|
|
140
|
+
def restart_total
|
|
141
|
+
@restart_mutex.synchronize { @restart_total }
|
|
142
|
+
end
|
|
143
|
+
|
|
144
|
+
# Bump the Layer 1 reconnect counter. Called from the ld-eventsource
|
|
145
|
+
# worker thread (via ReconnectCountingLogger) and from @retry_thread.
|
|
146
|
+
def count_restart!
|
|
147
|
+
@restart_mutex.synchronize { @restart_total += 1 }
|
|
34
148
|
end
|
|
35
149
|
|
|
36
150
|
def close
|
|
@@ -60,6 +174,11 @@ module Quonfig
|
|
|
60
174
|
|
|
61
175
|
closed_count = 0
|
|
62
176
|
@logger.debug 'Reconnecting SSE client'
|
|
177
|
+
# SDK-driven reconnect: a closing error (HTTP::ConnectionError)
|
|
178
|
+
# closed the previous SSE::Client, so ld-eventsource's own
|
|
179
|
+
# reconnect loop has exited and won't emit the "Will retry" signal.
|
|
180
|
+
# Count it here instead (qfg-ie49).
|
|
181
|
+
count_restart!
|
|
63
182
|
@client = connect(&load_configs)
|
|
64
183
|
end
|
|
65
184
|
end
|
|
@@ -70,12 +189,20 @@ module Quonfig
|
|
|
70
189
|
cursor = current_cursor
|
|
71
190
|
@logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
|
|
72
191
|
|
|
192
|
+
# Wrap the ld-eventsource logger so internal reconnects (clean FIN,
|
|
193
|
+
# read-timeout, transient errors) bump restart_total — they never reach
|
|
194
|
+
# on_error (qfg-ie49).
|
|
195
|
+
sse_logger = ReconnectCountingLogger.new(
|
|
196
|
+
Quonfig::InternalLogger.new(SSE::Client)
|
|
197
|
+
) { count_restart! }
|
|
198
|
+
|
|
73
199
|
SSE::Client.new(url,
|
|
74
200
|
headers: headers,
|
|
75
201
|
read_timeout: @options.sse_read_timeout,
|
|
76
202
|
reconnect_time: @options.sse_default_reconnect_time,
|
|
203
|
+
reconnect_reset_interval: @options.sse_reconnect_reset_interval,
|
|
77
204
|
last_event_id: cursor,
|
|
78
|
-
logger:
|
|
205
|
+
logger: sse_logger) do |client|
|
|
79
206
|
client.on_event do |event|
|
|
80
207
|
if event.data.nil? || event.data.empty?
|
|
81
208
|
@logger.error "SSE Streaming Error: Received empty data for url #{url}"
|
|
@@ -106,6 +233,25 @@ module Quonfig
|
|
|
106
233
|
@logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
|
|
107
234
|
end
|
|
108
235
|
|
|
236
|
+
# qfg-ie49: restart_total is NOT bumped here. ld-eventsource
|
|
237
|
+
# auto-reconnects after most non-closing errors, and that reconnect
|
|
238
|
+
# is already counted via ReconnectCountingLogger; bumping here too
|
|
239
|
+
# would double-count. For closing errors (HTTP::ConnectionError) the
|
|
240
|
+
# reconnect is counted in @retry_thread instead. on_error's job is
|
|
241
|
+
# purely to notify the parent client of the disconnect edge.
|
|
242
|
+
|
|
243
|
+
# Notify the parent client BEFORE deciding whether to close — every
|
|
244
|
+
# error edge is a disconnect signal as far as @sse_state goes, even
|
|
245
|
+
# if we let the underlying SSE library handle reconnect itself.
|
|
246
|
+
# qfg-47c2.27
|
|
247
|
+
if @on_error
|
|
248
|
+
begin
|
|
249
|
+
@on_error.call(error)
|
|
250
|
+
rescue StandardError => e
|
|
251
|
+
@logger.error "SSE on_error callback raised: #{e.inspect}"
|
|
252
|
+
end
|
|
253
|
+
end
|
|
254
|
+
|
|
109
255
|
if @options.errors_to_close_connection.any? { |klass| error.is_a?(klass) }
|
|
110
256
|
@logger.debug "Closing SSE connection for url #{url}"
|
|
111
257
|
client.close
|
data/lib/quonfig/version.rb
CHANGED
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Quonfig
|
|
4
|
+
# Internal control-flow exception raised inside a supervised worker thread
|
|
5
|
+
# to signal cooperative shutdown. Workers may catch and re-raise, or just
|
|
6
|
+
# propagate.
|
|
7
|
+
class Shutdown < StandardError; end
|
|
8
|
+
|
|
9
|
+
# Single supervisor for a long-lived background worker (SSE read loop,
|
|
10
|
+
# fallback poller). Catches unhandled exceptions at the worker boundary,
|
|
11
|
+
# logs them, increments +worker_restart_total+, and restarts with
|
|
12
|
+
# exponential backoff capped at 30s.
|
|
13
|
+
#
|
|
14
|
+
# Contract: integration-test-data/chaos/supervisor-test-contract.md
|
|
15
|
+
# Plan: project/plans/sdk-hardening-and-verification.md (Phase 1)
|
|
16
|
+
#
|
|
17
|
+
# The worker is a Proc-like callable invoked as +worker.call(notify_delivered)+
|
|
18
|
+
# where +notify_delivered+ is a Proc the worker calls when it has handed at
|
|
19
|
+
# least one envelope to the cache. That signal resets the backoff so a
|
|
20
|
+
# transient blip doesn't double the delay on the next disconnect.
|
|
21
|
+
#
|
|
22
|
+
# Shutdown is signaled by Thread#raise(Quonfig::Shutdown) into the
|
|
23
|
+
# supervisor thread. Logger writes and bookkeeping use Thread.handle_interrupt
|
|
24
|
+
# so a concurrent raise doesn't trip Ruby's "log writing failed" path.
|
|
25
|
+
class WorkerSupervisor
|
|
26
|
+
METRIC_NAME = 'quonfig_sdk_worker_restart_total'
|
|
27
|
+
|
|
28
|
+
DEFAULT_INITIAL_BACKOFF = 0.5
|
|
29
|
+
DEFAULT_MAX_BACKOFF = 30.0
|
|
30
|
+
DEFAULT_MULTIPLIER = 2.0
|
|
31
|
+
SHUTDOWN_TIMEOUT_SEC = 5.0
|
|
32
|
+
|
|
33
|
+
LOG = Quonfig::InternalLogger.new(self)
|
|
34
|
+
|
|
35
|
+
attr_reader :worker_restart_total, :worker_restart_labels
|
|
36
|
+
|
|
37
|
+
def initialize(name:, worker:, layer: '1',
|
|
38
|
+
initial_backoff: DEFAULT_INITIAL_BACKOFF,
|
|
39
|
+
max_backoff: DEFAULT_MAX_BACKOFF,
|
|
40
|
+
multiplier: DEFAULT_MULTIPLIER,
|
|
41
|
+
sleep_proc: nil,
|
|
42
|
+
logger: nil)
|
|
43
|
+
@name = name
|
|
44
|
+
@layer = layer.to_s
|
|
45
|
+
@worker = worker
|
|
46
|
+
@initial_backoff = initial_backoff
|
|
47
|
+
@max_backoff = max_backoff
|
|
48
|
+
@multiplier = multiplier
|
|
49
|
+
@sleep_proc = sleep_proc || ->(seconds) { sleep(seconds) }
|
|
50
|
+
@logger = logger || LOG
|
|
51
|
+
@worker_restart_total = 0
|
|
52
|
+
@worker_restart_labels = {
|
|
53
|
+
sdk: 'ruby',
|
|
54
|
+
sdk_version: Quonfig::VERSION,
|
|
55
|
+
layer: @layer
|
|
56
|
+
}.freeze
|
|
57
|
+
@mutex = Mutex.new
|
|
58
|
+
@stop_requested = false
|
|
59
|
+
@thread = nil
|
|
60
|
+
@current_backoff = @initial_backoff
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
def start
|
|
64
|
+
@mutex.synchronize do
|
|
65
|
+
return self if @thread&.alive?
|
|
66
|
+
|
|
67
|
+
@stop_requested = false
|
|
68
|
+
ready = Queue.new
|
|
69
|
+
@thread = Thread.new do
|
|
70
|
+
# Set report_on_exception + signal "ready" BEFORE entering
|
|
71
|
+
# run_loop. start() blocks on the ready queue so a racing stop()
|
|
72
|
+
# can never raise into a thread that hasn't yet installed its
|
|
73
|
+
# Shutdown rescue.
|
|
74
|
+
Thread.current.report_on_exception = false
|
|
75
|
+
ready << true
|
|
76
|
+
run_loop
|
|
77
|
+
rescue Quonfig::Shutdown
|
|
78
|
+
# cooperative shutdown raced with thread startup; swallowed
|
|
79
|
+
end
|
|
80
|
+
ready.pop
|
|
81
|
+
end
|
|
82
|
+
self
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
def alive?
|
|
86
|
+
t = @thread
|
|
87
|
+
!t.nil? && t.alive?
|
|
88
|
+
end
|
|
89
|
+
|
|
90
|
+
def stop
|
|
91
|
+
thread = @mutex.synchronize do
|
|
92
|
+
@stop_requested = true
|
|
93
|
+
t = @thread
|
|
94
|
+
@thread = nil
|
|
95
|
+
t
|
|
96
|
+
end
|
|
97
|
+
return if thread.nil?
|
|
98
|
+
|
|
99
|
+
raise_shutdown(thread)
|
|
100
|
+
thread.join(SHUTDOWN_TIMEOUT_SEC)
|
|
101
|
+
thread.kill if thread.alive?
|
|
102
|
+
nil
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
alias close stop
|
|
106
|
+
|
|
107
|
+
private
|
|
108
|
+
|
|
109
|
+
def raise_shutdown(thread)
|
|
110
|
+
return if thread.nil?
|
|
111
|
+
return unless thread.alive?
|
|
112
|
+
|
|
113
|
+
begin
|
|
114
|
+
thread.raise(Quonfig::Shutdown.new('supervisor stopping'))
|
|
115
|
+
rescue ThreadError
|
|
116
|
+
# thread already exited between alive? and raise — fine
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
def run_loop
|
|
121
|
+
Thread.current.name = "quonfig-supervisor-#{@name}"
|
|
122
|
+
# Don't dump our managed Shutdown to stderr on shutdown.
|
|
123
|
+
Thread.current.report_on_exception = false
|
|
124
|
+
|
|
125
|
+
loop do
|
|
126
|
+
break if stop?
|
|
127
|
+
|
|
128
|
+
delivered = false
|
|
129
|
+
notify_delivered = -> { delivered = true }
|
|
130
|
+
reason = :worker_exit
|
|
131
|
+
|
|
132
|
+
begin
|
|
133
|
+
@worker.call(notify_delivered)
|
|
134
|
+
rescue Quonfig::Shutdown
|
|
135
|
+
break
|
|
136
|
+
rescue StandardError => e
|
|
137
|
+
reason = :worker_throw
|
|
138
|
+
safe_log(:error,
|
|
139
|
+
"[quonfig] supervisor=#{@name} worker raised #{e.class}: #{e.message}")
|
|
140
|
+
bt = e.backtrace&.first(10)&.join("\n")
|
|
141
|
+
safe_log(:debug, bt) if bt
|
|
142
|
+
end
|
|
143
|
+
|
|
144
|
+
break if stop?
|
|
145
|
+
|
|
146
|
+
@worker_restart_total += 1
|
|
147
|
+
@current_backoff = @initial_backoff if delivered
|
|
148
|
+
backoff = @current_backoff
|
|
149
|
+
|
|
150
|
+
safe_log(:warn,
|
|
151
|
+
"[quonfig] supervisor=#{@name} restarting worker " \
|
|
152
|
+
"(reason=#{reason}, restart_total=#{@worker_restart_total}, " \
|
|
153
|
+
"backoff_s=#{backoff})")
|
|
154
|
+
|
|
155
|
+
begin
|
|
156
|
+
@sleep_proc.call(backoff)
|
|
157
|
+
rescue Quonfig::Shutdown
|
|
158
|
+
break
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
@current_backoff = [@current_backoff * @multiplier, @max_backoff].min
|
|
162
|
+
end
|
|
163
|
+
rescue Quonfig::Shutdown
|
|
164
|
+
# supervisor-level cooperative shutdown
|
|
165
|
+
rescue StandardError => e
|
|
166
|
+
safe_log(:error, "[quonfig] supervisor=#{@name} crashed: #{e.class}: #{e.message}")
|
|
167
|
+
end
|
|
168
|
+
|
|
169
|
+
def stop?
|
|
170
|
+
@mutex.synchronize { @stop_requested }
|
|
171
|
+
end
|
|
172
|
+
|
|
173
|
+
# Defer Shutdown delivery while we're inside Logger.write so we don't
|
|
174
|
+
# trip Logger's "log writing failed" -> stderr fallback. Swallow any
|
|
175
|
+
# other logger error.
|
|
176
|
+
def safe_log(level, msg)
|
|
177
|
+
return unless @logger.respond_to?(level)
|
|
178
|
+
|
|
179
|
+
Thread.handle_interrupt(Quonfig::Shutdown => :never) do
|
|
180
|
+
@logger.public_send(level, msg)
|
|
181
|
+
end
|
|
182
|
+
rescue StandardError
|
|
183
|
+
nil
|
|
184
|
+
end
|
|
185
|
+
end
|
|
186
|
+
end
|
data/lib/quonfig.rb
CHANGED
|
@@ -29,6 +29,7 @@ require 'quonfig/evaluation'
|
|
|
29
29
|
require 'quonfig/evaluation_details'
|
|
30
30
|
require 'quonfig/encryption'
|
|
31
31
|
require 'quonfig/exponential_backoff'
|
|
32
|
+
require 'quonfig/worker_supervisor'
|
|
32
33
|
require 'quonfig/periodic_sync'
|
|
33
34
|
require 'quonfig/errors/initialization_timeout_error'
|
|
34
35
|
require 'quonfig/errors/invalid_sdk_key_error'
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: quonfig
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.0.
|
|
4
|
+
version: 0.0.15
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Jeff Dwyer
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-05-
|
|
11
|
+
date: 2026-05-15 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activesupport
|
|
@@ -134,6 +134,7 @@ files:
|
|
|
134
134
|
- lib/quonfig/types.rb
|
|
135
135
|
- lib/quonfig/version.rb
|
|
136
136
|
- lib/quonfig/weighted_value_resolver.rb
|
|
137
|
+
- lib/quonfig/worker_supervisor.rb
|
|
137
138
|
- quonfig.gemspec
|
|
138
139
|
homepage: https://github.com/quonfig/sdk-ruby
|
|
139
140
|
licenses:
|