quonfig 0.0.15 → 0.0.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +37 -11
- data/lib/quonfig/client.rb +168 -23
- data/lib/quonfig/sse_config_client.rb +536 -225
- data/lib/quonfig/version.rb +1 -1
- data/lib/quonfig.rb +1 -1
- data/quonfig.gemspec +0 -1
- metadata +1 -15
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
|
|
4
|
+
data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
|
|
7
|
+
data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,13 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.0.16 - 2026-05-15
|
|
4
|
+
|
|
5
|
+
- **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
|
|
6
|
+
- **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
|
|
7
|
+
- **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
|
|
8
|
+
- **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
|
|
9
|
+
- **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
|
|
10
|
+
|
|
3
11
|
## 0.0.15 - 2026-05-15
|
|
4
12
|
|
|
5
13
|
- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
|
data/README.md
CHANGED
|
@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
|
|
|
247
247
|
are dead — the SSE socket is held open by a thread that no longer exists, and
|
|
248
248
|
the child silently stops receiving live updates.
|
|
249
249
|
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
250
|
+
**On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
|
|
251
|
+
automatically tears down threaded components in the parent and restarts them
|
|
252
|
+
in the child. This covers any `Process.fork` / `Kernel#fork` path — Puma's
|
|
253
|
+
clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
|
|
254
|
+
manual `fork { ... }` calls. **No customer wiring is required.**
|
|
255
|
+
|
|
256
|
+
Caveats:
|
|
257
|
+
|
|
258
|
+
- Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
|
|
259
|
+
- `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
|
|
260
|
+
not go through `Process._fork`), but those execute a new program, so the
|
|
261
|
+
in-process SSE state is moot.
|
|
262
|
+
- The hook tears down the SSE/polling/telemetry threads in the parent before
|
|
263
|
+
fork (so the child does not inherit a live socket fd) and does **not**
|
|
264
|
+
auto-restart the parent. This mirrors the Puma master case: the master no
|
|
265
|
+
longer serves requests, so it does not need a live SSE connection. If you
|
|
266
|
+
have a non-Puma topology where the parent must keep streaming after fork,
|
|
267
|
+
call `Quonfig.instance.after_fork_in_child` manually in the parent after
|
|
268
|
+
the fork returns.
|
|
254
269
|
|
|
255
270
|
### Puma (clustered mode)
|
|
256
271
|
|
|
272
|
+
With the automatic fork hook, the typical Puma config needs **no Quonfig
|
|
273
|
+
lifecycle wiring** — initialize in your Rails initializer and let the hook
|
|
274
|
+
handle the rest:
|
|
275
|
+
|
|
257
276
|
```ruby
|
|
258
|
-
# config/
|
|
277
|
+
# config/initializers/quonfig.rb
|
|
278
|
+
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
|
|
282
|
+
|
|
283
|
+
```ruby
|
|
284
|
+
# config/puma.rb (Ruby 3.0 only)
|
|
259
285
|
before_fork do
|
|
260
286
|
Quonfig.instance.stop # close the master's SSE before forking
|
|
261
287
|
end
|
|
@@ -265,18 +291,18 @@ on_worker_boot do
|
|
|
265
291
|
end
|
|
266
292
|
```
|
|
267
293
|
|
|
268
|
-
If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
|
|
269
|
-
single mode (no clustering), no fork hook is needed.
|
|
270
|
-
|
|
271
294
|
### Sidekiq
|
|
272
295
|
|
|
273
|
-
|
|
296
|
+
On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too — no
|
|
297
|
+
`configure_server` wiring required.
|
|
298
|
+
|
|
299
|
+
On Ruby 3.0:
|
|
274
300
|
|
|
275
301
|
```ruby
|
|
276
302
|
# config/initializers/quonfig.rb
|
|
277
303
|
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
278
304
|
|
|
279
|
-
# config/initializers/sidekiq.rb
|
|
305
|
+
# config/initializers/sidekiq.rb (Ruby 3.0 only)
|
|
280
306
|
Sidekiq.configure_server do |config|
|
|
281
307
|
config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
|
|
282
308
|
config.on(:shutdown) { Quonfig.instance.stop rescue nil }
|
|
@@ -284,7 +310,7 @@ end
|
|
|
284
310
|
```
|
|
285
311
|
|
|
286
312
|
For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
|
|
287
|
-
`Quonfig.init` in the initializer is sufficient.
|
|
313
|
+
`Quonfig.init` in the initializer is sufficient on any Ruby version.
|
|
288
314
|
|
|
289
315
|
### Spring / Bootsnap preloaders
|
|
290
316
|
|
data/lib/quonfig/client.rb
CHANGED
|
@@ -20,6 +20,29 @@ module Quonfig
|
|
|
20
20
|
class Client
|
|
21
21
|
LOG = Quonfig::InternalLogger.new(self)
|
|
22
22
|
|
|
23
|
+
# qfg-ryov: instance registry for the Process._fork hook. Every live
|
|
24
|
+
# Client is tracked here so the hook can fan out before_fork_in_parent /
|
|
25
|
+
# after_fork_in_child across all of them without the customer needing to
|
|
26
|
+
# name a specific instance. ObjectSpace::WeakMap means a Client that goes
|
|
27
|
+
# out of scope is GC'd without leaking through this registry. Stopped
|
|
28
|
+
# Clients stay in the registry until GC; both fork hooks early-return on
|
|
29
|
+
# +@stopped+ so a stopped instance is effectively a no-op. (We don't use
|
|
30
|
+
# WeakMap#delete because it was added in Ruby 3.3 and the matrix still
|
|
31
|
+
# includes 3.2.)
|
|
32
|
+
@instances = ObjectSpace::WeakMap.new
|
|
33
|
+
@instances_mutex = Mutex.new
|
|
34
|
+
|
|
35
|
+
class << self
|
|
36
|
+
# Iterate live Client instances. Used by Quonfig::ForkSafety.
|
|
37
|
+
def each_instance(&block)
|
|
38
|
+
@instances_mutex.synchronize { @instances.keys }.each(&block)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def register_instance(client)
|
|
42
|
+
@instances_mutex.synchronize { @instances[client] = true }
|
|
43
|
+
end
|
|
44
|
+
end
|
|
45
|
+
|
|
23
46
|
attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
|
|
24
47
|
:config_loader, :telemetry_reporter
|
|
25
48
|
|
|
@@ -48,6 +71,7 @@ module Quonfig
|
|
|
48
71
|
@sse_state = :idle
|
|
49
72
|
@sse_ever_connected = false
|
|
50
73
|
@fallback_engage_timer = nil
|
|
74
|
+
@sse_terminal_failure = false
|
|
51
75
|
|
|
52
76
|
# If the caller injected a store, we're in test/bootstrap mode; skip I/O.
|
|
53
77
|
return if store
|
|
@@ -59,6 +83,10 @@ module Quonfig
|
|
|
59
83
|
end
|
|
60
84
|
|
|
61
85
|
initialize_telemetry
|
|
86
|
+
|
|
87
|
+
# Register only for non-store-injected clients (a caller-supplied store
|
|
88
|
+
# is the test/bootstrap path; the fork hook does not apply there).
|
|
89
|
+
self.class.register_instance(self) unless store
|
|
62
90
|
end
|
|
63
91
|
|
|
64
92
|
# ---- Lookup --------------------------------------------------------
|
|
@@ -264,34 +292,52 @@ module Quonfig
|
|
|
264
292
|
|
|
265
293
|
def stop
|
|
266
294
|
@stopped = true
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
rescue StandardError => e
|
|
270
|
-
LOG.debug "Error closing SSE client: #{e.message}"
|
|
271
|
-
end
|
|
272
|
-
@sse_client = nil
|
|
295
|
+
tear_down_threaded_components!
|
|
296
|
+
end
|
|
273
297
|
|
|
274
|
-
|
|
298
|
+
# qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
|
|
299
|
+
# telemetry reporter, and any fallback-engage timer. Idempotent — calling
|
|
300
|
+
# twice is safe. Does NOT set @stopped: the client is still expected to
|
|
301
|
+
# be usable post-fork via after_fork_in_child.
|
|
302
|
+
#
|
|
303
|
+
# Why this matters: Ruby threads do not survive fork(2). If we let the
|
|
304
|
+
# child inherit a live Net::HTTP socket, both processes read from the
|
|
305
|
+
# same fd and corrupt each other's bytes. Closing in the parent before
|
|
306
|
+
# fork is the only safe shape.
|
|
307
|
+
def before_fork_in_parent
|
|
308
|
+
return if @stopped
|
|
275
309
|
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
rescue StandardError => e
|
|
279
|
-
LOG.debug "Error stopping poll supervisor: #{e.message}"
|
|
280
|
-
end
|
|
281
|
-
@poll_supervisor = nil
|
|
310
|
+
tear_down_threaded_components!
|
|
311
|
+
end
|
|
282
312
|
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
313
|
+
# qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
|
|
314
|
+
# components the client had pre-fork. No-op if the client was already
|
|
315
|
+
# stopped (the customer asked for it to be dead — do not resurrect),
|
|
316
|
+
# or if the client is in datadir mode (no threaded components to start).
|
|
317
|
+
def after_fork_in_child
|
|
318
|
+
return if @stopped
|
|
319
|
+
return if @options.datadir
|
|
320
|
+
return if @config_loader.nil? # never finished network init (e.g. invalid key)
|
|
321
|
+
|
|
322
|
+
# SSE state machine carries flags that no longer apply in the child
|
|
323
|
+
# (the parent had connected, the parent had errored, etc.). Reset.
|
|
324
|
+
@state_mutex.synchronize do
|
|
325
|
+
@sse_state = :idle
|
|
326
|
+
@sse_ever_connected = false
|
|
327
|
+
@sse_terminal_failure = false
|
|
287
328
|
end
|
|
288
|
-
|
|
329
|
+
|
|
330
|
+
sse_started = @options.enable_sse && start_sse
|
|
331
|
+
start_polling if @options.enable_polling && !sse_started
|
|
332
|
+
|
|
333
|
+
restart_telemetry_in_child
|
|
289
334
|
end
|
|
290
335
|
|
|
291
336
|
# quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
|
|
292
337
|
# Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
|
|
293
|
-
# incremented
|
|
294
|
-
# Layer 2 (HTTP polling fallback) is wired through
|
|
338
|
+
# incremented once per reconnect attempt by the SDK-owned reconnect
|
|
339
|
+
# loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
|
|
340
|
+
# Quonfig::WorkerSupervisor.
|
|
295
341
|
#
|
|
296
342
|
# Pass +layer:+ ('1' or '2') to read a single layer; default returns the
|
|
297
343
|
# sum across both layers so the chaos harness (and operators) can pull
|
|
@@ -357,6 +403,41 @@ module Quonfig
|
|
|
357
403
|
|
|
358
404
|
private
|
|
359
405
|
|
|
406
|
+
# Close every threaded component and drop its reference. Used by both
|
|
407
|
+
# +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
|
|
408
|
+
# (where @stopped is left alone so the child can restart).
|
|
409
|
+
def tear_down_threaded_components!
|
|
410
|
+
begin
|
|
411
|
+
@sse_client&.close
|
|
412
|
+
rescue StandardError => e
|
|
413
|
+
LOG.debug "Error closing SSE client: #{e.message}"
|
|
414
|
+
end
|
|
415
|
+
@sse_client = nil
|
|
416
|
+
|
|
417
|
+
cancel_fallback_engage_timer
|
|
418
|
+
|
|
419
|
+
begin
|
|
420
|
+
@poll_supervisor&.stop
|
|
421
|
+
rescue StandardError => e
|
|
422
|
+
LOG.debug "Error stopping poll supervisor: #{e.message}"
|
|
423
|
+
end
|
|
424
|
+
@poll_supervisor = nil
|
|
425
|
+
|
|
426
|
+
begin
|
|
427
|
+
@telemetry_reporter&.stop
|
|
428
|
+
rescue StandardError => e
|
|
429
|
+
LOG.debug "Error stopping telemetry reporter: #{e.message}"
|
|
430
|
+
end
|
|
431
|
+
@telemetry_reporter = nil
|
|
432
|
+
end
|
|
433
|
+
|
|
434
|
+
# Rebuild the telemetry reporter in the child after fork. Mirrors the
|
|
435
|
+
# original initialize_telemetry path — fresh aggregators, fresh reporter.
|
|
436
|
+
def restart_telemetry_in_child
|
|
437
|
+
@telemetry_reporter = nil
|
|
438
|
+
initialize_telemetry
|
|
439
|
+
end
|
|
440
|
+
|
|
360
441
|
# Stamp +last_successful_refresh+ at install time. Called by every code
|
|
361
442
|
# path that hands an envelope to the cache: datadir load, initial HTTP
|
|
362
443
|
# fetch, SSE event apply, and polling worker fetch.
|
|
@@ -402,20 +483,31 @@ module Quonfig
|
|
|
402
483
|
@sse_error_callback ||= ->(error) { handle_sse_error(error) }
|
|
403
484
|
end
|
|
404
485
|
|
|
405
|
-
def handle_sse_error(
|
|
486
|
+
def handle_sse_error(error)
|
|
487
|
+
# qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
|
|
488
|
+
# key that won't auth over SSE won't auth over HTTP polling either, so
|
|
489
|
+
# we must NOT engage the Layer 2 fallback — that just moves the
|
|
490
|
+
# auth-failure storm from one endpoint to another. Once flipped,
|
|
491
|
+
# @sse_terminal_failure latches: a buggy customer retry loop cannot
|
|
492
|
+
# un-classify the failure by driving the state machine.
|
|
493
|
+
@state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
|
|
406
494
|
handle_sse_state_change(:error)
|
|
407
495
|
end
|
|
408
496
|
|
|
409
497
|
def handle_sse_state_change(new_state)
|
|
410
498
|
state = new_state.to_sym
|
|
411
|
-
ever_connected = @state_mutex.synchronize do
|
|
499
|
+
ever_connected, terminal = @state_mutex.synchronize do
|
|
412
500
|
@sse_state = state
|
|
413
501
|
@sse_ever_connected = true if state == :connected
|
|
414
|
-
@sse_ever_connected
|
|
502
|
+
[@sse_ever_connected, @sse_terminal_failure]
|
|
415
503
|
end
|
|
416
504
|
|
|
417
505
|
return unless @options.respond_to?(:enable_polling) && @options.enable_polling
|
|
418
506
|
return if @stopped
|
|
507
|
+
# qfg-i5xv: a terminal SSE classification suppresses polling engage in
|
|
508
|
+
# every branch — the customer's key is bad and HTTP polling will fail
|
|
509
|
+
# identically. Operators surface this via #terminal_failure?.
|
|
510
|
+
return if terminal
|
|
419
511
|
|
|
420
512
|
case state
|
|
421
513
|
when :connected
|
|
@@ -430,6 +522,21 @@ module Quonfig
|
|
|
430
522
|
end
|
|
431
523
|
end
|
|
432
524
|
|
|
525
|
+
public
|
|
526
|
+
|
|
527
|
+
# qfg-i5xv: true once the SSE layer has classified an HTTP response as
|
|
528
|
+
# terminal (401/403/404) — bad SDK key, revoked workspace permission,
|
|
529
|
+
# or wrong endpoint. The classification latches: the SDK will not
|
|
530
|
+
# auto-recover, and a customer-supplied retry must rebuild the client.
|
|
531
|
+
# Surfaced for operator alerting; `connection_state` still reports
|
|
532
|
+
# `:disconnected` to honor the documented connection_state vocabulary
|
|
533
|
+
# (supervisor-test-contract.md §"connectionState()" — values fixed).
|
|
534
|
+
def terminal_failure?
|
|
535
|
+
@state_mutex.synchronize { @sse_terminal_failure }
|
|
536
|
+
end
|
|
537
|
+
|
|
538
|
+
private
|
|
539
|
+
|
|
433
540
|
def cancel_fallback_engage_timer
|
|
434
541
|
timer = @state_mutex.synchronize do
|
|
435
542
|
t = @fallback_engage_timer
|
|
@@ -904,4 +1011,42 @@ module Quonfig
|
|
|
904
1011
|
end
|
|
905
1012
|
end
|
|
906
1013
|
end
|
|
1014
|
+
|
|
1015
|
+
# qfg-ryov: hook into Process._fork so customers using Puma's clustered
|
|
1016
|
+
# mode (or any preload/fork-worker server) don't have to wire
|
|
1017
|
+
# +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
|
|
1018
|
+
# +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
|
|
1019
|
+
# prepend covers them all.
|
|
1020
|
+
#
|
|
1021
|
+
# Process._fork's contract:
|
|
1022
|
+
# - Called in the parent process before the fork syscall.
|
|
1023
|
+
# - Returns 0 in the child, child's pid in the parent.
|
|
1024
|
+
# - +super+ performs the actual fork.
|
|
1025
|
+
#
|
|
1026
|
+
# The parent's view: SSE/polling/telemetry threads are torn down before
|
|
1027
|
+
# the syscall so the child does not inherit a live Net::HTTP socket fd
|
|
1028
|
+
# (which would corrupt both sides). The parent does NOT auto-restart —
|
|
1029
|
+
# that mirrors the Puma master use case where the master process no
|
|
1030
|
+
# longer serves requests after spawning workers.
|
|
1031
|
+
module ForkSafety
|
|
1032
|
+
def _fork
|
|
1033
|
+
Quonfig::Client.each_instance(&:before_fork_in_parent)
|
|
1034
|
+
pid = super
|
|
1035
|
+
Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
|
|
1036
|
+
pid
|
|
1037
|
+
rescue StandardError => e
|
|
1038
|
+
# Fork-hook failures must never break the customer's fork. Worst case
|
|
1039
|
+
# the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
|
|
1040
|
+
# bad, but recoverable. Crashing the fork itself is not.
|
|
1041
|
+
Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
|
|
1042
|
+
raise if pid.nil? # super never returned — propagate fork failures
|
|
1043
|
+
|
|
1044
|
+
pid
|
|
1045
|
+
end
|
|
1046
|
+
end
|
|
1047
|
+
|
|
1048
|
+
# Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
|
|
1049
|
+
# customers must keep wiring their own Puma before_fork / on_worker_boot
|
|
1050
|
+
# (see README "Rails integration"). On 3.1+ we install the hook globally.
|
|
1051
|
+
Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
|
|
907
1052
|
end
|
|
@@ -2,300 +2,611 @@
|
|
|
2
2
|
|
|
3
3
|
require 'base64'
|
|
4
4
|
require 'json'
|
|
5
|
+
require 'net/http'
|
|
6
|
+
require 'uri'
|
|
5
7
|
|
|
6
8
|
module Quonfig
|
|
9
|
+
# Event delivered to on_envelope. +id+ mirrors the SSE +id:+ field and is
|
|
10
|
+
# consumed by callers that want the server cursor (tests + last-event-id
|
|
11
|
+
# resume). +data+ is the raw +data:+ payload string. +envelope+ is the
|
|
12
|
+
# parsed Quonfig::ConfigEnvelope.
|
|
13
|
+
StreamEvent = Struct.new(:envelope, :id, :data)
|
|
14
|
+
|
|
15
|
+
# SSE client for real-time config delivery from api-delivery-sse.
|
|
16
|
+
#
|
|
17
|
+
# Owns its reconnect loop end-to-end. sdk-go, sdk-python, and sdk-node all
|
|
18
|
+
# reached the same conclusion: the wire format we consume (plain JSON
|
|
19
|
+
# envelopes in single-line +data:+ frames, no named events, no retry
|
|
20
|
+
# directives) is simple enough that an SDK-owned loop is clearer than a
|
|
21
|
+
# library wrapper, and the operator-facing reconnect counter becomes
|
|
22
|
+
# trivially correct because there is exactly one place that increments it
|
|
23
|
+
# (qfg-35sm; replaces the ld-eventsource integration from qfg-ie49 +
|
|
24
|
+
# qfg-cf52, which required log-line scraping and a raise-proof logger
|
|
25
|
+
# wrapper to observe reconnects through the upstream library).
|
|
7
26
|
class SSEConfigClient
|
|
8
|
-
# ld-eventsource auto-reconnects on a clean socket EOF (server FIN)
|
|
9
|
-
# *internally* — it never calls +on_error+ for that case, only for
|
|
10
|
-
# ECONNREFUSED-style failures (qfg-ie49; see chaos scenario 09). The one
|
|
11
|
-
# signal it emits for any reconnect is an info-level
|
|
12
|
-
# "Will retry connection after ..." line, logged once per reconnect attempt
|
|
13
|
-
# and never on the first connect. Wrapping the logger we hand to
|
|
14
|
-
# SSE::Client lets the SDK observe those internal reconnects without
|
|
15
|
-
# touching the data path. This is the only reconnect hook ld-eventsource
|
|
16
|
-
# >= 2.0 exposes.
|
|
17
|
-
class ReconnectCountingLogger
|
|
18
|
-
RECONNECT_SIGNAL = 'Will retry connection after'
|
|
19
|
-
|
|
20
|
-
LEVELS = %i[trace debug info warn error fatal].freeze
|
|
21
|
-
|
|
22
|
-
def initialize(wrapped, &on_reconnect)
|
|
23
|
-
@wrapped = wrapped
|
|
24
|
-
@on_reconnect = on_reconnect
|
|
25
|
-
end
|
|
26
|
-
|
|
27
|
-
# Crash-safe by construction: ld-eventsource calls this logger from
|
|
28
|
-
# inside its bare-Thread +run_stream+ loop, and several of those call
|
|
29
|
-
# sites (+connect+, +log_and_dispatch_error+, query-param building) are
|
|
30
|
-
# NOT wrapped in a rescue. Any exception that escapes a logger call kills
|
|
31
|
-
# the worker thread with +@stopped+ still false, so +closed?+ never flips
|
|
32
|
-
# true and the SDK's @retry_thread never reconnects — the SSE stream is
|
|
33
|
-
# silently wedged forever (qfg-cf52, the chaos scenario 05 flake). Every
|
|
34
|
-
# step here is therefore independently guarded: a throwing message block,
|
|
35
|
-
# a throwing on_reconnect callback, or a throwing wrapped logger can
|
|
36
|
-
# never propagate out of this method.
|
|
37
|
-
LEVELS.each do |level|
|
|
38
|
-
define_method(level) do |message = nil, &block|
|
|
39
|
-
begin
|
|
40
|
-
message = block.call if message.nil? && block
|
|
41
|
-
rescue StandardError
|
|
42
|
-
message = nil
|
|
43
|
-
end
|
|
44
|
-
|
|
45
|
-
if level == :info && message.to_s.include?(RECONNECT_SIGNAL)
|
|
46
|
-
begin
|
|
47
|
-
@on_reconnect.call
|
|
48
|
-
rescue StandardError
|
|
49
|
-
nil
|
|
50
|
-
end
|
|
51
|
-
end
|
|
52
|
-
|
|
53
|
-
begin
|
|
54
|
-
@wrapped.public_send(level, message) if @wrapped.respond_to?(level)
|
|
55
|
-
rescue StandardError
|
|
56
|
-
nil
|
|
57
|
-
end
|
|
58
|
-
end
|
|
59
|
-
end
|
|
60
|
-
|
|
61
|
-
def level
|
|
62
|
-
@wrapped&.level
|
|
63
|
-
end
|
|
64
|
-
|
|
65
|
-
def level=(new_level)
|
|
66
|
-
@wrapped.level = new_level if @wrapped.respond_to?(:level=)
|
|
67
|
-
end
|
|
68
|
-
end
|
|
69
|
-
|
|
70
27
|
class Options
|
|
71
|
-
attr_reader :sse_read_timeout, :
|
|
72
|
-
:
|
|
73
|
-
:errors_to_close_connection, :sse_reconnect_reset_interval
|
|
28
|
+
attr_reader :sse_read_timeout, :sse_connect_timeout,
|
|
29
|
+
:sse_initial_reconnect_delay, :sse_max_reconnect_delay
|
|
74
30
|
|
|
75
31
|
# sse_read_timeout: 90s = 3x the 30s server heartbeat. A silent socket
|
|
76
|
-
# stall trips
|
|
77
|
-
#
|
|
78
|
-
# `project/plans/sdk-hardening-and-verification.md` Layer 1.
|
|
32
|
+
# stall trips within one missed-heartbeat window rather than the OS
|
|
33
|
+
# TCP idle (often hours).
|
|
79
34
|
#
|
|
80
|
-
#
|
|
81
|
-
#
|
|
82
|
-
#
|
|
83
|
-
#
|
|
84
|
-
#
|
|
85
|
-
#
|
|
86
|
-
# it. Resetting after 1s of healthy connection mirrors sdk-python, which
|
|
87
|
-
# resets its backoff on every successful connect (sdk-python/quonfig/
|
|
88
|
-
# sse.py). A *sustained* outage still backs off exponentially: no
|
|
89
|
-
# connection succeeds, so `mark_success` is never called and the reset
|
|
90
|
-
# never triggers (qfg-ie49).
|
|
35
|
+
# sse_initial_reconnect_delay / sse_max_reconnect_delay: backoff bounds.
|
|
36
|
+
# Each failed reconnect doubles the delay (with +/-50% jitter) up to the
|
|
37
|
+
# max. A successful event delivery resets the delay to the initial
|
|
38
|
+
# value — matches sdk-python's policy. A clean server-initiated FIN is
|
|
39
|
+
# treated as "not a failure for backoff purposes" because LBs recycling
|
|
40
|
+
# connections is normal; the reconnect counter still increments.
|
|
91
41
|
def initialize(sse_read_timeout: 90,
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
sse_reconnect_reset_interval: 1,
|
|
96
|
-
errors_to_close_connection: [HTTP::ConnectionError])
|
|
42
|
+
sse_connect_timeout: 10,
|
|
43
|
+
sse_initial_reconnect_delay: 1.0,
|
|
44
|
+
sse_max_reconnect_delay: 30.0)
|
|
97
45
|
@sse_read_timeout = sse_read_timeout
|
|
98
|
-
@
|
|
99
|
-
@
|
|
100
|
-
@
|
|
101
|
-
@sleep_delay_for_new_connection_check = sleep_delay_for_new_connection_check
|
|
102
|
-
@errors_to_close_connection = errors_to_close_connection
|
|
46
|
+
@sse_connect_timeout = sse_connect_timeout
|
|
47
|
+
@sse_initial_reconnect_delay = sse_initial_reconnect_delay.to_f
|
|
48
|
+
@sse_max_reconnect_delay = sse_max_reconnect_delay.to_f
|
|
103
49
|
end
|
|
104
50
|
end
|
|
105
51
|
|
|
106
52
|
LOG = Quonfig::InternalLogger.new(self)
|
|
107
53
|
|
|
54
|
+
# qfg-i5xv: HTTP status codes the SDK classifies as terminal — these will
|
|
55
|
+
# not heal by retrying (bad key, revoked permission, missing endpoint).
|
|
56
|
+
# Anything else (5xx, 429, network errors) stays on the transient path.
|
|
57
|
+
TERMINAL_HTTP_CODES = [401, 403, 404].freeze
|
|
58
|
+
|
|
108
59
|
# +on_error+: optional callable invoked on every SSE error edge. Parent
|
|
109
60
|
# Quonfig::Client wires this to drive @sse_state -> :error so that
|
|
110
|
-
# +connection_state+ reflects the disconnect (qfg-47c2.27).
|
|
111
|
-
# the SDK's public health primitive would lie about its own state during
|
|
112
|
-
# a mid-run socket drop.
|
|
61
|
+
# +connection_state+ reflects the disconnect (qfg-47c2.27).
|
|
113
62
|
def initialize(prefab_options, config_loader, options = nil, logger = nil, on_error: nil)
|
|
114
63
|
@prefab_options = prefab_options
|
|
115
64
|
@options = options || Options.new
|
|
116
65
|
@config_loader = config_loader
|
|
117
|
-
@connected = false
|
|
118
66
|
@logger = logger || LOG
|
|
119
67
|
@on_error = on_error
|
|
68
|
+
|
|
69
|
+
@stopped = Concurrent::AtomicBoolean.new(false)
|
|
120
70
|
@restart_total = 0
|
|
121
71
|
@restart_mutex = Mutex.new
|
|
72
|
+
|
|
73
|
+
@on_envelope_error_total = 0
|
|
74
|
+
@on_envelope_error_mutex = Mutex.new
|
|
75
|
+
|
|
76
|
+
@conn_mutex = Mutex.new
|
|
77
|
+
@active_http = nil
|
|
78
|
+
|
|
79
|
+
@source_index = -1
|
|
80
|
+
@last_event_id = nil
|
|
122
81
|
end
|
|
123
82
|
|
|
124
|
-
#
|
|
125
|
-
#
|
|
126
|
-
#
|
|
127
|
-
#
|
|
128
|
-
# ReconnectCountingLogger "Will retry connection after" signal.
|
|
129
|
-
# 2. SDK-driven reconnects in @retry_thread, after a closing error
|
|
130
|
-
# (HTTP::ConnectionError) made us close the SSE::Client outright.
|
|
131
|
-
# These two are mutually exclusive per disconnect, so there is no
|
|
132
|
-
# double-count. on_error is deliberately NOT a source — ld-eventsource
|
|
133
|
-
# reconnects internally after most non-closing errors, so counting the
|
|
134
|
-
# error edge AND the reconnect would double up (qfg-ie49).
|
|
135
|
-
#
|
|
136
|
-
# The chaos harness pulls this via Client#worker_restart_total(layer: '1')
|
|
137
|
-
# so kill-storm scenarios (e.g. scenario 09 — proxy killed 5x in 30s) can
|
|
138
|
-
# assert restart_total >= 5 even when the kills produce clean FINs that
|
|
139
|
-
# never reach on_error.
|
|
83
|
+
# Layer 1 (SSE) reconnect counter. Bumped exactly once per reconnect
|
|
84
|
+
# attempt — never per error edge, never per envelope. Read by
|
|
85
|
+
# Quonfig::Client#worker_restart_total(layer: '1') and asserted by chaos
|
|
86
|
+
# scenario 09 (>= 5 after 5 proxy flaps in 30s).
|
|
140
87
|
def restart_total
|
|
141
88
|
@restart_mutex.synchronize { @restart_total }
|
|
142
89
|
end
|
|
143
90
|
|
|
144
|
-
#
|
|
145
|
-
#
|
|
146
|
-
|
|
147
|
-
|
|
91
|
+
# qfg-m3lk: count of user-supplied on_envelope callback invocations that
|
|
92
|
+
# raised. Surfaced for operator visibility — a non-zero value here with
|
|
93
|
+
# restart_total stable means a caller-side listener bug, not a transport
|
|
94
|
+
# problem. (Pre-fix, those raises propagated into run_loop's rescue and
|
|
95
|
+
# masqueraded as transport errors, causing reconnect storms.)
|
|
96
|
+
def on_envelope_error_total
|
|
97
|
+
@on_envelope_error_mutex.synchronize { @on_envelope_error_total }
|
|
148
98
|
end
|
|
149
99
|
|
|
150
|
-
def
|
|
151
|
-
@
|
|
152
|
-
|
|
100
|
+
def start(&on_envelope)
|
|
101
|
+
return if @prefab_options.sse_api_urls.nil? || @prefab_options.sse_api_urls.empty?
|
|
102
|
+
|
|
103
|
+
@worker = Thread.new { run_loop(&on_envelope) }
|
|
153
104
|
end
|
|
154
105
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
106
|
+
# Shut down. Interrupts the in-flight stream by closing the underlying
|
|
107
|
+
# socket from this thread — the worker thread observes the resulting
|
|
108
|
+
# IOError, sees @stopped == true, and exits cleanly.
|
|
109
|
+
def close
|
|
110
|
+
@stopped.make_true
|
|
111
|
+
@conn_mutex.synchronize do
|
|
112
|
+
begin
|
|
113
|
+
@active_http&.finish
|
|
114
|
+
rescue StandardError
|
|
115
|
+
# already closed / never started — idempotent
|
|
116
|
+
end
|
|
117
|
+
@active_http = nil
|
|
159
118
|
end
|
|
119
|
+
@worker&.join(2)
|
|
120
|
+
@worker = nil
|
|
121
|
+
end
|
|
122
|
+
|
|
123
|
+
# Public so tests can assert the headers shape. Body of the request is
|
|
124
|
+
# always empty; this is the full set api-delivery-sse sees.
|
|
125
|
+
def headers
|
|
126
|
+
auth = "1:#{@prefab_options.sdk_key}"
|
|
127
|
+
auth_string = Base64.strict_encode64(auth)
|
|
128
|
+
h = {
|
|
129
|
+
'Authorization' => "Basic #{auth_string}",
|
|
130
|
+
'Accept' => 'text/event-stream',
|
|
131
|
+
'Cache-Control' => 'no-cache',
|
|
132
|
+
'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
|
|
133
|
+
}
|
|
134
|
+
cursor = current_cursor
|
|
135
|
+
h['Last-Event-Id'] = cursor if cursor
|
|
136
|
+
h
|
|
137
|
+
end
|
|
160
138
|
|
|
161
|
-
|
|
139
|
+
# Compute a Last-Event-ID for the next request. Three sources, in
|
|
140
|
+
# priority order:
|
|
141
|
+
# 1. @last_event_id -- set by the most recent event we processed
|
|
142
|
+
# 2. config_loader.version -- string ETag from last HTTP fetch
|
|
143
|
+
# 3. config_loader.highwater_mark -- legacy numeric cursor
|
|
144
|
+
# Returns nil if no prior state exists.
|
|
145
|
+
def current_cursor
|
|
146
|
+
return @last_event_id if @last_event_id && !@last_event_id.empty?
|
|
162
147
|
|
|
163
|
-
|
|
148
|
+
if @config_loader.respond_to?(:version)
|
|
149
|
+
v = @config_loader.version
|
|
150
|
+
return v if v.is_a?(String) && !v.empty?
|
|
151
|
+
end
|
|
164
152
|
|
|
165
|
-
@
|
|
166
|
-
|
|
167
|
-
|
|
153
|
+
if @config_loader.respond_to?(:highwater_mark)
|
|
154
|
+
hw = @config_loader.highwater_mark
|
|
155
|
+
return hw.to_s if hw.is_a?(Numeric) && hw.positive?
|
|
156
|
+
return hw if hw.is_a?(String) && !hw.empty?
|
|
157
|
+
end
|
|
168
158
|
|
|
169
|
-
|
|
159
|
+
nil
|
|
160
|
+
end
|
|
170
161
|
|
|
171
|
-
|
|
162
|
+
private
|
|
172
163
|
|
|
173
|
-
|
|
164
|
+
# Long-lived reconnect loop. One iteration = one connect attempt. Bumps
|
|
165
|
+
# restart_total *before* every retry — so the counter answers "how many
|
|
166
|
+
# times have we reconnected after a drop" rather than "how many connect
|
|
167
|
+
# attempts have occurred." The first attempt is not a restart.
|
|
168
|
+
#
|
|
169
|
+
# qfg-tj18: the body is wrapped in
|
|
170
|
+
# +Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)+ so a
|
|
171
|
+
# watchdog raise that's already been queued (the watchdog's mutex covers
|
|
172
|
+
# the *decision* to fire but cannot un-queue a delivered raise) lands
|
|
173
|
+
# only at a blocking-IO checkpoint. Inside stream_once we explicitly
|
|
174
|
+
# re-enable +:immediate+ around the +read_body+ block where we *do*
|
|
175
|
+
# want the raise to wake the read. A per-iteration paranoid rescue
|
|
176
|
+
# catches any late-landing raise that escapes the inner +rescue
|
|
177
|
+
# StandardError+ (e.g. lands inside +interruptible_sleep+ between
|
|
178
|
+
# iterations) so the worker thread never silently dies.
|
|
179
|
+
def run_loop(&on_envelope)
|
|
180
|
+
Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking) do
|
|
181
|
+
delay = @options.sse_initial_reconnect_delay
|
|
182
|
+
first_attempt = true
|
|
183
|
+
|
|
184
|
+
until @stopped.value
|
|
185
|
+
begin
|
|
186
|
+
unless first_attempt
|
|
187
|
+
increment_restart!
|
|
188
|
+
interruptible_sleep(jittered(delay))
|
|
189
|
+
break if @stopped.value
|
|
190
|
+
end
|
|
191
|
+
first_attempt = false
|
|
174
192
|
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
193
|
+
connected_at_least_once = false
|
|
194
|
+
begin
|
|
195
|
+
stream_once do |event|
|
|
196
|
+
connected_at_least_once = true
|
|
197
|
+
# Persist the most recent id so the next reconnect resumes
|
|
198
|
+
# from there via Last-Event-Id. Updated *before* the user
|
|
199
|
+
# callback runs so a raising listener still advances the
|
|
200
|
+
# cursor — the event was delivered to us, the bug is on the
|
|
201
|
+
# caller side.
|
|
202
|
+
@last_event_id = event.id if event.id
|
|
203
|
+
# qfg-m3lk: callback exceptions are isolated. A buggy
|
|
204
|
+
# listener must not look like a transport error and trigger
|
|
205
|
+
# a reconnect.
|
|
206
|
+
invoke_on_envelope_safely(on_envelope, event)
|
|
207
|
+
# A connection healthy enough to deliver a real envelope
|
|
208
|
+
# earns a reset of the backoff. Sustained outages never
|
|
209
|
+
# reach this branch (no event ever delivered) so the
|
|
210
|
+
# exponential growth still holds.
|
|
211
|
+
delay = @options.sse_initial_reconnect_delay
|
|
212
|
+
end
|
|
213
|
+
rescue StandardError => e
|
|
214
|
+
handle_error(e) unless @stopped.value
|
|
215
|
+
end
|
|
216
|
+
|
|
217
|
+
# Backoff only grows on failed connect attempts. A server-
|
|
218
|
+
# initiated clean FIN after a healthy session (normal LB
|
|
219
|
+
# recycling) reuses the same delay — punishing it would make
|
|
220
|
+
# us look broken under benign rolling restarts. Matches
|
|
221
|
+
# sdk-go's `connectedOK` distinction.
|
|
222
|
+
delay = [delay * 2, @options.sse_max_reconnect_delay].min unless connected_at_least_once
|
|
223
|
+
rescue SSEReadDeadlineExceeded => e
|
|
224
|
+
# Paranoid backstop (qfg-tj18). A watchdog raise that landed
|
|
225
|
+
# outside +stream_once+ — typically in +interruptible_sleep+
|
|
226
|
+
# — must not kill the worker thread. We log loudly and let the
|
|
227
|
+
# +until+ loop carry on.
|
|
228
|
+
@logger.error "SSE watchdog late-raise contained: #{e.inspect}; resuming loop"
|
|
229
|
+
end
|
|
183
230
|
end
|
|
184
231
|
end
|
|
232
|
+
ensure
|
|
233
|
+
register_active(nil)
|
|
185
234
|
end
|
|
186
235
|
|
|
187
|
-
|
|
188
|
-
|
|
236
|
+
# Opens one SSE request and yields each parsed event until the stream
|
|
237
|
+
# ends (clean FIN, error, or stop). Raises on transport errors so the
|
|
238
|
+
# caller can apply backoff. Clean FIN returns without raising.
|
|
239
|
+
#
|
|
240
|
+
# A watchdog thread closes the socket if no bytes arrive within
|
|
241
|
+
# +sse_read_timeout+. Net::HTTP#read_timeout is NOT reliable for the
|
|
242
|
+
# streaming +read_body do |chunk|+ form — the underlying BufferedIO
|
|
243
|
+
# reads bypass it in practice (a silent server stall blocks indefinitely
|
|
244
|
+
# against a configured deadline). sdk-go and sdk-node hit the same
|
|
245
|
+
# gotcha and solve it the same way: per-chunk reset, async close on
|
|
246
|
+
# expiry (chaos scenario 02 — sse_silent_stall).
|
|
247
|
+
def stream_once(&block)
|
|
248
|
+
url = "#{current_url}/api/v2/sse/config"
|
|
189
249
|
cursor = current_cursor
|
|
190
250
|
@logger.debug "SSE Streaming Connect to #{url} start_at #{cursor.inspect}"
|
|
191
251
|
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
)
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
252
|
+
uri = URI(url)
|
|
253
|
+
http = Net::HTTP.new(uri.host, uri.port)
|
|
254
|
+
http.use_ssl = (uri.scheme == 'https')
|
|
255
|
+
http.open_timeout = @options.sse_connect_timeout
|
|
256
|
+
# Keep Net::HTTP's read_timeout as a backstop for the header read
|
|
257
|
+
# (where it does apply reliably). The watchdog covers the body path.
|
|
258
|
+
http.read_timeout = @options.sse_read_timeout
|
|
259
|
+
|
|
260
|
+
req = Net::HTTP::Get.new(uri.request_uri, headers)
|
|
261
|
+
|
|
262
|
+
http.start
|
|
263
|
+
register_active(http)
|
|
264
|
+
|
|
265
|
+
watchdog = ReadDeadlineWatchdog.new(
|
|
266
|
+
worker: Thread.current, deadline_s: @options.sse_read_timeout,
|
|
267
|
+
stopped: @stopped, logger: @logger
|
|
268
|
+
)
|
|
269
|
+
watchdog.start
|
|
270
|
+
|
|
271
|
+
begin
|
|
272
|
+
http.request(req) do |resp|
|
|
273
|
+
code = resp.code.to_i
|
|
274
|
+
if TERMINAL_HTTP_CODES.include?(code)
|
|
275
|
+
# qfg-i5xv: 401/403/404 will not heal by retrying — bad key,
|
|
276
|
+
# revoked permission, or wrong endpoint. Mark stopped *before*
|
|
277
|
+
# invoking on_error so the loop's terminal-error branch is
|
|
278
|
+
# already locked in if the parent callback inspects state, and
|
|
279
|
+
# so the inner rescue's `handle_error(e) unless @stopped.value`
|
|
280
|
+
# guard suppresses a second on_error edge.
|
|
281
|
+
err = SSEHTTPTerminalError.new(code)
|
|
282
|
+
@logger.error "SSE Streaming Terminal Error: HTTP #{code} for url #{url}; will not retry"
|
|
283
|
+
@stopped.make_true
|
|
284
|
+
invoke_on_error(err)
|
|
285
|
+
raise err
|
|
211
286
|
end
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
client.close
|
|
218
|
-
next
|
|
287
|
+
if code != 200
|
|
288
|
+
err = SSEHTTPStatusError.new(code)
|
|
289
|
+
@logger.error "SSE Streaming Error: HTTP #{code} for url #{url}"
|
|
290
|
+
invoke_on_error(err)
|
|
291
|
+
raise err
|
|
219
292
|
end
|
|
220
293
|
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
)
|
|
225
|
-
|
|
294
|
+
parser = EventParser.new
|
|
295
|
+
# qfg-tj18: run_loop wraps the body in +:on_blocking+ which
|
|
296
|
+
# *would* still deliver during read_body (read_body is a
|
|
297
|
+
# blocking IO call), but be explicit: we want the watchdog raise
|
|
298
|
+
# to land here without ambiguity.
|
|
299
|
+
Thread.handle_interrupt(SSEReadDeadlineExceeded => :immediate) do
|
|
300
|
+
resp.read_body do |chunk|
|
|
301
|
+
watchdog.reset!
|
|
302
|
+
break if @stopped.value
|
|
303
|
+
|
|
304
|
+
parser.feed(chunk, &block)
|
|
305
|
+
end
|
|
306
|
+
end
|
|
307
|
+
# read_body returned cleanly — either a server-initiated FIN, or
|
|
308
|
+
# the watchdog closed the socket on a silent stall. Either way,
|
|
309
|
+
# the outer loop will reconnect and bump restart_total on the
|
|
310
|
+
# next iteration.
|
|
311
|
+
@logger.debug "SSE stream ended for url #{url}"
|
|
312
|
+
end
|
|
313
|
+
ensure
|
|
314
|
+
watchdog.stop
|
|
315
|
+
register_active(nil)
|
|
316
|
+
begin
|
|
317
|
+
http.finish if http.started?
|
|
318
|
+
rescue StandardError
|
|
319
|
+
# already closed
|
|
226
320
|
end
|
|
321
|
+
end
|
|
322
|
+
end
|
|
227
323
|
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
@logger.error "SSE Streaming Error: #{error.inspect} for url #{url}"
|
|
234
|
-
end
|
|
324
|
+
# Track the active connection so close() can interrupt a blocked
|
|
325
|
+
# read_body from another thread. Guarded by @conn_mutex.
|
|
326
|
+
def register_active(http)
|
|
327
|
+
@conn_mutex.synchronize { @active_http = http }
|
|
328
|
+
end
|
|
235
329
|
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
# would double-count. For closing errors (HTTP::ConnectionError) the
|
|
240
|
-
# reconnect is counted in @retry_thread instead. on_error's job is
|
|
241
|
-
# purely to notify the parent client of the disconnect edge.
|
|
242
|
-
|
|
243
|
-
# Notify the parent client BEFORE deciding whether to close — every
|
|
244
|
-
# error edge is a disconnect signal as far as @sse_state goes, even
|
|
245
|
-
# if we let the underlying SSE library handle reconnect itself.
|
|
246
|
-
# qfg-47c2.27
|
|
247
|
-
if @on_error
|
|
248
|
-
begin
|
|
249
|
-
@on_error.call(error)
|
|
250
|
-
rescue StandardError => e
|
|
251
|
-
@logger.error "SSE on_error callback raised: #{e.inspect}"
|
|
252
|
-
end
|
|
253
|
-
end
|
|
330
|
+
def increment_restart!
|
|
331
|
+
@restart_mutex.synchronize { @restart_total += 1 }
|
|
332
|
+
end
|
|
254
333
|
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
334
|
+
def handle_error(error)
|
|
335
|
+
@logger.error "SSE Streaming Error: #{error.inspect}"
|
|
336
|
+
invoke_on_error(error)
|
|
337
|
+
end
|
|
338
|
+
|
|
339
|
+
# qfg-m3lk: rescue StandardError (NOT Exception) so SystemExit /
|
|
340
|
+
# Interrupt / SignalException still escape — Ctrl-C inside a customer
|
|
341
|
+
# callback must still kill the process. StandardError is the right
|
|
342
|
+
# boundary for "the caller's listener has a bug".
|
|
343
|
+
def invoke_on_envelope_safely(on_envelope, event)
|
|
344
|
+
on_envelope.call(event.envelope, event, :sse)
|
|
345
|
+
rescue StandardError => e
|
|
346
|
+
@on_envelope_error_mutex.synchronize { @on_envelope_error_total += 1 }
|
|
347
|
+
bt = (e.backtrace || []).first(5).join("\n ")
|
|
348
|
+
@logger.error "SSE on_envelope callback raised: #{e.class}: #{e.message}\n #{bt}"
|
|
349
|
+
end
|
|
350
|
+
|
|
351
|
+
def invoke_on_error(error)
|
|
352
|
+
return unless @on_error
|
|
353
|
+
|
|
354
|
+
begin
|
|
355
|
+
@on_error.call(error)
|
|
356
|
+
rescue StandardError => e
|
|
357
|
+
@logger.error "SSE on_error callback raised: #{e.inspect}"
|
|
260
358
|
end
|
|
261
359
|
end
|
|
262
360
|
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
'X-Quonfig-SDK-Version' => "ruby-#{Quonfig::VERSION}"
|
|
270
|
-
}
|
|
361
|
+
# +/-50% jitter — caps thundering-herd amplitude after a partition heal.
|
|
362
|
+
# Identical shape to ld-eventsource's Backoff#next_interval (and
|
|
363
|
+
# sdk-go's runLoop jitter) so we don't surprise operators familiar with
|
|
364
|
+
# those.
|
|
365
|
+
def jittered(delay)
|
|
366
|
+
(delay / 2) + rand(delay / 2.0)
|
|
271
367
|
end
|
|
272
368
|
|
|
273
|
-
|
|
274
|
-
|
|
369
|
+
# Sleep with interrupt: chunks the sleep so close() during a long
|
|
370
|
+
# backoff doesn't block shutdown for tens of seconds.
|
|
371
|
+
def interruptible_sleep(seconds)
|
|
372
|
+
deadline = Process.clock_gettime(Process::CLOCK_MONOTONIC) + seconds
|
|
373
|
+
until @stopped.value
|
|
374
|
+
remaining = deadline - Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
375
|
+
break if remaining <= 0
|
|
275
376
|
|
|
276
|
-
|
|
377
|
+
sleep([remaining, 0.1].min)
|
|
378
|
+
end
|
|
379
|
+
end
|
|
277
380
|
|
|
278
|
-
|
|
381
|
+
# Rotate through configured SSE URLs. The same rotation rule the
|
|
382
|
+
# previous implementation used, preserved so multi-region failover
|
|
383
|
+
# behavior is unchanged.
|
|
384
|
+
def current_url
|
|
385
|
+
urls = @prefab_options.sse_api_urls
|
|
386
|
+
@source_index = (@source_index + 1) % urls.size
|
|
387
|
+
urls[@source_index]
|
|
279
388
|
end
|
|
280
389
|
|
|
281
|
-
#
|
|
282
|
-
#
|
|
283
|
-
#
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
390
|
+
# Internal: HTTP-status sentinel error for non-200 SSE responses. Surfaces
|
|
391
|
+
# the status code through #message so parent on_error callbacks can log
|
|
392
|
+
# meaningfully without depending on ld-eventsource's error hierarchy.
|
|
393
|
+
class SSEHTTPStatusError < StandardError
|
|
394
|
+
attr_reader :status_code
|
|
395
|
+
|
|
396
|
+
def initialize(status_code)
|
|
397
|
+
@status_code = status_code
|
|
398
|
+
super("HTTP #{status_code}")
|
|
290
399
|
end
|
|
400
|
+
end
|
|
291
401
|
|
|
292
|
-
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
402
|
+
# qfg-i5xv: terminal HTTP failures the SDK will not retry. 401 = bad key,
|
|
403
|
+
# 403 = revoked workspace permission, 404 = wrong endpoint / missing
|
|
404
|
+
# workspace. A subclass of SSEHTTPStatusError so existing on_error
|
|
405
|
+
# callbacks that only check `is_a?(SSEHTTPStatusError)` keep working,
|
|
406
|
+
# while customers that want to distinguish (alerting, OpenFeature
|
|
407
|
+
# provider error events) can dispatch on the subclass.
|
|
408
|
+
class SSEHTTPTerminalError < SSEHTTPStatusError; end
|
|
409
|
+
|
|
410
|
+
# Raised by the watchdog into the worker thread when the per-chunk
|
|
411
|
+
# read deadline elapses. Caught by run_loop's rescue, indistinguishable
|
|
412
|
+
# from any other transport error for backoff/restart purposes.
|
|
413
|
+
class SSEReadDeadlineExceeded < StandardError; end
|
|
414
|
+
|
|
415
|
+
# Background watchdog that interrupts the worker thread if no chunk
|
|
416
|
+
# arrives within +deadline_s+ seconds. Uses Thread#raise — the only
|
|
417
|
+
# reliable cross-platform way to unblock a Ruby thread blocked in
|
|
418
|
+
# +Net::HTTP+'s body-read on macOS. (Closing or shutting down the
|
|
419
|
+
# underlying socket from another thread does NOT wake the reader on
|
|
420
|
+
# macOS; the kernel discards future reads but the in-flight syscall
|
|
421
|
+
# stays blocked until something else trips. sdk-go and sdk-node solve
|
|
422
|
+
# the equivalent problem with context cancellation / AbortController,
|
|
423
|
+
# which Ruby lacks at the IO layer.) Thread#raise is essentially what
|
|
424
|
+
# +Timeout.timeout+ does internally; using it directly avoids
|
|
425
|
+
# Timeout.timeout's sketch reputation around ensure blocks.
|
|
426
|
+
class ReadDeadlineWatchdog
|
|
427
|
+
POLL_INTERVAL = 0.25
|
|
428
|
+
|
|
429
|
+
def initialize(worker:, deadline_s:, stopped:, logger:)
|
|
430
|
+
@worker = worker
|
|
431
|
+
@deadline_s = deadline_s
|
|
432
|
+
@stopped = stopped
|
|
433
|
+
@logger = logger
|
|
434
|
+
@active = true
|
|
435
|
+
# Mutex covers @active AND the decision to fire Thread#raise. stop()
|
|
436
|
+
# holds the mutex when flipping @active false, so a +stop+ that
|
|
437
|
+
# arrives mid-deadline-check cannot lose the race against the
|
|
438
|
+
# watchdog's @worker.raise call (which would inject a spurious
|
|
439
|
+
# SSEReadDeadlineExceeded into the worker thread right after a
|
|
440
|
+
# clean read_body return).
|
|
441
|
+
@mutex = Mutex.new
|
|
442
|
+
@last_read_at = Concurrent::AtomicReference.new(Process.clock_gettime(Process::CLOCK_MONOTONIC))
|
|
296
443
|
end
|
|
297
444
|
|
|
298
|
-
|
|
445
|
+
def start
|
|
446
|
+
@thread = Thread.new { watch }
|
|
447
|
+
end
|
|
448
|
+
|
|
449
|
+
def reset!
|
|
450
|
+
@last_read_at.set(Process.clock_gettime(Process::CLOCK_MONOTONIC))
|
|
451
|
+
end
|
|
452
|
+
|
|
453
|
+
def stop
|
|
454
|
+
@mutex.synchronize { @active = false }
|
|
455
|
+
@thread&.join(1)
|
|
456
|
+
@thread = nil
|
|
457
|
+
end
|
|
458
|
+
|
|
459
|
+
private
|
|
460
|
+
|
|
461
|
+
def watch
|
|
462
|
+
loop do
|
|
463
|
+
sleep POLL_INTERVAL
|
|
464
|
+
break unless @mutex.synchronize { @active } && !@stopped.value
|
|
465
|
+
|
|
466
|
+
idle = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @last_read_at.value
|
|
467
|
+
next if idle < @deadline_s
|
|
468
|
+
|
|
469
|
+
fired = @mutex.synchronize do
|
|
470
|
+
next false unless @active && !@stopped.value
|
|
471
|
+
|
|
472
|
+
@logger.debug "SSE read deadline exceeded (#{idle.round(1)}s idle >= #{@deadline_s}s); interrupting worker"
|
|
473
|
+
@worker.raise(SSEReadDeadlineExceeded.new("SSE read deadline #{@deadline_s}s exceeded"))
|
|
474
|
+
true
|
|
475
|
+
end
|
|
476
|
+
break if fired
|
|
477
|
+
end
|
|
478
|
+
rescue StandardError => e
|
|
479
|
+
# Watchdog must never crash the SDK. Worst case we silently fall
|
|
480
|
+
# back to Net::HTTP's own (unreliable) read_timeout.
|
|
481
|
+
@logger.debug "SSE watchdog error: #{e.inspect}"
|
|
482
|
+
end
|
|
483
|
+
end
|
|
484
|
+
|
|
485
|
+
# Streaming SSE parser. Accepts byte chunks (any encoding), yields one
|
|
486
|
+
# Quonfig::StreamEvent per complete event. Tolerates:
|
|
487
|
+
# - chunks that split a UTF-8 multi-byte character (buffer in 8-bit,
|
|
488
|
+
# transcode whole lines)
|
|
489
|
+
# - chunks that split a line mid-way
|
|
490
|
+
# - any of CR / LF / CRLF as line terminators
|
|
491
|
+
# - +data:+, +data: + (optional space per SSE spec)
|
|
492
|
+
# - +:comment+ lines (keepalives — ignored)
|
|
493
|
+
# - multi-line +data:+ (concatenated with +\n+, per spec)
|
|
494
|
+
# Ignores +event:+ and +retry:+ — api-delivery does not emit them and the
|
|
495
|
+
# Quonfig wire contract does not honor reconnect-time directives.
|
|
496
|
+
# Malformed +data:+ JSON is logged and skipped; one bad event does not
|
|
497
|
+
# tear down the stream.
|
|
498
|
+
class EventParser
|
|
499
|
+
def initialize(logger: nil)
|
|
500
|
+
@logger = logger
|
|
501
|
+
@reader = LineReader.new
|
|
502
|
+
@data = +''
|
|
503
|
+
@have_data = false
|
|
504
|
+
@id = nil
|
|
505
|
+
end
|
|
506
|
+
|
|
507
|
+
def feed(chunk)
|
|
508
|
+
@reader.feed(chunk) do |line|
|
|
509
|
+
if line.empty?
|
|
510
|
+
event = flush
|
|
511
|
+
yield event if event
|
|
512
|
+
elsif line.start_with?(':')
|
|
513
|
+
# comment / keepalive — ignore
|
|
514
|
+
else
|
|
515
|
+
process_field(line)
|
|
516
|
+
end
|
|
517
|
+
end
|
|
518
|
+
end
|
|
519
|
+
|
|
520
|
+
private
|
|
521
|
+
|
|
522
|
+
def process_field(line)
|
|
523
|
+
idx = line.index(':')
|
|
524
|
+
return unless idx
|
|
525
|
+
|
|
526
|
+
name = line[0...idx]
|
|
527
|
+
rest = line[(idx + 1)..]
|
|
528
|
+
rest = rest[1..] if rest.start_with?(' ')
|
|
529
|
+
|
|
530
|
+
case name
|
|
531
|
+
when 'data'
|
|
532
|
+
if @have_data
|
|
533
|
+
@data << "\n" << rest
|
|
534
|
+
else
|
|
535
|
+
@data = rest
|
|
536
|
+
@have_data = true
|
|
537
|
+
end
|
|
538
|
+
when 'id'
|
|
539
|
+
@id = rest unless rest.include?("\x00")
|
|
540
|
+
# event: / retry: are intentionally ignored
|
|
541
|
+
end
|
|
542
|
+
end
|
|
543
|
+
|
|
544
|
+
def flush
|
|
545
|
+
return nil unless @have_data
|
|
546
|
+
|
|
547
|
+
data = @data
|
|
548
|
+
id = @id
|
|
549
|
+
@data = +''
|
|
550
|
+
@have_data = false
|
|
551
|
+
# NB: @id persists across events — the SSE spec says last-event-id
|
|
552
|
+
# is sticky until overwritten. Matches ld-eventsource.
|
|
553
|
+
|
|
554
|
+
begin
|
|
555
|
+
parsed = JSON.parse(data)
|
|
556
|
+
rescue JSON::ParserError => e
|
|
557
|
+
(@logger || LOG).error "SSE Streaming Error: malformed JSON: #{e.message}"
|
|
558
|
+
return nil
|
|
559
|
+
end
|
|
560
|
+
|
|
561
|
+
envelope = Quonfig::ConfigEnvelope.new(
|
|
562
|
+
configs: parsed['configs'] || [],
|
|
563
|
+
meta: parsed['meta'] || {}
|
|
564
|
+
)
|
|
565
|
+
StreamEvent.new(envelope, id, data)
|
|
566
|
+
end
|
|
567
|
+
end
|
|
568
|
+
|
|
569
|
+
# Byte-level line reader. Accepts arbitrary chunks, yields one UTF-8
|
|
570
|
+
# line per call to the block. Terminator-stripped (CR / LF / CRLF
|
|
571
|
+
# supported). Modeled on ld-eventsource's BufferedLineReader — same
|
|
572
|
+
# invariants: split bytes-not-chars while scanning, force-encode to
|
|
573
|
+
# UTF-8 only once a complete line is sliced out, so a multi-byte
|
|
574
|
+
# character spanning two chunks does not raise Encoding::CompatibilityError.
|
|
575
|
+
class LineReader
|
|
576
|
+
def initialize
|
|
577
|
+
@buffer = +''.b
|
|
578
|
+
@last_was_cr = false
|
|
579
|
+
end
|
|
580
|
+
|
|
581
|
+
def feed(chunk)
|
|
582
|
+
@buffer << chunk.b
|
|
583
|
+
loop do
|
|
584
|
+
idx = @buffer.index(/[\r\n]/)
|
|
585
|
+
break if idx.nil?
|
|
586
|
+
|
|
587
|
+
ch = @buffer[idx]
|
|
588
|
+
if idx.zero? && ch == "\n" && @last_was_cr
|
|
589
|
+
# Dangling LF of a CRLF pair split across chunks — consume and skip.
|
|
590
|
+
@last_was_cr = false
|
|
591
|
+
@buffer.slice!(0, 1)
|
|
592
|
+
next
|
|
593
|
+
end
|
|
594
|
+
|
|
595
|
+
line = @buffer[0, idx].force_encoding('UTF-8')
|
|
596
|
+
consume = idx + 1
|
|
597
|
+
@last_was_cr = false
|
|
598
|
+
if ch == "\r"
|
|
599
|
+
if consume == @buffer.bytesize
|
|
600
|
+
# CR at end of buffer — could be CRLF split across feeds.
|
|
601
|
+
@last_was_cr = true
|
|
602
|
+
elsif @buffer[consume] == "\n"
|
|
603
|
+
consume += 1
|
|
604
|
+
end
|
|
605
|
+
end
|
|
606
|
+
@buffer.slice!(0, consume)
|
|
607
|
+
yield line
|
|
608
|
+
end
|
|
609
|
+
end
|
|
299
610
|
end
|
|
300
611
|
end
|
|
301
612
|
end
|
data/lib/quonfig/version.rb
CHANGED
data/lib/quonfig.rb
CHANGED
data/quonfig.gemspec
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: quonfig
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.0.
|
|
4
|
+
version: 0.0.16
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Jeff Dwyer
|
|
@@ -58,20 +58,6 @@ dependencies:
|
|
|
58
58
|
- - ">="
|
|
59
59
|
- !ruby/object:Gem::Version
|
|
60
60
|
version: '1.0'
|
|
61
|
-
- !ruby/object:Gem::Dependency
|
|
62
|
-
name: ld-eventsource
|
|
63
|
-
requirement: !ruby/object:Gem::Requirement
|
|
64
|
-
requirements:
|
|
65
|
-
- - ">="
|
|
66
|
-
- !ruby/object:Gem::Version
|
|
67
|
-
version: '2.0'
|
|
68
|
-
type: :runtime
|
|
69
|
-
prerelease: false
|
|
70
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
71
|
-
requirements:
|
|
72
|
-
- - ">="
|
|
73
|
-
- !ruby/object:Gem::Version
|
|
74
|
-
version: '2.0'
|
|
75
61
|
description: Quonfig — feature flags and live config, stored as files in git.
|
|
76
62
|
email: jeff@quonfig.com
|
|
77
63
|
executables: []
|