quonfig 0.0.14 → 0.0.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/README.md +55 -11
- data/lib/quonfig/client.rb +398 -22
- data/lib/quonfig/datadir.rb +8 -3
- data/lib/quonfig/sse_config_client.rb +550 -93
- data/lib/quonfig/version.rb +1 -1
- data/lib/quonfig/worker_supervisor.rb +186 -0
- data/lib/quonfig.rb +2 -1
- data/quonfig.gemspec +0 -1
- metadata +3 -16
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6f167f3b60db07394dc7c49b85c3dbc196e0b5c82f3426b35695c0f212339b8b
|
|
4
|
+
data.tar.gz: 4b79e1196c4625359943255a348d907c28865a5cd85432ac464737406d7a6169
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 5aa3a23774245bf31752e4c9918de8bf37cc865e15b6ed160b222181d805a0fe477064cc5cf27dc6810b87cdd1c250f8558b650c98fd5e7a354c3e2e70090c53
|
|
7
|
+
data.tar.gz: 82c4561817b40e4dd0ecfd1b5267e6a5f41ea2e774c2d2537400aaf04886eb579807da457f6964915a065a32dc7beea043a2797d83ff04dba3c9fb4e46c39cb3
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,19 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.0.16 - 2026-05-15
|
|
4
|
+
|
|
5
|
+
- **Feat (SSE): replace `ld-eventsource` with an SDK-owned reconnect loop (qfg-35sm).** sdk-ruby was the outlier among the four backend SDKs — sdk-go, sdk-node, and sdk-python all own their reconnect loop, only sdk-ruby handed it off to a library and scraped its log output to observe reconnects. The wire format we actually consume (plain JSON envelopes in single-line `data:` frames, no named events, no retry directives) is trivial enough that an SDK-owned loop is clearer than the library wrapper. New `Quonfig::SSEConfigClient` (~520 LoC, `lib/quonfig/sse_config_client.rb`) handles connect/parse/reconnect end-to-end. `restart_total` is now incremented at exactly one site under a mutex — verifiable, not log-scraped. `ld-eventsource` and the transitive `http` gem are removed from the gemspec. `ReconnectCountingLogger` and the `sse_reconnect_reset_interval` option (both 0.0.15-era defensive scaffolding around upstream behavior) are deleted — the bugs they defended against don't exist when the SDK owns the loop. Chaos: 10/10 in a 36-min run (scenarios 02 silent-stall, 05 sse-down-fallback, 09 flapping kill-storm).
|
|
6
|
+
- **Fix (SSE): contain watchdog `Thread#raise` with `Thread.handle_interrupt` (qfg-tj18).** The new watchdog fires `Thread#raise(SSEReadDeadlineExceeded)` into the worker on a silent stall — the only reliable cross-platform way to unblock a Ruby thread blocked in `Net::HTTP`'s body-read on macOS. The decision to fire is mutex-guarded against `stop()`, but the raise itself is delivered at the worker's next interrupt checkpoint, which could be anywhere in the call stack. `run_loop`'s body now runs under `Thread.handle_interrupt(SSEReadDeadlineExceeded => :on_blocking)` so a late-landing raise can only land inside a blocking call; the `read_body` block explicitly switches to `:immediate`. A paranoid backstop `rescue` outside the until-`@stopped` loop ensures an escaped raise can never silently kill the worker.
|
|
7
|
+
- **Fix (SSE): isolate `on_envelope` callback exceptions (qfg-m3lk).** A buggy user-supplied listener that raised during envelope delivery used to propagate out of `read_body`, get caught by `run_loop` as a transport error, bump `restart_total`, and reconnect — a perpetual reconnect storm at api-delivery-sse driven by a customer code bug. The callback is now wrapped in `begin/rescue StandardError` at the invocation site; exceptions are logged with class + message + backtrace sample and the stream continues uninterrupted. `Interrupt` and `SystemExit` are deliberately not caught so `Ctrl-C` still works.
|
|
8
|
+
- **Fix (SSE): classify 401/403/404 as terminal errors (qfg-i5xv).** Non-200 responses used to be treated identically — `SSEHTTPStatusError` raised, `run_loop` rescued, `restart_total` bumped, backoff, retry, forever. For a bad SDK key (401) or revoked workspace (403) that was wasted load on api-delivery-sse with no recovery path short of a customer redeploy. New `SSEHTTPTerminalError` sentinel for 401/403/404; `run_loop` catches it, invokes `on_error`, exits the loop without bumping `restart_total`. Parent `Quonfig::Client` surfaces a terminal `:sse_terminal_failure` state distinct from transient `:error`. 429 and 5xx still retry.
|
|
9
|
+
- **Feat (fork): install `Process._fork` hook so SSE auto-restarts after fork (qfg-ryov).** Ruby threads do not survive `fork(2)`. Customers initializing `Quonfig::Client` in the Puma master (the `preload_app! true` / Rails `config.eager_load = true` convention) used to silently lose SSE in every worker child. New `Quonfig::ForkSafety` module prepends `Process._fork` and fans out across all live `Quonfig::Client` instances (tracked in an `ObjectSpace::WeakMap`): in the parent before the syscall, threaded components (SSE worker, polling supervisor, telemetry reporter) are torn down; in the child after the syscall, they are rebuilt. `@stopped` is preserved so a `stop()`-ed client stays stopped across fork. Covers `Process.fork` / `Kernel#fork`; `Process.spawn` and `system("...")` exec a new program so in-process state doesn't apply. Ruby 3.0 lacks `Process._fork` and is documented as requiring manual `before_fork` / `on_worker_boot` wiring.
|
|
10
|
+
|
|
11
|
+
## 0.0.15 - 2026-05-15
|
|
12
|
+
|
|
13
|
+
- **Fix (SSE): count ld-eventsource internal reconnects (qfg-ie49).** ld-eventsource auto-reconnects on a clean socket FIN *internally* and never fires `on_error`, so the qfg-ll6r on_error-based `restart_total` counter sat at 0 under flapping outages (chaos scenario 09 — proxy killed 5x in 30s). `restart_total` now counts actual reconnects from two mutually-exclusive sources: ld-eventsource internal reconnects (observed via a pass-through logger wrapper that watches the per-reconnect `"Will retry connection after"` info line — the only hook the library exposes) and SDK-driven reconnects in `@retry_thread`. `on_error` is no longer a counting source.
|
|
14
|
+
- **Fix (SSE): backoff reset interval (qfg-ie49).** New `sse_reconnect_reset_interval` option, default `1s`. ld-eventsource's 60s default lets the backoff run away under flapping — the SDK is mid-sleep when later kills land and never observes them. 1s mirrors sdk-python's reset-on-every-successful-connect behavior. Sustained outages still back off exponentially (`mark_success` is never called, so the reset never triggers).
|
|
15
|
+
- **Fix (SSE): make `ReconnectCountingLogger` raise-proof (qfg-cf52).** ld-eventsource calls the logger from inside a bare-`Thread` `run_stream` loop with several call sites unguarded by `rescue`. A throwing wrapper would kill the worker with `@stopped=false`, leaving `closed?` false forever — silently wedging the SSE stream (the intermittent chaos scenario 05 flake). Every wrapper step is now independently rescued.
|
|
16
|
+
|
|
3
17
|
## 0.0.14 - 2026-05-10
|
|
4
18
|
|
|
5
19
|
- **Feat: expose `variant` and `flag_metadata` on `EvaluationDetails` (qfg-9dbl).** OpenFeature's `EvaluationDetails` Ruby return type now carries the variant name and the flag-level metadata hash alongside the resolved value/reason. Brings sdk-ruby to parity with the other SDKs' detail surfaces and lets host apps (incl. the Ruby OpenFeature provider) read variant/metadata without re-fetching the config.
|
data/README.md
CHANGED
|
@@ -247,15 +247,41 @@ not want to inherit across a `fork(2)`. Forked threads in the child process
|
|
|
247
247
|
are dead — the SSE socket is held open by a thread that no longer exists, and
|
|
248
248
|
the child silently stops receiving live updates.
|
|
249
249
|
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
250
|
+
**On Ruby 3.1+ the SDK installs a `Process._fork` hook at load time** that
|
|
251
|
+
automatically tears down threaded components in the parent and restarts them
|
|
252
|
+
in the child. This covers any `Process.fork` / `Kernel#fork` path — Puma's
|
|
253
|
+
clustered mode, Unicorn, Sidekiq's parent-forks-workers model, Spring, and
|
|
254
|
+
manual `fork { ... }` calls. **No customer wiring is required.**
|
|
255
|
+
|
|
256
|
+
Caveats:
|
|
257
|
+
|
|
258
|
+
- Ruby 3.0 has no hookable choke point — fall back to manual wiring (below).
|
|
259
|
+
- `system("fork-and-exec ...")` and `Process.spawn` are not covered (they do
|
|
260
|
+
not go through `Process._fork`), but those execute a new program, so the
|
|
261
|
+
in-process SSE state is moot.
|
|
262
|
+
- The hook tears down the SSE/polling/telemetry threads in the parent before
|
|
263
|
+
fork (so the child does not inherit a live socket fd) and does **not**
|
|
264
|
+
auto-restart the parent. This mirrors the Puma master case: the master no
|
|
265
|
+
longer serves requests, so it does not need a live SSE connection. If you
|
|
266
|
+
have a non-Puma topology where the parent must keep streaming after fork,
|
|
267
|
+
call `Quonfig.instance.after_fork_in_child` manually in the parent after
|
|
268
|
+
the fork returns.
|
|
254
269
|
|
|
255
270
|
### Puma (clustered mode)
|
|
256
271
|
|
|
272
|
+
With the automatic fork hook, the typical Puma config needs **no Quonfig
|
|
273
|
+
lifecycle wiring** — initialize in your Rails initializer and let the hook
|
|
274
|
+
handle the rest:
|
|
275
|
+
|
|
276
|
+
```ruby
|
|
277
|
+
# config/initializers/quonfig.rb
|
|
278
|
+
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
If you're on Ruby 3.0 (no `Process._fork`), wire the legacy hooks manually:
|
|
282
|
+
|
|
257
283
|
```ruby
|
|
258
|
-
# config/puma.rb
|
|
284
|
+
# config/puma.rb (Ruby 3.0 only)
|
|
259
285
|
before_fork do
|
|
260
286
|
Quonfig.instance.stop # close the master's SSE before forking
|
|
261
287
|
end
|
|
@@ -265,18 +291,18 @@ on_worker_boot do
|
|
|
265
291
|
end
|
|
266
292
|
```
|
|
267
293
|
|
|
268
|
-
If you initialize Quonfig lazily (in a Rails initializer) and run Puma in
|
|
269
|
-
single mode (no clustering), no fork hook is needed.
|
|
270
|
-
|
|
271
294
|
### Sidekiq
|
|
272
295
|
|
|
273
|
-
|
|
296
|
+
On Ruby 3.1+ the automatic fork hook covers Sidekiq workers too — no
|
|
297
|
+
`configure_server` wiring required.
|
|
298
|
+
|
|
299
|
+
On Ruby 3.0:
|
|
274
300
|
|
|
275
301
|
```ruby
|
|
276
302
|
# config/initializers/quonfig.rb
|
|
277
303
|
Quonfig.init(Quonfig::Options.new(sdk_key: ENV.fetch('QUONFIG_BACKEND_SDK_KEY')))
|
|
278
304
|
|
|
279
|
-
# config/initializers/sidekiq.rb
|
|
305
|
+
# config/initializers/sidekiq.rb (Ruby 3.0 only)
|
|
280
306
|
Sidekiq.configure_server do |config|
|
|
281
307
|
config.on(:startup) { Quonfig.fork if Process.ppid != 1 }
|
|
282
308
|
config.on(:shutdown) { Quonfig.instance.stop rescue nil }
|
|
@@ -284,7 +310,7 @@ end
|
|
|
284
310
|
```
|
|
285
311
|
|
|
286
312
|
For Sidekiq web/CLI processes that don't fork (default `concurrency: 1`),
|
|
287
|
-
`Quonfig.init` in the initializer is sufficient.
|
|
313
|
+
`Quonfig.init` in the initializer is sufficient on any Ruby version.
|
|
288
314
|
|
|
289
315
|
### Spring / Bootsnap preloaders
|
|
290
316
|
|
|
@@ -333,6 +359,24 @@ converge once the envelope finishes applying.
|
|
|
333
359
|
`Quonfig.fork` is the only safe way to "carry" a client across `Process.fork`
|
|
334
360
|
— do not reuse the parent's client in a child process.
|
|
335
361
|
|
|
362
|
+
## Diagnostic health signals
|
|
363
|
+
|
|
364
|
+
`Quonfig::Client` exposes two read-only getters for monitoring SDK liveness:
|
|
365
|
+
|
|
366
|
+
- `client.last_successful_refresh` — a `Time` (UTC) marking the most recent
|
|
367
|
+
envelope install (any source: datadir, initial HTTP fetch, SSE, or fallback
|
|
368
|
+
polling). Returns `nil` before the first install. Preserved across `stop`.
|
|
369
|
+
- `client.connection_state` — a `Symbol` describing the aggregate state:
|
|
370
|
+
`:initializing`, `:connected`, `:disconnected`, or `:falling_back`.
|
|
371
|
+
|
|
372
|
+
> Do not wire `last_successful_refresh` or `connection_state` directly into a Kubernetes liveness probe. These signals are diagnostic, not pass/fail. A liveness probe based on SDK freshness will amplify transient network blips into restart cascades.
|
|
373
|
+
|
|
374
|
+
Compose your own threshold from the two getters if you need a dashboard signal
|
|
375
|
+
— but route alerts through a metrics pipeline, not a probe that restarts the
|
|
376
|
+
process.
|
|
377
|
+
|
|
378
|
+
There is intentionally no `client.healthy?` primitive.
|
|
379
|
+
|
|
336
380
|
## Documentation
|
|
337
381
|
|
|
338
382
|
Full documentation, including SPEC, SDK reference, and operational guides, is
|
data/lib/quonfig/client.rb
CHANGED
|
@@ -20,6 +20,29 @@ module Quonfig
|
|
|
20
20
|
class Client
|
|
21
21
|
LOG = Quonfig::InternalLogger.new(self)
|
|
22
22
|
|
|
23
|
+
# qfg-ryov: instance registry for the Process._fork hook. Every live
|
|
24
|
+
# Client is tracked here so the hook can fan out before_fork_in_parent /
|
|
25
|
+
# after_fork_in_child across all of them without the customer needing to
|
|
26
|
+
# name a specific instance. ObjectSpace::WeakMap means a Client that goes
|
|
27
|
+
# out of scope is GC'd without leaking through this registry. Stopped
|
|
28
|
+
# Clients stay in the registry until GC; both fork hooks early-return on
|
|
29
|
+
# +@stopped+ so a stopped instance is effectively a no-op. (We don't use
|
|
30
|
+
# WeakMap#delete because it was added in Ruby 3.3 and the matrix still
|
|
31
|
+
# includes 3.2.)
|
|
32
|
+
@instances = ObjectSpace::WeakMap.new
|
|
33
|
+
@instances_mutex = Mutex.new
|
|
34
|
+
|
|
35
|
+
class << self
|
|
36
|
+
# Iterate live Client instances. Used by Quonfig::ForkSafety.
|
|
37
|
+
def each_instance(&block)
|
|
38
|
+
@instances_mutex.synchronize { @instances.keys }.each(&block)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def register_instance(client)
|
|
42
|
+
@instances_mutex.synchronize { @instances[client] = true }
|
|
43
|
+
end
|
|
44
|
+
end
|
|
45
|
+
|
|
23
46
|
attr_reader :options, :resolver, :store, :evaluator, :instance_hash,
|
|
24
47
|
:config_loader, :telemetry_reporter
|
|
25
48
|
|
|
@@ -40,9 +63,15 @@ module Quonfig
|
|
|
40
63
|
@resolver = Quonfig::Resolver.new(@store, @evaluator)
|
|
41
64
|
@semantic_logger_filters = {}
|
|
42
65
|
@sse_client = nil
|
|
43
|
-
@
|
|
66
|
+
@poll_supervisor = nil
|
|
44
67
|
@stopped = false
|
|
45
68
|
@telemetry_reporter = nil
|
|
69
|
+
@state_mutex = Mutex.new
|
|
70
|
+
@last_successful_refresh = nil
|
|
71
|
+
@sse_state = :idle
|
|
72
|
+
@sse_ever_connected = false
|
|
73
|
+
@fallback_engage_timer = nil
|
|
74
|
+
@sse_terminal_failure = false
|
|
46
75
|
|
|
47
76
|
# If the caller injected a store, we're in test/bootstrap mode; skip I/O.
|
|
48
77
|
return if store
|
|
@@ -54,6 +83,10 @@ module Quonfig
|
|
|
54
83
|
end
|
|
55
84
|
|
|
56
85
|
initialize_telemetry
|
|
86
|
+
|
|
87
|
+
# Register only for non-store-injected clients (a caller-supplied store
|
|
88
|
+
# is the test/bootstrap path; the fork hook does not apply there).
|
|
89
|
+
self.class.register_instance(self) unless store
|
|
57
90
|
end
|
|
58
91
|
|
|
59
92
|
# ---- Lookup --------------------------------------------------------
|
|
@@ -259,6 +292,121 @@ module Quonfig
|
|
|
259
292
|
|
|
260
293
|
def stop
|
|
261
294
|
@stopped = true
|
|
295
|
+
tear_down_threaded_components!
|
|
296
|
+
end
|
|
297
|
+
|
|
298
|
+
# qfg-ryov: pre-fork hook. Close the SSE worker, polling supervisor,
|
|
299
|
+
# telemetry reporter, and any fallback-engage timer. Idempotent — calling
|
|
300
|
+
# twice is safe. Does NOT set @stopped: the client is still expected to
|
|
301
|
+
# be usable post-fork via after_fork_in_child.
|
|
302
|
+
#
|
|
303
|
+
# Why this matters: Ruby threads do not survive fork(2). If we let the
|
|
304
|
+
# child inherit a live Net::HTTP socket, both processes read from the
|
|
305
|
+
# same fd and corrupt each other's bytes. Closing in the parent before
|
|
306
|
+
# fork is the only safe shape.
|
|
307
|
+
def before_fork_in_parent
|
|
308
|
+
return if @stopped
|
|
309
|
+
|
|
310
|
+
tear_down_threaded_components!
|
|
311
|
+
end
|
|
312
|
+
|
|
313
|
+
# qfg-ryov: post-fork (in child) hook. Re-establish whatever threaded
|
|
314
|
+
# components the client had pre-fork. No-op if the client was already
|
|
315
|
+
# stopped (the customer asked for it to be dead — do not resurrect),
|
|
316
|
+
# or if the client is in datadir mode (no threaded components to start).
|
|
317
|
+
def after_fork_in_child
|
|
318
|
+
return if @stopped
|
|
319
|
+
return if @options.datadir
|
|
320
|
+
return if @config_loader.nil? # never finished network init (e.g. invalid key)
|
|
321
|
+
|
|
322
|
+
# SSE state machine carries flags that no longer apply in the child
|
|
323
|
+
# (the parent had connected, the parent had errored, etc.). Reset.
|
|
324
|
+
@state_mutex.synchronize do
|
|
325
|
+
@sse_state = :idle
|
|
326
|
+
@sse_ever_connected = false
|
|
327
|
+
@sse_terminal_failure = false
|
|
328
|
+
end
|
|
329
|
+
|
|
330
|
+
sse_started = @options.enable_sse && start_sse
|
|
331
|
+
start_polling if @options.enable_polling && !sse_started
|
|
332
|
+
|
|
333
|
+
restart_telemetry_in_child
|
|
334
|
+
end
|
|
335
|
+
|
|
336
|
+
# quonfig_sdk_worker_restart_total counter (Tier 1 supervisor contract).
|
|
337
|
+
# Layer 1 (SSE) is tracked on Quonfig::SSEConfigClient#restart_total —
|
|
338
|
+
# incremented once per reconnect attempt by the SDK-owned reconnect
|
|
339
|
+
# loop (qfg-35sm). Layer 2 (HTTP polling fallback) is wired through
|
|
340
|
+
# Quonfig::WorkerSupervisor.
|
|
341
|
+
#
|
|
342
|
+
# Pass +layer:+ ('1' or '2') to read a single layer; default returns the
|
|
343
|
+
# sum across both layers so the chaos harness (and operators) can pull
|
|
344
|
+
# per-layer values explicitly while preserving the previous single-number
|
|
345
|
+
# diagnostic surface.
|
|
346
|
+
def worker_restart_total(layer: nil)
|
|
347
|
+
case layer&.to_s
|
|
348
|
+
when '1' then sse_restart_total
|
|
349
|
+
when '2' then poll_restart_total
|
|
350
|
+
else sse_restart_total + poll_restart_total
|
|
351
|
+
end
|
|
352
|
+
end
|
|
353
|
+
|
|
354
|
+
# Wall-clock time of the last installed envelope (any source: datadir,
|
|
355
|
+
# initial HTTP fetch, SSE, or polling fallback). +nil+ before the first
|
|
356
|
+
# install. Preserved after +stop+.
|
|
357
|
+
#
|
|
358
|
+
# **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
|
|
359
|
+
# — a transient network blip will trip any freshness threshold and cause
|
|
360
|
+
# a rolling restart cascade. See the README "Diagnostic health signals"
|
|
361
|
+
# section.
|
|
362
|
+
#
|
|
363
|
+
# Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
|
|
364
|
+
def last_successful_refresh
|
|
365
|
+
@state_mutex.synchronize { @last_successful_refresh }
|
|
366
|
+
end
|
|
367
|
+
|
|
368
|
+
# Aggregate connection state. Returns one of:
|
|
369
|
+
#
|
|
370
|
+
# - +:initializing+ — no envelope has been installed and SSE is not yet
|
|
371
|
+
# connected.
|
|
372
|
+
# - +:connected+ — SSE is live, or the SDK is delivering configs from a
|
|
373
|
+
# loaded envelope (datadir mode or post-initial-fetch with no SSE).
|
|
374
|
+
# - +:disconnected+ — +stop+ was called, or SSE errored and no fallback
|
|
375
|
+
# poller is active.
|
|
376
|
+
# - +:falling_back+ — the Layer 2 HTTP polling supervisor is alive and
|
|
377
|
+
# serving as the active update channel.
|
|
378
|
+
#
|
|
379
|
+
# **Diagnostic only.** Do NOT wire this into a Kubernetes liveness probe
|
|
380
|
+
# — see the README "Diagnostic health signals" section.
|
|
381
|
+
#
|
|
382
|
+
# Contract: integration-test-data/chaos/supervisor-test-contract.md (Test 6).
|
|
383
|
+
def connection_state
|
|
384
|
+
@state_mutex.synchronize do
|
|
385
|
+
next :disconnected if @stopped
|
|
386
|
+
next :falling_back if @poll_supervisor&.alive?
|
|
387
|
+
next :connected if @sse_state == :connected
|
|
388
|
+
next :disconnected if @sse_state == :error
|
|
389
|
+
|
|
390
|
+
# No SSE state change yet: state is driven by whether any envelope
|
|
391
|
+
# has been installed (datadir / initial fetch).
|
|
392
|
+
@last_successful_refresh.nil? ? :initializing : :connected
|
|
393
|
+
end
|
|
394
|
+
end
|
|
395
|
+
|
|
396
|
+
def fork
|
|
397
|
+
self.class.new(@options.for_fork)
|
|
398
|
+
end
|
|
399
|
+
|
|
400
|
+
def inspect
|
|
401
|
+
"#<Quonfig::Client:#{object_id} environment=#{@options.environment.inspect}>"
|
|
402
|
+
end
|
|
403
|
+
|
|
404
|
+
private
|
|
405
|
+
|
|
406
|
+
# Close every threaded component and drop its reference. Used by both
|
|
407
|
+
# +stop+ (where @stopped is also flipped) and +before_fork_in_parent+
|
|
408
|
+
# (where @stopped is left alone so the child can restart).
|
|
409
|
+
def tear_down_threaded_components!
|
|
262
410
|
begin
|
|
263
411
|
@sse_client&.close
|
|
264
412
|
rescue StandardError => e
|
|
@@ -266,9 +414,14 @@ module Quonfig
|
|
|
266
414
|
end
|
|
267
415
|
@sse_client = nil
|
|
268
416
|
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
417
|
+
cancel_fallback_engage_timer
|
|
418
|
+
|
|
419
|
+
begin
|
|
420
|
+
@poll_supervisor&.stop
|
|
421
|
+
rescue StandardError => e
|
|
422
|
+
LOG.debug "Error stopping poll supervisor: #{e.message}"
|
|
423
|
+
end
|
|
424
|
+
@poll_supervisor = nil
|
|
272
425
|
|
|
273
426
|
begin
|
|
274
427
|
@telemetry_reporter&.stop
|
|
@@ -278,16 +431,161 @@ module Quonfig
|
|
|
278
431
|
@telemetry_reporter = nil
|
|
279
432
|
end
|
|
280
433
|
|
|
281
|
-
|
|
282
|
-
|
|
434
|
+
# Rebuild the telemetry reporter in the child after fork. Mirrors the
|
|
435
|
+
# original initialize_telemetry path — fresh aggregators, fresh reporter.
|
|
436
|
+
def restart_telemetry_in_child
|
|
437
|
+
@telemetry_reporter = nil
|
|
438
|
+
initialize_telemetry
|
|
283
439
|
end
|
|
284
440
|
|
|
285
|
-
|
|
286
|
-
|
|
441
|
+
# Stamp +last_successful_refresh+ at install time. Called by every code
|
|
442
|
+
# path that hands an envelope to the cache: datadir load, initial HTTP
|
|
443
|
+
# fetch, SSE event apply, and polling worker fetch.
|
|
444
|
+
def record_refresh!
|
|
445
|
+
@state_mutex.synchronize { @last_successful_refresh = Time.now.utc }
|
|
446
|
+
end
|
|
447
|
+
|
|
448
|
+
def sse_restart_total
|
|
449
|
+
sse = @sse_client
|
|
450
|
+
return 0 if sse.nil?
|
|
451
|
+
return 0 unless sse.respond_to?(:restart_total)
|
|
452
|
+
|
|
453
|
+
sse.restart_total.to_i
|
|
454
|
+
end
|
|
455
|
+
|
|
456
|
+
def poll_restart_total
|
|
457
|
+
sup = @poll_supervisor
|
|
458
|
+
return 0 if sup.nil?
|
|
459
|
+
return 0 unless sup.respond_to?(:worker_restart_total)
|
|
460
|
+
|
|
461
|
+
sup.worker_restart_total.to_i
|
|
462
|
+
end
|
|
463
|
+
|
|
464
|
+
# Drive the SSE-side of the connection_state machine. The SSE client
|
|
465
|
+
# invokes this on connect/error edges; tests call it directly via +send+.
|
|
466
|
+
# Documented values: :idle, :connecting, :connected, :error.
|
|
467
|
+
#
|
|
468
|
+
# Also drives the Layer 2 fallback poller's engage/disengage:
|
|
469
|
+
# - :connected clears any pending engage timer and stops an active
|
|
470
|
+
# fallback poller (SSE recovered, drop the second channel).
|
|
471
|
+
# - :error before any successful connect engages immediately
|
|
472
|
+
# (initial-fail path).
|
|
473
|
+
# - :error after a successful connect schedules a 2x-poll-interval
|
|
474
|
+
# grace timer; the timer engages if SSE has not recovered by then.
|
|
475
|
+
# Mirrors sdk-python's `_handle_sse_state_change` and sdk-node's
|
|
476
|
+
# `fallbackPollerActive` engagement behavior. (qfg-47c2.26)
|
|
477
|
+
# Stable callable handed to Quonfig::SSEConfigClient so its +on_error+
|
|
478
|
+
# block can drive @sse_state -> :error on a mid-run socket drop. Without
|
|
479
|
+
# this wiring, +connection_state+ would stay +:connected+ after a
|
|
480
|
+
# disconnect and customers composing staleness checks would see stale
|
|
481
|
+
# data. (qfg-47c2.27)
|
|
482
|
+
def sse_error_callback
|
|
483
|
+
@sse_error_callback ||= ->(error) { handle_sse_error(error) }
|
|
484
|
+
end
|
|
485
|
+
|
|
486
|
+
def handle_sse_error(error)
|
|
487
|
+
# qfg-i5xv: classify terminal HTTP failures (401/403/404). The same SDK
|
|
488
|
+
# key that won't auth over SSE won't auth over HTTP polling either, so
|
|
489
|
+
# we must NOT engage the Layer 2 fallback — that just moves the
|
|
490
|
+
# auth-failure storm from one endpoint to another. Once flipped,
|
|
491
|
+
# @sse_terminal_failure latches: a buggy customer retry loop cannot
|
|
492
|
+
# un-classify the failure by driving the state machine.
|
|
493
|
+
@state_mutex.synchronize { @sse_terminal_failure = true } if error.is_a?(Quonfig::SSEConfigClient::SSEHTTPTerminalError)
|
|
494
|
+
handle_sse_state_change(:error)
|
|
495
|
+
end
|
|
496
|
+
|
|
497
|
+
def handle_sse_state_change(new_state)
|
|
498
|
+
state = new_state.to_sym
|
|
499
|
+
ever_connected, terminal = @state_mutex.synchronize do
|
|
500
|
+
@sse_state = state
|
|
501
|
+
@sse_ever_connected = true if state == :connected
|
|
502
|
+
[@sse_ever_connected, @sse_terminal_failure]
|
|
503
|
+
end
|
|
504
|
+
|
|
505
|
+
return unless @options.respond_to?(:enable_polling) && @options.enable_polling
|
|
506
|
+
return if @stopped
|
|
507
|
+
# qfg-i5xv: a terminal SSE classification suppresses polling engage in
|
|
508
|
+
# every branch — the customer's key is bad and HTTP polling will fail
|
|
509
|
+
# identically. Operators surface this via #terminal_failure?.
|
|
510
|
+
return if terminal
|
|
511
|
+
|
|
512
|
+
case state
|
|
513
|
+
when :connected
|
|
514
|
+
cancel_fallback_engage_timer
|
|
515
|
+
stop_fallback_poller('sse-recovered')
|
|
516
|
+
when :error
|
|
517
|
+
if ever_connected
|
|
518
|
+
schedule_fallback_engage
|
|
519
|
+
else
|
|
520
|
+
start_polling
|
|
521
|
+
end
|
|
522
|
+
end
|
|
523
|
+
end
|
|
524
|
+
|
|
525
|
+
public
|
|
526
|
+
|
|
527
|
+
# qfg-i5xv: true once the SSE layer has classified an HTTP response as
|
|
528
|
+
# terminal (401/403/404) — bad SDK key, revoked workspace permission,
|
|
529
|
+
# or wrong endpoint. The classification latches: the SDK will not
|
|
530
|
+
# auto-recover, and a customer-supplied retry must rebuild the client.
|
|
531
|
+
# Surfaced for operator alerting; `connection_state` still reports
|
|
532
|
+
# `:disconnected` to honor the documented connection_state vocabulary
|
|
533
|
+
# (supervisor-test-contract.md §"connectionState()" — values fixed).
|
|
534
|
+
def terminal_failure?
|
|
535
|
+
@state_mutex.synchronize { @sse_terminal_failure }
|
|
287
536
|
end
|
|
288
537
|
|
|
289
538
|
private
|
|
290
539
|
|
|
540
|
+
def cancel_fallback_engage_timer
|
|
541
|
+
timer = @state_mutex.synchronize do
|
|
542
|
+
t = @fallback_engage_timer
|
|
543
|
+
@fallback_engage_timer = nil
|
|
544
|
+
t
|
|
545
|
+
end
|
|
546
|
+
timer&.kill if timer&.alive?
|
|
547
|
+
end
|
|
548
|
+
|
|
549
|
+
def stop_fallback_poller(reason)
|
|
550
|
+
supervisor = @state_mutex.synchronize do
|
|
551
|
+
s = @poll_supervisor
|
|
552
|
+
@poll_supervisor = nil
|
|
553
|
+
s
|
|
554
|
+
end
|
|
555
|
+
return if supervisor.nil?
|
|
556
|
+
|
|
557
|
+
begin
|
|
558
|
+
supervisor.stop
|
|
559
|
+
LOG.debug "[quonfig] Layer 2 fallback poller stopped (reason=#{reason})"
|
|
560
|
+
rescue StandardError => e
|
|
561
|
+
LOG.debug "Error stopping fallback poller: #{e.message}"
|
|
562
|
+
end
|
|
563
|
+
end
|
|
564
|
+
|
|
565
|
+
# Schedule a 2*poll_interval grace timer after a connected->error edge.
|
|
566
|
+
# If SSE recovers before the timer fires, +cancel_fallback_engage_timer+
|
|
567
|
+
# tears it down. Idempotent — does nothing if a timer is already pending
|
|
568
|
+
# or the supervisor is already alive.
|
|
569
|
+
def schedule_fallback_engage
|
|
570
|
+
poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
|
|
571
|
+
return if poll_interval <= 0
|
|
572
|
+
|
|
573
|
+
grace_seconds = poll_interval * 2.0
|
|
574
|
+
|
|
575
|
+
@state_mutex.synchronize do
|
|
576
|
+
return if @fallback_engage_timer&.alive?
|
|
577
|
+
return if @poll_supervisor&.alive?
|
|
578
|
+
return if @stopped
|
|
579
|
+
|
|
580
|
+
@fallback_engage_timer = Thread.new do
|
|
581
|
+
Thread.current.report_on_exception = false
|
|
582
|
+
sleep grace_seconds
|
|
583
|
+
@state_mutex.synchronize { @fallback_engage_timer = nil }
|
|
584
|
+
start_polling unless @stopped
|
|
585
|
+
end
|
|
586
|
+
end
|
|
587
|
+
end
|
|
588
|
+
|
|
291
589
|
# Construct and start the telemetry reporter if the options permit it.
|
|
292
590
|
# The reporter runs on a background thread and periodically POSTs
|
|
293
591
|
# context-shape and example-context batches to +telemetry_destination+.
|
|
@@ -378,6 +676,7 @@ module Quonfig
|
|
|
378
676
|
def load_datadir_into_store
|
|
379
677
|
envelope = Quonfig::Datadir.load_envelope(@options.datadir, @options.environment)
|
|
380
678
|
envelope.configs.each { |cfg| @store.set(cfg['key'], cfg) }
|
|
679
|
+
record_refresh!
|
|
381
680
|
end
|
|
382
681
|
|
|
383
682
|
# Initialize network mode: sync HTTP fetch (bounded by
|
|
@@ -412,7 +711,11 @@ module Quonfig
|
|
|
412
711
|
return
|
|
413
712
|
end
|
|
414
713
|
|
|
415
|
-
|
|
714
|
+
if result == :failed
|
|
715
|
+
handle_init_failure(RuntimeError.new('Config fetch failed against all api_urls'))
|
|
716
|
+
else
|
|
717
|
+
record_refresh!
|
|
718
|
+
end
|
|
416
719
|
end
|
|
417
720
|
|
|
418
721
|
def handle_init_failure(err)
|
|
@@ -429,44 +732,79 @@ module Quonfig
|
|
|
429
732
|
def start_sse
|
|
430
733
|
return false if @options.sse_api_urls.nil? || @options.sse_api_urls.empty?
|
|
431
734
|
|
|
432
|
-
@sse_client = Quonfig::SSEConfigClient.new(
|
|
735
|
+
@sse_client = Quonfig::SSEConfigClient.new(
|
|
736
|
+
@options,
|
|
737
|
+
@config_loader,
|
|
738
|
+
nil,
|
|
739
|
+
nil,
|
|
740
|
+
on_error: sse_error_callback
|
|
741
|
+
)
|
|
433
742
|
@sse_client.start do |envelope, _event, _source|
|
|
434
743
|
next if @stopped
|
|
435
744
|
|
|
436
745
|
begin
|
|
437
746
|
@config_loader.apply_envelope(envelope)
|
|
438
|
-
|
|
747
|
+
handle_sse_state_change(:connected)
|
|
748
|
+
record_refresh!
|
|
439
749
|
rescue StandardError => e
|
|
440
750
|
LOG.warn "[quonfig] Error applying SSE envelope: #{e.message}"
|
|
751
|
+
next
|
|
441
752
|
end
|
|
753
|
+
notify_on_update_callback
|
|
442
754
|
end
|
|
443
755
|
true
|
|
444
756
|
rescue StandardError => e
|
|
445
757
|
LOG.warn "[quonfig] SSE start failed: #{e.message}"
|
|
446
758
|
@sse_client = nil
|
|
759
|
+
handle_sse_state_change(:error)
|
|
447
760
|
false
|
|
448
761
|
end
|
|
449
762
|
|
|
450
763
|
def start_polling
|
|
764
|
+
return if @stopped
|
|
765
|
+
return if @poll_supervisor&.alive?
|
|
766
|
+
|
|
451
767
|
poll_interval = @options.respond_to?(:poll_interval) && @options.poll_interval ? @options.poll_interval : 60
|
|
452
768
|
return if poll_interval <= 0
|
|
453
769
|
|
|
454
|
-
|
|
455
|
-
|
|
770
|
+
stopped_ref = -> { @stopped }
|
|
771
|
+
worker = lambda do |notify_delivered|
|
|
456
772
|
loop do
|
|
457
|
-
break if
|
|
773
|
+
break if stopped_ref.call
|
|
458
774
|
|
|
459
775
|
sleep poll_interval
|
|
460
|
-
break if
|
|
461
|
-
|
|
462
|
-
|
|
463
|
-
|
|
464
|
-
|
|
465
|
-
|
|
466
|
-
LOG.warn "[quonfig] Polling error: #{e.message}"
|
|
467
|
-
end
|
|
776
|
+
break if stopped_ref.call
|
|
777
|
+
|
|
778
|
+
@config_loader.fetch!
|
|
779
|
+
record_refresh!
|
|
780
|
+
notify_delivered.call
|
|
781
|
+
notify_on_update_callback
|
|
468
782
|
end
|
|
469
783
|
end
|
|
784
|
+
|
|
785
|
+
supervisor = Quonfig::WorkerSupervisor.new(
|
|
786
|
+
name: 'poll', layer: '2', worker: worker
|
|
787
|
+
)
|
|
788
|
+
@state_mutex.synchronize { @poll_supervisor = supervisor }
|
|
789
|
+
supervisor.start
|
|
790
|
+
end
|
|
791
|
+
|
|
792
|
+
# Invoke the customer-supplied on_update callback under a rescue. A raise
|
|
793
|
+
# here is the customer's bug, but it must NOT take down the SSE listener
|
|
794
|
+
# or polling supervisor. Log at ERROR with a message containing
|
|
795
|
+
# "onConfigUpdate callback" so chaos scenario 10's
|
|
796
|
+
# sdkLog('error', /callback|onConfigUpdate/i) assertion matches and so
|
|
797
|
+
# the message is distinguishable from internal envelope-apply errors
|
|
798
|
+
# (qfg-47c2.30).
|
|
799
|
+
def notify_on_update_callback
|
|
800
|
+
cb = @on_update
|
|
801
|
+
return unless cb
|
|
802
|
+
|
|
803
|
+
begin
|
|
804
|
+
cb.call
|
|
805
|
+
rescue StandardError => e
|
|
806
|
+
LOG.error "[quonfig] onConfigUpdate callback raised: #{e.class}: #{e.message}"
|
|
807
|
+
end
|
|
470
808
|
end
|
|
471
809
|
|
|
472
810
|
def build_context(jit_context)
|
|
@@ -673,4 +1011,42 @@ module Quonfig
|
|
|
673
1011
|
end
|
|
674
1012
|
end
|
|
675
1013
|
end
|
|
1014
|
+
|
|
1015
|
+
# qfg-ryov: hook into Process._fork so customers using Puma's clustered
|
|
1016
|
+
# mode (or any preload/fork-worker server) don't have to wire
|
|
1017
|
+
# +before_fork+/+on_worker_boot+ manually. Ruby 3.1+ routes every
|
|
1018
|
+
# +Kernel#fork+/+Process.fork+ call through +Process._fork+, so a single
|
|
1019
|
+
# prepend covers them all.
|
|
1020
|
+
#
|
|
1021
|
+
# Process._fork's contract:
|
|
1022
|
+
# - Called in the parent process before the fork syscall.
|
|
1023
|
+
# - Returns 0 in the child, child's pid in the parent.
|
|
1024
|
+
# - +super+ performs the actual fork.
|
|
1025
|
+
#
|
|
1026
|
+
# The parent's view: SSE/polling/telemetry threads are torn down before
|
|
1027
|
+
# the syscall so the child does not inherit a live Net::HTTP socket fd
|
|
1028
|
+
# (which would corrupt both sides). The parent does NOT auto-restart —
|
|
1029
|
+
# that mirrors the Puma master use case where the master process no
|
|
1030
|
+
# longer serves requests after spawning workers.
|
|
1031
|
+
module ForkSafety
|
|
1032
|
+
def _fork
|
|
1033
|
+
Quonfig::Client.each_instance(&:before_fork_in_parent)
|
|
1034
|
+
pid = super
|
|
1035
|
+
Quonfig::Client.each_instance(&:after_fork_in_child) if pid.zero?
|
|
1036
|
+
pid
|
|
1037
|
+
rescue StandardError => e
|
|
1038
|
+
# Fork-hook failures must never break the customer's fork. Worst case
|
|
1039
|
+
# the child inherits dead SSE threads (the pre-qfg-ryov behavior) —
|
|
1040
|
+
# bad, but recoverable. Crashing the fork itself is not.
|
|
1041
|
+
Quonfig::Client::LOG.error "Quonfig fork hook error: #{e.class}: #{e.message}"
|
|
1042
|
+
raise if pid.nil? # super never returned — propagate fork failures
|
|
1043
|
+
|
|
1044
|
+
pid
|
|
1045
|
+
end
|
|
1046
|
+
end
|
|
1047
|
+
|
|
1048
|
+
# Ruby 3.0 lacks Process._fork. There's no hookable choke point on 3.0, so
|
|
1049
|
+
# customers must keep wiring their own Puma before_fork / on_worker_boot
|
|
1050
|
+
# (see README "Rails integration"). On 3.1+ we install the hook globally.
|
|
1051
|
+
Process.singleton_class.prepend(ForkSafety) if Process.respond_to?(:_fork)
|
|
676
1052
|
end
|