protobuf-nats 0.13.1.pre1 → 0.13.1.pre2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/README.md +20 -3
- data/lib/protobuf/nats/config.rb +34 -3
- data/lib/protobuf/nats/errors.rb +7 -0
- data/lib/protobuf/nats/response_muxer.rb +8 -0
- data/lib/protobuf/nats/server.rb +73 -23
- data/lib/protobuf/nats/version.rb +1 -1
- data/lib/protobuf/nats.rb +30 -2
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: fd150563a177c8d3eddeac0375225c8e05489082e3189da9746c9ad86e4bc188
|
|
4
|
+
data.tar.gz: a5a4d99c1897458ac511ac3f92bcd1d09338c340b11b97e214900c076a63f69d
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 552ec79ee1fe6b74b85e0ba5b7ee67f1a50beccc20155b515f28249a25632bed90a87bdfc5c9b9badaac6f538f1e7e1ebba7cda6c45f773529af6b4fec2e52aa
|
|
7
|
+
data.tar.gz: 4f93e5107575a9703e01a0808705ad220dcdee1dadef657da326315f35f5de1d5ac59cf035bb250178697a31ca9a5c4094cc1fe124021437a87fe532396a231a
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,19 @@
|
|
|
1
1
|
## Changelog
|
|
2
2
|
|
|
3
|
+
### 0.13.1.pre2
|
|
4
|
+
Additional edge-case fixes found while reviewing the 0.13.1 changes:
|
|
5
|
+
|
|
6
|
+
- **ResponseMuxer self-heal could drop to zero dispatchers.** When the sole dispatcher crashed fatally, its self-healing `start` counted the still-alive (but exiting) crashing thread, so it spawned no replacement — leaving zero dispatchers on CRuby (`dispatcher_count == 1`) and the muxer silently delivering no responses. The crashing thread now removes itself from the handler pool before re-topping it up.
|
|
7
|
+
- **Client connection is rebuilt after a terminal close.** `@client_nats_connection` was memoized once and never reset, so once nats-pure gave up and fired `on_close` every later request reused a dead client forever. `on_close` now drops the cached connection so the next request rebuilds.
|
|
8
|
+
- **Dropped error callbacks are now observable.** The bounded `notify_error_callbacks_async` executor silently discarded callbacks when saturated (exactly during an error flood). Drops now bump `Protobuf::Nats.error_callback_drop_count` and emit `error_callback_dropped`, without formatting/logging on the read thread.
|
|
9
|
+
- **Server no longer double-publishes on a response-publish failure.** A transport error while publishing a *successful* response fell into the handler rescue and emitted a second (error) response for the same request. The handler and the success-response publish are now in separate rescue scopes. The handler-failure error response also now sends a generic message ("Internal server error") instead of the raw `error.message`, so internal handler details aren't leaked to clients (the real error is still logged server-side).
|
|
10
|
+
- **Opt-in reclaim of overdue handlers.** Handlers are still never aborted by default. `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS=true` lets operators reclaim a pool slot held by an orphaned handler (one that outlived the client's `response_timeout`) by raising `Errors::HandlerOverdue` into it; emits `server.handler_reclaimed`.
|
|
11
|
+
- **TLS now verifies the NATS server certificate.** Previously, supplying a prepared `:tls` context made nats-pure skip its own `set_params`, leaving the OpenSSL default `VERIFY_NONE` in force — any certificate was accepted (MITM exposure) — and `tls_ca_cert` was configured but read nowhere. `Config#new_tls_context` now sets `verify_mode = VERIFY_PEER` and trusts the configured `tls_ca_cert` (falling back to the system trust store when none is set). **Breaking for misconfigured deployments:** a server whose certificate does not chain to the trusted CA, which previously connected insecurely, will now be rejected. (Hostname/SAN verification is still not enabled — see known gaps.)
|
|
12
|
+
|
|
13
|
+
#### Known gaps noted (not changed here)
|
|
14
|
+
- **TLS hostname (SAN/CN) is not verified.** Chain verification is now on, but nats-pure only sets the SSLSocket hostname for a context it builds itself, and a single static hostname would be wrong for a multi-server cluster that reconnects across hosts. Plumbing per-connection hostname verification is tracked separately.
|
|
15
|
+
- **`TLS1_3_VERSION` is assumed defined.** Fine on the JRuby targets; an old MRI/OpenSSL build without the constant would raise `NameError`.
|
|
16
|
+
|
|
3
17
|
### 0.13.1
|
|
4
18
|
Fixes a production regression and a set of related issues, all of the same class: assumptions left over from the JNats → nats-pure migration in 0.13.0 that became silently wrong.
|
|
5
19
|
|
data/README.md
CHANGED
|
@@ -57,6 +57,12 @@ so the work is orphaned. Defaults above the client's 60s response timeout so leg
|
|
|
57
57
|
a small grace). If clients use a longer response timeout, raise this so handlers aren't flagged overdue while a client is
|
|
58
58
|
still waiting; if shorter, lower it so orphaned work is surfaced promptly.
|
|
59
59
|
|
|
60
|
+
`PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS` - When `"true"`, actively reclaim the thread-pool slot held by an overdue
|
|
61
|
+
handler (one past `PB_NATS_SERVER_HANDLER_OVERDUE_MS`, whose client has already given up) by raising
|
|
62
|
+
`Errors::HandlerOverdue` into the worker; emits `server.handler_reclaimed`. **Off by default** — the documented contract
|
|
63
|
+
is that handlers are never aborted, since killing a thread mid-handler can corrupt state. Enable only when orphaned work
|
|
64
|
+
is saturating the pool and the server is NACKing healthy traffic (default: false).
|
|
65
|
+
|
|
60
66
|
`PB_NATS_CLIENT_ACK_TIMEOUT` - Seconds to wait for an ACK from the rpc server (default: 5 seconds).
|
|
61
67
|
|
|
62
68
|
`PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS` - Array of milliseconds to wait between NACK retries (default: "0,1,3,5,10").
|
|
@@ -115,6 +121,8 @@ An example config looks like this:
|
|
|
115
121
|
|
|
116
122
|
When `uses_tls` is set, the client negotiates TLS with a floor of 1.2 and a ceiling of 1.3: it uses TLS 1.3 where the NATS server supports it and falls back to 1.2 otherwise (verified on JRuby 9.4 and 10.0).
|
|
117
123
|
|
|
124
|
+
The client **verifies the NATS server's certificate chain** (`verify_mode = VERIFY_PEER`). When `tls_ca_cert` is set, only certificates chaining to that CA are trusted (the private-CA case); otherwise the system trust store is used. **Note:** a server whose certificate does not chain to the configured CA will be rejected — if you are upgrading from a release that did not verify (`< 0.13.1`), make sure `tls_ca_cert` points at the CA that signed your NATS server certificate. Hostname (SAN/CN) verification is not yet enabled (chain verification still ensures the certificate is signed by your trusted CA).
|
|
125
|
+
|
|
118
126
|
## Usage
|
|
119
127
|
|
|
120
128
|
This library is designed to be an alternative transport implementation used by the `protobuf` gem. In order to make
|
|
@@ -195,8 +203,13 @@ If we were to add another service endpoint called `search` to the `UserService`
|
|
|
195
203
|
self-heal with exponential backoff.
|
|
196
204
|
- **Server observability** — beyond the thread-pool gauges, the server emits in-flight handler metrics
|
|
197
205
|
(`server.inflight_count`, `server.inflight_oldest_age_ms`, `server.overdue_handler_count`, `server.handler_overdue`,
|
|
198
|
-
`server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted
|
|
199
|
-
a handler is only "overdue" once it outlives the client's `response_timeout` (see
|
|
206
|
+
`server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted
|
|
207
|
+
by default; a handler is only "overdue" once it outlives the client's `response_timeout` (see
|
|
208
|
+
`PB_NATS_SERVER_HANDLER_OVERDUE_MS`). Overdue handlers can optionally be reclaimed (emitting `server.handler_reclaimed`)
|
|
209
|
+
via `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS`.
|
|
210
|
+
- **Error-callback observability** — `on_error` callbacks run on a bounded background executor so a slow callback can't
|
|
211
|
+
stall message processing. If that executor saturates under an error flood, dropped callbacks are counted
|
|
212
|
+
(`Protobuf::Nats.error_callback_drop_count`) and emit `error_callback_dropped` rather than being lost silently.
|
|
200
213
|
|
|
201
214
|
## Resilience
|
|
202
215
|
|
|
@@ -209,7 +222,11 @@ The client is built to ride out transient NATS hiccups rather than surface them
|
|
|
209
222
|
- **Missing ACKs and NACKs are retried** with their own timeouts/backoff (`PB_NATS_CLIENT_ACK_TIMEOUT`,
|
|
210
223
|
`PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS`).
|
|
211
224
|
- **Server-side failures fail the caller fast.** If the server cannot process a request after it has ACKed, it publishes
|
|
212
|
-
an encoded RPC error response
|
|
225
|
+
an encoded RPC error response (a generic message; the real error is logged server-side) so the client raises immediately
|
|
226
|
+
instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`. A transport failure while publishing a successful response
|
|
227
|
+
no longer triggers a duplicate error response.
|
|
228
|
+
- **The client connection self-heals after a terminal close.** A permanently closed NATS connection is dropped and
|
|
229
|
+
rebuilt on the next request instead of being reused.
|
|
213
230
|
- **The response dispatcher self-heals.** A crashed muxer dispatcher restarts with exponential backoff, and a brief
|
|
214
231
|
subscription-restart window won't busy-spin the dispatch loop.
|
|
215
232
|
|
data/lib/protobuf/nats/config.rb
CHANGED
|
@@ -66,9 +66,11 @@ module Protobuf
|
|
|
66
66
|
end
|
|
67
67
|
|
|
68
68
|
# Only the keys nats-pure's `connect` actually consumes. App-level settings
|
|
69
|
-
# (uses_tls,
|
|
70
|
-
# are read
|
|
71
|
-
#
|
|
69
|
+
# (uses_tls, tls_client_cert, tls_client_key, tls_ca_cert,
|
|
70
|
+
# server_subscription_key_*, subscription_key_replacements) are read
|
|
71
|
+
# directly via their accessors elsewhere and must NOT be forwarded to
|
|
72
|
+
# nats-pure (it ignores unknown keys today, but that is brittle). The TLS
|
|
73
|
+
# cert/key/CA are folded into the :tls context by #new_tls_context.
|
|
72
74
|
def connection_options(reload = false)
|
|
73
75
|
@connection_options = false if reload
|
|
74
76
|
@connection_options ||= begin
|
|
@@ -88,10 +90,39 @@ module Protobuf
|
|
|
88
90
|
# ssl_version=:TLSv1_2 hard pin). The client offers 1.2 and 1.3 and
|
|
89
91
|
# negotiates the highest the server also supports, so a TLS-1.2-only
|
|
90
92
|
# transport still connects (verified on JRuby 9.4 and 10.0).
|
|
93
|
+
#
|
|
94
|
+
# NOTE (#7): this assumes the OpenSSL build defines TLS1_3_VERSION. That
|
|
95
|
+
# holds on the JRuby targets above, but an older MRI/OpenSSL build without
|
|
96
|
+
# the constant would raise NameError here. Not guarded yet -- revisit if
|
|
97
|
+
# CRuby-on-old-OpenSSL becomes a supported target.
|
|
91
98
|
tls_context.min_version = ::OpenSSL::SSL::TLS1_2_VERSION
|
|
92
99
|
tls_context.max_version = ::OpenSSL::SSL::TLS1_3_VERSION
|
|
93
100
|
tls_context.cert = ::OpenSSL::X509::Certificate.new(::File.read(tls_client_cert)) if tls_client_cert
|
|
94
101
|
tls_context.key = ::OpenSSL::PKey::RSA.new(::File.read(tls_client_key)) if tls_client_key
|
|
102
|
+
|
|
103
|
+
# Verify the NATS server's certificate chain. This context is handed to
|
|
104
|
+
# nats-pure as :tls => {:context => ...}; nats-pure uses a supplied
|
|
105
|
+
# context verbatim and does NOT call #set_params, so verification has to
|
|
106
|
+
# be configured here. Without this the OpenSSL default (VERIFY_NONE)
|
|
107
|
+
# stood and any certificate -- including an attacker's -- was accepted.
|
|
108
|
+
tls_context.verify_mode = ::OpenSSL::SSL::VERIFY_PEER
|
|
109
|
+
cert_store = ::OpenSSL::X509::Store.new
|
|
110
|
+
if tls_ca_cert
|
|
111
|
+
# Trust the configured CA bundle (the private-CA deployment case).
|
|
112
|
+
cert_store.add_file(tls_ca_cert)
|
|
113
|
+
else
|
|
114
|
+
# No CA configured: fall back to the system trust store.
|
|
115
|
+
cert_store.set_default_paths
|
|
116
|
+
end
|
|
117
|
+
tls_context.cert_store = cert_store
|
|
118
|
+
|
|
119
|
+
# NOTE: hostname (SAN/CN) verification is NOT enabled here. nats-pure only
|
|
120
|
+
# sets the SSLSocket hostname from @tls[:hostname], which it populates
|
|
121
|
+
# itself only when it builds the context; for a supplied context it stays
|
|
122
|
+
# nil, and a single static hostname would be wrong for a multi-server
|
|
123
|
+
# cluster that reconnects across hosts. Chain verification above still
|
|
124
|
+
# ensures the cert is signed by the trusted CA. Plumbing per-connection
|
|
125
|
+
# hostname verification is tracked separately.
|
|
95
126
|
tls_context
|
|
96
127
|
end
|
|
97
128
|
|
data/lib/protobuf/nats/errors.rb
CHANGED
|
@@ -16,6 +16,13 @@ module Protobuf
|
|
|
16
16
|
class MriIOException < ::StandardError
|
|
17
17
|
end
|
|
18
18
|
|
|
19
|
+
# Raised into a worker thread to reclaim a handler that has outlived the
|
|
20
|
+
# client's response_timeout. Only used when overdue-reclaim is explicitly
|
|
21
|
+
# enabled via PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS (default off); the
|
|
22
|
+
# documented default is that handlers are never aborted.
|
|
23
|
+
class HandlerOverdue < ::StandardError
|
|
24
|
+
end
|
|
25
|
+
|
|
19
26
|
IOException = MriIOException
|
|
20
27
|
|
|
21
28
|
# Transient transport errors that mean the NATS connection is unavailable
|
|
@@ -327,6 +327,14 @@ module Protobuf
|
|
|
327
327
|
|
|
328
328
|
# After sleeping, reset the state and try to start again.
|
|
329
329
|
LOCK.synchronize do
|
|
330
|
+
# Remove ourselves from the handler pool BEFORE start re-tops it up.
|
|
331
|
+
# This thread is still alive (running this rescue) but is about to
|
|
332
|
+
# exit, so start's `select!(&:alive?)` would otherwise count it as a
|
|
333
|
+
# live dispatcher and spawn no replacement -- leaving the pool one
|
|
334
|
+
# short (zero dispatchers on CRuby, where dispatcher_count == 1, and
|
|
335
|
+
# the muxer would stop delivering responses entirely).
|
|
336
|
+
@resp_handlers.delete(::Thread.current)
|
|
337
|
+
|
|
330
338
|
if @resp_sub
|
|
331
339
|
begin
|
|
332
340
|
@resp_sub.unsubscribe
|
data/lib/protobuf/nats/server.rb
CHANGED
|
@@ -64,6 +64,20 @@ module Protobuf
|
|
|
64
64
|
@handler_overdue_ms ||= ::ENV.fetch("PB_NATS_SERVER_HANDLER_OVERDUE_MS", 65_000).to_i
|
|
65
65
|
end
|
|
66
66
|
|
|
67
|
+
# Whether to actively reclaim (abort) an overdue handler's pool slot. OFF by
|
|
68
|
+
# default: the documented contract is that handlers are never aborted, since
|
|
69
|
+
# killing a thread mid-handler can corrupt state. Enable only when you would
|
|
70
|
+
# rather shed orphaned work (whose client already gave up) than let it pin a
|
|
71
|
+
# pool slot -- e.g. when overdue handlers are saturating the pool and the
|
|
72
|
+
# server is NACKing healthy traffic. Reclaim raises Errors::HandlerOverdue
|
|
73
|
+
# into the worker, which the handler rescue turns into an RPC error response.
|
|
74
|
+
def reclaim_overdue_handlers?
|
|
75
|
+
# Memoize the raw string (never falsey, so ||= is safe) and derive the
|
|
76
|
+
# boolean per call -- avoids the nil-guard dance for a false-able memo.
|
|
77
|
+
@reclaim_overdue_handlers ||= ::ENV.fetch("PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS", "false")
|
|
78
|
+
@reclaim_overdue_handlers == "true"
|
|
79
|
+
end
|
|
80
|
+
|
|
67
81
|
# How long to let in-flight handlers finish on shutdown. Tracks the overdue
|
|
68
82
|
# window (plus grace) so a legitimate long handler isn't killed mid-flight.
|
|
69
83
|
def shutdown_drain_timeout
|
|
@@ -91,13 +105,24 @@ module Protobuf
|
|
|
91
105
|
oldest_age_ms = 0.0
|
|
92
106
|
overdue = 0
|
|
93
107
|
|
|
94
|
-
@inflight.each_pair do |id,
|
|
108
|
+
@inflight.each_pair do |id, entry|
|
|
109
|
+
started_at, handler_thread = entry
|
|
95
110
|
count += 1
|
|
96
111
|
age_ms = (now - started_at) * MILLISECOND
|
|
97
112
|
oldest_age_ms = age_ms if age_ms > oldest_age_ms
|
|
98
113
|
next unless overdue_ms.positive? && age_ms >= overdue_ms
|
|
99
114
|
|
|
100
115
|
overdue += 1
|
|
116
|
+
|
|
117
|
+
# Optionally reclaim the slot by aborting the orphaned handler (opt-in;
|
|
118
|
+
# see #reclaim_overdue_handlers?). Done before the dedupe below so the
|
|
119
|
+
# reclaim is attempted even after the overdue event was already emitted.
|
|
120
|
+
if reclaim_overdue_handlers? && handler_thread&.alive?
|
|
121
|
+
logger.warn "Reclaiming overdue handler (age=#{age_ms.round}ms, client already gave up) to free its pool slot"
|
|
122
|
+
handler_thread.raise(::Protobuf::Nats::Errors::HandlerOverdue, "handler exceeded #{overdue_ms}ms; reclaimed")
|
|
123
|
+
::Protobuf::Nats.instrument("server.handler_reclaimed", age_ms)
|
|
124
|
+
end
|
|
125
|
+
|
|
101
126
|
# Emit the per-handler overdue event once (the client has already
|
|
102
127
|
# given up; this handler's result is orphaned).
|
|
103
128
|
next if @overdue_flagged[id]
|
|
@@ -138,35 +163,60 @@ module Protobuf
|
|
|
138
163
|
enqueued_at = monotonic
|
|
139
164
|
request_id = @request_seq.increment
|
|
140
165
|
was_enqueued = thread_pool.push do
|
|
166
|
+
# nil response_data is the "handler failed, don't publish a success
|
|
167
|
+
# response" sentinel (a successful encode is always a non-nil String,
|
|
168
|
+
# even when empty).
|
|
169
|
+
response_data = nil
|
|
141
170
|
begin
|
|
142
171
|
# Instrument the thread pool time-to-execute duration.
|
|
143
172
|
processed_at = monotonic
|
|
144
173
|
::Protobuf::Nats.instrument("server.thread_pool_execution_delay", (processed_at - enqueued_at) * MILLISECOND)
|
|
145
174
|
|
|
146
175
|
# Track this handler as in-flight (long handlers are allowed; this is
|
|
147
|
-
# only for observability -- we never abort it
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
#
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
rescue => error
|
|
157
|
-
logger.debug { "rescued error => #{error}" } if logger.debug?
|
|
158
|
-
::Protobuf::Nats.notify_error_callbacks(error)
|
|
159
|
-
|
|
160
|
-
# The client has already received our ACK and is now blocked waiting
|
|
161
|
-
# for the response message. If we don't send one it will hang until
|
|
162
|
-
# response_timeout (60s by default). Publish an encoded RPC error so
|
|
163
|
-
# the client fails fast instead. (If the failure was the connection
|
|
164
|
-
# itself, this publish will also fail and is swallowed below.)
|
|
176
|
+
# only for observability -- we never abort it unless overdue-reclaim
|
|
177
|
+
# is explicitly enabled). Store the worker thread so reclaim can
|
|
178
|
+
# target it; the start time drives age/overdue accounting.
|
|
179
|
+
@inflight[request_id] = [processed_at, ::Thread.current]
|
|
180
|
+
|
|
181
|
+
# Process request. Only the handler is wrapped here so a transport
|
|
182
|
+
# failure on the success-response publish (below) cannot fall into
|
|
183
|
+
# this rescue and emit a *second* (error) publish for a request whose
|
|
184
|
+
# handler actually succeeded.
|
|
165
185
|
begin
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
logger.
|
|
186
|
+
response_data = handle_request(request_data, 'server' => @server)
|
|
187
|
+
rescue => error
|
|
188
|
+
response_data = nil # ensure the success-publish below is skipped
|
|
189
|
+
logger.debug { "rescued error => #{error}" } if logger.debug?
|
|
190
|
+
# Logs the real error server-side (via the default log_error
|
|
191
|
+
# callback) so it isn't lost; the client gets only a generic message.
|
|
192
|
+
::Protobuf::Nats.notify_error_callbacks(error)
|
|
193
|
+
|
|
194
|
+
# The client has already received our ACK and is now blocked waiting
|
|
195
|
+
# for the response message. If we don't send one it will hang until
|
|
196
|
+
# response_timeout (60s by default). Publish an encoded RPC error so
|
|
197
|
+
# the client fails fast instead. Use a generic message rather than
|
|
198
|
+
# error.message so internal handler details aren't leaked over the
|
|
199
|
+
# wire. (If the failure was the connection itself, this publish will
|
|
200
|
+
# also fail and is swallowed below.)
|
|
201
|
+
begin
|
|
202
|
+
error_response = ::Protobuf::Rpc::PbError.new("Internal server error")
|
|
203
|
+
nats.publish(reply_id, error_response.encode)
|
|
204
|
+
rescue => publish_error
|
|
205
|
+
logger.error "Failed to publish error response for #{reply_id}: #{publish_error.message}"
|
|
206
|
+
end
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
# Publish the successful response. Kept outside the handler rescue so a
|
|
210
|
+
# publish failure here is logged rather than triggering a duplicate
|
|
211
|
+
# (error) response for a request that already succeeded.
|
|
212
|
+
if response_data
|
|
213
|
+
logger.debug { "Publishing response to #{reply_id}" } if logger.debug?
|
|
214
|
+
begin
|
|
215
|
+
nats.publish(reply_id, response_data)
|
|
216
|
+
rescue => publish_error
|
|
217
|
+
logger.error "Failed to publish response for #{reply_id}: #{publish_error.message}"
|
|
218
|
+
::Protobuf::Nats.notify_error_callbacks(publish_error)
|
|
219
|
+
end
|
|
170
220
|
end
|
|
171
221
|
ensure
|
|
172
222
|
@inflight.delete(request_id)
|
data/lib/protobuf/nats.rb
CHANGED
|
@@ -90,10 +90,32 @@ module Protobuf
|
|
|
90
90
|
:fallback_policy => :discard
|
|
91
91
|
)
|
|
92
92
|
|
|
93
|
+
# Count of error callbacks discarded because the bounded executor was
|
|
94
|
+
# saturated. Lets a flood of dropped callbacks during an incident be observed
|
|
95
|
+
# instead of vanishing silently.
|
|
96
|
+
ERROR_CALLBACK_DROP_COUNT = ::Concurrent::AtomicFixnum.new(0)
|
|
97
|
+
|
|
98
|
+
def self.error_callback_drop_count
|
|
99
|
+
ERROR_CALLBACK_DROP_COUNT.value
|
|
100
|
+
end
|
|
101
|
+
|
|
93
102
|
def self.notify_error_callbacks_async(error)
|
|
94
|
-
|
|
103
|
+
# #post returns false when the job is rejected. With the :discard fallback
|
|
104
|
+
# policy the job is silently dropped (returning false) rather than raising,
|
|
105
|
+
# so the false return is the only drop signal to handle.
|
|
106
|
+
accepted = ERROR_CALLBACK_EXECUTOR.post { notify_error_callbacks(error) }
|
|
107
|
+
record_dropped_error_callback unless accepted
|
|
95
108
|
nil
|
|
96
|
-
|
|
109
|
+
end
|
|
110
|
+
|
|
111
|
+
# Record a discarded error callback. Kept cheap -- this runs on nats-pure's
|
|
112
|
+
# read/flush thread, so it must NOT format/log the error synchronously (the
|
|
113
|
+
# whole point of the async path). The atomic counter is the durable signal;
|
|
114
|
+
# the instrument gauge emits a discrete event for dashboards (drops only
|
|
115
|
+
# happen under a severe flood, so a notification per drop is acceptable).
|
|
116
|
+
def self.record_dropped_error_callback
|
|
117
|
+
ERROR_CALLBACK_DROP_COUNT.increment
|
|
118
|
+
instrument("error_callback_dropped", 1)
|
|
97
119
|
nil
|
|
98
120
|
end
|
|
99
121
|
|
|
@@ -131,6 +153,12 @@ module Protobuf
|
|
|
131
153
|
|
|
132
154
|
client.on_close do
|
|
133
155
|
logger.warn("Client NATS connection was closed")
|
|
156
|
+
# A close is terminal for this client object (nats-pure only reconnects
|
|
157
|
+
# via on_disconnect/on_reconnect; on_close means it gave up). Drop the
|
|
158
|
+
# memoized reference so the next start_client_nats_connection rebuilds a
|
|
159
|
+
# fresh connection instead of reusing a permanently-dead one. In-flight
|
|
160
|
+
# callers keep their own local reference; only new calls rebuild.
|
|
161
|
+
@client_nats_connection = nil
|
|
134
162
|
end
|
|
135
163
|
|
|
136
164
|
client.on_error do |error|
|