protobuf-nats 0.13.1.pre1 → 0.13.1.pre2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 547f632aa7ad154f6a546c67df3628166ac2db3acbc2dd532a1e98aa6525788e
4
- data.tar.gz: 699fe41c76d6aabb6f5db20f20f8309f5723e99d74f164d906dabd92fc441556
3
+ metadata.gz: fd150563a177c8d3eddeac0375225c8e05489082e3189da9746c9ad86e4bc188
4
+ data.tar.gz: a5a4d99c1897458ac511ac3f92bcd1d09338c340b11b97e214900c076a63f69d
5
5
  SHA512:
6
- metadata.gz: ac2417d75bbc60ad475c01bec82bd334a83ef7b3a7a6051b096273e28406c68c9344756078bb6e1a729f8827c478e91b0341c1d69bbe3ff35fdb2d3f6bab47b8
7
- data.tar.gz: 4bbc4f568f992068277a3ee2d9b50d6b715cf80e12e6208e020bed6137cd54c7dcd37a7279933f7ca2f0ebcef52aae9adba939633de9edbf5f76366ed1fb0188
6
+ metadata.gz: 552ec79ee1fe6b74b85e0ba5b7ee67f1a50beccc20155b515f28249a25632bed90a87bdfc5c9b9badaac6f538f1e7e1ebba7cda6c45f773529af6b4fec2e52aa
7
+ data.tar.gz: 4f93e5107575a9703e01a0808705ad220dcdee1dadef657da326315f35f5de1d5ac59cf035bb250178697a31ca9a5c4094cc1fe124021437a87fe532396a231a
data/CHANGELOG.md CHANGED
@@ -1,5 +1,19 @@
1
1
  ## Changelog
2
2
 
3
+ ### 0.13.1.pre2
4
+ Additional edge-case fixes found while reviewing the 0.13.1 changes:
5
+
6
+ - **ResponseMuxer self-heal could drop to zero dispatchers.** When the sole dispatcher crashed fatally, its self-healing `start` counted the still-alive (but exiting) crashing thread, so it spawned no replacement — leaving zero dispatchers on CRuby (`dispatcher_count == 1`) and the muxer silently delivering no responses. The crashing thread now removes itself from the handler pool before re-topping it up.
7
+ - **Client connection is rebuilt after a terminal close.** `@client_nats_connection` was memoized once and never reset, so once nats-pure gave up and fired `on_close` every later request reused a dead client forever. `on_close` now drops the cached connection so the next request rebuilds.
8
+ - **Dropped error callbacks are now observable.** The bounded `notify_error_callbacks_async` executor silently discarded callbacks when saturated (exactly during an error flood). Drops now bump `Protobuf::Nats.error_callback_drop_count` and emit `error_callback_dropped`, without formatting/logging on the read thread.
9
+ - **Server no longer double-publishes on a response-publish failure.** A transport error while publishing a *successful* response fell into the handler rescue and emitted a second (error) response for the same request. The handler and the success-response publish are now in separate rescue scopes. The handler-failure error response also now sends a generic message ("Internal server error") instead of the raw `error.message`, so internal handler details aren't leaked to clients (the real error is still logged server-side).
10
+ - **Opt-in reclaim of overdue handlers.** Handlers are still never aborted by default. `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS=true` lets operators reclaim a pool slot held by an orphaned handler (one that outlived the client's `response_timeout`) by raising `Errors::HandlerOverdue` into it; emits `server.handler_reclaimed`.
11
+ - **TLS now verifies the NATS server certificate.** Previously, supplying a prepared `:tls` context made nats-pure skip its own `set_params`, leaving the OpenSSL default `VERIFY_NONE` in force — any certificate was accepted (MITM exposure) — and `tls_ca_cert` was configured but read nowhere. `Config#new_tls_context` now sets `verify_mode = VERIFY_PEER` and trusts the configured `tls_ca_cert` (falling back to the system trust store when none is set). **Breaking for misconfigured deployments:** a server whose certificate does not chain to the trusted CA, which previously connected insecurely, will now be rejected. (Hostname/SAN verification is still not enabled — see known gaps.)
12
+
13
+ #### Known gaps noted (not changed here)
14
+ - **TLS hostname (SAN/CN) is not verified.** Chain verification is now on, but nats-pure only sets the SSLSocket hostname for a context it builds itself, and a single static hostname would be wrong for a multi-server cluster that reconnects across hosts. Plumbing per-connection hostname verification is tracked separately.
15
+ - **`TLS1_3_VERSION` is assumed defined.** Fine on the JRuby targets; an old MRI/OpenSSL build without the constant would raise `NameError`.
16
+
3
17
  ### 0.13.1
4
18
  Fixes a production regression and a set of related issues, all of the same class: assumptions left over from the JNats → nats-pure migration in 0.13.0 that became silently wrong.
5
19
 
data/README.md CHANGED
@@ -57,6 +57,12 @@ so the work is orphaned. Defaults above the client's 60s response timeout so leg
57
57
  a small grace). If clients use a longer response timeout, raise this so handlers aren't flagged overdue while a client is
58
58
  still waiting; if shorter, lower it so orphaned work is surfaced promptly.
59
59
 
60
+ `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS` - When `"true"`, actively reclaim the thread-pool slot held by an overdue
61
+ handler (one past `PB_NATS_SERVER_HANDLER_OVERDUE_MS`, whose client has already given up) by raising
62
+ `Errors::HandlerOverdue` into the worker; emits `server.handler_reclaimed`. **Off by default** — the documented contract
63
+ is that handlers are never aborted, since killing a thread mid-handler can corrupt state. Enable only when orphaned work
64
+ is saturating the pool and the server is NACKing healthy traffic (default: false).
65
+
60
66
  `PB_NATS_CLIENT_ACK_TIMEOUT` - Seconds to wait for an ACK from the rpc server (default: 5 seconds).
61
67
 
62
68
  `PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS` - Array of milliseconds to wait between NACK retries (default: "0,1,3,5,10").
@@ -115,6 +121,8 @@ An example config looks like this:
115
121
 
116
122
  When `uses_tls` is set, the client negotiates TLS with a floor of 1.2 and a ceiling of 1.3: it uses TLS 1.3 where the NATS server supports it and falls back to 1.2 otherwise (verified on JRuby 9.4 and 10.0).
117
123
 
124
+ The client **verifies the NATS server's certificate chain** (`verify_mode = VERIFY_PEER`). When `tls_ca_cert` is set, only certificates chaining to that CA are trusted (the private-CA case); otherwise the system trust store is used. **Note:** a server whose certificate does not chain to the configured CA will be rejected — if you are upgrading from a release that did not verify (`< 0.13.1`), make sure `tls_ca_cert` points at the CA that signed your NATS server certificate. Hostname (SAN/CN) verification is not yet enabled (chain verification still ensures the certificate is signed by your trusted CA).
125
+
118
126
  ## Usage
119
127
 
120
128
  This library is designed to be an alternative transport implementation used by the `protobuf` gem. In order to make
@@ -195,8 +203,13 @@ If we were to add another service endpoint called `search` to the `UserService`
195
203
  self-heal with exponential backoff.
196
204
  - **Server observability** — beyond the thread-pool gauges, the server emits in-flight handler metrics
197
205
  (`server.inflight_count`, `server.inflight_oldest_age_ms`, `server.overdue_handler_count`, `server.handler_overdue`,
198
- `server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted;
199
- a handler is only "overdue" once it outlives the client's `response_timeout` (see `PB_NATS_SERVER_HANDLER_OVERDUE_MS`).
206
+ `server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted
207
+ by default; a handler is only "overdue" once it outlives the client's `response_timeout` (see
208
+ `PB_NATS_SERVER_HANDLER_OVERDUE_MS`). Overdue handlers can optionally be reclaimed (emitting `server.handler_reclaimed`)
209
+ via `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS`.
210
+ - **Error-callback observability** — `on_error` callbacks run on a bounded background executor so a slow callback can't
211
+ stall message processing. If that executor saturates under an error flood, dropped callbacks are counted
212
+ (`Protobuf::Nats.error_callback_drop_count`) and emit `error_callback_dropped` rather than being lost silently.
200
213
 
201
214
  ## Resilience
202
215
 
@@ -209,7 +222,11 @@ The client is built to ride out transient NATS hiccups rather than surface them
209
222
  - **Missing ACKs and NACKs are retried** with their own timeouts/backoff (`PB_NATS_CLIENT_ACK_TIMEOUT`,
210
223
  `PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS`).
211
224
  - **Server-side failures fail the caller fast.** If the server cannot process a request after it has ACKed, it publishes
212
- an encoded RPC error response so the client raises immediately instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`.
225
+ an encoded RPC error response (a generic message; the real error is logged server-side) so the client raises immediately
226
+ instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`. A transport failure while publishing a successful response
227
+ no longer triggers a duplicate error response.
228
+ - **The client connection self-heals after a terminal close.** A permanently closed NATS connection is dropped and
229
+ rebuilt on the next request instead of being reused.
213
230
  - **The response dispatcher self-heals.** A crashed muxer dispatcher restarts with exponential backoff, and a brief
214
231
  subscription-restart window won't busy-spin the dispatch loop.
215
232
 
@@ -66,9 +66,11 @@ module Protobuf
66
66
  end
67
67
 
68
68
  # Only the keys nats-pure's `connect` actually consumes. App-level settings
69
- # (uses_tls, tls_*, server_subscription_key_*, subscription_key_replacements)
70
- # are read directly via their accessors elsewhere and must NOT be forwarded
71
- # to nats-pure (it ignores unknown keys today, but that is brittle).
69
+ # (uses_tls, tls_client_cert, tls_client_key, tls_ca_cert,
70
+ # server_subscription_key_*, subscription_key_replacements) are read
71
+ # directly via their accessors elsewhere and must NOT be forwarded to
72
+ # nats-pure (it ignores unknown keys today, but that is brittle). The TLS
73
+ # cert/key/CA are folded into the :tls context by #new_tls_context.
72
74
  def connection_options(reload = false)
73
75
  @connection_options = false if reload
74
76
  @connection_options ||= begin
@@ -88,10 +90,39 @@ module Protobuf
88
90
  # ssl_version=:TLSv1_2 hard pin). The client offers 1.2 and 1.3 and
89
91
  # negotiates the highest the server also supports, so a TLS-1.2-only
90
92
  # transport still connects (verified on JRuby 9.4 and 10.0).
93
+ #
94
+ # NOTE (#7): this assumes the OpenSSL build defines TLS1_3_VERSION. That
95
+ # holds on the JRuby targets above, but an older MRI/OpenSSL build without
96
+ # the constant would raise NameError here. Not guarded yet -- revisit if
97
+ # CRuby-on-old-OpenSSL becomes a supported target.
91
98
  tls_context.min_version = ::OpenSSL::SSL::TLS1_2_VERSION
92
99
  tls_context.max_version = ::OpenSSL::SSL::TLS1_3_VERSION
93
100
  tls_context.cert = ::OpenSSL::X509::Certificate.new(::File.read(tls_client_cert)) if tls_client_cert
94
101
  tls_context.key = ::OpenSSL::PKey::RSA.new(::File.read(tls_client_key)) if tls_client_key
102
+
103
+ # Verify the NATS server's certificate chain. This context is handed to
104
+ # nats-pure as :tls => {:context => ...}; nats-pure uses a supplied
105
+ # context verbatim and does NOT call #set_params, so verification has to
106
+ # be configured here. Without this the OpenSSL default (VERIFY_NONE)
107
+ # stood and any certificate -- including an attacker's -- was accepted.
108
+ tls_context.verify_mode = ::OpenSSL::SSL::VERIFY_PEER
109
+ cert_store = ::OpenSSL::X509::Store.new
110
+ if tls_ca_cert
111
+ # Trust the configured CA bundle (the private-CA deployment case).
112
+ cert_store.add_file(tls_ca_cert)
113
+ else
114
+ # No CA configured: fall back to the system trust store.
115
+ cert_store.set_default_paths
116
+ end
117
+ tls_context.cert_store = cert_store
118
+
119
+ # NOTE: hostname (SAN/CN) verification is NOT enabled here. nats-pure only
120
+ # sets the SSLSocket hostname from @tls[:hostname], which it populates
121
+ # itself only when it builds the context; for a supplied context it stays
122
+ # nil, and a single static hostname would be wrong for a multi-server
123
+ # cluster that reconnects across hosts. Chain verification above still
124
+ # ensures the cert is signed by the trusted CA. Plumbing per-connection
125
+ # hostname verification is tracked separately.
95
126
  tls_context
96
127
  end
97
128
 
@@ -16,6 +16,13 @@ module Protobuf
16
16
  class MriIOException < ::StandardError
17
17
  end
18
18
 
19
+ # Raised into a worker thread to reclaim a handler that has outlived the
20
+ # client's response_timeout. Only used when overdue-reclaim is explicitly
21
+ # enabled via PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS (default off); the
22
+ # documented default is that handlers are never aborted.
23
+ class HandlerOverdue < ::StandardError
24
+ end
25
+
19
26
  IOException = MriIOException
20
27
 
21
28
  # Transient transport errors that mean the NATS connection is unavailable
@@ -327,6 +327,14 @@ module Protobuf
327
327
 
328
328
  # After sleeping, reset the state and try to start again.
329
329
  LOCK.synchronize do
330
+ # Remove ourselves from the handler pool BEFORE start re-tops it up.
331
+ # This thread is still alive (running this rescue) but is about to
332
+ # exit, so start's `select!(&:alive?)` would otherwise count it as a
333
+ # live dispatcher and spawn no replacement -- leaving the pool one
334
+ # short (zero dispatchers on CRuby, where dispatcher_count == 1, and
335
+ # the muxer would stop delivering responses entirely).
336
+ @resp_handlers.delete(::Thread.current)
337
+
330
338
  if @resp_sub
331
339
  begin
332
340
  @resp_sub.unsubscribe
@@ -64,6 +64,20 @@ module Protobuf
64
64
  @handler_overdue_ms ||= ::ENV.fetch("PB_NATS_SERVER_HANDLER_OVERDUE_MS", 65_000).to_i
65
65
  end
66
66
 
67
+ # Whether to actively reclaim (abort) an overdue handler's pool slot. OFF by
68
+ # default: the documented contract is that handlers are never aborted, since
69
+ # killing a thread mid-handler can corrupt state. Enable only when you would
70
+ # rather shed orphaned work (whose client already gave up) than let it pin a
71
+ # pool slot -- e.g. when overdue handlers are saturating the pool and the
72
+ # server is NACKing healthy traffic. Reclaim raises Errors::HandlerOverdue
73
+ # into the worker, which the handler rescue turns into an RPC error response.
74
+ def reclaim_overdue_handlers?
75
+ # Memoize the raw string (never falsey, so ||= is safe) and derive the
76
+ # boolean per call -- avoids the nil-guard dance for a false-able memo.
77
+ @reclaim_overdue_handlers ||= ::ENV.fetch("PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS", "false")
78
+ @reclaim_overdue_handlers == "true"
79
+ end
80
+
67
81
  # How long to let in-flight handlers finish on shutdown. Tracks the overdue
68
82
  # window (plus grace) so a legitimate long handler isn't killed mid-flight.
69
83
  def shutdown_drain_timeout
@@ -91,13 +105,24 @@ module Protobuf
91
105
  oldest_age_ms = 0.0
92
106
  overdue = 0
93
107
 
94
- @inflight.each_pair do |id, started_at|
108
+ @inflight.each_pair do |id, entry|
109
+ started_at, handler_thread = entry
95
110
  count += 1
96
111
  age_ms = (now - started_at) * MILLISECOND
97
112
  oldest_age_ms = age_ms if age_ms > oldest_age_ms
98
113
  next unless overdue_ms.positive? && age_ms >= overdue_ms
99
114
 
100
115
  overdue += 1
116
+
117
+ # Optionally reclaim the slot by aborting the orphaned handler (opt-in;
118
+ # see #reclaim_overdue_handlers?). Done before the dedupe below so the
119
+ # reclaim is attempted even after the overdue event was already emitted.
120
+ if reclaim_overdue_handlers? && handler_thread&.alive?
121
+ logger.warn "Reclaiming overdue handler (age=#{age_ms.round}ms, client already gave up) to free its pool slot"
122
+ handler_thread.raise(::Protobuf::Nats::Errors::HandlerOverdue, "handler exceeded #{overdue_ms}ms; reclaimed")
123
+ ::Protobuf::Nats.instrument("server.handler_reclaimed", age_ms)
124
+ end
125
+
101
126
  # Emit the per-handler overdue event once (the client has already
102
127
  # given up; this handler's result is orphaned).
103
128
  next if @overdue_flagged[id]
@@ -138,35 +163,60 @@ module Protobuf
138
163
  enqueued_at = monotonic
139
164
  request_id = @request_seq.increment
140
165
  was_enqueued = thread_pool.push do
166
+ # nil response_data is the "handler failed, don't publish a success
167
+ # response" sentinel (a successful encode is always a non-nil String,
168
+ # even when empty).
169
+ response_data = nil
141
170
  begin
142
171
  # Instrument the thread pool time-to-execute duration.
143
172
  processed_at = monotonic
144
173
  ::Protobuf::Nats.instrument("server.thread_pool_execution_delay", (processed_at - enqueued_at) * MILLISECOND)
145
174
 
146
175
  # Track this handler as in-flight (long handlers are allowed; this is
147
- # only for observability -- we never abort it).
148
- @inflight[request_id] = processed_at
149
-
150
- # Process request.
151
- response_data = handle_request(request_data, 'server' => @server)
152
-
153
- # Publish response.
154
- logger.debug { "Publishing response to #{reply_id}" } if logger.debug?
155
- nats.publish(reply_id, response_data)
156
- rescue => error
157
- logger.debug { "rescued error => #{error}" } if logger.debug?
158
- ::Protobuf::Nats.notify_error_callbacks(error)
159
-
160
- # The client has already received our ACK and is now blocked waiting
161
- # for the response message. If we don't send one it will hang until
162
- # response_timeout (60s by default). Publish an encoded RPC error so
163
- # the client fails fast instead. (If the failure was the connection
164
- # itself, this publish will also fail and is swallowed below.)
176
+ # only for observability -- we never abort it unless overdue-reclaim
177
+ # is explicitly enabled). Store the worker thread so reclaim can
178
+ # target it; the start time drives age/overdue accounting.
179
+ @inflight[request_id] = [processed_at, ::Thread.current]
180
+
181
+ # Process request. Only the handler is wrapped here so a transport
182
+ # failure on the success-response publish (below) cannot fall into
183
+ # this rescue and emit a *second* (error) publish for a request whose
184
+ # handler actually succeeded.
165
185
  begin
166
- error_response = ::Protobuf::Rpc::PbError.new(error.message)
167
- nats.publish(reply_id, error_response.encode)
168
- rescue => publish_error
169
- logger.error "Failed to publish error response for #{reply_id}: #{publish_error.message}"
186
+ response_data = handle_request(request_data, 'server' => @server)
187
+ rescue => error
188
+ response_data = nil # ensure the success-publish below is skipped
189
+ logger.debug { "rescued error => #{error}" } if logger.debug?
190
+ # Logs the real error server-side (via the default log_error
191
+ # callback) so it isn't lost; the client gets only a generic message.
192
+ ::Protobuf::Nats.notify_error_callbacks(error)
193
+
194
+ # The client has already received our ACK and is now blocked waiting
195
+ # for the response message. If we don't send one it will hang until
196
+ # response_timeout (60s by default). Publish an encoded RPC error so
197
+ # the client fails fast instead. Use a generic message rather than
198
+ # error.message so internal handler details aren't leaked over the
199
+ # wire. (If the failure was the connection itself, this publish will
200
+ # also fail and is swallowed below.)
201
+ begin
202
+ error_response = ::Protobuf::Rpc::PbError.new("Internal server error")
203
+ nats.publish(reply_id, error_response.encode)
204
+ rescue => publish_error
205
+ logger.error "Failed to publish error response for #{reply_id}: #{publish_error.message}"
206
+ end
207
+ end
208
+
209
+ # Publish the successful response. Kept outside the handler rescue so a
210
+ # publish failure here is logged rather than triggering a duplicate
211
+ # (error) response for a request that already succeeded.
212
+ if response_data
213
+ logger.debug { "Publishing response to #{reply_id}" } if logger.debug?
214
+ begin
215
+ nats.publish(reply_id, response_data)
216
+ rescue => publish_error
217
+ logger.error "Failed to publish response for #{reply_id}: #{publish_error.message}"
218
+ ::Protobuf::Nats.notify_error_callbacks(publish_error)
219
+ end
170
220
  end
171
221
  ensure
172
222
  @inflight.delete(request_id)
@@ -1,5 +1,5 @@
1
1
  module Protobuf
2
2
  module Nats
3
- VERSION = "0.13.1.pre1"
3
+ VERSION = "0.13.1.pre2"
4
4
  end
5
5
  end
data/lib/protobuf/nats.rb CHANGED
@@ -90,10 +90,32 @@ module Protobuf
90
90
  :fallback_policy => :discard
91
91
  )
92
92
 
93
+ # Count of error callbacks discarded because the bounded executor was
94
+ # saturated. Lets a flood of dropped callbacks during an incident be observed
95
+ # instead of vanishing silently.
96
+ ERROR_CALLBACK_DROP_COUNT = ::Concurrent::AtomicFixnum.new(0)
97
+
98
+ def self.error_callback_drop_count
99
+ ERROR_CALLBACK_DROP_COUNT.value
100
+ end
101
+
93
102
  def self.notify_error_callbacks_async(error)
94
- ERROR_CALLBACK_EXECUTOR.post { notify_error_callbacks(error) }
103
+ # #post returns false when the job is rejected. With the :discard fallback
104
+ # policy the job is silently dropped (returning false) rather than raising,
105
+ # so the false return is the only drop signal to handle.
106
+ accepted = ERROR_CALLBACK_EXECUTOR.post { notify_error_callbacks(error) }
107
+ record_dropped_error_callback unless accepted
95
108
  nil
96
- rescue ::Concurrent::RejectedExecutionError
109
+ end
110
+
111
+ # Record a discarded error callback. Kept cheap -- this runs on nats-pure's
112
+ # read/flush thread, so it must NOT format/log the error synchronously (the
113
+ # whole point of the async path). The atomic counter is the durable signal;
114
+ # the instrument gauge emits a discrete event for dashboards (drops only
115
+ # happen under a severe flood, so a notification per drop is acceptable).
116
+ def self.record_dropped_error_callback
117
+ ERROR_CALLBACK_DROP_COUNT.increment
118
+ instrument("error_callback_dropped", 1)
97
119
  nil
98
120
  end
99
121
 
@@ -131,6 +153,12 @@ module Protobuf
131
153
 
132
154
  client.on_close do
133
155
  logger.warn("Client NATS connection was closed")
156
+ # A close is terminal for this client object (nats-pure only reconnects
157
+ # via on_disconnect/on_reconnect; on_close means it gave up). Drop the
158
+ # memoized reference so the next start_client_nats_connection rebuilds a
159
+ # fresh connection instead of reusing a permanently-dead one. In-flight
160
+ # callers keep their own local reference; only new calls rebuild.
161
+ @client_nats_connection = nil
134
162
  end
135
163
 
136
164
  client.on_error do |error|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: protobuf-nats
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.13.1.pre1
4
+ version: 0.13.1.pre2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Brandon Dewitt