RubyGems - protobuf-nats - Versions diffs - 0.13.1.pre1 → 0.13.1.pre2 - Mend

protobuf-nats 0.13.1.pre1 → 0.13.1.pre2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -0
data/README.md +20 -3
data/lib/protobuf/nats/config.rb +34 -3
data/lib/protobuf/nats/errors.rb +7 -0
data/lib/protobuf/nats/response_muxer.rb +8 -0
data/lib/protobuf/nats/server.rb +73 -23
data/lib/protobuf/nats/version.rb +1 -1
data/lib/protobuf/nats.rb +30 -2
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 547f632aa7ad154f6a546c67df3628166ac2db3acbc2dd532a1e98aa6525788e
-  data.tar.gz: 699fe41c76d6aabb6f5db20f20f8309f5723e99d74f164d906dabd92fc441556
+  metadata.gz: fd150563a177c8d3eddeac0375225c8e05489082e3189da9746c9ad86e4bc188
+  data.tar.gz: a5a4d99c1897458ac511ac3f92bcd1d09338c340b11b97e214900c076a63f69d
 SHA512:
-  metadata.gz: ac2417d75bbc60ad475c01bec82bd334a83ef7b3a7a6051b096273e28406c68c9344756078bb6e1a729f8827c478e91b0341c1d69bbe3ff35fdb2d3f6bab47b8
-  data.tar.gz: 4bbc4f568f992068277a3ee2d9b50d6b715cf80e12e6208e020bed6137cd54c7dcd37a7279933f7ca2f0ebcef52aae9adba939633de9edbf5f76366ed1fb0188
+  metadata.gz: 552ec79ee1fe6b74b85e0ba5b7ee67f1a50beccc20155b515f28249a25632bed90a87bdfc5c9b9badaac6f538f1e7e1ebba7cda6c45f773529af6b4fec2e52aa
+  data.tar.gz: 4f93e5107575a9703e01a0808705ad220dcdee1dadef657da326315f35f5de1d5ac59cf035bb250178697a31ca9a5c4094cc1fe124021437a87fe532396a231a

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,19 @@
 ## Changelog
+### 0.13.1.pre2
+Additional edge-case fixes found while reviewing the 0.13.1 changes:
+- **ResponseMuxer self-heal could drop to zero dispatchers.** When the sole dispatcher crashed fatally, its self-healing `start` counted the still-alive (but exiting) crashing thread, so it spawned no replacement — leaving zero dispatchers on CRuby (`dispatcher_count == 1`) and the muxer silently delivering no responses. The crashing thread now removes itself from the handler pool before re-topping it up.
+- **Client connection is rebuilt after a terminal close.** `@client_nats_connection` was memoized once and never reset, so once nats-pure gave up and fired `on_close` every later request reused a dead client forever. `on_close` now drops the cached connection so the next request rebuilds.
+- **Dropped error callbacks are now observable.** The bounded `notify_error_callbacks_async` executor silently discarded callbacks when saturated (exactly during an error flood). Drops now bump `Protobuf::Nats.error_callback_drop_count` and emit `error_callback_dropped`, without formatting/logging on the read thread.
+- **Server no longer double-publishes on a response-publish failure.** A transport error while publishing a *successful* response fell into the handler rescue and emitted a second (error) response for the same request. The handler and the success-response publish are now in separate rescue scopes. The handler-failure error response also now sends a generic message ("Internal server error") instead of the raw `error.message`, so internal handler details aren't leaked to clients (the real error is still logged server-side).
+- **Opt-in reclaim of overdue handlers.** Handlers are still never aborted by default. `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS=true` lets operators reclaim a pool slot held by an orphaned handler (one that outlived the client's `response_timeout`) by raising `Errors::HandlerOverdue` into it; emits `server.handler_reclaimed`.
+- **TLS now verifies the NATS server certificate.** Previously, supplying a prepared `:tls` context made nats-pure skip its own `set_params`, leaving the OpenSSL default `VERIFY_NONE` in force — any certificate was accepted (MITM exposure) — and `tls_ca_cert` was configured but read nowhere. `Config#new_tls_context` now sets `verify_mode = VERIFY_PEER` and trusts the configured `tls_ca_cert` (falling back to the system trust store when none is set). **Breaking for misconfigured deployments:** a server whose certificate does not chain to the trusted CA, which previously connected insecurely, will now be rejected. (Hostname/SAN verification is still not enabled — see known gaps.)
+#### Known gaps noted (not changed here)
+- **TLS hostname (SAN/CN) is not verified.** Chain verification is now on, but nats-pure only sets the SSLSocket hostname for a context it builds itself, and a single static hostname would be wrong for a multi-server cluster that reconnects across hosts. Plumbing per-connection hostname verification is tracked separately.
+- **`TLS1_3_VERSION` is assumed defined.** Fine on the JRuby targets; an old MRI/OpenSSL build without the constant would raise `NameError`.
 ### 0.13.1
 Fixes a production regression and a set of related issues, all of the same class: assumptions left over from the JNats → nats-pure migration in 0.13.0 that became silently wrong.

data/README.md CHANGED Viewed

@@ -57,6 +57,12 @@ so the work is orphaned. Defaults above the client's 60s response timeout so leg
 a small grace). If clients use a longer response timeout, raise this so handlers aren't flagged overdue while a client is
 still waiting; if shorter, lower it so orphaned work is surfaced promptly.
+`PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS` - When `"true"`, actively reclaim the thread-pool slot held by an overdue
+handler (one past `PB_NATS_SERVER_HANDLER_OVERDUE_MS`, whose client has already given up) by raising
+`Errors::HandlerOverdue` into the worker; emits `server.handler_reclaimed`. **Off by default** — the documented contract
+is that handlers are never aborted, since killing a thread mid-handler can corrupt state. Enable only when orphaned work
+is saturating the pool and the server is NACKing healthy traffic (default: false).
 `PB_NATS_CLIENT_ACK_TIMEOUT` - Seconds to wait for an ACK from the rpc server (default: 5 seconds).
 `PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS` - Array of milliseconds to wait between NACK retries (default: "0,1,3,5,10").
@@ -115,6 +121,8 @@ An example config looks like this:
 When `uses_tls` is set, the client negotiates TLS with a floor of 1.2 and a ceiling of 1.3: it uses TLS 1.3 where the NATS server supports it and falls back to 1.2 otherwise (verified on JRuby 9.4 and 10.0).
+The client **verifies the NATS server's certificate chain** (`verify_mode = VERIFY_PEER`). When `tls_ca_cert` is set, only certificates chaining to that CA are trusted (the private-CA case); otherwise the system trust store is used. **Note:** a server whose certificate does not chain to the configured CA will be rejected — if you are upgrading from a release that did not verify (`< 0.13.1`), make sure `tls_ca_cert` points at the CA that signed your NATS server certificate. Hostname (SAN/CN) verification is not yet enabled (chain verification still ensures the certificate is signed by your trusted CA).
 ## Usage
 This library is designed to be an alternative transport implementation used by the `protobuf` gem. In order to make
@@ -195,8 +203,13 @@ If we were to add another service endpoint called `search` to the `UserService`
   self-heal with exponential backoff.
 - **Server observability** — beyond the thread-pool gauges, the server emits in-flight handler metrics
   (`server.inflight_count`, `server.inflight_oldest_age_ms`, `server.overdue_handler_count`, `server.handler_overdue`,
-  `server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted;
-  a handler is only "overdue" once it outlives the client's `response_timeout` (see `PB_NATS_SERVER_HANDLER_OVERDUE_MS`).
+  `server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted
+  by default; a handler is only "overdue" once it outlives the client's `response_timeout` (see
+  `PB_NATS_SERVER_HANDLER_OVERDUE_MS`). Overdue handlers can optionally be reclaimed (emitting `server.handler_reclaimed`)
+  via `PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS`.
+- **Error-callback observability** — `on_error` callbacks run on a bounded background executor so a slow callback can't
+  stall message processing. If that executor saturates under an error flood, dropped callbacks are counted
+  (`Protobuf::Nats.error_callback_drop_count`) and emit `error_callback_dropped` rather than being lost silently.
 ## Resilience
@@ -209,7 +222,11 @@ The client is built to ride out transient NATS hiccups rather than surface them
 - **Missing ACKs and NACKs are retried** with their own timeouts/backoff (`PB_NATS_CLIENT_ACK_TIMEOUT`,
   `PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS`).
 - **Server-side failures fail the caller fast.** If the server cannot process a request after it has ACKed, it publishes
-  an encoded RPC error response so the client raises immediately instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`.
+  an encoded RPC error response (a generic message; the real error is logged server-side) so the client raises immediately
+  instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`. A transport failure while publishing a successful response
+  no longer triggers a duplicate error response.
+- **The client connection self-heals after a terminal close.** A permanently closed NATS connection is dropped and
+  rebuilt on the next request instead of being reused.
 - **The response dispatcher self-heals.** A crashed muxer dispatcher restarts with exponential backoff, and a brief
   subscription-restart window won't busy-spin the dispatch loop.

data/lib/protobuf/nats/config.rb CHANGED Viewed

@@ -66,9 +66,11 @@ module Protobuf
       end
       # Only the keys nats-pure's `connect` actually consumes. App-level settings
-      # (uses_tls, tls_*, server_subscription_key_*, subscription_key_replacements)
-      # are read directly via their accessors elsewhere and must NOT be forwarded
-      # to nats-pure (it ignores unknown keys today, but that is brittle).
+      # (uses_tls, tls_client_cert, tls_client_key, tls_ca_cert,
+      # server_subscription_key_*, subscription_key_replacements) are read
+      # directly via their accessors elsewhere and must NOT be forwarded to
+      # nats-pure (it ignores unknown keys today, but that is brittle). The TLS
+      # cert/key/CA are folded into the :tls context by #new_tls_context.
       def connection_options(reload = false)
         @connection_options = false if reload
         @connection_options ||= begin
@@ -88,10 +90,39 @@ module Protobuf
         # ssl_version=:TLSv1_2 hard pin). The client offers 1.2 and 1.3 and
         # negotiates the highest the server also supports, so a TLS-1.2-only
         # transport still connects (verified on JRuby 9.4 and 10.0).
+        #
+        # NOTE (#7): this assumes the OpenSSL build defines TLS1_3_VERSION. That
+        # holds on the JRuby targets above, but an older MRI/OpenSSL build without
+        # the constant would raise NameError here. Not guarded yet -- revisit if
+        # CRuby-on-old-OpenSSL becomes a supported target.
         tls_context.min_version = ::OpenSSL::SSL::TLS1_2_VERSION
         tls_context.max_version = ::OpenSSL::SSL::TLS1_3_VERSION
         tls_context.cert = ::OpenSSL::X509::Certificate.new(::File.read(tls_client_cert)) if tls_client_cert
         tls_context.key = ::OpenSSL::PKey::RSA.new(::File.read(tls_client_key)) if tls_client_key
+        # Verify the NATS server's certificate chain. This context is handed to
+        # nats-pure as :tls => {:context => ...}; nats-pure uses a supplied
+        # context verbatim and does NOT call #set_params, so verification has to
+        # be configured here. Without this the OpenSSL default (VERIFY_NONE)
+        # stood and any certificate -- including an attacker's -- was accepted.
+        tls_context.verify_mode = ::OpenSSL::SSL::VERIFY_PEER
+        cert_store = ::OpenSSL::X509::Store.new
+        if tls_ca_cert
+          # Trust the configured CA bundle (the private-CA deployment case).
+          cert_store.add_file(tls_ca_cert)
+        else
+          # No CA configured: fall back to the system trust store.
+          cert_store.set_default_paths
+        end
+        tls_context.cert_store = cert_store
+        # NOTE: hostname (SAN/CN) verification is NOT enabled here. nats-pure only
+        # sets the SSLSocket hostname from @tls[:hostname], which it populates
+        # itself only when it builds the context; for a supplied context it stays
+        # nil, and a single static hostname would be wrong for a multi-server
+        # cluster that reconnects across hosts. Chain verification above still
+        # ensures the cert is signed by the trusted CA. Plumbing per-connection
+        # hostname verification is tracked separately.
         tls_context
       end

data/lib/protobuf/nats/errors.rb CHANGED Viewed

@@ -16,6 +16,13 @@ module Protobuf
       class MriIOException < ::StandardError
       end
+      # Raised into a worker thread to reclaim a handler that has outlived the
+      # client's response_timeout. Only used when overdue-reclaim is explicitly
+      # enabled via PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS (default off); the
+      # documented default is that handlers are never aborted.
+      class HandlerOverdue < ::StandardError
+      end
       IOException = MriIOException
       # Transient transport errors that mean the NATS connection is unavailable

data/lib/protobuf/nats/response_muxer.rb CHANGED Viewed

@@ -327,6 +327,14 @@ module Protobuf
             # After sleeping, reset the state and try to start again.
             LOCK.synchronize do
+              # Remove ourselves from the handler pool BEFORE start re-tops it up.
+              # This thread is still alive (running this rescue) but is about to
+              # exit, so start's `select!(&:alive?)` would otherwise count it as a
+              # live dispatcher and spawn no replacement -- leaving the pool one
+              # short (zero dispatchers on CRuby, where dispatcher_count == 1, and
+              # the muxer would stop delivering responses entirely).
+              @resp_handlers.delete(::Thread.current)
               if @resp_sub
                 begin
                   @resp_sub.unsubscribe

data/lib/protobuf/nats/server.rb CHANGED Viewed

@@ -64,6 +64,20 @@ module Protobuf
         @handler_overdue_ms ||= ::ENV.fetch("PB_NATS_SERVER_HANDLER_OVERDUE_MS", 65_000).to_i
       end
+      # Whether to actively reclaim (abort) an overdue handler's pool slot. OFF by
+      # default: the documented contract is that handlers are never aborted, since
+      # killing a thread mid-handler can corrupt state. Enable only when you would
+      # rather shed orphaned work (whose client already gave up) than let it pin a
+      # pool slot -- e.g. when overdue handlers are saturating the pool and the
+      # server is NACKing healthy traffic. Reclaim raises Errors::HandlerOverdue
+      # into the worker, which the handler rescue turns into an RPC error response.
+      def reclaim_overdue_handlers?
+        # Memoize the raw string (never falsey, so ||= is safe) and derive the
+        # boolean per call -- avoids the nil-guard dance for a false-able memo.
+        @reclaim_overdue_handlers ||= ::ENV.fetch("PB_NATS_SERVER_RECLAIM_OVERDUE_HANDLERS", "false")
+        @reclaim_overdue_handlers == "true"
+      end
       # How long to let in-flight handlers finish on shutdown. Tracks the overdue
       # window (plus grace) so a legitimate long handler isn't killed mid-flight.
       def shutdown_drain_timeout
@@ -91,13 +105,24 @@ module Protobuf
         oldest_age_ms = 0.0
         overdue = 0
-        @inflight.each_pair do |id, started_at|
+        @inflight.each_pair do |id, entry|
+          started_at, handler_thread = entry
           count += 1
           age_ms = (now - started_at) * MILLISECOND
           oldest_age_ms = age_ms if age_ms > oldest_age_ms
           next unless overdue_ms.positive? && age_ms >= overdue_ms
           overdue += 1
+          # Optionally reclaim the slot by aborting the orphaned handler (opt-in;
+          # see #reclaim_overdue_handlers?). Done before the dedupe below so the
+          # reclaim is attempted even after the overdue event was already emitted.
+          if reclaim_overdue_handlers? && handler_thread&.alive?
+            logger.warn "Reclaiming overdue handler (age=#{age_ms.round}ms, client already gave up) to free its pool slot"
+            handler_thread.raise(::Protobuf::Nats::Errors::HandlerOverdue, "handler exceeded #{overdue_ms}ms; reclaimed")
+            ::Protobuf::Nats.instrument("server.handler_reclaimed", age_ms)
+          end
           # Emit the per-handler overdue event once (the client has already
           # given up; this handler's result is orphaned).
           next if @overdue_flagged[id]
@@ -138,35 +163,60 @@ module Protobuf
         enqueued_at = monotonic
         request_id = @request_seq.increment
         was_enqueued = thread_pool.push do
+          # nil response_data is the "handler failed, don't publish a success
+          # response" sentinel (a successful encode is always a non-nil String,
+          # even when empty).
+          response_data = nil
           begin
             # Instrument the thread pool time-to-execute duration.
             processed_at = monotonic
             ::Protobuf::Nats.instrument("server.thread_pool_execution_delay", (processed_at - enqueued_at) * MILLISECOND)
             # Track this handler as in-flight (long handlers are allowed; this is
-            # only for observability -- we never abort it).
-            @inflight[request_id] = processed_at
-            # Process request.
-            response_data = handle_request(request_data, 'server' => @server)
-            # Publish response.
-            logger.debug { "Publishing response to #{reply_id}" } if logger.debug?
-            nats.publish(reply_id, response_data)
-          rescue => error
-            logger.debug { "rescued error => #{error}" }  if logger.debug?
-            ::Protobuf::Nats.notify_error_callbacks(error)
-            # The client has already received our ACK and is now blocked waiting
-            # for the response message. If we don't send one it will hang until
-            # response_timeout (60s by default). Publish an encoded RPC error so
-            # the client fails fast instead. (If the failure was the connection
-            # itself, this publish will also fail and is swallowed below.)
+            # only for observability -- we never abort it unless overdue-reclaim
+            # is explicitly enabled). Store the worker thread so reclaim can
+            # target it; the start time drives age/overdue accounting.
+            @inflight[request_id] = [processed_at, ::Thread.current]
+            # Process request. Only the handler is wrapped here so a transport
+            # failure on the success-response publish (below) cannot fall into
+            # this rescue and emit a *second* (error) publish for a request whose
+            # handler actually succeeded.
             begin
-              error_response = ::Protobuf::Rpc::PbError.new(error.message)
-              nats.publish(reply_id, error_response.encode)
-            rescue => publish_error
-              logger.error "Failed to publish error response for #{reply_id}: #{publish_error.message}"
+              response_data = handle_request(request_data, 'server' => @server)
+            rescue => error
+              response_data = nil # ensure the success-publish below is skipped
+              logger.debug { "rescued error => #{error}" }  if logger.debug?
+              # Logs the real error server-side (via the default log_error
+              # callback) so it isn't lost; the client gets only a generic message.
+              ::Protobuf::Nats.notify_error_callbacks(error)
+              # The client has already received our ACK and is now blocked waiting
+              # for the response message. If we don't send one it will hang until
+              # response_timeout (60s by default). Publish an encoded RPC error so
+              # the client fails fast instead. Use a generic message rather than
+              # error.message so internal handler details aren't leaked over the
+              # wire. (If the failure was the connection itself, this publish will
+              # also fail and is swallowed below.)
+              begin
+                error_response = ::Protobuf::Rpc::PbError.new("Internal server error")
+                nats.publish(reply_id, error_response.encode)
+              rescue => publish_error
+                logger.error "Failed to publish error response for #{reply_id}: #{publish_error.message}"
+              end
+            end
+            # Publish the successful response. Kept outside the handler rescue so a
+            # publish failure here is logged rather than triggering a duplicate
+            # (error) response for a request that already succeeded.
+            if response_data
+              logger.debug { "Publishing response to #{reply_id}" } if logger.debug?
+              begin
+                nats.publish(reply_id, response_data)
+              rescue => publish_error
+                logger.error "Failed to publish response for #{reply_id}: #{publish_error.message}"
+                ::Protobuf::Nats.notify_error_callbacks(publish_error)
+              end
             end
           ensure
             @inflight.delete(request_id)

data/lib/protobuf/nats/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Protobuf
   module Nats
-    VERSION = "0.13.1.pre1"
+    VERSION = "0.13.1.pre2"
   end
 end

data/lib/protobuf/nats.rb CHANGED Viewed

@@ -90,10 +90,32 @@ module Protobuf
       :fallback_policy => :discard
     )
+    # Count of error callbacks discarded because the bounded executor was
+    # saturated. Lets a flood of dropped callbacks during an incident be observed
+    # instead of vanishing silently.
+    ERROR_CALLBACK_DROP_COUNT = ::Concurrent::AtomicFixnum.new(0)
+    def self.error_callback_drop_count
+      ERROR_CALLBACK_DROP_COUNT.value
+    end
     def self.notify_error_callbacks_async(error)
-      ERROR_CALLBACK_EXECUTOR.post { notify_error_callbacks(error) }
+      # #post returns false when the job is rejected. With the :discard fallback
+      # policy the job is silently dropped (returning false) rather than raising,
+      # so the false return is the only drop signal to handle.
+      accepted = ERROR_CALLBACK_EXECUTOR.post { notify_error_callbacks(error) }
+      record_dropped_error_callback unless accepted
       nil
-    rescue ::Concurrent::RejectedExecutionError
+    end
+    # Record a discarded error callback. Kept cheap -- this runs on nats-pure's
+    # read/flush thread, so it must NOT format/log the error synchronously (the
+    # whole point of the async path). The atomic counter is the durable signal;
+    # the instrument gauge emits a discrete event for dashboards (drops only
+    # happen under a severe flood, so a notification per drop is acceptable).
+    def self.record_dropped_error_callback
+      ERROR_CALLBACK_DROP_COUNT.increment
+      instrument("error_callback_dropped", 1)
       nil
     end
@@ -131,6 +153,12 @@ module Protobuf
         client.on_close do
           logger.warn("Client NATS connection was closed")
+          # A close is terminal for this client object (nats-pure only reconnects
+          # via on_disconnect/on_reconnect; on_close means it gave up). Drop the
+          # memoized reference so the next start_client_nats_connection rebuilds a
+          # fresh connection instead of reusing a permanently-dead one. In-flight
+          # callers keep their own local reference; only new calls rebuild.
+          @client_nats_connection = nil
         end
         client.on_error do |error|

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: protobuf-nats
 version: !ruby/object:Gem::Version
-  version: 0.13.1.pre1
+  version: 0.13.1.pre2
 platform: ruby
 authors:
 - Brandon Dewitt