protobuf-nats 0.13.0 → 0.13.1.pre1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dd9d0e1d6f565a66e972312fa19398f5b027c426b363672b5aa96b5a61f00595
4
- data.tar.gz: c253885854d9bafcd5e714f8f1ff2afdb1193758d394f42c91ebf58f70865bd6
3
+ metadata.gz: 547f632aa7ad154f6a546c67df3628166ac2db3acbc2dd532a1e98aa6525788e
4
+ data.tar.gz: 699fe41c76d6aabb6f5db20f20f8309f5723e99d74f164d906dabd92fc441556
5
5
  SHA512:
6
- metadata.gz: c6415db921943a0c61e3c310aea1a67f50fc955dfe03512d8709aeb3ae242228d8da78ce60df8158e9e7ba1b8d56cb8d59512e265a27e8f106d9fcf85fdf2c1c
7
- data.tar.gz: ed7ff5492e9dce9a7c15aaf30885c2fa0e50533186e2c48a662eed6f80ae4db5a485a5469d3707b221956f02d05b68843273a23280cc7c9209217bca66f03c7a
6
+ metadata.gz: ac2417d75bbc60ad475c01bec82bd334a83ef7b3a7a6051b096273e28406c68c9344756078bb6e1a729f8827c478e91b0341c1d69bbe3ff35fdb2d3f6bab47b8
7
+ data.tar.gz: 4bbc4f568f992068277a3ee2d9b50d6b715cf80e12e6208e020bed6137cd54c7dcd37a7279933f7ca2f0ebcef52aae9adba939633de9edbf5f76366ed1fb0188
data/CHANGELOG.md CHANGED
@@ -1,5 +1,25 @@
1
1
  ## Changelog
2
2
 
3
+ ### 0.13.1
4
+ Fixes a production regression and a set of related issues, all of the same class: assumptions left over from the JNats → nats-pure migration in 0.13.0 that became silently wrong.
5
+
6
+ - **Dropped-connection retries were silently disabled.** Dropping JNats collapsed `Errors::IOException` to the never-raised `MriIOException`, so the client's reconnect/retry `rescue` became dead code. A dropped NATS connection then escaped immediately as an `RPC_ERROR` (surfacing as a 500) instead of being retried. The client now rescues the transport errors `nats-pure` and the socket layer actually raise (`EOFError`, `IOError`, `Errno::ECONNRESET`/`EPIPE`/`ECONNREFUSED`/`ETIMEDOUT`, `NATS::IO::ConnectionClosedError`, and `java.io.IOException` on JRuby) via `Errors::RETRYABLE_TRANSPORT_ERRORS` and rides them out with the existing `reconnect_delay` retry loop.
7
+ - **Response muxer `pending_size` drift (could silently drop all responses).** nats-pure increments a subscription's `pending_size` (synchronized) for every inbound message and uses it to enforce the slow-consumer byte limit; for a callback-less subscription it never decrements it, so the muxer would be the sole consumer. Rather than mirror that accounting with a lock on every message, the muxer now **disables the byte-based limit** on its response subscription and relies on the message-count limit (the `SizedQueue` depth, tracked accurately for free). This removes the per-message lock from the dispatch hot path (~**2.7× faster** per message on JRuby — see `bench/muxer_resilience_bench.rb`) and eliminates the drift bug entirely.
8
+ - **Dispatcher no longer busy-spins during a restart window.** If `@resp_sub` was briefly `nil` while the muxer restarted, the dispatch loop raised `NoMethodError` every iteration — busy-spinning and emitting a logged error + error-callback per spin. It now parks briefly (~0.2% of the old wasted work, zero errors).
9
+ - **Self-healing backoff counter is now thread-safe.** The shared dispatcher crash counter was a plain `Integer` mutated by multiple dispatcher threads (it lost ~45% of updates under true parallelism on JRuby, corrupting the exponential backoff). It is now a `Concurrent::AtomicFixnum` that decays once a dispatcher is healthy.
10
+ - **Client connection lifecycle hardening.** Connection callbacks (`on_disconnect`/`on_reconnect`/`on_close`/`on_error`) are now registered before `connect`, so handshake-time events are observed; and a failed handshake closes the half-open client so nats-pure's reader/flusher threads aren't leaked.
11
+ - **Removed the dead `:disable_reconnect_buffer` connect option.** nats-pure has no such option (it was a JNats concept), so it was silently ignored. Transient disconnects are now handled by the client's transport-error retry path and `ack_timeout`.
12
+ - **Server no longer leaves clients hanging on handler/publish failure.** If processing a request fails after the ACK was sent, the server now publishes an encoded `RPC_ERROR` response so the client fails fast instead of blocking until `response_timeout` (60s).
13
+ - **Config no longer crashes when the YAML file has no section for the current environment** (or is empty); it falls back to defaults.
14
+ - **TLS now floors at 1.2 and ceilings at 1.3** (replacing the deprecated `ssl_version = :TLSv1_2` hard pin), so TLS 1.3 is used when the server supports it and a TLS-1.2-only transport still negotiates down to 1.2. Verified on JRuby 9.4 and 10.0.
15
+ - **Server request intake is now parallelized.** `SuperSubscriptionManager` drained the shared intake queue with a single thread that also published every ACK/NACK, so on JRuby intake was pinned to one core and one slow publish (e.g. nats-pure's buffer during a reconnect) head-of-line blocked *every* subject. Intake now fans out to `PB_NATS_SERVER_SUBSCRIPTION_HANDLERS` threads (default `processor_count` on JRuby, 1 on CRuby) with per-thread self-healing backoff. NATS queue-group semantics and subscription counts are unchanged — each request is still delivered to exactly one consumer. Measured **~8.5× intake throughput** and head-of-line stall **~505ms → ~0.4ms** at 8 handlers (`bench/server_intake_bench.rb`).
16
+ - **Client retry is bounded and jittered.** `PB_NATS_CLIENT_MAX_RETRIES` (default 3) and `PB_NATS_CLIENT_RECONNECT_DELAY_SPLAY_LIMIT` (default 1000ms) make retries configurable, and the reconnect sleep now adds random jitter so a fleet hitting the same outage doesn't reconnect in lockstep.
17
+ - **More transient errors are retried.** `ConnectionPool::TimeoutError` (subscription-pool exhaustion during a reconnect) is now treated as transient instead of surfacing as an `RPC_ERROR`.
18
+ - **`connection_options` only forwards nats-pure-recognized keys** (servers, max_reconnect_attempts, connect_timeout, tls); app-level settings are read via their own accessors and no longer leak into `nats.connect`. YAML config now uses `safe_load`.
19
+ - **Thread-pool robustness.** `wait_for_termination` prunes under its mutex and returns a real drained/timed-out result; a new `replenish` (called each server tick) respawns a worker killed by a non-StandardError. On shutdown the drain timeout tracks `handler_overdue_ms` so a legitimate long handler isn't killed mid-flight, and abandoned in-flight handlers are logged/instrumented.
20
+ - **Error callbacks run off the read loop.** The nats `on_error` hooks dispatch via a bounded executor (`notify_error_callbacks_async`) so a slow user callback can't stall message processing for every subject.
21
+ - **Server handler observability (long operations are first-class).** Handlers are never aborted — long-running operations (up to and beyond a minute) are allowed. The server now tracks in-flight handlers and emits `server.inflight_count`, `server.inflight_oldest_age_ms`, `server.overdue_handler_count`, `server.handler_overdue`, `server.pending_intake_queue_size`, `server.slow_handler` (opt-in via `PB_NATS_SERVER_SLOW_HANDLER_THRESHOLD_MS`), and `server.thread_pool_saturated`. A handler is only flagged "overdue" once it outlives the client's `response_timeout` (`PB_NATS_SERVER_HANDLER_OVERDUE_MS`, default 65s), so normal long ops are not mislabeled. Server duration metrics now use a monotonic clock.
22
+
3
23
  ### 0.13.0
4
24
  This is a large overhaul of the client and server internals.
5
25
 
data/README.md CHANGED
@@ -39,7 +39,23 @@ file is removed it will resubscribe and restart slow start (default: `nil`).
39
39
 
40
40
  `PB_NATS_SERVER_SUBSCRIPTIONS_PER_RPC_ENDPOINT` - Number of subscriptions to create for each rpc endpoint. This number is
41
41
  used to allow JVM based servers to warm-up slowly to prevent jolts in runtime performance across your RPC network
42
- (default: 10).
42
+ (default: 10). Each subscription joins the NATS queue group for its endpoint, so every request is still delivered to
43
+ exactly one consumer — this knob controls subscription/interest count, not duplicate delivery.
44
+
45
+ `PB_NATS_SERVER_SUBSCRIPTION_HANDLERS` - Number of threads that drain the shared intake queue and publish ACK/NACKs
46
+ (see [How it works](#how-it-works)). Defaults to `Concurrent.processor_count` on JRuby and `1` on CRuby. This is the
47
+ *consumer* parallelism for messages this server has already received; it does not change how many topics are subscribed
48
+ to or the queue-group delivery semantics. Minimum of 1.
49
+
50
+ `PB_NATS_SERVER_SLOW_HANDLER_THRESHOLD_MS` - If set (> 0), emit `server.slow_handler` when a handler runs longer than this
51
+ many milliseconds. Informational/SLA only — handlers are never aborted (default: 0, off).
52
+
53
+ `PB_NATS_SERVER_HANDLER_OVERDUE_MS` - A handler still running past this many milliseconds is reported as "overdue"
54
+ (`server.handler_overdue` + `server.overdue_handler_count`) — i.e. the client has already given up (`response_timeout`)
55
+ so the work is orphaned. Defaults above the client's 60s response timeout so legitimate long operations are not flagged
56
+ (default: 65000). **This should track your clients' `PB_NATS_CLIENT_RESPONSE_TIMEOUT`** — set it to roughly that value (plus
57
+ a small grace). If clients use a longer response timeout, raise this so handlers aren't flagged overdue while a client is
58
+ still waiting; if shorter, lower it so orphaned work is surfaced promptly.
43
59
 
44
60
  `PB_NATS_CLIENT_ACK_TIMEOUT` - Seconds to wait for an ACK from the rpc server (default: 5 seconds).
45
61
 
@@ -50,7 +66,11 @@ used to allow JVM based servers to warm-up slowly to prevent jolts in runtime pe
50
66
 
51
67
  `PB_NATS_CLIENT_RESPONSE_TIMEOUT` - Seconds to wait for a non-ACK response from the rpc server (default: 60 seconds).
52
68
 
53
- `PB_NATS_CLIENT_RECONNECT_DELAY` - If we detect a reconnect delay, we will wait this many seconds (default: the ACK timeout).
69
+ `PB_NATS_CLIENT_RECONNECT_DELAY` - When a request hits a transient transport error (e.g. the NATS connection drops or is reset), the client sleeps this many seconds before retrying to give the connection time to re-establish (default: the ACK timeout). See [Resilience](#resilience).
70
+
71
+ `PB_NATS_CLIENT_RECONNECT_DELAY_SPLAY_LIMIT` - Random jitter (milliseconds, `0..limit`) added to the reconnect delay so a fleet hitting the same NATS outage does not reconnect in lockstep (default: 1000). Set to 0 to disable jitter.
72
+
73
+ `PB_NATS_CLIENT_MAX_RETRIES` - Number of attempts for ack-timeouts and transient transport errors before raising (default: 3).
54
74
 
55
75
  `PB_NATS_CLIENT_SUBSCRIPTION_POOL_SIZE` - If subscription pooling is desired for the request/response cycle then the pool size maximum should be set; the pool is lazy and therefore will only start new subscriptions as necessary (default: 0)
56
76
 
@@ -93,6 +113,8 @@ An example config looks like this:
93
113
  - "original_service": "replacement_service"
94
114
  ```
95
115
 
116
+ When `uses_tls` is set, the client negotiates TLS with a floor of 1.2 and a ceiling of 1.3: it uses TLS 1.3 where the NATS server supports it and falls back to 1.2 otherwise (verified on JRuby 9.4 and 10.0).
117
+
96
118
  ## Usage
97
119
 
98
120
  This library is designed to be an alternative transport implementation used by the `protobuf` gem. In order to make
@@ -162,13 +184,64 @@ If we were to add another service endpoint called `search` to the `UserService`
162
184
  - **ResponseMuxer** (`lib/protobuf/nats/response_muxer.rb`) — the client uses a single wildcard subscription to multiplex
163
185
  all RPC responses (similar to the Golang NATS client) instead of subscribing/unsubscribing per request. One or more
164
186
  dispatcher threads drain the shared subscription and route each reply to the waiting caller via a `Concurrent::Map`,
165
- keyed by a UUIDv7 request token. Tune the dispatcher count with `PB_NATS_RESPONSE_MUXER_DISPATCHERS`.
187
+ keyed by a UUIDv7 request token. Tune the dispatcher count with `PB_NATS_RESPONSE_MUXER_DISPATCHERS`. Slow-consumer
188
+ protection on the response subscription is by message count (the queue depth); the dispatch hot path does no per-message
189
+ locking. Dispatcher threads self-heal: a crashed dispatcher is restarted with exponential backoff (capped at 60s) that
190
+ decays once healthy.
166
191
  - **SuperSubscriptionManager** (`lib/protobuf/nats/super_subscription_manager.rb`) — the server manages the lifecycle of
167
- RPC endpoint subscriptions, including slow start, pausing, and resubscription.
192
+ RPC endpoint subscriptions (NATS queue groups, so each request is delivered to one consumer), including slow start,
193
+ pausing, and resubscription. All subscriptions feed one shared intake queue drained by `PB_NATS_SERVER_SUBSCRIPTION_HANDLERS`
194
+ handler threads, so a slow ACK publish on one message can't head-of-line block every other subject. Handler threads
195
+ self-heal with exponential backoff.
196
+ - **Server observability** — beyond the thread-pool gauges, the server emits in-flight handler metrics
197
+ (`server.inflight_count`, `server.inflight_oldest_age_ms`, `server.overdue_handler_count`, `server.handler_overdue`,
198
+ `server.pending_intake_queue_size`, `server.thread_pool_saturated`). Long-running handlers are allowed and never aborted;
199
+ a handler is only "overdue" once it outlives the client's `response_timeout` (see `PB_NATS_SERVER_HANDLER_OVERDUE_MS`).
200
+
201
+ ## Resilience
202
+
203
+ The client is built to ride out transient NATS hiccups rather than surface them as request failures:
204
+
205
+ - **Transient transport errors are retried.** If a request hits a dropped/reset/closed connection (`EOFError`,
206
+ `IOError`, `Errno::ECONNRESET`/`EPIPE`/`ECONNREFUSED`/`ETIMEDOUT`, `NATS::IO::ConnectionClosedError`, or a Java
207
+ `IOException` on JRuby — see `Errors::RETRYABLE_TRANSPORT_ERRORS`), the client sleeps `PB_NATS_CLIENT_RECONNECT_DELAY`
208
+ and retries (up to 3 attempts) while `nats-pure` re-establishes the connection in the background.
209
+ - **Missing ACKs and NACKs are retried** with their own timeouts/backoff (`PB_NATS_CLIENT_ACK_TIMEOUT`,
210
+ `PB_NATS_CLIENT_NACK_BACKOFF_INTERVALS`).
211
+ - **Server-side failures fail the caller fast.** If the server cannot process a request after it has ACKed, it publishes
212
+ an encoded RPC error response so the client raises immediately instead of blocking until `PB_NATS_CLIENT_RESPONSE_TIMEOUT`.
213
+ - **The response dispatcher self-heals.** A crashed muxer dispatcher restarts with exponential backoff, and a brief
214
+ subscription-restart window won't busy-spin the dispatch loop.
215
+
216
+ See `bench/muxer_resilience_bench.rb` for microbenchmarks of the dispatch hot path and these resilience paths.
217
+
218
+ ## Delivery semantics (at-least-once)
219
+
220
+ **Current design choice:** RPC delivery is **at-least-once**, and the gem does **not** deduplicate requests. The resilience features above are the reason: when the client retries on an ACK/response timeout or a transient transport error, the server may have *already received and processed* the original request, so a single client call can run a handler **more than once**. (NATS queue groups guarantee each *delivered* message goes to one consumer, but they do not prevent the client from re-sending after a timeout.)
221
+
222
+ The gem deliberately favors at-least-once over at-most-once: dropping work on a transient blip is usually worse than occasionally repeating it. Making this safe is therefore the **service author's responsibility** — handlers that have side effects should be written to be idempotent:
223
+
224
+ - Key writes on a natural/business id or a client-supplied idempotency token (upsert / `find_or_create`) rather than blind inserts.
225
+ - Make external side effects (charges, emails, downstream RPCs) safe to repeat, or guard them with your own dedup keyed on a request id you put in the message.
226
+ - Naturally idempotent operations (reads, idempotent upserts) need no special handling.
227
+
228
+ **Why no built-in dedup (yet):** correct dedup across a horizontally-scaled service requires a *shared* store (a retry can land on a different server instance), a tuned TTL, and a cached response to replay on duplicates — and it only helps RPCs that aren't already idempotent. A future, **opt-in per-RPC** dedup with a pluggable store may be added; it will not be the default. Until then, treat handlers as potentially re-run.
168
229
 
169
230
  ## Future Improvements (locked behind ruby version)
170
231
  - Migrate from the `uuid7` gem to native `Random#uuid_v7` once the minimum Ruby version supports it (see `UUIDv7Helper`).
171
232
 
233
+ ## Benchmarks
234
+
235
+ Microbenchmarks live in `bench/` and measure both the old and new behavior in one process (no NATS server required). See `bench/bench.md` for details. Highlights on JRuby:
236
+
237
+ - `bench/muxer_resilience_bench.rb` — response-muxer dispatch hot path (~2.5× faster per message with the per-message lock removed), restart-window resilience, and crash-counter accuracy.
238
+ - `bench/server_intake_bench.rb` — server intake fan-out (~8× throughput, head-of-line stall ~505ms → ~2ms) and the handler-exhaustion observability.
239
+
240
+ ```
241
+ bundle exec ruby -Ilib bench/server_intake_bench.rb
242
+ bundle exec ruby -Ilib bench/muxer_resilience_bench.rb
243
+ ```
244
+
172
245
  ## Development
173
246
 
174
247
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
data/bench/bench.md CHANGED
@@ -1,16 +1,96 @@
1
+ ## Benchmarks
1
2
 
2
- Notes:
3
- `-Xjit.threshold=0` - Setting the threshold to 0 forces JRuby to compile every method into Java bytecode immediately before its very first execution. This is particularly useful for debugging or bypassing warm-up times during profiling
3
+ - `bench/concurrency_bench.rb` — end-to-end hot-path throughput (muxer round-trip,
4
+ subscription-key cache, thread pool) across thread counts. No NATS server needed.
5
+ - `bench/muxer_resilience_bench.rb` — measures the response-muxer hot-path and
6
+ self-healing fixes (both old/baseline and new/patched behavior in one process):
7
+ - **A. Dispatch hot-path** — per-message `pending_size` accounting that was
8
+ removed; the dispatch step is ~**2.7× faster** per message on JRuby once the
9
+ per-message subscription lock is gone.
10
+ - **B. nil-`@resp_sub` resilience** — during a restart window the old loop
11
+ busy-spun (a `NoMethodError` + logged error/callback every iteration); the new
12
+ loop parks, doing **~0.2%** of the old wasted work and emitting **0** errors.
13
+ - **C. Self-healing crash counter** — a plain Integer mutated by N dispatcher
14
+ threads loses ~**45%** of updates on JRuby (corrupting the exponential backoff);
15
+ the `Concurrent::AtomicFixnum` replacement loses none.
4
16
 
17
+ Run: `bundle exec ruby -Ilib bench/muxer_resilience_bench.rb`
18
+ - `bench/server_intake_bench.rb` — server intake fan-out + handler observability
19
+ (old single-handler vs new N-handler intake, in one process):
20
+ - **A. Intake throughput** — with a per-ACK publish cost, N drain threads scale
21
+ intake ~linearly (measured **~8.5×** at 8 handlers on JRuby vs the old single
22
+ intake thread).
23
+ - **B. Head-of-line blocking** — behind one slow (0.5s) publish, 50 quick
24
+ messages finished in **~505ms** with one handler vs **~0.4ms** with N.
25
+ - **C. Observability demo** — with hung handlers the new notifications report
26
+ `inflight_count` / `inflight_oldest_age_ms` / `overdue_handler_count` and fire
27
+ `server.handler_overdue`, where before only `server.message_dropped` was visible.
5
28
 
6
- `-Xjit.threshold=10 -J-XX:CompileThreshold=10` - If you are running benchmarks and want both JRuby and the JVM to aggressively optimize early, you can lower both thresholds simultaneously
29
+ Run: `bundle exec ruby -Ilib bench/server_intake_bench.rb`
30
+ - `bench/soak.rb` — opt-in soak/chaos test: spawns its own `nats-server`, runs a
31
+ real protobuf-nats server + client in-process under sustained concurrency
32
+ (including deliberately long handlers), bounces the nats-server mid-run, and
33
+ asserts recovery (≥90% success) while reporting the resilience signals. Skips
34
+ if `nats-server` isn't on PATH.
7
35
 
8
- `bundle; bx ruby -I lib bench/real_client.rb`
36
+ Run: `SOAK_DURATION=20 SOAK_BOUNCES=3 bundle exec ruby -Ilib bench/soak.rb`
9
37
 
10
- Start local nats server so details can be monitored.
11
- `/opt/homebrew/opt/nats-server/bin/nats-server -DV -m 8222 -p 4222`
38
+ ---
12
39
 
40
+ ## Running benchmarks (warm + reliable)
41
+
42
+ These numbers are meaningless cold. On JRuby the JVM has to load classes and JIT-compile the hot paths before it reaches steady state, so the first second(s) of any run are far slower than production. Always warm up, repeat, and compare like-for-like.
43
+
44
+ ### 1. Use the production engine
45
+
46
+ Run on JRuby (what production uses); CRuby numbers differ because the GVL serializes the parallelism these benches exercise.
47
+
48
+ ```
49
+ rbenv shell jruby-9.4.14.0 # or your deployed JRuby
50
+ ruby -v # confirm engine before trusting any number
51
+ ```
52
+
53
+ ### 2. Benchmarking JRUBY_OPTS
54
+
55
+ Fix the heap so GC resizing doesn't jitter the run, give the young gen room, and don't block on entropy:
56
+
57
+ ```
58
+ export JRUBY_OPTS="-J-Xms4g -J-Xmx4g -J-Xmn1g --disable:did_you_mean -J-Djava.security.egd=file:/dev/./urandom"
59
+ ```
60
+
61
+ - Set `-Xms == -Xmx` so the heap never resizes mid-measurement.
62
+ - Do **not** use `--dev` for benchmarking — it disables the JIT for fast startup and will understate performance.
63
+ - Optional faster warmup (compile sooner): add `-Xjit.threshold=10 -J-XX:CompileThreshold=10`. `-Xjit.threshold=0` forces immediate compilation — useful for profiling, but prefer real warmup for representative steady-state numbers.
64
+
65
+ ### 3. Warm up, then measure
66
+
67
+ - `muxer_resilience_bench.rb` section A uses **benchmark-ips**, which warms up on its own (warmup then a timed window) — no extra flags needed.
68
+ - The loop-driven benches (`concurrency_bench.rb`, and the throughput sections of `server_intake_bench.rb`) measure a fixed window. Give them a real warmup and a longer window:
69
+
70
+ ```
71
+ BENCH_WARMUP=5 BENCH_DURATION=10 BENCH_THREADS=1,4,8,16 bundle exec ruby -Ilib bench/concurrency_bench.rb
72
+ ```
73
+
74
+ ### 4. Repeat and take the median
75
+
76
+ JVM warmup and machine noise make any single run unreliable. Run each bench **3+ times**, discard the first (cold class-load/JIT), and report the **median**. Keep the machine quiet (close other apps, disable CPU throttling / keep laptops on AC) and run one bench at a time.
77
+
78
+ ### Per-script tuning knobs
79
+
80
+ | Script | Env knobs (defaults) |
81
+ | --- | --- |
82
+ | `concurrency_bench.rb` | `BENCH_DURATION` (4), `BENCH_WARMUP` (2), `BENCH_THREADS` (`1,4,8,16`), `BENCH_POOL_WORKERS` (8) |
83
+ | `muxer_resilience_bench.rb` | none — benchmark-ips controls warmup/time |
84
+ | `server_intake_bench.rb` | `BENCH_HANDLERS` (cores), `BENCH_MSGS` (20000), `BENCH_PUBLISH_LATENCY_US` (50) |
85
+ | `soak.rb` | `SOAK_DURATION` (15), `SOAK_THREADS` (12), `SOAK_BOUNCES` (2), `SOAK_NATS_PORT` (4299) |
86
+
87
+ ### Real end-to-end run (optional, needs a NATS server)
88
+
89
+ `bench/real_client.rb` drives the example app against a live server. Start a local nats-server (with monitoring) first:
13
90
 
14
91
  ```
15
- export JRUBY_OPTS="--disable:did_you_mean -J-Djava.security.egd=file:/dev/./urandom -J-Xmx2g -J-Xms1024m -J-Xmn512m -Xjit.threshold=10 -J-XX:CompileThreshold=10"
92
+ nats-server -DV -m 8222 -p 4222 # or: /opt/homebrew/opt/nats-server/bin/nats-server ...
93
+ bundle exec ruby -Ilib bench/real_client.rb
16
94
  ```
95
+
96
+ `bench/soak.rb` spawns and bounces its own throwaway nats-server, so it needs only the `nats-server` binary on PATH (it self-skips otherwise).
@@ -0,0 +1,151 @@
1
+ # Benchmarks for the response-muxer hot-path and self-healing changes.
2
+ #
3
+ # This file measures BOTH the old (baseline) and new (patched) behavior in one
4
+ # process so the speedup/robustness delta is reproducible on CRuby and JRuby
5
+ # without a NATS server:
6
+ #
7
+ # A. Dispatch hot-path cost -- per-message pending_size accounting that was
8
+ # removed (#1). benchmark-ips, lower is better.
9
+ # B. nil-@resp_sub resilience -- busy-spin vs park during a restart window (#3).
10
+ # C. Self-healing counter -- lost updates with a plain int vs AtomicFixnum
11
+ # under concurrent crashes (#4).
12
+ #
13
+ # Usage:
14
+ # bundle exec ruby -Ilib bench/muxer_resilience_bench.rb
15
+
16
+ require "bundler/setup"
17
+ require "benchmark/ips"
18
+ require "concurrent"
19
+ require "nats/client" # real NATS::Subscription / NATS::Msg
20
+
21
+ def mono
22
+ ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
23
+ end
24
+
25
+ puts "=" * 72
26
+ puts "protobuf-nats response-muxer resilience bench"
27
+ puts "engine=#{RUBY_ENGINE} #{RUBY_VERSION} processor_count=#{::Concurrent.processor_count}"
28
+ puts "=" * 72
29
+
30
+ # --------------------------------------------------------------------------
31
+ # A. Dispatch hot-path: per-message pending_size accounting (removed in #1).
32
+ #
33
+ # Old dispatch did `sub.synchronize { sub.pending_size -= msg.data.size }` for
34
+ # EVERY response message; the new code does nothing here. We compare the old
35
+ # accounting step against the cheapest real per-message op (a Concurrent::Map
36
+ # lookup, which the dispatcher still does) so the delta is the lock overhead we
37
+ # removed from the hot path.
38
+ # --------------------------------------------------------------------------
39
+ puts "\nA. Dispatch hot-path per-message overhead (higher ips = better)\n\n"
40
+
41
+ sub = ::NATS::Subscription.new
42
+ sub.pending_size = 0
43
+ resp_map = ::Concurrent::Map.new
44
+ resp_map["tok"] = { :queue => ::Queue.new }
45
+ size = 64
46
+
47
+ Benchmark.ips do |x|
48
+ x.config(:time => 3, :warmup => 1)
49
+
50
+ x.report("old: synchronize { pending_size -= n } + map lookup") do
51
+ sub.synchronize { sub.pending_size -= size }
52
+ resp_map["tok"]
53
+ end
54
+
55
+ x.report("new: map lookup only (accounting removed)") do
56
+ resp_map["tok"]
57
+ end
58
+
59
+ x.compare!
60
+ end
61
+
62
+ # --------------------------------------------------------------------------
63
+ # B. nil-@resp_sub resilience (#3). During a restart @resp_sub can briefly be
64
+ # nil. The old loop dereferenced it unconditionally (NoMethodError every
65
+ # iteration -> busy-spin + a logged error/callback per spin); the new loop
66
+ # parks. We run each for a fixed window and count iterations and "errors that
67
+ # would be logged/dispatched to callbacks".
68
+ # --------------------------------------------------------------------------
69
+ puts "\nB. Behavior while @resp_sub is nil for #{(WINDOW = 0.5)}s (lower spin = better)\n\n"
70
+
71
+ def run_old_loop(window)
72
+ resp_sub = nil # the restart window
73
+ iters = 0
74
+ errors = 0
75
+ deadline = mono + window
76
+ while mono < deadline
77
+ begin
78
+ resp_sub.pending_queue.pop # NoMethodError on nil
79
+ rescue => _e
80
+ errors += 1 # old code logs + notify_error_callbacks here
81
+ end
82
+ iters += 1
83
+ end
84
+ [iters, errors]
85
+ end
86
+
87
+ def run_new_loop(window)
88
+ resp_sub = nil
89
+ iters = 0
90
+ errors = 0
91
+ deadline = mono + window
92
+ while mono < deadline
93
+ s = resp_sub
94
+ if s.nil?
95
+ sleep 0.01 # park instead of spinning
96
+ iters += 1
97
+ next
98
+ end
99
+ begin
100
+ s.pending_queue.pop
101
+ rescue => _e
102
+ errors += 1
103
+ end
104
+ iters += 1
105
+ end
106
+ [iters, errors]
107
+ end
108
+
109
+ old_iters, old_errs = run_old_loop(WINDOW)
110
+ new_iters, new_errs = run_new_loop(WINDOW)
111
+
112
+ printf(" old loop: %12d iterations, %12d errors logged/dispatched\n", old_iters, old_errs)
113
+ printf(" new loop: %12d iterations, %12d errors logged/dispatched\n", new_iters, new_errs)
114
+ printf(" => new loop does %.5f%% of the old loop's wasted work\n",
115
+ old_iters.zero? ? 0.0 : (new_iters.to_f / old_iters * 100))
116
+
117
+ # --------------------------------------------------------------------------
118
+ # C. Self-healing crash counter (#4). The old counter was a plain Integer
119
+ # mutated by multiple dispatcher threads (`@crash_count = (@crash_count||0)+1`),
120
+ # which loses updates under true parallelism, corrupting the exponential
121
+ # backoff. The new counter is a Concurrent::AtomicFixnum. We have N threads each
122
+ # "crash" K times and check the final count.
123
+ # --------------------------------------------------------------------------
124
+ puts "\nC. Crash-counter accuracy under concurrent crashes (expected == actual is correct)\n\n"
125
+
126
+ def hammer(counter, threads, per_thread)
127
+ ts = threads.times.map do
128
+ ::Thread.new do
129
+ per_thread.times { counter.call }
130
+ end
131
+ end
132
+ ts.each(&:join)
133
+ end
134
+
135
+ threads = [::Concurrent.processor_count, 4].max
136
+ per_thread = 50_000
137
+ expected = threads * per_thread
138
+
139
+ # Old: plain integer read-modify-write (racy).
140
+ plain = 0
141
+ hammer(->{ plain = plain + 1 }, threads, per_thread)
142
+
143
+ # New: atomic increment.
144
+ atomic = ::Concurrent::AtomicFixnum.new(0)
145
+ hammer(->{ atomic.increment }, threads, per_thread)
146
+
147
+ printf(" threads=%d per_thread=%d expected=%d\n", threads, per_thread, expected)
148
+ printf(" old plain Integer: %10d (lost %d updates)\n", plain, expected - plain)
149
+ printf(" new AtomicFixnum: %10d (lost %d updates)\n", atomic.value, expected - atomic.value)
150
+
151
+ puts "\ndone."
@@ -0,0 +1,158 @@
1
+ # Benchmarks for the server intake fan-out (#1) and handler observability (#2).
2
+ #
3
+ # Models old (1 intake handler) vs new (N intake handlers) in one process, plus
4
+ # a demonstration of the #2 in-flight observability. No NATS server required.
5
+ #
6
+ # A. Intake throughput -- acks/sec with 1 vs N drain threads when each ACK
7
+ # publish has some latency (the real bottleneck).
8
+ # B. Head-of-line blocking -- how long other subjects stall behind one slow
9
+ # publish with 1 vs N handlers.
10
+ # C. Observability demo -- with hung handlers, the new server notifications
11
+ # surface the saturation/overdue work that was
12
+ # previously invisible (only message_dropped).
13
+ #
14
+ # Usage:
15
+ # bundle exec ruby -Ilib bench/server_intake_bench.rb
16
+
17
+ require "bundler/setup"
18
+ require "concurrent"
19
+ require "nats/client" # real NATS::Subscription / NATS::Msg
20
+ require "protobuf/nats"
21
+
22
+ ::Protobuf::Logging.logger = ::Logger.new(nil)
23
+
24
+ def mono
25
+ ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
26
+ end
27
+
28
+ HANDLERS = Integer(ENV.fetch("BENCH_HANDLERS", [::Concurrent.processor_count, 4].max.to_s))
29
+ MSGS = Integer(ENV.fetch("BENCH_MSGS", "20000"))
30
+ PUBLISH_LAT_US = Integer(ENV.fetch("BENCH_PUBLISH_LATENCY_US", "50")) # per-ACK publish latency
31
+
32
+ puts "=" * 72
33
+ puts "protobuf-nats server intake bench"
34
+ puts "engine=#{RUBY_ENGINE} #{RUBY_VERSION} processor_count=#{::Concurrent.processor_count}"
35
+ puts "handlers(new)=#{HANDLERS} msgs=#{MSGS} publish_latency=#{PUBLISH_LAT_US}us"
36
+ puts "=" * 72
37
+
38
+ # --------------------------------------------------------------------------
39
+ # Shared intake model: a SizedQueue fed with `total` messages, drained by
40
+ # `handlers` threads. Each message does light work + an ACK "publish" that
41
+ # costs `publish_latency` seconds (the part that serializes on one thread today).
42
+ # --------------------------------------------------------------------------
43
+ def drain(handlers, total, publish_latency)
44
+ queue = ::SizedQueue.new(total + handlers)
45
+ total.times { queue.push(:msg) }
46
+ handlers.times { queue.push(:stop) }
47
+ processed = ::Concurrent::AtomicFixnum.new(0)
48
+
49
+ t0 = mono
50
+ threads = handlers.times.map do
51
+ ::Thread.new do
52
+ loop do
53
+ m = queue.pop
54
+ break if m == :stop
55
+ sleep(publish_latency) if publish_latency.positive?
56
+ processed.increment
57
+ end
58
+ end
59
+ end
60
+ threads.each(&:join)
61
+ elapsed = mono - t0
62
+ { per_sec: processed.value / elapsed, elapsed: elapsed }
63
+ end
64
+
65
+ puts "\nA. Intake throughput (acks/sec; higher is better)\n\n"
66
+ lat = PUBLISH_LAT_US / 1_000_000.0
67
+ old = drain(1, MSGS, lat)
68
+ new = drain(HANDLERS, MSGS, lat)
69
+ printf(" old (1 handler): %12.0f acks/s (%.2fs)\n", old[:per_sec], old[:elapsed])
70
+ printf(" new (%d handlers): %12.0f acks/s (%.2fs)\n", HANDLERS, new[:per_sec], new[:elapsed])
71
+ printf(" => %.2fx faster intake\n", new[:per_sec] / old[:per_sec])
72
+
73
+ # --------------------------------------------------------------------------
74
+ # B. Head-of-line blocking: one slow publish is enqueued first, followed by
75
+ # `fast_count` quick messages. Measure how long until all the quick messages
76
+ # finish. With one handler they wait behind the slow publish; with N they don't.
77
+ # --------------------------------------------------------------------------
78
+ def head_of_line(handlers, slow_latency, fast_count)
79
+ queue = ::SizedQueue.new(fast_count + 1 + handlers)
80
+ queue.push(:slow)
81
+ fast_count.times { queue.push(:fast) }
82
+ handlers.times { queue.push(:stop) }
83
+
84
+ fast_done = ::Concurrent::AtomicFixnum.new(0)
85
+ last_fast_at = ::Concurrent::AtomicReference.new(nil)
86
+
87
+ start = mono
88
+ threads = handlers.times.map do
89
+ ::Thread.new do
90
+ loop do
91
+ m = queue.pop
92
+ break if m == :stop
93
+ if m == :slow
94
+ sleep slow_latency
95
+ else
96
+ last_fast_at.set(mono) if fast_done.increment == fast_count
97
+ end
98
+ end
99
+ end
100
+ end
101
+ threads.each(&:join)
102
+ (last_fast_at.get || mono) - start
103
+ end
104
+
105
+ puts "\nB. Head-of-line blocking behind one slow (0.5s) publish (lower = better)\n\n"
106
+ slow = 0.5
107
+ old_b = head_of_line(1, slow, 50)
108
+ new_b = head_of_line(HANDLERS, slow, 50)
109
+ printf(" old (1 handler): 50 quick messages finished after %6.1f ms (stuck behind the slow publish)\n", old_b * 1000)
110
+ printf(" new (%d handlers): 50 quick messages finished after %6.1f ms (unaffected)\n", HANDLERS, new_b * 1000)
111
+
112
+ # --------------------------------------------------------------------------
113
+ # C. #2 observability demo: hung handlers occupy the pool. Today operators only
114
+ # see `message_dropped`; now the in-flight gauges + overdue event explain why.
115
+ # --------------------------------------------------------------------------
116
+ puts "\nC. Handler-exhaustion observability (what an operator now sees)\n\n"
117
+
118
+ ENV["PB_NATS_SERVER_SUBSCRIPTION_HANDLERS"] = "1"
119
+ ENV["PB_NATS_SERVER_HANDLER_OVERDUE_MS"] = "100"
120
+
121
+ class DemoNats
122
+ def connect(*); end
123
+ def new_inbox; "_INBOX.demo"; end
124
+ def subscribe(_s, *_a)
125
+ sub = ::NATS::Subscription.new
126
+ sub.pending_queue = ::SizedQueue.new(1024)
127
+ sub
128
+ end
129
+ def publish(*); end
130
+ def flush(*); end
131
+ %i[on_disconnect on_reconnect on_close on_error].each { |m| define_method(m) { |*| } }
132
+ def close; end
133
+ end
134
+
135
+ server = ::Protobuf::Nats::Server.new(:threads => 4, :client => DemoNats.new, :server => "bench")
136
+ release = ::Queue.new
137
+ server.define_singleton_method(:handle_request) { |*_| release.pop; "" }
138
+
139
+ gauges = {}
140
+ %w[inflight_count inflight_oldest_age_ms overdue_handler_count handler_overdue pending_intake_queue_size].each do |name|
141
+ ::ActiveSupport::Notifications.subscribe("server.#{name}.protobuf-nats") { |_, _, _, _, v| gauges[name] = v }
142
+ end
143
+
144
+ 4.times { |i| server.enqueue_request("req#{i}", "inbox#{i}") } # all 4 pool slots now hung
145
+ sleep 0.15 # exceed the 100ms overdue window
146
+ server.enqueue_request("req5", "inbox5") # pool full -> NACK + saturated
147
+ server.instrument_inflight_handlers
148
+
149
+ printf(" inflight_count = %s (handlers stuck on the downstream)\n", gauges["inflight_count"])
150
+ printf(" inflight_oldest_age_ms = %.0f\n", gauges["inflight_oldest_age_ms"] || 0)
151
+ printf(" overdue_handler_count = %s (client already gave up on these)\n", gauges["overdue_handler_count"])
152
+ printf(" handler_overdue fired = %s\n", gauges.key?("handler_overdue"))
153
+ puts " (previously: only server.message_dropped, with no hint that handlers were stuck)"
154
+
155
+ release << :go while !release.num_waiting.zero?
156
+ 4.times { release << :go }
157
+
158
+ puts "\ndone."