waterdrop 2.10.0 → 2.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5a840a99425c1700eb3ea2cad5da08279e11ba0f4a2600b046dcef14bca9255b
4
- data.tar.gz: 6107f58c3ed66912e56379660a021eb87c1e18a0d7fee7702460c1e90b75fea3
3
+ metadata.gz: 1329d22d7f4f960b2df24949a040dd1e1b57ec73002ed779f3bcd0c5ad4dad20
4
+ data.tar.gz: b2b9868379d00dd3951df5a2e3b45bdce1b24295284fd5cb6cca410709a38a05
5
5
  SHA512:
6
- metadata.gz: 454ff01bc3baa3c2b47c46c6538dd8310c3b2af10cafe3551f00630e78d5980b7afeec7eb2047f769b24d0d6106aba4366e04e99d91045154a4ddb0b34edd4b1
7
- data.tar.gz: '0980ac5585f18983d4d6918f99cf8bf070160d5c99964459c714b532195d74372785dd9b3e40366fae709f92aea9c364e2b71d33c3ffadde0528f60e863d8348'
6
+ metadata.gz: 817244d3151a463cca84a1a7fefd298c5e355337325817430cdef5f5eb1c09013d20f801f9ef95b5e9182539f8cb550f25df62042bf93bc7826e9f50ee4b1caf
7
+ data.tar.gz: c60cc27e7b3787a9610229dacdcbe0440fa7332c5659f5e61348409b4ca2c507718a5bbd93bf5e088f8cc7a5d8b91da981eac0f377257ec0c4394e20dcd92994
data/.ruby-version CHANGED
@@ -1 +1 @@
1
- 4.0.3
1
+ 4.0.5
data/CHANGELOG.md CHANGED
@@ -1,5 +1,28 @@
1
1
  # WaterDrop changelog
2
2
 
3
+ ## 2.10.2 (2026-06-15)
4
+ - [Feature] Expose `Producer#current_variant` as a public method. It returns the variant active for the current dispatch on the current fiber - the custom variant while inside a `#with`/`#variant`-wrapped call, otherwise the producer's default variant - so middleware and instrumentation listeners running synchronously within a dispatch can read the effective per-dispatch settings (`topic_config`, `max_wait_timeout`, `default?`). The lookup is fiber-local and dispatch-scoped: outside a variant-wrapped call (or from an asynchronous delivery callback) it returns the default variant.
5
+ - [Enhancement] Stop allocating one interpolated string per message in `LoggerListener` batch produce handlers. The quoted topic strings were only ever counted (quoting is a 1:1 mapping), never displayed, so counting the raw topic values yields the identical number with zero string allocations - relevant for large `produce_many_*` batches with the default logger listener attached.
6
+ - [Enhancement] Use `Array#concat` in `Producer#buffer_many` instead of appending messages one by one.
7
+ - [Enhancement] Skip building the `message.acknowledged` instrumentation payload in the delivery callback when nothing is subscribed to that event. The notifications bus already short-circuits on empty listeners, but only after the payload hash was allocated - once per delivered message on the polling thread. Mirrors the listener guard already used by the statistics callback. Late subscribers keep working as the check happens on each emission.
8
+ - [Enhancement] Resolve the fiber-local variant once per `#produce` call and once per `#produce_many_sync` wait phase instead of re-resolving it for every usage and for every waited delivery handle. For a 1,000-message sync batch this removes ~2,000 redundant fiber-local lookups.
9
+ - [Enhancement] Do not allocate the fiber-local variants hash on the `Producer#current_variant` read path. Previously every fiber that produced messages got a Hash pinned to it for the fiber's lifetime (per producer use), even when variants were never used - wasteful under fiber-per-request servers (Falcon, async). The hash is now only created by variant wrapper methods that actually need to write to it.
10
+ - [Enhancement] Cache the variant validation contract in a constant instead of instantiating a new `Contracts::Variant` on every `Producer#with` / `Producer#variant` call (mirrors the existing `Transactions::CONTRACT` pattern).
11
+ - [Enhancement] Cache the tombstone validation contract in a constant instead of instantiating a new `Contracts::Tombstone` per tombstone message, removing per-message allocations in the `tombstone_*` APIs (mirrors the existing `Transactions::CONTRACT` pattern).
12
+ - [Enhancement] Replace explicit `Warning[:performance]` opt-in with a dynamic approach using `Warning.categories` (available since Ruby 3.4) to automatically enable all stable opt-in warning categories in the test suite, including `:strict_unused_block` introduced in Ruby 4.0.
13
+ - [Fix] Prevent a deadlock between a transactional single-message dispatch and `#close`. A single `produce_sync`/`produce_async` on a transactional producer incremented the operations counter (which `#close` drains while holding `@transaction_mutex`) before acquiring `@transaction_mutex` for its per-message transaction - an inverted lock order. A dispatch that had counted itself but not yet taken `@transaction_mutex` could deadlock a concurrent `#close` permanently (the close wait loop has no timeout). Transactional dispatches now take `@transaction_mutex` before the operation is counted, matching `#close`'s lock order (`@transaction_mutex` -> `@operating_mutex` -> operations counter).
14
+ - [Fix] Prevent a deadlock (`ThreadError: deadlock; recursive locking`) when closing an idempotent producer (with `reload_on_idempotent_fatal_error` enabled) that has buffered messages whose final flush surfaces a fatal librdkafka error. `#close` performs the final flush while already holding `@operating_mutex`, and the idempotent fatal-error reload tried to re-acquire that same mutex, leaving the producer stuck in `:closing` with the native client leaked. The idempotent reload is now skipped on the closing path, and the final buffer flush is best-effort so client teardown always completes.
15
+ - [Fix] Make concurrent idempotent fatal-error reload thread-safe. When several threads shared an idempotent producer (with `reload_on_idempotent_fatal_error` enabled), a single fatal librdkafka condition failed all their in-flight produces at once and each entered the reload path; the second reload ran `reload!` after the first had already reset `@client` to `nil`, raising `NoMethodError`. The idempotent reload now bails out if another thread already reloaded (mirroring the transactional path's `return if @status.configured?` guard). Additionally, `Status#active?` now classifies the lifecycle from a single atomic read and `Producer#ensure_active!` branches on one snapshot, so a concurrent `configured -> connected` transition during a reload can no longer make `ensure_active!` raise `StatusInvalidError` for a valid, active producer.
16
+ - [Fix] Stop `#flush_async` / `#flush_sync` from silently dropping valid buffered messages when the dispatch fails. `#flush` removes the batch from the internal buffer before dispatching it, and a failure (a single invalid message failing validation before anything is sent, or a mid-batch inline error such as queue full) previously discarded the entire taken batch - the removed messages were never restored. A failed flush now re-buffers the messages that never reached librdkafka (the whole batch on validation failure or on a transactional rollback, the unsent remainder otherwise) so they can be retried instead of being lost.
17
+ - [Fix] Make `Producer#close` fork-safe so the GC finalizer inherited by a forked child can no longer close the parent's client. `#client` registers an `ObjectSpace` finalizer that calls `#close`; that finalizer is inherited across `fork`, and a child that inherited a used producer, never touched it, and exited normally would run `#close` in the child - flushing and closing (with the real rdkafka client, `rd_kafka_destroy` on a fork-inherited handle, i.e. undefined behavior) a client owned by the parent. `#close` now detects when it runs in a process other than the one that built the client, drops the inherited references and finalizer, and returns without touching the native client (matching the existing fork guard on the `#client` path).
18
+ - [Fix] Guard the internal buffer appends in `Producer#buffer` and `Producer#buffer_many` with `@buffer_mutex`. The appends mutated the shared `@messages` buffer without the lock that `flush`/`purge`/`close` hold while swapping it for a fresh array, so a concurrent swap landing between reading `@messages` and appending could drop the message into an orphaned array that is never dispatched - silently losing buffered messages in the documented "buffer in one thread, flush in another" pattern.
19
+ - [Fix] Stop a nested same-producer variant call from clobbering the outer variant inside a variant `transaction` block. `transaction` is the only variant-wrapped method that yields user code, so a variant call nested inside it (another `variant.produce_*`, or a raw producer dispatch in the same scope) used to delete the shared `Fiber.current.waterdrop_clients` entry on return, making the rest of the block silently fall back to the default variant and dispatch with default `topic_config` (timeouts, compression, partitioner) instead of the altered one. The wrapper now saves and restores the previous entry instead of unconditionally deleting it (still deleting when there was none, so the fiber-local hash does not accumulate stale keys).
20
+ - [Fix] Stop `ConnectionPool#shutdown` and `#reload` from silently dropping in-flight messages. Both closed every pooled producer with `close!` (force), which flushes for the max wait timeout and then purges whatever has not drained - so on a slow or unreachable broker, queued `produce_async` messages were cancelled and lost with no delivery report. They now close producers gracefully by default (`#reload` always; `#shutdown` unless called with the new `force: true`), letting messages flush instead of being purged. Pass `pool.shutdown(force: true)` to keep the old force-and-purge behavior.
21
+ - [Fix] Close a race in the FD poller where a producer registered while the last one was being torn down could be left permanently unpolled (sync produces hang until timeout, async deliveries are never acknowledged). The poller thread decided to exit (last producer unregistered) and cleared its thread reference in two separate, unsynchronized steps, so a `register` landing in that gap saw the still-alive exiting thread, skipped starting a fresh one, and then had its producer's state closed by the exiting thread's cleanup. The thread now decides to stop and clears its reference in a single mutex section, so a racing `register` either keeps it running or starts a fresh thread; and the exit cleanup runs only on an abnormal exit, since a normal exit always leaves an empty registry and so can never close a producer registered in the gap.
22
+
23
+ ## 2.10.1 (2026-05-25)
24
+ - [Fix] Prevent `Producer#close` from raising `ThreadError: can't be called from trap context` when invoked from a Ruby signal trap context (e.g. Puma's `after_stopped` DSL hook in single mode). `close` now detects this case and delegates to a background thread, joining it so the caller blocks until the producer is fully closed (#866).
25
+
3
26
  ## 2.10.0 (2026-05-07)
4
27
  - [Fix] Clean up native rdkafka client, global instrumentation callbacks, and poller registration when `init_transactions` fails during producer client construction. Previously, each failed attempt permanently leaked native threads, pipe file descriptors, and callback registry entries because the started `rd_kafka_t` handle was abandoned without being destroyed.
5
28
  - **[Breaking]** Skip emitting librdkafka statistics when nothing is subscribed to `statistics.emitted` at the time the underlying rdkafka client is constructed. When no listener is present at build time, `statistics.interval.ms` is forced to `0` regardless of user configuration and the statistics callback is not registered, saving substantial allocations in the hot path (no JSON parsing, no statistics hash materialization, no decoration work). To use statistics, subscribe a listener to `statistics.emitted` BEFORE the first producer use (before the underlying client is lazily initialized).
data/Gemfile CHANGED
@@ -5,7 +5,7 @@ source "https://rubygems.org"
5
5
  gemspec
6
6
 
7
7
  # Relaxed from 2.7 because we support Ruby 3.1
8
- gem "zeitwerk", "~> 2.7.0"
8
+ gem "zeitwerk", "~> 2.8.0"
9
9
 
10
10
  group :development do
11
11
  gem "byebug"
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- waterdrop (2.10.0)
4
+ waterdrop (2.10.2)
5
5
  karafka-core (>= 2.5.12, < 3.0.0)
6
6
  karafka-rdkafka (>= 0.24.0)
7
7
  zeitwerk (~> 2.3)
@@ -16,11 +16,11 @@ GEM
16
16
  drb (2.2.3)
17
17
  ffi (1.17.4)
18
18
  io-console (0.8.2)
19
- json (2.19.3)
20
- karafka-core (2.5.12)
19
+ json (2.19.7)
20
+ karafka-core (2.5.13)
21
21
  karafka-rdkafka (>= 0.20.0)
22
22
  logger (>= 1.6.0)
23
- karafka-rdkafka (0.25.0)
23
+ karafka-rdkafka (0.27.2)
24
24
  ffi (~> 1.17.1)
25
25
  json (> 2.0)
26
26
  logger
@@ -35,7 +35,7 @@ GEM
35
35
  ruby2_keywords (>= 0.0.5)
36
36
  ostruct (0.6.3)
37
37
  prism (1.9.0)
38
- rake (13.3.1)
38
+ rake (13.4.2)
39
39
  reline (0.6.3)
40
40
  io-console (~> 0.5)
41
41
  ruby2_keywords (0.0.5)
@@ -45,8 +45,8 @@ GEM
45
45
  simplecov_json_formatter (~> 0.1)
46
46
  simplecov-html (0.13.2)
47
47
  simplecov_json_formatter (0.1.4)
48
- warning (1.5.0)
49
- zeitwerk (2.7.5)
48
+ warning (1.6.0)
49
+ zeitwerk (2.8.2)
50
50
 
51
51
  PLATFORMS
52
52
  ruby
@@ -60,7 +60,7 @@ DEPENDENCIES
60
60
  simplecov
61
61
  warning
62
62
  waterdrop!
63
- zeitwerk (~> 2.7.0)
63
+ zeitwerk (~> 2.8.0)
64
64
 
65
65
  CHECKSUMS
66
66
  byebug (13.0.0) sha256=d2263efe751941ca520fa29744b71972d39cbc41839496706f5d9b22e92ae05d
@@ -69,24 +69,24 @@ CHECKSUMS
69
69
  drb (2.2.3) sha256=0b00d6fdb50995fe4a45dea13663493c841112e4068656854646f418fda13373
70
70
  ffi (1.17.4) sha256=bcd1642e06f0d16fc9e09ac6d49c3a7298b9789bcb58127302f934e437d60acf
71
71
  io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
72
- json (2.19.3) sha256=289b0bb53052a1fa8c34ab33cc750b659ba14a5c45f3fcf4b18762dc67c78646
73
- karafka-core (2.5.12) sha256=57cbb45a187fbe3df9b9a57af59dda7211f9969524b2afbb83792a64705860e1
74
- karafka-rdkafka (0.25.0) sha256=67b316b942cf9ff7e9d7bbf9029e6f2d91eba97b4c9dc93b9f49fd207dfb80f8
72
+ json (2.19.7) sha256=fe432c8639f6efff69f9d73b518a3705d9581ab93156f981ea72806e1e5bcc3e
73
+ karafka-core (2.5.13) sha256=0acec083043bb6166c4b647a7458091cc7b08066d3b92a026932925ec7e07f61
74
+ karafka-rdkafka (0.27.2) sha256=3ccce96306642be70bff8168e4e737fc10f2ffae20bc0ff0a43d88dbb7452d31
75
75
  logger (1.7.0) sha256=196edec7cc44b66cfb40f9755ce11b392f21f7967696af15d274dde7edff0203
76
76
  mini_portile2 (2.8.9) sha256=0cd7c7f824e010c072e33f68bc02d85a00aeb6fce05bb4819c03dfd3c140c289
77
77
  minitest (6.0.6) sha256=153ea36d1d987a62942382b61075745042a2b3123b1cd48f4c3675af9cc7d6f1
78
78
  mocha (3.1.0) sha256=75f42d69ebfb1f10b32489dff8f8431d37a418120ecdfc07afe3bc183d4e1d56
79
79
  ostruct (0.6.3) sha256=95a2ed4a4bd1d190784e666b47b2d3f078e4a9efda2fccf18f84ddc6538ed912
80
80
  prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
81
- rake (13.3.1) sha256=8c9e89d09f66a26a01264e7e3480ec0607f0c497a861ef16063604b1b08eb19c
81
+ rake (13.4.2) sha256=cb825b2bd5f1f8e91ca37bddb4b9aaf345551b4731da62949be002fa89283701
82
82
  reline (0.6.3) sha256=1198b04973565b36ec0f11542ab3f5cfeeec34823f4e54cebde90968092b1835
83
83
  ruby2_keywords (0.0.5) sha256=ffd13740c573b7301cf7a2e61fc857b2a8e3d3aff32545d6f8300d8bae10e3ef
84
84
  simplecov (0.22.0) sha256=fe2622c7834ff23b98066bb0a854284b2729a569ac659f82621fc22ef36213a5
85
85
  simplecov-html (0.13.2) sha256=bd0b8e54e7c2d7685927e8d6286466359b6f16b18cb0df47b508e8d73c777246
86
86
  simplecov_json_formatter (0.1.4) sha256=529418fbe8de1713ac2b2d612aa3daa56d316975d307244399fa4838c601b428
87
- warning (1.5.0) sha256=0f12c49fea0c06757778eefdcc7771e4fd99308901e3d55c504d87afdd718c53
88
- waterdrop (2.10.0)
89
- zeitwerk (2.7.5) sha256=d8da92128c09ea6ec62c949011b00ed4a20242b255293dd66bf41545398f73dd
87
+ warning (1.6.0) sha256=a49cdfae19fb77d19afff2efbe45f8ab759e9cd25b4e4ce2c79dbaf46bdb6c9e
88
+ waterdrop (2.10.2)
89
+ zeitwerk (2.8.2) sha256=7212a61311083c604184b1ea2574b9aa05cd14f855a0841c06985cabe9181d12
90
90
 
91
91
  BUNDLED WITH
92
92
  4.0.6
@@ -18,7 +18,7 @@ services:
18
18
  start_period: 90s
19
19
 
20
20
  kafka-oauth:
21
- image: confluentinc/cp-kafka:8.2.0
21
+ image: confluentinc/cp-kafka:8.2.1
22
22
  container_name: kafka-oauth
23
23
  depends_on:
24
24
  keycloak:
@@ -1,6 +1,6 @@
1
1
  services:
2
2
  kafka-sasl:
3
- image: confluentinc/cp-kafka:8.2.0
3
+ image: confluentinc/cp-kafka:8.2.1
4
4
  container_name: kafka-sasl
5
5
  ports:
6
6
  - "9095:9095"
data/docker-compose.yml CHANGED
@@ -1,7 +1,7 @@
1
1
  services:
2
2
  kafka:
3
3
  container_name: kafka
4
- image: confluentinc/cp-kafka:8.2.0
4
+ image: confluentinc/cp-kafka:8.2.1
5
5
 
6
6
  ports:
7
7
  - 9092:9092
@@ -113,11 +113,14 @@ module WaterDrop
113
113
  end
114
114
 
115
115
  # Shutdown the global connection pool
116
- def shutdown
116
+ #
117
+ # @param force [Boolean] when true, force-close each producer, purging unflushed messages.
118
+ # Defaults to false (graceful close) so in-flight messages are not silently dropped.
119
+ def shutdown(force: false)
117
120
  return unless @default_pool
118
121
 
119
122
  pool = @default_pool
120
- @default_pool.shutdown
123
+ @default_pool.shutdown(force: force)
121
124
  @default_pool = nil
122
125
 
123
126
  # Emit global event for pool shutdown
@@ -237,9 +240,16 @@ module WaterDrop
237
240
  end
238
241
 
239
242
  # Shutdown the connection pool
240
- def shutdown
243
+ #
244
+ # @param force [Boolean] when true, force-close each producer, purging any messages that do not
245
+ # flush within the producer's max wait timeout. Defaults to false: producers are closed
246
+ # gracefully so in-flight messages are flushed instead of being silently dropped when the
247
+ # broker is slow or unreachable.
248
+ def shutdown(force: false)
241
249
  @pool.shutdown do |producer|
242
- producer.close! if producer&.status&.active?
250
+ next unless producer&.status&.active?
251
+
252
+ force ? producer.close! : producer.close
243
253
  end
244
254
 
245
255
  # Emit event after pool is shut down
@@ -254,11 +264,13 @@ module WaterDrop
254
264
  # for API consistency across both individual producers and connection pools
255
265
  alias_method :close, :shutdown
256
266
 
257
- # Reload all connections in the pool
258
- # Useful for configuration changes or error recovery
267
+ # Reload all connections in the pool. Useful for configuration changes or error recovery
268
+ #
269
+ # @note Producers are always closed gracefully (never force-closed): a reload must not drop
270
+ # in-flight messages, so it waits for them to flush rather than purging the queue.
259
271
  def reload
260
272
  @pool.reload do |producer|
261
- producer.close! if producer&.status&.active?
273
+ producer.close if producer&.status&.active?
262
274
  end
263
275
 
264
276
  # Emit event after pool is reloaded
@@ -80,7 +80,6 @@ module WaterDrop
80
80
  end
81
81
  end
82
82
 
83
- # Alias so we can have a nicer API to abort transactions
84
- # This makes referencing easier
83
+ # Alias so we can have a nicer API to abort transactions. This makes referencing easier
85
84
  AbortTransaction = Errors::AbortTransaction
86
85
  end
@@ -60,7 +60,14 @@ module WaterDrop
60
60
  private
61
61
 
62
62
  # @param delivery_report [Rdkafka::Producer::DeliveryReport] delivery report
63
+ # @note This is the most frequently fired event in the system (once per delivered
64
+ # message) and most users do not subscribe to it. While the notifications bus
65
+ # short-circuits on empty listeners, that happens only after the payload hash is
66
+ # built, so we guard here to keep the no-listeners path allocation-free. We check on
67
+ # each emission to support late subscribers.
63
68
  def instrument_acknowledged(delivery_report)
69
+ return unless listening?
70
+
64
71
  @monitor.instrument(
65
72
  "message.acknowledged",
66
73
  caller: self,
@@ -111,6 +118,13 @@ module WaterDrop
111
118
  def build_error(delivery_report)
112
119
  ::Rdkafka::RdkafkaError.new(delivery_report.error)
113
120
  end
121
+
122
+ # Check if anyone is listening to the acknowledgement events
123
+ # @return [Boolean] true if there are any listeners
124
+ def listening?
125
+ listeners = @monitor.listeners["message.acknowledged"]
126
+ listeners && !listeners.empty?
127
+ end
114
128
  end
115
129
  end
116
130
  end
@@ -21,8 +21,7 @@ module WaterDrop
21
21
  # @note When there is a particular message produce error (not internal error), the error
22
22
  # is shipped via the delivery callback, not via error callback.
23
23
  def call(client_name, error)
24
- # Emit only errors related to our client
25
- # Same as with statistics (mor explanation there)
24
+ # Emit only errors related to our client, same as with statistics (mor explanation there)
26
25
  return unless @client_name == client_name
27
26
 
28
27
  @monitor.instrument(
@@ -47,7 +47,7 @@ module WaterDrop
47
47
  # @param event [Dry::Events::Event] event that happened with the details
48
48
  def on_messages_produced_async(event)
49
49
  messages = event[:messages]
50
- topics_count = messages.map { |message| "'#{message[:topic]}'" }.uniq.count
50
+ topics_count = messages.map { |message| message[:topic] }.uniq.count
51
51
 
52
52
  info(
53
53
  event,
@@ -62,7 +62,7 @@ module WaterDrop
62
62
  # @param event [Dry::Events::Event] event that happened with the details
63
63
  def on_messages_produced_sync(event)
64
64
  messages = event[:messages]
65
- topics_count = messages.map { |message| "'#{message[:topic]}'" }.uniq.count
65
+ topics_count = messages.map { |message| message[:topic] }.uniq.count
66
66
 
67
67
  info(event, "Sync producing of #{messages.size} messages to #{topics_count} topics")
68
68
 
@@ -218,8 +218,7 @@ module WaterDrop
218
218
  when :brokers
219
219
  statistics.fetch("brokers").each_value do |broker_statistics|
220
220
  # Skip bootstrap nodes
221
- # Bootstrap nodes have nodeid -1, other nodes have positive
222
- # node ids
221
+ # Bootstrap nodes have nodeid -1, other nodes have positive node ids
223
222
  next if broker_statistics["nodeid"] == -1
224
223
 
225
224
  public_send(
@@ -16,8 +16,7 @@ module WaterDrop
16
16
  extend ::Karafka::Core::Configurable
17
17
 
18
18
  # Ruby thread priority for the poller thread
19
- # Valid range: -3 to 3 (Ruby's thread priority range)
20
- # Higher values = higher priority
19
+ # Valid range: -3 to 3 (Ruby's thread priority range). Higher values = higher priority
21
20
  setting :thread_priority, default: 0
22
21
 
23
22
  # IO.select timeout in milliseconds
@@ -33,8 +33,7 @@ module WaterDrop
33
33
  end
34
34
  end
35
35
 
36
- # Waits until the latch is released
37
- # Returns immediately if already released
36
+ # Waits until the latch is released. Returns immediately if already released
38
37
  def wait
39
38
  @mutex.synchronize do
40
39
  @cv.wait(@mutex) until @released
@@ -186,8 +186,7 @@ module WaterDrop
186
186
  @ios_dirty = true
187
187
  end
188
188
 
189
- # Ensures the polling thread is running
190
- # Must be called within @mutex.synchronize
189
+ # Ensures the polling thread is running. Must be called within @mutex.synchronize
191
190
  def ensure_thread_running!
192
191
  return if @thread&.alive?
193
192
 
@@ -200,9 +199,29 @@ module WaterDrop
200
199
  # Main polling loop that runs in a dedicated thread
201
200
  def polling_loop
202
201
  backoff_ms = 0
202
+ clean_exit = false
203
203
 
204
204
  loop do
205
- break if @shutdown
205
+ # Decide whether to stop AND clear @thread in a single critical section. This is what
206
+ # closes the register/shutdown race: a concurrent `register` is serialized by @mutex, so
207
+ # it either runs before this block (we observe its producer plus `@shutdown = false` and
208
+ # keep polling) or after it (it finds `@thread` already nil and starts a fresh thread).
209
+ # Previously the exit decision and the `@thread = nil` teardown were separate and
210
+ # unsynchronized, so a producer registered in that gap was treated as already served by
211
+ # this exiting thread and then closed by its cleanup - left registered but never polled.
212
+ stop = @mutex.synchronize do
213
+ if @shutdown || @producers.empty?
214
+ @thread = nil
215
+ true
216
+ else
217
+ false
218
+ end
219
+ end
220
+
221
+ if stop
222
+ clean_exit = true
223
+ break
224
+ end
206
225
 
207
226
  # Apply backoff from previous error
208
227
  if backoff_ms > 0
@@ -213,9 +232,9 @@ module WaterDrop
213
232
  # Collect readable IOs (queue FDs)
214
233
  readable_ios, io_to_state = collect_readable_ios
215
234
 
216
- # Exit when no producers registered
217
- # New registrations will start a fresh thread via ensure_thread_running!
218
- break if readable_ios.empty?
235
+ # A producer may have registered right after the stop check above; if the cached snapshot
236
+ # is momentarily empty, loop to rebuild it instead of selecting on an empty set.
237
+ next if readable_ios.empty?
219
238
 
220
239
  poll_with_select(readable_ios, io_to_state)
221
240
  rescue => e
@@ -229,13 +248,12 @@ module WaterDrop
229
248
  end
230
249
  end
231
250
  ensure
232
- # Clear thread reference first so new registrations will start a fresh thread
233
- # This prevents race where register sees old thread as alive during cleanup
234
- @mutex.synchronize { @thread = nil }
235
-
236
- # When the poller thread exits (error or clean shutdown), close all remaining states
237
- # This releases any latches that might be waiting in unregister calls
238
- close_all_states
251
+ # A normal exit already cleared @thread above with an empty registry, so there is nothing to
252
+ # release - and skipping cleanup here is what keeps a producer registered in the exit gap
253
+ # from being closed: its fresh thread owns it now. Only an abnormal exit (an exception
254
+ # escaped the loop) can leave producers registered with callers blocked in `unregister`;
255
+ # release those so they don't hang.
256
+ close_all_states unless clean_exit
239
257
  end
240
258
 
241
259
  # Broadcasts an error to all registered producers' monitors
@@ -379,13 +397,15 @@ module WaterDrop
379
397
  state.close
380
398
  end
381
399
 
382
- # Closes all remaining producer states
383
- # Called when the poller thread exits to release any pending latches
384
- # This prevents deadlocks if producers are waiting in unregister
400
+ # Releases any producer states still registered when the poller thread exits ABNORMALLY (an
401
+ # exception escaped the loop), so callers blocked in `unregister` waiting on their latch are
402
+ # not left hanging. A normal exit clears the registry through the loop and never calls this,
403
+ # which is why no thread-ownership check is needed here.
385
404
  def close_all_states
386
405
  states = @mutex.synchronize do
387
- to_close = @producers.values.dup
388
- @producers.clear
406
+ @thread = nil
407
+ to_close = @producers.values
408
+ @producers = {}
389
409
  @ios_dirty = true
390
410
  to_close
391
411
  end
@@ -25,8 +25,7 @@ module WaterDrop
25
25
  client.enable_queue_io_events(@writer.fileno)
26
26
  end
27
27
 
28
- # Signals by writing a byte to the pipe
29
- # Used to wake IO.select for continue/close signals
28
+ # Signals by writing a byte to the pipe. Used to wake IO.select for continue/close signals
30
29
  # Thread-safe and non-blocking; silently ignores errors
31
30
  def signal
32
31
  @writer.write_nonblock("W", exception: false)
@@ -53,8 +53,7 @@ module WaterDrop
53
53
  @io = @queue_pipe.reader
54
54
  end
55
55
 
56
- # Drains the queue pipe
57
- # Called before polling to clear any pending signals
56
+ # Drains the queue pipe. Called before polling to clear any pending signals
58
57
  def drain
59
58
  @queue_pipe.drain
60
59
  end
@@ -88,8 +87,7 @@ module WaterDrop
88
87
 
89
88
  private_constant :STALE_CHECK_THROTTLE_MS
90
89
 
91
- # Marks this producer as having been polled
92
- # Called after polling to track staleness
90
+ # Marks this producer as having been polled. Called after polling to track staleness
93
91
  def mark_polled!
94
92
  @last_poll_time = monotonic_now
95
93
  end
@@ -21,7 +21,7 @@ module WaterDrop
21
21
  "message.produced_async",
22
22
  producer_id: id,
23
23
  message: message
24
- ) { produce(message) }
24
+ ) { produce(message, "produce_async") }
25
25
  rescue *SUPPORTED_FLOW_ERRORS => e
26
26
  # We use this syntax here because we want to preserve the original `#cause` when we
27
27
  # instrument the error and there is no way to manually assign `#cause` value
@@ -62,7 +62,7 @@ module WaterDrop
62
62
  ) do
63
63
  with_transaction_if_transactional do
64
64
  messages.each do |message|
65
- dispatched << produce(message)
65
+ dispatched << produce(message, "produce_many_async")
66
66
  end
67
67
  end
68
68
 
@@ -12,12 +12,15 @@ module WaterDrop
12
12
  def buffer(message)
13
13
  ensure_active!
14
14
 
15
+ # The append runs under @buffer_mutex because flush/purge/close swap @messages for a fresh
16
+ # array under the same lock. Without it, a concurrent swap between reading @messages and
17
+ # appending would land the message in the orphaned old array and silently lose it.
15
18
  @monitor.instrument(
16
19
  "message.buffered",
17
20
  producer_id: id,
18
21
  message: message,
19
22
  buffer: @messages
20
- ) { @messages << message }
23
+ ) { @buffer_mutex.synchronize { @messages << message } }
21
24
  end
22
25
 
23
26
  # Adds given messages into the internal producer buffer without flushing them to Kafka
@@ -29,13 +32,16 @@ module WaterDrop
29
32
  def buffer_many(messages)
30
33
  ensure_active!
31
34
 
35
+ # The concat runs under @buffer_mutex for the same reason as #buffer: flush/purge/close swap
36
+ # @messages under the lock, so an unguarded concat could append into an array that has just
37
+ # been captured for dispatch (or discarded), silently losing the messages.
32
38
  @monitor.instrument(
33
39
  "messages.buffered",
34
40
  producer_id: id,
35
41
  messages: messages,
36
42
  buffer: @messages
37
43
  ) do
38
- messages.each { |message| @messages << message }
44
+ @buffer_mutex.synchronize { @messages.concat(messages) }
39
45
  messages
40
46
  end
41
47
  end
@@ -83,6 +89,32 @@ module WaterDrop
83
89
  return data_for_dispatch if data_for_dispatch.empty?
84
90
 
85
91
  sync ? produce_many_sync(data_for_dispatch) : produce_many_async(data_for_dispatch)
92
+ rescue Errors::ProduceManyError => e
93
+ # A dispatch failed partway through the batch. Re-buffer the messages that never reached
94
+ # librdkafka so a partial failure does not silently drop valid buffered messages. For a
95
+ # transactional producer the whole batch is rolled back (nothing is visible to consumers),
96
+ # so all of it is restored; for a regular producer `e.dispatched` holds the handles already
97
+ # created, so only the remainder is restored.
98
+ requeue_unflushed(transactional? ? data_for_dispatch : data_for_dispatch.drop(e.dispatched.size))
99
+
100
+ raise
101
+ rescue Errors::MessageInvalidError
102
+ # Validation runs before anything is dispatched, so nothing reached librdkafka. Restore the
103
+ # whole batch instead of dropping valid messages alongside the invalid one.
104
+ requeue_unflushed(data_for_dispatch)
105
+
106
+ raise
107
+ end
108
+
109
+ # Puts not-yet-dispatched messages back at the front of the buffer (preserving their original
110
+ # order relative to each other and to anything buffered concurrently), so a failed flush does
111
+ # not lose them.
112
+ #
113
+ # @param messages [Array<Hash>] messages to restore to the buffer
114
+ def requeue_unflushed(messages)
115
+ return if messages.empty?
116
+
117
+ @buffer_mutex.synchronize { @messages.unshift(*messages) }
86
118
  end
87
119
  end
88
120
  end
@@ -57,6 +57,15 @@ module WaterDrop
57
57
  # @note After reload, the producer will automatically retry the failed operation
58
58
  def idempotent_reload_client_on_fatal_error(attempt, error)
59
59
  @operating_mutex.synchronize do
60
+ # When several threads share an idempotent producer, one fatal librdkafka condition fails
61
+ # all their in-flight produces at once and each enters this method. The mutex serializes
62
+ # them, but a thread that waited here may arrive after another has already reloaded -
63
+ # resetting @client to nil and moving the producer to the configured state. Running
64
+ # reload! again would call methods on a nil @client and raise NoMethodError, so we bail
65
+ # out and let #produce retry against the freshly reloaded client. This mirrors the
66
+ # `return if @status.configured?` guard on the transactional reload path.
67
+ next if @client.nil? || @status.configured?
68
+
60
69
  # Emit producer.reload event before reload
61
70
  # Users can subscribe to this event and modify event[:caller].config.kafka to change
62
71
  # producer config
@@ -17,6 +17,16 @@ module WaterDrop
17
17
 
18
18
  private_constant :LIFECYCLE
19
19
 
20
+ # States in which the producer is considered active and able to accept work. Kept as a single
21
+ # set so the current state can be classified in one atomic read (see `#active?` / `#to_sym`)
22
+ # rather than via a chain of predicate calls that could straddle a concurrent transition.
23
+ ACTIVE_STATES = %i[
24
+ connected
25
+ configured
26
+ disconnecting
27
+ disconnected
28
+ ].freeze
29
+
20
30
  # Creates a new instance of status with the initial state
21
31
  # @return [Status]
22
32
  def initialize
@@ -29,7 +39,10 @@ module WaterDrop
29
39
  # established or disconnected, meaning it was working but user disconnected for his own
30
40
  # reasons though sending could reconnect and continue.
31
41
  def active?
32
- connected? || configured? || disconnecting? || disconnected?
42
+ # Single read of @current so a concurrent transition cannot make this return false for a
43
+ # status that is in fact active (for example flipping configured -> connected mid-check
44
+ # while another thread reloads the client after a fatal error).
45
+ ACTIVE_STATES.include?(@current)
33
46
  end
34
47
 
35
48
  # @return [String] current status as a string
@@ -37,6 +50,13 @@ module WaterDrop
37
50
  @current.to_s
38
51
  end
39
52
 
53
+ # @return [Symbol] current lifecycle state captured as a single atomic read. Lets callers
54
+ # branch on one consistent value instead of issuing several predicate calls that could
55
+ # observe different states if the producer is transitioning on another thread.
56
+ def to_sym
57
+ @current
58
+ end
59
+
40
60
  LIFECYCLE.each do |state|
41
61
  # @example
42
62
  # def initial?
@@ -24,7 +24,7 @@ module WaterDrop
24
24
  producer_id: id,
25
25
  message: message
26
26
  ) do
27
- wait(produce(message))
27
+ wait(produce(message, "produce_sync"))
28
28
  end
29
29
  rescue *SUPPORTED_FLOW_ERRORS => e
30
30
  # We use this syntax here because we want to preserve the original `#cause` when we
@@ -84,21 +84,27 @@ module WaterDrop
84
84
  begin
85
85
  with_transaction_if_transactional do
86
86
  messages.each do |message|
87
- dispatched << produce(message)
87
+ dispatched << produce(message, "produce_many_sync")
88
88
  end
89
89
  end
90
90
  rescue *SUPPORTED_FLOW_ERRORS => e
91
91
  inline_error = e
92
92
  end
93
93
 
94
+ # Resolve the variant timeout once instead of re-resolving the fiber-local variant for
95
+ # every single handler we wait on
96
+ max_wait_timeout = current_variant.max_wait_timeout
97
+
94
98
  # This will ensure, that we have all verdicts before raising the failure, so we pass
95
99
  # all delivery handles having a final verdict
96
- dispatched.each { |handler| wait(handler, raise_response_error: false) }
100
+ dispatched.each do |handler|
101
+ wait(handler, max_wait_timeout: max_wait_timeout, raise_response_error: false)
102
+ end
97
103
 
98
104
  raise(inline_error) if inline_error
99
105
 
100
106
  # This will raise an error on the first error that have happened
101
- dispatched.each { |handler| wait(handler) }
107
+ dispatched.each { |handler| wait(handler, max_wait_timeout: max_wait_timeout) }
102
108
 
103
109
  dispatched
104
110
  end
@@ -8,6 +8,11 @@ module WaterDrop
8
8
  # in compacted topics. This module provides a dedicated API so users don't have to manually
9
9
  # construct `produce_*(topic:, key:, payload: nil, ...)` calls.
10
10
  module Tombstone
11
+ # Contract to validate that tombstone message input is correct
12
+ CONTRACT = Contracts::Tombstone.new
13
+
14
+ private_constant :CONTRACT
15
+
11
16
  # Produces a tombstone message to Kafka and waits for it to be delivered
12
17
  #
13
18
  # @param message [Hash] hash with at least `:topic`, `:key`, and `:partition` keys.
@@ -66,10 +71,9 @@ module WaterDrop
66
71
  # @raise [Errors::MessageInvalidError] when key or partition is missing
67
72
  def prepare_tombstone(message)
68
73
  message = message.dup
69
- message.delete(:payload)
70
74
  message[:payload] = nil
71
75
 
72
- Contracts::Tombstone.new.validate!(message, Errors::MessageInvalidError)
76
+ CONTRACT.validate!(message, Errors::MessageInvalidError)
73
77
 
74
78
  message
75
79
  end
@@ -34,7 +34,10 @@ module WaterDrop
34
34
  # When rdkafka-ruby detects empty hash, it will use the librdkafka defaults
35
35
  EMPTY_HASH = {}.freeze
36
36
 
37
- private_constant :EMPTY_HASH
37
+ # Contract to validate that variant alteration data is correct
38
+ CONTRACT = Contracts::Variant.new
39
+
40
+ private_constant :EMPTY_HASH, :CONTRACT
38
41
 
39
42
  attr_reader :max_wait_timeout, :topic_config, :producer
40
43
 
@@ -56,7 +59,7 @@ module WaterDrop
56
59
  @default = default
57
60
  super(producer)
58
61
 
59
- Contracts::Variant.new.validate!(to_h, Errors::VariantInvalidError)
62
+ CONTRACT.validate!(to_h, Errors::VariantInvalidError)
60
63
  end
61
64
 
62
65
  # @return [Boolean] is this a default variant for this producer
@@ -75,23 +78,34 @@ module WaterDrop
75
78
  Transactions
76
79
  ].each do |scope|
77
80
  scope.instance_methods(false).each do |method_name|
81
+ # We save and restore any variant already active for this producer in this fiber rather
82
+ # than unconditionally deleting it. A variant-wrapped method that yields user code (e.g.
83
+ # `transaction`) may wrap a nested same-producer variant call; without save/restore the
84
+ # inner call's `ensure` would clear the slot the outer scope still needs, so the rest of
85
+ # the outer scope would silently fall back to the default variant. When there was no outer
86
+ # entry we still `delete` (not nil-assign) to avoid leaving stale entries behind.
87
+ #
78
88
  # @example
79
89
  # def produce_async(*args, &block)
80
90
  # ref = Fiber.current.waterdrop_clients ||= {}
91
+ # had = ref.key?(@producer.id)
92
+ # prev = ref[@producer.id]
81
93
  # ref[@producer.id] = self
82
94
  #
83
95
  # @producer.produce_async(*args, &block)
84
96
  # ensure
85
- # ref.delete(@producer.id)
97
+ # had ? (ref[@producer.id] = prev) : ref.delete(@producer.id)
86
98
  # end
87
99
  class_eval <<-RUBY, __FILE__, __LINE__ + 1
88
100
  def #{method_name}(*args, &block)
89
101
  ref = Fiber.current.waterdrop_clients ||= {}
102
+ had = ref.key?(@producer.id)
103
+ prev = ref[@producer.id]
90
104
  ref[@producer.id] = self
91
105
 
92
106
  @producer.#{method_name}(*args, &block)
93
107
  ensure
94
- ref.delete(@producer.id)
108
+ had ? (ref[@producer.id] = prev) : ref.delete(@producer.id)
95
109
  end
96
110
  RUBY
97
111
  end
@@ -152,8 +152,7 @@ module WaterDrop
152
152
 
153
153
  # We should raise an error when trying to use a producer with client from a fork. Always.
154
154
  if @client
155
- # We need to reset the client, otherwise there might be attempt to close the parent
156
- # client
155
+ # We need to reset the client, otherwise there might be attempt to close the parent client
157
156
  @client = nil
158
157
  raise Errors::ProducerUsedInParentProcess, Process.pid
159
158
  end
@@ -264,6 +263,29 @@ module WaterDrop
264
263
  @middleware ||= config.middleware
265
264
  end
266
265
 
266
+ # Returns the variant currently in effect for dispatches on the current fiber.
267
+ #
268
+ # While executing inside a variant-wrapped call (any method invoked on the object returned by
269
+ # {#with} / {#variant}), this returns that variant; otherwise it returns the producer's default
270
+ # variant. It is primarily useful to middleware and instrumentation listeners that run
271
+ # synchronously within a dispatch and want to read the effective per-dispatch settings, such as
272
+ # `#topic_config`, `#max_wait_timeout` or `#default?`.
273
+ #
274
+ # @return [WaterDrop::Producer::Variant] the variant active for the current dispatch on this
275
+ # fiber, or the producer's default variant when not inside a variant-wrapped call
276
+ #
277
+ # @note The lookup is fiber-local and scoped to a single dispatch; it does not represent a
278
+ # producer-wide setting. Called from arbitrary code outside a variant-wrapped call it always
279
+ # returns the default variant. It is likewise not meaningful from asynchronous delivery
280
+ # callbacks (which run on the poller thread, a different fiber) - there it also returns the
281
+ # default variant, not the variant the acknowledged message was dispatched with.
282
+ def current_variant
283
+ # Read-only: the fiber-local hash is created by the variant wrapper methods only when needed,
284
+ # so we must not allocate it here just to look up a variant that may not exist.
285
+ clients = Fiber.current.waterdrop_clients
286
+ (clients && clients[id]) || @default_variant
287
+ end
288
+
267
289
  # Disconnects the producer from Kafka while keeping it configured for potential reconnection
268
290
  #
269
291
  # This method safely disconnects the underlying Kafka client while preserving the producer's
@@ -339,6 +361,19 @@ module WaterDrop
339
361
  # @param force [Boolean] should we force closing even with outstanding messages after the
340
362
  # max wait timeout
341
363
  def close(force: false)
364
+ # If the client was built in a different process, we have been forked. The client and its
365
+ # native resources belong to the parent, so we must never flush or close them here: with the
366
+ # real rdkafka client that is rd_kafka_destroy on a fork-inherited handle (undefined behavior),
367
+ # and it would also tear down a client the parent still uses. We just drop our references and
368
+ # the inherited finalizer and return. This matters most for the GC finalizer, which is
369
+ # inherited across fork and would otherwise run #close in the child at exit.
370
+ if @client && @pid != Process.pid
371
+ @client = nil
372
+ ObjectSpace.undefine_finalizer(id)
373
+
374
+ return
375
+ end
376
+
342
377
  # When closing from within the FD poller thread (e.g., from a callback like
343
378
  # message.acknowledged or error.occurred), we must delegate to a background thread.
344
379
  # Close performs flush which waits for delivery reports, but delivery reports require
@@ -382,7 +417,18 @@ module WaterDrop
382
417
 
383
418
  # Flush has its own buffer mutex but even if it is blocked, flushing can still happen
384
419
  # as we close the client after the flushing (even if blocked by the mutex)
385
- flush(true)
420
+ #
421
+ # This is best-effort: if a buffered message surfaces a terminal error here (for example
422
+ # a fatal error on an idempotent producer), we must still proceed to close the underlying
423
+ # client. Otherwise the native client and its resources would leak and the producer would
424
+ # stay stuck in the `:closing` state. The failure is already surfaced via the
425
+ # `error.occurred` instrumentation emitted by the dispatch itself, so swallowing the
426
+ # re-raised wrapper here does not hide it.
427
+ begin
428
+ flush(true)
429
+ rescue Errors::ProduceError
430
+ nil
431
+ end
386
432
 
387
433
  # We should not close the client in several threads the same time
388
434
  # It is safe to run it several times but not exactly the same moment
@@ -423,6 +469,20 @@ module WaterDrop
423
469
  end
424
470
  end
425
471
  end
472
+ rescue ThreadError => e
473
+ # Ruby raises ThreadError with this specific message when Mutex#synchronize (or #lock) is
474
+ # called from a signal trap context. There is no public Ruby API to detect trap context
475
+ # proactively - Thread.current is the same object as the main thread, its status is "run",
476
+ # and caller_locations contains no "trap" frame. The only observable difference is that
477
+ # blocking mutex operations raise this error. We re-raise anything else (e.g.
478
+ # "deadlock; recursive locking") so those are not silently swallowed.
479
+ #
480
+ # Puma's `after_stopped` DSL hook in single mode is one example that fires in trap context.
481
+ # We escape by delegating to a background thread and joining so the caller blocks until the
482
+ # producer is fully closed.
483
+ raise unless e.message == "can't be called from trap context"
484
+
485
+ Thread.new { close(force: force) }.value
426
486
  end
427
487
 
428
488
  # Closes the producer with forced close after timeout, purging any outgoing data
@@ -484,15 +544,21 @@ module WaterDrop
484
544
  # Ensures that we don't run any operations when the producer is not configured or when it
485
545
  # was already closed
486
546
  def ensure_active!
487
- return if @status.active?
488
- return if @status.closing? && @operating_mutex.owned?
547
+ # Capture the lifecycle state once. Another thread may be transitioning the producer between
548
+ # states (for example configured -> connected while reloading the client after a fatal error),
549
+ # and issuing several @status predicate calls here could otherwise observe an inconsistent mix
550
+ # of states and raise StatusInvalidError for what is in fact a valid, active producer.
551
+ state = @status.to_sym
489
552
 
490
- raise Errors::ProducerNotConfiguredError, id if @status.initial?
491
- raise Errors::ProducerClosedError, id if @status.closing?
492
- raise Errors::ProducerClosedError, id if @status.closed?
553
+ return if Status::ACTIVE_STATES.include?(state)
554
+ return if state == :closing && @operating_mutex.owned?
555
+
556
+ raise Errors::ProducerNotConfiguredError, id if state == :initial
557
+ raise Errors::ProducerClosedError, id if state == :closing
558
+ raise Errors::ProducerClosedError, id if state == :closed
493
559
 
494
560
  # This should never happen
495
- raise Errors::StatusInvalidError, [id, @status.to_s]
561
+ raise Errors::StatusInvalidError, [id, state.to_s]
496
562
  end
497
563
 
498
564
  # Ensures that the message we want to send out to Kafka is actually valid and that it can be
@@ -506,26 +572,48 @@ module WaterDrop
506
572
  # Waits on a given handler
507
573
  #
508
574
  # @param handler [Rdkafka::Producer::DeliveryHandle]
575
+ # @param max_wait_timeout [Integer] max wait timeout in ms. Resolved from the current variant
576
+ # by default but can be passed in by batch operations that wait on many handlers, so the
577
+ # variant is not re-resolved for each of them.
509
578
  # @param raise_response_error [Boolean] should we raise the response error after we receive the
510
579
  # final result and it is an error.
511
- def wait(handler, raise_response_error: true)
580
+ def wait(handler, max_wait_timeout: current_variant.max_wait_timeout, raise_response_error: true)
512
581
  handler.wait(
513
- max_wait_timeout_ms: current_variant.max_wait_timeout,
582
+ max_wait_timeout_ms: max_wait_timeout,
514
583
  raise_response_error: raise_response_error
515
584
  )
516
585
  end
517
586
 
518
- # @return [Producer::Variant] the variant config. Either custom if built using `#with` or
519
- # a default one.
520
- def current_variant
521
- Fiber.current.waterdrop_clients ||= {}
522
- Fiber.current.waterdrop_clients[id] || @default_variant
587
+ # Dispatches a message, ensuring transactional producers take the transaction lock before the
588
+ # operation is counted.
589
+ #
590
+ # For a transactional producer we wrap the whole dispatch (including the operations-counter
591
+ # bookkeeping) in `transaction`, so `@transaction_mutex` is acquired BEFORE
592
+ # `@operations_in_progress` is incremented. This makes `#produce` acquire locks in the same order
593
+ # as `#close` (`@transaction_mutex` -> `@operating_mutex` -> operations counter) and removes a
594
+ # lock-order inversion: without it, a dispatch that had already counted itself could block forever
595
+ # on `@transaction_mutex` held by a concurrent `#close` that was itself waiting for the operations
596
+ # counter to drain. When we already own the transaction lock (inside an explicit transaction block
597
+ # or the closing flush) the order is already correct, so we dispatch directly.
598
+ #
599
+ # @param message [Hash] message we want to send
600
+ # @param label [String] short name of the public dispatch method (e.g. `"produce_sync"`) that
601
+ # we surface in the `message.*` queue-full error type. Passed explicitly by each public entry
602
+ # point so we never have to walk the call stack to recover it (the number of internal frames
603
+ # varies because the transactional path wraps the dispatch in a `transaction`).
604
+ def produce(message, label)
605
+ if transactional? && !@transaction_mutex.owned?
606
+ transaction { produce_to_client(message, label) }
607
+ else
608
+ produce_to_client(message, label)
609
+ end
523
610
  end
524
611
 
525
612
  # Runs the client produce method with a given message
526
613
  #
527
614
  # @param message [Hash] message we want to send
528
- def produce(message)
615
+ # @param label [String] public dispatch method name used in the queue-full error type
616
+ def produce_to_client(message, label)
529
617
  produce_time ||= monotonic_now
530
618
 
531
619
  # This can happen only during flushing on closing, in case like this we don't have to
@@ -537,16 +625,20 @@ module WaterDrop
537
625
  ensure_active!
538
626
  end
539
627
 
628
+ # The variant is fiber-local and cannot change mid-call, so we resolve it once instead of
629
+ # paying the fiber-local lookup for each usage
630
+ variant = current_variant
631
+
540
632
  # We basically only duplicate the message hash only if it is needed.
541
633
  # It is needed when user is using a custom settings variant or when symbol is provided as
542
634
  # the topic name. We should never mutate user input message as it may be a hash that the
543
635
  # user is using for some other operations
544
- if message[:topic].is_a?(Symbol) || !current_variant.default?
636
+ if message[:topic].is_a?(Symbol) || !variant.default?
545
637
  message = message.dup
546
638
  # In case someone defines topic as a symbol, we need to convert it into a string as
547
639
  # librdkafka does not accept symbols
548
640
  message[:topic] = message[:topic].to_s
549
- message[:topic_config] = current_variant.topic_config
641
+ message[:topic_config] = variant.topic_config
550
642
  end
551
643
 
552
644
  result = if transactional?
@@ -560,8 +652,14 @@ module WaterDrop
560
652
 
561
653
  result
562
654
  rescue SUPPORTED_FLOW_ERRORS.first => e
563
- # Check if this is a fatal error on an idempotent producer and we should reload
564
- if idempotent_reloadable?(e)
655
+ # Check if this is a fatal error on an idempotent producer and we should reload.
656
+ #
657
+ # We must never reload while closing. During `#close` the final `flush` runs while this
658
+ # thread already owns `@operating_mutex`; the idempotent reload re-acquires that same mutex,
659
+ # which Ruby rejects with `ThreadError: deadlock; recursive locking`, and it would also try to
660
+ # rebuild the very client we are tearing down. In that case we let the error propagate so
661
+ # `#close` can finish and release the underlying client.
662
+ if idempotent_reloadable?(e) && !@operating_mutex.owned?
565
663
  # Check if we've exceeded max reload attempts
566
664
  raise unless idempotent_retryable?
567
665
 
@@ -597,8 +695,6 @@ module WaterDrop
597
695
  # in an infinite loop, effectively hanging the processing
598
696
  raise unless monotonic_now - produce_time < @config.wait_timeout_on_queue_full
599
697
 
600
- label = caller_locations(2, 1)[0].label.split.last.split("#").last
601
-
602
698
  # We use this syntax here because we want to preserve the original `#cause` when we
603
699
  # instrument the error and there is no way to manually assign `#cause` value. We want to keep
604
700
  # the original cause to maintain the same API across all the errors dispatched to the
@@ -3,5 +3,5 @@
3
3
  # WaterDrop library
4
4
  module WaterDrop
5
5
  # Current WaterDrop version
6
- VERSION = "2.10.0"
6
+ VERSION = "2.10.2"
7
7
  end
data/package-lock.json CHANGED
@@ -286,9 +286,9 @@
286
286
  }
287
287
  },
288
288
  "node_modules/smol-toml": {
289
- "version": "1.6.0",
290
- "resolved": "https://registry.npmjs.org/smol-toml/-/smol-toml-1.6.0.tgz",
291
- "integrity": "sha512-4zemZi0HvTnYwLfrpk/CF9LOd9Lt87kAt50GnqhMpyF9U3poDAP2+iukq2bZsO/ufegbYehBkqINbsWxj4l4cw==",
289
+ "version": "1.6.1",
290
+ "resolved": "https://registry.npmjs.org/smol-toml/-/smol-toml-1.6.1.tgz",
291
+ "integrity": "sha512-dWUG8F5sIIARXih1DTaQAX4SsiTXhInKf1buxdY9DIg4ZYPZK5nGM1VRIYmEbDbsHt7USo99xSLFu5Q1IqTmsg==",
292
292
  "dev": true,
293
293
  "license": "BSD-3-Clause",
294
294
  "engines": {
@@ -312,9 +312,9 @@
312
312
  }
313
313
  },
314
314
  "node_modules/yaml": {
315
- "version": "2.8.2",
316
- "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.8.2.tgz",
317
- "integrity": "sha512-mplynKqc1C2hTVYxd0PU2xQAc22TI1vShAYGksCCfxbn/dFwnHTNi1bvYsBTkhdUNtGIf5xNOg938rrSSYvS9A==",
315
+ "version": "2.9.0",
316
+ "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.9.0.tgz",
317
+ "integrity": "sha512-2AvhNX3mb8zd6Zy7INTtSpl1F15HW6Wnqj0srWlkKLcpYl/gMIMJiyuGq2KeI2YFxUPjdlB+3Lc10seMLtL4cA==",
318
318
  "dev": true,
319
319
  "license": "ISC",
320
320
  "bin": {
data/renovate.json CHANGED
@@ -17,7 +17,7 @@
17
17
  {
18
18
  "minimumReleaseAge": "7 days",
19
19
  "matchDepNames": [
20
- "/*/"
20
+ "*"
21
21
  ]
22
22
  },
23
23
  {
@@ -39,7 +39,15 @@
39
39
  "ruby/setup-ruby",
40
40
  "ruby"
41
41
  ],
42
- "groupName": "ruby setup"
42
+ "groupName": "ruby setup",
43
+ "internalChecksFilter": "strict"
44
+ },
45
+ {
46
+ "description": "Let setup-ruby pass age gate before ruby so it is ready when the group PR is created",
47
+ "matchPackageNames": [
48
+ "ruby/setup-ruby"
49
+ ],
50
+ "minimumReleaseAge": "5 days"
43
51
  }
44
52
  ],
45
53
  "minimumReleaseAge": "7 days",
@@ -47,6 +55,9 @@
47
55
  "dependencies"
48
56
  ],
49
57
  "lockFileMaintenance": {
50
- "enabled": true
58
+ "enabled": true,
59
+ "schedule": [
60
+ "before 4am on the first day of the month"
61
+ ]
51
62
  }
52
63
  }
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: waterdrop
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.10.0
4
+ version: 2.10.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maciej Mensfeld
@@ -160,7 +160,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
160
160
  - !ruby/object:Gem::Version
161
161
  version: '0'
162
162
  requirements: []
163
- rubygems_version: 4.0.6
163
+ rubygems_version: 4.0.10
164
164
  specification_version: 4
165
165
  summary: Kafka messaging made easy!
166
166
  test_files: []