@drarzter/kafka-client 0.9.3 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (170) hide show
  1. package/README.md +625 -8
  2. package/dist/chunk-CMO7SMVK.mjs +4814 -0
  3. package/dist/chunk-CMO7SMVK.mjs.map +1 -0
  4. package/dist/cli/dlq.d.ts +119 -0
  5. package/dist/cli/dlq.d.ts.map +1 -0
  6. package/dist/cli/index.d.ts +3 -0
  7. package/dist/cli/index.d.ts.map +1 -0
  8. package/dist/{chunk-TPIP5VV7.mjs → cli/index.js} +965 -265
  9. package/dist/cli/index.js.map +1 -0
  10. package/dist/cli/index.mjs +355 -0
  11. package/dist/cli/index.mjs.map +1 -0
  12. package/dist/client/config/from-env.d.ts +188 -0
  13. package/dist/client/config/from-env.d.ts.map +1 -0
  14. package/dist/client/config/index.d.ts +2 -0
  15. package/dist/client/config/index.d.ts.map +1 -0
  16. package/dist/client/errors.d.ts +67 -0
  17. package/dist/client/errors.d.ts.map +1 -0
  18. package/dist/client/kafka.client/admin/ops.d.ts +114 -0
  19. package/dist/client/kafka.client/admin/ops.d.ts.map +1 -0
  20. package/dist/client/kafka.client/consumer/features/delayed.d.ts +24 -0
  21. package/dist/client/kafka.client/consumer/features/delayed.d.ts.map +1 -0
  22. package/dist/client/kafka.client/consumer/features/dlq-replay.d.ts +52 -0
  23. package/dist/client/kafka.client/consumer/features/dlq-replay.d.ts.map +1 -0
  24. package/dist/client/kafka.client/consumer/features/routed.d.ts +4 -0
  25. package/dist/client/kafka.client/consumer/features/routed.d.ts.map +1 -0
  26. package/dist/client/kafka.client/consumer/features/snapshot.d.ts +10 -0
  27. package/dist/client/kafka.client/consumer/features/snapshot.d.ts.map +1 -0
  28. package/dist/client/kafka.client/consumer/features/window.d.ts +5 -0
  29. package/dist/client/kafka.client/consumer/features/window.d.ts.map +1 -0
  30. package/dist/client/kafka.client/consumer/handler.d.ts +149 -0
  31. package/dist/client/kafka.client/consumer/handler.d.ts.map +1 -0
  32. package/dist/client/kafka.client/consumer/ops.d.ts +51 -0
  33. package/dist/client/kafka.client/consumer/ops.d.ts.map +1 -0
  34. package/dist/client/kafka.client/consumer/pipeline.d.ts +167 -0
  35. package/dist/client/kafka.client/consumer/pipeline.d.ts.map +1 -0
  36. package/dist/client/kafka.client/consumer/queue.d.ts +37 -0
  37. package/dist/client/kafka.client/consumer/queue.d.ts.map +1 -0
  38. package/dist/client/kafka.client/consumer/retry-topic.d.ts +65 -0
  39. package/dist/client/kafka.client/consumer/retry-topic.d.ts.map +1 -0
  40. package/dist/client/kafka.client/consumer/setup.d.ts +63 -0
  41. package/dist/client/kafka.client/consumer/setup.d.ts.map +1 -0
  42. package/dist/client/kafka.client/consumer/start.d.ts +7 -0
  43. package/dist/client/kafka.client/consumer/start.d.ts.map +1 -0
  44. package/dist/client/kafka.client/consumer/stop.d.ts +19 -0
  45. package/dist/client/kafka.client/consumer/stop.d.ts.map +1 -0
  46. package/dist/client/kafka.client/consumer/subscribe-retry.d.ts +4 -0
  47. package/dist/client/kafka.client/consumer/subscribe-retry.d.ts.map +1 -0
  48. package/dist/client/kafka.client/context.d.ts +72 -0
  49. package/dist/client/kafka.client/context.d.ts.map +1 -0
  50. package/dist/client/kafka.client/index.d.ts +155 -0
  51. package/dist/client/kafka.client/index.d.ts.map +1 -0
  52. package/dist/client/kafka.client/infra/circuit-breaker.manager.d.ts +61 -0
  53. package/dist/client/kafka.client/infra/circuit-breaker.manager.d.ts.map +1 -0
  54. package/dist/client/kafka.client/infra/dedup.store.d.ts +28 -0
  55. package/dist/client/kafka.client/infra/dedup.store.d.ts.map +1 -0
  56. package/dist/client/kafka.client/infra/inflight.tracker.d.ts +22 -0
  57. package/dist/client/kafka.client/infra/inflight.tracker.d.ts.map +1 -0
  58. package/dist/client/kafka.client/infra/metrics.manager.d.ts +67 -0
  59. package/dist/client/kafka.client/infra/metrics.manager.d.ts.map +1 -0
  60. package/dist/client/kafka.client/producer/lifecycle.d.ts +41 -0
  61. package/dist/client/kafka.client/producer/lifecycle.d.ts.map +1 -0
  62. package/dist/client/kafka.client/producer/ops.d.ts +70 -0
  63. package/dist/client/kafka.client/producer/ops.d.ts.map +1 -0
  64. package/dist/client/kafka.client/producer/send.d.ts +21 -0
  65. package/dist/client/kafka.client/producer/send.d.ts.map +1 -0
  66. package/dist/client/kafka.client/validate-options.d.ts +11 -0
  67. package/dist/client/kafka.client/validate-options.d.ts.map +1 -0
  68. package/dist/client/message/envelope.d.ts +105 -0
  69. package/dist/client/message/envelope.d.ts.map +1 -0
  70. package/dist/client/message/schema-registry.d.ts +105 -0
  71. package/dist/client/message/schema-registry.d.ts.map +1 -0
  72. package/dist/client/message/topic.d.ts +138 -0
  73. package/dist/client/message/topic.d.ts.map +1 -0
  74. package/dist/client/message/versioned-schema.d.ts +53 -0
  75. package/dist/client/message/versioned-schema.d.ts.map +1 -0
  76. package/dist/client/outbox/index.d.ts +4 -0
  77. package/dist/client/outbox/index.d.ts.map +1 -0
  78. package/dist/client/outbox/outbox.relay.d.ts +90 -0
  79. package/dist/client/outbox/outbox.relay.d.ts.map +1 -0
  80. package/dist/client/outbox/outbox.store.d.ts +42 -0
  81. package/dist/client/outbox/outbox.store.d.ts.map +1 -0
  82. package/dist/client/outbox/outbox.types.d.ts +144 -0
  83. package/dist/client/outbox/outbox.types.d.ts.map +1 -0
  84. package/dist/client/security/acl.d.ts +108 -0
  85. package/dist/client/security/acl.d.ts.map +1 -0
  86. package/dist/client/security/index.d.ts +5 -0
  87. package/dist/client/security/index.d.ts.map +1 -0
  88. package/dist/client/security/providers.d.ts +88 -0
  89. package/dist/client/security/providers.d.ts.map +1 -0
  90. package/dist/client/security/resolve-security.d.ts +19 -0
  91. package/dist/client/security/resolve-security.d.ts.map +1 -0
  92. package/dist/client/security/security.types.d.ts +76 -0
  93. package/dist/client/security/security.types.d.ts.map +1 -0
  94. package/dist/client/transport/confluent.transport.d.ts +32 -0
  95. package/dist/client/transport/confluent.transport.d.ts.map +1 -0
  96. package/dist/client/transport/transport.interface.d.ts +216 -0
  97. package/dist/client/transport/transport.interface.d.ts.map +1 -0
  98. package/dist/client/types/admin.interface.d.ts +174 -0
  99. package/dist/client/types/admin.interface.d.ts.map +1 -0
  100. package/dist/client/types/admin.types.d.ts +140 -0
  101. package/dist/client/types/admin.types.d.ts.map +1 -0
  102. package/dist/client/types/client.d.ts +21 -0
  103. package/dist/client/types/client.d.ts.map +1 -0
  104. package/dist/client/types/common.d.ts +84 -0
  105. package/dist/client/types/common.d.ts.map +1 -0
  106. package/dist/client/types/config.types.d.ts +150 -0
  107. package/dist/client/types/config.types.d.ts.map +1 -0
  108. package/dist/client/types/consumer.interface.d.ts +115 -0
  109. package/dist/client/types/consumer.interface.d.ts.map +1 -0
  110. package/dist/{consumer.types-fFCag3VJ.d.mts → client/types/consumer.types.d.ts} +62 -383
  111. package/dist/client/types/consumer.types.d.ts.map +1 -0
  112. package/dist/client/types/dedup.types.d.ts +50 -0
  113. package/dist/client/types/dedup.types.d.ts.map +1 -0
  114. package/dist/client/types/lifecycle.interface.d.ts +72 -0
  115. package/dist/client/types/lifecycle.interface.d.ts.map +1 -0
  116. package/dist/client/types/producer.interface.d.ts +52 -0
  117. package/dist/client/types/producer.interface.d.ts.map +1 -0
  118. package/dist/client/types/producer.types.d.ts +90 -0
  119. package/dist/client/types/producer.types.d.ts.map +1 -0
  120. package/dist/client/types.d.ts +8 -0
  121. package/dist/client/types.d.ts.map +1 -0
  122. package/dist/core.d.ts +10 -314
  123. package/dist/core.d.ts.map +1 -0
  124. package/dist/core.js +1326 -74
  125. package/dist/core.js.map +1 -1
  126. package/dist/core.mjs +39 -3
  127. package/dist/index.d.ts +7 -128
  128. package/dist/index.d.ts.map +1 -0
  129. package/dist/index.js +1343 -74
  130. package/dist/index.js.map +1 -1
  131. package/dist/index.mjs +56 -3
  132. package/dist/index.mjs.map +1 -1
  133. package/dist/nest/kafka.constants.d.ts +5 -0
  134. package/dist/nest/kafka.constants.d.ts.map +1 -0
  135. package/dist/nest/kafka.decorator.d.ts +49 -0
  136. package/dist/nest/kafka.decorator.d.ts.map +1 -0
  137. package/dist/nest/kafka.explorer.d.ts +17 -0
  138. package/dist/nest/kafka.explorer.d.ts.map +1 -0
  139. package/dist/nest/kafka.health.d.ts +7 -0
  140. package/dist/nest/kafka.health.d.ts.map +1 -0
  141. package/dist/nest/kafka.module.d.ts +61 -0
  142. package/dist/nest/kafka.module.d.ts.map +1 -0
  143. package/dist/otel.d.ts +83 -5
  144. package/dist/otel.d.ts.map +1 -0
  145. package/dist/otel.js +100 -6
  146. package/dist/otel.js.map +1 -1
  147. package/dist/otel.mjs +98 -5
  148. package/dist/otel.mjs.map +1 -1
  149. package/dist/testing/client.mock.d.ts +47 -0
  150. package/dist/testing/client.mock.d.ts.map +1 -0
  151. package/dist/testing/index.d.ts +4 -0
  152. package/dist/testing/index.d.ts.map +1 -0
  153. package/dist/testing/test.container.d.ts +63 -0
  154. package/dist/testing/test.container.d.ts.map +1 -0
  155. package/dist/{testing.d.mts → testing/transport.fake.d.ts} +7 -111
  156. package/dist/testing/transport.fake.d.ts.map +1 -0
  157. package/dist/testing.d.ts +2 -318
  158. package/dist/testing.d.ts.map +1 -0
  159. package/dist/testing.js +28 -2
  160. package/dist/testing.js.map +1 -1
  161. package/dist/testing.mjs +28 -2
  162. package/dist/testing.mjs.map +1 -1
  163. package/package.json +22 -9
  164. package/dist/chunk-TPIP5VV7.mjs.map +0 -1
  165. package/dist/client-CBBUDDtu.d.ts +0 -751
  166. package/dist/client-D-SxYV2b.d.mts +0 -751
  167. package/dist/consumer.types-fFCag3VJ.d.ts +0 -958
  168. package/dist/core.d.mts +0 -314
  169. package/dist/index.d.mts +0 -128
  170. package/dist/otel.d.mts +0 -27
package/README.md CHANGED
@@ -24,17 +24,25 @@ Type-safe Kafka client for Node.js. Framework-agnostic core with a first-class N
24
24
  - [Iterator: consume()](#iterator-consume)
25
25
  - [Multiple consumer groups](#multiple-consumer-groups)
26
26
  - [Partition key](#partition-key)
27
+ - [Typed partition keys](#typed-partition-keys)
27
28
  - [Message headers](#message-headers)
28
29
  - [Batch sending](#batch-sending)
30
+ - [Delayed delivery](#delayed-delivery)
29
31
  - [Batch consuming](#batch-consuming)
30
32
  - [Tombstone messages](#tombstone-messages)
31
33
  - [Compression](#compression)
32
34
  - [Transactions](#transactions)
33
35
  - [Consumer interceptors](#consumer-interceptors)
34
36
  - [Instrumentation](#instrumentation)
37
+ - [OpenTelemetry metrics](#opentelemetry-metrics)
38
+ - [Transport security](#transport-security)
39
+ - [AWS MSK IAM & GCP authentication](#aws-msk-iam--gcp-authentication)
40
+ - [ACL requirements](#acl-requirements)
41
+ - [Environment configuration](#environment-configuration)
35
42
  - [Options reference](#options-reference)
36
43
  - [Error classes](#error-classes)
37
44
  - [Deduplication (Lamport Clock)](#deduplication-lamport-clock)
45
+ - [Pluggable deduplication store](#pluggable-deduplication-store)
38
46
  - [Retry topic chain](#retry-topic-chain)
39
47
  - [stopConsumer](#stopconsumer)
40
48
  - [Pause and resume](#pause-and-resume)
@@ -53,7 +61,10 @@ Type-safe Kafka client for Node.js. Framework-agnostic core with a first-class N
53
61
  - [Header-based routing](#header-based-routing)
54
62
  - [Lag-based producer throttling](#lag-based-producer-throttling)
55
63
  - [Transactional consumer](#transactional-consumer)
64
+ - [Transactional outbox](#transactional-outbox)
65
+ - [Schema Registry client](#schema-registry-client)
56
66
  - [Admin API](#admin-api)
67
+ - [DLQ CLI](#dlq-cli)
57
68
  - [Graceful shutdown](#graceful-shutdown)
58
69
  - [Consumer handles](#consumer-handles)
59
70
  - [onMessageLost](#onmessagelost)
@@ -61,8 +72,11 @@ Type-safe Kafka client for Node.js. Framework-agnostic core with a first-class N
61
72
  - [onRebalance](#onrebalance)
62
73
  - [Consumer lag](#consumer-lag)
63
74
  - [Handler timeout warning](#handler-timeout-warning)
75
+ - [Static group membership](#static-group-membership)
64
76
  - [Schema validation](#schema-validation)
77
+ - [Versioned schemas](#versioned-schemas)
65
78
  - [Context-aware validators](#context-aware-validators-schemaparsecontext)
79
+ - [Constructor options validation](#constructor-options-validation)
66
80
  - [Health check](#health-check)
67
81
  - [Testing](#testing)
68
82
  - [Project structure](#project-structure)
@@ -107,13 +121,27 @@ Safe by default. Configurable when you need it. Escape hatches for when you know
107
121
  - **Declarative & imperative** — use `@SubscribeTo()` decorator or `startConsumer()` directly
108
122
  - **Async iterator** — `consume<K>()` returns an `AsyncIterableIterator<EventEnvelope<T[K]>>` for `for await` consumption; breaking out of the loop stops the consumer automatically
109
123
  - **Message TTL** — `messageTtlMs` drops or DLQs messages older than a configurable threshold, preventing stale events from poisoning downstream systems after a lag spike
110
- - **Circuit breaker** — `circuitBreaker` option applies a sliding-window breaker per topic-partition; pauses delivery on repeated DLQ failures and resumes after a configurable recovery window
124
+ - **Circuit breaker** — `circuitBreaker` option applies a sliding-window breaker per topic-partition; pauses delivery on repeated handler failures and resumes after a configurable recovery window
111
125
  - **Seek to offset** — `seekToOffset(groupId, assignments)` seeks individual partitions to explicit offsets for fine-grained replay
112
126
  - **Tombstone messages** — `sendTombstone(topic, key)` sends a null-value record to compact a key out of a log-compacted topic; all instrumentation hooks still fire
113
127
  - **Regex topic subscription** — `startConsumer([/^orders\..+/], handler)` subscribes using a pattern; the broker routes matching topics to the consumer dynamically
114
128
  - **Compression** — per-send `compression` option (`gzip`, `snappy`, `lz4`, `zstd`) in `SendOptions` and `BatchSendOptions`
115
129
  - **Partition assignment strategy** — `partitionAssigner` in `ConsumerOptions` chooses between `cooperative-sticky` (default), `roundrobin`, and `range`
116
130
  - **Admin API** — `listConsumerGroups()`, `describeTopics()`, `deleteRecords()` for group inspection, partition metadata, and message deletion
131
+ - **Typed partition keys** — `topic('orders').type<T>().key(m => m.orderId)` binds a partition-key extractor to a descriptor so related messages land on the same partition without passing `key` at every call site
132
+ - **Versioned schemas** — `versionedSchema({ 1: v1, 2: v2 }, { migrate })` dispatches validation on the `x-schema-version` header and upgrades old shapes to the latest
133
+ - **Constructor validation** — the `KafkaClient` constructor fails fast, throwing a single aggregated error that lists every invalid config value instead of surfacing a confusing driver error on first use
134
+ - **Pluggable deduplication store** — swap the in-memory Lamport-clock store for a `DedupStore` (e.g. Redis-backed) so deduplication survives restarts and rebalances; fail-open on store errors
135
+ - **Delayed delivery** — `sendMessage(..., { deliverAfterMs })` stages messages in `<topic>.delayed`; a `startDelayedRelay()` consumer forwards them transactionally once the deadline passes
136
+ - **OpenTelemetry metrics** — `otelMetricsInstrumentation()` records send/consume counters and a handler-duration histogram; `otelLagGauge()` reports per-partition consumer lag as an observable gauge
137
+ - **Transport security** — `security: { ssl, sasl }` with secure-by-default rules: SASL auto-enables TLS, plaintext to non-local brokers warns once (silenceable via `allowInsecure: true`); SASL mechanisms `plain`, `scram-sha-256`, `scram-sha-512`, `oauthbearer`
138
+ - **AWS MSK / GCP auth** — `awsMskIamProvider({ region })` and `gcpAccessTokenProvider()` supply OAUTHBEARER tokens from the standard AWS / Google credential chains (IRSA, task roles, ADC)
139
+ - **ACL requirements helper** — `describeRequiredAcls()` enumerates every derived topic, companion group, ephemeral group, and transactional id a service needs; render them as `kafka-acls.sh` commands or an MSK IAM policy
140
+ - **Environment configuration** — `kafkaClientConfigFromEnv()`, `consumerOptionsFromEnv()`, and `mergeConsumerOptions()` build config from env vars with `code > env > defaults` precedence
141
+ - **Transactional outbox** — `startOutboxRelay()` publishes rows from a DB outbox table to Kafka inside a transaction; at-least-once with stable `eventId` for downstream dedup
142
+ - **Schema Registry client** — `SchemaRegistryClient` + `registrySchema()` keep locally-defined schemas in lockstep with a Confluent-compatible registry (validation/evolution, not Avro wire-format serde)
143
+ - **Static group membership** — `groupInstanceId` (`group.instance.id`) skips rebalance on k8s rolling restarts within `session.timeout.ms`
144
+ - **DLQ CLI** — `kafka-client-dlq ls | peek | replay` for inspecting and re-publishing dead letter queues from the terminal
117
145
 
118
146
  See the [Roadmap](./ROADMAP.md) for upcoming features and version history.
119
147
 
@@ -605,6 +633,36 @@ await this.kafka.sendMessage(
605
633
  );
606
634
  ```
607
635
 
636
+ ### Typed partition keys
637
+
638
+ Instead of passing `key` at every call site, bind a partition-key extractor to the topic descriptor with `.key()`. The extractor runs on every send through that descriptor, so messages with the same logical key always land on the same partition — you never forget to set it. Available on both `.type<T>()` and `.schema()` descriptors:
639
+
640
+ ```typescript
641
+ import { topic } from '@drarzter/kafka-client';
642
+
643
+ const OrderCreated = topic('order.created')
644
+ .type<{ orderId: string; userId: string; amount: number }>()
645
+ .key((m) => m.orderId);
646
+
647
+ // Key is derived automatically from the payload — no `key` needed
648
+ await kafka.sendMessage(OrderCreated, { orderId: '123', userId: '456', amount: 100 });
649
+ // → produced with key '123'
650
+
651
+ // Works with schema descriptors too
652
+ const PaymentTaken = topic('payment.taken')
653
+ .schema(z.object({ paymentId: z.string(), orderId: z.string() }))
654
+ .key((m) => m.orderId);
655
+ ```
656
+
657
+ The extractor runs on the **original (pre-validation) payload**. An explicit `key` in `SendOptions` — or a batch item's `key` — always wins over the descriptor's extractor:
658
+
659
+ ```typescript
660
+ // Explicit key overrides the extractor
661
+ await kafka.sendMessage(OrderCreated, { orderId: '123', userId: '456', amount: 100 }, {
662
+ key: 'custom-partition-key',
663
+ });
664
+ ```
665
+
608
666
  ## Message headers
609
667
 
610
668
  Attach metadata to messages:
@@ -642,6 +700,33 @@ await this.kafka.sendBatch('order.created', [
642
700
  ]);
643
701
  ```
644
702
 
703
+ ## Delayed delivery
704
+
705
+ Schedule a message for future delivery with `deliverAfterMs`. Instead of going straight to the target topic, the message is produced to a `<topic>.delayed` staging topic carrying `x-delayed-until` (deadline) and `x-delayed-target` headers. A **relay consumer** started via `startDelayedRelay()` holds each message until its deadline passes, then forwards it to the target topic:
706
+
707
+ ```typescript
708
+ // 1. Start the relay once (per process) for the topics you delay-deliver to
709
+ await kafka.startDelayedRelay(['order.reminder']);
710
+
711
+ // 2. Send a message that should arrive in ~1 hour
712
+ await kafka.sendMessage(
713
+ 'order.reminder',
714
+ { orderId: '123', channel: 'email' },
715
+ { deliverAfterMs: 60 * 60 * 1000 },
716
+ );
717
+ // → staged in order.reminder.delayed, forwarded to order.reminder ~1 h later
718
+ ```
719
+
720
+ `deliverAfterMs` also works on `sendBatch` — it applies to the whole batch:
721
+
722
+ ```typescript
723
+ await kafka.sendBatch('order.reminder', messages, { deliverAfterMs: 30_000 });
724
+ ```
725
+
726
+ The relay defaults to a `<defaultGroupId>-delayed-relay` consumer group; override it with `startDelayedRelay(topics, { groupId })`. Forwarding is **transactional** — the produce to the target topic and the source-offset commit happen atomically, so no duplicates are relayed even if the relay crashes mid-forward. The original key, value, and envelope headers (`x-event-id`, `x-correlation-id`, `x-lamport-clock`, `traceparent`) all survive the hop; only the `x-delayed-*` control headers are stripped.
727
+
728
+ > **Delivery time is a lower bound.** The relay pauses a partition until the head-of-line message's deadline, so later messages on the same partition wait behind it (at-least semantics). Delayed messages are only delivered while the relay is running — treat it as a long-lived consumer, not a fire-and-forget scheduler.
729
+
645
730
  ## Batch consuming
646
731
 
647
732
  Process messages in batches for higher throughput. The handler receives an array of `EventEnvelope`s and a `BatchMeta` object with offset management controls:
@@ -821,6 +906,47 @@ const kafka = new KafkaClient('my-app', 'my-group', brokers, {
821
906
 
822
907
  `otelInstrumentation()` injects `traceparent` on send, extracts it on consume, and creates `CONSUMER` spans automatically. The span is set as the **active OTel context** for the handler's duration via `context.with()` — so `trace.getActiveSpan()` works inside your handler and any child spans are automatically parented to the consume span. Requires `@opentelemetry/api` as a peer dependency.
823
908
 
909
+ ### OpenTelemetry metrics
910
+
911
+ `otelInstrumentation()` handles **traces**. For **metrics**, the same entrypoint exports `otelMetricsInstrumentation()` (counters + a duration histogram) and `otelLagGauge()` (an observable consumer-lag gauge). They share nothing with the tracing instrumentation and compose with it in any order:
912
+
913
+ ```typescript
914
+ import {
915
+ otelInstrumentation,
916
+ otelMetricsInstrumentation,
917
+ otelLagGauge,
918
+ } from '@drarzter/kafka-client/otel';
919
+
920
+ const kafka = new KafkaClient('my-app', 'my-group', brokers, {
921
+ instrumentation: [otelInstrumentation(), otelMetricsInstrumentation()],
922
+ });
923
+ ```
924
+
925
+ `otelMetricsInstrumentation()` registers seven instruments under the meter `@drarzter/kafka-client` (created once per instance, not per message):
926
+
927
+ | Instrument | Type | Attributes | Recorded when |
928
+ | ---------- | ---- | ---------- | ------------- |
929
+ | `kafka.client.messages.sent` | Counter | `topic` | a message is sent |
930
+ | `kafka.client.messages.processed` | Counter | `topic` | a handler succeeds |
931
+ | `kafka.client.messages.retried` | Counter | `topic` | a message is queued for retry |
932
+ | `kafka.client.messages.dlq` | Counter | `topic`, `reason` | a message is routed to a DLQ |
933
+ | `kafka.client.messages.duplicate` | Counter | `topic`, `strategy` | a Lamport-clock duplicate is detected |
934
+ | `kafka.client.consume.errors` | Counter | `topic` | a handler throws |
935
+ | `kafka.client.consume.duration` | Histogram (ms) | `topic` | measured across the handler's execution |
936
+
937
+ Pass a custom meter with `otelMetricsInstrumentation({ meter })` to route instruments through your own `MeterProvider`; it defaults to `metrics.getMeter('@drarzter/kafka-client')`.
938
+
939
+ `otelLagGauge()` registers an observable gauge `kafka.client.consumer.lag` (attributes `topic`, `partition`, `groupId`) that polls `getConsumerLag()` on each metric-collection cycle. It returns an **unregister disposer** — call it on shutdown to stop observing:
940
+
941
+ ```typescript
942
+ const unregisterLag = otelLagGauge(kafka, { groupId: 'billing-service' });
943
+
944
+ // ...later, on shutdown:
945
+ unregisterLag();
946
+ ```
947
+
948
+ `groupId` defaults to the client's constructor group (reported as an empty-string attribute), and `meter` overrides the meter as above. Lag-query failures during a collection cycle are swallowed silently — a broker hiccup reports no samples for that cycle rather than breaking metric collection. Both helpers require `@opentelemetry/api` as a peer dependency.
949
+
824
950
  ### Custom instrumentation
825
951
 
826
952
  `beforeConsume` can return a `BeforeConsumeResult` — either the legacy `() => void` cleanup function, or an object with `cleanup` and/or `wrap`:
@@ -917,6 +1043,204 @@ Passing a topic name that has not seen any events returns a zero-valued snapshot
917
1043
 
918
1044
  Counters are incremented in the same code paths that fire the corresponding hooks — they are always active regardless of whether any instrumentation is configured.
919
1045
 
1046
+ ## Transport security
1047
+
1048
+ Configure TLS and SASL through the `security` option on `KafkaClientOptions`. The library applies **secure-by-default** rules so credentials never leak onto plaintext connections by accident:
1049
+
1050
+ - **SASL auto-enables TLS.** When `sasl` is set and `ssl` is left unset, `ssl` is turned on automatically — SASL credentials always travel over TLS unless you explicitly opt out.
1051
+ - **Explicit `ssl: false` with SASL warns.** Setting `sasl` together with `ssl: false` logs a warning that credentials will cross the wire in plaintext — only safe on fully trusted networks.
1052
+ - **Plaintext to non-local brokers warns once.** With no `ssl`/`sasl` at all and at least one non-local broker (anything outside `localhost`, `127.0.0.0/8`, `::1`, `0.0.0.0`, `host.docker.internal`), a single warning is logged per client. Acknowledge and silence it with `allowInsecure: true`.
1053
+
1054
+ Nothing here ever throws or blocks a connection — the defaults protect, you stay in control.
1055
+
1056
+ ```typescript
1057
+ import { KafkaClient } from '@drarzter/kafka-client/core';
1058
+
1059
+ // SASL/SCRAM over TLS — ssl auto-enabled because sasl is set
1060
+ const kafka = new KafkaClient('billing-svc', 'billing-group', ['broker.example.com:9093'], {
1061
+ security: {
1062
+ sasl: {
1063
+ mechanism: 'scram-sha-512',
1064
+ username: 'billing-svc',
1065
+ password: process.env.KAFKA_PASSWORD!,
1066
+ },
1067
+ // ssl: true — inferred automatically; set explicitly if you prefer
1068
+ },
1069
+ });
1070
+ ```
1071
+
1072
+ `KafkaSecurityOptions`:
1073
+
1074
+ | Field | Default | Description |
1075
+ | ----- | ------- | ----------- |
1076
+ | `ssl` | `true` when `sasl` set, else `false` | Enable TLS |
1077
+ | `sasl` | — | SASL authentication (see below) |
1078
+ | `allowInsecure` | `false` | Acknowledge an intentionally insecure (plaintext, non-local) setup and silence the warning. No effect when `ssl`/`sasl` are set |
1079
+
1080
+ `sasl` is a discriminated union on `mechanism`:
1081
+
1082
+ ```typescript
1083
+ // Username / password mechanisms
1084
+ { mechanism: 'plain' | 'scram-sha-256' | 'scram-sha-512', username: string, password: string }
1085
+
1086
+ // Token-based (AWS MSK IAM, GCP, custom)
1087
+ { mechanism: 'oauthbearer', oauthBearerProvider: () => Promise<OAuthBearerToken> }
1088
+ ```
1089
+
1090
+ An `OAuthBearerProvider` is an async factory the driver calls on connect and before each token expiry; it returns `{ value, principal?, lifetimeMs?, extensions? }`.
1091
+
1092
+ ### AWS MSK IAM & GCP authentication
1093
+
1094
+ Two ready-made `oauthbearer` providers cover the common managed-Kafka cases. Both resolve credentials from the platform's standard chain — nothing to hard-code — and rely on an **optional** peer dependency you install alongside this library.
1095
+
1096
+ **AWS MSK IAM** — `awsMskIamProvider({ region })` delegates token signing to `aws-msk-iam-sasl-signer-js`. Credentials come from the standard AWS provider chain, so EKS IRSA, ECS task roles, and env credentials all work unchanged. Authorisation is then governed by IAM policies (`kafka-cluster:*` actions) — see [ACL requirements](#acl-requirements) to generate one:
1097
+
1098
+ ```bash
1099
+ npm install aws-msk-iam-sasl-signer-js
1100
+ ```
1101
+
1102
+ ```typescript
1103
+ import { KafkaClient, awsMskIamProvider } from '@drarzter/kafka-client/core';
1104
+
1105
+ const kafka = new KafkaClient('orders-svc', 'orders-group', brokers, {
1106
+ security: {
1107
+ sasl: {
1108
+ mechanism: 'oauthbearer',
1109
+ oauthBearerProvider: awsMskIamProvider({ region: 'eu-west-1' }),
1110
+ },
1111
+ },
1112
+ });
1113
+ ```
1114
+
1115
+ **GCP** — `gcpAccessTokenProvider()` delegates to `google-auth-library` using Application Default Credentials, so GKE Workload Identity, attached service accounts, and `GOOGLE_APPLICATION_CREDENTIALS` all work unchanged. It supplies a raw ADC access token; verify the exact token format your cluster expects against current Google documentation:
1116
+
1117
+ ```bash
1118
+ npm install google-auth-library
1119
+ ```
1120
+
1121
+ ```typescript
1122
+ import { KafkaClient, gcpAccessTokenProvider } from '@drarzter/kafka-client/core';
1123
+
1124
+ const kafka = new KafkaClient('events-svc', 'events-group', brokers, {
1125
+ security: {
1126
+ sasl: {
1127
+ mechanism: 'oauthbearer',
1128
+ oauthBearerProvider: gcpAccessTokenProvider(),
1129
+ },
1130
+ },
1131
+ });
1132
+ ```
1133
+
1134
+ | Provider | Options | Optional peer dep |
1135
+ | -------- | ------- | ----------------- |
1136
+ | `awsMskIamProvider` | `{ region }` | `aws-msk-iam-sasl-signer-js` |
1137
+ | `gcpAccessTokenProvider` | `{ scopes?, principal?, tokenTtlMs? }` (defaults: `cloud-platform` scope, principal `'gcp'`, 50 min TTL) | `google-auth-library` |
1138
+
1139
+ Neither package is a hard dependency — they are dynamically imported on first token fetch. If the package is missing, the provider throws a clear install hint rather than failing at build time.
1140
+
1141
+ ### ACL requirements
1142
+
1143
+ The features that make this library convenient — retry topics, DLQ, delayed delivery, deduplication routing, DLQ replay, snapshots, clock recovery — quietly create **extra topics and consumer groups** (`<topic>.retry.N`, `<topic>.dlq`, `<topic>.delayed`, `<topic>.duplicates`, `<groupId>-retry.N`, timestamped ephemeral groups, transactional ids). On a locked-down cluster every one of them needs an ACL, and the last place you want to discover a missing grant is production at 3 a.m.
1144
+
1145
+ `describeRequiredAcls()` enumerates the complete set from a declarative usage profile. Feed the result to `toKafkaAclCommands()` for `kafka-acls.sh` commands, or `toMskIamPolicy()` for an AWS MSK IAM policy document:
1146
+
1147
+ ```typescript
1148
+ import {
1149
+ describeRequiredAcls,
1150
+ toKafkaAclCommands,
1151
+ toMskIamPolicy,
1152
+ } from '@drarzter/kafka-client/core';
1153
+
1154
+ const resources = describeRequiredAcls({
1155
+ clientId: 'billing-svc',
1156
+ groupIds: ['billing-svc-group'],
1157
+ produceTopics: ['invoices.created'],
1158
+ consumeTopics: ['orders.created'],
1159
+ features: {
1160
+ retryTopics: { maxRetries: 3 },
1161
+ dlq: true,
1162
+ dlqReplay: true,
1163
+ transactions: true,
1164
+ },
1165
+ });
1166
+
1167
+ // Render kafka-acls.sh commands for a principal
1168
+ for (const cmd of toKafkaAclCommands(resources, 'User:billing-svc', 'broker:9092')) {
1169
+ console.log(cmd);
1170
+ }
1171
+ // kafka-acls.sh --bootstrap-server broker:9092 --add --allow-principal 'User:billing-svc' \
1172
+ // --operation READ --operation DESCRIBE --topic 'orders.created' # startConsumer
1173
+ // kafka-acls.sh ... --topic 'orders.created.dlq' # dlq: true — failed messages routed to DLQ
1174
+ // kafka-acls.sh ... --topic 'orders.created.retry.1' ... --topic 'orders.created.retry.3'
1175
+ // kafka-acls.sh ... --group 'billing-svc-group-retry.' --resource-pattern-type prefixed
1176
+ // kafka-acls.sh ... --transactional-id 'billing-svc-group-' --resource-pattern-type prefixed
1177
+ // kafka-acls.sh ... --group 'orders.created.dlq-replay' --operation DELETE --resource-pattern-type prefixed
1178
+ // ...
1179
+
1180
+ // Or an MSK IAM policy document
1181
+ const policy = toMskIamPolicy(resources, {
1182
+ region: 'eu-west-1',
1183
+ accountId: '123456789012',
1184
+ clusterName: 'prod',
1185
+ clusterUuid: 'abcd-1234',
1186
+ });
1187
+ ```
1188
+
1189
+ `describeRequiredAcls()` returns `AclResource[]`, each carrying `resourceType` (`topic` | `group` | `transactional-id` | `cluster`), `patternType` (`literal` | `prefixed`), `name`, `operations`, and a `reason` naming the feature that requires it. Ephemeral-group features (`dlqReplay`, `snapshots`, `clockRecovery`) request `DELETE` on a **prefixed** pattern, because those groups are timestamped and cleaned up after use.
1190
+
1191
+ | Feature flag | Adds |
1192
+ | ------------ | ---- |
1193
+ | `dlq` | `<topic>.dlq` WRITE per consumed topic |
1194
+ | `retryTopics: { maxRetries }` | `<topic>.retry.1…N` topics; `<groupId>-retry.` prefixed groups; `<groupId>-` prefixed transactional ids |
1195
+ | `delayedDelivery` | `<topic>.delayed` topics; `<groupId>-delayed-relay` group + `-tx` id |
1196
+ | `duplicatesTopic` | `<topic>.duplicates` (or a custom topic name) WRITE |
1197
+ | `dlqReplay` | `<topic>.dlq-replay` prefixed groups (READ, DESCRIBE, **DELETE**) + DLQ READ |
1198
+ | `snapshots` | `<clientId>-snapshot-` prefixed groups (READ, DESCRIBE, **DELETE**) |
1199
+ | `clockRecovery` | `<clientId>-clock-recovery-` prefixed groups (READ, DESCRIBE, **DELETE**) |
1200
+ | `transactions` | `<clientId>-tx` transactional id |
1201
+ | `autoCreateTopics` | cluster `CREATE` (avoid in production) |
1202
+
1203
+ `toMskIamPolicy()` maps Kafka operations to `kafka-cluster:*` actions, turns prefixed patterns into `name*` ARN wildcards, and always includes `kafka-cluster:Connect`. **Review both outputs against your organisation's least-privilege standards and current AWS documentation before applying** — they are a starting point, not a rubber stamp.
1204
+
1205
+ ## Environment configuration
1206
+
1207
+ Build client and consumer configuration from environment variables with a strict precedence rule: **explicit code options > env vars > built-in library defaults**. The helpers only *feed* values in — anything you hard-code always wins, and any variable left unset keeps the library default.
1208
+
1209
+ The library never reads a `.env` file itself. Load one first with Node's built-in `node --env-file=.env` (Node 20.6+) or the `dotenv` package, then call the helpers:
1210
+
1211
+ ```typescript
1212
+ import { KafkaClient, kafkaClientConfigFromEnv } from '@drarzter/kafka-client/core';
1213
+
1214
+ const { clientId, groupId, brokers, options } = kafkaClientConfigFromEnv();
1215
+
1216
+ const kafka = new KafkaClient(
1217
+ clientId ?? 'my-svc', // env value or your fallback
1218
+ groupId ?? 'my-grp',
1219
+ brokers ?? ['localhost:9092'],
1220
+ {
1221
+ ...options, // only the keys whose env vars were present
1222
+ onMessageLost: alerting, // code-level value — always applied, not env-configurable
1223
+ },
1224
+ );
1225
+ ```
1226
+
1227
+ `kafkaClientConfigFromEnv(env?, prefix?)` reads `KAFKA_`-prefixed variables (`CLIENT_ID`, `GROUP_ID`, `BROKERS`, `AUTO_CREATE_TOPICS`, `STRICT_SCHEMAS`, `NUM_PARTITIONS`, `TRANSACTIONAL_ID`, `CLOCK_RECOVERY_*`, `LAG_THROTTLE_*`, and the security vars `SSL`, `SASL_MECHANISM`, `SASL_USERNAME`, `SASL_PASSWORD`, `ALLOW_INSECURE`). It returns `{ clientId?, groupId?, brokers?, options }`, emitting only the keys whose variables were set. Malformed booleans/numbers/enums throw with the offending variable named. `oauthbearer` cannot come from env — token providers are functions, so configure them in code.
1228
+
1229
+ `consumerOptionsFromEnv(env?, prefix?)` reads `KAFKA_CONSUMER_`-prefixed variables into a `Partial<ConsumerOptions>` (retry, DLQ, deduplication, circuit breaker, TTL, `GROUP_INSTANCE_ID`, and more). Merge it under your code-level options with `mergeConsumerOptions()`, which applies the precedence rule — later layers win, and the nested objects (`retry`, `deduplication`, `circuitBreaker`, `subscribeRetry`) are deep-merged so a code layer can override a single field:
1230
+
1231
+ ```typescript
1232
+ import { consumerOptionsFromEnv, mergeConsumerOptions } from '@drarzter/kafka-client/core';
1233
+
1234
+ const envDefaults = consumerOptionsFromEnv();
1235
+ await kafka.startConsumer(
1236
+ ['orders'],
1237
+ handler,
1238
+ mergeConsumerOptions(envDefaults, { dlq: true }), // code layer wins on conflict
1239
+ );
1240
+ ```
1241
+
1242
+ Both helpers accept an explicit `env` object (handy in tests) and a custom variable `prefix`. See [`docs/configuration.md`](./docs/configuration.md) for the full variable reference and [`.env.example`](./.env.example) for a ready-to-copy template.
1243
+
920
1244
  ## Options reference
921
1245
 
922
1246
  ### Send options
@@ -931,8 +1255,9 @@ Options for `sendMessage()` — the third argument:
931
1255
  | `schemaVersion` | `1` | Schema version for the payload |
932
1256
  | `eventId` | auto | Override the auto-generated event ID (UUID v4) |
933
1257
  | `compression` | — | Compression codec for the message set: `'gzip'`, `'snappy'`, `'lz4'`, `'zstd'`; omit to send uncompressed |
1258
+ | `deliverAfterMs` | — | Delay delivery by at least this many milliseconds via a `<topic>.delayed` staging topic; requires a running `startDelayedRelay()` (see [Delayed delivery](#delayed-delivery)) |
934
1259
 
935
- `sendBatch()` accepts `compression` as a top-level option (not per-message); all other options are per-message inside the array items.
1260
+ `sendBatch()` accepts `compression` and `deliverAfterMs` as top-level options (not per-message); all other options are per-message inside the array items.
936
1261
 
937
1262
  ### Consumer options
938
1263
 
@@ -951,15 +1276,17 @@ Options for `sendMessage()` — the third argument:
951
1276
  | `handlerTimeoutMs` | — | Log a warning if the handler hasn't resolved within this window (ms) — does not cancel the handler |
952
1277
  | `deduplication.strategy` | `'drop'` | What to do with duplicate messages: `'drop'` silently discards, `'dlq'` forwards to `{topic}.dlq` (requires `dlq: true`), `'topic'` forwards to `{topic}.duplicates` |
953
1278
  | `deduplication.duplicatesTopic` | `{topic}.duplicates` | Custom destination for `strategy: 'topic'` |
1279
+ | `deduplication.store` | in-memory | Pluggable `DedupStore` for the per-partition last-processed clock; supply a persistent store (e.g. Redis) so dedup survives restarts/rebalances (see [Pluggable deduplication store](#pluggable-deduplication-store)) |
954
1280
  | `messageTtlMs` | — | Drop (or DLQ) messages older than this many milliseconds at consumption time; evaluated against the `x-timestamp` header; see [Message TTL](#message-ttl) |
955
- | `circuitBreaker` | — | Enable circuit breaker with `{}` for zero-config defaults; requires `dlq: true`; see [Circuit breaker](#circuit-breaker) |
956
- | `circuitBreaker.threshold` | `5` | DLQ failures within `windowSize` that opens the circuit |
1281
+ | `circuitBreaker` | — | Enable circuit breaker with `{}` for zero-config defaults; see [Circuit breaker](#circuit-breaker) |
1282
+ | `circuitBreaker.threshold` | `5` | Failed handler attempts within `windowSize` that open the circuit |
957
1283
  | `circuitBreaker.recoveryMs` | `30_000` | Milliseconds to wait in OPEN state before entering HALF_OPEN |
958
1284
  | `circuitBreaker.windowSize` | `threshold × 2, min 10` | Sliding window size in messages |
959
1285
  | `circuitBreaker.halfOpenSuccesses` | `1` | Consecutive successes in HALF_OPEN required to close the circuit |
960
1286
  | `queueHighWaterMark` | unbounded | Max messages buffered in the `consume()` iterator queue before the partition is paused; resumes at 50% drain. Only applies to `consume()` |
961
1287
  | `batch` | `false` | (decorator only) Use `startBatchConsumer` instead of `startConsumer` |
962
1288
  | `partitionAssigner` | `'cooperative-sticky'` | Partition assignment strategy: `'cooperative-sticky'` (minimal movement on rebalance, best for horizontal scaling), `'roundrobin'` (even distribution), `'range'` (contiguous partition ranges) |
1289
+ | `groupInstanceId` | — | Static group membership (`group.instance.id`) — a member that restarts within `session.timeout.ms` rejoins with the same partitions and no rebalance. Must be unique per member; not propagated to retry companions. See [Static group membership](#static-group-membership) |
963
1290
  | `onTtlExpired` | — | Per-consumer override of the client-level `onTtlExpired` callback; takes precedence when set. Receives `TtlExpiredContext` — same shape as the client-level hook |
964
1291
  | `onMessageLost` | — | Per-consumer override of the client-level `onMessageLost` callback; takes precedence when set. Use for consumer-specific dead-message alerting or structured logging |
965
1292
  | `onRetry` | — | Per-consumer retry callback; fires **in addition to** the built-in metrics hook (does not replace it). Same signature as `KafkaInstrumentation.onRetry` |
@@ -980,11 +1307,20 @@ Passed to `KafkaModule.register()` or returned from `registerAsync()` factory:
980
1307
  | `autoCreateTopics` | `false` | Auto-create topics on first send (dev only) |
981
1308
  | `numPartitions` | `1` | Number of partitions for auto-created topics |
982
1309
  | `strictSchemas` | `true` | Validate string topic keys against schemas registered via TopicDescriptor |
1310
+ | `security` | — | TLS + SASL transport security with secure-by-default rules (`{ ssl, sasl, allowInsecure }`); see [Transport security](#transport-security) |
983
1311
  | `instrumentation` | `[]` | Client-wide instrumentation hooks (e.g. OTel). Applied to both send and consume paths |
984
1312
  | `transactionalId` | `${clientId}-tx` | Transactional producer ID for `transaction()` calls. Must be unique per producer instance across the cluster — two instances sharing the same ID will be fenced by Kafka. The client logs a warning when the same ID is registered twice within one process |
985
1313
  | `onMessageLost` | — | Called when a message is silently dropped without DLQ — use to alert, log to external systems, or trigger fallback logic |
986
1314
  | `onTtlExpired` | — | Called when a message is dropped due to TTL expiration (`messageTtlMs`) and `dlq` is not enabled; receives `{ topic, ageMs, messageTtlMs, headers }` |
987
1315
  | `onRebalance` | — | Called on every partition assign/revoke event across all consumers created by this client |
1316
+ | `clockRecovery.topics` | — | Topics to scan on `connectProducer()` to recover the highest `x-lamport-clock`, so the clock stays monotonic across restarts (see [Deduplication](#deduplication-lamport-clock)) |
1317
+ | `clockRecovery.timeoutMs` | `30000` | Max time (ms) to wait for clock recovery before proceeding with a partial result |
1318
+ | `lagThrottle` | — | Delay sends when a consumer group's lag exceeds `maxLag` (see [Lag-based producer throttling](#lag-based-producer-throttling)) |
1319
+ | `lagThrottle.maxLag` | — | Lag threshold (messages) above which sends are delayed (required when `lagThrottle` is set) |
1320
+ | `lagThrottle.groupId` | default group | Consumer group whose lag is monitored |
1321
+ | `lagThrottle.pollIntervalMs` | `5000` | How often (ms) to poll `getConsumerLag()` in the background |
1322
+ | `lagThrottle.maxWaitMs` | `30000` | Max time (ms) a send waits while throttled before proceeding anyway (best-effort, not hard back-pressure) |
1323
+ | `transport` | `ConfluentTransport` | Custom `KafkaTransport` implementation — target an alternative broker library or inject a deterministic fake in tests |
988
1324
 
989
1325
  **Module-scoped** (default) — import `KafkaModule` in each module that needs it:
990
1326
 
@@ -1147,6 +1483,50 @@ Deduplication state is **in-memory and per-consumer-instance**. Understand what
1147
1483
 
1148
1484
  Use this feature as a lightweight first line of defence — not as a substitute for idempotent business logic.
1149
1485
 
1486
+ ### Pluggable deduplication store
1487
+
1488
+ The in-memory limitation above is only the **default**. Pass a `store` in `deduplication` to back the per-partition clock with any external system — Redis, a database, anything — so deduplication survives process restarts and rebalances. The store implements the `DedupStore` interface:
1489
+
1490
+ ```typescript
1491
+ import { DedupStore } from '@drarzter/kafka-client';
1492
+
1493
+ interface DedupStore {
1494
+ // Return the last processed clock for a group + "topic:partition", or undefined.
1495
+ getLastClock(groupId: string, topicPartition: string): number | undefined | Promise<number | undefined>;
1496
+ // Persist the last processed clock for a group + "topic:partition".
1497
+ setLastClock(groupId: string, topicPartition: string, clock: number): void | Promise<void>;
1498
+ }
1499
+ ```
1500
+
1501
+ Both methods may be synchronous or return a promise. A minimal Redis-backed store:
1502
+
1503
+ ```typescript
1504
+ class RedisDedupStore implements DedupStore {
1505
+ constructor(private readonly redis: RedisClient) {}
1506
+
1507
+ private key(groupId: string, topicPartition: string) {
1508
+ return `dedup:${groupId}:${topicPartition}`;
1509
+ }
1510
+
1511
+ async getLastClock(groupId: string, topicPartition: string) {
1512
+ const raw = await this.redis.get(this.key(groupId, topicPartition));
1513
+ return raw === null ? undefined : Number(raw);
1514
+ }
1515
+
1516
+ async setLastClock(groupId: string, topicPartition: string, clock: number) {
1517
+ await this.redis.set(this.key(groupId, topicPartition), String(clock));
1518
+ }
1519
+ }
1520
+
1521
+ await kafka.startConsumer(['payments'], handler, {
1522
+ deduplication: { strategy: 'drop', store: new RedisDedupStore(redis) },
1523
+ });
1524
+ ```
1525
+
1526
+ **Failure semantics (fail-open):** if `getLastClock` or `setLastClock` throws or rejects, the error is logged and the message is treated as **not** a duplicate. A transient store outage never silently drops messages — it only weakens deduplication until the store recovers, biasing towards at-least-once delivery.
1527
+
1528
+ When `store` is omitted, the built-in `InMemoryDedupStore` is used — the in-session behaviour described above.
1529
+
1150
1530
  ## Retry topic chain
1151
1531
 
1152
1532
  > **tl;dr — recommended production setup:**
@@ -1246,9 +1626,9 @@ Pausing is non-destructive: the consumer stays connected and Kafka preserves the
1246
1626
 
1247
1627
  ## Circuit breaker
1248
1628
 
1249
- Automatically pause delivery from a topic-partition when its DLQ error rate exceeds a threshold. After a recovery window the partition is resumed automatically.
1629
+ Automatically pause delivery from a topic-partition when its handler failure rate exceeds a threshold. After a recovery window the partition is resumed automatically.
1250
1630
 
1251
- **`dlq: true` is required** the breaker counts DLQ events as failures. Without it no failures are recorded and the circuit never opens.
1631
+ Failures are recorded at the handler-error boundary: every failed handler attempt counts (including in-process retries and retry-topic chain levels), independent of whether the message ends up in a DLQ. `dlq` is **not** required for the breaker to work.
1252
1632
 
1253
1633
  Zero-config start — all options have sensible defaults:
1254
1634
 
@@ -1287,7 +1667,7 @@ Options:
1287
1667
 
1288
1668
  | Option | Default | Description |
1289
1669
  | ------ | ------- | ----------- |
1290
- | `threshold` | `5` | DLQ failures within `windowSize` that opens the circuit |
1670
+ | `threshold` | `5` | Failed handler attempts within `windowSize` that open the circuit |
1291
1671
  | `recoveryMs` | `30_000` | Milliseconds to wait in OPEN state before entering HALF_OPEN |
1292
1672
  | `windowSize` | `threshold × 2, min 10` | Sliding window size in messages |
1293
1673
  | `halfOpenSuccesses` | `1` | Consecutive successes in HALF_OPEN required to close the circuit |
@@ -1385,7 +1765,7 @@ await kafka.seekToTimestamp('payments-group', [
1385
1765
  ]);
1386
1766
  ```
1387
1767
 
1388
- Uses `admin.fetchTopicOffsetsByTime` under the hood. If no offset exists at the requested timestamp (e.g. the partition is empty or the timestamp is in the future), the partition falls back to `-1` (end of topic — new messages only).
1768
+ Uses `admin.fetchTopicOffsetsByTimestamp` under the hood. If no offset exists at the requested timestamp (e.g. the partition is empty or the timestamp is in the future), the partition falls back to the current high watermark (end of topic — new messages only).
1389
1769
 
1390
1770
  **Important:** the consumer group must be stopped before seeking. Assignments for the same topic are batched into a single `admin.setOffsets` call.
1391
1771
 
@@ -1691,6 +2071,117 @@ await kafka.startTransactionalConsumer(
1691
2071
 
1692
2072
  `retryTopics: true` is rejected at startup — EOS redelivery on failure is already guaranteed by the transaction. `autoCommit` is always `false` (managed internally).
1693
2073
 
2074
+ ## Transactional outbox
2075
+
2076
+ The transactional-outbox pattern decouples "write my business state" from "publish an event" so the two can never diverge. Application code writes an event row into an outbox table **in the same DB transaction** as its business writes; a relay polls that table and publishes the rows to Kafka, marking them published only after Kafka has acked them. If the process dies after the DB commit but before the publish, the row is still there and gets published on the next poll — the event is never lost.
2077
+
2078
+ `startOutboxRelay()` runs that relay against any `OutboxStore` you implement. The library never touches your database — you own the schema and the queries; it only needs to read unpublished rows oldest-first and durably mark rows published:
2079
+
2080
+ ```typescript
2081
+ import { startOutboxRelay, OutboxStore } from '@drarzter/kafka-client/core';
2082
+
2083
+ // Pseudo-Postgres store — you own the table and the SQL.
2084
+ const store: OutboxStore = {
2085
+ async fetchUnpublished(limit) {
2086
+ const { rows } = await pool.query(
2087
+ `SELECT id, topic, payload, key, correlation_id AS "correlationId",
2088
+ event_id AS "eventId", headers
2089
+ FROM outbox
2090
+ WHERE published_at IS NULL
2091
+ ORDER BY created_at ASC
2092
+ LIMIT $1`,
2093
+ [limit],
2094
+ );
2095
+ return rows;
2096
+ },
2097
+ async markPublished(ids) {
2098
+ await pool.query(`UPDATE outbox SET published_at = now() WHERE id = ANY($1)`, [ids]);
2099
+ },
2100
+ };
2101
+
2102
+ await kafka.connectProducer();
2103
+
2104
+ const relay = startOutboxRelay(kafka, store, {
2105
+ pollIntervalMs: 500, // default 1000
2106
+ batchSize: 200, // default 100 — rows fetched & published per tick
2107
+ onPublished: (n) => metrics.increment('outbox.published', n),
2108
+ onError: (err, batch) => logger.error(`outbox batch of ${batch.length} failed`, err),
2109
+ });
2110
+
2111
+ // On shutdown — stop() halts the timer and awaits any in-flight iteration:
2112
+ await relay.stop();
2113
+ await kafka.disconnect();
2114
+ ```
2115
+
2116
+ Meanwhile, application code inserts outbox rows inside its business transaction:
2117
+
2118
+ ```typescript
2119
+ // Inside a DB transaction, alongside your business INSERT/UPDATE:
2120
+ await tx.query(
2121
+ `INSERT INTO outbox (id, topic, payload, key, correlation_id, event_id)
2122
+ VALUES ($1, $2, $3, $4, $5, $6)`,
2123
+ [randomUUID(), 'orders.created', JSON.stringify(order), order.id, corrId, eventId],
2124
+ );
2125
+ ```
2126
+
2127
+ **Delivery guarantee: at-least-once.** Each poll publishes the whole batch inside **one Kafka transaction**, then marks the rows published. If the process crashes *after* the Kafka commit but *before* `markPublished`, those rows are re-published on the next tick — a **duplicate**. Persist a stable `eventId` on each row (surfaced as `x-event-id`) so consumers can deduplicate, either via this library's [Lamport-clock deduplication](#deduplication-lamport-clock) or an application-level idempotency check. Iterations never overlap; the loop never dies on error.
2128
+
2129
+ `OutboxStore` interface:
2130
+
2131
+ | Method | Description |
2132
+ | ------ | ----------- |
2133
+ | `fetchUnpublished(limit): Promise<OutboxMessage[]>` | Unpublished rows, oldest first, capped at `limit`. Empty array = nothing to do |
2134
+ | `markPublished(ids): Promise<void>` | Durably mark ids published; called only after Kafka acks. Idempotent |
2135
+
2136
+ An `InMemoryOutboxStore` (with `.add()`, `pendingCount`, `publishedCount`) ships for tests and as executable documentation — it is **not** durable, so it does not provide the "same DB transaction as the business write" guarantee that is the whole point of the pattern. A full Postgres reference implementation lives in [`src/integration/postgres-outbox.integration.spec.ts`](./src/integration/postgres-outbox.integration.spec.ts).
2137
+
2138
+ ## Schema Registry client
2139
+
2140
+ `SchemaRegistryClient` is a minimal, dependency-free client for the Confluent Schema Registry REST API (works with Confluent Platform/Cloud, Redpanda, Karapace, and the AWS Glue SR proxy). Its scope is deliberately narrow: **subject/version management and compatibility checks** — the pieces needed to keep your locally-defined schemas in lockstep with a central registry. Payload (de)serialisation stays JSON as everywhere else in this library; Avro/Protobuf **wire-format framing with magic bytes is intentionally out of scope**.
2141
+
2142
+ ```typescript
2143
+ import { SchemaRegistryClient } from '@drarzter/kafka-client/core';
2144
+
2145
+ const registry = new SchemaRegistryClient({
2146
+ baseUrl: 'http://localhost:8081',
2147
+ auth: { username: apiKey, password: apiSecret }, // optional HTTP Basic (Confluent Cloud)
2148
+ cacheTtlMs: 300_000, // latest-version cache TTL — default 5 min
2149
+ });
2150
+
2151
+ // Register (idempotent — re-registering the same schema returns the existing id)
2152
+ const { id } = await registry.registerSchema('order.created-value', JSON.stringify(orderJsonSchema), 'JSON');
2153
+
2154
+ // Fetch (getLatestSchema is cached; getSchemaVersion is not)
2155
+ const latest = await registry.getLatestSchema('order.created-value');
2156
+ const v2 = await registry.getSchemaVersion('order.created-value', 2);
2157
+
2158
+ // Check compatibility against the subject's policy without registering
2159
+ const ok = await registry.checkCompatibility('order.created-value', JSON.stringify(candidate));
2160
+ ```
2161
+
2162
+ | Method | Cached | Description |
2163
+ | ------ | ------ | ----------- |
2164
+ | `getLatestSchema(subject)` | yes (`cacheTtlMs`) | Latest `{ id, version, schema }` for a subject |
2165
+ | `getSchemaVersion(subject, version)` | no | A specific registered version |
2166
+ | `registerSchema(subject, schema, schemaType?)` | invalidates cache | Register (idempotent); returns `{ id }`. `schemaType` defaults to `'JSON'` |
2167
+ | `checkCompatibility(subject, schema, schemaType?)` | no | `true` when the registry reports the schema compatible |
2168
+
2169
+ `registrySchema()` bridges a registry subject to this library's `SchemaLike` seam so you can attach it to a `TopicDescriptor` like any other schema. On each `parse` it resolves the subject's latest version (cached), optionally verifies the message's `x-schema-version` is not newer than what is registered, and delegates structural validation to a local validator:
2170
+
2171
+ ```typescript
2172
+ import { topic, registrySchema } from '@drarzter/kafka-client/core';
2173
+ import { z } from 'zod';
2174
+
2175
+ const OrderCreated = topic('order.created').schema(
2176
+ registrySchema(registry, 'order.created-value', {
2177
+ validator: z.object({ orderId: z.string() }), // local runtime shape check
2178
+ enforceVersion: true, // default — fail loudly if the message version outruns the registry
2179
+ }),
2180
+ );
2181
+ ```
2182
+
2183
+ The division of labour: the **registry governs schema evolution** (compatibility across versions); the **local validator governs runtime shape**. When `enforceVersion` is `true` (the default) a producer publishing a version newer than the latest registered version fails loudly rather than drifting silently.
2184
+
1694
2185
  ## Admin API
1695
2186
 
1696
2187
  Inspect consumer groups, topic metadata, and delete records via the built-in admin client — no separate connection needed.
@@ -1735,6 +2226,34 @@ await kafka.deleteRecords('orders.created', [
1735
2226
 
1736
2227
  Pass `offset: '-1'` to delete all records in a partition (truncate completely).
1737
2228
 
2229
+ ## DLQ CLI
2230
+
2231
+ The package ships a `kafka-client-dlq` binary for inspecting and re-publishing dead letter queues from the terminal — no code needed. It operates on `<topic>.dlq` topics and delegates replay to `KafkaClient.replayDlq`:
2232
+
2233
+ ```bash
2234
+ # List every .dlq topic with its message count (optionally filtered by base-topic prefix)
2235
+ kafka-client-dlq ls --brokers localhost:9092 [--prefix orders]
2236
+
2237
+ # Print up to N messages from <topic>.dlq — offset, x-dlq-* headers, and value
2238
+ kafka-client-dlq peek --brokers localhost:9092 --topic orders.created [--limit 5]
2239
+
2240
+ # Re-publish <topic>.dlq to its original topic (or --target), full or incremental
2241
+ kafka-client-dlq replay --brokers localhost:9092 --topic orders.created [--target orders.manual] [--dry-run] [--from-beginning | --incremental]
2242
+ ```
2243
+
2244
+ | Flag | Command | Description |
2245
+ | ---- | ------- | ----------- |
2246
+ | `--brokers <list>` | all | Comma-separated broker addresses (**required**) |
2247
+ | `--prefix <name>` | `ls` | Only show DLQ topics whose base name starts with `<name>` |
2248
+ | `--topic <name>` | `peek`, `replay` | Base topic name — the CLI reads `<name>.dlq` |
2249
+ | `--limit <n>` | `peek` | Max messages to print (default `10`) |
2250
+ | `--target <t>` | `replay` | Override destination topic (default: `x-dlq-original-topic` header) |
2251
+ | `--dry-run` | `replay` | Count what would be replayed without publishing |
2252
+ | `--from-beginning` | `replay` | Full replay of all DLQ messages every call (default) |
2253
+ | `--incremental` | `replay` | Only messages added since the previous replay |
2254
+
2255
+ `--from-beginning` and `--incremental` are mutually exclusive. Run `kafka-client-dlq --help` (or with no arguments) for the full usage text.
2256
+
1738
2257
  ## Graceful shutdown
1739
2258
 
1740
2259
  `disconnect()` now drains in-flight handlers before tearing down connections — no messages are silently cut off mid-processing.
@@ -1899,6 +2418,20 @@ If the handler hasn't resolved within the window, a `warn` is logged:
1899
2418
 
1900
2419
  The handler is **not** cancelled — the warning is diagnostic only. Combine with `retry` to automatically give up after a fixed number of slow attempts.
1901
2420
 
2421
+ ## Static group membership
2422
+
2423
+ Set `groupInstanceId` in `ConsumerOptions` to give a consumer a **static** identity (`group.instance.id`). A member that restarts within the broker's `session.timeout.ms` rejoins the group with the same partition assignment and triggers **no rebalance** — ideal for Kubernetes rolling restarts and short redeploys where a transient rebalance would otherwise stall every consumer in the group:
2424
+
2425
+ ```typescript
2426
+ await kafka.startConsumer(['orders'], handler, {
2427
+ groupInstanceId: `orders-svc-${process.env.HOSTNAME}`,
2428
+ });
2429
+ ```
2430
+
2431
+ The id must be **unique per member** within the consumer group — derive it from a stable per-pod value such as the StatefulSet ordinal or hostname. Two live members sharing the same `groupInstanceId` are fenced by the broker.
2432
+
2433
+ `groupInstanceId` is applied only to the consumer you set it on. It is **not** propagated to retry-chain companion consumers — those run in their own groups (`<groupId>-retry.N`) and rebalance independently. It can also be supplied via the `KAFKA_CONSUMER_GROUP_INSTANCE_ID` environment variable (see [Environment configuration](#environment-configuration)).
2434
+
1902
2435
  ## Schema validation
1903
2436
 
1904
2437
  Add runtime message validation using any library with a `.parse()` method — Zod, Valibot, ArkType, or a custom validator. No extra dependency required.
@@ -2019,6 +2552,70 @@ interface SchemaParseContext {
2019
2552
 
2020
2553
  Existing validators (Zod, Valibot, ArkType, custom) that only use the first argument continue to work unchanged — the second argument is silently ignored.
2021
2554
 
2555
+ ### Versioned schemas
2556
+
2557
+ `versionedSchema()` composes per-version validators into a single `SchemaLike` that dispatches on the message's `x-schema-version` header (via `SchemaParseContext.version`). Pass a map of version number → validator, plus an optional `migrate` hook that upgrades older shapes to the latest:
2558
+
2559
+ ```typescript
2560
+ import { topic, versionedSchema } from '@drarzter/kafka-client';
2561
+ import { z } from 'zod';
2562
+
2563
+ const OrderSchema = versionedSchema<{ orderId: string; amountMinor: number }>(
2564
+ {
2565
+ 1: z.object({ orderId: z.string(), amount: z.number() }), // legacy: major units
2566
+ 2: z.object({ orderId: z.string(), amountMinor: z.number().int() }), // current: minor units
2567
+ },
2568
+ {
2569
+ // migrate(data, fromVersion, latestVersion) → data in its latest shape
2570
+ migrate: (data, from) =>
2571
+ from === 1
2572
+ ? { orderId: data.orderId, amountMinor: Math.round(data.amount * 100) }
2573
+ : data,
2574
+ },
2575
+ );
2576
+
2577
+ const OrderCreated = topic('order.created').schema(OrderSchema);
2578
+ ```
2579
+
2580
+ Dispatch rules:
2581
+
2582
+ - **Consume path** — the version comes from the `x-schema-version` header (defaults to `1` when absent).
2583
+ - **Send path** — the version comes from `SendOptions.schemaVersion` (defaults to `1`).
2584
+ - **No parse context** (a direct `schema.parse(data)` call) — the **latest** registered version is assumed.
2585
+
2586
+ After a non-latest version is parsed, `migrate` (if provided) is called so your handler always receives the latest shape. Without a `migrate` hook, older versions are returned as parsed and callers must handle shape differences themselves.
2587
+
2588
+ A message carrying a version with **no registered schema throws** — the error lists every registered version rather than validating against the wrong shape, so a misconfigured producer fails loudly:
2589
+
2590
+ ```text
2591
+ versionedSchema: no schema registered for version 3 (topic "order.created") — registered versions: 1, 2
2592
+ ```
2593
+
2594
+ ## Constructor options validation
2595
+
2596
+ The `KafkaClient` constructor validates its arguments up front. If anything is invalid it throws a **single aggregated error** listing every problem at once, so a misconfigured client fails at construction with a clear message instead of surfacing a confusing driver error on first use:
2597
+
2598
+ ```typescript
2599
+ new KafkaClient('', '', [], { numPartitions: 0 });
2600
+ // throws:
2601
+ // KafkaClient: invalid configuration:
2602
+ // - clientId must be a non-empty string
2603
+ // - groupId must be a non-empty string
2604
+ // - brokers must be a non-empty array of broker addresses
2605
+ // - numPartitions must be a positive integer (got 0)
2606
+ ```
2607
+
2608
+ Checks performed:
2609
+
2610
+ - `clientId` and `groupId` must be non-empty strings.
2611
+ - `brokers` must be a non-empty array with no empty entries — **unless** a custom `transport` is supplied (e.g. `FakeTransport` in tests), in which case an empty `brokers` array is allowed since no broker is dialled.
2612
+ - `numPartitions`, when set, must be a positive integer.
2613
+ - `transactionalId`, when set, must be non-empty.
2614
+ - `clockRecovery.topics` must be an array; `clockRecovery.timeoutMs`, when set, must be `> 0`.
2615
+ - `lagThrottle.maxLag` must be `>= 0`; `lagThrottle.pollIntervalMs` must be `> 0`; `lagThrottle.maxWaitMs` must be `>= 0` (each validated only when set).
2616
+
2617
+ This applies to both `new KafkaClient(...)` and `KafkaModule.register()` / `registerAsync()`, which construct the client under the hood.
2618
+
2022
2619
  ## Health check
2023
2620
 
2024
2621
  Monitor Kafka connectivity with the built-in health indicator:
@@ -2129,6 +2726,26 @@ The integration suite spins up a single-node KRaft Kafka container and tests sen
2129
2726
 
2130
2727
  Both suites run in CI on every push to `main` and on pull requests.
2131
2728
 
2729
+ **Chaos suite** — fault-injection tests (broker restarts, forced rebalances) that verify redelivery and offset-commit guarantees under failure:
2730
+
2731
+ ```bash
2732
+ npm run test:chaos
2733
+ ```
2734
+
2735
+ **Benchmark** — measure the wrapper's overhead over the raw driver:
2736
+
2737
+ ```bash
2738
+ npm run bench
2739
+ ```
2740
+
2741
+ The throughput benchmark reports roughly **~2% overhead** versus using `@confluentinc/kafka-javascript` directly — the typed envelope, Lamport clock, and instrumentation hooks cost very little on the hot path.
2742
+
2743
+ **Clean up stray containers** — if a Testcontainers run is interrupted, remove leftover containers:
2744
+
2745
+ ```bash
2746
+ npm run containers:clean
2747
+ ```
2748
+
2132
2749
  ## File naming conventions
2133
2750
 
2134
2751
  Hyphens within a multi-word name; dot separates the name from its role suffix.