@eventferry/kafka 3.4.0 → 3.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,50 @@
1
1
  # @eventferry/kafka
2
2
 
3
+ ## 3.5.0
4
+
5
+ ### Minor Changes
6
+
7
+ - fb0549d: Producer-fenced restart. `PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors now classify as `errorKind: "fenced"` (previously bundled into `"fatal"`). The new kind is documented as **transient by default** — fences also fire on broker restart and network partition recovery, not only on multi-instance collisions.
8
+
9
+ New publisher option `autoRecoverFromFence: boolean` (default `false`): when on, a publish batch reporting at least one fenced result triggers exactly one `disconnect → connect → re-send same batch` cycle. Transactional producers re-run `initTransactions` as part of the reconnect. If the second send still reports any fenced record, the publisher gives up — silently retrying again would mask a real misconfiguration. Concurrent fenced publishes share a single in-flight reconnect so the producer is not torn down twice mid-restart.
10
+
11
+ New `KafkaPublisherHooks.onProducerFenced(error)` hook fires regardless of the recovery flag — informational signal so dashboards can track fence rates whether or not the publisher attempts recovery.
12
+
13
+ `@eventferry/core` minor: `PublishErrorKind` union gains `"fenced"`. The relay treats unknown / `"retriable"` / `"fenced"` identically (retry per backoff, DLQ on `attempts > maxAttempts`) — no relay-level changes required, but the new kind shows up in logs and the `errorKind` field of `PublishResult`.
14
+
15
+ Multi-instance EOS guidance: leave `autoRecoverFromFence` OFF and use a callable `transactionalId` that derives a stable, unique id per instance (pod name + replica index). Cross-instance fence is the broker telling the loser instance to stop — recovering silently creates a thrashing leadership flip. The README now spells this out in a `Producer-fenced restart` section.
16
+
17
+ - 08d3384: `publisher.healthCheck({ timeoutMs })` — cheap reachability probe usable as the body of `/healthz` or `/readyz`. Borrows a fresh admin client, calls `listTopics`, and returns a stable `HealthStatus` shape: `{ ok, latencyMs, timestamp, error? }`. Default timeout 5000 ms (long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully); `timeoutMs: 0` disables the timer entirely.
18
+
19
+ What it proves: the broker is reachable AND the configured credentials still authenticate. What it does NOT prove: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Documented as "broker reachable + auth still good", not "publisher fully operational".
20
+
21
+ The borrowed admin is always closed (success, failure, timeout — try/finally). Admin-side close failures are swallowed; health checks aren't the place to crash. Custom drivers without an `admin()` method return `{ ok: false, error: ... }` instead of the throw `publisher.admin()` would surface.
22
+
23
+ - 90b69c6: librdkafka stats hook on the confluent driver. New `onStats: (stats) => void` callback receives the librdkafka periodic statistics JSON, already parsed to a plain object — pipe queue depth, broker latencies, txmsgs counters, per-topic/per-partition stats into your metrics stack without a second client. The wrapper swallows callback exceptions and JSON parse failures so a misbehaving observer cannot take down the producer's event loop. `statsIntervalMs` controls the polling interval; defaults to 30000 ms when `onStats` is set, stays OFF otherwise (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently). `rawProducerConfig` still wins on precedence. kafkajs driver warns once and ignores both options — kafkajs has no equivalent surface.
24
+
25
+ ### Patch Changes
26
+
27
+ - 715523f: Consumer-side documentation. No API change. The root README gains:
28
+
29
+ - **`Consuming what eventferry produced`** — canonical loop showing `decode(message)` → `extractTraceContext(headers)` → `defineOutbox(registry).decode(topic, bytes)`. Same registry the producer used, in reverse, returns the typed validated payload.
30
+ - **`Consuming the DLQ`** — copy-paste handler that routes by `dlq-error-class` (cleaner than parsing `dlq-reason`), pulls `dlq-attempts` for retry-queue accounting, and shows the alert-vs-retry split.
31
+
32
+ The `@eventferry/kafka` README adds matching subsections under the existing `Consumer helpers` block: **`Typed payload via the producer-side registry`** and **`DLQ recipe`**.
33
+
34
+ `defineOutbox(registry).decode()` was already shipped — the round just makes the symmetric "same registry, both sides" pattern discoverable.
35
+
36
+ - ba81a78: Hardened TLS configuration documentation. No API change — `ssl.ca`, `ssl.servername`, and the rest of `TlsConfig` were already on the surface. This round:
37
+
38
+ - Expanded the `TlsConfig` JSDoc with the driver-parity gap: `servername` is honored by the **kafkajs** driver (Node `tls.connect` reads it directly) but is a documented **no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override.
39
+ - README gained explicit "Dev cluster with a self-signed cert" and "IP-literal brokers (cert hostname mismatch)" sections with copy-paste examples covering CA pinning + `servername` for SNI/SAN alignment.
40
+ - Reaffirmed that `rejectUnauthorized: false` is **never** going to ship on this surface. TLS verification is non-negotiable. For dev clusters with self-signed certs, the supported pattern is to pass the cluster CA via `ssl.ca` so verification still happens — just against your CA instead of the system trust store.
41
+
42
+ Companion library updates (changesets, dependabot) on the way; this patch only touches comments + README, so the change is safe to consume immediately.
43
+
44
+ - Updated dependencies [715523f]
45
+ - Updated dependencies [fb0549d]
46
+ - @eventferry/core@3.4.0
47
+
3
48
  ## 3.4.0
4
49
 
5
50
  ### Minor Changes
package/README.md CHANGED
@@ -67,6 +67,40 @@ new KafkaPublisher({
67
67
  > non-negotiable. For dev clusters with self-signed certs, pass the cluster
68
68
  > CA via `ca` so verification succeeds.
69
69
 
70
+ ### Dev cluster with a self-signed cert
71
+
72
+ The right pattern is to pin **your** CA. Verification still happens — just against your CA instead of the system trust store.
73
+
74
+ ```ts
75
+ new KafkaPublisher({
76
+ brokers: ["dev-broker.internal:9093"],
77
+ ssl: {
78
+ ca: readFileSync("/path/to/dev-cluster-ca.pem"),
79
+ // Cluster reachable via DNS that doesn't match the cert SAN?
80
+ // Pin the SNI host the cert was issued for:
81
+ servername: "kafka.dev.internal",
82
+ },
83
+ });
84
+ ```
85
+
86
+ **Never** add `rejectUnauthorized: false` (TS would reject it anyway — it's not in the type). That disables verification entirely and opens every connection to a man-in-the-middle.
87
+
88
+ ### IP-literal brokers (cert hostname mismatch)
89
+
90
+ When the broker address is an IP and the cert was issued for a hostname, set `servername`:
91
+
92
+ ```ts
93
+ new KafkaPublisher({
94
+ brokers: ["10.0.5.12:9093"], // IP literal
95
+ ssl: {
96
+ ca: readFileSync("/etc/ssl/kafka-ca.pem"),
97
+ servername: "broker.example.com", // hostname the cert was issued for
98
+ },
99
+ });
100
+ ```
101
+
102
+ `servername` is honored by the **kafkajs** driver (Node `tls.connect` reads `servername` directly). It's a **documented no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override, and SNI is derived from the broker address. Use the kafkajs driver when you need the SNI lever.
103
+
70
104
  ### SASL — username + password (PLAIN / SCRAM)
71
105
 
72
106
  ```ts
@@ -289,6 +323,107 @@ const tracer: KafkaTracer = {
289
323
 
290
324
  The publisher clones each outbound message before injecting (the caller's `PublishableMessage` is never mutated, so the relay's retry path stays correct).
291
325
 
326
+ ## Health check
327
+
328
+ Cheap reachability probe — useful as the body of a `/healthz` or `/readyz` endpoint:
329
+
330
+ ```ts
331
+ import express from "express";
332
+ const app = express();
333
+
334
+ app.get("/healthz", async (_req, res) => {
335
+ const status = await publisher.healthCheck({ timeoutMs: 3_000 });
336
+ res.status(status.ok ? 200 : 503).json({
337
+ ok: status.ok,
338
+ latencyMs: status.latencyMs,
339
+ error: status.error?.message,
340
+ });
341
+ });
342
+ ```
343
+
344
+ `publisher.healthCheck()` opens a fresh admin, calls `listTopics`, and returns:
345
+
346
+ ```ts
347
+ interface HealthStatus {
348
+ ok: boolean; // broker answered within timeout
349
+ latencyMs: number; // probe wall-clock
350
+ timestamp: number; // epoch ms when the probe started
351
+ error?: Error; // present when ok === false
352
+ }
353
+ ```
354
+
355
+ Default `timeoutMs: 5_000` — long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully. Set `timeoutMs: 0` to disable the timer.
356
+
357
+ **What this proves**: the broker is reachable AND the configured credentials still authenticate. **What this does NOT prove**: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Treat the result as "broker reachable + auth still good", not "publisher fully operational".
358
+
359
+ The borrowed admin is always closed (success or failure). Admin-side close failures don't change the outcome — health checks aren't the place to crash.
360
+
361
+ ## Producer-fenced restart
362
+
363
+ `PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors classify as `errorKind: "fenced"` — a distinct kind from `fatal` because some fences are **transient** (broker restart, network partition recovery) rather than a permanent multi-instance collision.
364
+
365
+ ### `autoRecoverFromFence: true`
366
+
367
+ Opt in to a single transparent reconnect-and-retry when a publish batch reports a fence:
368
+
369
+ ```ts
370
+ new KafkaPublisher({
371
+ brokers,
372
+ transactional: true,
373
+ transactionalId: "orders-publisher",
374
+ autoRecoverFromFence: true,
375
+ });
376
+ ```
377
+
378
+ What happens on a fenced batch:
379
+
380
+ 1. The `onProducerFenced(error)` hook fires (regardless of the recovery flag — informational).
381
+ 2. The driver is disconnected and reconnected (re-running `initTransactions` for transactional producers).
382
+ 3. The same batch is resent **once**.
383
+ 4. If the second send still reports any fenced record, the publisher gives up and surfaces those failures unchanged — silently retrying again would mask a misconfiguration.
384
+
385
+ Concurrent fenced publishes share a single in-flight reconnect — the producer is not torn down twice while a recovery is in progress.
386
+
387
+ **Default is `false`** to preserve the previous "fenced → propagate to relay" behavior. The relay will retry fenced records under the configured backoff and DLQ them when `attempts > retry.maxAttempts`.
388
+
389
+ ### `transactional.id` strategy for multi-instance EOS
390
+
391
+ When running multiple producer instances against the same logical workload, each instance MUST have a stable, unique `transactionalId`. Use the callable form to derive it from runtime context:
392
+
393
+ ```ts
394
+ new KafkaPublisher({
395
+ brokers,
396
+ transactional: true,
397
+ transactionalId: () => `${process.env.POD_NAME}-${process.env.HOSTNAME}`,
398
+ // Leave autoRecoverFromFence OFF — a fence means a real collision
399
+ // worth surfacing.
400
+ });
401
+ ```
402
+
403
+ Cross-instance fence is **not** a transient blip — it's the broker telling one of you that the other is now the canonical producer. Auto-recovery would create a thrashing leadership flip. Keep the option off in multi-instance setups and let the loser instance fail loudly.
404
+
405
+ ## librdkafka stats hook
406
+
407
+ The confluent driver exposes librdkafka's periodic statistics stream as a typed callback. Useful for piping queue depth, broker latency, broker timeout counts, and per-topic/per-partition counters into your metrics stack.
408
+
409
+ ```ts
410
+ new KafkaPublisher({
411
+ brokers,
412
+ driver: "confluent",
413
+ onStats: (stats) => {
414
+ // stats is opaque librdkafka JSON. Reach for the fields you care about.
415
+ promClient.gauge("kafka_msg_cnt").set(stats.msg_cnt as number);
416
+ promClient.gauge("kafka_txmsgs").set(stats.txmsgs as number);
417
+ },
418
+ statsIntervalMs: 30_000, // optional; defaults to 30s when onStats is set
419
+ });
420
+ ```
421
+
422
+ - **`onStats`** receives the librdkafka stats JSON, already parsed to a plain object. The schema is opaque (`Record<string, unknown>`) — librdkafka's stats are huge and evolve across versions. Reference: [librdkafka STATISTICS.md](https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md).
423
+ - **`statsIntervalMs`** maps to librdkafka's `statistics.interval.ms`. **Defaults to 30000 ms when `onStats` is set; otherwise stays off** (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently).
424
+ - The wrapper swallows callback exceptions and JSON parse failures — a single dropped sample is preferable to taking down the producer's event loop.
425
+ - **No-op on the kafkajs driver** — kafkajs has no equivalent surface. Logs a one-time warning and ignores both options.
426
+
292
427
  ## Power-user escape hatches
293
428
 
294
429
  When the high-level options don't reach a knob you need, drop down to the native client config.
@@ -455,6 +590,50 @@ await consumer.run({
455
590
 
456
591
  `extractTraceContext` returns `null` if no `traceparent` header is present or it fails W3C validation (all-zero IDs, `version: ff`, malformed hex). It accepts both raw consumer headers (Buffer values) and already-decoded headers (string values).
457
592
 
593
+ #### Typed payload via the producer-side registry
594
+
595
+ When your consumer lives in the same monorepo as the producer, hand the decoded bytes to the **same `defineOutbox(registry)`** you used to enqueue. `decode` validates against the topic's Standard Schema and returns the typed payload:
596
+
597
+ ```ts
598
+ import { defineOutbox } from "@eventferry/core";
599
+ import { decode } from "@eventferry/kafka/consume";
600
+ import { registry } from "./outbox-registry";
601
+
602
+ const events = defineOutbox(registry); // no store — consumer side
603
+
604
+ await consumer.run({
605
+ eachMessage: async ({ message }) => {
606
+ const m = decode(message, { decoder: "utf8" });
607
+ const event = await events.decode("orders.created", m.value!);
608
+ // ^? { orderId: string; total: number }
609
+ await handle(event);
610
+ },
611
+ });
612
+ ```
613
+
614
+ `events.decode(topic, bytes)` throws `OutboxValidationError` if the topic isn't in the registry or the payload doesn't match the schema. Cross-language consumers (Go, Java, Python) skip the companion and use their own schema tooling — Confluent Schema Registry for typed wire formats.
615
+
616
+ #### DLQ recipe
617
+
618
+ Records that exhaust retries land on `${topic}.dlq` (or your configured DLQ topic) carrying enriched headers `dlq-reason`, `dlq-error-class`, `dlq-original-topic`, `dlq-failed-at`, `dlq-attempts` (and optionally `dlq-stack` when you opt in). Route them with `dlq-error-class` rather than parsing `dlq-reason`:
619
+
620
+ ```ts
621
+ await dlqConsumer.run({
622
+ eachMessage: async ({ message }) => {
623
+ const m = decode(message);
624
+ const errClass = m.headers["dlq-error-class"];
625
+ if (errClass === "KafkaJSProtocolError" && m.headers["dlq-reason"]?.includes("MESSAGE_TOO_LARGE")) {
626
+ await ticket.create({ title: `Oversized DLQ from ${m.headers["dlq-original-topic"]}` });
627
+ } else {
628
+ await retryQueue.put({
629
+ payload: m.value,
630
+ attemptsSoFar: Number(m.headers["dlq-attempts"] ?? "0"),
631
+ });
632
+ }
633
+ },
634
+ });
635
+ ```
636
+
458
637
  ### `validateTopicsOnConnect`
459
638
 
460
639
  Fail-fast at startup if expected topics are missing:
package/dist/index.cjs CHANGED
@@ -51,6 +51,7 @@ function classifyKafkajsError(err) {
51
51
  if (e.name === "KafkaJSNonRetriableError") return "fatal";
52
52
  const type = typeof e.type === "string" ? e.type : void 0;
53
53
  if (type) {
54
+ if (FENCED_TYPES.has(type)) return "fenced";
54
55
  if (RETRIABLE_TYPES.has(type)) return "retriable";
55
56
  if (POISON_TYPES.has(type)) return "poison";
56
57
  if (FATAL_TYPES.has(type)) return "fatal";
@@ -84,9 +85,11 @@ var POISON_TYPES = /* @__PURE__ */ new Set([
84
85
  "INVALID_REQUIRED_ACKS",
85
86
  "INVALID_PARTITIONS"
86
87
  ]);
87
- var FATAL_TYPES = /* @__PURE__ */ new Set([
88
+ var FENCED_TYPES = /* @__PURE__ */ new Set([
88
89
  "INVALID_PRODUCER_EPOCH",
89
- "PRODUCER_FENCED",
90
+ "PRODUCER_FENCED"
91
+ ]);
92
+ var FATAL_TYPES = /* @__PURE__ */ new Set([
90
93
  "TOPIC_AUTHORIZATION_FAILED",
91
94
  "CLUSTER_AUTHORIZATION_FAILED",
92
95
  "TRANSACTIONAL_ID_AUTHORIZATION_FAILED",
@@ -117,8 +120,8 @@ var CODE_TO_KIND = /* @__PURE__ */ new Map([
117
120
  // TOPIC_AUTHORIZATION_FAILED
118
121
  [31, "fatal"],
119
122
  // CLUSTER_AUTHORIZATION_FAILED
120
- [47, "fatal"],
121
- // INVALID_PRODUCER_EPOCH
123
+ [47, "fenced"],
124
+ // INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
122
125
  [58, "fatal"],
123
126
  // SASL_AUTHENTICATION_FAILED
124
127
  [74, "retriable"],
@@ -151,7 +154,10 @@ var UNSUPPORTED_BY_KAFKAJS = [
151
154
  "maxRequestSize",
152
155
  // Confluent-only escape hatches; ignored on kafkajs.
153
156
  "compressionLevel",
154
- "rawProducerConfig"
157
+ "rawProducerConfig",
158
+ // librdkafka stats — kafkajs has no equivalent surface.
159
+ "onStats",
160
+ "statsIntervalMs"
155
161
  ];
156
162
  var KafkaJsDriver = class {
157
163
  transactional;
@@ -437,8 +443,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
437
443
  // ERR__TRANSPORT
438
444
  [-198, "poison"],
439
445
  // ERR__BAD_COMPRESSION
440
- [-144, "fatal"],
441
- // ERR__FENCED — producer fenced by another with same txn id
446
+ [-144, "fenced"],
447
+ // ERR__FENCED — producer fenced; publisher reconnect attempts a transparent recovery once
442
448
  [-150, "fatal"],
443
449
  // ERR__FATAL — unrecoverable librdkafka error
444
450
  [-169, "fatal"],
@@ -470,8 +476,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
470
476
  // TOPIC_AUTHORIZATION_FAILED
471
477
  [31, "fatal"],
472
478
  // CLUSTER_AUTHORIZATION_FAILED
473
- [47, "fatal"],
474
- // INVALID_PRODUCER_EPOCH
479
+ [47, "fenced"],
480
+ // INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
475
481
  [58, "fatal"],
476
482
  // SASL_AUTHENTICATION_FAILED
477
483
  [74, "retriable"],
@@ -485,7 +491,7 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
485
491
  ]);
486
492
  var NAME_TO_KIND = /* @__PURE__ */ new Map([
487
493
  ["ERR__QUEUE_FULL", "backpressure"],
488
- ["ERR__FENCED", "fatal"],
494
+ ["ERR__FENCED", "fenced"],
489
495
  ["ERR__FATAL", "fatal"],
490
496
  ["ERR__AUTHENTICATION", "fatal"],
491
497
  ["ERR__SSL", "fatal"],
@@ -494,7 +500,7 @@ var NAME_TO_KIND = /* @__PURE__ */ new Map([
494
500
  ["ERR__BAD_COMPRESSION", "poison"],
495
501
  ["ERR_TOPIC_AUTHORIZATION_FAILED", "fatal"],
496
502
  ["ERR_CLUSTER_AUTHORIZATION_FAILED", "fatal"],
497
- ["ERR_INVALID_PRODUCER_EPOCH", "fatal"],
503
+ ["ERR_INVALID_PRODUCER_EPOCH", "fenced"],
498
504
  ["ERR_SASL_AUTHENTICATION_FAILED", "fatal"],
499
505
  ["ERR_CORRUPT_MESSAGE", "poison"],
500
506
  ["ERR_MSG_SIZE_TOO_LARGE", "poison"],
@@ -530,6 +536,12 @@ function buildConfluentClientConfig(opts) {
530
536
  if (opts.compressionLevel !== void 0) {
531
537
  librdkafka["compression.level"] = opts.compressionLevel;
532
538
  }
539
+ if (opts.onStats) {
540
+ librdkafka["stats_cb"] = wrapStatsCallback(opts.onStats);
541
+ librdkafka["statistics.interval.ms"] = opts.statsIntervalMs ?? 3e4;
542
+ } else if (opts.statsIntervalMs !== void 0) {
543
+ librdkafka["statistics.interval.ms"] = opts.statsIntervalMs;
544
+ }
533
545
  const tlsRequested = opts.ssl === true || isTlsConfig(opts.ssl);
534
546
  const saslRequested = !!opts.sasl;
535
547
  if (saslRequested && tlsRequested) {
@@ -567,6 +579,20 @@ function buildConfluentClientConfig(opts) {
567
579
  function isTlsConfig(v) {
568
580
  return typeof v === "object" && v !== null;
569
581
  }
582
+ function wrapStatsCallback(onStats) {
583
+ return (raw) => {
584
+ let parsed;
585
+ try {
586
+ parsed = typeof raw === "string" ? JSON.parse(raw) : raw;
587
+ } catch {
588
+ return;
589
+ }
590
+ try {
591
+ onStats(parsed);
592
+ } catch {
593
+ }
594
+ };
595
+ }
570
596
  function stringifyPem(input) {
571
597
  if (Array.isArray(input)) {
572
598
  return input.map((x) => typeof x === "string" ? x : x.toString("utf8")).join("\n");
@@ -808,11 +834,17 @@ var KafkaPublisher = class {
808
834
  hooks;
809
835
  tracer;
810
836
  validateTopicsOnConnect;
837
+ autoRecoverFromFence;
838
+ // Serialize reconnects so concurrent publish() calls hitting a fence
839
+ // all observe the same single reconnect attempt — the second publish
840
+ // doesn't try to disconnect a producer the first is still re-initing.
841
+ fenceRecovery = null;
811
842
  constructor(opts) {
812
843
  this.logger = opts.logger;
813
844
  this.hooks = opts.hooks ?? {};
814
845
  this.tracer = opts.tracer ?? new NoopKafkaTracer();
815
846
  this.validateTopicsOnConnect = opts.validateTopicsOnConnect ? Object.freeze([...opts.validateTopicsOnConnect]) : void 0;
847
+ this.autoRecoverFromFence = opts.autoRecoverFromFence ?? false;
816
848
  const onTransactionAbort = this.hooks.onTransactionAbort ? (error) => {
817
849
  void safeHook(
818
850
  this.logger,
@@ -935,6 +967,20 @@ var KafkaPublisher = class {
935
967
  await safeHook(this.logger, "onError", () => this.hooks.onError?.(error));
936
968
  throw err;
937
969
  }
970
+ const firstFenced = results.find(
971
+ (r) => !r.ok && r.errorKind === "fenced"
972
+ );
973
+ if (firstFenced) {
974
+ const fenceErr = firstFenced.error ?? new Error("producer fenced");
975
+ await safeHook(
976
+ this.logger,
977
+ "onProducerFenced",
978
+ () => this.hooks.onProducerFenced?.(fenceErr)
979
+ );
980
+ if (this.autoRecoverFromFence) {
981
+ results = await this.recoverAndRetry(outgoing, results);
982
+ }
983
+ }
938
984
  const byId = new Map(messages.map((m) => [m.recordId, m]));
939
985
  let allOk = true;
940
986
  for (const r of results) {
@@ -985,6 +1031,110 @@ var KafkaPublisher = class {
985
1031
  get transactional() {
986
1032
  return this.driver.transactional;
987
1033
  }
1034
+ /**
1035
+ * Cheap reachability probe. Borrows a fresh admin client, calls
1036
+ * `listTopics`, and returns timing + outcome. Useful as the body of a
1037
+ * `/healthz` or `/readyz` endpoint — proves the broker is reachable
1038
+ * AND that the configured credentials still authenticate against it,
1039
+ * without writing a record.
1040
+ *
1041
+ * Does NOT exercise the producer's send path — a healthy admin
1042
+ * connection doesn't guarantee `publish()` will succeed (a fenced
1043
+ * transactional producer would still answer healthy here). Treat this
1044
+ * as "broker reachable + auth still good", not "publisher is fully
1045
+ * operational".
1046
+ *
1047
+ * Default timeout 5_000 ms — long enough to ride out a single broker
1048
+ * leader election, short enough to fail a liveness probe meaningfully.
1049
+ * Set `timeoutMs: 0` to disable the timer entirely.
1050
+ *
1051
+ * The driver must implement `admin()` (the built-ins do); custom
1052
+ * drivers without admin get `{ ok: false, error: ... }` instead of
1053
+ * the throw `publisher.admin()` would surface — health checks are
1054
+ * not the place to crash.
1055
+ */
1056
+ async healthCheck(opts = {}) {
1057
+ const timeoutMs = opts.timeoutMs ?? 5e3;
1058
+ const startedAt = Date.now();
1059
+ if (!this.driver.admin) {
1060
+ return {
1061
+ ok: false,
1062
+ latencyMs: 0,
1063
+ timestamp: startedAt,
1064
+ error: new Error(
1065
+ "KafkaPublisher.healthCheck: configured driver does not implement admin()"
1066
+ )
1067
+ };
1068
+ }
1069
+ let admin = null;
1070
+ try {
1071
+ admin = await this.driver.admin();
1072
+ await admin.connect();
1073
+ const probe = admin.listTopics();
1074
+ if (timeoutMs > 0) {
1075
+ await raceWithTimeout(probe, timeoutMs, "healthCheck");
1076
+ } else {
1077
+ await probe;
1078
+ }
1079
+ return {
1080
+ ok: true,
1081
+ latencyMs: Date.now() - startedAt,
1082
+ timestamp: startedAt
1083
+ };
1084
+ } catch (err) {
1085
+ const error = err instanceof Error ? err : new Error(String(err));
1086
+ return {
1087
+ ok: false,
1088
+ latencyMs: Date.now() - startedAt,
1089
+ timestamp: startedAt,
1090
+ error
1091
+ };
1092
+ } finally {
1093
+ try {
1094
+ await admin?.close();
1095
+ } catch {
1096
+ }
1097
+ }
1098
+ }
1099
+ /**
1100
+ * Disconnect + re-connect the driver and re-send the batch ONCE. Used
1101
+ * by the fence-recovery path. Concurrent fence recoveries dedupe on a
1102
+ * shared in-flight promise (`fenceRecovery`) so we don't tear the
1103
+ * producer down while another batch is mid-restart.
1104
+ *
1105
+ * If the second send STILL reports any fenced records, those failures
1106
+ * are returned unchanged — another instance has almost certainly taken
1107
+ * the same `transactionalId` and silently retrying again would mask
1108
+ * the misconfiguration.
1109
+ */
1110
+ async recoverAndRetry(outgoing, firstResults) {
1111
+ if (!this.fenceRecovery) {
1112
+ this.fenceRecovery = (async () => {
1113
+ try {
1114
+ await this.driver.disconnect();
1115
+ await this.driver.connect();
1116
+ } finally {
1117
+ this.fenceRecovery = null;
1118
+ }
1119
+ })();
1120
+ }
1121
+ try {
1122
+ await this.fenceRecovery;
1123
+ } catch (err) {
1124
+ const reconnectErr = err instanceof Error ? err : new Error(String(err));
1125
+ await safeHook(
1126
+ this.logger,
1127
+ "onError",
1128
+ () => this.hooks.onError?.(reconnectErr)
1129
+ );
1130
+ return firstResults;
1131
+ }
1132
+ try {
1133
+ return await this.driver.sendBatch(outgoing);
1134
+ } catch {
1135
+ return firstResults;
1136
+ }
1137
+ }
988
1138
  /**
989
1139
  * Start a span for the batch following the OTel messaging conventions.
990
1140
  *
@@ -1003,6 +1153,26 @@ var KafkaPublisher = class {
1003
1153
  });
1004
1154
  }
1005
1155
  };
1156
+ function raceWithTimeout(p, ms, label) {
1157
+ return new Promise((resolve, reject) => {
1158
+ const timer = setTimeout(() => {
1159
+ reject(new Error(`${label} timed out after ${ms}ms`));
1160
+ }, ms);
1161
+ if (typeof timer.unref === "function") {
1162
+ timer.unref();
1163
+ }
1164
+ p.then(
1165
+ (v) => {
1166
+ clearTimeout(timer);
1167
+ resolve(v);
1168
+ },
1169
+ (e) => {
1170
+ clearTimeout(timer);
1171
+ reject(e);
1172
+ }
1173
+ );
1174
+ });
1175
+ }
1006
1176
  function selectDriver(opts) {
1007
1177
  const kind = opts.driver ?? "kafkajs";
1008
1178
  switch (kind) {