npm - @eventferry/kafka - Versions diffs - 3.4.0 → 3.5.0 - Mend

@eventferry/kafka 3.4.0 → 3.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,50 @@
 # @eventferry/kafka
+## 3.5.0
+### Minor Changes
+- fb0549d: Producer-fenced restart. `PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors now classify as `errorKind: "fenced"` (previously bundled into `"fatal"`). The new kind is documented as **transient by default** — fences also fire on broker restart and network partition recovery, not only on multi-instance collisions.
+  New publisher option `autoRecoverFromFence: boolean` (default `false`): when on, a publish batch reporting at least one fenced result triggers exactly one `disconnect → connect → re-send same batch` cycle. Transactional producers re-run `initTransactions` as part of the reconnect. If the second send still reports any fenced record, the publisher gives up — silently retrying again would mask a real misconfiguration. Concurrent fenced publishes share a single in-flight reconnect so the producer is not torn down twice mid-restart.
+  New `KafkaPublisherHooks.onProducerFenced(error)` hook fires regardless of the recovery flag — informational signal so dashboards can track fence rates whether or not the publisher attempts recovery.
+  `@eventferry/core` minor: `PublishErrorKind` union gains `"fenced"`. The relay treats unknown / `"retriable"` / `"fenced"` identically (retry per backoff, DLQ on `attempts > maxAttempts`) — no relay-level changes required, but the new kind shows up in logs and the `errorKind` field of `PublishResult`.
+  Multi-instance EOS guidance: leave `autoRecoverFromFence` OFF and use a callable `transactionalId` that derives a stable, unique id per instance (pod name + replica index). Cross-instance fence is the broker telling the loser instance to stop — recovering silently creates a thrashing leadership flip. The README now spells this out in a `Producer-fenced restart` section.
+- 08d3384: `publisher.healthCheck({ timeoutMs })` — cheap reachability probe usable as the body of `/healthz` or `/readyz`. Borrows a fresh admin client, calls `listTopics`, and returns a stable `HealthStatus` shape: `{ ok, latencyMs, timestamp, error? }`. Default timeout 5000 ms (long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully); `timeoutMs: 0` disables the timer entirely.
+  What it proves: the broker is reachable AND the configured credentials still authenticate. What it does NOT prove: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Documented as "broker reachable + auth still good", not "publisher fully operational".
+  The borrowed admin is always closed (success, failure, timeout — try/finally). Admin-side close failures are swallowed; health checks aren't the place to crash. Custom drivers without an `admin()` method return `{ ok: false, error: ... }` instead of the throw `publisher.admin()` would surface.
+- 90b69c6: librdkafka stats hook on the confluent driver. New `onStats: (stats) => void` callback receives the librdkafka periodic statistics JSON, already parsed to a plain object — pipe queue depth, broker latencies, txmsgs counters, per-topic/per-partition stats into your metrics stack without a second client. The wrapper swallows callback exceptions and JSON parse failures so a misbehaving observer cannot take down the producer's event loop. `statsIntervalMs` controls the polling interval; defaults to 30000 ms when `onStats` is set, stays OFF otherwise (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently). `rawProducerConfig` still wins on precedence. kafkajs driver warns once and ignores both options — kafkajs has no equivalent surface.
+### Patch Changes
+- 715523f: Consumer-side documentation. No API change. The root README gains:
+  - **`Consuming what eventferry produced`** — canonical loop showing `decode(message)` → `extractTraceContext(headers)` → `defineOutbox(registry).decode(topic, bytes)`. Same registry the producer used, in reverse, returns the typed validated payload.
+  - **`Consuming the DLQ`** — copy-paste handler that routes by `dlq-error-class` (cleaner than parsing `dlq-reason`), pulls `dlq-attempts` for retry-queue accounting, and shows the alert-vs-retry split.
+  The `@eventferry/kafka` README adds matching subsections under the existing `Consumer helpers` block: **`Typed payload via the producer-side registry`** and **`DLQ recipe`**.
+  `defineOutbox(registry).decode()` was already shipped — the round just makes the symmetric "same registry, both sides" pattern discoverable.
+- ba81a78: Hardened TLS configuration documentation. No API change — `ssl.ca`, `ssl.servername`, and the rest of `TlsConfig` were already on the surface. This round:
+  - Expanded the `TlsConfig` JSDoc with the driver-parity gap: `servername` is honored by the **kafkajs** driver (Node `tls.connect` reads it directly) but is a documented **no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override.
+  - README gained explicit "Dev cluster with a self-signed cert" and "IP-literal brokers (cert hostname mismatch)" sections with copy-paste examples covering CA pinning + `servername` for SNI/SAN alignment.
+  - Reaffirmed that `rejectUnauthorized: false` is **never** going to ship on this surface. TLS verification is non-negotiable. For dev clusters with self-signed certs, the supported pattern is to pass the cluster CA via `ssl.ca` so verification still happens — just against your CA instead of the system trust store.
+  Companion library updates (changesets, dependabot) on the way; this patch only touches comments + README, so the change is safe to consume immediately.
+- Updated dependencies [715523f]
+- Updated dependencies [fb0549d]
+  - @eventferry/core@3.4.0
 ## 3.4.0
 ### Minor Changes

package/README.md CHANGED Viewed

@@ -67,6 +67,40 @@ new KafkaPublisher({
 > non-negotiable. For dev clusters with self-signed certs, pass the cluster
 > CA via `ca` so verification succeeds.
+### Dev cluster with a self-signed cert
+The right pattern is to pin **your** CA. Verification still happens — just against your CA instead of the system trust store.
+```ts
+new KafkaPublisher({
+  brokers: ["dev-broker.internal:9093"],
+  ssl: {
+    ca: readFileSync("/path/to/dev-cluster-ca.pem"),
+    // Cluster reachable via DNS that doesn't match the cert SAN?
+    // Pin the SNI host the cert was issued for:
+    servername: "kafka.dev.internal",
+  },
+});
+```
+**Never** add `rejectUnauthorized: false` (TS would reject it anyway — it's not in the type). That disables verification entirely and opens every connection to a man-in-the-middle.
+### IP-literal brokers (cert hostname mismatch)
+When the broker address is an IP and the cert was issued for a hostname, set `servername`:
+```ts
+new KafkaPublisher({
+  brokers: ["10.0.5.12:9093"],          // IP literal
+  ssl: {
+    ca: readFileSync("/etc/ssl/kafka-ca.pem"),
+    servername: "broker.example.com",   // hostname the cert was issued for
+  },
+});
+```
+`servername` is honored by the **kafkajs** driver (Node `tls.connect` reads `servername` directly). It's a **documented no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override, and SNI is derived from the broker address. Use the kafkajs driver when you need the SNI lever.
 ### SASL — username + password (PLAIN / SCRAM)
 ```ts
@@ -289,6 +323,107 @@ const tracer: KafkaTracer = {
 The publisher clones each outbound message before injecting (the caller's `PublishableMessage` is never mutated, so the relay's retry path stays correct).
+## Health check
+Cheap reachability probe — useful as the body of a `/healthz` or `/readyz` endpoint:
+```ts
+import express from "express";
+const app = express();
+app.get("/healthz", async (_req, res) => {
+  const status = await publisher.healthCheck({ timeoutMs: 3_000 });
+  res.status(status.ok ? 200 : 503).json({
+    ok: status.ok,
+    latencyMs: status.latencyMs,
+    error: status.error?.message,
+  });
+});
+```
+`publisher.healthCheck()` opens a fresh admin, calls `listTopics`, and returns:
+```ts
+interface HealthStatus {
+  ok: boolean;          // broker answered within timeout
+  latencyMs: number;    // probe wall-clock
+  timestamp: number;    // epoch ms when the probe started
+  error?: Error;        // present when ok === false
+}
+```
+Default `timeoutMs: 5_000` — long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully. Set `timeoutMs: 0` to disable the timer.
+**What this proves**: the broker is reachable AND the configured credentials still authenticate. **What this does NOT prove**: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Treat the result as "broker reachable + auth still good", not "publisher fully operational".
+The borrowed admin is always closed (success or failure). Admin-side close failures don't change the outcome — health checks aren't the place to crash.
+## Producer-fenced restart
+`PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors classify as `errorKind: "fenced"` — a distinct kind from `fatal` because some fences are **transient** (broker restart, network partition recovery) rather than a permanent multi-instance collision.
+### `autoRecoverFromFence: true`
+Opt in to a single transparent reconnect-and-retry when a publish batch reports a fence:
+```ts
+new KafkaPublisher({
+  brokers,
+  transactional: true,
+  transactionalId: "orders-publisher",
+  autoRecoverFromFence: true,
+});
+```
+What happens on a fenced batch:
+1. The `onProducerFenced(error)` hook fires (regardless of the recovery flag — informational).
+2. The driver is disconnected and reconnected (re-running `initTransactions` for transactional producers).
+3. The same batch is resent **once**.
+4. If the second send still reports any fenced record, the publisher gives up and surfaces those failures unchanged — silently retrying again would mask a misconfiguration.
+Concurrent fenced publishes share a single in-flight reconnect — the producer is not torn down twice while a recovery is in progress.
+**Default is `false`** to preserve the previous "fenced → propagate to relay" behavior. The relay will retry fenced records under the configured backoff and DLQ them when `attempts > retry.maxAttempts`.
+### `transactional.id` strategy for multi-instance EOS
+When running multiple producer instances against the same logical workload, each instance MUST have a stable, unique `transactionalId`. Use the callable form to derive it from runtime context:
+```ts
+new KafkaPublisher({
+  brokers,
+  transactional: true,
+  transactionalId: () => `${process.env.POD_NAME}-${process.env.HOSTNAME}`,
+  // Leave autoRecoverFromFence OFF — a fence means a real collision
+  // worth surfacing.
+});
+```
+Cross-instance fence is **not** a transient blip — it's the broker telling one of you that the other is now the canonical producer. Auto-recovery would create a thrashing leadership flip. Keep the option off in multi-instance setups and let the loser instance fail loudly.
+## librdkafka stats hook
+The confluent driver exposes librdkafka's periodic statistics stream as a typed callback. Useful for piping queue depth, broker latency, broker timeout counts, and per-topic/per-partition counters into your metrics stack.
+```ts
+new KafkaPublisher({
+  brokers,
+  driver: "confluent",
+  onStats: (stats) => {
+    // stats is opaque librdkafka JSON. Reach for the fields you care about.
+    promClient.gauge("kafka_msg_cnt").set(stats.msg_cnt as number);
+    promClient.gauge("kafka_txmsgs").set(stats.txmsgs as number);
+  },
+  statsIntervalMs: 30_000, // optional; defaults to 30s when onStats is set
+});
+```
+- **`onStats`** receives the librdkafka stats JSON, already parsed to a plain object. The schema is opaque (`Record<string, unknown>`) — librdkafka's stats are huge and evolve across versions. Reference: [librdkafka STATISTICS.md](https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md).
+- **`statsIntervalMs`** maps to librdkafka's `statistics.interval.ms`. **Defaults to 30000 ms when `onStats` is set; otherwise stays off** (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently).
+- The wrapper swallows callback exceptions and JSON parse failures — a single dropped sample is preferable to taking down the producer's event loop.
+- **No-op on the kafkajs driver** — kafkajs has no equivalent surface. Logs a one-time warning and ignores both options.
 ## Power-user escape hatches
 When the high-level options don't reach a knob you need, drop down to the native client config.
@@ -455,6 +590,50 @@ await consumer.run({
 `extractTraceContext` returns `null` if no `traceparent` header is present or it fails W3C validation (all-zero IDs, `version: ff`, malformed hex). It accepts both raw consumer headers (Buffer values) and already-decoded headers (string values).
+#### Typed payload via the producer-side registry
+When your consumer lives in the same monorepo as the producer, hand the decoded bytes to the **same `defineOutbox(registry)`** you used to enqueue. `decode` validates against the topic's Standard Schema and returns the typed payload:
+```ts
+import { defineOutbox } from "@eventferry/core";
+import { decode } from "@eventferry/kafka/consume";
+import { registry } from "./outbox-registry";
+const events = defineOutbox(registry); // no store — consumer side
+await consumer.run({
+  eachMessage: async ({ message }) => {
+    const m = decode(message, { decoder: "utf8" });
+    const event = await events.decode("orders.created", m.value!);
+    //    ^? { orderId: string; total: number }
+    await handle(event);
+  },
+});
+```
+`events.decode(topic, bytes)` throws `OutboxValidationError` if the topic isn't in the registry or the payload doesn't match the schema. Cross-language consumers (Go, Java, Python) skip the companion and use their own schema tooling — Confluent Schema Registry for typed wire formats.
+#### DLQ recipe
+Records that exhaust retries land on `${topic}.dlq` (or your configured DLQ topic) carrying enriched headers `dlq-reason`, `dlq-error-class`, `dlq-original-topic`, `dlq-failed-at`, `dlq-attempts` (and optionally `dlq-stack` when you opt in). Route them with `dlq-error-class` rather than parsing `dlq-reason`:
+```ts
+await dlqConsumer.run({
+  eachMessage: async ({ message }) => {
+    const m = decode(message);
+    const errClass = m.headers["dlq-error-class"];
+    if (errClass === "KafkaJSProtocolError" && m.headers["dlq-reason"]?.includes("MESSAGE_TOO_LARGE")) {
+      await ticket.create({ title: `Oversized DLQ from ${m.headers["dlq-original-topic"]}` });
+    } else {
+      await retryQueue.put({
+        payload: m.value,
+        attemptsSoFar: Number(m.headers["dlq-attempts"] ?? "0"),
+      });
+    }
+  },
+});
+```
 ### `validateTopicsOnConnect`
 Fail-fast at startup if expected topics are missing:

package/dist/index.cjs CHANGED Viewed

@@ -51,6 +51,7 @@ function classifyKafkajsError(err) {
   if (e.name === "KafkaJSNonRetriableError") return "fatal";
   const type = typeof e.type === "string" ? e.type : void 0;
   if (type) {
+    if (FENCED_TYPES.has(type)) return "fenced";
     if (RETRIABLE_TYPES.has(type)) return "retriable";
     if (POISON_TYPES.has(type)) return "poison";
     if (FATAL_TYPES.has(type)) return "fatal";
@@ -84,9 +85,11 @@ var POISON_TYPES = /* @__PURE__ */ new Set([
   "INVALID_REQUIRED_ACKS",
   "INVALID_PARTITIONS"
 ]);
-var FATAL_TYPES = /* @__PURE__ */ new Set([
+var FENCED_TYPES = /* @__PURE__ */ new Set([
   "INVALID_PRODUCER_EPOCH",
-  "PRODUCER_FENCED",
+  "PRODUCER_FENCED"
+]);
+var FATAL_TYPES = /* @__PURE__ */ new Set([
   "TOPIC_AUTHORIZATION_FAILED",
   "CLUSTER_AUTHORIZATION_FAILED",
   "TRANSACTIONAL_ID_AUTHORIZATION_FAILED",
@@ -117,8 +120,8 @@ var CODE_TO_KIND = /* @__PURE__ */ new Map([
   // TOPIC_AUTHORIZATION_FAILED
   [31, "fatal"],
   // CLUSTER_AUTHORIZATION_FAILED
-  [47, "fatal"],
-  // INVALID_PRODUCER_EPOCH
+  [47, "fenced"],
+  // INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
   [58, "fatal"],
   // SASL_AUTHENTICATION_FAILED
   [74, "retriable"],
@@ -151,7 +154,10 @@ var UNSUPPORTED_BY_KAFKAJS = [
   "maxRequestSize",
   // Confluent-only escape hatches; ignored on kafkajs.
   "compressionLevel",
-  "rawProducerConfig"
+  "rawProducerConfig",
+  // librdkafka stats — kafkajs has no equivalent surface.
+  "onStats",
+  "statsIntervalMs"
 ];
 var KafkaJsDriver = class {
   transactional;
@@ -437,8 +443,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
   // ERR__TRANSPORT
   [-198, "poison"],
   // ERR__BAD_COMPRESSION
-  [-144, "fatal"],
-  // ERR__FENCED — producer fenced by another with same txn id
+  [-144, "fenced"],
+  // ERR__FENCED — producer fenced; publisher reconnect attempts a transparent recovery once
   [-150, "fatal"],
   // ERR__FATAL — unrecoverable librdkafka error
   [-169, "fatal"],
@@ -470,8 +476,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
   // TOPIC_AUTHORIZATION_FAILED
   [31, "fatal"],
   // CLUSTER_AUTHORIZATION_FAILED
-  [47, "fatal"],
-  // INVALID_PRODUCER_EPOCH
+  [47, "fenced"],
+  // INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
   [58, "fatal"],
   // SASL_AUTHENTICATION_FAILED
   [74, "retriable"],
@@ -485,7 +491,7 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
 ]);
 var NAME_TO_KIND = /* @__PURE__ */ new Map([
   ["ERR__QUEUE_FULL", "backpressure"],
-  ["ERR__FENCED", "fatal"],
+  ["ERR__FENCED", "fenced"],
   ["ERR__FATAL", "fatal"],
   ["ERR__AUTHENTICATION", "fatal"],
   ["ERR__SSL", "fatal"],
@@ -494,7 +500,7 @@ var NAME_TO_KIND = /* @__PURE__ */ new Map([
   ["ERR__BAD_COMPRESSION", "poison"],
   ["ERR_TOPIC_AUTHORIZATION_FAILED", "fatal"],
   ["ERR_CLUSTER_AUTHORIZATION_FAILED", "fatal"],
-  ["ERR_INVALID_PRODUCER_EPOCH", "fatal"],
+  ["ERR_INVALID_PRODUCER_EPOCH", "fenced"],
   ["ERR_SASL_AUTHENTICATION_FAILED", "fatal"],
   ["ERR_CORRUPT_MESSAGE", "poison"],
   ["ERR_MSG_SIZE_TOO_LARGE", "poison"],
@@ -530,6 +536,12 @@ function buildConfluentClientConfig(opts) {
   if (opts.compressionLevel !== void 0) {
     librdkafka["compression.level"] = opts.compressionLevel;
   }
+  if (opts.onStats) {
+    librdkafka["stats_cb"] = wrapStatsCallback(opts.onStats);
+    librdkafka["statistics.interval.ms"] = opts.statsIntervalMs ?? 3e4;
+  } else if (opts.statsIntervalMs !== void 0) {
+    librdkafka["statistics.interval.ms"] = opts.statsIntervalMs;
+  }
   const tlsRequested = opts.ssl === true || isTlsConfig(opts.ssl);
   const saslRequested = !!opts.sasl;
   if (saslRequested && tlsRequested) {
@@ -567,6 +579,20 @@ function buildConfluentClientConfig(opts) {
 function isTlsConfig(v) {
   return typeof v === "object" && v !== null;
 }
+function wrapStatsCallback(onStats) {
+  return (raw) => {
+    let parsed;
+    try {
+      parsed = typeof raw === "string" ? JSON.parse(raw) : raw;
+    } catch {
+      return;
+    }
+    try {
+      onStats(parsed);
+    } catch {
+    }
+  };
+}
 function stringifyPem(input) {
   if (Array.isArray(input)) {
     return input.map((x) => typeof x === "string" ? x : x.toString("utf8")).join("\n");
@@ -808,11 +834,17 @@ var KafkaPublisher = class {
   hooks;
   tracer;
   validateTopicsOnConnect;
+  autoRecoverFromFence;
+  // Serialize reconnects so concurrent publish() calls hitting a fence
+  // all observe the same single reconnect attempt — the second publish
+  // doesn't try to disconnect a producer the first is still re-initing.
+  fenceRecovery = null;
   constructor(opts) {
     this.logger = opts.logger;
     this.hooks = opts.hooks ?? {};
     this.tracer = opts.tracer ?? new NoopKafkaTracer();
     this.validateTopicsOnConnect = opts.validateTopicsOnConnect ? Object.freeze([...opts.validateTopicsOnConnect]) : void 0;
+    this.autoRecoverFromFence = opts.autoRecoverFromFence ?? false;
     const onTransactionAbort = this.hooks.onTransactionAbort ? (error) => {
       void safeHook(
         this.logger,
@@ -935,6 +967,20 @@ var KafkaPublisher = class {
       await safeHook(this.logger, "onError", () => this.hooks.onError?.(error));
       throw err;
     }
+    const firstFenced = results.find(
+      (r) => !r.ok && r.errorKind === "fenced"
+    );
+    if (firstFenced) {
+      const fenceErr = firstFenced.error ?? new Error("producer fenced");
+      await safeHook(
+        this.logger,
+        "onProducerFenced",
+        () => this.hooks.onProducerFenced?.(fenceErr)
+      );
+      if (this.autoRecoverFromFence) {
+        results = await this.recoverAndRetry(outgoing, results);
+      }
+    }
     const byId = new Map(messages.map((m) => [m.recordId, m]));
     let allOk = true;
     for (const r of results) {
@@ -985,6 +1031,110 @@ var KafkaPublisher = class {
   get transactional() {
     return this.driver.transactional;
   }
+  /**
+   * Cheap reachability probe. Borrows a fresh admin client, calls
+   * `listTopics`, and returns timing + outcome. Useful as the body of a
+   * `/healthz` or `/readyz` endpoint — proves the broker is reachable
+   * AND that the configured credentials still authenticate against it,
+   * without writing a record.
+   *
+   * Does NOT exercise the producer's send path — a healthy admin
+   * connection doesn't guarantee `publish()` will succeed (a fenced
+   * transactional producer would still answer healthy here). Treat this
+   * as "broker reachable + auth still good", not "publisher is fully
+   * operational".
+   *
+   * Default timeout 5_000 ms — long enough to ride out a single broker
+   * leader election, short enough to fail a liveness probe meaningfully.
+   * Set `timeoutMs: 0` to disable the timer entirely.
+   *
+   * The driver must implement `admin()` (the built-ins do); custom
+   * drivers without admin get `{ ok: false, error: ... }` instead of
+   * the throw `publisher.admin()` would surface — health checks are
+   * not the place to crash.
+   */
+  async healthCheck(opts = {}) {
+    const timeoutMs = opts.timeoutMs ?? 5e3;
+    const startedAt = Date.now();
+    if (!this.driver.admin) {
+      return {
+        ok: false,
+        latencyMs: 0,
+        timestamp: startedAt,
+        error: new Error(
+          "KafkaPublisher.healthCheck: configured driver does not implement admin()"
+        )
+      };
+    }
+    let admin = null;
+    try {
+      admin = await this.driver.admin();
+      await admin.connect();
+      const probe = admin.listTopics();
+      if (timeoutMs > 0) {
+        await raceWithTimeout(probe, timeoutMs, "healthCheck");
+      } else {
+        await probe;
+      }
+      return {
+        ok: true,
+        latencyMs: Date.now() - startedAt,
+        timestamp: startedAt
+      };
+    } catch (err) {
+      const error = err instanceof Error ? err : new Error(String(err));
+      return {
+        ok: false,
+        latencyMs: Date.now() - startedAt,
+        timestamp: startedAt,
+        error
+      };
+    } finally {
+      try {
+        await admin?.close();
+      } catch {
+      }
+    }
+  }
+  /**
+   * Disconnect + re-connect the driver and re-send the batch ONCE. Used
+   * by the fence-recovery path. Concurrent fence recoveries dedupe on a
+   * shared in-flight promise (`fenceRecovery`) so we don't tear the
+   * producer down while another batch is mid-restart.
+   *
+   * If the second send STILL reports any fenced records, those failures
+   * are returned unchanged — another instance has almost certainly taken
+   * the same `transactionalId` and silently retrying again would mask
+   * the misconfiguration.
+   */
+  async recoverAndRetry(outgoing, firstResults) {
+    if (!this.fenceRecovery) {
+      this.fenceRecovery = (async () => {
+        try {
+          await this.driver.disconnect();
+          await this.driver.connect();
+        } finally {
+          this.fenceRecovery = null;
+        }
+      })();
+    }
+    try {
+      await this.fenceRecovery;
+    } catch (err) {
+      const reconnectErr = err instanceof Error ? err : new Error(String(err));
+      await safeHook(
+        this.logger,
+        "onError",
+        () => this.hooks.onError?.(reconnectErr)
+      );
+      return firstResults;
+    }
+    try {
+      return await this.driver.sendBatch(outgoing);
+    } catch {
+      return firstResults;
+    }
+  }
   /**
    * Start a span for the batch following the OTel messaging conventions.
    *
@@ -1003,6 +1153,26 @@ var KafkaPublisher = class {
     });
   }
 };
+function raceWithTimeout(p, ms, label) {
+  return new Promise((resolve, reject) => {
+    const timer = setTimeout(() => {
+      reject(new Error(`${label} timed out after ${ms}ms`));
+    }, ms);
+    if (typeof timer.unref === "function") {
+      timer.unref();
+    }
+    p.then(
+      (v) => {
+        clearTimeout(timer);
+        resolve(v);
+      },
+      (e) => {
+        clearTimeout(timer);
+        reject(e);
+      }
+    );
+  });
+}
 function selectDriver(opts) {
   const kind = opts.driver ?? "kafkajs";
   switch (kind) {