@eventferry/kafka 3.4.0 → 3.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +45 -0
- package/README.md +179 -0
- package/dist/index.cjs +181 -11
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +147 -3
- package/dist/index.d.ts +147 -3
- package/dist/index.js +181 -11
- package/dist/index.js.map +1 -1
- package/package.json +7 -4
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,50 @@
|
|
|
1
1
|
# @eventferry/kafka
|
|
2
2
|
|
|
3
|
+
## 3.5.0
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- fb0549d: Producer-fenced restart. `PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors now classify as `errorKind: "fenced"` (previously bundled into `"fatal"`). The new kind is documented as **transient by default** — fences also fire on broker restart and network partition recovery, not only on multi-instance collisions.
|
|
8
|
+
|
|
9
|
+
New publisher option `autoRecoverFromFence: boolean` (default `false`): when on, a publish batch reporting at least one fenced result triggers exactly one `disconnect → connect → re-send same batch` cycle. Transactional producers re-run `initTransactions` as part of the reconnect. If the second send still reports any fenced record, the publisher gives up — silently retrying again would mask a real misconfiguration. Concurrent fenced publishes share a single in-flight reconnect so the producer is not torn down twice mid-restart.
|
|
10
|
+
|
|
11
|
+
New `KafkaPublisherHooks.onProducerFenced(error)` hook fires regardless of the recovery flag — informational signal so dashboards can track fence rates whether or not the publisher attempts recovery.
|
|
12
|
+
|
|
13
|
+
`@eventferry/core` minor: `PublishErrorKind` union gains `"fenced"`. The relay treats unknown / `"retriable"` / `"fenced"` identically (retry per backoff, DLQ on `attempts > maxAttempts`) — no relay-level changes required, but the new kind shows up in logs and the `errorKind` field of `PublishResult`.
|
|
14
|
+
|
|
15
|
+
Multi-instance EOS guidance: leave `autoRecoverFromFence` OFF and use a callable `transactionalId` that derives a stable, unique id per instance (pod name + replica index). Cross-instance fence is the broker telling the loser instance to stop — recovering silently creates a thrashing leadership flip. The README now spells this out in a `Producer-fenced restart` section.
|
|
16
|
+
|
|
17
|
+
- 08d3384: `publisher.healthCheck({ timeoutMs })` — cheap reachability probe usable as the body of `/healthz` or `/readyz`. Borrows a fresh admin client, calls `listTopics`, and returns a stable `HealthStatus` shape: `{ ok, latencyMs, timestamp, error? }`. Default timeout 5000 ms (long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully); `timeoutMs: 0` disables the timer entirely.
|
|
18
|
+
|
|
19
|
+
What it proves: the broker is reachable AND the configured credentials still authenticate. What it does NOT prove: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Documented as "broker reachable + auth still good", not "publisher fully operational".
|
|
20
|
+
|
|
21
|
+
The borrowed admin is always closed (success, failure, timeout — try/finally). Admin-side close failures are swallowed; health checks aren't the place to crash. Custom drivers without an `admin()` method return `{ ok: false, error: ... }` instead of the throw `publisher.admin()` would surface.
|
|
22
|
+
|
|
23
|
+
- 90b69c6: librdkafka stats hook on the confluent driver. New `onStats: (stats) => void` callback receives the librdkafka periodic statistics JSON, already parsed to a plain object — pipe queue depth, broker latencies, txmsgs counters, per-topic/per-partition stats into your metrics stack without a second client. The wrapper swallows callback exceptions and JSON parse failures so a misbehaving observer cannot take down the producer's event loop. `statsIntervalMs` controls the polling interval; defaults to 30000 ms when `onStats` is set, stays OFF otherwise (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently). `rawProducerConfig` still wins on precedence. kafkajs driver warns once and ignores both options — kafkajs has no equivalent surface.
|
|
24
|
+
|
|
25
|
+
### Patch Changes
|
|
26
|
+
|
|
27
|
+
- 715523f: Consumer-side documentation. No API change. The root README gains:
|
|
28
|
+
|
|
29
|
+
- **`Consuming what eventferry produced`** — canonical loop showing `decode(message)` → `extractTraceContext(headers)` → `defineOutbox(registry).decode(topic, bytes)`. Same registry the producer used, in reverse, returns the typed validated payload.
|
|
30
|
+
- **`Consuming the DLQ`** — copy-paste handler that routes by `dlq-error-class` (cleaner than parsing `dlq-reason`), pulls `dlq-attempts` for retry-queue accounting, and shows the alert-vs-retry split.
|
|
31
|
+
|
|
32
|
+
The `@eventferry/kafka` README adds matching subsections under the existing `Consumer helpers` block: **`Typed payload via the producer-side registry`** and **`DLQ recipe`**.
|
|
33
|
+
|
|
34
|
+
`defineOutbox(registry).decode()` was already shipped — the round just makes the symmetric "same registry, both sides" pattern discoverable.
|
|
35
|
+
|
|
36
|
+
- ba81a78: Hardened TLS configuration documentation. No API change — `ssl.ca`, `ssl.servername`, and the rest of `TlsConfig` were already on the surface. This round:
|
|
37
|
+
|
|
38
|
+
- Expanded the `TlsConfig` JSDoc with the driver-parity gap: `servername` is honored by the **kafkajs** driver (Node `tls.connect` reads it directly) but is a documented **no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override.
|
|
39
|
+
- README gained explicit "Dev cluster with a self-signed cert" and "IP-literal brokers (cert hostname mismatch)" sections with copy-paste examples covering CA pinning + `servername` for SNI/SAN alignment.
|
|
40
|
+
- Reaffirmed that `rejectUnauthorized: false` is **never** going to ship on this surface. TLS verification is non-negotiable. For dev clusters with self-signed certs, the supported pattern is to pass the cluster CA via `ssl.ca` so verification still happens — just against your CA instead of the system trust store.
|
|
41
|
+
|
|
42
|
+
Companion library updates (changesets, dependabot) on the way; this patch only touches comments + README, so the change is safe to consume immediately.
|
|
43
|
+
|
|
44
|
+
- Updated dependencies [715523f]
|
|
45
|
+
- Updated dependencies [fb0549d]
|
|
46
|
+
- @eventferry/core@3.4.0
|
|
47
|
+
|
|
3
48
|
## 3.4.0
|
|
4
49
|
|
|
5
50
|
### Minor Changes
|
package/README.md
CHANGED
|
@@ -67,6 +67,40 @@ new KafkaPublisher({
|
|
|
67
67
|
> non-negotiable. For dev clusters with self-signed certs, pass the cluster
|
|
68
68
|
> CA via `ca` so verification succeeds.
|
|
69
69
|
|
|
70
|
+
### Dev cluster with a self-signed cert
|
|
71
|
+
|
|
72
|
+
The right pattern is to pin **your** CA. Verification still happens — just against your CA instead of the system trust store.
|
|
73
|
+
|
|
74
|
+
```ts
|
|
75
|
+
new KafkaPublisher({
|
|
76
|
+
brokers: ["dev-broker.internal:9093"],
|
|
77
|
+
ssl: {
|
|
78
|
+
ca: readFileSync("/path/to/dev-cluster-ca.pem"),
|
|
79
|
+
// Cluster reachable via DNS that doesn't match the cert SAN?
|
|
80
|
+
// Pin the SNI host the cert was issued for:
|
|
81
|
+
servername: "kafka.dev.internal",
|
|
82
|
+
},
|
|
83
|
+
});
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**Never** add `rejectUnauthorized: false` (TS would reject it anyway — it's not in the type). That disables verification entirely and opens every connection to a man-in-the-middle.
|
|
87
|
+
|
|
88
|
+
### IP-literal brokers (cert hostname mismatch)
|
|
89
|
+
|
|
90
|
+
When the broker address is an IP and the cert was issued for a hostname, set `servername`:
|
|
91
|
+
|
|
92
|
+
```ts
|
|
93
|
+
new KafkaPublisher({
|
|
94
|
+
brokers: ["10.0.5.12:9093"], // IP literal
|
|
95
|
+
ssl: {
|
|
96
|
+
ca: readFileSync("/etc/ssl/kafka-ca.pem"),
|
|
97
|
+
servername: "broker.example.com", // hostname the cert was issued for
|
|
98
|
+
},
|
|
99
|
+
});
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
`servername` is honored by the **kafkajs** driver (Node `tls.connect` reads `servername` directly). It's a **documented no-op on the confluent driver** — librdkafka v1.x's kafkaJS-compat layer doesn't expose an SNI override, and SNI is derived from the broker address. Use the kafkajs driver when you need the SNI lever.
|
|
103
|
+
|
|
70
104
|
### SASL — username + password (PLAIN / SCRAM)
|
|
71
105
|
|
|
72
106
|
```ts
|
|
@@ -289,6 +323,107 @@ const tracer: KafkaTracer = {
|
|
|
289
323
|
|
|
290
324
|
The publisher clones each outbound message before injecting (the caller's `PublishableMessage` is never mutated, so the relay's retry path stays correct).
|
|
291
325
|
|
|
326
|
+
## Health check
|
|
327
|
+
|
|
328
|
+
Cheap reachability probe — useful as the body of a `/healthz` or `/readyz` endpoint:
|
|
329
|
+
|
|
330
|
+
```ts
|
|
331
|
+
import express from "express";
|
|
332
|
+
const app = express();
|
|
333
|
+
|
|
334
|
+
app.get("/healthz", async (_req, res) => {
|
|
335
|
+
const status = await publisher.healthCheck({ timeoutMs: 3_000 });
|
|
336
|
+
res.status(status.ok ? 200 : 503).json({
|
|
337
|
+
ok: status.ok,
|
|
338
|
+
latencyMs: status.latencyMs,
|
|
339
|
+
error: status.error?.message,
|
|
340
|
+
});
|
|
341
|
+
});
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
`publisher.healthCheck()` opens a fresh admin, calls `listTopics`, and returns:
|
|
345
|
+
|
|
346
|
+
```ts
|
|
347
|
+
interface HealthStatus {
|
|
348
|
+
ok: boolean; // broker answered within timeout
|
|
349
|
+
latencyMs: number; // probe wall-clock
|
|
350
|
+
timestamp: number; // epoch ms when the probe started
|
|
351
|
+
error?: Error; // present when ok === false
|
|
352
|
+
}
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
Default `timeoutMs: 5_000` — long enough to ride out a single broker leader election, short enough to fail a liveness probe meaningfully. Set `timeoutMs: 0` to disable the timer.
|
|
356
|
+
|
|
357
|
+
**What this proves**: the broker is reachable AND the configured credentials still authenticate. **What this does NOT prove**: the producer's send path is fully operational — a fenced transactional producer would still answer healthy here. Treat the result as "broker reachable + auth still good", not "publisher fully operational".
|
|
358
|
+
|
|
359
|
+
The borrowed admin is always closed (success or failure). Admin-side close failures don't change the outcome — health checks aren't the place to crash.
|
|
360
|
+
|
|
361
|
+
## Producer-fenced restart
|
|
362
|
+
|
|
363
|
+
`PRODUCER_FENCED` and `INVALID_PRODUCER_EPOCH` errors classify as `errorKind: "fenced"` — a distinct kind from `fatal` because some fences are **transient** (broker restart, network partition recovery) rather than a permanent multi-instance collision.
|
|
364
|
+
|
|
365
|
+
### `autoRecoverFromFence: true`
|
|
366
|
+
|
|
367
|
+
Opt in to a single transparent reconnect-and-retry when a publish batch reports a fence:
|
|
368
|
+
|
|
369
|
+
```ts
|
|
370
|
+
new KafkaPublisher({
|
|
371
|
+
brokers,
|
|
372
|
+
transactional: true,
|
|
373
|
+
transactionalId: "orders-publisher",
|
|
374
|
+
autoRecoverFromFence: true,
|
|
375
|
+
});
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
What happens on a fenced batch:
|
|
379
|
+
|
|
380
|
+
1. The `onProducerFenced(error)` hook fires (regardless of the recovery flag — informational).
|
|
381
|
+
2. The driver is disconnected and reconnected (re-running `initTransactions` for transactional producers).
|
|
382
|
+
3. The same batch is resent **once**.
|
|
383
|
+
4. If the second send still reports any fenced record, the publisher gives up and surfaces those failures unchanged — silently retrying again would mask a misconfiguration.
|
|
384
|
+
|
|
385
|
+
Concurrent fenced publishes share a single in-flight reconnect — the producer is not torn down twice while a recovery is in progress.
|
|
386
|
+
|
|
387
|
+
**Default is `false`** to preserve the previous "fenced → propagate to relay" behavior. The relay will retry fenced records under the configured backoff and DLQ them when `attempts > retry.maxAttempts`.
|
|
388
|
+
|
|
389
|
+
### `transactional.id` strategy for multi-instance EOS
|
|
390
|
+
|
|
391
|
+
When running multiple producer instances against the same logical workload, each instance MUST have a stable, unique `transactionalId`. Use the callable form to derive it from runtime context:
|
|
392
|
+
|
|
393
|
+
```ts
|
|
394
|
+
new KafkaPublisher({
|
|
395
|
+
brokers,
|
|
396
|
+
transactional: true,
|
|
397
|
+
transactionalId: () => `${process.env.POD_NAME}-${process.env.HOSTNAME}`,
|
|
398
|
+
// Leave autoRecoverFromFence OFF — a fence means a real collision
|
|
399
|
+
// worth surfacing.
|
|
400
|
+
});
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
Cross-instance fence is **not** a transient blip — it's the broker telling one of you that the other is now the canonical producer. Auto-recovery would create a thrashing leadership flip. Keep the option off in multi-instance setups and let the loser instance fail loudly.
|
|
404
|
+
|
|
405
|
+
## librdkafka stats hook
|
|
406
|
+
|
|
407
|
+
The confluent driver exposes librdkafka's periodic statistics stream as a typed callback. Useful for piping queue depth, broker latency, broker timeout counts, and per-topic/per-partition counters into your metrics stack.
|
|
408
|
+
|
|
409
|
+
```ts
|
|
410
|
+
new KafkaPublisher({
|
|
411
|
+
brokers,
|
|
412
|
+
driver: "confluent",
|
|
413
|
+
onStats: (stats) => {
|
|
414
|
+
// stats is opaque librdkafka JSON. Reach for the fields you care about.
|
|
415
|
+
promClient.gauge("kafka_msg_cnt").set(stats.msg_cnt as number);
|
|
416
|
+
promClient.gauge("kafka_txmsgs").set(stats.txmsgs as number);
|
|
417
|
+
},
|
|
418
|
+
statsIntervalMs: 30_000, // optional; defaults to 30s when onStats is set
|
|
419
|
+
});
|
|
420
|
+
```
|
|
421
|
+
|
|
422
|
+
- **`onStats`** receives the librdkafka stats JSON, already parsed to a plain object. The schema is opaque (`Record<string, unknown>`) — librdkafka's stats are huge and evolve across versions. Reference: [librdkafka STATISTICS.md](https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md).
|
|
423
|
+
- **`statsIntervalMs`** maps to librdkafka's `statistics.interval.ms`. **Defaults to 30000 ms when `onStats` is set; otherwise stays off** (librdkafka CPU-bills the JSON serialization every tick — we don't enable it silently).
|
|
424
|
+
- The wrapper swallows callback exceptions and JSON parse failures — a single dropped sample is preferable to taking down the producer's event loop.
|
|
425
|
+
- **No-op on the kafkajs driver** — kafkajs has no equivalent surface. Logs a one-time warning and ignores both options.
|
|
426
|
+
|
|
292
427
|
## Power-user escape hatches
|
|
293
428
|
|
|
294
429
|
When the high-level options don't reach a knob you need, drop down to the native client config.
|
|
@@ -455,6 +590,50 @@ await consumer.run({
|
|
|
455
590
|
|
|
456
591
|
`extractTraceContext` returns `null` if no `traceparent` header is present or it fails W3C validation (all-zero IDs, `version: ff`, malformed hex). It accepts both raw consumer headers (Buffer values) and already-decoded headers (string values).
|
|
457
592
|
|
|
593
|
+
#### Typed payload via the producer-side registry
|
|
594
|
+
|
|
595
|
+
When your consumer lives in the same monorepo as the producer, hand the decoded bytes to the **same `defineOutbox(registry)`** you used to enqueue. `decode` validates against the topic's Standard Schema and returns the typed payload:
|
|
596
|
+
|
|
597
|
+
```ts
|
|
598
|
+
import { defineOutbox } from "@eventferry/core";
|
|
599
|
+
import { decode } from "@eventferry/kafka/consume";
|
|
600
|
+
import { registry } from "./outbox-registry";
|
|
601
|
+
|
|
602
|
+
const events = defineOutbox(registry); // no store — consumer side
|
|
603
|
+
|
|
604
|
+
await consumer.run({
|
|
605
|
+
eachMessage: async ({ message }) => {
|
|
606
|
+
const m = decode(message, { decoder: "utf8" });
|
|
607
|
+
const event = await events.decode("orders.created", m.value!);
|
|
608
|
+
// ^? { orderId: string; total: number }
|
|
609
|
+
await handle(event);
|
|
610
|
+
},
|
|
611
|
+
});
|
|
612
|
+
```
|
|
613
|
+
|
|
614
|
+
`events.decode(topic, bytes)` throws `OutboxValidationError` if the topic isn't in the registry or the payload doesn't match the schema. Cross-language consumers (Go, Java, Python) skip the companion and use their own schema tooling — Confluent Schema Registry for typed wire formats.
|
|
615
|
+
|
|
616
|
+
#### DLQ recipe
|
|
617
|
+
|
|
618
|
+
Records that exhaust retries land on `${topic}.dlq` (or your configured DLQ topic) carrying enriched headers `dlq-reason`, `dlq-error-class`, `dlq-original-topic`, `dlq-failed-at`, `dlq-attempts` (and optionally `dlq-stack` when you opt in). Route them with `dlq-error-class` rather than parsing `dlq-reason`:
|
|
619
|
+
|
|
620
|
+
```ts
|
|
621
|
+
await dlqConsumer.run({
|
|
622
|
+
eachMessage: async ({ message }) => {
|
|
623
|
+
const m = decode(message);
|
|
624
|
+
const errClass = m.headers["dlq-error-class"];
|
|
625
|
+
if (errClass === "KafkaJSProtocolError" && m.headers["dlq-reason"]?.includes("MESSAGE_TOO_LARGE")) {
|
|
626
|
+
await ticket.create({ title: `Oversized DLQ from ${m.headers["dlq-original-topic"]}` });
|
|
627
|
+
} else {
|
|
628
|
+
await retryQueue.put({
|
|
629
|
+
payload: m.value,
|
|
630
|
+
attemptsSoFar: Number(m.headers["dlq-attempts"] ?? "0"),
|
|
631
|
+
});
|
|
632
|
+
}
|
|
633
|
+
},
|
|
634
|
+
});
|
|
635
|
+
```
|
|
636
|
+
|
|
458
637
|
### `validateTopicsOnConnect`
|
|
459
638
|
|
|
460
639
|
Fail-fast at startup if expected topics are missing:
|
package/dist/index.cjs
CHANGED
|
@@ -51,6 +51,7 @@ function classifyKafkajsError(err) {
|
|
|
51
51
|
if (e.name === "KafkaJSNonRetriableError") return "fatal";
|
|
52
52
|
const type = typeof e.type === "string" ? e.type : void 0;
|
|
53
53
|
if (type) {
|
|
54
|
+
if (FENCED_TYPES.has(type)) return "fenced";
|
|
54
55
|
if (RETRIABLE_TYPES.has(type)) return "retriable";
|
|
55
56
|
if (POISON_TYPES.has(type)) return "poison";
|
|
56
57
|
if (FATAL_TYPES.has(type)) return "fatal";
|
|
@@ -84,9 +85,11 @@ var POISON_TYPES = /* @__PURE__ */ new Set([
|
|
|
84
85
|
"INVALID_REQUIRED_ACKS",
|
|
85
86
|
"INVALID_PARTITIONS"
|
|
86
87
|
]);
|
|
87
|
-
var
|
|
88
|
+
var FENCED_TYPES = /* @__PURE__ */ new Set([
|
|
88
89
|
"INVALID_PRODUCER_EPOCH",
|
|
89
|
-
"PRODUCER_FENCED"
|
|
90
|
+
"PRODUCER_FENCED"
|
|
91
|
+
]);
|
|
92
|
+
var FATAL_TYPES = /* @__PURE__ */ new Set([
|
|
90
93
|
"TOPIC_AUTHORIZATION_FAILED",
|
|
91
94
|
"CLUSTER_AUTHORIZATION_FAILED",
|
|
92
95
|
"TRANSACTIONAL_ID_AUTHORIZATION_FAILED",
|
|
@@ -117,8 +120,8 @@ var CODE_TO_KIND = /* @__PURE__ */ new Map([
|
|
|
117
120
|
// TOPIC_AUTHORIZATION_FAILED
|
|
118
121
|
[31, "fatal"],
|
|
119
122
|
// CLUSTER_AUTHORIZATION_FAILED
|
|
120
|
-
[47, "
|
|
121
|
-
// INVALID_PRODUCER_EPOCH
|
|
123
|
+
[47, "fenced"],
|
|
124
|
+
// INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
|
|
122
125
|
[58, "fatal"],
|
|
123
126
|
// SASL_AUTHENTICATION_FAILED
|
|
124
127
|
[74, "retriable"],
|
|
@@ -151,7 +154,10 @@ var UNSUPPORTED_BY_KAFKAJS = [
|
|
|
151
154
|
"maxRequestSize",
|
|
152
155
|
// Confluent-only escape hatches; ignored on kafkajs.
|
|
153
156
|
"compressionLevel",
|
|
154
|
-
"rawProducerConfig"
|
|
157
|
+
"rawProducerConfig",
|
|
158
|
+
// librdkafka stats — kafkajs has no equivalent surface.
|
|
159
|
+
"onStats",
|
|
160
|
+
"statsIntervalMs"
|
|
155
161
|
];
|
|
156
162
|
var KafkaJsDriver = class {
|
|
157
163
|
transactional;
|
|
@@ -437,8 +443,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
|
|
|
437
443
|
// ERR__TRANSPORT
|
|
438
444
|
[-198, "poison"],
|
|
439
445
|
// ERR__BAD_COMPRESSION
|
|
440
|
-
[-144, "
|
|
441
|
-
// ERR__FENCED — producer fenced
|
|
446
|
+
[-144, "fenced"],
|
|
447
|
+
// ERR__FENCED — producer fenced; publisher reconnect attempts a transparent recovery once
|
|
442
448
|
[-150, "fatal"],
|
|
443
449
|
// ERR__FATAL — unrecoverable librdkafka error
|
|
444
450
|
[-169, "fatal"],
|
|
@@ -470,8 +476,8 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
|
|
|
470
476
|
// TOPIC_AUTHORIZATION_FAILED
|
|
471
477
|
[31, "fatal"],
|
|
472
478
|
// CLUSTER_AUTHORIZATION_FAILED
|
|
473
|
-
[47, "
|
|
474
|
-
// INVALID_PRODUCER_EPOCH
|
|
479
|
+
[47, "fenced"],
|
|
480
|
+
// INVALID_PRODUCER_EPOCH — retryable once via publisher reconnect
|
|
475
481
|
[58, "fatal"],
|
|
476
482
|
// SASL_AUTHENTICATION_FAILED
|
|
477
483
|
[74, "retriable"],
|
|
@@ -485,7 +491,7 @@ var CODE_TO_KIND2 = /* @__PURE__ */ new Map([
|
|
|
485
491
|
]);
|
|
486
492
|
var NAME_TO_KIND = /* @__PURE__ */ new Map([
|
|
487
493
|
["ERR__QUEUE_FULL", "backpressure"],
|
|
488
|
-
["ERR__FENCED", "
|
|
494
|
+
["ERR__FENCED", "fenced"],
|
|
489
495
|
["ERR__FATAL", "fatal"],
|
|
490
496
|
["ERR__AUTHENTICATION", "fatal"],
|
|
491
497
|
["ERR__SSL", "fatal"],
|
|
@@ -494,7 +500,7 @@ var NAME_TO_KIND = /* @__PURE__ */ new Map([
|
|
|
494
500
|
["ERR__BAD_COMPRESSION", "poison"],
|
|
495
501
|
["ERR_TOPIC_AUTHORIZATION_FAILED", "fatal"],
|
|
496
502
|
["ERR_CLUSTER_AUTHORIZATION_FAILED", "fatal"],
|
|
497
|
-
["ERR_INVALID_PRODUCER_EPOCH", "
|
|
503
|
+
["ERR_INVALID_PRODUCER_EPOCH", "fenced"],
|
|
498
504
|
["ERR_SASL_AUTHENTICATION_FAILED", "fatal"],
|
|
499
505
|
["ERR_CORRUPT_MESSAGE", "poison"],
|
|
500
506
|
["ERR_MSG_SIZE_TOO_LARGE", "poison"],
|
|
@@ -530,6 +536,12 @@ function buildConfluentClientConfig(opts) {
|
|
|
530
536
|
if (opts.compressionLevel !== void 0) {
|
|
531
537
|
librdkafka["compression.level"] = opts.compressionLevel;
|
|
532
538
|
}
|
|
539
|
+
if (opts.onStats) {
|
|
540
|
+
librdkafka["stats_cb"] = wrapStatsCallback(opts.onStats);
|
|
541
|
+
librdkafka["statistics.interval.ms"] = opts.statsIntervalMs ?? 3e4;
|
|
542
|
+
} else if (opts.statsIntervalMs !== void 0) {
|
|
543
|
+
librdkafka["statistics.interval.ms"] = opts.statsIntervalMs;
|
|
544
|
+
}
|
|
533
545
|
const tlsRequested = opts.ssl === true || isTlsConfig(opts.ssl);
|
|
534
546
|
const saslRequested = !!opts.sasl;
|
|
535
547
|
if (saslRequested && tlsRequested) {
|
|
@@ -567,6 +579,20 @@ function buildConfluentClientConfig(opts) {
|
|
|
567
579
|
function isTlsConfig(v) {
|
|
568
580
|
return typeof v === "object" && v !== null;
|
|
569
581
|
}
|
|
582
|
+
function wrapStatsCallback(onStats) {
|
|
583
|
+
return (raw) => {
|
|
584
|
+
let parsed;
|
|
585
|
+
try {
|
|
586
|
+
parsed = typeof raw === "string" ? JSON.parse(raw) : raw;
|
|
587
|
+
} catch {
|
|
588
|
+
return;
|
|
589
|
+
}
|
|
590
|
+
try {
|
|
591
|
+
onStats(parsed);
|
|
592
|
+
} catch {
|
|
593
|
+
}
|
|
594
|
+
};
|
|
595
|
+
}
|
|
570
596
|
function stringifyPem(input) {
|
|
571
597
|
if (Array.isArray(input)) {
|
|
572
598
|
return input.map((x) => typeof x === "string" ? x : x.toString("utf8")).join("\n");
|
|
@@ -808,11 +834,17 @@ var KafkaPublisher = class {
|
|
|
808
834
|
hooks;
|
|
809
835
|
tracer;
|
|
810
836
|
validateTopicsOnConnect;
|
|
837
|
+
autoRecoverFromFence;
|
|
838
|
+
// Serialize reconnects so concurrent publish() calls hitting a fence
|
|
839
|
+
// all observe the same single reconnect attempt — the second publish
|
|
840
|
+
// doesn't try to disconnect a producer the first is still re-initing.
|
|
841
|
+
fenceRecovery = null;
|
|
811
842
|
constructor(opts) {
|
|
812
843
|
this.logger = opts.logger;
|
|
813
844
|
this.hooks = opts.hooks ?? {};
|
|
814
845
|
this.tracer = opts.tracer ?? new NoopKafkaTracer();
|
|
815
846
|
this.validateTopicsOnConnect = opts.validateTopicsOnConnect ? Object.freeze([...opts.validateTopicsOnConnect]) : void 0;
|
|
847
|
+
this.autoRecoverFromFence = opts.autoRecoverFromFence ?? false;
|
|
816
848
|
const onTransactionAbort = this.hooks.onTransactionAbort ? (error) => {
|
|
817
849
|
void safeHook(
|
|
818
850
|
this.logger,
|
|
@@ -935,6 +967,20 @@ var KafkaPublisher = class {
|
|
|
935
967
|
await safeHook(this.logger, "onError", () => this.hooks.onError?.(error));
|
|
936
968
|
throw err;
|
|
937
969
|
}
|
|
970
|
+
const firstFenced = results.find(
|
|
971
|
+
(r) => !r.ok && r.errorKind === "fenced"
|
|
972
|
+
);
|
|
973
|
+
if (firstFenced) {
|
|
974
|
+
const fenceErr = firstFenced.error ?? new Error("producer fenced");
|
|
975
|
+
await safeHook(
|
|
976
|
+
this.logger,
|
|
977
|
+
"onProducerFenced",
|
|
978
|
+
() => this.hooks.onProducerFenced?.(fenceErr)
|
|
979
|
+
);
|
|
980
|
+
if (this.autoRecoverFromFence) {
|
|
981
|
+
results = await this.recoverAndRetry(outgoing, results);
|
|
982
|
+
}
|
|
983
|
+
}
|
|
938
984
|
const byId = new Map(messages.map((m) => [m.recordId, m]));
|
|
939
985
|
let allOk = true;
|
|
940
986
|
for (const r of results) {
|
|
@@ -985,6 +1031,110 @@ var KafkaPublisher = class {
|
|
|
985
1031
|
get transactional() {
|
|
986
1032
|
return this.driver.transactional;
|
|
987
1033
|
}
|
|
1034
|
+
/**
|
|
1035
|
+
* Cheap reachability probe. Borrows a fresh admin client, calls
|
|
1036
|
+
* `listTopics`, and returns timing + outcome. Useful as the body of a
|
|
1037
|
+
* `/healthz` or `/readyz` endpoint — proves the broker is reachable
|
|
1038
|
+
* AND that the configured credentials still authenticate against it,
|
|
1039
|
+
* without writing a record.
|
|
1040
|
+
*
|
|
1041
|
+
* Does NOT exercise the producer's send path — a healthy admin
|
|
1042
|
+
* connection doesn't guarantee `publish()` will succeed (a fenced
|
|
1043
|
+
* transactional producer would still answer healthy here). Treat this
|
|
1044
|
+
* as "broker reachable + auth still good", not "publisher is fully
|
|
1045
|
+
* operational".
|
|
1046
|
+
*
|
|
1047
|
+
* Default timeout 5_000 ms — long enough to ride out a single broker
|
|
1048
|
+
* leader election, short enough to fail a liveness probe meaningfully.
|
|
1049
|
+
* Set `timeoutMs: 0` to disable the timer entirely.
|
|
1050
|
+
*
|
|
1051
|
+
* The driver must implement `admin()` (the built-ins do); custom
|
|
1052
|
+
* drivers without admin get `{ ok: false, error: ... }` instead of
|
|
1053
|
+
* the throw `publisher.admin()` would surface — health checks are
|
|
1054
|
+
* not the place to crash.
|
|
1055
|
+
*/
|
|
1056
|
+
async healthCheck(opts = {}) {
|
|
1057
|
+
const timeoutMs = opts.timeoutMs ?? 5e3;
|
|
1058
|
+
const startedAt = Date.now();
|
|
1059
|
+
if (!this.driver.admin) {
|
|
1060
|
+
return {
|
|
1061
|
+
ok: false,
|
|
1062
|
+
latencyMs: 0,
|
|
1063
|
+
timestamp: startedAt,
|
|
1064
|
+
error: new Error(
|
|
1065
|
+
"KafkaPublisher.healthCheck: configured driver does not implement admin()"
|
|
1066
|
+
)
|
|
1067
|
+
};
|
|
1068
|
+
}
|
|
1069
|
+
let admin = null;
|
|
1070
|
+
try {
|
|
1071
|
+
admin = await this.driver.admin();
|
|
1072
|
+
await admin.connect();
|
|
1073
|
+
const probe = admin.listTopics();
|
|
1074
|
+
if (timeoutMs > 0) {
|
|
1075
|
+
await raceWithTimeout(probe, timeoutMs, "healthCheck");
|
|
1076
|
+
} else {
|
|
1077
|
+
await probe;
|
|
1078
|
+
}
|
|
1079
|
+
return {
|
|
1080
|
+
ok: true,
|
|
1081
|
+
latencyMs: Date.now() - startedAt,
|
|
1082
|
+
timestamp: startedAt
|
|
1083
|
+
};
|
|
1084
|
+
} catch (err) {
|
|
1085
|
+
const error = err instanceof Error ? err : new Error(String(err));
|
|
1086
|
+
return {
|
|
1087
|
+
ok: false,
|
|
1088
|
+
latencyMs: Date.now() - startedAt,
|
|
1089
|
+
timestamp: startedAt,
|
|
1090
|
+
error
|
|
1091
|
+
};
|
|
1092
|
+
} finally {
|
|
1093
|
+
try {
|
|
1094
|
+
await admin?.close();
|
|
1095
|
+
} catch {
|
|
1096
|
+
}
|
|
1097
|
+
}
|
|
1098
|
+
}
|
|
1099
|
+
/**
|
|
1100
|
+
* Disconnect + re-connect the driver and re-send the batch ONCE. Used
|
|
1101
|
+
* by the fence-recovery path. Concurrent fence recoveries dedupe on a
|
|
1102
|
+
* shared in-flight promise (`fenceRecovery`) so we don't tear the
|
|
1103
|
+
* producer down while another batch is mid-restart.
|
|
1104
|
+
*
|
|
1105
|
+
* If the second send STILL reports any fenced records, those failures
|
|
1106
|
+
* are returned unchanged — another instance has almost certainly taken
|
|
1107
|
+
* the same `transactionalId` and silently retrying again would mask
|
|
1108
|
+
* the misconfiguration.
|
|
1109
|
+
*/
|
|
1110
|
+
async recoverAndRetry(outgoing, firstResults) {
|
|
1111
|
+
if (!this.fenceRecovery) {
|
|
1112
|
+
this.fenceRecovery = (async () => {
|
|
1113
|
+
try {
|
|
1114
|
+
await this.driver.disconnect();
|
|
1115
|
+
await this.driver.connect();
|
|
1116
|
+
} finally {
|
|
1117
|
+
this.fenceRecovery = null;
|
|
1118
|
+
}
|
|
1119
|
+
})();
|
|
1120
|
+
}
|
|
1121
|
+
try {
|
|
1122
|
+
await this.fenceRecovery;
|
|
1123
|
+
} catch (err) {
|
|
1124
|
+
const reconnectErr = err instanceof Error ? err : new Error(String(err));
|
|
1125
|
+
await safeHook(
|
|
1126
|
+
this.logger,
|
|
1127
|
+
"onError",
|
|
1128
|
+
() => this.hooks.onError?.(reconnectErr)
|
|
1129
|
+
);
|
|
1130
|
+
return firstResults;
|
|
1131
|
+
}
|
|
1132
|
+
try {
|
|
1133
|
+
return await this.driver.sendBatch(outgoing);
|
|
1134
|
+
} catch {
|
|
1135
|
+
return firstResults;
|
|
1136
|
+
}
|
|
1137
|
+
}
|
|
988
1138
|
/**
|
|
989
1139
|
* Start a span for the batch following the OTel messaging conventions.
|
|
990
1140
|
*
|
|
@@ -1003,6 +1153,26 @@ var KafkaPublisher = class {
|
|
|
1003
1153
|
});
|
|
1004
1154
|
}
|
|
1005
1155
|
};
|
|
1156
|
+
function raceWithTimeout(p, ms, label) {
|
|
1157
|
+
return new Promise((resolve, reject) => {
|
|
1158
|
+
const timer = setTimeout(() => {
|
|
1159
|
+
reject(new Error(`${label} timed out after ${ms}ms`));
|
|
1160
|
+
}, ms);
|
|
1161
|
+
if (typeof timer.unref === "function") {
|
|
1162
|
+
timer.unref();
|
|
1163
|
+
}
|
|
1164
|
+
p.then(
|
|
1165
|
+
(v) => {
|
|
1166
|
+
clearTimeout(timer);
|
|
1167
|
+
resolve(v);
|
|
1168
|
+
},
|
|
1169
|
+
(e) => {
|
|
1170
|
+
clearTimeout(timer);
|
|
1171
|
+
reject(e);
|
|
1172
|
+
}
|
|
1173
|
+
);
|
|
1174
|
+
});
|
|
1175
|
+
}
|
|
1006
1176
|
function selectDriver(opts) {
|
|
1007
1177
|
const kind = opts.driver ?? "kafkajs";
|
|
1008
1178
|
switch (kind) {
|