@checkstack/backend-api 0.18.0 → 0.20.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,256 @@
1
1
  # @checkstack/backend-api
2
2
 
3
+ ## 0.20.0
4
+
5
+ ### Minor Changes
6
+
7
+ - a57f7db: fix(backend): give advisory locks a dedicated connection pool to prevent pool-starvation deadlock
8
+
9
+ Both the session-lock service and `withXactLock` HOLD a Postgres connection for
10
+ the lock's whole lifetime while the gated work runs on a _different_ connection.
11
+ Both lock and work were drawing from the single shared `adminPool` (which, with
12
+ no explicit config, defaulted to `max: 10` and `connectionTimeoutMillis: 0` -
13
+ wait forever). Under concurrency >= pool size, every slot became a lock-holding
14
+ connection waiting for a work connection that could never free up: a permanent
15
+ deadlock. It surfaced as all connections stuck `idle in transaction` on
16
+ `pg_advisory_xact_lock` and every API request hanging into an upstream 502,
17
+ only after the server had been running long enough to hit that concurrency
18
+ (e.g. a burst of health-check evaluations or incident dedups).
19
+
20
+ Advisory locks now run on a dedicated `lockPool`, separate from `adminPool`, so
21
+ the acquire graph is acyclic (`lockPool -> adminPool`, never back) and the
22
+ deadlock class is impossible. `AdvisoryLockService` gains a pooled
23
+ `withXactLock({ key, fn })` method (lock on the lock pool, work on the admin
24
+ pool); healthcheck's per-system serializer, incident's dedup-create, and the
25
+ automation single-mode concurrency lock now use it. The deadlock-prone
26
+ standalone `withXactLock({ db, ... })` helper is REMOVED.
27
+
28
+ Both pools are explicitly configured with `connectionTimeoutMillis` so any
29
+ future exhaustion fails fast and self-heals instead of hanging, and both get a
30
+ pool-level `error` handler (an idle pooled client whose backend dies otherwise
31
+ crashes the pod). The lock pool additionally sets
32
+ `idle_in_transaction_session_timeout` and `lock_timeout` so a stalled critical
33
+ section is reaped server-side (auto-releasing the lock) rather than stranding a
34
+ key forever. The advisory-lock service also now removes its per-client error
35
+ listener on release (it previously leaked one listener per acquisition on each
36
+ reused pooled connection - an unbounded `MaxListenersExceeded` leak).
37
+
38
+ New env vars (all optional): `DATABASE_POOL_MAX` (default 20),
39
+ `DATABASE_LOCK_POOL_MAX` (default 10), `DATABASE_POOL_CONNECTION_TIMEOUT_MS`
40
+ (default 10000), `DATABASE_POOL_IDLE_TIMEOUT_MS` (default 30000),
41
+ `DATABASE_LOCK_IDLE_TX_TIMEOUT_MS` (default 30000), `DATABASE_LOCK_TIMEOUT_MS`
42
+ (default 30000). Size pools off
43
+ `N_pods * (DATABASE_POOL_MAX + DATABASE_LOCK_POOL_MAX) <= max_connections`.
44
+
45
+ BREAKING CHANGE: the standalone `withXactLock({ db, key, fn })` export is
46
+ removed - use `coreServices.advisoryLock.withXactLock({ key, fn })` instead.
47
+ `IncidentService`'s constructor now requires an `AdvisoryLockService` as its
48
+ second argument, and the healthcheck `createHealthEntitySerializer` /
49
+ `executeHealthCheckJob` / `setupHealthCheckWorker` helpers take `advisoryLock`
50
+ instead of `db` for the serializer.
51
+
52
+ ### Patch Changes
53
+
54
+ - @checkstack/cache-api@0.3.8
55
+ - @checkstack/queue-api@0.3.8
56
+
57
+ ## 0.19.0
58
+
59
+ ### Minor Changes
60
+
61
+ - 270ef29: Fix automation provider actions and `secretEnv` script actions throwing in production.
62
+
63
+ The automation dispatch engine resolved provider-action dependencies (the integration connection store, the secret resolver) through a `getService` that was a throwing stub, so Jira / Teams / Webex actions and `secretEnv` script actions threw at execute time in production. The whole dispatch test suite stubbed `getService`, so the break was invisible.
64
+
65
+ Root cause: the plugin `env` exposed `registerService` but no resolver, so the dispatch path (the only context that resolves arbitrary cross-plugin refs outside an RPC handler) had nothing real to call.
66
+
67
+ Changes:
68
+
69
+ - `@checkstack/backend-api`: add `getService<S>(ref: ServiceRef<S>): Promise<S>` to the plugin `env` (`BackendPluginRegistry`). It resolves a service registered by any plugin through the real `ServiceRegistry` using the calling plugin's identity, and throws a clear error if the ref is not registered (never silently `undefined`). **NEW PLUGIN-AUTHOR CONTRACT**: `env.getService` is now available to resolve arbitrary cross-plugin service refs at init / afterPluginsReady time.
70
+ - `@checkstack/backend`: implement `env.getService` in both the plugin loader and the runtime single-plugin registration path, backed by `ServiceRegistry.get(ref, { pluginId })`.
71
+ - `@checkstack/automation-backend`: wire the dispatch `getService` to `env.getService` (was a throwing stub). This also activates run-wide provider-credential masking, because resolving the connection store / secret resolver now flows through the run's masking interceptor.
72
+
73
+ Also fixes a test-only seam where the `core/backend` test preload registered a no-op `registerRouter`, silently disabling oRPC router registration across the suite.
74
+
75
+ - 270ef29: Fix suspend/resume durability + complete the run-wide secret-masking guarantee.
76
+
77
+ A panel review confirmed several defects in the automation dispatch engine's suspend/resume durability and in the run-wide masking choke point. These survived because the unit suite stubbed the seam under test; the fixes ship with tests that exercise the real suspend / sweep / resume paths.
78
+
79
+ Suspend/resume durability:
80
+
81
+ - **Stalled sweeper no longer re-runs intentional waits.** `findStalledRunIds` now joins `automation_runs` and returns only `status = 'running'` runs, and suspend-finalisation no longer clobbers the run's `lastActionPath` checkpoint to `null`. Previously any wait longer than the stale window (>60s) was re-walked from the top every sweep cycle, re-firing pre-wait side effects and leaking wait locks. The wait-aware sweeps now also run before the stalled-run sweep.
82
+ - **Stalled recovery refuses a run holding a live wait lock.** `recoverStalledRun` now only recovers a genuinely-`running` run with no wait lock; a crash-mid-wait recovery is left to the wait/resume paths instead of re-walking from the top and creating a duplicate lock + duplicate delay job.
83
+ - **Cancelled runs can no longer resurrect.** `resumeRun` guards on `status === 'waiting'` (mirroring `checkWaitUntil`) and drops any stale lock for a non-waiting run, so `wakeWaitingRuns` / delay-expiry / a racing queue job can't wake a cancelled or terminal run. `cancelActiveRuns` (restart mode) now deletes the cancelled runs' wait locks + run-state in the same operation.
84
+ - **Concurrency check-then-create is serialized.** The `mode` check + `createRun` now run under a transaction-scoped advisory lock keyed on `(automationId, scope)`, so two concurrent fires can't both pass a `single`-mode "no active run" check and double-run.
85
+
86
+ Masking guarantee (now genuinely covers scope + artifacts):
87
+
88
+ - **The run-wide masking choke point now also masks the durable scope snapshot and produced artifacts.** The `RunSecretRegistry` is threaded into `RunStateStore.upsert` (masks `scopeSnapshot`) and `ArtifactStore.record` (masks `data`) so a resolved connection credential threaded into `scope.variables` or surfaced into an artifact is redacted before persist - and therefore cannot reach a read-only user via `getRunScopeForReplay`. **GUARANTEE CHANGE**: run-wide masking now covers step output, run error, scope snapshot, and artifact data for every action.
89
+ - **`testConnection` / `testProviderConnection` mask provider errors.** These RPCs run outside a dispatch run, so they build a per-call mask set from the resolved/submitted connection config and run any provider error through it before returning, so a provider error echoing a token can't cross back to the browser.
90
+ - **Short secrets surface a warning.** `setSecret` now warns when a value is shorter than `MIN_MASKABLE_LENGTH` (4) that it cannot be auto-redacted (the threshold is intentionally not lowered).
91
+
92
+ Internal:
93
+
94
+ - `@checkstack/backend-api`: `withXactLock`'s `fn` now receives the transaction handle `tx` so a critical section can run on the locked connection; the doc clarifies why running on the pool inside the lock window is still safe. The incident dedup caller's comment is corrected accordingly. `RunStore` gains `findWaitLocksByRun`.
95
+
96
+ - 270ef29: Fix several correctness defects around distributed coordination and stored-data handling.
97
+
98
+ - Dwell `for:` timers now fire via an atomic `DELETE ... RETURNING` claim, so two pods (or the stalled sweeper vs the queue consumer) can no longer both fire the same dwell.
99
+ - Postgres session-level advisory locks now keep connection affinity. A shared `AdvisoryLockService` (backed by a dedicated pooled client) replaces the previous acquire/release-on-different-connection pattern that leaked locks. Used by the script-packages installer election, the automation run resume + stalled sweeper, and (via a new transaction-scoped `withXactLock`) incident dedup.
100
+ - A storage migration that crashed mid-flight is now resumed on startup under the installer-election lock, instead of permanently wedging installs.
101
+ - Distributed script-package blobs carry a `blobSha256` and are verified before extraction (the SRI `integrity` hashes the npm tarball, not the transported archive). Backward-safe: entries without the field skip verification until a re-install regenerates the manifest.
102
+ - Archive extraction rejects zip-slip paths (absolute or `..` entries) before writing anything.
103
+ - `incident.create` with `dedupe_open_for_system` serializes its check-then-create per system, so concurrent triggers for the same system can't both open a duplicate incident.
104
+ - Seeded auto-incident filter expressions JSON-encode interpolated ids so a quote/backslash can't corrupt the expression.
105
+ - Stored jsonb snapshots (dwell `actorSnapshot`, wait-lock `waitConfig`) are validated with zod on load and degrade safely instead of flowing through as the wrong type.
106
+
107
+ - b995afb: Harden the advisory-lock service against holder-connection termination.
108
+
109
+ A session-level advisory lock is held on a dedicated checked-out pool client.
110
+ If that backend is terminated (admin kill, failover, network drop) while the
111
+ lock is held, `pg` emits an `'error'` on the client; with no listener attached
112
+ that error is re-thrown by the EventEmitter and crashes the pod. The service
113
+ now attaches an error listener to the held client so the loss degrades
114
+ gracefully - the session lock is auto-released server-side when the backend
115
+ dies, and the key simply becomes acquirable again.
116
+
117
+ Also de-flaked the advisory-lock integration test: it now terminates only the
118
+ lock-holding backend (found via `pg_locks`) instead of every backend in the
119
+ database - the old blanket kill also tore down the pool's idle connections,
120
+ whose async errors flaked the run and left the pool unusable.
121
+
122
+ - 270ef29: Add in-UI script testing for automation `run_script` / `run_shell` actions.
123
+
124
+ A new `testScript` RPC runs a TypeScript or shell script against an
125
+ editable, auto-seeded sample context using the same sandboxed runner the
126
+ real action uses, so operators can test scripts directly in the editor
127
+ without dispatching a whole automation. Surfaces beneath any script field
128
+ flagged `x-script-testable` via the new `ScriptTestPanel` /
129
+ `ContextSampleEditor` components in `@checkstack/ui` and the
130
+ `scriptTestRenderer` prop threaded through `DynamicForm`.
131
+
132
+ - `@checkstack/automation-common`: adds the `testScript` contract +
133
+ `ScriptTest*` schemas (gated by `automation.manage`).
134
+ - `@checkstack/automation-backend`: implements `testScript` reusing the
135
+ shared ESM / shell runners; central-only, time-bounded.
136
+ - `@checkstack/backend-api`: new `x-script-testable` config-schema
137
+ metadata propagated to the frontend JSON Schema.
138
+ - `@checkstack/ui`: new `ScriptTestPanel` + `ContextSampleEditor`
139
+ components and a `scriptTestRenderer` prop on `DynamicForm`.
140
+ - `@checkstack/automation-frontend`: wires the test panel into the action
141
+ editor.
142
+ - `@checkstack/integration-script-backend`: marks the `run_script` /
143
+ `run_shell` script fields as testable.
144
+
145
+ - 270ef29: Activate npm packages in script execution: thread the managed
146
+ `resolutionRoot` into every user-script call site so an allowlisted package
147
+ can actually be `import`ed.
148
+
149
+ - `@checkstack/backend-api`: the ESM runner now always writes a per-run
150
+ `bunfig.toml` with `[install] auto = "disable"` and runs with that dir as
151
+ CWD. Without this Bun silently auto-installs any imported package from the
152
+ registry (verified), defeating the allowlist; with it, imports resolve
153
+ only against the reconciled `current/node_modules` (when a `resolutionRoot`
154
+ is set) and otherwise fail fast.
155
+ - `@checkstack/script-packages-backend`: `resolveResolutionRoot` /
156
+ `resolveResolutionRootFromStore` / `resolveResolutionRootForHost` decide a
157
+ host's resolution-root status (`none` / `ready` / `notReady`) from the
158
+ local `<store>/current`.
159
+ - `run_script` (integration-script-backend), the inline-script collector
160
+ (healthcheck-script-backend, core + satellite), and the in-UI `testScript`
161
+ / `testCollectorScript` endpoints all resolve the root per run and pass it
162
+ to the runner; `run_script` surfaces a clear "npm packages not ready"
163
+ error when configured-but-unsynced. Shell paths are unaffected (no module
164
+ resolution).
165
+
166
+ An opt-in end-to-end test (`CHECKSTACK_E2E_NETWORK=1`) proves an allowlisted
167
+ package imports successfully through the real `run_script` action execute
168
+ path, with non-network degradation tests running always.
169
+
170
+ BREAKING CHANGES: `@checkstack/backend-api`'s `defaultEsmScriptRunner` now
171
+ always disables Bun auto-install for the user subprocess. A script that
172
+ previously relied on Bun silently fetching an un-vendored package from the
173
+ registry at import time will now fail to resolve it. This is intentional -
174
+ package availability is governed by the admin allowlist - but any caller
175
+ depending on the old implicit auto-install behavior must add the package to
176
+ the allowlist instead. The new `EsmScriptRunOptions.resolutionRoot` field is
177
+ optional and additive (defaults to today's `os.tmpdir()` behavior when
178
+ unset), so the runner API itself is source-compatible.
179
+
180
+ - 270ef29: Add the per-host script-package reconciler and the runner resolution root.
181
+
182
+ - `@checkstack/backend-api`: `EsmScriptRunOptions.resolutionRoot` - when
183
+ set, the per-run temp dir is created inside it so module resolution walks
184
+ up to `<resolutionRoot>/node_modules` and user scripts can `import`
185
+ managed npm packages. Defaults to today's `os.tmpdir()` behavior when
186
+ unset (backward-compatible; isolation unchanged - the subprocess still
187
+ only sees `SAFE_ENV_VARS`).
188
+ - `@checkstack/script-packages-backend`: content-addressed cache archive
189
+ (tar+gzip per package), pure delta diff (`computeMissingBlobs`), atomic
190
+ `current` symlink swap, the host reconciler (`reconcileToHash` -
191
+ idempotent: pull only missing blobs, materialize a versioned tree via
192
+ `bun install --offline`, atomically flip `current`), the concrete fs/Bun
193
+ adapter, the central install resolver, and the `script-packages.changed`
194
+ broadcast hook. An opt-in end-to-end test
195
+ (`CHECKSTACK_E2E_NETWORK=1`) proves resolve -> publish -> cold reconcile
196
+ (no registry) -> offline materialize -> import.
197
+
198
+ - 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
199
+
200
+ - Script consumers declare a least-privilege `secretEnv` allowlist
201
+ (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
202
+ `run_shell` actions resolve ONLY the declared secrets via
203
+ `secretResolverRef.resolveForRun`, inject them into the runner env for
204
+ that run (memory-only; the ESM runner gained a per-run `env` option), and
205
+ mask their values out of stdout/stderr/result/error via the run-scoped
206
+ masking context. A missing required secret fails the run clearly. No
207
+ ambient secret access.
208
+ - Test panel: `testScript` / `testCollectorScript` inject named
209
+ `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
210
+ overrides; real production values are never resolved in the test path,
211
+ and overrides are masked out of the result.
212
+ - Healthcheck collectors carry the `secretEnv` field for authoring +
213
+ the test panel; runtime injection on satellites lands in Phase 3.
214
+ - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
215
+ record fields with `${{ secrets.* }}` name autocomplete (from
216
+ `listSecretNames`), wired into the automation action editor and the
217
+ healthcheck collector editor. New `withConfigMeta` helper +
218
+ `x-secret-env` config-meta key in `@checkstack/backend-api`.
219
+
220
+ - 270ef29: Secrets platform Phase 3: just-in-time secret delivery to satellites + source-side masking, and central-execution injection for healthcheck collectors.
221
+
222
+ - New satellite WS messages `request_run_secrets` / `run_secrets`: just
223
+ before a satellite runs a collector that declares a `secretEnv`, it asks
224
+ core for that collector's resolved env; core resolves ONLY the secrets the
225
+ collector's OWN persisted assignment declares (least-privilege — the
226
+ satellite cannot choose) and replies with the env map (or a clear error).
227
+ The satellite injects it memory-only for the run and drops it on
228
+ completion. Secrets never ride the persisted assignment and never touch
229
+ disk.
230
+ - Source-side masking: the satellite runs `maskSecrets` over the collector's
231
+ stdout/stderr/result/error using the run's delivered values BEFORE the
232
+ result leaves the satellite (defense in depth).
233
+ - `CollectorStrategy.execute` gains an optional `secretEnv`. The
234
+ inline-script and shell collectors inject it into the runner
235
+ (`process.env` / `$VAR`) and mask the values out of their output.
236
+ - Healthcheck collectors running centrally (the queue executor) also resolve
237
+ - inject `secretEnv` via `secretResolverRef`, closing the gap where a
238
+ centrally-run secretEnv collector got no secrets. A missing required
239
+ secret fails the run clearly in all paths.
240
+
241
+ ### Patch Changes
242
+
243
+ - Updated dependencies [270ef29]
244
+ - Updated dependencies [270ef29]
245
+ - Updated dependencies [270ef29]
246
+ - Updated dependencies [b995afb]
247
+ - Updated dependencies [b995afb]
248
+ - Updated dependencies [270ef29]
249
+ - Updated dependencies [270ef29]
250
+ - @checkstack/healthcheck-common@1.4.0
251
+ - @checkstack/cache-api@0.3.7
252
+ - @checkstack/queue-api@0.3.7
253
+
3
254
  ## 0.18.0
4
255
 
5
256
  ### Minor Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/backend-api",
3
- "version": "0.18.0",
3
+ "version": "0.20.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "main": "./src/index.ts",
@@ -10,11 +10,11 @@
10
10
  "lint:code": "eslint . --max-warnings 0"
11
11
  },
12
12
  "dependencies": {
13
- "@checkstack/common": "0.11.0",
14
- "@checkstack/healthcheck-common": "1.2.0",
15
- "@checkstack/cache-api": "0.3.5",
16
- "@checkstack/queue-api": "0.3.5",
17
- "@checkstack/signal-common": "0.2.4",
13
+ "@checkstack/common": "0.12.0",
14
+ "@checkstack/healthcheck-common": "1.3.0",
15
+ "@checkstack/cache-api": "0.3.6",
16
+ "@checkstack/queue-api": "0.3.6",
17
+ "@checkstack/signal-common": "0.2.5",
18
18
  "@orpc/client": "^1.13.14",
19
19
  "@orpc/contract": "^1.13.14",
20
20
  "@orpc/openapi": "^1.13.2",
@@ -26,9 +26,11 @@
26
26
  "zod": "^4.2.1"
27
27
  },
28
28
  "devDependencies": {
29
- "@types/bun": "latest",
29
+ "@checkstack/scripts": "0.3.4",
30
30
  "@checkstack/tsconfig": "0.0.7",
31
- "@checkstack/scripts": "0.3.3"
31
+ "@types/bun": "latest",
32
+ "@types/pg": "^8.20.0",
33
+ "pg": "^8.21.0"
32
34
  },
33
35
  "peerDependencies": {
34
36
  "hono": "^4.12.14",
@@ -0,0 +1,282 @@
1
+ /**
2
+ * Integration test (real Postgres) for the advisory-lock CONNECTION-POOL
3
+ * contract — the behaviour that silently wedged production and that fakes
4
+ * cannot model: a held advisory lock keeps its connection checked out while the
5
+ * gated work runs on a *different* connection, so lock-pool / work-pool sizing
6
+ * decides whether the system makes progress or deadlocks.
7
+ *
8
+ * It pins three things against a live server:
9
+ *
10
+ * 1. REPRODUCE THE BUG: when the lock and its work share ONE pool, concurrency
11
+ * at the pool size deadlocks (every slot is a lock-holder waiting for a
12
+ * work connection that can never free up). This is a guard — if a refactor
13
+ * makes this stop deadlocking, the throughput test below is no longer
14
+ * proving anything.
15
+ * 2. THE FIX: with the lock on a DEDICATED pool, the same (and much higher)
16
+ * concurrency completes with zero failures.
17
+ * 3. CORRECTNESS ACROSS INSTANCES: independent service instances with their
18
+ * OWN pools (simulating N pods on one database) serialize a find-then-
19
+ * create on a shared key down to exactly ONE row — with a no-lock control
20
+ * proving the lock is what enforces it.
21
+ *
22
+ * Gated behind `CHECKSTACK_IT=1`; the integration CI job provides the Postgres
23
+ * service container. Connection from `CHECKSTACK_IT_PG_URL`.
24
+ */
25
+ import { afterAll, beforeAll, describe, expect, it } from "bun:test";
26
+ import { Pool } from "pg";
27
+ import { createAdvisoryLockService } from "./advisory-lock";
28
+
29
+ const PG_URL =
30
+ process.env.CHECKSTACK_IT_PG_URL ??
31
+ "postgres://postgres:postgres@localhost:5432/postgres";
32
+
33
+ const DEDUP_TABLE = "it_advisory_dedup";
34
+
35
+ describe.skipIf(!process.env.CHECKSTACK_IT)(
36
+ "advisory-lock pool contract (real Postgres)",
37
+ () => {
38
+ /** Pools created during a test; ended in afterEach-style cleanup helpers. */
39
+ const tracked: Pool[] = [];
40
+ function makePool(max: number, connectionTimeoutMillis = 5000): Pool {
41
+ const pool = new Pool({
42
+ connectionString: PG_URL,
43
+ max,
44
+ connectionTimeoutMillis,
45
+ idleTimeoutMillis: 1000,
46
+ });
47
+ // A held-lock client can error asynchronously (timeout / termination);
48
+ // swallow so it never surfaces as an unhandled error and fails the file.
49
+ pool.on("error", () => {});
50
+ tracked.push(pool);
51
+ return pool;
52
+ }
53
+ async function endTrackedPools(): Promise<void> {
54
+ await Promise.all(tracked.splice(0).map((p) => p.end().catch(() => {})));
55
+ }
56
+
57
+ let setupPool: Pool;
58
+ beforeAll(async () => {
59
+ setupPool = new Pool({ connectionString: PG_URL });
60
+ await setupPool.query(
61
+ `CREATE TABLE IF NOT EXISTS ${DEDUP_TABLE} (lock_key text NOT NULL, id text NOT NULL)`,
62
+ );
63
+ });
64
+ afterAll(async () => {
65
+ await setupPool.query(`DROP TABLE IF EXISTS ${DEDUP_TABLE}`);
66
+ await setupPool.end();
67
+ await endTrackedPools();
68
+ });
69
+
70
+ /**
71
+ * Find-then-create on `workPool`: insert exactly once per key. The 15ms gap
72
+ * between the read and the write widens the race window so an UNSERIALIZED
73
+ * run reliably double-inserts — making the lock's effect observable.
74
+ */
75
+ async function dedupCreate(workPool: Pool, key: string): Promise<boolean> {
76
+ const client = await workPool.connect();
77
+ try {
78
+ const { rows } = await client.query(
79
+ `SELECT id FROM ${DEDUP_TABLE} WHERE lock_key = $1 LIMIT 1`,
80
+ [key],
81
+ );
82
+ if (rows.length > 0) return false;
83
+ await new Promise((r) => setTimeout(r, 15));
84
+ await client.query(
85
+ `INSERT INTO ${DEDUP_TABLE} (lock_key, id) VALUES ($1, $2)`,
86
+ [key, crypto.randomUUID()],
87
+ );
88
+ return true;
89
+ } finally {
90
+ client.release();
91
+ }
92
+ }
93
+
94
+ async function countFor(key: string): Promise<number> {
95
+ const { rows } = await setupPool.query<{ n: string }>(
96
+ `SELECT count(*)::text AS n FROM ${DEDUP_TABLE} WHERE lock_key = $1`,
97
+ [key],
98
+ );
99
+ return Number(rows[0]?.n ?? "0");
100
+ }
101
+
102
+ it(
103
+ "REPRODUCES the deadlock when lock + work share one pool (the bug)",
104
+ async () => {
105
+ const POOL_MAX = 4;
106
+ // Single shared pool — the pre-fix wiring. The lock client AND the work
107
+ // client both come from here. Short connect timeout so the deadlock
108
+ // surfaces as a fast rejection rather than a long hang.
109
+ const pool = makePool(POOL_MAX, 1500);
110
+ const svc = createAdvisoryLockService(pool);
111
+ const runId = crypto.randomUUID();
112
+
113
+ // Exactly POOL_MAX concurrent ops, each on a DISTINCT key (so there is
114
+ // NO lock contention — the only thing that can stall is connection
115
+ // accounting). Each holds a lock client, then asks the same pool for a
116
+ // work client that will never come.
117
+ const results = await Promise.allSettled(
118
+ Array.from({ length: POOL_MAX }, (_, i) =>
119
+ svc.withXactLock({
120
+ key: `deadlock:${runId}:${i}`,
121
+ fn: async () => {
122
+ const c = await pool.connect();
123
+ try {
124
+ await c.query("SELECT 1");
125
+ } finally {
126
+ c.release();
127
+ }
128
+ },
129
+ }),
130
+ ),
131
+ );
132
+
133
+ const rejected = results.filter((r) => r.status === "rejected").length;
134
+ // The deadlock manifests as connection-acquire timeouts on the work
135
+ // checkout. If this ever becomes 0, the single-pool design no longer
136
+ // deadlocks and the throughput proof below must be re-examined.
137
+ expect(rejected).toBeGreaterThan(0);
138
+
139
+ await endTrackedPools();
140
+ },
141
+ 30_000,
142
+ );
143
+
144
+ it(
145
+ "does NOT deadlock under high throughput with a dedicated lock pool (the fix)",
146
+ async () => {
147
+ // Deliberately TINY pools so any deadlock would be hit immediately; the
148
+ // fix is that lock and work draw from different pools.
149
+ const lockPool = makePool(4);
150
+ const workPool = makePool(4);
151
+ const svc = createAdvisoryLockService(lockPool);
152
+ const runId = crypto.randomUUID();
153
+
154
+ const CONCURRENCY = 200;
155
+ const results = await Promise.allSettled(
156
+ Array.from({ length: CONCURRENCY }, (_, i) =>
157
+ svc.withXactLock({
158
+ key: `throughput:${runId}:${i}`,
159
+ fn: async () => {
160
+ const c = await workPool.connect();
161
+ try {
162
+ await c.query("SELECT 1");
163
+ } finally {
164
+ c.release();
165
+ }
166
+ },
167
+ }),
168
+ ),
169
+ );
170
+
171
+ const rejected = results.filter((r) => r.status === "rejected");
172
+ // Every single operation must complete: no deadlock, no timeout.
173
+ expect(rejected).toHaveLength(0);
174
+
175
+ await endTrackedPools();
176
+ },
177
+ 30_000,
178
+ );
179
+
180
+ it(
181
+ "serializes find-then-create across INSTANCES to exactly one row",
182
+ async () => {
183
+ // Each "pod" is an independent service instance with its OWN pools, all
184
+ // pointing at the same database — the real multi-instance topology. The
185
+ // advisory lock space is global to the server, so they must serialize.
186
+ const PODS = 6;
187
+ const ATTEMPTS_PER_POD = 5;
188
+ const key = `dedupe:${crypto.randomUUID()}`;
189
+
190
+ const pods = Array.from({ length: PODS }, () => {
191
+ const workPool = makePool(2);
192
+ const svc = createAdvisoryLockService(makePool(2));
193
+ return { workPool, svc };
194
+ });
195
+
196
+ const attempts = pods.flatMap((pod) =>
197
+ Array.from({ length: ATTEMPTS_PER_POD }, () =>
198
+ pod.svc.withXactLock({
199
+ key,
200
+ fn: () => dedupCreate(pod.workPool, key),
201
+ }),
202
+ ),
203
+ );
204
+
205
+ const settled = await Promise.allSettled(attempts);
206
+ const created = settled.filter(
207
+ (r) => r.status === "fulfilled" && r.value === true,
208
+ ).length;
209
+
210
+ // Exactly one attempt created the row; the rest observed it and no-oped.
211
+ expect(await countFor(key)).toBe(1);
212
+ expect(created).toBe(1);
213
+
214
+ await endTrackedPools();
215
+ },
216
+ 30_000,
217
+ );
218
+
219
+ it(
220
+ "a STALLED critical section is reaped by idle_in_transaction_session_timeout, freeing the key",
221
+ async () => {
222
+ // The lock pool sets a short idle-in-transaction timeout. A held lock
223
+ // sits "idle in transaction" for the whole time `fn` runs, so a hung
224
+ // `fn` trips it: Postgres aborts the session, auto-releasing the lock -
225
+ // proving a stall self-heals instead of stranding the key forever.
226
+ const lockPool = new Pool({
227
+ connectionString: PG_URL,
228
+ max: 4,
229
+ connectionTimeoutMillis: 5000,
230
+ idle_in_transaction_session_timeout: 1000,
231
+ });
232
+ lockPool.on("error", () => {});
233
+ tracked.push(lockPool);
234
+ const svc = createAdvisoryLockService(lockPool);
235
+ const key = `stall:${crypto.randomUUID()}`;
236
+
237
+ let releaseHang!: () => void;
238
+ const hang = new Promise<void>((r) => (releaseHang = r));
239
+
240
+ // Holder whose critical section hangs (never issues another query).
241
+ const stalled = svc
242
+ .withXactLock({ key, fn: () => hang })
243
+ .catch(() => "rejected-as-expected");
244
+
245
+ // Wait past the 1s idle timeout so the server reaps the stalled holder.
246
+ await new Promise((r) => setTimeout(r, 1800));
247
+
248
+ // The key must be acquirable again now that the stalled session was
249
+ // aborted server-side.
250
+ const t0 = Date.now();
251
+ const got = await svc.withXactLock({ key, fn: async () => "ok" });
252
+ expect(got).toBe("ok");
253
+ expect(Date.now() - t0).toBeLessThan(3000);
254
+
255
+ releaseHang();
256
+ await stalled; // let the stalled call unwind (COMMIT fails on dead conn)
257
+ await endTrackedPools();
258
+ },
259
+ 30_000,
260
+ );
261
+
262
+ it(
263
+ "CONTROL: the same workload WITHOUT the lock races into duplicates",
264
+ async () => {
265
+ // Proves the lock — not some incidental ordering — is what enforces
266
+ // single-creation above. Same widened-window find-then-create, run
267
+ // concurrently with NO advisory lock, must double-insert.
268
+ const workPool = makePool(8);
269
+ const key = `dedupe-nolock:${crypto.randomUUID()}`;
270
+
271
+ await Promise.all(
272
+ Array.from({ length: 8 }, () => dedupCreate(workPool, key)),
273
+ );
274
+
275
+ expect(await countFor(key)).toBeGreaterThan(1);
276
+
277
+ await endTrackedPools();
278
+ },
279
+ 30_000,
280
+ );
281
+ },
282
+ );