@ayepi/work 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,408 @@
1
+ <!--
2
+ ayepi-work-ports.md — reference for `@ayepi/work` (ports, backends & JSON codec), written for coding agents.
3
+
4
+ Copy this file into any project that depends on `@ayepi/work` (e.g. into your repo's
5
+ `docs/` or `.claude/` directory) and reference it from your agents and slash commands.
6
+ It documents the public API, the patterns the package expects, and how it works under the
7
+ hood, with copy-pasteable examples. Keep it in sync with the installed package version.
8
+ -->
9
+
10
+ # `@ayepi/work` — ports, custom backends & JSON codec
11
+
12
+ Part of the `@ayepi/work` doc set. See [`ayepi-work.md`](./ayepi-work.md) for the core API
13
+ and [`ayepi-work-deps-schedule.md`](./ayepi-work-deps-schedule.md) for dependencies &
14
+ scheduling. All durations are **milliseconds**.
15
+
16
+ ---
17
+
18
+ ## The three ports
19
+
20
+ Everything sits on three interfaces. A `Backend` bundles them:
21
+
22
+ ```ts
23
+ interface Backend {
24
+ readonly queue: Queue
25
+ readonly pubsub: PubSub
26
+ readonly store: Store
27
+ }
28
+ ```
29
+
30
+ Supply all three to `createWork` to go distributed; supply none for the bundled in-memory
31
+ backend (zero-config). All durations are **milliseconds**.
32
+
33
+ ### `Queue` — the durable work log
34
+
35
+ At-least-once delivery with a **visibility timeout**: a popped item is invisible to other
36
+ workers until its lease elapses; the worker keeps the lease alive with `heartbeat` and
37
+ removes the item with `ack`. A worker that dies without acking lets the lease expire, and
38
+ the item is redelivered.
39
+
40
+ ```ts
41
+ interface Queue {
42
+ push(body: string, opts?: PushOptions): void | Promise<void>
43
+ pop(max: number, visibility: number): PulledWork[] | Promise<PulledWork[]> // lease ≤ max, hide each for `visibility`
44
+ heartbeat(pulled: PulledWork, visibility: number): void | Promise<void> // extend a lease
45
+ ack(pulled: PulledWork): void | Promise<void> // permanently remove (completed)
46
+ fail(pulled: PulledWork, delay?: number): void | Promise<void> // return to queue, visible after `delay`
47
+ deadLetter?(body: string, error: string): void | Promise<void> // optional dead-letter sink
48
+ }
49
+
50
+ interface PushOptions {
51
+ readonly delay?: number // delay before first visible (backoff / scheduled work)
52
+ readonly dedupeKey?: string // optional idempotency key — best-effort, not all backends dedupe
53
+ }
54
+
55
+ interface PulledWork {
56
+ readonly body: string // the JSON work envelope
57
+ readonly handle: unknown // backend-specific lease/receipt token — round-trip to heartbeat/ack/fail
58
+ readonly attempt: number // delivery attempt for this body, starting at 1
59
+ }
60
+ ```
61
+
62
+ `pop` must **reclaim** items whose lease expired (redelivery, incrementing `attempt`)
63
+ before leasing fresh visible ones. `ack`/`heartbeat`/`fail` should be **token-gated**: a
64
+ stale worker whose lease lapsed must not ack work another worker now owns.
65
+
66
+ `fail(pulled, delay)` is also the engine's **put-back primitive**: the loop uses it to
67
+ return an item it isn't ready to run (a saturated doer, an `accept` decline, or an item that
68
+ arrived **before** its `startAt`) so it becomes visible again after `delay`. A backend whose
69
+ single delay is capped (e.g. SQS) need only honor `delay` up to its own ceiling — the engine
70
+ re-checks `startAt` on the next pop and re-defers until the item is actually due (see
71
+ [Early-arrival re-defer](#early-arrival-re-defer-far-future-scheduling)).
72
+
73
+ > **State the engine guarantees won't leak.** Each leased/dispatched item's transient state is
74
+ > paired: an acquire returns its own teardown, and the runner always runs it in a `finally`, so the
75
+ > active-set entry + heartbeat are torn down on every path (success, throw, deferral, dead-letter) —
76
+ > and even if a misbehaving doer throws on handoff. The group "open-work" counter (`+1` at submit,
77
+ > `-1` at settle) is likewise **rolled back** if an enqueue fails before the item reaches the queue,
78
+ > so a failed `push` (e.g. a backend that exhausts its retries) can't leave a group stuck open and
79
+ > hang an awaiting handle. A custom backend only needs to honor the port contracts above; these
80
+ > invariants are the engine's own.
81
+
82
+ ### `PubSub` — best-effort cross-instance fanout
83
+
84
+ Identical in shape to `@ayepi/core`'s `Broker`: publish an opaque string, subscribe to
85
+ every published string. Used to wake distributed waiters and nudge gates.
86
+
87
+ ```ts
88
+ interface PubSub {
89
+ publish(message: string): void | Promise<void>
90
+ subscribe(listener: (message: string) => void): () => void // returns an unsubscribe fn
91
+ }
92
+ ```
93
+
94
+ ### `Store` — key/value with TTL + compare-and-set
95
+
96
+ `setIfNotExists` is the single atom every distributed claim relies on (dependency
97
+ fire-once, scheduler lease, group-handled claim, the waiter registry).
98
+
99
+ ```ts
100
+ interface Store {
101
+ get(key: string): string | undefined | Promise<string | undefined>
102
+ set(key: string, value: string, ttl?: number): void | Promise<void>
103
+ delete?(key: string): void | Promise<void>
104
+ // set only if absent; returns true if this caller won the slot
105
+ setIfNotExists(key: string, value: string, ttl?: number): boolean | Promise<boolean>
106
+ // atomic add (may be negative); returns the new value. Backs the group open-counter.
107
+ increment?(key: string, by: number, ttl?: number): number | Promise<number>
108
+ }
109
+ ```
110
+
111
+ > **`increment` is optional but important.** When absent, the engine falls back to a
112
+ > non-atomic get+set for the group open-work counter — **safe only on a single process**.
113
+ > Any multi-pod backend (Redis, etc.) must implement `increment` atomically, or groups can
114
+ > settle incorrectly under concurrency.
115
+
116
+ ## Swapping in a custom backend
117
+
118
+ The same engine runs distributed by passing your own ports. The engine uses the bundled
119
+ in-memory backend **only when at least one** of `queue`/`pubsub`/`store` is missing — so
120
+ provide **all three** to go fully custom:
121
+
122
+ ```ts
123
+ import { createWork } from '@ayepi/work'
124
+ import type { Queue, PubSub, Store } from '@ayepi/work'
125
+
126
+ const queue: Queue = makeRedisQueue(/* ... */)
127
+ const pubsub: PubSub = makeRedisPubSub(/* ... */)
128
+ const store: Store = makeRedisStore(/* ... */) // implement setIfNotExists + increment atomically
129
+
130
+ const w = createWork({ queue, pubsub, store, work: [/* ... */] as const })
131
+ ```
132
+
133
+ Every key the engine writes is namespaced by `prefix` (default `'work:'`), so multiple
134
+ systems can share one Redis/store instance without colliding.
135
+
136
+ ---
137
+
138
+ ## The bundled in-memory backend
139
+
140
+ A zero-dependency implementation of all three ports that **simulates the distributed
141
+ protocol** (visibility-timeout leases with heartbeat-driven redelivery, TTL'd store with
142
+ atomic `setIfNotExists`/`increment`, in-process fanout). Exported as four factories:
143
+
144
+ ```ts
145
+ function memoryQueue(opts?: MemoryQueueOptions): MemoryQueue
146
+ function memoryPubSub(): PubSub
147
+ function memoryStore(opts?: MemoryOptions): Store
148
+ function memoryBackend(opts?: MemoryBackendOptions): Backend // the three together, sharing one clock
149
+
150
+ interface MemoryOptions {
151
+ readonly now?: Clock // clock injection for deterministic tests (default Date.now)
152
+ }
153
+ ```
154
+
155
+ `memoryStore` implements a real atomic `increment`, so a single shared in-memory backend is
156
+ correct for multi-instance **tests** within one process.
157
+
158
+ ### Durable (file-backed) queue
159
+
160
+ The queue can persist to a file so **pending work survives a process restart** — single-process
161
+ durability with no Redis/SQS. State is written atomically (a temp file renamed over the target)
162
+ after every mutation; a steady-state heartbeat is *not* persisted (lease expiry is reset on
163
+ reload anyway). On startup the file is reloaded and any **in-flight (leased) item is redelivered**
164
+ — the worker holding its lease is gone — with its `attempt` bumped.
165
+
166
+ ```ts
167
+ interface MemoryQueuePersistence {
168
+ readonly file?: string // persist here; omit for a pure in-memory queue
169
+ readonly fs?: QueueFsLike // injected fs (default synchronous node:fs)
170
+ readonly onError?: (err: unknown) => void // observe a corrupt-file load / failed write (best-effort)
171
+ }
172
+ interface MemoryQueueOptions extends MemoryOptions, MemoryQueuePersistence {}
173
+ interface MemoryBackendOptions extends MemoryOptions {
174
+ readonly queue?: MemoryQueuePersistence // file-back the queue; store/pubsub stay in memory
175
+ }
176
+ ```
177
+
178
+ ```ts
179
+ import { createWork, memoryBackend, defineWork } from '@ayepi/work'
180
+
181
+ // a single durable worker — enqueued work outlives a crash/restart
182
+ const backend = memoryBackend({ queue: { file: './work-queue.json' } })
183
+ const work = createWork({ ...backend, work: [add] as const })
184
+ ```
185
+
186
+ Persistence is **best-effort**: a corrupt file loads as empty and a failed write is reported to
187
+ `onError` (never thrown), since the in-memory state stays authoritative for the running process.
188
+ `QueueFsLike` is a tiny synchronous fs seam (`readFile`/`writeFile`/`rename`/`mkdir`) — `node:fs`
189
+ is the default; inject your own for tests or a custom backing store. (Durability is per-process;
190
+ for multi-pod, supply distributed ports.)
191
+
192
+ ### Sharing one backend across instances (multi-pod tests)
193
+
194
+ Share one `memoryBackend()` between several `createWork` calls to model a multi-pod
195
+ deployment in a single process — work fans out, waiters resolve cross-instance, and `accept`
196
+ shards by type:
197
+
198
+ ```ts
199
+ import { createWork, memoryBackend, defineWork } from '@ayepi/work'
200
+
201
+ const backend = memoryBackend()
202
+ const add = defineWork('add', (i: { a: number; b: number }, ctx) => ctx.result(i.a + i.b))
203
+
204
+ const podA = createWork({ ...backend, work: [add] as const }) // share one backend
205
+ const podB = createWork({ ...backend, work: [add] as const }) // = two pods
206
+
207
+ const sum = await podA.enqueue(add({ a: 1, b: 2 })).result() // may run on either pod
208
+ ```
209
+
210
+ ### `MemoryQueue` test extras
211
+
212
+ `memoryQueue` returns a `MemoryQueue` (a `Queue` plus synchronous test helpers). You can
213
+ reach it via `w.backend.queue`:
214
+
215
+ ```ts
216
+ interface MemoryQueue extends Queue {
217
+ pop(max: number, visibility: number): PulledWork[] // synchronous
218
+ readonly dead: readonly DeadLettered[] // items moved to the dead-letter sink
219
+ size(): number // count still in the queue (leased or visible)
220
+ }
221
+
222
+ interface DeadLettered { readonly body: string; readonly error: string }
223
+ ```
224
+
225
+ ```ts
226
+ import type { MemoryQueue } from '@ayepi/work'
227
+
228
+ const w = createWork({ work: [boom] as const })
229
+ await expect(w.enqueue(boom({})).result()).rejects.toThrow()
230
+ const dead = (w.backend.queue as MemoryQueue).dead
231
+ expect(dead.length).toBe(1)
232
+ ```
233
+
234
+ > The in-memory backend shares state only **within one process**. For real multi-pod
235
+ > deployments, supply distributed ports.
236
+
237
+ ---
238
+
239
+ ## The JSON codec
240
+
241
+ Work inputs, outputs, and group results cross the wire as strings. A plain `JSON.stringify`
242
+ silently drops `undefined`, throws on `BigInt`, and flattens `Date`/`Map`/`Set` into
243
+ useless shapes. `defaultCodec` round-trips all of them with a tagged-wrapper
244
+ replacer/reviver.
245
+
246
+ ```ts
247
+ interface JsonCodec {
248
+ stringify(value: unknown): string
249
+ parse(text: string): unknown
250
+ }
251
+
252
+ const defaultCodec: JsonCodec
253
+ ```
254
+
255
+ `defaultCodec` tags non-JSON-native values so they survive `stringify` → `parse`:
256
+
257
+ | Value | Encoded as |
258
+ |---|---|
259
+ | `undefined` | `{ $ayepi:'undefined' }` |
260
+ | `bigint` | `{ $ayepi:'BigInt', value:'123' }` |
261
+ | `Date` | `{ $ayepi:'Date', value:<iso> }` |
262
+ | `Map` | `{ $ayepi:'Map', value:[[k,v]…] }` |
263
+ | `Set` | `{ $ayepi:'Set', value:[…] }` |
264
+ | `Error` | `{ $ayepi:'Error', value:{name,message,stack} }` |
265
+
266
+ ```ts
267
+ import { defaultCodec } from '@ayepi/work'
268
+
269
+ const s = defaultCodec.stringify({ when: new Date(0), n: 10n, tags: new Set(['a']) })
270
+ const v = defaultCodec.parse(s) // { when: Date, n: 10n, tags: Set }
271
+ ```
272
+
273
+ ### Custom codecs — global or per type
274
+
275
+ Set a global codec on `createWork` (`codec`), or a per-type codec on `defineWork`
276
+ (`WorkOptions.codec`, which wins for that type). The per-type codec is used to encode/decode
277
+ that type's input and `.result()` output; the **global** codec is always used for the
278
+ group result (the value a `ctx.result(...)` contributes, read via `.group()`).
279
+
280
+ ```ts
281
+ import { createWork, defineWork, defaultCodec } from '@ayepi/work'
282
+ import type { JsonCodec } from '@ayepi/work'
283
+
284
+ // per-type codec for a type carrying a custom class
285
+ const myCodec: JsonCodec = {
286
+ stringify: (v) => defaultCodec.stringify(/* map custom → tagged */ v),
287
+ parse: (t) => /* map tagged → custom */ defaultCodec.parse(t),
288
+ }
289
+
290
+ const job = defineWork('job', handler, { codec: myCodec })
291
+ const w = createWork({ work: [job] as const, codec: defaultCodec }) // global fallback
292
+ ```
293
+
294
+ > **Constraint:** whatever codec you use **must** round-trip every value a handler
295
+ > receives as input or contributes via `ctx.result(...)` (the group value goes through the
296
+ > global codec). A value the codec can't represent will be lost or corrupted across the queue.
297
+
298
+ ---
299
+
300
+ ## Engine mechanics deep dive
301
+
302
+ The abbreviated version lives in [`ayepi-work.md`](./ayepi-work.md#how-it-works-under-the-hood);
303
+ the full detail is here.
304
+
305
+ ### At-least-once delivery + visibility/lease + heartbeat redelivery
306
+
307
+ The `Queue` is a durable log with **at-least-once** delivery and a **visibility timeout**.
308
+ `pop(max, visibility)` leases up to `max` items, hiding each for `visibility` ms. While
309
+ running, the engine **heartbeats** the lease every `heartbeat` ms (default `visibility/3`)
310
+ via `queue.heartbeat`. A worker that dies without acking lets the lease lapse, and `pop`
311
+ **reclaims** it on the next poll (redelivery, `attempt + 1`). `ack` removes a completed
312
+ item; `fail(delay)` returns it to the queue visible again after `delay`. Lease handles are
313
+ token-gated, so a stale worker cannot ack work another worker now owns.
314
+
315
+ The worker loop asks every relevant doer (and batcher) `available()`, pulls up to that
316
+ many (capped at `POLL_BATCH_CAP = 512`), and routes each item. If a doer is saturated, the
317
+ item is `fail`ed back with a `pollInterval` delay to retry shortly or elsewhere.
318
+
319
+ ### Multi-queue fair polling
320
+
321
+ A work system can run several distinct `Queue` instances at once: the system default plus any
322
+ per-type `queue` (`WorkOptions.queue`). Each loop tick, the engine polls **every distinct
323
+ queue**, giving each a fair `ceil(n / queues)` share of the total poll budget `n`, in
324
+ round-robin order (the lead queue rotates each tick so none is consistently polled last). This
325
+ is what makes per-type queues an **isolation boundary**: a type flooding its own queue can't
326
+ starve types whose work lives on another queue — every queue is serviced each tick regardless.
327
+
328
+ The loop avoids busy-spin: it keeps pulling immediately only while some queue returns a **full**
329
+ share (more likely waiting) *and* it actually started work that round; it sleeps `pollInterval`
330
+ when a full round started nothing (only over-capacity or not-yet-due work was available).
331
+
332
+ ### Early-arrival re-defer (far-future scheduling)
333
+
334
+ When the engine pops an item, it first re-checks the item's `startAt`. If the item is still
335
+ more than `SCHED_TOLERANCE = 1000` ms before its `startAt` — i.e. a backend that couldn't
336
+ honor a long single delay handed it back early — the engine **puts it back** with
337
+ `queue.fail(p, startAt - now)` instead of running it, and tries again later. This repeats
338
+ (each round-trip waits at most the backend's delay ceiling) until the item is finally due.
339
+
340
+ This is what makes **far-future scheduling correct** on delay-capping backends. `runAt`
341
+ (absolute schedule) and a handler-thrown `WorkDelayError` deferral both resolve to a
342
+ `startAt`; on a backend like SQS (which caps a single delay at 15 min and a visibility at
343
+ 12 h) a far-future item simply **bounces** — received early, re-deferred — every cap-length
344
+ interval until due. A deferral (`WorkDelayError`) re-enqueues at the resolved `startAt`
345
+ **without advancing `attempt`** and emits a `deferred` event; the early-arrival put-back is a
346
+ plain `fail` (no event, no attempt change).
347
+
348
+ ### Group open-counter + group-done
349
+
350
+ Each group keeps an integer **open-work counter** at `group:<id>:open`, bumped `+1` when an
351
+ item is queued and `-1` when it settles (via `Store.increment`, falling back to a
352
+ non-atomic get+set when the store lacks `increment` — single-process only). When the
353
+ counter hits `0`, the group is **done**: a `group:<id>:done` flag is set, a `group-done`
354
+ message is published, the `group-done` event fires, and the orphan check is scheduled.
355
+ Because children are queued (incrementing the counter) **before** the parent settles, the
356
+ group only completes once every descendant has settled.
357
+
358
+ ### Distributed wait (PubSub + Store poll)
359
+
360
+ A `WorkHandle`'s `.result()` / `.group()` registers a "someone is waiting" key
361
+ (`wait:<groupId>`) and then races two signals: a `PubSub` subscription (the engine
362
+ publishes `{ kind: 'done', id }` on item completion and `{ kind: 'group-done', groupId }`
363
+ on group completion) **and** a store poll every `WAIT_POLL = 250` ms. Either one triggers a
364
+ re-read of `result:<id>` / `group:<groupId>:result` from the store. This works
365
+ cross-instance: a waiter on pod A resolves when pod B finishes the work, since results live
366
+ in the shared store and pub/sub fans out across pods.
367
+
368
+ ### `setIfNotExists` idempotency (claims & leases)
369
+
370
+ Every "exactly once across the fleet" concern is one `Store.setIfNotExists` (a
371
+ compare-and-set):
372
+
373
+ - **Dependency fire-once** — `ctx.claim('dep:<key>:fired')` ensures dependents are queued
374
+ once even under redelivery.
375
+ - **Scheduler lease** — `sched:<name>:<second-bucket>` ensures one instance fires per cron
376
+ occurrence.
377
+ - **Group-handled (orphan)** — `group-handled:<groupId>` ensures the
378
+ `unhandledWorkGroup` hook fires at most once.
379
+
380
+ ### Backoff math
381
+
382
+ A failed attempt that will retry sleeps `backoff(attempt, retry, random)` =
383
+ `min(base · factor^(attempt-1), max) · (1 − jitter · random())`, then re-enters the queue
384
+ with `attempt + 1` and a recomputed `startAt`.
385
+
386
+ ### `unhandledWorkGroup` orphan hook
387
+
388
+ When a group settles, the engine waits `UNHANDLED_GRACE = 100` ms (so an in-process awaiter
389
+ can register its `wait:` key first), claims `group-handled:<groupId>` via `setIfNotExists`,
390
+ and — if no `wait:<groupId>` key exists — calls `unhandledWorkGroup({ groupId, lastResult,
391
+ states })` exactly once.
392
+
393
+ ### Tunable constants (engine internals, not configurable)
394
+
395
+ `POLL_BATCH_CAP = 512`, `RESULT_TTL = 86_400_000` (24 h, for results/states/group keys),
396
+ `WAIT_TTL = 3_600_000` (1 h, for the wait registry), `WAIT_POLL = 250`,
397
+ `UNHANDLED_GRACE = 100`, `UNKNOWN_TYPE_DELAY = 5000` (redelivery delay for an unknown type,
398
+ in case another instance knows it), `SCHED_TOLERANCE = 1000` (a popped item this far before
399
+ its `startAt` is put back rather than run — drives the early-arrival re-defer),
400
+ `SCHED_TICK = 1000`, `SCHED_LEASE_TTL = 90_000`,
401
+ `STOP_DRAIN = 5000` (max `stop()` drain wait), `DEP_RETRY_ATTEMPTS = 1` (a dependency
402
+ dead-letters on timeout rather than retrying).
403
+
404
+ ---
405
+
406
+ See also: [`ayepi-work.md`](./ayepi-work.md) (core API, "how it works under the hood",
407
+ gotchas) and [`ayepi-work-deps-schedule.md`](./ayepi-work-deps-schedule.md) (dependencies &
408
+ scheduling).