@bluelibs/runner 6.3.0 → 6.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,2270 @@
1
+ # Durable Workflows (Node-only) — Architecture v2
2
+
3
+ ← [Back to main README](../README.md)
4
+
5
+ ---
6
+
7
+ > Durable workflows are Runner tasks with "save points". If your process dies, deploys, or scales horizontally, the workflow comes back and continues like nothing happened (except now you can finally sleep at night).
8
+
9
+ ## Table of Contents
10
+
11
+ - [Start Here](#start-here)
12
+ - [Quickstart](#quickstart)
13
+ - [Tagging Workflows for Discovery](#tagging-workflows-for-discovery-required)
14
+ - [Why You'd Want This (In One Minute)](#why-youd-want-this-in-one-minute)
15
+ - [Core Insight](#core-insight)
16
+ - [Abstract Interfaces](#abstract-interfaces)
17
+ - [API Design](#api-design)
18
+ - [Safety & Semantics](#safety--semantics)
19
+ - [Signals (wait for external events)](#signals-wait-for-external-events)
20
+ - [Testing Utilities](#testing-utilities)
21
+ - [Compensation / Rollback Pattern](#compensation--rollback-pattern)
22
+ - [Branching with durableContext.switch()](#branching-with-durablecontextswitch)
23
+ - [Describing a Flow (Static Shape Export)](#describing-a-flow-static-shape-export)
24
+ - [Scheduling & Cron Jobs](#scheduling--cron-jobs)
25
+ - [Gotchas & Troubleshooting](#gotchas--troubleshooting)
26
+
27
+ ## Start Here
28
+
29
+ - If you want the short version: `readmes/DURABLE_WORKFLOWS_AI.md`
30
+ - If you're new to Runner concepts (tasks/resources/events/middleware): `readmes/COMPACT_GUIDE.md`
31
+ - Platform note (why this is Node-only): `readmes/MULTI_PLATFORM.md`
32
+
33
+ ## Quickstart
34
+
35
+ ### 0) Create durable support + a durable backend
36
+
37
+ The recommended integration is:
38
+
39
+ - register `resources.durable` once for durable tags/events support
40
+ - fork a concrete durable backend (`resources.memoryWorkflow` / `resources.redisWorkflow`)
41
+
42
+ The concrete durable backend:
43
+
44
+ - Executes Runner tasks via DI (`taskRunner.run(...)`).
45
+ - Provides a **per-resource** durable context, accessed via `durable.use()`.
46
+ - Optionally embeds a worker (`worker: true`) to consume the queue in that process.
47
+
48
+ ### 1) Define a durable task (steps + sleep + signal)
49
+
50
+ ```ts
51
+ import { r, run } from "@bluelibs/runner";
52
+ import { resources, tags } from "@bluelibs/runner/node";
53
+
54
+ const Approved = r.event<{ approvedBy: string }>("approved").build();
55
+
56
+ const durable = resources.memoryWorkflow.fork("app-durable");
57
+
58
+ const durableRegistration = durable.with({
59
+ worker: true, // single-process dev/tests
60
+ });
61
+
62
+ const approveOrder = r
63
+ .task("approve-order")
64
+ .dependencies({ durable })
65
+ .tags([tags.durableWorkflow.with({ category: "orders" })])
66
+ .run(async (input: { orderId: string }, { durable }) => {
67
+ const durableContext = durable.use();
68
+
69
+ await durableContext.step("validate", async () => {
70
+ // fetch order, validate invariants, etc.
71
+ return { ok: true };
72
+ });
73
+
74
+ const outcome = await durableContext.waitForSignal(Approved, {
75
+ timeoutMs: 86_400_000,
76
+ });
77
+ if (outcome.kind === "timeout") {
78
+ return { status: "timed_out" };
79
+ }
80
+
81
+ await durableContext.step("ship", async () => {
82
+ // ship only after approval
83
+ return { shipped: true };
84
+ });
85
+
86
+ return {
87
+ status: "approved",
88
+ approvedBy: outcome.payload.approvedBy,
89
+ };
90
+ })
91
+ .build();
92
+
93
+ const app = r
94
+ .resource("app")
95
+ .register([resources.durable, durableRegistration, approveOrder])
96
+ .build();
97
+
98
+ await run(app, { logs: { printThreshold: null } });
99
+ ```
100
+
101
+ ## Tagging Workflows for Discovery (Required)
102
+
103
+ Durable workflows are regular Runner tasks, but **must be tagged with `tags.durableWorkflow`**
104
+ to make them discoverable at runtime. Always add this tag to your workflow tasks:
105
+
106
+ ```ts
107
+ import { r } from "@bluelibs/runner";
108
+ import { resources, tags } from "@bluelibs/runner/node";
109
+
110
+ const durable = resources.memoryWorkflow.fork("app-durable");
111
+
112
+ const onboarding = r
113
+ .task("onboarding")
114
+ .dependencies({ durable })
115
+ .tags([
116
+ tags.durableWorkflow.with({
117
+ category: "users",
118
+ defaults: { invitedBy: "system" },
119
+ }),
120
+ ])
121
+ .run(async (_input, { durable }) => {
122
+ const durableContext = durable.use();
123
+ await durableContext.step("create-user", async () => ({ ok: true }));
124
+ return { ok: true };
125
+ })
126
+ .build();
127
+
128
+ // later, after run(...)
129
+ // const durableRuntime = runtime.getResourceValue(durable);
130
+ // const workflows = durableRuntime.getWorkflows();
131
+ ```
132
+
133
+ `tags.durableWorkflow` is **required** — workflows without this tag will not be discoverable
134
+ via `getWorkflows()`. Register `resources.durable` once in the app so the durable tag
135
+ definition and durable events are available at runtime.
136
+
137
+ `tags.durableWorkflow` is discovery metadata only. The unified response envelope
138
+ is produced by `durable.startAndWait(...)`:
139
+ `{ durable: { executionId }, data }`.
140
+
141
+ `tags.durableWorkflow` also supports optional `defaults` used by
142
+ `durable.describe(task)` **only when no explicit describe input is provided**.
143
+ This does not affect `start()`, `startAndWait()`, `schedule()`, or `ensureSchedule()`.
144
+
145
+ ### Starting Durable Workflows From Resource Dependencies (HTTP route)
146
+
147
+ Tagged workflow tasks are discoverable metadata only. Execution is explicit:
148
+ start with `durable.start(...)` (fire-and-track) or
149
+ `durable.startAndWait(...)` (start-and-wait).
150
+
151
+ ```ts
152
+ import express from "express";
153
+ import { r, run } from "@bluelibs/runner";
154
+ import { resources, tags } from "@bluelibs/runner/node";
155
+
156
+ const durable = resources.memoryWorkflow.fork("app-durable");
157
+
158
+ const approveOrder = r
159
+ .task("approve-order")
160
+ .dependencies({ durable })
161
+ .tags([tags.durableWorkflow.with({ category: "orders" })])
162
+ .run(async (input: { orderId: string }, { durable }) => {
163
+ const durableContext = durable.use();
164
+ await durableContext.step("approve", async () => ({ approved: true }));
165
+ return { orderId: input.orderId, status: "approved" as const };
166
+ })
167
+ .build();
168
+
169
+ const api = r
170
+ .resource("api")
171
+ .register([resources.durable, durable.with({ worker: false }), approveOrder])
172
+ .dependencies({ durable, approveOrder })
173
+ .init(async (_cfg, { durable, approveOrder }) => {
174
+ const app = express();
175
+ app.use(express.json());
176
+
177
+ app.post("/orders/:id/approve", async (req, res) => {
178
+ const executionId = await durable.start(approveOrder, {
179
+ orderId: req.params.id,
180
+ });
181
+
182
+ res.status(202).json({ executionId });
183
+ });
184
+
185
+ app.listen(3000);
186
+ })
187
+ .build();
188
+
189
+ await run(api);
190
+ ```
191
+
192
+ ### Production wiring (Redis + RabbitMQ)
193
+
194
+ For production, swap the in-memory backends:
195
+
196
+ ```ts
197
+ import { resources } from "@bluelibs/runner/node";
198
+
199
+ const durable = resources.redisWorkflow.fork("app-durable");
200
+
201
+ const durableRegistration = durable.with({
202
+ redis: { url: process.env.REDIS_URL! },
203
+ queue: { url: process.env.RABBITMQ_URL! },
204
+ worker: true,
205
+ });
206
+ ```
207
+
208
+ Isolation note: `resources.redisWorkflow` derives Redis key prefixes, pub/sub prefixes, and default queue names from the durable resource id (the value you pass to `.fork("...")`). Use different ids (or set `{ namespace }`) to run multiple durable "apps" safely on the same Redis/RabbitMQ.
209
+
210
+ API nodes typically **disable polling and the embedded worker**:
211
+
212
+ ```ts
213
+ const durable = resources.redisWorkflow.fork("app-durable");
214
+ const durableRegistration = durable.with({
215
+ redis: { url: process.env.REDIS_URL! },
216
+ queue: { url: process.env.RABBITMQ_URL! },
217
+ worker: false,
218
+ polling: { enabled: false },
219
+ });
220
+ ```
221
+
222
+ In a typical deployment:
223
+
224
+ - API nodes call `start()` / `signal()` / `wait()`.
225
+ - Worker nodes run the durable resource with `worker: true`.
226
+
227
+ ### Scaling in production (recommended topology)
228
+
229
+ Durable workflows are designed to scale **horizontally**.
230
+ The core idea is: **the store is the source of truth**, and the queue distributes work.
231
+
232
+ **Recommended split:**
233
+
234
+ - **API nodes** (stateless): accept HTTP/webhooks, call `start()` / `signal()` / `wait()`.
235
+ - **Worker nodes** (scalable): consume the durable queue and run executions.
236
+
237
+ **API node config (no background work):**
238
+
239
+ ```ts
240
+ const durable = resources.redisWorkflow.fork("app-durable");
241
+ const durableRegistration = durable.with({
242
+ redis: { url: process.env.REDIS_URL! },
243
+ queue: { url: process.env.RABBITMQ_URL! },
244
+ worker: false,
245
+ polling: { enabled: false },
246
+ });
247
+ ```
248
+
249
+ **Worker node config (does background work):**
250
+
251
+ ```ts
252
+ const durable = resources.redisWorkflow.fork("app-durable");
253
+ const durableRegistration = durable.with({
254
+ redis: { url: process.env.REDIS_URL! },
255
+ queue: { url: process.env.RABBITMQ_URL! },
256
+ worker: true,
257
+ polling: { enabled: true, interval: 1000 },
258
+ });
259
+ ```
260
+
261
+ **How it scales:**
262
+
263
+ - Increase worker replicas: each one consumes from the queue, so throughput scales with workers.
264
+ - Crash/redeploy safety: a worker can die at any time; the next worker resumes from the last checkpoint.
265
+ - Multi-worker correctness: executions/steps are coordinated through the store, not through in-memory state.
266
+
267
+ **Timers, sleeps, and schedules (important):**
268
+
269
+ Timers (used by `durableContext.sleep(...)`, signal timeouts, and scheduling) are driven by the durable polling loop.
270
+ In multi-process setups you typically either:
271
+
272
+ - run a **single poller** (one worker replica with `polling.enabled: true`), or
273
+ - use a store implementation that provides **atomic timer claiming** so multiple pollers are safe.
274
+
275
+ If you enable polling in multiple processes without atomic claiming, you may get duplicate resume attempts.
276
+ This is still designed to be safe (at-least-once), but it can increase load/noise.
277
+
278
+ ### 2) Start an execution (store the executionId)
279
+
280
+ ```ts
281
+ const executionId = await d.start(approveOrder, {
282
+ orderId: "order-123",
283
+ });
284
+ // store executionId on the order record so your webhook can resume the workflow later
285
+ ```
286
+
287
+ ### Reading status later (no double-sync required)
288
+
289
+ If you store the `executionId` in your main database (eg. `orders.durable_execution_id`), you can fetch live workflow status on-demand from the durable store.
290
+ This avoids mirroring every durable transition into Postgres.
291
+
292
+ ```ts
293
+ import { DurableOperator, RedisStore } from "@bluelibs/runner/node";
294
+
295
+ const durableStorePrefix = process.env.DURABLE_STORE_PREFIX!; // same value used by your durable runtime config
296
+
297
+ // Read-only store client for status lookups (same redis url + prefix)
298
+ const store = new RedisStore({
299
+ redis: process.env.REDIS_URL!,
300
+ prefix: durableStorePrefix,
301
+ });
302
+
303
+ // Minimal: just the execution row (status/result/error)
304
+ const execution = await store.getExecution(executionId);
305
+
306
+ // Rich: execution + steps + audit (dashboard-like view)
307
+ const operator = new DurableOperator(store);
308
+ const detail = await operator.getExecutionDetail(executionId);
309
+ ```
310
+
311
+ Keep the durable store prefix in one shared config module and reuse it for both workflow runtime wiring and read-only status lookups.
312
+
313
+ If you already have the durable resource instance (dependency injection), you can use the operator API directly:
314
+
315
+ ```ts
316
+ const detail = await durable.operator.getExecutionDetail(executionId);
317
+ ```
318
+
319
+ ### 3) Resume from the outside (webhook / callback)
320
+
321
+ ```ts
322
+ await d.signal(executionId, Approved, { approvedBy: "admin@company.com" });
323
+ const result = await d.wait(executionId, { timeout: 30_000 });
324
+ ```
325
+
326
+ ## Why You'd Want This (In One Minute)
327
+
328
+ - Your workflow needs to span time: minutes, hours, days (payments, shipping, approvals).
329
+ - You want deterministic retries without duplicating side-effects (charge twice, email twice, etc.).
330
+ - You want horizontal scaling without "who owns this in-memory timeout?" problems.
331
+ - You want explicit, type-safe "outside world pokes the workflow" via signals.
332
+
333
+ ## Core Insight
334
+
335
+ The key insight (Temporal/Inngest-style) is that workflows are just functions with checkpoints. We provide a `DurableContext` that gives tasks:
336
+
337
+ 1. **`step(id, fn)`** - Execute a function once, cache the result, return cached on replay
338
+ 2. **`sleep(ms)`** - Durable sleep that survives process restarts
339
+ 3. **`emit(event, data)`** - Publish a best-effort notification, de-duplicated via `step()` (not guaranteed delivery)
340
+ 4. **`waitForSignal(signal)`** - Suspend until an external signal is delivered (eg. payment confirmation)
341
+
342
+ **Scalability Model:** Multiple worker instances can process executions concurrently. Work is distributed via a durable queue (RabbitMQ quorum queues by default), with state stored in Redis.
343
+
344
+ ```mermaid
345
+ graph TB
346
+ subgraph Clients
347
+ C1[Client 1]
348
+ C2[Client 2]
349
+ end
350
+
351
+ subgraph DurableInfra[Durable Infrastructure]
352
+ Q[(RabbitMQ - Quorum Queue)]
353
+ R[(Redis - State/PubSub)]
354
+ end
355
+
356
+ subgraph Workers[Scalable Workers]
357
+ W1[Worker 1]
358
+ W2[Worker 2]
359
+ W3[Worker N]
360
+ end
361
+
362
+ C1 -->|enqueue| Q
363
+ C2 -->|enqueue| Q
364
+
365
+ Q -->|consume| W1
366
+ Q -->|consume| W2
367
+ Q -->|consume| W3
368
+
369
+ W1 <-->|state| R
370
+ W2 <-->|state| R
371
+ W3 <-->|state| R
372
+
373
+ R -.->|pub/sub| W1
374
+ R -.->|pub/sub| W2
375
+ R -.->|pub/sub| W3
376
+ ```
377
+
378
+ ---
379
+
380
+ ## Abstract Interfaces
381
+
382
+ Three pluggable interfaces allow swapping backends without changing application code:
383
+
384
+ ### 1. IDurableStore - State Storage
385
+
386
+ The **Store** is the absolute source of truth. It persists execution state, step results, timers, and schedules. If it's not in the store, it didn't happen.
387
+
388
+ ```typescript
389
+ // interfaces/IDurableStore.ts
390
+
391
+ export interface IDurableStore {
392
+ // Executions (The primary workflow records)
393
+ saveExecution(execution: Execution): Promise<void>;
394
+ getExecution(id: string): Promise<Execution | null>;
395
+ updateExecution(id: string, updates: Partial<Execution>): Promise<void>;
396
+ listIncompleteExecutions(): Promise<Execution[]>;
397
+
398
+ // Steps (Memoized results for exactly-once-ish semantics)
399
+ getStepResult(
400
+ executionId: string,
401
+ stepId: string,
402
+ ): Promise<StepResult | null>;
403
+ saveStepResult(result: StepResult): Promise<void>;
404
+
405
+ // Timers (Drives sleep(), signal timeouts, and cron)
406
+ createTimer(timer: Timer): Promise<void>;
407
+ getReadyTimers(now?: Date): Promise<Timer[]>;
408
+ markTimerFired(timerId: string): Promise<void>;
409
+ deleteTimer(timerId: string): Promise<void>;
410
+
411
+ // Schedules (Cron and Interval orchestration)
412
+ createSchedule(schedule: Schedule): Promise<void>;
413
+ getSchedule(id: string): Promise<Schedule | null>;
414
+ updateSchedule(id: string, updates: Partial<Schedule>): Promise<void>;
415
+ deleteSchedule(id: string): Promise<void>;
416
+ listSchedules(): Promise<Schedule[]>;
417
+ listActiveSchedules(): Promise<Schedule[]>;
418
+
419
+ // Optional: Distributed Timer Coordination
420
+ claimTimer?(
421
+ timerId: string,
422
+ workerId: string,
423
+ ttlMs: number,
424
+ ): Promise<boolean>;
425
+
426
+ // Optional: Idempotency (dedupe start calls)
427
+ getExecutionIdByIdempotencyKey?(params: {
428
+ taskId: string;
429
+ idempotencyKey: string;
430
+ }): Promise<string | null>;
431
+ setExecutionIdByIdempotencyKey?(params: {
432
+ taskId: string;
433
+ idempotencyKey: string;
434
+ executionId: string;
435
+ }): Promise<boolean>;
436
+
437
+ // Optional: Dashboard & Operator API
438
+ listExecutions?(options?: ListExecutionsOptions): Promise<Execution[]>;
439
+ listStepResults?(executionId: string): Promise<StepResult[]>;
440
+ retryRollback?(executionId: string): Promise<void>;
441
+ skipStep?(executionId: string, stepId: string): Promise<void>;
442
+ forceFail?(
443
+ executionId: string,
444
+ error: { message: string; stack?: string },
445
+ ): Promise<void>;
446
+ editStepResult?(
447
+ executionId: string,
448
+ stepId: string,
449
+ newResult: unknown,
450
+ ): Promise<void>;
451
+
452
+ // Lifecycle
453
+ init?(): Promise<void>;
454
+ dispose?(): Promise<void>;
455
+
456
+ // Optional: Locking (if store handles its own concurrency)
457
+ acquireLock?(resource: string, ttlMs: number): Promise<string | null>;
458
+ releaseLock?(resource: string, lockId: string): Promise<void>;
459
+ }
460
+ ```
461
+
462
+ **Implementations:**
463
+
464
+ - `MemoryStore` - Dev/test, no persistence
465
+ - `RedisStore` - Production default, distributed locking
466
+
467
+ ### 2. IEventBus - Pub/Sub
468
+
469
+ For event notifications across workers (timer ready, execution complete, etc).
470
+
471
+ ```typescript
472
+ // interfaces/IEventBus.ts
473
+
474
+ export type EventHandler = (event: BusEvent) => Promise<void>;
475
+
476
+ export interface IEventBus {
477
+ // Publish event to all subscribers
478
+ publish(channel: string, event: BusEvent): Promise<void>;
479
+
480
+ // Subscribe to events on a channel
481
+ subscribe(channel: string, handler: EventHandler): Promise<void>;
482
+
483
+ // Unsubscribe from a channel
484
+ unsubscribe(channel: string): Promise<void>;
485
+
486
+ // Lifecycle
487
+ init?(): Promise<void>;
488
+ dispose?(): Promise<void>;
489
+ }
490
+
491
+ export interface BusEvent {
492
+ type: string;
493
+ payload: unknown;
494
+ timestamp: Date;
495
+ }
496
+ ```
497
+
498
+ **Implementations:**
499
+
500
+ - `MemoryEventBus` - Dev/test, single-process only
501
+ - `RedisEventBus` - Production default, uses Redis Pub/Sub
502
+
503
+ **Serialization note:** `RedisEventBus` serializes events using Runner's serializer (tree mode) so `BusEvent.timestamp: Date` (and other supported built-in types) round-trip correctly across Redis Pub/Sub.
504
+
505
+ ### 3. IDurableQueue - Work Distribution
506
+
507
+ For distributing execution work across multiple workers with durability guarantees.
508
+
509
+ ```typescript
510
+ // interfaces/IDurableQueue.ts
511
+
512
+ export interface QueueMessage<T = unknown> {
513
+ id: string;
514
+ type: "execute" | "resume" | "schedule";
515
+ payload: T;
516
+ attempts: number;
517
+ maxAttempts: number;
518
+ createdAt: Date;
519
+ }
520
+
521
+ export type MessageHandler<T = unknown> = (
522
+ message: QueueMessage<T>,
523
+ ) => Promise<void>;
524
+
525
+ export interface IDurableQueue {
526
+ // Send message to queue
527
+ enqueue<T>(
528
+ message: Omit<QueueMessage<T>, "id" | "createdAt">,
529
+ ): Promise<string>;
530
+
531
+ // Start consuming messages (calls handler for each)
532
+ consume<T>(handler: MessageHandler<T>): Promise<void>;
533
+
534
+ // Acknowledge successful processing
535
+ ack(messageId: string): Promise<void>;
536
+
537
+ // Negative acknowledge (requeue or dead-letter)
538
+ nack(messageId: string, requeue?: boolean): Promise<void>;
539
+
540
+ // Lifecycle
541
+ init?(): Promise<void>;
542
+ dispose?(): Promise<void>;
543
+ }
544
+ ```
545
+
546
+ **Message types note:** Runner currently enqueues `execute` and `resume`. `schedule` is accepted by `DurableWorker` as an alias of `resume` (an execution hint) so custom adapters can use it, but built-in cron/interval scheduling is driven by timers + `resume`.
547
+
548
+ **Implementations:**
549
+
550
+ - `MemoryQueue` - Dev/test, no persistence
551
+ - `RabbitMQQueue` - Production default, quorum queues for durability
552
+
553
+ ---
554
+
555
+ ## Adapting to Your Flow: Custom Backends
556
+
557
+ One of Runner's core philosophies is **zero lock-in**. If your team uses Postgres for state or Kafka for queues, you shouldn't have to change your workflow logic to use them.
558
+
559
+ ### Implementing a Custom Store
560
+
561
+ To implement a custom store (e.g., for SQL), you only need to satisfy the `IDurableStore` interface. The engine is designed to be "dumb" and trust the store for all persistence.
562
+
563
+ **Minimum Viable Store (Pseudo-SQL):**
564
+
565
+ ```typescript
566
+ class MySqlStore implements IDurableStore {
567
+ async saveExecution(e: Execution) {
568
+ await db.query("INSERT INTO durable_executions ...", [e.id, serialize(e)]);
569
+ }
570
+
571
+ async getExecution(id: string) {
572
+ const row = await db.query(
573
+ "SELECT data FROM durable_executions WHERE id = ?",
574
+ [id],
575
+ );
576
+ return row ? deserialize(row.data) : null;
577
+ }
578
+
579
+ // ... implement other methods by mapping to your DB tables
580
+ }
581
+ ```
582
+
583
+ > [!TIP]
584
+ > Look at [MemoryStore.ts](../src/node/durable/store/MemoryStore.ts) for a clean reference of how to manage in-memory state, or [RedisStore.ts](../src/node/durable/store/RedisStore.ts) for a production-grade implementation using Lua scripts for atomicity.
585
+
586
+ ### Implementing a Custom Queue
587
+
588
+ If you want to use a different message broker (SQS, Kafka, Redis Streams), implement `IDurableQueue`.
589
+
590
+ **Key Responsibilities:**
591
+
592
+ - **`enqueue`**: Push a message (task execution hint) to the broker.
593
+ - **`consume`**: Register a listener that calls the provided handler when a message arrives.
594
+ - **`ack` / `nack`**: Handle message confirmation/failure.
595
+
596
+ ```typescript
597
+ class SqsQueue implements IDurableQueue {
598
+ async enqueue(msg) {
599
+ const res = await sqs.sendMessage({
600
+ QueueUrl,
601
+ MessageBody: JSON.stringify(msg),
602
+ });
603
+ return res.MessageId;
604
+ }
605
+
606
+ async consume(handler) {
607
+ // Polling loop or subscription
608
+ const msgs = await sqs.receiveMessage({ QueueUrl });
609
+ for (const m of msgs) {
610
+ await handler(JSON.parse(m.Body));
611
+ await this.ack(m.ReceiptHandle);
612
+ }
613
+ }
614
+ }
615
+ ```
616
+
617
+ > [!IMPORTANT]
618
+ > A queue in Durable Workflows is just a **hint**. If a message is lost, the `polling` loop in `DurableService` acts as a safety net to find and resume stuck executions. However, a reliable queue (like RabbitMQ or SQS) is critical for low-latency distribution and high throughput.
619
+
620
+ ---
621
+
622
+ ## Component Architecture
623
+
624
+ ```mermaid
625
+ graph TB
626
+ subgraph RunnerCore[Runner Core - Unchanged]
627
+ R[Resources]
628
+ T[Tasks]
629
+ E[Events]
630
+ H[Hooks]
631
+ end
632
+
633
+ subgraph DurableModule[src/node/durable/]
634
+ DS[DurableService]
635
+ DC[DurableContext]
636
+ DW[DurableWorker]
637
+
638
+ subgraph Interfaces[Abstract Interfaces]
639
+ IS[IDurableStore]
640
+ IB[IEventBus]
641
+ IQ[IDurableQueue]
642
+ end
643
+
644
+ subgraph StoreImpl[Store Implementations]
645
+ MS[MemoryStore]
646
+ RS[RedisStore]
647
+ end
648
+
649
+ subgraph BusImpl[EventBus Implementations]
650
+ MB[MemoryEventBus]
651
+ RB[RedisEventBus]
652
+ end
653
+
654
+ subgraph QueueImpl[Queue Implementations]
655
+ MQ[MemoryQueue]
656
+ RQ[RabbitMQQueue]
657
+ end
658
+ end
659
+
660
+ DS --> IS
661
+ DS --> IB
662
+ DS --> IQ
663
+ DW --> IQ
664
+ DC --> IS
665
+
666
+ IS -.-> MS
667
+ IS -.-> RS
668
+ IB -.-> MB
669
+ IB -.-> RB
670
+ IQ -.-> MQ
671
+ IQ -.-> RQ
672
+
673
+ T -.->|uses| DC
674
+ ```
675
+
676
+ ---
677
+
678
+ ## API Design
679
+
680
+ ### Basic Usage
681
+
682
+ Durable workflows are **normal Runner tasks** that inject a **durable backend resource** (created via `resources.memoryWorkflow.fork(id)` or `resources.redisWorkflow.fork(id)` and registered via `.with(config)`) and call `durableContext.step(...)` / `durableContext.sleep(...)` from inside their `run` function.
683
+
684
+ ```typescript
685
+ import { r, run } from "@bluelibs/runner";
686
+ import { resources } from "@bluelibs/runner/node";
687
+
688
+ // 2. Create durable resource definition
689
+ const durable = resources.memoryWorkflow.fork("app-durable");
690
+
691
+ // 3. Register durable resource with config
692
+ const durableRegistration = durable.with({
693
+ worker: true,
694
+ polling: { enabled: true, interval: 1000 }, // Timer polling interval
695
+ });
696
+
697
+ // 3. Define a task that uses durable context
698
+ const processOrder = r
699
+ .task("process-order")
700
+ .inputSchema({ orderId: String, customerId: String })
701
+ .dependencies({ durable })
702
+ .run(async (input, { durable }) => {
703
+ const durableContext = durable.use();
704
+
705
+ // Step 1: Validate order (checkpointed)
706
+ const order = await durableContext.step("validate", async () => {
707
+ const o = await db.orders.find(input.orderId);
708
+ if (!o) throw new Error("Order not found");
709
+ return o;
710
+ });
711
+
712
+ // Step 2: Process payment (checkpointed)
713
+ const payment = await durableContext.step("charge-payment", async () => {
714
+ return await payments.charge(order.customerId, order.total);
715
+ });
716
+
717
+ // Durable sleep - survives restart
718
+ await durableContext.sleep(5000);
719
+
720
+ // Step 3: Ship order (checkpointed)
721
+ const shipment = await durableContext.step("create-shipment", async () => {
722
+ return await shipping.create(order.id);
723
+ });
724
+
725
+ return {
726
+ success: true,
727
+ orderId: order.id,
728
+ trackingId: shipment.trackingId,
729
+ };
730
+ })
731
+ .build();
732
+
733
+ // 4. Wire up and run
734
+ const app = r
735
+ .resource("app")
736
+ .register([resources.durable, durableRegistration, processOrder])
737
+ .build();
738
+
739
+ const runtime = await run(app);
740
+
741
+ // 5. Execute durably
742
+ const d = runtime.getResourceValue(durable);
743
+ const result = await d.startAndWait(processOrder, {
744
+ orderId: "order-123",
745
+ customerId: "cust-456",
746
+ });
747
+ ```
748
+
749
+ ### How It Works
750
+
751
+ 1. **`durable.startAndWait(task, input)`** creates an execution record and runs the task
752
+ - Prefer `startAndWait()` when you want "start and wait for result" in one call.
753
+ - Prefer `start()` + `signal()` + `wait()` when the outside world must resume the workflow later (webhooks, approvals).
754
+ 2. **`durableContext.step(id, fn)`** checks if step was already executed:
755
+ - If yes: returns cached result (replay)
756
+ - If no: executes fn, caches result, returns result
757
+ 3. **`durableContext.sleep(ms)`** creates a timer record, suspends execution, resumes when timer fires
758
+ 4. **`durableContext.waitForSignal(signal)`** records a durable wait checkpoint and suspends execution
759
+ 5. **`durable.signal(executionId, signal, payload)`** completes the signal checkpoint and resumes the execution
760
+ 6. If process crashes, **`durableService.recover()`** resumes incomplete executions from their last checkpoint
761
+
762
+ ### `start()` vs `startAndWait()` (clear contract)
763
+
764
+ - `start(taskOrTaskId, input)`:
765
+ returns immediately with `executionId` (`string`).
766
+ - `startAndWait(taskOrTaskId, input)`:
767
+ convenience wrapper for `start(...)` + `wait(executionId)`; returns
768
+ `{ durable: { executionId }, data }`.
769
+
770
+ `start()` and `startAndWait()` are the only supported durable execution APIs.
771
+
772
+ `taskOrTaskId` can be:
773
+
774
+ - an `ITask` (the built task object, returned by `.build()`)
775
+ - a task id `string`
776
+
777
+ It is **not** the injected dependency callable from `.dependencies({ someTask })`. That dependency is a function used to invoke the task directly, not an `ITask` reference.
778
+
779
+ ```ts
780
+ // ✅ built task object
781
+ const executionIdA = await d.start(approveOrder, { orderId: "o1" });
782
+
783
+ // ✅ task id string
784
+ const executionIdB = await d.start(approveOrder.id, {
785
+ orderId: "o2",
786
+ });
787
+
788
+ // ❌ injected callable dependency (different type)
789
+ // await d.start(deps.approveOrder, { orderId: "o3" });
790
+ ```
791
+
792
+ ### What Happens with the Return Value
793
+
794
+ Whatever your workflow function returns becomes the **execution result**, persisted in the durable store. You can retrieve it in three ways depending on your pattern:
795
+
796
+ - **`startAndWait(task, input)`** — starts the workflow **and** waits for it to finish, returning `{ durable: { executionId }, data }`:
797
+
798
+ ```ts
799
+ const result = await d.startAndWait(processOrder, { orderId: "order-123" });
800
+ // result = {
801
+ // durable: { executionId: "..." },
802
+ // data: { success: true, orderId: "order-123", trackingId: "TRK-789" }
803
+ // }
804
+ ```
805
+
806
+ - **`start(task, input)`** + **`wait(executionId)`** — start and wait separately (useful when a webhook or external event resumes the workflow later):
807
+
808
+ ```ts
809
+ const executionId = await d.start(approveOrder, {
810
+ orderId: "order-123",
811
+ });
812
+ // ... later (eg. in a webhook handler) ...
813
+ await d.signal(executionId, Approved, { approvedBy: "admin@co.com" });
814
+ const result = await d.wait(executionId, { timeout: 30_000 });
815
+ // result = { status: "approved", approvedBy: "admin@co.com" }
816
+ ```
817
+
818
+ - **Read from the store** — fetch the persisted result without blocking:
819
+ ```ts
820
+ const execution = await store.getExecution(executionId);
821
+ // execution.status = "completed" | "failed" | "running" | ...
822
+ // execution.result = the return value of your workflow
823
+ ```
824
+
825
+ If the workflow throws an error instead of returning, the execution is marked as `failed` and `startAndWait()`/`wait()` will reject with that error.
826
+
827
+ ---
828
+
829
+ ## Execution Flow
830
+
831
+ ```mermaid
832
+ sequenceDiagram
833
+ participant C as Client
834
+ participant DS as DurableService
835
+ participant S as Store
836
+ participant DC as DurableContext
837
+ participant T as Task Function
838
+
839
+ C->>DS: startAndWait(task, input)
840
+ DS->>S: createExecution(id, task, input)
841
+ DS->>DC: create context for execution
842
+ DS->>T: run task with context
843
+
844
+ T->>DC: step('validate', fn)
845
+ DC->>S: getStepResult(execId, 'validate')
846
+ alt Step not cached
847
+ DC->>DC: execute fn()
848
+ DC->>S: saveStepResult(execId, 'validate', result)
849
+ end
850
+ DC-->>T: return result
851
+
852
+ T->>DC: sleep(5000)
853
+ DC->>S: createTimer(execId, fireAt)
854
+ Note over DC,T: Execution suspends
855
+
856
+ Note over DS: Timer polling...
857
+ DS->>S: getReadyTimers()
858
+ S-->>DS: timer ready!
859
+ DS->>DC: resume execution
860
+
861
+ T->>DC: step('ship', fn)
862
+ DC->>S: getStepResult(execId, 'ship')
863
+ DC->>DC: execute fn()
864
+ DC->>S: saveStepResult(execId, 'ship', result)
865
+ DC-->>T: return result
866
+
867
+ T-->>DS: return final result
868
+ DS->>S: markExecutionComplete(id, result)
869
+ DS-->>C: return result
870
+ ```
871
+
872
+ ---
873
+
874
+ ## Safety & Semantics
875
+
876
+ This section summarizes the safety guarantees and expectations of the durable workflow system.
877
+
878
+ - **Store is the source of truth**
879
+ All durable state (executions, steps, timers, schedules) lives in `IDurableStore`. Queues and pub/sub are optimizations on top; correctness must not rely solely on in-memory state or transient messages.
880
+
881
+ - **At-least-once execution, effectively-once steps**
882
+ - Executions are retried on failure, so the same logical workflow may run more than once.
883
+ - `durableContext.step(stepId, fn)` ensures each step function is _observably_ executed at most once per execution: results are memoized in the store and returned on replay.
884
+ - External side effects inside a step must still be designed to be idempotent or safely repeatable (for example, idempotent payment/refund APIs).
885
+
886
+ - **Sleep and resumption**
887
+ - `durableContext.sleep(ms)` persists a timer and marks the execution as `sleeping`.
888
+ - When the timer fires, execution is resumed from the code _after_ `sleep`, and all previous steps are replayed via cached results (no re‑issuing of side effects wrapped in `step`).
889
+
890
+ - **Event emission without duplicates**
891
+ - `durableContext.emit(event, data)` is implemented as one or more internal `step`s under the hood.
892
+ - Each call is assigned a deterministic internal id like `__emit:<eventId>:<index>` so you can emit the same event type multiple times in one workflow.
893
+ - On replay, memoization prevents duplicates for each individual emission.
894
+ - **Determinism note:** those internal `:<index>` suffixes are derived from call order within the workflow. If you change the workflow structure (branching / adding/removing calls), the internal step ids may shift and past executions may no longer replay cleanly.
895
+
896
+ - **Signals (wait until external confirmation)**
897
+ - `durableContext.waitForSignal(signal)` suspends an execution until `durable.signal(executionId, signal, payload)` is called.
898
+ - `stepId` keeps the same return type (payload + timeout error), while `timeoutMs` switches to a `{ kind: "signal" | "timeout" }` outcome.
899
+ - Signals are memoized as steps under `__signal:<signal.id>[:index]` (or `__signal:<id>[:index]` for string ids).
900
+ - Repeated waits use `__signal:<id>:<index>` and are resolved by the first available slot; payloads can be buffered for future waits.
901
+ - **Determinism note:** like `emit`, the `:<index>` suffixes are derived from call order within the workflow; code changes can shift indexes on replay.
902
+
903
+ - **Retries and timeouts**
904
+ - `StepOptions.retries` and `DurableServiceConfig.execution.maxAttempts` control step‑level and execution‑level retries respectively.
905
+ - `StepOptions.timeout` and `execution.timeout` bound how long a single step or the whole execution may run.
906
+ - **Global Timeouts**: `execution.timeout` measures the total time from the very first attempt (`createdAt`) and is not reset on retries or resumptions.
907
+
908
+ - **Queue and worker semantics**
909
+ - `IDurableQueue` provides **at-least-once** delivery: messages may be delivered more than once but will not be silently dropped.
910
+ - Workers must treat queue messages as hints to load state from the store, apply `DurableContext` logic, and then `ack` or `nack` the message. Idempotency is achieved by reading/writing through `IDurableStore`, not by trusting the queue alone.
911
+
912
+ - **Multi-node coordination**
913
+ - `IEventBus` is used to reduce `wait()` latency (publish `execution:<id>` completion events) but does not replace the store.
914
+ - Timers (`sleep`, signal timeouts, schedules) are driven by the durable poller (`DurableService` polling loop). In multi-process setups, run a single poller (`polling: { enabled: true }`) or implement atomic timer claiming in your store.
915
+
916
+ - **Reserved step ids**
917
+ - Step ids starting with `__` and `rollback:` are reserved for durable internals. Avoid using them in `durableContext.step(...)` to prevent collisions with system steps.
918
+
919
+ These semantics intentionally favor **safety and debuggability** over perfect "exactly-once" guarantees at the infrastructure level. Application code remains explicit and testable, while the system provides strong, well-defined durability guarantees around that code.
920
+
921
+ ---
922
+
923
+ ## Signals (wait for external events)
924
+
925
+ Durable workflows often need to pause until the outside world confirms something (eg. payment provider callbacks). Use `durableContext.waitForSignal()` inside the workflow, and `durable.signal()` from the outside.
926
+
927
+ Signal summary:
928
+
929
+ - `stepId` is a stable key only; it does not change return types.
930
+ - `waitForSignal({ stepId })` requires a store that supports listing step results (`listStepResults`) so `durable.signal(...)` can find the waiter.
931
+ - `timeoutMs` changes the return value to a `{ kind: "signal" | "timeout" }` outcome.
932
+ - Without `timeoutMs`, timeouts throw an error (no union result).
933
+
934
+ Return shapes:
935
+
936
+ | Call | Returns |
937
+ | ---------------------------------------------- | ------------------------------------------------------ |
938
+ | `waitForSignal(signal)` | `payload` (throws on timeout) |
939
+ | `waitForSignal(signal, { stepId })` | `payload` (throws on timeout) |
940
+ | `waitForSignal(signal, { timeoutMs })` | `{ kind: "signal", payload }` or `{ kind: "timeout" }` |
941
+ | `waitForSignal(signal, { timeoutMs, stepId })` | `{ kind: "signal", payload }` or `{ kind: "timeout" }` |
942
+
943
+ ### Example: `waitUntilPaid()`
944
+
945
+ ```typescript
946
+ import { r } from "@bluelibs/runner";
947
+ import { resources } from "@bluelibs/runner/node";
948
+
949
+ const Paid = r.event<{ paidAt: number }>("paid").build();
950
+ const durable = resources.memoryWorkflow.fork("app-durable");
951
+ const durableRegistration = durable.with({ worker: true });
952
+
953
+ export const processOrder = r
954
+ .task("process-order")
955
+ .dependencies({ durable })
956
+ .run(async (input: { orderId: string }, { durable }) => {
957
+ const durableContext = durable.use();
958
+
959
+ await durableContext.step("reserve", async () => {
960
+ // reserve inventory, create payment intent, etc.
961
+ return { ok: true };
962
+ });
963
+
964
+ const payment = await durableContext.waitForSignal(Paid);
965
+
966
+ await durableContext.step("ship", async () => {
967
+ // ship only after payment is confirmed
968
+ return { ok: true, paidAt: payment.paidAt };
969
+ });
970
+ })
971
+ .build();
972
+ ```
973
+
974
+ From an API webhook / callback handler:
975
+
976
+ ```typescript
977
+ // Store the workflow `executionId` in your domain data when you start it.
978
+ // You can get it immediately via `await d.start(task, input)`.
979
+ const d = runtime.getResourceValue(durable);
980
+ await d.signal(executionId, Paid, { paidAt: Date.now() });
981
+ ```
982
+
983
+ ### Whichever comes first: signal or timeout
984
+
985
+ If you need "wait for payment confirmation or continue after 1 day", use the timeout variant:
986
+
987
+ ```typescript
988
+ const outcome = await durableContext.waitForSignal(Paid, {
989
+ timeoutMs: 86_400_000,
990
+ });
991
+
992
+ if (outcome.kind === "timeout") {
993
+ // mark order as expired, notify user, etc.
994
+ return;
995
+ }
996
+
997
+ // outcome.kind === "signal"
998
+ await durableContext.step("ship", async () => ({
999
+ paidAt: outcome.payload.paidAt,
1000
+ }));
1001
+ ```
1002
+
1003
+ ### Stable `stepId` without changing behavior
1004
+
1005
+ You can pass a stable step id for replay stability without changing the return type:
1006
+
1007
+ ```typescript
1008
+ const payment = await durableContext.waitForSignal(Paid, {
1009
+ stepId: "stable-paid",
1010
+ });
1011
+ ```
1012
+
1013
+ ---
1014
+
1015
+ ## Compensation / Rollback Pattern
1016
+
1017
+ Instead of a complex saga orchestrator, users implement compensation explicitly:
1018
+
1019
+ ```typescript
1020
+ const processOrderWithRollback = r
1021
+ .task("process-order")
1022
+ .dependencies({ durable })
1023
+ .run(async (input, { durable }) => {
1024
+ const durableContext = durable.use();
1025
+
1026
+ // Reserve inventory
1027
+ const reservation = await durableContext
1028
+ .step("reserve-inventory")
1029
+ .up(async () => inventory.reserve(input.items))
1030
+ .down(async (res) => inventory.release(res.reservationId));
1031
+
1032
+ // Charge payment
1033
+ const payment = await durableContext
1034
+ .step("charge-payment")
1035
+ .up(async () => payments.charge(input.customerId, input.amount))
1036
+ .down(async (p) => payments.refund(p.chargeId));
1037
+
1038
+ try {
1039
+ // Ship order - might fail
1040
+ const shipment = await durableContext.step("ship-order", async () => {
1041
+ return await shipping.ship(input.orderId);
1042
+ });
1043
+ return { success: true, shipment };
1044
+ } catch (error) {
1045
+ await durableContext.rollback();
1046
+ return {
1047
+ success: false,
1048
+ error: error instanceof Error ? error.message : String(error),
1049
+ };
1050
+ }
1051
+ })
1052
+ .build();
1053
+ ```
1054
+
1055
+ This is more explicit and readable than an automatic saga system.
1056
+
1057
+ ---
1058
+
1059
+ ## Branching with durableContext.switch()
1060
+
1061
+ `durableContext.switch()` is a replay-safe branching primitive for durable workflows. Instead of using plain `if/else` (which the flow shape exporter can't capture), model conditional logic with `switch` so that:
1062
+
1063
+ 1. The branch decision is **persisted** — on replay, matchers are skipped and the cached branch result is returned.
1064
+ 2. The branch structure is **visible** to the flow-shape recorder (via `durable.describe(...)`) for documentation and visualization.
1065
+
1066
+ ### API
1067
+
1068
+ ```typescript
1069
+ const result = await durableContext.switch<TValue, TResult>(
1070
+ stepId, // unique step ID (like durableContext.step)
1071
+ value, // the value to match against
1072
+ branches, // array of { id, match, run }
1073
+ defaultBranch?, // optional { id, run } (no match needed)
1074
+ );
1075
+ ```
1076
+
1077
+ ### Example
1078
+
1079
+ ```typescript
1080
+ const fulfillOrder = r
1081
+ .task("fulfill-order")
1082
+ .dependencies({ durable })
1083
+ .run(async (input: { orderId: string; tier: string }, { durable }) => {
1084
+ const durableContext = durable.use();
1085
+
1086
+ const order = await durableContext.step("fetch-order", async () => {
1087
+ return await db.orders.findById(input.orderId);
1088
+ });
1089
+
1090
+ const result = await durableContext.switch(
1091
+ "fulfillment-route",
1092
+ order.tier,
1093
+ [
1094
+ {
1095
+ id: "premium",
1096
+ match: (tier) => tier === "premium",
1097
+ run: async () => {
1098
+ await durableContext.step("express-ship", async () =>
1099
+ shipping.express(order),
1100
+ );
1101
+ return "express-shipped";
1102
+ },
1103
+ },
1104
+ {
1105
+ id: "standard",
1106
+ match: (tier) => tier === "standard",
1107
+ run: async () => {
1108
+ await durableContext.step("standard-ship", async () =>
1109
+ shipping.standard(order),
1110
+ );
1111
+ return "standard-shipped";
1112
+ },
1113
+ },
1114
+ ],
1115
+ {
1116
+ id: "manual-review",
1117
+ run: async () => {
1118
+ await durableContext.step("flag-review", async () =>
1119
+ flagForReview(order),
1120
+ );
1121
+ return "needs-review";
1122
+ },
1123
+ },
1124
+ );
1125
+
1126
+ return { orderId: input.orderId, result };
1127
+ })
1128
+ .build();
1129
+ ```
1130
+
1131
+ ### How it works
1132
+
1133
+ - **First execution**: matchers evaluate in order; the first matching branch's `run()` is called. The branch `id` and result are persisted as a step result.
1134
+ - **Replay**: the cached `{ branchId, result }` is returned immediately — no matchers or `run()` are re-executed.
1135
+ - **Audit**: emits a `switch_evaluated` audit entry with `branchId` and `durationMs`.
1136
+ - **Determinism**: the step ID is user-provided (required), so it's stable across refactors (like `durableContext.step`).
1137
+ - **Fail-fast**: throws if no branch matches and no default is provided.
1138
+
1139
+ ### Interface
1140
+
1141
+ ```typescript
1142
+ interface SwitchBranch<TValue, TResult> {
1143
+ id: string;
1144
+ match: (value: TValue) => boolean;
1145
+ run: (value: TValue) => Promise<TResult>;
1146
+ }
1147
+ ```
1148
+
1149
+ ---
1150
+
1151
+ ## Describing a Flow (Static Shape Export)
1152
+
1153
+ Use `durable.describe(...)` to capture the **structure** of a durable workflow without executing it. It returns a serializable `DurableFlowShape` object that you can use for:
1154
+
1155
+ - Documentation generation
1156
+ - Visual workflow diagrams
1157
+ - Tooling and editor plugins
1158
+ - API schema exports
1159
+
1160
+ ### From an existing task (recommended)
1161
+
1162
+ Call `describe()` on your durable dependency, then pass your task directly — it shims `durable.use()` and records every `durableContext.*` operation:
1163
+
1164
+ ```typescript
1165
+ import { r, run } from "@bluelibs/runner";
1166
+ import { resources } from "@bluelibs/runner/node";
1167
+
1168
+ const durable = resources.memoryWorkflow.fork("app-durable");
1169
+ const app = r
1170
+ .resource("app")
1171
+ .register([resources.durable, durable.with({})])
1172
+ .build();
1173
+ const runtime = await run(app);
1174
+
1175
+ // TInput is inferred from the task:
1176
+ const shape = await runtime.getResourceValue(durable).describe(approveOrder);
1177
+
1178
+ // Or specify input explicitly:
1179
+ const shape2 = await runtime
1180
+ .getResourceValue(durable)
1181
+ .describe<{ orderId: string }>(approveOrder, { orderId: "123" });
1182
+
1183
+ console.log(shape.nodes);
1184
+ // [
1185
+ // { kind: "step", stepId: "validate", hasCompensation: false },
1186
+ // { kind: "waitForSignal", signalId: "approved", ... },
1187
+ // { kind: "step", stepId: "ship", hasCompensation: false },
1188
+ // { kind: "emit", eventId: "shipped", stepId: "notify" },
1189
+ // ]
1190
+ ```
1191
+
1192
+ If your task is tagged with `tags.durableWorkflow.with({ defaults: {...} })`,
1193
+ `describe(task)` (without input) uses a cloned copy of those defaults.
1194
+ Passing `describe(task, input)` always wins and replaces tag defaults.
1195
+
1196
+ That's it. No refactoring — just call `durable.describe(task)` and get the shape.
1197
+
1198
+ ### Output shape
1199
+
1200
+ ```typescript
1201
+ interface DurableFlowShape {
1202
+ nodes: FlowNode[];
1203
+ }
1204
+
1205
+ type FlowNode =
1206
+ | { kind: "step"; stepId: string; hasCompensation: boolean }
1207
+ | { kind: "sleep"; durationMs: number; stepId?: string }
1208
+ | {
1209
+ kind: "waitForSignal";
1210
+ signalId: string;
1211
+ timeoutMs?: number;
1212
+ stepId?: string;
1213
+ }
1214
+ | { kind: "emit"; eventId: string; stepId?: string }
1215
+ | { kind: "switch"; stepId: string; branchIds: string[]; hasDefault: boolean }
1216
+ | { kind: "note"; message: string };
1217
+ ```
1218
+
1219
+ ### How it works
1220
+
1221
+ The recorder runs your task's `run` function with **real runtime dependencies**, but wraps durable resource dependencies so `durable.use()` returns a **recording context**. That context implements `IDurableContext` and captures each `durableContext.*` call as a `FlowNode` instead of executing it.
1222
+
1223
+ The step builder API (`.up()` / `.down()`) is also supported: `hasCompensation` reflects whether `.down()` was called.
1224
+
1225
+ `rollback()` is a no-op in the recorder (it's a runtime concern, not a structural one).
1226
+
1227
+ ---
1228
+
1229
+ ## Scheduling & Cron Jobs
1230
+
1231
+ ### One-Time Scheduled Execution
1232
+
1233
+ Run a task at a specific future time:
1234
+
1235
+ ```typescript
1236
+ // Schedule a task to run in 1 hour
1237
+ const executionId = await durable.schedule(
1238
+ processReport,
1239
+ { reportId: "daily-sales" },
1240
+ { at: new Date(Date.now() + 3600000) },
1241
+ );
1242
+
1243
+ // Or use delay helper
1244
+ const executionId = await durable.schedule(
1245
+ sendReminder,
1246
+ { userId: "user-123" },
1247
+ { delay: 24 * 60 * 60 * 1000 }, // 24 hours from now
1248
+ );
1249
+ ```
1250
+
1251
+ ### Recurring Cron Jobs
1252
+
1253
+ Define tasks that run on a schedule using cron expressions:
1254
+
1255
+ ```typescript
1256
+ // Define a scheduled task
1257
+ const dailyCleanup = r
1258
+ .task("daily-cleanup")
1259
+ .dependencies({ durable, db })
1260
+ .run(async (input, { durable, db }) => {
1261
+ const durableContext = durable.use();
1262
+
1263
+ await durableContext.step("cleanup-old-sessions", async () => {
1264
+ await db.sessions.deleteOlderThan(7, "days");
1265
+ });
1266
+
1267
+ await durableContext.step("cleanup-temp-files", async () => {
1268
+ await fs.rm("./tmp/*", { recursive: true });
1269
+ });
1270
+
1271
+ return { cleaned: true };
1272
+ })
1273
+ .build();
1274
+
1275
+ // Create schedules once at startup (in a bootstrap resource/task)
1276
+ // ensureSchedule() is idempotent — safe to call on every boot and concurrently
1277
+ await durable.ensureSchedule(
1278
+ dailyCleanup,
1279
+ {},
1280
+ { id: "daily-cleanup", cron: "0 3 * * *" },
1281
+ );
1282
+ await durable.ensureSchedule(
1283
+ syncInventory,
1284
+ { full: false },
1285
+ { id: "hourly-sync", cron: "0 * * * *" },
1286
+ );
1287
+ await durable.ensureSchedule(
1288
+ generateWeeklyReport,
1289
+ { type: "weekly" },
1290
+ { id: "weekly-report", cron: "0 9 * * MON" },
1291
+ );
1292
+ ```
1293
+
1294
+ ### Interval-Based Scheduling
1295
+
1296
+ Run tasks at fixed intervals (e.g., every 30 seconds):
1297
+
1298
+ ```typescript
1299
+ // ensureSchedule() is idempotent — safe to call on every boot and concurrently
1300
+ await durable.ensureSchedule(
1301
+ healthCheckTask,
1302
+ { endpoints: ["api", "db"] },
1303
+ { id: "health-check", interval: 30_000 },
1304
+ );
1305
+ await durable.ensureSchedule(
1306
+ pollExternalApi,
1307
+ {},
1308
+ { id: "poll-external-api", interval: 5 * 60 * 1000 },
1309
+ );
1310
+ await durable.ensureSchedule(
1311
+ metricsSync,
1312
+ { flush: true },
1313
+ { id: "metrics-sync", interval: 60_000 },
1314
+ );
1315
+ ```
1316
+
1317
+ **Interval vs Cron:**
1318
+
1319
+ - **Interval**: Fixed delay between executions. Next run = end of previous + interval. Best for polling, health checks.
1320
+ - **Cron**: Calendar-based. Next run = next matching time. Best for scheduled reports, daily cleanup.
1321
+
1322
+ **Interval Behavior (current implementation):**
1323
+ Intervals are currently measured from when the schedule timer fires / execution is kicked off (not from task completion).
1324
+ If the task runs longer than the interval, the next run will be scheduled after the interval from _kickoff time_, which can cause overlapping executions unless your task logic (or your infrastructure) prevents it.
1325
+
1326
+ ```
1327
+ Task starts at t=0, takes 12s to complete
1328
+ Interval = 10s
1329
+
1330
+ t=0 t=10 t=12
1331
+ |------------|------------|
1332
+ task run A next run B A completes
1333
+ ```
1334
+
1335
+ If you need "completion-based" intervals (no overlap), implement it explicitly inside the workflow:
1336
+
1337
+ - run the work
1338
+ - then `await durableContext.sleep(intervalMs)`
1339
+ - then loop / re-run (or have the schedule fire less frequently and use durable sleeps inside)
1340
+
1341
+ ### Cron Expression Format
1342
+
1343
+ Standard 5-field cron format:
1344
+
1345
+ ```
1346
+ ┌───────────── minute (0-59)
1347
+ │ ┌─────────── hour (0-23)
1348
+ │ │ ┌───────── day of month (1-31)
1349
+ │ │ │ ┌─────── month (1-12 or JAN-DEC)
1350
+ │ │ │ │ ┌───── day of week (0-6 or SUN-SAT)
1351
+ │ │ │ │ │
1352
+ * * * * *
1353
+ ```
1354
+
1355
+ Common patterns:
1356
+
1357
+ - `* * * * *` - Every minute
1358
+ - `0 * * * *` - Every hour
1359
+ - `0 0 * * *` - Every day at midnight
1360
+ - `0 9 * * MON-FRI` - Weekdays at 9am
1361
+ - `0 0 1 * *` - First of every month
1362
+
1363
+ ### Schedule Management API
1364
+
1365
+ ```typescript
1366
+ // Pause a schedule
1367
+ await durable.pauseSchedule("daily-cleanup");
1368
+
1369
+ // Resume a schedule
1370
+ await durable.resumeSchedule("daily-cleanup");
1371
+
1372
+ // Get schedule status
1373
+ const status = await durable.getSchedule("daily-cleanup");
1374
+ // { id, cron, lastRun, nextRun, status: 'active' | 'paused' }
1375
+
1376
+ // List all schedules
1377
+ const schedules = await durable.listSchedules();
1378
+
1379
+ // Update schedule cron
1380
+ await durable.updateSchedule("daily-cleanup", { cron: "0 4 * * *" });
1381
+
1382
+ // Remove schedule
1383
+ await durable.removeSchedule("daily-cleanup");
1384
+ ```
1385
+
1386
+ ### How Scheduling Works
1387
+
1388
+ ```mermaid
1389
+ sequenceDiagram
1390
+ participant DS as DurableService
1391
+ participant S as Store
1392
+ participant T as Task
1393
+
1394
+ Note over DS: Timer polling loop
1395
+
1396
+ loop Every polling interval
1397
+ DS->>S: getReadyTimers
1398
+ S-->>DS: timers ready to fire
1399
+
1400
+ alt Schedule timer
1401
+ DS->>S: getSchedule by scheduleId
1402
+ DS->>DS: execute task with input
1403
+ DS->>S: calculateNextRun from cron
1404
+ DS->>S: createTimer for next run
1405
+ else Sleep timer
1406
+ DS->>DS: resume execution
1407
+ else One-time scheduled
1408
+ DS->>DS: execute task
1409
+ end
1410
+ end
1411
+ ```
1412
+
1413
+ ---
1414
+
1415
+ ## Core Types
1416
+
1417
+ ```typescript
1418
+ // types.ts
1419
+
1420
+ export type ExecutionStatus =
1421
+ | "pending"
1422
+ | "running"
1423
+ | "retrying"
1424
+ | "sleeping"
1425
+ | "completed"
1426
+ | "failed"
1427
+ | "compensation_failed";
1428
+
1429
+ export interface Execution<TInput = unknown, TResult = unknown> {
1430
+ id: string;
1431
+ taskId: string;
1432
+ input: TInput | undefined;
1433
+ status: ExecutionStatus;
1434
+ result?: TResult;
1435
+ error?: {
1436
+ message: string;
1437
+ stack?: string;
1438
+ };
1439
+ attempt: number;
1440
+ maxAttempts: number;
1441
+ timeout?: number;
1442
+ createdAt: Date;
1443
+ updatedAt: Date;
1444
+ completedAt?: Date;
1445
+ }
1446
+
1447
+ export interface StepResult<T = unknown> {
1448
+ executionId: string;
1449
+ stepId: string;
1450
+ result: T;
1451
+ completedAt: Date;
1452
+ }
1453
+
1454
+ export type TimerType =
1455
+ | "sleep"
1456
+ | "timeout"
1457
+ | "scheduled"
1458
+ | "cron"
1459
+ | "retry"
1460
+ | "signal_timeout";
1461
+
1462
+ export interface Timer {
1463
+ id: string;
1464
+ executionId?: string; // For sleep/timeout timers
1465
+ stepId?: string; // For step-specific timers
1466
+ scheduleId?: string; // For cron timers
1467
+ type: TimerType;
1468
+ fireAt: Date;
1469
+ status: "pending" | "fired";
1470
+ }
1471
+
1472
+ export type ScheduleType = "cron" | "interval";
1473
+
1474
+ export interface Schedule<TInput = unknown> {
1475
+ id: string;
1476
+ taskId: string;
1477
+ type: ScheduleType;
1478
+ pattern: string; // Cron expression or interval (ms)
1479
+ input: TInput | undefined;
1480
+ status: "active" | "paused";
1481
+ lastRun?: Date;
1482
+ nextRun?: Date;
1483
+ createdAt: Date;
1484
+ updatedAt: Date;
1485
+ }
1486
+
1487
+ export interface DurableContextState {
1488
+ executionId: string;
1489
+ attempt: number;
1490
+ }
1491
+ ```
1492
+
1493
+ ---
1494
+
1495
+ **Note on Interfaces**: The full technical contracts for `IDurableStore`, `IEventBus`, and `IDurableQueue` are documented in the [Abstract Interfaces](#abstract-interfaces) section.
1496
+
1497
+ ---
1498
+
1499
+ ## DurableContext
1500
+
1501
+ ```typescript
1502
+ // DurableContext.ts
1503
+
1504
+ export interface IDurableContext {
1505
+ readonly executionId: string;
1506
+ readonly attempt: number;
1507
+
1508
+ /**
1509
+ * Execute a step with memoization. On replay, returns cached result.
1510
+ */
1511
+ step<T>(stepId: string, fn: () => Promise<T>): Promise<T>;
1512
+ step<T>(
1513
+ stepId: string,
1514
+ options: StepOptions,
1515
+ fn: () => Promise<T>,
1516
+ ): Promise<T>;
1517
+
1518
+ /**
1519
+ * Durable sleep that survives process restarts.
1520
+ */
1521
+ sleep(durationMs: number): Promise<void>;
1522
+
1523
+ /**
1524
+ * Emit an event durably (as a step).
1525
+ */
1526
+ emit<T>(event: IEvent<T>, data: T): Promise<void>;
1527
+ }
1528
+
1529
+ export interface StepOptions {
1530
+ retries?: number;
1531
+ timeout?: number;
1532
+ }
1533
+ ```
1534
+
1535
+ ---
1536
+
1537
+ ## DurableService
1538
+
1539
+ ```typescript
1540
+ // DurableService.ts (simplified interface)
1541
+
1542
+ export interface ScheduleConfig<TInput = unknown> {
1543
+ id: string;
1544
+ task: ITask<TInput, any>;
1545
+ cron?: string; // Cron expression (e.g., '0 3 * * *')
1546
+ interval?: number; // Interval in ms (e.g., 30000 for 30 seconds)
1547
+ input: TInput;
1548
+ }
1549
+ // Must specify either cron OR interval, not both
1550
+
1551
+ export interface DurableServiceConfig {
1552
+ store: IDurableStore;
1553
+ queue?: IDurableQueue;
1554
+ eventBus?: IEventBus;
1555
+ audit?: {
1556
+ enabled?: boolean; // Default: false
1557
+ };
1558
+ polling?: {
1559
+ enabled?: boolean; // Default: true
1560
+ interval?: number; // Default: 1000ms
1561
+ };
1562
+ execution?: {
1563
+ maxAttempts?: number; // Default: 3
1564
+ timeout?: number; // Default: no timeout
1565
+ };
1566
+ schedules?: ScheduleConfig[]; // Cron schedules to register
1567
+ }
1568
+
1569
+ export interface ScheduleOptions {
1570
+ id?: string; // Stable schedule id (required for ensureSchedule)
1571
+ at?: Date; // Run at specific time
1572
+ delay?: number; // Run after delay (ms)
1573
+ cron?: string; // Cron expression (for recurring)
1574
+ interval?: number; // Interval in ms (for recurring)
1575
+ }
1576
+
1577
+ export interface IDurableService {
1578
+ /**
1579
+ * Start a task durably and wait for it to complete.
1580
+ */
1581
+ startAndWait<TInput, TResult>(
1582
+ task: ITask<TInput, Promise<TResult>, any, any, any, any> | string,
1583
+ input?: TInput,
1584
+ options?: ExecuteOptions,
1585
+ ): Promise<TResult>;
1586
+
1587
+ /**
1588
+ * Start a task execution and return the ID immediately.
1589
+ */
1590
+ start<TInput>(
1591
+ task: ITask<TInput, Promise<unknown>, any, any, any, any> | string,
1592
+ input?: TInput,
1593
+ options?: ExecuteOptions,
1594
+ ): Promise<string>;
1595
+
1596
+ /**
1597
+ * Wait for a previously started execution to complete.
1598
+ */
1599
+ wait<TResult>(
1600
+ executionId: string,
1601
+ options?: { timeout?: number; waitPollIntervalMs?: number },
1602
+ ): Promise<TResult>;
1603
+
1604
+ /**
1605
+ * Deliver a signal payload to a waiting workflow execution.
1606
+ */
1607
+ signal<TPayload>(
1608
+ executionId: string,
1609
+ signal: string | IEventDefinition<TPayload>,
1610
+ payload: TPayload,
1611
+ ): Promise<void>;
1612
+
1613
+ /**
1614
+ * Schedule a one-time task execution.
1615
+ */
1616
+ schedule<TInput>(
1617
+ task: ITask<TInput, Promise<any>, any, any, any, any> | string,
1618
+ input: TInput,
1619
+ options: ScheduleOptions,
1620
+ ): Promise<string>;
1621
+
1622
+ /**
1623
+ * Idempotently create (or update) a recurring schedule (cron/interval).
1624
+ * Safe to call on every boot and concurrently across processes.
1625
+ */
1626
+ ensureSchedule<TInput>(
1627
+ task: ITask<TInput, Promise<any>, any, any, any, any> | string,
1628
+ input: TInput,
1629
+ options: ScheduleOptions & { id: string },
1630
+ ): Promise<string>;
1631
+
1632
+ /**
1633
+ * Recover incomplete executions on startup.
1634
+ */
1635
+ recover(): Promise<void>;
1636
+
1637
+ /**
1638
+ * Start timer polling (called automatically on init).
1639
+ */
1640
+ start(): void;
1641
+
1642
+ /**
1643
+ * Stop timer polling (called on dispose).
1644
+ */
1645
+ stop(): Promise<void>;
1646
+
1647
+ // Schedule management
1648
+ pauseSchedule(scheduleId: string): Promise<void>;
1649
+ resumeSchedule(scheduleId: string): Promise<void>;
1650
+ getSchedule(scheduleId: string): Promise<Schedule | null>;
1651
+ listSchedules(): Promise<Schedule[]>;
1652
+ updateSchedule(
1653
+ scheduleId: string,
1654
+ updates: { cron?: string; interval?: number; input?: unknown },
1655
+ ): Promise<void>;
1656
+ removeSchedule(scheduleId: string): Promise<void>;
1657
+ }
1658
+ ```
1659
+
1660
+ ---
1661
+
1662
+ ## File Structure
1663
+
1664
+ ```
1665
+ src/node/durable/
1666
+ ├── index.ts # Public exports (from `@bluelibs/runner/node`)
1667
+ ├── core/ # Engine (store is the source of truth)
1668
+ │ ├── index.ts
1669
+ │ ├── types.ts
1670
+ │ ├── CronParser.ts
1671
+ │ ├── DurableContext.ts
1672
+ │ ├── DurableService.ts
1673
+ │ ├── DurableWorker.ts
1674
+ │ ├── DurableOperator.ts
1675
+ │ ├── StepBuilder.ts
1676
+ │ └── interfaces/
1677
+ ├── store/
1678
+ │ ├── MemoryStore.ts
1679
+ │ └── RedisStore.ts
1680
+ ├── queue/
1681
+ │ ├── MemoryQueue.ts
1682
+ │ └── RabbitMQQueue.ts
1683
+ ├── bus/
1684
+ │ ├── MemoryEventBus.ts
1685
+ │ ├── NoopEventBus.ts
1686
+ │ └── RedisEventBus.ts
1687
+ └── __tests__/
1688
+ ├── DurableContext.test.ts
1689
+ ├── DurableService.integration.test.ts
1690
+ ├── DurableService.realBackends.integration.test.ts
1691
+ ├── MemoryBackends.test.ts
1692
+ ├── RabbitMQQueue.mock.test.ts
1693
+ ├── RedisEventBus.mock.test.ts
1694
+ └── RedisStore.mock.test.ts
1695
+ ```
1696
+
1697
+ ---
1698
+
1699
+ ## Production Setup with Redis + RabbitMQ
1700
+
1701
+ For production, use Redis for state/pub-sub and RabbitMQ with quorum queues for durable work distribution.
1702
+
1703
+ Install required Node dependencies:
1704
+
1705
+ ```bash
1706
+ npm install ioredis amqplib
1707
+ ```
1708
+
1709
+ ### Quick Start - Production Configuration
1710
+
1711
+ ```typescript
1712
+ import {
1713
+ RedisStore,
1714
+ RedisEventBus,
1715
+ RabbitMQQueue,
1716
+ resources,
1717
+ } from "@bluelibs/runner/node";
1718
+
1719
+ // State storage with Redis
1720
+ const store = new RedisStore({
1721
+ redis: process.env.REDIS_URL || "redis://localhost:6379",
1722
+ prefix: "durable:",
1723
+ });
1724
+
1725
+ // Pub/Sub with Redis
1726
+ const eventBus = new RedisEventBus({
1727
+ redis: process.env.REDIS_URL || "redis://localhost:6379",
1728
+ prefix: "durable:bus:",
1729
+ });
1730
+
1731
+ // Work distribution with RabbitMQ quorum queues
1732
+ const queue = new RabbitMQQueue({
1733
+ url: process.env.RABBITMQ_URL || "amqp://localhost",
1734
+ queue: {
1735
+ name: "durable-executions",
1736
+ quorum: true, // Use quorum queue for durability
1737
+ deadLetter: "durable-dlq", // Dead letter queue for failed messages
1738
+ },
1739
+ prefetch: 10, // Process up to 10 messages concurrently
1740
+ });
1741
+
1742
+ // Create durable resource definition + registration
1743
+ const durable = resources.redisWorkflow.fork("app-durable");
1744
+ const durableRegistration = durable.with({
1745
+ store,
1746
+ eventBus,
1747
+ queue,
1748
+ worker: true, // starts a queue consumer in this process
1749
+ // polling.enabled defaults to true; keep it on for timers/schedules
1750
+ });
1751
+ ```
1752
+
1753
+ If you want API-only nodes to call `start()` / `signal()` / `wait()` **without running the timer poller**, disable polling:
1754
+
1755
+ ```ts
1756
+ const durable = resources.redisWorkflow.fork("app-durable");
1757
+ const durableRegistration = durable.with({
1758
+ store,
1759
+ eventBus,
1760
+ queue,
1761
+ worker: false,
1762
+ polling: { enabled: false },
1763
+ });
1764
+ ```
1765
+
1766
+ Make sure at least one worker process runs with polling enabled, otherwise sleeps/timeouts/schedules will never fire.
1767
+
1768
+ ### RabbitMQ Quorum Queues
1769
+
1770
+ **Why quorum queues?**
1771
+
1772
+ - **Durability** - Messages survive broker restarts
1773
+ - **Replication** - Messages replicated across nodes
1774
+ - **Consistency** - Strong guarantees vs classic mirrored queues
1775
+ - **Dead-letter** - Failed messages go to DLQ for inspection
1776
+
1777
+ ```typescript
1778
+ // queue/RabbitMQQueue.ts
1779
+
1780
+ export interface RabbitMQQueueConfig {
1781
+ url: string;
1782
+ queue: {
1783
+ name: string;
1784
+ quorum?: boolean; // Use quorum queue (default: true)
1785
+ deadLetter?: string; // Dead letter exchange
1786
+ messageTtl?: number; // Message TTL in ms
1787
+ };
1788
+ prefetch?: number; // Consumer prefetch (default: 10)
1789
+ }
1790
+
1791
+ export class RabbitMQQueue implements IDurableQueue {
1792
+ constructor(config: RabbitMQQueueConfig);
1793
+
1794
+ async init(): Promise<void> {
1795
+ // Creates quorum queue with:
1796
+ // - x-queue-type: quorum
1797
+ // - x-dead-letter-exchange: <deadLetter>
1798
+ // - durable: true
1799
+ }
1800
+ }
1801
+ ```
1802
+
1803
+ ### Redis Store Implementation Details
1804
+
1805
+ - **Serialization**: `RedisStore` uses Runner's serializer for persistence. This preserves `Date` objects and other complex types, avoiding "time bombs" where dates become strings after being stored.
1806
+ - **Performance (SCAN vs KEYS)**: All multi-key searches use Redis `SCAN` for non-blocking iteration. This prevents Redis from freezing when thousands of executions are present.
1807
+ - **Concurrency & Atomicity**:
1808
+ - `updateExecution()` uses a Lua script to perform a read/merge/write update atomically.
1809
+ - Execution processing is guarded by `acquireLock()` so only one worker runs an execution attempt at a time.
1810
+ - Signal delivery (`durable.signal`) and signal waits (`durableContext.waitForSignal`) use a per-execution/per-signal lock when supported by the store, to prevent races between "signal arrives" and "wait is being recorded".
1811
+
1812
+ ### Optimized Client Waiting
1813
+
1814
+ When an `IEventBus` (like `RedisEventBus`) is present, calls to `durable.startAndWait()` or `durable.wait()` use a **reactive event-driven approach**. The service subscribes to completion events for that specific execution ID, resulting in near-instant response times once the workflow finishes, without constant store polling.
1815
+
1816
+ ### Horizontal Scaling
1817
+
1818
+ ```mermaid
1819
+ graph TB
1820
+ subgraph Clients[API Servers]
1821
+ A1[API 1]
1822
+ A2[API 2]
1823
+ end
1824
+
1825
+ subgraph RabbitMQ[RabbitMQ Cluster]
1826
+ Q[(Quorum Queue)]
1827
+ DLQ[(Dead Letter Queue)]
1828
+ end
1829
+
1830
+ subgraph Redis[Redis Cluster]
1831
+ RS[(State Store)]
1832
+ RP[(Pub/Sub)]
1833
+ end
1834
+
1835
+ subgraph Workers[Worker Pool - Auto-Scaling]
1836
+ W1[Worker 1]
1837
+ W2[Worker 2]
1838
+ W3[Worker N]
1839
+ end
1840
+
1841
+ A1 -->|enqueue| Q
1842
+ A2 -->|enqueue| Q
1843
+
1844
+ Q -->|consume| W1
1845
+ Q -->|consume| W2
1846
+ Q -->|consume| W3
1847
+
1848
+ W1 <-->|state| RS
1849
+ W2 <-->|state| RS
1850
+ W3 <-->|state| RS
1851
+
1852
+ RP -.->|notify| W1
1853
+ RP -.->|notify| W2
1854
+ RP -.->|notify| W3
1855
+
1856
+ Q -->|failed| DLQ
1857
+ ```
1858
+
1859
+ **Scaling characteristics:**
1860
+
1861
+ - **Workers** - Add more worker instances to increase throughput
1862
+ - **Queue** - RabbitMQ handles work distribution automatically
1863
+ - **State** - All workers share state via Redis
1864
+ - **Events** - Redis pub/sub notifies workers of timer events
1865
+
1866
+ ### Execution Flow with Queue
1867
+
1868
+ ```mermaid
1869
+ sequenceDiagram
1870
+ participant C as Client
1871
+ participant Q as RabbitMQ
1872
+ participant W as Worker
1873
+ participant R as Redis
1874
+
1875
+ C->>R: Create execution record
1876
+ C->>Q: Enqueue execution message
1877
+ C-->>C: Return execution ID
1878
+
1879
+ Note over Q,W: Workers consuming queue
1880
+
1881
+ Q->>W: Deliver message
1882
+ W->>R: Acquire lock on execution
1883
+
1884
+ alt Lock acquired
1885
+ W->>R: Load execution state
1886
+ W->>W: Execute task with DurableContext
1887
+
1888
+ loop For each step
1889
+ W->>R: Check step result cache
1890
+ alt Cache hit
1891
+ R-->>W: Return cached result
1892
+ else Cache miss
1893
+ W->>W: Execute step
1894
+ W->>R: Cache step result
1895
+ end
1896
+ end
1897
+
1898
+ W->>R: Mark execution complete
1899
+ W->>R: Release lock
1900
+ W->>Q: Ack message
1901
+ else Lock not acquired
1902
+ W->>Q: Nack with requeue
1903
+ end
1904
+ ```
1905
+
1906
+ ---
1907
+
1908
+ ## Integration with Runner Resources
1909
+
1910
+ The durable module integrates seamlessly with Runner's resource pattern:
1911
+
1912
+ ### As a Dependency
1913
+
1914
+ ```typescript
1915
+ import { r, run } from "@bluelibs/runner";
1916
+ import { resources } from "@bluelibs/runner/node";
1917
+
1918
+ const durable = resources.memoryWorkflow.fork("app-durable");
1919
+ const durableRegistration = durable.with({
1920
+ worker: true, // single-process: also consumes the queue if configured
1921
+ });
1922
+
1923
+ const processOrder = r
1924
+ .task("process-order")
1925
+ .dependencies({ durable })
1926
+ .run(async (input, { durable }) => {
1927
+ const durableContext = durable.use();
1928
+ // ... durable task logic
1929
+ })
1930
+ .build();
1931
+
1932
+ const recoverDurable = r
1933
+ .resource("durable-recover")
1934
+ .dependencies({ durable })
1935
+ .init(async (_cfg, { durable }) => {
1936
+ await durable.recover();
1937
+ })
1938
+ .build();
1939
+
1940
+ const app = r
1941
+ .resource("app")
1942
+ .register([
1943
+ resources.durable,
1944
+ durableRegistration,
1945
+ processOrder,
1946
+ recoverDurable,
1947
+ ])
1948
+ .build();
1949
+ await run(app);
1950
+ ```
1951
+
1952
+ ### Resource Factory Pattern
1953
+
1954
+ Runner resources are definitions built at bootstrap time. Pick the durable resource family up front; don't pass a custom store into `memoryWorkflow`:
1955
+
1956
+ ```typescript
1957
+ const durableRegistration = process.env.REDIS_URL
1958
+ ? resources.redisWorkflow.fork("app-durable").with({
1959
+ redis: { url: process.env.REDIS_URL! },
1960
+ worker: true,
1961
+ })
1962
+ : resources.memoryWorkflow.fork("app-durable").with({
1963
+ worker: true,
1964
+ });
1965
+ ```
1966
+
1967
+ ### Integration with HTTP Exposure
1968
+
1969
+ Expose a starter task or HTTP route that calls `durable.start(...)`. Do not expose the durable workflow task itself through `client.task(...)`, because remote task execution runs outside the durable execution context:
1970
+
1971
+ ```typescript
1972
+ import { createHttpClient, r } from "@bluelibs/runner";
1973
+ import { resources, rpcLanesResource } from "@bluelibs/runner/node";
1974
+
1975
+ const durable = resources.memoryWorkflow.fork("app-durable");
1976
+ const durableRegistration = durable.with({ worker: true });
1977
+
1978
+ const processOrder = r
1979
+ .task("process-order")
1980
+ .dependencies({ durable })
1981
+ .run(async (input, { durable }) => {
1982
+ const durableContext = durable.use();
1983
+ await durableContext.step("process", async () => input.orderId);
1984
+ return { ok: true };
1985
+ })
1986
+ .build();
1987
+
1988
+ const startProcessOrderWorkflow = r
1989
+ .task("start-process-order-workflow")
1990
+ .dependencies({ durable })
1991
+ .run(async (input: { orderId: string }, { durable }) => {
1992
+ const executionId = await durable.start(processOrder, input);
1993
+ return { executionId };
1994
+ })
1995
+ .build();
1996
+
1997
+ const durableLane = r
1998
+ .rpcLane("durable-lane")
1999
+ .applyTo([startProcessOrderWorkflow])
2000
+ .build();
2001
+
2002
+ const topology = r.rpcLane.topology({
2003
+ profiles: { worker: { serve: [durableLane] } },
2004
+ bindings: [{ lane: durableLane, communicator: r.rpcLane.http() }],
2005
+ });
2006
+
2007
+ const app = r
2008
+ .resource("app")
2009
+ .register([
2010
+ resources.durable,
2011
+ durableRegistration,
2012
+ processOrder,
2013
+ startProcessOrderWorkflow,
2014
+ rpcLanesResource.with({
2015
+ profile: "worker",
2016
+ mode: "network",
2017
+ topology,
2018
+ exposure: {
2019
+ http: { basePath: "/__runner", listen: { port: 7070 } },
2020
+ },
2021
+ }),
2022
+ ])
2023
+ .build();
2024
+
2025
+ // Remote clients start the workflow via the starter task
2026
+ const client = createHttpClient({ baseUrl: "http://worker:7070/__runner" });
2027
+ await client.task("start-process-order-workflow", { orderId: "123" });
2028
+ ```
2029
+
2030
+ ## Recovery on Startup
2031
+
2032
+ ```typescript
2033
+ const recoverDurable = r
2034
+ .resource("durable-recover")
2035
+ .dependencies({ durable })
2036
+ .init(async (_cfg, { durable }) => {
2037
+ await durable.recover();
2038
+ })
2039
+ .build();
2040
+
2041
+ const app = r
2042
+ .resource("app")
2043
+ .register([resources.durable, durableRegistration, processOrder, recoverDurable])
2044
+ .build();
2045
+ ```
2046
+
2047
+ The recovery process:
2048
+
2049
+ 1. Load all incomplete executions (status `pending`, `running`, `sleeping`, or `retrying`)
2050
+ 2. For each, re-execute the task within a new DurableContext
2051
+ 3. The task replays through cached steps automatically
2052
+ 4. Execution continues from where it left off
2053
+
2054
+ ---
2055
+
2056
+ ## Testing Utilities
2057
+
2058
+ Durable exports a small test harness so you can run workflows with in-memory
2059
+ backends while keeping the `run()` semantics you use in production.
2060
+
2061
+ ```ts
2062
+ import { r, run } from "@bluelibs/runner";
2063
+ import { createDurableTestSetup, resources, waitUntil } from "@bluelibs/runner/node";
2064
+
2065
+ const { durable, durableRegistration, store } = createDurableTestSetup();
2066
+ const Paid = r.event<{ paidAt: number }>("paid").build();
2067
+
2068
+ const task = r
2069
+ .task("spec-durable-wait-for-signal")
2070
+ .dependencies({ durable, Paid })
2071
+ .run(async (_input: undefined, { durable, Paid }) => {
2072
+ const durableContext = durable.use();
2073
+ const payment = await durableContext.waitForSignal(Paid);
2074
+ return { ok: true, paidAt: payment.paidAt };
2075
+ })
2076
+ .build();
2077
+
2078
+ const app = r
2079
+ .resource("spec-app")
2080
+ .register([resources.durable, durableRegistration, Paid, task])
2081
+ .build();
2082
+ const runtime = await run(app);
2083
+ const durableRuntime = runtime.getResourceValue(durable);
2084
+
2085
+ const executionId = await durableRuntime.start(task);
2086
+
2087
+ await waitUntil(
2088
+ async () => (await store.getExecution(executionId))?.status === "sleeping",
2089
+ { timeoutMs: 1000, intervalMs: 5 },
2090
+ );
2091
+
2092
+ await durableRuntime.signal(executionId, Paid, { paidAt: Date.now() });
2093
+ await durableRuntime.wait(executionId);
2094
+
2095
+ await runtime.dispose();
2096
+ ```
2097
+
2098
+ `createDurableTestSetup` uses `MemoryStore`, `MemoryEventBus`, and an optional
2099
+ `MemoryQueue`, so tests stay fast and isolated.
2100
+
2101
+ Tip: Use `stepId` for stability in tests without changing behavior, and use `timeoutMs`
2102
+ when you need an explicit timeout outcome.
2103
+
2104
+ ### Running tests against real backends (Redis + RabbitMQ)
2105
+
2106
+ Runner also ships an integration suite that exercises the durable service with real backends
2107
+ (Redis for store + pub/sub and RabbitMQ for queue). This suite is part of the normal Jest
2108
+ test discovery, but it is **skipped by default** to keep local runs hermetic.
2109
+
2110
+ To enable it, set `DURABLE_INTEGRATION=1` and provide connection URLs (defaults point to localhost):
2111
+
2112
+ ```bash
2113
+ DURABLE_INTEGRATION=1 \
2114
+ DURABLE_TEST_REDIS_URL=redis://127.0.0.1:6379 \
2115
+ DURABLE_TEST_RABBIT_URL=amqp://127.0.0.1:5672 \
2116
+ npm run coverage:ai
2117
+ ```
2118
+
2119
+ ---
2120
+
2121
+ ## Comparison with Previous Design
2122
+
2123
+ | Aspect | Previous Design | New Design |
2124
+ | ------------------- | ----------------------------------------------------------- | ----------------------------------------- |
2125
+ | Components | 8+ (EventManager, WorkflowEngine, TimerManager, Saga, etc.) | 3 (DurableService, DurableContext, Store) |
2126
+ | Files | ~30 | ~12 |
2127
+ | New concepts | Workflows, Sagas, Compensation, DLQ | Just `step()` and `sleep()` |
2128
+ | Changes to core | EventBuilder, TaskBuilder modifications | None - pure node extension |
2129
+ | Learning curve | High | Low |
2130
+ | Implementation time | 12 weeks | 2-3 weeks |
2131
+
2132
+ ## Operator & Observability
2133
+
2134
+ > [!NOTE]
2135
+ > `createDashboardMiddleware` moved out of core and now lives in `@bluelibs/runner-durable-dashboard`.
2136
+
2137
+ ### What is the store?
2138
+
2139
+ The **durable store** (`IDurableStore`) is the persistence layer for durable workflows. It is responsible for saving and loading:
2140
+
2141
+ - executions (id, task id, input, status, attempt/error, timestamps)
2142
+ - step results (memoized outputs for `durableContext.step(...)`)
2143
+ - timers and schedules (for `sleep`, signal timeouts, cron/interval scheduling)
2144
+ - optional audit entries (timeline), and optional operator actions (manual interventions)
2145
+
2146
+ You provide a store implementation when you create the durable resource/service:
2147
+
2148
+ - `MemoryStore` — in-memory, great for local dev/tests (state is lost on restart)
2149
+ - `RedisStore` — Redis-backed, appropriate for production durability
2150
+
2151
+ ### What is `DurableOperator`?
2152
+
2153
+ `DurableOperator` is an **operations/admin helper** around the store. It does not execute workflows; it reads/writes durable state to support external tooling and manual interventions:
2154
+
2155
+ - query executions for listing (filters/pagination)
2156
+ - load execution details (execution + step results + audit)
2157
+ - operator actions: retry rollback, skip steps, force fail, patch a step result
2158
+
2159
+ You can use `DurableOperator` as the backend contract for your own operational UI or APIs.
2160
+
2161
+ ### Audit trail (timeline)
2162
+
2163
+ In addition to `StepResult` records, durable can persist a structured audit trail as the workflow runs:
2164
+
2165
+ - execution status transitions (pending/running/sleeping/retrying/completed/failed/cancelled)
2166
+ - step completions (with durations)
2167
+ - sleep scheduled/completed
2168
+ - signal waiting/delivered/timed-out
2169
+ - user-added notes via `durableContext.note(...)`
2170
+
2171
+ This is implemented via optional `IDurableStore` capabilities:
2172
+
2173
+ - Enable it via `resources.memoryWorkflow.fork("app-durable").with({ audit: { enabled: true }, ... })` or `resources.redisWorkflow.fork("app-durable").with({ audit: { enabled: true }, ... })` (default: off).
2174
+ - `appendAuditEntry(entry)`
2175
+ - `listAuditEntries(executionId)`
2176
+
2177
+ Notes are replay-safe: if the workflow replays after a suspend, the same `durableContext.note(...)` call does not create duplicates.
2178
+
2179
+ ### Stream audit entries via Runner events (for mirroring)
2180
+
2181
+ If you want to mirror audit entries to cold storage (S3/Glacier/Postgres), enable:
2182
+
2183
+ - `audit: { enabled: true, emitRunnerEvents: true }`
2184
+
2185
+ Then listen to Runner events (they are excluded from `on("*")` global hooks by default, so subscribe explicitly):
2186
+
2187
+ ```ts
2188
+ import { r } from "@bluelibs/runner";
2189
+ import { durableEvents } from "@bluelibs/runner/node";
2190
+
2191
+ const mirrorAudit = r
2192
+ .hook("app.hooks.durableAuditMirror")
2193
+ .on(durableEvents.audit.appended)
2194
+ .run(async (event) => {
2195
+ const { entry } = event.data;
2196
+ // write entry to your cold store (idempotent by entry.id)
2197
+ })
2198
+ .build();
2199
+ ```
2200
+
2201
+ ---
2202
+
2203
+ ## Gotchas & Troubleshooting
2204
+
2205
+ - **Always put side effects inside `durableContext.step(...)`**: anything outside a step can run multiple times on retries/replays.
2206
+ - **Keep step ids stable**: renaming a step id (or changing control-flow so a different call order happens) can break replay determinism for existing executions.
2207
+ - **Call-order indexing is real**: `emit()` and repeated `waitForSignal()` allocate `:<index>` internally based on call order; refactors that add/remove calls can shift indexes.
2208
+ - **Signals are "deliver to current wait"**: `durableService.signal(executionId, ...)` delivers to the base signal slot if it's not completed yet (this can buffer the first signal even if the workflow hasn't reached the wait). Additional signals only deliver to subsequent indexed waits; otherwise they are ignored.
2209
+ - **Don't hang forever**: prefer `durableService.wait(executionId, { timeout: ... })` unless you intentionally want an unbounded wait.
2210
+ - **Compensation failures are terminal**: if `durableContext.rollback()` fails, execution becomes `compensation_failed` and `wait()` rejects. Use `DurableOperator.retryRollback(executionId)` after fixing the underlying issue.
2211
+ - **Intervals can overlap**: interval schedules are currently measured from kickoff time, not completion time. If you need non-overlapping behavior, implement it via `durableContext.sleep()` inside the workflow.
2212
+ - **Debugging**: inspect step results + timers via `DurableOperator`/store queries (Redis keys are prefixed by `durable:` by default).
2213
+
2214
+ ## Idempotency & Deduplication
2215
+
2216
+ There are two different "idempotency" problems:
2217
+
2218
+ 1. **Workflow-level deduplication (start only once)**
2219
+
2220
+ - `start(task, input, { idempotencyKey })` supports a store-backed **"start-or-get"** mode.
2221
+ - It returns the same `executionId` for the same `{ taskId, idempotencyKey }` pair, even if multiple callers race.
2222
+ - Important: subsequent calls return the existing `executionId` and do **not** overwrite the originally stored `input`.
2223
+ - Store support: `MemoryStore` and `RedisStore` implement this. Custom stores must implement `getExecutionIdByIdempotencyKey` / `setExecutionIdByIdempotencyKey`.
2224
+ - You should still persist the returned `executionId` in your domain model for observability and to make webhook handling trivial.
2225
+
2226
+ 2. **Schedule-level deduplication (create schedule only once)**
2227
+
2228
+ - Use `ensureSchedule(...)` with a stable `id`. It is designed to be safe to call on every boot and concurrently across processes.
2229
+
2230
+ If you need workflow-level dedupe by business key (for example `orderId`), use it as the `idempotencyKey` (for example `order:${orderId}`), and store the returned `executionId` on the record as well.
2231
+
2232
+ ## Cancellation (and why it's tricky)
2233
+
2234
+ Durable exposes a first-class cancellation API:
2235
+
2236
+ - `durableService.cancelExecution(executionId, reason?)`
2237
+
2238
+ Semantics:
2239
+
2240
+ - Cancellation is **cooperative**, not preemptive: Node cannot reliably interrupt arbitrary async work.
2241
+ - Cancelling marks the execution as terminal (`cancelled`), unblocks `wait()` / `startAndWait()`, and prevents future resumes (timers/signals won't continue it).
2242
+ - Already-running code will only stop at the next durable checkpoint (for example the next `durableContext.step(...)`, `durableContext.sleep(...)`, `durableContext.waitForSignal(...)`, or `durableContext.emit(...)`).
2243
+
2244
+ Administrative alternatives still exist:
2245
+
2246
+ - `DurableOperator.forceFail(executionId)` is a blunt instrument to stop and mark `failed`.
2247
+
2248
+ ## What This Design Deliberately Excludes
2249
+
2250
+ 1. **Exactly-once external side effects** – The system provides at-least-once execution with effectively-once steps; true exactly-once semantics at the boundary (e.g., payment processors) are left to idempotent APIs and application logic.
2251
+ 2. **Event sourcing** – Steps are modeled as checkpoints, not a full event stream. This keeps the model simple.
2252
+ 3. **Automatic saga orchestration DSLs** – There is no separate workflow language or visual designer. Compensation is regular TypeScript code using `try/catch` and `durableContext.step`.
2253
+ 4. **Built-in dashboards** – not included in core; observability UIs are intentionally external to the runtime package.
2254
+ 5. **Cross-region or multi-tenant sharding logic** – Multi-region replication and advanced topology concerns are out of scope for v1.
2255
+
2256
+ Also intentionally minimal in v1: 6. **Preemptive cancellation** – cancellation is cooperative (checkpoints), not an interrupt/kill mechanism for arbitrary in-flight async work. 7. **Advanced visibility indexes** – `listExecutions` is operator-oriented and not a full-blown search/indexing system. 8. **Cron timezone & misfire policies** – cron is evaluated using the process environment defaults; DST/timezone/misfire handling is not configurable yet.
2257
+
2258
+ These can all be added in future versions if needed, without changing the core `DurableContext` and `DurableService` APIs.
2259
+
2260
+ ---
2261
+
2262
+ ## Why This is Better
2263
+
2264
+ 1. **Fits Runner's philosophy** - No new concepts, just enhanced tasks
2265
+ 2. **No magic** - What you see is what you get
2266
+ 3. **Explicit over implicit** - Compensation is code, not configuration
2267
+ 4. **Simple mental model** - `step()` = checkpoint, that's it
2268
+ 5. **Easy to understand** - Read the code, know what happens
2269
+ 6. **Easy to test** - MemoryStore for tests, no external dependencies
2270
+ 7. **Easy to debug** - Each step is recorded, replay is deterministic