@bluelibs/runner-dev 5.3.0 → 6.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (148) hide show
  1. package/AI.md +25 -3
  2. package/README.md +190 -55
  3. package/dist/cli/generators/artifact.js +2 -14
  4. package/dist/cli/generators/artifact.js.map +1 -1
  5. package/dist/cli/generators/common.d.ts +1 -0
  6. package/dist/cli/generators/common.js +22 -0
  7. package/dist/cli/generators/common.js.map +1 -1
  8. package/dist/cli/generators/printNewHelp.js +2 -2
  9. package/dist/cli/generators/printNewHelp.js.map +1 -1
  10. package/dist/cli/generators/scaffold/templates/package.json.d.ts +2 -2
  11. package/dist/cli/generators/scaffold/templates/package.json.js +2 -2
  12. package/dist/cli/generators/scaffold/templates/src/main.ts.js +7 -9
  13. package/dist/cli/generators/scaffold/templates/src/main.ts.js.map +1 -1
  14. package/dist/cli/generators/scaffold.js +1 -135
  15. package/dist/cli/generators/scaffold.js.map +1 -1
  16. package/dist/cli/generators/templates.js +64 -63
  17. package/dist/cli/generators/templates.js.map +1 -1
  18. package/dist/generated/resolvers-types.d.ts +376 -144
  19. package/dist/index.d.ts +39 -43
  20. package/dist/resources/cli.config.resource.d.ts +1 -1
  21. package/dist/resources/cli.config.resource.js +2 -2
  22. package/dist/resources/cli.config.resource.js.map +1 -1
  23. package/dist/resources/coverage.resource.d.ts +2 -2
  24. package/dist/resources/coverage.resource.js +3 -3
  25. package/dist/resources/coverage.resource.js.map +1 -1
  26. package/dist/resources/dev.resource.d.ts +1 -1
  27. package/dist/resources/dev.resource.js +2 -2
  28. package/dist/resources/dev.resource.js.map +1 -1
  29. package/dist/resources/docs.generator.resource.d.ts +4 -4
  30. package/dist/resources/docs.generator.resource.js +2 -2
  31. package/dist/resources/docs.generator.resource.js.map +1 -1
  32. package/dist/resources/graphql-accumulator.resource.d.ts +2 -2
  33. package/dist/resources/graphql-accumulator.resource.js +6 -3
  34. package/dist/resources/graphql-accumulator.resource.js.map +1 -1
  35. package/dist/resources/graphql.cli.resource.d.ts +1 -1
  36. package/dist/resources/graphql.cli.resource.js +2 -2
  37. package/dist/resources/graphql.cli.resource.js.map +1 -1
  38. package/dist/resources/graphql.query.cli.task.d.ts +14 -16
  39. package/dist/resources/graphql.query.cli.task.js +3 -3
  40. package/dist/resources/graphql.query.cli.task.js.map +1 -1
  41. package/dist/resources/graphql.query.task.d.ts +18 -20
  42. package/dist/resources/graphql.query.task.js +4 -4
  43. package/dist/resources/graphql.query.task.js.map +1 -1
  44. package/dist/resources/http.tag.d.ts +1 -1
  45. package/dist/resources/http.tag.js +2 -2
  46. package/dist/resources/http.tag.js.map +1 -1
  47. package/dist/resources/introspector.cli.resource.d.ts +2 -2
  48. package/dist/resources/introspector.cli.resource.js +14 -6
  49. package/dist/resources/introspector.cli.resource.js.map +1 -1
  50. package/dist/resources/introspector.resource.d.ts +3 -3
  51. package/dist/resources/introspector.resource.js +4 -5
  52. package/dist/resources/introspector.resource.js.map +1 -1
  53. package/dist/resources/live.resource.d.ts +4 -6
  54. package/dist/resources/live.resource.js +38 -25
  55. package/dist/resources/live.resource.js.map +1 -1
  56. package/dist/resources/models/Introspector.d.ts +28 -14
  57. package/dist/resources/models/Introspector.js +334 -161
  58. package/dist/resources/models/Introspector.js.map +1 -1
  59. package/dist/resources/models/durable.runtime.js +36 -10
  60. package/dist/resources/models/durable.runtime.js.map +1 -1
  61. package/dist/resources/models/durable.tools.d.ts +1 -1
  62. package/dist/resources/models/durable.tools.js +6 -3
  63. package/dist/resources/models/durable.tools.js.map +1 -1
  64. package/dist/resources/models/initializeFromStore.js +54 -21
  65. package/dist/resources/models/initializeFromStore.js.map +1 -1
  66. package/dist/resources/models/initializeFromStore.utils.d.ts +7 -6
  67. package/dist/resources/models/initializeFromStore.utils.js +302 -25
  68. package/dist/resources/models/initializeFromStore.utils.js.map +1 -1
  69. package/dist/resources/models/introspector.tools.js +18 -6
  70. package/dist/resources/models/introspector.tools.js.map +1 -1
  71. package/dist/resources/routeHandlers/getDocsData.d.ts +4 -0
  72. package/dist/resources/routeHandlers/getDocsData.js +28 -0
  73. package/dist/resources/routeHandlers/getDocsData.js.map +1 -1
  74. package/dist/resources/routeHandlers/registerHttpRoutes.hook.d.ts +26 -25
  75. package/dist/resources/routeHandlers/registerHttpRoutes.hook.js +10 -9
  76. package/dist/resources/routeHandlers/registerHttpRoutes.hook.js.map +1 -1
  77. package/dist/resources/server.resource.d.ts +20 -22
  78. package/dist/resources/server.resource.js +6 -6
  79. package/dist/resources/server.resource.js.map +1 -1
  80. package/dist/resources/swap.cli.resource.d.ts +4 -4
  81. package/dist/resources/swap.cli.resource.js +2 -2
  82. package/dist/resources/swap.cli.resource.js.map +1 -1
  83. package/dist/resources/swap.resource.d.ts +7 -7
  84. package/dist/resources/swap.resource.js +188 -38
  85. package/dist/resources/swap.resource.js.map +1 -1
  86. package/dist/resources/swap.tools.d.ts +3 -2
  87. package/dist/resources/swap.tools.js +27 -27
  88. package/dist/resources/swap.tools.js.map +1 -1
  89. package/dist/resources/telemetry.resource.d.ts +1 -1
  90. package/dist/resources/telemetry.resource.js +46 -43
  91. package/dist/resources/telemetry.resource.js.map +1 -1
  92. package/dist/runner-compat.d.ts +85 -0
  93. package/dist/runner-compat.js +178 -0
  94. package/dist/runner-compat.js.map +1 -0
  95. package/dist/runner-node-compat.d.ts +2 -0
  96. package/dist/runner-node-compat.js +28 -0
  97. package/dist/runner-node-compat.js.map +1 -0
  98. package/dist/schema/index.js +4 -8
  99. package/dist/schema/index.js.map +1 -1
  100. package/dist/schema/model.d.ts +80 -23
  101. package/dist/schema/model.js.map +1 -1
  102. package/dist/schema/query.js +2 -1
  103. package/dist/schema/query.js.map +1 -1
  104. package/dist/schema/types/AllType.js +6 -3
  105. package/dist/schema/types/AllType.js.map +1 -1
  106. package/dist/schema/types/BaseElementCommon.js +2 -2
  107. package/dist/schema/types/ErrorType.js +1 -1
  108. package/dist/schema/types/ErrorType.js.map +1 -1
  109. package/dist/schema/types/EventType.js +19 -2
  110. package/dist/schema/types/EventType.js.map +1 -1
  111. package/dist/schema/types/LaneSummaryTypes.d.ts +3 -0
  112. package/dist/schema/types/LaneSummaryTypes.js +19 -0
  113. package/dist/schema/types/LaneSummaryTypes.js.map +1 -0
  114. package/dist/schema/types/LiveType.js +67 -0
  115. package/dist/schema/types/LiveType.js.map +1 -1
  116. package/dist/schema/types/ResourceType.js +100 -19
  117. package/dist/schema/types/ResourceType.js.map +1 -1
  118. package/dist/schema/types/RunOptionsType.js +41 -5
  119. package/dist/schema/types/RunOptionsType.js.map +1 -1
  120. package/dist/schema/types/TagType.js +35 -4
  121. package/dist/schema/types/TagType.js.map +1 -1
  122. package/dist/schema/types/TaskType.js +5 -0
  123. package/dist/schema/types/TaskType.js.map +1 -1
  124. package/dist/schema/types/index.d.ts +2 -2
  125. package/dist/schema/types/index.js +6 -7
  126. package/dist/schema/types/index.js.map +1 -1
  127. package/dist/schema/types/middleware/common.d.ts +3 -2
  128. package/dist/schema/types/middleware/common.js +19 -13
  129. package/dist/schema/types/middleware/common.js.map +1 -1
  130. package/dist/ui/.vite/manifest.json +2 -2
  131. package/dist/ui/assets/docs-Btkv97Ls.js +302 -0
  132. package/dist/ui/assets/docs-Btkv97Ls.js.map +1 -0
  133. package/dist/ui/assets/docs-CipvKUxZ.css +1 -0
  134. package/dist/utils/lane-resources.d.ts +55 -0
  135. package/dist/utils/lane-resources.js +143 -0
  136. package/dist/utils/lane-resources.js.map +1 -0
  137. package/dist/utils/zod.js +36 -3
  138. package/dist/utils/zod.js.map +1 -1
  139. package/dist/version.d.ts +1 -1
  140. package/dist/version.js +1 -1
  141. package/package.json +4 -4
  142. package/readmes/runner-AI.md +740 -0
  143. package/readmes/runner-durable-workflows.md +2247 -0
  144. package/readmes/runner-full-guide.md +5869 -0
  145. package/readmes/runner-remote-lanes.md +909 -0
  146. package/dist/ui/assets/docs-BhRuaJ5l.css +0 -1
  147. package/dist/ui/assets/docs-H4oDZj7p.js +0 -302
  148. package/dist/ui/assets/docs-H4oDZj7p.js.map +0 -1
@@ -0,0 +1,2247 @@
1
+ # Durable Workflows (Node-only) — Architecture v2
2
+
3
+ ← [Back to main README](../README.md)
4
+
5
+ ---
6
+
7
+ > Durable workflows are Runner tasks with "save points". If your process dies, deploys, or scales horizontally, the workflow comes back and continues like nothing happened (except now you can finally sleep at night).
8
+
9
+ ## Table of Contents
10
+
11
+ - [Start Here](#start-here)
12
+ - [Quickstart](#quickstart)
13
+ - [Tagging Workflows for Discovery](#tagging-workflows-for-discovery-required)
14
+ - [Why You'd Want This (In One Minute)](#why-youd-want-this-in-one-minute)
15
+ - [Core Insight](#core-insight)
16
+ - [Abstract Interfaces](#abstract-interfaces)
17
+ - [API Design](#api-design)
18
+ - [Safety & Semantics](#safety--semantics)
19
+ - [Signals (wait for external events)](#signals-wait-for-external-events)
20
+ - [Testing Utilities](#testing-utilities)
21
+ - [Compensation / Rollback Pattern](#compensation--rollback-pattern)
22
+ - [Branching with durableContext.switch()](#branching-with-durablecontextswitch)
23
+ - [Describing a Flow (Static Shape Export)](#describing-a-flow-static-shape-export)
24
+ - [Scheduling & Cron Jobs](#scheduling--cron-jobs)
25
+ - [Gotchas & Troubleshooting](#gotchas--troubleshooting)
26
+
27
+ ## Start Here
28
+
29
+ - If you want the short version: `readmes/DURABLE_WORKFLOWS_AI.md`
30
+ - If you're new to Runner concepts (tasks/resources/events/middleware): `readmes/AI.md`
31
+ - Platform note (why this is Node-only): `readmes/MULTI_PLATFORM.md`
32
+
33
+ ## Quickstart
34
+
35
+ ### 0) Create durable support + a durable backend
36
+
37
+ The recommended integration is:
38
+
39
+ - register `resources.durable` once for durable tags/events support
40
+ - fork a concrete durable backend (`resources.memoryWorkflow` / `resources.redisWorkflow`)
41
+
42
+ The concrete durable backend:
43
+
44
+ - Executes Runner tasks via DI (`taskRunner.run(...)`).
45
+ - Provides a **per-resource** durable context, accessed via `durable.use()`.
46
+ - Optionally embeds a worker (`worker: true`) to consume the queue in that process.
47
+
48
+ ### 1) Define a durable task (steps + sleep + signal)
49
+
50
+ ```ts
51
+ import { event, r, run } from "@bluelibs/runner";
52
+ import { resources } from "@bluelibs/runner/node";
53
+
54
+ const Approved = event<{ approvedBy: string }>({ id: "app.signals.approved" });
55
+
56
+ const durable = resources.memoryWorkflow.fork("app-durable");
57
+
58
+ const durableRegistration = durable.with({
59
+ worker: true, // single-process dev/tests
60
+ });
61
+
62
+ const approveOrder = r
63
+ .task("app.tasks.approveOrder")
64
+ .dependencies({ durable })
65
+ .run(async (input: { orderId: string }, { durable }) => {
66
+ const durableContext = durable.use();
67
+
68
+ await durableContext.step("validate", async () => {
69
+ // fetch order, validate invariants, etc.
70
+ return { ok: true };
71
+ });
72
+
73
+ const outcome = await durableContext.waitForSignal(Approved, {
74
+ timeoutMs: 86_400_000,
75
+ });
76
+ if (outcome.kind === "timeout") {
77
+ return { status: "timed_out" };
78
+ }
79
+
80
+ await durableContext.step("ship", async () => {
81
+ // ship only after approval
82
+ return { shipped: true };
83
+ });
84
+
85
+ return {
86
+ status: "approved",
87
+ approvedBy: outcome.payload.approvedBy,
88
+ };
89
+ })
90
+ .build();
91
+
92
+ const app = r
93
+ .resource("app")
94
+ .register([resources.durable, durableRegistration, approveOrder])
95
+ .build();
96
+
97
+ await run(app, { logs: { printThreshold: null } });
98
+ ```
99
+
100
+ ## Tagging Workflows for Discovery (Required)
101
+
102
+ Durable workflows are regular Runner tasks, but **must be tagged with `tags.durableWorkflow`**
103
+ to make them discoverable at runtime. Always add this tag to your workflow tasks:
104
+
105
+ ```ts
106
+ import { r } from "@bluelibs/runner";
107
+ import { resources, tags } from "@bluelibs/runner/node";
108
+
109
+ const durable = resources.memoryWorkflow.fork("app-durable");
110
+
111
+ const onboarding = r
112
+ .task("app.workflows.onboarding")
113
+ .dependencies({ durable })
114
+ .tags([
115
+ tags.durableWorkflow.with({
116
+ category: "users",
117
+ defaults: { invitedBy: "system" },
118
+ }),
119
+ ])
120
+ .run(async (_input, { durable }) => {
121
+ const durableContext = durable.use();
122
+ await durableContext.step("create-user", async () => ({ ok: true }));
123
+ return { ok: true };
124
+ })
125
+ .build();
126
+
127
+ // later, after run(...)
128
+ // const durableRuntime = runtime.getResourceValue(durable);
129
+ // const workflows = durableRuntime.getWorkflows();
130
+ ```
131
+
132
+ `tags.durableWorkflow` is **required** — workflows without this tag will not be discoverable
133
+ via `getWorkflows()`. Register `resources.durable` once in the app so the durable tag
134
+ definition and durable events are available at runtime.
135
+
136
+ `tags.durableWorkflow` is discovery metadata only. The unified response envelope
137
+ is produced by `durable.startAndWait(...)`:
138
+ `{ durable: { executionId }, data }`.
139
+
140
+ `tags.durableWorkflow` also supports optional `defaults` used by
141
+ `durable.describe(task)` **only when no explicit describe input is provided**.
142
+ This does not affect `start()`, `startAndWait()`, `schedule()`, or `ensureSchedule()`.
143
+
144
+ ### Starting Durable Workflows From Resource Dependencies (HTTP route)
145
+
146
+ Tagged workflow tasks are discoverable metadata only. Execution is explicit:
147
+ start with `durable.start(...)` (fire-and-track) or
148
+ `durable.startAndWait(...)` (start-and-wait).
149
+
150
+ ```ts
151
+ import express from "express";
152
+ import { r, run } from "@bluelibs/runner";
153
+ import { resources, tags } from "@bluelibs/runner/node";
154
+
155
+ const durable = resources.memoryWorkflow.fork("app-durable");
156
+
157
+ const approveOrder = r
158
+ .task("app.workflows.approveOrder")
159
+ .dependencies({ durable })
160
+ .tags([tags.durableWorkflow.with({ category: "orders" })])
161
+ .run(async (input: { orderId: string }, { durable }) => {
162
+ const durableContext = durable.use();
163
+ await durableContext.step("approve", async () => ({ approved: true }));
164
+ return { orderId: input.orderId, status: "approved" as const };
165
+ })
166
+ .build();
167
+
168
+ const api = r
169
+ .resource("app.api")
170
+ .register([resources.durable, durable.with({ worker: false }), approveOrder])
171
+ .dependencies({ durable, approveOrder })
172
+ .init(async (_cfg, { durable, approveOrder }) => {
173
+ const app = express();
174
+ app.use(express.json());
175
+
176
+ app.post("/orders/:id/approve", async (req, res) => {
177
+ const executionId = await durable.start(approveOrder, {
178
+ orderId: req.params.id,
179
+ });
180
+
181
+ res.status(202).json({ executionId });
182
+ });
183
+
184
+ app.listen(3000);
185
+ })
186
+ .build();
187
+
188
+ await run(api);
189
+ ```
190
+
191
+ ### Production wiring (Redis + RabbitMQ)
192
+
193
+ For production, swap the in-memory backends:
194
+
195
+ ```ts
196
+ import { resources } from "@bluelibs/runner/node";
197
+
198
+ const durable = resources.redisWorkflow.fork("app-durable");
199
+
200
+ const durableRegistration = durable.with({
201
+ redis: { url: process.env.REDIS_URL! },
202
+ queue: { url: process.env.RABBITMQ_URL! },
203
+ worker: true,
204
+ });
205
+ ```
206
+
207
+ Isolation note: `resources.redisWorkflow` derives Redis key prefixes, pub/sub prefixes, and default queue names from the durable resource id (the value you pass to `.fork("...")`). Use different ids (or set `{ namespace }`) to run multiple durable "apps" safely on the same Redis/RabbitMQ.
208
+
209
+ API nodes typically **disable polling and the embedded worker**:
210
+
211
+ ```ts
212
+ const durable = resources.redisWorkflow.fork("app-durable");
213
+ const durableRegistration = durable.with({
214
+ redis: { url: process.env.REDIS_URL! },
215
+ queue: { url: process.env.RABBITMQ_URL! },
216
+ worker: false,
217
+ polling: { enabled: false },
218
+ });
219
+ ```
220
+
221
+ In a typical deployment:
222
+
223
+ - API nodes call `start()` / `signal()` / `wait()`.
224
+ - Worker nodes run the durable resource with `worker: true`.
225
+
226
+ ### Scaling in production (recommended topology)
227
+
228
+ Durable workflows are designed to scale **horizontally**.
229
+ The core idea is: **the store is the source of truth**, and the queue distributes work.
230
+
231
+ **Recommended split:**
232
+
233
+ - **API nodes** (stateless): accept HTTP/webhooks, call `start()` / `signal()` / `wait()`.
234
+ - **Worker nodes** (scalable): consume the durable queue and run executions.
235
+
236
+ **API node config (no background work):**
237
+
238
+ ```ts
239
+ const durable = resources.redisWorkflow.fork("app-durable");
240
+ const durableRegistration = durable.with({
241
+ redis: { url: process.env.REDIS_URL! },
242
+ queue: { url: process.env.RABBITMQ_URL! },
243
+ worker: false,
244
+ polling: { enabled: false },
245
+ });
246
+ ```
247
+
248
+ **Worker node config (does background work):**
249
+
250
+ ```ts
251
+ const durable = resources.redisWorkflow.fork("app-durable");
252
+ const durableRegistration = durable.with({
253
+ redis: { url: process.env.REDIS_URL! },
254
+ queue: { url: process.env.RABBITMQ_URL! },
255
+ worker: true,
256
+ polling: { enabled: true, interval: 1000 },
257
+ });
258
+ ```
259
+
260
+ **How it scales:**
261
+
262
+ - Increase worker replicas: each one consumes from the queue, so throughput scales with workers.
263
+ - Crash/redeploy safety: a worker can die at any time; the next worker resumes from the last checkpoint.
264
+ - Multi-worker correctness: executions/steps are coordinated through the store, not through in-memory state.
265
+
266
+ **Timers, sleeps, and schedules (important):**
267
+
268
+ Timers (used by `durableContext.sleep(...)`, signal timeouts, and scheduling) are driven by the durable polling loop.
269
+ In multi-process setups you typically either:
270
+
271
+ - run a **single poller** (one worker replica with `polling.enabled: true`), or
272
+ - use a store implementation that provides **atomic timer claiming** so multiple pollers are safe.
273
+
274
+ If you enable polling in multiple processes without atomic claiming, you may get duplicate resume attempts.
275
+ This is still designed to be safe (at-least-once), but it can increase load/noise.
276
+
277
+ ### 2) Start an execution (store the executionId)
278
+
279
+ ```ts
280
+ const executionId = await d.start(approveOrder, {
281
+ orderId: "order-123",
282
+ });
283
+ // store executionId on the order record so your webhook can resume the workflow later
284
+ ```
285
+
286
+ ### Reading status later (no double-sync required)
287
+
288
+ If you store the `executionId` in your main database (eg. `orders.durable_execution_id`), you can fetch live workflow status on-demand from the durable store.
289
+ This avoids mirroring every durable transition into Postgres.
290
+
291
+ ```ts
292
+ import { DurableOperator, RedisStore } from "@bluelibs/runner/node";
293
+
294
+ const durableStorePrefix = process.env.DURABLE_STORE_PREFIX!; // same value used by your durable runtime config
295
+
296
+ // Read-only store client for status lookups (same redis url + prefix)
297
+ const store = new RedisStore({
298
+ redis: process.env.REDIS_URL!,
299
+ prefix: durableStorePrefix,
300
+ });
301
+
302
+ // Minimal: just the execution row (status/result/error)
303
+ const execution = await store.getExecution(executionId);
304
+
305
+ // Rich: execution + steps + audit (dashboard-like view)
306
+ const operator = new DurableOperator(store);
307
+ const detail = await operator.getExecutionDetail(executionId);
308
+ ```
309
+
310
+ Keep the durable store prefix in one shared config module and reuse it for both workflow runtime wiring and read-only status lookups.
311
+
312
+ If you already have the durable resource instance (dependency injection), you can use the operator API directly:
313
+
314
+ ```ts
315
+ const detail = await durable.operator.getExecutionDetail(executionId);
316
+ ```
317
+
318
+ ### 3) Resume from the outside (webhook / callback)
319
+
320
+ ```ts
321
+ await d.signal(executionId, Approved, { approvedBy: "admin@company.com" });
322
+ const result = await d.wait(executionId, { timeout: 30_000 });
323
+ ```
324
+
325
+ ## Why You'd Want This (In One Minute)
326
+
327
+ - Your workflow needs to span time: minutes, hours, days (payments, shipping, approvals).
328
+ - You want deterministic retries without duplicating side-effects (charge twice, email twice, etc.).
329
+ - You want horizontal scaling without "who owns this in-memory timeout?" problems.
330
+ - You want explicit, type-safe "outside world pokes the workflow" via signals.
331
+
332
+ ## Core Insight
333
+
334
+ The key insight (Temporal/Inngest-style) is that workflows are just functions with checkpoints. We provide a `DurableContext` that gives tasks:
335
+
336
+ 1. **`step(id, fn)`** - Execute a function once, cache the result, return cached on replay
337
+ 2. **`sleep(ms)`** - Durable sleep that survives process restarts
338
+ 3. **`emit(event, data)`** - Publish a best-effort notification, de-duplicated via `step()` (not guaranteed delivery)
339
+ 4. **`waitForSignal(signal)`** - Suspend until an external signal is delivered (eg. payment confirmation)
340
+
341
+ **Scalability Model:** Multiple worker instances can process executions concurrently. Work is distributed via a durable queue (RabbitMQ quorum queues by default), with state stored in Redis.
342
+
343
+ ```mermaid
344
+ graph TB
345
+ subgraph Clients
346
+ C1[Client 1]
347
+ C2[Client 2]
348
+ end
349
+
350
+ subgraph DurableInfra[Durable Infrastructure]
351
+ Q[(RabbitMQ - Quorum Queue)]
352
+ R[(Redis - State/PubSub)]
353
+ end
354
+
355
+ subgraph Workers[Scalable Workers]
356
+ W1[Worker 1]
357
+ W2[Worker 2]
358
+ W3[Worker N]
359
+ end
360
+
361
+ C1 -->|enqueue| Q
362
+ C2 -->|enqueue| Q
363
+
364
+ Q -->|consume| W1
365
+ Q -->|consume| W2
366
+ Q -->|consume| W3
367
+
368
+ W1 <-->|state| R
369
+ W2 <-->|state| R
370
+ W3 <-->|state| R
371
+
372
+ R -.->|pub/sub| W1
373
+ R -.->|pub/sub| W2
374
+ R -.->|pub/sub| W3
375
+ ```
376
+
377
+ ---
378
+
379
+ ## Abstract Interfaces
380
+
381
+ Three pluggable interfaces allow swapping backends without changing application code:
382
+
383
+ ### 1. IDurableStore - State Storage
384
+
385
+ The **Store** is the absolute source of truth. It persists execution state, step results, timers, and schedules. If it's not in the store, it didn't happen.
386
+
387
+ ```typescript
388
+ // interfaces/IDurableStore.ts
389
+
390
+ export interface IDurableStore {
391
+ // Executions (The primary workflow records)
392
+ saveExecution(execution: Execution): Promise<void>;
393
+ getExecution(id: string): Promise<Execution | null>;
394
+ updateExecution(id: string, updates: Partial<Execution>): Promise<void>;
395
+ listIncompleteExecutions(): Promise<Execution[]>;
396
+
397
+ // Steps (Memoized results for exactly-once-ish semantics)
398
+ getStepResult(
399
+ executionId: string,
400
+ stepId: string,
401
+ ): Promise<StepResult | null>;
402
+ saveStepResult(result: StepResult): Promise<void>;
403
+
404
+ // Timers (Drives sleep(), signal timeouts, and cron)
405
+ createTimer(timer: Timer): Promise<void>;
406
+ getReadyTimers(now?: Date): Promise<Timer[]>;
407
+ markTimerFired(timerId: string): Promise<void>;
408
+ deleteTimer(timerId: string): Promise<void>;
409
+
410
+ // Schedules (Cron and Interval orchestration)
411
+ createSchedule(schedule: Schedule): Promise<void>;
412
+ getSchedule(id: string): Promise<Schedule | null>;
413
+ updateSchedule(id: string, updates: Partial<Schedule>): Promise<void>;
414
+ deleteSchedule(id: string): Promise<void>;
415
+ listSchedules(): Promise<Schedule[]>;
416
+ listActiveSchedules(): Promise<Schedule[]>;
417
+
418
+ // Optional: Distributed Timer Coordination
419
+ claimTimer?(
420
+ timerId: string,
421
+ workerId: string,
422
+ ttlMs: number,
423
+ ): Promise<boolean>;
424
+
425
+ // Optional: Idempotency (dedupe start calls)
426
+ getExecutionIdByIdempotencyKey?(params: {
427
+ taskId: string;
428
+ idempotencyKey: string;
429
+ }): Promise<string | null>;
430
+ setExecutionIdByIdempotencyKey?(params: {
431
+ taskId: string;
432
+ idempotencyKey: string;
433
+ executionId: string;
434
+ }): Promise<boolean>;
435
+
436
+ // Optional: Dashboard & Operator API
437
+ listExecutions?(options?: ListExecutionsOptions): Promise<Execution[]>;
438
+ listStepResults?(executionId: string): Promise<StepResult[]>;
439
+ retryRollback?(executionId: string): Promise<void>;
440
+ skipStep?(executionId: string, stepId: string): Promise<void>;
441
+ forceFail?(
442
+ executionId: string,
443
+ error: { message: string; stack?: string },
444
+ ): Promise<void>;
445
+ editStepResult?(
446
+ executionId: string,
447
+ stepId: string,
448
+ newResult: unknown,
449
+ ): Promise<void>;
450
+
451
+ // Lifecycle
452
+ init?(): Promise<void>;
453
+ dispose?(): Promise<void>;
454
+
455
+ // Optional: Locking (if store handles its own concurrency)
456
+ acquireLock?(resource: string, ttlMs: number): Promise<string | null>;
457
+ releaseLock?(resource: string, lockId: string): Promise<void>;
458
+ }
459
+ ```
460
+
461
+ **Implementations:**
462
+
463
+ - `MemoryStore` - Dev/test, no persistence
464
+ - `RedisStore` - Production default, distributed locking
465
+
466
+ ### 2. IEventBus - Pub/Sub
467
+
468
+ For event notifications across workers (timer ready, execution complete, etc).
469
+
470
+ ```typescript
471
+ // interfaces/IEventBus.ts
472
+
473
+ export type EventHandler = (event: BusEvent) => Promise<void>;
474
+
475
+ export interface IEventBus {
476
+ // Publish event to all subscribers
477
+ publish(channel: string, event: BusEvent): Promise<void>;
478
+
479
+ // Subscribe to events on a channel
480
+ subscribe(channel: string, handler: EventHandler): Promise<void>;
481
+
482
+ // Unsubscribe from a channel
483
+ unsubscribe(channel: string): Promise<void>;
484
+
485
+ // Lifecycle
486
+ init?(): Promise<void>;
487
+ dispose?(): Promise<void>;
488
+ }
489
+
490
+ export interface BusEvent {
491
+ type: string;
492
+ payload: unknown;
493
+ timestamp: Date;
494
+ }
495
+ ```
496
+
497
+ **Implementations:**
498
+
499
+ - `MemoryEventBus` - Dev/test, single-process only
500
+ - `RedisEventBus` - Production default, uses Redis Pub/Sub
501
+
502
+ **Serialization note:** `RedisEventBus` serializes events using Runner's serializer (tree mode) so `BusEvent.timestamp: Date` (and other supported built-in types) round-trip correctly across Redis Pub/Sub.
503
+
504
+ ### 3. IDurableQueue - Work Distribution
505
+
506
+ For distributing execution work across multiple workers with durability guarantees.
507
+
508
+ ```typescript
509
+ // interfaces/IDurableQueue.ts
510
+
511
+ export interface QueueMessage<T = unknown> {
512
+ id: string;
513
+ type: "execute" | "resume" | "schedule";
514
+ payload: T;
515
+ attempts: number;
516
+ maxAttempts: number;
517
+ createdAt: Date;
518
+ }
519
+
520
+ export type MessageHandler<T = unknown> = (
521
+ message: QueueMessage<T>,
522
+ ) => Promise<void>;
523
+
524
+ export interface IDurableQueue {
525
+ // Send message to queue
526
+ enqueue<T>(
527
+ message: Omit<QueueMessage<T>, "id" | "createdAt">,
528
+ ): Promise<string>;
529
+
530
+ // Start consuming messages (calls handler for each)
531
+ consume<T>(handler: MessageHandler<T>): Promise<void>;
532
+
533
+ // Acknowledge successful processing
534
+ ack(messageId: string): Promise<void>;
535
+
536
+ // Negative acknowledge (requeue or dead-letter)
537
+ nack(messageId: string, requeue?: boolean): Promise<void>;
538
+
539
+ // Lifecycle
540
+ init?(): Promise<void>;
541
+ dispose?(): Promise<void>;
542
+ }
543
+ ```
544
+
545
+ **Message types note:** Runner currently enqueues `execute` and `resume`. `schedule` is accepted by `DurableWorker` as an alias of `resume` (an execution hint) so custom adapters can use it, but built-in cron/interval scheduling is driven by timers + `resume`.
546
+
547
+ **Implementations:**
548
+
549
+ - `MemoryQueue` - Dev/test, no persistence
550
+ - `RabbitMQQueue` - Production default, quorum queues for durability
551
+
552
+ ---
553
+
554
+ ## Adapting to Your Flow: Custom Backends
555
+
556
+ One of Runner's core philosophies is **zero lock-in**. If your team uses Postgres for state or Kafka for queues, you shouldn't have to change your workflow logic to use them.
557
+
558
+ ### Implementing a Custom Store
559
+
560
+ To implement a custom store (e.g., for SQL), you only need to satisfy the `IDurableStore` interface. The engine is designed to be "dumb" and trust the store for all persistence.
561
+
562
+ **Minimum Viable Store (Pseudo-SQL):**
563
+
564
+ ```typescript
565
+ class MySqlStore implements IDurableStore {
566
+ async saveExecution(e: Execution) {
567
+ await db.query("INSERT INTO durable_executions ...", [e.id, serialize(e)]);
568
+ }
569
+
570
+ async getExecution(id: string) {
571
+ const row = await db.query(
572
+ "SELECT data FROM durable_executions WHERE id = ?",
573
+ [id],
574
+ );
575
+ return row ? deserialize(row.data) : null;
576
+ }
577
+
578
+ // ... implement other methods by mapping to your DB tables
579
+ }
580
+ ```
581
+
582
+ > [!TIP]
583
+ > Look at [MemoryStore.ts](../src/node/durable/store/MemoryStore.ts) for a clean reference of how to manage in-memory state, or [RedisStore.ts](../src/node/durable/store/RedisStore.ts) for a production-grade implementation using Lua scripts for atomicity.
584
+
585
+ ### Implementing a Custom Queue
586
+
587
+ If you want to use a different message broker (SQS, Kafka, Redis Streams), implement `IDurableQueue`.
588
+
589
+ **Key Responsibilities:**
590
+
591
+ - **`enqueue`**: Push a message (task execution hint) to the broker.
592
+ - **`consume`**: Register a listener that calls the provided handler when a message arrives.
593
+ - **`ack` / `nack`**: Handle message confirmation/failure.
594
+
595
+ ```typescript
596
+ class SqsQueue implements IDurableQueue {
597
+ async enqueue(msg) {
598
+ const res = await sqs.sendMessage({
599
+ QueueUrl,
600
+ MessageBody: JSON.stringify(msg),
601
+ });
602
+ return res.MessageId;
603
+ }
604
+
605
+ async consume(handler) {
606
+ // Polling loop or subscription
607
+ const msgs = await sqs.receiveMessage({ QueueUrl });
608
+ for (const m of msgs) {
609
+ await handler(JSON.parse(m.Body));
610
+ await this.ack(m.ReceiptHandle);
611
+ }
612
+ }
613
+ }
614
+ ```
615
+
616
+ > [!IMPORTANT]
617
+ > A queue in Durable Workflows is just a **hint**. If a message is lost, the `polling` loop in `DurableService` acts as a safety net to find and resume stuck executions. However, a reliable queue (like RabbitMQ or SQS) is critical for low-latency distribution and high throughput.
618
+
619
+ ---
620
+
621
+ ## Component Architecture
622
+
623
+ ```mermaid
624
+ graph TB
625
+ subgraph RunnerCore[Runner Core - Unchanged]
626
+ R[Resources]
627
+ T[Tasks]
628
+ E[Events]
629
+ H[Hooks]
630
+ end
631
+
632
+ subgraph DurableModule[src/node/durable/]
633
+ DS[DurableService]
634
+ DC[DurableContext]
635
+ DW[DurableWorker]
636
+
637
+ subgraph Interfaces[Abstract Interfaces]
638
+ IS[IDurableStore]
639
+ IB[IEventBus]
640
+ IQ[IDurableQueue]
641
+ end
642
+
643
+ subgraph StoreImpl[Store Implementations]
644
+ MS[MemoryStore]
645
+ RS[RedisStore]
646
+ end
647
+
648
+ subgraph BusImpl[EventBus Implementations]
649
+ MB[MemoryEventBus]
650
+ RB[RedisEventBus]
651
+ end
652
+
653
+ subgraph QueueImpl[Queue Implementations]
654
+ MQ[MemoryQueue]
655
+ RQ[RabbitMQQueue]
656
+ end
657
+ end
658
+
659
+ DS --> IS
660
+ DS --> IB
661
+ DS --> IQ
662
+ DW --> IQ
663
+ DC --> IS
664
+
665
+ IS -.-> MS
666
+ IS -.-> RS
667
+ IB -.-> MB
668
+ IB -.-> RB
669
+ IQ -.-> MQ
670
+ IQ -.-> RQ
671
+
672
+ T -.->|uses| DC
673
+ ```
674
+
675
+ ---
676
+
677
+ ## API Design
678
+
679
+ ### Basic Usage
680
+
681
+ Durable workflows are **normal Runner tasks** that inject a **durable backend resource** (created via `resources.memoryWorkflow.fork(id)` or `resources.redisWorkflow.fork(id)` and registered via `.with(config)`) and call `durableContext.step(...)` / `durableContext.sleep(...)` from inside their `run` function.
682
+
683
+ ```typescript
684
+ import { r, run } from "@bluelibs/runner";
685
+ import { MemoryStore, resources } from "@bluelibs/runner/node";
686
+
687
+ // 1. Create store
688
+ const store = new MemoryStore();
689
+
690
+ // 2. Create durable resource definition
691
+ const durable = resources.memoryWorkflow.fork("app-durable");
692
+
693
+ // 3. Register durable resource with config
694
+ const durableRegistration = durable.with({
695
+ store,
696
+ polling: { enabled: true, interval: 1000 }, // Timer polling interval
697
+ });
698
+
699
+ // 3. Define a task that uses durable context
700
+ const processOrder = r
701
+ .task("app.tasks.processOrder")
702
+ .inputSchema<{ orderId: string; customerId: string }>()
703
+ .dependencies({ durable })
704
+ .run(async (input, { durable }) => {
705
+ const durableContext = durable.use();
706
+
707
+ // Step 1: Validate order (checkpointed)
708
+ const order = await durableContext.step("validate", async () => {
709
+ const o = await db.orders.find(input.orderId);
710
+ if (!o) throw new Error("Order not found");
711
+ return o;
712
+ });
713
+
714
+ // Step 2: Process payment (checkpointed)
715
+ const payment = await durableContext.step("charge-payment", async () => {
716
+ return await payments.charge(order.customerId, order.total);
717
+ });
718
+
719
+ // Durable sleep - survives restart
720
+ await durableContext.sleep(5000);
721
+
722
+ // Step 3: Ship order (checkpointed)
723
+ const shipment = await durableContext.step("create-shipment", async () => {
724
+ return await shipping.create(order.id);
725
+ });
726
+
727
+ return {
728
+ success: true,
729
+ orderId: order.id,
730
+ trackingId: shipment.trackingId,
731
+ };
732
+ })
733
+ .build();
734
+
735
+ // 4. Wire up and run
736
+ const app = r
737
+ .resource("app")
738
+ .register([resources.durable, durableRegistration, processOrder])
739
+ .build();
740
+
741
+ const runtime = await run(app);
742
+
743
+ // 5. Execute durably
744
+ const d = runtime.getResourceValue(durable);
745
+ const result = await d.startAndWait(processOrder, {
746
+ orderId: "order-123",
747
+ customerId: "cust-456",
748
+ });
749
+ ```
750
+
751
+ ### How It Works
752
+
753
+ 1. **`durable.startAndWait(task, input)`** creates an execution record and runs the task
754
+ - Prefer `startAndWait()` when you want "start and wait for result" in one call.
755
+ - Prefer `start()` + `signal()` + `wait()` when the outside world must resume the workflow later (webhooks, approvals).
756
+ 2. **`durableContext.step(id, fn)`** checks if step was already executed:
757
+ - If yes: returns cached result (replay)
758
+ - If no: executes fn, caches result, returns result
759
+ 3. **`durableContext.sleep(ms)`** creates a timer record, suspends execution, resumes when timer fires
760
+ 4. **`durableContext.waitForSignal(signal)`** records a durable wait checkpoint and suspends execution
761
+ 5. **`durable.signal(executionId, signal, payload)`** completes the signal checkpoint and resumes the execution
762
+ 6. If process crashes, **`durableService.recover()`** resumes incomplete executions from their last checkpoint
763
+
764
+ ### `start()` vs `startAndWait()` (clear contract)
765
+
766
+ - `start(taskOrTaskId, input)`:
767
+ returns immediately with `executionId` (`string`).
768
+ - `startAndWait(taskOrTaskId, input)`:
769
+ convenience wrapper for `start(...)` + `wait(executionId)`; returns
770
+ `{ durable: { executionId }, data }`.
771
+
772
+ `start()` and `startAndWait()` are the only supported durable execution APIs.
773
+
774
+ `taskOrTaskId` can be:
775
+
776
+ - an `ITask` (the built task object, returned by `.build()`)
777
+ - a task id `string`
778
+
779
+ It is **not** the injected dependency callable from `.dependencies({ someTask })`. That dependency is a function used to invoke the task directly, not an `ITask` reference.
780
+
781
+ ```ts
782
+ // ✅ built task object
783
+ const executionIdA = await d.start(approveOrder, { orderId: "o1" });
784
+
785
+ // ✅ task id string
786
+ const executionIdB = await d.start(approveOrder.id, {
787
+ orderId: "o2",
788
+ });
789
+
790
+ // ❌ injected callable dependency (different type)
791
+ // await d.start(deps.approveOrder, { orderId: "o3" });
792
+ ```
793
+
794
+ ### What Happens with the Return Value
795
+
796
+ Whatever your workflow function returns becomes the **execution result**, persisted in the durable store. You can retrieve it in three ways depending on your pattern:
797
+
798
+ - **`startAndWait(task, input)`** — starts the workflow **and** waits for it to finish, returning `{ durable: { executionId }, data }`:
799
+
800
+ ```ts
801
+ const result = await d.startAndWait(processOrder, { orderId: "order-123" });
802
+ // result = {
803
+ // durable: { executionId: "..." },
804
+ // data: { success: true, orderId: "order-123", trackingId: "TRK-789" }
805
+ // }
806
+ ```
807
+
808
+ - **`start(task, input)`** + **`wait(executionId)`** — start and wait separately (useful when a webhook or external event resumes the workflow later):
809
+
810
+ ```ts
811
+ const executionId = await d.start(approveOrder, {
812
+ orderId: "order-123",
813
+ });
814
+ // ... later (eg. in a webhook handler) ...
815
+ await d.signal(executionId, Approved, { approvedBy: "admin@co.com" });
816
+ const result = await d.wait(executionId, { timeout: 30_000 });
817
+ // result = { status: "approved", approvedBy: "admin@co.com" }
818
+ ```
819
+
820
+ - **Read from the store** — fetch the persisted result without blocking:
821
+ ```ts
822
+ const execution = await store.getExecution(executionId);
823
+ // execution.status = "completed" | "failed" | "running" | ...
824
+ // execution.result = the return value of your workflow
825
+ ```
826
+
827
+ If the workflow throws an error instead of returning, the execution is marked as `failed` and `startAndWait()`/`wait()` will reject with that error.
828
+
829
+ ---
830
+
831
+ ## Execution Flow
832
+
833
+ ```mermaid
834
+ sequenceDiagram
835
+ participant C as Client
836
+ participant DS as DurableService
837
+ participant S as Store
838
+ participant DC as DurableContext
839
+ participant T as Task Function
840
+
841
+ C->>DS: startAndWait(task, input)
842
+ DS->>S: createExecution(id, task, input)
843
+ DS->>DC: create context for execution
844
+ DS->>T: run task with context
845
+
846
+ T->>DC: step('validate', fn)
847
+ DC->>S: getStepResult(execId, 'validate')
848
+ alt Step not cached
849
+ DC->>DC: execute fn()
850
+ DC->>S: saveStepResult(execId, 'validate', result)
851
+ end
852
+ DC-->>T: return result
853
+
854
+ T->>DC: sleep(5000)
855
+ DC->>S: createTimer(execId, fireAt)
856
+ Note over DC,T: Execution suspends
857
+
858
+ Note over DS: Timer polling...
859
+ DS->>S: getReadyTimers()
860
+ S-->>DS: timer ready!
861
+ DS->>DC: resume execution
862
+
863
+ T->>DC: step('ship', fn)
864
+ DC->>S: getStepResult(execId, 'ship')
865
+ DC->>DC: execute fn()
866
+ DC->>S: saveStepResult(execId, 'ship', result)
867
+ DC-->>T: return result
868
+
869
+ T-->>DS: return final result
870
+ DS->>S: markExecutionComplete(id, result)
871
+ DS-->>C: return result
872
+ ```
873
+
874
+ ---
875
+
876
+ ## Safety & Semantics
877
+
878
+ This section summarizes the safety guarantees and expectations of the durable workflow system.
879
+
880
+ - **Store is the source of truth**
881
+ All durable state (executions, steps, timers, schedules) lives in `IDurableStore`. Queues and pub/sub are optimizations on top; correctness must not rely solely on in-memory state or transient messages.
882
+
883
+ - **At-least-once execution, effectively-once steps**
884
+ - Executions are retried on failure, so the same logical workflow may run more than once.
885
+ - `durableContext.step(stepId, fn)` ensures each step function is _observably_ executed at most once per execution: results are memoized in the store and returned on replay.
886
+ - External side effects inside a step must still be designed to be idempotent or safely repeatable (for example, idempotent payment/refund APIs).
887
+
888
+ - **Sleep and resumption**
889
+ - `durableContext.sleep(ms)` persists a timer and marks the execution as `sleeping`.
890
+ - When the timer fires, execution is resumed from the code _after_ `sleep`, and all previous steps are replayed via cached results (no re‑issuing of side effects wrapped in `step`).
891
+
892
+ - **Event emission without duplicates**
893
+ - `durableContext.emit(event, data)` is implemented as one or more internal `step`s under the hood.
894
+ - Each call is assigned a deterministic internal id like `__emit:<eventId>:<index>` so you can emit the same event type multiple times in one workflow.
895
+ - On replay, memoization prevents duplicates for each individual emission.
896
+ - **Determinism note:** those internal `:<index>` suffixes are derived from call order within the workflow. If you change the workflow structure (branching / adding/removing calls), the internal step ids may shift and past executions may no longer replay cleanly.
897
+
898
+ - **Signals (wait until external confirmation)**
899
+ - `durableContext.waitForSignal(signal)` suspends an execution until `durable.signal(executionId, signal, payload)` is called.
900
+ - `stepId` keeps the same return type (payload + timeout error), while `timeoutMs` switches to a `{ kind: "signal" | "timeout" }` outcome.
901
+ - Signals are memoized as steps under `__signal:<signal.id>[:index]` (or `__signal:<id>[:index]` for string ids).
902
+ - Repeated waits use `__signal:<id>:<index>` and are resolved by the first available slot; payloads can be buffered for future waits.
903
+ - **Determinism note:** like `emit`, the `:<index>` suffixes are derived from call order within the workflow; code changes can shift indexes on replay.
904
+
905
+ - **Retries and timeouts**
906
+ - `StepOptions.retries` and `DurableServiceConfig.execution.maxAttempts` control step‑level and execution‑level retries respectively.
907
+ - `StepOptions.timeout` and `execution.timeout` bound how long a single step or the whole execution may run.
908
+ - **Global Timeouts**: `execution.timeout` measures the total time from the very first attempt (`createdAt`) and is not reset on retries or resumptions.
909
+
910
+ - **Queue and worker semantics**
911
+ - `IDurableQueue` provides **at-least-once** delivery: messages may be delivered more than once but will not be silently dropped.
912
+ - Workers must treat queue messages as hints to load state from the store, apply `DurableContext` logic, and then `ack` or `nack` the message. Idempotency is achieved by reading/writing through `IDurableStore`, not by trusting the queue alone.
913
+
914
+ - **Multi-node coordination**
915
+ - `IEventBus` is used to reduce `wait()` latency (publish `execution:<id>` completion events) but does not replace the store.
916
+ - Timers (`sleep`, signal timeouts, schedules) are driven by the durable poller (`DurableService` polling loop). In multi-process setups, run a single poller (`polling: { enabled: true }`) or implement atomic timer claiming in your store.
917
+
918
+ - **Reserved step ids**
919
+ - Step ids starting with `__` and `rollback:` are reserved for durable internals. Avoid using them in `durableContext.step(...)` to prevent collisions with system steps.
920
+
921
+ These semantics intentionally favor **safety and debuggability** over perfect "exactly-once" guarantees at the infrastructure level. Application code remains explicit and testable, while the system provides strong, well-defined durability guarantees around that code.
922
+
923
+ ---
924
+
925
+ ## Signals (wait for external events)
926
+
927
+ Durable workflows often need to pause until the outside world confirms something (eg. payment provider callbacks). Use `durableContext.waitForSignal()` inside the workflow, and `durable.signal()` from the outside.
928
+
929
+ Signal summary:
930
+
931
+ - `stepId` is a stable key only; it does not change return types.
932
+ - `waitForSignal({ stepId })` requires a store that supports listing step results (`listStepResults`) so `durable.signal(...)` can find the waiter.
933
+ - `timeoutMs` changes the return value to a `{ kind: "signal" | "timeout" }` outcome.
934
+ - Without `timeoutMs`, timeouts throw an error (no union result).
935
+
936
+ Return shapes:
937
+
938
+ | Call | Returns |
939
+ | ---------------------------------------------- | ------------------------------------------------------ |
940
+ | `waitForSignal(signal)` | `payload` (throws on timeout) |
941
+ | `waitForSignal(signal, { stepId })` | `payload` (throws on timeout) |
942
+ | `waitForSignal(signal, { timeoutMs })` | `{ kind: "signal", payload }` or `{ kind: "timeout" }` |
943
+ | `waitForSignal(signal, { timeoutMs, stepId })` | `{ kind: "signal", payload }` or `{ kind: "timeout" }` |
944
+
945
+ ### Example: `waitUntilPaid()`
946
+
947
+ ```typescript
948
+ import { event, r } from "@bluelibs/runner";
949
+ import { MemoryStore, resources } from "@bluelibs/runner/node";
950
+
951
+ const Paid = event<{ paidAt: number }>({ id: "app.signals.paid" });
952
+ const durable = resources.memoryWorkflow.fork("app-durable");
953
+ const durableRegistration = durable.with({ store: new MemoryStore() });
954
+
955
+ export const processOrder = r
956
+ .task("app.tasks.processOrder")
957
+ .dependencies({ durable })
958
+ .run(async (input: { orderId: string }, { durable }) => {
959
+ const durableContext = durable.use();
960
+
961
+ await durableContext.step("reserve", async () => {
962
+ // reserve inventory, create payment intent, etc.
963
+ return { ok: true };
964
+ });
965
+
966
+ const payment = await durableContext.waitForSignal(Paid);
967
+
968
+ await durableContext.step("ship", async () => {
969
+ // ship only after payment is confirmed
970
+ return { ok: true, paidAt: payment.paidAt };
971
+ });
972
+ })
973
+ .build();
974
+ ```
975
+
976
+ From an API webhook / callback handler:
977
+
978
+ ```typescript
979
+ // Store the workflow `executionId` in your domain data when you start it.
980
+ // You can get it immediately via `await d.start(task, input)`.
981
+ const d = runtime.getResourceValue(durable);
982
+ await d.signal(executionId, Paid, { paidAt: Date.now() });
983
+ ```
984
+
985
+ ### Whichever comes first: signal or timeout
986
+
987
+ If you need "wait for payment confirmation or continue after 1 day", use the timeout variant:
988
+
989
+ ```typescript
990
+ const outcome = await durableContext.waitForSignal(Paid, {
991
+ timeoutMs: 86_400_000,
992
+ });
993
+
994
+ if (outcome.kind === "timeout") {
995
+ // mark order as expired, notify user, etc.
996
+ return;
997
+ }
998
+
999
+ // outcome.kind === "signal"
1000
+ await durableContext.step("ship", async () => ({
1001
+ paidAt: outcome.payload.paidAt,
1002
+ }));
1003
+ ```
1004
+
1005
+ ### Stable `stepId` without changing behavior
1006
+
1007
+ You can pass a stable step id for replay stability without changing the return type:
1008
+
1009
+ ```typescript
1010
+ const payment = await durableContext.waitForSignal(Paid, {
1011
+ stepId: "stable-paid",
1012
+ });
1013
+ ```
1014
+
1015
+ ---
1016
+
1017
+ ## Compensation / Rollback Pattern
1018
+
1019
+ Instead of a complex saga orchestrator, users implement compensation explicitly:
1020
+
1021
+ ```typescript
1022
+ const processOrderWithRollback = r
1023
+ .task("app.tasks.processOrder")
1024
+ .dependencies({ durable })
1025
+ .run(async (input, { durable }) => {
1026
+ const durableContext = durable.use();
1027
+
1028
+ // Reserve inventory
1029
+ const reservation = await durableContext
1030
+ .step("reserve-inventory")
1031
+ .up(async () => inventory.reserve(input.items))
1032
+ .down(async (res) => inventory.release(res.reservationId));
1033
+
1034
+ // Charge payment
1035
+ const payment = await durableContext
1036
+ .step("charge-payment")
1037
+ .up(async () => payments.charge(input.customerId, input.amount))
1038
+ .down(async (p) => payments.refund(p.chargeId));
1039
+
1040
+ try {
1041
+ // Ship order - might fail
1042
+ const shipment = await durableContext.step("ship-order", async () => {
1043
+ return await shipping.ship(input.orderId);
1044
+ });
1045
+ return { success: true, shipment };
1046
+ } catch (error) {
1047
+ await durableContext.rollback();
1048
+ return {
1049
+ success: false,
1050
+ error: error instanceof Error ? error.message : String(error),
1051
+ };
1052
+ }
1053
+ })
1054
+ .build();
1055
+ ```
1056
+
1057
+ This is more explicit and readable than an automatic saga system.
1058
+
1059
+ ---
1060
+
1061
+ ## Branching with durableContext.switch()
1062
+
1063
+ `durableContext.switch()` is a replay-safe branching primitive for durable workflows. Instead of using plain `if/else` (which the flow shape exporter can't capture), model conditional logic with `switch` so that:
1064
+
1065
+ 1. The branch decision is **persisted** — on replay, matchers are skipped and the cached branch result is returned.
1066
+ 2. The branch structure is **visible** to the flow-shape recorder (via `durable.describe(...)`) for documentation and visualization.
1067
+
1068
+ ### API
1069
+
1070
+ ```typescript
1071
+ const result = await durableContext.switch<TValue, TResult>(
1072
+ stepId, // unique step ID (like durableContext.step)
1073
+ value, // the value to match against
1074
+ branches, // array of { id, match, run }
1075
+ defaultBranch?, // optional { id, run } (no match needed)
1076
+ );
1077
+ ```
1078
+
1079
+ ### Example
1080
+
1081
+ ```typescript
1082
+ const fulfillOrder = r
1083
+ .task("app.tasks.fulfillOrder")
1084
+ .dependencies({ durable })
1085
+ .run(async (input: { orderId: string; tier: string }, { durable }) => {
1086
+ const durableContext = durable.use();
1087
+
1088
+ const order = await durableContext.step("fetch-order", async () => {
1089
+ return await db.orders.findById(input.orderId);
1090
+ });
1091
+
1092
+ const result = await durableContext.switch(
1093
+ "fulfillment-route",
1094
+ order.tier,
1095
+ [
1096
+ {
1097
+ id: "premium",
1098
+ match: (tier) => tier === "premium",
1099
+ run: async () => {
1100
+ await durableContext.step("express-ship", async () =>
1101
+ shipping.express(order),
1102
+ );
1103
+ return "express-shipped";
1104
+ },
1105
+ },
1106
+ {
1107
+ id: "standard",
1108
+ match: (tier) => tier === "standard",
1109
+ run: async () => {
1110
+ await durableContext.step("standard-ship", async () =>
1111
+ shipping.standard(order),
1112
+ );
1113
+ return "standard-shipped";
1114
+ },
1115
+ },
1116
+ ],
1117
+ {
1118
+ id: "manual-review",
1119
+ run: async () => {
1120
+ await durableContext.step("flag-review", async () =>
1121
+ flagForReview(order),
1122
+ );
1123
+ return "needs-review";
1124
+ },
1125
+ },
1126
+ );
1127
+
1128
+ return { orderId: input.orderId, result };
1129
+ })
1130
+ .build();
1131
+ ```
1132
+
1133
+ ### How it works
1134
+
1135
+ - **First execution**: matchers evaluate in order; the first matching branch's `run()` is called. The branch `id` and result are persisted as a step result.
1136
+ - **Replay**: the cached `{ branchId, result }` is returned immediately — no matchers or `run()` are re-executed.
1137
+ - **Audit**: emits a `switch_evaluated` audit entry with `branchId` and `durationMs`.
1138
+ - **Determinism**: the step ID is user-provided (required), so it's stable across refactors (like `durableContext.step`).
1139
+ - **Fail-fast**: throws if no branch matches and no default is provided.
1140
+
1141
+ ### Interface
1142
+
1143
+ ```typescript
1144
+ interface SwitchBranch<TValue, TResult> {
1145
+ id: string;
1146
+ match: (value: TValue) => boolean;
1147
+ run: (value: TValue) => Promise<TResult>;
1148
+ }
1149
+ ```
1150
+
1151
+ ---
1152
+
1153
+ ## Describing a Flow (Static Shape Export)
1154
+
1155
+ Use `durable.describe(...)` to capture the **structure** of a durable workflow without executing it. It returns a serializable `DurableFlowShape` object that you can use for:
1156
+
1157
+ - Documentation generation
1158
+ - Visual workflow diagrams
1159
+ - Tooling and editor plugins
1160
+ - API schema exports
1161
+
1162
+ ### From an existing task (recommended)
1163
+
1164
+ Call `describe()` on your durable dependency, then pass your task directly — it shims `durable.use()` and records every `durableContext.*` operation:
1165
+
1166
+ ```typescript
1167
+ import { r, run } from "@bluelibs/runner";
1168
+ import { resources } from "@bluelibs/runner/node";
1169
+
1170
+ const durable = resources.memoryWorkflow.fork("app-durable");
1171
+ const app = r
1172
+ .resource("app")
1173
+ .register([resources.durable, durable.with({})])
1174
+ .build();
1175
+ const runtime = await run(app);
1176
+
1177
+ // TInput is inferred from the task:
1178
+ const shape = await runtime.getResourceValue(durable).describe(approveOrder);
1179
+
1180
+ // Or specify input explicitly:
1181
+ const shape2 = await runtime
1182
+ .getResourceValue(durable)
1183
+ .describe<{ orderId: string }>(approveOrder, { orderId: "123" });
1184
+
1185
+ console.log(shape.nodes);
1186
+ // [
1187
+ // { kind: "step", stepId: "validate", hasCompensation: false },
1188
+ // { kind: "waitForSignal", signalId: "app.signals.approved", ... },
1189
+ // { kind: "step", stepId: "ship", hasCompensation: false },
1190
+ // { kind: "emit", eventId: "app.events.shipped", stepId: "notify" },
1191
+ // ]
1192
+ ```
1193
+
1194
+ If your task is tagged with `tags.durableWorkflow.with({ defaults: {...} })`,
1195
+ `describe(task)` (without input) uses a cloned copy of those defaults.
1196
+ Passing `describe(task, input)` always wins and replaces tag defaults.
1197
+
1198
+ That's it. No refactoring — just call `durable.describe(task)` and get the shape.
1199
+
1200
+ ### Output shape
1201
+
1202
+ ```typescript
1203
+ interface DurableFlowShape {
1204
+ nodes: FlowNode[];
1205
+ }
1206
+
1207
+ type FlowNode =
1208
+ | { kind: "step"; stepId: string; hasCompensation: boolean }
1209
+ | { kind: "sleep"; durationMs: number; stepId?: string }
1210
+ | {
1211
+ kind: "waitForSignal";
1212
+ signalId: string;
1213
+ timeoutMs?: number;
1214
+ stepId?: string;
1215
+ }
1216
+ | { kind: "emit"; eventId: string; stepId?: string }
1217
+ | { kind: "switch"; stepId: string; branchIds: string[]; hasDefault: boolean }
1218
+ | { kind: "note"; message: string };
1219
+ ```
1220
+
1221
+ ### How it works
1222
+
1223
+ The recorder runs your task's `run` function with **real runtime dependencies**, but wraps durable resource dependencies so `durable.use()` returns a **recording context**. That context implements `IDurableContext` and captures each `durableContext.*` call as a `FlowNode` instead of executing it.
1224
+
1225
+ The step builder API (`.up()` / `.down()`) is also supported: `hasCompensation` reflects whether `.down()` was called.
1226
+
1227
+ `rollback()` is a no-op in the recorder (it's a runtime concern, not a structural one).
1228
+
1229
+ ---
1230
+
1231
+ ## Scheduling & Cron Jobs
1232
+
1233
+ ### One-Time Scheduled Execution
1234
+
1235
+ Run a task at a specific future time:
1236
+
1237
+ ```typescript
1238
+ // Schedule a task to run in 1 hour
1239
+ const executionId = await durable.schedule(
1240
+ processReport,
1241
+ { reportId: "daily-sales" },
1242
+ { at: new Date(Date.now() + 3600000) },
1243
+ );
1244
+
1245
+ // Or use delay helper
1246
+ const executionId = await durable.schedule(
1247
+ sendReminder,
1248
+ { userId: "user-123" },
1249
+ { delay: 24 * 60 * 60 * 1000 }, // 24 hours from now
1250
+ );
1251
+ ```
1252
+
1253
+ ### Recurring Cron Jobs
1254
+
1255
+ Define tasks that run on a schedule using cron expressions:
1256
+
1257
+ ```typescript
1258
+ // Define a scheduled task
1259
+ const dailyCleanup = r
1260
+ .task("app.tasks.dailyCleanup")
1261
+ .dependencies({ durable, db })
1262
+ .run(async (input, { durable, db }) => {
1263
+ const durableContext = durable.use();
1264
+
1265
+ await durableContext.step("cleanup-old-sessions", async () => {
1266
+ await db.sessions.deleteOlderThan(7, "days");
1267
+ });
1268
+
1269
+ await durableContext.step("cleanup-temp-files", async () => {
1270
+ await fs.rm("./tmp/*", { recursive: true });
1271
+ });
1272
+
1273
+ return { cleaned: true };
1274
+ })
1275
+ .build();
1276
+
1277
+ // Create schedules once at startup (in a bootstrap resource/task)
1278
+ // ensureSchedule() is idempotent — safe to call on every boot and concurrently
1279
+ await durable.ensureSchedule(
1280
+ dailyCleanup,
1281
+ {},
1282
+ { id: "daily-cleanup", cron: "0 3 * * *" },
1283
+ );
1284
+ await durable.ensureSchedule(
1285
+ syncInventory,
1286
+ { full: false },
1287
+ { id: "hourly-sync", cron: "0 * * * *" },
1288
+ );
1289
+ await durable.ensureSchedule(
1290
+ generateWeeklyReport,
1291
+ { type: "weekly" },
1292
+ { id: "weekly-report", cron: "0 9 * * MON" },
1293
+ );
1294
+ ```
1295
+
1296
+ ### Interval-Based Scheduling
1297
+
1298
+ Run tasks at fixed intervals (e.g., every 30 seconds):
1299
+
1300
+ ```typescript
1301
+ // ensureSchedule() is idempotent — safe to call on every boot and concurrently
1302
+ await durable.ensureSchedule(
1303
+ healthCheckTask,
1304
+ { endpoints: ["api", "db"] },
1305
+ { id: "health-check", interval: 30_000 },
1306
+ );
1307
+ await durable.ensureSchedule(
1308
+ pollExternalApi,
1309
+ {},
1310
+ { id: "poll-external-api", interval: 5 * 60 * 1000 },
1311
+ );
1312
+ await durable.ensureSchedule(
1313
+ metricsSync,
1314
+ { flush: true },
1315
+ { id: "metrics-sync", interval: 60_000 },
1316
+ );
1317
+ ```
1318
+
1319
+ **Interval vs Cron:**
1320
+
1321
+ - **Interval**: Fixed delay between executions. Next run = end of previous + interval. Best for polling, health checks.
1322
+ - **Cron**: Calendar-based. Next run = next matching time. Best for scheduled reports, daily cleanup.
1323
+
1324
+ **Interval Behavior (current implementation):**
1325
+ Intervals are currently measured from when the schedule timer fires / execution is kicked off (not from task completion).
1326
+ If the task runs longer than the interval, the next run will be scheduled after the interval from _kickoff time_, which can cause overlapping executions unless your task logic (or your infrastructure) prevents it.
1327
+
1328
+ ```
1329
+ Task starts at t=0, takes 12s to complete
1330
+ Interval = 10s
1331
+
1332
+ t=0 t=10 t=12
1333
+ |------------|------------|
1334
+ task run A next run B A completes
1335
+ ```
1336
+
1337
+ If you need "completion-based" intervals (no overlap), implement it explicitly inside the workflow:
1338
+
1339
+ - run the work
1340
+ - then `await durableContext.sleep(intervalMs)`
1341
+ - then loop / re-run (or have the schedule fire less frequently and use durable sleeps inside)
1342
+
1343
+ ### Cron Expression Format
1344
+
1345
+ Standard 5-field cron format:
1346
+
1347
+ ```
1348
+ ┌───────────── minute (0-59)
1349
+ │ ┌─────────── hour (0-23)
1350
+ │ │ ┌───────── day of month (1-31)
1351
+ │ │ │ ┌─────── month (1-12 or JAN-DEC)
1352
+ │ │ │ │ ┌───── day of week (0-6 or SUN-SAT)
1353
+ │ │ │ │ │
1354
+ * * * * *
1355
+ ```
1356
+
1357
+ Common patterns:
1358
+
1359
+ - `* * * * *` - Every minute
1360
+ - `0 * * * *` - Every hour
1361
+ - `0 0 * * *` - Every day at midnight
1362
+ - `0 9 * * MON-FRI` - Weekdays at 9am
1363
+ - `0 0 1 * *` - First of every month
1364
+
1365
+ ### Schedule Management API
1366
+
1367
+ ```typescript
1368
+ // Pause a schedule
1369
+ await durable.pauseSchedule("daily-cleanup");
1370
+
1371
+ // Resume a schedule
1372
+ await durable.resumeSchedule("daily-cleanup");
1373
+
1374
+ // Get schedule status
1375
+ const status = await durable.getSchedule("daily-cleanup");
1376
+ // { id, cron, lastRun, nextRun, status: 'active' | 'paused' }
1377
+
1378
+ // List all schedules
1379
+ const schedules = await durable.listSchedules();
1380
+
1381
+ // Update schedule cron
1382
+ await durable.updateSchedule("daily-cleanup", { cron: "0 4 * * *" });
1383
+
1384
+ // Remove schedule
1385
+ await durable.removeSchedule("daily-cleanup");
1386
+ ```
1387
+
1388
+ ### How Scheduling Works
1389
+
1390
+ ```mermaid
1391
+ sequenceDiagram
1392
+ participant DS as DurableService
1393
+ participant S as Store
1394
+ participant T as Task
1395
+
1396
+ Note over DS: Timer polling loop
1397
+
1398
+ loop Every polling interval
1399
+ DS->>S: getReadyTimers
1400
+ S-->>DS: timers ready to fire
1401
+
1402
+ alt Schedule timer
1403
+ DS->>S: getSchedule by scheduleId
1404
+ DS->>DS: execute task with input
1405
+ DS->>S: calculateNextRun from cron
1406
+ DS->>S: createTimer for next run
1407
+ else Sleep timer
1408
+ DS->>DS: resume execution
1409
+ else One-time scheduled
1410
+ DS->>DS: execute task
1411
+ end
1412
+ end
1413
+ ```
1414
+
1415
+ ---
1416
+
1417
+ ## Core Types
1418
+
1419
+ ```typescript
1420
+ // types.ts
1421
+
1422
+ export type ExecutionStatus =
1423
+ | "pending"
1424
+ | "running"
1425
+ | "retrying"
1426
+ | "sleeping"
1427
+ | "completed"
1428
+ | "failed"
1429
+ | "compensation_failed";
1430
+
1431
+ export interface Execution<TInput = unknown, TResult = unknown> {
1432
+ id: string;
1433
+ taskId: string;
1434
+ input: TInput | undefined;
1435
+ status: ExecutionStatus;
1436
+ result?: TResult;
1437
+ error?: {
1438
+ message: string;
1439
+ stack?: string;
1440
+ };
1441
+ attempt: number;
1442
+ maxAttempts: number;
1443
+ timeout?: number;
1444
+ createdAt: Date;
1445
+ updatedAt: Date;
1446
+ completedAt?: Date;
1447
+ }
1448
+
1449
+ export interface StepResult<T = unknown> {
1450
+ executionId: string;
1451
+ stepId: string;
1452
+ result: T;
1453
+ completedAt: Date;
1454
+ }
1455
+
1456
+ export type TimerType =
1457
+ | "sleep"
1458
+ | "timeout"
1459
+ | "scheduled"
1460
+ | "cron"
1461
+ | "retry"
1462
+ | "signal_timeout";
1463
+
1464
+ export interface Timer {
1465
+ id: string;
1466
+ executionId?: string; // For sleep/timeout timers
1467
+ stepId?: string; // For step-specific timers
1468
+ scheduleId?: string; // For cron timers
1469
+ type: TimerType;
1470
+ fireAt: Date;
1471
+ status: "pending" | "fired";
1472
+ }
1473
+
1474
+ export type ScheduleType = "cron" | "interval";
1475
+
1476
+ export interface Schedule<TInput = unknown> {
1477
+ id: string;
1478
+ taskId: string;
1479
+ type: ScheduleType;
1480
+ pattern: string; // Cron expression or interval (ms)
1481
+ input: TInput | undefined;
1482
+ status: "active" | "paused";
1483
+ lastRun?: Date;
1484
+ nextRun?: Date;
1485
+ createdAt: Date;
1486
+ updatedAt: Date;
1487
+ }
1488
+
1489
+ export interface DurableContextState {
1490
+ executionId: string;
1491
+ attempt: number;
1492
+ }
1493
+ ```
1494
+
1495
+ ---
1496
+
1497
+ **Note on Interfaces**: The full technical contracts for `IDurableStore`, `IEventBus`, and `IDurableQueue` are documented in the [Abstract Interfaces](#abstract-interfaces) section.
1498
+
1499
+ ---
1500
+
1501
+ ## DurableContext
1502
+
1503
+ ```typescript
1504
+ // DurableContext.ts
1505
+
1506
+ export interface IDurableContext {
1507
+ readonly executionId: string;
1508
+ readonly attempt: number;
1509
+
1510
+ /**
1511
+ * Execute a step with memoization. On replay, returns cached result.
1512
+ */
1513
+ step<T>(stepId: string, fn: () => Promise<T>): Promise<T>;
1514
+ step<T>(
1515
+ stepId: string,
1516
+ options: StepOptions,
1517
+ fn: () => Promise<T>,
1518
+ ): Promise<T>;
1519
+
1520
+ /**
1521
+ * Durable sleep that survives process restarts.
1522
+ */
1523
+ sleep(durationMs: number): Promise<void>;
1524
+
1525
+ /**
1526
+ * Emit an event durably (as a step).
1527
+ */
1528
+ emit<T>(event: IEvent<T>, data: T): Promise<void>;
1529
+ }
1530
+
1531
+ export interface StepOptions {
1532
+ retries?: number;
1533
+ timeout?: number;
1534
+ }
1535
+ ```
1536
+
1537
+ ---
1538
+
1539
+ ## DurableService
1540
+
1541
+ ```typescript
1542
+ // DurableService.ts (simplified interface)
1543
+
1544
+ export interface ScheduleConfig<TInput = unknown> {
1545
+ id: string;
1546
+ task: ITask<TInput, any>;
1547
+ cron?: string; // Cron expression (e.g., '0 3 * * *')
1548
+ interval?: number; // Interval in ms (e.g., 30000 for 30 seconds)
1549
+ input: TInput;
1550
+ }
1551
+ // Must specify either cron OR interval, not both
1552
+
1553
+ export interface DurableServiceConfig {
1554
+ store: IDurableStore;
1555
+ queue?: IDurableQueue;
1556
+ eventBus?: IEventBus;
1557
+ audit?: {
1558
+ enabled?: boolean; // Default: false
1559
+ };
1560
+ polling?: {
1561
+ enabled?: boolean; // Default: true
1562
+ interval?: number; // Default: 1000ms
1563
+ };
1564
+ execution?: {
1565
+ maxAttempts?: number; // Default: 3
1566
+ timeout?: number; // Default: no timeout
1567
+ };
1568
+ schedules?: ScheduleConfig[]; // Cron schedules to register
1569
+ }
1570
+
1571
+ export interface ScheduleOptions {
1572
+ id?: string; // Stable schedule id (required for ensureSchedule)
1573
+ at?: Date; // Run at specific time
1574
+ delay?: number; // Run after delay (ms)
1575
+ cron?: string; // Cron expression (for recurring)
1576
+ interval?: number; // Interval in ms (for recurring)
1577
+ }
1578
+
1579
+ export interface IDurableService {
1580
+ /**
1581
+ * Start a task durably and wait for it to complete.
1582
+ */
1583
+ startAndWait<TInput, TResult>(
1584
+ task: ITask<TInput, Promise<TResult>, any, any, any, any> | string,
1585
+ input?: TInput,
1586
+ options?: ExecuteOptions,
1587
+ ): Promise<TResult>;
1588
+
1589
+ /**
1590
+ * Start a task execution and return the ID immediately.
1591
+ */
1592
+ start<TInput>(
1593
+ task: ITask<TInput, Promise<unknown>, any, any, any, any> | string,
1594
+ input?: TInput,
1595
+ options?: ExecuteOptions,
1596
+ ): Promise<string>;
1597
+
1598
+ /**
1599
+ * Wait for a previously started execution to complete.
1600
+ */
1601
+ wait<TResult>(
1602
+ executionId: string,
1603
+ options?: { timeout?: number; waitPollIntervalMs?: number },
1604
+ ): Promise<TResult>;
1605
+
1606
+ /**
1607
+ * Deliver a signal payload to a waiting workflow execution.
1608
+ */
1609
+ signal<TPayload>(
1610
+ executionId: string,
1611
+ signal: string | IEventDefinition<TPayload>,
1612
+ payload: TPayload,
1613
+ ): Promise<void>;
1614
+
1615
+ /**
1616
+ * Schedule a one-time task execution.
1617
+ */
1618
+ schedule<TInput>(
1619
+ task: ITask<TInput, Promise<any>, any, any, any, any> | string,
1620
+ input: TInput,
1621
+ options: ScheduleOptions,
1622
+ ): Promise<string>;
1623
+
1624
+ /**
1625
+ * Idempotently create (or update) a recurring schedule (cron/interval).
1626
+ * Safe to call on every boot and concurrently across processes.
1627
+ */
1628
+ ensureSchedule<TInput>(
1629
+ task: ITask<TInput, Promise<any>, any, any, any, any> | string,
1630
+ input: TInput,
1631
+ options: ScheduleOptions & { id: string },
1632
+ ): Promise<string>;
1633
+
1634
+ /**
1635
+ * Recover incomplete executions on startup.
1636
+ */
1637
+ recover(): Promise<void>;
1638
+
1639
+ /**
1640
+ * Start timer polling (called automatically on init).
1641
+ */
1642
+ start(): void;
1643
+
1644
+ /**
1645
+ * Stop timer polling (called on dispose).
1646
+ */
1647
+ stop(): Promise<void>;
1648
+
1649
+ // Schedule management
1650
+ pauseSchedule(scheduleId: string): Promise<void>;
1651
+ resumeSchedule(scheduleId: string): Promise<void>;
1652
+ getSchedule(scheduleId: string): Promise<Schedule | null>;
1653
+ listSchedules(): Promise<Schedule[]>;
1654
+ updateSchedule(
1655
+ scheduleId: string,
1656
+ updates: { cron?: string; interval?: number; input?: unknown },
1657
+ ): Promise<void>;
1658
+ removeSchedule(scheduleId: string): Promise<void>;
1659
+ }
1660
+ ```
1661
+
1662
+ ---
1663
+
1664
+ ## File Structure
1665
+
1666
+ ```
1667
+ src/node/durable/
1668
+ ├── index.ts # Public exports (from `@bluelibs/runner/node`)
1669
+ ├── core/ # Engine (store is the source of truth)
1670
+ │ ├── index.ts
1671
+ │ ├── types.ts
1672
+ │ ├── CronParser.ts
1673
+ │ ├── DurableContext.ts
1674
+ │ ├── DurableService.ts
1675
+ │ ├── DurableWorker.ts
1676
+ │ ├── DurableOperator.ts
1677
+ │ ├── StepBuilder.ts
1678
+ │ └── interfaces/
1679
+ ├── store/
1680
+ │ ├── MemoryStore.ts
1681
+ │ └── RedisStore.ts
1682
+ ├── queue/
1683
+ │ ├── MemoryQueue.ts
1684
+ │ └── RabbitMQQueue.ts
1685
+ ├── bus/
1686
+ │ ├── MemoryEventBus.ts
1687
+ │ ├── NoopEventBus.ts
1688
+ │ └── RedisEventBus.ts
1689
+ └── __tests__/
1690
+ ├── DurableContext.test.ts
1691
+ ├── DurableService.integration.test.ts
1692
+ ├── DurableService.realBackends.integration.test.ts
1693
+ ├── MemoryBackends.test.ts
1694
+ ├── RabbitMQQueue.mock.test.ts
1695
+ ├── RedisEventBus.mock.test.ts
1696
+ └── RedisStore.mock.test.ts
1697
+ ```
1698
+
1699
+ ---
1700
+
1701
+ ## Production Setup with Redis + RabbitMQ
1702
+
1703
+ For production, use Redis for state/pub-sub and RabbitMQ with quorum queues for durable work distribution.
1704
+
1705
+ Install required Node dependencies:
1706
+
1707
+ ```bash
1708
+ npm install ioredis amqplib
1709
+ ```
1710
+
1711
+ ### Quick Start - Production Configuration
1712
+
1713
+ ```typescript
1714
+ import {
1715
+ RedisStore,
1716
+ RedisEventBus,
1717
+ RabbitMQQueue,
1718
+ resources,
1719
+ } from "@bluelibs/runner/node";
1720
+
1721
+ // State storage with Redis
1722
+ const store = new RedisStore({
1723
+ redis: process.env.REDIS_URL || "redis://localhost:6379",
1724
+ prefix: "durable:",
1725
+ });
1726
+
1727
+ // Pub/Sub with Redis
1728
+ const eventBus = new RedisEventBus({
1729
+ redis: process.env.REDIS_URL || "redis://localhost:6379",
1730
+ prefix: "durable:bus:",
1731
+ });
1732
+
1733
+ // Work distribution with RabbitMQ quorum queues
1734
+ const queue = new RabbitMQQueue({
1735
+ url: process.env.RABBITMQ_URL || "amqp://localhost",
1736
+ queue: {
1737
+ name: "durable-executions",
1738
+ quorum: true, // Use quorum queue for durability
1739
+ deadLetter: "durable-dlq", // Dead letter queue for failed messages
1740
+ },
1741
+ prefetch: 10, // Process up to 10 messages concurrently
1742
+ });
1743
+
1744
+ // Create durable resource definition + registration
1745
+ const durable = resources.redisWorkflow.fork("app-durable");
1746
+ const durableRegistration = durable.with({
1747
+ store,
1748
+ eventBus,
1749
+ queue,
1750
+ worker: true, // starts a queue consumer in this process
1751
+ // polling.enabled defaults to true; keep it on for timers/schedules
1752
+ });
1753
+ ```
1754
+
1755
+ If you want API-only nodes to call `start()` / `signal()` / `wait()` **without running the timer poller**, disable polling:
1756
+
1757
+ ```ts
1758
+ const durable = resources.redisWorkflow.fork("app-durable");
1759
+ const durableRegistration = durable.with({
1760
+ store,
1761
+ eventBus,
1762
+ queue,
1763
+ worker: false,
1764
+ polling: { enabled: false },
1765
+ });
1766
+ ```
1767
+
1768
+ Make sure at least one worker process runs with polling enabled, otherwise sleeps/timeouts/schedules will never fire.
1769
+
1770
+ ### RabbitMQ Quorum Queues
1771
+
1772
+ **Why quorum queues?**
1773
+
1774
+ - **Durability** - Messages survive broker restarts
1775
+ - **Replication** - Messages replicated across nodes
1776
+ - **Consistency** - Strong guarantees vs classic mirrored queues
1777
+ - **Dead-letter** - Failed messages go to DLQ for inspection
1778
+
1779
+ ```typescript
1780
+ // queue/RabbitMQQueue.ts
1781
+
1782
+ export interface RabbitMQQueueConfig {
1783
+ url: string;
1784
+ queue: {
1785
+ name: string;
1786
+ quorum?: boolean; // Use quorum queue (default: true)
1787
+ deadLetter?: string; // Dead letter exchange
1788
+ messageTtl?: number; // Message TTL in ms
1789
+ };
1790
+ prefetch?: number; // Consumer prefetch (default: 10)
1791
+ }
1792
+
1793
+ export class RabbitMQQueue implements IDurableQueue {
1794
+ constructor(config: RabbitMQQueueConfig);
1795
+
1796
+ async init(): Promise<void> {
1797
+ // Creates quorum queue with:
1798
+ // - x-queue-type: quorum
1799
+ // - x-dead-letter-exchange: <deadLetter>
1800
+ // - durable: true
1801
+ }
1802
+ }
1803
+ ```
1804
+
1805
+ ### Redis Store Implementation Details
1806
+
1807
+ - **Serialization**: `RedisStore` uses Runner's serializer for persistence. This preserves `Date` objects and other complex types, avoiding "time bombs" where dates become strings after being stored.
1808
+ - **Performance (SCAN vs KEYS)**: All multi-key searches use Redis `SCAN` for non-blocking iteration. This prevents Redis from freezing when thousands of executions are present.
1809
+ - **Concurrency & Atomicity**:
1810
+ - `updateExecution()` uses a Lua script to perform a read/merge/write update atomically.
1811
+ - Execution processing is guarded by `acquireLock()` so only one worker runs an execution attempt at a time.
1812
+ - Signal delivery (`durable.signal`) and signal waits (`durableContext.waitForSignal`) use a per-execution/per-signal lock when supported by the store, to prevent races between "signal arrives" and "wait is being recorded".
1813
+
1814
+ ### Optimized Client Waiting
1815
+
1816
+ When an `IEventBus` (like `RedisEventBus`) is present, calls to `durable.startAndWait()` or `durable.wait()` use a **reactive event-driven approach**. The service subscribes to completion events for that specific execution ID, resulting in near-instant response times once the workflow finishes, without constant store polling.
1817
+
1818
+ ### Horizontal Scaling
1819
+
1820
+ ```mermaid
1821
+ graph TB
1822
+ subgraph Clients[API Servers]
1823
+ A1[API 1]
1824
+ A2[API 2]
1825
+ end
1826
+
1827
+ subgraph RabbitMQ[RabbitMQ Cluster]
1828
+ Q[(Quorum Queue)]
1829
+ DLQ[(Dead Letter Queue)]
1830
+ end
1831
+
1832
+ subgraph Redis[Redis Cluster]
1833
+ RS[(State Store)]
1834
+ RP[(Pub/Sub)]
1835
+ end
1836
+
1837
+ subgraph Workers[Worker Pool - Auto-Scaling]
1838
+ W1[Worker 1]
1839
+ W2[Worker 2]
1840
+ W3[Worker N]
1841
+ end
1842
+
1843
+ A1 -->|enqueue| Q
1844
+ A2 -->|enqueue| Q
1845
+
1846
+ Q -->|consume| W1
1847
+ Q -->|consume| W2
1848
+ Q -->|consume| W3
1849
+
1850
+ W1 <-->|state| RS
1851
+ W2 <-->|state| RS
1852
+ W3 <-->|state| RS
1853
+
1854
+ RP -.->|notify| W1
1855
+ RP -.->|notify| W2
1856
+ RP -.->|notify| W3
1857
+
1858
+ Q -->|failed| DLQ
1859
+ ```
1860
+
1861
+ **Scaling characteristics:**
1862
+
1863
+ - **Workers** - Add more worker instances to increase throughput
1864
+ - **Queue** - RabbitMQ handles work distribution automatically
1865
+ - **State** - All workers share state via Redis
1866
+ - **Events** - Redis pub/sub notifies workers of timer events
1867
+
1868
+ ### Execution Flow with Queue
1869
+
1870
+ ```mermaid
1871
+ sequenceDiagram
1872
+ participant C as Client
1873
+ participant Q as RabbitMQ
1874
+ participant W as Worker
1875
+ participant R as Redis
1876
+
1877
+ C->>R: Create execution record
1878
+ C->>Q: Enqueue execution message
1879
+ C-->>C: Return execution ID
1880
+
1881
+ Note over Q,W: Workers consuming queue
1882
+
1883
+ Q->>W: Deliver message
1884
+ W->>R: Acquire lock on execution
1885
+
1886
+ alt Lock acquired
1887
+ W->>R: Load execution state
1888
+ W->>W: Execute task with DurableContext
1889
+
1890
+ loop For each step
1891
+ W->>R: Check step result cache
1892
+ alt Cache hit
1893
+ R-->>W: Return cached result
1894
+ else Cache miss
1895
+ W->>W: Execute step
1896
+ W->>R: Cache step result
1897
+ end
1898
+ end
1899
+
1900
+ W->>R: Mark execution complete
1901
+ W->>R: Release lock
1902
+ W->>Q: Ack message
1903
+ else Lock not acquired
1904
+ W->>Q: Nack with requeue
1905
+ end
1906
+ ```
1907
+
1908
+ ---
1909
+
1910
+ ## Integration with Runner Resources
1911
+
1912
+ The durable module integrates seamlessly with Runner's resource pattern:
1913
+
1914
+ ### As a Dependency
1915
+
1916
+ ```typescript
1917
+ import { r, run } from "@bluelibs/runner";
1918
+ import { MemoryStore, resources } from "@bluelibs/runner/node";
1919
+
1920
+ const durable = resources.memoryWorkflow.fork("app-durable");
1921
+ const durableRegistration = durable.with({
1922
+ store: new MemoryStore(),
1923
+ worker: true, // single-process: also consumes the queue if configured
1924
+ });
1925
+
1926
+ const processOrder = r
1927
+ .task("app.tasks.processOrder")
1928
+ .dependencies({ durable })
1929
+ .run(async (input, { durable }) => {
1930
+ const durableContext = durable.use();
1931
+ // ... durable task logic
1932
+ })
1933
+ .build();
1934
+
1935
+ const recoverDurable = r
1936
+ .resource("app-durable.recover")
1937
+ .dependencies({ durable })
1938
+ .init(async (_cfg, { durable }) => {
1939
+ await durable.recover();
1940
+ })
1941
+ .build();
1942
+
1943
+ const app = r
1944
+ .resource("app")
1945
+ .register([
1946
+ resources.durable,
1947
+ durableRegistration,
1948
+ processOrder,
1949
+ recoverDurable,
1950
+ ])
1951
+ .build();
1952
+ await run(app);
1953
+ ```
1954
+
1955
+ ### Resource Factory Pattern
1956
+
1957
+ Runner resources are definitions built at bootstrap time. If you want to pick a store based on environment/config, do it when you create the resource:
1958
+
1959
+ ```typescript
1960
+ const store = process.env.REDIS_URL
1961
+ ? new RedisStore({ redis: process.env.REDIS_URL })
1962
+ : new MemoryStore();
1963
+
1964
+ const durable = resources.memoryWorkflow.fork("app-durable");
1965
+ const durableRegistration = durable.with({ store });
1966
+ ```
1967
+
1968
+ ### Integration with HTTP Exposure
1969
+
1970
+ Expose durable task execution over HTTP using Runner's remote lanes pattern:
1971
+
1972
+ ```typescript
1973
+ import { createHttpClient } from "@bluelibs/runner";
1974
+ import { rpcLanesResource } from "@bluelibs/runner/node";
1975
+
1976
+ const durableLane = r
1977
+ .rpcLane("app.rpc.durable")
1978
+ .applyTo([processOrder])
1979
+ .build();
1980
+
1981
+ const topology = r.rpcLane.topology({
1982
+ profiles: { worker: { serve: [durableLane] } },
1983
+ bindings: [{ lane: durableLane, communicator: r.rpcLane.http() }],
1984
+ });
1985
+
1986
+ const app = r
1987
+ .resource("app")
1988
+ .register([
1989
+ durable,
1990
+ processOrder,
1991
+ rpcLanesResource.with({
1992
+ profile: "worker",
1993
+ mode: "network",
1994
+ topology,
1995
+ exposure: {
1996
+ http: { basePath: "/__runner", listen: { port: 7070 } },
1997
+ },
1998
+ }),
1999
+ ])
2000
+ .build();
2001
+
2002
+ // Remote clients can now call durable tasks via HTTP
2003
+ const client = createHttpClient({ baseUrl: "http://worker:7070/__runner" });
2004
+ await client.task("app.tasks.processOrder", { orderId: "123" });
2005
+ ```
2006
+
2007
+ ## Recovery on Startup
2008
+
2009
+ ```typescript
2010
+ const recoverDurable = r
2011
+ .resource("app-durable.recover")
2012
+ .dependencies({ durable })
2013
+ .init(async (_cfg, { durable }) => {
2014
+ await durable.recover();
2015
+ })
2016
+ .build();
2017
+
2018
+ const app = r
2019
+ .resource("app")
2020
+ .register([durable, processOrder, recoverDurable])
2021
+ .build();
2022
+ ```
2023
+
2024
+ The recovery process:
2025
+
2026
+ 1. Load all incomplete executions (status `pending`, `running`, `sleeping`, or `retrying`)
2027
+ 2. For each, re-execute the task within a new DurableContext
2028
+ 3. The task replays through cached steps automatically
2029
+ 4. Execution continues from where it left off
2030
+
2031
+ ---
2032
+
2033
+ ## Testing Utilities
2034
+
2035
+ Durable exports a small test harness so you can run workflows with in-memory
2036
+ backends while keeping the `run()` semantics you use in production.
2037
+
2038
+ ```ts
2039
+ import { r, run } from "@bluelibs/runner";
2040
+ import { createDurableTestSetup, waitUntil } from "@bluelibs/runner/node";
2041
+
2042
+ const { durable, durableRegistration, store } = createDurableTestSetup();
2043
+ const Paid = r.event<{ paidAt: number }>("app.signals.paid").build();
2044
+
2045
+ const task = r
2046
+ .task("spec.durable.waitForSignal")
2047
+ .dependencies({ durable, Paid })
2048
+ .run(async (_input: undefined, { durable, Paid }) => {
2049
+ const durableContext = durable.use();
2050
+ const payment = await durableContext.waitForSignal(Paid);
2051
+ return { ok: true, paidAt: payment.paidAt };
2052
+ })
2053
+ .build();
2054
+
2055
+ const app = r
2056
+ .resource("spec.app")
2057
+ .register([resources.durable, durableRegistration, Paid, task])
2058
+ .build();
2059
+ const runtime = await run(app);
2060
+ const durableRuntime = runtime.getResourceValue(durable);
2061
+
2062
+ const executionId = await durableRuntime.start(task);
2063
+
2064
+ await waitUntil(
2065
+ async () => (await store.getExecution(executionId))?.status === "sleeping",
2066
+ { timeoutMs: 1000, intervalMs: 5 },
2067
+ );
2068
+
2069
+ await durableRuntime.signal(executionId, Paid, { paidAt: Date.now() });
2070
+ await durableRuntime.wait(executionId);
2071
+
2072
+ await runtime.dispose();
2073
+ ```
2074
+
2075
+ `createDurableTestSetup` uses `MemoryStore`, `MemoryEventBus`, and an optional
2076
+ `MemoryQueue`, so tests stay fast and isolated.
2077
+
2078
+ Tip: Use `stepId` for stability in tests without changing behavior, and use `timeoutMs`
2079
+ when you need an explicit timeout outcome.
2080
+
2081
+ ### Running tests against real backends (Redis + RabbitMQ)
2082
+
2083
+ Runner also ships an integration suite that exercises the durable service with real backends
2084
+ (Redis for store + pub/sub and RabbitMQ for queue). This suite is part of the normal Jest
2085
+ test discovery, but it is **skipped by default** to keep local runs hermetic.
2086
+
2087
+ To enable it, set `DURABLE_INTEGRATION=1` and provide connection URLs (defaults point to localhost):
2088
+
2089
+ ```bash
2090
+ DURABLE_INTEGRATION=1 \
2091
+ DURABLE_TEST_REDIS_URL=redis://127.0.0.1:6379 \
2092
+ DURABLE_TEST_RABBIT_URL=amqp://127.0.0.1:5672 \
2093
+ npm run coverage:ai
2094
+ ```
2095
+
2096
+ ---
2097
+
2098
+ ## Comparison with Previous Design
2099
+
2100
+ | Aspect | Previous Design | New Design |
2101
+ | ------------------- | ----------------------------------------------------------- | ----------------------------------------- |
2102
+ | Components | 8+ (EventManager, WorkflowEngine, TimerManager, Saga, etc.) | 3 (DurableService, DurableContext, Store) |
2103
+ | Files | ~30 | ~12 |
2104
+ | New concepts | Workflows, Sagas, Compensation, DLQ | Just `step()` and `sleep()` |
2105
+ | Changes to core | EventBuilder, TaskBuilder modifications | None - pure node extension |
2106
+ | Learning curve | High | Low |
2107
+ | Implementation time | 12 weeks | 2-3 weeks |
2108
+
2109
+ ## Operator & Observability
2110
+
2111
+ > [!NOTE]
2112
+ > `createDashboardMiddleware` moved out of core and now lives in `@bluelibs/runner-durable-dashboard`.
2113
+
2114
+ ### What is the store?
2115
+
2116
+ The **durable store** (`IDurableStore`) is the persistence layer for durable workflows. It is responsible for saving and loading:
2117
+
2118
+ - executions (id, task id, input, status, attempt/error, timestamps)
2119
+ - step results (memoized outputs for `durableContext.step(...)`)
2120
+ - timers and schedules (for `sleep`, signal timeouts, cron/interval scheduling)
2121
+ - optional audit entries (timeline), and optional operator actions (manual interventions)
2122
+
2123
+ You provide a store implementation when you create the durable resource/service:
2124
+
2125
+ - `MemoryStore` — in-memory, great for local dev/tests (state is lost on restart)
2126
+ - `RedisStore` — Redis-backed, appropriate for production durability
2127
+
2128
+ ### What is `DurableOperator`?
2129
+
2130
+ `DurableOperator` is an **operations/admin helper** around the store. It does not execute workflows; it reads/writes durable state to support external tooling and manual interventions:
2131
+
2132
+ - query executions for listing (filters/pagination)
2133
+ - load execution details (execution + step results + audit)
2134
+ - operator actions: retry rollback, skip steps, force fail, patch a step result
2135
+
2136
+ You can use `DurableOperator` as the backend contract for your own operational UI or APIs.
2137
+
2138
+ ### Audit trail (timeline)
2139
+
2140
+ In addition to `StepResult` records, durable can persist a structured audit trail as the workflow runs:
2141
+
2142
+ - execution status transitions (pending/running/sleeping/retrying/completed/failed/cancelled)
2143
+ - step completions (with durations)
2144
+ - sleep scheduled/completed
2145
+ - signal waiting/delivered/timed-out
2146
+ - user-added notes via `durableContext.note(...)`
2147
+
2148
+ This is implemented via optional `IDurableStore` capabilities:
2149
+
2150
+ - Enable it via `resources.memoryWorkflow.fork("app-durable").with({ audit: { enabled: true }, ... })` or `resources.redisWorkflow.fork("app-durable").with({ audit: { enabled: true }, ... })` (default: off).
2151
+ - `appendAuditEntry(entry)`
2152
+ - `listAuditEntries(executionId)`
2153
+
2154
+ Notes are replay-safe: if the workflow replays after a suspend, the same `durableContext.note(...)` call does not create duplicates.
2155
+
2156
+ ### Stream audit entries via Runner events (for mirroring)
2157
+
2158
+ If you want to mirror audit entries to cold storage (S3/Glacier/Postgres), enable:
2159
+
2160
+ - `audit: { enabled: true, emitRunnerEvents: true }`
2161
+
2162
+ Then listen to Runner events (they are excluded from `on("*")` global hooks by default, so subscribe explicitly):
2163
+
2164
+ ```ts
2165
+ import { r } from "@bluelibs/runner";
2166
+ import { durableEvents } from "@bluelibs/runner/node";
2167
+
2168
+ const mirrorAudit = r
2169
+ .hook("app.hooks.durableAuditMirror")
2170
+ .on(durableEvents.audit.appended)
2171
+ .run(async (event) => {
2172
+ const { entry } = event.data;
2173
+ // write entry to your cold store (idempotent by entry.id)
2174
+ })
2175
+ .build();
2176
+ ```
2177
+
2178
+ ---
2179
+
2180
+ ## Gotchas & Troubleshooting
2181
+
2182
+ - **Always put side effects inside `durableContext.step(...)`**: anything outside a step can run multiple times on retries/replays.
2183
+ - **Keep step ids stable**: renaming a step id (or changing control-flow so a different call order happens) can break replay determinism for existing executions.
2184
+ - **Call-order indexing is real**: `emit()` and repeated `waitForSignal()` allocate `:<index>` internally based on call order; refactors that add/remove calls can shift indexes.
2185
+ - **Signals are "deliver to current wait"**: `durableService.signal(executionId, ...)` delivers to the base signal slot if it's not completed yet (this can buffer the first signal even if the workflow hasn't reached the wait). Additional signals only deliver to subsequent indexed waits; otherwise they are ignored.
2186
+ - **Don't hang forever**: prefer `durableService.wait(executionId, { timeout: ... })` unless you intentionally want an unbounded wait.
2187
+ - **Compensation failures are terminal**: if `durableContext.rollback()` fails, execution becomes `compensation_failed` and `wait()` rejects. Use `DurableOperator.retryRollback(executionId)` after fixing the underlying issue.
2188
+ - **Intervals can overlap**: interval schedules are currently measured from kickoff time, not completion time. If you need non-overlapping behavior, implement it via `durableContext.sleep()` inside the workflow.
2189
+ - **Debugging**: inspect step results + timers via `DurableOperator`/store queries (Redis keys are prefixed by `durable:` by default).
2190
+
2191
+ ## Idempotency & Deduplication
2192
+
2193
+ There are two different "idempotency" problems:
2194
+
2195
+ 1. **Workflow-level deduplication (start only once)**
2196
+
2197
+ - `start(task, input, { idempotencyKey })` supports a store-backed **"start-or-get"** mode.
2198
+ - It returns the same `executionId` for the same `{ taskId, idempotencyKey }` pair, even if multiple callers race.
2199
+ - Important: subsequent calls return the existing `executionId` and do **not** overwrite the originally stored `input`.
2200
+ - Store support: `MemoryStore` and `RedisStore` implement this. Custom stores must implement `getExecutionIdByIdempotencyKey` / `setExecutionIdByIdempotencyKey`.
2201
+ - You should still persist the returned `executionId` in your domain model for observability and to make webhook handling trivial.
2202
+
2203
+ 2. **Schedule-level deduplication (create schedule only once)**
2204
+
2205
+ - Use `ensureSchedule(...)` with a stable `id`. It is designed to be safe to call on every boot and concurrently across processes.
2206
+
2207
+ If you need workflow-level dedupe by business key (for example `orderId`), use it as the `idempotencyKey` (for example `order:${orderId}`), and store the returned `executionId` on the record as well.
2208
+
2209
+ ## Cancellation (and why it's tricky)
2210
+
2211
+ Durable exposes a first-class cancellation API:
2212
+
2213
+ - `durableService.cancelExecution(executionId, reason?)`
2214
+
2215
+ Semantics:
2216
+
2217
+ - Cancellation is **cooperative**, not preemptive: Node cannot reliably interrupt arbitrary async work.
2218
+ - Cancelling marks the execution as terminal (`cancelled`), unblocks `wait()` / `startAndWait()`, and prevents future resumes (timers/signals won't continue it).
2219
+ - Already-running code will only stop at the next durable checkpoint (for example the next `durableContext.step(...)`, `durableContext.sleep(...)`, `durableContext.waitForSignal(...)`, or `durableContext.emit(...)`).
2220
+
2221
+ Administrative alternatives still exist:
2222
+
2223
+ - `DurableOperator.forceFail(executionId)` is a blunt instrument to stop and mark `failed`.
2224
+
2225
+ ## What This Design Deliberately Excludes
2226
+
2227
+ 1. **Exactly-once external side effects** – The system provides at-least-once execution with effectively-once steps; true exactly-once semantics at the boundary (e.g., payment processors) are left to idempotent APIs and application logic.
2228
+ 2. **Event sourcing** – Steps are modeled as checkpoints, not a full event stream. This keeps the model simple.
2229
+ 3. **Automatic saga orchestration DSLs** – There is no separate workflow language or visual designer. Compensation is regular TypeScript code using `try/catch` and `durableContext.step`.
2230
+ 4. **Built-in dashboards** – not included in core; observability UIs are intentionally external to the runtime package.
2231
+ 5. **Cross-region or multi-tenant sharding logic** – Multi-region replication and advanced topology concerns are out of scope for v1.
2232
+
2233
+ Also intentionally minimal in v1: 6. **Preemptive cancellation** – cancellation is cooperative (checkpoints), not an interrupt/kill mechanism for arbitrary in-flight async work. 7. **Advanced visibility indexes** – `listExecutions` is operator-oriented and not a full-blown search/indexing system. 8. **Cron timezone & misfire policies** – cron is evaluated using the process environment defaults; DST/timezone/misfire handling is not configurable yet.
2234
+
2235
+ These can all be added in future versions if needed, without changing the core `DurableContext` and `DurableService` APIs.
2236
+
2237
+ ---
2238
+
2239
+ ## Why This is Better
2240
+
2241
+ 1. **Fits Runner's philosophy** - No new concepts, just enhanced tasks
2242
+ 2. **No magic** - What you see is what you get
2243
+ 3. **Explicit over implicit** - Compensation is code, not configuration
2244
+ 4. **Simple mental model** - `step()` = checkpoint, that's it
2245
+ 5. **Easy to understand** - Read the code, know what happens
2246
+ 6. **Easy to test** - MemoryStore for tests, no external dependencies
2247
+ 7. **Easy to debug** - Each step is recorded, replay is deterministic