chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +22 -0
  3. data/README.md +305 -44
  4. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
  5. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
  6. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
  7. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
  8. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
  9. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
  10. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
  11. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
  12. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
  13. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
  14. data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
  15. data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
  16. data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
  17. data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
  18. data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
  19. data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
  20. data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
  21. data/lib/chrono_forge/branch_merge_job.rb +138 -0
  22. data/lib/chrono_forge/branch_probe.rb +26 -0
  23. data/lib/chrono_forge/cleanup.rb +6 -0
  24. data/lib/chrono_forge/execution_log.rb +6 -0
  25. data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
  26. data/lib/chrono_forge/executor/methods/branch.rb +185 -0
  27. data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
  28. data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
  29. data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
  30. data/lib/chrono_forge/executor/methods/wait.rb +2 -4
  31. data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
  32. data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
  33. data/lib/chrono_forge/executor/methods.rb +2 -0
  34. data/lib/chrono_forge/executor/retry_policy.rb +111 -0
  35. data/lib/chrono_forge/executor.rb +216 -28
  36. data/lib/chrono_forge/version.rb +1 -1
  37. data/lib/chrono_forge/workflow.rb +10 -1
  38. data/lib/generators/chrono_forge/migration_actions.rb +1 -0
  39. data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
  40. metadata +42 -5
  41. data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 482a39dce49e0e88ddc0f7341b2806c7893376863caa658eccdd6111c12a4d62
4
- data.tar.gz: aab4017a4b0f90ff68779a0b21b660e34a10ba53a28325f7bcbb801251796ebb
3
+ metadata.gz: f03445b6275e345beb34505d4d59a01d8450df220e94f07bc909c8c69059ab8d
4
+ data.tar.gz: 9ba7aaa7364736f66778da68af4f21d7c944ac01270c7b92ca76bf09bf880738
5
5
  SHA512:
6
- metadata.gz: 1c1a271dccf9c204633846bc81126b327efb1c176b90181c0dece715f99471a021054f8a892f1a223b11655ededeadc720d34a1a67b7e8ca7cf47a6f9749c175
7
- data.tar.gz: 72745635bb4e34011c3ed735dfa3d648e88362824463b34bd719b32727c7d36602d130044861b882abb9685633231a57ce9a223771f08855a9938b1811db8405
6
+ metadata.gz: f761f180b4e8323721cfffc0a7c2569f30ea8e5b8e085cc52ab32aeefa64ed2a45caac13dff38c168ae497c437816201cdf5c2946a85ab987814eecd852d97c6
7
+ data.tar.gz: 22ca2b2ca99188b5117c06e2d9b313e726e0087ed48d3635d1064bcb82eee69b9f649a56a441a42e603c3b32ceb98c1aa788782dc01c5d2066f09dae0593900b
data/CHANGELOG.md CHANGED
@@ -1,5 +1,27 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.10.0] - 2026-06-27
4
+
5
+ ### Added
6
+
7
+ - **Concurrent sub-workflows** — `branch` blocks plus `spawn` / `spawn_each` dispatch child workflows that run in parallel, joined later with `merge_branches` (or an inline automerge at branch-block close); the completion gate raises if a branch is left unmerged. `spawn_each` streams bulk dispatch with a resumable cursor and keys AR-sourced children by record PK, so a crash mid-dispatch resumes without re-running completed children. Adds the `parent_execution_log_id` column + `[parent_execution_log_id, state]` index (additive migration, installed by the `chrono_forge:upgrade` generator), with `ExecutionLog#spawned_workflows` / `Workflow#parent_execution_log` associations. Join progress is driven by `ChronoForge::BranchMergeJob`, a lightweight poller that holds no lock and never replays the parent: it re-arms on each pass (fenced by a per-pass `poll_token` so a superseded chain stops quietly), rekicks dropped child jobs, and records observable poll state on the branch logs. Requires `activejob >= 7.1`.
8
+ - `ChronoForge::Executor::RetryPolicy` — a single, unified retry abstraction (attempt cap + exponential-with-jitter backoff + error-class predicate) used by every retry site: workflow-level uncaught errors, `durably_execute`, `durably_repeat`, and `wait_until` condition errors. Replaces the three previously-independent retry systems and two backoff algorithms.
9
+ - Class-level `retry_policy` DSL to set a workflow's default retry policy, plus a per-call `retry_policy:` keyword on `durably_execute`, `durably_repeat`, and `wait_until`. Resolution is per-call → class default → per-site built-in. `wait_until` deliberately does not inherit the class default (so a class-wide "retry everything" can't silently retry condition-evaluation bugs).
10
+ - **Composite retry policies** — pass an ordered array of `RetryPolicy` objects (per-call, or to the class-level `retry_policy` DSL as positional args) to give each error type its own independent attempt budget and backoff. The first policy whose `retry_on` matches the raised error wins (subclasses route to the policy that lists their ancestor; a trailing `retry_on: nil` is a catch-all; an unmatched error fails fast). Per-error counts are keyed by each policy's declared errors (`RetryPolicy#budget_key`) and persisted in execution-log metadata (steps) or the job args (workflow-level), so budgets are stable across replays and policy reordering. `RetryPolicy.compose(*policies)` builds one explicitly.
11
+
12
+ ### Changed
13
+
14
+ - **Performance:** completed steps are now resolved from a single bulk read per replay instead of one indexed `SELECT` each. On every resume the engine replays the whole workflow body; previously each already-completed step cost its own lookup, so a workflow with hundreds of steps paid hundreds of `SELECT`s per resume (quadratic over its lifetime). Completed steps are now plucked once into a per-pass cache and short-circuited from a readonly, unsaved stand-in (no row, no round-trip); only not-yet-completed steps still hit the database. `durably_repeat` repetition logs are deliberately excluded from the cache — they accumulate without bound yet are never replayed — so repeat-heavy workflows don't pull their history into memory.
15
+ - **BREAKING:** `durably_execute` and `durably_repeat` no longer accept `max_attempts:`; `wait_until` no longer accepts `retry_on:`. All three now take `retry_policy:` (a `RetryPolicy`). Migrate `max_attempts: N` → `retry_policy: RetryPolicy.new(max_attempts: N)` and `retry_on: [...]` → `retry_policy: RetryPolicy.new(retry_on: [...])`.
16
+ - **BREAKING:** backoff is now exponential with jitter everywhere (previously the workflow level used a fixed array declared as `[1s,5s,30s,2m,10m]` — though the `should_retry? < 3` bug meant only its first three entries `[1s,5s,30s]` were ever reached — and steps used `2**n` capped at 32s). Workflow-level retries default to 10 attempts with a tolerant window of up to ~8.5 min (≈4 min typical with jitter; cap 600s) — wide enough to ride out a transient infra blip (DB failover, deploy restart) on an uncaught `perform` error, since each such retry replays the whole workflow. A *permanently* failing workflow is now retried 10 times before reaching `failed` (vs the previous effective 4). Note this path covers only uncaught errors in `perform`; a step exhausting its own retries stalls the workflow instead.
17
+
18
+ ### Fixed
19
+
20
+ - Continuation jobs are now published only **after** the workflow lock is released. Every deferral primitive (`wait`, `wait_until`, `durably_execute` retry, `durably_repeat`, and the workflow-level retry) previously enqueued its continuation inline, while the enqueuing job still held the lock; an immediately-runnable (`delay == 0`) same-key continuation could be claimed by another worker before the lock was released, surfacing as a spurious `ConcurrentExecutionError` at lock acquisition. The continuation is now recorded during the run and flushed in the executor's `ensure` block after `release_lock`, closing the race.
21
+ - `durably_repeat` catch-up is now O(1) for the skippable run instead of O(missed intervals). When a workflow resumes far behind schedule, the **expired prefix** (ticks older than `timeout`) is fast-forwarded in closed form to the first non-expired grid tick, rather than walking one zero-delay job per missed tick. **Behavior change:** the expired prefix now produces a single summary execution log (`error_class: "TimeoutError"`, `metadata["fast_forwarded"]` = number of ticks skipped) instead of one `"Execution timed out"` row per tick — update any dashboards or alerts that key off per-tick timeout rows. Ticks still inside their `timeout` window continue to execute as normal catch-up work.
22
+ - Workflow-level retry no longer has a contradictory cap (`should_retry?` stopped at 3 while `RetryStrategy.max_attempts` was 5, making the array's `2m`/`10m` entries unreachable). The single `RetryPolicy` is now the sole decider.
23
+ - Removed the dead `retry_method:` argument that `durably_execute` passed on reschedule but `perform` never bound.
24
+
3
25
  ## [0.9.1] - 2026-06-25
4
26
 
5
27
  ### Fixed
data/README.md CHANGED
@@ -7,20 +7,75 @@
7
7
 
8
8
  > A robust framework for building durable, distributed workflows in Ruby on Rails applications
9
9
 
10
- ChronoForge provides a powerful solution for handling long-running processes, managing state, and recovering from failures in your Rails applications. Built on top of ActiveJob, it ensures your critical business processes remain resilient and traceable.
10
+ ChronoForge handles long-running processes, manages state, and recovers from failures in your Rails applications. Built on ActiveJob, it keeps critical business processes resilient and traceable.
11
+
12
+ Workflows are **plain Ruby**. Ordinary `if`/`else`, loops, and early returns drive the flow. There's no declarative DSL to learn and no extra service to run, which makes ChronoForge a good fit for business processes whose shape depends on runtime state: conditional branches, iteration over data, and built-in periodic tasks (`durably_repeat`).
13
+
14
+ > **In production** at **achieve by Petra**, an investment platform in the Petra Group — where it has executed over 3.6 million workflows and 32 million durable steps across scheduled payments, investment rollovers, and membership lifecycle management.
15
+
16
+ ## 🧭 Why ChronoForge
17
+
18
+ Most Rails workflow tools ask you to declare your steps up front in a DSL:
19
+
20
+ ```ruby
21
+ step :send_welcome_email
22
+ step :remind_of_tasks, wait: 2.days
23
+ step :complete_onboarding, wait: 15.days
24
+ ```
25
+
26
+ That reads cleanly for a fixed, linear sequence. But many business processes branch, loop, and react to data that only exists at runtime, and a declarative schema gets awkward there. ChronoForge takes the opposite approach: **a workflow is just a Ruby method.** Conditionals, iteration, early returns, and helper methods all work the way they normally do.
27
+
28
+ There is a real trade-off. Because the flow is ordinary code, ChronoForge can show the steps that **have run** (a replay/history view), but not a roadmap of steps that *haven't* run yet, which a declarative engine can. For workflows whose path isn't fixed in advance, that's a trade worth making; for a simple, fixed sequence ("send email, wait 2 days, send another"), a declarative DSL may read more cleanly, and that's a fine reason to reach for one.
29
+
30
+ ### How it compares
31
+
32
+ | | ChronoForge | GenevaDrive | AcidicJob | Temporal |
33
+ | ---------------------------- | -------------------- | ------------------ | --------------- | --------------- |
34
+ | Programming model | procedural (plain Ruby) | declarative DSL | declarative DSL | procedural (via SDK) |
35
+ | Built-in periodic tasks | ✓ `durably_repeat` | ✗ | ✗ | ✓ |
36
+ | Pending-step visibility | ✗ (procedural) | ✓ | ✓ | ✗ (procedural) |
37
+ | Extra infrastructure | none (DB + ActiveJob)| none | none | server required |
38
+ | License | MIT | LGPL / commercial | MIT | MIT |
39
+
40
+ <sub>Comparison reflects each project's documented features as of mid-2026, to the best of our knowledge; corrections welcome via PR.</sub>
41
+
42
+ A few deliberate choices behind that table:
43
+
44
+ - **Periodic tasks are built in.** `durably_repeat` runs a step on a schedule until a condition holds, with automatic catch-up for missed runs, so a workflow can be its own recurring job and cron-style monitor, right alongside the rest of its logic. Without built-in support, periodic behavior usually lives in a separate scheduler that you reconcile with workflow state by hand.
45
+ - **No extra infrastructure.** ChronoForge is a gem over your existing database and ActiveJob backend. There's no separate server or daemon to operate, unlike Temporal.
46
+ - **Recovery is built into the model.** Steps are append-only history, so a crashed step leaves the workflow `stalled`, recoverable directly with `retry_later`.
47
+ - **MIT licensed.** Permissive and dependency-policy-friendly.
11
48
 
12
49
  ## 🌟 Features
13
50
 
51
+ - **Plain-Ruby control flow**: Branching, loops, and iteration over runtime data, without a DSL or step registry
14
52
  - **Durable Execution**: Automatically tracks and recovers from failures during workflow execution
53
+ - **Periodic tasks built in**: `durably_repeat` runs a step on an interval until a condition is met, with catch-up for missed runs. Acts as a recurring task and a cron-style monitor in one
54
+ - **Wait States**: Time-based waits and condition-based waiting (`wait_until`) that survive restarts
15
55
  - **State Management**: Built-in workflow state tracking with persistent context storage
16
56
  - **Concurrency Control**: Advanced locking mechanisms to prevent parallel execution of the same workflow
17
- - **Error Handling**: Comprehensive error tracking with configurable retry strategies
57
+ - **Error Handling**: Error tracking with a unified, configurable [`RetryPolicy`](#-retry-policies) (including per-error-type policies)
18
58
  - **Execution Logging**: Detailed logging of workflow steps and errors for visibility
19
- - **Wait States**: Support for time-based waits and condition-based waiting
20
- - **Database-Backed**: All workflow state is persisted to ensure durability
59
+ - **Database-Backed**: All workflow state is persisted to ensure durability, with no extra services to run
21
60
  - **ActiveJob Integration**: Compatible with all ActiveJob backends, though database-backed processors (like Solid Queue) provide the most reliable experience for long-running workflows
22
61
  - **Retention & Cleanup**: A schedulable job to prune finished workflows and the unbounded logs that periodic tasks accumulate (see [Cleanup & Retention](#-cleanup--retention))
23
62
 
63
+ ## 🖥️ Dashboard
64
+
65
+ ChronoForge has a free, mountable dashboard for visibility and recovery: workflow list, step replay timeline, context inspector, periodic-task health, wait-state age, and retry/unlock actions. It ships as a separate gem, `chrono_forge-dashboard`, so the core stays lean.
66
+
67
+ [![ChronoForge dashboard](chrono_forge-dashboard/docs/screenshots/workflows.png)](chrono_forge-dashboard/README.md#screenshots)
68
+
69
+ ```ruby
70
+ # Gemfile
71
+ gem "chrono_forge-dashboard"
72
+
73
+ # config/routes.rb
74
+ mount ChronoForge::Dashboard::Engine, at: "/chrono_forge"
75
+ ```
76
+
77
+ See [`chrono_forge-dashboard`](chrono_forge-dashboard/README.md) for setup, authentication, and [more screenshots](chrono_forge-dashboard/README.md#screenshots).
78
+
24
79
  ## 📦 Installation
25
80
 
26
81
  Add to your application's Gemfile:
@@ -136,6 +191,54 @@ class OrderProcessingWorkflow < ApplicationJob
136
191
  end
137
192
  ```
138
193
 
194
+ ### A workflow you can't flatten into a step list
195
+
196
+ The example above is linear, but most real processes aren't. Because a ChronoForge workflow is plain Ruby, branching and dynamic iteration are just… branching and iteration:
197
+
198
+ ```ruby
199
+ class OrderProcessingWorkflow < ApplicationJob
200
+ prepend ChronoForge::Executor
201
+
202
+ def perform(order_id:)
203
+ @order_id = order_id
204
+
205
+ wait_until :payment_confirmed?
206
+ durably_execute :validate_order
207
+
208
+ # Runtime branching: the path depends on data known only at execution time
209
+ if context["requires_compliance_check"]
210
+ durably_execute :run_compliance_review
211
+ wait_until :compliance_approved?, timeout: 48.hours
212
+ end
213
+
214
+ # Iterate over runtime data: one durable, idempotent step per item
215
+ context["line_item_ids"].each do |item_id|
216
+ context["current_item_id"] = item_id
217
+ durably_execute :fulfill_item, name: "fulfill_#{item_id}"
218
+ end
219
+
220
+ # Recurring notification: nudge the customer until they confirm delivery
221
+ durably_repeat :send_delivery_reminder, every: 3.days, till: :delivery_confirmed?
222
+
223
+ durably_execute :complete_order
224
+ end
225
+
226
+ private
227
+
228
+ def fulfill_item
229
+ FulfillmentService.fulfill(@order_id, context["current_item_id"])
230
+ end
231
+
232
+ def send_delivery_reminder
233
+ OrderMailer.delivery_reminder(@order_id).deliver_later
234
+ end
235
+
236
+ # ... other condition and step methods ...
237
+ end
238
+ ```
239
+
240
+ Each `durably_execute` is checkpointed by its step name, so on resume the completed branches and items are skipped and the workflow continues where it left off. A fixed, declared list of steps can't easily express runtime branches, a loop over a runtime-sized collection, and an open-ended recurring notification.
241
+
139
242
  ### Core Workflow Features
140
243
 
141
244
  #### 🚀 Executing Workflows
@@ -162,14 +265,15 @@ OrderProcessingWorkflow.perform_later(
162
265
 
163
266
  #### ⚡ Durable Execution
164
267
 
165
- The `durably_execute` method ensures operations are executed exactly once with automatic retry logic and fault tolerance:
268
+ The `durably_execute` method runs an operation with automatic retries, and skips it on replay once it has completed:
166
269
 
167
270
  ```ruby
168
271
  # Basic execution
169
272
  durably_execute :send_welcome_email
170
273
 
171
- # With custom retry attempts
172
- durably_execute :critical_payment_processing, max_attempts: 5
274
+ # With a custom retry policy
275
+ durably_execute :critical_payment_processing,
276
+ retry_policy: RetryPolicy.new(max_attempts: 5)
173
277
 
174
278
  # With custom name for tracking multiple calls to same method
175
279
  durably_execute :upload_file, name: "profile_image_upload"
@@ -182,10 +286,10 @@ class FileProcessingWorkflow < ApplicationJob
182
286
  @file_id = file_id
183
287
 
184
288
  # This might fail due to network issues, rate limits, etc.
185
- durably_execute :upload_to_s3, max_attempts: 5
289
+ durably_execute :upload_to_s3, retry_policy: RetryPolicy.new(max_attempts: 5)
186
290
 
187
291
  # Process file after successful upload
188
- durably_execute :generate_thumbnails, max_attempts: 3
292
+ durably_execute :generate_thumbnails, retry_policy: RetryPolicy.new(max_attempts: 3)
189
293
  end
190
294
 
191
295
  private
@@ -204,9 +308,77 @@ end
204
308
 
205
309
  **Key Features:**
206
310
  - **Idempotent**: Same operation won't be executed twice during replays
207
- - **Automatic Retries**: Failed executions retry with exponential backoff (2^attempt seconds, capped at 32s)
311
+ - **Automatic Retries**: Failed executions retry per a unified `RetryPolicy` (exponential backoff with jitter; the step default caps at 30s over 3 attempts)
208
312
  - **Error Tracking**: All failures are logged with detailed error information
209
- - **Configurable**: Customize retry attempts and step naming
313
+ - **Configurable**: Pass a `retry_policy:` per call, or set a class-wide default with the `retry_policy` DSL (see [Retry Policies](#retry-policies))
314
+
315
+ #### 🔁 Retry Policies
316
+
317
+ All retrying in ChronoForge goes through a single `RetryPolicy` (`ChronoForge::Executor::RetryPolicy`). It answers two questions: *should this failure be retried?* and *how long until the next attempt?*
318
+
319
+ ```ruby
320
+ RetryPolicy.new(
321
+ max_attempts: 3, # cap on total attempts; nil = no count cap (bounded elsewhere)
322
+ base: 1, # seconds; delay of the first retry
323
+ cap: 30, # seconds; ceiling for a single delay
324
+ jitter: true, # spread retries with equal jitter
325
+ retry_on: nil # nil = retry any StandardError; [Classes] = only those; [] = none
326
+ )
327
+ ```
328
+
329
+ Backoff is exponential with equal jitter, computed once at re-enqueue time (never replayed, so it stays deterministic where it matters).
330
+
331
+ **Resolution order:**
332
+
333
+ - **`durably_execute`, `durably_repeat`, workflow-level errors**: per-call `retry_policy:` → class-level `retry_policy` default → built-in default.
334
+ - **`wait_until`**: per-call `retry_policy:` → built-in default. It deliberately does **not** inherit the class default, so a class-wide "retry everything" can't silently turn condition-evaluation bugs into retried errors.
335
+
336
+ **Built-in defaults:**
337
+
338
+ | Site | Default | Why |
339
+ |------|---------|-----|
340
+ | Steps (`durably_execute`/`durably_repeat`) | 3 attempts, cap 30s, retry any error | flaky calls fail fast |
341
+ | Workflow-level (uncaught errors) | 10 attempts, cap 600s, retry any error | tolerant window up to ~8.5 min (≈4 min typical w/ jitter) for transient infra errors; each retry replays the whole workflow from the top |
342
+ | `wait_until` condition errors | retry nothing | a raised condition is usually a bug, not transient |
343
+
344
+ **Class-wide default via the `retry_policy` DSL:**
345
+
346
+ ```ruby
347
+ class ChargeWorkflow < ApplicationJob
348
+ prepend ChronoForge::Executor
349
+ retry_policy max_attempts: 5, base: 2, cap: 60 # applies to steps + workflow-level
350
+
351
+ def perform
352
+ durably_execute :charge,
353
+ retry_policy: RetryPolicy.new(max_attempts: 8, retry_on: [Net::OpenTimeout])
354
+ wait_until :settled?,
355
+ retry_policy: RetryPolicy.new(retry_on: [BankApiError])
356
+ end
357
+ end
358
+ ```
359
+
360
+ **Composite policies (per-error budgets):**
361
+
362
+ Pass an **array** of policies to handle different error types differently. On a failure, the **first** policy whose `retry_on` matches the raised error applies, and each error type gets its **own attempt budget and backoff**:
363
+
364
+ ```ruby
365
+ durably_execute :charge_card, retry_policy: [
366
+ RetryPolicy.new(retry_on: [NetworkError], max_attempts: 5), # transient: retry hard
367
+ RetryPolicy.new(retry_on: [RateLimitError], max_attempts: 10, base: 5), # back off longer
368
+ RetryPolicy.new(retry_on: [PaymentDeclinedError], max_attempts: 1), # fail fast, never retry
369
+ RetryPolicy.new(retry_on: nil) # catch-all (optional), keep last
370
+ ]
371
+ ```
372
+
373
+ - **Order matters**: the first matching policy wins, so list specific errors first and a catch-all (`retry_on: nil`) last. An error matched by no policy is **not retried** (fails fast).
374
+ - A subclass of a listed error routes to that policy and draws from its budget.
375
+ - Per-error counts are tracked by the policy's declared errors, so the budgets are stable even if you reorder the list.
376
+ - The class-level DSL accepts the same form as positional arguments (applies to steps **and** workflow-level errors):
377
+
378
+ ```ruby
379
+ retry_policy RetryPolicy.new(retry_on: [NetworkError], max_attempts: 5),
380
+ RetryPolicy.new(retry_on: nil, max_attempts: 2)
381
+ ```
210
382
 
211
383
  #### ⏱️ Wait States
212
384
 
@@ -243,11 +415,11 @@ wait_until :external_api_ready?,
243
415
  timeout: 30.minutes,
244
416
  check_interval: 1.minute
245
417
 
246
- # Wait with retry on specific errors
418
+ # Wait with retry on specific errors raised while evaluating the condition
247
419
  wait_until :database_migration_complete?,
248
420
  timeout: 2.hours,
249
421
  check_interval: 30.seconds,
250
- retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError]
422
+ retry_policy: RetryPolicy.new(retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError])
251
423
 
252
424
  # Complex condition example
253
425
  def third_party_service_ready?
@@ -258,7 +430,7 @@ end
258
430
  wait_until :third_party_service_ready?,
259
431
  timeout: 1.hour,
260
432
  check_interval: 2.minutes,
261
- retry_on: [Net::TimeoutError, Net::HTTPClientException]
433
+ retry_policy: RetryPolicy.new(retry_on: [Net::TimeoutError, Net::HTTPClientException])
262
434
  ```
263
435
 
264
436
  **3. Event-driven Waits (`continue_if`)**
@@ -328,7 +500,7 @@ PaymentWorkflow.perform_later("order-#{order_id}", order_id: order_id)
328
500
 
329
501
  #### 🔄 Periodic Tasks
330
502
 
331
- The `durably_repeat` method enables robust periodic task execution within workflows. Tasks are scheduled at regular intervals until a specified condition is met, with automatic catch-up for missed executions and configurable error handling.
503
+ `durably_repeat` runs periodic tasks inside a workflow. A task is scheduled at a regular interval until a condition is met, with automatic catch-up for missed executions and configurable error handling.
332
504
 
333
505
  ```ruby
334
506
  class NotificationWorkflow < ApplicationJob
@@ -379,7 +551,7 @@ end
379
551
 
380
552
  - **Idempotent Execution**: Each repetition gets a unique execution log, preventing duplicates during replays
381
553
  - **Automatic Catch-up**: Missed executions due to downtime are automatically skipped using timeout-based fast-forwarding
382
- - **Flexible Timing**: Support for custom start times and precise interval scheduling
554
+ - **Custom Timing**: Custom start times and precise interval scheduling
383
555
  - **Error Resilience**: Individual execution failures don't break the periodic schedule
384
556
  - **Configurable Error Handling**: Choose between continuing despite failures or failing the entire workflow
385
557
 
@@ -390,7 +562,7 @@ durably_repeat :generate_daily_report,
390
562
  every: 1.day, # Execution interval
391
563
  till: :reports_complete?, # Stop condition
392
564
  start_at: Date.tomorrow.beginning_of_day, # Custom start time (optional)
393
- max_attempts: 5, # Retries per execution (default: 3)
565
+ retry_policy: RetryPolicy.new(max_attempts: 5), # Retry policy per execution (default: step_default)
394
566
  timeout: 2.hours, # Catch-up timeout (default: 1.hour)
395
567
  on_error: :fail_workflow, # Error handling (:continue or :fail_workflow)
396
568
  name: "daily_reports" # Custom task name (optional)
@@ -447,7 +619,7 @@ end
447
619
 
448
620
  The context supports serializable Ruby objects (Hash, Array, String, Integer, Float, Boolean, and nil) and validates types automatically.
449
621
 
450
- Hash and Array values are stored as JSON, which has no symbols so **symbol keys inside a stored hash come back as strings**:
622
+ Hash and Array values are stored as JSON, which has no symbols, so **symbol keys inside a stored hash come back as strings**:
451
623
 
452
624
  ```ruby
453
625
  context[:totals] = { paid: 5, pending: 2 }
@@ -455,33 +627,31 @@ context[:totals] # => { "paid" => 5, "pending" => 2 }
455
627
  context[:totals]["paid"] # => 5 (not context[:totals][:paid])
456
628
  ```
457
629
 
458
- (The top-level context key itself is interchangeable `context[:totals]` and `context["totals"]` refer to the same entry.)
630
+ (The top-level context key itself is interchangeable: `context[:totals]` and `context["totals"]` refer to the same entry.)
459
631
 
460
- Context is meant for **small working state** ids, flags, timestamps, and small structures used to coordinate steps. Each value is capped at **16 KB** (a `ChronoForge::Executor::Context::ValidationError` is raised above that). Store large payloads (documents, uploads, API responses) in their own storage and keep just a reference (an id or key) in the context.
632
+ Context is meant for **small working state**: ids, flags, timestamps, and small structures used to coordinate steps. Each value is capped at **16 KB** (a `ChronoForge::Executor::Context::ValidationError` is raised above that). Store large payloads (documents, uploads, API responses) in their own storage and keep just a reference (an id or key) in the context.
461
633
 
462
634
  ### 🛡️ Error Handling
463
635
 
464
- ChronoForge automatically tracks errors and provides configurable retry capabilities:
636
+ ChronoForge automatically tracks errors and routes all retrying through a single [`RetryPolicy`](#-retry-policies). Configure it per call with `retry_policy:`, or set a class-wide default with the `retry_policy` DSL:
465
637
 
466
638
  ```ruby
467
639
  class MyWorkflow < ApplicationJob
468
640
  prepend ChronoForge::Executor
469
641
 
470
- private
642
+ # Class-wide default for workflow-level errors and steps without an override
643
+ retry_policy max_attempts: 5, base: 2, cap: 60
471
644
 
472
- def should_retry?(error, attempt_count)
473
- case error
474
- when NetworkError
475
- attempt_count < 5 # Retry network errors up to 5 times
476
- when ValidationError
477
- false # Don't retry validation errors
478
- else
479
- attempt_count < 3 # Default retry policy
480
- end
645
+ def perform
646
+ # Retry only network errors, up to 5 times, for this step
647
+ durably_execute :call_external_api,
648
+ retry_policy: RetryPolicy.new(max_attempts: 5, retry_on: [NetworkError])
481
649
  end
482
650
  end
483
651
  ```
484
652
 
653
+ To make an error non-retryable, leave it out of `retry_on:` (an empty `retry_on: []` retries nothing).
654
+
485
655
  ## 🧪 Testing
486
656
 
487
657
  ChronoForge is designed to be easily testable using [ChaoticJob](https://github.com/fractaledmind/chaotic_job), a testing framework that makes it simple to test complex job workflows:
@@ -550,7 +720,7 @@ ChronoForge is ideal for:
550
720
 
551
721
  ## 🧠 Advanced State Management
552
722
 
553
- ChronoForge workflows follow a sophisticated state machine model to ensure durability and fault tolerance. Understanding these states and transitions is essential for troubleshooting and recovery.
723
+ ChronoForge workflows move through a state machine. Understanding these states and transitions helps with troubleshooting and recovery.
554
724
 
555
725
  ### Workflow State Diagram
556
726
 
@@ -609,8 +779,7 @@ stateDiagram-v2
609
779
 
610
780
  #### Recovering Stalled/Failed Workflows
611
781
 
612
- Re-execute a failed or stalled workflow directly from its record no need to
613
- constantize the job class or re-pass the key. Execution resumes via replay, so
782
+ Re-execute a failed or stalled workflow directly from its record. Execution resumes via replay, so
614
783
  completed steps are skipped and it picks up at the step that failed:
615
784
 
616
785
  ```ruby
@@ -621,7 +790,7 @@ workflow.retry_now # re-run inline (console/debugging)
621
790
  ```
622
791
 
623
792
  Only `stalled` or `failed` workflows are retryable. `retryable?` lets you check
624
- first, and both methods **validate up front** calling `retry_later`
793
+ first, and both methods **validate up front**: calling `retry_later`
625
794
  on a non-retryable workflow raises `ChronoForge::Executor::WorkflowNotRetryableError`
626
795
  immediately rather than enqueuing a job that would fail in the worker:
627
796
 
@@ -660,14 +829,14 @@ ChronoForge keeps every workflow and execution-log row indefinitely so that
660
829
  replays remain idempotent. Over time two things grow without bound:
661
830
 
662
831
  1. **Terminal workflows** (`completed` / `failed`) that are no longer needed.
663
- 2. **`durably_repeat` repetition logs** one row per scheduled execution. A
832
+ 2. **`durably_repeat` repetition logs**: one row per scheduled execution. A
664
833
  long-lived periodic workflow never reaches a terminal state, so its
665
834
  repetition logs accumulate indefinitely. Past repetitions (those behind the
666
835
  task's current frontier) are never read again, since each resume recomputes
667
- the next execution from the coordination log so they are safe to prune (see
836
+ the next execution from the coordination log, so they are safe to prune (see
668
837
  the safety note below).
669
838
 
670
- `ChronoForge::Cleanup` reclaims both. It is **not** run automatically schedule
839
+ `ChronoForge::Cleanup` reclaims both. It is **not** run automatically; schedule
671
840
  it from your own scheduler so you stay in control of retention:
672
841
 
673
842
  ```ruby
@@ -692,14 +861,14 @@ Notes:
692
861
  that are both older than the window **and** scheduled strictly before the
693
862
  periodic task's current frontier (the coordination log's `last_execution_at`).
694
863
  Anything at or after the frontier is kept so `durably_repeat`'s catch-up
695
- mechanism is never disrupted so the window is purely a retention preference
864
+ mechanism is never disrupted, so the window is purely a retention preference
696
865
  and is safe even for yearly schedules.
697
866
  - Workflow retention is measured from when a workflow became terminal, not when
698
- it was created a long-running workflow that only just finished is kept for
867
+ it was created. A long-running workflow that only just finished is kept for
699
868
  the full window. Completed workflows use `completed_at` (immutable); failed
700
869
  workflows use `updated_at` (they have no `completed_at`).
701
870
  - The composite `[state, completed_at]` index added in this version keeps these
702
- scans efficient run `chrono_forge:upgrade` if you installed an earlier
871
+ scans efficient; run `chrono_forge:upgrade` if you installed an earlier
703
872
  version.
704
873
 
705
874
  A ready-made job is bundled so you can schedule it with any recurring-job
@@ -726,6 +895,98 @@ production:
726
895
  schedule: every day at 3am
727
896
  ```
728
897
 
898
+ ## 🌿 Branches: parallel sub-workflows
899
+
900
+ `branch` / `spawn` / `spawn_each` / `merge_branches` let a workflow fan out into
901
+ child workflows that run concurrently, then join them when their results are
902
+ needed.
903
+
904
+ ### Model
905
+
906
+ - **`branch :name do … end`** opens a named branch (a durable step). Inside the
907
+ block, `spawn` and `spawn_each` create and immediately enqueue child workflows —
908
+ children start running as soon as the branch block is entered.
909
+ - **`spawn :name, WorkflowClass, **kwargs`** — enqueues one child workflow.
910
+ - **`spawn_each :name, source do |item| [WorkflowClass, kwargs] end`** — enqueues
911
+ one child per item. The block returns the class and kwargs, so one branch can
912
+ fan out into mixed workflow types. Sources are iterated in constant memory;
913
+ ActiveRecord relations are streamed by primary key — pass them **without** an
914
+ explicit `.order`.
915
+ - **`automerge: true`** — joins the branch **inline at the block's close**.
916
+ Execution does not continue past the `branch` call until every child has
917
+ completed. Use it for "dispatch this group and wait right here."
918
+ - **`merge_branches :a, :b`** (or the singular alias `merge_branch :a`) — the
919
+ separate join point. Open branches without `automerge`, do other work while the
920
+ children run, then join when you need their results. `merge_branches` blocks
921
+ until all named branches are complete.
922
+
923
+ ### Worked example
924
+
925
+ ```ruby
926
+ class FulfillmentWorkflow < ApplicationJob
927
+ prepend ChronoForge::Executor
928
+
929
+ def perform(cycle_id:)
930
+ # automerge: the branch is joined inline, right where the block closes —
931
+ # `perform` does not continue past it until every child has completed.
932
+ branch :reconcile, automerge: true do
933
+ spawn :eu, ReconcileWorkflow, region: "EU"
934
+ spawn_each :orders, Order.pending do |order|
935
+ order.priority? ? [PriorityOrderWorkflow, { order_id: order.id }]
936
+ : [OrderWorkflow, { order_id: order.id }]
937
+ end
938
+ end
939
+
940
+ # For branches you want to run concurrently and join later, omit automerge
941
+ # and use merge_branches:
942
+ branch :invoices do
943
+ spawn_each :unpaid, Invoice.unpaid do |inv|
944
+ [InvoiceWorkflow, { invoice_id: inv.id }]
945
+ end
946
+ end
947
+ branch :shipments do
948
+ spawn_each :ready, Shipment.ready do |s|
949
+ [ShipmentWorkflow, { shipment_id: s.id }]
950
+ end
951
+ end
952
+ do_other_work # runs while :invoices and :shipments dispatch/run
953
+ merge_branches :invoices, :shipments # join both here
954
+
955
+ durably_execute :finalize
956
+ end
957
+ end
958
+ ```
959
+
960
+ ### Caveats
961
+
962
+ > **Every branch must be joined.** A branch opened and never joined raises
963
+ > `ChronoForge::Executor::UnmergedBranchError` when the workflow tries to
964
+ > complete — fail-fast, no silently-orphaned children. Use either
965
+ > `automerge: true` or a matching `merge_branches` call.
966
+
967
+ > **The parent isn't replayed while waiting.** A lightweight
968
+ > `ChronoForge::BranchMergeJob` polls for child completion; the parent workflow
969
+ > only runs again once the branch is fully done. Polling cadence adapts to how
970
+ > many children remain.
971
+
972
+ > **`spawn_each` sources must re-enumerate deterministically across replays.**
973
+ > ActiveRecord relations are streamed by primary key (children are keyed by
974
+ > record id, so crash-resume is idempotent); a relation carrying an explicit
975
+ > `.order(...)` raises. For non-AR enumerables, items are keyed by position, so
976
+ > inserting or removing items mid-dispatch would shift keys and break idempotency.
977
+
978
+ > **`spawn_each` AR sources must have stable membership.** Dispatch streams by
979
+ > ascending primary key and resumes from the last key on crash-recovery, so a row
980
+ > that enters the relation *below* the cursor after it has passed (e.g. a
981
+ > `where(state: …)` scope whose rows mutate mid-dispatch) will never get a child.
982
+ > Point `spawn_each` at a set that is fixed for the branch's lifetime — a frozen id
983
+ > range, an append-only table, or `where(id: [...])` over a snapshot.
984
+
985
+ > **`branch` blocks cannot be lexically nested within one workflow.** Opening a
986
+ > `branch` inside another `branch` block raises `ArgumentError`; spawns belong to
987
+ > exactly one branch. (A *spawned child workflow* may open its own branches — it
988
+ > runs in its own executor — so cross-workflow nesting is fine.)
989
+
729
990
  ## 🚀 Development
730
991
 
731
992
  After checking out the repo, run:
@@ -762,11 +1023,11 @@ This gem is available as open source under the terms of the [MIT License](https:
762
1023
 
763
1024
  | Method | Purpose | Key Parameters |
764
1025
  |--------|---------|----------------|
765
- | `durably_execute` | Execute method with retry logic | `method`, `max_attempts: 3`, `name: nil` |
1026
+ | `durably_execute` | Execute method with retry logic | `method`, `retry_policy: nil`, `name: nil` |
766
1027
  | `wait` | Time-based pause | `duration`, `name` |
767
- | `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `retry_on: []` |
1028
+ | `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `retry_policy: nil` |
768
1029
  | `continue_if` | Manual continuation wait | `condition`, `name: nil` |
769
- | `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `max_attempts: 3`, `timeout: 1.hour`, `on_error: :continue` |
1030
+ | `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `retry_policy: nil`, `timeout: 1.hour`, `on_error: :continue` |
770
1031
 
771
1032
  ### Context Methods
772
1033