RubyGems - chrono_forge - Versions diffs - 0.9.1 → 0.10.0 - Mend

chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 482a39dce49e0e88ddc0f7341b2806c7893376863caa658eccdd6111c12a4d62
-  data.tar.gz: aab4017a4b0f90ff68779a0b21b660e34a10ba53a28325f7bcbb801251796ebb
+  metadata.gz: f03445b6275e345beb34505d4d59a01d8450df220e94f07bc909c8c69059ab8d
+  data.tar.gz: 9ba7aaa7364736f66778da68af4f21d7c944ac01270c7b92ca76bf09bf880738
 SHA512:
-  metadata.gz: 1c1a271dccf9c204633846bc81126b327efb1c176b90181c0dece715f99471a021054f8a892f1a223b11655ededeadc720d34a1a67b7e8ca7cf47a6f9749c175
-  data.tar.gz: 72745635bb4e34011c3ed735dfa3d648e88362824463b34bd719b32727c7d36602d130044861b882abb9685633231a57ce9a223771f08855a9938b1811db8405
+  metadata.gz: f761f180b4e8323721cfffc0a7c2569f30ea8e5b8e085cc52ab32aeefa64ed2a45caac13dff38c168ae497c437816201cdf5c2946a85ab987814eecd852d97c6
+  data.tar.gz: 22ca2b2ca99188b5117c06e2d9b313e726e0087ed48d3635d1064bcb82eee69b9f649a56a441a42e603c3b32ceb98c1aa788782dc01c5d2066f09dae0593900b

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,27 @@
 ## [Unreleased]
+## [0.10.0] - 2026-06-27
+### Added
+- **Concurrent sub-workflows** — `branch` blocks plus `spawn` / `spawn_each` dispatch child workflows that run in parallel, joined later with `merge_branches` (or an inline automerge at branch-block close); the completion gate raises if a branch is left unmerged. `spawn_each` streams bulk dispatch with a resumable cursor and keys AR-sourced children by record PK, so a crash mid-dispatch resumes without re-running completed children. Adds the `parent_execution_log_id` column + `[parent_execution_log_id, state]` index (additive migration, installed by the `chrono_forge:upgrade` generator), with `ExecutionLog#spawned_workflows` / `Workflow#parent_execution_log` associations. Join progress is driven by `ChronoForge::BranchMergeJob`, a lightweight poller that holds no lock and never replays the parent: it re-arms on each pass (fenced by a per-pass `poll_token` so a superseded chain stops quietly), rekicks dropped child jobs, and records observable poll state on the branch logs. Requires `activejob >= 7.1`.
+- `ChronoForge::Executor::RetryPolicy` — a single, unified retry abstraction (attempt cap + exponential-with-jitter backoff + error-class predicate) used by every retry site: workflow-level uncaught errors, `durably_execute`, `durably_repeat`, and `wait_until` condition errors. Replaces the three previously-independent retry systems and two backoff algorithms.
+- Class-level `retry_policy` DSL to set a workflow's default retry policy, plus a per-call `retry_policy:` keyword on `durably_execute`, `durably_repeat`, and `wait_until`. Resolution is per-call → class default → per-site built-in. `wait_until` deliberately does not inherit the class default (so a class-wide "retry everything" can't silently retry condition-evaluation bugs).
+- **Composite retry policies** — pass an ordered array of `RetryPolicy` objects (per-call, or to the class-level `retry_policy` DSL as positional args) to give each error type its own independent attempt budget and backoff. The first policy whose `retry_on` matches the raised error wins (subclasses route to the policy that lists their ancestor; a trailing `retry_on: nil` is a catch-all; an unmatched error fails fast). Per-error counts are keyed by each policy's declared errors (`RetryPolicy#budget_key`) and persisted in execution-log metadata (steps) or the job args (workflow-level), so budgets are stable across replays and policy reordering. `RetryPolicy.compose(*policies)` builds one explicitly.
+### Changed
+- **Performance:** completed steps are now resolved from a single bulk read per replay instead of one indexed `SELECT` each. On every resume the engine replays the whole workflow body; previously each already-completed step cost its own lookup, so a workflow with hundreds of steps paid hundreds of `SELECT`s per resume (quadratic over its lifetime). Completed steps are now plucked once into a per-pass cache and short-circuited from a readonly, unsaved stand-in (no row, no round-trip); only not-yet-completed steps still hit the database. `durably_repeat` repetition logs are deliberately excluded from the cache — they accumulate without bound yet are never replayed — so repeat-heavy workflows don't pull their history into memory.
+- **BREAKING:** `durably_execute` and `durably_repeat` no longer accept `max_attempts:`; `wait_until` no longer accepts `retry_on:`. All three now take `retry_policy:` (a `RetryPolicy`). Migrate `max_attempts: N` → `retry_policy: RetryPolicy.new(max_attempts: N)` and `retry_on: [...]` → `retry_policy: RetryPolicy.new(retry_on: [...])`.
+- **BREAKING:** backoff is now exponential with jitter everywhere (previously the workflow level used a fixed array declared as `[1s,5s,30s,2m,10m]` — though the `should_retry? < 3` bug meant only its first three entries `[1s,5s,30s]` were ever reached — and steps used `2**n` capped at 32s). Workflow-level retries default to 10 attempts with a tolerant window of up to ~8.5 min (≈4 min typical with jitter; cap 600s) — wide enough to ride out a transient infra blip (DB failover, deploy restart) on an uncaught `perform` error, since each such retry replays the whole workflow. A *permanently* failing workflow is now retried 10 times before reaching `failed` (vs the previous effective 4). Note this path covers only uncaught errors in `perform`; a step exhausting its own retries stalls the workflow instead.
+### Fixed
+- Continuation jobs are now published only **after** the workflow lock is released. Every deferral primitive (`wait`, `wait_until`, `durably_execute` retry, `durably_repeat`, and the workflow-level retry) previously enqueued its continuation inline, while the enqueuing job still held the lock; an immediately-runnable (`delay == 0`) same-key continuation could be claimed by another worker before the lock was released, surfacing as a spurious `ConcurrentExecutionError` at lock acquisition. The continuation is now recorded during the run and flushed in the executor's `ensure` block after `release_lock`, closing the race.
+- `durably_repeat` catch-up is now O(1) for the skippable run instead of O(missed intervals). When a workflow resumes far behind schedule, the **expired prefix** (ticks older than `timeout`) is fast-forwarded in closed form to the first non-expired grid tick, rather than walking one zero-delay job per missed tick. **Behavior change:** the expired prefix now produces a single summary execution log (`error_class: "TimeoutError"`, `metadata["fast_forwarded"]` = number of ticks skipped) instead of one `"Execution timed out"` row per tick — update any dashboards or alerts that key off per-tick timeout rows. Ticks still inside their `timeout` window continue to execute as normal catch-up work.
+- Workflow-level retry no longer has a contradictory cap (`should_retry?` stopped at 3 while `RetryStrategy.max_attempts` was 5, making the array's `2m`/`10m` entries unreachable). The single `RetryPolicy` is now the sole decider.
+- Removed the dead `retry_method:` argument that `durably_execute` passed on reschedule but `perform` never bound.
 ## [0.9.1] - 2026-06-25
 ### Fixed

data/README.md CHANGED Viewed

@@ -7,20 +7,75 @@
 > A robust framework for building durable, distributed workflows in Ruby on Rails applications
-ChronoForge provides a powerful solution for handling long-running processes, managing state, and recovering from failures in your Rails applications. Built on top of ActiveJob, it ensures your critical business processes remain resilient and traceable.
+ChronoForge handles long-running processes, manages state, and recovers from failures in your Rails applications. Built on ActiveJob, it keeps critical business processes resilient and traceable.
+Workflows are **plain Ruby**. Ordinary `if`/`else`, loops, and early returns drive the flow. There's no declarative DSL to learn and no extra service to run, which makes ChronoForge a good fit for business processes whose shape depends on runtime state: conditional branches, iteration over data, and built-in periodic tasks (`durably_repeat`).
+> **In production** at **achieve by Petra**, an investment platform in the Petra Group — where it has executed over 3.6 million workflows and 32 million durable steps across scheduled payments, investment rollovers, and membership lifecycle management.
+## 🧭 Why ChronoForge
+Most Rails workflow tools ask you to declare your steps up front in a DSL:
+```ruby
+step :send_welcome_email
+step :remind_of_tasks, wait: 2.days
+step :complete_onboarding, wait: 15.days
+```
+That reads cleanly for a fixed, linear sequence. But many business processes branch, loop, and react to data that only exists at runtime, and a declarative schema gets awkward there. ChronoForge takes the opposite approach: **a workflow is just a Ruby method.** Conditionals, iteration, early returns, and helper methods all work the way they normally do.
+There is a real trade-off. Because the flow is ordinary code, ChronoForge can show the steps that **have run** (a replay/history view), but not a roadmap of steps that *haven't* run yet, which a declarative engine can. For workflows whose path isn't fixed in advance, that's a trade worth making; for a simple, fixed sequence ("send email, wait 2 days, send another"), a declarative DSL may read more cleanly, and that's a fine reason to reach for one.
+### How it compares
+|                              | ChronoForge          | GenevaDrive        | AcidicJob       | Temporal        |
+| ---------------------------- | -------------------- | ------------------ | --------------- | --------------- |
+| Programming model            | procedural (plain Ruby) | declarative DSL | declarative DSL | procedural (via SDK) |
+| Built-in periodic tasks      | ✓ `durably_repeat`   | ✗                  | ✗               | ✓               |
+| Pending-step visibility      | ✗ (procedural)       | ✓                  | ✓               | ✗ (procedural)  |
+| Extra infrastructure         | none (DB + ActiveJob)| none               | none            | server required |
+| License                      | MIT                  | LGPL / commercial  | MIT             | MIT             |
+<sub>Comparison reflects each project's documented features as of mid-2026, to the best of our knowledge; corrections welcome via PR.</sub>
+A few deliberate choices behind that table:
+- **Periodic tasks are built in.** `durably_repeat` runs a step on a schedule until a condition holds, with automatic catch-up for missed runs, so a workflow can be its own recurring job and cron-style monitor, right alongside the rest of its logic. Without built-in support, periodic behavior usually lives in a separate scheduler that you reconcile with workflow state by hand.
+- **No extra infrastructure.** ChronoForge is a gem over your existing database and ActiveJob backend. There's no separate server or daemon to operate, unlike Temporal.
+- **Recovery is built into the model.** Steps are append-only history, so a crashed step leaves the workflow `stalled`, recoverable directly with `retry_later`.
+- **MIT licensed.** Permissive and dependency-policy-friendly.
 ## 🌟 Features
+- **Plain-Ruby control flow**: Branching, loops, and iteration over runtime data, without a DSL or step registry
 - **Durable Execution**: Automatically tracks and recovers from failures during workflow execution
+- **Periodic tasks built in**: `durably_repeat` runs a step on an interval until a condition is met, with catch-up for missed runs. Acts as a recurring task and a cron-style monitor in one
+- **Wait States**: Time-based waits and condition-based waiting (`wait_until`) that survive restarts
 - **State Management**: Built-in workflow state tracking with persistent context storage
 - **Concurrency Control**: Advanced locking mechanisms to prevent parallel execution of the same workflow
-- **Error Handling**: Comprehensive error tracking with configurable retry strategies
+- **Error Handling**: Error tracking with a unified, configurable [`RetryPolicy`](#-retry-policies) (including per-error-type policies)
 - **Execution Logging**: Detailed logging of workflow steps and errors for visibility
-- **Wait States**: Support for time-based waits and condition-based waiting
-- **Database-Backed**: All workflow state is persisted to ensure durability
+- **Database-Backed**: All workflow state is persisted to ensure durability, with no extra services to run
 - **ActiveJob Integration**: Compatible with all ActiveJob backends, though database-backed processors (like Solid Queue) provide the most reliable experience for long-running workflows
 - **Retention & Cleanup**: A schedulable job to prune finished workflows and the unbounded logs that periodic tasks accumulate (see [Cleanup & Retention](#-cleanup--retention))
+## 🖥️ Dashboard
+ChronoForge has a free, mountable dashboard for visibility and recovery: workflow list, step replay timeline, context inspector, periodic-task health, wait-state age, and retry/unlock actions. It ships as a separate gem, `chrono_forge-dashboard`, so the core stays lean.
+[![ChronoForge dashboard](chrono_forge-dashboard/docs/screenshots/workflows.png)](chrono_forge-dashboard/README.md#screenshots)
+```ruby
+# Gemfile
+gem "chrono_forge-dashboard"
+# config/routes.rb
+mount ChronoForge::Dashboard::Engine, at: "/chrono_forge"
+```
+See [`chrono_forge-dashboard`](chrono_forge-dashboard/README.md) for setup, authentication, and [more screenshots](chrono_forge-dashboard/README.md#screenshots).
 ## 📦 Installation
 Add to your application's Gemfile:
@@ -136,6 +191,54 @@ class OrderProcessingWorkflow < ApplicationJob
 end
 ```
+### A workflow you can't flatten into a step list
+The example above is linear, but most real processes aren't. Because a ChronoForge workflow is plain Ruby, branching and dynamic iteration are just… branching and iteration:
+```ruby
+class OrderProcessingWorkflow < ApplicationJob
+  prepend ChronoForge::Executor
+  def perform(order_id:)
+    @order_id = order_id
+    wait_until :payment_confirmed?
+    durably_execute :validate_order
+    # Runtime branching: the path depends on data known only at execution time
+    if context["requires_compliance_check"]
+      durably_execute :run_compliance_review
+      wait_until :compliance_approved?, timeout: 48.hours
+    end
+    # Iterate over runtime data: one durable, idempotent step per item
+    context["line_item_ids"].each do |item_id|
+      context["current_item_id"] = item_id
+      durably_execute :fulfill_item, name: "fulfill_#{item_id}"
+    end
+    # Recurring notification: nudge the customer until they confirm delivery
+    durably_repeat :send_delivery_reminder, every: 3.days, till: :delivery_confirmed?
+    durably_execute :complete_order
+  end
+  private
+  def fulfill_item
+    FulfillmentService.fulfill(@order_id, context["current_item_id"])
+  end
+  def send_delivery_reminder
+    OrderMailer.delivery_reminder(@order_id).deliver_later
+  end
+  # ... other condition and step methods ...
+end
+```
+Each `durably_execute` is checkpointed by its step name, so on resume the completed branches and items are skipped and the workflow continues where it left off. A fixed, declared list of steps can't easily express runtime branches, a loop over a runtime-sized collection, and an open-ended recurring notification.
 ### Core Workflow Features
 #### 🚀 Executing Workflows
@@ -162,14 +265,15 @@ OrderProcessingWorkflow.perform_later(
 #### ⚡ Durable Execution
-The `durably_execute` method ensures operations are executed exactly once with automatic retry logic and fault tolerance:
+The `durably_execute` method runs an operation with automatic retries, and skips it on replay once it has completed:
 ```ruby
 # Basic execution
 durably_execute :send_welcome_email
-# With custom retry attempts
-durably_execute :critical_payment_processing, max_attempts: 5
+# With a custom retry policy
+durably_execute :critical_payment_processing,
+  retry_policy: RetryPolicy.new(max_attempts: 5)
 # With custom name for tracking multiple calls to same method
 durably_execute :upload_file, name: "profile_image_upload"
@@ -182,10 +286,10 @@ class FileProcessingWorkflow < ApplicationJob
     @file_id = file_id
     # This might fail due to network issues, rate limits, etc.
-    durably_execute :upload_to_s3, max_attempts: 5
+    durably_execute :upload_to_s3, retry_policy: RetryPolicy.new(max_attempts: 5)
     # Process file after successful upload
-    durably_execute :generate_thumbnails, max_attempts: 3
+    durably_execute :generate_thumbnails, retry_policy: RetryPolicy.new(max_attempts: 3)
   end
   private
@@ -204,9 +308,77 @@ end
 **Key Features:**
 - **Idempotent**: Same operation won't be executed twice during replays
-- **Automatic Retries**: Failed executions retry with exponential backoff (2^attempt seconds, capped at 32s)
+- **Automatic Retries**: Failed executions retry per a unified `RetryPolicy` (exponential backoff with jitter; the step default caps at 30s over 3 attempts)
 - **Error Tracking**: All failures are logged with detailed error information
-- **Configurable**: Customize retry attempts and step naming
+- **Configurable**: Pass a `retry_policy:` per call, or set a class-wide default with the `retry_policy` DSL (see [Retry Policies](#retry-policies))
+#### 🔁 Retry Policies
+All retrying in ChronoForge goes through a single `RetryPolicy` (`ChronoForge::Executor::RetryPolicy`). It answers two questions: *should this failure be retried?* and *how long until the next attempt?*
+```ruby
+RetryPolicy.new(
+  max_attempts: 3,        # cap on total attempts; nil = no count cap (bounded elsewhere)
+  base: 1,                # seconds; delay of the first retry
+  cap: 30,                # seconds; ceiling for a single delay
+  jitter: true,           # spread retries with equal jitter
+  retry_on: nil           # nil = retry any StandardError; [Classes] = only those; [] = none
+)
+```
+Backoff is exponential with equal jitter, computed once at re-enqueue time (never replayed, so it stays deterministic where it matters).
+**Resolution order:**
+- **`durably_execute`, `durably_repeat`, workflow-level errors**: per-call `retry_policy:` → class-level `retry_policy` default → built-in default.
+- **`wait_until`**: per-call `retry_policy:` → built-in default. It deliberately does **not** inherit the class default, so a class-wide "retry everything" can't silently turn condition-evaluation bugs into retried errors.
+**Built-in defaults:**
+| Site | Default | Why |
+|------|---------|-----|
+| Steps (`durably_execute`/`durably_repeat`) | 3 attempts, cap 30s, retry any error | flaky calls fail fast |
+| Workflow-level (uncaught errors) | 10 attempts, cap 600s, retry any error | tolerant window up to ~8.5 min (≈4 min typical w/ jitter) for transient infra errors; each retry replays the whole workflow from the top |
+| `wait_until` condition errors | retry nothing | a raised condition is usually a bug, not transient |
+**Class-wide default via the `retry_policy` DSL:**
+```ruby
+class ChargeWorkflow < ApplicationJob
+  prepend ChronoForge::Executor
+  retry_policy max_attempts: 5, base: 2, cap: 60   # applies to steps + workflow-level
+  def perform
+    durably_execute :charge,
+      retry_policy: RetryPolicy.new(max_attempts: 8, retry_on: [Net::OpenTimeout])
+    wait_until :settled?,
+      retry_policy: RetryPolicy.new(retry_on: [BankApiError])
+  end
+end
+```
+**Composite policies (per-error budgets):**
+Pass an **array** of policies to handle different error types differently. On a failure, the **first** policy whose `retry_on` matches the raised error applies, and each error type gets its **own attempt budget and backoff**:
+```ruby
+durably_execute :charge_card, retry_policy: [
+  RetryPolicy.new(retry_on: [NetworkError],         max_attempts: 5),            # transient: retry hard
+  RetryPolicy.new(retry_on: [RateLimitError],       max_attempts: 10, base: 5),  # back off longer
+  RetryPolicy.new(retry_on: [PaymentDeclinedError], max_attempts: 1),            # fail fast, never retry
+  RetryPolicy.new(retry_on: nil)                                                 # catch-all (optional), keep last
+]
+```
+- **Order matters**: the first matching policy wins, so list specific errors first and a catch-all (`retry_on: nil`) last. An error matched by no policy is **not retried** (fails fast).
+- A subclass of a listed error routes to that policy and draws from its budget.
+- Per-error counts are tracked by the policy's declared errors, so the budgets are stable even if you reorder the list.
+- The class-level DSL accepts the same form as positional arguments (applies to steps **and** workflow-level errors):
+  ```ruby
+  retry_policy RetryPolicy.new(retry_on: [NetworkError], max_attempts: 5),
+               RetryPolicy.new(retry_on: nil, max_attempts: 2)
+  ```
 #### ⏱️ Wait States
@@ -243,11 +415,11 @@ wait_until :external_api_ready?,
   timeout: 30.minutes,
   check_interval: 1.minute
-# Wait with retry on specific errors
+# Wait with retry on specific errors raised while evaluating the condition
 wait_until :database_migration_complete?,
   timeout: 2.hours,
   check_interval: 30.seconds,
-  retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError]
+  retry_policy: RetryPolicy.new(retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError])
 # Complex condition example
 def third_party_service_ready?
@@ -258,7 +430,7 @@ end
 wait_until :third_party_service_ready?,
   timeout: 1.hour,
   check_interval: 2.minutes,
-  retry_on: [Net::TimeoutError, Net::HTTPClientException]
+  retry_policy: RetryPolicy.new(retry_on: [Net::TimeoutError, Net::HTTPClientException])
 ```
 **3. Event-driven Waits (`continue_if`)**
@@ -328,7 +500,7 @@ PaymentWorkflow.perform_later("order-#{order_id}", order_id: order_id)
 #### 🔄 Periodic Tasks
-The `durably_repeat` method enables robust periodic task execution within workflows. Tasks are scheduled at regular intervals until a specified condition is met, with automatic catch-up for missed executions and configurable error handling.
+`durably_repeat` runs periodic tasks inside a workflow. A task is scheduled at a regular interval until a condition is met, with automatic catch-up for missed executions and configurable error handling.
 ```ruby
 class NotificationWorkflow < ApplicationJob
@@ -379,7 +551,7 @@ end
 - **Idempotent Execution**: Each repetition gets a unique execution log, preventing duplicates during replays
 - **Automatic Catch-up**: Missed executions due to downtime are automatically skipped using timeout-based fast-forwarding
-- **Flexible Timing**: Support for custom start times and precise interval scheduling
+- **Custom Timing**: Custom start times and precise interval scheduling
 - **Error Resilience**: Individual execution failures don't break the periodic schedule
 - **Configurable Error Handling**: Choose between continuing despite failures or failing the entire workflow
@@ -390,7 +562,7 @@ durably_repeat :generate_daily_report,
   every: 1.day,                          # Execution interval
   till: :reports_complete?,              # Stop condition
   start_at: Date.tomorrow.beginning_of_day, # Custom start time (optional)
-  max_attempts: 5,                       # Retries per execution (default: 3)
+  retry_policy: RetryPolicy.new(max_attempts: 5), # Retry policy per execution (default: step_default)
   timeout: 2.hours,                      # Catch-up timeout (default: 1.hour)
   on_error: :fail_workflow,              # Error handling (:continue or :fail_workflow)
   name: "daily_reports"                  # Custom task name (optional)
@@ -447,7 +619,7 @@ end
 The context supports serializable Ruby objects (Hash, Array, String, Integer, Float, Boolean, and nil) and validates types automatically.
-Hash and Array values are stored as JSON, which has no symbols — so **symbol keys inside a stored hash come back as strings**:
+Hash and Array values are stored as JSON, which has no symbols, so **symbol keys inside a stored hash come back as strings**:
 ```ruby
 context[:totals] = { paid: 5, pending: 2 }
@@ -455,33 +627,31 @@ context[:totals]          # => { "paid" => 5, "pending" => 2 }
 context[:totals]["paid"]  # => 5   (not context[:totals][:paid])
 ```
-(The top-level context key itself is interchangeable — `context[:totals]` and `context["totals"]` refer to the same entry.)
+(The top-level context key itself is interchangeable: `context[:totals]` and `context["totals"]` refer to the same entry.)
-Context is meant for **small working state** — ids, flags, timestamps, and small structures used to coordinate steps. Each value is capped at **16 KB** (a `ChronoForge::Executor::Context::ValidationError` is raised above that). Store large payloads (documents, uploads, API responses) in their own storage and keep just a reference (an id or key) in the context.
+Context is meant for **small working state**: ids, flags, timestamps, and small structures used to coordinate steps. Each value is capped at **16 KB** (a `ChronoForge::Executor::Context::ValidationError` is raised above that). Store large payloads (documents, uploads, API responses) in their own storage and keep just a reference (an id or key) in the context.
 ### 🛡️ Error Handling
-ChronoForge automatically tracks errors and provides configurable retry capabilities:
+ChronoForge automatically tracks errors and routes all retrying through a single [`RetryPolicy`](#-retry-policies). Configure it per call with `retry_policy:`, or set a class-wide default with the `retry_policy` DSL:
 ```ruby
 class MyWorkflow < ApplicationJob
   prepend ChronoForge::Executor
-  private
+  # Class-wide default for workflow-level errors and steps without an override
+  retry_policy max_attempts: 5, base: 2, cap: 60
-  def should_retry?(error, attempt_count)
-    case error
-    when NetworkError
-      attempt_count < 5  # Retry network errors up to 5 times
-    when ValidationError
-      false  # Don't retry validation errors
-    else
-      attempt_count < 3  # Default retry policy
-    end
+  def perform
+    # Retry only network errors, up to 5 times, for this step
+    durably_execute :call_external_api,
+      retry_policy: RetryPolicy.new(max_attempts: 5, retry_on: [NetworkError])
   end
 end
 ```
+To make an error non-retryable, leave it out of `retry_on:` (an empty `retry_on: []` retries nothing).
 ## 🧪 Testing
 ChronoForge is designed to be easily testable using [ChaoticJob](https://github.com/fractaledmind/chaotic_job), a testing framework that makes it simple to test complex job workflows:
@@ -550,7 +720,7 @@ ChronoForge is ideal for:
 ## 🧠 Advanced State Management
-ChronoForge workflows follow a sophisticated state machine model to ensure durability and fault tolerance. Understanding these states and transitions is essential for troubleshooting and recovery.
+ChronoForge workflows move through a state machine. Understanding these states and transitions helps with troubleshooting and recovery.
 ### Workflow State Diagram
@@ -609,8 +779,7 @@ stateDiagram-v2
 #### Recovering Stalled/Failed Workflows
-Re-execute a failed or stalled workflow directly from its record — no need to
-constantize the job class or re-pass the key. Execution resumes via replay, so
+Re-execute a failed or stalled workflow directly from its record. Execution resumes via replay, so
 completed steps are skipped and it picks up at the step that failed:
 ```ruby
@@ -621,7 +790,7 @@ workflow.retry_now     # re-run inline (console/debugging)
 ```
 Only `stalled` or `failed` workflows are retryable. `retryable?` lets you check
-first, and both methods **validate up front** — calling `retry_later`
+first, and both methods **validate up front**: calling `retry_later`
 on a non-retryable workflow raises `ChronoForge::Executor::WorkflowNotRetryableError`
 immediately rather than enqueuing a job that would fail in the worker:
@@ -660,14 +829,14 @@ ChronoForge keeps every workflow and execution-log row indefinitely so that
 replays remain idempotent. Over time two things grow without bound:
 1. **Terminal workflows** (`completed` / `failed`) that are no longer needed.
-2. **`durably_repeat` repetition logs** — one row per scheduled execution. A
+2. **`durably_repeat` repetition logs**: one row per scheduled execution. A
    long-lived periodic workflow never reaches a terminal state, so its
    repetition logs accumulate indefinitely. Past repetitions (those behind the
    task's current frontier) are never read again, since each resume recomputes
-   the next execution from the coordination log — so they are safe to prune (see
+   the next execution from the coordination log, so they are safe to prune (see
    the safety note below).
-`ChronoForge::Cleanup` reclaims both. It is **not** run automatically — schedule
+`ChronoForge::Cleanup` reclaims both. It is **not** run automatically; schedule
 it from your own scheduler so you stay in control of retention:
 ```ruby
@@ -692,14 +861,14 @@ Notes:
   that are both older than the window **and** scheduled strictly before the
   periodic task's current frontier (the coordination log's `last_execution_at`).
   Anything at or after the frontier is kept so `durably_repeat`'s catch-up
-  mechanism is never disrupted — so the window is purely a retention preference
+  mechanism is never disrupted, so the window is purely a retention preference
   and is safe even for yearly schedules.
 - Workflow retention is measured from when a workflow became terminal, not when
-  it was created — a long-running workflow that only just finished is kept for
+  it was created. A long-running workflow that only just finished is kept for
   the full window. Completed workflows use `completed_at` (immutable); failed
   workflows use `updated_at` (they have no `completed_at`).
 - The composite `[state, completed_at]` index added in this version keeps these
-  scans efficient — run `chrono_forge:upgrade` if you installed an earlier
+  scans efficient; run `chrono_forge:upgrade` if you installed an earlier
   version.
 A ready-made job is bundled so you can schedule it with any recurring-job
@@ -726,6 +895,98 @@ production:
     schedule: every day at 3am
 ```
+## 🌿 Branches: parallel sub-workflows
+`branch` / `spawn` / `spawn_each` / `merge_branches` let a workflow fan out into
+child workflows that run concurrently, then join them when their results are
+needed.
+### Model
+- **`branch :name do … end`** opens a named branch (a durable step). Inside the
+  block, `spawn` and `spawn_each` create and immediately enqueue child workflows —
+  children start running as soon as the branch block is entered.
+- **`spawn :name, WorkflowClass, **kwargs`** — enqueues one child workflow.
+- **`spawn_each :name, source do |item| [WorkflowClass, kwargs] end`** — enqueues
+  one child per item. The block returns the class and kwargs, so one branch can
+  fan out into mixed workflow types. Sources are iterated in constant memory;
+  ActiveRecord relations are streamed by primary key — pass them **without** an
+  explicit `.order`.
+- **`automerge: true`** — joins the branch **inline at the block's close**.
+  Execution does not continue past the `branch` call until every child has
+  completed. Use it for "dispatch this group and wait right here."
+- **`merge_branches :a, :b`** (or the singular alias `merge_branch :a`) — the
+  separate join point. Open branches without `automerge`, do other work while the
+  children run, then join when you need their results. `merge_branches` blocks
+  until all named branches are complete.
+### Worked example
+```ruby
+class FulfillmentWorkflow < ApplicationJob
+  prepend ChronoForge::Executor
+  def perform(cycle_id:)
+    # automerge: the branch is joined inline, right where the block closes —
+    # `perform` does not continue past it until every child has completed.
+    branch :reconcile, automerge: true do
+      spawn :eu, ReconcileWorkflow, region: "EU"
+      spawn_each :orders, Order.pending do |order|
+        order.priority? ? [PriorityOrderWorkflow, { order_id: order.id }]
+                        : [OrderWorkflow, { order_id: order.id }]
+      end
+    end
+    # For branches you want to run concurrently and join later, omit automerge
+    # and use merge_branches:
+    branch :invoices do
+      spawn_each :unpaid, Invoice.unpaid do |inv|
+        [InvoiceWorkflow, { invoice_id: inv.id }]
+      end
+    end
+    branch :shipments do
+      spawn_each :ready, Shipment.ready do |s|
+        [ShipmentWorkflow, { shipment_id: s.id }]
+      end
+    end
+    do_other_work                        # runs while :invoices and :shipments dispatch/run
+    merge_branches :invoices, :shipments # join both here
+    durably_execute :finalize
+  end
+end
+```
+### Caveats
+> **Every branch must be joined.** A branch opened and never joined raises
+> `ChronoForge::Executor::UnmergedBranchError` when the workflow tries to
+> complete — fail-fast, no silently-orphaned children. Use either
+> `automerge: true` or a matching `merge_branches` call.
+> **The parent isn't replayed while waiting.** A lightweight
+> `ChronoForge::BranchMergeJob` polls for child completion; the parent workflow
+> only runs again once the branch is fully done. Polling cadence adapts to how
+> many children remain.
+> **`spawn_each` sources must re-enumerate deterministically across replays.**
+> ActiveRecord relations are streamed by primary key (children are keyed by
+> record id, so crash-resume is idempotent); a relation carrying an explicit
+> `.order(...)` raises. For non-AR enumerables, items are keyed by position, so
+> inserting or removing items mid-dispatch would shift keys and break idempotency.
+> **`spawn_each` AR sources must have stable membership.** Dispatch streams by
+> ascending primary key and resumes from the last key on crash-recovery, so a row
+> that enters the relation *below* the cursor after it has passed (e.g. a
+> `where(state: …)` scope whose rows mutate mid-dispatch) will never get a child.
+> Point `spawn_each` at a set that is fixed for the branch's lifetime — a frozen id
+> range, an append-only table, or `where(id: [...])` over a snapshot.
+> **`branch` blocks cannot be lexically nested within one workflow.** Opening a
+> `branch` inside another `branch` block raises `ArgumentError`; spawns belong to
+> exactly one branch. (A *spawned child workflow* may open its own branches — it
+> runs in its own executor — so cross-workflow nesting is fine.)
 ## 🚀 Development
 After checking out the repo, run:
@@ -762,11 +1023,11 @@ This gem is available as open source under the terms of the [MIT License](https:
 | Method | Purpose | Key Parameters |
 |--------|---------|----------------|
-| `durably_execute` | Execute method with retry logic | `method`, `max_attempts: 3`, `name: nil` |
+| `durably_execute` | Execute method with retry logic | `method`, `retry_policy: nil`, `name: nil` |
 | `wait` | Time-based pause | `duration`, `name` |
-| `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `retry_on: []` |
+| `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `retry_policy: nil` |
 | `continue_if` | Manual continuation wait | `condition`, `name: nil` |
-| `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `max_attempts: 3`, `timeout: 1.hour`, `on_error: :continue` |
+| `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `retry_policy: nil`, `timeout: 1.hour`, `on_error: :continue` |
 ### Context Methods