chrono_forge 0.9.1 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/README.md +305 -44
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
- data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
- data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
- data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
- data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
- data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
- data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
- data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
- data/lib/chrono_forge/branch_merge_job.rb +138 -0
- data/lib/chrono_forge/branch_probe.rb +26 -0
- data/lib/chrono_forge/cleanup.rb +6 -0
- data/lib/chrono_forge/execution_log.rb +6 -0
- data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
- data/lib/chrono_forge/executor/methods/branch.rb +185 -0
- data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
- data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
- data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
- data/lib/chrono_forge/executor/methods/wait.rb +2 -4
- data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
- data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
- data/lib/chrono_forge/executor/methods.rb +2 -0
- data/lib/chrono_forge/executor/retry_policy.rb +111 -0
- data/lib/chrono_forge/executor.rb +216 -28
- data/lib/chrono_forge/version.rb +1 -1
- data/lib/chrono_forge/workflow.rb +10 -1
- data/lib/generators/chrono_forge/migration_actions.rb +1 -0
- data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
- metadata +42 -5
- data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f03445b6275e345beb34505d4d59a01d8450df220e94f07bc909c8c69059ab8d
|
|
4
|
+
data.tar.gz: 9ba7aaa7364736f66778da68af4f21d7c944ac01270c7b92ca76bf09bf880738
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: f761f180b4e8323721cfffc0a7c2569f30ea8e5b8e085cc52ab32aeefa64ed2a45caac13dff38c168ae497c437816201cdf5c2946a85ab987814eecd852d97c6
|
|
7
|
+
data.tar.gz: 22ca2b2ca99188b5117c06e2d9b313e726e0087ed48d3635d1064bcb82eee69b9f649a56a441a42e603c3b32ceb98c1aa788782dc01c5d2066f09dae0593900b
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,27 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.10.0] - 2026-06-27
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
|
|
7
|
+
- **Concurrent sub-workflows** — `branch` blocks plus `spawn` / `spawn_each` dispatch child workflows that run in parallel, joined later with `merge_branches` (or an inline automerge at branch-block close); the completion gate raises if a branch is left unmerged. `spawn_each` streams bulk dispatch with a resumable cursor and keys AR-sourced children by record PK, so a crash mid-dispatch resumes without re-running completed children. Adds the `parent_execution_log_id` column + `[parent_execution_log_id, state]` index (additive migration, installed by the `chrono_forge:upgrade` generator), with `ExecutionLog#spawned_workflows` / `Workflow#parent_execution_log` associations. Join progress is driven by `ChronoForge::BranchMergeJob`, a lightweight poller that holds no lock and never replays the parent: it re-arms on each pass (fenced by a per-pass `poll_token` so a superseded chain stops quietly), rekicks dropped child jobs, and records observable poll state on the branch logs. Requires `activejob >= 7.1`.
|
|
8
|
+
- `ChronoForge::Executor::RetryPolicy` — a single, unified retry abstraction (attempt cap + exponential-with-jitter backoff + error-class predicate) used by every retry site: workflow-level uncaught errors, `durably_execute`, `durably_repeat`, and `wait_until` condition errors. Replaces the three previously-independent retry systems and two backoff algorithms.
|
|
9
|
+
- Class-level `retry_policy` DSL to set a workflow's default retry policy, plus a per-call `retry_policy:` keyword on `durably_execute`, `durably_repeat`, and `wait_until`. Resolution is per-call → class default → per-site built-in. `wait_until` deliberately does not inherit the class default (so a class-wide "retry everything" can't silently retry condition-evaluation bugs).
|
|
10
|
+
- **Composite retry policies** — pass an ordered array of `RetryPolicy` objects (per-call, or to the class-level `retry_policy` DSL as positional args) to give each error type its own independent attempt budget and backoff. The first policy whose `retry_on` matches the raised error wins (subclasses route to the policy that lists their ancestor; a trailing `retry_on: nil` is a catch-all; an unmatched error fails fast). Per-error counts are keyed by each policy's declared errors (`RetryPolicy#budget_key`) and persisted in execution-log metadata (steps) or the job args (workflow-level), so budgets are stable across replays and policy reordering. `RetryPolicy.compose(*policies)` builds one explicitly.
|
|
11
|
+
|
|
12
|
+
### Changed
|
|
13
|
+
|
|
14
|
+
- **Performance:** completed steps are now resolved from a single bulk read per replay instead of one indexed `SELECT` each. On every resume the engine replays the whole workflow body; previously each already-completed step cost its own lookup, so a workflow with hundreds of steps paid hundreds of `SELECT`s per resume (quadratic over its lifetime). Completed steps are now plucked once into a per-pass cache and short-circuited from a readonly, unsaved stand-in (no row, no round-trip); only not-yet-completed steps still hit the database. `durably_repeat` repetition logs are deliberately excluded from the cache — they accumulate without bound yet are never replayed — so repeat-heavy workflows don't pull their history into memory.
|
|
15
|
+
- **BREAKING:** `durably_execute` and `durably_repeat` no longer accept `max_attempts:`; `wait_until` no longer accepts `retry_on:`. All three now take `retry_policy:` (a `RetryPolicy`). Migrate `max_attempts: N` → `retry_policy: RetryPolicy.new(max_attempts: N)` and `retry_on: [...]` → `retry_policy: RetryPolicy.new(retry_on: [...])`.
|
|
16
|
+
- **BREAKING:** backoff is now exponential with jitter everywhere (previously the workflow level used a fixed array declared as `[1s,5s,30s,2m,10m]` — though the `should_retry? < 3` bug meant only its first three entries `[1s,5s,30s]` were ever reached — and steps used `2**n` capped at 32s). Workflow-level retries default to 10 attempts with a tolerant window of up to ~8.5 min (≈4 min typical with jitter; cap 600s) — wide enough to ride out a transient infra blip (DB failover, deploy restart) on an uncaught `perform` error, since each such retry replays the whole workflow. A *permanently* failing workflow is now retried 10 times before reaching `failed` (vs the previous effective 4). Note this path covers only uncaught errors in `perform`; a step exhausting its own retries stalls the workflow instead.
|
|
17
|
+
|
|
18
|
+
### Fixed
|
|
19
|
+
|
|
20
|
+
- Continuation jobs are now published only **after** the workflow lock is released. Every deferral primitive (`wait`, `wait_until`, `durably_execute` retry, `durably_repeat`, and the workflow-level retry) previously enqueued its continuation inline, while the enqueuing job still held the lock; an immediately-runnable (`delay == 0`) same-key continuation could be claimed by another worker before the lock was released, surfacing as a spurious `ConcurrentExecutionError` at lock acquisition. The continuation is now recorded during the run and flushed in the executor's `ensure` block after `release_lock`, closing the race.
|
|
21
|
+
- `durably_repeat` catch-up is now O(1) for the skippable run instead of O(missed intervals). When a workflow resumes far behind schedule, the **expired prefix** (ticks older than `timeout`) is fast-forwarded in closed form to the first non-expired grid tick, rather than walking one zero-delay job per missed tick. **Behavior change:** the expired prefix now produces a single summary execution log (`error_class: "TimeoutError"`, `metadata["fast_forwarded"]` = number of ticks skipped) instead of one `"Execution timed out"` row per tick — update any dashboards or alerts that key off per-tick timeout rows. Ticks still inside their `timeout` window continue to execute as normal catch-up work.
|
|
22
|
+
- Workflow-level retry no longer has a contradictory cap (`should_retry?` stopped at 3 while `RetryStrategy.max_attempts` was 5, making the array's `2m`/`10m` entries unreachable). The single `RetryPolicy` is now the sole decider.
|
|
23
|
+
- Removed the dead `retry_method:` argument that `durably_execute` passed on reschedule but `perform` never bound.
|
|
24
|
+
|
|
3
25
|
## [0.9.1] - 2026-06-25
|
|
4
26
|
|
|
5
27
|
### Fixed
|
data/README.md
CHANGED
|
@@ -7,20 +7,75 @@
|
|
|
7
7
|
|
|
8
8
|
> A robust framework for building durable, distributed workflows in Ruby on Rails applications
|
|
9
9
|
|
|
10
|
-
ChronoForge
|
|
10
|
+
ChronoForge handles long-running processes, manages state, and recovers from failures in your Rails applications. Built on ActiveJob, it keeps critical business processes resilient and traceable.
|
|
11
|
+
|
|
12
|
+
Workflows are **plain Ruby**. Ordinary `if`/`else`, loops, and early returns drive the flow. There's no declarative DSL to learn and no extra service to run, which makes ChronoForge a good fit for business processes whose shape depends on runtime state: conditional branches, iteration over data, and built-in periodic tasks (`durably_repeat`).
|
|
13
|
+
|
|
14
|
+
> **In production** at **achieve by Petra**, an investment platform in the Petra Group — where it has executed over 3.6 million workflows and 32 million durable steps across scheduled payments, investment rollovers, and membership lifecycle management.
|
|
15
|
+
|
|
16
|
+
## 🧭 Why ChronoForge
|
|
17
|
+
|
|
18
|
+
Most Rails workflow tools ask you to declare your steps up front in a DSL:
|
|
19
|
+
|
|
20
|
+
```ruby
|
|
21
|
+
step :send_welcome_email
|
|
22
|
+
step :remind_of_tasks, wait: 2.days
|
|
23
|
+
step :complete_onboarding, wait: 15.days
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
That reads cleanly for a fixed, linear sequence. But many business processes branch, loop, and react to data that only exists at runtime, and a declarative schema gets awkward there. ChronoForge takes the opposite approach: **a workflow is just a Ruby method.** Conditionals, iteration, early returns, and helper methods all work the way they normally do.
|
|
27
|
+
|
|
28
|
+
There is a real trade-off. Because the flow is ordinary code, ChronoForge can show the steps that **have run** (a replay/history view), but not a roadmap of steps that *haven't* run yet, which a declarative engine can. For workflows whose path isn't fixed in advance, that's a trade worth making; for a simple, fixed sequence ("send email, wait 2 days, send another"), a declarative DSL may read more cleanly, and that's a fine reason to reach for one.
|
|
29
|
+
|
|
30
|
+
### How it compares
|
|
31
|
+
|
|
32
|
+
| | ChronoForge | GenevaDrive | AcidicJob | Temporal |
|
|
33
|
+
| ---------------------------- | -------------------- | ------------------ | --------------- | --------------- |
|
|
34
|
+
| Programming model | procedural (plain Ruby) | declarative DSL | declarative DSL | procedural (via SDK) |
|
|
35
|
+
| Built-in periodic tasks | ✓ `durably_repeat` | ✗ | ✗ | ✓ |
|
|
36
|
+
| Pending-step visibility | ✗ (procedural) | ✓ | ✓ | ✗ (procedural) |
|
|
37
|
+
| Extra infrastructure | none (DB + ActiveJob)| none | none | server required |
|
|
38
|
+
| License | MIT | LGPL / commercial | MIT | MIT |
|
|
39
|
+
|
|
40
|
+
<sub>Comparison reflects each project's documented features as of mid-2026, to the best of our knowledge; corrections welcome via PR.</sub>
|
|
41
|
+
|
|
42
|
+
A few deliberate choices behind that table:
|
|
43
|
+
|
|
44
|
+
- **Periodic tasks are built in.** `durably_repeat` runs a step on a schedule until a condition holds, with automatic catch-up for missed runs, so a workflow can be its own recurring job and cron-style monitor, right alongside the rest of its logic. Without built-in support, periodic behavior usually lives in a separate scheduler that you reconcile with workflow state by hand.
|
|
45
|
+
- **No extra infrastructure.** ChronoForge is a gem over your existing database and ActiveJob backend. There's no separate server or daemon to operate, unlike Temporal.
|
|
46
|
+
- **Recovery is built into the model.** Steps are append-only history, so a crashed step leaves the workflow `stalled`, recoverable directly with `retry_later`.
|
|
47
|
+
- **MIT licensed.** Permissive and dependency-policy-friendly.
|
|
11
48
|
|
|
12
49
|
## 🌟 Features
|
|
13
50
|
|
|
51
|
+
- **Plain-Ruby control flow**: Branching, loops, and iteration over runtime data, without a DSL or step registry
|
|
14
52
|
- **Durable Execution**: Automatically tracks and recovers from failures during workflow execution
|
|
53
|
+
- **Periodic tasks built in**: `durably_repeat` runs a step on an interval until a condition is met, with catch-up for missed runs. Acts as a recurring task and a cron-style monitor in one
|
|
54
|
+
- **Wait States**: Time-based waits and condition-based waiting (`wait_until`) that survive restarts
|
|
15
55
|
- **State Management**: Built-in workflow state tracking with persistent context storage
|
|
16
56
|
- **Concurrency Control**: Advanced locking mechanisms to prevent parallel execution of the same workflow
|
|
17
|
-
- **Error Handling**:
|
|
57
|
+
- **Error Handling**: Error tracking with a unified, configurable [`RetryPolicy`](#-retry-policies) (including per-error-type policies)
|
|
18
58
|
- **Execution Logging**: Detailed logging of workflow steps and errors for visibility
|
|
19
|
-
- **
|
|
20
|
-
- **Database-Backed**: All workflow state is persisted to ensure durability
|
|
59
|
+
- **Database-Backed**: All workflow state is persisted to ensure durability, with no extra services to run
|
|
21
60
|
- **ActiveJob Integration**: Compatible with all ActiveJob backends, though database-backed processors (like Solid Queue) provide the most reliable experience for long-running workflows
|
|
22
61
|
- **Retention & Cleanup**: A schedulable job to prune finished workflows and the unbounded logs that periodic tasks accumulate (see [Cleanup & Retention](#-cleanup--retention))
|
|
23
62
|
|
|
63
|
+
## 🖥️ Dashboard
|
|
64
|
+
|
|
65
|
+
ChronoForge has a free, mountable dashboard for visibility and recovery: workflow list, step replay timeline, context inspector, periodic-task health, wait-state age, and retry/unlock actions. It ships as a separate gem, `chrono_forge-dashboard`, so the core stays lean.
|
|
66
|
+
|
|
67
|
+
[](chrono_forge-dashboard/README.md#screenshots)
|
|
68
|
+
|
|
69
|
+
```ruby
|
|
70
|
+
# Gemfile
|
|
71
|
+
gem "chrono_forge-dashboard"
|
|
72
|
+
|
|
73
|
+
# config/routes.rb
|
|
74
|
+
mount ChronoForge::Dashboard::Engine, at: "/chrono_forge"
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
See [`chrono_forge-dashboard`](chrono_forge-dashboard/README.md) for setup, authentication, and [more screenshots](chrono_forge-dashboard/README.md#screenshots).
|
|
78
|
+
|
|
24
79
|
## 📦 Installation
|
|
25
80
|
|
|
26
81
|
Add to your application's Gemfile:
|
|
@@ -136,6 +191,54 @@ class OrderProcessingWorkflow < ApplicationJob
|
|
|
136
191
|
end
|
|
137
192
|
```
|
|
138
193
|
|
|
194
|
+
### A workflow you can't flatten into a step list
|
|
195
|
+
|
|
196
|
+
The example above is linear, but most real processes aren't. Because a ChronoForge workflow is plain Ruby, branching and dynamic iteration are just… branching and iteration:
|
|
197
|
+
|
|
198
|
+
```ruby
|
|
199
|
+
class OrderProcessingWorkflow < ApplicationJob
|
|
200
|
+
prepend ChronoForge::Executor
|
|
201
|
+
|
|
202
|
+
def perform(order_id:)
|
|
203
|
+
@order_id = order_id
|
|
204
|
+
|
|
205
|
+
wait_until :payment_confirmed?
|
|
206
|
+
durably_execute :validate_order
|
|
207
|
+
|
|
208
|
+
# Runtime branching: the path depends on data known only at execution time
|
|
209
|
+
if context["requires_compliance_check"]
|
|
210
|
+
durably_execute :run_compliance_review
|
|
211
|
+
wait_until :compliance_approved?, timeout: 48.hours
|
|
212
|
+
end
|
|
213
|
+
|
|
214
|
+
# Iterate over runtime data: one durable, idempotent step per item
|
|
215
|
+
context["line_item_ids"].each do |item_id|
|
|
216
|
+
context["current_item_id"] = item_id
|
|
217
|
+
durably_execute :fulfill_item, name: "fulfill_#{item_id}"
|
|
218
|
+
end
|
|
219
|
+
|
|
220
|
+
# Recurring notification: nudge the customer until they confirm delivery
|
|
221
|
+
durably_repeat :send_delivery_reminder, every: 3.days, till: :delivery_confirmed?
|
|
222
|
+
|
|
223
|
+
durably_execute :complete_order
|
|
224
|
+
end
|
|
225
|
+
|
|
226
|
+
private
|
|
227
|
+
|
|
228
|
+
def fulfill_item
|
|
229
|
+
FulfillmentService.fulfill(@order_id, context["current_item_id"])
|
|
230
|
+
end
|
|
231
|
+
|
|
232
|
+
def send_delivery_reminder
|
|
233
|
+
OrderMailer.delivery_reminder(@order_id).deliver_later
|
|
234
|
+
end
|
|
235
|
+
|
|
236
|
+
# ... other condition and step methods ...
|
|
237
|
+
end
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
Each `durably_execute` is checkpointed by its step name, so on resume the completed branches and items are skipped and the workflow continues where it left off. A fixed, declared list of steps can't easily express runtime branches, a loop over a runtime-sized collection, and an open-ended recurring notification.
|
|
241
|
+
|
|
139
242
|
### Core Workflow Features
|
|
140
243
|
|
|
141
244
|
#### 🚀 Executing Workflows
|
|
@@ -162,14 +265,15 @@ OrderProcessingWorkflow.perform_later(
|
|
|
162
265
|
|
|
163
266
|
#### ⚡ Durable Execution
|
|
164
267
|
|
|
165
|
-
The `durably_execute` method
|
|
268
|
+
The `durably_execute` method runs an operation with automatic retries, and skips it on replay once it has completed:
|
|
166
269
|
|
|
167
270
|
```ruby
|
|
168
271
|
# Basic execution
|
|
169
272
|
durably_execute :send_welcome_email
|
|
170
273
|
|
|
171
|
-
# With custom retry
|
|
172
|
-
durably_execute :critical_payment_processing,
|
|
274
|
+
# With a custom retry policy
|
|
275
|
+
durably_execute :critical_payment_processing,
|
|
276
|
+
retry_policy: RetryPolicy.new(max_attempts: 5)
|
|
173
277
|
|
|
174
278
|
# With custom name for tracking multiple calls to same method
|
|
175
279
|
durably_execute :upload_file, name: "profile_image_upload"
|
|
@@ -182,10 +286,10 @@ class FileProcessingWorkflow < ApplicationJob
|
|
|
182
286
|
@file_id = file_id
|
|
183
287
|
|
|
184
288
|
# This might fail due to network issues, rate limits, etc.
|
|
185
|
-
durably_execute :upload_to_s3, max_attempts: 5
|
|
289
|
+
durably_execute :upload_to_s3, retry_policy: RetryPolicy.new(max_attempts: 5)
|
|
186
290
|
|
|
187
291
|
# Process file after successful upload
|
|
188
|
-
durably_execute :generate_thumbnails, max_attempts: 3
|
|
292
|
+
durably_execute :generate_thumbnails, retry_policy: RetryPolicy.new(max_attempts: 3)
|
|
189
293
|
end
|
|
190
294
|
|
|
191
295
|
private
|
|
@@ -204,9 +308,77 @@ end
|
|
|
204
308
|
|
|
205
309
|
**Key Features:**
|
|
206
310
|
- **Idempotent**: Same operation won't be executed twice during replays
|
|
207
|
-
- **Automatic Retries**: Failed executions retry
|
|
311
|
+
- **Automatic Retries**: Failed executions retry per a unified `RetryPolicy` (exponential backoff with jitter; the step default caps at 30s over 3 attempts)
|
|
208
312
|
- **Error Tracking**: All failures are logged with detailed error information
|
|
209
|
-
- **Configurable**:
|
|
313
|
+
- **Configurable**: Pass a `retry_policy:` per call, or set a class-wide default with the `retry_policy` DSL (see [Retry Policies](#retry-policies))
|
|
314
|
+
|
|
315
|
+
#### 🔁 Retry Policies
|
|
316
|
+
|
|
317
|
+
All retrying in ChronoForge goes through a single `RetryPolicy` (`ChronoForge::Executor::RetryPolicy`). It answers two questions: *should this failure be retried?* and *how long until the next attempt?*
|
|
318
|
+
|
|
319
|
+
```ruby
|
|
320
|
+
RetryPolicy.new(
|
|
321
|
+
max_attempts: 3, # cap on total attempts; nil = no count cap (bounded elsewhere)
|
|
322
|
+
base: 1, # seconds; delay of the first retry
|
|
323
|
+
cap: 30, # seconds; ceiling for a single delay
|
|
324
|
+
jitter: true, # spread retries with equal jitter
|
|
325
|
+
retry_on: nil # nil = retry any StandardError; [Classes] = only those; [] = none
|
|
326
|
+
)
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
Backoff is exponential with equal jitter, computed once at re-enqueue time (never replayed, so it stays deterministic where it matters).
|
|
330
|
+
|
|
331
|
+
**Resolution order:**
|
|
332
|
+
|
|
333
|
+
- **`durably_execute`, `durably_repeat`, workflow-level errors**: per-call `retry_policy:` → class-level `retry_policy` default → built-in default.
|
|
334
|
+
- **`wait_until`**: per-call `retry_policy:` → built-in default. It deliberately does **not** inherit the class default, so a class-wide "retry everything" can't silently turn condition-evaluation bugs into retried errors.
|
|
335
|
+
|
|
336
|
+
**Built-in defaults:**
|
|
337
|
+
|
|
338
|
+
| Site | Default | Why |
|
|
339
|
+
|------|---------|-----|
|
|
340
|
+
| Steps (`durably_execute`/`durably_repeat`) | 3 attempts, cap 30s, retry any error | flaky calls fail fast |
|
|
341
|
+
| Workflow-level (uncaught errors) | 10 attempts, cap 600s, retry any error | tolerant window up to ~8.5 min (≈4 min typical w/ jitter) for transient infra errors; each retry replays the whole workflow from the top |
|
|
342
|
+
| `wait_until` condition errors | retry nothing | a raised condition is usually a bug, not transient |
|
|
343
|
+
|
|
344
|
+
**Class-wide default via the `retry_policy` DSL:**
|
|
345
|
+
|
|
346
|
+
```ruby
|
|
347
|
+
class ChargeWorkflow < ApplicationJob
|
|
348
|
+
prepend ChronoForge::Executor
|
|
349
|
+
retry_policy max_attempts: 5, base: 2, cap: 60 # applies to steps + workflow-level
|
|
350
|
+
|
|
351
|
+
def perform
|
|
352
|
+
durably_execute :charge,
|
|
353
|
+
retry_policy: RetryPolicy.new(max_attempts: 8, retry_on: [Net::OpenTimeout])
|
|
354
|
+
wait_until :settled?,
|
|
355
|
+
retry_policy: RetryPolicy.new(retry_on: [BankApiError])
|
|
356
|
+
end
|
|
357
|
+
end
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
**Composite policies (per-error budgets):**
|
|
361
|
+
|
|
362
|
+
Pass an **array** of policies to handle different error types differently. On a failure, the **first** policy whose `retry_on` matches the raised error applies, and each error type gets its **own attempt budget and backoff**:
|
|
363
|
+
|
|
364
|
+
```ruby
|
|
365
|
+
durably_execute :charge_card, retry_policy: [
|
|
366
|
+
RetryPolicy.new(retry_on: [NetworkError], max_attempts: 5), # transient: retry hard
|
|
367
|
+
RetryPolicy.new(retry_on: [RateLimitError], max_attempts: 10, base: 5), # back off longer
|
|
368
|
+
RetryPolicy.new(retry_on: [PaymentDeclinedError], max_attempts: 1), # fail fast, never retry
|
|
369
|
+
RetryPolicy.new(retry_on: nil) # catch-all (optional), keep last
|
|
370
|
+
]
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
- **Order matters**: the first matching policy wins, so list specific errors first and a catch-all (`retry_on: nil`) last. An error matched by no policy is **not retried** (fails fast).
|
|
374
|
+
- A subclass of a listed error routes to that policy and draws from its budget.
|
|
375
|
+
- Per-error counts are tracked by the policy's declared errors, so the budgets are stable even if you reorder the list.
|
|
376
|
+
- The class-level DSL accepts the same form as positional arguments (applies to steps **and** workflow-level errors):
|
|
377
|
+
|
|
378
|
+
```ruby
|
|
379
|
+
retry_policy RetryPolicy.new(retry_on: [NetworkError], max_attempts: 5),
|
|
380
|
+
RetryPolicy.new(retry_on: nil, max_attempts: 2)
|
|
381
|
+
```
|
|
210
382
|
|
|
211
383
|
#### ⏱️ Wait States
|
|
212
384
|
|
|
@@ -243,11 +415,11 @@ wait_until :external_api_ready?,
|
|
|
243
415
|
timeout: 30.minutes,
|
|
244
416
|
check_interval: 1.minute
|
|
245
417
|
|
|
246
|
-
# Wait with retry on specific errors
|
|
418
|
+
# Wait with retry on specific errors raised while evaluating the condition
|
|
247
419
|
wait_until :database_migration_complete?,
|
|
248
420
|
timeout: 2.hours,
|
|
249
421
|
check_interval: 30.seconds,
|
|
250
|
-
retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError]
|
|
422
|
+
retry_policy: RetryPolicy.new(retry_on: [ActiveRecord::ConnectionNotEstablished, Net::TimeoutError])
|
|
251
423
|
|
|
252
424
|
# Complex condition example
|
|
253
425
|
def third_party_service_ready?
|
|
@@ -258,7 +430,7 @@ end
|
|
|
258
430
|
wait_until :third_party_service_ready?,
|
|
259
431
|
timeout: 1.hour,
|
|
260
432
|
check_interval: 2.minutes,
|
|
261
|
-
retry_on: [Net::TimeoutError, Net::HTTPClientException]
|
|
433
|
+
retry_policy: RetryPolicy.new(retry_on: [Net::TimeoutError, Net::HTTPClientException])
|
|
262
434
|
```
|
|
263
435
|
|
|
264
436
|
**3. Event-driven Waits (`continue_if`)**
|
|
@@ -328,7 +500,7 @@ PaymentWorkflow.perform_later("order-#{order_id}", order_id: order_id)
|
|
|
328
500
|
|
|
329
501
|
#### 🔄 Periodic Tasks
|
|
330
502
|
|
|
331
|
-
|
|
503
|
+
`durably_repeat` runs periodic tasks inside a workflow. A task is scheduled at a regular interval until a condition is met, with automatic catch-up for missed executions and configurable error handling.
|
|
332
504
|
|
|
333
505
|
```ruby
|
|
334
506
|
class NotificationWorkflow < ApplicationJob
|
|
@@ -379,7 +551,7 @@ end
|
|
|
379
551
|
|
|
380
552
|
- **Idempotent Execution**: Each repetition gets a unique execution log, preventing duplicates during replays
|
|
381
553
|
- **Automatic Catch-up**: Missed executions due to downtime are automatically skipped using timeout-based fast-forwarding
|
|
382
|
-
- **
|
|
554
|
+
- **Custom Timing**: Custom start times and precise interval scheduling
|
|
383
555
|
- **Error Resilience**: Individual execution failures don't break the periodic schedule
|
|
384
556
|
- **Configurable Error Handling**: Choose between continuing despite failures or failing the entire workflow
|
|
385
557
|
|
|
@@ -390,7 +562,7 @@ durably_repeat :generate_daily_report,
|
|
|
390
562
|
every: 1.day, # Execution interval
|
|
391
563
|
till: :reports_complete?, # Stop condition
|
|
392
564
|
start_at: Date.tomorrow.beginning_of_day, # Custom start time (optional)
|
|
393
|
-
max_attempts: 5,
|
|
565
|
+
retry_policy: RetryPolicy.new(max_attempts: 5), # Retry policy per execution (default: step_default)
|
|
394
566
|
timeout: 2.hours, # Catch-up timeout (default: 1.hour)
|
|
395
567
|
on_error: :fail_workflow, # Error handling (:continue or :fail_workflow)
|
|
396
568
|
name: "daily_reports" # Custom task name (optional)
|
|
@@ -447,7 +619,7 @@ end
|
|
|
447
619
|
|
|
448
620
|
The context supports serializable Ruby objects (Hash, Array, String, Integer, Float, Boolean, and nil) and validates types automatically.
|
|
449
621
|
|
|
450
|
-
Hash and Array values are stored as JSON, which has no symbols
|
|
622
|
+
Hash and Array values are stored as JSON, which has no symbols, so **symbol keys inside a stored hash come back as strings**:
|
|
451
623
|
|
|
452
624
|
```ruby
|
|
453
625
|
context[:totals] = { paid: 5, pending: 2 }
|
|
@@ -455,33 +627,31 @@ context[:totals] # => { "paid" => 5, "pending" => 2 }
|
|
|
455
627
|
context[:totals]["paid"] # => 5 (not context[:totals][:paid])
|
|
456
628
|
```
|
|
457
629
|
|
|
458
|
-
(The top-level context key itself is interchangeable
|
|
630
|
+
(The top-level context key itself is interchangeable: `context[:totals]` and `context["totals"]` refer to the same entry.)
|
|
459
631
|
|
|
460
|
-
Context is meant for **small working state
|
|
632
|
+
Context is meant for **small working state**: ids, flags, timestamps, and small structures used to coordinate steps. Each value is capped at **16 KB** (a `ChronoForge::Executor::Context::ValidationError` is raised above that). Store large payloads (documents, uploads, API responses) in their own storage and keep just a reference (an id or key) in the context.
|
|
461
633
|
|
|
462
634
|
### 🛡️ Error Handling
|
|
463
635
|
|
|
464
|
-
ChronoForge automatically tracks errors and
|
|
636
|
+
ChronoForge automatically tracks errors and routes all retrying through a single [`RetryPolicy`](#-retry-policies). Configure it per call with `retry_policy:`, or set a class-wide default with the `retry_policy` DSL:
|
|
465
637
|
|
|
466
638
|
```ruby
|
|
467
639
|
class MyWorkflow < ApplicationJob
|
|
468
640
|
prepend ChronoForge::Executor
|
|
469
641
|
|
|
470
|
-
|
|
642
|
+
# Class-wide default for workflow-level errors and steps without an override
|
|
643
|
+
retry_policy max_attempts: 5, base: 2, cap: 60
|
|
471
644
|
|
|
472
|
-
def
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
when ValidationError
|
|
477
|
-
false # Don't retry validation errors
|
|
478
|
-
else
|
|
479
|
-
attempt_count < 3 # Default retry policy
|
|
480
|
-
end
|
|
645
|
+
def perform
|
|
646
|
+
# Retry only network errors, up to 5 times, for this step
|
|
647
|
+
durably_execute :call_external_api,
|
|
648
|
+
retry_policy: RetryPolicy.new(max_attempts: 5, retry_on: [NetworkError])
|
|
481
649
|
end
|
|
482
650
|
end
|
|
483
651
|
```
|
|
484
652
|
|
|
653
|
+
To make an error non-retryable, leave it out of `retry_on:` (an empty `retry_on: []` retries nothing).
|
|
654
|
+
|
|
485
655
|
## 🧪 Testing
|
|
486
656
|
|
|
487
657
|
ChronoForge is designed to be easily testable using [ChaoticJob](https://github.com/fractaledmind/chaotic_job), a testing framework that makes it simple to test complex job workflows:
|
|
@@ -550,7 +720,7 @@ ChronoForge is ideal for:
|
|
|
550
720
|
|
|
551
721
|
## 🧠 Advanced State Management
|
|
552
722
|
|
|
553
|
-
ChronoForge workflows
|
|
723
|
+
ChronoForge workflows move through a state machine. Understanding these states and transitions helps with troubleshooting and recovery.
|
|
554
724
|
|
|
555
725
|
### Workflow State Diagram
|
|
556
726
|
|
|
@@ -609,8 +779,7 @@ stateDiagram-v2
|
|
|
609
779
|
|
|
610
780
|
#### Recovering Stalled/Failed Workflows
|
|
611
781
|
|
|
612
|
-
Re-execute a failed or stalled workflow directly from its record
|
|
613
|
-
constantize the job class or re-pass the key. Execution resumes via replay, so
|
|
782
|
+
Re-execute a failed or stalled workflow directly from its record. Execution resumes via replay, so
|
|
614
783
|
completed steps are skipped and it picks up at the step that failed:
|
|
615
784
|
|
|
616
785
|
```ruby
|
|
@@ -621,7 +790,7 @@ workflow.retry_now # re-run inline (console/debugging)
|
|
|
621
790
|
```
|
|
622
791
|
|
|
623
792
|
Only `stalled` or `failed` workflows are retryable. `retryable?` lets you check
|
|
624
|
-
first, and both methods **validate up front
|
|
793
|
+
first, and both methods **validate up front**: calling `retry_later`
|
|
625
794
|
on a non-retryable workflow raises `ChronoForge::Executor::WorkflowNotRetryableError`
|
|
626
795
|
immediately rather than enqueuing a job that would fail in the worker:
|
|
627
796
|
|
|
@@ -660,14 +829,14 @@ ChronoForge keeps every workflow and execution-log row indefinitely so that
|
|
|
660
829
|
replays remain idempotent. Over time two things grow without bound:
|
|
661
830
|
|
|
662
831
|
1. **Terminal workflows** (`completed` / `failed`) that are no longer needed.
|
|
663
|
-
2. **`durably_repeat` repetition logs
|
|
832
|
+
2. **`durably_repeat` repetition logs**: one row per scheduled execution. A
|
|
664
833
|
long-lived periodic workflow never reaches a terminal state, so its
|
|
665
834
|
repetition logs accumulate indefinitely. Past repetitions (those behind the
|
|
666
835
|
task's current frontier) are never read again, since each resume recomputes
|
|
667
|
-
the next execution from the coordination log
|
|
836
|
+
the next execution from the coordination log, so they are safe to prune (see
|
|
668
837
|
the safety note below).
|
|
669
838
|
|
|
670
|
-
`ChronoForge::Cleanup` reclaims both. It is **not** run automatically
|
|
839
|
+
`ChronoForge::Cleanup` reclaims both. It is **not** run automatically; schedule
|
|
671
840
|
it from your own scheduler so you stay in control of retention:
|
|
672
841
|
|
|
673
842
|
```ruby
|
|
@@ -692,14 +861,14 @@ Notes:
|
|
|
692
861
|
that are both older than the window **and** scheduled strictly before the
|
|
693
862
|
periodic task's current frontier (the coordination log's `last_execution_at`).
|
|
694
863
|
Anything at or after the frontier is kept so `durably_repeat`'s catch-up
|
|
695
|
-
mechanism is never disrupted
|
|
864
|
+
mechanism is never disrupted, so the window is purely a retention preference
|
|
696
865
|
and is safe even for yearly schedules.
|
|
697
866
|
- Workflow retention is measured from when a workflow became terminal, not when
|
|
698
|
-
it was created
|
|
867
|
+
it was created. A long-running workflow that only just finished is kept for
|
|
699
868
|
the full window. Completed workflows use `completed_at` (immutable); failed
|
|
700
869
|
workflows use `updated_at` (they have no `completed_at`).
|
|
701
870
|
- The composite `[state, completed_at]` index added in this version keeps these
|
|
702
|
-
scans efficient
|
|
871
|
+
scans efficient; run `chrono_forge:upgrade` if you installed an earlier
|
|
703
872
|
version.
|
|
704
873
|
|
|
705
874
|
A ready-made job is bundled so you can schedule it with any recurring-job
|
|
@@ -726,6 +895,98 @@ production:
|
|
|
726
895
|
schedule: every day at 3am
|
|
727
896
|
```
|
|
728
897
|
|
|
898
|
+
## 🌿 Branches: parallel sub-workflows
|
|
899
|
+
|
|
900
|
+
`branch` / `spawn` / `spawn_each` / `merge_branches` let a workflow fan out into
|
|
901
|
+
child workflows that run concurrently, then join them when their results are
|
|
902
|
+
needed.
|
|
903
|
+
|
|
904
|
+
### Model
|
|
905
|
+
|
|
906
|
+
- **`branch :name do … end`** opens a named branch (a durable step). Inside the
|
|
907
|
+
block, `spawn` and `spawn_each` create and immediately enqueue child workflows —
|
|
908
|
+
children start running as soon as the branch block is entered.
|
|
909
|
+
- **`spawn :name, WorkflowClass, **kwargs`** — enqueues one child workflow.
|
|
910
|
+
- **`spawn_each :name, source do |item| [WorkflowClass, kwargs] end`** — enqueues
|
|
911
|
+
one child per item. The block returns the class and kwargs, so one branch can
|
|
912
|
+
fan out into mixed workflow types. Sources are iterated in constant memory;
|
|
913
|
+
ActiveRecord relations are streamed by primary key — pass them **without** an
|
|
914
|
+
explicit `.order`.
|
|
915
|
+
- **`automerge: true`** — joins the branch **inline at the block's close**.
|
|
916
|
+
Execution does not continue past the `branch` call until every child has
|
|
917
|
+
completed. Use it for "dispatch this group and wait right here."
|
|
918
|
+
- **`merge_branches :a, :b`** (or the singular alias `merge_branch :a`) — the
|
|
919
|
+
separate join point. Open branches without `automerge`, do other work while the
|
|
920
|
+
children run, then join when you need their results. `merge_branches` blocks
|
|
921
|
+
until all named branches are complete.
|
|
922
|
+
|
|
923
|
+
### Worked example
|
|
924
|
+
|
|
925
|
+
```ruby
|
|
926
|
+
class FulfillmentWorkflow < ApplicationJob
|
|
927
|
+
prepend ChronoForge::Executor
|
|
928
|
+
|
|
929
|
+
def perform(cycle_id:)
|
|
930
|
+
# automerge: the branch is joined inline, right where the block closes —
|
|
931
|
+
# `perform` does not continue past it until every child has completed.
|
|
932
|
+
branch :reconcile, automerge: true do
|
|
933
|
+
spawn :eu, ReconcileWorkflow, region: "EU"
|
|
934
|
+
spawn_each :orders, Order.pending do |order|
|
|
935
|
+
order.priority? ? [PriorityOrderWorkflow, { order_id: order.id }]
|
|
936
|
+
: [OrderWorkflow, { order_id: order.id }]
|
|
937
|
+
end
|
|
938
|
+
end
|
|
939
|
+
|
|
940
|
+
# For branches you want to run concurrently and join later, omit automerge
|
|
941
|
+
# and use merge_branches:
|
|
942
|
+
branch :invoices do
|
|
943
|
+
spawn_each :unpaid, Invoice.unpaid do |inv|
|
|
944
|
+
[InvoiceWorkflow, { invoice_id: inv.id }]
|
|
945
|
+
end
|
|
946
|
+
end
|
|
947
|
+
branch :shipments do
|
|
948
|
+
spawn_each :ready, Shipment.ready do |s|
|
|
949
|
+
[ShipmentWorkflow, { shipment_id: s.id }]
|
|
950
|
+
end
|
|
951
|
+
end
|
|
952
|
+
do_other_work # runs while :invoices and :shipments dispatch/run
|
|
953
|
+
merge_branches :invoices, :shipments # join both here
|
|
954
|
+
|
|
955
|
+
durably_execute :finalize
|
|
956
|
+
end
|
|
957
|
+
end
|
|
958
|
+
```
|
|
959
|
+
|
|
960
|
+
### Caveats
|
|
961
|
+
|
|
962
|
+
> **Every branch must be joined.** A branch opened and never joined raises
|
|
963
|
+
> `ChronoForge::Executor::UnmergedBranchError` when the workflow tries to
|
|
964
|
+
> complete — fail-fast, no silently-orphaned children. Use either
|
|
965
|
+
> `automerge: true` or a matching `merge_branches` call.
|
|
966
|
+
|
|
967
|
+
> **The parent isn't replayed while waiting.** A lightweight
|
|
968
|
+
> `ChronoForge::BranchMergeJob` polls for child completion; the parent workflow
|
|
969
|
+
> only runs again once the branch is fully done. Polling cadence adapts to how
|
|
970
|
+
> many children remain.
|
|
971
|
+
|
|
972
|
+
> **`spawn_each` sources must re-enumerate deterministically across replays.**
|
|
973
|
+
> ActiveRecord relations are streamed by primary key (children are keyed by
|
|
974
|
+
> record id, so crash-resume is idempotent); a relation carrying an explicit
|
|
975
|
+
> `.order(...)` raises. For non-AR enumerables, items are keyed by position, so
|
|
976
|
+
> inserting or removing items mid-dispatch would shift keys and break idempotency.
|
|
977
|
+
|
|
978
|
+
> **`spawn_each` AR sources must have stable membership.** Dispatch streams by
|
|
979
|
+
> ascending primary key and resumes from the last key on crash-recovery, so a row
|
|
980
|
+
> that enters the relation *below* the cursor after it has passed (e.g. a
|
|
981
|
+
> `where(state: …)` scope whose rows mutate mid-dispatch) will never get a child.
|
|
982
|
+
> Point `spawn_each` at a set that is fixed for the branch's lifetime — a frozen id
|
|
983
|
+
> range, an append-only table, or `where(id: [...])` over a snapshot.
|
|
984
|
+
|
|
985
|
+
> **`branch` blocks cannot be lexically nested within one workflow.** Opening a
|
|
986
|
+
> `branch` inside another `branch` block raises `ArgumentError`; spawns belong to
|
|
987
|
+
> exactly one branch. (A *spawned child workflow* may open its own branches — it
|
|
988
|
+
> runs in its own executor — so cross-workflow nesting is fine.)
|
|
989
|
+
|
|
729
990
|
## 🚀 Development
|
|
730
991
|
|
|
731
992
|
After checking out the repo, run:
|
|
@@ -762,11 +1023,11 @@ This gem is available as open source under the terms of the [MIT License](https:
|
|
|
762
1023
|
|
|
763
1024
|
| Method | Purpose | Key Parameters |
|
|
764
1025
|
|--------|---------|----------------|
|
|
765
|
-
| `durably_execute` | Execute method with retry logic | `method`, `
|
|
1026
|
+
| `durably_execute` | Execute method with retry logic | `method`, `retry_policy: nil`, `name: nil` |
|
|
766
1027
|
| `wait` | Time-based pause | `duration`, `name` |
|
|
767
|
-
| `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `
|
|
1028
|
+
| `wait_until` | Condition-based waiting | `condition`, `timeout: 1.hour`, `check_interval: 15.minutes`, `retry_policy: nil` |
|
|
768
1029
|
| `continue_if` | Manual continuation wait | `condition`, `name: nil` |
|
|
769
|
-
| `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `
|
|
1030
|
+
| `durably_repeat` | Periodic task execution | `method`, `every:`, `till:`, `start_at: nil`, `retry_policy: nil`, `timeout: 1.hour`, `on_error: :continue` |
|
|
770
1031
|
|
|
771
1032
|
### Context Methods
|
|
772
1033
|
|