chrono_forge 0.9.1 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/README.md +305 -44
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
- data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
- data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
- data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
- data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
- data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
- data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
- data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
- data/lib/chrono_forge/branch_merge_job.rb +138 -0
- data/lib/chrono_forge/branch_probe.rb +26 -0
- data/lib/chrono_forge/cleanup.rb +6 -0
- data/lib/chrono_forge/execution_log.rb +6 -0
- data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
- data/lib/chrono_forge/executor/methods/branch.rb +185 -0
- data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
- data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
- data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
- data/lib/chrono_forge/executor/methods/wait.rb +2 -4
- data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
- data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
- data/lib/chrono_forge/executor/methods.rb +2 -0
- data/lib/chrono_forge/executor/retry_policy.rb +111 -0
- data/lib/chrono_forge/executor.rb +216 -28
- data/lib/chrono_forge/version.rb +1 -1
- data/lib/chrono_forge/workflow.rb +10 -1
- data/lib/generators/chrono_forge/migration_actions.rb +1 -0
- data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
- metadata +42 -5
- data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# Reserved-keyword guard + keywords-only enqueue contract
|
|
2
|
+
|
|
3
|
+
Date: 2026-06-25
|
|
4
|
+
Status: Approved (design)
|
|
5
|
+
|
|
6
|
+
## Problem
|
|
7
|
+
|
|
8
|
+
`ChronoForge::Executor#perform` reserves several keyword parameters for internal
|
|
9
|
+
plumbing:
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
def perform(key, attempt: 0, retry_counts: {}, retry_workflow: false, options: {}, **kwargs)
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Anything not named here lands in `**kwargs`, is persisted on the `Workflow` row,
|
|
16
|
+
and is replayed to the user's job body via `super(**workflow.kwargs.symbolize_keys)`
|
|
17
|
+
(`executor.rb:100`).
|
|
18
|
+
|
|
19
|
+
Two problems follow from the current public enqueue surface
|
|
20
|
+
(`perform_now`/`perform_later`), which only validate that `key` is a String:
|
|
21
|
+
|
|
22
|
+
1. **Silent collision.** A user calling `MyJob.perform_later("k", attempt: 5)`
|
|
23
|
+
silently hijacks the internal retry counter instead of passing their own
|
|
24
|
+
argument. Same risk for `retry_counts` and `retry_workflow`.
|
|
25
|
+
2. **No positional contract.** Extra positional arguments produce Ruby's generic
|
|
26
|
+
`wrong number of arguments` error rather than a contract-specific message —
|
|
27
|
+
even though the executor only ever replays kwargs (keyword-only) to the job
|
|
28
|
+
body, so positionals beyond `key` can never work.
|
|
29
|
+
|
|
30
|
+
## Decisions (settled with maintainer)
|
|
31
|
+
|
|
32
|
+
- **Public keyword surface:** `key` (required, first positional), `options`
|
|
33
|
+
(free-form metadata bag), and user `**kwargs`. Nothing else.
|
|
34
|
+
- **`options` is unstructured.** The framework defines **zero** recognized option
|
|
35
|
+
keys. `options` is written to `workflow.options` (`executor.rb:159`) and only
|
|
36
|
+
ever read back by callers (`workflow.options`); no key in it drives behavior.
|
|
37
|
+
- **Reserved keys (rejected on the public path):** `attempt`, `retry_counts`,
|
|
38
|
+
`retry_workflow`. These are internal threading params; users have no legitimate
|
|
39
|
+
reason to pass them.
|
|
40
|
+
- **`retry_workflow` is internal**, reached only through the `retry_now` /
|
|
41
|
+
`retry_later` helpers — not by passing the flag directly.
|
|
42
|
+
- **Keywords-only:** exactly one positional (`key`); everything else must be a
|
|
43
|
+
keyword, enforced with a clear, contract-specific error.
|
|
44
|
+
- **No job-signature validation.** We do *not* introspect the job's `perform`
|
|
45
|
+
parameters to validate unknown/missing kwargs. Out of scope; mismatches still
|
|
46
|
+
surface at execution time as today.
|
|
47
|
+
|
|
48
|
+
## Mechanism: how internal calls bypass the guard
|
|
49
|
+
|
|
50
|
+
The split between "framework may pass reserved keys" and "users may not" rests on
|
|
51
|
+
an ActiveJob implementation detail, confirmed on ActiveJob 7.1.3.4:
|
|
52
|
+
|
|
53
|
+
> `ActiveJob::ConfiguredJob` (returned by `.set(...)`) defines its own
|
|
54
|
+
> `perform_now` / `perform_later` that build a fresh job instance and call the
|
|
55
|
+
> **instance-level** enqueue path. They do **not** dispatch through the
|
|
56
|
+
> **class-level** `perform_*` override.
|
|
57
|
+
|
|
58
|
+
Therefore any enqueue routed through `.set(...)` bypasses the guard:
|
|
59
|
+
|
|
60
|
+
- All framework continuations already use `.set(wait: …).perform_later(key, …)`
|
|
61
|
+
(`executor.rb:138`, `wait.rb`, `wait_until.rb`, `durably_repeat.rb`,
|
|
62
|
+
`durably_execute.rb`) — their `attempt:`/`retry_counts:`/`wait_condition:`
|
|
63
|
+
ride through untouched.
|
|
64
|
+
- `retry_now` / `retry_later` are rewritten to enqueue via `set.perform_*`,
|
|
65
|
+
legitimately injecting `retry_workflow: true` past the guard.
|
|
66
|
+
|
|
67
|
+
This dependency is non-obvious and now load-bearing, so it is documented inline
|
|
68
|
+
at the guard.
|
|
69
|
+
|
|
70
|
+
## Design
|
|
71
|
+
|
|
72
|
+
All changes land in `lib/chrono_forge/executor.rb`, in the `class << base` block,
|
|
73
|
+
plus one module-level constant. ~30 lines. No schema or behavior changes
|
|
74
|
+
elsewhere.
|
|
75
|
+
|
|
76
|
+
### 1. Reserved-key constant (module level, near `STEP_NAME_DELIMITER`)
|
|
77
|
+
|
|
78
|
+
```ruby
|
|
79
|
+
# Keyword args ChronoForge threads through job args internally. Users must not
|
|
80
|
+
# pass these to perform_now/perform_later; the framework injects them via
|
|
81
|
+
# `.set(...)` continuations, whose ConfiguredJob proxy bypasses the class-level
|
|
82
|
+
# guard below.
|
|
83
|
+
RESERVED_KWARGS = %i[attempt retry_counts retry_workflow].freeze
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### 2. Public guards — `perform_now` / `perform_later`
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
def perform_now(key, *extra, **kwargs)
|
|
90
|
+
__validate_enqueue!(key, extra, kwargs)
|
|
91
|
+
super(key, **kwargs)
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
def perform_later(key, *extra, **kwargs)
|
|
95
|
+
__validate_enqueue!(key, extra, kwargs)
|
|
96
|
+
super(key, **kwargs)
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
private
|
|
100
|
+
|
|
101
|
+
def __validate_enqueue!(key, extra, kwargs)
|
|
102
|
+
unless key.is_a?(String)
|
|
103
|
+
raise ArgumentError, "Workflow key must be a string as the first argument"
|
|
104
|
+
end
|
|
105
|
+
unless extra.empty?
|
|
106
|
+
raise ArgumentError, "ChronoForge workflows accept only `key` positionally; " \
|
|
107
|
+
"pass everything else as keywords (got #{extra.size} extra positional arg(s))"
|
|
108
|
+
end
|
|
109
|
+
reserved = kwargs.keys & RESERVED_KWARGS
|
|
110
|
+
if reserved.any?
|
|
111
|
+
raise ArgumentError,
|
|
112
|
+
"#{reserved.join(", ")} #{reserved.one? ? "is a reserved" : "are reserved"} " \
|
|
113
|
+
"ChronoForge keyword(s) and cannot be passed to perform_now/perform_later"
|
|
114
|
+
end
|
|
115
|
+
end
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
`*extra` exists solely to catch stray positionals and produce the clear error;
|
|
119
|
+
after validation it is always empty and discarded (only `super(key, **kwargs)`
|
|
120
|
+
is forwarded).
|
|
121
|
+
|
|
122
|
+
### 3. Retry helpers — route past the guard
|
|
123
|
+
|
|
124
|
+
```ruby
|
|
125
|
+
def retry_now(key, **kwargs)
|
|
126
|
+
__validate_enqueue!(key, [], kwargs)
|
|
127
|
+
set.perform_now(key, retry_workflow: true, **kwargs)
|
|
128
|
+
end
|
|
129
|
+
|
|
130
|
+
def retry_later(key, **kwargs)
|
|
131
|
+
__validate_enqueue!(key, [], kwargs)
|
|
132
|
+
set.perform_later(key, retry_workflow: true, **kwargs)
|
|
133
|
+
end
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
They still validate the *user's* kwargs (rejecting any reserved key the user
|
|
137
|
+
supplied), then inject `retry_workflow: true` through the `ConfiguredJob` bypass.
|
|
138
|
+
|
|
139
|
+
### 4. Framework continuations — unchanged
|
|
140
|
+
|
|
141
|
+
`executor.rb:138`, `wait.rb`, `wait_until.rb`, `durably_repeat.rb`,
|
|
142
|
+
`durably_execute.rb` already enqueue via `.set(...)`. No change required.
|
|
143
|
+
|
|
144
|
+
## Scope / caveats
|
|
145
|
+
|
|
146
|
+
- **Executor-only.** The guard lives in the `Executor`-prepended singleton.
|
|
147
|
+
`ChronoForge::CleanupJob` is a plain `ActiveJob::Base` and is unaffected
|
|
148
|
+
(its `perform_now(older_than_days: …)` / arg-less `perform_later` keep working).
|
|
149
|
+
- **Backward compatible.** A full scan of `lib/` and `test/` found no call site
|
|
150
|
+
passing a second positional and no user call passing a reserved key, so the
|
|
151
|
+
existing suite passes unchanged.
|
|
152
|
+
- **`wait_condition`** (internal kwarg in `wait_until`) is intentionally *not*
|
|
153
|
+
added to `RESERVED_KWARGS`: it only ever travels via `.set(...)` and so never
|
|
154
|
+
reaches the guard. Adding it later is a harmless one-line hygiene change if
|
|
155
|
+
desired.
|
|
156
|
+
|
|
157
|
+
## Testing (TDD)
|
|
158
|
+
|
|
159
|
+
New tests (Executor-prepended job):
|
|
160
|
+
|
|
161
|
+
1. `perform_later` / `perform_now` raise `ArgumentError` when passed `attempt:`,
|
|
162
|
+
`retry_counts:`, or `retry_workflow:` — and the message names the key(s).
|
|
163
|
+
2. `perform_later` / `perform_now` raise `ArgumentError` with the contract
|
|
164
|
+
message when passed a second positional argument.
|
|
165
|
+
3. `perform_later("k", kwarg: "x", options: {a: 1})` still enqueues; `options`
|
|
166
|
+
and user kwargs reach the workflow (`workflow.options`, `workflow.kwargs`).
|
|
167
|
+
4. `retry_now` / `retry_later` still unlock-and-continue a stalled workflow
|
|
168
|
+
(existing behavior preserved), and reject reserved keys passed by the caller.
|
|
169
|
+
5. Non-String `key` still raises (regression guard for existing behavior).
|
|
@@ -0,0 +1,468 @@
|
|
|
1
|
+
# Branches — Concurrent Sub-Workflows (`branch` / `spawn` / `merge_branches`) — Design
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-06-25
|
|
4
|
+
**Status:** Implemented (branch `feat/branches`).
|
|
5
|
+
**Scope:** New public API, additive. Introduces parent/child workflows and a
|
|
6
|
+
fan-out/fan-in primitive built to dispatch **hundreds of thousands** of children per
|
|
7
|
+
branch. One new (generic, reusable) column on `chrono_forge_workflows`; reuses the
|
|
8
|
+
execution-log pattern for coordination. No breaking change to existing single-workflow
|
|
9
|
+
execution. **New dependency floor:** `activejob >= 7.1` (for `perform_all_later`).
|
|
10
|
+
|
|
11
|
+
## Problem
|
|
12
|
+
|
|
13
|
+
ChronoForge workflows are strictly sequential. The only way to fan work out today is to
|
|
14
|
+
hand-enqueue independent workflows and poll for them with `wait_until` — a hand-rolled
|
|
15
|
+
fork/join with no idempotent dispatch and no parent/child visibility.
|
|
16
|
+
|
|
17
|
+
Real workflows need durable, large-scale fan-out: "spawn one sub-workflow per record
|
|
18
|
+
across a 500k-row set, run them in parallel, continue once all are done." It must be
|
|
19
|
+
crash-safe, idempotent under replay, and must not hold the batch in memory or serialize
|
|
20
|
+
on a hot row.
|
|
21
|
+
|
|
22
|
+
## Goal
|
|
23
|
+
|
|
24
|
+
A **branch** is the unit of fan-out — a durable step that wraps the work it spawns and
|
|
25
|
+
ties it together for the join. `spawn`/`spawn_each` exist **only inside a `branch`
|
|
26
|
+
block**. The model mirrors git: you branch, then you merge.
|
|
27
|
+
|
|
28
|
+
- `branch(name, automerge: false, &block)` — opens branch `name` (the durable
|
|
29
|
+
`branch$<name>` log), runs the block to **eagerly dispatch** children, and **seals**
|
|
30
|
+
when the block closes. Returns immediately (does **not** wait) so branches run
|
|
31
|
+
concurrently.
|
|
32
|
+
- `spawn(name, WorkflowClass, **kwargs)` — inside a branch: dispatch a **single** named
|
|
33
|
+
child.
|
|
34
|
+
- `spawn_each(name, source, of:) { |item| [WorkflowClass, kwargs] }` — inside a branch:
|
|
35
|
+
dispatch **one child per item**, streamed like ActiveRecord batch loading; AR items are
|
|
36
|
+
keyed `name_<record.id>` (primary key); plain enumerables are keyed `name_{index}`
|
|
37
|
+
(sequential index).
|
|
38
|
+
- `merge_branches(*names)` — the **separate** join: halt until every named branch is
|
|
39
|
+
sealed **and** all its children have completed.
|
|
40
|
+
|
|
41
|
+
```ruby
|
|
42
|
+
def perform(cycle_id:)
|
|
43
|
+
branch :fulfillment, automerge: true do # the step; seals when the block closes
|
|
44
|
+
spawn :reconcile, ReconcileWorkflow, region: "EU" # single child of :fulfillment
|
|
45
|
+
spawn_each :orders, Order.pending do |order| # bulk, streamed; keys orders_<id>…
|
|
46
|
+
order.priority? ? [PriorityOrderWorkflow, { order_id: order.id }]
|
|
47
|
+
: [OrderWorkflow, { order_id: order.id }]
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
branch :invoicing do # a second, concurrent branch
|
|
52
|
+
spawn_each :invoices, Invoice.unpaid do |inv|
|
|
53
|
+
[InvoiceWorkflow, { invoice_id: inv.id }]
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
do_other_work # both branches already running
|
|
58
|
+
|
|
59
|
+
merge_branches :invoicing # join :invoicing; :fulfillment auto-merges
|
|
60
|
+
durably_execute :finalize
|
|
61
|
+
end
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Decisions (locked during brainstorming)
|
|
65
|
+
|
|
66
|
+
| Decision | Choice |
|
|
67
|
+
|---|---|
|
|
68
|
+
| Keywords | **`branch` / `spawn` / `spawn_each` / `merge_branches`** — git branch/merge metaphor; `spawn` avoids shadowing `Kernel#fork`. |
|
|
69
|
+
| Branch = the step | A branch **is** its `branch$<name>` execution log. `spawn`/`spawn_each` are valid **only inside a `branch` block** (raise otherwise) — spawns don't exist without a branch. |
|
|
70
|
+
| Dispatch timing | **Eager.** Spawns insert + enqueue as the block runs; children start at once. The branch **seals** (log → `completed`) when the block closes. |
|
|
71
|
+
| Join | **Separate `merge_branches`** so branches run concurrently and work can happen in between. Joins one or more named branches at once (`merge_branches :a` for one). |
|
|
72
|
+
| `merge_branch` alias | **Ship a singular `merge_branch(name, **opts)` alias** that delegates to `merge_branches` — reads naturally for the common one-branch case (`merge_branch :a`) without a plural-method/singular-arg mismatch. Decided, not just mentioned. |
|
|
73
|
+
| Automerge | A property **of the branch**: `branch(name, automerge: true)`. When `true`, `branch` eagerly dispatches inside the block and then immediately calls `merge_branches(name)` at the block's close — execution does not continue past the block until the branch's children complete. No explicit `merge_branches` is needed. |
|
|
74
|
+
| Branch tracking | An **in-memory registry** (`@open_branches`), rebuilt each replay pass: `branch` adds, `merge_branches` removes on completion, the completion gate inspects the remainder. Deterministic replay makes it exact — no persisted `merged`/`automerge` flags. |
|
|
75
|
+
| Every branch must be joined | **No detached branches.** Any branch remaining in `@open_branches` at completion (neither `merge_branches`-d nor automerged) **raises `UnmergedBranchError`** (fail-fast on a forgotten join), rather than silently letting children run orphaned. `automerge: true` branches are joined inline at the block close and are absent from `@open_branches` by the time the completion gate runs. |
|
|
76
|
+
| Spawn identity | Spawns are **named** (`spawn :reconcile, …`, `spawn_each :orders, …`). The name anchors the child key and the per-`spawn_each` cursor — stable across code reordering (unlike a positional ordinal). AR items are keyed `name_<record.id>` (primary key); plain enumerable items are keyed `name_{index}` (sequential index). |
|
|
77
|
+
| Bulk source | `spawn_each` **streams** the source — `find_in_batches(batch_size: of, start: cursor)` for AR — never materialising the batch in memory. Scales to millions. |
|
|
78
|
+
| Child class | **Returned from the block** (`[WorkflowClass, kwargs]`); one branch may fan out into mixed workflow types. |
|
|
79
|
+
| Child key | Deterministic: `spawn` → `"#{parent.key}$#{branch}$#{spawn_name}"`; AR `spawn_each` item → `"#{parent.key}$#{branch}$#{spawn_name}_#{record.id}"`; enumerable item → `"#{parent.key}$#{branch}$#{spawn_name}_#{index}"`. Idempotency falls out of the unique-key constraint. |
|
|
80
|
+
| Cursor | Per `spawn_each`, stored in the `branch$<name>` log's `metadata` keyed by **spawn name** as `{ pk: <keyset>, n: <count/index> }`; persisted **once per dispatched chunk** (bundled with that chunk's `insert_all`). |
|
|
81
|
+
| Completion | **Poll**, no counter: a branch is done when sealed and has no incomplete children (`branch_log.spawned_workflows.where.not(state: :completed)` empty — read as an O(CAP) capped count). Zero per-completion contention. |
|
|
82
|
+
| Poll mechanism | A dedicated lightweight `ChronoForge::BranchMergeJob` (plain ActiveJob — no lock/replay/context) does the repeated probing and wakes the parent only at completion. The heavy parent runs just twice per merge (kick off + wake). No separate recovery timer. |
|
|
83
|
+
| Poll cadence | **Adaptive, capped-count.** `pending = incomplete.limit(CAP).count` (**O(CAP)**, never O(N)); next delay `clamp(pending * factor, min_interval, max_interval)` — fast when few remain, slow when many. |
|
|
84
|
+
| Determinism | AR items are keyed by **primary key**, so the stream is stable by construction. `spawn_each` rejects an AR relation carrying an explicit `.order` by checking `order_values.present?` up front (raises `NotExecutableError`). Plain enumerable items are keyed by **sequential index** and must re-enumerate deterministically (documented contract). |
|
|
85
|
+
| Failure semantics | **Option A.** A `stalled`/`failed` child keeps the branch incomplete; the parent stays parked; the user recovers the child (`retry_now`/`retry_later`) and the merge then resolves. No new failure states, no cascade. |
|
|
86
|
+
| Nesting | **Free.** A child is a workflow and may open its own branches; the tree forms via `parent_execution_log_id` (child → branch log → parent workflow). |
|
|
87
|
+
|
|
88
|
+
## Public API surface
|
|
89
|
+
|
|
90
|
+
```ruby
|
|
91
|
+
branch(name, automerge: false) do
|
|
92
|
+
spawn(name, WorkflowClass, **kwargs)
|
|
93
|
+
spawn_each(name, source, of: 1000) { |item| [WorkflowClass, kwargs] }
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
merge_branches(*names, min_interval: 5.seconds, max_interval: 5.minutes) # halts until done
|
|
97
|
+
merge_branch(name, **opts) # singular alias for the common one-branch case
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
`spawn`/`spawn_each` raise `NotInBranchError` if called outside a `branch` block. A branch
|
|
101
|
+
opened but neither `merge_branches`-d nor `automerge: true` raises `UnmergedBranchError` at
|
|
102
|
+
workflow completion.
|
|
103
|
+
|
|
104
|
+
## Data model
|
|
105
|
+
|
|
106
|
+
`chrono_forge_workflows` gains **one** nullable column (inline in the install migration;
|
|
107
|
+
a follow-up migration template for existing installs):
|
|
108
|
+
|
|
109
|
+
| Column | Type | Notes |
|
|
110
|
+
|---|---|---|
|
|
111
|
+
| `parent_execution_log_id` | FK → `chrono_forge_execution_logs.id`, nullable, indexed | The execution log that spawned this workflow. For branches it's the `branch$<name>` log. **Deliberately generic** — any future step that spawns sub-workflows reuses it. |
|
|
112
|
+
|
|
113
|
+
The branch a child belongs to *is* its `parent_execution_log_id` (the `branch$<name>`
|
|
114
|
+
log), which is globally unique and encodes both the parent workflow (the log's
|
|
115
|
+
`workflow_id`) and the branch (its `step_name`). No `branch_name`/`parent_workflow_id`
|
|
116
|
+
column is needed.
|
|
117
|
+
|
|
118
|
+
**Merge/automerge state is not persisted.** It's tracked in an in-memory registry rebuilt
|
|
119
|
+
each replay pass from the `branch`/`merge_branches` calls (see Execution flow) — `branch`
|
|
120
|
+
adds, `merge_branches` removes, the completion gate inspects the remainder. Deterministic
|
|
121
|
+
replay makes this exact every pass, so no `merged`/`automerge` columns or metadata flags
|
|
122
|
+
are needed; the branch log holds only dispatch cursors.
|
|
123
|
+
|
|
124
|
+
**Composite index** `(parent_execution_log_id, state)` — makes the merge capped count
|
|
125
|
+
and the dropped-job re-kick index-only and short-circuiting at millions of rows.
|
|
126
|
+
|
|
127
|
+
No new table. The branch is the **`branch$<name>`** execution log (two-segment, like
|
|
128
|
+
`durably_repeat`'s coordination log — preloaded when sealed, never per child):
|
|
129
|
+
|
|
130
|
+
```
|
|
131
|
+
step_name: "branch$fulfillment"
|
|
132
|
+
state: pending (dispatching) | completed (sealed / block closed)
|
|
133
|
+
metadata: { "cursors" => { "orders" => { "pk" => <keyset>, "n" => <count> } } } # keyed by spawn name
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
The **`merge$<names>`** log coordinates a join (`pending` while polling → `completed`).
|
|
137
|
+
|
|
138
|
+
`Workflow belongs_to :parent_execution_log, class_name: "ExecutionLog", optional: true`;
|
|
139
|
+
`ExecutionLog has_many :spawned_workflows, class_name: "Workflow", foreign_key: :parent_execution_log_id`.
|
|
140
|
+
A branch's children are `branch_log.spawned_workflows`; the parent is `branch_log.workflow`.
|
|
141
|
+
|
|
142
|
+
### One bounded log per branch — preload safety
|
|
143
|
+
|
|
144
|
+
The preload (`completed_step_cache`) bulk-loads all `completed` logs except
|
|
145
|
+
`durably_repeat$%$%` (the unbounded three-segment repetition logs). A two-segment
|
|
146
|
+
`branch$<name>` is preloaded when sealed, so a replay past the branch short-circuits.
|
|
147
|
+
**Per-child state is never modelled as sub-segment logs** (`branch$<name>$<child>`):
|
|
148
|
+
those would be pulled into the preload and load millions of rows on every replay.
|
|
149
|
+
Per-child state lives on the child workflow rows; the branch log holds only cursors.
|
|
150
|
+
|
|
151
|
+
## Execution flow
|
|
152
|
+
|
|
153
|
+
### `branch(name, automerge:) { … }` — wrap + eager dispatch + seal
|
|
154
|
+
|
|
155
|
+
Gated by `find_or_create_execution_log!("branch$#{name}")` (the branch log holds only
|
|
156
|
+
dispatch cursors; `automerge`/merge state is in-memory, not seeded here):
|
|
157
|
+
|
|
158
|
+
1. **Sealed** (`completed`, served from `completed_step_cache`) → already fully
|
|
159
|
+
dispatched; **skip the block entirely** (no re-stream) and return. *(This short-circuit
|
|
160
|
+
is the single most important correctness/performance property in the design — the
|
|
161
|
+
expensive source enumeration never re-runs after sealing. It warrants a prominent comment
|
|
162
|
+
directly above the skip path in the implementation.)*
|
|
163
|
+
2. **Pending / new** → set the current-branch context, **yield the block** (named spawns
|
|
164
|
+
dispatch into this branch, advancing their cursors — see below), clear the context,
|
|
165
|
+
mark the `branch$<name>` log `completed` (**sealed**). If `automerge: true`, immediately
|
|
166
|
+
call `merge_branches(name)` — **execution does not continue past the block** until the
|
|
167
|
+
branch's children complete (identical to an explicit `merge_branches` call placed right
|
|
168
|
+
after the block, but guaranteed by the method). Otherwise **return** without halting —
|
|
169
|
+
branches are concurrent; the explicit join is separate.
|
|
170
|
+
|
|
171
|
+
Either way, `branch` **registers the branch in the in-memory registry**
|
|
172
|
+
`@open_branches[name] = { automerge:, log_id: }` — this runs on *every* pass (sealed or
|
|
173
|
+
not), since the `branch` method itself always executes even when its block is skipped.
|
|
174
|
+
|
|
175
|
+
`spawn`/`spawn_each` read the current branch from that context and raise
|
|
176
|
+
`NotInBranchError` if there is none.
|
|
177
|
+
|
|
178
|
+
### `spawn` / `spawn_each` — dispatch within a branch
|
|
179
|
+
|
|
180
|
+
- **`spawn(name, klass, **kwargs)`** → one child, key `"#{parent.key}$#{branch}$#{name}"`,
|
|
181
|
+
`job_class: klass.name`, `parent_execution_log_id: branch_log.id`. Idempotent on the key.
|
|
182
|
+
- **`spawn_each(name, source, of:)`** → stream, resuming from `metadata.cursors[name]`
|
|
183
|
+
(`{ pk:, n: }`); `n` is a running count (AR) or the resume index (enumerable):
|
|
184
|
+
- **AR relation:** rejects `source` if `source.order_values.present?` (raises
|
|
185
|
+
`NotExecutableError` — iteration is by PK and an explicit order conflicts). Resumes via
|
|
186
|
+
`source.find_in_batches(batch_size: of, start: cursor.pk)`. Per batch, for each record: `klass, kw =
|
|
187
|
+
yield(record)`; build child rows (key
|
|
188
|
+
`"#{parent.key}$#{branch}$#{name}_#{record.id}"`, `job_class`, `kwargs`,
|
|
189
|
+
`parent_execution_log_id: branch_log.id`, `state: :idle`); `insert_all(…, unique_by:
|
|
190
|
+
:key)` (on-conflict-ignore); enqueue only those children still `:idle` (dispatch is
|
|
191
|
+
**queue-idempotent** — a crash-resume never re-runs an already-completed/running child);
|
|
192
|
+
advance `metadata.cursors[name]` to `{ pk: batch.last.id, n: n + batch.size }`
|
|
193
|
+
(committed with the inserts).
|
|
194
|
+
- **Enumerable:** resume via `drop(n)`; child key uses `name_#{n}` (sequential index);
|
|
195
|
+
same per-chunk insert/idle-filter/enqueue/advance (`n` only).
|
|
196
|
+
- Enqueue the chunk, **then** advance the cursor — a crash in between re-enqueues only
|
|
197
|
+
that one chunk on resume (idempotent).
|
|
198
|
+
|
|
199
|
+
### `merge_branches(*names)` — separate poll-join
|
|
200
|
+
|
|
201
|
+
Each name is validated up front: `$` is rejected via `validate_step_name_segment!`, and `,`
|
|
202
|
+
(the merge step-name separator) is also rejected — both raise `InvalidStepName`.
|
|
203
|
+
|
|
204
|
+
Gated by `find_or_create_execution_log!("merge$#{names.sort.join(',')}")`:
|
|
205
|
+
|
|
206
|
+
1. **Completed** → return, continue.
|
|
207
|
+
2. For each `name`: require it to be in `@open_branches` (opened earlier this pass) — a name
|
|
208
|
+
that was never opened **raises `UnknownBranchError`** (a `NotExecutableError` subclass,
|
|
209
|
+
so it fail-fasts via the existing rescue without broadening the executor); a not-yet-sealed
|
|
210
|
+
branch means "still dispatching".
|
|
211
|
+
3. **Capped-count probe** per branch:
|
|
212
|
+
`branch_log.spawned_workflows.where.not(state: :completed).limit(CAP).count`
|
|
213
|
+
(`where(parent_execution_log_id: branch_log.id, …)`, index-only, **O(CAP) not O(N)**).
|
|
214
|
+
All `0` → done. Otherwise enqueue a `BranchMergeJob` (which polls + re-kicks dropped
|
|
215
|
+
jobs) and `halt_execution!`.
|
|
216
|
+
4. All branches `0` pending → **delete those names from `@open_branches`** (so the
|
|
217
|
+
completion gate sees them as joined), mark the `merge$…` log `completed`, continue.
|
|
218
|
+
|
|
219
|
+
Completion is **poll-based**, delegated to a dedicated lightweight job so the heavy
|
|
220
|
+
parent isn't replayed per check:
|
|
221
|
+
|
|
222
|
+
- `merge_branches` does **one** immediate check; if not done, enqueues
|
|
223
|
+
`ChronoForge::BranchMergeJob` and `halt_execution!`s (parent → `idle`, lock released).
|
|
224
|
+
The parent runs only **twice** per merge: kick off + completion wake.
|
|
225
|
+
- **`BranchMergeJob`** is a plain ActiveJob — *no* lock, replay, or context. Each run:
|
|
226
|
+
```ruby
|
|
227
|
+
pending = branch_log_ids.sum { |id| incomplete(id).limit(CAP).count } # O(CAP), index-only
|
|
228
|
+
if pending.zero? && all_sealed?(branch_log_ids)
|
|
229
|
+
ParentWorkflow.perform_later(parent_key) # wake the parent once
|
|
230
|
+
else
|
|
231
|
+
rekick_dropped_jobs(branch_log_ids) # idle re-kick lives here
|
|
232
|
+
delay = [[pending * factor, min_interval].max, max_interval].min # adaptive cadence
|
|
233
|
+
self.class.set(wait: delay).perform_later(parent_key, branch_log_ids, min_interval, max_interval)
|
|
234
|
+
end
|
|
235
|
+
```
|
|
236
|
+
- On the wake, the parent replays once; sealed branches short-circuit (no re-stream),
|
|
237
|
+
`merge_branches` re-checks, marks the `merge$<names>` log `completed`, and continues.
|
|
238
|
+
**The parent completes its own merge step** — the poller only *detects* and wakes.
|
|
239
|
+
|
|
240
|
+
`merge_branches` **(re)spawns a poller whenever reached while still pending**, so a manual
|
|
241
|
+
retry of a parked parent self-heals a lost poller (including when the poller was spawned by
|
|
242
|
+
an automerge inline call); a rare double-poller from an external re-trigger is harmless (the
|
|
243
|
+
wake is idempotent).
|
|
244
|
+
|
|
245
|
+
**No separate recovery poll.** The poller is a durable backend-scheduled job — the same
|
|
246
|
+
durability `wait_until`'s reschedule already relies on. A lost poller just parks the
|
|
247
|
+
parent with a pending `merge$…` log, recoverable by retry (Option A). Cost: one tiny job
|
|
248
|
+
per (adaptive) interval plus an **O(CAP)** index-only capped count per branch — no
|
|
249
|
+
counter, no per-child shared write, no hot-row contention at any scale; latency falls
|
|
250
|
+
toward `min_interval` as the branch nears done. Option A falls out — a failed child keeps
|
|
251
|
+
pending > 0, so the parent waits until it is recovered.
|
|
252
|
+
|
|
253
|
+
### Completion gate — every branch must be joined
|
|
254
|
+
|
|
255
|
+
Every branch must be joined — explicitly via `merge_branches` or implicitly via
|
|
256
|
+
`automerge: true`. **There is no detached branch.** `complete_workflow!` (`enforce_branch_joins!`)
|
|
257
|
+
gains a gate **before** it seals the workflow that inspects `@open_branches` — the in-memory
|
|
258
|
+
registry that `branch` populated and `merge_branches` pruned during this pass (rebuilt
|
|
259
|
+
deterministically every replay, so it's exact). The gate does **only** the unmerged-raise
|
|
260
|
+
check:
|
|
261
|
+
|
|
262
|
+
1. **Unmerged check:** any branch remaining in `@open_branches` at completion is a forgotten
|
|
263
|
+
join → **raise `UnmergedBranchError`** naming the branch(es), with the hint *"add
|
|
264
|
+
`merge_branches :x` or `branch(:x, automerge: true)`."* This fails the workflow fast
|
|
265
|
+
rather than letting children run orphaned; the developer fixes the code and retries. The
|
|
266
|
+
check is unconditional (fires even if the branch's children happen to have finished) so
|
|
267
|
+
the contract is deterministic, not timing-dependent.
|
|
268
|
+
|
|
269
|
+
(A branch joined via `merge_branches` was already deleted from `@open_branches` when that
|
|
270
|
+
merge completed, so it's absent here. An `automerge: true` branch is also absent — its join
|
|
271
|
+
ran inline at the `branch` block's close, removing it from `@open_branches` before
|
|
272
|
+
execution ever continued past the block.)
|
|
273
|
+
|
|
274
|
+
## Determinism
|
|
275
|
+
|
|
276
|
+
The cursor is only meaningful if iteration is reproducible across replays:
|
|
277
|
+
|
|
278
|
+
- **AR relation:** children are keyed by **primary key** (`name_<record.id>`), so the
|
|
279
|
+
mapping from record to child key is stable regardless of enumeration order. Iteration is
|
|
280
|
+
driven by **primary-key keyset** (`find_in_batches(start:)`). `spawn_each` rejects a relation
|
|
281
|
+
carrying an explicit `.order(...)` by checking `order_values.present?` up front (raises
|
|
282
|
+
`NotExecutableError`) — relying on `find_in_batches`'s `error_on_ignore` is not
|
|
283
|
+
sufficient because `find_in_batches(start:)` is inclusive and a crash-resume re-yields
|
|
284
|
+
the boundary record; the explicit up-front check catches order conflicts before any
|
|
285
|
+
inserts occur.
|
|
286
|
+
- **Enumerable:** items are keyed `name_{index}` by their **sequential position** in the
|
|
287
|
+
stream, so the source must re-enumerate identically across replays (effectively frozen
|
|
288
|
+
for the brief dispatch window — once the branch seals, replay skips the block, so no
|
|
289
|
+
re-enumeration happens thereafter). Deterministic re-enumeration is a documented,
|
|
290
|
+
unverifiable contract — misuse is still *safe* (`insert_all`-ignore + poll) but could
|
|
291
|
+
dispatch the wrong set.
|
|
292
|
+
|
|
293
|
+
## Idempotency & crash recovery
|
|
294
|
+
|
|
295
|
+
Three layers — **what exists** (DB), **how far dispatch got** (cursor), **step state**
|
|
296
|
+
(the log).
|
|
297
|
+
|
|
298
|
+
- **`find_or_create_execution_log!`** — a sealed `branch$<name>` skips the whole block;
|
|
299
|
+
a completed `merge$…` short-circuits the join.
|
|
300
|
+
- **Deterministic keys + per-spawn cursors = "which children exist."** The branch never
|
|
301
|
+
tracks children individually. Existence is owned by the unique index (`insert_all`-ignore
|
|
302
|
+
is a no-op for rows that exist); dispatch progress is owned by `metadata.cursors[name]`.
|
|
303
|
+
Recovery resumes from the cursor and re-touches **one chunk**, not the whole set.
|
|
304
|
+
*(Dispatch is not bound to the create block: a crash after the log is created must still
|
|
305
|
+
create and enqueue the remaining rows, or the branch would stall forever.)*
|
|
306
|
+
- **`:idle` filter for dropped jobs.** A child dispatched but never run is re-kicked from
|
|
307
|
+
the `BranchMergeJob` poll: re-enqueue branch children with `state: :idle` in an
|
|
308
|
+
incomplete branch. Branch children are pre-inserted with `state: :idle` and never get
|
|
309
|
+
`started_at` set before execution, so `:idle` is the correct "never picked up" signal
|
|
310
|
+
(filtering by `started_at IS NULL` would be unreliable). *Child existence is not enough;
|
|
311
|
+
the merge guarantees every member is actually queued.* (verbatim into the code comment.)
|
|
312
|
+
The re-kick is batch-capped. Safe because re-enqueue is idempotent: `executable?` is
|
|
313
|
+
`idle || running`, so `acquire_lock` raises `NotExecutableError` for a `completed` child
|
|
314
|
+
(its dispatch can never double-fire) and `ConcurrentExecutionError` for a `running` one.
|
|
315
|
+
Children in other states (running, mid-halt, stalled/failed under Option A) are excluded
|
|
316
|
+
by the `:idle` filter.
|
|
317
|
+
|
|
318
|
+
### Recovery walkthrough — 300,000 children, crash at 250,000
|
|
319
|
+
|
|
320
|
+
A `branch :fulfillment` block's `spawn_each :orders` had committed 250k child rows + jobs
|
|
321
|
+
with `metadata.cursors["orders"]` at `{ pk: <250,000th PK>, n: 250_000 }`; it was mid-chunk
|
|
322
|
+
when the process died. The `branch$orders` log is still `pending` (not sealed), so workflow retry replays
|
|
323
|
+
from the top. The block re-runs; `spawn_each` resumes `find_in_batches(start: cursor)` from PK
|
|
324
|
+
250,000 — re-touching ~50k rows, worst-case duplicate enqueue is the single in-flight
|
|
325
|
+
chunk. The 250k already dispatched keep running the whole time. When the source is
|
|
326
|
+
exhausted the block closes and the branch seals; `merge_branches`/automerge then polls to
|
|
327
|
+
completion. Recovery is bounded and idempotent — never a re-fan-out of 300k.
|
|
328
|
+
|
|
329
|
+
## Scale & performance (target: hundreds of thousands per branch)
|
|
330
|
+
|
|
331
|
+
| Operation | Frequency | Cost |
|
|
332
|
+
|---|---|---|
|
|
333
|
+
| `spawn_each` dispatch | once per branch, **streamed** | `⌈N/of⌉` `insert_all` + `perform_all_later`, each advancing the cursor — O(N) total, bounded chunks, **constant memory** |
|
|
334
|
+
| Child run | per child | one own-row state transition — **no shared-row contention** |
|
|
335
|
+
| Merge poll | per adaptive interval | lightweight `BranchMergeJob` running an **O(CAP)** capped count per branch; interval scales `min`↔`max` with pending — the heavy parent is *not* replayed per poll |
|
|
336
|
+
| Crash recovery | once | resumes dispatch from the cursor — re-touches **one chunk** |
|
|
337
|
+
| Replay cost | per resume | independent of how many children finished — no member list, no sibling scan, no counter |
|
|
338
|
+
|
|
339
|
+
What deliberately does **not** exist: an in-memory fork registry (streamed instead), a
|
|
340
|
+
single hot completion counter (adaptive capped-count poll instead), and a member-key blob
|
|
341
|
+
in metadata (just per-spawn cursors). The remaining O(N) work — N rows + N jobs — is
|
|
342
|
+
irreducible for N sub-workflows, done in bounded chunks by one parent job.
|
|
343
|
+
|
|
344
|
+
### `perform_all_later` — verified (activejob 7.1.3.4)
|
|
345
|
+
|
|
346
|
+
- **Mixed job classes: supported, no same-class requirement.** `perform_all_later` groups
|
|
347
|
+
by `queue_adapter` (`enqueuing.rb:18`); the adapter's `enqueue_all` then sub-groups by
|
|
348
|
+
**class then queue** (e.g. core Sidekiq adapter does `group_by(&:class).group_by(&:queue_name)`
|
|
349
|
+
→ one `push_bulk` per group, `sidekiq_adapter.rb:36`). So a `spawn_each` returning mixed
|
|
350
|
+
workflow types is fine — enqueue just batches per distinct (class, queue), never falling
|
|
351
|
+
back to per-job for being heterogeneous.
|
|
352
|
+
- **It bypasses ChronoForge's class-level `perform_later` override** (`__validate_enqueue!`)
|
|
353
|
+
and **ActiveJob enqueue callbacks** — it builds instances and hits the adapter directly.
|
|
354
|
+
So `spawn`/`spawn_each` must **build child job instances and validate them
|
|
355
|
+
(String key, no reserved kwargs) themselves**, then `ActiveJob.perform_all_later(jobs)`.
|
|
356
|
+
Execution is unaffected (the executor's logic is in instance `perform`). This mirrors the
|
|
357
|
+
existing sanctioned `.set(...)`-bypasses-the-override pattern.
|
|
358
|
+
- **Requires activejob ≥ 7.1.** The gemspec currently pins no version — add
|
|
359
|
+
`spec.add_dependency "activejob", ">= 7.1"` (or provide a `perform_later`-loop fallback).
|
|
360
|
+
- **Bulk enqueue is adapter-dependent.** Only adapters implementing `enqueue_all`
|
|
361
|
+
(Sidekiq in core; solid_queue/good_job ship their own) batch the enqueue; Test/Inline/
|
|
362
|
+
Async fall back to per-job `enqueue`. The `insert_all` of child rows is **always** bulk;
|
|
363
|
+
job enqueue batching is best-effort.
|
|
364
|
+
|
|
365
|
+
Other caveats to verify in the plan:
|
|
366
|
+
- `find_in_batches` `start:` semantics (inclusive boundary — the boundary record is re-yielded on
|
|
367
|
+
crash-resume; PK-keyed children dedup via `insert_all`-ignore) and the per-adapter
|
|
368
|
+
bind-param limit for the `of:` chunk size (notably SQLite's `SQLITE_MAX_VARIABLE_NUMBER`).
|
|
369
|
+
- The merge capped count must be index-only on `(parent_execution_log_id, state)`.
|
|
370
|
+
|
|
371
|
+
## Poll-cadence constants (class-configurable defaults)
|
|
372
|
+
|
|
373
|
+
- `CAP` (capped-count limit) — default `5_000`. Bounds each poll's count cost; beyond it,
|
|
374
|
+
pending saturates to `max_interval` (no signal lost).
|
|
375
|
+
- `factor` — maps pending → delay; default tuned so ~100 → ~10s, ~1k → ~1 min.
|
|
376
|
+
- `min_interval` / `max_interval` — clamp; defaults `5.seconds` / `5.minutes`
|
|
377
|
+
(per-`merge_branches` overridable; `automerge` uses the defaults).
|
|
378
|
+
|
|
379
|
+
## Naming & validation
|
|
380
|
+
|
|
381
|
+
- `STEP_NAME_DELIMITER` is `$` (executor.rb). Reserved.
|
|
382
|
+
- The `branch` name, each `spawn`/`spawn_each` name, and each `merge_branches` name pass
|
|
383
|
+
through `validate_step_name_segment!` (no `$`). The merge step name joins sorted branch
|
|
384
|
+
names with `,`; names containing `,` are rejected. (`name_{index}` uses `_`, which is
|
|
385
|
+
unreserved.)
|
|
386
|
+
- Child keys use `$` (`"#{parent.key}$#{branch}$#{spawn_name}"` / `…$#{spawn_name}_#{index}"`).
|
|
387
|
+
Keys are opaque (never parsed), so a `$` already in the parent key is harmless.
|
|
388
|
+
|
|
389
|
+
## Non-goals (v1) — with caveats
|
|
390
|
+
|
|
391
|
+
- **Sharded completion counter / instant wake.** v1 polls (adaptive, but still polls). If
|
|
392
|
+
sub-`min_interval` wake latency is ever needed, a
|
|
393
|
+
`fork_counters(branch_log_id, shard, completed)` table (K rows/branch) is the upgrade —
|
|
394
|
+
fully internal to the branch, **zero API change**.
|
|
395
|
+
- **Parallel dispatch.** One parent job streams the dispatch. At ~1M this is minutes and
|
|
396
|
+
crash-safe via the cursor; recursive dispatcher sub-jobs are a future throughput upgrade.
|
|
397
|
+
- **Result aggregation.** Children communicate via their own `context`, which is
|
|
398
|
+
**workflow-scoped** — a parent can't read a child's context today. `branch_log.spawned_workflows`
|
|
399
|
+
returns the child **records**. Aggregation, when added, needs an explicit cross-workflow
|
|
400
|
+
read API.
|
|
401
|
+
- **`merge_branches` timeout.** Blocks indefinitely (Option A); a `timeout:` can come later.
|
|
402
|
+
- **Dashboard nesting.** Parent/child tree + per-child recovery is a follow-up; the
|
|
403
|
+
`parent_execution_log_id` column makes the tree cheap to walk.
|
|
404
|
+
|
|
405
|
+
## Testing strategy
|
|
406
|
+
|
|
407
|
+
Mirror the existing `ChaoticJob` style (`perform_later` + `perform_all_jobs`; assert on
|
|
408
|
+
workflow `state`, `execution_logs`).
|
|
409
|
+
|
|
410
|
+
- **Happy path:** a `branch` with a `spawn :a` + a `spawn_each :b`; assert child keys
|
|
411
|
+
(`…$a`, `…$b_0`, `…$b_1`, …), `parent_execution_log_id`, the branch seals,
|
|
412
|
+
`merge_branches` resumes, the workflow finishes.
|
|
413
|
+
- **Spawn outside branch raises** `NotInBranchError`.
|
|
414
|
+
- **Concurrency:** two branches dispatched before the merge both make progress before the
|
|
415
|
+
join; work between branch blocks and merge runs while children are in flight.
|
|
416
|
+
- **Eager dispatch:** children begin before `merge_branches` is reached.
|
|
417
|
+
- **Class from body:** a `spawn_each` returning mixed classes creates children with the
|
|
418
|
+
right `job_class` per item (and they bulk-enqueue together).
|
|
419
|
+
- **Determinism guard:** an AR relation with a conflicting `.order(...)` **raises**.
|
|
420
|
+
- **Crash mid-dispatch (cursor resume):** glitch after chunk *k*; assert
|
|
421
|
+
`metadata.cursors[name]` (`{ pk:, n: }` for AR; `{ n: }` for enumerable) persisted,
|
|
422
|
+
dispatch resumes from it (not 0), final child count correct, no duplicate rows, only the
|
|
423
|
+
in-flight chunk re-enqueued. (250k-of-300k.)
|
|
424
|
+
- **Dropped-job re-kick:** a sealed branch with a child whose job was lost (state `:idle`,
|
|
425
|
+
never started); assert the poll re-enqueues exactly that child and then resolves.
|
|
426
|
+
- **Poll job:** assert the parent runs only twice (kick-off + wake) regardless of poll
|
|
427
|
+
count; a manual retry of a parked parent re-spawns the poller.
|
|
428
|
+
- **Adaptive cadence:** assert the count is capped at `CAP` (a branch with ≫CAP incomplete
|
|
429
|
+
issues an O(CAP) count and picks `max_interval`); the delay shrinks toward `min_interval`
|
|
430
|
+
as pending drops.
|
|
431
|
+
- **Automerge:** an `automerge: true` branch blocks execution inline at the block's close
|
|
432
|
+
(not at workflow completion) — assert execution does not continue past the block until
|
|
433
|
+
children finish, and that the `merge$<name>` log exists and is `completed` before the
|
|
434
|
+
next step runs. No explicit `merge_branches` is needed.
|
|
435
|
+
- **Unmerged branch raises:** a branch opened with neither `merge_branches` nor
|
|
436
|
+
`automerge: true` raises `UnmergedBranchError` at the completion gate (unconditional —
|
|
437
|
+
fires even if children already finished), naming the branch.
|
|
438
|
+
- **Option A:** a child `permanently_fail`s → merge parks; recover via `retry_later` →
|
|
439
|
+
merge resolves; assert no progress while parked.
|
|
440
|
+
- **Idempotency / replay:** force replays; assert constant child count, no re-dispatch once
|
|
441
|
+
sealed, replay query count independent of completed-child count (`branch$<name>` preloads
|
|
442
|
+
when sealed; no per-child sub-logs).
|
|
443
|
+
- **Scale:** a branch of hundreds of thousands; assert `insert_all` issues `⌈N/of⌉` inserts
|
|
444
|
+
(not N), constant memory (streamed), contention-free child runs, and one capped probe per
|
|
445
|
+
branch per poll. (Job-enqueue batching is adapter-dependent — under the test adapter it
|
|
446
|
+
falls back to per-job enqueue, so don't assert bulk *enqueue* there.)
|
|
447
|
+
- **Empty branch / empty source:** seals immediately; the merge resolves at once.
|
|
448
|
+
- **Nesting:** a child opens its own branch; assert the tree completes bottom-up.
|
|
449
|
+
|
|
450
|
+
## README notes (when shipping)
|
|
451
|
+
|
|
452
|
+
Surface these prominently in user-facing docs, not just here:
|
|
453
|
+
- **Every branch must be merged or `automerge: true`** — otherwise `UnmergedBranchError`.
|
|
454
|
+
- **The heavy parent is not replayed per poll** — a lightweight `BranchMergeJob` does the
|
|
455
|
+
waiting; the parent runs twice per merge.
|
|
456
|
+
- **AR source must be stable during a branch's dispatch window** — AR items are keyed by
|
|
457
|
+
primary key (`name_<id>`), so inserting rows mid-dispatch is safe but re-use of a PK
|
|
458
|
+
(soft-delete/re-insert) can confuse the cursor. Plain enumerables are keyed by sequential
|
|
459
|
+
index; inserting/removing items mid-dispatch (before a crash-replay seals the branch)
|
|
460
|
+
shifts indices. Once sealed, the block never re-enumerates.
|
|
461
|
+
|
|
462
|
+
## Future work
|
|
463
|
+
|
|
464
|
+
- Sharded-counter table for instant (non-poll) wake.
|
|
465
|
+
- Parallel/recursive dispatcher sub-jobs for dispatch throughput beyond one parent job.
|
|
466
|
+
- Result aggregation via an explicit child-context read API.
|
|
467
|
+
- `merge_branches(..., timeout:)`.
|
|
468
|
+
- Dashboard parent/child tree + per-child recovery actions.
|