chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +22 -0
  3. data/README.md +305 -44
  4. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
  5. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
  6. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
  7. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
  8. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
  9. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
  10. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
  11. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
  12. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
  13. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
  14. data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
  15. data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
  16. data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
  17. data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
  18. data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
  19. data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
  20. data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
  21. data/lib/chrono_forge/branch_merge_job.rb +138 -0
  22. data/lib/chrono_forge/branch_probe.rb +26 -0
  23. data/lib/chrono_forge/cleanup.rb +6 -0
  24. data/lib/chrono_forge/execution_log.rb +6 -0
  25. data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
  26. data/lib/chrono_forge/executor/methods/branch.rb +185 -0
  27. data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
  28. data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
  29. data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
  30. data/lib/chrono_forge/executor/methods/wait.rb +2 -4
  31. data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
  32. data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
  33. data/lib/chrono_forge/executor/methods.rb +2 -0
  34. data/lib/chrono_forge/executor/retry_policy.rb +111 -0
  35. data/lib/chrono_forge/executor.rb +216 -28
  36. data/lib/chrono_forge/version.rb +1 -1
  37. data/lib/chrono_forge/workflow.rb +10 -1
  38. data/lib/generators/chrono_forge/migration_actions.rb +1 -0
  39. data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
  40. metadata +42 -5
  41. data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
@@ -0,0 +1,169 @@
1
+ # Reserved-keyword guard + keywords-only enqueue contract
2
+
3
+ Date: 2026-06-25
4
+ Status: Approved (design)
5
+
6
+ ## Problem
7
+
8
+ `ChronoForge::Executor#perform` reserves several keyword parameters for internal
9
+ plumbing:
10
+
11
+ ```ruby
12
+ def perform(key, attempt: 0, retry_counts: {}, retry_workflow: false, options: {}, **kwargs)
13
+ ```
14
+
15
+ Anything not named here lands in `**kwargs`, is persisted on the `Workflow` row,
16
+ and is replayed to the user's job body via `super(**workflow.kwargs.symbolize_keys)`
17
+ (`executor.rb:100`).
18
+
19
+ Two problems follow from the current public enqueue surface
20
+ (`perform_now`/`perform_later`), which only validate that `key` is a String:
21
+
22
+ 1. **Silent collision.** A user calling `MyJob.perform_later("k", attempt: 5)`
23
+ silently hijacks the internal retry counter instead of passing their own
24
+ argument. Same risk for `retry_counts` and `retry_workflow`.
25
+ 2. **No positional contract.** Extra positional arguments produce Ruby's generic
26
+ `wrong number of arguments` error rather than a contract-specific message —
27
+ even though the executor only ever replays kwargs (keyword-only) to the job
28
+ body, so positionals beyond `key` can never work.
29
+
30
+ ## Decisions (settled with maintainer)
31
+
32
+ - **Public keyword surface:** `key` (required, first positional), `options`
33
+ (free-form metadata bag), and user `**kwargs`. Nothing else.
34
+ - **`options` is unstructured.** The framework defines **zero** recognized option
35
+ keys. `options` is written to `workflow.options` (`executor.rb:159`) and only
36
+ ever read back by callers (`workflow.options`); no key in it drives behavior.
37
+ - **Reserved keys (rejected on the public path):** `attempt`, `retry_counts`,
38
+ `retry_workflow`. These are internal threading params; users have no legitimate
39
+ reason to pass them.
40
+ - **`retry_workflow` is internal**, reached only through the `retry_now` /
41
+ `retry_later` helpers — not by passing the flag directly.
42
+ - **Keywords-only:** exactly one positional (`key`); everything else must be a
43
+ keyword, enforced with a clear, contract-specific error.
44
+ - **No job-signature validation.** We do *not* introspect the job's `perform`
45
+ parameters to validate unknown/missing kwargs. Out of scope; mismatches still
46
+ surface at execution time as today.
47
+
48
+ ## Mechanism: how internal calls bypass the guard
49
+
50
+ The split between "framework may pass reserved keys" and "users may not" rests on
51
+ an ActiveJob implementation detail, confirmed on ActiveJob 7.1.3.4:
52
+
53
+ > `ActiveJob::ConfiguredJob` (returned by `.set(...)`) defines its own
54
+ > `perform_now` / `perform_later` that build a fresh job instance and call the
55
+ > **instance-level** enqueue path. They do **not** dispatch through the
56
+ > **class-level** `perform_*` override.
57
+
58
+ Therefore any enqueue routed through `.set(...)` bypasses the guard:
59
+
60
+ - All framework continuations already use `.set(wait: …).perform_later(key, …)`
61
+ (`executor.rb:138`, `wait.rb`, `wait_until.rb`, `durably_repeat.rb`,
62
+ `durably_execute.rb`) — their `attempt:`/`retry_counts:`/`wait_condition:`
63
+ ride through untouched.
64
+ - `retry_now` / `retry_later` are rewritten to enqueue via `set.perform_*`,
65
+ legitimately injecting `retry_workflow: true` past the guard.
66
+
67
+ This dependency is non-obvious and now load-bearing, so it is documented inline
68
+ at the guard.
69
+
70
+ ## Design
71
+
72
+ All changes land in `lib/chrono_forge/executor.rb`, in the `class << base` block,
73
+ plus one module-level constant. ~30 lines. No schema or behavior changes
74
+ elsewhere.
75
+
76
+ ### 1. Reserved-key constant (module level, near `STEP_NAME_DELIMITER`)
77
+
78
+ ```ruby
79
+ # Keyword args ChronoForge threads through job args internally. Users must not
80
+ # pass these to perform_now/perform_later; the framework injects them via
81
+ # `.set(...)` continuations, whose ConfiguredJob proxy bypasses the class-level
82
+ # guard below.
83
+ RESERVED_KWARGS = %i[attempt retry_counts retry_workflow].freeze
84
+ ```
85
+
86
+ ### 2. Public guards — `perform_now` / `perform_later`
87
+
88
+ ```ruby
89
+ def perform_now(key, *extra, **kwargs)
90
+ __validate_enqueue!(key, extra, kwargs)
91
+ super(key, **kwargs)
92
+ end
93
+
94
+ def perform_later(key, *extra, **kwargs)
95
+ __validate_enqueue!(key, extra, kwargs)
96
+ super(key, **kwargs)
97
+ end
98
+
99
+ private
100
+
101
+ def __validate_enqueue!(key, extra, kwargs)
102
+ unless key.is_a?(String)
103
+ raise ArgumentError, "Workflow key must be a string as the first argument"
104
+ end
105
+ unless extra.empty?
106
+ raise ArgumentError, "ChronoForge workflows accept only `key` positionally; " \
107
+ "pass everything else as keywords (got #{extra.size} extra positional arg(s))"
108
+ end
109
+ reserved = kwargs.keys & RESERVED_KWARGS
110
+ if reserved.any?
111
+ raise ArgumentError,
112
+ "#{reserved.join(", ")} #{reserved.one? ? "is a reserved" : "are reserved"} " \
113
+ "ChronoForge keyword(s) and cannot be passed to perform_now/perform_later"
114
+ end
115
+ end
116
+ ```
117
+
118
+ `*extra` exists solely to catch stray positionals and produce the clear error;
119
+ after validation it is always empty and discarded (only `super(key, **kwargs)`
120
+ is forwarded).
121
+
122
+ ### 3. Retry helpers — route past the guard
123
+
124
+ ```ruby
125
+ def retry_now(key, **kwargs)
126
+ __validate_enqueue!(key, [], kwargs)
127
+ set.perform_now(key, retry_workflow: true, **kwargs)
128
+ end
129
+
130
+ def retry_later(key, **kwargs)
131
+ __validate_enqueue!(key, [], kwargs)
132
+ set.perform_later(key, retry_workflow: true, **kwargs)
133
+ end
134
+ ```
135
+
136
+ They still validate the *user's* kwargs (rejecting any reserved key the user
137
+ supplied), then inject `retry_workflow: true` through the `ConfiguredJob` bypass.
138
+
139
+ ### 4. Framework continuations — unchanged
140
+
141
+ `executor.rb:138`, `wait.rb`, `wait_until.rb`, `durably_repeat.rb`,
142
+ `durably_execute.rb` already enqueue via `.set(...)`. No change required.
143
+
144
+ ## Scope / caveats
145
+
146
+ - **Executor-only.** The guard lives in the `Executor`-prepended singleton.
147
+ `ChronoForge::CleanupJob` is a plain `ActiveJob::Base` and is unaffected
148
+ (its `perform_now(older_than_days: …)` / arg-less `perform_later` keep working).
149
+ - **Backward compatible.** A full scan of `lib/` and `test/` found no call site
150
+ passing a second positional and no user call passing a reserved key, so the
151
+ existing suite passes unchanged.
152
+ - **`wait_condition`** (internal kwarg in `wait_until`) is intentionally *not*
153
+ added to `RESERVED_KWARGS`: it only ever travels via `.set(...)` and so never
154
+ reaches the guard. Adding it later is a harmless one-line hygiene change if
155
+ desired.
156
+
157
+ ## Testing (TDD)
158
+
159
+ New tests (Executor-prepended job):
160
+
161
+ 1. `perform_later` / `perform_now` raise `ArgumentError` when passed `attempt:`,
162
+ `retry_counts:`, or `retry_workflow:` — and the message names the key(s).
163
+ 2. `perform_later` / `perform_now` raise `ArgumentError` with the contract
164
+ message when passed a second positional argument.
165
+ 3. `perform_later("k", kwarg: "x", options: {a: 1})` still enqueues; `options`
166
+ and user kwargs reach the workflow (`workflow.options`, `workflow.kwargs`).
167
+ 4. `retry_now` / `retry_later` still unlock-and-continue a stalled workflow
168
+ (existing behavior preserved), and reject reserved keys passed by the caller.
169
+ 5. Non-String `key` still raises (regression guard for existing behavior).
@@ -0,0 +1,468 @@
1
+ # Branches — Concurrent Sub-Workflows (`branch` / `spawn` / `merge_branches`) — Design
2
+
3
+ **Date:** 2026-06-25
4
+ **Status:** Implemented (branch `feat/branches`).
5
+ **Scope:** New public API, additive. Introduces parent/child workflows and a
6
+ fan-out/fan-in primitive built to dispatch **hundreds of thousands** of children per
7
+ branch. One new (generic, reusable) column on `chrono_forge_workflows`; reuses the
8
+ execution-log pattern for coordination. No breaking change to existing single-workflow
9
+ execution. **New dependency floor:** `activejob >= 7.1` (for `perform_all_later`).
10
+
11
+ ## Problem
12
+
13
+ ChronoForge workflows are strictly sequential. The only way to fan work out today is to
14
+ hand-enqueue independent workflows and poll for them with `wait_until` — a hand-rolled
15
+ fork/join with no idempotent dispatch and no parent/child visibility.
16
+
17
+ Real workflows need durable, large-scale fan-out: "spawn one sub-workflow per record
18
+ across a 500k-row set, run them in parallel, continue once all are done." It must be
19
+ crash-safe, idempotent under replay, and must not hold the batch in memory or serialize
20
+ on a hot row.
21
+
22
+ ## Goal
23
+
24
+ A **branch** is the unit of fan-out — a durable step that wraps the work it spawns and
25
+ ties it together for the join. `spawn`/`spawn_each` exist **only inside a `branch`
26
+ block**. The model mirrors git: you branch, then you merge.
27
+
28
+ - `branch(name, automerge: false, &block)` — opens branch `name` (the durable
29
+ `branch$<name>` log), runs the block to **eagerly dispatch** children, and **seals**
30
+ when the block closes. Returns immediately (does **not** wait) so branches run
31
+ concurrently.
32
+ - `spawn(name, WorkflowClass, **kwargs)` — inside a branch: dispatch a **single** named
33
+ child.
34
+ - `spawn_each(name, source, of:) { |item| [WorkflowClass, kwargs] }` — inside a branch:
35
+ dispatch **one child per item**, streamed like ActiveRecord batch loading; AR items are
36
+ keyed `name_<record.id>` (primary key); plain enumerables are keyed `name_{index}`
37
+ (sequential index).
38
+ - `merge_branches(*names)` — the **separate** join: halt until every named branch is
39
+ sealed **and** all its children have completed.
40
+
41
+ ```ruby
42
+ def perform(cycle_id:)
43
+ branch :fulfillment, automerge: true do # the step; seals when the block closes
44
+ spawn :reconcile, ReconcileWorkflow, region: "EU" # single child of :fulfillment
45
+ spawn_each :orders, Order.pending do |order| # bulk, streamed; keys orders_<id>…
46
+ order.priority? ? [PriorityOrderWorkflow, { order_id: order.id }]
47
+ : [OrderWorkflow, { order_id: order.id }]
48
+ end
49
+ end
50
+
51
+ branch :invoicing do # a second, concurrent branch
52
+ spawn_each :invoices, Invoice.unpaid do |inv|
53
+ [InvoiceWorkflow, { invoice_id: inv.id }]
54
+ end
55
+ end
56
+
57
+ do_other_work # both branches already running
58
+
59
+ merge_branches :invoicing # join :invoicing; :fulfillment auto-merges
60
+ durably_execute :finalize
61
+ end
62
+ ```
63
+
64
+ ## Decisions (locked during brainstorming)
65
+
66
+ | Decision | Choice |
67
+ |---|---|
68
+ | Keywords | **`branch` / `spawn` / `spawn_each` / `merge_branches`** — git branch/merge metaphor; `spawn` avoids shadowing `Kernel#fork`. |
69
+ | Branch = the step | A branch **is** its `branch$<name>` execution log. `spawn`/`spawn_each` are valid **only inside a `branch` block** (raise otherwise) — spawns don't exist without a branch. |
70
+ | Dispatch timing | **Eager.** Spawns insert + enqueue as the block runs; children start at once. The branch **seals** (log → `completed`) when the block closes. |
71
+ | Join | **Separate `merge_branches`** so branches run concurrently and work can happen in between. Joins one or more named branches at once (`merge_branches :a` for one). |
72
+ | `merge_branch` alias | **Ship a singular `merge_branch(name, **opts)` alias** that delegates to `merge_branches` — reads naturally for the common one-branch case (`merge_branch :a`) without a plural-method/singular-arg mismatch. Decided, not just mentioned. |
73
+ | Automerge | A property **of the branch**: `branch(name, automerge: true)`. When `true`, `branch` eagerly dispatches inside the block and then immediately calls `merge_branches(name)` at the block's close — execution does not continue past the block until the branch's children complete. No explicit `merge_branches` is needed. |
74
+ | Branch tracking | An **in-memory registry** (`@open_branches`), rebuilt each replay pass: `branch` adds, `merge_branches` removes on completion, the completion gate inspects the remainder. Deterministic replay makes it exact — no persisted `merged`/`automerge` flags. |
75
+ | Every branch must be joined | **No detached branches.** Any branch remaining in `@open_branches` at completion (neither `merge_branches`-d nor automerged) **raises `UnmergedBranchError`** (fail-fast on a forgotten join), rather than silently letting children run orphaned. `automerge: true` branches are joined inline at the block close and are absent from `@open_branches` by the time the completion gate runs. |
76
+ | Spawn identity | Spawns are **named** (`spawn :reconcile, …`, `spawn_each :orders, …`). The name anchors the child key and the per-`spawn_each` cursor — stable across code reordering (unlike a positional ordinal). AR items are keyed `name_<record.id>` (primary key); plain enumerable items are keyed `name_{index}` (sequential index). |
77
+ | Bulk source | `spawn_each` **streams** the source — `find_in_batches(batch_size: of, start: cursor)` for AR — never materialising the batch in memory. Scales to millions. |
78
+ | Child class | **Returned from the block** (`[WorkflowClass, kwargs]`); one branch may fan out into mixed workflow types. |
79
+ | Child key | Deterministic: `spawn` → `"#{parent.key}$#{branch}$#{spawn_name}"`; AR `spawn_each` item → `"#{parent.key}$#{branch}$#{spawn_name}_#{record.id}"`; enumerable item → `"#{parent.key}$#{branch}$#{spawn_name}_#{index}"`. Idempotency falls out of the unique-key constraint. |
80
+ | Cursor | Per `spawn_each`, stored in the `branch$<name>` log's `metadata` keyed by **spawn name** as `{ pk: <keyset>, n: <count/index> }`; persisted **once per dispatched chunk** (bundled with that chunk's `insert_all`). |
81
+ | Completion | **Poll**, no counter: a branch is done when sealed and has no incomplete children (`branch_log.spawned_workflows.where.not(state: :completed)` empty — read as an O(CAP) capped count). Zero per-completion contention. |
82
+ | Poll mechanism | A dedicated lightweight `ChronoForge::BranchMergeJob` (plain ActiveJob — no lock/replay/context) does the repeated probing and wakes the parent only at completion. The heavy parent runs just twice per merge (kick off + wake). No separate recovery timer. |
83
+ | Poll cadence | **Adaptive, capped-count.** `pending = incomplete.limit(CAP).count` (**O(CAP)**, never O(N)); next delay `clamp(pending * factor, min_interval, max_interval)` — fast when few remain, slow when many. |
84
+ | Determinism | AR items are keyed by **primary key**, so the stream is stable by construction. `spawn_each` rejects an AR relation carrying an explicit `.order` by checking `order_values.present?` up front (raises `NotExecutableError`). Plain enumerable items are keyed by **sequential index** and must re-enumerate deterministically (documented contract). |
85
+ | Failure semantics | **Option A.** A `stalled`/`failed` child keeps the branch incomplete; the parent stays parked; the user recovers the child (`retry_now`/`retry_later`) and the merge then resolves. No new failure states, no cascade. |
86
+ | Nesting | **Free.** A child is a workflow and may open its own branches; the tree forms via `parent_execution_log_id` (child → branch log → parent workflow). |
87
+
88
+ ## Public API surface
89
+
90
+ ```ruby
91
+ branch(name, automerge: false) do
92
+ spawn(name, WorkflowClass, **kwargs)
93
+ spawn_each(name, source, of: 1000) { |item| [WorkflowClass, kwargs] }
94
+ end
95
+
96
+ merge_branches(*names, min_interval: 5.seconds, max_interval: 5.minutes) # halts until done
97
+ merge_branch(name, **opts) # singular alias for the common one-branch case
98
+ ```
99
+
100
+ `spawn`/`spawn_each` raise `NotInBranchError` if called outside a `branch` block. A branch
101
+ opened but neither `merge_branches`-d nor `automerge: true` raises `UnmergedBranchError` at
102
+ workflow completion.
103
+
104
+ ## Data model
105
+
106
+ `chrono_forge_workflows` gains **one** nullable column (inline in the install migration;
107
+ a follow-up migration template for existing installs):
108
+
109
+ | Column | Type | Notes |
110
+ |---|---|---|
111
+ | `parent_execution_log_id` | FK → `chrono_forge_execution_logs.id`, nullable, indexed | The execution log that spawned this workflow. For branches it's the `branch$<name>` log. **Deliberately generic** — any future step that spawns sub-workflows reuses it. |
112
+
113
+ The branch a child belongs to *is* its `parent_execution_log_id` (the `branch$<name>`
114
+ log), which is globally unique and encodes both the parent workflow (the log's
115
+ `workflow_id`) and the branch (its `step_name`). No `branch_name`/`parent_workflow_id`
116
+ column is needed.
117
+
118
+ **Merge/automerge state is not persisted.** It's tracked in an in-memory registry rebuilt
119
+ each replay pass from the `branch`/`merge_branches` calls (see Execution flow) — `branch`
120
+ adds, `merge_branches` removes, the completion gate inspects the remainder. Deterministic
121
+ replay makes this exact every pass, so no `merged`/`automerge` columns or metadata flags
122
+ are needed; the branch log holds only dispatch cursors.
123
+
124
+ **Composite index** `(parent_execution_log_id, state)` — makes the merge capped count
125
+ and the dropped-job re-kick index-only and short-circuiting at millions of rows.
126
+
127
+ No new table. The branch is the **`branch$<name>`** execution log (two-segment, like
128
+ `durably_repeat`'s coordination log — preloaded when sealed, never per child):
129
+
130
+ ```
131
+ step_name: "branch$fulfillment"
132
+ state: pending (dispatching) | completed (sealed / block closed)
133
+ metadata: { "cursors" => { "orders" => { "pk" => <keyset>, "n" => <count> } } } # keyed by spawn name
134
+ ```
135
+
136
+ The **`merge$<names>`** log coordinates a join (`pending` while polling → `completed`).
137
+
138
+ `Workflow belongs_to :parent_execution_log, class_name: "ExecutionLog", optional: true`;
139
+ `ExecutionLog has_many :spawned_workflows, class_name: "Workflow", foreign_key: :parent_execution_log_id`.
140
+ A branch's children are `branch_log.spawned_workflows`; the parent is `branch_log.workflow`.
141
+
142
+ ### One bounded log per branch — preload safety
143
+
144
+ The preload (`completed_step_cache`) bulk-loads all `completed` logs except
145
+ `durably_repeat$%$%` (the unbounded three-segment repetition logs). A two-segment
146
+ `branch$<name>` is preloaded when sealed, so a replay past the branch short-circuits.
147
+ **Per-child state is never modelled as sub-segment logs** (`branch$<name>$<child>`):
148
+ those would be pulled into the preload and load millions of rows on every replay.
149
+ Per-child state lives on the child workflow rows; the branch log holds only cursors.
150
+
151
+ ## Execution flow
152
+
153
+ ### `branch(name, automerge:) { … }` — wrap + eager dispatch + seal
154
+
155
+ Gated by `find_or_create_execution_log!("branch$#{name}")` (the branch log holds only
156
+ dispatch cursors; `automerge`/merge state is in-memory, not seeded here):
157
+
158
+ 1. **Sealed** (`completed`, served from `completed_step_cache`) → already fully
159
+ dispatched; **skip the block entirely** (no re-stream) and return. *(This short-circuit
160
+ is the single most important correctness/performance property in the design — the
161
+ expensive source enumeration never re-runs after sealing. It warrants a prominent comment
162
+ directly above the skip path in the implementation.)*
163
+ 2. **Pending / new** → set the current-branch context, **yield the block** (named spawns
164
+ dispatch into this branch, advancing their cursors — see below), clear the context,
165
+ mark the `branch$<name>` log `completed` (**sealed**). If `automerge: true`, immediately
166
+ call `merge_branches(name)` — **execution does not continue past the block** until the
167
+ branch's children complete (identical to an explicit `merge_branches` call placed right
168
+ after the block, but guaranteed by the method). Otherwise **return** without halting —
169
+ branches are concurrent; the explicit join is separate.
170
+
171
+ Either way, `branch` **registers the branch in the in-memory registry**
172
+ `@open_branches[name] = { automerge:, log_id: }` — this runs on *every* pass (sealed or
173
+ not), since the `branch` method itself always executes even when its block is skipped.
174
+
175
+ `spawn`/`spawn_each` read the current branch from that context and raise
176
+ `NotInBranchError` if there is none.
177
+
178
+ ### `spawn` / `spawn_each` — dispatch within a branch
179
+
180
+ - **`spawn(name, klass, **kwargs)`** → one child, key `"#{parent.key}$#{branch}$#{name}"`,
181
+ `job_class: klass.name`, `parent_execution_log_id: branch_log.id`. Idempotent on the key.
182
+ - **`spawn_each(name, source, of:)`** → stream, resuming from `metadata.cursors[name]`
183
+ (`{ pk:, n: }`); `n` is a running count (AR) or the resume index (enumerable):
184
+ - **AR relation:** rejects `source` if `source.order_values.present?` (raises
185
+ `NotExecutableError` — iteration is by PK and an explicit order conflicts). Resumes via
186
+ `source.find_in_batches(batch_size: of, start: cursor.pk)`. Per batch, for each record: `klass, kw =
187
+ yield(record)`; build child rows (key
188
+ `"#{parent.key}$#{branch}$#{name}_#{record.id}"`, `job_class`, `kwargs`,
189
+ `parent_execution_log_id: branch_log.id`, `state: :idle`); `insert_all(…, unique_by:
190
+ :key)` (on-conflict-ignore); enqueue only those children still `:idle` (dispatch is
191
+ **queue-idempotent** — a crash-resume never re-runs an already-completed/running child);
192
+ advance `metadata.cursors[name]` to `{ pk: batch.last.id, n: n + batch.size }`
193
+ (committed with the inserts).
194
+ - **Enumerable:** resume via `drop(n)`; child key uses `name_#{n}` (sequential index);
195
+ same per-chunk insert/idle-filter/enqueue/advance (`n` only).
196
+ - Enqueue the chunk, **then** advance the cursor — a crash in between re-enqueues only
197
+ that one chunk on resume (idempotent).
198
+
199
+ ### `merge_branches(*names)` — separate poll-join
200
+
201
+ Each name is validated up front: `$` is rejected via `validate_step_name_segment!`, and `,`
202
+ (the merge step-name separator) is also rejected — both raise `InvalidStepName`.
203
+
204
+ Gated by `find_or_create_execution_log!("merge$#{names.sort.join(',')}")`:
205
+
206
+ 1. **Completed** → return, continue.
207
+ 2. For each `name`: require it to be in `@open_branches` (opened earlier this pass) — a name
208
+ that was never opened **raises `UnknownBranchError`** (a `NotExecutableError` subclass,
209
+ so it fail-fasts via the existing rescue without broadening the executor); a not-yet-sealed
210
+ branch means "still dispatching".
211
+ 3. **Capped-count probe** per branch:
212
+ `branch_log.spawned_workflows.where.not(state: :completed).limit(CAP).count`
213
+ (`where(parent_execution_log_id: branch_log.id, …)`, index-only, **O(CAP) not O(N)**).
214
+ All `0` → done. Otherwise enqueue a `BranchMergeJob` (which polls + re-kicks dropped
215
+ jobs) and `halt_execution!`.
216
+ 4. All branches `0` pending → **delete those names from `@open_branches`** (so the
217
+ completion gate sees them as joined), mark the `merge$…` log `completed`, continue.
218
+
219
+ Completion is **poll-based**, delegated to a dedicated lightweight job so the heavy
220
+ parent isn't replayed per check:
221
+
222
+ - `merge_branches` does **one** immediate check; if not done, enqueues
223
+ `ChronoForge::BranchMergeJob` and `halt_execution!`s (parent → `idle`, lock released).
224
+ The parent runs only **twice** per merge: kick off + completion wake.
225
+ - **`BranchMergeJob`** is a plain ActiveJob — *no* lock, replay, or context. Each run:
226
+ ```ruby
227
+ pending = branch_log_ids.sum { |id| incomplete(id).limit(CAP).count } # O(CAP), index-only
228
+ if pending.zero? && all_sealed?(branch_log_ids)
229
+ ParentWorkflow.perform_later(parent_key) # wake the parent once
230
+ else
231
+ rekick_dropped_jobs(branch_log_ids) # idle re-kick lives here
232
+ delay = [[pending * factor, min_interval].max, max_interval].min # adaptive cadence
233
+ self.class.set(wait: delay).perform_later(parent_key, branch_log_ids, min_interval, max_interval)
234
+ end
235
+ ```
236
+ - On the wake, the parent replays once; sealed branches short-circuit (no re-stream),
237
+ `merge_branches` re-checks, marks the `merge$<names>` log `completed`, and continues.
238
+ **The parent completes its own merge step** — the poller only *detects* and wakes.
239
+
240
+ `merge_branches` **(re)spawns a poller whenever reached while still pending**, so a manual
241
+ retry of a parked parent self-heals a lost poller (including when the poller was spawned by
242
+ an automerge inline call); a rare double-poller from an external re-trigger is harmless (the
243
+ wake is idempotent).
244
+
245
+ **No separate recovery poll.** The poller is a durable backend-scheduled job — the same
246
+ durability `wait_until`'s reschedule already relies on. A lost poller just parks the
247
+ parent with a pending `merge$…` log, recoverable by retry (Option A). Cost: one tiny job
248
+ per (adaptive) interval plus an **O(CAP)** index-only capped count per branch — no
249
+ counter, no per-child shared write, no hot-row contention at any scale; latency falls
250
+ toward `min_interval` as the branch nears done. Option A falls out — a failed child keeps
251
+ pending > 0, so the parent waits until it is recovered.
252
+
253
+ ### Completion gate — every branch must be joined
254
+
255
+ Every branch must be joined — explicitly via `merge_branches` or implicitly via
256
+ `automerge: true`. **There is no detached branch.** `complete_workflow!` (`enforce_branch_joins!`)
257
+ gains a gate **before** it seals the workflow that inspects `@open_branches` — the in-memory
258
+ registry that `branch` populated and `merge_branches` pruned during this pass (rebuilt
259
+ deterministically every replay, so it's exact). The gate does **only** the unmerged-raise
260
+ check:
261
+
262
+ 1. **Unmerged check:** any branch remaining in `@open_branches` at completion is a forgotten
263
+ join → **raise `UnmergedBranchError`** naming the branch(es), with the hint *"add
264
+ `merge_branches :x` or `branch(:x, automerge: true)`."* This fails the workflow fast
265
+ rather than letting children run orphaned; the developer fixes the code and retries. The
266
+ check is unconditional (fires even if the branch's children happen to have finished) so
267
+ the contract is deterministic, not timing-dependent.
268
+
269
+ (A branch joined via `merge_branches` was already deleted from `@open_branches` when that
270
+ merge completed, so it's absent here. An `automerge: true` branch is also absent — its join
271
+ ran inline at the `branch` block's close, removing it from `@open_branches` before
272
+ execution ever continued past the block.)
273
+
274
+ ## Determinism
275
+
276
+ The cursor is only meaningful if iteration is reproducible across replays:
277
+
278
+ - **AR relation:** children are keyed by **primary key** (`name_<record.id>`), so the
279
+ mapping from record to child key is stable regardless of enumeration order. Iteration is
280
+ driven by **primary-key keyset** (`find_in_batches(start:)`). `spawn_each` rejects a relation
281
+ carrying an explicit `.order(...)` by checking `order_values.present?` up front (raises
282
+ `NotExecutableError`) — relying on `find_in_batches`'s `error_on_ignore` is not
283
+ sufficient because `find_in_batches(start:)` is inclusive and a crash-resume re-yields
284
+ the boundary record; the explicit up-front check catches order conflicts before any
285
+ inserts occur.
286
+ - **Enumerable:** items are keyed `name_{index}` by their **sequential position** in the
287
+ stream, so the source must re-enumerate identically across replays (effectively frozen
288
+ for the brief dispatch window — once the branch seals, replay skips the block, so no
289
+ re-enumeration happens thereafter). Deterministic re-enumeration is a documented,
290
+ unverifiable contract — misuse is still *safe* (`insert_all`-ignore + poll) but could
291
+ dispatch the wrong set.
292
+
293
+ ## Idempotency & crash recovery
294
+
295
+ Three layers — **what exists** (DB), **how far dispatch got** (cursor), **step state**
296
+ (the log).
297
+
298
+ - **`find_or_create_execution_log!`** — a sealed `branch$<name>` skips the whole block;
299
+ a completed `merge$…` short-circuits the join.
300
+ - **Deterministic keys + per-spawn cursors = "which children exist."** The branch never
301
+ tracks children individually. Existence is owned by the unique index (`insert_all`-ignore
302
+ is a no-op for rows that exist); dispatch progress is owned by `metadata.cursors[name]`.
303
+ Recovery resumes from the cursor and re-touches **one chunk**, not the whole set.
304
+ *(Dispatch is not bound to the create block: a crash after the log is created must still
305
+ create and enqueue the remaining rows, or the branch would stall forever.)*
306
+ - **`:idle` filter for dropped jobs.** A child dispatched but never run is re-kicked from
307
+ the `BranchMergeJob` poll: re-enqueue branch children with `state: :idle` in an
308
+ incomplete branch. Branch children are pre-inserted with `state: :idle` and never get
309
+ `started_at` set before execution, so `:idle` is the correct "never picked up" signal
310
+ (filtering by `started_at IS NULL` would be unreliable). *Child existence is not enough;
311
+ the merge guarantees every member is actually queued.* (verbatim into the code comment.)
312
+ The re-kick is batch-capped. Safe because re-enqueue is idempotent: `executable?` is
313
+ `idle || running`, so `acquire_lock` raises `NotExecutableError` for a `completed` child
314
+ (its dispatch can never double-fire) and `ConcurrentExecutionError` for a `running` one.
315
+ Children in other states (running, mid-halt, stalled/failed under Option A) are excluded
316
+ by the `:idle` filter.
317
+
318
+ ### Recovery walkthrough — 300,000 children, crash at 250,000
319
+
320
+ A `branch :fulfillment` block's `spawn_each :orders` had committed 250k child rows + jobs
321
+ with `metadata.cursors["orders"]` at `{ pk: <250,000th PK>, n: 250_000 }`; it was mid-chunk
322
+ when the process died. The `branch$orders` log is still `pending` (not sealed), so workflow retry replays
323
+ from the top. The block re-runs; `spawn_each` resumes `find_in_batches(start: cursor)` from PK
324
+ 250,000 — re-touching ~50k rows, worst-case duplicate enqueue is the single in-flight
325
+ chunk. The 250k already dispatched keep running the whole time. When the source is
326
+ exhausted the block closes and the branch seals; `merge_branches`/automerge then polls to
327
+ completion. Recovery is bounded and idempotent — never a re-fan-out of 300k.
328
+
329
+ ## Scale & performance (target: hundreds of thousands per branch)
330
+
331
+ | Operation | Frequency | Cost |
332
+ |---|---|---|
333
+ | `spawn_each` dispatch | once per branch, **streamed** | `⌈N/of⌉` `insert_all` + `perform_all_later`, each advancing the cursor — O(N) total, bounded chunks, **constant memory** |
334
+ | Child run | per child | one own-row state transition — **no shared-row contention** |
335
+ | Merge poll | per adaptive interval | lightweight `BranchMergeJob` running an **O(CAP)** capped count per branch; interval scales `min`↔`max` with pending — the heavy parent is *not* replayed per poll |
336
+ | Crash recovery | once | resumes dispatch from the cursor — re-touches **one chunk** |
337
+ | Replay cost | per resume | independent of how many children finished — no member list, no sibling scan, no counter |
338
+
339
+ What deliberately does **not** exist: an in-memory fork registry (streamed instead), a
340
+ single hot completion counter (adaptive capped-count poll instead), and a member-key blob
341
+ in metadata (just per-spawn cursors). The remaining O(N) work — N rows + N jobs — is
342
+ irreducible for N sub-workflows, done in bounded chunks by one parent job.
343
+
344
+ ### `perform_all_later` — verified (activejob 7.1.3.4)
345
+
346
+ - **Mixed job classes: supported, no same-class requirement.** `perform_all_later` groups
347
+ by `queue_adapter` (`enqueuing.rb:18`); the adapter's `enqueue_all` then sub-groups by
348
+ **class then queue** (e.g. core Sidekiq adapter does `group_by(&:class).group_by(&:queue_name)`
349
+ → one `push_bulk` per group, `sidekiq_adapter.rb:36`). So a `spawn_each` returning mixed
350
+ workflow types is fine — enqueue just batches per distinct (class, queue), never falling
351
+ back to per-job for being heterogeneous.
352
+ - **It bypasses ChronoForge's class-level `perform_later` override** (`__validate_enqueue!`)
353
+ and **ActiveJob enqueue callbacks** — it builds instances and hits the adapter directly.
354
+ So `spawn`/`spawn_each` must **build child job instances and validate them
355
+ (String key, no reserved kwargs) themselves**, then `ActiveJob.perform_all_later(jobs)`.
356
+ Execution is unaffected (the executor's logic is in instance `perform`). This mirrors the
357
+ existing sanctioned `.set(...)`-bypasses-the-override pattern.
358
+ - **Requires activejob ≥ 7.1.** The gemspec currently pins no version — add
359
+ `spec.add_dependency "activejob", ">= 7.1"` (or provide a `perform_later`-loop fallback).
360
+ - **Bulk enqueue is adapter-dependent.** Only adapters implementing `enqueue_all`
361
+ (Sidekiq in core; solid_queue/good_job ship their own) batch the enqueue; Test/Inline/
362
+ Async fall back to per-job `enqueue`. The `insert_all` of child rows is **always** bulk;
363
+ job enqueue batching is best-effort.
364
+
365
+ Other caveats to verify in the plan:
366
+ - `find_in_batches` `start:` semantics (inclusive boundary — the boundary record is re-yielded on
367
+ crash-resume; PK-keyed children dedup via `insert_all`-ignore) and the per-adapter
368
+ bind-param limit for the `of:` chunk size (notably SQLite's `SQLITE_MAX_VARIABLE_NUMBER`).
369
+ - The merge capped count must be index-only on `(parent_execution_log_id, state)`.
370
+
371
+ ## Poll-cadence constants (class-configurable defaults)
372
+
373
+ - `CAP` (capped-count limit) — default `5_000`. Bounds each poll's count cost; beyond it,
374
+ pending saturates to `max_interval` (no signal lost).
375
+ - `factor` — maps pending → delay; default tuned so ~100 → ~10s, ~1k → ~1 min.
376
+ - `min_interval` / `max_interval` — clamp; defaults `5.seconds` / `5.minutes`
377
+ (per-`merge_branches` overridable; `automerge` uses the defaults).
378
+
379
+ ## Naming & validation
380
+
381
+ - `STEP_NAME_DELIMITER` is `$` (executor.rb). Reserved.
382
+ - The `branch` name, each `spawn`/`spawn_each` name, and each `merge_branches` name pass
383
+ through `validate_step_name_segment!` (no `$`). The merge step name joins sorted branch
384
+ names with `,`; names containing `,` are rejected. (`name_{index}` uses `_`, which is
385
+ unreserved.)
386
+ - Child keys use `$` (`"#{parent.key}$#{branch}$#{spawn_name}"` / `…$#{spawn_name}_#{index}"`).
387
+ Keys are opaque (never parsed), so a `$` already in the parent key is harmless.
388
+
389
+ ## Non-goals (v1) — with caveats
390
+
391
+ - **Sharded completion counter / instant wake.** v1 polls (adaptive, but still polls). If
392
+ sub-`min_interval` wake latency is ever needed, a
393
+ `fork_counters(branch_log_id, shard, completed)` table (K rows/branch) is the upgrade —
394
+ fully internal to the branch, **zero API change**.
395
+ - **Parallel dispatch.** One parent job streams the dispatch. At ~1M this is minutes and
396
+ crash-safe via the cursor; recursive dispatcher sub-jobs are a future throughput upgrade.
397
+ - **Result aggregation.** Children communicate via their own `context`, which is
398
+ **workflow-scoped** — a parent can't read a child's context today. `branch_log.spawned_workflows`
399
+ returns the child **records**. Aggregation, when added, needs an explicit cross-workflow
400
+ read API.
401
+ - **`merge_branches` timeout.** Blocks indefinitely (Option A); a `timeout:` can come later.
402
+ - **Dashboard nesting.** Parent/child tree + per-child recovery is a follow-up; the
403
+ `parent_execution_log_id` column makes the tree cheap to walk.
404
+
405
+ ## Testing strategy
406
+
407
+ Mirror the existing `ChaoticJob` style (`perform_later` + `perform_all_jobs`; assert on
408
+ workflow `state`, `execution_logs`).
409
+
410
+ - **Happy path:** a `branch` with a `spawn :a` + a `spawn_each :b`; assert child keys
411
+ (`…$a`, `…$b_0`, `…$b_1`, …), `parent_execution_log_id`, the branch seals,
412
+ `merge_branches` resumes, the workflow finishes.
413
+ - **Spawn outside branch raises** `NotInBranchError`.
414
+ - **Concurrency:** two branches dispatched before the merge both make progress before the
415
+ join; work between branch blocks and merge runs while children are in flight.
416
+ - **Eager dispatch:** children begin before `merge_branches` is reached.
417
+ - **Class from body:** a `spawn_each` returning mixed classes creates children with the
418
+ right `job_class` per item (and they bulk-enqueue together).
419
+ - **Determinism guard:** an AR relation with a conflicting `.order(...)` **raises**.
420
+ - **Crash mid-dispatch (cursor resume):** glitch after chunk *k*; assert
421
+ `metadata.cursors[name]` (`{ pk:, n: }` for AR; `{ n: }` for enumerable) persisted,
422
+ dispatch resumes from it (not 0), final child count correct, no duplicate rows, only the
423
+ in-flight chunk re-enqueued. (250k-of-300k.)
424
+ - **Dropped-job re-kick:** a sealed branch with a child whose job was lost (state `:idle`,
425
+ never started); assert the poll re-enqueues exactly that child and then resolves.
426
+ - **Poll job:** assert the parent runs only twice (kick-off + wake) regardless of poll
427
+ count; a manual retry of a parked parent re-spawns the poller.
428
+ - **Adaptive cadence:** assert the count is capped at `CAP` (a branch with ≫CAP incomplete
429
+ issues an O(CAP) count and picks `max_interval`); the delay shrinks toward `min_interval`
430
+ as pending drops.
431
+ - **Automerge:** an `automerge: true` branch blocks execution inline at the block's close
432
+ (not at workflow completion) — assert execution does not continue past the block until
433
+ children finish, and that the `merge$<name>` log exists and is `completed` before the
434
+ next step runs. No explicit `merge_branches` is needed.
435
+ - **Unmerged branch raises:** a branch opened with neither `merge_branches` nor
436
+ `automerge: true` raises `UnmergedBranchError` at the completion gate (unconditional —
437
+ fires even if children already finished), naming the branch.
438
+ - **Option A:** a child `permanently_fail`s → merge parks; recover via `retry_later` →
439
+ merge resolves; assert no progress while parked.
440
+ - **Idempotency / replay:** force replays; assert constant child count, no re-dispatch once
441
+ sealed, replay query count independent of completed-child count (`branch$<name>` preloads
442
+ when sealed; no per-child sub-logs).
443
+ - **Scale:** a branch of hundreds of thousands; assert `insert_all` issues `⌈N/of⌉` inserts
444
+ (not N), constant memory (streamed), contention-free child runs, and one capped probe per
445
+ branch per poll. (Job-enqueue batching is adapter-dependent — under the test adapter it
446
+ falls back to per-job enqueue, so don't assert bulk *enqueue* there.)
447
+ - **Empty branch / empty source:** seals immediately; the merge resolves at once.
448
+ - **Nesting:** a child opens its own branch; assert the tree completes bottom-up.
449
+
450
+ ## README notes (when shipping)
451
+
452
+ Surface these prominently in user-facing docs, not just here:
453
+ - **Every branch must be merged or `automerge: true`** — otherwise `UnmergedBranchError`.
454
+ - **The heavy parent is not replayed per poll** — a lightweight `BranchMergeJob` does the
455
+ waiting; the parent runs twice per merge.
456
+ - **AR source must be stable during a branch's dispatch window** — AR items are keyed by
457
+ primary key (`name_<id>`), so inserting rows mid-dispatch is safe but re-use of a PK
458
+ (soft-delete/re-insert) can confuse the cursor. Plain enumerables are keyed by sequential
459
+ index; inserting/removing items mid-dispatch (before a crash-replay seals the branch)
460
+ shifts indices. Once sealed, the block never re-enumerates.
461
+
462
+ ## Future work
463
+
464
+ - Sharded-counter table for instant (non-poll) wake.
465
+ - Parallel/recursive dispatcher sub-jobs for dispatch throughput beyond one parent job.
466
+ - Result aggregation via an explicit child-context read API.
467
+ - `merge_branches(..., timeout:)`.
468
+ - Dashboard parent/child tree + per-child recovery actions.