chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +22 -0
  3. data/README.md +305 -44
  4. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
  5. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
  6. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
  7. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
  8. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
  9. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
  10. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
  11. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
  12. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
  13. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
  14. data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
  15. data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
  16. data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
  17. data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
  18. data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
  19. data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
  20. data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
  21. data/lib/chrono_forge/branch_merge_job.rb +138 -0
  22. data/lib/chrono_forge/branch_probe.rb +26 -0
  23. data/lib/chrono_forge/cleanup.rb +6 -0
  24. data/lib/chrono_forge/execution_log.rb +6 -0
  25. data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
  26. data/lib/chrono_forge/executor/methods/branch.rb +185 -0
  27. data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
  28. data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
  29. data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
  30. data/lib/chrono_forge/executor/methods/wait.rb +2 -4
  31. data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
  32. data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
  33. data/lib/chrono_forge/executor/methods.rb +2 -0
  34. data/lib/chrono_forge/executor/retry_policy.rb +111 -0
  35. data/lib/chrono_forge/executor.rb +216 -28
  36. data/lib/chrono_forge/version.rb +1 -1
  37. data/lib/chrono_forge/workflow.rb +10 -1
  38. data/lib/generators/chrono_forge/migration_actions.rb +1 -0
  39. data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
  40. metadata +42 -5
  41. data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
@@ -0,0 +1,142 @@
1
+ # Dashboard Branch View — Design
2
+
3
+ **Date:** 2026-06-26
4
+ **Status:** Design only. **BLOCKED on the branches core feature**
5
+ ([`2026-06-25-spawn-merge-branches-design.md`](2026-06-25-spawn-merge-branches-design.md),
6
+ itself still Draft). Nothing here can be built or tested until `parent_execution_log_id`,
7
+ `branch`/`spawn`/`merge_branches`, and the `spawned_workflows` association exist in the
8
+ core gem. This spec is written so it can be implemented the day branches ships.
9
+ **Scope:** additive views in the `chrono_forge-dashboard` engine. No core changes.
10
+
11
+ ## Problem
12
+
13
+ With fan-out, a parent can park at `merge_branches` (Option A) because **one** child
14
+ among tens or hundreds of thousands is `failed`/`stalled`. Today there is no way to see
15
+ that: the parent looks idle, and the blocking child is a needle in a haystack. The branch
16
+ view exists to answer, in one screen: *which branch is blocking this parent, how many
17
+ children are outstanding, and which specific children are failed/stalled — with a Retry on
18
+ each.* It is what makes Option A (park until recovered) operable in production.
19
+
20
+ ## What it consumes from the core (the contract)
21
+
22
+ This view reads only what the branches spec defines. If any of these change, this spec
23
+ changes with it.
24
+
25
+ - **Column** `chrono_forge_workflows.parent_execution_log_id` (FK → `execution_logs.id`,
26
+ nullable) with composite index `(parent_execution_log_id, state)`.
27
+ - **Associations** `Workflow#parent_execution_log` (→ `ExecutionLog`) and
28
+ `ExecutionLog#spawned_workflows` (→ `Workflow`, FK `parent_execution_log_id`).
29
+ - **Branch log:** an execution log with `step_name` `"branch$<name>"`,
30
+ `state` `pending` (dispatching) | `completed` (sealed), and `metadata`
31
+ `{ "automerge" => bool, "merged" => bool, "cursors" => { "<spawn>" => { "pk", "n" } } }`.
32
+ - **Merge log:** `"merge$<names>"`, `pending` while polling → `completed`.
33
+ - A child is a `Workflow`; its parent branch log is `child.parent_execution_log`; the
34
+ parent workflow is `branch_log.workflow`.
35
+
36
+ The reusable `StepNameParser` (already in the engine) gains `branch` / `merge` kinds.
37
+
38
+ ## Design
39
+
40
+ ### 1. Branches panel on the parent's detail page
41
+
42
+ A new section on `workflows#show`, rendered only when the workflow has any `branch$%`
43
+ logs. One row per branch (`BranchPresenter`), showing **health**:
44
+
45
+ | Field | Source | Notes |
46
+ |---|---|---|
47
+ | name | `StepNameParser.parse(log.step_name).name` | |
48
+ | status | `log.state` | `completed` → **sealed**; `pending` → **dispatching** (still spawning) |
49
+ | join | `metadata.automerge` / `metadata.merged` | "automerge", "merged", or "unmerged" |
50
+ | dispatched | `sum(metadata.cursors[*].n)` + explicit `spawn` count | cheap (from metadata), avoids counting rows |
51
+ | pending | `spawned_workflows.where.not(state: :completed).limit(CAP).count` | **capped, index-only** (O(CAP)); shows `"5000+"` past CAP |
52
+ | blocked | `spawned_workflows.where(state: [:failed, :stalled]).limit(CAP).count` | the actionable number; rendered in rose when > 0 |
53
+
54
+ Each branch row links to its **children view** (below) and, when `blocked > 0`, a direct
55
+ "View blocked" link pre-filtered to failed/stalled.
56
+
57
+ A parent parked on a merge also surfaces its `merge$<names>` log(s) here ("merging
58
+ invoicing — pending"), so the park is legible.
59
+
60
+ ### 2. Branch children view (drill-down)
61
+
62
+ A new route + controller, because a branch can hold **hundreds of thousands** of children
63
+ — they are never all rendered.
64
+
65
+ - Route: `GET /workflows/:workflow_id/branches/:branch_log_id` →
66
+ `BranchChildrenController#show` (scoped to the branch log; verifies it belongs to the
67
+ workflow).
68
+ - **Reuses `WorkflowsQuery`** over `branch_log.spawned_workflows` (same state/key filters,
69
+ pagination). **Default filter: `failed` + `stalled` first** — the triage default, so the
70
+ blockers are the landing view rather than page 1 of 500k.
71
+ - Reuses the existing `_workflow_row` partial (children are workflows) plus a per-row
72
+ **Retry** (and the child's own key links to its detail).
73
+ - A capped state-count strip at the top (completed/running/idle/failed/stalled), each an
74
+ O(CAP) index-only count rendered as `"N"` or `"CAP+"`.
75
+
76
+ ### 3. Per-child recovery
77
+
78
+ Children are workflows, so recovery reuses the existing `ActionsController`:
79
+ - Per-child **Retry** (`workflow.retry_later`) in each row and on the child detail.
80
+ - A **"Retry all blocked in this branch"** bulk action: iterate
81
+ `branch_log.spawned_workflows.where(state: [:failed, :stalled]).find_each(&:retry_later)`
82
+ (a scoped sibling of the existing bulk-retry). After recovery the parent's merge poll
83
+ resolves on its own (Option A) — the view does not touch the parent.
84
+
85
+ ### 4. Child → parent linkage
86
+
87
+ On `workflows#show`, when `@workflow.parent_execution_log_id` is present, render a
88
+ **breadcrumb**: `parent key › branch <name> › this child`, linking to the parent and the
89
+ branch children view. Cheap: one `parent_execution_log` + its `workflow`.
90
+
91
+ ### 5. Tree view (nested branches)
92
+
93
+ Branches nest (a child may open its own branches). The parent panel shows **one level**
94
+ (this workflow's branches + per-branch child summary); you navigate down by opening a
95
+ child (whose own detail shows its branches) rather than rendering an unbounded tree on one
96
+ page. The breadcrumb provides the up-path. This keeps every page O(page_size), never
97
+ O(tree).
98
+
99
+ ## Components
100
+
101
+ - `app/presenters/.../branch_presenter.rb` — one branch log → health struct (capped
102
+ counts, dispatched-from-cursor, sealed/merged flags).
103
+ - `app/presenters/.../branches_presenter.rb` — a workflow's `branch$%` + `merge$%` logs.
104
+ - `app/controllers/.../branch_children_controller.rb` — `#show`, scoped children list.
105
+ - `app/queries/.../workflows_query.rb` — extend to accept a base scope (so it can run over
106
+ `branch_log.spawned_workflows`, not just `Workflow.all`).
107
+ - `ActionsController#bulk_retry_branch` — scoped bulk retry.
108
+ - Views: `_branches.html.erb` (panel on show), `branch_children/show.html.erb`,
109
+ `_parent_breadcrumb.html.erb`; `StepNameParser` branch/merge kinds.
110
+ - Routes: nested `branches/:branch_log_id` under `workflows`; a member `bulk_retry` on it.
111
+
112
+ ## Scale guardrails (non-negotiable)
113
+
114
+ - **Never** `group(:state).count` an unbounded child set on a page load. All counts are
115
+ **capped** (`limit(CAP)`) and index-only on `(parent_execution_log_id, state)`, shown as
116
+ `"CAP+"` past the cap — mirroring the merge probe.
117
+ - **Never** render more than one page of children. Default to the blocked subset.
118
+ - "Dispatched" total comes from `metadata.cursors` (`n`), not a row count.
119
+ - The branches panel issues at most ~2 capped probes per branch (pending, blocked) — bounded
120
+ regardless of child count.
121
+
122
+ ## Testing (once branches exists)
123
+
124
+ Seed parent + `branch$<name>` logs + child workflow rows with `parent_execution_log_id`
125
+ (no need to run real fan-out):
126
+ - branches panel: sealed vs dispatching; automerge/merged/unmerged; pending + blocked
127
+ capped counts (incl. a `>CAP` case showing `"CAP+"`); rose styling when blocked > 0.
128
+ - children view: default filter shows only failed/stalled; state filter + pagination work
129
+ over the scoped relation; per-child Retry calls `retry_later`.
130
+ - scoped bulk retry hits only that branch's failed/stalled children.
131
+ - breadcrumb: a child renders a link to its parent + branch; a non-child renders none.
132
+ - merge log surfaced when the parent is parked.
133
+
134
+ ## Open questions (confirm on review)
135
+
136
+ 1. **Counts beyond CAP** — show `"5000+"` (capped) everywhere, or pay an exact `COUNT` for
137
+ the *blocked* number only (usually small) while capping pending? (Leaning: exact for
138
+ blocked, capped for pending.)
139
+ 2. **Children view default** — land on failed/stalled (triage), or all-with-failed-first?
140
+ (Leaning: failed/stalled, with a clear "show all" toggle.)
141
+ 3. **Tree depth** — one level per page + breadcrumb (this spec), or a shallow expandable
142
+ tree for small fan-outs? (Leaning: one level; revisit if small-N trees feel clunky.)
@@ -0,0 +1,265 @@
1
+ # ChronoForge — deferral continuation race & catch-up surge
2
+
3
+ **Date:** 2026-06-26
4
+ **Gem:** `chrono_forge` 0.9.1
5
+ **Status:** design approved, ready for implementation plan
6
+
7
+ ## Problem
8
+
9
+ Two related findings in how ChronoForge's deferral primitives (`wait`, `wait_until`,
10
+ `durably_execute` retry, `durably_repeat`, workflow-level retry) schedule their
11
+ continuation jobs. Both are functionally benign in 0.9.1 (no lost work, no double
12
+ execution) but generate avoidable job/lock churn and log noise, and they interact.
13
+
14
+ ### Issue 1 — continuation/lock-release race (`ConcurrentExecutionError`)
15
+
16
+ Every deferral primitive enqueues its continuation **inline** and then halts:
17
+
18
+ ```ruby
19
+ self.class.set(wait: delay).perform_later(@workflow.key) # (1) enqueue continuation
20
+ halt_execution! # (2) raise HaltExecutionFlow
21
+ ```
22
+
23
+ The executor releases the lock in `ensure`, **after** the body runs
24
+ (`executor.rb:168-172`). So within one job run the order is: **enqueue continuation →
25
+ halt → (ensure) release lock.** The continuation is published while the current job
26
+ still holds the lock.
27
+
28
+ When the continuation is **immediately runnable** (`delay == 0`), SolidQueue puts it
29
+ straight in `ready_executions`. With multiple workers, a free worker can claim and
30
+ start it in the window between (1) and the `ensure` release. That second job calls
31
+ `acquire_lock`, finds `locked_at > max_duration.ago` (still freshly held by the first
32
+ job), and raises `ConcurrentExecutionError` at lock acquisition (failing
33
+ `execution_log.step_name` is `nil` — before any step).
34
+
35
+ `delay == 0` arises when:
36
+ - `wait` targets computed against wall-clock times already in the past on replay, and
37
+ - **every fast-forwarded tick in Issue 2** (`delay = max(next − now, 0) = 0`).
38
+
39
+ Benign today (loser is rescued, winner proceeds, continuation replays idempotently),
40
+ but costs wasted job executions, redundant lock attempts, and log noise.
41
+
42
+ ### Issue 2 — catch-up is O(missed intervals)
43
+
44
+ When a `durably_repeat` workflow resumes far behind schedule, each missed tick is
45
+ handled by `execute_repetition_now`. For an expired tick it correctly **skips the
46
+ periodic method**, but then advances by exactly **one interval** and enqueues a **new
47
+ job** (`durably_repeat.rb:200-212`, `:271-293`):
48
+
49
+ ```ruby
50
+ if Time.current > repetition_log.metadata["timeout_at"]
51
+ repetition_log.update!(state: :failed, error_class: "TimeoutError")
52
+ schedule_next_execution_after_completion(...) # advance ONE interval + enqueue a job
53
+ return # method NOT run (work correctly skipped)
54
+ end
55
+ ```
56
+
57
+ So expiry is a **work skip, not an iteration skip**. Walking a workflow from a far-past
58
+ `start_at` up to `now` churns through **one `delay == 0` job per missed interval** — each
59
+ job marks one tick timed-out, schedules the next, and halts. Resuming ~14 dormant
60
+ daily/weekly schedulers generated ~6,000 back-to-back `delay == 0` jobs. Every one of
61
+ those is the maximal trigger for Issue 1.
62
+
63
+ Worst case: a workflow resuming from genesis (no prior coordination/repetition logs)
64
+ with an ancient `start_at`.
65
+
66
+ ## Enqueue sites (complete inventory)
67
+
68
+ All 8 continuation enqueues are `.set(wait:).perform_later`; `continue_if` halts with no
69
+ continuation (it waits for an external trigger — correctly needs no fix).
70
+
71
+ | # | Site | kwargs passed | delay |
72
+ |---|------|---------------|-------|
73
+ | 1 | `executor.rb:163` workflow retry | `attempt:, retry_counts:` | backoff |
74
+ | 2 | `wait.rb:107` reschedule | — | duration |
75
+ | 3 | `wait_until.rb:135` cond-error retry | — | backoff |
76
+ | 4 | `wait_until.rb:181` poll | `wait_condition:` | check_interval |
77
+ | 5 | `durably_execute.rb:112` retry | — | backoff |
78
+ | 6 | `durably_repeat.rb:193` schedule-later | — | delay |
79
+ | 7 | `durably_repeat.rb:235` repetition retry | — | backoff |
80
+ | 8 | `durably_repeat.rb:288` schedule-next | — | delay (=0 in surge) |
81
+
82
+ ## Fix — Section 1: deferred continuation flush
83
+
84
+ Primitives stop calling `perform_later` inline. They **record** the intended
85
+ continuation on the instance; the executor flushes it in `ensure`, **after**
86
+ `release_lock`. The continuation becomes claimable only once the lock row reads
87
+ released, so no second worker can lose the acquire race. This is the report's
88
+ **option 1** (fully closes the window), not option 3 (epsilon delay heuristic).
89
+
90
+ **Single slot suffices.** Every primitive enqueues at most one continuation and then
91
+ either raises `HaltExecutionFlow` (sites 2–8) or falls through `rescue => e` into
92
+ `ensure` (site 1). No path schedules two.
93
+
94
+ ```ruby
95
+ # executor.rb — new private helper
96
+ def enqueue_continuation(wait:, **kwargs)
97
+ @continuation = {wait: wait, kwargs: kwargs}
98
+ end
99
+ ```
100
+
101
+ Each of the 8 sites changes from:
102
+
103
+ ```ruby
104
+ self.class.set(wait: delay).perform_later(@workflow.key) # or with kwargs
105
+ halt_execution!
106
+ ```
107
+
108
+ to:
109
+
110
+ ```ruby
111
+ enqueue_continuation(wait: delay) # kwargs preserved per-site
112
+ halt_execution!
113
+ ```
114
+
115
+ Flush in `ensure` (`executor.rb:168`), strictly ordered after release:
116
+
117
+ ```ruby
118
+ ensure
119
+ if lock_acquired
120
+ context.save!
121
+ self.class::LockStrategy.release_lock(job_id, workflow)
122
+ flush_continuation! # NEW — only now is the next job claimable
123
+ end
124
+ end
125
+
126
+ def flush_continuation!
127
+ return unless @continuation
128
+ self.class.set(wait: @continuation[:wait]).perform_later(@workflow.key, **@continuation[:kwargs])
129
+ end
130
+ ```
131
+
132
+ **Ordering guarantee:** `save! → release_lock → flush`. The continuation is published
133
+ only after the lock row is updated to released, so even a `delay == 0` continuation
134
+ finds the lock free.
135
+
136
+ **Edge cases:**
137
+ - If `release_lock` raises `LongRunningConcurrentExecutionError` (this job overran
138
+ `max_duration` and lost the lock), we do **not** flush — correct, another job already
139
+ owns the continuation.
140
+ - Site 1 (workflow retry) isn't a halt, but routing it through the same slot keeps all
141
+ enqueues post-release and is harmless (backoff is normally > 0 anyway).
142
+ - `@continuation` is per-job-execution instance state; nil unless a primitive set it.
143
+
144
+ ## Fix — Section 2: closed-form fast-forward of the expired prefix
145
+
146
+ In `durably_repeat` (`durably_repeat.rb:143-151`), after the naive `next_execution_at`
147
+ is computed and before `execute_or_schedule_repetition`, jump past the expired prefix in
148
+ closed form instead of walking one job per tick.
149
+
150
+ **Skip rule (from the code):** a tick `t` is expired iff `Time.current > t + timeout`,
151
+ i.e. `t < now − timeout`. Find the smallest tick on the grid `next_execution_at + n·every`
152
+ (n ≥ 0) that is **not** expired (`t ≥ now − timeout`):
153
+
154
+ ```ruby
155
+ def fast_forward_expired_prefix(next_execution_at, every, timeout)
156
+ cutoff = Time.current - timeout
157
+ return next_execution_at if next_execution_at >= cutoff # nothing expired
158
+
159
+ gap = cutoff - next_execution_at
160
+ n = (gap / every.to_f).ceil # n ≥ 1 here
161
+ Rails.logger.info {
162
+ "ChronoForge:#{self.class}(#{@workflow.key}) durably_repeat fast-forwarded #{n} expired tick(s)"
163
+ }
164
+ next_execution_at + (n * every)
165
+ end
166
+ ```
167
+
168
+ **Why anchor on `next_execution_at`, not `start_at`.** `next_execution_at` is always
169
+ already on the canonical grid `anchor + k·every`:
170
+
171
+ 1. `start_at` given, no `last_execution_at` → `next = start_at`. On-grid (k=0).
172
+ 2. No `start_at`, no `last_execution_at` → `next = created_at + every`. On-grid (k=0).
173
+ 3. `last_execution_at` present → `next = last_execution_at + every`. On-grid because
174
+ `last_execution_at` stores the **scheduled** tick time, not wall-clock:
175
+ `schedule_next_execution_after_completion` writes `current_execution_time.iso8601`
176
+ (`durably_repeat.rb:275`), where `current_execution_time` is the scheduled tick, not
177
+ `Time.current`. By induction, lateness never enters the recurrence.
178
+
179
+ So jumping by integer multiples of `every` from `next_execution_at` stays exactly on the
180
+ grid — **no drift**. Anchoring the ceil on `start_at` (as the report's formula literally
181
+ writes) would compute against a different anchor than the grid the workflow is actually
182
+ on (branches 2 and 3) and could land between real ticks.
183
+
184
+ **Boundary correctness — only the expired prefix is skipped.** The jump lands on the
185
+ first tick with `t ≥ now − timeout`, which is either:
186
+ - **in-window** (`now − timeout ≤ t ≤ now`): `execute_or_schedule_repetition` sees
187
+ `t ≤ now` → runs `execute_repetition_now`, which re-checks `now > timeout_at` (now
188
+ false) → **executes the work**. Legitimate catch-up preserved.
189
+ - **future** (`t > now`): → `schedule_repetition_for_later`. Normal.
190
+
191
+ If `timeout > every` there can be several in-window ticks; those still walk one job each
192
+ by design (real work, not bookkeeping). Only the expired prefix collapses to O(1).
193
+
194
+ **Coordination-log bookkeeping.** As part of the fast-forward, set the coordination
195
+ log's `last_execution_at = (first_valid − every).iso8601` (same format the reader
196
+ `Time.parse` expects). A replay then recomputes `naive_next = last_execution_at + every
197
+ = first_valid` — stable and idempotent — and the expired prefix produces **one metadata
198
+ update** instead of N `failed/TimeoutError` repetition rows and N jobs.
199
+
200
+ **One summary row for the skipped prefix (decided).** Instead of N `failed/TimeoutError`
201
+ repetition rows, the fast-forward writes a **single** durable `ExecutionLog` covering the
202
+ whole skipped prefix, so the skip stays dashboard-visible and queryable:
203
+
204
+ - **step_name:** `durably_repeat$<name>$<last_skipped_tick.to_i>`, where
205
+ `last_skipped_tick = first_valid − every`. This is the last expired grid tick, so it is
206
+ unique and **never collides** with the repetition row for `first_valid` (the first
207
+ in-window/future tick, which `execute_or_schedule_repetition` still creates and runs).
208
+ - **state:** `failed` (the enum has only `pending/completed/failed` — no migration),
209
+ **error_class:** `"TimeoutError"`, **error_message:** `"Fast-forwarded N expired tick(s)"`.
210
+ - **metadata:** `{ fast_forwarded: N, from: <first_expired.iso8601>,
211
+ to: <last_skipped.iso8601>, scheduled_for: <last_skipped>, timeout_at: <last_skipped + timeout>,
212
+ parent_id: <coordination_log.id> }` — mirrors the existing repetition-log metadata shape
213
+ plus the `fast_forwarded`/`from`/`to` summary fields.
214
+
215
+ Created via `find_or_create_execution_log!`, so it is idempotent on replay (and the
216
+ 3-segment step name is correctly excluded from `completed_step_cache`, matching ordinary
217
+ repetition logs). A `Rails.logger.info { "...fast-forwarded N expired tick(s)" }` line is
218
+ also emitted for ops. This is a deliberate behavior change from 0.9.1's one-row-per-tick.
219
+
220
+ The existing dashboard step-name parser already handles 3-segment
221
+ `durably_repeat$<name>$<ts>` repetition steps, so **no dashboard change is required** for
222
+ this plan; the summary row renders like any other repetition log.
223
+
224
+ **Observable-behavior change → existing tests updated.** Two tests assert the old
225
+ per-tick tombstones via `timeout: -1.second` and must be updated to the new behavior:
226
+ - `durably_repeat_test.rb:116` `test_durably_repeat_with_timeout` — asserts
227
+ `timeout_logs.size > 0` (filtering `error_message == "Execution timed out"`); flip to
228
+ asserting **no** `Execution timed out` rows and exactly **one** `fast_forwarded` summary
229
+ row for the expired prefix.
230
+ - `durably_repeat_test.rb:345` `test_durably_repeat_coordination_log_updated_on_timeout`
231
+ — its `last_execution_at`-advances assertion still holds; its `timeout_logs.size > 0`
232
+ assertion flips to asserting the single `fast_forwarded` summary row instead.
233
+
234
+ The `wait_until` negative-timeout test (`error_log_correlation_test.rb:23`) is a
235
+ different primitive and is unaffected. Catch-up tests using the default positive timeout
236
+ (`test_durably_repeat_with_past_start_at`, etc.) are unaffected because nothing is
237
+ expired under a 1-hour window.
238
+
239
+ **Idempotency / replay safety.** The skipped ticks never get repetition logs, but
240
+ they're never recomputed either (the jump advances `last_execution_at` past them), and
241
+ all execution-log lookups are by exact `step_name` — nothing scans for the missing rows.
242
+ Prior completed/failed ticks from before dormancy are untouched.
243
+
244
+ ## Interaction
245
+
246
+ The two share a root: continuations are published as immediately-claimable, same-key
247
+ jobs while/just-before the lock is released. The catch-up surge (Issue 2) is the
248
+ maximal trigger for the race (Issue 1). Section 1 closes the race structurally;
249
+ Section 2 removes the burst of `delay == 0` continuations that most reliably arms it.
250
+ Both together remove the class of problem.
251
+
252
+ ## Testing
253
+
254
+ - **Issue 1:** unit-test that each of the 8 primitives sets `@continuation` and does
255
+ **not** call `perform_later` inline; that the executor flushes after `release_lock`
256
+ (assert ordering — e.g. the enqueue observes the lock row released); that
257
+ `LongRunningConcurrentExecutionError` from `release_lock` suppresses the flush; that
258
+ per-site kwargs (`attempt`/`retry_counts`, `wait_condition`) are preserved.
259
+ - **Issue 2:** unit-test `fast_forward_expired_prefix` returns `next_execution_at`
260
+ unchanged when nothing is expired; lands exactly on the first non-expired grid tick;
261
+ is on-grid across all three anchor branches; that an in-window first tick executes its
262
+ work while the expired prefix creates no repetition rows; that `last_execution_at` is
263
+ advanced so a replay is stable. Integration: resume a far-past daily schedule and
264
+ assert O(1) jobs/log rows for the expired prefix instead of O(missed intervals).
265
+ ```
@@ -0,0 +1,138 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ChronoForge
4
+ # Lightweight poller that joins one or more branches. NOT a workflow — it holds
5
+ # no lock, does no replay, and carries no context. It exists so the heavy parent
6
+ # workflow is replayed only twice per merge (kick off + completion wake).
7
+ class BranchMergeJob < ActiveJob::Base
8
+ # The poller is the parent's only wake mechanism, so survive TRANSIENT
9
+ # infrastructure errors (DB connection/timeout/deadlock) with backoff. Any
10
+ # other error — a programming bug, a bad guard — is NOT retried: it propagates
11
+ # to the backend's failed-job queue where it's visible, rather than being
12
+ # silently retried-then-discarded (which would orphan the parent in :idle).
13
+ retry_on ActiveRecord::ConnectionNotEstablished,
14
+ ActiveRecord::ConnectionTimeoutError,
15
+ ActiveRecord::Deadlocked,
16
+ ActiveRecord::LockWaitTimeout,
17
+ wait: :polynomially_longer, attempts: 25
18
+
19
+ CAP = 5_000 # cap the pending count; beyond it we just pick max_interval
20
+ FACTOR = 0.06 # seconds of delay per pending child
21
+ REKICK_AFTER = 5.minutes
22
+ REKICK_BATCH = 200 # bound per-run rekicks; later polls handle the rest
23
+
24
+ def perform(parent_key, parent_job_class, branch_log_ids, min_interval, max_interval, token = nil)
25
+ raise ArgumentError, "branch_log_ids must not be empty" if branch_log_ids.empty?
26
+
27
+ # Fencing: every merge_branches pass mints a fresh token and writes it onto
28
+ # the branch logs, so a poller from a superseded chain (parent replay /
29
+ # re-enqueue) holds a stale token. It stops quietly — no poll, no wake, no
30
+ # reschedule — leaving only the newest chain to drive the merge. (A nil token
31
+ # is a pre-upgrade job enqueued before fencing existed; it runs unfenced.)
32
+ return if superseded?(branch_log_ids, token)
33
+
34
+ # Per-branch probe (kept as maps so we can persist each branch's own state,
35
+ # not just the merge aggregate). Same query count as a plain sum/all?.
36
+ pending_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.incomplete(id).limit(CAP).count] }
37
+ sealed_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.sealed?(id)] }
38
+ pending = pending_by_branch.values.sum
39
+ sealed = sealed_by_branch.values.all?
40
+
41
+ if sealed && pending.zero?
42
+ record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: nil)
43
+ parent_job_class.constantize.perform_later(parent_key)
44
+ return
45
+ end
46
+
47
+ rekick_dropped_jobs(branch_log_ids)
48
+
49
+ delay = reschedule_delay(pending, min_interval, max_interval)
50
+ record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: delay.seconds.from_now)
51
+ self.class.set(wait: delay.seconds)
52
+ .perform_later(parent_key, parent_job_class, branch_log_ids, min_interval, max_interval, token)
53
+ end
54
+
55
+ private
56
+
57
+ # Adaptive poll cadence: scale the wait with the number of pending children,
58
+ # clamped to [min_interval, max_interval]. min_interval <= max_interval is
59
+ # enforced up front in merge_branches, so the clamp can't raise here.
60
+ def reschedule_delay(pending, min_interval, max_interval)
61
+ (pending * FACTOR).clamp(min_interval, max_interval)
62
+ end
63
+
64
+ # A poller is superseded when its token no longer matches what's stored on the
65
+ # branch logs (a newer merge_branches pass rotated it). A plain read is enough
66
+ # for the early-out; the persisting write in record_poll! re-checks the token
67
+ # under a row lock so it can never clobber the newer chain.
68
+ def superseded?(branch_log_ids, token)
69
+ logs = ExecutionLog.where(id: branch_log_ids).to_a
70
+ logs.empty? || logs.any? { |log| log.metadata&.dig("poll_token") != token }
71
+ end
72
+
73
+ # ActiveJob exposes no portable API to enumerate enqueued/scheduled jobs, so a
74
+ # poller in the backend's scheduled set is invisible to a backend-agnostic
75
+ # dashboard. We make the durable log the source of truth instead: each poll
76
+ # stamps its observable state onto every target branch log's metadata, so the
77
+ # dashboard can list in-flight merges (and a next_poll_at long in the past with
78
+ # work still pending is the signal that the poller was dropped). This is purely
79
+ # observational — replay and correctness never read it. It writes a "poll"
80
+ # sub-key, leaving spawn_each's "cursors" metadata untouched.
81
+ def record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at:)
82
+ now = Time.current
83
+ ExecutionLog.where(id: pending_by_branch.keys).find_each do |log|
84
+ # Lock the row so this read-modify-write can't clobber a concurrent token
85
+ # rotation (merge_branches) or another poller's metadata write — both touch
86
+ # the same JSON column. Re-check the token under the lock and skip if we've
87
+ # been superseded mid-run, so a stale poller never overwrites the fence.
88
+ log.with_lock do
89
+ meta = log.metadata || {}
90
+ next unless meta["poll_token"] == token
91
+ meta["poll"] = {
92
+ "last_polled_at" => now.iso8601,
93
+ "next_poll_at" => next_poll_at&.iso8601,
94
+ "pending" => pending_by_branch[log.id],
95
+ "sealed" => sealed_by_branch[log.id],
96
+ "polls" => meta.dig("poll", "polls").to_i + 1
97
+ }
98
+ log.update!(metadata: meta)
99
+ end
100
+ end
101
+ end
102
+
103
+ # A child that was dispatched but never picked up (its job was dropped by the
104
+ # backend) sits :idle with started_at nil. setup_workflow! stamps started_at
105
+ # on a child's first execution, so a nil started_at precisely means "never
106
+ # ran" — that's what we rekick on. It correctly excludes a child that ran and
107
+ # is now parked on a wait/wait_until (also :idle, but started_at is set):
108
+ # rekicking that would re-evaluate the wait condition prematurely and pile up
109
+ # duplicate scheduled jobs. We also require the row to be stale past
110
+ # REKICK_AFTER (a freshly dispatched child just hasn't been grabbed yet) and
111
+ # keep the :idle guard (a running/failed/stalled child must never be
112
+ # re-dispatched). Re-enqueue of an :idle child a worker just grabbed is still
113
+ # safe — the lock guard rejects the duplicate. Capped per run.
114
+ def rekick_dropped_jobs(branch_log_ids)
115
+ branch_log_ids.each do |id|
116
+ Workflow.where(parent_execution_log_id: id, state: Workflow.states[:idle], started_at: nil)
117
+ .where("updated_at < ?", REKICK_AFTER.ago)
118
+ .limit(REKICK_BATCH)
119
+ .find_each do |child|
120
+ # Intentionally uses the GUARDED perform_later (single-child path),
121
+ # unlike the bulk perform_all_later bypass in dispatch_children.
122
+ #
123
+ # Rekick is best-effort recovery, so one bad child must never sink the
124
+ # poll: a raise here (e.g. cross-version kwarg drift failing the enqueue
125
+ # guard) would abort the whole run and — since it isn't a transient AR
126
+ # error — dead-letter the poller, orphaning every healthy sibling. Catch
127
+ # per child, log, and let the next poll retry it (it's still idle+stale).
128
+ child.job_klass.perform_later(child.key, **child.kwargs.symbolize_keys)
129
+ rescue => e
130
+ Rails.logger.error do
131
+ "ChronoForge:BranchMergeJob rekick failed for child #{child.key}: " \
132
+ "#{e.class}: #{e.message}"
133
+ end
134
+ end
135
+ end
136
+ end
137
+ end
138
+ end
@@ -0,0 +1,26 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ChronoForge
4
+ # Single source of truth for "is this branch done?" — used by both merge_branches
5
+ # (boolean) and BranchMergeJob (which needs the sealed flag and pending count
6
+ # separately for its adaptive poll cadence). Option A: only :completed counts as
7
+ # done, so a failed/stalled child keeps the branch pending until recovered.
8
+ module BranchProbe
9
+ module_function
10
+
11
+ # The branch's coordination log is sealed (fully dispatched).
12
+ def sealed?(branch_log_id)
13
+ ExecutionLog.where(id: branch_log_id, state: ExecutionLog.states[:completed]).exists?
14
+ end
15
+
16
+ # Relation of this branch's children that are not yet completed.
17
+ def incomplete(branch_log_id)
18
+ Workflow.where(parent_execution_log_id: branch_log_id)
19
+ .where.not(state: Workflow.states[:completed])
20
+ end
21
+
22
+ def done?(branch_log_id)
23
+ sealed?(branch_log_id) && !incomplete(branch_log_id).exists?
24
+ end
25
+ end
26
+ end
@@ -93,6 +93,12 @@ module ChronoForge
93
93
  ids = batch.ids
94
94
  next if ids.empty?
95
95
 
96
+ # Branch children point at their parent's branch$ execution log via
97
+ # parent_execution_log_id. Bulk delete bypasses the dependent: :nullify callback,
98
+ # so nullify explicitly to avoid dangling references when a parent is reclaimed.
99
+ Workflow.where(parent_execution_log_id: ExecutionLog.where(workflow_id: ids).select(:id))
100
+ .update_all(parent_execution_log_id: nil)
101
+
96
102
  # Delete dependent rows in bulk rather than relying on row-by-row
97
103
  # dependent: :destroy callbacks.
98
104
  result[:execution_logs] += ExecutionLog.where(workflow_id: ids).delete_all
@@ -33,6 +33,12 @@ module ChronoForge
33
33
 
34
34
  belongs_to :workflow
35
35
 
36
+ has_many :spawned_workflows,
37
+ class_name: "ChronoForge::Workflow",
38
+ foreign_key: :parent_execution_log_id,
39
+ inverse_of: :parent_execution_log,
40
+ dependent: :nullify
41
+
36
42
  enum :state, %i[
37
43
  pending
38
44
  completed
@@ -0,0 +1,47 @@
1
+ module ChronoForge
2
+ module Executor
3
+ # An ordered list of RetryPolicy objects, each scoped to an error type via
4
+ # its `retry_on`. On failure the first policy whose `retry_on` matches the
5
+ # raised error (by `is_a?`) is applied, giving each error type its own
6
+ # independent attempt budget and backoff curve. Put specific policies first
7
+ # and a catch-all (`retry_on: nil`) last; an unmatched error is not retried.
8
+ #
9
+ # Pure: it never reads storage. The per-error count is supplied by the
10
+ # caller through the block passed to #retry_backoff, keyed by the matched
11
+ # policy's budget_key (its declared errors).
12
+ class CompositeRetryPolicy
13
+ attr_reader :policies
14
+
15
+ def initialize(policies)
16
+ @policies = Array(policies)
17
+ if @policies.empty?
18
+ raise ArgumentError, "composite retry policy needs at least one policy"
19
+ end
20
+ end
21
+
22
+ # First sub-policy whose retry_on matches the error, or nil.
23
+ def policy_for(error)
24
+ @policies.find { |p| p.matches?(error) }
25
+ end
26
+
27
+ # Routes on the live error and delegates the decision to the matched
28
+ # sub-policy. When a block is given it is called with the matched policy's
29
+ # budget_key and must return that policy's running attempt count (1-based,
30
+ # including the current failure); otherwise `attempts` is used.
31
+ def retry_backoff(error, attempts:)
32
+ sub = policy_for(error)
33
+ return nil if sub.nil?
34
+
35
+ count = block_given? ? yield(sub.budget_key) : attempts
36
+ sub.retryable?(error, count) ? sub.backoff_for(count) : nil
37
+ end
38
+
39
+ # Coarsest attempt bound across sub-policies, for the workflow-level
40
+ # safety-net guard. nil (unbounded) if any sub-policy is unbounded.
41
+ def max_attempts
42
+ caps = @policies.map(&:max_attempts)
43
+ caps.include?(nil) ? nil : caps.max
44
+ end
45
+ end
46
+ end
47
+ end