chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +22 -0
  3. data/README.md +305 -44
  4. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
  5. data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
  6. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
  7. data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
  8. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
  9. data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
  10. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
  11. data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
  12. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
  13. data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
  14. data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
  15. data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
  16. data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
  17. data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
  18. data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
  19. data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
  20. data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
  21. data/lib/chrono_forge/branch_merge_job.rb +138 -0
  22. data/lib/chrono_forge/branch_probe.rb +26 -0
  23. data/lib/chrono_forge/cleanup.rb +6 -0
  24. data/lib/chrono_forge/execution_log.rb +6 -0
  25. data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
  26. data/lib/chrono_forge/executor/methods/branch.rb +185 -0
  27. data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
  28. data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
  29. data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
  30. data/lib/chrono_forge/executor/methods/wait.rb +2 -4
  31. data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
  32. data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
  33. data/lib/chrono_forge/executor/methods.rb +2 -0
  34. data/lib/chrono_forge/executor/retry_policy.rb +111 -0
  35. data/lib/chrono_forge/executor.rb +216 -28
  36. data/lib/chrono_forge/version.rb +1 -1
  37. data/lib/chrono_forge/workflow.rb +10 -1
  38. data/lib/generators/chrono_forge/migration_actions.rb +1 -0
  39. data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
  40. metadata +42 -5
  41. data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
@@ -0,0 +1,19 @@
1
+ {
2
+ "planPath": "docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md",
3
+ "tasks": [
4
+ {
5
+ "id": 1,
6
+ "subject": "Task 1: Defer all continuation enqueues until after lock release",
7
+ "status": "completed",
8
+ "description": "**Goal:** No continuation job is published while the enqueuing job still holds the workflow lock; all 8 enqueue sites route through one recorded slot flushed in `ensure` after `release_lock`.\n\n**Files:** lib/chrono_forge/executor.rb; lib/chrono_forge/executor/methods/{wait,wait_until,durably_execute,durably_repeat}.rb; test/continuation_flush_test.rb (create)\n\n**Verify:** `bundle exec ruby -Itest test/continuation_flush_test.rb && bundle exec rake test`\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/executor.rb\", \"lib/chrono_forge/executor/methods/wait.rb\", \"lib/chrono_forge/executor/methods/wait_until.rb\", \"lib/chrono_forge/executor/methods/durably_execute.rb\", \"lib/chrono_forge/executor/methods/durably_repeat.rb\", \"test/continuation_flush_test.rb\"], \"verifyCommand\": \"bundle exec ruby -Itest test/continuation_flush_test.rb && bundle exec rake test\", \"acceptanceCriteria\": [\"continuations enqueued only after lock release\", \"per-site kwargs preserved\", \"flush no-ops without recorded continuation and is skipped when release_lock raises\", \"full suite green\"], \"requiresUserVerification\": false}\n```"
9
+ },
10
+ {
11
+ "id": 2,
12
+ "subject": "Task 2: Closed-form fast-forward of the expired prefix in durably_repeat",
13
+ "status": "completed",
14
+ "blockedBy": [1],
15
+ "description": "**Goal:** When `durably_repeat` resumes behind schedule, jump past the expired prefix in O(1), advance the coordination log's `last_execution_at`, and write one summary `ExecutionLog` for the skip — instead of one timed-out row + one zero-delay job per missed tick.\n\n**Files:** lib/chrono_forge/executor/methods/durably_repeat.rb; test/durably_repeat_test.rb (add tests; update 2 timeout tests)\n\n**Verify:** `bundle exec ruby -Itest test/durably_repeat_test.rb && bundle exec rake test`\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/executor/methods/durably_repeat.rb\", \"test/durably_repeat_test.rb\"], \"verifyCommand\": \"bundle exec ruby -Itest test/durably_repeat_test.rb && bundle exec rake test\", \"acceptanceCriteria\": [\"fast_forward returns input when nothing expired\", \"lands on first non-expired grid tick (no drift)\", \"zero per-tick timeout rows + exactly one summary row\", \"coordination last_execution_at advanced for stable replay\", \"first in-window tick still executes\", \"full suite green\"], \"requiresUserVerification\": false}\n```"
16
+ }
17
+ ],
18
+ "lastUpdated": "2026-06-26"
19
+ }
@@ -0,0 +1,226 @@
1
+ # Unified RetryPolicy — Design
2
+
3
+ **Date:** 2026-06-03
4
+ **Status:** Approved (pending spec review)
5
+ **Scope:** Internal API. No external callers; clean break, no deprecation shim.
6
+
7
+ ## Problem
8
+
9
+ ChronoForge currently has three independent retry systems, two backoff
10
+ algorithms, and three different "should we retry?" decision models:
11
+
12
+ 1. **Workflow-level** (uncaught errors in `perform`)
13
+ - `should_retry?(error, attempt)` → hardcoded `attempt < 3`, ignores the error
14
+ - `RetryStrategy.schedule_retry` → fixed array `[1s, 5s, 30s, 2m, 10m]`
15
+ - guard: `attempt >= RetryStrategy.max_attempts` (= 5)
16
+ - **Dead config:** `should_retry?` stops at 3, so the array's `2m`/`10m`
17
+ entries and `max_attempts == 5` are unreachable.
18
+
19
+ 2. **Step-level** (`durably_execute`, `durably_repeat`)
20
+ - `max_attempts:` param (default 3)
21
+ - backoff `2**[attempts, 5].min` — a *different* algorithm (exponential,
22
+ 32s cap) than the workflow level
23
+ - `durably_repeat` adds `on_error: :continue | :fail_workflow`
24
+ - **Dead arg:** the reschedule passes `retry_method:`, which `perform`'s
25
+ signature never binds — it falls into `**kwargs` and is ignored. Replay
26
+ skipping completed steps is what actually resumes the step, not this arg.
27
+
28
+ 3. **`wait_until`**
29
+ - `retry_on: [ExceptionClass, …]` — a third model (error-class allowlist)
30
+ - no attempt cap (bounded by `timeout`), same `2**n` backoff
31
+
32
+ Additional finding: workflow-level attempts (the `attempt:` job arg, lives only
33
+ in the job payload) and step attempts (`execution_log.attempts`, a DB column)
34
+ are unrelated counters.
35
+
36
+ Net: backoff is implemented twice and configurable nowhere per-call; "should we
37
+ retry?" is answered three ways (attempt-count / max_attempts / error-class); and
38
+ the workflow-level cap is internally contradictory (3 vs 5).
39
+
40
+ ## Goal
41
+
42
+ Collapse to **one** `RetryPolicy` type with **one** backoff algorithm, used by
43
+ all four sites. Today's three behaviors become three *default configurations* of
44
+ the same type. Retry behavior becomes expressible per-call.
45
+
46
+ The unification is of **type + mechanism**, not of default *values*: each call
47
+ site keeps a default tuned to its purpose, but all defaults are instances of the
48
+ same object and all are overridable.
49
+
50
+ ## Decisions (locked during brainstorming)
51
+
52
+ | Decision | Choice |
53
+ |---|---|
54
+ | Ambition | Option A — one unified `RetryPolicy`, `wait_until` folded in |
55
+ | Backoff curve | Exponential + jitter, single default, per-call overridable |
56
+ | Compatibility | Clean break — internal code, fix all call sites in the same change |
57
+ | Surface | Class-level default DSL + per-call `retry_policy:` override |
58
+ | Attempt counters | Workflow-level stays in the `attempt:` job arg; steps stay on `execution_log.attempts`. Policy unifies; counting storage does not (no migration) |
59
+ | `wait_until` poll cadence | Stays **out** of `RetryPolicy` (`check_interval`/`timeout` are polling, not retry) |
60
+ | Per-site backoff defaults | Steps `max_attempts: 3, cap: 30`. Workflow-level `max_attempts: 10, cap: 600` — a tolerant window up to ~8.5 min (≈4 min typical with jitter) for transient infra errors on uncaught `perform` errors (revised post-review from an inconsistent `8/600`, where the 600 cap was unreachable). `cap: 600` is a per-delay ceiling, not a dead default: it binds when a caller configures more attempts. |
61
+
62
+ ## Design
63
+
64
+ ### 1. `RetryPolicy` value object
65
+
66
+ New file: `lib/chrono_forge/executor/retry_policy.rb`
67
+
68
+ ```ruby
69
+ RetryPolicy.new(
70
+ max_attempts: 3, # Integer cap, or nil = no count cap (bounded elsewhere)
71
+ base: 1, # seconds
72
+ cap: 30, # seconds, max single delay
73
+ jitter: true,
74
+ retry_on: nil # nil = retry any StandardError; [Classes] = only these
75
+ )
76
+ ```
77
+
78
+ Two methods are the entire decision surface. `attempts` is the 1-based count of
79
+ attempts made so far, *including* the one that just failed (matching
80
+ `ExecutionLog#attempts`); on the first failure `attempts == 1`.
81
+
82
+ - `retryable?(error, attempts)` →
83
+ `(max_attempts.nil? || attempts < max_attempts)` **and** the error matches
84
+ `retry_on` (`retry_on.nil?` means any `StandardError`; otherwise
85
+ `retry_on.any? { |k| error.is_a?(k) }`).
86
+ - `backoff_for(attempts)` → `delay = [cap, base * 2**(attempts - 1)].min`, then
87
+ equal jitter when enabled: `delay / 2.0 + rand(0.0..delay / 2.0)`. Returns an
88
+ `ActiveSupport::Duration` suitable for `set(wait:)`.
89
+
90
+ **Jitter & determinism:** `backoff_for` is called once, at the moment a retry
91
+ job is re-enqueued. The result is never persisted or replayed, so `rand`
92
+ introduces no replay nondeterminism. (Stated explicitly because this is a
93
+ replay engine.)
94
+
95
+ ### 2. Per-site default policies
96
+
97
+ A single gem-wide default, overridable per class and per call. Two sites need
98
+ distinct *defaults* to preserve current semantics:
99
+
100
+ | Site | Default policy | Rationale (= today's behavior) |
101
+ |---|---|---|
102
+ | `durably_execute`, `durably_repeat` | `max_attempts: 3, base: 1, cap: 30, retry_on: nil` (retry **all** errors) | matches current `rescue => e; retry`; flaky calls fast-fail |
103
+ | Workflow-level | `max_attempts: 10, base: 1, cap: 600, retry_on: nil` | only fires on uncaught `perform` errors (step failures stall instead), which are rare and may be transient infra blips. 10 attempts (up to ~8.5 min, ≈4 min typical with jitter) rides those out; each retry replays the whole workflow, so the count is bounded rather than open-ended. `cap: 600` (10 min) ceils any single backoff |
104
+ | `wait_until` (error path) | `retry_on: []` (retry **nothing** by default) | a condition that *raises* is usually a bug, not transient — matches current `retry_on: []` |
105
+
106
+ `wait_until`'s polling cadence (`check_interval` / `timeout`) is **not** retry
107
+ and is untouched. `RetryPolicy` governs only what happens when the condition
108
+ *raises*.
109
+
110
+ ### 3. Surface — class default + per-call override
111
+
112
+ ```ruby
113
+ class ChargeWorkflow < ApplicationJob
114
+ prepend ChronoForge::Executor
115
+ retry_policy max_attempts: 5, base: 2, cap: 60 # class-wide default
116
+
117
+ def perform
118
+ durably_execute :charge, retry_policy: RetryPolicy.new(max_attempts: 8, retry_on: [Net::OpenTimeout])
119
+ wait_until :settled?, retry_policy: RetryPolicy.new(retry_on: [BankApiError])
120
+ end
121
+ end
122
+ ```
123
+
124
+ `retry_policy(**)` is a class-level DSL added by the prepended `Executor` that
125
+ builds and stores a `RetryPolicy` in `default_retry_policy` (a `class_attribute`,
126
+ so it inherits). The per-call kwarg is named `retry_policy:` (not `retry:`)
127
+ because `retry` is a Ruby keyword — a `retry:` parameter could not be read inside
128
+ the method without `binding.local_variable_get(:retry)`. `retry_policy:` also
129
+ reads consistently with the class-level DSL.
130
+
131
+ **Resolution rules (precise — to remove ambiguity):**
132
+
133
+ - **Error-retry sites** (`durably_execute`, `durably_repeat`, workflow-level):
134
+ explicit per-call `retry_policy:` → class `default_retry_policy` → that site's
135
+ built-in default (table above). So a declared class default replaces the
136
+ built-in for *both* steps and the workflow level, collapsing their differing
137
+ built-ins (3/30 vs 5/30) onto one value. This is the intended, predictable
138
+ meaning of "class-wide default."
139
+ - **`wait_until`** does **not** inherit the class `default_retry_policy`. It uses
140
+ its built-in `retry_on: []` unless an explicit per-call `retry_policy:` is passed.
141
+ Rationale: a class-wide "retry all errors 5×" must not silently turn
142
+ condition-evaluation bugs into retried errors. `wait_until`'s retry set is a
143
+ deliberate per-call opt-in, not a class-wide inheritance.
144
+
145
+ ### 4. Integration / deletions
146
+
147
+ - **Delete** `lib/chrono_forge/executor/retry_strategy.rb` (`RetryStrategy`).
148
+ - **Delete** private `should_retry?` in `executor.rb` (the dead `attempt < 3`).
149
+ - **Delete** the dead `retry_method:` arg in `durably_execute`'s reschedule.
150
+ - **Replace** the `max_attempts:` / `retry_on:` kwargs on `durably_execute`,
151
+ `durably_repeat`, and `wait_until` with a single `retry_policy:` kwarg.
152
+ - **`executor.rb#perform`:** the resolved policy here is `default_retry_policy`
153
+ (class DSL) or the workflow-level built-in (`max_attempts: 10, cap: 600`);
154
+ there is no per-call `retry_policy:` since an uncaught error has no call site.
155
+ - top guard becomes `attempt >= resolved_policy.max_attempts`
156
+ - the `rescue => e` block routes through the resolved policy:
157
+ ```ruby
158
+ if policy.retryable?(e, attempt)
159
+ self.class.set(wait: policy.backoff_for(attempt)).perform_later(key, attempt: attempt + 1)
160
+ else
161
+ fail_workflow!(error_log)
162
+ end
163
+ ```
164
+ - **`durably_execute` / `durably_repeat`:** on error, use
165
+ `policy.retryable?(e, execution_log.attempts)` and
166
+ `policy.backoff_for(execution_log.attempts)`; otherwise mark failed and raise
167
+ `ExecutionFailedError` (`durably_repeat` keeps its `on_error` branch).
168
+ - **`wait_until`:** replace the `retry_on.include?(e.class)` check and the
169
+ inline `2**n` backoff with the resolved policy. The poll/timeout path is
170
+ unchanged.
171
+
172
+ The old extensibility model — `self.class::RetryStrategy` magic constant +
173
+ overriding private `should_retry?` — is removed in favor of passing a
174
+ `RetryPolicy`.
175
+
176
+ ### 5. Backoff impact (informational)
177
+
178
+ Delays in seconds; current step/wait curve is `2**min(attempts,5)`, current
179
+ workflow curve is the fixed array (truncated at attempt 3 by the dead config).
180
+
181
+ | Site | Today (actual) | New default |
182
+ |---|---|---|
183
+ | `durably_execute`/`durably_repeat` (`max_attempts:3`) | `2, 4` then fail | `~1, ~2` (jittered) then fail |
184
+ | `wait_until` error path | `2, 4, 8, …` cap 32 | unchanged in shape; cap 30 |
185
+ | Workflow-level | `1, 5, 30` then fail | `~1, 2, 4, 8, 16, 32, 64, 128, 256` (jittered) then fail, `max_attempts:10` (up to ~8.5 min, ≈4 min typical) |
186
+
187
+ Steps and `wait_until` are effectively unchanged (jitter added, cap 32→30). The
188
+ workflow level keeps the array's intended 5-attempt count but with one curve;
189
+ it does **not** add a long backoff tail — each workflow-level retry replays the
190
+ whole workflow, so the attempt count is deliberately kept modest.
191
+
192
+ ## Files touched
193
+
194
+ - **New:** `lib/chrono_forge/executor/retry_policy.rb`
195
+ - **Delete:** `lib/chrono_forge/executor/retry_strategy.rb`
196
+ - **Edit:** `lib/chrono_forge/executor.rb` (perform rescue + guard, remove
197
+ `should_retry?`), `lib/chrono_forge/workflow.rb` (add `retry_policy` DSL +
198
+ `default_retry_policy`), `lib/chrono_forge/executor/methods/durably_execute.rb`,
199
+ `.../durably_repeat.rb`, `.../wait_until.rb`
200
+ - **Edit:** test suite, example workflows, and `README.md` retry sections
201
+ (~lines 165–261, 393, 765–769) — clean break, all call sites updated together
202
+
203
+ ## Testing
204
+
205
+ **`RetryPolicy` unit tests**
206
+ - `retryable?` truth table: count cap reached/not; `max_attempts: nil` (never
207
+ count-capped); `retry_on: nil` (any StandardError); `retry_on: [A]` match and
208
+ miss; `retry_on: []` (never).
209
+ - `backoff_for`: exponential growth; cap clamp; jitter bounds with a
210
+ seeded/stubbed `rand`; `jitter: false` is exact.
211
+
212
+ **Integration (per method)**
213
+ - retries → succeeds; retries → exhausts → fails (`ExecutionFailedError` /
214
+ `fail_workflow!`); per-call `retry_policy:` override is honored.
215
+ - `wait_until`: fails fast on an unlisted error; retries a listed one; poll
216
+ cadence/timeout unaffected.
217
+ - workflow-level: uncaught error retries with `attempt+1` and the workflow-level
218
+ policy; stops at `max_attempts`.
219
+ - `durably_repeat`: `on_error: :continue` vs `:fail_workflow` still branch
220
+ correctly after exhaustion.
221
+
222
+ ## Out of scope
223
+
224
+ - Migrating workflow-level attempt counting into the DB (explicitly deferred).
225
+ - Changing `wait_until`'s polling model (`check_interval`/`timeout`).
226
+ - `durably_repeat`'s `on_error` semantics (kept as-is).
@@ -0,0 +1,190 @@
1
+ # ChronoForge Dashboard — Design
2
+
3
+ **Date:** 2026-06-25
4
+ **Status:** Approved (pending spec review)
5
+ **Scope:** New companion gem `chrono_forge-dashboard`, a mountable Rails engine.
6
+ Additive; does not change the published `chrono_forge` gem.
7
+
8
+ ## Problem
9
+
10
+ ChronoForge exposes rich per-step data (execution logs, error logs, persistent
11
+ context, wait states, periodic tasks) but no UI. Operators recover stalled
12
+ workflows and inspect failures from a Rails console. Competing job dashboards
13
+ (Sidekiq, GoodJob, Mission Control) show queues and jobs, not the interior of a
14
+ long-running workflow. A free, self-contained dashboard over ChronoForge's data
15
+ is both a useful tool and the project's strongest adoption lever.
16
+
17
+ ## Goal
18
+
19
+ A mountable, zero-build Rails engine giving full visibility and operational
20
+ control over ChronoForge workflows: list/triage, a step **replay timeline**, a
21
+ context inspector, periodic-task health, wait-state age, and the recovery actions
22
+ (`retry_later`, force-unlock, bulk retry) — behind fail-closed auth.
23
+
24
+ ## Decisions (locked during brainstorming)
25
+
26
+ | Decision | Choice |
27
+ |---|---|
28
+ | Repo layout | **Monorepo subfolder** `chrono_forge-dashboard/` with its own gemspec; core gem excludes the dir from `spec.files` so the published `chrono_forge` stays lean. |
29
+ | Scope | **Full build** — all tiers (visibility, triage, timeline, periodic health, wait-state, actions) in v1. |
30
+ | Frontend | **Server-rendered, zero-build** — ERB + one bundled CSS + one vanilla JS file, served by the engine itself. No npm/bundler/importmap; CSP-friendly; polling for live updates. |
31
+ | Auth | **Fail-closed, pluggable** — built-in HTTP Basic, a custom hook, or explicit `:none` (to use routing constraints). Mounting without configuring any of them **raises**. |
32
+ | Data | **Reuse core models read-only**; engine holds its own query objects/presenters. No schema changes; minimal-to-no core changes. Offset pagination. |
33
+ | Engine | Namespace-isolated `ChronoForge::Dashboard::Engine`, Zeitwerk-loaded. |
34
+
35
+ ## Architecture
36
+
37
+ ```
38
+ chrono_forge/ # repo root (core gem)
39
+ lib/ chrono_forge.gemspec # core; rejects chrono_forge-dashboard/ from spec.files
40
+ chrono_forge-dashboard/
41
+ chrono_forge-dashboard.gemspec # add_dependency "chrono_forge", "railties"
42
+ lib/chrono_forge/dashboard.rb # config object + Engine
43
+ lib/chrono_forge/dashboard/engine.rb
44
+ app/controllers/chrono_forge/dashboard/...
45
+ app/views/chrono_forge/dashboard/...
46
+ app/assets/chrono_forge/dashboard/{dashboard.css,dashboard.js}
47
+ app/queries/chrono_forge/dashboard/... # query objects
48
+ app/presenters/chrono_forge/dashboard/... # timeline / context / sparkline builders
49
+ test/ # Combustion dummy app mounting the engine
50
+ ```
51
+
52
+ Host mounts it:
53
+
54
+ ```ruby
55
+ mount ChronoForge::Dashboard::Engine, at: "/chrono_forge"
56
+ ```
57
+
58
+ `isolate_namespace ChronoForge::Dashboard` keeps routes/helpers/table-name
59
+ prefixes contained. Engine views and assets are wholly self-contained.
60
+
61
+ ## Components
62
+
63
+ ### 1. Configuration & auth (`ChronoForge::Dashboard`)
64
+
65
+ A config singleton:
66
+
67
+ ```ruby
68
+ ChronoForge::Dashboard.configure do |c|
69
+ c.http_basic = { username: ENV["CF_USER"], password: ENV["CF_PASS"] } # built-in
70
+ # c.authenticate { |controller| controller.head(:forbidden) unless controller.current_user&.admin? }
71
+ # c.authentication = :none # opt out; you mount behind your own routing constraint
72
+ c.polling_interval = 5 # seconds; 0 disables auto-refresh
73
+ c.page_size = 50
74
+ c.long_wait_threshold = 1.hour
75
+ end
76
+ ```
77
+
78
+ `BaseController` runs `before_action :authenticate!`, resolved fail-closed in
79
+ this order:
80
+
81
+ 1. **hook present** → call it (host integrates Devise/Pundit/etc.).
82
+ 2. **else `http_basic` present** → `authenticate_or_request_with_http_basic`.
83
+ 3. **else `authentication == :none`** → permit (host guards via routing
84
+ constraint).
85
+ 4. **else → raise `ChronoForge::Dashboard::AuthenticationNotConfigured`** at
86
+ request time, with a message naming the three options.
87
+
88
+ So a forgotten config fails loudly instead of leaking workflow context.
89
+
90
+ ### 2. Read / query layer
91
+
92
+ - Reuses `ChronoForge::Workflow`, `ExecutionLog`, `ErrorLog` read-only.
93
+ - **`WorkflowsQuery`** — filter by `state`, `job_class`, `key` (search),
94
+ date range; offset-paginated; recency-sorted.
95
+ - **`StatsQuery`** — counts by state + recent failure rate in one grouped query
96
+ (no N+1).
97
+ - **`StepNameParser`** — decodes step names into `{kind, name, timestamp}`:
98
+ `durably_execute$<name>`, `wait_until$<condition>`, `durably_repeat$<name>`
99
+ (coordination) and `durably_repeat$<name>$<ts>` (repetition). `$` is the core's
100
+ reserved delimiter, so parsing is unambiguous.
101
+ - Detail-view logs are paginated (a `durably_repeat` workflow accumulates
102
+ unbounded repetition logs; never load them all).
103
+
104
+ ### 3. Presenters
105
+
106
+ - **`TimelinePresenter`** — orders a workflow's `execution_logs` into a replay
107
+ sequence; each entry: kind, status (completed/failed/pending/waiting),
108
+ attempts, started/completed, duration, error summary. Repetitions roll up under
109
+ their coordination log. Marks the "current position" (last failed/running, or
110
+ the active wait).
111
+ - **`ContextPresenter`** — renders the JSON context as a collapsible tree with
112
+ value types and a size-vs-16KB indicator. Read-only.
113
+ - **`PeriodicHealthPresenter`** — per `durably_repeat` coordination log: last run
114
+ (`metadata.last_execution_at`), next scheduled, missed/timed-out count
115
+ (repetition logs with `error_class == "TimeoutError"`), recent-latency
116
+ sparkline data, and per-error `retry_counts` from metadata.
117
+ - **`WaitStatePresenter`** — for idle workflows whose latest step is a pending
118
+ `wait_until`: condition, wait age (`now - last_executed_at`), `timeout_at`.
119
+
120
+ ### 4. Controllers & routes
121
+
122
+ - `WorkflowsController#index` — list + stats + filters + pagination.
123
+ - `WorkflowsController#show` — detail: timeline, context, errors, wait callout,
124
+ periodic health.
125
+ - `WaitStatesController#index` — idle-waiting workflows by wait age, flagging
126
+ those past `long_wait_threshold`.
127
+ - `ActionsController` (POST, CSRF-protected):
128
+ - `#retry` → `workflow.retry_later` (guarded by `retryable?`; 422 + flash if not).
129
+ - `#unlock` → clear `locked_at`/`locked_by`, set `idle` (loud duplicate-exec warning in the UI).
130
+ - `#bulk_retry` → `ChronoForge::Workflow.failed.find_each(&:retry_later)`; returns affected count.
131
+ - `AssetsController#show` — serves `dashboard.css` / `dashboard.js` with long-cache
132
+ headers, so the engine needs no host asset pipeline.
133
+ - Fragment endpoints (`index`/`show` with a partial format) back the JS polling
134
+ refresh.
135
+
136
+ ### 5. Frontend
137
+
138
+ - ERB views + one layout; all classes prefixed `cf-`.
139
+ - One `dashboard.css`, one `dashboard.js` (vanilla), served by `AssetsController`.
140
+ - **CSP-friendly**: no CDN/external fonts; behavior attached via
141
+ `addEventListener` + `data-` attributes (no inline `<script>` handlers, no
142
+ inline event attributes).
143
+ - JS responsibilities: collapsible context tree; confirm dialogs for destructive
144
+ actions; inline-SVG sparklines (no chart lib); polling that refreshes the
145
+ list/stats fragment (and a running workflow's detail) every
146
+ `polling_interval` seconds, with a pause toggle.
147
+
148
+ ## Error handling
149
+
150
+ - Missing/legacy step names that don't parse fall back to a raw display rather
151
+ than raising.
152
+ - Actions on a workflow whose state changed under the operator (e.g. retry on a
153
+ now-running workflow) surface the core's `WorkflowNotRetryableError` as a flash,
154
+ not a 500.
155
+ - Force-unlock always shows the duplicate-execution warning and requires confirm.
156
+ - Auth misconfiguration raises a clear, actionable error (see §1).
157
+
158
+ ## Testing
159
+
160
+ Combustion dummy app (mirroring core's `test/internal`) mounting the engine, with
161
+ seeded workflows across every state and a `durably_repeat` workflow. Minitest +
162
+ standardrb.
163
+
164
+ - **Queries**: `WorkflowsQuery` filters/pagination; `StatsQuery` counts;
165
+ `StepNameParser` for each kind incl. repetitions and unparseable names.
166
+ - **Presenters**: timeline ordering + repetition rollup + current-position;
167
+ context tree + size indicator; periodic health (missed/timeout/sparkline);
168
+ wait-state age.
169
+ - **Controllers**: index filters/pagination; show renders all panels; wait-state
170
+ list + threshold flag.
171
+ - **Actions**: retry calls `retry_later` and guards non-retryable; unlock clears
172
+ the lock; bulk retry count.
173
+ - **Auth**: raises when unconfigured; HTTP Basic accept/reject; hook;
174
+ `:none` permits.
175
+ - **Assets**: `AssetsController` serves CSS/JS with cache headers.
176
+
177
+ ## Build order (for the implementation plan)
178
+
179
+ Engine skeleton + gemspec + core `spec.files` exclusion + auth → list + stats +
180
+ filters → detail (context + errors) → step replay timeline → periodic health +
181
+ wait-state age → operational actions → assets + JS polling → README/docs. Each
182
+ step is independently testable.
183
+
184
+ ## Out of scope (v1)
185
+
186
+ - Real-time push (ActionCable/SSE) — polling only.
187
+ - Editing context or workflow internals from the UI (read-only except the three
188
+ actions).
189
+ - Cross-workflow search by context value (only key/class/state/date filters).
190
+ - Triggering `CleanupJob` from the UI (operator runs cleanup on their schedule).
@@ -0,0 +1,228 @@
1
+ # Composite Retry Policies — Design
2
+
3
+ **Date:** 2026-06-25
4
+ **Status:** Approved (pending spec review)
5
+ **Scope:** Internal API, additive. Builds on the unified `RetryPolicy`
6
+ (2026-06-03). No breaking change — the common single-policy path is byte-for-byte
7
+ unchanged.
8
+
9
+ ## Problem
10
+
11
+ The unified `RetryPolicy` answers "should we retry?" with a single
12
+ `max_attempts`/`backoff`/`retry_on` tuple per retry site. A single tuple cannot
13
+ express *different behavior per error type* — yet that is exactly what real
14
+ workflows (fintech especially) need:
15
+
16
+ - `NetworkError` → retry aggressively, short backoff
17
+ - `RateLimitError` → retry more, longer backoff
18
+ - `PaymentDeclinedError` → fail immediately, do not retry
19
+
20
+ Today you must pick one policy for the whole step. A `retry_on:` allowlist filters
21
+ *which* errors retry, but every retried error shares one `max_attempts` and one
22
+ backoff curve.
23
+
24
+ ## Goal
25
+
26
+ Let a retry site be configured with an **ordered list** of `RetryPolicy` objects.
27
+ On failure, the **first** sub-policy whose `retry_on` matches the raised error
28
+ applies its own `max_attempts`/`backoff`. Each error type gets an **independent
29
+ attempt budget**.
30
+
31
+ The single-policy path is untouched: it keeps using the site's own `attempts`
32
+ counter, with no extra state.
33
+
34
+ ## Decisions (locked during brainstorming)
35
+
36
+ | Decision | Choice |
37
+ |---|---|
38
+ | Counting semantics | **Per-error budgets.** A sub-policy's `max_attempts` counts only failures routed to it. |
39
+ | Where the count lives | A `retry_counts` map keyed by **matched-policy index**. Steps: in the (execution/repetition) log `metadata`. Workflow-level: in the job args, beside `attempt:`. No new column, no `error_logs` query. |
40
+ | Subclass safety | Routing happens **once, on the live error** via `matches?`/`is_a?` — subclass-correct. The count is keyed by the matched policy, never by error class, so no constantizing or class-string matching at decision time. |
41
+ | Single-policy path | Unchanged — uses the site's `attempts` counter directly, no `retry_counts`. |
42
+ | Construction | Named factory `RetryPolicy.compose(*policies)`; passing an `Array` to `retry_policy:` coerces through the same factory. |
43
+ | Scope | **All four sites**, including the workflow-level class default. |
44
+ | Routing | First match wins. `retry_on: nil` (catch-all) typically last; no match → no retry (fail fast). |
45
+ | Purity | `RetryPolicy` and `CompositeRetryPolicy` stay pure value objects. The per-error count is incremented/read by the executor via a block; the policies never touch storage. |
46
+
47
+ ## Why increment-then-check preserves existing semantics
48
+
49
+ `attempts` is 1-based and includes the failure that just happened (so on the
50
+ first failure `attempts == 1`, and `retryable?` checks `attempts < max_attempts`).
51
+ The composite mirrors this: on each failure it **increments** the matched
52
+ policy's counter **then** checks, so the value handed to `retryable?` is "failures
53
+ routed to this policy so far, including the current one" — the same shape the
54
+ single-policy path already uses. The two notions agree, so a per-error count
55
+ substitutes cleanly wherever a plain policy uses `attempts`.
56
+
57
+ ## Components
58
+
59
+ ### 1. `RetryPolicy` (existing — additions, pure)
60
+
61
+ - `matches?(error)` — public routing predicate; wraps the existing private
62
+ `retryable_error?`. `retry_on: nil` matches any `StandardError` (catch-all);
63
+ `retry_on: []` matches nothing. Subclass-correct (`error.is_a?`).
64
+ - `retry_backoff(error, attempts:) { |policy_index| count }` — returns the backoff
65
+ `Duration` to retry, or `nil` to stop. The plain policy **ignores the block**
66
+ and uses `attempts`:
67
+
68
+ ```ruby
69
+ def retry_backoff(error, attempts:)
70
+ retryable?(error, attempts) ? backoff_for(attempts) : nil
71
+ end
72
+ ```
73
+
74
+ - `self.compose(*policies)` — factory returning a `CompositeRetryPolicy`.
75
+
76
+ `retryable?`, `backoff_for`, `max_attempts` are unchanged (tests and the
77
+ single-policy path depend on them).
78
+
79
+ ### 2. `CompositeRetryPolicy` (new — `executor/composite_retry_policy.rb`, pure)
80
+
81
+ ```ruby
82
+ class CompositeRetryPolicy
83
+ attr_reader :policies
84
+
85
+ def initialize(policies)
86
+ @policies = Array(policies)
87
+ raise ArgumentError, "composite retry policy needs at least one policy" if @policies.empty?
88
+ end
89
+
90
+ # First sub-policy whose retry_on matches the error, or nil.
91
+ def policy_for(error)
92
+ @policies.find { |p| p.matches?(error) }
93
+ end
94
+
95
+ # Routes on the *live* error, yields the matched policy's index so the caller
96
+ # can increment and return that policy's running count, then delegates the
97
+ # decision to the matched sub-policy.
98
+ def retry_backoff(error, attempts:)
99
+ idx = @policies.index { |p| p.matches?(error) }
100
+ return nil if idx.nil?
101
+
102
+ sub = @policies[idx]
103
+ count = block_given? ? yield(idx) : attempts
104
+ sub.retryable?(error, count) ? sub.backoff_for(count) : nil
105
+ end
106
+
107
+ # Coarsest bound, for the workflow-level safety-net guard in `perform`.
108
+ # nil if any sub-policy is unbounded.
109
+ def max_attempts
110
+ caps = @policies.map(&:max_attempts)
111
+ caps.include?(nil) ? nil : caps.max
112
+ end
113
+ end
114
+ ```
115
+
116
+ Routing is by `matches?`, which is `is_a?`-based, so a subclass of a `retry_on`
117
+ class routes to the right policy. The returned **index** — not the error class —
118
+ is the counter key, so subclasses share the budget of the policy they routed to,
119
+ exactly as intended.
120
+
121
+ ### 3. Executor wiring
122
+
123
+ - **`coerce_policy(value)`** — `Array` → `RetryPolicy.compose(*value)`; a
124
+ `RetryPolicy` or `CompositeRetryPolicy` passes through; `nil` → `nil`. Applied
125
+ in `step_retry_policy` and `wait_retry_policy`, and to the class DSL.
126
+
127
+ - **Class DSL** `retry_policy(*policies, **opts)` — positional policies →
128
+ `RetryPolicy.compose(*policies)` stored as `default_retry_policy`; kwargs only →
129
+ `RetryPolicy.new(**opts)` (unchanged). Mixing both raises `ArgumentError`.
130
+
131
+ - **Per-error counter, step sites** — one helper, incrementing the matched
132
+ policy's slot in the log metadata and returning the new count:
133
+
134
+ ```ruby
135
+ RETRY_COUNTS_KEY = "retry_counts"
136
+
137
+ def bump_retry_count!(log, policy_index)
138
+ meta = log.metadata || {}
139
+ counts = meta[RETRY_COUNTS_KEY] || {}
140
+ key = policy_index.to_s
141
+ counts[key] = counts[key].to_i + 1
142
+ meta[RETRY_COUNTS_KEY] = counts
143
+ log.update!(metadata: meta) # explicit reassign so the JSON column is marked dirty
144
+ counts[key]
145
+ end
146
+ ```
147
+
148
+ - **Per-error counter, workflow level** — `perform` gains `retry_counts: {}` in
149
+ its signature; the failure path increments the in-memory map and threads it
150
+ through the reschedule, beside `attempt:`. No DB write (mirrors `attempt:`).
151
+
152
+ - **The four retry sites** change from the `retryable? … backoff_for` pair to a
153
+ single `retry_backoff` call carrying the site's counter block. A single policy
154
+ ignores the block, so its path is unchanged and writes no `retry_counts`:
155
+
156
+ ```ruby
157
+ backoff = policy.retry_backoff(e, attempts: COUNT) { |idx| <bump counter for idx> }
158
+ if backoff
159
+ self.class.set(wait: backoff).perform_later(@workflow.key, *site_args)
160
+ halt_execution!
161
+ else
162
+ # site-specific terminal action
163
+ end
164
+ ```
165
+
166
+ Per-site `COUNT` / counter store / terminal action:
167
+
168
+ | Site | `COUNT` (single-policy) | Composite counter store | Terminal action |
169
+ |---|---|---|---|
170
+ | `perform` (workflow) | `attempt + 1` | job-args `retry_counts` (rescheduled with `attempt: attempts_made, retry_counts:`) | `fail_workflow!(error_log)` |
171
+ | `durably_execute` | `execution_log.attempts` | `execution_log.metadata` | mark failed, raise `ExecutionFailedError` |
172
+ | `durably_repeat` | `repetition_log.attempts` | `repetition_log.metadata` | `on_error` (`:continue` / `:fail_workflow`) |
173
+ | `wait_until` | `execution_log.attempts` | `execution_log.metadata` | mark failed, raise `ExecutionFailedError` |
174
+
175
+ `durably_repeat` keys its counter on the per-repetition log, so each repetition
176
+ gets its own independent per-error budgets.
177
+
178
+ ## Notable properties
179
+
180
+ - **Per-error backoff escalation.** `backoff_for(count)` uses the per-error count
181
+ as the exponent, so each error type's backoff escalates on its own schedule.
182
+ - **No class-string matching at decision time.** Subclass resolution is a single
183
+ `is_a?` on the live error during routing; the counter is keyed by policy index.
184
+ - **`wait_until` still does not inherit the class default.** A per-call array is
185
+ coerced normally; the class-level composite default does not leak in.
186
+ - **Workflow-level safety net.** `perform`'s early guard
187
+ (`policy.max_attempts && attempt >= policy.max_attempts`) keeps working because
188
+ `CompositeRetryPolicy#max_attempts` returns the coarsest bound — a safe
189
+ over-estimate that never kills a workflow prematurely.
190
+ - **Ordering matters.** Specific policies first, catch-all (`retry_on: nil`) last;
191
+ without a catch-all, an unmatched error fails fast. (Documented footgun.)
192
+ - **Mid-flight reorder caveat.** Counts are keyed by policy index, so reordering a
193
+ composite's policies while a long-running workflow is in flight can misattribute
194
+ in-progress counts. Reordering retry config mid-workflow is ambiguous under any
195
+ scheme; documented as a known edge.
196
+
197
+ ## Testing
198
+
199
+ **Unit — `CompositeRetryPolicy`**
200
+ - routing: first match wins; specific-before-catch-all
201
+ - catch-all (`retry_on: nil`) matches anything; `retry_on: []` matches nothing
202
+ - subclass of a `retry_on` class routes to that policy (and yields its index)
203
+ - no match → `retry_backoff` returns `nil`
204
+ - `retry_backoff` yields the matched policy's index and uses the yielded count for
205
+ both the cap check and the backoff exponent
206
+ - `max_attempts` = coarsest bound; `nil` if any sub-policy unbounded
207
+ - empty policy list raises `ArgumentError`
208
+
209
+ **Unit — `RetryPolicy` additions**
210
+ - `matches?` semantics for `nil` / `[]` / class list incl. subclasses
211
+ - `retry_backoff` (plain) ignores the block, returns `nil` past the cap
212
+ - `RetryPolicy.compose` builds a `CompositeRetryPolicy`
213
+
214
+ **Unit — `bump_retry_count!`**
215
+ - increments the right index slot; independent slots accumulate independently
216
+ - reassigns `metadata` so the JSON column persists across reload
217
+ - `nil`/absent `metadata` initializes cleanly
218
+
219
+ **Integration**
220
+ - a step raising different error types accumulates independent per-error budgets
221
+ and per-error backoff; fail-fast policy (`max_attempts: 1`) stops immediately;
222
+ subclass of a `retry_on` class draws from the parent policy's budget
223
+ - regression: a single `RetryPolicy` (per-call, class default, built-in) behaves
224
+ identically to today and writes no `retry_counts`
225
+ - array passed to `retry_policy:` is coerced to a composite
226
+ - workflow-level composite default routes correctly, threads `retry_counts`
227
+ through reschedules, and the `perform` safety-net guard honors the coarse
228
+ `max_attempts`