RubyGems - chrono_forge - Versions diffs - 0.9.1 → 0.10.0 - Mend

chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json ADDED Viewed

@@ -0,0 +1,19 @@
+{
+  "planPath": "docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md",
+  "tasks": [
+    {
+      "id": 1,
+      "subject": "Task 1: Defer all continuation enqueues until after lock release",
+      "status": "completed",
+      "description": "**Goal:** No continuation job is published while the enqueuing job still holds the workflow lock; all 8 enqueue sites route through one recorded slot flushed in `ensure` after `release_lock`.\n\n**Files:** lib/chrono_forge/executor.rb; lib/chrono_forge/executor/methods/{wait,wait_until,durably_execute,durably_repeat}.rb; test/continuation_flush_test.rb (create)\n\n**Verify:** `bundle exec ruby -Itest test/continuation_flush_test.rb && bundle exec rake test`\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/executor.rb\", \"lib/chrono_forge/executor/methods/wait.rb\", \"lib/chrono_forge/executor/methods/wait_until.rb\", \"lib/chrono_forge/executor/methods/durably_execute.rb\", \"lib/chrono_forge/executor/methods/durably_repeat.rb\", \"test/continuation_flush_test.rb\"], \"verifyCommand\": \"bundle exec ruby -Itest test/continuation_flush_test.rb && bundle exec rake test\", \"acceptanceCriteria\": [\"continuations enqueued only after lock release\", \"per-site kwargs preserved\", \"flush no-ops without recorded continuation and is skipped when release_lock raises\", \"full suite green\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 2,
+      "subject": "Task 2: Closed-form fast-forward of the expired prefix in durably_repeat",
+      "status": "completed",
+      "blockedBy": [1],
+      "description": "**Goal:** When `durably_repeat` resumes behind schedule, jump past the expired prefix in O(1), advance the coordination log's `last_execution_at`, and write one summary `ExecutionLog` for the skip — instead of one timed-out row + one zero-delay job per missed tick.\n\n**Files:** lib/chrono_forge/executor/methods/durably_repeat.rb; test/durably_repeat_test.rb (add tests; update 2 timeout tests)\n\n**Verify:** `bundle exec ruby -Itest test/durably_repeat_test.rb && bundle exec rake test`\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/executor/methods/durably_repeat.rb\", \"test/durably_repeat_test.rb\"], \"verifyCommand\": \"bundle exec ruby -Itest test/durably_repeat_test.rb && bundle exec rake test\", \"acceptanceCriteria\": [\"fast_forward returns input when nothing expired\", \"lands on first non-expired grid tick (no drift)\", \"zero per-tick timeout rows + exactly one summary row\", \"coordination last_execution_at advanced for stable replay\", \"first in-window tick still executes\", \"full suite green\"], \"requiresUserVerification\": false}\n```"
+    }
+  ],
+  "lastUpdated": "2026-06-26"
+}

data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Unified RetryPolicy — Design
+**Date:** 2026-06-03
+**Status:** Approved (pending spec review)
+**Scope:** Internal API. No external callers; clean break, no deprecation shim.
+## Problem
+ChronoForge currently has three independent retry systems, two backoff
+algorithms, and three different "should we retry?" decision models:
+1. **Workflow-level** (uncaught errors in `perform`)
+   - `should_retry?(error, attempt)` → hardcoded `attempt < 3`, ignores the error
+   - `RetryStrategy.schedule_retry` → fixed array `[1s, 5s, 30s, 2m, 10m]`
+   - guard: `attempt >= RetryStrategy.max_attempts` (= 5)
+   - **Dead config:** `should_retry?` stops at 3, so the array's `2m`/`10m`
+     entries and `max_attempts == 5` are unreachable.
+2. **Step-level** (`durably_execute`, `durably_repeat`)
+   - `max_attempts:` param (default 3)
+   - backoff `2**[attempts, 5].min` — a *different* algorithm (exponential,
+     32s cap) than the workflow level
+   - `durably_repeat` adds `on_error: :continue | :fail_workflow`
+   - **Dead arg:** the reschedule passes `retry_method:`, which `perform`'s
+     signature never binds — it falls into `**kwargs` and is ignored. Replay
+     skipping completed steps is what actually resumes the step, not this arg.
+3. **`wait_until`**
+   - `retry_on: [ExceptionClass, …]` — a third model (error-class allowlist)
+   - no attempt cap (bounded by `timeout`), same `2**n` backoff
+Additional finding: workflow-level attempts (the `attempt:` job arg, lives only
+in the job payload) and step attempts (`execution_log.attempts`, a DB column)
+are unrelated counters.
+Net: backoff is implemented twice and configurable nowhere per-call; "should we
+retry?" is answered three ways (attempt-count / max_attempts / error-class); and
+the workflow-level cap is internally contradictory (3 vs 5).
+## Goal
+Collapse to **one** `RetryPolicy` type with **one** backoff algorithm, used by
+all four sites. Today's three behaviors become three *default configurations* of
+the same type. Retry behavior becomes expressible per-call.
+The unification is of **type + mechanism**, not of default *values*: each call
+site keeps a default tuned to its purpose, but all defaults are instances of the
+same object and all are overridable.
+## Decisions (locked during brainstorming)
+| Decision | Choice |
+|---|---|
+| Ambition | Option A — one unified `RetryPolicy`, `wait_until` folded in |
+| Backoff curve | Exponential + jitter, single default, per-call overridable |
+| Compatibility | Clean break — internal code, fix all call sites in the same change |
+| Surface | Class-level default DSL + per-call `retry_policy:` override |
+| Attempt counters | Workflow-level stays in the `attempt:` job arg; steps stay on `execution_log.attempts`. Policy unifies; counting storage does not (no migration) |
+| `wait_until` poll cadence | Stays **out** of `RetryPolicy` (`check_interval`/`timeout` are polling, not retry) |
+| Per-site backoff defaults | Steps `max_attempts: 3, cap: 30`. Workflow-level `max_attempts: 10, cap: 600` — a tolerant window up to ~8.5 min (≈4 min typical with jitter) for transient infra errors on uncaught `perform` errors (revised post-review from an inconsistent `8/600`, where the 600 cap was unreachable). `cap: 600` is a per-delay ceiling, not a dead default: it binds when a caller configures more attempts. |
+## Design
+### 1. `RetryPolicy` value object
+New file: `lib/chrono_forge/executor/retry_policy.rb`
+```ruby
+RetryPolicy.new(
+  max_attempts: 3,        # Integer cap, or nil = no count cap (bounded elsewhere)
+  base: 1,                # seconds
+  cap: 30,                # seconds, max single delay
+  jitter: true,
+  retry_on: nil           # nil = retry any StandardError; [Classes] = only these
+)
+```
+Two methods are the entire decision surface. `attempts` is the 1-based count of
+attempts made so far, *including* the one that just failed (matching
+`ExecutionLog#attempts`); on the first failure `attempts == 1`.
+- `retryable?(error, attempts)` →
+  `(max_attempts.nil? || attempts < max_attempts)` **and** the error matches
+  `retry_on` (`retry_on.nil?` means any `StandardError`; otherwise
+  `retry_on.any? { |k| error.is_a?(k) }`).
+- `backoff_for(attempts)` → `delay = [cap, base * 2**(attempts - 1)].min`, then
+  equal jitter when enabled: `delay / 2.0 + rand(0.0..delay / 2.0)`. Returns an
+  `ActiveSupport::Duration` suitable for `set(wait:)`.
+**Jitter & determinism:** `backoff_for` is called once, at the moment a retry
+job is re-enqueued. The result is never persisted or replayed, so `rand`
+introduces no replay nondeterminism. (Stated explicitly because this is a
+replay engine.)
+### 2. Per-site default policies
+A single gem-wide default, overridable per class and per call. Two sites need
+distinct *defaults* to preserve current semantics:
+| Site | Default policy | Rationale (= today's behavior) |
+|---|---|---|
+| `durably_execute`, `durably_repeat` | `max_attempts: 3, base: 1, cap: 30, retry_on: nil` (retry **all** errors) | matches current `rescue => e; retry`; flaky calls fast-fail |
+| Workflow-level | `max_attempts: 10, base: 1, cap: 600, retry_on: nil` | only fires on uncaught `perform` errors (step failures stall instead), which are rare and may be transient infra blips. 10 attempts (up to ~8.5 min, ≈4 min typical with jitter) rides those out; each retry replays the whole workflow, so the count is bounded rather than open-ended. `cap: 600` (10 min) ceils any single backoff |
+| `wait_until` (error path) | `retry_on: []` (retry **nothing** by default) | a condition that *raises* is usually a bug, not transient — matches current `retry_on: []` |
+`wait_until`'s polling cadence (`check_interval` / `timeout`) is **not** retry
+and is untouched. `RetryPolicy` governs only what happens when the condition
+*raises*.
+### 3. Surface — class default + per-call override
+```ruby
+class ChargeWorkflow < ApplicationJob
+  prepend ChronoForge::Executor
+  retry_policy max_attempts: 5, base: 2, cap: 60   # class-wide default
+  def perform
+    durably_execute :charge, retry_policy: RetryPolicy.new(max_attempts: 8, retry_on: [Net::OpenTimeout])
+    wait_until :settled?, retry_policy: RetryPolicy.new(retry_on: [BankApiError])
+  end
+end
+```
+`retry_policy(**)` is a class-level DSL added by the prepended `Executor` that
+builds and stores a `RetryPolicy` in `default_retry_policy` (a `class_attribute`,
+so it inherits). The per-call kwarg is named `retry_policy:` (not `retry:`)
+because `retry` is a Ruby keyword — a `retry:` parameter could not be read inside
+the method without `binding.local_variable_get(:retry)`. `retry_policy:` also
+reads consistently with the class-level DSL.
+**Resolution rules (precise — to remove ambiguity):**
+- **Error-retry sites** (`durably_execute`, `durably_repeat`, workflow-level):
+  explicit per-call `retry_policy:` → class `default_retry_policy` → that site's
+  built-in default (table above). So a declared class default replaces the
+  built-in for *both* steps and the workflow level, collapsing their differing
+  built-ins (3/30 vs 5/30) onto one value. This is the intended, predictable
+  meaning of "class-wide default."
+- **`wait_until`** does **not** inherit the class `default_retry_policy`. It uses
+  its built-in `retry_on: []` unless an explicit per-call `retry_policy:` is passed.
+  Rationale: a class-wide "retry all errors 5×" must not silently turn
+  condition-evaluation bugs into retried errors. `wait_until`'s retry set is a
+  deliberate per-call opt-in, not a class-wide inheritance.
+### 4. Integration / deletions
+- **Delete** `lib/chrono_forge/executor/retry_strategy.rb` (`RetryStrategy`).
+- **Delete** private `should_retry?` in `executor.rb` (the dead `attempt < 3`).
+- **Delete** the dead `retry_method:` arg in `durably_execute`'s reschedule.
+- **Replace** the `max_attempts:` / `retry_on:` kwargs on `durably_execute`,
+  `durably_repeat`, and `wait_until` with a single `retry_policy:` kwarg.
+- **`executor.rb#perform`:** the resolved policy here is `default_retry_policy`
+  (class DSL) or the workflow-level built-in (`max_attempts: 10, cap: 600`);
+  there is no per-call `retry_policy:` since an uncaught error has no call site.
+  - top guard becomes `attempt >= resolved_policy.max_attempts`
+  - the `rescue => e` block routes through the resolved policy:
+    ```ruby
+    if policy.retryable?(e, attempt)
+      self.class.set(wait: policy.backoff_for(attempt)).perform_later(key, attempt: attempt + 1)
+    else
+      fail_workflow!(error_log)
+    end
+    ```
+- **`durably_execute` / `durably_repeat`:** on error, use
+  `policy.retryable?(e, execution_log.attempts)` and
+  `policy.backoff_for(execution_log.attempts)`; otherwise mark failed and raise
+  `ExecutionFailedError` (`durably_repeat` keeps its `on_error` branch).
+- **`wait_until`:** replace the `retry_on.include?(e.class)` check and the
+  inline `2**n` backoff with the resolved policy. The poll/timeout path is
+  unchanged.
+The old extensibility model — `self.class::RetryStrategy` magic constant +
+overriding private `should_retry?` — is removed in favor of passing a
+`RetryPolicy`.
+### 5. Backoff impact (informational)
+Delays in seconds; current step/wait curve is `2**min(attempts,5)`, current
+workflow curve is the fixed array (truncated at attempt 3 by the dead config).
+| Site | Today (actual) | New default |
+|---|---|---|
+| `durably_execute`/`durably_repeat` (`max_attempts:3`) | `2, 4` then fail | `~1, ~2` (jittered) then fail |
+| `wait_until` error path | `2, 4, 8, …` cap 32 | unchanged in shape; cap 30 |
+| Workflow-level | `1, 5, 30` then fail | `~1, 2, 4, 8, 16, 32, 64, 128, 256` (jittered) then fail, `max_attempts:10` (up to ~8.5 min, ≈4 min typical) |
+Steps and `wait_until` are effectively unchanged (jitter added, cap 32→30). The
+workflow level keeps the array's intended 5-attempt count but with one curve;
+it does **not** add a long backoff tail — each workflow-level retry replays the
+whole workflow, so the attempt count is deliberately kept modest.
+## Files touched
+- **New:** `lib/chrono_forge/executor/retry_policy.rb`
+- **Delete:** `lib/chrono_forge/executor/retry_strategy.rb`
+- **Edit:** `lib/chrono_forge/executor.rb` (perform rescue + guard, remove
+  `should_retry?`), `lib/chrono_forge/workflow.rb` (add `retry_policy` DSL +
+  `default_retry_policy`), `lib/chrono_forge/executor/methods/durably_execute.rb`,
+  `.../durably_repeat.rb`, `.../wait_until.rb`
+- **Edit:** test suite, example workflows, and `README.md` retry sections
+  (~lines 165–261, 393, 765–769) — clean break, all call sites updated together
+## Testing
+**`RetryPolicy` unit tests**
+- `retryable?` truth table: count cap reached/not; `max_attempts: nil` (never
+  count-capped); `retry_on: nil` (any StandardError); `retry_on: [A]` match and
+  miss; `retry_on: []` (never).
+- `backoff_for`: exponential growth; cap clamp; jitter bounds with a
+  seeded/stubbed `rand`; `jitter: false` is exact.
+**Integration (per method)**
+- retries → succeeds; retries → exhausts → fails (`ExecutionFailedError` /
+  `fail_workflow!`); per-call `retry_policy:` override is honored.
+- `wait_until`: fails fast on an unlisted error; retries a listed one; poll
+  cadence/timeout unaffected.
+- workflow-level: uncaught error retries with `attempt+1` and the workflow-level
+  policy; stops at `max_attempts`.
+- `durably_repeat`: `on_error: :continue` vs `:fail_workflow` still branch
+  correctly after exhaustion.
+## Out of scope
+- Migrating workflow-level attempt counting into the DB (explicitly deferred).
+- Changing `wait_until`'s polling model (`check_interval`/`timeout`).
+- `durably_repeat`'s `on_error` semantics (kept as-is).

data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md ADDED Viewed

@@ -0,0 +1,190 @@
+# ChronoForge Dashboard — Design
+**Date:** 2026-06-25
+**Status:** Approved (pending spec review)
+**Scope:** New companion gem `chrono_forge-dashboard`, a mountable Rails engine.
+Additive; does not change the published `chrono_forge` gem.
+## Problem
+ChronoForge exposes rich per-step data (execution logs, error logs, persistent
+context, wait states, periodic tasks) but no UI. Operators recover stalled
+workflows and inspect failures from a Rails console. Competing job dashboards
+(Sidekiq, GoodJob, Mission Control) show queues and jobs, not the interior of a
+long-running workflow. A free, self-contained dashboard over ChronoForge's data
+is both a useful tool and the project's strongest adoption lever.
+## Goal
+A mountable, zero-build Rails engine giving full visibility and operational
+control over ChronoForge workflows: list/triage, a step **replay timeline**, a
+context inspector, periodic-task health, wait-state age, and the recovery actions
+(`retry_later`, force-unlock, bulk retry) — behind fail-closed auth.
+## Decisions (locked during brainstorming)
+| Decision | Choice |
+|---|---|
+| Repo layout | **Monorepo subfolder** `chrono_forge-dashboard/` with its own gemspec; core gem excludes the dir from `spec.files` so the published `chrono_forge` stays lean. |
+| Scope | **Full build** — all tiers (visibility, triage, timeline, periodic health, wait-state, actions) in v1. |
+| Frontend | **Server-rendered, zero-build** — ERB + one bundled CSS + one vanilla JS file, served by the engine itself. No npm/bundler/importmap; CSP-friendly; polling for live updates. |
+| Auth | **Fail-closed, pluggable** — built-in HTTP Basic, a custom hook, or explicit `:none` (to use routing constraints). Mounting without configuring any of them **raises**. |
+| Data | **Reuse core models read-only**; engine holds its own query objects/presenters. No schema changes; minimal-to-no core changes. Offset pagination. |
+| Engine | Namespace-isolated `ChronoForge::Dashboard::Engine`, Zeitwerk-loaded. |
+## Architecture
+```
+chrono_forge/                         # repo root (core gem)
+  lib/ chrono_forge.gemspec           # core; rejects chrono_forge-dashboard/ from spec.files
+  chrono_forge-dashboard/
+    chrono_forge-dashboard.gemspec    # add_dependency "chrono_forge", "railties"
+    lib/chrono_forge/dashboard.rb         # config object + Engine
+    lib/chrono_forge/dashboard/engine.rb
+    app/controllers/chrono_forge/dashboard/...
+    app/views/chrono_forge/dashboard/...
+    app/assets/chrono_forge/dashboard/{dashboard.css,dashboard.js}
+    app/queries/chrono_forge/dashboard/...     # query objects
+    app/presenters/chrono_forge/dashboard/...  # timeline / context / sparkline builders
+    test/                              # Combustion dummy app mounting the engine
+```
+Host mounts it:
+```ruby
+mount ChronoForge::Dashboard::Engine, at: "/chrono_forge"
+```
+`isolate_namespace ChronoForge::Dashboard` keeps routes/helpers/table-name
+prefixes contained. Engine views and assets are wholly self-contained.
+## Components
+### 1. Configuration & auth (`ChronoForge::Dashboard`)
+A config singleton:
+```ruby
+ChronoForge::Dashboard.configure do |c|
+  c.http_basic = { username: ENV["CF_USER"], password: ENV["CF_PASS"] }  # built-in
+  # c.authenticate { |controller| controller.head(:forbidden) unless controller.current_user&.admin? }
+  # c.authentication = :none   # opt out; you mount behind your own routing constraint
+  c.polling_interval = 5        # seconds; 0 disables auto-refresh
+  c.page_size = 50
+  c.long_wait_threshold = 1.hour
+end
+```
+`BaseController` runs `before_action :authenticate!`, resolved fail-closed in
+this order:
+1. **hook present** → call it (host integrates Devise/Pundit/etc.).
+2. **else `http_basic` present** → `authenticate_or_request_with_http_basic`.
+3. **else `authentication == :none`** → permit (host guards via routing
+   constraint).
+4. **else → raise `ChronoForge::Dashboard::AuthenticationNotConfigured`** at
+   request time, with a message naming the three options.
+So a forgotten config fails loudly instead of leaking workflow context.
+### 2. Read / query layer
+- Reuses `ChronoForge::Workflow`, `ExecutionLog`, `ErrorLog` read-only.
+- **`WorkflowsQuery`** — filter by `state`, `job_class`, `key` (search),
+  date range; offset-paginated; recency-sorted.
+- **`StatsQuery`** — counts by state + recent failure rate in one grouped query
+  (no N+1).
+- **`StepNameParser`** — decodes step names into `{kind, name, timestamp}`:
+  `durably_execute$<name>`, `wait_until$<condition>`, `durably_repeat$<name>`
+  (coordination) and `durably_repeat$<name>$<ts>` (repetition). `$` is the core's
+  reserved delimiter, so parsing is unambiguous.
+- Detail-view logs are paginated (a `durably_repeat` workflow accumulates
+  unbounded repetition logs; never load them all).
+### 3. Presenters
+- **`TimelinePresenter`** — orders a workflow's `execution_logs` into a replay
+  sequence; each entry: kind, status (completed/failed/pending/waiting),
+  attempts, started/completed, duration, error summary. Repetitions roll up under
+  their coordination log. Marks the "current position" (last failed/running, or
+  the active wait).
+- **`ContextPresenter`** — renders the JSON context as a collapsible tree with
+  value types and a size-vs-16KB indicator. Read-only.
+- **`PeriodicHealthPresenter`** — per `durably_repeat` coordination log: last run
+  (`metadata.last_execution_at`), next scheduled, missed/timed-out count
+  (repetition logs with `error_class == "TimeoutError"`), recent-latency
+  sparkline data, and per-error `retry_counts` from metadata.
+- **`WaitStatePresenter`** — for idle workflows whose latest step is a pending
+  `wait_until`: condition, wait age (`now - last_executed_at`), `timeout_at`.
+### 4. Controllers & routes
+- `WorkflowsController#index` — list + stats + filters + pagination.
+- `WorkflowsController#show` — detail: timeline, context, errors, wait callout,
+  periodic health.
+- `WaitStatesController#index` — idle-waiting workflows by wait age, flagging
+  those past `long_wait_threshold`.
+- `ActionsController` (POST, CSRF-protected):
+  - `#retry` → `workflow.retry_later` (guarded by `retryable?`; 422 + flash if not).
+  - `#unlock` → clear `locked_at`/`locked_by`, set `idle` (loud duplicate-exec warning in the UI).
+  - `#bulk_retry` → `ChronoForge::Workflow.failed.find_each(&:retry_later)`; returns affected count.
+- `AssetsController#show` — serves `dashboard.css` / `dashboard.js` with long-cache
+  headers, so the engine needs no host asset pipeline.
+- Fragment endpoints (`index`/`show` with a partial format) back the JS polling
+  refresh.
+### 5. Frontend
+- ERB views + one layout; all classes prefixed `cf-`.
+- One `dashboard.css`, one `dashboard.js` (vanilla), served by `AssetsController`.
+- **CSP-friendly**: no CDN/external fonts; behavior attached via
+  `addEventListener` + `data-` attributes (no inline `<script>` handlers, no
+  inline event attributes).
+- JS responsibilities: collapsible context tree; confirm dialogs for destructive
+  actions; inline-SVG sparklines (no chart lib); polling that refreshes the
+  list/stats fragment (and a running workflow's detail) every
+  `polling_interval` seconds, with a pause toggle.
+## Error handling
+- Missing/legacy step names that don't parse fall back to a raw display rather
+  than raising.
+- Actions on a workflow whose state changed under the operator (e.g. retry on a
+  now-running workflow) surface the core's `WorkflowNotRetryableError` as a flash,
+  not a 500.
+- Force-unlock always shows the duplicate-execution warning and requires confirm.
+- Auth misconfiguration raises a clear, actionable error (see §1).
+## Testing
+Combustion dummy app (mirroring core's `test/internal`) mounting the engine, with
+seeded workflows across every state and a `durably_repeat` workflow. Minitest +
+standardrb.
+- **Queries**: `WorkflowsQuery` filters/pagination; `StatsQuery` counts;
+  `StepNameParser` for each kind incl. repetitions and unparseable names.
+- **Presenters**: timeline ordering + repetition rollup + current-position;
+  context tree + size indicator; periodic health (missed/timeout/sparkline);
+  wait-state age.
+- **Controllers**: index filters/pagination; show renders all panels; wait-state
+  list + threshold flag.
+- **Actions**: retry calls `retry_later` and guards non-retryable; unlock clears
+  the lock; bulk retry count.
+- **Auth**: raises when unconfigured; HTTP Basic accept/reject; hook;
+  `:none` permits.
+- **Assets**: `AssetsController` serves CSS/JS with cache headers.
+## Build order (for the implementation plan)
+Engine skeleton + gemspec + core `spec.files` exclusion + auth → list + stats +
+filters → detail (context + errors) → step replay timeline → periodic health +
+wait-state age → operational actions → assets + JS polling → README/docs. Each
+step is independently testable.
+## Out of scope (v1)
+- Real-time push (ActionCable/SSE) — polling only.
+- Editing context or workflow internals from the UI (read-only except the three
+  actions).
+- Cross-workflow search by context value (only key/class/state/date filters).
+- Triggering `CleanupJob` from the UI (operator runs cleanup on their schedule).

data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md ADDED Viewed

@@ -0,0 +1,228 @@
+# Composite Retry Policies — Design
+**Date:** 2026-06-25
+**Status:** Approved (pending spec review)
+**Scope:** Internal API, additive. Builds on the unified `RetryPolicy`
+(2026-06-03). No breaking change — the common single-policy path is byte-for-byte
+unchanged.
+## Problem
+The unified `RetryPolicy` answers "should we retry?" with a single
+`max_attempts`/`backoff`/`retry_on` tuple per retry site. A single tuple cannot
+express *different behavior per error type* — yet that is exactly what real
+workflows (fintech especially) need:
+- `NetworkError` → retry aggressively, short backoff
+- `RateLimitError` → retry more, longer backoff
+- `PaymentDeclinedError` → fail immediately, do not retry
+Today you must pick one policy for the whole step. A `retry_on:` allowlist filters
+*which* errors retry, but every retried error shares one `max_attempts` and one
+backoff curve.
+## Goal
+Let a retry site be configured with an **ordered list** of `RetryPolicy` objects.
+On failure, the **first** sub-policy whose `retry_on` matches the raised error
+applies its own `max_attempts`/`backoff`. Each error type gets an **independent
+attempt budget**.
+The single-policy path is untouched: it keeps using the site's own `attempts`
+counter, with no extra state.
+## Decisions (locked during brainstorming)
+| Decision | Choice |
+|---|---|
+| Counting semantics | **Per-error budgets.** A sub-policy's `max_attempts` counts only failures routed to it. |
+| Where the count lives | A `retry_counts` map keyed by **matched-policy index**. Steps: in the (execution/repetition) log `metadata`. Workflow-level: in the job args, beside `attempt:`. No new column, no `error_logs` query. |
+| Subclass safety | Routing happens **once, on the live error** via `matches?`/`is_a?` — subclass-correct. The count is keyed by the matched policy, never by error class, so no constantizing or class-string matching at decision time. |
+| Single-policy path | Unchanged — uses the site's `attempts` counter directly, no `retry_counts`. |
+| Construction | Named factory `RetryPolicy.compose(*policies)`; passing an `Array` to `retry_policy:` coerces through the same factory. |
+| Scope | **All four sites**, including the workflow-level class default. |
+| Routing | First match wins. `retry_on: nil` (catch-all) typically last; no match → no retry (fail fast). |
+| Purity | `RetryPolicy` and `CompositeRetryPolicy` stay pure value objects. The per-error count is incremented/read by the executor via a block; the policies never touch storage. |
+## Why increment-then-check preserves existing semantics
+`attempts` is 1-based and includes the failure that just happened (so on the
+first failure `attempts == 1`, and `retryable?` checks `attempts < max_attempts`).
+The composite mirrors this: on each failure it **increments** the matched
+policy's counter **then** checks, so the value handed to `retryable?` is "failures
+routed to this policy so far, including the current one" — the same shape the
+single-policy path already uses. The two notions agree, so a per-error count
+substitutes cleanly wherever a plain policy uses `attempts`.
+## Components
+### 1. `RetryPolicy` (existing — additions, pure)
+- `matches?(error)` — public routing predicate; wraps the existing private
+  `retryable_error?`. `retry_on: nil` matches any `StandardError` (catch-all);
+  `retry_on: []` matches nothing. Subclass-correct (`error.is_a?`).
+- `retry_backoff(error, attempts:) { |policy_index| count }` — returns the backoff
+  `Duration` to retry, or `nil` to stop. The plain policy **ignores the block**
+  and uses `attempts`:
+  ```ruby
+  def retry_backoff(error, attempts:)
+    retryable?(error, attempts) ? backoff_for(attempts) : nil
+  end
+  ```
+- `self.compose(*policies)` — factory returning a `CompositeRetryPolicy`.
+`retryable?`, `backoff_for`, `max_attempts` are unchanged (tests and the
+single-policy path depend on them).
+### 2. `CompositeRetryPolicy` (new — `executor/composite_retry_policy.rb`, pure)
+```ruby
+class CompositeRetryPolicy
+  attr_reader :policies
+  def initialize(policies)
+    @policies = Array(policies)
+    raise ArgumentError, "composite retry policy needs at least one policy" if @policies.empty?
+  end
+  # First sub-policy whose retry_on matches the error, or nil.
+  def policy_for(error)
+    @policies.find { |p| p.matches?(error) }
+  end
+  # Routes on the *live* error, yields the matched policy's index so the caller
+  # can increment and return that policy's running count, then delegates the
+  # decision to the matched sub-policy.
+  def retry_backoff(error, attempts:)
+    idx = @policies.index { |p| p.matches?(error) }
+    return nil if idx.nil?
+    sub   = @policies[idx]
+    count = block_given? ? yield(idx) : attempts
+    sub.retryable?(error, count) ? sub.backoff_for(count) : nil
+  end
+  # Coarsest bound, for the workflow-level safety-net guard in `perform`.
+  # nil if any sub-policy is unbounded.
+  def max_attempts
+    caps = @policies.map(&:max_attempts)
+    caps.include?(nil) ? nil : caps.max
+  end
+end
+```
+Routing is by `matches?`, which is `is_a?`-based, so a subclass of a `retry_on`
+class routes to the right policy. The returned **index** — not the error class —
+is the counter key, so subclasses share the budget of the policy they routed to,
+exactly as intended.
+### 3. Executor wiring
+- **`coerce_policy(value)`** — `Array` → `RetryPolicy.compose(*value)`; a
+  `RetryPolicy` or `CompositeRetryPolicy` passes through; `nil` → `nil`. Applied
+  in `step_retry_policy` and `wait_retry_policy`, and to the class DSL.
+- **Class DSL** `retry_policy(*policies, **opts)` — positional policies →
+  `RetryPolicy.compose(*policies)` stored as `default_retry_policy`; kwargs only →
+  `RetryPolicy.new(**opts)` (unchanged). Mixing both raises `ArgumentError`.
+- **Per-error counter, step sites** — one helper, incrementing the matched
+  policy's slot in the log metadata and returning the new count:
+  ```ruby
+  RETRY_COUNTS_KEY = "retry_counts"
+  def bump_retry_count!(log, policy_index)
+    meta   = log.metadata || {}
+    counts = meta[RETRY_COUNTS_KEY] || {}
+    key    = policy_index.to_s
+    counts[key] = counts[key].to_i + 1
+    meta[RETRY_COUNTS_KEY] = counts
+    log.update!(metadata: meta)   # explicit reassign so the JSON column is marked dirty
+    counts[key]
+  end
+  ```
+- **Per-error counter, workflow level** — `perform` gains `retry_counts: {}` in
+  its signature; the failure path increments the in-memory map and threads it
+  through the reschedule, beside `attempt:`. No DB write (mirrors `attempt:`).
+- **The four retry sites** change from the `retryable? … backoff_for` pair to a
+  single `retry_backoff` call carrying the site's counter block. A single policy
+  ignores the block, so its path is unchanged and writes no `retry_counts`:
+  ```ruby
+  backoff = policy.retry_backoff(e, attempts: COUNT) { |idx| <bump counter for idx> }
+  if backoff
+    self.class.set(wait: backoff).perform_later(@workflow.key, *site_args)
+    halt_execution!
+  else
+    # site-specific terminal action
+  end
+  ```
+  Per-site `COUNT` / counter store / terminal action:
+  | Site | `COUNT` (single-policy) | Composite counter store | Terminal action |
+  |---|---|---|---|
+  | `perform` (workflow) | `attempt + 1` | job-args `retry_counts` (rescheduled with `attempt: attempts_made, retry_counts:`) | `fail_workflow!(error_log)` |
+  | `durably_execute` | `execution_log.attempts` | `execution_log.metadata` | mark failed, raise `ExecutionFailedError` |
+  | `durably_repeat` | `repetition_log.attempts` | `repetition_log.metadata` | `on_error` (`:continue` / `:fail_workflow`) |
+  | `wait_until` | `execution_log.attempts` | `execution_log.metadata` | mark failed, raise `ExecutionFailedError` |
+  `durably_repeat` keys its counter on the per-repetition log, so each repetition
+  gets its own independent per-error budgets.
+## Notable properties
+- **Per-error backoff escalation.** `backoff_for(count)` uses the per-error count
+  as the exponent, so each error type's backoff escalates on its own schedule.
+- **No class-string matching at decision time.** Subclass resolution is a single
+  `is_a?` on the live error during routing; the counter is keyed by policy index.
+- **`wait_until` still does not inherit the class default.** A per-call array is
+  coerced normally; the class-level composite default does not leak in.
+- **Workflow-level safety net.** `perform`'s early guard
+  (`policy.max_attempts && attempt >= policy.max_attempts`) keeps working because
+  `CompositeRetryPolicy#max_attempts` returns the coarsest bound — a safe
+  over-estimate that never kills a workflow prematurely.
+- **Ordering matters.** Specific policies first, catch-all (`retry_on: nil`) last;
+  without a catch-all, an unmatched error fails fast. (Documented footgun.)
+- **Mid-flight reorder caveat.** Counts are keyed by policy index, so reordering a
+  composite's policies while a long-running workflow is in flight can misattribute
+  in-progress counts. Reordering retry config mid-workflow is ambiguous under any
+  scheme; documented as a known edge.
+## Testing
+**Unit — `CompositeRetryPolicy`**
+- routing: first match wins; specific-before-catch-all
+- catch-all (`retry_on: nil`) matches anything; `retry_on: []` matches nothing
+- subclass of a `retry_on` class routes to that policy (and yields its index)
+- no match → `retry_backoff` returns `nil`
+- `retry_backoff` yields the matched policy's index and uses the yielded count for
+  both the cap check and the backoff exponent
+- `max_attempts` = coarsest bound; `nil` if any sub-policy unbounded
+- empty policy list raises `ArgumentError`
+**Unit — `RetryPolicy` additions**
+- `matches?` semantics for `nil` / `[]` / class list incl. subclasses
+- `retry_backoff` (plain) ignores the block, returns `nil` past the cap
+- `RetryPolicy.compose` builds a `CompositeRetryPolicy`
+**Unit — `bump_retry_count!`**
+- increments the right index slot; independent slots accumulate independently
+- reassigns `metadata` so the JSON column persists across reload
+- `nil`/absent `metadata` initializes cleanly
+**Integration**
+- a step raising different error types accumulates independent per-error budgets
+  and per-error backoff; fail-fast policy (`max_attempts: 1`) stops immediately;
+  subclass of a `retry_on` class draws from the parent policy's budget
+- regression: a single `RetryPolicy` (per-call, class default, built-in) behaves
+  identically to today and writes no `retry_counts`
+- array passed to `retry_policy:` is coerced to a composite
+- workflow-level composite default routes correctly, threads `retry_counts`
+  through reschedules, and the `perform` safety-net guard honors the coarse
+  `max_attempts`