RubyGems - chrono_forge - Versions diffs - 0.9.1 → 0.10.0 - Mend

chrono_forge 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md ADDED Viewed

@@ -0,0 +1,142 @@
+# Dashboard Branch View — Design
+**Date:** 2026-06-26
+**Status:** Design only. **BLOCKED on the branches core feature**
+([`2026-06-25-spawn-merge-branches-design.md`](2026-06-25-spawn-merge-branches-design.md),
+itself still Draft). Nothing here can be built or tested until `parent_execution_log_id`,
+`branch`/`spawn`/`merge_branches`, and the `spawned_workflows` association exist in the
+core gem. This spec is written so it can be implemented the day branches ships.
+**Scope:** additive views in the `chrono_forge-dashboard` engine. No core changes.
+## Problem
+With fan-out, a parent can park at `merge_branches` (Option A) because **one** child
+among tens or hundreds of thousands is `failed`/`stalled`. Today there is no way to see
+that: the parent looks idle, and the blocking child is a needle in a haystack. The branch
+view exists to answer, in one screen: *which branch is blocking this parent, how many
+children are outstanding, and which specific children are failed/stalled — with a Retry on
+each.* It is what makes Option A (park until recovered) operable in production.
+## What it consumes from the core (the contract)
+This view reads only what the branches spec defines. If any of these change, this spec
+changes with it.
+- **Column** `chrono_forge_workflows.parent_execution_log_id` (FK → `execution_logs.id`,
+  nullable) with composite index `(parent_execution_log_id, state)`.
+- **Associations** `Workflow#parent_execution_log` (→ `ExecutionLog`) and
+  `ExecutionLog#spawned_workflows` (→ `Workflow`, FK `parent_execution_log_id`).
+- **Branch log:** an execution log with `step_name` `"branch$<name>"`,
+  `state` `pending` (dispatching) | `completed` (sealed), and `metadata`
+  `{ "automerge" => bool, "merged" => bool, "cursors" => { "<spawn>" => { "pk", "n" } } }`.
+- **Merge log:** `"merge$<names>"`, `pending` while polling → `completed`.
+- A child is a `Workflow`; its parent branch log is `child.parent_execution_log`; the
+  parent workflow is `branch_log.workflow`.
+The reusable `StepNameParser` (already in the engine) gains `branch` / `merge` kinds.
+## Design
+### 1. Branches panel on the parent's detail page
+A new section on `workflows#show`, rendered only when the workflow has any `branch$%`
+logs. One row per branch (`BranchPresenter`), showing **health**:
+| Field | Source | Notes |
+|---|---|---|
+| name | `StepNameParser.parse(log.step_name).name` | |
+| status | `log.state` | `completed` → **sealed**; `pending` → **dispatching** (still spawning) |
+| join | `metadata.automerge` / `metadata.merged` | "automerge", "merged", or "unmerged" |
+| dispatched | `sum(metadata.cursors[*].n)` + explicit `spawn` count | cheap (from metadata), avoids counting rows |
+| pending | `spawned_workflows.where.not(state: :completed).limit(CAP).count` | **capped, index-only** (O(CAP)); shows `"5000+"` past CAP |
+| blocked | `spawned_workflows.where(state: [:failed, :stalled]).limit(CAP).count` | the actionable number; rendered in rose when > 0 |
+Each branch row links to its **children view** (below) and, when `blocked > 0`, a direct
+"View blocked" link pre-filtered to failed/stalled.
+A parent parked on a merge also surfaces its `merge$<names>` log(s) here ("merging
+invoicing — pending"), so the park is legible.
+### 2. Branch children view (drill-down)
+A new route + controller, because a branch can hold **hundreds of thousands** of children
+— they are never all rendered.
+- Route: `GET /workflows/:workflow_id/branches/:branch_log_id` →
+  `BranchChildrenController#show` (scoped to the branch log; verifies it belongs to the
+  workflow).
+- **Reuses `WorkflowsQuery`** over `branch_log.spawned_workflows` (same state/key filters,
+  pagination). **Default filter: `failed` + `stalled` first** — the triage default, so the
+  blockers are the landing view rather than page 1 of 500k.
+- Reuses the existing `_workflow_row` partial (children are workflows) plus a per-row
+  **Retry** (and the child's own key links to its detail).
+- A capped state-count strip at the top (completed/running/idle/failed/stalled), each an
+  O(CAP) index-only count rendered as `"N"` or `"CAP+"`.
+### 3. Per-child recovery
+Children are workflows, so recovery reuses the existing `ActionsController`:
+- Per-child **Retry** (`workflow.retry_later`) in each row and on the child detail.
+- A **"Retry all blocked in this branch"** bulk action: iterate
+  `branch_log.spawned_workflows.where(state: [:failed, :stalled]).find_each(&:retry_later)`
+  (a scoped sibling of the existing bulk-retry). After recovery the parent's merge poll
+  resolves on its own (Option A) — the view does not touch the parent.
+### 4. Child → parent linkage
+On `workflows#show`, when `@workflow.parent_execution_log_id` is present, render a
+**breadcrumb**: `parent key › branch <name> › this child`, linking to the parent and the
+branch children view. Cheap: one `parent_execution_log` + its `workflow`.
+### 5. Tree view (nested branches)
+Branches nest (a child may open its own branches). The parent panel shows **one level**
+(this workflow's branches + per-branch child summary); you navigate down by opening a
+child (whose own detail shows its branches) rather than rendering an unbounded tree on one
+page. The breadcrumb provides the up-path. This keeps every page O(page_size), never
+O(tree).
+## Components
+- `app/presenters/.../branch_presenter.rb` — one branch log → health struct (capped
+  counts, dispatched-from-cursor, sealed/merged flags).
+- `app/presenters/.../branches_presenter.rb` — a workflow's `branch$%` + `merge$%` logs.
+- `app/controllers/.../branch_children_controller.rb` — `#show`, scoped children list.
+- `app/queries/.../workflows_query.rb` — extend to accept a base scope (so it can run over
+  `branch_log.spawned_workflows`, not just `Workflow.all`).
+- `ActionsController#bulk_retry_branch` — scoped bulk retry.
+- Views: `_branches.html.erb` (panel on show), `branch_children/show.html.erb`,
+  `_parent_breadcrumb.html.erb`; `StepNameParser` branch/merge kinds.
+- Routes: nested `branches/:branch_log_id` under `workflows`; a member `bulk_retry` on it.
+## Scale guardrails (non-negotiable)
+- **Never** `group(:state).count` an unbounded child set on a page load. All counts are
+  **capped** (`limit(CAP)`) and index-only on `(parent_execution_log_id, state)`, shown as
+  `"CAP+"` past the cap — mirroring the merge probe.
+- **Never** render more than one page of children. Default to the blocked subset.
+- "Dispatched" total comes from `metadata.cursors` (`n`), not a row count.
+- The branches panel issues at most ~2 capped probes per branch (pending, blocked) — bounded
+  regardless of child count.
+## Testing (once branches exists)
+Seed parent + `branch$<name>` logs + child workflow rows with `parent_execution_log_id`
+(no need to run real fan-out):
+- branches panel: sealed vs dispatching; automerge/merged/unmerged; pending + blocked
+  capped counts (incl. a `>CAP` case showing `"CAP+"`); rose styling when blocked > 0.
+- children view: default filter shows only failed/stalled; state filter + pagination work
+  over the scoped relation; per-child Retry calls `retry_later`.
+- scoped bulk retry hits only that branch's failed/stalled children.
+- breadcrumb: a child renders a link to its parent + branch; a non-child renders none.
+- merge log surfaced when the parent is parked.
+## Open questions (confirm on review)
+1. **Counts beyond CAP** — show `"5000+"` (capped) everywhere, or pay an exact `COUNT` for
+   the *blocked* number only (usually small) while capping pending? (Leaning: exact for
+   blocked, capped for pending.)
+2. **Children view default** — land on failed/stalled (triage), or all-with-failed-first?
+   (Leaning: failed/stalled, with a clear "show all" toggle.)
+3. **Tree depth** — one level per page + breadcrumb (this spec), or a shallow expandable
+   tree for small fan-outs? (Leaning: one level; revisit if small-N trees feel clunky.)

data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md ADDED Viewed

@@ -0,0 +1,265 @@
+# ChronoForge — deferral continuation race & catch-up surge
+**Date:** 2026-06-26
+**Gem:** `chrono_forge` 0.9.1
+**Status:** design approved, ready for implementation plan
+## Problem
+Two related findings in how ChronoForge's deferral primitives (`wait`, `wait_until`,
+`durably_execute` retry, `durably_repeat`, workflow-level retry) schedule their
+continuation jobs. Both are functionally benign in 0.9.1 (no lost work, no double
+execution) but generate avoidable job/lock churn and log noise, and they interact.
+### Issue 1 — continuation/lock-release race (`ConcurrentExecutionError`)
+Every deferral primitive enqueues its continuation **inline** and then halts:
+```ruby
+self.class.set(wait: delay).perform_later(@workflow.key)   # (1) enqueue continuation
+halt_execution!                                            # (2) raise HaltExecutionFlow
+```
+The executor releases the lock in `ensure`, **after** the body runs
+(`executor.rb:168-172`). So within one job run the order is: **enqueue continuation →
+halt → (ensure) release lock.** The continuation is published while the current job
+still holds the lock.
+When the continuation is **immediately runnable** (`delay == 0`), SolidQueue puts it
+straight in `ready_executions`. With multiple workers, a free worker can claim and
+start it in the window between (1) and the `ensure` release. That second job calls
+`acquire_lock`, finds `locked_at > max_duration.ago` (still freshly held by the first
+job), and raises `ConcurrentExecutionError` at lock acquisition (failing
+`execution_log.step_name` is `nil` — before any step).
+`delay == 0` arises when:
+- `wait` targets computed against wall-clock times already in the past on replay, and
+- **every fast-forwarded tick in Issue 2** (`delay = max(next − now, 0) = 0`).
+Benign today (loser is rescued, winner proceeds, continuation replays idempotently),
+but costs wasted job executions, redundant lock attempts, and log noise.
+### Issue 2 — catch-up is O(missed intervals)
+When a `durably_repeat` workflow resumes far behind schedule, each missed tick is
+handled by `execute_repetition_now`. For an expired tick it correctly **skips the
+periodic method**, but then advances by exactly **one interval** and enqueues a **new
+job** (`durably_repeat.rb:200-212`, `:271-293`):
+```ruby
+if Time.current > repetition_log.metadata["timeout_at"]
+  repetition_log.update!(state: :failed, error_class: "TimeoutError")
+  schedule_next_execution_after_completion(...)   # advance ONE interval + enqueue a job
+  return                                           # method NOT run (work correctly skipped)
+end
+```
+So expiry is a **work skip, not an iteration skip**. Walking a workflow from a far-past
+`start_at` up to `now` churns through **one `delay == 0` job per missed interval** — each
+job marks one tick timed-out, schedules the next, and halts. Resuming ~14 dormant
+daily/weekly schedulers generated ~6,000 back-to-back `delay == 0` jobs. Every one of
+those is the maximal trigger for Issue 1.
+Worst case: a workflow resuming from genesis (no prior coordination/repetition logs)
+with an ancient `start_at`.
+## Enqueue sites (complete inventory)
+All 8 continuation enqueues are `.set(wait:).perform_later`; `continue_if` halts with no
+continuation (it waits for an external trigger — correctly needs no fix).
+| # | Site | kwargs passed | delay |
+|---|------|---------------|-------|
+| 1 | `executor.rb:163` workflow retry | `attempt:, retry_counts:` | backoff |
+| 2 | `wait.rb:107` reschedule | — | duration |
+| 3 | `wait_until.rb:135` cond-error retry | — | backoff |
+| 4 | `wait_until.rb:181` poll | `wait_condition:` | check_interval |
+| 5 | `durably_execute.rb:112` retry | — | backoff |
+| 6 | `durably_repeat.rb:193` schedule-later | — | delay |
+| 7 | `durably_repeat.rb:235` repetition retry | — | backoff |
+| 8 | `durably_repeat.rb:288` schedule-next | — | delay (=0 in surge) |
+## Fix — Section 1: deferred continuation flush
+Primitives stop calling `perform_later` inline. They **record** the intended
+continuation on the instance; the executor flushes it in `ensure`, **after**
+`release_lock`. The continuation becomes claimable only once the lock row reads
+released, so no second worker can lose the acquire race. This is the report's
+**option 1** (fully closes the window), not option 3 (epsilon delay heuristic).
+**Single slot suffices.** Every primitive enqueues at most one continuation and then
+either raises `HaltExecutionFlow` (sites 2–8) or falls through `rescue => e` into
+`ensure` (site 1). No path schedules two.
+```ruby
+# executor.rb — new private helper
+def enqueue_continuation(wait:, **kwargs)
+  @continuation = {wait: wait, kwargs: kwargs}
+end
+```
+Each of the 8 sites changes from:
+```ruby
+self.class.set(wait: delay).perform_later(@workflow.key)   # or with kwargs
+halt_execution!
+```
+to:
+```ruby
+enqueue_continuation(wait: delay)                          # kwargs preserved per-site
+halt_execution!
+```
+Flush in `ensure` (`executor.rb:168`), strictly ordered after release:
+```ruby
+ensure
+  if lock_acquired
+    context.save!
+    self.class::LockStrategy.release_lock(job_id, workflow)
+    flush_continuation!                  # NEW — only now is the next job claimable
+  end
+end
+def flush_continuation!
+  return unless @continuation
+  self.class.set(wait: @continuation[:wait]).perform_later(@workflow.key, **@continuation[:kwargs])
+end
+```
+**Ordering guarantee:** `save! → release_lock → flush`. The continuation is published
+only after the lock row is updated to released, so even a `delay == 0` continuation
+finds the lock free.
+**Edge cases:**
+- If `release_lock` raises `LongRunningConcurrentExecutionError` (this job overran
+  `max_duration` and lost the lock), we do **not** flush — correct, another job already
+  owns the continuation.
+- Site 1 (workflow retry) isn't a halt, but routing it through the same slot keeps all
+  enqueues post-release and is harmless (backoff is normally > 0 anyway).
+- `@continuation` is per-job-execution instance state; nil unless a primitive set it.
+## Fix — Section 2: closed-form fast-forward of the expired prefix
+In `durably_repeat` (`durably_repeat.rb:143-151`), after the naive `next_execution_at`
+is computed and before `execute_or_schedule_repetition`, jump past the expired prefix in
+closed form instead of walking one job per tick.
+**Skip rule (from the code):** a tick `t` is expired iff `Time.current > t + timeout`,
+i.e. `t < now − timeout`. Find the smallest tick on the grid `next_execution_at + n·every`
+(n ≥ 0) that is **not** expired (`t ≥ now − timeout`):
+```ruby
+def fast_forward_expired_prefix(next_execution_at, every, timeout)
+  cutoff = Time.current - timeout
+  return next_execution_at if next_execution_at >= cutoff   # nothing expired
+  gap = cutoff - next_execution_at
+  n = (gap / every.to_f).ceil                               # n ≥ 1 here
+  Rails.logger.info {
+    "ChronoForge:#{self.class}(#{@workflow.key}) durably_repeat fast-forwarded #{n} expired tick(s)"
+  }
+  next_execution_at + (n * every)
+end
+```
+**Why anchor on `next_execution_at`, not `start_at`.** `next_execution_at` is always
+already on the canonical grid `anchor + k·every`:
+1. `start_at` given, no `last_execution_at` → `next = start_at`. On-grid (k=0).
+2. No `start_at`, no `last_execution_at` → `next = created_at + every`. On-grid (k=0).
+3. `last_execution_at` present → `next = last_execution_at + every`. On-grid because
+   `last_execution_at` stores the **scheduled** tick time, not wall-clock:
+   `schedule_next_execution_after_completion` writes `current_execution_time.iso8601`
+   (`durably_repeat.rb:275`), where `current_execution_time` is the scheduled tick, not
+   `Time.current`. By induction, lateness never enters the recurrence.
+So jumping by integer multiples of `every` from `next_execution_at` stays exactly on the
+grid — **no drift**. Anchoring the ceil on `start_at` (as the report's formula literally
+writes) would compute against a different anchor than the grid the workflow is actually
+on (branches 2 and 3) and could land between real ticks.
+**Boundary correctness — only the expired prefix is skipped.** The jump lands on the
+first tick with `t ≥ now − timeout`, which is either:
+- **in-window** (`now − timeout ≤ t ≤ now`): `execute_or_schedule_repetition` sees
+  `t ≤ now` → runs `execute_repetition_now`, which re-checks `now > timeout_at` (now
+  false) → **executes the work**. Legitimate catch-up preserved.
+- **future** (`t > now`): → `schedule_repetition_for_later`. Normal.
+If `timeout > every` there can be several in-window ticks; those still walk one job each
+by design (real work, not bookkeeping). Only the expired prefix collapses to O(1).
+**Coordination-log bookkeeping.** As part of the fast-forward, set the coordination
+log's `last_execution_at = (first_valid − every).iso8601` (same format the reader
+`Time.parse` expects). A replay then recomputes `naive_next = last_execution_at + every
+= first_valid` — stable and idempotent — and the expired prefix produces **one metadata
+update** instead of N `failed/TimeoutError` repetition rows and N jobs.
+**One summary row for the skipped prefix (decided).** Instead of N `failed/TimeoutError`
+repetition rows, the fast-forward writes a **single** durable `ExecutionLog` covering the
+whole skipped prefix, so the skip stays dashboard-visible and queryable:
+- **step_name:** `durably_repeat$<name>$<last_skipped_tick.to_i>`, where
+  `last_skipped_tick = first_valid − every`. This is the last expired grid tick, so it is
+  unique and **never collides** with the repetition row for `first_valid` (the first
+  in-window/future tick, which `execute_or_schedule_repetition` still creates and runs).
+- **state:** `failed` (the enum has only `pending/completed/failed` — no migration),
+  **error_class:** `"TimeoutError"`, **error_message:** `"Fast-forwarded N expired tick(s)"`.
+- **metadata:** `{ fast_forwarded: N, from: <first_expired.iso8601>,
+  to: <last_skipped.iso8601>, scheduled_for: <last_skipped>, timeout_at: <last_skipped + timeout>,
+  parent_id: <coordination_log.id> }` — mirrors the existing repetition-log metadata shape
+  plus the `fast_forwarded`/`from`/`to` summary fields.
+Created via `find_or_create_execution_log!`, so it is idempotent on replay (and the
+3-segment step name is correctly excluded from `completed_step_cache`, matching ordinary
+repetition logs). A `Rails.logger.info { "...fast-forwarded N expired tick(s)" }` line is
+also emitted for ops. This is a deliberate behavior change from 0.9.1's one-row-per-tick.
+The existing dashboard step-name parser already handles 3-segment
+`durably_repeat$<name>$<ts>` repetition steps, so **no dashboard change is required** for
+this plan; the summary row renders like any other repetition log.
+**Observable-behavior change → existing tests updated.** Two tests assert the old
+per-tick tombstones via `timeout: -1.second` and must be updated to the new behavior:
+- `durably_repeat_test.rb:116` `test_durably_repeat_with_timeout` — asserts
+  `timeout_logs.size > 0` (filtering `error_message == "Execution timed out"`); flip to
+  asserting **no** `Execution timed out` rows and exactly **one** `fast_forwarded` summary
+  row for the expired prefix.
+- `durably_repeat_test.rb:345` `test_durably_repeat_coordination_log_updated_on_timeout`
+  — its `last_execution_at`-advances assertion still holds; its `timeout_logs.size > 0`
+  assertion flips to asserting the single `fast_forwarded` summary row instead.
+The `wait_until` negative-timeout test (`error_log_correlation_test.rb:23`) is a
+different primitive and is unaffected. Catch-up tests using the default positive timeout
+(`test_durably_repeat_with_past_start_at`, etc.) are unaffected because nothing is
+expired under a 1-hour window.
+**Idempotency / replay safety.** The skipped ticks never get repetition logs, but
+they're never recomputed either (the jump advances `last_execution_at` past them), and
+all execution-log lookups are by exact `step_name` — nothing scans for the missing rows.
+Prior completed/failed ticks from before dormancy are untouched.
+## Interaction
+The two share a root: continuations are published as immediately-claimable, same-key
+jobs while/just-before the lock is released. The catch-up surge (Issue 2) is the
+maximal trigger for the race (Issue 1). Section 1 closes the race structurally;
+Section 2 removes the burst of `delay == 0` continuations that most reliably arms it.
+Both together remove the class of problem.
+## Testing
+- **Issue 1:** unit-test that each of the 8 primitives sets `@continuation` and does
+  **not** call `perform_later` inline; that the executor flushes after `release_lock`
+  (assert ordering — e.g. the enqueue observes the lock row released); that
+  `LongRunningConcurrentExecutionError` from `release_lock` suppresses the flush; that
+  per-site kwargs (`attempt`/`retry_counts`, `wait_condition`) are preserved.
+- **Issue 2:** unit-test `fast_forward_expired_prefix` returns `next_execution_at`
+  unchanged when nothing is expired; lands exactly on the first non-expired grid tick;
+  is on-grid across all three anchor branches; that an in-window first tick executes its
+  work while the expired prefix creates no repetition rows; that `last_execution_at` is
+  advanced so a replay is stable. Integration: resume a far-past daily schedule and
+  assert O(1) jobs/log rows for the expired prefix instead of O(missed intervals).
+```

data/lib/chrono_forge/branch_merge_job.rb ADDED Viewed

@@ -0,0 +1,138 @@
+# frozen_string_literal: true
+module ChronoForge
+  # Lightweight poller that joins one or more branches. NOT a workflow — it holds
+  # no lock, does no replay, and carries no context. It exists so the heavy parent
+  # workflow is replayed only twice per merge (kick off + completion wake).
+  class BranchMergeJob < ActiveJob::Base
+    # The poller is the parent's only wake mechanism, so survive TRANSIENT
+    # infrastructure errors (DB connection/timeout/deadlock) with backoff. Any
+    # other error — a programming bug, a bad guard — is NOT retried: it propagates
+    # to the backend's failed-job queue where it's visible, rather than being
+    # silently retried-then-discarded (which would orphan the parent in :idle).
+    retry_on ActiveRecord::ConnectionNotEstablished,
+      ActiveRecord::ConnectionTimeoutError,
+      ActiveRecord::Deadlocked,
+      ActiveRecord::LockWaitTimeout,
+      wait: :polynomially_longer, attempts: 25
+    CAP = 5_000          # cap the pending count; beyond it we just pick max_interval
+    FACTOR = 0.06        # seconds of delay per pending child
+    REKICK_AFTER = 5.minutes
+    REKICK_BATCH = 200   # bound per-run rekicks; later polls handle the rest
+    def perform(parent_key, parent_job_class, branch_log_ids, min_interval, max_interval, token = nil)
+      raise ArgumentError, "branch_log_ids must not be empty" if branch_log_ids.empty?
+      # Fencing: every merge_branches pass mints a fresh token and writes it onto
+      # the branch logs, so a poller from a superseded chain (parent replay /
+      # re-enqueue) holds a stale token. It stops quietly — no poll, no wake, no
+      # reschedule — leaving only the newest chain to drive the merge. (A nil token
+      # is a pre-upgrade job enqueued before fencing existed; it runs unfenced.)
+      return if superseded?(branch_log_ids, token)
+      # Per-branch probe (kept as maps so we can persist each branch's own state,
+      # not just the merge aggregate). Same query count as a plain sum/all?.
+      pending_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.incomplete(id).limit(CAP).count] }
+      sealed_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.sealed?(id)] }
+      pending = pending_by_branch.values.sum
+      sealed = sealed_by_branch.values.all?
+      if sealed && pending.zero?
+        record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: nil)
+        parent_job_class.constantize.perform_later(parent_key)
+        return
+      end
+      rekick_dropped_jobs(branch_log_ids)
+      delay = reschedule_delay(pending, min_interval, max_interval)
+      record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: delay.seconds.from_now)
+      self.class.set(wait: delay.seconds)
+        .perform_later(parent_key, parent_job_class, branch_log_ids, min_interval, max_interval, token)
+    end
+    private
+    # Adaptive poll cadence: scale the wait with the number of pending children,
+    # clamped to [min_interval, max_interval]. min_interval <= max_interval is
+    # enforced up front in merge_branches, so the clamp can't raise here.
+    def reschedule_delay(pending, min_interval, max_interval)
+      (pending * FACTOR).clamp(min_interval, max_interval)
+    end
+    # A poller is superseded when its token no longer matches what's stored on the
+    # branch logs (a newer merge_branches pass rotated it). A plain read is enough
+    # for the early-out; the persisting write in record_poll! re-checks the token
+    # under a row lock so it can never clobber the newer chain.
+    def superseded?(branch_log_ids, token)
+      logs = ExecutionLog.where(id: branch_log_ids).to_a
+      logs.empty? || logs.any? { |log| log.metadata&.dig("poll_token") != token }
+    end
+    # ActiveJob exposes no portable API to enumerate enqueued/scheduled jobs, so a
+    # poller in the backend's scheduled set is invisible to a backend-agnostic
+    # dashboard. We make the durable log the source of truth instead: each poll
+    # stamps its observable state onto every target branch log's metadata, so the
+    # dashboard can list in-flight merges (and a next_poll_at long in the past with
+    # work still pending is the signal that the poller was dropped). This is purely
+    # observational — replay and correctness never read it. It writes a "poll"
+    # sub-key, leaving spawn_each's "cursors" metadata untouched.
+    def record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at:)
+      now = Time.current
+      ExecutionLog.where(id: pending_by_branch.keys).find_each do |log|
+        # Lock the row so this read-modify-write can't clobber a concurrent token
+        # rotation (merge_branches) or another poller's metadata write — both touch
+        # the same JSON column. Re-check the token under the lock and skip if we've
+        # been superseded mid-run, so a stale poller never overwrites the fence.
+        log.with_lock do
+          meta = log.metadata || {}
+          next unless meta["poll_token"] == token
+          meta["poll"] = {
+            "last_polled_at" => now.iso8601,
+            "next_poll_at" => next_poll_at&.iso8601,
+            "pending" => pending_by_branch[log.id],
+            "sealed" => sealed_by_branch[log.id],
+            "polls" => meta.dig("poll", "polls").to_i + 1
+          }
+          log.update!(metadata: meta)
+        end
+      end
+    end
+    # A child that was dispatched but never picked up (its job was dropped by the
+    # backend) sits :idle with started_at nil. setup_workflow! stamps started_at
+    # on a child's first execution, so a nil started_at precisely means "never
+    # ran" — that's what we rekick on. It correctly excludes a child that ran and
+    # is now parked on a wait/wait_until (also :idle, but started_at is set):
+    # rekicking that would re-evaluate the wait condition prematurely and pile up
+    # duplicate scheduled jobs. We also require the row to be stale past
+    # REKICK_AFTER (a freshly dispatched child just hasn't been grabbed yet) and
+    # keep the :idle guard (a running/failed/stalled child must never be
+    # re-dispatched). Re-enqueue of an :idle child a worker just grabbed is still
+    # safe — the lock guard rejects the duplicate. Capped per run.
+    def rekick_dropped_jobs(branch_log_ids)
+      branch_log_ids.each do |id|
+        Workflow.where(parent_execution_log_id: id, state: Workflow.states[:idle], started_at: nil)
+          .where("updated_at < ?", REKICK_AFTER.ago)
+          .limit(REKICK_BATCH)
+          .find_each do |child|
+            # Intentionally uses the GUARDED perform_later (single-child path),
+            # unlike the bulk perform_all_later bypass in dispatch_children.
+            #
+            # Rekick is best-effort recovery, so one bad child must never sink the
+            # poll: a raise here (e.g. cross-version kwarg drift failing the enqueue
+            # guard) would abort the whole run and — since it isn't a transient AR
+            # error — dead-letter the poller, orphaning every healthy sibling. Catch
+            # per child, log, and let the next poll retry it (it's still idle+stale).
+            child.job_klass.perform_later(child.key, **child.kwargs.symbolize_keys)
+          rescue => e
+            Rails.logger.error do
+              "ChronoForge:BranchMergeJob rekick failed for child #{child.key}: " \
+              "#{e.class}: #{e.message}"
+            end
+          end
+      end
+    end
+  end
+end

data/lib/chrono_forge/branch_probe.rb ADDED Viewed

@@ -0,0 +1,26 @@
+# frozen_string_literal: true
+module ChronoForge
+  # Single source of truth for "is this branch done?" — used by both merge_branches
+  # (boolean) and BranchMergeJob (which needs the sealed flag and pending count
+  # separately for its adaptive poll cadence). Option A: only :completed counts as
+  # done, so a failed/stalled child keeps the branch pending until recovered.
+  module BranchProbe
+    module_function
+    # The branch's coordination log is sealed (fully dispatched).
+    def sealed?(branch_log_id)
+      ExecutionLog.where(id: branch_log_id, state: ExecutionLog.states[:completed]).exists?
+    end
+    # Relation of this branch's children that are not yet completed.
+    def incomplete(branch_log_id)
+      Workflow.where(parent_execution_log_id: branch_log_id)
+        .where.not(state: Workflow.states[:completed])
+    end
+    def done?(branch_log_id)
+      sealed?(branch_log_id) && !incomplete(branch_log_id).exists?
+    end
+  end
+end

data/lib/chrono_forge/cleanup.rb CHANGED Viewed

@@ -93,6 +93,12 @@ module ChronoForge
         ids = batch.ids
         next if ids.empty?
+        # Branch children point at their parent's branch$ execution log via
+        # parent_execution_log_id. Bulk delete bypasses the dependent: :nullify callback,
+        # so nullify explicitly to avoid dangling references when a parent is reclaimed.
+        Workflow.where(parent_execution_log_id: ExecutionLog.where(workflow_id: ids).select(:id))
+          .update_all(parent_execution_log_id: nil)
         # Delete dependent rows in bulk rather than relying on row-by-row
         # dependent: :destroy callbacks.
         result[:execution_logs] += ExecutionLog.where(workflow_id: ids).delete_all

data/lib/chrono_forge/execution_log.rb CHANGED Viewed

@@ -33,6 +33,12 @@ module ChronoForge
     belongs_to :workflow
+    has_many :spawned_workflows,
+      class_name: "ChronoForge::Workflow",
+      foreign_key: :parent_execution_log_id,
+      inverse_of: :parent_execution_log,
+      dependent: :nullify
     enum :state, %i[
       pending
       completed

data/lib/chrono_forge/executor/composite_retry_policy.rb ADDED Viewed

@@ -0,0 +1,47 @@
+module ChronoForge
+  module Executor
+    # An ordered list of RetryPolicy objects, each scoped to an error type via
+    # its `retry_on`. On failure the first policy whose `retry_on` matches the
+    # raised error (by `is_a?`) is applied, giving each error type its own
+    # independent attempt budget and backoff curve. Put specific policies first
+    # and a catch-all (`retry_on: nil`) last; an unmatched error is not retried.
+    #
+    # Pure: it never reads storage. The per-error count is supplied by the
+    # caller through the block passed to #retry_backoff, keyed by the matched
+    # policy's budget_key (its declared errors).
+    class CompositeRetryPolicy
+      attr_reader :policies
+      def initialize(policies)
+        @policies = Array(policies)
+        if @policies.empty?
+          raise ArgumentError, "composite retry policy needs at least one policy"
+        end
+      end
+      # First sub-policy whose retry_on matches the error, or nil.
+      def policy_for(error)
+        @policies.find { |p| p.matches?(error) }
+      end
+      # Routes on the live error and delegates the decision to the matched
+      # sub-policy. When a block is given it is called with the matched policy's
+      # budget_key and must return that policy's running attempt count (1-based,
+      # including the current failure); otherwise `attempts` is used.
+      def retry_backoff(error, attempts:)
+        sub = policy_for(error)
+        return nil if sub.nil?
+        count = block_given? ? yield(sub.budget_key) : attempts
+        sub.retryable?(error, count) ? sub.backoff_for(count) : nil
+      end
+      # Coarsest attempt bound across sub-policies, for the workflow-level
+      # safety-net guard. nil (unbounded) if any sub-policy is unbounded.
+      def max_attempts
+        caps = @policies.map(&:max_attempts)
+        caps.include?(nil) ? nil : caps.max
+      end
+    end
+  end
+end