RubyGems - chrono_forge - Versions diffs - 0.10.0 → 0.11.0 - Mend

chrono_forge 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +34 -1
data/README.md +188 -105
data/Rakefile +4 -0
data/cliff.toml +62 -0
data/docs/design/per-child-commit-overhead.md +213 -0
data/docs/fanout-scale-test.md +247 -0
data/docs/superpowers/plans/2026-06-30-poller-rekick-and-eta-cadence.md +205 -0
data/docs/superpowers/plans/2026-06-30-poller-rekick-and-eta-cadence.md.tasks.json +33 -0
data/docs/superpowers/plans/2026-07-01-workflow-definition-dag.md +1373 -0
data/docs/superpowers/plans/2026-07-01-workflow-definition-dag.md.tasks.json +68 -0
data/docs/superpowers/specs/2026-07-01-workflow-definition-dag-design.md +203 -0
data/lib/chrono_forge/branch_merge_job.rb +158 -21
data/lib/chrono_forge/branch_probe.rb +44 -0
data/lib/chrono_forge/configuration.rb +25 -0
data/lib/chrono_forge/definition.rb +37 -0
data/lib/chrono_forge/definition_analyzer.rb +501 -0
data/lib/chrono_forge/executor/context.rb +23 -0
data/lib/chrono_forge/executor/lock_strategy.rb +10 -3
data/lib/chrono_forge/executor/methods/continue_if.rb +15 -6
data/lib/chrono_forge/executor/methods/durably_execute.rb +15 -7
data/lib/chrono_forge/executor/methods/durably_repeat.rb +30 -14
data/lib/chrono_forge/executor/methods/merge_branches.rb +5 -4
data/lib/chrono_forge/executor/methods/workflow_states.rb +35 -47
data/lib/chrono_forge/executor.rb +34 -9
data/lib/chrono_forge/version.rb +1 -1
data/lib/chrono_forge.rb +8 -0
data/lib/tasks/release.rake +212 -0
metadata +28 -2

data/docs/superpowers/plans/2026-07-01-workflow-definition-dag.md.tasks.json ADDED Viewed

@@ -0,0 +1,68 @@
+{
+  "planPath": "docs/superpowers/plans/2026-07-01-workflow-definition-dag.md",
+  "tasks": [
+    {
+      "id": 1,
+      "subject": "Task 1: Definition value objects + Prism dep",
+      "status": "completed",
+      "description": "Graph data model (Definition/Node/Edge) + prism runtime dependency, round-trippable to_h.\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/definition.rb\", \"chrono_forge.gemspec\", \"test/definition_test.rb\"], \"verifyCommand\": \"bundle exec ruby -I test test/definition_test.rb\", \"acceptanceCriteria\": [\"Definition holds nodes/edges/warnings; to_h JSON-safe\", \"Node#dynamic?\", \"prism declared dependency\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 2,
+      "subject": "Task 2: Analyzer — linear steps",
+      "status": "completed",
+      "blockedBy": [1],
+      "description": "DefinitionAnalyzer.call resolves perform via Prism; a node per straight-line durable call with sequential edges from start.\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/definition_analyzer.rb\", \"test/support/definition_fixtures.rb\", \"test/definition_analyzer_test.rb\"], \"verifyCommand\": \"bundle exec ruby -I test test/definition_analyzer_test.rb\", \"acceptanceCriteria\": [\"node per durable call in source order with exact step_name\", \"seq edges start->n1->n2\", \"non-durable Ruby ignored\", \"no DB/exec\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 3,
+      "subject": "Task 3: Analyzer — conditionals & guards",
+      "status": "completed",
+      "blockedBy": [2],
+      "description": "if/unless/case around durable calls -> :conditional edges with guard labels; rejoin skip/body; continue_if false path -> :terminal.\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/definition_analyzer.rb\", \"test/support/definition_fixtures.rb\", \"test/definition_analyzer_test.rb\"], \"verifyCommand\": \"bundle exec ruby -I test test/definition_analyzer_test.rb\", \"acceptanceCriteria\": [\"guarded conditional edge with source guard\", \"rejoin skip+body\", \"continue_if terminal edge\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 4,
+      "subject": "Task 4: Analyzer — branch fan-out + merge join",
+      "status": "completed",
+      "blockedBy": [3],
+      "description": "branch block -> :branch fan-out node + child-group node via :fanout edge; merge_branches -> :join edge from branch.\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/definition_analyzer.rb\", \"test/support/definition_fixtures.rb\", \"test/definition_analyzer_test.rb\"], \"verifyCommand\": \"bundle exec ruby -I test test/definition_analyzer_test.rb\", \"acceptanceCriteria\": [\"branch$name node + child-group with pattern\", \"branch->child :fanout\", \"branch->merge :join\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 5,
+      "subject": "Task 5: Analyzer — repeat, helper tracing, loop warnings",
+      "status": "completed",
+      "blockedBy": [4],
+      "description": "durably_repeat -> :repeat node; trace durable calls in same-class helpers (recursion-guarded); durable call inside a loop -> warning, no crash.\n\n```json:metadata\n{\"files\": [\"lib/chrono_forge/definition_analyzer.rb\", \"test/support/definition_fixtures.rb\", \"test/definition_analyzer_test.rb\"], \"verifyCommand\": \"bundle exec ruby -I test test/definition_analyzer_test.rb\", \"acceptanceCriteria\": [\"durably_repeat$tick single repeat node\", \"same-class helper traced in position\", \"loop-with-durable warns\", \"no infinite recursion\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 6,
+      "subject": "Task 6: Dashboard — DefinitionOverlay",
+      "status": "completed",
+      "blockedBy": [1],
+      "description": "Annotate nodes with runtime status from execution_logs; fan-out/repeat aggregates via BranchProbe/rep logs; append unmapped nodes.\n\n```json:metadata\n{\"files\": [\"chrono_forge-dashboard/app/presenters/chrono_forge/dashboard/definition_overlay.rb\", \"chrono_forge-dashboard/test/definition_overlay_test.rb\"], \"verifyCommand\": \"cd chrono_forge-dashboard && bundle exec rake test TEST=test/definition_overlay_test.rb\", \"acceptanceCriteria\": [\"exact-name status\", \"branch/merge counts\", \"repeat repetitions\", \"unmapped logs appended\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 7,
+      "subject": "Task 7: Dashboard — MermaidRenderer",
+      "status": "completed",
+      "blockedBy": [6],
+      "description": "Statused nodes + edges -> Mermaid flowchart TD string with shapes by kind, :::status classes, guard edge labels, classDef lines.\n\n```json:metadata\n{\"files\": [\"chrono_forge-dashboard/app/presenters/chrono_forge/dashboard/mermaid_renderer.rb\", \"chrono_forge-dashboard/test/mermaid_renderer_test.rb\"], \"verifyCommand\": \"cd chrono_forge-dashboard && bundle exec rake test TEST=test/mermaid_renderer_test.rb\", \"acceptanceCriteria\": [\"flowchart TD header\", \"node line per node with :::status\", \"edge lines with guard labels\", \"classDef per used status\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 8,
+      "subject": "Task 8: Dashboard — definition page (route/controller/view/Mermaid/link)",
+      "status": "completed",
+      "blockedBy": [5, 6, 7],
+      "description": "GET workflows/:id/definition page — analyze class, overlay run, render Mermaid client-side (vendored), warnings panel, graceful degradation, link from detail page.\n\n```json:metadata\n{\"files\": [\"chrono_forge-dashboard/config/routes.rb\", \"chrono_forge-dashboard/app/controllers/chrono_forge/dashboard/definitions_controller.rb\", \"chrono_forge-dashboard/app/views/chrono_forge/dashboard/definitions/show.html.erb\", \"chrono_forge-dashboard/app/assets/chrono_forge/dashboard/mermaid.min.js\", \"chrono_forge-dashboard/app/views/chrono_forge/dashboard/workflows/show.html.erb\", \"chrono_forge-dashboard/test/definitions_controller_test.rb\"], \"verifyCommand\": \"cd chrono_forge-dashboard && bundle exec rake test TEST=test/definitions_controller_test.rb\", \"acceptanceCriteria\": [\"200 with flowchart TD for analyzable wf\", \"unknown class degrades to warning not 500\", \"detail page links to page\"], \"requiresUserVerification\": false}\n```"
+    },
+    {
+      "id": 9,
+      "subject": "Task 9: Full suite + docs",
+      "status": "completed",
+      "blockedBy": [8],
+      "description": "Both packages green, lint clean, document the definition page.\n\n```json:metadata\n{\"files\": [\"chrono_forge-dashboard/README.md\"], \"verifyCommand\": \"bundle exec rake test && cd chrono_forge-dashboard && bundle exec rake test\", \"acceptanceCriteria\": [\"core suite green\", \"dashboard suite green\", \"lint clean on new files\"], \"requiresUserVerification\": false}\n```"
+    }
+  ],
+  "lastUpdated": "2026-07-01T00:00:00Z"
+}

data/docs/superpowers/specs/2026-07-01-workflow-definition-dag-design.md ADDED Viewed

@@ -0,0 +1,203 @@
+# Workflow Definition DAG — static "future timeline" for ChronoForge
+**Status:** Design approved (pending written-spec review)
+**Date:** 2026-07-01
+**Reference:** the `durable_flow` gem's `DefinitionAnalyzer` (Prism-based static analyzer + definition DAG overlaid with runtime status).
+## Problem
+The dashboard today shows only the **historical** timeline of a workflow — the
+`execution_logs` that have already run. There is no forward view: an operator
+can't see the steps a workflow *will* run, where the current run sits in the
+overall shape, or which branches/loops are still ahead.
+ChronoForge workflows are plain Ruby: a `perform` method that the engine
+**replays** every resume, with each durable step identified by a string name
+(`durably_execute$name`, `wait_until$cond`, `branch$name`, `merge$a,b`,
+`durably_repeat$name$<ts>`). Because the structure is expressed in source, we can
+recover a *projection* of the step sequence by statically parsing `perform` with
+Prism — without executing anything — and then paint the run's actual status onto
+that static map.
+## Goal
+A **new per-run dashboard page** that renders a workflow's **conditional DAG**
+(the static definition graph) with the current run's `execution_logs` **overlaid**
+as node status. The existing workflow detail page is unchanged; it gains a link
+to this page.
+Non-goals for v1 are listed under [Scope](#scope-v1).
+## Key decisions (locked during brainstorming)
+1. **Primary consumer:** dashboard overlay — run status painted on the static map
+   (mirrors durable_flow's run → definition-DAG view).
+2. **Map shape:** a **conditional DAG** — guarded edges for `if`/`continue_if`,
+   fan-out groups for branches, joins for merges.
+3. **Fidelity:** **conservative + trace same-class helper methods**. Resolve step
+   names statically where possible; anything unresolvable (computed `name:`,
+   data-dependent loop count, a durable call behind an unknown/external call)
+   becomes an explicit **`dynamic` node with a warning**. No unrolling, no
+   cross-class tracing.
+4. **Rendering:** **Mermaid.js** (client-side, vendored). The analyzer's graph
+   model is rendering-agnostic; a renderer emits Mermaid flowchart text with
+   status encoded as node classes.
+5. **Static vs runtime:** static Prism analysis is the source of the *shape*
+   (only it can show not-yet-run steps and untaken branches); the run log is the
+   *overlay*, never the source of the graph.
+6. **Placement:** a **new route/page**, not an inline addition to the detail page.
+## Architecture
+```
+workflow_class
+   │  DefinitionAnalyzer.call            (core gem; Prism; memoized by class + source digest)
+   ▼
+Definition (Node[], Edge[], warnings)    (plain, JSON-serializable value objects)
+   │  DefinitionOverlay(execution_logs)  (dashboard; read-only queries; per-run; never cached)
+   ▼
+statused Definition
+   │  MermaidRenderer                     (dashboard; statused graph → flowchart text)
+   ▼
+new DAG page  →  vendored Mermaid JS renders client-side (inside data-poll-region)
+```
+### Core gem — `lib/chrono_forge/` (rendering-agnostic, no dashboard/DB dependency)
+- **`ChronoForge::DefinitionAnalyzer`** — `.call(workflow_class) → Definition`.
+  - Resolves `workflow_class.instance_method(:perform).source_location`, reads the
+    file, `Prism.parse`, locates the `perform` def node, and walks its body with a
+    visitor.
+  - **Traces durable calls in same-class helper methods** to a fixed point within
+    the class (a call to a method defined on the same class whose body contains
+    durable DSL calls is expanded inline; recursion is guarded).
+  - Emits nodes, edges, and warnings. **Only reads source text — never touches the
+    DB, never executes workflow code.**
+- **`ChronoForge::Definition`** (+ `Node`, `Edge`) — plain value objects,
+  JSON-serializable so a `Definition` can be cached.
+  - `Node`: `id`, `kind` ∈ `{:execute, :wait, :wait_until, :continue_if, :branch,
+    :merge, :repeat, :dynamic}`, `label`, and **either** an exact `step_name`
+    **or** a `step_name_pattern` (fan-out/repeat/dynamic), plus optional `guard`
+    (condition source label) and `warnings`.
+  - `Edge`: `from`, `to`, optional `guard` label, and a `kind` (`:seq`,
+    `:conditional`, `:fanout`, `:join`, `:terminal`).
+### Dashboard package — `chrono_forge-dashboard/`
+- **`DefinitionOverlay`** — takes a `Definition` + a workflow's `execution_logs`
+  (and, for `:branch`/`:merge` nodes, child-workflow state counts via the existing
+  `BranchProbe`) and annotates each node with a runtime `status`. Read-only.
+- **`MermaidRenderer`** — `statused Definition → flowchart text`; status encoded
+  as `classDef` + `class` assignments.
+- **New controller action + view** — `GET workflows/:id/definition`, plus a
+  "Definition graph" link from the existing detail page.
+- **Vendored Mermaid JS** — the dashboard's first client script, initialized
+  inside the existing `data-poll-region` so the DAG re-renders on the normal
+  page refresh.
+## Node → step-name binding
+Each node knows the step-name it *would* produce, so the overlay is a lookup, not
+guesswork:
+| DSL call | Node kind | Binds to |
+|---|---|---|
+| `durably_execute :m` / `name: "x"` | `:execute` | exact `durably_execute$x` (or `$m`) |
+| `durably_execute :m, name: <expr>` | `:dynamic` | prefix `durably_execute$`, by ordinal |
+| `wait <duration>, "n"` | `:wait` | exact `wait$n` (name is the 2nd positional) |
+| `wait_until :cond` | `:wait_until` | exact `wait_until$cond` |
+| `continue_if :cond` | `:continue_if` | exact `continue_if$cond` |
+| `branch :name { spawn/spawn_each }` | `:branch` (fan-out) | `branch$name` + child-workflow aggregate |
+| `merge_branches :a, :b` | `:merge` (join) | `merge$a,b` (names sorted) |
+| `durably_repeat :name` | `:repeat` (loop) | `durably_repeat$name` coord + `$<ts>` reps |
+**Fan-out (`branch`/`spawn_each`) and `durably_repeat` collapse to a single node
+with aggregate status** — not one node per child/iteration.
+## Overlay status vocabulary (→ Mermaid classes)
+- `done` — matching log is `completed`.
+- `active` — log is `started`/`running`, not completed.
+- `pending` — reached but not done (a coordination log exists, work outstanding).
+- `not_reached` — no log yet.
+- `failed` / `stalled` — from the log state.
+- `conditional` — statically guarded; may be skipped.
+- `dynamic` — unresolved name; bound by prefix + ordinal.
+- `unmapped` — **a runtime log with no matching static node**; appended so
+  analyzer gaps are surfaced, not hidden.
+Aggregates:
+- `:repeat` → "N done, current active, `till` met?" from the coordination log +
+  its `$<ts>` repetition logs.
+- `:branch`/`:merge` → child-workflow state counts (running/idle/completed/failed)
+  via `BranchProbe`.
+## Edges & conditionals
+- Sequential DSL calls → `:seq` edges.
+- `if`/`unless`/`case`/`&&`/`||`/early-return around a step → `:conditional` edge
+  labeled with the condition source; steps only reachable under a guard render
+  `conditional`.
+- `continue_if` → a gate node; its false path is a `:terminal` edge (workflow
+  halts).
+- `branch` block → fans out (`:fanout`) to its spawn/`spawn_each` child-group;
+  `merge_branches` is the `:join` those edges reconnect into.
+- `each`/`times`/`while` containing durable calls → one node + a "dynamic loop
+  count" **warning** (conservative — no unrolling).
+## Error handling
+The analyzer must never break the dashboard:
+- Source unavailable (`source_location` nil, C-defined, `eval`'d, unreadable
+  file) → return a `Definition` carrying a single `unavailable` warning; the page
+  renders "can't be statically analyzed" gracefully. Never raises.
+- Any Prism parse issue degrades the same way (Prism is error-tolerant).
+- Missing/unloadable `job_class`, or a partially-resolved analysis → render what
+  was found plus a warnings panel.
+- **Analyzer is pure/read-only over source text**; the overlay does read-only
+  queries only.
+## Caching
+- Memoize `Definition` by `job_class` + source-file digest — auto-invalidates on
+  dev code reload, stable in prod.
+- The **overlay is never cached** — it is per-run and changes every poll.
+## Testing
+- **Analyzer unit tests (no DB):** a fixture set of workflow classes — linear,
+  conditional/`continue_if`, `branch`+`spawn_each`, `durably_repeat`, dynamic
+  `name:`, helper-traced, unanalyzable loop — asserting node kinds, edges, guards,
+  and warnings. Deterministic and fast.
+- **Overlay tests (dashboard harness):** seed `execution_logs` + child workflows;
+  assert per-node status, fan-out aggregates, repeat counts, and the `unmapped`
+  path.
+- **`MermaidRenderer`:** golden-text tests (statused `Definition` → expected
+  flowchart string).
+## Scope (v1)
+**In:** all seven primitives as nodes + conditional edges + fan-out/repeat
+aggregation + the overlay + the new per-run DAG page + Mermaid rendering;
+same-class helper tracing.
+**Out (deferred):**
+- Cross-class helper tracing.
+- Recursively expanding a spawned child *workflow class* into its own graph
+  (v1 shows it as one fan-out node; "drill into child" is a future feature).
+- Per-node ETA/timing beyond status + counts (that's the separate progress/ETA
+  feature).
+- A class-level (no-overlay) definition view (trivial later addition).
+## Open questions / risks
+- **Helper-tracing fixed point:** need a clear rule for what counts as "a durable
+  call inside a same-class method" vs. ordinary work, and recursion/mutual-call
+  guards. The analyzer stays conservative — when in doubt, emit a `dynamic` node +
+  warning rather than a confident-but-wrong expansion.
+- **Ordinal binding for dynamic siblings** is best-effort; if two dynamic
+  `durably_execute` calls interleave at runtime out of source order, the overlay
+  may mis-bind. Acceptable for v1 (surfaced as `dynamic`).
+- **Mermaid as first client dependency** — keep it vendored and isolated so the
+  rest of the dashboard stays server-rendered.

data/lib/chrono_forge/branch_merge_job.rb CHANGED Viewed

@@ -4,7 +4,22 @@ module ChronoForge
   # Lightweight poller that joins one or more branches. NOT a workflow — it holds
   # no lock, does no replay, and carries no context. It exists so the heavy parent
   # workflow is replayed only twice per merge (kick off + completion wake).
+  #
+  # DEPLOY NOTE — queue placement matters. merge_branches enqueues this poller
+  # AFTER dispatching the branch's children, so if it runs on the SAME queue as a
+  # large fan-out's children it is starved behind the whole backlog and only gets a
+  # worker slot near the end. It then polls once, at pending≈0, with no prior sample
+  # (rate 0) and backs off to max_interval — so the parent's convergence lags by up
+  # to max_interval and no mid-drain throughput sample is ever recorded. Set
+  # ChronoForge.config.branch_merge_queue to a queue NOT saturated by the fan-out's
+  # own children so it polls throughout the drain (ETA cadence then converges
+  # tightly). See ChronoForge::Configuration and docs/fanout-scale-test.md.
   class BranchMergeJob < ActiveJob::Base
+    # Resolved per-enqueue from config (a block, so changing the config takes effect
+    # without redefining the job — and it can't be silently reset by a code reload,
+    # unlike a queue_as monkey-patch in a to_prepare block).
+    queue_as { ChronoForge.config.branch_merge_queue }
     # The poller is the parent's only wake mechanism, so survive TRANSIENT
     # infrastructure errors (DB connection/timeout/deadlock) with backoff. Any
     # other error — a programming bug, a bad guard — is NOT retried: it propagates
@@ -16,8 +31,7 @@ module ChronoForge
       ActiveRecord::LockWaitTimeout,
       wait: :polynomially_longer, attempts: 25
-    CAP = 5_000          # cap the pending count; beyond it we just pick max_interval
-    FACTOR = 0.06        # seconds of delay per pending child
+    ETA_FRACTION = 0.5   # poll at this fraction of the projected time-to-drain
     REKICK_AFTER = 5.minutes
     REKICK_BATCH = 200   # bound per-run rekicks; later polls handle the rest
@@ -29,44 +43,135 @@ module ChronoForge
       # re-enqueue) holds a stale token. It stops quietly — no poll, no wake, no
       # reschedule — leaving only the newest chain to drive the merge. (A nil token
       # is a pre-upgrade job enqueued before fencing existed; it runs unfenced.)
-      return if superseded?(branch_log_ids, token)
+      logs = ExecutionLog.where(id: branch_log_ids).to_a
+      return if superseded?(logs, token)
       # Per-branch probe (kept as maps so we can persist each branch's own state,
       # not just the merge aggregate). Same query count as a plain sum/all?.
-      pending_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.incomplete(id).limit(CAP).count] }
+      # The pending count is UNCAPPED: it feeds the drain signal below (a change in
+      # pending since the prior poll), which a CAP would flatten into a false
+      # "not draining" for large branches.
+      prev_pending_by_branch = logs.to_h { |l| [l.id, l.metadata&.dig("poll", "pending")] }
+      pending_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.incomplete(id).count] }
       sealed_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.sealed?(id)] }
       pending = pending_by_branch.values.sum
       sealed = sealed_by_branch.values.all?
+      # Total children spawned per branch. Immutable once the branch is SEALED
+      # (dispatch done), so we count it exactly ONCE and cache it on the metadata;
+      # every later poll (and the dashboard) reuses the cached value, never recounting.
+      # Unsealed (mid-spawn, count still climbing) => nil, and the dashboard falls back
+      # to its capped live count until the seal freezes the total.
+      logs_by_id = logs.index_by(&:id)
+      spawned_by_branch = branch_log_ids.to_h do |id|
+        cached = logs_by_id[id]&.metadata&.dig("poll", "spawned")
+        [id, cached || (sealed_by_branch[id] ? BranchProbe.spawned(id).count : nil)]
+      end
       if sealed && pending.zero?
-        record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: nil)
+        record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: nil, interval: nil,
+          rate_by_branch: {}, never_started_by_branch: {}, spawned_by_branch: spawned_by_branch, rekicked_by_branch: {})
         parent_job_class.constantize.perform_later(parent_key)
         return
       end
-      rekick_dropped_jobs(branch_log_ids)
+      # DISPATCHED (never-started) count per branch — the rekick drain signal. A
+      # drop since the prior poll means workers are consuming this branch's queue,
+      # so a still-queued child is in line; a flat count with stale never-started
+      # children is a dropped job to recover. Keyed off this, NOT total pending,
+      # which a wait/wait_until child completing would drop without any never-started
+      # child moving (masking a genuinely-dropped one behind staggered waits).
+      prev_never_started_by_branch = logs.to_h { |l| [l.id, l.metadata&.dig("poll", "never_started")] }
+      never_started_by_branch = branch_log_ids.to_h { |id| [id, BranchProbe.never_started(id).count] }
+      rekicked_by_branch = rekick_dropped_jobs(branch_log_ids, never_started_by_branch, prev_never_started_by_branch)
+      # Cadence is driven by ESTIMATED TIME-TO-DRAIN, measured from the prior
+      # poll's persisted pending. `motion` (EXISTS probes) is the fallback signal
+      # when nothing completed this interval: :running => a live worker is
+      # executing a child (hold the floor, it'll finish); :never_started => the only
+      # motion is a queued/rekicked-but-unpicked child (back off exponentially,
+      # it may never be picked up); :none => blocked/waiting (max backstop).
+      # See reschedule_delay. Computed lazily below, only off the drain path.
+      prior = logs.map { |l| l.metadata&.dig("poll") }
+      # Only trust the AGGREGATE prev_pending when every requested branch log is
+      # loaded AND carries a prior sample — otherwise `pending` (over all
+      # branch_log_ids) and prev_pending (over loaded logs) would cover different
+      # sets and yield a bogus aggregate rate. Missing/partial => no sample =>
+      # bootstrap. Per-branch rate below is independently safe (missing => nil => 0).
+      complete_prior = logs.size == branch_log_ids.size && prior.all?
+      prev_pending = (prior.sum { |p| p["pending"].to_i } if complete_prior)
+      prev_polled_at = prior.filter_map { |p| p && p["last_polled_at"] }.map { |s| Time.zone.parse(s) }.min
+      elapsed = prev_polled_at && (Time.current - prev_polled_at)
+      prev_delay = prior.filter_map { |p| p && p["interval"] }.max
+      # Drain rate = children completed / second since the prior poll — THIS is the
+      # throughput surfaced on the dashboard. Per branch for display; aggregated for
+      # the ETA. Zero unless the branch actually drained (a no-headway / cold poll).
+      # NOTE: the aggregate ETA blurs a heterogeneous multi-branch merge; acceptable
+      # (the common case is single-branch; clamp + per-poll re-estimate bound any
+      # skew, and only poll timing is affected — the parent is still woken).
+      drained = ->(pend, prev) { prev && elapsed && elapsed > 0 && pend < prev }
+      rate_by_branch = pending_by_branch.to_h do |id, pend|
+        prev = prev_pending_by_branch[id]
+        [id, drained.call(pend, prev) ? (prev - pend) / elapsed.to_f : 0.0]
+      end
+      rate = drained.call(pending, prev_pending) ? (prev_pending - pending) / elapsed.to_f : 0.0
+      # Only needed when the ETA branch won't be taken (rate == 0); computing the
+      # EXISTS probes lazily keeps them off the hot drain path. See reschedule_delay.
+      motion = if rate > 0 then nil
+      elsif branch_log_ids.any? { |id| BranchProbe.running?(id) } then :running
+      elsif never_started_by_branch.values.any?(&:positive?) then :never_started
+      else :none
+      end
-      delay = reschedule_delay(pending, min_interval, max_interval)
-      record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: delay.seconds.from_now)
+      delay = reschedule_delay(pending, rate, motion, prev_delay, min_interval, max_interval)
+      record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at: delay.seconds.from_now,
+        interval: delay, rate_by_branch: rate_by_branch, never_started_by_branch: never_started_by_branch,
+        spawned_by_branch: spawned_by_branch, rekicked_by_branch: rekicked_by_branch)
       self.class.set(wait: delay.seconds)
         .perform_later(parent_key, parent_job_class, branch_log_ids, min_interval, max_interval, token)
     end
     private
-    # Adaptive poll cadence: scale the wait with the number of pending children,
-    # clamped to [min_interval, max_interval]. min_interval <= max_interval is
-    # enforced up front in merge_branches, so the clamp can't raise here.
-    def reschedule_delay(pending, min_interval, max_interval)
-      (pending * FACTOR).clamp(min_interval, max_interval)
+    # Adaptive poll cadence driven by ESTIMATED TIME-TO-DRAIN, not backlog size.
+    # When the branch-set drained since the last poll we project completion from
+    # the measured rate and poll at ETA_FRACTION of it, clamped [min, max]. Because
+    # each poll re-estimates against the shrinking remainder, cadence converges
+    # geometrically and detects the merge within ~min_interval of the last child
+    # finishing — where the old count-based cadence polled SLOWEST (max_interval)
+    # exactly when a fast-draining backlog was about to complete.
+    #
+    # No completion observed this interval — fall back on `motion`:
+    #   :running    => a live worker is executing a child; it will finish, so hold
+    #                  the responsive floor (matches prior behaviour and avoids
+    #                  waking the parent late for a slow/low-fan-out child).
+    #   :never_started => the only motion is a queued/rekicked-but-unpicked child that
+    #                  may never be picked up => exponential backoff from the floor
+    #                  (double prev_delay, capped at max), catching a quick recovery
+    #                  within seconds without spinning on a dead dispatch.
+    #   :none       => nothing can progress (blocked/failed or parked on a wait) =>
+    #                  straight to max_interval, the cheap recovery backstop.
+    # min_interval <= max_interval is enforced in merge_branches, so clamp is safe.
+    # `rate` is children/s measured by the caller (0 => nothing completed since the
+    # prior poll / cold poll).
+    def reschedule_delay(pending, rate, motion, prev_delay, min_interval, max_interval)
+      return (pending / rate * ETA_FRACTION).clamp(min_interval, max_interval) if rate > 0
+      case motion
+      when :running then min_interval
+      when :never_started then prev_delay ? (prev_delay * 2).clamp(min_interval, max_interval) : min_interval
+      else max_interval
+      end
     end
     # A poller is superseded when its token no longer matches what's stored on the
     # branch logs (a newer merge_branches pass rotated it). A plain read is enough
     # for the early-out; the persisting write in record_poll! re-checks the token
     # under a row lock so it can never clobber the newer chain.
-    def superseded?(branch_log_ids, token)
-      logs = ExecutionLog.where(id: branch_log_ids).to_a
+    def superseded?(logs, token)
       logs.empty? || logs.any? { |log| log.metadata&.dig("poll_token") != token }
     end
@@ -78,7 +183,7 @@ module ChronoForge
     # work still pending is the signal that the poller was dropped). This is purely
     # observational — replay and correctness never read it. It writes a "poll"
     # sub-key, leaving spawn_each's "cursors" metadata untouched.
-    def record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at:)
+    def record_poll!(pending_by_branch, sealed_by_branch, token, next_poll_at:, interval:, rate_by_branch:, never_started_by_branch:, spawned_by_branch:, rekicked_by_branch:)
       now = Time.current
       ExecutionLog.where(id: pending_by_branch.keys).find_each do |log|
         # Lock the row so this read-modify-write can't clobber a concurrent token
@@ -88,12 +193,25 @@ module ChronoForge
         log.with_lock do
           meta = log.metadata || {}
           next unless meta["poll_token"] == token
+          prev = meta["poll"] || {}
+          n = rekicked_by_branch[log.id].to_i
+          pend = pending_by_branch[log.id]
+          rate = rate_by_branch[log.id].to_f
           meta["poll"] = {
             "last_polled_at" => now.iso8601,
             "next_poll_at" => next_poll_at&.iso8601,
-            "pending" => pending_by_branch[log.id],
+            "interval" => interval,
+            "pending" => pend,
+            "never_started" => never_started_by_branch[log.id],   # never-started count (rekick drain signal)
+            "spawned" => prev["spawned"] || spawned_by_branch[log.id],  # total spawned; immutable once sealed, so sticky
             "sealed" => sealed_by_branch[log.id],
-            "polls" => meta.dig("poll", "polls").to_i + 1
+            "rate" => rate.round(3),                               # children/s (round(3), not (2), so a
+                                                                   # very slow but real drain still reads > 0)
+            "eta_seconds" => (rate > 0 ? (pend / rate).round : nil),
+            "polls" => prev["polls"].to_i + 1,
+            "rekicked" => n,
+            "rekick_total" => prev["rekick_total"].to_i + n,
+            "last_rekick_at" => (n.positive? ? now.iso8601 : prev["last_rekick_at"])
           }
           log.update!(metadata: meta)
         end
@@ -111,10 +229,22 @@ module ChronoForge
     # keep the :idle guard (a running/failed/stalled child must never be
     # re-dispatched). Re-enqueue of an :idle child a worker just grabbed is still
     # safe — the lock guard rejects the duplicate. Capped per run.
-    def rekick_dropped_jobs(branch_log_ids)
-      branch_log_ids.each do |id|
+    def rekick_dropped_jobs(branch_log_ids, never_started_by_branch, prev_never_started_by_branch)
+      cutoff = REKICK_AFTER.ago
+      branch_log_ids.to_h do |id|
+        # Skip a branch whose NEVER-STARTED count dropped since the last poll:
+        # workers are pulling its dispatched children off the queue, so a still-
+        # queued child is in line, not dropped. Deliberately NOT total pending —
+        # a wait/wait_until child completing would drop pending without any
+        # never-started child moving, masking a genuinely-dropped child behind
+        # staggered waits. With no prior sample (cold poll) we don't gate — the
+        # per-child staleness filter below still spares freshly-dispatched rows.
+        prev = prev_never_started_by_branch[id]
+        next [id, 0] if prev && never_started_by_branch[id] < prev
+        count = 0
         Workflow.where(parent_execution_log_id: id, state: Workflow.states[:idle], started_at: nil)
-          .where("updated_at < ?", REKICK_AFTER.ago)
+          .where("updated_at < ?", cutoff)
           .limit(REKICK_BATCH)
           .find_each do |child|
             # Intentionally uses the GUARDED perform_later (single-child path),
@@ -126,12 +256,19 @@ module ChronoForge
             # error — dead-letter the poller, orphaning every healthy sibling. Catch
             # per child, log, and let the next poll retry it (it's still idle+stale).
             child.job_klass.perform_later(child.key, **child.kwargs.symbolize_keys)
+            # Debounce: bump updated_at so this child isn't re-rekicked until it's
+            # been unstarted for another REKICK_AFTER — one redelivery window for a
+            # worker to pick it up. Only on a SUCCESSFUL enqueue; a rescued failure
+            # leaves it stale so the next poll retries.
+            child.touch
+            count += 1
           rescue => e
             Rails.logger.error do
               "ChronoForge:BranchMergeJob rekick failed for child #{child.key}: " \
               "#{e.class}: #{e.message}"
             end
           end
+        [id, count]
       end
     end
   end

data/lib/chrono_forge/branch_probe.rb CHANGED Viewed

@@ -19,6 +19,50 @@ module ChronoForge
         .where.not(state: Workflow.states[:completed])
     end
+    # Relation of children that can advance on their own — actively running, or
+    # dispatched-but-not-yet-started (started_at nil). This drives the adaptive
+    # poll cadence. Deliberately EXCLUDES waiting children (idle with started_at
+    # SET — parked on a wait/wait_until) and blocked children (failed/stalled —
+    # awaiting operator recovery): polling can't make either progress, so they
+    # must not pin the cadence at the responsive floor. They still count as
+    # +incomplete+ (the branch stays open), they just don't accelerate polling.
+    def progressing(branch_log_id)
+      base = Workflow.where(parent_execution_log_id: branch_log_id)
+      base.where(state: Workflow.states[:running])
+        .or(base.where(state: Workflow.states[:idle], started_at: nil))
+    end
+    # A child of this branch is actively executing — a live worker will complete
+    # it, so the poller can hold its responsive floor rather than backing off.
+    def running?(branch_log_id)
+      Workflow.where(parent_execution_log_id: branch_log_id, state: Workflow.states[:running]).exists?
+    end
+    # Children dispatched but not yet started (idle, started_at nil) — the queue of
+    # never-started work for this branch. A DROP in this count between polls means
+    # workers are actively pulling it off the queue (so a still-queued child is in
+    # line, not dropped); the rekick gate keys off that. Distinct from total pending,
+    # which a wait/wait_until child completing would drop without any never-started
+    # child moving. (Not to be confused with the dashboard's "Dispatched" column,
+    # which is the TOTAL children spawned.)
+    def never_started(branch_log_id)
+      Workflow.where(parent_execution_log_id: branch_log_id,
+        state: Workflow.states[:idle], started_at: nil)
+    end
+    # A child was dispatched but no worker has started it yet. If this is the only
+    # motion left, it's a queued/rekicked-but-unpicked straggler (which may never be
+    # picked up), NOT active work — so the poller backs off.
+    def never_started?(branch_log_id) = never_started(branch_log_id).exists?
+    # All children spawned into this branch (every state) — the dispatch total. Fixed
+    # once the branch is sealed, so the poller counts it exactly once and caches it on
+    # the branch-log metadata. This is the dashboard's "Spawned" column. Distinct from
+    # #never_started, which is only the idle-and-unstarted subset.
+    def spawned(branch_log_id)
+      Workflow.where(parent_execution_log_id: branch_log_id)
+    end
     def done?(branch_log_id)
       sealed?(branch_log_id) && !incomplete(branch_log_id).exists?
     end

data/lib/chrono_forge/configuration.rb ADDED Viewed

@@ -0,0 +1,25 @@
+# frozen_string_literal: true
+module ChronoForge
+  # Engine-wide configuration. Set via ChronoForge.configure in an initializer.
+  class Configuration
+    # The queue the branch-merge poller (BranchMergeJob) runs on.
+    #
+    # This MUST NOT be a queue that a fan-out's own children saturate: merge_branches
+    # enqueues the poller AFTER dispatching the branch's children, so on a shared
+    # queue it is starved behind the whole backlog and only gets a worker slot near
+    # the end — it then polls once, at pending≈0, and backs off, so the parent's
+    # convergence lags by up to max_interval and no mid-drain throughput is recorded.
+    # Because the poller is OUR code (not the user's job), its placement is a
+    # first-class setting rather than something to monkey-patch onto BranchMergeJob.
+    #
+    # Defaults to :default (fine when fan-outs run on their own queues). For large
+    # fan-outs, point this at a dedicated queue with its own worker so the poller
+    # runs promptly throughout the drain.
+    attr_accessor :branch_merge_queue
+    def initialize
+      @branch_merge_queue = :default
+    end
+  end
+end