RubyGems - chrono_forge - Versions diffs - 0.10.0 → 0.11.0 - Mend

chrono_forge 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +34 -1
data/README.md +188 -105
data/Rakefile +4 -0
data/cliff.toml +62 -0
data/docs/design/per-child-commit-overhead.md +213 -0
data/docs/fanout-scale-test.md +247 -0
data/docs/superpowers/plans/2026-06-30-poller-rekick-and-eta-cadence.md +205 -0
data/docs/superpowers/plans/2026-06-30-poller-rekick-and-eta-cadence.md.tasks.json +33 -0
data/docs/superpowers/plans/2026-07-01-workflow-definition-dag.md +1373 -0
data/docs/superpowers/plans/2026-07-01-workflow-definition-dag.md.tasks.json +68 -0
data/docs/superpowers/specs/2026-07-01-workflow-definition-dag-design.md +203 -0
data/lib/chrono_forge/branch_merge_job.rb +158 -21
data/lib/chrono_forge/branch_probe.rb +44 -0
data/lib/chrono_forge/configuration.rb +25 -0
data/lib/chrono_forge/definition.rb +37 -0
data/lib/chrono_forge/definition_analyzer.rb +501 -0
data/lib/chrono_forge/executor/context.rb +23 -0
data/lib/chrono_forge/executor/lock_strategy.rb +10 -3
data/lib/chrono_forge/executor/methods/continue_if.rb +15 -6
data/lib/chrono_forge/executor/methods/durably_execute.rb +15 -7
data/lib/chrono_forge/executor/methods/durably_repeat.rb +30 -14
data/lib/chrono_forge/executor/methods/merge_branches.rb +5 -4
data/lib/chrono_forge/executor/methods/workflow_states.rb +35 -47
data/lib/chrono_forge/executor.rb +34 -9
data/lib/chrono_forge/version.rb +1 -1
data/lib/chrono_forge.rb +8 -0
data/lib/tasks/release.rake +212 -0
metadata +28 -2

data/docs/design/per-child-commit-overhead.md ADDED Viewed

@@ -0,0 +1,213 @@
+# Per-child commit overhead on large branch fan-out
+Status: **implemented and benchmarked**, targeted for **v0.12** (baseline is the
+released v0.10). All the consolidations under "Implemented" below have shipped to
+the working tree and are validated by matched baseline-vs-consolidation benchmarks
+at 20k/50k/100k — see
+[`fanout-scale-test.md`](../fanout-scale-test.md): **~−50% per-child execution
+time, +20–30% throughput**, flat across scale. The risky `save`+`release` merge was
+deliberately left out. Scope: `chrono_forge` core engine. Independent of the dashboard.
+## Problem
+Branch fan-out throughput tops out at roughly a couple thousand DB commits/sec
+on a single Postgres writer. Each child workflow, even a trivial one that runs a
+single `durably_execute` to completion in one `perform`, produces a string of
+independent transactions. Because every transaction is its own `fsync`, the
+workload is commit-bound (fsync-bound), and the number of children you can drain
+per second is `~(fsync budget) / (commits per child)`.
+For ~8 primary-DB commits per child plus Solid Queue overhead, ~10 workers land
+around ~2,000 fsync/s — that's the ceiling. Lowering commits-per-child raises the
+ceiling proportionally.
+## The actual per-child commit sequence
+Traced for a trivial child: one `durably_execute`, runs to completion in a single
+`perform`. Each row is one transaction = one commit = one fsync.
+| # | Where | Write | Source |
+|---|-------|-------|--------|
+| 1 | `setup_workflow!` | `update_column(:started_at)` — **branch children only** (parent pre-inserts the row, so the create block never stamps it) | `executor.rb:221` |
+| 2 | `LockStrategy.acquire_lock` | `transaction { lock!; update_columns(locked_by, locked_at, state: :running) }` | `executor/lock_strategy.rb:10-33` |
+| 3 | `durably_execute` | INSERT execution log (`find_or_create_execution_log!`) | `executor/methods/durably_execute.rb:70` |
+| 4 | `durably_execute` | `update!(attempts, last_executed_at)` — *before* the step | `executor/methods/durably_execute.rb:80` |
+| 5 | `durably_execute` | `update!(state: :completed)` — *after* the step | `executor/methods/durably_execute.rb:89` |
+| 6 | `complete_workflow!` | **three separate commits**: completion-marker INSERT + `workflow.completed!` UPDATE + log `update!` | `executor/methods/workflow_states.rb:50-70` |
+| 7 | `ensure` → `context.save!` | `update_column(:context)` | `executor/context.rb:70`, called at `executor.rb:187` |
+| 8 | `release_lock` | `update_columns(locked_at: nil, locked_by: nil, …)` | `executor/lock_strategy.rb:53` |
+| + | Solid Queue | claim + finish/delete (2 more, in the **queue** DB) | backend |
+So ~8 primary-DB commits + ~2 queue commits per child. Empirically (SQLite probe
+of a no-step child), step 6 is **three** independent commits, not one — the
+completion-marker INSERT, the workflow→`:completed` UPDATE, and the marker UPDATE
+each commit separately. That made it the single biggest safe win.
+## What's safely reducible (≈⅓, realistically ~8 → ~5)
+> Note: this section is the original sketch. It shipped *differently* — see
+> "Implemented" below. The real target turned out to be `complete_workflow!`'s own
+> three commits (#6), which collapse to one safely; #7+#8 (`context.save!` +
+> `release_lock`) were deliberately left split. Net effect is the same ~⅓.
+These have no external side effect between them, so collapsing them into one
+transaction cannot cause double-execution of anything observable:
+- **Collapse the end-of-run trio (#6, #7, #8).** Completion + context save + lock
+  release happen back-to-back with nothing external in between. One transaction
+  instead of three. Caveat: today `context.save!` and `release_lock` are
+  deliberately ordered in an `ensure` so the lock is released (and the
+  continuation published) *even if the save raises* — see `executor.rb:178-196`.
+  Any consolidation must preserve that "always release the lock, never strand the
+  workflow" guarantee. Merging the happy path while keeping the failure path's
+  release semantics is the careful part.
+- **Fold #1 into #2.** Stamp `started_at` inside the `acquire_lock` transaction
+  instead of as its own `update_column`. Saves one commit on branch children.
+Net: ~8 → ~5 primary commits ⇒ a proportional throughput bump on a commit-bound
+workload. Real and worth doing, but it's ~40%, **not** an order of magnitude.
+## What is NOT reducible (it's the point of the engine)
+- **The per-step INSERT + "completed" UPDATE (#3, #5) must commit independently
+  of other steps.** That committed "completed" marker is exactly what guarantees a
+  side-effecting step (charge a card, send a webhook) runs once and is never
+  replayed. You cannot batch step-completion commits across steps without risking
+  double-execution of external effects. It looks wasteful for a no-op child, but
+  the engine cannot assume a step is side-effect-free.
+- **Lock acquire (#2) must commit before doing work** so other workers see it.
+- **Solid Queue's claim/finish** is backend overhead, outside ChronoForge.
+## The real lever for massive fan-out is architectural, not commit-tuning
+To run millions of trivial children you change the shape, not shave commits:
+- **Bulk / lightweight child mode** for large sets of trivial, idempotent items:
+  process N items as one durable unit, or a leaner child path that skips the full
+  lock + completion ceremony when a child has no deferral points. Cuts per-item
+  commits dramatically, but **trades away per-item isolation and observability** —
+  a feature with real trade-offs, not a tweak.
+- **Opt-in "transactional workflow" mode**: whole `perform` in one transaction → 1
+  commit. Only safe for genuinely side-effect-free workflows.
+- **Scale the write tier out** (sharded Postgres) if you need raw commit headroom.
+## Implemented
+Two consolidations shipped, both behaviour-preserving and covered by tests:
+1. **`complete_workflow!` → one transaction** (`workflow_states.rb`). The marker
+   INSERT, the workflow→`:completed` UPDATE, and the marker UPDATE now share a
+   single `ActiveRecord::Base.transaction`. 3 commits → 1. The failure path
+   re-finds/recreates the marker and records `:failed` outside the rolled-back
+   transaction, so completion observability is preserved and a resume simply
+   retries completion. Test: `WriteConsolidationTest#test_completion_writes_share_a_single_transaction`.
+2. **`started_at` folded into `acquire_lock`** (`lock_strategy.rb`). The branch-child
+   first-execution `started_at` stamp now rides along the existing lock UPDATE
+   (`update_columns`) instead of a standalone `update_column` in `setup_workflow!`.
+   1 commit → 0 (absorbed). The poller's "nil started_at = dropped child" contract
+   is preserved — it's still stamped on first pickup, just in the same write.
+   Tests: `LockStrategyTest#test_acquire_lock_stamps_started_at_*`.
+Net for a trivial branch child: roughly 3 fewer commits/fsyncs per child (the
+note's safe ~⅓), with no behaviour change.
+### Statement flattening (a different axis: round-trips/CPU, not fsyncs)
+Within the now-single transactions, two INSERT-then-UPDATE pairs were collapsed to
+a single INSERT. This does not cut commits further — it cuts statements (DB
+round-trips, parse/plan CPU, a little WAL), so it pays most on round-trip- or
+CPU-bound profiles, less on a pure single-writer fsync-bound one.
+3. **Completion marker born completed** (`workflow_states.rb`). The marker is
+   INSERTed already in `:completed` state (attempts: 1, all timestamps set) instead
+   of INSERTed `:started` then UPDATEd. The rare resume-after-failed-completion /
+   create-race path still flips an existing marker via UPDATE. Completion is now
+   2 statements (marker INSERT + workflow UPDATE), down from 3. Test:
+   `WriteConsolidationTest#test_completion_marker_is_born_completed_in_a_single_insert`.
+4. **`durably_execute` first attempt recorded in the INSERT** (`durably_execute.rb`).
+   A fresh step log is created with `attempts: 1, last_executed_at` baked in, so the
+   pre-execution attempt-bump UPDATE is skipped on first run (detected via
+   `previously_new_record?`). Retries (existing log) still bump via UPDATE — the
+   committed `:completed` write after the side effect is untouched (the once-only
+   boundary). Tests: `WriteConsolidationTest#test_first_step_run_records_attempt_in_the_insert`
+   and `#test_retry_run_bumps_attempt_with_an_update`.
+The same two shapes recur across the engine, so the flattening was applied
+everywhere the pattern is safe (a log written twice in one pass, no deferral
+between the writes):
+5. **`fail_workflow!` born-completed + one transaction** (`workflow_states.rb`).
+   Identical to `complete_workflow!`: the workflow→`:failed` transition and the
+   (born-completed) failure marker are batched in one transaction, marker written
+   terminal in a single INSERT. Test:
+   `WriteConsolidationTest#test_failure_marker_is_born_completed_in_a_single_insert`.
+6. **`continue_if` first evaluation in the INSERT** (`continue_if.rb`). Same as
+   `durably_execute` — a fresh gate bakes `attempts: 1`; only re-evaluations after a
+   not-met halt bump via UPDATE. Test:
+   `WriteConsolidationTest#test_continue_if_first_run_records_attempt_in_the_insert`.
+7. **`durably_repeat` coordination + repetition logs** (`durably_repeat.rb`). Both
+   bake the first attempt into their INSERT; later passes/resumes bump via UPDATE.
+   Test: `WriteConsolidationTest#test_durably_repeat_first_repetition_records_attempt_in_the_insert`.
+The born-completed sites (#3, #5) share one helper, `create_completed_execution_log!`
+(`executor.rb`): a single race/replay-safe INSERT in `:completed` state, falling
+back to an UPDATE only when the row already exists.
+Not flattened: **`merge_branches`** and the **`durably_repeat` fast-forward summary**
+create their log in one pass and complete/finalize it in a *later* pass (across a
+halt), or merge metadata into a reused row — the two writes are inherently separate,
+so there's nothing to collapse. **Waits** (`wait`, `wait_until`) and **branch
+coordination** are the same story: they must persist `:started` before halting.
+`acquire_lock`'s SELECT…FOR UPDATE + UPDATE was left as a SELECT + UPDATE: a
+single conditional UPDATE would flip pessimistic→optimistic locking and complicate
+`ensure_executable!` / the contention-error message for one statement — not worth it.
+Full suite (236 tests) green; behaviour unchanged.
+### Deliberately NOT done: merging `context.save!` + `release_lock`
+Tempting (it's the remaining adjacent pair), but the split is load-bearing. The
+`ensure` block (`executor.rb:178-196`) guarantees the lock is released — and the
+continuation published — *even if `context.save!` raises*, and it publishes the
+continuation only after a successful release so a zero-delay same-key continuation
+can't lose the acquire race. Wrapping save+release in one transaction would roll
+the release back with a failed save and strand the workflow holding its lock.
+A correct merge needs a fallback (standalone release on rollback) plus careful
+handling of the lost-lock case, and it also changes whether context persists when
+this job has lost the lock. That subtlety isn't worth one commit on a safety-
+critical path — left as-is intentionally.
+## Benchmark outcome
+Measured against matched baseline pairs at 20k/50k/100k (single Postgres, 5×4
+workers; full data in [`fanout-scale-test.md`](../fanout-scale-test.md)):
+- **Per-child execution time roughly halved** — p50 ~35 ms → ~19 ms (−46%), avg
+  −49%, **flat from 20k to 100k**. This is the direct fsync-per-commit saving and it
+  is robust and repeatable.
+- **Aggregate throughput +20–30%, not 2×.** Once the child's own commits are
+  cheaper, the **Solid Queue claim/finish cycle and the shared single-Postgres
+  fsync budget** dominate per-slot time — halving one of several serial commits
+  can't double the whole pipeline. The win also *compresses* with N (+30% at 20k →
+  +22% at 100k) as that shared ceiling takes a larger share.
+- **Dispatch got faster too** (100k fan-out 64 s → 53 s): lighter child commits free
+  fsync headroom for the parent's `spawn_each` inserts.
+So the measure-first question is answered: the workload **is** commit/fsync-bound,
+and the consolidation is a clear, behaviour-preserving win — but the residual
+ceiling is now the queue backend and the single-DB fsync budget, not engine work.
+## Where the remaining headroom is
+1. **Split the queue onto its own database/disk.** Solid Queue's claim/finish
+   fsyncs currently compete with engine-commit fsyncs for one budget; separating
+   them lets the two streams run in parallel. Likely a bigger aggregate lever than
+   any further engine shaving.
+2. **Bulk / lightweight child mode** for large sets of trivial, idempotent items —
+   the path to order-of-magnitude gains, with its isolation/observability
+   trade-offs made explicit and opt-in (see above).
+3. **Scale the write tier out** (sharded Postgres) for raw commit headroom.
+Do not expect a 10× from further commit consolidation — the engine is no longer the
+wall. That comes from (1)/(2) or scaling the write tier.

data/docs/fanout-scale-test.md ADDED Viewed

@@ -0,0 +1,247 @@
+# ChronoForge fan-out scale test
+_Run 2026-06-28 on a local Mac (11 cores)._
+Two things were measured here:
+1. **Correctness at scale** — does a fan-out converge, lose nothing, and stay in
+   constant memory up to **500,000 children**?
+2. **Throughput** — what is the steady-state ceiling, where is it, and how much
+   does an experimental **commit-consolidation** change move it (matched
+   baseline-vs-consolidation pairs at 20k / 50k / 100k)?
+## What was tested
+A parent workflow that fans out **N child workflows** via `branch` + `spawn_each`
+(a `Range` source → constant-memory batched `insert_all`, joined inline with
+`automerge`). Each child is one trivial `durably_execute`.
+```ruby
+class ScaleFanout < ActiveJob::Base
+  prepend ChronoForge::Executor
+  def perform(count:)
+    branch :fanout, automerge: true do
+      spawn_each :child, (1..count) { |i| [ScaleChild, {n: i}] }
+    end
+  end
+end
+```
+All runs were on **Solid Queue + Postgres 13 (Docker)** against a dev DB. Workers
+ran on an **isolated `scale` queue** (`BranchMergeJob`
+re-routed to the same queue so it never competed with real jobs).
+| Parameter | Value |
+|---|---|
+| Job backend | Solid Queue, isolated `scale` queue |
+| Database | Postgres 13 (Docker), `max_connections` 100, `shared_buffers` 128 MB |
+| Worker concurrency | 5 processes × 4 threads (= 20 slots) |
+| DB connection pool | `DB_POOL=10` per process |
+| Worker polling interval | 0.1 s |
+| Dispatcher | default Solid Queue dispatcher |
+| Child workload | one trivial `durably_execute` (no real I/O) |
+---
+## Part 1 — Correctness (500k)
+| N | Wall clock | Throughput | Completed | Outcome |
+|---|-----------|------------|-----------|---------|
+| 20,000 | 88.3s | 226/s | 20,000 / 20,000 | ✓ merged |
+| 50,000 | 215.5s | 232/s | 50,000 / 50,000 | ✓ merged |
+| 100,000 | 443.2s | 226/s | 100,000 / 100,000 | ✓ merged |
+| **500,000** | **2,497s (~41.6 min)** | **200/s** | **500,000 / 500,000** | **✓ merged** |
+**Flawless at every scale.** Every child completed and every parent converged to
+`completed`. Streaming dispatch held constant memory; `BranchMergeJob` polled to
+convergence; the parent's replay correctly **skipped the sealed branch** (no
+re-dispatch). Zero failures, zero lost children, no memory wall.
+Throughput held **rock-steady at ~200–230/s from 20k through 500k** — no
+degradation as the tables grew. The path has a **stable steady-state ceiling**.
+---
+## Part 2 — Benchmark: baseline vs commit-consolidation
+**Baseline** is the current released engine (v0.10). **Consolidation** is the
+unreleased patch targeted for **v0.12** that cuts per-child write cost without
+changing behaviour. It (a) folds `started_at` into the lock-acquire
+transaction, (b) collapses `complete_workflow!`'s three separate commits — marker
+INSERT + `workflow.completed!` UPDATE + marker UPDATE — into one transaction, and
+(c) flattens the INSERT-then-UPDATE pairs in the step / completion / failure /
+`continue_if` / `durably_repeat` paths into single INSERTs (the row is born in its
+terminal/attempted state). It deliberately does **not** merge `context.save!` +
+`release_lock` — that split is load-bearing (the lock must release even if the
+save raises). See `docs/design/per-child-commit-overhead.md` for the full set.
+Matched pairs, isolated worker (5 procs × 4 threads), `DB_POOL=10`, single
+Postgres, ~170k-row backdrop held constant across all six runs. Throughput is
+measured over the child window: `N / (max(completed_at) − min(created_at))`.
+### Throughput
+| Run | N | Fan-out (dispatch) | Dispatch rate | **Exec throughput** |
+|------|------:|------:|------:|------:|
+| baseline 20k | 20,000 | 9.6s | 2,093/s | **226/s** |
+| cons 20k | 20,000 | 7.8s | 2,561/s | **293/s** (+30%) |
+| baseline 50k | 50,000 | 30.4s | 1,647/s | **232/s** |
+| cons 50k | 50,000 | 17.2s | 2,905/s | **279/s** (+20%) |
+| baseline 100k | 100,000 | 63.7s | 1,570/s | **226/s** |
+| cons 100k | 100,000 | 53.0s | 1,886/s | **275/s** (+22%) |
+### Per-child execution time — `completed_at − started_at` (seconds)
+| Run | avg | p50 | p95 | p99 | max |
+|------|----:|----:|----:|----:|----:|
+| baseline 20k | 0.042 | 0.035 | 0.074 | 0.137 | 0.544 |
+| cons 20k | 0.020 | 0.018 | 0.030 | 0.070 | 0.560 |
+| baseline 50k | 0.041 | 0.035 | 0.065 | 0.138 | 0.933 |
+| cons 50k | 0.022 | 0.019 | 0.035 | 0.074 | 0.812 |
+| baseline 100k | 0.042 | 0.036 | 0.072 | 0.148 | 0.911 |
+| cons 100k | 0.022 | 0.019 | 0.038 | 0.079 | 0.578 |
+### Fan-out (spawn) time — child `created_at` span
+How long the parent's `spawn_each` took to enqueue the whole set
+(`max(created_at) − min(created_at)` over the children):
+| Children | baseline | consolidation |
+|---:|---:|---:|
+| 20k | 9.6s (2,093/s) | 7.8s (2,561/s) |
+| 50k | 30.4s (1,647/s) | 17.2s (2,905/s) |
+| 100k | 63.7s (1,570/s) | 53.0s (1,886/s) |
+Spawning 100k children takes ~64s on baseline, ~53s consolidated — roughly
+**0.5–0.6 ms of parent time per child**. The spawn rate **degrades as N grows**
+on baseline (2,093 → 1,570/s) because the parent inserts child rows into the same
+Postgres that's simultaneously draining the early children. Consolidation lightens
+the children's commit load, freeing fsync budget for the parent's inserts, so its
+spawn rate holds up far better.
+---
+## Throughput analysis — a *flat* ceiling, set by fsync
+The ceiling is **single-Postgres commit/`fsync` throughput**, not worker count:
+- Throughput is **flat (~200–230/s) from 20k to 500k** — it neither degrades as
+  tables grow nor rises with more work in flight against 20 worker slots.
+- Halving per-child execution time (consolidation, below) lifts aggregate
+  throughput only **~25%** — so the bottleneck is shared write infrastructure, not
+  per-child work or worker count.
+- **Not** connection-bound (51 / 100 connections used), **not** memory-bound.
+Each trivial baseline child costs ~**10 primary-DB commits + ~2 Solid Queue
+commits** (the `complete_workflow!` entry below is itself three separate commits):
+```
+setup started_at · acquire_lock · step INSERT · step attempt-update ·
+step completed-update · complete_workflow! (×3) · context.save! · release_lock   (+ SQ claim/finish)
+```
+≈ 200 children/s × ~10 commits ≈ **~2,000 fsyncs/s** — the wall.
+**Commit consolidation confirms the diagnosis.** Folding `started_at` into the
+lock-acquire txn, collapsing `complete_workflow!`'s three commits into one, and
+baking each log's first write into its INSERT **halves per-child execution time**
+(p50 ~35 ms → ~19 ms, −46%; avg −49%) and is **flat across 20k→100k**. Yet
+aggregate throughput rises only **+20–30%**, not 2×: once the
+child's own commits are cheaper, the **Solid Queue claim/finish cycle** and the
+shared single-Postgres fsync budget dominate the per-slot time. Halving one of
+several serial commits can't double the whole pipeline.
+## Queue wait is backlog math, not a latency regression
+Per-child **queue wait** (`started_at − created_at`) scales ~linearly with N:
+| Run | avg | p50 | p95 | max |
+|------|----:|----:|----:|----:|
+| baseline 20k | 40.6 | 42.9 | 75.3 | 78.7 |
+| cons 20k | 30.3 | 29.3 | 57.4 | 60.4 |
+| baseline 50k | 93.7 | 95.6 | 173.0 | 185.1 |
+| cons 50k | 80.7 | 80.3 | 154.4 | 161.9 |
+| baseline 100k | 191.6 | 194.9 | 357.8 | 379.5 |
+| cons 100k | 151.7 | 147.5 | 291.1 | 310.9 |
+This is **not** a per-job slowdown. The entire set is enqueued in seconds but
+drains at ~225–290/s, so a child's wait is just its **position in the backlog ÷
+throughput**. The last of 100k children waits ~100,000 / ~270 ≈ ~6 min by
+arithmetic, regardless of how fast any individual child runs. Consolidation
+shrinks the wait proportionally (100k: 192s → 152s) because it lifts throughput —
+the lever for queue wait is throughput, not per-child execution time.
+## Dashboard at 500k
+The dashboard's scale-aware design held up live: capped `5000+` counts (no
+`COUNT(*)` over 500k), keyset pagination, blocked-first triage — instant render
+throughout the run.
+On a branch's detail views the counts the poller already records render **exact**
+straight from the branch-log metadata — no live count: `pending` and
+`never-started` (recomputed each poll) and the total `spawned` (immutable once the
+branch is sealed, so it's counted **once** and cached). Only the mutable per-state
+chips (idle / completed / …) stay capped. So a 500k branch shows its real
+`spawned` / `pending` / `never-started` figures, not `5000+`.
+## Poller behavior
+`BranchMergeJob` cadence is driven by **estimated time-to-drain** (from the prior
+poll's uncapped pending count), not backlog size. For a 500k fan-out draining at
+~200/s this is flat `max_interval` (5 min) polling through the long middle, then a
+smooth ramp over the final minutes, tightening to `min_interval` (~5s) for the
+last few thousand children — so the parent is woken within ~5s of the last child
+finishing rather than up to a full `max_interval` late. ~15 cheap polls across the
+run, one branch-scoped index count each (`[parent_execution_log_id, state]`); no
+new indexes. When nothing completes in an interval the fallback is motion-aware: a
+child still running holds the responsive floor (so a slow or single-child branch is
+never woken late), a dispatched-but-unpicked straggler backs off exponentially, and
+a fully blocked/waiting branch decays to `max_interval` instead of spinning.
+Rekick of dropped children is **gated on the never-started count
+delta**: if that count fell since the last poll, workers are consuming the branch's
+queue, so deeply-queued-but-healthy children are left alone. It deliberately does
+NOT use total pending — a `wait_until` child resuming would drop pending without
+any never-started child moving, masking a genuinely-dropped child behind staggered
+waits. Only a branch whose never-started count has gone flat has its stale
+never-started children rekicked, and a `touch` on each rekick debounces it to at
+most once per `REKICK_AFTER`. Rekick counts are stamped on the branch-log metadata
+for the dashboard.
+### ⚠️ Poller queue placement (a trap)
+`merge_branches` enqueues `BranchMergeJob` **after** it dispatches the branch's
+children, so the poller **must not run on the same queue as a large fan-out's
+children**. If it does, it is enqueued behind the entire backlog and starved: it
+gets a worker slot only near the end, polls **once** at `pending≈0` with no prior
+sample (`rate 0`), and backs off to `max_interval`. The consequences are twofold
+and both defeat the point of the ETA cadence:
+- the parent's convergence **lags by up to `max_interval` (~5 min)** after the last
+  child finishes (all children `completed`, parent still `idle`); and
+- the dashboard's **live throughput/ETA never renders** — the poller never took a
+  mid-drain sample, so `rate` stays 0.
+Give the poller a **dedicated, un-starved queue** so it polls throughout the drain
+(then ETA engages and convergence is tight) via the first-class setting:
+```ruby
+ChronoForge.configure { |c| c.branch_merge_queue = :chrono_forge_pollers }
+```
+(and run a worker on that queue). It defaults to `:default`, which is fine when
+fan-outs run on their own queues. For these runs that queue is `:scale_poller`
+(see `config/scale_queue.yml`). This bit us live-driving 20k/100k — the first pass
+had the poller on `:scale` and every parent hung `idle` for 5 min.
+## Environment caveats
+- Local Docker **Postgres 13**, default `shared_buffers` 128 MB, single disk,
+  `max_connections` 100. Absolute numbers are this-laptop-specific; tuned/larger
+  production infra changes them.
+- Part 1's 20k–100k rows are the **same clean baseline runs** as Part 2 (matched
+  pairs at a fixed ~170k-row backdrop). The **500k** row is the original
+  single-growing-table run, so its absolute throughput isn't strictly comparable
+  to the others — but it lands in the same ~200/s band.
+- A child here is a **trivial** `durably_execute`; real children doing actual work
+  shift the bottleneck away from the engine's own commits.

data/docs/superpowers/plans/2026-06-30-poller-rekick-and-eta-cadence.md ADDED Viewed

@@ -0,0 +1,205 @@
+# BranchMergeJob: ETA poll cadence, drain-aware rekick, configurable poller queue
+_Implementation record. Started from the two defects below; the final shape was
+driven by three review rounds and a live 20k/100k/500k drive, so this documents
+what shipped, not a pre-implementation plan._
+## Problem
+`ChronoForge::BranchMergeJob` is the lightweight poller that joins a fan-out's
+branches: each pass it counts a branch's incomplete children, wakes the parent
+when all are sealed and complete, otherwise re-arms itself. Two defects, plus a
+deployment trap surfaced by the live drive:
+1. **Rekick re-enqueued healthy children.** `rekick_dropped_jobs` re-dispatched any
+   child that was `idle`, `started_at: nil`, and `updated_at < REKICK_AFTER.ago` —
+   which also matches a healthy child merely waiting deep in a draining backlog
+   (queue wait exceeds `REKICK_AFTER` at N ≥ 100k). And because `perform_later`
+   never touched the row, a rekicked-but-unpicked child stayed stale and was
+   **re-rekicked every poll**, piling up duplicates.
+2. **Cadence overshoot.** The delay was `(progressing × FACTOR).clamp(min, max)`
+   with `progressing` a count capped at `CAP = 5000`, so it saturated to
+   `max_interval` while any large backlog existed — polling *slowest* exactly when
+   a fast-draining backlog was about to finish. A 20k fan-out drains in ~88s but
+   the parent wasn't woken until ~310s; 500k sealed up to 5 min late.
+3. **Poller starvation (deploy trap).** `merge_branches` enqueues the poller *after*
+   dispatching the branch's children, so on a queue those children saturate it is
+   starved behind the whole backlog — it polls once, at pending≈0, and backs off,
+   so the parent converges up to `max_interval` late and no throughput is recorded.
+   The only lever was monkey-patching `BranchMergeJob.queue_as` in the host app,
+   which a dev code-reload can silently reset.
+## Solution overview
+- **Cadence** is driven by **estimated time-to-drain**, measured from the branch's
+  own completion rate — so the parent is woken within ~`min_interval` of the last
+  child finishing. When an interval sees no completion, the fallback is
+  **motion-aware** so a slow-but-healthy child isn't woken late.
+- **Rekick** is gated on the **never-started count delta** — the true
+  "workers are pulling this branch's queue" signal — and **debounced** with
+  `child.touch`, so healthy deep-queued children are never rekicked and a dropped
+  child is redelivered at most once per `REKICK_AFTER`.
+- The poller's **queue is a first-class config** (`branch_merge_queue`), since its
+  placement is our concern, not the user's.
+- The measured **rate + ETA are persisted** each poll and surfaced **live on the
+  dashboard**, which now **auto-refreshes on every page**.
+---
+## Cadence — `reschedule_delay(pending, rate, motion, prev_delay, min, max)`
+`lib/chrono_forge/branch_merge_job.rb`
+```ruby
+def reschedule_delay(pending, rate, motion, prev_delay, min_interval, max_interval)
+  return (pending / rate * ETA_FRACTION).clamp(min_interval, max_interval) if rate > 0
+  case motion
+  when :running then min_interval
+  when :never_started then prev_delay ? (prev_delay * 2).clamp(min_interval, max_interval) : min_interval
+  else max_interval
+  end
+end
+```
+- **Draining (`rate > 0`):** poll at `ETA_FRACTION` (0.5) of the projected time-to-
+  drain. Because each poll re-estimates against the shrinking remainder, the cadence
+  converges geometrically and tightens to `min_interval` at the tail.
+- **No completion this interval → `motion`:**
+  - `:running` — a live worker is executing a child; it will finish, so **hold the
+    floor** (`min_interval`). This is the anti-regression case: backing off would
+    wake the parent late for a slow / single-child branch.
+  - `:never_started` — the only motion is a queued/rekicked-but-unpicked child that may
+    never be picked up → **exponential backoff** from the floor (double `prev_delay`,
+    capped at `max`). Catches a quick recovery in seconds; decays instead of spinning.
+  - `:none` — nothing can progress (blocked/failed or parked on a wait) →
+    `max_interval` backstop.
+Inputs, computed in `perform`:
+- **`pending`** is the **uncapped** incomplete count (`BranchProbe.incomplete(id)
+  .count`), served by the existing `[parent_execution_log_id, state]` index — one
+  branch-scoped count per poll (~7 for 20k, ~15 for 500k; a background cost). The
+  old `CAP` flattened this to a constant `5000`, which is why the ETA couldn't reuse
+  it. `CAP` and `FACTOR` are removed; `ETA_FRACTION` added.
+- **`rate`** = `(prev_pending − pending) / elapsed` when the branch drained since its
+  prior poll, else `0.0`. Measured per branch (`rate_by_branch`, for the dashboard)
+  and aggregated for the ETA. Aggregate `prev_pending` is only trusted when every
+  requested branch log is loaded *and* carries a prior sample
+  (`logs.size == branch_log_ids.size && prior.all?`), so a partial set can't yield a
+  bogus rate.
+- **`motion`** is computed **lazily** — only when `rate == 0`, keeping the EXISTS
+  probes off the hot drain path: `:running` if any `BranchProbe.running?`, else
+  `:never_started` if any branch has a positive never-started count, else `:none`.
+- **`prev_delay`** comes from the prior poll's persisted `interval`, driving the
+  exponential backoff.
+## Rekick — `rekick_dropped_jobs(branch_log_ids, never_started_by_branch, prev_never_started_by_branch)`
+```ruby
+prev = prev_never_started_by_branch[id]
+next [id, 0] if prev && never_started_by_branch[id] < prev   # never-started count fell → workers consuming → in line
+# else: scan idle & started_at IS NULL & updated_at < REKICK_AFTER.ago, limit REKICK_BATCH,
+#       guarded perform_later, then child.touch on success (debounce), rescue per child.
+```
+- **Gate on the never-started count delta, not total pending.** A
+  `wait_until` child resuming drops total `pending` without any never-started child
+  being consumed, so a pending-delta gate would mistake that for "draining" and
+  defer recovery of a genuinely-dropped child behind staggered waits. The
+  `idle & started_at IS NULL` count falling is the real signal that workers are
+  pulling this branch's queue (added `BranchProbe.dispatched`, a countable relation).
+- **Cold poll (no prior sample) doesn't gate** — it falls through to the per-child
+  staleness filter, which already spares freshly-dispatched children, so a dropped
+  child is still recovered on the first poll.
+- **`child.touch` on a successful rekick** bumps `updated_at`, so the child leaves
+  the staleness window for one `REKICK_AFTER` — redelivered at most once per window,
+  killing the re-rekick pile-up. Only on success; a rescued enqueue failure leaves it
+  stale to retry next poll.
+- Best-effort: a per-child rescue keeps one bad child from sinking the whole poll.
+## Persisted poll state — `record_poll!`
+Each pass stamps the branch log's `metadata["poll"]` (under `with_lock` + a token
+recheck, leaving `spawn_each`'s cursors untouched):
+`last_polled_at`, `next_poll_at`, `interval`, `pending`, `dispatched`, `sealed`,
+`rate` (children/s, `round(3)` so a very slow but real drain still reads > 0),
+`eta_seconds`, `polls`, `rekicked`, `rekick_total`, `last_rekick_at`.
+`rate`/`eta_seconds` are **free** — already computed for the cadence, no extra query
+— which is what lets the dashboard show live throughput without the aggregate scans
+the scale-aware design avoids.
+## Configurable poller queue
+`lib/chrono_forge/configuration.rb` (the engine's first config object):
+```ruby
+ChronoForge.configure { |c| c.branch_merge_queue = :chrono_forge_pollers } # default: :default
+```
+`BranchMergeJob` reads it via `queue_as { ChronoForge.config.branch_merge_queue }`
+— resolved **per-enqueue**, so a change takes effect without redefining the job and
+can't be silently reset by a code reload (the fragility the live drive exposed). Keep
+the poller off a queue saturated by a fan-out's own children.
+## Dashboard (`chrono_forge-dashboard`)
+- **Live throughput / ETA on in-flight merges.** `BranchesPresenter::Merge` gains
+  `:rate` / `:eta_seconds` and `throughput? = merging? && rate.to_f > 0`; the merges
+  list renders `<rate>/s` and `ETA <cf_secs>`, both guarded. **Multi-branch merges
+  aggregate** — `merge_throughput` sums per-branch `rate` and recomputes the combined
+  ETA (`Σpending / Σrate`), rather than showing one branch's figure.
+- **Auto-refresh on every page.** The poll region is marked once on the layout's
+  `<main>` (the nav + refresh/time controls sit in `<header>`, outside the swap), so
+  every page — workflow list *and detail*, analytics, waiting, repetitions —
+  refreshes in place, preserving filter text, focus, and scroll. It was previously
+  per-page opt-in, which had silently left the detail page (where the gauge lives)
+  and several others un-refreshing.
+## Files changed
+**`chrono_forge`**
+- `lib/chrono_forge.rb` — `config` / `configure` / `reset_configuration!`.
+- `lib/chrono_forge/configuration.rb` — new; `branch_merge_queue`.
+- `lib/chrono_forge/branch_merge_job.rb` — `queue_as` from config; ETA + motion
+  cadence; dispatched-delta rekick + `touch`; uncapped `pending`; `record_poll!`
+  fields; `superseded?(logs, …)`; removed `CAP`/`FACTOR`, added `ETA_FRACTION`.
+- `lib/chrono_forge/branch_probe.rb` — `running?`, `dispatched`/`dispatched?`;
+  `incomplete` used uncapped (`progressing` retained, unused by the poller).
+- `lib/chrono_forge/executor/methods/merge_branches.rb` — stale cadence comment.
+- `test/branch_merge_job_test.rb`, `test/branch_probe_test.rb` — cadence (motion),
+  dispatched-delta rekick incl. the waits-drain-pending regression, debounce,
+  throughput persistence, configurable queue.
+- `docs/fanout-scale-test.md`, `README.md` — cadence, rekick, queue config, the
+  poller-queue-placement trap.
+**`chrono_forge-dashboard`**
+- `branches_presenter.rb` — `Merge` rate/eta + `throughput?` + `merge_throughput`
+  aggregate.
+- `_branches.html.erb` — throughput/ETA spans (sub-1/s shown to one decimal).
+- `layouts/.../application.html.erb` — `data-poll-region` on `<main>`; removed the
+  per-page markers.
+- `test/branches_test.rb`, `README.md` — aggregation test; docs.
+## Validation
+Full engine suite **263** tests green; dashboard **106**. Live-driven on Solid Queue
++ Postgres (poller on a dedicated `:scale_poller` queue):
+- **20k / 100k** — parents converged; the dashboard showed live `~226/s` + ETA that
+  ramped down and vanished on completion.
+- **500k** — **500,000 / 500,000** children completed, parent converged, **11**
+  poller passes (ETA cadence held throughout, not a single starved poll), and **200**
+  children rekicked + recovered after a mid-drain worker restart — the
+  rekick/debounce path exercised at scale.
+## Review findings resolved
+- **#1** — rekick gate moved from total-pending delta to the never-started count
+  delta (dropped-child recovery no longer deferred behind resuming waits).
+- **#2** — multi-branch merges aggregate rate/ETA in the presenter.
+- **#3** — `rate` stored `round(3)` so a sub-1/s drain still renders throughput/ETA.