chrono_forge 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,213 @@
1
+ # Per-child commit overhead on large branch fan-out
2
+
3
+ Status: **implemented and benchmarked**, targeted for **v0.12** (baseline is the
4
+ released v0.10). All the consolidations under "Implemented" below have shipped to
5
+ the working tree and are validated by matched baseline-vs-consolidation benchmarks
6
+ at 20k/50k/100k — see
7
+ [`fanout-scale-test.md`](../fanout-scale-test.md): **~−50% per-child execution
8
+ time, +20–30% throughput**, flat across scale. The risky `save`+`release` merge was
9
+ deliberately left out. Scope: `chrono_forge` core engine. Independent of the dashboard.
10
+
11
+ ## Problem
12
+
13
+ Branch fan-out throughput tops out at roughly a couple thousand DB commits/sec
14
+ on a single Postgres writer. Each child workflow, even a trivial one that runs a
15
+ single `durably_execute` to completion in one `perform`, produces a string of
16
+ independent transactions. Because every transaction is its own `fsync`, the
17
+ workload is commit-bound (fsync-bound), and the number of children you can drain
18
+ per second is `~(fsync budget) / (commits per child)`.
19
+
20
+ For ~8 primary-DB commits per child plus Solid Queue overhead, ~10 workers land
21
+ around ~2,000 fsync/s — that's the ceiling. Lowering commits-per-child raises the
22
+ ceiling proportionally.
23
+
24
+ ## The actual per-child commit sequence
25
+
26
+ Traced for a trivial child: one `durably_execute`, runs to completion in a single
27
+ `perform`. Each row is one transaction = one commit = one fsync.
28
+
29
+ | # | Where | Write | Source |
30
+ |---|-------|-------|--------|
31
+ | 1 | `setup_workflow!` | `update_column(:started_at)` — **branch children only** (parent pre-inserts the row, so the create block never stamps it) | `executor.rb:221` |
32
+ | 2 | `LockStrategy.acquire_lock` | `transaction { lock!; update_columns(locked_by, locked_at, state: :running) }` | `executor/lock_strategy.rb:10-33` |
33
+ | 3 | `durably_execute` | INSERT execution log (`find_or_create_execution_log!`) | `executor/methods/durably_execute.rb:70` |
34
+ | 4 | `durably_execute` | `update!(attempts, last_executed_at)` — *before* the step | `executor/methods/durably_execute.rb:80` |
35
+ | 5 | `durably_execute` | `update!(state: :completed)` — *after* the step | `executor/methods/durably_execute.rb:89` |
36
+ | 6 | `complete_workflow!` | **three separate commits**: completion-marker INSERT + `workflow.completed!` UPDATE + log `update!` | `executor/methods/workflow_states.rb:50-70` |
37
+ | 7 | `ensure` → `context.save!` | `update_column(:context)` | `executor/context.rb:70`, called at `executor.rb:187` |
38
+ | 8 | `release_lock` | `update_columns(locked_at: nil, locked_by: nil, …)` | `executor/lock_strategy.rb:53` |
39
+ | + | Solid Queue | claim + finish/delete (2 more, in the **queue** DB) | backend |
40
+
41
+ So ~8 primary-DB commits + ~2 queue commits per child. Empirically (SQLite probe
42
+ of a no-step child), step 6 is **three** independent commits, not one — the
43
+ completion-marker INSERT, the workflow→`:completed` UPDATE, and the marker UPDATE
44
+ each commit separately. That made it the single biggest safe win.
45
+
46
+ ## What's safely reducible (≈⅓, realistically ~8 → ~5)
47
+
48
+ > Note: this section is the original sketch. It shipped *differently* — see
49
+ > "Implemented" below. The real target turned out to be `complete_workflow!`'s own
50
+ > three commits (#6), which collapse to one safely; #7+#8 (`context.save!` +
51
+ > `release_lock`) were deliberately left split. Net effect is the same ~⅓.
52
+
53
+ These have no external side effect between them, so collapsing them into one
54
+ transaction cannot cause double-execution of anything observable:
55
+
56
+ - **Collapse the end-of-run trio (#6, #7, #8).** Completion + context save + lock
57
+ release happen back-to-back with nothing external in between. One transaction
58
+ instead of three. Caveat: today `context.save!` and `release_lock` are
59
+ deliberately ordered in an `ensure` so the lock is released (and the
60
+ continuation published) *even if the save raises* — see `executor.rb:178-196`.
61
+ Any consolidation must preserve that "always release the lock, never strand the
62
+ workflow" guarantee. Merging the happy path while keeping the failure path's
63
+ release semantics is the careful part.
64
+ - **Fold #1 into #2.** Stamp `started_at` inside the `acquire_lock` transaction
65
+ instead of as its own `update_column`. Saves one commit on branch children.
66
+
67
+ Net: ~8 → ~5 primary commits ⇒ a proportional throughput bump on a commit-bound
68
+ workload. Real and worth doing, but it's ~40%, **not** an order of magnitude.
69
+
70
+ ## What is NOT reducible (it's the point of the engine)
71
+
72
+ - **The per-step INSERT + "completed" UPDATE (#3, #5) must commit independently
73
+ of other steps.** That committed "completed" marker is exactly what guarantees a
74
+ side-effecting step (charge a card, send a webhook) runs once and is never
75
+ replayed. You cannot batch step-completion commits across steps without risking
76
+ double-execution of external effects. It looks wasteful for a no-op child, but
77
+ the engine cannot assume a step is side-effect-free.
78
+ - **Lock acquire (#2) must commit before doing work** so other workers see it.
79
+ - **Solid Queue's claim/finish** is backend overhead, outside ChronoForge.
80
+
81
+ ## The real lever for massive fan-out is architectural, not commit-tuning
82
+
83
+ To run millions of trivial children you change the shape, not shave commits:
84
+
85
+ - **Bulk / lightweight child mode** for large sets of trivial, idempotent items:
86
+ process N items as one durable unit, or a leaner child path that skips the full
87
+ lock + completion ceremony when a child has no deferral points. Cuts per-item
88
+ commits dramatically, but **trades away per-item isolation and observability** —
89
+ a feature with real trade-offs, not a tweak.
90
+ - **Opt-in "transactional workflow" mode**: whole `perform` in one transaction → 1
91
+ commit. Only safe for genuinely side-effect-free workflows.
92
+ - **Scale the write tier out** (sharded Postgres) if you need raw commit headroom.
93
+
94
+ ## Implemented
95
+
96
+ Two consolidations shipped, both behaviour-preserving and covered by tests:
97
+
98
+ 1. **`complete_workflow!` → one transaction** (`workflow_states.rb`). The marker
99
+ INSERT, the workflow→`:completed` UPDATE, and the marker UPDATE now share a
100
+ single `ActiveRecord::Base.transaction`. 3 commits → 1. The failure path
101
+ re-finds/recreates the marker and records `:failed` outside the rolled-back
102
+ transaction, so completion observability is preserved and a resume simply
103
+ retries completion. Test: `WriteConsolidationTest#test_completion_writes_share_a_single_transaction`.
104
+ 2. **`started_at` folded into `acquire_lock`** (`lock_strategy.rb`). The branch-child
105
+ first-execution `started_at` stamp now rides along the existing lock UPDATE
106
+ (`update_columns`) instead of a standalone `update_column` in `setup_workflow!`.
107
+ 1 commit → 0 (absorbed). The poller's "nil started_at = dropped child" contract
108
+ is preserved — it's still stamped on first pickup, just in the same write.
109
+ Tests: `LockStrategyTest#test_acquire_lock_stamps_started_at_*`.
110
+
111
+ Net for a trivial branch child: roughly 3 fewer commits/fsyncs per child (the
112
+ note's safe ~⅓), with no behaviour change.
113
+
114
+ ### Statement flattening (a different axis: round-trips/CPU, not fsyncs)
115
+
116
+ Within the now-single transactions, two INSERT-then-UPDATE pairs were collapsed to
117
+ a single INSERT. This does not cut commits further — it cuts statements (DB
118
+ round-trips, parse/plan CPU, a little WAL), so it pays most on round-trip- or
119
+ CPU-bound profiles, less on a pure single-writer fsync-bound one.
120
+
121
+ 3. **Completion marker born completed** (`workflow_states.rb`). The marker is
122
+ INSERTed already in `:completed` state (attempts: 1, all timestamps set) instead
123
+ of INSERTed `:started` then UPDATEd. The rare resume-after-failed-completion /
124
+ create-race path still flips an existing marker via UPDATE. Completion is now
125
+ 2 statements (marker INSERT + workflow UPDATE), down from 3. Test:
126
+ `WriteConsolidationTest#test_completion_marker_is_born_completed_in_a_single_insert`.
127
+ 4. **`durably_execute` first attempt recorded in the INSERT** (`durably_execute.rb`).
128
+ A fresh step log is created with `attempts: 1, last_executed_at` baked in, so the
129
+ pre-execution attempt-bump UPDATE is skipped on first run (detected via
130
+ `previously_new_record?`). Retries (existing log) still bump via UPDATE — the
131
+ committed `:completed` write after the side effect is untouched (the once-only
132
+ boundary). Tests: `WriteConsolidationTest#test_first_step_run_records_attempt_in_the_insert`
133
+ and `#test_retry_run_bumps_attempt_with_an_update`.
134
+
135
+ The same two shapes recur across the engine, so the flattening was applied
136
+ everywhere the pattern is safe (a log written twice in one pass, no deferral
137
+ between the writes):
138
+
139
+ 5. **`fail_workflow!` born-completed + one transaction** (`workflow_states.rb`).
140
+ Identical to `complete_workflow!`: the workflow→`:failed` transition and the
141
+ (born-completed) failure marker are batched in one transaction, marker written
142
+ terminal in a single INSERT. Test:
143
+ `WriteConsolidationTest#test_failure_marker_is_born_completed_in_a_single_insert`.
144
+ 6. **`continue_if` first evaluation in the INSERT** (`continue_if.rb`). Same as
145
+ `durably_execute` — a fresh gate bakes `attempts: 1`; only re-evaluations after a
146
+ not-met halt bump via UPDATE. Test:
147
+ `WriteConsolidationTest#test_continue_if_first_run_records_attempt_in_the_insert`.
148
+ 7. **`durably_repeat` coordination + repetition logs** (`durably_repeat.rb`). Both
149
+ bake the first attempt into their INSERT; later passes/resumes bump via UPDATE.
150
+ Test: `WriteConsolidationTest#test_durably_repeat_first_repetition_records_attempt_in_the_insert`.
151
+
152
+ The born-completed sites (#3, #5) share one helper, `create_completed_execution_log!`
153
+ (`executor.rb`): a single race/replay-safe INSERT in `:completed` state, falling
154
+ back to an UPDATE only when the row already exists.
155
+
156
+ Not flattened: **`merge_branches`** and the **`durably_repeat` fast-forward summary**
157
+ create their log in one pass and complete/finalize it in a *later* pass (across a
158
+ halt), or merge metadata into a reused row — the two writes are inherently separate,
159
+ so there's nothing to collapse. **Waits** (`wait`, `wait_until`) and **branch
160
+ coordination** are the same story: they must persist `:started` before halting.
161
+
162
+ `acquire_lock`'s SELECT…FOR UPDATE + UPDATE was left as a SELECT + UPDATE: a
163
+ single conditional UPDATE would flip pessimistic→optimistic locking and complicate
164
+ `ensure_executable!` / the contention-error message for one statement — not worth it.
165
+
166
+ Full suite (236 tests) green; behaviour unchanged.
167
+
168
+ ### Deliberately NOT done: merging `context.save!` + `release_lock`
169
+
170
+ Tempting (it's the remaining adjacent pair), but the split is load-bearing. The
171
+ `ensure` block (`executor.rb:178-196`) guarantees the lock is released — and the
172
+ continuation published — *even if `context.save!` raises*, and it publishes the
173
+ continuation only after a successful release so a zero-delay same-key continuation
174
+ can't lose the acquire race. Wrapping save+release in one transaction would roll
175
+ the release back with a failed save and strand the workflow holding its lock.
176
+ A correct merge needs a fallback (standalone release on rollback) plus careful
177
+ handling of the lost-lock case, and it also changes whether context persists when
178
+ this job has lost the lock. That subtlety isn't worth one commit on a safety-
179
+ critical path — left as-is intentionally.
180
+
181
+ ## Benchmark outcome
182
+
183
+ Measured against matched baseline pairs at 20k/50k/100k (single Postgres, 5×4
184
+ workers; full data in [`fanout-scale-test.md`](../fanout-scale-test.md)):
185
+
186
+ - **Per-child execution time roughly halved** — p50 ~35 ms → ~19 ms (−46%), avg
187
+ −49%, **flat from 20k to 100k**. This is the direct fsync-per-commit saving and it
188
+ is robust and repeatable.
189
+ - **Aggregate throughput +20–30%, not 2×.** Once the child's own commits are
190
+ cheaper, the **Solid Queue claim/finish cycle and the shared single-Postgres
191
+ fsync budget** dominate per-slot time — halving one of several serial commits
192
+ can't double the whole pipeline. The win also *compresses* with N (+30% at 20k →
193
+ +22% at 100k) as that shared ceiling takes a larger share.
194
+ - **Dispatch got faster too** (100k fan-out 64 s → 53 s): lighter child commits free
195
+ fsync headroom for the parent's `spawn_each` inserts.
196
+
197
+ So the measure-first question is answered: the workload **is** commit/fsync-bound,
198
+ and the consolidation is a clear, behaviour-preserving win — but the residual
199
+ ceiling is now the queue backend and the single-DB fsync budget, not engine work.
200
+
201
+ ## Where the remaining headroom is
202
+
203
+ 1. **Split the queue onto its own database/disk.** Solid Queue's claim/finish
204
+ fsyncs currently compete with engine-commit fsyncs for one budget; separating
205
+ them lets the two streams run in parallel. Likely a bigger aggregate lever than
206
+ any further engine shaving.
207
+ 2. **Bulk / lightweight child mode** for large sets of trivial, idempotent items —
208
+ the path to order-of-magnitude gains, with its isolation/observability
209
+ trade-offs made explicit and opt-in (see above).
210
+ 3. **Scale the write tier out** (sharded Postgres) for raw commit headroom.
211
+
212
+ Do not expect a 10× from further commit consolidation — the engine is no longer the
213
+ wall. That comes from (1)/(2) or scaling the write tier.
@@ -0,0 +1,247 @@
1
+ # ChronoForge fan-out scale test
2
+
3
+ _Run 2026-06-28 on a local Mac (11 cores)._
4
+
5
+ Two things were measured here:
6
+
7
+ 1. **Correctness at scale** — does a fan-out converge, lose nothing, and stay in
8
+ constant memory up to **500,000 children**?
9
+ 2. **Throughput** — what is the steady-state ceiling, where is it, and how much
10
+ does an experimental **commit-consolidation** change move it (matched
11
+ baseline-vs-consolidation pairs at 20k / 50k / 100k)?
12
+
13
+ ## What was tested
14
+
15
+ A parent workflow that fans out **N child workflows** via `branch` + `spawn_each`
16
+ (a `Range` source → constant-memory batched `insert_all`, joined inline with
17
+ `automerge`). Each child is one trivial `durably_execute`.
18
+
19
+ ```ruby
20
+ class ScaleFanout < ActiveJob::Base
21
+ prepend ChronoForge::Executor
22
+ def perform(count:)
23
+ branch :fanout, automerge: true do
24
+ spawn_each :child, (1..count) { |i| [ScaleChild, {n: i}] }
25
+ end
26
+ end
27
+ end
28
+ ```
29
+
30
+ All runs were on **Solid Queue + Postgres 13 (Docker)** against a dev DB. Workers
31
+ ran on an **isolated `scale` queue** (`BranchMergeJob`
32
+ re-routed to the same queue so it never competed with real jobs).
33
+
34
+ | Parameter | Value |
35
+ |---|---|
36
+ | Job backend | Solid Queue, isolated `scale` queue |
37
+ | Database | Postgres 13 (Docker), `max_connections` 100, `shared_buffers` 128 MB |
38
+ | Worker concurrency | 5 processes × 4 threads (= 20 slots) |
39
+ | DB connection pool | `DB_POOL=10` per process |
40
+ | Worker polling interval | 0.1 s |
41
+ | Dispatcher | default Solid Queue dispatcher |
42
+ | Child workload | one trivial `durably_execute` (no real I/O) |
43
+
44
+ ---
45
+
46
+ ## Part 1 — Correctness (500k)
47
+
48
+ | N | Wall clock | Throughput | Completed | Outcome |
49
+ |---|-----------|------------|-----------|---------|
50
+ | 20,000 | 88.3s | 226/s | 20,000 / 20,000 | ✓ merged |
51
+ | 50,000 | 215.5s | 232/s | 50,000 / 50,000 | ✓ merged |
52
+ | 100,000 | 443.2s | 226/s | 100,000 / 100,000 | ✓ merged |
53
+ | **500,000** | **2,497s (~41.6 min)** | **200/s** | **500,000 / 500,000** | **✓ merged** |
54
+
55
+ **Flawless at every scale.** Every child completed and every parent converged to
56
+ `completed`. Streaming dispatch held constant memory; `BranchMergeJob` polled to
57
+ convergence; the parent's replay correctly **skipped the sealed branch** (no
58
+ re-dispatch). Zero failures, zero lost children, no memory wall.
59
+
60
+ Throughput held **rock-steady at ~200–230/s from 20k through 500k** — no
61
+ degradation as the tables grew. The path has a **stable steady-state ceiling**.
62
+
63
+ ---
64
+
65
+ ## Part 2 — Benchmark: baseline vs commit-consolidation
66
+
67
+ **Baseline** is the current released engine (v0.10). **Consolidation** is the
68
+ unreleased patch targeted for **v0.12** that cuts per-child write cost without
69
+ changing behaviour. It (a) folds `started_at` into the lock-acquire
70
+ transaction, (b) collapses `complete_workflow!`'s three separate commits — marker
71
+ INSERT + `workflow.completed!` UPDATE + marker UPDATE — into one transaction, and
72
+ (c) flattens the INSERT-then-UPDATE pairs in the step / completion / failure /
73
+ `continue_if` / `durably_repeat` paths into single INSERTs (the row is born in its
74
+ terminal/attempted state). It deliberately does **not** merge `context.save!` +
75
+ `release_lock` — that split is load-bearing (the lock must release even if the
76
+ save raises). See `docs/design/per-child-commit-overhead.md` for the full set.
77
+
78
+ Matched pairs, isolated worker (5 procs × 4 threads), `DB_POOL=10`, single
79
+ Postgres, ~170k-row backdrop held constant across all six runs. Throughput is
80
+ measured over the child window: `N / (max(completed_at) − min(created_at))`.
81
+
82
+ ### Throughput
83
+
84
+ | Run | N | Fan-out (dispatch) | Dispatch rate | **Exec throughput** |
85
+ |------|------:|------:|------:|------:|
86
+ | baseline 20k | 20,000 | 9.6s | 2,093/s | **226/s** |
87
+ | cons 20k | 20,000 | 7.8s | 2,561/s | **293/s** (+30%) |
88
+ | baseline 50k | 50,000 | 30.4s | 1,647/s | **232/s** |
89
+ | cons 50k | 50,000 | 17.2s | 2,905/s | **279/s** (+20%) |
90
+ | baseline 100k | 100,000 | 63.7s | 1,570/s | **226/s** |
91
+ | cons 100k | 100,000 | 53.0s | 1,886/s | **275/s** (+22%) |
92
+
93
+ ### Per-child execution time — `completed_at − started_at` (seconds)
94
+
95
+ | Run | avg | p50 | p95 | p99 | max |
96
+ |------|----:|----:|----:|----:|----:|
97
+ | baseline 20k | 0.042 | 0.035 | 0.074 | 0.137 | 0.544 |
98
+ | cons 20k | 0.020 | 0.018 | 0.030 | 0.070 | 0.560 |
99
+ | baseline 50k | 0.041 | 0.035 | 0.065 | 0.138 | 0.933 |
100
+ | cons 50k | 0.022 | 0.019 | 0.035 | 0.074 | 0.812 |
101
+ | baseline 100k | 0.042 | 0.036 | 0.072 | 0.148 | 0.911 |
102
+ | cons 100k | 0.022 | 0.019 | 0.038 | 0.079 | 0.578 |
103
+
104
+ ### Fan-out (spawn) time — child `created_at` span
105
+
106
+ How long the parent's `spawn_each` took to enqueue the whole set
107
+ (`max(created_at) − min(created_at)` over the children):
108
+
109
+ | Children | baseline | consolidation |
110
+ |---:|---:|---:|
111
+ | 20k | 9.6s (2,093/s) | 7.8s (2,561/s) |
112
+ | 50k | 30.4s (1,647/s) | 17.2s (2,905/s) |
113
+ | 100k | 63.7s (1,570/s) | 53.0s (1,886/s) |
114
+
115
+ Spawning 100k children takes ~64s on baseline, ~53s consolidated — roughly
116
+ **0.5–0.6 ms of parent time per child**. The spawn rate **degrades as N grows**
117
+ on baseline (2,093 → 1,570/s) because the parent inserts child rows into the same
118
+ Postgres that's simultaneously draining the early children. Consolidation lightens
119
+ the children's commit load, freeing fsync budget for the parent's inserts, so its
120
+ spawn rate holds up far better.
121
+
122
+ ---
123
+
124
+ ## Throughput analysis — a *flat* ceiling, set by fsync
125
+
126
+ The ceiling is **single-Postgres commit/`fsync` throughput**, not worker count:
127
+
128
+ - Throughput is **flat (~200–230/s) from 20k to 500k** — it neither degrades as
129
+ tables grow nor rises with more work in flight against 20 worker slots.
130
+ - Halving per-child execution time (consolidation, below) lifts aggregate
131
+ throughput only **~25%** — so the bottleneck is shared write infrastructure, not
132
+ per-child work or worker count.
133
+ - **Not** connection-bound (51 / 100 connections used), **not** memory-bound.
134
+
135
+ Each trivial baseline child costs ~**10 primary-DB commits + ~2 Solid Queue
136
+ commits** (the `complete_workflow!` entry below is itself three separate commits):
137
+
138
+ ```
139
+ setup started_at · acquire_lock · step INSERT · step attempt-update ·
140
+ step completed-update · complete_workflow! (×3) · context.save! · release_lock (+ SQ claim/finish)
141
+ ```
142
+
143
+ ≈ 200 children/s × ~10 commits ≈ **~2,000 fsyncs/s** — the wall.
144
+
145
+ **Commit consolidation confirms the diagnosis.** Folding `started_at` into the
146
+ lock-acquire txn, collapsing `complete_workflow!`'s three commits into one, and
147
+ baking each log's first write into its INSERT **halves per-child execution time**
148
+ (p50 ~35 ms → ~19 ms, −46%; avg −49%) and is **flat across 20k→100k**. Yet
149
+ aggregate throughput rises only **+20–30%**, not 2×: once the
150
+ child's own commits are cheaper, the **Solid Queue claim/finish cycle** and the
151
+ shared single-Postgres fsync budget dominate the per-slot time. Halving one of
152
+ several serial commits can't double the whole pipeline.
153
+
154
+ ## Queue wait is backlog math, not a latency regression
155
+
156
+ Per-child **queue wait** (`started_at − created_at`) scales ~linearly with N:
157
+
158
+ | Run | avg | p50 | p95 | max |
159
+ |------|----:|----:|----:|----:|
160
+ | baseline 20k | 40.6 | 42.9 | 75.3 | 78.7 |
161
+ | cons 20k | 30.3 | 29.3 | 57.4 | 60.4 |
162
+ | baseline 50k | 93.7 | 95.6 | 173.0 | 185.1 |
163
+ | cons 50k | 80.7 | 80.3 | 154.4 | 161.9 |
164
+ | baseline 100k | 191.6 | 194.9 | 357.8 | 379.5 |
165
+ | cons 100k | 151.7 | 147.5 | 291.1 | 310.9 |
166
+
167
+ This is **not** a per-job slowdown. The entire set is enqueued in seconds but
168
+ drains at ~225–290/s, so a child's wait is just its **position in the backlog ÷
169
+ throughput**. The last of 100k children waits ~100,000 / ~270 ≈ ~6 min by
170
+ arithmetic, regardless of how fast any individual child runs. Consolidation
171
+ shrinks the wait proportionally (100k: 192s → 152s) because it lifts throughput —
172
+ the lever for queue wait is throughput, not per-child execution time.
173
+
174
+ ## Dashboard at 500k
175
+
176
+ The dashboard's scale-aware design held up live: capped `5000+` counts (no
177
+ `COUNT(*)` over 500k), keyset pagination, blocked-first triage — instant render
178
+ throughout the run.
179
+
180
+ On a branch's detail views the counts the poller already records render **exact**
181
+ straight from the branch-log metadata — no live count: `pending` and
182
+ `never-started` (recomputed each poll) and the total `spawned` (immutable once the
183
+ branch is sealed, so it's counted **once** and cached). Only the mutable per-state
184
+ chips (idle / completed / …) stay capped. So a 500k branch shows its real
185
+ `spawned` / `pending` / `never-started` figures, not `5000+`.
186
+
187
+ ## Poller behavior
188
+
189
+ `BranchMergeJob` cadence is driven by **estimated time-to-drain** (from the prior
190
+ poll's uncapped pending count), not backlog size. For a 500k fan-out draining at
191
+ ~200/s this is flat `max_interval` (5 min) polling through the long middle, then a
192
+ smooth ramp over the final minutes, tightening to `min_interval` (~5s) for the
193
+ last few thousand children — so the parent is woken within ~5s of the last child
194
+ finishing rather than up to a full `max_interval` late. ~15 cheap polls across the
195
+ run, one branch-scoped index count each (`[parent_execution_log_id, state]`); no
196
+ new indexes. When nothing completes in an interval the fallback is motion-aware: a
197
+ child still running holds the responsive floor (so a slow or single-child branch is
198
+ never woken late), a dispatched-but-unpicked straggler backs off exponentially, and
199
+ a fully blocked/waiting branch decays to `max_interval` instead of spinning.
200
+
201
+ Rekick of dropped children is **gated on the never-started count
202
+ delta**: if that count fell since the last poll, workers are consuming the branch's
203
+ queue, so deeply-queued-but-healthy children are left alone. It deliberately does
204
+ NOT use total pending — a `wait_until` child resuming would drop pending without
205
+ any never-started child moving, masking a genuinely-dropped child behind staggered
206
+ waits. Only a branch whose never-started count has gone flat has its stale
207
+ never-started children rekicked, and a `touch` on each rekick debounces it to at
208
+ most once per `REKICK_AFTER`. Rekick counts are stamped on the branch-log metadata
209
+ for the dashboard.
210
+
211
+ ### ⚠️ Poller queue placement (a trap)
212
+
213
+ `merge_branches` enqueues `BranchMergeJob` **after** it dispatches the branch's
214
+ children, so the poller **must not run on the same queue as a large fan-out's
215
+ children**. If it does, it is enqueued behind the entire backlog and starved: it
216
+ gets a worker slot only near the end, polls **once** at `pending≈0` with no prior
217
+ sample (`rate 0`), and backs off to `max_interval`. The consequences are twofold
218
+ and both defeat the point of the ETA cadence:
219
+
220
+ - the parent's convergence **lags by up to `max_interval` (~5 min)** after the last
221
+ child finishes (all children `completed`, parent still `idle`); and
222
+ - the dashboard's **live throughput/ETA never renders** — the poller never took a
223
+ mid-drain sample, so `rate` stays 0.
224
+
225
+ Give the poller a **dedicated, un-starved queue** so it polls throughout the drain
226
+ (then ETA engages and convergence is tight) via the first-class setting:
227
+
228
+ ```ruby
229
+ ChronoForge.configure { |c| c.branch_merge_queue = :chrono_forge_pollers }
230
+ ```
231
+
232
+ (and run a worker on that queue). It defaults to `:default`, which is fine when
233
+ fan-outs run on their own queues. For these runs that queue is `:scale_poller`
234
+ (see `config/scale_queue.yml`). This bit us live-driving 20k/100k — the first pass
235
+ had the poller on `:scale` and every parent hung `idle` for 5 min.
236
+
237
+ ## Environment caveats
238
+
239
+ - Local Docker **Postgres 13**, default `shared_buffers` 128 MB, single disk,
240
+ `max_connections` 100. Absolute numbers are this-laptop-specific; tuned/larger
241
+ production infra changes them.
242
+ - Part 1's 20k–100k rows are the **same clean baseline runs** as Part 2 (matched
243
+ pairs at a fixed ~170k-row backdrop). The **500k** row is the original
244
+ single-growing-table run, so its absolute throughput isn't strictly comparable
245
+ to the others — but it lands in the same ~200/s band.
246
+ - A child here is a **trivial** `durably_execute`; real children doing actual work
247
+ shift the bottleneck away from the engine's own commits.
@@ -0,0 +1,205 @@
1
+ # BranchMergeJob: ETA poll cadence, drain-aware rekick, configurable poller queue
2
+
3
+ _Implementation record. Started from the two defects below; the final shape was
4
+ driven by three review rounds and a live 20k/100k/500k drive, so this documents
5
+ what shipped, not a pre-implementation plan._
6
+
7
+ ## Problem
8
+
9
+ `ChronoForge::BranchMergeJob` is the lightweight poller that joins a fan-out's
10
+ branches: each pass it counts a branch's incomplete children, wakes the parent
11
+ when all are sealed and complete, otherwise re-arms itself. Two defects, plus a
12
+ deployment trap surfaced by the live drive:
13
+
14
+ 1. **Rekick re-enqueued healthy children.** `rekick_dropped_jobs` re-dispatched any
15
+ child that was `idle`, `started_at: nil`, and `updated_at < REKICK_AFTER.ago` —
16
+ which also matches a healthy child merely waiting deep in a draining backlog
17
+ (queue wait exceeds `REKICK_AFTER` at N ≥ 100k). And because `perform_later`
18
+ never touched the row, a rekicked-but-unpicked child stayed stale and was
19
+ **re-rekicked every poll**, piling up duplicates.
20
+
21
+ 2. **Cadence overshoot.** The delay was `(progressing × FACTOR).clamp(min, max)`
22
+ with `progressing` a count capped at `CAP = 5000`, so it saturated to
23
+ `max_interval` while any large backlog existed — polling *slowest* exactly when
24
+ a fast-draining backlog was about to finish. A 20k fan-out drains in ~88s but
25
+ the parent wasn't woken until ~310s; 500k sealed up to 5 min late.
26
+
27
+ 3. **Poller starvation (deploy trap).** `merge_branches` enqueues the poller *after*
28
+ dispatching the branch's children, so on a queue those children saturate it is
29
+ starved behind the whole backlog — it polls once, at pending≈0, and backs off,
30
+ so the parent converges up to `max_interval` late and no throughput is recorded.
31
+ The only lever was monkey-patching `BranchMergeJob.queue_as` in the host app,
32
+ which a dev code-reload can silently reset.
33
+
34
+ ## Solution overview
35
+
36
+ - **Cadence** is driven by **estimated time-to-drain**, measured from the branch's
37
+ own completion rate — so the parent is woken within ~`min_interval` of the last
38
+ child finishing. When an interval sees no completion, the fallback is
39
+ **motion-aware** so a slow-but-healthy child isn't woken late.
40
+ - **Rekick** is gated on the **never-started count delta** — the true
41
+ "workers are pulling this branch's queue" signal — and **debounced** with
42
+ `child.touch`, so healthy deep-queued children are never rekicked and a dropped
43
+ child is redelivered at most once per `REKICK_AFTER`.
44
+ - The poller's **queue is a first-class config** (`branch_merge_queue`), since its
45
+ placement is our concern, not the user's.
46
+ - The measured **rate + ETA are persisted** each poll and surfaced **live on the
47
+ dashboard**, which now **auto-refreshes on every page**.
48
+
49
+ ---
50
+
51
+ ## Cadence — `reschedule_delay(pending, rate, motion, prev_delay, min, max)`
52
+
53
+ `lib/chrono_forge/branch_merge_job.rb`
54
+
55
+ ```ruby
56
+ def reschedule_delay(pending, rate, motion, prev_delay, min_interval, max_interval)
57
+ return (pending / rate * ETA_FRACTION).clamp(min_interval, max_interval) if rate > 0
58
+
59
+ case motion
60
+ when :running then min_interval
61
+ when :never_started then prev_delay ? (prev_delay * 2).clamp(min_interval, max_interval) : min_interval
62
+ else max_interval
63
+ end
64
+ end
65
+ ```
66
+
67
+ - **Draining (`rate > 0`):** poll at `ETA_FRACTION` (0.5) of the projected time-to-
68
+ drain. Because each poll re-estimates against the shrinking remainder, the cadence
69
+ converges geometrically and tightens to `min_interval` at the tail.
70
+ - **No completion this interval → `motion`:**
71
+ - `:running` — a live worker is executing a child; it will finish, so **hold the
72
+ floor** (`min_interval`). This is the anti-regression case: backing off would
73
+ wake the parent late for a slow / single-child branch.
74
+ - `:never_started` — the only motion is a queued/rekicked-but-unpicked child that may
75
+ never be picked up → **exponential backoff** from the floor (double `prev_delay`,
76
+ capped at `max`). Catches a quick recovery in seconds; decays instead of spinning.
77
+ - `:none` — nothing can progress (blocked/failed or parked on a wait) →
78
+ `max_interval` backstop.
79
+
80
+ Inputs, computed in `perform`:
81
+
82
+ - **`pending`** is the **uncapped** incomplete count (`BranchProbe.incomplete(id)
83
+ .count`), served by the existing `[parent_execution_log_id, state]` index — one
84
+ branch-scoped count per poll (~7 for 20k, ~15 for 500k; a background cost). The
85
+ old `CAP` flattened this to a constant `5000`, which is why the ETA couldn't reuse
86
+ it. `CAP` and `FACTOR` are removed; `ETA_FRACTION` added.
87
+ - **`rate`** = `(prev_pending − pending) / elapsed` when the branch drained since its
88
+ prior poll, else `0.0`. Measured per branch (`rate_by_branch`, for the dashboard)
89
+ and aggregated for the ETA. Aggregate `prev_pending` is only trusted when every
90
+ requested branch log is loaded *and* carries a prior sample
91
+ (`logs.size == branch_log_ids.size && prior.all?`), so a partial set can't yield a
92
+ bogus rate.
93
+ - **`motion`** is computed **lazily** — only when `rate == 0`, keeping the EXISTS
94
+ probes off the hot drain path: `:running` if any `BranchProbe.running?`, else
95
+ `:never_started` if any branch has a positive never-started count, else `:none`.
96
+ - **`prev_delay`** comes from the prior poll's persisted `interval`, driving the
97
+ exponential backoff.
98
+
99
+ ## Rekick — `rekick_dropped_jobs(branch_log_ids, never_started_by_branch, prev_never_started_by_branch)`
100
+
101
+ ```ruby
102
+ prev = prev_never_started_by_branch[id]
103
+ next [id, 0] if prev && never_started_by_branch[id] < prev # never-started count fell → workers consuming → in line
104
+ # else: scan idle & started_at IS NULL & updated_at < REKICK_AFTER.ago, limit REKICK_BATCH,
105
+ # guarded perform_later, then child.touch on success (debounce), rescue per child.
106
+ ```
107
+
108
+ - **Gate on the never-started count delta, not total pending.** A
109
+ `wait_until` child resuming drops total `pending` without any never-started child
110
+ being consumed, so a pending-delta gate would mistake that for "draining" and
111
+ defer recovery of a genuinely-dropped child behind staggered waits. The
112
+ `idle & started_at IS NULL` count falling is the real signal that workers are
113
+ pulling this branch's queue (added `BranchProbe.dispatched`, a countable relation).
114
+ - **Cold poll (no prior sample) doesn't gate** — it falls through to the per-child
115
+ staleness filter, which already spares freshly-dispatched children, so a dropped
116
+ child is still recovered on the first poll.
117
+ - **`child.touch` on a successful rekick** bumps `updated_at`, so the child leaves
118
+ the staleness window for one `REKICK_AFTER` — redelivered at most once per window,
119
+ killing the re-rekick pile-up. Only on success; a rescued enqueue failure leaves it
120
+ stale to retry next poll.
121
+ - Best-effort: a per-child rescue keeps one bad child from sinking the whole poll.
122
+
123
+ ## Persisted poll state — `record_poll!`
124
+
125
+ Each pass stamps the branch log's `metadata["poll"]` (under `with_lock` + a token
126
+ recheck, leaving `spawn_each`'s cursors untouched):
127
+
128
+ `last_polled_at`, `next_poll_at`, `interval`, `pending`, `dispatched`, `sealed`,
129
+ `rate` (children/s, `round(3)` so a very slow but real drain still reads > 0),
130
+ `eta_seconds`, `polls`, `rekicked`, `rekick_total`, `last_rekick_at`.
131
+
132
+ `rate`/`eta_seconds` are **free** — already computed for the cadence, no extra query
133
+ — which is what lets the dashboard show live throughput without the aggregate scans
134
+ the scale-aware design avoids.
135
+
136
+ ## Configurable poller queue
137
+
138
+ `lib/chrono_forge/configuration.rb` (the engine's first config object):
139
+
140
+ ```ruby
141
+ ChronoForge.configure { |c| c.branch_merge_queue = :chrono_forge_pollers } # default: :default
142
+ ```
143
+
144
+ `BranchMergeJob` reads it via `queue_as { ChronoForge.config.branch_merge_queue }`
145
+ — resolved **per-enqueue**, so a change takes effect without redefining the job and
146
+ can't be silently reset by a code reload (the fragility the live drive exposed). Keep
147
+ the poller off a queue saturated by a fan-out's own children.
148
+
149
+ ## Dashboard (`chrono_forge-dashboard`)
150
+
151
+ - **Live throughput / ETA on in-flight merges.** `BranchesPresenter::Merge` gains
152
+ `:rate` / `:eta_seconds` and `throughput? = merging? && rate.to_f > 0`; the merges
153
+ list renders `<rate>/s` and `ETA <cf_secs>`, both guarded. **Multi-branch merges
154
+ aggregate** — `merge_throughput` sums per-branch `rate` and recomputes the combined
155
+ ETA (`Σpending / Σrate`), rather than showing one branch's figure.
156
+ - **Auto-refresh on every page.** The poll region is marked once on the layout's
157
+ `<main>` (the nav + refresh/time controls sit in `<header>`, outside the swap), so
158
+ every page — workflow list *and detail*, analytics, waiting, repetitions —
159
+ refreshes in place, preserving filter text, focus, and scroll. It was previously
160
+ per-page opt-in, which had silently left the detail page (where the gauge lives)
161
+ and several others un-refreshing.
162
+
163
+ ## Files changed
164
+
165
+ **`chrono_forge`**
166
+ - `lib/chrono_forge.rb` — `config` / `configure` / `reset_configuration!`.
167
+ - `lib/chrono_forge/configuration.rb` — new; `branch_merge_queue`.
168
+ - `lib/chrono_forge/branch_merge_job.rb` — `queue_as` from config; ETA + motion
169
+ cadence; dispatched-delta rekick + `touch`; uncapped `pending`; `record_poll!`
170
+ fields; `superseded?(logs, …)`; removed `CAP`/`FACTOR`, added `ETA_FRACTION`.
171
+ - `lib/chrono_forge/branch_probe.rb` — `running?`, `dispatched`/`dispatched?`;
172
+ `incomplete` used uncapped (`progressing` retained, unused by the poller).
173
+ - `lib/chrono_forge/executor/methods/merge_branches.rb` — stale cadence comment.
174
+ - `test/branch_merge_job_test.rb`, `test/branch_probe_test.rb` — cadence (motion),
175
+ dispatched-delta rekick incl. the waits-drain-pending regression, debounce,
176
+ throughput persistence, configurable queue.
177
+ - `docs/fanout-scale-test.md`, `README.md` — cadence, rekick, queue config, the
178
+ poller-queue-placement trap.
179
+
180
+ **`chrono_forge-dashboard`**
181
+ - `branches_presenter.rb` — `Merge` rate/eta + `throughput?` + `merge_throughput`
182
+ aggregate.
183
+ - `_branches.html.erb` — throughput/ETA spans (sub-1/s shown to one decimal).
184
+ - `layouts/.../application.html.erb` — `data-poll-region` on `<main>`; removed the
185
+ per-page markers.
186
+ - `test/branches_test.rb`, `README.md` — aggregation test; docs.
187
+
188
+ ## Validation
189
+
190
+ Full engine suite **263** tests green; dashboard **106**. Live-driven on Solid Queue
191
+ + Postgres (poller on a dedicated `:scale_poller` queue):
192
+
193
+ - **20k / 100k** — parents converged; the dashboard showed live `~226/s` + ETA that
194
+ ramped down and vanished on completion.
195
+ - **500k** — **500,000 / 500,000** children completed, parent converged, **11**
196
+ poller passes (ETA cadence held throughout, not a single starved poll), and **200**
197
+ children rekicked + recovered after a mid-drain worker restart — the
198
+ rekick/debounce path exercised at scale.
199
+
200
+ ## Review findings resolved
201
+
202
+ - **#1** — rekick gate moved from total-pending delta to the never-started count
203
+ delta (dropped-child recovery no longer deferred behind resuming waits).
204
+ - **#2** — multi-branch merges aggregate rate/ETA in the presenter.
205
+ - **#3** — `rate` stored `round(3)` so a sub-1/s drain still renders throughput/ETA.