chrono_forge 0.9.1 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/README.md +305 -44
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md +1748 -0
- data/docs/superpowers/plans/2026-06-25-chrono_forge-dashboard.md.tasks.json +17 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md +930 -0
- data/docs/superpowers/plans/2026-06-25-composite-retry-policies.md.tasks.json +54 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md +241 -0
- data/docs/superpowers/plans/2026-06-25-reserved-kwarg-guard.md.tasks.json +12 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md +1378 -0
- data/docs/superpowers/plans/2026-06-26-branches-spawn-merge.md.tasks.json +67 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md +709 -0
- data/docs/superpowers/plans/2026-06-26-deferral-continuation-race-and-catchup.md.tasks.json +19 -0
- data/docs/superpowers/specs/2026-06-03-unified-retry-policy-design.md +226 -0
- data/docs/superpowers/specs/2026-06-25-chrono_forge-dashboard-design.md +190 -0
- data/docs/superpowers/specs/2026-06-25-composite-retry-policies-design.md +228 -0
- data/docs/superpowers/specs/2026-06-25-reserved-kwarg-guard-design.md +169 -0
- data/docs/superpowers/specs/2026-06-25-spawn-merge-branches-design.md +468 -0
- data/docs/superpowers/specs/2026-06-26-dashboard-branch-view-design.md +142 -0
- data/docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md +265 -0
- data/lib/chrono_forge/branch_merge_job.rb +138 -0
- data/lib/chrono_forge/branch_probe.rb +26 -0
- data/lib/chrono_forge/cleanup.rb +6 -0
- data/lib/chrono_forge/execution_log.rb +6 -0
- data/lib/chrono_forge/executor/composite_retry_policy.rb +47 -0
- data/lib/chrono_forge/executor/methods/branch.rb +185 -0
- data/lib/chrono_forge/executor/methods/durably_execute.rb +21 -19
- data/lib/chrono_forge/executor/methods/durably_repeat.rb +118 -25
- data/lib/chrono_forge/executor/methods/merge_branches.rb +83 -0
- data/lib/chrono_forge/executor/methods/wait.rb +2 -4
- data/lib/chrono_forge/executor/methods/wait_until.rb +25 -25
- data/lib/chrono_forge/executor/methods/workflow_states.rb +16 -0
- data/lib/chrono_forge/executor/methods.rb +2 -0
- data/lib/chrono_forge/executor/retry_policy.rb +111 -0
- data/lib/chrono_forge/executor.rb +216 -28
- data/lib/chrono_forge/version.rb +1 -1
- data/lib/chrono_forge/workflow.rb +10 -1
- data/lib/generators/chrono_forge/migration_actions.rb +1 -0
- data/lib/generators/chrono_forge/templates/add_chrono_forge_parent_execution_log.rb +38 -0
- metadata +42 -5
- data/lib/chrono_forge/executor/retry_strategy.rb +0 -29
|
@@ -0,0 +1,709 @@
|
|
|
1
|
+
# Deferral Continuation Race & Catch-up Surge — Implementation Plan
|
|
2
|
+
|
|
3
|
+
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (recommended) or superpowers-extended-cc:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
4
|
+
|
|
5
|
+
**Goal:** Close the continuation/lock-release race (Issue 1) by publishing every continuation only after the lock is released, and collapse `durably_repeat` catch-up from O(missed intervals) to O(1) with a closed-form fast-forward of the expired prefix (Issue 2).
|
|
6
|
+
|
|
7
|
+
**Architecture:** (1) Deferral primitives stop calling `perform_later` inline; they record an intended continuation on the instance, and the executor flushes it in `ensure` *after* `release_lock`. (2) `durably_repeat` computes the first non-expired grid tick in closed form, advances the coordination log's `last_execution_at`, and writes a single summary `ExecutionLog` for the skipped prefix instead of one timed-out row per tick.
|
|
8
|
+
|
|
9
|
+
**Tech Stack:** Ruby 3.2, Rails (ActiveJob/ActiveRecord), Minitest + `chaotic_job`, SolidQueue (prod). Gem: `chrono_forge` 0.9.1.
|
|
10
|
+
|
|
11
|
+
**Spec:** `docs/superpowers/specs/2026-06-26-deferral-continuation-race-and-catchup-design.md`
|
|
12
|
+
|
|
13
|
+
**User Verification:** NO — no user verification required (automated tests are the acceptance gate).
|
|
14
|
+
|
|
15
|
+
**Test command (single file):** `bundle exec ruby -Itest test/<file>_test.rb`
|
|
16
|
+
**Full suite:** `bundle exec rake test`
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## File Structure
|
|
21
|
+
|
|
22
|
+
| File | Responsibility | Change |
|
|
23
|
+
|---|---|---|
|
|
24
|
+
| `lib/chrono_forge/executor.rb` | Continuation recording + post-release flush | Add `enqueue_continuation` / `flush_continuation!`; flush in `ensure`; convert workflow-retry enqueue |
|
|
25
|
+
| `lib/chrono_forge/executor/methods/wait.rb` | `wait` reschedule | Convert inline enqueue → `enqueue_continuation` |
|
|
26
|
+
| `lib/chrono_forge/executor/methods/wait_until.rb` | poll + cond-error retry | Convert 2 inline enqueues |
|
|
27
|
+
| `lib/chrono_forge/executor/methods/durably_execute.rb` | retry backoff | Convert 1 inline enqueue |
|
|
28
|
+
| `lib/chrono_forge/executor/methods/durably_repeat.rb` | schedule-later, repetition-retry, schedule-next, **fast-forward** | Convert 3 inline enqueues (Task 1); add `fast_forward_expired_prefix` (Task 2) |
|
|
29
|
+
| `test/continuation_flush_test.rb` | Issue 1 tests | Create (Task 1) |
|
|
30
|
+
| `test/durably_repeat_test.rb` | Issue 2 tests + updates | Add fast-forward tests; update 2 timeout tests (Task 2) |
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
### Task 1: Defer all continuation enqueues until after lock release
|
|
35
|
+
|
|
36
|
+
**Goal:** No continuation job is published while the enqueuing job still holds the workflow lock; all 8 enqueue sites route through one recorded slot flushed in `ensure` after `release_lock`.
|
|
37
|
+
|
|
38
|
+
**Files:**
|
|
39
|
+
- Modify: `lib/chrono_forge/executor.rb` (add helpers near `halt_execution!` ~`:305`; flush in `ensure` `:168-173`; convert workflow-retry enqueue `:162-164`)
|
|
40
|
+
- Modify: `lib/chrono_forge/executor/methods/wait.rb:106-108`
|
|
41
|
+
- Modify: `lib/chrono_forge/executor/methods/wait_until.rb:134-138` and `:180-185`
|
|
42
|
+
- Modify: `lib/chrono_forge/executor/methods/durably_execute.rb:111-113`
|
|
43
|
+
- Modify: `lib/chrono_forge/executor/methods/durably_repeat.rb:192-194`, `:234-236`, `:287-289`
|
|
44
|
+
- Test: `test/continuation_flush_test.rb` (create)
|
|
45
|
+
|
|
46
|
+
**Acceptance Criteria:**
|
|
47
|
+
- [ ] Every continuation observes the workflow lock already released (`locked_by == nil`) at enqueue time.
|
|
48
|
+
- [ ] Per-site kwargs are preserved (`wait_condition:` for the `wait_until` poll; `attempt:`/`retry_counts:` for the workflow retry).
|
|
49
|
+
- [ ] `flush_continuation!` is a no-op when no continuation was recorded, and is skipped when `release_lock` raises (overrun loses the lock).
|
|
50
|
+
- [ ] Full suite still green (regression guard for retry/attempt threading).
|
|
51
|
+
|
|
52
|
+
**Verify:** `bundle exec ruby -Itest test/continuation_flush_test.rb` → all pass; then `bundle exec rake test` → green.
|
|
53
|
+
|
|
54
|
+
**Steps:**
|
|
55
|
+
|
|
56
|
+
- [ ] **Step 1: Write the failing tests**
|
|
57
|
+
|
|
58
|
+
Create `test/continuation_flush_test.rb`:
|
|
59
|
+
|
|
60
|
+
```ruby
|
|
61
|
+
require "test_helper"
|
|
62
|
+
|
|
63
|
+
class ContinuationFlushTest < ActiveJob::TestCase
|
|
64
|
+
include ChaoticJob::Helpers
|
|
65
|
+
|
|
66
|
+
def setup
|
|
67
|
+
ChronoForge::Workflow.destroy_all
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
# The core ordering guarantee: a continuation must only become claimable after
|
|
71
|
+
# the enqueuing job has released the lock. We observe the workflow's lock owner
|
|
72
|
+
# in the DB at the instant each same-key continuation is enqueued; it must be nil.
|
|
73
|
+
def test_continuation_is_enqueued_only_after_lock_released
|
|
74
|
+
key = "flush_order_#{Time.now.to_i}_#{rand(10_000)}"
|
|
75
|
+
|
|
76
|
+
locked_owners = []
|
|
77
|
+
subscriber = ActiveSupport::Notifications.subscribe("enqueue.active_job") do |*args|
|
|
78
|
+
event = ActiveSupport::Notifications::Event.new(*args)
|
|
79
|
+
job = event.payload[:job]
|
|
80
|
+
next unless job.arguments.first == key
|
|
81
|
+
wf = ChronoForge::Workflow.find_by(key: key)
|
|
82
|
+
locked_owners << (wf && wf.locked_by)
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
begin
|
|
86
|
+
WaitContinuationJob.perform_later(key)
|
|
87
|
+
perform_all_jobs_before(1.second)
|
|
88
|
+
ensure
|
|
89
|
+
ActiveSupport::Notifications.unsubscribe(subscriber)
|
|
90
|
+
end
|
|
91
|
+
|
|
92
|
+
# At least one continuation enqueue must have been observed from inside the job.
|
|
93
|
+
refute locked_owners.empty?, "expected to observe a continuation enqueue"
|
|
94
|
+
assert locked_owners.all?(&:nil?),
|
|
95
|
+
"continuation must be enqueued only after lock release; observed owners: #{locked_owners.inspect}"
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
# flush_continuation! must round-trip arbitrary kwargs into the continuation.
|
|
99
|
+
def test_flush_continuation_preserves_kwargs
|
|
100
|
+
key = "flush_kwargs_#{Time.now.to_i}_#{rand(10_000)}"
|
|
101
|
+
workflow = ChronoForge::Workflow.create!(
|
|
102
|
+
key: key, job_class: "KitchenSink", kwargs: {}, options: {}, context: {}, state: :idle
|
|
103
|
+
)
|
|
104
|
+
|
|
105
|
+
job = KitchenSink.new
|
|
106
|
+
job.instance_variable_set(:@workflow, workflow)
|
|
107
|
+
job.send(:enqueue_continuation, wait: 0.seconds, wait_condition: "my_cond")
|
|
108
|
+
|
|
109
|
+
assert_difference -> { enqueued_jobs.size }, 1 do
|
|
110
|
+
job.send(:flush_continuation!)
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
last = enqueued_jobs.last
|
|
114
|
+
assert_includes last.to_s, key, "continuation should target the workflow key"
|
|
115
|
+
assert_includes last.to_s, "my_cond", "continuation must carry the wait_condition kwarg"
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
# No recorded continuation => flush does nothing.
|
|
119
|
+
def test_flush_continuation_is_noop_without_recorded_continuation
|
|
120
|
+
job = KitchenSink.new
|
|
121
|
+
assert_no_difference -> { enqueued_jobs.size } do
|
|
122
|
+
job.send(:flush_continuation!)
|
|
123
|
+
end
|
|
124
|
+
end
|
|
125
|
+
end
|
|
126
|
+
|
|
127
|
+
class WaitContinuationJob < WorkflowJob
|
|
128
|
+
prepend ChronoForge::Executor
|
|
129
|
+
|
|
130
|
+
def perform
|
|
131
|
+
# First pass: wait period not elapsed -> records a continuation and halts.
|
|
132
|
+
wait 1.hour, "long_wait"
|
|
133
|
+
end
|
|
134
|
+
end
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
- [ ] **Step 2: Run tests to verify they fail**
|
|
138
|
+
|
|
139
|
+
Run: `bundle exec ruby -Itest test/continuation_flush_test.rb`
|
|
140
|
+
Expected: FAIL —
|
|
141
|
+
- `test_continuation_is_enqueued_only_after_lock_released`: observed owner is the job id (non-nil), because `wait` enqueues before the `ensure` release.
|
|
142
|
+
- `test_flush_continuation_preserves_kwargs` / `..._noop_...`: `NoMethodError: undefined method 'enqueue_continuation'/'flush_continuation!'`.
|
|
143
|
+
|
|
144
|
+
- [ ] **Step 3: Add the recording + flush helpers in the executor**
|
|
145
|
+
|
|
146
|
+
In `lib/chrono_forge/executor.rb`, add near `halt_execution!` (private section, ~`:305`):
|
|
147
|
+
|
|
148
|
+
```ruby
|
|
149
|
+
# Record the continuation this job intends to enqueue. It is NOT published
|
|
150
|
+
# here: publishing while the lock is still held lets another worker claim it
|
|
151
|
+
# and lose the lock-acquisition race. The executor flushes it in `ensure`,
|
|
152
|
+
# after release_lock (see #flush_continuation!). At most one continuation is
|
|
153
|
+
# recorded per job run (every primitive records one then halts, or falls
|
|
154
|
+
# through the workflow-retry rescue).
|
|
155
|
+
def enqueue_continuation(wait:, **kwargs)
|
|
156
|
+
@continuation = {wait: wait, kwargs: kwargs}
|
|
157
|
+
end
|
|
158
|
+
|
|
159
|
+
# Publish the recorded continuation, if any. Called from `ensure` only after
|
|
160
|
+
# the lock row has been updated to released, so even a zero-delay continuation
|
|
161
|
+
# finds the lock free.
|
|
162
|
+
def flush_continuation!
|
|
163
|
+
return unless @continuation
|
|
164
|
+
|
|
165
|
+
self.class
|
|
166
|
+
.set(wait: @continuation[:wait])
|
|
167
|
+
.perform_later(@workflow.key, **@continuation[:kwargs])
|
|
168
|
+
end
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
- [ ] **Step 4: Flush in `ensure`, after release_lock**
|
|
172
|
+
|
|
173
|
+
In `lib/chrono_forge/executor.rb`, change the `ensure` block (`:168-173`) from:
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
ensure
|
|
177
|
+
if lock_acquired # Only release lock if we acquired it
|
|
178
|
+
context.save!
|
|
179
|
+
self.class::LockStrategy.release_lock(job_id, workflow)
|
|
180
|
+
end
|
|
181
|
+
end
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
to:
|
|
185
|
+
|
|
186
|
+
```ruby
|
|
187
|
+
ensure
|
|
188
|
+
if lock_acquired # Only release lock if we acquired it
|
|
189
|
+
context.save!
|
|
190
|
+
self.class::LockStrategy.release_lock(job_id, workflow)
|
|
191
|
+
# Publish the continuation only now — after the lock is released — so a
|
|
192
|
+
# zero-delay, same-key continuation can't lose the acquire race against
|
|
193
|
+
# this still-locked job. If release_lock raised (this job overran and
|
|
194
|
+
# lost the lock), we never reach here and another job owns continuation.
|
|
195
|
+
flush_continuation!
|
|
196
|
+
end
|
|
197
|
+
end
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
- [ ] **Step 5: Convert the workflow-level retry enqueue**
|
|
201
|
+
|
|
202
|
+
In `lib/chrono_forge/executor.rb`, change (`:161-164`):
|
|
203
|
+
|
|
204
|
+
```ruby
|
|
205
|
+
if backoff
|
|
206
|
+
self.class
|
|
207
|
+
.set(wait: backoff)
|
|
208
|
+
.perform_later(workflow.key, attempt: attempts_made, retry_counts: retry_counts)
|
|
209
|
+
else
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
to:
|
|
213
|
+
|
|
214
|
+
```ruby
|
|
215
|
+
if backoff
|
|
216
|
+
enqueue_continuation(wait: backoff, attempt: attempts_made, retry_counts: retry_counts)
|
|
217
|
+
else
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
- [ ] **Step 6: Convert the `wait` enqueue**
|
|
221
|
+
|
|
222
|
+
In `lib/chrono_forge/executor/methods/wait.rb`, change (`:105-111`):
|
|
223
|
+
|
|
224
|
+
```ruby
|
|
225
|
+
# Reschedule the job
|
|
226
|
+
self.class
|
|
227
|
+
.set(wait: duration)
|
|
228
|
+
.perform_later(@workflow.key)
|
|
229
|
+
|
|
230
|
+
# Halt current execution
|
|
231
|
+
halt_execution!
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
to:
|
|
235
|
+
|
|
236
|
+
```ruby
|
|
237
|
+
# Record the reschedule; the executor publishes it after lock release.
|
|
238
|
+
enqueue_continuation(wait: duration)
|
|
239
|
+
|
|
240
|
+
# Halt current execution
|
|
241
|
+
halt_execution!
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
- [ ] **Step 7: Convert both `wait_until` enqueues**
|
|
245
|
+
|
|
246
|
+
In `lib/chrono_forge/executor/methods/wait_until.rb`, change the cond-error retry (`:132-141`):
|
|
247
|
+
|
|
248
|
+
```ruby
|
|
249
|
+
if backoff
|
|
250
|
+
# Reschedule with the policy's backoff
|
|
251
|
+
self.class
|
|
252
|
+
.set(wait: backoff)
|
|
253
|
+
.perform_later(
|
|
254
|
+
@workflow.key
|
|
255
|
+
)
|
|
256
|
+
|
|
257
|
+
# Halt current execution
|
|
258
|
+
halt_execution!
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
to:
|
|
262
|
+
|
|
263
|
+
```ruby
|
|
264
|
+
if backoff
|
|
265
|
+
# Reschedule with the policy's backoff (published after lock release).
|
|
266
|
+
enqueue_continuation(wait: backoff)
|
|
267
|
+
|
|
268
|
+
# Halt current execution
|
|
269
|
+
halt_execution!
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
Then change the poll reschedule (`:179-188`):
|
|
273
|
+
|
|
274
|
+
```ruby
|
|
275
|
+
# Reschedule with delay
|
|
276
|
+
self.class
|
|
277
|
+
.set(wait: check_interval)
|
|
278
|
+
.perform_later(
|
|
279
|
+
@workflow.key,
|
|
280
|
+
wait_condition: condition
|
|
281
|
+
)
|
|
282
|
+
|
|
283
|
+
# Halt current execution
|
|
284
|
+
halt_execution!
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
to:
|
|
288
|
+
|
|
289
|
+
```ruby
|
|
290
|
+
# Reschedule the poll (published after lock release).
|
|
291
|
+
enqueue_continuation(wait: check_interval, wait_condition: condition)
|
|
292
|
+
|
|
293
|
+
# Halt current execution
|
|
294
|
+
halt_execution!
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
- [ ] **Step 8: Convert the `durably_execute` retry enqueue**
|
|
298
|
+
|
|
299
|
+
In `lib/chrono_forge/executor/methods/durably_execute.rb`, change (`:107-116`):
|
|
300
|
+
|
|
301
|
+
```ruby
|
|
302
|
+
if backoff
|
|
303
|
+
# Reschedule with the policy's backoff. The workflow replays on
|
|
304
|
+
# resume and skips completed steps, so the rescheduled run picks
|
|
305
|
+
# this step up again by its persisted execution log.
|
|
306
|
+
self.class
|
|
307
|
+
.set(wait: backoff)
|
|
308
|
+
.perform_later(@workflow.key)
|
|
309
|
+
|
|
310
|
+
# Halt current execution
|
|
311
|
+
halt_execution!
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
to:
|
|
315
|
+
|
|
316
|
+
```ruby
|
|
317
|
+
if backoff
|
|
318
|
+
# Reschedule with the policy's backoff (published after lock release).
|
|
319
|
+
# The workflow replays on resume and skips completed steps, so the
|
|
320
|
+
# rescheduled run picks this step up again by its execution log.
|
|
321
|
+
enqueue_continuation(wait: backoff)
|
|
322
|
+
|
|
323
|
+
# Halt current execution
|
|
324
|
+
halt_execution!
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
- [ ] **Step 9: Convert all three `durably_repeat` enqueues**
|
|
328
|
+
|
|
329
|
+
In `lib/chrono_forge/executor/methods/durably_repeat.rb`, `schedule_repetition_for_later` (`:191-197`):
|
|
330
|
+
|
|
331
|
+
```ruby
|
|
332
|
+
# Schedule the workflow to run at the specified time
|
|
333
|
+
self.class
|
|
334
|
+
.set(wait: delay)
|
|
335
|
+
.perform_later(@workflow.key)
|
|
336
|
+
|
|
337
|
+
# Halt current execution until scheduled time
|
|
338
|
+
halt_execution!
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
to:
|
|
342
|
+
|
|
343
|
+
```ruby
|
|
344
|
+
# Schedule the workflow to run at the specified time (published after release).
|
|
345
|
+
enqueue_continuation(wait: delay)
|
|
346
|
+
|
|
347
|
+
# Halt current execution until scheduled time
|
|
348
|
+
halt_execution!
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
The repetition retry (`:232-239`):
|
|
352
|
+
|
|
353
|
+
```ruby
|
|
354
|
+
if backoff
|
|
355
|
+
# Reschedule this same repetition with the policy's backoff
|
|
356
|
+
self.class
|
|
357
|
+
.set(wait: backoff)
|
|
358
|
+
.perform_later(@workflow.key)
|
|
359
|
+
|
|
360
|
+
# Halt current execution
|
|
361
|
+
halt_execution!
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
to:
|
|
365
|
+
|
|
366
|
+
```ruby
|
|
367
|
+
if backoff
|
|
368
|
+
# Reschedule this same repetition with the policy's backoff (after release).
|
|
369
|
+
enqueue_continuation(wait: backoff)
|
|
370
|
+
|
|
371
|
+
# Halt current execution
|
|
372
|
+
halt_execution!
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
And `schedule_next_execution_after_completion` (`:286-292`):
|
|
376
|
+
|
|
377
|
+
```ruby
|
|
378
|
+
# Schedule the workflow to run for the next periodic execution
|
|
379
|
+
self.class
|
|
380
|
+
.set(wait: delay)
|
|
381
|
+
.perform_later(@workflow.key)
|
|
382
|
+
|
|
383
|
+
# Halt current execution
|
|
384
|
+
halt_execution!
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
to:
|
|
388
|
+
|
|
389
|
+
```ruby
|
|
390
|
+
# Schedule the next periodic execution (published after lock release).
|
|
391
|
+
enqueue_continuation(wait: delay)
|
|
392
|
+
|
|
393
|
+
# Halt current execution
|
|
394
|
+
halt_execution!
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
- [ ] **Step 10: Run the new tests — expect PASS**
|
|
398
|
+
|
|
399
|
+
Run: `bundle exec ruby -Itest test/continuation_flush_test.rb`
|
|
400
|
+
Expected: PASS (3 tests).
|
|
401
|
+
|
|
402
|
+
- [ ] **Step 11: Run the full suite — expect green**
|
|
403
|
+
|
|
404
|
+
Run: `bundle exec rake test`
|
|
405
|
+
Expected: all tests pass (retry/attempt threading preserved through the flush).
|
|
406
|
+
|
|
407
|
+
- [ ] **Step 12: Commit**
|
|
408
|
+
|
|
409
|
+
```bash
|
|
410
|
+
git add lib/chrono_forge/executor.rb \
|
|
411
|
+
lib/chrono_forge/executor/methods/wait.rb \
|
|
412
|
+
lib/chrono_forge/executor/methods/wait_until.rb \
|
|
413
|
+
lib/chrono_forge/executor/methods/durably_execute.rb \
|
|
414
|
+
lib/chrono_forge/executor/methods/durably_repeat.rb \
|
|
415
|
+
test/continuation_flush_test.rb
|
|
416
|
+
git commit -m "fix(executor): publish continuations after lock release to close acquire race"
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
```json:metadata
|
|
420
|
+
{"files": ["lib/chrono_forge/executor.rb", "lib/chrono_forge/executor/methods/wait.rb", "lib/chrono_forge/executor/methods/wait_until.rb", "lib/chrono_forge/executor/methods/durably_execute.rb", "lib/chrono_forge/executor/methods/durably_repeat.rb", "test/continuation_flush_test.rb"], "verifyCommand": "bundle exec ruby -Itest test/continuation_flush_test.rb && bundle exec rake test", "acceptanceCriteria": ["continuations enqueued only after lock release", "per-site kwargs preserved", "flush no-ops without recorded continuation and is skipped when release_lock raises", "full suite green"], "requiresUserVerification": false}
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
---
|
|
424
|
+
|
|
425
|
+
### Task 2: Closed-form fast-forward of the expired prefix in `durably_repeat`
|
|
426
|
+
|
|
427
|
+
**Goal:** When `durably_repeat` resumes behind schedule, jump past the expired prefix in O(1), advance the coordination log's `last_execution_at`, and write one summary `ExecutionLog` for the skip — instead of one timed-out row + one zero-delay job per missed tick.
|
|
428
|
+
|
|
429
|
+
**Files:**
|
|
430
|
+
- Modify: `lib/chrono_forge/executor/methods/durably_repeat.rb` (call fast-forward in `durably_repeat` after `:149`; add private `fast_forward_expired_prefix`)
|
|
431
|
+
- Test: `test/durably_repeat_test.rb` (add new tests; update `test_durably_repeat_with_timeout` `:116` and `test_durably_repeat_coordination_log_updated_on_timeout` `:345`)
|
|
432
|
+
|
|
433
|
+
**Acceptance Criteria:**
|
|
434
|
+
- [ ] `fast_forward_expired_prefix` returns the input unchanged when nothing is expired (`next >= now − timeout`).
|
|
435
|
+
- [ ] When ticks are expired, it returns the first grid tick `>= now − timeout` (exact grid landing, `n = ceil((cutoff − next)/every)` intervals).
|
|
436
|
+
- [ ] The expired prefix produces **zero** `Execution timed out` rows and exactly **one** summary row (`error_class: "TimeoutError"`, `metadata["fast_forwarded"] == n`, step on the last skipped grid tick) that does not collide with the first-valid repetition row.
|
|
437
|
+
- [ ] Coordination `last_execution_at` is advanced to `(first_valid − every).iso8601`, so a replay is stable (recomputes the same `first_valid`).
|
|
438
|
+
- [ ] The first in-window tick still executes its work (boundary preserved).
|
|
439
|
+
- [ ] Full suite green.
|
|
440
|
+
|
|
441
|
+
**Verify:** `bundle exec ruby -Itest test/durably_repeat_test.rb` → all pass; then `bundle exec rake test` → green.
|
|
442
|
+
|
|
443
|
+
**Steps:**
|
|
444
|
+
|
|
445
|
+
- [ ] **Step 1: Write the failing unit test for the closed form**
|
|
446
|
+
|
|
447
|
+
Add to `test/durably_repeat_test.rb` (inside `class DurablyRepeatTest`):
|
|
448
|
+
|
|
449
|
+
```ruby
|
|
450
|
+
def test_fast_forward_returns_input_when_nothing_expired
|
|
451
|
+
workflow = ChronoForge::Workflow.create!(
|
|
452
|
+
key: "ff_noop_#{rand(10_000)}", job_class: "KitchenSink",
|
|
453
|
+
kwargs: {}, options: {}, context: {}, state: :idle
|
|
454
|
+
)
|
|
455
|
+
coordination = workflow.execution_logs.create!(
|
|
456
|
+
step_name: "durably_repeat$x", state: :pending, metadata: {}
|
|
457
|
+
)
|
|
458
|
+
job = KitchenSink.new
|
|
459
|
+
job.instance_variable_set(:@workflow, workflow)
|
|
460
|
+
|
|
461
|
+
next_at = Time.current + 5.seconds # future tick, not expired
|
|
462
|
+
result = job.send(:fast_forward_expired_prefix, coordination, next_at, 2.seconds, 1.hour)
|
|
463
|
+
|
|
464
|
+
assert_in_delta next_at.to_f, result.to_f, 0.001, "future tick must be returned unchanged"
|
|
465
|
+
end
|
|
466
|
+
|
|
467
|
+
def test_fast_forward_lands_on_first_non_expired_grid_tick
|
|
468
|
+
workflow = ChronoForge::Workflow.create!(
|
|
469
|
+
key: "ff_jump_#{rand(10_000)}", job_class: "KitchenSink",
|
|
470
|
+
kwargs: {}, options: {}, context: {}, state: :idle
|
|
471
|
+
)
|
|
472
|
+
coordination = workflow.execution_logs.create!(
|
|
473
|
+
step_name: "durably_repeat$x", state: :pending, metadata: {}
|
|
474
|
+
)
|
|
475
|
+
job = KitchenSink.new
|
|
476
|
+
job.instance_variable_set(:@workflow, workflow)
|
|
477
|
+
|
|
478
|
+
every = 1.second
|
|
479
|
+
timeout = 1.second
|
|
480
|
+
# 60 ticks back, 1s grid, 1s timeout => cutoff = now-1s; first non-expired
|
|
481
|
+
# tick is the smallest grid tick >= now-1s.
|
|
482
|
+
next_at = Time.current - 60.seconds
|
|
483
|
+
cutoff = Time.current - timeout
|
|
484
|
+
|
|
485
|
+
result = job.send(:fast_forward_expired_prefix, coordination, next_at, every, timeout)
|
|
486
|
+
|
|
487
|
+
# On-grid: result == next_at + n*every for integer n.
|
|
488
|
+
n = ((result - next_at) / every.to_f).round
|
|
489
|
+
assert_in_delta next_at.to_f + n * every.to_f, result.to_f, 0.001, "result must stay on the grid"
|
|
490
|
+
assert_operator result, :>=, cutoff, "result must be the first non-expired tick"
|
|
491
|
+
assert_operator result - every, :<, cutoff, "the tick before result must still be expired"
|
|
492
|
+
|
|
493
|
+
# Coordination advanced so replay recomputes the same first_valid.
|
|
494
|
+
coordination.reload
|
|
495
|
+
assert coordination.metadata["last_execution_at"], "last_execution_at must be set"
|
|
496
|
+
assert_in_delta (result - every).to_f,
|
|
497
|
+
Time.parse(coordination.metadata["last_execution_at"]).to_f, 0.001
|
|
498
|
+
|
|
499
|
+
# Exactly one summary row written, on the last skipped grid tick, with the count.
|
|
500
|
+
summary = workflow.execution_logs.where("step_name LIKE ?", "durably_repeat$x$%").to_a
|
|
501
|
+
assert_equal 1, summary.size, "exactly one summary row for the skipped prefix"
|
|
502
|
+
assert_equal "TimeoutError", summary.first.error_class
|
|
503
|
+
assert_operator summary.first.metadata["fast_forwarded"].to_i, :>=, 1
|
|
504
|
+
end
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
- [ ] **Step 2: Run unit tests to verify they fail**
|
|
508
|
+
|
|
509
|
+
Run: `bundle exec ruby -Itest test/durably_repeat_test.rb -n "/fast_forward/"`
|
|
510
|
+
Expected: FAIL — `NoMethodError: undefined method 'fast_forward_expired_prefix'`.
|
|
511
|
+
|
|
512
|
+
- [ ] **Step 3: Implement `fast_forward_expired_prefix` and wire it in**
|
|
513
|
+
|
|
514
|
+
In `lib/chrono_forge/executor/methods/durably_repeat.rb`, in `durably_repeat`, insert the call right after `next_execution_at` is computed (after `:149`, before `execute_or_schedule_repetition` at `:151`):
|
|
515
|
+
|
|
516
|
+
```ruby
|
|
517
|
+
next_execution_at = fast_forward_expired_prefix(coordination_log, next_execution_at, every, timeout)
|
|
518
|
+
|
|
519
|
+
execute_or_schedule_repetition(method, coordination_log, next_execution_at, every, policy, timeout, on_error)
|
|
520
|
+
```
|
|
521
|
+
|
|
522
|
+
Add the private method (alongside the other privates):
|
|
523
|
+
|
|
524
|
+
```ruby
|
|
525
|
+
# Catch-up fast-forward. A tick `t` is expired (its work is skipped) iff
|
|
526
|
+
# `Time.current > t + timeout`, i.e. `t < now - timeout`. Rather than
|
|
527
|
+
# walking one zero-delay job per expired tick, jump straight to the first
|
|
528
|
+
# non-expired tick on the same grid in closed form.
|
|
529
|
+
#
|
|
530
|
+
# Anchoring the arithmetic on `next_execution_at` (already on the canonical
|
|
531
|
+
# grid: start_at / created_at+every / last_execution_at+every all land on
|
|
532
|
+
# it, because last_execution_at stores the *scheduled* time, not wall-clock)
|
|
533
|
+
# keeps the result exactly on the grid — no drift.
|
|
534
|
+
#
|
|
535
|
+
# Returns `next_execution_at` unchanged when nothing is expired. Otherwise
|
|
536
|
+
# advances the coordination log's last_execution_at so a replay recomputes
|
|
537
|
+
# the same first tick, and writes ONE summary ExecutionLog for the whole
|
|
538
|
+
# skipped prefix (no per-tick timeout rows).
|
|
539
|
+
def fast_forward_expired_prefix(coordination_log, next_execution_at, every, timeout)
|
|
540
|
+
cutoff = Time.current - timeout
|
|
541
|
+
return next_execution_at if next_execution_at >= cutoff
|
|
542
|
+
|
|
543
|
+
n = ((cutoff - next_execution_at) / every.to_f).ceil
|
|
544
|
+
first_valid = next_execution_at + (n * every)
|
|
545
|
+
last_skipped = first_valid - every
|
|
546
|
+
|
|
547
|
+
Rails.logger.info {
|
|
548
|
+
"ChronoForge:#{self.class}(#{@workflow.key}) durably_repeat fast-forwarded " \
|
|
549
|
+
"#{n} expired tick(s) to #{first_valid.iso8601}"
|
|
550
|
+
}
|
|
551
|
+
|
|
552
|
+
# Single summary row for the skipped prefix, on the last skipped grid
|
|
553
|
+
# tick (unique; never collides with the first_valid repetition row).
|
|
554
|
+
summary_step = "#{coordination_log.step_name}$#{last_skipped.to_i}"
|
|
555
|
+
find_or_create_execution_log!(summary_step) do |log|
|
|
556
|
+
log.started_at = Time.current
|
|
557
|
+
log.metadata = {
|
|
558
|
+
fast_forwarded: n,
|
|
559
|
+
from: next_execution_at.iso8601,
|
|
560
|
+
to: last_skipped.iso8601,
|
|
561
|
+
scheduled_for: last_skipped,
|
|
562
|
+
timeout_at: last_skipped + timeout,
|
|
563
|
+
parent_id: coordination_log.id
|
|
564
|
+
}
|
|
565
|
+
end.update!(
|
|
566
|
+
state: :failed,
|
|
567
|
+
error_class: "TimeoutError",
|
|
568
|
+
error_message: "Fast-forwarded #{n} expired tick(s)",
|
|
569
|
+
completed_at: Time.current
|
|
570
|
+
)
|
|
571
|
+
|
|
572
|
+
# Record progress: a replay recomputes naive_next = last + every = first_valid.
|
|
573
|
+
coordination_log.update!(
|
|
574
|
+
metadata: coordination_log.metadata.merge("last_execution_at" => last_skipped.iso8601)
|
|
575
|
+
)
|
|
576
|
+
|
|
577
|
+
first_valid
|
|
578
|
+
end
|
|
579
|
+
```
|
|
580
|
+
|
|
581
|
+
- [ ] **Step 4: Run unit tests — expect PASS**
|
|
582
|
+
|
|
583
|
+
Run: `bundle exec ruby -Itest test/durably_repeat_test.rb -n "/fast_forward/"`
|
|
584
|
+
Expected: PASS (2 tests).
|
|
585
|
+
|
|
586
|
+
- [ ] **Step 5: Add an integration test for catch-up (red→green in one step since impl now exists)**
|
|
587
|
+
|
|
588
|
+
Add to `test/durably_repeat_test.rb` (class body + a job class at the bottom with the others):
|
|
589
|
+
|
|
590
|
+
```ruby
|
|
591
|
+
def test_durably_repeat_catch_up_fast_forwards_expired_prefix
|
|
592
|
+
unique_key = "catchup_#{Time.now.to_i}_#{rand(10_000)}"
|
|
593
|
+
|
|
594
|
+
# start_at far in the past with a short timeout => a long expired prefix.
|
|
595
|
+
CatchUpJob.perform_later(unique_key, start_time: Time.current - 60.seconds)
|
|
596
|
+
|
|
597
|
+
perform_all_jobs_before(5.seconds)
|
|
598
|
+
|
|
599
|
+
workflow = ChronoForge::Workflow.find_by(key: unique_key)
|
|
600
|
+
|
|
601
|
+
# No per-tick timeout tombstones for the expired prefix.
|
|
602
|
+
timed_out = workflow.execution_logs.select { |l| l.error_message == "Execution timed out" }
|
|
603
|
+
assert_empty timed_out, "expired prefix must not create per-tick timeout rows"
|
|
604
|
+
|
|
605
|
+
# Exactly one fast-forward summary row.
|
|
606
|
+
summaries = workflow.execution_logs.select { |l| l.metadata && l.metadata["fast_forwarded"] }
|
|
607
|
+
assert_equal 1, summaries.size, "expired prefix collapses to one summary row"
|
|
608
|
+
assert_operator summaries.first.metadata["fast_forwarded"].to_i, :>=, 1
|
|
609
|
+
end
|
|
610
|
+
```
|
|
611
|
+
|
|
612
|
+
```ruby
|
|
613
|
+
class CatchUpJob < WorkflowJob
|
|
614
|
+
prepend ChronoForge::Executor
|
|
615
|
+
|
|
616
|
+
def perform(start_time:)
|
|
617
|
+
context.set_once(:execution_count, 0)
|
|
618
|
+
start_obj = start_time.is_a?(String) ? Time.parse(start_time) : start_time
|
|
619
|
+
durably_repeat :catch_up_task, every: 1.second, till: :done?,
|
|
620
|
+
start_at: start_obj, timeout: 1.second
|
|
621
|
+
end
|
|
622
|
+
|
|
623
|
+
private
|
|
624
|
+
|
|
625
|
+
def catch_up_task(_scheduled = nil)
|
|
626
|
+
context[:execution_count] = context.fetch(:execution_count, 0) + 1
|
|
627
|
+
end
|
|
628
|
+
|
|
629
|
+
def done?
|
|
630
|
+
context.fetch(:execution_count, 0) >= 1
|
|
631
|
+
end
|
|
632
|
+
end
|
|
633
|
+
```
|
|
634
|
+
|
|
635
|
+
- [ ] **Step 6: Update the two existing timeout tests to the new behavior**
|
|
636
|
+
|
|
637
|
+
In `test/durably_repeat_test.rb`, `test_durably_repeat_with_timeout` (`:116-131`) — replace the timeout-tombstone assertion. Change:
|
|
638
|
+
|
|
639
|
+
```ruby
|
|
640
|
+
# Should have timeout failures
|
|
641
|
+
timeout_logs = workflow.execution_logs.select { |log|
|
|
642
|
+
log.failed? && log.error_message == "Execution timed out"
|
|
643
|
+
}
|
|
644
|
+
assert_operator timeout_logs.size, :>, 0, "should have timeout failures"
|
|
645
|
+
```
|
|
646
|
+
|
|
647
|
+
to:
|
|
648
|
+
|
|
649
|
+
```ruby
|
|
650
|
+
# Expired ticks are now fast-forwarded: no per-tick "Execution timed out"
|
|
651
|
+
# rows; the skipped prefix collapses to a single fast_forwarded summary row.
|
|
652
|
+
timeout_logs = workflow.execution_logs.select { |log|
|
|
653
|
+
log.error_message == "Execution timed out"
|
|
654
|
+
}
|
|
655
|
+
assert_empty timeout_logs, "expired ticks should be fast-forwarded, not tombstoned per tick"
|
|
656
|
+
|
|
657
|
+
summaries = workflow.execution_logs.select { |log| log.metadata && log.metadata["fast_forwarded"] }
|
|
658
|
+
assert_operator summaries.size, :>=, 1, "should record a fast-forward summary row"
|
|
659
|
+
```
|
|
660
|
+
|
|
661
|
+
Then `test_durably_repeat_coordination_log_updated_on_timeout` (`:345-384`) — change the same tombstone block (`:369-373`):
|
|
662
|
+
|
|
663
|
+
```ruby
|
|
664
|
+
# Find timeout logs
|
|
665
|
+
timeout_logs = workflow.execution_logs.select { |log|
|
|
666
|
+
log.failed? && log.error_message == "Execution timed out"
|
|
667
|
+
}
|
|
668
|
+
assert_operator timeout_logs.size, :>, 0, "should have timeout failures"
|
|
669
|
+
```
|
|
670
|
+
|
|
671
|
+
to:
|
|
672
|
+
|
|
673
|
+
```ruby
|
|
674
|
+
# Expired ticks are fast-forwarded into a single summary row, not per-tick rows.
|
|
675
|
+
summaries = workflow.execution_logs.select { |log| log.metadata && log.metadata["fast_forwarded"] }
|
|
676
|
+
assert_operator summaries.size, :>=, 1, "should record a fast-forward summary row"
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
(The remaining assertions in that test — `last_execution_at` is present and advanced — still hold and stay unchanged.)
|
|
680
|
+
|
|
681
|
+
- [ ] **Step 7: Run the durably_repeat suite — expect PASS**
|
|
682
|
+
|
|
683
|
+
Run: `bundle exec ruby -Itest test/durably_repeat_test.rb`
|
|
684
|
+
Expected: PASS (all, including updated timeout tests).
|
|
685
|
+
|
|
686
|
+
- [ ] **Step 8: Run the full suite — expect green**
|
|
687
|
+
|
|
688
|
+
Run: `bundle exec rake test`
|
|
689
|
+
Expected: all tests pass.
|
|
690
|
+
|
|
691
|
+
- [ ] **Step 9: Commit**
|
|
692
|
+
|
|
693
|
+
```bash
|
|
694
|
+
git add lib/chrono_forge/executor/methods/durably_repeat.rb test/durably_repeat_test.rb
|
|
695
|
+
git commit -m "fix(durably_repeat): fast-forward expired catch-up prefix in closed form"
|
|
696
|
+
```
|
|
697
|
+
|
|
698
|
+
```json:metadata
|
|
699
|
+
{"files": ["lib/chrono_forge/executor/methods/durably_repeat.rb", "test/durably_repeat_test.rb"], "verifyCommand": "bundle exec ruby -Itest test/durably_repeat_test.rb && bundle exec rake test", "acceptanceCriteria": ["fast_forward returns input when nothing expired", "lands on first non-expired grid tick (no drift)", "zero per-tick timeout rows + exactly one summary row", "coordination last_execution_at advanced for stable replay", "first in-window tick still executes", "full suite green"], "requiresUserVerification": false}
|
|
700
|
+
```
|
|
701
|
+
|
|
702
|
+
---
|
|
703
|
+
|
|
704
|
+
## Self-Review
|
|
705
|
+
|
|
706
|
+
- **Spec coverage:** Section 1 (deferred flush, all 8 sites) → Task 1. Section 2 (closed-form fast-forward, summary row, coordination advance, test updates) → Task 2. Both covered.
|
|
707
|
+
- **Placeholder scan:** none — every step has concrete code/commands.
|
|
708
|
+
- **Type/name consistency:** `enqueue_continuation`/`flush_continuation!`/`@continuation`/`fast_forward_expired_prefix` used identically across plan and tests. Summary metadata key `fast_forwarded` consistent in impl and all assertions.
|
|
709
|
+
- **Verification scan:** spec requires no human-in-the-loop verification → User Verification = NO; no verification task needed.
|