RubyGems - ruby_reactor - Versions diffs - 0.5.2 → 0.5.3 - Mend

ruby_reactor 0.5.2 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checksums.yaml +4 -4
data/.release-please-manifest.json +1 -1
data/CHANGELOG.md +7 -0
data/README.md +147 -34
data/lib/ruby_reactor/configuration.rb +66 -2
data/lib/ruby_reactor/context_serializer.rb +9 -4
data/lib/ruby_reactor/executor/ordered_lock_support.rb +1 -1
data/lib/ruby_reactor/executor/retry_manager.rb +7 -2
data/lib/ruby_reactor/executor/step_executor.rb +25 -5
data/lib/ruby_reactor/executor.rb +85 -3
data/lib/ruby_reactor/lock.rb +13 -0
data/lib/ruby_reactor/map/collector.rb +41 -0
data/lib/ruby_reactor/map/dispatcher.rb +42 -0
data/lib/ruby_reactor/map/element_executor.rb +39 -0
data/lib/ruby_reactor/map/helpers.rb +10 -3
data/lib/ruby_reactor/map/sweeper.rb +110 -0
data/lib/ruby_reactor/reactor.rb +7 -5
data/lib/ruby_reactor/sidekiq_adapter.rb +9 -8
data/lib/ruby_reactor/sidekiq_workers/sweeper_worker.rb +73 -0
data/lib/ruby_reactor/sidekiq_workers/worker.rb +42 -34
data/lib/ruby_reactor/step/map_step.rb +18 -2
data/lib/ruby_reactor/storage/redis_adapter.rb +83 -60
data/lib/ruby_reactor/storage/redis_locking.rb +8 -0
data/lib/ruby_reactor/sweeper.rb +58 -0
data/lib/ruby_reactor/version.rb +1 -1
data/lib/ruby_reactor.rb +42 -0
metadata +4 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 66c0a0c5591cd862dc61063d752f4d92111370808256bfdae749825fec68429b
-  data.tar.gz: 212ac4ce7ef87e5d28606cab0aff8358afde437389fbb5b2d717b1b5874daa77
+  metadata.gz: 778bc5305c6d1f20833819afd9ccd46f5a5b4c2c135d8e63344d6530f9a733f1
+  data.tar.gz: 32e769816eba846f419e3f31e8290b94e8ff04fe6ea71fef125bb128b3085b82
 SHA512:
-  metadata.gz: 75e3bd7ead2281ef7bd1a74fe42a1aaaa1ff5ac92db2b72172e779b0fa9591878268b9829c939def027a6d368ad0ad7b00eddaa05bbb5497f0678229b68b17c0
-  data.tar.gz: 46c44e73bef1e7f2a11d7a5de51de83a2a9a2505aa8c5b06cf64215bc97c4c272f67f23a220187431a188a6dcca216ee13741566fc61e1ead6a8d7dc7b392d74
+  metadata.gz: 592db1d0ef94153a4ea028aa3cdeb59f7e4c73929ebec5afd5a9795a93d27a0cfb0c4bd3e135ccf1b14c0329123be6cf0905ad4a141018993832d21212754db3
+  data.tar.gz: 2fd90e47af8e26cf2d58468a1f0629fd1dc04890df0a6e1704e8d6cba5c65b1e050f10f56628c27c7c90b0545d081e33169e3bfd62f416affd761b62edeb24a0

data/.release-please-manifest.json CHANGED Viewed

@@ -1,3 +1,3 @@
 {
-  ".": "0.5.2"
+  ".": "0.5.3"
 }

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,12 @@
 # Changelog
+## [0.5.3](https://github.com/arturictus/ruby_reactor/compare/v0.5.2...v0.5.3) (2026-06-17)
+### Features
+* Durability & Recovery ([#39](https://github.com/arturictus/ruby_reactor/issues/39)) ([103e583](https://github.com/arturictus/ruby_reactor/commit/103e5835b413eec2302fa63f3e998d487cfd9eaf))
 ## [0.5.2](https://github.com/arturictus/ruby_reactor/compare/v0.5.1...v0.5.2) (2026-06-14)

data/README.md CHANGED Viewed

@@ -36,6 +36,7 @@ The key value is **Reliability**: if any part of your workflow fails, Ruby React
 | Locks / sem / rate / per | Yes          | No              | No          | Manual              |
 | Built-in web dashboard   | Yes          | No              | No          | No                  |
 | Async with Sidekiq       | Yes          | No              | Limited     | Yes                 |
+| Durable crash recovery   | Yes          | No              | No          | Manual              |
 ## Real-World Use Cases
@@ -44,6 +45,7 @@ The key value is **Reliability**: if any part of your workflow fails, Ruby React
 - **Subscription Billing**: Coordinate Stripe charges, invoice email generation, and internal entitlement updates. Use interrupts to pause the workflow when 3rd-party APIs are required to continue the workflow or when specific customer approval is needed.
 ## Table of Contents
 - [Features](#features)
 - [Comparison](#comparison)
 - [Real-World Use Cases](#real-world-use-cases)
@@ -57,6 +59,7 @@ The key value is **Reliability**: if any part of your workflow fails, Ruby React
   - [Async Execution](#async-execution)
     - [Full Reactor Async](#full-reactor-async)
     - [Step-Level Async](#step-level-async)
+  - [Durability & Recovery](#durability--recovery)
   - [Interrupts (Pause & Resume)](#interrupts-pause--resume)
   - [Locks, Semaphores & Ordered Locks](#locks-semaphores--ordered-locks)
   - [Map & Parallel Execution](#map--parallel-execution)
@@ -90,53 +93,96 @@ Or install it yourself as:
 ## Configuration
-Configure RubyReactor with your Sidekiq and Redis settings:
+Every setting is **optional** — RubyReactor ships with the defaults shown. Drop
+this into an initializer (e.g. `config/initializers/ruby_reactor.rb`); pasted as-is
+it changes nothing, so it doubles as a reference of every knob.
-Every setting below is **optional** — RubyReactor ships with the defaults shown. Override only what you need.
+> **Reading the block:** lines starting with `##` are documentation. Lines starting
+> with a single `#` (a `config.…` call) are real settings commented at their
+> default — uncomment one to enable it.
 ```ruby
 RubyReactor.configure do |config|
-  # Storage adapter. Default: :redis (the only adapter shipped today).
-  config.storage.adapter = :redis
-  # Redis URL. Default: "redis://localhost:6379/0".
+  ## === Storage (Redis) ===
+  ## Storage adapter. Default: :redis (the only adapter shipped today).
+  # config.storage.adapter = :redis
+  ## Redis URL. Default: "redis://localhost:6379/0".
   config.storage.redis_url = ENV.fetch("REDIS_URL", "redis://localhost:6379/0")
-  # Extra options passed to Redis.new. Default: {}.
-  config.storage.redis_options = { timeout: 1 }
-  # Sidekiq queue used by RubyReactor's async worker. Default: :default.
-  config.sidekiq_queue = :default
-  # Sidekiq retry count for infrastructure failures only (deserialization,
-  # Redis, network). Step retries are managed separately. Default: 3.
-  config.sidekiq_retry_count = 3
-  # Lock/semaphore/rate-limit/ordered-lock contention snooze behavior for
-  # async reactors. When a Sidekiq worker cannot acquire a primitive it
-  # re-enqueues itself with `lock_snooze_base_delay + rand(0..lock_snooze_jitter)`
-  # seconds (rate-limit uses a precise `retry_after_seconds` hint from the error;
-  # ordered-lock waits re-poll at the base delay so a successor catches its
-  # blocker finishing fast), up to `lock_snooze_max_attempts` times before
-  # marking the context :failed. Defaults: 5 / 5 / 20. Set max_attempts to
-  # :infinity to never give up.
-  config.lock_snooze_base_delay = 5
-  config.lock_snooze_jitter = 5
-  config.lock_snooze_max_attempts = 20
-  # Named rate limits shared across reactors. Reference them with
-  # `with_rate_limit(:stripe)`. See Locks, Semaphores, Rate Limits & Periods.
-  config.rate_limits.register(:stripe, limits: { second: 3, minute: 100 })
-  # Logger. Default: Logger.new($stderr).
-  config.logger = Logger.new($stdout)
+  ## Extra options passed to Redis.new. Default: {}.
+  # config.storage.redis_options = { timeout: 1 }
+  ## === Sidekiq ===
+  ## Sidekiq queue used by RubyReactor's async worker. Default: :default.
+  # config.sidekiq_queue = :default
+  ## Sidekiq retry count for infrastructure failures only (deserialization,
+  ## Redis, network). Step retries are managed separately. Default: 3.
+  # config.sidekiq_retry_count = 3
+  ## === Contention snooze (locks / semaphores / rate limits / ordered locks) ===
+  ## When a Sidekiq worker cannot acquire a primitive it re-enqueues itself with
+  ## `lock_snooze_base_delay + rand(0..lock_snooze_jitter)` seconds (rate-limit
+  ## uses a precise `retry_after_seconds` hint from the error; ordered-lock waits
+  ## re-poll at the base delay so a successor catches its blocker finishing fast),
+  ## up to `lock_snooze_max_attempts` times before marking the context :failed.
+  ## Set max_attempts to :infinity to never give up.
+  # config.lock_snooze_base_delay = 5
+  # config.lock_snooze_jitter = 5
+  # config.lock_snooze_max_attempts = 20
+  ## === Durability & crash recovery (see "Durability & Recovery" below) ===
+  ## Retention TTL (seconds) for stored reactor/map state. Must exceed your
+  ## worst-case snooze/retry window; re-stamped on every write. Default: 86_400.
+  # config.context_ttl = 86_400
-  # Async router. Default: RubyReactor::SidekiqAdapter. Swap for a custom
-  # adapter if you don't use Sidekiq — the adapter only needs to respond to
-  # `perform_async(serialized_context, reactor_class_name, **)`.
+  ## TTL (seconds) for the per-context liveness lock. A live worker auto-extends
+  ## it; its absence is the sweeper's "worker died" signal. Must exceed the
+  ## longest a single step can run without yielding the GIL. Default: 60.
+  # config.context_lock_ttl = 60
+  ## Minimum seconds between per-step checkpoints within one run. 0 = checkpoint
+  ## after every step (strongest guarantee). Raise to coalesce mid-run writes for
+  ## long reactors — only safe when steps are idempotent. Default: 0.
+  # config.checkpoint_min_interval = 0
+  ## Recovery sweeper (the chain is kicked once by `RubyReactor.start_sweeper!`).
+  # config.sweeper_enabled = true    # run recovery by default
+  # config.sweeper_interval = 30     # seconds between sweeps = recovery-latency bound
+  # config.sweeper_limit = 1000      # max contexts/maps inspected per sweep
+  ## === Misc ===
+  ## Logger. Default: Logger.new($stdout).
+  # config.logger = Logger.new($stdout)
+  ## Async router. Default: RubyReactor::SidekiqAdapter. Swap for a custom adapter
+  ## if you don't use Sidekiq — it only needs to respond to
+  ## `perform_async(context_id, reactor_class_name, **)`.
   # config.async_router = MyCustomAdapter
+  ## === Examples (no default — set these to use the feature) ===
+  ## Named rate limits shared across reactors. Reference with `with_rate_limit(:stripe)`.
+  # config.rate_limits.register(:stripe, limits: { second: 3, minute: 100 })
+  ## OpenTelemetry / custom middlewares. Default: [].
+  # config.middlewares = [RubyReactor::OpenTelemetry]
 end
 ```
 You can also leave out the `configure` block entirely — defaults work for local development against a Redis on `localhost:6379`.
+> **Crash recovery needs a kick.** The `sweeper_*` settings above only configure
+> the recovery sweeper — they do not start it. Call `RubyReactor.start_sweeper!`
+> once at boot (ideally from a Sidekiq `on(:startup)` hook) or no crashed reactor
+> will ever resume. See [Durability & Recovery](#durability--recovery).
 ## Quick Start
@@ -341,6 +387,73 @@ def create(params)
 end
 ```
+### Durability & Recovery
+Async reactors are durable: state lives in Redis, not in the job payload. Before
+any background job is enqueued the root context is persisted, and after every
+completed step a checkpoint advances the stored blob — so a crash re-runs at most
+one step, never the whole reactor. Each running reactor also holds a short
+**liveness lock** that a live worker auto-extends; its absence is how a dead
+worker is detected.
+**Recovery is not automatic until you start the sweeper.** A crashed worker's
+reactor only resumes when the recovery sweeper notices the lapsed liveness lock
+and re-enqueues it. The sweeper is a self-rescheduling chain — **kick it once per
+process boot:**
+The recommended spot is a Sidekiq server startup hook, so only the worker
+process runs recovery (not your web/console/client processes):
+```ruby
+# config/initializers/sidekiq.rb
+Sidekiq.configure_server do |config|
+  config.on(:startup) { RubyReactor.start_sweeper! }
+end
+```
+Anywhere that runs once at boot works too — e.g. a Rails initializer:
+```ruby
+# config/initializers/ruby_reactor.rb
+RubyReactor.start_sweeper!
+```
+That's all that's required: `start_sweeper!` is idempotent (safe to call on every
+boot — duplicate kicks collapse to one chain), runs both the top-level reactor
+sweeper and the map sweeper every `config.sweeper_interval` seconds, and stops if
+you set `config.sweeper_enabled = false`. The interval is your recovery-latency
+bound.
+> **Sidekiq Enterprise `super_fetch` compatibility:** the chain is safe under
+> reliable fetch. `super_fetch` re-runs a job whose worker died mid-execution, so
+> a tick that crashes *after* enqueuing its successor but *before* acking would,
+> with naive single-flight, be recovered alongside that successor and fork the
+> chain (doubling every interval). RubyReactor avoids this: it never relies on
+> "one job in the chain" — each next tick is claimed by a per-time-window lock, so
+> a `super_fetch`-recovered tick computes the same window, loses the claim, and
+> collapses back to a single successor. The startup hook above is likewise
+> idempotent across multiple `super_fetch` server processes.
+**Prefer your own scheduler?** Set `config.sweeper_enabled = false` (which makes
+`start_sweeper!` a no-op) and drive recovery from cron, a Kubernetes `CronJob`,
+`sidekiq-cron`, `sidekiq-scheduler`, or Rails recurring tasks. Each tick is one
+call:
+```ruby
+RubyReactor.sweep_once # => { reactors: <n re-enqueued>, maps: <n recovered> }
+```
+For example, a rake task a system cron / CronJob can invoke:
+```ruby
+# lib/tasks/ruby_reactor.rake
+namespace :ruby_reactor do
+  task sweep: :environment do
+    RubyReactor.sweep_once
+  end
+end
+```
 ### Interrupts (Pause & Resume)
 Pause execution to wait for external events like webhooks or user approvals.

data/lib/ruby_reactor/configuration.rb CHANGED Viewed

@@ -9,12 +9,76 @@ module RubyReactor
     attr_writer :sidekiq_queue, :sidekiq_retry_count, :logger, :async_router,
                 :lock_snooze_base_delay, :lock_snooze_jitter, :lock_snooze_max_attempts,
-                :middlewares
+                :middlewares, :context_ttl, :context_lock_ttl, :checkpoint_min_interval,
+                :sweeper_enabled, :sweeper_interval, :sweeper_limit
     def sidekiq_queue
       @sidekiq_queue ||= :default
     end
+    # Retention TTL (seconds) for a stored reactor context. Storage is
+    # load-bearing for resume, so this must comfortably exceed the worst-case
+    # snooze/retry window. Refreshed on every checkpoint write.
+    def context_ttl
+      @context_ttl ||= 86_400
+    end
+    # Minimum wall-clock seconds between two PER-STEP durable checkpoints within a
+    # single worker run. The save-per-step checkpoint (`on_step_complete`) bounds
+    # crash re-execution to one step, but re-serializes and re-writes the WHOLE
+    # root blob after every Success — O(steps × context_size) writes for a long,
+    # large reactor. This throttle coalesces the mid-run intermediate checkpoints:
+    # a checkpoint is written only if at least this many seconds have elapsed since
+    # the last one. The final terminal/handoff state is ALWAYS persisted (by the
+    # run's ensure-save and the pre-enqueue checkpoint), so throttling only affects
+    # mid-run granularity. Tradeoff: with interval > 0, a crash may re-run every
+    # step completed inside the last interval — safe only when those steps are
+    # idempotent or side-effect-free.
+    #
+    # Default 0 -> checkpoint after EVERY step (strongest guarantee, no coalescing).
+    def checkpoint_min_interval
+      @checkpoint_min_interval ||= 0
+    end
+    # Whether the recovery sweepers run. The host kicks the self-rescheduling
+    # chain once (`RubyReactor.start_sweeper!`, e.g. from an initializer); each
+    # tick re-checks this flag, so flipping it to false stops the chain at the
+    # next tick. Default on: durability is inert without a running sweeper, so
+    # recovery must work out of the box.
+    def sweeper_enabled
+      @sweeper_enabled = true if @sweeper_enabled.nil?
+      @sweeper_enabled
+    end
+    # Seconds between sweeps. This is the upper bound on recovery latency for a
+    # dead worker — lower it for faster recovery, raise it to cut scan load.
+    def sweeper_interval
+      @sweeper_interval ||= 30
+    end
+    # Max contexts/maps inspected per sweep (passed to each sweeper's run_once).
+    def sweeper_limit
+      @sweeper_limit ||= 1000
+    end
+    # TTL (seconds) for the per-context liveness lock (`async:<id>`). Short by
+    # design — it is a liveness signal, not retention. A live worker auto-extends
+    # it (every ttl/3 s, from a background thread); its absence is the sweeper's
+    # "worker died" signal.
+    #
+    # SAFETY CONSTRAINT: this MUST exceed the longest a single step can run
+    # WITHOUT letting the auto-extend thread make progress. Under MRI the
+    # extender shares the GIL, so a step that holds the GIL continuously for
+    # longer than this TTL (a long CPU-bound pure-Ruby loop, a C extension that
+    # never releases the GIL, or a stop-the-world GC pause) lets the lock lapse.
+    # A lapsed lock looks "dead" to the sweeper, which may re-enqueue a duplicate
+    # that runs CONCURRENTLY with the still-live original — a double-run. I/O-bound
+    # steps release the GIL and keep the lock fresh, so the default 60s suits
+    # typical workloads; raise it if you run long synchronous CPU-bound steps.
+    def context_lock_ttl
+      @context_lock_ttl ||= 60
+    end
     def sidekiq_retry_count
       @sidekiq_retry_count ||= 3
     end
@@ -36,7 +100,7 @@ module RubyReactor
     end
     def logger
-      @logger ||= Logger.new($stderr)
+      @logger ||= Logger.new($stdout)
     end
     def async_router

data/lib/ruby_reactor/context_serializer.rb CHANGED Viewed

@@ -19,13 +19,18 @@ module RubyReactor
       def deserialize(serialized_data)
         decompressed = decompress_if_needed(serialized_data)
-        data = JSON.parse(decompressed, symbolize_names: false)
+        deserialize_hash(JSON.parse(decompressed, symbolize_names: false))
+      rescue JSON::ParserError => e
+        raise RubyReactor::Error::DeserializationError, "Failed to parse serialized context: #{e.message}"
+      end
+      # Deserialize from an already-parsed Hash (e.g. what the storage adapter's
+      # `retrieve_context` returns). Lets the rehydrate-by-id worker path avoid a
+      # second JSON parse while still schema-validating. Schema validation lives
+      # here so both the string and Hash entry points enforce it.
+      def deserialize_hash(data)
         validate_schema_version(data)
         Context.deserialize_from_retry(data)
-      rescue JSON::ParserError => e
-        raise RubyReactor::Error::DeserializationError, "Failed to parse serialized context: #{e.message}"
       end
       # rubocop:disable Metrics/CyclomaticComplexity, Metrics/MethodLength

data/lib/ruby_reactor/executor/ordered_lock_support.rb CHANGED Viewed

@@ -180,7 +180,7 @@ module RubyReactor
       end
       def stored_context_status
-        reactor_class_name = @reactor_class.name || "AnonymousReactor-#{@reactor_class.object_id}"
+        reactor_class_name = RubyReactor.reactor_storage_name(@reactor_class)
         data = RubyReactor.configuration.storage_adapter.retrieve_context(@context.context_id, reactor_class_name)
         return nil unless data

data/lib/ruby_reactor/executor/retry_manager.rb CHANGED Viewed

@@ -48,7 +48,7 @@ module RubyReactor
                                  @context.root_context || @context
                                end
-        reactor_class_name = context_to_serialize.reactor_class.name
+        reactor_class_name = RubyReactor.reactor_storage_name(context_to_serialize.reactor_class)
         @middlewares.on(:before_async_enqueue, context_to_serialize)
@@ -72,7 +72,12 @@ module RubyReactor
             fail_fast: map_args[:fail_fast]
           )
         else
-          configuration.async_router.perform_in(delay, serialized_context, reactor_class_name)
+          # Persist BEFORE enqueue — the job payload is identity-only (F2). The
+          # rescheduled job rehydrates the root by id from storage.
+          configuration.storage_adapter.store_context(
+            context_to_serialize.context_id, serialized_context, reactor_class_name
+          )
+          configuration.async_router.perform_in(delay, context_to_serialize.context_id, reactor_class_name)
         end
       end

data/lib/ruby_reactor/executor/step_executor.rb CHANGED Viewed

@@ -11,6 +11,7 @@ module RubyReactor
         @result_handler = managers[:result_handler]
         @compensation_manager = managers[:compensation_manager]
         @middlewares = managers[:middlewares] || context.middlewares || Executor.middlewares_for(reactor_class)
+        @on_step_complete = managers[:on_step_complete]
       end
       def execute_all_steps
@@ -45,8 +46,14 @@ module RubyReactor
             # If a step returns InterruptResult, we need to stop execution and return it
             return result if result.is_a?(RubyReactor::InterruptResult)
-            # If result is nil, it means async was executed inline (test mode), continue
-            next if result.nil?
+            # Only a continue-Success reaches here (Async/Retry/Skipped/Failure/
+            # Interrupt all returned above; nil is inline-async test mode). It is
+            # the one outcome where the loop proceeds to more steps with no other
+            # save in between — every terminal/handoff result persists via its own
+            # path. Write a durable checkpoint so a crash re-runs at most this one
+            # step. Ordering: side-effect -> record result (inside execute_step) ->
+            # checkpoint here.
+            @on_step_complete&.call if result.is_a?(RubyReactor::Success)
           end
         end
@@ -198,20 +205,33 @@ module RubyReactor
         # Use root context if available to ensure we serialize the full tree
         context_to_serialize = @context.root_context || @context
-        reactor_class_name = context_to_serialize.reactor_class.name
+        reactor_class_name = RubyReactor.reactor_storage_name(context_to_serialize.reactor_class)
         # Inject OTel context before serialization
         @middlewares.on(:before_async_enqueue, context_to_serialize)
-        serialized_context = ContextSerializer.serialize(context_to_serialize)
+        # Storage is load-bearing: the job payload is identity-only, so the root
+        # context MUST be persisted BEFORE the job is enqueued (F2). The reactor
+        # class name used for the storage key must match the one handed to the
+        # worker, so compute it once and reuse it for both.
+        checkpoint_root!(context_to_serialize, reactor_class_name)
         configuration.async_router.perform_async(
-          serialized_context,
+          context_to_serialize.context_id,
           reactor_class_name,
           intermediate_results: @context.intermediate_results
         )
       end
+      # Persist the root context under its storage key. Mirrors Executor#checkpoint!
+      # but lives here because handle_async_step runs inside the StepExecutor and
+      # must serialize AFTER the before_async_enqueue middleware has injected its
+      # OTel context.
+      def checkpoint_root!(root, reactor_class_name)
+        storage = RubyReactor::Configuration.instance.storage_adapter
+        storage.store_context(root.context_id, ContextSerializer.serialize(root), reactor_class_name)
+      end
       def handle_interrupt_step(step_config)
         # Check if we have a result for this step (resuming)
         if @context.intermediate_results.key?(step_config.name)

data/lib/ruby_reactor/executor.rb CHANGED Viewed

@@ -38,14 +38,24 @@ module RubyReactor
           retry_manager: @retry_manager,
           result_handler: @result_handler,
           compensation_manager: @compensation_manager,
-          middlewares: @middlewares
+          middlewares: @middlewares,
+          # Save-per-step durable checkpoint. checkpoint! resolves the ROOT
+          # context, so this same callback — wired into every executor including
+          # the nested ones ComposeStep builds — always advances the root blob
+          # (F8): a mid-child crash re-runs one sub-step, not the whole child.
+          # `throttle: true` lets checkpoint_min_interval coalesce these mid-run
+          # writes (default 0 = write every step); the terminal save still runs.
+          on_step_complete: -> { checkpoint!(throttle: true) }
         }
       )
       @result = nil
       @acquired_lock = nil
       @acquired_semaphore = nil
+      @acquired_context_lock = nil
+      @context_lock_owner = nil
       @contention_snooze = false
       @skip_context_persist = false
+      @last_checkpoint_at = nil
     end
     def self.resolve_middlewares(reactor_class)
@@ -150,7 +160,7 @@ module RubyReactor
       end
     end
-    def resume_execution # rubocop:disable Metrics/MethodLength,Metrics/PerceivedComplexity
+    def resume_execution # rubocop:disable Metrics/MethodLength,Metrics/PerceivedComplexity,Metrics/CyclomaticComplexity
       middlewares.on(:start_reactor, reactor_class.name, context.inputs, @context)
       completed = false
@@ -175,6 +185,13 @@ module RubyReactor
       @context.status = :running
       check_rate_limit if first_run
+      # Per-context liveness lock: serializes duplicate deliveries of the same
+      # root context (e.g. a sweeper re-enqueue racing a still-live worker) and
+      # doubles as the sweeper's "worker alive" signal. Only the ROOT executor
+      # holds it — composed/nested children resume inline under the root worker
+      # and must not contend on the root's own key.
+      acquire_context_lock
       # Resumes intentionally skip check_rate_limit (a paused run must not
       # block itself on resume), so acquire lock/semaphore directly rather
       # than via acquire_locks.
@@ -217,6 +234,8 @@ module RubyReactor
       @result
     ensure
       release_locks
+      @acquired_context_lock&.release
+      @acquired_context_lock = nil
       leave_ordered_lock_scope
       save_context unless skip_context_persist?
@@ -241,13 +260,40 @@ module RubyReactor
     def save_context
       storage = RubyReactor::Configuration.instance.storage_adapter
-      reactor_class_name = @reactor_class.name || "AnonymousReactor-#{@reactor_class.object_id}"
+      reactor_class_name = RubyReactor.reactor_storage_name(@reactor_class)
       # Serialize context
       serialized_context = ContextSerializer.serialize(@context)
       storage.store_context(@context.context_id, serialized_context, reactor_class_name)
     end
+    # Durable per-step checkpoint. Unlike save_context (which serializes THIS
+    # executor's @context — the observability path, F1), checkpoint! always
+    # serializes and stores the ROOT context under the root's key — the unit the
+    # async worker rehydrates by id. For a top-level reactor root == @context; for
+    # a composed/nested child it stores the root with the child's live state
+    # embedded via composed_contexts. TTL is re-stamped on every write (Phase 4).
+    def checkpoint!(throttle: false)
+      return if throttle && !checkpoint_due?
+      root = @context.root_context || @context
+      storage = RubyReactor::Configuration.instance.storage_adapter
+      reactor_class_name = RubyReactor.reactor_storage_name(root.reactor_class)
+      storage.store_context(root.context_id, ContextSerializer.serialize(root), reactor_class_name)
+      @last_checkpoint_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+    end
+    # Whether a throttled (per-step) checkpoint is due. With checkpoint_min_interval
+    # <= 0 (default) every step checkpoints; otherwise mid-run checkpoints are
+    # coalesced to at most one per interval. The first step of a run always writes
+    # (@last_checkpoint_at is nil), and the run's terminal save is never throttled.
+    def checkpoint_due?
+      interval = RubyReactor.configuration.checkpoint_min_interval.to_f
+      return true if interval <= 0 || @last_checkpoint_at.nil?
+      (Process.clock_gettime(Process::CLOCK_MONOTONIC) - @last_checkpoint_at) >= interval
+    end
     def persist_context?
       @context.status.to_s != "pending" ||
         @context.execution_trace.any? ||
@@ -340,6 +386,42 @@ module RubyReactor
       RubyReactor::Period.key(base, config[:every])
     end
+    # Per-execution liveness lock on the root context id. Owner is a fresh UUID
+    # per execution (NOT the context_id): a duplicate delivery of the *same*
+    # context from a different worker must be blocked, so reentrancy by id would
+    # defeat the guard. Only the root executor acquires — a composed/nested child
+    # resumes inline under the root worker and shares the root's lock, so it must
+    # not try to re-acquire the same key with a different owner (self-deadlock).
+    def acquire_context_lock
+      root = @context.root_context || @context
+      return unless root.equal?(@context) # only the root executor holds it
+      # In Sidekiq::Testing.inline! the retry/snooze `perform_in` re-enters the
+      # worker synchronously, nested inside this still-running frame that holds
+      # the lock — it would self-contend forever. The lock guards concurrent
+      # cross-process delivery, which cannot happen under inline testing, so skip.
+      return if inline_testing_mode?
+      lock = RubyReactor::Lock.new(
+        "async:#{root.context_id}",
+        owner: @context_lock_owner ||= SecureRandom.uuid,
+        ttl: RubyReactor.configuration.context_lock_ttl,
+        wait: 0,            # fail fast -> snooze; never block the worker thread
+        auto_extend: true   # keep the liveness signal fresh while we run
+      )
+      lock.acquire
+      @acquired_context_lock = lock
+    rescue RubyReactor::Lock::AcquisitionError => e
+      # We lost the race to a live original holding this context's lock. We did
+      # no work, so we must NOT persist on the way out — saving our (older)
+      # rehydrated snapshot would clobber the original's newer checkpoint.
+      @skip_context_persist = true
+      raise RubyReactor::Lock::ContextLockContention.new(e.message, context_lock_key: "async:#{root.context_id}")
+    end
+    def inline_testing_mode?
+      defined?(Sidekiq::Testing) && Sidekiq::Testing.respond_to?(:inline?) && Sidekiq::Testing.inline?
+    end
     def acquire_exclusive_lock
       config = @reactor_class.lock_config
       key = config[:key_proc].call(@context.inputs)

data/lib/ruby_reactor/lock.rb CHANGED Viewed

@@ -4,6 +4,19 @@ module RubyReactor
   class Lock
     class AcquisitionError < StandardError; end
+    # Raised specifically for the per-context liveness lock (`async:<id>`).
+    # Carries the bare key so the worker can exempt it from the snooze cap:
+    # a duplicate of the *same* execution may legitimately wait arbitrarily
+    # long for the live original to finish.
+    class ContextLockContention < AcquisitionError
+      attr_reader :context_lock_key
+      def initialize(message, context_lock_key:)
+        super(message)
+        @context_lock_key = context_lock_key
+      end
+    end
     # Minimum interval between auto-extend pings; protects very small TTLs.
     MIN_EXTEND_INTERVAL = 1.0