RubyGems - kairos-chain - Versions diffs - 3.24.0 → 3.24.3 - Mend

kairos-chain 3.24.0 → 3.24.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 20e2a223137f51dc61025e57dd6fd205a8f702ef923b5b0b2e0d464a308d279f
-  data.tar.gz: 51627cb487cf5fc2e46b8e6055bf36e0cf5c8839f15cb2c2236f2ff56efa2def
+  metadata.gz: fb099806eedb198afc167cbb810110ab7bffac2fceea3684eb14a24e3e7b46fb
+  data.tar.gz: 61223eff1c6cd146eea47d44ab0ee95506a8baa6e15a0f1f5c2e929e9d44a5b1
 SHA512:
-  metadata.gz: 9fd4b17a28bdc06b7b19195e7274ef41e4ba7a93fc70e2c3a38083c58eae85ea8dfd432cb7cc6219cf7ba49ba637dbed12c48ea12312f4a3f26f46dfb936438f
-  data.tar.gz: fadb35fdbf47eeebfc9b667685452a0c222ecb396dbb2031de08ac27b2df1de1a3985fc41c03e4e767d22a58ec98a77d52c2059e921f8084d1663a2a099bdd1d
+  metadata.gz: 91d4a86fc2df06025fefb5f0252e9f019d85cb9fc31f6ceec0d2e0bbe1209c8d24277e7448faace8ced8ba28c889dee0af7f68fbf50d6dda6da27bbb3366e588
+  data.tar.gz: b6847081bc03d40d2ae77ec317d52955ba7fdc2597b91d10addba2a052067339171542316810d8e1dc95c5f4eb35eb5963425f7f82961ae13bbc5e3bfa8e2f50

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,54 @@ All notable changes to the `kairos-chain` gem will be documented in this file.
 This project follows [Semantic Versioning](https://semver.org/).
+## [3.24.1] - 2026-04-27
+### Fixed (multi_llm_review_wait)
+Self-referential validation: ran multi_llm_review on the new wait tool itself
+(both Path A Bash workflow and Path B MCP SkillSet, 7 reviewers total) — found
+1 P0 + 7 P1 bugs not caught by the v3.24.0 test suite. All fixed:
+- **P0** `config_parallel` had dead `unless ... || true` guard so YAML never
+  loaded — all configured wait caps silently fell back to defaults. Removed
+  the bogus guard, added explicit `require 'yaml'` at the top of the file.
+- **P1** Streak read-then-write was split: state was read at entry, streak
+  written later via `update_state`, so two concurrent waiters could both
+  observe the same N and both write N+1, undercounting. Now the increment
+  is fully inside the `update_state` RMW block.
+- **P1** `still_pending` next_action message read streak limit from
+  `state.dig('wait_still_pending_streak_limit')` (a key never written),
+  always falling back to the default constant. Now `streak_limit` is
+  threaded through `translate_outcome` so the displayed denominator
+  matches the effective config.
+- **P1** Post-wait deadline revalidation missing: `WaitForWorker.wait`
+  could return `:timeout` after the collect deadline elapsed during the
+  blocking wait, but the tool returned `still_pending`. Now re-checks
+  `Time.now >= deadline_at_entry` after the wait and returns
+  `past_collect_deadline` if so.
+- **P1** Pre-wait streak guard ran before the ready/results-file check,
+  so a worker that finished while streak was at limit was misclassified
+  as `crashed/wait_exhausted`. Reordered: ready check now runs first.
+- **P1** Internal exceptions returned `status: 'error'`, outside the
+  declared 6-status enum. Now mapped to `crashed` with
+  `crashed_reason: 'internal_error'`.
+- **P1** Malformed `collect_deadline` (non-ISO8601 string) was silently
+  rescued to nil, skipping all deadline checks. Now returns `crashed`
+  with `crashed_reason: 'malformed_state'`.
+- **P1** `safe_path` swallowed PendingState errors, masking real failures
+  as benign "not collected". Removed; errors now surface to the outer
+  rescue and become `crashed/internal_error`.
+### Other improvements
+- Deadline-cap arithmetic uses `ceil` instead of `to_i` so the wait can
+  actually run up to the deadline; the post-wait revalidation catches any
+  overshoot.
+- `elapsed_seconds` field now correctly uses `outcome[:waited_seconds]`
+  for the `:timeout` path (was always 0.0 in v3.24.0).
+- 8 new regression tests covering each of the bugs above (22 wait tool
+  tests total).
 ## [3.24.0] - 2026-04-27
 ### Added

data/lib/kairos_mcp/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module KairosMcp
-  VERSION = "3.24.0"
+  VERSION = "3.24.3"
   CHANGELOG_URL = "https://github.com/masaomi/KairosChain_2026/blob/main/CHANGELOG.md"
 end

data/templates/skillsets/llm_client/lib/llm_client/headless.rb CHANGED Viewed

@@ -53,16 +53,16 @@ module KairosMcp
           # from codex 5.5 + cursor).
           # Guarded by `defined?` so non-worker consumers (MCP direct call)
           # that never load multi_llm_review/main_state don't NameError.
-          bracket = defined?(KairosMcp::SkillSets::MultiLlmReview::MainState)
-          if bracket
-            KairosMcp::SkillSets::MultiLlmReview::MainState.enter_call!
-          end
-          begin
-            result = CallRouter.perform(args, @config)
-          ensure
-            if bracket
-              KairosMcp::SkillSets::MultiLlmReview::MainState.exit_call!
+          # v3.24.3: use with_call to enforce ensure-bracketed enter/exit.
+          # enter_call!/exit_call! are now private; with_call is the only
+          # supported pattern. defined?-guard preserved so non-worker
+          # consumers (MCP direct call) don't NameError.
+          if defined?(KairosMcp::SkillSets::MultiLlmReview::MainState)
+            result = KairosMcp::SkillSets::MultiLlmReview::MainState.with_call do
+              CallRouter.perform(args, @config)
             end
+          else
+            result = CallRouter.perform(args, @config)
           end
           # Shape matches BaseTool#text_content (symbol :text key) — what
           # Dispatcher consumes today via `b[:text] || b['text']`.

data/templates/skillsets/multi_llm_review/bin/dispatch_worker.rb CHANGED Viewed

@@ -103,28 +103,31 @@ def self_timeout_at_from_state(token, request)
   end
 end
-# Pulse thread: touches worker.tick IFF main is alive (counter advanced OR
-# still inside an adapter.call within its expected timeout).  (C3b/P0-3)
+# Pulse thread: touches worker.tick IFF main is alive. v3.24.3 uses the
+# per-thread (counter, in_flight, oldest_ts) snapshot from MainState and
+# delegates the alive decision to MainState.compute_alive (pure function,
+# unit-testable). Emits a diagnostic log line every ~5s so future incidents
+# can be diagnosed from worker.log without filesystem mtime archaeology.
 pulse_thread = Thread.new do
   begin
     last_counter = -1
-    # Loaded below; read now for the timeout window
-    max_call_t = 300
-    call_margin = 60
+    log_emit_at = 0
+    threshold = 360  # max_call_t (300) + call_margin (60)
     loop do
-      # MainState.snapshot reads ts FIRST then counter, per the v0.3.2
-      # reader-ordering invariant. Pulse must use snapshot (not raw struct
-      # reads) so any future change to the invariant is observed here.
-      counter, ts = MLR::MainState.snapshot
-      alive =
-        if counter != last_counter
-          true
-        elsif ts
-          (Process.clock_gettime(Process::CLOCK_MONOTONIC) - ts) < (max_call_t + call_margin)
-        else
-          false
-        end
+      counter, in_flight, oldest_ts = MLR::MainState.snapshot
+      now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      alive = MLR::MainState.compute_alive(
+        counter, last_counter, in_flight, oldest_ts, now, threshold
+      )
       FileUtils.touch(PS.worker_tick_path(token)) if alive
+      if now - log_emit_at >= 5
+        oldest_age = oldest_ts ? (now - oldest_ts).round(1) : nil
+        warn "[pulse] counter=#{counter} in_flight=#{in_flight} " \
+             "oldest_age=#{oldest_age || 'nil'}s alive=#{alive}"
+        log_emit_at = now
+      end
       last_counter = counter
       sleep 2
     end
@@ -263,8 +266,9 @@ begin
     review_context: request['review_context'] || 'independent'
   )
-  # Advance counter so pulse observes "progress since dispatch entered".
-  MLR::MainState.exit_call!
+  # v3.24.3: counter-only signal (no enter_call!/exit_call! pair). bump_counter!
+  # advances pulse's progress signal without touching ts_by_thread.
+  MLR::MainState.bump_counter!
   check_shutdown!(token)
   elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - t0

data/templates/skillsets/multi_llm_review/lib/multi_llm_review/dispatcher.rb CHANGED Viewed

@@ -132,7 +132,10 @@ module KairosMcp
         def bump_main_state_counter
           return unless defined?(KairosMcp::SkillSets::MultiLlmReview::MainState)
-          KairosMcp::SkillSets::MultiLlmReview::MainState.exit_call!
+          # v3.24.3: counter-only bump. exit_call! is private; bump_counter!
+          # is the public counter-only progress signal (does not touch
+          # ts_by_thread).
+          KairosMcp::SkillSets::MultiLlmReview::MainState.bump_counter!
         rescue StandardError
           nil
         end

data/templates/skillsets/multi_llm_review/lib/multi_llm_review/main_state.rb CHANGED Viewed

@@ -3,61 +3,123 @@
 module KairosMcp
   module SkillSets
     module MultiLlmReview
-      # Main-thread liveness state for the worker's pulse mechanism (v0.3 P0-3,
-      # v0.3.2 C3b). Read by the pulse thread to decide whether worker.tick
-      # should be touched; written by the main thread around each adapter.call.
+      # ──────────────────────────────────────────────────────────────────
+      # MainState — main-thread liveness state for the worker pulse
+      # ──────────────────────────────────────────────────────────────────
       #
-      # ORDERING INVARIANT (v0.3.2 C3b):
-      #   exit_call! increments `counter` FIRST, clears `in_llm_call_since_mono`
-      #   SECOND. A torn two-field read by the pulse thread therefore always
-      #   lands in one of:
-      #     (old_counter, old_ts)  — in-call, recent       → alive
-      #     (new_counter, old_ts)  — counter advanced      → alive
-      #     (new_counter, nil)     — exit complete         → alive via counter
-      #   Never (old_counter, nil), which would look stalled.
+      # Tracks per-thread enter/exit timestamps so the pulse thread can tell
+      # whether the worker's main path is still progressing through LLM calls.
+      # Replaces the v0.3.2 process-global single-ts design which raced under
+      # parallel reviewer threads (incident token 5b75ff8c-..., 2026-04-27).
       #
-      # MRI atomicity note: integer accessor and flonum Float accessor reads
-      # are each atomic via GVL-serialized method dispatch; the PAIR is not.
-      # The invariant above makes pair torn reads benign.
-      MAIN_STATE = Struct.new(:counter, :in_llm_call_since_mono).new(0, nil)
+      # ORDERING / ATOMICITY INVARIANTS (v3.24.3):
+      #
+      # 1. counter and ts_by_thread mutations AND reads are bracketed by a
+      #    single Mutex (MUTEX). Readers (snapshot) take the same mutex, so
+      #    they never observe a torn (counter, ts_by_thread) pair.
+      #    Replaces the v0.3.2 "ts-first/counter-second" ordering invariant
+      #    which assumed single-threaded callers.
+      #
+      # 2. with_call { ... } is the ONLY supported call-bracketing pattern.
+      #    Direct enter_call!/exit_call! calls are private (see
+      #    private_class_method below). This guarantees that any exception
+      #    from the LLM call propagates AFTER ts_by_thread has been cleaned
+      #    up (via `ensure exit_call!`), preventing per-thread entry leaks.
+      #
+      # 3. Thread.current.object_id is used as the per-thread key. MRI's
+      #    object_id stays stable for the lifetime of a Thread object;
+      #    reuse only happens after the Thread has been GC'd. Within a
+      #    single with_call invocation, the Thread is on-stack and therefore
+      #    not GC-eligible, so the key is unique.
+      #
+      # 4. Mutex#synchronize is Thread.kill-safe under MRI (Ruby's internal
+      #    `ensure unlock`). The `ensure exit_call!` inside with_call also
+      #    runs under Thread.kill, so cleanup is guaranteed even if the
+      #    dispatch thread is forcibly terminated.
+      #
+      # 5. NON-REENTRANT: nested with_call on the same thread is NOT
+      #    supported. The inner enter_call! would overwrite the outer
+      #    ts_by_thread[tid], and the outer ensure exit_call! would delete
+      #    the entry while the inner call is still tracked. Current
+      #    multi_llm_review code paths never nest LLM calls; if a future
+      #    adapter calls another LLM, this contract must be revisited.
+      MAIN_STATE = Struct.new(:counter, :ts_by_thread).new(0, {})
+      MUTEX = Mutex.new
       module MainState
         module_function
-        # Called immediately before adapter.call enters a blocking LLM syscall.
-        def enter_call!
-          MAIN_STATE.in_llm_call_since_mono =
-            Process.clock_gettime(Process::CLOCK_MONOTONIC)
+        # PUBLIC: bracket an LLM call. The block runs between enter_call!
+        # and exit_call!; ensure guarantees exit_call! even on exception or
+        # Thread.kill. Returns the value of the block.
+        def with_call
+          enter_call!
+          yield
+        ensure
+          exit_call!
         end
-        # Called in the `ensure` block around adapter.call. Must be idempotent:
-        # if enter_call! never ran (e.g., exception before entry), clearing a
-        # nil timestamp is a no-op and counter is still bumped so a pulse read
-        # observes progress.
-        def exit_call!
-          MAIN_STATE.counter += 1                    # INVARIANT: counter first
-          MAIN_STATE.in_llm_call_since_mono = nil    # then clear timestamp
+        # PUBLIC: counter-only progress signal. Used by dispatcher's join
+        # cleanup loop where there is no LLM call in flight but the main
+        # thread is still doing useful work (joining worker threads). Does
+        # NOT touch ts_by_thread.
+        def bump_counter!
+          MUTEX.synchronize { MAIN_STATE.counter += 1 }
         end
-        # Read current state as a plain Array snapshot.
-        #
-        # READER ORDERING (mirrors writer's C3b invariant): read ts FIRST,
-        # counter SECOND. If reader observes ts == nil, the writer MUST
-        # already have completed counter+=1 (writer writes counter before ts).
-        # Therefore (old_counter, nil) is unreachable by any reader using
-        # this snapshot. The pulse thread uses this helper — do not change
-        # the order without also changing the writer invariant.
+        # PUBLIC: snapshot of current state. Returns (counter, in_flight,
+        # oldest_ts). in_flight = ts_by_thread.size; oldest_ts = min of
+        # in-flight ts (nil if idle). Always atomic via MUTEX.
         def snapshot
-          ts = MAIN_STATE.in_llm_call_since_mono
-          counter = MAIN_STATE.counter
-          [counter, ts]
+          MUTEX.synchronize do
+            ts_values = MAIN_STATE.ts_by_thread.values
+            [MAIN_STATE.counter, ts_values.size, ts_values.min]
+          end
+        end
+        # PUBLIC PURE FUNCTION: determine alive state from a snapshot
+        # tuple. Extracted so unit tests can table-drive the four branches
+        # without forking a worker. The pulse thread calls this with the
+        # result of snapshot().
+        def compute_alive(counter, last_counter, in_flight, oldest_ts, now_mono, threshold_seconds)
+          if counter != last_counter
+            true                                                  # progress observed
+          elsif in_flight > 0 && oldest_ts
+            (now_mono - oldest_ts) < threshold_seconds            # in-call, recent
+          elsif in_flight > 0
+            true                                                  # in-call but ts not visible (transient)
+          else
+            false                                                 # idle, no progress
+          end
         end
-        # Reset for tests. NOT safe for runtime use.
+        # TEST API: clear all state. NOT safe for runtime use.
         def reset!
-          MAIN_STATE.counter = 0
-          MAIN_STATE.in_llm_call_since_mono = nil
+          MUTEX.synchronize do
+            MAIN_STATE.counter = 0
+            MAIN_STATE.ts_by_thread.clear
+          end
+        end
+        # ── private (do not call from outside MainState; use with_call) ──
+        def enter_call!
+          tid = Thread.current.object_id
+          MUTEX.synchronize do
+            MAIN_STATE.ts_by_thread[tid] =
+              Process.clock_gettime(Process::CLOCK_MONOTONIC)
+          end
+        end
+        private_class_method :enter_call!
+        def exit_call!
+          tid = Thread.current.object_id
+          MUTEX.synchronize do
+            MAIN_STATE.counter += 1
+            MAIN_STATE.ts_by_thread.delete(tid)
+          end
         end
+        private_class_method :exit_call!
       end
     end
   end

data/templates/skillsets/multi_llm_review/lib/multi_llm_review/wait_for_worker.rb CHANGED Viewed

@@ -6,17 +6,45 @@ module KairosMcp
       # Phase 2's polling loop for the detached worker's subprocess_results.json.
       # Returns one of four outcomes:
       #   :ready       — subprocess_results.json parsed successfully
-      #   :crashed     — state.subprocess_status terminal OR heartbeat stale
+      #   :crashed     — state.subprocess_status == crashed/self_timed_out
+      #                  OR state == done but results never parseable within
+      #                     wall-clock budget (reason: done_but_no_results)
+      #                  OR heartbeat stale (only while non-terminal state)
       #                  OR pid present but no heartbeat within grace OR
       #                  no pid/heartbeat within startup grace
       #   :timeout     — wall-clock max_wait exceeded with live worker
       #   (raises on unexpected errors from PendingState)
+      #
+      # v3.24.2: 'done' state now bypasses the heartbeat staleness check.
+      # The heartbeat thread is killed in the worker's ensure block, so
+      # mtime stops advancing the moment the worker transitions to 'done'.
+      # Without this bypass, a transient parse-mid-rename of
+      # subprocess_results.json combined with the killed heartbeat could
+      # surface a false-positive 'heartbeat_stale' for a successfully
+      # completed worker.
       module WaitForWorker
         STARTUP_GRACE_DEFAULT        = 30
         HEARTBEAT_STALE_DEFAULT      = 15
         POLL_INTERVAL_DEFAULT        = 0.5
         SUSPEND_JUMP_THRESHOLD       = 5.0
+        # All possible :crashed outcome reasons. Single source of truth for
+        # the crash-reason taxonomy; operators grep these in worker.log and
+        # next_action redispatch hints. v3.24.3 declares the constant; usage
+        # sites still use string literals (replacement scheduled for v3.24.4
+        # to avoid bundling unrelated refactors).
+        CRASH_REASONS = %w[
+          heartbeat_stale
+          heartbeat_never_started
+          worker_never_started
+          done_but_no_results
+          crashed
+          self_timed_out
+          wait_exhausted
+          internal_error
+          malformed_state
+        ].freeze
         module_function
         def wait(token, opts = {})
@@ -48,17 +76,40 @@ module KairosMcp
               # transient parse mid-rename — keep polling
             end
-            # 2. Explicit crash marker from worker
+            # 2. Explicit terminal status from worker
             state = PendingState.load_state(token)
-            if state && (state['subprocess_status'] == 'crashed' ||
-                         state['subprocess_status'] == 'self_timed_out')
-              return {
-                status: :crashed,
-                reason: state['crash_reason'] || state['subprocess_status'],
-                pid: read_pid(token),
-                pgid: read_pgid_from_file(token),
-                log_tail: tail_log(token)
-              }
+            if state
+              status = state['subprocess_status']
+              if status == 'crashed' || status == 'self_timed_out'
+                return {
+                  status: :crashed,
+                  reason: state['crash_reason'] || status,
+                  pid: read_pid(token),
+                  pgid: read_pgid_from_file(token),
+                  log_tail: tail_log(token)
+                }
+              end
+              # Worker exited cleanly. subprocess_results.json should be (or
+              # imminently become) loadable via step 1 on a subsequent poll.
+              # The heartbeat thread is intentionally killed at worker exit
+              # (dispatch_worker.rb ensure block), so the heartbeat-stale
+              # check below would false-positive. Skip liveness checks while
+              # 'done', and rely on step 1 retry until results parse or the
+              # wall-clock budget exhausts.
+              if status == 'done'
+                if now_mono > deadline
+                  return {
+                    status: :crashed,
+                    reason: 'done_but_no_results',
+                    pid: read_pid(token),
+                    pgid: read_pgid_from_file(token),
+                    log_tail: tail_log(token)
+                  }
+                end
+                sleep poll_interval
+                next
+              end
             end
             # 3. Heartbeat-based liveness checks

data/templates/skillsets/multi_llm_review/test/test_main_state.rb ADDED Viewed

@@ -0,0 +1,152 @@
+# frozen_string_literal: true
+# v3.24.3: per-thread MainState concurrency tests. Covers the per-thread
+# Hash invariants that fix the v0.3.2 single-ts process-global race
+# (incident token 5b75ff8c-..., 2026-04-27).
+require 'minitest/autorun'
+require_relative '../lib/multi_llm_review/main_state'
+module KairosMcp
+  module SkillSets
+    module MultiLlmReview
+      class TestMainStateConcurrency < Minitest::Test
+        def setup
+          MainState.reset!
+        end
+        # T1 enter, T2 enter, T1 exit. Verify oldest_ts becomes T2's ts
+        # (not stuck at T1's). This is the exact scenario that v0.3.2 broke
+        # under: T1.exit cleared the single global ts while T2 was still
+        # in-call.
+        def test_oldest_ts_advances_when_first_enter_exits
+          enter_order = Queue.new
+          can_exit_t1 = Queue.new
+          can_exit_t2 = Queue.new
+          t1_ts = nil
+          t2_ts = nil
+          t1 = Thread.new do
+            MainState.with_call do
+              # capture our ts via snapshot
+              _, _, oldest_ts = MainState.snapshot
+              t1_ts = oldest_ts
+              enter_order << :t1
+              can_exit_t1.pop  # wait for main to release
+            end
+          end
+          # Wait for t1 to enter
+          assert_equal :t1, enter_order.pop
+          t2 = Thread.new do
+            MainState.with_call do
+              enter_order << :t2
+              can_exit_t2.pop
+            end
+          end
+          # Wait for t2 to enter
+          assert_equal :t2, enter_order.pop
+          # Both in flight. Capture snapshot.
+          _, in_flight, oldest_ts_both = MainState.snapshot
+          assert_equal 2, in_flight
+          assert_equal t1_ts, oldest_ts_both, 'oldest_ts is T1 (earliest enter)'
+          # Now grab T2's ts before T1 exits
+          # Since T2 entered after T1, T2's ts > T1's ts.
+          # After T1 exits, oldest_ts must become T2's ts.
+          can_exit_t1 << :go
+          t1.join
+          _, in_flight_after, oldest_ts_after = MainState.snapshot
+          assert_equal 1, in_flight_after, 'T2 still in-flight'
+          refute_nil oldest_ts_after
+          assert oldest_ts_after > t1_ts,
+            "oldest_ts must advance past T1's anchor after T1 exits " \
+            "(was #{t1_ts}, now #{oldest_ts_after})"
+          can_exit_t2 << :go
+          t2.join
+          # Both exited
+          counter, in_flight_final, oldest_ts_final = MainState.snapshot
+          assert_equal 2, counter
+          assert_equal 0, in_flight_final
+          assert_nil oldest_ts_final
+        end
+        # 4 threads cycling enter/exit 250 times each = 1000 total cycles.
+        # Verifies counter and ts_by_thread stay consistent under contention.
+        def test_concurrent_with_call_stress
+          srand(20260427)  # deterministic seed
+          n_threads = 4
+          cycles_per_thread = 250
+          start_at = Time.now
+          threads = n_threads.times.map do
+            Thread.new do
+              cycles_per_thread.times do
+                MainState.with_call { }
+              end
+            end
+          end
+          threads.each(&:join)
+          elapsed = Time.now - start_at
+          assert elapsed < 10, "stress test took #{elapsed.round(2)}s, budget 10s"
+          counter, in_flight, oldest_ts = MainState.snapshot
+          assert_equal n_threads * cycles_per_thread, counter
+          assert_equal 0, in_flight, 'ts_by_thread leaked entries'
+          assert_nil oldest_ts
+        end
+        # If with_call raises mid-block across many threads, ts_by_thread
+        # must still be cleaned for every thread.
+        def test_concurrent_with_call_exception_cleanup
+          n_threads = 4
+          threads = n_threads.times.map do |i|
+            Thread.new do
+              begin
+                MainState.with_call { raise "boom from thread #{i}" }
+              rescue StandardError
+                # expected
+              end
+            end
+          end
+          threads.each(&:join)
+          counter, in_flight, oldest_ts = MainState.snapshot
+          assert_equal n_threads, counter, 'counter bumps even on exception'
+          assert_equal 0, in_flight, 'ts_by_thread must be cleaned on exception'
+          assert_nil oldest_ts
+        end
+        # bump_counter! is racy with concurrent with_call but must not
+        # corrupt ts_by_thread or under-count counter.
+        def test_bump_counter_concurrent_with_with_call
+          n_threads = 4
+          n_bumps = 100
+          n_cycles = 100
+          bump_threads = n_threads.times.map do
+            Thread.new { n_bumps.times { MainState.bump_counter! } }
+          end
+          call_threads = n_threads.times.map do
+            Thread.new { n_cycles.times { MainState.with_call { } } }
+          end
+          (bump_threads + call_threads).each(&:join)
+          counter, in_flight, oldest_ts = MainState.snapshot
+          assert_equal n_threads * (n_bumps + n_cycles), counter
+          assert_equal 0, in_flight
+          assert_nil oldest_ts
+        end
+      end
+    end
+  end
+end