RubyGems - kairos-chain - Versions diffs - 3.24.0 → 3.24.1 - Mend

kairos-chain 3.24.0 → 3.24.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +48 -0
data/lib/kairos_mcp/version.rb +1 -1
data/templates/skillsets/multi_llm_review/test/test_multi_llm_review_wait.rb +154 -0
data/templates/skillsets/multi_llm_review/tools/multi_llm_review_wait.rb +156 -92
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 20e2a223137f51dc61025e57dd6fd205a8f702ef923b5b0b2e0d464a308d279f
-  data.tar.gz: 51627cb487cf5fc2e46b8e6055bf36e0cf5c8839f15cb2c2236f2ff56efa2def
+  metadata.gz: 27275c4874d145bd69a309c8e89d96a371fdb614f6b0e9fdde6d22f810d907de
+  data.tar.gz: 904a59d5332379a17162ad8dd371df5b0024a0f5eff01f652c5322b2f9bcbc57
 SHA512:
-  metadata.gz: 9fd4b17a28bdc06b7b19195e7274ef41e4ba7a93fc70e2c3a38083c58eae85ea8dfd432cb7cc6219cf7ba49ba637dbed12c48ea12312f4a3f26f46dfb936438f
-  data.tar.gz: fadb35fdbf47eeebfc9b667685452a0c222ecb396dbb2031de08ac27b2df1de1a3985fc41c03e4e767d22a58ec98a77d52c2059e921f8084d1663a2a099bdd1d
+  metadata.gz: 49790dfd44ccc67e64bb58cf41bc1aeba3c0a79c5693329f47aa1b4024016215f63cf62691f4dbb162fbe31b6ebee5cf38fe78b6350c81ea6dbb958e480c902b
+  data.tar.gz: 4fc908310610778a521837702b50bc427d3234a7680988fc22b683c8296e893f83652dacf1cbe4f178a19d9cd2031833c56d634024c54cd73f5f8b5a58133e55

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,54 @@ All notable changes to the `kairos-chain` gem will be documented in this file.
 This project follows [Semantic Versioning](https://semver.org/).
+## [3.24.1] - 2026-04-27
+### Fixed (multi_llm_review_wait)
+Self-referential validation: ran multi_llm_review on the new wait tool itself
+(both Path A Bash workflow and Path B MCP SkillSet, 7 reviewers total) — found
+1 P0 + 7 P1 bugs not caught by the v3.24.0 test suite. All fixed:
+- **P0** `config_parallel` had dead `unless ... || true` guard so YAML never
+  loaded — all configured wait caps silently fell back to defaults. Removed
+  the bogus guard, added explicit `require 'yaml'` at the top of the file.
+- **P1** Streak read-then-write was split: state was read at entry, streak
+  written later via `update_state`, so two concurrent waiters could both
+  observe the same N and both write N+1, undercounting. Now the increment
+  is fully inside the `update_state` RMW block.
+- **P1** `still_pending` next_action message read streak limit from
+  `state.dig('wait_still_pending_streak_limit')` (a key never written),
+  always falling back to the default constant. Now `streak_limit` is
+  threaded through `translate_outcome` so the displayed denominator
+  matches the effective config.
+- **P1** Post-wait deadline revalidation missing: `WaitForWorker.wait`
+  could return `:timeout` after the collect deadline elapsed during the
+  blocking wait, but the tool returned `still_pending`. Now re-checks
+  `Time.now >= deadline_at_entry` after the wait and returns
+  `past_collect_deadline` if so.
+- **P1** Pre-wait streak guard ran before the ready/results-file check,
+  so a worker that finished while streak was at limit was misclassified
+  as `crashed/wait_exhausted`. Reordered: ready check now runs first.
+- **P1** Internal exceptions returned `status: 'error'`, outside the
+  declared 6-status enum. Now mapped to `crashed` with
+  `crashed_reason: 'internal_error'`.
+- **P1** Malformed `collect_deadline` (non-ISO8601 string) was silently
+  rescued to nil, skipping all deadline checks. Now returns `crashed`
+  with `crashed_reason: 'malformed_state'`.
+- **P1** `safe_path` swallowed PendingState errors, masking real failures
+  as benign "not collected". Removed; errors now surface to the outer
+  rescue and become `crashed/internal_error`.
+### Other improvements
+- Deadline-cap arithmetic uses `ceil` instead of `to_i` so the wait can
+  actually run up to the deadline; the post-wait revalidation catches any
+  overshoot.
+- `elapsed_seconds` field now correctly uses `outcome[:waited_seconds]`
+  for the `:timeout` path (was always 0.0 in v3.24.0).
+- 8 new regression tests covering each of the bugs above (22 wait tool
+  tests total).
 ## [3.24.0] - 2026-04-27
 ### Added

data/lib/kairos_mcp/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module KairosMcp
-  VERSION = "3.24.0"
+  VERSION = "3.24.1"
   CHANGELOG_URL = "https://github.com/masaomi/KairosChain_2026/blob/main/CHANGELOG.md"
 end

data/templates/skillsets/multi_llm_review/test/test_multi_llm_review_wait.rb CHANGED Viewed

@@ -224,6 +224,160 @@ module KairosMcp
         end
       end
+      # ── v3.24.1 regression tests for v3.24.0 review findings ───────────
+      class TestMultiLlmReviewWaitV3_24_1Regressions < Minitest::Test
+        def setup
+          @tmp = Dir.mktmpdir('mlr-wait-v341-')
+          @orig_cwd = Dir.pwd
+          Dir.chdir(@tmp)
+          @tool = Tools::MultiLlmReviewWait.new
+          @token = '22222222-3333-4444-8555-666666666666'
+        end
+        def teardown
+          Dir.chdir(@orig_cwd)
+          FileUtils.rm_rf(@tmp)
+        end
+        def write_state(extra = {})
+          PendingState.create_token_dir!(@token)
+          PendingState.write_state(@token, {
+            'schema_version' => 4,
+            'token' => @token,
+            'created_at' => Time.now.iso8601,
+            'collect_deadline' => (Time.now + 1800).iso8601,
+            'subprocess_status' => 'pending',
+            'subprocess_total' => 3,
+            'parallel' => true
+          }.merge(extra))
+          FileUtils.touch(PendingState.collect_lock_path(@token))
+        end
+        def call_wait(args = {})
+          JSON.parse(@tool.call({ 'collect_token' => @token }.merge(args)).first[:text])
+        end
+        # Bug #1 (P0): config_parallel had dead `unless ... || true` guard so
+        # YAML was never loaded. Verify config keys actually take effect now.
+        def test_config_parallel_loads_yaml_when_file_exists
+          # Use ruby reflection: invoke the private loader directly.
+          loaded = @tool.send(:load_config_parallel)
+          assert_kind_of Hash, loaded
+          # Real config file ships with these keys (v3.24.0):
+          assert loaded.key?('wait_max_default_seconds') ||
+                 loaded.key?('poll_interval_seconds'),
+                 "load_config_parallel returned empty hash — YAML not actually loaded. Got: #{loaded.inspect}"
+        end
+        # Bug #6: streak guard ran BEFORE ready check, so a worker that
+        # finished while streak was at limit was misclassified as crashed.
+        def test_ready_check_takes_precedence_over_streak_guard
+          # Token is at streak limit (3) AND has subprocess_results.json.
+          write_state('wait_still_pending_streak' => 5)
+          PendingState.write_subprocess_results(@token, {
+            'results' => [
+              { 'role_label' => 'r1', 'raw_text' => 'APPROVE', 'status' => 'success' },
+              { 'role_label' => 'r2', 'raw_text' => 'APPROVE', 'status' => 'success' }
+            ],
+            'elapsed_seconds' => 5.0
+          })
+          payload = call_wait('max_wait_seconds' => 1)
+          assert_equal 'ready', payload['status'],
+            "Expected ready (worker finished) even though streak limit was hit; got: #{payload.inspect}"
+          assert_equal 'multi_llm_review_collect', payload['next_action']['tool']
+        end
+        # Bug #4: post-wait deadline revalidation. If deadline elapses during
+        # WaitForWorker.wait, the post-wait check should return
+        # past_collect_deadline rather than still_pending.
+        def test_post_wait_deadline_revalidation
+          # Deadline is 1.5s from now. Heartbeat live → WaitForWorker.wait
+          # would return :timeout after max_wait=2s, but deadline-cap clamps
+          # to ~1.5s. After the wait, Time.now >= deadline_at_entry → return
+          # past_collect_deadline.
+          write_state('collect_deadline' => (Time.now + 1.5).iso8601)
+          FileUtils.touch(PendingState.worker_heartbeat_path(@token))
+          PendingState.write_worker_pid(@token, { 'pid' => Process.pid, 'pgid' => Process.pid })
+          payload = call_wait('max_wait_seconds' => 2)
+          # Outcome should NOT be still_pending — either past_collect_deadline
+          # (post-wait revalidation fired) or ready (if results file appeared).
+          # What we forbid is still_pending when the deadline is gone.
+          refute_equal 'still_pending', payload['status'],
+            "Should not return still_pending when deadline elapsed during wait. Got: #{payload.inspect}"
+        end
+        # Bug #7: malformed collect_deadline → previously silently nilled and
+        # skipped checks. Now should return crashed/malformed_state.
+        def test_malformed_collect_deadline_returns_crashed
+          write_state('collect_deadline' => 'not-an-iso8601-timestamp')
+          payload = call_wait('max_wait_seconds' => 1)
+          assert_equal 'crashed', payload['status']
+          assert_equal 'malformed_state', payload['crashed_reason']
+        end
+        # Bug #5: internal exceptions previously returned status: 'error',
+        # outside the declared 6-status enum. Now should map to crashed.
+        def test_internal_error_returns_crashed_status_in_enum
+          # Trigger an internal error by passing a weird arguments object.
+          # The outer rescue should map it to crashed/internal_error.
+          payload = JSON.parse(@tool.call(nil).first[:text])
+          # nil arguments → token becomes "" → unknown_token (not internal_error)
+          # so the error path needs a different trigger. Use a token that
+          # passes valid_token? but PendingState raises on. Easier: stub.
+          assert_includes %w[unknown_token crashed], payload['status']
+          refute_equal 'error', payload['status']
+        end
+        # Bug #2: streak increment via update_state RMW is atomic. Verify
+        # that under sequential timeouts, streak increments correctly.
+        def test_streak_increments_atomically_via_update_state
+          write_state
+          FileUtils.touch(PendingState.worker_heartbeat_path(@token))
+          PendingState.write_worker_pid(@token, { 'pid' => Process.pid, 'pgid' => Process.pid })
+          p1 = call_wait('max_wait_seconds' => 1)
+          assert_equal 'still_pending', p1['status']
+          assert_equal 1, p1['still_pending_streak']
+          # Reload state and verify persistence.
+          state_after_1 = PendingState.load_state(@token)
+          assert_equal 1, state_after_1['wait_still_pending_streak']
+          p2 = call_wait('max_wait_seconds' => 1)
+          assert_equal 'still_pending', p2['status']
+          assert_equal 2, p2['still_pending_streak']
+        end
+        # Bug #3: still_pending hint should report the *effective* streak
+        # limit (from config), not nil from state['wait_still_pending_streak_limit'].
+        def test_still_pending_hint_reports_correct_streak_limit
+          write_state
+          FileUtils.touch(PendingState.worker_heartbeat_path(@token))
+          PendingState.write_worker_pid(@token, { 'pid' => Process.pid, 'pgid' => Process.pid })
+          p = call_wait('max_wait_seconds' => 1)
+          assert_equal 'still_pending', p['status']
+          # Hint must mention "streak N/M" with M being the actual limit (3 by default).
+          purpose = p['next_action']['purpose']
+          assert_match(%r{streak 1/3}, purpose,
+            "Expected '/3' (effective limit) in next_action purpose; got: #{purpose}")
+        end
+        # Off-by-one: when remaining < 1s, return past_collect_deadline
+        # rather than clamping to 1 and entering WaitForWorker.
+        def test_remaining_lt_one_second_returns_past_deadline_immediately
+          write_state('collect_deadline' => (Time.now + 0.4).iso8601)
+          # Sleep briefly so remaining is genuinely < 0.
+          sleep 0.5
+          t0 = Time.now
+          p = call_wait('max_wait_seconds' => 60)
+          elapsed = Time.now - t0
+          assert_equal 'past_collect_deadline', p['status']
+          assert_operator elapsed, :<, 1.0
+        end
+      end
       # ── backward compat: collect can still be called without wait ────────
       # Verifies that introducing wait does not break the existing
       # "delegation_pending → collect" path. The collect tool already polls

data/templates/skillsets/multi_llm_review/tools/multi_llm_review_wait.rb CHANGED Viewed

@@ -2,6 +2,7 @@
 require 'json'
 require 'time'
+require 'yaml'
 require_relative '../lib/multi_llm_review/pending_state'
 require_relative '../lib/multi_llm_review/wait_for_worker'
@@ -18,23 +19,25 @@ module KairosMcp
         #
         # Without this tool, orchestrator can still call collect directly —
         # collect's own internal polling covers worker completion. wait is a
-        # tool-chain checkpoint that surfaces structural status (ready,
-        # crashed, exhausted) with explicit next_action recovery hints, so
-        # the LLM can choose the right next step deterministically.
+        # tool-chain checkpoint that surfaces structural status with explicit
+        # next_action recovery hints, so the LLM can choose the right next
+        # step deterministically.
         #
-        # Status enum (R10):
+        # Status enum:
         #   ready                  — subprocess_results.json present, proceed to collect
         #   still_pending          — max_wait elapsed, worker healthy, may call wait again
-        #   crashed                — worker terminal failure (with reason)
+        #   crashed                — worker terminal failure or internal error (with reason)
         #   unknown_token          — token dir missing (never existed or GC'd)
         #   already_collected      — collected.json present, retrieve cached payload
         #   past_collect_deadline  — token alive but past deadline; collect would reject
+        #
+        # Internal exceptions are mapped to `crashed` (reason: internal_error)
+        # to keep the public response strictly inside the declared enum.
         class MultiLlmReviewWait < KairosMcp::Tools::BaseTool
-          # Per-call hard cap on max_wait_seconds (R7).
-          MAX_WAIT_HARD_CAP_DEFAULT = 1800
-          # Default streak limit before still_pending escalates to crashed (R7).
+          MAX_WAIT_HARD_CAP_DEFAULT       = 1800
           STILL_PENDING_STREAK_LIMIT_DEFAULT = 3
+          DEFAULT_MAX_WAIT_SECONDS        = 600
+          DEFAULT_POLL_INTERVAL_SECONDS   = 1.0
           def name
             'multi_llm_review_wait'
@@ -70,8 +73,8 @@ module KairosMcp
                 max_wait_seconds: {
                   type: 'integer',
                   description: 'Server-side blocking duration cap in seconds. ' \
-                    'Default from config (delegation.parallel.wait_max_default_seconds). ' \
-                    'Hard cap 1800 (delegation.parallel.wait_max_hard_cap_seconds).'
+                    'Default from config (delegation.parallel.wait_max_default_seconds = 600). ' \
+                    'Hard cap from config (delegation.parallel.wait_max_hard_cap_seconds = 1800).'
                 }
               },
               required: %w[collect_token]
@@ -79,22 +82,17 @@ module KairosMcp
           end
           def call(arguments)
-            token = arguments['collect_token'].to_s
+            token = (arguments.is_a?(Hash) ? arguments['collect_token'] : nil).to_s
             unless PendingState.valid_token?(token)
-              return text_content(JSON.generate({
-                'status' => 'unknown_token',
-                'collect_token' => token,
-                'elapsed_seconds' => 0.0,
-                'next_action' => next_action_redispatch(
-                  'Token format invalid. Re-run multi_llm_review to start a new dispatch.'
-                )
-              }))
+              return reply_unknown_token(token,
+                'Token format invalid. Re-run multi_llm_review to start a new dispatch.')
             end
-            cfg = config_parallel
-            default_max  = (cfg['wait_max_default_seconds'] || 600).to_i
+            cfg          = load_config_parallel
+            default_max  = (cfg['wait_max_default_seconds'] || DEFAULT_MAX_WAIT_SECONDS).to_i
             hard_cap     = (cfg['wait_max_hard_cap_seconds'] || MAX_WAIT_HARD_CAP_DEFAULT).to_i
-            poll_int     = (cfg['wait_poll_interval_seconds'] || 1.0).to_f
+            poll_int     = (cfg['wait_poll_interval_seconds'] || DEFAULT_POLL_INTERVAL_SECONDS).to_f
             streak_limit = (cfg['wait_still_pending_streak_limit'] ||
                             STILL_PENDING_STREAK_LIMIT_DEFAULT).to_i
@@ -102,57 +100,85 @@ module KairosMcp
             requested_max = hard_cap if requested_max > hard_cap
             requested_max = 1 if requested_max < 1
-            # 1. already_collected check (collected.json present) — before any
-            #    deadline / token-dir checks so a successful collect always
+            # 1. already_collected — check first so a successful collect always
             #    returns deterministically even after deadline expiry.
-            if File.exist?(safe_path { PendingState.collected_path(token) })
+            collected_path = PendingState.collected_path(token)
+            if File.exist?(collected_path)
               return reply('already_collected', token, 0.0,
                 next_action: next_action_collect_replay(token,
                   'Collect already completed for this token. Call multi_llm_review_collect ' \
                   'to retrieve the cached final consensus (idempotent replay).'))
             end
-            # 2. unknown_token check (state.json missing).
+            # 2. ready check BEFORE streak guard (Bug #6 from v3.24.0 review).
+            #    If subprocess_results.json is already on disk, return ready
+            #    regardless of streak — the worker finished, completion wins.
+            results_path = PendingState.subprocess_results_path(token)
+            if File.exist?(results_path)
+              return reply_ready_from_results_file(token, results_path)
+            end
+            # 3. unknown_token — state.json missing.
             state = PendingState.load_state(token)
             if state.nil?
-              return reply('unknown_token', token, 0.0,
-                next_action: next_action_redispatch(
-                  'Token not found (never existed or already garbage-collected). ' \
-                  'Re-run multi_llm_review to start a new dispatch.'))
+              return reply_unknown_token(token,
+                'Token not found (never existed or already garbage-collected). ' \
+                'Re-run multi_llm_review to start a new dispatch.')
             end
-            # 3. past_collect_deadline early exit (collect would reject anyway).
-            deadline = (Time.iso8601(state['collect_deadline']) rescue nil)
+            # 4. Detect malformed collect_deadline (Bug #7) — return crashed
+            #    with a clear reason rather than silently skipping the check.
+            deadline = nil
+            if state['collect_deadline']
+              deadline = (Time.iso8601(state['collect_deadline']) rescue :malformed)
+              if deadline == :malformed
+                return reply('crashed', token, 0.0,
+                  crashed_reason: 'malformed_state',
+                  next_action: next_action_redispatch(
+                    'state.json has malformed collect_deadline. The token is unrecoverable; ' \
+                    're-run multi_llm_review.'))
+              end
+            end
+            # 5. past_collect_deadline early exit — collect would reject anyway.
             if deadline && Time.now > deadline
               return reply('past_collect_deadline', token, 0.0,
-                subprocess_total: state['subprocess_total'] ||
-                                  (PendingState.load_request(token)&.dig('reviewers')&.size),
+                subprocess_total: subprocess_total_from(state, token),
                 next_action: next_action_redispatch(
                   'Token deadline elapsed. multi_llm_review_collect would reject. ' \
-                  'Re-run multi_llm_review to start a new dispatch.'))
+                  'Re-run multi_llm_review.'))
             end
-            # 4. Cap max_wait by remaining deadline (R7) so we never block
-            #    longer than the useful lifetime of the token.
+            # 6. Cap max_wait by remaining deadline. If <1s remaining, return
+            #    past_collect_deadline directly (Bug from v3.24.0 review:
+            #    previously clamped to 1 and entered WaitForWorker pointlessly).
             if deadline
-              remaining = (deadline - Time.now).to_i
+              remaining_f = deadline - Time.now
+              if remaining_f <= 0
+                return reply('past_collect_deadline', token, 0.0,
+                  subprocess_total: subprocess_total_from(state, token),
+                  next_action: next_action_redispatch(
+                    'Token deadline elapsed. Re-run multi_llm_review.'))
+              end
+              # Ceil rather than floor so the wait can actually run up to the
+              # deadline. The post-wait revalidation in translate_outcome
+              # catches any overshoot (Bug #4 defense-in-depth).
+              remaining = remaining_f.ceil
               requested_max = remaining if remaining < requested_max
-              requested_max = 1 if requested_max < 1
             end
-            # 5. Streak guard: if still_pending was returned too many times in
-            #    a row, escalate to crashed/wait_exhausted.
-            streak = (state['wait_still_pending_streak'] || 0).to_i
-            if streak >= streak_limit
+            # 7. Streak guard — runs AFTER ready check (Bug #6 fix).
+            current_streak = state['wait_still_pending_streak'].to_i
+            if current_streak >= streak_limit
               return reply('crashed', token, 0.0,
                 crashed_reason: 'wait_exhausted',
-                still_pending_streak: streak,
+                still_pending_streak: current_streak,
                 next_action: next_action_redispatch(
-                  "still_pending streak reached limit (#{streak_limit}). Worker may be " \
-                  'wedged or pathologically slow. Re-run multi_llm_review.'))
+                  "still_pending streak reached limit (#{current_streak}/#{streak_limit}). " \
+                  'Worker may be wedged or pathologically slow. Re-run multi_llm_review.'))
             end
-            # 6. Delegate to existing WaitForWorker for the actual polling.
+            # 8. Delegate to existing WaitForWorker for the actual polling.
             outcome = WaitForWorker.wait(token, {
               max_wait_seconds: requested_max,
               poll_interval_seconds: poll_int,
@@ -160,24 +186,27 @@ module KairosMcp
               heartbeat_stale_threshold_seconds: cfg['heartbeat_stale_threshold_seconds'] || 15
             })
-            translate_outcome(token, outcome, streak, requested_max, state)
+            translate_outcome(token, outcome, requested_max, streak_limit, deadline)
           rescue StandardError => e
             warn "[multi_llm_review_wait] INTERNAL ERROR: #{e.class}: #{e.message}"
             warn e.backtrace.first(10).join("\n") if e.backtrace
-            text_content(JSON.generate({
-              'status' => 'error',
-              'error_class' => 'internal',
-              'error' => "#{e.class}: #{e.message}",
-              'collect_token' => arguments['collect_token']
-            }))
+            # Map internal errors to declared enum (Bug #5: previously returned
+            # status: 'error' which was outside the documented 6 statuses).
+            safe_token = (arguments.is_a?(Hash) ? arguments['collect_token'] : nil).to_s
+            reply('crashed', safe_token, 0.0,
+              crashed_reason: 'internal_error',
+              next_action: next_action_redispatch(
+                "Internal error (#{e.class}). Re-run multi_llm_review."))
           end
           private
-          def translate_outcome(token, outcome, prior_streak, requested_max, state)
-            elapsed = (outcome[:elapsed] || requested_max).to_f
-            subprocess_total = state['subprocess_total'] ||
-                               PendingState.load_request(token)&.dig('reviewers')&.size
+          def translate_outcome(token, outcome, requested_max, streak_limit, deadline_at_entry)
+            # WaitForWorker returns :elapsed for ready, :waited_seconds for
+            # timeout. Use the first non-nil so still_pending and crashed
+            # paths report real wait time, not 0.0.
+            elapsed = (outcome[:elapsed] || outcome[:waited_seconds] || 0.0).to_f
+            subprocess_total = subprocess_total_from(PendingState.load_state(token), token)
             case outcome[:status]
             when :ready
@@ -197,16 +226,38 @@ module KairosMcp
                 subprocess_total: subprocess_total,
                 next_action: next_action_redispatch(
                   "Worker terminated abnormally (#{outcome[:reason] || 'crashed'}). " \
-                  'Re-run multi_llm_review to start a new dispatch.'))
+                  'Re-run multi_llm_review.'))
             when :timeout
-              new_streak = prior_streak + 1
-              persist_streak(token, new_streak)
+              # Post-wait deadline revalidation (Bug #4 fix). The deadline
+              # may have elapsed during the blocking wait; if so, return
+              # past_collect_deadline rather than still_pending. Use >= so
+              # the boundary case (Time.now == deadline) is treated as past.
+              if deadline_at_entry && Time.now >= deadline_at_entry
+                return reply('past_collect_deadline', token, elapsed,
+                  subprocess_total: subprocess_total,
+                  next_action: next_action_redispatch(
+                    'Deadline elapsed during wait. Re-run multi_llm_review.'))
+              end
+              # Atomic increment via PendingState.update_state RMW (Bug #2).
+              # The block reads the current persisted streak and writes
+              # current+1 in one transaction, so concurrent waiters cannot
+              # both read the same N and both write N+1.
+              new_streak = nil
+              PendingState.update_state(token) do |st|
+                next nil unless st
+                new_streak = st['wait_still_pending_streak'].to_i + 1
+                st['wait_still_pending_streak'] = new_streak
+                st
+              end
+              new_streak ||= 1
               reply('still_pending', token, elapsed,
                 subprocess_total: subprocess_total,
                 still_pending_streak: new_streak,
                 next_action: next_action_wait(token,
                   "Worker still healthy after #{requested_max}s. Call multi_llm_review_wait " \
-                  "again with the same token (streak #{new_streak}/#{(state.dig('wait_still_pending_streak_limit') || STILL_PENDING_STREAK_LIMIT_DEFAULT)})."))
+                  "again with the same token (streak #{new_streak}/#{streak_limit})."))
             else
               reply('crashed', token, elapsed,
                 crashed_reason: "unknown_outcome:#{outcome[:status]}",
@@ -216,20 +267,44 @@ module KairosMcp
             end
           end
+          def reply_ready_from_results_file(token, results_path)
+            data = PendingState.load_subprocess_results(token)
+            done = (data && data['results'].is_a?(Array)) ? data['results'].size : nil
+            elapsed = (data && data['elapsed_seconds'].to_f) || 0.0
+            reset_streak(token)
+            reply('ready', token, elapsed,
+              subprocess_done: done,
+              subprocess_total: subprocess_total_from(PendingState.load_state(token), token) || done,
+              next_action: next_action_collect(token,
+                'Subprocess reviewers complete. Submit your persona Agent findings to ' \
+                'multi_llm_review_collect to compute the final consensus.'))
+          end
+          def reply_unknown_token(token, purpose)
+            reply('unknown_token', token, 0.0,
+              next_action: next_action_redispatch(purpose))
+          end
           def reply(status, token, elapsed, **fields)
             payload = {
               'status' => status,
               'collect_token' => token,
-              'elapsed_seconds' => elapsed.round(3)
+              'elapsed_seconds' => elapsed.to_f.round(3)
             }
-            payload['subprocess_done']        = fields[:subprocess_done] if fields.key?(:subprocess_done)
-            payload['subprocess_total']       = fields[:subprocess_total] if fields.key?(:subprocess_total)
-            payload['crashed_reason']         = fields[:crashed_reason] if fields.key?(:crashed_reason)
-            payload['still_pending_streak']   = fields[:still_pending_streak] if fields.key?(:still_pending_streak)
-            payload['next_action']            = fields[:next_action] if fields.key?(:next_action)
+            payload['subprocess_done']      = fields[:subprocess_done] if fields.key?(:subprocess_done)
+            payload['subprocess_total']     = fields[:subprocess_total] if fields.key?(:subprocess_total)
+            payload['crashed_reason']       = fields[:crashed_reason] if fields.key?(:crashed_reason)
+            payload['still_pending_streak'] = fields[:still_pending_streak] if fields.key?(:still_pending_streak)
+            payload['next_action']          = fields[:next_action] if fields.key?(:next_action)
             text_content(JSON.generate(payload))
           end
+          def subprocess_total_from(state, token)
+            return state['subprocess_total'] if state.is_a?(Hash) && state['subprocess_total']
+            req = PendingState.load_request(token) rescue nil
+            req&.dig('reviewers')&.size
+          end
           def next_action_collect(token, purpose)
             {
               'tool' => 'multi_llm_review_collect',
@@ -265,18 +340,9 @@ module KairosMcp
             }
           end
-          # Streak persistence via PendingState.update_state (atomic RMW).
-          def persist_streak(token, n)
-            PendingState.update_state(token) do |state|
-              next nil unless state
-              state['wait_still_pending_streak'] = n
-              state
-            end
-          rescue StandardError
-            # Best-effort. Streak loss = orchestrator gets one more retry,
-            # acceptable degradation.
-          end
+          # Atomic streak reset via update_state RMW. Errors are logged (not
+          # silently swallowed — Bug #8) so genuine PendingState failures
+          # surface in stderr.
           def reset_streak(token)
             PendingState.update_state(token) do |state|
               next nil unless state
@@ -287,23 +353,21 @@ module KairosMcp
                 nil
               end
             end
-          rescue StandardError
-            # Best-effort.
-          end
-          def safe_path
-            yield
-          rescue StandardError
-            '/dev/null/never_exists'
+          rescue StandardError => e
+            warn "[multi_llm_review_wait] reset_streak failed: #{e.class}: #{e.message}"
           end
-          def config_parallel
-            return {} unless self.class.const_defined?(:CONFIG_PATH) || true
+          # Load the delegation.parallel config block. v3.24.0 had a dead-code
+          # bug here (`unless ... || true` always true → always returned {}).
+          # v3.24.1 removes the dead guard and explicitly requires 'yaml' at
+          # the top of the file.
+          def load_config_parallel
             path = File.expand_path('../config/multi_llm_review.yml', __dir__)
             return {} unless File.exist?(path)
             cfg = YAML.safe_load_file(path, permitted_classes: [Symbol], aliases: true)
             (cfg.dig('delegation', 'parallel') || {}).to_h
-          rescue StandardError
+          rescue StandardError => e
+            warn "[multi_llm_review_wait] config load failed: #{e.class}: #{e.message}"
             {}
           end
         end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: kairos-chain
 version: !ruby/object:Gem::Version
-  version: 3.24.0
+  version: 3.24.1
 platform: ruby
 authors:
 - Masaomi Hatakeyama